1 Introduction

In many longitudinal studies, e.g. in medical research, subjects are followed over time for the occurrence of certain events. One such example is the ‘PROVA’ trial (PROVA study group 1991) that included 286 patients in whom liver cirrhosis was histologically verified. In eligible patients, endoscopy had shown oesophageal varices, but a transfusion-requiring bleeding had not yet been observed. Patients were randomized in a four-arm design to treatment with propranolol or not combined with sclerotherapy or not. The purpose of the trial was to study to what extent these treatments were PROphylactic against transfusion-requiring bleeding from the VArices and against death (without bleeding). In another example (Andersen and Pohar Perme 2008), 2009 patients with acute leukemia were followed after bone marrow transplantation (BMT) in an observational study with the purpose of evaluating prognostic factors for the events relapse of disease and death, and with special emphasis on how the intermediate event graft-versus-host disease (GvHD) is associated with these outcomes.

A suitable mathematical framework in which to study such phenomena is that of multi-state models where events of interest are considered as transitions between a (typically small) number of states, see Figs. 1 and 2 for examples of ‘box-and-arrow diagrams’ depicting possible multi-state models with the transitions of interest for the PROVA and BMT studies. Reviews on multi-state models have been presented, e.g. by Andersen and Keiding (2002) and by Meira-Machado and Sestelo (2017), the latter focussing on the model in Fig. 1, known as the (irreversible) illness–death model. The book by Cook and Lawless (2018) gives a comprehensive review of the field.

Fig. 1
figure 1

An irreversible illness–death model for the PROVA study

Fig. 2
figure 2

States and transitions in the bone marrow transplantation (BMT) study

In a multi-state model, time is measured relative to some time origin (time of randomization in the PROVA study and time of bone marrow transplantation in the BMT study) and if X(t) denotes the state occupied at time t then the transition probabilities are \(P_{hj}(s,t)=P(X(t)=j\mid X(s)=h)\) where hj are in the finite state space \(\mathcal{S}\). Some times, the transition probabilities are further conditioned on the past of the multi-state process up till just before time s: \(\mathcal{F}_{s-}=\sigma \{X(u), u<s, {\varvec{Z}}\}\), possibly including time-fixed covariates \({\varvec{Z}}\) observed at study entry. If, for all \(j\ne h\) and all \(t\ge s\), \(P_{hj}(s,t)=0\) then state h is absorbing, otherwise h is transient.

The transition intensities are

$$\begin{aligned} \alpha _{hj}(t)=\lim _{\Delta t\rightarrow 0}\frac{P_{hj}(t,t+\Delta t)}{\Delta t}, \quad h,j\in \mathcal{S}, h\ne j, \end{aligned}$$

and if \(\alpha _{hj}(t)\) only depends on the past \(\mathcal{F}_{t-}\) via the state h occupied at t (and possibly via time-fixed covariates) then the multi-state process is Markovian. For an irreversible illness–death model, the process is Markovian if the \(1\rightarrow 2\) intensity does not depend on the time, say \(T_1\), of entry into state 1. Similarly, if in Fig. 2 all transition intensities out of the non-initial transient states (1 and 2) are independent of times and types of previous transitions then the process is Markovian.

Transition intensities can be considered as the basic ‘building blocks’ of a multi-state process in the sense that if all transition intensities are known then, via Jacod’s formula (e.g., Andersen et al. 1993), a likelihood is available that only depends on the transition intensities. From this likelihood, inference on the intensities may be performed based on observation of, possibly right-censored realizations of the process for independent subjects. Thus, regression models for intensities may be based on versions of the Cox (1972) proportional hazards model (or on other hazard regression models known from survival analysis). Nevertheless, transition probabilities (and state occupation probabilities, i.e. \(Q_h(t)=P(X(t)=h), h\in \mathcal{S}\)) have a more simple interpretation, especially for scientists with limited mathematical background and it is, therefore, of considerable interest to be able to estimate such quantities. For Markov models, the problem of obtaining transition probabilities from intensities was completely solved by Aalen and Johansen (1978) who showed that the product-integral maps transition intensities onto transition probabilities. We will briefly review this in Sect. 2. For general multi-state processes, Datta and Satten (2001) showed that state occupation probabilities may also be estimated using the product-integral. However, estimation of transition probabilities in non-Markov models is more involved and several approaches to this have been put forward. We will review some of these methods in Sect. 3. This includes plug-in methods where, for non-reversible processes (such as those exemplified in Figs. 1 and 2 where a state cannot be reached once it has been left), transition probabilities are explicit functionals of the intensities. It also includes more general approaches based on land-marking where estimation of \(P_{hj}(s,t)\) is based solely on subjects in state h at time s, such as those discussed by Titman (2015) and Putter and Spitoni (2018).

A general technique to obtain estimates of transition probabilities for given intensities is micro-simulation where paths of the multi-state process are generated and, from these, transition probabilities may be estimated as simple averages over repeated simulated paths (e.g., Mitton et al. 2000). However, in a regression situation where intensities are given as a function of time-fixed covariates, neither this technique, nor plug-in will provide parameters describing directly the association between covariates and transition probabilities. We will discuss this problem in Sect. 4 where we will show that pseudo-observations (e.g., Andersen et al. 2003; Andersen and Pohar Perme 2010) may be applicable for this purpose. Results will be supported by simulations (Sect. 5), and we will study how some of the methods work in the PROVA and BMT examples (Sect. 6). The article is concluded by a brief discussion of findings and lines of further research (Sect. 7).

2 Markov models

Let X(t) be a multi-state process with state space \(\mathcal{S}=\{0,1,\dots ,k\}\) and assume that X(t) is Markovian, i.e.

$$\begin{aligned} \alpha _{hj}(t)dt\approx P(X(t+dt)=j\mid X(t)=h, \mathcal{F}_{t-})=P(X(t+dt)=j\mid X(t)=h, \varvec{Z}) \end{aligned}$$

for all states \(h,j\in \mathcal{S}, j\ne h\). We define the cumulative intensities \(A_{hj}(t)=\int _0^t\alpha _{hj}(u)du\) and let \(A_{hh}(t)=-\sum _{j\in \mathcal{S}, j\ne h}A_{hj}(t).\) We can then collect all \(A_{hj}(t), h,j\in \mathcal{S}\) in a \((k+1)\times (k+1)\)-matrix \(\mathbf{A}(t)\) and the product-integral of \(\mathbf{A}(\cdot )\) over the interval (st] is defined as the \((k+1)\times (k+1)\)-matrix

$$\begin{aligned} \mathbf{P}(s,t)= & {} \mathop {{\varvec{\varPi }}}\limits _{(s,t]}\bigl (\mathbf{I}+d\mathbf{A}(u)\bigr )\nonumber \\= & {} \lim _{\max |u_i-u_{i-1}|\rightarrow 0}\prod \bigl (\mathbf{I}+\mathbf{A}(u_i)-\mathbf{A}(u_{i-1})\bigr ) \end{aligned}$$
(1)

for any partition \(s=u_0<u_1<\dots <u_N=t\) of (st] (Gill and Johansen 1990). Here, \(\mathbf{I}\) is the \((k+1)\times (k+1)\) identity matrix. Note that equation (1) is also well-defined if the \(A_{hj}\) have jumps, and in the case where the \(A_{hj}\) correspond to purely discrete measures, the product-integral is just a finite matrix product over the jump times in (st]. Gill and Johansen (1990) showed that \(\mathbf{P}(s,t)\) is the transition probability matrix for the Markov process \(X(\cdot )\).

This immediately suggests plug-in estimators for \(\mathbf{P}(s,t)\) based on models fitted for the intensities, as follows. We assume that independent, possibly right-censored, realizations of \(X(\cdot )\) are observed. These data can be represented as counting processes \(N_{hji}(t), t\le \tau _i\) where, for \(h, j\in \mathcal{S}, h\ne j\), the process \(N_{hji}(t)\) counts the observed number of direct \(h\rightarrow j\) transitions in [0, t] for subject \(i=1,\dots ,n\). The time point \(\tau _i\) is either the time at which \(X_i(\cdot )\) reaches an absorbing state or a previous time of right-censoring. A non-parametric estimator for \(\mathbf{P}(s,t)\) for an assumed homogeneous group is obtained by plugging-in the Nelson-Aalen estimator \(\widehat{\mathbf{A}}\), where

$$\begin{aligned} \widehat{A}_{hj}(t)=\int _0^t\frac{\sum _i dN_{hji}(u)}{\sum _i Y_{hi}(u)} \end{aligned}$$

and \(Y_{hi}(u)=I(X_i(u)=h)\) is the state h indicator for subject i. The resulting estimator

$$\begin{aligned} \widehat{\mathbf{P}}(s,t)=\mathop {{\varvec{\varPi }}}\limits _{(s,t]}\bigl (\mathbf{I}+d\widehat{\mathbf{A}}(u)\bigr ) \end{aligned}$$
(2)

is the Aalen–Johansen estimator (Aalen and Johansen 1978). The expression (2) also applies if the model for the intensities is a hazard regression model with time-fixed covariates, e.g. a Cox model.

The state occupation probabilities are

$$\begin{aligned} Q_h(t)=P(X(t)=h)=\sum _j Q_j(0)P_{jh}(0,t), \quad h\in \mathcal{S}. \end{aligned}$$

In the situation where all subjects are in the same state (0) at time 0, i.e. \(Q_0(0)=1\), these are \(Q_h(t)=P_{0h}(0,t)\) and the Aalen–Johansen estimator may be used for this parameter. Since \(P_{hj}(s,t)\) is a differentiable functional of the intensities, large-sample properties of the resulting plug-in estimator may be derived from those of the intensity estimators using (functional) delta-methods (Andersen et al. 1993).

3 Non-Markov models

3.1 General non-Markov models

As mentioned above, estimation of state occupation probabilities is possible using the Aalen–Johansen estimator for a general multi-state model (Datta and Satten 2001). This feature was used by Putter and Spitoni (2018) to estimate transition probabilities in any multi-state model using land-marking (or sub-setting). To estimate \(P_{hj}(s,t)=P(X(t)=j\mid X(s)=h)\) for a fixed value of s and a fixed state \(h\in \mathcal{S}\), attention was restricted to those processes \(X_i(\cdot )\) observed to be in state h at time s, i.e processes for which \(Y_{hi}(s)=1\) and counting process increments and at-risk processes were studied for this subset:

$$\begin{aligned} dN_{j\ell }^{LM}(t)=\sum _i dN_{j\ell i}(t)Y_{hi}(s),\quad Y_{j}^{LM}(t)=\sum _iY_{ji}(t)Y_{hi}(s), \quad t\ge s. \end{aligned}$$

The Nelson-Aalen estimators \(\widehat{\mathbf{A}}^{LM}(t)\) based on these sub-sets are then plugged-in to the product-integral to yield the land-mark Aalen–Johansen estimator:

$$\begin{aligned} \widehat{\mathbf{P}}^{LM}(s,t)=\mathbf{Q}^{LM}(s)\mathop {{\varvec{\varPi }}}\limits _{(s,t]}\bigl (\mathbf{I}+d\widehat{\mathbf{A}}^{LM}(u)\bigr ), \end{aligned}$$
(3)

where \(\mathbf{Q}^{LM}(s)\) is the \((k+1)\) row vector with element h equal to 1 and other elements equal to 0. Malzahn et al. (2021) extended this technique to ‘hybrid’ situations where, only for some transitions, the Markov property fails whereas, for others, the Markov assumption is compatible with the data.

Titman (2015) also used sub-setting to obtain estimators for transition probabilities in non-Markov models, as follows. Define \(\mathcal{R}_{hj}\) to be the set of states reachable from h but from which j cannot be reached. For the considered subset of processes one then defines the following competing risks process for \(u\ge s\) when j is an absorbing state:

$$\begin{aligned} X^*_s(u)=\left\{ \begin{array}{lll} 0&{}\text{ if }&{}X(u)\notin \mathcal{R}_{hj}\cup \{j\}\\ 1&{}\text{ if }&{}X(u)\in \mathcal{R}_{hj}\\ 2&{}\text{ if }&{}X(u)=j \end{array} \right. \end{aligned}$$

For the considered subset, this process is linked to X(t) by the relation \(P_{hj}(s,t)=P(X^*_s(t)=2)\) and, therefore, the desired transition probability can be estimated using the Aalen–Johansen estimator for the cause 2 cumulative incidence for \(X^*_s(t)\). More specifically, if \(N_{s\ell }(u)\) counts cause \(\ell =1,2\) events for \(X^*(\cdot )\) and \(Y_s(u)\) is the number still at risk for cause 1 or 2 events at time \(u-\) then the estimator is

$$\begin{aligned} \widehat{P}_{hj}^T(s,t)=\int _s^t\widehat{P}(X^*_s(u)=0\mid X^*_s(s)=0)\frac{dN_{s2}(u)}{Y_s(u)}, \end{aligned}$$

where \(\widehat{P}(X^*_s(u)=0\mid X^*_s(s)=0)\) is estimated using the Kaplan–Meier estimator

$$\begin{aligned} \mathop {{\varvec{\varPi }}}\limits _{(s,u]}\bigl (1-\frac{dN_{s1}(v)+dN_{s2}(v)}{Y_s(v)}\bigr ) \end{aligned}$$

based on events of both types. If j is a transient state then one defines the following survival process for \(u\ge s\) for the considered subset of processes:

$$\begin{aligned} X^*_s(u)=\left\{ \begin{array}{lll} 0&{}\text{ if }&{}X(u)\notin \mathcal{R}_{hj}\\ 1&{}\text{ if }&{}X(u)\in \mathcal{R}_{hj} \end{array}\right. \end{aligned}$$

For this subset, the process \(X^*_s(t)\) is related to X(t) via \(P_{hj}(s,t)=P(X^*_s(t)=0)P(X(t)=j\mid X^*_s(t)=0)\), where the first factor can, once more, be estimated by the Kaplan–Meier estimator for \(X^*_s(t)\). Titman (2015) proposed to use the relative frequency of processes in state j at time t among those for which \(X^*_s(t)=0\), i.e.

$$\begin{aligned} \frac{\sum _i I(X_i(t)=j, X_i(s)=h, X_i(t)\notin \mathcal{R}_{hj})}{\sum _i I(X_i(s)=h, X_i(t)\notin \mathcal{R}_{hj})} \end{aligned}$$

as an (ad hoc) estimator of the second factor.

We will not study Titman’s estimators any further but rather focus on non-reversible models for which some alternative estimators are possible.

3.2 Non-reversible models

If the multi-state process is non-reversible then, as discussed by Titman (2015), a simple estimator—building on Pepe (1991) and on sub-setting—is available for \(P_{hj}(s,t)\) when state j is transient. Thus, to estimate \(P_{hj}(s,t)\), one again looks at the subset of processes \(X_i(\cdot )\) observed to be in state h at time s and, for fixed s, this transition probability is estimated as the difference between Kaplan–Meier estimators of staying in sets of states \(\mathcal{S}_{hj}\) and \(\mathcal{S}_{hj}\cup \{j\}\), respectively, at time t where \(\mathcal{S}_{hj}\) is the set of states reachable from h and from which j can be reached. A variance estimator was also presented.

Violation of the Markov property is typically detected by fitting models for the intensities allowing past events (i.e., in \(\mathcal{F}_{t-}\)) to affect \(\alpha _{hj}(t)\). When this dependence is explicitly modelled and if an expression is available for the way in which transition probabilities depend on the intensities then \(P_{hj}(s,t)\) may be estimated by plug-in. This is the case for non-reversible models such as those depicted in Figs. 1 and 2. Thus, for the illness–death model, we have \(P_{00}(s,t)=\exp (-\int _s^t(\alpha _{01}(u)+\alpha _{02}(u))du)\), \(P_{11}(s,t\mid T_1)=\exp (-\int _s^t \alpha _{12}(u\mid T_1)du)\), where \(T_1<s\) is the time of \(0\rightarrow 1\) transition, and

$$\begin{aligned} P_{01}(s,t)=\int _s^tP_{00}(s,u)\alpha _{01}(u)P_{11}(u,t\mid u)du. \end{aligned}$$

Similar expressions hold for the model in Fig. 2. Thus, \(P_{02}(s,t)\) 1s the sum of two terms corresponding to a direct \(0\rightarrow 2\) transition and to transition via state 1, say

$$\begin{aligned} P_{02}^{(a)}(s,t)=\int _s^t P_{00}(s,u)\alpha _{02}(u)P_{22}(u,t\mid \infty , u)du \end{aligned}$$

and

$$\begin{aligned} P_{02}^{(b)}(s,t)=\int _s^tP_{00}(s,u)\alpha _{01}(u)\int _u^tP_{11}(u,x\mid u)\alpha _{12}(x\mid u)P_{22}(x,t\mid u,x)dxdu. \end{aligned}$$

Here,

$$\begin{aligned} P_{11}(s,t\mid T_1)=\exp (-\int _s^t(\alpha _{12}(u\mid T_1)+\alpha _{13}(u\mid T_1))du) \end{aligned}$$

is the probability of staying in state 1 from time s to time t given entry into state 1 at \(T_1<s\) and

$$\begin{aligned} P_{22}(s,t\mid T_1,T_2)=\exp (-\int _s^t\alpha _{23}(u\mid T_1,T_2)du) \end{aligned}$$

is similarly the probability of staying in state 2 from time s to time t given entry into state 2 at \(T_2<s\) and, possibly, entry into state 1 at \(T_1<T_2\), \(T_1=\infty \) denoting no previous \(0\rightarrow 1\) transition.

To apply these expressions, intensity models describing the influence of \(T_1, T_2\) are needed. Obviously, such models may be difficult to assess, however, if such models are available then estimation may be based on the entire data set, rather than restricting attention to the land-mark sub-set of processes in state 0 at time s. This may entail a considerable efficiency gain as we will further study in Sects. 5 and 6. Alternatively, semi-Markov Cox-type models with the baseline intensities out of non-initial transient states h depending on the sojourn time spent in h are possible. Large sample theory for the resulting estimators follow, in principle, from those of the hazard models via the functional delta-method. The details, however, may be cumbersome to verify, see Shu et al. (2007) for a study of the semi-Markov illness–death model.

It should be noted that for the irreversible illness–death model (Fig. 1), some special estimators are available, see e.g. Meira-Machado et al. (2006), Meira-Machado and Sestelo (2017), and Allignol et al. (2014). These estimators will not be further studied here. For this model, Rodriguez-Girondo and Uña-Alvarez (2012) studied non-parametric tests for Markovianity based on estimates of Kendall’s \(\tau \). Tests for the Markov assumption in general multi-state models were discussed by Titman and Putter (2022).

4 Regression

If transition probabilities are studied in relation to time-fixed covariates \({\varvec{Z}}\) then the plug-in methods described above are available, at least for non-reversible models. This would entail fitting regression models, such as Cox models, for each of the relevant transition intensities. Thereby, probabilities may be predicted for given \({\varvec{Z}}\). This approach, however, would not provide coefficients that directly describe the association between \(P_{hj}(s,t)\) and \({\varvec{Z}}\). To obtain this, we suggest to use pseudo-observations (e.g., Andersen et al. 2003; Andersen and Pohar Perme 2010), as follows.

If there had been no censoring then, for all subjects in state h at time s, the indicator \(Y_{ji}(t)=I(X_i(t)=j)\) of being in state j at time \(t>s\) would be observable and could be used as outcome variable in a regression model. With potential censoring, let \(\widehat{P}_{hj}(s,t)\) be the estimator based on ‘all’ subjects and \(\widehat{P}^{-i}_{hj}(s,t)\) the same estimator applied to the data set obtained by eliminating subject i. The pseudo-observation for subject i is then

$$\begin{aligned} \theta _i=n_s\cdot \widehat{P}_{hj}(s,t)-(n_s-1)\widehat{P}^{-i}_{hj}(s,t) \end{aligned}$$

where \(n_s\) is the size of the data set used for estimating \(\widehat{P}_{hj}(s,t)\). Depending on the estimator used, \(n_s\) could be the size the full land-mark data set (e.g., land-mark Aalen–Johansen or land-mark Pepe estimators) or of the complete data set (e.g., Aalen–Johansen or plug-in estimators). For plug-in estimators, an ‘intermediate’ data set could consist of all subjects still at risk at time s, but not necessarily in state h at that time. Note that the Aalen–Johansen estimators based on either the complete data set or the ‘at risk’ (at time s) data set are identical. Also note that, in an uncensored data set, pseudo-values based on the land-mark Aalen–Johansen or the land-mark Pepe estimators both reduce to the indicator \(\theta _i=I(X_i(t)=j)\).

Estimates of the parameters \({\varvec{\beta }}\) in a regression model \(g(P_{hj}(s,t\mid {\varvec{Z}}))={\varvec{\beta }}^\mathsf{T}{\varvec{Z}}\) with link function g are now obtained by solving the estimating equations

$$\begin{aligned} {\varvec{U}}({\varvec{\beta }})=\sum _i \mathbf{A}({\varvec{\beta }},{\varvec{Z}}_i)\bigl (\theta _i-g^{-1}({\varvec{\beta }}^\mathsf{T}{\varvec{Z}}_i)\bigr )=\mathbf{0} \end{aligned}$$
(4)

(e.g., Andersen and Pohar Perme 2010). The function \(\mathbf{A}({\varvec{\beta }},{\varvec{Z}}_i)\) is typically the p-vector

$$\begin{aligned} \mathbf{A}({\varvec{\beta }},{\varvec{Z}}_i)=\left( \frac{\partial }{\partial \beta _j}g^{-1}({\varvec{\beta }}^\mathsf{T}{\varvec{Z}}_i),\quad j=1,\dots ,p\right) \end{aligned}$$

of partial derivatives of the mean function. Properties of the resulting estimators have been studied in special cases, e.g. by Overgaard et al. 2017 when pseudo-values are based on the Kaplan–Meier estimator or the Aalen–Johansen estimator for the competing risks cumulative incidence function and when censoring is independent of covariates. In a simulation study, we will in the next section investigate the behavior of the estimator of \({\varvec{\beta }}\) when pseudo observations are based on the estimators for \(P_{hj}(s,t)\) discussed in Sect. 3.

A crucial assumption when using pseudo observations is, as mentioned, that censoring is independent of covariates. An alternative to using pseudo-observations would be to use inverse probability of censoring weighted ‘direct binomial regression’. Thus, Scheike and Zhang (2007) used that technique for analyzing state occupation probabilities, and Azarang et al. (2017) studied transition probabilities in the irreversible illness–death model.

5 Simulations

In the simulation study, we focus on the illness–death model of Fig. 1 and consider three scenarios:

  • Scenario A: Markov model with constant transition intensities

  • Scenario B: Non-Markov model with constant transition intensities and with a linear effect of duration d on \(\alpha _{12}(t,d)\)

  • Scenario C: Non-Markov model with constant transition intensities and with a piece-wise constant effect of duration d on \(\alpha _{12}(t,d)\)

5.1 Marginal estimators

First, the probability \(P_{01}(1,t)\) was estimated on a large data-set (75000 individuals, no censoring) to obtain ‘true values’ for the transition probability. Then, a simulation with 100 simulation runs with sample size \(n=750\) was performed and on each simulated data set, the following estimators were used:

land-mark data set (only individuals in state 0 at time \(s=1\)):

  • LM Pepe: Pepe estimator

  • LM AAJ: Aalen–Johansen estimator

  • LM PGL: Plug-in estimator using land-mark data only, modelling duration effect in state 1 linearly

At risk data set (only individuals at risk at time \(s=1\)):

  • At risk PGL: Plug-in estimator using everyone at risk at time \(s=1\), modelling duration effect in state 1 linearly

Complete data set (using all individuals):

  • AAJ: Aalen–Johansen estimator (same as at risk AAJ)

  • PGL: plug-in estimator, modelling the effect of duration d in state 1 linearly

  • PGS: plug-in estimator, modelling the effect of duration d in state 1 using a spline with 3 df

  • PGR: plug-in estimator, using duration as time-scale in state 1 and modelling the effect on \(\alpha _{21}(t,d)\) of time t since 0 linearly

Figure 3 shows the true \(P_{01}(1,t)\) together with averages over the simulation runs of the various estimates. It is seen that all estimators are unbiased when the process is Markov, that the Aalen–Johansen estimator using all data is biased on non-Markov data, and that the simple linear model for duration makes the plug-in estimator biased when the duration effect is non-linear.

Fig. 3
figure 3

Estimates of the transition probability \(P_{01}(s,t), t>s\) for \(s=1\) in Scenarios A, B, and C

Table 1 Standard deviations for different estimators at three time points in Scenarios A, B, and C

Turning, next, to the variability of the estimators, Table 1 shows the standard deviation over the simulation runs of the estimates at three time points chosen, approximately, at the 25th, 50th, and 75th percentiles of transition times in the land-mark data set. The general picture is that the estimators using the complete data (plug-in or Aalen–Johansen) give the lowest variance, however, in Scenarios B and C we have seen that Aalen–Johansen is biased. Furthermore, the model based on the at-risk data gives a standard deviation similar to models using the complete data set; the model using duration as the time scale has the lowest standard deviation and, on the land-mark data set, all methods give similar standard deviations. The results for the Aalen–Johansen estimator and for the land-mark Aalen–Johansen and Pepe estimators are well in line with the results reported in the simulation studies by Putter and Spitoni (2018).

5.2 Regression

We now turn to regression and consider one binary covariate. All regression models are based on pseudo-observations where we directly model the probability \(P_{01}(1,t)\). The link function is the logarithm, i.e. we estimate the coefficient \(\beta =\log (P_{01}(1,t\mid Z=1)/P_{01}(1,t\mid Z=0))\).

We first study the average of the pseudo-observations and Fig. 4 shows the true curves and the averages of pseudo-observations based on estimates in a data set with \(n=2000\) and 20% censoring. We can see that the average pseudo observations with all methods are close and follow nicely the true values, the only exception is the complete data AAJ (not shown), which we already know to be biased.

Fig. 4
figure 4

Transition probability \(P_{01}(s,t\mid Z=\ell ), t>s, \ell =0,1,\) for \(s=1\) in Scenario B together with average pseudo observations using different estimators

It is instructive to study how pseudo-values look like for different subjects and for the different base estimators, see the Supplementary Material.

To study the behavior of estimators of \(\beta \) we consider Scenario B (linear effect of duration on \(\alpha _{21}(t,d)\)) and three effects of Z: (1)—no effect, (2)—Z affects only \(\alpha _{12}\), (3)—Z affects all three transition intensities. Fig. 5 shows box-plots of the \(\widehat{\beta }\)’s in these situations based on 100 simulation runs with \(n=750\) subjects and 20% censoring. It is seen that all estimators are unbiased, except Aalen–Johansen using the full data set (which, however, provides an unbiased estimate of \(\beta \) under the null – even though the average of pseudo observations does not hit the target). The box-plots suggest some differences in the variability of the estimators: the variance is lower with the plug-in model in the complete or at risk data sets, the other three options give roughly the same variance. At the later time point, with only few individuals at risk, the plug-in estimators become more variable. In the Supplementary Material, the standard deviations are tabulated, as well as the power of a (Wald) test for the null hypothesis \(\beta =0\). There, we also show how different levels of censoring affect the estimators.

Fig. 5
figure 5

Box-plots of \(\widehat{\beta }\) under the null hypothesis ((1) – no effect of Z, top panel) and under two alternatives ((2) and (3) – middle and lower panels). The ‘true value’ (i.e., based on 75000 subjects) is at the red line

6 Examples

6.1 The PROVA trial

Based on data from the PROVA trial, briefly introduced in Sect. 1, we will illustrate the estimators discussed in Sects. 24. We will first study the entire data set, ignoring treatment and other covariates, and aim at estimating the probability, \(P_{01}(1,t)\) of being alive in the bleeding state (1) at time t among those who were alive and free of bleeding \(s=1\) year after randomization. At that time, 190 out the initial 286 patients were still at risk in state 0. Figure 6 shows the Markov-based Aalen–Johansen estimator (using all 286 patients) and the land-mark Aalen–Johansen and Pepe estimators. The Aalen–Johansen estimator is seen to provide somewhat higher values compared to the land-mark estimators, suggesting that the Markov assumption may be questionable. This observation is supported by fitting models for the death intensity after bleeding, \(\alpha _{12}(t\mid T_1)\) taking the time \(T_1\) of bleeding into account. This was done by introducing time-dependent covariates \(Z(t)=t-T_1\) or \((Z_1(t),Z_2(t))=(I(5 \text{ days }>t-T_1),I(5 \text{ days }\le t-T_1<10 \text{ days}))\) and testing their significance. The estimated coefficients (with estimated standard deviation) were, respectively, \(\widehat{\beta }=-2.21 (0.62) , (\widehat{\beta }_1=3.22 (0.61),\widehat{\beta }_2=2.14 (0.73))\) and in both models the Markov assumption is clearly rejected. The estimates show that, shortly after a bleeding episode, the mortality rate is high. Figure 6 also shows the plug-in estimators based on these models as well that based on a semi-Markov model using duration \(d=t-T_1\) as baseline time in the model for \(\alpha _{12}(\cdot )\). It is seen that the latter is close to the non-parametric estimators based on land-marking, whereas the plug-in estimator based on a piecewise constant duration effect in the Cox model does not seem to capture the true effect very well. The plug-in estimator based on a linear duration effect in the Cox model is close to the non-parametric estimates using land-marking, and a model with a more detailed duration effect (using a fractional polynomial, not shown) was quite similar to the model with a linear effect.

Fig. 6
figure 6

Estimates for the transition probability \(P_{01}(s,t), t>s\) for \(s=1\) year in the PROVA trial. AAJ: Aalen–Johansen estimator based on the entire data set; LM AAJ: land-mark Aalen–Johansen estimator using only subjects in state 0 at time \(s=1\); PGR: Semi-Markov model using d as time axis for the 1 to 2 transition intensity; PGL: Plug-in model with a linear effect of d; PCW: Plug-in model with piecewise constant effect of d; LM Pepe: land-mark Pepe estimator using only subjects in state 0 at time \(s=1\)

Turning, next, to the precision of the estimators, a bootstrap experiment was conducted, re-sampling data sets of size 286, \(B=1000\) times with replacement from the PROVA data. On each data set, estimators of \(P_{01}(1,t)\) were computed and the standard deviation of the estimates was calculated. Since the variability tends to be larger, when the point estimate is large, Table 2 reports relative bootstrap standard deviations (coefficients of variation, CV). It is seen that the two estimators based on land-marking (\(n=190\)) have relatively large CV-values compared to those based on the full data set (\(n=286\)). The smallest CV is for the biased Aalen–Johansen estimator but also the semi-Markov estimator is quite precise. The plug-in estimator has large CV-values for high values of t.

Table 2 Coefficient of variation (%) for different estimators for the transition probability \(P_{01}(s,t), t>s\) for \(s=1\) year in the PROVA trial

6.1.1 Pseudo-values

To illustrate the use of pseudo-values, we now study how \(P_{01}(s,t), t>s\) for \(s=1\) year depends on whether sclerotherapy was given. First, we use the Cox model to model each transition intensity separately. To this end, we can use the whole data set or only at-risk or land-mark data (the two are equal for transitions 0 to 1 and 0 to 2). When considering the 1 to 2 transition, we have several options: fitting on the original time axis but with duration as a covariate (linear or as a spline), fitting on the duration time axis but with time since randomization as a covariate (linear or spline). The results are shown in Fig. 7; the effect of treatment only seems important for the 0 to 2 transition.

Fig. 7
figure 7

Effect of sclerotherapy on transition rates in the PROVA trial using different subsets of the data

When modelling the transition probability directly, we first study the two estimated curves (using the land-mark Aalen–Johansen estimator) and compare to the average of pseudo-observations for each of the subgroups of the data, see Fig. 8. We can see no important differences between the two estimated curves. We also investigated if censoring was independent of covariates and that assumption turned out to be reasonably well fulfilled. As the last step, we fit the model using pseudo-observations at a time point around the 50th percentile of observed transition times. These can be calculated in different ways (Fig. 9), but all the estimated coefficients have a large standard deviation and no effects of the covariate are seen. However, those based on land-marking show the largest variability in accordance with the simulation study.

Fig. 8
figure 8

Effect of sclerotherapy on \(P_{01}(s,t), t>s\) for \(s=1\) in the PROVA trial: land-mark Aalen–Johansen estimators and average pseudo-values

Fig. 9
figure 9

Effect of sclerotherapy on \(P_{01}(s,t), t>s\) for \(s=1\) in the PROVA trial based on pseudo observations

6.2 The BMT study

We also illustrate some of the methods for the bone marrow transplantation example introduced in Sect. 1. For this example, a Markov model does not fit the data well judged from results from Cox models for the transition intensities allowing for duration dependence in states 1 or 2. Thus, the two death intensities depend significantly on (a linear effect of) duration, \(t-T_1\) or \(t-T_2\) in states 1 or 2, respectively. The former increases with duration (\(\widehat{\beta }=0.050, \quad SD=0.021\)) while the latter decreases (\(\widehat{\beta }=-0.066, \quad SD=0.016\)). The transition intensity from state 1 to state 2 increases insignificanty with duration (\(t-T_1\)) in state 1 (\(\widehat{\beta }=0.074, \quad SD=0.046\)).

Figure 10 shows estimates of \(P_{02}(s,t), t>s\) for \(s=\)6 months using various estimators: the Markov-based Aalen–Johansen estimator, the land-mark Aalen–Johansen and Pepe estimators, and two plug-in estimators. The first plug-in estimator uses durations in states 1 or 2 as baseline time variables with no adjustment for time t since transplantation (‘semi-Markov estimators’), and the other models the intensities out of states 1 or 2 using t as baseline time variable and adjusting for duration in states 1 and 2 using the models with estimates quoted above. It is seen that, for this example, the deviations from the Markov assumption are less severe for the estimation of the transition probability, and no big differences between the various estimators are apparent. However, it does seem as if that based on the semi-Markov model gives somewhat lower estimates for large values of t. A possible explanation is that the semi-Markov model does not take time since transplantation into account for the death intensities out of states 1 and 2. The example thus illustrates that when using plug-in estimators and modeling duration effects explicitly, great care must be exercised when setting up these models.

Fig. 10
figure 10

Estimates for the transition probability \(P_{02}(s,t), t>s\) for \(s=6\) months for the bone marrow transplantation data: AAJ—Aalen–Johansen; LM AAJ—land-mark Aalen–Johansen; LM Pepe—land-mark Pepe; PGL—Plug-in with a linear effect of d on death intensities from states 1 and 2; PGR—semi-Markov model using d as time axis for death intensities from states 1 and 2

7 Discussion

We have reviewed how inference for transition probabilities \(P_{hj}(s,t)\) may be carried out in non-Markov multi-state models. For (‘marginal’) estimators of the probability in a homogeneous population, i.e. with no consideration of covariates, we saw that the standard Aalen–Johansen estimator using all data could be biased when the process was, indeed, non-Markov, whereas this estimator was the most efficient one in the Markovian case. Restricting to the land-mark data set of subjects in state h at time s, the Aalen–Johansen method provided unbiased estimation and was comparable to the land-mark Pepe estimator in the situations studied. Estimators based on plug-in were highly competitive in terms of precision if the plug-in models for the intensities were correctly specified but did suffer from bias otherwise. This was seen both in the simulations and in the practical examples.

The results for regression models using pseudo observations were compatible with the findings for the overall estimators. Thus, using the Aalen–Johansen estimator for the full data set as the base estimator when calculating pseudo-values gave biased estimates of regression coefficients in non-Markov situations. Furthermore, plug-in models may be efficient but they do suffer from bias when the intensity models are not correctly specified, whereas base estimators using land-marking gave unbiased results, however, possibly with a large variability. In conclusion: there is a bias/variance trade-off when choosing between land-mark and plug-in estimators.

Some remarks about calculation of pseudo-values are in place. We think it is a good practice always to look at the marginal estimator (which may exhibit large jumps – also becoming negative at later time points) and to check whether the average of pseudo observations is close to the marginal estimator. Furthermore, convergence problems, especially at later time points are not unusual and it should be noted that pseudo-observations for the plug-in models are very computationally intensive.

Some points deserve further attention. Future studies should investigate if the pseudo-values calculated from different base estimators have the desired mathematical properties that make the GEE in (4) unbiased when censoring is independent of covariates, see e.g. Overgaard et al. (2017). Additionally, methods for adjusting for covariate-dependent censoring should be developed and it should be investigated to what extent plug-in estimators using, respectively, the full data set, the land-mark data set, or the at risk data set are applicable as base estimators for calculation of pseudo-values.