1 Introduction

Due to increased life expectancy in many countries, older adults are exposed to a higher risk of adverse health outcomes. Also, they are vulnerable to increased utilization of health care services and death [11]. For example, difficulty in the activity of daily living (ADL) may prospectively relate to the progression of functional limitations and disability among the elderly [5]. The Health and Retirement Study (HRS) is a nationally representative longitudinal survey in the USA and repeatedly measured ADL as ordinal outcomes. Individuals make transitions over time among different response categories, and a trajectory based on the series of events is helpful to understand the disease progression [4, 6, 12, 15, 40]. This is a large-scale longitudinal study with more than 20 years of follow-up data on approximately 20,000 people producing a huge data volume. Nowadays, due to the lower cost of data acquisition, large and complex longitudinal data captured, termed big data. As a result, there are new statistical challenges in methodology, theory, and computation to get vital insight, actual behavior, and make sense of this extensive complex data. A special issue on “The role of Statistics in the era of big data” of Statistics and Probability Letters in 2018 was devoted to the role of statistics in the era of big data. Modeling transitions over time among different response categories and predicting trajectory risks based on various risk factors would be difficult.

A growing area of interest is to predict the joint probability of a sequence of events (trajectory) based on a specified covariate vector [23, 29, 32, 35, 41, 42]. Modeling these sequences allows us to predict likely future outcomes. Specifically, interest might be in (i) What is the expected risk of having a condition of a patient based on previous responses and risk factors? (ii) What is the predicted risk of occurrence of a sequence of events based on specified features at different follow-ups? (iii) What would be the predicted outcome at the subsequent follow-up with specified values of covariates and previous outcomes? (iv) Also, in a follow-up, it is interesting to study the interaction effects between risk factors and outcomes of earlier follow-ups and interaction among previous events. Using the predicted risk of a sequence of outcomes, health care providers can screen individuals that would help them to suggest necessary therapy and preventive measures. A physician can recommend early or regular office visits or prescribe medication to prevent hospitalization based on a patient’s trajectory [37]. The risk prediction can also allow a patient to be aware of the future course of the disease [39].

The prediction of trajectories from a sequence of ordinal outcomes based on specified covariates is a great challenge. To predict the joint probability of a series of events, we need to examine the progression of responses during subsequent follow-ups using a joint model (multivariate) for ordinal outcomes. A multivariate approach is often complicated and would be challenging to develop for a large number of follow-ups [16]. The multistate higher-order Markov model (conditional model) can be used to study the underlying dependence in consecutive follow-ups [24]. Using this model, one can estimate the risk for a sequence of events [26]. However, these models are restricted for a large number of follow-ups due to over-parameterization, and one cannot assess the impact of the prior outcomes due to effective stratification [16, 25]. Also, it is not possible to include the interaction between responses from previous follow-ups and interaction between previous responses and risk factors in the model [13]. For many follow-ups, Markov models require data set with a big sample size throughout the follow-ups.

Figure 1 displays three repeated outcomes, each with three categories and twenty-seven possible trajectories (paths). In this case, one needs to fit a total of thirteen models; one marginal model from follow-up one (baseline), three first-order and nine second-order Markov models. Which could be computationally cumbersome and may explode for a large number of repeated responses [14, 17]. Another choice is the regressive logistic models under the Markovian assumption, which include both binary outcomes in previous times in addition to covariates in the conditional models [7, 8, 23, 36]. Islam and Chowdhury [23] developed a regressive logistic model to predict the joint probability of a sequence of binary outcomes based on specified covariates, which reduces the fitting of conditional models significantly. Chowdhury and Islam [13] extended this model for repeated multinomial responses. The ordinal logistic regression models with different variants are some approaches to model ordinal response [2, 3, 9, 19, 33, 34], for example proportional odds, partial proportional odds, continuation ratio, stereotype adjacent category, and baseline category models. However, these are univariate models only for the single ordinal outcome.

Against this backdrop, we proposed two regressive models for repeated ordinal outcomes and showed the joint model, which is a new development. First, we proposed a proportional odds regressive model (POM) for repeated ordinal outcomes. For POM, one needs to test the proportional odds assumption [9]. Second, in the case of violations of proportional odds assumption, we proposed a partial proportional odds regressive model (PPOM) for repeated ordinal outcomes. We also applied the multinomial regressive logistic model (MNOM) for repeated responses [13] by ignoring the ordinal nature of the outcome variables. Then we estimated the risk for a sequence of events for specified covariate values by linking marginal and conditional probabilities. We obtained the marginal probability using the outcome from the first follow-up, and the conditional probabilities are estimated from the subsequent follow-ups using the proposed regressive models. Using data partitioning (training and test data), we computed the prediction accuracy to check over(under)fitting. Furthermore, 10,000 bootstrap simulations are performed to assess the proposed models’ performance. Finally, we illustrated the proposed methods using follow-up data from the USA’s Health and Retirement Study (HRS).

2 Repeated outcomes and trajectories

Suppose \(Y_1, Y_2\), and \(Y_3\) are three repeated ordinal outcomes with three categories that may represent three states of ADL difficulty (0,1,2). Figure 1 displays the possible transitions between three outcome categories from three follow-ups. A total of twenty-seven distinct trajectories (paths) are possible. Here, the first column shows marginal probabilities, and the second and third are conditional probabilities.

Fig. 1
figure 1

Transitions between states for regressive models

2.1 Notations

Let \({Y_{i1}},{Y_{i2}},...,{Y_{iJ_i}}\) represent the past and present responses for ith subject at jth follow-up where \((i=1,2,...,n \text { and }j=1,2,...,J_i)\), \(J_i\) is the number of follow-ups for subject i. For simplicity, subscript i is omitted in what follows next unless explicitly specified. Define, \(Y_j=s\) where \((s=0,1,2,...,S)\) with \(S+1\) outcome categories. The category 0 may denote non-event.

Following the notations used in [13] the joint probability mass function of \(Y_1,Y_2,\cdots ,Y_J\) with covariate vector \({\varvec{X=x}}\) can be expressed as:

$$\begin{aligned} \begin{aligned} P(Y_1 = y_1,Y_2 = y_2,\ldots ,Y_J = y_J\mid {\varvec{x}})&=P(Y_1 = y_1\mid {\varvec{x}})\times P(Y_2 = y_2\mid y_1;{\varvec{x}})\\&\quad \times \cdots \times P(Y_J = s \mid y_1,\cdots ,y_{j-1};{\varvec{x}})\\&=P_{y_1}({\varvec{x}})\times P_{y_2.y_{1}}({\varvec{x}})\times \cdots \\&\quad \times P_{s.y_1}, \ldots ,y_{j-1}({\varvec{x}}), \end{aligned} \end{aligned}$$
(1)

where \({\varvec{X}}^\prime =[1,x_1,...,x_p]\) is vector of covariates for a subject at the first follow-up. It should be noted that \({\varvec{X}}={\varvec{x}}\) can be time dependent. Explanations of the functions of the right-hand side in Eq. (1) are as follows:

\(P(Y_1 = s\mid {\varvec{x}})=P_s({\varvec{x}})\) is the marginal probability function of \(Y_1\) conditional on \({\varvec{x}}\);

\(P(Y_J = s\mid y_{j-1};{\varvec{x}})=P_{s.y_{j-1}}({\varvec{x}})\) is the probability function of \(Y_j\) conditional on \(y_{j-1}\) and \({\varvec{x}}\) of order one;

\(P(Y_J = s\mid y_{j-1}, y_{j-2};{\varvec{x}})=P_{s.y_{j-1},y_{j-2}}({\varvec{x}})\) is the probability function for \(Y_j\) conditional on \(y_{j-1}, y_{j-2}\) and \({\varvec{x}}\) of order two;

\(P(Y_J = s\mid y_{j-1},y_{j-2},\cdots ,y_1;{\varvec{x}})=P_{s.y_{j-1},y_{j-2},\cdots ,y_1}({\varvec{x}})\) is the probability function of \(Y_j\) conditional on \(y_{j-1},\cdots ,y_1\) and \({\varvec{x}}\) of order \(k=j-1\).

The unconditional probability of the left-hand side of Eq. (1) is defined as:

\(P(Y_1 = y_1,Y_2 = y_2,\cdots ,Y_J = y_J\mid {\varvec{x}})=P_{y_1,y_2,\cdots ,y_J}({\varvec{x}})\).

3 Models

3.1 Proportional odds model (POM)

McCullagh [33] proposed the proportional odds model (POM) model to analyze ordinal outcomes as a function of covariates. In this model, the coefficients that describe the relationship between lower-level versus all higher levels (thresholds or cut points) of the response variable are the same, which is the proportional odds assumption (parallel regression) and required to test [9]. We assessed the proportionality odds assumption using the Brant test [9]. POM-fitting using baseline outcome as a function of covariates will provide the marginal model.

Let, the outcome \(Y_1\) having s categories \((s=0,1,\ldots ,S)\) with associated probabilities \({\pi _0+\pi _1+\cdots +\pi _S}\) and \(P(Y_1 \le s)=\pi _0+\cdots +\pi _s\) where \(P(Y_1 \le 0) \le P(Y_1 \le 1) \le \cdots \le P(Y_1 \le S)\). Then the proportional odds model can be shown as:

$$\begin{aligned} P(Y_1\le s\mid {\varvec{x}})=\frac{\hbox {exp}\left( \alpha _{s}-{\varvec{\beta _1^\prime }}{\varvec{X}}\right) }{1+\hbox {exp}\left( \alpha _{s}-{\varvec{\beta _1^\prime }}{\varvec{X}}\right) }, \quad s=1,2,\ldots ,S \end{aligned}$$
(2)

or equivalently can be expressed in logit form as

$$\begin{aligned} \begin{aligned} \hbox {logit}[P(Y_1\le s\mid {\varvec{x}})]&=\ln \bigg [\frac{\pi _0+\cdots +\pi _s}{\pi _{s+1}+\cdots +\pi _S}\bigg ]\\&= \alpha _s - \left( \beta _1X_1 +\cdots + \beta _pX_p\right) \\&=\alpha _s-{\varvec{\beta _1^\prime }}{\varvec{X}} \end{aligned} \end{aligned}$$
(3)

where \(\alpha _s\)’s are the threshold parameters (cut points) and \({\varvec{\beta _1}}=[\beta _1,\beta _2,\cdots ,\beta _p]^\prime \) is the vector of regression coefficients corresponding to the covariate vector \({\varvec{X}}=[X_1,X_2,\cdots ,X_p]^\prime \). This model assumes that the effects of the covariates are same for all categories (proportional odds). Then the marginal probability of sth category is

$$\begin{aligned} P_s({\varvec{x}})=P(Y_1=s\mid {\varvec{x}})=P(Y_1\le s+1\mid {\varvec{x}}) - P(Y_1 \le s\mid {\varvec{x}}), \quad s=0,1,\ldots ,S. \end{aligned}$$
(4)

3.2 Proposed kth-order proportional odds regressive model

Let \(Y_1,\ldots , Y_J\) are repeated ordinal outcomes with s outcome levels (\(s=0,1,2,\ldots ,S\)). Then the proposed kth-order (k=j-1) proportional odds regressive model can be shown as follows:

$$\begin{aligned} \begin{aligned}&logit[P(Y_j\le s\mid {\varvec{z}})]= \alpha _{s.y_{j-1}} \\&\quad - \left( \beta _{j.y_{j-1}1}Z_1 +\cdots + \beta _{j.y_{j-1}p}Z_p+\beta _{j.y_{j-1}(p+1)}Z_{p+1}+\cdots \right. \\&\quad + \beta _{j.y_{j-1}(p+S)} Z_{p+S} +\beta _{j.y_{j-1}(p+S+1)}Z_{p+S+1} +\cdots \\&\quad +\beta _{ j.y_{j-1}(p+2S)} Z_{p+2S}+\cdots +\beta _{j.y_{j-1}[p+(j-2)S+1]}Z_{p+(j-2)S+1} +\cdots \\&\quad \left. +\beta _{j.y_{j-1}[p+(j-1)S]} Z_{p+(j-1)S}\right) =\alpha _{s.y_{j-1}}-{\varvec{\beta ^\prime }}_{j.y_{j-1}}{\varvec{Z}}, \quad s=1,2,\ldots ,S \end{aligned} \end{aligned}$$
(5)

where \(\alpha _s\)’s are the threshold parameters and

$$\begin{aligned} \begin{aligned}&{\varvec{\beta _{j.y_{j-1}}}}=\left[ \beta _{j.y_{j-1}1},\ldots , \beta _{j.y_{j-1}p},\beta _{j.y_{j-1}(p+1)}\ldots ,\beta _{j.y_{j-1}(p+S)},\beta _{j.y_{j-1}(p+S+1)}\right. \\&\quad \left. ,\ldots ,\beta _{j.y_{j-1}(p+2S)},\ldots ,\beta _{j.y_{j-1}[p+(j-2)S+1]},\ldots ,\beta _{j.y_{j-1}\left[ p+(j-1)S\right] }\right] ^\prime \end{aligned} \end{aligned}$$
(6)

is the vector of regression coefficients corresponding to the covariate vector

$$\begin{aligned} \begin{aligned} {\varvec{Z}}&= \left[ Z_1,\ldots ,Z_p,Z_{p+1},\ldots ,Z_{p+S},Z_{p+S+1},\ldots ,Z_{p+2S},\right. \\&\quad \left. \ldots ,Z_{p+(j-2)S+1},\ldots , Z_{p+(j-1)S}\right] ^\prime \\&=\left[ {\varvec{X^\prime }},{\varvec{D^\prime }}\right] \\&=\left[ X_1,X_2,\ldots ,X_p,D_{11},\ldots ,D_{1S},D_{21},\ldots ,D_{2S},D_{(j-1)1},\ldots ,D_{(j-1)S}\right] ^\prime . \end{aligned} \end{aligned}$$
(7)

here \(D_{11},\ldots ,D_{1S},D_{21},\ldots ,D_{2S},\ldots ,D_{(j-1)1},\ldots ,D_{(j-1)S}\) are the dummy variables for categories \(1, 2,\ldots ,S\) for \(Y_1,\ldots Y_{j-1}\) with 0 as the reference category. Then the conditional probability of sth category is

$$\begin{aligned} \begin{aligned} P_{s.y_1y_2,\ldots ,y_{j-1}}({\varvec{z}})&=P(Y_j=s\mid y_1,y_2,\ldots ,y_{j-1};{\varvec{x}})\\&=P(Y_j\le s+1\mid y_1,y_2,\ldots ,y_{j-1};{\varvec{x}}) - P(Y_j \\&\le s\mid y_1,y_2,\ldots ,y_{j-1};{\varvec{x}}),\\&\quad s,y_1,\ldots ,y_{j-1}=0,1,\ldots ,S. \end{aligned} \end{aligned}$$
(8)

3.3 Partial proportional odds model (PPOM)

If the proportional odds assumption violates for predictors, then alternative models are unconstrained, constrained partial proportional odds [38], or multinomial logistic models among others [1, 20, p. 290–292]. The unconstrained partial proportional odds model allows non-proportional odds for a subset of q predictors (\(q<p\), p is the total number of predictors) for those proportional odds assumptions violates. Then the marginal model using baseline outcome can be shown as:

$$\begin{aligned} \begin{aligned} P(Y_1\le s\mid {\varvec{x}})=\frac{\hbox {exp}\left( \alpha _{s}-{\varvec{\beta _1^\prime }}{\varvec{X}}-{\varvec{\gamma }}_s^\prime {\varvec{T}}\right) }{1+\hbox {exp}\left( \alpha _{s}-{\varvec{\beta _1^\prime }}{\varvec{X}}-{\varvec{\gamma }}_s^\prime {\varvec{T}}\right) }, \quad s=1,2,\ldots ,S. \end{aligned} \end{aligned}$$
(9)

or equivalently can be expressed in logit form as

$$\begin{aligned} \hbox {logit}\left[ P\left( Y_1\le s\mid {\varvec{x}}\right) \right] =\alpha _s-{\varvec{\beta _1^\prime }}{\varvec{X}}-{\varvec{\gamma }}_s^\prime {\varvec{T}} \end{aligned}$$
(10)

where \(\alpha _s\) are the cut points, \({\varvec{T}}\) is the subset of covariate vector for which the proportional odds assumption is violated, and \(\gamma _s\) is a vector of regression coefficients corresponding to the q covariates in \({\varvec{T}}\), \({\varvec{\beta _1^\prime }}\) is the vector of the regression coefficients of covariates those are not in q. Then using Eq. (4), we can obtain the marginal probability of sth category.

3.4 Proposed kth-order partial proportional odds regressive model

The kth-order partial proportional regressive model for \(Y_1,\ldots ,Y_j\) can be shown as:

$$\begin{aligned} \hbox {logit}\left[ P\left( Y_j\le s\mid {\varvec{z}}\right) \right] =\alpha _{j.s}-{\varvec{\beta _{j.y_{j-1}}^\prime }}{\varvec{Z}}-{\varvec{\gamma }}_{j.s}^\prime {\varvec{T}} \end{aligned}$$
(11)

where \(\alpha _{j.s}\) are the cut points, \({\varvec{T}}\), \(\gamma _{j.s}\), \({\varvec{\beta _{j.y_{j-1}}^\prime }}\) are equivalent as explained in Eq. (10) and \({\varvec{Z}}\) is a covariate vector as defined in Eq. (7). The conditional probability of sth category can be estimated using Eq. (8).

3.5 Multinomial regressive logistic model

Chowdhury and Islam [13] showed kth-order multinomial regressive logistic model. The first-order multinomial regressive model \(P(Y_2\mid y_1; {\varvec{z}})\) for outcomes \(Y_1\) and \(Y_2\) can be shown as:

$$\begin{aligned} P_{s.y_1}({\varvec{z}})=P(Y_2=s\mid y_1; {\varvec{z}})=\frac{e^{g_{s.y_1}({\varvec{Z}})}}{\sum \limits _{s=0}^S e^{g_{s.y_1}({\varvec{Z}})}} ,\quad s, y_1=0,1,\ldots ,S, \end{aligned}$$
(12)
$$\begin{aligned} \begin{aligned} \hbox {where } g_{s.y_1}({\varvec{Z}})&=\beta _{s.y_10}+\beta _{s.y_11}Z_1+\cdots +\beta _{s.y_1p}Z_p+\beta _{s.y_1(p+1)}Z_{p+1}+\cdots \\&\quad +\beta _{s.y_1(p+S)}Z_{p+S}, \quad s=1,\ldots ,S \text{ and } \end{aligned} \end{aligned}$$

\({\varvec{Z^\prime }}=\left[ 1,Z_1,...,Z_p,Z_{p+1},\ldots , Z_{p+S}\right] \) \(=\left[ {\varvec{X^\prime }},{\varvec{D^\prime }}\right] =\left[ 1,X_1,\ldots ,X_p,D_{11},\ldots , D_{1S}\right] \). Here \(D_{11},\ldots , D_{1S}\) are the dummy variables for categories \(1,\ldots , S\) of outcome \(Y_1\) with 0 as the reference category and producing a total of \([(p+1)+S]S\) regression coefficients.

The first and all higher-order regressive models are equivalent to the corresponding marginal models. The regressive modeling approach requires fitting only one model for each repeated outcome by incorporating previous responses as covariates along with the risk factor. Besides, it allows the divide and recombines technique for large complex data. Therefore, one can run models for all the follow-ups in parallel using parallel programming and exploiting multiple processors. We can use R, SAS, STATA, or other software capable of fitting POM, PPOM, and MNOM. It is noteworthy that the regressive model for binary outcomes proposed by Islam and Chowdhury [23] and Bonney [7, 8] is special cases of the proposed regressive models shown in Eq. (12) for s=0,1.

3.6 Predictive models and joint probabilities

The log-likelihood function of the joint mass function in (1) can be obtained as:

$$\begin{aligned} \begin{aligned} l&=\sum \limits _{i=1}^{n}\sum \limits _{j=1}^{J} \ln P(Y_{i1} = y_{i1},Y_{i2} = y_{i2},\ldots ,Y_{iJ} = y_{iJ}\mid {\varvec{x}})\\&=\sum \limits _{i=1}^{n}\sum \limits _{j=1}^{J} \bigg [\ln P(Y_{i1} = y_{i1}\mid {\varvec{x}}) + \ln P(Y_{i2} = y_{i2}\mid y_{i1};{\varvec{x}}) \\&\quad +\cdots + \ln P(Y_{iJ} = s \mid y_{i1},\ldots ,y_{i(j-1)};{\varvec{x}})\bigg ]. \end{aligned} \end{aligned}$$
(13)

For each proposed model, differentiating the log-likelihood with respect to the parameters and equating the derivatives to zero, we obtain the equations whose solutions give the maximum likelihood estimates for parameters. The observed information matrix can be obtained by taking the second derivative. Then using the Newton–Raphson method, the estimates of the parameters are obtained. These provide the fitted models and are used for predictions.

We can predict the risks of a sequence of outcomes for a subject with specified covariate vector \({\varvec{X}}^*={\varvec{x}}^*\) for a particular trajectory as shown in Fig. 1.

The predicted joint probabilities of \({\hat{P}}(Y_1=y_1,Y_2=y_2,\ldots ,Y_j=y_j\mid {\varvec{x}}^*)\) can be obtained as:

$$\begin{aligned} \begin{aligned} {\hat{P}}\left( Y_1 = y_1,Y_2 = y_2,\ldots ,Y_J = y_J\mid {\varvec{x}}^*\right)&={\hat{P}}\left( Y_1 = y_1\mid {\varvec{x}}^*\right) \times {\hat{P}}\left( Y_2 = y_2\mid y_1;{\varvec{x}}^*\right) \\&\quad \times , \ldots , \times {\hat{P}}\left( Y_J = s \mid y_{j-1}, \ldots ,y_1;{\varvec{x}}^*\right) \\&={\hat{P}}_{y_1}\left( {\varvec{x}}^*\right) \times {\hat{P}}_{y_2.y_{1}}\left( {\varvec{x}}^*\right) \\&\quad \times \cdots \times {\hat{P}}_{s. y_{j-1}, \ldots ,y_1}\left( {\varvec{x}}^*\right) . \end{aligned} \end{aligned}$$
(14)

For simplicity, let two repeated outcomes \(Y_1\) and \(Y_2\) with categories \(s=0,1\) and 2. Then using Eq. (14) the predicted joint probabilities \(P(Y_1=y_1,Y_2=y_2 \mid {\varvec{x}}^*)\) is

$$\begin{aligned} \begin{aligned} {\hat{P}}_{y_1,y_2}\left( {\varvec{x}}^*\right)&={\hat{P}}\left( Y_1=y_1,Y_2=y_2 \mid {\varvec{x}}^*\right) ={\hat{P}}\left( Y_1=y_1 \mid {\varvec{x}}^*\right) \times {\hat{P}}\left( Y_2=s, \mid y_1; {\varvec{x}}^*\right) \\&={\hat{P}}_{y_1}\left( {\varvec{x}}^*\right) \times {\hat{P}}_{s.y_1}\left( {\varvec{x}}^*\right) , \quad s, y_1, y_2 = 0, 1, 2. \end{aligned} \end{aligned}$$
(15)

We can predict the marginal probabilities \({\hat{P}}_0({\varvec{x}}^*); {\hat{P}}_1({\varvec{x}}^*); {\hat{P}}_2({\varvec{x}}^*)\) from the fitted marginal model and the first-order conditional probabilities \({\hat{P}}_{s.y_1}({\varvec{x}}^*)\) from the fitted first-order regressive model using covariate vector \({\varvec{Z}}=[{\varvec{x}}^*,D_{11},D_{12}]^\prime \) where \(D_{11},D_{12}=0,1\). For example, \({\hat{P}}_{1.0}({\varvec{x}}^*)\) and \({\hat{P}}_{2.0}({\varvec{x}}^*)\) are estimated using \({\varvec{Z}}=[{\varvec{x}}^*,0,0]^\prime \); \({\hat{P}}_{1.1}({\varvec{x}}^*)\) and \({\hat{P}}_{2.1}({\varvec{x}}^*)\) are estimated using \({\varvec{Z}}=[{\varvec{x}}^*,1,0]^\prime \); \({\hat{P}}_{1.2}({\varvec{x}}^*)\) and \({\hat{P}}_{2.2}({\varvec{x}}^*)\) are estimated using \({\varvec{Z}}=[{\varvec{x}}^*,0,1]^\prime \) and so on. Then the joint probabilities for two outcomes \({\hat{P}}_{00}={\hat{P}}_0\times {\hat{P}}_{0.0}\); \({\hat{P}}_{01}={\hat{P}}_0\times {\hat{P}}_{1.0}\) and \({\hat{P}}_{02}={\hat{P}}_0\times {\hat{P}}_{2.0}\) and so on.

4 Tests

4.1 Significance of the joint model

We can test the significance of the joint model using the likelihood ratio test between the joint constant only model (Red.) and joint full model (Full) as follows:

$$\begin{aligned} -2\left[ \ln L_{\text {Red.}}({\varvec{{\hat{\beta }}_0}})-\ln L_{\text {Full}}({\varvec{{\hat{\beta }}_1}})\right] \text{ is } \text{ distributed } \text{ asymptotically } \text{ as } \chi ^2_{(d)}. \end{aligned}$$
(16)

The degrees of freedom (d) for three models are as follows:

POM=\([\{(p+S)\}+\{(p+S+S)\} +\{(p+2S+S)\} +\cdots + \{p+(j-1)S\} +S]-jS\).

PPOM=\([\{(p^\prime +S)\}+\{(p^\prime +S+S)\} +\{(p^\prime +2S+S)\} +\cdots + \{p^\prime +(j-1)S\} +S]-jS\).

MNOM=\([\{(p+1)S\}+\{(p+1+S)S\} +\{(p+1+2S)S\} +\cdots + \{p+1+(j-1)S\} S] - jS\).

here \({\varvec{{\hat{\beta }}_0}}\) and \({\varvec{{\hat{\beta }}_1}}\) includes regression coefficients from the constant only joint model and the full joint model, respectively. Table 1 displays the degrees of freedom for different models.

Table 1 Number of parameters for different models

4.2 Test for order of the regressive model

To test the order of the regressive model, i.e., whether a given response depends on any of the previous ones, we used the test showed by Chowdhury and Islam [13]. The null hypothesis for kth \((k=j-1)\)-order regressive model is

$$\begin{aligned} \begin{aligned} H_0: \quad \beta _{j.y_{j-1}(p+1)}&=\cdots =\beta _{j.y_{j-1}(p+S)}=\beta _{j.y_{j-1}(p+S+1)}= \cdots = \beta _{j.y_{j-1}(p+2S)}\\&\quad =\cdots = \cdots = \beta _{j.y_{j-1}[p+(j-2)S+1]} = \cdots = \beta _{j.y_{j-1}[p+(j-1)S]} = 0 \end{aligned} \end{aligned}$$
(17)

which can be tested using following test statistic:

$$\begin{aligned} -2\left[ \ln L\left( {\varvec{{\hat{\beta }}_1}}\right) -\ln L\left( {\varvec{{\hat{\beta }}}}\right) \right] , \end{aligned}$$
(18)

which is distributed asymptotically as \(\chi ^2\) with \([p+(j-1)S+ S]-\{(j-1)S\}]\) degrees of freedom. The term \([p+(j-1)S+S]\) is the total number of parameters of \((j-1)\)th-order regressive model. The quantity \((j-1)S\) are the number of previous outcomes \(y_1,\ldots ,y_{j-1}\) multiplied by the number of dummy variables (S). Then we can perform the test as follows:

  1. (i)

    The likelihood ratio test can be used to test the significance of the overall model at the first stage.

  2. (ii)

    The Wald test can be used to test the significance of the parameter(s) corresponding to the previous outcomes.

4.3 Overfitting, underfitting and predictive accuracy

We evaluated the performance and predictive capability of models by estimating the prediction accuracy using the confusion matrix from training, test, full data sets, and to check over(under)fitting [27, p. 21, 29]. Good fit models with the better discriminative ability and predictive power will improve prediction accuracy.

5 An illustration

The panel data from the Health and Retirement Study (HRS), sponsored by the National Institute of Aging (grant number NIA U01AG09740), conducted by the University of Michigan [21] is used for illustration. We used data from follow-up six to follow-up eleven of the RAND version. At follow-up, six minimum age of the subjects was 60. The outcome variables considered are the Activity of Daily Living (ADL) Index. The ADL, from follow-up six to follow-up eleven are denoted by \(Y_1, Y_2, Y_3, Y_4, Y_5\), and \(Y_6\), respectively. This index ranges from 0 to 5 is the sum of five tasks (yes/no): whether respondents faced difficulties in walking, dressing, bathing, eating, and getting in/out of bed. Due to small frequencies, we recategorized all six ADL outcomes 0 as 0, 1 or 2 as 1, and 3, or more as 2. We termed those as independent, i.e., free of ADL difficulty, mild and severe ADL difficulty, respectively. The explanatory variables considered are: age (in years), marital status (married/partnered = 1, single/separated = 0), whether drink (yes = 1, no=0), gender (male = 1, female = 0), ncond, (number of conditions ever had, range 0–8), white (yes = 1, no = 0), black (yes = 1, no = 0) with others as reference category, education (in years), veteran status (1 = yes, 0 = no), mobility (mobility index, range 0–5), BMI (body mass index), CESD (mental health index, range 0–8), lmuscle (large muscle index, range 0–4), gskills ( gross motor skills, range 0–5), and wrecall (word recall score, range 0–20). More details about variables can be found in RAND HRS Longitudinal File 2016 (V1) documentation. Table 2 presents the frequency distribution of the outcome variables \(Y_1, \ldots ,Y_6\).

Table 2 Distribution of outcomes (Activity of Daily Living)

Tables 3 and 4 display parameter estimates, significance level, standard errors, and Brant p-value to test proportional odds assumption from POM for marginal and regressive models. Various predictors are significantly associated with outcome variables for different models. In particular, gender, mobility, lmuscle, and gskills were found to be significantly associated with all models (Tables 3 and 4). Most of the dummy indicators for previous outcomes are significantly and positively associated with the current outcomes except for some in the higher-order models. The Brant test results indicated the violation of the proportional odds assumption for different covariates in marginal and all regressive models. We fitted the PPOM models to tackle the variables that violated the proportional odds assumption in POM. We also fitted the multinomial regressive logistic model by ignoring the ordinal nature of outcomes. The parameter estimates of PPOM and MNOM are shown in supplementary materials (Appendix A: Tables 6, 7, 8, 9 and 10). All the models are fitted in parallel using three cores (processor) in a desktop computer (Intel i5-4590 CPU, 3.30 GHz, four cores) and NVIDIA Quadro FX 370 LP graphics card. The CPU time for fitting POM from all six follow-ups was fast (user 0.45, system 0.09, and elapsed 13.00).

Table 3 Parameter estimates of proportional odds model for the marginal, first-order, and second-order conditional models
Table 4 Parameter estimates of proportional odds model for the third-order, fourth-order, and fifth-order conditional models

Table 5 displays model statistics, including log-likelihood value for the constant only model and full model for marginal and all higher-order models for POM, PPOM, and MNOM, respectively. The likelihood ratio test between the constant only and full models is statistically significant (\(p < 0.001\)) for POM, PPOM, and MNOM. The prediction accuracy based on the confusion matrix for full data and test and training data varies between 0.91 and 0.94. Prediction accuracy for POM, PPOM, and MNOM are overly similar. Also, accuracy from full training and test data are very close for POM, PPOM, MNOM., which shows the absence of over(under)fitting for all models and better generalization for prediction for out-of-the-sample subjects.

Table 5 Various statistics for proportional odds, partial proportional odds, and multinomial models

5.1 Predicted trajectories

To illustrate, we showed the prediction for three selected trajectories. Those trajectories are (i) remains ADL difficulty free for all outcomes \(P_{0,0,0,0,0,0}({\varvec{x}})\), (ii) mild ADL difficulty for all outcomes \(P_{1,1,1,1,1,1}({\varvec{x}})\), and (iii) severe ADL difficulty for all outcomes \(P_{2,2,2,2,2,2}({\varvec{x}})\).

5.1.1 Impact of gender on trajectory

Figure 2 displays the predicted joint probabilities by gender from three models (POM, PPOM, and MNOM) for the three trajectories. The predicted risk at follow-up six in the graphs is the marginal probability, while from the follow-up, seven onward are the joint probabilities. Table 11 in Appendix presents the outcomes and covariates for all six follow-ups for these subjects. There are clearer differences between trajectories (\(P_{2,2,2,2,2,2}({\varvec{x}})\)) by gender and models. The predicted risk for a trajectory from PPOM is the highest for the male (top line) compared to that of a female (the third solid line from the top). The difference between trajectories for gender and models due to a significant positive association of gender with outcomes for all follow-ups and PPOM showing better predictions due to the violation of the proportional odds assumption (Tables 3 and 4). The first two panels show the other two trajectories \(P_{0,0,0,0,0,0}({\varvec{x}})\) and \(P_{1,1,1,1,1,1}({\varvec{x}})\) from different models. The joint probabilities (second follow-up onward) for these trajectories are close to zero because all the observed ADL difficulty from six follow-ups for these subjects was severe (Table 11 in Appendix). The predicted joint probabilities for trajectory \(P_{2,2,2,2,2,2}({\varvec{x}})\) for all figures are shown in Table 18 in Appendix.

5.1.2 Impact of mobility index on trajectory

Then we assess the impact of mobility for different values (0, 5) on a trajectory for a female subject (Fig. 3). Other covariates remain the same as in Table 11 in Appendix. The predicted risks of trajectories for a zero value of mobility index were close to one from all three models, and those three trajectories were coincided (top line in the bottom panel). A higher value (5) of the mobility index reduces ADL difficulty compared to a value of zero. This reduction is due to a significant negative association of mobility with outcomes for all follow-ups. The predicted joint probabilities for a value of 5 for mobility index for trajectory were highest from POM followed by PPOM and MNOM, respectively.

5.1.3 Impact of large muscle index on trajectory

Large muscle index showed significant positive associations with outcomes for marginal and all regressive models. Figure 4 displays the trajectories for two values (0 and 4) of this index from three models for a male subject. Zero means no problem with four tasks included for this index, and four means difficulties with all four tasks. The top lines in the bottom panel are for path \(P_{2,2,2,2,2,2}({\varvec{x}})\) from POM, PPOM, and MNOM and with a large muscle index value of four are close to one. For a large muscle index value of zero, these lines were followed by trajectories from MNOM, PPOM, and POM, respectively.

5.1.4 Impact of large muscle index on trajectory with mild ADL difficulties

Next, we assess the impact of large muscle index on trajectory \(P_{2,2,2,2,2,2}({\varvec{x}})\) with mild ADL difficulty for all outcomes (Fig. 5). Trajectories from all three models with a value of 4 of this index predicted the probabilities close to one (top three lines). Subjects with zero large muscle index value lowered the probabilities sharply. The lowest reduction was from the POM, followed by PPOM and MNOM, respectively. Controlling the covariates with a significant association with outcomes and previous episodes of ADL difficulties slow down the progression of the disease rapidly.

5.1.5 Impact of large muscle, mobility index, and previous outcomes on trajectory

Lastly, we assessed the impact of large muscle with value 0 and mobility index with value 5 on trajectory \(P_{2,2,2,2,2,2}({\varvec{x}})\) with mild ADL difficulty for all outcomes (Fig. 6). The highest reduction was from the PPOM, followed by POM and MNOM, respectively. It is clear from the figure that by controlling risk factors, ADL difficulty can be reduced over time significantly.

6 Bootstrapping

We performed 10,000 bootstraps simulations and computed bias, standard error, and mean squared error to measure the accuracy of the parameter estimates and predicted joint probabilities for trajectories. We used the nonparametric bootstrap by randomly sampling with replacement from the original longitudinal data. Also, we assume no assumptions about how observations are distributed to generate bootstrap samples. As we generate bootstrap samples with replacement, those observations not included in each bootstrap sample are used as the test data for assessing the model’s generalizing ability. For POM, estimates from Tables 3 and 4 are considered population parameters while bias, standard error, and mean squared error are computed. We used Tables 8, 9, and 10 in Appendix for MNOM as population parameters. However, we could not do the bootstrapping for PPOM as a varying number of covariates were violating the proportional odds assumption for different bootstrap samples (Tables 6 and 7 in Appendix). For POM, bias is generally minimal (less than 1 percent) for the estimators of parameters of all models along with very low standard error and mean squared errors (Tables 12 and 13 in Appendix). The same is true for MNOM (Tables 14, 15 and 16 in Appendix). But, we computed bootstrap estimates for predicted joint probabilities for trajectories for full, train and test data sets for POM, PPOM, and MNOM. As a result, the bias for the prediction accuracy from all three models for full, train, and test data are less than 0.01 percent (Table 17 in Appendix). Table 18 in Appendix displays the bootstrap estimates for the predicted joint probabilities for the subject used in all the graphs. The bootstrap mean (\({\bar{x}}\)) of the predicted joint probabilities to that of the population are overly similar. However, the biases are negligible, and the average bootstrap estimates will coincide with the population’s estimate.

Fig. 2
figure 2

The predicted trajectories by gender from three models

Fig. 3
figure 3

Trajectory of a male subject by mobility index from three models

Fig. 4
figure 4

Trajectory of a female subject by large muscle index from three models

Fig. 5
figure 5

\(P_{2,2,2,2,2,2}({\varvec{x}})\) trajectory by large muscle index with mild ADL difficulties

Fig. 6
figure 6

Impact of large muscle, mobility index and mild ADL difficulties on the trajectory

7 Conclusion

In this paper, we proposed (i) proportional odds and (ii) partial proportional odds regressive models along with a framework to predict the risks for a sequence of ordinal responses from longitudinal studies that can make transitions through different trajectories. The proposed models and the risk prediction framework is a new development. Also, we compared the results from POM and PPOM with the multinomial regressive logistic model by [13]. Using a well-known longitudinal data set, we illustrated the proposed models, and the estimates are computed for each stage in the process conditionally. Then the conditional estimates are linked using marginal and sequence of conditional models to provide the joint model and the trajectory. The proposed modeling framework allows answering different questions of interest to researchers, clinicians, and policymakers. (i) We can observe the conditional probability estimate, hence the class prediction at each stage. This conditional probability allows us to assess the effect of responses from previous follow-ups. Also, we can compare the predicted sequence from all time points to that of the observed one. [31] suggested that for repeated measures, this must generally be conditioned on the previous history of a subject. [30] concluded that the conditional models are of fundamental interest, and one can make the marginal predictions from conditional models. (ii) The estimated joint probability provides the trajectory, which is of vital importance. Using this trajectory, one can see how individual responses change over time, which is the advantage of repeated measures [31]. (iv) This model allows interaction among previous outcomes and the interaction between previous outcomes and predictors in the model. The interaction terms may provide a better understanding of the underlying disease process and the relationships between outcomes and related risk factors. (v) In the proposed regressive model, it is easy to include varying numbers of predictors at each stage. Besides, one can easily add a terminal event at each stage, for example, death as the last category of the outcome variables [18, 25].

We showed the likelihood ratio test and AIC for the marginal and regressive models. When considering outcomes as ordinal, the PPOM showed a better performance compared to POM as AIC was lowest for marginal and all regressive models. This pattern was also evident in the figures for predicted trajectories. The classical marginal models (such as GEE) provide the average relationships estimated from repeated observations. However, the transition probabilities may depend on different models, each model representing a transition from one stage to another. [31] examined some important theoretical aspects regarding the marginal models and demonstrated various limitations. Another alternative is the subject-specific models taking into consideration the random effects by allowing random effect terms in the linear predictor [10]. However, the proposed method provides a more comprehensive, flexible, and attractive setup for addressing the risk prediction of repeated ordinal outcomes emerging from longitudinal studies.

The significant improvement of the proposed approach is the reduction of overparameterization, as it requires only one model for each follow-up compared to the sequence of conditional models, such as Markov models. The proposed modeling framework can readily use other available models for the ordinal outcome (e.g., continuation ratio, stereotype adjacent category). Also, one can use different machine learning algorithms (e.g., neural network, support vector machine, decision tree, and random forest) for multiclass classification and trajectory prediction in the proposed framework. Besides, the proposed method will be beneficial to analyze big data for a large number of repeated outcomes as it readily permits to use of a divide and recombine approach for big data in a statistically valid manner [14, 17]. Follow-ups are a natural condition variable that allows data division and recombination for trajectory using Eq. (1). For a large sample size in the follow-ups, a second-level data division is also possible [28]. Then one can analyze all subsets using multiple cores in a single computer or using several CPUs in a distributed system [22]. We believe the proposed methods can be applied to analyze and risk prediction for a sequence of events in many fields of studies such as epidemiology, public health, survival analysis, genetics, reliability, and environmental studies.