1 Introduction

Longitudinal item response data occur when students are assessed at several time points (Singer and Andrade 2000). This kind of data consist of response patterns of different examinees responding to different tests at different measurement occasions (e.g. grades). This leads to a complex dependence structure that arises from the fact that measurements from the same student are typically correlated (Tavares and Andrade 2006).

Various longitudinal item response theory (LIRT) models have been proposed to handle the correlation between measurements made over time. The popular mixed-effects regression modeling approach is often considered, where random effects are used to model the between-student and within-student dependencies. Conoway (1990) proposed a Rasch LIRT model to analyze panel data and proposed a marginal maximum likelihood method (Bock and Aitkin 1981) for parameter estimation. Liu and Hedeker (2006) developed a comparable three-level model to analyze LIRT data for ordinal response data. Eid (1996) defined a LIRT model for polytomous response data. Douglas (1999) analyzed longitudinal response data from a quality of life instrument using a joint model, which consisted of proportional odds model and the graded item response model.

In these mixed effects models, the assumption of conditional independence is achieved by the random effects. That is, the assumption is made that the time-variant measurements are conditionally independent given the student’s latent trait. The random effects imply a compound symmetry covariance structure, which assumes equal variances and covariances over time. In practice, the within-student latent trait dependencies are often not completely modeled and the errors are still correlated over time. Furthermore, in regression using repeated measurements it is common for the errors to show a time series structure (e.g., an autoregressive dependence) (Fitzmaurice et al. 2008; Hedeker and Gibbons 2006). If the dependence structure of the errors is not correctly specified, the parameter estimates and their standard errors will be biased.

Therefore, following the work of Jennrich and Schluchter (1986), Muthén (1998) and Tavares and Andrade (2006), restricted covariance pattern models are considered to model the time series structure of the errors. That is, the errors are allowed to correlate over time, and different variance-covariance structures are proposed to capture time-specific between-student variability and time-heterogenous longitudinal dependencies between latent traits. An important aspect is that the covariance matrices considered allow for time-heterogenous variances and covariances. The covariance pattern modeling framework is integrated in the LIRT modeling approach. At the student level, the time-specific latent traits are assumed to be multivariate normally distributed, and the within-student correlation structure is modeled using a covariance pattern model. This makes it possible to model the specific type of time-invariant and time-variant dependencies.

This modeling framework builds on the work of Tavares and Andrade (2006), who proposed a logistic three-parameter IRT model with a multivariate normal population distribution for the latent traits. They used a covariance matrix to model the within-examinee dependency, where the variances are allowed to vary over time but the covariance structure is assumed to be time-homoscedastic. The modeling framework also relates to the generalized linear latent trait model for longitudinal data of Dunson (2003), where the latent variable covariance structure is modeled via a linear transition model using observed predictors and an autoregressive component.

A full Gibbs sampling (FGS) algorithm is developed, which avoids the use of MCMC methods that require adaptive implementations, like the Metropolis–Hastings algorithm, to regulate the convergence of the algorithm. Furthermore, Sahu (2002) and Azevedo et al. (2012a) have shown that an FGS algorithm tends to perform better, in terms of parameter recovery, than a Metropolis–Hastings-within-Gibbs sampling algorithm when dealing with IRT models. The proposed MCMC algorithm recovers all parameters properly and accommodates a wide range of variance–covariance structures. Using a parameter transformation method, MCMC samples are obtained of restricted parameters by transforming MCMC samples of unrestricted parameters. The proposed modeling framework is extended with various Bayesian model-fit assessment tools. Among other things, a Bayesian p-value is defined based on a suitable discrepancy measure for a global model-fit assessment and it is shown how Bayesian latent residuals can be used to evaluate the normality assumptions (Albert and Chib 1995).

This paper is outlined as follows. After introducing the Bayesian LIRT model, the FGS method is given, which can handle different variance–covariance structures. Then, the accuracy of the MCMC estimation method as well as the prior sensitivity are assessed. Subsequently, a real data study is presented, where the data set comes from a large-scale longitudinal study of children from the fourth to the eight grade of different Brazilian public schools. One of the objects of the study is to analyze the student achievements across different grade levels. The model assessment tools are used to evaluate the fit of the model. In the last section, the results and some model extensions are discussed.

2 The model

A longitudinal test design is considered, where tests are administered to different examinees at different points in time. For each measurement occasion at time point t, \(t = 1, ..., T,\,n_t\) examinees complete a test consisting of \(I_t\) items. The design can be typed as an incomplete block design such that common items are defined across tests and the total number of items equals \(I \le \sum _{t=1}^TI_t\). For a complete design, \(n_t = n\), for all \(t\). Dropouts and inclusion of students during the study are allowed.

The following notation will be introduced. Let \(\theta _{jt}\) represent the latent trait of examinee j (\(j = 1,\ldots ,n,\)) at time-point or measurement occasion t (\(t= 1,\ldots ,T\)), \(\varvec{\theta }_{j.} = (\theta _{j1},\ldots ,\theta _{jT})^{t}\) the vector of the latent traits of the examinee j, and \(\varvec{\theta }_{..} = (\varvec{\theta }_{1.},\ldots ,\varvec{\theta }_{n.})^{t}\) the vector of all latent traits. Let \(Y_{ijt}\) represent the response of examinee j to item i (\(i = 1,\ldots ,I\)) in time-point t, \(\varvec{Y}_{.jt} = (Y_{1jt},\ldots ,Y_{I_tjt})^{t}\) the response vector of examinee j in time-point \(t,\,\varvec{Y}_{..t}= (\varvec{Y}_{.1t}^{t},\ldots ,\varvec{Y}_{.n_tt}^{t})^{t}\) the response vector of all examinees in time-point \(t,\,\varvec{Y}_{...}=(\varvec{Y}_{..1}^{t},\ldots ,\varvec{Y}_{..T}^{t})^{t}\) the entire response set. Let \(\varvec{\zeta }_i\) denote the vector of parameters of item i, \(\varvec{\zeta }=(\varvec{\zeta }_{1}^t,\ldots ,\varvec{\zeta }_{I}^t)^t\) the whole set of item parameters, and \(\varvec{\eta }_{\varvec{\theta }}\) the vector with population parameters (related to the latent traits distribution).

A LIRT model is proposed that consists of two stages. At the first stage, a time-specific two-parameter IRT model is considered for the measurement of the time-specific latent traits given observed dichotomous response data. The item-specific response probabilities are assumed to be independently distributed given the item and time-specific latent trait parameters. At the second stage, the subject-specific latent traits are assumed to be multivariate normally distributed with a time-heterogenous covariance structure, that is:

$$\begin{aligned}&Y_{ijt} \mid (\theta _{jt},\varvec{\zeta }_i) \sim \text{ Bernoulli }(P_{ijt}) \nonumber \\&P_{ijt} = P(Y_{ijt}=1 \mid \theta _{jt},\varvec{\zeta }_i) = \varPhi (a_i\theta _{jt}-b_i) \end{aligned}$$
(1)
$$\begin{aligned} \varvec{\theta }_{j.} | \varvec{\eta }_{\varvec{\theta }}&\sim N_{T}( \varvec{\mu }_{\varvec{\theta }},\varvec{\varPsi }_{\varvec{\theta }}), \end{aligned}$$
(2)

where \(\varvec{\eta }_{\varvec{\theta }}\) consists on \(\varvec{\mu }_{\varvec{\theta }}\) and \(\varvec{\varPsi }_{\varvec{\theta }}\) and \(\varPhi (.)\) stands for the cumulative normal distribution function. In this parametrization, the difficulty parameter \(b_i = a_ib_i^*\) is a transformation of the original difficulty parameter denoted by \(b_i^*\).

The within-subject dependencies among the time-specific latent traits are modeled using a \(T\)-dimensional normal distribution, denoted as \(N_T (\varvec{\mu }_{\varvec{\theta }},\varvec{\varPsi }_{\varvec{\theta }})\), with mean vector \(\varvec{\mu }_{\varvec{\theta }}\) and unstructured covariance matrix \(\varvec{\varPsi }_{\varvec{\theta }}\), where

$$\begin{aligned} \varvec{\mu }_{\varvec{\theta }} = \left[ \begin{array}{c}\mu _{\theta _1}\\ \mu _{\theta _2}\\ \vdots \\ \mu _{\varvec{\theta }_T}\end{array}\right] \text { and } \varvec{\varPsi }_{\varvec{\theta }} = \left[ \begin{array}{c@{\quad }c@{\quad }c@{\quad }c} \psi _{\theta _1}&{} \psi _{\theta _{12}}&{}\ldots &{}\psi _{\theta _{1T}}\\ \psi _{\theta _{12}}&{}\psi _{\theta _2}&{}\ldots &{}\psi _{\theta _{2T}}\\ \vdots &{}\vdots &{}\ddots &{}\vdots \\ \psi _{\theta _{1T}}&{}\psi _{\theta _{2T}}&{}\ldots &{}\psi _{\theta _T} \end{array}\right] \,, \end{aligned}$$
(3)

respectively. A total of \(\frac{T(T+1)}{2}\) parameters need to be estimated for the unstructured covariance model.

2.1 Restricted covariance pattern structures

In the LIRT model, the mean component of the multivariate population distribution for the latent traits can be extended to allow latent growth curves and include explanatory information. Besides modeling the mean structure, it is also possible to model the correlation structure between latent traits. Therefore, different covariance patterns are considered, which are restricted versions of the unrestricted covariance matrix (Eq. 3). Each restricted covariance pattern can address specific dependencies between the latent traits.

Several arguments can be given to explicitly model the covariance structure of the errors. First, the unstructured covariance model for the latent variables measured at different occasions will allow one parameter for every unique covariance term. There are no assumptions made about the nature of the residual correlation between the latent traits over time. However, for unbalanced data designs, small sample sizes with respect to the number of subjects and items, and many measurement occasions, the unstructured covariance model may lead to unstable covariance parameter estimates with large posterior variances (Hedeker and Gibbons 2006; Jennrich and Schluchter 1986). In practical test situations, the test length differs over occasions and students, and the measurement error associated with the traits differ over students and measurement occasions (e.g., NELS, the national education longitudinal study; student monitoring systems for pupils in the Netherlands). As noted by Muthén (1998), the available longitudinal test data are of complex multivariate form, often involving different test forms, attrition, and students sampled hierarchically within schools.

Second, a fitted covariance pattern can provide insight in the residual correlation between latent measurements over time. When a covariance pattern is identified, information about the underlying growth process will be revealed. That is, on top of the change in latent traits modeled in the mean structure, the fitted covariance pattern can further explain the growth in latent traits. For example, when covariances of latent traits change over time, a fitted covariance pattern can be used to identify and describe the type of change.

Third, by correctly modeling the subject-specific correlated residuals across measurement occasions, more accurate statistical inferences can be made from the mean structure. Here, time-heteroscedastic covariance structures are considered to model complex patterns of residuals over time, where population variances of latent measurements can differ over time-points. A restricted covariance pattern model can lead to a decrease in model fit compared to the unstructured covariance pattern model, when the data do not fully support the restriction. When the restricted covariance model is supported by the data, it can lead to an improved fit compared to the unstructured model. Furthermore, due to the decrease in the number of model parameters, model selection criteria might prefer the restricted covariance model over the unstructured covariance model. Muthén (1998) and Jennrich and Schluchter (1986), among others, already noted that the efficiency of the mean structure parameters can be improved by modeling the covariance structure parsimoniously. For sample sizes and unbalanced data, the parameter estimates are most likely to be improved.

Muthén (1998) explained in more detail that the growth model consists of two components, a mean and a covariance structure component. Both components need to be modeled properly to describe the growth, and data interpretations depend on the specification of each component. Hedeker and Gibbons (2006) and Fitzmaurice et al. (2008) have shown that the analysis of longitudinal multivariate response data requires more complex covariance structures to capture the often complex dependency structures. Here, different covariance pattern models will be considered. For all cases, the sampling design is allowed to be unbalanced, where subjects can vary in the number of measurement occasions and response observations per measurement. The measurement times can vary over subjects and are not restricted to be equally spaced over subjects.

2.1.1 First-order heteroscedastic autoregressive model: ARH

This structure assumes that the correlations between subject’s latent traits decrease, when distances between instants of evaluation increase. However the magnitude of the correlations depend only on the distance between the time-points, not on their values. In addition, the variances are not assumed to follow any specific pattern. The form of the covariance matrix is given by,

$$\begin{aligned} \varvec{\varPsi }_{\varvec{\theta }}&= \left[ \begin{array}{c@{\quad }c@{\quad }c@{\quad }c} \psi _{\theta _{1}} &{} \sqrt{\psi _{\theta _{1}}}\sqrt{\psi _{\theta _{2}}}\rho ^{}_{\theta } &{}\ldots &{}\sqrt{\psi _{\theta _{1}}} \sqrt{\psi _{\theta _{T}}}\rho ^{T-1}_{\theta }\\ \sqrt{\psi _{\theta _{1}}}\sqrt{\psi _{\theta _{2}}}\rho ^{}_{\theta } &{}\psi _{\theta _{2}} &{}\ldots &{}\sqrt{\psi _{\theta _{2}}} \sqrt{\psi _{\theta _{T}}}\rho ^{T-2}_{\theta }\\ \vdots &{}\vdots &{}\ddots &{}\vdots \\ \sqrt{\psi _{\theta _{1}}}\sqrt{\psi _{\theta _{T}}} \rho ^{T-1}_{\theta }&{}\sqrt{\psi _{\theta _{2}}} \sqrt{\psi _{\theta _{T}}}\rho ^{T-2}_{\theta }&{}\ldots &{}\psi _{\theta _{T}} \end{array}\right] \,,\nonumber \\ \end{aligned}$$
(4)

where \(\psi _{\theta _t} \in (0,\infty )\), for \(t=1,\ldots ,T\), and \(\rho _{\theta } \in (-1,1)\). Note that the number of parameters for the ARH(1) is \(T+1\), which is much lower than the number of parameters for the unstructured covariance matrix. For more details, see Singer and Andrade (2000), Tavares and Andrade (2006), Andrade and Tavares (2005) and Fitzmaurice et al. (2008).

2.1.2 Heteroscedastic uniform model: HU

This is a special case of the ARH, which also assumes time-heterogenous variances over measurement occasions but time-homogenous correlations over time. So, the HU model assumes equal correlations between all pairs of time-specific latent trait measurements, independently of the distance between them. The heteroscedastic uniform covariance matrix is given by,

$$\begin{aligned} \varvec{\varPsi }_{\varvec{\theta }}&= \left[ \begin{array}{c@{\quad }c@{\quad }c@{\quad }c}\psi _{\theta _{1}} &{} \sqrt{\psi _{\theta _{1}}}\sqrt{\psi _{\theta _{2}}} \rho _{\theta }&{}\ldots &{}\sqrt{\psi _{\theta _{1}}}\sqrt{\psi _{\theta _{T}}} \rho _{\theta }\\ \sqrt{\psi _{\theta _{1}}}\sqrt{\psi _{\theta _{2}}}\rho _{\theta } &{}\psi _{\theta _{2}}&{}\ldots &{}\sqrt{\psi _{\theta _{2}}} \sqrt{\psi _{\theta _{T}}}\rho _{\theta }\\ \vdots &{}\vdots &{}\ddots &{}\vdots \\ \sqrt{\psi _{\theta _{1}}}\sqrt{\psi _{\theta _{T}}} \rho _{\theta }&{}\sqrt{\psi _{\theta _{2}}}\sqrt{\psi _{\theta _{T}}} \rho _{\theta }&{}\ldots &{}\psi _{\theta _{T}} \end{array}\right] \,\,\,\,,\nonumber \\ \end{aligned}$$
(5)

again, \(\psi _{\theta _t} \in (0,\infty )\), for \(t=1,\ldots ,T\), and \(\rho _{\theta } \in (-1,1)\). This covariance structure reduces the number of covariance parameters to one, while allowing \(T\) variance parameters, which resembles the total number of parameters used for the ARH. See Singer and Andrade (2000), Tavares and Andrade (2006), Andrade and Tavares (2005) and Fitzmaurice et al. (2008) for more details.

2.1.3 Heteroscedastic Toeplitz model: HT

As a special case of covariance pattern HU, the heteroscedastic Toeplitz model assumes a zero covariance between subject’s latent traits of two nonconsecutive instants. This might be suitable when correlations decay quickly due to relatively large time-spaces between non-consecutive measurement occasions. This covariance pattern is represented by,

$$\begin{aligned} \varvec{\varPsi }_{\varvec{\theta }}&= \left[ \begin{array}{c@{\quad }c@{\quad }c@{\quad }c@{\quad }c}\psi _{\theta _{1}} &{} \sqrt{\psi _{\theta _{1}}}\sqrt{\psi _{\theta _{2}}} \rho _{\theta }&{}0&{}\ldots &{}0\\ \sqrt{\psi _{\theta _{1}}}\sqrt{\psi _{\theta _{2}}}\rho _{\theta } &{}\psi _{\theta _{2}} &{}\sqrt{\psi _{\theta _{2}}}\sqrt{\psi _{\theta _{3}}}\rho _{\theta }&{}\ldots &{}0\\ 0&{}\sqrt{\psi _{\theta _{2}}}\sqrt{\psi _{\theta _{3}}}\rho _{\theta } &{}\psi _{\theta _{3}}&{}\ldots &{}0\\ \vdots &{}\vdots &{}\vdots &{}\ddots &{}\vdots \\ 0&{}0&{}0&{}\ldots &{}\psi _{\theta _{T}} \end{array}\right] \,\,\,\,.\nonumber \\ \end{aligned}$$
(6)

In this case, \(\psi _{\theta _t} \in (0,\infty )\) for \(t=1,\ldots ,T\), but \(\rho _{\theta } \in (-k,k)\), where \(k\) depends on the value of \(T\). For instance, \(k \approx 1/\sqrt{2}\), for \(T=3\) and \(k=1/2\) for large \(T\). For more details see Singer and Andrade (2000), Andrade and Tavares (2005), and Tavares and Andrade (2006).

2.1.4 Heteroscedastic covariance model: HC

The heteroscedastic covariance (HC) model is a restricted version of the heteroscedastic uniform model, where in the HC model, a common covariance is assumed across time points. As in the other covariance models, time-heterogenous variances are assumed. The HC model also corresponds to an unstructured covariance matrix with equal covariances. Note that the time-heterogeneous variances define time-heterogenous correlations, while assuming a common covariance term across time. Subsequently, relatively high time-specific latent trait variances will specify a low correlation between them. The HC covariance structure is represented by,

$$\begin{aligned} \varvec{\varPsi }_{\varvec{\theta }}&= \left[ \begin{array}{c@{\quad }c@{\quad }c@{\quad }c} \psi _{\theta _{1}} &{} \rho _{\theta }&{}\ldots &{}\rho _{\theta }\\ \rho _{\theta }&{}\psi _{\theta _{2}}&{}\ldots &{}\rho _{\theta }\\ \vdots &{}\vdots &{}\ddots &{}\vdots \\ \rho _{\theta }&{}\rho _{\theta }&{}\ldots &{}\psi _{\theta _{T}} \end{array}\right] , \end{aligned}$$
(7)

where, \(\psi _{\theta _t} \in (0,\infty )\) for \(t=1,\ldots ,T\), and \(|\rho _{\theta }| < \text{ min }_{i,j}\sqrt{\psi _{\theta _i}\psi _{\theta _j}}\). Fore more details, see Andrade and Tavares (2005) and Tavares and Andrade (2006).

2.1.5 First-order autoregressive moving-average model: ARMAH

As the first-order autoregressive ARH structure, correlations between subject’s latent traits decrease as long as the distances between the instants of evaluation increase. However, the decrease is further parameterized due to the additional covariance parameter \(\gamma _{\theta }\). This covariance matrix, denoted as ARMAH, generalizes the ARH model, since it supports a more flexible modeling of the time-specific correlations. The ARMAH covariance matrix is represented by,

$$\begin{aligned} \varvec{\varPsi }_{\varvec{\theta }}\!&= \!\!\left[ \begin{array}{c@{\quad }c@{\quad }c@{\quad }c} \psi _{\theta _{1}} &{} \sqrt{\psi _{\theta _{1}}\psi _{\theta _{2}}} \gamma _{\theta }&{}\ldots &{}\sqrt{\psi _{\theta _{1}}\psi _{\theta _{T}}} \gamma _{\theta }\rho ^{T-2}_{\theta }\\ \sqrt{\psi _{\theta _{1}}\psi _{\theta _{2}}}\gamma _{\theta } &{}\psi _{\theta _{2}}&{}\ldots &{}\sqrt{\psi _{\theta _{2}} \psi _{\theta _{T}}}\gamma _{\theta }\rho ^{T-3}_{\theta }\\ \vdots &{}\vdots &{}\ddots &{}\vdots \\ \sqrt{\psi _{\theta _{1}}\psi _{\theta _{T}}} \gamma _{\theta }\rho ^{T-2}_{\theta }&{}\sqrt{\psi _{\theta _{2}} \psi _{\theta _{T}}}\gamma _{\theta }\rho ^{T-3}_{\theta } &{}\ldots &{}\psi _{\theta _{T}}\end{array}\right] \,.\nonumber \\ \end{aligned}$$
(8)

In this case, \(\psi _{\theta _t} \in (0,\infty )\) for \(t=1,\ldots ,T\), and \((\gamma _{\theta },\rho _{\theta }) \in (-1,1)^2\). For more details, see Singer and Andrade (2000) and Rochon (1992).

2.1.6 Ante-dependence model: AD

The last covariance structure model that will be considered is specifically useful when time points are not equally spaced and/or there is an additional source of variability present. This (first-order) AD model is a general covariance model that allows for changes in the correlation structure over time and unequally spaced measurement occasions.

The AD model generalizes the ARH using time-specific covariance parameters. It also generalizes the ARMAH model, since the covariance structure of the ARMAH is defined with \(\rho _{\theta _1}=\gamma _{\theta }\) and \(\rho _{\theta _2},\ldots ,\rho _{\theta _T}=\rho _{\theta }\). The AD model supports a more dynamic modeling of the covariance pattern compared to ARH and ARMAH. The AD covariance model is represented by,

$$\begin{aligned} \varvec{\varPsi }_{\varvec{\theta }} \!\!&= \!\! \left[ \begin{array}{c@{\quad }c@{\quad }c@{\quad }c} \psi _{\theta _{1}} &{} \sqrt{\psi _{\theta _{1}}\psi _{\theta _{2}}}\rho ^{}_{\theta _{1}} &{}\ldots &{}\sqrt{\psi _{\theta _{1}}\psi _{\theta _{T}}} \displaystyle \prod _{t=1}^{T-1}\rho _{\theta _{t}}\\ \sqrt{\psi _{\theta _{1}}\psi _{\theta _{2}}}\rho _{\theta _{1}} &{}\psi _{\theta _{2}}&{}\ldots &{}\sqrt{\psi _{\theta _{2}}\psi _{\theta _{T}}} \displaystyle \prod _{t=2}^{T-1}\rho _{\theta _{t}}\\ \vdots &{}\vdots &{}\ddots &{}\vdots \\ \sqrt{\psi _{\theta _{1}}\psi _{\theta _{T}}} \displaystyle \prod _{t=1}^{T-1}\rho _{\theta _{t}} &{}\sqrt{\psi _{\theta _{2}}\psi _{\theta _{T}}}\displaystyle \prod _{t=2}^{T-1} \rho _{\theta _{t}}&{}\ldots &{}\psi _{\theta _{T}} \end{array}\right] ,\nonumber \\ \end{aligned}$$
(9)

where, \(\psi _{\theta _t} \in (0,\infty )\) and \(\rho _{\theta _t} \in (-1,1)\), for \(t =1,2,...,T-1\). In conclusion, the AD model permits variances and correlations to change over time, and uses \(2T-1\) parameters. For more details, see Singer and Andrade (2000) and Nunez-Anton and Zimmerman (2000).

2.2 A restricted unstructured covariance structure

The latent variable framework will require a reference time-point to identify the latent scale. To accomplish that, the latent mean and variance of the first time-point will be fixed to zero and one, respectively. This way, the latent trait estimates across time will be estimated on one common scale, since an incomplete test design is used such that common items are administered at different measurement occasions (time-points). The common items, also known as anchors, make it possible to measure the latent traits on one common scale.

However, the estimation of the covariance parameters is complicated due to the identifying constraints. Note that even for the unstructured covariance matrix, a restriction is implied on a variance parameter, which leads to an restricted unstructured covariance matrix. Furthermore, the restrictions on the parameters of the latent trait distribution also complicate the specification of priors. In this case, assuming an inverse-Wishart distribution for the unstructured covariance matrix is not possible, when the variance parameter of the first time-point is restricted to one.

In the present latent variable framework, a novel prior modeling approach will be followed to account for the restricted covariance structure. Following McCulloh et al. (2000), a parametrization of the latent trait’s covariance structure is considered. Therefore, the following partition of the latent traits structure is defined,

$$\begin{aligned} \varvec{\theta }_{j.}&= (\theta _{j1},\theta _{j2},\ldots ,\theta _{jT})^{t} = (\theta _{j1},\varvec{\theta }_{j(1)})^{t}, \\ \varvec{\mu }_{\varvec{\theta }}&= (\mu _{\theta _1},\mu _{\theta _2},\ldots ,\mu _{\theta _T})^t = (\mu _{\theta _1},\varvec{\mu }_{\varvec{\theta }(1)})^t, \end{aligned}$$

where, \(\varvec{\theta }_{j(1)} = (\theta _{j2},\ldots ,\theta _{jT})^{t}\mathrm{and}\,\varvec{\mu }_{\varvec{\theta }(1)} = (\mu _{\theta _2},\ldots ,\mu _{\theta _T})^{t}\). In this notation, the index \((1)\) indicates that the first component is excluded. It follows that the covariance structure, see definition in Eq. (3), is partitioned as,

$$\begin{aligned} \varvec{\varPsi }_{\varvec{\theta }}&= \left[ \begin{array}{c@{\quad }c} \psi _{\theta _1} &{} \varvec{\psi }_{\varvec{\theta }(1)}^{t}\\ \varvec{\psi }_{\varvec{\theta }(1)} &{} \varvec{\varPsi }_{\varvec{\theta }(1)} \end{array}\right] , \end{aligned}$$
(10)

where \(\varvec{\psi }_{\varvec{\theta }(1)} = (\psi _{\theta _{12}},\ldots ,\psi _{\theta _{1T}})^{t}\) and

$$\begin{aligned} \varvec{\varPsi }_{\varvec{\theta }(1)}&= \left[ \begin{array}{c@{\quad }c@{\quad }c}\psi _{\theta _2}&{}\ldots &{}\psi _{\theta _{2T}}\\ \vdots &{}\ddots &{}\vdots \\ \psi _{\theta _{2T}} &{}\ldots &{}\psi _{\theta _T} \end{array}\right] \,. \end{aligned}$$
(11)

From properties of the multivariate normal distribution, see Rencher (2002), it follows that

$$\begin{aligned} \varvec{\theta }_{j(1)}|\theta _{j1} \sim N_{(T-1)}\left( \varvec{\mu }^*,\varvec{\varPsi }^*\right) \,, \end{aligned}$$
(12)

where

$$\begin{aligned} \varvec{\mu }^*&= \varvec{\mu }_{\varvec{\theta }(1)} + \psi _{\theta _1}^{-1}\varvec{\psi }_{\varvec{\theta }(1)}\left( \theta _{j1} - \mu _{\theta _1}\right) , \end{aligned}$$

and

$$\begin{aligned} \varvec{\varPsi }^*&= \varvec{\varPsi }_{\varvec{\theta }(1)} - \psi _{\theta _1}^{-1}\varvec{\psi }_{\varvec{\theta }(1)}\varvec{\psi }_{\varvec{\theta }(1)}^{t}. \end{aligned}$$
(13)

As a result, when conditioning on the restricted first-time point parameter, \(\theta _{j1}\), the remaining \(\varvec{\theta }_{j(1)}\) are conditionally multivariate normally distributed given \(\theta _{j1}\), with an unrestricted covariance matrix. The matrix \(\varvec{\varPsi }^*\) is an unstructured covariance matrix without any identifiability restrictions, see Singer and Andrade (2000). As a result, the common modeling (e.g., using an Inverse-Wishart prior) and estimation approaches can be applied for Bayesian inference, see Gelman et al. (2004).

For estimation purposes (using the restriction \(\psi _{\theta _1}=1\)), it is convenient to eliminate the restricted parameter \(\psi _{\theta _1}\) from the vector of covariances, \(\varvec{\psi }_{\varvec{\theta }(1)}\) (see the covariance structures in Equations from (4) to (9)) between the first component, \(\theta _{j1}\), and the remaining components \(\varvec{\theta }_{j(1)}\). This new vector is denoted as \(\varvec{\psi }^*\) and is equal to

$$\begin{aligned} \varvec{\psi }^* =\varvec{\psi }_{\varvec{\theta }(1)}/\sqrt{\psi _{\theta _1}}. \end{aligned}$$
(14)

Subsequently, the conditional distribution of the unrestricted latent variables is expressed as

$$\begin{aligned} \varvec{\theta }_{j(1)} \mid \theta _{j1}&= \varvec{\mu }_{\varvec{\theta }(1)} + \frac{\varvec{\psi }^*}{\sqrt{\psi _{\theta _1}}}\left( \theta _{j1} - \mu _{\theta _1}\right) + \varvec{\xi }_{j}\,. \end{aligned}$$
(15)

where \(\varvec{\xi }_{j} \sim N(\varvec{0},\varvec{\varPsi }^*)\). The variance and correlation parameters,

$$\begin{aligned} \varvec{\psi }^*\,\, \text{ and }\,\, \varvec{\varPsi }^*, \end{aligned}$$
(16)

define an one-to-one relation with the free parameters of the original covariance matrix \(\varvec{\varPsi }_{\varvec{\theta }}\), since the parameter \(\psi _{\theta _1}\) is restricted to 1. As a result, the estimates of the population variances and covariances can be obtained from the estimates of Eq. (16). The latent variable distribution of the first measurement occasion will be restricted to identify the model. This is done by re-scaling the vector of latent variable values of the first measurement occasion to a pre-specified scale in each MCMC iteration. The latent variable population distribution of subsequent measurement occasions are conditionally specified according to Eq. (15), given the restricted population distribution parameters of the first measurement occasion. Subsequently, the covariance parameters of the latent multivariate model are not restricted for identification purposes, which will facilitate a straightforward specification of the prior distributions.

3 Bayesian inference and Gibbs sampling methods

The marginal posterior distributions comprise the main tool to perform Bayesian inference. Unfortunately, it is not possible to obtain closed-form expressions of the marginal posterior distributions. An MCMC algorithm will be used to obtain samples from the marginal posteriors, see Gamerman and Lopes (2006). More specifically, we will develop a FGS algorithm to estimate all parameters simultaneously.

MCMC methods for longitudinal and multivariate probit models have been developed by, among others, Albert and Chib (1993), Chib and Greenberg (1998), Chib and Carlin (1999), Imai and Dyk (2005), and McCulloh et al. (2000). A particular problem in Bayesian modeling of longitudinal multivariate response data is the prior specification for covariance matrices. An Inverse-Wishart prior distribution is plausible when covariance parameters are not functionally dependent, see Tiao and Zellner (1964). When this is not the case, the prior specification of covariance parameters becomes much more complicated. Here, identification rules impose restrictions on the covariance parameters of the latent trait distribution. Therefore, the covariance structure is modeled by conditioning on the restricted parameters, which are related to the first measurement occasion. Following McCulloh et al. (2000), this approach supports a proper implementation of the identifying restrictions and a FGS implementation.

Conjugate prior distributions are considered, see Gelman et al. (2004) and Gelman (2006). According to the approach presented in Sect. 2, the parameters of interest are \((\varvec{\mu }_{\varvec{\theta }}^{t},\psi _{\theta _1},\varvec{\psi }^{*^{t}})^{t}\) and \(\varvec{\varPsi }^*\). Conjugate priors are specified as,

$$\begin{aligned} \varvec{\mu }_{\varvec{\theta }}&\sim N_{T}(\varvec{\mu }_0,\varvec{\varPsi }_0)\,,\end{aligned}$$
(17)
$$\begin{aligned} \psi _{\theta _1}&\sim IG(\nu _0,\kappa _0)\,,\end{aligned}$$
(18)
$$\begin{aligned} \varvec{\psi }^*&\sim N_{T-1}(\varvec{\mu }_{\varvec{\psi }},\varvec{\varPsi }_{\varvec{\psi }})\,,\end{aligned}$$
(19)
$$\begin{aligned} \varvec{\varPsi }^*&\sim IW_{T-1}(\nu _{\varvec{\varPsi }},\varvec{\varPsi }_{\varvec{\varPsi }})\,, \end{aligned}$$
(20)

where \(IG(\nu _0,\kappa _0)\) stands for the inverse-gamma distribution with shape parameter \(\nu _0\) and scale parameter \(\kappa _0\), and \(IW_{T-1}(\nu _{\varvec{\varPsi }},\varvec{\varPsi }_{\varvec{\varPsi }})\) for the inverse-Wishart distribution with degrees of freedom \(\nu _{\varvec{\varPsi }}\) and dispersion matrix \(\varvec{\varPsi }_{\varvec{\varPsi }}\).

The prior for the item parameters is specified as

$$\begin{aligned} p\left( \varvec{\zeta }_i = (a_i,b_i) \mid \varvec{\mu }_{\varvec{\zeta }}, \varvec{\varPsi }_{\varvec{\zeta }} \right)&\propto \exp \Biggl (-0.5\left( \varvec{\zeta }_i - \varvec{\mu }_{\varvec{\zeta }}\right) ^{t}\varvec{\varPsi }_{\varvec{\zeta }}^{-1}\nonumber \\&\times \left( \varvec{\zeta }_i - \varvec{\mu }_{\varvec{\zeta }}\right) \Biggl )1\!\!1_{(a_i > 0)}, \end{aligned}$$
(21)

where \(\varvec{\mu }_{\varvec{\zeta }}\) and \(\varvec{\varPsi }_{\varvec{\zeta }}\) are the hyperparameters, and \(1\!\!1\) the usual indicator function. The hyperparameters are fixed and often set in such a way that they represent reasonable values for the prior parameters.

In order to facilitate an FGS approach, and to account for missing response data, an augmented data scheme will be introduced, see Albert (1992) and Albert and Chib (1993). An augmented scheme is introduced to sample normally distributed latent response data \(Z...=(Z_{111},...,Z_{I_TnT})^{t}\), given the discrete observed response data; that is,

$$\begin{aligned} Z_{ijt}|(\theta _{jt}, \varvec{\zeta }_i, Y_{ijt}) \sim N(a_i\theta _{jt} - b_i,1), \end{aligned}$$
(22)

where \(Y_{ijt}\) is the indicator of \(Z_{ijt}\) being greater than zero.

To handle incomplete block designs, and indicator variable \(\varvec{I}\) is defined that defines the set of administered items for each occasion and subject. This indicator variable is defined as follows,

$$\begin{aligned} I_{ijt} = \left\{ \begin{array}{r@{\quad }l} 1,&{} \text{ item } i \text{ administered } \text{ for } \text{ examinee } j \text{ at } \text{ time } \text{ point } \text{ t } \\ 0, &{} \text{ missing } \text{ by } \text{ design }. \end{array} \right. \end{aligned}$$
(23)

The not-selective missing responses due to uncontrolled events as dropouts, inclusion of examinees, non-response, or errors in recoding data are marked by another indicator, which is defined as,

$$\begin{aligned} V_{ijt} \!=\! \left\{ \begin{array}{r@{\quad }l} 1,&{} \text{ observed } \text{ response } \text{ of } \text{ examinee } j \text{ at } \text{ time } \text{ point } t \text{ on } \text{ item } \ i \\ 0, &{} \text{ otherwise }, \end{array} \right. \end{aligned}$$
(24)

It is assumed that the missing data are missing at random (MAR), such that the distribution of patterns of missing data does not depend on the unobserved data. When the MAR assumption does not hold and the missing data are non-ignorable, a missing data model can be defined to model explicitly the pattern of missingness. In case of MAR, the observed data can be used to make valid inferences about the model parameters.

To ease the notation, let indicator matrix \(\mathbf {I}\) represent both cases of missing data. Then, under the above assumptions, the distribution of augmented data \(\varvec{Z}_{...}\) (conditioned on all other quantities) is given by

$$\begin{aligned}&p(\varvec{z}_{...}\mid \varvec{y}_{...},\varvec{\zeta }, \varvec{\theta }_{..}, \varvec{\eta }_{\varvec{\theta }}, \mathbf {I}) \propto \prod _{t=1}^{T}\prod _{j=1}^{n}\prod _{i \mid I_{ijt=1}}\nonumber \\&\times \Bigl \{\exp {\left\{ -0.5\left( z_{ijt} - a_i\theta _{jt} + b_i\right) ^2\right\} }1\!\!1_{(z_{ijt},y_{ijt})}\Bigl \}, \end{aligned}$$
(25)

where \(1\!\!1_{(z_{ijt},y_{ijt})}\) represents the restriction that \(z_{ijt}\) is greater (lesser) than zero when \(y_{ijt}\) equals one (zero), according to Eq. (22).

Given the augmented data likelihood in Eq. (25) and the prior distributions in Eqs. (2), (17), (18), (19), (20) and (21), the joint posterior distribution is given by:

$$\begin{aligned}&p(\varvec{\theta }_{..},\varvec{\zeta },\varvec{\mu }_{\varvec{\theta }},\psi _{\theta _1},\varvec{\psi }^* , \varvec{\varPsi }^* | \varvec{z}_{...},\varvec{y}_{...}) \nonumber \\&\quad \propto p(\varvec{z}_{...}|\varvec{\theta }_{..},\varvec{\zeta },\varvec{y}_{...})p(\varvec{\theta }_{..}|\varvec{\eta }_{\varvec{\theta }})\nonumber \\&\quad \times p(\varvec{\zeta }|\varvec{\mu }_{\varvec{\zeta }},\varvec{\varPsi }_{\varvec{\zeta }})p(\varvec{\eta }_{\varvec{\theta }}). \end{aligned}$$
(26)

where

$$\begin{aligned} p(\varvec{\theta }_{..}|\varvec{\eta }_{\varvec{\theta }}) =\displaystyle \prod _{j=1}^np(\varvec{\theta }_{j.}|\varvec{\eta }_{\varvec{\theta }}), \end{aligned}$$
(27)

and

$$\begin{aligned} p(\varvec{\eta }_{\varvec{\theta }}) = p(\varvec{\mu }_{\varvec{\theta }})p(\psi _{\theta _1})p(\varvec{\psi }^*)p(\varvec{\varPsi }^*)\,. \end{aligned}$$

This posterior distribution (26) has an intractable form but, as shown in the Appendix, the full conditionals are known and easy to sample from. Let (.) denote the set of all necessary parameters. The FGS algorithm is defined as follows:

  1. 1.

    Start the algorithm by choosing suitable initial values. Repeat steps 2–10.

  2. 2.

    Simulate \(Z_{ijt}\) from \(Z_{ijt} \mid (.), i = 1, \ldots ,I_t, j = 1,\ldots , n, t = 1, \ldots , T\).

  3. 3.

    Simulate \(\varvec{\theta }_{j.}\) from \(\varvec{\theta }_{j.}\mid (.), j = 1,...,n\).

  4. 4.

    Simulate \(\varvec{\zeta }_{i}\) from \(\varvec{\zeta }_{i} \mid (.)\), i =1,...,I.

  5. 5.

    Simulate \(\varvec{\mu }_{\varvec{\theta }}\) from \(\varvec{\mu }_{\varvec{\theta }}\mid (.) \).

  6. 6.

    Simulate \(\psi _{\theta _1}\) from \(\psi _{\theta _1} \mid (.)\).

  7. 7.

    Simulate \(\varvec{\psi }^*\) from \(\varvec{\psi }^* \mid (.)\).

  8. 8.

    Simulate \(\varvec{\varPsi }^*\) from \(\varvec{\varPsi }^* \mid (.)\).

  9. 9.

    Compute the unstructured covariance matrix using the sampled covariance components from Steps 6–8 and Eqs. (10), (13) and (14).

  10. 10.

    Through a parameter transformation method using sampled unstructured covariance parameters, compute restricted covariance components of interest. The sampled restricted covariance structure \(\varvec{\varPsi }_{\varvec{\theta }}\) is used when repeating steps 2–8.

To handle the restriction \(\mu _{\theta _1} = 0\), the expression in Eq. (12) is used to simulate \(\varvec{\mu }_{\varvec{\theta }(1)}\). To simulate \((\mu _{\theta _1},\psi _{\theta _1})^{t}\), the following decomposition is used in (27),

$$\begin{aligned} p(\varvec{\theta }_{j.} | \varvec{\eta }_{\varvec{\theta }}) = p(\varvec{\theta }_{j(1)}|\varvec{\eta }_{\varvec{\theta }}, \theta _{j1})p(\theta _{j1}|\varvec{\eta }_{\theta _1}). \end{aligned}$$

where \(\varvec{\eta }_{\theta _1} =(\mu _{\theta _1},\psi _{\theta _1})^{t}\). To identify the model, the scale of the latent variable of measurement occasion one is transformed to mean zero and variance one. It is also possible to restrict the parameters \((\mu _{\theta _1},\psi _{\theta _1})^{t}\) to specific values.

In Step 9, MCMC samples of \(\varvec{\varPsi }^*\) are drawn from an inverse-Wishart distribution, and each sampled covariance matrix is restricted to be positive definite. Now, the following relationship can be defined,

$$\begin{aligned} det(\varvec{\varPsi }_{\varvec{\theta }})&= det(\psi _{\theta _1}) det \left( \varvec{\varPsi }_{\varvec{\theta }(1)} - \psi _{\theta _1}^{-1}\varvec{\psi }_{\varvec{\theta }(1)}\varvec{\psi }_{\varvec{\theta }(1)}^{^{t}}\right) \\&= \psi _{\theta _1} det(\varvec{\varPsi }^*), \end{aligned}$$

using Eqs. (10) and (13) and a property of the determinant of block matrices. As a result, the \(det(\varvec{\varPsi }_{\varvec{\theta }})\) is greater than zero since both the determinant of \(\varvec{\varPsi }^*\) and \(\psi _{\theta _1}\) are greater than zero. This implies positive definite samples of \(\varvec{\varPsi }_{\varvec{\theta }}\).

In MCMC Step 10, parameters of a posited covariance pattern structure are computed given an MCMC sample of the unrestricted unstructured covariance parameters. Each simulated covariance matrix will be positive definite, since it is based on a positive definite unstructured covariance matrix. In the Appendix, the reparameterization for each covariance structure is specified, which facilitates the sampling of the parameters of the restricted covariance matrices. That is, in each MCMC iteration, parameters of a specific covariance pattern are computed using sampled unstructured covariance parameters. This procedure is based on the notion that each restricted covariance pattern is nested in the most general unstructured pattern, and that in the MCMC procedure parameter transformations can be used to achieve draws from the transformed parameter distribution.

4 Selection of covariance structure

Accurate inferences are obtained when selecting the most appropriate covariance pattern. Selecting a too simple covariance pattern can lead to underestimated standard errors and biased parameter estimates, see Singer and Andrade (2000) and Singer and Andrade (1994). Selecting a too complex covariance pattern can lead to a decrease in power and efficiency. The general method to select an appropriate covariance structure is based on some Bayesian optimality criterion. The different covariance structures are viewed as competing covariance models and the one that optimizes the Bayesian model criterion is selected. Attention is focused on three criteria for model selection, which are widely used in the literature; the deviance information criterion (DIC), posterior expectation of the Aikaike’s information criterion (AIC) and the posterior expectation of the Bayesian information criterion (BIC), see Spiegelhalter et al. (2002). For each criterion, the covariance pattern is selected with the smallest criterion value. All competing covariance structures are time heterogenous, which generalizes the work of Andrade and Tavares (2005) and Tavares and Andrade (2006).

The general form of the different information criteria is the deviance (i.e. minus two times the log-likelihood) plus a penalty term for model complexity, which includes the number of model parameters. Let \(\varvec{\vartheta }\) denote the set of relevant parameters, that is, the latent traits, the item and the population parameters, then the following deviance is considered to define the model selection criteria,

$$\begin{aligned} D_1(\varvec{\vartheta })&= -2\left[ \log p(\varvec{y}\mid \varvec{\theta }, \varvec{\zeta }) + \log p(\varvec{\theta }\mid \varvec{\mu }_{\varvec{\theta }}, \varvec{\varPsi }_{\varvec{\theta }})\right] \\&= -2\left( \sum _{t=1}^T\sum _{j=1}^n\sum _{i|I_{ijt}=1} \log P(Y_{ijt}=y_{ijt}|\theta _{jt},\varvec{\zeta }_i)\right. \\&+ \left. \sum _{j=1}^n\log p(\varvec{\theta }_j \mid \varvec{\mu }_{\varvec{\theta }}, \varvec{\varPsi }_{\varvec{\theta }})\, \right) \\&= -2(LL + LLLT), \end{aligned}$$

where \(p(\varvec{\theta }_j \mid \varvec{\mu }_{\varvec{\theta }}, \varvec{\varPsi }_{\varvec{\theta }})\) represents the density of the multivariate normal distribution, \(LL = \sum _{t=1}^T\sum _{j=1}^n\sum _{i|I_{ijt}=1}\) \( \log P(Y_{ijt}=y_{ijt}|\theta _{jt},\varvec{\zeta }_i) \) and \(LLLT = \sum _{j=1}^n\log p(\varvec{\theta }_j \mid \varvec{\mu }_{\varvec{\theta }}, \varvec{\varPsi }_{\varvec{\theta }})\).

The deviance depends highly on the estimated latent traits. The covariance structure will influence the latent trait estimates, although they are mostly influenced by the data. The terms \(LL\) and \(LLLT\) both emphasize the fit of the latent traits, and will diminish the importance of the fit of the covariance structure. Therefore, the deviance term \(D_2(\varvec{\vartheta })= LL\) is also considered to evaluate the fit of the covariance structure by evaluating the fit of the latent traits in the likelihood term.

Let \(\overline{D_i(\varvec{\vartheta })}\) denote the posterior mean deviance and \(D_i(\hat{\varvec{\vartheta }})\) the deviance at the posterior mean. Then, the DIC is defined as,

$$\begin{aligned} DIC_i = 2\overline{D_i(\varvec{\vartheta })} - D_i(\hat{\varvec{\vartheta }}), i=1,2, \end{aligned}$$

where the penalty function for model complexity is determined by an estimate of the effective number of model parameters, which allows nonzero covariance among model parameters.

For the AIC and the BIC, the penalty function for model complexity is determined by the effective number of parameters in the model, which is difficult or impossible to ascertain when random effects are involved. Following Spiegelhalter et al. (2002) and Congdon (2003), the posterior mean deviance is used as a penalized fit measure, which includes a measure of complexity. Then, the following specification is made for the AIC and the BIC,

$$\begin{aligned} AIC_i = \overline{D_i(\varvec{\vartheta })} +2(2(I-1) + (T+n_{\varvec{\varPsi }_{\theta }})), \end{aligned}$$

and

$$\begin{aligned} BIC_i = \overline{D_i(\varvec{\vartheta })} + (2(I-1) + (T+n_{\varvec{\varPsi }_{\theta }})) \ln (n^*), \end{aligned}$$

i = 1, 2, respectively, where \(n_{\varvec{\varPsi }_{\theta }}\) is the total number of covariance parameters, \(T\) is the number of time points and \(n^* = \sum _{j=1}^n\sum _{t=1}^T\sum _{i=1^I}V_{ijt}\).

The AIC and BIC results are not guaranteed to lead to the same model, see Spiegelhalter et al. (2002) and Ando (2007). The BIC has a much higher penalty term for model complexity than the AIC. Therefore, a relatively more concise description of the covariance structure can be expected from the BIC. When different results of the two criteria are obtained, the model selected by the BIC is preferred over the one selected by the AIC, see Spiegelhalter et al. (2002) and Ando (2007).

The deviance can be approximated using the MCMC output, and using \(G\) MCMC iterations the posterior mean of the deviance is estimated by

$$\begin{aligned} \overline{D_i(\varvec{\vartheta })} = \frac{1}{G}\sum _{g=1}^G D_i(\varvec{\vartheta }^g), \end{aligned}$$

and the deviance at the posterior mean by,

$$\begin{aligned} {D_i(\hat{\varvec{\vartheta }})} = D_i\left( \frac{1}{G}\sum _{g=1}^G \varvec{\vartheta }^g\right) , \end{aligned}$$

with index \(g\) representing the \(g\)-th value of the valid MCMC sample (considering the burn-in and the thin value).

Here, the selection of the most optimal covariance structure is carried out using Bayesian measures of model complexity as in Spiegelhalter et al. (2002). It also possible to use pseudo-Bayes factors as in Kass and Raftery (1995), or reversible Jump MCMC algorithms, see Green (1995) and Azevedo (2008), which would require a different computational implementation.

4.1 Model assessment: posterior predictive checks

Besides using model selection criteria for selecting the covariance structure, the fit of the general LIRT model can be evaluated using Bayesian posterior predictive tests and Bayesian residual analysis techniques (Albert and Chib 1995). The literature about posterior predictive checks for Bayesian item response models shows several diagnostics for evaluating the model fit. A general discussion can be found in, among others, Stern and Sinharay (2005), Sinharay (2006), Sinharay et al. (2006), and Fox (2004, 2005, 2010).

The common posterior predictive tests can be generalized to make them applicable for the LIRT model. Each posterior predictive test is based on a discrepancy measure, where this discrepancy measure is defined in such a way that a specific assumption or general fit of the model can be evaluated. The main idea is to generalize the well known discrepancy measures to a longitudinal structure.

In general, let \(\varvec{y}^{obs}\) be the matrix of observed responses, and \(\varvec{y}^{rep}\) the matrix of replicated responses generated from its posterior predictive distribution. The posterior predictive distribution of the response data of time-point \(t\) is represented by

$$\begin{aligned} p\left( \varvec{y}_t^{rep} \mid \varvec{y}_t^{obs}\right)&= \int p\left( \varvec{y}_t^{rep} \mid \varvec{\theta }_t \right) p\left( \varvec{\theta }_t \mid \varvec{y}_t^{obs}\right) d\varvec{\theta }_t, \end{aligned}$$

where \(\varvec{\theta }_t\) denotes the set of model parameters corresponding time-point \(t\). Generally, given a discrepancy measure \(D\left( \varvec{y}_t, \varvec{\theta }_t \right) \), the replicated data are used to evaluate whether the discrepancy value given the observed data is typical under the model. A p-value can be defined that quantifies the extremeness of the observed discrepancy value in time-point \(t\),

$$\begin{aligned} p_0\left( \varvec{y}_t^{(obs)}\right) \!=\! P\left( D\left( \varvec{y}^{(rep)}_t, \varvec{\theta }_t \right) \!\ge \! D\left( \varvec{y}^{(obs)}_t, \varvec{\theta }_t \right) \mid \varvec{y}_t^{(obs)} \right) , \end{aligned}$$

where the probability is taken over the joint posterior of \((\varvec{y}^{(rep)}_t, \varvec{\theta }_t)\). In some cases, the discrepancy measure can be generalized from the time-point level to the population level. In that case, the discrepancy measure can be used to evaluate model fit at the time-point and population level.

Here, p-values based on a chi-square distance, predictive distributions of latent scores, and Bayesian latent residuals are considered (Fox 2004, 2010; Azevedo et al. 2011). The chi-square posterior predictive check is defined to evaluate the predictive score distribution with the observed score distribution. The discrepancy measure for evaluating the score distribution is defined as,

$$\begin{aligned} D\left( \mathbf {y}_t \right) = \sum _l \frac{\left( n_{l,t} - E(n_{l,t})\right) ^2}{V(n_{l,t})}\,, \end{aligned}$$

where \(n_{l,t}\) is the number of subjects with a score \(l\) at measurement occasion \(t\), and \(E(.)\) and \(V(.)\) stand for the expectation and the variance, respectively. The posterior predictive check based on the score distribution is evaluated using MCMC output.

The predictive score distribution is easily calculated using the MCMC output. In each iteration, a sample of the score distribution is obtained. This is accomplished by generating response data for the sampled parameters. Subsequently, the number of subjects can be calculated for each possible score at each time-point. For each possible score, the median and 95 % credible interval is calculated to evaluate the score distribution.

A general approach for model adequacy assessment using Bayesian (latent) residuals is described by Albert and Chib (1995). Here, Bayesian residuals are analyzed for the latent traits at each time-point. The following quantity is considered,

$$\begin{aligned} \frac{\widehat{\theta }_j - \widehat{\mu }_{\theta _t}}{\sqrt{\widehat{\psi }_{\theta _t}}}, \end{aligned}$$

for \(t=1,2,...,T\), using posterior mean estimates. Subsequently, the normality assumption is evaluated using box and/or Q-Q plots.

5 Simulation study

Convergence properties and parameter recovery were analyzed using simulated data. The following hyperparameter settings were used in the simulation study:

$$\begin{aligned} \varvec{\mu }_{\varvec{\psi }}&= \varvec{0}_{T-1}\,,\varvec{\varPsi }_{\varvec{\psi }} = \tau \varvec{I}_{T-1}\,\,\, \end{aligned}$$
(28)
$$\begin{aligned} \varvec{\varPsi }_{\varvec{\varPsi }}&= \left( \nu _{\varvec{\varPsi }} - T + 1\right) \left( \varvec{I}- \varvec{\varPsi }_{\varvec{\psi }}\right) , \end{aligned}$$
(29)

where \(\nu _{\varvec{\varPsi }}= 5,\,\tau =1/8\) and the hyperparameters for the item parameters were specified as: \(\varvec{\mu }_{\varvec{\zeta }} = (1,0)^{\top }\) and \(\varvec{\varPsi }_{\varvec{\zeta }} = \text{ diag }(0,5,3)\).

Responses of \(n = 1{,}000\) examinees were simulated for three measurement occasions. At each occasion, data were simulated according to a test of 24 items. There were six common items between test one and two, and six between test two and three. The item parameter values vary in terms of discrimination power and difficulty, properly . For each examinee, a total of 60 items were administered.

Examinees’ latent traits were generated from a three-variate normal distribution with \(\varvec{\mu }_{\varvec{\theta }} = (0,1,2){^{t}}\). The within-subject latent traits were correlated according to an ARH covariance structure, where \(\varvec{\psi }_{\varvec{\theta }} = (1,0.9,0.95){^{t}}\) and \(\rho _{\theta } = 0.75\). This implies latent growth in the mean structure, weak heterogeneous latent trait variance across time, and a strong within-subject correlation over time.

5.1 Convergence and autocorrelation assessment

Following Gamerman and Lopes (2006), the convergence of the MCMC algorithm was investigated by monitoring trace plots generated by three different sets of starting values, and by evaluating Geweke’s and Gelman and Rubin’s convergence diagnostics.

Following DeMars (2003), the sampled latent traits were transformed to the scale of the simulated latent traits according to

$$\begin{aligned} \varvec{\theta }_{j.}^{**}&= Chol\left( \varvec{\varPsi }_{\theta }\right) Chol\left( \varvec{S}_{\varvec{\theta }}\right) ^{-1}\left( \varvec{\theta }_{j.}^{*} - \overline{\varvec{\theta }}\right) + \varvec{\mu }_{\varvec{\theta }}\,, \end{aligned}$$

where \(\varvec{\theta }_{j.}^{*}\) are the simulated latent traits, \(\overline{\varvec{\theta }}\) and \(\varvec{S}_{\varvec{\theta }}\) are the sample mean vector and covariance matrix, respectively, and Chol stands for the Cholesky decomposition.

Figure 1 represents trace plots of latent trait population parameters for occasions two and three. The population parameters of time point one were fixed for identification. Figure 2 represents trace plots of parameters of two randomly selected items. Sampled values were stored every 30th iteration. The MCMC sample composed by storing every 30th value showed negligible autocorrelation. Posterior density plots (not shown) using the sampled values showed that symmetric behavior of the posteriors, which support the posterior mean as a Bayesian point estimate.

Fig. 1
figure 1

For different starting values, trace plots of the simulated values of the population parameters

Fig. 2
figure 2

For different starting values, trace plots of the simulated values for parameters of item 4 and 33

In each plot, three different chains are plotted, which correspond to three different initial values. From a visual inspection it can be concluded that within 100 (thinned) iterations each chain of simulated values reached the same area of plausible parameter values. Each MCMC chain mixed very well, which indicates that the entire area of the parameter space was easily reached. The Geweke diagnostic, based on a burn-in period of 16,000 iterations, indicated convergence of the chains of all model parameters. Furthermore, the Gelman–Rubin diagnostic were close to one, for all parameters. Convergence was established easily without requiring informative initial parameter values or long burn-in periods. Therefore, the burn-in was set to be 16,000, and a total of 46,000 values were simulated, and samples were collected at a spacing of 30 iterations.

5.2 Parameter recovery

The linked test design contains 60 items such that 120 item parameters need to be estimated and 3,000 person parameters. The general population model for the person parameters leads to an additional set of five parameters, since two population parameters were restricted. Ayala and Sava-Bolesta (1999) suggest to consider around 1,200 subjects per item to obtain accurate parameter estimates. Here, 1,000 responses per item were simulated since the specification of a correct prior structure of the LIRT becomes more important when less data are available. Furthermore, the characteristics of the real data study described further on, will resemble those of the simulated data study.

Different statistics were used to compare the results: mean of the estimates (M. Est.), correlation (Corr), mean of the standard error (MSE), variance (VAR), the absolute bias (ABias) and the root mean squared error (RMSE). To evaluate the accuracy of the MCMC estimates, a total of ten replicated data sets were generated, which was based on Azevedo and Andrade (2010) and Ayala and Sava-Bolesta (1999). For the item and latent trait parameters, average statistics were computed by averaging across data sets, and items and persons, respectively.

Table 1 represents the results for the latent traits and item parameters. The estimated values of the statistics indicate that the MCMC algorithm recovered all parameters properly. Furthermore, the estimated posterior means of the discrimination and difficulty parameters were also close to the true values. Similar conclusions can be drawn about the estimates of the latent trait population parameters, see Table 2. The estimated posterior means are close to the true values, and the biases are relatively small.

Table 1 Replication study: results for the estimated latent trait and item parameters
Table 2 Replication study: results for the estimated latent trait population parameters

5.3 Covariance structure selection

The information criteria were used to compare the fit of the different covariance models. The results are given in Table 3, which includes the information criteria for model comparison, as presented in Sect. 4. The information criteria results for the heteroscedastic toeplitz (HT) model were much higher in comparison to the other covariance models, since the dependency structure was restricted to correlations between two adjacent time measurements. Therefore, to avoid distraction these results were not included in Table 3.

Table 3 Selecting the optimal covariance structure for the real data set: estimated Bayesian information criteria. Bold values indicates models chosen by the statistics

From Table 3 it follows that the \(DIC_2\) selects the true covariance model (ARH), where the \(AIC_2\) and \(BIC_2\) select the UH structure. However, note that the ARH model was ranked second by these two criteria. It is not surprising that the \(BIC_2\) selects the UH above the ARH, since the DIC tends to prefer simpler models. However, the \(AIC_2\), which tends to select more complex models, also selected the UH model, even though the difference from the related statistic for the ARH model is quite small. This behavior could be caused by sampling fluctuation.

As expected, the results of the \(AIC_1,\,BIC_1\) and \(DIC_1\) were inflated by the values of the LLLT, by emphasizing the fit of the latent trait estimates. The quantification of the fit of each particular covariance structure is not well represented by these criteria, since the latent trait estimates dominate the deviance term. Although the results show some consistency when considering the \(DIC_2\), a more thorough study is necessary, which is beyond the scope of the present study.

6 The Brazilian school development study

The data set analyzed stems from a major study initiated by the Brazilian Federal Government known as the School Development Program. The aim of the program is to improve the teaching quality and the general structure (classrooms, libraries, laboratory informatics etc) in Brazilian public schools. A total of 400 schools in different Brazilian states joined the program. Achievements in mathematics and Portuguese language were measured over five years (from fourth to eight grade of primary school) from students of schools selected and not selected for the program.

The study was conducted from 1999 to 2003. At the start, 158 public schools were monitored, where 55 schools were selected for the program. The sampled schools were located over six Brazilian states with two states in each of three Brazilian regions (North, Northeast, and Center West). The schools had at least 200 students enrolled for the daytime educational programs, were located at urban zones, and offered an educational program to the eighth grade. At baseline, a total of 12,580 students were sampled. From 2000 to 2003, the cohort consisted of students from the baseline sample who were approved to the fifth grade and did not switch schools. Students enrolled in the fifth grade but coming from another school, and students not assessed in former grades constituted a second cohort, which was followed the four subsequent years. Other cohorts were defined in the same way. The longitudinal test design allowed dropouts and inclusions along the time points. Besides achievements, social-cultural information was collected. The selected students were tested each year.

In the present study, mathematic performances of 1,500 randomly selected students, who were assessed in the fourth, fifth, and sixth grade, were considered. A total of 72 test items was used, where 23, 26, and 31 items were used in the test in grade four, grade five, and grade six, respectively. Five anchor items were used in all three tests. Another common set of five items was used in the test in grade four and five. Furthermore, four common items were used in the tests in grade five and six.

In an exploratory analysis, the multiple group model (MGM), described in Azevedo et al. (2011), was used to estimate the latent student achievements given the response data. The MGM for cross-sectional data assumes that students are nested in groups and latent traits are assumed to be independent given the mean level of the group. Typical for the longitudinal nature of the study, a positive correlation among latent traits from the same examinee is to be expected, but this aspect was ignored in this explanatory analysis. Pearson’s correlations, variances, and covariances were estimated among the vectors of estimated latent traits corresponding to grade four to six. The estimates are represented in Table 4.

Table 4 Estimated posterior variances, covariances, and correlations among estimated latent traits are given in the diagonal, lower and upper triangle, respectively

The results show significant between-grade dependencies. That is, the latent traits are not conditionally independently distributed over grades given the grade-specific means. The estimated variances increased after grade four, which indicates the presence of time-heterogenous variances. Furthermore, given the estimates of covariances, time-heteroscedastic covariances and time-decreasing correlations are to be considered to account for within-subject (between-grade) dependencies among latent traits. Therefore, the LIRT model was estimated using each one of the covariance structures to account for the specific dependencies.

The response data were modeled according to the LIRT model using different covariance structures. First, attention was focused on selecting the optimal covariance structure. Second, a more detailed model fit assessment was carried out using the selected covariance structure. The three model selection criteria were used to identify the most suitable covariance structure. As in the simulation study, the heteroscedastic toeplitz model did not fit the data and produced much higher information criteria estimates. For each other covariance structure, Table 5 represents the estimated values for the \(AIC_i,\,BIC_i\), and \(DIC_i\), i = 1, 2. The information criteria are represented such that a smaller value corresponds to a better model fit.

Table 5 Selecting the optimal covariance structure for the real data set: estimated Bayesian information criteria. Bold values indicates models chosen by the statistics

The \(AIC_2\) and \(DIC_2\) preferred the unstructured covariance model, where the \(BIC_2\) preferred the more parsimonious HU. However, the unstructured model was ranked second by \(BIC_2\), whereas the UH was second ranked by \(AIC_2\) and \(DIC_2\). There were only three measurement occasions, and the correlations between grade years were high. This made the comparison between the unstructured and UH difficult. In the presence of high between grade correlations and a few time points, the information criteria results preferred the UH (the most parsimonious model) and the unstructured covariance matrix. From the various competing covariance structures, the unstructured covariance matrix was used for further analysis.

Different model fit assessment tools, based on posterior predictive densities of different quantities were used to evaluate the LIRT model with the ARMAH covariance structure. The p-value based on a chi-squared distance, and predictive distributions of latent scores and Bayesian latent residuals were considered, see for more details about the posterior checks Albert and Chib (1995), Azevedo (2008), and Azevedo et al. (2011).

The Bayesian p-value was p = .398, which indicates that the model fitted well. In addition, the observed scores fall almost all within the credible intervals for each grade, except for observed scores equal to 20 in grade five, see Fig. 3. Figure 4 represents an estimated quantile-quantile plot of the latent trait residuals of each grade. In general, from visual inspection follows that the assumed normal probability distribution in each grade seems to be appropriate.

Fig. 3
figure 3

Observed score distribution, predicted score distribution, and 95 % central credible intervals

Fig. 4
figure 4

For each grade, Quantile-Quantile plot of estimated latent trait residuals

Table 6 represents the population parameter estimates and 95 % HPD credible intervals of the three grade levels while accounting for a time-heterogenous correlation structure among latent traits. A significant growth in latent trait means was detected given the non-overlapping credible intervals. As expected, the mean growth of math achievement over grade years is significant. The within-grade variability is relatively small, but the between-grade correlations are significant. Each within-examinee latent growth was computed, while accounting for the complex dependencies, which showed a comparable pattern compared to the mean latent growth over grade years.

Table 6 Population parameter estimates and 95 % credible intervals

Finally, Figs. 5 and 6 represent the posterior means and 95 % credible intervals of the item discrimination and original difficulty estimates (\(b_i^*\)), respectively. The discrimination parameter estimates are relatively low, where approximately 50 % of the items have sufficient discriminating power. In addition, by comparing the difficulty parameter estimates with the population mean estimates, it follows that the tests were relatively easy, since most of the difficulty values are below zero. To obtain more accurate estimates of latent growth of well-performing and excellent examinees, more difficult test items are needed. The relatively easy items led to skewed population distributions (see Fig. 3), where a lot of students performed very well, which makes it difficult to accurately measure the math performances of these students. However, note that the within-examinee dependency structure over time contributes to an improved estimate of subject-specific latent trait, since it supports the use of information from other grade years to estimate the achievement level.

Fig. 5
figure 5

Posterior means and HPD intervals for the discrimination parameters

Fig. 6
figure 6

Posterior means and HPD intervals for the difficulty parameters

7 Conclusions and comments

A longitudinal item response model is proposed, where the within-examinee latent trait dependencies are explicitly modeled using different covariance structures. The time-heterogenous covariance structures allow for time-varying latent trait variances, covariances, and correlations. The complex dependency structure across time and identification issues lead to restrictions on the covariance matrix, which complicates the specification of priors and implementation of an MCMC algorithm. By conditioning on a reference or baseline time-point, an unrestricted unstructured covariance matrix was specified given the baseline population parameters. Furthermore, the restricted structured covariance models were handled as restricted versions of the unstructured restricted covariance model, which was estimated through the developed MCMC method.

The developed Bayesian methods include an MCMC estimation method, and different posterior predictive assessment tools. In a simulation study, the MCMC algorithm showed a good recovery of the model parameters. The assessment tools were shown to be useful in evaluating the fit of the model.

Various model extensions of the LIRT model can be considered. The latent variable distribution is assumed to be multivariate normal. This can be adjusted for example by using a multivariate skewed latent variable distribution to model asymmetric latent trait distributions. Furthermore, the skewed latent variable approach of Azevedo et al. (2011) could be used. The extension to nominal and ordinal response data can be made by defining a more flexible response model at level 1 of the longitudinal model. Dropouts and inclusions of examinees were not allowed in the present data study. A multiple imputation method could be developed to support this situation, see Azevedo (2008). More general, the LIRT model can be adapted to accommodate incomplete designs, latent growth curves, collateral information for latent traits, informative mechanisms of non-response, mixture structures on latent traits and/or item and population parameters, and flexible latent trait distributions, among other things. This requires defining a more general IRT model for the response data using flexible priors that can include the different extensions.