Item response theory (IRT; e.g., Lord, 1980) has become the preeminent modeling paradigm in educational and psychological measurement. In large-scale testing, IRT has played a dominant role in operational calibration and scoring. The development and application of IRT models have been well studied; for example, historical overviews can be found in van der Linden and Hambleton (1997), Embretson and Reise (2000), Thissen and Wainer (2001), among others. IRT models posit the probabilistic relationship between a person’s latent ability and the probability of an item response. The modeling process links the theory underlying the test, the administrative practices for distributing the test, and statistical modeling so that a test can be constructed fairly and scientifically. This paper seeks to improve the efficiency of estimation of Bayesian IRT models by developing an innovative approach using so-called Pólya–Gamma distributions. We begin the paper by discussing common IRT estimation methods to provide context for comparing the estimation methods we develop. Following that, we develop the Bayesian Pólya–Gamma estimator and demonstrate its usefulness in a simulation study and an empirical data analysis using IRT models of differing complexity.

1 Bayesian Item Response Theory Estimation

IRT models are estimated in a number of ways. Perhaps the most often-used method is marginal maximum likelihood (MML) estimation using either the expectation maximization (EM) algorithm, a variant of a Newton–Raphson procedure, or some hybrid of the two designed to speed convergence (e.g., Baker & Kim, 2004, Bock & Aitkin, 1981). MML algorithms rely upon numerical integration to marginalize the likelihood function across the space of the latent traits. The integration process requires a set of discrete quadrature points to approximate the integral, so the number of quadrature points increases exponentially as the number of latent variables increases linearly. As a result, models with numerous quadrature points take tremendous amounts of calculations to estimate yet often yield inaccurate results. Adaptive quadrature has been developed to handle the computational deficiency by using fewer points (see Schilling & Bock, 2005), but does not solve the problem completely.

Following an estimation method similar to the EM algorithm, the quasi-Monte Carlo integration (QMCEM) algorithm replaces quadrature points with pseudorandom numbers (e.g., Niederreiter, 1978). Although the QMCEM algorithm is better suited to high-dimensional integration, it is relatively slow in estimation time when compared with some fully Bayesian estimation algorithms. Newer algorithms have combined Bayesian and maximum likelihood estimation with stochastic approximation methods such as the Metropolis–Hastings Robbins–Monroe (MHRM) algorithm (e.g., Cai, 2010).

The most frequently used Bayesian algorithms are based upon two fundamental mechanisms: Gibbs sampling and the Metropolis–Hastings (MH) algorithm. Gibbs sampling is used in situations where full conditional posterior distributions of parameters can be derived in closed-form expressions, whereas the MH algorithm uses a proposal distribution substituting the real conditional distribution to enable the MCMC process (e.g., Lynch, 2010). In their seminal paper on the estimation of multinomial regression via Gibbs sampling, Albert and Chib (1993) used a probit link function to enable Gibbs sampling of parameters. Extending this work, Gibbs sampling has been used to estimate parameters of the two-parameter normal ogive model which provides the conditional probability examinee e answers item i correctly \(\left( {X_{ei} =1} \right) \) as:

$$\begin{aligned} P\left( {X_{ei} =1{|}\theta _e } \right) =\varPhi \left( {a_i \left( {\theta _e -b_i } \right) } \right) \end{aligned}$$
(1)

where, parameterized in discrimination/difficulty form, \(a_i \) is the discrimination parameter for an item i, \(b_i \) is the difficulty parameter for an item i, \(\theta _e \) is the continuous ability parameter of examinee e, and \(\varPhi \left( \cdot \right) \) is the normal cumulative distribution function.

Instead of the probit link, IRT models can be parameterized with logistic link functions. In the unidimensional case, this yields the two-parameter logistic (2PL) model:

$$\begin{aligned} P\left( {X_{ei} =1{|}\theta _e } \right) =\frac{\hbox {exp}\left\{ {a_i \left( {\theta _e -b_i } \right) } \right\} }{1+\hbox {exp}\left\{ {a_i \left( {\theta _e -b_i } \right) } \right\} }, \end{aligned}$$
(2)

where \(a_i \), \(b_i \), and \(\theta _e \) retain the same meaning from the two-parameter normal ogive model. Historically, the 2PL model has been parameterized with a scaling constant multiplying the \(a_i \) parameter. When the scaling constant is included and set to a value of approximately 1.7, estimates from the 2PL model are roughly identical to estimates from the two-parameter normal ogive model. Although similar results can be achieved by the two models, many of the contemporary developments in IRT are based on the logistic link version of the model, which, until this point, could only be estimated with Metropolis–Hastings-based algorithms. The MH algorithm requires a rejection/acceptance decision for each parameter at each step of the Markov chain, whereas all the values generated from Gibbs sampling are accepted. Thus, if the full conditional distribution has a closed form, Gibbs sampling provides a much more efficient chain and converges faster than a MH algorithm will. Fully Bayesian estimation of logistic link function IRT models has been conducted using Markov chain Monte Carlo (MCMC) simulation techniques with the Metropolis–Hastings algorithm, and Hamiltonian Monte Carlo has become popular (e.g., Duane, Kennedy, Pendleton, & Roweth, 1987; Geman & Geman, 1984; Hastings, 1970; Patz & Junker, 1999).

Extending Bayesian IRT models, Fox and Glas (2001) used Gibbs sampling with a probit link function to estimate a multilevel IRT model, Edwards (2010) compared the performance of MML with MHRM, and Monroe and Cai (2014) estimated a Ramsay-curve item response theory model via the MHRM. More recently, Kuo and Sheng (2015) compared different estimation methods for multidimensional graded response models (GRMs) using various statistical software programs and packages, including two MML approaches (EM algorithm and adaptive quadrature), four fully Bayesian algorithms (Gibbs sampling, Metropolis–Hastings, Metropolis–Hastings-within-Gibbs, and blocked Metropolis), and the MHRM algorithm. The comparison study showed that when the correlation or covariance among latent traits is moderate or high, fully Bayesian algorithms such as Metropolis–Hastings-within-Gibbs perform better than non-fully Bayesian algorithms in the recovery of item discrimination and trait correlation parameters.

In addition to Gibbs sampling and the Metropolis–Hastings algorithm, Hamiltonian Monte Carlo has gained researchers’ attention in recent years (HMC; see Brooks, Gelman, Jones, & Meng, 2011; Hoffman & Gelman, 2014). HMC extends the MH algorithm by providing more precise proposal values using Hamiltonian dynamics. In each iteration of the algorithm, the values of parameters are said to “leapfrog” to states closer to their posterior densities; shortcutting the time, the MH algorithm takes by avoiding proposal values that are ultimately rejected. Once new values are proposed, the HMC algorithm uses MH to accept/reject proposals. Therefore, HMC algorithms can be inefficient when compared to Gibbs samplers, depending on the model, as the MH steps of the algorithm involve the rejection of at least some of the proposed parameters—something Gibbs avoids. Girolami and Calderhead (2011), however, show that when the posterior correlations are high, the HMC algorithm can outperform Gibbs in terms of sampling efficiency. On the other hand, the HMC requires tuning and therefore is sensitive to the choice of the tuning parameters, whereas Gibbs needs fewer extra manipulations and therefore can be treated as a “plug-and-play” toolkit.

As the latent trait(s) in a model are often assumed to be (multivariate) normally distributed, if Gibbs sampling is preferred, estimating parameters via a Bayesian approach relies on a probit link function instead of a logit link (Albert & Chib, 1993; Kuo & Sheng, 2015). If a logit model is used, Bayesian estimation usually adopts a MH algorithm and/or its variants. Skene and Wakefiled (1990), Carlin, Polson, and Stoffer (1992), and Forster and Skene (1994) have proposed several analytic approximations for Bayesian logit models, while Gamerman (1997), Chib, Greenberg, and Chen (1998), Lenk and DeSarbo (2000), and Dobra, Tebaldi, and West (2006) use MH algorithms. Each of these Metropolis–Hastings algorithms are not nearly as efficient as Gibbs sampling. Holmes and Held (2006) and Frühwirth-Schnatter and Frühwirth (2010) propose methods that are analogues of Albert and Chib’s method, but both yield slow mixing samplers whose posterior distributions can be weakly approximated.

A Gibbs sampling method can be implemented for logistic-based IRT models by adopting the methods of Polson, Scott, and Windle (2013) who proposed a new data-augmentation strategy based upon the Pólya–Gamma family of distributions. The next section briefly introduces the Pólya–Gamma distribution and shows its utility in a logit model. This will serve as a core of extending these new techniques to IRT model estimation.

2 Pólya–Gamma Distribution

Definition

A random variable X has a Pólya–Gamma distribution with parameters \(b>0\) and \(c\in {\mathbb {R}}\), denoted \(X\sim PG\left( {b,c} \right) \), if

$$\begin{aligned} X\sim \sum \limits _{k=1}^{\infty } G\left( {b,1} \right) /\left( {2\pi ^{2}\left( {k-0.5} \right) ^{2}+\frac{c^{2}}{2}} \right) , \end{aligned}$$
(3)

where \(G\left( {b,1} \right) \) is from a gamma distribution with parameters b and 1. The Pólya–Gamma distribution is an infinite mixture of gamma distributions which provide the plausibility to sample from gamma distributions.

In a logistic regression model, given that the likelihood is binomial, Polson et al. (2013) show that the likelihood contribution of observation e can be expressed as

$$\begin{aligned} L_e \left( {\varvec{\beta }} \right) =\frac{\exp \left( {{{\varvec{x}}}_e^T {\varvec{{\beta }}}} \right) ^{y_e }}{\left( {1+\exp \left( {{{\varvec{x}}}_e^T {\varvec{\beta }} } \right) } \right) ^{n_e }}\propto \exp \left( {k_e {{\varvec{x}}}_e^T {\varvec{\beta }} } \right) \mathop \int \limits _0^\infty e^{-\frac{w_e \left( {{{\varvec{x}}}_e^T \beta } \right) ^{2}}{2}}p\left( {w_e |n_e ,0} \right) \mathrm{d}w_e , \end{aligned}$$
(4)

where \(y_e\) is the number of successes and \(k_e =y_e -\frac{n_e }{2}\), where \(n_e\) is the number of trials, \({{\varvec{x}}}_e\) is a vector of p predictors for observation \(e\left( {e\in \left\{ {1,\ldots ,N} \right\} } \right) \), and \(p\left( {w_e |n_e ,0} \right) \) is the conditional density of \(w_e \), a Pólya–Gamma random variable with parameters (\(n_e ,0)\). Biane et al. (2001) provide proofs that given Eq. (4), if a \({\varvec{\beta }}\) has a prior distribution \(p\left( {\varvec{\beta }} \right) \), then the conditional posterior of \({\varvec{\beta }}\) conditioning on a set of Pólya–Gamma random variables \({{\varvec{w}}}=\left( {w_1 ,w_2 ,\ldots ,w_N } \right) \) is

$$\begin{aligned} p({\varvec{\beta }} |{{\varvec{w}}},{{\varvec{y}}})\propto & {} p\left( {\varvec{\beta }} \right) \prod \limits _{e=1}^N \hbox { exp }\left\{ {k_e {{\varvec{x}}}_e^T {\varvec{\beta }} -\frac{w_e \left( {{{\varvec{x}}}_e^T \beta }\right) ^{2}}{2}} \right\} \nonumber \\\propto & {} p\left( {\varvec{\beta }} \right) \mathop \prod \limits _{e=1}^N \hbox { exp }\left\{ {\frac{w_e }{2}\left( {{{\varvec{x}}}_e^T {\varvec{\beta }} -\frac{k_e }{w_e }} \right) ^{2}} \right\} \nonumber \\\propto & {} p\left( {\varvec{\beta }} \right) \,\exp \,\left\{ {-\frac{1}{2}\left( {{{\varvec{z}}}-{{\varvec{X}}}{\varvec{\beta }} } \right) ^{T}{\varvec{\varOmega }} \left( {{{\varvec{z}}}-{{\varvec{X}}}{\varvec{\beta }} } \right) } \right\} , \end{aligned}$$
(5)

where \({{\varvec{z}}}=\left( {\frac{k_1 }{w_1 },\frac{k_2 }{w_2 },\ldots ,\frac{k_N }{w_N }} \right) \), and where \({\varvec{\varOmega }} =diag\left( {{w}_1 ,w_2 ,\ldots ,w_N } \right) \). Here, \({{\varvec{y}}}\) is reparametrized to k by calculating \(=y-\frac{n}{2}.\) If the prior distribution \(p\left( {\varvec{\beta }} \right) \) is specified as N (b, B) where b is the mean vector and B is the covariance matrix, it can be shown that Gibbs samplers can be used in estimation. If \(y_e \sim \hbox {Binom}\left( {n_e ,\frac{1}{1+\exp \left( {-x_e^T {\varvec{\beta }}} \right) }} \right) \) and \(p\left( {\varvec{\beta }} \right) \sim N\left( {{{\varvec{b,B}}}} \right) \), the sampling from the posterior defined in Eq. (5) follows two steps:

$$\begin{aligned} w_e |{\varvec{\beta }} \sim PG \left( {n_e ,x_e^T {\varvec{\beta }} } \right) \nonumber \\ {\varvec{\beta }} |y,w\sim N\left( {{\varvec{\mu }} _w ,{{\varvec{V}}}_w } \right) \end{aligned}$$
(6)

Given the conjugacy between the prior distribution and posterior distribution, it can be shown that \({{\varvec{V}}}_w =\left( {{{\varvec{X}}}^{T}{\varvec{\varOmega }} {{\varvec{X}}}+{{\varvec{B}}}^{-\mathbf{1}}} \right) ^{-\mathbf{1}}\) and \({\varvec{\mu }} _w ={{\varvec{V}}}_w \left( {{{\varvec{X}}}^{T}{{\varvec{k}}}+{{\varvec{B}}}^{-\mathbf{1}}{{\varvec{b}}}} \right) \), where \({{\varvec{k}}}=\left( {{{\varvec{y}}}_\mathbf{1} -\frac{{{\varvec{n}}}_\mathbf{1} }{\mathbf{2}},\ldots ,{{\varvec{y}}}_{{\varvec{N}}} -\frac{{{\varvec{n}}}_{{\varvec{N}}} }{\mathbf{2}}} \right) \) (see Zeithammer, & Lenk,, 2006, pp. 5–7 for deriving details). This finding, again, serves as a foundation for the applications on IRT models later.

Sampling from a Pólya–Gamma distribution, in accordance with its definition defined in Eq. (3), can be reached by summing gamma distributions. The practice is called a naïve finite approximation. Alternatively, Polson et al. (2013) adopted a rejection sampling algorithm based on an alternating series method proposed by Devroye (2002) to avoid the difficulties that can result from an infinite sum: The alternate sampler only requires exponential and inverse Gaussian draws. In a small-scale simulation, Polson et al. (2013) showed that drawing 50,000 points from PG(1, 3.2) was approximately 15 times faster than drawing the same number of points from truncated normal distribution TN(0,1).

To reiterate, we propose a new method for estimating Bayesian IRT models with logistic link functions based on Pólya–Gamma distributions. The proposed method is fully Bayesian; depending on the model of interest, the proposed method can be tailored Gibbs sampling or Metropolis–Hastings-within-Gibbs (and, by extension, as part of Hamiltonian Monte Carlo algorithms). In the next section, we demonstrate methods used for building Gibbs samplers for estimating uni- and multidimensional 2-PL models.

3 Estimation of IRT Models with Pólya–Gamma Distributions

3.1 Unidimensional 2-PL IRT Model

Bayesian estimation of the unidimensional 2-PL relies on the MCMC process, where parameter blocks are frequently used instead of directly sampling from the overall joint likelihood. To construct Gibbs samplers, blocks of parameters are derived into complete conditional formats as Eq. (6) shows (see Junker, Patz, and Vanhoudnos, 2016). Without the Pólya–Gamma distribution method, the samplers for \(\theta _e \), \(a_i \), and \(b_i \) are unable to adopt Gibbs samplers as the conditional posterior distributions do not have closed-form expressions. As such, rejection sampling algorithms (i.e., MH and HMC) are often used. These estimation methods can be inefficient when compared with Gibbs samplers as each parameter sampled in Gibbs is retained, but not every parameter in rejection sampling is retained.

Using the Pólya–Gamma distribution, Gibbs samplers can be derived using closed-form expressions. The derivation of the conditional posterior distributions in Formulas 789, and 12 is given in “Appendix I”. In our derivation, we follow the same formula structure as Eq. (5) so that general conclusions in Eq. (6) can be applied in a similar practice. In the following equations, the letter e and i are used to represent examinee and item, respectively, such that \(y_{ei} \) is the actual response of examinee e at item i. Let N and I be the numbers of examinees and items, respectively. For \(\theta _e \), the conditional posterior distribution can be rewritten as:

$$\begin{aligned} f\left( {\theta _{{e}} {|}{{\varvec{a,b,y}}}_{{\varvec{e}}} } \right) \propto p\left( {\theta _e } \right) \exp \, \left\{ {-\frac{1}{2}\left( {\mathbf{z}_e -{{\varvec{a}}}\theta _{{\varvec{e}}} } \right) ^{T}{\varvec{\varOmega }} _{\mathrm{e}} \left( {{{\varvec{z}}}_e -{{\varvec{a}}}\theta _{{\varvec{e}}} } \right) } \right\} , \end{aligned}$$
(7)

where \(p\left( {\theta _e } \right) \sim N\left( {0,1} \right) \), \({{\varvec{z}}}_e =(\frac{a_1 b_1 w_{1e} +k_{1e} }{w_{1e} }\), ..., \(\frac{a_I b_I w_{Ie} +k_{Ie} }{w_{Ie} }),\) and \({\varvec{\varOmega }} _{\mathrm{e}} =diag\left( {w_{1e} ,\ldots ,w_{Ie} } \right) \). For \(b_i \), the conditional posterior distribution is:

$$\begin{aligned} f\left( {b_i {|}{\varvec{\theta }} ,a_i ,{{\varvec{y}}}_i } \right) \propto p\left( {b_i } \right) \exp \,\left\{ {-\frac{1}{2}\left( {{{\varvec{z}}}_b +\mathbf{1}a_i b_i } \right) ^{T}{\varvec{\varOmega }} _{ab} \left( {{{\varvec{z}}}_b +\mathbf{1}a_i b_i } \right) } \right\} , \end{aligned}$$
(8)

where \(b_i \sim N\left( {\mu _b ,\sigma _b^2 } \right) \), \({{\varvec{z}}}_b =(\frac{k_{i1} -a_i \theta _1 w_{i1} }{w_{i1} }\), ..., \(\frac{k_{iN} -a_i \theta _{N} w_{iN} }{w_{iN} }),\) and \({\varvec{\varOmega }} _{\mathrm {ab}} =diag\left( {w_{i1} ,\ldots ,w_{iN} } \right) .\) Finally, for \(a_i \) the conditional posterior distribution is:

$$\begin{aligned} f\left( {a_i {|}{\varvec{\theta }} ,b_i ,{{\varvec{y}}}_i } \right) \propto p\left( {a_i } \right) \hbox {exp}\left\{ {-\frac{1}{2}\left( {{{\varvec{z}}}_a -\left( {{\varvec{\theta }} -\mathbf{1}b_i } \right) a_i } \right) ^{T}{\varvec{\varOmega }} _{\mathrm {ab}} \left( {{{\varvec{z}}}_a -\left( {{\varvec{\theta }} -\mathbf{1}b_i } \right) a_i } \right) } \right\} , \end{aligned}$$
(9)

where \(p\left( {a_i } \right) \sim TN_{\left( {0,\infty } \right) } \left( {\mu _a ,\sigma _a^2 } \right) , {{\varvec{z}}}_a =(\frac{k_{i1} }{w_{i1} }\), ..., \(\frac{k_{iN} }{w_{iN} }),\) and \({\varvec{\varOmega }} _{\mathrm {ab}} \) is the identical to what was defined in \(b_i \). With Eqs. (7) and (8), the new forms of conditionals of \(\theta _e \) and \(b_i \) enable the use of Gibbs samplers, namely sampling from normal distributions. Note that for \(a_i \), if a lower bound of zero is desired (as some Bayesian IRT algorithms enforce), the parameter can be sampled in a closed-form expression via an inverse transformation. The inverse transformation technique details can be found Lynch (2010, pp. 203–206) and Robert and Casella (2004, pp. 35–77). From Eq. (6), the last sampler, \(w_{ie} \), follows a Pólya–Gamma with the first and the second parameters equal to 1 and \(a_i \left( {\theta _e -b_i } \right) \), respectively:

$$\begin{aligned} w_{ie} \sim \mathop \sum \limits _{\mathrm {k}=1}^{\infty } G\left( {1,1} \right) /\left( {2\pi ^{2}\left( {k-0.5} \right) ^{2}+\frac{a_i^2 \left( {\theta _e -b_i } \right) ^{2}}{2}} \right) , \end{aligned}$$
(10)

4 Multidimensional 2-PL IRT Model

The multidimensional 2-PL model extends the unidimensional 2-PL model to multiple latent variables. Although not all latent variables are measured by each item, a unique discrimination parameter is estimated for each latent variable for any item that measures that latent variable. In the multidimensional 2-PL model, let \({\varvec{\theta }} _{{\varvec{e}}} =\left( {\theta _{\mathrm {e}1} ,\theta _{\mathrm {e}2} ,\ldots \theta _{ek} } \right) \) be the K-dimensional column vector of latent traits for an examinee e. The probability of a correct response P to item i from the examinee can be expressed as:

$$\begin{aligned} P\left( {X_{ei} =1{|}{\varvec{\theta }} _{{\varvec{e}}} ,{{\varvec{a}}}_{{\varvec{i}}} ,b_{{\varvec{i}}} } \right) =\frac{\exp \left( {{{\varvec{a}}}_i^T {\varvec{\theta }} _{{e}} -b_{{i}} } \right) }{1+\exp \left( {{{\varvec{a}}}_i^T {\varvec{\theta }} _e -b_i } \right) } \end{aligned}$$
(11)

where \(b_i\) is a scalar difficulty parameter for an item and \({{\varvec{a}}}_{{\varvec{i}}} =\left( {a_{1i} ,a_{2i} ,\ldots ,a_{Ki} } \right) \) is a column vector of discrimination parameters of the item i. To make the model identified, a minimum of \(K(K-1)/2\) constraints must be placed on the elements of \({{\varvec{a}}}_i \). Either constraints can be made by setting some discrimination parameters to zero (at a minimum the set must follow so-called row-echelon form; e.g., McDonald, 1999) or multivariate constraints must be made (e.g., Lawley & Maxwell, 1971). Commonly, versions of MIRT models are constructed so each item measures a small number of the latent traits overall. As such, each of the item’s discrimination parameters for non-measured traits is set to zero. Effectively, such models are identified by the set of non-estimated discrimination parameters.

In our algorithm, the sampler of the covariance matrix \({\varvec{\Sigma }}\) for the multidimensional 2-PL model uses the MH algorithm. A typical Gibbs sampler for drawing samples from a covariance matrix relies on the conjugate inverse Wishart distribution. However, when constraints are placed on the covariance matrix such a sampler is not appropriate. Particularly, the current model imposes constraints that the diagonal elements of the covariance matrix are all 1 as we chose to estimate all discrimination parameters for a given latent variable. As this set of constraints turns the matrix into a correlation matrix, known distributions for covariance matrices do not fit. Decomposing samples from an inverse Wishart distribution is one alternative (Imai & van Dyk, 2005): It simulates a matrix,\(\widetilde{\varvec{\Sigma }},\) from an inverse Wishart distribution and converts the sample matrix into a correlation one simply via dividing each element \((\tilde{\sigma }_{{ dj}})\) of \(\widetilde{\varvec{\Sigma }}\) by the square root of the \(\tilde{\sigma }_{{ dd}}\) and \(\tilde{\sigma }_{{ jj}} \). Lynch (2010, p. 296) claims the decomposing approach is not exact when the sample size is small or the off-diagonal elements are large. Given this drawback, MH algorithm is adopted instead to sample the correlations. The transition kernel for correlation \(\sigma _{dj}^{\left( t \right) } \) uses here is simply set to a symmetric distribution \(N\left( {\sigma _{dj}^{\left( {t-1} \right) } ,0.05} \right) \) where t represents the time of sampling iteration. Note that an adaptive MH algorithm can substitute the fixed MH one in the present paper (see Shaby & Wells, 2010 for more details). Alternatively, we could have chosen to fix one item discrimination parameter to one per dimension, allowing us to use a Gibbs sampler for the covariance matrix of the latent traits. Updating correlation matrix within a Gibbs sampling framework can be found Talhouk, Doucet, and Murphy (2012). Our choice of the MH algorithm for the covariance matrix was motivated by our desire to demonstrate the Pólya–Gamma Gibbs sampling methods for the discrimination parameters developed in the remainder of this manuscript.

Given that a multidimensional 2-PL model is a multivariate version of the unidimensional 2-PL model, the samplers for \({{\varvec{a}}}_{{\varvec{i}}} \), \(b_i \), and \({\varvec{\theta }} _e\) only require slight changes from the conditional posterior distributions defined for the unidimensional 2-PL model. To distinguish the notation from unidimensional models, \(a_{id} \) is used to represent a nonzero element of vector \({{\varvec{a}}}_{{\varvec{i}}}\) and the subscript d is the related dimension. Note that subscript-d means the remaining dimension after excluding d. Thus, the conditionals can be expressed as

$$\begin{aligned}&f\left( {{\varvec{\theta }} _e {|}{\varvec{\Sigma }},{{\varvec{a, b,y}}}_e } \right) \propto p\left( {{\varvec{\theta }} _{{\varvec{e}}} } \right) \exp \left\{ {-\frac{1}{2}\left( {{{\varvec{z}}}_{{\varvec{e}}} -{{\varvec{a}}}^{T}{\varvec{\theta }} _{{\varvec{e}}} } \right) ^{T}{\varvec{\varOmega }} _{\mathrm{e}} \left( {{{\varvec{z}}}_{{\varvec{e}}} -{{\varvec{a}}}^{T}{\varvec{\theta }} _{{\varvec{e}}} } \right) } \right\} ,\nonumber \\&f\left( {b_i {|}{\varvec{\theta }} ,a_i ,{{\varvec{y}}}_{{\varvec{i}}} ,{{\varvec{w}}}} \right) \propto p\left( {b_i } \right) \exp \left\{ {-\frac{1}{2}\left( {{{\varvec{z}}}_b +\mathbf{1}b_i } \right) ^{T}{\varvec{\varOmega }} _b \left( {{{\varvec{z}}}_b +\mathbf{1}b_i } \right) } \right\} ,\nonumber \\&f\left( {a_{id} {|}{\varvec{\Theta }},{{\varvec{a}}}_{{{\varvec{i}}}\left( {-d} \right) } ,b_i ,{{\varvec{y}}}_{{\varvec{i}}} ,{{\varvec{w}}}} \right) \propto p\left( {a_{id} } \right) \exp \left\{ {-\frac{1}{2}\left( {{{\varvec{z}}}_{ad} -a_{id} {\varvec{\Theta }}_d } \right) ^{T}{\varvec{\varOmega }} _{{\varvec{a}}} \left( {{{\varvec{z}}}_{ad} -a_{id} {\varvec{\Theta }}_d } \right) } \right\} , \end{aligned}$$
(12)
$$\begin{aligned}&f\left( {{\varvec{\Sigma }}{|}{\varvec{\Theta }}} \right) \propto N\left( {0,{\varvec{\Sigma }}} \right) \\&f\left( {w_{ie} } \right) \propto PG \left( 1,\left( {{\varvec{a}}}_i^T {\varvec{\theta }} _e -b_i \right) \right) \end{aligned}$$

Comments are made to clarify the symbols and notations: (1) \({\varvec{\Theta }}\) is the \(N \times K\) matrix of latent traits where d is the dth column of the \({\varvec{\Theta }}\) matrix and \({\varvec{\Theta }}_{-{{\varvec{d}}}} \) is the \({\varvec{\Theta }}\) matrix excluding the \({\varvec{\Theta }}_{{\varvec{d}}} \), (2) \({\varvec{\Sigma }}\) is the correlation matrix of \({\varvec{\Theta }}\), (3) all \({\varvec{\Omega }}\) matrices are identical to those defined earlier, (4) \({\varvec{\theta }}_{{\varvec{e}}} \) is a row vector of \({\varvec{\Theta }}\) for an examine e, and finally, (5) \({\varvec{z}}_{\varvec{e}} \), \({\varvec{z}}_{\varvec{b}} \), and \({\varvec{z}}_{{\varvec{ad}}} \) are \({\varvec{z}}+{\varvec{b}}\), \({\varvec{z}}-{\varvec{a}}_{\varvec{i}}^T {\varvec{\Theta }}\), and \({\varvec{z}}-{\varvec{a}}_{{\varvec{i}}\left( {-{\varvec{d}}} \right) }^T {\varvec{\Theta }}_{-d} +\mathbf 1 b_i \), respectively. To reiterate, the deriving details are given in “Appendix I”.

5 Simulation Study

The proposed algorithm is called PG-MCMC throughout the remainder of the paper. We conducted a simulation study to demonstrate the utility of the proposed estimation method. This simulation study is comprised of two components: (1) the first part examines the accuracy and efficiency of the proposed estimation method and (2) the second part investigates the MCMC mixing performance of the proposed estimation method against a MH algorithm. Although the simulation design includes the unidimensional case, we primarily focus on the multidimensional conditions where researchers and practitioners frequently encounter estimation difficulty. Throughout the simulation, the R (R Core Team, 2018) software was used to generate data, construct the proposed algorithm, execute package functions, and aggregate model parameter estimates. In the first part of the study, the mirt (Chalmers, 2012) package was used for comparison. The mirt package, known as a toolkit for numerous IRT-related estimation tasks, has been widely cited in a large body of published works (see DeMars, 2016; Eckes & Baghaei, 2015; Matlock, Turner, & Gitchel, 2016). More specifically, the package provides marginal maximum likelihood algorithms such as EM, QMCEM, and MHRM. The EM algorithm was used for unidimensional 2-PL IRT simulation study, where QMCEM and MHRM were adopted for the multidimensional simulation. By default, all the stop criteria (tolerance level) in the mirt were set to 0.001. In the second part of the study, a MH algorithm powered by WinBUGS (Lunn, Thomas, Best, & Spiegelhalter, 2000; Spiegelhalter, Thomas, Best, & Lunn, 2003) was adopted to compare with the proposed approach such that the mixing and the converging properties of the proposed approach can be investigated. Throughout the simulation study, the point estimates yielded by the Bayesian frameworks are referred as the means of the posterior distributions.

6 Method

The syntax in the present paper was implemented in R 3.4.2 and executed on two different facility platforms. In the first part of the study, the unidimensional 2-PL IRT model simulation was meant for an initial test: 500 examinees and 20 items (i.e., \(N=500\) and \(I=20)\). Following the notations in Eq. (79), a and b parameters were generated from two different distributions: logN(0.3, 0.2) and N(0, 1), and \(\theta \) was by generating from a standard normal distribution. Note that logN represents a log normal distribution where all samples are positive. The simulation schema was suggested in several studies, for example Harwell and Baker (1991), Feinberg and Rubright (2016), and Mislevy and Stocking (1989). For simplicity and without loss of generality, the multidimensional 2-PL model had all items measure only one latent trait. Responses were used to generate responses under 8 conditions: (1) two levels of number of latent trait dimensions \({{\varvec{K}}}=\left( {2,4} \right) \), (2) two levels of number of items \({{\varvec{I}}}=\left( {20,40} \right) \) per dimension, and (3) two levels of examines \({{\varvec{N}}}=\left( {200,1000} \right) \), The average correlation among the latent trait dimensions \(\overline{\rho _{vv^{{\prime }}}}\) was randomly generated from a uniform distribution [0.7, 0.9]. Like the unidimensional simulation, all elements in vector \({{\varvec{a}}}\) and scalar b were generated from the same distributions above: logN(0.3, 0.2) and N(0, 1). Note that the actual correlations for the true latent trait parameters were generated to be random variations of \(\overline{\rho _{vv^{{\prime }}}}\). That is, for each specified level of average correlation, the off-diagonal values of the correlation matrix were not constant from replication to replication. Creating random variation in correlations ensured that the average level of correlation was equal to \(\overline{\rho _{vv^{{\prime }}}}\), while allowing actual subtest correlations to be different without constraining the elements of \({\varvec{\Sigma }}\) to a specific pattern of correlations. Each condition was replicated 100 times. For all Bayesian estimations, the uninformative priors were used: The priors of a and b parameters were TN (0.01, 10) and N (0.01, 10). The MCMC iteration number was set to 5000, where the first 4000 iterations were burned. As the test lengths were relatively short (20 and 40 items), in some replications, responses for certain items were either all correct or all incorrect which result in unidentified item parameters. When this situation was detected, we regenerated responses until the situation was avoided. In some situations where the estimates became positive/negative infinite due to computational instability, we arbitrarily fixed them to ±100. The first part of the study was executed on Intel® Xeon(R) Processor (4M Cache, 3.00 GHz) and 32 GB RAM on Windows 10 operating system.

The second part of the study is essentially a one-condition simulation. A certification test dataset in a health profession (Jiang & Raymond, 2018) was analyzed via both the MHRM algorithm and the PG-MCMC algorithm. This dataset that was administered to 3,399 examinees comprises 200 items that are unevenly nested within five subtests as specified in Table 1. The means of the MHRM estimates and the PG-MCMC estimates were used to serve as true values to generate responses. The simulated responses were then analyzed via both the PG-MCMC algorithm and the MH algorithm, where the (1) the parameter recovery status in different truncation conditions and (2) the autocorrelation for each parameter separately were recorded. A SuperMicro server with 32(64) (four 8(16)-core processors) AMD Opteron “Seoul” processors and 128 GB RAM on Linux operating system was deployed.

Table 1 Subtest specifications of a certification test dataset in a health profession.
Table 2 QMCEM, MHRM, and PG-MCMC simulation outcomes for multidimensional 2-PL model.

7 Result

To understand the estimation accuracy of the proposed estimation method and other estimation algorithms embedded in mirt, parameter bias and root mean squared error (RMSE) were calculated for each simulation condition. Both the EM algorithm and the PG-MCMC algorithm produced accurate yet similar parameters. The biases of a, b, and \(\theta \) parameters from the EM algorithm were 0.015, − 0.001, and 0.001. The RMSEs were 0.178, 0.139, and 0.350, respectively. On the other hand, the biases of a, b, and \(\theta \) parameter produced by the PG-MCMC algorithm were 0.012, 0.005, and 0.001, while the RMSEs were 0.166, 0.141, and 0.210. The computing speed, however, showed a higher level of difference: The average time of the EM was 0.15 s and 2246 s for that of the PG-MCMC algorithm (with 5000 iterations).

Table 2 summaries the simulation results of both two-dimensional and four-dimensional 2-PL IRT models, when the items were not cross-loaded. The cells with the lowest absolute values were bolded to demonstrate the winner algorithm. Note that most RMSEs for correlation parameters were lower than 0.003 such that the differences between conditions were barely discernible; these were then not presented in the result table. When the number of observations was small (i.e., the first condition), the QMCEM algorithm tended to outperform both the PG-MCMC algorithm and the MHRM algorithm. On the other hand, in the conditions of \(D = 4\), the QMCEM algorithm had less accurate and less efficient results; for example, when \(D = 4\), \(I = 40\), and \(N =1000\), the BIAS and the RMSE of QMCEM for a parameter were almost ten times larger than the other two approaches; the similar pattern could be found in the seventh and the sixth conditions. Overall, the PG-MCMC algorithm did achieve more preferable results: It yielded the lowest values in 26 out the 56 measures in Table 2.

In terms of computation time, the MHRM algorithm is the fastest, the QMCEM algorithm comes into the second, and the PG-MCMC algorithm is the slowest. Referring to Table 2, the computing time (in seconds) of the MHRM algorithm for conditions (1–8) was [17, 92, 42, 205, 112, 242, 233, 437]; correspondingly, the time vector of the QMCEM became [22, 156, 99, 395, 347, 503, 483, 829]. Similar to the unidimensional 2-PL simulation, the PG-MCMC algorithm took substantially longer time due to the high iteration number standard: The seconds needed for the 5000 iterations were [839, 2140, 1371, 3008, 1745, 8583, 7197, 9104].

Fig. 1
figure 1

a- and b-parameter estimates of the health profession certification test.

In the second part of the study, implementing both the MHRM and the PG-MCMC algorithms (with 1000 burn-ins) to the certification test dataset yielded similar estimates. Figure 1 shows the estimates of the a and b parameters where the correlations between two estimation approaches were 0.91 and 0.96, respectively. The \(\widehat{{\varvec{\Theta }}}\) correlation between the two approaches was 0.88. Finally, the off-diagonal element estimates of the covariance (correlation) matrix reordered by column were [0.93, 0.94, 0.89, 0.76, 0.95, 0.87, 0.77, 0.91, 0.77, 0.77] and [0.83, 0.74, 0.94, 0.90, 0.86, 0.86, 0.70, 0.94, 0.92, 0.96, 0.71] for the MHRM and the PG-MCMC algorithms; it shows a larger discrepancy than other estimates. These estimates were averaged across both algorithms to serve as true parameters for further response generation. For each replication, the following MCMC truncation ranges were used (1) 1–500, (2) 500–1000, (3) 1000–2000, and (4) 2000–5000, while the initial values were all set to (1) a = 2, (2) b = − 1, (3) the off-diagonal elements of \({\varvec{\Sigma }} = 0.5\), and (4) \({\varvec{\Theta }}\) = the values generated from the initial \({\varvec{\Sigma }}\) with seed No.2018 in the R.

Table 3 shows the simulation outcomes of the two estimation approaches at three MCMC truncation levels, with the upper panel showing the parameter recovery results and the lower panel showing the autocorrelation status. Granted that the outcomes at 2000–5000 truncation level were close to those of at 1000–2000 level, it can be claimed that 1000 burn-ins were sufficient for both algorithms with the given starting values. (Therefore, the outcomes at 2000–5000 truncation level are not shown here.) Overall, the PG-MCMC mixed faster than the MH as the BIAS and the RMSE values dropped quicker when the burn-ins increased. For example, when burn-ins were 500 and the following 500 MCMC iterations were used to extract estimates, the biases of a parameter decreased to − 1.015 and 0.757 for the MH and the PG-MCMC algorithms, respectively, from − 1.191 and 1.282 when the truncation level was 1–500. Similar patterns can be seen in other parameters. Among them, the PG-MCMC algorithm outperformed more substantially in mixing \(\Sigma \): The outcomes at 500–1000 truncation level yielded by the PG-MCMC algorithm were very similar to those of by the MH algorithm at 1000–2000 truncation level. Meanwhile, the autocorrelation results match the aforementioned conclusion. Overall, as the truncation level increased, the autocorrelation declined. The converging speed for the PG-MCMC algorithm is faster than the MH algorithm. Note that the autocorrelation of \({\varvec{\Theta }}\) draws were higher than other parameters. Even at the 1000–2000 truncation level, the Lag 5 autocorrelation values were higher than 0.1 for both algorithms. These autocorrelation findings are consistent with Sinharay (2003).

Table 3 Simulation outcomes for MH and PG-MCMC at different MCMC truncation levels.

8 Discussion

The simulation results demonstrate the efficiency of the PG-MCMC algorithm in IRT model estimation. The Gibbs sampling estimation strategy based on PG distributions enables the algorithm to converge within far fewer iterations than similar fully Bayesian algorithms. In an ideal estimation situation, such as a unidimensional 2-PL where all conditional posterior distributions can be sampled by Gibbs, all points sampled from the algorithm are accepted. As such, the mixing speed is faster than other methods requiring rejections such as MH and HMC. Even in the situation where Gibbs samplers and MH are used simultaneously, adopting PG-MCMC still outperforms pure MH at least because PG-MCMC allows several conditional posterior samplers to be rejection-free.

Although when the number of latent variables of the model increases the computation time for PG-MCMC was outperformed by MHRM, this result is due to how the PG-MCMC algorithm was coded. Theoretically, PG-MCMC can be many times faster than what it is now if the entire function is constructed in \(C++\) or Fortran; currently, the PG-MCMC algorithm is written in base R software scripting language, while mirt is a \(C++\)-based package. Research has shown that using compiler package with R often takes less than half of time executing the same function than that of without packages (e.g., Aruoba & Fernández, 2014). Further, the computation time depends on the user-specified number of iterations in addition to model complexity. Finally, the MCMC-based approach produces posterior distributions for all parameters on the fly and does not rely on asymptotic limiting distributions, where both MHRM and QMCEM need separate steps to obtain obtaining standard errors of the estimates some of which are difficult to yield precisely (i.e., the standard errors of \({\varvec{\Sigma }})\).

The PG-MCMC cannot be fully generalized to 3-PL IRT model estimation since its posterior distributions are not in closed form as adding the guessing parameter makes the model not in a pure logistic regression framework. MHRM and QMCEM, however, have no problems handling 3-PL IRT model. Despite this limitation, the PG-MCMC algorithm is beneficial as 2-PL IRT models, especially the multidimensional ones, are used frequently in practice. That said PG-MCMC can potentially be applied to these models that are hard to estimate due to substantial parameter constraints. Meanwhile, the 2-PL IRT models we used in the paper are for binary response data, where many measurement designs contain more than two levels of response in which (GRMs) are needed. Polson at al. (2013) showed that the PG strategy can be extended to a multinomial regression model, and thus, it is viable that PG-MCMC is able to be customized to handle polytomous IRT models, such as the GRMs.