1 Introduction

Item response theory (IRT) refers to a collection of latent variable models and statistical methods that has been widely used for describing the underlying structure of survey questionnaires or standardized tests in psychological and educational research. In the current work, we focus on logistic IRT models for dichotomously scored items, e.g., questions with “yes/no” response options in an attitude survey, or multiple-choice questions with a single correct answer in an aptitude test. In particular, the binary response to each item in the test is modeled as a logistic regression on one or more latent variables, each of which represents some latent construct we intend to measure.

Maximum likelihood (ML) has been the gold-standard estimation method for IRT models. The ML estimates can be numerically found by expectation-maximization (EM; e.g., Bock & Aitkin, 1981) or Newton-type (e.g., Bock & Lieberman, 1970; Haberman, 2013) algorithms. The likelihood function of IRT models usually involves an intractable integration over the space of latent variables. Were the dimensionality of latent variables low, simple tensor-product Gaussian/rectangular quadrature suffices to approximate the integral. As the dimensionality increases, however, the naive quadrature representation suffers from the well-known “curse of dimensionality” that the total number of quadrature points grows exponentially fast. Adaptive quadrature (e.g., Schilling & Bock, 2005; Haberman, 2006), which re-scales the quadrature grid for each observed response pattern at each iteration based on the current parameter estimates, is able to attain the same accuracy using much fewer points, and thus fares more efficient in high-dimensional problems. Alternatively, the integral can be approximated by Markov chain Monte Carlo (MCMC) techniques, which results in stochastic variants of EM- or Newton-type algorithms (e.g., Meng & Schilling, 1996; Cai, 2010a, b) that are also suitable for models with a large number of latent variables.

Bayesian inference (e.g., Albert, 1992; Patz & Junker, 1999; Edwards, 2010) based on sampling from the posterior distribution of model parameters has become popular in recent years, partly because of the enhanced computing power and availability of user-friendly software. Bayesian estimation circumvents the evaluation of the likelihood, and thus remains feasible in models with high-dimensional latent traits. However, it should be used with caution since specifying appropriate prior distributions and tuning the sampling algorithm require extraordinary statistical expertise. Even though the asymptotic optimality of Bayesian posteriors can be guaranteed by the Bernstein–von Mises theorem (e.g., Le Cam & Yang, 2000), erroneous results may be seen in real applications due to improperly chosen prior distributions or ill-behaved samplers.

Confidence intervals (CIs) convey information about the sampling variability, and should always be reported in company with point estimates. The most widely used interval estimator associated with the ML estimation of IRT models is the Wald-type CI, defined as the point estimate plus or minus the standard error multiplied by the proper normal quantile that matches the nominal coverage level. The standard error computation for IRT model parameters was discussed by, e.g., Cai (2008) and Yuan, Cheng, and Patton (2014). Caveats on the use of Wald-type CIs, due to the reliance on a quadratic approximation of the log-likelihood, have been raised in the statistical literature (e.g., Neale & Miller, 1997): For instance, they are not invariant under non-linear transformations, may cover values beyond the boundary of the parameter space, and may have unsatisfying small-sample behaviors. As pointed out by a referee, CIs obtained by inverting the likelihood ratio or score test may have a better finite-sample performance. Those methods, however, have not yet been available in the IRT literature. Moreover, they require fitting the model multiple times for each parameter, which is computationally intensive, and thus may not be suitable for multidimensional models.

Bayesian inference is more flexible in terms of quantifying the sampling error. For a certain reparameterization of the model, converting accordingly each Monte Carlo sample from the original posterior yields an approximation to the transformed posterior, from which credible intervals can be constructed by taking, for example, the equi-tailed region. In finite samples, however, dissimilar interval estimates may be resulted from different prior configurations, and preferring one solution over others reduces in essence to the subtle question of prior selection.

In summary, the extant likelihood-based and Bayesian inference methods for IRT parameters both have their merits and deficiencies. In this paper, we aim at developing a comprehensive estimation and inference framework that is able to (a) deal with high-dimensional latent traits, (b) facilitate interval estimation for transformations of parameters, and (c) avoid as much subjectivity and ambiguity as possible in application. Generalized fiducial inference (GFI; Hannig, 2009, 2013), a new variant of Fisher’s fiducial inference, is believed to achieve most, if not all, the aforementioned desiderata. In this article, we apply the GFI to a family of binary logistic IRT models; in particular, a fiducial distribution of item intercepts and slopes is derived. The resulting fiducial distribution is closely approximated by a Bayesian posterior with a data-dependent prior, which is shown to satisfy a Bernstein–von Mises-type asymptotic normality. A Markov chain Monte Carlo (MCMC) algorithm is proposed to obtain samples from the fiducial distribution, which can be subsequently used for constructing CIs. Using simulated data, we evaluate the comparative performance of the fiducial percentile CI against two types of ML Wald CIs in terms of empirical coverage and length. An real data example is provided in the end illustrating the use of GFI for exploratory item factor analysis.

2 Theory

2.1 Generalized Fiducial Inference

The origin of fiducial inference can be traced to Fisher (1930, 1933, 1935). To redress what he regarded as a “fallacy” of Bayesian inference that uninformative/flat priors are specified when such a priori information is indeed absent, Fisher invented a fiducial argument to transfer to the parameter space a prior-free probability distribution, namely, the fiducial distribution, which can be used for inferential purposes in ways that resemble the use of a Bayesian posterior. However, he failed to provide an unambiguous interpretation of the fiducial probability, and some of the claimed properties of the fiducial distribution could not be established (Zabell et al., 1992). As a result, fiducial inference has been considered Fisher’s “one great failure” (Zabell et al., 1992), and largely eschewed by mainstream statisticians. Recently, from roots in the theory of structural inference (Fraser, 1968), Dempster–Shafer calculus (e.g., Dempster, 1968, 2008; Shafer, 1976), and generalized confidence intervals (Weerahandi, 1993), the re-formulated generalized fiducial inference (GFI; Hannig, 2009, 2013) was brought back to spotlight. GFI is a completely general framework adaptable to various parametric models, and usually has justified asymptotic frequentist properties under mild regularity conditions.

We illustrate the idea of fiducial inference using a simple example. Consider a normal location model \(Y\sim \mathcal{N}(\theta , 1)\) with parameter \(\theta \in {\mathbb R}\). When \(\theta = \theta _0\) is known, data can be generated by \(Y = \theta _0 + U\) in which \(U\sim \mathcal{N}(0, 1)\). Conversely, we may want to make inference about \(\theta _0\) after observing \(Y = y\).Footnote 1 Under most circumstances, we are not able to identify the data generating \(U = u_0\) satisfying \(y = \theta _0 + u_0\); otherwise, \(\theta _0\) can be obtained trivially by \(\theta _0 = y - u_0\). Despite the fact that the exact recovery of \(\theta _0\) via \(u_0\) is not viable, the quantity \(y - u\) corresponds to the \(\theta \) value that is needed to reproduce the observed data y for any fixed u. If we replace the fixed u by an independent and identically distributed (i.i.d.) copy of the data generating U, denoted \(U^\star \), then the distribution of \(y - U^\star \), referred to as a fiducial distribution of \(\theta \), gauges how plausible each \(\theta \) value may reproduce y. The definition of a fiducial probability does not require any information other than the model and observed data, different from the definition of a Bayesian posterior probability in which prior information is indispensable. Because of its dependence on fixed data, the fiducial probability is also not the confidence probability in the usual frequentist sense.

We now introduce the theory of GFI. For a family of parametric models indexed by some parameter space \(\Theta \), the data-generating equation (DGE) expresses the data \(\mathbf Y\) as a composition of parameters \({\varvec{\theta }}\in \Theta \) and random components \(\mathbf U\) with parameter-free distributions:

$$\begin{aligned} \mathbf{Y}=\mathbf{g}({\varvec{\theta }},\mathbf U). \end{aligned}$$
(1)

As the name suggests, the DGE characterizes how data are generated from the model; in the normal location example, the DGE is \(Y = \theta + U\). When the parameters \(\varvec{\theta }\) are considered known, data \(\mathbf Y\) can be obtained using Eq. 1 after sampling the random components \(\mathbf U\) from their known distributions. For the reversal, i.e., making inference about \(\theta \), the data \(\mathbf Y=y\) are considered fixed and known, and Eq. 1 is regarded as an implicit function expressing parameters by the data and random components. A distribution on the parameter space is then implicitly determined by Eq. 1, transferred from the known distributions of \(\mathbf U\). Properly explicating this relationship, in other words, “solving” \(\varvec{\theta }\) form the DGE, leads to a fiducial distribution that can be used for making inference about parameters \(\varvec{\theta }\). The same role-switching of data and parameters can be found in the duality between the likelihood and density functions, which is fundamental in likelihood-based inference. Here, applying the same idea to the DGE yields a probabilistic quantification regarding which \({\varvec{\theta }}\) in \(\Theta \) is the truth, which is more intuitive than the deterministic quantification provided by the likelihood function.

Define the set inverse of the DGE:

$$\begin{aligned} Q(\mathbf{y}, \mathbf{u})=\left\{ {\varvec{\theta }}\in \Theta : \mathbf{g}({\varvec{\theta }},\mathbf u)=\mathbf{y}\right\} , \end{aligned}$$
(2)

which contains all possible solutions to Eq. 1 given fixed \(\mathbf y\) and \(\mathbf u\). In the normal location example, the set inverse is a singleton set \(\{y - u\}\); the unique element defines unambiguously a fiducial distribution, i.e., \(y - U^\star \). In general, however, Eq. 2 may contain more than one element for some combinations of \(\mathbf y\) and \(\mathbf u\), and may be empty for others. More involved arguments are needed to define a fiducial distribution rigorously.

Infinitely many solutions to the DGE may co-exist. An example is the Bernoulli family \(Y\sim \hbox {Bernoulli}(\theta )\), \(0\le \theta \le 1\). Its DGE can be expressed as \(Y = {\mathbb {I}}\{U\le \theta \}\), in which \({\mathbb {I}}(\cdot )\) denotes the indicator function, and \(U\sim \hbox {Uniform}(0, 1)\). Given \(y\in \{0, 1\}\) and \(0\le u\le 1\) as realizations of Y and U, the corresponding set inverse is one of the two intervals partitioned by the value of u (see Hannig, 2009, Example 6): \(u\le \theta \le 1\) if \(y = 1\), and \(0\le \theta <u\) if \(y = 0\). In fact, models for discrete data often yield non-singleton set inverse \(Q(\mathbf{y}, \mathbf{u})\); therein favoring one element over others for the purpose of defining a fiducial distribution for \(\varvec{\theta }\) cannot be decided from the values of \(\mathbf y\) and \(\mathbf u\) per se. Hence, we need a user-defined rule that uniquely identifies an element from the set inverse: e.g., randomly selecting a point from \(Q(\mathbf{y}, \mathbf{u})\).

It is also possible that the set determined by Eq. 2 is empty. For example, let \(Y_1 = {\mathbb {I}}\{U_1\le \theta \}\) and \(Y_2 = {\mathbb {I}}\{U_2\le \theta \}\) be two observations from \(\hbox {Bernoulli}(\theta )\). If we observe \(y_1 = 1\) and \(y_2 = 0\), then the joint set inverse is \([u_1, 1]\cap [0, u_2)\), which is empty whenever \(u_1 \ge u_2\). An empty set inverse \(Q(\mathbf{y}, \mathbf{u})\) implies that no parameter value is able to recover \(\mathbf y\) combined with the particular \(\mathbf u\). Because the model is assumed to be correctly specified, intuitively it means that this \(\mathbf u\) value is not helpful to the inference of \(\varvec{\theta }\) and should be discarded. One natural resolution is to concentrate on the set of \(\mathbf u\) such that Eq. 2 is non-empty.

Following these heuristics, we define a fiducial distribution as

$$\begin{aligned} \mathbf{v}(Q(\mathbf{y}, \mathbf{U}^\star ))\ | \ \{Q(\mathbf{y}, \mathbf{U}^\star )\ne \emptyset \}, \end{aligned}$$
(3)

in which \(\mathbf{U}^\star \) is an i.i.d. copy of the data generating \(\mathbf{U}\), and \(\mathbf{v}(\cdot )\) denotes a selection rule. A random variable having the distribution determined by Eq. 3 is referred to as a generalized fiducial quantity (GFQ).

Two sources of non-uniqueness are inherent in Eq. 3. First, there may exist different DGEs for the same model, and thus different fiducial distributions (e.g., Hannig, 2013, Example 5.1). In our IRT application, we focus our attention on a specific DGE that has been widely used in practice for generating item response data; future studies are encouraged to investigate other possibilities. Second, when the set inverse consists of more than one point, different selection rules lead to different fiducial distributions. Hannig (2013) proved for a general class of models without random effects that the diameter of the set inverse converges to zero at a fast rate (of order 1 / n), which implies that the impact of selection rules is asymptotically negligible. Simulation studies suggest that the sizes of the polytopes produced by the proposed sampler (introduced in Sect. 2.4) are always tiny when the sample size is large. Thus, we conjecture that a higher-order convergence result similar to Hannig’s result holds in our case as well. Theorem 2 presented in the current work is a initial step towards this direction, in which we establish that the size of the set inverse under the empirical Bayesian approximation to the fiducial distribution is of order 1 / n for unidimensional models (\(r = 1\)).

2.2 GFI for Binary Logistic IRT Models

Next, we consider a family of binary logistic IRT models, and derive a GFQ for item parameters. Let a person i’s response to a binary item j, \(Y_{ij} = y_{ij}\in \{0,1\}\), be modeled by the following conditional likelihood (also known as the item response function):

$$\begin{aligned} f_j({\varvec{\theta }}_j, y_{ij} | \mathbf{z}_i)=P\{Y_{ij}=y_{ij}|\mathbf{Z}_i=\mathbf{z}_i\}=\frac{1}{1+e^{(-1)^{y_{ij}}\tau _j({\varvec{\theta }}_j, \mathbf{z}_i)}},\quad \tau _j({\varvec{\theta }}_j, \mathbf{z}_i) = \alpha _j+{\varvec{\beta }}_j{}^\top \mathbf{z}_i, \end{aligned}$$
(4)

in which \(\mathbf{Z}_i=(Z_{id})_{d=1}^r \in {\mathbb R}^r\) are the latent variables. In Eq. 4, \(\alpha _j\) denotes the item intercept, and \({\varvec{\beta }}_j\) denotes the r item slopes. We assume that the intercept is always freely estimated, but some slopes must be fixed for model identification. We denote all \(q_j\) free parameters that calibrate item j by \({\varvec{\theta }}_j\), and write \(\tau _j({\varvec{\theta }}_j, \mathbf{z}_i)\) as the usual linear regression on the latent variable to highlight its dependence on \({\varvec{\theta }}_j\). In addition, we restrict consideration to the case that \(\mathbf{Z}_i\sim \mathcal{N}(\mathbf{0}, \mathbf{I}_r)\), in which \(\mathbf{I}_r\) is an r-dimensional identity matrix; that is, no correlation component is estimated among the latent variables. This general setup encompasses the two-parameter logistic (2PL), bifactor, and exploratory item factor analysis models, but not the independent-cluster or the general two-tier models; future research works are encouraged to extend the current framework to a broader class of IRT models.

\(Y_{ij}\) is a Bernoulli random variable with success probability given by \(f_j({\varvec{\theta }}_j, 1 | \mathbf{z}_i)\). The DGE of \(Y_{ij}\) has the following form:

$$\begin{aligned} Y_{ij}={\mathbb {I}}\left\{ U_{ij}\le f_j({\varvec{\theta }}_j, 1 | \mathbf{z}_i)\right\} ={\mathbb {I}}\{ \hbox {logit}(U_{ij})\le \tau _j({\varvec{\theta }}_j, \mathbf{Z}_i) \} = {\mathbb {I}}\left\{ A_{ij}\le \tau _j({\varvec{\theta }}_j, \mathbf{Z}_i) \right\} , \end{aligned}$$
(5)

in which \(U_{ij}\sim \hbox {Uniform}(0,1)\) independent of \(\mathbf{Z}_i\), and \(A_{ij}=\hbox {logit}(U_{ij})\sim \hbox {Logistic}(0,1)\). In Eq. 5, the free components of \(\alpha _j\) and \({\varvec{\beta }}_j\), i.e., \({\varvec{\theta }}_j\), can be identified as parameters \(\varvec{\theta }\) in Eq. 1; \(A_{ij}\) and \(\mathbf{Z}_i\) are the random components with parameter-free distributions; they jointly correspond to \(\mathbf U\) in Eq. 1. The set inverse of Eq. 5 becomes

$$\begin{aligned} Q_{ij}(y_{ij}, a_{ij}, \mathbf{z}_i)=\{ {\varvec{\theta }}_j\in {\mathbb R}^{q_j}:\,&a_{ij}\le \tau _j( {\varvec{\theta }}_j, \mathbf{z}_i),\hbox { if }y_{ij}=1;\\&a_{ij}>\tau _j( {\varvec{\theta }}_j, \mathbf{z}_i),\hbox { if }y_{ij}=0\}. \end{aligned}$$
(6)

The geometric representation of Eq. 6 is a half-space, i.e., one half of the Euclidean space \({\mathbb R}^{q_j}\) with the partitioning determined by the affine hyperplane \(a_{ij}=\tau _j({\varvec{\theta }}_j, \mathbf{z}_i)\); a graphical illustration using a 2PL item is shown in the left panel of Figure 1.

Fig. 1
figure 1

In both panels, the horizontal axis is intercept \(\alpha _j\), and the vertical axis is slope \(\beta _j\). Each straight line gives the boundary condition in Eq. 6 for a particular observation: Dashed lines indicate strict inequalities, while solid lines indicate non-strict ones. The arrow perpendicular to each boundary points into the half-space. The left plot presents the set inverse function (the shaded area) for a single observation (Observation 1, abbreviated as O1). In the right plot, four additional observations (O2–O5) are included; the intersection of the five half-spaces is shown as the shaded area.

Now consider an \(n\times m\) binary response data matrix, denoted \(\mathbf{Y} = (Y_{ij})_{i=1}^n{}_{j=1}^m\), in which n is the sample size, m is the test length, and each \(Y_{ij}\) is generated from a version of Eq. 5. It is assumed that the n individual response patterns, denoted \(\mathbf{Y}_i = (Y_{ij})_{j=1}^m\), \(i = 1,\ldots ,n\), are i.i.d, and that for each observation i, \(Y_{ij}\), \(j = 1,\ldots ,m\), are independent conditional on \(\mathbf{Z}_i\). This implies the independence of the corresponding logistic and normal variates, denoted \(\mathbf{A} = (A_{ij})_{i=1}^n{}_{j=1}^m\) and \(\mathbf{Z} = (\mathbf{Z}_i)_{i=1}^n\), respectively. The set inverse for the DGE of \(\mathbf{Y}\) can be written as

(7)

in which we write \(\times \) for the Cartesian product. For each j, we take the intersection for the reason that the set inverse by definition should include \({\varvec{\theta }}_j\) values that are consistent with all individual DGEs (Eq. 5). For easy reference, we introduce the notation \(\mathbf{Y}_{(j)} = (Y_{ij})_{i=1}^n\) for all n responses to item j, and similarly \(\mathbf{A}_{(j)} = (Y_{ij})_{i=1}^n\) for the corresponding logistic variates. Also, let

$$\begin{aligned} Q_j(\mathbf{y}_{(j)}, \mathbf{a}_{(j)}, \mathbf{z})=\!\bigcap _{i=1}^n Q_{ij}(y_{ij}, a_{ij}, \mathbf{z}_i). \end{aligned}$$
(8)

Geometrically, \(Q_j(\mathbf{y}_{(j)}, \mathbf{a}_{(j)}, \mathbf{z})\) is an \({\mathbb R}^{q_j}\)-polyhedron, whose faces are the boundaries of a selective collection of individual half-spaces (Eq. 5). For a single 2PL item, an illustration of Eq. 8 as an \({\mathbb R}^2\)-polygon is given in the right panel of Figure 1. Because we assume that different items do not share parameters, the overall set inverse, i.e., Eq. 7, is finally obtained by taking the Cartesian product. In the sequel, we treat without loss of generality the set inverses (Eqs. 68) as the closureFootnote 2 of what we defined earlier; since the random components are continuous, all the stated properties that hold for the closure also apply to the interior with probability one.

Denote by \(\mathbf{A}^\star \) and \(\mathbf{Z}^\star \) i.i.d. copies of the random components \(\mathbf A\) and \(\mathbf Z\), respectively. A GFQ for item parameters can be constructed from the random set \(Q(\mathbf{y}, \mathbf{A}^\star , \mathbf{Z}^\star )\), provided the set is not empty. We remark that polyhedrons constituting the set inverse are unbounded with a positive probability for fixed n and \(\mathbf y\): For example, when \(n\le q_j\), a non-empty polyhedron \(Q_j(\mathbf{y}_{(j)}, \mathbf{a}_{(j)}, \mathbf{z})\) is certainly unbounded, because a bounded \({\mathbb R}^{q_j}\)-polytope has at least \(q_j + 1\) faces. For ease of exposition, we restrict ourselves to selection rules \(\mathbf{v}(\cdot )\) returning finite values within the set inverse. Following the generic recipe (Eq. 3), a GFQ of item parameters is

$$\begin{aligned} \mathbf{v}(Q(\mathbf{y}, \mathbf{A}^\star , \mathbf{Z}^\star ))\ |\ \left\{ Q(\mathbf{y}, \mathbf{A}^\star , \mathbf{Z}^\star )\ne \emptyset \right\} . \end{aligned}$$
(9)

In the simulation study and the empirical example discussed later, we consider a selection rule that randomly (with equal probability) selects for each item an interior vertex of the corresponding polytope, which parallels Hannig’s (2009) recommendation of non-informative and data-independent selection rules.

2.3 A Bernstein–von Mises Theorem

In Bayesian inference, the Bernstein–von Mises theorem describes the well-known phenomenon that a posterior distribution converges to a normal limit as the sample size increases; it is sometimes referred to as “the Bayesian central limit theorem” in fundamental text. To illustrate, we consider a one-parameter model with parameter \(\theta \); denote by \(\theta _0\) the true parameter that generates the observed data \(\mathbf y\). Let \(R(\mathbf{y})\) be a random variable that follows the posterior distribution of \(\theta \) given observed data \(\mathbf y\). The Bernstein–von Mises theorem implies that the distribution of \(R(\mathbf{Y})\) approaches \(\mathcal{N}(X, \sigma ^2_0)\), in which \(X\sim \mathcal{N}(\theta _0, \sigma ^2_0)\), and \(\sigma ^2_0\) is the reciprocal of the sample Fisher information evaluated at \(\theta _0\). As a result, in large samples, a Bayesian credible interval has approximately the correct frequentist coverage. For instance, consider the one-sided credible interval \((-\infty , r_\alpha (\mathbf{y})]\) in which \(r_\alpha (\mathbf{y})\) is the upper \(\alpha \) quantile of \(R(\mathbf{y})\). The normal approximation suggests \(r_\alpha (\mathbf{Y}) \approx X + z_\alpha \sigma _0\), in which \(z_\alpha \) is the upper \(\alpha \) quantile of the standard normal distribution, so \(P\{\theta _0\le r_\alpha (\mathbf{Y})\} \approx P\{\theta _0\le X + z_\alpha \sigma _0\} = 1 - \alpha \).

In this section, we establish a Bernstein–von Mises theorem for a posterior distribution derived from a data-dependent prior, which amounts to approximating the conditioning set involved in the GFQ (Eq. 9) by a first-order inclusion-exclusion expansion. Some notation is introduced first. Suppose that i.i.d. item response data \(\mathbf{Y}=(\mathbf{Y}_i)_{i=1}^n\) are generated from the same logistic IRT family as described in the previous section. Each observation \(\mathbf{Y}_i\) is a multinomial random variable with the following probability mass/likelihood function:

$$\begin{aligned} f({\varvec{\theta }}, \mathbf{y}_i) = \int _{ {\mathbb R}^r} \prod _{j=1}^m f_j({\varvec{\theta }}_j, y_{ij} | \mathbf{z}_i)d\Phi (\mathbf{z}_i). \end{aligned}$$
(10)

Let \(\mathbf{s}({\varvec{\theta }},\mathbf{y}_i)=\partial \log f({\varvec{\theta }}, \mathbf{y}_i)/\partial {\varvec{\theta }}\) be the single-observation score vector, and \(\mathbf{H}({\varvec{\theta }},\mathbf{y}_i) = \partial ^2\log f({\varvec{\theta }}, \mathbf{y}_i)/\partial {\varvec{\theta }}\partial {\varvec{\theta }}{}^\top \) be the single-observation Hessian matrix. Also define \({{\varvec{\mathcal {I}}}}({\varvec{\theta }}) = \hbox {Cov}_{\varvec{\theta }}\left[ \mathbf{s}({\varvec{\theta }},\right. \left. \mathbf{Y}_i)\right] \) which is usually referred to as the Fisher information matrix. It can be verified by direct calculation that

$$\begin{aligned} E_{\varvec{\theta }}\left[ \mathbf{s}({\varvec{\theta }},\mathbf{Y}_i)\right] = \mathbf{0}, \end{aligned}$$
(11)

and

$$\begin{aligned} {{\varvec{\mathcal {I}}}}({\varvec{\theta }}) = E_{ {\varvec{\theta }}}\left[ \mathbf{s}({\varvec{\theta }},\mathbf{Y}_i)\mathbf{s}({\varvec{\theta }},\mathbf{Y}_i){}^\top \right] = -E_{\varvec{\theta }}\left[ \mathbf{H}({\varvec{\theta }},\mathbf{Y}_i)\right] . \end{aligned}$$
(12)

Let \({\varvec{\theta }}_0\) be the true parameter value that generates \(\mathbf{Y}\), and \({{\varvec{\mathcal {I}}}}_0\) be a short-hand notation for \({{\varvec{\mathcal {I}}}}({\varvec{\theta }}_0)\). Also define the (scaled) sample score function \(\mathbf{S}_n = n^{-1/2}\sum _{i=1}^n\mathbf{s}({{\varvec{\theta }}_0},\mathbf{Y}_i)\). By Eqs. 11 and 12, and the Central Limit Theorem,

$$\begin{aligned} \mathbf{S}_n \mathop {\rightarrow }\limits ^{d}\mathcal{N}(\mathbf{0}, {{\varvec{\mathcal {I}}}}_0). \end{aligned}$$
(13)

It follows that \({{\varvec{\mathcal {I}}}}_0^{-1}{} \mathbf{S}_n \mathop {\rightarrow }\limits ^{d}\mathcal{N}(\mathbf{0}, {\varvec{\mathcal {I}}}_0^{-1})\).

Let \(I = (I_j)_{j=1}^m\) be an m-tuple of index sets, in which each \(I_j\) indexes a size-\(q_j\) sub-sample, i.e., \(I_j\subset \{1,\ldots ,n\}\) and \(|I_j| = q_j\). For each item j, the linear system \(A_{ij}^\star = \tau _j({\varvec{\theta }}_j, \mathbf{Z}_i^\star )\), \(i\in I_j\), has a unique solution with probability one, denoted \(\mathbf{V}_{I_j}\), which can potentially be an interior vertex of the random polytope \(Q_j(\mathbf{y}_{(j)}, \mathbf{A}_{(j)}^\star , \mathbf{Z}^\star )\). Pooling across all items, I determines a potential extremal point of \(Q(\mathbf{y}, \mathbf{A}^\star , \mathbf{Z}^\star )\), denoted \(\mathbf{V}_I = (\mathbf{V}_{I_j})_{j=1}^m\); there are in total \(C_n = \prod _{j=1}^m{n\atopwithdelims ()q_j}\) different choices of I. Let \(D_I\) be the event that I determines an extremal point of the non-empty set inverse; the event \(Q(\mathbf{y}, \mathbf{A}^\star , \mathbf{Z}^\star )\ne \emptyset \) used for conditioning in Eq. 9 is then equivalent to \(\bigcup _{I}D_I\).Footnote 3 Conditioning on the union of multiple \(D_I\)’s is not easy to manipulate, and thus the following approximation is resorted to. Define event \(D(\mathbf{y})\) by the law of total probability that each sub-sample I has probability \(C_n^{-1}\) to be selected, and that \(Q(\mathbf{y}, \mathbf{A}^\star , \mathbf{Z}^{\star })\) is non-empty given that the selected I forms an extremal point. It is clear that \(P\{D(\mathbf{y})\} \propto \sum _IP\{D_I\}\). Let \(\mathbf{R}(\mathbf{y})\) be a random variable that follows the distribution of the selected \(\mathbf{V}_I\) conditional on \(D(\mathbf{y})\). \(\mathbf{R}(\mathbf{y})\) differs from the GFQ (Eq. 9) only in the conditioning event: \(D(\mathbf{y})\) used in the approximation can be considered as a first-order approximation to \(\cup _ID_I\) in the inclusion-exclusion formula:

$$\begin{aligned} P\{\bigcup _ID_I\} = \sum _{I}P\{D_I\} - \sum _{I\ne I'}P\{D_I\cap D_{I'}\} + \sum _{I\ne I'\ne I''}P\{D_I\cap D_{I'}\cap D_{I''}\}-\cdots . \end{aligned}$$
(14)

The construction of the equi-probability mixture distribution is inspired by Hannig’s (2009, Sect. 4.1) suggested implementation of the fiducial recipe for continuous data. We conjecture that the higher-order terms on the right-hand side of Eq. 14 do not affect the conditional distribution as the sample size grows, but leave the theoretical justification for future research.

A roadmap for our theoretical justification is summarized as follows. We first establish that the density of \(\mathbf R(y)\) has a closed-form expression (Lemma 1) and satisfies the desirable result (Theorem 1). Next, it is proved for unidimensional models (\(r = 1\)) that the diameter of the set inverse goes to 0 at the rate 1 / n (Theorem 2), faster than the rate \(1/\sqrt{n}\) at which the distribution of \(\mathbf R(Y)\) approaches its normal limit. This provides partial support for the observation that different selection rules tend to give converged inference about model parameters when the sample size is large enough.

The following lemma gives explicitly the density of \(\mathbf R(y)\), which amounts to a posterior density defined by a data-dependent prior; detailed derivations can be found in Appendix 1.

Lemma 1

(Density) Consider a test of m dichotomous items each of which is characterized by a version of Eq. 4. Let \(\Theta \subset {\mathbb R}^q\), \(q = \sum _{j=1}^mq_j\), be the parameter space comprising all free intercepts and slopes \({\varvec{\theta }} = ({\varvec{\theta }}_j)_{j=1}^m\). For ease of exposition, the fixed slopes are set to zero.Footnote 4 Given observed response data \(\mathbf{y}\) = \((\mathbf{y}_i)_{i=1}^n\) = \((y_{ij})_{i=1}^n{}_{j=1}^m\), the density of \(\mathbf{R}(\mathbf{y})\) can be written as

$$\begin{aligned} g_n({\varvec{\theta }} |\mathbf{y}) \propto&\sum _{I}\int _{{\mathbb R}^{nr}}d_I({\varvec{\theta }},\mathbf{z}_I)\prod _{j=1}^m\left\{ \prod _{i\in {I}_j}\frac{e^{\tau _j({\varvec{\theta }}_j, \mathbf{z}_i)}}{\left[ 1 + e^{\tau _j({\varvec{\theta }}_j, \mathbf{z}_i)}\right] ^2}\prod _{i\in I_j^c}f_j({\varvec{\theta }}_j, y_{ij}|\mathbf{z}_i)\right\} d\Phi (\mathbf{z}). \end{aligned}$$
(15)

In Eq. 15, \(\Phi \) denotes the probability measure of \(\mathcal{N}(\mathbf{0}, \mathbf{I}_{nr})\),Footnote 5 and \(d_I({\varvec{\theta }}, \mathbf{z}_I) =\prod _{j=1}^m \left| \det (\partial \tau _j({\varvec{\theta }}_j,\mathbf{z}_i)\right. \left. /\partial {\varvec{\theta }}_j)_{i\in I_j}\right| \) gives a Jacobian determinant, in which \(\mathbf{z}_I = (\mathbf{z}_i)_{i\in I}\).Footnote 6

Remark 1

The connection to Bayesian inference can be seen from Eq. 15. Rewrite Eq. 15 by splitting the integral into two parts—one for \(\mathbf{z}_{I}\), and the other for \(\mathbf{z}_{ {I}^c}\):

$$\begin{aligned} g_n({\varvec{\theta }} | \mathbf{y})\propto&\sum _I\int d_I({\varvec{\theta }}, \mathbf{z}_i)\prod _{i\in {I}}\left\{ \prod _{j\in J_i}\frac{e^{\tau _j({\varvec{\theta }}_j, \mathbf{z}_i)}}{\left[ 1 + e^{\tau _j({\varvec{\theta }}_j, \mathbf{z}_i)}\right] ^2}\prod _{j\notin J_i}f_j({\varvec{\theta }}_j, y_{ij}| \mathbf{z}_i)\right\} d\Phi (\mathbf{z}_{I})\\&\cdot \int \prod _{i\in {I}^c}\prod _{j=1}^mf_j({\varvec{\theta }}_j, y_{ij}| \mathbf{z}_i)d\Phi (\mathbf{z}_{ {I}^c}), \end{aligned}$$
(16)

in which \(J_i=\{j: i\in {I}_j\}\) for \(i\in {I}\). Note that the second line of Eq. 16 is the marginal likelihood function of the observations in \(I^c\). We can multiply and divide the right-hand side of Eq. 16 by the likelihood of the vertex-determining observations I, and then simplify it to

$$\begin{aligned} g_n({\varvec{\theta }} | \mathbf{y})&\propto b_n({\varvec{\theta }},\mathbf{y})f_n({\varvec{\theta }}, \mathbf{y}). \end{aligned}$$
(17)

In Eq. 17,

$$\begin{aligned} f_n({\varvec{\theta }}, \mathbf{y}) = \int \prod _{i=1}^n\prod _{j=1}^mf_j({\varvec{\theta }}_j, y_{ij}| \mathbf{z}_i)d\Phi (\mathbf{z}) \end{aligned}$$
(18)

denotes the complete sample likelihood, and

$$\begin{aligned} b_n({\varvec{\theta }}, \mathbf{y}) =&\sum _I\int d_I({\varvec{\theta }},\mathbf{z}_I)\prod _{i\in I}\left\{ \prod _{j\in J_i}\frac{e^{\tau _j({\varvec{\theta }}_j, \mathbf{z}_i)}}{\left[ 1 + e^{\tau _j({\varvec{\theta }}_j, \mathbf{z}_i)}\right] ^2}\prod _{j\notin J_i}f_j({\varvec{\theta }}_j, y_{ij}| \mathbf{z}_i)\right\} d\Phi (\mathbf{z}_{I})\\&\biggr /\int \prod _{i\in I}\prod _{j=1}^mf_j({\varvec{\theta }}_j, y_{ij}| \mathbf{z}_i)d\Phi (\mathbf{z}_{I}) \end{aligned}$$
(19)

is a function of both the item parameters and data. Therefore, the density of \(\mathbf{R(y)}\) can be conceived as the (empirical) Bayesian posterior computed from the data-dependent prior proportional to Eq. 19.

It can be straightforwardly shown that the density expressed by Eq. 15 (or equivalently Eq. 17) satisfies a Bernstein–von Mises theorem. The proof is relegated to Appendix 2, which is similar to Ghosh and Ramamoorthi’s (2003, Theorem 1.4.2) proof of a Bayesian Bernstein–von Mises theorem.

Theorem 1

(Bernstein–von Mises) Suppose that item response data \(\mathbf{Y}=(\mathbf{Y}_i)_{i=1}^n\) are i.i.d. with probability mass function \(f( {\varvec{\theta }}_0, \mathbf{y}_i)\). Let \(\Theta \subset {\mathbb R}^q\) be the parameter space as usual. Assume that

  1. (i)

    \(m\ge r + 1\);

  2. (ii)

    For all \({\varvec{\theta }},{\varvec{\theta }}'\in \Theta \) such that \({\varvec{\theta }} \ne {\varvec{\theta }}'\), \(f_{\varvec{\theta }} \ne f_{\varvec{\theta }'}\) for some response pattern;

  3. (iii)

    \({\varvec{\theta }}_0\) is at the interior of \(\Theta \);

  4. (iv)

    The Fisher information matrix \(\varvec{\mathcal {I}}_0\) is positive definite.

Let \(\bar{g}_n(\mathbf{h}|\mathbf{y}) = g_n({\varvec{\theta }}_0 + \mathbf{h}/\sqrt{n} | \mathbf{y}) / \sqrt{n}\) be the density of \(\sqrt{n}[\mathbf{R}(\mathbf{y})-{\varvec{\theta }}_0]\), \(H_n\) be the correspondingly rescaled parameter space, and \(\phi _{\varvec{\mathcal {I}}_0^{-1}\mathbf{S}_n,\varvec{\mathcal {I}}_0^{-1}}\) be the density of \({\mathcal N}(\varvec{\mathcal {I}}_0^{-1}\mathbf{S}_n,\varvec{\mathcal {I}}_0^{-1})\). Then,

$$\begin{aligned} \int _{H_n}\left| \bar{g}_n(\mathbf{h}|\mathbf{Y})-\phi _{\varvec{\mathcal {I}}_0^{-1}\mathbf{S}_n,\varvec{\mathcal {I}}_0^{-1}}(\mathbf{h})\right| d\mathbf{h}\mathop {\rightarrow }\limits ^{P_{ {\varvec{\theta }}_0}}0, \end{aligned}$$
(20)

in which \(P_{ {\varvec{\theta }}_0}\) denotes the probability measure of \(\mathbf Y\) under the true parameter values \({\varvec{\theta }}_0\), and \(\mathop {\rightarrow }\limits ^{P_{ {\varvec{\theta }}_0}}\) means converges in probability under the true model.

Remark 2

Assumptions (ii) to (iv) are standard regularity conditions for establishing the asymptotic optimality of the ML estimator. (i) guarantees the existence of some neighborhood of \({\varvec{\theta }}_0\) such that for \({\varvec{\theta }}\) outside the likelihood ratio \(f_n({\varvec{\theta }},\mathbf{Y})/f_n({\varvec{\theta }}_0, \mathbf{Y})\) converges uniformly to zero in probability, which functions similarly to Assumption (v) in Ghosh and Ramamoorthi (2003).

Remark 3

As remarked in van der Vaart (2000, Sect. 10.2), the alternative “centering sequence” \(\sqrt{n}(\hat{\varvec{\theta }} - {\varvec{\theta }}_0)\), in which \(\hat{\varvec{\theta }}\) is the ML estimator, can be used in place of \(\varvec{\mathcal I}_0^{-1}{} \mathbf{S}_n\) in Eq. 20, because the the latter is a local linear approximation of the former at the true parameter values \({\varvec{\theta }}_0\) and the two are asymptotically equivalent.

When \(r = 1\), we are able to control the diameter of the set inverse by an \(O_p(n^{-1})\) term (Theorem 2). Because the rate of convergence in Theorem 1 is of order \(1/\sqrt{n}\), the same convergence result also holds for all other points selected from the set inverse given that each sub-sample I serves equally likely as the selected extremal point. The proof is provided in Appendix 3.

Theorem 2

Suppose that Assumptions (i)–(iv) of Theorem 1 hold. Consider \(r = 1\). For any \(K > 0\), define

$$\begin{aligned} \rho _K(\mathbf{y}) = P\{\hbox {diam}Q(\mathbf{y}, \mathbf{A}^\star , \mathbf{Z}^{\star }) > K/n\ |\ D(\mathbf{y})\}, \end{aligned}$$
(21)

Then, for each \(\varepsilon > 0\),

$$\begin{aligned} P_{ {\varvec{\theta }}_0}\{\exists K, N > 0:\ \rho _K(\mathbf{Y}) < \varepsilon ,\ \forall n>N\}\rightarrow 1. \end{aligned}$$
(22)

Remark 4

A majority of the proof (Appendix 3) is extensible to multidimensional models (i.e., \(r > 1\)), except for the last part that involves a case enumeration.

2.4 A Markov Chain Monte Carlo Algorithm

Next, we introduce an MCMC algorithm to sample from the fiducial distribution (Eq. 9). Our main task is to sample \(\mathbf{A}^\star \) and \(\mathbf{Z}^\star \) such that the set inverse \(Q(\mathbf{y}, \mathbf{A}^\star ,\mathbf{Z}^\star )\) is non-empty. We solve this high-dimensional truncated sampling problem by a Gibbs sampler, which consists of two types of conditional sampling steps, one for \(A_{ij}^\star \) and the other for \(Z_{id}^\star \). After initialization, our algorithm sequentially draws each random component from its conditional distribution given the latest values of the rest. By the standard theory for Gibbs samplers, the generated Markov Chain converges to the joint distribution of \(\mathbf{A}^\star \) and \(\mathbf{Z}^\star \) conditional on \(Q(\mathbf{y}, \mathbf{A}^\star ,\mathbf{Z}^\star )\ne \emptyset \). Following the update of each random component, the interior polyhedrons are rebuilt accordingly. After each MCMC cycle, one extremal point of the set inverse is selected and recorded as an instance of the GFQ. Next, we discuss the two Gibbs sampling steps, the choice of starting values, and some tuning details of the algorithm.

Conditional sampling of \(A_{ij}^{\star }\). Fix i and j. The goal of this step is to obtain an update of \(A_{ij}^\star \) such that the resulting new half-space has a non-empty intersection with the interior polyhedron determined by all current realizations of the random components except for those of the ith observation. Notationally, we use superscript 0 to highlight the dependency solely on the current values of the random components, and superscript 1 the involvement of the updated one. Let \(\mathbf{y}_{-i(j)} = (y_{kj})_{k\ne i}\), \(\mathbf{a}_{-i(j)}^0 = (a_{kj}^0)_{k\ne i}\), and \(\mathbf{z}_{-i}^0 = (\mathbf{z}_k^0)_{k\ne i}\). Any valid update of \(A_{ij}^\star \), denoted \(a_{ij}^1\), should satisfy the following condition:

$$\begin{aligned} Q_{ij}(y_{ij}, a_{ij}^1, \mathbf{z}_i^0) \cap Q_j(\mathbf{y}_{-i(j)}, \mathbf{a}^0_{-i(j)}, \mathbf{z}^0_{-i}) \ne \emptyset \ \Leftrightarrow \ \left\{ \begin{matrix} a_{ij}^1\le \max \nolimits _{{\varvec{\theta }}_j\in \mathcal{V}_{-ij}^0}{\tau _j(\varvec{\theta }}_j, \mathbf{z}_i^0),\quad \hbox {if }y_{ij}=1;\\ a_{ij}^1\ge \min \nolimits _{{\varvec{\theta }}_j\in \mathcal{V}_{-ij}^0}{\tau _j(\varvec{\theta }}_j, \mathbf{z}_i^0),\quad \hbox {if }y_{ij}=0.\\ \end{matrix} \right. \end{aligned}$$
(23)

in which \(\mathcal{V}_{-ij}^0\) denotes the collection of interior vertices of \(Q_j(\mathbf{y}_{-i(j)}, \mathbf{a}^0_{-i(j)}, \mathbf{z}^0_{-i})\). Equation 23 follows from the fact that the left-hand side intersection, i.e., the updated interior polyhedron for item j, is non-empty if and only if at least one point in \(Q_j(\mathbf{y}_{-i(j)}, \mathbf{a}^0_{-i(j)}, \mathbf{z}^0_{-i})\) satisfies the inequality posed by \(Q_{ij}(y_{ij}, a_{ij}^1, \mathbf{z}_i^0)\); due to convexity, it suffices to require at least one vertex of the polyhedron \(Q_j(\mathbf{y}_{-i(j)}, \mathbf{a}^0_{-i(j)}, \mathbf{z}^0_{-i})\) satisfying the inequality. Therefore, we sample \(A_{ij}^{\star }=a_{ij}^1\) from \(\hbox {Logistic}(0,1)\) truncated from above by \(\max _{{\varvec{\theta }}_j\in \mathcal{V}_{-ij}^0}{\tau _j(\varvec{\theta }}_j, \mathbf{z}_i^0)\) when \(y_{ij}=1\), and from below by \(\min _{{\varvec{\theta }}_j\in \mathcal{V}_{-ij}^0}{\tau _j(\varvec{\theta }}_j, \mathbf{z}_i^0)\) when \(y_{ij}=0\). A graphical illustration of Step 1 using a 2PL item can be found in the left panel of Figure 2.

Fig. 2
figure 2

The left panel illustrates updating \(A_{5j}^\star \) for Observation 5 (O5) given values of other random components. The updated half-space (O5, new) is parallel to the old O5. The dot-filled area is the feasible region for the updated half-space to have non-empty intersection with the interior polygon without old O5. The shaded area shows the updated polygon. Similarly, the right panel illustrates the conditional sampling of \(Z_5^\star \). This time both the old and new O5 pass through the same point on \(\beta _j=0\).

Conditional sampling of \(Z_{id}^{\star }\). Fix i and d. The goal of this step is to sample \(Z_{id}^\star \) from a suitably truncated standard normal distribution ensuring for all items that the updated interior polyhedrons are not empty. Let \(\mathbf{z}_i^d = (z_{i1}^0\ \cdots \ z_{i,d-1}^0\ z_{id}^1\ z_{i, d+1}^0\ \cdots \ z_{ir}^0){}^\top \). For each item j, the updated \(z_{id}^1\) should satisfy:

$$\begin{aligned} Q_{ij}(y_{ij}, a_{ij}^0, \mathbf{z}_i^d) \cap Q_j(\mathbf{y}_{-i(j)}, \mathbf{a}^0_{-i(j)}, \mathbf{z}^0_{-i}) \ne \emptyset \ \Leftrightarrow \ \left\{ \begin{matrix} a_{ij}^0\le \max \nolimits _{{\varvec{\theta }}_j\in \mathcal{V}_{-ij}^0}{\tau _j(\varvec{\theta }}_j, \mathbf{z}_i^d),\hbox { if }y_{ij}=1;\\ a_{ij}^0\ge \min \nolimits _{{\varvec{\theta }}_j\in \mathcal{V}_{-ij}^0}{\tau _j(\varvec{\theta }}_j, \mathbf{z}_i^d),\hbox { if }y_{ij}=0.\\ \end{matrix}\right. \end{aligned}$$
(24)

Pooling across all items, we express the desired truncation of this sampling step as

$$\begin{aligned} z_{id}^1\in \bigcap _{j=1}^m\bigcup _{{\varvec{\theta }}_j\in \mathcal{V}_{-ij}^0}\{&a_{ij}^0 \le \tau _j({\varvec{\theta }}_j, \mathbf{z}_i^d),\hbox { if }y_{ij} = 1;\\&a_{ij}^0 \ge \tau _j({\varvec{\theta }}_j, \mathbf{z}_i^d),\hbox { if }y_{ij} = 0.\} \end{aligned}$$
(25)

The geometric object implied by Eq. 25 can be a finite interval, an infinite interval, or a disjoint union of intervals. An example using a single 2PL item can be found in the right panel of Figure 2.

Starting values. A non-empty set \(Q(\mathbf{y}, \mathbf{a}^0, \mathbf{z}^0)\) is required to initialize our Gibbs sampler, which can be constructed from some suitable starting values of the parameters and random components. Suppose that some initial guess of the parameter values \({\varvec{\theta }}^0=({\varvec{\theta }}_j^0)_{j=1}^m\) and the latent variable values \(\mathbf{z}^0\) are available. For each i and j, we could execute the Gibbs sampling step of \(A_{ij}^{\star }\) to obtain starting values \(a_{ij}^0\) assuming that the interior polytope has only one vertex \({\varvec{\theta }}^0_j\); that is, we sample \(A_{ij}^\star \) from Logistic(0, 1) truncated from above by \(\tau _j({\varvec{\theta }}^0_j, \mathbf{z}^0)\) if \(y_{ij}=1\), and truncated from below by the same quantity if \(y_{ij}=0\). It is clear that the resulting set inverse function is non-empty, because it contains at least some neighborhood of \({\varvec{\theta }}^0\).

In practice, conveniently computable parameter estimates, such as various weighted least square methods based on tetrachoric correlations (e.g., Muthén, 1978; Gunsjö, 1994), can be used as \({\varvec{\theta }}^0\); alternatively, one could use naive starting values such as 0 for intercepts and 1 for slopes. \(\mathbf{z}^0\) can be generated from the conditional distribution of the latent variables given \(\mathbf y\) and \({\varvec{\theta }}^0\), or simply from a standard normal distribution. From our experience, the generated Markov chain often appears stationary after several thousand iterations, and the final results are not affected by the choice of starting points.

Additional tuning of the sampler. To simplify the sampling algorithm, we restrict the item parameters to a compact set \([-M, M]^{q}\) with \(M>0\) being some pre-specified large number. Hence, the set inverse \(Q(\mathbf{y}, \mathbf{A}^\star ,\mathbf{Z}^\star )\) always comprises closed polytopes which can be efficiently represented by their vertices. The results are not significantly affected by the choice of M provided the sample size is large enough, in which case the generated polyhedrons usually have small diameters and thus are unlikely to attain the arbitrary bounding box.

In small samples, however, polytopes attaining the bounding box emerge every now and then, resulting from unbounded polyhedrons hard-truncated to the arbitrary bound. Consequently, the marginal fiducial distribution for the associated item parameters can be heavy-tailed; it leaves visible “spikes” on the trace plots, and yields less efficient interval estimators. To resolve this, we propose an extra tuning operation based on the observation that unbounded polyhedrons are typically resulted from lacking lower/upper bounds for the slope parameters. For fixed item j and dimension d on which item j loads, the single-entry set inverse (Eq. 6) imposes an upper bound for the corresponding slope parameter if \(y_{ij} = 1\) and \(Z^\star _{id} < 0\), or \(y_{ij} = 0\) and \(Z^\star _{id} > 0\); a lower bound is imposed otherwise. Some combinations of \(y_{ij}\) and \(Z^\star _{id}\) rarely occur under certain data-generating models,Footnote 7 which may lead to a shortage, if not a sheer absence, of bounds on one side. A natural workaround is to modify the set inverse \(Q_{ij}(y_{ij}, a_{ij}, \mathbf{z}_i)\) to give each slope both lower and upper bounds; in particular, we define

$$\begin{aligned} Q_{ij}^M(y_{ij}, a_{ij}, \mathbf{z}_i)=\{ {\varvec{\theta }}_j\in {\mathbb R}^{q_j}:\,&-M + {\varvec{\beta }}_j{}^\top \mathbf{z}_i\le a_{ij}\le \alpha _j + {\varvec{\beta }}_j{}^\top \mathbf{z}_i,\hbox { if }y_{ij}=1;\\&\alpha _j + {\varvec{\beta }}_j{}^\top \mathbf{z}_i < a_{ij}\le M + {\varvec{\beta }}_j{}^\top \mathbf{z}_i,\hbox { if }y_{ij}=0\}, \end{aligned}$$
(26)

which, for fixed \(a_{ij}\) and \(\mathbf{z}_i\), approaches Eq. 6 as M increases. Pilot studies suggest that replacing Eq. 6 by Eq. 26 with parameter bound \(M = 20\) in the construction of a fiducial distribution significantly relieves the problems caused by the heavy-tailedness. In practice, we do not expect item parameters to go beyond this value as well.

3 Simulation Study

We report next a comparative evaluation of fiducial and ML Wald-type interval estimators via Monte Carlo simulations. Nine-item tests (\(m=9\)) and two sample size conditions (\(n=100\) and 500) were considered. Under each condition, 500 data sets were simulated. Apart from the intercepts and slopes in the original parameterization of the model, three additional parameters are of interest to us. The item difficulty parameter,

$$\begin{aligned} \delta _j = -\alpha _j / \beta _j, \end{aligned}$$
(27)

gauges the latent variable level at which a correct response is produced with 50 % chance. The loading \(\lambda _j\) and threshold \(\tau _j\) are standardizations of slope and intercept, respectively:

$$\begin{aligned} \lambda _j = \frac{\beta _j/1.7}{\sqrt{1 + (\beta _j/1.7)^2}}, \end{aligned}$$
(28)

and

$$\begin{aligned} \tau _j = \delta _j\lambda _j. \end{aligned}$$
(29)

They are defined on a standardized scale pertaining to the notion of explained variance (communality), which is the preferred metric in the literature of item factor analysis (e.g., Wirth & Edwards, 2007). The true item parameters, tabulated in Table 1, were determined by two factors: (a) \(\lambda _j^2 = 0.1, 0.5, 0.9\), representing low, medium, and high communality, and (b) \(|\tau _j| = 0, 0.5, 1\), representing no, low, and high skewness.

Table 1 True item parameter values in the Monte Carlo simulation.

We implemented the previously discussed Gibbs sampler (Sect. 2.4) in Fortran. We set 0 as the starting value for intercepts, and 1 for slopes; \(\mathbf{z}^0\) were generated from the standard normal distribution, and \(\mathbf{a}^0\) were generated by running the Gibbs sampling step once, as described in the previous section. For each simulated dataset, we ran 60000 MCMC cycles, and burned in the first 10000 to remove the influence of starting values; 5000 draws were then extracted by applying a thinning interval of 10, from which equi-tailed percentile CIs (FID) were obtained. The parameter bound M is set to 20.

The ML estimates of item parameters were found by the Bock-Aitkin EM algorithm using Mplus 7.0 (Muthén & Muthén, 1998-2012). The integral in the response pattern likelihood function (Eq.10) was approximated using 49 equally spaced rectangular quadrature points from -5 to 5. We adopted the software’s default convergence criteria, maximum number of iterations, and starting values. Two types of Wald CIs were computed from the two commonly used sample estimates of the Fisher information matrix: the Hessian form (MWH; in Mplus, estimator = ML) and the outer-product form (MWO; estimator = MLF). The Delta-method standard errors were used for transformed parameters.

The empirical coverage and median length of CIs are two main criteria for comparison. Intervals having coverage probabilities greater than or equal to the nominal level (95 % in the current work) and short lengths are preferred. Whenever a trade-off between coverage and length is observed, we always prioritize coverage over length. The results are tabulated in Tables 2 and 3 for the two sample size conditions, respectively.

Table 2 Empirical coverage and median length of CIs (\(n = 100\)).
Table 3 Empirical coverage and median length of CIs (\(n = 500\)).

As expected, the difference among the three candidate CIs is more salient in the small-sample condition (\(n = 100\)); in large samples (e.g., \(n = 500\) in the current study), the three methods are more comparable in accordance with the asymptotic theory. Hence, we only discuss the results for \(n = 100\) (Table 2) here.

For the original parameterization, MWO and FID always exhibit well-calibrated coverage, with FID being uniformly more efficient (i.e., having shorter lengths) than MWO across all items. In contrast, MWH significantly under-covers for large slopes (items 7–9), and skewed intercepts when combined with large slopes (items 8 and 9); what is worse, MWH also tends to be much wider than FID for those parameters. For low and medium communality items (items 1–6), however, MWH is the most reliable and efficient choice for slope and intercept parameters, trailed by FID with slightly less desirable lengths.

The coverage of MWH for loading parameters decreases substantially as the true value increases; for high communality items (items 7–9), its empirical coverage can be even lower than 80 %. This may be construed as the failure of normal approximation when the true parameters are closed to the boundary (here, 1 is the upper bound for the loading parameter), due to a skewed sampling distribution of the ML estimate. Having lengths comparable to MWH on average, FID, however, is able to maintain well-controlled coverage; moreover, for large loading parameters (items 7–9), FID achieves the highest empirical coverage with the shortest median length. For threshold parameters, all three candidate methods show acceptable coverage; MWO is less favorable than MWH and FID, because it always yields wider intervals.

Both MWH and MWO are subject to insufficient coverage for non-zero difficulty parameters in low communality items (items 2 and 3). When a small slope co-occurs with a somewhat large intercept, the difficulty parameter tends to be large, i.e., close to infinity, which may lead to a non-normal sampling distribution of the ML estimate, and consequently the poor performance of normal-approximation intervals. In the meantime, the coverage of FID is not affected by extreme difficulty values for low communality items, compensated by excessive lengths (for item 3, FID is almost 5 times as wide as MWH).

In summary, FID, although not always the most efficient interval estimator, is always reliable in terms of coverage for all five parameterizations. The gold-standard method MWH is liberal when the ML estimates have non-normal sampling distributions, which is likely to happen for extreme parameters in small samples. The alternative MWO approach often yields conservative intervals that are adequate in coverage but typically wider than the corresponding FID and MWH ones.

4 Empirical Example

In this section, we apply the proposed GFI to an exploratory item factor analysis (EIFA) problem. The dataset being analyzed is the UK female normative sample data of the revised Eysenck Personality Questionnaire (EPQ-R; Eysenck, Eysenck, & Barrett, 1985). We are grateful to Dr. Paul Barrett for granting us access to the data. This questionnaire was originally designed to measure three dimensions of individual differences: extraversion (E), neuroticism (N), and psychoticism (P). In this analysis, we only use the 12 short form items from each subscale, so there are 36 items (\(m=36\)) in total. The sample size is \(n=824\), after all incomplete cases deleted.

In EIFA, substantive researchers are more interested in the multi-factor structure of the scale, and the strength of each test item being associated with each factor. In this sense, the standardized loading-threshold parameterization is more helpful, because it is on a scale that eases the computation of variance/covariance of test items explained by the factors. In addition, analytic rotations of factor loadings (see Browne, 2001, for a review) are often applied to obtain more interpretable patterns of item factor dependency. The goal of this analysis is to obtain CIs for rotated factor loadings and the inter-factor correlations.

An r-dimensional (\(r > 1\)) EIFA model can be parameterized by Eq. 4: For each of the first \(r - 1\) items, indexed by \(j = 1,\ldots ,r - 1\), the last j slopes are fixed to 0; for the remaining items, all slopes are freely estimated. Then, unrotated factor loadings for each item were computed as a non-linear transformation of the slopes:

$$\begin{aligned} {\varvec{\lambda }}_j = \frac{ {\varvec{\beta }}_j/1.7}{\sqrt{1 + {\varvec{\beta }}_j{}^\top {\varvec{\beta }}_j/1.7^2}}, \end{aligned}$$
(30)

which is a generalization of Eq. 28. The Crawford–Ferguson Quartimax criterion (Crawford & Ferguson, 1970) was minimized to obtain rotated factor loadings and inter-factor correlations, which leads to an implicit non-linear transformation of the unrotated loadings that does not have a closed-form expression. Implicit differentiation is required to compute the Delta-method standard errors for the rotated ML solutions, which has been described by Jennrich (1973).

Using GFI, however, we can easily approximate the fiducial distribution of rotated solutions by applying the transformation given by Eq. 30 and then the rotation routine to each Monte Carlo sample from the marginal fiducial distribution of slopes. We tuned the Gibbs sampler similarly as described in the simulation study, with the exception that the weighted least square solution (estimator = WLSMV) produced by Mplus 7.0 (Muthén & Muthén, 1998-2012) and the corresponding factor score estimates were used as starting values \({\varvec{\theta }}^0\) and \(\mathbf{z}^0\), respectively, in order to accelerate the convergence of the generated Markov chain. The R package GPArotation (Bernaards & Jennrich, 2005) was used to perform analytic rotation. Note that the directions and the order of factors are not identified for the rotated solutions; a matching procedure was applied to establish a uniform orientation across all MCMC iterations, similar to that described by Asparouhov and Muthén (2012) in the context of Bayesian EIFA.

We fitted three-, four-, and five- factor EIFA models to the EPQ-R data; for succinctness, only the five-factor solution is reported here. The high dimensionality and the relatively small sample size render likelihood-based estimation and inference very challenging; with GFI, however, we are still able to obtain substantively meaningful results. Fiducial medians and 95 % equi-tailed percentile CIs for rotated loadings and inter-factor correlations are shown in Figures 3 and 4, respectively.

When multiple CIs for selected parameters are reported, Benjamini and Yekutieli (2005) recommended a general procedure that controls the false coverage-statement rate (FCR), i.e., the expected proportion of the selected parameters not covered by the constructed CIs. Here, we are interested in testing whether the rotated loadings and inter-factor correlations are 0. Thus, we computed for each parameter the empirical two-sided p-values for the corresponding test, selected R significant parameters by the Benjamini–Hochberg step-up procedure (Benjamini and Hochberg,1995) at nominal level 0.05, and then construct \(100(1 - 0.05R/\tilde{q} )\,\%\) CIs for all loading and correlation parameters, in which \(\tilde{q} = rm + r(r - 1)/2 = 190\) is the total number of parameters being tested for the five-factor EIFA.

Fig. 3
figure 3

Point estimates and 95 % CIs for rotated factor loadings in the five-factor model. The tabular layout has a row for each item; item stems are listed in the leftmost column. The following five columns correspond to the five factors. Within each cell, the estimated fiducial density is shown in the background. Superimposed are the fiducial median (shown as dots), and the 95 % fiducial equi-tailed percentile CIs. The 0 point on the factor loading scale is highlighted by the vertical dashed lines. For each parameter, the empirical coverage frequency in the bootstrap simulation is also included.

Fig. 4
figure 4

Point estimates and 95 % CIs for inter-factor correlations in the five-factor model. The tabular layout resembles a correlation matrix. Within each cell, the estimated fiducial density is shown in the background. Superimposed are the fiducial median (shown as dots), and the 95 % fiducial equi-tailed percentile CIs. The 0 point on the correlation scale is highlighted by the vertical dashed lines. For each parameter, the empirical coverage frequency in the bootstrap simulation is also included.

The psychoticism items dominate the first factor in the five-factor EIFA. Factor 2 and 3 yield a further decomposition of the extraversion subscale. The separation of the two factors is driven by two locally dependent pairs of items (e.g., Liu & Thissen, 2012): Factor 2 is led by two “party” items, i.e., “Can you easily get some life into a rather dull party? (E51)” and “Can you get a party going? (E78)”; the two items loaded the highest on factor 3 are “Are you a talkative person? (E6)” and “Are you mostly quiet when you are with other people? (E47)”, both related to loquaciousness. The remaining extraversion items are moderately cross-loaded on both factors. The correlation between factor 2 and 3 is about 0.5, which is the highest among all factors. Meanwhile, the neuroticism items are split into halves (factor 4 and 5). After examining the item stems, we conclude that factor 5 is mainly indicated by the mood-related items in the neuroticism subscale, e.g., “Does your mood often go up and down? (N3)” and “Do you often feel ‘fed-up’? (N26)”. Factor 4, on the other hand, is defined by the items related to worrying and nerves. In addition, extraversion (factors 2 and 3) is nearly uncorrelated with the emotion-related neuroticism (factor 5), but negatively correlated with the emotion-free one (factor 4).

To qualify the fiducial solution, we conducted a 100-replication bootstrap simulation: Data sets were generated from a five-dimensional model using the point estimates of the rotated factor loadings and inter-factor correlations as the true values. The empirical coverage accumulated across the 100 resamples is also included in Figures 3 and 4. It is observed that for almost all loading and correlation parameters, the empirical coverage of the fiducial percentile interval is close to the nominal level (coverage frequency \(>\)90). But for some moderately high loadings and the largest inter-factor correlation, the fiducial interval is too liberal. We conclude that in general the fiducial intervals obtained in the current example can be trusted; however, we need further investigations on those problematic cases to better understand the behavior of GFI in EIFA.

5 Discussion and Conclusion

In the current research, GFI is employed to address interval estimation problems for a family of binary logistic IRT models. We derive a fiducial distribution for item parameters, prove a Bernstein–von Mises theorem analogous to the well-known version for Bayesian posteriors, and implement an efficient MCMC sampler to fit the model. It has been observed in the simulation study that the fiducial percentile CI outperforms the commonly used ML Wald-type CIs when the sample size is small and the generating parameters are extreme. In addition, as shown in the EIFA example, GFI offers great flexibility and reliable performance when interval estimation is desired for complex transformations of parameters. All these render GFI a promising statistical tool catering to the gaining popularity of item response models in psychological and educational testing.

As pointed out by a referee, good coverage coupled with short width of CIs often translates to small mean squared error of the corresponding point estimates; in this regard, we observed from pilot simulations that the fiducial median can be less biased and variable than the ML estimate when the sample size is small, in line with the empirical coverage and length results reported in Sect. 3. However, since the improvement is often outweighed by the large sampling variability, it is highly recommended to rely on CIs, rather than point estimates, when interpreting model parameters in small-sample calibrations. Even in large-scale educational testing, the usefulness of CIs is likely underestimated. Operational researchers in educational assessment programs tend to only pay heed to point estimates, because their pool of respondents is often large. Indeed, when the sample size is large enough, ML, fiducial, and Bayesian fittings should not be dissimilar because of their asymptotic equivalence. What is often ignored, however, is the trade-off between the sample size and model complexity in determining the amount of sampling variability of point estimates, the degree of which is largely unknown in practice until CIs are calculated. Therefore, we believe that methods producing high-quality CIs, such as GFI, deserve more attention than they receive at the moment.

There are limitations and extensions of the current study that remain to be addressed by future research.

First, the Bernstein–von Mises theorem (Theorem 1) is only established for an approximation of the fiducial distribution, which is a limitation of the current work. Although similar constructions of the empirical Bayesian approximation have been considered “fiducial” by some authors (e.g., Hannig, 2009), its exact relation with GFI is yet to be demonstrated. In addition, an extension of Theorem 2 to multidimensional models (\(r > 1\)) should be pursued. As the latent dimensionality increases, we cannot discuss all the possible cases as we do in the last part of the unidimensional proof (see Appendix 3). More intricate arguments involving the high-dimensional Euclidean geometry are expected to replace the current case-enumerating one.

Second, the current study is more theoretically oriented, and the simulation study is more of an illustration than a demonstration of the proposed GFI. Carefully designed large-scale simulation studies should be conducted to evaluate all existing frequentist and Bayesian inference methods in the context of more involved multidimensional IRT models. It is particularly of interest to compare GFI to a number of stochastic variants of the EM algorithm (e.g., Cai, 2010a, b) for ML estimation, and to “less informative” Bayesian methods using flat priors. Apart from standard criteria of parameter recovery, practical matters, such as the computational time and convergence of the Markov chain, are subject to comparison as well.

Third, the reliable performance of GFI observed under the combination of a small sample size and extreme parameter values prompts our speculation that in general GFI is able to handle nearly unidentified models properly. For example, as brought up by a referee, the guessing parameter in a three-parameter logistic (3PL) model (Birnbaum, 1968) is difficult to estimate. In those cases, the log-likelihood function is flat, and finding its mode can be challenging. Adding a prior distribution ameliorates the numerical condition; however, since the contribution from the likelihood is very little, the performance of a Bayesian estimator is almost completely contingent upon how good the prior is. Meanwhile, GFI may produce wide CIs if there is indeed barely any information contained in the data. Yet those wide CIs at least have trustworthy coverage, from which sound statistical inferences can be made.

Finally, the usefulness of GFI for other inferential purposes, such as goodness of testing and test scoring, should be explored. It has been identified that the quality of asymptotic covariance matrix estimates plays an important role in determining the performance of various quadratic form goodness of fit statistics (e.g., Cai, 2008; Liu & Maydeu-Olivares, 2013). Suitable co-variation estimates of the fiducial distribution are natural candidates to this end, and their performance should be examined and compared to existing approaches such as the inverse outer-product/Hessian information estimators. As an alternative, a fiducial analogy of Bayesian posterior predictive checks (Rubin, 1984) can be easily programmed; theoretical inquisition and empirical evaluation can be pursued as a distinctive line of research to validate its use. As for test scoring, a Monte Carlo sample from the marginal distribution of \(\mathbf{Z}^\star \) given \(Q(\mathbf{y}, \mathbf{A}^\star , \mathbf{Z}^\star )\ne \emptyset \), an incidental product of the sampling algorithm, can be used to calibrate the latent traits for each observation. Consistency of individual latent score estimates in some proper sense is anticipated.