1 Introduction

A major topic of study in educational and psychological testing is measurement invariance, with violation of this assumption being called differential item functioning (DIF) in the item response literature (see, for example, Millsap, 2012, for a review). If a set of items violates measurement invariance, then individuals with the same ability (“amount” of the latent variable) may systematically receive different scale scores. This is problematic because researchers might conclude group ability differences when, in reality, the differences arise from unfair items.

We can formally define measurement invariance in a general fashion via (Mellenbergh,1989):

$$\begin{aligned} f(\varvec{y}_i|v_i,\varvec{\theta }_i)=f(\varvec{y}_i|\varvec{\theta }_i), \end{aligned}$$
(1)

where \(\varvec{y}_i\) is a vector of observed variables for individual i, \(\varvec{\theta }_i\) is the latent variable vector for individual i, which can be viewed as a random variable generated from a normal or multivariate normal distribution with parameter \(\varvec{\theta }\), \(v_i \in V\), where V is the auxiliary variable such as age, gender, ethnicity, etc., against which we are testing measurement invariance, and \(f(\cdot )\) is an assumed parametric distribution. In applying the measurement invariance definition to a parametric item response theory (IRT) framework, Eq. (1) states that the relationship between the latent construct (ability) \(\varvec{\theta }_i\) and response \(\varvec{y}_i\) (binary or ordinal) holds regardless of the value of V.

Under this definition, many procedures have been proposed to assess measurement invariance/DIF in IRT models, including the Mantel–Haenszel statistic (Holland & Thayer, 1988), Raju’s area approach (Raju, 1988), logistic regression methods (Swaminathan & Rogers, 1990; Van den Noortgate & De Boeck, 2005), Lord’s Wald test (Lord, 1980), and likelihood ratio test (Thissen, Steinberg, & Wainer, 1988). Overviews can be found in Millsap and Everson (1993), Osterlind and Everson (2009), Magis, Béland, Tuerlinckx, and De Boeck (2010), Glas (2015). These methods focus on generally detecting the presence or absence of DIF. When a measurement invariance violation is detected, however, researchers are typically interested in “locating” the measurement invariance. As Millsap (2005) stated, locating the invariance violation is one of the major outstanding problems in the field. This locating problem can be divided into two aspects. One is to locate which item parameter violates the measurement invariance assumption. The other is to locate the point/level of the auxiliary variable (V) at which the violation occurs. Unfortunately, this second aspect is often ignored because previous procedures require us to pre-define the reference and focal groups (based on V).

Beyond the approaches described in the previous paragraph, Glas and colleagues have done seminal work applying the Lagrange multiplier test (see also Satorra, 1989) to item response models, focusing on situations where pre-defined group information is available (i.e., V is treated as categorical variable). Their work has included the traditional two-parameter (Glas, 1998), three-parameter (Glas & Falcón, 2003), and nominal response (Glas, 1999, 2010) models, with applications including computerized adaptive testing (Glas, 2009), country-specific DIF (Glas & Jehangir, 2014), and models of response time (Glas & Linden, 2010). The main estimation framework in this line of research is marginal maximum likelihood (Glas, 2009; Bock & Schilling, 1997; Schilling & Bock, 2005), which is generally most popular in IRT applications.

A more general family of score-based or Lagrange multiplier tests has been recently proposed to address “locating” issues in factor models for continuous response data (Merkle & Zeileis, 2013; Merkle, Fan, & Zeileis, 2014; T. Wang, Merkle, & Zeileis, 2014), where the auxiliary variable V can be continuous, ordinal, or categorical. Additionally, Strobl, Kopf, and Zeileis  (2015) applied related tests to Rasch models estimated via conditional ML in order to identify the violating point along a categorical or continuous auxiliary variable. Moreover, Strobl et al.  (2015) applied the tests recursively to multiple auxiliary variables via a “Rasch trees” approach, highlighting the fact that the groups tested for DIF need not be specified in advance and can even be formed by interactions of several auxiliary variables. Unfortunately, the conditional ML framework is only applicable to models of the Rasch family. Penalized maximum likelihood has also been recently proposed to detect DIF (Tutz & Schauberger, 2015), but the work has also been confined to the Rasch model.

In this paper, we extend the score-based tests to more general IRT models in a unified way, using both pairwise and marginal maximum likelihood estimation. We focus on identifying problematic item parameters without pre-specifying reference and focal groups. This approach allows us to (1) detect DIF in various IRT models without additional computational burden and (2) detect DIF against ordinal auxiliary variables like socioeconomic status and age group, whose ordinal nature is often ignored in the IRT literature. We first describe the two-parameter IRT model and its relationship to factor analysis, along with the score-based tests’ application to IRT via pairwise maximum likelihood estimation. Next, we report on the results of two simulation studies designed to examine the tests’ ability to locate problematic item parameters while simultaneously handling the issue of person impact. Next, we apply the tests to real data, studying the measurement invariance of a mathematics achievement test with respect to socioeconomic status. Finally, we discuss test extensions and further IRT applications.

2 Model

In this study, we focus on binary data \(y_{ij}\), where i represents individuals (\(i \in {1, \ldots , n}\)) and j represents items (\(j \in {1, \ldots , p}\)). There are two related approaches in the social science literature for analyzing these data: IRT and factor analysis. A two-parameter IRT model can be written as

$$\begin{aligned} y_{ij}\sim & {} \text {Bernoulli}(p_{ij}), \end{aligned}$$
(2)
$$\begin{aligned} \text {logit}(p_{ij})= & {} \alpha _{j}\theta _{i}+\gamma _{j}, \end{aligned}$$
(3)
$$\begin{aligned} \theta _i\sim & {} N(\mu , \sigma ^2), \end{aligned}$$
(4)

where Eq. (2) states that each person’s response to each item (\(y_{ij}\)) arises from a Bernoulli distribution with parameter \(p_{ij}\). Then Eq. (3) transforms \(p_{ij}\) to \(\text {logit}(p_{ij})=\log (\frac{p_{ij}}{1-p_{ij}})\), which is a linear function of the person’s ability \(\theta _{i}\) and the item parameters \(\gamma _{j}\) and \(\alpha _{j}\). The alternative parameterization, \(\alpha _{j} (\theta _{i} - \gamma _{j})\), could also be used here. Finally, person ability \(\theta _i\) is described by hyperparameters \(\mu \) and \(\sigma ^2\), with these parameters commonly being fixed to 0 and 1, respectively, for identification. Instead of using the logit as the link function in Eq. (3), we can alternatively use the inverse cumulative distribution function of the standard normal distribution \(\Phi ^{-1}()\) (the probit link function). In this case, Eq. (3) could be written as \(p_{ij} = \Phi (\alpha _{j}\theta _{i}+\gamma _{j})\).

The use of the probit link function in the above model is equivalent to placing a factor analysis model on latent continuous variables \(\varvec{y}^\star \) (Takane & de Leeuw, 1987). In particular,

$$\begin{aligned} \varvec{y}_{i}^{\star } = \varvec{\Lambda }\theta _{i} + \varvec{\epsilon }, \end{aligned}$$
(5)

where \(\varvec{\Lambda }\) is a \(p \times 1\) factor loading vector, with components \(\lambda _{1}, \ldots , \lambda _{p}\); \(\theta _{i} \sim N(0,1)\); and \(\varvec{\epsilon }\) is an error term, which follows the distribution \(N(\varvec{0}, \varvec{\Psi })\). The matrix \(\varvec{\Psi }\) is diagonal and defined as \(\varvec{I}-\text {diag}(\varvec{\Lambda }\varvec{\Lambda }^{\prime })\). The continuous response vector \(\varvec{y}_{i}^{\star }\) is composed by \(y_{ij}^{\star }\) (\(j=1,\ldots , p\)), with the observed binary data being obtained via

$$\begin{aligned} y_{ij}= {\left\{ \begin{array}{ll} 1 &{} y_{ij}^{\star } \ge \tau _{j}\\ 0 &{} y_{ij}^{\star } < \tau _{j}. \end{array}\right. } \end{aligned}$$
(6)

Therefore, we can see that \(\lambda _{j}\) is similar to \(\alpha _{j}\) in Eq. (3); they are both attached to the ability variable \(\theta _{i}\). The error term \(\varvec{\epsilon }\) is related to the probit link function that could be used in Eq. (3). Finally, the threshold \(\tau _{j}\) corresponds to \(\gamma _{j}\), which is related to item j’s difficulty.

No matter which link function is used, however, estimation of the two-parameter IRT model is not straightforward. The difficulty is caused by the person parameters \(\theta _i\), which we generally avoid estimating (either by conditioning on them or by integrating them out). Estimation methods that address this difficulty include conditional maximum likelihood (CML; e.g., Fischer & Molenaar, 2012; De Ayala, 2009), marginal maximum likelihood (MML; e.g., Thissen, 1982), and pairwise maximum likelihood (PML; e.g., Katsikatsou, Moustaki, Yang-Wallentin, & Jöreskog, 2012). We briefly describe PML below, which is the main focus of this paper (though we also consider MML).

3 Estimation

If we employ the factor analysis version of the model, the log-likelihood of individual i’s observed data \(\varvec{y}_i\), given the parameter vector \(\varvec{\eta }\) (including \(\lambda \), \(\tau \)), involves the integral

$$\begin{aligned} \ell (\varvec{\eta }; \varvec{y}_i)= & {} \log \int _{\varvec{\tau }} f(\varvec{y}_{i}^{\star } | \varvec{\eta }) d \varvec{y}_{i}^{\star }, \end{aligned}$$
(7)

where \(\varvec{y}_i^{\star }\) is described as Eq. (5) and the distribution of \(\varvec{y}_i^{\star }\) with \(\theta _i\) marginalized out is denoted as \(f(\varvec{y}_i^{\star } | \varvec{\eta })\) (p-dimensional), which can be considered as a multivariate normal distribution following \(N(\varvec{0}, \varvec{\Lambda }\varvec{\Lambda }^{T}+\varvec{\Psi }) \). The integration of the p-dimensional multivariate normal distribution over support \(\varvec{\tau }\) is the difficult part, which does not have a closed form.

Katsikatsou et al.  (2012) proposed that the likelihood function above can be approximated by:

$$\begin{aligned} p\ell (\varvec{\eta };\varvec{y}_i)&= \left\{ \sum _{j<k}\ell (\varvec{\eta };(y_{ij}, y_{ik}))\right\} , \end{aligned}$$
(8)
$$\begin{aligned}&= \left\{ \sum _{j<k} \left( \sum _{c_{j}=1}^{2}\sum _{c_{k}=1}^{2}\log \pi _{y_{ij}y_{ik}}^{(c_{j}c_{k})}(\varvec{\eta })\right) \right\} , \end{aligned}$$
(9)

where \(\sum _{j<k}\ell (\varvec{\eta };(y_{ij},y_{ik}))\) is the log-likelihood associated with all pairs of items, which is a series of two-way contingency tables; \(\pi _{y_{ij}y_{ik}}^{(c_{j}c_{k})}(\varvec{\eta })\) is the probability that individual i responds to item j and k with category \(c_{j}\) (\(c_{j} = 1, 2\)) and \(c_{k}\) (\(c_{k} = 1, 2\)) under the model, which is expressed as a function of pairwise integrals. See Katsikatsou et al. (2012) for the explicit expression of \(\pi _{y_{ij}y_{ik}}^{(c_{j}c_{k})}(\varvec{\eta })\), and note that the categories 1, 2 above represent responses of “0”, “1”, respectively, in Eq. (6).

Comparing Eq. (7) with Eq. (8), we can see that the p-dimensional integral is reduced to all possible pairwise (\(j < k\)) integrals, which are bivariate normal distributions with closed-form solution. This significantly reduces the computational complexity, which is a major advantage of PML.

4 Maximizing Likelihood Function

The model’s log-likelihood function can be written as the sum of individual log-likelihoods

$$\begin{aligned} \ell (\varvec{\eta }; \varvec{y}_1, \ldots , \varvec{y}_n)= \sum _{i=1}^{n}\log f(\varvec{y}_i|\varvec{\eta }), \end{aligned}$$
(10)

where the length of the parameter vector \(\varvec{\eta }\) is q.

Maximizing the model’s log-likelihood function is equivalent to solving the first-order conditions

$$\begin{aligned} \displaystyle \sum _{i=1}^{n} \varvec{s}(\hat{\varvec{\eta }} ; \varvec{y}_{i}) = {\varvec{0}}, \end{aligned}$$
(11)

where

$$\begin{aligned} \hat{\varvec{\eta }} = \mathop {\hbox {argmax}}\limits _{\varvec{\eta }} \ell (\varvec{\eta }; \varvec{y}_{1}, \varvec{y}_{2}, \ldots , \varvec{y}_{n}) \end{aligned}$$
(12)

and

$$\begin{aligned} \varvec{s}(\hat{{\varvec{\eta }}}; \varvec{y}_{i})&= \frac{\partial \ell ({\varvec{y}}_{i}, {\varvec{\eta }})}{\partial \varvec{\eta }} \big |_{\varvec{\eta } = \widehat{\varvec{\eta }}} \end{aligned}$$
(13)
$$\begin{aligned}&= \left( \frac{\partial \ell (\varvec{\eta };\varvec{y}_{i})}{\partial \eta _{1}}, \ldots , \frac{\partial \ell (\varvec{\eta };\varvec{y}_i)}{\partial \eta _{q}} \right) . \end{aligned}$$
(14)

For the two-parameter IRT model analyzed in this paper, the log-likelihood function and consequently also the individual score function differ depending on the log-likelihood. We provide some detail on the PML score function below, with the MML score function being detailed in, for example, Glas (1998).

Maximizing the log-likelihood function in Eq. (8) over the parameter \(\varvec{\eta }\), we obtain the composite pairwise maximum likelihood estimator \(\hat{\varvec{\eta }}_{\text {PML}}\). Again, this is equivalent to solving for \(\varvec{\eta }\) so that the sum of scores equals zero. The score vector of the pairwise likelihood for each individual can be decomposed in two blocks: the first derivative with respect to the factor loading \(\varvec{\Lambda }\) and the first derivative with respect to the thresholds \(\varvec{\tau }\):

$$\begin{aligned} \displaystyle \varvec{s} (\varvec{\eta }; \varvec{y}_{i}) = \left( \frac{\partial \left\{ \sum \limits _{j<k} \ell (\varvec{\eta }; (y_{ij},y_{ik}))\right\} }{\partial \varvec{\Lambda }}, \frac{\partial \left\{ \sum \limits _{j<k} \ell (\varvec{\eta }; (y_{ij},y_{ik}))\right\} }{\partial \varvec{\tau }} \right) . \end{aligned}$$
(15)

The elements of the score matrix are analytic solutions, requiring no approximation via quadrature. The derivatives are explicitly demonstrated in Appendix of Katsikatsou et al. (2012).

It is easy to show that PML estimates are more easily obtained and less computationally intensive compared to traditional maximum likelihood estimation, e.g., marginal maximum likelihood (MML) estimation, which often involves quadrature or adaptive quadrature to approximate integrals (Schilling & Bock, 2005; Katsikatsou et al., 2012). Further, Katsikatsou and Moustaki  (2016) have recently derived likelihood ratio tests for the PML framework, which in turn leads to expressions for pairwise AIC and BIC. Thus, we focus on PML in the simulations and analyses below, with similar results holding for MML as demonstrated in supplementary material. In the next section, we describe the scores’ use in tests of measurement invariance.

5 Score-Based Tests of Measurement Invariance

Measurement invariance is usually studied in a hypothesis testing framework. We can write the hypothesis very generally by assuming a potential observation-specific parameter vector \(\varvec{\eta }_{i}\). The null hypothesis of measurement invariance can then be expressed as all observations arising from a common set of population parameters \(\varvec{\eta }_{0}\)

$$\begin{aligned} H_{0}: \varvec{\eta }_{i}=\varvec{\eta }_{0} \quad (i=1,\ldots ,n) \end{aligned}$$
(16)

versus

$$\begin{aligned} H_{1}: \varvec{\eta }_{i}=\varvec{\eta }(v_{i}) \quad (i=1,\ldots ,n), \end{aligned}$$
(17)

where \(\varvec{\eta }(v_{i})\) is typically an unknown function w.r.t. \(v_{i}\). If the function is known, the alternative hypothesis can be expressed more specifically. For example, one function of particular interest involves V dividing individuals into two subgroups with different parameter vectors based on the cut point v:

$$\begin{aligned} H_{1}:\varvec{\eta }_{i}= {\left\{ \begin{array}{ll} \varvec{\eta }^{(A)} &{} v_{i} \le v \\ \varvec{\eta }^{(B)} &{} v_{i} > v. \end{array}\right. } \end{aligned}$$
(18)

For this hypothesis testing problem with known cut point v, the likelihood ratio test (LRT; Thissen et al., 1988) is most popular. The LRT compares two models, a full model and a reduced model. The full model is a multiple-group model with parameters free to vary across group A and group B, while the reduced model constrains some parameters to be equal across groups. The LRT statistic for cut point v can be expressed as

$$\begin{aligned} LR(v)=-2\left[ \ell \left( \hat{\varvec{\eta }};\varvec{y}_{1},\ldots ,\varvec{y}_{n}\right) -\left\{ \ell \left( \hat{\varvec{\eta }}^{(A)};\varvec{y}_{1},\ldots ,\varvec{y}_{m}\right) +\ell \left( \hat{\varvec{\eta }}^{(B)}; \varvec{y}_{m+1},\ldots ,\varvec{y}_{n}\right) \right\} \right] , \end{aligned}$$
(19)

where \(\ell \) represents the log-likelihood function, \(\hat{\varvec{\eta }}^{(A)}\) is the MLE of \(\varvec{\eta }^{(A)}\) based on \(\{\varvec{y}_{1}, \ldots , \varvec{y}_{m}\}\), for which \(v_{i} \le v\), and \(\hat{\varvec{\eta }}^{(B)}\) is the MLE of \(\varvec{\eta }^{(B)}\) based on \(\{\varvec{y}_{m+1}, \ldots , \varvec{y}_{n}\}\) for which \(v_{i} > v\). This LRT statistic has an asymptotic \(\chi ^{2}\) distribution with degrees of freedom equal to the number of parameters in \(\varvec{\eta }\).

However, when the grouping information is unknown, we can also compute LR(v) for each possible value of V in some interval \([\underline{v}, \bar{v}]\), obtaining a test statistic via:

$$\begin{aligned} \max _{v \in [\underline{v}, \bar{v}]} LR(v). \end{aligned}$$
(20)

The asymptotic distribution of this maximum LR statistic is not \(\chi ^{2}\); Andrews (1993) showed that, under the null hypothesis in (16), the statistic converges in distribution to some stochastic process. This result is also utilized in the score-based tests discussed below.

5.1 Test Background

The score-based tests described here utilize the scores defined above, and they are based on theory showing that functions of the scores follow a stochastic process along an auxiliary variable V. Related descriptions can be found in Zeileis and Hornik  (2007), Merkle et al. (2014), and T. Wang et al. (2014).

We can build the following intuition for the tests. We examine individuals’ scores as we move from the smallest value of V to the largest. If there are no measurement invariance violations, the scores should fluctuate around zero. Conversely, the scores will systematically shift from zero when measurement invariance is violated.

To obtain formal test statistics, we define the cumulative score as

$$\begin{aligned} {\varvec{B}}(t; \hat{\varvec{\eta }}) = \hat{\varvec{I}}^{-1/2} n^{-1/2} \displaystyle \sum _{i = 1}^{\lfloor n \cdot t \rfloor } {\varvec{s}} (\hat{\varvec{\eta }}; \varvec{y}_{(i)}) \qquad (0 \le t \le 1), \end{aligned}$$
(21)

where \(\varvec{y}_{(i)}\) represents the observed data vector for ith largest observation, with ordering determined by the auxiliary variable V. \(\hat{\varvec{I}}\) denotes some estimate of the covariance matrix of the scores, which serves to decorrelate the fluctuation processes associated with individual model parameters; \(\lfloor nt \rfloor \) is the integer part of nt (i.e., a floor operator); and \(0 \le t \le 1\). In a sample of size n, \({\varvec{B}}(t; \hat{\varvec{\eta }})\) changes at \(0, \frac{1}{n}, \frac{2}{n}, \ldots , \frac{n}{n}\). For \(t=1\) the cumulative score vector always equals \(\varvec{0}\), as defined in Eq. (11). We are specifically interested in how the cumulative score fluctuates as we move from \(t=0\) to \(t=1\).

Along with the score vectors, we need an estimate of the score covariance matrix, which is shown in Eq. (21) as \(\hat{\varvec{I}}\). For regular maximum likelihood estimation, the covariance matrix is equal to the information matrix. However, this identity does not hold for PML (Katsikatsou et al., 2012). Therefore, instead of the information matrix, we use an estimate based on the outer product of scores \(\hat{\varvec{I}} = (1/n) \sum _{i=1}^{n} \varvec{s}(\hat{\varvec{\eta }}, \varvec{y}_{(i)}) \varvec{s}(\hat{\varvec{\eta }}, \varvec{y}_{(i)})^{T}\).

Hjort and Koning  (2002) showed that, under the null hypothesis from (16), \({\varvec{B}}(t; \hat{\varvec{\eta }})\) converges in distribution to an independent Brownian bridge:

$$\begin{aligned} {\varvec{B}}(\cdot ; \hat{\varvec{\eta }}) ~\overset{d}{\rightarrow }~ {\varvec{B}}^{0}(\cdot ), \end{aligned}$$
(22)

where \({\varvec{B}}^{0}(\cdot )\) is a q-dimensional Brownian bridge and each column represents a unidimensional Brownian bridge associated with a single parameter.

Empirically, the \(\varvec{B}(t;\varvec{\eta })\) process can be described by an \(n \times q\) matrix, with each column following an independent Brownian bridge. The matrix row represents the ordered observations’ cumulative score vector, and the last row is zero as described by Eq. (11). To obtain scalar test statistics, we summarize the empirical behavior of Eq. (21) and compare it to the analogous scalar summary of the Brownian bridge. In the next section, we introduce various summaries of Eq. (21) that can serve as test statistics.

5.2 Test Statistics

After summarizing or aggregating the empirical cumulative score process via a scalar, the asymptotic distribution of the scalar can be obtained by applying the same summary to the asymptotic Brownian bridge. This allows us to obtain critical values and p values. Various statistics have been proposed, with selection of a statistic being based on the plausible patterns of potential measurement invariance violations.

The simplest aggregation strategy is to reject measurement invariance if the largest component of the empirical cumulative score matrix is greater than a critical value. Based on the location of the detected component, we can easily identify the violating parameter and the value of V at which the violation occurs. Because this statistic is searching for the maximum over the parameters (columns of the empirical cumulative score matrix) and individuals (rows of the empirical cumulative score matrix), this statistic is called the “double maximum” (DM).

$$\begin{aligned} DM = \max _{i = 1, \ldots , n} \max _{j = 1, \ldots , q} | {\varvec{B}}(\hat{{\varvec{\eta }}})_{ij} |. \end{aligned}$$
(23)

However, the DM statistic is suboptimal if many of the parameters change at the same value of V and/or there exist many (rather than only one) changing points in V, because this “wastes” power by only considering the maximum. In such cases, sums across parameters and individuals are more suitable. The Cramèr–von Mises (CvM) statistic falls in this category,

$$\begin{aligned} CvM = n^{-1} \sum _{i = 1, \ldots , n} \sum _{j = 1, \ldots , q} {\varvec{B}}(\hat{\varvec{\eta }})_{ij}^2. \end{aligned}$$
(24)

If we expect there is only one change point, but that change point affects multiple parameters, we can aggregate by summing over parameters and then taking the maximum over the individual interval (scaled by variance). This statistic is equivalent to obtaining the maximum of Lagrange multiplier statistics, and it can be formally written as

$$\begin{aligned} \max LM = \max _{i=\underline{i}, \bar{i}} \left\{ \frac{i}{n}\left( 1- \frac{i}{n}\right) \right\} ^{-1}\sum _{j=1,\ldots , q}\varvec{B}(\hat{\varvec{\eta }})_{ij}^{2}. \end{aligned}$$
(25)

Note that this statistic is asymptotically equivalent to the \(\max LR\) mentioned before, in the same way that the traditional likelihood ratio test is asymptotically equivalent to the traditional Lagrange multiplier test.

Across the above statistics, the auxiliary variable V is assumed to be continuous. Merkle et al. (2014) introduced two modified statistics that could deal with ordinal V, which could include school grades or income levels. For an ordinal auxiliary variable with m levels, the modifications are based on \(t_{l}\) \((l=1,\ldots ,m-1)\), which are the empirical, cumulative proportions of individuals observed at the first \(m-1\) levels. The modified statistics are then given by

$$\begin{aligned} WDM_o= & {} \max _{i \in \{i_1, \ldots , i_{m-1} \}} ~ \left\{ \frac{i}{n} \left( 1 - \frac{i}{n} \right) \right\} ^{-1/2} \max _{j = 1, \ldots , q} | {\varvec{B}}(\hat{\varvec{\eta }})_{ij} |, \end{aligned}$$
(26)
$$\begin{aligned} \max LM_o= & {} \max _{i \in \{i_1, \ldots , i_{m-1} \}} ~ \left\{ \frac{i}{n} \left( 1 - \frac{i}{n} \right) \right\} ^{-1} \sum _{j = 1, \ldots , q} {\varvec{B}}(\hat{\varvec{\eta }})_{ij}^2, \end{aligned}$$
(27)

where \(i_{l}=\lfloor n \cdot t_{l} \rfloor \) \((l=1,\ldots ,m-1)\).

If the auxiliary variable V is only nominal/categorical, the empirical cumulative sums of scores can be used to obtain a Lagrange multiplier statistic by first summing scores within each of the m levels of the auxiliary variable and then summing the sums (Hjort & Koning, 2002). This test statistic can be formally written as

$$\begin{aligned} LM_{uo} = \sum _{l = 1, \ldots , m} \sum _{j = 1, \ldots , q} \left( {\varvec{B}}(\hat{\varvec{\eta }})_{i_l j} - {\varvec{B}}(\hat{\varvec{\eta }})_{i_{l - 1}j} \right) ^2, \end{aligned}$$
(28)

where \({\varvec{B}}(\hat{{\varvec{\eta }}})_{i_{0}j} = 0\) for all j. This statistic is asymptotically equivalent to the usual likelihood ratio statistic, and it is advantageous over the LRT from (19) because it requires the estimation of only one model (the null model). We expect similar asymptotic equivalence results to hold in the PML framework, though this equivalence has not been investigated.

In the following sections, we apply these theoretical results to IRT models. We focus on the two-parameter model estimated via PML, where the \(\theta _i\) are assumed to arise from a normal distribution (and, as mentioned previously, MML results are included in Appendices).

6 Simulation 1

In this study, we aim to examine the tests’ abilities to locate item parameters that violate measurement invariance. Consider a hypothetical battery of five items administered to students in several ordered groups (e.g., \(m = 8\)), with the item responses being described by a traditional two-parameter model. Measurement invariance violations may occur in the item intercept or the item slope parameters (related to difficulty and discrimination, respectively). It is plausible that violations in an item’s slope parameter influence the item’s intercept parameter, or that one violating item influences the other items. Thus, the goal of Simulation 1 is to examine the extent to which the score-based tests attribute the measurement invariance violation to the correct item parameters.

6.1 Method

Data were generated from a two-parameter model (with probit link function) for a test with 5 items. A violation occurred in one of two places: the item 3 slope parameter (\(\alpha _{3}\)) or intercept parameter (\(\gamma _{3}\)). The fitted models matched the data-generating model, and parameter estimates were obtained by PML. Measurement invariance violations were tested in eight subsets of parameters: each item’s intercept parameter (or slope parameter, depending on the location of the true violation), item 3’s non-violating parameter (\(\gamma _{3}\) or \(\alpha _{3}\)), all items’ intercept parameters, and all items’ slope parameters.

Power and type I error were examined across three sample sizes (\(n = 120, 480, \hbox {and}\;960\)), three numbers of ordered groups (\(m = 4, 8, \hbox {and}\;12\)) and 17 magnitudes of invariance violations. The measurement invariance violations occurred at level \(m/2+1\) of V: Students with \(V < (m/2+1)\) deviated from students with \(V \ge (m/2+1)\) by d times the parameters’ asymptotic standard errors (scaled by \(\sqrt{n}\)), with \(d = 0, 0.25, 0.5, \ldots , 4\).

For each combination of sample size (n) \(\times \) violation magnitude (d) \(\times \) violating parameter \(\times \) groups (m), 5000 data sets were generated and tested. In all conditions, we maintained equal sample sizes in each subgroup of the categories m. Statistics from Eqs. (26) and (27) (both ordinal statistics) were examined, as was the statistic from (28) (categorical statistic, ignoring the ordering information). As mentioned previously, the latter statistic is asymptotically equivalent to the usual likelihood ratio test. Thus, this statistic provides information about the relative performance of the ordinal statistics versus the LRT.

6.2 Results

Full simulation results for PML are presented in Figs. 1, 2, 3, and 4. (Similar results for MML are shown in supplementary material.) Figures 1 and 2 shows the comparison of different test statistics at a fixed value of n, while Figs. 3 and 4 display a single test statistic across all values of n. Because items 1, 2, 4, and 5 display similar power curves in all conditions, we only show item 2’s results.

Fig. 1
figure 1

Simulation 1. Simulated power curves for \(\max LM_o\), \(WDM_o\), and \(LM_{uo}\) across three levels of the ordinal variable m and measurement invariance violations of 0–4 standard errors (scaled by \(\sqrt{n}\)), estimated by the PML two-parameter model. The parameter violating measurement invariance is \(\alpha _{3}\). \(n=960\). Panel labels denote the parameter(s) being tested and the number of levels of the ordinal variable m.

Fig. 2
figure 2

Simulation 1. Simulated power curves for \(\max LM_o\), \(WDM_o\), and \({LM}_{uo}\) across three levels of the ordinal variable m and measurement invariance violations of 0–4 standard errors (scaled by \(\sqrt{n}\)), estimated by PML two-parameter model. The parameter violating measurement invariance is \(\gamma _{3}\). \(n=960\). Panel labels denote the parameter(s) being tested and the number of levels of the ordinal variable m.

Fig. 3
figure 3

Simulation 1. Simulated power curves for sample sizes \(n=120\), 480, and 960 of test statistic \({WDM}_o\), across three levels of the ordinal variable m and measurement invariance violations of 0–4 standard errors (scaled by \(\sqrt{n}\)), estimated by PML two-parameter model. The parameter violating measurement invariance is \(\alpha _{3}\). Panel labels denote the parameter(s) being tested and the number of levels of the ordinal variable m.

Fig. 4
figure 4

Simulation 1. Simulated power curves for sample sizes \(n=120\), 480, and 960 of test statistic \({WDM}_o\), across three levels of the ordinal variable m and measurement invariance violations of 0–4 standard errors (scaled by \(\sqrt{n}\)), estimated by PML two-parameter item response model. The parameter violating measurement invariance is \(\gamma _{3}\). Panel labels denote the parameter(s) being tested and the number of levels of the ordinal variable m.

Figure 1 demonstrates power curves (of sample size 960) as a function of violation magnitude in item 3’s slope parameter \(\alpha _{3}\), with the tested parameters changing across rows, the number of levels m of the ordinal variable V changing across columns, and lines reflecting different test statistics. In each panel, the x-axis represents the violation magnitude and the y-axis represents power. Figure 2 demonstrates similar power curves when the violating parameter is item 3’s intercept parameter \(\gamma _{3}\).

These two graphs show that the ordinal statistics exhibit similar results, with the \(\max {LM}_{uo}\) statistic demonstrating lower power across all situations. This demonstrates the sensitivity of the ordinal statistics to invariance violations that are monotonic with V. In situations where only one parameter is tested, \({WDM}_o\) and \(\max {LM}_o\) exhibit equivalent power curves. This is because these two statistics are equivalent when only one parameter is tested (see Merkle et al., 2014).

Figures 3 and 4 display similar power curves (of statistic \({WDM}_o\)), but the lines now reflect different sample sizes. Figure 3 demonstrates results when the violating parameter is \(\alpha _{3}\), and Fig. 4 displays the results when the violating parameter is \(\gamma _{3}\).

From these figures, one generally observes that the tests isolate the parameter violating measurement invariance. Comparing Fig. 1 to Fig. 2, we can see the tests have somewhat higher power to detect measurement invariance violations in the intercept parameter as opposed to the slope parameter. This is because it is easier to detect violations in “main effects” (we can see it as intercept \(\times \) 1) than in “interactions” (slope \(\times \) person parameter \(\theta _i\)). Any changes in an intercept parameter will influence every person equally, whereas any changes in a slope parameter’s influence is moderated by each person’s ability \(\theta _i\). Meanwhile, comparing Fig. 3 and Fig. 4, we can see that sample size has a much larger influence on power to detect violations in the slope parameter, as compared to the intercept parameter. This is related to the fact that the violation magnitudes were scaled by the square root of n and the slope parameter is attached to the person parameter \(\theta _i\) which follows a distribution instead of a constant.

Finally, simultaneous tests of all slope parameters or of all intercept parameters resulted in decreased power, as compared to the situation where only the violating parameter is tested. This “dampening” phenomenon is more apparent for \(\max {LM}_o\) statistic, because it involves a sum across all tested parameters [see Eq. (27)], whereas \({WDM}_o\) only takes the maximum over parameters [see Eq. (26)]. However, the relative power advantage of using \(\max {LM}_o\) and \({WDM}_o\) when testing multiple parameters depends on the number of parameters that actually violate invariance (Merkle et al., 2014). In practice, we often test multiple parameters in the exploratory stage and, when we have no information about which parameter(s) might be problematic, \(\max {LM}_o\) has more power than \({WDM}_o\) (Merkle et al., 2014; T. Wang et al., 2014).

In summary, we found that the proposed tests can attribute measurement invariance violations to the correct parameter of a two-parameter item response model. While this can give practitioners some confidence in the tests, we did not examine the situation where person abilities differ across groups, which is often called “impact” in the item response literature (Fischer, 1995b). We consider this situation in Simulation 2.

7 Simulation 2

In Simulation 1, the ability distributions were assumed to be the same for all persons. This ignored the fact that person hyperparameters (mean ability and variance of ability) could change across groups along with the item parameters. Changes in person hyperparameters do not count as measurement invariance violations, but ignoring these changes may lead us to incorrectly conclude an invariance violation (Woods, 2009; Stark, Chernyshenko, & Drasgow, 2006; W.-C. Wang & Yeh, 2003; Fischer, 1995a; Kopf, Zeileis, & Strobl, 2015).

Formally, in a regular two-parameter model, we assume that the person parameters follow a standard normal distribution across all groups: \(\theta _i \sim N(0,1)\). There is the potential that the hyperdistribution is group specific, however, with \(\theta _{i}^{\star } \sim N(\mu _{v_i}, \sigma _{v_i}^{2})\), where \(v_i\) is in \(1,\ldots , m\). If the hyperparameters change from group to group, then our model can be written as:

$$\begin{aligned} \Phi ^{-1}(p_{ij})= & {} \gamma _{j} + \alpha _{j}\theta _{i}^{\star }, \end{aligned}$$
(29)
$$\begin{aligned}= & {} \gamma _{j} + \alpha _{j}(\mu _{v_i} + \sigma _{v_i}\theta _{i}), \end{aligned}$$
(30)
$$\begin{aligned}= & {} (\gamma _{j} + \alpha _{j}\mu _{v_i}) + \sigma _{v_i}\alpha _{j}\theta _{i}. \end{aligned}$$
(31)

This shows that, when \(\sigma _{v_i}\) differs across values of \(v_i\), it will look like there are measurement invariance violations in \(\alpha _{j}\) (for all j). Similarly, when \(\mu _{v_i}\) differs across values of \(v_i\), it will look like there are measurement invariance violations in \(\gamma _{j}\) (for all j). Further, because \(\mu _{v_i}\) is no longer 0, changes in \(\alpha _{j}\) will also make it look like there are measurement invariance violations in the \(\gamma _{j}\) (through the term \(\alpha _{j}\mu _{v_i}\)). Therefore, the proposed tests’ good properties from Simulation 1 are lost when the person hyperparameters change across groups.

To avoid this problem, we should estimate the person hyperparameters \(\mu _{v_i}\) and \(\sigma _{v_i}^{2}\), when there is uncertainty about person abilities. It is clear that estimation of these extra parameters will decrease the proposed tests’ power. However, both the extent of decrease and the relative performance compared to traditional statistics are unclear. In this section, we conduct two simulations that address these issues.

7.1 Method

To examine the decrease in power when we estimate person hyperparameters with or without a “true” person hyperparameter change, we organize Simulation 2 into two subsections. In Simulation 2.1, the data generation model is the same as Simulation 1, with abilities of students generated from \(\theta _{i} \sim N(0,1)\), whereas in Simulation 2.2, the abilities of students were manipulated. Specifically, abilities of students with \(V=1, 2, 3,\) or 4 were generated from \(\theta _{i} \sim N(0,1)\), while the abilities of students with \(V=5, 6, 7,\) or 8 were generated from \(\theta _{i} \sim N(-1,2)\).

The estimated model for both Simulations 2.1 and 2.2 is the multiple-group two-parameter model, which can be described as: free parameters for each level’s \(\mu _{v_i}\) (with level 1 fixed to zero for identification), \(\sigma _{v_i}^{2}\) (with level 1 fixed to 1 for identification) and the five items’ slope and intercept parameters (as in Simulation 1), with estimates again being obtained by PML.

Because the multiple-group two-parameter model has more parameters to be estimated (7 mean parameters \(\mu _{v_i}\) and 7 variance parameters \(\sigma _{v_i}^{2}\)), the sample sizes were increased to \(n=1200, 4800\), and 9600. Measurement invariance violations still occurred in the same places (either \(\alpha _3\) or \(\gamma _3\)), and the subsets of tested parameters were the same as in Simulation 1.

Power and type I error were examined across three sample sizes and 17 magnitudes of invariance violations (manipulated in the same way as Simulation 1). For each combination of sample size (n) \(\times \) violation magnitude (d), 5000 data sets were generated and tested. In all conditions, we still maintained equal sample sizes in each level of V. We examined the statistics from Eqs. (26), (27), and (28).

Fig. 5
figure 5

Simulation 2.1. Simulated power curves for \(\max {LM}_o\), \({WDM}_o\), and \({LM}_{uo}\) across measurement invariance violations of 0–4 standard errors (scaled by \(\sqrt{n}\)), estimated by PML (fitting multiple-group two-parameter model, without person abilities change in the generation model). The parameter violating measurement invariance is \(\alpha _{3}\). The number of categories is \(m=8\). Panel labels denote the parameter(s) being tested and sample size.

Fig. 6
figure 6

Simulation 2.1. Simulated power curves for \(\max {LM}_o\), \({WDM}_o\), and \({LM}_{uo}\) across measurement invariance violations of 0–4 standard errors (scaled by \(\sqrt{n}\)), estimated by PML (fitting multiple-group two-parameter model, without person abilities change in the generation model). The parameter violating measurement invariance is \(\gamma _{3}\). The number of categories is \(m=8\). Panel labels denote the parameter(s) being tested and sample size.

7.2 Results

In the sections below, we first discuss results when the data generation model had person hyperparameters that were the same across groups (Simulation 2.1). We then discuss results when the data generation model had person hyperparameters that differed across groups (Simulation 2.2).

7.2.1 Simulation 2.1

The results are presented in Figs. 5 and 6. Figure 5 demonstrates power curves as a function of violation magnitude in item 3’s slope parameter \(\alpha _{3}\), with the parameters being tested changing across rows, the sample sizes n changing across columns, and lines reflecting different test statistics. Figure 6 demonstrates similar power curves when the violating parameter is item 3’s intercept parameter \(\gamma _{3}\). In both figures, tests of item 2’s parameters are again representative of all invariant items.

From these two figures, one generally observes that the tests isolate the parameter violating measurement invariance in the multiple-group two-parameter model (across rows) and power increases with n (across columns). The impact of n is more substantial when the slope parameter, as opposed to the intercept parameter, violates invariance. We need sample size as large as 9600 to obtain power near .8 for detecting DIF in the slope parameter (with increasing violation magnitude), whereas there is no large difference across columns when the intercept parameter violates invariance.

Within each panel of Figs. 5 and 6, the three lines reflect the three test statistics. It is seen that the two ordinal statistics still exhibit similar results, with \(\max {LM}_{uo}\) demonstrating lower power across all situations. Therefore, the sensitivity of the ordinal statistics is preserved in the multiple-group two-parameter model.

Comparing Fig. 5 and Fig. 6 in general, we can see the tests still have somewhat higher power to detect measurement invariance violations in the intercept parameter as opposed to the slope parameter. Moreover, power is lower when we test the full set of slope (or intercept) parameters, as opposed to only the problematic parameter.

7.2.2 Simulation 2.2

The results are presented in supplementary material, with the same figure and panel arrangements as Simulation 2.1. They demonstrate the same pattern as Simulation 2.1. We can observe that the power decrease is related to the number of parameters in the estimated model, regardless of the data generation model.

In summary, we found that the proposed tests can attribute measurement invariance violations to the correct multiple-group model parameter when impact is exhibited. Although the multiple-group model requires a much larger sample size to obtain reasonable power, this type of model is necessary in practice when there is uncertainty about changes in person hyperparameters. Otherwise, there will be a serious “false alarm” as illustrated by Eqs. (2931). The sample size issue can often be addressed, as IRT researchers often have thousands of respondents in their data sets. However, in other situations, we may wish to first test whether the hyperparameters vary across groups before examining the item parameters. If this test indicates that the hyperparameters do not vary, then we can constrain them to be equal and gain more power to detect DIF. If the test indicates that the hyperparameters do vary, then we can use the “location” information resulting from the tests to potentially reduce the number of hyperparameters in the model (i.e., by constraining similar groups’ parameters to be equal). This would again lead to increased power to detect DIF.

The procedure outlined in the previous paragraph has the potential to capitalize on chance, as we are relying on sequential statistical tests to modify the focal model. However, sequential statistical tests are commonly used in the DIF literature for, e.g., anchor item selection and item “purification.” Because the tests proposed here utilize a constrained model where all items serve as anchors, we have essentially replaced the sequential anchor item tests with sequential tests of the person hyperparameters. Further, to address concerns about capitalizing on chance, we can employ cross-validation methods. These strategies are demonstrated below in a practical example.

8 Application

We illustrate the tests’ application using 18 dichotomously scored mathematics items from the graduation examination developed by the Netherlands National Institute for Educational Measurement (Doolaard, 1999; Fox, 2010).

8.1 Method

In the data set, 2156 eighth-grade students completed the test, with a socioeconomic status (SES) variable also being measured on each student. The SES scores were based on four indicators, which were the education and occupation levels of both parents (if present). In this sample, there are 40 unique SES values ranging from \(-\,3.23\) to 2.8, with higher values indicating higher SES. For the purposes of demonstration, we treat SES as a 6-category ordinal variable here and maintain equal sample sizes at each level.

The correlation between SES and mathematics achievement (sum of the 18 items) equals 0.49. Of course, this relationship could be explained in two different manners: Either people of different SES exhibit different abilities, or the items are unfair to people of certain SES levels. We use the score-based tests to distinguish between these different explanations.

Following the strategy outlined at the end of the previous section, we start with a two-parameter item response model where the person hyperparameters \(\mu _{1}\) and \(\sigma _{1}^{2}\) (for level 1) are fixed to 0 and 1, while the hyperparameters in other levels are estimated but constrained to be equal, in the following referred to as the constrained hyperparameter model. This allows us to test whether the hyperparameters are equal across levels, and if hyperparameters are not equal, it provides us with information about specific groups that are unequal. This information is used to build a model with relatively higher power to detect DIF and avoids “false alarm” by accounting for person hyperparameters. To address the potential problem of “capitalizing on chance” by adopting this strategy, overall model fit and cross-validation are examined.

8.2 Results

We describe the results in three sections: one for the initial examination of fluctuations in the hyperparameters, one for the examination of parameters based on the model with appropriate hyperparameters, and one for further support of our chosen model.

8.2.1 Testing the Hyperparameters

The results representing the statistics’ fluctuations across SES level based on the constrained hyperparameter model are shown in Fig. 7. The first column displays the fluctuation process associated with \({LM}_o\) for testing the 18 items’ slopes (first row), the 18 items’ intercepts (second row), the person mean parameters (third row), and the person variance parameters (fourth row). The second column displays the fluctuation process associated with \({WDM}_o\) for the same sets of parameters. In other words, these panels show the values of Eqs. (26) and (27) for each SES level, with the dashed horizontal line being the 5% critical value. If the solid line crosses the critical value, then there is evidence that the corresponding parameter fluctuates across levels of SES. Because the final level’s statistics always equal zero [see Eq. (11)], the final level (level 6 here) is not displayed.

Fig. 7
figure 7

Empirical fluctuation processes of the \(\max {LM}_o\) statistic (first column) and \({WDM}_o\) (second column) for slope parameters (first row), intercept parameters (second row), person mean parameter (third row), and person variance parameter (fourth row), using constrained hyperparameter model.

It is observed that the person mean parameter (third row) fluctuates across all levels, while the person variance parameter (fourth row) fluctuates between the middle levels and level 5 (note that person hyperparameter change is not DIF). As shown in Simulation 2, this can cause the slope (first row) and intercept (second row) parameters to exhibit DIF regardless of whether they actually exhibit DIF. Therefore, we need to examine a second model where person hyperparameters are free across specific levels of SES. Based on the statistics’ fluctuation processes, the second model should estimate a separate person mean parameter for each SES level and a separate person variance parameter for the middle levels (levels 2–4) and for the extreme levels (at and after level 5). The test results involving this partially free hyperparameter model are described in the next section.

8.2.2 Testing the Partially Free Hyperparameter Model

In estimating a separate \(\mu _{v_i}\) for each of the six SES groups (with first level being fixed to 0 for identification) and two separate \(\sigma ^2\)s for the middle level and extreme levels, we obtain the results shown in Fig. 8. The panel arrangements are the same as in Fig. 7.

Fig. 8
figure 8

Empirical fluctuation processes of the \(\max {LM}_o\) statistic (first column) and \({WDM}_o\) (second column) for slope parameters (first row), intercept parameters (second row), person mean parameters (third row), and person variance parameters (fourth row), using partially free hyperparameter model.

Figure 8 implies that no sets of item parameters exhibit DIF, according to either statistic. This is the opposite result of what we found in the previous section, and it is related to the findings from Simulation 2. Further, the estimated \(\mu _{v_i}\) increases monotonically with SES, with the lowest SES level having a fixed mean of 0, followed by 0.54, 1.01, 1.26, 1.58, and 2.25. Meanwhile, \(\sigma ^2\) for the middle SES levels (levels 2–4) and extreme SES levels (level 5–6) are 1.14 and 1.37, with the lowest SES level having a fixed variance of 1.

Table 1 Model fit, comparison, and replication statistics from the application.

8.2.3 Overall Model Fit and Cross-Validation

As mentioned earlier, the fact that we sequentially studied person parameters and item parameters is potentially problematic from the perspective of “capitalizing on chance.” This is because our tests of item parameters were based on a model that was influenced by the tests of person parameters. In this section, we do model comparisons and cross-validations to examine the extent to which our results were robust.

We start with general model comparisons. In addition to the two models examined above, we added a third model where each SES level has unique hyperparameters (similar to the model from Simulation 2). As mentioned before, this model will generally have lower power compared to the partially free hyperparameter model, but it also avoids “false alarm” to the greatest extent. This third model is referred to as the fully free hyperparameter model in the following. Both the AIC and BIC statistics (arising from PML, denoted as \(\text {AIC}_{\text {PL}}\) and \(\text {BIC}_{\text {PL}}\) below) preferred the partially free hyperparameter model to the fully free hyperparameter model, as well as the constrained hyperparameter model. Model statistics are given in the top half of Table 1. In addition, the pairwise likelihood ratio test (PLRT) preferred the partially free hyperparameter model to the other two models. Specifically, the fit of the fully free hyperparameter model is not significantly better than the partially free hyperparameter model (3.84, \(p=0.27\)) and the partially free model is significantly better than that of the constrained hyperparameter model (346.73, \(p=0.00\)). Additionally, the model with all free hyperparameters is preferred to the constrained hyperparameter model (380.80, \(p=0.00\)). Thus, the preference order of these three models is as follows: partially free hyperparameter model, then fully free hyperparameter model, followed by the constrained hyperparameter model.

In order to confirm the generality of the above model assessment, we conducted a cross-validation whereby half of the original data set was randomly allocated to the training set, with the remaining half being allocated to the test set. This random allocation is replicated 100 times, resulting in 100 training data sets and corresponding 100 test data sets. For each of 100 training data sets, we fitted the three models and compared them to one another. We then computed analogous model fit statistics for the test data, holding the model parameter values at the estimates from the training data. This allows us to examine the extent to which the fitted models continue to be preferred in new data. These results are displayed in the bottom half of Table 1, where the numbers represent percentages of the 100 data sets that replicated the original results.

The bottom half of the table shows that, in the training data sets, the original model preference ordering was replicated 86 times out of 100 based on \(\text {AIC}_{\text {PL}}\) and 100 times out of 100 based on \(\text {BIC}_{\text {PL}}\). This result is also supported by pairwise log-likelihood ratio test when comparing the top two models, whereby 90 out of 100 statistics preferred the partially free hyperparameter model at \(\alpha =.05\). Similarly, in the test data sets, the same model preference ordering was replicated 97 times out of 100 based on \(\text {AIC}_{\text {PL}}\), as well as 97 times out of 100 based on \(\text {BIC}_{\text {PL}}\).

Finally, applying score-based tests to the partially free hyperparameter model for the training data, 83% (using the statistic \({WDM}_o\)) and 99% (using the statistic \(LM_{o}\)) data sets exhibited no DIF. Taken together, these analyses illustrate that our original results (model preferences and DIF results) remain similar across the 100 resamples.

In summary, we found that the positive correlation between SES and math achievement is due to the fact that students’ ability means and variances increase with SES. All parameters appear to fulfill the measurement invariance assumption after we take account of changes in person ability at corresponding SES level. The score-based tests allowed us to systematically study these issues without estimating an excessive number of models. If desired, we could also test each item’s parameters individually (as opposed to the set of intercepts and the set of slopes) without fitting any new models. This illustrates the inherent flexibility of the tests.

9 General Discussion

In this paper, we extended a recently proposed family of score-based tests to item response models, focusing on multiple-group two-parameter models. The tests’ power levels are comparable to traditional statistics, and the tests can isolate specific parameters violating invariance so long as we account for changes in person ability across groups.

The test statistics examined here, along with estimation by PML, provides a more general and flexible framework to detect DIF in IRT research. Traditionally, we pre-define two groups of individuals and compare them via a multiple-group model. In using score-based tests, we do not need to pre-define the groups and can test many groups simultaneously. Additionally, person hyperparameters can be estimated conveniently in a multiple-group null model (that assumes measurement invariance holds) without refitting multiple alternative models as is required by the LRT or Wald test (see also Glas, 1998). This can enhance our ability to detect DIF in large data sets with many groups.

In the sections below, we consider the tests’ applications in related models and in complex scenarios.

9.1 Model Extensions

The PML framework generally allows us to use the score-based tests in situations when the responses have multiple categories, where a graded response model (Samejima, 1969) or partial credit model (Muraki, 1992) may be used. These models become increasingly difficult to estimate when we have many groups and when items have many categories. In these situations, the score-based tests become increasingly attractive because they require estimation of only a null model (assuming that invariance holds).

Another extension involves the use of multidimensional IRT models, especially because multidimensionality is one possible cause of DIF (Millsap, 2012). However, it is difficult to test this hypothesis due to the multidimensional integration involved. In employing the factor-analytic framework described here with PML, we can more easily estimate models with multiple dimensions. This can further help us study DIF in larger data sets.

Finally, moving beyond traditional IRT models, the tests proposed here can be applied to multilevel/mixed models where, for example, students’ responses may be nested in classes, schools, or states. Score-based tests only rely on the derivative of each individual’s likelihood function so that, as long as the individual derivative (analytic or approximation) can be specified, the tests can be applied. Scores for generalized linear mixed models will be more difficult to obtain than scores for linear mixed models, in the same way that scores for continuous data factor analysis are easier to obtain than scores for IRT models.

9.2 Full Structural Equation Modeling Approach to Linking/Equating Problem

In practice, we often need to transform person parameters so that ability estimates are equivalent across different scales. This is called equating (see Kolen & Brennan, 2004, for a review). For example, we may need to equate test takers’ abilities across multiple versions of the SAT.

The existence of DIF complicates equating. Suppose that Form A of the SAT exhibits DIF with respect to country/grade/age, but Form B does not exhibit DIF. We must then decide whether we should equate each level of V separately, as opposed to equating simultaneously across the whole sample. Dorans  (2004) dealt with this question by introducing new statistics that utilized the test characteristic curve. Alternatively, we can frame the question in a full structural equation model (SEM) and employ the score-based test to examine the corresponding coefficients’ stability against V. In this way, no new statistics need to be introduced.

9.3 Multiple Violating Slope Parameters

In this paper, we studied the tests’ applications to two-parameter and multiple-group two-parameter models when only one parameter violated invariance. When there are multiple violating parameters, Bechger and Maris  (2015) point out that both the null and alternative hypotheses of a score-based test can be incorrect. For example, if we test a single item intercept parameter, then the null hypothesis would involve all intercept parameters being equal across groups and the alternative hypothesis would involve the focal intercept parameter being unequal across groups (with the remaining intercepts being equal). If a non-focal intercept parameter is unequal across groups, however, then both hypotheses are incorrect.

To address this issue, we can employ recursive tests related to item purification. This could proceed as follows (see Glas, 1998, for a related approach): (1) Fit the null model with person hyperparameters, (2) test for DIF in each item parameter, (3) free the parameter with the largest statistic and refit the model with person hyperparameters, (4) repeat steps (2)–(3) until there is no further DIF detected. This procedure is similar to the LRT algorithm described by Magis et al.  (2010), which is implemented in R packages mirt (Chalmers, 2012) and difR (Magis, Beland, & Raiche, 2015). The score-based tests are advantageous here because no anchor items are needed (see Woods, 2009, for a review of procedures involving anchor items). This is because we only need to estimate the null model, where all parameters are already assumed to be invariant across groups. However, the sensitivity to the order of purification described by, for example, Magis and Facon  (2013) and Bechger and Maris  (2015) cannot be avoided under this approach.

As an alternative to the item purification approach, Bechger and Maris  (2015) make the insightful point that, in a Rasch framework, pairwise differences between item parameters are preserved across the set of possible identification constraints. Thus, they conceptualize differential item pair functioning as a property of item pairs, whereby differences between item parameters may vary across groups (as opposed to individual item parameters varying across groups). This proposal leads to Wald tests of differences between item parameters, where the test results are the same regardless of choice of identification constraint. A potential difficulty here is that, in the marginal and pairwise ML frameworks considered in this paper, we typically want hyperparameters to be free across groups, and estimation of these hyperparameters requires more parameter constraints than would typically be employed in the Wald test framework. It appears difficult to address all these issues without an iterative procedure.

Nonetheless, we might make some progress through consideration of alternative parameter constraints. That is, instead of constraining one group’s mean and variance hyperparameters to 0 and 1, respectively, we may employ “sum” constraints that allow us to freely estimate more parameters. For example, Verhagen, Levy, Millsap, and Fox (2016) constrained the sum of all intercept parameters to be zero (in a Rasch-type model) to avoid the need for defining anchor items or assuming group ability (i.e., fixing one group ability parameter). These constraints can be extended to the slopes of a two-parameter model, requiring that the squared slope parameters sum to 1. Further work may consider the combination of these types of parameter constraints with both score-based tests and differential item pair functioning.

9.4 Summary

In this paper, we generalized the score-based tests to IRT models estimated by MML and PML. This extension has advantages over traditional DIF detection methods in locating the violating parameter without pre-specifying grouping information and in accounting for the ordinal information of the auxiliary variable V. Besides, implementation of these tests is simpler, requiring only estimation of a null model that assumes measurement invariance. Applied researchers in psychology and education could use these tests to conveniently examine measurement invariance in their own data sets.

10 Computational Details

All results were obtained using the R system for statistical computing (R Core Team, 2017), version 3.4.2, employing the add-on package lavaan 0.5-23.1097 (Rosseel, 2012) for fitting of the factor analysis models and strucchange 1.5-1 (Zeileis, Leisch, Hornik, & Kleiber, 2002; Zeileis, 2006) for evaluating the parameter instability tests. R and both packages are freely available under the General Public License from the Comprehensive R Archive Network at https://CRAN.R-project.org/. R code for replication of our results is available in the supplementary materials and at http://semtools.R-Forge.R-project.org/.