Keywords

1 Introduction

Differential item functioning (DIF) analysis is an essential procedure for educational and psychological tests. DIF occurs when individuals from different groups (such as gender, ethnicity, country, or age) have different probabilities of endorsing or accurately answering a given item after controlling for overall test scores. It violates the assumption of measurement invariance and the test scores become incomparable for individuals of the same ability level from different groups, which substantially threatens test validity. DIF detection can examine how test scores are affected by external variables that are not related to the construct (Glas, 1998). Therefore, it is important to know if items are subject to DIF; that is, to know if the examinees are fairly measured.

Many approaches have been developed to perform DIF detection, and they can be classified into two categories (Magis, Béland, Tuerlinckx, & De Boeck, 2010): item response theory (IRT)-based and non-IRT-based approaches. The IRT-based approaches include the Lagrange multiplier test (Glas, 1998), the likelihood ratio test (Cohen, Kim, & Wollack, 1996), Lord’s chi-square test (Lord, 1980), Raju’s (1988) signed area method, etc. The IRT-based approaches require estimating item parameters for different groups. After comparing these item parameters of different groups, an item is identified as a DIF item if the item parameters are significantly different between groups. By contrast, the non-IRT-based approaches require neither specific forms for the IRT models nor large sample sizes (Narayanon & Swaminathan, 1996). The non-IRT-based approaches include the Mantel-Haenszel (MH; Holland & Thayer, 1988), logistic regression (LR; Rogers & Swaminathan, 1993), simultaneous item bias test (SIBTEST; Shealy & Stout, 1993) methods, etc.

Among the non-IRT-based approaches, the MH and LR methods perform well in flagging DIF items when the percentage of DIF items is not very high and there is no mean ability difference between groups (French & Maller, 2007; Narayanon & Swaminathan, 1996). A common feature of these two methods is that examinees from different groups are placed on a common metric based on the test scores, which are usually called matching variables. The use of the matching variables is critical for DIF detection (Kopf, Zeileis, & Strobl, 2015). If the matching variables are contaminated (i.e., consisting of DIF items), examinees with the same ability levels would not be matched well, and the subsequent DIF detection would be biased (Clauser, Mazor, & Hambleton, 1993). In practice, it is challenging to identify a set of DIF-free items as the matching variables for DIF detection, especially when the percentage of DIF items is high or when DIF magnitudes are large (Narayanon & Swaminathan, 1996; Rogers & Swaminathan, 1993).

To overcome this difficulty, the odds ratios (OR; Jin, Chen, & Wang, 2018) method was proposed to detect uniform DIF under various manipulated variables, such as different DIF pattern, impact, sample size, and with/without purification. Jin, Chen, and Wang (2018) found that the OR method without a purification procedure outperformed the MH and LR methods in controlling false positive rates (FPR) and obtaining high true positive rates (TPR) when tests contained high percentages of DIF items. Another recently developed IRT-based DIF detection method was the credible interval (CI) method proposed by Su, Chang, & Tsai (2018) to detect uniform and non-uniform DIF items under the Bayesian framework. Su et al (2018) found that the CI method performed well; however, only unbalanced DIF conditions and no impact (i.e., mean ability difference between the reference and focal groups was zero) were considered in their study.

A common feature of the CI and OR methods is that both methods perform DIF detection after constructing intervals. The OR method follows the frequentist approach, and constructs the confidence interval for the mean ability difference between the reference and focal groups. By contrast, the CI method follows the Bayesian approach, and constructs the credible interval for the item difficulty difference between the reference and focal groups. See next section for more details. Because of the nature of the Bayesian framework, the CI method would need more time to perform DIF examination. Besides, the CI method assumes Rasch (1960) model is a correct model for the data. By contrast, the OR method does not require the specification of an IRT model; however, this method may not work when the number of examinees of any group is very small. Given the very different nature of these two newly developed methods, it is interesting to compare these two methods under the Rasch model. In this paper, we investigated the performance of the CI and OR methods to detect uniform DIF within the framework of the Rasch model through a series of simulation studies. The effectiveness of these two approaches was illustrated with an empirical example.

2 The CI and OR DIF Detection Methods

2.1 The CI Method

We first review the CI method proposed by Su, Chang, and Tsai (2018). Let Ypj be the dichotomous response of examinee p on item j, where p = 1, …, P, and j = 1, …, J. Denote \( b_{j} \) and θp as the difficulty parameter for item j and the examinee ability parameter for examinee p, respectively. In the Rasch (1960) model, the probability of examinee p getting a correct response on item j is given by

$$ \pi_{pj} = {\text{P}}\left( {Y_{pj} = \left. 1 \right|\theta_{p} ,b_{j} } \right) = \frac{1}{{1 + e^{{ - \theta_{p} + b_{j} }} }}. $$
(1)

An item is flagged as DIF if the probability of answering the item correctly differs across different groups after controlling for the underlying ability levels. The CI method was proposed to perform DIF detection under a Bayesian estimation framework (Su et al., 2018). Consider the simplest case of two groups, hence, examinee p either belongs to the reference group (gp = 0) or to the focal group (gp = 1). Furthermore, each group has its own difficulty parameter. Then, Eq. (1) becomes

$$ \pi_{pj} = {\text{P}}\left( {\left. {Y_{pj} = 1} \right|g_{p} ,\theta_{p} ,b_{j} ,d_{j} } \right) = \left\{ {\begin{array}{*{20}l} {\frac{1}{{1 + e^{{ - \theta_{p} + b_{j} }} }},} \hfill & {g_{p} = 0,} \hfill \\ {\frac{1}{{1 + e^{{ - \theta_{p} + d_{j} }} }},} \hfill & { g_{p} = 1,} \hfill \\ \end{array} } \right. $$
(2)

where \( b_{j} \) and \( d_{j} \) are the difficulty parameters for the reference and the focal groups, respectively. Alternatively, the notations of Glas (1998) is adopted to rewrite Eq. (2) as

$$ \pi_{pj} = {\text{P}}\left( {\left. {Y_{pj} = 1} \right|g_{p} ,\theta_{p} ,b_{j} ,\delta_{j} } \right) = \left\{ {\begin{array}{*{20}l} {\frac{1}{{1 + e^{{ - \theta_{p} + b_{j} }} }},} \hfill & {g_{p} = 0,} \hfill \\ {\frac{1}{{1 + e^{{ - \theta_{p} + b_{j} + \delta_{j} }} }},} \hfill & {g_{p} = 1.} \hfill \\ \end{array} } \right. $$
(3)

Equation (3) implies that the responses of the focal group need an additional difficulty parameter δj. Therefore, the following hypothesis is considered:

$$ H_{0} :\delta_{j} = 0\;{\text{versus}}\;H_{1} :\delta_{j} \ne 0. $$

Due to the complexity of the likelihood function, a Bayesian estimation method is used. Specifically, we follow closely the Bayesian approaches proposed by Chang, Tsai, and Hsu (2014), Chang, Tsai, Su, and Lin (2016), and Su et al. (2018). In particular, a two-layer hierarchical prior is assumed for the model parameters to reduce the impact of the prior settings on the posterior inference. For model identification, we follow Frederickx, Tuerlinckx, de Boeck, and Magis (2010)’s paper by assuming that the marginal distribution of θp is normal:

$$ \theta_{p} \sim\left\{ {\begin{array}{*{20}l} {N\left( {0, \sigma_{r}^{2} } \right) ,} \hfill & {g_{p} = 0,} \hfill \\ {N\left( {\mu_{f} , \sigma_{f}^{2} } \right) ,} \hfill & {g_{p} = 1.} \hfill \\ \end{array} } \right. $$

For the first-layer prior settings for the parameters, we assume

$$ \begin{aligned} & b_{j} \sim N\left( {\mu_{b} , \sigma_{b}^{2} } \right), \\ & d_{j} \sim N\left( {\mu_{d} , \sigma_{d}^{2} } \right). \\ \end{aligned} $$

Given the first-layer prior, we assume the second-layer prior to be

$$ \begin{array}{*{20}c} {\mu_{f} \sim N\left( {\mu_{1} , \sigma_{1}^{2} } \right),} \\ {\mu_{b} \sim N\left( {\mu_{2} , \sigma_{2}^{2} } \right),} \\ {\mu_{d} \sim N\left( {\mu_{3} , \sigma_{3}^{2} } \right),} \\ {\sigma_{r}^{2} \sim {\text{Inv-Gamma}}\left( {\alpha_{1} , \beta_{1} } \right),} \\ {\sigma_{f}^{2} \sim {\text{Inv-Gamma}}\left( {\alpha_{2} , \beta_{2} } \right),} \\ {\sigma_{b}^{2} \sim {\text{Inv-Gamma}}\left( {\alpha_{3} , \beta_{3} } \right),} \\ {\sigma_{d}^{2} \sim {\text{Inv-Gamma}}\left( {\alpha_{4} , \beta_{4} } \right).} \\ \end{array} $$

All parameters in the second-layer priors,

$$ (\mu_{1} ,\mu_{2} ,\mu_{3} ,\sigma_{1}^{2} ,\sigma_{2}^{2} ,\sigma_{3}^{2} , \alpha_{1} , \alpha_{2} ,\alpha_{3} ,\alpha_{4} , \beta_{1} ,\beta_{2} ,\beta_{3} ,\beta_{4} ), $$

are assigned in a reasonable way. Furthermore, we also assume that all the priors are independent.

More specifically, the CI method proceeds as follows. There are J items in the test, and each of the J items in the test is examined one at a time. For item j, a size α test of δj = 0 is constructed. Let item j follow Eq. (3) and the other items follow Eq. (1). That is, item j is tested if the responses of the focal group need an additional parameter δj. The Bayesian analysis via the Markov chain Monte Carlo (MCMC) scheme is implemented to construct the equal-tailed 1 − α credible interval for the parameter δj. If the interval includes 0, then δj = 0 is not rejected. Otherwise, δj = 0 is rejected, and hence item j is considered a DIF item.

2.2 The OR Method

The OR method was proposed by Jin, Chen, and Wang (2018) to detect uniform DIF. Let nR1j and nR0j be the numbers of examinees for the reference group who answer item j correctly and incorrectly, respectively; and let nF1j and nF0j be the numbers of examinees for the focal group who answer item j correctly and incorrectly, respectively. For item j, let \( {\hat{\uplambda}}_{j} \) denote the logarithm of the OR of success over failure for the reference and focal groups:

$$ {\hat{\uplambda}}_{j} = { \log }\left( {\frac{{n_{R1j} /n_{R0j} }}{{n_{F1j} /n_{F0j} }}} \right), $$
(4)

which follows a normal distribution asymptotically (Agresti, 2002) with mean \( \uplambda \) and standard deviation

$$ \sigma ({\hat{\uplambda}}_{j} ) = \left( {n_{R1j}^{ - 1} + n_{R0j}^{ - 1} + n_{F1j}^{ - 1} + n_{F0j}^{ - 1} } \right)^{1/2} , $$
(5)

where \( \uplambda \) is the mean ability difference between the reference and focal groups. For each item j, \( {\hat{\uplambda}}_{j} \), \( \sigma ({\hat{\uplambda}}_{j} ) \), and \( {\hat{\uplambda}}_{j} \pm z_{\alpha /2} \times \sigma ({\hat{\uplambda}}_{j} ) \) are computed. Then, find the median for \( {\hat{\uplambda}}_{1} ,{\hat{\uplambda}}_{2} , \ldots ,\;{\text{and}}\;{\hat{\uplambda}}_{J} \). An item j is flagged as a DIF item if \( {\hat{\uplambda}}_{j} \pm z_{\alpha /2} \times \sigma ({\hat{\uplambda}}_{j} ) \), the \( 1{-}\alpha \) confidence interval of item j, does not cover the median of \( {\hat{\uplambda}}_{1} ,{\hat{\uplambda}}_{2} , \ldots ,\;{\text{and}}\;{\hat{\uplambda}}_{J} \). Note that this method may not work when the number of examinees are very small because the values of \( {\hat{\uplambda}}_{j} \) cannot be computed when any numbers in Eq. (4) is zero. The scale purification procedures can easily be implemented with the OR method; all that is necessary is the precomputation of the sample median based on presumably DIF-free items. See Jin, Chen, and Wang (2018) for the details.

3 Simulation Study

3.1 Design

In this section, the simulation studies were conducted to compare the performance of the CI and OR methods. In each experiment, we simulated a test consisting of 20 items (i.e., J = 20). The number of examinee (P) is 1000. Specifically, we were interested in the comparisons based on the five factors, which were also considered in Simulation Study I of Jin et al. (2018). They were (a) equal and unequal sample sizes of the reference and focal groups (500/500 and 800/200), (b) percentages of DIF items (0, 10, 20, 30 and 40%), (c) DIF patterns: balanced and unbalanced, (d) impact (0 and 1), and (e) purification procedure (with or without). Under the balanced DIF conditions, some DIF items favored the reference group and the other items favored the focal group. By contrast, under the unbalanced DIF conditions, all DIF items favored the reference group.

Item responses were generated according to Eq. (3). The true values of difficulty parameters \( b_{j} \) were generated identically and independently from a uniform distribution between −1.5 and 1.5. The true values of examinee ability parameters \( \theta_{p} \) for the reference group (\( g_{p} = 0 \)) were generated from the standard normal distribution. When impact = 0, the true values of \( \theta_{p} \) for the focal group (\( g_{p} = 1 \)) were also generated from the standard normal distribution; when impact = 1, they were generated from the normal distribution with mean −1 and variance 1. Under the unbalanced DIF conditions, \( d_{j} - b_{j} = 0.5 \) for all DIF items; under the balanced DIF conditions, \( d_{j} - b_{j} = 0.5 \) for the first half of the DIF items and \( d_{j} - b_{j} = - 0.5 \) for the second half of the DIF items. We fixed α, the Type-I error of each test, to 0.05.

To construct the credible intervals, we produced 11,000 MCMC draws with the first 1000 draws as burn-in. A total of 100 replications were carried out under each condition. The performance of these two methods was compared in terms of the FPR and TPR. The FPR was the rate that DIF-free items were misclassified as having DIF whereas the TPR was rate that DIF items were correctly classified as having DIF. The averaged FPR across the DIF-free items and averaged TPR across the DIF items for these two methods were reported. Both the OR and CI methods were implemented by using FORTRAN code with IMSL subroutines, and are available upon request.

3.2 Results

The averaged FPR and TPR of two DIF detection methods for equal (500/500) and unequal (800/200) sample sizes list in Tables 1 and 2, respectively. As expected, both methods yielded well-controlled FPR under the no-DIF (0% DIF items) and balanced DIF conditions, although the OR method was slightly conservative. Similar to Jin, Chen, and Wang (2018)’s study, the FPR larger than or equal to 7.5% was defined as the inflated FPR in the present study. Under the unbalanced DIF conditions, the OR method yielded slightly inflated FPR only when tests had 40% or more DIF items. However, the CI method yielded inflated FPR when tests had 20% or more DIF items under the unbalanced DIF conditions. The TPR of the CI method was higher than that of the OR methods under two following conditions: (i) the balanced DIF conditions and (ii) the unbalanced DIF conditions with 10% DIF items. Furthermore, under these two conditions, the ratio of the TPR of the CI method to that of the OR method with scale purification procedure ranged from 1.01 to 1.27, and it was larger for unequal (800/200) sample sizes than that for equal (500/500) sample sizes. When the total sample size is 1000, the TPR for equal (500/500) sample sizes was higher than that for unequal (800/200) sample sizes. In general, both the FPR and TPR increased with the percentages of DIF items. The TPR for the balanced DIF was higher than that for the unbalanced DIF, except for the OR method when Impact = 0 with equal (500/500) sample size. In general, the TPR was higher when Impact = 0 than that when Impact = 1. The purification procedure increased the TPR for the unbalanced DIF condition, and the higher the percentage of the DIF items, the higher the ratio of the TPR of the OR method with scale purification to that of the OR method without scale purification. By contrast, the purification procedure did not increase the TPR for the balanced DIF condition.

Table 1 Averaged FPR (%) and TPR (%) under the conditions with sample sizes of the reference and the focal groups: 500/500
Table 2 Averaged FPR (%) and TPR (%) under the conditions with sample sizes of the reference and the focal groups: 800/200

4 Application

In this section, the CI and OR methods described in the previous sections were applied to the data of the physics examination of the 2010 Department Required Test for college entrance in Taiwan provided by the College Entrance Examination Center (CEEC). Each examinee was required to answer 26 questions within 80 min. The 26 questions were further divided into three parts. The total score was 100, and the test was administered under the formula-scoring directions. For the first part, there were 20 multiple-choice questions, and the examinees had to choose one correct answer out of 5 possible choices. For each correct answer, 3 points were granted, and 3/4 point was deducted from the raw score for each incorrect answer. The second part consisted of 4 multiple-response questions, and each question consisted of 5 choices, examinees needed to select all the answer choices that apply. The choices in each item were knowledge-related, but were answered and graded separately. For each correct choice, 1 point was earned, and for each incorrect choice 1 point was deducted from the raw score. The final adjusted scores for each of these two parts started from 0. The last part consisted of 2 calculation problems, and deserved 20 points in total.

The data from 1000 randomly sampled examinees contained the original responses and nonresponses information, but we treated both nonresponses and incorrect answers the same way and coded them as \( Y_{pj} = 0 \) as Chang et al. (2014) suggested. As for the calculation part, the response \( Y_{pj} \) was coded as 1 whenever the original score was more than 7.5 out of 10 points, and zero otherwise (see also Chang et al., 2014). Here, we considered male and female as the reference and focal groups, respectively. Among the 1000 examinees, 692 of them were male and the others were female.

We made more MCMC draws than that in Sect. 3. Specifically, we produced 40,000 MCMC draws with the first 10,000 draws as burn-in. Then we tested δj = 0, for j = 1, …, 26. Again, we considered α = 0.05. The intervals of \( {\hat{\uplambda}}_{j} \pm z_{\alpha /2} \times \sigma ({\hat{\uplambda}}_{j} ) \) for the OR method, which were the same for both with and without purification, and the credible intervals obtained from the real data were summarized in Table 3. Note that the median of \( {\hat{\uplambda}}_{1} ,{\hat{\uplambda}}_{2} , \ldots ,\;{\text{and}}\;{\hat{\uplambda}}_{J} \) before and after purification were 0.5687 and 0.6163, respectively, so the OR method identified Items 3, 5, 8, 19 and 23 as DIF items, which were underlined and bolded in Table 3. Table 3 also showed that the CI method identified not only Items 3, 5, 8, 19 and 23 as DIF items, but also Items 6, 10, 18 and 25. Based on the result from the OR method, the real data could be contaminated with unbalanced DIF items because the intervals of the identified DIF items all fell on the same side of the median. According to the simulation results in Tables 1 and 2, the CI method yielded inflated FPR when test had 20% or more unbalanced DIF items.

Table 3 The intervals of the OR and CI methods for the real data

To reduce the inflated FPR of the CI method, we proposed a two-stage CI method to detect DIF items, which was implemented as follows. At the first stage, we detected the DIF items by using the CI method. Suppose \( \left\{ {i_{1} ,i_{2} , \ldots ,i_{k} } \right\} \) were the collection of the DIF items identified by the CI method. At the second stage, we check, for j = 1, …, k, if item \( i_{k} \) is a real DIF item by deleting the other DIF items, and use only item \( i_{k} \) and the other non-DIF items to fit the Rasch model and then to detect if item \( i_{k} \) is a DIF item based on the CI method again. Based on the two-stage CI method, the identified DIF items were Items 3, 5, 6, 8, 19, 23 and 25, the credible intervals of these items were underlined and bolded in Table 3. Items 10 and 18 were identified as DIF items at the first stage, but were not identified as DIF items at the second stage, and the credible intervals of these two items were marked in italic and underlined in Table 3.

5 Concluding Remarks

In this article, we compared the finite sample performance of the CI and OR methods for detecting the need of an additional difficulty parameter for the responses of the focal group when the data follow the Rasch model. Simulation studies showed that the CI method worked better than the OR method under the balanced DIF conditions. However, the CI method yielded inflated FPR under the unbalanced DIF condition. The two methods were then applied to an empirical example. Comparisons of these two methods to other IRT models will be an interesting future line of research.