Differential Item Functioning (DIF)

Chen, Wen-Hung; Revicki, Dennis

doi:10.1007/978-94-007-0753-5_728

Wen-Hung Chen³ &
Dennis Revicki³

1055 Accesses
1 Citations

Access provided by Autonomous University of Puebla. Download reference work entry PDF

Synonyms

Differential item performance; Item bias

Definition

Differential item functioning refers to the situation where members from different groups (age, gender, race, education, culture) on the same level of the latent trait (disease severity, quality of life) have a different probability of giving a certain response to a particular item.

Description

Differential item functioning (DIF) is a threat to the validity of a patient-reported outcome (PRO) instrument. DIF occurs when subjects on the same level of the latent trait, such as disease severity, answer differently to the same item depending on their group memberships (e.g., age group, gender, race) (Chang, 2005; Holland & Thayer, 1988). The validity of the instrument is threatened because the response to the DIF item is governed by something other than the construct that the instrument is intended to measure. For example, crying spells is one of the symptoms for patients with depression, but this concept is reported more by women than men with the same level of depression severity (Teresi et al., 2009). An item asking about the amount of crying will likely to underestimate the severity of men than women who are otherwise equally depressed. The crying item is said to exhibit DIF due to gender for assessing depression.

DIF has been studied in the educational testing since the 1960s (Berk, 1982) and was usually referred to as “item bias.” A test item is biased when it favors (more easy to answer the item correctly) one group of test takers, but is against (harder to answer the item correctly) another group. The term DIF was later created from the understanding that differentiating behavior of the item did not always arise from a bias situation. This is almost always the case for items in PRO instruments. A PRO item usually does not favor a particular group of patients; it is just behaving like a different item depending on the group memberships.

Group difference on the scale for assessing a certain latent construct is often confused with DIF. An item where the average scores are different between two groups is not evidence of DIF for that item. Two groups of subjects with different distribution on the latent construct are expected to have different average scores. However, subjects with the same level of the latent construct from both groups are expected to answer similarly to the same items regardless of their group membership. DIF is suspected only when subjects on the same level of the latent construct answer differently to the same items. This is the most important part of the definition of DIF. In mathematical terms, DIF can be described as a discrepancy between two groups of subjects in the conditional probability of a response to an item conditioned on the level of the latent construct.

Early methods for DIF detection relied on the assumption that the percentages of test takers answer correctly will be different between two groups to an item with DIF. In classical test theory, this percentage is referred to as item difficulty and denoted as the p value of an item. The delta plot (Angoff, 1984; Angoff & Ford, 1973) involves the transformation of the p values. Several methods all involve with construction of three-way contingency tables of frequency counts; these include Scheuneman’s chi-square (Scheuneman, 1979), Camilli’s chi-square (Ironson, 1982), and the Mantel-Haenszel procedure (MH D-DIF) (Holland & Thayer, 1988). Generally, for all the DIF detection methods, groups are first identified based on the group membership of particular interest (i.e., age >60 vs. <60, male vs. female, Blacks vs. Whites, high school graduate vs. some college degree). It is customary to refer one group as the reference group and the other the focal group. Subjects then are divided into subgroups with matching criterion between the reference and focal groups. The criterion used should be the measure of the level of the latent construct to be assessed, usually the score of the instrument itself. The percentages of the correct (and incorrect) answers to the item under investigation between the two groups are then compared across the subgroups. Unlike educational tests, the concept of correct answers does not apply to PRO instruments in health-care research. For many PRO instruments, Likert-type scales are used, and some may even include numeric rating scale (NRS) and visual analogue scale (VAS) which are continuous variables. In these cases where there are more than two item responses (i.e., polytomous item responses), generalized Mantel-Haenszel procedure can be used for DIF detection (Fidalgo & Madeira, 2008; Zwick, Donoghue, & Grima, 1993).

With the advances of item response theory (IRT) in psychometric analysis of items and instrument construction, DIF detection methods based on IRT were also developed. Because IRT consists of mathematical models already expressed as conditional probability of a certain item response given the latent trait, its application to DIF detection is straightforward. One approach is to compare the item characteristic curves (ICC) between the reference and focal groups (Linn, Levine, Hastings, & Wardrop, 1981; Runder, 1977). Alternative to comparing the ICC is to compare the item parameters directly (Hambleton & Swaninathan, 1985; Muthen & Lehman, 1985; Thissen, Steinberg, & Gerrard, 1986), since item parameters determine the ICC. A more elaborate approach involves a likelihood ratio test between the item parameters estimated separately for the target items for the reference and focal groups versus when these item parameters are constrained to be equal (Thissen, Steinberg, & Wainer, 1988). One of the special features of these IRT-based methods is that they do not require computation of the level of the latent construct for each individual and match them into subgroups. While these IRT-based methods were first developed for dichotomous response items, they can be extended to polytomous response items which are more common in health outcome assessments.

Two methods commonly used for DIF detection for polytomous response items are SIBTEST (Chang, Mazzeo, & Roussos, 1996; Shealy & Stout, 1993a, 1993b) and ordinal logistic regression (Zumbo, 1999). SIBTEST is based on a multidimensional model of DIF that formally defines a second latent trait that contributes to the DIF. Many variations of the SIBTEST methods have been developed since its first introduction. The ordinal logistic regression method of detecting DIF is a straightforward application of logistic regression modeling. Because an interaction term of latent trait by groups can be included in the model, the logistic regression method can be used to detect the nonuniform DIF directly. The concept of uniform and nonuniform DIF was first introduced by Mellengergh (1982) and was referred to as crossing DIF in SIBTEST (Li & Stout, 1996). Uniform DIF is where the discrepancy between reference and focal groups is constant across the range of the latent construct. Whereas with nonuniform DIF, the discrepancy between the two groups differs depending on the level of the latent construct. More recently, a method based on complex multiple indicators, multiple causes (MIMIC) confirmatory factor analysis model was developed for DIF detection (Finch, 2005).

The usual treatment of DIF item in education testing is to remove the item from the test. While items with DIF are serious problem in education testing, especially if the DIF item biases against a certain group of test takers, the role of DIF in health outcomes research is less clear. Multiple factors affect the assessment health-related quality of life and other patient-reported outcomes. In many cases, differences in item responses are expected for members of different groups. An item may cover very important content for assessing a health-related quality of life or patient-reported outcome, despite different groups of subjects responding to it differently. In this case, this item should be kept in the instrument, but each individual’s response should be compared within his or her group. This required the item to be treated differently according to the group memberships, e.g., different item scores between male and female. An ideal assessment of health-related quality of life or patient-reported outcomes involves tailoring the items to the unique characteristics of the individual for maximum information. This is possible by using computerized adaptive testing with a large item bank. In this case, items with DIF are no longer a threat to the validity of the test; they become asset of the instrument.

Cross-References

Item Response Theory

References

Angoff, W. H. (1984). Scales, norms, and equivalent scores. Princeton, NJ: Educational Testing Service.
Google Scholar
Angoff, W. H., & Ford, S. F. (1973). Item-race interaction on a test of scholastic aptitude. Journal of Educational Measurement, 10, 95–105.
Google Scholar
Berk, R. A. (Ed.). (1982). Handbook of methods for detecting test bias. Baltimore: John Hopkins University Press.
Google Scholar
Chang, C. H. (2005). Item response theory and beyond: Advanced in patient-reported outcomes measurement. In W. R. Lenderking & D. A. Revicki (Eds.), Advancing health outcomes research methods and clinical applications (pp. 37–55). McLean, VA: Degnon Associated.
Google Scholar
Chang, H., Mazzeo, J., & Roussos, R. (1996). Detect DIF for polytomously scored items: An adaptation of Shealy-Stout’s SIBTEST procedure. Journal of Educational Measurement, 33, 333–353.
Google Scholar
Fidalgo, A. M., & Madeira, J. M. (2008). Generalized Mantel-Haenszel methods for differential item function detection. Educational and Psychological Measurement, 68(6), 940–958.
Google Scholar
Finch, H. (2005). The MIMIC model as a method for detecting DIF: Comparison with Mantel-Haenszel, SIBTEST, and the IRT likelihood ratio. Applied Psychological Measurement, 29(4), 278–295.
Google Scholar
Hambleton, R. K., & Swaninathan, H. (1985). Item response theory: Principles and applications. Hingham, MA: Kluwer.
Google Scholar
Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 129–145). Hillsdale, NJ: LEA.
Google Scholar
Ironson, G. H. (1982). Use of chi-square and latent trait approaches for detecting item bias. In R. A. Berk (Ed.), Handbook of methods for detecting test bias (pp. 117–160). Baltimore: John Hopkins University Press.
Google Scholar
Li, H.-H., & Stout, W. (1996). A new procedure for detection of crossing DIF. Psychometrika, 61, 647–677.
Google Scholar
Linn, R. L., Levine, M. V., Hastings, C. N., & Wardrop, J. L. (1981). An investigation of item bias in a test of reading comprehension. Applied Psychological Measurement, 5, 159–173.
Google Scholar
Mellenbergh, G. J. (1982). Contingency table models for assessing item bias. Journal of Educational Statistics, 7(2), 105–118.
Google Scholar
Muthen, B., & Lehman, J. (1985). Multiple group IRT modeling: Application to item bias analysis. Journal of Educational Statistics, 10, 133–142.
Google Scholar
Runder, L. M. (1977, April). An approach to biased item identification using latent trait measurement theory. Paper presented at the annual meeting of the American Educational Research Association, New York.
Google Scholar
Scheuneman, J. D. (1979). A new method of assessing bias in test items. Journal of Educational Measurement, 16, 143–152.
Google Scholar
Shealy, R., & Stout, W. (1993a). An item response theory model for test bias and differential test functioning. In P. Holland & H. Wainer (Eds.), Differential item functioning (pp. 197–240). Hillsdale, NJ: Earlbaum.
Google Scholar
Shealy, R., & Stout, W. (1993b). A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DIF as well as item bias/DTF. Psychometrika, 58, 159–194.
Google Scholar
Teresi, J. A., Ocepek-Welikson, K., Kleinman, M., Eimicke, J. P., Crane, P. K., Jones, R. N., et al. (2009). Analysis of differential item functioning in the depression item bank from the Patient Reported Outcome Measurement Information System (PROMIS): An item response theory approach. Psychology Science Quarterly, 51(2), 148–180.
Google Scholar
Thissen, D., Steinberg, L., & Gerrard, M. (1986). Beyond group mean differences: The concept of item bias. Psychological Bulletin, 99, 118–128.
Google Scholar
Thissen, D., Steinberg, L., & Wainer, H. (1988). Use of item response theory in the study of group differences in the trace lines. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 147–168). Hillsdale, NJ: LEA.
Google Scholar
Zumbo, B. D. (1999). Functioning (DIF): Logistic regression modeling as a unitary framework for binary and Likert-type (ordinal) item scores. Ottawa, ON: Directorate of Human Resources Research and Evaluation, Department of National Defense.
Google Scholar
Zwick, R., Donoghue, J. R., & Grima, A. (1993). Assessment of differential item functioning for performance tasks. Journal of Educational Measurement, 30, 233–251.
Google Scholar

Download references

Author information

Authors and Affiliations

Center for Health Outcomes Research, United BioSource Corporation, 7101 Wisconsin Ave, Suite 600, Bethesda, MD, 20814, USA
Wen-Hung Chen & Dennis Revicki

Authors

Wen-Hung Chen
View author publications
You can also search for this author in PubMed Google Scholar
Dennis Revicki
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wen-Hung Chen .

Editor information

Editors and Affiliations

University of Northern British Columbia, Prince George, BC, Canada
Alex C. Michalos
(residence), Brandon, MB, Canada
Alex C. Michalos

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Chen, WH., Revicki, D. (2014). Differential Item Functioning (DIF). In: Michalos, A.C. (eds) Encyclopedia of Quality of Life and Well-Being Research. Springer, Dordrecht. https://doi.org/10.1007/978-94-007-0753-5_728

Download citation

DOI: https://doi.org/10.1007/978-94-007-0753-5_728
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-007-0752-8
Online ISBN: 978-94-007-0753-5
eBook Packages: Humanities, Social Sciences and LawReference Module Humanities and Social SciencesReference Module Business, Economics and Social Sciences

Publish with us

Policies and ethics