Access provided by Autonomous University of Puebla. Download reference work entry PDF
Synonyms
Definition
Differential item functioning refers to the situation where members from different groups (age, gender, race, education, culture) on the same level of the latent trait (disease severity, quality of life) have a different probability of giving a certain response to a particular item.
Description
Differential item functioning (DIF) is a threat to the validity of a patient-reported outcome (PRO) instrument. DIF occurs when subjects on the same level of the latent trait, such as disease severity, answer differently to the same item depending on their group memberships (e.g., age group, gender, race) (Chang, 2005; Holland & Thayer, 1988). The validity of the instrument is threatened because the response to the DIF item is governed by something other than the construct that the instrument is intended to measure. For example, crying spells is one of the symptoms for patients with depression, but this concept is reported more by women than men with the same level of depression severity (Teresi et al., 2009). An item asking about the amount of crying will likely to underestimate the severity of men than women who are otherwise equally depressed. The crying item is said to exhibit DIF due to gender for assessing depression.
DIF has been studied in the educational testing since the 1960s (Berk, 1982) and was usually referred to as “item bias.” A test item is biased when it favors (more easy to answer the item correctly) one group of test takers, but is against (harder to answer the item correctly) another group. The term DIF was later created from the understanding that differentiating behavior of the item did not always arise from a bias situation. This is almost always the case for items in PRO instruments. A PRO item usually does not favor a particular group of patients; it is just behaving like a different item depending on the group memberships.
Group difference on the scale for assessing a certain latent construct is often confused with DIF. An item where the average scores are different between two groups is not evidence of DIF for that item. Two groups of subjects with different distribution on the latent construct are expected to have different average scores. However, subjects with the same level of the latent construct from both groups are expected to answer similarly to the same items regardless of their group membership. DIF is suspected only when subjects on the same level of the latent construct answer differently to the same items. This is the most important part of the definition of DIF. In mathematical terms, DIF can be described as a discrepancy between two groups of subjects in the conditional probability of a response to an item conditioned on the level of the latent construct.
Early methods for DIF detection relied on the assumption that the percentages of test takers answer correctly will be different between two groups to an item with DIF. In classical test theory, this percentage is referred to as item difficulty and denoted as the p value of an item. The delta plot (Angoff, 1984; Angoff & Ford, 1973) involves the transformation of the p values. Several methods all involve with construction of three-way contingency tables of frequency counts; these include Scheuneman’s chi-square (Scheuneman, 1979), Camilli’s chi-square (Ironson, 1982), and the Mantel-Haenszel procedure (MH D-DIF) (Holland & Thayer, 1988). Generally, for all the DIF detection methods, groups are first identified based on the group membership of particular interest (i.e., age >60 vs. <60, male vs. female, Blacks vs. Whites, high school graduate vs. some college degree). It is customary to refer one group as the reference group and the other the focal group. Subjects then are divided into subgroups with matching criterion between the reference and focal groups. The criterion used should be the measure of the level of the latent construct to be assessed, usually the score of the instrument itself. The percentages of the correct (and incorrect) answers to the item under investigation between the two groups are then compared across the subgroups. Unlike educational tests, the concept of correct answers does not apply to PRO instruments in health-care research. For many PRO instruments, Likert-type scales are used, and some may even include numeric rating scale (NRS) and visual analogue scale (VAS) which are continuous variables. In these cases where there are more than two item responses (i.e., polytomous item responses), generalized Mantel-Haenszel procedure can be used for DIF detection (Fidalgo & Madeira, 2008; Zwick, Donoghue, & Grima, 1993).
With the advances of item response theory (IRT) in psychometric analysis of items and instrument construction, DIF detection methods based on IRT were also developed. Because IRT consists of mathematical models already expressed as conditional probability of a certain item response given the latent trait, its application to DIF detection is straightforward. One approach is to compare the item characteristic curves (ICC) between the reference and focal groups (Linn, Levine, Hastings, & Wardrop, 1981; Runder, 1977). Alternative to comparing the ICC is to compare the item parameters directly (Hambleton & Swaninathan, 1985; Muthen & Lehman, 1985; Thissen, Steinberg, & Gerrard, 1986), since item parameters determine the ICC. A more elaborate approach involves a likelihood ratio test between the item parameters estimated separately for the target items for the reference and focal groups versus when these item parameters are constrained to be equal (Thissen, Steinberg, & Wainer, 1988). One of the special features of these IRT-based methods is that they do not require computation of the level of the latent construct for each individual and match them into subgroups. While these IRT-based methods were first developed for dichotomous response items, they can be extended to polytomous response items which are more common in health outcome assessments.
Two methods commonly used for DIF detection for polytomous response items are SIBTEST (Chang, Mazzeo, & Roussos, 1996; Shealy & Stout, 1993a, 1993b) and ordinal logistic regression (Zumbo, 1999). SIBTEST is based on a multidimensional model of DIF that formally defines a second latent trait that contributes to the DIF. Many variations of the SIBTEST methods have been developed since its first introduction. The ordinal logistic regression method of detecting DIF is a straightforward application of logistic regression modeling. Because an interaction term of latent trait by groups can be included in the model, the logistic regression method can be used to detect the nonuniform DIF directly. The concept of uniform and nonuniform DIF was first introduced by Mellengergh (1982) and was referred to as crossing DIF in SIBTEST (Li & Stout, 1996). Uniform DIF is where the discrepancy between reference and focal groups is constant across the range of the latent construct. Whereas with nonuniform DIF, the discrepancy between the two groups differs depending on the level of the latent construct. More recently, a method based on complex multiple indicators, multiple causes (MIMIC) confirmatory factor analysis model was developed for DIF detection (Finch, 2005).
The usual treatment of DIF item in education testing is to remove the item from the test. While items with DIF are serious problem in education testing, especially if the DIF item biases against a certain group of test takers, the role of DIF in health outcomes research is less clear. Multiple factors affect the assessment health-related quality of life and other patient-reported outcomes. In many cases, differences in item responses are expected for members of different groups. An item may cover very important content for assessing a health-related quality of life or patient-reported outcome, despite different groups of subjects responding to it differently. In this case, this item should be kept in the instrument, but each individual’s response should be compared within his or her group. This required the item to be treated differently according to the group memberships, e.g., different item scores between male and female. An ideal assessment of health-related quality of life or patient-reported outcomes involves tailoring the items to the unique characteristics of the individual for maximum information. This is possible by using computerized adaptive testing with a large item bank. In this case, items with DIF are no longer a threat to the validity of the test; they become asset of the instrument.
Cross-References
References
Angoff, W. H. (1984). Scales, norms, and equivalent scores. Princeton, NJ: Educational Testing Service.
Angoff, W. H., & Ford, S. F. (1973). Item-race interaction on a test of scholastic aptitude. Journal of Educational Measurement, 10, 95–105.
Berk, R. A. (Ed.). (1982). Handbook of methods for detecting test bias. Baltimore: John Hopkins University Press.
Chang, C. H. (2005). Item response theory and beyond: Advanced in patient-reported outcomes measurement. In W. R. Lenderking & D. A. Revicki (Eds.), Advancing health outcomes research methods and clinical applications (pp. 37–55). McLean, VA: Degnon Associated.
Chang, H., Mazzeo, J., & Roussos, R. (1996). Detect DIF for polytomously scored items: An adaptation of Shealy-Stout’s SIBTEST procedure. Journal of Educational Measurement, 33, 333–353.
Fidalgo, A. M., & Madeira, J. M. (2008). Generalized Mantel-Haenszel methods for differential item function detection. Educational and Psychological Measurement, 68(6), 940–958.
Finch, H. (2005). The MIMIC model as a method for detecting DIF: Comparison with Mantel-Haenszel, SIBTEST, and the IRT likelihood ratio. Applied Psychological Measurement, 29(4), 278–295.
Hambleton, R. K., & Swaninathan, H. (1985). Item response theory: Principles and applications. Hingham, MA: Kluwer.
Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 129–145). Hillsdale, NJ: LEA.
Ironson, G. H. (1982). Use of chi-square and latent trait approaches for detecting item bias. In R. A. Berk (Ed.), Handbook of methods for detecting test bias (pp. 117–160). Baltimore: John Hopkins University Press.
Li, H.-H., & Stout, W. (1996). A new procedure for detection of crossing DIF. Psychometrika, 61, 647–677.
Linn, R. L., Levine, M. V., Hastings, C. N., & Wardrop, J. L. (1981). An investigation of item bias in a test of reading comprehension. Applied Psychological Measurement, 5, 159–173.
Mellenbergh, G. J. (1982). Contingency table models for assessing item bias. Journal of Educational Statistics, 7(2), 105–118.
Muthen, B., & Lehman, J. (1985). Multiple group IRT modeling: Application to item bias analysis. Journal of Educational Statistics, 10, 133–142.
Runder, L. M. (1977, April). An approach to biased item identification using latent trait measurement theory. Paper presented at the annual meeting of the American Educational Research Association, New York.
Scheuneman, J. D. (1979). A new method of assessing bias in test items. Journal of Educational Measurement, 16, 143–152.
Shealy, R., & Stout, W. (1993a). An item response theory model for test bias and differential test functioning. In P. Holland & H. Wainer (Eds.), Differential item functioning (pp. 197–240). Hillsdale, NJ: Earlbaum.
Shealy, R., & Stout, W. (1993b). A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DIF as well as item bias/DTF. Psychometrika, 58, 159–194.
Teresi, J. A., Ocepek-Welikson, K., Kleinman, M., Eimicke, J. P., Crane, P. K., Jones, R. N., et al. (2009). Analysis of differential item functioning in the depression item bank from the Patient Reported Outcome Measurement Information System (PROMIS): An item response theory approach. Psychology Science Quarterly, 51(2), 148–180.
Thissen, D., Steinberg, L., & Gerrard, M. (1986). Beyond group mean differences: The concept of item bias. Psychological Bulletin, 99, 118–128.
Thissen, D., Steinberg, L., & Wainer, H. (1988). Use of item response theory in the study of group differences in the trace lines. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 147–168). Hillsdale, NJ: LEA.
Zumbo, B. D. (1999). Functioning (DIF): Logistic regression modeling as a unitary framework for binary and Likert-type (ordinal) item scores. Ottawa, ON: Directorate of Human Resources Research and Evaluation, Department of National Defense.
Zwick, R., Donoghue, J. R., & Grima, A. (1993). Assessment of differential item functioning for performance tasks. Journal of Educational Measurement, 30, 233–251.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer Science+Business Media Dordrecht
About this entry
Cite this entry
Chen, WH., Revicki, D. (2014). Differential Item Functioning (DIF). In: Michalos, A.C. (eds) Encyclopedia of Quality of Life and Well-Being Research. Springer, Dordrecht. https://doi.org/10.1007/978-94-007-0753-5_728
Download citation
DOI: https://doi.org/10.1007/978-94-007-0753-5_728
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-007-0752-8
Online ISBN: 978-94-007-0753-5
eBook Packages: Humanities, Social Sciences and LawReference Module Humanities and Social SciencesReference Module Business, Economics and Social Sciences