Synonyms

Differential item performance; Item bias

Definition

Differential item functioning refers to the situation where members from different groups (age, gender, race, education, culture) on the same level of the latent trait (disease severity, quality of life) have a different probability of giving a certain response to a particular item.

Description

Differential item functioning (DIF) is a threat to the validity of a patient-reported outcome (PRO) instrument. DIF occurs when subjects on the same level of the latent trait, such as disease severity, answer differently to the same item depending on their group memberships (e.g., age group, gender, race) (Chang, 2005; Holland & Thayer, 1988). The validity of the instrument is threatened because the response to the DIF item is governed by something other than the construct that the instrument is intended to measure. For example, crying spells is one of the symptoms for patients with depression, but this concept is reported more by women than men with the same level of depression severity (Teresi et al., 2009). An item asking about the amount of crying will likely to underestimate the severity of men than women who are otherwise equally depressed. The crying item is said to exhibit DIF due to gender for assessing depression.

DIF has been studied in the educational testing since the 1960s (Berk, 1982) and was usually referred to as “item bias.” A test item is biased when it favors (more easy to answer the item correctly) one group of test takers, but is against (harder to answer the item correctly) another group. The term DIF was later created from the understanding that differentiating behavior of the item did not always arise from a bias situation. This is almost always the case for items in PRO instruments. A PRO item usually does not favor a particular group of patients; it is just behaving like a different item depending on the group memberships.

Group difference on the scale for assessing a certain latent construct is often confused with DIF. An item where the average scores are different between two groups is not evidence of DIF for that item. Two groups of subjects with different distribution on the latent construct are expected to have different average scores. However, subjects with the same level of the latent construct from both groups are expected to answer similarly to the same items regardless of their group membership. DIF is suspected only when subjects on the same level of the latent construct answer differently to the same items. This is the most important part of the definition of DIF. In mathematical terms, DIF can be described as a discrepancy between two groups of subjects in the conditional probability of a response to an item conditioned on the level of the latent construct.

Early methods for DIF detection relied on the assumption that the percentages of test takers answer correctly will be different between two groups to an item with DIF. In classical test theory, this percentage is referred to as item difficulty and denoted as the p value of an item. The delta plot (Angoff, 1984; Angoff & Ford, 1973) involves the transformation of the p values. Several methods all involve with construction of three-way contingency tables of frequency counts; these include Scheuneman’s chi-square (Scheuneman, 1979), Camilli’s chi-square (Ironson, 1982), and the Mantel-Haenszel procedure (MH D-DIF) (Holland & Thayer, 1988). Generally, for all the DIF detection methods, groups are first identified based on the group membership of particular interest (i.e., age >60 vs. <60, male vs. female, Blacks vs. Whites, high school graduate vs. some college degree). It is customary to refer one group as the reference group and the other the focal group. Subjects then are divided into subgroups with matching criterion between the reference and focal groups. The criterion used should be the measure of the level of the latent construct to be assessed, usually the score of the instrument itself. The percentages of the correct (and incorrect) answers to the item under investigation between the two groups are then compared across the subgroups. Unlike educational tests, the concept of correct answers does not apply to PRO instruments in health-care research. For many PRO instruments, Likert-type scales are used, and some may even include numeric rating scale (NRS) and visual analogue scale (VAS) which are continuous variables. In these cases where there are more than two item responses (i.e., polytomous item responses), generalized Mantel-Haenszel procedure can be used for DIF detection (Fidalgo & Madeira, 2008; Zwick, Donoghue, & Grima, 1993).

With the advances of item response theory (IRT) in psychometric analysis of items and instrument construction, DIF detection methods based on IRT were also developed. Because IRT consists of mathematical models already expressed as conditional probability of a certain item response given the latent trait, its application to DIF detection is straightforward. One approach is to compare the item characteristic curves (ICC) between the reference and focal groups (Linn, Levine, Hastings, & Wardrop, 1981; Runder, 1977). Alternative to comparing the ICC is to compare the item parameters directly (Hambleton & Swaninathan, 1985; Muthen & Lehman, 1985; Thissen, Steinberg, & Gerrard, 1986), since item parameters determine the ICC. A more elaborate approach involves a likelihood ratio test between the item parameters estimated separately for the target items for the reference and focal groups versus when these item parameters are constrained to be equal (Thissen, Steinberg, & Wainer, 1988). One of the special features of these IRT-based methods is that they do not require computation of the level of the latent construct for each individual and match them into subgroups. While these IRT-based methods were first developed for dichotomous response items, they can be extended to polytomous response items which are more common in health outcome assessments.

Two methods commonly used for DIF detection for polytomous response items are SIBTEST (Chang, Mazzeo, & Roussos, 1996; Shealy & Stout, 1993a, 1993b) and ordinal logistic regression (Zumbo, 1999). SIBTEST is based on a multidimensional model of DIF that formally defines a second latent trait that contributes to the DIF. Many variations of the SIBTEST methods have been developed since its first introduction. The ordinal logistic regression method of detecting DIF is a straightforward application of logistic regression modeling. Because an interaction term of latent trait by groups can be included in the model, the logistic regression method can be used to detect the nonuniform DIF directly. The concept of uniform and nonuniform DIF was first introduced by Mellengergh (1982) and was referred to as crossing DIF in SIBTEST (Li & Stout, 1996). Uniform DIF is where the discrepancy between reference and focal groups is constant across the range of the latent construct. Whereas with nonuniform DIF, the discrepancy between the two groups differs depending on the level of the latent construct. More recently, a method based on complex multiple indicators, multiple causes (MIMIC) confirmatory factor analysis model was developed for DIF detection (Finch, 2005).

The usual treatment of DIF item in education testing is to remove the item from the test. While items with DIF are serious problem in education testing, especially if the DIF item biases against a certain group of test takers, the role of DIF in health outcomes research is less clear. Multiple factors affect the assessment health-related quality of life and other patient-reported outcomes. In many cases, differences in item responses are expected for members of different groups. An item may cover very important content for assessing a health-related quality of life or patient-reported outcome, despite different groups of subjects responding to it differently. In this case, this item should be kept in the instrument, but each individual’s response should be compared within his or her group. This required the item to be treated differently according to the group memberships, e.g., different item scores between male and female. An ideal assessment of health-related quality of life or patient-reported outcomes involves tailoring the items to the unique characteristics of the individual for maximum information. This is possible by using computerized adaptive testing with a large item bank. In this case, items with DIF are no longer a threat to the validity of the test; they become asset of the instrument.

Cross-References

Item Response Theory