Background

Measuring psychological well-being and quality of life is more complicated than measuring other aspects of health where aetiology and pathology make a greater contribution [1]. Where scales observe differences in scores between groups, differences may be due to a characteristic of test items other than the scale attribute. For example, there may be items within a given scale which may be more likely to be endorsed by those in a particular age, gender or ethnic group. Differential item functioning (DIF) is considered present where items on a scale show such bias. Awareness of this bias is of particular importance where scale thresholds are used to inform decisions on diagnosis and subsequent treatment. Where DIF is present, this could lead to under or over treatment for particular groups, depending on the direction of bias. As such, it is important that scales should be assessed for DIF and the extent of its presence taken into account in interpreting scale scores. Several methods have been applied to measure DIF in health-related scales, for example: structural equation modelling [2]; ordinal logistic regression [3]; item response theory (IRT) analysis [4]; and contingency tables [5] methods. Personal preference has been advocated to guide choice of method [6] yet these methods may have varying degrees of sensitivity to detect DIF.

We compared three approaches for assessing DIF in the Hospital Anxiety Depression Scale (HADS) [7]; a 14-item self-reported instrument that comprises an anxiety (HADS-A) and depression (HADS-D) subscales where higher scores represent greater symptom severity. HADS is commonly used in clinical practice and research [8]. It is therefore important that comparable results are obtained regardless of demographic aspects such as age and gender of respondents. Three methods of measuring DIF were applied to one dataset and the relative findings examined: (1) ordinal logistic regression; (2) an IRT method using Rasch analysis; and (3) a contingency table method using Mantel χ2. The objective was to assess whether methods identified the same items as exhibiting DIF, and whether some methods were more sensitive to detecting DIF than others.

Methods

Sample

In four practices in North East Scotland, 1,068 adult consulting primary care professionals completed HADS [9]. North of Scotland Research Ethics Committee (06/S0802/27) approval was granted.

Statistical methods

DIF analyses were conducted independently by different researchers (ordinal logistic regression (Scott), Rasch analysis (Adler) and Mantel χ2 procedure (Cameron). In DIF analysis, assessment is made between a ‘reference’ and ‘focal’ group. Each researcher received an anonymised dataset of HADS items, sex (reference group = female, focal group = male) and age (reference group = <65 years, focal group = ≥65 years). Each researcher completed analysis before appraising the other analyses to reduce interpretative bias. A fourth author, free of methodological preference bias (Reid) appraised the findings.

Method 1: ordinal logistic regression

For each item in HADS-D and HADS-A, an ordinal logistic regression (OLR) model was used with age group, sex and the overall scale score as dependent variables. A log odds ratio greater than zero indicated that those in the focal group (age ≥ 65 or males) were more likely to have higher anxiety/depression symptoms on this item than those in the reference group (age < 65 or females). Items were regarded as having important DIF if p < 0.001 and the magnitude of the log odds ratio was greater than 0.64 [10]. Items associated with p < 0.05 were also noted. For greater detail on the OLR approach, see Scott et al. [3], Crane et al. [11], Zumbo [12].

Method 2: Rasch model

Parametric IRT-models are built on the premise that it is possible to formulate a mathematical function that adequately describes the probability of respondents, at different levels of the dimension, to endorse a response option in a rating scale. Presently, the 1-parameter Rasch model is applied [13]. The quality of the measurement can be evaluated by fit to the model, dimensionality and DIF. The analysis was performed using the Winsteps programme [14]. Magnitude of DIF is referred to as a DIF contrast. DIF contrasts <0.5 are considered negligible, contrasts 0.5 to 1 as moderate and >1 as substantial, provided that the DIF contrasts are statistically significant (p < 0.05, T value > 2). For greater detail on the Rasch model approach, see Bond and Fox [13], Tennant et al. [15].

Method 3: Mantel chi-square procedure

DIF analyses were performed using DIFAS-5 [16]. Data were stratified by the sums of the respective scales and assessed for DIF by sex and age. The Mantel χ2 [17] statistic was computed (a contingency table method of assessing DIF in scales made up of polytomous items). The total score on the scale was divided into slices and the performance of each item assessed at these different score levels according to the grouping variables of interest. As fourteen items were being tested by two different groupings, a Mantel χ2 value >10.83 was considered indicative of a statistically significant difference at the 0.001 level. The Mantel χ2 value was then considered in the context of the effect size. Standardised Liu-Agresti Cumulative Common Log-Odds Ratios (LOR Z) are presented. Where this value is >2 or <−2, evidence of DIF is indicated [18]. Positive values indicate greater propensity for item endorsement by the reference group and negative values by the focal group. For greater detail on this method, see Penfield and Algina [19].

Assessment of unidimensionality and model fit

Prior to the DIF analyses, dimensionality was assessed to ensure that HADS-D and HADS-A were each measuring one underlying construct. Additionally, for a valid analysis of DIF within the Rasch model, it is mandatory that data also show an acceptable fit to the model (within the recommended range of 0.5–1.5 [14] ). Using an IRT-approach to test for unidimensionality, HADS-D and HADS-A were analysed using a principal component analysis (PCA) of the residuals left after the Rasch model was fitted to the data. Each item is modelled to contribute one unit of information (=1 eigenvalue) to the principal components decomposition of residuals. The eigenvalues of the PCA correspond to the number of items that the contrast represents. Contrasts with fewer than two eigenvalues imply low influence from secondary dimensions.

Results

Sample

For age, there were 814 respondents in the reference group (<65 years) and 254 in the focal group (≥65 years). For sex, there were 633 in the reference group (female) and 435 in the focal group (male).

Dimensionality and model fit assessment

All fit values of the Rasch model were within the recommended range (HADS-D: item infit 0.76–1.27, item outfit 0.67–1.41; HADS-A: item infit 0.84–1.31, item outfit 0.81–1.32), and eigenvalues of the first residuals were below two (eigenvalue of first contrast for HADS-D = 1.4 and for HADS-A = 1.6) indicating unidimensionality in both HADS-D and HADS-A.

Method 1: Ordinal logistic regression method

Using the combined criteria of p < 0.001 and |log(OR)| > 0.64 as indicating important DIF, three items (Q1, 6 and 8) had age group DIF but no items met the stricter criteria for sex DIF (Table 1).

Table 1 Uniform DIF based on ordinal logistic regression

Method 2: Rasch model method

DIF contrasts are shown in Table 2. There were four items with a DIF contrast >0.5 for age group (Q1, 6, 8 and 10) and one for sex (Q11). All items with contrast values >0.5 were also statistically significant (T value >2).

Table 2 DIF based on the Rasch model

Method 3: Mantel chi-squared procedure

Significant DIF by age was identified for four items (Q1, 6, 8 and 10) and two by sex (Q9, 11) (Table 3).

Table 3 DIF based on the contingency tables method

Discussion

Ordinal logistic regression, Rasch analysis and Mantel chi-square methods of measuring DIF in HADS-D and HADS-A led to similar findings regarding the presence of DIF. There was remarkable consistency between the methods in the size and direction of the DIF effects found, although there were differences in the number of items crossing the threshold indicating important DIF.

Regardless of method, the analyses of DIF implied that the HADS-D and HADS-A subscales are valid tools for comparisons between sexes and for between age groups for HADS-A. In HADS-D, all three methods identified significant levels of age-related DIF. This is a potential concern given how frequently HADS is used in research studies and clinical practice. Yet, the difference in DIF on the problematic items in HADS-D went in different directions so the effect might be numerically cancelled out at the scale score level.

Within the Rasch model, it is possible to remove DIF by splitting the estimation of measures between the subgroups on the items showing substantial DIF. This is an alternative to removing or reformulating items with DIF.

We studied only three of many methods proposed to assess DIF. Further investigations with structural equation modelling methods would add to the understanding of the benefits and drawbacks of the different methods.

Our findings, in relation to the presence of DIF in HADS, concur with other studies [5, 20, 21]. A small number of studies have examined the relative performance of DIF detection methods in depressive symptom tools [2224]. The Mini-Mental State Examination (MMSE) has been subjected to several methods to assess for the presence of DIF in relation to translation and other variables [11, 2528]. The methods assessed included logistic regression, IRT, contingency tables and structural equation modelling methods. In considering the relative findings, there was a general lack of agreement between methods.

Conclusion

Ordinal logistic regression, Rasch analysis and contingency tables methods of investigating DIF yielded consistent results when identifying DIF in HADS-D and HADS-A. Regardless of method, investigators should combine statistical significance, magnitude of DIF effect and investigator judgement to interpret the results.