Abstract
Purpose
It is important for clinical practice and research that measurement scales of well-being and quality of life exhibit only minimal differential item functioning (DIF). DIF occurs where different groups of people endorse items in a scale to different extents after being matched by the intended scale attribute. We investigate the equivalence or otherwise of common methods of assessing DIF.
Method
Three methods of measuring age- and sex-related DIF (ordinal logistic regression, Rasch analysis and Mantel χ2 procedure) were applied to Hospital Anxiety Depression Scale (HADS) data pertaining to a sample of 1,068 patients consulting primary care practitioners.
Results
Three items were flagged by all three approaches as having either age- or sex-related DIF with a consistent direction of effect; a further three items identified did not meet stricter criteria for important DIF using at least one method. When applying strict criteria for significant DIF, ordinal logistic regression was slightly less sensitive.
Conclusions
Ordinal logistic regression, Rasch analysis and contingency table methods yielded consistent results when identifying DIF in the HADS depression and HADS anxiety scales. Regardless of methods applied, investigators should use a combination of statistical significance, magnitude of the DIF effect and investigator judgement when interpreting the results.
Avoid common mistakes on your manuscript.
Background
Measuring psychological well-being and quality of life is more complicated than measuring other aspects of health where aetiology and pathology make a greater contribution [1]. Where scales observe differences in scores between groups, differences may be due to a characteristic of test items other than the scale attribute. For example, there may be items within a given scale which may be more likely to be endorsed by those in a particular age, gender or ethnic group. Differential item functioning (DIF) is considered present where items on a scale show such bias. Awareness of this bias is of particular importance where scale thresholds are used to inform decisions on diagnosis and subsequent treatment. Where DIF is present, this could lead to under or over treatment for particular groups, depending on the direction of bias. As such, it is important that scales should be assessed for DIF and the extent of its presence taken into account in interpreting scale scores. Several methods have been applied to measure DIF in health-related scales, for example: structural equation modelling [2]; ordinal logistic regression [3]; item response theory (IRT) analysis [4]; and contingency tables [5] methods. Personal preference has been advocated to guide choice of method [6] yet these methods may have varying degrees of sensitivity to detect DIF.
We compared three approaches for assessing DIF in the Hospital Anxiety Depression Scale (HADS) [7]; a 14-item self-reported instrument that comprises an anxiety (HADS-A) and depression (HADS-D) subscales where higher scores represent greater symptom severity. HADS is commonly used in clinical practice and research [8]. It is therefore important that comparable results are obtained regardless of demographic aspects such as age and gender of respondents. Three methods of measuring DIF were applied to one dataset and the relative findings examined: (1) ordinal logistic regression; (2) an IRT method using Rasch analysis; and (3) a contingency table method using Mantel χ2. The objective was to assess whether methods identified the same items as exhibiting DIF, and whether some methods were more sensitive to detecting DIF than others.
Methods
Sample
In four practices in North East Scotland, 1,068 adult consulting primary care professionals completed HADS [9]. North of Scotland Research Ethics Committee (06/S0802/27) approval was granted.
Statistical methods
DIF analyses were conducted independently by different researchers (ordinal logistic regression (Scott), Rasch analysis (Adler) and Mantel χ2 procedure (Cameron). In DIF analysis, assessment is made between a ‘reference’ and ‘focal’ group. Each researcher received an anonymised dataset of HADS items, sex (reference group = female, focal group = male) and age (reference group = <65 years, focal group = ≥65 years). Each researcher completed analysis before appraising the other analyses to reduce interpretative bias. A fourth author, free of methodological preference bias (Reid) appraised the findings.
Method 1: ordinal logistic regression
For each item in HADS-D and HADS-A, an ordinal logistic regression (OLR) model was used with age group, sex and the overall scale score as dependent variables. A log odds ratio greater than zero indicated that those in the focal group (age ≥ 65 or males) were more likely to have higher anxiety/depression symptoms on this item than those in the reference group (age < 65 or females). Items were regarded as having important DIF if p < 0.001 and the magnitude of the log odds ratio was greater than 0.64 [10]. Items associated with p < 0.05 were also noted. For greater detail on the OLR approach, see Scott et al. [3], Crane et al. [11], Zumbo [12].
Method 2: Rasch model
Parametric IRT-models are built on the premise that it is possible to formulate a mathematical function that adequately describes the probability of respondents, at different levels of the dimension, to endorse a response option in a rating scale. Presently, the 1-parameter Rasch model is applied [13]. The quality of the measurement can be evaluated by fit to the model, dimensionality and DIF. The analysis was performed using the Winsteps programme [14]. Magnitude of DIF is referred to as a DIF contrast. DIF contrasts <0.5 are considered negligible, contrasts 0.5 to 1 as moderate and >1 as substantial, provided that the DIF contrasts are statistically significant (p < 0.05, T value > 2). For greater detail on the Rasch model approach, see Bond and Fox [13], Tennant et al. [15].
Method 3: Mantel chi-square procedure
DIF analyses were performed using DIFAS-5 [16]. Data were stratified by the sums of the respective scales and assessed for DIF by sex and age. The Mantel χ2 [17] statistic was computed (a contingency table method of assessing DIF in scales made up of polytomous items). The total score on the scale was divided into slices and the performance of each item assessed at these different score levels according to the grouping variables of interest. As fourteen items were being tested by two different groupings, a Mantel χ2 value >10.83 was considered indicative of a statistically significant difference at the 0.001 level. The Mantel χ2 value was then considered in the context of the effect size. Standardised Liu-Agresti Cumulative Common Log-Odds Ratios (LOR Z) are presented. Where this value is >2 or <−2, evidence of DIF is indicated [18]. Positive values indicate greater propensity for item endorsement by the reference group and negative values by the focal group. For greater detail on this method, see Penfield and Algina [19].
Assessment of unidimensionality and model fit
Prior to the DIF analyses, dimensionality was assessed to ensure that HADS-D and HADS-A were each measuring one underlying construct. Additionally, for a valid analysis of DIF within the Rasch model, it is mandatory that data also show an acceptable fit to the model (within the recommended range of 0.5–1.5 [14] ). Using an IRT-approach to test for unidimensionality, HADS-D and HADS-A were analysed using a principal component analysis (PCA) of the residuals left after the Rasch model was fitted to the data. Each item is modelled to contribute one unit of information (=1 eigenvalue) to the principal components decomposition of residuals. The eigenvalues of the PCA correspond to the number of items that the contrast represents. Contrasts with fewer than two eigenvalues imply low influence from secondary dimensions.
Results
Sample
For age, there were 814 respondents in the reference group (<65 years) and 254 in the focal group (≥65 years). For sex, there were 633 in the reference group (female) and 435 in the focal group (male).
Dimensionality and model fit assessment
All fit values of the Rasch model were within the recommended range (HADS-D: item infit 0.76–1.27, item outfit 0.67–1.41; HADS-A: item infit 0.84–1.31, item outfit 0.81–1.32), and eigenvalues of the first residuals were below two (eigenvalue of first contrast for HADS-D = 1.4 and for HADS-A = 1.6) indicating unidimensionality in both HADS-D and HADS-A.
Method 1: Ordinal logistic regression method
Using the combined criteria of p < 0.001 and |log(OR)| > 0.64 as indicating important DIF, three items (Q1, 6 and 8) had age group DIF but no items met the stricter criteria for sex DIF (Table 1).
Method 2: Rasch model method
DIF contrasts are shown in Table 2. There were four items with a DIF contrast >0.5 for age group (Q1, 6, 8 and 10) and one for sex (Q11). All items with contrast values >0.5 were also statistically significant (T value >2).
Method 3: Mantel chi-squared procedure
Significant DIF by age was identified for four items (Q1, 6, 8 and 10) and two by sex (Q9, 11) (Table 3).
Discussion
Ordinal logistic regression, Rasch analysis and Mantel chi-square methods of measuring DIF in HADS-D and HADS-A led to similar findings regarding the presence of DIF. There was remarkable consistency between the methods in the size and direction of the DIF effects found, although there were differences in the number of items crossing the threshold indicating important DIF.
Regardless of method, the analyses of DIF implied that the HADS-D and HADS-A subscales are valid tools for comparisons between sexes and for between age groups for HADS-A. In HADS-D, all three methods identified significant levels of age-related DIF. This is a potential concern given how frequently HADS is used in research studies and clinical practice. Yet, the difference in DIF on the problematic items in HADS-D went in different directions so the effect might be numerically cancelled out at the scale score level.
Within the Rasch model, it is possible to remove DIF by splitting the estimation of measures between the subgroups on the items showing substantial DIF. This is an alternative to removing or reformulating items with DIF.
We studied only three of many methods proposed to assess DIF. Further investigations with structural equation modelling methods would add to the understanding of the benefits and drawbacks of the different methods.
Our findings, in relation to the presence of DIF in HADS, concur with other studies [5, 20, 21]. A small number of studies have examined the relative performance of DIF detection methods in depressive symptom tools [22–24]. The Mini-Mental State Examination (MMSE) has been subjected to several methods to assess for the presence of DIF in relation to translation and other variables [11, 25–28]. The methods assessed included logistic regression, IRT, contingency tables and structural equation modelling methods. In considering the relative findings, there was a general lack of agreement between methods.
Conclusion
Ordinal logistic regression, Rasch analysis and contingency tables methods of investigating DIF yielded consistent results when identifying DIF in HADS-D and HADS-A. Regardless of method, investigators should combine statistical significance, magnitude of DIF effect and investigator judgement to interpret the results.
References
Warner, J. (2004). Clinicians’ guide to evaluating diagnostic and screening tests in psychiatry. Advances in Psychiatric Treatment, 10(6), 446–454.
Crawford, J. R., Garthwaite, P. H., & Slick, D. J. (2009). On percentile norms in neuropsychology: Proposed reporting standards and methods for quantifying the uncertainty over the percentile ranks of test scores. The Clinical Neuropsychologist, 23, 1173–1195.
Scott, N. W., Fayers, P. M., Aaronson, N. K., Bottomley, A., De Graaf, R., Groenvold, M., et al. (2010). Differential Item Functioning (DIF) analysis of health-related quality of life instruments using logistic regression. Health and Quality of Life Outcomes, 8(81), 1–9.
Isacsson, G., Adler, M. (2011) Randomized clinical trials underestimate the efficacy of antidepressants in less severe depression. Acta Psychiatrica Scandinavica, 125(8), 453–459.
Cameron, I. M., Crawford, J. R., Lawton, K., & Reid, I. C. (2013). Differential item functioning of the HADS and PHQ-9: An investigation of age, gender and educational background in a clinical UK primary care sample. Journal of Affective Disorders, 147(1–3), 262–268.
Clauser, B. E., & Mazor, K. M. (1998). Using statistical procedures to identify differentially functioning test items. Educational Measurement: Issues and Practice, 17(1), 31–44.
Zigmond, A. S., & Snaith, P. (1983). The Hospital Anxiety and Depression Scale (HAD). Acta Psychiatrica Scandinavica, 67, 361–370.
Herrmann, C. (1997). International experiences with the Hospital Anxiety and Depression Scale—a review of validation data and clinical results. Journal of Psychosomatic Research, 42, 17–41.
Cameron, I. M., Lawton, K., & Reid, I. C. (2009). Appropriateness of antidepressant prescribing: An observational study in a Scottish primary-care setting. British Journal of General Practice, 59, 644–649.
Bjorner, J. B., Kreiner, S., Ware, J. E., Damsgaard, M. T., & Bech, P. (1998). Differential item functioning in the Danish translation of the SF-36. Journal of Clinical Epidemiology, 51(11), 1189–1202.
Crane, P. K., Gibbons, L. E., Jolley, L., & van Belle, G. (2006). Differential item functioning analysis with ordinal logistic regression techniques. DIFdetect and difwithpar. Medical Care, 44(11 Suppl 3), S115–S123.
Zumbo, B. D. (1999). A handbook on the theory and methods of Differential Item Functioning (DIF). Ottawa: Directorate of Human Resources Research and Evaluation, National Defense Headquarters.
Bond, T. G., & Fox, C. M. (2007). Applying The Rasch Model. Fundamental measurement in the human sciences (2nd ed.). New Jersey: Lawrence Eribaum Associates Inc.
Linacre, J. M. (2010). Winsteps Rash Measurement, 3.70.0.
Tennant, A., Penta, M., Tesio, L., Grimby, G., Thonnard, J. L., Slade, A., et al. (2004). Assessing and adjusting for cross-cultural validity of impairment and activity limitation scales through differential item functioning within the framework of the Rasch model: the PRO-ESOR project. Medical Care, 42(1 Suppl), I37–I48.
Penfield, R. D. (2007) DIFAS 4.0: Differential item functioning analysis system user’s manual.
Mantel, N. (1963). Chi square tests with one degree of freedom: Extension of the Mantel-Haenszel procedure. Journal of the American Statistical Association, 58, 690–700.
Liu, I., & Agresti, A. (1996). Mantel-Haenszel-type inference for cumulative odds ratios with a stratified ordinal response. Biometrics, 52, 1223–1234.
Penfield, R. D., & Algina, J. (2003). Applying the Liu-Agresti estimator of the cumulative common odds ratio to DIF detection in polytomous items. Journal of Educational Measurement, 40, 353–370.
Lambert, S., Pallant, J. F., Girgis, A. (2010) Rasch analysis of the Hospital Anxiety and Depression Scale among caregivers of cancer survivors: Implications for its use in psycho-oncology. Psycho-Oncology , 20(9), 919–925.
Pallant, J. F., & Tennant, A. (2007). An introduction to the Rasch measurement model: An example using the Hospital Anxiety and Depression Scale (HADS). British Journal of Clinical Psychology, 46(1), 1–18.
Yang, F. M., & Jones, R. N. (2007). Center for Epidemiologic Studies-Depression scale (CES-D) item response bias found with Mantel-Haenszel method was successfully replicated using latent variable modeling. Journal of Clinical Epidemiology, 60(11), 1195–1200.
Cole, S. R., Kawachi, I., Maller, S. J., & Berkman, L. F. (2000). Test of item-response bias in the CES-D scale. Experience from the New Haven EPESE study. Journal of Clinical Epidemiology, 53(3), 285–289.
Huang, F. Y., Chung, H., Kroenke, K., Dellucchi, K. L., & Spitzer, R. L. (2006). Using the Patient Health Questionnaire 9 to measure depression among racially and ethnically diverse primary care patients. Journal of General Internal Medicine, 21, 547–552.
Dorans, N. J., & Kulick, E. (2006) Differential item functioning on the Mini-Mental State Examination. An application of the Mantel-Haenszel and standardization procedures. Medical Care, 44(11 Suppl 3):S107–S114.
Jones, R. N. (2006). Identification of measurement differences between English and Spanish language versions of the Mini-Mental State Examination. Detecting differential item functioning using MIMIC modeling. Medial Care, 44(11 Suppl 3):S124–S133.
Orlando Edelen, M. O., Thissen, D., Teresi, J. A., Kleinman, M., & Ocepek-Welikson, K. (2006) Identification of differential item functioning using item response theory and the likelihood-based model comparison approach. Application to the Mini-Mental State Examination. Medical Care, 44(11 Suppl 3):S134–S142.
Morales, L. S., Flowers, C., Gutierrez, P., Kleinman, M., & Teresi, J. A. (2006). Item and scale differential functioning of the Mini-Mental State Exam assessed using the Differential Item and Test Functioning (DFIT) Framework. Medical Care, 44(11 Suppl 3), S143–S151.
Acknowledgments
We would like to thank the primary care participants and general practices who kindly took part in the original study from which the data were collected. The original research from which the data presently analysed were collected was funded by the Centre for Change and Innovation, of the then Scottish Executive; and from Support for Science funding, Grampian NHS Research and Development. The present methodological investigations were conducted without additional funding.
Ethical standards
The anonymised data analysed in this study were originally collected for research conducted with the approval of the North of Scotland Research Ethics Committee (06/S0802/27).
Conflict of interest
IMC and NWS have nothing to declare. MA has received fees for speaking from Ostuka, AstraZeneca and Servier and served as consultant for Otsuka. ICR has received fees for speaking from AstraZeneca UK and received travel and meeting registration assistance from Lundbeck.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Cameron, I.M., Scott, N.W., Adler, M. et al. A comparison of three methods of assessing differential item functioning (DIF) in the Hospital Anxiety Depression Scale: ordinal logistic regression, Rasch analysis and the Mantel chi-square procedure. Qual Life Res 23, 2883–2888 (2014). https://doi.org/10.1007/s11136-014-0719-3
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11136-014-0719-3