Introduction

Mental disorders are among the leading causes of disability worldwide and will be among the most burdensome conditions by the year 2020 [1]. Among mental disorders, major depression [24] is associated with greatest disability in high, middle, and low-income countries.

The prevalence of major depression in primary health care (PC) is considerable [57]; however, fewer than half of the patients with depression are correctly identified and adequately treated by general practitioners (GPs) [8, 9]. Moreover, we now know that, if used alone, screening instruments do not distinguish well between those patients who are disabled by their symptoms and those who are not [10] and have little or no impact on the detection, management, and outcome of depression [11].

The World Health Organization Disability Assessment Schedule II (WHO-DAS II) [12] was designed to assess the activity limitations and participation restrictions experienced by an individual irrespective of medical diagnosis. The main advantages of this instrument over other disability measures are as follows: it was cross-culturally developed and field tested in 16 languages in 14 different countries, it is compatible with an international classification system (the International Classification of Functioning, Disability and Health) [13], and it treats all disorders at parity when establishing the level of functioning. Several studies have extensively analyzed the dimensionality, internal consistency, test–retest reliability, and construct validity of the 36- and 12-item version of the WHO-DAS II in patients with diverse physical and mental conditions [1417], demonstrating that the instrument possesses sound psychometric properties.

However, to the best of our knowledge, no study has addressed the extent to which the 12-item WHO-DAS II can discriminate between the following clinical groups in the context of primary health care: patients with/without major depression, patients with depression with/without medical comorbidity, and patients with depression with different depression severity. These known-groups validity analyses were carried out in the present work.

Methods

For this study, we utilized the ERASMAP data set. The ERASMAP was a cross-sectional observational study carried out in 874 PC centers in Spain designed to identify the sociodemographic and clinical factors associated with diagnostic delay in a first diagnosed major depressive episode. The study was performed in accordance with the ethical standards laid down in the 1964 Declaration of Helsinki and was approved by the Clinical Research and Ethics Committee of the University Hospital La Princesa (Madrid, Spain).

Participants

The sample consisted of 3,615 adult (18 years or older) PC patients from 17 regions of Spain, with a first-time diagnosis of major depressive episode. PC patients with a previously diagnosed major depressive episode, bipolar disorder, schizophrenia or delusional disorder, and those who were receiving treatment with any psychotropic medication were not included in the study.

Measures

The 12-item interviewer administered version of the World Health Organization Disability Assessment Schedule II (12-item WHO-DAS II) [14, 17]. In each item, individuals have to estimate the magnitude of the disability during the previous 30 days from none = 1 to extreme/cannot do = 5. The total score may vary from 0 to 100 with higher scores reflecting greater disability.

The Patient Health Questionnaire nine-item depression module (PHQ-9) [18, 19]. A nine-item scale that assesses the nine DSM-IV [20] depression symptoms. Each of the nine items is scored from 0, not at all, to 3, nearly every day. The PHQ-9 can be used as a screening tool, with summed score ranging from 0 (no depressive symptoms) to 27 (all symptoms occurring daily). Summed scores of 0–4 represent a minimal level of depression; 5–9, mild; 10–14, moderate; 15–19, moderately severe; and 20–27, severe. The PHQ-9 can also be used as a diagnostic tool using a “diagnostic algorithm”; major depression is diagnosed if 5 or more of the 9 symptoms have been present at least more than half the days of the past 2 weeks, and 1 of these symptoms is either depressed mood or anhedonia.

Chronic medical conditions checklist

The presence of comorbid medical conditions was assessed using a yes-or-no checklist developed by the authors for the present study. It included questions about a wide range of conditions (e.g. migraine, arthritis, heart attack, hypertension, asthma, tuberculosis, diabetes, etc.). Respondents were asked whether they had experienced any of the symptom-based conditions in the checklist during the previous year.

Procedure

During the consultation, the participating GPs assessed the patients meeting the inclusion criteria using a paper-and-pencil interview. Prior to the assessment, all patients had provided written informed consent.

Data analyses

The known-groups’ validity approach is founded on the basis that certain specified groups of patients might be expected to score differently from others. In the present work, we carried out the following analyses to examine the known-groups’ validity of the 12-item WHO-DAS II: First, a Student’s t test for independent samples (with unequal variances) was performed to assess the validity of the instrument for discriminating between the patients with major depression and those without (according to the PHQ-9 diagnostic algorithm).

We then conducted a ROC analysis to examine the sensitivity and specificity of the instrument for major depression, using the PHQ-9 as a “gold standard”. The area under the curve (interpretation: .50 to .75 = fair, .75 to .92 = good, .92 to .97 = very good, .97 to 1.00 = excellent), positive and negative predictive value, and the positive and negative likelihood ratio were all calculated.

Finally, to examine the differences in disability among patients with depression with and without comorbid medical conditions, as well as among those reporting different degrees of depression severity (depression groups using PHQ-9: 10–14 = moderate; 15–19 = moderately severe; 20–27 = severe depression), a Student’s t test for independent samples (with unequal variances) and one-way Analysis of Variance (ANOVA; using Games-Howell for post hoc comparisons) were performed, respectively. The overall alpha level for was set at .05.

Results

Patient characteristics and scores on study measures are described in Table 1 using means and standard deviations for continuous variables and percentages for categorical variables. Means and standard deviations for the PHQ-9 and the 12-item WHO-DAS II by group are displayed in Table 2.

Table 1 Characteristics of the study sample
Table 2 Means and standard deviations (SD) for the PHQ-9 and the 12-item WHO-DAS II by group

Discriminating depression “caseness”

The analysis revealed a significant group difference in disability, t (1684.57) = 26.86, P < .001. The 2,612 PC patients with major depression (according to PHQ-9) obtained significantly higher scores on the WHO-DAS II (M = 58.02, SD = 16.39) than the 913 without depression (M = 41.84, SD = 15.41). We computed Cohen’s d from the value of the t test of the differences between the two groups (Rule of thumb: .20 = small; .50 = medium; .80 = large). The effect size was large (d = 1.31).

Subsequently, the ROC analysis revealed that the accuracy of the WHO-DAS II with respect to discriminating depression “caseness” was good (see Fig. 1; AUC = .76, SE = .0088, P < .001, 95%CI .75—.78, LR + = 2.20, LR = .42). The point of maximum curvature of the ROC analysis suggested that a cutoff score ≥50% yielded the best trade-off between sensitivity (71.4%) and specificity (67.6%) for the 12-item WHO-DAS II, correctly classifying 70.4% of the sample and producing a positive predictive value of 86.3% and a negative predictive value of 45.2%.

Fig. 1
figure 1

Receiver operating characteristic curve (ROC) for the 12-item WHO-DAS II versus PHQ-9 (gold standard) for major depression diagnosis

Discriminating depression with/without medical comorbidity

The 744 patients with depression without medical comorbidity presented lower scores on the 12-item WHO-DAS II (M = 52.82, SD = 19.30) than the 2,781 depressed participants that were suffering one or various comorbid medical conditions (M = 54.10, SD = 17.15), but this difference was not statistically significant, t (1077.03) = 1.63, P = .10.

Discriminating depression severity

Mean scores (standard deviations) on the 12-item WHO-DAS II were 43.12 (13.23), 54.56 (15.08), and 64.74 (15.51) for moderate (n = 793), moderately severe (n = 1,414), and patients with severe depression (n = 1,105), respectively. The ANOVA yielded significant group differences in disability, F (2, 3309) = 494.55, P < .001 (n = 3,312 after listwise deletion). The effect size analysis based on partial eta-squared (η 2p rule of thumb: .01 = small; .06 = medium; .14 = large) indicated that the difference was large (η 2p  = .23). The Games–Howell post hoc test indicated that all pairwise comparisons were statistically significant Fig. 2.

Fig. 2
figure 2

12-item WHO-DAS II scores of patients with depression with different levels of depression severity (according to PHQ-9)

Discussion

The known-groups validity analyses reported here support the utility of the 12-item WHO-DAS II for discriminating depression “caseness” and severity among PC patients with a first diagnosed major depressive episode. However, the instrument was not able to discriminate the presence/absence of medical comorbidity among PC patients with depression.

Our results are in line with those recently obtained by Baron and collaborators [21] with the 36-item WHO-DAS II. These authors divided the patients with inflammatory arthritis into two subsets according to their scores on the Center for Epidemiologic Studies Depression Scale (CES-D) and found that the instrument was able to discriminate between patients with low (CES-D < 19) and high (CES-D ≥ 19) depressive symptoms.

Although our objective was not to examine the validity of the instrument as a diagnostic tool, we found in the ROC analysis that with a cutoff score ≥50%, depression “caseness” was detected with an acceptable sensitivity and specificity. Notwithstanding, it would not be reasonable to use the instrument as a substitute for available screening tools (e.g. the PHQ-9). In the clinical context, positive screening results are usually followed by a further diagnostic interview. Therefore, the sensitivity of screening instruments should be above specificity and be as high as possible (at least 90%) in order to avoid the presence of excessive false-negative results. At the same time it is also necessary to avoid the presence of too many false-positive results, requiring a specificity of at least 75% [22]. In the ROC analysis, none of the cutoff points yielded sensitivity and specificity values that met both criteria (data not presented). Therefore, we find appropriate to use the 12-item WHO-DAS II as a complementary tool, its administration being recommended in combination with a depression-screening or case-finding instrument.

The discriminative validity reported here was quite similar to that obtained with other disability instruments used in PC [23, 24]. Luciano and collaborators [23] found that the Sheehan disability scale (SDS) had good sensitivity (82%) and specificity (71%) for major depression. Similarly, Leon et al. [24] examined the utility of the SDS for identifying PC patients with any of six mental disorders (alcohol dependence, drug dependence, generalized anxiety disorder, major depression, OCD and panic disorder) and found adequate sensitivity (83%) and specificity (69%). Given that the 12-item WHO-DAS II and the SDS seem to have similar psychometric properties using classical test theory, it might be interesting to analyze in a future study the ability of their individual items to discriminate across varying levels of disability using methods based on item-response theory [25].