Introduction

Primary health care (PHC) plays a key role in the detection and management of depression and represents the entry point to psychiatric care for people with depression (WHO, 2001). While the reported prevalence of depression is dependent on diagnostic methods used, it usually ranges between 6 and 15 % (Bauer et al., 2007; King et al., 2008), with some studies reporting even higher occurrence rates (e.g., Al-Windi, 2005) and a large variability across different countries (Tarricone et al., 2012). Also, the majority of depressed patients seek help from and are treated by general practitioners (GPs) (Gaynes et al., 2008; Wancata & Friedrich, 2011), while the severity of symptoms in these individuals is similar to those of psychiatric patients (Gaynes et al., 2005). Unfortunately, it is estimated that over 50 % of depressed patients often remain undiagnosed by GPs, and are therefore left untreated (Cepoiu et al., 2008; Jackson, Passamonti, & Kroenke, 2007). These particular individuals have increased levels of mortality and morbidity (Katon, 2003; Licht-Strunk, Beekman, de Haan, & van Marwijk, 2009) and seek medical help more often (Weissman et al., 2010). Arguably, the early detection and treatment of depression is a worthy aim in order to reduce these negative consequences and to promote remission and prevent relapse. Indeed, in the last decade, the use of screening instruments for the early detection of depression has increased (O’Connor, Whitlock, Beil, & Gaynes, 2009).

In Croatian PHC, the use of depression screening instruments is not common practice, and for this reason there is currently no accurate and valid data on the prevalence of depression in primary care (Stojanovic-Spehar et al., 2009). In contrast, there is frequent use of screening instruments for measuring depression in patients suffering from different somatic illnesses, such as diabetic patients (Pibernik-Okanovic, Peros, Szabo, Begic, & Metelko, 2005), patients infected with HIV (Kolaric, Tesic, Ivankovic, & Begovac, 2006), epilepsy patients (Hecimovic, Bosnjak, & Demarin, 2008), and various other somatic disorders (Filipcic et al., 2007).

Croatia is an Eastern European country currently in a post-transitional period in which society has evolved from a traditional socialist to a liberal capitalist model. Although Croatia is currently at the end of this transition, the conclusion of these important multilevel changes was postponed by war and the later impact of this war on Croatian society. These turbulent conditions over the past two decades became a source of stress for all Croatian citizens. Recent research in a community sample of citizens directly exposed to war in Croatia and other ex-Yugoslavian countries (Priebe et al., 2010) documented higher prevalence rates of depressive disorders (25.9 %) than in Western countries (Richards, 2011), a finding that clearly demands both research and clinical attention. Arguably, it is important to first establish a psychometrically valid and “user-friendly” screening instrument for the early detection of depression that can be used in both PHC practice and in research.

One of the most widely used instruments for depression is the Beck Depression Inventory-Second Edition (BDI-II; Beck, Steer, & Brown, 1996), a self-rating scale for the assessment of the severity of depression in adults and adolescents older than 13 years. Validation studies have shown high test–retest reliability and internal consistency (e.g., Arnau, Meagher, Norris, & Bramson, 2001; Steer, Rissmiller, & Beck, 2000) and moderate to high convergent and divergent validity (e.g., Beck et al., 1996; Kapci, Uslu, Turkcapar, & Karaoglan, 2008). The BDI-II has been translated into many languages and has shown good qualities as a screening method for identifying the possible presence and severity of depressive symptoms (e.g., Campos & Goncalves, 2011; Segal, Coolidge, Cahill, & O’Riley, 2008). The BDI-II is also one of the most commonly used screening measures for adults in PHC settings (Sharp & Lipsky, 2002) and has been recommended specifically for this purpose, together with the Patient Health Questionnaire 9 (PHQ-9) and the Hospital Anxiety and Depression Scale (HADS), by the UK Quality and Outcomes Framework (QOF) (NHS Employers and the General Practitioners’ Committee, 2009). All three recommended screening measures have shown generally strong and comparable psychometric properties in terms of their internal consistency, factor structure and convergent validity (Applied Health Sciences, 2011).

The factor structure of the BDI-II is inconsistent across studies and still remains somewhat controversial. Some studies using non-clinical samples have often found a structure comprising two factors, usually called Cognitive-affective and Somatic (Beck et al., 1996; Dozois, Dobson, & Ahnberg, 1998; Whisman, Perez, & Ramel, 2000; Wiebe & Penley, 2005), while other studies have failed to confirm this particular structure (e.g., Kojima et al., 2002; Uslu, Kapci, Oncu, Ugurlu, & Turkcapar, 2008). In clinical samples, two factors have also been frequently obtained but differed from those found in nonclinical samples, typically labelled as Somatic-affective and Cognitive (Beck et al., 1996; Bedi, Koopman, Thompson, 2001; Steer et al., 2000). In two studies, a three-factor model was obtained that distinguished between cognitive, somatic and affective factors (Beck, Steer, Brown, & Van der Does, 2002; Buckley, Parker, & Heggie, 2001). However, there were some differences between these studies in the partition of items into factors. To date, there has been little research examining the factor structure of the BDI-II with general PHC samples. However, existing studies have shown the two-factor structure obtained in the original validation of the clinical sample, thus confirming the existence of somatic-affective and cognitive factors (e.g., Arnau et al., 2001; Viljoen, Iverson, Griffiths, & Woodward, 2003).

In clinical practice, the diagnostic validity of cut-off scores empirically derived from psychometric instruments is of critical importance. These scores are assessed using Receiver Operating Characteristics (ROC) Analysis, which includes the value of the area under the ROC curve (AUC), sensitivity, specificity, positive (PPP) and negative predictive power (NPP), as well as the optimal cut-off score in the differentiation of healthy and depressed individuals. For the BDI-II, Beck et al. (1996) suggested the following cut-off scores for patients suffering from depression: minimal (0–13), mild (14–19), moderate (20–28), and severe depression (29–63). While some studies have obtained very similar values for these levels of depression (e.g., Kapci et al., 2008), most studies have found that cut-off scores between 14 and 20 best discriminated healthy individuals from depressed ones (e.g., Arnarson, Olason, Smar, & Sigurethsson, 2008; Bunevicius, Staniute, Brozaitiene, & Bunevicius, 2012; Huffman et al., 2010). These studies were carried out using various samples, with the particular values depending on the specific characteristics of the examined population. To date, only a handful of studies investigating the diagnostic validity parameters of the BDI-II have been conducted with general PHC samples (e.g., Arnau et al., 2001; Dutton et al., 2004). Furthermore, these cut-off scores are not culturally independent. The results of one pan-European study using the BDI (an earlier, and very similar, version of the BDI-II) suggest that the experience of depression may differ cross-culturally (Nuevo et al., 2009). The authors of this study concluded that, while the BDI can be used across cultures in European settings, one must take into account that the probability of responding low or high to several items might be biased by cultural or language issues.

A comparison of the BDI-II structure across different cultures has yet to receive sufficient attention (Nuevo et al., 2009). In order to ensure that a scale is consistent across different versions or with different groups, and that any interpretation based on differences in scores is valid, it is necessary to first establish evidence of measurement invariance. In light of this prerequisite, the aim of this study was to examine the factorial and diagnostic validity of BDI-II in a general PHC sample in Croatia. In doing so, the findings of the present study aim to contribute to further development and understanding of a valid and easily applicable instrument that allows rapid and accurate screening of depressed individuals. In light of previous findings from the limited number of studies conducted in PHC settings with the BDI-II thus far, a two-factor structure (with somatic-affective and cognitive factors) was expected. Further, the diagnostic validity of this questionnaire was expected to meet criteria consistent with high-precision screening instruments.

Method

The sample consisted of 314 adult participants recruited from four primary health care offices in Zagreb, of which 204 were women (65 %). The age of participants ranged from 25 to 87 years, with an average age of 55.01 years (SD = 12.99). The level of educational attainment amongst participants was also assessed, with 12 % of participants (n = 38) having finished primary school only, 63 % (n = 198) having finished secondary school, and 25 % (n = 78) holding a degree in post-secondary education. All of the participants were Caucasian. Patients participating in this study were recruited during GP visits, the majority of whom were visiting their GP for various acute physical complaints and related social concerns, such as work-related sick leave. Exclusion criteria for participation included being below 18 or over 90 years of age, an inability to read and write Croatian, or the presence of mental retardation, dementia or other severe cognitive disabilities.

The Beck Depression Inventory-Second Edition (BDI-II; Beck et al., 1996) is a self-report measure used for the examination of the presence and severity of depressive symptoms. It contains 21 items in the form of statements that describe these symptoms. For each item, respondents are offered 4 statements and are asked to select the statement that best describes their mood in the last 2 weeks. It is measured on a 4-point scale, with higher scores indicating greater severity of depressive symptoms and a maximum possible score of 63. The administration of the questionnaire is very simple and usually takes between 5 and 10 min.

In addition to the BDI-II, two further instruments were used in the present study. The Major Depression Inventory (MDI) and Doctor’s Interview (DI) are two measures that are a part of the World Health Organization (WHO) Information Pack (WHO, 1998), a tool intended for the formal diagnosis of major depressive disorder in PHC settings. The MDI is a self-report questionnaire comprising 10 items, with two items (numbers 8 and 10) divided into two subitems, a and b. For these items, only the highest scores (either a or b) are included in statistical analysis. On each item, respondents are asked to report the amount of time various depressive symptoms were present during the past 14 days. Responses are made using a 6-point Likert scale, ranging from 0 (the symptom was not present at all) to 5 (the symptom was present all of the time). The maximum possible score is 50. The MDI includes a specific scoring algorithm that indicates the absence or presence of depression and, in the latter case, the severity of depression according to ICD-10 and/or DSM-IV criteria. Validation studies were conducted with psychiatric patients and have shown high reliability and diagnostic validity for this instrument (Bech et al., 2001; Olsen, Jensen, Noerholm, Martiny, & Bech, 2003). The DI is an interview conducted by the GP, consisting of 7 questions designed to help the physician confirm the diagnosis of major depressive disorder indicated by the MDI questionnaire. The DI is structured according to three symptom criteria (A, B and C) derived from the MDI, which are explained in detail on the instruments’ official website (WHO, 2006).

Finally, a demographic data sheet that included information about age, gender, and educational status was completed by each participant.

Procedure

All data were gathered by the authors of the study in four PHC offices in Zagreb, Croatia. Two researchers were assigned to a single office, and each was present for 4 h on a given day. During this time, primary care physicians identified eligible patients from those visiting the office and invited these patients to participate in the study. Information regarding the purpose and procedure of the study was provided to patients by the GP according to a pre-determined standardized protocol. Those who provided informed consent were taken by the researchers to a private room where they completed the demographic data sheet, the BDI-II and the MDI. Those who met the MDI criteria for the presence of major depressive disorder, according to the DSM-IV, were questioned with the DI to confirm the diagnosis. Data were collected over a time period of 2 months. Primary care physicians were informed of the assessment results for these patients. All study procedures were conducted in accordance with ethical standards on human experimentation (World Medical Association Helsinki Declaration) and were approved by the ethics committee of the Faculty of Medicine at the University of Zagreb.

Data Analyses

Data analysis was performed using the statistical software package SPSS (Statistical Package for the Social Sciences) version 17, the AMOS 7 and MedCalc Software version 12. Visual inspection of all assessment measures following completion by the participants revealed there to be no missing data. This was confirmed by the descriptive statistics of the data.

The reliability of the BDI-II was analyzed using the internal consistency coefficient and corrected item-total correlations. The association between BDI-II scores and the socio-demographic characteristics of participants was determined by t test, Pearson r and Spearman rho correlations. T test was also used to test the significance of differences in total BDI-II scores between healthy and depressed participants. The Cohen-d value was calculated for determining effect size. In order to examine the factor structure of the BDI-II, the optimal number of factors was explored using: (a) Velicer’s minimum average partial (MAP) test with a parallel test, (b) a criterion of eigenvalues over 1, and (c) model fit indices of confirmatory factor analysis (CFA). In addition, previously proposed models and the model revealed by our sample through explorative factor analysis (EFA) were compared using CFAs. The diagnostic validity of the BDI-II was investigated using MDI and DI as criterion measures for the presence of depression followed by ROC analysis to calculate the AUC, sensitivity, specificity, positive (PPP) and negative predictive power (NPP) as well as the optimal cut-off score.

Results

Depression Severity and Sociodemographic Characteristics

Based on the results of the MDI and subsequent DI as diagnostic criterion measures, 52 participants (16 % of the total 314 participants) were found to be currently suffering from depression. In contrast, a BDI-II score of 14 and higher was achieved by 26.8 % of participants, which, according to the authors of the questionnaire, corresponds to at least a mild level of depression. The average BDI-II score for participants in this study was 10.35 (SD = 10.27), indicative of a minimal level of depression. The lowest score recorded was 0 and the highest was 51. Across BDI-II items, the lowest score was achieved on the 9th item (Suicidal Thoughts) while the highest was achieved on the 15th item (Loss of Energy).

While there was no statistically significant gender difference (t = 1.319, p > 0.05), the BDI-II score was positively associated with age (r = .12, p < 0.05) and negatively associated with level of educational attainment (rho = −.20, p < 0.01).

Reliability

Internal consistency of the BDI-II was analyzed using the Cronbach α, the indicator of the average intercorrelation among test items. The Cronbach α of the BDI-II achieved in the present study was .94, indicating very high reliability. Corrected BDI-II item-total correlations ranged from .44 (Loss of Interest in Sex) to .75 (Tiredness) with most correlations exceeding .60, a finding also suggestive of high internal consistency.

Factor Analysis

Velicer’s minimum average partial (MAP) test compares the relative amount of systematic and unsystematic variance remaining in a correlation matrix after the extraction of an increasing number of components (O’Connor, 2000). The smallest average squared partial correlation that indicates the appropriate number of components was .0131, which was found in the case of the first factor (Table 1). Consistent with O’Connor’s (2000) recommendation, parallel analyses were also conducted in order to compare the eigenvalues derived from the actual data in the MAP test and the eigenvalues of 1,000 random datasets. Factors are usually retained as long as the eigenvalue from the actual data is greater than the mean (or 95th percentile) eigenvalue of the random data. In the present study, this assumption was satisfied for one factor only (Table 1). Together with the results of the MAP test, this finding supports the one-factor model of depression.

Table 1 Velicer’s minimum average partial (MAP, 1976) test and parallel analysis (Horn, 1965) test for determining the BDI-II number of components

However, principal axis factoring revealed the hypothesized two-factor solution with a total accounted variance of 47.34 % (F-I: 46.39 %; F-II: 5.76 %). The appropriateness of this factor analysis was confirmed by Bartlett’s test of sphericity (χ2 = 3,568.44, p = 0.000), as well as by the high value of the Kaiser–Meyer–Olkin (KMO) coefficient (0.95). Consistent with our expectations, the present study obtained somatic-affective and cognitive factors (Table 2). These two factors were found to correlate strongly (r = .647, p < 0.01).

Table 2 Pattern matrix of principal axis factoring

The single-factor model, five 2-factor models and two 3-factor models were all tested using CFAs. Model fit indices for all models are shown in Table 3.

Table 3 Goodness of fit indices for various BDI-II models

The 3-factor model proposed by Beck et al. (2002), consisting of cognitive, somatic and affective factors, achieved the best fit indices overall, while the 3-factor model obtained by Buckley et al. (2001) achieved similar values. When CMIN/df (χ2) statistics are considered, all models demonstrated acceptable fit (CMIN/df was lower than 5 in all of models). For all tested models, GFI and TLI scores were lower than the minimally expected .90, but were moderately acceptable in both 3-factor models. In addition, these latter two were the only models with moderately acceptable CFI scores (>.90). For all models, RMSEA significantly differed from the expected interval (it was significantly larger than .05, p = .001). Figure 1 depicts factor loadings and factor intercorrelations for Beck et al.’s (2002) model, which achieved the best fit indices in relation to all other models. All factor loadings are significant and the intercorrelations between the three factors are high.

Fig. 1
figure 1

Factor loadings and factor intercorrelations for the 3-factor model based on Beck et al. (2002)

Interestingly, the 3-factor solution is not consistent with our expectations, which were based on previous studies conducted in PHC settings (e.g., Arnau et al., 2001; Viljoen et al., 2003). This is perhaps due to the fact that previous studies used exploratory factor analysis instead of CFA, which allows for the comparison of different factor models and the determination of the best fitting model.

Diagnostic Validity

Healthy and depressed participants, categorized according to the results of the MDI and DI as criterion measures, differed significantly (t = 13.13, p < 0.001) on the BDI-II scores, with a large effect size (Cohen d = −3.04, r = −.84). In other words, healthy participants (M = 6.92, SD = 5.590) had, on average, lower BDI-II scores than depressed participants (M = 27.62, SD = 11.090). According to the cut-off values proposed by Beck et al. (1996), participants with depressive disorder, diagnosed according to the criterion measures, had average BDI-II scores that indicated moderate levels of depression.

A Receiver Operating Characteristics (ROC) analysis was conducted to determine the diagnostic characteristics of the BDI-II as a screening instrument that discriminates participants with and without depressive disorder diagnoses obtained on the basis of the previously described criterion measures. Here, the AUC value was 0.96 (95 % CI .94 to .98) and there was a statistically significant difference from the area under the curve of the instrument that would discriminate the participants by chance (z = 46.92, p < 0.001). The parameters of sensitivity, specificity, positive predictive power (PPP) and negative predictive power (NPP) for all possible total scores were also considered in order to identify the optimal critical value for discriminating non-depressed and depressed individuals in this sample. Using the results of this analysis, shown in Table 4, the maximum Youden index value (sensitivity + specificity − 100) of 79.68 is achieved at a critical value of 15/16. This result demonstrates that the highest diagnostic accuracy of the BDI-II in differentiating non-depressed and depressed individuals was achieved at the critical value of 15/16, where the best balance between sensitivity (88.46) and specificity (91.22) is indicated. In addition, the PPP at this critical value was 66.7 % while the amount of NPP was very high (97.6 %). These results confirm our expectation that the BDI-II is a screening instrument with high levels of diagnostic validity. A score of 16 or higher, indicative of at least a mild level of depression, was achieved by 22 % of the participants.

Table 4 Critical values, sensitivity, specificity, positive predictive power (PPP) and negative predictive power (NPP) in distinguishing between non-depressed and depressed subjects

Discussion

The results of the present study indicated that 16 % of the sample met criteria for depression, according to the MDI and DI. A BDI-II score of 14 or higher (recommended in the BDI-II manual as a cut-off score for mild depression) was achieved by 26.8 % of subjects, while a score of 16 or higher (the cut-off score based on the ROC curve) was achieved by 22 % of subjects. Together, these results indicate somewhat higher prevalence rates for depressive disorder in this study than those found in PHC patients in research from other national and cultural communities (e.g., Arnau et al., 2001; Dutton et al., 2004), particularly in the case of the BDI-II scores. There are several potential explanations for these results. The first potential reason for this finding relates to environmental factors specific to the Croatian context that might produce stressful circumstances. As previously mentioned, Croatia is an Eastern European country in a post-transitional period with recent experience in war and post-war consequences and currently experiencing an economic recession. However, an alternative explanation for these results might be found in a consideration of the structure of the sample in the present study. For the most part, the participants in this study were PHC patients who were visiting their GP with acute medical complaints or conditions, a factor that might have elevated the prevalence of depressive disorder. Indeed, the achieved prevalence rates in the present study are comparable to those from a study conducted with a sample of Croatian patients suffering from a chronic somatic illness, in which 29 % of patients achieved BDI-II results above the cut-off score (Filipcic et al., 2007). Furthermore, the observed discrepancy, in relation to the prevalence of depression, between the criterion measures (MDI and DI) and the BDI-II might be due to the fact that the BDI-II is more highly saturated with items describing somatic complaints and thus slightly over-estimates the number of depressed individuals, particularly amongst a sample of PHC patients. This criticism of the BDI-II has similarly been raised in a recent study examining the BDI-II as a screening measure for mood disorders in pregnant women (Curzik & Jokic Begic, 2012).

The highest score on the BDI-II was achieved on the Loss of Energy item, which is not surprising considering that energy loss is a common symptom of numerous health problems. More specifically, this finding might be the result of the fact that participants, PHC patients visiting their GP with an acute medical complaint, were interviewed in a situation in which they were experiencing some form of physical or mental discomfort. It might also be argued that the over-expression of this symptom can be explained by the collectivist culture to which Croatia belongs, a context in which the reporting of physical symptoms such as energy loss might be more prevalent. An extensive meta-analysis (Oyserman, Coon, & Kemmelmeier, 2002) demonstrated that Southern European cultures presented higher levels of collectivism than northern European countries (e.g., Norway) or English-speaking countries (e.g., Australia). In a similar examination on the differences in social support amongst European countries, the ODIN study also reported results pointing in a similar direction (Lehtinen et al., 2003). In such a cultural context, it seems logical that symptoms demonstrating a help-seeking message are more dominant. Conversely, the lowest score on the BDI-II was achieved on the Suicidal Thoughts item. This is perhaps a similarly unsurprising finding, where suicide is generally a final choice when people do not find other solutions to their problems. This symptom is present at similarly low levels in northern European countries (Nuevo et al., 2009), in which a more highly individualist culture encourages the independent resolution of one’s problems.

Although depression is more common among women than men, a significant gender difference in total BDI-II scores was not obtained in the present sample. Findings from other studies in this regard are inconsistent (Dozois et al., 1998; Arnau et al., 2001; Osman, Kopper, Barrios, Gutierrez, & Bagge, 2004; VanVoorhis & Blumentritt, 2007).

In the present study, the correlation between age and BDI-II score was low and positive, while the correlation between level of educational attainment and total score was low and negative. Similar results were obtained in Arnau et al.’s (2001) study with a PHC sample, while other studies conducted with different samples provided inconsistent results concerning the association between age and BDI-II scores (Steer, Kumar, Ranieri, & Beck, 1998; Kojima et al., 2002; VanVoorhis & Blumentritt, 2007).

The BDI-II proved to be a highly reliable instrument, consistent with the estimates of reliability in other studies conducted with general PHC samples, where reliability estimates are generally around 0.90 (e.g., Arnau et al., 2001; Dutton et al., 2004).

In this and other studies, the findings with regards to the structure of the scale and to the existence of subscales are somewhat contradictory. While high internal reliability and Velicer’s MAP test (along with a parallel test, as recommended by O’Connor (2000)) both confirmed the adequacy of computing a total depression score as a reliable indicator of depression, the eigenvalue criterion of principal axis factoring suggested that a two-factor solution should be used. Consistent with the results from the present study, a two-factor structure with somatic-affective and cognitive dimensions was obtained in studies using comparable general PHC samples (Arnau et al., 2001; Viljoen et al., 2003), as well as in studies using other non-clinical samples (e.g., Kojima et al., 2002; Uslu et al., 2008). However, in the present study, confirmatory factor analysis demonstrated that the two 3-factor models have the best adequate fit indices, with the model proposed by Beck et al. (2002) achieving the best overall values of the various fit indices. A similar model obtained by Buckley et al. (2001) was also found to be quite satisfactory. The existence of three factors has also been found in other BDI-II studies. In one study using a sample of adolescent psychiatric outpatients, negative attitude, performance difficulty and somatic elements were found as factors (Osman et al., 2004), a result confirmed by Carmody (2005) using a student sample. Finally, a recent large study using confirmatory factor analysis also yielded a three-factor structure (with cognitive, affective and somatic factors), although the authors also suggested the need for a shortened version of the BDI-II (Vanheule, Desmet, Groenvynck, Rosseel, & Fontaine, 2008).

The discrepancies between factor solutions observed using different analytical methods are partly a consequence of high inter-item correlations and high associations between cognitive, somatic and affective factors. These findings support the use of the total BDI-II score in clinical settings. In addition, there are two important advantages to using this total score in primary care specifically. First, the total score covers the broad clinical picture of major depressive disorder based on DSM-IV criteria. Second, a number of studies, including the present study, have empirically derived critical values using the whole instrument but not its subscales. However, the findings of the present study also suggest that three subscales with a good CFA fit can be used by researchers and clinicians interested in a straightforward and unambiguous assessment of the cognitive, affective, and somatic symptoms of depression measured by the BDI-II. This becomes especially relevant in light of the fact that these subscales have shown unique relations with different psychological variables, such as alexithymia (Vanheule, Desment, Verhaeghe, & Bogaerts, 2007) and autobiographical memory (Mackinger & Svaldi, 2004). Indeed, more research is needed to develop a clearer picture of each subdimension of depression as well as their potential sensitivity to different antidepressant treatment options used in primary care.

In order to verify the usefulness of the BDI-II as a screening instrument for depression in a PHC population, various parameters of diagnostic validity were analyzed. When differentiating healthy and depressed patients, the AUC value was very high, thus locating the BDI-II in the category of high-precision diagnostic instruments (Streiner & Cairney, 2007). Specifically, the findings indicated a 96 % chance for a randomly selected individual from the depressed group to have a higher overall score than a randomly selected individual from the healthy group. Other studies conducted with different populations have also demonstrated relatively high levels of general diagnostic informativeness for this instrument, which usually ranges between 0.86 and 0.96 (e.g., Arnau et al., 2001; Kumar, Steer, Teitelman, & Villacis, 2002; Uslu et al., 2008). The results also indicated that the largest possible sum of the amount of sensitivity and specificity was obtained with a critical value of 15/16, meaning that the total BDI-II scores equal to and above 16 were indicative of an increased probability of depressive disorder in patients from PHC. Furthermore, the levels of both sensitivity and specificity were approximately equal and very satisfactory at this critical value. Despite a certain number of false positive results, these findings suggest that, if a patients’ score is below the critical value, one can infer with very high certainty that he or she is not clinically depressed, a conclusion reflected in the very high value of NPP. Arguably, this is a more critical feature for instruments used for screening purposes in PHC, where the goal is to avoid a failure to detect patients who suffer from this disorder.

It should be kept in mind that the decision to use a certain critical value must be based on the unique characteristics of the population being served and the purpose for which the instrument is used (Pintea and Moldovan, 2009). To date, only a few studies have examined the parameters of diagnostic validity of the BDI-II in the context of PHC, reporting critical values of 14 (Dutton et al., 2004) and 18 (Arnau et al., 2001). In Dutton et al.’s (2004) study, the percentage of depressed PHC patients was significantly higher than the average prevalence of depression generally reported in PHC. Therefore, the authors proposed the use of a higher critical value, similar to that found in the present study, where the prevalence of depressive patients (16 %) was similar to the expected prevalence. Determining the most appropriate critical value also depends on the purpose of the instrument being examined. In the case of screening measures for use in PHC, it is perhaps most critical that such instruments have a high amount of sensitivity in order to minimize the number of false negative outcomes while at the same time retaining a reasonable level of specificity. For example, in the present study, a critical value of 12 would result in very high sensitivity and NPP, making it nearly impossible to fail to identify patients with depressive disorder. However, the application of this critical value would mean reducing specificity and PPP so that 50 % of patients labeled as depressed would actually be healthy, thus resulting in unnecessary further treatment for these individuals.

The critical importance of the implementation of effective screening for depression in PHC is reflected in research demonstrating that GPs sometimes fail to detect up to more than half of depressed patients (Cepoiu et al., 2008; Jackson et al., 2007). Research has also shown that the use of screening measures increases the detection of depression in PHC (Pignone et al., 2002; Gilbody, Whitty, Grimshaw, & Thomas, 2003), but that systematic screening should also be an integral part of depression support programs in order to achieve a significant reduction of depressive symptoms, an improvement in social and work-specific functioning, and a reduction in mortality rates of PHC patients (Gilbody, House, and Sheldon, 2008; O’Connor et al., 2009; Mitchell, 2012; Sikorski, Luppa, König, van den Bussche, & Riedel-Heller, 2012). These programs include various aspects of depression care such as systematic screening, office staff training, individualized evidence-based treatment, closer monitoring of the patient, and available mental health referral. It has also been shown that depression-specific instruments such as the BDI-II and high-risk screening instruments that use empirically derived critical values have a greater effect in depression care (Gilbody et al., 2008). In the context of the present study, Croatian PHC, the BDI-II could be easily used due to the small amount of time required to complete the measure. In this setting, the instrument might be administered to patients waiting to see the GP, thus avoiding unnecessary extra time with the GP. Those individuals who scored above the optimal critical value of 15 (i.e., 16 and above) would be next examined using a diagnostic interview for final confirmation of the presence of depressive disorders. Furthermore, specific results from each of the three BDI-II factors could help the GP to better determine the specific nature of each patient’s depressive disorder by providing information about which of the three types of symptoms (affective, cognitive, and/or somatic) are predominant in the clinical picture. This information could potentially allow the GP to provide the most appropriate antidepressant treatment, although more research is needed before such conclusions can be made. An additional advantage of the BDI-II is that it is often used as an antidepressant treatment outcome measure in various populations, including PHC patients (e.g., de Graaf et al., 2009; Reeves, Rohan, Langenberg, Snitker, & Postolache, 2012).

A number of limitations to the present study should be considered alongside the presented findings. As previously mentioned, a self-report instrument (MDI) was used as a criterion measure against which the diagnostic validity of the BDI-II was examined. Due to their shared method variance, the diagnostic usefulness of the BDI-II might have been somewhat inflated. It is perhaps more methodologically sound to use validated diagnostic interviews such as the Structured Clinical Interview for DSM-IV (SCID-IV; Spitzer, Williams, Gibbon, & First, 1994) as the criterion measure. In addition, the participation refusal rate was not recorded for practical reasons. Although the vast majority of patients approached agreed to participate in the study, the number and depression-related characteristics of those who refused to participate is unknown and thus limits the generalization of the findings on the prevalence of depression and the diagnostic parameters of the BDI-II, including the size of the critical value. Furthermore, the use of a convenience sampling method in the present study may have influenced the response rate and style of the participants. Indeed, one might hypothesize that individuals suffering from depression and related mental disorders might be more prone to agreeing to take part in such a study, perhaps because they expect some form of psychosocial benefit from participation. If this was the case, the prevalence of depression in the whole sample would have been inflated. Finally, in light of the confirmation of the 3-factor structure of BDI-II achieved in the present study, it might be useful to conduct future ROC analyses using these three subscales separately.

On the whole, the findings of the present study suggest that the Croatian version of the BDI-II is a highly reliable instrument, with satisfying structural validity and high diagnostic accuracy. As such, it supports the possibility of using the BDI-II as a screening instrument in Croatian primary health care settings.