Introduction

Prevalence estimates of depression in the US have varied widely, ranging from 3.8 to 27.3% in studies based on structured patient interviews, and from 11.5 to 60.7% in studies based on self-administered questionnaires (Anderson et al. 2001). A recent report, based on 2006 data from Behavioral Risk Factor Surveillance System (BRFSS), estimated the prevalence of the depressive disorder assessed by the 8-item Patient Health Questionnaire (PHQ-8) to be 9.1% in the general population and 16.6% among persons with diabetes (Kroenke et al. 2009; Li et al. 2008).

The identification of depression reflects the sensitivity and specificity of the diagnostic instrument. The Structured Clinical Interview for Diagnostic and Statistical Manual of Mental Disorders (SCID) is currently the gold standard but must be administered by an experienced clinician or trained mental health professional and is not suitable for population surveys (First et al. 2002). The PHQ-8 used in BRFSS was derived from the 9-item Patient Health Questionnaire (PHQ-9) that queries the nine criteria for depressive disorders defined in the Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition (DSM-IV) (American Psychiatric Association 1994). The validity of the PHQ-9 to establish provisional diagnoses of major and minor depression and evaluate the severity of depressive symptoms (Kroenke and Spitzer 2002) has been established in the general populations (Martin et al. 2006) and in diverse ethnic groups (Huang et al. 2006). The PHQ-9 diagnoses of major depression have 73% sensitivity and 98% specificity compared with independent diagnoses made by mental health professionals using the SCID (Spitzer et al. 1999). The PHQ-8 does not include queries as to suicidality and has similar validity for major depression as the PHQ-9 (Kroenke and Spitzer 2002). The PHQ-8 internal consistency reliability coefficient from previous studies is 0.82 (Pressler et al. 2010).

In addition to the diagnostic instrument, modes of questionnaire administration in surveys impact both data quality and prevalence estimates, particularly for sensitive topics and stigmatized behaviors (Tourangeau and Smith 1996; Bowling 2005; Hanmer et al. 2007; Salomon et al. 2009; Smith et al. 2009). A validation study conducted among 346 participants showed that the PHQ-9 scores were highly correlated (r = 0.92) between the self-administered and telephone-administered questionnaire (Pinto-Meza et al. 2005). Good agreement also exists between face-to-face interview and telephone interview scores for mental health assessed by the General Health Questionnaire (intra-class correlation coefficient r = 0.83) and Revised Clinical Interview Schedule (intra-class correlation coefficient r = 0.92) (Evans et al. 2004). We are unaware of studies comparing identification of depression by the PHQ-8 administered by computer-assisted telephone interviewing (CATI) and computer-assisted personal interviewing (CAPI). Thus, we conducted this study to compare the prevalence estimates of the PHQ-8 depression between the CATI, as administered in the 2006 BRFSS, and CAPI, as administered in the 2005–2006 National Health and Nutrition Examination Survey (NHANES).

Methods

Study design and population

The BRFSS uses a random-digit dialing (RDD) method to obtain a state-representative sample of civilian, non-institutionalized population aged ≥18 years in all 50 states, District of Columbia, and territories. The BRFSS uses disproportionate stratified random sampling (DSS) method to obtain a probability sample within each state. Information obtained from the sample can be used to generalize results to the total population in the state as well as to the nation as a whole. With DSS, telephone numbers are drawn from two strata that are based on the presumed density of known household telephone numbers. In this design, telephone numbers are classified into strata that are either high density or medium density to yield residential telephone numbers. Telephone numbers in the high density stratum are sampled at the highest rate. The 2006 BRFSS used CATI to assess key behavioral risk factors and chronic conditions among adults [Centers for Disease Control and Prevention (CDC) 2007].

The NHANES 2005–2006 is a complex, multistage probability sample of the civilian non-institutionalized population of the US. The NHANES 2005–2006 uses a four-stage sample design: (1) primary sampling units (PSUs) comprising mostly single counties, (2) area segments within PSUs, (3) households within area segments, and (4) persons within households (Plan and operation of the Third National Health and Nutrition Examination Survey 1994; Centers for Disease Control and Prevention 2007). NHANES combines in-home face-to-face CAPI and physical examinations at the mobile examination center to collect data on individuals. In this study, we restricted our analyses in the NHANES 2005–2006 sample to adults aged ≥18 years.

Measurement of depression

The same PHQ-8 was administered to participants aged ≥18 years in the 2006 BRFSS by CATI and to participants aged ≥18 years in the 2005–2006 NHANES by CAPI conducted by trained lay interviewers. Major depression was defined as having at least five of eight PHQ-8 criteria, one of which must be depressed mood or loss of interest or pleasure, for ≥7 days in the past 2 weeks. Minor depression was defined as having two to four of the eight PHQ-8 criteria, one of which must be depressed mood or loss of interest or pleasure, for ≥7 days in the past 2 weeks (American Psychiatric Association 1994; Kroenke and Spitzer 2002). The PHQ-8 depression included both major and minor depression. Alternatively, a severity score of 0–3 was assigned to each item (0 = 0–1 day, 1 = 2–6 days, 2 = 7–11 days, and 3 = 12–14 days), yielding a total score between 0 and 24 points. A total score of 0–4, 5–9, 10–14, 15–19, and 20–24 represents no, mild, moderate, moderately severe, and severe depressive symptoms, respectively. A total score ≥10 has been recommended as a cutoff point for screening depression (Kroenke and Spitzer 2002). Because using either a combination of major and minor depression or a total of PHQ-8 score ≥10 to define depression has yielded similar prevalence estimates (Li et al. 2008), we used the former definition in our current analysis.

Assessment of demographic characteristics, lifestyle habits, and chronic illnesses

In both 2006 BRFSS and 2005–2006 NHANES, all covariates were assessed by participants’ self-report using similar instruments. Demographic characteristics included sex (men vs. women), age (18–24, 25–44, 45–64, ≥65 years), race/ethnicity (non-Hispanic white, non-Hispanic black, Hispanic, other), education attainment (<high school, high school/some college, ≥college), and marital status (married, previously married, never married). Smoking status was determined as current smokers (have smoked at least 100 cigarettes during their entire life and smoked at the interview), former smokers (have smoked at least 100 cigarettes during their entire life but did not smoke at the interview), and persons who never smoked. Body mass index [BMI = weight (kg)/height2 (m)] was calculated by using self-reported weight in kilograms and height in meters. Diabetes, myocardial infarction, coronary heart disease (CHD), stroke, and asthma were determined by participants’ self-report. Participants’ general health status was categorized as being fair or poor versus excellent, very good, or good. Disability was determined by asking participants, “Do you now have any health problem that requires you to use special equipment, such as a cane, a wheelchair, a special bed, or a special telephone?” in both surveys. Persons who answered this question affirmatively were considered to have a disability.

Statistical analysis

We compared sample characteristics between the 2006 BRFSS and the 2005–2006 NHANES. We estimated standardized internal consistency reliability coefficient of the PHQ-8 in the two surveys separately and estimated the crude and age-standardized prevalence of depression by different diagnostic criteria, severity category, and each single item. We also assessed the differences in the prevalence of the PHQ-8 depression in each subgroup stratified by demographic characteristics, lifestyle habits, and chronic illnesses with adjustment for sex, age, race/ethnicity, education attainment, and marital status.

To control for possible differences in the distribution of demographic characteristics between the two surveys, we estimated adjusted prevalence of PHQ depression using a predicted marginal approach in log-linear regression analyses. Predictive marginals have been proposed as a direct standardization method controlling for differences in the distribution of covariates (Graubard and Korn 1999). PHQ depression was dichotomized as 1 = yes and 0 = no and modeled as a dependent variable in the log-linear regression. Depression prevalence between the two surveys was estimated by controlling possible variations in sex, age, race/ethnicity, education attainment, and marital status. The difference in the prevalence between the two surveys was tested with a two-sample t test. We performed data management and variable recoding using SAS (version 9.2) and conducted all analyses using SUDAAN software (Release 9.0; Research Triangle Institute, Research Triangle Park, NC). In all analyses, sample weights were used to account for the varying probabilities of complex sampling design and non-response. We considered results with a two-tailed P value <0.05 to be statistically significant.

Results

Sample characteristics 2005 BRFSS and 2005–2006 NHANES surveys

A total of 198,678 participants in the 2006 BRFSS and 4,800 participants in the 2005–2006 NHANES provided PHQ-8 data. Sex and age distributions were comparable in 2006 BRFSS and the 2005–2006 NHANES surveys (Table 1). However, distributions of race/ethnicity, education, marital status differed significantly between the two samples (all P < 0.001). The prevalence of current smoking and self-reported total obesity (BMI ≥ 30 kg/m2) were lower in the 2006 BRFSS than in the 2005–2006 NHANES (both P < 0.001), whereas the prevalence of CHD and asthma were higher in the 2006 BRFSS than in the 2005–2006 NHANES (both P < 0.01). There were no significant differences in the prevalence of diabetes, MI, stroke, disability, and poor or fair health between the two surveys.

Table 1 Sample characteristics between BRFSS (2006) and NHANES (2005–2006), US

Prevalence of depression

The standardized internal consistency reliability coefficient of PHQ-8 was 0.85 in BRFSS and 0.83 in NHANES. The crude prevalence of the PHQ-8 score ≥10 (P < 0.001), depression (P < 0.001), major depression (P < 0.001), and minor depression (P = 0.04) was higher in BRFSS than NHANES (Fig. 1). We also observed a significant difference in the crude proportion of depression severity between the two surveys (P < 0.001) (Fig. 2). The crude prevalence of each PHQ-8 single item was higher in BFRSS than NHANES, differences ranging from 1.4 to 5.7% (Table 2). The age-standardized prevalence estimates of PHQ depression and each PHQ-8 single item were nearly identical to their crude estimates (data not shown).

Fig. 1
figure 1

Crude Prevalence of Depression Diagnosis Assessed by the PHQ-8 with CATI in the 2006 BRFSS and CAPI in the 2005–2006 NHANES. PHQ patient health questionnaire, CATI computer-assisted telephone interviewing, CAPI computer-assisted personal interviewing

Fig. 2
figure 2

Crude Proportion of Depression Severity Assessed by the PHQ-8 with CATI in the 2006 BRFSS and CAPI in the 2005–2006 NHANES. A total PHQ-8 score 0–4, 5–9, 10–14, 15–19, and 20–24 represents no, mild, moderate, moderately severe, and severe depressive symptoms, respectively. PHQ patient health questionnaire, CATI computer-assisted telephone interviewing, CAPI computer-assisted personal interviewing

Table 2 Crude prevalence of the PHQ-8 single item in BRFSS (2006) and NHANES (2005–2006), US

After adjustment for sex, age, race/ethnicity, education attainment, and marital status, the prevalence of PHQ-8 depression was about 3.5% higher in absolute value in BRFSS (9.7%) than that NHANES (6.1%) (P < 0.001) (Table 3). The difference in the prevalence of depression between the two surveys ranged from 1.5 to 7.0% in the subgroups stratified by demographic characteristics, lifestyle habits, and chronic illness status. We observed significant differences in the prevalence in each subgroup, except for participants with stroke or asthma.

Table 3 Adjusted prevalence of the PHQ-8 depression in BRFSS (2006) and NHANES (2005–2006), US

Discussion

Our results indicated that the overall prevalence estimates of the PHQ-8 depression were approximately 3.5% higher with the administration of CATI in BRFSS (9.7%) than that with the administration of CAPI in NHANES (6.1%) after adjustment for demographic characteristics. The difference in the adjusted prevalence estimates persisted in most of the subgroups stratified by demographic characteristics, lifestyle habits, and chronic illness status.

It is surprising to observe the significant differences in the prevalence estimates of the PHQ-8 depression between the two modes administered in these two national surveys. Our results were in contrast to the previous findings that showed similar results of mental health scores between telephone interviews and self-administered assessments or face-to-face interviews in two validation studies (Evans et al. 2004; Pinto-Meza et al. 2005). Our study was not intended to validate the results of the PHQ-8 between CATI and CAPI because the two modes were not applied within the same survey. Rather, we attempted to compare the population prevalence estimates of the PHQ-8 depression in two surveys that targeted the same population with two different modes.

There are several limitations to this study. First, the PHQ-8 assessment in BRFSS was conducted among participants selected in 38 states, whereas the PHQ-8 assessment in NHANES was conducted among participants of a representative sample of the US population. Differences in the geographic coverage between the two surveys may also lead to the small variations of prevalence estimates. Because the PHQ-8 assessment in the BRFSS was conducted among participants from about 76% of all 50 states in the US, exclusion of participants in the rest of 12 states could have minimally influenced the prevalence estimates.

Second, the PHQ-8 assessment in BRFSS was conducted in 2006, whereas the PHQ-8 assessment in NHANES was conducted during 2005 and 2006; both assessments were done for the first time. No data on PHQ-8 were available in prior BRFSS and NHANES surveys. The slightly different time frame may lead to bias for the comparison of the prevalence estimates between the two surveys. However, the difference in the prevalences between two time periods could be minimal because we found similar prevalence estimates of the PHQ-8 depression in the 2006 BRFSS survey and in the 2008 BRFSS survey (data not shown) in the eight states that had the PHQ-8 assessment in both years.

Third, because the PHQ-8 assessment in BRFSS was conducted among participants who had land phones at home, people with advanced age, disability, or severe chronic illness were likely to be included in the survey, whereas people in households without landline telephones and those who were hospitalized or institutionalized were excluded from the survey. In contrast, the PHQ-8 assessment in NHANES was conducted among participants who attended the Mobile Examination Center, and people with advanced age, disability, or severe chronic illness may have been less likely to have attended. However, their impact on our results could be minimal because our subgroup analyses stratified by age, disability, and chronic illness showed persistent differences in the prevalence estimates of depression between the two surveys.

Fourth, variations in the response rates between BRFSS and NHANES may have possibly contributed to the differences in the PHQ-8 depression prevalence estimates. The median response rate was 51.4%, ranging from 35.1 to 66.0% at the state level in the 2006 BRFSS survey, whereas the response rate was 77.4% in the 2005–2006 NHANES survey. A low response rate in the BRFSS may have increased non-response bias in the PHQ-8 depression prevalence estimates. However, recent research has shown that the standard RDD telephone survey with a response rate of 36.0% and the rigorous RDD survey with a response rate of 60.6% yielded similar results (Keeter et al. 2000).

Fifth, different sampling methods used in BRFSS and NHANES may be partially responsible for the observed differences in the prevalence estimates of depression between the two surveys. Indeed, we observed some differences in the proportions of race/ethnicity, education attainment, and marital status and the prevalences of several health risk factors and chronic illnesses between the two surveys. However, the impact of sampling methods on our results could be minimal for the following two reasons: first, sample weights were used in all analyses to account for the varying probabilities of complex sampling design and non-response to generate unbiased estimates to the US population in both surveys. Second, we adjusted for the possible variations in demographic characteristics between the two surveys for the comparisons in the prevalence estimates of depression between the two surveys. Moreover, as shown in a previous study (Nelson et al. 2003), BRFSS provided national estimates of major health measures comparable to those of the National Health Interview Survey, which uses similar sampling methods as the NHANES does.

Despite these limitations, the mode of administration (CATI vs. CAPI) for the PHQ-8 questionnaire may play a major role in the differences in the prevalence estimates of the PHQ-8 depression because BRFSS data have consistently been found to provide national estimates comparable to results from national household surveys with a face-to-face interview in health risk factors, chronic conditions, and health care access (Mokdad et al. 2003; Nelson et al. 2003). Numerous studies have shown that survey administration mode may have significant impact on the data quality and results (Bowling 2005; Newman et al. 2002), particularly for sensitive issues and stigmatized behaviors (Durant and Carey 2000; Smith et al. 2009; Tourangeau and Smith 1996). Studies with either CATI or self-administered surveys have been found to report a higher rate of stigmatized behaviors including sexual behaviors and drug use than those using face-to-face interviewing (Durant and Carey 2000; Smith et al. 2009; Tourangeau and Smith 1996). Similarly, studies have suggested that telephone administration has yielded more positive health-related quality of life scores than either the self-administered survey (Hanmer et al. 2007) or mail survey (Hays et al. 2009). Moreover, a recent study compared the four national surveys including BRFSS (with CATI), NHANES (with CAPI), National Health Interview Survey (NHIS with CAPI), and Current Population Survey (CPS with CATI/CAPI) and found that BRFSS yielded a higher prevalence of fair or poor self-rated general health than either NHANES or NHIS, but a lower prevalence than CPS (Salomon et al. 2009).

Consistent with those findings, CATI as administered in the 2006 BRFSS yielded a higher prevalence estimate of the PHQ-8 depression than CAPI as administered in the 2005–2006 NHANES. Telephone interviewing may be more successful in eliciting frank responses to sensitive issues than face-to-face interviewing (Sykes and Collins 1988). Throughout history, stigmatization of people with mental illness has long been recognized as a public health issue (US Department of Health and Human Services 1999). Stigmatizing attitudes about mental illness in the general public and self-stigma and fear of rejection by others have a negative impact on people’s lives (Corrigan and Watson 2002; Link et al. 1997). The negative impact may also be reflected in their responses to the PHQ-8 items in the surveys, particularly in the face-to-face interviewing mode. Alternatively, differences in sampling frames (i.e., telephone sampling frame in BRFSS versus household sampling frame in NHANES) between the two studies may have contributed to the difference.

In sum, by comparing the prevalence estimates of the PHQ-8 depression between the BRFSS and NHANES surveys that used two different modes of questionnaire administration, we identified significant variations in the prevalence estimates of depression between the two modes. Although our results are unable to provide reasons for the actual causes of the variations in the PHQ-8 depression prevalence estimates in the US adult population between CATI and CAPI, the notable differences in the prevalence estimates between the two administration modes warrant cautions when one compares and interprets the results from these two national surveys. In recognition of stigmatization for mental illness in the general public, appropriate selection of administration modes for the PHQ-8 questionnaire needs to be scrutinized to elicit accurate information regarding depression.