Introduction

Heterogeneity in major depressive disorder (MDD) illness course complicates clinical decision-making. Clinicians have consistently identified absence of guidance on how to deal with this variation as a critical gap in personalizing MDD treatment.1, 2, 3, 4 However, efforts to address this problem by finding useful prognostic subtypes based on empirically derived symptom profiles5, 6 or biomarkers7, 8, 9 have so far yielded disappointing results. A potentially promising complementary approach would be to apply machine-learning (ML) methods to baseline data on symptoms and other easily assessed clinical features to develop first-stage prediction models of subsequent depression course and treatment response10, 11 that could be expanded to target and examine incremental prognostic effects of novel biomarkers among patients who could not be classified definitively with the inexpensive first-stage prediction models.

Although ML methods have been used successfully to develop risk-prediction schemes in other areas of medicine,12, 13 applications to depression have so far relied on small samples and thin predictor sets, failing to realize the full potential of the methods.14, 15 A recent exception is a study carried out among 8261 respondents with lifetime DSM-IV MDD in the World Health Organization World Mental Health (WMH) surveys.16, 17 Retrospective reports about parental history of depression, temporally primary comorbid disorders and characteristics of incident depressive episodes were used to predict retrospectively reported subsequent depression persistence (number of years with episodes), chronicity (number of years with episodes lasting most days), hospitalization for depression and work disability due to depression. K-means cluster analysis of the four predicted risk scores found a parsimonious three-cluster solution with the high-risk cluster (32.4% of cases) accounting for 56.6–72.9% of high persistence, chronicity, hospitalization and disability.

Although useful as a proof of concept, the WMH results were based on retrospective reports. A prospective validation is reported here that uses the WMH models to predict subsequent MDD persistence, chronicity and severity in a sample of 1056 respondents with lifetime DSM-III-R MDD in the 1990–1992 US National Comorbidity Survey (Survey 1)18 who were re-interviewed 10–12 years later in the 2001–2003 National Comorbidity Survey Follow-Up (Survey 2).19 ML model results are compared with results based on more conventional logistic regression models to determine whether ML methods improve on conventional methods.

MATERIALS AND Methods

Sample

Survey 1 was a community epidemiological survey of common DSM-III-R disorders among English-speaking residents of the non-institutionalized civilian US household population aged 15–54 years (n=5877 respondents; 82.4% response rate).18 Respondents were paid $25 for participation. Recruitment-consent procedures were approved by the human subjects committee of the University of Michigan. Interviews were conducted face-to-face in respondent homes after obtaining verbal informed consent. Survey 2 attempted to re-interview all baseline respondents considered here 10–12 years later using recruitment-consent procedures identical to Survey 1 other than a $50 incentive. These procedures were approved by the human subjects committees of both Harvard Medical School and the University of Michigan. Interviews were again conducted face-to-face in respondent homes after obtaining a verbal informed consent. The 5001 Survey 2 respondents (87.6% of living targeted Survey 1 respondents) were administered an expanded version of the baseline interview assessing onset-course of disorders between the two surveys. A non-response adjustment weight corrected for baseline differences between Survey 2 respondents and non-respondents conditional on Survey 1 responses. Analyses reported here use the weighted data from the 1056 Surveys 1–2 panel respondents who met lifetime criteria for MDD in Survey 1.

The baseline assessment of DSM-III-R disorders

Survey 1 assessed DSM-III-R disorders with the World Health Organization’s Composite International Diagnostic Interview (CIDI) Version 1.1, a fully structured lay-administered interview that assessed common mental disorders using DSM-III-R criteria.20 Syndromes assessed included major depressive episode, mania-hypomania, six anxiety disorders (generalized anxiety disorder, panic disorder, agoraphobia, specific phobia, social phobia and post-traumatic stress disorder) and five externalizing disorders (conduct disorder, alcohol abuse, alcohol dependence, drug abuse and drug dependence). Blinded Structured Clinical Interview for DSM-III-R21 clinical reappraisal interviews in a probability sub-sample found good concordance with the DSM-III-R/CIDI diagnoses.20 Respondents with lifetime MDD were asked whether their first lifetime episode ‘was brought on by some stressful experience’ or happened ‘out of the blue’. DSM-III-R Criteria A-D MDE symptoms were then assessed for this incident episode. Family History Research Diagnostic Criteria questions22 were used to determine parental history of depression.

Outcome measures

Depression persistence, chronicity and severity were assessed in Survey 2 with a computerized version of CIDI 3.0 using ‘pre-loaded’ information about Survey 1 responses to guide follow-up questioning. Respondents with Survey 1 lifetime MDD were asked to review the depressive symptoms reported in Survey 1, update subsequent episodes and symptoms using a life history calender and answer four summary questions about subsequent episodes: in how many years since baseline did the respondent have a depressive episode lasting 2+ weeks (referred to below as 'persistence') and an episode lasting most days throughout the year (referred to below as 'chronicity')? Was the respondent ever hospitalized for depression since the baseline? Was the respondent currently disabled (at least 50% limitation in ability to perform paid work) because of depression? A fifth Survey 2 outcome measure was whether the respondent attempted suicide at any time since the baseline.

Analysis methods

Predicting the outcomes in the WMH surveys

The predictors in the WMH surveys included temporally primary comorbid lifetime disorders, parental depression, MDD incident episode symptoms and other information about the incident episode (age-of-onset and if the episode was triggered or endogenous). The outcomes were MDD persistence severity (number of years since age-of-onset with episodes lasting 2+ weeks and lasting most days throughout the year, each standardized to a 0–100% range in relation to number of years between age-of-onset and age-at-interview), whether respondents were ever hospitalized for depression after their first episode, and whether respondents were disabled at the time of interview because of their depression. The ML methods used to develop the models included ensemble regression trees23 and 10-fold cross-validated penalized regression,24 both of which were designed to avoid overfitting. These methods are described elsewhere.16, 17

Between 9 and 13 predictors available at baseline in Surveys 1–2 emerged as significant in each WMH model, including measures of individual symptoms and symptom clusters in the incident episode, whether that episode was triggered or endogenous, parental history of depression and various measures of temporally primary comorbid anxiety and externalizing disorders (some of them depending on age-of-onset). A more detailed discussion of the final WMH models is available elsewhere.16, 17

To evaluate whether models based on ML methods improve prediction in an independent data set more than models based on conventional methods, we also estimated a logistic regression model for each outcome in the WMH data that included 23 predictors: the nine DSM-III-R Criterion A symptoms of MDD, a measure of whether the episode was triggered or endogenous, parental history of depression and 11 measures of the temporally primary comorbid anxiety and externalizing disorders that were also available in Survey 1. To the extent that the ML methods stabilize estimates, we would expect predictions based on these methods to out-perform predictions based on logistic regression despite the ML models containing fewer predictors (9–13) than the logistic models (23).

Assigning WMH-predicted risk scores to Survey 1 respondents

Risk scores based on the logistic models were generated in Survey 1 using the WMH coefficients and the Survey 1 predictors. This direct estimation method could not be used for the ML models, though, as Survey 1 did not assess a number of significant predictors in the ML models (symptoms of anxious depression and mixed episodes in incident episodes, comorbid obsessive–compulsive disorder, intermittent explosive disorder and oppositional defiant disorder). We addressed this problem by imputing ML risk scores to Survey 1 respondents from a consolidated data set that combined WMH respondents and Surveys 1 and 2 respondents. The data set included all predictors in common across the surveys along with the four Ml-predicted risk scores. The latter four scores had valid values for WMH cases and missing values for Survey 1 cases. Multiple imputation was applied to this data set to generate 10 predicted scores on each missing variable to each Survey 1 respondent using SAS 9.2 (Cary, NC, USA) proc mi.25 Modal imputed values were assigned to each Survey 1 respondent for purposes of analysis. As these scores were strongly correlated across outcomes, a single composite ML-predicted risk score was then constructed for each respondent by averaging across the four scores after transforming to percentiles.

Validating the prediction models

Survey 2 outcomes were predicted from risk scores based on the ML and logistic models applied to the Survey 1 data. The Survey 2 outcomes included high (top 10%) MDD persistence and chronicity in the 10–12 years between the two surveys, hospitalization for depression and attempted suicide during those years and disability due to depression at the time of Survey 2. Area under the receiver operating characteristic curve (AUC) was calculated for each Survey 2 outcome separately for the ML and logistic models. Sensitivity (SN; the percentage of respondents with the outcome classified by the predicted risk scores as having high risk), positive predictive value (PPV; the percentage of respondents predicted to have high risk who experienced the outcome) and likelihood-ratio positive (LR+; the relative proportions of respondents who experienced the outcome among those classified as having or not having high risk) were also calculated for the 20 and 33% of Survey 1 respondents with highest and lowest ML-imputed composite risk scores. S.e.m. of SN, PPV and LR+ were estimated using the Taylor series method with SUDAAN26 to adjust for design effects in the Surveys 1 and 2 panels.

Results

Outcome distributions

One-third (37.9%) of the 1056 Surveys 1 and 2 respondents had at least one depressive episode in 10–12 years between surveys (Table 1). Mean (s.e.m) number of years in episode was 2.0 (0.2) and the 90th percentile was 9 years. Roughly half the respondents with episodes (16.7% (1.5) of all respondents) reported episodes lasting most days throughout one or more years, with a mean (s.e.m.) of 0.8 (0.1) and a 90th percentile of 4 such years. A strong correlation (polychoric) was found between number of years in episode and number of years with episodes lasting most days throughout the year (rp=0.61).

Table 1 Distributions and polychoric/tetrachoric correlations among the outcomes in the Surveys 1 and 2 panels (N=1056)

Hospitalization for depression in the years between Surveys 1 and 2 was reported by 5.8% (1.1) of Survey 2 respondents and attempted suicide by 4.5% (0.6). Current disability because of depression was reported by 3.2% (0.6) of Survey 2 respondents. Correlations (tetrachoric) among these three severity indicators were rt=0.51–0.84. Correlations (polychoric) between number of years in episode and the severity indicators were rp=0.38–0.49. Correlations (polychoric) between number of years in episodes lasting most days throughout the year and the severity indicators were rp=0.30–0.53.

Associations of the Survey 1 risk scores with Survey 2 outcomes

AUCs of the Survey 1 Ml and logistic risk scores with Survey 2 outcomes were 0.71 and 0.68, respectively; predicting high persistence, 0.63 and 0.62, respectively; predicting high chronicity, 0.73 and 0.65, respectively; predicting hospitalization, 0.74 and 0.69, respectively; predicting disability, 0.76 and 0.70, respectively; and predicting attempted suicide. (Table 2) The AUCs of the ML scores were somewhat higher than those of the logistic regression scores for all five outcomes despite the ML scores being based on models that used only 9–13 predictors compared with 23 predictors in the logistic models and the fact that the ML-predicted values were based on multiple imputation rather than direct estimation.

Table 2 AUC of Survey 1 risk scores based on ML models and logistic regression models predicting Survey 2 outcomes (N=1056)

Operating characteristics of the composite-imputed risk score

The 20% of Survey 1 respondents with highest ML composite-imputed predicted risk scores accounted for 38.1% of high persistence in the years between the two surveys, 34.6% of high chronicity, 40.8% of hospitalizations for depression, 55.8% of disability because of depression and 55.8% of attempted suicides. Sensitivities were substantially higher (49.7–70.7%) in the 33% of Survey 1 respondents with highest predicted risk scores (Table 3). Positive predictive values of the outcomes in the 20% of respondents with highest predicted risk scores were in the range 8.8–18.3% (that is, 1.8–3.0 times the positive predictive values in the remaining 80% of the sample), while positive predictive values were 6.3–17.5% in the 33% of respondents with highest predicted risk (that is, 1.5–2.2 times the positive predictive values in the remaining 67% of the sample).

Table 3 Sensitivity, positive predictive value and likelihood-ratio positive of Survey 1 risk scores based on ML models in the upper and lower 20 and 33% of the risk distribution predicting Survey 2 outcomes (N=1056)

The ML-predicted risk scores were also useful at the low end of the distribution, as seen most vividly in the fact that the 20% of Survey 1 respondents with lowest predicted risk accounted for only 0.9% of all hospitalizations and 1.5% of all attempted suicides in the 10–12 years between surveys. This means that low ML-predicted risk scores can be used as rule-outs for these outcomes (LR+=0.0–0.1). Sensitivities for other outcomes in this 20% of respondents with lowest predicted risk were 5.6–15.9%, while those of the 33% of respondents with lowest predicted risk were 9.7–16.7%. Positive predictive values of the outcomes in the 20% of respondents with lowest predicted risk were 0.3–6.7% (that is, 0.0–0.8 times the positive predictive values in the remaining 80% of the sample), while positive predictive values were 0.9–4.2% in the 33% of respondents with lowest predicted risk (that is, 0.3–0.5 times the positive predictive values in the remaining 67% of the sample).

Discussion

Four important limitations of the WMH models should be noted before discussing the results. First, MDD was assessed with a fully structured diagnostic interview rather than a semi-structured clinical interview. Second, the models were developed in a cross-sectional sample using retrospective reports that could have been biased. Third, because the data were retrospective, predictors were limited in two important ways: the predictors for comorbid disorder did not include those with first onsets subsequent to first onset of MDD; and no predictors were included for MDD course subsequent to first onset. Both these types of predictors would normally be available to clinicians interested in evaluating differential patient risk for MDD persistence severity. Because of these limitations, we would expect the performance of the WMH models to be lower bounds on the performance of models with a more complete set of predictors. Fourth, only a limited set of ML methods was used to develop the WMH models. Because of these limitations, it would be useful to replicate and expand the model development and validation process illustrated here in prospective clinical samples using consistently administered semi-structured clinical interviews with a more complete set of predictors using additional ML algorithms (for example, naive Bayesian, random forests and support vector machines)27 and an optimal combined suite of algorithms to maximize cross-validated prediction accuracy.28

Within the context of these limitations, the validation exercise reported here confirmed the predictive value of the kinds of self-report variables included in the WMH ML models over a 10–12 year follow-up period in an independent sample of the US household population. We also showed that prediction accuracy (AUC) of the ML models was consistently higher across all study outcomes (0.63–0.76) than a more conventional logistic model (0.62–0.70) despite the logistic model including 23 predictors and the ML models 9–13 predictors. This finding illustrates the value of ML methods in stabilizing predictions to avoid overfitting in a training data set (that is, the WMH sample), so as to improve prediction in independent samples.

A question can be raised how well the WMH ML composite risk score prediction accuracy compares with previous attempts to predict long-term depression persistence severity. Only a handful of relevant comparison studies exist over a follow-up period of 10+ years in samples of initially depressed patients29, 30 or community residents.31, 32 These studies were all quite small (n=87–424) and none reported AUC. However, AUC can be computed post hoc from two of these studies. The first was a 50-year follow-up of 293 community respondents classified post hoc as having had baseline DSM-IV MDD, 20 of whom subsequently died by suicide.32 A composite measure of baseline depression severity predicted subsequent suicide with 0.69 AUC compared with 0.76 for the validated AUC of the most comparable Survey 2 outcome (attempted suicide). The second comparison study followed 313 outpatients with initial diagnoses of MDD 1, 4 and 10 years after baseline and predicted persistent depression over that time period from 10 baseline depressive symptoms along with 10 baseline measures of self-concept, social function and coping. The AUC of 0.70 is quite similar to the 0.71 AUC for the most comparable Survey 2 outcome of high persistence.

In making these comparisons, it is important to remember that the AUCs in these other studies were not validated in independent samples. As noted above, AUC estimates in the Surveys 1 and 2 panels were ~10% lower than in the WMH sample. Shrinkage would be expected to be even greater in the earlier studies because of their much smaller samples than in the broadly representative WMH sample of 8261 respondents. Prediction models in the two comparison studies might consequently yield validated AUCs below 0.60 in independent samples. AUCs in that range are considered small based on conventional guidelines, while WMH ML AUCs would typically be considered moderate.33, 34, 35

It is noteworthy that AUC of the ML models in Surveys 1 and 2 was similar to widely used risk models in other areas of medicine.36, 37 For example, the 0.73 mean AUC of the ML models over the four Survey 2 outcomes other than high chronicity is similar to the 0.74 average AUC of the Framingham Risk Score of coronary heart disease, one of the most widely used prediction scores in medicine, across 79 different validation studies,38 and higher than the AUCs (typically below 0.70) of models to predict the course of breast cancer.39 Nonetheless, these AUCs are only moderate, which means that predictions based on such models could not be used to make definite rule-ins and could be used to make definite rule-outs only for risks of hospitalization and suicide attempts in the lowest 20% of the composite risk distribution. But this level of precision could be useful in defining bands of differential risk warranting variation in clinical attention. Tiered risk assessments of this sort are becoming increasingly important in other areas of medicine.40, 41, 42

Given that predictions based on models of the sort evaluated here would most realistically be used to help clinicians identify patients who might more profit from more intensive treatment (for example, long-term maintenance therapy), the vast majority of whom present for treatment of recurrent rather than incident episodes, an obvious future direction should be to go beyond the WMH model focus on incident episodes to develop expanded models in the Surveys 1 and 2 panels focused on recurrent episodes. Such an expansion could evaluate the incremental value of including new predictors for course of MDD between onset and time of Survey 1, secondary comorbid disorders, and other variables found to be important in previous studies of the course of depression (for example, childhood family adversities, history of traumatic stress exposure, comorbid physical disorders, social networks-support, personality). We plan to implement this kind of expansion in future work with the Surveys 1 and 2 panels.

Beyond our own work with these data, it would be useful to develop an interview schedule to assess the full set of self-report predictors found in the WMH data and in the earlier studies reviewed above to use in future depression treatment trials. Such an instrument, if administered at trial baseline, could be used as part of a principled approach to study heterogeneity of treatment effects.43, 44 An even more promising extension given the small size of most depression treatment trials might be to administer this same instrument to a large observational sample of patients beginning depression treatment, follow these patients to assess treatment response and analyze these data to develop a robust model predicting heterogeneity of treatment effects. In addition to providing an a priori representation of predicted treatment response for use in subsequent controlled trials, such a model could be useful in targeting depressed patients with high risk of treatment resistance at the beginning of treatment who might warrant the substantial investment currently being made in large pragmatic trials to determine the value of expensive baseline biomarker assessments in guiding depression treatment targeting.8, 9 It would also be valuable in this context to evaluate the 'incremental' value of promising biomarkers to prediction over and above the level of prediction accuracy achieved in a model based only on baseline self-reports.45

Risk stratification data from a large observational study of this sort could also be analyzed using an extension of the innovative statistical approaches recently developed to study comparative effectiveness in observational studies.46 The potential value of such an approach is supported both by evidence that treatment effect size estimates in appropriately analyzed observational studies are comparable with those in controlled trials47 and by the existence of numerous replicated predictors of heterogeneity of depression treatment effects in existing trials.14, 15, 48, 49 The use of an expansion of our model in this way would address two important problems in previous research on heterogeneity of depression treatment effects: the small sample sizes of depression treatment trials;50, 51 and the fact that most such trials assess only a small number of potential treatment effect modifiers, thus providing no principled basis for using pooling across trials to develop the kind of fine-grained multivariate models of heterogeneity of treatment effects that will eventually be needed to guide personalized depression treatment planning.44 The results presented in the current report, while only taking a first step in this direction, provide strong support for the potential value of this possible extension.