Introduction

Major depressive disorder (MDD) is a highly burdensome disorder [1, 2]. The most common first-line treatments are antidepressant medication (ADM), psychotherapy, and combined ADM-psychotherapy [3]. Meta-analyses of randomized controlled trials find psychotherapies either alone or in combination with ADM have better aggregate MDD symptom response and remission than ADM-only over 6- to 12-month follow-up periods [4]. A recent meta-analysis found that psychotherapy-only is also associated with significantly lower risk of a composite serious negative outcome (either suicide attempt, ED visit, psychiatric hospitalization, and/or suicide death in the 12 months after initiating treatment) than either ADM-only or combined treatment [5]. Most patients also prefer psychotherapy with or without ADM to ADM-only [6]. Yet the great majority of MDD patients are treated with ADM-only [3] due to its greater availability [7] and lower costs [8].

The above observations suggest that aggregate treatment response would improve if the number of MDD patients receiving psychotherapy increased. This is unrealistic in the near term, though, given shortages in trained psychotherapists. However, if the benefits of psychotherapy are much greater for some patients than others, it might be possible to increase access to psychotherapy for the patients who need it most if a method existed to determine these patients. Numerous predictors of differential MDD treatment response have been suggested for this purpose [9,10,11,12,13,14,15]. However, none of these predictors is sufficiently strong to be of practical value by itself. Recent studies consequently have attempted to develop an individualized treatment rule (ITR) that combines information across a range of prescriptive predictors to optimize the selection of either (i) psychotherapy vs. ADM [16,17,18,19,20], (ii) combined treatment vs. ADM [21, 22], (iii) combined treatment vs. psychotherapy [23], or (iv) combined treatment vs. either of the two monotherapies [24,25,26,27,28,29]. These studies have been limited, though, by small samples, limited predictor sets, and suboptimal analysis methods [30, 31].

We present here an attempt to move beyond prior ITR development studies by determining if a preliminary ITR can be developed from data in a large electronic health record (EHR) administrative data system. We focus on patients treated in Veterans Health Administration (VHA) Primary Care Mental Health Integration (PC-MHI) clinics, as such clinics offer not only ADMs but also psychotherapy and combined ADM-psychotherapy. The outcome is a composite measure of experiencing one or more serious negative events in the 365 after initiating treatment: suicide attempts, psychiatric emergency department visits, psychiatric urgent care visits, psychiatric hospitalizations, or suicide-related deaths. Prior research showed that all these negative outcomes are significantly elevated among MDD patients in PC-MHI clinics [32, 33].

As this is an observational study, not a randomized controlled trial, results will be biased to the extent that type of treatment is nonrandom with respect to confounders. Great care was consequently taken to use best-practices methods described below and in the Supplementary Methods section to adjust for nonrandom treatment assignment and improve both aggregate model accuracy and fairness [34]. Previous research shows that observational analyses based on such methods can be useful in generating preliminary ITRs so long as they are followed by pragmatic trials in which the ITRs are experimentally evaluated and refined [35].

Methods

Sample

Our sample consisted of all n = 43,470 PC-MHI patients beginning MDD outpatient treatment with either an ADM and/or psychotherapy between October 1, 2015 and December 31, 2016. Exclusion criteria included: (i) any previous MDD treatment in the past 365 days; (ii) any lifetime diagnosis of autism, bipolar disorder, borderline intellectual functioning, dementia, intellectual disability, nonaffective psychosis, stereotyped movement disorder, or Tourette’s disorder; (iii) any lifetime VHA treatment with antimanic or antipsychotic medication; or (iv) an administratively recorded suicide attempt in the prior 365 days. ADM-only treatment was defined as receiving an ADM prescription but not a referral for psychotherapy during the first PC-MHI visit. Psychotherapy-only was defined as either seeing a psychotherapist or receiving an appointment for such a visit but not receiving an ADM prescription on the day of the initial PC-MHI visit. Combined treatment was defined as receiving an ADM prescription and either seeing a psychotherapist or receiving an appointment for such a visit the day of the first PC-MHI visit. The human subjects committees of both Harvard Medical School and the Canandaigua VA Boston Medical Center approved these procedures. HIPAA and VHA waivers of consent were obtained to conduct secondary analyses with this dataset.

Outcome

The serious negative events making up our outcome were selected because information about these events, unlike the symptom severity measures conventionally used as outcomes in depression treatment trials, is available for all PC-MHI patients in the VHA Corporate Data Warehouse [36], VHA Suicide Prevention Applications Network (SPAN; [37]), or National Death Index (NDI; [38]). As detailed in the Supplementary Literature Review, a recent meta-analysis found 34 published randomized trials that assessed comparative treatment effects of ADM-only, psychotherapy-only, and combined treatment on these outcomes [5], but none of those studies attempted to develop an ITR to optimize treatment assignment. It is important to recognize that an ITR to minimize risk of this composite negative outcome might not be optimal for other outcomes such as symptom response or remission.

Predictors

Potential predictors were extracted from the VHA Corporate Data Warehouse [36] (Supplementary Table 1) and a geospatial social determinants of health (SDoH) database on characteristics of patient residential neighborhoods (Census Tracts, Block Groups), Counties, and States (e.g., economic deprivation, social cohesion) that have been linked in previous research to time-space variation in indicators of mental disorders [39] (Supplementary Table 2). The 5865 variables in this combined database operationalized four broad classes of such prescriptive predictors (see Supplementary Literature Review): psychopathological risk factors (i.e., history of treated mental disorders, treatment types, suicidality), comorbid physical disorders and treatments (including medications suspected to increase self-harm risk; Supplementary Table 3), SDoH (at both the individual and geospatial levels), and facility-level quality indicators [9, 12,13,14,15]. Missingness, which occurred only for geospatial variables, was addressed by using nearest-neighbor imputations to fill in missing scores by assigning non-missing values from contiguous areas.

Analysis methods

Estimating aggregate negative outcomes

The sample was split into a 70% training sample and a 30% test sample. We began by estimating an aggregate risk model [40] to predict whether each patient would experience the negative outcome using information from all pretreatment predictors but ignoring type of treatment received. This model was estimated using the Super Learner (SL) stacked generalization machine learning (ML) method [41], an ensemble ML method that pools results across a user-specified range of algorithms (Supplementary Table 4). This was done with the SuperLearner R package [42]. See Supplementary Methods for further details. Model fit was evaluated by calculating area under the receiver operating characteristic curve (AU-ROC). We also inspected observed within-ventile and cumulative sensitivity and positive predictive value in the test sample based on predictions from the training sample.

Estimating average treatment effects

We then estimated average treatment effects by adjusting for significant differences in the extensive battery of baseline covariates across treatment types. Best-practices methods were used to do this by combining two types of adjustments [43]. (i) The first was a propensity score weighting method to adjust for nonrandom treatment assignment [44] based on the Random Forests (RF) ML method [45]. (ii) The second was an outcome modeling method that predicted probability of the outcome for each patient separately within each treatment arm from baseline covariates, again using RF, and adjusted for differences in multivariate distributions of significant predictors across arms by assigning patients who received each treatment the average scores on these predictors. The predicted outcome scores in the treatment groups generated in these outcome models were then compared to estimate average treatment effect (ATE). As detailed in the Supplementary Methods section, results based on these two methods were combined using doubly robust methods to yield consistent results if either method was correct and to reduce finite-sample bias. The specific doubly robust method used was targeted minimum loss-based estimation [46], implemented in the tmle3 R package [47]. See Supplementary Methods for further details.

Estimating heterogeneity of treatment effects

We then estimated between-patient differences in comparative treatment effects, referred to below as Heterogeneity of Treatment Effects (HTE), using Generalized Random Forest (GRF), a doubly robust ML approach that expands on RF with a focus on individual differences in comparative treatment effects with respect to observed baseline variables while adjusting for confounding due to these variables [48, 49]. Analyses were implemented in the grf R package [50]. See Supplementary Methods for further details. In broad outline, the analysis entailed developing a model using a two-step doubly robust method. The first step estimated an expected outcome for each patient for each of the three treatment types by estimating treatment-specific models in the subsamples of patients who received each treatment and then imputing predicted treatment-specific outcomes based on those models to all patients regardless of type of treatment received. The second step used these first-stage within-patient estimates to create within-patient scores for differences in expected outcomes across treatments and then used those difference scores as outcomes in a second set of RF models. Importantly, the latter models directly fit interaction terms (i.e., differences in expected outcomes across treatments within patients) without requiring the correct specification of main effects, simplifying the task of estimating HTE. The ITR for a specific patient was defined as the treatment associated with the lowest predicted probability of the negative outcome for that patient. See Supplementary Methods for further details.

The ITR was then evaluated in the test sample by: (i) dividing the test sample into three subgroups depending on which treatment type was estimated by the ITR in the training sample to have the lowest probability of the negative outcome; and then (ii) using tmle3 to estimate ATE within each of these subgroups in the test sample. If the ITR improves on a non-individualized treatment strategy, we would expect predicted probability of the negative outcome to be significantly lower for the treatment type estimated to be optimal. Importantly, only information in the test sample was used to evaluate the ITR, whereas only information in the training sample was used to estimate the ITR.

Predictor importance

Predictor importance was examined using the kernel Shapley Additive Explanations (SHAP) method [51] implemented in the fastshap R package [52]. This method estimates the effect of changing a predictor from its observed value to the sample mean separately averaged across all logically possible permutations of other predictors. More important predictors are associated with higher mean absolute SHAP values. Proportional mean absolute SHAP values (SHAPP) were calculated by dividing mean absolute SHAP values of classes and important predictors within classes by the mean absolute SHAP value for the entire model. Bee swarm plots were used to identify dominant directions and distributions of associations. See Supplementary Methods for further details.

Reporting

We followed the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) reporting guidelines [53] to report our analyses intended to build predictive models.

Results

Sample attributes

Most patients received either ADM-only (46.9%) or psychotherapy-only (39.7%). The remaining 13.4% received combined ADM-psychotherapy. 25.6–33.1% of patients across treatment types were 60+ years of age, 17.0–20.6% were 50–59 years old, 23.1–27.9% were 35–49 years old, and 23.1–29.2% were younger than 35 years old. The great majority (82.0–83.5%) of patients were male. 44.3–46.7% were married. Over half (50.5–55.1%) lived in the South and the great majority (87.5–88.4%) lived in Major Metropolitan Areas (Table 1).

Table 1 Selected baseline characteristics of patients by treatment group in the total sample.

Aggregate prediction of negative outcomes

8.6% (S.E. = 0.2%) of patients experienced the negative outcome. The aggregate SL risk model estimated in the training sample predicted this outcome in the test sample with AU-ROC = 0.68 (S.E. = 0.01) using an optimal weighting scheme across classifiers (Supplementary Table 5). Observed SN was significantly higher than the 5.0% expected by chance in the top 3 ventiles (18.9–7.0%) in the test sample, with 35.2% of all patients having the negative outcomes falling into this top 15% of the sample (Table 2), PPV in the top ventile of 32.6% (S.E. = 1.8), and cumulative PPV across the top 3 ventiles of 20.2% (S.E. = 0.9) compared to 6.6% (S.E. = 0.5) across the remaining 17 ventiles. Observed sensitivities were significantly lower than chance, in comparison, in the bottom 9 ventiles of the test sample (3.4–1.2%), with only 23.3% of all patients with the negative outcome falling into this 45% of the sample. Cumulative PPV in this bottom 45% of the sample was 4.5%.

Table 2 Observed prevalence of the negative outcome in the test sample (n = 12,986 by ventiles of predicted probability of the negative outcome based on the SL analysis carried out in the training sample (n = 30,484)a.

Important predictors of aggregate negative outcomes

The most important classes of predictors of aggregate negative outcome risk were physical disorders (SHAPP = 53.3%), patient-level SDoH (SHAPP = 35.6%), and number of visits for mental and/or substance disorders (SHAPP = 22.9%) (Fig. 1). Strikingly, only substance disorders featured among the most important individual predictors in the last of these categories. Predictors involving treatments of mental and/or substantive use disorders were notable for their low importance (SHAPP = 5.6%). There were only four individual predictors with SHAPP values of at least 5%, all associated with increased probability of the negative outcome. Three of these involved housing problems—administrative evidence of having been homeless in the past 5 years (SHAPP = 6.6%) and having had housing/economic problems in the past 5 years (SHAPP = 6.1%) and count of number of VA housing problem visits in the past 5 years (SHAPP = 5.3%). The fourth was a measure of high physical comorbidity (SHAPP = 6.0%).

Fig. 1: Predictors of high negative outcome risk in the test sample.
figure 1

SHAP Shapley Additive exPlanations, # Number of, 5 yr 5 years before outpatient treatment visit, Rx prescription, CNS central nervous system, 12 m 12 months before outpatient treatment visit, 2 yr 2 years before outpatient treatment visit, 6 m 6 months before outpatient treatment visit, 3 m 3 months before outpatient treatment visit, 2 m 2 months before outpatient treatment visit, SDoH social determinants of health. 1See Supplementary Methods for discussion of how to interpret SHAPP. 2Visits with alcohol abuse uncomplicated, with intoxication or alcohol-induced disorders, any dependence, or unspecified use with intoxication or alcohol-induced disorders as the primary diagnosis as indicated by a selected set of International Classification of Diseases, Ninth/Tenth Revision, Clinical Modification (ICD-9/10-CM) codes.

Average treatment effects

Average treatment-specific probability of the outcome adjusting for observed differences in baseline variables did not differ significantly across the three treatment types: 9.1% (S.E. = 0.3) for ADM-only; 8.5% (S.E. = 0.3) for psychotherapy-only; and 8.8% (S.E. = 0.4) for combined ADM-psychotherapy (χ22 = 2.0, p = 0.36) (Table 3).

Table 3 Estimated average treatment effect (ATE) for each treatment type in total training sample (n = 30,484)a.

Heterogeneity of treatment effects

ADM-only was estimated by the ITR to be optimal with respect to low probability of the negative outcome for 8.5% of patients, psychotherapy-only for another 55.0%, and combined treatment for the remaining 36.5%. Patients estimated by the ITR to be optimized by a given treatment were only slightly more likely than others to receive that treatment (Supplementary Table 6). However, estimated ATE in the test sample differed significantly by treatment type received in the subgroup predicted by the ITR to be optimized with psychotherapy-only (χ22 = 7.2, p = 0.028). As predicted by the ITR, probability of the negative outcome in this subgroup was significantly lower among the patients treated with psychotherapy-only (6.9%, S.E. = 0.5) than those treated with either ADM-only (8.4% S.E. = 0.5, χ21 = 5.2, p = 0.023) or combined ADM-psychotherapy (9.0%, S.E. = 0.9, χ21 = 4.3, p = 0.038), suggesting that the proportion of patients with this outcome is proportionally about 20% lower when treated with psychotherapy-only than one of the other treatments. In comparison, estimated ATE in the test sample did not differ significantly by treatment received either in the subgroup predicted by the ITR to be optimized with ADM-only (χ22 = 0.9, p = 0.63) or with combined ADM-psychotherapy (χ22 = 0.4, p = 0.81) (Table 4).

Table 4 Estimated average treatment effect (ATE) for each treatment type in subsamples of the test sample estimated by the individualized treatment rule (ITR) be optimized by the different treatment types (n = 12,986)a.

Important predictors of heterogeneity of treatment effects

Given that the only significant HTE involved optimization with psychotherapy-only, we focused on that outcome in considering predictor importance (Fig. 2). By far the most important class of predictors of being optimized with psychotherapy-only was geocode-level SDoH (SHAPP = 93.5%). Physical disorders were the only other class of predictors with a SHAPP in double digits (13.7%). All individual predictors with SHAPP values of at least 5% were geocode-level SDoH variables, the most important of them involving low county-level access to medical treatment.

Fig. 2: Predictors of being optimized by psychotherapy only.
figure 2

SHAP Shapley Additive exPlanations, 6 m, 6 months before outpatient treatment visit; # Number of, 12 m 12 months before outpatient treatment visit, Rx prescription, 2 yr 2 years before outpatient treatment visit, SDoH social determinants of health, MD doctor of medicine, DO doctor of osteopathic medicine. 1See Supplementary Methods for discussion of how to interpret SHAPP.

The implications of using the ITR for treatment assignment

The test sample results suggest that all the 56.0% of patients who were predicted by the ITR to be optimized by psychotherapy-only should receive psychotherapy-only if the goal of treatment is to minimize risk of the negative outcome considered here, as estimated prevalence of that outcome was proportionally about 20% lower (6.9%) when these patients were treated optimally than otherwise (8.6–9.0%). However, the test sample results suggest that the remaining 44.0% of patients would have equivalent prevalence of the negative outcome no matter which treatment they received. If treatment assignment was made according to these results, 56% of patients would be treated with psychotherapy-only and 44% with ADM-only (i.e., the least expensive and most readily available treatment). This compares with 53.1% of patients who were observed to receive psychotherapy (39.7% psychotherapy-only and 13.4% combined ADM-psychotherapy) and 60.3% observed to receive ADM (46.9% ADM-only and 13.4% combined ADM-psychotherapy), for a total increase of 2.9% in the number of patients who would receive psychotherapy and a total decrease of 16.1% in the number of patients who would receive ADM under the ITR compared to current practice. If the test sample conditional ATE results reflect causal effects, these changes in treatment assignment would be expected to result in a 7.7% proportion decrease in prevalence of the negative outcome across all patients (from 8.6% to 8.0%).

Discussion

Prevalence and prediction of aggregate negative outcomes

We are aware of no previous research that attempted to use administrative data to predict serious negative events of the sort considered here among patients in outpatient MDD treatment. However, previous research has documented that administrative data can predict two components in our outcome, suicide attempts and suicide deaths after outpatient visits in both civilian (e.g., [54]) and military (e.g., [55]) samples. The PPV of the negative outcome in the highest risk ventiles for our aggregate model was sufficiently high (32.6% predicted probability of the negative outcome among the 5% of patients with highest predicted risk) that some clinicians might think of these patients as likely treatment-resistant. If so, both practice guidelines and health-economic analyses would support starting with more intensive ADM/psychotherapy than is standard, such as aggressive medication dosing, more frequent psychotherapy scheduling, or more invasive treatments (e.g., antipsychotic augmentation, electroconvulsive therapy, transcranial magnetic stimulation, or ketamine [56,57,58,59]). It might be that, relative to current practice, one of these approaches would reduce futile trial and error efforts to find less intensive effective treatments for these cases.

Which of these or other more intensive courses of action would be preferable among patients at high risk of a serious negative event of the sort considered here cannot be determined from our results, as our models were not designed to identify optimal alternative treatments beyond the three considered here. There would certainly be barriers to implementing any more intensive courses of action as first-line treatments, not least of which in the civilian healthcare sector would be a payment structure that often requires several failed trials of first-line treatments before more intensive interventions are reimbursed [60]. But there is some indication that going directly to more intensive treatments could be acceptable to payers as an extension of existing stepped-care approaches if this was demonstrated to be cost-effective [61].

Aggregate treatment effects

Our finding of nonsignificant aggregate differences in prevalence of the composite negative outcome across treatment types is consistent with previous MDD randomized controlled trials that evaluated comparative effectiveness of ADM-only, psychotherapy-only, and combined ADM-psychotherapy in reducing psychiatric hospitalizations [62, 63], psychiatric emergency department visits [64], and suicide deaths [65, 66]. However, as noted in the introduction, a recent meta-analysis of these studies found that psychotherapy-only is associated with significantly lower risk than either ADM-only or combined ADM-psychotherapy of a composite serious negative outcome of the sort we considered here [5]. Our sample size was large enough to detect differences of the magnitude found in the meta-analysis, but the estimated ATEs in our sample, although consistent in sign with those of the meta-analysis (i.e., psychotherapy-only associated with lowest risk followed by combined treatment and ADM-only), were nonsignificant. This suggests either that ATEs are weaker in the VHA than in the more general population samples included in the meta-analysis and/or that residual bias existed in our nonexperimental analysis that under-estimated the benefit of psychotherapy-only. A pragmatic trial would be required to adjudicate between these two contending possibilities.

Heterogeneity of treatment effects

We found significant HTE for the 56% of the sample that was estimated to be optimized by psychotherapy-only, with the estimated benefit of optimal versus suboptimal treatment assignment associated with a roughly 20% proportional reduction in risk of the negative outcome. An effect of this magnitude would be considered clinically significant but small, raising the question of whether this level of HTE, even if it could be confirmed in a pragmatic trial, would be sufficiently large to warrant carrying out such a trial. The alternative, given the results of the recent meta-analysis cited above, might be to recommend psychotherapy-only in VHA PC-MHI clinics whenever concerns exist about any of the serious negative events considered here. However, even in PC-MHI clinics MDD patients often wait up to 8 weeks before being seen by a psychotherapist. Such patients are routinely prescribed ADMs at the initial PC-MHI visit in which they are referred to psychotherapy. PC-MHI psychotherapy-only patients, in comparison, are those who are either able to see a psychotherapist at the time of their initial visit or who get a near-term psychotherapy appointment at the time of this initial visit. Given that this quick access to psychotherapy is not always possible, our ITR might be useful in PC-MHI settings when triage decisions are needed about which patients to prioritize for near-term psychotherapy.

Our ITR might be more important in VHA primary care settings that do not qualify as PC-MHI sites (i.e., do not have at least one full-time psychotherapist on staff), where a lower proportion of MDD patients have access to psychotherapy, and even more so in clinics not associated with VHA, where only a small minority of MDD patients receive psychotherapy [3]. In both cases, though, new ITRs should be developed and evaluated before considering a pragmatic trial. Given the stronger estimates of ATE in the recent meta-analysis of controlled trials than in our observational analysis, it might well be that HTE would be stronger in these other settings than in VHA PC-MHI clinics.

It is also important to repeat a point made earlier: that HTE with respect to more conventional measures of MDD treatment response and remission might be quite different than for the serious negative events considered in the current analysis. Meta-analysis of controlled trials found stronger HTE for psychotherapy-only, ADM-only, and combined treatment with respect to these more typical outcome measures [4] than for the serious negative events considered in our study [5]. The possibility that the same might be true for HTE adds to the argument in favor of carrying out such a trial.

Predictor importance

It is hazardous to place too much emphasis on predictor importance either in modeling aggregate risk of negative outcomes or HTE, as the ML methods used to train these models are designed to optimize overall model prediction accuracy at the expense of the accuracy of individual predictors [67]. This means that the predictors highlighted as important are really markers of predicting associations involving the (sometimes many) other baseline variables that are correlated significantly with the predictors designated important. Furthermore, important predictors cannot be assumed to be important causes, as markers of unmeasured causal factors often emerge as important predictors. Within the context of these cautions, it is noteworthy that physical disorders and patient-level indicators of SDoH emerged as more important than psychopathological factors as predictors of aggregate risk, whereas geocode-level indicators of SDoH were by far the most important predictors of HTE involving the comparative benefits of psychotherapy-only. It is unclear whether the same predictors would be important in predicting MDD symptom response or remission in VHA, or in predicting the serious negative outcomes considered here in other treatment settings.

Strengths and limitations

This study had several noteworthy strengths, including a large representative sample of patients, an extensive array set of baseline variables related to constructs found by prior studies to predict MDD HTE, and the use of a best-practices approach to estimate the ITR. All of these have been identified as important weaknesses in prior ITR studies of psychiatric disorders [68].

But the study also had some noteworthy limitations. First, and most importantly, the data were observational rather than experimental. This meant that unmeasured confounding variables could have introduced bias into estimates even though we adjusted for an exhaustive set of baseline covariates and our aggregate results had a similar, albeit weaker, pattern than in prior controlled trials that evaluated the comparative effectiveness of ADM-only, psychotherapy-only, and combined treatment in preventing the negative events that made up our composite outcome.

Second, despite our use of an extensive battery of baseline covariates, the predictor set did not include information about some variables that have been the focus of prior research, including biomarkers [69] and several other potentially important prescriptive predictors [70, 71]. Although the absence of biomarker information might not be seen as a major limitation given that biomarkers have not as yet shown great promise as predictors of MDD HTE, other variables we did not include, most notably patient treatment preferences, are known both to influence the type of treatment received and to predict treatment outcomes [72, 73]. We attempted to develop proxy measures for patient treatment preferences by abstracting information from EHRs that might be indicators of past MDD treatments perceived by the patient as successful (as indicated by adherence over a period of time long enough to be considered a full course of treatment) vs. not successful (as indicated either by a short course of treatment consistent with treatment dropout or by a serious negative outcome, such as a psychiatric emergency department visit or hospitalization, in the course of treatment, suggesting that the treatment failed). However, it would have been better to have direct measures of patient preferences than these proxies.

Third, although we excluded patients previously treated with antipsychotic medications to make sure there were no cases of bipolar disorder or nonaffective psychosis in the sample, this had the effect also of excluding refractive MDD cases treated with novel antipsychotics. It might be that retaining such patients in future extensions and using information about their history of antipsychotic treatment might improve prediction accuracy, increase external validity, and possibly help define a subgroup of patients who are optimized with combined ADM-psychotherapy.

Fourth, information on the types of ADMs and psychotherapies received were not used in the analysis. This was by design given that previous research has failed to document substantial evidence of HTE across ADM types [74]. Information on the effectiveness of psychotherapy is strongest for Cognitive Behavioral Therapy [4] but psychotherapy type was not recorded in the VHA EHR. In addition, information on treatment adherence was not used in the analysis even though VHA records contain information on prescription refills and psychotherapy sessions attended and missed. We did not include these measures, as they are endogenous (i.e., could be a consequence rather than a cause of treatment nonresponse). Information on ADM dosage was not extracted from the EHR.

Fifth, our use of a composite outcome of serious negative events limited generalizability. We know from previous randomized controlled trials that the comparative treatment effects of ADM-only, psychotherapy-only, and combined ADM-psychotherapy are stronger for symptom response-remission [4] and functioning [75] than for the serious negative events that made up our composite outcome measure [5]. It might be that something along the same lines holds for HTE; that is, that HTE is stronger with respect to symptom response-remission and functioning than with respect to the serious negative events we considered. If so, our ITR for psychotherapy-only might be a lower-bound estimate.

Sixth, caution should be used in extrapolating results to all potential VHA patients, as MDD treatment is less available in VHA clinics that do not qualify for a designation as PC-MHI and in rural areas. In addition, some depressed Veterans are more reluctant than others to seek treatment. Even greater caution should also be used in extrapolating results outside of VHA given the unique socio-demographic characteristics of Veterans and the fact that the VHA system provides relatively high access to affordable and quality care, including much higher access to psychotherapy, than in the civilian healthcare system.

Future directions

As noted in the introduction, the next logical step in evaluating the ITR developed here would be to implement a pragmatic trial in which the ITR is used to help provide clinical decision support for treatment selection to some, but not all, VHA patients seeking treatment for MDD. Replication and expansion of the analysis carried out here in that presumably large experimental sample could then be used to refine the ITR by including easily collected baseline and follow-up measures of patient self-reported symptom response. It is noteworthy in this regard that such a pragmatic trial could be carried out at low cost merely by using administrative data of the sort we used here to define the subset of patients hypothesized to be optimized by psychotherapy-only and to randomize access to this information as a clinical decision support tool.

If the results reported here are not judged to be sufficiently strong to warrant such a trial, then they might be strong enough to justify more intensive observational analyses that address the limitations noted above in our study. The most plausible way to do this would be by using instrumental variable methods. The key to this approach would be to identify one or more baseline variables (instrumental variables; IVs) that could plausibly be thought of as influencing treatment assignment but not influencing treatment outcomes other than through treatment assignment [76]. This approach is becoming increasingly common in nonexperimental studies of treatment effects using EHRs based on two general classes of plausible IVs. One potentially useful class of IVs involves geographic information [77]. These are typically used in studies of the effects of innovative treatments that are at first available only in limited geographic areas (so long as the patients in the trial are restricted to those residing close to the treatment setting in which they obtain treatment to avoid bias due to selection into innovative treatments by early adopters from other areas). This type of instrument might be expected to be of little value in VHA because all VHA medical centers should adopt the PC-MHI model over the next few years [78]. However, large variations continue to exist across VHA centers in PC-MHI implementation, resulting in the proportion of VHA patients seeking MDD treatment obtaining psychotherapy to vary widely across centers in ways that could be treated as a valid IV [71].

A second class of IVs involves clinician preference [79]. These IVs can be constructed from administrative databases to describe the past treatment patterns of the provider treating each patient. These past treatment patterns are typically strong predictors of patient treatment assignment and might plausibly be thought to influence outcomes only through treatment assignment. This type of IV is most useful when substantial variation exists across providers in the types of preferred treatments. Previous research has shown that this variation exists across prescribers in types of ADMs prescribed and that these differences are strong predictors of subsequent prescribing patterns [80]. Our preliminary analysis of VHA data shows that the same is true for past psychotherapy referral patterns of primary care physicians.

It is important to note that IV methods are highly sensitive to misspecification. Formal tests consequently should be used to evaluate the validity of IVs before using them [81, 82]. It is also noteworthy that statistical methods exist to estimate HTE in the context of IV models [83]. However, measures of prescriptive predictors remain important for developing strong ITRs even when valid IVs are present. The section on limitations noted several important prescriptive predictors that we could not measure in our study, including baseline MDD symptom severity (which, as noted above, could also be used as an additional outcome), patient treatment references, and psychotherapy type. As measurement-based care becomes more common in large health systems like VHA, it will become more and more possible to carry out studies like the current one with patient-reported symptom measures used both as prescriptive predictors and outcomes. Even before this time, though, natural language processing (NLP) methods have shown considerable promise in extracting information from clinical notes that allow PHQ-9 scores to be approximated [84, 85]. The same methods have been used to extract information from clinical notes about SDoH as prescriptive predictors [86]. Although we are aware of no comparable studies designed to elicit information about patient treatment preferences, psychotherapy type, or other potentially important prescriptive predictors of MDD HTE, dramatic recent successes in large language models on various NLP tasks suggest that it might be useful to attempt to develop such models [87]. VHA would be an ideal site for such studies given that the VA Informatics and Computing and Infrastructure system has created a consolidated free test database to facilitate large-scale NLP analyses [88].

Conclusions

We were able to define a subgroup of patients with high risk of the composite negative outcome considered here based on information available in administrative records at the time of beginning treatment. We also found evidence for substantial MDD HTE with respect to the benefit of psychotherapy-only. Finally, we found no evidence for the benefit of combined ADM-psychotherapy in any subgroup of patients, although it is noteworthy that we under-represented patients who were previously refractive by excluding from analysis those with a history of treatment with novel antipsychotics. This information could be useful to target more intensive treatments to high-risk patients and to increase the match between the type of first-line treatment received and the treatment likely to be optimal for remaining patients. Replication would be needed before implementing such changes, though, both to confirm the stability of the aggregate risk model and to determine if the same predictors can be used to predict aggregate variation in risk of more common outcomes involving MDD symptom response-remission. NLP methods applied to clinical notes could also be used both to generate a proxy MDD symptom response-remission outcome measure and to obtain estimates of baseline prescriptive predictors that could refine ITR development in conjunction with the use of IV methods to address the problem of nonrandom treatment assignment. Supplementary information is available on MP’s website.