Introduction

The State of Neuroscience and Clinical Care in MDD

Major Depressive Disorder (MDD) is a brain disease, or heterogeneous set of brain diseases. It can lead to temporary or permanent disruptions in emotions, problem solving, attention, motivation, and sleep. MDD also has a high prevalence, now with lifetime estimates at nearly 20% [1]. Indeed, the majority who experience a depressive episode will re-experience the illness within 2 years, with some estimates at 80% recurrence [2,3,4]. Moreover, it is becoming increasingly clear that environmental and “endogenous” risk factors for MDD become evident in childhood, even if the majority of individuals who have these risk factors do not express illness.

It is surprising, then, that so little knowledge exists about how to prevent or treat MDD. The research expenditure for MDD has paled in comparison to other diseases that are rarer and less costly – MDD is estimated to result in a 500 million dollar loss of productivity and earnings per year in the US alone. Current treatments for MDD have limited success, and no clinical predictors of treatment response exist that could be used on the individual patient level. Unfortunately, longitudinal studies that might better illustrate the risk and expression factors for MDD are expensive and difficult to maintain using standard extramural funding cycles, so only a limited number of studies have followed samples longitudinally.

One particular challenge for treatment identification and prediction is that MDD is a highly heterogeneous condition that can present with any number and severity of symptoms for treatment identification and prediction. For example, one of the best-studied mechanisms for mood disorders is the Hypothalamic Pituitary Adrenal (HPA) access. Yet even this well-established model has led to a highly heterogeneous set of reports, and recent failure of multiple clinical trials for HPA axis modifiers [5,6,7,8,9,10,11,12,13]. To this end, one could consider MDD as a multidimensional condition, as the neural underpinnings of any of the symptoms are (a) shared with many other conditions, and (b) supported by unique and integrated neural circuits [14]. The Research Domain Criteria (RDoC) is one framework by which the field has begun to deconstruct the substrates that are potential risks for MDD, or that might be adversely affected by MDD.

A related challenge is that treatment for MDD has almost uniformly consisted of repurposed treatments for alternative conditions, or accidental findings. Of these treatments, there is not a clear understanding of the mechanisms by which they work. For example, SSRI action happens over the course of hours, whereas the treatment effect is not evident for weeks [15]. More recently, alternative treatments like repetitive TMS, magnetic seizure therapy, and ketamine have enabled us to imagine and test different pathways and mechanisms for MDD [16, 17]. Furthermore, large clinical trials like I-SPOT have solidified some pathways on treatment response [18••, 19••], and EMBARC holds promise for new insights [20]. It is perhaps a time for some critical reflection and discussion.

The Promise of Neuroprediction in MDD

Given the many challenges toward understanding factors involved in risk for and expression of MDD, neuroscientists have designed experiments as bottom-up tests of response in MDD. The idea was to understand what neural mechanisms and pathways mediate or predict (neuroprediction) response for some individuals, so that specific treatments could be targeted for certain individuals. First, there was a hope that neuroprediction might illuminate the mechanisms by which standard treatments effect change in MDD, and for whom. Second, there was also hope that different treatments might have different predictive capacity based upon regions and networks [21••]. Indeed, there is mounting evidence that brain activity is better than standard clinical measures at predicting treatment outcome [19••, 22, 23]. Unfortunately, neither of those hypotheses has led to any clear breakthroughs as of yet – what makes for significant prediction at the group level may not be specific enough to transfer to the individual patient [24••].

In addition, in many studies, samples that are eligible for MDD imaging trials tend to be younger, more highly educated, less severely ill, absent many other medical and clinical comorbidities, and with lower body mass index. Each of these factors alone, and in combination, results in a far greater likelihood of a better treatment response [25]. In contrast, studies of treatment resistant depression (TRD) are plagued by variable definitions of MDD treatment resistance, and more heterogeneous, chronic representations of the illness [26,27,28,29,30]. How might we integrate these disparate results?

Predicting Treatment Response in MDD

By our reckoning, there are well over 60 studies that have attempted to use neuroimaging measures to predict treatment response (See Table 1 for a subset of just task-based fMRI studies). The majority of these studies have been open-label, unblinded medication trials, many of which are discussed in two separate meta analyses [31, 32] and three recent review articles [33••, 34, 35]. These studies include a near term prediction of reduction or resolution of depressive symptoms, typically over 4–16 weeks.

Table 1 Neuroprediction Studies in Major Depressive Disorder with Task-Based fMRI

Prediction studies fall into four broad categories of RDoC domains – negative valence processing (e.g., emotion reactivity, attentional bias), positive valence processing (e.g., reward), cognitive control and working memory (e.g., inhibitory control), and resting. In addition, the methodologies for measurement of brain function are multifaceted, including EEG, MRI, fMRI (also including ASL, rs-fMRI, ALFF, ICA techniques), PET, SPECT, and fNRIS. We focus in this review on task-based fMRI, the predominant strategy, especially since other reviews have included PET, ASL, EEG, and rs-fMRI. One of the main differences across studies is whether whole brain vs region of interest analyses were used. Moreover, different studies have used different thresholds for significance, which, when combined with variable and often small sample sizes (see below) results in an investigator/team effect outside the effects of measurement. Finally, though, the median number of subjects is about 20 (see Table 1), which limits the nature of accuracy in regression models [24••, 36]. Or for example, in t test comparisons of responders vs non-responders, for example, if a majority of those enrolled will be treatment “responders” then the clinically more meaningful group (non-responders) are underweighted within the model. Overall, these considerations and variations render integration and interpretation challenging.

Existing Studies and Networks in Treatment Prediction in MDD

Some early studies, reviews, and meta-analyses have honed in on key neural circuits involved in neuroprediction of treatment response [31,32,33,34], and we will only briefly retouch upon these here, offering a network-based framework for interpretation. The utility of a “key region” (KR) predictive model is balanced by the reality of the level of precision in the data (smoothness), the nature of network-based functioning, and realizations that there is more scanner noise in a given region, particularly for fMRI. For example, the ability to replicate an exact KR is not the same as replicating a performance or self-report predictor in the standard sense of replication. Differences in scanners, software, preprocessing pipelines (including choices made regarding realignment, slice-timing, normalization, standard templates, smoothing kernel), and quality control procedures, substantially adds to variability in 3D coordinate system replication, beyond the assessment of the effect size. An alternative strategy might be to employ some additional smoothing function when a meta analytic tool, such as GingerAle, is employed. As a result, here we organize the regions by virtue of recent network parcellations. For simplicity, we focus on the three-network model of Menon and colleagues (see Table 1), while acknowledging that other researchers choose to parcellate networks and subnetworks differently [37].

Salience and Emotion Network

Within the concept of a three-network model, there is one network that prioritizes processing and reacting to information that has a high salience level, including self-relevant and often emotionally-laden stimuli. As a result, it can be difficult to dissociate what could be considered emotional and not salient, and even more challenging to segregate out those experiences that might be self-relevant but somehow not salient. One can imagine, then, that these categorizations do not lend themselves to a transparent set of questions or answers within neuroimaging research. As such, we combine these two highly overlapping concepts into one broad network. The key nodes involved in this broad network are amygdala, subgenual cingulate, dorsal cingulate, ventral striatum, and anterior insula (possibly more ventral). The rostral and subgenual anterior cingulate have been implicated in prediction studies across modality and task [24••, 25, 31, 34, 38, 39]. The amygdala has been a much trickier region to study (more heavily targeted, but inconsistent results) in prediction of response [19••, 40,41,42]. The insula has also been observed in a few task paradigms [32, 43, 44]. Within studies of reward, ROI approaches have been very common, with hypoactivation in ventral striatum and subgenual cingulate as predictors of poor treatment response [45,46,47].

Cognitive Control Network

The cognitive control network (CCN), is thought to be a large subnetwork within the “task-positive” network. It is thought to prioritize processing of information in relation to planning, organization, sequencing, stopping and starting, and processing of mental operations. Key nodes are the dorso-lateral prefrontal cortex (DLPFC), the inferior parietal lobule, and dorsal anterior cingulate. Of the few studies that have employed cognitive control or working memory studies, the importance of the right DLPFC in treatment prediction has been reported in several studies, primarily cognitive control and working memory tasks [18••, 24••, 38, 48]. Notably, many negative valence studies have also reported DLPFC activation as a predictor of treatment response [39, 49,50,51,52]. It is possible and even likely that emotion regulation engages cognitive control regions to aid in managing the emotional response, even if it is somewhat unclear about the level of volitional control a particular patient or control participant might have over such regions.

Default Mode Network (DMN)

The DMN is a distributed neural network encompassing a large amount of medial cortex, proximal to both anterior and posterior aspects of the medial prefrontal cortex. It also includes nodes within medial and lateral temporal and parietal cortex. It is thought to represent a host of functions including memory, self-referential thought, theory of mind [37]. Because DMN is a task-negative network, it has been relatively understudied in task domains, yet it does routinely activate in prediction of treatment response for affective paradigms [34, 50, 53,54,55]. The medial prefrontal cortex is a frequently reported predictor in fMRI task-based studies, and includes/extends into rostral cingulate, mostly for emotion perception/processing studies, primarily in emotion perception, processing and regulation tasks [50, 51]. The most ventral aspect of the posterior cingulate, extending into posterior hippocampus, is a reported predictor for both cognitive and affective paradigms [50, 55, 56]. The anterior hippocampus, has also been reported as a predictor, primarily in emotion processing studies [49].

Other Regions

Surprisingly, visual cortex and cerebellum appear also in such task-based fMRI studies, irrespective of mechanism [24••, 31, 34, 38, 49]. Not surprisingly, these results are often under-specified and under-discussed. Potentially due to the lack of theory regarding the potential contributions of these regions to MDD, such findings nevertheless cause us to pause in making assumptions about network or KR specificity in neuroprediction, and require further study.

Recent studies have even looked at comparative and integrative prediction of different neuroimaging approaches [24••, 57]. The goal is to obtain a treatment prediction accuracy of >95% (binary question of whether this particular patient will achieve remission) so that such predictors could be used in treatment prescriptive studies. More importantly, using combined/comparative treatment studies could identify highly accurate moderators and mediators of treatment response, so that a prescriptive clinical imaging design could be planned. The results of such studies could inform newer guided clinical trials where the key outcome is time to remission. If we could reduce the median time to wellness by weeks or even months, a considerable degree of the “burden” of depression could be reversed.

One recent study by our group combined behavioral, task-fMRI, and task-fMRI with independent component analysis in an integrative predictive model that achieved 89% accuracy in prediction of treatment response, including steps with cross validation [24••]. Medial and lateral prefrontal cortex synchronization of activation during errors was a positive predictor of degree of treatment responsiveness, and accuracy of prediction significantly increased when combined with poorer behavioral inhibitory control and increased activation in several prefrontal regions. In addition, one of the I-SPOT reports suggested that hypo-reactivity in emotional stimuli within the amygdala was successful in predicting treatment response with 75% accuracy [19••].

Predicting Risk and Disease Course

Another potential use of neuroimaging studies in MDD is to predict disease course or recurrence of illness. Of the studies that have been conducted, initial results are interesting and potentially promising. However, there are relatively few studies of this type. The distinction between risk and disease course studies, is that they study individuals over a longer period of time (e.g., 6 months up through decades of follow-up), and the goal is to predict a distant event. The likelihood that such studies will yield a positive predictor is quite modest. In fact, this weakened predictive capacity is compounded by attrition, and further limited by the tendency for negative studies to go unpublished [58], and the difficulty in publishing replication studies (see below for more details).

A prior review identified biological markers of vulnerability in at-risk youth [59]. Despite the many challenges of publishing longitudinal data, investigator variability in task/physiological/measurement probes that were used, and the marked cost of longitudinal studies, this review noted that there were a significant number of biological predictors that were reported by more than one group. For example, EEG measured alpha band power, P300, and frontal asymmetry all demonstrated some degree of hereditary convergence. Studies that use fMRI measures to predict outcome are few and far between. Recently, the longitudinal assessment of manic symptoms (LAMS) multisite group suggested that self-report and neuroimaging markers could account for 28% of variance in future manic symptoms [60]. This study followed 78 at risk youth for an average of 15 months and using cingulum connectivity and connectivity from the ventral striatum to the parietal cortex as predictors. Another study used cross-hemispheric connectivity from subgenual anterior cingulate seeds within a psychophysiological interaction analysis during a self-blame task. Surprisingly, though, the resilient group (no MDD recurrence) was different from the healthy control group, whereas the group with recurrent MDD did not noticeably differ from the HC group [61]. Recurrence was predicted with 75% accuracy in this sample. A final example of the utility of fMRI in the prediction of treatment response comes from a study of anterior cingulate volume, which predicted 52% of variance in future depression scores, along with other relevant clinical data [62]. While these studies are encouraging, more are needed. As an example of what might be conducted in future studies, a recent paper used discrete-time Markov Chain with finite states (based upon 1 year of monthly self report questionnaires) to define latent symptom classes in 209 adults with bipolar disorder [63••]. These repeated measures type analyses with MDD combined with biological measures in patients with MDD could be very helpful for predicting future states and course of illness.

Open Label Studies, Placebo Response and the Specificity of Clinical Prediction

The majority of predictive studies in MDD with fMRI have been open label studies. Those who have longstanding interest in clinical trials have questioned the internal validity of open label, one-arm predictive studies, because there is no comparison treatment, the treatment is not blinded, and there is no placebo control. We agree that single arm studies have challenges for specificity of prediction – but if replicated these studies still may offer some prescriptive value. We highlight the importance of the control group to evaluate effects of time and maturation independent of the treatment condition [64].

Placebo-controlled designs have several challenges and merits, as there are opportunities in these designs to distinguish treatment specific and more generic effects of help-seeking and return to wellness. The role of placebo responding is an important consideration in treatment prediction modeling. Many studies suggest that placebo responding can be nearly as good as the effects of an active treatment [20, 65]. These studies have led to concerns about the biological specificity and clarity of diagnoses and treatments, including with MDD. They have also led to broader concerns with specificity of treatments for MDD.

More recently, our group has focused on whether placebo responding is in fact distinct in any way from response to a standard psychiatric medication [66]. Notably, the individuals who are most responsive to a suggestive placebo effect are also the ones who show the greatest responsiveness to a psychoactive treatment had greater u-opioid release during placebo in the nucleus accumbens. The EMBARC study should also be able to address this question in some detail.

Psychometric considerations beyond placebo include natural resolution of illness, regression to the mean, and the close links between hopefulness, behavioral activation, and placebo responding [65]. Continued innovation is needed to better understand placebo response, and perhaps how placebo might be marshaled to facilitate, enhance or extend the effects of psychoactive medications and psychotherapies. In summary, although we agree that single arm studies clearly have challenges for specificity of prediction, if replicated these studies still may offer some prescriptive value.

Power, Clinical Significance, Effect Sizes, and Adjustments for Multiple Comparisons

There is continued misunderstanding within the imaging field (although it is not relegated solely to imaging studies) about the role of statistical adjustment of accepted type I error rate, the relationships of statistical threshold adjustment to chances for replication, and whether such one-off studies can actually diminish the type I error without negatively compromising a scientific line of inquiry. Statisticians often counsel on the careful selection of a p value to balance out the nature of a false positive, type I error, vs a type II, negative error [36, 67]. In addition, the concept of meaningfulness of a significant effect – does it help us to understand illness, treatment with a reasonable degree of precision and effect size, is often lost in the discussion [64, 68]. We hope to illustrate that the concerns about type I error are valid, but that they have mislead reviewers and the field into a p value war that can only sacrifice type II error, clinical significance - and will very likely reduce the capability of time tested strategies like replication and meta-analysis. Figure 1, Panel A is an actual illustration of the relationship between sample size and statistical significance using GPower. We set alpha at .005, as our experience suggests that this threshold has a balance between statistical stringency and clinical significance. To achieve significance with an alpha of .005 and power of .80, an effect size of 1.25 (very large) is needed with equal samples of 20. This means that many comparison studies are underpowered for large and medium effect sizes, they would have a higher likelihood for non-significance in this scenario (type II error). This is particularly troubling, as the vast majority of medical treatments have small to moderate effect sizes (Fig. 1, Panel B). So, would we counsel throwing away the baby with the bathwater?

Fig. 1
figure 1

A. Illustrates the observed N needed to obtain power of .80 to reject the null hypothesis, in a given voxel, based upon alpha < .005. This is independent of adjustment for cluster size. A exponential fit line is include to illustrate the relationship between sample size and power. B. Effect sizes for comparison with 1B. Most psychotherapies are moderate effect sizes, suggesting that a similar brain effect size would have adequate power with Ns of between 50 and 109. The assumption is that the brain marker would have the same effect size as the treatment. Brain effect sizes may be larger or smaller, as it is doubtful that they are parametrically linked. Effect sizes from Meyer et al., 2001 [69] and Leuck et al., 2013 [70]

This illustration enables us to see what types of effect sizes would be significant with a given sample size. This is compounded by the reality that large effect sizes may be no more or less likely to replicate than very large effect sizes. This creates an unhealthy tension between whole brain analyses and ROI analyses. There may be a temptation to only report ROI analyses to avoid undergoing adjustments for multiple comparisons. Including ROI only analyses - limits the ability to conduct meaningful meta-analyses across many studies. This challenge is compounded, because few groups are capable and motivated toward carrying out treatment studies with biological markers.

A brief comment on adjustments for multiple comparisons. We contend that the unbalanced concern about p value adjustments, with multiple comparisons in mind, has created a mindset in authors, reviewers and editors that is not conducive to evaluating the relative merits of false positives vs true negatives. There is already a tremendous bias against publication of negative results, often referred to as the file drawer effect. Well-funded labs are left to resort to publishing in paid journals, if they choose to publish the findings at all. Less-well funded labs resort to planting summary results in chapters and reviews, with sparse and under-evaluated methods. No matter the outcome, the lack of published negative studies substantially limits the benefits of meta analytic and qualitative review techniques.

Finally, an unadjusted p value for a new treatment (exploratory) should be viewed differently than an unadjusted value for a known treatment. There is currently a focus on rapid fail clinical trials at the NIMH. A strict evaluation of merit based upon adjusted p value may result in type II error – a promising treatment may be relegated to the dust heap. An evaluation based upon effect size with confidence estimates around the effect size, combined with rapid extension into a replication sample, can help balance type I vs type II error. Moreover, before the rapid extension to multisite trials, it is wise to require a semi-independent replication at a separate site. An extension of R61 to R33, could be followed by a second R33 (or even concurrently run, perhaps a new R mechanism) in an independent lab with input from the PI and team from the R61/R33.

Neuroprediction in the RDoC Era: Current Directions and Recommendations

The emergence of the RDoC era has placed the question of treatment effect sizes and specificity squarely in the cross-hairs. There are many non-specific effects of intervention; effects of therapeutic alliance, intervention time, effort toward change, regression to the mean, placebo effects, natural resolution of illness, etc. Each of these can contribute to significant “improvement” that is not related to the specific mechanisms of treatment (e.g., domain). RDoC highlights this tension because it shows how many diseases may have common and overlapping domains of illness – therefore they may also have common pathways to wellness [69,70,71,72,73]. Anxiety and depression may share similar negative valence domain disruptions, whereas only depression might have positive valence domain dysfunction.

To date, the study of clinical predictors has tended to overly rely on a categorical-polythetic diagnostic nomenclature (e.g., DSM-IV) constricting tests to one disorder, often testing the therapeutic response in terms of rigid measures of symptom change – these are inevitably tied to categorical diagnostic systems. Given the heterogeneity of major depression and dimensional nature of symptomatology, neuropredictors of treatment response may elucidate distinct and shared pathways that interact with particular interventions. Therefore, testing the discriminant and construct validities of several RDoC domains and dimensions (e.g., reward, threat responding, loss, affect regulation) linked to circuits in experimental designs that examine response to interventions with different mechanisms of action (e.g., pharmacotherapy, psychotherapy, neuromodulation) can lead to new insights.

The convergence of anxiety and depression symptoms and effected domains suggests that there may be parallel predictors in treatment response. The RDoC initiative has encouraged us to frame our understanding of treatment mechanisms and predictors to have the broadest impact on the care of patients with major depression and other internalizing psychopathologies (IPs [74]). More specifically, the RDoC framework is grounded on three postulates of high relevance to neuroprediction: 1) IPs as mental illnesses are disorders of brain circuits (e.g., amygdala-frontal circuitry); 2) tools of clinical neuroscience (e.g., functional neuroimaging, electrophysiology, etc.) can be used to test and advance biosignatures that will guide treatment; 3) brain-based predictors can be enhanced by a multimodal approach that incorporates different units of information that are likely to moderate or mediate neural predictors [75]. This framework has the potential to catalyze research that will address knowledge gaps that have hindered progress in incorporating biological predictors into clinical practice.

Additionally, accumulating data from the literature and our teams suggest brain regions implicated in the brain pathophysiology of a disorder or even in treatment-mediated change may not be the same regions that predict treatment response [35, 76••, 77]. For example, those factors that contribute to risk of illness are thought of as endophenotypes. Those that mediate the treatment response are considered treatment targets. Those factors or biological markers that predict treatment response could be endophenotypes. They could also be treatment targets. They could, however be independent treatment predictors, and be unrelated to endophenotypes or treatment targets. RDoC studies with a focus on specific domains of dysfunction (e.g., treatment targets and/or endophenotypes), may in fact best highlight (or even expand) treatment predictors across multiple illnesses and domains. Thus, continued focus on mechanisms for endophenotypic risk is likely a different path than the advancement of biomarkers toward precision medicine (treatment targets or predictors).

Conclusions and Recommendations

We have attempted to cover a few important issues in neuroprediction studies in MDD. To our view, there are two few studies of the neurobiological predictors of and mechanisms involved in treatment response, even of accepted clinical treatments. Neuroprediction studies offer several windows into disease, illness expression, processes of recovery, and maintenance of wellness. We recommend that concerted effort be focused toward collections of patients with internalizing disorders, often referred to as repositories. These repositories of eligible and interested patients can then be tested with different RDoC paradigms, with sufficient sample sizes, using different treatment strategies, with the idea that accumulated knowledge will improve the matching of treatments to patients for optimal outcomes. In the meantime, a great deal will be learned about how our treatments work, and for whom.

We close with some additional recommendations for uniformity in reporting for neuroprediction studies (Table 2). These can be considered as additional and complementary to already existing reporting guidelines (e.g., COBIDAS [78••]), with a specific focus on data that will assist in evaluating clinical specificity, meaningfulness, and can contribute to meta analyses. We highlight again, that such reporting guidelines do not and cannot protect against a failure to replicate. They can only guide better implementation of replication studies, increased rigor. Moreover, we add that replication studies should carefully consider challenges of overfitting, p-hacking, and spatial alignment challenges. A poorly executed replication study (by sample size, design, inclusion and exclusion criteria, treatment fidelity) has the potential for great harm. As the number and types of therapies for internalizing disorders has expanded, including many different potential mechanisms, we harbor optimism that we will move the needle forward, toward better and more precise treatment matching.

Table 2 Reporting Recommendations for Neuroprediction Studies