Introduction

Major depressive disorder (MDD) is a highly prevalent and disabling condition [1, 2]. Established treatments for MDD include psychological interventions such as cognitive behavioral therapy (CBT) and interpersonal psychotherapy, antidepressant medications, and somatic non-pharmacological treatments including electroconvulsive therapy (ECT), repetitive transcranial magnetic stimulation (rTMS), and direct current stimulation. However, treatment outcomes for MDD patients are highly variable and have been shown to be influenced by variables including age [3], sex [4], disease duration, and symptom severity [5]. It is difficult to predict using clinical and demographic features, and approximately 30–50% of patients with MDD do not respond to first-line medication or psychotherapy [6]. Therefore, treatment selection often begins with a “trial and error” approach, with weeks or months long trials until an effective and well-tolerated treatment is found. Consequently, several studies have investigated the potential of pretreatment features to guide personalized medicine approaches that can speed optimal treatment selection and positive clinical outcomes.

Multiple features have been evaluated for predicting treatment outcomes including clinical [7] and neuroimaging features [8]. A previous study demonstrated that utilizing pretreatment clinical features, such as the scores of apparent sadness, reported sadness, and inability to feel in the Montgomery-Åsberg Depression Rating Scale (MADRS), can successfully predict the ECT treatment outcomes [9]. The pretreatment scores of the Beck Depression Inventory (BDI), neuroticism, extraversion, depression, anxiety, and stress can also predict treatment outcomes of rTMS [10]. Clinical features, including baseline symptom severity, suicidality, and appetite changes, and demographic features such as age, sex, and ethnicity were also significant predictors for predicting antidepressant treatment outcomes [11]. However, the predictive accuracy using pretreatment clinical features varied from 44.3% to 94.3% (Table S6), and clinical heterogeneities between studies are important obstacles in the generalizability of diagnostic models.

Neuroimaging using magnetic resonance imaging (MRI) employs non-invasive techniques to evaluate brain anatomy and function whose predictive utility can be optimized using machine learning [12,13,14,15]. Numerous studies have predicted treatment outcomes of antidepressants, ECT, and other treatments using pretreatment brain structural and functional MRI features. For example, resting state functional connectivity (rsFC) between dorsolateral prefrontal cortex (dlPFC) and visual regions evaluated with resting state functional MRI (rsfMRI) has shown a predictive utility [16], as well as activation of rostral anterior cingulate cortex (ACC) in task-based fMRI (tbfMRI) [17]. Structural MRI (sMRI) studies have shown a predictive utility for gray matter volume (GMV) of hippocampal subfields [18] and cortical thickness (CTh) of supplementary motor area [19]. These features have shown varying levels of sensitivity (ranging from 0.74 to 0.84) and specificity (ranging from 0.67 to 0.97) in predicting treatment outcomes. The differences in sensitivity and specificity among studies may arise from variations in interventions, sample cohort features, small study samples, varying MRI modalities, different tasks in tbfMRI, acquisition parameters, and analysis methods. Given these factors and the urgent need for predictive features for MDD treatment, a meta-analysis of the literature is needed to advance progress in this field.

Prior works have assessed utilizing neuroimaging features for treatment outcome prediction. A meta-analysis explored prediction based on electroencephalogram (EEG) and MRI, achieving an area under the curve (AUC) of 0.85 [8]. This meta-analysis didn’t specify neuroimaging techniques, limiting insights into the specific role of MRI features for MDD treatment outcomes. Moreover, a meta-analysis of brain MRI features used for outcome prediction in MDD reported an AUC of 0.84 [20]. However, the expanding literature, along with limitations and unresolved research questions of prior work, emphasizes the need for further investigation. For example, this meta-analysis omitted separate subgroup meta-analyses for fMRI and sMRI studies [20], preventing a detailed understanding of the performance of different MRI modalities. Moreover, for functional MRI studies, the potential differential predictive utility of rsfMRI and tbfMRI has not been systematically examined. Furthermore, no meta-analyses have compared the predictive potential of MRI and clinical features. Therefore, an updated meta-analysis and systematic review are warranted to comprehensively understand the performance of brain features in predicting treatment outcomes using more available published studies.

The primary objective of the present meta-analysis was to evaluate the overall performance of clinical and brain MRI features for predicting treatment outcomes for MDD. A secondary objective was to explore the utility of different MRI modalities for predicting treatment outcomes and determine variations in predictive performance for different interventions. Our primary hypotheses were that: (a) MRI data would have greater performance for predicting treatment outcomes than clinical features in patients with MDD, and (b) the predictive performance in MRI studies would differ across imaging modalities and interventions.

Materials and Methods

Search strategy and selection criteria

Our study followed the Preferred Reporting Items for Systematic Reviews and Meta-analyses (PRISMA) reporting guideline [21] (Table S1) and was registered on PROSPERO (CRD42022376797). Three authors (FHL, YFC, and QZ) conducted a literature search in PubMed, Embase, Web of Science, and Science Direct databases from inception to July 22, 2023.

Inclusion criteria were used to select studies that: (a) included participants meeting diagnostic and statistical manual of mental disorders (DSM) or international classification of diseases (ICD) criteria for MDD, (b) evaluated prediction of treatment outcome using pretreatment clinical (including severity ratings, duration, and demographics) or brain MRI data, and (c) provided a specific evidence-based clinical intervention such as antidepressants, ECT, or CBT. We excluded: (a) theoretical papers, case reports, reviews, and meta-analyses, (b) animal studies, (c) studies of samples with mean age younger than 18 or older than 65 years, and (d) confusion matrix data for a study sample, which depicts the differences between model predictions and actual outcomes using true/false positives/negatives, was not available even after contacting the authors. Details of the search strategy and selection criteria (Figure S1) and quality assessment and data extraction (Tables S2S5) are provided in the Supplementary Methods.

Meta-analysis

Data analysis was conducted in R (version 4.2.1) using the mada [22] and glmnet [23] packages. First, we calculated I2 as a measure of study heterogeneity and classified it as low, moderate, and high for I2 values of < 50%, 50–75%, and > 75% [24]. If the data were low in heterogeneity, a fixed effects model would be employed to estimate the results, i.e., here as the logarithm of the diagnostic odds ratio [log(DOR)]. Otherwise, a random effects model was employed. Log(DOR) greater than zero indicates that the predictive model can discriminate between responders/remitters and non-responders/non-remitters, and higher values indicate better predictive performance [25]. Second, bivariate analyses for sensitivity and specificity were implemented using the approach of Reitsma [26]. Because the pooled studies use different diagnostic thresholds (such as varying treatment response definitions), this could lead to heterogeneity in the estimates of sensitivity and specificity. To examine the potential impact of variable outcome thresholding approaches, we computed a summary receiver operating characteristic (SROC) curve that displays predictive utility across the range of potential thresholds. When the correlation coefficient (r) between sensitivity and false positive rate exceeds 0.6, it signifies a considerable threshold effect [27]. For the pooled meta-analysis and all subgroup meta-analyses in the present study, no considerable threshold effect was observed (all r < 0.6; r range, −0.08 ~ 0.52). Of note, AUC of an SROC curve ranging from 0.75 to 0.92 is generally considered to indicate good predictive performance [28]. When the lower bound of the 95% confidence interval [CI] for AUC exceeds 0.5 (chance performance), it signifies that the current prediction is significantly better than chance [29].

We conducted separate pooled meta-analyses for prediction models based on clinical features, encompassing demographics and severity ratings (Hamilton Depression Rating Scale [HDRS] scores, Hamilton Anxiety Rating Scale [HARS] scores, illness duration, sleep disruption score, etc.), and models based on brain MRI features. Subsequently, MRI studies were categorized into subgroups based on modalities (rsfMRI, tbfMRI, sMRI, and diffusion tensor imaging [DTI]) and interventions (antidepressant, ECT, rTMS, and CBT). In studies utilizing multiple MRI modalities, those exclusively relying on features from a single modality to predict treatment outcomes were categorized within the corresponding modality subgroup. One study that combined features from multiple modalities to collectively predict treatment outcomes was classified under the multiple MRI modalities group [30]. The tbfMRI studies were categorized into two subsets based on the emotional and cognitive tasks employed. We subdivided the antidepressant studies into those using selective serotonin reuptake inhibitors (SSRIs) and those using other medications. Separate meta-analyses were performed in subgroups that included five or more original studies (n ≥ 5) to provide a reasonable statistical power, noting the preliminary nature of such analyses [31]. If a study reports on both response and remission, or multiple studies use the same patient data, we calculate a weighted average of their 2 × 2 tables, with weights determined by sample size [20]. We used a systematic review to summarize the current state of the field with regard to less studied imaging modalities and intervention approaches.

Next, we assessed publication bias by Deeks’ funnel plot asymmetry test, considering it present if there was a nonzero slope coefficient (P < 0.05). The diagnostic original studies usually have an imbalance in negative samples (i.e., patients with poor treatment outcomes) over positive ones. Therefore, this imbalance tends to inflate the calculated DOR above 1, potentially affecting the standard errors and increasing the likelihood of a false positive result if it is calculated by Begg’s and Egger’s methods. The Deeks’ test conducted a linear regression of log(DOR) and precision (the reciprocal of the square root of the effective sample size) to investigate the relationship between effect size and effective sample size, which further evaluated the impact of publication bias. The effective sample size is more appropriate than the sample size to summarize the precision due to the unequal numbers of responder/remitter and non-responder/non-remitter in each study [32].

Then, bivariate meta-regression was used firstly to explore the effects of covariables (sample size, age, sex ratio, illness duration, HDRS score, and publication year) on sensitivity and specificity of each meta-analysis [33]. Baseline BDI and MADRS scores were transformed to approximate HDRS categories using published transformation rules [34, 35]. The bivariate meta-regression was also used to test for the differential effects between clinical and MRI studies, as well as among pairs of modality and intervention subgroups within MRI studies, respectively. The likelihood ratio test was conducted to determine the statistical significance of the differences in estimated variances of logit sensitivity and logit specificity before and after adding covariates (i.e., imaging modalities or interventions) in bivariate meta-regression [36]. For example, to assess the influence of interventions on predictive performance, the antidepressant subgroup was taken as a reference group, and an ECT subgroup was added as a covariate to analyze the statistical significance of changes in the estimated variance. Therefore, we conducted bivariate meta-regression on the following studies, including: (a) clinical and MRI studies, (b) three different MRI modalities (rsfMRI, tbfMRI, and sMRI) to evaluate potential imaging modality variations and provide additional information for outcome prediction, (c) tbfMRI studies employing emotional and cognitive tasks to understand the impact of tasks on prediction, (d) antidepressant and ECT subgroups to assess their distinct predictive performance, and (e) specific SSRIs and ECT to contrast predictive features performance. For multiple comparisons in the analyses among three different MRI modalities, we used the Bonferroni correction, and results were considered significant if the P value was less than 0.05/3 = 0.017.

Finally, the elastic net algorithm was employed to construct a multivariate regression model for predicting log(DOR) in each study, investigating the impact of methodological and clinical variables on the prediction. Methodological variables mainly encompassed data types (e.g., rsfMRI data), validation status, magnetic field strength, and predicted methodologies (e.g., support vector machine [SVM]). Clinical variables included sample size, age, intervention type, etc. To address missing values in variables, we utilized the k-nearest neighbors algorithm for imputation. We utilized nested cross-validation (CV) with 10-fold for the inner CV and leave-one-out for the outer CV, aiming to select alpha and lambda values that minimized the root mean squared error. Correlation analysis was conducted between the predicted log(DOR) and true log(DOR) to assess the reliability of the predictive model.

Results

Characteristics of included studies

We included 13 studies that used clinical features to predict treatment outcomes, covering 4301 patients (mean age, 45.1 years; male/female, 1753/2548); and 44 MRI studies recruited 2623 patients (mean age, 38.2 years; male/female, 1109/1514) (Table 1 and S6). Within MRI studies, 19 rsfMRI, 13 tbfMRI, and ten sMRI studies were included in modality subgroups (Table S7); 27 MRI studies utilized antidepressants and nine utilized ECT which were included in intervention subgroup analyses (Table S8). Detailed characteristics of included studies are provided in the Supplementary Results.

Table 1 Characteristics of all included studies in the present meta-analysis.

Pooled meta-analysis

Due to the high heterogeneity observed with the fixed effects model, a random effects model was employed for the present study. The overall log(DOR) of clinical studies for treatment outcome prediction was 1.62 (95% CI 1.16–2.09; Fig. 1). The AUC of SROC curve was 0.73 (95% CI 0.67–0.81; Fig. 2), sensitivity was 0.62 (95% CI 0.48–0.74), and specificity was 0.76 (95% CI 0.64–0.85). No covariates were identified to impact the sensitivity and specificity (P > 0.05). There was a low heterogeneity observed among studies (I2 = 42.4%). Deeks’ funnel plot asymmetry test did not reveal significant publication bias in the included studies (beta = 0.008, P = 0.51; Fig. S2). No significant correlation was observed between the predicted log(DOR) and true log(DOR) in clinical studies (r = 0.12, P = 0.71; Fig. S3).

Fig. 1: Overall random effects model forest plot of the logarithm of diagnostic odds ratios in clinical and MRI studies.
figure 1

CI confidence interval; log(DOR), the logarithm of diagnostic odds ratios. Notes: * represents data after weighted averaging of studies that used repeated samples. ISPOT-D included six studies utilizing the international study to predict optimized treatment in depression data.

Fig. 2: Summary receiver operator characteristic (SROC) curve within clinical and MRI studies.
figure 2

AUC Area under the curve, CI Confidence interval; Conf. region, the region of confidence interval, MRI Magnetic resonance imaging.

The pooled meta-analysis of all included MRI studies revealed an overall log(DOR) of 2.53 (95% CI 2.22–2.84; Fig. 1). The AUC of the SROC curve was 0.89 (95% CI 0.87–0.91; Fig. 2), indicating performance better than chance. Sensitivity was 0.78 (95% CI 0.75–0.81), and specificity was 0.75 (95% CI 0.71–0.79). No covariates had a significant impact on overall sensitivity and specificity (P > 0.05). There was no evidence of heterogeneity observed among studies (Fig. 1). Deeks’ funnel plot asymmetry test did not demonstrate significant publication bias in the included studies (beta = 0.001, P = 0.93; Fig. S2). In the meta-regression comparing clinical and MRI studies, we identified significant differences in predicting treatment outcomes (Chi2 = 6.53, P = 0.03), with the MRI studies exhibiting higher sensitivity (Z = 3.42, P = 0.001).

For the pooled MRI studies, we employed the elastic net algorithm with an average of alpha = 0.5 and lambda = 0.21 across all CV-folds. The predicted log(DOR) showed a significant correlation with the true log(DOR) (r = 0.39, P = 0.02). Six variables were identified based on the absolute value of their coefficients. Specifically, “data: tbfMRI”, “scanner: 1.5 T”, and “sample size” were linked to lower log(DOR), while “data: rsfMRI”, “method: ROC curve analysis”, and “illness duration” demonstrated associations with higher log(DOR) (Fig. 3).

Fig. 3: Results of multiple regression model in MRI group by the elastic net algorithm predicting log(DOR) of individual studies.
figure 3

a Nineteen methodological and clinical variables were included in elastic net algorithm. b The predicted log(DOR) was significantly correlated with true log(DOR) (r = 0.39, P = 0.02). c Six variables with non-zero coefficients were important predictors for log(DOR) prediction, ranked by their absolute value of coefficient values from the lowest to the highest. CV Cross validation, ECT Electroconvulsive therapy, HDRS Hamilton depression rating scale, log(DOR), the logarithm of diagnostic odds ratios; MRI Magnetic resonance imaging, ROC receiver operating characteristic, rs/tbfMRI resting-state/task-based functional magnetic resonance imaging; sMRI, structural magnetic resonance imaging; SVM support vector machine. Note: -, represents that the relevant information or data is not reported.

Modality subgroup outcomes in MRI studies

Meta-analysis

The rsfMRI subgroup consisted of a total of 1130 patients (mean age, 40.7 years; male/female, 440/690). The outcome prediction model had a log(DOR) of 2.74 (95% CI 2.39–3.08), sensitivity of 0.80 (95% CI 0.75–0.84), specificity of 0.79 (95% CI 0.75–0.82), and an AUC of 0.90 (95% CI 0.87–0.93).

For tbfMRI studies, which included 891 participants (mean age, 34.2 years; male/female, 394/497), the log(DOR) was 2.00 (95% CI 1.28 to 2.72), sensitivity was 0.74 (95% CI 0.67–0.81), specificity was 0.69 (95% CI 0.56 to 0.79), and AUC was 0.85 (95% CI 0.78–0.92). Regarding different tasks employed in tbfMRI, the log(DOR) of emotional task subgroup was 1.78 (95% CI 0.91 to 2.65), sensitivity, specificity, and AUC were 0.77 (95% CI 0.68–0.84), 0.63 (95% CI 0.48–0.75), and 0.84 (95% CI 0.73–0.99). The cognitive task subgroup has a log(DOR) of 2.35 (95% CI 1.69–3.02), sensitivity of 0.77 (95% CI 0.69–0.84), specificity of 0.73 (95% CI 0.63–0.81), and AUC of 0.88 (95% CI 0.80–0.97).

The sMRI studies included 347 patients (mean age, 39.5 years; male/female, 146/201). The log(DOR) was 2.63 (95% CI 1.96–3.30), and sensitivity, specificity, and AUC were 0.79 (95% CI 0.71 to 0.86), 0.73 (95% CI 0.63–0.81), and 0.91 (95% CI 0.86–0.96), respectively (Fig. S4a and Table S9).

The heterogeneity test showed low heterogeneity in rsfMRI subgroup (I2 = 1.57%) and no evidence of heterogeneity was observed in the tbfMRI and sMRI subgroups (Fig. S6). No publication bias was found in any of these three subgroup analyses (Fig. S5a). Using meta-regression analysis, we found a significant difference in the sensitivity and specificity of outcomes predicted by the rsfMRI and tbfMRI subgroups (Chi2 = 8.70, uncorrected P = 0.013 survived with the Bonferroni correction). Specifically, while the sensitivity was similar (Z = −1.13, P = 0.26), rsfMRI showed higher specificity than tbfMRI (Z = −2.86, P = 0.004). There was no significant difference in sensitivity and specificity in prediction of treatment outcome between sMRI and rsfMRI subgroups (Chi2 = 1.00, P = 0.61), between sMRI and tbfMRI subgroups (Chi2 = 1.70, P = 0.43), or between emotional and cognitive tbfMRI (Chi2 = 1.61, P = 0.45). Furthermore, we found that HDRS score was a significant covariate influencing the prediction in emotional task subgroup (Chi2 = 6.34, P = 0.04), with a negative impact on its specificity (Z = −2.88, P = 0.004).

Predictive brain features in models including all studies using each type of MRI protocol

Analysis of features selected for outcome predictions indicated that predictive brain regions were predominantly located within the limbic and default mode networks (DMN) for both rsfMRI and tbfMRI studies. The rsfMRI features included rsFC between ACC and middle frontal gyrus [37], amygdala [38], and dlPFC [39], as well as between medial PFC and posterior cingulate cortex (PCC) [40]. Predictive features for tbfMRI included task-based FC between limbic and somatomotor networks [41], as well as within DMN [42]. Activation of ACC [43, 44] and precuneus [45, 46] in tbfMRI studies also contributed to prediction. The sMRI predictive features for all treatments predominantly included brain regions within limbic network not the DMN, including GMV of hippocampus [18, 47], GM density of ACC [48], and CTh of hippocampus [49] (Table S10, Fig. 4).

Fig. 4: Summary schematic representation of brain MRI features predicting treatment outcomes of major depressive disorder.
figure 4

Illustrated are all predictive brain MRI features and features for different MRI modalities and interventions. Node color illustrates the brain networks the nodes belong to. Yellow/white edges represent resting-state/task-based FC. ACT activation, aMCC anterior midcingulate cortex, AMYG amygdala, CTh cortex thickness, d/pg/sgACC dorsal/pregenual/subgenual anterior cingulate cortex, dl/dm/vmPFC dorsolateral/dorsomedial/ventromedial prefrontal cortex, ECT electroconvulsive therapy, FC functional connectivity, GMD/V gray matter density/volume, HIP hippocampus, I/MFG inferior/middle frontal gyrus, IPL inferior parietal lobule, L left, NAcc nucleus accumbens, PCC posterior cingulate cortex, PCUN precuneus, PHG parahippocampal gyrus, Po/PreCG post/precentral gyrus, R right, rs/tbfMRI resting-state/task-based functional magnetic resonance imaging, STG Superior temporal gyrus, sMRI structural magnetic resonance imaging.

Intervention subgroup outcomes in MRI studies

Meta-analysis

Studies including 1700 patients using antidepressants (mean age, 35.5 years; male/female, 736/964) showed a log(DOR) of 2.48 (95% CI 2.05–2.91), sensitivity of 0.78 (95% CI 0.74–0.82), specificity of 0.74 (95% CI 0.68–0.80), and an AUC of SROC of 0.89 (95% CI 0.86–0.92). The log(DOR), sensitivity, specificity, and AUC for studies with patients only administered SSRIs were 2.68 (95% CI 1.81–3.56), 0.79 (95% CI 0.72–0.84), 0.75(95% CI 0.61 to 0.85), and 0.91 (95% CI 0.86–0.96), respectively. ECT studies included 395 participants (mean age, 44.9 years; male/female, 155/240). The log(DOR) for ECT studies was 2.56 (95% CI 1.90–3.22), sensitivity and specificity were 0.83 (95% CI 0.69 to 0.91) and 0.74 (95% CI 0.65 to 0.82), and AUC was 0.89 (95% CI 0.80 to 1.00; Fig. S4b and Table S9).

No publication bias or evidence of heterogeneity was found in any intervention subgroup (Figs. S5b and S7). Meta-regression showed no significant differences between antidepressant (SSRI and other antidepressant studies combined) and ECT subgroups (combined across imaging modalities) in sensitivity and specificity (Chi2 = 0.98, P = 0.61), as well as in SSRIs and ECT (Chi2 = 0.10, P = 0.95). We observed that sample size significantly affected the predictive efficacy of ECT treatment outcomes (Chi2 = 7.98, P = 0.02), negatively influencing its sensitivity (Z = −3.50, P < 0.001).

Predictive brain features

Through a systematic review, we found that features for antidepressants, including SSRIs examined separately, were distributed in the limbic network and DMN. Predictive features included rsFC between hippocampus and angular gyrus [50], and between ACC and supplementary motor area [51]. The task-based FC between DMN and somatomotor networks [42], and activation of medial PFC [40], were also significant predictors. In terms of ECT, features related to treatment outcome were mainly found in the limbic network, including rsFC between ACC and dlPFC [39], GMV of subgenual ACC [52], as well as CTh of hippocampus [49] (Table S11, Fig. 4).

Discussion

Findings of the present meta-analysis highlight the potential of utilizing pretreatment brain MRI data to predict treatment outcomes for MDD patients, outperforming clinical features. Pretreatment alterations in functional and structural brain features may explain in part the wide heterogeneity of clinical response to antidepressant therapies. In imaging modality subgroups, rsfMRI outperformed tbfMRI in specificity, more accurately identifying true negatives (i.e., non-responders and non-remitters) among patients. Outcome prediction of sMRI features did not differ significantly from either of the two fMRI modalities. No significant differences were found among the different intervention subgroups in accuracy of outcome prediction. Although outcome prediction features mainly involved DMN and limbic networks, predictive neuroimaging features differed somewhat among modality and intervention subgroups.

Overall prediction performance

Our meta-analysis utilized brain MRI data to predict treatment outcomes and revealed a superior predictive performance compared to that reported in a meta-analysis of EEG data, with an AUC of 0.89 versus 0.76 [53]. Our overall predictive performance is similar to the findings reported by Lee et al. for EEG and MRI-based predictive markers (AUC: 0.89 vs. 0.85) [8]. Furthermore, we observed comparable overall performance to another meta-analysis (AUC: 0.89 vs. 0.84) [20], further emphasizing the robust potential of MRI data in predicting treatment outcomes in MDD patients. In addition, with the leverage of more recent studies to increase samples, we were able to compare different imaging modalities and treatment types. In terms of comparisons of MRI modalities, we observed rsfMRI outperformed tbfMRI in predicting treatment outcomes. We additionally utilized a multivariate regression model to identify predictive factors associated with predictive accuracy. The results further support our findings, indicating an association between rsfMRI utility and higher log(DOR), while tbfMRI demonstrates a correlation with lower log(DOR).

Pretreatment clinical features are readily accessible features for patients, but the performance of models based on them in predicting MDD treatment outcomes is limited. This may be attributed to the diversity of clinical features introducing heterogeneity. Although all clinical studies included scores of symptom severity scales, the feature sets for prediction were not uniform (i.e., not all studies including comorbidities or illness duration). A prior study revealed that combining pretreatment clinical features with early response factors (i.e., severity ratings at two weeks after medication administration) significantly improves treatment outcome prediction, raising specificity from 30% to 90% [54]. Furthermore, integrating clinical features with patient genetics, metabolomics, and other features is poised to further enhance prediction accuracy [55, 56]. Collectively, our study highlights the superior suitability of pretreatment brain MRI relative to clinical features for predicting treatment outcomes. It is worth noting however that it was not possible to directly compare clinical and MRI features for prediction in the same sample, and warrants additional further validation in future studies.

As key brain networks that have been implicated in MDD, the limbic network and DMN play crucial roles in various cognitive processes, reward regulation, and emotion homeostasis [57, 58]. Previous studies have suggested that MDD patients exhibit decreased activation in the limbic network, which has been hypothesized to be related to abnormal reward-related behaviors and reduced hedonic tone [59]. Furthermore, reduced rsFC within DMN can disrupt the ability to disengage from internal emotions and cognitive processes, impairing their ability to focus on external tasks and experiences [57]. Our meta-analysis demonstrated the predictive power of alterations in these networks for treatment outcomes, showing good predictive performance with AUC ranging from 0.84 to 0.91 that varied within that range depending on the MRI modality and intervention subgroups. Previous studies have reported that treatment-resistant patients with MDD demonstrated decreased rsFC and regional homogeneity within DMN [42, 60,61,62], and reduced rsFC between DMN and other brain regions [51, 63], as compared to successfully treated individuals. In parallel, increased fractional amplitude of low-frequency fluctuations and rsFC within the limbic network [38, 64], along with elevated rsFC between limbic network and other brain regions [65], were also found to be greater in treatment-resistant patients. These findings are reflected in aggregate in our meta-analysis findings.

Some factors influenced the prediction of treatment outcomes using brain MRI data. Previous studies demonstrated that longer MDD duration is associated with reduced GMV in hippocampus and ACC [66], along with weakened FC between ACC and DMN [67]. This is consistent with crucial regions in DMN and the limbic network that we systematically identified for prediction, and MRI features yield a higher predictive accuracy for patients with a longer illness duration. Moreover, the negative correlation between sample size and predictive accuracy may be influenced by the clinical and neurophysiological heterogeneity in MDD patients [68]. Regarding methodology, the 1.5 T MRI data has a low signal-to-noise ratio and spatial resolution [69], which might contribute to the observed low predictive accuracy in our results. Lastly, prediction using feature-based ROC curve analysis, which typically selects MRI features with optimal performance, exhibited high predictive accuracy. However, as the performance of this method does not stem from its learning capabilities, our findings do not imply its superiority over other machine learning methods.

Varied MRI modalities

The superior predictive utility of rsfMRI relative to tbfMRI is based on the studies that have been done previously. While methods across studies can vary in terms of how measurements are performed, for tbfMRI there is an additional major consideration of what task is being performed during scans. While many studies using a range of tasks were examined, the potential utility of different tasks is unknown. Other methodological features for all fMRI include factors such as the focus on regional activity vs. FC between pairs of regions vs. whole brain connectome analysis using graph theory approaches [70]. Sufficient evidence does not yet exist to clarify which of these approaches has the best utility considered individually or in combination. Further, a previous study suggested that increased severity of MDD correlates with reduced hippocampus and amygdala activation during emotional processing [71]. Patients with severe symptoms and varied treatment outcomes may exhibit comparable lower baseline activation in these regions, which was consistent with our findings and might be one possible explanation for the negative impact of heightened symptom severity on specificity by using emotional tbfMRI for treatment outcome prediction.

There are similar methodological issues for anatomic imaging, as cortical thickness, surface area, gyral features, and volume measurements can all provide somewhat independent information. In our meta-analysis, we found that sMRI predictors of treatment outcome were more commonly observed within the limbic network rather than DMN, though from a broader perspective, we did not observe significant differences in predictive performance between sMRI and the two fMRI modalities. The differences in findings across MRI modalities are of mechanistic interest for MDD, as local anatomic alteration can induce functional alterations elsewhere in regions with which they are connected. They also are of interest in supporting the need for more multimodal imaging studies [72,73,74,75,76] that use diverse MRI data in an integrated way to perhaps improve prediction as well as understanding of treatment response in MDD.

Different interventions

Antidepressant medication is commonly used to treat depression. They aid in restoring the functionality of brain networks by modulating interaction of neurotransmitter systems [77]. ECT is a stimulation therapy that produces therapeutic effects by directly stimulating neural activity. Animal models have shown that repeated electrical stimulation induces neurogenesis, synaptogenesis, and synaptic plasticity in the brain [78]. In the intervention subgroup meta-regression, despite the different mechanisms of antidepressants and ECT, similar sensitivity, specificity, and AUC were observed in treatment outcome prediction using overlapping brain features. Our study demonstrated that the primary distribution of brain-predicting outcomes of antidepressant therapy was observed in the limbic network and DMN, while those for ECT were more specific to the limbic network. Although FC of the PCC and precuneus contributed to outcomes prediction for both antidepressants and ECT, the structural characteristics and functional activation within DMN were exclusive to predicting antidepressant outcomes. Interpretation of this similarity is also complicated by the fact that many receiving ECT are also administered antidepressants before and during ECT treatment.

While ECT is widely recognized for its high efficacy in treating therapy-resistant patients, it is not a first line treatment due to costs, adverse effects on memory, and stigma [79]. In this context, identifying brain features that predict a poor response to antidepressants and a more positive outcome to ECT would be advantageous clinically by suggesting a need for ECT rather than a trial with a different antidepressant medication for treatment nonresponsive patients.

Limitations

Certain limitations should be noted in interpreting the present findings. First, confusion matrices essential for our primary analysis were unavailable in some included studies even after contacting authors. We used scatter plots of all participants and high-resolution ROC to estimate them in order to include more studies in our analysis [44, 50, 80, 81]. Second, two of thirteen clinical studies included patients with bipolar depression (with a proportion of no more than 15%) [82, 83]. This overlap in illness risk may affect the predictive performance for pure MDD patients and also the comparisons of clinical and MRI studies. Further, many clinical studies (more than 20 studies published to date based on our rough search, such as these studies [84, 85]) used early response (i.e., clinical ratings after two or four weeks of treatment) rather than pretreatment baseline clinical ratings as input features to predict treatment response in MDD. To be consistent with the inclusion and exclusion criteria of MRI studies and to compare the clinical predictive results to MRI results that were predicted by using pretreatment brain MRI features, we only found and included 13 clinical studies taking pretreatment clinical ratings as predictors. Thus, due to the variations in study numbers between clinical and MRI studies in the present study, more future studies using pretreatment clinical features are needed to validate the present results. Third, similarly, the variations in patient numbers between fMRI and sMRI subgroups, as well as antidepressants and ECT subgroups, limit the ability to draw conclusive differences in predictive performance among treatment and imaging modality subgroups. Additionally, given the influence of sample size on ECT subgroup sensitivity, caution is warranted in interpreting our results. Fourth, subgroup analyses of rTMS, CBT, and DTI were not performed, due to the insufficient number of included studies (n ≤ 4) to analyze and draw a robust SROC. Fifth, although included studies have defined treatment response and remission based on previous research or professional consensus, there are still variations in definitions due to the use of different rating scales that impact the ability to model binarized outcome prediction [86]. Sixth, we did not incorporate validation as inclusion criteria to ensure a larger sample size, allowing for a comprehensive and robust analysis with enhanced statistical power and reliability. Most included studies utilized separate datasets for training, testing, and validating predictive models. Only two studies employed independent external validation datasets [43, 87]. While we were able to include more studies in our more up-to-date review of prior work, limitations in the available literature may lead to some degree of overfitting that would increase estimated model utility. Finally, the current meta-analysis combined studies employing varying modalities and interventions, and validation was not universal across all included studies. Consequently, the applicability of the current model to clinical practice remains limited.

Conclusion

The present findings revealed that pretreatment brain MRI features outperformed clinical characteristics in predicting short-term treatment outcomes in patients with MDD. The observed variations between rsfMRI and tbfMRI are also noteworthy. We found that rsfMRI biomarkers have higher accuracy in predicting non-responders/non-remitters than tbfMRI. These findings may be helpful in the early identification of patients who may not benefit from treatment, potentially aiding clinicians in considering alternative treatment options. Additional research is required to validate and expand upon these findings, particularly in exploring the predictive capabilities of specific MRI modalities and specific interventions.