Keywords

1 Introduction

Major depressive disorder (MDD) has a lifetime prevalence of 6–21% worldwide and is a major cause of disability in adults [12]. Though half of MDD cases are treated with medication, there are dozens of antidepressants available and a patient’s response to each is highly unpredictable [7]. The current standard in healthcare entails a long trial-and-error process in which a patient tries a series of different antidepressants. The patient must test each drug for up to 3 months, and if satisfactory symptomatic improvement is not achieved within this time, the clinician modifies the dosage or selects a different drug to test next. This trial-and-error process may take months to years to find the optimal treatment, during which patients suffer continued debilitation, including worsening symptoms, social impairment, loss of employment or marriage, and suicidal ideation. It has been shown that 30–40% of patients do not find adequate treatment after a year or more of drug trials [19, 22]. Consequently, a predictive tool that helps prioritize the selection of antidepressants that are best suited to each patient would have high clinical impact.

This work demonstrates the use of deep learning and pretreatment task-based fMRI to predict long-term response to bupropion, a widely used antidepressant with a response rate of 44% [15]. An accurate screening tool that distinguishes bupropion responders from non-responders using pretreatment imaging would reduce morbidity and unnecessary treatment for non-responders and prioritize the early administration of bupropion for responders.

The use of functional magnetic imaging (fMRI) measurements to infer quantitative estimates of bupropion response is motivated by evidence for an association between fMRI and antidepressant response. For example, resting-state activity in the anterior cingulate cortex as well as activity evoked by reward processing tasks in the anterior cingulate cortex and amygdala have all been associated with antidepressant response [13, 16, 17].

In this work, predictive models of individual response to bupropion treatment are built using deep learning and pretreatment, task-based fMRI from a cohort of MDD subjects. The novel contributions of this work are: (1) the first tool for accurately predicting long-term bupropion response, and (2) the use of an unbiased neural architecture search (NAS) to identify the best-performing model and brain parcellation from 800 distinct model architecture and parcellation combinations.

2 Methods

2.1 Materials

Data for this analysis comes from the EMBARC clinical trial [23], which includes 37 subjects who were imaged with fMRI at baseline and then completed an 8-week trial of bupropion XL. To track symptomatic outcomes, the 52-point Hamilton Rating Scale for Depression (HAMD) was administered at baseline and week 8 of antidepressant treatment. Higher HAMD scores indicate greater MDD severity. Quantitative treatment response for each subject was defined as \(\varDelta \text {HAMD}=\text {HAMD}(\text {week 8}) - \text {HAMD}(\text {baseline})\), where a negative \(\varDelta \text {HAMD}\) indicates improvement in symptoms. The mean \(\varDelta \text {HAMD}\) for these subjects was \(-5.98 \pm 6.25\), suggesting a large variability in individual treatment outcomes. For comparison, placebo-treated subjects in this study exhibited a mean \(\varDelta \text {HAMD}\) of \(-6.70 \pm 6.93\).

Image Acquisition. Subjects were imaged with resting-state and task-based fMRI (gradient echo-planar imaging at 3T, TR of 2000 ms, \(64 \times 64 \times 39\) image dimensions, and \(3.2 \times 3.2 \times 3.1\) mm voxel dimensions). Resting-state fMRI was acquired for 6 min. Task-based fMRI was acquired immediately afterwards for 8 min during a well-validated block-design reward processing task assessing reactivity to reward and punishment [8, 11]. In this task, subjects must guess in the response phase whether an upcoming number will be higher or lower than 5. They are then informed in the anticipation phase if the trial is a “possible win”, in which they receive a $1 reward for a correct guess and no punishment for an incorrect guess, or a “possible loss”, in which they receive a -$0.50 punishment for an incorrect guess and no reward for a correct guess. In the outcome phase, they are then presented with the number and the outcome of the trial.

2.2 Image Preprocessing

Both resting-state and task-based fMRI images were preprocessed as follows. Frame-to-frame head motion was estimated and corrected with FSL MCFLIRT, and frames where the norm of the fitted head motion parameters was \({>}1\) mm or the intensity Z-score was \({>}3\) were marked as outliers. Images were then skull-stripped using a combination of FSL BET and AFNI Automask. To perform spatial normalization, fMRI images were registered directly to an MNI EPI template using ANTs. This coregistration approach has been shown to better correct for nonlinear distortions in EPI acquisitions compared to T1-based coregistration [2, 6]. Finally, the images were smoothed with a 6 mm Gaussian filter.

Predictive features were extracted from the preprocessed task-based fMRI images in the form of contrast maps (i.e. spatial maps of task-related neuronal activity). Each task-based fMRI image was fit to a generalized linear model,

$$\begin{aligned} \varvec{Y} = \varvec{X} \times \varvec{\beta } + \varvec{\epsilon } \end{aligned}$$

where \(\varvec{Y}\) is the time \(\times \) voxels matrix of BOLD signals, \(\varvec{X}\) is the time \(\times \) regressors design matrix, \(\varvec{\beta }\) is the regressors \(\times \) voxels parameter matrix, and \(\varvec{\epsilon }\) is the residual error, using SPM12. The design matrix \(\varvec{X}\) was defined as described in [11] and included regressors for the response, anticipation, outcome, and inter-trial phases of the task paradigm. In addition, a reward expectancy regressor was included, which had values of \(+0.5\) during the anticipation phase for “possible win” trials and \(-0.25\) during the anticipation phase for “possible loss” trials. These numbers correspond to the expected value of the monetary reward/punishment in each trial. In addition to these task-related regressors and their first temporal derivatives, the head motion parameters and outlier frames were also included as regressors in \(\varvec{X}\).

After fitting the generalized linear model, contrast maps for anticipation (\( \varvec{C}_{antic}\)) and reward expectation (\( \varvec{C}_{re}\)) were computed from the fitted \(\varvec{\beta }\) coefficients:

$$\begin{aligned} \varvec{C}_{antic}&= \varvec{\beta }_{\text {anticipation}} - \varvec{\beta }_{\text {inter-trial}} \\ \varvec{C}_{re}&= \varvec{\beta }_{\text {reward expectation}} \end{aligned}$$

To extract region-based features from these contrast maps, three custom, study-specific brain parcellations (later referred to as ss100, ss200 and ss400) were generated with 100, 200, and 400 regions-of-interest (ROIs) from the resting-state fMRI data using a spectral clustering method [5]. Each parcellation was then used to extract mean contrast values per ROI. The performance achieved with each of these custom parcellations, as well as a canonical functional atlas generated from healthy subjects (Schaefer 2018, 100 ROIs) [20], is compared in the following experiments.

2.3 Construction of Deep Learning Predictive Models

Dense feed-forward neural networks were constructed to take the concatenated ROI mean values from the two contrast maps as inputs and predict 8-week \(\varDelta \text {HAMD}\). Rather than hand-tuning model hyperparameters, a random search was conducted to identify a high-performing model for predicting response to bupropion. The random search is an unbiased neural architecture search (NAS) that was chosen because it has been shown to outperform grid search [1] and when properly configured can provide performance competitive with leading NAS methods such as ENAS [14].

200 architectures were sampled randomly from a uniform distribution over a defined hyperparameter space (Table 1) and then used to construct models that were trained in parallel on 4 NVIDIA P100 GPUs. All models contained a single neuron output layer to predict \(\varDelta \)HAMD and were trained with the Nadam optimizer, 1000 maximum epochs, and early stopping after 50 epochs without decrease in validation root mean squared error (RMSE).

The combination of 200 model architectures with 4 different parcellations resulted in a total of 800 distinct model configurations that were tested. To ensure robust model selection and to accurately estimate generalization performance, these 800 model configurations were tested with a nested K-fold cross-validation scheme with 3 outer and 3 inner folds. Although a single random split is commonly used in place of the outer validation loop, a nested cross-validation ensures that no test data is used during training or model evaluation and provides an unbiased estimate of final model performance [24]. Within each outer fold, the best-performing model was selected based on mean root mean squared error (RMSE) over the inner folds. The model was then retrained on all training and validation data from the inner folds and final generalization performance was evaluated on the held-out test data of the outer fold. Repeating this process for each outer fold yielded 3 best-performing models, and the mean test performance of these models is reported here.

Table 1. Hyperparameter space defined for the random neural architecture search. For each model, one value was randomly selected from each of the first set of hyperparameters; for each layer in each model, one value was randomly selected from the second set of hyperparameters.
Fig. 1.
figure 1

Mean inner validation fold RMSE of the 800 model architecture & parcellation combinations evaluated in the unbiased neural architecture search. Results from one outer cross-validation fold are illustrated here, and findings for the other two folds were similar.

3 Results and Discussion

3.1 Neural Architecture Search (NAS)

Results indicate that the NAS is beneficial. In particular, a wide range of validation RMSE was observed across the 800 tested model configurations (Fig. 1). Certain models performed particularly well achieving RMSE approaching 4.0, while other model architectures were less suitable. NAS helped identify high-performing configurations expediently.

The information from the NAS can be examined for insight into what configurations constitute high versus low performing models and whether the ranges of hyperparameters searched were sufficiently broad. Towards this end, the hyperparameter distributions of the top and bottom quartiles of these 800 model configurations, sorted by RMSE, were compared. Substantial differences in the hyperparameter values that yielded high and low predictive accuracy are observed (Fig. 2). Notably, the custom, study-specific parcellation with 100 ROIs (ss100) provided significantly better RMSE than the “off-the-shelf” Schaefer parcellation (\(p = 0.023\)). Additionally, the top quartile of models using ss100 used fewer layers (1–2), but more neurons (384–416) in the first hidden layer, compared to the bottom quartile of models. Note that unlike in a parameter sensitivity analysis, where ideal results exhibit a uniform model performance over a wide range of model parameters, in a neural architecture search, an objective is to demonstrate adequate coverage over a range of hyperparameters. This objective is met when local performance maxima are observed. This is shown in (Fig. 2b, c and d) where peaks in the top quartile (blue curve) of model architectures are evident.

Fig. 2.
figure 2

Hyperparameter patterns for the top (blue) and bottom (orange) quartiles of the 800 model configurations evaluated in the unbiased neural architecture search. Representative results for one of the outer cross-validation folds are presented. a: Top quartile models tended to use the ss100 parcellation, while bottom quartile models tended to use the Schaefer parcellation. b–d: Distributions of three selected hyperparameters compared for the top and bottom quartiles of model configurations, revealing the distinct patterns of hyperparameters for high-performing models. The top quartile of model architectures have fewer layers (peaking at 1–2) but more neurons in the first hidden layer (peaking at 384–416 neurons). (Color figure online)

The best performing model configuration used an architecture with two hidden layers and the 100-ROI study-specific parcellation (ss100). Regression accuracy in predicting \(\varDelta \text {HAMD}\) in response to bupropion treatment was RMSE 4.71 and \(R^2\) 0.26. This \(R^2\) value (95% confidence interval 0.12–0.40 for \(n = 37\)) constitutes a highly significant effect size for a neuroimaging study where effect sizes are commonly much lower, e.g. 0.01–0.10 in [3] and 0.09–0.15 in [21]. Furthermore, this predictor identifies individuals who will experience clinical remission (\(\text {HAMD}(\text {week 8}) <= 7\)) with number of subjects needed-to-treat (NNT) of 3.2 subjects and AUC of 0.71. This NNT indicates that, on average, one additional remitter will be identified for every 3 individuals screened by this predictor. In comparison, clinically-adopted pharmacological and psychotherapeutic treatments for MDD have NNTs ranging from 2–25 [18], and other proposed predictors for antidepressants besides bupropion have reported NNTs of 3–5 [9, 10]. Therefore, this NNT of 3.2 has high potential for clinical benefit in identifying individuals mostly likely to respond to bupropion (Table 2).

When evaluated on sertraline and placebo-treated subjects from the this dataset, the model demonstrated poor accuracy (negative \(R^2\)), which is desirable because it indicates the model learned features specific to bupropion response. Additionally, clinical covariates such as demographics, disease duration, and baseline clinical scores were added to the data in another NAS, but this did not increase predictive power. Lastly, less statistically complex models, including multiple linear regression and a support vector machine, performed poorly with negative \(R^2\), even after hyperparameter optimization with a comparable random search of 800 configurations. This finding suggests that a model with a higher statistical capability such as a neural network was needed to learn the association between the data and treatment outcome.

Table 2. Performance of the best model configuration from the neural architecture search. To obtain classifications of remission, the model’s regression outputs were thresholded post-hoc using the clinical criteria for MDD remission (\(\text {HAMD}(\text {week 8}) < 7\)). RMSE: root mean squared error, NNS: number needed to screen, PPV: positive predictive value, AUC: area under the receiver operating characteristic curve.

3.2 Learned Neuroimaging Biomarker

Permutation feature importance was measured on the best-performing model configuration to extract a composite neuroimaging biomarker of bupropion response. Specifically, for each feature, the change in \(R^2\) was measured after randomly permuting the feature’s values among the subjects. This was repeated 100 times per feature, and the mean change in \(R^2\) provided an estimate of the importance of each feature in accurate predicting bupropion response. The 10 most important regions for bupropion response prediction are visualized in Fig. 3 and include the medial frontal cortex, amygdala, cingulate cortex, and striatum. The regions this model has learned to use agree with the regions neurobiologists have identified as key regions in the reward processing neural circuitry [4]. This circuit is the putative target of bupropion and the circuit largely measured by the reward expectancy task in this task-based fMRI study.

Fig. 3.
figure 3

The 10 most important ROIs for bupropion response prediction, as measured by permutation feature importance. These included 5 regions in the anticipation contrast map (\(\varvec{C}_{antic}\), top row) and 5 regions in the reward expectation contrast map (\(\varvec{C}_{re}\), bottom row). Darker hues indicate greater importance in predicting \(\varDelta \)HAMD.

4 Conclusions

In this work, deep learning and an extensive, unbiased NAS were used to construct predictors of bupropion response from pretreatment task-based fMRI. These methods produced a novel, accurate predictive tool to screen for MDD patients likely to respond to bupropion, to estimate the degree of long-term symptomatic improvement after treatment, and to identify patients who will not respond appreciably to the antidepressant. Predictors such as the one presented are an important step to help narrow down the set of candidate antidepressants to be tested for each patient and to address the urgent need for individualized treatment planning in MDD. The results presented also underscore the value of fMRI and in MDD treatment prediction, and future work will target extension to additional treatments.