Introduction

There are currently ambitious but still preclinical efforts to complement the diagnostic spectrum for individual patients with unipolar depression, particularly the clinical entity major depressive disorder (MDD), by imaging biomarkers. These neurobiological markers are aimed at amending therapeutic decision making by capturing biological information which cannot be assessed by clinical interviews (Atluri et al. 2013; Phillips et al. 2015; Schneider and Prvulovic 2013). Particularly, diagnostic models based on multivariate pattern analyses (MVPA) of functional magnetic resonance imaging (fMRI) have been proposed as potentially powerful clinical tools for mental disorders (Arbabshirani et al. 2017; Castellanos et al. 2013; Haller et al. 2014; Klöppel et al. 2012; Lui et al. 2016; Orru et al. 2012; Patel et al. 2016; Sundermann et al. 2014a; Wolfers et al. 2015).

FMRI data acquisition at rest has attracted particular interest of clinical researchers, because it seems to promise robust information on functionally relevant neural networks with simple setups and little demand for patient cooperation (Barkhof et al. 2014; Castellanos et al. 2013; Smith et al. 2009; Sundermann et al. 2014a; Zhang and Raichle 2010). Most approaches to analyze data obtained by this so-called resting-state fMRI (rs-fMRI) are based on analyses of functional connectivity (FC), the correlation of spontaneous activity in remote brain areas (Margulies et al. 2010; van den Heuvel and Hulshoff Pol 2010). Rs-fMRI has been used to characterize abnormal FC (rs-fcMRI) or spontaneous local activity in MDD. Major findings are increased spontaneous activity in cortical midline structures related to self-referential cognition and altered interactions of these regions with lateral cortical areas (Hamilton et al. 2015; Kaiser et al. 2015; Sundermann et al. 2014b). They have been interpreted as a correlate of a reduced top–down inhibition of cortical midline and limbic regions reflecting increased ruminative brooding (Hamilton et al. 2015; Marchetti et al. 2012; Nejad et al. 2013). Rs-fMRI, therefore, seems to provide information on important aspects of MDD etiopathogenesis and symptomatology (Kupfer et al. 2012).

MVPA conceptually overlaps with multivariate classification, pattern recognition, or predictive analyses. Here, mainly, instances of supervised learning (a subfield of machine learning) are summarized as MVPA. They facilitate the automated generation of decision rules based on previous experience, i.e., labeled training data (Alpaydin 2010; James et al. 2013). MVPA integrates information from multiple sources (for example, brain regions) with the aim to increase discriminative power compared to conventional univariate analyses of intrinsically noisy fMRI data. MVPA, particularly nonlinear techniques, additionally exploits complex relationships among individual features. Information captured by MVPA, therefore, exceeds and fundamentally differs from that assessed by standard univariate analyses (Arbabshirani et al. 2017; Pereira et al. 2009; Sundermann et al. 2014a). Popular and powerful classification tools for such analyses are support vector machines (SVM). In SVMs, subjects/imaging data sets are represented as points in a multidimensional feature space and the diagnostic problem can be operationalized as defining a hyperplane (i.e., a decision boundary) which best distinguishes between two groups of subjects. The classifier is trained using the `kernel trick’ by maximizing the margin of separation between groups based on the examples closest to this hyperplane (Alpaydin 2010; James et al. 2013; Orru et al. 2012; Pereira et al. 2009; Vapnik 2000). Different kernels and model parameters (such as C in SVMs) can be chosen to influence model complexity. These properties determine a model’s complexity and thus its performance given a tradeoff between over- and underfitting. If a model is overfitted it perfectly conforms to the training samples while losing generalization ability for new samples. On the other hand, if a model is underfitted, it is too simple to capture the information essential for the diagnostic decision of interest. Consequently, parameters have to be optimized to generate successful diagnostic models that generalize to new clinical data (Alpaydin 2010; Arbabshirani et al. 2017; James et al. 2013; Orru et al. 2012; Vapnik 2000). The actual classification algorithm can be combined with different methods for dimensionality reduction and feature selection (FS) to prepare input data with the aim to improve diagnostic accuracy (Mwangi et al. 2014; Pereira et al. 2009; Sundermann et al. 2014a).

Groundbreaking work in the field of diagnostic MVPA in depression has been reported by Craddock et al. in 2009: The authors used FS and linear SVMs based on pairwise FC. They reached diagnostic accuracies of 83.33% (hold-out validation) (Craddock et al. 2009). Usually, MVPA models to differentiate patients from controls were trained on rs-fcMRI data of small, selected MDD samples and explicitly healthy subjects under ideal research conditions: these subjects were mostly young and had high levels of current depressive symptoms. However, other factors of clinical heterogeneity (e.g., antidepressant medication or any clinical comorbidity) were either excluded or not explicitly reported (Cao et al. 2014; Guo et al. 2014; Lord et al. 2012; Ma et al. 2013; Qin et al. 2015; Ramasubbu et al. 2016; Zeng et al. 2012; Zeng et al. 2014). One recent study reported successful diagnostic performance even for medication-free remitted MDD patients while even more strictly constraining sources of heterogeneity (Bhaumik et al. 2016). In contrast to that, Ramasubbu et al. observed above-change accuracies in a subgroup with the highest symptom severity only and still being in a range not deemed clinically meaningful (Ramasubbu et al. 2016). Another study included patients with schizophrenia as a second control group (Yu et al. 2013). For a detailed overview of sample characteristics and classification methods in previous studies on diagnostic MVPA of rs-fMRI data in depression, see Table 1.

Table 1 Sample characteristics, methods, and main classification results of the previous studies reporting diagnostic applications of rs-fcMRI and MVPA in unipolar depression

Small samples as well as a high diversity of tested computational models in the field carry an inherent bias to a publication of false-positive results (Button et al. 2013). Diagnostic classification studies typically use cross validation (CV) to partially alleviate problems coming along with small samples. CV is an established method to assess the generalizability of classifiers to new data. It makes efficient use of data sets by repartitioning them into test- and training sets multiple times (Pereira et al. 2009). Nevertheless, rigorous confirmation of findings with independent data (Ioannidis 2005; Sundermann et al. 2014a) and sufficiently large samples (Arbabshirani et al. 2017) is crucial to assess whether these desirable methods can be translated from well-controlled research environments to routine clinical diagnostic applications in more heterogeneous patient populations.

The purposes of this multi-step investigation were replication, clinical translation, and model optimization: The main goal was to clarify if the approach to combine rs-fcMRI with a common MVPA approach to diagnose MDD is feasible in a diverse population as a prerequisite for generalization to routine depression care. Thereby, we wanted to identify suitable modelling parameters and FS techniques. SVMs, a well-established classification method, were applied to rs-fcMRI data from a large cohort acquired under clinically realistic conditions. The MDD sample was heterogeneous regarding symptom severity, comorbidity, and therapy status. A non-depressed population sample served as control group. In addition, we aimed to corroborate the feasibility of this approach in a more homogeneous subgroup of patients with more distinct depressive symptoms.

Methods

First, we give a brief overview of our data analysis strategy: For the main analysis, two sub-samples—both comprising patients with MDD and controls—were drawn from participants in a cohort study. Estimates of functional connectivity derived from rs-fMRI data were used as features to train computational models to identify individual MDD patients. All this modelling was accomplished in the form of planned exploratory analyses with widely varying technical parameters in the first sub-sample. The second sub-sample was reserved to validate potentially successful models identified at that exploratory stage. In additional analyses, we evaluated modelling performance in a more selected subgroup and explored the univariate information content of the features used.

Subjects: the BiDirect study sample

This investigation is based on baseline data from the BiDirect study (Hermesdorf et al. 2016; Teismann et al. 2014; Teuber et al. 2017). BiDirect primarily aims to disentangle the bidirectional relationship between depression and subclinical arteriosclerosis. The particular analysis reported here uses data from the depression cohort (current or recent episode) and the population control cohort (invited via population register). The acquired data comprised clinical, psychological, and neuropsychometric testing. Structural and functional [wakeful rest, emotional faces paradigm (Dannlowski et al. 2009)] imaging was also performed. Psychometric assessment included the Center for Epidemiological Studies Depression Scale (CES-D) (Radloff 1977), Hamilton Rating Scale for Depression (HAM-D-17) (Hamilton 1960), and the Mini International Neuropsychiatric Interview (MINI) (Ackenheil et al. 1999). Clinical data included information on medical history and current medication. These drugs were classified according to the Anatomical Therapeutical Classification Systems (ATC, http://www.whocc.no/atc_ddd_index/). A comprehensive study protocol detailing recruitment and data acquisition of BiDirect has been published (Teismann et al. 2014). Sub-sample selection for this methodological work, including demographical and clinical characteristics, will be detailed in the following.

Resting-state fMRI data acquisition and feature extraction

MRI data were acquired in a setting equivalent to clinical appointments: The same 3 Tesla scanner is also used for clinical examinations, and data acquisition was accomplished by clinical personnel. Rs-fMRI data were acquired using T2*-weighted echo planar imaging at the end of the MRI protocol [see the methods supplement (Online Resource 1) for further details on the rs-fMRI protocol]. 1378 technically complete rs-fMRI data sets were obtained. Structural scans were subjected to neuroradiological reporting.

We followed a standard approach for preprocessing and analysis of individual rs-fMRI data sets implemented in the Data Processing Assistant for Resting-State fMRI (DPARSF 2.3) (Chao-Gan and Yu-Feng 2010), REST 1.8 (Song et al. 2011) and SPM8 (http://www.fil.ion.ucl.ac.uk/spm/) [see the methods supplement (Online Resource 1) for further details].

Signal time courses were extracted from two sets of ROIs: (1) peak coordinates (n = 38) derived from a meta-analysis on rs-fMRI in depression (Sundermann et al. 2014b) as center coordinates for spherical ROIs (5 mm radius). They represent prior knowledge on altered spontaneous activity in MDD, a common approach in MVPA of MRI data (Chu et al. 2012; Schrouff et al. 2013). (2) Another set consisted of 200 ROIs spanning the whole gray matter. They were derived using spatially constrained spectral clustering of rs-fMRI data with the aim to better comprehend the functional architecture of the human brain compared to conventional atlases based on surface anatomy (Craddock et al. 2012). They are distributed with DPARSF (Chao-Gan and Yu-Feng 2010). FC was determined by calculating pairwise correlation (Pearson’s r) for all ROIs within each of the two sets separately. Subsequently, correlation coefficients were z-transformed. FC analyses resulted in 703 unique features for the meta-analytically defined ROIs and 19,900 unique features for the whole-brain parcellation.

Definition of balanced sub-samples for model generation and validation

To attain unbiased estimates of the generalizability of MVPA to diagnose the clinical entity MDD, we chose to draw balanced sub-samples from the whole BiDirect sample. This three-step procedure applied here is an important characteristic of this investigation. It has been specifically chosen to fulfill the goals of parameter optimization/model identification and model validation in this investigation:

  1. 1.

    We excluded data sets bearing particular risks of potential bias through imaging artefacts, brain lesions, and clinical characteristics.

  2. 2.

    The complete BiDirect sample exhibited some demographical and cohort size imbalances. In contrast to the conventional statistical modelling, these imbalances cannot directly be taken into account when training and assessing MVPA models. We, therefore, adopted a pairwise-matching procedure to draw a balanced sub-sample from both the MDD and control cohorts.

  3. 3.

    The resulting data set was randomly split into half. This resulted in two highly comparable yet statistically independent data sets: subset 1 (unipolar depression: n = 180, controls: n = 180) was used for all subsequent explorative analyses to identify sufficiently accurate diagnostic models. Demographical and detailed clinical characteristics of both patients and controls in this subset are shown in Table 2. Subset 2 (unipolar depression: n = 180, controls: n = 180) was kept apart from subset 1 and remained as an independent hold-out data set to validate potentially successful models with a high level of evidence.

    Table 2 Demographical and medical characteristics of subjects in subset 1 used for exploratory model comparison

See Fig. 1 and the methods supplement (Online Resource 1) for further details on sub-sample selection.

Fig. 1
figure 1

Flow chart of steps leading from the initial data set to two balanced and independent subsets facilitating unbiased estimates of the diagnostic performance regarding MDD. Figures do not necessarily add to 100% as some data sets fulfilled multiple exclusion criteria

Training and performance estimation of diagnostic MVPA models

Subsequent MVPA used the z-transformed correlation coefficients as features to train diagnostic models for this two-class problem (MDD vs. controls). Analyses were implemented using RapidMiner (open source edition, version 5.3.013, RapidMiner GmbH, Dortmund, North Rhine-Westphalia, Germany, http://github.com/rapidminer/) (Mierswa et al. 2006; Schowe 2011). Given the comparably large data set available, we opted for a hierarchical approach to identification and validation of potential diagnostic models. In a first step, a wide range of models was explored in subset 1. As pointed out previously, the independent subset 2 was kept for validation and potentially further refinement of models. An overview of the MVPA approach is presented in Fig. 2. In-depth information on our MVPA approach is presented in the methods supplement (Online Resource 1).

Fig. 2
figure 2

Overview of the general data analysis pipeline for generating FC-based classifiers

We explored the diagnostic accuracy, sensitivity, and specificity of soft-margin SVM models [C-SVC from LIBSVM (Chang and Lin 2011)]. Linear kernels are the preferential choice in diagnostic SVM models of neuroimaging data given the typically higher number of features than patients (Orru et al. 2012), have been the most common choice in MDD (see Table 1), and were consequentially evaluated here. We additionally explored nonlinear radial basis function (RBF) kernels to facilitate a higher model complexity. Model parameters were varied systematically over a wide range spanning potential under- and overfitting to identify optimal settings. Models were tested either with one of four different FS techniques or without any FS: a t test-based filter, a filter based on linear SVM weights, recursive feature elimination (SVM-RFE) (Guyon et al. 2002), and minimum redundancy maximum relevance (MRMR) FS (Ding and Peng 2005). Classification accuracies were estimated using cross validation (CV) (Pereira et al. 2009).

In total, 210 different settings were assessed in the exploratory analysis in subset 1.

Subgroup analysis in patients with more severe current depressive symptoms

To further explore the technical feasibility of different models as well as the potential influence of the severity of depressive symptoms at the time of MR scanning and to allow for comparison with pilot studies in severely depressed patients, we investigated the diagnostic accuracies in a subgroup of subset 1. This subgroup comprised one-third of MDD patients (n = 60) with the most severe depressive symptoms according to HAM-D-17 and their respective matched controls (n = 60). The mean HAM-D-17-score in this MDD subgroup was 20.2 ± SD 2.9. For further subject characteristics, see supplementary Table 3 (Online Resource 2). Exploratory analyses comprised the same modelling and validation approaches as in the main analysis. As an exception, only moderate C values were tested as very low or very high C values in the main analysis did either not change results or led to biased classifiers with a high tendency towards a single diagnostic label (see results). 90 different settings were assessed.

Parameter sets that achieved an overall cross-validated accuracy of at least 60% were further validated in subset 2. We, therefore, drew a comparable subgroup of the most severely depressed MDD patients (n = 60, mean HAM-D-17 score: 20.9 ± SD 3.3) and their controls (n = 60) from the independent subset 2 (see supplementary Table 6 in Online Resource 2). Parameters derived from successful models in the cross-validated exploratory subgroup analyses in subset 1 were used to train classifiers based on the entire subgroup of subset 1. Generalizability of these hypothetically most powerful models was assessed by applying the resulting decision rules to subjects in the subgroup of subset 2. These hypotheses about overall accuracies generated in the more severely depressed subgroup of subset 1 were tested using a one-tailed binomial test and corrected for multiple comparisons by controlling the false-discovery rate (FDR) (Benjamini and Hochberg 1995).

Post hoc estimation of feature set information content by univariate analyses

As a final step, we retrospectively assessed how much information about the clinical diagnosis of unipolar depression the FC-based feature sets inhered by means of more classical univariate analyses. We, therefore, conducted two-sided two-sample t tests in the entire subset 1 and the subgroup of subset 1 with pronounced depressive symptoms both for FC coefficients based on the meta-analytical ROIs and the whole-brain parcellation. To assess information content while controlling for multiple comparisons, we then estimated the proportion π 1 of univariate results that truly followed the alternative hypothesis of a group difference of patients and controls using Matlab. This method is based on the assumption of a uniform distribution of p values that follow the null hypothesis (Storey and Tibshirani 2003) and was introduced in the framework of the positive false-discovery rate for high-dimensional data sets (Storey and Tibshirani 2003; Storey 2002).

Results

Models based on features supported by a priori knowledge

In the explorative model, identification step of the main analysis, SVM models based on all pairwise connections of 38 meta-analytically defined ROIs, reached overall diagnostic accuracies from 47.5% (linear kernel, C = 0.1) to 53.6% (linear kernel, C = 0.001) in the explorative analysis in the entire subset 1. Sensitivities ranged from 45.0 to 97.2% and specificities from 1.7 to 53.9% as some models assigned nearly all subjects to one single group. FS did not improve performance with overall accuracies ranging from 47.0% (linear kernel, C = 0.1, FS based on SVM weights) to 53.3% (linear kernel, C = {0.001, 0.01, 0.1}, FS based on t tests) with sensitivities from 45.0 to 98.33% and specificities from 3.3 to 51.7%. To summarize, results were distributed closely around the chance level of 50% accuracy with a typical tradeoff of sensitivities and specificities. For detailed results, see supplementary Table 1 (Online Resource 2).

Models based on whole-brain connectivity

Models based on the entire gray matter parcellation in the model identification step of the main analysis reached overall diagnostic accuracies from 45.0% (RBF kernel, γ = 0.01, C = {10, 100, 1000}) to 52.8% (linear kernel, C = {0.01, 0.1, 1, 10, 100, 1000}) in the explorative analysis in the entire subset 1. Sensitivities ranged from 0.0 to 88.3% and specificities from 9.4 to 100.0%. Some instances of automated feature selection yielded slightly better performance in this case with overall accuracies ranging from 48.1% (RBF kernel, γ = 1, C = {0.001, 0.01, 0.1}, MRMR FS) to 56.1% (linear kernel, C = 0.1, MRMR FS) with sensitivities from 51.1 to 83.9% and specificities from 12.8 to 55.0%. Thus, these exploratory results were slightly shifted with regard to chance level. For detailed results, see supplementary Table 2 (Online Resource 2).

With the range of diagnostic performances observed, no model generation technique tested in the entire subset 1 reached clinically relevant accuracies in the exploratory analysis. We, therefore, refrained from further model validation in subset 2 (Table 3a).

Table 3 Models with cross-validated accuracies of at least 60.0% and corresponding results of model validation in the hold-out data set

Subgroup analysis in patients with higher current depressive symptoms

Models explored in the subgroup of most depressed patients in subset 1 to assess the general feasibility of this approach achieved diagnostic accuracies from 40.8 to 65.0%. 13 models reached cross-validated accuracies of at least 60.0%. For detailed results, see supplementary Tables 4 and 5 (Online Resource 2).

These 13 models were further assessed by validation in an independent non-overlapping subgroup with the most severely depressed patients from subset 2 (hypothesis tests). Three of these modelling approaches performed significantly above chance (p < 0.05, FDR), and further six models reached a formal trend to statistical significance (p < 0.1, FDR). Detailed validation results are presented in Table 3b.

Post hoc estimation of feature set information content by univariate analyses

The following results calculated with a conventional approach serve as a surrogate of the information content of features. The estimated proportion π 1 of pairwise connections between meta-analytically defined ROIs which contain group information of diagnostic interest in the univariate analyses was 9.05% in the entire subset 1 and 16.38% in the subgroup with more severely depressed patients and matched controls. Corresponding π 1 of features based on the whole-brain analysis was 19.00% in the entire subset 1 and 26.18% in the subgroup analysis. p value histograms illustrating this deviation of univariate effects from a uniform distribution under the null hypothesis are shown in Online Resource 3.

Discussion

This study in a diverse unipolar depression and population control sample fails to confirm reports (Cao et al. 2014; Craddock et al. 2009; Guo et al. 2014; Lord et al. 2012; Ma et al. 2013; Qin et al. 2015; Yu et al. 2013; Zeng et al. 2012; Zeng et al. 2014) that the combination of MVPA and rs-fMRI facilitates the clinically reliable identification of individual MDD patients. This is particularly noteworthy, since the current investigation follows methodological principles (combinations of correlation-based FC analyses and SVMs) that have yielded particularly promising results in pilot studies not only in depression but have been commonly used in pilot studies in diverse mental disorders in recent years (Arbabshirani et al. 2017; Orru et al. 2012; Sundermann et al. 2014a).

Sample characteristics

Pilot studies on the combination of rs-fcMRI and MVPA in MDD (Cao et al. 2014; Craddock et al. 2009; Guo et al. 2014; Lord et al. 2012; Ma et al. 2013; Qin et al. 2015; Yu et al. 2013; Zeng et al. 2012; Zeng et al. 2014) as well as most work on diagnostic MVPA of fMRI data so far (Sundermann et al. 2014a) have adopted control groups of mostly young explicitly healthy subjects. In contrast, this analysis features controls from the general population (Teismann et al. 2014). Some recruitment bias can also be expected in general population samples like this (Heun et al. 1997). Even though subjects with signs of depression among controls as well as disorders potentially mimicking MDD in the depression cohort were excluded and data sets were balanced for potential general confounders, we aimed at keeping sample heterogeneity at its original level. Thus, compared to the above-detailed previous work, there is substantial heterogeneity in our data regarding age-related physical and mental comorbidity, medication in both groups, and levels of depressive symptoms as well as disease duration in the MDD group (Table 2). There is initial evidence that in addition to increasing the level of heterogeneity antidepressant medication may obscure typical FC alterations in depression beyond correlates of symptom reduction (McCabe and Mishor 2011). Thus, current antidepressant medication may limit the diagnostic power of MVPA (Qin et al. 2015). Moreover, the specific recruitment strategy for BiDirect (Teismann et al. 2014) and the matching procedure essential for unbiased estimates of classifier performance resulted in a comparatively older sample lacking female predominance compared to typical disease-onset MDD populations (Andrade et al. 2003). There is evidence that particularly age has a major effect on brain structure and function (Douaud et al. 2014) including measures typically derived from rs-fcMRI (Damoiseaux et al. 2008; Dosenbach et al. 2010). In addition, effects of cardiovascular disorders on depression pathogenesis and vice versa are expected to be more prevalent than in younger depression samples. This reflects the core objective of the underlying BiDirect study (Teismann et al. 2014).

Despite its inherent limitations we believe that the data set used here represents clinical populations far better than homogenous MDD samples and explicitly healthy control groups in the previous studies and is, therefore, beneficial to rigorously assess the transferability of current MVPA approaches to routine care.

Significant positive results in the subgroup analysis with more severely depressed patients further support the assumption that sample heterogeneity is an important determinant of the ineffectiveness of this approach in this sample. Moreover, this indicates that these rs-fcMRI models in depression may be particularly sensitive to the current depressive state. This finding is in line with a recently reported failure to achieve above-change diagnostic accuracies in subgroups other than one comprising patients with the highest current symptom severity (Ramasubbu et al. 2016). That study is, however, limited by its small sample size.

It remains unclear if these models are also capable of sufficiently representing longer term traits (Bhaumik et al. 2016; Graham et al. 2013; Qin et al. 2015), which may be more important for clinical decision making in this population. Nevertheless, results from the subgroup analysis confirm the general feasibility of this computational approach based on rs-fcMRI. However, even most successful models in the subgroup analysis did not reach sufficient diagnostic accuracies for clinical use.

Methodological aspects

We used two conceptually different approaches to extract relevant features from the original data. Compared to voxel-based FC analyses, our ROI-based approaches limit redundancy. Avoiding excessive numbers of features is supposed to improve the power of MVPA (Mwangi et al. 2014; Pereira et al. 2009). We have compared two different sets of ROIs: One set of regions optimized specificity by relying on prior knowledge about the effects of diagnostic interest (Sundermann et al. 2014b). Despite the potential advantage of a strictly hypothesis-driven approach, important discriminative features may be missed. We, therefore, additionally extracted features based on the entire gray matter (Craddock et al. 2012). The univariate post hoc analysis of information content confirms that both ROI definitions lead to informative feature sets.

First, SVM models were trained and tested based on the whole feature set of pairwise connections. In addition, we applied different automated FS methods to select most discriminative features. Model parameters were systematically varied over a wide range at the exploratory stage of analyses as such parameter optimization is mandatory to avoid under- or overfitting and to improve the generalizability of our results to comparable, yet still versatile MVPA approaches in MDD (Arbabshirani et al. 2017; Pereira et al. 2009). Please also note that if no successful model is identified in a wide search like this, poor classification performance can be assumed to be rather caused by the fact the underlying data does not convey sufficient information for the diagnostic question of interest to be captured by this commonly used family of computational models.

Successful modelling approaches in the subgroup analysis of patient with a preeminent depressive state were dominated by nonlinear models. This indicates a high complexity of the underlying information to be captured, i.e., beyond a pure summation of univariate information. As an observation—not directly amenable to statistical analyses—the feature set based on the previous knowledge about FC alterations in MDD led to superior models compared to features representing whole gray matter connectivity. However, in this setting, all successful SVM models required further FS, preferably based on SVMs themselves. In contrast to findings in the subgroup analysis, whole-brain features with powerful FS (MRMR) yielded slightly better diagnostic accuracies compared to whole-brain models with other types of FS or literature-informed features in the main analysis in the original more heterogeneous data set. However, as stated previously, the generality of this result is limited by its explorative nature as no approach reached sufficient accuracies in this diverse sample.

As there were no successful models in the first exploratory stage of the main analysis, there was no model to be validated in a second stage. The hold-out data set was, therefore, only used in the subgroup analysis with most severely depressed patients. While overall performance was slightly poorer in the validation set of the subgroup analysis, there was generally a good agreement and most results were either significant or exhibited a formal trend towards significance after correcting for multiple comparisons. These findings are in line with earlier work in this field, for example, the seminal paper by Craddock et al. (2009).

Potential limitations and future directions

Reliable estimates of FC can be obtained using relatively short runs of data acquisition (Van Dijk et al. 2010). According to recent evidence prolonged acquisitions—even beyond clinically realistic timeframes—can, however, improve reliability depending on analysis techniques (Anderson et al. 2011, Birn et al. 2013). Determining the optimal scan duration for these particular analysis methods was not within the scope of this study. Therefore, it cannot be excluded that prolonged scanning may help reach sufficient diagnostic accuracies. All subjects were examined in a single center with a single MRI scanner. In a clinical context, it will be important that successful techniques can be adapted to different hardware.

The resting period in most subjects of this study followed an fMRI run with emotional faces, a popular paradigm in MDD research (Stuhrmann et al. 2011). Preceding tasks can influence quantitative estimates in rs-fcMRI (Pyka et al. 2013; Waites et al. 2005). During preparation of the training and validation sets, the percentage of subjects with this preceding task was balanced across MDD patients and controls by matching. This ensures that diagnostic decisions were not driven by the existence of this protocol variation. Though these preceding stimuli marginally limit the generalizability to other integrations of the resting period in imaging protocols, the expected diagnostic power has not been diminished.

FC analysis based on Pearson correlation is one of the most common approaches in rs-fcMRI (Margulies et al. 2010). Recently, sparse connectivity models have been proposed that were able to achieve higher diagnostic power than conventional correlation-based approaches (Rosa et al. 2015). Partial correlation and independent component analyses are further alternative strategies for extracting FC features (Margulies et al. 2010). Such approaches may be an important area of future research as long as computational expenses can be accommodated to clinical settings.

It has been proposed that clinically defined mental disorders, such as MDD, may in fact represent a diverse spectrum of brain disorders with important differences in underlying disease mechanisms (Krishnan 2014). This potential neurobiological source of heterogeneity in clinically defined samples can be subject to future work that tries to optimize diagnostic strategies in mental disorders (Atluri et al. 2013). However, this also points toward the possibility that a clinical diagnosis of MDD is probably a weak reference (Hickie et al. 2013) for assessing new diagnostic tests. Yet, no other reference is currently available. The weaknesses of clinical signs and scores as diagnostic reference also apply to our group definition approach: In particular, the adoption of a range of surrogate parameters to achieve a high sensitivity for excluding potentially depressed subjects from the original control cohort reduces the spectrum of psychiatric comorbidity in the final samples. Moreover, information about mental and older age-related comorbidity is limited by the fact that these in part rely upon self-report data. These facts limit conclusions about the differential diagnostic ability of this classification approach regarding other mental conditions associated with depression.

The current investigation aims at distinguishing MDD patients from non-depressed controls. Using this principal diagnosis as the key diagnostic question is a straightforward approach to identify classification methods and parameters viable in MDD. It has been followed in a majority of such pilot studies (Patel et al. 2016; Sundermann et al. 2014a). However, such tools have recently been applied to more specific clinical questions, such as prognostics or differential diagnoses (Fu et al. 2013; Hahn et al. 2015; Lener and Iosifescu 2015; Patel et al. 2016; Phillips et al. 2015; Qin et al. 2015; Schmaal et al. 2014; Sundermann et al. 2014a; van Waarde et al. 2015). Prediction of therapy response in MDD by biomarkers, including fMRI, is currently investigated as part of prospective trials (Dunlop et al. 2012; Grieve et al. 2013; Kennedy et al. 2012; Trivedi et al. 2016; Williams et al. 2011). Beyond the fact that such diagnostic problems presumably involve more highly selected samples, they may in part rely on different neurobiological bases (Fu et al. 2013; Kupfer et al. 2012; Lener and Iosifescu 2015; Phillips et al. 2015). The finding that the combination of rs-fcMRI and MVPA does not generalize to a clinically more realistic population here does, therefore, not necessarily hold true for such more specific clinical questions. Hence, we propose that these treatment-related questions and adequate selection of patients to be examined (with disease severity as a crucial factor) are important lines of future research. Even if rs-fcMRI alone does not serve as a clinically reliable diagnostic biomarker, it may be included in MRI biomarkers in future studies which may rely on integrated multimodal instead of unimodal information (Calhoun and Sui 2016; Douaud et al. 2013; Wee et al. 2012). Another limitation is that information about potential clinical confounders, such as comorbidity or medication, cannot be reasonably dealt with directly in diagnostic models using common techniques, such as SVMs. We, therefore, believe that another direction of future research should be the development of more sophisticated computational models (Li et al. 2011) which are to a lesser degree influenced by clinical heterogeneity.

We would like to conclude by saying that we are aware from extensive discussions with other scientists that these sobering results may be very disappointing and even discouraging for clinical and non-clinical scientists working in this interdisciplinary field. However, we would like to stress that this paper should not be considered as an objection against the work people have and will put into these kinds of methods. Rather, we would like this article to be understood as a note of caution to prevent a premature translation of “standard” methods in this emerging research area into clinical practice or large-scale prospective clinical trials for derivative medical products. In the context of this study, this statement particularly refers to the SVM family of classifiers. As discussed above (where directly related to our results), there are further new, potentially more sophisticated methods currently being tested, developed and to be developed. Further lines of research cover the whole range from improved data acquisition (Ugurbil et al. 2013) through improved preprocessing strategies (Murphy et al. 2013; Salimi-Khorshidi et al. 2014) to new, more complex classes of classification methods, including deep learning (Arbabshirani et al. 2017; Plis et al. 2014; Sarraf and Tofighi 2016) and elastic net approaches (Bowman et al. 2016; Mwangi et al. 2014; Schouten et al. 2016; Zou and Hastie 2005). Therefore, our hope is that these results will motivate researchers to improve such techniques for diagnostic classification.

Conclusions

Straightforward combinations of classification methods that are seemingly established based on the results of small homogenous samples do not translate to a diverse sample in a situation closer to daily life and thus a potential clinical application. The sample size in this investigation limits the risk of false-positive results compared to earlier work and is, therefore, suitable to assess the reproducibility of such findings. Results of a subgroup analysis show that this methodological approach is feasible yet still not clinically reliable even in patients with a preeminent depressive state. This indicates that such MVPA approaches need to take the heterogeneity of clinical populations (including symptom severity) into account. Presumably, they also need to focus on more specific clinical questions, such as therapeutic outcomes as well as improvement of data acquisition and analysis techniques.