Introduction

As the population becomes older, the world is now facing an epidemic of dementia, the loss of mental functions such as memory, thinking, and reasoning, each of which is sufficient enough to interfere a person’s activities of daily life. Among various causes of dementia, Alzheimer’s disease (AD) is the most prevalent in elderly people, rising significantly every year in terms of the proportion of cause of death (Alzheimer’s Association 2012). Furthermore, it is reported that people with mild cognitive impairment (MCI), known as precursor to dementia in AD, progress to AD with an average conversion rate of 10 % per year (Busse et al. 2006; Alzheimer’s Association 2012). Although there is currently no pharmaceutical medicine to recover AD/MCI back to cognitive normal (CN), it is still important to detect the diseases for timely treatments that possibly delay the progress. Thus, it is of great interest for AD/MCI diagnosis or prognosis in the clinic.

With the advent of neuroimaging tools such as magnetic resonance imaging (MRI), positron emission tomography (PET), and functional MRI, many researchers have been devoting their efforts to investigate the underlying biological or neurological mechanisms and also to discover biomarkers for AD/MCI diagnosis or prognosis (Li et al. 2012; Zhang and Shen 2012). Recent studies have shown that information fusion of multiple modalities can help enhance the diagnostic performance (Perrin et al. 2009; Kohannim et al. 2010; Walhovd et al. 2010; Cui et al. 2011; Hinrichs et al. 2011; Zhang et al. 2011; Westman et al. 2012; Yuan et al. 2012; Zhang and Shen 2012; Suk et al. 2015). The main challenge in AD/MCI diagnosis or prognosis with neuroimaging arises from the fact that, while the data dimensionality is intrinsically high, in general, a small number of samples are available. In this regard, machine learning has been playing a pivotal role to overcome this so-called “large p, small n” problem (West 2003). Broadly, we can categorize the existing methods into a feature dimension-reduction approach and a feature selection approach. The feature dimension-reduction approach transforms the original features in an ambient space into a lower dimensional subspace, while the feature selection approach finds informative features in the original space. In neuroimaging data analysis, feature selection techniques have drawn much attention these days, due to its interpretational easiness of the results. In this work, we focus on the feature selection approach.

Among different feature selection techniques, sparse (least squares) regression methods, e.g., \(\ell _{1}\)-penalized linear regression (Tibshirani 1994), \(\ell _{2,1}\)-penalized group sparse regression (Yuan and Lin 2006; Nie et al. 2010), and their variants (Roth 2004; Wang et al. 2011; Wan et al. 2012; Zhu et al. 2014), have attracted researchers because of their theoretical strengths and effectiveness in various applications (Varoquaux et al. 2010; Fazli et al. 2011; de Brecht and Yamagishi 2012; Yuan et al. 2012; Zhang and Shen 2012; Suk et al. 2015).

For example, Wang et al. proposed a sparse multi-taskFootnote 1 regression and feature selection method to jointly analyze the neuroimaging and clinical data in prediction of the memory performance (Wang et al. 2011), where \(\ell _{1}\)- and \(\ell _{2,1}\)-norm regularizations were used for sparsity and facilitation of multi-task learning, respectively. Zhang and Shen exploited an \(\ell _{2,1}\)-norm based group sparse regression method to select features that could be used to jointly represent the clinical status, e.g., AD, MCI, or CN, and two clinical scores of Mini-Mental State Examination (MMSE) and Alzheimer’s Disease Assessment Scale-Cognitive (ADAS-Cog) (Zhang and Shen 2012). Varoquaux et al. (2010) formulated the subject-level functional connectivity estimation as multivariate Gaussian process and imposed a group constraint for a common structure on the graphical model in the population. Suk et al. (2013) proposed a supervised discriminative group sparse representation to estimate functional connectivity from fMRI by penalizing a large within-class variance and a small between-class variance of features. Recently, Yuan et al. (2012), Xiang et al. (2014), and Thung et al. (2014), independently, proposed a sparse regression-based feature selection method for AD/MCI diagnosis to maximally utilize features from multiple sources by focusing on a missing modality problem.

In the context of the data distribution, the previous sparse regression methods mostly assumed a unimodal distribution for a same group of subjects. However, due to the inter-subject variability in the same group (Fotenos et al. 2005; Noppeney et al. 2006; DiFrancesco et al. 2008), it is highly likely for neuroimaging data to have a complex data distribution, e.g., mixture of Gaussians. To this end, Suk et al. (2014) recently proposed a subclass-based sparse multi-task learning method, where they approximated the complex data distribution per class by means of clustering and defined subclasses to better encompass the distributional characteristics in feature selection.

Note that the above-mentioned sparse regression methods find the optimal regression coefficients for the respective objective function in one step, i.e., a single hierarchy, using the training feature vectors as regressors. Since the training feature vectors are composed of both informative and uninformative or less informative features, the resulting optimal regression coefficients are inevidently affected by uninformative or less informative featuresFootnote 2. While the regularization terms drive the regression coefficients of the uninformative or less informative features to be zero or close to zero, and thus we can discard the corresponding features by thresholding, it is still problematic to find the optimal threshold for feature selection. As for the subclass-based feature selection method (Suk et al. 2014), the clustering is performed with the original full features. Therefore, the clustering results can be also affected by uninformative or less informative features, which sequentially can influence the sparse multi-task learning, feature selection, and classification accuracy.

In this paper, we propose a deep sparse multi-task learning method that can mitigate the effect of uninformative or less informative features in feature selection. Specifically, we iteratively perform subclass-based sparse multi-task learning by discarding uninformative features in a hierarchical fashion. That is, in each hierarchy, we first cluster the current feature samples for each original class. Based on the clustering results, we then assign new label vectors and perform sparse multi-task learning with an \(\ell _{2,1}\)-norm regularization. It should be noted that, unlike the conventional multi-task learning methods, which treat all features equally, we further propose to utilize the optimal regression coefficients learned in the lower hierarchy as context information to weight features adaptively. We validate the effectiveness of the proposed method on the ADNI cohort by comparing with the state-of-the-art methods.

Our main contributions can be threefold:

  • We propose a novel deep architecture to recursively discard uninformative features by performing sparse multi-task learning in a hierarchical fashion. The rationale of the proposed hierarchical feature selection is that, while the convex optimization algorithm finds optimal regression coefficients, it is still affected by the less informative features. Therefore, if we can discard uninformative features and perform the sparse multi-task learning iteratively, the optimal solution can be more robust to less informative features, and thus to select task-relevant features.

  • We also devise a weighted sparse multi-task learning using the optimal regression coefficients learned in one hierarchy as feature-adaptive weighting factors in the next deeper hierarchy. In this way, we can adaptively assign different weights for different features in each hierarchy and the features of small weights, which survived in the lower hierarchy, are less likely to be selected in the deeper hierarchy.

  • Motivated by Suk et al.’s work (2014), we also take into account the distributional characteristics of samples in each class and define clustering-induced label vectors. That is, in each hierarchy, we define subclasses by clustering the training samples but with only the selected feature set from the lower hierarchy, and then assign new label vectors. By taking this new label vectors as target response values, we perform the proposed weighted sparse multi-task learning.

Materials and image processing

Subjects

In this work, we use the ADNI cohortFootnote 3, but consider only the baseline MRI, 18-fluoro-deoxyglucose PET, and cerebrospinal fluid (CSF) data acquired from 51 AD, 99 MCI, and 52 CN subjectsFootnote 4. For the MCI subjects, they were clinically further subdivided into 43 progressive MCI (pMCI), who progressed to AD in 18 months, and 56 stable MCI (sMCI), who did not progress to AD in 18 months. We summarize the demographics of the subjects in Table 1.

Table 1 Demographic and clinical information of the subjects

With regard to the general eligibility criteria in ADNI, subjects were in the age of between 55 and 90 with a study partner, who could provide an independent evaluation of functioning. General inclusion/exclusion criteriaFootnote 5 are as follows: (1) healthy subjects: Mini-Mental State Examination (MMSE) scores between 24 and 30 (inclusive), a Clinical Dementia Rating (CDR) of 0, non-depressed, non-MCI, and non-demented; (2) MCI subjects: MMSE scores between 24 and 30 (inclusive), a memory complaint, objective memory loss measured by education adjusted scores on Wechsler Memory Scale Logical Memory II, a CDR of 0.5, absence of significant levels of impairment in other cognitive domains, essentially preserved activities of daily living, and an absence of dementia; and (3) mild AD: MMSE scores between 20 and 26 (inclusive), CDR of 0.5 or 1.0, and meets the National Institute of Neurological and Communicative Disorders and Stroke and the Alzheimer’s Disease and Related Disorders Association (NINCDS/ADRDA) criteria for probable AD.

Image processing and feature extraction

The MRI images were preprocessed by applying the typical procedures of Anterior Commissure (AC)–Posterior Commissure (PC) correction, skull stripping, and cerebellum removal. Specifically, we used MIPAV softwareFootnote 6 for AC–PC correction, resampled images to \(256 \times 256 \times 256\), and applied N3 algorithm (Sled et al. 1998) to correct intensity inhomogeneity. An accurate and robust skull stripping (Wang 2014) was performed, followed by cerebellum removal. We further manually reviewed the skull-stripped images to ensure the clean and dura removal. Then, FAST in FSL packageFootnote 7 Zhang et al. (2001) was used for structural MRI image segmentation into three tissue types of gray matter (GM), white matter (WM) and CSF. We finally parcellated them into 93 regions of interest (ROIs) by warping Kabani et al.’s atlas (1998) to each subject’s space via HAMMER (Shen and Davatzikos 2002).

In this work, we considered only GM for classification, because of its relatively high relatedness to AD/MCI compared to WM and CSF (Liu et al. 2012). Regarding PET images, they were rigidly aligned to the corresponding MRI images, and then applied the parcellation propagated from the atlas by registration.

For each ROI, we used the GM tissue volume from MRI, and the mean intensity from PET as features, which are widely used in the field for AD/MCI diagnosis (Davatzikos et al. 2011; Hinrichs et al. 2011; Zhang and Shen 2012; Suk et al. 2015). Therefore, we have 93 features from an MRI image and the same dimensional features from a PET image. In addition, we have three CSF biomarkers of A\(\beta _{42}\), t-tau, and p-tau as features.

Method

Notations

In this paper, we denote matrices as boldface uppercase letters, vectors as boldface lowercase letters, and scalars as normal italic letters, respectively. For a matrix \({\mathbf {X}}= [x_{ij}]\), its i-th row and j-th column are denoted as \({\mathbf {x}}_{i}\) and \({\mathbf {x}}^{j}\), respectively. We further denote a Frobenius norm and an \(\ell _{2,1}\)-norm of a matrix \({\mathbf {X}}\) as \(\Vert {\mathbf {X}}\Vert _F = \sqrt{\sum _i\Vert {\mathbf {x}}_i\Vert _2^2} = \sqrt{\sum _j\Vert {\mathbf {x}}^j\Vert _2^2}\) and \(\Vert {\mathbf {X}}\Vert _{2,1} = \sum _i\Vert {\mathbf {x}}_i\Vert _2 = \sum _i{\sqrt{\sum _j x_{ij}^2}}\), respectively. Let \({\mathbf {1}}_{q}\) and \({\mathbf {0}}_{q}\) denote q-dimensional row vectors whose elements are all 1 and 0, respectively, and \(|\mathbb {F}|\) be a cardinality of a set \(\mathbb {F}\).

Preliminary

Let \({\mathbf {X}}\in {\mathbb {R}}^{N\times D}\) and \({\mathbf {Y}}\in {\mathbb {R}}^{N\times C}\) denote, respectively, the D neuroimaging features and the corresponding class label vectors of N samplesFootnote 8 for C-class classification. In this work, without loss of generality, we represent a class label with a 0/1 encoding scheme. For example, in a binary classification problem, the class label of each training sample is represented by either \({\mathbf {o}}_{1}=\left[ \begin{matrix}1&0\end{matrix}\right]\) or \({\mathbf {o}}_{2}=\left[ \begin{matrix}0&1\end{matrix}\right]\). Although it is more general to use scalar values of \(+1/-1\) for a binary classification problem, in this work, for general applicability of the proposed method, we use a 0/1 encoding scheme, by which we can naturally apply our method to both binary and multi-class classification problems.

In the context of AD/MCI diagnosis, sparse (least squares) regression methods with different types of regularizers have been used for feature selection in neuroimaging data (Wang et al. 2011; Zhou et al. 2013; Suk et al. 2014; Zhu et al. 2014). The common assumption on these methods is that the target response values, which comprise the class labels in our work, can be predicted by a linear combination of the regressors, i.e., feature values in \({\mathbf {X}}\), as follows:

$$\min \limits_{\mathbf {W}} \left\| {\mathbf {Y}}-{\mathbf {XW}}\right\| _{F}^{2} + R({\mathbf {W}})$$
(1)

where \({\mathbf {W}}\in {\mathbb {R}}^{D\times C}\) is a regression coefficient matrix and \(R({\mathbf {W}})\) denotes a regularization function. Note that, since our main goal is to identify a clinical label based on the neuroimaging features, we constrain a common subset of features to be used in predicting the target values. In this regard, we can use an \(\ell _{2,1}\)-norm regularizer for \(R({\mathbf {W}})\) in Eq. (1) and define a group sparse regression model (Zhou et al. 2013) as follows:

$$\min \limits _{\mathbf {W}} {\left\| {\mathbf {Y}}-{\mathbf {XW}}\right\| _{F}^{2} + \lambda \left\| {\mathbf {W}}\right\| _{2,1}}$$
(2)

where \(\lambda\) denotes a group sparsity control parameter. By regarding the prediction of each target vector \({\mathbf{y}}^{i}\) (\(i\in \{1,\ldots ,C\}\)) as a task, we designate this as sparse multi-task learning (SMTL). Due to the use of an \(\ell _{2,1}\)-norm regularizer in Eq. (2), the estimated optimal coefficient matrix \(\hat{\mathbf {W}}\) will have some zero-valued row vectors, denoting that the corresponding features are not useful in prediction of the target response variables, i.e., class labels. Furthermore, the lower the \(\ell _{2}\)-norm of a row vector, the less informative the corresponding feature in \({\mathbf {X}}\) to represent the target response variables in \({\mathbf {Y}}\).

In the meantime, while the neuroimaging is highly variable among subjects of a same group, the conventional sparse multi-task learning assumes a unimodal data distribution. That is, it overlooks the complicated distributional characteristics inherent in samples, and thus can fail to select task-relevant features. In this regard, Suk and Shen recently proposed a subclass-based sparse multi-task learning (S\(^{2}\)MTL) method (Suk et al. 2014). Specifically, they used a clustering method to discover the complex distributional characteristics and defined subclasses based on the clustering results. Then, they encoded the respective subclasses, i.e., clusters, with their unique codes. Finally, by setting the codes as new label vectors of the training samples, they performed sparse multi-task learning as follows:

$$\begin{aligned} \hat{\mathbf{W}} = \mathop {{{\mathrm{argmin}}}}\limits _{\mathbf{W}}\left\| \tilde{\mathbf{Y}}-{\mathbf{X}}{\mathbf{W}}\right\| _{F}^{2}+\lambda \left\| {\mathbf{W}}\right\| _{2,1} \end{aligned}$$
(3)

where \(\tilde{\mathbf{Y}}\in R^{N\times C'}\) denotes a new label matrix and \(C'\) is the total number of response variables, i.e., the sum of the number of the original classes and the number of subclasses in each original class.

Fig. 1
figure 1

A framework for AD/MCI diagnosis with the proposed deep weighted subclass-based sparse multi-task learning (DW-S\(^{2}\)MTL) method

Deep weighted subclass-based sparse multi-task learning

The main limitation of the SMTL and S\(^{2}\)MTL methods is that they find the optimal regression coefficients and then select task-relevant features based on the regression coefficients in one step, i.e., a single hierarchy. However, uninformative or less informative features, which are also included in regressors, can affect finding the optimal regression coefficients in both Eqs. (2) and (3). Thus, the features selected in a single hierarchy may not be optimal for classification. To mitigate the effects of uninformative or less informative features in optimizing coefficients and in selecting features, we propose a ‘deep weighted subclass-based sparse multi-task learning’ method. Specifically, rather than selecting features in one step, we iteratively discard uninformative features and perform sparse multi-task learning in a hierarchical fashion. In particular, we devise a novel sparse multi-task learning with a feature-adaptive weighting scheme under the hypothesis that the optimal regression coefficients reflect the relative importance of features in representing target response variables. Motivated by Suk and Shen’s work (2014), we also use the S\(^{2}\)MTL framework combined with the proposed feature weighting scheme to reflect the distributional characteristics inherent in samples. Hereafter, we call the proposed method as deep weighted S\(^{2}\)MTL (DW-S\(^{2}\)MTL).

Figure 1 illustrates the overall framework of our method for AD/MCI diagnosis. Given multiple modalities of MRI, PET, and CSF, we extract features from MRI and PET, preceded by image preprocessing as described in “Image processing and feature extraction”, and then concatenate features of all modalities into a long vector for complementary information fusion. Using the concatenated features as regressors and the corresponding class labels as target response values, we perform the proposed DW-S\(^{2}\)MTL for feature selection. In this step, we (1) perform S\(^{2}\)MTL (clustering and label encoding and multi-task learning), (2) select features based on the learned optimal regression coefficients, (3) train a classifier using training samples but with only the selected features, and (4) compute validation accuracy. If the validation accuracy is higher than the previous one (initially, we set the previous validation accuracy as zero), we iterate the processes of (1) through (4) in a hierarchical manner. That is, in the following hierarchy, we consider only the selected features along with the corresponding regression coefficients learned from the current hierarchy. Once converged, i.e., there is no increase in the validation accuracy, we use the current feature set and the corresponding classifier to identify the clinical label of a testing sample.

Now, let us describe the proposed method in detail. Assume that, at the h-th hierarchy, we have the dimension-reduced training samples \(\tilde{\mathbf {X}}^{(h)}\in {\mathbb {R}}^{N\times |{\mathbb {F}}^{(h-1)}|}\), where \(\mathbb {F}^{(h-1)}\) denotes a set of features selected in the \((h-1)\)-th hierarchyFootnote 9, along with the corresponding class labels \({\mathbf {Y}}\). By regarding \(\tilde{\mathbf {X}}^{(h)}\) and \({\mathbf {Y}}\) as our current training samples, we perform clustering to find subclasses for each original class, by which we can facilitate the distributional characteristics in samples.

Earlier, Suk et al. (2014) used the K-means algorithm for this purpose due to its simplicity and computational efficiency. However, since it requires to predefine the number of clusters, i.e., K, for which a cross-validation technique is usually applied in the literature, it is limited to use the K-means algorithm in practical applications. To this end, in this work, we use affinity propagation (Frey and Dueck 2007), which can automatically select the optimal number of clusters and has been successfully applied to a variety of applications (Dueck and Frey 2007; Lu and Carreira-Perpinan 2008; Wang 2010; Shi et al. 2011; Alikhanian et al. 2013). For the details of affinity propagation, please refer to Appendix and Frey and Dueck (2007).

After clustering samples in \(\tilde{\mathbf {X}}^{(h)}\) via affinity propagation, we define subclasses and assign a new label to each sample. Let us consider a binary classification problem and assume that affinity propagation finds \(K_{1}^{(h)}\) and \(K_{2}^{(h)}\) numbers of clusters/exemplars for class 1 and class 2, respectively. Note that we regard the clusters as subclasses of the original class. Then, we define sparse codes for subclasses of the original class 1 and the original class 2 as follows:

$$\begin{aligned} \left( {\mathbf{z}}_{l}^{(1)}\right) ^{(h)}= & {} \left[ \begin{array}{lll}{\mathbf {o}}_{1}&\left( {\mathbf{s}}_{l}^{(1)}\right) ^{(h)}&{\mathbf {0}}_{K_{2}^{(h)}}\end{array}\right] \\ \left( {\mathbf{z}}_{m}^{(2)}\right) ^{(h)}= & {} \left[ \begin{array}{lll}{\mathbf {o}_{2}}&{\mathbf {0}}_{K_{1}^{(h)}}&\left( {\mathbf{s}}_{m}^{(2)}\right) ^{(h)}\end{array}\right] \end{aligned}$$

where \({\mathbf {o}}_{1}=\left[ \begin{matrix}1&0\end{matrix}\right]\) and \({\mathbf {o}}_{2}=\left[ \begin{matrix}0&1\end{matrix}\right]\) denote the original class labels for class 1 and class 2, respectively, \(l=\{1,\ldots , K_{1}^{(h)}\}\), \(m=\{1,\ldots , K_{2}^{(h)}\}\), and \(({\mathbf{s}}_{l}^{(1)})^{(h)}\in \{0, 1\}^{K_{1}^{(h)}}\) and \(({\mathbf{s}}_{m}^{(2)})^{(h)}\in \{0, 1\}^{K_{2}^{(h)}}\) denote, respectively, subclass-indicator row vectors in which only the l-th/m-th element is set to 1 and the others are 0. Thus, the full label set for binary classification becomes:

$$\begin{aligned} \mathbb {Z}_{1:2}^{(h)}=\left\{ \begin{array}{l} \left( {\mathbf{z}}_{1}^{(1)}\right) ^{(h)}, \ldots , \left( {\mathbf{z}}_{l}^{(1)}\right) ^{(h)}, \ldots , \left( {\mathbf{z}}_{K_{1}^{(h)}}^{(1)}\right) ^{(h)}, \\ \left( {\mathbf{z}}_{1}^{(2)}\right) ^{(h)}, \ldots , \left( {\mathbf{z}}_{m}^{(2)}\right) ^{(h)}, \ldots , \left( {\mathbf{z}}_{K_{2}^{(h)}}^{(2)}\right) ^{(h)} \end{array} \right\} . \end{aligned}$$
(4)

Now, without loss of generality, based on Eq. (4), we can extend the full label set for C-class classification as follows:

$$\begin{aligned} \mathbb {Z}_{1:C}^{(h)}=\left\{ \begin{array}{c} \left( {\mathbf{z}}_{1}^{(1)}\right) ^{(h)}, \cdots , \left( {\mathbf{z}}_{l}^{(1)}\right) ^{(h)}, \cdots , \left( {\mathbf{z}}_{K_{1}^{(h)}}^{(1)}\right) ^{(h)}, \\ \vdots \\ \left( {\mathbf{z}}_{1}^{(c)}\right) ^{(h)}, \cdots , \left( {\mathbf{z}}_{m}^{(c)}\right) ^{(h)}, \cdots , \left( {\mathbf{z}}_{K_{c}^{(h)}}^{(c)}\right) ^{(h)}, \\ \vdots \\ \left( {\mathbf{z}}_{1}^{(C)}\right) ^{(h)}, \cdots , \left( {\mathbf{z}}_{p}^{(C)}\right) ^{(h)}, \cdots , \left( {\mathbf{z}}_{K_{C}^{(h)}}^{(C)}\right) ^{(h)} \end{array} \right\} \end{aligned}$$
(5)

where \(({\mathbf{z}}_{m}^{(c)})^{(h)}= \left[ \begin{array}{llllll} {\mathbf {o}}_{c}&{\mathbf {0}}_{K_{1}}^{(h)}&\cdots&\left( {\mathbf{s}}_{m}^{(c)}\right) ^{(h)}&\cdots&{\mathbf {0}}_{K_{C}^{(h)}}\end{array}\right] \in \{0,1\}^{\left( C+\sum _{c=1}^{C}K_{c}^{(h)}\right) }\) and \({\mathbf {o}}_{c}\) is a original class indicator row vector. Then, for the n-th training sample \((\tilde{\mathbf {x}}_{n})^{(h)}\) at the h-th hierarchy, if it belongs to the original class c and is assigned to a cluster m of the class, then its new label vector \(({\tilde{\mathbf{y}}_{n}})^{(h)}\) is set to \(({\mathbf {z}}_{m}^{(c)})^{(h)}\).

By regarding the newly assigned label vectors \(\{(\tilde{\mathbf {y}}_{n})^{(h)}\}_{n=1}^{N}\) as target response values, i.e., \(\tilde{\mathbf{Y}}^{(h)}=\left[ \left( \tilde{\mathbf {y}}_{1}\right) ^{(h)}; \cdots ; \left( \tilde{\mathbf {y}}_{N}\right) ^{(h)}\right] \in \mathbb {R}^{N\times \left( C+\sum _{c=1}^{C}K_{c}^{(h)}\right) }\), we can learn the regression coefficients of an S\(^{2}\)MTL model in Eq. (3). Here, it is noteworthy that the \(\ell _{2}\)-norm of a row vector in an optimal regression coefficient matrix quantifies the relevance of the corresponding feature in representing the target response variables. In our deep architecture, we use such context information to adaptively weight the selected features in the upper hierarchy. Specifically, we devise a novel weighted sparse multi-task learning method by exploiting the optimal regression coefficients learned in the lower hierarchy as feature weighting factors. We define an adaptive feature weighting vector at the h-th hierarchy as follows:

$$\begin{aligned} {\varvec{\delta }}^{(h)}=\left\{ \begin{array}{ll} {\mathbf {1}}_{\left| \mathbb {F}^{(h-1)}\right| }-\frac{1}{Z}\left[ \left\| \hat{\mathbf{w}}_{1}^{(h-1)}\right\| _{2}, \cdots , \left\| \hat{\mathbf{w}}_{\left| \mathbb {F}^{(h-1)}\right| }^{(h-1)}\right\| _{2}\right] &{} (h\ne 1)\\ \frac{1}{\left| \mathbb {F}^{(0)}\right| }{\mathbf {1}}_{\left| \mathbb {F}^{(0)}\right| } &{} (h=1) \end{array} \right. \end{aligned}$$
(6)

where \(Z=\sum _{i=1}^{|\mathbb {F}^{(h-1)}|}\Vert \hat{\mathbf{w}}_{i}^{(h-1)}\Vert _{2}\) is a normalizing constant. In our adaptive feature weighting scheme in Eq. (6), the higher the \(\ell _{2}\)-norm of the optimal regression coefficient vector \(\hat{\mathbf {w}}_{i}^{(h-1)}\), the smaller the weight for the i-th feature is assigned. By introducing this feature-adaptive weighting factor into a regularization term of a sparse regression model, we impose that in the upper hierarchy, the features of high \(\ell _{2}\)-norm values from the lower hierarchy have also high regression coefficients; meanwhile, those of low \(\ell _{2}\)-norm values from the lower hierarchy have low regression coefficients and ultimately become zero to be discarded. Thus, we formulate a weighted sparse multi-task learning method as follows:

$$\begin{aligned} \hat{\mathbf{W}}^{(h)} = \mathop {{{\mathrm{argmin}}}}\limits _{\mathbf{W}^{(h)}}\left\| \tilde{\mathbf{Y}}^{(h)}-\tilde{\mathbf{X}}^{(h)}{\mathbf{W}}^{(h)}\right\| _{F}^{2}+\lambda ^{(h)}\left\| \varvec{\Delta }^{(h)}\odot {\mathbf{W}}^{(h)}\right\| _{2,1} \end{aligned}$$
(7)

where \({\mathbf{W}}^{(h)}\in \mathbb {R}^{|\mathbb {F}^{(h-1)}|\times (C+\sum _{c=1}^{C}K_{c}^{(h)})}\), \({\varvec{\Delta }}^{(h)}=({\varvec{\delta }}^{(h)})^{T} {\mathbf {1}}_{(C+\sum _{c=1}^{C}K_{c}^{(h)})}\), and \(\odot\) denotes an element-wise matrix multiplication. Note that the feature weights defined in Eq. (6) are used to guide the selection of informative features in the current hierarchy by adaptively adjusting the penalty levels of different features. That is, by giving small weights for the informative features in representing the target responses, we impose the corresponding regression coefficients to be larger, and thus to survive in feature selection. We should note that, since we use class labels as target responses, features corresponding to low regression coefficients would have low discriminative power for the classification of the respective classes. In this regard, the proposed method can be effective to remove such features by deep learning.

Based on the optimal regression coefficients \(\hat{\mathbf {W}}^{(h)}\), we select the features whose regression coefficient vector is non-zero, i.e., \(\Vert (\hat{\mathbf{w}}_{i})^{(h)}\Vert _{2}>0\). With the selected features, we train a linear support vector machine (SVM), which has been successfully used in many applications (Zhang and Shen 2012; Suk and Lee 2013), and then compute the accuracy on the validation samples. If the validation accuracy is higher than the accuracy in the lower hierarchyFootnote 10, we move to the next level of hierarchy, to further filter out uninformative features (if exist), and thus to reduce the dimensionality; otherwise, stop the deep learning. Algorithm 1 summarizes the overall procedures of the proposed DW-S\(^{2}\)MTL method for feature selection.

figure a
Fig. 2
figure 2

Schematic illustration of the proposed deep weighted subclass-based sparse multi-task learning for feature selection. \(f(\tilde{\mathbf{Y}}^{(h)}, \tilde{\mathbf{X}}^{(h)}, \varvec{\delta }^{(h)})=\Vert \tilde{\mathbf{Y}}^{(h)}-\tilde{\mathbf{X}}^{(h)}\mathbf{W}^{(h)}\Vert _{2}^{2}+\lambda ^{(h)}\Vert \varvec{\Delta }^{(h)}\odot {\mathbf{W}}^{(h)}\Vert _{2,1}\) denotes an objective function in Eq. (7), \(\varvec{\delta }^{(h)}\) is defined by Eq. (6), and \(a^{(h)}\) (\(a^{(0)}=0\)) and \(\mathbb {F}^{(h)}\) denote, respectively, the validation accuracy and a set of the selected features at the h-th hierarchy

For better understanding, in Fig. 2, we present an example of applying the proposed DW-S\(^{2}\)MTL for feature selection in binary classification. In the 1st hierarchy, we have the training feature samples \(\tilde{\mathbf{X}}^{(1)}\) and the new label vectors \(\tilde{\mathbf{Y}}^{(1)}\) determined by clustering. In this hierarchy, since we have no prior weight information on the features, we treat all the features equally by setting \(\varvec{\delta }^{(1)}=\frac{1}{|\mathbb {F}^{(0)}|}{\mathbf {1}}_{|\mathbb {F}^{(0)}|}\). Note that the optimization problem in this hierarchy corresponds to S\(^{2}\)MTL (Suk et al. 2014). Based on the learned optimal regression coefficients \(\hat{\mathbf {W}}^{(1)}\), we select a feature set \(\mathbb {F}^{(1)}\) and define \(\varvec{\delta }^{(2)}\) by Eq. (6). By taking account of the values of the selected features in \(\tilde{\mathbf {X}}^{(1)}\) and the original class labels \({\mathbf {Y}}\), we train a linear SVM and compute the classification accuracy \(a^{(1)}\) on a validation set. If \(a^{(1)}\) is greater than \(a^{(0)}(=0)\), we set \(\hat{\mathbb {F}}=\mathbb {F}^{(1)}\) and the algorithm proceeds to the next hierarchy. For the 2nd hierarchy, we construct our feature samples \(\tilde{\mathbf {X}}^{(2)}\) from \(\tilde{\mathbf {X}}^{(1)}\) with only the selected features of \(\mathbb {F}^{(1)}\) and define new label vectors \(\tilde{\mathbf {Y}}^{(2)}\) via clustering for each original class with feature samples in \(\tilde{\mathbf {X}}^{(2)}\). We then learn the optimal regression coefficients \(\hat{\mathbf {W}}^{(2)}\) by solving Eq. (7) with \(\tilde{\mathbf {Y}}^{(2)}\), \(\tilde{\mathbf {X}}^{(2)}\), and \(\varvec{\delta }^{(2)}\) as inputs. Again, we select a feature set \(\mathbb {F}^{(2)}\) based on \(\hat{\mathbf {W}}^{(2)}\), and train a linear SVM with the feature samples of \(\tilde{\mathbf {X}}^{(2)}\) but only with features in \(\mathbb {F}^{(2)}\) and the original class labels \(\mathbf {Y}\). With the trained SVM, we compute the classification accuracy \(a^{(2)}\) on a validation set. If the current validation accuracy \(a^{(2)}\) is higher than \(a^{(1)}\), we update our optimal feature set \(\hat{\mathbb {F}}=\mathbb {F}^{(2)}\), compute the feature weights \(\varvec{\delta }^{(3)}\), and proceed to the 3rd hierarchy.

In a nutshell, in the h-th hierarchy, we sequentially perform the steps of (1) clustering samples to define subclasses and assigning a new label to the samples, (2) learning the optimal regression coefficients \(\hat{\mathbf{W}}^{(h)}\) by taking into account the features selected in the \((h-1)\)-th hierarchy and the regression coefficients \(\hat{\mathbf{W}}^{(h-1)}\), (3) selecting informative feature set based on \(\hat{\mathbf{W}}^{(h)}\), (4) reorganizing training and validation samples by discarding the unselected features, and (5) training an SVM classifier and computing the validation accuracy \(a^{(h)}\). If the current validation accuracy is higher than the previous one, i.e., \(a^{(h-1)}\), which means that the current feature set is better suited for classification than the previous one, we repeat the steps from (1) to (5) until convergence, i.e., no improvement in the validation accuracy. Note that the number of features under consideration reduces gradually as advancing to the higher level in the hierarchy with the respective feature weights determined based on the optimal weight coefficients from the one level below.

Experimental results

In this section, we validate the effectiveness of the proposed deep weighted subclass-based sparse multi-task learning for feature selection in AD/MCI diagnosis. We conducted two sets of experiments, namely, binary and multi-class classification problems. For the binary classification, we considered three tasks: (1) AD vs. CN, (2) MCI vs. CN, and (3) progressive MCI (pMCI), who converted to AD in 18 months, vs. stable MCI (sMCI), who did not converted to AD in 18 months. Meanwhile, for the multi-class classification, we performed two tasks of (1) AD vs. MCI vs. CN (3-class) and (2) AD vs. pMCI vs. sMCI vs. CN (4-class). In the classifications of MCI vs. CN (binary) and AD vs. MCI vs. CN (3-class), we labeled both pMCI and sMCI as MCI.

Experimental setting

For performance comparison, we consider five competing methods as follows:

  • Sparse multi-task learning (SMTL) (Zhou et al. 2013) that assumes a unimodal data distribution and selects features in a single hierarchy.

  • Subclass-based SMTL (S\(^{2}\)MTL) (Suk et al. 2014) that takes into account a complex data distribution and selects features in a single hierarchy.

  • Deep weighted SMTL (DW-SMTL) that assumes a unimodal data distribution and selects features in a hierarchical fashion using the proposed deep sparse multi-task learning with a feature weighting scheme.

  • Deep S\(^{2}\)MTL (D-S\(^{2}\)MTL) that takes into account a complex data distribution and also selects features in a hierarchical fashion using the proposed deep sparse multi-task learning but without a feature weighting scheme.

  • Deep weighted S\(^{2}\)MTL (DW-S\(^{2}\)MTL) that takes into account a complex data distribution and also selects features in a hierarchical fashion using the proposed deep sparse multi-task learning with a feature weighting scheme.

For the S\(^{2}\)MTL method, unlike the original work in Suk et al. (2014), we used affinity propagation to define subclasses in order for fair comparison with D-S\(^{2}\)MTL and DW-S\(^{2}\)MTL. It should be noted that the main difference among the competing methods lies in the methodological characteristics such as the use of data distribution (unimodal or complex), the number of hierarchies (single or multiple), and the use of context information, i.e., feature weights. We compare their characteristics in Table 2.

Table 2 Characteristics of the competing methods considered in our experiments

Due to the limited number of samples, we evaluated the performance of all the competing methods by applying a tenfold cross-validation technique in each classification problem and taking the average of the results. Specifically, we randomly partitioned the samples of each class into 10 subsets with approximately equal size without replacement. We then used 9 out of 10 subsets for training and the remaining one for testing. We repeated this process 10 times. It is noteworthy that for fair comparison among the competing methods, we used the same training and testing samples in our cross-validation.

Regarding model selection of the sparsity control parameter \(\lambda\) in sparse regression models and the soft margin parameter C in SVM (Burges 1998), we defined the parameter spaces as \(\lambda \in \{0.001, 0.005, 0.01, 0.05, 0.1, 0.3, 0.5\}\) and \(C \in \{2^{-10}, \dots , 2^{5}\}\), and performed a grid search. The parameters that achieved the best classification accuracy in the inner cross-validation were finally used in testing. In our implementation, we used a SLEP toolboxFootnote 11 for optimization of the respective objective function and an LIBSVM toolboxFootnote 12 for SVM classifier learning. As for the multi-class classification, we applied a one-versus-all strategy (Milgram et al. 2006) and chose the class which classified the test sample with the greatest margin.

We used 93 MRI features, 93 PET features, and/or 3 CSF features as regressors in all the competing methods. Regarding the multimodality neuroimaging fusion, e.g., MRI + PET (MP for short) and MRI + PET + CSF (MPC for short), we constructed a long feature vector by concatenating features of the modalities.

Performance comparison

Let TP, TN, FP, and FN denote, respectively, true positive, true negative, false positive, and false negative. We considered the following metrics to measure the performance of the methods:

  • ACCuracy (ACC) = (TP + TN)/(TP + TN + FP + FN)

  • SENsitivity (SEN) = TP/(TP + FN)

  • SPECificity (SPEC) = TN/(TN + FP)

  • Balanced ACcuracy (BAC) = (SEN + SPEC)/2

  • Positive Predictive Value (PPV) = TP/(TP+FP)

  • Negative Predictive Value (NPV) = TN/(TN+FN)

The accuracy that counts the number of correctly classified samples in a test set is the most direct metric for comparison among methods. Regarding the sensitivity and specificity, the higher the values of these metrics, the lower the chance of misdiagnosing to the respective clinical label.

Note that in our dataset, since the number of samples available for each class is imbalanced, it is likely to have an inflated performance estimates for two binary classification tasks, i.e., MCI (99) vs. CN (52) and pMCI (43) vs. sMCI (56), and one multi-class classification task, i.e., AD (51) vs. MCI (99) vs. CN (52). For this reason, we also considered a balanced accuracy and positive/negative predictive values (Wei and Dunbrack 2013).

Table 3 A summary of the performances for AD vs. CN classification
Table 4 A summary of the performances for MCI vs. CN classification
Table 5 A summary of the performances for pMCI vs. sMCI classification

Binary classification results

We summarized the performances of the competing methods with various modalities in Tables 3, 4, 5. In discrimination between AD and CN (Table 3), SMTL achieved the ACCs of 86.55 % (MRI), 80.45 % (PET), 87.64 % (MP), and 92.45 % (MPC), while S\(^{2}\)MTL achieved the ACCs of 86.55 % (MRI), 85.36 % (PET), 93.18 % (MP), and 92.36 % (MPC). When applying the proposed deep and feature-adaptive weighting scheme to these methods, we obtained the ACCs of 88.36 % (MRI), 82.45 % (PET), 90.45 % (MP), and 92.45 % (MPC) by DW-SMTL and the ACCs of 90.36 % (MRI), 89.27 % (PET), 93.18 % (MP), and 95.09 % (MPC) by DW-S\(^{2}\)MTL. Note that thanks to the proposed deep and feature-adaptive weighting scheme, we could improve the ACCs by 1.85 % (MRI), 2 % (PET), and 2.81 % (MP) in comparison between SMTL and DW-SMTL and by 3.91 % (MRI), 3.91 % (PET), and 2.73 % (MPC) in comparison between S\(^{2}\)MTL and DW-S\(^{2}\)MTL. Regarding the proposed feature weighting scheme, we could also verify its effectiveness by comparison between D-S\(^{2}\)MTL and DW-S\(^{2}\)MTL. Overall, the proposed DW-S\(^{2}\)MTL outperformed the other four competing methods. It is worth noting that since the discrimination between AD and NC is relatively easier than the other classification tasks described below, all the competing methods achieved good performance, i.e., higher than 90 % in accuracy. Thus, there is no substantial difference among the competing methods.

For the task of MCI vs. CN classification (Table 4), the proposed DW-S\(^{2}\)MTL achieved the best ACCs of 77.57 % (MRI), 74.90 % (PET), 80.11 % (MP), and 78.77 % (MPC), while D-S\(^{2}\)MTL/DW-SMTL achieved the ACCs of 68.85/68.89 % (MRI), 68.89/64.31 % (PET), and 70.98/70.94 % (MP), and 68.98/72.77 % (MPC). In the meantime, SMTL/S\(^{2}\)MTL achieved the ACCs of 70.90/70.32 % (MRI), 64.98/67.90 % (PET), 66.76/69.65 % (MP), and 68.32/67.02 % (MPC), respectively. By applying the proposed deep and feature-adaptive weighting scheme, DW-SMTL improved the ACCs by 4.18 % (MP) and 4.45 % (MPC) compared to SMTL. It is remarkable that compared to S\(^{2}\)MTL, DW-S\(^{2}\)MTL improved by 7.25 % (MRI), 7.30 % (PET), 10.46 % (MP), and 11.75 % (MPC).

Lastly, in the classification of pMCI and sMCI (Table 5), which is clinically the most important because the timely symptomatic treatment can potentially delay the progression (Francis et al. 2010), DW-S\(^{2}\)MTL outperformed the other competing methods again, and the proposed deep and feature-adaptive weighting scheme helped improve the accuracies for both SMTL and S\(^{2}\)MTL. Concretely, we obtained the ACCs of 69.84 % (MRI), 65.71 % (PET), 74.15 % (MP), and 73.04 % (MPC) by DW-S\(^{2}\)MTL and the ACCs of 63.71/55.46 % (MRI), 55.25/54.12 % (PET), 67.82/56.71 % (MP), 70.73/58.56 % (MPC) by D-S\(^{2}\)MTL/DW-SMTL. In comparison between S\(^{2}\)MTL and DW-S\(^{2}\)MTL, the improvements were 8.84 % (MRI), 7.84 % (PET), 8.82 % (MP), and 6 % (MPC). It is also noteworthy that the subclass-based methods, i.e., S\(^{2}\)MTL and DW-S\(^{2}\)MTL, that encompass the characteristics of a complex distribution were superior to both SMTL and DW-SMTL that assumed a unimodal data distribution.

Fig. 3
figure 3

Performance comparison on two multi-class classification problems. (AVG average of accuracies over different modalities, SMTL sparse multi-task learning, S \(^{2}\) MTL subclass-based SMTL, DW-SMTL deep weighted SMTL, D-S \(^{2}\) MTL deep S\(^{2}\)MTL, DW-S \(^{2}\) MTL deep weighted S\(^{2}\)MTL)

Multi-class classification results

From a clinical standpoint, while there exist multiple stages in the spectrum of AD and CN, the previous work mostly focused on binary classification problems. By taking account of more practical applications, we also performed experiments of multi-class classifications. Note that no change in our framework is required for multi-class classification, except for the class labels.

Figure 3 summarizes the performances on two multi-class classification tasks. Same as the binary classification results, we observed that the proposed DW-S\(^{2}\)MTL method outperformed the competing methods for both three-class and four-class classification tasks. Concretely, in three-class classification, SMTL achieved the ACCs of 50.10 % (MRI), 49.52 % (PET), 54.57 % (MP), and 58.55 % (MPC), and DW-SMTL achieved the ACCs of 50.10 % (MRI), 51.50 % (PET), 56.52 % (MP), and 58.55 % (MPC). Meanwhile, DW-S\(^{2}\)MTL achieved 55.50 % (MRI), 53.50 % (PET), 62.43 % (MP), and 62.93 % (MPC). In four-class classification, the maximal ACC of 53.72 % was produced by the proposed DW-S\(^{2}\)MTL method with MPC data, improving the ACC by 9.08 % (vs. SMTL), 8.63 % (vs. DW-SMTL), 11.22 % (vs. S\(^{2}\)MTL), and 12.21 % (vs. D-S\(^{2}\)MTL), respectively.

Classification results on a large MRI dataset

Since the focus on AD/MCI diagnosis or prognosis appears to be mostly on MRI, we further performed experiments with a large number of MRI data. Specifically, we considered 805 subjects of 198 (AD), 167 (pMCI), 236 (sMCI), and 229 (NC). With this large dataset, we conducted experiments for the same tasks as considered above. The classification accuracies and the respective standard deviations are presented in Fig. 4. In all classification tasks, the proposed DW-S\(^{2}\)MTL clearly surpassed the other four competing methods, by achieving the ACCs of 90.27 % (AD vs. NC), 70.86 % (MCI vs. NC), 73.93 % (pMCI vs. sMCI), 57.74 % (AD vs. MCI vs. NC), and 47.83 % (AD vs. pMCI vs. sMCI vs. NC), respectively.

Fig. 4
figure 4

Performance comparison on a large MRI dataset from ADNI. (SMTL sparse multi-task learning, S \(^{2}\) MTL subclass-based SMTL, DW-SMTL deep weighted SMTL, D-S \(^{2}\) MTL deep S\(^{2}\)MTL, DW-S \(^{2}\) MTL deep weighted S\(^{2}\)MTL)

Discussions

Based on our experiments of binary and multi-class classifications, we observed two interesting results: (1) when comparing SMTL with S\(^{2}\)MTL and also DW-SMTL with DW-S\(^{2}\)MTL, the subclass-based approaches, i.e., S\(^{2}\)MTL and DW-S\(^{2}\)MTL, outperformed the respective competing methods, i.e., SMTL and DW-SMTL; (2) the proposed deep sparse multi-task learning method with a feature-adaptive weighting scheme helped enhance the diagnostic accuracies, i.e., DW-SMTL and DW-S\(^{2}\)MTL showed better performance than SMTL, and S\(^{2}\)MTL and D-S\(^{2}\)MTL, respectively. In this section, we further discuss the results in various perspectives.

Table 6 A summary of Henze–Zirkler’s multivariate normality test on our dataset

Data distributions

In our experiments, the subclass-based methods, i.e., S\(^{2}\)MTL and DW-S\(^{2}\)MTL, were superior to the respective competing methods, i.e., SMTL and DW-SMTL. To justify the results, we performed Henze–Zirkler’s multivariate normality test (Henze and Zirkler 1990) that statistically determines how well samples can be modeled by a multivariate normal distribution, and summarized the results in Table 6. In our test, the null hypothesis was that the samples could come from a multivariate normal distribution. Regarding MRI, the null hypothesis was rejected for both AD and MCI. With respect to PET, the test rejected the hypothesis for MCI. In the meantime, it turned out that the CSF samples of all the disease labels did not follow a multivariate Gaussian distribution. Based on these statistical evaluations, we can confirm the complex data distributions and also justify the necessity of using the subclass-based approach, which can efficiently handle such a complex distribution problem.

Table 7 A summary of the statistics (mean ± std [min–max]) of the number of hierarchies with the proposed DW-S\(^{2}\)MTL in the tasks of binary and multi-class classification with modalities

Effect of deep architecture in feature selection

To see the effect of the proposed deep learning scheme in a sparse regression framework, in Fig. 5a and b, respectively, we illustrate the change of the weights for each feature and the selected features over hierarchies by DW-S\(^{2}\)MTL from one of the tenfolds in three-class classification with MP data. From the figure, it is clear that in the 1st hierarchy that corresponds to S\(^{2}\)MTL, the weights for the features are equal and more than 80 % of the total features were selected. But, as the algorithm forwarded to the higher hierarchy, it gradually discarded uninformative or less informative features, whose weights from the optimal regression coefficients in the lower hierarchy were relatively low, and after the 4-th hierarchy, it finally selected only 19 features (approximately 10 % of the total features). The ROIs corresponding to the finally selected features, i.e., weighted high for classification, included hippocampal formation left/right, amygdala left/right (in a medial temporal lobe that involves a system of anatomically related structures that are vital for declarative or long-term memory) (Braak and Braak 1991; Visser et al. 2002; Mosconi 2005; Lee et al. 2006; Devanand et al. 2007; Frisoni et al. 2008; Burton et al. 2009; Desikan et al. 2009; Ewers et al. 2012; Walhovd et al. 2010), precuneus left/right (Karas et al. 2007), cuneus left (Bokde et al. 2006; Singh et al. 2006; Davatzikos et al. 2011), uncus left, anterior cingulate gyrus left, occipital pole left, subthalamic nucleus left, postcentral gyrus left/right, superior parietal lobule right, anterior limb of internal capsule right, and angular gyrus left (Schroeter et al. 2009; Nobili et al. 2010; Yao et al. 2012). From a biological perspective, we could understand that some of the ROIs such as hippocampal formation, amygdala, and precuneous selected from our MRI features were related to the volume atrophy in medial temporal cortex, while precuneous, cingulate gyrus, and parietal lobule selected from our PET features could be concerned with hypometabolism (Joie et al. 2012). For reference, we also summarized the statistics of the number of hierarchies built with the proposed DW-S\(^{2}\)MTL in the tasks of binary and multi-class classification with different modalities in Table 7.

Fig. 5
figure 5

An example of the change of the selected features over hierarchies with MP in AD vs. MCI vs. CN

Performance interpretation

In “Binary classification results” and “Multi-class classification results”, we showed the superiority of the proposed DW-S\(^{2}\)MTL method compared to the competing methods in the context of classification accuracy. For the binary classifications of MCI vs. CN and pMCI vs. sMCI, the proposed DW-S\(^{2}\)MTL method with MP data showed better performance than with MPC data, even though the later provided additional information from CSF. Note that in this work, we treated different modalities equally, i.e., uniform weight across modalities. However, should we apply a modality-adaptive weighting scheme similar to Zhang et al. (2011), we then expect to obtain enhanced performances with MPC data.

Regarding sensitivity and specificity, the higher the sensitivity, the lower the chance of misdiagnosing AD/MCI patients; also the higher the specificity, the lower the chance of misdiagnosing CN to AD/MCI. In our three binary classification tasks, although the proposed DW-S\(^{2}\)MTL method achieved the best accuracies, it did not necessarily obtain the best sensitivity or specificity (but still reported high sensitivity and specificity). It is noteworthy that due to the imbalanced samples between classes, we obtained low sensitivity in pMCI vs. sMCI and low specificity in MCI vs. CN. In this regard, we also computed the balanced accuracy that avoids inflated performance estimates on imbalanced datasets by taking the average of sensitivity and specificity. Based on this metric, we clearly see that the proposed DW-S\(^{2}\)MTL method outperformed the competing methods by achieving the maximal BACs of 95 % (MPC) in AD vs. CN, 73.78 % (MP) in MCI vs. CN, and 71.58 % (MP) in pMCI vs. sMCI.

The metrics of sensitivity and specificity have been widely considered in the fields of the computer-aided AD diagnosis. However, note that since both sensitivity and specificity are defined on the basis of people with or without a disease, there is no practical use to estimate the probability of disease in an individual patient (Akobeng 2007). We rather need to know the positive/negative predictive values (PPV/NPV for short), which describe a patient’s probability of having disease once the classification results are known. Furthermore, PPV and NPV are highly related to the prevalence of disease. That is, the higher the disease prevalence, the higher the PPV, i.e., the more likely a positive diagnostic result; the lower disease prevalence, the lower the PPV, i.e., the less likely a positive diagnostic result. NPV would show exactly the opposite trends. In our experiments, the proposed DW-S\(^{2}\)MTL method achieved the maximal PPVs/NPVs of 97.74 % (MPC)/92.86 % (MPC) in AD vs. CN, 79.64 % (MPC)/82.07 % (MP) in MCI vs. CN, and 84.36 % (MP)/70.51 % (MP) in pMCI vs. sMCI. It is remarkable that in pMCI vs. sMCI classification, which is clinically the most important, the proposed DW-S\(^{2}\)MTL showed PPV improvements by 28.4 % (vs. SMTL with MPC), 30.58 % (vs. DW-SMTL with MPC), 22.88 % (vs. S\(^{2}\)MTL with MPC), and 1.63 % (vs. D-S\(^{2}\)MTL) and NPV improvements by 7.71 % (vs. SMTL with MPC), 9.94 % (vs. DW-SMTL with MP), 0.62 % (vs. S\(^{2}\)MTL with MPC), and 3.29 % (vs. D-S\(^{2}\)MTL with MPC).

Table 8 Comparison of classification accuracies (%) with the state-of-the-art methods that used multimodal neuroimaging for AD/CN and MCI/CN. The boldface denotes the maximum performance in each classification problem. (MP: MRI+PET, MPC: MRI+PET+CSF)

Comparison with the state-of-the-art methods

In Table 8, we also compared the classification accuracies of the proposed DW-S\(^{2}\)MTL method with those of the state-of-the-art methods that fused multiple modalities for the classifications of AD vs. NC and MCI vs. NC. Note that, due to different datasets and different approaches for extracting features and building classifiers, it is not fair to directly compare the performances among the methods. Nevertheless, the proposed method showed the highest accuracies among the methods in both binary classification problems. In particular, it is noteworthy that compared to Zhang and Shen’s work (2011) in which they used the same dataset as ours, the proposed method enhanced the accuracies by 1.89 and 3.71 % for the classifications of AD/CN and MCI/CN, respectively. Furthermore, in comparison with Liu et al.’s work (2013), where they also used both the same types of features from MRI and PET and the same number of subjects with ours, our method improved the accuracies by 0.72 % (AD/CN) and 1.31 % (MCI/CN), respectively. We also performed statistical significance tests to compare with Liu et al.’s and Zhang et al.’s methods. In summary, the null hypothesis was rejected beyond the 99 % of the confidence level based on the p-values of 0.00024 (vs. Liu et al.’s method) and 0.00012 (vs. Zhang et al.’s method).

Conclusions

In neuroimaging-based AD/MCI diagnosis, the ‘high-dimension and small sample’ problem has been one of the major issues. To tackle this problem, sparse regression methods have been widely exploited for feature selection, thus reducing the dimensionality. To our best knowledge, most of the existing methods select informative features in a single hierarchy. However, during the optimization of the regression coefficients, the weights of informative features are inevitably affected by non-informative or noisy features, and thus there is a high possibility of having the informative features underestimated or the uninformative features overestimated. In this regard, we proposed a deep sparse multi-task learning method along with a feature-adaptive weighting scheme for feature selection in AD/MCI diagnosis. The main contributions of this work can be threefold: (1) Rather than selecting informative features in a single hierarchy, the proposed method iteratively filters out uninformative features in a hierarchical fashion. (2) Furthermore, at different hierarchies, our method utilizes the regression coefficients optimized in the lower hierarchy as context information to better determine informative features for classification. (3) Last but not least, our method reflects the complex distributional characteristics in each class via a subclass labeling scheme.

In our experimental results on the ADNI cohort, we validated the effectiveness of the proposed method in both binary classification and multi-class classification tasks, outperforming the competing methods in various metrics.

It is noteworthy that in this work, we regarded the importance of features from different modality equally. However, as demonstrated by Zhang et al. (2011), different modalities may have different impacts on making a clinical decision. If a multi-kernel SVM (Gönen and Alpaydin 2011) is used to replace the linear SVM in our framework, then it would be possible to learn modality-adaptive weights and thus can obtain the relative importance of different modalities.

According to a recent broad spectrum of studies, there are increasing evidences that subjective cognitive complaint is one of the important genetic risk factors, which increases the risk of progression to MCI or AD (Loewenstein et al. 2012; Mark and Sitskoorn 2013). That is, among the cognitively normal elderly individuals who have subjective cognitive impairment, there exists a high possibility for some of them to be in the stage of ‘pre-MCI’. However, this issue has been underestimated in the field. Thus, we believe that it is important to design and develop diagnostic methods by taking into account such information as well. In addition, to our best knowledge, most of the existing computational methods have focused on improving diagnostic accuracy or finding the potential biomarkers. However, for practical application of those computational tools as an expert system, it is required to present the grounds for the clinical decision. For example, when a diagnostic system makes a decision to MCI, then it would be beneficial for doctors to know which parts of the brain regions are distinct or abnormal compared to those of the normal healthy controls.