Introduction

As the common form of dementia worldwide, Alzheimer’s disease (AD) is a primary neurodegenerative brain disease occurring in elderly people. It was first described by a German psychiatrist and neuropathologist Alois Alzheimer in 1906 and was named after him (Berchtold and Cotman 1998). It was reported that there were 26.6 million AD patients in the world in 2006 (Berchtold and Cotman 1998). Also, it is predicted that 1 in 85 people will be affected by AD by 2050 (Brookmeyer et al. 2007). There is a prodromal state between normal aging and AD, called mild cognitive impairment (MCI). Most individuals with MCI will eventually progress to dementia within 5 years (Gauthier et al. 2006). There is no cure for AD and no treatment to reverse or halt its progression. Therefore, accurate diagnosis of AD and MCI is very important to delay the disease progression. However, since the change of AD-related brain is prior to the symptom of AD, it is critical to detect those changes for early diagnosis of AD. Recently, neuroimaging technology is increasingly used to identify such abnormal changes in the early stage of AD (Cheng et al. 2012; Petersen et al. 1999; Sui et al. 2012; Ye et al. 2011; Zhang et al. 2011).

Early studies on AD/MCI classification mainly focus on using a single modality of biomarker, such as magnetic resonance imaging (MRI) (De Leon et al. 2007; Fan et al. 2008; McEvoy et al. 2009), fluorodeoxyglucose positron emission tomography (FDG-PET) (Higdon et al. 2004; Morris et al. 2001; De Santi et al. 2001), cerebrospinal fluid (CSF) (Mattsson et al. 2009; Shaw et al. 2009), etc. However, in those studies, some useful complementary information across different modalities of biomarkers is ignored, which is helpful for further improving the accuracy of classification. Recently, several researchers have explored to combine multiple modalities of biomarkers (Apostolova et al. 2010; Fjell et al. 2010; Landau et al. 2010; Walhovd et al. 2010; Jie et al. 2013). For instance, Hinrichs et al. (Hinrichs et al. 2009) combined two modalities, i.e., MRI and PET, for classification of AD. Bouwman et al. (Bouwman et al. 2007) proposed to combine two modalities of MRI and CSF to identify MCI patients from healthy controls (HC). Fellgiebel et al. (Fellgiebel et al. 2007) used PET and CSF to predict cognitive deterioration in MCI. Zhang et al. (Zhang et al. 2011) combined three modalities, i.e., MRI, FDG-PET and CSF, to classify AD/MCI from HC. Gray et al. (Gray et al. 2013) used four modalities, i.e., MRI, FDG-PET, CSF and genetic information, for AD classification. These existing studies have suggested that different modalities of biomarkers can provide the inherently complementary information that can improve accuracy in disease diagnosis when used together (Apostolova et al. 2010; Fjell et al. 2010; Landau et al. 2010; Walhovd et al. 2010; Foster et al. 2007).

In multi-modality based classification methods, traditional feature selection approaches, such as the least absolute shrinkage and selection operator (Lasso) and t-test, are often performed to help select the disease-related brain features for training a good learning model (Tibshirani 1996; Wee et al. 2012). However, one main disadvantage of those feature selection methods is that they usually ignore the inherent relatedness among features from different modalities. Recently, multi-modality based feature selection methods have been proposed to overcome this problem. For example, Huang et al. (Huang et al. 2011) presented a sparse composite linear discrimination analysis to recognized AD-related ROIs from multi-modality data. Liu et al. (Liu et al. 2014) proposed a multi-task based feature selection with each task corresponding to a learning model using individual modality of data and embedding inter-modality information into multi-task learning model for AD classification. Gray et al. (Gray et al. 2013) constructed a multi-modality classification framework based on pairwise similarity measures which come from random forest classifiers for the classification between AD/MCI and HC. However, in those methods, some useful discriminative information, such as the distribution information of intra-class and inter-class subjects, is not well mined, which may affect the final classification performance.

To address that problem, in this paper, we propose a new discriminative multi-task feature selection (DMTFS) model, which considers both the inherent relations among multi-modality data and the distribution information of intra-class subjects (i.e., subjects from the same class) and inter-class subjects (i.e., subjects from different classes) from each modality. Specifically, we first formulate feature selection on multi-modality data as multi-task learning problem with each task corresponding to a learning problem on individual modality. Then, two regularized terms are included into the proposed DMTFS model. Specifically, the first term is the group-sparsity regularizer (Ng and Abugharbieh 2011; Yuan and Lin 2006), which ensures only a small number of common brain region-specific features to be jointly selected from multi-modality data. Furthermore, we introduce a new Laplacian regularization term into the proposed objective function, which preserves the compactness of intra-class subjects and the separability of inter-class subjects, and hence induces the more discriminative features. Finally, we adopt the multi-kernel support vector machine (SVM) technique to fuse multi-modality data for performing classification of AD/MCI. To evaluate the proposed method, a series of experiments are performed on the baseline MRI and FDG-PET image data of 202 subjects from the Alzheimer’s Disease Neuroimaging Initiative (ADNI), which includes 51 AD patients, 99 MCI patients, and 52 HC. The experimental results show the superiority of our proposed method, in comparison with the existing multi-modality based methods.

Methods

Figure 1 shows the overview of our proposed framework, which contains three major steps, i.e., image pre-processing and feature extraction, discriminative multi-task feature selection, and multi-kernel SVM classification. In this section, before giving the detailed descriptions of these steps, we will first introduce the subjects used in this study.

Fig. 1
figure 1

Overview of proposed method

Subjects

The dataset we used in this study is obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (www.adni-info.org). ADNI is a non-profit organization which was founded in 2003 by the National Institute of Biomedical Imaging and Bioengineering. Many researchers of institutions work together to achieve this organization. The ADNI is committed to evaluate the progression of early Alzheimer’s disease, i.e., MCI, by combining some technology such as magnetic resonance imaging (MRI), fluorodeoxyglucose positron emission tomography (FDG-PET), Cerebrospinal fluid (CSF) and other clinical diagnosis, which greatly improve the efficiency of diagnosis, and save the time of treatment and the cost to the patients.

Following our previous works (Zhang et al. 2011, 2012), we evaluate the proposed method on the baseline MRI and FDG-PET data of 202 ADNI subjects, which contain 51 AD patients, 99 MCI patients (including 43 MCI converters (MCI-C) and 56 MCI non-converters (MCI-NC)), as well as 52 healthy controls (HC). Specifically, we build multiple binary classifiers to confirm the classification performance of our proposed method, including AD vs. HC, MCI vs. HC, and MCI-C vs. MCI-NC.

Image Pre-processing and feature extraction

The same image pre-processing as in (Zhang et al. 2011, 2012) is performed for all MRI and PET images, including anterior commissure (AC) - posterior commissure (PC) correction, skull-stripping, removal of cerebellum, and segmentation of structural MR images into three different tissues: grey matter (GM), white matter (WM), and cerebrospinal fluid (CSF). With atlas warping, we can partition each subject image into 93 regions of interests (ROIs). For each of the 93 ROIs, we compute the GM tissue volume from the subject’s MRI image. For PET image, we first rigidly align it with its respective MRI image of the same subject, and then compute the average value of PET signals in each ROI. Therefore, for each subject, we can finally obtain totally 93 features from MRI image and another 93 features from PET image.

Discriminative Multi-Task Feature Selection (DMTFS)

Before deriving our proposed discriminative multi-task feature selection (DMTFS) method, we first briefly introduce the traditional multi-task feature selection (MTFS) model (Zhang et al. 2012). Suppose X m = [x m1 , …, x m i , …, x m N ]T ∈ R N × d as training subjects from the m -th modality (i.e., task), and Y = [y 1, …, y i , …, y N ]T ∈ R N represents the corresponding response vector from all training subjects, where d and N are the numbers of features and training subjects, respectively. Here, x m i is a feature vector of the i -th subject from the m -th modality, and y i  ∈ {+1, − 1} is the response class label (i.e., patient or healthy control). In addition, w m ∈ R d represents the weight vector of linear function for the m -th task, and W = [w 1, …, w m, …, w M] ∈ R d × M denotes the weight matrix including all w m. Then, the MTFS model is to optimize the following objective function:

$$ \underset{\boldsymbol{W}}{ \min}\frac{1}{2}{\displaystyle \sum_{m=1}^M}{\left\Vert \boldsymbol{Y}-{\boldsymbol{X}}^{\boldsymbol{m}}{\boldsymbol{w}}^{\boldsymbol{m}}\right\Vert}_2^2+\lambda {\left\Vert \boldsymbol{W}\right\Vert}_{2,1} $$
(1)

where M is the number of modalities, ‖W2,1 = ∑ d j = 1 w j 2 s the l 2,1 -norm of weight matrix which calculates the sum of l 2 -norm of w j (Yuan and Lin 2006), and w j is the j-th row of W which represents the weight vector of the j -th feature across M tasks. Here, the l 2,1 -norm is adopted to enforce the group sparsity on the weight matrix, i.e., encouraging a number of rows in the weight matrix being zero. The first term in Eq. (1) is the empirical loss function, which measures the error between predicted value obtained from learning model and the true value. λ is a regularization parameter which balances the relative importance of both terms. The larger λ value means the more zero rows appear in the weight matrix, i.e., few of features are preserved.

In MTFS model, a linear function (i.e., f(x) = w T x) was used to map the data from the original high-dimensional feature space to one-dimensional space. This model only focuses on the relationship between label and subject, and thus ignores the distribution information of subjects from each modality, such as the compactness of intra-class subjects and the separability of inter-class subjects. This kind of information may help induce the more discriminative features and thus further improve the classification performance. Figure 2 illustrates an example. Here each color denotes a class, and the points with the same color denote that they come from the same class. The arrows with green color denote that green points (which are intra-class nearest neighbors) should be closer to the central green point in the new feature space. Also, the arrows with purple color denote that purple points (which are inter-class nearest neighbors) should be far away from the central green point in the new feature space. Intuitively, Fig. 2 shows that intra-class samples should be closer while inter-class samples should be far away in the new feature space.

Fig. 2
figure 2

The diagram of discriminative analysis

To address this problem, inspired by some recent works (Cai et al. 2007; Xue et al. 2009), we propose a new discriminative regularization term to preserve the distribution information of subjects. To be specific, in each modality, for each subject x m i , we first seek its k nearest neighbors, i.e., n(x m i ) = {x m,1 i , x m,2 i , …, x m,k i }, and define two disjoint subject subsets as follows:

$$ {n}_w\left({\boldsymbol{x}}_{\boldsymbol{i}}^{\boldsymbol{m}}\right)=\left\{{\boldsymbol{x}}_{\boldsymbol{i}}^{\boldsymbol{m},\boldsymbol{l}}\left| if\ {\boldsymbol{x}}_{\boldsymbol{i}}^{\boldsymbol{m},\boldsymbol{l}} and\ {\boldsymbol{x}}_{\boldsymbol{i}}^{\boldsymbol{m}} belong\ to\ same\ class,\ 1\le l\le k\right.\right\} $$
(2)
$$ {n}_b\left({\boldsymbol{x}}_{\boldsymbol{i}}^{\boldsymbol{m}}\right) = \left\{{\boldsymbol{x}}_{\boldsymbol{i}}^{\boldsymbol{m},\boldsymbol{l}}\left| if\ {\boldsymbol{x}}_{\boldsymbol{i}}^{\boldsymbol{m},\boldsymbol{l}} and\ {\boldsymbol{x}}_{\boldsymbol{i}}^{\boldsymbol{m}}\ belong\ to\ different\ classes,1\le l\le k\right.\right\} $$
(3)

where n w (x m i ) includes the neighbors that have the same label with the subject x m i , and n b (x m i ) contains the neighbors having different labels with the subject x m i . Then, to discover discriminative structure and geometrical information of the data, we construct two graphs, i.e., intra-class graph G m w and inter-class graph G m b , with each subject as a node for both graphs. Let Z m w and Z m b denote the weight matrices of G m w and G m b , respectively. We define:

$$ {Z}_{w,ij}^m=\left\{\begin{array}{ll}1,\hfill & \left| if\;{x}_j^m\in {n}_w\left({x}_i^m\right)\mathrm{or}\;{x}_i^m\right.\in {n}_w\left({x}_j^m\right)\hfill \\ {}0,\hfill & \left| otherwise\right.\hfill \end{array}\right. $$
(4)
$$ {Z}_{b,ij}^m=\left\{\begin{array}{ll}1,\hfill & \left| if\right.{x}_j^m\in {n}_b\left({x}_i^m\right)\;\mathrm{or}\;{x}_i^m\in {n}_b\left({x}_j^m\right)\hfill \\ {}0,\hfill & \left| otherwise\right.\hfill \end{array}\right. $$
(5)

Then, to preserve the discriminative and structural information of two graphs during linear mapping, we introduce a new discriminative regularization term as:

$$ Q\left(\boldsymbol{W}\right)=\sigma {S}_w-\left(1-\sigma \right){S}_b $$
(6)

Where

$$ {S}_w={{\displaystyle {\sum}_{m=1}^M{\displaystyle {\sum}_{i,j}^N\left\Vert f\left({\boldsymbol{x}}_{\boldsymbol{i}}^{\boldsymbol{m}}\right)-f\left({\boldsymbol{x}}_{\boldsymbol{j}}^{\boldsymbol{m}}\right)\right\Vert}}}^2{Z}_{w,ij}^m=2{\displaystyle {\sum}_{m=1}^M{\left({\boldsymbol{w}}^{\boldsymbol{m}}\right)}^T{\left({\boldsymbol{X}}^{\boldsymbol{m}}\right)}^T{\boldsymbol{L}}_{\boldsymbol{w}}^{\boldsymbol{m}}{\boldsymbol{X}}^{\boldsymbol{m}}{\boldsymbol{w}}^{\boldsymbol{m}}} $$
(7)

and

$$ {S}_b={{\displaystyle {\sum}_{m=1}^M{\displaystyle {\sum}_{i,j}^N\left\Vert f\left({\boldsymbol{x}}_{\boldsymbol{i}}^{\boldsymbol{m}}\right)-f\left({\boldsymbol{x}}_{\boldsymbol{j}}^{\boldsymbol{m}}\right)\right\Vert}}}^2{Z}_{b,ij}^m=2{\displaystyle {\sum}_{m=1}^M{\left({\boldsymbol{w}}^{\boldsymbol{m}}\right)}^T{\left({\boldsymbol{X}}^{\boldsymbol{m}}\right)}^T{\boldsymbol{L}}_{\boldsymbol{b}}^{\boldsymbol{m}}{\boldsymbol{X}}^{\boldsymbol{m}}{\boldsymbol{w}}^{\boldsymbol{m}}} $$
(8)

Here, L m w  = D m w  − Z m w and L m b  = D m b  − Z m b represent intra-class and inter-class Laplacian matrices for the m-th modality, respectively. D m w,ii  = ∑ N j = 1 Z m w,ij and D m b,ii  = ∑ N j = 1 Z m b,ij are the corresponding diagonal matrices. σ is a positive constant which controls the relative importance of both terms.

With the regularizer in Eq. (6), our proposed discriminative multi-task feature selection model (DMTFS) has the following objective function:

$$ { \min}_w\frac{1}{2}{\displaystyle {\sum}_{m=1}^M{\left\Vert \boldsymbol{Y}-{\boldsymbol{X}}^{\boldsymbol{m}}{\boldsymbol{w}}^{\boldsymbol{m}}\right\Vert}_2^2+\lambda {\left\Vert \boldsymbol{W}\right\Vert}_{2,1}}+{\displaystyle {\sum}_{m=1}^M{\left({\boldsymbol{w}}^{\boldsymbol{m}}\right)}^T{\left({\boldsymbol{X}}^{\boldsymbol{m}}\right)}^T}\left[\sigma {\boldsymbol{L}}_{\boldsymbol{w}}^{\boldsymbol{m}}-\left(1-\sigma \right){\boldsymbol{L}}_{\boldsymbol{b}}^{\boldsymbol{m}}\right]{\boldsymbol{X}}^{\boldsymbol{m}}{\boldsymbol{w}}^{\boldsymbol{m}} $$
(9)

where λ and σ are positive constants whose values can be determined via inner cross-validation on the training data. Below, we give an algorithm to solve the optimization problem in Eq. (9).

Optimization algorithm

In our study, we use the Accelerated Proximal Gradient (APG) technique (Chen et al. 2009; Liu 2999) to solve the optimization problem in Eq. (9). Specifically, we first separate the objective function in Eq. (9) into a non-smooth part as:

$$ g\left(\boldsymbol{W}\right)=\lambda {\left\Vert \boldsymbol{W}\right\Vert}_{2,1} $$
(10)

and a smooth one as:

$$ h\left(\boldsymbol{W}\right)=\frac{1}{2}{\displaystyle \sum_{m=1}^M}\left({\left\Vert \boldsymbol{Y}-{\boldsymbol{X}}^{\boldsymbol{m}}{\boldsymbol{w}}^{\boldsymbol{m}}\right\Vert}_2^2+2{\left({\boldsymbol{w}}^{\boldsymbol{m}}\right)}^T{\left({\boldsymbol{X}}^{\boldsymbol{m}}\right)}^T\left[\sigma {\boldsymbol{L}}_{\boldsymbol{w}}^{\boldsymbol{m}}-\left(1-\sigma \right){\boldsymbol{L}}_{\boldsymbol{b}}^{\boldsymbol{m}}\right]{\boldsymbol{X}}^{\boldsymbol{m}}{\boldsymbol{w}}^{\boldsymbol{m}}\right) $$
(11)

Then, the function h(W) + g(W) can be approximately expressed by the following function:

$$ {\varOmega}_n\left(\boldsymbol{W},{\boldsymbol{W}}_{\boldsymbol{k}}\right)=h\left({\boldsymbol{W}}_{\boldsymbol{k}}\right)+\frac{n}{2}{\left\Vert \boldsymbol{W}-{\boldsymbol{W}}_{\boldsymbol{k}}\right\Vert}_F+\left\langle \boldsymbol{W}-{\boldsymbol{W}}_{\boldsymbol{k}},\nabla h\left({\boldsymbol{W}}_{\boldsymbol{k}}\right)\right\rangle +g\left(\boldsymbol{W}\right) $$
(12)

where ‖ ⋅ ‖F denotes the Frobenius norm, h(W k ) represents the gradient of h(W) at the point W k in the k -th iteration process, and 〈W − W k , ∇h(W k )〉 denotes the inter product of matrixes which equals to Tr((W − W k )Th(W k )). Here, n represents the iteration step size, whose value can be decided by using line search.

Finally, the iterative process of APG algorithm can be interpreted as follows:

$$ {\boldsymbol{W}}_{\boldsymbol{k}+1}= arg\underset{W}{min}\frac{1}{2}{\left\Vert \boldsymbol{W}-{\boldsymbol{P}}_{\boldsymbol{k}}\right\Vert}_2^2+\frac{1}{n}g\left(\boldsymbol{W}\right) $$
(13)

Where \( {\boldsymbol{P}}_{\boldsymbol{k}}={\boldsymbol{W}}_{\boldsymbol{k}}-\frac{1}{n}\nabla h\left({\boldsymbol{W}}_{\boldsymbol{k}}\right) \). According to (Chen et al. 2009; Liu 2999), we can decompose the problem in Eq. (13) into d separate sub-problems, for which the analytical solutions can be easily obtained. Also, from (Chen et al. 2009; Liu 2999), instead of performing gradient descent based on W k , we can compute the following formulation as:

$$ {\boldsymbol{R}}_{\boldsymbol{k}}={\boldsymbol{W}}_{\boldsymbol{k}}+{\alpha}_k\left({\boldsymbol{W}}_{\boldsymbol{k}}-{\boldsymbol{W}}_{\boldsymbol{k}-1}\right) $$
(14)

where \( {\alpha}_k=\frac{\left(1-{\tau}_{k-1}\right){\tau}_k}{\tau_{k-1}} \) and \( {\tau}_k=\frac{2}{k+3} \).

Multi-kernel SVM classification

After selecting the discriminative and common features (i.e., brain regions) across multiple modalities, we then use the multi-kernel SVM method proposed in (Zhang et al. 2011) for final classification of AD/MCI from healthy controls. Specifically, based on the features obtained from the proposed method, we compute a linear kernel across different subjects for each modality and then use the following function to integrate the multiple kernels:

$$ K\left({\boldsymbol{x}}_{\boldsymbol{i}},{\boldsymbol{x}}_{\boldsymbol{j}}\right)={\displaystyle {\sum}_m{\alpha}_m{K}^m\left({\boldsymbol{x}}_{\boldsymbol{i}}^{\boldsymbol{m}},{\boldsymbol{x}}_{\boldsymbol{j}}^{\boldsymbol{m}}\right)} $$
(15)

where K m(x m i , x m j ) represents the kernel function over the m -th modality between the sample x i and x j , and α m  ≥ 0 is a weight parameter with constraint of ∑ m α m  = 1. Here, we find the optimal values of α m by using a coarse-grid search on the training subjects with range from 0 to 1 and the interval value of 0.1. Finally, the LIBSVM toolbox (Chang and Lin 2011) is adopted to perform SVM with the mixed kernel defined in Eq. (15).

Results

Classification performance

In this paper, we adopt 10-fold cross-validation to evaluate the classification performance. Specifically, we divide the whole samples into 10 parts, leaving one part for testing and the remaining parts as training data in each cross-validation. This process is repeated for 10 independent times to avoid the bias in random partition of samples. Four performance measures, including classification accuracy (ACC) measuring the proportion of subjects correctly classified among the whole subjects, sensitivity (SEN) measuring the proportion of AD or MCI patients correctly classified, specificity (SPE) measuring the proportion of healthy controls correctly classified, and the area under receiver operating characteristic (ROC) curve (AUC), are used to evaluate the classification performance of different classification methods.

We compare our proposed DMTFS method with several other methods, including multi-task feature selection method (denoted as MTFS) (Zhang et al. 2012), and multi-modal classification method proposed in (Zhang et al. 2011) using the least absolution shrinkage and selection operator (Lasso) as feature selection (denoted as MML). For further comparison, we also concatenate the MRI and PET features into a long feature vector, followed by the sequential forward floating selection (SFFS) (Pudil et al. 1994) for feature selection, and then using the standard SVM for classification. Table 1 lists the comparison of different methods for AD/MCI classifications. Figure 3 plots the ROC curves of different methods.

Table 1 The comparison of different methods for AD and MCI classification
Fig. 3
figure 3

ROC curves of the classification performance of different methods in AD/MCI and HC

From Table 1 and Fig. 3, we can see that our proposed method outperforms the other methods in all performance measures for both AD and MCI classifications. Specifically, our method achieves the classification accuracies of 95.92 % and 82.13 % for AD vs. HC and MCI vs. HC, respectively, while the best accuracies of other methods are only 92.07 % and 74.17 %, respectively. In addition, our method achieves high AUC values of 0.97 and 0.82 for AD vs. HC and MCI vs. HC, respectively, showing better diagnostic power than the other methods for AD/MCI classifications.

On the other hand, we also perform experiments on classifying MCI converters (MCI-C) from MCI non-converters (MCI-NC), with the corresponding results shown in Table 2 and Fig. 4. As can be seen from Table 2 and Fig. 4, our proposed method achieves better classification performances than other methods for MCI-C vs. MCI-NC classification. Specifically, our proposed method achieves a classification accuracy of 71.12 % for MCI-C vs. MCI-NC classification, which is nearly 10 % higher than the best result by other methods.

Table 2 The comparison of different methods for MCI converter classification
Fig. 4
figure 4

ROC curves of the classification performance of different methods in MCI-C and MCI-NC

In addition, we perform significance test on the classification performances between our proposed method and other compared methods by using the standard paired \( t \)test under the significance level of 95 %. Table 3 shows the results of \( t \)test between our method and any other method. As we can see from Table 3, for all the three classification tasks, i.e., AD vs. HC, MCI vs. HC, and MCI-C vs. MCI-NC, our proposed method is significantly better than other compared methods, which again shows the advantages of our proposed method.

Table 3 Significance test on the classification accuracies between our proposed method and other methods

The most discriminative brain regions

Because of the importance of the brain region-related disease in early diagnosis, we besides reporting classification performances, and also investigate the top selected features (i.e., brain regions) by our proposed DMTFS method. To be specific, since the selected features in each cross-validation are not the same, we select those features as the most discriminative features which have the highest occurrence frequency in every cross-validation folds.

Figure 5 plots the top 15 selected brain regions for MCI vs. HC classification. As can be seen from Fig. 5, our method can effectively identify those disease-related brain regions such as hippocampal, amygdala, precuneus, and temporal pole, which have been reported to be relevant with AD in the previous studies (Dai et al. 2009; Del Sole et al. 2008; Misra et al. 2009; Solodkin et al. 2013; Van Hoesen and Hyman 1990; Wang et al. 2012). For example, hippocampus are located in the temporal lobe of the brain, which are the role of the memory and spatial navigation. The Hippocampi are the first damaged regions in AD, showing loss of memory and spatial orientation. Hyman BT et al. (Hyman et al. 1984) also mentioned that the focal pattern of pathology isolates the hippocampal may induce to damage of memory in AD. Amygdala is the subcortical central of the limbic system, and has the function of regulating visceral sensation and producing emotions. Many researchers have found that the important role of amygdala in AD patients (Knafo et al. 2009; Poulin et al. 2011). Such as Knafo et al. (Knafo et al. 2009) mentioned that individuals who have AD with a significant shrinkage of amygdala, and extensive gliosis. In addition, precuneus (Del Sole et al. 2008; Karas et al. 2007) and temporal pole (Nobili et al. 2008) also show significant abnormalities in AD.

Fig. 5
figure 5

Top 15 brain regions in MCI vs. HC classification

Discussion

In this paper, we propose a new discriminative multi-task feature selection method for AD vs. HC, MCI vs. HC, and MCI-C vs. MCI-NC classifications. Experimental results demonstrate that our proposed method achieves better classification performances and also identifies more discriminative features, compared with the existing multi-modality based methods. Specifically, our proposed method achieves an accuracy of 95.92 % for classification between AD and HC, a high accuracy of 82.13 % for the classification between MCI and HC, and a high accuracy of 71.12 % for classification between MCI-C vs. MCI-NC.

Multi-modality based classification

Since different modalities may provide complementary information for diagnosis of AD (Apostolova et al. 2010; Landau et al. 2010), a lot of recent studies have investigated combining multi-modalities of data for AD diagnosis, showing improved classification performances (Walhovd et al. 2010; Bouwman et al. 2007; Wee et al. 2012; Ye et al. 2008; Davatzikos et al. 2011). For more comparisons, Table 4 lists the comparison between our proposed method and several other state-of-the-art methods for multi-modality based AD/MCI classification. For example, Huang et al. (Huang et al. 2011) proposed the sparse composite linear discriminant analysis (SCLDA) model performed on MRI and PET modalities of data, achieving the accuracy of 94.30 % for AD classification. Gray et al. (Gray et al. 2013) used four modalities (including MRI, PET, CSF and genetic) of data and achieved the accuracies of 89.00 %, 74.60 % and 58.00 % for classifying AD, MCI and MCI-C, respectively. Liu et al. (Liu et al. 2014) used two modalities including MRI and PET and achieved the accuracies of 94.40 %, 78.80 % and 67.80 % for classifying AD, MCI and MCI-C, respectively. As we can see from Table 4, our proposed method consistently outperforms the other state-of-the-art methods for multi-modality based classifications of AD, MCI, and MCI-C.

Table 4 The comparison between proposed method and the state-of-the-art multi-modality based classification methods

Effect of parameters

In our proposed model, there are two regularization terms including the group-sparsity regularizer and the discriminative regularizer. Accordingly, two regularization parameters (i.e., λ and σ) are used to balance the contributions of different terms. More specifically, λ is used to control the group sparsity of the model, and σ is used to balance the relative importance between the intra-class Laplacian matrix and the inter-class Laplacian matrix. Figure 6 gives the classification accuracies of our proposed method under different values of the parameter λ. For comparison, we also give the classification results of standard MTFS method (i.e., without the discriminative regularization term). Also, it is worth noting that when λ = 0, no feature selection step is performed, i.e., all features are used for subsequent classification. In addition, we also test different values for the parameter σ, ranging from 0 to 1 at a step size of 0.1, with a fixed λ value, as shown in Fig. 7.

Fig. 6
figure 6

Classification accuracies under different values of λ

Fig. 7
figure 7

Classification accuracy with the change of discriminative parameter σ

As we can see from Fig. 6, under all values of λ, our proposed method significantly outperforms the MTFS method on all three classification tasks (i.e., AD vs. HC, MCI vs. HC and MCI-C vs. MCI-NC), which again shows the advantage of our method by introducing the discriminative regularization term based on the intra-class and inter-class Laplacian matrices. On the other hand, Fig. 7 indicates that the corresponding curves w.r.t different values of σ are very smooth on all the three classification tasks, showing a good robustness, i.e., insensitive to the values of σ.

Comparsion with single model methods

Here, to estimate the effect of combining multi-modality image data and provide a more comprehensive comparison of the result from the proposed model, we further perform two experiments, that are (1) using only MRI modality, and (2) using only PET modality. It’s worth noting that our proposed model can also be used in single-modality case, where our model degrades into discriminative single-task (modality) feature selection followed by SVM classification. With corresponding results shown in Table 5. As can be seen from Table 5, using multi-modalities (i.e., MRI + PET) achieves significantly better performances than only using single modality (MRI or PET).

Table 5 The classification performance of different modality

Comparison with other feature selection methods

In order to further show the superiority of our proposed method, we compare it with other popular feature selection methods including RelieF (Kira and Rendell 1992) and Elastic Net (Zou and Hastie 2005). For fair comparison, we use the same classifier (i.e., multi-kernel SVM) after performing feature selection using RelieF, Elastic Net and our proposed method. Table 6 gives the classification accuracies of different feature selection methods for AD vs. HC, MCI vs. HC and MCI-C vs. MCI-NC, respectively. As we can see from Table 6, our proposed method always achieves the best classification accuracies in all the three classification tasks, compared to RelieF and Elastic Net. In particular, our proposed method exceeds nearly 10 percentage points than other two compared methods in the classification accuracy of MCI-C vs. MCI-NC. This result again validates the efficacy of our proposed method.

Table 6 The accuracies of different feature selection methods for AD, MCI and MCI-C classification

Limitations

The current study is limited by the following two factors. First, in this paper, we use two modalities, i.e., MRI and PET, for AD/MCI classification. However, there exist other modalities (e.g., CSF and APOE) which may also contain commentary information for further improving the classification performance. Second, we only consider two class classification problems (i.e., AD vs. HC, MCI vs. HC and MCI-C vs. MCI-NC), while did not test our proposed method for multi-class classification. In the future, we will address the above limitations to further improve the classification performance.

Conclusion

This paper proposed a discriminative multi-task feature selection method for classification of AD/MCI. Different from the existing multi-modality based feature selection methods, our proposed method explores both the distribution information of intra-class subjects and inter-class subjects. Experimental results on the ADNI dataset show that our proposed method not only improves the classification performance, but also has potential to discover the disease-related biomarkers useful for diagnosis of disease, in comparison with the state-of-the-art multi-modality based methods.