Introduction

Alzheimer’s disease (AD) is a physical disease that affects the brain and is the most common cause of dementia. There were more than 26.6 million people worldwide with AD in 2010, and it is predicted that 1 in 85 people will be affected by 2050 (Brookmeyer et al. 2007). So far, there is no treatment for the disease, which worsens as it progresses, and eventually leads to death. Thus, it is very important to accurately identify AD, especially for its early stage also known as mild cognitive impairment (MCI) which has a high risk of progressing to AD (Petersen et al. 1999).

Existing studies have shown that AD is related to the structural atrophy, pathological amyloid depositions, and metabolic alterations in the brain (Jack et al. 2010; Nestor et al. 2004). So far, multiple biomarkers have been shown to be sensitive to the diagnosis of AD and MCI, i.e., structural MR imaging (MRI) for brain atrophy measurement (Leon et al. 2007; Du et al. 2007; Fjell et al. 2010; Mcevoy et al. 2009), functional imaging (e.g., FDG-PET) for hypometabolism quantification (De et al. 2001; Morris et al. 2001), and cerebrospinal fluid (CSF) for quantification of specific proteins (Bouwman et al. 2007; Mattsson et al. 2009; Shaw et al. 2009; Fjell et al. 2010).

In recent years, machine learning and pattern classification methods, which can learn a model from training subjects to predict class label (i.e., patient or normal control) on unseen subject, have been widely applied to studies of AD and MCI based on single modality of biomarkers. For example, researchers have extracted the features from the structural MRI, such as voxel-wise tissue (Desikan et al. 2009; Fan et al. 2007; Magnin et al. 2009), cortical thickness (Desikan et al. 2009; Oliveira et al. 2010) and hippocampal volumes (Gerardin et al. 2009; MJ et al. 2004) for AD and MCI classification. Besides structural MRI, some researchers also used fluorodeoxyglucose positron emission tomography (FDG-PET) (Chételat et al. 2003; Foster et al. 2007; Higdon et al. 2004) for AD or MCI classification.

Different imaging modalities provide different views of brain structure or function. For example, structural MRI reveals patterns of gray matter atrophy, while FDG-PET measures the reduced glucose metabolism in the brain. It is reported that MRI and FDG-PET provide different sensitivity for memory prediction between disease and health (Walhovd et al. 2010). Using multiple biomarkers may reveal hidden information that could be overlooked by using single modality. Researchers have begun to integrate multiple modalities to further improve the accuracy of disease classification (Leon et al. 2007; Fjell et al. 2010; Foster et al. 2007; Walhovd et al. 2010; Apostolova et al. 2010; Dai et al. 2012; Gray et al. 2012; Hinrichs et al. 2011; Huang et al. 2011; Landau et al. 2010; Westman et al. 2012; Yuan et al. 2012; Zhang et al. 2011). For instance, Hinrichs et al. (2011) used two modalities (including MRI and FDG-PET) for AD classification. Zhang et al. (2011) combined MRI, FDG-PET and cerebrospinal fluid (CSF) for classifying patients with AD/MCI from normal controls. Dai et al. (2012) integrated structural MRI (sMRI) and functional MRI (fMRI) for AD classification. Gray et al. (2012) used MRI, FDG-PET, CSF and categorical genetic information for AD/MCI classification.

Although promising results were achieved by existing multimodal classification methods, the problem of small number of subjects and large feature dimensions limits further performance improvement of the above methods. For neuroimaging data, even after feature extraction, the dimension of feature is still relatively high compared to the size of subject. Also, there may exist redundant or irrelevant features for subsequent classification task. Thus, those irrelevant and redundant features need to be removed for reducing feature dimension by feature selection. In the literature, most existing feature selection methods are often performed for each modality individually, which ignores the potential relationship among different modalities. To the best of our knowledge, only a few studies focus on jointly selecting features from multi-modality neuroimaging data for AD/MCI classification. For example, Huang et al. (2011) proposed a sparse composite linear discriminant analysis model (SCLDA) for identification of disease-related brain regions of early AD from multi-modality data. Zhang and Shen (2012) proposed a multi-modal multi-task learning for joint feature selection for AD classification and regression. Liu et al. (2014) proposed inter-modality relationship constrained multi-task feature selection for AD/MCI classification. Jie et al. (2015) presented a manifold regularized multi-task feature selection method for multimodal classification of AD/MCI. However, except for Jie et al.’s work, most of the existing multi-modality feature selection methods focus on using multi-modality information from the same subjects, while ignoring the intrinsic relationship across different subjects, which may also contain useful information for further improving the classification performance. Different from Jie et al.’s method, the proposed approach not only considers the information of each modality, but also regards the relationship across different modalities as extra information. Hence, Jie et al.’s method can be regarded as a special case of our proposed method.

In this paper, we propose a novel learning method that can fully explore the relationships across both modalities and subjects through mining and fusing discriminative features from multi-modality data for AD/MCI classification. Specifically, our proposed learning method includes two major steps: 1) label-aligned multi-task feature selection, and 2) multimodal classification. First, we treat the feature selections from multi-modality data as different learning tasks and adopt a group sparsity regularizer to ensure a subset of relevant features to be jointly selected from multi-modality data. Moreover, to utilize the discriminative information among labeled subjects, we introduce a new label-aligned regularization term into the objective function of standard multi-task feature selection. Here, label-alignment means that all multi-modality subjects with the same class label should be closer in the new feature-reduced space. Then, we use a multi-kernel support vector machine (SVM) to fuse the selected features from multi-modality data for final classification. The proposed method has been evaluated on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database, demonstrating better results compared to several state-of-the-art multi-modality-based methods.

Method

Neuroimaging data

We use the data obtained from the Alzheimer’s disease Neuroimaging Initiative (ADNI) database (www.loni.usc.edu) in this paper. The ADNI was launched in 2003 by the National Institute on Aging (NIA), the National Institute of Biomedical Imaging and Bioengineering (NIBIB), the Food and Drug Administration (FDA), private pharmaceutical companies and non-profit organizations, as a $60 million, 5-year public-private partnership. Determination of sensitive and specific markers of very early AD progression is intended to aid researchers and clinicians to develop new treatments and monitor their effectiveness, as well as lessen the time and cost of clinical trials. The initial goal of ADNI was to recruit approximately 200 cognitively normal older individuals to be followed for 3 years, 400 MCI patients to be followed for 3 years, and 200 early AD patients to be followed for 2 years.

We use imaging data from 202 ADNI participants with corresponding baseline MRI and FDG-PET data. In particular, it includes 51 AD patients, 99 MCI patients and 52 normal controls (NC). The MCI patients were divided into 43 MCI converters (MCI-C) who have progressed to AD with 18 months and 56 MCI non-converters (MCI-NC) whose diagnoses have still remain stable within 18 months. Table 1 lists the clinical and demographic information for the study population. A detailed description on acquiring MRI and PET from ADNI as used in this paper can be found in (Zhang et al. 2011). All structural MR scans were acquired from 1.5 T scanners. Raw Digital Imaging and Communications in Medicine (DICOM) MRI scans were downloaded from the public ADNI site (adni.loni.usc.edu), reviewed for quality, and automatically corrected for spatial distortion caused by gradient nonlinearity and B1 field inhomogeneity. PET images were acquired 30–60 min post-injection, averaged, spatially aligned, interpolated to a standard voxel size, intensity normalized, and smoothed to a common resolution of 8 mm full width at half maximum.

Table 1 Subject information

Image pre-processing and feature extraction are performed for all MR and PET images by following the same procedures as in (Zhang et al. 2011). First, we do anterior commissure (AC)-posterior commissure (PC) correction on all images, and use the N3 algorithm (Sled et al. 1997) to correct the intensity inhomogeneity. Next, we do skull-stripping on structural MR images using both brain surface extractor (BSE) (Shattuck et al. 2001) and brain extraction tool (BET) (Smith and Stephen 2002), followed by manual edition and intensity inhomogeneity correction. After removal of cerebellum, FAST in the FSL package (Zhang et al. 2001) is used to segment structural MR images into three different tissues: gray matter (GM), white matter (WM), and cerebrospinal fluid (CSF). After registration using HAMMER (Shen and Davatzikos 2002), we obtain the subject-labeled image based on a template with 93 manual labels. Then, we compute the GM tissue volume of each region as a feature. For PET image, we first align it to its respective MR image of the same subject using a rigid transformation, and then compute the average intensity of each ROI in the PET image as a feature. Therefore, for each subject, we totally obtain 93 features from MR image and another 93 features from PET image.

Label-aligned multi-task feature learning

In this section, we will first briefly introduce the conventional multi-task feature selection (Evgeniou and Pontil 2004; Kumar and Daume Iii 2012; Obozinski et al. 2006, 2010; Yuan and Lin 2006), and then derive our proposed label-aligned multi-task feature selection model, as well as the corresponding optimization algorithm. Finally, we use the multi-kernel support vector machine for classification. Figure 1 gives the overview of the proposed classification method.

Fig. 1
figure 1

Schematic illustration of the proposed classification pipeline

Multi-task feature selection

Denote X m = [x m1 , …, x m i , …, x m N ]T ∈ ℝN × d as the training data matrix on the m -th modality, where \( {\boldsymbol{x}}_i^m \) represents the corresponding (column) feature vector of the i -th subject, d is the dimension of features, and N is the number of subjects. Let Y = [y 1, …, y i , …, y N ]T ∈ ℝN be the label vector corresponding to N training samples, where the value of \( {y}_i \) is +1 or −1 (i.e., patient or normal control). Then, the objective function of multi-task feature selection (MTFS) model is as follows (Yuan and Lin 2006):

$$ \underset{\boldsymbol{W}}{ \min}\frac{1}{2}{\displaystyle \sum_{m=1}^M}\left\Vert \boldsymbol{Y}-{\boldsymbol{X}}^m{\boldsymbol{w}}^m\right\Vert \left.{}_2^2+{\lambda}_1\right\Vert {\left.\boldsymbol{W}\right\Vert}_{2,1} $$
(1)

where w m ∈ ℝd is the regression coefficient vector for the m -th modality and the coefficient vectors for all \( M \) modalities form a coefficient matrix, W = [w 1, …, w m, …, w M] ∈ ℝd × M, and \( M \) is the total number of modalities. In (1), \( {\left\Vert \boldsymbol{W}\right\Vert}_{2,1} \) is the 2,1-norm of matrix W defined as ‖W2,1 = ∑ d i = 1 w j 2, where w j is the j -th row of matrix W. Here, \( {\uplambda}_1 \) is a regularization parameter controlling the relative contributions of the two terms.

The \( {\mathrm{\ell}}_{2,1} \) -norm \( {\left\Vert \boldsymbol{W}\right\Vert}_{2,1} \) can be seen as the sum of the 2 -norms of the rows of matrix W (Yuan and Lin 2006), which encourages the weights corresponding to the same feature across different modalities to be grouped together and then a small number of common features will be jointly selected. So, the solution of MTFS results in a weight matrix W whose elements in many rows are all zeros for the characteristic of ‘group sparsity’. It is worth noting that when there is only one modality (i.e., \( M \) =1), the MTFS model will degenerate into the least absolute shrinkage and selection operator (LASSO) model (Tibshirani 1994).

Label-aligned multi-task feature selection

One limitation of the standard multi-task feature selection model is that only the relationship between modalities of the same subjects is considered, while ignoring the important relationship among labeled subjects. To address this issue, we introduce a new term called label-aligned regularization term, which minimizes the distance between within-class subjects in the feature-reduced space as follows:

$$ \Omega ={\displaystyle \sum_{i,j}^N}{\displaystyle \sum_{p,q\left(p\le q\right)}^M}{\left\Vert {\left({\boldsymbol{w}}^p\right)}^T{\boldsymbol{x}}_i^p-{\left({\boldsymbol{w}}^q\right)}^T{\boldsymbol{x}}_j^q\right\Vert}_2^2{S}_{ij} $$
(2)

where, \( {S}_{ij} \) is defined as:

$$ {S}_{ij}=\left\{\begin{array}{cc}\hfill 1,\hfill & \hfill \mathrm{if}\kern0.5em {\boldsymbol{x}}_i^p\kern0.5em \mathrm{and}\kern0.5em {\boldsymbol{x}}_j^q\kern0.5em \mathrm{are}\kern0.5em \mathrm{from}\kern0.5em \mathrm{the}\kern0.5em \mathrm{same}\kern0.5em \mathrm{class}\hfill \\ {}\hfill 0,\hfill & \hfill \mathrm{otherwise}\hfill \end{array}\right. $$
(3)

The regularization term (2) can be explained as follows. ‖(w p)T x p i  − (w q)T x q j 22 S ij measures the distance between x p i and x q j in the projected space. It implies that if x p i and x q j are from the same class, the distance between them should be as small as possible in the projected space. It is worth noting that 1) when \( p=q \) the local geometric structure of the same modality data is preserved in the feature-reduced space; 2) when \( p<q \) the complementary information provided from different modalities are used to guide the estimation of the feature-reduced space. Therefore, the Eq. (2) preserves the intrinsic label relatedness among multi-modality data and also explores the complementary information conveyed by different modalities. Generally speaking, the goal of (2) is to preserve label relatedness by aligning paired within-class subjects from multiple modalities.

By incorporating the regularizer (2) into (1), we can obtain the objective function of our label-aligned multi-task feature selection model as below:

$$ \begin{array}{c}\hfill \underset{\boldsymbol{W}}{ \min}\frac{1}{2}{\displaystyle \sum_{m=1}^M}\left\Vert \boldsymbol{Y}-{\boldsymbol{X}}^m{\boldsymbol{w}}^m\right\Vert \left.{}_2^2+{\lambda}_1\right\Vert {\left.\boldsymbol{W}\right\Vert}_{2,1}\hfill \\ {}\hfill +{\lambda}_2{\displaystyle \sum_{i,j}^N}{\displaystyle \sum_{p,q\left(p\le q\right)}^M}{\left\Vert {\left({\boldsymbol{w}}^p\right)}^T{\boldsymbol{x}}_i^p-{\left({\boldsymbol{w}}^q\right)}^T{\boldsymbol{x}}_j^q\right\Vert}_2^2{S}_{ij}\hfill \end{array} $$
(4)

where \( {\uplambda}_1 \) and \( {\uplambda}_2 \) are the two positive constants that control the sparseness and the degree of preserving the distance between subjects, respectively. From (4), we can not only jointly select a subset of common features from multi-modality data, but also preserve label relatedness by aligning paired within-class subjects. Figure 2 illustrates the used relationships among modalities and subjects in our proposed model as compared with the traditional multi-modality methods. In Fig. 2a, traditional multimodal methods only concern the relationships of different modalities (i.e., the single line connecting MRI and PET) from the same subject. As we can see from Fig. 2b, our proposed method can preserve not only the multi-modality relationship from the same subject, but also the correlation across modalities between different subjects.

Fig. 2
figure 2

Illustrations on the relationship among modalities and subjects in a traditional multi-modality methods and b proposed method in identifying subjects in class 1 and class 2. Circles and rectangles represent MRI and PET data, respectively. Red and blue denote different classes

Optimization algorithm

At present, there are several algorithms developed to solve the optimization problem in (4). Here, we choose the widely applied Accelerated Proximal Gradient (APG) method (Nesterov 2003; Chen et al. 2009) to get the solution of our proposed method. Specifically, we separate the objective function in (4) to the smooth part:

$$ \begin{array}{c}\hfill f\left(\boldsymbol{W}\right)=\frac{1}{2}{\displaystyle \sum_{m=1}^M}{\left\Vert \boldsymbol{Y}-{\boldsymbol{X}}^m{\boldsymbol{w}}^m\right\Vert}_2^2\hfill \\ {}\hfill +{\lambda}_2{\displaystyle \sum_{i,j}^N}{\displaystyle \sum_{p,q\left(p\le q\right)}^M}{\left\Vert {\left({\boldsymbol{w}}^p\right)}^T{\boldsymbol{x}}_i^p-{\left({\boldsymbol{w}}^q\right)}^T{\boldsymbol{x}}_j^q\right\Vert}_2^2{S}_{ij}\hfill \end{array} $$
(5)

and non-smooth part:

$$ g\left(\boldsymbol{W}\right)={\lambda}_1{\left\Vert \boldsymbol{W}\right\Vert}_{2,1} $$
(6)

Then, the following function is constructed for approximating the composite function f(W) + g(W):

$$ \begin{array}{c}\kern1em {\Omega}_l\left(\boldsymbol{W},{\boldsymbol{W}}_k\right)=f\left({\boldsymbol{W}}_k\right)+<\boldsymbol{W}-{\boldsymbol{W}}_k,\nabla f\left({\boldsymbol{W}}_k\right)>\kern1em \\ {}\kern1em +\frac{l}{2}\parallel \boldsymbol{W}-{\boldsymbol{W}}_k{\parallel}_F^2+g\left(\boldsymbol{W}\right)\kern1em \end{array} $$
(7)

where \( {\left\Vert \cdot \right\Vert}_F \) is the Frobenius norm, \( \nabla f\left({\boldsymbol{W}}_k\right) \) is the gradient of \( f\left(\boldsymbol{W}\right) \) at point W k of the k -th iteration, and l is the step size. Finally, the update step of AGP algorithm is defined as:

$$ {\boldsymbol{W}}_{k+1}=\underset{\boldsymbol{W}}{ \arg \min}\frac{1}{2}{\left\Vert \boldsymbol{W}-{\boldsymbol{U}}_k\right\Vert}_F^2+\frac{1}{l}g\left(\boldsymbol{W}\right) $$
(8)

where l can be determined by line search, and \( {\boldsymbol{U}}_k={\boldsymbol{W}}_k-\frac{1}{l}\nabla f\left({\boldsymbol{W}}_k\right) \).

The key of AGP algorithm is how to solve the update step efficiently. The study in (Liu and Ye 2010) shows that this problem can be decomposed into d separate subproblems, and the analytical solutions of these sub-problems can be easily obtained.

In addition, according to the technique in (Chen et al. 2009), instead of computing (7) based on W k , we use Q k to calculate \( {\Omega}_l\left(\boldsymbol{W},{\boldsymbol{Q}}_k\right) \) and the search point Q k is defined as:

$$ {\boldsymbol{Q}}_k={\boldsymbol{W}}_k+{\eta}_k\left({\boldsymbol{W}}_k-{\boldsymbol{W}}_{k-1}\right) $$
(9)

where \( {\eta}_k=\frac{\left(1-{\gamma}_{k-1}\right){\gamma}_k}{\gamma_{k-1}} \) and \( {\gamma}_k=\frac{2}{k+3} \). The algorithm for Eq. (4) can achieve a convergence rate of \( O\left(1/{K}^2\right) \), where \( K \) is the maximum iteration.

Multi-kernel support vector machine

Multi-kernel SVM can effectively integrate data from multiple modalities for classification of Alzheimer’s disease (Zhang et al. 2011). Given a set of training subjects, m = 1, … M, k m(z m i , z m j ) = ϕ m(z m i )T ϕ m(z m j ) is the kernel function for the subjects z m i and z m j of the m -th modality. Linear combined kernel, k(z i , z j ) = ∑ M m = 1 β m k m(z m i , z m j ) is adopted for fusing information from different modalities. Here \( {\beta}_m \) is the combining weight of the m -th kernel and ∑ M m = 1 β m  = 1. In our experiments, the optimal \( {\beta}_m \) is determined via a coarse-grid search through cross-validation on the training set.

Experiments and results

We test the performance of the proposed method on 202 ADNI participants with corresponding baseline MRI and FDG-PET data. Classification performance is assessed between three clinically relevant pairs of diagnostic groups (AD vs. NC, MCI vs. NC, and MCI-C vs. MCI-NC). The proposed method is compared with three existing multi-kernel-based multimodal classification methods, including multi-kernel method (Zhang et al. 2011) without performing feature selection (denoted as Baseline), multi-kernel method with LASSO feature selection performed independently on single modalities (denoted as SMFS), and multi-kernel method using multi-modal feature selection method (denoted as MMFS) proposed in (Zhang and Shen 2012). We also directly concatenate 93 features from MRI and 93 features from FDG-PET into a 186 dimensional vector, and then perform t-test and LASSO as feature selection methods, followed by the standard SVM with linear kernel for classification (with the corresponding methods denoted as t-test and LASSO, respectively). It is worth noting that the same training and test subjects are used in all methods for fair comparison.

Validation

In our experiments, we use a 10-fold cross-validation strategy to evaluate the effectiveness of our proposed method. Specifically, the whole set of subject samples are equally partitioned into 10 subsets. For each cross-validation, the nine subsets are chosen for training and the remaining subjects are used for testing. The process is independently repeated 10 times to avoid any bias introduced by randomly partitioning the dataset in cross-validation. We evaluate the performance of different methods by computing the classification accuracy (ACC), as well as the sensitivity (SEN), the specificity (SPE) and the area under receiver operating characteristic (ROC) curve (AUC). Here, the accuracy measures the proportion of subjects correctly classified among the whole population, the sensitivity represents the proportion of AD or MCI patients correctly classified, and the specificity denotes the proportion of normal controls correctly classified. The SVM classifier is implemented using the LIBSVM toolbox (Chang and Lin 2007), with a linear kernel and a default value for the parameter C (i.e., \( C=1 \)). The optimal values of regularization parameters \( {\uplambda}_1 \), \( {\uplambda}_2 \) and the weights in the multi-kernel classification method are determined by another 10-fold cross-validation on the training subjects.

Results of AD/MCI vs. NC classification

The classification results of AD vs. NC and MCI vs. NC produced by different methods are listed in Table 2. As can be seen from Table 2, our proposed method consistently achieves better performance than other methods for the classification between AD/MCI patients and normal controls. Specifically, for classifying AD from NC, our proposed method achieves a classification accuracy of 95.95 %, while the best accuracy of other methods is only 92.25 % (obtained by SMFS). In addition, for classifying MCI from NC, our proposed method achieves a classification accuracy of 80.26 %, while the best accuracy of other methods is only 74.34 % (obtained by Baseline). Furthermore, we perform the significance test using paired t-test on the classification accuracies between our proposed method and other compared methods, with the corresponding results given in Table 2. From Table 2, we can see that our proposed method is significantly better than the compared methods (i.e., the corresponding p values are very small).

Table 2 Comparison of performance of different methods for AD vs. NC and MCI vs. NC classifications, respectively

For further validation, in Fig. 3 we plot the ROC curves of four multi-modality based classification methods for AD/MCI vs. NC classification. Figure 3 shows that our proposed method consistently achieves better classification performances than other multi-modality based methods for both AD vs. NC and MCI vs. NC classifications. Specifically, as can be seen from Table 2, our method achieves the area under the ROC curve (AUC) of 0.97 and 0.81 for AD vs. NC and MCI vs. NC classifications, respectively, showing better classification ability compared with other methods.

Fig. 3
figure 3

ROC curves of four multi-modality based methods. a Classification of AD vs. NC, b Classification of MCI vs. NC

Results of MCI conversion prediction

The classification results for MCI-C vs. MCI-NC are shown in Table 3. As can be seen from Table 3 and Fig. 4, our proposed method consistently outperforms other methods in MCI-converter classification. Specifically, our proposed method achieves a classification accuracy of 69.78 %, while the best one of other methods is only 61.67 %, which is obtained by SMFS. The classification accuracy of our proposed method is significantly (p < 0.001) higher than any compared methods.

Table 3 Comparison of performance of different methods for MCI-C vs. MCI-NC classification
Fig. 4
figure 4

ROC curves of four multi-modality based methods for classification of MCI converters

Figure 4 plots the corresponding ROC curves of four multi-modality based methods for MCI-C vs. MCI-NC classification. We can see from Fig. 4 that the superior classification performance is obtained by our proposed method. Table 3 also lists the area under the ROC curve (AUC) of different classification methods. As can be seen from Table 3, AUC achieved by our proposed method is 0.69 for MCI-C vs. MCI-NC classification, while the best one of other methods is only 0.64, obtained by t-test, indicating the outstanding classification performance of our proposed method.

The most discriminative brain regions

The most discriminative regions are defined as those that are most frequently selected in cross-validation. For each selected discriminative feature, the standard paired t-test is performed to evaluate its discriminative power between patients and normal control groups. Top 10 ROIs detected from both MRI and FDG-PET data for MCI classification are listed in Table 4. Figure 5 plots these regions in the template space. As can be seen from Table 4 and Fig. 5, the most important regions for MCI classification include hippocampal, amygdale, etc., which are in agreement with other recent AD/MCI studies (Sole et al. 2008; Derflinger et al. 2011; Al 2008; Poulina et al. 2011; Wolf et al. 2003).

Table 4 Top 10 ROIs selected by the proposed method for MCI classification
Fig. 5
figure 5

Top 10 ROIs selected by the proposed method for MCI

Discussion

In this paper, we proposed a novel label-aligned multi-task feature learning method for multimodal classification of Alzheimer’s disease and mild cognitive impairment. The experimental results on the ADNI database show that our proposed method achieves high classification accuracies of 95.95, 80.26, and 69.78 % for AD vs. NC, MCI vs. NC and MCI-C vs. MCI-NC classifications, in comparison with several state-of-the-art multimodal AD/MCI classification methods.

Multi-task learning

Multi-task learning is a recently developed technique in machine learning field, which can jointly learn multiple tasks via a shared representation. Because the domain information or some commonality is contained in the learning tasks, multi-task learning can usually improve the performances by learning classifiers for multiple tasks together.

Recently, multi-task learning has been introduced into medical imaging field. For example, Zhang et al. (Zhang and Shen 2012) applied multi-task learning for joint prediction of both regression variables (i.e., clinical scores) and classification variable (i.e., class labels) in Alzheimer’s disease. In their method, multi-task feature selection was first used to select the common subset features corresponding to different tasks, and then multi-kernel SVM was performed for final regression and classification. It is worth noting that the feature selection step in (Zhang and Shen 2012) was performed separately for each modality, while ignoring the potential relationship among different modalities. Afterwards, Liu et al. (2014) considered the inter-modality relationship within each subject to preserve the complementary information among modalities. However, in their method only information corresponding to individual subject is concerned. Suk et al. (2014) first assumed the data classes were multipeak distribution, and then formulated a multi-task learning problem in a l-2,1 framework with new label encodings obtained by clustering. However, the method in (Suk et al. 2014) still did not consider the potential information across different modalities. More recently, Jie et al. (2015) proposed a manifold regularized multi-task feature learning method, which only considered the manifold information in each modality separately and thus cannot reflect the information across different modalities. It is worth noting that our proposed method and Jie et al.’s method are developed based on different considerations. Jie et al.’s method only concerns preserving the manifolds existing in each modality of the data. Different from Jie et al.’s method, the proposed approach not only takes the structure information of each modality into account, but also regards the relationship across different modalities as extra information. Hence, Jie et al.’s method can be regarded as a special case of our proposed method. Although our proposed method has a more general feature selection framework compared with Jie et al.’s approach, the objective function of our method is still convex. Thus, the optimal solution can still be obtained, i.e., by using Accelerated Proximal Gradient (APG) method.

In contrast, our proposed label-aligned multi-task feature learning method can preserve the relationships not only across different modalities in the same subjects but also among different modalities in different subjects. Our proposed method is evaluated on the ADNI database using baseline MRI and FDG-PET data for three clinical groups classifications including AD vs. NC, MCI vs. NC and MCI-C vs. MCI-NC, and the experimental results demonstrate the effectiveness of our proposed method.

Comparison with existing methods

To compare our proposed method with existing methods, in this section we perform the comparisons between the results of our proposed method and those of existing state-of-the-art multi-modality methods, as shown in Table 5. As can be seen from Table 5, Hinrichs et al. (2011) used 48 AD subjects and 66 NC subjects, and obtained an accuracy of 87.6 % by using two modalities (MRI + PET). Huang et al. (2011) used 49 AD patients and 67 NC with MRI and PET modalities for AD classification, achieving an accuracy of 94.3 %. In (Gray et al. 2012), authors used 37 AD patients, 75 MCI patients and 35 NC and reported classification accuracies of 89.0, 74.6 and 58.0 % for AD, MCI and MCI-converter classification, respectively, using four different modalities (MRI + PET + CSF + genetic). Jie et al. (2015) achieved the accuracies of 95.03, 79.27 and 68.94 % for classification of AD/NC, MCI/NC and MCI-C/MCI-NC, respectively. Liu et al. (2014) obtained the accuracies of 94.37, 78.80 and 67.83 % for AD, MCI and MCI-converter classifications, respectively. It is worth noting that the dataset used in (Jie et al. 2015) and (Liu et al. 2014) are the same as that in the current study. Table 5 indicates that our proposed method consistently outperform other methods, which further validate the efficacy of our proposed method for AD diagnosis.

Table 5 Comparison of classification accuracy of different multi-modality methods

The effect of regularization parameters

In our method, there are two regularization items, i.e., the sparsity regularizer \( {\uplambda}_1 \) and label-aligned regularization term \( {\uplambda}_2 \). The two parameters control the relative contribution of those regularization terms. Here, the values of \( {\uplambda}_1 \) and \( {\uplambda}_2 \) are set from 0 to 50 at a step size of 10, respectively, to observe the effect of the regularization parameters on the classification performance of our proposed method. Figure 6 shows the classification results with respect to different values of \( {\uplambda}_1 \) and \( {\uplambda}_2 \). When \( {\uplambda}_1=0 \), all features extracted from MRI and FDG-PET data are used for classification, and thus our method will degenerate to multi-kernel method proposed in (Zhang et al. 2011). Also, when λ2 = 0, no label-aligned regularization item is introduced, and thus our method will degenerate to the MMFS method proposed in (Zhang and Shen 2012).

Fig. 6
figure 6

The classification accuracy with regularization parameters \( {\uplambda}_1 \) and \( {\uplambda}_2 \). a AD classification, b MCI classification, and c MCI conversion classification. Each curve denotes the performance for different selected value for \( {\uplambda}_1 \). X-axis represents diverse values for \( {\uplambda}_2 \)

As we can observe from Fig. 6, under all values of \( {\uplambda}_1 \) and \( {\uplambda}_2 \), our proposed method consistently outperforms the MMFS methods on three classification tasks (i.e., AD vs. NC, MCI vs. NC, and MCI-C vs. MCI-NC), which further indicates the advantage of using label-aligned regularization term. Also, Fig. 6 shows that when fixing the value of \( {\uplambda}_1 \), the curves corresponding to different values of \( {\uplambda}_2 \) are very smooth on three classification tasks, which shows our method is relatively robust to the regularization parameter \( {\uplambda}_2 \). Finally, as can be seen from Fig. 6, when fixing the value of \( {\uplambda}_2 \), the results on three classification tasks are largely affected with different values of \( {\uplambda}_1 \), which implies that the selection of \( {\uplambda}_1 \) is very important for final classification results. This is reasonable since \( {\uplambda}_1 \) controls the sparsity of model and thus determines the size of the optimal feature subset.

The effect of weights for multimodal classification

We investigate how the two combining kernel weights \( {\beta}_{\mathrm{MRI}} \) and \( {\beta}_{\mathrm{PET}} \) affect the classification performance of our proposed method. The combining kernel weights are set from 0 to 1 at a step size of 0.1, with the constraint of β MRI + β PET = 1. Figure 7 shows the classification accuracy and AUC value under different combination of kernel weights of MRI and PET. As we can observe from Fig. 7, the relative high classification performance is obtained in the middle part, which demonstrates the effectiveness of combining two modalities for classification. Moreover, the intervals with higher performance mainly lie in a larger interval of [0.2, 0.8], implying that each modality is indispensable for achieving good classification performances.

Fig. 7
figure 7

The classification results on three classification tasks with respect to different combining weights of MRI and PET (Top: classification accuracy; Bottom: AUC value)

Limitations

There are several limitations that should be further considered in the future study. First, in the current study, we only investigated binary classification problem (i.e., AD vs. NC, MCI vs. NC, and MCI-C vs. MCI-NC), and did not test the ability of the classifier for the multi-class classification of AD, MCI and normal controls. Although multi-class classification is more challenging than binary-class classification, it is very important to diagnose different stage of dementia. Second, the proposed method requires the same number of features from different modalities. Other modalities in ADNI database, such as CSF and genetic data, which have different feature numbers, may also carry important pathological information that can help further improve the classification performance. Finally, longitudinal data may contain very important information for classification, while our proposed method can only deal with the baseline data.

Conclusion

This paper proposed a novel multi-task feature learning method for jointly selecting features from multi-modality neuroimaging data for AD/MCI classification. By introducing the label-aligned regularization term into the multi-task learning framework, the proposed method can utilize the relationships across both modalities and subjects to seek out the most discriminative features subset. Experimental results on the ADNI database demonstrate that our proposed method outperforms the state-of-the-art methods for multimodal classification of AD/MCI.