Introduction

Alzheimer’s disease (AD), characterized by progressive impairment of cognitive and memory functions, is the most prevalent cause of dementia in elderly people and is recognized as one of the major challenges to global health care systems. A recent research by Alzheimer’s association reports that AD is the sixth-leading cause of death in the United States, rising significantly every year in terms of the proportion of cause of death (Alzheimer’s 2012). It is also indicated that 10–20 % of people aged 65 or older have mild cognitive impairment (MCI), a prodromal stage of AD (Alzheimer’s 2012), and situated in the spectrum between normal cognition and dementia (Cui et al. 2011). Due to the limited periods for which the symptomatic treatments are available, it has been of great importance for early diagnosis and prognosis of AD/MCI in the clinic.

To this end, researchers in many scientific fields have devoted their efforts to understand the underlying mechanism that causes these diseases and to identify pathological biomarkers for diagnosis or prognosis of AD/MCI by analyzing different types of neuroimaging modalities, such as magnetic resonance imaging (MRI) (Davatzikos et al. 2011; Wee et al. 2011), positron emission tomography (PET) (Nordberg et al. 2010), functional MRI (fMRI) (Greicius et al. 2004; Suk et al. 2013), cerebrospinal fluid (CSF) (Nettiksimmons et al. 2010), etc. In terms of clinical diagnosis, structural MRI provides visual information regarding the macroscopic tissue atrophy, which results from the cellular changes underlying AD/MCI, and PET can be used for the investigation of the cerebral glucose metabolism (Nordberg et al. 2010), which reflects the functional brain activity.

While these neuroimaging techniques have contributed substantially to our observation of the brain, significant breakthroughs in how we can efficiently understand and analyze the observed information have been of great concerns for the last decades. In that respect, machine learning has provided nice tools to tackle these challenges. Specifically, it has proved for their efficacy in multivariate pattern analysis and feature selection for clinical diagnosis. It is also impressive that they offered a new leverage strategy to efficiently fuse complementary information from different modalities including MRI, PET, biological and neurological data for discriminating AD/MCI patients from healthy normal controls (HC) (Fan et al. 2007; Perrin et al. 2009; Kohannim et al. 2010; Walhovd et al. 2010; Cui et al. 2011; Hinrichs et al. 2011; Zhang et al. 2011; Wee et al. 2012; Westman et al. 2012; Yuan et al. 2012; Zhang and Shen 2012). Kohannim et al. (2010) concatenated features from modalities into a vector and used a support vector machine (SVM) classifier. Walhovd et al. (2010) applied multi-method stepwise logistic regression analyses, and Westman et al. (2012) exploited a hierarchical modeling of orthogonal partial least squares to latent structures. Hinrichs et al. (2011) and Zhang et al. (2011), independently, utilized a kernel-based machine learning technique. There have been also attempts to select features by means of sparse learning, which jointly learns the tasks of clinical label identification and clinical scores prediction (Yuan et al. 2012; Zhang and Shen 2012).

Although these researches presented the effectiveness of their methods in their own experiments on multi-modal AD/MCI classification, the main limitation of the previous work is that they considered only simple low-level features such as cortical thickness and/or gray matter tissue volumes from MRI (Klöppel et al. 2008; Gray et al. 2013; Zhang et al. 2011; Zhang and Shen 2012; Cui et al. 2011; Desikan et al. 2009; Walhovd et al. 2010; Yao et al. 2012; Westman et al. 2012; Ewers et al. 2012; Zhou et al. 2011; Li et al. 2012; Liu et al. 2012), mean signal intensities from PET (Mosconi et al. 2008; Walhovd et al. 2010; Nordberg et al. 2010; Zhang et al. 2011; Zhang and Shen 2012; Gray et al. 2013), and t-tau, p-tau, and β-amyloid 42 ( 42) from CSF (Cui et al. 2011; Yuan et al. 2012; Zhang et al. 2011; Zhang and Shen 2012; Walhovd et al. 2010; Westman et al. 2012; Ewers et al. 2012; Tapiola et al. 2009). In this paper, we assume that there exists hidden or latent high-level information, inherent in those low-level features such as relations among them, which can be helpful to build a more robust model for AD/MCI diagnosis and prognosis.

To tackle this problem, we exploit a deep learning framework, which has been efficiently used to discover visual features in computer vision (Hinton and Salakhutdinov 2006; Bengio et al. 2007; Lee et al. 2011; Yu et al. 2011). The main concept of the deep learning is that deep architectures can be much more efficient than shallow architectures in terms of computational elements and parameters required to represent unknown functions (Bengio et al. 2007). Furthermore, one of the key features of the deep learning is that the low-layer represents low-level features and the high-layer abstracts those low-level features. In the case of our neuroimaging and biological data, the deep or hierarchical architecture can be efficiently used to discover latent or hidden representation, inherent in the low-level features from modalities, and ultimately to enhance classification accuracy. Specifically, ‘stacked auto-encoder’ (SAE) is considered to discover latent representations from the original neuroimaging and biological features. It is also noteworthy that thanks to the unsupervised characteristic of the pre-training in deep learning, the SAE model allows us to benefit from the target-unrelated samples to discover general latent feature representations, and hence to leverage for further enhancement of the classification accuracy.

The main contributions of our work can be summarized as follows: (1) To our best knowledge, this is the first work that considers a deep learning for feature representation in brain disease diagnosis and prognosis. (2) Unlike the previous work in the literature, we consider complicated non-linear latent feature representation, which can be discovered from data in self-taught learning. (3) By constructing an augmented feature vector via a concatenation of the original low-level features and the SAE-learned latent feature representation, we can improve diagnostic accuracy on the public ADNI dataset. (4) By means of pre-training of SAE in an unsupervised manner with the target-unrelated samples and then fine-tuning with target-related samples, the proposed method further enhances the classification performance.

Materials and image processing

Subjects

In this work, we use the ADNI dataset publicly available on the webFootnote 1. Specifically, we consider only the baseline MRI, 18-fluoro-deoxyglucose (FDG) PET, and CSF data acquired from 51 AD patients, 99 MCI patients (43 MCI converters, who progressed to AD, and 56 MCI non-converters, who did not progress to AD in 18 months), and 52 HC subjectsFootnote 2. The demographics of the subjects are detailed in Table 1. Along with the neuroimaging and biological data, two types of clinical scores, mini-mental state examination (MMSE) and Alzheimer’s disease assessment scale-cognitive subscale (ADAS-Cog), are also provided for each subject.

Table 1 Demographic and clinical information of the subjects

With regard to the general eligibility criteria in ADNI, subjects were in the age of between 55 and 90 with a study partner, who could provide an independent evaluation of functioning. General inclusion/exclusion criteriaFootnote 3 are as follows: (1) healthy subjects: MMSE scores between 24 and 30 (inclusive), a clinical dementia rating (CDR) of 0, non-depressed, non-MCI, and non-demented; (2) MCI subjects: MMSE scores between 24 and 30 (inclusive), a memory complaint, objective memory loss measured by education adjusted scores on Wechsler memory scale logical memory II, a CDR of 0.5, absence of significant levels of impairment in other cognitive domains, essentially preserved activities of daily living, and an absence of dementia; and (3) mild AD: MMSE scores between 20 and 26 (inclusive), CDR of 0.5 or 1.0, and meets the National Institute of Neurological and Communicative Disorders and Stroke and the Alzheimer’s Disease and Related Disorders Association (NINCDS/ADRDA) criteria for probable AD.

MRI and PET scanning

The structural MR images were acquired from 1.5 T scanners. We downloaded data in Neuroimaging Informatics Technology Initiative (NIfTI) format, which had been pre-processed for spatial distortion correction caused by gradient nonlinearity and B1 field inhomogeneity. The FDG-PET images were acquired 30–60 min post-injection, averaged, spatially aligned, interpolated to a standard voxel size, normalized in intensity, and smoothed to a common resolution of 8 mm full width at half maximum. CSF data were collected in the morning after an overnight fast using a 20- or 24-gauge spinal needle, frozen within 1 h of collection, and transported on dry ice to the ADNI Biomarker Core Laboratory at the University of Pennsylvania Medical Center.

Image processing and feature extraction

The MR images were preprocessed by applying the typical procedures of anterior commissure (AC)–posterior commissure (PC) correction, skull-stripping, and cerebellum removal. Specifically, we used MIPAV softwareFootnote 4 for AC-PC correction, resampled images to 256 × 256 × 256, and applied N3 algorithm (Sled et al. 1998) to correct intensity inhomogeneity. An accurate and robust skull stripping (Wang et al. 2011) was performed, followed by cerebellum removal. We further manually reviewed the skull-stripped images to ensure clean and dura removal. Then, FAST in FSL packageFootnote 5 (Zhang et al. 2001) was used for structural MR image segmentation into three tissue types of gray matter (GM), white matter (WM) and cerebrospinal fluid (CSF). We finally pacellated them into 93 regions of interests (ROIs) by warping Kabani et al.’s (1998) atlas to each subject’s space via HAMMER (Shen and Davatzikos 2002), although other advanced registration methods can also be applied for this process (Friston et al. 1995; Evans and Collins 1997; Rueckert et al. 1999; Shen et al. 1999; Wu et al. 2006; Xue et al. 2006a, b; Avants et al. 2008; Yang et al. 2008; Tang et al. 2009; Vercauteren et al. 2009; Jia et al. 2010). In this work, we only considered GM for classification, because of its relatively high relatedness to AD/MCI compared to WM and CSF (Liu et al. 2012).

Regarding FDG-PET images, they were rigidly aligned to the respective MR images, and then applied parcellation propagated from the atlas by registration. For each ROI, we used the GM tissue volume from MRI, and the mean intensity from FDG-PET as featuresFootnote 6, which are most widely used in the field for AD/MCI diagnosis (Davatzikos et al. 2011; Hinrichs et al. 2011; Zhang and Shen 2012; Liu et al. 2013). Therefore, we have 93 features from a MR image and the same dimensional features from FDG-PET image. Here, we should note that although it is known that the regions of medial temporal and superior parietal lobes are mainly affected by the disease, we assume that other brain regions, although their relatedness to AD is not clearly investigated yet, may also contribute to the diagnosis of AD/MCI and thus we consider 93 ROIs in our study. In addition, we have three CSF biomarkers of 42, t-tau, and p-tau as features.

Stacked auto-encoder for latent feature representation

In this section, we describe the proposed method for AD/MCI classification. Figure 1 illustrates a schematic diagram of the proposed method. Given multi-modal data along with the class-label and clinical scores, we first extract low-level features from MRI and FDG-PET as explained in "Image processing and feature extraction". We then discover a latent feature representation from the low-level features in MRI, FDG-PET, and CSF, individually, by deep learning with SAE. In deep learning, we perform two steps sequentially: (1) We first pre-train the SAE in a greedy layer-wise manner to obtain good initial parameters. (2) We then fine-tune the deep network to find the optimal parameters. A sparse learning on the augmented feature vectors, i.e., a concatenation of the original low-level features and the SAE-learned features, is applied to select features that efficiently regress the target values, e.g., class-label and/or clinical scores. Finally, we fuse the selected multi-modal feature information via a multi-kernel SVM (MK-SVM) for diagnosis. Note that the latent feature representation and feature selection are performed for each modality individually. Hereafter, we do not explicitly indicate the modality of samples, unless specified, in order for simplicity. Basically, the method described below can be applied for each modality individually, but also applicable to the concatenated feature vectors of three modalities in terms of information fusion, which is considered later in our experiments for comparison.

Fig. 1
figure 1

An illustration of the proposed method for AD/MCI diagnosis

Sparse auto-encoder

An auto-encoder, also called as auto-associator, is one type of artificial neural networks structurally defined by three layers: input layer, hidden layer, and output layer. The input layer is fully connected to the hidden layer, which is further fully connected to the output layer as illustrated in Fig. 2. The aim of the auto-encoder is to learn a latent or compressed representation of the input, by minimizing the reconstruction error between the input and the reconstructed one from the learned representation.

Fig. 2
figure 2

Illustration of an auto-encoder and its parameters. (The bias parameters b 1 and b 2 are omitted for clarity.)

Let D H and D I denote, respectively, the number of hidden and input units in a neural network. Given a set of training samples \({{\user2 X}=\{{\user2 x}_{i}\in {\mathbb{R}}^{D_{\text I}}\}_{i=1}^{N}}\) from N subjects, an auto-encoder maps x i to a latent representation \({{\user2 y}_{i}\in {\mathbb{R}}^{D_H}}\) through a linear deterministic mapping and a nonlinear activation function f as follows:

$${\user2 y}_{i}=f({\user2 W}_{1}{\user2 x}_{i}+{\user2 b}_{1})$$
(1)

where \({{\user2 W}_{1}\in {\mathbb{R}}^{D_H\times D_{\text I}}}\) is an encoding weight matrix and \({{\user2 b}_{1}\in {\mathbb{R}}^{D_H}}\) is a bias vector. Regarding the activation function, in this study, we consider a logistic sigmoid function for \(f(a)=1/\left(1+{\text {exp}}(-a)\right),\) which is the most widely used in the field of pattern recognition or machine learning (Bengio et al. 2007; Lee et al. 2008; Bengio 2009; Larochelle et al. 2009; Ngiam et al. 2011; Shin et al. 2013). The representation y i of the hidden layer is then mapped to a vector \({{\user2 z}_{i}\in {\mathbb{R}}^{D_{\text I}},}\) which approximately reconstructs the input vector x i by another linear mapping as follows:

$${\user2 z}_{i}={\user2 W}_{2}{\user2 y}_{i}+{\user2 b}_{2}\approx {\user2 x}_{i}$$
(2)

where \({{\user2 W}_{2}\in {\mathbb{R}}^{D_{\text I}\times D_H}}\) and \({{\user2 b}_{2}\in {\mathbb{R}}^{D_{\text I}}}\) are a decoding weight matrix and a bias vector, respectively.

Structurally, the number of input and output units are fixed to the dimension of an input vector. Meanwhile, the number of hidden units can be determined based on the nature of the data. If the number of hidden units is less than the dimension of the input data, then the auto-encoder can be used for dimensionality reduction. However, it is noteworthy that in order for obtaining complicated non-linear relations among neuroimaging features, we can allow the number of hidden units to be even larger than the input dimension, from which we can still find an interesting structure by imposing a sparsity constraint (Lee et al. 2008; Larochelle et al. 2009).

From a learning perspective, we aim to minimize the reconstruction error between the input x i and the output z i with respect to the parameters. Let \(\user2{Z} = \left\{ {\user2{z}_{i} } \right\}_{{i = 1}}^{N}\) and \(l({\user2 X}, {\user2 Z})=\frac{1}{2}\sum_{i=1}^{N}\left\|{\user2 x}_{i}-{\user2 z}_{i}\right\|_{2}^{2}\) denote a reconstruction error. In order for the sparseness of the hidden units, we further consider a Kullback-Leibler (KL) divergence between the average activation \(\hat{\rho_{j}}\) of the jth hidden unit over the training samples and the target average activation ρ defined as follows (Shin et al. 2013):

$$\hbox {KL}(\rho||\hat{\rho_{j}})=\rho {\text {log}}\frac{\rho}{\hat{\rho_{j}}} + (1-\rho) {\text {log}}\frac{1-\rho}{1-\hat{\rho_{j}}}$$
(3)

where ρ and \(\hat{\rho_{j}}\) are Bernoulli random variables. Then our objective function can be written as follows:

$$l({\user2 X}, {\user2 Z})+\gamma\sum_{j=1}^{D_H}\hbox {KL}(\rho||\hat{\rho_{j}}).$$
(4)

With the introduction of the KL divergence weighted by a sparsity control parameter γ to the target objective function, we penalize a large average activation of a hidden unit over the training samples by setting ρ smallFootnote 7. This penalization drives many of the hidden units’ activation to be equal or close to zero, resulting in sparse connections between layers.

Note that the output from the hidden layer determines the latent representation of the input vector. However, due to its simple shallow structural characteristic, the representational power of a single-layer auto-encoder is known to be very limited.

Stacked auto-encoder

Inspired from the biological model of the human visual cortex (Fukushima 1980; Serre et al. 2005), recent studies in machine learning have shown that a deep or hierarchical architecture is useful to find highly non-linear and complex patterns in data (Bengio 2009). Motivated by the studies, in this paper, we consider a SAE (Bengio et al. 2007), in which an auto-encoder becomes a building block, for a latent feature representation in neuroimaging or biological data. Specifically, as the name says, we stack auto-encoders one after another taking the outputs from the hidden units of the lower layer as the input to the upper layer’s input units, and so on. Figure 3 shows an example of a SAE model with three auto-encoders stacked hierarchically. Note that the number of units in the input layer is equal to the dimension of the input feature vector. But the number of hidden units in the upper layers can be determined according to the nature of the input, i.e., even larger than the input dimension.

Fig. 3
figure 3

A deep architecture of our stacked auto-encoder and the two-step (unsupervised greedy layer-wise pretraining and supervised fine-tuning) parameter optimization scheme. (The black arrows denote the parameters to be optimized in the current stage). a Pre-training of the first hidden layer with the training samples as inputs, b pre-training of the second hidden layer with the outputs from the first hidden layer as inputs, c pre-training of the third hidden layer with the output from the second hidden layer as inputs, d fine-tuning of the whole network with an additional label-output layer, taking the pre-trained parameters as the starting point in optimization

Thanks to the hierarchical nature in structure, one of the most important characteristics of the SAE is to learn or discover highly non-linear and complicated patterns such as the relations among input features. Another important characteristic of the deep learning is that the latent representation can be learned directly from the data. Utilizing its representational and self-taught learning properties, we can find a latent representation of the original low-level features directly extracted from neuroimaging or biological data. When an input sample is presented to a SAE model, the different layers of the network represent different levels of information. That is, the lower the layer in the network, the simpler patterns (e.g., linear relations of features); the higher the layer, the more complicated or abstract patterns inherent in the input feature vector (e.g., non-linear relations among features).

With regard to training parameters of the weight matrices and the biases in the deep network of our SAE model, a straightforward way is to apply back-propagation with the gradient-based optimization technique starting from random initialization taking the deep network as a conventional multi-layer neural network. Unfortunately, it is generally known that deep networks trained in that manner perform worse than networks with a shallow architecture, suffering from falling into a poor local optimum (Larochelle et al. 2009). However, recently, Hinton et al. introduced a greedy layer-wise unsupervised learning algorithm and showed its success to learn a deep belief network (Hinton et al. 2006). The key idea in a greedy layer-wise learning is to train one layer at a time by maximizing the variational lower bound (Hinton et al. 2006). That is, we first train the first hidden layer with the training data as input, and then train the second hidden layer with the outputs from the first hidden layer as input, and so on. That is, the representation of the lth hidden layer is used as input for the (l + 1)-th hidden layer. This greedy layer-wise learning is called ‘pre-training’ (Fig. 3a–c). The pre-training is performed in an unsupervised manner with a standard back-propagation algorithm (Bishop 1995). Later in our experiments, we utilize this unsupervised characteristic in pre-training to further find optimal parameters to discover a latent representation in the neuroimaging or biological data, taking benefits from target-unrelated samples.

Focusing on the ultimate goal of our work to improve diagnostic performance in AD/MCI identification, we further optimize the deep network in a supervised manner. In order for that, we stack another output layer on top of the SAE (Fig. 3d). This top output layer is used to represent the class-label of an input sample. We set the number of units in the output layer to be equal to the number of classes of interest. This extended network can be considered as a multi-layer neural network and, in this paper, we call it ‘SAE-classifier’. Therefore, it is straightforward to optimize the deep network by back-propagation with gradient descent, having parameters, except for the last classification network, initialized by the pre-trained ones. Note that the initialization of the parameters via pre-training makes the deep network different from the conventional neural network, and it helps the supervised optimization, called ‘fine-tuning’, reduce the risk of falling into local poor optima (Hinton et al. 2006; Larochelle et al. 2009). We summarize the deep learning of the SAE in Algorithm 1. Besides the fine-tuning of the parameters, we also utilize the SAE-classifier to determine the optimal SAE structure.

Later in our experiments, we consider the following two learning schemes, in which the main difference lies in the way of utilizing the training samples available: (1) The supervised approach learns the parameters of SAE from solely the target-related training samples. For example, in the task of classifying MCI converter (MCI-C) and MCI non-converter (MCI-NC), we use the target-related training samples of MCI-C and MCI-NC for both pre-training and fine-tuning in deep learning, and for the SVM classifier learning (Fig. 4a). (2) The semi-supervised approach first performs pre-training using both target-related and target-unrelated samples, and then fine-tune the model with only the target-related samples. For example, in the task of discriminating MCI-C from MCI-NC, we first perform pre-training with the samples of AD and HC as well as those of MCI-C and MCI-NC, and then fine-tuning with only the MCI-C and MCI-NC training samples (Fig. 4b). Finally, the representation of the target-related MCI-C and MCI-NC training samples are used for SVM learning. The motivation of applying this learning scheme in our work is that the more samples we use in pre-training of the deep architecture, the better good initialization of the parameters we can obtain, and thus the better latent representation inherent in the low-level features we can discover (Larochelle et al. 2009). Hereafter, we use the terms of ‘supervised’ and ‘semi-supervised’, respectively, to specify the strategy of learning parameters of a SAE model as described above throughout the paper.

Fig. 4
figure 4

An example of SAE model training schemes for MCI converter (MCI-C) and MCI non-converter (MCI-NC) classification. The colored-boxes denote the samples used for training during the specified step. The size of a rectangle represents the number of training samples available for each class

Once we determine the structure of a SAE model, we consider the outputs from the top hidden layer as our latent feature representation, i.e., \({{\user2 Y}_H=f\left(\hat{{\user2 W}}_{1}^H{\user2 Y}_{H-1}+\hat{{\user2 b}}_{1}^{H}\right)\in {\mathbb{R}}^{D_{H}\times N},}\) where \(\hat{{\user2 W}}_{1}^{H}\) and \(\hat{{\user2 b}}_{1}^{H}\) denote, respectively, the trained weight matrix and bias of the top Hth hidden layer, and Y H−1 is the outputs from the (H−1)-th hidden layer. To utilize both the low-level simple features and the high-level latent representation, we construct an augmented feature vector \({\hat{{\user2 X}}}\) by concatenating the SAE-learned feature representation Y H with the original low-level features X, i.e., \({\hat{{\user2 X}}=\left[{\user2 X}^{\text T}, {\user2 Y}_{H}^{\text T}\right]\in {\mathbb{R}}^{N\times(D_{\text I}+D_H)}}\), which is then fed into the sparse learning for feature selection as described below.

Feature selection with sparse representation learning

Earlier, Zhang and Shen showed the efficacy of sparse representation for feature selection in AD/MCI diagnosis (Zhang and Shen 2012). Here, we consider two sparse representation methods, namely, least absolute shrinkage and selection operator (lasso) (Tibshirani 1996) and group lasso (Yuan and Lin 2006), which penalize a linear regression model with l 1-norm and l 21-norm, respectively. In this work, we select features for each modality individually and defer the multi-modal information fusion to MK-SVM learning. The rationale for the modality-specific feature selection is that we believe it would be helpful to find the discriminative features in a low dimension rather than in a high dimension of the modality-concatenated feature vectors.

Let \(m\in \{1, \ldots, M\}\) denote an index of modalities and \({\hat{{\user2 X}}^{(m)}\in {\mathbb{R}}^{N \times {\text {D}}}}\) denote a set of the augmented feature vectors, where N and D(= D H + D I) are, respectively, the number of samples and the dimension of the augmented feature vector. In lasso, we focus on finding optimal weight coefficients a (m) to regress the target response vector \({{\user2 t}^{(m)}\in {\mathbb{R}}^{N}}\) by a combination of the features in \({\hat{{\user2 X}}^{(m)}}\) with a sparsity constraint as follows:

$$J\left({\user2 a}^{(m)}\right)=\min_{{\user2 a}^{(m)}}\frac{1}{2}\left\| {\user2 t}^{(m)}-{\hat{\user2 X}}^{(m)}{\user2 a}^{(m)}\right\|_{2}^{2}+ \lambda_{1} \left\|{\user2 a}^{(m)} \right\|_{1}$$
(5)

where λ 1 is a sparsity control parameter. In our work, the target response vector corresponds to the target clinical labels. The l 1-norm penalty to linear regression imposes sparsity to the solution of a (m), which means that many of the elements are to be zero. By the application of the lasso, we can select features whose weight coefficients are non-zero.

Meanwhile, unlike the lasso that considers a single target response vector, the group lasso can accommodate multiple target response vectors, where each target response vector can be regarded as one task, and impose a constraint that encourages the correlated features to be jointly selected for multiple tasks in a data-driven manner.

$$J\left({\user2 A}^{(m)}\right)=\min_{{\user2 A}^{(m)}}\frac{1}{2}\sum_{s=1}^{S}\left\| {\user2 t}_{s}^{(m)}-{\hat{\user2 X}}^{(m)}{\user2 a}^{(m)}_{s}\right\|_{2}^{2}+ \lambda_{2} \left\|{\user2 A}^{(m)} \right\|_{2,1}$$
(6)

where \(s\in\{1,\ldots, S\}\) denotes an index of tasks Footnote 8, \({\user2 A}^{(m)}=\left[{\user2 a}^{(m)}_{1}\cdots {\user2 a}^{(m)}_{s}\cdots {\user2 a}^{(m)}_{S}\right],\) and λ 2 is a group-sparsity control parameter. In Eq. 6, \(\left\|{\user2 A}^{(m)} \right\|_{2,1}=\sum_{d=1}^{D}\|{\user2 A}^{(m)}[d,:]\|_{2},\) where A (m)[d, :] denotes the dth row of the matrix A (m). This l 2,1-norm imposes to select features that are jointly used to represent the target response vector {t (m) s } S s=1 across tasksFootnote 9. We can select features whose absolute weight coefficient is larger than zero.

From the inspection of Eqs. 5 and 6, we can see that the group lasso is a generalized form of the lasso in terms of the number of tasks involved in regression. That is, if we have information for a single task, then the group lasso becomes the conventional lasso. Later in our experiments, we consider both of these sparse representation learning and observe their effects on selecting features as well as classification performance. We use a set of class-labels in lasso, and clinical scores as well as class-labels in group lasso. The hyper-parameters of λ 1 and λ 2 in Eqs. 5 and 6, respectively, are determined by a grid search within a space of [0.001, 0.005, 0.01, 0.05, 0.1, 0.2, 0.3]. For the optimization, we use a SLEP toolboxFootnote 10.

Multi-kernel SVM learning

It is witnessed in the previous studies that the biomarkers from different modalities can provide complementary information in AD/MCI diagnosis (Perrin et al. 2009). In this paper, we combine the complementary information from modalities of MRI, FDG-PET, and CSF in the feature kernel space with linear SVM, which has proved its efficacy in many fields (Wee et al. 2012; Han and Davis 2012; Suk and Lee 2013). Given the dimension-reduced training samples \(\tilde{{\user2 X}}^{(m)}=\{\tilde{{\user2 x}}_{i}^{(m)}\}_{i=1}^{N}\) through the sparse representation learning as described in "Feature selection with sparse representation learning", and the test sample of \(\tilde{{\user2 x}}^{(m)},\) where \(m\in \{1, \ldots, M\}\) denotes an index of modalities, the decision function of the MK-SVM is defined as follows:

$$f\left(\tilde{{\user2 x}}^{(1)},\ldots,\tilde{{\user2 x}}^{(M)}\right)={\text{sign}}\left\{\sum_{i=1}^{N}\zeta_{i}\alpha_{i}\sum_{m=1}^{M}\beta^{(m)}k^{(m)}\left(\tilde{{\user2 x}}^{(m)}_{i},\tilde{{\user2 x}}^{(m)}\right)+b\right\}$$
(7)

where \(\zeta_{i}\) is the class-label of the ith sample, α i and b are, respectively, a Lagrangian multiplier and a bias, \(k^{(m)}\left(\tilde{{\user2 x}}^{(m)}_{i},\tilde{{\user2 x}}^{(m)}\right)=\left\{\phi^{(m)}\left(\tilde{{\user2 x}}^{(m)}_{i}\right)\right\}^{\text T}\left\{\phi^{(m)}\left(\tilde{{\user2 x}}^{(m)}\right)\right\}\) is a kernel function, ϕ (m) is a kernel-induced mapping function, and β (m) ≥ 0 is a weight coefficient of the mth modality with the constraint of \(\sum\nolimits_{{m = 1}}^{M} {\beta ^{{(m)}} = 1}\). Refer to Gönen (2011) for a detailed explanation on the MK-SVM.

Experimental results

Experimental setup

In this section, we evaluate the effectiveness of the proposed method for a non-linear latent feature representation by deep learning with SAE, considering four binary classification problems: AD vs. HC, MCI vs. HC, AD vs. MCI, and MCI-C vs. MCI-NC. In the classifications of MCI vs. HC, and AD vs. MCI, both MCI-C and MCI-NC data were used as the MCI class. For each classification problem, we applied a tenfold cross validation technique. That is, we randomly partitioned the dataset into 10 subsets, each of which included 10 % of the total dataset, and then used 9 out of 10 subsets for training and the remaining one for testing. We repeated these whole process 10 times for unbiased evaluation.

To show the validity of the proposed method of combining SAE-learned feature representation with the original low-level features, we compared the results of the proposed method with those from the original low-level features and SAE-learned feature representation, respectively, by applying the same strategies of feature selection and MK-SVM learning. Hereafter, we denote LLF, SAEF, and LLF + SAEF, respectively, for the cases of using the original low-level features, SAE-learned features, and the concatenation of LLF and SAEF. It is noteworthy that we use the same training and test samples over the competing methods for fair comparison.

Determination of the structure of a SAE model

With regard to the structure of a SAE model, we considered three hidden layers for MRI, FDG-PET, and CONCATFootnote 11, and two hidden layers for CSF, by taking into account the dimensionality of the low-level features in each modality. To determine the number of hidden units, we performed classification with a SAE-classifier by a grid searchFootnote 12. Due to the possibility of over-fitting with a small number of training samples, we early stopped the fine-tuning step by setting a small number for iteration. The optimal structure of the SAE models and the respective performance are presented in Table 2. For example, in classification of AD and HC, we obtained the best accuracy of 85.7 % with MRI from a SAE-classifier of 500-50-10 (from bottom to top) hidden units in supervised learningFootnote 13. We used a DeepLearnToolboxFootnote 14 to train our SAE model.

Table 2 Classification accuracies (mean ± standard deviation) obtained from SAE-classifiers and their corresponding structures in terms of the number of hidden units

Classification results

Regarding the feature selection, we observed that the lasso-based method showed better classification performance compared to the group lasso-based one. Here, we present the classification results obtained by lasso-based feature selection method.

Table 3 shows the mean accuracies of the competing methods in the classification of AD and HC. Although the proposed method of LLF + SAEF with a single-modality was outperformed for a couple of cases by the LLF-based one, e.g., 89 % (LLF) vs. 88.2 % (LLF + SAEF) with MRI, 93.7 % (LLF) vs. 93.5 % (LLF + SAEF) with CONCAT, those from multi-modality fusion via MK-SVM showed the best accuracies of 97.9 and 98.8 % in supervised and semi-supervised learning, respectively. Compared to the accuracy of 97 % with a LLF-based method, the proposed method improved the accuracy by 0.9 and 1.8 %, in supervised and semi-supervised learning, respectively.

Table 3 Performance comparison of different feature sets with lasso-based feature selection in AD vs. HC classification

In the classification of MCI and HC, as presented in Table 4, the proposed method showed the best classification accuracies of 88.8 and 90.7 % with supervised and semi-supervised learning schemes, respectively. The performance improvements compared to the classification accuracy of 84.8 % with the LLF-based method were 4 and 5.9 %, respectively.

Table 4 Performance comparison of different feature sets with lasso-based feature selection in MCI vs. HC classification

In the classification of AD and MCI, as shown in Table 5, the proposed method showed the best classification accuracies of 82.7 and 83.7 % with supervised and semi-supervised learning schemes, respectively. We could enhance the classification accuracy by 3.9 and 4.9 % with supervised and semi-supervised learning schemes, respectively, compared to the LLF-based method, whose accuracy was 78.8 %.

Table 5 Performance comparison of different feature sets with lasso-based feature selection in AD vs. MCI classification

In discriminating MCI-C from MCI-NC, the proposed method also outperformed the LLF-based method as presented in Table 6. While the LLF-based method showed the classification accuracy of 76 % with multi-modality fusion via MK-SVM, we could obtain the classification accuracies of 77.9 and 83.3 % in supervised and semi-supervised learning, respectively. It is remarkable that the semi-supervised learning scheme enhanced the performance by 7.3 % compared to that of the LLF-based method.

Table 6 Performance comparison of different feature sets with lasso-based feature selection in MCI converter (MCI-C) vs. MCI non-converter (MCI-NC) classification

We also plotted the best performances of the competing methods, regardless of the model training schemes, for four binary classification problems with their sensitivity and specificity given in Fig. 5. From the figure, we can clearly see that the proposed method outperforms the competing methods. It is noteworthy that there is a tendency of the improvement increase in the order of AD vs. HC, AD vs. MCI, MCI vs. HC, and MCI-C vs. MCI-NC. That is, we made higher improvements in the more challengeable and important tasks, e.g., classifying between MCI-C and MCI-NC, for early diagnosis and treatment.

Fig. 5
figure 5

Comparison of the best performances of the competing methods, regardless of the learning schemes for a SAE model

Discussions

Deep learning-based latent feature representation

In our method of discovering a latent feature representation, we built a SAE-classifier for a means of determining the optimal SAE-structure. It is worth noting that, across classification tasks, different numbers of hidden units for the same modality were determined, e.g., 500-50-10 in AD vs. HC, 100-100-20 in MCI vs. HC, 1000-50-30 in AD vs. MCI, and 100-100-10 in MCI-C vs. MCI-NC for MRI in supervised learning. We believe that this reflects the necessity of considering different high-level non-linear relations inherent in LLF for different classification problems.

In terms of the model architecture, the SAE-classifier can be considered as a simple logistic regression model taking the SAE-learned feature representation as input. Despite the simple architecture, it presented classification accuracies higher than or comparable to those from the SVM classifier, into which SAE-learned features were fed after feature selection. This is resulted from the fact that the SAE-learned features were optimized to the SAE-classifier, not to the SVM classifier.

In the meantime, when we constructed an augmented feature vector via a concatenation of LLF and SAEF, we could greatly improve the accuracies. That is, the original low-level features are still informative for brain disease diagnosis along with the latent feature representations.

In comparison with the LLF-based method, the proposed method of LLF + SAEF, greatly improved the diagnostic accuracy over all the classification problems considered in this work. Specifically, the proposed method consistently outperformed the competing methods over uni-modality and multi-modality with semi-supervised learning.

In deep learning, it is an important issue for the size of training samples. While there is a limited number of samples available in ADNI dataset, we would like to note that under the circumstance of a small sample size, there is an empirical proof that the unsupervised pre-training helps deep learning find better optimal parameters for reducing errors (Erhan et al. 2010). In the same perspective, we could also obtain the best performances in four binary classification problems from the semi-supervised learning, which means that we could benefit from the target-unrelated samples for pre-training and learning the optimal parameters for the deep network, and hence enhance the classification accuracy. This is one of the most prominent and important characteristics of deep learning in SAE, compared to the conventional neural network. In the conventional neural network, we find the optimal parameters starting from random initialization in a supervised manner, which means that we can only use limited number of target-related samples in learning. Therefore, it is restricted for the application of neural networks with only a small number of layers in structure. Meanwhile, the deep learning allows to utilize the unlabeled or target-unrelated samples in learning. From a practical point of view, it is of great importance to exploit information from unlabeled or target-unrelated data, which we have much more available in the reality.

It is also important for the interpretation of the trained weights and the latent feature representations. We can regard the trained weights as filters that can find different types of relations among the inputs. For example, each hidden unit in the first hidden layer captures a different representation via the non-linear transformation of the weighted linear combination of the input low-level features. Note that each unit has a different weight set and the weights of the input low-level features can be positive, negative, or zero. That is, by assigning different weights to each low-level feature, e.g., GM tissue volume from MRI or mean intensity from FDG-PET, the model discovers different latent relations among the low-level features from hidden units. From a neuroscience perspective, the hidden layer can discover the structural non-linear relations from MRI features and the functional non-linear relations from FDG-PET features. The outputs of the first hidden layer are further combined in the upper hidden layer capturing even more complicated relations. In this way, the SAE hierarchically captures latent complicated information inherent in the input low-level features, which are helpful to classify patients and healthy normal controls. Theoretically, to date, there is no standard way to visualize or interpret the trained weights in an intuitive way, but it still remains a challenging issue also in the field of pattern recognition or machine learning. We would like to mention that while it is not straightforward to interpret the meaning of the trained weights or the latent feature representations, it is clear from our experiments that the latent complicated information is useful in AD/MCI diagnosis.

To further validate the effectiveness of the proposed method, we also presented a statistical significance of the results with paired t test in Table 7. The test was performed with the results obtained from LLF and LLF + SAEF with MK-SVM. The lasso-based feature selection was considered for both methods, and, for LLF + SAEF, the SAE model was learned in a semi-supervised manner. The proposed method statistically outperformed the LLF-based method across all cases, except for CSF, rejecting the null hypothesis beyond the 95 % of confidence level. We believe that due to the low dimensionality of the original features from CSF, the SAE-learned latent feature representation was not much informative in classification.

Table 7 Statistical significance (paired t test) between the classification accuracies obtained from LLF and LLF + SAEF, which used supervised and semi-supervised learning schemes, respectively

Lasso vs. group lasso for feature selection

Here, we compare the performances with lasso- and group lasso-based feature selection methods. In group lasso, we considered the clinical labels and clinical scores of MMSE and ADAS-cog as the target responses. In conclusion, we observed that the method of lasso-based feature selection outperformed that of group lasso-based one as presented in Fig. 6. The reason for this result is that, we believe, although the l 21-norm-based multi-task learning can be used to take the advantage of richer information, it focuses on the target regression instead of the classification. Therefore, it finds features that most accurately regress the target values, i.e., clinical labels and clinical scores, regardless of the discriminative power of the selected features between classes. Moreover, the MMSE scores for different groups were highly overlapped, which means it provided mere information and might act as a potential confounding in discriminative feature selection. Meanwhile, in l 1-norm-based single-task learning, the clinical labels, the prediction of which is our main goal, are used as the target response. That is, the selected features to regress the target clinical labels can be class-discriminative in some sense. However, we should note that the multi-task learning is a generalized form of the single-task sparse learning. Therefore, if there exists other class-related information, we should utilize the information in the framework of multi-task learning and it should thus produce better performance.

Fig. 6
figure 6

Comparison of the best performances between lasso- and group lasso-based feature selection methods

Comparison with the state-of-the-art method

We also compared the performance of the proposed method with that of the multi-task multi-modal learning (M3T) method (Zhang and Shen 2012), which first performs multi-task learning, i.e., group lasso, on LLF for feature selection and then fuses multi-modal information via MK-SVM. For fair comparison, we used the same training and test samples for M3T. Compared to the accuracies of M3T, which were 94.5 ± 0.8, 84 ± 1.1, 78.8 ± 1.8, and 71.8 ± 2.6 % for AD vs. HC, MCI vs. HC, AD vs. MCI, and MCI-C vs. MCI-NC classification, respectively, the proposed method with LLF + SAEF made a performance improvement of 3.4, 4.8, 3.9, and 6.1 % using a supervised learning scheme, and 4.3, 6.7, 4.9, and 11.5 % using a semi-supervised learning scheme, both of which used a l 1-norm based feature selection.

Selected region of interests

From Figs. 7, 8, 9 and 10, we can see that the SAE-learned latent features did not show high frequency of being selected for classification. However, based on the classification accuracies and the fewer number of high frequency ROIs in the graphs, we assume that the SAE-learned latent features affected to filter out the original low-level features, which were not discriminative in classification, during feature selection.

Fig. 7
figure 7

Frequencies of the selected ROIs in AD vs. HC classification. Blue and red bars correspond, respectively, to the original low-level features and the SAE-learned feature representations

Fig. 8
figure 8

Frequencies of the selected ROIs in MCI vs. HC classification. Blue and red bars correspond, respectively, to the original low-level features and the SAE-learned feature representations

Fig. 9
figure 9

Frequencies of the selected ROIs in AD vs. MCI classification. Blue and red bars correspond, respectively, to the original low-level features and the SAE-learned feature representations

Fig. 10
figure 10

Frequencies of the selected ROIs in MCI-C vs. MCI-NC classification. Blue and red bars correspond, respectively, to the original low-level features and the SAE-learned feature representations

But, in classification of MCI vs. HC, a larger number of ROIs were involved for discrimination in the proposed method. Our understanding for this phenomenon is that due to its subtlety of the involved cognitive impairment in MCI compared to AD, we need to consider a larger number of ROIs and also the relations among them for more accurate diagnosis.

The selected ROIs included medial temporal lobe that involves a system of anatomically related structures that are vital for declarative or long-term memory: amygdala, hippocampal formation, entorhinal cortex, hippocampal region, and the perirhinal, entorhinal, and parahippocampal cortices (Braak and Braak 1991; Visser et al. 2002; Mosconi 2005; Lee et al. 2006; Devanand et al. 2007; Burton et al. 2009; Desikan et al. 2009; Ewers et al. 2012; Walhovd et al. 2010), and also the regions of supramarginal gyrus (Buckner et al. 2005; Desikan et al. 2009; Dickerson et al. 2009; Schroeter et al. 2009), angular gyrus (Schroeter et al. 2009; Nobili et al. 2013; Yao et al. 2012), superior parietal lobule, precuneus, cuneus (Bokde et al. 2006; Singh et al. 2006; Davatzikos et al. 2011), cingulate region (Mosconi 2005), anterior limb of internal capsule (Zhang et al. 2009), caudate nucleus (Dai et al. 2009), fornix (Copenhaver et al. 2006).

Limitations of the current work

Although we could achieve performance enhancements in four different classification problems, there exist some limitations and disadvantages of the proposed method.

First, in PET imaging, it is known that the partial volume effect, caused by a combination of the limited resolution of PET and image sampling, can lead to underestimation or overestimation of regional concentrations of radioactivity in the reconstructed images and further errors in statistical parametric images (Aston et al. 2002). However, in this work, we did not apply a procedure for partial volume correction. Therefore, there is a possibility of resulting in mixed combination of multiple tissue values in each voxel, reducing the differences between GM and WM. On the other hand, since we are using the ROI-based features for our classification, the performance of our method is less affected by this partial volume effect.

Second, as for the computational complexity, once the model was built by determining the network structure, learning the model parameters, and selecting the features, it took less than a minute to get the result for a given patient in our system of Mac OSX with 3.2GHz Intel Core i5 and 16 GB of memory. However, as stated in "Deep learning-based latent feature representation", to date, there is no general or intuitive method for visualization of the trained weights or for interpretation of the latent feature representations. The problem of efficient visualization or interpretation of the latent feature representation is another big challenge that should be tackled by the communities of machine learning and clinical neuroscience, collaboratively. Furthermore, we used a relatively small data samples (51 AD, 43 MCI-C, 56 MCI-NC, and 52 HC). Therefore, the network structures used to discover latent information in our experiments are not necessarily optimal for other datasets. We believe that it needs more intensive studies such as learning the optimal network structure from big data for practical use of deep learning in clinical settings.

Third, according to a recent broad spectrum of studies, there are increasing evidences that subjective cognitive complaints are one of the important genetic risk factors increasing the risk of progression to MCI or AD (Loewenstein et al. 2012; Mark and Sitskoorn 2013). That is, among cognitively normal elderly individuals, who have subjective cognitive impairments, there exists a high possibility for some of them to be in the stage of ‘pre-MCI’. However, in the ADNI dataset, there is no related information. Thus, in our experiments, the HC group could include both genuine controls and those with subjective cognitive complaints.

Lastly, we should mention that the data fusion in our deep learning is considered through a simple concatenation of the features from modalities into a vector, resulting in a low performance compared to that of the multi-kernel SVM. But, in terms of the network architecture, it is limited as a shallow model to discover the non-linear relations among features from multiple modalities. We believe that although the proposed SAE-based deep learning is successful to find latent information in this work, there is still a room to design a multi-modal deep network for the shared representation across modailities. Furthermore, inspired from the recent computer vision researches (Ngiam et al. 2011; Srivastava and Salakhutdinov 2012), we can efficiently handle the incomplete data problem (Yuan et al. 2012) with multi-modal deep learning. Therefore, it will be our forthcoming research issue to build a novel multi-modal deep architecture that can efficiently model and combine complementary information in a unified framework. Besides that, while we used the complimentary information from three different modalities of MRI, FDG-PET, and CSF in this work, it will be also beneficiary to consider the genetic risk factor such as the presence of the allele \(\varepsilon\)4 in the Apoliopoprotein E (ApoE) for our future work.

Conclusions

Due to the increasing proportion of AD as the cause of death in elderly people, there have been great interests in early diagnosis and prognosis of the neurodegenerative disease in the clinic. Recent neuroimaging tools and machine learning techniques have greatly contributed for computer-aided brain disease diagnosis. However, the previous work in the literature considered only simple low-level features such as cortical thickness and/or gray matter tissue volumes from MRI, mean signal intensities from FDG-PET, and t-tau, p-tau, and 42 from CSF.

The main motivation of our work is that there may exist hidden or latent high-level information inherent in the original low-level features, such as relations among features, which can be helpful to build a more robust diagnostic model. To this end, in this paper, we proposed to utilize a deep learning with SAE for a latent feature representation from the data for AD/MCI diagnosis.

While the SAE is a neural network in terms of the model structure, thanks to the two-step learning scheme of greedy layer-wise pre-training and the fine-tuning in deep learning, we could reduce the risk of falling into a poor local optimum, which is the main limitation of the conventional neural network. We believe that deep learning can shed new light on the analysis of neuroimaging data, and our paper presented the applicability of the method to brain disease diagnosis for the first time.

The contributions of our work are that (1) to our best knowledge, this is the first work that considers a deep learning for feature representation in brain disease diagnosis and prognosis, (2) unlike the previous work in the literature, we considered a complicated non-linear latent feature representation, which was directly discovered from data, (3) by constructing an augmented feature vector via a concatenation of the original low-level features and the SAE-learned latent feature representation, we could greatly improve diagnostic accuracy, and (4) thanks to the unsupervised characteristic of the pre-training in deep learning, the proposed method can utilize target-unrelated samples to discover a general feature representation, which helped to further enhance classification performance. Using the publicly available ADNI dataset, we evaluated the effectiveness of the proposed method and achieved the maximum accuracies of 98.8, 90.7, 83.7, and 83.3 % for AD vs. NC, MCI vs. NC, AD vs. MCI, and MCI-C vs. MCI-NC classification, respectively, outperforming the competing methods.