Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Alzheimer’s disease (AD) is an irreversible and progressive brain disorder. Early prediction of the disease using multimodal neuroimaging data has yielded important insights into the progression patterns of AD [11, 16, 18]. Among the many risk factors for AD, genetic variation has been identified as an important one [11, 17]. Therefore, it is important and beneficial to build prediction models by leveraging both imaging and genetic data, e.g., magnetic resonance imaging (MRI) and positron emission tomography (PET), and single-nucleotide polymorphisms (SNPs). However, it is a challenging task due to the multimodal nature of the data, limited observations, and highly-redundant high-dimensional data.

Multiple kernel learning (MKL) provides an elegant framework to learn an optimally combined kernel representation for heterogeneous data [4, 5, 10]. When it is applied to the classification problem with multimodal data, data of each modality are usually represented using a base kernel [3, 8, 12]. The selection of certain sparse regularization methods such as lasso (\(\ell _1\) norm) [13] and group lasso (\(\ell _{2,1}\) norm) [15], yields different modality selection approaches [3, 8, 12]. In particular, \(\ell _1\)-MKL [10] is able to sparsely select the most discriminative modalities. With grouped kernels, group lasso performs sparse group selection, while densely combining kernels within groups. In [8], the group lasso regularized MKL was employed to select the most relevant modalities. In [12], a class of generalized group lasso with the focus on inter-group sparsity was introduced into MKL for channel selection on EEG data, where groups correspond to channels.

Fig. 1.
figure 1

Schematic illustration of our proposed framework (a), and different sparsity patterns (b) produced by lasso (\(\ell _1\) norm), group lasso (\(\ell _{2,1}\) norm) and the proposed structured sparsity (\(\ell _{1,p}\) norm, \(p>1\)). Darker color in (b) indicates larger weights.

In view of the unique and complementary information contained in different modalities, all of them are expected to be utilized for AD prediction. Moreover, compared with modality-wise analysis and then conducting relevant modality selection, integration of feature-level and modality-level analysis is more favorable. However, for some modalities, their features as a whole or individual are weaker than those in other modalities. In these scenarios, as shown in Fig. 1(b), the lasso and group lasso tend to independently select the most discriminative features/groups, making features from weak modalities having less chance to be selected. Moreover, they are less effective to utilize complementary information among modalities with \(\ell _1\) norm penalty [5, 7]. To address these issues, we propose to jointly learn a better integration of multiple modalities and select subsets of discriminative features simultaneously from all the modalities.

Accordingly, we propose a novel structured sparsity (i.e., \(\ell _{1,p}\) norm with \(p>1\)) regularized MKL for heterogeneous multimodal data integration. It is noteworthy that \(\ell _{1,2}\) norm was considered [6, 7] in settings such as regression, multitask learning etc. Here, we go beyond these studies by considering the \(\ell _{1,p}\) constrained MKL for multimodal feature selection and fusion and its application for AD diagnosis. Moreover, contrary to representing each modality with a single kernel as in conventional MKL based methods [3, 4, 8], we assign each feature with a kernel and then group kernels according to modalities to facilitate both feature- and group-level analysis. Specifically, we promote sparsity inside groups with inner \( \ell _1\) norm and pursue dense combination of groups with outer nonsparse \(\ell _p\) norm. Guided by the learning of modality-level dense combination, sparse feature selections in different modalities interact with each other for a better overall performance. This \(\ell _{1,p}\) regularizer is completely different from group lasso [15] and its generalization [9] (i.e., \(\ell _{p,1}\) norm) which gives sparse groups but performs no feature selection within each group [12, 15]. An illustration of different sparsity patterns selected by lasso, group lasso and the proposed method is shown in Fig. 1(b). In comparison, the proposed model can not only keep information from each modality with outer nonsparse regularization but also support variable interpretability and scalability with the inner sparse feature selection.

2 Method

Given a set of N labeled data samples \(\{\varvec{x}^i,y^i\}_{i=1}^N\), where \(\varvec{x}^i=( x^i_1, x^i_2,\cdots ,x^i_M)^T\), M is the number of all features in all modalities, and \(y^i \in \{1,-1\}\) is a class label. MKL aims to learn an optimal combination of base kernels, while each kernel describes a different property of the data. To also perform the task of joint feature selection, we assign each feature a base kernel through its own feature mapping. An overview of the proposed framework is illustrated in Fig. 1(a).

2.1 Structured Sparse Feature and Kernel Learning

Let \(\mathcal {G}=\{1,2,\cdot \cdot \cdot ,M\}\) be the feature index set which is partitioned into L non-overlapping groups \(\{\mathcal {G}_l\}_{l=1}^L\) according to task-specific knowledge. For instance, in our application, we partition \(\mathcal {G}\) into \(L=3\) groups according to modalities. Let \(\{K_m\succeq 0\}_{m=1}^M\) be the M base kernels for the M features respectively, which are induced by M feature mappings \(\{\varvec{\phi }_m\}_{m=1}^M\). Given the feature space defined by the joint feature mapping \(\mathbf {\Phi }(\varvec{x})=(\varvec{\phi }_1(x_1),\varvec{\phi }_2(x_2),\cdots ,\varvec{\phi }_M(x_M))^T\), we learn a linear discriminant function of the form \(f(\varvec{x})=\sum _{l=1}^L \sum _{m \in \mathcal {G}_l}\sqrt{\theta _{m}}\tilde{\varvec{w}}_{m}^{T}\varvec{\phi }_{m}( x_{m})+b\). Here, we have explicitly written out the group structure in the function \(f(\varvec{x})\), in which \(\tilde{\varvec{w}}_m\) is the normal vector corresponding to \(\varvec{\phi }_m\), b encodes the bias, and \(\varvec{\theta }=(\theta _1,\theta _2,\cdots ,\theta _M)^T\) contains the weights for the M feature mappings. Therefore, feature mappings with zero weights would not be active in \(f(\varvec{x})\).

In the following, we perform feature selection by enforcing a structured sparsity on weights of the feature mappings. To introduce a more general model, we further introduce (1) M positive weights \(\varvec{\beta }=(\beta _1, \beta _2, \cdot \cdot \cdot , \beta _M )^T\) for features and (2) L positive weights \(\varvec{\gamma }=(\gamma _1, \gamma _2, \cdot \cdot \cdot , \gamma _L )^T\) for feature groups to encode prior information. If we have no knowledge about group/feature importance, we can set \(\beta _m=1\) and \(\gamma _l=1\) for each l and m. Accordingly, our generalized MKL model with a structured sparsity inducing constraint can be formulated as below:

$$\begin{aligned} \begin{array}{ll} &{}\mathop {\text {min}}\limits _{\varvec{\theta }}\mathop {\text {min}}\limits _{\tilde{\varvec{w}}_m,b} \frac{1}{2}\sum \limits _{l=1}^L \sum \limits _{m \in \mathcal {G}_l} \Vert \tilde{\varvec{w}}_{m} \Vert _2^2+C^{\prime }\sum \limits _{i=1}^N\mathcal {L}\left( f(\varvec{x}^i), y^i \right) , \\ &{}~\mathrm{{s.t.}} ~~\Vert \varvec{\theta }\Vert _{1,p;\varvec{\beta },\varvec{\gamma }}\triangleq \left( \sum \limits _{l=1}^L \gamma _l \left( \sum \limits _{m \in \mathcal {G}_l}\beta _{m}|\theta _{m}|\right) ^p\right) ^{\frac{1}{p}}\le \tau , ~~\mathbf {0}\le \varvec{\theta },\\ \end{array} \end{aligned}$$
(1)

where \(\mathcal {L}(t,y) = max(0,1- ty)\) is the hinge loss function, \(C^{\prime }\) is a trade-off weight, \(\tau \) controls the sparsity level, and \(\mathbf {0}\) is a vector of all zeros. Similar to the typical MKL [10], this model is equivalent to learning an optimally combined kernel \(K=\sum _{l=1}^L \sum _{m \in \mathcal {G}_l}\theta _{m} K_{m}\). The inequality constraint employs a weighted \(\ell _{1,p}\) mixed norm (\(p>1\)), i.e., \(\Vert \cdot \Vert _{1,p;\varvec{\beta },\varvec{\gamma }}\), which simultaneously promotes sparsity inside groups with the inner weighted \( \ell _1\) norm and pursues dense combination of groups with the outer weighted \(\ell _p\) norm.

The rationale of using this regularization is that, while each individual modality contains redundant high-dimensional features, different modalities can offer unique and complementary information. Owing to the heterogeneity of different modalities, we sparsely select features from each homogenous feature groups, i.e., modalities, and densely integrate different modalities. As has been discussed in [5], with \(p>1\), the non-sparse \(\ell _p\) norm has the advantage of better combining complementary features than \(\ell _1\) norm. Moreover, in view of the unequal reliability of different modalities, we take a compromise of \(\ell _1\) lasso and \(\ell _2\) ridge regularization and intuitively set \(p=1.5\) for inter-group regularization, i.e. \(\ell _{1,1.5}\). More specifically, due to the geometrical property of the \(\ell _{1.5}\) contour lines, it results in unequal shrinkage of weights with higher probability than \(\ell _2\) norm, thus allowing the assignment of larger weights for leading groups/modalities.

Further understanding and computation of our model can be achieved with the following lemma and theorem. Let \(\varvec{w}_m=\sqrt{\theta _m}\tilde{\varvec{w}}_m\), \(\varvec{w}=(\varvec{w}_1, \varvec{w}_2, \cdots ,\varvec{w}_M)^T\) and also \( \mathbf {W}=(\Vert \varvec{w}_1\Vert _2, \Vert \varvec{w}_2\Vert _2,\cdots ,\Vert \varvec{w}_M\Vert _2)^T\), we first have the following lemma.

Lemma 1

Given \(p\ge 1\), positive weights \(\varvec{\gamma }\) and \(\varvec{\beta }\). We use the convention that \(0/0=0\). For fixed \(\varvec{w}\ne \varvec{0}\), the minimal \(\varvec{\theta }\) in Eq. (1) is attained at

$$\begin{aligned} \theta _m^*=\frac{\Vert \varvec{w}_m \Vert _2}{\beta _m^\frac{1}{2}\gamma _{l_m}^\frac{1}{p+1}\Vert \mathbf {W}_{\mathcal {G}_{l_m}}\Vert _{1;{\varvec{\beta }}}^\frac{p-1}{p+1}}\cdot \frac{\tau }{(\sum _{l=1}^L\gamma _l^{\frac{1}{p+1}}\Vert \mathbf {W}_{\mathcal {G}_l}\Vert _{1;{\varvec{\beta }}}^{\frac{2p}{p+1}})^\frac{1}{p}},~~\forall m=1,2,\cdot \cdot \cdot ,M \end{aligned}$$
(2)

where \(\Vert \mathbf {W}_{\mathcal {G}_l}\Vert _{1;{\varvec{\beta }}}=\sum _{m^{\prime } \in \mathcal {G}_l}\beta _{m^{\prime }}^\frac{1}{2} \Vert \varvec{w}_{m^{\prime }}\Vert _2\), and \(\mathcal {G}_{l_{m}}\) is the index set containing m.

For the fixed \(\varvec{w}\), this lemma gives an explicit solution for \(\varvec{\theta }\). The proof can be done by deriving the first order optimality conditions of Eq. (1). Plugging Eq. (2) into the model in Eq. (1) yields the following compact optimization problem.

Theorem 1

Let \(q=\frac{2p}{p+1}\). For \(p>1\), the model in Eq. (1) is equivalent to

$$\begin{aligned} \min _{\varvec{w}_m,b} \frac{1}{2\tau }\left( \sum _{l=1}^L \gamma _l ^{\frac{2-q}{q}}\Vert \mathbf {W}_{\mathcal {G}_l}\Vert _{1;{\varvec{\beta }}}^q\right) ^\frac{2}{q}+C^{\prime }\sum _{i=1}^N\mathcal {L}\left( \sum _{l=1}^L \sum _{m \in \mathcal {G}_l}\varvec{w}_{m}^{T}\varvec{\phi }_{m}(x^i_{m})+b, y^i \right) . \end{aligned}$$
(3)

The first term is a weighted \(\ell _{1,q}\) norm penalty on \(\mathbf {W}\) with \(q\in (1,2)\). By choosing \(p=1.5\) and thus \(q=1.2\), it shares similar group-level regularization property with that in Eq. (1) on \(\varvec{\theta }\). Specifically, in each group, only a small number of \(\varvec{w}_m\) can contribute to the decision function \(f(\mathbf {x})\) with nonzero values. Accordingly, few features in each group can be selected. Meanwhile, the sparsely filtered groups are densely combined, while allowing the presence of leading groups.

2.2 Model Computation

After the variable changing, we can optimize the proposed model via a block coordinate descent. For fixed \(\varvec{\theta }\), the subproblem of \(\varvec{w}\) and b can be computed with any support vector machine (SVM) [2] solver. According to Lemma 1, we can analytically carry out \(\theta _m\) with \(\varvec{w}\) fixed. \(\theta _m\) can be initialized as \(\theta _m=(\sum _{l=1}^L \gamma _l(\sum _{{m^{\prime }}\in \mathcal {G}_l}\beta _{m^{\prime }})^p)^{-\frac{1}{p}}\) to satisfy the constraint in Eq. (1). Moreover, from Eq. (3), it is obvious that we can fold \(\tau \) and \(C^{\prime }\) into a single trade-off weight C and set \(\tau =1\). In this way, we have single model parameter C which not only acts as the soft margin parameter but also controls the sparsity of \(\varvec{\theta }\) and \(\mathbf {W}\).

3 Experimental Results

3.1 Dataset

We evaluated our method by applying it on a subset of the Alzheimer’s Disease Neuroimaging Initiative (ADNI) datasetFootnote 1. In total, we used MRI, PET, and SNP data of 189 subjects, including 49 patients with AD, 93 patients with Mild Cognitive Impairment (MCI), and 47 Normal Controls (NC). After preprocessing, the MRI and PET images were segmented into 93 regions-of-interest (ROIs). The gray matter volumes of these ROIs in MRI and the average intensity of each ROI in PET were calculated as features. The SNPs [11] were genotyped using the Human 610-Quad BeadChip. Among all SNPs, only SNPs, belonging to the top AD candidate genes listed on the AlzGene databaseFootnote 2 as of June 10, 2010, were selected after the standard quality control and imputation steps. The Illumina annotation information based on the Genome build 36.2 was used to select a subset of SNPs, belonging or proximal to the top 135 AD candidate genes. The above procedure yielded 5677 SNPs from 135 genes. Thus, we totally have 93 + 93 + 5677 = 5863 features from the three modalities for each subject.

3.2 Experimental Settings

For method evaluation, we used the strategy of 10 times repeated 10-fold cross-validation. All parameters were learned by conducting 5-fold inner cross-validation. Three measures including classification accuracy (ACC), sensitivity (SEN), and specificity (SPE) were used. We compared the proposed method with (1) feature selection based methods, i.e., Fisher Score (FS) [2], and Lasso [13], and (2) MKL based methods, i.e., the method of Zhang et al. in [16], and \(\ell _1\)-MKL [10]. In the Lasso method, the logistic loss [2] was used. The method in [16] represented each modality with a base kernel and further learned a linearly-combined kernel with cross validation. For FS, Lasso and the method in [16], the linear SVM implemented in LibSVM softwareFootnote 3 was used as the classifier. For all methods, we used t-test [2] thresholded by p-value as a feature pre-selection step to reduce feature size and improve computational efficiency. The commonly used p-value \(<0.05\) was applied for MRI and PET. Considering the large number of SNP features, we selected the p-value from \(\{0.05,0.02,0.01\}\). Therefore, t-test-SVM that combined t-test and SVM was designed for comparison with the same p-value setting as well. For our proposed model, \(\ell _1\)-MKL and Zhang’s method, to avoid further kernel parameter selection, each kernel matrix was defined as a linear kernel on a single feature. Furthermore, we simply assumed no knowledge on both feature and group weights and thus we set \(\varvec{\gamma }=\varvec{1}\) and \(\varvec{\beta }=\varvec{1}\). The soft margin parameter C was selected with grid search from \(\{2^{-5},2^{-4},\cdots ,2^{5}\}\).

3.3 Results and Discussions

The classification results of AD vs. NC and MCI vs. NC using all the three modalities are listed in Table 1. By taking advantage of the structured feature learning in kernel space, the proposed method outperforms all competing methods in classification rate. For AD vs. NC classification, our method achieves an ACC of \(96.1\,\%\) with an improvement of \(2.1\,\%\) over the best performance of other methods. Meanwhile, the standard variance of the proposed method is also lower, demonstrating the stability of the proposed method. For classifying MCI from NC, the improvements by the proposed method is \(2.4\,\%\) in terms of ACC. In comparison with t-test-SVM, we obtained \(4.2\,\%\) and \(7.6\,\%\) improvements in terms of ACC for classifying AD and MCI from NC, respectively. Similar results are obtained for the classification of AD and MCI, which has not listed in Table 1 due to space limit. For example, the ACC of Lasso-SVM, \(\ell _1\)-MKL and our method are 70.3 ± 1.5 %, 73.0 ± 1.6 %, and 76.9 ± 1.4 %, respectively. In summary, these results show the improved classification performance by our method.

Table 1. Performance comparison of different methods in terms of “mean ± standard deviation” for AD vs. NC and MCI vs. NC classifications, using MRI, PET and SNPs. The superscript “\(*\)” indicates statistically significant difference (p-value \(<0.05\)) compared with the proposed method

To further investigate the benefit of SNP data and multimodality fusion, in Table 2 we illustrate the performance of the proposed method w.r.t different modality combinations. First of all, the performance of any single modality is much lower than that of their combinations. Among the three modalities, the SNP data shows the lowest performance. However, when combined with other modalities, genetic data can obviously help improve predictions. For example, in AD and NC classification, the performances using MRI+SNP and PET+SNP demonstrate \(2.7\,\%\) and \(5.7\,\%\) improvements in terms of ACC over the cases of only using MRI and PET, respectively; the improvement with MRI+PET+SNP over that with MRI+PET is \(3.8\,\%\). Similar results are obtained for MCI vs. NC.

Table 2. Comparison of our proposed method in the cases of using different modality combinations. “\(*\)” indicates statistically significant difference with MRI+PET+SNP

The most selected brain regions and SNPs in our algorithm can also be the potential biomarkers used in clinical diagnosis. In MRI, hippocampal formation and uncus in parahippocampal gyrus are recognized in both AD vs. NC and MCI vs. NC classifications, as well as multiple temporal gyrus regions. This is in line with the findings of the most affected regions in AD in previous neuro-studies [3, 8, 16, 18]. Amygdala, one of the subcortical regions, is the integrative center for emotions, is also identified as AD. In PET, angular gyri, precuneus, and entorhinal cortices are the regions identified, which are also among the altered regions in AD reported in prior studies [16, 18]. As to the genetic information, the most selected SNPs for AD and NC classification are from APOE gene, VEGFA gene, and SORCS1 gene. For MCI prediction, the most selected SNPs are from KCNMA1 gene, APOE gene, VEGFA gene and CTNNA3 gene. Generally, our results are consistent with the existing results [11, 17]. For instance, APOE and SORCS1 genes are the well-known top candidate genes related to AD and MCI [11]. VEGFA, the expression of vascular endothelial growth factor, represents a potential mechanism where vascular and AD pathologies are related [1].

4 Conclusion

We developed a kernel-based multimodal feature selection and integration method, and further applied it on imaging and genetic data for AD diagnosis. Instead of independently selecting features from each modality and then combining them together [16] or performing most relevant modality selection [8, 14], we integrated the multimodal feature selection and combination in a novel structured sparsity regularized kernel learning framework. A block coordinate descent algorithm was derived to solve our general \(\ell _{1,p}\) (\(p\ge 1\)) constrained non-smooth objective function. Comparisons by various experiments have shown better AD diagnosis performance by our proposed method. In future work, we will incorporate prior knowledge about feature/group importance into the proposed framework.