Introduction

Alzheimer’s disease (AD) is one of the most common forms of dementia characterized by progressive cognitive and memory deficits. It has been reported that one in every 85 persons in year 2050 will be likely affected by this disease (Brookmeyer et al. 2007). The increasing incidence of AD makes this disease a very important health issue and also huge financial burden for both patients and governments (Hebert et al. 2001; Bain et al. 2008). Thus, it is very important to develop methods for timely diagnosis of AD and its predromal stage, i.e., mild cognitive impairment (MCI). Over the last decade, many machine learning methods have been used for early diagnosis of AD and MCI based on different modalities of biomarkers, e.g., structural brain atrophy delineated by structural magnetic resonance imaging (MRI) (Du et al. 2007; McEvoy et al. 2009; Fjell et al. 2010; Yu et al. 2014), metabolic alterations characterized by fluorodeoxyglucose positron emission tomography (FDG-PET) (De Santi et al. 2001; Morris et al. 2001), and pathological amyloid depositions measured by cerebrospinal fluid (CSF) (Bouwman et al. 2007; Fjell et al. 2010). Typically, these methods learn a binary classification model from training data and use this model to predict disease status (i.e., class label) of the testing subjects.

Besides classification of disease status, accurate prediction of clinical scores such as Mini-Mental State Examination (MMSE) score and Alzheimer’s Disease Assessment Scale-Cognitive Subscale (ADAS-Cog) is also important and useful since they can help evaluate the stage of AD pathology and predict future progression. Specifically, as a brief 30-point questionnaire test, MMSE is commonly used to screen for cognitive impairment. It can be used to examine a patient’s arithmetic, memory and orientation (Folstein et al. 1975). As another important clinical score of AD, ADAS-Cog is a cognitive testing instrument widely used in clinical trials. It is designed to measure the severity of the most important symptoms of AD (Rosen et al. 1984). Several studies based on regression methods have been conducted to estimate MMSE and ADAS-Cog using the extracted features from MRI and FDG-PET. For example, Duchesne et al. (2005) used linear regression models, Wang et al. (2010) developed a high-dimensional kernel-based regression method, and Cheng et al. (2013) proposed a semi-supervised multi-modal relevance vector regression method. However, almost all of these regression methods model different clinical scores separately and do not use the class label information which is often available in practice.

Although the classification of disease status and the prediction of clinical scores are different tasks, there exists inherent correlation among them since the underlying pathology is the same (Fan et al. 2010; Stonnington et al. 2010). In the literature, Zhang and Shen (2012) proposed multi-modal multi-task (M3T) learning to predict both class label and clinical scores jointly. M3T formulated the estimations of class label and clinical scores as different tasks. The \(l_{2,1}\) penalty was used to deliver sparse models with a common feature subset for each task. Their experimental results indicated that selecting a common feature subset for different correlated tasks could achieve better prediction of both class label and clinical scores than choosing the feature subset for each task separately. Although benefiting from using the commonality among different correlated tasks, M3T method did not incorporate the correlation information among features. Actually, many features extracted from brain images such as structural MRI are statistically correlated significantly. In this case, feature selection combined with the additional correlation information among features can improve classification/regression performance (Yang et al. 2012).

In this paper, we extract effective correlation information among features by constructing a sparse undirected feature graph. This undirected graph uses all features as nodes. Also, two features are connected by an edge in the graph if there is statistically significant partial correlation between them. In practice, we can use many existing high-dimensional precision matrix estimation methods (Friedman et al. 2008; Cai et al. 2011) to construct this undirected graph. Based on this undirected feature graph, we propose a new graph-guided multi-task learning (GGML) method to predict both class label and clinical scores simultaneously. Specifically, we utilize a new latent group Lasso penalty to encourage the significantly correlated features to be in or out of the models together. This new penalty also encourages the intrinsic correlated tasks to share a common feature subset. It is very useful for us to acquire robust and accurate feature selection. Computationally, the optimization problem for our proposed GGML method can be solved by the traditional group Lasso algorithm efficiently (Yuan and Lin 2006). Theoretically, our proposed GGML method includes M3T method as a special case. To validate our proposed GGML method, we have conducted many numerical studies using simulated datasets and the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (http://www.loni.ucla.edu/ADNI) dataset. Compared with the other methods, our proposed GGML method acquired very promising results.

The remainder of this paper is organized as follows. In the “Materials” section, we introduce the ADNI dataset used in this study. In the “Method” section, we show how to extract useful correlation information among features and describe our proposed new method. In “Simulation study” and “Analysis of the ADNI dataset” sections, we compare our method with the other methods by simulation study and also the analysis of the ADNI dataset. In the “Discussion” section, we discuss some possible extensions of our proposed method. Finally, we conclude this paper in the “Conclusion” section.

Materials

Data

Data used in this paper were obtained from the ADNI database. As a $ 60 million, 5-year public–private partnership, the ADNI was launched in 2003 by the National Institute on Aging (NIA), the National Institute of Biomedical Imaging and Bioengineering (NIBIB), the Food and Drug Administration (FDA), private pharmaceutical companies and non-profit organizations. The main goal of ADNI was to test whether serial MRI, PET, other biological markers, and clinical and neuropsychological assessments could be combined to measure the progression of MCI and early AD. To that end, 800 adults with age between 55 and 90 were recruited from over 50 sites across the US and Canada. Approximately, 200 cognitively normal controls (NC) and 400 MCI individuals were followed for 3 years and 200 individuals with early AD were followed for 2 years (see http://www.adni-info.org for up-to-date information). The general inclusion/exclusion criteria of the subjects are described in Zhang and Shen (2012). In this paper, we use data from 199 subjects who have complete baseline MRI, FDG-PET, and CSF data. These 199 subjects include 50 AD subjects, 97 MCI subjects, and 52 NC subjects. The detailed demographic information about these 199 subjects is summarized in Table 1.

Table 1 Demographic information of the 199 subjects used in this study

Data preprocessing

Imaging preprocessing was performed for MRI and PET. For MRI, the preprocessing steps include anterior commissure (AC)–posterior commissure (PC) correction, intensity inhomogeneity correction (Sled et al. 1998), skull stripping (Wang et al. 2011), cerebellum removal based on registration with atlas, spatial segmentation (Zhang et al. 2001) and registration (Shen and Davatzikos 2002). After registration, we obtained the subject-labeled image based on the Jacob template (Kabani et al. 1998) with 93 manually labeled regions of interest (ROI). For each of the 93 ROIs in the labeled MRI, we computed the volume of gray matter as a feature. For each PET image, we first aligned the PET image to its respective MRI using affine registration. Then, we got the skull-stripping image using the corresponding brain mask of MRI and computed the average intensity of every ROI in the PET image as a feature. Besides MRI and PET, we used CSF A\(\beta\)42, CSF t-tau and CSF p-tau as CSF features. For each subject, we finally obtained 93 MRI features, 93 PET features, and 3 CSF features. We also had the class label, MMSE and ADAS-Cog scores for each subject.

Methods

In this section, after introducing some notations, we will first discuss how to extract the correlation information among features. Next, in order to show how to utilize this correlation information clearly, we first introduce the graph-guided single-task learning (GGSL) method. Then, as an extension of this method, our proposed graph-guided multi-task learning method will be described.

Notation

For a set \({\mathcal {A}}\), we denote \(|{\mathcal {A}}|\) as the number of elements in \({\mathcal {A}}\). For a matrix \({\mathbf {B}}\), we denote \({\mathbf {B}}^{T}\) and \({\mathbf {B}}^{-1}\) as the transpose and the inverse of matrix \({\mathbf {B}}\), respectively. We also denote \(\Vert {\mathbf {B}}\Vert _{F}=\sqrt{\sum _{i}\sum _{j}{\mathbf {B}}_{ij}^{2}}\) as the Frobenius norm.

Suppose we have n samples and p features. Let \(\mathbf X =(X_{1},X_{2},\ldots ,X_{p})=(x_{1},x_{2},\ldots ,x_{n})^{T}\) denote the \(n\times p\) training data matrix of features, where \(x_{1},x_{2},\ldots ,x_{n}\) are i.i.d. samples generated from a p-dimensional multivariate distribution with mean vector \(0_{p\times 1}\) and covariance matrix \(\varvec{\Sigma }=(\sigma _{ij})_{i,j=1}^{p}\). Also, let \(\varvec{\Omega }=(\omega _{ij})_{i,j=1}^{p}=\varvec{\Sigma }^{-1}\) denote the precision matrix. Furthermore, suppose we have q response variables. Let \(\mathbf Y =(Y_{1},Y_{2},\ldots ,Y_{q})=(y_{1},y_{2},\ldots ,y_{n})^{T}\) denote the \(n\times q\) training data matrix of response variables, where the response variables can be binary (for classification) or continuous (for regression). Note that, for the ADNI dataset used in our study, we have three response variables, which are class label, MMSE score, and ADAS-Cog score. The class labels are coded as +1 and \(-1\) for the binary classification problem considered in this paper.

Extract the correlation information among features

The correlation information is often measured by the Pearson correlation between each pair of features. We can use sample Pearson correlation coefficients to identify the statistically significant correlated features. One issue with this method is that it only estimates the marginal linear dependence between a pair of features without considering the influence of other features and common driving influences. Such issue can be overcome by using partial correlation which measures the linear dependence between each pair of features after eliminating the linear effect of the other features. In practice, we can compute the sample partial correlation coefficient between features i and j, denoted as \(\hat{\rho }_{ij}^{*}\), which is defined as the sample Pearson correlation coefficient between the residuals \(R_{i}\) and \(R_{j}\) resulting from the linear regression of the feature \(X_{i}\) with features \(\{X_{k}: k\ne i, j\}\) and of the feature \(X_{j}\) with features \(\{X_{k}: k\ne i, j\}\), respectively. The resulting \(\hat{\rho }_{ij}^{*}\)s can be further used to identify features which are partially correlated statistically significantly.

When the number of features p is small and the sample size n is big enough (bigger than p), it is easy to get good estimates of partial correlation coefficients. In this case, many previous studies (Hampson et al. 2002; Lee et al. 2011) have used partial correlations to identify the significant correlated features. However, in the high-dimensional case with the number of features p bigger than the sample size n, the conventional methods for estimating partial correlation may result in over-fitting of the data (Ryali et al. 2012). In this case, it is difficult to get accurate estimates of partial correlation coefficients.

For our proposed method introduced in the next section, in order to incorporate the correlation information among features, instead of requiring accurate estimation of \(\rho _{ij}^{*}\)s, we only need to estimate which pairs of features are partially correlated, i.e., estimate the set \({\mathcal {E}}=\{(i,j): i<j \;\text { and }\;\rho _{ij}^{*}\ne 0\}\). It is well known that the partial correlation coefficients are proportional to the off-diagonal entries of the precision matrix \(\varvec{\Omega }\) (Meinshausen and Bühlmann 2006). Thus, estimating \({\mathcal {E}}\) is equivalent to estimating the set \(\{(i,j): i<j \,\,\text {and}\,\,\omega _{ij}\ne 0\}\). In this way, many existing methods (Meinshausen and Bühlmann 2006; Friedman et al. 2008; Cai et al. 2011) can be used to estimate \({\mathcal {E}}\) effectively.

In this paper, we will use the graphical Lasso (Friedman et al. 2008) or the neighborhood selection method (Meinshausen and Bühlmann 2006) to estimate \({\mathcal {E}}\) and denote its estimate as \(\hat{{\mathcal {E}}}\). Furthermore, we represent \(\hat{{\mathcal {E}}}\) as an undirected graph \(\mathbf G\) with p nodes and \(|\hat{{\mathcal {E}}}|\) edges, where each node represents one feature and each edge indicates that two involved features are partially correlated significantly. Figure 1 shows an example on how to transform the estimated precision matrix \(\hat{\varvec{\Omega }}\) into the estimated undirected graph \({\mathbf {G}}\). In the graph \(\mathbf G\), features i and j are connected if and only if \(\hat{\omega }_{ij}\ne 0\).

Fig. 1
figure 1

Transforming the precision matrix \(\hat{\varvec{\Omega }}\) (left) into the undirected graph \(\mathbf G\) (right). Features i and j are connected if and only if \(\hat{\omega }_{ij}\ne 0\)

Graph-guided single-task learning (GGSL) method

In this section, we assume that the undirected feature graph \({\mathbf {G}}\) has been constructed. For each \(i=1,2,\ldots ,p\), denote \({\mathcal {N}}_{i}\) as the set including the ith feature and its neighbors in the graph \(\mathbf G\), i.e., \({\mathcal {N}}_{i}=\{j:\hat{\omega }_{ji}\ne 0\}\).

To show how to use the correlation information represented by \({\mathbf {G}}\), we consider the single-task learning first and then generalize this idea to multi-task learning. Without loss of generality, considering the tth task, we want to use the following linear model to predict the response variable \(Y_{t}\),

$$\begin{aligned} Y_{t}=\mathbf X B_{t}+\epsilon _{t}, \end{aligned}$$
(1)

where \(B_{t}=(b_{1t},b_{2t},\ldots ,b_{pt})^{T}\in R^{p}\) is the coefficient vector of interest and \(\epsilon _{t}=(\epsilon _{1t},\epsilon _{2t},\ldots ,\epsilon _{nt})\in R^{n}\) is the error vector with \(\text {E}(\epsilon _{st})=0\) and \(\text {Var}(\epsilon _{st})=\sigma _{t}^{2}\) for each \(1\le s\le n\).

Suppose the feature matrix \(\mathbf X\) is independent of the error vector \(\epsilon _{t}\). Denote \(C_{t}\) as the marginal correlation vector between p features and the response variable \(Y_{t}\), i.e., \(C_{t}=\text {E}(\mathbf X ^{T}Y_{t}/n)=(c_{1t},c_{2t},\ldots ,c_{pt})^{T}\in R^{p}\). Then by (1), we have

$$\begin{aligned} C_{t}=\text {E}(\mathbf X ^{T}Y_{t}/n)=\text {E}(\mathbf X ^{T}\mathbf X B_{t}/n)+\text {E}(\mathbf X ^{T}\epsilon _{t}/n) =\mathbf {\Sigma }B_{t}. \end{aligned}$$
(2)

Thus, the true coefficient vector \(B_{t}\) can be represented as

$$\begin{aligned} B_{t}=\mathbf {\Sigma }^{-1}C_{t}={\mathbf {\Omega }}C_{t}, \end{aligned}$$
(3)

where \({\mathbf {\Omega }}\) shows the partial correlations among different features, and \(C_{t}\) reflects the marginal correlations between the features and the tth response variable \(Y_{t}\).

Furthermore, the Eq. (3) can be expanded as follows:

$$\begin{aligned}&b_{1t}=\omega _{11}c_{1t}+\omega _{12}c_{2t}+\cdots +\omega _{1i}c_{it}+\cdots +\omega _{1p}c_{pt}\\ \nonumber&b_{2t}=\omega _{21}c_{1t}+\omega _{22}c_{2t}+\cdots +\omega _{2i}c_{it}+\cdots +\omega _{2p}c_{pt}\\ \nonumber&\vdots \\ \nonumber&b_{pt}=\omega _{p1}c_{1t}+\omega _{p2}c_{2t}+\cdots +\omega _{pi}c_{it}+\cdots +\omega _{pp}c_{pt}. \nonumber \end{aligned}$$
(4)

We observe that the coefficients vector \(B_{t}=(b_{1t},b_{2t},\ldots ,b_{pt})^{T}\) is the sum of p parts, where the ith part, \((\omega _{1i}c_{it},\omega _{2i}c_{it},\ldots , \omega _{pi}c_{it})^{T}\), is the ith vertical part in the right side of the above equations (4). In addition, for each i, if there is no marginal correlation between the ith feature and the response variable \(Y_{t}\), i.e., \(c_{it}=0\), then the components in the ith part \((\omega _{1i}c_{it},\omega _{2i}c_{it},\ldots , \omega _{pi}c_{it})^{T}\) will be zero simultaneously due to the common factor \(c_{it}\). Furthermore, if the ith feature and the response variable \(Y_{t}\) are correlated marginally, then \(c_{it}\ne 0\) and the set of candidate nonzero components in the ith part is \(\{j: \omega _{ji}\ne 0\}\), which can be well estimated by the set \({\mathcal {N}}_{i}\) including the ith feature and its neighbors in the estimated undirected graph \({\mathbf {G}}\).

Motivated by the decompositions shown in Eq. (4), we assume that there is a latent decomposition of the coefficients vector \(B_{t}\) into p parts, \(V^{1t},\ldots ,V^{it},\ldots ,V^{pt}\), where \(V^{it}\) is a p-dimensional latent vector representing the ith vertical part in the right side of Eq. (4). In order to incorporate the correlation information represented by the undirected graph \({\mathbf {G}}\), a group penalty term will be used to encourage the ith latent vector \(V^{it}\) to be zero or have nonzero components only for the indices in the set \({\mathcal {N}}_{i}\). Hence, we use the following (GGSL method to estimate \(B_{t}\):

$$\begin{aligned} \min _{B_{t},V^{1t},V^{2t},\ldots ,V^{pt}\in R^{p}}\Vert Y_{t}-\mathbf X B_{t}\Vert _{2}^{2}+\lambda \sum _{i=1}^{p}\tau _{it}\Vert V^{it}\Vert _{2}, \end{aligned}$$
(5)

subject to \(B_{t}=\sum _{i=1}^{p}V^{it}\) and \(\text {supp}(V^{it})\subseteq {\mathcal {N}}_{i}\) for each \(1\le i\le p\), where \(\text {supp}(V^{it})\) is the index set of the nonzero components in the vector \(V^{it}\).

In the optimization problem (5), \(\tau _{it}\) is a positive weight for the ith part and tth task. Similar to the methods for adaptive Lasso (Zou 2006) and group Lasso (Yuan and Lin 2006), we can set \(\tau _{it}=\frac{\sqrt{|{\mathcal {N}}_{i}|}}{|\tilde{b}_{it}|^{\gamma }}\) where \(\gamma\) is a positive parameter and \(\tilde{b}_{it}\) is an initial estimate of \(b_{it}\). In our experiments, we chose \(\tilde{b}_{it}\) as the sample correlation coefficient between \(X_{i}\) and \(Y_{t}\). Both the positive parameter \(\gamma\) and the tuning parameter \(\lambda\) were chosen by cross-validation. Our experimental results indicate that this method can acquire good performance in general.

Theoretically, the GGSL method is very general and covers the popular Lasso method as a special case. Specifically, if we ignore the correlation information among features, we can set the undirected graph \({\mathbf {G}}\) as an empty graph with no edge. In this case, if setting constant weights \(\tau _{it}\)s, we can show that \(\sum _{i=1}^{p}\tau _{it}\Vert V^{it}\Vert _{2}\propto |B_{t}|_{1}\), and the GGSL method is the same as the Lasso method (Tibshirani 1996). In general, we can estimate a sparse undirected graph \({\mathbf {G}}\) for modeling the significant partial correlation information among features. The GGSL method can utilize this correlation information effectively and thus acquires good prediction performance.

Graph-guided multi-task learning (GGML) method

For the multi-task learning, we aim at estimating q response variables simultaneously. Similar to the above GGSL method, for each t, we assume that the coefficient vector \(B_{t}\) can be decomposed as \(B_{t}=\sum _{i=1}^{p}V^{it}\), where each \(V^{it}\) is a p-dimensional latent vector satisfying \(\text {supp}(V^{it})\subseteq {\mathcal {N}}_{i}\). Furthermore, in order to make use of the intrinsic correlation among these q tasks (response variables), we also assume that the decompositions of q coefficient vectors \(B_{1}, B_{2},\ldots , B_{q}\) have the same pattern, i.e., \(\text {supp}(V^{i1})=\text {supp}(V^{i2})=\cdots =\text {supp}(V^{iq})\) for each \(1\le i\le p\). That is, for each \(i=1,2,\ldots ,p\), we assume that, if both the ith feature and its partially correlated features are useful for prediction of one response variable, they are also useful for prediction of the other response variables.

Based on the above assumption, denote \(\mathbf B =(B_{1},B_{2},\ldots ,B_{q})\in R^{p\times q}\) and \(\mathbf V ^{i}=(V^{i1},V^{i2},\ldots ,V^{iq})\in R^{p\times q}\) for each \(1\le i\le p\), we generalize the GGSL method to the following GGML method:

$$\begin{aligned} \min _\mathbf{B ,\mathbf V ^{1},\mathbf V ^{2},\ldots ,\mathbf V ^{p}\in R^{p\times q}}\Vert \mathbf Y -\mathbf X \mathbf B \Vert _{F}^{2}+\lambda \sum _{i=1}^{p}\tau _{i}\Vert \mathbf V ^{i}\Vert _{F}, \end{aligned}$$
(6)

subject to \(\mathbf B =\sum _{i=1}^{p}\mathbf V ^{i}\) and \(\{j: \Vert \mathbf V ^{i}_{j\cdot }\Vert _{2}\ne 0\}\subseteq {\mathcal {N}}_{i}\) for each \(1\le i\le p\), where \(\mathbf V ^{i}_{j\cdot }\) is the jth row of the matrix \(\mathbf V ^{i}\).

Similar to the GGSL method discussed in “Graph-guided single-task learning (GGSL) method” section, we can set the weight \(\tau _{i}=\frac{\sqrt{|{\mathcal {N}}_{i}|}}{\max _{1\le t\le q}|\tilde{b}_{it}|^{\gamma }}\). The cross-validation method can be used to choose the best \(\gamma\) and the best tuning parameter \(\lambda\) for different tasks separately. Note that the penalty term in (6) along with the additional constraints not only encourage the significantly partially correlated features to be in or out of the model jointly, but also choose a common feature subset for different tasks. Due to the use of both the correlation information among features and the intrinsic commonality among different related tasks, our proposed GGML method can acquire better prediction performance than the methods not using or only using part of these two kinds of information.

As an interesting remark, we note that the M3T method (Zhang and Shen 2012) is a special case of our proposed GGML method. In particular, when we ignore the correlation information among features, we can set the undirected graph \({\mathbf {G}}\) as an empty graph with no edge. In this case, if setting constant weights \(\tau _{i}\)s, we can show that \(\sum _{i=1}^{p}\tau _{i}\Vert \mathbf {V}^{i}\Vert _{F}\propto \sum _{i=1}^{p}\Vert {\mathbf {B}}_{i\cdot }\Vert _{2}\), where \({\mathbf {B}}_{i\cdot }\) is the ith row of the coefficient matrix \({\mathbf {B}}\). Thus, our proposed GGML method is exactly the same as the M3T method using the \(l_{2,1}\) penalty.

Objective function optimization

For our proposed GGML method, we need to solve the optimization problem (6). We can transform this constrained optimization problem into a simple unconstrained optimization problem by feature duplication.

Denote \(\mathbf X _{\cdot {\mathcal {N}}_{i}}\) as the sub-matrix of \(\mathbf X\) with column indices in \({\mathcal {N}}_{i}\), and denote \(\mathbf V _{{\mathcal {N}}_{i}\cdot }^{i}\) as the sub-matrix of \(\mathbf V ^{i}\) with row indices in \({\mathcal {N}}_{i}\). Furthermore, denote \(\tilde{\mathbf{X }}=(\mathbf X _{\cdot {\mathcal {N}}_{1}},\mathbf X _{\cdot {\mathcal {N}}_{2}},\ldots ,\mathbf X _{\cdot {\mathcal {N}}_{p}}) \in R^{n\times (\sum _{i=1}^{p}|{\mathcal {N}}_{i}|)}\) as the duplicated feature matrix and \(\tilde{\mathbf{V }}=((\mathbf V _{{\mathcal {N}}_{1}\cdot }^{1})^{T},(\mathbf V _{{\mathcal {N}}_{2}\cdot }^{2})^{T},\ldots , (\mathbf V _{{\mathcal {N}}_{p}\cdot }^{p})^{T})^{T}\) as the \((\sum _{i=1}^{p}|{\mathcal {N}}_{i}|)\times q\) coefficient matrix. Then, we can check that \({\mathbf{XB}} ={\tilde{\mathbf{X }}}{\tilde{\mathbf{V}}}\) and (6) is equivalent to the following unconstrained optimization problem:

$$\min \limits _{\tilde{\mathbf{V}}}\Vert {\mathbf{Y}} -{\tilde{\mathbf{X }}}{\tilde{\mathbf{V }}}\Vert _{F}^{2}+\lambda \sum _{i=1}^{p}\tau _{i}\Vert {\mathbf{ V}} _{{\mathcal {N}}_{i}\cdot }^{i}\Vert _{F}.$$
(7)

The above problem (7) is a traditional group Lasso problem which can be solved efficiently by the blockwise majorization decent algorithm (Yang and Zou 2013). Denote the estimate of \({\mathbf{B}}\) as \(\hat{{\mathbf{B}}}\). In the application stage, given a testing subject \(x^{*}\), for the tth task, we can estimate \(Y_{t}^{*}\) by \({\hat{Y}}_{t}^{*}={\text{sign}}(\hat{B}_{t}^{T}x^{*})\) if \(Y_{t}^{*}\) is a class label and by \(\hat{Y}_{t}^{*}=\hat{B}_{t}^{T}x^{*}\) if \(Y_{t}^{*}\) is a continuous response variable.

Simulation study

In this section, we perform numerical studies using simulated examples. For each example, we compare our proposed GGML method with (1) the Lasso method which learns different tasks separately; (2) the GGSL method which uses the correlation information among features and learns different tasks separately, and (3) M3T method which learns different tasks jointly while ignoring the correlation information among features. We implement Lasso, GGSL, and M3T methods as shown in “Objective function optimization” section to predict the response variables.

Similar to the measures used in Zhang and Shen (2012), the classification accuracy and the Pearson’s correlation coefficient (CC) are also used here to evaluate the classification and regression performances, respectively. In addition, we also use the root-mean-square error (RMSE) to evaluate the regression performance.

Simulated examples

We study three simulated examples. Each example has one classification task and two regression tasks. We set \(p=100\), \(B_{1}=(2,2,\ldots ,2,0,0,\ldots ,0)^{T}\), \(B_{2}=B_{3}=(1,1,\ldots ,1,0,0,\ldots ,0)^{T}\), where only the first 15 elements of each \(B_{t}\) (\(t=1,2,3\)) are nonzero. For each t, the errors \(\epsilon _{1t},\epsilon _{2t},\ldots ,\epsilon _{nt}\mathop {\sim }\limits ^{i.i.d.}N(0,9)\). For \(s=1,2,\ldots ,n\), the feature vector \((x_{s1},x_{s2},\ldots ,x_{sp})^{T}\) is generated as follows.

Example 1

For \(1\le j\le 5\), \(x_{sj}=z_{1}+0.4\epsilon _{j}^{x}\). For \(6\le j\le 10\), \(x_{sj}=z_{2}+0.4\epsilon _{j}^{x}\). For \(11\le j\le 15\), \(x_{sj}=z_{3}+0.4\epsilon _{j}^{x}\). For \(16\le j\le p\), \(x_{sj}\overset{i.i.d}{\sim } N(0,1)\). Here, \(z_{1}, z_{2}, z_{3},\epsilon _{1}^{x}, \epsilon _{2}^{x}, \ldots , \epsilon _{15}^{x}\overset{i.i.d}{\sim } N(0,1)\).

Example 2

The features \((x_{s1},x_{s2},\ldots ,x_{sp})^{T}\sim N(0,\varvec{\Sigma })\) with \(\sigma _{ij}=0.5^{|i-j|}\). For this example, we have \(\omega _{ii}=1.333\), \(\omega _{ij}=-0.667\) if \(|i-j|=1\) and \(\omega _{ij}=0\) if \(|i-j|>1\).

Example 3

The features \(\{x_{sj}: 1\le j\le 15\}\) are generated from the same model as shown in Example 1. In addition, the features \((x_{s16},x_{s17},\ldots ,x_{sp})\sim N(0,\tilde{\varvec{\Omega }}^{-1})\), where \(\tilde{\varvec{\Omega }}=\mathbf M +\delta \mathbf I\). Each off-diagonal entry in M is generated independently and equals 0.5 with probability 0.05 or 0 with probability 0.95. The diagonal entry of M is 0. Here, \(\delta\) is chosen such that the conditional number of \(\tilde{\varvec{\Omega }}\) is equal to \(p-15\). Finally, \(\tilde{\varvec{\Omega }}\) is standardized to have unit diagonals.

After generating each column of the response matrix \({\mathbf {Y}}\) by model (1), we replace the elements in the first column of \({\mathbf {Y}}\) by their signs (positive or negative) to simulate class labels. For all examples, we generate 40 training samples, 40 validation samples, and 400 testing samples. All the models are fitted on the training data. The validation data are used to choose the tuning parameters and the testing data are used to evaluate different methods. For each example, we repeat the simulation 30 times.

Figure 2 shows the binary maps of the true precision matrices and Fig. 3 shows the corresponding feature graphs of these three examples. All these three graphs are sparse. For Examples 1 and 3, useful features (i.e., features with nonzero regression coefficients) are only connected with useful features. For Example 2, one useful feature is connected with one useless feature. In addition, for each example, different tasks are highly correlated since they share the same useful features. It is very interesting to study whether correlation information among features represented by the feature graph and the correlation information among tasks can be incorporated to improve the prediction performance.

Fig. 2
figure 2

Binary maps of the true precision matrices corresponding to these three simulated examples: left (Example 1), middle (Example 2), and right (Example 3). Each red dot represents a nonzero element in the precision matrix

Fig. 3
figure 3

True feature graphs corresponding to these three simulated examples: left (Example 1), middle (Example 2), and right (Example 3). Each blue dot indicates a feature

Simulation results

Table 2 shows the comparison of different methods using these three simulated examples. As shown in Table 2, for all these three examples, the GGSL method and GGML method acquire better performance than the Lasso method and the M3T method, respectively. This indicates that the extracted partial correlation information from features can be utilized to improve the prediction performance. In addition, the GGML method and M3T method also acquire better performance than the GGSL method and the Lasso method, respectively. It indicates that learning different correlated tasks jointly can also improve the prediction performance. For these three simulated examples, since our proposed GGML method incorporates both the partial correlation information among features and the intrinsic correlation information among different related tasks, it delivers the best performance in all cases. In the next section, we will further compare these four methods using the ADNI dataset.

Table 2 Comparison of different methods using the simulated examples

Analysis of the ADNI dataset

For the ADNI dataset, we estimate one class label and two clinical scores (i.e., MMSE and ADAS-Cog) using the MRI, FDG-PET and/or CSF features. Since there are two binary classification problems (AD vs. NC, and MCI vs. NC), we perform two sets of experiments. The first set of experiments uses the AD/NC dataset including only AD and NC subjects. The second set of experiments uses the MCI/NC dataset including only MCI and NC subjects. For each set of experiments, we consider four cases: (I) use only MRI features; (II) use only PET features; (III) use both MRI and PET features (denoted as MRI \(+\) PET); (IV) use all MRI, PET and CSF features (denoted as MRI \(+\) PET \(+\) CSF).

To evaluate the performance of different methods, we used the tenfold cross-validation (CV) strategy. Specifically, the whole samples were partitioned randomly into ten subsets. Each time only nine subsets were chosen for training and the remaining one was used for testing. We repeated this process ten times with each of the ten subsets used exactly once as the testing data. Furthermore, in consideration of possible bias due to the random partition in the tenfold CV, we repeated the whole 10-CV process 30 times. In the training process, each column of the training data was normalized to have mean 0 and standard deviation 1. For all methods, we performed another inner fivefold CV on the training data to choose the tuning parameters.

Partial correlation among different features

In the first step of the GGSL and GGML methods, we need to extract the effective correlation information from features. Note that, only the training data matrix of features were used to estimate the sparse undirected graph \({\mathbf {G}}\) representing the significant partial correlation among features. Figure 4 shows the binary maps of the estimated precision matrices. Binary maps in the first two columns indicate that many features within the same modality (e.g., MRI or PET) are partially correlated statistically significantly. However, as shown by the binary maps in the third column, the partial correlation between MRI features and PET features are not statistically significantly in most cases. Furthermore, the comparison between the binary maps in the first row and the second row indicates that the partial correlation information extracted from AD/NC data is similar to that of MCI/NC data. Similar to the example shown in Fig. 1, we can transform the estimated precision matrices to some undirected graphs. The feature graphs corresponding to the estimated precision matrices are shown in Fig. 5. This graph information will be used in the GGML and GGSL methods.

Fig. 4
figure 4

Binary maps of the estimated precision matrices. First row uses AD/NC data; second row uses MCI/NC data. First column use only MRI features; second column use only PET features; third column use both MRI and PET features. Each red dot in the plot represents a nonzero element

Fig. 5
figure 5

Feature graphs corresponding to the estimated precision matrices. First row uses AD/NC data; second row uses MCI/NC data. First column use only MRI features; second column use only PET features; third column use both MRI and PET features. Each blue dot represents an MRI feature and each green dot represents a PET feature

Classification results

The classification accuracies of different methods are shown in Table 3. All methods deliver higher classification accuracy for the AD/NC dataset than the corresponding classification accuracy for the MCI/NC dataset. For the AD/NC dataset, when we use only MRI features or PET features, the GGSL method and GGML method acquire better classification performance than the Lasso method and the M3T method, respectively. This indicates that the extracted partial correlation information from features can be utilized to improve the classification performance. In addition, when we use both MRI and PET features or all the MRI, PET, and CSF features, since it is relatively easy to discriminate AD subjects from NC subjects in this case, all four methods acquire similar high classification accuracies.

Table 3 Comparison of the classification performance on the ADNI dataset

For the MCI/NC dataset, on the one hand, the comparison between GGSL and Lasso (or GGML and M3T) indicates that using the extracted partial correlation information among features improve the classification performance significantly. On the other hand, the comparison between GGML and GGSL (or M3T and Lasso) shows that the joint classification and regression could provide better classification performance than the separate classification. Since our proposed GGML method incorporates both the partial correlation information among features and the intrinsic correlation information among different related tasks, it delivers the best classification performance.

Regression results

For regression tasks, we need to predict both the MMSE score and the ADAS-Cog score. Tables 4 and 5 show the comparison of regression performance on the AD/NC data and the MCI/NC data, respectively. As shown in Tables 4 and 5, our proposed GGML method acquires promising performance in most cases. For example, when we use all the features to predict the MMSE score, for the AD/NC data, our proposed GGML method achieves the highest correlation coefficient 0.745 while the corresponding correlation coefficients for Lasso, GGSL, and M3T are 0.709, 0.723 and 0.724, respectively. For the MCI/NC data, GGML also has the best performance with correlation coefficient 0.382 while the corresponding correlation coefficients for Lasso, GGSL, and M3T are 0.303, 0.325 and 0.364, respectively. In addition, when we use all the features to predict the ADAS-Cog scores, for the AD/NC data, our proposed GGML method achieves the highest correlation coefficient 0.740 while the corresponding correlation coefficients for Lasso, GGSL, and M3T are 0.664, 0.719 and 0.718, respectively. For the MCI/NC data, GGML also has the best performance with correlation coefficient 0.472 while the corresponding correlation coefficients for Lasso, GGSL, and M3T are 0.336, 0.464 and 0.426, respectively.

Table 4 Comparison of the regression performance on the AD/NC dataset
Table 5 Comparison of the regression performance on the MCI/NC dataset

It is interesting to note that for the MCI/NC dataset, the PET and CSF data seem to be not useful for the prediction of MMSE score. All four methods acquire poor prediction of the MMSE scores when only the PET data are used. In addition, compared with the cases only using MRI data, both M3T and GGML methods acquire worse performance when the additional PET/CSF data are used. Similar to the previous discussion about classification performance, the comparison between GGSL and Lasso (or GGML and M3T) indicates that using the extracted partial correlation information among features improves the prediction of MMSE and ADAS-Cog scores significantly. In addition, the comparison between GGML and GGSL (or M3T and Lasso) shows that joint classification and regression could deliver better prediction performance than the separate regression of MMSE (or ADAS-Cog) on the features. Since our GGML method incorporates both the partial correlation information among features and the intrinsic correlation information among different tasks, it delivers the best prediction of the MMSE and ADAS-Cog scores.

Most discriminative brain regions

In this subsection, we investigate the most discriminative brain regions for the diagnosis of disease status and the prediction of the MMSE and ADAS-Cog scores. For each method, we repeated the whole 10-CV process 30 times and acquired 300 different models using different training datasets. Figure 6 shows the selection frequency of each of 93 ROIs for the AD/NC classification task using only MRI features, where the selection frequency for each ROI is defined as

$$\begin{aligned} \text {Frequency}=100\times \frac{\text {The\,\, times\,\ of\,\, being\,\, selected\,\, in\,\, the\,\, 300\,\, models}}{300}. \end{aligned}$$

For each method, some ROIs are always selected while some ROIs are seldom selected. Compared with Lasso and M3T, the GGSL and GGML methods tend to select more ROIs since they use the feature graph information and encourage the significantly partially correlated features to be selected jointly. According to the selection frequency, we compare the top ten selected ROIs of different methods for different tasks. Tables 6, 7 and 8 show the indices of the top ten selected ROIs of the four methods for different tasks (classification or regression), different datasets (AD/NC or MCI/NC) and different modalities (MRI or PET). Table 9 contains the full names of the ROIs.

Fig. 6
figure 6

Selection frequency of 93 ROIs for the AD/NC classification task using only MRI features

Table 6 Comparison of the top ten selected ROIs for the classification task
Table 7 Comparison of the top ten selected ROIs for the regression task (MMSE)
Table 8 Comparison of the top ten selected ROIs for the regression task (ADAS-Cog)
Table 9 Names of the selected ROIs in this study

As shown in Tables 6, 7 and 8, for different tasks, the top ten selected ROIs of the single-task learning methods such as Lasso and GGSL are different while the top ten selected ROIs of the multi-task learning methods such as M3T and GGML are the same. We can also observe that the top ten selected ROIs for the cases using MRI features are not very similar to the top ten selected ROIs for the cases using PET features. One possible reason is that MRI features and PET features provide complementary information for the diagnosis of AD. However, for each case, the top ten selected ROIs of the four methods are similar. For example, for the AD/NC classification task using MRI features, Table 6 indicates that the ROIs with indices 18, 80, 83, 84, and 90 are frequently selected by all four methods. It is interesting to point out that both GGML and M3T methods also select the 48th ROI frequently for the AD/NC classification task while this ROI is not one of the top ten selected ROIs of Lasso and GGSL for this task. However, as shown in Table 8, the 48th ROI is frequently selected by Lasso and GGSL for the regression task (ADAS-Cog) using AD/NC data. This indicates that the multi-task learning methods such as GGML and M3T incorporate the clinical score information for the classification task. On the other hand, as shown in Table 8, both GGML and M3T methods select the 22th ROI frequently for the regression task (ADAS-Cog) using AD/NC data while this ROI is not one of the top ten selected ROIs of Lasso and GGSL for this task. However, as shown in Table 6, the 22th ROI is frequently selected by Lasso and GGSL for the classification task (AD vs NC). This indicates that the multi-task learning methods such as GGML and M3T incorporate the class label information for the regression task.

Furthermore, as shown in Tables 6, 7 and 8, for the study using AD/NC data and MRI features, the common top ten selected ROIs of Lasso for different tasks are the ROIs with indices 18, 80, 83, 84 and 90. The common top ten selected ROIs of the GGSL method for different tasks are the ROIs with indices 58, 80, 83, and 84. Most of these ROIs are the top ten selected ROIs of our proposed GGML method. In Figs. 7 and 8, we visualize the top ten selected ROIs of our proposed GGML method when different datasets (AD/NC or MCI/NC) and different modalities (MRI or PET) are used. Most of the selected regions, e.g., uncus right (22), hippocampal formation right (30), uncus left (46), middle temporal gyrus left (48), hippocampus formation left (69), middle temporal gyrus right (80) and amygdale right (83), are known to be highly correlated with AD and MCI by many studies using group comparison methods (Jack et al. 1999; Misra et al. 2009; Zhang and Shen 2012).

Fig. 7
figure 7

Top ten most discriminative brain regions selected by GGML method using AD/NC dataset

Fig. 8
figure 8

Top ten most discriminative brain regions selected by GGML method using MCI/NC dataset

Discussion

In this section, we first discuss some issues about constructing the undirected feature graph \({\mathbf {G}}\). Then, some possible extensions of our proposed method will be discussed.

Construction of the undirected feature graph \({\mathbf {G}}\)

Before performing our proposed GGML method, we need to construct an undirected feature graph \({\mathbf {G}}\) representing the significant correlation information among features. In “Extract the correlation information among features” section, we proposed to use the graphical Lasso method to construct this graph. For some datasets, the constructed graph \({\mathbf {G}}\) may include many edges corresponding to weak or even wrong partial correlation due to bad estimation of the precision matrix. In this case, by thresholding of the estimated precision matrix, we can construct a sparse undirected graph for representing only the most reliable partial correlation.

Furthermore, besides partial correlation information among features, we can also combine other useful information (e.g., some prior information about features) to construct this graph \({\mathbf {G}}\). Our proposed GGML method can be used for any given undirected feature graph \({\mathbf {G}}\) representing the relationships among different features.

Use of the structure information among different subjects

Our proposed GGML method utilizes both the correlation information among features and the intrinsic correlation information among different response variables. Actually, we can also generalize GGML method to incorporate the structure information among different subjects. Similar to the locality preserving projection (LPP) method (He and Niyogi 2004), we can model the structure information among different training subjects as another sparse undirected graph \(\mathbf S\). Here, \(\mathbf S\) has n nodes and each node represents one subject. The connectivity of the graph \(\mathbf S\) can be defined by the k nearest neighbors, i.e., subjects \(x_{s}\) and \(x_{l}\) are connected by an edge if \(x_{s}\) is among the k nearest neighbors of \(x_{l}\), or \(x_{l}\) is among the k nearest neighbors of \(x_{s}\). In order to use the structure information among different training subjects represented by \(\mathbf S\), we can preserve the neighborhood structure of subjects, i.e., encouraging the predicted response variables \(\hat{y}_{s}={\mathbf {B}}^{T}x_{s}\) and \(\hat{y}_{l}={\mathbf {B}}^{T}x_{l}\) to be close if the sth and the lth subjects are connected in the undirected graph \({\mathbf {S}}\).

Conclusion

In summary, we propose a new graph-guided multi-task learning method to incorporate the correlation information among features and the intrinsic correlation information among different tasks. To use the correlation information among features, our proposed GGML method encourages the partially correlated features to be in or out of the model jointly. Furthermore, in order to acquire more robust and accurate feature selection, our proposed GGML method encourages different tasks to share a common useful feature subset. Theoretically, our proposed GGML method is very general and includes the M3T method as a special case. The experimental results on the simulated examples and the ADNI dataset also show the advantage of the proposed GGML method over the existing methods.