Introduction

Alzheimer’s disease (AD) is the most prevalent dementia, accounting for about 60–80% of dementia cases among the elderly population worldwide. It has been reported that the incidence of AD will double every five years after the age of 65 (Bain et al. 2008), and 1 in every 85 people will be affected by the year 2050 (Brookmeyer et al. 2007). As a typical neurological disorder disease, AD is commonly characterized by some predominant clinical symptoms, including progressive memory loss and cognitive deficits, which severely interfere with patients’ daily lives. This disease is incurable and worsens over time due to the degeneration of specific nerve cells, presence of neuritic plaques, and neurofibrillary tangles (McKhann et al. 1984), eventually causing death. Thus, it is important for AD patients to be diagnosed as early and accurately as possible, so the effective pharmacological and behavioral treatments can be provided to potentially delay the progress of AD.

Some works (Johnson et al. 2006; Whitwell et al. 2007) suggest that the pathological manifestation of AD begins many years before it can be diagnosed. Mild cognitive impairment (MCI), as a prodromal stage of AD, has attracted much attention because MCI subjects tend to convert to clinical AD with an average conversion rate of 10% to 15% per year (Misra et al. 2009), and more than 50% within 5 years (Gauthier et al. 2006; Petersen et al. 2001). Therefore, earlier diagnosis of MCI is very important for treatment and possible delaying the progression of MCI to moderate and severe stages. However, identifying MCI subjects from subjects undergoing normal aging is much more difficult because of the subtlety of involved cognitive impairment. Through analyzing neuroimaging data with machine learning algorithms (Mitchell 1997), it is possible to identify subtle diagnostic biomarkers that can effectively distinguish MCI from normal controls (NC). Some successfully applied machine learning algorithms include support vector machine and multiple kernel learning (Cortes and Vapnik 1995; Jie et al. 2014b; Zhang, et al. 2011), multi-task and sparse learning (Friedman et al. 2008; Huang et al. 2010; Suk, et al. 2014a), and the recently emerging deep neural networks (Liu et al. 2015; Suk et al. 2015; Suk, et al. 2013; Suk, et al. 2014b) amongst many others.

Resting-state functional magnetic resonance imaging (RS-fMRI), a cutting-edge technology at disposal of cognitive neuroscience, can measure the blood oxygenation level dependent (BOLD) signal. This signal reflects low-frequency spontaneous fluctuations in the resting brain, which are related to intrinsic neural activity within the brain (Fox and Raichle 2007). From RS-fMRI data, it is possible to infer functional connectivity (FC) between structurally separated brain regions. Here, the FC is defined as the temporal correlation of BOLD signals measured in different brain areas (Friston et al. 1993). It has been proven (Greicius 2008) that FC is a useful tool for understanding the pathological underpinnings of MCI at the whole-brain level (Rombouts et al. 2005; Sorg et al. 2007; Wang et al. 2007) and has great potential in providing a significant biomarker for the diagnosis of MCI (Wee, et al. 2015; Wee et al. 2012; Wee et al. 2014). This is because the topological structure and strength of FC are disrupted due to the pathological attack of AD (Anderson and Cohen 2013; dos Santos Siqueira et al. 2014; Fekete et al. 2013; Wang et al. 2007; Zhang et al. 2016a). In most studies, FC is represented by a graph (Brier et al. 2014; Toussaint et al. 2014) in which a node corresponds to a brain region and an edge characterizes the FC strength between different brain regions. Therefore, constructing FC network based on RS-fMRI holds great promise for distinguishing MCI from normal aging (Stam et al. 2009; Stam et al. 2007). Most studies have focused on estimating FC from BOLD signals by using different statistical methods. For instance, Pearson’s correlation (Jie et al. 2014b; Wee et al. 2012) can measure the pairwise similarity of BOLD signals. But this method handles each pair of brain regions independently, without taking into account the effects of other brain regions. Partial correlation characterizes the pairwise correlation and, at the same time, factors out the effects of other brain regions. Sparse representation (Yu et al. 2016; Suk et al. 2014c; Wee, et al. 2015; Wee et al. 2014; Wright et al. 2009) is able to approximate the BOLD signals of one brain region by linearly combining a smallest set of the signals from the rest brain regions, thus resulting in a sparse FC network.

Traditionally, the FC is estimated based on the entire length of BOLD time series (Jie et al. 2014a; Jie et al. 2014b; Wee et al. 2014). It assumes the FC is stationary, by ignoring the complexity and dynamic property of the brain activities. To deal with this problem, a sliding window approach has been utilized in (Wee et al. 2013; Wee, et al. 2015) to partition the entire BOLD signals (derived from each brain region) into several overlapping segments. For each segment, a FC network is constructed to characterize the time-varying FC between two brain regions during a certain time interval. Consequently, a set of FC networks can be generated and fused for MCI classification. Nevertheless, as in most traditional methods, the involved correlation (and thus the FC network) is still low-order in the sense that 1) it is calculated based on the raw BOLD signals; 2) it only describes how two brain regions are functionally interacted with each other; 3) it cannot be directly used because of the possible phase mismatch across different subjects. In our previous study (Chen et al. 2016), we regarded these low-order FC networks as a set of FC time series, each of which is associated with a specific pair of brain regions and characterizes the variation of their FC over time. Thus, by computing the correlation between two FC time series (involving up to four brain regions), a novel high-order FC network can be constructed for each subject. It should be emphasized that, different from the low-order case above, the correlation (and thus the FC network) obtained in such a manner is high-order because 1) it is computed based on FC time series and is thus a high-level feature; 2) it characterizes how different brain region pairs functionally interact with each other; 3) it owns a merit of invariance, avoiding the adverse impact of the possible phase mismatch across different subjects. Moreover, the proposed high-order FC network is different from either static or dynamic FC. First, the high-order FC network uses time-varying FC obtained by the sliding window, which is different from the procedure of computing the static FC. Second, by calculating a second round of correlation, our method aggregates the time-varying FCs into a single high-order functional network (i.e., computing the correlation’s correlation), thus very different from those traditional dynamic (low-order) FC computational approaches.

Note that the scale of high-order FC networks is much larger than that of low-order FC networks, because the quantity of brain region pairs is much larger than that of brain regions (e.g., 116 brain regions yield about 1162 brain region pairs). As a result, the feature vector extracted from the high-order FC networks is of high-dimensionality, which increases the model complexity as well as the risk of over-fitting. To overcome these problems, a clustering algorithm (Chen et al. 2014; Ward 1963) is first used to group the FC time series (associated with each pair of brain regions) into U different clusters, where U is a predefined number of clusters. Then, the mean of the FC time series within each cluster is used to compute high-order FC and further construct high-order FC networks. This strategy succeeds in reducing the scale of high-order FC networks as well as the number of features, thus improving computation efficiency and alleviating over-fitting. However, we found that the selected value of U will severely influence the discriminative ability of the resulting high-order FC networks in diagnosis (Chen et al. 2016), since different values of U lead to different high-order FC networks. To find an optimal high-order FC network, a candidate set of high-order FC networks with different discriminative ability is generated by varying U (Chen et al. 2016). Then, each high-order FC network is evaluated independently on the validation data and the one with the best performance is finally selected, while others are simply discarded. Although this method achieves promising results, there is still room for improvement.

In order to boost the diagnosis performance of high-order FC networks, we hypothesize in this paper that those discarded high-order FC networks in (Chen et al. 2016), although inferior to the optimal network in overall discriminative performance, may contain essential and complementary information for classification. In this sense, choosing a single optimal high-order FC network while discarding others, may cause the loss of useful information for diagnosis. Therefore, inspired by the above insight, we propose a novel hierarchical high-order FC network and feature fusion framework to take advantage of all high-order FC networks and further improve the diagnosis performance. Notably, by selection and integration of the information contained in the candidate high-order FC networks in an appropriate way, this method takes full advantage of all of the available discriminative information and also avoids over-fitting. This study is featured by two following aspects:

First , in order to exploit multiple high-order FC networks while avoiding large quantity of features, we construct these high-order FC networks in a hierarchical way, where the network in one layer is closely dependent on the network in the previous layer. In view of this inherent relationship between two consecutive layers, the features extracted from the network in each layer can be decomposed into two feature blocks, depending on the correlation with the features extracted from the network in the previous layer. Those highly-correlated features with respect to the previous layer are redundant and thus can be eliminated before feature fusion without losing vital information. In contrast, those less-correlated features may contain complementary information and should be taken into account in feature fusion. By constructing the hierarchical high-order FC networks, we can not only generate useful and complementary features, but also significantly reduce the redundancy.

Second, considering the layer-wise structure of features, we propose a novel feature fusion method to utilize the information contained in all candidate high-order FC networks. This method elaborately combines the sequential forward selection (Jain and Zongker 1997) and sparse regression (Tibshirani 1996) under the classic wrapper-based feature selection framework (Kohavi and John 1997). First, sequential forward selection operates on feature block from each layer and adds one block each time. Then, sparse regression operates on the combined feature blocks and attempts to select only a small number of individual features. Finally, support vector machine (Chen et al. 2011a; Chen et al. 2011b; Cortes and Vapnik 1995) with simple linear kernel is trained based on the selected features, and the classification performance on validation data is employed to guide the selection procedure above. This feature fusion method further removes redundant and uncorrelated features, with respect to the classification, thus mitigating the influence of over-fitting.

It is worth noting the existence of the previous work (Zhou et al. 2011) on modelling the hierarchical structure of brain anatomical network. However, there are two main differences from our proposed approach. First, their brain anatomical network focuses on multi-scale structural relationship generated based on T1-weighted MRI data, while our high-order FC networks model multi-scale functional relationship derived from RS-fMRI data. Second, their brain anatomical network is still a low-order network since it considers only the direct pairwise interaction between brain regions. Our proposed hierarchical high-order FC network reveals high-order correlations, reflecting how different pairs of brain regions (involving at least four brain regions) interact with each other.

Moreover, compared to the previous work (Chen et al. 2016), this study can be viewed as an important extension with the following novelties: 1) a hierarchical strategy is developed to generate multiple related high-order FC networks for containing as more information as possible; 2) due to the use of hierarchical structure, a feature decomposition technique is further developed based on correlation analysis of hierarchical high-order FC networks from consecutive layers in order to reduce feature redundancy; 3) due to the complexity of those obtained features, a feature fusion method (by integration of sequential forward selection and sparse regression) is developed to select a few discriminative features for classification.

Data and Methodology

Data Preparation

The Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset is used in this study. ADNI was launched in 2003 by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, the Food and Drug Administration, private pharmaceutical companies and nonprofit organizations. The goal of ADNI is to validate the use of various biomarkers, including MRI, PET imaging and related neuropsychological assessments for AD clinical trials and diagnosis.

In this work, 50 MCI subjects and 49 normal controls (NCs) are randomly selected from ADNIGo and ADNI2 dataset. (For more details about imaging parameters, please see the ADNI protocols at adni.loni.ucla.edu.) The patient group consists of 27 early MCI subjects and 23 late MCI subjects. Subjects from both groups were age-matched and scanned using 3.0 T Philips scanner. SPM8 software package (http://www.fil.ion.ucl.uk/spm/software/spm8) was used to preprocess the acquired RS-fMRI data. The scanning time for each subject is 7 min (i.e., 140 volumes), and the subjects with more than 2.5 min of large frame-wise displacement (FD > 0.5) were not included in this study (i.e., excluded before data inclusion). For magnetization equilibrium, the first 3 volumes of each subject were discarded before preprocessing, and then the remaining 137 volumes were used for subsequent analysis. A rigid-body transformation was used to correct head motion during the scan; and the subject with head motion larger than 2 mm or 2° were not included in this study. The fMRI images were registered to the Montreal Neurological Institute (MNI) space and spatially smoothed using a Gaussian kernel with full width at half maximum (FWHM) = 6 × 6 × 6 mm3. We did not perform data scrubbing (i.e., removing volumes with FD > 0.5), as it would introduce additional artifacts to the subsequent dynamic FC analysis. The RS-fMRI images were parcellated into 116 regions according to the Automated Anatomical Labeling (AAL) template. The mean RS-fMRI time series of each brain region was band-pass filtered (0.015–0.15 Hz). Head motion parameters (Friston24), mean BOLD signal of white matter, and mean BOLD signal of cerebrospinal fluid were all regressed out from the RS-fMRI data to further reduce artifacts.

Framework

The main steps of hierarchical high-order FC networks construction are shown in Fig. 1 where four brain regions are denoted by A, B, C, and D. Generally speaking, this method consists of the following steps: 1) A sliding window with length N and step size s is applied to partition the entire BOLD signal into multiple overlapping segments, each of which characterizes the neural activity of brain region in a relatively small time interval. 2) For each subject, a set of low-order FC networks is constructed, each of which is based on a BOLD signal segment. By doing so, we actually obtain a set of FC time series, each describing the temporal variation of correlation between two brain regions. 3) All subjects’ FC time series associated with the same brain region pair are concatenated together to form a long FC time series which is represented by a point in high-dimensional space. 4) In the high-dimensional space, the long FC time series from all brain region pairs are grouped into U clusters by the clustering algorithm, thus, yielding consistent clustering results across different subjects. 5) For each subject, the mean of the FC time series within the same cluster is computed and then a high-order FC network (HON) is constructed based on the correlation between the mean FC time series of different clusters. 6) High-order feature vector (e.g., weighted local clustering coefficients in this paper) is extracted from the constructed high-order network. 7) Repeating steps 4–6 multiple times with different Us generates multiple high-order FC networks, each of which characterizes the high-order correlation at different scales, and also multiple high-order feature vectors for each subject. 8) The feature vectors extracted from all high-order FC networks are analyzed based on correlation and then a feature subset is selected by a feature selection that combines the sequential forward selection and sparse regression. 9) Support vector machine (SVM) is trained with the selected features to classify MCI and NC subjects.

Fig. 1
figure 1

Framework of the proposed hierarchical high-order FC networks construction

We can observe that the resulting high-order FC network from Steps 4–5 closely depends on the number of clusters U. By varying U, we can obtain a candidate set of high-order FC networks, from which their corresponding features can be extracted. Then, we can ensemble these networks in a principled way to make full use of all available features. On one hand, more networks will produce more features, which may bring benefits to MCI diagnosis. On the other hand, more networks also make the relationship among the features more complex and difficult to analyze. Also, due to the limited sample size in practice, this will probably cause the curse of dimensionality and over-fitting. As a result, taking advantage of the information brought by all high-order FC networks and also avoiding over-fitting become a crucial problem for the multiple high-order FC networks ensemble.

To deal with the problems above, we propose hierarchical high-order FC networks, as well as an effective feature fusion method, for MCI classification. First, the agglomerative hierarchical clustering algorithm is used to produce multiple clustering results with different Us. Then, each clustering result is utilized to generate a high-order FC network in the corresponding layer. Second, features are extracted from all high-order FC networks, and the correlation analysis is performed to reveal their inherent relationship, based on which features in each layer can be decomposed into the redundant part and the informative part, respectively. Third, a sequential forward selection procedure is employed to combine the informative features from different high-order FC networks, followed by a sparse regression to select the individual features beneficial for classification. The detailed procedure is explained as follows.

Hierarchical Clustering and Feature Decomposition

Suppose the FC time series in high-dimensional space are progressively grouped into u i and u i + 1 clusters, respectively, in the layer i and layer i + 1, where u i  > u i + 1. Due to the nature of agglomerative hierarchical clustering, some clusters in the layer i, which are highly similar to each other, will be merged to form new clusters in the layer i + 1, while the remaining clusters are retained without change. An illustration of the agglomerative hierarchical clustering procedure is shown in the panel of Step 4 and its upper panel in Fig. 1. Notice that the FC time series in Fig. 1 are generated by the sliding window approach in Step 1. Due to the hierarchical clustering in Fig. 1, we can observe that, from the layer i to the layer i + 1, the clusters in the blue and purple ellipses are merged to form new clusters as shown in red ellipse, while all other clusters are retained, thus reducing the total number of clusters in the layer i + 1. Based on the clustering results in the layer i and the layer i + 1, the high-order FC networks HON i and HON i + 1 are constructed, respectively, where one vertex corresponds to one cluster in the corresponding layer, as shown by the high-order FC network in the panel of Step 5 and its upper panel in Fig. 1. Subsequently, the feature vectors \( {Fea}_i\in {R}^{u_i} \) and \( {Fea}_{i+1}\in {R}^{u_{i+1}} \) (i.e., weighted local clustering coefficients (Rubinov and Sporns 2010; Watts and Strogatz 1998) in this paper) can be extracted from the high-order FC networks HON i and HON i + 1, respectively. For this type of features, each entry in Fea i and Fea i + 1 corresponds to a vertex in HON i and HON i + 1, respectively, thus also corresponding to a cluster in the layer i and the layer i + 1. Due to the high overlapping between the clustering results in the layer i + 1 and the layer i, Fea i + 1 can be decomposed as Fea i + 1 = [D i + 1, S i + 1], where D and S respectively indicate the corresponding clusters that are newly formed from and already exist in the previous layer. As a result, S i + 1 may be highly correlated with some features in Fea i . This implies that only D i + 1 in the layer i + 1 may contain useful information, whereas S i + 1 is redundant with respect to the previous layer. This observation can be generalized to the case of multiple layers. Suppose hierarchical high-order FC networks are generated from the layer 1 to the layer L. Following the above analysis, the feature vector Fea i extracted from the layer i can be decomposed into two parts Fea i  = [D i , S i ] by comparing it with the previous layer. Note that, to decompose the features in the layer 1, an extra layer 0 is used. For the layer i (≥2), only D i may contain useful information while S i is redundant with respect to the previous layer. But for the layer 1, both S 1 and D 1 should be reserved to guarantee the completeness of information contained in all networks. Therefore, for each subject, the features extracted from all hierarchical high-order FC networks can be condensed and expressed as Fea = [D 0, D 1, D 2,  ⋯ , D L ] where D 0 = S 1. In this paper, each D i is called a feature block since it is associated with a specific layer.

Selective Feature Fusion

Note that although applying the above agglomerative hierarchical clustering and correlation analysis can reduce the dimensionality of features to a large extent, the redundancy between different layers may still exist, especially when taking into account multiple layers. In addition, not all of the features in Fea are discriminative in terms of MCI classification. To maximally benefit from the information contained in Fea and also further reduce redundancy simultaneously, we propose a feature fusion method, which combines sequential forward selection and sparse regression (Liu et al. 2009), under the framework of wrapper-based feature selection (Kohavi and John 1997). Here, sequential forward selection can find feature block progressively, while sparse regression can select individual features that are predictive for classification. Specifically, given a current set A of feature blocks, a new feature block D i from Fea − A can be selected and combined with A. Then, sparse regression (Zhang et al. 2016b; Zhang et al. 2017) is performed on all training samples with this enlarged feature subset to find a small subset C, which is beneficial for classification. Next, the selected features of all training subjects are used to train a linear SVM model, and the classification accuracy on the validation subjects is used to guide the selection of D i , which means that the one yielding the optimal accuracy is finally selected. The procedure above is repeated until a required number of feature blocks is reached. This procedure is illustrated in the following Table 1.

Table 1 Sequential forward selection and sparse regression

Evaluation Protocol

In this study, because of limited samples, the leave-one-out cross validation (LOOCV) is adopted to evaluate the performance of different methods. Specifically, given a total of N subjects, N − 1 subjects are used for training a SVM classifier and the rest one is used for testing. The classification results on the testing subject are recorded. The above procedure is repeated N times, each time leaving out a different subject for testing. Finally, the average of the N evaluation results is computed to compare the generalization performance of different methods. For the hyper-parameter in each method, we tune its value on the training subjects by using the nested LOOCV. That is, for each fold of the above LOOCV, we have N − 1 training subjects. Then, for the N − 1 training subjects, one is left out for testing and the remaining N − 2 subjects are for training, using each specific hyper-parameter. This procedure will repeat N − 1 times, each time leaving out a different subject from the N − 1 training subjects. Then, the average of N − 1 evaluation results is computed and the hyper-parameter that gives rise to the best accuracy is finally selected for this fold.

Experimental Analysis and Results

Feature Correlation Analysis and Feature Decomposition

In this study, all methods were implemented in MATLAB 2012b environment. To generate high-order FC networks, we use a sliding window with s = 1 and N = 50. To generate multiple layers, we start from one layer with a relatively large number of clusters (U = 220), for retaining sufficient information. Then, the sequent layers are added by gradually reducing U by 30 until the optimal performance is achieved. In such a way, it eventually generates 4 high-order FC networks from the layer 1 to the layer 4: HON1, HON2, HON3, and HON4, where the number of clusters U equals 220, 190, 160, and 130, respectively. The feature vectors (i.e., weighted local clustering coefficients) Fea 1 ∈ R 220 , Fea 2 ∈ R 190 , Fea 3 ∈ R 160, and Fea 4 ∈ R 130 are extracted from HON1, HON2, HON3, and HON4, respectively. Since our method is a feature fusion method, the correlation between features from different high-order FC networks provides important prior information. To verify the analysis in Hierarchical Clustering and Feature Decomposition Section, we show the correlation between Fea i and Fea i + 1 (i = 1 , 2 , 3) in Fig. 2. As shown by the red lines in Fig. 2, most features in Fea i + 1 are actually highly correlated with features in Fea i , implying that most features in the current layer are redundant with respect to those in the previous layer and thus should be eliminated before feature fusion. Based on this correlation with the previous layer, each feature vector Fea i (i = 1 , 2 , 3 , 4) can be decomposed into two feature blocks, as shown in Fig. 3. As we can see, only about 30 features of each layer are less correlated with the previous layer. To guarantee the inclusion of sufficient information and also reduce the redundancy, five feature blocks S 1 ∈ R 191 , D 1 ∈ R 29 , D 2 ∈ R 30 , D 3 ∈ R 29, and D 4 ∈ R 30 are engaged in the subsequent feature fusion, while others are eliminated. In short, the total number of features decreases from 700 to 309 by this unsupervised correlation analysis.

Fig. 2
figure 2

Correlation between features from neighboring high-order FC networks

Fig. 3
figure 3

Illustration of feature decomposition in each layer

Classification Accuracy

For sparse regression, the SLEP toolbox (Liu et al. 2009) is utilized. The hyper-parameter involved in sparse regression is determined by the nested LOOCV described in Evaluation Protocol Section. To construct SVM classifier, the well-known LIBSVM toolbox (Chang and Lin 2001) is applied with default hyper-parameter. The proposed sequential forward selection and sparse regression based hierarchical high-order FC networks feature fusion (HHON-SFS) is compared with some closely related methods, including 1) a simple feature fusion method (HHON-CON), which directly concatenates all features extracted from four high-order FC networks, 2) four individual high-order FC networks (HON1, HON2, HON3, and HON4), 3) two low-order FC networks based on partial correlation (LON-PAC) and Pearson’s correlation (LON-PEC), and 4) sparse representation (SR) and weighted SR (WSR) based FC networks (Yu et al. 2016), respectively. To measure the performance of different methods, we use the following 7 indices: accuracy (ACC), area under ROC curve (AUC), sensitivity (SEN), specificity (SPE), Youden’s Index (YI), F-score, and balanced accuracy (BAC). These statistical measures are defined as (Sokolova et al. 2006)

$$ \begin{array}{c}\kern1em \mathrm{ACC}=\frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{TN}+\mathrm{FP}+\mathrm{FN}}\kern1em \\ {}\kern1em \mathrm{SEN}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}\kern1em \\ {}\kern1em \mathrm{SPE}=\frac{\mathrm{TN}}{\mathrm{TN}+\mathrm{FP}}\kern1em \\ {}\kern1em \mathrm{YI}=\mathrm{SEN}+\mathrm{SPE}-1\kern1em \\ {}\kern1em \mathrm{F}\hbox{-} \mathrm{score}=2\times \frac{\mathrm{precision}\times \mathrm{recall}}{\mathrm{precision}+\mathrm{recall}}\kern1em \\ {}\kern-2em \mathrm{BAC}=\frac{1}{2}\times \left(\mathrm{SEN}+\mathrm{SPE}\right)\kern1em \end{array} $$

where \( \mathrm{precision}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}} \), \( \mathrm{recall}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}} \), and TP, TN, FP, and FN denote the true positive, true negative, false positive and false negative, respectively.

The experimental results are shown in Table 2. As we can see, the high-order FC networks achieve better accuracy than the two low-order FC networks. This is consistent with our previous research (Chen et al. 2016), indicating that high-order FC networks provide more discriminative biomarkers for MCI identification. Two sparse representation based networks, i.e., SR and WSR, remarkably improve correlation based networks; however, they are still inferior to the proposed hierarchical high-order FC networks. Comparing the four individual high-order FC networks, we can observe its sensitive performance to the number of clusters U. For example, too large or too small U will adversely affect the performance. This can be understood as follows: large U will cause too many redundant features, while small U will lead to significant information loss. We observe that the high-order FC network HON2 (U = 190) achieves better performance than other individual high-order FC networks (HON1, HON3, and HON4). Nevertheless, it does not mean the information contained in other suboptimal networks is completely useless, as they can also distinguish MCI and NC subjects to a certain extent, although their overall accuracies are lower than HON2. As an attempt to make use of all information, HHON-CON directly concatenates Fea 1 , Fea 2 , Fea 3, and Fea 4 together to form a combined feature vector of length 700. Although this method involves all features, the accuracy falls just between the best one and the worst one, as indicated by Table 2. This may be due to too many redundant features that make the relationship between features more complex, thus causing difficulty in individual feature selection and also potential over-fitting. In contrast, the proposed feature fusion method, HHON-SFS, achieves the best performance among all competitive methods. On one hand, we attribute this improvement to the feature correlation analysis and also the resulting feature decomposition, which eliminates many redundant features. On the other hand, the combination of sequential forward selection and sparse regression makes it possible to evaluate the importance of feature blocks and individual features progressively. As a result, the crucial and complementary features have more probability to be selected and fused for classification.

Table 2 Performance comparison of different methods in MCI classification

Sequential Forward Selection and Individual Feature Selection

For the sequential forward selection procedure, we vary the number of feature blocks from one to five and show the corresponding classification accuracies in Fig. 4(a). We can see that two feature blocks yield preferable performance. Moreover, it is necessary to choose some feature blocks according to the specific training subjects, rather than simply using all feature blocks. The proposed method can automatically determine which feature blocks should be selected based on the training subjects. Due to the use of LOOCV, different training subjects may cause differences in selection results of feature blocks. If one feature block is selected due to the nested LOOCV on the training subjects, but no individual feature is selected from this block based on the whole training subjects by sparse regression, this feature block will be eventually discarded. Therefore, we show the total number of occurrences of each feature block in Fig. 4(b). As we can see, S 1 is selected in each fold of LOOCV, which means it contains many features beneficial for disease classification. In fact, we can observe from Fig. 3 that, without S 1, the rest feature blocks cannot form a complete high-order FC network and much information will be lost. S 1, D 2 and D 4 are also selected many times, while D 2 is never chosen, which indicates that the information contained in D 2 cannot bring remarkable guidance for classification. For the individual feature selection, the experimental result is shown in Fig. 4(c). This is consistent with Fig. 4(b), which further confirms that most features are located in S 1, but D 2 and D 4 also provide some complementary features for improving classification. As a whole, we can see from these results that, despite the overall performances of individual high-order FC networks are different, they do contain complementary information and thus integrating them can boost the overall performance.

Fig. 4
figure 4

a Variation of accuracy versus the number of feature blocks; b Number of occurrences of each feature block; c Number of occurrences of each individual feature

Hierarchical High-Order FC Networks

The averaged low-order FC networks built by LON-PEC and LON-PAC on MCI and NC subjects are shown in Figs. 5 and 6, respectively. Similarly, the illustrations of averaged sparse representation and weighted sparse representation based FC networks are shown in Figs. 7 and 8, respectively. As we can see, the difference between MCI and NC is imperceptible for low-order FC networks, which implies the inferior classification performance. In contrast, the high-order FC networks in the four layers, as shown in Figs. 9, 10, 11, and 12, have more remarkable differences. Specifically, the number of strong high-order correlations in MCI subjects is generally more than that in NC subjects, which, to a certain sense, reflects the disorder characteristic of neural degenerative disease. This difference also explains why the high-order FC networks can better distinguish MCI and NC subjects than the traditional low-order FC networks. This observation is also consistent with our previous research. Moreover, due to the agglomerative hierarchical clustering, the high-order FC networks actually have much more overlap between two consecutive layers.

Fig. 5
figure 5

Averaged low-order FC networks built by LON-PEC for (a) MCI and (b) NC subjects

Fig. 6
figure 6

Averaged low-order FC networks built by LON-PAC for (a) MCI and (b) NC subjects

Fig. 7
figure 7

Averaged FC networks built by SR for (a) MCI and (b) NC subjects

Fig. 8
figure 8

Averaged FC networks built by WSR for (a) MCI and (b) NC subjects

Fig. 9
figure 9

Averaged high-order FC networks for (a) MCI and (b) NC subjects in the layer 1

Fig. 10
figure 10

Averaged high-order FC networks for (a) MCI and (b) NC subjects in the layer 2

Fig. 11
figure 11

Averaged high-order FC networks for (a) MCI and (b) NC subjects in the layer 3

Fig. 12
figure 12

Averaged high-order FC networks for (a) MCI and (b) NC subjects in the layer 4

Limitations

One important issue of the current study is related to the limited number of subjects used, which may lead to less statistical power. In addition, the leave-one-out cross validation approach was used to evaluate the classification performance of different approaches, which often leads to optimistic performance. This issue can be better overcome by other approaches such as bootstrapping. In the future, our proposed method will be further evaluated by the bootstrapping technique, along larger dataset. Besides, the number of clusters could be also determined by the bootstrapping technique.

Another limitation is related to the use of Pearson’s correlation which requires the samples to be independent to each other, although it has been used extensively in RS-fMRI and brain network studies. In our case, the dynamic FC time series generated by using the sliding window might violate this requirement, since neighboring windows are highly overlapped. Thus, the estimated correlation coefficient could be biased from the “true” correlation and less robust. But, in this study, we just simply calculate a “second level” of correlation based on such dynamic FC time series because of its simplicity and popularity in brain network construction (e.g., most of the previous FC analysis studies computed temporal correlation using the RS-fMRI signals of each pair of brain regions, without considering the sample dependency problem). In the future work, how to robustly measure high-order relationship between dynamic FC time series using other statistical techniques will be investigated.

Conclusion

In this paper, we propose integrating the information contained in multiple high-order FC networks for MCI classification. Hierarchical clustering technique is utilized to generate multiple high-order FC networks, each of which is located in one layer. Due to such hierarchy structure, the features extracted from the network in each layer can be simplified, and only the informative feature block is taken into account. By combining sequential forward selection and sparse regression, a novel feature fusion method is further developed. This method is able to selectively fuse informative feature blocks from different layers and further determine a small number of distinctive features for early diagnosis. Finally, support vector machine with simple linear kernel is used for MCI classification. The experimental results demonstrate the capability of the proposed approach in making full use of information contained in multiple high-order FC networks. It also shows that the appropriate combination of multiple high-order FC networks can yield much better classification performance than any single optimal high-order FC network.

Information Sharing Statement

The dataset used in this paper are from the Alzheimer’s Disease Neuroimaging Initiative (ADNI, RRID:SCR_003007), which are available at http://adni.loni.usc.edu/. Besides, Libsvm toolbox (RRID:SCR_010243) is available at https://www.csie.ntu.edu.tw/~cjlin/libsvm/, and SLEP (RRID:SCR_001870) toolbox is available at www.yelab.net/software/SLEP.