1 Introduction

Most elderly people are affected by dementia mainly due to Alzheimer’s disease. Alzheimer’s disease slowly progresses from mild, moderate to severe stage of dementia. It occurs mainly due to the abnormal build-up of proteins like amyloid plaques and tau tangles in the brain. In the preclinical stage, changes take place in the brain decades before the actual diagnosis of Alzheimer’s disease (AD). The next stage is mild cognitive impairment (MCI) where slight but noticeable changes occur in memory and cognitive functionalities. The final stage is dementia due to memory loss and impaired daily activities. None of the treatments available is a complete cure for the disease. There is a strong need for identifying the disease at an earlier stage to effectively implement preventive measures.

Significant studies were carried out to understand the pathological conditions in the brain. Imaging modalities like structural and functional magnetic resonance imaging (sMRI, fMRI) [1,2,3], positron emission tomography (PET) [4] and cerebrospinal fluids (CSF) [5] were used as biomarkers to classify the disease stage either separately or combined [5,6,7]. These imaging biomarkers quantify the structural and functional information of the brain. Apart from these biomarkers, mini-mental state examination (MMSE) scores were considered for the diagnosis of AD [1, 2]. MMSE is a 30-point questionnaire for measuring cognitive impairment.

Different approaches were used for identifying AD. Traditional machine learning approaches based on the features obtained manually were used in [4, 8]. Quantitative studies were developed to analyze the volume, thickness, surface, shape and texture of the brain [9,10,11]. In [12], region of interest (ROI)-based analysis was used for obtaining the features. Voxel-based morphometry in statistical parametric mapping [13] and volumes generated from FreeSurfer [14] were used to extract the brain features. In [15], a multi-feature kernel discriminant dictionary learning technique that combines the sMRI, fluorodeoxyglucose (FDG) PET and florbetapir-PET imaging features was used. In [16], three different two-dimensional convolutional neural networks were used for Alzheimer’s disease classification based on the slice importance of sMRI images. Instead of using the entire brain, the hippocampus region was segmented and used for the disease classification [17]. Networks were constructed from cortical gray matter volume, cortical thickness, cortical surface area, cortical curvature, cortical folding index and subcortical volume. Node and edge features of these networks were selected using F-score and were used for classification [6]. In [18], support vector machine (SVM) used recursive feature elimination (RFE) to reduce model complexity.

Most of the previous studies used only baseline data. Despite many efforts in identifying the biomarkers for early diagnosis, classification of the disease state is still quite hard. In this study, sMRI images are used for the classification of AD disease stages. Volumetric segmentation is used where the images are normalized and registered using Desikan–Killiany atlas. The predominant challenge is the limited records with large dimensions of imaging features. Cortical parcellation volume (CV), subcortical parcellation volume (SV), surface area (SA), cortical thickness average (TA), cortical thickness standard deviation (TS), hippocampal subfields parcellation volume (HS) biomarkers are considered for the classification. Though the RFE and Genetic Algorithm (GA) were used in other studies for the feature selection, there was no study on the classification based on the above-proposed biomarkers. For these biomarkers, different wrapper-based feature selection techniques using RFE and GA are performed to find the optimal feature set. MMSE score is used along with the optimal volumetric segmentation features to check whether it affects the classification performance. Classifier with better predicting accuracy is determined.

2 Methods

Data used in the preparation of this article is obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). Longitudinal sMRI images are used which are processed for gradient inhomogeneity, B1 non-uniformity correction, and scaling by the Mayo Clinic. Volumetric segmentation is carried out using FreeSurfer version 5.1 based on the image processing framework mentioned in [19, 20]. The quality control process is used on the segmented images and is provided for the usage in the ADNI website.

For the classification of sMRI imaging data into normal control (NC), mild cognitive impairment (MCI) and Alzheimer’s disease (AD), the imaging data which pass the overall quality control process are considered. This reduces the data size considerably due to the partial or total segmentation failure based on quality control. Apart from this neuropsychological measure, mini-mental state examination (MMSE) scores are also obtained from the LONI Image Data Archive website. Missing feature readings are not calculated to avoid bias being introduced in the results. Hence, records with missing values are discarded.

After the preprocessing of data, total records include 347 normal controls (NC), 558 MCI, and 171 AD. Among these records, 27 MCI subjects converted to AD in three years time period from the baseline. The demographic information of the data samples used in this study is shown in Table 1.

Table 1 Demographic information of the data samples

3 Features selection

Totally 69 CV, 50 SV, 70 SA, 68 TA, 68 TS, 16 HS features are present for every record. The records are scaled to their inter-quartile range so that they are robust to the outliers. Training time is high if all the features are used for classification purposes. When the feature dimension is large, it leads to problems like overfitting, models with higher complexity, and lesser accuracy. To overcome the above-mentioned problems, feature selection strategies are carried out.

Univariate methods such as t-test and Fisher’s criteria were used for feature selection in [21, 22]. Though the features selected were the best as an individual, they might not be the best as a whole. Multivariate feature selection methods do not individually rank the features but they rank sets of features.

RFE is a backward elimination technique that starts with the total feature set, and then the most irrelevant features are removed one after another. This RFE technique is used to select the most prominent and discriminative features from the entire feature set. GA is an evolutionary computing algorithm suitable for searching an efficient subset of features that are optimal from the high-dimensional feature set. In this work, RFE and GA methods are combined with the logistic regression and linear support vector machine classifiers in the wrapper technique to select the highly relevant features from the large feature set. The classifiers are then trained to verify the usefulness of the different sets of the selected features.

3.1 RFE features selection

Recursive feature elimination is a simple heuristic approach to select features that are most relevant for predicting the target. RFE looks for a subset of features by starting with all the features and removes features iteratively until the desired feature set is obtained. It is achieved by fitting LR or SVM model with all the features. Then, features are ranked by importance score and the least important features are discarded. The model is re-fitted. This process is repeated until a specified number of features remains.

3.2 GA features selection

Genetic algorithm is a simple meta-heuristic algorithm that imitates the biological evolution is used for feature selection. In this feature selection, volumetric measures are encoded as genomes using binary strings. The steps involved in GA feature selection process are listed below.

  • Step 1. Random population of 50 chromosomes are generated which are the solutions.

  • Step 2. The fitness function is evaluated for each chromosome in the random population. Here, the fitness function is the accuracy measured from the classifier (LR or SVM). This accuracy is calculated using fivefold cross-validation technique.

  • Step 3. Tournament selection is performed for selecting the parent chromosomes based on their fitness value. Tournament size of 3 is used for this tournament selection.

  • Step 4. Uniform cross-over operator is used to produce offspring with cross-over probability of 0.5

  • Step 5. With a mutation probability of 0.1 new offspring is mutated by flipping the binary bits. This offspring is added into the population. For the further iterations, newly generated population is used.

  • Step 6. If the chromosomes generated are same for the last 10 generations, then stop the process and return the best solution (feature set) from the current population. Otherwise go to step 2.

4 Proposed workflow

ADNI longitudinal sMRI features that are preprocessed and normalized are given as input to the four wrapper-based feature selection processes. The feature subsets obtained are then available to be trained by the classifiers. Three different binary classifications are performed in AD diagnosis: NC/AD, NC/MCI and MCI/AD classification. For this purpose, logistic regression, SVM, random forest (RF) and extreme gradient boosting classifiers (XGB) are used. Based on these classifications, the best feature set is selected. To this feature set, a neuropsychological measure MMSE score is added. Once again, classification is performed on these combined measures. The workflow of this classification process is shown in Fig. 1.

Fig. 1
figure 1

Workflow of the classification process

4.1 Classifiers

The following classifiers are considered for the diagnosis of Alzheimer’s disease stage using sMRI features.

4.1.1 LR classifier

Consider the training set of n points {(xi, yi) | xi ∈ Rm, yi ∈ {0,1}}. Logistic regression generalizes from the linear regression, where sigmoid function is applied to the linear regression function as given by Eq. (1):

$$\begin{gathered} y = g\left( {\theta^{{\text{T}}} x} \right) \hfill \\ {\text{where}}\;g\left( z \right) = \frac{1}{{1 + e^{ - z} }} \hfill \\ \end{gathered}$$
(1)

This enables to classify the inputs into binary valued labels, yi ∈ {0,1}.

4.1.2 SVM classifier

Consider the training set of n points {(xi, yi)| xi ∈ Rm, yi ∈ {± 1}} where i = 1 to n, xi denotes the feature vectors and yi denotes the class label. SVM maximizes the margin around the hyperplane. This is achieved by solving the following optimization problem in Eq. (2):

$$\begin{gathered} \mathop {\min }\limits_{w,\xi ,b} \left\{ {\frac{1}{2}||w||^{2} + C \mathop \sum \limits_{i = 1}^{N} \xi_{i} } \right\} \hfill \\ {\text{s}}.{\text{t}}.\;y_{i} \left( {w.x_{i} - b} \right) \ge 1 - \xi_{i} ,\quad \xi_{i} \ge 0,\quad i = 1, \ldots ,n \hfill \\ \end{gathered}$$
(2)

When radial basis function kernel is used, dot product is replaced by the Gaussian kernel function which is given in Eq. (3).

$$K\left( {x_{i} ,x_{j} } \right) = \exp \left( { - \gamma|| x_{i} - x_{j}||^{2} } \right),\quad \gamma > 0,$$
(3)

4.1.3 RF classifier

Random forest is an ensemble tool that builds many decision trees and combines them to produce better results. Consider the m set of points for the training {(xi, yi)| xi ∈ Rm, yi ∈ {0,1}}.

for b = 1 to B:

  • 1. n samples are taken with replacement from m set of points.

  • 2. A classifier tree fb is constructed on these n samples.

After training, predictions for the unseen samples \(\hat{x}\) are made by averaging the predictions from all these individual trees as per Eq. (4):

$$\hat{f} = \frac{1}{B} \mathop \sum \limits_{b = 1}^{B} f_{b} \left( {\hat{x}} \right)$$
(4)

When data are provided to the decision tree, splitting of data is based on Gini index and entropy.

4.1.4 XGBoost classifier

XGBoost (XGB), extreme gradient boosting classifier, is an optimized gradient boosting algorithm. Trees are grown one by one and attempt to reduce misclassification rate in subsequent iterations. The next tree is grown by giving higher weights to the misclassified points of the previous tree. Consider a dataset {(xi, yi)| xi ∈ Rm, yi ∈ {0,1}}. The objective function for XGBoost is given in Eq. (5).

$$\begin{gathered} {\mathcal{L}}^{\left( t \right)} = \mathop \sum \limits_{i = 1}^{n} l\left( {y_{i} ,\widehat{{y_{i} }}^{{\left( {t - 1} \right)}} + f_{t} \left( {x_{i} } \right)} \right) + \Omega \left( {f_{t} } \right) \hfill \\ {\text{where}}\;\Omega \left( f \right) = \gamma T + \frac{1}{2} \lambda ||w||^{2} \hfill \\ \end{gathered}$$
(5)

\(f_{t}\) represents a tree with weights \(w\). \(T\) is the number of leaves in a tree. \(l\) denotes differentiable convex loss function that measures the difference between the prediction \(\widehat{{y_{i} }}\) and the target \(y_{i}\). The second term \(\Omega\) penalizes the complexity of the model.

4.2 Performance metrics

For analyzing the performance of the models, accuracy, sensitivity and specificity are used. Accuracy is calculated by the formula given in Eq. (6):

$${\text{Accuracy}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{FP}} + {\text{FN}} + {\text{TN}}}}$$
(6)

Sensitivity is calculated by the formula given in Eq. (7):

$${\text{Sensitivity}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}}$$
(7)

Specificity is calculated by the formula given in Eq. (8):

$${\text{Specificity}} = \frac{{{\text{TN}}}}{{{\text{TN}} + {\text{FP}}}}$$
(8)

where TP, TN, FP and FN stand for true positive, true negative, false positive and false negative, respectively. The performance measures are the means of these measures computed in the cross-validation runs.

5 Results and discussion

In RFE, the volumetric features corresponding to the lowest rankings are discarded and the remaining metrics are used for training the classifier at each level. The feature ranking for all the three binary classifications using RFE method is shown in Fig. 2

Fig. 2
figure 2

Feature ranking using RFE a NC vs AD, b NC vs MCI, c MCI vs AD

One hundred and seventy features are chosen among the 341 features for LR and linear SVM wrapper method. Between the two wrapper methods, there are 140, 132 and 118 identical features in NC/AD, NC/MCI and MCI/AD classification, respectively.

In GA, genomes are represented using binary strings which are the set of volumetric measures encoded. The bit whose value is zero indicates the feature is not selected; otherwise, the feature is selected. The initial population is chosen as 50, and tournament size is selected as 3. Forty generations are performed with fivefold cross-validation, where four folds are used for training and one fold is used for testing in all the iterations. A total of 82 to 104 features are chosen among the 341 features for LR and SVM wrapper methods as shown in Fig. 3. There are 19, 44 and 25 identical features in NC/AD, NC/MCI and MCI/AD classification, respectively, when the two wrapper methods are compared.

Fig. 3
figure 3

Features selected using GA a NC vs AD, b NC vs MCI, c MCI vs AD

The features whose ranking scores are 1 based on RFE and GA will be selected for further classification purposes. The remaining features are discarded. After the four feature subsets are identified, hyper-parameters C and γ are tuned using grid search on SVM which is implemented using the LIBSVM library [23]. Fivefold cross-validation is performed for grid search. Similarly, models are built using LR, RF and XGB classifiers.

All the models are developed using tenfold stratified cross-validation, and they are repeated fifteen times. The records are randomly partitioned into ten subsets with roughly the same proportions of the different class labels. Each class is almost equally represented across the training and test fold. Stratified cross-validation is performed due to the imbalance in the data size. For each of the 10 folds, the model is trained with nine folds and the remaining fold is used for testing. This process is repeated fifteen times with different randomization of the samples. The results of the binary classification experiments are presented in Tables 2, 3 and 4.

Table 2 NC/AD classification using different feature selection techniques
Table 3 NC/MCI classification using different feature selection techniques
Table 4 MCI/AD classification using different feature selection techniques

From Tables 2, 3 and 4, the selected features from the wrapper-based GA-LR algorithm differentiate the disease stages better when compared to the other feature selecting algorithms. From the 341 features of the structural MRI image, only 82, 89 and 84 features are selected for NC/AD, NC/MCI and MCI/AD classification, respectively. This helps in attaining better classifiers with reduced number of features which makes the models have lesser complexity. In the GA-LR feature selection technique, SVM with RBF kernel gives better accuracy when compared with the other classifiers. LR provides the least accuracy in all the classification models. RF and XGB produce approximately equal accuracy results.

The performance metrics of the models developed with MMSE score along with the best feature subset obtained in NC/AD, NC/MCI and MCI/AD classification are shown in Fig. 4. Receiver operating characteristic (ROC) curves for the three binary classifiers are shown in Fig. 5

Fig. 4
figure 4

Classification upon combining MMSE feature to MRI features

Fig. 5
figure 5

ROC curves for disease classification a NC vs AD, b NC vs MCI, c MCI vs AD

For each classifier, the ROC curve represents the mean ROC from the 150 cross-validation runs and area under curve (AUC) corresponds to the mean AUC of 150 cross-validation runs. SVM with RBF kernel has achieved higher AUC, i.e., 0.99, 0.95 and 0.94, respectively, for NC vs AD, NC vs MCI and MCI vs AD classification, respectively.

Upon adding MMSE score to the feature subset, in the NC/AD classification and NC/MCI classification, there is not much difference in the performance of the model. But, for the MCI/AD classification, accuracy approximately increases by 2.7%, sensitivity improves by 8.7%, and specificity improves by 1.4% for RF classifier when MMSE score is combined with the sMRI features subset. But adding the MMSE score to the features had not improved SVM with RBF kernel performance. On the other hand, the XGB classifier has a better performance over SVM but underperforms when compared to RF. Thus, RF classifier is ideal for classifying the MCI and AD subjects with improved accuracy.

From the above-mentioned results, the MMSE score does not play a major role in classifying NC/AD or NC/MCI classification. However, the MMSE score influences RF and XGB classifiers in MCI/AD classification. These results indicate that features derived from sMRI images play a vital role in the classification rather than the neuropsychological measure MMSE score.

The proposed model outperforms the hippocampus-based method [17] in terms of the performance measures. The model developed achieves better results when compared to the other whole-brain methods [6, 15, 16] mainly because of their smaller and effective feature set as shown in Table 5. The proposed model performs much better in MCI/AD and NC/MCI classification with a single imaging modality. With the minimal feature set, consistent performance over Alzheimer’s disease classification is attained.

Table 5 Performance comparison of binary classifications

6 Conclusion

The influence of different feature selection algorithms on CV, SV, SA, TA, TS, HS features of the longitudinal structural MRI images has been investigated. GA-LR feature selection performs better than the other algorithms. Appending MMSE score to the structural MRI reduced feature set helps in improving the accuracy and specificity of the classifier that distinguishes MCI and AD. The model developed from the features selected using GA-LR combined with the MMSE score has a 2.7% increase in accuracy, an 8.7% increase in sensitivity and a 1.4% increase in specificity for RF classifier in MCI/AD classification. SVM with RBF kernel produces better results with 96.82% and 89.39% accuracy for binary classification of NC/AD and NC/MCI, respectively. The proposed models have been developed with lesser features when compared to the existing works and they exhibit high accuracy. As the future work, other imaging measures will be reviewed for improving the classification accuracy.