Introduction

The Alzheimer Disease (AD) is a neurodegenerative disorder that critically affects memory, reasoning, and behavior. It is the most common type of dementia, accounting for 60%–80% of the cases (Alzheimer’s Association 2017). Worldwide, it is estimated that 47.5 million people live with dementia and such prevalence is expected to double in 20 years (World Health Organization 2017). There is no cure for AD, but a suitable treatment can relieve the symptoms and reduce its aggravation. Its early and accurate diagnosis is crucial to improve the life quality of the patient, but this is a complex task that requires cognitive and objective tests, patient records, clinical and laboratory exams.

Recent works have shown that biomarkers obtained by image processing and machine learning techniques can aid the diagnosis of AD (Chincarini et al. 2011; Garali et al. 2016; Liu et al. 2014). Machines might also provide a more accurate diagnosis than clinicians, because they are free from fatigue and can deal with neurodegenerative patterns of difficult visualization (Casanova et al. 2011; Klöppel et al. 2008). Effects of aging also cause brain changes, making more difficult the effective pattern identification (Ambastha 2015).

Many works have reported high accuracies when using different image databases, pattern classifiers, and validation protocols. This makes impossible a comparative analysis of their image descriptors. In this work, we present an extensive study of image descriptors for the diagnosis of AD and introduce a new one, named Residual Center of Mass (RCM). RCM explores image moments and other operations to enhance brain regions and select the most relevant features for the diagnosis of AD. For validation, a Support Vector Machine (SVM) is trained with the selected features to classify images from Normal Control (NC) subjects and patients with AD. We show that RCM with SVM achieves the best accuracies on a considerable number of exams — 507 Fluorodeoxyglucose-Positron Emission Tomography (FDG-PET) scans and 1,374 Magnetic Resonance Imaging (MRI) scans, as provided by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (Jack et al. 2008).

The related works are described in Section “Related Works”, making clear the motivation and impact of this paper. Section “Materials” presents the databases and the image preprocessing techniques used for the experiments. The RCM descriptor is introduced in Section “Methods” and its comparative study is presented in Section “Results and Discussion”. Finally, Section “Conclusion” states conclusion and provides direction to future research.

Related Works

Neuroimaging methods for Computer-Aided Diagnosis (CAD) are composed by different techniques of image processing and classification, as shown in Fig. 1. Each method has each its own pipeline of processes described in the following. The images are preprocessed to remove the skull and muscle tissues, both irrelevant for classification. Then, the images are spatially normalized (e.g., registered into a reference image space) to account for the natural differences in size and shape of the brain. The voxel intensities are also normalized to correct the large variations caused by the use of different scanners and parameters. Finally, the images may be resized to a lower resolution, for the sake of efficiency, and smoothed by a low-pass filter to reduce the effects of misregistration.

Fig. 1
figure 1

Processes of neuroimaging methods for computer-aided diagnosis

In the next step, some works extract image features and/or apply feature-space reduction techniques. Subsequently, the image descriptor results from feature selection techniques and/or regions/volumes of interest (ROIs/VOIs). ROIs are image regions that represent the spatial extension of brain structures, which are interactively delineated or automatically segmented based on object models, such as a probabilistic atlas (Carmichael et al. 2005). VOIs are volumetric blocks extracted from the image at given locations, as defined by prior knowledge in some reference image space. Finally, a classifier is trained from the image descriptor and evaluated by some validation method .

The performances of recent neuroimaging methods using MRI and FDG-PET scans are presented in Tables 1 and 2. The following metrics are reported:

$$ \begin{array}{llll} Accuracy\text{ }(ACC) = \frac{TP + TN}{TP+TN+FP+FN},\\ Precision\text{ }(PRE) = \frac{TP}{TP+FP},\\ Recall\text{ }(REC) = \frac{TP}{TP+FN},\\ Specificity\text{ }(SPE) = \frac{TN}{FP+TN}, \end{array} $$
(1)

where TP (true positives) are the AD patients correctly classified as AD, TN (true negatives) are the normal control (NC) individuals correctly classified as NC, FP (false positives) are the NC incorrectly identified as AD and FN (false negatives) are the AD patients incorrectly identified as NC. The positive class represents the data from AD patients and the negative, the data from healthy individuals.

Table 1 Classification performances of methods using MRI scans
Table 2 Classification performances of methods using FDG-PET scans

The simplest approach for the detection of AD consists in classifying the images directly based on their voxel values. Some works classify voxels from segmented tissues (ROIs) (Casanova et al. 2011; Klöppel et al. 2008; Rao et al. 2011) by logistic regression (LR) or SVM. Others train deep learning architectures with the whole brain image volume (Ambastha 2015; Gupta et al. 2013; Liu et al. 2015; Payan and Montana 2015). Klöppel et al. (2008) achieved an accuracy rate of 96.4% when classifying images of the gray matter tissue by SVM, but they used images from only 68 subjects. Casanova et al. (2011) tested the performance of a penalized logistic regression (PLR) to classify 98 subjects, resulting in an accuracy rate of 85.7%. In Rao et al. (2011), a sparse logistic regression (SLR) was trained, incorporating a sparsity penalty into the log-likelihood function, such that 85.3% of the images from 129 subjects were correctly classified.

Recent works (Ambastha 2015; Gupta et al. 2013; Liu et al. 2015; Payan and Montana 2015) adopt deep learning techniques for feature extraction and classification using images from the ADNI databases (Jack et al. 2008). Ambastha (2015) proposed a 3D convolutional neural network (ConvNet), reporting an accuracy rate of 81.8% in the classification of 100 individuals. In Payan and Montana (2015), the authors trained a 3D ConvNet with Sparse Autoencoders (SAE) and reported a high accuracy rate of 95.4% on images of 432 individuals. However, according to Ambastha (2015), the experiments in Payan and Montana (2015) indicate bias in the data. Liu et al. (2015) evaluated a multimodal approach, training Stacked Autoencoders (Stacked AE) with FDG-PET and MRI scans from 162 subjects. They reported a high accuracy rate of 91.4%. Gupta et al. (2013) trained a 2D ConvNet with SAE to extract features from MRI slices, achieving an accuracy rate of 94.7% on images of 432 individuals.

We may also say that the most promising approaches apply feature selection methods to discover biomarkers and remove non-informative features. Liu et al. (2014) selected the most relevant features using a tree construction method based on hierarchical clustering by taking into account spatial adjacency and feature similarity and discriminability. They achieved an accuracy rate of 90.2% for classifying 198 AD patients and 229 NC. In Garali et al. (2016), the brain was segmented into 116 regions to create a ranking of the most discriminative regions. Features were selected from 29 ROIs using the SelectKBest method (Kramer 2016), achieving an accuracy rate of 94.4% in the classification of 142 subjects. In Chincarini et al. (2011), different features were extracted from 9 VOIs to classify 144 AD and 189 NC. The features with the highest relative importance values, given by a Random Forest classifier (Breiman 2001), were selected to train an SVM classifier. This approach was able to discriminate NC from AD individuals with 89% of sensitivity and 94% of specificity.

Other approaches use Principal Component Analysis (PCA) and Independent Component Analysis (ICA) to reduce the feature space. Khedher et al. (2015) applied PCA to extract features from segmented MRI scans, achieving an accuracy rate of 87.7% on images from 417 individuals. Approaches with ICA achieved accuracies of 88.9% (Yang et al. 2011) and 86.8% (Wenlu et al. 2011) in the classification of 438 and 160 subjects, respectively. In Illán et al. (2011), the experiments were performed by applying ICA and PCA to images from 192 subjects. The best accuracy was 88.24%, as obtained by PCA with SVM.

Therefore, given the differences in databases, classifiers, and validation protocols, it is impossible to indirectly compare the image descriptors or methods (descriptor and classifier) proposed in the aforementioned works. We have then selected some of them for a fair comparative analysis based on 10-fold cross-validation and a considerable number of FDG-PET and MRI scans.

Materials

ADNI Database

The data used in the preparation of this article were obtained from the ADNI databases (adni.loni.usc.edu). The ADNI was launched in 2003 by the National Institute on Aging (NIA), the National Institute of Biomedical Imaging and Bioengineering (NIBIB), the Food and Drug Administration (FDA), private pharmaceutical companies and non-profit organizations, as a $60 million, 5-year public-private partnership. The primary goal of ADNI has been to test whether serial MRI, FDG-PET, other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early AD. Determination of sensitive and specific markers of very early AD progression is intended to aid researchers and clinicians to develop new treatments and monitor their effectiveness, as well as reduce the time and cost of clinical trials.

The Principal Investigator of this initiative is Michael W. Weiner, MD, VA Medical Center, the University of California - San Francisco. ADNI is the result of efforts of many co-investigators from a broad range of academic institutions and private corporations, and subjects have been recruited from over 50 sites across the U.S. and Canada. The initial goal of ADNI was to recruit 800 subjects, but ADNI has been followed by ADNI-GO and ADNI-2. To date, these three protocols have recruited over 1,500 adults of age from 55 to 90, consisting of cognitively normal older individuals, people with early or late MCI, and people with early AD. The follow-up duration of each group is specified in the protocols for ADNI-1, ADNI-2, and ADNI-GO. Subjects originally recruited for ADNI-1 and ADNI-GO had the option to be followed in ADNI-2. For up-to-date information, see www.adni-info.org.

Our experiments were performed using scans acquired over a 2 or 3-year period in two imaging modalities: FDG-PET and T1-weighted MRI. The group and demographic information about the data are summarized in Table 3. The data include 507 FDG-PET scans from 172 individuals and 1374 MRI scans from 403 individuals.

Table 3 Demographic information of each dataset

Preprocessing

Before extracting the features, the images are preprocessed through 4 steps: skull-stripping, spatial normalization, min-max normalization, and image downsampling.

Skull-stripping::

firstly, the skull is stripped off using the Brain Extraction Tool version 2 (BET2) from the FMRIB Software Library 4.1 (FSL 4.1) (Jenkinson et al. 2005). This step removes the skull and the muscle tissue of the head. The skull-stripping algorithm creates a three-dimensional mesh with a spherical shape positioned at the center of gravity of the head. The mesh is iteratively expanded, adjusting its vertices to the borders detected between the brain tissues and the skull. Then, the brain is separated from the skull by applying the mask defined by this mesh. In the FDG-PET images, the skull region is not well-defined. Thus the skull-stripping was not performed and the images were only smoothed with a 12mm FWHM (Full width at half maximum) Gaussian filter (Landini et al. 2005).

Spatial normalization::

the images are spatially normalized onto the MNI (Montreal Neurological Institute) reference space (Fonov et al. 2011). This step reduces the brain anatomy variability among individuals, warping the images into a standard coordinate space. The MRI images are normalized using the symmetric diffeomorphic registration (SyN) (Avants et al. 2008). In an evaluation of 14 brain registration methods, the SyN was the algorithm that presented the best results according to overlap and distance measures (Klein et al. 2009). The FDG-PET scans are registered using the Statistical Parametric Mapping 5 (SPM 5) toolbox (Eickhoff et al. 2005) configured with its default parameters.

Min-max normalization::

the voxel values are mapped in the range [0,1], calculated by:

$$ I_{norm}(\mathbf{p}) = \frac{ I(\mathbf{p})-I_{min} }{ I_{max}-I_{min} }. $$
(2)

Each voxel I(p) at its position p = (p1,p2,p3) is subtracted to the minimum value Imin of all voxels. Then, they are divided by the difference between the maximum Imax and the minimum Imin of their original values.

Image downsampling::

After registration, each image is downsampled to the dimension 74 × 92 × 78 with a voxel size of 2 × 2 × 2 mm3, which results in 531 024 voxels. This process helps to eliminate noise and compensate imprecisions in the registration. Also, the computational time and memory requirements are reduced without losing discriminant information (Segovia et al. 2010, 2012).

Methods

Image Description

When analyzing biomarkers from other studies (Chincarini et al. 2011; Liu et al. 2014), we can observe that the brain boundaries concentrate the most discriminative voxels for the detection of AD. Thus, we explore in this work image operations to highlight theses areas. We extract three types of features for evaluation: top-hat (Heijmans and Roerdink 1998), Mexican-hat (Russ 2016) and RCM. The first two were already used in neuroimaging applications (Sensi et al. 2014; Somasundaram and Genish 2014). The last one, RCM, is proposed in this work.

In computer vision, image moments have been extensively used for pattern recognition (Chaumette 2004). In this work, we use image moments to extract a feature designed to highlight the brain boundaries. The process to extract the RCM features has three steps as presented in Fig. 2.

Fig. 2
figure 2

Flowchart to compute the RCM features. The images on the right are the inputs and outputs of each feature extraction step

In the first step, we compute the center of mass locally in regions of the image using three-dimensional moments. For a continuous function f(p1,p2,p3), we can define the moment of order (q + r + s) as:

$$ M_{qrs} = {\int}_{-\infty}^{\infty} {\int}_{-\infty}^{\infty} {\int}_{-\infty}^{\infty} {p_{1}^{q}} {p_{2}^{r}} {p_{3}^{s}} f(p_{1}, p_{2}, p_{3}) dp_{1} dp_{2} dp_{3} $$
(3)

where q,r,s = 0,1,2,...,. Adapting (3) to volumetric images with voxel intensities I(p1,p2,p3), image moments Mijk are calculated by:

$$ M_{ijk} = \sum\limits_{p_{1},p_{2},p_{3}} {p_{1}^{i}} {p_{2}^{j}} {p_{3}^{k}} I(p_{1}, p_{2}, p_{3}) $$
(4)

where \(i, j, k \in \mathbb {N}\) and (i + j + k) is the order of the moment.

A set of moments up to order n consists of all moments Mi,j,k such that 0 ≤ i + j + kn. Moments of low orders provide geometric properties of the image. The zero-order moment M000 defines the total of mass of the image I. The ratios between the first-order moments and the zero-order moments (M100/M000,M010/M000,M001/M000) define the center of mass or centroid of the image.

To extract the RCM features, an image IC is created with all voxel values of the centroids calculated locally in regions of the input image. We define a region R at each voxel p of the input image and compute the centroid CP of R as follows:

$$ C_{\mathbf{p}} = \frac{ \sum\nolimits_{\mathbf{p} \in R} \mathbf{p}I(\mathbf{p}) }{ \sum\nolimits_{\mathbf{p} \in R} I(\mathbf{p}) }, $$
(5)

where I(p) is the voxel value at the position p. The region R at a position p = (p1,p2,p3) is defined by all valid voxels between (p1s/2,p2s/2,p3s/2) and (p1 + s/2,p2 + s/2,p3 + s/2), where s × s × s is the region’s size.

In the second step, the center of mass image IC is smoothed by a mean filter, creating a smoothed image \(I_{C}^{\prime }\). Lastly, we subtract \(I_{C}^{\prime }\) from the input image of the brain to output the RCM features. Figure 2 presents the resulting images from the RCM procedure. The first image is the input image used to extract the features. The second is the center of mass image IC. The third image is \(I_{C}^{\prime }\) and the voxel intensities of the last image represent the RCM features.

Feature Selection

One of the main challenges in working with neuroimages is the curse of dimensionality. In neuroimaging studies, there are generally hundreds of images to be analyzed for thousands of features. All that can easily lead to classifiers that overfit the data. Some studies avoid this problem by introducing regularization parameters to train sparse models (Rao et al. 2011). Others use feature selection techniques to identify biomarkers (Breiman 2001; Chincarini et al. 2011; Garali et al. 2016; Kramer 2016; Liu et al. 2014).

A popular technique used in machine learning applications is the feature selection by ANOVA (Analysis Of Variance) (Chen and Lin 2006; Costafreda et al. 2009; Costafreda et al. 2011; Elssied et al. 2014; Garali et al. 2016; Golugula et al. 2011; Grünauer and Vincze 2015). It consists of selecting the most relevant features for classification by calculating scores with the ANOVA F-test. The features with the lowest F-values are filtered out and the features with the highest F-values are maintained as the final image descriptor to train and test the classifier. We can define the F-value as:

$$ F = \frac{ {S_{B}^{2}} }{ {S_{W}^{2}} }. $$
(6)

\({S_{B}^{2}}\) is the between-group variability (sets of samples per class), given by:

$$ {S_{B}^{2}} = \frac{ \sum\nolimits_{i} n_{i} (\bar{x_{i}} -\bar{x})^{2} }{ K-1 }, $$
(7)

where ni and \(\bar {x_{i}}\) denote the number of observations and the sample mean in the ith group, respectively. \(\bar {x}\) is the overall mean of the data and K denotes the number of groups (or classes).

\({S_{W}^{2}}\) is the within-group variability, defined as:

$$ {S_{W}^{2}} = \frac{ \sum\nolimits_{ij} (x_{ij} -\bar{x_{i}})^{2} }{ N-K }, $$
(8)

where xij is the jth observation in the ith group and N is the overall sample size.

Studies have shown that ANOVA is an efficient method to discover biomarkers (Costafreda et al. 2009, 2011; Garali et al. 2016). However, it is very sensitive to outliers (French et al. 2017; Halldestam 2016). Noisy data and errors in the spatial normalization can hamper the effectiveness of the scores. To avoid that, the F-values are computed in an iterative process by random sampling the data. The pseudo-code of the method is summarized as follows:

figure a

First, a score vector (scores) is initialized with zeros. Each voxel value (or feature) is associated with a score value. And X represents the set of vectors composed by the image voxels of each sample. We select a random subset of samples X from X and compute the F-values (fValues) for each feature in X. Then the scores of each feature are updated with the maximum values of the score values (scores) and the F-values (fValues). The random selection and the score update are executed until a stopping criteria is reached. We may stop the execution when a maximum iteration number is achieved or the scores converges to some value. The positions of the relevant features for classification are identified by selecting the regions associated with the N highest scores. Lastly, we select the N features in the corresponding relevant positions.

Results and Discussion

Validation Method

The experiments adopted the 10-fold cross-validation method, presented in the flowchart of Fig. 3. First, the data are split in ten folds, assigning the images of each subject to different folds to avoid biased results. Then, ten classification tests are performed. In each test, one fold is selected to be the test set and the others are split in training and validation sets. After splitting the data, we have about 10% of the data for testing, 85% for training and 5% for validation.

Fig. 3
figure 3

Schematic representation of the validation method

From the training set, we compute the scores for the feature selection. Then, the features are selected and the classifier is trained, adjusting its parameters and the number of features selected using the validation data. At last, the classifier is tested using the test data and the performance metrics are reported.

Experiments and Results

In the experiments of this work, we first analyze the classification performances of the preprocessed images and three descriptors: RCM and the preprocessed input image filtered by mexican-hat and by top-hat. Different patterns are generated with different filter scales. Thus, four filter sizes are used to extract the features: 3, 5, 7 and 9. The descriptors for classification are represented by vectors concatenating the features of each filter size, as selected with the scores computed by ANOVA. Examples of features extracted with filter size equal to 5 are shown in Fig. 4. All images are normalized to the range 0 to 255 of grayscale values for visualization. The images Fig. 4b, c and d are the RCM, top-hat and mexican-hat features extracted from the preprocessed image Fig. 4a, respectively.

Fig. 4
figure 4

Preprocessed image and features extracted with filter size 5

In feature selection, the scores are computed separately for each type of feature and filter size using 100 iterations. Figure 5 presents the scores computed with the preprocessed images and their features. The filters applied in the images help to select discriminant features, highlighting the scores of relevant voxels near the boundary. We ranked the brain regions by their scores and the results are consistent with the literature (Ambastha 2015; Garali et al. 2016). The brain is segmented in 116 anatomical regions using masks extracted from the MARINA software (Walter et al. 2003). The 10 regions with highest mean scores in MRI and FDG-PET scans are presented in the Tables 4 and 5. The MRI features with the highest scores are located in the hippocampi, parahippocampi, and amygdalae. In the FDG-PET scans, the top ranked regions are the posterior cingulate gyrus, angular gyrus and hippocampi.

Fig. 5
figure 5

Scores computed using preprocessed images and their features

Table 4 Rank of the regions with highest scores in MRI scans
Table 5 Rank of the regions with highest scores in FDG-PET scans

To calculate the scores, random subsets are selected choosing randomly images from 95% of the subjects from the training data. At each iteration, only one image is chosen by subject to keep the number of samples fixed for the calculation of the F-values. The number of features is selected by the value that achieved the highest accuracy on the validation set. Experiments were performed by training linear SVM classifiers with penalty parameter C fixed at value 1. Other values of C did not cause variations in classification performances.

Figure 6 shows the average accuracy on the test set with respect to different number of selected features. For each neuroimaging modality, we evaluated the classification performance of SVM with each descriptor: the preprocessed input image and the filtered images. The highest classification performance of SVM was obtained with the RCM descriptor. Correct classification rates of 90.3% e 95.1% were achieved for the MRI and FDG-PET modalities, respectively. Table 6 presents the results for each image descriptor. The highest means of the performance rates are emphasized in bold.

Fig. 6
figure 6

Classification accuracy when the number of selected features varies

Table 6 Classification performances of SVM with each descriptor

Figure 7 shows the Receiver Operating Characteristic (ROC) (Hanley and McNeil 1982) plots of the RCM descriptor. Analyzing the ROC curves, we can observe variations in the classification performances across the folds due to the diversity of individuals. This shows that there are some subjects more difficult to classify than others. Thus, our choice for 10-fold cross-validation is justified to obtain more reliable results.

Fig. 7
figure 7

ROC curves of experiments performed with RCM features

The results reported in the literature are difficult to be compared given the differences in the validation methods, databases and amount of data used. In addition, problems such as the presence of images of the same individual in both training and test sets, or the lack of consideration to the variability among individuals in the validation method, may compromise the accuracy of the results. Therefore, for a more reliable comparison with other methods, the performances of different techniques were evaluated using k-fold cross-validation with k = 10 without overlapping images of the same individual in the training and test sets. The same images and folds were used to evaluate these methods. The results are reported in Table 7. The highest means of the performance rates are emphasized in bold.

Table 7 Classification performances obtained from different techniques using 10-fold cross-validation

The classification performances of LR with the preprocessed images, feature-space reduction, and Deep Learning techniques were also evaluated. PLR models were trained with regularization parameter λ = 0.5, achieving accuracies of 82.2% and 91.6% on MRI and FDG-PET images. SLR with regularization and sparsity parameters, λ = 0.5 and β = 0.5, achieved correction classification rates of 81.0% and 91.6% in the classification of MRI and FDG-PET images, respectively.

Feature-space reduction techniques also resulted in high accuracy of classification. The space dimension of the images was reduced by PCA and ICA, followed by SVM classification. The best results were obtained with ICA, presenting correct classification rates of 82.3% and 92.1% on MRI and FDG-PET images, respectively. The PCA method resulted in a accuracy of 81.9% and 89.8% on MRI and FDG-PET images. The number of components chosen for the feature-space reduction was determined by the value that reached the highest accuracy in the classification of the validation data.

Deep learning techniques have proved to be effective for image representation and classification. For neuroimaging applications, different architectures have been exploited (Ambastha 2015; Gupta et al. 2013; Liu et al. 2015; Payan and Montana 2015). Most of the works train their models with Autoencoders (Gupta et al. 2013; Liu et al. 2015; Payan and Montana 2015) due to the high dimensionality of the data. Some of these works, however, do not split the training and the test sets by subjects. This implies in a high risk of overfitting, because some databases can have multiple images of a same subject. Thus, we evaluated in this work the performances of two ConvNet architectures (Gupta et al. 2013; Payan and Montana 2015) trained with SAE.

The 2D architecture of Gupta et al. (2013) resulted in low accuracies below 60%. The large pooling causes the loss of discriminant information, affecting the classification performances. The 43 084 800 features extracted by the convolution layer are reduced to 61 200. Generally, ConvNets use pooling sizes of 3 × 3 or 5 × 5, preserving discriminant information. But in this architecture, the size used was 24x30.

Payan and Montana (2015) trained a 3D ConvNet with the same data used in Gupta et al. (2013), reporting a greater accuracy. Our experiments with this architecture have reached accuracies of 82.3% and 87.1% on MRI and FDG-PET images.

Considering the randomness on the data that can affect the classification performances, we also reported the Cohen’s kappa coefficient (Association et al. 1999) obtained in each experiment. Different from the accuracy, Cohen’s Kappa calculates the correct classification rates independently for each class and then aggregates them into a single value. This metric is less sensitive to randomness caused by variations in the numbers of observations of each class. The coefficient κ can be defined by:

$$ \kappa = 1- \frac{1 -p_{o}}{1 -p_{e}}, $$
(9)

where po is the relative agreement rate observed between the real value and the estimated value by the classifier, being equal to the value of the accuracy:

$$ p_{o} = \frac{TP + TN}{TP+TN+FP+FN}. $$
(10)

The term pe is the hypothetical probability of chance agreement, calculated by:

$$ p_{e} = \frac{(TP + FN)(TP + FP) + (FP + TN)(FN + TN)}{(TP+TN+FP+FN)^{2}}. $$
(11)

According to Landis and Koch (1977), the κ value can be interpreted based on the intervals shown in Table 8.

Table 8 Interpretation of κ based on intervals of values (Landis and Koch 1977)

The κ values obtained for each classification method are presented in Table 7. Analyzing κ, the performances with RCM indicate substantial agreement and almost perfect agreement with the clinical information of the MRI and FDG-PET scans, respectively.

Conclusion

In this paper, we analyzed the performances of different methods for the discovery of image biomarkers associated with AD. Experiments were performed with large and public datasets of MRI and FDG-PET scans. The results showed that image filtering techniques could be useful to improve the classification performances. We also proposed the RCM descriptor that extracts features from the brain boundaries. Our method was able to find relevant biomarkers not requiring prior knowledge of ROIs/VOIs. Also, the whole brain with gray matter and white matter tissues was used for feature extraction. Thus, tissue segmentation was not necessary as in other approaches. In comparison with other methods, the classification with the RCM descriptor obtained high performances with low variances.

AD severely damages the hippocampal region that begins to be affected from the earliest stages of the disease, even before it impairs the patient’s cognitive ability. In this work, it is noted that this region is very important for the diagnosis as also reported in medical studies. The results indicate that relevant regions can be automatically found by computers and are useful for supporting the clinical diagnosis.

PET images indicated a high potential to aid the diagnosis of AD. It is expected that this type of exam will be an important tool for diagnosis and prognosis. Furthermore, with advances in treatment methods, imaging exams will be of great importance for the discovery and determination of the stages of AD. In future work, we intend to investigate multimodal approaches with FDG-PET and MRI, since the combination of different modalities improved the classification performances in other studies. RCM can also be investigated to classify images of patients with other stages of dementia, such as early and late MCI.

Information Sharing Statement

The implementation of the methods to extract features, preprocess and classify the images is available at https://github.com/alexandreyy/alzheimer_cad. These methods requires open source softwares to be executed: the FSL library (RRID:SCR_002823) https://fsl.fmrib.ox.ac.uk/fsl/fslwiki and the Advanced Normalization Tools (RRID:SCR_004757) http://stnava.github.io/ANTs/. The image dataset is provided by the ADNI (RRID:SCR_003007) at http://adni.loni.usc.edu/.