Keywords

1 Introduction

Alzheimer’s Disease (AD) is a neurodegenerative disease and the usual cause of dementia in adult life. It is characterized by the deterioration of neurons, affecting most of its functions, and producing the loss of immediate memory [17]. One study has shown that the approximate number of individuals affected by AD will duplicate in the next two decades, and by 2050, a diagnose of AD is anticipated to approximately be produced every half minute, forecasting almost one million new cases every year in the United States [3]. As a result of this, the cost of treating and taking care of AD patients will be increasing, so it becomes crucial to build computerized systems that can detect early AD accurately and slow down its progress.

Artificial intelligence, in particular, machine learning (ML) has gained unprecedented attention during the last decade with applications such as anomaly detection [7, 27, 28], assay detection [23], biological data mining [14, 16], disease diagnosis [2, 18, 19, 29], education [25], financial prediction [20], natural language processing [21], trust management [15] and urban services [9]. Several of these ML methods (e.g., random forest [6], and auto-encoders [11]) have been employed for this type of research recently. This difficult research can be solved with Deep learning (DL) models that can be fed with 3D images and learn features to perform better with enhance detection. Studies done recently have shown that convolutional neural networks (CNNs) yield better results than the traditional approaches in computerized prediction of AD from MR images [8].

This paper proposes a novel approach of probability-based fusion of several CNN models to diagnose AD stages using brain 3D MRI scans. This model is able to perform a 4-way classification between the healthy brains (CN), brains with early MCI (EMCI), late MCI (LMCI) and diseased brains with AD (AD), (CN vs. EMCI vs. LMCI vs. AD) on the ADNI dataset.

In the rest of this paper, Sect. 2 reviews the literature, Sect. 3 describes the proposed method, Sect. 4 reports and discusses the results, and Sect. 5 concludes the paper along with some possible future research directions.

2 Related Works

The automatic classification of AD is an issue that has been under research for more than a decade. In recent years, there has been considerable progress in the field with DL models achieving near-perfect accuracy scores [4], thanks to the progress in the robust DL models, more specifically CNN-based models have been widely used for medical diagnosis research.

When DL was started to be employed in medical imaging classification, Liu et al. [12] used a stacked auto-encoder, to classify the early stage of AD. Applying 10-fold cross-validation to measure the model and achieved 47.42% accuracy in classifying 4 classes. Their dataset was unbalanced which was a limitation for the approach; thus classifying some groups was more complex than others for the auto-encoder. Moreover, with the popularity of predesigned CNN architectures that performed well on the ImageNet Large-Scale Visual Recognition Challenge, researchers started to focus in the potential that transfer learning has for computational biology applications. Farooq et al. [8] proposed an approach using predesigned CNNs and 2D segments of MRIs. The approach implemented complex CNN architectures from the ImageNet challenge, in this instance, two residual networks (Resnet-18 and ResNet-152) and Google’s LeNet achieved astonishing accuracy, 98.01%, 98.14% and 98.88% respectively in a 3-way classification. Many of the reported AD classification methods were applied on 2D segments of MRIs, which in nature are 3D, and these approaches usually need multiple processes for feature extraction that help in future phases of training the model. Korolev et al. [10] developed powerful and altered adaptations of the VGG and Residual network architecture to work with 3D images. Additional research using 3D CNNs was conducted by Tang et al. [22] consisting of 1 ternary and 3 binary classification problems achieving 91.32% accuracy for the ternary classification, 88.43% for AD/MCI, 92.62% for MCI/NC and 96.81% for AD/NC. A different method researched by Wang et al. [24] achieved extraordinary performance in ternary classification with the application of an 3D ensemble approach. The approach consisted of merging the more efficient DenseNet classifiers that were trained individually and produced the probabilistic output through a softmax layer; lastly, the final classification was obtained by feeding the previous probability scores to the probability-based fusion approach.

Fig. 1.
figure 1

Top: Architecture of a dense unit. Bottom: Composition of dense connectivity in a 6-layer dense block.

3 Proposed Method

The proposed approach employs distinct 3D DenseNets that vary in their hyper-parameters and are fed with MR images that pass through the network and the networks classification probability goes to a probability-based fusion approach to make the last classification. Traditional network models comprises l layers, taking \(z_l\) as the output of the \(l^{\text {th}}\) layer, and every layer implements a non-linear transformation \(H_l(.)\), where l indexes the layer. To impede vanishing gradient and improve the information flow during the network training, the DenseNet employs the connections from a layer to all the following layers. In the approach implemented for this study, the idea of dense connectivity is expanded to the 3D volumetric image classification task. Specifically, \(_l\) is defined as:

$$\begin{aligned} z_l = H_l([z_0, z_1, z_2,..., z_{l-1}]) \end{aligned}$$

where \(z_0, z_1,..., z_{l-1}\) are 3D feature volumes produced in previous layers, [...] refers to the concatenation function. The structure of a dense unit is shown in Fig. 1 (top). The function \(H_l(.)\) has three main actions: a batch normalisation (BN) layer to decrease internal covariate transform, spatial convolution with k \(3 \times 3 \times 3\) convolution kernels to produce 3D feature volumes, and to accelerate the training phase a rectified linear unit (ReLU). Figure 1 (bottom) shows a dense unit, that comprises one layer in a dense block, and each layer in the block connected with all the following layers. With dense connections between layers, feature utilisation is more efficient, and feature growth for each layer is lower than that of traditional CNNs. Thus, the models are compact and have less parameters than other networks.

In previous research it was shown that the hyper-parameters of the 3D DenseNet affects the performance [24]. Multiple tests were conducted with diverse hyper-parameter sets to produce individual networks with unique compositions, and that were able to extract different features. One sensible hyper-parameter demonstrated to enhance the outputs of the network was the growth rate. If each function \(H_l\) generates g volume-features, it means that the \(l^{\text {th}}\) layer has \(g_0 + g \times (l-1)\) input volume-features, where \(g_0\) is the number of channels in the input layer. The 3D DenseNet can have compact layers, e.g., \(g = 12\), where g is the growth rate of the network. Every layer appends g feature-maps of its own to the state given that every layer has access to all the previous volume-features in its block.

Fig. 2.
figure 2

Architecture of the proposed ensemble 3D DenseNet framework for 4-way AD classification.

The proposed method consists in the implementation of a probability-based fusion ensemble approach [26], having the probabilistic outcome of the last classification layer from the varying individual networks are combined (see Fig. 2). Compared to the usual majority voting method that uses as labels, the outcome that appeared the most in the models, in this ensemble method, every model is individually trained, thus the error margin among the different classifiers are insignificant, making the results of the approach superior compared to one single classifier. The error margin could rise for simple classifiers if the subject classification is complicated and there’s incertitude among the distinct classes. As an example, take three classifiers, the output probabilities of the classifier layer for CN, EMCI, LMCI, and AD are: (1)0.8, 0.1, 0.1, 0.0 (2)0.4, 0.5, 0.0, 0.1 (3)0.3, 0.5, 0.1, 0.1, respectively. Making use of a majority voting approach, the classification result is Early MCI. On the other hand, this is not the most accurate answer, considering that the classification of the prediction model 1 is more certain in the prediction, while 2 and 3 had incertitude in theirs. The probability-based fusion approach will take the sum of the probabilistic output for each class of all the classifiers and then make a more certain prediction. For this research, u individual models were picked, and the probabilities of 3D DenseNet\(_u\) assigned to classes on testing set were:

$$\begin{aligned} \varOmega ^u = (\beta _1^u,\beta _2^u,\beta _3^u,\beta _4^u) \end{aligned}$$

where \(\beta _n^u\) indicates the probabilities of the class n. Then \(\varOmega ^u\) is normalized by:

$$\begin{aligned} \varOmega ^k = \frac{Y^u}{\max [\beta _1^u,\beta _2^u,\beta _3^u,\beta _4^u]} \end{aligned}$$

when outputs of the c-based 3D DenseNets have been calculated, the final prediction label is determined by the probability-based fusion method as:

$$\begin{aligned} a = \arg \,\max (\prod _{u=1}^{c}\beta _{1}^{u}, \prod _{u=1}^{c}\beta _{2}^{u}, \prod _{u=1}^{c}\beta _{4}^{u},\prod _{u=1}^{c}\beta _{4}^{u}) \end{aligned}$$

3.1 Experimentation

MRI Data. Structural brain MRI scans from the ADNI dataset (http://adni.loni.usc.edu/) were used (n=600 images) in this study. Preprocessed MRI scans (e.g., mask, intensity normalisation, reorientation, and spatial normalisation) were downloaded in NIfTI file format from ADNI2 and ADNIGO. For all the experiments the dataset was divided into 80% training and 20% validation, hence the training set consisted of 480 brain scans. With the goal of having an optimal dataset for the network, both the training and validation sets were balanced.

Parameter Selection. Multiple tests on the 3D DenseNet were carried out and the network hyper-parameters were optimised to obtain best results on the 4-way classification task. The following hyper-parameter settings were used during the training:

Fig. 3.
figure 3

Comparison of different growth rates.

  • Adam stochastic optimisation algorithm;

  • Pytorch’s Cross-Entropy loss function;

  • \(Learning\ rate = 0.0001\);

  • \(Batch\ size = 4\);

  • \(Dropout = 0.5\);

  • \(Epochs = 100\).

Growth Rate Analysis. The number of new features incremented at each layer is determined by the hyper-parameter g known as the growth rate of the model. Figure 3 shows the considerable change in accuracy of the models depending of the number assigned to g. When \(g = 28\), the model achieved the best performance; nonetheless, it can also be observed that with \(g = 12\), the accuracy is near the result best performance. Previous research [24] shown that 3D DenseNet with low growth rate was incapable of learning crucial features for prediction and consequently, did not achieve good performance.

DenseNet Network Depth Selection. Different network depth configurations of DenseNet, specifically, the 121, 169 and 201 were compared for time and accuracy while making the executing the 4-way classification task. As shown in Fig. 4, DenseNet-121 was the most efficient network depth in both parameters, and hence was chosen to be the base classifier for this task.

3.2 Selecting Optimal Number of Models

The combination of different individual classifiers can reduce the error margin. Principally due to the probability-based fusion approach being able to combine the probabilistic output of different classifier models and produce a more certain decision based on more robust and reliable data, instead of producing predictions based on only one classifier or having a majority voting method (see Sect. 3). Various tests were carried out to probe the optimal number of models in creating the ensemble. As shown in Fig. 5, the ensemble with three models achieved the best accuracy. These experiments suggest that the quality of the models, i.e., how good is an individual model in predicting a specific class, is what actually is going to determine the number of models that produce the optimal performance.

Fig. 4.
figure 4

Comparison of different network depths of the DenseNet model.

Fig. 5.
figure 5

Comparison of different number of models in the ensemble.

4 Results and Discussion

4.1 Individual Classifier Performance

The test findings for the independent classifier models and their parameters are shown in Table 1. The best results out of the three was produced when \(g = 28\), insinuating that 28 is the optimal growth rate to achieve higher results in 4-way classification with the DenseNet implementation of this study. Figure 6 shows that although classifier 1 was only 53.33% accurate in this task, it performed fine predicting EMCI, while the other two struggled. One justification to why the classifier 1 could extract features to predict EMCI subjects could be its growth rate. With \(g = 32\), the classifier 1 was the one with the bigger number of parameters, and this gave the model some leverage to extract more complex features. This being said, classifier 1 struggled when it comes to predict more simple groups like CN, on which the other two performed better; this might occur when the classifier is too complex for the training data.

Fig. 6.
figure 6

Confusion matrices (a) for classifier-1 (accuracy = 53.33%), (b) for classifier-2 (accuracy = 57.50%), and (c) for classifier-3 (accuracy = 66.67%). Labels: 0 = CN, 1 = AD, 2 = EMCI, 3 = LMCI.

Table 1. Parameter comparison of different network structures.

4.2 Comparison with Residual Network

The results of the comparison between the DenseNet-121 and the ResNet-18 are presented in Fig. 7. The experiment demonstrated that DenseNet-18 has the quality to be trained faster and achieve more accuracy compared to the ResNet-18. The longer training time is probably due to the ResNet-18 network having around 108 million parameters to train compared to the 8.6 million of the DenseNet-121.

Fig. 7.
figure 7

Comparison of ResNet-18 and DenseNet-121.

Table 2. Test outcome of our approach compared with different methods in 4-way classification of AD.

4.3 Comparison with the State-of-the-Art

Test outcomes for this research approach compared to other similar research are shown in Table 2. While comparing with other models which use 3D MRI as input, our proposed model achieved the best performance (83.33%) which is 0.32% more than the test outcomes shown in [1]. However it becomes crucial to note here that this is not comparable in a straightforward fashion given that the other studies made use of stable MCI and converting or progressive MCI as two distinct phases of MCI, while in this research early MCI and late MCI were used, which seems to be different in literature.

Test results in the study show that the ensemble approach can lead to higher classification performance. Primarily because this approach can accumulate the probabilistic output of different classifier models and produce better predictions employing more robust and reliable data, instead of classifying based on only one classifier model.

Matched against the independent classifiers, a substantial increment in the accuracy is shown on the classification on both phases of MCI; as a result of merging the output from classifier 1 that performed well predicting EMCI and the other classifier’s predictions that were accurate classifying the other groups.

5 Conclusion

This study presented an ensemble of multiple 3D densely connected convolutional neural networks to predict AD as well as two critical stages of MCI (known as early MCI and late MCI) utilising MR images. The prediction and discrimination between early MCI, late MCI and AD can aid in the recognition of different dementia’s phases and allow the early treatment inn those early life phases. With the goal of figuring out the problem of having a limited number of MR images for the training phase, the proposed approach was implemented. The 3D DenseNet is more simple to train with its lower number of parameters due to having the type of connections that enhance the flow of gradients and data throughout the network. Various test were conducted to study the performance of the model with different parameters. Furthermore, these tests produced individual classifier models with diverse parameters and structures that could be used for the ensemble. A probability-based fusion approach was used to merge the probabilistic outputs from these models. The model’s accuracy was enhanced while using the probability-based fusion approach, obtaining a final accuracy of 83.33%, in comparison to the individual member classifiers. This proofs that the approach of this study is a robust and reliable method in 4-way prediction tasks, while also outperforming some previous studies. In further research of this work, we would like to implement an increment in the training dataset, to test the classifier 1 of this research, and the ensemble approach to enhance the results. Otherwise further study could include implement less training data and finding an approach that yields the same or higher accuracy with the goal in mind of having a real-world environment where the data could be scarce.