1 Background

As is known, Alzheimer’s disease (AD) is a chronic neurodegenerative disease aging progress [44]. It accounts for 60% and 70% of dementia that causes severe thinking, memory, and behavior problems [66]. In the past, nearly 30 million elders were reported as suffering from AD. In the coming 2050, Alzheimer’s disease (AD) shall affect 1 in 85 people in the worldwide [63].

As the world steps into an aging society, people with AD bring heavy burdens and negative impacts to both their families and the society. In US, the cost for the healthcare on people with AS is about $100 billion every year, and shall increase to $1 trillion annually after thirty years [48].

In recent decades, several neuroimaging techniques have been widely applied in clinical diagnosis and AD detection. Those techniques are composed of computed tomography (CT) [11], single-photon emission computed tomography (SPECT) [37], positron emission tomography (PET), magnetic resonance imaging (MRI) [9, 19, 49, 69], magnetic resonance spectral imaging (MRSI) [12], functional magnetic resonance imaging (fMRI) [62], etc.

Automatic detection for AD is extremely important for patients, so they can have enough time to get early treatment. Two types of detection systems exist in past researches, one is whole brain based detection (WBD) [47], and the other is single slice based detection (SSD) [78]. In hospitals, the latter was widely used due to its inexpensiveness (only 300–500 RMB per scan) and rapidness (only two or three minutes per scan). In this study, we focus on the latter one.

How to detect important features from brain images? Physicians and computer experts have different opinions on this problem. Physicians like to extract local features, they either measure the volume of segmented region of interest (ROI), or use voxel-based morphometry (VBM) to measure the atrophy, or measure cortical thickness and other features related to brain tissues [46]. On the other hand, computer experts like to use image processing [35, 36, 46] techniques and artificial intelligence methods [5, 75], and therefore directly extract image global features, like Hu’s moment invariants [60], Zernike moment [21], wavelet energy [72], wavelet transform [34], etc.

Scholars have proposed various methods to detect AD, which are listed in Section 2. Our contribution aims to propose a novel SSD system for AD with higher accuracy than state-of-the-art approaches, on the basis of wavelet entropy, multilayer perceptron, and an improved biogeography-based optimization method. Besides, our contribution is to compare our proposed method with widely used methods by strict statistical experiments.

The structure of the paper is organized as below: Section 2 discusses the background and latest methods. Section 3 presents the feature extraction methods. Section 4 offers the classification methods. Section 5 reports the data, the results and gives corresponding discussions. Final Section 6 presents the concluding remarks.

2 State-of-the-art

Currently, there are many novel AD detection methods: Dong (2014) [14] employed the under-sampling (US) technique. They used principal component analysis (PCA) and singular value decomposition (SVD) to select features. Finally, they combined decision tree (DT) with support vector machine (SVM). Plant (2010) [51] employed brain region cluster (BRC) and information gain (IG). Savio (2013) [55] offered a new deformation-based morphometry (DBM) method. They found modulated gray matter (MGM) performed well. Furthermore, they utilized Pearson’s correlation (PC) to select important features. Yuan (2015) [73] employed the eigenbrain (EB) to extract features. Afterwards, they employed Welch’s t-Test (WTT) to reduce the feature dimensionality. Gray (2013) [23] put forward a voxel-based morphometry (VBM) method and employed random forest (RF) technique. Zhang (2015) [76] proposed a novel displacement field (DF) to detect AD.

This paper proposed a novel AD detection system, which is based on wavelet entropy (WE) and multi-layer perceptron (MLP). WE has been successfully applied in various medical applications. For example, Shiyang (2007) [57] applied WE to analyze the heart rate fluctuation. Bakhshi (2013) [4] applied continuous-time WE to detect cardiac repolarization alternans. Frantzidis (2014) [18] used relative WE and electroencephalographic (EEG) to detect AD. Candra (2015) [10] used WE to classify EEG-emotion signal.

On the other hand, MLP is also a prevalent classification tool in medical fields. For instance, Sonawane (2014) [58] applied MLP to predict heart diseases. Behera (2015) [6] used bird mating optimization method and MLP to classify diseases. Ibrahim (2015) [26] used MLP to diagnose breast cancer. Meng (2015) [41] analyzed the meteorological factors related to emergency admission of elder stroke patients in Shanghai, with the tool of MLP.

From above, we see WE and MLP are efficient tools to analyze medical signal and images, and they have achieved success in recent studies. This gives solid support for our study.

3 Feature extraction

3.1 Single slice selection

The single slice was selected via our past proposed inter-class variance (ICV) criterion [76] with important modifications. Reference [76] selected 10 most important slices from each 3D volumetric brain image. In this paper, we only choose one important slice, hence, the slice with the highest ICV was picked up from all slices, and it was used for following processing. The slice direction may be sagittal, coronal, or axial. In this study, we chose axial direction by experience.

3.2 Wavelet transform

In this study, Wavelet transform (WT) was firstly analyzed. Torrents-Barrena (2015) [61] selected complex wavelet transform to handle Alzheimer’s electroencephalography signals. Aggarwal (2015) [2] used 3d discrete wavelet transform on T1-weighted brain magnetic resonance images for the diagnosis of AD. However, wavelet transform introduces additional parameters such as wavelet families and decomposition scales [79]. Previous studies commonly selected wavelet families and decomposition scales by experience or arbitrarily.

When the wavelet analysis method is employed for AD detection, the ultimate aim is to obtain better identification rate of AD subjects. Another problem raised is wavelet transform will generate the same size of coefficients as original 3D brain image, which will cause a burden to the consequent analysis.

Entropy can be employed to measure the information content over the decomposed wavelet coefficients. The wavelet entropy (WE) have been proposed to calculate the entropy of t wavelet subband coefficients distribution.

The discrete wavelet transform (abbreviated as DWT) [39] implements continuous wavelet transform (CWT) using the dyadic scales and positions [43]. Suppose t represents the time and r(t) is a given signal in the time domain (can be extended to spatial domain easily), the CWT is defined below:

$$ S\left(a,\tau \right)={\displaystyle \underset{-\infty }{\overset{\infty }{\int }}r(t)\frac{1}{\sqrt{a}}}\psi \ast \left(\frac{t-u}{a}\right)\mathrm{d}t $$
(1)

where Ψ is a real-valued wavelet function. S denotes the wavelet coefficients, a the dilation factor, u the translation factor.

We discretize formula (1) by limiting a and u to a discrete lattice

$$ a={2}^j $$
(2)
$$ u={2}^jk $$
(3)

Then, we have the DWT form of

$$ \begin{array}{l}{L}_{j,k}(n)=\Omega \left[{\displaystyle \sum_nr(n){l}_j^{*}\left(n-{2}^jk\right)}\right]\\ {}{H}_{j,k}(n)=\Omega \left[{\displaystyle \sum_nr(n){h}_j^{*}\left(n-{2}^jk\right)}\right]\end{array} $$
(4)

Here Ω denotes the downsampling [16]. n is the discrete counterpart of variable t. L represents the approximation coefficients through a low-pass filter l(n). H represents the detail coefficients through a high-pass filter h(n). j and k denotes the scale and translation factor for wavelet function, respectively.

3.3 Brain image oriented wavelet

What kind of wavelet is suitable for brain image? To answer the question, we selected three row-lines and three column-lines from a randomly selected brain image. Figure 1(a) shows the randomly selected brain image. Figure 1(b-d) presents three row-lines with indexes of 30, 40, and 50, respectively. Figure 1(e-g) presents three column-lines with indexes of 60, 70, and 80, respectively.

Fig. 1
figure 1

Randomly selected row-line and column-line of brain images (RI = Row Index, CI = Column Index)

After comparing the row-lines and column-lines with different wavelets, we finally selected to use bior4.4 wavelet, since the wavelet forms of bior4.4 are similar to the sharp changes of gray-level values in the lines of brain images. Figure 2 shows the decomposition functions of bior4.4.

Fig. 2
figure 2

Some important functions of bior4.4 decomposition (SF = scaling function; WF = wavelet function; LPF = low-pass filter; HPF = high-pass filter)

Compared to orthogonal wavelets, biorthogonal wavelet has more freedom degrees, and its wavelet transform is invertible but not necessarily orthogonal. Another advantage of biorthogonal wavelet is to generate symmetric wavelet functions.

3.4 Entropy and wavelet entropy

In statistics, entropy is defined for a stochastic system to measure its randomness quantitatively. Suppose we have a continuous random variable X ∈ R n, then the entropy S can be calculated as:

$$ S={\displaystyle {\int}_0^{\infty }-\eta (x) \log \eta (x)\mathrm{d}x} $$
(5)

where η(x) denotes the probability density function (PDF) of variable X. The value range of entropy sit between zero and one. The less the entropy is, the less uncertainty degree the system is, and vice versa.

Wavelet entropy (WE) [25] calculates the entropy value of the PDF of the energy distribution of wavelet subband coefficients in the wavelet domain. It combines wavelet transform and Shannon entropy, so as to estimate the disorder/order degree of a particular image with specified spatial-frequency resolution. Suppose we have a brain image with size of 256 × 256, and take a 2-level WE as an example.

Figure 3 shows the diagram of calculating WE. Here a brain image was submitted, after firstly taking 1-level DWT, we have four subbands in total (HL1, LL1, HH1, and LH1). Then, a 2-level DWT decomposes the LL1 subband, which is then transformed to four other subbands (HL2, LL2, HH2, and LH2). The LL1 subband usually contains more image information, thus, it is also called approximation subband. The other three bands contain only detail information, and therefore they are called detail subband. In a word, the next decomposition is always performed over the LL subband. In total, there are 7 subbands (LL2, HL2, LH2, HH2, HL1, LH1, and LL1). Entropy is then implemented over each subband, and finally a 7-element entropy vector is output.

Fig. 3
figure 3

Pipeline of Calculating WE (L = Low; H = High; S = Entropy)

4 Classifier

4.1 Multilayer perceptron

In the field of artificial intelligence [5], a multilayer perceptron (MLP) is a feed-forward neural network structure that maps input training points to target labels. The universal approximation theorem [74] guarantees the MLP can approximate to the model required for our task. Figure 4 shows that an MLP is composed of multiple layers (usually 3) of nodes in a directed graph.

Fig. 4
figure 4

Diagram of multilayer perceptron, in which each layer connecting fully to the next layer

4.2 Biogeography-based optimization

Traditionally, MLP is trained by backpropagation (BP) algorithm. The BP method uses gradient of the loss function to train the weights and biases in the MLP. Before training, the weights and biases were generated at random. Then, we measure the mean-squared error (MSE) between realistic output R and target output T

$$ E=\frac{1}{2}{\left\Vert T-R\right\Vert}^2 $$
(6)

Here E represents the value of MSE. The target here was interpreted as the clinical dementia rating (CDR) as shown in Table 2.

Nevertheless, the loss function contains many local minimal points, and gradient descent method may converge to one of the local minimal points. To improve the performance, scholars have suggested to use swarm intelligence that can converge to global minimal point at very high probability. For instance, Dil (2016) [13] used genetic algorithm (GA) to train the artificial neural network. Saghatforoush (2016) [54] combined ant colony optimization (ACO) with neural network. Mashhadban (2016) [40] applied particle swarm optimization (PSO) in training ANN. Shamshirband (2016) [56] combined cuckoo search (CS) with ANN. Two years ago, Mirjalili (2014) [42] proposed to use a novel training method, named biogeography-based optimization (BBO), to train MLP, and they reported the superior performances of BBO to other swarm intelligence based training algorithms. Therefore, we chose the BBO in this study.

Biogeography-based optimization (BBO) iteratively approximates to the global optimal point of an optimization problem [28, 65, 70], by mimicking the context of biogeography. First, we define the “habitat suitability index (HASI)” of the comfort measure of each habitat based on current living conditions. The HASI relies on numerous variables [17], such as temperature, rainfall, area, humidity, vegetation, etc. Those variables are defined as “suitability index variables (SUIV)” [52].

Three important components are covered in BBO algorithm: migration, elitism, and mutation. In below texts, we shall discuss them in sequence.

Habitats with higher and lower HASI values tend to emigrate and immigrate, respectively, since emigration is caused by intense competitions among existed species, and immigration is yielded due to abundant resources left for extra species [15, 71]. Therefore, based on the relationship of emigration rate x and the immigration rate y, we can model the migration of species as:

$$ x(z)=X\times \frac{z}{Z} $$
(7)
$$ y(z)=Y\times \frac{1-z}{Z} $$
(8)

In the formula, z denotes the number of species, Z represents the maximum number of species. X and Y represents the highest values of emigration and immigration rates, respectively [24].

Suppose a(z) represents the solution probability of species s, A is the maximum value of a. Hence, mutation u is defined as

$$ u(z)=\frac{1-a(z)}{A}\times U $$
(9)

where U is the maximum mutation rate [8]. The mutation operation is carried out by:

$$ {F}_{i,\mathrm{m}}={F}_{i,k}+\theta \times \left({F}_{i, \max }-{F}_{i, \min}\right) $$
(10)

Where θ is a random number in the range of [0, 1]. F i,k represents the SUIV value at k-th step. F i,m is the mutated value, and will be assigned to F i,k if it can provide better HASI value. F i,max and F i,min represents the lower and upper bounds of F i,k , respectively. Remember that mutation was carried out independently on each SUIV.

$$ \begin{array}{l}{F}_{i,k}\leftarrow {F}_{i,\mathrm{m}},\\ {}\mathrm{if}\kern0.5em v\left(\left[{F}_{1,k},{F}_{2,k},\dots, {F}_{i-1,k},{F}_{i,\mathrm{m}},{F}_{i+1,k},\dots \right]\right)<v\left({F}_k\right)\end{array} $$
(11)

where v represents the HASI objective function. On the other hand, elitism keeps the best solutions within the ecosystem [50], to counteract the effect of mutation operation. Assume the number of elitism is ξ, then the algorithm performs elitism by assigning y = 0 for the best ξ elites.

4.3 Stratified cross validation

With the aim of statistical analysis, 10-fold stratified cross validation (SCV) was employed for fair comparison. The 10-fold SCV repeated 50 times, viz., a 50 × 10-fold SCV was implemented. For each run, we take the accuracy, sensitivity, and specificity, as the measurement of the performance (See Table 1).

Table 1 Measurement of classification performance

A correctly recognized AD case was taken as a true positive. Based on the 50 runs, the final three measures of were reported, in the form of both the mean and standard deviation (SD).

5 Experiments, results, and discussions

Data used in the simulation experiments in this paper, were downloaded from “Open Access Series of Imaging Studies (OASIS)” [3]. The OASIS is a project, compiling and freely distributing MRI data sets, in order to make MRI data sets of the brain freely available to the scientific community [53].

OASIS covers two types of data: cross-sectional MRI data and longitudinal MRI data. In this study, we used cross-sectional MRI data because our study aims at developing an automatic system to detect AD, which is not relevant to longitudinal data in which AD subjects were gathered over a long period of time.

The cross-sectional MRI data in OASIS include 416 subjects, who aged from 18 to 96. All subjects are right-handed, and include both men and women. In this study, we pick up 126 samples (28 ADs and 98 HCs). The exclusion criterion is subjects less than 60 years old or any of their records are missing. The demographic statuses are reported in Table 2. The imbalanced data may cause problem in future identification, we adjust cost matrix [29, 45] to solve this problem.

Table 2 Demographic Status of subjects

5.1 Image preprocessing

For each subject, each scanning session include three or four individual T1-weighted MRI scans. In order to increase the signal-to-noise ratio (SNR), all those MRI scans with the same protocol of the same person were motion-corrected, and spatially co-registered to the Talairach space to generate an averaged image, and then brain masked. The motion-correction registered the 3D images of all scans, and then generated an average 3D image in original acquisition space. The images were then resampled to 1 mm × 1 mm × 1 mm. The image was transformed from acquisition space to Talairach coordinate space. Finally, the brain extraction was implemented. The whole preprocessing can be viewed in Fig. 5.

Fig. 5
figure 5

Preprocessing of a specified subject (Axial View)

All MR images were downloaded from OASIS and preprocessed. We only considered one slice of all MR images in axial view. Figure 6 shows exemplar instances of both HC and AD.

Fig. 6
figure 6

Axial view of (a) HC and (b) AD

5.2 WE results

The AD image was decomposed by bior4.4 wavelet. Figure 7(a) presents an original AD image. Clearly the ventricle is enlarged and the cortex is shrunk compared to healthy controls. Afterwards, Fig. 7(b) offers the 1-level DWT decomposition results, in which four subbands (LL1, LH1, HL1, and HH1) preserve different components from original image. Figure 7(c) shows the 2-level decomposition results. The LL1 was decomposed to LL2, LH2, HL2, and HH2. Figure 7(d) shows the 3-level decomposition results. The LL2 was further decomposed to LL3, LH3, HL3, and HH3.

Fig. 7
figure 7

DWT Decomposition Results (DC = decomposition)

5.3 Optimal decomposition level

After 50 × 10-fold stratified cross validation, our “WE + MLP + BBO” approach achieved an accuracy of 92.40%, a sensitivity of 92.14%, a specificity of 92.47%, a precision of 77.76%, when 3-level decomposition was implemented. We also changed the decomposition level from 1 to 4, and plotted the corresponding performance change in Fig. 8, from which we can observe 3-level decomposition yields the best performance.

Fig. 8
figure 8

Classification performance vary with decomposition level

5.4 Statistical analysis

In the third experiment, we give the details result of each run over each fold of this proposed method. Appendix 1 gives the segmented results based stratified cross validation. Remember that we have 28 AD subjects and 98 HC healthy subjects. The 10-fold segmentation divides the dataset into ten folds. Each column in Appendix 1 represents a different fold. Each row in Appendix 1 represents a run. The same setting is for Appendix 2.

The sensitivities, specificities, and accuracies over the 50 runs of 10-fold SCV is shown below in Table 3. Here we observe that the sensitivity of our method is 92.14 ± 4.39%, the specificity is 92.47 ± 1.23%, the accuracy is 92.40 ± 0.83%.

Table 3 Measures over 50 runs (Unit: %)

5.5 Comparison with other approaches

To further demonstrate the effectiveness of this proposed “WE + MLP + BBO”, we compared it to 6 state-of-the-art approaches in Table 4. Those methods include US + SVD-PCA + SVM-DT [14], BRC + IG + SVM [51], MGM + PEC + SVM [55], EB + WTT + SVM [73], VBM + RF [23], and DF + PCA + SVM [76]. The meaning of these abbreviations can be found in Table 5. For clear view, Fig. 9 presents the corresponding bar plot.

Table 4 Comparison with State-of-the-art Approaches
Fig. 9
figure 9

Bar plot of algorithm comparison (MGM + PC + SVM [55] did not report its specificity)

The results in Table 4 show that US + SVD-PCA + SVM-DT [14] and BRC + IG + SVM [51] did not report the standard deviation of three measures. The former obtained an accuracy of 90%, a sensitivity of 94%, and a specificity of 71%. The latter obtains an accuracy of 90.00%, a sensitivity of 96.88%, and a specificity of 77.78%. We can observe their specificities are too low compared to other approaches. Therefore, these two methods are not worthy to be studied.

Other five methods report both the average values and the standard deviation values. MGM + PC + SVM [55] obtained an accuracy of 92.07 ± 1.12%, and a sensitivity of 86.67 ± 4.71. Nevertheless, they did not report anything about the specificity. Thus, it is not further considered in this study.

All the rest algorithms achieved satisfying results. EB + WTT + SVM [73] obtained an accuracy of 91.47 ± 1.02%, a sensitivity of 90.17 ± 1.66%, and a specificity of 91.84 ± 1.09. Their excellent performance contributes to their proposed eigenbrain, which was inspired from the eigenface theory [7] widely used in face recognition.

VBM + RF [23] obtained an accuracy of 89.0 ± 0.7%, a sensitivity of 87.9 ± 1.2%, and a specificity of 90.0 ± 1.1. Their success contributes to the voxel based morphometry. Indeed, VBM has been commonly used to study brain changes. Maguire (2000) [38] showed taxi driver will have larger back part of posterior hippocampus in average. Good (2001) [20] showed global gray matter decreased linearly with age, but the global white matter did not. Nevertheless, it needs accurate spatial normalization, otherwise the classification performance may decrease significantly.

DF + PCA + SVM [76] obtained an accuracy of 88.27 ± 1.89%, a sensitivity of 84.93 ± 1.21%, a specificity of 89.21 ± 1.63%. This method relies on a novel method called displacement field (DF). This research measures and calculates the displace field of different slices between AD patients and HC subjects. Liu (2016) [33] extends DF to three-dimensional. DF method is promising, but it still needs further development to solve several problems: (i) DF is sensitive to noise, i.e., it will fail if the brain extraction result is not clear. (ii) The initial random solution candidate affects the final searched result. (iii) Fast algorithms are expected.

Finally, this proposed “WE + MLP + BBO” achieves the largest accuracy of 92.40% and the largest specificity of 92.47% among all methods. In addition, our method obtains a sensitivity of 92.14%, which is slightly worse than BRC + IG + SVM [51] of 96.88% and US + SVD-PCA + SVM-DT [14] of 94%. Considering all three measures, our method performs better than other six methods. Our method does not propose any new algorithm, and it is a simple combination of mature algorithms. Nevertheless, the result shows “simple is better than complex”, as stated in Occam’s razor [59]. This also gives us a hint to use combination of simple methods in other medical applications, such as sensorineural hearing loss [22, 31], multiple sclerosis [77], breast cancer [32], etc.

6 Conclusions and future research

In this paper, our team proposed a new AD identification approach based on wavelet entropy (WE), multilayer perceptron (MLP), and biogeography-based optimization (BBO). This proposed “WE + MLP + BBO” approach yields an accuracy of 92.40%, a sensitivity of 92.14%, and a specificity of 92.47%.

In the future, we will test advanced variants of WE, such as relative wavelet entropy [30], wavelet singular entropy [27]. In addition, some advanced swarm intelligence [1] methods can be used for training MLP. Other image preprocessing methods will be tested to enhance the classification performance, such as image denoising [67], image enhancement [64], and image segmentation [68]. Morphological shared-weight neural network will be employed as an alternative to MLP.

7 Nomenclature

Table 5 Acronym list