1 Introduction

Brain is the most sophisticated organ in our body. It is the control center of the nervous system which controls our behavior. So, brain diseases are the most deadly compared with other diseases. Early diagnosis can assist the patients to survive these diseases. Currently, diagnosis of brain disease is dependent on medical imaging. Magnetic resonance image (MRI), computed tomography (CT) and X-ray are common medical imaging modalities in clinical diagnosis. MRI is non-invasion and free of radiation; it can provide clearer imaging results on soft tissue than CT and X-ray. Therefore, it is the first choice for brain disease diagnosis.

Recently, automated medical image analysis is becoming a hot research topic, which requires both medical expertise and machine learning. The developed computer-aided diagnosis (CAD) systems can assist the doctors and physicians to come up with decisions based on medical images. Abnormal brain detection can be regarded as an image recognition and classification problem from the viewpoint of artificial intelligence. A general framework to solve image classification problems often includes feature extraction and classifier training. During the last two decades, researchers and practitioners have proposed their methods to detect abnormal brain automatically based on MRI.

These abnormal brain detection methods can be classified toward two groups: traditional machine learning and deep learning. For classical machine learning, the image features are usually handcrafted; much attention is paid to classifier training and optimization. In [1], authors proposed to use discrete wavelet transform (DWT) for feature extraction and employed two classification algorithms for brain MRI classification: neural network self-organizing maps (SOM) and support vector machine (SVM). The SOM and SVM yielded accuracy of 94% and 98%, respectively. El-Dahshan and Hosny [2] proposed a hybrid pathological brain detection method. They firstly employed DWT to extract features from brain MRI. Thereafter, principle component analysis (PCA) was leveraged to reduce the feature dimension. Finally, feedforward back propagation neural network (BPNN) and k nearest neighbors (k-NN) were selected as the classification algorithms. The neural network and k-NN achieved 97% and 98% accuracy, respectively. Kalbkhani and Shayesteh [3] suggested to combine DWT and generalized autoregressive conditional heteroscedasticity (GARCH) for feature generation. PCA and linear discriminant analysis (LDA) were utilized to remove the redundant features. Finally, they trained k-NN and SVM to identify the types of the brain MRIs. Their approach can not only classify abnormal and healthy, but also recognize seven different brain abnormalities. Saritha, Paul Joseph (2013) [4] firstly performed DWT on brain MRIs and extracted entropies from the DWT sub-bands. Then, they proposed to use spider web plots to generate features based on wavelet entropy. Probabilistic neural network (PNN) was trained for image classification which achieved good classification results. El-Dahshan and Mohsen [5] put forward a brain tumor detection system based on brain MRI. The feedback pulse-coupled neural network was trained to segment the brain tumors before classification. Then, DWT with PCA was employed to generate image features. BPNN served as the classification algorithm which recognized the image as abnormal or healthy. The proposed method achieved 99% accuracy on both training and testing samples. Bahadure and Ray [6] put forward their brain tumor segmentation and recognition method. The signal-to-noise ratio of the raw images was improved by pre-processing. Then, they compared the performance of several image segmentation methods including watershed segmentation, fuzzy clustering means, discrete cosine transform and Berkeley wavelet transform, and found out Berkeley wavelet transform performed the best. Then, morphological operation was applied to the segmented images and a set of texture and statistical features were calculated to form the feature vector. Finally, genetic algorithm (GA) was employed for feature selection and classification. Their system achieved overall accuracy of 92.03%. Gudigar and Raghavendra [7] suggested two image decomposition methods: bidimensional empirical mode decomposition and variational mode decomposition. Then, supervised neighborhood projection embedding and bispectral feature extractor were used to generate feature vector. SVM was trained as the classifier. Experiment results suggested that variational mode decomposition was better than bidimensional empirical mode decomposition with 90.68% accuracy. Acharya and Fernandes [8] proposed a Alzheimer’s disease detection system based on brain MRI. A number of different image transforms were employed to extract features including wavelet transform and its variants. Then, Student’s t test was utilized for feature selection. Thereafter, k-NN was trained for identification and recognition. There are other reports showing the success of AI and signal processing methods in handling various tasks [9,10,11,12,13].

On the other hand, deep learning techniques usually generate image features in an automated manner. Deep learning is becoming an important tool for image classification in recent five years. Convolutional neural networks (CNNs) have made substantial improvements in image-based machine learning tasks. We no longer need to use image transforms or decomposition methods to extract handcrafted features, because CNN provides a unified framework to implement feature extraction and classification automatically and simultaneously. So, a bunch of deep learning-based abnormal brain detection approaches have been proposed recently. Nayak and Das [14] proposed a multilayer ELM autoencoder with leaky rectified linear unit to classify brain MRIs. The ELM autoencoders were stacked together to form a deep ELM in their experiment of multi-class classification. Deepak and Ameer [15] proposed a brain tumor classification approaches which can distinguish three types of tumors: glioma, meningioma and pituitary tumors. They used pre-trained GoogleNet and transfer learning to implement the classification. The last three layers in the pre-trained GoogleNet were modified and the parameters in early layers remained unchanged. So, the training of the modified GoogleNet was only for determining the weights in the last three layers. Their method achieved good classification performance in experiment. Han and Rundo [16] proposed a data augmentation method for brain tumor detection because the size of medical image datasets is small. A generative adversarial network (GAN)-based brain MRI augmentation algorithm was presented, which improved the classification accuracy. Lu and Lu [17] combined AlexNet and transfer learning for detecting the abnormalities in brain MRI. They used a pre-trained AlexNet modified the last several layers. Then, the whole modified AlexNet was fine-tuned on the brain MRIs. Their method achieved 100% accuracy on testing set.

From the above analysis, we can find that in abnormal brain detection systems based on classical machine learning, the feature extraction relies on manual image transforms and sometimes requires further feature selection and reduction. But the classifier training is generally faster than deep learning methods because of simple classifier structures and much less parameters. Deep learning methods are capable of generating image features automatically. With convolution and pooling operations in CNN, features can be learned from low level to high level gradually. Nevertheless, training deep CNN models is time-consuming.

The contribution of this study is that we aim to combine the classical machine learning and deep learning technique to obtain the fast training and automated feature learning ability. We improved the performance of pre-trained AlexNet by introducing batch normalization (BN) layers. The improved AlexNet was fine-tuned on our brain MRI dataset. Thereafter, we replaced its last several layers with ELM structure and proposed a searching approach to find the optimal number of layers to be replaced. To obtain better classification results, we optimized the weights and biases in ELM with a novel chaotic bat algorithm. Four different chaotic maps were tested. The evaluation results were all obtained by 5 × hold-out validation. Experimental results suggested that our system achieved good classification performance. Furthermore, our method provided a general framework that can be used in other image classification tasks.

The rest of this paper is as follows. Section 2 presents the brain MRIs in our experiment. Section 3 explains the methods in detail. Section 4 offers the experiment environment and settings, and the experiment research is given in Sect. 5. Finally, Sect. 6 provides the conclusion and future work.

2 Material

The brain MRIs used in this study are obtained from Whole Brain Atlas-Harvard Medical School (website: http://www.med.harvard.edu/AANLIB/). The key slices are selected by radiologists of over ten years’ experience. Our original dataset contains 177 abnormal samples but merely 28 healthy controls. To balance the ratio of the two classes, we firstly randomly select 14 samples from both classes to form the testing set and the rest samples to form training set. Then, the 14 healthy samples in training set are resampled to get 168 normal samples, which were copied for eleven times. In this way, both training set and testing set are balanced generally. The diseases with abnormal samples include cerebrovascular diseases, neoplastic diseases, degenerative diseases and inflammatory or infectious diseases. Some samples in our dataset are presented in Fig. 1.

Fig. 1
figure 1

Some samples of our dataset

3 Methods

We proposed a novel abnormal brain detection algorithm based on machine learning and deep learning techniques. Firstly, a pre-trained AlexNet was modified and fine-tuned on our brain MRI dataset. Then, we substituted the last several layers in the modified AlexNet with extreme learning machine. Finally, the extreme learning machine was optimized by a novel chaotic bat algorithm to obtain better generalization ability.

3.1 Improved AlexNet

AlexNet is one of the most well-known deep CNN structures proposed by Krizhevsky and Sutskever [18]. AlexNet achieved high classification accuracy on ImageNet dataset, which was a significant breakthrough in machine learning field. Since then, people started to put more time and effort to the research of deep learning models. Various deep CNNs have been invented recently, like ResNet [19], GoogleNet [20], VGG [21], etc., along with numerous training and optimization algorithms.

In this study, we propose to use batch normalization (BN) to improve the robustness of AlexNet for detecting abnormal brain. The distribution of the brain MRIs is complex because of the high variance of human brains. As a result, the distributions of inputs of the layers in AlexNet are different from layer to layer. This can make the parameter training extremely hard and time-consuming, which requires good initialization. In order to overcome this internal covariate shifting, BN is invented. The intuition behind BN is simple. As CNNs are trained in mini-batch mode, BN uses normalization transform on the activations of layers to keep the means and variances fixed. For a random variable x and its values in a mini-batch S:

$$S = \left[ {x_{1} ,x_{2} ,x_{3} , \ldots ,x_{n} } \right]$$
(1)

The mean μS and variance \(\sigma_{S}^{2}\) of x can be obtained by:

$$\mu_{S} = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} x_{i}$$
(2)
$$\sigma_{S}^{2} = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \left( {x_{i} - \mu_{S} } \right)^{2}$$
(3)

So, the normalized values \(\hat{x}_{i}\) can be obtained by:

$$\hat{x}_{i} = \frac{{x_{i} - \mu_{S} }}{{\sqrt {\sigma_{S}^{2} + \varepsilon } }}$$
(4)

Where \(\varepsilon\) denotes a constant value to increase the numerical stability. Nevertheless, the normalized activations may not be the learning target of the layers in some cases. So, a transformation is added to the result:

$$y_{i} = \gamma \hat{x}_{i} + \alpha$$
(5)

Where γ and α are two learnable parameters of mini-batch S.

With BN, the training speed of deep CNNs can be accelerated and gradients are less dependent on the initial values of parameters. Furthermore, BN can serve as a regularization, which improves the generalization ability of deep networks.

3.2 ELM

The improved AlexNet can yield good classification performance, but its classification is dependent on the last several layers (mostly fully connected layers). We proposed to replace these layers with a more efficient classifier model: extreme learning machine, to further improve the detection accuracy. ELM is a training algorithm for single-hidden layer feedforward network (SLFN), proposed by Guang-Bin and Qin-Yu [22]. An SLFN contains merely three layers, namely input layer, hidden layer and output layer, shown in Fig. 2. The w and β are the input and output weights, respectively, and b denotes the bias in hidden nodes. The x and o represent the input and output, respectively.

Fig. 2
figure 2

Architecture of SLFN

The advantage of ELM is that it is trained without iteration, which makes it converges extremely faster than traditional BPNN [23, 24], and the generalization ability of ELM is also promising [25]. The training algorithm of ELM contains only three steps. Given a training set M:

$${\mathbf{M}} = \left[ {\left( {\varvec{x}_{1} ,\varvec{t}_{1} } \right),\left( {\varvec{x}_{2} ,\varvec{t}_{2} } \right),\left( {\varvec{x}_{3} ,\varvec{t}_{3} } \right), \ldots ,\left( {\varvec{x}_{\varvec{n}} ,\varvec{t}_{\varvec{n}} } \right)} \right]$$
(6)

Where xi represents the input vector and ti denotes the label, ELM firstly initializes the input weight w and bias b with random values. Then, the output matrix of hidden layer H can be calculated:

$${\mathbf{H}} = \mathop \sum \limits_{i = 1}^{{\hat{N}}} g_{i} \left( {\varvec{w}_{i} \varvec{x}_{j} + b_{i} } \right),j = 1, \ldots ,n$$
(7)

Where \(\hat{N}\) denotes the number of hidden nodes and \(g\)(·) denotes the activation function in hidden layer. Finally, our target is to achieve the ELM output equals to the actual sample labels:

$${\mathbf{H\beta }} = {\mathbf{T}}$$
(8)

Where \({\mathbf{T}} = \left( {\varvec{t}_{1} ,\varvec{t}_{2} ,\varvec{t}_{3} , \ldots ,\varvec{t}_{\varvec{n}} } \right)^{\varvec{T}}\). So, the β can be obtained by Moore–Penrose pseudo-inverse:

$${\varvec{\upbeta}} = {\mathbf{H}}^{\dag } {\mathbf{T}}$$
(9)

Where \({\mathbf{H}}^{\dag }\) represents the pseudo-inverse of H. The training algorithm is summarized in Table 1.

Table 1 Training algorithm of ELM

From the above analysis, it is clear that the training of ELM is simple to implement. So, ELM is widely applied in practical applications, like recognition [26], prediction [27] and clustering [28]. Therefore, we employ ELM in this study to replace the last several layers for brain MRIs classification. However, the random input weight and bias can have a bad effect on the robustness of the ELM performance; we hope to further optimize these parameters. So, chaotic bat algorithm is proposed to handle this problem.

3.3 SNN

We also employed Schmidt neural network (SNN) and random vector functional link (RVFL) net as classifiers to compare with ELM. SNN and RVFL are both random neural networks, but their structures are different. SNN was proposed by Schmidt and Kraaijveld [29]. There are three layers in SNN, shown in Fig. 3. The weights from input layer to hidden layer were randomly assigned, and there are biases in both hidden layer and output layer. The output of SNN can be expressed as

Fig. 3
figure 3

Architecture of SNN

$$\mathop \sum \limits_{{\varvec{i} = 1}}^{{\hat{\varvec{N}}}} \left[ {\varvec{\beta}_{\varvec{i}} g\left( {\varvec{w}_{\varvec{i}} \varvec{x}_{\varvec{j}} +b_i} \right) +\varvec{b} } \right] = \varvec{o},\quad \varvec{j} = 1, \ldots ,\varvec{N}$$
(10)

Where \(N\) denotes the number of hidden nodes. The training of SNN is similar to training of ELM, which can be implemented by pseudo-inverse to get the output weights β.

3.4 RVFL

RVFL was proposed by Pao and Park [30], which is different from ELM and SNN. RVFL firstly maps the input features to enhancement space with random weights and biases. Then, the input features and enhanced features are concatenated to form the feature vector. This structure looks like the shortcut shown in Fig. 4, which is similar to the modules in ResNet. Finally, output weights β are obtained by pseudo-inverse like ELM and SNN.

Fig. 4
figure 4

Architecture of RVFL

3.5 Chaotic bat algorithm

Chaotic bat algorithm (CBA) belongs to a swarm intelligent optimization method, which is evolved from bat algorithm [31]. Inspired by the echolocation behavior of bats, CBA uses a set of bats with potential solutions to search the solution space by certain strategies. In every iteration, the parameters of the bats will be updated including the position, velocity and frequency based on the optimal solution found so far. The bat algorithm is better than traditional PSO for optimization, and we introduce chaotic map to improve its searching ability.

Chaotic map is used in updating the positions of bats in our CBA. There are various chaotic maps, and we choose four maps for optimization: sine map, cosine map, Gaussian map and logistic map [32]. The formulae are presented below.


Sine map:

$$x_{k + 1} = \mu { \sin }\left( {\pi x_{k} } \right)$$
(11)

Where k denotes the iteration time and μ the parameter ranging from 0 to 1.


Cosine map:

$$x_{k + 1} = \mu { \cos }\left( {\pi x_{k} } \right)$$
(12)

Where μ represents the parameter ranging from 0 to 1.


Gaussian map:

$$x_{k + 1} = { \exp }\left( { - \alpha x_{k}^{2} } \right) + \beta$$
(13)

Where α and β are two parameters of real values.


Logistic map:

$$x_{k + 1} = rx_{k} \left( {1 - x_{k} } \right)$$
(14)

Where r denotes the parameter of positive integer values.

CBA firstly initializes the parameters in the bats with random values. Then, in each iteration, all the bats search the solution space with their velocities and update the solutions using chaotic maps. The best solution in that iteration is obtained by sorting. The iterations will continue until the stop criterion is met. A brief diagram of the CBA is given in Fig. 5.

Fig. 5
figure 5

CBA optimization

3.6 BN-AlexNet-ELM-CBA

We propose the abnormal brain detection method based on batch normalized AlexNet, extreme learning machine and chaotic bat algorithm, abbreviated as BN-AlexNet-ELM-CBA. First of all, we employ a pre-trained AlexNet for extracting image feature from brain MRIs. We add the batch normalization layers in the AlexNet model to handle the internal covariate shifting problem. We also modify the last three layers because the original output contains 1000 nodes, but our brain images have only two categories: abnormal and healthy. Totally, six BN layers are added into AlexNet, mainly located after the convolution and pooling layers. The original fully connected layer ‘fc8’ is also replaced by two new fully connected layers. Because the output matrix dimension of ‘drop7’ is 4096 × 1, and the original ‘fc8’ in AlexNet contains 1000 nodes, but our abnormal brain detection is a binary problem. So, we use two layers ‘fc8’ and ‘fc9’ to gradually shrink the dimensions, and the dimensions for ‘fc8’ and ‘fc9’ are 256 × 1 and 2 × 1, respectively. We also construct a transfer-AlexNet (T-AlexNet) for performance comparison. T-AlexNet is built by removing all the batch normalization layers in BN-AlexNet. The three deep CNN structures are given in Fig. 6.

Fig. 6
figure 6

The structures of AlexNet and BN-AlexNet

Then, the last several layers in BN-AlexNet are replaced by an ELM classifier. To obtain the optimal layers to be substituted, we proposed to search it by the classification performance. We test the accuracy of our system with n replaced layers based on 5 × hold-out validation and select the best one. The searching algorithm is given in Table 2.

Table 2 Searching algorithm for the optimal layers to be replaced

In chaotic bat algorithm optimization for the ELM, the bats contain the input weight w and bias b of the ELM. The fitness function f(·) of CBA is the squared error of predicted labels and actual labels:

$$f\left( {\varvec{w},\varvec{b}} \right) = \mathop \sum \limits_{i = 1}^{n} \left( {\varvec{o}_{i} - \varvec{t}_{i} } \right)^{2}$$
(15)

Where \(\varvec{o}_{i}\) and \(\varvec{t}_{i}\) stand for the output of ELM and the image label, respectively, and n denotes the number of training samples. The solutions in bats are updated with their velocities and chaotic maps:

$$x_{i}^{t} = x_{i}^{t - 1} + v_{i}^{t} + \lambda \times {\text{chaotic}}\left( {x_{i}^{t - 1} } \right)$$
(16)

Where \(x_{i}^{t}\) denotes the solution of the ith bat in tth iteration, and λ is the weighting parameter ranging from 0 to 1. In this paper, λ is set as 0.3. All the evaluation is carried out based on 5 × hold-out validation, i.e., we run the systems for five times and calculate the average classification performance for comparison. The pseudocode of our BN-AlexNet-ELM-CBA is presented in Table 3, and a brief diagram is illustrated in Fig. 7. Our method provides a general framework using off-the-shelf deep learning models. The system can be used in other image classification tasks by simple parameter tuning.

Table 3 Training of BN-AlexNet-ELM-CBA
Fig. 7
figure 7

Flowchart of our BN-AlexNet-ELM-CBA

4 Experiment

We implemented our BN-AlexNet-ELM-CBA on MATLAB 2018a with deep learning toolbox. The experiment is done on a laptop with i7 7700HQ CPU, GTX 1060 GPU and 16 GB RAM.

4.1 Dataset

We obtained 359 samples in total and used 331 samples for training and 28 for testing. In training set, there are 163 abnormal samples and 168 normal controls. In testing set, there are 14 images for both classes, respectively. The dataset information is listed in Table 4.

Table 4 Dataset information

4.2 Hyper-parameter settings

We added 6 batch normalization layers in AlexNet and modified its last three layers to obtain BN-AlexNet. The two fully connected layers ‘fc8’ and ‘fc9’ contained 256 and 2 nodes, respectively. The BN-AlexNet was trained on our brain images by stochastic gradient descent with momentum (SGDM) algorithm. The mini-batch size was 40, maximum epoch was 3, and learning rate was 1e-4. The T-AlexNet was trained on the same settings as BN-AlexNet.

The ELM is a simple structure with only one hyper-parameter: number of hidden nodes. We set 500 hidden nodes in our model considering the input dimension. Finally, the hyper-parameters in CBA were determined. The population of bat particles was 20, and max iteration was 5. The values of all hyper-parameters are provided in Table 5.

Table 5 Hyper-parameter settings

4.3 Evaluation measurements

Six widely used measurements were employed to evaluate the classification performance of our method and compare with state-of-the-art approaches: sensitivity, specificity, accuracy, precision, F1 score and Matthew’s correlation coefficient (MCC). They can be computed by following equations:

$${\text{Sensitivity}} = \frac{TP}{TP + FN}$$
(17)
$${\text{Specificity}} = \frac{TN}{TN + FP}$$
(18)
$${\text{Accuracy}} = \frac{TP + TN}{TP + TN + FP + FN}$$
(19)
$${\text{Precision}} = \frac{TP}{TP + FP}$$
(20)
$${\text{F}}1 {\text{score}} = 2 \times \frac{{{\text{Precision }} \times {\text{Sensitivity}}}}{{{\text{Precision }} + {\text{Sensitivity}}}}$$
(21)
$${\text{MCC}} = \frac{{{\text{TP}} \times {\text{TN}} - {\text{TP}} \times {\text{FN}}}}{{\sqrt {\left( {{\text{TP}} + {\text{FP}}} \right) \times \left( {{\text{TP}} + {\text{FN}}} \right) \times \left( {{\text{TN}} + {\text{FP}}} \right) \times \left( {{\text{TN}} + {\text{FN}}} \right) } }}$$
(22)

Where TP and FP denote the numbers of abnormal samples correctly classified and miss classified, respectively, and TN and FN represent the numbers of healthy samples correctly classified and miss classified, respectively.

5 Results and discussion

5.1 Performance of the proposed method

The classification performance of T-AlexNet and BN-AlexNet is presented in Tables 6 and 7, respectively. The classification results of BN-AlexNet-ELM and BN-AlexNet-ELM-CBA are provided in Tables 8 and 9, respectively. The training time of T-AlexNet was 18 s for one time running. The comparison of the four methods is provided in Table 10 and Fig. 8. The BN-AlexNet achieved sensitivity of 78.57%, specificity of 97.14% and overall accuracy of 87.86%, which was better than T-AlexNet in terms of specificity and accuracy. So, the introduction of batch normalization did improved the classification performance of AlexNet. The performance of BN-AlexNet-ELM was better than BN-AlexNet with accuracy of 92.86%, and the BN-AlexNet-ELM-CBA outperformed BN-AlexNet-ELM with accuracy of 96.43%. The CBA optimization contributed to the high accuracy of BN-AlexNet-ELM-CBA. Though ELM training is extremely fast which can finish within 0.03 s, the ELM-CBA training can converge in 3 s, which is also acceptable for real-world application. The number of replaced layers was 3, and the chaotic map was Gaussian map in our BN-AlexNet-ELM-CBA. Detailed analysis was provided in following sections.

Table 6 Classification performance of T-AlexNet
Table 7 Classification performance of BN-AlexNet
Table 8 Classification performance of BN-AlexNet-ELM
Table 9 Classification performance of BN-AlexNet-ELM-CBA
Table 10 Classification performance comparison of our proposed four methods
Fig. 8
figure 8

Comparison of the four proposed methods

5.2 Optimal numbers of layers to be replaced

The numbers of layers to be replaced can make a great difference in our system, because the input dimension of the ELM is different with different layers to be replaced. As a result, the training and testing results of ELM vary. Therefore, we proposed to search the optimal replaced layers by evaluating the classification performance of our system with different layers to be replaced. The statistics were the average results of 5 × hold-out validation, shown in Table 11. It is clear that our system achieved over 90% accuracy except with 2 replaced layers. Because the feature dimension was only two, most information was lost. BN-AlexNet-ELM-CBA performed best with 3 layers to be replaced in terms of F1 score and MCC. Though the specificity was the highest when our system was with 5 replaced layers, the sensitivity was only 90.00%. Additionally, 4096 features were much more than 256 features, which inevitably increased the computational complexity. Therefore, the optimal number of layers to be replaced is 3 in this study.

Table 11 Performance of BN-AlexNet-ELM-CBA with different number of replaced layers

5.3 Optimal chaotic map

In this study, we tested the performance of our BN-AlexNet-ELM-CBA with four different chaotic maps: sine map, cosine map, Gaussian map and logistic map based on 5 × hold-out validation. The results are given in Table 12. The ‘No map’ means the ELM was trained by bat algorithm with any chaotic maps. We can find that the introduction of chaotic maps generally improves the classification performance, except cosine map. The improvement was obvious in terms of specificity, from 87.14% to the best 95.71% obtained by Gaussian and cosine maps. The accuracy of BN-AlexNet-ELM-CBA with Gaussian map was marginally better than that of logistic map, and it also achieved the best F1 score and MCC. So, Gaussian map was selected as the optimal chaotic map in our method. Gaussian map provides the better chaotic mechanism for optimization of the CBA. As a result, the ELM can achieve higher generalization ability.

Table 12 Performance of BN-AlexNet-ELM-CBA with different chaotic maps

5.4 Comparison of three classifiers

We compared the performance of ELM, SNN and RVFL with the same image features from BN-AlexNet and CBA optimization, i.e., the BN-AlexNet-ELM-CBA, BN-AlexNet-SNN-CBA and BN-AlexNet-RVFL-CBA. The statistics is presented in Table 13, which was obtained by 5 × hold-out validation. It can be seen that all the three methods achieved over 90% accuracy. The sensitivity of BN-AlexNet-SNN-CBA and BN-AlexNet-RVFL-CBA is 98.57% which is marginally better than BN-AlexNet-ELM-CBA, but the specificity of BN-AlexNet-ELM-CBA was the best among the three. Moreover, BN-AlexNet-ELM-CBA yielded over 90% performance for all the six measurements, so we think it is marginally better than the other two methods.

Table 13 Performance of different classifier structures

5.5 Comparison with state-of-the-art methods

We offer a comparison between our BN-AlexNet-ELM-CBA and state-of-the-art methods in detecting abnormal brains in MRIs. The state-of-the-art methods include: RBFNN [33], CNN [34], GA [6] and SVM [7]. The detailed information is listed in Table 14. It is obvious that SVM yielded the best sensitivity and CNN achieved the best specificity. However, the difference between sensitivity and specificity of the two methods was relatively large, which resulted in the low accuracy. Our BN-AlexNet-ELM-CBA was marginally worse than SVM and CNN in terms of sensitivity and specificity, respectively, and achieved the best accuracy among the methods. Meanwhile, BN-AlexNet-ELM-CBA was also robust because of the small differences between the three measurements (Fig. 9).

Table 14 Performance comparison
Fig. 9
figure 9

Comparison with state-of-the-art methods

6 Conclusion

In this study, we proposed four novel abnormal brain diagnosing method for brain MRI: T-AlexNet, BN-AlexNet, BN-AlexNet-ELM and BN-AlexNet-ELM-CBA. Experiment result revealed that BN-AlexNet-ELM-CBA was the best of the four with sensitivity of 97.14%, specificity of 95.71% and overall accuracy of 96.43%. Our method leveraged the feature learning ability of deep neural network and the particle intelligence for ELM optimization for classification on small dataset. The introduction of batch normalization and chaotic bat algorithm improved the generalization ability of our system. Moreover, our method can provide a general framework to search the optimal feature layers in deep CNN models, which is applicable to other image classification tasks.

However, our system can only classify brain images as abnormal or healthy, multi-class classification is more useful in clinical diagnosis, which is one of our future research directions. We shall collect more samples and build bigger dataset to re-test our method. Because for deep models, generally, the bigger the training set is, the better they can perform. We shall adopt other swarm intelligent algorithms to optimize the ELM in the future.