Introduction

Pathological brain detection system (PBDS) can help physicians interpret medical brain images accurately [13]. In hospitals, the picture archiving and communication system (PACS) can provide either 3D brain or only a single slice that is associated to the foci within the brain [46]. Nevertheless, scanning the whole 3D brain is expensive and time-consuming [79], hence, we proposed a PBDS for single slice brain images.

At present, neuroradiologists used many neuroimaging methods to detect the brains by two ways: structural and functional. The structural imaging measures the inner of the brain structure, while the functional imaging measures its functions. In hospitals, structural imaging is commonly used by magnetic resonance imaging (MRI), since it displays better resolution for brain soft tissues and it does not relate to any radiations, compared to traditional X-ray and computed tomography (CT) [10].

In recent years, various PBDSs were developed by scholars [1113]. They can provide user-friendly, professional, and even personalized assistance [1416]. Their accurate performances motivate an increasing willingness for neuroradiologists to make decisions, and for patients to monitor healthy conditions regularly [1719], with the help of PBDSs.

For instances, El-Dahshan, Hosny and Salem (2010) [20] employed techniques of discrete wavelet transform (DWT) and principal component analysis (PCA), and used K-nearest neighbor (KNN). Dong, et al. (2011) [21] employed a scaled conjugate gradient (SCG) method to train the artificial neural network (ANN). Das, Chowdhury and Kundu (2013) [22] combined Ripplet transform (RT) with PCA and least square SVM (LS-SVM). Wu (2012) [23] used support vector machine (SVM) to develop PBDS. Saritha, Paul Joseph and Mathew (2013) [24] used and spider-web plots (abbreviated as SWP) and wavelet-entropy (abbreviated as WE). They then employed probabilistic neural network (PNN) as the classifiers. El-Dahshan, et al. (2014) [25] employed the feedback pulse-coupled neural network (PCNN) to preprocess the brain images. Then they combined DWT and PCA to extract features. They finally employed back propagation neural network (BPNN). Wang, et al. (2015) [26] employed stationary wavelet transform (abbreviated as SWT) to take place of traditional DWT. Afterwards, to train the classifier, they designed a new training algorithm, viz., the hybridization of PSO and ABC (shorted as HPA). Sun, et al. (2015) [27] combined Hu moment invariants (HMI) with wavelet entropy. Generalized eigenvalue proximal SVM (GEPSVM) was employed. Wibmer, et al. (2015) [28] proposed a novel Haralick texture as image feature. Dong, Ji and Yang (2015) [29] proposed a new image feature as wavelet packet Tsallis entropy (WPTE) and wavelet packet Shannon entropy (WPSE). They proved WPTE is the extension of WPSE, i.e., WPSE is a particular case of WPTE. In this study, we employed WPTE to extract features. Sheejakumari and Gomathi (2015) [30] proposed an improved PSO and used neural network, in order to classify healthy and pathological tissues. Dong, et al. (2015) [31] used stationary wavelet transform (SWT) and PCA. Hemanth, et al. (2014) [32] used iteration-free artificial neural network for abnormal brain image classification. Zhang, et al. (2015) [33] tested wavelet packet Tsallis entropy (WPTE) and fuzzy SVM (FSVM).

Nevertheless, the classification accuracy of above methods do not come across realistic requirement (high accuracy and fast detection speed), those methods can still be enhanced [34]. Yang, et al. (2015) [35] proposed a novel feature and named it as fractional Fourier entropy (FRFE). Their method had been proven to be better than most of existing PBDSs. This study continued to use FRFE.

In addition, multilayer perceptron (MLP) belongs to feedforward neural network (FNN), and MLP has obtained successful applications in various fields. In this study, we proposed two improvements for MLP. We compared three pruning techniques and introduced a relatively new algorithm to train its weights and biases.

The structure of this paper is organized as below: Section 2 provides the materials. Section 3 shows how to extract features by FRFE. Section 4 describes the mechanism of MLP and presented two improvements. Section 5 shows the experimental results and discussions. Section 6 concludes the paper. The abbreviation is listed in the end of this paper.

Materials

In PBDS, there are three open access dataset, which contain different numbers of brain magnetic resonance (MR) images. Dataset I (D_I) contains 66 brain images, Dataset II (D_II) contains 160 images, and Dataset III (D_III) contains 255 images.

Figure 1 shows the samples of MR brains, which are all T2-weighted and with sizes of 256x256. Here T2-weighted (spin-spin) relaxation is to give better image contrast, so as to show different anatomical structures clearly. Note that all pathological brains in Fig. 1 suffer from structural alternation, which is the basis of the success of our PBDS. Here Meningioma, glioma, sarcoma are of neoplastic disease. AD, AD with VA, PD, HD are of degenerative disease. MS are of inflammatory disease. SDH is of cerebrovascular disease. Therefore, the chosen images are of various types of brain diseases.

Fig. 1
figure 1

Sample of MR brains. (AD Alzheimer’s disease, PD Pick’s disease, MS multiple sclerosis, HE Herpes encephalitis, HD Huntington’s disease, SDH subdural hematoma, VA visual agnosia)

Feature extraction

Fractional fourier transform

Suppose we have a function x(t), we have its a-angle fractional Fourier transform (FRFT) F as:

$$ {F}_a(u)={\displaystyle {\int}_{-\infty}^{\infty }x(t)Z\left(t,u\Big|a\right)\mathrm{d}t} $$
(1)

here t represents the time, and u denotes the frequency. Z is defined as the transform kernel.

$$ \begin{array}{l}Z\left(t,u\Big|a\right)=\sqrt{1-j \cot a}\times \\ {} \exp \left(j\pi \left({t}^2 \cot a-2ut \csc a+{u}^2 \cot a\right)\right)\end{array} $$
(2)

Here j represents the imaginary unit. A problem exists that both cot and csc will diverge for a is set the value of a multiple of π. By taking knowledge from the limitation, equation (2) can be transformed to [36]

$$ Z\left(t,u\Big|a\right)=\left\{\begin{array}{ll}\mathbb{D}\left(t-u\right)\hfill & a/\pi =2m\hfill \\ {}\begin{array}{l}\sqrt{1-j \cot a}\times \\ {} \exp \left(\begin{array}{l}j\pi \Big({t}^2 \cot a\\ {}-2ut \csc a\\ {}+{u}^2 \cot a\Big)\end{array}\right)\end{array}\hfill & a/\pi \ne m\hfill \\ {}\mathbb{D}\left(t+u\right)\hfill & a/\pi =\left(2m+1\right)\pi \hfill \end{array}\right. $$
(3)

where \( \mathbb{D} \) represents the Diract delta function and m an arbitrary integer. For 2D FRFT, we have not only angle (denoted by a) for x-axis, but also another angle (denoted by b) for y-axis.

To show the connection between standard Fourier transform (SFT) and FRFT, we showed in Fig. 2 a given rectangular function rect(t) defined as

Fig. 2
figure 2

FRFT of rect function (a changes from 0 to 1)

$$ \mathrm{rect}(t)=\left\{\begin{array}{cc}\hfill 0\hfill & \hfill \left|t\right|>1/2\hfill \\ {}\hfill 1/2\hfill & \hfill \left|t\right|=1/2\hfill \\ {}\hfill 1\hfill & \hfill \left|t\right|<1/2\hfill \end{array}\right. $$
(4)

In Fig. 2, we present the FRFT results with angles from 0 to 1 with equal increase of 0.1. Remember that the SFT of rect(t) is sinc(u). In this figure, the red line represents the real part while the blue line the imaginary part. It is easily observed that the FRFT result approximate to the SFT result when the value of a increases to 1. This falls within the theoretical prediction. Another point can be deduce is that adding an extra parameter a can provide more information than SFT does.

Fractional fourier entropy

Yang, et al. (2015) [35] combined FRFT with Shannon entropy, and they proposed a novel image feature based on analysis of MR brain images. They named this new feature as FRFE. Suppose Shannon entropy operation is defined as H, FRFE operation is defined as E, FRFT operation is defined beforehand as F, we have

$$ E=H\cdot F $$
(5)

Nevertheless, Yang, et al. (2015) [35] used Welch’s t-test (WTT) and found only 12 different angle combinations are effective features for brain images. Those angle combinations are listed in Table 1. Therefore, our FRFE followed this setting and we defined it as

Table 1 Angle combination of FRFE for brain images
$$ E(x)=H\left[\underset{\left(a,b\right)\in S}{\cup }F(x)\right] $$
(6)

here x denotes any brain image (pathological or healthy) and S denotes the angle combination set

$$ \begin{array}{l}S=\Big\{\left(0.6,1\right),\left(0.7,1\right),\left(0.8,0.9\right),\\ {}\left(0.8,1.0\right),\left(0.9,0.8\right),\left(0.9,0.9\right),\\ {}\left(0.9,1.0\right),\left(1.0,0.6\right),\left(1.0,0.7\right),\\ {}\left(1.0,0.8\right),\left(1.0,0.9\right),\left(1.0,1,0\right)\Big\}\end{array} $$
(7)

Multi-layer perceptron

A multilayer perceptron (MLP) is a type of neural network, which maps given input data to expected target data. MLP consists of multiple layers of nodes in a directed graph, each layer connecting fully to the next. In this study, we used the common one-hidden layer MLP, so our modal consists of an input layer with d = 12 nodes, a hidden layer with unknown neurons with size of M, and an output layer with c = 1 neuron with values of either true (denoting pathological) or false (denoting healthy).

For generality (See Fig. 3), suppose [x(n), t(n)] denotes the n-th training sample, where x(n) = [x 1(n), x 2(n), …, x d (n)]T (n = 1, 2, …, N) denotes the input vector with d-dimension, and t(n) = [t 1(n), t 2(n), …, t c (n)]T the target of c-dimension. The training of MLP is an optimization problem of minimizing the sum of mean-squared error (MSE) E between the target t k (n) and realistic output y k (n).

Fig. 3
figure 3

Structure of one-hidden-layer MLP

$$ E={\displaystyle \sum_{n=1}^N{\displaystyle \sum_{k=1}^c{\left({y}_k(n)-{t}_k(n)\right)}^2}} $$
(8)

Assume g is the activation function in hidden layer, k the dimension of target, h the activation function in output layer, A the weights connecting the input to hidden layers and B the weights connecting hidden to output layers, we have

$$ {y}_k(n)=h\left({\displaystyle \sum_{j=0}^M{B}_{kj}{z}_j(n)}\right) $$
(9)

where z j (n) represents the output of j-th neuron in the hidden layer and j = 1, 2, …, M, with definition of

$$ {z}_j(n)=g\left({\displaystyle \sum_{i=0}^d{A}_{ji}{x}_i(n)}\right) $$
(10)

Nevertheless, traditional MLP suffers from two shortcomings: (i) it is difficult to determine the optimal hidden neuron number; and (ii) the weight training may be trapped into local minimum points. To solve above problems, we make two improvements in this work.

Pruning technique

The first major problem is to define the number of hidden neurons. One popular technique is pruning technique (PT) that forces hidden neuron number to be more than necessary, however, this will leads to a sparsely-connected network with most weights near-zero. Hence, iterative methods were proposed that removes a neuron with the lowest (or the largest) score in each step, until the error estimation increases. In what below, we will introduce how to define error estimation e and the score function S.

Error estimation

Apparent rate error (APER) was used as the error estimation e. It can be obtained directly from the confusion matrix. Suppose n ij is at the i-th row and j-th column in the confusion matrix, and obviously n ij denotes the sample number of class i predicted to class j, then we have

$$ {e}_{APER}=\frac{\left({\displaystyle \sum_{i=1}^c{\displaystyle \sum_{j=1}^c{n}_{ij}}}-{\displaystyle \sum_{i=1}^c{n}_{ii}}\right)}{{\displaystyle \sum_{i=1}^c{\displaystyle \sum_{j=1}^c{n}_{ij}}}} $$
(11)

APER presents the proportion in percentage of the incorrectly classified samples. Nevertheless, APER tends to underestimate the true rate error due to overfitting. Therefore, Stratified cross validation (SCV) was employed over those datasets. Table 2 shows the statistical characteristics for each dataset. For D_I, it is composed of 18 healthy and 48 pathological brains. Hence, it is common to segment D_I to 6 folds that each fold contains 3 healthy and 8 pathological brains. For D_II that has 20 healthy and 140 pathological brains, we divide it into 5 folds so that each fold consists of 4 healthy and 28 pathological brains. The same thing is performed for D_III.

Table 2 Statistical characteristics

Measure of hidden neuron

Three measures of hidden neurons were introduced and would be compared in the experiments. Murase, Matsunaga and Nakade (1991) [37] proposed the dynamic pruning (DP) that scores each hidden neuron j with following equation

$$ {S}_j^{DP}=\frac{1}{N}{\displaystyle \sum_{n=1}^N{\displaystyle \sum_{k=1}^c{B}_{kj}^2{z}_j^2(n)}} $$
(12)

where S represents the score. Silvestre and Lee Luan (2002) [38] proposed a pruning based on Bayesian detection boundary (BDB). The measure is similar to equation (12) except for the drop of quadratic terms as

$$ {S}_j^{BDB}=\frac{1}{N}{\displaystyle \sum_{n=1}^N{\displaystyle \sum_{k=1}^c{B}_{kj}{z}_j(n)}} $$
(13)

Based on Kappa coefficient (KC), Silvestre and Ling (2014) [39] proposed a relatively new measure method. Usually a higher KC indicates a better classifier. In extreme cases, a zero value of KC means the success of the classifier is by chance, and a value of one indicates perfect classification. KC is defined as

$$ {S}^{KC}=\frac{M{\displaystyle \sum_{k=1}^c{n}_{kk}}-{\displaystyle \sum_{k=1}^c{n}_{k\bullet }{n}_{\bullet k}}}{N^2-{\displaystyle \sum_{k=1}^c{n}_{k\bullet }{n}_{\bullet k}}} $$
(14)

Here n k • represents the sum of k-th row of the confusion matrix, and n • k the sum of k-th column of confusion matrix. The definition of KC of k-th neuron is without neuron k within the network, by deleting all the weights those are linking to neuron k. Finally, the neuron that has the largest KC should be removed, since the network without its present has the best performance.

Training method

The second major problem is to determine its optimal weights. Traditionally, back-propagation (BP) was the most common method to train MLP [4042]. During the last decade, swarm intelligence methods were employed to train MLP, such as genetic algorithm (GA) [43], improved hybrid GA [44], bacterial chemotaxis optimization [45], particle swarm optimization [46], and cuckoo optimization [47]. Biogeography-based optimization (BBO) [48] was a novel swarm intelligence method and had been reported to present superior performance to other swarm intelligence approaches.

Theory of BBO

The Biogeography-based optimization (BBO) was proposed to solve optimization problems, based on the study of geographical distribution of species. It has three main operators: migration, mutation, and elitism [49]. The objective function is transformed as habitat suitability index (HSI), and the search space is transformed as suitability index variables (SIV) [50].

Migration

Migration modifies each individual in the habitat at random. Suppose s denotes the species number, S the maximum number of species, then the emigration rate (a) and immigration rate (b) have a connection as

$$ b(s)=B\times \frac{1-s}{S} $$
(15)
$$ a(s)=\frac{A\times s}{S} $$
(16)

here A and B represents the maximal values of emigration and immigration possibilities, respectively. In the special case of A = B, we have

$$ a(s)+b(s)=A=B $$
(17)

Mutation

Mutation occurs in the SIV level. Suppose the mutation rate is represented as w, and

$$ w(s)=\frac{1-p(s)}{P}\times W $$
(18)

here p(s) represents the solution probability of s. P represents the maximum value of p, and W is the maximum mutation rate. The mutation is implemented by

$$ {D}_i^{\prime }={D}_i+\mathrm{rand}\left(0,1\right)\times \left({D}_{i, \max }-{D}_{i, \min}\right) $$
(19)

where D i represents the decision variable in the search space, and D i,max and D i,min represents the lower and upper bounds of the i-th decision variable.

Elitism

Elitism occurs in SIV level as mutation. It aims to keep the best solutions within the ecosystem from mutation operator [51]. Suppose the number of elitism is l, then we perform elitism by taking b = 0 for the l elites.

Adaptive real-coded BBO

Real-coded technique has been introduced to improve the performance of BBO. Gong, et al. (2010) [52] extended original BBO and presented a real-coded biogeography-based optimization with mutation (RCBBO). Later, Kumar and Premalatha (2015) [53] introduced adaptive mechanism into RCBBO, and proposed adaptive RCBBO (ARCBBO).

ARCBBO suggested two improvements to improve the standard BBO. First, ARCBBO denotes individuals by real parameter vector; hence, Equation (19) should be modified. Kumar and Premalatha (2015) [53] proposed a probability based Gaussian mutation to improve the convergence characteristics as

$$ {D}_i^{\prime }={D}_i+N\left(m,{\sigma}_i^2\right) $$
(20)

where N represents the Gaussian random number with mean of m and variance of σ 2. m is assigned with the value of zero. Secondly, adaptive mechanism is introduced to the Gaussian mutation, in order to improve the worst half population set by changing σ i adaptively:

$$ {\sigma}_i=\beta (k)\times {\displaystyle \sum_{i=1}^n\left(\frac{F_i}{f_{\min }}\right)\times \left({D}_{i, \max }-{D}_{i, \min}\right)} $$
(21)

where f min represents the minimum fitness value among the whole ecosystem. F i represents the fitness value of i-th habitat. β(k) represents an adaptive parameter at k-th iteration with the form of

$$ \beta (k)=1-\frac{0.995}{K}\times k $$
(22)

where K denotes the maximum iterative number. Note that the above adaptive mutation is only for mutation operator. For the ecosystem initialization, we still use random generator. In all, Table 3 shows the pseudocode of ARCBBO. We divide it into eight steps.

Table 3 Implementation of ARCBBO

Results and discussions

Our PBDS is composed of four parts: FRFE, MLP, PT, and ARCBBO. For the PT, we introduced three different measures of DP, BDB, and KC, respectively. Figure 4 shows the diagram of this proposed PBDS.

Fig. 4
figure 4

Diagram of our PBDS

For statistical analysis, the stratified cross validation (SCV) was used [54]. 6-fold SCV was employed for D_I, and then 5-fold SCV was used for D_II and D_III. Here pathological (P) brains were assigned as true, and the healthy (H) brains were assigned as false. The experiment all run 10 times. The effectiveness of FRFT and FRFE were already reported in [35].

ARCBBO versus BBO and RCBBO

In the first experiment, we compared ARCBBO with standard BBO and RCBBO. We set the hidden neuron number to 20, and no pruning technique was employed. The average accuracy of 10 runs of K-fold SCV was listed in Fig. 5.

Fig. 5
figure 5

Comparison among BBO, RCBBO, and ARCBBO (No Pruning Technique)

The average accuracies of D_I, D_II, and D_III by BBO were 99.09, 97.81, and 95.76 %, respectively. The average accuracies of D_I, D_II, and D_III by RCBBO achieved 99.24, 97.69, and 96.12 %, respectively. What is more, using ARCBBO, the average accuracies of D_I, D_II, and D_III were increased to 99.85, 98.38, and 97.02 %, respectively.

Figure 5 shows that the comparison among the accuracy results obtained by BBO, RCBBO and ARCBBO validated that the ARCBBO is more effective in training MLP for PBDS than both BBO and RCBBO. The reason is because real-coded and adaptive mechanism in ARCBBO can improve population diversity and exploration ability, hence, leading to better convergence and robustness than BBO. In the future, we will also test other improved BBO variants, such as multiobjective BBO [55], grouping BBO [56], optimal integrated BBO [57], etc.

Pruning technique comparison

In this second experiment, we compared there different pruning techniques (PTs). ARCBBO was chosen since it has been proven to have better performance than BBO in Section 5.1. The comparison was based on 10 repetitions of K-fold SCV. APER was chosen as the error estimation. The results were shown in Fig. 6.

Fig. 6
figure 6

Pruning technique comparison (NPT no pruning technique)

Here, NPT denotes no pruning technique. For D_I, the NPT, DP, BDB and KC approach obtains average accuracy of 99.85, 100.00, 100.00, and 100.00 %, respectively. For D_II, the NPT, DP, BDB and KC approach obtains average accuracy of 98.38, 99.19, 99.31, and 99.75 %, respectively. For D_III, the NPT, DP, BDB and KC approach obtain average accuracy of 97.02, 98.24, 98.12, and 99.53 %, respectively.

The pruning technique comparison in Fig. 6 suggests that using pruning technique will get better performance than not using pruning technique. The reason is the MLP will contains plenty of near-zero weights and biases if just assign a large hidden neuron number, and thus exists overfitting for validation sets. After employed pruning technique, the unnecessary neuron number will be removed, and thus overfitting will be avoided. The comparison also demonstrates KC method is superior to DP and BDB methods.

The best proposed approach

From above, we will know the best proposed approach is “FRFE + KC-MLP + ARCBBO”. In this section, we report in Table 4 the classification results of each run and each fold over the largest dataset D_III. Here we can see in the first run, for example, our algorithm successes in predicting 50 instances in Fold 1, and all 51 instances for other four folds. Hence, our algorithm achieves an accuracy of 99.61 % for the first run. Summarizing all 10 runs, the average accuracy of our algorithm is 99.53 %.

Table 4 Classification results over D_III

Classifier comparison

In the fourth experiment, we compared the best proposed classifier “KC-MLP + ARCBBO”, with native Bayesian classifier (NBC) [35] and support vector machine (SVM) [35]. All methods used FRFE and ran 10 times of K-fold SCV. Table 5 shows the comparison results.

Table 5 Classifier comparison (FRFE were used for all)

The classifier comparison in Table 5 shows that the proposed KC-MLP + ARCBBO gives better classification performance than both NBC and SVM. This indicates that MLP may have a potential to excel NBC and SVM, only if the users can carefully tune its hidden neuron number and training algorithm.

Comparison to state-of-the-art approaches

In the fifth experiment, we compared FRFE + KC-MLP + ARCBBO, with 11 approaches including DWT + PCA + KNN [20], DWT + PCA + SCG-ANN [21], RT + PCA + LS-SVM [22], DWT + PCA + SVM [23], DWT + SE + SWP + PNN [24], PCNN + DWT + PCA + BPNN [25], SWT + PCA + HPA-ANN [26], WE + HMI + GEPSVM + RBF [27], WPTE + GEPSVM [29], SWT + PCA + GEPSVM [31], WPTE + FSVM [33]. Table 6 showed the comparison results together with the feature number. Here we only report the results over D_III, since the other two datasets are too small. The abbreviations can be found in Table 7.

Table 6 Classification comparison over D_III

Table 6 shows that the proposed FRFE + KC-MLP + ARCBBO achieved the highest average accuracy of 99.53 %, better than 11 state-of-the-art approaches, such as DWT + PCA + KNN [20] with an average accuracy of 96.79 %, DWT + PCA + SCG-ANN [21] with an average accuracy of 98.82 %, RT + PCA + LS-SVM [22] with an average accuracy of 99.39 %, DWT + PCA + SVM [23] with an average accuracy of 94.29 %, WE + SWP + PNN [24] with an average accuracy of 98.86 %, PCNN + DWT + PCA + BPNN [25] with an average accuracy of 98.24 %, SWT + PCA + HPA-ANN [26] with an average accuracy of 99.45 %, WE + HMI + GEPSVM + RBF [27] with an average accuracy of 98.63 %, WPTE + GEPSVM [29] with an average accuracy of 99.33 %, SWT + PCA + GEPSVM [31] with an average accuracy of 99.02 %, and WPTE + FSVM [33] with an average accuracy of 99.49 %. The improvements may be small in degree, but it was obtained by 10 repetitions of K-fold SCV. Hence, this improvement of our method is reliable.

Conclusions and future researches

This paper proposed a new PBDS of “FRFE + KC-MLP + ARCBBO”. The experiments validated its effectiveness as achieved an average accuracy of 99.53 %. Our contributions lie in three points. We compared three different pruning techniques for MLP and showed KC is the most effective. Besides, we introduced the ARCBBO and proved it give better performance than BBO. Finally, the proposed PBDS is superior to 11 state-of-the-art PBDS methods.

In the future, we will include images obtained by other modalities, such as MRSI [58]. Further, other advanced pruning techniques will be tested. Deep learning [59] will be considered after we obtain enough brain images. Internet of things [60] will be another potential research field to embed this PBDS.