Introduction

Over the years due to brain diseases, the mortality rate increases vastly among individuals with different age groups across the globe. Pathological brain detection (PBD) has played significant role for early identification of various diseases such as Alzheimer’s disease [36], mild cognitive impairment, autism spectrum disorder [6], multiple sclerosis [33], hearing loss [34], and microbleeding [43]. The major objective of PBD is to assist radiologists to arrive at correct and quick clinical decisions. In PBD, a non-invasive imaging modality called magnetic resonance imaging (MRI) is often used since it supplies better resolution of brain tissues [32]. However, manual interpretation of MR images is a costly, troublesome and time-consuming task [2, 13, 15]. Hence, current trend is to develop automated PBD systems (PBDSs) with the help of image processing and machine learning algorithms which can detect brain diseases in less time. Further, it has been shown that PBDSs are effective and have practical applications.

Many attempts have been made toward the development of various PBDSs in the past decade [4]. However, the accuracy of these systems still requires notable improvement in order to meet the necessity of real world diagnostic situations. Hence, PBDS remains an open challenging issue in front of researchers. The goal of this study is to improve the performance of the system for pathological brain detection.

It has been observed that discrete wavelet transform (DWT) is the most used feature extractor in PBDSs as it analyzes images at several scales and handles one-dimensional (1D) singularities effectively. However, it has limited capability of representing two-dimensional 2D singularities (edges of an image). That is, DWT is not able to capture curve like features effectively from the images. Therefore, to handle such issue, application of advanced transforms are in great demand. Further, classifiers like support vector machine (SVM) and feed-forward neural network (FNN) are often used in earlier PBDSs. To train FNN, traditional gradient-based learning algorithms such as Levenberg-Marquardt (LM) and back-propagation (BP) are used which have many limitations such as trapping at local minima, slower learning speed, and learning epochs. Furthermore, traditional SVM classifier encounters higher computational complexity and it performs poorly on large datasets.

To overcome the aforementioned problems, we propose a novel PBDS in this paper. The main contributions of this study are summarized as follows:

  1. (a)

    Two dimensional PCA (2DPCA) is explored to extract the features from MR images.

  2. (b)

    To combat the issues of conventional learning algorithms, a simple and effective learning technique known as extreme learning machine (ELM) is employed.

  3. (c)

    To further enhance the performance of standard ELM, a new learning algorithm known as MDE-ELM based on modified differential evolution (MDE) and ELM is proposed.

  4. (d)

    To test the effectiveness of the suggested scheme, extensive experiments are conducted on three well-known datasets. In this context, the suggested scheme is compared against its counterparts with respect to classification accuracy and number of features required.

The remaining part of the article is structured as follows. Section “Related work” summarizes the related works. Section “Datasets used” offers the description of the datasets used in this study. Section “Proposed work” discusses the proposed methodology. In “Experimental results and analysis”, the experimental details and comparisons are presented. Finally, the concluding remarks are drawn in “Conclusions and future work”.

Related work

A significant number of PBDSs have been proposed in the past decade [4, 16]. Chaplot et al. [1] have suggested to use 2D discrete wavelet transform (2D DWT) and support vector machine (SVM) for feature extraction and classification. El-Dahshan et al. [5] have employed 2D DWT and two classifiers such as k-nearest neighbor (KNN) and feed forward back-propagation artificial neural network (FP-ANN). To reduce the feature dimensionality, they have applied principal component analysis (PCA). The authors in [32, 38, 40] have used scaled conjugate gradient (SCG), particle swarm optimization (PSO), adaptive chaotic PSO (ACPSO), and scaled chaotic artificial bee colony (SCABC) to train the feed forward neural network (FNN) classifier. Zhang et al. [39] have combined DWT, PCA and kernel SVM (KSVM). In [2], a PBDS based on Ripplet transform (RT), PCA and least squares SVM (LS-SVM) is suggested. In [18], the authors harnessed wavelet entropy (WE) to extract features and probabilistic neural network (PNN) is used for classification. Later, in [4], the authors have combined feedback pulse coupled neural network (FPCNN), DWT, PCA and FNN to detect pathological brain. Zhang et al. [41] have used weighted-type fractional Fourier transform (WFRFT) and two individual classifiers such as generalized eigenvalue proximal SVM (GEPSVM) and twin SVM (TSVM). Later, Yang et al. [26] have used wavelet energy values of as features. They have applied biogeography-based optimization (BBO) to train SVM classifier. Dong et al. [31] have utilized wavelet packet Shannon entropy (WPSE) and wavelet packet Tsallis entropy (WPTE) separately as features. In this, GEPSVM is employed as classifier. Nayak et al. [13] have utilized 2D DWT, probabilistic PCA (PPCA) and AdaBoost with random forests (ADBRF) for identifying pathological brains. In [30], the authors have offered a PBDS which combines stationary wavelet transform (SWT), PCA, and GEPSVM. In [12], a PCA+LDA technique is applied on the 2D DWT features. In [45], Naive Bayes classifier (NBC) based PBDS is proposed which uses WE features. While, in [29], wavelet energy and SVM is used. Sun et al. [37] have utilized GEPSVM+RBF classifier on WE and Hu moment invariants (HMI) features. Wang et al. [23] have proposed a novel feature called fractional Fourier entropy (FRFE) and performed Welch’s t-test (WTT) to select the relevant features. Twin SVM (TSVM) classifier is employed for classification. Later, in [35], a PBDS based on FRFE features and multilayer perceptron (MLP) is proposed. They have employed an adaptive real coded BBO (ARCBBO) approach for training the MLP. In this case, the number of hidden neurons of MLP is found using three separate pruning methods, namely, Bayesian detection boundaries (BDB), dynamic pruning (DP) and Kappa coefficient (KC). Chen et al. [42] have utilized Minkowski-Bouligand dimension (MBD) features and proposed an improved PSO (IPSO) to train the single-hidden layer feedforward neural network. Dash et al. [14] have intriduced a PBDS harnessing fast discrete curvelet transform and LS-SVM. Later on, Wang et al. [22] have combined the variance and entropy (VE) values of dual-tree complex wavelet transform (DTCWT) and TSVM to detect pathological brain. Li et al. [21] have employed wavelet packet Tsallis entropy (WPTE) and FNN with real-coded biogeography-based optimization (RCBBO) for pathological brain detection.

The literature study shows that most PBDSs used different forms of wavelet like DWPT, SWT, DTCWT, etc., as feature extractor. Despite the merits of these approaches, it has been observed that none of the approaches are able to achieve perfect classification accuracy in all cases. Therefore, application of proper feature extraction algorithms needs to be explored. Further, classifiers like SVM and FNN are frequently used in the existing PBDSs in spite of many loopholes. Moreover, it has been found that few PBDSs need a large number of features and hence, there exists a scope to limit the feature requirement without compromising the accuracy. It is noted that Yang et al. [28] have proposed an efficient and novel image feature extraction technique called two dimensional PCA (2DPCA) which has gained tremendous attention from researchers in last decade. 2DPCA was initially applied to face recognition task and thereafter, it has been leveraged in many applications.

In order to combat the above issues, we have proposed an efficient PBDS to classify the MR image as healthy or pathological. The proposed PBDS utilizes 2DPCA for feature extraction. Subsequently, PCA+LDA approach is employed in order to decide the most significant feature set. Lastly, an improved learning algorithm called MDE-ELM is proposed which offers several advantages such as local minima avoidance, better generalization capability, faster learning rate, and well-conditioned over other classifiers like FNN, SVM, LS-SVM, ELM, etc.

Datasets used

The proposed PBDS has been evaluated on three benchmark datasets, namely, DS-I, DS-II, and DS-III which carries 66, 160 and 255 brain MR images respectively. The datasets accommodate T2-weighted brain MR images of size 256 × 256 in axial view plane which were downloaded from Medical School of Harvard University website [10]. Both DS-I and DS-II hold samples of seven categories of diseases such as sarcoma, glioma, meningioma, AD plus visual agnosia (VA), Pick’s disease (PD), AD and Huntington’s disease (HD) plus healthy brain samples. However, DS-III includes four more diseases such as cerebral toxoplasmosis (CTP), multiple sclerosis (MS), herpes encephalitis (HE), and chronic subdural hematoma (CSH). The proposed work deals with solving a binary class classification problem (healthy or pathological), where the pathological class contains images from all kinds of diseases. Samples of all kinds of MR images are shown in Fig. 1.

Fig. 1
figure 1

T2-weighted brain MR samples [2]

Proposed work

The proposed system involves four stages such as contrast limited adaptive histogram equalization (CLAHE) based preprocessing, 2DPCA based feature extraction, PCA+LDA based feature reduction, and MDE-ELM based classification. The input of the system is an MR image and the output is the class label (healthy or pathological). The overview of the proposed PBDS is depicted in Fig. 2. A detail description of each stage is presented below.

Fig. 2
figure 2

Overview of the proposed framework

Preprocessing using CLAHE

It is observed that most of the images in the datasets considered in this study are of low-contrast. Therefore, for contrast enhancement of the images, a standard technique named CLAHE is employed. CLAHE initially evaluates a histogram of gray values at a contextual region surrounded by every pixel and thereafter, allocates a value to each pixel intensity within the display range [17]. Additionally, it uses a fixed value dubbed clip limit which helps in clipping the histogram prior to the computation of cumulative distribution function (CDF). However, CLAHE redistributes those parts of the histogram equally among all histogram bins that surpass the clip limit.

Feature extraction using 2DPCA

Two-dimensional PCA (2DPCA) has been shown to be promising in the domain of feature extraction and feature reduction over the last decade due to its salient properties like less memory storage and lower computational overhead [25]. In addition, 2DPCA enjoys decorrelation property and the feature vectors extracted from images are uncorrelated. It was originally applied to face recognition tasks and afterward, it has been successfully applied in several applications. This motivates us to employ 2DPCA for extracting features from brain MR images. Mathematically, it is described as follows.

For a given P training MR images (I j ,j = 1,2,…,P) with size m × n, the image covariance matrix in 2DPCA takes the form [28]

$$ Cov=\frac{1}{P}\sum\limits_{j = 1}^{P}(I_{j}-\bar{I})^{T}(I_{j}-\bar{I}) $$
(1)

Here, C o v denotes a non-negative definite matrix of size n × n and \(\bar {I}\) is the mean of all the training images.

Next, we evaluate the eigenvalues and eigenvectors of matrix C o v. Then, α eigenvectors V 1,V 2,…,V α (also called projection vector of 2DPCA) corresponding to α largest eigenvalues are selected as the transforming axes and these vectors are used for feature extraction. 2DPCA projects an image onto the transforming axes and serves the resultant α projections (projected vectors) as features which is stated as

$$ R_{i}=IV_{i}, \ i = 1,2,\ldots,\alpha $$
(2)

It is worth mentioning here that α value is selected using a measure called normalized cumulative sum of variances (NCSV). The NCSV value for a th eigenvector is calculated as

$$ NCSV(a)=\frac{\sum\limits_{u = 1}^{a}\lambda(u)}{\sum\limits_{u = 1}^{n}\lambda(u)} \qquad \quad ;\ 1\le a \le n $$
(3)

where, λ(u) represents the eigenvalue of the u th eigenvector and n denotes the total number of the eigenvectors sorted in descending order of eigenvalues. Here, we choose a threshold value manually and the number of eigenvectors (for instance α) for which the NCSV value surpasses the threshold are selected. As mentioned earlier that these α eigenvectors are retained for extraction of features from the MR images.

For each input MR image, we apply 2DPCA and obtain the features. The implementation procedure of feature extraction is outlined in Algorithm 1.

figure a

Feature reduction using PCA+LDA

It has been observed that the features extracted using 2DPCA are of high dimension and the high dimensional feature vector prompts to high computational overhead and high storage space. Hence, application of dimensionality reduction techniques is of great importance. PCA has been found to be effective in reducing feature dimension which transforms high dimensional input data to a lower dimensional space while keeping maximum variations of the data. In contrast, linear discriminant analysis (LDA) attempts to find a feature subspace that best discriminates between the classes. But, conventional LDA performs poorly while dealing with high dimensional and small sample size problem as in this case the within-scatter matrix (S w ) is always singular [27]. Further, to make sure that S w does not become singular, we need at least D + C (where D=dimension of the feature vector and C=number of classes) number of samples which in general is practically not possible [11]. To address this issue, an approach called PCA+LDA is harnessed in the proposed system, where a D-dimensional data is first reduced using PCA to an M-dimensional data and then to a L-dimensional data using LDA, L << M < D. It may be noted that the optimal number of features (L) required in our system is selected using the NCSV measure. The overall steps involved in the feature reduction stage is listed in Algorithm 2.

figure b

Classification based on MDE-ELM

Extreme learning machine (ELM)

Extreme learning machine (ELM) is the most simple and efficient learning algorithm for training the single-hidden layer feed-forward neural networks (SLFNs) which avoids the limitations of gradient based learning schemes [8]. It has achieved dramatic successes in solving problems like multi-label classification problems and regression tasks. In contrast to conventional learning schemes such as BP, SVM and LS-SVM, ELM learns faster with better generalization performance [7]. In ELM, the hidden node parameters (the input weights and hidden biases) are randomly assigned, while the output weights of SLFNs are mathematically calculated by a simple inverse operation of the hidden layer output matrix.

Given N distinct training samples (x j ,t j ), where x j =[x j1,x j2,…,x j L ]TR L and t j = [t j1,t j2,…,t j C ]TR C, the hidden node number n h and an activation function ϕ(.), the ELM algorithm can be expressed as follows.

  1. 1.

    Generate hidden node parameters randomly (\({w^{h}_{i}},b_{i}\)), i = 1,2,…,n h .

  2. 2.

    Compute the hidden layer output matrix H.

  3. 3.

    Compute the output weight matrix w o = H T

Here, \({w^{h}_{i}}=\left [ w^{h}_{i1},w^{h}_{i2},\ldots ,w^{h}_{iL}\right ]^{T}\) represents the weight vector that links between i th hidden neuron and the input neurons, \({w^{o}_{i}}=\left [ w^{o}_{i1},w^{o}_{i2},\ldots ,w^{o}_{iC}\right ]^{T}\) indicates the weight vector that connects the i th hidden neuron and the output neurons, and b i is the bias of the i th hidden neuron. H indicates the Moore-Penrose (MP) generalized inverse of matrix H. The size of H, w o and T are N × n h , n h × C and N × C respectively. The smallest norm LS solution is unique and has the minimum norm among all the LS solutions. As the solution of ELM is obtained using an analytical method without iteratively tuning parameters, it converges faster than other traditional learning algorithms.

Modified DE algorithm

Differential evolution (DE) is a simple and effective population based meta-heuristic approach for global searching of optimization problems [3, 19]. The performance of DE is strongly influenced by its mutation strategy, crossover operation and control parameters. As a consequence, a significant amount of works have been proposed to improve its search performance and it has been reported that DE outperforms GA and PSO on various benchmark functions [9]. However, the standard DE faces problems of premature convergence at local optima and stagnation. Therefore, the recent trend is to improve the search performance of DE by means of novel strategies for mutation and parameter controlling. In this study, a novel mutation and random scale factor strategy is proposed to improve the performance of DE and hence, it is referred as modified DE (MDE). The stepwise description of the proposed MDE algorithm is as follows.

DE Initialization

Randomly initialize the L-dimensional parameter vectors in a population of size N p as {S j,I t |j = 1,2,…,N p } with S j,I t = [S 1,j,I t ,S 2,j,I t ,…,S L,j,I t ], where I t denotes the generation number.

Mutation

For each target vector S j,I t , generate the mutant vector using the proposed mutation strategy as

$$ V_{j,It}=S_{j,It}+f_{s}(S_{best,It}-S_{j,It})+f_{s}(S_{best,It}-S_{{r_{1}^{j}},It}) $$
(4)

where, \({r_{1}^{j}}\) is a random integer between 1 to N p and different from index j. S b e s t,I t denotes the best parameter vector having best fitness at generation I t and f s is the scaling factor which helps in scaling the difference vectors. In basic DE, the difference vector is scaled by a constant f s . In the proposed scheme, however, f s is set to change randomly using the following equation

$$ f_{s}= 0.75+[rand(.)/4] $$
(5)

where r a n d(.) is a uniformly distributed random number within the range [0,1].

Crossover

Form a trial vector U j,I t = [U 1,j,I t ,U 2,j,I t ,…, U L,j,I t ] for the j th target vector S j,I t using binomial crossover as

$$ U_{d,j,It}\,=\,\left\{\begin{array}{ll} \!V_{d,j,It} & \!\text{if} \ randb(d)\!<=C_{r} \ \!\text{or} \ d\!=d_{rand}\\ \!S_{d,j,It} & \!\text{else} \end{array}\right.\!\!\!,\;\; d\,=\,1,2,\ldots,L $$
(6)

where, r a n d b(d) is the d th evaluation of a uniform random number generator with outcome in [0,1], d r a n d ∈ [1,2,…,L] is a randomly chosen index and C r ∈ [0,1] is the crossover constant.

Selection

Evaluate the fitness of the target and the trial vector and check the following condition to find the solution for next generation (i.e., I t = I t + 1)

$$ S_{j,It+ 1}=\left\{\begin{array}{ll} U_{j,It} & \text{if} \ f(U_{j,It})<=f(S_{j,It}) \\ S_{j,It} & \text{if} \ f(U_{j,It}) > f(S_{j,It}) \end{array}\right. $$
(7)

Here, f(.) is the objective function which is to be minimized. Repeat the above procedure until a termination criterion gets satisfied.

Proposed evolutionary extreme learning machine

Since ELM utilizes random input weights and hidden biases, it leads to two critical issues [24, 46]: (i) high requirement of hidden neurons for which ELM responds slowly to unknown testing data and (ii) causing an ill-conditioned hidden layer output matrix H in presence of large hidden neurons which induces poor generalization performance.Footnote 1

To overcome such issues, few research efforts have been reported in past years where population-based optimization schemes such as genetic algorithms (GA) [20], differential evolution (DE) [46] and PSO [24] are used to optimize the hidden node parameters of ELM. However, in this study, a new approach MDE-ELM by combining the modified DE (MDE) algorithm with ELM is proposed to enhance the performance of the proposed scheme compared to existing schemes. In this, MDE is used to optimize the hidden node parameters, whereas, MP generalized inverse is utilized to analytically find the solution. It is worth mentioning here that the MDE algorithm searches global optima by considering both root-mean squared error (RMSE) and norm of the output weights of SLFNs which ensures in improving the generalization performance and the conditioning of the SLFN. The proposed MDE-ELM is stepwise listed as follows.

  1. (a)

    Randomly initialize all the parameter vectors in the population between [-1,1] such that each vector comprises a set of input weights and hidden biases as

    $$\begin{array}{@{}rcl@{}} S_{j}\!&=&\!\left[ w^{h}_{11},w^{h}_{12},\ldots,w^{h}_{1L}, w^{h}_{21},w^{h}_{22},\ldots,w^{h}_{2L},w^{h}_{n_{h}1},\right.\\ &&~\left. w^{h}_{n_{h}2},\ldots,w^{h}_{n_{h}L},b_{1},b_{2},\ldots,b_{n_{h}}\right] \end{array} $$
    (9)
  2. (b)

    For each vector, evaluate the output weights and fitness. Here, for fitness evaluation, we compute the RMSE over the validation set rather than the whole training set to overcome the overfitting issue. Hence, we can define fitness as

    $$ f()=\sqrt{\frac{\sum\limits_{j = 1}^{N_{v}}||\sum\limits_{i = 1}^{n_{h}}{w^{o}_{i}} \phi({w^{h}_{i}} \cdot x_{j} + b_{i})-t_{j}||^{2}_{2}}{N_{v}}} $$
    (10)

    where, N v indicates the number of validation samples.

  3. (c)

    Find S b e s t of all the solutions in the population and generate the mutant vector V j and trial vector U j using Eqs. 4 and 6 respectively.

  4. (d)

    Update the vectors using the fitness value and the norm of the output weights and generate new population as follows:

    $$ S_{j,It+ 1}=\left\{\begin{array}{ll} U_{j,It} & \text{if} \ f(S_{j,It})-f(U_{j,It})> \epsilon f(S_{j,It})\\& \text{or} \ (|f(S_{j,It})-f(U_{j,It})| < \epsilon f(S_{j,It}) \ \text{and} \ ||w^{o}_{U_{j}}||< ||w^{o}_{S_{j}}||) \\ S_{j,It} & \text{otherwise} \end{array}\right. $$
    (11)

    where, f(S j,I t ) and f(U j,I t ) denotes the fitness value of the target vector j and its corresponding trial vector at iteration I t respectively. \(w^{o}_{S_{j}}\) and \(w^{o}_{U_{j}}\) represents the output weights of target vector j and its corresponding trial vector, respectively. 𝜖 > 0 is a user-defined tolerance rate.

  5. (e)

    To bound the input weights and biases in the range of [-1, 1], we use the following equation in the proposed MDE-ELM.

    $$ S_{d,j,It+ 1}\,=\,\left\{\begin{array}{ll} -1 & \text{if} \ S_{d,j,It+ 1}\!<\!-1\\ 1 & \text{if} \ S_{d,j,It+ 1}\!>\!\! 1 \end{array}\right., 1\!\le\! j\!\le\! N_{p}, \ 1\!\le\! d \!\le\! L $$
    (12)
  6. (f)

    Repeat (c)–(e) until the point that the most extreme number of iterations are finished and obtain the optimal input weights and hidden biases.

The proposed scheme uses Eq. 11 to find the optimal input weights and hidden biases and hence, it tends to provide a lower value of norm of output weights of SLFNs. On the other hand, the smaller norm of the output weights leads to a smaller condition value of the output hidden matrix. To sum up, the proposed MDE-ELM offers the following advantages: (i) it improves the conditioning, (ii) it produces better generalization performance with a much more compact network. Compared to other gradient based methods and classical ELM, MDE-ELM approach does not need activation function to be differentiable.

Since the proposed PBDS includes techniques such as 2DPCA, PCA+LDA, and MDE-ELM, hereafter, in this paper, the proposed scheme is referred to as 2DPCA + PCA + LDA + MDE-ELM.

Experimental results and analysis

The parameters used and the statistical set up was kept similar to other competent schemes to derive relative comparisons.

Statistical set up

In order to validate the proposed scheme 2DPCA + PCA+ LDA + MDE-ELM, simulation has been carried out on three different datasets, namely, DS-I, DS-II, and DS-III. For statistical analysis, cross-validation (CV) has been employed which avoids over-fitting problems. In this work, we have incorporated stratification into CV which splits the folds in such a way that each fold will have a similar class distribution. Figure 3 depicts the setting of a 5-fold CV for a single run. In each trial, one fold is used for testing, one for validation and the rests for training. The validation set is used to find the parameters of the MDE-ELM i.e., it helps us to know when to stop training. The test set is used to evaluate the performance in a run of five trials. Here, for DS-I, we employ 6-fold stratified cross validation (SCV) while for another two datasets, we select 5-fold SCV. The statistical setting for all the three datasets is given in Table 1. Here, the SCV procedure run for 10 times for three datasets.

Fig. 3
figure 3

Illustration of 5-fold cross validation setting for a single run

Table 1 Specification of three benchmark datasets [2, 13, 35]

Evaluation method

To decide whether the proposed scheme is effective or not, four different measures such as sensitivity (S e ), specificity (S p ), precision (P r ) and accuracy are computed. S e is the fraction of pathological MR samples successfully predicted, while S p is the fraction of healthy MR samples successfully predicted. However, accuracy (ACC) determines the fraction of the correctly predicted samples (both pathological and healthy) in the total number of testing samples. Moreover, to compare proposed MDE-ELM scheme against other schemes such as DE-ELM, PSO-ELM, basic ELM and BPNN, two parameters such as condition number and norm of output weights are used.

Experimental results

In the following, we discuss the results obtained at various stages of the proposed scheme.

Preprocessing and feature extraction results

In preprocessing stage, CLAHE is utilized which relies on the proper setting of its parameters. Here, the original MR image is divided into 64 contextual regions. The number of bins and the clip limit (β) are selected to be 256 and 0.01. The representative enhanced images corresponding to four original MR images are depicted in Fig. 4. From the figure, it is seen that the affected lesions are clear in the enhanced images than that of original images.

Fig. 4
figure 4

Preprocessing using CLAHE. Row 1 lists the original MR samples. Row 2 lists the corresponding preprocessed samples

Next, 2DPCA algorithm is employed on the preprocessed images for feature extraction. In 2DPCA, the features are extracted using the projection vectors of the image scatter matrix. If we use all the projection vectors for feature extraction of an image, then the total number of features will be too high. On the other hand, all projection vectors do not contain important information. Hence, a simple strategy based on NCSV measure is used in this study to select the optimal number of projection vectors (i.e., α). To test this strategy, we compute the NCSV values with varying number of projection vectors for all the three datasets as shown in Fig. 5. From the figure, it is seen that our algorithm needs more than 26 projection vectors for all the three datasets (in particular 23, 25 and 26 for DS-I, DS-II, and DS-III respectively) with a threshold of 0.8. Hence, we fix the α value as 26 in order to extract the salient features from the brain MRI of three datasets. As a consequence, the total number of features extracted from a single image is computed to be 6656 (i.e., 26*256). Here, the threshold value is determined experimentally.

Fig. 5
figure 5

NCSV values with respect to different number of projection vectors for three datasets

Feature reduction results

As the dimension of feature vector obtained by 2DPCA algorithm is much higher (i.e., 6656 features), we employ PCA+LDA to reduce the dimensionality. The number of significant features is obtained based on the NCSV values of different features. It has been observed that PCA preserves maximum information with more features compared to PCA+LDA. In this case, the threshold value for NCSV is set to 0.95. Moreover, the classification accuracy against the number of features for both PCA and PCA+LDA on three datasets is depicted in Fig. 6. From the figure, it is clear that PCA based scheme achieves higher accuracy with 14 features on all the three datasets, while PCA+LDA based scheme yields higher accuracy with only two features.

Fig. 6
figure 6

Classification accuracy with respect to number of features for three datasets

Classification results

The proposed system employs MDE-ELM for classification of MR images as healthy or pathological. Here, the performance of the proposed MDE-ELM is compared against other learning algorithms such as DE-ELM, PSO-ELM, ELM, and BPNN. The objective function is kept same for all the algorithms i.e., sigmoidal function and the inputs to the network are normalized into the range [-1,1]. It may be noted that we set 20 and 30 as the population size and the maximum number of iterations respectively for MDE-ELM, DE-ELM, and PSO-ELM algorithm. The 𝜖 value in the proposed MDE-ELM is tested between a range [0.01,0.2] at equally spaced intervals. However, it has been found that the proposed scheme achieves highest performance with 𝜖 value as 0.05. In case of PSO-ELM, the value of c 1 and c 2 are set as 2, while in DE-ELM, the crossover rate (C r ) and scaling factor (f s ) are set as 0.7 and 0.8 respectively.

Tables 23 and 4 show the results obtained by MDE-ELM, DE-ELM, PSO-ELM, ELM and BPNN on three benchmark datasets. From the tables, it is clear that MDE-ELM outperforms others with less hidden neurons over all the datasets. It can also be noticed that basic DE-ELM earns perfect classification on DS-I and DS-II, however, it earns comparable accuracy over DS-III. Compared to other algorithms, standard ELM demands more hidden neurons.

Table 2 Performance comparison of different algorithms on DS-I
Table 3 Performance comparison of different algorithms on DS-II
Table 4 Performance comparison of different algorithms on DS-III

Further it is observed that the condition value of the matrix H obtained by MDE-ELM, DE-ELM and PSO-ELM algorithm is much smaller compared to the conventional ELM. Therefore, it is proved that the network trained by all these algorithms are highly well-conditioned compared to basic ELM. Further, their corresponding norm values are much smaller than basic ELM and hence, these algorithms tend to have better generalization performance compared to traditional ELM. It can be seen that the smaller norm value of w o leads to a smaller condition value of matrix H. Compared with PSO-ELM and DE-ELM, the MDE-ELM obtains smaller condition and norm values. Therefore, it can be concluded that the proposed algorithm (MDE-ELM) can have better generalization performance with a compact network structure. It is worth mentioning here that the results reported in the tables are the average values of 50 trials and the parameters of all the schemes are determined through experimental evaluation.

Moreover, to prove the efficacy of the suggested MDE-ELM classifier, accuracy comparison is made against other classifiers like BPNN, KNN, random forest (RF), and SVM classifier on all the three datasets and the results are depicted in Fig. 7. For DS-I, KNN, BPNN, SVM, RF, ELM and DE-ELM yield an accuracy of 99.24%, 99.85%, 100.00%, 99.54%, 100.00% and 100.00% respectively; however, these classifiers obtain an accuracy of 99.38%, 99.88%, 99.81%, 99.69%, 100.00% and 100.00% respectively on DS-II. The accuracies yielded by KNN, BPNN, SVM, RF, ELM and DE-ELM are 99.14% 99.37%, 99.49%, 99.33%, 99.49%, and 99.53% respectively on DS-III. While MDE-ELM earns ideal classification on DS-I and DS-II datasets and an accuracy of 99.65% on DS-III dataset. This shows that the proposed algorithm outperforms all other classifiers in DS-III and able to provide ideal results in other two datasets.

Fig. 7
figure 7

Classification accuracy achieved by different classifiers for three datasets

Table 5 indicates the number of correctly classified MR images obtained by the proposed scheme (2DPCA+ PCA+LDA + MDE-ELM) over DS-III in each trial of a 10 ×k-fold SCV. It is found that the proposed scheme can successfully classify 2541 MR images out of 2550 samples (2200 pathological and 350 healthy MR images). In particular, 2195 pathological samples are successfully classified by our scheme and the rest five samples are misclassified to healthy class. However, the proposed system successfully predicts 347 healthy MR images and rest three samples are misclassified to pathological class. From these results, the sensitivity (S e ), specificity (S p ) and precision values (P r ) of the proposed scheme are computed as 99.82%, 98.57% and 99.77%, respectively which are shown in Table 6.

Table 5 Correctly classified samples of the proposed scheme on DS-III
Table 6 Classification performance (%) of the proposed scheme based on PCA and PCA+LDA over three datasets

Comparison to PCA based PBDS

To test the effectiveness of PCA+LDA approach over PCA, another experiment is done over three datasets. The performances of both the schemes, namely, 2DPCA+ PCA + MDE-ELM and 2DPCA+ PCA+LDA + MDE-ELM are listed in Table 6. It may be noticed that the proposed 2DPCA+ PCA+LDA + MDE-ELM scheme achieves better sensitivity, precision and accuracy than 2DPCA+ PCA + MDE-ELM over all the datasets with a relatively less number of features. Moreover, 2DPCA + PCA+LDA MDE-ELM obtains slightly less specificity than 2DPCA+ PCA + MDE-ELM in DS-III. However, it is worth addressing here that the CAD system with higher sensitivity values leads to have better performance. Therefore, it can be concluded that the proposed 2DPCA+ PCA+LDA + MDE-ELM scheme holds greater potential in taking accurate clinical decisions.

Comparison to existing PBDSs

To benchmark the performance of the suggested scheme in context of the number of number of features required and classification accuracy, extensive comparison with twenty existing schemes has been done over three datasets and is shown in Table 7. It is found that most of the earlier PBDSs yield ideal classification on DS-I; however, three PBDSs such as RT + PCA + LS-SVM [2], WPTE + FNN + RCBBO [21] and WPTE + GEPSVM [31] offer ideal classification on DS-II. Further, there is no PBDS available which can yield perfect classification over DS-III. However, our proposed PBDS obtains higher accuracy i.e., 99.65% compared to other PBDSs with a minimum number of features. Since MDE-ELM is used as classifier, the proposed system earns better generalization performance and responds faster to unknown testing data.

Table 7 Comparison against other competent PBDSs on three standard datasets

From the experiments, it has been observed that the proposed system has been tested on three openly accessible datasets accommodating images from patients during the late and middle stages of diseases, but a larger dataset with images from all stages of diseases can be validated to achieve better generalization performance. The present study deals with solving a two-class classification problem, however solving a multi-class brain disease classification problem is more challenging. Further, MDE demands more parameter to tune, hence there exists a scope to investigate on an optimization scheme which may need less number of parameters.

Conclusions and future work

This paper proposed an improved pathological brain detection system based on 2DPCA and an evolutionary ELM. In the proposed PBDS, 2DPCA is used for feature extraction followed by a PCA+LDA approach for feature reduction. Thereafter, a novel learning algorithm called MDE-ELM is introduced to perform classification of MRI brain which offers several advantages over traditional classifiers. The goal of using MDE in MDE-ELM is to optimize the hidden node parameters of standard ELM. The performance of the proposed scheme is evaluated on three standard datasets and the experimental results confirm that the effectiveness of the proposed scheme in improving classification accuracy compared to the existing schemes. Further, the number of features required is shown to be much less than others.

The proposed MDE-ELM algorithm can be tested over real regression and classification problems. Despite the merits of the proposed PBDS, it has been observed that the PBDS is benchmarked on three accessible datasets which are smaller in size; hence, a larger dataset collected online will further prove its effectiveness. Further, the images in the chosen datasets are assembled from the last and the middle stage of the diseases, images collected during all the stages need to be validated. In future, it would be interesting to hybridize ELM with other metaheuristic algorithms like grey wolf optimizer (GWO), firefly algorithm (FA), gravitational search algorithm (GSA) etc. In addition, harnessing deep learning algorithms for analyzing 3D MR images is another possible future work.