1 Introduction

Brain disease is one of the prime factors for causing death in people with different age groups. Different types of brain diseases exist such as neoplastic diseases (brain tumor), cerebrovascular diseases (stroke), degenerative diseases, and infectious diseases; some of these diseases may cause severe problems in the human brain and may prompt to death. This necessitates the development of automated computer-based decision systems which help the physicians to take correct and fast clinical decisions at early stages. Magnetic resonance imaging (MRI) is a common medical imaging modality used in PBDS because of its advantage of providing huge information about the soft tissues [5, 44, 53]. In addition, MRI is a non-invasive imaging modality compared other modalities including X-ray and CT scan. However, manual interpretation is difficult due to large data storage in MRI. Further, manual interpretation is costly, tedious and time-consuming process [8, 27, 31]. To overcome such issues, automated pathological brain detection systems (PBDSs) need to be developed to assist radiologists in taking accurate and quick decisions. PBDS utilizes various image processing and machine algorithms at different stages.

A significant amount of work has been done in developing various PBDSs in the past decades [10, 54, 65]. However, the development of an ideal PBDS is still challenging because of the difficulty in selecting proper algorithms for feature extraction, feature reduction, and classification. Further, these three phases should combinedly work in all cases regardless of the type of image modalities and the dataset size. Hence, PBDS remains an open problem for researchers. Our objective here is to enhance the performance of the PBDS with respect to existing systems for abnormality detection in the human brain.

It has been observed that discrete wavelet transform (DWT) is the mostly used feature extractor in PBDS since it analyzes images at several scales and handles one-dimensional (1D) singularities effectively. However, DWT can not handle two-dimensional (2D) singularities (edges of an image). That is, DWT is not able to capture curve like features effectively from the images. Therefore, finding a transform in order to capture 2D singularities is highly in demand. Further, classifiers like feed forward neural network (FNN) and support vector machine (SVM) are often used in earlier PBDSs because of their capability in separating nonlinear input patterns and predicting continuous functions. But, conventional gradient-based learning algorithms such as back-propagation (BP) and Levenberg-Marquardt (LM) used for the training of FNN causes many problems such as local minima, slower learning speed, and learning epochs. Few hybrid models have been designed with the help of population-based optimization strategies to overcome the limitations of traditional learning algorithms. Further, the traditional SVM classifier encounters higher computational complexity and performs poorly on large datasets [26]. Moreover, it has been found that few PBDSs need a large number of features and hence, there exists a scope to limit the feature requirement without compromising the accuracy.

Considering these concerns, we propose a novel PBDS with the following characteristics.

  1. (a)

    Orthogonal discrete ripplet-II transform (O-DR2T) is used for feature extraction to capture 2D singularities along with a group of curves from MR images.

  2. (b)

    To address the problems of traditional learning algorithms, a recently proposed learning algorithm known as extreme learning machine (ELM) is employed which provides faster learning speed and better generalization performance compared to conventional learning algorithms. However, ELM suffers from various limitations such as slower response speed on testing data, high requirement of hidden neurons, and ill-conditioned problem.

  3. (c)

    To further enhance the performance of standard ELM, a hybrid learning algorithm based on improve Jaya optimization algorithm and ELM (IJaya-ELM) is proposed.

  4. (d)

    To validate the proposed scheme, extensive experiments are carried out on three well-known datasets. In this context, the proposed PBDS is compared against other competent methods with respect to classification accuracy and number of features.

The remaining part of the article is organized as follows. Section 2 presents a review of the current PBDSs. Section 3 provides the materials used in the experiments. The proposed methodology adopted in the article is discussed in detail in Section 4. The statistical setting and pseudocode of the proposed scheme are presented in Section 5. In Section 6, a detail experimental evaluation and comparative analysis have been presented. Finally, the concluding remarks are presented in Section 7.

2 Related work

In the past years, a number of PBDSs have been reported in the literature for detection of brain diseases. MRI has been used as the imaging modality in almost all PBDSs. PBDSs can be broadly divided into two classes: direct-feature-based PBDS and indirect-feature-based PBDS depending on the type of features used. The former class uses coefficients of image transform as the key features. Indirect-feature-based PBDS, however, extracts features using statistical descriptors such as energy, entropy, mean and standard deviation from the coefficients. The PBDSs of the first category require feature transformation or selection techniques to get relevant feature sets. However, it is optional in case of the second category.

Chaplot et al. [5] were the forebears who proposed a PBDS with the help of 2D DWT features and two separate classifiers, namely, self-organizing map (SOM) and SVM. The authors in [23] have proposed a PBDS where Slantlet transform (ST) is employed for feature extraction and back-propagation neural network (BPNN) is used for classification. Later, El-Dahshan et al. [11] have suggested a hybrid approach with the assistance of 2D DWT and two separate classifiers such as k-nearest neighbor (k-NN) and feed forward back-propagation artificial neural network (FP-ANN). In order to reduce the feature dimensionality, they have applied principal component analysis (PCA). Further, with same features the authors in [53, 57, 59, 61], have proposed a variety of PBDSs. In these works, gradient-based and population-based optimization algorithms such as scaled conjugate gradient (SCG), particle swarm optimization (PSO), adaptive chaotic PSO (ACPSO), and scaled chaotic artificial bee colony (SCABC) are used to optimize the parameters of FNN, BPNN and kernel SVM (KSVM) classifier. Zhang et al. [60] have suggested a PBDS where DWT plus PCA based features are given to a KSVM classifier. The authors in [8] have derived features from Ripplet transform (RT) and reduced the feature dimensionality using PCA. Subsequently, they have applied least squares SVM (LS-SVM) for classification. In [10], the authors have utilized DWT and PCA for feature extraction and reduction prior to the employment of feedback pulse coupled neural network (FPCNN). Finally, they have applied FP-ANN for classification. Afterward, Wang et al. [42] offered a PBDS based on stationary wavelet transform (SWT), PCA and FNN. In this, the parameters of FNN classifier are optimized using artificial bee colony (ABC) and PSO and hence the schemes are coined as IABAP-FNN, ABC-SPSO-FNN, and HPA-FNN. In another work, Zhang et al. [62] have deployed weighted-type fractional Fourier transform (WFRFT) and PCA for feature extraction and reduction, respectively. For classification, they have applied generalized eigenvalue proximal SVM (GEPSVM) and twin SVM (TSVM). Later, Nayak et al. [27] have proposed a PBDS with the support of 2D DWT and probabilistic PCA (PPCA). In this, AdaBoost with random forests (ADBRF) method is employed for classification. While in [51], authors have used SWT, PCA, and GEPSVM for feature extraction, reduction, and classification respectively. The authors in [25] have utilized HL3coefficients of 2D DWT as features and harnessed PCA+LDA strategy for dimensionality reduction. Later, Dash et al. [28] have utilized curvelet transform for feature extraction, however, PCA and LS-SVM is employed for feature reduction and classification. Chen et al. [63] have used Minkowski-Bouligand dimension (MBD) features for MR image classification. Edge detection is performed using Canny edge detector prior to feature extraction. Subsequently, an improved PSO based on three-segment particle representation, time-varying acceleration coefficient, and chaos theory (PSO-TTC) is proposed to train the single-hidden layer feedforward neural network. In [31], authors have used fast discrete curvelet transform for feature extraction after segmentation using simple pulse coupled neural network (SPCNN). Eventually, PNN is applied for classification.

Recent articles on PBDS have used feature descriptors like energy, entropy [13], mean and standard deviation etc., in the feature extraction stage. For example, in [36], the entropy values of the wavelet coefficients are used as features. A spider web plot and t-test strategy is used to select the significant features. Subsequently, probabilistic neural network (PNN) is employed for classification. Later, Yang et al. [48] have computed energy values from a level-3 DWT coefficients to serve as features. For classification, biogeography-based optimization (BBO) technique is integrated into SVM. In [52], a discrete wavelet packet transform (DWPT) based PBDS is proposed. Two different types of entropies namely, Shannon entropy (SE) [30] and Tsallis entropy (TE) are evaluated from the sub-bands and finally, GEPSVM is utilized to classify MR images as healthy or pathological. Furthermore, in [56], a hybrid BBO and PSO based method known as HBP for training of FNN is suggested. In this, wavelet entropy values are used as features. In [67], wavelet entropy (WE) and a Naive Bayes classifier (NBC) based PBDS is proposed. While in [50], wavelet energy and SVM are used. Thereafter, Zhang et al. [64] have used Tsallis entropy of DWPT for feature extraction and fuzzy support vector machine (FSVM) for classification. On the other hand, in [58], WE and Hu moment invariants (HMI) features are used followed by a GEPSVM+RBF classifier. Wang et al. [43] have proposed a PBDS based on a novel feature called fractional Fourier entropy (FRFE) which is the combination of FRFT and Shannon entropy. Two separate test such as Welch’s t-test (WTT) and Mahalanobis distance (MD) is performed to select the relevant features. Subsequently, TSVM is employed for classification. Later, in [55], a PBDS based on FRFE features and multilayer perceptron (MLP) is proposed. In this, three pruning methods, namely, Bayesian detection boundaries (BDB), dynamic pruning (DP), and Kappa coefficient (KC) are utilized to get the optimal hidden neurons in MLP. Subsequently, an adaptive real coded BBO (ARCBBO) approach has been employed to update the weights of MLP. In [40], the authors have employed three varieties of binary PSO (BPSO) to select significant features from the entropy values of an 8-level DWT. PNN is deployed for classification. While in [39], the variance and entropy (VE) values of a dual-tree complex wavelet transform (DTCWT) are used as features. Both GEPSVM and TSVM are used as the classifier. Later, Nayak et al. [29] have used energy and entropy values of 2D-SWT as features. They have employed symmetric uncertainty ranking (SUR) filter for feature selection and AdaBoost with support vector machine (ADBSVM) for classification. The authors in [38] have employed wavelet packet Tsallis entropy (WPTE) for feature extraction and FNN with real-coded biogeography-based optimization (RCBBO) for classification.

The literature study reveals that in most PBDSs wavelet and its variants (like SWT, DWPT, DTCWT, etc.) have been frequently used for feature extraction. However, traditional DWT suffers from many drawbacks such as limited directional selectivity and translation variance. SWT can resolve the translation variance issue; however, it leads to redundancy and is not able to capture higher dimensional singularities. DTCWT is efficient and less redundant, which offers more directional selectivities (i.e., six) compared to SWT and DWT. It can be concluded here that all these transforms are less capable of handling 2D singularities. Hence, further improvements in directional selectivity need to be done. Additionally, it has been observed that FNN and SVM are commonly used in many PBDSs which require more parameters to tune and are time-consuming. Further, most of the schemes have been validated on small datasets and shown higher accuracies; however, they perform poorly when evaluated on large datasets. Thus, there exists a scope to eradicate the shortcomings of the existing schemes.

To combat these issues, we have proposed an efficient PBDS to classify the MR images as healthy or pathological. The proposed PBDS uses O-DR2T for feature extraction due to its ability in capturing directional features (edges and curves). Subsequently, a PCA+LDA based approach is employed in order to determine the most significant feature set. Eventually, a hybrid learning algorithm IJaya-ELM for SLFN is introduced which offers many advantages such as avoiding local minima issue, better generalization capability, faster learning rate, and well-conditioned in contrast to classifiers like FNN, SVM, LS-SVM, ELM, etc. These improvements lead the proposed PBDS to a more robust and accurate system over other current existing schemes.

3 Datasets used

The performance of the proposed PBDS are tested on three benchmark datasets, namely, DS-66, DS-160, and DS-255 accommodating 66, 160 and 255 images, respectively. These datasets hold T2-weighted brain MR images of size 256×256 in axial view plane which is available in Medical School of Harvard University website [22]. Along with the healthy brain samples, the datasets DS-66 and DS-160 have samples from seven classes of diseases, namely, sarcoma, AD (Alzheimer’s disease), AD plus visual agnosia (VA), glioma, meningioma, Huntington’s disease (HD), and Pick’s disease (PD). DS-255 contains four more diseases, viz., cerebral toxoplasmosis (CTP), multiple sclerosis (MS), herpes encephalitis (HE), and chronic subdural hematoma (CSH). Samples of all kind of MR images are shown in Fig. 1.

Fig. 1
figure 1

Samples of T2-weighted brain MR images [8]

Out of 11 types of diseases, glioma, meningioma, and sarcoma are of brain tumor type; while CTP, MS, and HE are of infectious type. The diseases such as AD, AD plus VA, PD, and HD are called the degenerative diseases; whereas CSH is a cerebrovascular disease. The proposed work is a two-class classification problem (healthy or pathological) in which the pathological class contains images from all kinds of diseases.

4 Proposed methodology

This section describes the methods involved in the proposed PBDS. The proposed PBDS consists of four steps, namely, preprocessing, feature extraction, feature reduction, and classification. The input of the system is an MR image and the output is the class label (healthy or pathological). In the preprocessing step, contrast limited adaptive histogram equalization (CLAHE) is employed. In feature extraction step, we use orthogonal discrete ripplet-II transform to extract features and in feature dimension reduction step, PCA+LDA approach is harnessed. Thereafter, for classification, a hybrid learning algorithm IJaya-ELM is utilized, where improved Jaya (IJaya) algorithm is used to optimize the initial weights and biases of the SLFN. The proposed PBDS works in two parts, namely, offline learning and online prediction. The former part includes the training and evaluation process of the system; whereas, the latter part predicts a class label for the query MR image. The detailed block diagram of the proposed PBDS is depicted in Fig. 2. All the steps are delineated below.

Fig. 2
figure 2

Detailed block diagram of the proposed PBD system

4.1 Preprocessing based on CLAHE

It is observed that most of the images in the datasets considered in this work are of low-contrast. Therefore, for contrast enhancement of the images, a standard technique named contrast limited adaptive histogram equalization (CLAHE) is employed. CLAHE initially evaluates a histogram of gray values in a contextual region centered around each pixel and then, it allocates a value to each pixel intensity within the display range [32]. Additionally, it uses a fixed value dubbed clip limit which helps in clipping the histogram prior to the computation of cumulative distribution function (CDF). However, CLAHE redistributes those parts of the histogram equally among all histogram bins that surpass the clip limit.

4.2 Feature extraction based on O-DR2T

Fourier transform has been found to be less suitable for feature extraction in images as it loses the time information and can not handle 1D singularities. Hence, this transform fails to provide efficient representation of images that contains edges, however, it works well for only smooth images. In contrast, wavelet transform performs better in representing 1D singularities (i.e., point singularities). But conventional wavelet transform is not capable of representing 2D singularities along arbitrarily shaped curves. In order to resolve the problem that conventional wavelet suffers from, another transform called ridgelet transform was introduced which is based on Radon transform [2, 9]. Ridgelet holds great potential in representing line singularities (i.e., it is capable of extracting lines of arbitrary orientation), but it is not able to handle 2D singularities. Thereafter, first generation curvelet transform based on multiscale ridgelet was proposed by Candes and Dohono in [3] to resolve the 2D singularities along smooth curves. Later on, they proposed the second generation curvelet transform [1] which is simple, fast, and less redundant than the former one. Because of the capabilities like multiresolution, more directional selectivity, anisotropy, and localization, it has drawn attentions over last decades. The anisotropic property guarantees solving 2D singularities along C 2curves and to accomplish this, curvelet utilizes a parabolic scaling law [4]. However, the reason behind the selection of parabolic scaling is not clear. In order to resolve this issue, a new transform called as ripplet-I transform is proposed which generalizes the scaling law [12, 46]. In general, ripplet-I transform generalizes the curvelet transform by adding two parameters such as support c and degree d. When c = 1 and d = 2, ripplet-I transform becomes curvelet transform. These two parameters provide ripplet-I transform with anisotropy capability of representing 2D singularities along arbitrarily shaped curves. Then, they proposed ripplet-II transform [45] based on generalized Radon transform (GRT) [6, 7] to further improve the capability of representing 2D singularities. It satisfies the properties like multiresolution, localization, good directionality, and flexibility. Moreover, compared to wavelet and ridgelet transform, ripplet-II has the fastest decay in coefficients and for which the sparser representation of images having edges is possible. Another variant of ripplet-II transform known as orthogonal ripplet-II transform generates even more sparse feature vectors than ripplet-II transform which is crucial for classification task. Therefore, it has been leveraged in applications like texture classification and image retrieval [45]. As the orthogonal ripplet-II transform is efficient in representing edges and textures than other conventional transforms and the affected regions in MR images contain edges and textures of arbitrary shapes, it is used as a feature extraction tool in this work.

4.2.1 Ripplet-II transform

Given a 2D function g(x,y), the continuous ripplet-II transform in polar coordinates (ρ,α) is defined as

$$ RT2{_{g}}(s,t,d,\theta)=\int \int \bar{\psi}_{s,t,d,\theta}(\rho,\alpha)g(\rho,\alpha)\rho \ d\rho \ d\alpha $$
(1)

where, g(ρ,α) is the polar coordinate conversion of g(x,y), \(\psi _{s,t,d,\theta }: \mathbb {R}^{2}\rightarrow \mathbb {R}^{2}\) is known as ripplet-II function and \(\bar {\psi }\) is the complex conjugate of ψ. The ripplet-II function is stated as

$$ \psi_{s,t,d,\theta}(\rho,\alpha)=s^{-1/2}\varphi((\rho \cos^{d}((\theta-\alpha)/d)-t)/s) $$
(2)

where \(\varphi : \mathbb {R}\rightarrow \mathbb {R}\) is a smooth univariate wavelet function, and s > 0, \(t \in \mathbb {R}\), \(d \in \mathbb {N}\) and 𝜃 ∈ [0,2π)indicates scale, translation, degree and orientation parameters, respectively. By tuning these parameters, ripllet-II transform can capture structural information along arbitrary curves. Using (1) and (2), we have

$$ RT2{_{g}}(s,t,d,\theta)=\left\langle \varphi_{s,t}(r),GR_{d}[g]\right\rangle $$
(3)

where G R d [g]is the GRT of function g and is defined as

$$ GR_{d}(r,\theta)=\int \int g(\rho,\alpha)\delta(r-\rho \cos^{d}((\alpha-\theta)/d)) \rho \ d\rho \ d\alpha $$
(4)

The GRT can also be evaluated using Fourier transform [45]. Equation (3) indicates that ripplet-II transform is the inner product between GRT and 1D wavelet. It can also be represented as

which defines that ripplet-II transform in two steps: first compute GRT of g and then compute 1D WT of the GRT of g.

The discrete version of ripplet-II transform (DR2T) can be defined as

in which the discrete GRT (DGRT) of g is first computed and subsequently, the 1D discrete WT (DWT) of the DGRT of g is computed. The computing procedure for discrete ripplet-II transform becomes more simpler when d = 2. In this case, the GRT is dubbed as ‘parabolic Radon transform’ and is defined as follows [45]

$$ GR_{2}(r,\theta)= 2 \sqrt{r} R[g(\rho^{' 2},2 \alpha^{\prime})](\sqrt{r},\theta / 2) $$
(7)

where, R[g(ρ,α)](r,𝜃) is the classical Radon transform (CRT) in polar coordinates. However, in general, the GRT of function g for d > 0takes the form in Fourier domain as

$$ G{R_{d}^{F}}(r,\theta)= 2\sum\limits_{n=-\infty}^{+\infty}\left[{\int}_{r}^{\infty}\int g(\rho,\alpha)e^{-\text{in} \ \alpha}d\alpha \times (1-(r/\rho)^{2/d})^{-1/2} \times T_{nd}((r/\rho)^{1/d})d\rho \right] e^{\text{in}\ \theta} $$
(8)

where T n (.)denotes the Chebyshev polynomial of degree n. In summary, the forward DR2T with d = 2of an input image can be computed as follows:

  1. (i)

    Convert the input function from Cartesian coordinates to polar coordinates i.e., g(x,y)to g(ρ,α). Replace (ρ,α)by (ρ ′2,2α )in g(ρ,α). Subsequently, generate a new image g (x,y)by interpolation after converting polar coordinates (ρ ,α )to Cartesian coordinates (x,y). The variables x and y hold integer values.

  2. (ii)

    Employ discrete CRT on g (x,y)that produces R(r ,𝜃 )and then substitute (r ,𝜃 )with \((\sqrt {r},\theta /2)\) in R(r ,𝜃 )as in (7). And obtain the DGRT coefficients G R 2(r,𝜃).

  3. (iii)

    Apply 1D DWT to DGRT coefficients w.r.t. r and obtain the discrete ripplet-II coefficients.

The above substitution from (r ,𝜃 )to \((\sqrt {r},\theta /2)\) makes DR2T coefficients more sparser than others.

4.2.2 Orthogonal ripplet-II transform

Orthogonal ripplet-II transform is an extension of ripplet-II transform which is achieved by applying 2D WT to GRT coefficients in place of 1D WT along r and 𝜃. The additional WT along angle 𝜃 helps in improving the sparsity of transform coefficients. The continuous orthogonal ripplet-II transform of the function g takes the form

$$\begin{array}{@{}rcl@{}} RT2_{g}^{orth}(s,t_{1},t_{2},d)= 2\sum\limits_{n=-\infty}^{+\infty} \int \int \frac{1}{s} \bar{\varphi}(\frac{r-t_{1}}{s})\bar{\varphi}(\frac{r-t_{2}}{s}) {\int}_{r}^{\infty}\int g(\rho,\alpha)e^{-\text{in} \ \alpha}d\alpha \\ \times(1-(r/\rho)^{2/d})^{-1/2} \times T_{nd}((r/\rho)^{1/d}) \ d\rho e^{\text{in}\ \theta} d r \ d\theta \end{array} $$
(9)

Now, the discrete orthogonal ripplet-II transform (O-DR2T) can be stated as

where the DGRT of g is first evaluated and thereafter 2D DWT of the DGRT of g is harnessed. It is worth mentioning that unlike DR2T, O-DR2T has no explicit direction parameter and for which it may lose the explicit directional information. However, in [45], it is shown that orthogonal ripplet-II transform supplies more sparser features of the images than wavelet, ridgelet and standard ripplet-II transform because of the replacement of 2D DWT with 1D DWT. Hence, in the proposed system, O-DR2T is used as feature extractor.

4.2.3 Feature generation

For each training input MR image, we apply O-DR2T and obtain the coefficients. Then, the transform coefficients are arranged in a feature vector of dimension D, where D = mn, and m and n are the number of rows and columns of the image. This vector is calculated for each training images and a feature matrix is formed finally. The implementation procedure of the feature generation is outlined in Algorithm 1.

figure d

4.3 Feature reduction based on PCA+LDA

It is noticed that the features generated by O-DR2T are of high dimension which leads high computational overhead and large storage space requirement. Therefore, application of dimensionality reduction techniques is of great importance. PCA is a frequently used feature dimension reduction technique which transforms high dimensional input data to a lower dimensional space while keeping maximum variations of the data [35]. In contrast, linear discriminant analysis (LDA) attempts to find a feature subspace that best discriminates between the classes. But, conventional LDA performs poorly while dealing with high dimensional and small sample size problem as in this case the within-scatter matrix (S w ) is always singular [49]. Further, to make sure that S w does not become singular, at least D + C(where, D=dimension of feature vector and C=number of classes) number of samples are required which in general is practically not possible [24]. To address this issue, a well-known method dubbed as PCA+LDA has been applied in this study, where a D-dimensional data is first reduced to an M-dimensional data using PCA and then reduced to a l-dimensional data using LDA, l << M < D.

In order to get a relevant feature set, we fist sort the eigenvalues of different features in decreasing order and then the normalized cumulative sum of variances (NCSV) corresponding to each feature is calculated. The NCSV value for j thfeature is defined as

$$ NCSV(j)=\frac{\sum\limits_{u = 1}^{j}\alpha(u)}{\sum\limits_{u = 1}^{D}\alpha(u)} \qquad \quad ;\ 1\le j \le D $$
(11)

where, α(u) represents the eigenvalue of the u thfeature and D denotes the dimensionality of the feature vector. Finally, a threshold value is set manually and the number of features for which the NCSV value surpasses the threshold are selected. Relevant features selected are determined experimentally to have a maximal accuracy. It may be noted that the coefficients of the l eigenvectors (suitably called as basis vectors (BV)) corresponding to l large eigenvalues are retained if these l eigenvalues collectively satisfy the given threshold.

4.4 Classification based on IJaya-ELM

In this section, we first discuss the preliminaries of extreme learning machine (ELM) and Jaya algorithm, and thereafter present the proposed IJaya-ELM learning algorithm for single-hidden layer feedforward neural networks (SLFNs) in oder to classify the MR brain as healthy or pathological.

4.4.1 Extreme Learning Machine (ELM)

Single-hidden layer feedforward neural networks (SLFNs) have been shown to be used in many applications as they successfully approximate any continuous function and classify any disjoint region. To train the SLFNs, gradient-based learning algorithms such as Levenberg-Marquardt (LM) and backpropagation (BP) algorithm have been widely used. However, despite their popularity, these learning algorithms face various issues such as poor learning speed due to improper learning steps, getting trapped at local minima, requiring large number of iterations to obtain better learning performance, and overfitting [21]. A recently developed learning algorithm called extreme learning machine (ELM) avoids the limitations of gradient based learning schemes. ELM has also the potential for solving multi-class classification and regression tasks [19, 20]. In contrast to other conventional learning algorithms such as BP, SVM and LS-SVM, ELM learns faster with better generalization performance. In ELM, the hidden node parameters (the input weights and hidden biases) are randomly assigned, while the output weights of SLFNs are analytically determined by simple inverse operation of the hidden layer output matrix. ELM is discussed below mathematically.

Given N distinct training samples (x i ,t i ), where x i =[x i1,x i2,…,x i l ]TR land t i = [t i1,t i2,…,t i C ]TR C, the SLFNs having n h hidden nodes and activation function ϕ(.)can be represented as

$$ \sum\limits_{i = 1}^{n_{h}}{w^{o}_{i}} \phi(x_{j})=\sum\limits_{i = 1}^{n_{h}}{w^{o}_{i}} \phi({w^{h}_{i}} \cdot x_{j} + b_{i})=o_{j}, \ \ j = 1,2,\ldots,N $$
(12)

Here, \({w^{h}_{i}}=\left [ w^{h}_{i1},w^{h}_{i2},\ldots ,w^{h}_{il}\right ]^{T}\) represents the weight vector that links between i th hidden neuron and the input neurons, \({w^{o}_{i}}=\left [ w^{o}_{i1},w^{o}_{i2},\ldots ,w^{o}_{iC}\right ]^{T}\) indicates the weight vector that connects the i th hidden neuron and the output neurons, and b i is the bias of the i thhidden neuron. The SLFNs can approximate these N samples with zero error, i.e., ∃\({w^{h}_{i}}\), \({w^{o}_{i}}\), and b i such that

$$ \sum\limits_{i = 1}^{n_{h}}{w^{o}_{i}} \phi({w^{h}_{i}} \cdot x_{j} + b_{i})=t_{j}, \ \ j = 1,2,\ldots,N $$
(13)

Now, (13) can be represented in matrix form as

$$ \mathbf{H}w^{o}=\mathbf{T} $$
(14)

where,

$$\begin{array}{@{}rcl@{}} &&\mathbf{H}({w^{h}_{1}},{w^{h}_{2}},\ldots,w^{h}_{n_{h}}, b_{1},b_{2},\ldots,b_{n_{h}},x_{1},x_{2},\ldots,x_{N})\\\ &&=\left[{\begin{array}{ccc} \phi({w^{h}_{1}}\cdot x_{1}+b_{1}) & {\ldots} & \phi(w^{h}_{n_{h}}\cdot x_{1} + b_{n_{h}})\\ {\vdots} &{\ldots} & \vdots\\ \phi({w^{h}_{1}}\cdot x_{N}+b_{1}) & {\ldots} & \phi(w^{h}_{n_{h}}\cdot x_{N} + b_{n_{h}}) \end{array}}\right]_{N\times n_{h}} ,\\ &&w^{o} = \left[{\begin{array}{c} {{w^{o}_{1}}}^{T}\\ \vdots\\ {w^{o}_{n_{h}}}^{T}\\ \end{array}}\right]_{n_{h}\times C} \,and\, \mathbf{T} = \left[{\begin{array}{c} {t_{1}^{T}}\\ \vdots\\ {t_{N}^{T}}\\ \end{array}}\right]_{N\times C}\end{array} $$

Here, H denotes the hidden layer output matrix. Now, the output weights w o can be analytically determined by finding the smallest norm least square (LS) solution of the above linear system (12) as

$$ \hat{w^{o}}=\mathbf{H}^{\dagger}\mathbf{T} $$
(15)

where, H indicates the Moore-Penrose (MP) generalized inverse of matrix H and with this method ELM leads better generalization performance [68]. The smallest norm LS solution is unique and has the minimum norm among all the LS solutions. As the solution of ELM is obtained using an analytical method without iteratively tuning parameters, it converges faster than other traditional learning algorithms.

4.4.2 Jaya algorithm

Jaya algorithm is a recent optimization algorithm developed by Rao [33] and has been gaining attractions of the researchers for its simplicity and robustness. Jaya algorithm is shown to provide better results than other optimization algorithms [34, 41]. Unlike other population-based optimization algorithms, it does not need any algorithm-specific parameters; however, it needs the common control parameters like population size, generation number, etc. The conceptual idea behind this scheme is that it always moves the obtained solution toward the best solution and avoids the worst solution.

Suppose f(s)is the objective function to be minimized or maximized. At any iteration k, let there are n number of candidate solutions (i.e., j = 1,2,…,n) each having dimension (or number of variables) d (i.e., d = 1,2,…,m). If s j d (k) denotes the value for j thsolution in d thdimension during iteration k, then its modified value can be obtained as

$$ s^{\prime}_{jd}(k)=s_{jd}(k)+r_{1d}(k)(s_{bestd}(k)-|s_{jd}(k)|)-r_{2d}(k)(s_{worst d}(k)-|s_{jd}(k)|) $$
(16)

where s b e s t d (k)represents the value for the best candidate solution in d th dimension and s w o r s t d (k)represents the value for the worst candidate solution in d th dimension during iteration k. It is worth mentioning that the candidate best and worst are the best and worst solution having best and worst fitness values in the entire population of an iteration. r 1d (k)and r 2d (k)indicate two random numbers in dimension d during k th iteration which lie in the interval [0,1]. \(s^{\prime }_{jd}(k)\) denotes the updated value of s j d (k). The term “r 1d (k)(s b e s t d (k) −|s j d (k)|)” defines that the solution tries to move toward the best solution and the term “− r 2d (k)(s w o r s t d (k) −|s j d (k)|)” indicates that the solution tries avoid the worst solution. The modified \(s^{\prime }_{jd}(k)\) value is accepted if the functional value generated by it is better. The overall steps involved in Jaya algorithm are shown in Fig. 3.

Fig. 3
figure 3

Flow diagram of the Jaya algorithm

4.4.3 Proposed improved extreme learning machine

Since ELM randomly chooses the input weights and hidden biases, it leads to two crucial problems [47, 66, 68]: (i) ELM needs more number of hidden neurons than conventional gradient based methods which make ELM respond slowly to unknown testing data, and (ii) ELM prompts to an ill-conditioned hidden layer output matrix H in presence of more hidden neurons which induces poor generalization performance. Condition number was found to be a good qualitative measure to find the conditioning of a matrix [66]. It indicates how close a system is to be ill-conditioned. It may be noted that an ill-conditioned system holds large condition number while a well-conditioned system holds small condition number. The 2-norm condition number of the matrix H can be calculated as,

$$ \mathcal{K}_{2}(\mathbf{H})=\sqrt{\frac{\lambda_{max}(\mathbf{H}^{T} \mathbf{H})}{\lambda_{min}(\mathbf{H}^{T}\mathbf{H})}} $$
(17)

where, λ m a x (H T H)and λ m i n (H T H)denotes the largest and smallest eigenvalues of matrix H T H.

In order to tackle these issues, few efforts have been made by researchers in the last decade using evolutionary algorithms (EAs) and swarm intelligence based algorithms since these algorithms have the benefits of global searching for optimization problems [18]. Zhu et al. [68] suggested a hybrid algorithm called evolutionary ELM (E-ELM), where a modified differential evolution (DE) algorithm is utilized to optimize hidden node parameters and MP generalized inverse is utilized to find the solution. They have shown that E-ELM provides faster learning speed and better generalization performance than other traditional algorithms, while it obtains much more compact network than ELM. However, E-ELM demands two additional parameters to tune, namely, the mutation factor and the crossover factor. Xu and Shu [47] introduced another evolutionary ELM based on PSO (PSO-ELM) to select the hidden node parameters which requires only one parameter to tune. They have added boundary conditions into conventional PSO to enhance the performance of ELM. Later, in [14], an improved PSO based ELM (IPSO-ELM) is proposed to find optimal SLFNs. In this, IPSO considers both the root mean squared error (RMSE) and the norm of output weights of validation set to obtain better convergence performance. Suresh et al. [37] have proposed a hybrid learning algorithm using real-coded genetic algorithm and ELM (RCGA-ELM) for no-reference image quality assessment. But, RCGA requires two genetic parameters such as crossover and mutation. While, Zhao et al. [66] have offered an input weight selection technique for improving the conditioning of ELM with the help of linear hidden neurons. With this technique, they have achieved numerical stability without degrading accuracy.

From the above literature, it is observed that different researchers have utilized optimization algorithms like GA and its variants, PSO and its variants, DE, etc., to find the optimal hidden node parameters. These techniques have their own advantages. However, they need proper tuning of their algorithm-specific parameters as it significantly influences the performance of the algorithms. Therefore, in order to resolve the problem of improper tuning, in this paper a parameter less based scheme known as Jaya algorithm is used. In addition, we propose a new scheme IJaya-ELM by combining the improved Jaya (IJaya) algorithm with the ELM. IJaya-ELM avoids the issues faced by existing methods in the recent literature. In this scheme, IJaya is harnessed to optimize the hidden node parameters and MP generalized inverse to analytically find the solution. It is worth mentioning here that the improved Jaya algorithm searches global optima considering both RMSE and norm of the output weights of SLFNs which on the other hand improve the generalization performance and conditioning of the SLFN. The main goal of IJaya is to minimize the norm of the output weights and to bound the hidden node parameters within a specific range in order to enhance the convergence performance of ELM. The steps of the proposed IJaya-ELM is delineated as follows:

  1. (a)

    At first, initialize randomly all the candidate solutions in the population such that each candidate solution consists of a set of input weights and hidden biases as

    $$ s_{j}=\left[ w^{h}_{11},w^{h}_{12},\ldots,w^{h}_{1l}, w^{h}_{21},w^{h}_{22},\ldots,w^{h}_{2l},w^{h}_{n_{h}1},w^{h}_{n_{h}2},\ldots,w^{h}_{n_{h}l},b_{1},b_{2},\ldots,b_{n_{h}}\right] $$
    (18)

    It may be noted that all the input weights and hidden biases are randomly initialized within a range of [-1,1].

  2. (b)

    For each solution, evaluate the output weights and fitness. Here, we set the root-mean squared error (RMSE) on the validation set as the fitness rather than the whole training set in order to avoid the overfitting. The fitness can be defined as

    $$ f()=\sqrt{\frac{\sum\limits_{j = 1}^{N_{v}}||\sum\limits_{i = 1}^{n_{h}}{w^{o}_{i}} \phi({w^{h}_{i}} \cdot x_{j} + b_{i})-t_{j}||^{2}_{2}}{N_{v}}} $$
    (19)

    where, N v indicates the number of validation samples.

  3. (c)

    Find s b e s t and s w o r s t of all the solutions in the population and modify the solutions using (16).

  4. (d)

    Update the solutions using the fitness value and the norm of the output weights and generate new population as follows:

    $$ s_{j}(k + 1)= \left\{\begin{array}{ll} s^{\prime}_{j}(k) & \text{if} \ f(s_{j}(k))-f(s^{\prime}_{j}(k))> \epsilon f(s_{j}(k)) \ \\ & \text{or } \left( |f(s_{j}(k))-f(s^{\prime}_{j}(k))| < \epsilon f(s_{j}(k)) \ \text{ and } ||w^{o}_{s^{\prime}_{j}}||< ||w^{o}_{s_{j}}||\right) \\ s_{j}(k) & \text{otherwise} \end{array}\right. $$
    (20)

    where, f(s j (k))and \(f(s^{\prime }_{j}(k))\) denotes the fitness value of the candidate solution j and its corresponding modified solution during iteration k, respectively. \(w^{o}_{s_{j}}\) and \(w^{o}_{s^{\prime }_{j}}\) represents the output weights generated by MP generalized inverse for candidate solution j and its corresponding modified solution, respectively. 𝜖 > 0is a user-defined tolerance rate.

  5. (e)

    As given in the literature, all the input weights and biases should lie in the range of [-1, 1]. Therefore, the following equation is followed in the IJaya-ELM in order to deal with the solution out-of-bound issue

    $$ s_{jd}(k + 1)=\left\{\begin{array}{ll} -1 & \text{if} \ s_{jd}(k + 1)<-1\\ 1 & \text{if} \ s_{jd}(k + 1)> 1 \end{array}\right. , \quad 1\le j\le N_{p}, \ 1\le d \le D $$
    (21)
  6. (f)

    Repeat (c)-(e) until the maximum number of iterations are over. Finally, the optimal input weights and hidden biases are obtained, and are employed on the testing data to find the performance of the system.

As the proposed scheme uses (20) to find the optimal input weights and hidden biases, it tends to provide the smaller norm of output weights of SLFNs. On the other hand, the smaller norm of the output weights leads to a smaller condition value of the output hidden matrix. In general, the proposed IJaya-ELM has the following advantages: it has no algorithm-specific parameters, it improves the conditioning, and it produces better generalization performance with a much more compact network. Compared to other gradient based methods and classical ELM, the proposed approach does not need activation function to be differentiable.

The proposed PBDS involves techniques like O-DR2T, PCA+LDA, and IJaya-ELM, and hence it is referred to as O-DR2T + PCA+LDA + IJaya-ELM. The overall steps followed is articulated in Algorithm 2.

figure e

5 Experimental design and evaluation

In order to validate the proposed PBDS, simulation has been carried out on three different datasets, namely, DS-66, DS-160, and DS-255. For statistical analysis, cross-validation (CV) has been employed to avoid over-fitting problems. CV makes the classifier to generalize on independent datasets. In this work, we incorporate stratification into CV which splits the folds in such a manner that each fold will have a similar class distributions. Figure 4 depicts the setting of a 5-fold CV for a single run. In each trial, one fold is used for testing, one for validation and the rests for training. The validation set is used to find the parameters of the IJaya-ELM i.e., it helps us to know when to stop training. The test set is used to evaluate the performance in a run of five trials. For DS-160 and DS-255, we select 5-fold SCV. But for DS-66 (18 healthy and 48 pathological) if we select 5-fold stratified cross validation (SCV), then each fold will have the different number of samples from two classes. Hence, for DS-66, we employ 6-fold (SCV). It is worth mentioning here that the statistical setting for all the three datasets is kept similar to the literatures as shown in Table 1. It may be noted that we run the SCV procedure 10 times on three datasets to avoid randomness.

Fig. 4
figure 4

Illustration of 5-fold cross validation setting for a single run

Table 1 Statistical setting of K-fold SCV for three benchmark datasets [8, 27, 55]

Four different measures, namely, sensitivity (S e ), specificity (S p ), precision (P r ) and accuracy are used to evaluate the proposed system. S e is the fraction of pathological MR samples correctly predicted by the model, while S p is the fraction of healthy MR samples correctly predicted by the model. However, accuracy (ACC) determines the fraction of the correctly predicted samples (both pathological and healthy) in the total number of testing samples. Moreover, to compare the proposed IJaya-ELM scheme with other schemes such as Jaya-ELM, PSO-ELM, APSO-ELM, E-ELM and GA-ELM, two additional parameters, namely, condition number and norm of output weights are used.

6 Experimental results and analysis

The proposed system was implemented using MATLAB toolbox on a machine with 3.4 GHz processor, 8 GB RAM, and windows 10 OS. The parameters used and the statistical set up were kept similar to other competent schemes to derive relative comparisons.

6.1 Preprocessing and feature extraction results

The quality features of an MR image is dependent on the quality of input image. To enhance the original MR images CLAHE is utilized, which relies on the proper setting of its parameters. In the present case, the original MR image is divided into 64 contextual regions. The number of bins and the clip limit (β) are set as 256 and 0.01 respectively. It may be noted that uniform distribution scheme is selected for each region to obtain a flat histogram shape. The representative enhanced images corresponding to four original MR images are shown in Fig. 5. The affected regions in the enhanced images are more clear compared to the original images.

Fig. 5
figure 5

Preprocessing using CLAHE. Row 1 lists the original MR samples. Row 2 lists the corresponding contrast enhancement using CLAHE

O-DR2T is applied to each of the preprocessed images and the features are extracted as the transform coefficients of O-DR2T. In this case, the number of levels in 2D DWT is set as 2 with Haar wavelet as the basis. As the images are of size 256 × 256, therefore the total number of features extracted from a single image are 256 ∗ 256 = 65536 which is huge in size.

6.2 Feature reduction results

In order to attain better performance and make the classifier’s job easier, the high dimensional O-DR2T features (65536 features) are reduced using PCA+LDA. The number of significant features is obtained based on the NCSV values of different features. It has been observed that PCA preserves maximum information with more features, however, PCA+LDA, requires relatively less number of features. In particular, setting the threshold value for NCSV as 0.95, two features are considered from PCA+LDA and 15 features are considered from PCA separately. Additionally, the classification accuracy with respect to the number of features for both PCA and PCA+LDA on three datasets are shown in Fig. 6. From the figures, it is clear that PCA with 15 features and PCA+LDA with two features are providing higher results on all the three datasets. Therefore, it can be concluded that PCA+LDA approach is more suitable than only PCA.

Fig. 6
figure 6

Classification accuracy with respect to number of features for three datasets

6.3 Classification results

For classification of MR images as healthy or pathological, we employ a combined learning algorithm called IJaya-ELM for SLFN. In this section, first we have compared the performance of the proposed IJaya-ELM with other learning algorithms, namely, Jaya-ELM, GA-ELM, PSO-ELM, adaptive PSO-ELM (APSO-ELM), E-ELM, ELM and BPNN. The basic PSO-ELM along with a time varying inertia weight parameter is referred to as APSO-ELM. In IJaya-ELM, we use sigmoidal function as the activation function and normalize all the inputs to the network into the range [-1,1]. It may be noted that the population size and the maximum number of iterations for IJaya-ELM, Jaya-ELM, GA-ELM, PSO-ELM, APSO-ELM, and E-ELM algorithm is kept same i.e., 20 and 30 respectively. The 𝜖 value in IJaya-ELM is experimentally determined as 0.05. The parameters involved in different algorithms are experimentally determined and their values are listed below. In case of PSO-ELM, the value of acceleration coefficients c 1 and c 2 are set as 2, while in E-ELM, the crossover rate (CR) and scaling factor (F) are set as 0.9 and 0.8 respectively. In APSO-ELM, the initial and final inertia parameters ω 1 and ω 2 are chosen as 0.4 and 0.9 respectively. For GA-ELM, we select the crossover rate and mutation rate as 0.7 and 0.1 respectively.

The performance of IJaya-ELM, Jaya-ELM, GA-PSO, PSO-ELM, APSO-ELM, E-ELM, ELM and BPNN on three benchmark datasets are reported in Tables 23 and 4. From the tables, it is seen that IJaya-ELM obtains higher accuracy than others with less hidden neurons on all the datasets. Jaya-ELM achieves ideal accuracy on DS-160, while it earns smaller accuracy than IJaya-ELM on DS-66 and DS-255. Further, E-ELM earns better performance compared to others except IJaya-ELM and APSO-ELM outperforms PSO-ELM. It can also be seen that standard ELM demands more hidden neurons than other algorithms. Furthermore, it is observed that the condition value of the matrix H obtained by IJaya-ELM algorithm is smaller compared to others on all the datasets. The norm value of IJaya-ELM is also found to be less than others and therefore, it can have better generalization performance than traditional ELM and its variants. It is proved that the smaller norm value of w o results in a smaller condition value of matrix H. Further, among Jaya-ELM and IJaya-ELM, IJaya-ELM obtains smaller condition and norm values, and higher accuracy. Therefore, it can be concluded that the proposed algorithm (IJaya-ELM) can achieve better generalization performance with compact networks than others. The results reported in the tables are the average values of 50 trials.

Table 2 Performance comparison of different classifiers on DS-66
Table 3 Performance comparison of different classifiers on DS-160
Table 4 Performance comparison of different classifiers on DS-255
Fig. 7
figure 7

Classification accuracy achieved by different classifiers on three standard datasets

Moreover, to demonstrate the effectiveness of the proposed IJaya-ELM algorithm with two features, accuracy comparison has been made with k-NN, random forest (RF), and SVM classifier along with BPNN, ELM, and Jaya-ELM on all the three datasets and the results are shown in Fig. 7. For DS-66, the accuracies earned by k-NN, BPNN, RF, SVM, ELM and Jaya-ELM are 98.94%, 100.00%, 99.39%, 100.00%, 100.00%, and 99.85% respectively. The accuracies obtained by k-NN, BPNN, RF, SVM, ELM and Jaya-ELM for DS-160 are 99.44%, 99.88%, 99.56%, 99.88%, 99.94% and 100.00%, respectively; while they are 98.88%, 99.29%, 99.02%, 99.33%, 99.37%, and 99.61% respectively for DS-255. From these results, it can be concluded that IJaya-ELM earns ideal classification on DS-66 and DS-160 datasets and an accuracy of 99.69% on DS-255 dataset which is superior to all other classifiers. Therefore, the proposed learning algorithm is found to be the most suitable algorithm among all other learning algorithms.

Table 5 lists the correctly classified samples and the corresponding accuracies obtained by O-DR2T+ PCA+LDA + IJaya-ELM on DS-255 during each trial of a 10 ×K-fold SCV process. The results in the table indicate that the proposed scheme can correctly classify 2542 samples out of 2550 samples (2200 pathological and 350 healthy samples). Further, among 2200 pathological samples, 2195 are correctly classified by our scheme and the rest five samples are misclassified to healthy class. While among 350 healthy samples, 347 samples are correctly classified by our scheme and rest three samples are misclassified to pathological class. Considering these results, the sensitivity (S e ), specificity (S p ), and precision values (P r ) of the proposed scheme are computed as 99.77%, 99.14%, and 99.86%, respectively which are listed in Table 6.

Table 5 10 × 5-fold SCV result of O-DR2T + PCA+LDA + IJaya-ELM method on DS-255
Table 6 Classification performances (%) of the proposed schemes based on PCA and PCA+LDA over three datasets

To compare the efficacy of PCA+LDA over PCA, another experiment has been carried out on all the three datasets. The performance of both the schemes, namely, O-DR2T+ PCA + IJaya-ELM and O-DR2T+ PCA+LDA + IJaya-ELM are shown in Table 6. It may be noticed that the proposed O-DR2T+ PCA+LDA + IJaya-ELM scheme earns better performances than O-DR2T+ PCA + IJaya-ELM on all the datasets with relatively less number of features. Moreover, O-DR2T+ PCA + IJaya-ELM obtains slightly lesser sensitivity, specificity and precision values than O-DR2T+ PCA+LDA + IJaya-ELM. However, the higher the sensitivity value of a CAD system, the better is the performance of the CAD system. Therefore, the proposed O-DR2T+ PCA+LDA + IJaya-ELM scheme holds greater potential in making correct clinical decisions.

Further, in order to support the effectiveness of O-DR2T features over DWT features, we have conducted an experiment where DWT features are used in place of O-DR2T features and record the results with the same number of features as shown in Table 7. It may be seen that the proposed scheme achieves better performance than DWT based schemes on all the datasets. Here, the DWT features are extracted from all the sub-bands of 3-level decomposition. Additionally, the DWT features used in literature [10, 27, 53] are also tested which results in smaller accuracy than the proposed scheme. Finally, it is concluded that with O-DR2T features the proposed scheme brings potential improvements in the performance of the PBDS system.

Table 7 Classification accuracy (%) comparison of the proposed method with wavelet based method

6.4 Comparison with other PBDSs

An extensive comparison with twenty-two existing competent PBDSs has been made on three datasets in the context of feature size, run size, and the classification accuracy as given in Table 8. It can be seen that a large number of the PBDSs yield perfect classification on DS-66, but merely two schemes, such as RT + PCA + LS-SVM [8] and DWPT + TE + GEPSVM [52] offer ideal classification on DS-160. It can also be noticed that no existing PBDSs can achieve perfect classification, but the suggested system earns higher classification accuracy i.e., 99.69% than others with a minimum number of features. Though the improvement in accuracy is marginal and comparable with some of the existing schemes, the result is obtained over a number of runs of a K-fold SCV procedure. This reflects the improvement in proposed scheme to be robust and reliable. The use of IJaya-ELM in the proposed scheme leads to have better generalization performance and faster response on unknown testing data.

Table 8 Comparative analysis with other competent PBDSs on three standard datasets

From the experimental results, it is clear that the suggested scheme yields superior performance in the context of classification accuracy and number of features used compared to other existing schemes over all the three datasets. The proposed system employs O-DR2T, and IJaya-ELM which possesses several advantages. O-DR2T helps in capturing edge and texture features effectively from MR images. IJaya-ELM obtains compact network structure, faster learning speed and better generalization performance in contrast to other traditional learning algorithms that are frequently employed in existing PBDSs. These methods collectively increase the strength of the system. However, the proposed system has the following loopholes. The proposed system has been validated on three available datasets which accommodate images from patients during the late and middle stages of diseases, but a larger dataset with images from all stages of diseases can be tested in order to achieve better generalization performance. The current work deals with solving a two-class classification problem, however solving a multi-class brain disease classification problem is highly in demand.

7 Conclusions and future work

In this paper, an attempt has been made to develop an efficient pathological brain detection system. The proposed scheme initially uses O-DR2T to extract the features from the enhanced brain MR images. Subsequently, a PCA+LDA approach has been employed to reduce the feature dimensionality. Finally, a novel learning algorithm called IJaya-ELM is proposed to train the SLFN. The proposed scheme inherits the advantages of O-DR2T and ELM for detection of pathological brain in MR images. The experimental results on three standard datasets demonstrate that the proposed scheme yields higher accuracy than other competent schemes with a minimum number of features. Moreover, it has been shown that the proposed IJaya-ELM leaning algorithm holds several advantages over other learning algorithms.

The proposed IJaya-ELM can be applied to regression problems as well as multi-label classification problems. Other advanced machine learning techniques such as dictionary learning and deep learning could be investigated as potential alternatives to the proposed IJaya-ELM in future. However, our proposed system has following limitations. The proposed PBDS has been validated on small datasets, however, a larger dataset collected online will further prove its effectiveness. Further, the images in the chosen datasets are collected from the late and middle stage of the diseases, images collected during the all stages need to be validated. In future, interactive machine learning (iML) algorithms [15,16,17] can also be studied to overcome the issues of automatic machine learning algorithms.