1 Introduction

Brain diseases are recognized to be the most predominant cause of death among individuals with different age groups across the globe. In general, brain diseases are categorized into various types such as cerebrovascular disease (stroke), degenerative disease, infectious disease and neoplastic disease (brain tumor). These diseases are progressive, and their occurrence increase with age. Early diagnosis is hence of great importance to prevent the severity of these diseases and improve the patient’s quality life. Magnetic resonance imaging (MRI), a non-invasive neuroimaging technique, has been profoundly used in biomedical and clinical research, particularly in pathological brain detection [28, 32, 42]. MRI dispenses better resolution of brain tissues as opposed to other imaging techniques such as CT, SPECT and X-ray [12, 19]. However, manual inspection of MR images is onerous, time-consuming and requires skilled supervision. Therefore, automatic computer-aided medical diagnosis (CAMD) system is essential to facilitate fast, reliable, and accurate decisions [29, 37]. The research on automatic pathological brain detection has been extensively studied in the past decade. These studies can be roughly categorized into two groups: (1) binary classification based detection system that segregates pathological MR images from normal MR images, and (2) multiclass classification based system that classifies brain into several categories (different types of brain diseases along with healthy brain).

Quite a large number of automated systems have been proposed for binary classification of brain MR images in the past years. In the following, we enumerate the evolution of these automatic models. Chaplot et al. [2] developed an automatic diagnosis model that derives features from the low-frequency component of discrete wavelet transform (DWT) and employs support vector machine (SVM) for classification. In [4], the DWT features are first subjected to principal component analysis (PCA) and then the classification is carried out using k-NN and feed-forward neural network (FNN) classifiers. Later, Zhang et al. [41,42,43,44] have developed various hybrid automatic models via DWT features and classifiers such as FNN and kernel SVM whose parameters are tuned by different optimization techniques. Then, a model based on ripplet transform (RT) features and least squares SVM (LS-SVM)classifier is proposed in [3]. Later on, Nayak et al. have presented a model in light of DWT features and AdaBoost with random forest classifier. In [27] and [45], the entropy features of low-frequency components at each scale of 8-level DWT decomposition (DWT-AE) are fed to the probabilistic neural network (PNN) for pathological brain detection. While Wang et al. [35] have extracted entropy features from all the components of 8-level DWT decomposition. The potency of wavelet packet Tsallis entropy and Shannon entropy features are individually evaluated with generalized eigenvalue proximal SVM (GEPSVM) classifier in [46]. Yang et al. [40] have proposed to use DWT energy features and SVM classifier for pathological brain detection. Later, stationary wavelet transform (SWT) based energy and entropy features have been introduced in [21] for achieving improved classification accuracy. The curvelet-based features have been studied in [20, 25] to obtain significant classification results. In [48], a model based on pseudo-Zernike moment and kernel SVM is proposed. Nayak et al. [24] have presented an improved model based on two-dimensional PCA features and evolutionary extreme learning machine (ELM). While in [47], wavelet packet Tsallis entropy features (DWPT-TE) and Jaya optimized ELM classifier are used to build the model. Recently, two contributions are reported to analyze the effect of ripplet-II features (DR2T) along with two individual improved ELM classifiers on detection results [22, 23].

In comparison, there is scant literature on multiclass brain MR image classification. Kalbkhani et al. [14] have extracted features using DWT and generalized autoregressive conditional heteroscedasticity technique. The classification is carried out using SVM classifier that assigns the MR images into eight different categories (one normal and seven brain diseases). Recently, a five-category classification system is developed using deep stacked sparse autoencoder (SSA) in [12].

The literature studies reveal that almost all existing approaches follow a conventional multi-stage pipeline of feature extraction, feature selection, classification. One of the major concerns in these approaches is the choice of proper feature descriptors and classifiers. DWT has been extensively used for feature extraction despite its shortcomings like limited directional selectivity and shift variance. Moreover, the detection accuracy for multiclass brain MR classification is still far from the real-time requirements. Deep neural-network models, on the other hand, have recently obtained remarkable success in medical image analysis [17, 38]. These models automatically learn the high-level features from the input data through their hierarchical structure and eliminate the need for hand-engineered features. The deep SSA used in [12] encounters many problems. The parameters of SSA are optimized using an iterative procedure that elicits poor learning speed. The weights of the hidden layers of SSA are initialized by independent autoencoders in an unsupervised fashion and then, the whole network is fine-tuned using traditional back-propagation (BP)-based learning algorithm.

In this paper, an automated multiclass classification model based on the deep extreme learning machine is developed to counter the above challenges. The main contributions of the current work are summarized as follows.

  • ELM is an emerging learning paradigm for single layer feed-forward neural network (SLFN) that produces better generalization capability at faster learning speed [9, 11]. The deep ELM network, also called ML-ELM, is employed for multiclass classification of MR images that involves a stack of ELM autoencoders to provide high-level feature representations and does not require fine-tuning [15].

  • Leaky rectified linear unit (LReLU) has had its success in a wide range of applications compared to sigmoid, tanh and ReLU [18, 39] functions. Hence, LReLU function is taken into consideration in ML-ELM that avoids the computationally expensive operation (in particular exponential). The suggested model is referred to as ML-ELM+LReLU in remainder of the paper.

  • The efficacy of the proposed model is evaluated on a multiclass brain MR image dataset.

The remainder of this paper is structured as follows. The detail description of the dataset is presented in Section 2. Section 3 presents the basic concepts and theories of ELM along with its practical issues. The proposed method is detailed in Section 4. Section 5 presents the experimental settings and results. Eventually, the concluding remarks of this work are drawn in Section 6.

2 Materials

The multiclass brain MR dataset comprises 200 images (40 normal and 160 pathological brain images) is used to evaluate the proposed model. The pathological brains contain diseases of four categories, namely brain stroke, degenerative, infectious and brain tumor; each category holds 40 images. The images are sourced from the Harvard Medical School website [13] and they are composed of T2-weighted MR scans acquired along the axial view plane. All the images hold a resolution of 256 × 256 pixels. The dataset is labeled as ‘Multiclass Harvard Dataset (MCHD). Figure 1 depicts typical brain MR samples from each of the five categories.

Fig. 1
figure 1

Typical brain MR images a Healthy, b Brain stroke, c Degenerative, d Infectious, and e Brain tumor

3 Extreme learning machine (ELM)

ELM developed by Huang et al. [9] is a learning mechanism for SLFN with good generalization capability and fast learning speed. It overcomes the issues of traditional training algorithms and hence, it has been applied in a wide range of classification and regression applications [10, 11]. The principle of ELM is that the hidden node parameters are assigned randomly and kept fixed during training, and the output weights are evaluated analytically by the least square method.

In ELM, the network response at single output node is computed as

$$ {f}_K\left(\mathbf{x}\right)=\sum \limits_{i=1}^K{w}_i^o{h}_i\left(\mathbf{x}\right)=\mathbf{h}\left(\mathbf{x}\right){w}^o $$
(1)

where, \( {w}^o={\left[{w}_1^o,\dots, {w}_K^o\right]}^T \) denotes the output weights that links between K hidden nodes and the output node and h(x) = [h1(x), …, hK(x)] is the output of hidden layer (also called as feature representation) for the input x which helps in mapping the data from d-dimensional input space to the K-dimensional hidden layer feature space (ELM feature space). ELM is inspired from the Bartlett’s theory for feedforward neural network [1] and it aims to achieve the minimum training error as well as norm of the output weights.

$$ \operatorname{Minimize}:\kern1em {\left\Vert \mathbf{H}{w}^o-Y\right\Vert}^2\kern1em \mathrm{and}\kern1em \left\Vert {w}^o\right\Vert $$
(2)

where, Y = [y1, …, yN]T denotes the target labels and H = [hT(x1), …, hT(xN)]T. The output weights wo can be computed using the Moore-Penrose (MP) generalized inverse of matrix H as

$$ {w}^o={\mathbf{H}}^{\dagger }Y $$
(3)

Another alternative for wo calculation is reported in [11], which provides a more robust and better generalization performance by introducing a regularization parameter C as

$$ {w}^o={\mathbf{H}}^T{\left(\frac{I}{C}+\mathbf{H}{\mathbf{H}}^T\right)}^{-1}Y $$
(4)

or,

$$ {w}^o={\left(\frac{I}{C}+{\mathbf{H}}^T\mathbf{H}\right)}^{-1}{\mathbf{H}}^TY $$
(5)

One of the notable characteristics of ELM is that it provides a unified solution to binary-class as well as multiclass classification tasks. However, the basic ELM is shallow in structure and thus, it may not be constructive for feature learning when dealing with natural signals such as images and videos. The deep learning architecture of ELM called multi-layer ELM (ML-ELM) [15, 31] is shown to be effective for learning meaningful feature representations from images.

4 Deep extreme learning machine with Leaky ReLU

The key challenge remains in traditional machine learning approach is the proper choice of features as they mainly influence the generalization performance. Therefore, careful feature engineering is essential to provide effective representation of the input data. However, designing such engineered features needs domain knowledge and expertise, and thus, takes a lot of time. Multilayer neural networks (MLNN) can be effective in representing the complex data (e.g., images); each layer attempts to learn increasingly high-level features. However, in practice, it is difficult to train MLNN. Hence, recent neural network architectures named as autoencoder (AE) and restricted Boltzmann machine (RBM) [7, 8] have gained significant interests from researchers to perform feature engineering. These networks effectively train MLNNs one layer at a time, and are served as the basic components for building several deep neural networks such as stacked autoencoders (SAE) [34], stacked denoising autoencoders (SDAE) [34], deep belief network (DBN) [8] and deep Boltzmann machine (DBM) [26, 30]. In particular, SAE and SDAE stack the AEs, while DBN and DBM stack the RBMs. These deep networks train their hidden layers individually using either AEs or RBMs in an unsupervised manner and the whole network is then fine-tuned in a supervised fashion using traditional learning method such as BP. Thus, the training of these deep networks is time-consuming and cumbersome. In contrast, the deep ELM (ML-ELM) proposed in [14, 31] facilitates faster and effective learning without the need for fine-tuning. Therefore, ML-ELM is taken into consideration in our work. Besides, we introduce LReLU function in the hidden layer with the aim to improve the learning speed. In the following, we discuss the basic building block of the ML-ELM i.e., ELM autoencoder (ELM-AE) and the deep architecture adopted in the current study.

4.1 ELM autoencoder with Leaky ReLU (ELM-AELR)

ELM theory has been extended to autoencoder known as ELM-AE that serves as the basic building block of ML-ELM. ELM-AE learns to represent powerful features of the input data [15]. Like conventional AE, ELM-AE consists of two parts (i) an encoder and (ii) a decoder as depicted in Fig. 2.

Fig. 2
figure 2

Network structure of a ELM-AELR. Output is same as input. LReLU activation function is used to compute the hidden output. x denotes the d-dimensional input, K indicates the number of hidden neurons, wh indicates the input weights, b represents the bias, and wo indicates output weights

The encoder maps the input x = [x1, x2, …, xd] to a high-level feature representation h(x) = [h1(x), h2(x), …, hK(x)] using a set of random weights and biases (wh, b). The (wh, b) values are made orthogonal according to [15]. Rather than sigmoid, tanh and other non-linear functions that are commonly used in traditional AE and ELM-AE, the leaky ReLU function is taken into consideration in this study (ELM-AELR) for feature mapping because of its potential advantages [18, 39]. Leaky ReLU function, unlike ReLU, maps the negative values to small non-zero slopes, and is mathematically defined as follows

$$ \varphi (x)=\left\{\begin{array}{cc}x,& x\ge 0\\ {}\alpha x,& x<0\end{array}\right. $$
(6)

where α is a fixed parameter. LReLU avoids the exponential operation in sigmoid and tanh activation function and hence, helps in faster learning. A clipped ReLU activation can also be considered in place of LReLU [6]. The decoder maps the h(x) back into input x through the output weights wo that are estimated analytically using \( {w}^o={\mathbf{H}}^T{\left(\mathbf{H}{\mathbf{H}}^T+\frac{I}{C}\right)}^{-1}\mathbf{X} \) or \( {w}^o={\left({\mathbf{H}}^T\mathbf{H}+\frac{I}{C}\right)}^{-1}{\mathbf{H}}^T\mathbf{X} \), where X = [x1, x2, …, xN] denotes the input (or output) data and H = [h1, h2, …, hN] indicates the outputs at each hidden neuron for each input data. It is worth mentioning here that the ELM-AELR can learn three separate representations of the input data similar to ELM-AE –(a) compressed feature representation (K < d) (b) equal dimension feature representation (K = d), and (c) sparse feature representation (K > d).

4.2 Deep ELM with Leaky ReLU (ML-ELM+LReLU)

Similar to other deep networks, ML-ELM+LReLU is a multi-layer network that stacks ELM-AELRs (as shown in Fig. 3). The weights of the hidden layers are assigned by the ELM-AELRs which accomplishes layer-wise unsupervised learning. The output of each hidden layer i in ML-ELM+LReLU can be computed as follows

$$ {\mathbf{H}}_i=\varphi \left({\mathbf{H}}_{i-1}.{w}_i^{oT}\right);\kern1em 1\le i\le L $$
(7)

where, φ(.) represents the LReLU activation function, Hi denotes output matrix at the ith hidden layer and,\( {w}_i^o \) is the learned output weights of ith autoencoder and L denotes the number of hidden layers. Note that H0 represents the input data X. Finally, the output weights are evaluated analytically as similar to conventional ELM using the following equation

$$ {o}^w={\mathbf{H}}_2^T{\left(\frac{I}{C}+{\mathbf{H}}_2{\mathbf{H}}_2^T\right)}^{-1}Y $$
(8)

or,

$$ {o}^w={\left(\frac{I}{C}+{\mathbf{H}}_2^T{\mathbf{H}}_2\right)}^{-1}{\mathbf{H}}_2^TY $$
(9)

where, H2 is the output of last hidden layer and Y = [y1, …, yN]T are the target labels. One of the most notable characteristics of the ML-ELM+LReLU is that it does not need additional fine-tuning as opposed to other traditional deep networks which helps to achieve faster learning.

Fig. 3
figure 3

Architecture of the ML-ELM+LReLU. Figure (above) indicates the two independent ELM-AELRs that performs unsupervised learning. Figure (below) indicates the stacked ELM-AELRs with hidden layers initialized using the pre-trained weights and the output weights (shown in dashed lines) computed using L2-norm regularized least squares. X denotes the input data, Hi denotes output matrix at ith hidden layer, Ki indicates the number of hidden neurons at ith hidden layer, \( {w}_i^o \) is the learned output weights of ith autoencoder, and c represents the number of classes

5 Experimental settings and results

All the programs are developed using MATLAB 2017b environment and are run on a machine with Intel Xeon 2.4 GHz processor and 64 GB RAM. The dataset is divided into two parts– training set and testing set. We have randomly chosen 60% MR samples for training purpose and the rest 40% for testing the model. In particular, 120 and 80 MR samples are chosen for training and testing respectively.

5.1 Experimental setup

We include two hidden layers in the proposed ML-ELM+LReLU network and thus, two individual ELM-AELRs are required to initialize their weights. In ML-ELM+LReLU, we need to set only two hyperparameters, namely, the number of nodes in hidden layers (K) and the regularized parameter (C), while conventional deep neural networks demand more parameters to tune. The hyperparameters K and C need to be chosen meticulously to achieve a good generalization performance. For experiment, we set the value of K and C as {50,100,150,200, …,3000} and {10− 10,10− 9, …,109,1010} respectively. Similar to ML-ELM+LReLU, two-layer architecture is taken into consideration for traditional ML-ELM, SAE and SDAE. The optimal network configuration for ML-ELM+LReLU is experimentally chosen as (256 × 256)-500-800-5. For fair comparison purpose, we use the similar architecture for original ML-ELM. The regularization parameters for training two individual ELM-AEs are set to 10− 1 and 102, while it is set as 104 for final layer output computation. Further, we choose a (256 × 256)-100-50-5 network for both SAE and SDAE that demand more user-specified hyperparameters compared to ML-ELM and ML-ELM+LReLU. The number of epochs for these two networks is set to 100 and 200 during pre-training and fine-tuning respectively. The L2 regularization parameters of the two autoencoders are initialized to 0.004 and 0.002. For SDAE, the input corruption rate is assigned to 0.2.

5.2 Data augmentation

The training of deep learning models, in general, requires a very large amount of data in order to provide reliable results. However, obtaining such a large set of medical images is very difficult. One of the effective ways to counter the above issue is data augmentation in which additional images are generated using label-preserving transformations [5]. Moreover, it helps in preventing the network from overfitting issue and thereby, enhancing the performance of the deep network [16]. The number of training samples in the dataset considered is quite less to build a robust deep learning model. Thus, data augmentation is performed over the training samples where each image is subjected to following independent transformations.

  • Flipping in horizontal and vertical direction

  • Rotation by an angle from [− 45,45] with a step size of 5

  • Gamma correction with a random r value in the range [0.7,1.3]

  • Gaussian noise injection with a variance of 0.01

It is worth noting here that the augmented images that are created by rotation and flipping operation are further subjected to random gamma correction and Gaussian noise injection. The resulting augmented images have the same class label as the original image from which they are obtained. The number of training images is increased by a factor 63, particularly 7560 training samples are generated with the above-mentioned transformations.

5.3 Results and analysis

The classification performance of ML-ELM+LReLU is compared with several relevant methods such as SAE [8], SDAE [33], ML-ELM [15] and ELM [9]. The obtained results over the testing set are listed in Table 1. It is evident that the ML-ELM+LReLU outperforms other deep networks in terms of classification accuracy and training time. Further, an improved accuracy is observed with ML-ELM+LReLU when compare with original ML-ELM and single layer ELM.

Table 1 Performance comparison of ML-ELM+LReLU with its competent methods

To demonstrate the effectiveness of LReLU activation function over other non-linear functions in the proposed model, an additional experiment is carried out on the MCHD dataset. The classification results of ML-ELM in presence of various non-linear activation functions are individually tabulated in Table 2. The results show the superiority of LReLU function over sigmoid, tanh and ReLU function.

Table 2 Performance evaluation of ML-ELM with different activation functions

The representation learned by the encoder of ELM-AELR can be effective in extracting meaningful features from the input brain MR images. Each neuron in the encoder connects to a set of weights that are tuned to represent a specific visual feature. The representation of features learned by the first ELM-AELR of the ML-ELM+LReLU is shown in Fig. 4. In the figure, we have shown the visualization of the weights associated with only 100 neurons (out of 500) in the encoder since the size of the visual weights is significantly large while considering all neurons. The responses of the learned weights demonstrate brain-like structures.

Fig. 4
figure 4

Visualization of weights learned by the first ELM-AELR

5.4 Comparison with state-of-the-arts

We perform a set of experiments to compare the proposed ML-ELM+LReLU framework with recently published schemes. Table 3 and Fig. 5 show the comparison result among the proposed framework and the state-of-the-art methods over MHCD dataset. It is observed from the table that the proposed framework achieves superior results than other schemes in terms of classification accuracy. It can also be noticed that most of the existing methods except SSA [12] require hand-engineered features. In comparison, the ML-ELM+LReLU learns feature representations directly from the MR image and does not require any engineered features.

Table 3 Performance comparison of ML-ELM+LReLU with state-of-the-art methods
Fig. 5
figure 5

Accuracy comparison plot between proposed scheme and existing methods

Despite the improved classification performance, ML-ELM+LReLU provides faster learning speed due to its following salient features.

  1. 1.

    Compared to traditional autoencoders in which both the input and output weights are trained using an iterative method, the ELM-AELR computes only the output weights using the regularized least squares.

  2. 2.

    A computationally efficient Leaky ReLU function is used in place of the commonly used sigmoid and tanh function.

  3. 3.

    The weights at each hidden layer of ML-ELM+LReLU are initialized by ELM-AELRs, whereas the weights at last layer are computed in a similar fashion to that of single layer ELM.

  4. 4.

    The ML-ELM+LReLU framework does not require additional fine-tuning.

6 Conclusion

In this paper, an automated computer-aided medical diagnosis system is proposed for multiclass classification of brain MR images. The system employs a deep learning model based on multilayer ELM. The leaky ReLU function has been considered for feature mapping that helps in improving the performance as well as the computational speed. The basic purpose of the proposed scheme (ML-ELM+LReLU) is to avoid the manual feature extraction process and achieve good generalization performance with faster training speed. An extensive set of experiments have been performed on a multiclass brain MR dataset to verify the effectiveness of the proposed scheme. The obtained results confirm the superiority of our proposed scheme than its counterparts in terms of classification accuracy and training speed. The proposed ML-ELM+LReLU serves as both a feature extractor and classifier as opposed to the existing schemes.

The efficacy of ML-ELM+LReLU model can be tested on several image classification problems. The application of convolutional neural networks (CNN) could be investigated for multiclass pathological brain detection. At present, our proposed system classifies brain image into a particular brain disorder, but in future, we plan to design a system that can detect more than one disorder simultaneously. In addition, obtaining a larger multiclass brain MR dataset still remains an open challenge.