1 Introduction

Artificial neural networks (ANN) refer to the computational models which mimic biological nervous systems such as the human brain. Feedforward neural networks (FNNs) are among the most successful ANNs in which no cycle exists between the node connections. They are known as feedforward as the information travels only in the forward direction throughout the network. The basic functional unit of FNN is a neuron [41, 99, 152]. The major components of FNN are i) Input layer, which comprises of neurons responsible for receiving the input data and forwarding it to other layers, ii) Hidden layer, having neurons that apply transformations to the input data, iii) Output layer, responsible for producing the final output depending on the model; and iv) Neuron weights, used to represent the strength of a connection existing between two neurons. The flow of information occurs from input nodes to hidden layer nodes and finally through the output layer nodes. The loss in FNN is computed based on actual output and predicted output using the loss function. Gradient descent is a commonly used optimization technique to find the local minimum, but it is quite slow [80, 104]. The local minimum refers to a point in the local region where the loss function attains the minimum value. Backpropagation is an algorithm used to perform supervised learning in FNN. The error is computed by comparing predicted and actual output using a function. Further, the error is propagated backward through the layers, and weights are updated based on their contribution.

FNNs have been quite significant since backpropagation (BP) algorithm came into existence [45, 78, 143]. The significant drawbacks of BP include slow convergence, unable to handle large datasets, and the problem of local minima. Although various improvements have been proposed in the FNN training method, most of them do not provide guarantee for an optimal global solution. Thus, an efficient generalized learning algorithm needs to be developed that can train FNNs faster.

1.1 Evolution of ML-ELM

A proficient approach named extreme learning machine (ELM) was put forward to train single hidden layer FNNs [55]. The initialization of hidden nodes is done randomly in ELM, and the main essence is that hidden layers need not be iteratively tuned. The only learnable parameters are the connections or weights between the hidden layer and the output layer, which are analytically computed using the least-square method [10, 53, 130]. ELM shows significant performance as it reduces the overall training time, is more generalized, easy to implement, and can reach a global optimum. In the recent past, many researchers have worked extensively on the theory and applications of ELM [2, 9, 58, 95, 149], and many extensions of ELM have been developed [3, 5, 12, 105, 139, 153]. Despite the merits of ELM over other FNNs, it also has some limitations:

  • It can’t achieve a high level of data abstraction due to a single hidden layer.

  • It’s implementation is difficult due to the requirement of a huge network for highly-modified input data.

  • Memory constraints in ELM and the massive computation cost for evaluating the inverse of larger matrices make it challenging to manage extensive learning tasks.

To handle the above limitations, ELM gave birth to a new deep learning (DL) architecture called multilayer ELM [64, 128] that uses ELM autoencoders to perform unsupervised learning. It can represent any complex target function easily compared to the prevalent machine learning (ML) architectures and deep networks. The evolution of ML-ELM is depicted in Fig. 1.

Fig. 1
figure 1

Evolution of ML-ELM

1.2 Research motivation

The well known DL architectures such as Recurrent Neural Network (RNN), Convolution Neural Network (CNN), Long-short Term Memory (LSTM), etc. [42, 84, 86, 88, 114, 135, 164], suffer from time-consuming training process due to their complicated hierarchical structure and need to fine-tune a large number of parameters. ML-ELM can handle such problems since it does not require iterations during the entire training process, no fine-tuning of parameters is needed, and it maintains a high level of data abstraction [64, 110, 111].

Although many researchers have focused on the survey work of ELM in their research publications [26, 32, 52, 54, 76, 103, 137], ML-ELM is a deep network that requires a thorough review and consideration and is still lacking among the research community. There are very few studies available in the literature [103, 163] which have addressed the survey on ML-ELM architecture. Thus, the major objective or motivation behind this study is to highlight the suitability and effectiveness of ML-ELM, which can further give a new direction and enlighten the opportunities and challenges among the research community.

1.3 Our contributions

As ML-ELM is an emerging field, very few researchers have worked on its extensive review as discussed here. Zhang et al. [163] have presented a review on ML-ELM development, and some commonly used hierarchical structures have been investigated while Parkavi et al. [103] have focused on the recent trends in ELM and ML-ELM. However, our work includes an exhaustive survey on ML-ELM architecture along with discussing its various variants and applications in detail. The comparison of our work with other similar survey papers is presented in Table 1. As it can be seen from Table 1, the present work includes topology of ML-ELM variants proposed for better feature learning, handling outliers or noise, optimization of hidden node parameters, reducing multicollinearity, and reducing overfitting; comparison of ML-ELM with other ML and DL techniques and details regarding ML-ELM feature mapping which were not considered in earlier studies. Also, in this paper, the study on different variants and applications of ML-ELM is presented comprehensively. The open issues in ML-ELM have also been discussed in this paper.

Table 1 Comparison of our work with other similar survey papers

Because of the advantages of ML-ELM mentioned earlier, which indicate that it may accelerate the development of DL, a detailed survey work has been carried out starting from the inception of this DL classifier to date. As the paper’s main focus is to discuss different variants and application works of ML-ELM, a comprehensive discussion on ELM is presented here. The significant contributions of the paper include the following:

  1. i.

    A comprehensive discussion on the architecture of ELM, ELM-autoencoder, ML-ELM, and feature mapping of ML-ELM has been done.

  2. ii.

    An in-depth study on various ML-ELM variants is conducted, and further a topology is defined which helps in understanding the practicality of the existing approaches.

  3. iii.

    A comparative analysis of ML-ELM with other machine and DL networks is performed.

  4. iv.

    An extensive survey on different applications of ML-ELM including medical, industrial, academic, security domains etc. is presented.

  5. v.

    Finally, the open issues in ML-ELM are also listed which can provide future research directions in this field.

1.4 Search criteria for selection of papers

The relevant research articles were extracted from various web domains including Science Direct (https://www.sciencedirect.com/), ACM digital library https://www.acm.org/, Google Scholar (https://scholar.google.com/), Springer (https://springerlink.bibliotecabuap.elogim.com/) and IEEE Explore Digital Library (https://ieeexplore.ieee.org) from the year 2006 till 2022 with keywords for search queries being ‘Multilayer ELM’, ‘Deep learning’, ‘ELM’, ‘ELM-autoencoder’, ‘Feature space of ML-ELM’, ‘ML-ELM variants’ and ‘ML-ELM applications’. Initially, 1508 search results were retrieved based on the keywords stated earlier. Then, after this step, 560 research articles were excluded with reasons being duplicate entries, different titles, non-relevant abstracts etc. Further, we obtained 142 articles after screening phase which were checked for eligibility. Out of these, 70 papers were excluded after full-text reading as they did not meet the desired outcome. Finally, 72 research articles met the inclusion criteria and thus, were used for analysis in the current study. The above mentioned stages involved in search criteria are described in Fig. 2.

Fig. 2
figure 2

Search criteria for selection of papers in the current study

1.5 Paper organization

The rest of the paper is organized as follows: Sections 2 and 3 discuss the basics of ELM and ML-ELM, respectively. In Section 4, different variants of ML-ELM have been discussed according to their topology. A comparison of ML-ELM with other machine and DL architectures is provided in Section 5. Section 6 highlights the application domains of ML-ELM. Finally, the conclusion, open issues, and future enhancements of the work are provided in Section 7. The relationship between various parts of the manuscript is depicted in Fig. 3 and the complete layout is presented in Fig. 4.

Fig. 3
figure 3

Relationship between parts of the manuscript

Fig. 4
figure 4

Layout of the manuscript

2 Fundamentals of extreme learning machine

ELM (shown in Fig. 5) is a shallow network with a single hidden layer [55]. If X is the input layer with n number of nodes, H is the hidden layer having L number of nodes; then the output layer Y can be represented by (1).

$$ Y_{j} = \sum\limits_{i=1}^{L} \beta_{i} g(w_{i}\cdot x_{j}+ b_{i}), j = 1,{\cdots} N $$
(1)
Fig. 5
figure 5

Structural diagram of ELM

Here, \(N \leftarrow \) total number of samples, \(w_{i} \leftarrow \) input weight vector with random initialization connecting input and hidden layer nodes, \(b_{i} \leftarrow \) random bias of ith hidden node, \(g\leftarrow \) activation function, and \(\beta \leftarrow \) output weight vector connecting output and hidden layer nodes. The relationship among H, β and Y is represented in (2).

$$ H\beta = \mathbf{Y} $$
(2)

This implies β = H− 1Y, where H− 1 is Moore-Penrose inverse [141].

Classic ELM has been modified to improve its performance and make it suitable for real-life problems [5, 11, 12, 44, 105, 139, 153, 162, 165]. ELM finds applications in diverse domains including image processing, computer vision, etc. [2, 3, 6, 13, 16, 58, 91, 95, 128, 149, 150, 157, 157]. Several survey works are available which cover the various variants and applications of ELM in detail [32, 52, 54, 76, 103, 137].

2.1 ELM autoencoder

Zhou et al. [171] proposed ELM autoencoder (ELM-AE), in which the output is a reconstruction of the input supplied to the neural network (Fig. 6) as shown in (3). ELM-AE is a special case of ELM (supervised in nature).

$$ \sum\limits_{i=1}^{L}\beta_{i}g(x_{j}, w_{i}, b_{i}) = x_{j}, j=1, 2, \cdots, N $$
(3)

Equation (4) represents the relationship between input and output feature vector.

$$ \mathbf{H}\beta = \mathbf{X} $$
(4)
Fig. 6
figure 6

Structural diagram of ELM-AE [64]

Unlike ELM, ELM-AE has orthogonal hidden layer weights and biases, enhancing ELM-AE’s performance [62] as shown in (5). The output weight vector β, which has three alternative representations, is used to learn the feature space transformation as equal dimension, sparse and compressed representation, illustrated in (6), (7) and (8) respectively.

$$ \mathbf{h}=g(\mathbf{w}.x +\mathbf{b}), \mathbf{w}^{T}\mathbf{w}=\mathbf{I}, \mathbf{b}^{T}\mathbf{b}=1 $$
(5)

ELM-AE, like ELM, is a universal approximator [51].

  • Equal dimension representation: dimensions of input data and feature space are equivalent, i.e., n = L.

    $$ \beta = \mathbf{H}^{-1}\mathbf{X} $$
    (6)
  • Sparse representation: representing features to a higher dimensional feature space from a lower-dimensional input data space, i.e., n < L.

    $$ \beta= {\mathbf{H}^{T}\left( \frac{\mathbf{I}}{C}+\mathbf{H}\mathbf{H}^{T}\right)}^{-1}\mathbf{X} $$
    (7)
  • Compressed representation: representing features to a lower dimensional feature space from a higher dimensional input data space, i.e., n > L.

    $$ \beta= \left( \frac{\mathbf{I}}{C}+ {\mathbf{H}^{T}\mathbf{H}} \right)^{-1} {\mathbf{H}^{T}\mathbf{X}} $$
    (8)

    Here, I represents the unit matrix, and C represents the scale parameter used for adjusting the structural and experimental risk.

Singular value decomposition (SVD) is generally used for feature representation and ELM-AE to represent features based on SVD [8]. The SVD corresponding to (8) is given in (9).

$$ \mathbf{H}\beta = \sum\limits_{i=1}^{N}u_{i}\frac{{d_{i}^{2}}}{{d_{i}^{2}} + C}{u_{i}^{T}} \mathbf{X} $$
(9)

where, d represents the singular values of H corresponding to the SVD of input data X, and u is the eigenvector of HHT. It is hypothesized that the representation of input features is learned by β using singular values, and H is the projected feature space of the input feature vector X squashed using a sigmoid function.

2.2 Hierarchical ELM variants

ELM has recently been extended to various multi-layer structures, i.e., multilayer ELM (ML-ELM) [64], hierarchical ELM (H-ELM) [127], and hierarchical local receptive fields ELM (H-LRF-ELM) [48]. ML-ELM is a deep network architecture which performs layer wise unsupervised learning, but requires no backpropagation for parameter tuning, which significantly reduces the computational time. Tang et al. [127] proposed a hierarchical multilayer framework based on ELM named H-ELM, which includes two major components, i.e. feature extraction and supervised classification. It utilizes an ELM-based sparse autoencoder based on l1-norm optimization. H-ELM is different from ML-ELM on the following lines: 1. ML-ELM uses stacked-layer architecture, whereas H-ELM works by separating the whole network into two distinct subsystems. 2. ML-ELM uses autoencoder based on l2 norm, whereas H-ELM makes use of l1 penalty. 3. ML-ELM involves orthogonal initialization of parameters, whereas H-ELM avoids the same. The experimental results showed that H-ELM has a faster learning time and higher learning accuracy. It has significant applications in various domains such as object detection, gesture recognition, etc. Huang et al. [48] put forward the concept of LRF-ELM which uses sparse connections for learning useful representations required in image processing and other related tasks. The enhancement of LRF-ELM to multilayer architecture is termed hierarchical LRF-ELM (H-LRF-ELM). The important components of H-LRF-ELM include feature extractor and ELM.

3 Multilayer extreme learning machine

The main focus of this survey is to conduct an in-depth study of ML-ELM, including its functioning, various architectural works, and applications, which will be discussed in the further sections.

3.1 Principle of ML-ELM

ELM and ELM-AE are combined to form Multilayer ELM, which has more than one hidden layer [64], as shown in Fig. 7. It inherits all the properties from ELM and ELM-AE. Some of the essential features of ML-ELM are:

  • The architecture is created by gradually building stacks on the ELM-AE.

  • The stacked ELM-AE’s first level learns the primary representation of the input data. The higher level of representation in the second level combines the output of the first level, and so on.

  • ELM-AE is used for unsupervised training between the hidden layers.

  • The hidden layer weights in ML-ELM are initialized using ELM-AE.

  • No fine-tuning is required in ML-ELM, unlike other deep networks.

  • The output of one trained ELM-AE is used as input for the next trained ELM-AE and so forth.

  • Depending on the following two conditions, the hidden layer activation function g is either linear or non-linear:

    • linear: if the count of ith and (i-1)th hidden layer nodes is equivalent.

    • non-linear (such as a sigmoidal function): if the count of ith and (i-1)th hidden layer nodes is different.

  • The numerical relationship between the ith and (i-1)th hidden layers is established using (10).

    $$ H_{i}= g((\beta_{i})^{T} H_{i\text{-}1}) $$
    (10)

    where Hi and Hi-1 are the output and input matrices of the ith hidden layer, and g(.) is the activation function. H0 is the input layer, and H1 is the first hidden layer. The output weight β is computed using regularized least squares [109].

  • Finally, the network is fine-tuned using ELM.

Fig. 7
figure 7

Structural diagram of ML-ELM

3.2 ML-ELM feature mapping

The ML-ELM feature mapping (Fig. 8) is discussed further below:

  1. i.

    ML-ELM makes extensive use of ELM’s universal approximation [54] and classification capabilities [50, 113].

  2. ii.

    It is well known that ELM is unbiased towards a large number of hidden layer nodes (L) and provides good performance when the size of L is greater than the size of the input vector [47]. This significantly improves the performance of ELM feature mapping.

  3. iii.

    ML-ELM employs the ELM feature mapping mechanism, which uses sparse representation (n < L) property of ELM-AE.

  4. iv.

    The features are mapped to higher dimensional space in ML-ELM using (11).

    $$ h(\mathbf{x}) = \left[\begin{array}{c} h_{1}(\mathbf{x})\\h_{2}(\mathbf{x})\\ .\\ .\\ . \\h_{L}(\mathbf{x}) \end{array} \right]^{T} = \left[\begin{array}{c} g(w_{1}, b_{1}, \mathbf{x}) \\g(w_{2}, b_{2}, \mathbf{x}) \\ .\\ .\\ .\\ g(w_{L}, b_{L}, \mathbf{x}) \end{array} \right]^{T} $$
    (11)

    where, \(h_{i}(\mathbf {x}) = g(w_{i}.\mathbf {x} + b_{i}).h(\mathbf {x}) = {[h_{1}(\mathbf {x}),\cdots , h_{i}(\mathbf {x}), \cdots , h_{L}(\mathbf {x})]}^{T} \) can be directly used for feature mapping [49, 51].

  5. v.

    Kernel techniques are more expensive because they use dot products to find similarities between features in higher-dimensional space. ML-ELM uses ELMs classification capability to do the same without using any kernel technique, which drastically reduces the computational cost.

Fig. 8
figure 8

ML-ELM feature mapping

4 Variants of ML-ELM

A deep-level study of different variants of ML-ELM starting from the commencement to the current date has been carried out in this section. The topology of ML-ELM variants is presented in Fig. 9.

Fig. 9
figure 9

Topology of ML-ELM variants

4.1 ML-ELM variants for better feature learning

In ELM with subnetwork nodes, a single hidden node is formed by using other hidden nodes, forming a subnetwork. Yang and Wu [156] put forward an efficient approach for feature learning using subnetwork nodes. The proposed method can be used for different representations, including Dimension Reduction, Expanded Dimension Representation, and Supervised/Unsupervised Learning Mode. Wong et al. [146] put forward a kernel version of ML-ELM named ML-KELM to eliminate the drawbacks of traditional ML-ELM such as unstable performance due to random projection carried out in each layer, huge time consumption due to manual tuning of hidden node count, and slow training speed due to more number of hidden layers. The optimization function plays a significant role in deep feature learning methods. The method of optimization adopted in ML-ELM fails to give efficient results while analyzing the variance of input, for example, noise disturbance, image deformation etc. Jia et al. [59] proposed C-ML-ELM, which includes a penalty term in the optimization function whose aim is to minimize the first-order derivative value of output to the input to improve the classification results. This is based on the principle of Contractive Auto-Encoder (CAE), discussed by Rifai et al. [108]. The framework of C-ML-ELM consists of a stack of contractive ELM-AE rather than a traditional auto-encoder at every layer to ensure robust generalization and faster learning capability. Mirza et al. [92] proposed an online sequential version of basic ML-ELM, namely MS-OSELM, which uses online sequential ELM-AE (OS-ELM-AE) to learn robust features from sequential streaming data. It utilizes Online Sequential ELM Auto-Encoder (OS-ELM-AE), which involves random and orthogonal weights and biases. Zhang et al. [160] proposed a variant of ML-ELM, namely Denoising ML-ELM, which uses denoising auto-encoder (ELM-DAE) to incorporate prior knowledge, which further enhances the performance of the classification model. ELM-DAE incorporates denoising criteria into basic ELM-AE to ensure the robustness of the learning features. The inputs to ELM-DAE are the corrupted samples (having noise) represented by \(\tilde {\mathbf {X}}\), and the outputs are the original training samples represented by X where \(\mathbf {X} = \{x_{i} \in R^{j}\}_{i=1}^{N}\). The framework of Denoising ML-ELM comprises of a stack of ELM-DAE to initialize the weights of the hidden layers. The output weights β are computed using (7) and (8). Zhang et al. [160] introduced the manifold regularization term in the original cost function corresponding to Denoising ML-ELM and named the variant Denoising Laplacian ML-ELM. Jiang et al. [61] proposed Densely Connected Multilayer Kernel ELM (Dense-KELM) to solve the problems of enormous memory consumption and huge training time faced during the classification of remote sensing image scenes. The training process of Dense-KELM is identical to ML-KELM. Region-enhanced ML-ELM (RE-ML-ELM) proposed by Jia et al. [60] utilizes several input nodes to perform multiplexing of the local significant region. The input data for RE-ML-ELM comprises two main parts: source data and the data obtained from the local significant region. The parameters are computed for each layer by using ELM-AE, and the incorporation of additional information from data helps in improving the representation learning. Fei et al. [29] further presented projective model (PM) based ML-KELM (ML-KELM-PM), which enhances the feature representation as well as classification accuracy of ML-KELM. Zhang et al. [166] presented Multilayer probability ELM (MP-ELM), which automatically enables the extraction of valuable information from the links in Device-free localization (DFL). MP-ELM makes use of ELM autoencoders to maintain the fast learning speed of ELM and returns probabilistic output to enable fast and accurate DFL. Nayak et al. [97] presented an ML-ELM variant which uses LReLU as the activation function. Another ML-KELM variant using combined kernels (ML-CK-ELM) was proposed by Rezaei et al. [107] for multi-label classification. Hernandez et al. [46] proposed Multilayer Fuzzy Extreme Learning Machine (ML-FELM), which consists of stacked Fuzzy Autoencoders to achieve high input feature representation. It also makes use of the Mamdani Fuzzy Logic System for performing classification.

A summary of ML-ELM variants proposed for better feature learning is presented in Table 2.

Table 2 Summary of ML-ELM variants for better feature learning

4.2 ML-ELM variants for handling outliers or noise

Liangjun et al. [77] proposed a correntropy-based ML-ELM, namely FC-ELM, to efficiently classify the datasets corrupted by outliers or noise. Correntropy refers to a non-linear measure that computes the similarity between two random variables. The correntropy for two random variables suppose S and T is defined as:

$$ C(S,T) = E[\kappa(S,T)] $$
(12)

where E represents expectation and κ represents kernel function which satisfies Mercer’s theorem. Using correntropy reconstruction loss instead of mean square error (MSE) makes the ELM autoencoder robust to noise. Dai et al. [21] proposed a multilayer architecture based on OC-ELM (ML-OCELM) and a kernel version of the same (MK-OCELM). The experimental results showed good generalization and less human intervention. Luo et al. [83] put forward a variant of ML-ELM that incorporates the criterion of Kernel risk-sensitive loss (KRSL) to ensure robust performance for training datasets with outliers or noise. KRSL is a measure used for local similarity which utilizes the concept of correntropy [17]. For two random variables, A and B, KRSL is computed using (13).

$$ \begin{array}{@{}rcl@{}} L_{\lambda}(A,B) &=& \frac{1}{\lambda}E[\exp(\lambda(1 - \kappa_{\sigma}(A-B)))] \\ &=& \frac{1}{\lambda}\int \exp(\lambda(1 - \kappa_{\sigma}(a-b)))dF_{AB}(a,b) \end{array} $$
(13)

where λ represents the risk-sensitive parameter, σ represents the kernel bandwidth, κσ(.) represents the Mercer kernel function [115], E(.) represents the mathematical expectation, and FAB(a,b) illustrates the distribution of (A,B). Stacked ELM, i.e., multilayer neural network utilizing multiple sub-ELMs, was presented by Luo et al. [83] which utilizes KRSL as the loss function (SELM-MKRSL). The MSE based loss function used in ML-ELM is quite sensitive to outliers and noise. Yu et al. [132] presented a robust multilayer approach named ML-RELM, which incorporates model bias and variance in the loss function to reduce the influence of noise signals. The experimental results showed higher generalization and greater robustness to noise.

A summary of ML-ELM variants proposed for handling outliers or noise is presented in Table 3.

Table 3 Summary of ML-ELM variants for handling outliers or noise

4.3 ML-ELM variants for optimizing weights and bias

The Multi Layer Multi Objective ELM, MLMO-ELM, proposed by Lekamalage et al. [75], uses a multi-objective formulation for learning the hidden layer parameters using Multi Objective ELM-AE (MO-ELMAE). The framework of MLMO-ELM consists of a stack of MO-ELMAE to learn hidden layer parameters. The pth hidden layer output denoted by Hp is used to learn the parameters of (p + 1)th hidden layer of MLMO-ELM. Ridge regression is used to learn the parameters of the output layer. Vong et al. [134] presented an improved version of ML-KELM, that encodes hidden layer in the form of empirical kernel map (EKM), and named it ML-EKM-ELM. This further leads to its suitability for large-scale problems. Le et al. [71] put forward an ML-ELM variant, namely Incremental ML-ELM (IM-ELM), for efficiently determining the number of hidden layers, i.e., K. The basic idea was to include a layer corresponding to K + 1 network to the last hidden layer of ML-ELM (having K hidden layers). The final output of the new network is given by (14).

$$ f_{K+1}(x) = \mathbf{H}_{K+1}\beta_{K+1} $$
(14)

This can aid in finding the most suitable value of K corresponding to initial input weight W1 and initial bias B1. Wu et al. [147] proposed a novel approach called Multilayer Incremental Hybrid Cost Sensitive ELM with Multiple Hidden Output Matrix and Subnetwork Hidden Nodes (MIHCS-ELM). It uses ant clone and multiple greywolf optimization methods for computing optimal hidden node parameters. Zheng et al. [170] presented the usage of a novel ant lion algorithm (NALO) for optimizing random weights and bias involved in ML-ELM (NALO-MELM), which can affect its accuracy. He et al. [43] proposed a tree root algorithm based ML-ELM (TR-ML-ELM), which gives better classification accuracy than ML-ELM. Ma et al. [85] put forward a heuristic kalman algorithm (HKA) based ML-ELM (HKA-ML-ELM) where HKA was used to optimize parameters of ML-ELM. This further leads to an improvement in prediction accuracy.

A summary of ML-ELM variants proposed for optimizing weights and bias is presented in Table 4.

Table 4 Summary of ML-ELM variants for optimizing weights and bias

4.4 ML-ELM variants for reducing multicollinearity

Multicollinearity refers to a situation when there is a high correlation between independent variables in a regression problem. This problem is faced in ML-ELM due to the last hidden layer in the network. Su et al. [123] proposed PLS-ML-ELM, which uses partial least square (PLS) to eradicate the problem of multicollinearity. Principal Component Analysis (PCA) is a method adopted for dimensionality reduction. It can identify the most significant data components by eliminating redundancy [34, 119]. Su et al. [125] put forward a variant of ML-ELM incorporating the PCA method, PCA-ML-ELM, to improve the performance of basic ML-ELM. PCA-ML-ELM, utilizes ELM-AE to obtain the output matrix corresponding to each hidden layer. Zhang et al. [167] put forward an ML-ELM variant, i.e. Self-adaptive ML-ELM model with dynamic generative adversarial net (GAN) [35] (PGM-ELM), which finds application in the classification of biomedical data.

A summary of ML-ELM variants proposed for reducing multicollinearity is presented in Table 5.

Table 5 Summary of ML-ELM variants for reducing multicollinearity

4.5 ML-ELM variants for reducing overfitting

ML-ELM might face the problem of overfitting training data. Su et al. [123] proposed an improved version of PLS-ML-ELM by combining it with the ensemble model [124]. Since different simulation results might be generated in various trials, the ensemble model was combined with the algorithm to generate EPLS-ML-ELM which overcame the above-mentioned problems and ensured better generalization. Zhang et al. [161] proposed an ML-ELM variant, namely Radial basis function based ML-ELM (ML-ELM-RBF), for performing multi-label learning by making use of weight uncertainty ELM-AE (WuELM-AE) to solve the overfitting problem. WuELM-AE incorporates weight uncertainty into ELM-AE to ensure robust learning of features. The framework of ML-WuELM consists of a stack of WuELM-AE to learn the parameters of each layer. Further, Xu et al. [151] proposed ML-AP-RBF-Lap-ELM algorithm for multi-label learning, which combines Affinity Propagation (AP) clustering algorithm, ML-RBF, and Lap-ELM. The experimental results showed good stability on various datasets, but the accuracy and generalization capability still improvement. Su et al. [126] proposed EAPSO-ML-ELM, which makes use of adaptive particle swarm optimization (APSO) [117] and ensemble model to enhance ML-ELMs performance. APSO can be used for optimizing input and hidden layer weights and biases that are selected randomly in ML-ELM. The proposed ensemble model contributes towards overcoming the over-fitting problem faced in ML-ELM. The variant EAPSO-ML-ELM is a combination of K ML-ELMs optimized using APSO. The various steps followed are:

  1. Step 1:

    K optimized ML-ELMs are generated by applying the training phase.

  2. Step 2:

    Prediction output matrix Oi is generated through the testing data where i = 1,⋯ ,K.

  3. Step 3:

    The final result is computed by using an average of all K ML-ELMs results (15).

$$ O_{final} = \frac{1}{K} \sum\limits_{i=1}^{K}O_{i} $$
(15)

Su et al. [122] further presented an improved version of MS-OSELM based on variable forgetting factor (VFF) and ensemble model (EVFF-ML-OSELM) where VFF is incorporated to give more emphasis to new incoming data and the ensemble model is used to avoid overfitting.

A summary of ML-ELM variants proposed for reducing overfitting is presented in Table 6. Also, a comparative analysis of various ML-ELM variants based on suitable characteristics is illustrated in Table 7.

Table 6 Summary of ML-ELM variants for reducing overfitting
Table 7 Suitable characteristics of various ML-ELM variants

5 Comparative analysis of ML-ELM

5.1 ML-ELM vs. other ML techniques

Numerous conventional ML classifiers include SVM, k-nearest neighbor, decision trees, etc. A few drawbacks of these classifiers are the requirement of massive resources for functioning, huge unbiased and good quality datasets, high error-susceptibility, and restrictions to approximate a complex function.

ELM is preferred over other ML techniques because it does not depend much on the number of hidden nodes present, requires no iterative tuning, has fast learning speed, exhibits good generalization performance, and ensures parallel and distributed computation which makes it appropriate for real-time problems.

SVM is one of the well-established classifiers which has been used for various applications by the research communities from time to time [90, 144, 158]. A comparative analysis of SVM, ELM, and ML-ELM is presented in Table 8.

Table 8 Comparative analysis of the properties of SVM, ELM and ML-ELM

5.2 ML-ELM vs. other DL techniques

This section discusses the limitations of different DL architectures [81] and the advantages of ML-ELM over them.

5.2.1 Limitations of existing DL techniques

The existing DL techniques involve long training time, high computational cost, and massive training data. Some of the limitations of state-of-the-art DL classifiers are:

  1. i.

    Limitations of Convolution Neural Network (CNN):

    • CNN sends all low-level neurons details to higher level neurons, and these higher level neurons again perform convolutions by replicating the knowledge across all the different neurons for checking whether certain features are present or not. This is a time-consuming job.

    • CNNs depend on the initial parameter tuning to prevent local optima. Hence, CNNs need to perform many computations for initialization depending upon the problem at hand.

    • As the convolution is very slow for forward and backward operation, deep networks require a lot of time for training.

    • CNNs have many parameters, and hence, it often exhibits overfitting for small datasets.

    • Hyperparamter tuning in CNN is non-trivial.

    • It is not spatially invariant to the input data.

    • Using traditional CPUs, training CNNs is time-consuming and expensive. Hence, good GPUs are needed for faster training.

  2. ii.

    Limitations of Recurrent Neural Network (RNN):

    • RNN cannot be stacked for an intense deep model and cannot sustain long-term dependencies.

    • RNN is prone to gradient vanishing and exploding during backpropagation. It makes the training of RNN difficult in two ways:- a) if tanh is used as the activation function, then it cannot process very long sequences b) if ReLu is used as the activation function, then it becomes more unstable.

    • Due to its recurrent nature, the computation is slow.

    • Like CNN, RNN also needs GPUs for faster training.

  3. iii.

    Limitations of Long Short Term Memory (LSTM):

    • As the data must be moved from one cell to another for evaluation, and the cells are rather complicated due to a few extra characteristics such as forget gates, LSTMs cannot eliminate the vanishing gradient problem.

    • It requires a high volume of resources such as a large number of tuned parameters and high memory bandwidth for training.

    • Random weights initialization affects the performance of LSTMs, and hence it behaves similar to FNN.

    • As LSTMs are vulnerable to overfitting, using dropout methods to control this issue is time-consuming.

  4. iv.

    Limitations of Artificial Neural Network (ANN):

    • When the training is finished, the network attempts to optimize the error by generating a specific result, but that is not optimum.

    • Like CNN and RNN, ANN also needs good GPUs to avoid slow training.

    • There are no proper guidelines or rules to assure an appropriate network structure in ANN. This can only be achieved through trial and error and rich experience.

    • In the majority of cases, overfitting persists in the network.

    • The result generated by the network does not give a clue as to how and why it is produced, which in turn reduces the trust in ANN.

5.2.2 Advantages of ML-ELM over machine and DL techniques

A detailed comparison of ML-ELM with the above-mentioned state-of-the-art DL mechanisms is provided in Table 9. ML-ELM suitably address the complexities that arise in ML and DL techniques on the following lines:

  1. i.

    ML-ELM has fewer parameters, no back propagations, and no fine-tuning of hidden node parameters. Also, an increase in the number of hidden nodes and layers won’t affect ML-ELM much as compared to other techniques.

  2. ii.

    ML-ELM involves less training time and has a fast learning speed during the classification process, due to which it doesn’t require GPUs.

  3. iii.

    It performs well on a large volume of datasets.

  4. iv.

    The classification and approximation capabilities of ELM are the strength of ML-ELM. It can map large datasets to an extended feature space, and can be separated linearly without using any kernel techniques, which can save a lot of resources.

  5. v.

    Unsupervised training is carried out among the hidden layers using ELM-AE.

  6. vi.

    Understanding the architecture of ML-ELM is very simple, and it is less complex to implement.

Table 9 ML-ELM vs. other DL architectures

5.3 Open issues in ML-ELM

Some of the open issues in ML-ELM are described below:

  1. i.

    As the hidden layer parameters are generated randomly, and there is no backpropagation, the training of ML-ELM is quite fast. But some variants of ML-ELM are sensitive to the randomization process, and thus, their changes may degrade the performance of ML-ELM. Further studies are required to handle such problems.

  2. ii.

    The behavior and the correct number of hidden units, including hidden layer activation function, and the parameters in each layer in ML-ELM, are still debatable and can be reviewed further.

  3. iii.

    Feature mapping of ML-ELM using universal and classification capabilities of ELM is very robust. However, more research inputs are required to justify such robustness.

  4. iv.

    DL methods such as CNN, RNN, and LSTM need a huge volume of data and many tuned parameters to train the network. Further investigations are required to find whether the combination of such DL architectures with ML-ELM can reduce the requirement of parameters without compromising the performance.

6 Applications of ML-ELM

The advantages of ML-ELM including training speed, accuracy, and generalization make it suitable for various application areas such as medicine, economy, etc. as shown in Fig. 10. This section highlights some of the works in these domains.

Fig. 10
figure 10

Application areas of ML-ELM

6.1 Medical applications

Timely detection of brain tumor can contribute towards saving the lives of a large number of patients. ELMs are found to be quite helpful for classifying tumor images and exhibit better performance than various deep neural network classifiers. As ELM is not much suitable for big data applications, Deepa and Rajalakshmi [25], proposed ML-ELM for classifying tumor images as tumorous or non-tumorous with higher accuracy. Various methods are available for EEG classification based on Bayes classifier, SVM, etc., but most of these algorithms have restrictions for approximating complex functions. An ML-ELM variant proposed by Ding et al. [27], namely Deep ELM (DELM), combines the best of both the approaches, i.e., ML-ELM and KELM. This approach proved successful for EEG classification as it has less training time and high efficiency. The usage of wearable sensors has become quite crucial due to the emergence of smart health facilities. This poses the requirement of an efficient classification algorithm to recognise humans’ actions, which can enable them to lead a healthier lifestyle. Chen et al. [15] proposed an algorithm named S-ELM-KRSL for this purpose, which uses stacked ELM (S-ELM) and KRSL similarity. The results achieved showed higher accuracy as compared to other traditional algorithms. The recognition of EEG signals is an essential technology of Brain-Computer Interface (BCI), which involves feature extraction and classification. Duan et al. [28] proposed an ML-ELM based classification approach for EEG signals to achieve better performance. She et al. [116] employed hierarchical semi-supervised ELM (HSS-ELM) for efficient motor imagery (MI) task classification. The experimental results showed higher classification accuracy and more generalization with least human intervention as compared to other methods used for motor imagery EEG data. It is crucial to identify brain diseases from magnetic resonance (MR) images early stage to avoid serious problems. A variant of ML-ELM, PGM-ELM [167], has been used efficiently for classifying imbalanced medical data. Fei et al. [29] proposed the usage of ML-KELM-PM for Breast Tumor Diagnosis using ultrasound. The algorithm effectively performs transfer learning required in computer-aided diagnosis. An efficient approach for brain images classification was put forward by Nayak et al. [97], which uses ML-ELM to automate the process of feature extraction. The proposed system resulted in better generalization and more robustness. Ijaz et al. [57] presented a model using random forest as classifier to perform early prediction of cervical cancer. Srinivasu et al. [120] used a computationally efficient AW-HARIS algorithm for automated segmentation of CT scan images to identify abnormalities in the human liver. Mandal et al. [87] proposed a tri-stage wrapper-filter-based feature selection method for saving time and cost in disease detection. Dash et al. [24] presented a hybrid method for blood vessel segmentation to improve the performance of curvelet transform. Srinivasu et al. [121] have proposed an efficient deep learning approach based on MobileNet V2 and LSTM for skin disease classification. Dash et al. [23] put forward a joint model having fast guided and matched filters for enhancing vessel extraction in abnormal retinal images. Kumar et al. [68] have presented a comprehensive survey on various artificial intelligence techniques which can be used to diagnose diseases such as cancer, tuberculosis etc.

6.2 Industrial applications

Wang et al. [136] put forward an efficient crack detection model based on ML-ELM, which does not require iterative tuning for learning its parameters. It exhibits phenomenal performance in terms of training efficiency and model accuracy. Identifying coal quality from remote sensing images is a critical task in coal mining. Le et al. [71] proposed an incremental ML-ELM based algorithm named IM-ELM which could more appropriately determine the number of hidden layers. This led to an efficient classification model, which proved to be better in terms of higher speed and accuracy along with low price. A variant of ML-ELM, EPLS-ML-ELM [124], was used to generate a data-driven prediction model for real blast furnace data. Whether or not to adjust the burden distribution matrix is efficiently determined by EPLS-ML-ELM. Su et al. [125] put forward an approach utilizing ML-ELM, PCA, and wavelet transform named W-PCA-ML-ELM for measuring the permeability index. The proposed method proved to be helpful for better generalization and more stability of the prediction model. Reliable and efficient detection of faults is a crucial task for reducing maintenance cost and avoiding unplanned interruption. Yang et al. [155] proposed an ML-ELM based fault diagnosis scheme for wind turbines. The experimental results showed better accuracy and efficiency as compared to other approaches. A variant of ML-ELM put forward by Su et al. [126], namely EAPSO-ML-ELM, was used for hot metal temperature prediction in the blast furnace. The proposed approach proved to be helpful for better generalization and prediction accuracy. Timely and accurate identification of high-quality coal can reduce environmental pollution and increasing production efficiency. A particular variant of ML-ELM based on inertia weight artificial bee colony (ABC), namely IAM-ELM, was put forward by Mao et al. [89]. It proved to be useful for coal classification as it showed significantly good speed and accuracy compared to various other existing coal classification methods. An efficient approach named NALO-MELM put forward by Zheng et al. [170] was used for fault diagnosis in rotary machines, and it showed higher accuracy than other methods. Lu et al. [82] applied an improved version of ML-OSELM using the evolutionary approach for Stencil Printing Optimization in real-time. The proposed approach contributed towards an increase in prediction accuracy and printing performance. Ma et al. [85] proposed using HKA-ML-ELM to estimate the remaining useful life of Lithium-ion batteries. The experimental results verified the effectiveness of the proposed approach. He et al. [43] used TR-ML-ELM to monitor coal mining areas. The proposed method resulted in high precision and fast speed. Zhao et al. [168] presented an early fault detection approach for analog circuits based on ML-ELM, which showed higher diagnosis accuracy and faster diagnosis speed. The traditional method of measuring the temperature using disposable thermocouples is inefficient and costly for continuous data procurement. Su et al. [122] proposed using EVFF-MLOSELM to predict hot metal silicon content. The simulation results exhibited better accuracy and generalization performance than other algorithms.Gupta et al. [40] presented an efficient content caching strategy for IoT applications which can aid in traffic management at cloud databases. Khan et al. [65] conducted a systematic literature review on security issues/challenges faced by software vendors’ organizations. Rani et al. [106] put forward an adapted fault tolerant approach wireless sensor network routing in industrial applications.

6.3 Academic applications

Automatic handwritten digit recognition is a task that has captured a lot of interest in academics and commerce. A novel classification approach based on ELM was put forward by Noushahr et al. [98], which exhibits better generalization and fast speed for learning. The authors have also proposed Multilayer Ensemble ELM (ML-EELM), based on the combination of concepts of CNN, ensemble models, and ELM. This further increases the accuracy of the classifier.

6.4 Security applications

Wang et al. [138] proposed an ML-ELM based approach, which works on the encrypted databases directly and returns the output after converting the multi-class classification problem at hand into the corresponding binary classification problem. This method results in secure and accurate image classification while eliminating the need for decryption. Yang et al. [154] presented a binary decision diagram (BDD) based DL algorithm, i.e., BDD-ML-ELM, for privacy protection in the finger-vein recognition systems. The proposed approach achieved adequate security and prediction accuracy. Traditional security measures, including passwords, tokens, etc., have some shortcomings which can lead to severe problems. Thus the authentication based on keystroke dynamics became more critical. Zhao et al. [169] presented an efficient keystroke dynamics identification approach using ML-ELM, reducing human interaction and manual feature extraction. The proposed method resulted in high accuracy and less time involvement which further justifies its usage for real-time applications. Panigrahi et al. [102] proposed a host-based intrusion detection system using a C4.5-based detector with Consolidated Tree Construction algorithm which works efficiently with class-imbalanced data as well. Panigrahi et al. [101] analyzed the current literature in the field of network intrusion detection, highlighting the various important parameters. Privacy protection is a requirement for applications dealing with confidential images. Nevertheless, performing decryption before the classification of images increases the computational complexity to a large extent. ML-ELM can be used for efficiently classifying the dataset of encrypted images without decrypting the encrypted image.

6.5 Transportation applications

Intelligent video surveillance (IVS) systems, involving the movement activity of humans, are quite helpful for various applications, including accident detection, patient monitoring, etc. Yu et al. [159] put forward an approach based on ML-ELM and object principal trajectory, namely PTM-ELM. The usage of ML-ELM makes it feasible for frame predictors to adapt to fast-changing features by holding necessary information. The proposed method resulted in excellent image quality and efficient generalization. Lee et al. [74] suggested the usage of ML-RBF-ELM for Terrain Referenced Navigation, which aids in real-time navigation operation. The experimental results justified its use for small unmanned aerial vehicles (UAVs), which don’t have huge memory space. Hernandez et al. [46] utilized ML-FELM to classify and transport objects using indoor UAVs. The results showed higher accuracy of ML-FELM for image classification.

6.6 IoT applications

DFL is widely used in the field of wireless localization. An ML-ELM variant, namely MP-ELM [166], proved to be helpful for implementing faster and more accurate DFL. MP-ELM has the advantage of less time and labor consumption to automatically learn important information from the links. Moreover, the validity of this method in DFL has been evaluated in indoor and outdoor environments.

6.7 Military and civil applications

Radar emitter identification mainly finds applications in military and civil fields. Cao et al. [9] proposed an H-ELM based method for identifying radar emitter signals that are quite efficient and can be deployed in various real-life applications.

6.8 Other interdisciplinary application domains

Modality refers to a particular way of experiencing something, and a multimodal research problem is one that involves multiple modalities. Weakly paired multimodal data refers to the situation when the real-world data obtained from various sensors has every modality being partitioned into various groups. Wen et al. [142] proposed ML-ELM based framework to capture the non-linear transformations related to each modality in a simple yet effective manner. Online Incremental Tracking is a tracking technique adopted to learn and adapt a representation that can be used to reflect the changes in the target’s appearance. Modeling such variation is a critical task that requires efficient algorithms. H-ELM [129] has been used successfully used for incremental tracking. Clustering is an important task that can solve problems in many domains, including text classification, market research, etc. Wu et al. [148] proposed an evolutionary ML-ELM based approach for data clustering problems. The experimental results revealed better accuracy and robustness of the method. A summary of various ML-ELM applications is listed in Table 10.

Table 10 Summary of ML-ELM applications

7 Conclusion and future enhancement

This paper reviews architecture of ML-ELM, emphasizing its variants and applications in various domains. The significant advantages of ML-ELM include less training time, random feature mapping, and higher learning accuracy, which accelerate the development of DL. The topology of existing ML-ELM variants till date is also described in the paper which constitutes variants for better feature learning, handling outliers or noise, optimizing weights and bias, reducing multicollinearity and reducing overfitting. Also, a comparative analysis has been performed for ML-ELM with other ML and DL techniques. The latter includes the shortcomings of prevailing state-of-the-art DL architectures such as ANN, CNN, RNN, and LSTM and how ML-ELM can be utilized to handle all such limitations. The open issues in ML-ELM have also been discussed in the paper. Since the variants of ML-ELM have shown promising results in various applications, this learning algorithm can be explored further for various real-life applications such as intrusion detection, fault diagnosis, coal mine area monitoring, etc. The future study on ML-ELM may include the following:

  1. i.

    The applications of ML-ELM in parallel and distributed computing are open. Also, its effectiveness for big data applications can be investigated further.

  2. ii.

    More applications of ML-ELM can be studied to verify its generalization capability on massive datasets having noise.

  3. iii.

    The variance of hidden layer weights remains a topic for future research to understand the in-depth functionality of ELM and ML-ELM.