1 Introduction

Extreme learning machine (ELM) was recently proposed by Huang et al. (2006, 2011) for single-hidden layer feed-forward neural networks (SLFNNs). In ELM the input weights and the hidden layer biases can be chosen randomly, the output weights can be analytically determined with Moore–Penrose generalized inverse H* of the hidden layer output matrix H. Unlike other gradient descent-based learning algorithms [such as back-propagation algorithm (BP)] for feed-forward networks, the ELM does not require iterative techniques to adjust input weights and hidden layer biases during training process, so it becomes a simple learning method with extremely fast learning speed (Feng et al. 2009; Liang et al. 2006; Huang et al. 2010; Wu et al. 2011; Wang et al. 2011). Although the ELM has simplified the learning approach for SLFNNs avoiding iterative and descent steps, the following issues still remain in ELM, especially dealing with large datasets.

  1. 1.

    Predictive instability caused by randomly selecting the input weights and the hidden layer biases;

  2. 2.

    Over-fitting problem caused by the complexity of distribution of input instances and much more hidden nodes on large datasets;

  3. 3.

    The order of matrix H is N × M, where N is the number of samples, and M is the number of hidden layer nodes. For large datasets, the order of H is very high. Large memory is required to calculate the Moore–Penrose generalized inverse H*.

Ensemble learning (EL) or combining classifiers (CC) could solve the problems mentioned above (Hansen and Salamon 1990; Zhou et al. 2002; Kittler et al. 1998; Rogova 1994; Pal and Pal 2001; Zhang et al. 2011; Biggio et al. 2010). EL is a learning paradigm where a collection of a finite number of base classifiers such as neural network or decision tree is trained for the same task (Kittler et al. 1998; Pal and Pal 2001), and can significantly improve the generalization ability of classification system (Hansen and Salamon 1990; Kittler et al. 1998).

In general, an ensemble of classifiers is generated in two steps:

  1. 1.

    Training a number of the base classifiers;

  2. 2.

    Combining the predictions of classifiers.

The most prevailing approaches of training base classifiers are Bagging and AdaBoost (Breiman 1996; Freund and Schapire 1997). Bagging generates diverse classifiers by randomly selecting subsets of samples to train classifiers (Kuncheva and Whitaker 2003; Brown et al. 2005; Mao et al. 2011). Intuitively, we would expect classifiers trained by different sample subsets to exhibit different behaviors. AdaBoost (Freund and Schapire 1997; Wu et al. 2008) also uses parts of samples to train classifiers, but not randomly, it maintains a set of weights over the original training set and adjusts these weights after each classifier is learned. The adjustments increase the weight of examples that are misclassified and decrease the weight of examples that are correctly classified. Liu and Wang (2010) proposed an approach of ensemble-based extreme learning machine (EN-ELM) to enhance the generalization ability (Wang and Dong 2009; Wang et al. 2008, 2011). EN-ELM uses the cross validation scheme to create an ensemble of ELM classifiers for classification. EN-ELM uses static ensemble strategy (Woods et al. 1997) to classify a test sample, all base classifiers are considered equally important. Actually, for different test samples the base classifiers have different degrees of confidence due to the diversity, and then their importance is different. Wang and Li (2010) proposed the dynamic AdaBoost ensemble extreme learning machine, which regards the extreme learning machine as weak learning machine, dynamic AdaBoost ensemble algorithm is used to integrate the outputs of weak learning machines, and makes use of fuzzy activation function as activation function of extreme learning machine. In this paper, different from the works in Liu and Wang (2010) and Wang and Li (2010), we propose a method of dynamic ensemble (Ko et al. 2008) ELM classifier based on sample entropy, which can alleviate the problems of instability and over-fitting, and increase the prediction accuracy.

In our method, we use AdaBoost to generate N training subsets from training set, and then train one ELM classifier for each of training subsets, hence N classifiers can be obtained in all; finally, based on the strategy of dynamic ensemble with sample entropy, an unseen instance can be classified.

The remaining of this paper is organized as follows. Section 2 will briefly present the ELM algorithm and the AdaBoost algorithm. In Sect. 3, we described the proposed method of dynamic ensemble ELM classifiers based on sample entropy. Performance evaluation is presented in Sect. 4. Section 5 gives conclusion.

2 Brief reviews of extreme learning machine and AdaBoost algorithm

In this section, we briefly review the extreme learning machine and the AdaBoost algorithm.

2.1 Extreme learning machine

Extreme learning machine proposed by Huang et al. (2006) is an efficient and practical learning mechanism for the SLFNNs, see Fig. 1. According to Theorem 2.1 of reference (Huang et al. 2006), the input weights and biases do not need to be adjusted. It is possible to analytically determine the output weights by finding the least-square solution. The neural network is obtained after very few steps with very low computational cost. Since Huang’s seminal work (Huang et al. 2006), many researchers have paid their attention to ELM recently, such as Wang et al. (2011) studied the effectiveness of extreme learning machine. José et al. (2011) studied the regularized extreme learning machine for regression problems. Mohammed et al. (2011) applied ELM to face recognition. Chacko et al. (2011) successfully apply ELM to handwritten character recognition field. Emilio et al. proposed a Bayesian approach to ELM, which allows the introduction of a priori knowledge, and presents high generalization capabilities (Emilio et al. 2011). As pointed out in Emilio et al. (2011), ELM represents a suitable approach to obtain models from databases, especially from huge databases within a reasonable time. An excellent survey papers on ELM can be found in Huang et al. (2011).

Fig. 1
figure 1

The SLFNN trained with ELM algorithm

Given a training dataset, \( L = \left\{ {(x_{i} ,t_{i} )|x_{i} \in R^{n} ,t_{i} \in R^{m} ,i = 1,2, \ldots,\,N} \right\} \), where \( x_{i} = \left( {x_{i1} ,x_{i2} , \ldots,\,x_{in} } \right)^{T} \) and \( t_{i} = \left( {t_{i1} ,t_{i2} , \ldots,\,t_{im} } \right)^{T} \). A SLFNN with M hidden nodes is formulated as

$$ \sum\limits_{j = 1}^{M} {\beta_{j} g\left( {w_{j} \cdot x_{i} + b_{{_{j} }} } \right) = t_{i} \,\left( {i = 1,2, \ldots,\,N} \right)} $$
(1)

where \( w_{j} = \left[ {w_{j1} ,w_{j2} , \ldots,\, w_{jn} } \right]^{T} \) is the weight vector connecting the jth hidden node with the input nodes. \( \beta_{j} = \left[ {\beta_{j1} ,\beta_{j2} \ldots,\,\beta_{jm} } \right]^{T} \) is the weight vector connecting the jth hidden node with the output nodes, and \( b_{j} \) is the threshold of the jth hidden node. Equation (1) can be written in a more compact format as

$$ H\beta = T $$
(2)

where

$$ H = \left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {g(w_{1} \cdot x_{1} + b_{1} )} & \cdots & {g(w_{M} \cdot x_{1} + b_{M} )} \\ \vdots & \cdots & \vdots \\ {g(w_{1} \cdot x_{N} + b_{1} )} & \cdots & {g(w_{M} \cdot x_{N} + b_{M} )} \\ \end{array} } \\ \end{array} } \right]_{N \times M} $$
(3)
$$ \beta = \left[ {\begin{array}{*{20}c} {\beta_{1}^{T} } \\ \vdots \\ {\beta_{M}^{T} } \\ \end{array} } \right]_{M \times m} \quad {\text{and}}\quad T = \left[ {\begin{array}{*{20}c} {t_{1}^{T} } \\ \vdots \\ {t_{N}^{T} } \\ \end{array} } \right]_{N \times m} $$
(4)

H is the hidden layer output matrix of the network, where the jth column of H is the jth hidden node’s output vector with respect to inputs x i , and the ith row of H is the output vector of the hidden layer with respect to input x i . If the number of hidden nodes is equal to the number of distinct training samples, the matrix H is square, and SLFNNs can approximate these training samples with zero error. But generally, the number of hidden nodes is much less than the number of training samples. Therefore, H is a non-square matrix and we can not expect an exact solution of the system (2). Fortunately, it has been proved in Huang et al. (2006) and Huang and Chen (2007) that SLFNNs with random hidden nodes have the universal approximation capability; the hidden nodes could be randomly generated. According to the definition of the Moore–penrose generalized inverse, the smallest norm least-squares solution of (2) is given in Huang et al. (2006):

$$ \hat{\beta } = H^{ * } T $$
(5)

where \( H^{ * } \) is the Moorepenrose generalized inverse of matrix H (Serre 2002). In the following, the ELM algorithm (Huang et al. 2006) is introduced.

ELM algorithm

 

Input:

A training dataset \( \left\{ {\left( {x_{i} ,t_{i} } \right)|x_{i} \in R^{n} , \, t_{i} \in R^{m} ,i = 1,2, \ldots,\,N} \right\} \) an activation function g, and the number of hidden nodes M

Output:

A weights matrix

  1. 1.

    Randomly assign input weights w j and biases b j , j = 1, …, M.

  2. 2.

    Calculate the hidden layer output matrix H;

  3. 3.

    Calculate output weights matrix \( \hat{\beta } = H^{ * } T \).

In this paper, we will focus on the classification problem. Let \( Y = \left\{ {\omega_{1} , \, \omega_{2} , \ldots,\,\omega_{K} } \right\} \) be a set of class labels of samples, and let \( L = \left\{ {\left( {x_{i} ,y_{i} } \right)|x_{i} \in R^{n} , \, y_{i} \in Y, \, i = 1,2, \ldots,\,N} \right\} \) , we will use the ELM algorithm with little change as the base classifiers, see Fig. 2, i.e. the output nodes with sigmoid active function. The output of the SLFNN becomes

$$ p_{i} \left( x \right) = g\left( {\sum\limits_{j = 1}^{M} {\beta_{j} g\left( {w_{j} \cdot x_{i} + b_{j} } \right)} } \right) $$
(6)

where \( p_{i} \left( x \right) = \left( {p_{i1} \left( x \right), \, p_{i2} \left( x \right), \, \ldots,\, p_{iK} \left( x \right)} \right), \, p_{ik} \left( x \right) \) denotes the probability (or membership degree) of sample x belongs to class k, \( 1 \le p_{ik} \left( x \right) \le 1, \, k = 1,2, \ldots,\,K \).

Fig. 2
figure 2

The SLFNN trained with ELM with little change in output layer

2.2 The AdaBoost algorithm

The boosting algorithm was originally proposed by Schapire (1990), AdaBoost is an improved version of boosting algorithm (Freund and Schapire 1997). In this paper, we use AdaBoost algorithm to generate the subsets of sample set used for training the base ELM classifiers. For convenience, we list this algorithm in the following.

AdaBoost algorithm

 

Input:

\( L = \left\{ {\left( {x_{i} ,y_{i} } \right)|x_{i} \in R^{n} ,y_{i} \in Y,i = 1,2, \ldots,\,N} \right\}, \) J, the number of iterations, BaseLearner, a base learning algorithm

Output:

The final hypothesis

Steps of the algorithm

  • Step 1: Initialize distribution of weights on all samples \( x_{i} ,\left( {1 \le i \le N} \right), \, D_{1} \left( {x_{i} } \right) = \frac{1}{N} \)

  • Step 2: For j = 1 to J

  • Step 3: Train BaseLearner with \( D_{j} \quad C_{j} = {\text{BaseLearner}}\left( {L,D_{j} } \right) \)

  • Step 4: Calculate the error of \( C_{j} ,\;\;e_{j} = \sum\nolimits_{{C_{j} \left( {x_{i} } \right) \ne y_{i} }} {D_{j} \left( {x_{i} } \right)} \)

  • Step 5: If \( e_{j} > \frac{1}{2} \), then set J = j − 1 and abort loop

  • Step 6: Set \( \beta_{j} = \frac{{e_{j} }}{{1 - e_{j} }} \)

  • Step 7: Update weights, \( D_{j + 1} \left( {x_{i} } \right) = \left\{ {\begin{array}{*{20}c} {D_{j} \left( {x_{i} } \right) \times \beta_{j} } & {\text{if}\,\left( {C_{j} \left( {x_{i} } \right) \ne y_{i} } \right)} \\ {D_{j} \left( {x_{i} } \right)} & {\text{otherwise} } \\ \end{array} } \right. \)

  • Step 8: Normalize weights, \( D_{j + 1} \left( {x_{i} } \right) = \frac{{D_{j + 1} \left( {x_{i} } \right)}}{{\sum\nolimits_{i = 1}^{N} {D_{j + 1} \left( {x_{i} } \right)} }} \)

  • Step 9: Output \( C^{\prime } \left( x \right) = \arg \max_{y \in Y} \sum\nolimits_{{j:C_{j} \left( x \right) = y}} {\log \frac{1}{{\beta_{j} }}} \)

3 Dynamic ensembles extreme learning machine based on sample entropy

In this section, we will present our method of dynamic ensemble ELM classifier based on sample entropy. We first give the basic concepts used in this paper.

Definition 1

Given \( L = \left\{ {\left( {x_{i} ,y_{i} } \right)|x_{i} \in R^{n} ,\,y_{i} \in Y,\,i = 1,2, \ldots,\,N} \right\}, \, Y = \left\{ {\omega_{1} ,\omega_{2} , \ldots,\,\omega_{K} } \right\} \) be a set of class labels of samples, the entropy of L is defined as follows.

$$ E\left( L \right) = - \sum\limits_{i = 1}^{K} {p_{i} \log_{2} p_{i} } $$
(7)

where \( p_{i} \) is the proportion of examples in \( \omega_{i}\,\left( {i = 1,2, \ldots,\,K} \right) \).

Definition 2

Given \( L = \left\{ {\left( {x_{i} ,y_{i} } \right)|x_{i} \in R^{n} ,y_{i} \in Y,\,\,i = 1,2, \ldots,\,N} \right\}, Y = \left\{ {\omega_{1} ,\omega_{2} , \ldots,\,\omega_{K} } \right\} \) be a set of class labels of samples, the entropy of sample \( x_{i} \) is defined as follows.

$$ {\text{Entropy}}\left( {x_{i} } \right) = - \sum\limits_{k = 1}^{K} {p_{k} \left( {x_{i} } \right)\log_{2} } p_{k} \left( {x_{i} } \right) $$
(8)

where \( p_{k} \left( {x_{i} } \right) \) is the probability (or the membership degree) of instance x belongs to class \( k\left( {k = 1,2, \ldots,\,K} \right) \).

Definition 3

Given \( L = \left\{ {\left( {x_{i} ,y_{i} } \right)|x_{i} \in R^{n} ,y_{i} \in Y,\,i = 1,2, \ldots,\,N} \right\}, \, Y = \left\{ {\omega_{1} ,\omega_{2} , \ldots,\,\omega_{K} } \right\} \) be a set of class labels of samples, \( C = \left\{ {C_{1} ,C_{2} , \ldots,\,C_{J} } \right\} \) be J classifiers. The entropy of sample \( x_{i} \) with respect to classifier \( C_{j} \left( {j = 1,2, \ldots,\,J} \right) \) is defined as follows.

$$ {\text{Entropy}}_{j} \left( {x_{i} } \right) = - \sum\limits_{k = 1}^{K} {p_{jk} \left( {x_{i} } \right)\log_{2} } p_{jk} \left( {x_{i} } \right) $$
(9)

Definition 4

Given \( L = \left\{ {\left( {x_{i} ,y_{i} } \right)|x_{i} \in R^{n} ,y_{i} \in Y,\,\,i = 1,2, \ldots,\,N} \right\}, \, Y = \left\{ {\omega_{1} ,\omega_{2} , \ldots,\,\omega_{K} } \right\} \) be a set of class labels of samples, \( C = \left\{ {C_{1} ,C_{2} , \ldots,\,C_{J} } \right\} \) be J classifiers. The normalized entropy of sample \( x_{i} \) with respect to classifier \( C_{j} \left( {j = 1,2, \ldots,\,J} \right) \) is defined as follows.

$$ N{\text{Entropy}}_{j} \left( {x_{i} } \right) = \frac{{{\text{Entropy}}_{j} \left( {x_{i} } \right)}}{{\sum\nolimits_{j = 1}^{J} {{\text{Entropy}}_{j} \left( {x_{i} } \right)} }} $$
(10)

Definition 5

Given a test sample x, \( C = \left\{ {C_{1} ,C_{2} , \ldots,\,C_{J} } \right\} \) be J basic classifiers. The threshold of entropy of sample x is defined as follows.

$$ \gamma = \frac{1}{2}\left( {\arg \max_{1 \le j \le J} \left( {{\text{Entropy}}_{j} \left( x \right)} \right) - \arg \min_{1 \le j \le J} \left( {{\text{Entropy}}_{j} \left( x \right)} \right)} \right) $$
(11)

Let \( C = \left\{ {C_{1} ,C_{2} , \ldots,\,C_{J} } \right\} \) be the trained-ELM classifiers, \( D = \left( {D_{1} ,D_{2} , \ldots,\,D_{J} } \right) \) be the distribution of weights on samples of L. The proposed algorithm is described as follows.

DE-ELM algorithm

 

figure a

Train a base learner ELM j using L j , the output of the ELM j is a probability distribution (p j1(x), p j2(x), …, p jK (x)) where p jk (x) (1 ≤ j ≤ J; 1 ≤ k ≤ K) denotes the membership degree of instance x belong to class k based on classifier ELM j ;

figure b

4 Experimental results and analysis

The effectiveness of our proposed method is demonstrated through numerical experiments in the environment of Matlab 7.0 on a Pentium 4 PC. Totally our experiments select 8 UCI datasets which are Statlog (Shuttle) (DB1), letter recognition (DB2), pen-based recognition of handwritten digits (DB3), mushroom (DB4), landsat satellite image (DB5), optical recognition of handwritten digits (DB6), waveform (DB7), and car evaluation (DB8). The basic information of the eight datasets is listed in Table 1.

Table 1 The basic information of the eight datasets used in our experiments

From Table 1, we can see that among the eight UCI datasets there are three large datasets, four medium datasets, and one small dataset. In our experiments, 50% samples chosen randomly are used as training set, and the other 50% samples are used for testing set. We set \( \lambda = 0.60 \), and J = 10, i.e. there are ten sub-classifiers (SLFNNs trained with ELM) are generated with AdaBoost algorithm. For each sub-classifier, the number of hidden nodes is determined using the method proposed in Feng et al. (2009). The performance of proposed method DE-ELM is compared with original ELM in three aspects, which are the influence of construction of sub-classifiers on ensemble system, average testing accuracy, stability.

Experiment 1

The influence of construction of ELM base classifiers (i.e. SLFNNs trained with ELM) on ensemble system.

There are two commonly used methods (i.e. Bagging and AdaBoost) to generate the base classifiers with diversity. We compare the two ensemble methods of constructing ELM-base classifier with dataset landsat satellite image (DB5). In the experiment, we increase the number of hidden nodes of SLFNNs from 20 to 100, the experimental results shown in Fig. 3 illustrate the relationship between the testing accuracy and the number of hidden nodes. In addition, in this experiment, we also experiment on dataset landsat satellite image (DB5) with ELM. In the view of testing accuracy, the performance of ELM is superior to Bagging when the number of hidden nodes is greater than 73. For other datasets, the experimental results are similar. Based on the experimental results, we conclude that in the framework of ensemble SLFNNs trained with ELM AdaBoost is superior to Bagging, so in our method we prefer to select AdaBoost rather than Bagging.

Fig. 3
figure 3

The influence of construction of ELM sub-classifiers on ensemble system

Experiment 2

Comparison with original ELM in average testing accuracy

In this experiment, we compare DE-ELM with the original ELM in average testing accuracy on eight datasets. Table 2 shows the comparison results. For each dataset, we run the experiment 10 times. The experimental results are the average of the ten outputs. It can be seen from Table 2 that the average testing accuracies of DE-ELM are consistently higher than the ones of ELM no matter how small or big is the dataset. For example, for large dataset Statlog (shuttle) (DB1), we run the experiments ten times using original ELM and DE-ELM, the ten experimental results are shown in Fig. 4. Similarly, for medium dataset landsat satellite image (DB5), and optical recognition of handwritten digits (DB6), the ten experimental results are shown in Figs. 5 and 6, respectively. For small dataset car evaluation (DB8), the ten experimental results are shown in Fig. 7.

Table 2 Comparison between DE-ELM and ELM
Fig. 4
figure 4

The experimental results in average testing accuracy on large dataset Statlog (Shuttle) (DB1)

Fig. 5
figure 5

The experimental results in average testing accuracy on medium dataset landsat satellite image (DB5)

Fig. 6
figure 6

The experimental results in average testing accuracy on medium dataset optical recognition of handwritten digits (DB6)

Fig. 7
figure 7

The experimental results in average testing accuracy on small dataset car evaluation dataset

Experiment 3

Comparison with original ELM in stability

In this experiment, we compare DE-ELM with the original ELM in stability on three datasets, which are large dataset letter recognition (DB2), medium dataset landsat satellite image (DB5), and small dataset car evaluation (DB8), respectively. For each dataset, we run the experiments ten times using original ELM and DE-ELM, the number of hidden nodes are set to be 30, 40, 50, and 60. Figures 8 and 9 are the experimental results on large dataset letter recognition (DB2) with original ELM and DE-ELM. The curves that the testing accuracies change with different number nodes are given in six figures (from Figs. 8 to 13). From the fluctuating curves, we have observed following

Fig. 8
figure 8

The experimental results in stability on large dataset letter recognition (DB2) with original ELM

Fig. 9
figure 9

The experimental results in stability on large dataset letter recognition (DB2) with DE-ELM

  1. 1.

    In all datasets, the testing accuracies of the two methods become higher and higher with the hidden nodes increasing from 30 to 60, and the accuracies of our method are always higher than that of original ELM.

  2. 2.

    In large dataset, our proposed method (Fig. 9) has more stable than original ELM (Fig. 8) when the number of hidden nodes is 50 or 60, which indicates that our proposed method has more stability than the original ELM with improvement of the testing accuracy.

  3. 3.

    In medium size of dataset, our proposed method has more stable than the original ELM in all four conditions (see Figs. 10, 11). For small dataset, it is same as medium size of dataset (see Figs. 12, 13).

    Fig. 10
    figure 10

    The experimental results in stability on medium dataset landsat satellite image (DB5) with original ELM

    Fig. 11
    figure 11

    The experimental results in stability on medium dataset landsat satellite image (DB5) with DE-ELM

    Fig. 12
    figure 12

    The experimental results in stability on small dataset car evaluation (DB8) with original ELM

    Fig. 13
    figure 13

    The experimental results in stability on small dataset car evaluation (DB8) with original ELM

So, we conclude that our proposed method has more stablity than original ELM

In order to further verify the effectiveness of our proposed method, we statistically analyze the experimental results using Wilcoxon test and paired t test (Demsar 2006). First, for each dataset, we run original ELM and our method 10, 30, and 50 times and then obtain six statistics, which are denoted with \( X_{i} \left( {1 \le i \le 3} \right) \) and \( X_{i}^{\prime } \left( {1 \le i \le 3} \right) \), respectively, where \( X_{i} \left( {1 \le i \le 3} \right) \) is corresponding to original ELM, \( X_{i}^{\prime } \left( {1 \le i \le 3} \right) \) is corresponding to our proposed method. \( X_{1} \) and \( X_{1}^{\prime } \) are both 10-dimensional vectors, \( X_{2} \) and \( X_{2}^{\prime } \) are both 30-dimensional vectors, \( X_{3} \) and \( X_{3}^{\prime } \) are both 50-dimensional vectors. Next, we apply Wilcoxon test to the experimental results by computing the values of MATLAB function \( {\text{ranksum}}\left( {X_{1} ,X_{1}^{\prime } } \right),{\text{ ranksum}}\left( {X_{2} ,X_{2}^{\prime } } \right), \) and \( {\text{ranksum}}\left( {X_{3} , \, X_{3}^{\prime } } \right) \). Similarly, applying paired t test to the experimental results by computing the values of MATLAB function \( t{\kern 1pt} {\text{test}}2\left( {X_{1} ,\,X_{1}^{\prime } } \right), \, t{\kern 1pt} {\text{test}}2\left( {X_{2} , \, X_{2}^{\prime } } \right) \) and \( t{\kern 1pt} {\text{test}}2\left( {X_{3} , \, X_{3}^{\prime } } \right) \). The p values and h values of Wilcoxon test are listed in Table 3. The p values of paired t test are listed in Table 4.

Table 3 p values and h values of Wilcoxon test
Table 4 p values of paired t test

The small p values cast doubt on the validity of the null hypothesis. Statistically, the p values of Wilcoxon test and paired t test (shown in Tables 3, 4) verified the effectiveness of our proposed method, and the h values of Wilcoxon test further certify that our method outperform original ELM.

Moreover, we statistically analyze the stability of our method with coefficient of variation (CV) of test accuracy. In our experiments, the numbers of hidden nodes are 30, 40, 50, and 60, respectively. The coefficient of variation is calculated with the following formula.

$$ CV = \frac{\sigma }{\mu } $$

where \( \sigma = ( {\frac{1}{n - 1}\sum\nolimits_{i = 1}^{N} {(x{}_{i} - \mu )^{2} } })^{\frac{1}{2}} \) and \( \mu = \frac{1}{n}\sum\nolimits_{i = 1}^{n} {x_{i} } \). The results are listed in Table 5, where CV value 1 and CV value 2 are corresponding to our method and original ELM. The smaller CV value is the better stability the method has. The results listed in Table 5 certify that out method is superior to original ELM in stability to some extent.

Table 5 Comparison of stability between our method and original ELM

5 Conclusions

In this paper, based on sample entropy the DE-ELM is proposed, which use AdaBoost to train N SLFNNs trained with ELM as N base classifiers and classify a new sample by strategy of dynamic ensemble. The DE-ELM not only can overcome many shortages in the traditional gradient-based learning algorithm such as local minimal, improper learning rate, and low learning speed but also can alleviate the problems of instability and over-fitting in original ELN, and increase the prediction accuracy. The experimental results show that the proposed approach is robust and efficient.