1 Introduction

In many real-world classification problems, the total number of data for a class is less than the other class (He and Ma 2013; Shen et al. 2018). We call it the imbalance data sets. Classifying the imbalance data is a challenging problem that has attracted growing attention of the scholars. Many outstanding works, normally grouped into data- and algorithm-based strategies (Liu et al. 2009), had been done on the class imbalance learning. Data-based strategies weaken the imbalance degree of data by over-sampling for minority samples and under-sampling for majority samples. Synthetic minority over-sampling technique (SMOTE) (Nitesh et al. 2002) and SMOTE-based improved sampling algorithms were the popular data-based strategies (Ramentol 2012; Zeng et al. 2009; Gao et al. 2011). Algorithm-based strategies improve the traditional classification algorithms according to the data features, with the purpose of enhancing the classification performances. The cost-sensitive boosting methods (Li and Mao 2014) updated the weight of AdaBboost in terms of the classification cost. Kernel-based learning methods were employed to adjust the classification boundary between minority and majority samples (Wu and Chang 2005; Wu et al. 2016). Three approaches including the boundary movement, the biased penalty and the class-boundary alignment were presented to adjust boundary skews (Wu and Chang 2005). In the kernel boundary alignment algorithm, the kernel matrix was adjusted in terms of the skew distribution of dataset (Wu et al. 2016). A weighted extreme learning machine (weighted-ELM) (Zong et al. 2013) was presented to strengthen the role of minority class by assigning an extra weight to each minority sample. Xiao et al. (2017) introduced class-specific regulation cost to penalize the misclassified data, and then proposed a class-specific cost regulation extreme learning machine (CCR-ELM), together with its kernel-based extension, to solve binary or multiclass classification problems with imbalance distribution.

In recent years, extreme learning machine (ELM) has become an attention-attracting method due to its faster speed and better generalization in regression and classification applications. It provides an unified learning platform with a widespread type of feature mappings by employing generalized single hidden layer feed-forward networks (SLFNs) and least-square-based learning algorithm (Huang et al. 2005). The input weights and hidden biases of SLFNs are chosen randomly. Corresponding output weights are analytically determined by generalized inverse operation of output matrix in the hidden layer. Some improved ELM algorithms were proposed to solve different kinds of regression or classification problems. Online sequential extreme learning machines employed additive hidden nodes with radial basis function in SLFNs, with the purpose of better generalization for dealing with an online data flow or concept-drifting data stream (Rong et al. 2009; Zhao et al. 2012; Mirza et al. 2015). Incremental extreme learning machine (Yang et al. 2017) randomly added hidden nodes and manually adjusted the output weights linking the hidden layer and the output layer. Semi-supervised ELM and unsupervised ELM fit for tackling the classification problems that collecting a large amount of labeled data is hard and time-consuming (Huang et al. 2014). For the class imbalance problems, weighted ELM (W-ELM) (Zong et al. 2013), total error rate ELM (TER-ELM) (Mirza et al. 2013), and class-specific cost regulation extreme learning machine (CCR-ELM) (Xiao et al. 2017) were presented, and CCR-ELM show better performances.

All above ELM-based learning algorithms may result in ill-condition problems due to the randomly selected input weights and hidden biases. Non-optimal or redundant input weights and hidden biases also make them responding slowly (Huang et al. 2005) and lead to worse generalization. In view of the shortcomings above, some population-based intelligence optimization algorithms, such as genetic algorithm (GA) (Ertam and Avci 2017), particle swarm optimization (PSO) (Han et al. 2013), fruit fly optimization (FOA) (Li and Zhang 2014) and krill herd algorithm (KH) (Guo et al. 2017), were introduced to find the optimal parameters of the activation function in ELM-based learning algorithms, with the purpose of improving the classification accuracy.

In this paper, we focus on the class imbalance problem, in which majority class with negative label and minority class with positive label. Class-specific cost regulation ELM is employed to weaken the influence of the imbalance and dispersion degree of the data on the classification accuracy. The structure of CCR-ELM decided by the number of hidden nodes, the input weights, the hidden biases and two tradeoff factors, plays the direct role in its classification performances. In general, above parameters are pre-set by human or randomly generated. Inappropriate values of the above parameters may lead to the useless or redundant neuron nodes, and make the whole structure complex. Based on this, brain storm optimization algorithm (BSO) is employed to find the optimal parameters’ values of CCR-ELM. Because the number of input weights and biases depend on the number of hidden nodes, the networks with a varying number of hidden nodes are depicted by the individuals with various length. The novel mutation and mergence operators generating the new individuals are proposed, with the purpose of effectively exchanging evolutionary information between the individuals with variable-length. Following that, an adaptive CCR-ELM based on variable-length brain storm optimization algorithm is presented to solve the class imbalance problems.

The rest of paper is organized as follows. A brief review of CCR-ELM is given in Sect. 2. BSO-based CCR-ELM for classifying imbalance data is proposed in Sect. 3. The classification performances of the proposed method are compared with other class imbalance learning algorithms in Sect. 4. Finally, the whole paper is concluded and work planned to do next are given.

2 Class-specific cost regulation extreme learning machine

In extreme learning machine originally proposed by Huang et al. (2005), the input weights and biases of SLFNs are randomly generated. Corresponding output matrix of hidden-layer is explicitly calculated in terms of the output weights by the fewer steps. Thus, the computation cost of ELM is less.

Suppose there are N distinct samples, depicted by (\(X_{i},y_{i}), i=1,2,\ldots ,N\). \(X_{i}=[x_{i1},x_{i2},\ldots ,x_{in}]^{T}\in R^{n}\) and \(y_{i}=[y_{i1},y_{i2},\ldots ,y_{im}]^{T}\in R^{m}\). Let \({a_j}\) and \({{\beta }_j}\) be the input and output weights, respectively. \({b_j}\) is the bias of a hidden unit. A SLFN with L hidden nodes is modeled as follows.

$$\begin{aligned} \sum \limits _{j = 1}^L {{\beta _j}g({a_j},{b_j},{X_i})} = {o_i},i = 1,...,N \end{aligned}$$
(1)

In above formula, \(g(\bullet )\) is the activation function and normally employs traditional nonlinear functions, such as sigmoid, sine, radial basis functions, and so on (Huang et al. 2012). The error between the estimated output \({o_i}\) and the actual output \({y_i}\) shall be zero once SLFNs can exactly approximate the feature of data. That is,

$$\begin{aligned} \sum \limits _{j = 1}^L {{\beta _j}g({a_j},{b_j},{X_i}) = {y_i}} ,i = 1,...,N \end{aligned}$$
(2)

Let \(\beta =[{\beta _{1}}^{T},\ldots ,{\beta _{L}}^{T}]^{T}\) and \(Y=[{y_{1}}^{T},\ldots ,{y_{N}}^{T}]^{T}\). Above model can be simplified as \({H\beta }=Y\).

$$H = \left[ {\begin{array}{*{20}c} {g(a_{1} ,b_{1} ,X_{1} )} & \cdots & {g(a_{L} ,b_{L} ,X_{1} )} \\ \vdots & \cdots & \vdots \\ {g(a_{1} ,b_{1} ,X_{N} )} & \cdots & {g(a_{L} ,b_{L} ,X_{N} )} \\ \end{array} } \right]_{{N \times L}}$$
(3)

H is so-called output matrix of hidden layer.\({h_{ij}}\) represents the output of jth hidden node corresponding to the input \({X_i}\). During the training process, the parameters of a hidden node, including \({a_j}\) and \({b_j}\), are not adjusted after initially generated. Corresponding output weights are estimated as follows.

$$\begin{aligned} {{\hat{\beta }}} = {H^\dag }Y = \left\{ \begin{array}{l} {\left( {{\textstyle {I \over C}} + {H^T}H} \right) ^{ - 1}}{H^T}Y,L < N\\ {H^T}{\left( {{\textstyle {I \over C}} + {H^T}H} \right) ^{ - 1}}Y,L \ge N \end{array} \right. \end{aligned}$$
(4)

\(H^{\dag }\) is the Moore–Penrose generalized inverse of H. C is a pre-set parameter, with the purpose of providing a tradeoff between minimizing the training error and maximizing the marginal distance. I is the unit matrix. The optimal output weights are gotten by minimizing the cost function \(\Vert O-Y\Vert\).

After introducing class-specific regulation cost, CCR-ELM was presented to solve the class imbalance problems. Two tradeoff factors, including \(C^{+}\) for minority positive samples and \(C^{-}\) for majority negative samples, are employed to rebalance both classes (Xiao et al. 2017). Assuming that the number of minority positive samples and majority negative samples are expressed by \(l_{1}\) and \(l_{2}\), respectively. CCR-ELM is modelled as follows.

$$\begin{aligned} \begin{aligned}&\min \, \left( {\frac{1}{2}{{\left\| \beta \right\| }^2} + \frac{1}{2}{C^ + }\sum \limits _{i = 1|{y_{i = + 1}}}^{{l_1}} {\xi _i^2} + \frac{1}{2}{C^ - }\sum \limits _{i = 1|{y_{i = - 1}}}^{{l_2}} {\xi _i^2} } \right) \\&s.t.\, h({x_i})\beta = {y_i} - {\xi _i},i = 1,...N. \end{aligned} \end{aligned}$$
(5)

Corresponding output weights \({\hat{\beta }}\) are computed as:

$$\begin{aligned} {{\hat{\beta }}} = {H^\dag }Y = \left\{ \begin{array}{l} {\left( {{\textstyle {I \over {{C^ + }}}} + {\textstyle {I \over {{C^ - }}}} + {H^T}H} \right) ^{ - 1}}{H^T}Y,L < N\\ {H^T}{\left( {{\textstyle {I \over {{C^ + }}}} + {\textstyle {I \over {{C^ - }}}} + {H^T}H} \right) ^{ - 1}}Y,L \ge N \end{array} \right. \end{aligned}$$
(6)

For the binary classification problems, the decision function of CCR-ELM-based classifier is \(f(x)=sign h(x)\beta\) .

$$\begin{aligned} f(x) = \left\{ \begin{array}{l} sign \, h(x){\left( {{\textstyle {I \over {{C^ + }}}} + {\textstyle {I \over {{C^ - }}}} + {H^T}H} \right) ^{ - 1}}{H^T}Y,L < N\\ sign \, h(x){H^T}{\left( {{\textstyle {I \over {{C^ + }}}} + {\textstyle {I \over {{C^ - }}}} + {H^T}H} \right) ^{ - 1}}Y,L \ge N \end{array} \right. \end{aligned}$$
(7)

In CCR-ELM, there are five key parameters having a direct impact on the classification accuracy, including the number of hidden nodes L, the input weights \({a_{j}}\), the biases \({b_{j}}\), \({C^{+}}\) for the minority positive samples and \({C^{-}}\) for the majority negative samples. The former three parameters decide the structure of SLFNs and normally are pre-set by human.

3 Adaptive CCR-ELM with variable-length brain storm optimization algorithm

Brain storm optimization algorithm (BSO) proposed by Cheng et al. (2015) is a promising swarm intelligence algorithm simulating the human brainstorming process. Implicit parallel search ability of BSO makes the population having superior diversity and converges to the optimum faster, especially for multi-modal or large-scale optimization problems. Many improved BSO algorithms have been presented, with the purpose of avoiding premature convergence and accelerating the convergence speed. Cheng et al. (2014) proposed the population diversity enhanced brain storm optimization algorithm (en-BSO). Two partial re-initializing solutions strategies were given to help the population jump out of local optima. The algorithm steps of en-BSO with re-initialization strategy is shown in Algorithm 1 (Cheng et al. 2014).

In en-BSO, a new individual is formed by the following mutation or mergence operators. Let \({\overline{{x_{ij}}}(t)}\) and \({x_{ij}(t)}\) be jth gene of \({x_{i}}\) after and before mutation operation. T and t denote the maximum iterations and the current generation. logsig is a logarithmic sigmoid transfer function. c is a constant. The mutation operation corresponds to Step 3.1 shown as follows.

$$\begin{aligned} \overline{{x_{ij}}} (t)= & {} {x_{ij}}(t) + \xi (t) \times rand() \end{aligned}$$
(8)
$$\begin{aligned} \xi (t)= & {} \log {\mathrm{{sig}}}\left( {\frac{{0.5T - t}}{c}} \right) \times rand() \end{aligned}$$
(9)

The mergence operation corresponding to Step 3.2 is done based on two selected parent individuals or cluster centers expressed by \({x_{ij}(t)}\) and \({x_{kj}(t)}\). \({{\overline{\overline{{x_{ij}}}}(t)}}\) denotes the new individual. Let \({\alpha \sim U(0,1)}\) be a random number.

$$\begin{aligned} \overline{\overline{{x_{ij}}}} (t) = \alpha {x_{ij}}(t) + (1 - \alpha ){x_{kj}}(t) \end{aligned}$$
(10)
figure a

Above en-BSO is employed to find the optimal parameters of CCR-ELM. Among five key parameters of CCR-ELM, the number of input weights \({a_{ij}} \in {R^{N \times L}}\) and the biases \({b_{j}\epsilon {R^{L}}}\) vary with the number of hidden nodes L. Based on this, the networks with different L can be described by the individuals with various length, expressed by \(p_{i}=(C^{+},C^{-},a_{11},\ldots ,a_{1N},\ldots ,a_{L1},\ldots ,a_{LN},b_{1},b_{2},\ldots ,b_{L})\). The length of each individual is \(L(N+1)\)+2. During the evolution, once two parent individuals have different length, no matching genes for the shorter individual can be merged with the longer one by Eq. (10). Therefore, a novel mergence operator is proposed, and then variable-length en-BSO algorithm (VLen-BSO) is constructed.

Assuming that \({l_{j}}\) and \({l_{k}}\) are the length of parent individuals. Only the genes in the matching positions can be merged each other and the new genes in the mismatching parts are randomly generated based on corresponding parents.

$$\begin{aligned} \overline{\overline{{x_{ij}}}} (t) = \left\{ {\begin{array}{ll} {\alpha {x_{ij}}(t) + (1 - \alpha ){x_{kj}}(t)}&j \le \min ({l_i},{l_k})\\ {x_{ij}}(t) + \alpha \xi (t)&\min ({l_i},{l_k}) < j \le \left\lceil {\max ({l_i},{l_k}) \times rand()} \right\rceil ,{l_i}> {l_k}\\ 0&j > \left\lceil {\max ({l_i},{l_k}) \times rand()} \right\rceil \end{array}} \right. \end{aligned}$$
(11)

Above variable-length en-BSO algorithm is introduced to find the optimal structure and tradeoff factors of CCR-ELM for the class imbalance learning. We call it adaptive CCR-ELM with VLen-BSO. In order to evaluate the classification accuracy of each individual, the fitness function is defined as the root mean square error (RMSE) on the training set \({X_{train}}\).

$$\begin{aligned} F = \sqrt{\frac{{\sum \nolimits _{i = 1}^N {\left\| {\sum \nolimits _{j = 1}^L {{{{{\hat{\beta }}} }_j}g({a_j},{b_j},{X_i}) - \left. {{y_i}} \right\| _2^2} } \right. } }}{N}} \end{aligned}$$
(12)

The detailed algorithm steps of adaptive CCR-ELM with VLen-BSO are shown as follows.

figure b

4 Experimental results and discussion

Nine binary-class imbalanced datasets are chosen from UCI machine learning repository preprocessed by Frank and Asuncion (2010) in the experiments, with the purpose of fully analyzing and comparing the classification performances of the proposed method. Especially, the data in Wine1 record chemical composition of wine in the same area of Italy, and can be classified into three categories in fact. It is transformed into binary-classification problem by combining the data in any two classes into majority samples, and employing the others as minority samples. Detailed information about these datasets is listed in Table 1. The imbalanced ratio (IR) of datasets varies from 0.1 to 0.4. In addition, a real-world class imbalance problem for the fault diagnosis of conveyor belt is employed to further analyze the classification performances.

Table 1 Detailed information about the datasets from UCI

To evaluate the class imbalance algorithms, we give more insight into not only the overall accuracy of the whole samples, but also the classification accuracy obtained within each class. Suppose TP, TN, FP, FN are the number of data with true positive, true negative, false positive and false negative, respectively. The overall accuracy of the dataset is defined in Eq. (13). However, this index is sensitive to the class distribution and cannot fully reflect the imbalance classification performances (Xiao et al. 2017). Based on this, minority accuracy, majority accuracy and G-mean are employed to evaluate the binary-classification performances (Yu et al. 2016).

$$\begin{aligned} overallaccuracy= & {} \frac{{TP + FN}}{{TP + FP + TN + FN}} \end{aligned}$$
(13)
$$\begin{aligned} minorityaccuracy= & {} \frac{{TP}}{{TP + FN}} \end{aligned}$$
(14)
$$\begin{aligned} majority\,accuracy= & {} \frac{{TN}}{{TN + FP}} \end{aligned}$$
(15)
$$\begin{aligned} G - mean= & {} \sqrt{\frac{{TP}}{{TP + FN}} \times \frac{{TN}}{{TN + FP}}} \end{aligned}$$
(16)

4.1 The influence of parameters on the performance of CCR-ELM

In CCR-ELM, sigmoid expressed by \(g(a,b,X)=1/(1+exp(a*X+b))\) is employed as the activation function and the bound of input weights and biases are set as \({a_{j}}\) ,\({{b_{j}}\in {[-1,1]}}\). \({C^{+}}\) and \({C^{-}}\) are the discrete values chosen from {\(2^{-24}\),\(2^{-23}\),...,\(2^{25}\)} and the possible values of L are {10,20,...,1000}.

The statistical classification performances under different L are shown in Fig. 1. Taking Car dataset as example, the effect of L on G-mean is shown in Fig. 2. With the increasing of L, more hidden nodes are contained in the network, bringing the better testing accuracy, and relatively stable G-mean when \({L\ge 500}\) for most of the datasets. However, G-mean for large-scale datasets, such as Poker hand and Dota2, becomes larger with the increasing number of hidden nodes, causing the weak generalization of CCR-ELM. Iris, as a small-scale dataset, the classification performance is apparently independent on L because larger L causes more redundant nodes having meaningless impact on the algorithm performances. In summary, the classifier with the complex structure is in favor of the classification accuracy, but obviously increases the computational complexity, causing the time-consuming for learning.

Fig. 1
figure 1

The G-mean under different L

Fig. 2
figure 2

G-mean of ACCR-ELM with different number of hidden nodes L

As L is pre-set to 400, G-mean under different \({C^{+}}\) and \({C^{-}}\) are shown in Fig. 3. Obviously, G-mean is sensitive to both \({C^{+}}\) and \({C^{-}}\). When \({C^{+}}\) and \({C^{-}}\) are less than \(2^{-10}\), the classification accuracy retains worse because of less punishment for the misclassified data.

Fig. 3
figure 3

The G-mean under different L

4.2 Comparison of the classification performances among ACCR-ELM combining with different evolutionary optimization algorithms

In this paper, five kinds of swarm intelligence optimization algorithms, including GA, PSO (Guo et al. 2016), variable-length PSO (VPSO) (Wang et al. 2018), VLen-BSO and group-based VLen-BSO (GVLen-BSO) are introduced to find the optimal parameters of CCR-ELM, with the purpose of improving the generalization and classification accuracy. Group-based VLen-BSO employes a simple grouping method (Guo et al. 2015) in the cluster operator instead of the k-means method to reduce the computational cost. During the experiments, the maximum terminal iteration and the population size are respectively set to 1500 and 20 for all methods. For ACCR-ELM with GA, the crossover probability is 0.9, the mutation probability is 0.01, and the generation gap is 0.9. For ACCR-ELM with PSO and VPSO, the learning factors \({c_{1}}=1.5\), \({c_{2}}=1.7\) and the inertia weight \({\omega } =1\). For ACCR-ELM with VLen-BSO, \({p_{c}=0.2}\), \({p_{g}=0.6}\). The number of the clusters is set to 20. Only half of the solutions will be re-initialized every 200 iterations.

The classification performances of above all adaptive CCR-ELM are compared with the original CCR-ELM and the statistical results under 30 running times are listed in Tables 2 and 3.

Table 2 Compare the overall accuracy \((({C^{+}},{C^{-}},L)Max/Average (\%))\) among CCR-ELM and ACCR-ELM with GA, PSO, VPSO, VLen-BSO and GVLen-BSO
Table 3 Compare G-mean \((({C^{+}},{C^{-}},L)Max/Average (\%))\) among CCR-ELM and ACCR-ELM with GA, PSO, VPSO, VLen-BSO and GVLen-BSO

The statistical results of overall accuracy for 30 running times shown in Table 2 indicate that ACCR-ELMs with VLen-BSO and GVLen-BSO have the better classification accuracy than other algorithms.By further comparing G-mean shown in Table 3, we see that traditional CCR-ELM with randomly generated parameters has the worse classification accuracy, especially for Iris, Dota2 and Poker hand datasets. In contrast, adaptive CCR-ELMs with the optimal parameters have the better classification performances for most of datasets. Especially, VLen-BSO and GVLen-BSO provides the more rational parameters for CCR-ELM, classifying the minority data more accurately.

4.3 The effect of the imbalance ratio on the algorithm performance

In real-world classification problems, the imbalance ratios reflecting the quantitative difference between the data of majority class and minority class, depend on the problem properties. The experiments under the imbalanced ratios changing from 0.1 to 0.4 are done for above nine imbalance datasets, with the purpose of analyzing the robustness of the class imbalance learning algorithms under varied imbalanced ratios.

The statistic experimental results of G-mean under different imbalanced ratio are shown in Fig. 4. Comparing the classification performances among ACCR-ELM with VLen-BSO, VPSO and GVLen-BSO indicate that no matter the imbalance ratios change or not, ACCR-ELM with VLen-BSO has the relatively stable classification performance. The G-means of ACCR-ELMs with GVLen-BSO and VPSO both fluctuate in the range about 0.5, meaning relatively worse robustness, but GVLen-BSO obtains the more rational parameters with the relatively better classification accuracy than VPSO.

Fig. 4
figure 4

The classification robustness of adaptive CCR-ELMs under varied imbalanced ratios

4.4 Application of ACCR-ELM algorithm in fault diagnosis of conveyor belt

Conveyor belt is the most important transportation equipment in coal mine. Any fault, such as overcurrent, off-tracking, overload, and so on, may make the belt shut down immediately, and cause production delay, even reduce production efficiency. Given that the faults only occur occasionally and the belt generally holds a normal operation, the monitoring data for a conveyor belt is in fact imbalance.

The experiment is done on the monitoring data collected from two belts in certain coal mine. Belt 1 represents a belt for the underground transportation of coal mine. Belt 2 is a conveyer belt in a coal separating plant. Detailed information about two belts are listed in Tables 4 and 5. In ACCR-ELM with VLen-BSO, the number of hidden nodes is set to 500, sigmoid activation function with \(C=1\) is employed. The learning factors and inertia weight of VPSO are denoted as \({c_{1}}=1.5\), \({c_{2}}=1.6\) , and \(\omega =1\). The probability of VLen-BSO is preset to \({p_{c}}=0.2\), \({p_{g}}=0.6\). The statistic experimental results with 30 running times shown in Table 6 indicate that ACCR-ELM with VLen-BSO has the better classification accuracy and stability that VPSO because of the more reasonable structure of ELM.

Table 4 The information of two belts
Table 5 Detailed information on the fault of belt in coal mine
Table 6 Comparison of the classification performance for the belt fault between ACCR-ELM with VLen-BSO and VPSO

5 Conclusions

An adaptive CCR-ELM combining with VLen-BSO was proposed to deal with the class imbalance problems, with the purpose of improving the generalization and classification accuracy. VLen-BSO is employed to find the optimal parameters of CCR-ELM, including the number of hidden nodes, the input weights, the biases and the tradeoff factors. All above parameters compose of an individual in VLen-BSO and the length of individuals vary with the number of hidden nodes. In order to incorporate two parent individuals with different length, a novel mergence operator is presented to generate a new individual. The experimental results for nine imbalance datasets show that VLen-BSO can find better parameters of CCR-ELM, resulting in the better classification accuracy than other evolutionary optimization algorithms, such as GA, PSO, and VPSO. In addition, the classification performance of the proposed adaptive algorithm is relatively stable under varied imbalance ratios. Applying the proposed algorithm in the fault diagnosis of conveyor belt also proves that ACCR-ELM with VLen-BSO has the better classification accuracy and stability. The optimal structure for ensemble imbalance learning with more than one CCR-ELM is our future work.