1 Introduction

The problem of classification for imbalanced data is one of the top ten studied areas in the data mining and machine learning field today [1]. Imbalanced data classification typically refers to a problem with the classes not equally represented. Applications of imbalanced data classification include a wide variety of “real-world” problems such as detecting fraudulent transactions of credit cards or characterizing the recurrence of breast cancer in patients. Most of classic imbalance studies focus on two-class (binary) classification since multi-class can be decomposed into multiple two-class. Commonly in a two-class classification problem, the majority of data are labeled as the “negative” class while the minority are the “positive” class. Because most of the standard classification learning algorithms [2,3,4] are designed to tackle balanced datasets, the small “positive” class is often underrepresented when these algorithms are being applied directly on imbalanced datasets. An excellent accuracy may be achieved due to correct classification for almost all samples, but the accuracy only reflects the underlying class distribution and therefore could be a misleading performance measure. Moreover, the positive classes usually have higher interest or importance than the negative class and oftentimes imply higher misclassification cost.

Many techniques have been developed recently to combat imbalanced data classification in machine learning. These techniques can be categorized as either data resampling methods or algorithmic approaches. Data resampling methods aim to balancing data before applying standard classification learning algorithms. Two well-known strategies can be utilized—oversampling or undersampling. Synthetic minority over-sampling technique (SMOTE) [5] is an example of oversampling, in which new data samples are generated between the minority data and their selected neighbors. Tomek links method [6] and Wilson’s editing [7] are undersampling methods, in which a fraction of the majority data are removed. Balancing data by resampling is helpful in reducing misclassification error, however, the effectiveness is not consistent but problem dependent [8]. On the other hand, new machine learning classification algorithms or modification of existing ones are also developed, such as one-class classifiers [9] and cost-sensitive learning [10]. As a popular algorithmic approach, cost-sensitive classifier incorporates cost sensitivity into machine learning process, i.e., a different misclassification cost is assigned for each particular example [11]. However, cases may arise in the real world applications when the misclassification cost matrix is difficult to generate.

Zong et al. [12] proposed a competitive cost-sensitive learning approach to automatically generate the misclassification cost matrix in accordance with class distribution, known as weighted extreme learning machine (weighted ELM). A weighting that is inversely proportional to the number of samples is assigned, which strengthens the impact of the positive class in imbalanced datasets. While maintaining the advantages from the original unweighted ELM, i.e., convenient implementation and easy application on multi-class data classification, weighted ELM demonstrates better performance on imbalanced datasets compared to unweighted ELM (i.e., ELM). However, the weighting schemes presented in Zong et al. induce a certain amount of misclassified samples from the negative class in compromise for a better accuracy in the positive class. Therefore, other ELM based approaches have been proposed. For instance, Zhang and Ji [13] developed fuzzy ELM (FELM): a fuzzy matrix is created to change the distributions of penalty factors. However, a unified design strategy for generating the fuzzy matrix is unpresented. In [14], Li et al. developed a Boosting weighted ELM, in which weighted ELM is embedded into a modified AdaBoost framework in order to obtain proper weights. But the computational complexity of algorithm is increased significantly due to iterations. Lin et al. [15] combined resampling technique with ELM; however, its performance is largely dependent on the adopted resampling technique.

In this paper, an improved weighted ELM (IW-ELM) algorithm is proposed, in which a voting based weighting scheme is introduced when assigning appropriate weights adaptively for imbalanced data classification. This paper is organized as follows. Section 2 briefly reviews the fundamental principles of weighted ELM and its issues. Section 3 elaborates the proposed algorithm. Experiment results are presented in Sect. 4. Section 5 compares the computational complexity of IW-ELM with other related classification algorithms using real world imbalanced datasets. Section 6 draws a conclusion.

2 Related work

The proposed algorithm is developed based on ELM and weighted ELM. The basic principles of ELM and weighted ELM are reviewed in Sects. 2.1 and 2.2 respectively. Section 2.2 also presents the problems related to weighted ELM when classifying imbalanced datasets.

2.1 Unweighted ELM

Least square based learning algorithm named ELM [16] was originally proposed for single layer feedforward networks (SLFNs), where input weights of a SLFN are randomly generated, and output weights are trained with batch learning technique of least squares. It has been proved that SLFNs with randomly hidden neurons and tunable output weights have universal approximation and excellent generalization performance [17]. More importantly, ELM outperforms most existed learning algorithms in training speed [18,19,20,21,22] and it has been widely used in applications of face recognition [23], image processing and classification[24], electricity price classification [25], energy commodity futures index forecast [26], location fingerprinting technique [27], protein sequence classification [28], and location classification [29].

The basics of unweighted ELM are briefly outlined as follows:

For any input data \({\varvec{x}}\in R^{n}\), the output of a standard SLFN with L hidden nodes can be written as

$$\begin{aligned} H\left( {\varvec{x}} \right) =\sum \nolimits _{i=1}^L {\varvec{\beta }}_{i}G\left( {\varvec{w}}_{i},b_{i},{\varvec{x}} \right) ,\, {\varvec{w}}_{i}\in R^{n}, {\varvec{\beta }}_{i}\in R^{m}, \end{aligned}$$

where \({\varvec{\beta }}_{i}={[\beta _{i1},\beta _{i2},\ldots ,\beta _{im}]}^{T}\) is the weight vector connecting the ith hidden node to the output nodes, and \(G\left( {\varvec{w}}_{i},b_{i},{\varvec{x}} \right) \) is the output of the ith hidden node.

Consider a set of training pairs with N inputs \({\varvec{x}}_{1},{\varvec{x}}_{2},\ldots ,{\varvec{x}}_{N}\) (here \({\varvec{x}}_{i}\in R^{n})\), and N target output \({\varvec{t}}_{1},{\varvec{t}}_{2},\ldots ,{\varvec{t}}_{N}\) (here \({\varvec{t}}_{i}\in R^{m})\), respectively. The mathematical model of a SLFN is

$$\begin{aligned} {\varvec{H}}{\varvec{\beta }}={\varvec{T}} \end{aligned}$$
(1)

with

$$\begin{aligned} {\varvec{H}}=\left( \begin{array}{*{20}c} G({\varvec{w}}_{1},b_{1},{\varvec{x}}_{1}) &{}\quad \cdots &{}\quad G({\varvec{w}}_{L},b_{L},{\varvec{x}}_{1})\\ \vdots &{} \quad \ddots &{} \quad \vdots \\ G({{\varvec{w}}_{1},b_{1},{\varvec{x}}}_{N}) &{} \quad \cdots &{}\quad G({\varvec{w}}_{L},b_{L},{\varvec{x}}_{N})\\ \end{array}\right) , \end{aligned}$$
(2)

Similar to least squares support vector machine (LS-SVM), the output can be formulated by finding the solution for an optimization problem:

$$\begin{aligned}&\text {Minimize }\,\frac{1}{2}\left\| {\varvec{\beta }} \right\| ^{2}+\frac{C}{2}\left\| {\varvec{\varepsilon }}_{i} \right\| ^{2}, \end{aligned}$$
(3)
$$\begin{aligned}&\text {subject to }{\varvec{\varepsilon }} _{i}=H\left( {\varvec{x}}_{i} \right) {\varvec{\beta }}-{\varvec{t}}_{i},\, i=1,2,\ldots ,N, \end{aligned}$$
(4)

where \({\varvec{\varepsilon }}_{i}\) is the training error vector of the m output nodes with respect to the training sample and C is a positive real regularization parameter.

By solving optimization problem (3), the least square solution can be determined as:

$$\begin{aligned} \text {when }N< & {} L:\, {\varvec{\beta }}=\tilde{{\varvec{H}}}^{+}{\varvec{T}}={\varvec{H}}^{T}\left( \frac{1}{C}{\varvec{I}}+{\varvec{H}}{\varvec{H}}^{T} \right) ^{-1}{\varvec{T}},\nonumber \\ \text {when } N\ge & {} L:\, {\varvec{\beta }}=\tilde{{\varvec{H}}}^{+}{\varvec{T}}=\left( \frac{1}{C}{\varvec{I}}+{\varvec{H}}^{T}{\varvec{H}} \right) ^{-1}{\varvec{H}}^{T}{\varvec{T}}. \end{aligned}$$
(5)

Similar to other conventional learning algorithms, the performance of unweighted ELM is highly dependent on class distribution. It performs well with balanced datasets, but imbalanced classification may be problematic. Intuitively, the negative class tends to push the separating boundary towards the positive class side to obtain better classification accuracy for itself (see [12] for more detailed analyses). Weighted ELM algorithm [12] was developed to overcome this problem.

2.2 Weighted ELM

As a cost-sensitive learning algorithm, weighted ELM assigns different weights for each example to minimize misclassification of positive samples and associated cost errors. That is, in weighted ELM, the goal is to maximize the marginal distances:

$$\begin{aligned}&\text {Minimize }\,\frac{1}{2}\left\| {\varvec{\beta }} \right\| ^{2}+\frac{C}{2}{\varvec{W}}\left\| {\varvec{\varepsilon }}_{i} \right\| ^{2}, \end{aligned}$$
(6)
$$\begin{aligned}&\text {subject to }\,{\varvec{\varepsilon }}_{i}=H\left( {\varvec{x}}_{i} \right) {\varvec{\beta }}-{\varvec{t}}_{i},\, i=1,2,\ldots ,N, \end{aligned}$$
(7)

where \({\varvec{W}}\) is a misclassification cost matrix in accordance with class distribution.

Similar to (5), solutions of \({\varvec{\beta }}\) can be obtained:

$$\begin{aligned} \text {when }N< & {} L{:} \, {\varvec{\beta }}={\varvec{H}}^{T}\left( \frac{1}{C}{\varvec{I}}+{\varvec{WH}}{\varvec{H}}^{T} \right) ^{-1}{\varvec{WT}},\nonumber \\ \text {when }N\ge & {} L{:} \,{\varvec{\beta }}=\left( \frac{1}{C}{\varvec{I}}+{\varvec{H}}^{T}{\varvec{WH}} \right) ^{-1}{\varvec{H}}^{T}{\varvec{WT}}. \end{aligned}$$
(8)

In [12], two cost matrices \({\varvec{W}}_{{\varvec{1}}}\) and \({\varvec{W}}_{{\varvec{2}}}\), were proposed.

$$\begin{aligned} \text {Weighting scheme }{\varvec{W}}_{{\varvec{1}}}: W_{i,i}=\frac{1}{\# (t_{i})}, \, i\mathrm {=1,2,\ldots ,}m, \end{aligned}$$
(9)

where \(\# (t_{i})\) is the number of samples in class \(t_{i}\).

The imbalanced datasets reach a cardinal balance when scheme \({\varvec{W}}_{{\varvec{1}}}\) is applied.

In comparison, another weighting scheme

$$\begin{aligned} {\varvec{W}}_{{\varvec{2}}}:\, \left\{ {\begin{array}{cc} W_{i,i}=\frac{1}{\# (t_{i})},&{} \mathrm {if}\,t_{i}>\mathrm {AVG},\\ &{} \\ W_{i,i}=\frac{0.618}{\# (t_{i})},&{}\mathrm {if}\,t_{i}\le \mathrm {AVG},\\ \end{array} } \right. \end{aligned}$$

where the golden ratio—0.618—is used and the AVG indicates the averaged sample size per class. Actually \({\varvec{W}}_{{\varvec{2}}}\) is a trade-off between unweighted ELM and weighted ELM \({\varvec{W}}_{{\varvec{1}}}\).

Table 1 Performance results of binary problems with imbalance ratio \(\in (0.0078, 0.8605)\)

After applying weighting scheme \({\varvec{W}}_{{\varvec{1}}}\) or \({\varvec{W}}_{{\varvec{2}}}\) into unweighted ELM, weighted ELM reduces the misclassification error associated with the small positive class. In another word, the weighting scheme pulls the boundary back towards the negative class side so that additional positive class samples can be correctly classified.

However, assigning higher weights to the positive class simply implies lowering the weights of the negative class. In fact, if the weights of the positive class are too high, then the classification accuracy in the negative class reduces. However, if the weights of the positive class are not high enough, then the classification accuracy in the positive class may be unsatisfactory. As noted in [12], a certain amount of samples from the negative class are misclassified in compromise for a better accuracy in the positive class. Therefore, the key point is to, prior to applying weighted ELM, determine an optimal weight or the most accurate position in the negative class side where the boundary should be pushed towards. However, this is generally not possible because the optimal weight is problem dependent. As shown in Table 1, different weights are evaluated when applying weighted ELM algorithm. Out of 46 tested datasets, only 11 happen to have the optimal weights as \({\varvec{W}}_{{\varvec{1}}}\) or \({\varvec{W}}_{{\varvec{2}}}\).

3 Description of IW-ELM

Since it is not practical to determine an optimal weight when designing a weighted ELM classifier in advance, in IW-ELM, we propose to use a series of “appropriate” weights in place of a single \({\varvec{W}}_{{\varvec{1}}}\) or \({\varvec{W}}_{{\varvec{2}}}\) for imbalanced data classification. For convenience, only binary classification problems are considered. Assume \(t_{1}\) is the positive class and \(t_{2}\) is the negative class, \(a=\frac{1}{\# (t_{1})}, \quad b_{i}=\frac{1}{\# (t_{1}\mathrm {)+}\left\{ \# \left( t_{2} \right) -\# (t_{1}) \right\} \mathrm {\times }\frac{1}{k-1}\times (i\mathrm {+}\frac{k-1}{2})}, \left( i=0,1,\ldots ,k-1 \right) .\) The proposed method is implemented in three major steps.

Table 2 The values of trade-off constant C and the number of hidden nodes L corresponding to the results in Table 1

Step 1k different weight matrices are first selected:

$$\begin{aligned} \tilde{{\varvec{W}}}_{i}={\varvec{W}}_{i}^{+}+{\varvec{W}}_{i}^{-}, \end{aligned}$$
(10)

where \({\varvec{W}}_{i}^{+}=\mathrm {diag} \left( {\mathop {\underbrace{0,0,\cdots ,0}_{\#(t_{2})}}}, {\mathop {\underbrace{a,a,\cdots ,a}_{\#(t_{1})}}}\right) ,\)\({\varvec{W}}_{i}^{-}=\mathrm {diag}\left( {\mathop {\underbrace{b_{i},b_{i},\cdots ,b_{i}}_{\#(t_{2})}}}, {\mathop {\underbrace{0,0,\cdots ,0}_{\#(t_{1})}}}\right) , i=0,1,\ldots ,k-1\).

Note that when i is equal to \(\frac{k-1}{2}\), \(\tilde{{\varvec{W}}}_{i}\) becomes the weight matrix \({\varvec{W}}_{{\varvec{1}}}\) in Eq. (9).

Step 2k weighted ELM classifiers are trained corresponding to the selected k weight matrices in Step 1.

It is expected that some of the classifiers are not usable—they may induce significant false-negative errors or false-positive errors. In order to eliminate those unusable classifiers, for each classifier, the number of positive classification determined is recorded. Then \(\left\lceil \frac{k}{4} \right\rceil \) classifiers corresponding to the biggest \(\left\lceil \frac{k}{4} \right\rceil \) numbers are rejected since they are very likely biased towards the positive class (\(\lceil \,\,\rceil \) is rounding up function ). Similarly, based on the number of negative classification recorded, \(\left\lceil \frac{k}{4} \right\rceil \) classifiers corresponding to the biggest \(\left\lceil \frac{k}{4} \right\rceil \) negative classification numbers are rejected as well because of the high possibility of misclassifying too many positive samples.

Step 3 Upon removing \(\left\lceil \frac{k}{2} \right\rceil \) unusable classifiers, an ensemble composed of remaining classifiers is generated and majority voting is used to classify samples.

Note that the proposed algorithm and voting based extreme learning machine (V-ELM) proposed by Cao et al. in [30] are significantly different even though both are voting based techniques. V-ELM is an algorithm targeted towards balanced datasets, in which classifiers are first trained with unweighted ELM algorithm then the results are averaged by majority voting on classifying samples. V-LEM needs to be modified for imbalanced data classification. One intuitive technique is to combine weighted ELM into V-ELM. We name this method VW-ELM, in which the results are averaged from weighted ELM classifiers (scheme \({\varvec{W}}_{{\varvec{1}}}\) or \({\varvec{W}}_{{\varvec{2}}})\). It is known that the classifiers in an effective ensemble need to be “good and different”. That is to say, every single classifier in the ensemble should have a certain level accuracy for classification while maintaining some dissimilarity from others. Although the single weighted ELM classifier in VW-ELM possibly performs better, the similarity of some classifiers makes VW-ELM unsatisfactory.

4 Performance evaluation

In this section, the performance of the proposed IW-ELM is compared with unweighted ELM, weighted ELM \({\varvec{W}}_{{\varvec{1}}}\), weighted ELM \({\varvec{W}}_{{\varvec{2}}}\), and VW-ELM using fivefold cross-validation. The results of unweighted ELM are quoted from Table 3 and Table 4 in [12] directly. For weighted ELM \({\varvec{W}}_{{\varvec{1}}}\) and weighted ELM \({\varvec{W}}_{{\varvec{2}}}\), the one with better performance is chosen from Table 3 and Table 4 in [12]. All experiments are conducted in MATLAB (R2014, 64bit) on a Window 10 OS with Intel Core i7-2620M 2.70GHz GHZ CPU and 12 GB RAM.

4.1 Data specification

For comparison purposes, the proposed IW-ELM are evaluated using the same 46 binary classification datasets as listed in Table 1 in [12], with the attributes normalized into [−1, 1] and various imbalance ratios ranging from 0.0078 to 0.8605. Note that the imbalance ratio (IR) of binary classes is defined as:

$$\begin{aligned} \mathrm {IR}=\frac{\# (t_{1})}{\# (t_{2})}. \end{aligned}$$
(11)

4.2 Parameter settings

For convenience, the proposed algorithm is tested on Sigmoid additive node with a Sigmoid additive activation function defined as \(G\left( {\varvec{a}},b,{\varvec{x}}\right) =\frac{1}{1+\mathrm {exp}(-{\varvec{a}}.{\varvec{x}}+b)}\). There are two parameters to tune: the trade-off constant C and the number of hidden nodes L. A grid search of C on \(\left\{ 2^{-18},2^{-16},\ldots ,2^{48},2^{50} \right\} \) and L on \(\left\{ 10,20,\ldots ,990,1000 \right\} \) is conducted in order to find the optimal result. The value of k in Eq. (10) is defined as 11, thus the number of remaining classifiers is 5. In order to make a fair comparison, the number of classifiers used in VW-ELM is also 5. C and L corresponding to the optimal result in our experiments are shown in Table 2.

4.3 Evaluation metrics for imbalanced datasets

Due to the possible presence of significant class imbalance, i.e., negative samples outnumber positive samples by a large margin, accuracy alone is not a reliable performance measure for imbalanced data classification. Other frequently used ones include F-measure, receiver operating characteristic (ROC) curve and geometric mean of the true rates (G-mean), etc. To compare the proposed method with weighted ELM presented in [12], G-mean is used in this paper. For binary classification problem, G-mean value is defined as the square root of (positive class accuracy \(\mathrm {\times }\) negative class accuracy):

$$\begin{aligned} \text {G-mean value}=\sqrt{\frac{{\textit{TP}}}{{\textit{TP}}+{\textit{FN}}}\times \frac{{\textit{TN}}}{{\textit{TN}}+{\textit{FP}}}}, \end{aligned}$$
(12)

where TP, TN, FP, FN stands for true positive, true negative, false positive and false negative, respectively.

4.4 Experiment results

4.4.1 IW-ELM versus unweighted ELM

From Table 1, it can be seen that the proposed IW-ELM algorithm outperforms unweighted ELM. For problems with an imbalance ratio close to 1, i.e., the datasets are relatively balanced, weighted ELM does not have a significant advantage over unweighted ELM while IW-ELM still performs better than either one of them with higher G-mean values.

4.4.2 IW-ELM versus weighted (\({\varvec{W}}_{{\varvec{1}}}, {\varvec{W}}_{{\varvec{1}}})\)

As mentioned before, there are only 11 out of 46 datasets happen to have the optimal weights as \({\varvec{W}}_{{\varvec{1}}}\) or \({\varvec{W}}_{{\varvec{2}}}\). Even for these 11 datasets, the results of IW-ELM and weighted ELM are still comparable because of the ensemble used in IW-ELM.

4.4.3 IW-ELM versus VW-ELM

As seen in Table 1, VW-ELM generally performs better than weighted ELM because of the voting strategy; and IW-ELM has even higher successful classification rates over VW-ELM for most of the imbalanced problems tested. This is due to the fact that, in IW-ELM, unusable classifiers are eliminated from the ensemble and remaining classifiers are more different compared to the ones in VW-ELM.

5 Computational complexity

When using IW-ELM, there is a compromise between the high accuracy achieved and the computational complexity. Use ELM as a guideline, weighted-ELM has approximately the same computational complexity, while the complexities of VW-ELM and the proposed IW-ELM are higher due to the classifier training process. Since \(\frac{k}{2}\) unweighted classifiers are trained in VW-ELM and k weighted classifies in IW-ELM, the execution time of IW-ELM is increased by k times compared to the original ELM.

6 Conclusions

In this paper, an improved algorithm designed for solving binary imbalanced classification problems—IW-ELM—has been proposed. There are three major steps in the process: training k weighted ELM classifiers, removing \(\left\lceil \frac{k}{2} \right\rceil \) unusable classifiers, and determining the final result based on majority voting of remaining classifiers. The performance of IW-ELM (k set to be 11) is verified using 46 imbalanced datasets with various imbalance ratios ranging from 0.0078 to 0.8605. Simulation results demonstrated that IW-ELM achieves higher accuracy compared to other ELM based algorithms.