A novel ensemble method for classification in imbalanced datasets using split balancing technique based on instance hardness (sBal_IH)

Chongomweru, Halimu; Kasem, Asem

doi:10.1007/s00521-020-05570-7

A novel ensemble method for classification in imbalanced datasets using split balancing technique based on instance hardness (sBal_IH)

Original Article
Published: 11 January 2021

Volume 33, pages 11233–11254, (2021)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Neural Computing and Applications Aims and scope Submit manuscript

A novel ensemble method for classification in imbalanced datasets using split balancing technique based on instance hardness (sBal_IH)

Download PDF

863 Accesses
8 Citations
Explore all metrics

Abstract

Classification tasks in datasets that suffer from high class imbalance pose challenge to machine learning algorithms and such datasets are prevalent in many real-world domains and applications. In machine learning research, ensemble methods for classification tasks in imbalanced datasets have attracted a lot of attention due to their ability to improve classification performance. However, these methods are still prone to the negative effects of noise in the training sets. Furthermore, many of them alter the original class distribution to create a sort of balance in the datasets through over-sampling or undersampling techniques, which can lead to overfitting or discarding useful data, respectively, and thus may still hamper performance. In this work, we propose a novel ensemble method for classification that creates an arbitrary number of balanced splits (sBal) of data generated based on Instance Hardness as a weighting mechanism for creating balanced bags. Each of the generated bags will contain all the minority instances, and a mixture of majority instances with varying degrees of hardness (easy, normal, and hard), and we call this approach sBal_IH technique. This will enable base learners to train on different balanced bags comprising varied characteristics of the training data. We evaluated the performance of our proposed method on a total of 100 datasets that include 30 synthetic datasets with controlled levels of noise, 29 balanced and 41 imbalanced real-world datasets, and compared its performance with both traditional ensemble methods (Bagging, Wagging Random Forest, and AdaBoost), and those specialized for class imbalanced problems (Balanced Bagging, Balanced Random forest, RUSBoost, and Easy Ensemble). The results reveal that our proposed method brings a substantial improvement in classification performance relevant to the compared methods. For statistical significance analysis, we conducted Friedman’s nonparametric statistical test with Bergman post hoc test. The analysis shows that our method performs significantly better than the compared traditional and specialized ensemble methods for imbalanced problems across many datasets.

A Robust Ensemble Method for Classification in Imbalanced Datasets in the Presence of Noise

A binary PSO-based ensemble under-sampling model for rebalancing imbalanced training data

Article 11 November 2021

Split Balancing (sBal)—A Data Preprocessing Sampling Technique for Ensemble Methods for Binary Classification in Imbalanced Datasets

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Pattern classification has attracted considerable attention in the field of machine learning with a wide area of application in contemporary real life such as face recognition, anomaly detection, image classification, cancer classification, medical diagnosis, among many others. The main challenge of pattern classification is the capability of the trained classifier to classify unseen patterns correctly. This challenge is most often associated with data complexities existing within the underlying datasets [1] such as class ambiguity, data sparsity and dimensionality, class boundary complexity, and class imbalance problem.

Class imbalance problem is a situation where a dataset is characterized by an uneven class distribution, i.e., the proportion of instances of one class is much smaller than those of other classes. In the real world, despite the vast availability of data, the class of interest (minority or positive class) in binary classification problems usually has fewer instances than the opposite class (majority or negative class).

One clear complication that arises as a result of the class imbalance problem is the performance of traditional machine learning algorithms and the effectiveness of their accuracy. Let’s consider an example of a dataset with an imbalance ratio of 99:1 where the majority class is composed of 99% of the data instances. In such a situation, a naive classifier will have an accuracy score of 99% since it always predicts the class with majority instances. This limitation cuts across most traditional machine learning algorithms such as decision tree (DT), k-nearest neighbors (kNN) when faced with imbalanced datasets since most of them optimize accuracy-based loss metrics. Hence, producing similar models as to that of the naïve model in the example described before [2, 3].

Apparently, in most practical situations misclassifying the class of interest can result in a heavy cost. For example, in cancer detection, patients with tumors might emerge out of hundreds of records but failing to identify a malignant tumor (false negative) may cause a serious threat to a patient since they may miss out on treatment. It is therefore useless to have a model with a very high accuracy score but failing to detect the minority class (i.e., low sensitivity) which is considered as the class of interest. However, it is possible to get a low misclassification cost by using an effective traditional algorithm for solving such a problem. But the cost component has to be addressed during the classification process through cost-sensitive learning. Cost-sensitive learning, which aims at minimizing errors, takes into account the cost of prediction errors, and potentially other costs during the training process [4]. It is closely related to the field of imbalanced learning that is much concerned with the classification of imbalanced datasets. As a result, various cost-sensitive techniques have been proposed. Unfortunately, defining the costs is still a big challenge, since misclassification costs are often unknown [5].

As a result, the problem of class imbalance has emerged as one of the challenges in the machine learning community and it has attracted much attention of researchers in academia and industry to an extent that 2 special workshops devoted to addressing this problem were held in 2000, at AAAI 2000 Workshop on Learning from Imbalanced Datasets [6] and in 2003 at ICML 2003 Workshop on Learning from Imbalanced Datasets [7]. It is evident that researchers have extensively studied the problem of class imbalance and various methods have been proposed to overcome this problem. These methods fall into three categories: Data level, Algorithm level, and hybrid/Ensemble Level.

Data level methods, also referred to as external techniques, first preprocess the data by trying to rebalance the class distribution. The data level techniques can further be broken down into sampling techniques and feature selection techniques. For sampling techniques, instances from the minority class are replicated to balance the class distribution through oversampling [8], or instances of the majority class are discarded to balance the class distribution through undersampling [9]. Researchers in [10]–[13] have proposed various oversampling approaches for balancing the class distributions; furthermore, researchers in [14]–[16] have similarly presented several undersampling approaches for solving the class imbalance problem. On the other hand, the feature selection techniques try to neutralize the effects of class imbalance by selecting the most influential features that can produce exclusive knowledge that can easily discriminate between the classes. Researchers in [17] explored feature selection for the categorization of text with imbalanced data. Mladenic et al. [18], utilized feature subsets to develop a Naive Bayes (NB) classifier on imbalanced text data. It is important to note that feature selection techniques for addressing the class imbalance problem have not yet been fully explored, creating a research gap in this area. On the other hand, feature selection techniques have been widely used in class balanced problems to improve performance score [19]– [22].

Algorithm Level (Internal) techniques modify the learning algorithm to take into consideration the significance of the minority instances by biasing the algorithm to learn towards the minority class [23, 24]. The most common algorithm level technique is the cost-sensitive method [25], where the algorithm is modified to integrate in varying penalties for each of the selected groups of examples. The higher cost is assigned to fewer represented instances to boost their significance during the learning process. Various cost-sensitive classification techniques have been proposed in the literature [4, 26]. A common strategy of these techniques is to deliberately increase the weight of instances with higher misclassification costs during the boosting process. However, the challenge with cost-sensitive classification is that the misclassification costs are often unknown and difficult to estimate.

Finally, the Hybrid/Ensemble level techniques utilize the advantage of both data level and algorithm level techniques by combining them to handle the problem of class imbalance efficiently. For example, Wang et al. proposed the hybridization of both sampling and cost-sensitive learning in handling imbalanced data [27]. Furthermore, Zhang et al. in their work [28] proposed a method for handling imbalanced data by integrating ensemble and classification techniques in enhancing the performance of the ensemble. In another study [29], Paweł et al. proposed a hybrid ensemble method that combines the advantages of ensemble learning, deep learning, and evolutionary computation, to effectively classify cardiac arrhythmias using ECG signal segments.

Ensemble machine learning techniques combine more than one single learner (base learner) using a given combination rule to produce enhanced predictive models [30]. Base learners can be any machine learning algorithm (e.g., Decision Tree, Naïve Bayes, Artificial Neural Network, Linear Regression, etc.). The ensemble topology can be as simple as an independent collection of learners combined via a majority vote or using some other advanced mechanisms such as those indicated in [31], where ensembles consist of General Regression Neural Network and Geometric transformation model.

Ensemble techniques are among the most commonly used methods that utilize data level or algorithm level together with data resampling techniques in the classification of imbalanced data. This work is motivated by the growing trends in the use of ensemble techniques in imbalanced classification. This is because of their ability to improve classification performance, by leveraging the classification power of multiple base learners trained on different bootstraps of training data as compared to traditional classification algorithms. The main ground of the ensemble techniques is that by combining various classifiers, the error of one classifier will most likely be compensated by another classifier in the ensemble, and as a result, the general prediction score of the ensemble model would be more effective than that of a single classifier [32].

Different reasons have been discussed why ensemble methods most of the time perform better than single classifiers [33, 34]. One of the reasons is that the training data may not give adequate information for selecting a single best algorithm. For instance, there may be a range of algorithms that performs equally fine on the training data. Instead of choosing one of them, joining the predictions of these algorithms can be a much better choice. The other important reason is that, in some cases, the hypothesis space which is being searched may not include the real target function. In such a situation, combining various hypotheses can efficiently broaden the space and provide a better estimation for the unknown predictor function. For instance, the classification boundary of normal decision trees is hyperplanes that are parallel to the coordinate axes. In case the target classification boundary is of a different type, a single decision tree may not give a smooth estimation [35]. On the other hand, a combination of decision trees can give estimations to smooth boundaries of arbitrary shape.

In the literature, a good number of comprehensive surveys on ensemble techniques have been published [8, 36]– [44]. Bootstrap aggregating (Bagging) [45] and Boosting [46, 47] are considered the most widely used ensemble methods; for Bagging, Breiman in 1996 introduced the idea of bootstrap aggregation to build an ensemble where different base algorithms are trained on bootstrap samples randomly drawn from the original dataset based on uniform probability distribution with replacement. As for Boosting, it was introduced by Schapire in 1998 [46] as a technique for boosting the performance of weak learners.

Since then, many variations of bagging and boosting based ensembles have been proposed in the literature to address the problem of class imbalance. M. Galar et al. in their study [36] proposed a taxonomy for ensemble-based methods that utilize data preprocessing techniques to solve the problem of class imbalance. They clustered them into 3 groups which include: Boosting-based ensembles, Bagging-based and Hybrid ensembles as depicted in Fig. 1.

The majority of these ensemble techniques, despite reported improvement in their classification performance, it has been reported that they cannot completely survive the common problem of noise in machine learning [48]. The term noise applies to all types of anomalies in the training data, from errors to unusual cases of the observed domain, which make it harder to interpret the data. Noise in the training data can either be: attribute noise (errors or unusual instances) or class noise (incorrect class labels) or a mixture of both [49]. Therefore, those instances that complicate the learning process and degrade the performance of learning algorithms are referred to as noisy instances [50].

Most bagging-based ensemble methods utilize the existing data sampling techniques in the bootstrap aggregating process. However, they are still prone to the effects of noise, given that bootstrap instances are always selected based on uniform probability distribution with replacement, hence creating the possibilities of randomly generating variety bootstraps with a high number of noisy instances that might be hard to classify, which might eventually affect the overall classification performance of an ensemble.

One of the key ideas in our proposed new method is not to make the sampling probability distribution uniform but instead a function of instance hardness. We aim to influence the process of picking instances, unlike most bagging methods that use a uniform probability to select instances. Our approach will select instances based on their level of hardness, i.e., the probability of an instance being selected will be equal to its hardness level; this will ensure that in each bootstrap/bag there is a representation of easy, normal, and hard instances. We discuss instance hardness in detail in Sect. 2.

Furthermore, there have been numerous studies pointing out defects of existing ensemble methods with their underlying data sampling techniques for handling class imbalance problems [13, 51]. Most of the proposed solutions alter the original dataset by either creating new data through oversampling methods that may increase the possibility of overfitting or by eliminating data from the majority class through undersampling techniques that may discard potentially useful data that might be important during the learning process. As discussed earlier, there are many widespread methods for building diverse ensembles classifiers, such as Bagging, AdaBoost, Random Forests, Random Subspaces, etc. [52], while each of these methods can be presented to datasets that have undergone sampling, this is not ideal as it disregards the power of joining the ensemble generation method and sampling to create a more structured approach. As an outcome, several ensemble techniques have been combined with sampling techniques to create ensemble methods that are more appropriate for handling class imbalance problem such as UnderBagging [53], OverBagging [54], SMOTE Bagging [54], Balanced Bagging [55], RUSBoost [12] among others. It is imperative to note that since most of these ensemble methods are based on sampling techniques that alter the original class distribution, they may, therefore, inherit the defects of sampling-based methods as earlier discussed [56].

In our method, we propose to generate balanced bags by using the split Balancing (sBal) technique and instance hardness (IH) as a weighting mechanism during the sampling process. Each of the generated bags will contain all the minority instances, and a mixture of majority instances with varying degrees of hardness (easy, normal, and hard). This will ensure that base learners are trained on balanced bags, containing diverse data segments with different levels of hardness to learn various patterns over a different portion of the data.

We hypothesize that generating several balanced bags that contain instances with varying degrees of hardness, will induce a set of base algorithms that are able to learn patterns in the data induced from data points that represent the overall difficulty of dividing the input space into different classes. We expect that an ensemble constructed with such base algorithms should have a better overall classification performance as compared to an ensemble that is constructed with base algorithms trained on just balanced or imbalanced bags that are uniformly sampled, regardless of their importance in identifying the classes.

We carry out an extensive empirical study of our proposed method and compare its performance with existing widely used state-of-the-art ensemble methods. We have structured our experimental framework in a way that enables us to extract justified conclusions. We used three sets of datasets for our experiments, the first set is composed of 30 synthetic imbalanced datasets with controlled levels of noise obtained from KEEL repository [57], the second set is composed of 29 real-world balanced datasets and the final set is composed of 41 real-world imbalanced datasets obtained from KEEL and UCI repositories. We further, conducted a nonparametric Friedman test and Bergmann’s post hoc statistical test, at a significant level of p < 0.05, to ascertain our findings.

The results reveal that training base algorithms on balanced bags with varying degrees of hardness can bring substantial improvement in the classification performance of an ensemble. The findings demonstrate that the proposed method performed significantly better than the regular ensemble methods (Bagging, Wagging, Random Forest, and Adaboost) on both synthetic and real-world balanced and imbalanced datasets except for Random Forest where performance was comparable when evaluated on real-world balanced datasets. Furthermore, the proposed method performed better than ensemble methods specialized for class imbalance problems (Balanced Bagging, Balanced Random Forest, RUSBoost, and Easy Ensemble) in the majority of both balanced and imbalanced datasets.

In summary, the main contribution of this paper is twofold. First, we propose an ensemble method based on a data-level sampling approach to balance ensemble bags (sBal_IH), which takes data complexity into account in order to find good representative points within the bags, to improve prediction accuracy in imbalanced datasets.

Secondly, we conducted extensive experiments on 100 datasets and evaluated the proposed method in comparison with state-of-the-art methods, both standard ensembles and those specialized for handling class imbalance problems. The corresponding results, validated by statistical significance tests, demonstrate that our innovative method of split-balancing based on data complexity, proxied by instance hardness (IH), significantly outperformed most of the other compared methods as measured by Area Under the Curve (AUC) performance. We believe that this study will significantly contribute to the efforts of the machine learning community on addressing the challenge of data imbalance.

The remainder of this paper is organized as follows. We present our proposed method in Sect. 2, followd by a detailed experimental design in Sect. 3. In Sect. 4, we report and discuss in detail experimental and statistical results. We finally make conclusions and propose future works in Sect. 5.

2 Proposed method

We base our proposed method on the idea of making balanced bags of instances, where each bag contains a mixture of varying degrees of data complexity. To achieve this, we utilize Instance Hardness as a measure of data complexity.

2.1 Instance hardness (IH)

It is well known in the machine learning community, that the performance of most classifiers is dependent on both their parameters and the underlying training dataset [14, 59]. However, most researchers mainly focus on model parameter tuning as a way of achieving better model performance while ignoring the understanding of the data that is being modeled by the classifier. As a result, it may be difficult to understand which instances are being misclassified and why they are being misclassified, on assumption that the right parameters and the right evaluation metric are being used. Instance Hardness (IH) is a measure that specifies the degree of complexity in classifying a given instance in a dataset [60]. This implies that each instance in a respective dataset has a property that suggests its probability of being classified incorrectly regardless of the choice of the classifier. For example, we anticipate having high IH among outliers and mislabeled instances since the classifier will most likely have to overfit in order to classify them correctly. IH examines classification problems at the instance level as compared to the majority of machine learning studies that are focused on dataset level and mostly concerned with maximizing $p\left( {f|s} \right)$, where $f: X \to Y$ is a function that maps input feature vector $X$ in the input space to their corresponding label vectors $Y,$ and $s = \left\{ {\left( {x_{i} , y_{i} } \right): x_{i} \in X \wedge y_{i} \in Y} \right\}$ is the training set. On assumption that the pairs in $s$ are drawn independently and identically distributed (i.i.d).

M. Smith et al. in their study [60] presented the notion of IH through the decomposition of $p\left( {f|s} \right)$, while using the Bayes’ theorem:

$$p\left( {f|s} \right){ } = \frac{{p\left( {s|f} \right){ } . P\left( f \right) }}{p\left( s \right)}$$

$$= \frac{{\mathop \prod \nolimits_{i = 1}^{\left| s \right|} p(x_{i} , y_{i} |f). p\left( f \right) }}{p\left( s \right)}$$

$$= \frac{{ \mathop \prod \nolimits_{i = 1}^{\left| s \right|} p(y_{i} |x_{i} ,f) p(x_{i} | f) p\left( f \right) }}{p\left( s \right)}$$

(1)

Furthermore, for $\left( {x_{i} , y_{i} } \right)$ as a training instance, $p(y_{i} |x_{i} , f)$ measures the probability that $f$ will assign the label $y_{i}$ to the input feature vector $x_{i}$. M. Smith et al. further state that the larger the $p(y_{i} |x_{i} , f)$ the more likely $f$ will assign the right label to $x_{i}$; on the other hand, the smaller the $p(y_{i} |x_{i} , f)$ the less likely for $f$ to produce the right label for $x_{i}$. Therefore, they define IH with respect to $f$. as

$$IH_{f} \left( {x_{i} , y_{i} } \right) = 1 - p(y_{i} |x_{i} , f)$$

Under normal practice, $f$ is induced by an algorithm $c$ being trained on the training set $s$. Hence, the hardness of an instance is reliant on the instances in the training set and the underlying algorithm used to produce $f$.

In the literature, M. Smith et al. in their study [60] proposed several IH measures such as k-Disagreeing Neighbors (kDN), Disjunct Size, Disjunct Class Percentage (DCP), Class Likelihood (CL), Class Balance (CB) among others. These measures measure several characteristics about the hardness level of a specific instance; they show why instances are misclassified hence giving an insight as to why specific instances are hard to classify and how best we can detect them. This has laid a base foundation for researchers on dealing with dataset complexities as a result, and many state-of-the-art machine learning classification studies have been proposed based on instance hardness.

Previous studies in the literature have utilized instance hardiness in different ways in order to improve classification performance, such as noise and outlier filtering [61, 62], boosting through weight adjustments, etc. A. Kabir et al. in their study [63] proposed a mixed bagging technique for non-class imbalance problems, that incorporates IH in the bootstrap aggregating process. Furthermore, researchers in [64] incorporated hardness ordering under the learning process using filtering and boosting; they significantly improved generalization accuracy.

Our proposed method utilizes the concept of IH in the classification process but in a basically different way. While bagging based methods have been reported to offer good classification performance, their bootstrap sampling process is still prone to the effects of outliers or noise, given that instances are randomly sampled based on uniform probability, thus creating the possibilities of having a high percentage of outliers or noisy instances in some bootstraps, which might eventually affect classification performance. For our case, we propose using IH as a weighting mechanism during the sampling process, we then incorporate IH information in the ensemble bootstrapping process, we ensure each bootstrap has a representation of instances with varying degree of hardness (we discuss in detail IH estimation and implementation in Sect.3) that will allow base algorithms to learn different patterns of the training data, and that an ensemble constructed with such base algorithms should have a better overall classification performance.

2.2 Split balancing (sBal)

Most popular machine learning binary classification algorithms are designed to perform better on balanced datasets [65], and under such circumstances, it is always easy to choose the right algorithm and the right performance evaluation metric that will truly represent your optimal model [66, 67]. But the challenges always start to emerge when faced with an imbalanced dataset since the majority of these algorithms tend to perform poorly where the cost of classifying the minority class is always much higher as compared to the cost of classifying the majority class [68]. Thus, several techniques that try to balance the imbalanced datasets have been proposed and given much attention. However, numerous studies have pointed out the defects of proposed solutions for handling the class imbalance problem [51], i.e., they alter the original class distribution of the dataset by either creating new data through oversampling that might likely lead to overfitting [15], or by discarding potentially useful data from the majority class through undersampling [69]. Moreover, ensemble techniques have been combined with sampling techniques to create ensemble methods that are more appropriate for handling the class imbalance problem [8]. There is a variety of widespread approaches for building ensembles that are diverse such as Random Forests [70], Random Spaces [71], Bagging [45], AdaBoost [72], and many more. Whereas each of these approaches can be applied to datasets that have gone through sampling, but in an actual sense this is not optimal as it disregards the combining power for generating ensemble methods and sampling to create a better-organized approach. Consequently, several ensemble methods have ended up being combined with sampling techniques to create suitable ensemble methods for dealing with a problem of class imbalance. Chawla et al. in their study [10] proposed a novel approach SMOTEBoost for addressing the class imbalance problem, their approach is based on Synthetic Minority Oversampling TEchnique (SMOTE) [73] and boosting techniques. In other studies, Seiffert et al. proposed RUSBoost [12], a hybrid ensemble-based method that combines the RUS approach with the boosting technique.

However, most of these ensemble-based methods, are based on data sampling techniques and therefore they may alter the class distribution of the original datasets by either eliminating the majority class samples (undersampling) or by increasing the minority class samples (oversampling). Furthermore, for the case of boosting and bagging ensemble-based methods, they might still suffer from a problem of class imbalance because for each iteration (for boosting and bagging methods) the class distribution in each sampled subset in a certain iteration is the same as that of the original dataset. It is, therefore, prudent to convert the imbalanced dataset into multiple balanced datasets (bags) without creating new extra data or discarding potentially useful original data.

Our proposed ensemble-based methods will try to tackle the possible shortfalls of the conventional methods stated above for handling class imbalance problems by first converting a problem of class imbalance into several balanced problems that do not suffer anymore from the challenge of class imbalance without creating new extra data or discarding potentially useful original data, hence making it unique from the conventional methods for handling class imbalance problem.

Our proposed method is based on a split balancing technique [74], dubbed as sBal, where we generate balanced bags by randomly splitting instances of the majority class into multiple bags and each of them containing all the minority instances and a sample of the majority instances. However, we introduce the additional constraint that sampled majority instances should have varying degrees of hardness (easy, normal, and hard). These majority instances will be sampled based on IH as a weighting mechanism. This will ensure that base learners are trained on data points with different levels of hardness that we believe will better represent the input space that the dataset represents. We then combine the binary classifiers into an ensemble to classify new unseen data. The overall method is shown in Fig. 2.

3 Experiment design

In this section, we present the details of the several experiments which we used to evaluate the effectiveness of our proposed sBal_IH ensemble method and compare its performance against existing regular ensemble methods and state-of-the-art ensemble methods specialized for solving class imbalance problems.

The proposed method is first compared with regular ensemble methods that include: Bagging, AdaBoost, Random Forest (RF), and Wagging [75], on both balanced and imbalanced datasets to assess the performance of sBal_IH in situations of balanced and imbalanced problems. Thereafter, we compare sBal_IH against ensemble methods specialized for class imbalanced problems, namely Balanced Bagging (BB), Balanced Random Forest (BRF), Easy Ensemble (EE), and Random Undersampling Boosting (RUSBoost), on both synthetic and real-life public datasets, balanced and imbalanced. In Sect. 3.1, we briefly introduce all the ensemble methods studied in this paper alongside our proposed method.

We organized our experiments and their analysis into 2 stages. In the first stage, we conduct experiments on different groups of datasets to facilitate the analysis of results. In one group, experiments are performed on synthetic imbalanced datasets (Table 1) with controlled levels of noise (disturbance ratio) to examine the influence of noise on the performance of sBal_IH alongside existing ensemble methods in terms of AUC. To further examine the performance of our proposed method in real-world situations against other ensemble methods, we performed experiments on another 2 groups of 29 balanced and 41 imbalanced real-world datasets. For evaluating the performance results of all the methods, we used a fivefold cross-validation technique repeated 5 times, where each fivefold cross-validation is calculated 5 different times with different random seeds. For fairness and uniformity, we used an ensemble size of 10 (n_estimators), and a Decision Tree algorithm with its default parameters as a base estimator for all the ensemble methods apart from Easy Ensemble, which uses AdaBoost as a base estimator. We considered the Easy ensemble method as part of our experiments to compare performance against a hybrid (ensemble of ensembles) method.

In the second stage, we performed a series of statistical comparisons of sBal_IH against existing ensemble techniques. The goal of this analysis is to validate whether there is a significant difference in the performance in terms of AUC, which will enable us to draw more confident conclusions.

All the experiments were carried out in Python 3.7 using scikit-learn library [76] version 0.21.3, and imbalanced-learn library [77] version 0.5. For statistical significance tests, we used KEEL’s nonparametric statistical analysis software [57], version 3.0. The computer used for conducting the experiments was running Windows 10 64 bit, with an Intel Xeon processor (2.5 GHz) and 12 GB of RAM.

3.1 Featured ensemble methods

We briefly introduce the 8 ensemble methods evaluated in this study. Four of them are regular ensemble methods, namely Bagging, AdaBoost, Random Forest (RF), and Wagging, and the other four are ensemble methods specialized for class imbalance problems, namely Balanced Bagging (BB), Balanced Random Forest (BRF), Easy Ensemble (EE) and Random Undersampling Boosting (RUSBoost). The motivation for selecting these methods is that they are widely used and the majority of the studies on ensemble methods and class imbalance problems include one or more of these methods. They are all discussed comprehensively in the literature, such as in the studies of [36] and [39] mentioned above, and of [46] and [75] that will be referred afterward.

Bagging, also referred to as Bootstrap Aggregating, was first introduced by Breiman [45], and today is one of the most intuitive and possibly the simplest ensemble-based methods. It involves training different classifiers with bootstrapped copies of the data points randomly drawn with replacement from the original training dataset. The decisions of individual classifiers are then combined through majority voting.

Wagging (Weights Aggregation) is a variant of Bagging that was proposed by Bauer et al. [75], it needs a base algorithm that utilizes training instances with different weights. It randomly assigns weights to the instances that are available in each training set rather than using bootstrap samples to establish successive training set. It uses Gaussian noise to adjust the weight of instances, this might reduce the weights of some instances to zero, hence efficiently eliminating them from the training set.

Balanced Bagging [9], also referred to as Blagging, is an extension of bagging for solving class imbalance problems. The idea is to continuously undersample instances from the majority class in each of the bootstraps in order to get balanced bootstraps, on which individual decision trees are trained. This leads to an emphasis on the minority class and more balanced decisions.

Random forest [70] consists of a big number of individual decision trees that work as an ensemble. The random trees are generated using bootstrap samples of the training data and a set of selected random features during the tree induction process.

Balanced Random Forest (BRF) [78] is a variant of RF that was designed to deal with the problem of class imbalance. It utilizes the undersampling technique in each iteration of RF by drawing bootstrap instances from the minority class, and then drawing the same number of instances with replacement from the majority class, and then induces classification trees (CART algorithm) from the data in each iteration.

AdaBoost algorithm was developed by Schapire and Freund [46, 72]; it uses the whole training set to train multiple classifiers in a serial format. In each round, AdaBoost gives more attention to misclassified instances, with an aim to correctly classify them in subsequent iterations. This is achieved by maintaining a set of weights of the training instances, which are initially equal, and get updated according to correct or incorrect classification.

RUSBoost [79], a variation of the SMOTEBoost algorithm [10], works in a similar way to AdaBoost, but it discards instances from the majority class by Random Undersampling (RUS) in each iteration. In a previous study [80], it was shown that RUS often outperforms SMOTE, and thus RUSBoost being preferred as an alternative method to SMOTEBoost due to its simplicity and lower computational cost.

Easy Ensemble (EE) was proposed by Liu et al. [14]. It involves generating balanced samples of the training set by selecting all minority instances from the majority instances. Then boosted decision trees are induced on each balanced subset, particularly the AdaBoost algorithm.

3.2 Experiment datasets

As mentioned earlier (Sect. 3), we use three groups of datasets in our experiments, all of them prepared for binary classification tasks in mind. The first group is shown in Table 1, which was obtained from KEEL repository [57] and is composed of 30 synthetic imbalanced datasets with controlled levels of noise, i.e., disturbance ratio (the last number in dataset name) applied to create noisy examples in the dataset. This set was used in [81] to investigate the effect of noise and borderline instances from the minority class on the performance of the classifier. We chose to add this group to evaluate the effectiveness of our proposed method at handling imbalanced datasets in the presence of noise compared to other methods. The second group is composed of 29 real-world balanced datasets as indicated in Table 2. We included this group to examine in a fair way if our proposed method will show a distinctive effect if the dataset was not actually imbalanced. The final group is composed of 41 real-world imbalanced datasets as shown in Table 3. Both groups were obtained from KEEL [57] and UCI repositories. Tables 1, 2, and 3 summarize all the datasets used, highlighting some of their respective properties which include Dataset id (ID), Dataset name (Dataset), Total number of instances (#Inst), Percentage of Majority instances (%Maj), Percentage of minority instances (%Min) and Imbalance Ratio (IR). In this study, we consider IR as the ratio of percentages of Majority to Minority instances, i.e., $IR = \frac{\% Maj}{{\% Min}}$

Table 1 Synthetic Imbalanced Datasets with Controlled Noise

A novel ensemble method for classification in imbalanced datasets using split balancing technique based on instance hardness (sBal_IH)

Abstract

Similar content being viewed by others

A Robust Ensemble Method for Classification in Imbalanced Datasets in the Presence of Noise

A binary PSO-based ensemble under-sampling model for rebalancing imbalanced training data

Split Balancing (sBal)—A Data Preprocessing Sampling Technique for Ensemble Methods for Binary Classification in Imbalanced Datasets

Explore related subjects

1 Introduction

2 Proposed method

2.1 Instance hardness (IH)

2.2 Split balancing (sBal)

3 Experiment design

3.1 Featured ensemble methods

3.2 Experiment datasets

3.3 Instance hardness estimation

3.4 Evaluation metric

4 Results and discussion

4.1 Results for synthetic datasets

4.1.1 (a) sBal_IH Against Regular Ensemble Methods on Synthetic Datasets

4.1.2 (b) sBal_IH against specialized ensemble methods for class imbalance problems on synthetic datasets.

4.2 Results for real-world datasets

4.2.1 (a) sBal_IH against regular ensemble methods on real-world datasets

4.2.2 (b) sBal_IH against specialized ensemble methods for class imbalance problems on real-world datasets

4.3 Statistical tests

4.4 Statistical analysis results.

4.5 Computational Cost

5 Conclusion and future works

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation