1 Introduction

Pattern classification has attracted considerable attention in the field of machine learning with a wide area of application in contemporary real life such as face recognition, anomaly detection, image classification, cancer classification, medical diagnosis, among many others. The main challenge of pattern classification is the capability of the trained classifier to classify unseen patterns correctly. This challenge is most often associated with data complexities existing within the underlying datasets [1] such as class ambiguity, data sparsity and dimensionality, class boundary complexity, and class imbalance problem.

Class imbalance problem is a situation where a dataset is characterized by an uneven class distribution, i.e., the proportion of instances of one class is much smaller than those of other classes. In the real world, despite the vast availability of data, the class of interest (minority or positive class) in binary classification problems usually has fewer instances than the opposite class (majority or negative class).

One clear complication that arises as a result of the class imbalance problem is the performance of traditional machine learning algorithms and the effectiveness of their accuracy. Let’s consider an example of a dataset with an imbalance ratio of 99:1 where the majority class is composed of 99% of the data instances. In such a situation, a naive classifier will have an accuracy score of 99% since it always predicts the class with majority instances. This limitation cuts across most traditional machine learning algorithms such as decision tree (DT), k-nearest neighbors (kNN) when faced with imbalanced datasets since most of them optimize accuracy-based loss metrics. Hence, producing similar models as to that of the naïve model in the example described before [2, 3].

Apparently, in most practical situations misclassifying the class of interest can result in a heavy cost. For example, in cancer detection, patients with tumors might emerge out of hundreds of records but failing to identify a malignant tumor (false negative) may cause a serious threat to a patient since they may miss out on treatment. It is therefore useless to have a model with a very high accuracy score but failing to detect the minority class (i.e., low sensitivity) which is considered as the class of interest. However, it is possible to get a low misclassification cost by using an effective traditional algorithm for solving such a problem. But the cost component has to be addressed during the classification process through cost-sensitive learning. Cost-sensitive learning, which aims at minimizing errors, takes into account the cost of prediction errors, and potentially other costs during the training process [4]. It is closely related to the field of imbalanced learning that is much concerned with the classification of imbalanced datasets. As a result, various cost-sensitive techniques have been proposed. Unfortunately, defining the costs is still a big challenge, since misclassification costs are often unknown [5].

As a result, the problem of class imbalance has emerged as one of the challenges in the machine learning community and it has attracted much attention of researchers in academia and industry to an extent that 2 special workshops devoted to addressing this problem were held in 2000, at AAAI 2000 Workshop on Learning from Imbalanced Datasets [6] and in 2003 at ICML 2003 Workshop on Learning from Imbalanced Datasets [7]. It is evident that researchers have extensively studied the problem of class imbalance and various methods have been proposed to overcome this problem. These methods fall into three categories: Data level, Algorithm level, and hybrid/Ensemble Level.

Data level methods, also referred to as external techniques, first preprocess the data by trying to rebalance the class distribution. The data level techniques can further be broken down into sampling techniques and feature selection techniques. For sampling techniques, instances from the minority class are replicated to balance the class distribution through oversampling [8], or instances of the majority class are discarded to balance the class distribution through undersampling [9]. Researchers in [10]–[13] have proposed various oversampling approaches for balancing the class distributions; furthermore, researchers in [14]–[16] have similarly presented several undersampling approaches for solving the class imbalance problem. On the other hand, the feature selection techniques try to neutralize the effects of class imbalance by selecting the most influential features that can produce exclusive knowledge that can easily discriminate between the classes. Researchers in [17] explored feature selection for the categorization of text with imbalanced data. Mladenic et al. [18], utilized feature subsets to develop a Naive Bayes (NB) classifier on imbalanced text data. It is important to note that feature selection techniques for addressing the class imbalance problem have not yet been fully explored, creating a research gap in this area. On the other hand, feature selection techniques have been widely used in class balanced problems to improve performance score [19]– [22].

Algorithm Level (Internal) techniques modify the learning algorithm to take into consideration the significance of the minority instances by biasing the algorithm to learn towards the minority class [23, 24]. The most common algorithm level technique is the cost-sensitive method [25], where the algorithm is modified to integrate in varying penalties for each of the selected groups of examples. The higher cost is assigned to fewer represented instances to boost their significance during the learning process. Various cost-sensitive classification techniques have been proposed in the literature [4, 26]. A common strategy of these techniques is to deliberately increase the weight of instances with higher misclassification costs during the boosting process. However, the challenge with cost-sensitive classification is that the misclassification costs are often unknown and difficult to estimate.

Finally, the Hybrid/Ensemble level techniques utilize the advantage of both data level and algorithm level techniques by combining them to handle the problem of class imbalance efficiently. For example, Wang et al. proposed the hybridization of both sampling and cost-sensitive learning in handling imbalanced data [27]. Furthermore, Zhang et al. in their work [28] proposed a method for handling imbalanced data by integrating ensemble and classification techniques in enhancing the performance of the ensemble. In another study [29], Paweł et al. proposed a hybrid ensemble method that combines the advantages of ensemble learning, deep learning, and evolutionary computation, to effectively classify cardiac arrhythmias using ECG signal segments.

Ensemble machine learning techniques combine more than one single learner (base learner) using a given combination rule to produce enhanced predictive models [30]. Base learners can be any machine learning algorithm (e.g., Decision Tree, Naïve Bayes, Artificial Neural Network, Linear Regression, etc.). The ensemble topology can be as simple as an independent collection of learners combined via a majority vote or using some other advanced mechanisms such as those indicated in [31], where ensembles consist of General Regression Neural Network and Geometric transformation model.

Ensemble techniques are among the most commonly used methods that utilize data level or algorithm level together with data resampling techniques in the classification of imbalanced data. This work is motivated by the growing trends in the use of ensemble techniques in imbalanced classification. This is because of their ability to improve classification performance, by leveraging the classification power of multiple base learners trained on different bootstraps of training data as compared to traditional classification algorithms. The main ground of the ensemble techniques is that by combining various classifiers, the error of one classifier will most likely be compensated by another classifier in the ensemble, and as a result, the general prediction score of the ensemble model would be more effective than that of a single classifier [32].

Different reasons have been discussed why ensemble methods most of the time perform better than single classifiers [33, 34]. One of the reasons is that the training data may not give adequate information for selecting a single best algorithm. For instance, there may be a range of algorithms that performs equally fine on the training data. Instead of choosing one of them, joining the predictions of these algorithms can be a much better choice. The other important reason is that, in some cases, the hypothesis space which is being searched may not include the real target function. In such a situation, combining various hypotheses can efficiently broaden the space and provide a better estimation for the unknown predictor function. For instance, the classification boundary of normal decision trees is hyperplanes that are parallel to the coordinate axes. In case the target classification boundary is of a different type, a single decision tree may not give a smooth estimation [35]. On the other hand, a combination of decision trees can give estimations to smooth boundaries of arbitrary shape.

In the literature, a good number of comprehensive surveys on ensemble techniques have been published [8, 36]– [44]. Bootstrap aggregating (Bagging) [45] and Boosting [46, 47] are considered the most widely used ensemble methods; for Bagging, Breiman in 1996 introduced the idea of bootstrap aggregation to build an ensemble where different base algorithms are trained on bootstrap samples randomly drawn from the original dataset based on uniform probability distribution with replacement. As for Boosting, it was introduced by Schapire in 1998 [46] as a technique for boosting the performance of weak learners.

Since then, many variations of bagging and boosting based ensembles have been proposed in the literature to address the problem of class imbalance. M. Galar et al. in their study [36] proposed a taxonomy for ensemble-based methods that utilize data preprocessing techniques to solve the problem of class imbalance. They clustered them into 3 groups which include: Boosting-based ensembles, Bagging-based and Hybrid ensembles as depicted in Fig. 1.

Fig. 1
figure 1

Ensemble Methods utilizing Data Preprocessing

The majority of these ensemble techniques, despite reported improvement in their classification performance, it has been reported that they cannot completely survive the common problem of noise in machine learning [48]. The term noise applies to all types of anomalies in the training data, from errors to unusual cases of the observed domain, which make it harder to interpret the data. Noise in the training data can either be: attribute noise (errors or unusual instances) or class noise (incorrect class labels) or a mixture of both [49]. Therefore, those instances that complicate the learning process and degrade the performance of learning algorithms are referred to as noisy instances [50].

Most bagging-based ensemble methods utilize the existing data sampling techniques in the bootstrap aggregating process. However, they are still prone to the effects of noise, given that bootstrap instances are always selected based on uniform probability distribution with replacement, hence creating the possibilities of randomly generating variety bootstraps with a high number of noisy instances that might be hard to classify, which might eventually affect the overall classification performance of an ensemble.

One of the key ideas in our proposed new method is not to make the sampling probability distribution uniform but instead a function of instance hardness. We aim to influence the process of picking instances, unlike most bagging methods that use a uniform probability to select instances. Our approach will select instances based on their level of hardness, i.e., the probability of an instance being selected will be equal to its hardness level; this will ensure that in each bootstrap/bag there is a representation of easy, normal, and hard instances. We discuss instance hardness in detail in Sect. 2.

Furthermore, there have been numerous studies pointing out defects of existing ensemble methods with their underlying data sampling techniques for handling class imbalance problems [13, 51]. Most of the proposed solutions alter the original dataset by either creating new data through oversampling methods that may increase the possibility of overfitting or by eliminating data from the majority class through undersampling techniques that may discard potentially useful data that might be important during the learning process. As discussed earlier, there are many widespread methods for building diverse ensembles classifiers, such as Bagging, AdaBoost, Random Forests, Random Subspaces, etc. [52], while each of these methods can be presented to datasets that have undergone sampling, this is not ideal as it disregards the power of joining the ensemble generation method and sampling to create a more structured approach. As an outcome, several ensemble techniques have been combined with sampling techniques to create ensemble methods that are more appropriate for handling class imbalance problem such as UnderBagging [53], OverBagging [54], SMOTE Bagging [54], Balanced Bagging [55], RUSBoost [12] among others. It is imperative to note that since most of these ensemble methods are based on sampling techniques that alter the original class distribution, they may, therefore, inherit the defects of sampling-based methods as earlier discussed [56].

In our method, we propose to generate balanced bags by using the split Balancing (sBal) technique and instance hardness (IH) as a weighting mechanism during the sampling process. Each of the generated bags will contain all the minority instances, and a mixture of majority instances with varying degrees of hardness (easy, normal, and hard). This will ensure that base learners are trained on balanced bags, containing diverse data segments with different levels of hardness to learn various patterns over a different portion of the data.

We hypothesize that generating several balanced bags that contain instances with varying degrees of hardness, will induce a set of base algorithms that are able to learn patterns in the data induced from data points that represent the overall difficulty of dividing the input space into different classes. We expect that an ensemble constructed with such base algorithms should have a better overall classification performance as compared to an ensemble that is constructed with base algorithms trained on just balanced or imbalanced bags that are uniformly sampled, regardless of their importance in identifying the classes.

We carry out an extensive empirical study of our proposed method and compare its performance with existing widely used state-of-the-art ensemble methods. We have structured our experimental framework in a way that enables us to extract justified conclusions. We used three sets of datasets for our experiments, the first set is composed of 30 synthetic imbalanced datasets with controlled levels of noise obtained from KEEL repository [57], the second set is composed of 29 real-world balanced datasets and the final set is composed of 41 real-world imbalanced datasets obtained from KEEL and UCI repositories. We further, conducted a nonparametric Friedman test and Bergmann’s post hoc statistical test, at a significant level of p < 0.05, to ascertain our findings.

The results reveal that training base algorithms on balanced bags with varying degrees of hardness can bring substantial improvement in the classification performance of an ensemble. The findings demonstrate that the proposed method performed significantly better than the regular ensemble methods (Bagging, Wagging, Random Forest, and Adaboost) on both synthetic and real-world balanced and imbalanced datasets except for Random Forest where performance was comparable when evaluated on real-world balanced datasets. Furthermore, the proposed method performed better than ensemble methods specialized for class imbalance problems (Balanced Bagging, Balanced Random Forest, RUSBoost, and Easy Ensemble) in the majority of both balanced and imbalanced datasets.

In summary, the main contribution of this paper is twofold. First, we propose an ensemble method based on a data-level sampling approach to balance ensemble bags (sBal_IH), which takes data complexity into account in order to find good representative points within the bags, to improve prediction accuracy in imbalanced datasets.

Secondly, we conducted extensive experiments on 100 datasets and evaluated the proposed method in comparison with state-of-the-art methods, both standard ensembles and those specialized for handling class imbalance problems. The corresponding results, validated by statistical significance tests, demonstrate that our innovative method of split-balancing based on data complexity, proxied by instance hardness (IH), significantly outperformed most of the other compared methods as measured by Area Under the Curve (AUC) performance. We believe that this study will significantly contribute to the efforts of the machine learning community on addressing the challenge of data imbalance.

The remainder of this paper is organized as follows. We present our proposed method in Sect. 2, followd by a detailed experimental design in Sect. 3. In Sect. 4, we report and discuss in detail experimental and statistical results. We finally make conclusions and propose future works in Sect. 5.

2 Proposed method

We base our proposed method on the idea of making balanced bags of instances, where each bag contains a mixture of varying degrees of data complexity. To achieve this, we utilize Instance Hardness as a measure of data complexity.

2.1 Instance hardness (IH)

It is well known in the machine learning community, that the performance of most classifiers is dependent on both their parameters and the underlying training dataset [14, 59]. However, most researchers mainly focus on model parameter tuning as a way of achieving better model performance while ignoring the understanding of the data that is being modeled by the classifier. As a result, it may be difficult to understand which instances are being misclassified and why they are being misclassified, on assumption that the right parameters and the right evaluation metric are being used. Instance Hardness (IH) is a measure that specifies the degree of complexity in classifying a given instance in a dataset [60]. This implies that each instance in a respective dataset has a property that suggests its probability of being classified incorrectly regardless of the choice of the classifier. For example, we anticipate having high IH among outliers and mislabeled instances since the classifier will most likely have to overfit in order to classify them correctly. IH examines classification problems at the instance level as compared to the majority of machine learning studies that are focused on dataset level and mostly concerned with maximizing \(p\left( {f|s} \right)\), where \(f: X \to Y\) is a function that maps input feature vector \(X\) in the input space to their corresponding label vectors \(Y,\) and \(s = \left\{ {\left( {x_{i} , y_{i} } \right): x_{i} \in X \wedge y_{i} \in Y} \right\}\) is the training set. On assumption that the pairs in \(s\) are drawn independently and identically distributed (i.i.d).

M. Smith et al. in their study [60] presented the notion of IH through the decomposition of \(p\left( {f|s} \right)\), while using the Bayes’ theorem:

$$p\left( {f|s} \right){ } = \frac{{p\left( {s|f} \right){ } . P\left( f \right) }}{p\left( s \right)}$$
$$= \frac{{\mathop \prod \nolimits_{i = 1}^{\left| s \right|} p(x_{i} , y_{i} |f). p\left( f \right) }}{p\left( s \right)}$$
$$= \frac{{ \mathop \prod \nolimits_{i = 1}^{\left| s \right|} p(y_{i} |x_{i} ,f) p(x_{i} | f) p\left( f \right) }}{p\left( s \right)}$$
(1)

Furthermore, for \(\left( {x_{i} , y_{i} } \right)\) as a training instance, \(p(y_{i} |x_{i} , f)\) measures the probability that \(f\) will assign the label \(y_{i}\) to the input feature vector \(x_{i}\). M. Smith et al. further state that the larger the \(p(y_{i} |x_{i} , f)\) the more likely \(f\) will assign the right label to \(x_{i}\); on the other hand, the smaller the \(p(y_{i} |x_{i} , f)\) the less likely for \(f\) to produce the right label for \(x_{i}\). Therefore, they define IH with respect to \(f\). as

$$IH_{f} \left( {x_{i} , y_{i} } \right) = 1 - p(y_{i} |x_{i} , f)$$

Under normal practice, \(f\) is induced by an algorithm \(c\) being trained on the training set \(s\). Hence, the hardness of an instance is reliant on the instances in the training set and the underlying algorithm used to produce \(f\).

In the literature, M. Smith et al. in their study [60] proposed several IH measures such as k-Disagreeing Neighbors (kDN), Disjunct Size, Disjunct Class Percentage (DCP), Class Likelihood (CL), Class Balance (CB) among others. These measures measure several characteristics about the hardness level of a specific instance; they show why instances are misclassified hence giving an insight as to why specific instances are hard to classify and how best we can detect them. This has laid a base foundation for researchers on dealing with dataset complexities as a result, and many state-of-the-art machine learning classification studies have been proposed based on instance hardness.

Previous studies in the literature have utilized instance hardiness in different ways in order to improve classification performance, such as noise and outlier filtering [61, 62], boosting through weight adjustments, etc. A. Kabir et al. in their study [63] proposed a mixed bagging technique for non-class imbalance problems, that incorporates IH in the bootstrap aggregating process. Furthermore, researchers in [64] incorporated hardness ordering under the learning process using filtering and boosting; they significantly improved generalization accuracy.

Our proposed method utilizes the concept of IH in the classification process but in a basically different way. While bagging based methods have been reported to offer good classification performance, their bootstrap sampling process is still prone to the effects of outliers or noise, given that instances are randomly sampled based on uniform probability, thus creating the possibilities of having a high percentage of outliers or noisy instances in some bootstraps, which might eventually affect classification performance. For our case, we propose using IH as a weighting mechanism during the sampling process, we then incorporate IH information in the ensemble bootstrapping process, we ensure each bootstrap has a representation of instances with varying degree of hardness (we discuss in detail IH estimation and implementation in Sect.3) that will allow base algorithms to learn different patterns of the training data, and that an ensemble constructed with such base algorithms should have a better overall classification performance.

2.2 Split balancing (sBal)

Most popular machine learning binary classification algorithms are designed to perform better on balanced datasets [65], and under such circumstances, it is always easy to choose the right algorithm and the right performance evaluation metric that will truly represent your optimal model [66, 67]. But the challenges always start to emerge when faced with an imbalanced dataset since the majority of these algorithms tend to perform poorly where the cost of classifying the minority class is always much higher as compared to the cost of classifying the majority class [68]. Thus, several techniques that try to balance the imbalanced datasets have been proposed and given much attention. However, numerous studies have pointed out the defects of proposed solutions for handling the class imbalance problem [51], i.e., they alter the original class distribution of the dataset by either creating new data through oversampling that might likely lead to overfitting [15], or by discarding potentially useful data from the majority class through undersampling [69]. Moreover, ensemble techniques have been combined with sampling techniques to create ensemble methods that are more appropriate for handling the class imbalance problem [8]. There is a variety of widespread approaches for building ensembles that are diverse such as Random Forests [70], Random Spaces [71], Bagging [45], AdaBoost [72], and many more. Whereas each of these approaches can be applied to datasets that have gone through sampling, but in an actual sense this is not optimal as it disregards the combining power for generating ensemble methods and sampling to create a better-organized approach. Consequently, several ensemble methods have ended up being combined with sampling techniques to create suitable ensemble methods for dealing with a problem of class imbalance. Chawla et al. in their study [10] proposed a novel approach SMOTEBoost for addressing the class imbalance problem, their approach is based on Synthetic Minority Oversampling TEchnique (SMOTE) [73] and boosting techniques. In other studies, Seiffert et al. proposed RUSBoost [12], a hybrid ensemble-based method that combines the RUS approach with the boosting technique.

However, most of these ensemble-based methods, are based on data sampling techniques and therefore they may alter the class distribution of the original datasets by either eliminating the majority class samples (undersampling) or by increasing the minority class samples (oversampling). Furthermore, for the case of boosting and bagging ensemble-based methods, they might still suffer from a problem of class imbalance because for each iteration (for boosting and bagging methods) the class distribution in each sampled subset in a certain iteration is the same as that of the original dataset. It is, therefore, prudent to convert the imbalanced dataset into multiple balanced datasets (bags) without creating new extra data or discarding potentially useful original data.

Our proposed ensemble-based methods will try to tackle the possible shortfalls of the conventional methods stated above for handling class imbalance problems by first converting a problem of class imbalance into several balanced problems that do not suffer anymore from the challenge of class imbalance without creating new extra data or discarding potentially useful original data, hence making it unique from the conventional methods for handling class imbalance problem.

Our proposed method is based on a split balancing technique [74], dubbed as sBal, where we generate balanced bags by randomly splitting instances of the majority class into multiple bags and each of them containing all the minority instances and a sample of the majority instances. However, we introduce the additional constraint that sampled majority instances should have varying degrees of hardness (easy, normal, and hard). These majority instances will be sampled based on IH as a weighting mechanism. This will ensure that base learners are trained on data points with different levels of hardness that we believe will better represent the input space that the dataset represents. We then combine the binary classifiers into an ensemble to classify new unseen data. The overall method is shown in Fig. 2.

Fig. 2
figure 2

Proposed Method—sBal_IH

3 Experiment design

In this section, we present the details of the several experiments which we used to evaluate the effectiveness of our proposed sBal_IH ensemble method and compare its performance against existing regular ensemble methods and state-of-the-art ensemble methods specialized for solving class imbalance problems.

The proposed method is first compared with regular ensemble methods that include: Bagging, AdaBoost, Random Forest (RF), and Wagging [75], on both balanced and imbalanced datasets to assess the performance of sBal_IH in situations of balanced and imbalanced problems. Thereafter, we compare sBal_IH against ensemble methods specialized for class imbalanced problems, namely Balanced Bagging (BB), Balanced Random Forest (BRF), Easy Ensemble (EE), and Random Undersampling Boosting (RUSBoost), on both synthetic and real-life public datasets, balanced and imbalanced. In Sect. 3.1, we briefly introduce all the ensemble methods studied in this paper alongside our proposed method.

We organized our experiments and their analysis into 2 stages. In the first stage, we conduct experiments on different groups of datasets to facilitate the analysis of results. In one group, experiments are performed on synthetic imbalanced datasets (Table 1) with controlled levels of noise (disturbance ratio) to examine the influence of noise on the performance of sBal_IH alongside existing ensemble methods in terms of AUC. To further examine the performance of our proposed method in real-world situations against other ensemble methods, we performed experiments on another 2 groups of 29 balanced and 41 imbalanced real-world datasets. For evaluating the performance results of all the methods, we used a fivefold cross-validation technique repeated 5 times, where each fivefold cross-validation is calculated 5 different times with different random seeds. For fairness and uniformity, we used an ensemble size of 10 (n_estimators), and a Decision Tree algorithm with its default parameters as a base estimator for all the ensemble methods apart from Easy Ensemble, which uses AdaBoost as a base estimator. We considered the Easy ensemble method as part of our experiments to compare performance against a hybrid (ensemble of ensembles) method.

In the second stage, we performed a series of statistical comparisons of sBal_IH against existing ensemble techniques. The goal of this analysis is to validate whether there is a significant difference in the performance in terms of AUC, which will enable us to draw more confident conclusions.

All the experiments were carried out in Python 3.7 using scikit-learn library [76] version 0.21.3, and imbalanced-learn library [77] version 0.5. For statistical significance tests, we used KEEL’s nonparametric statistical analysis software [57], version 3.0. The computer used for conducting the experiments was running Windows 10 64 bit, with an Intel Xeon processor (2.5 GHz) and 12 GB of RAM.

3.1 Featured ensemble methods

We briefly introduce the 8 ensemble methods evaluated in this study. Four of them are regular ensemble methods, namely Bagging, AdaBoost, Random Forest (RF), and Wagging, and the other four are ensemble methods specialized for class imbalance problems, namely Balanced Bagging (BB), Balanced Random Forest (BRF), Easy Ensemble (EE) and Random Undersampling Boosting (RUSBoost). The motivation for selecting these methods is that they are widely used and the majority of the studies on ensemble methods and class imbalance problems include one or more of these methods. They are all discussed comprehensively in the literature, such as in the studies of [36] and [39] mentioned above, and of [46] and [75] that will be referred afterward.

Bagging, also referred to as Bootstrap Aggregating, was first introduced by Breiman [45], and today is one of the most intuitive and possibly the simplest ensemble-based methods. It involves training different classifiers with bootstrapped copies of the data points randomly drawn with replacement from the original training dataset. The decisions of individual classifiers are then combined through majority voting.

Wagging (Weights Aggregation) is a variant of Bagging that was proposed by Bauer et al. [75], it needs a base algorithm that utilizes training instances with different weights. It randomly assigns weights to the instances that are available in each training set rather than using bootstrap samples to establish successive training set. It uses Gaussian noise to adjust the weight of instances, this might reduce the weights of some instances to zero, hence efficiently eliminating them from the training set.

Balanced Bagging [9], also referred to as Blagging, is an extension of bagging for solving class imbalance problems. The idea is to continuously undersample instances from the majority class in each of the bootstraps in order to get balanced bootstraps, on which individual decision trees are trained. This leads to an emphasis on the minority class and more balanced decisions.

Random forest [70] consists of a big number of individual decision trees that work as an ensemble. The random trees are generated using bootstrap samples of the training data and a set of selected random features during the tree induction process.

Balanced Random Forest (BRF) [78] is a variant of RF that was designed to deal with the problem of class imbalance. It utilizes the undersampling technique in each iteration of RF by drawing bootstrap instances from the minority class, and then drawing the same number of instances with replacement from the majority class, and then induces classification trees (CART algorithm) from the data in each iteration.

AdaBoost algorithm was developed by Schapire and Freund [46, 72]; it uses the whole training set to train multiple classifiers in a serial format. In each round, AdaBoost gives more attention to misclassified instances, with an aim to correctly classify them in subsequent iterations. This is achieved by maintaining a set of weights of the training instances, which are initially equal, and get updated according to correct or incorrect classification.

RUSBoost [79], a variation of the SMOTEBoost algorithm [10], works in a similar way to AdaBoost, but it discards instances from the majority class by Random Undersampling (RUS) in each iteration. In a previous study [80], it was shown that RUS often outperforms SMOTE, and thus RUSBoost being preferred as an alternative method to SMOTEBoost due to its simplicity and lower computational cost.

Easy Ensemble (EE) was proposed by Liu et al. [14]. It involves generating balanced samples of the training set by selecting all minority instances from the majority instances. Then boosted decision trees are induced on each balanced subset, particularly the AdaBoost algorithm.

3.2 Experiment datasets

As mentioned earlier (Sect. 3), we use three groups of datasets in our experiments, all of them prepared for binary classification tasks in mind. The first group is shown in Table 1, which was obtained from KEEL repository [57] and is composed of 30 synthetic imbalanced datasets with controlled levels of noise, i.e., disturbance ratio (the last number in dataset name) applied to create noisy examples in the dataset. This set was used in [81] to investigate the effect of noise and borderline instances from the minority class on the performance of the classifier. We chose to add this group to evaluate the effectiveness of our proposed method at handling imbalanced datasets in the presence of noise compared to other methods. The second group is composed of 29 real-world balanced datasets as indicated in Table 2. We included this group to examine in a fair way if our proposed method will show a distinctive effect if the dataset was not actually imbalanced. The final group is composed of 41 real-world imbalanced datasets as shown in Table 3. Both groups were obtained from KEEL [57] and UCI repositories. Tables 1, 2, and 3 summarize all the datasets used, highlighting some of their respective properties which include Dataset id (ID), Dataset name (Dataset), Total number of instances (#Inst), Percentage of Majority instances (%Maj), Percentage of minority instances (%Min) and Imbalance Ratio (IR). In this study, we consider IR as the ratio of percentages of Majority to Minority instances, i.e., \(IR = \frac{\% Maj}{{\% Min}}\)

Table 1 Synthetic Imbalanced Datasets with Controlled Noise
Table 2 Summary of Balanced Real-world Datasets Sorted by IR

.

Table 3 Summary of Imbalanced Real-world Datasets Sorted by IR

3.3 Instance hardness estimation

As explained in Sect. 2, IH is the probability that an instance will be classified incorrectly by a classifier built from other instances of the same dataset. M. Smith et al. in their study [60] described various metrics for estimating IH; they empirically analyzed IH in over 190,000 instances of multiple datasets, using different learning algorithms. They established that there are always a good number of instances that are hard to be classified correctly.

The chosen metric used in our experiments estimates the hardness of a data instance as a percentage of a pre-selected set of classifiers that misclassify that instance. We considered the following algorithms for instance hardness estimation: Logistic Regression (LR), Decision Tree (DT), k-Nearest Neighbor (kNN), and Gaussian Naive Bayes (NB). This collection is a cocktail of both linear (LR) and nonlinear algorithms (kNN, DT, NB), and both parametric (NB, LR) and nonparametric algorithms (kNN, DT), as well as instance-based (kNN) and model-based (DT, NB, LR) ones. This enables a diverse and probably more reliable estimate of instance hardness.

We estimate Instance Hardness (IH) of each instance (data point) in a dataset as the percentage of incorrect classifications for that instance made by a pre-selected set of classifiers \(C = \left\{ {c_{1} , c_{2} ,..,c_{m} } \right\}\) which are built from other instances in the dataset.

figure a

summarizes the steps that can be taken to calculate IH for every instance in a given dataset. The steps are laid out based on a Leave-One-Out (LOO) style to assess the hardness of each instance separately, using the remaining instances in the training set. Even though we use LOO in the layout of the algorithm for its simplicity, the calculation of IH can be made using a less compute-intensive approach, such as a less extreme style of cross-validation. In our implementation, we have used a variant based on fivefold cross-validation in order to save on computation times.

The hardness of an instance is represented by a real value between 0 and 1, where an instance with a hardness value close to 0 is more likely to be correctly classified, whereas the opposite is true for that instance with a hardness value close to 1. This implies that an instance with an estimated hardness value higher than a predefined threshold \(t_{h}\) is categorized as Hard (H), and with hardness value below the threshold \(t_{e}\) is categorized as Easy (E), or else as Normal (N) as shown in Fig. 3. In our study, we set the default IH thresholds for easy and hard instances as \(t_{e} = 0.33\) and \(t_{h} = 0.66\), respectively, to maintain equal ranges of hardness.

Fig. 3
figure 3

Instance Hardness Threshold Ratio

Even though we use the term “training dataset” in IH estimation above, we emphasize that this does not refer to the original imbalanced datasets in our study, but rather to the training subset in a cross-validation fold. In other words, for rigorous experiment design, IH estimation is performed multiple times, once in every fold, for each of the original imbalanced datasets. This is done to avoid data leakage from the test sets to the training phase as shown in Fig. 2.

3.4 Evaluation metric

An obvious challenge raised with evaluating classifiers on imbalanced datasets is choosing the right performance metric. Accuracy is a well-known metric frequently used for classification problems [82]; however, it might not be an appropriate metric as reported by many studies, for its ineffectiveness in situations of imbalanced datasets [67, 83, 84]. Instead of accuracy, researchers have adopted other evaluation metrics for imbalanced problems such as F-Measure, Recall, Precision, G-Means, Area Under Curve (AUC), and Mathew Correlation Coefficient (MCC). On precision and recall, it has been reported in [85] that precision is sensitive to data distribution while recall is not. With many metrics in place, researchers have encountered the challenge of choosing the right metric for imbalanced problems; most researchers in the machine learning community are preferring AUC [2, 3], and MCC [86] over other metrics. Researchers in [87] empirically compared and evaluated the performance of AUC against MCC on a series of real-world imbalanced datasets and established that AUC was statistically consistent and more discriminating than MCC; hence suggesting that AUC is a better measure than MCC to be used for evaluating binary classification with imbalanced datasets. In that regard, we chose to use AUC as an evaluation metric throughout our experiments.

4 Results and discussion

In this section, we present and discuss the outcomes of our experiments resulting from the fivefold cross-validation runs, and randomly recalculated with 5 different random seeds (as explained in Sect. 3.0). We tested our proposed method against regular ensemble methods; Bagging (Bag), Wagging (Wag), Random Forest (RF), AdaBoost (AdaB), and ensemble methods specialized for imbalanced problems; Easy Ensemble (EE), Balanced Bagging (BB), Balanced Random Forest (BRF) and RUSBoost (RUSB). These methods were evaluated on the 100 publicly available datasets: 30 synthetic imbalanced datasets (from Table 1), 29 balanced real-world datasets (from Table 2), and 41 imbalanced real-world datasets (from Table 3).

In all the tables, the results highlighted in bold indicate a higher AUC score as compared to the others in the same row for a given dataset. In case of a tie between methods, we consider it a win in its capacity.

4.1 Results for synthetic datasets

In this subsection, we present and discuss the performance of our proposed methods against regular and specialized ensembles methods for class imbalance problems when faced with imbalanced datasets perturbed with different levels of noise. This group of datasets will help us evaluate and understand the effectiveness of our proposed method, compared to the other methods, at handling noise in datasets as a proxy for underlying data complexities.

4.1.1 (a) sBal_IH Against Regular Ensemble Methods on Synthetic Datasets

We compared the proposed sBal_IH ensemble method with the regular ensemble methods in terms of AUC, evaluated on the 30 synthetic imbalanced datasets. Table 4 shows the mean AUC values of 5 repeated trials. The results highlighted in bold demonstrate the better performance of sBal_IH against all the other regular ensemble methods in 28 datasets out of 30. AdaBoost, despite its efforts of adding weight on difficult to classify instances, it ranked second, performing better in only 2 out of 30 datasets.

We can also observe that with the increase in noise level (disturbance ratio values increase every 5 rows in Table 4, from 0, 30, 50, 60, up to 70), performance starts to deteriorate proportionally for all ensemble methods; however, our proposed method struggles the least with this and still maintains higher performance. We believe that the reason for this is the ability of sBal_IH to still learn most of the underlying complexities existing within the dataset due to the use of learners trained on balanced bags which still have varying levels of hardness. This implies that our proposed method is more efficient in handling noise and class imbalance problems than the regular ensemble methods.

Table 4 AUC Score for sBal_IH Against Regular Ensemble Methods on 30 Synthetic Imbalance Datasets

4.1.2 (b) sBal_IH against specialized ensemble methods for class imbalance problems on synthetic datasets.

Since the regular ensemble methods are not specially designed to handle class imbalance problems, we evaluate the effectiveness of our proposed method in comparison with the methods specialized for class imbalanced problems as mentioned in Sect.3. In Table 5, we present the performance of sBal_IH against ensemble methods in terms of AUC on the 30 synthetic imbalanced datasets.

Table 5 AUC Score for sBal_IH against ensemble methods specialized for class imbalanced problems on 30 synthetic imbalanced datasets

The results highlighted in bold, show that out of the 30 datasets, sBal_IH outperformed the rest of the methods in 16 datasets, followed by BRF with only 6 wins. The results indicate that sBal_IH faced fair competition from state-of-the-art ensemble methods specialized for handling imbalanced datasets.

This is because most of these specialized methods try to take care of the class imbalance problem and the underlying complexities existing within the dataset. For instance, Balanced Random Forest (BRF) and Balanced Bagging (BB) are similar to our proposed method in trying to balance bootstraps during the aggregating process. However, this happens in an interestingly different way; for BRF, they draw bootstrap instances from the minority class and then randomly draw the same number of instances, with replacement, from the majority class [78]. As for BB, in each bootstrap, they draw the same number of minority and majority instances based on the undersampling technique using a negative binomial distribution [55].

Nevertheless, for most of the specialized methods, as much as they try to balance the class distributions, there is a likelihood of drawing more complex or noisy instances into a bag or across many bags, since the sampling process is based on a uniform probability distribution, and this seems to have a negative effect on the overall classification performance of the ensemble as compared to what sBal_IH was able to achieve when the balanced bags were composed of a mixture of instances with varying degree of hardness.

4.2 Results for real-world datasets

We go ahead and compare the performance of sBal_IH against regular and specialized ensemble methods on real-world multi-domain datasets to further understand its performance behavior in the real-world setup.

4.2.1 (a) sBal_IH against regular ensemble methods on real-world datasets

Here, we compare sBal_IH with the regular ensemble methods in terms of AUC on the 29 balanced and 41 imbalanced real-world datasets. Tables 6 and 7 present the AUC performance scores on the mentioned datasets groups, respectively. In the experiments, we considered a group of balanced datasets in order to examine in a fair way if our proposed method will show a distinctive effect if the datasets were actually balanced. As highlighted in bold in Table 6, it is observed that sBal_IH outperformed the traditional ensemble methods in only 15 out of the 29 balanced datasets, followed by Random Forest with 13 wins, and the remaining methods far behind. The results show a comparable performance between sBal_IH and the other regular ensemble methods.

Table 6 AUC Score for sBal_IH Against Regular Ensemble Methods on 29 Balanced Datasets

On the other hand, with the 41 real-world imbalanced datasets, the highlighted results in Table 7 show a similar pattern to that of the synthetic imbalanced datasets (subsection 4.1.a), where sBal_IH exhibits good performance in 36 out of the 41 real-world imbalanced datasets. This is an endorsement of the effectiveness of our proposed method in addressing the class imbalance problem as compared to the regular ensemble methods.

Table 7 AUC Score for sBal_IH Against Regular Ensemble Methods on 41 Real-world Imbalanced Datasets

The observations above may indicate that when the data is already balanced, sBal_IH still benefited from the varying hardness in generated bags, but that benefit will magnify if the dataset is also imbalanced. This can be inferred from our findings of sBal_IH outperforming the regular ensemble methods when the datasets were imbalanced (synthetic and real-world) and is consistent with what other researchers have found [88], in that when the data is imbalanced, the negative effect of noise is magnified.

4.2.2 (b) sBal_IH against specialized ensemble methods for class imbalance problems on real-world datasets

Our proposed method was also compared against state-of-the-art ensemble methods specialized for handling imbalanced data on a group of 41 real-world multi-domain imbalanced datasets. The outcomes from this experiment, if good enough, will position the sBal_IH method among the best alternative ensemble methods for addressing the problem of class imbalance.

In Table 8, we observe a better performance of the proposed method (sBal_IH) across 18 out of the 41 imbalanced real-world datasets. This is followed by Easy Ensemble (EE) with 11 wins and Balanced Random Forest coming third with only 9 wins. RUSBoost and Balanced Bagging were the least performing in terms of AUC.

Table 8 AUC Score for sBal_IH Against Ensemble Methods Specialized for Class Imbalanced Problems on 41 Imbalanced Datasets

We note that even though there might be a reasonable competition in performance between sBal_IH and EE method, sBal_IH still outperformed EE even though EE is considered a hybrid ensemble (an ensemble of ensembles), since it uses AdaBoost ensemble as its default base algorithm, hence having a higher advantage over sBal_IH.

Lastly, we curiously assessed the performance of sBal_IH against the specialized ensemble methods in situations where the data is actually balanced, i.e., on the 29 balanced real-world datasets.

The results in Table 9 show a remarkable performance of sBal_IH against the other methods, with 16 wins out of 29 balanced datasets, followed by Easy Ensemble with 6 wins, and RUSBoost registering the least performance. The findings suggest that sBal_IH significantly outperforms specialized ensemble methods when there is no class imbalance problem.

Table 9 AUC Score for sBal_IH Against Specialized Methods for Class Imbalanced Problems on 29 Balanced Datasets

Throughout the experiments, we observe a consistent and good performance of the proposed method across the majority of the datasets. The intuitive explanation for this is that when a classifier is trained on a section of instances from a dataset with varying degrees of hardness, the classifier will have a better chance to learn the underlying pattern to classify the data while maintaining two properties: learn from hard instances as well as easy ones, and at the same time not overfit to those that are easy or hard. Hence, producing a consistent and noise-tolerant (robust) model that can better deal with both balance and class imbalanced problems.

In order to draw more reliable conclusions, in the next subsection, we discuss Friedman’s nonparametric statistical test used to ascertain if there exists any significant difference in performance between sBal_IH and the other ensemble methods.

4.3 Statistical tests

Throughout our experiments, we observed a consistent trend in performance advantage that cuts across most of the studied methods. However, in order to draw reliable conclusions, it is important to determine if there exists a statistically significant difference in the classification performance as measured by AUC. For that purpose, we utilized the standard methodology proposed by Demšar [89] for testing statistical significance among multiple methods across various datasets. We carried out a nonparametric (distribution-free) statistical test on all mean AUC results. We chose to use [90] Friedman’s test because we have no knowledge of the distribution of the values in our analyzed data.

Friedman test is a nonparametric statistical test equivalent to the test of repeated ANOVA. It computes and ranks the algorithms for each dataset separately, where the best performing algorithm is assigned rank 1, second with rank 2, and so on. In case there is a tie, it assigns an average rank to the affected algorithm. Suppose, \(r_{i}^{j}\) is the rank of the \(j - th\) algorithm (of \(k\) algorithms) on the \(i - th\) dataset (of \(N\) datasets). The Friedman test calculates the average rank of an algorithm \(j\), \(R_{j} = \frac{1}{N}\mathop \sum \limits_{i} r_{i}^{j}\).

The Friedman statistic, \(\chi_{F}^{2}\), is calculated with \(k - 1\) degrees of freedom, \(F\), as shown in Eq. 2:

$$\chi_{F}^{2} = \frac{12N}{{k\left( {k + 1} \right)}}\left[ {\mathop \sum \limits_{j}^{{}} R_{j}^{2} - \frac{{k\left( {k + 1} \right)}}{4}^{2} } \right]$$
(2)

For this test, the Null Hypothesis (H0) states that all the algorithms are comparable and the observed differences in their ranks might be random. Whereas the Alternative Hypothesis (H1) states that, there is a statistically significant difference among the algorithms.

Our interest is to reject the null hypothesis; in case the H0is rejected, we go-ahead to carry out a post hoc test in order to establish specific pairs of algorithms that produce differences. Many post hoc tests have been proposed in the literature [91], such as Nemenyi’s, Holm’s, Bergmann’s, and Shaffer’s tests, among others, as shown in Fig. 4. A detailed discussion of nonparametric tests and their corresponding post hoc tests is presented in [89] and [91].

Fig. 4
figure 4

Nonparametric Tests and Post hoc Procedures for N × N Comparisons

For our study, we used Friedman’s N × N statistical test to first determine if there exists any statistically significant difference among methods being studied across the datasets, followed by a post hoc Bergman procedure to compute a probability value (p value) for the test on each pair of methods. We choose to Bergman post hoc procedure because of its high statistical power devoted to multiple comparisons. Bergman's procedure has been recommended in the literature for this kind of study since it exhaustively finds all the possible sets of hypotheses for given comparisons and all those elementary hypotheses that cannot be rejected [91].

We carried out the statistical tests using KEEL’s nonparametric statistical analysis tool [57]. We present the outcome of the statistical analysis in the next subsection.

4.4 Statistical analysis results.

As an outcome of the statistical analysis of the performance of our proposed method (sBal_IH) against existing popular ensemble methods when using Friedman/Bergman methods with a significance level of \(\alpha = 0.05\), we present our results in Tables 1015. We highlight with bold all pairs that comprise the sBal_IH method. In the results column, we indicate whether the null hypothesis is rejected or accepted, and we are much interested in determining statistically significant performance differences between sBal_IH and other ensemble methods across a series of datasets.

For the data resulting from Tables 4 and 5, where we studied sBal_IH against regular and specialized ensemble methods, respectively, on a series of synthetic imbalanced datasets with controlled levels of noise, we include the statistical analysis results in Tables 10 and 11.

It is observed in Table 10 that H0 is rejected across all pairs of algorithms involving sBal_IH. This shows that there is a significant difference in performance between sBal_IH and all traditional ensemble methods on the synthetic imbalanced datasets.

Table 10 Statistical Results for Data from Table 4—sBal_IH vs Regular Ensemble Method on Synthetic Imbalanced Datasets

When it comes to specialized ensemble methods for class imbalance problems, we observe in Table 11, a significant difference in performance between sBal_IH and the other specialized ensemble methods except for sBal_IH vs Balanced Random Forest (BRF).

Table 11 Statistical Results for Data from Table 5—sBal_IH vs Specialized Ensemble Methods on Real-world Balanced Datasets

In Table 12, sBal_IH is statistically analyzed against regular ensemble methods when faced with balanced datasets using results obtained from Table 6. For all the pairs of comparisons containing sBal_IH, Bergman’s procedure rejects the null hypothesis except for RF vs sBal_IH.

Table 12 Statistical Results for Data from Table 6—sBal_IH vs Specialized Ensemble Methods on Synthetic Imbalanced Datasets

In Table 13, which shows the analysis of the results from Table 7, we observe that H0is rejected for all pairs involving sBal_IH vs the regular ensemble methods. This confirms that there exists a statistically significant difference in performance between sBal_IH and all regular ensemble methods on imbalanced real-world datasets (in line with the same observation above on imbalanced synthetic datasets).

Table 13 Statistical Results for Table 7—sBal_IH vs Regular Ensemble Methods on Imbalanced Datasets

Finally, in Tables 14 and 15, we analyze the differences between sBal_IH and ensemble methods specialized for class imbalance problems on a series of imbalanced and balanced datasets, using results from Tables 8 and 9, respectively.

In Table 14, despite sBal_IH being ranked first above all other methods by average rankings of Friedman’s test, we observe that Bergman’s procedure only rejects the H0 hypothesis for sBal_IH vs RUSBoost. However, it is important to note that, at times, the Friedman test might report a significant difference in performance between algorithms but the post hoc test fails to detect it. This might be due to the used power of the post hoc test.

Table 14 Statistical Results for Data from Table 8—sBal_IH vs Regular Ensemble Methods on Balanced Datasets

Interestingly, when it comes to balanced datasets, as indicated in Table 15, sBal_IH registered a significant difference in performance with Easy Ensemble, Balanced Bagging, and RUSBoost. This shows that our proposed method is comparable to the state-of-the-art specialized ensemble methods in situations of class imbalance problem, and more superior in situations where the datasets are balanced.

Table 15 Statistical Results for Data from Table 9—sBal_IH vs Specialized Ensemble Methods on Real-world Imbalanced Datasets

4.5 Computational Cost

From the experimental results, despite the promising performance of the proposed method (sBal_IH), we do note that this comes at the computational cost of pre-calculating the Instance Hardness of a given dataset. As observed in Table 16, the proposed method (sBal_IH) registered a high computational time compared to all studied methods (2 runs with different random seeds). Due to space limitations, we only reported computational times for 15 datasets, selecting 5 from each group (Synthetic, Real World Balanced, and Imbalanced), composed of different datasets and features sizes.

Table 16 Computational Time (in seconds) for all Ensemble Methods on Selected Datasets

While we recognize that the computation times for the sBal_IH method are much higher than those of other methods, we believe that for many applications that use small and medium-size datasets, this can still be an acceptable compromise in exchange for performance improvement. We do also draw attention that these higher computation times are due to the IH estimation step. Table 17 demonstrates that clearly, by showing (in the first 4 columns) how much time it took to estimate IH versus the time for the remaining steps of sBal_IH.

Table 17 Computational Time (in seconds) subtasks of IH estimation and expected parallelization gains

Based on that observation, we emphasize that the IH estimation process is highly parallelizable since it consists of multiple independent tasks, performed within a cross-validation scheme (which can easily run in parallel for each fold). To demonstrate expected efficiency improvements, we ran some experiments that estimate performance gains expected if we were to run the IH process with parallelization. This is shown in Table 17 (last 4 columns) with the following logic:

  • Currently, IH estimation was calculated using a fivefold cross-validation (XV) scheme, and therefore running this in 5 parallel processes would cost approximately 20% of sequential execution of such scheme.

  • In each of the above folds, we have used 4 different learners to estimate IH. Our experiments show that KNN had the highest cost across all datasets, and thus if we were to parallelize each learner to a separate process, IH estimation has to wait for the longest leaner to finish (i.e., KNN).

  • Based on the highest learner’s cost, running each of the 4 learners in parallel would give an estimated improvement as shown in Table 17 (2nd last column).

  • Lastly, the total running time of sBal_IH under such a parallelization approach is estimated to be as shown in the last column.

As demonstrated by the above discussion, computation time for the sBal_IH method can be squashed by a magnitude of ~ 10 times by a simple parallelization scheme. We do however recognize that future work is required to address the issue of computational cost; maybe via faster alternatives for IH estimates. But even without this, the sBal_IH method still offers an acceptable compromise in exchange for performance improvement.

5 Conclusion and future works

In this study, we have addressed the problem of class imbalance from the perspective of improving the performance of ensemble-based classifiers. Though many state-of-the-art ensemble-based methods have been proposed for the problem, we have introduced a new ensemble method, called sBal_IH, with a new bootstrapping approach that balances the class distributions of the bags based on instance hardness. In the proposed method, each bag contains all of the minority instances, and we influence the process of picking majority instances using pre-calculated instance hardness categories (easy, normal, and hard), unlike the common way of most of the bagging based methods that use uniform probability to select instances. This ensures that base learners are trained on balanced bags, and at the same time contain diverse levels of hardness, allowing a good representation of the input space that reflected well on overall performance, as demonstrated by the experimental results.

The proposed method is evaluated on 100 datasets, which include 30 synthetic imbalanced datasets with controlled levels of noise, 29 balanced, and 41 imbalanced real-world datasets for binary classification. We used the nonparametric Friedman test and Bergmann’s post hoc test, at a significant level of p < 0.05, to statistically ascertain our findings.

The findings demonstrate that our proposed sBal_IH method performs statistically and significantly better than the regular ensemble methods (Bagging, Wagging, Random Forest, and AdaBoost) on both synthetic and real-world imbalanced datasets, and better than these methods on balanced datasets (except for Random Forest, where performance was comparable). Furthermore, sBal_IH performed better than ensemble methods specialized for class imbalance problems (Balanced Bagging, Balanced Random Forest, RUSBoost, and Easy Ensemble) in the majority of both balanced and imbalanced datasets. The analysis showed a statistically significant difference in performance between sBal_IH and the specialized methods on the balanced datasets (except for Balanced Random Forest, where performance was comparable), and comparable difference on the imbalanced datasets.

These findings suggest that the approach followed by the sBal_IH method is significantly better than the compared methods, in the majority of the cases, for both balanced and imbalanced datasets. This paves the way for similar extensions based on data complexity and instance hardness to boost performance further. We do note that this improvement comes at the expense of computational cost to pre-calculate the Instance Hardness of a given dataset. However, for many applications that use small and medium-size datasets, this might be acceptable in exchange for promising improvement in performance.

In future works, we plan to investigate how to improve the computational cost of sBal_IH by studying other faster alternative approaches for estimating IH. We also propose to extend our experiments to cover multi-class imbalanced datasets by expanding the sBal_IH approach to take into consideration, multiple classes. Finally, we also plan to study the performance of sBal_IH when induced using heterogeneous base learners, since in current experiments we only considered a homogenous one for building the ensembles. We, therefore, anticipate that this study opens a few research opportunities that utilize data balancing techniques based on instance hardness and data complexity, to improve classification performance in the situation of class imbalance problems, and maybe balanced ones as well.