Keywords

1 Introduction

The problem of learning from class-imbalanced data has been a topic of intensive research in recent years, motivated by an observation that standard learning methods are inherently biased to focus on the majority classes and fail to sufficiently recognize the minority class which is often of particular interest in the given application. For reviews of new methods see, e.g., [6, 10].

Ensembles are among the most effective methods for improving recognition of the minority class. Most of them extend strategies from bagging or boosting. They either employ pre-processing methods before learning component classifiers or embed a cost-sensitive framework in the ensemble learning process; see review in [9, 10]. Experimental comparisons [3, 11, 12] have shown that extensions of bagging work better than those of boosting and other solutions.

The basic idea behind extending bagging for imbalanced data is to modify the distribution of examples from minority and majority classes in bootstraps [9]. This may be achieved in many ways, which usually balance the numbers of examples from both classes. Experimental studies show that under-sampling, i.e., reduction of examples from the majority class, performs better than over-sampling (i.e., multiplication of examples from the minority class) [3, 4, 11]. Improvements are observed even for simplest extensions, like Exactly Balanced Bagging (EBBag) [7, 9]. Other studies [3, 11] demonstrate that Roughly Balanced Bagging (RBBag), which applies a more sophisticated random under-sampling, achieves the best G-mean among compared ensembles. We have proposed Neighbourhood Balanced Bagging (NBBag) being competitive to RBBag. NBBag modifies bootstrap sampling by weighting examples [4]. The weight of an minority example depends on quantity of examples in the whole training set, and in the example neighbourhood, which belong to the majority class.

However, these improvements may come with a decrease of recognition majority class examples [4]. We claim that there is still a possibility to better learn the trade-off between performance in both classes. This leads us to a research question how to achieve it. To best of our knowledge the current research focuses on various modifications of bootstrap samples in under-sampling bagging extensions. In this paper, we want to consider another hypothesis: given an already good technique of constructing a bagging extension, is it possible to perform additional steps of updating its bootstraps by selecting a limited number of learning examples, which are important to improve performance in both classes.

This hypothesis directs our attention toward active learning methods [15]. Recall that, in active learning, a learning method may achieve a better classification performance when it is allowed to choose examples on which it learns. Usually it is performed on partially labeled examples, i.e., after the initial step of full supervised learning, the active method should select a limited number of examples to be labeled and further used to update the classifier. However, the active learning could also be used to filter examples in the complete labeled examples [2]. In this way active strategies have been already applied to imbalanced learning - although these attempts are still quite limited (see Sect. 2).

We introduce a new approach, called Active Balancing Bagging (ABBag), to extend bagging for imbalanced data by an active learning modification of bootstrap samples of previously constructed under-sampling bagging extension. In this proposal, after the initial training, the ensemble (here we consider two algorithms: EBBag and NBBag) is re-trained on bootstraps enlarged by batches, i.e., small portions of learning examples, selected according to active selection strategies. Batch querying, although being very practical by nature, is rarely used by existing active learning approaches [5]. According to our best knowledge no similar extension has been considered in the literature so far.

The paper is organized as follows. The next section summarizes related works on bagging extensions proposed for imbalanced data and active learning strategies. The following Sect. 3, describes the proposed active balancing bagging extension. Experimental validation of the proposed methods is carried out in Sect. 4. The paper is concluded in Sect. 5.

2 Related Works

Besides working on new algorithms, some researchers also studied the characteristics of imbalanced data and sources of difficulties for learning classifiers. It has been noticed that the global imbalance ratio is not the only or even not the most important factor which makes learning difficult. Other data difficulty factors such as class overlapping, small disjuncts or lack of representativeness significantly deteriorate the quality of an induced model even on exactly balanced data [13, 14]. They also inspire development of new methods, see e.g. [4].

Several extensions of bagging approach for imbalanced data have been proposed [4, 9, 16]. In the simplest one, called Exactly Balanced Bagging (EBBag), while constructing bootstrap sample, the entire minority class is copied and combined with the randomly chosen subset of the majority class to exactly balance cardinalities of examples in both classes. The first bagging extension which uses knowledge of data difficulty factors is Neighbourhood Balanced Bagging (NBBag) [4]. NBBag focuses bootstrap sampling toward difficult minority examples by using certain type of weights. The weight of a minority example depends on the analysis of class labels among its k nearest neighbours. A minority example is considered the more unsafe, the more it has majority examples in its neighbourhood. Thus, this part of the weight reflects a local balancing factor. Moreover, this weight is also aggregated with a global balancing factor, which takes into account the imbalance ratio between classes. Hence, the formula for minority example weight is the following: \(w = 0.5 \times \left( \frac{(N_-')^{\psi }}{k} + 1\right) \) where \(N_-'\) is the number of majority examples among k nearest neighbours of the example and \(\psi \) is a scaling factor. Setting \(\psi = 1\) causes a linear amplification of the example weight with an increase of unsafeness and setting \(\psi \) to values greater then 1 effects in an exponential amplification. Each majority example is assigned a constant weight \(w = 0.5 \times \frac{N_+}{N_-}\).

Active learning approaches for imbalanced data are rather limited. The active strategy proposed by Ertkin et al. [8] relies on iterations of selecting single uncertain example and rebuilding the model. However, selecting more that one instance at time, and rebuilding model on a batch, often can greatly reduce the computation time [5]. This is especially true when the cost of constructing classifier is higher than for online SVM [8]. A simple active extension of bagging has been proposed for unlabeled data in imbalanced medical problems [16].

3 Active Selection of Examples in Balanced Bagging

In this section we describe Actively Balanced Bagging (ABBag), which is composed of two phases. The first phase consists in learning an ensemble classifier by one of approaches to construct under-sampling extensions of bagging. Although one can choose any extension, we will further consider one simple, yet effective one: Exactly Balanced Bagging (EBBag) [9], and better performing, yet more complex one: Neighbourhood Balanced Bagging NBBag [4]. For more information on constructing these ensemble refer to Sect. 2. Here, we focus on an explanation of the second phase, which is specific to ABBag, and which we call active selection of examples. It consists in: (1) an iterative modification of bootstrap samples, constructed in the first phase, by adding selected examples from the training set; (2) re-training of component classifiers on modified bootstraps. The examples in (1) are added to bootstraps in small portions called batches.

The proposed active selection of examples can be seen as a variant of Query-by-committee (QBC) approach [1]. In QBC, one constructs an ensemble of diverse component classifiers and then ranks all classified examples with respect to a selected committee disagreement measure (a decision margin). Originally, this margin is defined as a difference between number of votes for a class, or probabilities of a class, between the two most supported classes. Note that QBC has been already successfully applied in active learning of ensemble of diverse component classifiers in [5]. Ensembles constructed in the first phase of ABBag are attractive to combine with QBC, because, as we have previously shown in [4], they promote component classifiers diversity. However, QBC does not take into account global (i.e., concerning the whole training set) properties of examples distribution, and in result, it can focus on selecting outliers and sparse regions [5].

Recall that single example selection is a typical strategy in active learning [2, 15]. In ABBag we promote to select, in each iteration, a small batch instead of one example. It is motivated by a potential reduction of computation time as well as increasing diversity of examples in the batch. As it was observed in [5], a greedy selection of example with respect to a single criterion, e.g., typical for active strategy: highest utility/uncertainty measure [2, 15], does not provide desired diversity. In our view, giving chance to also randomly select some slightly sub-optimal examples besides the best ones may result in higher diversity of new bootstraps and increased diversity of re-trained component classifiers.

We address the above mentioned issues twofold. First, and foremost, the proposed active selection of examples considers multiple factors. Among them, we consider: uncertainty of example prediction by component classifiers, and a prediction error of a single component classifier. In addition, we consider factors specific to imbalanced data, which reflect more global (i.e., concerning the whole training set) and/or local (i.e., concerning example neighbourhood) class distribution of examples. Second, we use a specific variant of the rejection sampling to enforce diversity within the batch through randomization.

figure a

The pseudo-code of the algorithm for learning ABBag ensemble is presented as Algorithm 1. It starts with training set LS, and \(m_{bag}\) bootstrap samples \(\pmb {S}\), and it results, in the first phase (lines 1–3), in an under-sampling extension of bagging. It makes use of initial balancing weights \(\pmb {w}\), which are calculated in accordance with the under-sampling bagging extension. These initial balancing weights \(\pmb {w}\) allow us to direct sampling toward more difficult to learn examples in comparison to uniform sampling typical for bagging. In case of EBBag, \(\pmb {w}\) reflects the global imbalance of example in the training set. For NBBag, \(\pmb {w}\) expresses both global and local imbalance of example in the training set [4].

In the second phase, the active selection of examples is performed between lines 4, and 10. All bootstraps from \(\pmb {S}\) are iteratively (\(m_{al}\) times) enlarged by adding batches, and new component classifiers are re-learned. In each iteration, new weights \(\pmb {w'}\) of examples are calculated according to weights update method um (which is described in the next paragraph), and then they are sorted (lines 6–7). Each bootstrap is enhanced by \(n_{al}\) examples selected randomly with rejection sampling according to \(\alpha =w'(x_{n_{al}}) + \epsilon \), i.e., \(n_{al}\) random examples with weights \(w'\) higher than \(\alpha \) are selected (lines 8–9). Parameter \(\epsilon \) introduces an additional (after \(\alpha \)) level of randomness into the sampling. Finally, after each bootstrap sample is enlarged, new component classifier \(C_i\) is trained resulting in new ensemble classifier \(\pmb {C}\) (line 10).

We consider four different weights update methods. The simplest method, called margin (m), is substituting the initial weights of examples with a decision margin between component classifiers in \(\pmb {C}\). For a given example it is defined as: \(m = 1 - \Big \vert \frac{V_{maj} - V_{min}}{m_{bag}} \Big \vert \), where \(V_{maj}\) is number of votes for majority class, \(V_{min}\) is number of votes for minority class, and \(m_{bag}\) is the number of component classifiers. Since the margin may not be directly reflecting the characteristic of imbalanced data (indeed under-sampling somehow should reduce bias of the classifiers) we consider aggregating it with additional factors. As a result, three variants of weights update methods are proposed. In the first variant, called, margin and weight (mw), new weight \(w'\) is a product of m and initial balancing weight w. Additionally we reduce the influence of w in subsequent iterations of active example selection, as active selection iteration index l is increasing. The reason for this reduction of influence is that we expect that margin m is improving with subsequent iterations, and thus initial weights w are becoming less important. More precisely, \(mw = m \times w^{\big (\frac{m_{al} - l}{m_{al}}\big )}\).

Recall that both considered so far weights update methods produce bootstrap samples which, in the same iteration l, differ only according to randomization introduced by rejection sampling, i.e., weights \(\pmb {w'}\) are the same for each i. That is why, we consider yet another modification of m and mw, which makes \(\pmb {w'}\), and, consequently, each bootstrap dependent on performance of the corresponding component classifier. These two new update methods: margin and component error (mce), and margin, weight and component error (mwce) are defined, respectfully, as follows. \(mce = m + 1_e \times w\), and \(mwce = mw + 1_e \times w\). In this notation, \(1_e\) is an indicator function defined so that \(1_e = 1\) when a component classifier is making a prediction error on the example, and \(1_e = 0\) otherwise.

4 Experiments

We consider two aims of the experiments. First, we check whether the predictive performance of Actively Balanced Bagging is improved in comparison to known well performing under-sampling extensions of bagging. To examine this part more generally we have decided to choose two efficient extensions based on different principles: Exactly Balanced Bagging (EBBag) [7] and Neighbourhood Balanced Bagging (NBBag) [4]. Our second aim is to compare different variants of active selection methods, which result in different versions of ABBag.

In order to examine both aims we use several synthetic and real-world imbalanced data sets representing various difficulty factors in the distribution of the minority class. All considered synthetic data sets, i.e., flower, and paw, were constructed, as described in [17]. They have a controlled level of difficulty, i.e., decomposition of the minority class into sub-concepts, overlapping area, presence of rare examples and outliers. The percentage of minority examples representing safe, borderline, rare and outlier examples are given in the suffix of the data set name. The examples from the minority class were generated randomly inside predefined spheres and the majority class examples were randomly distributed in an area surrounding them. Following similar motivations and analysis from works on difficulty imbalanced data [4, 14], we selected UCI data sets, and additional scrotal-pain data setFootnote 1. Due to page limits we redirect the reader to the description and characteristics of these data sets in [4]. Note that some of data sets contain multiple majority classes, which were aggregated into one class.

The experiments were carried out in two variants with respect to the first phase of the active selection. In the first variant, standard EBBag or under-sampling NBBag was considered. In the other variant, the size of each of the classes in bootstrap samples was reduced to \(50\%\) of the size of the minority class in the training set. Active selection parameters, used in the second phase, \(m_{al}\), and \(n_{al}\) were chosen in a way, which enables the bootstrap samples constructed in ABBag to excess the size of standard under-sampling bootstrap by a factor not higher than two. The size of ensembles \(m_{bag}\), in accordance with previous experiments [4], was always fixed to 50 J4.8 decision trees. NBBag parameters were set the same as in [4]. All presented results are estimated by a stratified 10-fold cross-validation repeated three times to improve reproducibility. The number of repetitions is the same for all considered data sets for better comparability.

Table 1. G-mean of actively balanced \(50\%\) under-sampling EBBag

Due to page limits, we present results of two main experiments, which are good representatives of tendencies observed in other analyzed settings.Footnote 2 In Tables 1 and 2, we present values of G-mean measure, since we want to find a good trade-off between recognition in both classes [4]. More precisely, we show the best G-mean value, which can be achieved with a limited number of active learning iterations (we considered \(m_{al} \le 10\)) for relatively small batches. The size of batch in experiments was set as percentage of the size minority class (\(n_{al} = \{5\%,10\%\}\)). We show two variants of the first phase of the active selection. Each of them is appropriate for the analyzed type of ABBag. Table 1, presents results of active balancing of \(50\%\) under-sampling EBBag. While in Table 2, we present results of active balancing of standard under-sampling NBBag. The last row of Tables 1 and 2 contains average ranks calculated as in the Friedman test – the lower average rank, the better the classifier.

Table 2. G-mean of actively balanced under-sampling NBBag

The first, general conclusion resulting from our experiments is that ABBag performs better than under-sampling extensions of bagging, both: EBBag, and NBBag. Let us treat EBBag, and NBBag as baselines in Tables 1 and 2, respectively. The observed improvements of G-mean are statistically significant regardless of the version of ABBag. More precisely, following Friedman statistical test, each actively balanced EBBag has lower average rank than the baseline EBBag, and, similarly, each actively balanced NBBag has lower average rank than the baseline NBBag. Moreover, Friedman tests results in p-values \(\ll \) 0.00001. According to Nemenyi post-hoc test, critical difference CD between average ranks is around 1.272. CD is thus higher than the difference between average ranks of each actively balanced EBBag and the baseline EBBag. An analogous observation holds for each actively balanced NBBag and the baseline NBBag. Some more detailed observations concerning results presented in Tables 1 and 2 may be given. The most striking difference in performance between baseline bagging extensions and ABBag is visible for synthetic data sets with the most difficult to learn distribution of safe, borderline, rare and outlier examples (i.e., data sets with suffix 10-20-35-35). In that case, both baseline EBBag and NBBag are not able to learn the minority class at all. ABBag shows significantly better performance for these data sets, regardless of the weights update strategy.

Moving to the next research question on the role of the proposed active modifications, i.e., weights update methods in the active selection, we make the following observations. First, if we consider actively balanced EBBag, margin and component error weights update method, thus (mce-EBBag), has the best average rank. Nevertheless, margin, weight and component error weights update method, thus (mwce-EBBag), has the best value of median calculated for all G-mean in Table 1. These observations are, not statistically significant according to the critical difference and results of Wilcoxon test for a selected pair of classifiers. On the other hand, in all of experiments with actively balanced EBBag we note that one of the two weights update methods: mce or mwce yields the best results.

Similar observations are valid for actively balanced NBBag. In this case, the best weights update method according to average rank is margin, weight and component error, and thus (mwce-NBBag). On the other hand, margin and component error weights update method, and thus (mce-NBBag), has the best value of median. Again these observations are not statistically significant. However, Wilcoxon paired test between mwce-NBBag and mw-NBBag resulted in p-value close to 0.045, and mwce-NBBag has the higher value of median. We can take it as another indication that inclusion of component error inside the weights update method yields better G-mean results.

To sum up, we can interpret results obtained with all four different weights update methods applied in the active selection in the following way. First, they show that all considered elements of weights update strategy: margin of classifiers in ensemble, weight of example, and component error are important for improving ABBag performance. Second, two combinations of considered elements tend to give better results than the others. These are: margin and component error (mce), and margin, weight and component error (mwce).

5 Conclusions

In this paper we examined the following research question: is it possible to improve classification performance of an under-sampling bagging ensemble with an active learning strategy? To address this question we introduced a new approach, called Actively Balanced Bagging (ABBag). The proposed active selection of examples involves iterative updating of bootstraps with batches composed of examples selected from the training set. These examples were sampled according to the distribution of weights, which expressed: a decision margin of ensemble votes, balancing of example class distribution in the training set and/or in its neighbourhood, and prediction errors of component classifiers.

The results of experiments have clearly shown that the active selection of examples has improved G-mean performance of considered under-sampling bagging extensions, which are known to be very good classifiers for imbalanced data. Another important observation resulting from experiments is that the active selection strategy performs best when it integrates the ensemble disagreement factor (typical for active learning) with information on class distribution in imbalanced data and prediction error of component classifiers.

We hope that this study may open future research lines on adapting active learning strategies to improve ensemble classifiers.