1 Introduction

In the last decade, imbalanced data classification has drawn a significant amount of interests from academia, industry, and government funding agencies [16]. The imbalanced data distribution is characterized as some classes have more instances than others. In the binary classification case, the imbalance problem is that one class has more samples than the other. The former is called majority/negative class, while the latter is minority/positive class. Usually, minority samples have higher misclassification costs than majority samples have, and thus it’s more important to classify them correctly.

The fundamental issue with imbalanced data classification is the classification of imbalanced data has posed a significant drawback of the performance of most standard learning algorithms, which assume or expect balanced class distribution or equal misclassification costs. Boosting is a kind of method generally for effectively improving the accuracy of a given learning algorithm [2]. Naive boosting minimizes the overall error rate which is not suitable for imbalanced data. If the boosting framework can be improved to adapt to the imbalanced data classification, which is denoted as IDBoosting (Imbalanced-Data-Boosting) hereafter, then conventional learning algorithms can be integrated without further modifications.

So far, many tries of IDBoosting have been brought out and some of them have achieved success in various imbalanced data classification problems. This work gives a review of IDBoosting methods. The contributions are twofold. First, existing methods are catalogued and each class is displayed in detail in terms of design criteria, typical algorithms and performance analysis. Second, the essence of two IDBoosting methods is discovered from the view of Bayes optimal classification followed by experimental evidence.

The rest of the paper is organized as follows. Section 2 makes a brief introduction of the problem of imbalanced data classification. Section 3 introduces the boosting framework, its drawbacks for imbalanced data classification are also analysis. Section 4 reviews the state-of-the-art work of IDBoosting. Section 5 gives a theoretical illustration of two IDBoosting methods followed by experimental evidence in Sect. 6. Finally, conclusions and useful suggestions for future work are provided in Sect. 7.

2 The problem of imbalanced data classification

2.1 Examples of application domains

The problem of imbalanced data classification is pervasive in a large number of domains of pattern recognition, machine learning and data mining.

Fraud detection. Fraud, such as credit card fraud, insurance fraud and cellular fraud, is a costly problem for many business organizations. Companies attempt to detect fraud by analyzing different consuming patterns in their transaction databases. However, in their transaction collections, there are many more legitimate users than fraudulent examples. In practice, frauds with high transaction amounts are rare but their misclassifications are much more costly [38].

Medical diagnosis. Data mining techniques applied on clinical databases attempt to discover relationships and patterns among clinical and pathological data to understand the progression and features of certain diseases. In these databases, disease cases are fairly rare as compared with the normal populations. However, the error committed in diagnosing someone as healthy when one has a life-threatening disease is usually considered to be far more serious than the opposite type of error of diagnosing someone as ill when one is in fact healthy [23].

Network intrusion detection. As network-based computer systems play increasingly vital roles in modern society, attacks on computer systems and computer networks grow more commonplace. Learning prediction rules from network data is an effective detection approach to automate and simplify the manual development of intrusion signatures. In the collection of network connection records, the number of intrusions on the network is typically a very small fraction of the total network traffic [10].

Modern manufacturing. In a modern manufacturing plant, more and more processes are being handled by automated or semi-automated cells, each with a computer as its controller that renders automatic alarm when raw patterns are detected. In constructing an alarm system through supervised learning, the number of available defective cases is significantly fewer than that of ordinary procedures [6].

2.2 Nature of the problem

Research shows that the skewed data distribution is not the only parameter that influences the performance of standard learning algorithms. Other influential facts include (1) Small sample size. Given a fixed imbalance degree, the sample size plays a crucial role in determining the goodness of a classification model. Experimental observations reported in  [18] indicate that as the size of the training set increases, the large error rate caused by the imbalanced class distribution decreases. In the case that the sample size is limited, uncovering regularities inherent in small class is unreliable. (2) Within-class imbalance. Within-class imbalance refers to the phenomena that a single class is composed of various sub-clusters, or sub-concepts, which do not always contain the same number of examples [17]. The presence of within-class sub-concepts increases the learning concept complexity of the data set and worsens the imbalanced data classification problem. (3) Class separability. The difficulty in separating the minority class from the majority class is a key issue of the imbalanced data classification problem. Experiments conducted in [28] and  [18] have shown that linearly separable domains do not sensitive to any amount of imbalance and when the degree of class overlap increases, the system’s sensitivity to imbalance also increases.

2.3 Evaluation measures

Evaluation measures play a crucial role in both assessing the classification performance and guiding the classifier modeling. Traditionally, accuracy or error rate is the most commonly used measure for these purposes. Let true positive (TP) and true negative (TN) represent the number of correctly classified minority/positive class and majority/negative class respectively, false positive (FP) and false negative (FN) represent the number of misclassified negative class and positive class respectively. \(N_+=TP+FN\) and \(N_-=TN+FP\) represent the total number of positive and negative class respectively. Accuracy and error rate are defined as

$$\begin{aligned} \mathrm{{Accuracy}}=\frac{TP+TN}{N_++N_-}, \quad\mathrm{{Error\,Rate}}=1-\mathrm{{Accuracy}}. \end{aligned}$$
(1)

For classification with the class imbalance problem, accuracy is no longer a proper measure since the minority class has little impact on accuracy as compared to the majority class. Several evaluation metrics are frequently adopted in the research community to provide comprehensive assessments of imbalanced learning problems, which are defined as:

$$\begin{aligned} \mathrm{{Average}} \, \mathrm{{accuracy}}=\frac{1}{2} \left( \frac{TP}{N_+}+\frac{TN}{N_-} \right) \end{aligned}$$
(2)
$$\begin{aligned} G-\mathrm{{mean}}=\sqrt{\frac{TP}{N_+}\times \frac{TN}{N_-}}\end{aligned}$$
(3)
$$\begin{aligned} \mathrm{{Precision}}=\frac{TP}{TP+FP}\end{aligned}$$
(4)
$$\begin{aligned} \mathrm{{Recall}}=\frac{TP}{N_+}=\frac{TP}{TP+FN}\end{aligned}$$
(5)
$$\begin{aligned} F-\mathrm{{measure}}=\frac{2\mathrm{{Precision}}\cdot \mathrm{{Recall}}}{\mathrm{{Precision}}+\mathrm{{Recall}}} \end{aligned}$$
(6)

Average accuracy and G-mean are the arithmetic and geometric means of the accuracy of different classes respectively and can measure the balanced performance of a learning algorithm between all classes. Precision is a measure of exactness (i.e., of the examples labeled as positive, how many are actually labeled correctly), whereas recall is a measure of completeness (i.e., how many examples of the positive class were labeled correctly). Precision and recall are not both sensitive to changes in data distributions. And F-measure ensures that both recall and precision are high.

In the case that the problem of imbalanced data classification is a cost-sensitive classification problem in nature, the cumulative misclassification cost can be used to evaluate the classification performance, see Sect. 4.2 for details.

Besides, receiver operating characteristics (ROC) depicts relative trade-offs between benefits (true positives) and costs (false positives) across a range of thresholds of a classification model and gives a good summary of the performance of a classification model. The more the ROC curve bulges toward the top-left corner, the better the classification model. The area under a ROC curve (AUC) provides a single measure of a classifiers performance for evaluating which model is better on average.

2.4 The state of the art

A number of solutions to the class imbalance problem are reported in the literature. These solutions are developed at both data and algorithm levels [35].

At data level, the objective is to rebalance the class distribution by resampling the data space. Resampling approaches can be categorized into two tendencies: oversampling, that aims to replicate or generate new positive examples in order to gain importance; and undersampling, that consists of reducing the data by eliminating examples belonging to the majority class with the objective of equalizing the number of examples of each class [4]. Both oversampling and undersampling have many different forms, such as randomly oversampling, randomly undersampling, informatively oversampling (in which no new samples are created, but the choice of samples to resample is targeted rather than random), informatively undersampling (the choice of samples to eliminate is targeted), oversampling by generating new synthetic data [35].

Instead of creating balanced data distributions through different sampling strategies, algorithm-level approaches revise existing learning algorithms to make them perform better on the minority class, such as cost-sensitive learning, one-class learning and active learning [16]. Among them, cost-sensitive learning methods have attracted much attention in the research community. Cost-sensitive learning methods consider the costs associated with the classification of examples. Dealing with imbalanced data classification, cost-sensitive learning methods assume higher misclassification costs with minority samples and seek to minimize the high cost errors [43]. So far, many learning methods have had their cost-sensitive correspondings, such as cost-sensitive K-nearest, cost-sensitive decision tree, cost-sensitive neural network, cost-sensitive SVM. One-class learning only models the target class. In the case of highly imbalanced data and small minority class size, one-class learning can describe the concept of the minority class better than two-class learning [29]. Active learning methods are used to solve problems related to unlabeled training data. In most cases of imbalanced data classification, the imbalance ratio of data near the classification boundary is much smaller than the imbalance ratio of the entire data set. Then active learning aims to select those instances that are closest to the current hyper plane from the unseen training data in order to retrain the classifier model [8].

This work focuses on improved boosting methods for imbalanced data classification. This paper catalogues them into two main classes—data-sampling Indorsing and cost-sensitive Indorsing. Data-sampling Indorsing methods integrate data sampling techniques sampling techniques into boosting iterations to equalize the data distribution, thus they belong to both data level and algorithm Indorsing methods belong to cost-sensitive learning methods. Figure learning methods. Figure 1 to the problem of imbalanced data classification, where Indorsing classification, where IDBoosting solutions are highlighted in bold.

Fig. 1
figure 1

Solutions to the problem of imbalanced data classification. IDBoosting solutions are highlighted in bold

3 Boosting

3.1 Boosting framework

Boosting works by sequentially applying a weak learning algorithm to reweighed versions of the training data and taking a majority vote to aggregate weak classifiers thus produced. It has been proved to be a general and effective method for improving the accuracy of a given learning algorithm [11]. Since the generation of boosting, lots of theories are proposed to explain why this simple strategy results in dramatic improvements in performance, including generalization error analysis, game theory, VC theory, functional optimization and much beyond. Here a brief introduction of boosting is given from the statistical view [12].

Assume 1 represent positive class and -1 represent negative class. The logistic model \(F^*(x)\) has the following form

$$\begin{aligned} F^*(x)= \frac{1}{2}\log \frac{P(y=1|x)}{P(y=-1|x)}, \end{aligned}$$
(7)

where \(P(y=1|x)\) and \(P(y=-1|x)\) are posterior probabilities. To estimate \(F^*(x)\), boosting uses an exponential loss function \(L(y,F(x))=e^{-yF(x)}\), whose expectation is minimized by \(F^*(x)\),

$$\begin{aligned} F^*(x)=\arg \min _FE[L(y,F(x))]=\arg \min _FE[e^{-yF(x)}]. \end{aligned}$$
(8)

To fit \(F^*(x)\), boosting uses an additive model \(F(x)=\sum _{m=1}^M\alpha _mf_m(x)\), where \(m\) is the index of the iteration, \(\{f_m\}_{m=1}^M\) are a set of weak classifiers learned sequentially and \(\{\alpha _m\}_{m=1}^M\) are corresponding weights [12]. Initially, \(F(x)=F_0(x)=0\), for the \(m\)-th iteration round, given an imperfect estimate \(F_{m-1}(x)\), boosting seeks an increment \(\alpha _mf_m(x)\) to get an improved estimation \(F_{m}(x)=F_{m-1}(x)+\alpha _mf_m(x)\):

$$\begin{aligned}(\alpha _m, f_m)&=\arg \min _{\alpha ,f}E[L(y,F_{m-1}(x)+\alpha f(x))]\\ &=\arg \min _{\alpha ,f}E_{w_m}[L(y,\alpha f(x))] , \end{aligned}$$
(9)

where

$$\begin{aligned} w_m=w_m(x,y)=e^{-yF_{m-1}(x)}=L(y,F_{m-1}(x)) \end{aligned}$$
(10)

is the loss of sample \((x,y)\) at the moment, called as sample weight. For the next iteration, \(w_{m+1}\) can be updated by the following multiplication operation:

$$\begin{aligned} w_{m+1}(x,y)=e^{-yF_m(x)}&=e^{-yF_{m-1}(x)}e^{-y\alpha _mf_{m}(x)}\\&=w_m(x,y)\cdot L(y,\alpha _mf_m(x)). \end{aligned}$$
(11)

In summary, boosting is an iterative process where each round includes two steps:

  1. 1.

    Minimize the weighted expected loss over the training set to get the weak classifier \(\alpha _mf_m\),

    $$\begin{aligned} (\alpha _m, f_m)=\arg \min _{\alpha ,f}E_{w_m}[L(y,\alpha f(x))]. \end{aligned}$$
    (12)
  2. 2.

    Update example weights for next iteration

    $$\begin{aligned} w_{m+1}(x,y)=w_m(x,y)\cdot L(y,\alpha _mf_m(x)). \end{aligned}$$
    (13)

A concise description of boosting is given in Algorithm 1.

To solve weak classifier \(f_m(x)\) and its weight \(\alpha _m\) in formula (12), different numerical/analytical optimization methods are used and analytical solutions can be obtained, deducing different boosting algorithms [12]. For example, Discrete AdaBoost assumes \(\alpha >0, f\in \{-1,1\}\) and uses adaptive Newton updates. Real AdaBoost assumes \(\alpha =1, f\in R\) and adopts stationary point method. Gentle AdaBoost assumes \(\alpha =1, f\in R\) and uses Newton steps. Then, weak learning algorithms, such as decision trees, are used to fit \(f_m(x)\) over the training data set.

figure a

3.2 Why \(e^{-yF(x)}\)?

Currently the concept of boosting has been extended to greedy function approximation in function space and specific algorithms can be acquired for various loss functions [13]. Among them, exponential loss based boosting family is the most widely used and has achieved tremendous success in various pattern recognition problems [20, 40]. Compared with other loss functions such as square loss \((1-yF(x))^2\) or \((y-F(x))^2\), exponential loss \(e^{-yF(x)}\) has the following benefits. The deduced boosting algorithms are extremely modularized, i.e. requiring the retraining of a classifier on a weighted training database at each iteration and the sample weights can be updated by easy multiplication operation. Besides, exponential loss has a sensible population minimizer and increases exponentially for large increasingly negative margin [30].

3.3 The drawbacks of boosting in the problem of imbalanced data classification

Studies have shown that different learning algorithms are affected by the class imbalance problem to different extents. Decision tree is the most sensitive to class imbalance, followed by neural network, SVM that maximizes margins is the least prone to the class imbalance problem [18]. Like SVM, boosting is proved maximize the average margin [30], thus is less sensitive to the imbalanced data. Yet boosting minimizes the error rate in nature and still needs to be improved for imbalanced data classification. Proof is as follows.

Over the whole data distribution, error rate in (1) can be represented as the average probability of error

$$\begin{aligned} \mathrm{{Error\,Rate}}=P(\mathrm{{error}})=\int P(\mathrm{{error}}|x)p(x)dx, \end{aligned}$$
(14)

where the probability of error \(P(\mathrm{{error}}|x)\) is

$$\begin{aligned} &P(\mathrm{{error}}|x)\\&\quad=\left\{ \begin{array}{lll} P(y=1|x), &{}\quad{\rm if \, predict} \, x \,\text {belong \, to\, negative\, class,}\\ P(y=-1|x), &{}\quad{\rm if \, predict}\, x \,\text {belong\, to\, positive\, class.} \end{array} \right. \end{aligned}$$
(15)

To minimize the error rate, the probability of error \(P(\mathrm{{error}}|x)\) should be minimized, thus if and only if the probability of error of predicting +1 is not greater than that of predicting \(-\)1, the prediction output takes +1, mathematically described as follows:

$$\begin{aligned} P(y=-1|x)\,\le P(y=+1|x). \end{aligned}$$
(16)

Like other standard learning methods, boosting aims to minimize the error rate. Note that the classifier given by the logistic model \(F^*(x)\) (7) fitted by boosting, i.e., \(\mathrm{sgn}(F^*(x))\), is completely equal to (16). As discussed in Sect. 2.3, accuracy or error rate is sensitive to data imbalance. Thus boosting should be improved to adapt to imbalanced data classification.

4 IDBoosting

Boosting is a modular ensemble framework that can embody any weak learning algorithms. It has sound theoretical performance guarantees and strong experimental results. If the boosting framework can be improved to adapt the imbalanced data classification, then any conventional weak learning algorithm can be integrated without many modifications. Compared with other ensemble learning methods such as bagging, boosting is more successful in variance reduction [3]. With stumps (decision tree with single splitting node) as weak classifiers, boosting performs well and can realize feature selection at the same time, which guarantees that boosting can be used in high-dimensional data classification such as images.

So far, many IDBoosting algorithms have been proposed and this paper catalogues them into two main classes—data-sampling IDBoosting and cost-sensitive IDBoosting. The latter can be subdivided into two more classes—weight-modification IDBoosting and loss-minimization IDBoosting.

4.1 Data-sampling IDBoosting

Data-sampling IDBoosting applies resampling strategies, i.e. oversampling or undersampling, in each weak learning round in order to compensate for skewed distributions and make boosting learn better and broader decision regions for the minority class. A concise description of data-sampling IDBoosting is given in Algorithm 2. Two preprocessing steps, namely, data sampling and weight assignment, are added into naive boosting iterations. Weak classifier \(\alpha _mf_m\) is trained over the new training data set \(\{(\hat{x}_i,\hat{y}_i)\}_{i=1}^{\hat{N}}\) with assigned weights \(\hat{w}_m\).

figure b

SMOTEBoost: Synthetic Minority Oversampling Technique (SMOTE) [4] is a synthetic oversampling approach specifically designed for imbalanced data sets. SMOTE creates artificial data based on the feature space similarities between existing minority examples. Specifically, for each minority sample \(x_i\), randomly select one of the K-nearest neighbors, noted as \(\widehat{x}_i\), the synthetic sample \(x_{new}\) is a point along the line segment joining \(x_i\) under consideration and \(\widehat{x}_i\):

$$\begin{aligned} x_{new}=x_i+(\widehat{x}_i-x_i)\times \delta , \end{aligned}$$
(17)

where \(\delta \in [0,1]\) is a random number. By synthetically generating more instances of the minority class, the inductive learners, such as decision trees or rule-learners, are able to broaden their decision regions for the minority class. SMOTEBoost [5] combines SMOTE into the standard boosting procedure.

RUSBoost and EUSBoost: Random undersampling is a technique randomly removes examples from the majority class. RUSBoost [31] combines random undersampling and boosting, which is computationally less expensive than SMOTEBoost. EUSBoost [14] improves RUSBoost by the usage of the evolutionary undersampling approach, which takes into consideration the classification performance on the sampled data, such as AUC.

JOUS-Boost: The simplest oversample method is random oversampling, which appends replicated data to the original data set. In this case, multiple instances of certain examples become tied, leading to overfitting [25]. SMOTE can avoids ties in the oversampled minority class by moving the sampled predictor points toward near neighbors in the minority class. However, its data generation process is complex and computationally expensive. Mease et al. [25] propose a much simpler technique for breaking these ties: instead of generating new data from computational methods, use the duplicate data obtained from random oversampling and introduce perturbations (jittering) to this data to break ties. The resulting algorithm, over/ undersampling with jittering (JOUS-Boost), introduces independently and identically distributed (iid) noise at each iteration of boosting to minority examples for which oversampling creates replicates. This idea is relatively simple compared to its synthetic sampling counterparts and also incorporates the benefits of boosted ensembles to improve performance. It was shown to provide very efficient results in empirical studies, which suggests that synthetic procedures can be successful without jeopardizing runtime costs.

DataBoost-IM: Guo and Viktor [15] stated that one-class based sampling may sacrifice one class in favor of the other. In order to produce high predictions against both minority and majority classes, they proposed DataBoost-IM which oversamples both classes. In the DataBoost-IM method, hard examples from both the majority and minority classes are identified during execution of the boosting algorithm. Subsequently, the hard examples are used to separately generate synthetic examples for the majority and minority classes. The synthetic data are then added to the original training set, and the class distribution and the total weights of the different classes in the new training set are rebalanced. Results indicate that our approach does not sacrifice one class in favor of the other, but produces high predictions against both minority and majority classes.

The data-sampling IDBoosting algorithms mentioned above are summarized in Table 1.

Table 1 Typical data-sampling IDBoosting algorithms

Due to the introduction of resampling techniques, data-sampling IDBoosting methods have both advantages and disadvantages. The main advantages are:

  • data sampling procedure increases the diversity amongst the classifiers in the ensemble, as in each iteration it produce a different set.

  • When minority class examples are very limited and the data distribution is highly skewed, e.g., the task of mining rare classes, synthetic oversampling process brings along broadened and well-defined decision regions for the minority class.

The main disadvantages are:

  • The sampling operation is performed in each boosting iteration, which is computationally expensive and time-consuming, especially at handling large-scale problems.

  • The amount of sampling is not a prior and hard to set. The sampling degree is not really a feature of the class imbalance and always set by cross validation.

  • It is hard to make the sampling result accord with the true data distribution. Undersampling may loss information associated with deleting examples from the training data. Oversampling, on the other hand, can lead to overfitting or introduce noise.

  • It’s tough to generate reliable synthetic data for a task with very high dimensional data such as images.

4.2 Cost-sensitive IDBoosting

Dealing with imbalanced data classification, cost-sensitive learning methods assume higher misclassification costs with minority samples and seek to minimize the overall cost [43]. To do this, cost items are mixed into weight-upstate step or loss function of original boosting, corresponding to two methods, Weight-modification IDBoosting and loss-minimization IDBoosting.

4.2.1 Cost-sensitive learning

In a binary classification scenario, the cost matrix \(C\) is defined as follows:

Prediction/label

Minority/positive

Majority/negative

Minority/positive

\(c_{++}\)

\(c_{+-}\)

Majority/negative

\(c_{-+}\)

\(c_{-\,-}\)

Cost-sensitive learning minimizes the overall cost [7], which is the sum of the classification cost over the data set:

$$\begin{aligned} \mathrm{{Overall\,Cost}}=\sum _i\mathrm{{cost}}(x_i), \end{aligned}$$
(18)

where \(\mathrm{{cost}}(x)\) is the classification cost of sample \(x\). Assume the costs of correct classification are zero, i.e., \(c_{++}=c_{--}=0\). For a given discriminant function \(F(x)\), the prediction is given by \(\mathrm{sgn}(F(x))\), then overall cost degrades into cumulative misclassification cost (CMC)  [9],

$$\begin{aligned} CMC=\Sigma _ic_{y_i}1_{[y_iF(x_i)<0]}, \end{aligned}$$
(19)

where \(c_y\) is

$$\begin{aligned} c_y=\left\{ \begin{array}{lll} c_+=c_{-+}, &\quad{\rm if}\quad y=1,\\ c_-=c_{+-}, &\quad{\rm if} \quad y=-1.\\ \end{array} \right. \end{aligned}$$
(20)

According to the Bayesian decision theory, to minimize the overall cost, the expected cost should be minimized. For example, if and only if the expected cost of predicting \(+1\) is not greater than that of predicting \(-1\), the prediction output takes \(+1\), mathematically described as follows:

$$\begin{aligned} c_{++}P(y=+1|x)+c_{+-}P(y=-1|x)&\le c_{-+}P(y=+1|x)\\&\quad+c_{--}P(y=-1|x). \end{aligned}$$
(21)

Assume \(c_{++}=c_{--}=0\), it becomes

$$\begin{aligned} c_{+-}P(y=-1|x)\le c_{-+}P(y=+1|x). \end{aligned}$$
(22)

Set \(c_{-+}>c_{+-}\), i.e., the cost of misclassifying minority/positive examples is higher than the contrary case, the classification accuracy of minority class could be improved by minimizing the overall cost. In some cases, the costs can be specified from prior knowledge. For example, in a fraud detection application, prior experience indicates that there is an average cost of \(c_{+-}\) dollars per false positive, while a false negative (miss) will cost \(c_{-+}\) dollars on average [24]. In other cases, the costs are tuned to optimize certain criterion by cross-validation. Note that not the bigger \(\frac{c_{-+}}{c_{+-}}\) is the better. For example, in the fault detection system using soft computing techniques [32], a lot of false alarms would decrease the attraction of operators to the system, finally inducing its shut-off. Thus balancing the detection rate and false alarm rate is important.

4.2.2 Weight-modification IDBoosting

The weight updating strategy (13) of the original boosting treats different classes as the same by using a cost-insensitive multiplicative factor \(L(y,\alpha _mf_m(x))\)

$$\begin{aligned} L(y,\alpha _mf_m(x))=e^{-\alpha _myf_m(x)}\left\{ \begin{array}{lll} <1, &\quad {\rm if} \quad yf_m(x)>0,\\ >1, &\quad {\rm if}\quad yf_m(x)<0.\\ \end{array} \right. \end{aligned}$$
(23)

When an example is misclassified, i.e., \(yf_m(x)<0\), its weight increases by multiplying a factor \(>1\). When an example is correctly classified, i.e., \(yf_m(x)>0\), its weight decreases by multiplying a factor \(<1\). Then in the following process weak learner will focus on those misclassified samples.

To minimize the overall cost, specially, the cumulative misclassification cost in (19), weight-modification IDBoosting introduces cost items into weight updating by replacing the original multiplicative factor \(L(y,\alpha _mf_m(x))\) with a cost-sensitive one \(L_C(y,\alpha _mf_m(x))\), i.e.,

$$\begin{aligned} w_{m+1}(x,y)=w_m(x,y)\cdot L_C(y,\alpha _mf_m(x)). \end{aligned}$$
(24)

A concise description of weight-modification IDBoosting is presented in Algorithm 3. Adopting different cost-sensitive loss function \(L_C(y,\alpha _mf_m(x))\), various weight-modification IDBoostings are proposed.

figure c

AdaCost: In AdaCost [9], the cost item \(c_y\) is introduced inside the exponential function by an additional cost adjustment function \(\beta \)

$$\begin{aligned} AdaCost: L_C(y,\alpha _mf_m(x))=e^{-\beta (c_{y})\alpha _myf_m(x)}=\left\{ \begin{array}{lll} e^{-\beta _+(c_{y})\alpha _myf_m(x)}, &{}\quad{\rm if}\quad yf_m(x)>0,\\ e^{-\beta _-(c_{y})\alpha _myf_m(x)}, &{}\quad{\rm if} \quad yf_m(x)<0.\\ \end{array} \right. \end{aligned}$$
(25)

where \(\beta =\beta _+\) in the correct classification case, i.e., \(yf_m(x)>0\), and \(\beta =\beta _-\) in the misclassification case, i.e., \(yf_m(x)<0\). To emphasize the minority class instance with a higher cost, the weights should be increased more if the instance is misclassified, but decreased less otherwise. Therefore, \(\beta _-\) is non-decreasing and \(\beta _+\) is non-increasing. A recommended setting is \(\beta _+(c_y)=-0.5c_y+0.5\) and \(\beta _+(c_y)=0.5c_y+0.5\). AdaCost is proved to minimize an upper bound of cumulative misclassification cost (19), which is

$$\begin{aligned} \Sigma _ic_{y_i}e^{-y_iF'(x_i)}, \, \, F'(x_i)=\sum _{m=1}^M\beta (c_{y_i})\alpha _mf_m(x_i). \end{aligned}$$
(26)

An alternative of AdaCost is also proposed in  [9], where the cost item \(c_y\) is introduced outside the exponential function:

$$\begin{aligned} AdaCost(1): L_C(y,\alpha _mf_m(x))=\beta (c_{y})e^{-\alpha _myf_m(x)}. \end{aligned}$$
(27)

\(\beta \) is a non-decreasing function such as \(\beta (c_y)=c_y\). This strategy emphasizes minority class as does AdaCost does.

CSB0/1/2: To minimize high-cost error, i,e., error of minority class, several cost-sensitive adaptations of boosting are proposed in [37], named as CSB0, CSB1 and CSB2. CSB0 doesn’t use the exponential loss function in weight-update step, CSB1 and CSB2 utilize the cost item in the same form but CSB1 does not use \(\alpha _m\) in the formulation. The three algorithms are obtained by modifying (13) to:

$$\begin{aligned} CSB0: L_C(y,\alpha _mf_m(x))=\left\{ \begin{array}{lll} 1, &{}\quad{\rm if} \quad yf_m(x)>0,\\ c_{y}, &{}\quad{\rm if}\quad yf_m(x)<0. \end{array} \right. \end{aligned}$$
(28)
$$\begin{aligned} CSB1: L_C(y,\alpha _mf_m(x))=\left\{ \begin{array}{lll} e^{-yf_m(x)}, &{}\quad{\rm if} \quad yf_m(x)>0,\\ c_{y}e^{-yf_m(x)}, &{}\quad{\rm if}\quad yf_m(x)<0. \end{array} \right. \end{aligned}$$
(29)
$$\begin{aligned} CSB2: L_C(y,\alpha _mf_m(x))=\left\{ \begin{array}{lll} e^{-\alpha _myf_m(x)}, &{}\quad{\rm if} \quad yf_m(x)>0,\\ c_{y}e^{-\alpha _myf_m(x)}, &{}\quad{\rm if}\quad \quad yf_m(x)<0 \end{array} \right. \end{aligned}$$
(30)

where \(c_+>c_-\ge 1\). In the three modifications, the weights of misclassification samples increase more than original boosting, especially those samples belong to minority class.

AdaC1/2/3: In  [34], the cost items \(c_y\) are introduced into the weight update formula of boosting in three ways: inside the exponent, outside the exponent, and both inside and outside the exponent. Three modifications then become:

$$\begin{aligned} AdaC1: L_C(y,\alpha _mf_m(x))=e^{-c_y\alpha _myf_m(x)}. \end{aligned}$$
(31)
$$\begin{aligned} AdaC2: L_C(y,\alpha _mf_m(x))=c_ye^{-\alpha _myf_m(x)}.\end{aligned}$$
(32)
$$\begin{aligned} AdaC3: L_C(y,\alpha _mf_m(x))=c_ye^{-c_y\alpha _myf_m(x)}. \end{aligned}$$
(33)

AdaC1, AdaC2 and AdaC3 are proved to minimize an upper bound of the training error as the original boosting does. Specially, in  [34] AdaC2 is claimed to minimize the over cost-sensitive exponential loss

$$\begin{aligned} \Sigma _ic_{y_i}e^{-y_iF(x_i)}, \end{aligned}$$
(34)

which is an upper bound of the cumulative misclassification cost. However, this conclusion is proved not accurate in Sect. 5. In terms of F-measure, AdaC2 has furnished better results in most experiments compared with original boosting and other weight-modification IDBoosting methods, such as AdaCost, CSB2, AdaC1, AdaC3 [34]. However, it is observed to be sensitive to the cost setups, the classification precision is very low when a large positive-to-negative cost ratio \(c_+/c_-\) is set.

Table 2 gives a summary of the above algorithms. Weight-modification IDBoosting increases the weight of minority class samples to make the weak classifier focus more on the correct classification of those samples. The main benefits are:

  • Easy to implement. Most methods are compatible with the original boosting when \(c_+=c_-\). The modification is introduced into the weight updating step. There’s almost no additional computational cost.

  • Compared with data-sampling IDBoosting, most weight-modification methods have an provable optimization objective.

The main drawbacks are

  • The idea is somewhat intuitively and lacks sound theoretic guide. Instead of directly optimizing the overall cost or cumulative misclassification cost, the algorithms optimize certain loose upper bounds which are deduced from evolutionary derivation.

  • The strategy introducing the cost items outside the exponent is the most popular and widely utilized in the weight-modification IDBoosting methods, including an alternative of AdaCost [9], AsymBoost [39] (a cost-sensitive extension of Real AdaBoost), AdaC2 [34], cost-sensitive IDBoosting algorithms proposed in  [19], robust asymmetric Adaboost [26] et al. However, this strategy has the problem of instability, i.e., the performance of algorithm is sensitive to the cost setups and the classification precision is very low when a large positive-to-negative cost ratio \(c_+/c_-\) is set [34]. Theoretical illustration of the phenomenon lacks.

Table 2 Typical weight-modification IDBoosting algorithms

4.2.3 Loss-minimization IDBoosting

To minimize the overall cost, loss-minimization IDBoosting uses boosting-style stagewise optimization to minimize the expectation of certain cost-sensitive loss \(L_C(y,F(x))\) instead of the cost-insensitive one \(L(y,F(x))\). To realize the optimal prediction in (22), the expectation of \(L_C(y,F(x))\) is minimized by the Bayesian discriminant function \(F_C^*(x)\)

$$\begin{aligned} F_C^*(x)=\arg \min _FE[L_C(y,F(x))]. \end{aligned}$$
(35)

Like original boosting, \(\{\alpha _mf_m(x)\}_{m=1}^M\) are obtained sequentially by stage-wisely optimizing the expected cost-sensitive loss

$$\begin{aligned} (\alpha _m, f_m)=\arg \min _{\alpha ,f}E[L_C(y,F_{m-1}(x)+\alpha f(x))]. \end{aligned}$$
(36)

CSDA/CSRA/CSLB:Masnadi-Shirazi and Vasconcelos [24] proposed the cost-sensitive extensions of Discrete AdaBoost, Real AdaBoost and LogitBoost, named CSDA, CSRA and CSLB respectively. The cost-sensitive loss functions are subtly designed, whose expectation is minimized by the following discriminant function

$$\begin{aligned} F_C^*(x)= \frac{1}{2}\log \frac{c_+P(y=1|x)}{c_-P(y=-1|x)}. \end{aligned}$$
(37)

The prediction is given by \(\mathrm{sgn}(F_C^*(x))\), which is equal to the Bayes optimal prediction (22). For CSDA and CSRA, the cost-sensitive loss function is

$$\begin{aligned} L_C(y,F(x))=\log (1+e^{-c_yyF(x)}). \end{aligned}$$
(38)

For CSLB, the cost-sensitive loss function is

$$\begin{aligned} L_C(y,F(x))=e^{-c_yyF(x)}. \end{aligned}$$
(39)

L \(_p\) -CSB: Lozano [21] proposed L\(_p\)-CSB, which adopts a cost-sensitive \(p\)-norm loss as an approximation of the classification cost. In the bi-class case, the cost-sensitive loss function is

$$\begin{aligned} L_C(y,F(x))= \left\{ \begin{array}{ll} c_{+}F(x)^p, &{}\text {if} \quad y=1,\\ c_{-}(1-F(x))^p, &{}\text {if} \quad y=-1, \end{array} \right. \end{aligned}$$
(40)

where \(p\ge 1\). The expectation of (40) is minimized by

$$\begin{aligned} F_C^*(x)=\frac{1}{1+\left( \frac{c_-P(y=-1|x)}{c_+P(y=1|x)}\right) ^{p-1}}. \end{aligned}$$
(41)

Note that the classifier \(\mathrm{sgn}(F_C^*(x)-\frac{1}{2})\) is equal to (22).

Table 3 gives a summarization of loss-minimization IDBoosting algorithms. The main advantages of loss-minimization IDBoosting algorithms are

  • Loss-minimization IDBoosting minimizes the overall cost by learning the Bayesian optimal classifier. Theoretical guarantee makes it performs better in the problem of imbalanced data classification compared with weight-modification IDBoosting [21, 24].

The main disadvantages are

  • Unlike naive boosting algorithms, current loss-minimization IDBoosting algorithms cannot get benefits from the form of cost-sensitive loss function when solving (36). They cannot get analytical solutions of weak classifiers when using the same optimization methods as their naive boosting counterparts and have to resort to numerical optimization algorithms to tackle the problem, this makes the training process complex and time-consuming.

Table 3 Typical loss-minimization IDBoosting algorithms
Fig. 2
figure 2

The \(m\)-th iteration of two different IDBoostings (minority class is plotted as red circles and majority class as blue squares). a Data-Sampling IDBoosting. b Weight-modification IDBoosting (color figure online)

5 The essence of current IDBoosting methods

Apparently, there’s a strong connection between data-sampling IDBoosting and weight-modification IDBoosting. Weight manipulation can be considered as replicated sampling. However, in data-sampling IDBoosting, weight updating (step 4 of Algorithm 2) is performed on the original data set and the sampling result isn’t propagated to the following process, which we call “sampling-recovering” strategy. In replication-based weight-modification IDBoosting, weight updating (step 2 of Algorithm 3) is performed on the modified weights thus the replicating result directly moves to the next iteration, which we call ”replication-propagating” strategy. Figure 2a and b illustrate the two different strategies by displaying the \(m\)-th iteration of the two IDBoosting methods. Next we’ll give an in-depth study of the two different strategies from the view of Bayes optimal classification. Two results are deduced as follows.

RESULT 1

Replication-based data-sampling IDBoosting, i.e. using replication-oversampling in Algorithm 2, fits the Bayes optimal discriminant function

$$\begin{aligned} F_{C}^{*}(x)= \frac{1}{2}\log \frac{c_{+} P(y=1|x)}{c_{-} P(y=-1|x)}. \end{aligned}$$
(42)

Proof

Setting \(\frac{c_+}{c_-} = r\), introduce the following cost-sensitive loss function

$$\begin{aligned} L_C(y,F(x))=r^{y*}e^{-yF(x)},\quad y*=\frac{y+1}{2} \end{aligned}$$
(43)

whose expectation is minimized by (42). Like boosting, use an additive model \(F(x)=\sum _{m=1}^M\alpha _mf_m(x)\) to fit \(F_C^*(x)\). Initially, \(F(x)=F_0(x)=0\), for the \(m\)-th iteration round, given an imperfect estimate \(F_{m-1}(x)\), seek an increment \(\alpha _mf_m(x)\) to get an improved estimation \(F_{m}(x)=F_{m-1}(x)+\alpha _mf_m(x)\):

$$\begin{aligned} (\alpha _m, f_m)&=\arg \min _{\alpha ,f}E[L_C(y,F_{m-1}(x)+\alpha f(x))]\\ &=\arg \min _{\alpha ,f}E[r^{y*}e^{-yF_{m-1}(x)}\cdot e^{-y\alpha f(x))}]\\ &=\arg \min _{\alpha ,f}E[L_C(y,F_{m-1}(x))\cdot L(y,\alpha f(x))]\\ &=\arg \min _{\alpha ,f}E_{w_m^r}[L(y,\alpha f(x))] , \end{aligned}$$
(44)

where

$$\begin{aligned} w_m^r=L_C(y,F_{m-1}(x))=r^{y*}e^{-yF_{m-1}(x)}=r^{y*}w_m. \end{aligned}$$
(45)

\(w_m\) can be updated by multiplication operation as described in formula (13) just like boosting does. \(w_m^r\) is a kind of replication-oversampling strategy, i.e. replicating the minority samples in original training data set \(r\) times for each iteration. Thus replication-based data-sampling IDBoosting learns the Bayes-optimal classifier (42) in nature. \(\square \)

RESULT 2

Replication-based weight-modification IDBoosting such as AdaC2, i.e. using the weight update strategy

$$\begin{aligned} w_{m+1}(x,y)=w_m(x,y)L_C(y,\alpha _mf_m(x))=w_m(x,y)c_ye^{-\alpha _myf_m(x)} \end{aligned}$$
(46)

in Algorithm 2, fits the Bayes optimal discriminant function

$$\begin{aligned} F_C^*(x)= \frac{1}{2}\log \frac{c_+}{c_-}^m\frac{P(y=1|x)}{P(y=-1|x)} \end{aligned}$$
(47)

in the \(m\)-th iteration round.

PROOF

Setting \(\frac{c_+}{c_-} = r\), all weights should be normalized when learning weak classifiers, so (46) is equivalent to

$$\begin{aligned} w_{m+1}(x,y)=w_m(x,y)r^{y*}e^{-\alpha _myf_m(x)}=r^{y*}w_m(x,y)L(y,\alpha _{m}f_{m}(x)),y*=\frac{y+1}{2}. \end{aligned}$$
(48)

For the \(m\)-th round, according to (48), the weight \(w_m\) is

$$\begin{aligned} w_m&=r^{y*}w_{m-1}L(y,\alpha _{m-1}f_{m-1}(x))\\ &=r^{2y*}w_{m-2}L(y,\alpha _{m-1}f_{m-1}(x))L(y,\alpha _{m-2}f_{m-2}(x))\\ &=r^{my*}\Pi _{j=1}^{m-1}L(y,\alpha _{j}f_{j}(x)) \\ &=r^{my*}\Pi _{j=1}^{m-1}e^{-y\alpha _{j}f_{j}(x)} \\ &=r^{my*}e^{-yF_{m-1}(x)}\\ &=r^{my*}L(y,F_{m-1}(x)) . \end{aligned}$$
(49)

So the optimization problem to be solved in the \(m\)-th round becomes

$$\begin{aligned} (\alpha _m, f_m)&=\arg \min _{\alpha ,f}E_{w_m}[L(y,\alpha f(x))]\\ &=\arg \min _{\alpha ,f}E[r^{my*}L(y,F_{m-1}(x)) \cdot L(y,\alpha f(x))]\\ &=\arg \min _{\alpha ,f}E[r^{my*}L(y,F(x)+\alpha f(x))]\\ &=\arg \min _{\alpha ,f}E[L_C(y,F(x)+\alpha f(x))] , \end{aligned}$$
(50)

where

$$\begin{aligned} L_C(y,F(x)+\alpha f(x))=r^{my*}e^{-yF(x)}. \end{aligned}$$
(51)

Compared (51) with (43), apparently, replication-based weight-modification IDBoosting learns a Bayes optimal discriminant function with a cost factor \(r^m\) increasing exponentially with the iteration rounds \(m\). \(\square \)

So far, three kinds of IDBoosting are unified under the framework of Bayes optimal classification. Besides, the instability phenomenon of replication-based weight-modification IDBoosting can be explained - when a big cost factor \(r=\frac{c_+}{c_-}\) is set, the minority samples are excessively emphasized in the latter iterations so that the total accuracy is sacrificed. The instability problem is further proved by experiments in Sect. 6.

6 Experimental evaluation

With naive boosting algorithms as benchmark, experiments are designed in this section to compare the performance of replication-based data-sampling IDBoosting and replication-based weight-modification IDBoosting. The replication-based data-sampling IDBoosting algorithms extended from Discrete AdaBoost and Gentle AdaBoost are denoted as IDBoosting\(_C\).DA and IDBoosting\(_C\).GA respectively, while the replication-based weight-modification IDBoosting algorithms extended from Discrete AdaBoost and Gentle AdaBoost are denoted as IDBoosting\(_I\).DA and IDBoosting\(_I\).GA respectively.

We take decision tree as the weak learner and adopt a “best-first” strategy [12] to make the number of terminal nodes increased by one and then achieve the pre-set node number. The tree building criterion depends on the boosting algorithms. For Discrete AdaBoost and its IDBoosting extensions, where weak classifiers take a discrete \(\{-1,1\}\) value, splitting criteria based on classification error is applied while for Gentle AdaBoost and its IDBoosting extensions, where weak classifiers take a real value, minimum variance/least squares criteria is adopted.

Several imbalanced data sets from the UCI repository [1] and MIT CBCL face database [36] are subject to comparison. Measures suitable for the class-imbalance problem are used in our experiments, i.e., cumulated misclassification cost (CMC), the receiver operating characteristic (ROC) curve.

6.1 UCI datasets

Five imbalanced data sets are selected from the UCI repository, that is, sonar, wdbc, wpbc, bupa and pima, which have been listed in Table 4. Among them, wdbc and wpbc are breast cancer data, bupa is bupa liver disorders data and pima is diabetes data.

Table 4 UCI Data set summary

Assume the imbalanced data classification problem discussed here is actually a cost-sensitive classification problem. The cost matrix is known and the cost of correct classification is zero. Then the performance of learning algorithms can be evaluated by cumulated misclassification cost (CMC). To find the influence of different cost factors on algorithm performance, in our experiment a set of \(r=\frac{c_+}{c_-}\) are selected

$$\begin{aligned} \frac{1}{r}=[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8], \end{aligned}$$
(52)

For each value of cost factor \(r\), we made a five-fold cross validation for each data set with node number 2, 4, and 8 separately. All algorithms run for 200 iterations. For a given cost factor \(r\), the \(m\)th data set and the node number \(n\), the average of the minimum of CMC on the 200 rounds for fivefold cross validation is calculated

$$\begin{aligned} \overline{CMC}(r,n,m) = \frac{\sum _{j=1}^5 \min _{i=1:200}CMC_{ij}(r,n,m)}{5}, \end{aligned}$$
(53)

where \(i\) is the iteration number, \(j\) is the fold number. Then compute the score \(score(r,n,m)\) for each algorithm according to the ranking order of \(\overline{CMC}(r,n,m)\), i.e., 3 is the best, 2 is following and 1 is the worst. Finally the average score \(\overline{score}(r)\) over all node number settings and data sets is used to measure the algorithm performance

$$\begin{aligned} \overline{score}(r)=\frac{\sum _{m=1}^5\sum _{n=2,4,8}score(r,n,m)}{3\times 5} \end{aligned}$$
(54)

The score curves of each algorithm as a function of cost factor \(r\) are drawn in Fig. 3 for Discrete AdaBoost and Gentle AdaBoost respectively. We can see that IDBoosting \(_C\) .DA and IDBoosting \(_C\) .GA with a constant cost factor are stable and perform the best for almost all cost factor settings. However, the performance of Indorsing \(_I\) .DA and Indorsing \(_I\) .DA IDBoosting \(_I\) .DA with an increasing cost factor decreases significantly and even worse than corresponding naive boosting algorithms when the cost factor increases. Thus the insatiability weight-modification Indorsing is proved by experiments. The method by experiments. The method loses its effectiveness when a big cost factor \(r\) is given, i.e. minority samples have much higher misclassification costs than majority samples.

Fig. 3
figure 3

Score of naive boosting and two cost-sensitive extensions as a function of cost factor \(r\)

6.2 Face detection

Data imbalance is inherent in tasks of face detection where rare face patterns need to be distinguished from enormous non-face patterns. Thus face detection has become an important area of applications for cost-sensitive boostings [39], i.e., to achieve a higher detection rate, the cost of missing a target should be higher than that of a false positive. To get a whole face detector, cascade classifier needs to be trained. The overall cascade design requires many parameters to be tuned. However, the goal here is not to compete with algorithms for cascade design, but simply compare cost-sensitive boosting algorithms in the application of face detection, which is considered as a large-scale, large-dimensional imbalanced data classification problem. Receiver operating characteristic (ROC) curve is usually used to evaluate the performance of face detector.

We use the MIT CBCL face database, where the size of each sample is \(19\times 19\). The training set contains 2,429 faces and 4,548 non-faces, the test set contains 472 faces and 23,573 non-faces. Haar features [40] are extracted from the original image samples and the weak learning algorithm outputs a stump/2-node tree. In this problem, the cost factor \(r\) is unknown in advance and should be set experimentally or empirically. Here \(r\) is set to a small value, \(\frac{1}{r} = 0.9\) and a large value, \(\frac{1}{r} = 0.1\) respectively. Each algorithm is trained for 200 rounds.

Figure 4 has drawn the ROC curve of different algorithms on test set. With a small cost factor \(\frac{1}{r} = 0.9\), both Indorsing \(_C\) .DA and Indorsing \(_I\) .DA are better than than the naive Discrete AdaBoost. With a large cost factor \(\frac{1}{0}.2\), the performance of Indorsing \(_I\) .DA decreases significantly while the performance of IDBoosting \(_C\) .DA is as good as \(\frac{1}{r} = 0.9\). For Gentle AdaBoost, the performance improvement brought by IDBoosting \(_C\) .GA is not so significant but is still stable when \(\frac{1}{r} = 0.2, 0.9\). Indorsing \(_I\) .GA performs worse than naive Gentle AdaBoost AdaBoost with a small cost factor \(\frac{1}{r} = 0.9\) and decreases significantly with a large cost factor \(\frac{1}{r} = 0.2\).

Fig. 4
figure 4

ROC of naive boosting and two cost-sensitive extensions

In practice, the cost factor \(r=\frac{c_+}{c_-}\) can be known in advance as described in Sect. 6.1 or tunable as described in Sect. 6.2. In both cases, it’s important to guarantee the algorithm performance stable. Replication-based data-sampling IDBoosting improves the performance of naive boosting and performs well over a wide range of cost factor. However, replication-based weight-modification IDBoosting, which learns a Bayes optimal discriminant function with a variable cost factor, is unstable with the cost factor increasing and performs seriously poor when using a big cost factor.

7 Conclusions

In this paper we have reviewed the state-of-the-art work of IDBoosting methods. When minority class examples are very limited and the data distribution is highly skewed, data-sampling IDBoosting with synthetic oversampling process can brings along broaden and well-defined decision regions for the minority class. For the task with very high dimensional data such as images, it’s tough to generate reliable synthetic data and cost-sensitive IDBoosting is a better choice.

Besides, we briefly discuss several aspects for the future work:

  • As summarized at the end of the introduction of each IDBoosting class, current IDBoosting methods have their own shortcomings that restrict their applications. Better IDBoosting solutions are expected to be proposed.

  • In practice, most applications have more than two classes where the imbalanced class distributions hinder the classification performance. Solutions for bi-class problems are not applicable directly to multi-class cases. Thus bi-class IDBoosting methods should be extended to their multi-class cases [33] or new multi-class IDBoosting algorithms should be proposed [41].

  • Current IDBoosting methods are limited to static learning, i.e., representative data sets are required to be available at training time. However, in many realistic application environments, raw data become available continuously over an indefinite learning lifetime. More attention should be paid to incremental IDBoosting learning for imbalanced data streams. In the on-line version of boosting, the weak classifiers are sequentially updated when a new sample arrives [27]. To save computational cost and annotation effort, a combination with active learning would be very helpful [22]. Samples containing important classification information are selected to be annotated and used to update the weak classifiers.

  • Studies have shown that loss functions imply certain assumptions of data distribution [42]. Though exponential loss has its benefits as mentioned in Sect. 3.2, it is not well-behaved when the posterior probability \(P(y=1|x)\) is close to 0 or 1 and has the problem of overfitting “noisy” samples. While least squares can lead to reliable estimation and is well-behaved when the posterior probability \(P(y=1|x)\) is close to 0 or 1. Thus more loss functions should be considered in the IDBoosting framework with somewhat sacrificing the algorithm modularity.