1 Introduction

An imbalance probably deteriorate the performance of classifiers in imbalanced classification, which is an important and widespread task in machine learning. Therefore, solving the problem of imbalanced learning has far-reaching significance. Many techniques are generally used to solve this problem. They can be roughly divided into two categories: algorithm-level methods and data-level methods [1,2,3]. One method at the algorithm level is to modify or design a new classification algorithm to improve the performance of the learning algorithm. One method is to design a new classifier performance evaluation index. There is a specific type of algorithm level method called cost-sensitive learning [4, 5].Among them, cost-sensitive learning is a scheme of algorithm-level modification [6]. This scheme does not use a standard loss function, but introduces the concept of misclassification cost to minimize conditional risk. By severely penalizing certain categories of classification errors, the importance of these categories is increased in the process of classification training. For example, Datta et al. proposed a dictionary-based linear programming framework to minimize the maximum deviation of the minimum loss value of a specific category [7]. They also proposed a multi-objective optimization framework called radial boundary intersection by training support vector machines on two and multiple data sets. It overcomes the disadvantage of high cost of adjusting parameters [8]. The data-level method generally balances the majority class and the minority class by resampling the original data. Resampling the original data by oversampling the minority or undersampling the majority is the most popular method of addressing the imbalance problem [9,10,11,12], and oversampling is more effective and popular than undersampling [13,14,15,16,17].

Fig. 1
figure 1

The decision boundaries of a data set and its sampled results based on the SVM classifier. The blue dots denote the majority vote category, and the red dots denote the minority vote category. The dark areas represent the decision range of the minority vote category, and the remaining light areas represent the decision range of the majority vote category

For the development of oversampling technology, the synthetic minority oversampling technique (SMOTE) is one of the most well-known data preprocessing methods for balancing different numbers of samples in each class; it proceeds by randomly interpolating between the minority points, extending the minority classes, bringing the data to a balanced state, and thereby improving the classification accuracy [18, 19]. There has been a succession of many improved versions of SMOTE, which are roughly divided into two categories. The first category filters out some noise or dangerous points that are not suitable for expansion and then performs SMOTE. For example, Rivera et al. introduced a new oversampling technique that focuses on noise reduction and selective sampling of the minority group, which results in an improvement in the prediction of minority group membership [20]. Saez et al. proposed an iterative ensemble-based noise filter that can overcome the problems produced by noisy and borderline examples in imbalanced data sets [21]. Another category is to improve the sampling strategy to make it general for more data sets with various distributions. In this category, Han et al. proposed two new minority oversampling methods, borderline-SMOTE1 and borderline-SMOTE2, in which only the minority examples near the borderline are oversampled [22]. Das et al. introduced two probabilistic oversampling approaches for synthetically generating and strategically selecting new minority class samples [23]. Abdi et al. suggested an elegant blend of boosting and oversampling paradigms by introducing the Mahalanobis distance-based oversampling technique (MDO) [24]. In addition, some of the latest algorithms combine the above two methods. For example, Xie et al. proposed a minority oversampling technique based on local densities in low-dimensional space [25]. Zhou et al. proposed an oversampling algorithm based on weights to calculate the synthesis position of new samples [26]. He et al. proposed an adaptive synthesis (ADASYN) sampling method, which uses a weighted distribution for different minority examples according to their learning difficulty [27].

Although these oversampling methods have achieved satisfactory results on some imbalanced data sets, almost all of them default the balance to the final state of sampling. Nevertheless, balance is not necessarily the optimal state for classification. For example, Prati et al. conducted a detailed study on the class imbalance, and evaluated the performance of the class imbalance treatment method through a large-scale experimental design, proving that the higher the class imbalance degree, the greater the loss, and the class imbalance treatment method can only restore part of the performance loss [28]. Barella et al. determined the convex relationship between the class imbalance rate and the accuracy rate by learning the imbalance of the data in the classification [29]. Thabtah et al. believe that when the complexity of the data in the sampling process is high, imbalanced data will bring greater difficulties to classification. Therefore, the data complexity measurement is used to estimate the optimal sample size of the data imbalance pre-processing technique [30]. However, the measurement for data complexity degree is a difficult problem in itself, so this method is limited to some data sets with specific distributions and not general. This paper shows a fact that in the process of data gradually becoming balanced, the difference between the distribution of the current data and that of the original data may become increasingly large. When the sampling method is not suitable for data with a certain distribution, this negative effect will be considerable. There are two contributions of this paper, as follows:

  • We analyze the relationship between balanced rate and classification performance in the sampling process from a novel perspective that oversampling may cause the loss of the distribution while minority class is enhanced.

  • This paper proposes a novel cross-validation framework called “k-fold icross-validation” that can be used in sampling to find a better state than the balanced state. This framework is general and can be applied into various oversampling methods.

2 Relationship between prediction performance and imbalance rate

As pointed out by Chawla [19], an imbalance has a negative effect on a classifier because the majority class of data points always has a greater impact on the classifier than the minority points. Therefore, in imbalanced classification, current oversampling methods always aim at reaching an absolutely balanced state in the data set. However, in the process of sampling, the data distribution is likely to differ increasingly from the distribution of the original data set, and the difference in the distribution between the original data set and the sampled data set will obviously deteriorate the generalizability of the classifier. If the negative effect of the distribution difference is larger than the positive effect of balance, the classification performance can deteriorate. This phenomenon is particularly noticeable in some cases in which the distribution of a data set is very easy for a sampling method to change, such as when the data set contains noise [21].

Figure 1 shows the change in the decision boundary when borderline-SMOTE is used for oversampling on fourclass(the benchmark data set). In Fig. 1b, the minority class of data points in the original imbalanced data set is very sparse, and many data points in the minority class are incorrectly classified when the support vector machine (SVM) classifier is directly implemented on the original imbalanced data set. Therefore, the decision boundary is very unreasonable. As shown in Fig. 1c, when borderline-SMOTE runs to the 20th iteration, the decision boundary is very close to that in the original data set shown in Fig. 1a. Therefore, it performs well and can correctly classify much more of the minority class of data points than in Fig. 1b. Furthermore, the decision boundary in Fig. 1c is also closer to that in Fig. 1a than that in Fig. 1d, in which the data set is balanced when borderline-SMOTE is run to the 60th iteration. The reason is that although the balanced state can make the number of points of the minority class almost equal to that of the majority class, excessive sampling exists in the balanced state, and the data distribution and decision boundary are very different from those of the original data set.

The classification performance can be represented by the evaluation index in imbalanced classification, such as the G-mean. The existing KL divergence metric can calculate the distribution difference between two data sets, but unfortunately it is only applicable when the size of the two data sets is equal. So, referring to the characteristics of KL divergence, we propose a novel measurement for measuring the difference between two distributions containing different numbers of points. The first characteristics is that, the distribution difference between two data sets containing the same data points is equal to zero; the second is that, for a data set D and its oversampled result \(D'\), the distribution difference is larger if the data points in \(D'\) are farther from those in D. Based on above characteristics, we design the distribution difference for oversampling as follows:

Definition 1 (Division): Given two data sets \(D=\left\{ x_i, i=1...n\right\}\) and \(D'\), the number of data points in D is denoted as \(\mid D \mid\). Assuming that \(\mid D'\mid >\mid D\mid\), a division of D on \(D'\) implements an initial operation of n-means clustering on \(D'\) with \(x_i (i=1...n)\) as centers.

Definition 2 (Distribution difference): Given a data set \(D=\left\{ x_i, i=1...n\right\}\) and its oversampled result \(D'=\left\{ x'_i, i=1...m\right\}\), where n and m denote the number of data points in D and \(D'\), respectively, n clusters \(\left\{ C_i, i=1...n\right\}\) are generated by generating a division of D on \(D'\). The number of data points in D is denoted as \(\mid D\mid\). For two points A and B, their distance is denoted as dist(AB). The distribution difference of \(C_i\) relative to \(x_i\) is defined as:

$$\begin{aligned} DD\left[ x_{i}\mid C_{i} \right] = \frac{1}{\mid C_{i}\mid } \sum _{x_{i}\in C_{i}}^{}\textrm{dist}(x_{i},{\bar{C}}_{i}), \end{aligned}$$
(1)

where \(\bar{C_{i}}\) represents the center of \(C_i\). The distribution difference of \(D'\) relative to D is defined as:

$$\begin{aligned} DD\left[ D\mid D' \right] =\frac{1}{n} \sum _{i=1}^{n} DD[x_{i}\mid C_{i}]. \end{aligned}$$
(2)
Fig. 2
figure 2

Visualization results of the changing G-mean (i.e., classification performance) using various sampling methods on two real-world data sets. (a) Average G-mean on the data set vowel using various sampling methods. (b) Average G-mean on the ThoraricSurgery data set using various sampling methods. The square represents the optimal state, and the dots represent the balanced state

Fig. 3
figure 3

Visualization results of changing the distribution difference using various sampling methods on two real-world data sets. (a) Average distribution difference on the vowel data set; (b) Average distribution difference on the ThoraricSurgery data set

Fig. 4
figure 4

Distribution difference curve and classification evaluation curve. The orange line represents the curve of the distribution difference as the balance rate changes, and the blue curve represents the curve of the classification evaluation as the balance rate changes. The blue points represent the different results of classification evaluation on different balance rates

The visualization results of changing G-means (i.e., classification performances) and distribution differences using various sampling methods on some real-world data sets are shown in Figs. 2 and 3. Some rules can be extracted and represented, as shown in Fig. 4. The balance rate used in this paper is defined in Definition 2, and the imbalance rate used in this paper is equal to the the reciprocal of balance rate.

Definition 3 (Balance rate): Given a data set D consisting of a majority class \(D_{-}\) and a minority class \(D_{+}\) and given that the number of data points in a data set P is denoted as \(\mid P \mid\), the balance rate of D is defined as:

$$\begin{aligned} BR(D)=\frac{\mid D_{+}\mid }{\mid D_{-}\mid }. \end{aligned}$$
(3)

Supposing that the other factors of classification performance are equal, classification performance is better when the balance rate is closer to 1. The characteristics in Fig. 4 can be summarized as follows:

  • The orange curve will increase when new sampled points are added to the original data because the distribution difference will increase. When an increasing number of points are generated, as shown in Fig. 1d, those points are limited to a local area, resulting in a stable distribution difference at the end of the orange curve.

  • The blue curve changes along a parabola, where A is the point with the best classification performance. At the front of the parabola (that is, between the origin and A), due to the high proportion of original data points, the sampling result is reliable, resulting in a low distribution difference and a small negative impact on classification performance. Meanwhile, the balance rate continues to increase from a small value to almost 1. Its benefits to classification outweigh the negative effects of the initial balance rate. Therefore, the classification performance is improved.

  • In the second half of the blue curve (that is, after A), due to the small proportion of original data points, the reliability of the sampling results is low, resulting in a high distribution difference and a large negative impact on classification performance. The resulting negative impact on classification is larger than the benefits of the increasing balance rate. Therefore, the classification performance is decreased.

  • For the blue curve, similar to the phenomenon in Fig. 2, point B, where the balance rate is equal to 1, is located in the second half of the parabola (that is, after A). In other words, the classification performance cannot increase after the point where the balance rate is equal to 1. The reason is that not only has the balance rate moved far from 1 but also the distribution difference has expanded. Both of these effects are negative. When this process has continued sufficiently far, the classification performance will be lower than that of the original data set and may even fall to 0.

To the best of our knowledge, there are no methods of finding the optimal state of a sampling method in imbalanced classification. We will present a method of finding the optimal state in the sampling process in the next section.

Fig. 5
figure 5

Schematic diagram of the proposed leave-one-out cross-validation for imbalance classification

Fig. 6
figure 6

Schematic diagram of the proposed k-fold cross-validation for imbalance classification

3 Validation framework to optimize the balance rate

3.1 Motivation

To find a better state than balance, the most direct method is to use conventional cross-validation. However, conventional cross-validation may result in an unstable and unreliable sampled data set because minority samples are likely to be unevenly distributed in different validation sets. Additionally, minority samples are very important for the validation index in imbalanced classification, and conventional evaluation indexes, such as test accuracy, are seldom used in imbalanced classification. For example, consider a data set D that consists of two kinds of points, the numbers of which are 1000 and 10. If 5-fold cross-validation is used, one possible case is that the number of minority samples in the five sets are 0, 0, 0, 0, and 10. In this case, the validation indexes of the imbalanced data in the first four sets are nonsensical and harmful for the average validation.

3.2 A cross-validation method for imbalance classification

To address the above problem, we design a validation framework for imbalanced classification by applying the leave-one-out strategy for the minority class samples. In addition, to protect the majority samples and keep the overall distribution consistent, the majority samples are divided into sets whose number is equal to the number of minority samples. As shown in Fig. 5, m denotes the number of majority samples, and n denotes the number of minority samples. \(x_i (i=1, 2...n)\) denotes each minority sample point, and the majority samples are divided into n equal parts \(\left\{ X_{i} \mid i=1...n\right\}\). So, the number of samples in each majority set \(X_{i}\) is equal to \(\frac{m}{n}\). After the division is completed, we merge a majority class data set \(X_i\) and a minority class sample point \(x_i\) into a set, and use this set as one fold of the leave-one-out cross-validation. However, we consider that if the minority sample size is large, the implementation cost of the framework will be quite high. Therefore, as shown in Fig. 6, we proposed a k-fold cross-validation framework for imbalance classification. In this method, the majority and minority samples are divided into k parts respectively and then cross-validated is used.

In Fig. 6, one majority set \(X_i(i=1...k)\) and one minority set \(X'_i(i=1...k)\) are combined as a cross-validation set, and the remaining points make up the cross training set. In particular, when \(k=n\), \(X'_i\) represents a minority sample, and the framework is converted to be leave-one-out framework. In this way, the minority samples are evenly distributed in each validation set, and the majority samples are also considered according to the balance rate. We can calculate G-mean\(_i(i=1...k)\) for each validation set using the proposed k-fold cross-validation, and average the k G-mean metrics to get the final CG-mean. CG-mean represents the average value of G-mean\(_i\). We can calculate CG-mean for each sampling iteration, and the balance rate corresponding to the highest CG-mean value is the optimal balance rate. We named the proposed framework for imbalance classification as k-fold icross-validation.

Fig. 7
figure 7

Schematic diagram of searching for the optimal state on a certain sampling algorithm

figure a

Figure 7 shows the process of finding the optimal state on a specific oversampling method using the k-fold icross-validation. In the \(j{th}(1 \le j \le s)\) sampling iteration process, \(D_j\) is generated from the original date set D, and each \(D_j(1 \le j \le s)\) contains a same number of samples, which is equal to the number of the minority samples in D. In the k-fold icross-validation, CG-\(mean_j\) is generated on the data sets including D and those \(D_i (i \le j)\). The optimal value CG-\(mean^*\) is corresponding to the CG-\(mean_j\) who has the highest value. The pseudo codes of k-fold icross-validation is provided in Algorithm 1. The proposed framework can not only give sufficient attention to the minority samples but can also consider the majority samples by ensuring that the balance rate in the validation set is equal to that in the training set, i.e., that the data distributions of the validation set and the training set are as consistent as possible.

4 Experiment

In this section, five widely used or state-of-the-art sampling methods, SMOTE, ADASYN, Borderline-SMOTE, MDO, and SVMSMOTE, are used to validate the effectiveness of the proposed framework on eight real-world data sets. Five classifiers are selected based on “scikit-learn” package: GBDT, SVM, LR, DT and LGB. The parameter k is set to the default value of 5 in these sampling methods, and the other parameters are set to the default values. The parameter k is set to 5 in k-fold icross-validation framework. The information on the imbalances of these data sets is shown in Table 1, and the balance rate is between 1/6 and 1/27. Each data set is divided into a training set and a test set in a 3:1 ratio. During this process, the majority class and minority class are divided separately to make the balance rates equal.

Table 1 The information on the imbalances of the data sets
Table 2 Comparison results of the optimal state and the balanced state under the five classifiers
Table 3 Compare results on average in optimal and balanced states

In the case of imbalanced learning conditions, since the classification accuracy of traditional classification is biased towards majority, the classification accuracy can not fully evaluate the observed learning algorithm. In our experiment, a set of ROC-based evaluation metrics are used as follows: accuracy, precision, recall, F1-measure, G-mean. These metrics are listed in the Eqs. (4)–(8). The minority class is set as the positive class, and the majority class is set as the negative class. Among them, accuracy represents the correct ratio of positive and negative predictions among all prediction results. Precision is the ratio of correct predictions among all predicted positive samples. It reflects how many of the predicted positive samples are true positive samples. Recall is the ratio of correctly predicted instances in all real positive classes, that is, how many positive class instances are correctly classified. F1-measure is the harmonic average of precision and recall. It is closer to the smaller one, which means that F1-measure will only become higher when precision and recall reach a high value simultaneously. F1-measure is designed to measure minority class performance, while G-mean measures the performance of two classes. When G-mean is high, the overall performance of the classifier is robust. Another commonly used metric is the area under the receiver operating characteristic curve (ROC), i.e., AUC. ROC is generated by plotting the proportion of true positives and the proportion of false positives.

$$\begin{aligned} \text{ Accuracy }= & {} \frac{TP+TN}{TP+FP+TN+FN} \end{aligned}$$
(4)
$$\begin{aligned} \text{ Precision }= & {} \frac{TP}{TP+FP} \end{aligned}$$
(5)
$$\begin{aligned} \text{ Recall }= & {} \frac{TP}{TP+FN} \end{aligned}$$
(6)
$$\begin{aligned} \textrm{F1}\!-\!\! \text{ measure }= & {} \frac{2 \times \text{ Recall } \times \text{ Precision }}{\text{ Recall }+\text{ Precision }} \end{aligned}$$
(7)
$$\begin{aligned} \textrm{G}\!-\!\! \text{ mean }= & {} \sqrt{\frac{TP}{TP+FN} \times \frac{TN}{TN+FP}} \end{aligned}$$
(8)

Some comparison results between the optimal state and the balanced state are provided in Tables 2 and 3. The average results of five classifiers under eight data sets are shown in Table 2. For each algorithm, the upper row represents the optimal state, and the lower row represents the balanced state. In particular, CG-mean is the average validation result for imbalanced classification using icross-validation. Overall, the bold face appeared on the optimal state found using the proposed cross-validation method in most cases. It indicates that, the optimal state found using the proposed cross-validation method is better than the balanced state in most cases. This trend is more obvious in Table 3. Table 3 shows the average results of those experimental results in Table 2 under five classifiers. In the comparison of various algorithms, it can be observed that the optimal state performs better than the balanced state in most cases. In some cases, the advantage of the optimal state obtained by using the k-fold icross-validation is considerable, such as that the average precision under SVMSMOTE exceeds the balanced state by 7.36% in the optimal state. Due to space limitations, we put other experimental results in the supplementary materials.

5 Conclusion

This paper analyzes the relationship between balanced rate and classification performance in the oversampling process from a novel perspective that sampling may cause the loss of the distribution while the minority class is enhanced using the proposed “distribution difference” measurement. We also present an effective cross-validation framework to find a better state than the balanced state. The experimental results on real-world data sets demonstrate that the proposed cross-validation method can achieve better classification performance that that of balanced state in most cases. In the future, we aim to improve the proposed framework to more efficiently and effectively optimize the balance rate [31].