Abstract
Imbalance classification has always been a popular research point in the application of machine learning, data mining and pattern recognition. At present, there are also many techniques to reduce the negative impact of imbalance on classification performance, and oversampling is the most commonly used one. In this paper, we illustrate the relationship between imbalance rate and classification performance in the oversampling process from a novel perspective that oversampling may cause the loss of the distribution while minority class is enhanced. In addition, this paper proposes a novel cross-validation framework called “icross-validation” that can be used in sampling to find a better state than the balanced state. This framework is general and can be applied into various oversampling methods. In comparison with some state-of-the-art and widely used oversampling methods, the experimental results on some real data sets demonstrate the effectiveness of the icross-validation. All code has been released in the open source icross-validation library at https://github.com/syxiaa/icross-valiation.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
An imbalance probably deteriorate the performance of classifiers in imbalanced classification, which is an important and widespread task in machine learning. Therefore, solving the problem of imbalanced learning has far-reaching significance. Many techniques are generally used to solve this problem. They can be roughly divided into two categories: algorithm-level methods and data-level methods [1,2,3]. One method at the algorithm level is to modify or design a new classification algorithm to improve the performance of the learning algorithm. One method is to design a new classifier performance evaluation index. There is a specific type of algorithm level method called cost-sensitive learning [4, 5].Among them, cost-sensitive learning is a scheme of algorithm-level modification [6]. This scheme does not use a standard loss function, but introduces the concept of misclassification cost to minimize conditional risk. By severely penalizing certain categories of classification errors, the importance of these categories is increased in the process of classification training. For example, Datta et al. proposed a dictionary-based linear programming framework to minimize the maximum deviation of the minimum loss value of a specific category [7]. They also proposed a multi-objective optimization framework called radial boundary intersection by training support vector machines on two and multiple data sets. It overcomes the disadvantage of high cost of adjusting parameters [8]. The data-level method generally balances the majority class and the minority class by resampling the original data. Resampling the original data by oversampling the minority or undersampling the majority is the most popular method of addressing the imbalance problem [9,10,11,12], and oversampling is more effective and popular than undersampling [13,14,15,16,17].
For the development of oversampling technology, the synthetic minority oversampling technique (SMOTE) is one of the most well-known data preprocessing methods for balancing different numbers of samples in each class; it proceeds by randomly interpolating between the minority points, extending the minority classes, bringing the data to a balanced state, and thereby improving the classification accuracy [18, 19]. There has been a succession of many improved versions of SMOTE, which are roughly divided into two categories. The first category filters out some noise or dangerous points that are not suitable for expansion and then performs SMOTE. For example, Rivera et al. introduced a new oversampling technique that focuses on noise reduction and selective sampling of the minority group, which results in an improvement in the prediction of minority group membership [20]. Saez et al. proposed an iterative ensemble-based noise filter that can overcome the problems produced by noisy and borderline examples in imbalanced data sets [21]. Another category is to improve the sampling strategy to make it general for more data sets with various distributions. In this category, Han et al. proposed two new minority oversampling methods, borderline-SMOTE1 and borderline-SMOTE2, in which only the minority examples near the borderline are oversampled [22]. Das et al. introduced two probabilistic oversampling approaches for synthetically generating and strategically selecting new minority class samples [23]. Abdi et al. suggested an elegant blend of boosting and oversampling paradigms by introducing the Mahalanobis distance-based oversampling technique (MDO) [24]. In addition, some of the latest algorithms combine the above two methods. For example, Xie et al. proposed a minority oversampling technique based on local densities in low-dimensional space [25]. Zhou et al. proposed an oversampling algorithm based on weights to calculate the synthesis position of new samples [26]. He et al. proposed an adaptive synthesis (ADASYN) sampling method, which uses a weighted distribution for different minority examples according to their learning difficulty [27].
Although these oversampling methods have achieved satisfactory results on some imbalanced data sets, almost all of them default the balance to the final state of sampling. Nevertheless, balance is not necessarily the optimal state for classification. For example, Prati et al. conducted a detailed study on the class imbalance, and evaluated the performance of the class imbalance treatment method through a large-scale experimental design, proving that the higher the class imbalance degree, the greater the loss, and the class imbalance treatment method can only restore part of the performance loss [28]. Barella et al. determined the convex relationship between the class imbalance rate and the accuracy rate by learning the imbalance of the data in the classification [29]. Thabtah et al. believe that when the complexity of the data in the sampling process is high, imbalanced data will bring greater difficulties to classification. Therefore, the data complexity measurement is used to estimate the optimal sample size of the data imbalance pre-processing technique [30]. However, the measurement for data complexity degree is a difficult problem in itself, so this method is limited to some data sets with specific distributions and not general. This paper shows a fact that in the process of data gradually becoming balanced, the difference between the distribution of the current data and that of the original data may become increasingly large. When the sampling method is not suitable for data with a certain distribution, this negative effect will be considerable. There are two contributions of this paper, as follows:
-
We analyze the relationship between balanced rate and classification performance in the sampling process from a novel perspective that oversampling may cause the loss of the distribution while minority class is enhanced.
-
This paper proposes a novel cross-validation framework called “k-fold icross-validation” that can be used in sampling to find a better state than the balanced state. This framework is general and can be applied into various oversampling methods.
2 Relationship between prediction performance and imbalance rate
As pointed out by Chawla [19], an imbalance has a negative effect on a classifier because the majority class of data points always has a greater impact on the classifier than the minority points. Therefore, in imbalanced classification, current oversampling methods always aim at reaching an absolutely balanced state in the data set. However, in the process of sampling, the data distribution is likely to differ increasingly from the distribution of the original data set, and the difference in the distribution between the original data set and the sampled data set will obviously deteriorate the generalizability of the classifier. If the negative effect of the distribution difference is larger than the positive effect of balance, the classification performance can deteriorate. This phenomenon is particularly noticeable in some cases in which the distribution of a data set is very easy for a sampling method to change, such as when the data set contains noise [21].
Figure 1 shows the change in the decision boundary when borderline-SMOTE is used for oversampling on fourclass(the benchmark data set). In Fig. 1b, the minority class of data points in the original imbalanced data set is very sparse, and many data points in the minority class are incorrectly classified when the support vector machine (SVM) classifier is directly implemented on the original imbalanced data set. Therefore, the decision boundary is very unreasonable. As shown in Fig. 1c, when borderline-SMOTE runs to the 20th iteration, the decision boundary is very close to that in the original data set shown in Fig. 1a. Therefore, it performs well and can correctly classify much more of the minority class of data points than in Fig. 1b. Furthermore, the decision boundary in Fig. 1c is also closer to that in Fig. 1a than that in Fig. 1d, in which the data set is balanced when borderline-SMOTE is run to the 60th iteration. The reason is that although the balanced state can make the number of points of the minority class almost equal to that of the majority class, excessive sampling exists in the balanced state, and the data distribution and decision boundary are very different from those of the original data set.
The classification performance can be represented by the evaluation index in imbalanced classification, such as the G-mean. The existing KL divergence metric can calculate the distribution difference between two data sets, but unfortunately it is only applicable when the size of the two data sets is equal. So, referring to the characteristics of KL divergence, we propose a novel measurement for measuring the difference between two distributions containing different numbers of points. The first characteristics is that, the distribution difference between two data sets containing the same data points is equal to zero; the second is that, for a data set D and its oversampled result \(D'\), the distribution difference is larger if the data points in \(D'\) are farther from those in D. Based on above characteristics, we design the distribution difference for oversampling as follows:
Definition 1 (Division): Given two data sets \(D=\left\{ x_i, i=1...n\right\}\) and \(D'\), the number of data points in D is denoted as \(\mid D \mid\). Assuming that \(\mid D'\mid >\mid D\mid\), a division of D on \(D'\) implements an initial operation of n-means clustering on \(D'\) with \(x_i (i=1...n)\) as centers.
Definition 2 (Distribution difference): Given a data set \(D=\left\{ x_i, i=1...n\right\}\) and its oversampled result \(D'=\left\{ x'_i, i=1...m\right\}\), where n and m denote the number of data points in D and \(D'\), respectively, n clusters \(\left\{ C_i, i=1...n\right\}\) are generated by generating a division of D on \(D'\). The number of data points in D is denoted as \(\mid D\mid\). For two points A and B, their distance is denoted as dist(A, B). The distribution difference of \(C_i\) relative to \(x_i\) is defined as:
where \(\bar{C_{i}}\) represents the center of \(C_i\). The distribution difference of \(D'\) relative to D is defined as:
The visualization results of changing G-means (i.e., classification performances) and distribution differences using various sampling methods on some real-world data sets are shown in Figs. 2 and 3. Some rules can be extracted and represented, as shown in Fig. 4. The balance rate used in this paper is defined in Definition 2, and the imbalance rate used in this paper is equal to the the reciprocal of balance rate.
Definition 3 (Balance rate): Given a data set D consisting of a majority class \(D_{-}\) and a minority class \(D_{+}\) and given that the number of data points in a data set P is denoted as \(\mid P \mid\), the balance rate of D is defined as:
Supposing that the other factors of classification performance are equal, classification performance is better when the balance rate is closer to 1. The characteristics in Fig. 4 can be summarized as follows:
-
The orange curve will increase when new sampled points are added to the original data because the distribution difference will increase. When an increasing number of points are generated, as shown in Fig. 1d, those points are limited to a local area, resulting in a stable distribution difference at the end of the orange curve.
-
The blue curve changes along a parabola, where A is the point with the best classification performance. At the front of the parabola (that is, between the origin and A), due to the high proportion of original data points, the sampling result is reliable, resulting in a low distribution difference and a small negative impact on classification performance. Meanwhile, the balance rate continues to increase from a small value to almost 1. Its benefits to classification outweigh the negative effects of the initial balance rate. Therefore, the classification performance is improved.
-
In the second half of the blue curve (that is, after A), due to the small proportion of original data points, the reliability of the sampling results is low, resulting in a high distribution difference and a large negative impact on classification performance. The resulting negative impact on classification is larger than the benefits of the increasing balance rate. Therefore, the classification performance is decreased.
-
For the blue curve, similar to the phenomenon in Fig. 2, point B, where the balance rate is equal to 1, is located in the second half of the parabola (that is, after A). In other words, the classification performance cannot increase after the point where the balance rate is equal to 1. The reason is that not only has the balance rate moved far from 1 but also the distribution difference has expanded. Both of these effects are negative. When this process has continued sufficiently far, the classification performance will be lower than that of the original data set and may even fall to 0.
To the best of our knowledge, there are no methods of finding the optimal state of a sampling method in imbalanced classification. We will present a method of finding the optimal state in the sampling process in the next section.
3 Validation framework to optimize the balance rate
3.1 Motivation
To find a better state than balance, the most direct method is to use conventional cross-validation. However, conventional cross-validation may result in an unstable and unreliable sampled data set because minority samples are likely to be unevenly distributed in different validation sets. Additionally, minority samples are very important for the validation index in imbalanced classification, and conventional evaluation indexes, such as test accuracy, are seldom used in imbalanced classification. For example, consider a data set D that consists of two kinds of points, the numbers of which are 1000 and 10. If 5-fold cross-validation is used, one possible case is that the number of minority samples in the five sets are 0, 0, 0, 0, and 10. In this case, the validation indexes of the imbalanced data in the first four sets are nonsensical and harmful for the average validation.
3.2 A cross-validation method for imbalance classification
To address the above problem, we design a validation framework for imbalanced classification by applying the leave-one-out strategy for the minority class samples. In addition, to protect the majority samples and keep the overall distribution consistent, the majority samples are divided into sets whose number is equal to the number of minority samples. As shown in Fig. 5, m denotes the number of majority samples, and n denotes the number of minority samples. \(x_i (i=1, 2...n)\) denotes each minority sample point, and the majority samples are divided into n equal parts \(\left\{ X_{i} \mid i=1...n\right\}\). So, the number of samples in each majority set \(X_{i}\) is equal to \(\frac{m}{n}\). After the division is completed, we merge a majority class data set \(X_i\) and a minority class sample point \(x_i\) into a set, and use this set as one fold of the leave-one-out cross-validation. However, we consider that if the minority sample size is large, the implementation cost of the framework will be quite high. Therefore, as shown in Fig. 6, we proposed a k-fold cross-validation framework for imbalance classification. In this method, the majority and minority samples are divided into k parts respectively and then cross-validated is used.
In Fig. 6, one majority set \(X_i(i=1...k)\) and one minority set \(X'_i(i=1...k)\) are combined as a cross-validation set, and the remaining points make up the cross training set. In particular, when \(k=n\), \(X'_i\) represents a minority sample, and the framework is converted to be leave-one-out framework. In this way, the minority samples are evenly distributed in each validation set, and the majority samples are also considered according to the balance rate. We can calculate G-mean\(_i(i=1...k)\) for each validation set using the proposed k-fold cross-validation, and average the k G-mean metrics to get the final CG-mean. CG-mean represents the average value of G-mean\(_i\). We can calculate CG-mean for each sampling iteration, and the balance rate corresponding to the highest CG-mean value is the optimal balance rate. We named the proposed framework for imbalance classification as k-fold icross-validation.
Figure 7 shows the process of finding the optimal state on a specific oversampling method using the k-fold icross-validation. In the \(j{th}(1 \le j \le s)\) sampling iteration process, \(D_j\) is generated from the original date set D, and each \(D_j(1 \le j \le s)\) contains a same number of samples, which is equal to the number of the minority samples in D. In the k-fold icross-validation, CG-\(mean_j\) is generated on the data sets including D and those \(D_i (i \le j)\). The optimal value CG-\(mean^*\) is corresponding to the CG-\(mean_j\) who has the highest value. The pseudo codes of k-fold icross-validation is provided in Algorithm 1. The proposed framework can not only give sufficient attention to the minority samples but can also consider the majority samples by ensuring that the balance rate in the validation set is equal to that in the training set, i.e., that the data distributions of the validation set and the training set are as consistent as possible.
4 Experiment
In this section, five widely used or state-of-the-art sampling methods, SMOTE, ADASYN, Borderline-SMOTE, MDO, and SVMSMOTE, are used to validate the effectiveness of the proposed framework on eight real-world data sets. Five classifiers are selected based on “scikit-learn” package: GBDT, SVM, LR, DT and LGB. The parameter k is set to the default value of 5 in these sampling methods, and the other parameters are set to the default values. The parameter k is set to 5 in k-fold icross-validation framework. The information on the imbalances of these data sets is shown in Table 1, and the balance rate is between 1/6 and 1/27. Each data set is divided into a training set and a test set in a 3:1 ratio. During this process, the majority class and minority class are divided separately to make the balance rates equal.
In the case of imbalanced learning conditions, since the classification accuracy of traditional classification is biased towards majority, the classification accuracy can not fully evaluate the observed learning algorithm. In our experiment, a set of ROC-based evaluation metrics are used as follows: accuracy, precision, recall, F1-measure, G-mean. These metrics are listed in the Eqs. (4)–(8). The minority class is set as the positive class, and the majority class is set as the negative class. Among them, accuracy represents the correct ratio of positive and negative predictions among all prediction results. Precision is the ratio of correct predictions among all predicted positive samples. It reflects how many of the predicted positive samples are true positive samples. Recall is the ratio of correctly predicted instances in all real positive classes, that is, how many positive class instances are correctly classified. F1-measure is the harmonic average of precision and recall. It is closer to the smaller one, which means that F1-measure will only become higher when precision and recall reach a high value simultaneously. F1-measure is designed to measure minority class performance, while G-mean measures the performance of two classes. When G-mean is high, the overall performance of the classifier is robust. Another commonly used metric is the area under the receiver operating characteristic curve (ROC), i.e., AUC. ROC is generated by plotting the proportion of true positives and the proportion of false positives.
Some comparison results between the optimal state and the balanced state are provided in Tables 2 and 3. The average results of five classifiers under eight data sets are shown in Table 2. For each algorithm, the upper row represents the optimal state, and the lower row represents the balanced state. In particular, CG-mean is the average validation result for imbalanced classification using icross-validation. Overall, the bold face appeared on the optimal state found using the proposed cross-validation method in most cases. It indicates that, the optimal state found using the proposed cross-validation method is better than the balanced state in most cases. This trend is more obvious in Table 3. Table 3 shows the average results of those experimental results in Table 2 under five classifiers. In the comparison of various algorithms, it can be observed that the optimal state performs better than the balanced state in most cases. In some cases, the advantage of the optimal state obtained by using the k-fold icross-validation is considerable, such as that the average precision under SVMSMOTE exceeds the balanced state by 7.36% in the optimal state. Due to space limitations, we put other experimental results in the supplementary materials.
5 Conclusion
This paper analyzes the relationship between balanced rate and classification performance in the oversampling process from a novel perspective that sampling may cause the loss of the distribution while the minority class is enhanced using the proposed “distribution difference” measurement. We also present an effective cross-validation framework to find a better state than the balanced state. The experimental results on real-world data sets demonstrate that the proposed cross-validation method can achieve better classification performance that that of balanced state in most cases. In the future, we aim to improve the proposed framework to more efficiently and effectively optimize the balance rate [31].
References
Chen B, Xia S, Chen Z, Wang B, Wang G (2021) Rsmote: a self-adaptive robust smote for imbalanced problems with label noise. Inf Sci 553:397–428
Douzas G, Bacao F, Last F (2018) Improving imbalanced learning through a heuristic oversampling method based on k-means and smote. Inf Sci 465:1–20
Alam TM, Shaukat K, Mahboob H, Sarwar MU, Iqbal F, Nasir A, Hameed IA, Luo S (2021) A machine learning approach for identification of malignant mesothelioma etiological factors in an imbalanced dataset. Comput J 65(7):1740–1751
Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Prog Artif Intell 5(4):221–232
López V, Fernández A, Moreno-Torres JG, Herrera F (2012) Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. open problems on intrinsic data characteristics. Expert Syst Appl 39(7):6585–6608
Petrides G, Moldovan D, Coenen L, Guns T, Verbeke W (2022) Cost-sensitive learning for profit-driven credit scoring. J Oper Res Soc 73(2):338–350
Datta S, Nag S, Das S (2019) Boosting with lexicographic programming: addressing class imbalance without cost tuning. IEEE Trans Knowl Data Eng 32(5):883–897
Datta S, Das S (2018) Multiobjective support vector machines: handling class imbalance with pareto optimality. IEEE Trans Neural Netw Learn Syst 30(5):1602–1608
Maulidevi NU, Surendro K et al (2022) Smote-lof for noise identification in imbalanced data classification. J King Saud Univ Comput Inf Sci 34(6, Part B):3413–3423
Ren J, Wang Y, Cheung Y-M, Gao X-Z, Guo X (2023) Grouping-based oversampling in kernel space for imbalanced data classification. Pattern Recognit 133:108992
Sandhan T, Choi JY (2014) Handling imbalanced datasets by partially guided hybrid sampling for pattern recognition. In: 2014 22nd international conference on pattern recognition. IEEE, pp 1449–1453
Japkowicz N et al (2000) Learning from imbalanced data sets: a comparison of various strategies. In: AAAI workshop on learning from imbalanced data sets, vol 68. Menlo Park, CA, pp 10–15
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2012) Dbsmote: density-based synthetic minority over-sampling technique. Appl Intell 36(3):664–684
Zhai J, Qi J, Shen C (2022) Binary imbalanced data classification based on diversity oversampling by generative models. Inf Sci 585:313–343
Lunardon N, Menardi G, Torelli N (2014) Rose: a package for binary imbalanced learning. R J 6(1)
Barua S, Islam MM, Yao X, Murase K (2012) Mwmote-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425
Arafa A, El-Fishawy N, Badawy M, Radad M (2022) Rn-smote: reduced noise smote based on dbscan for enhancing imbalanced data classification. J King Saud Univ Comput Inf Sci 34(8, Part A):5059–5074
Soltanzadeh P, Hashemzadeh M (2021) Rcsmote: range-controlled synthetic minority over-sampling technique for handling the class imbalance problem. Inf Sci 542:92–111
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Rivera WA (2017) Noise reduction a priori synthetic over-sampling for class imbalanced data sets. Inf Sci 408:146–161
Sáez JA, Luengo J, Stefanowski J, Herrera F (2015) Smote-ipf: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci 291:184–203
Han H, Wang W-Y, Mao B-H (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing. Springer, pp 878–887
Das B, Krishnan NC, Cook DJ (2014) Racog and wracog: two probabilistic oversampling techniques. IEEE Trans Knowl Data Eng 27(1):222–234
Abdi L, Hashemi S (2015) To combat multi-class imbalanced problems by means of over-sampling techniques. IEEE Trans Knowl Data Eng 28(1):238–251
Xie Z, Jiang L, Ye T, Li X (2015) A synthetic minority oversampling method based on local densities in low-dimensional space for imbalanced learning. In: International conference on database systems for advanced applications. Springer, pp 3–18
Zhou H, Dong X, Xia S, Wang G (2021) Weighted oversampling algorithms for imbalanced problems and application in prediction of streamflow. Knowl Based Syst 229:107306
He H, Bai Y, Garcia EA, Li S (2008) Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE World congress on computational intelligence). IEEE, pp 1322–1328
Prati RC, Batista GE, Silva DF (2015) Class imbalance revisited: a new experimental setup to assess the performance of treatment methods. Knowl Inf Syst 45(1):247–270
Barella V, Garcia L, de Carvalho A (2018) The influence of sampling on imbalanced data classification. In: 2019 8th Brazilian conference on intelligent systems (BRACIS). IEEE, pp 210–215
Thabtah F, Hammoud S, Kamalov F, Gonsalves A (2020) Data imbalance in classification: experimental evaluation. Inf Sci 513:429–441
He J, Zhang S, Yang M, Shan Y, Huang T (2020) Bi-directional cascade network for perceptual edge detection. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR)
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China under Grant Nos. 62176033 and 61936001, Key Cooperation Project of Chongqing Municipal Education Commission under Grant No. HZ2021008, and Natural Science Foundation of Chongqing under Grant No. cstc2019jcyj-cxttX0002, National Key Research and Development Program of China under Grant No. 2019QY(Y)0301.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
All authors have no conflict of interest to declare
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Dai, Q., Li, D. & Xia, S. A cross-validation framework to find a better state than the balanced one for oversampling in imbalanced classification. Int. J. Mach. Learn. & Cyber. 14, 2877–2886 (2023). https://doi.org/10.1007/s13042-023-01804-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-023-01804-x