Abstract
The problem of modeling binary responses by using cross-sectional data has been addressed with a number of satisfying solutions that draw on both parametric and nonparametric methods. However, there exist many real situations where one of the two responses (usually the most interesting for the analysis) is rare. It has been largely reported that this class imbalance heavily compromises the process of learning, because the model tends to focus on the prevalent class and to ignore the rare events. However, not only the estimation of the classification model is affected by a skewed distribution of the classes, but also the evaluation of its accuracy is jeopardized, because the scarcity of data leads to poor estimates of the model’s accuracy. In this work, the effects of class imbalance on model training and model assessing are discussed. Moreover, a unified and systematic framework for dealing with the problem of imbalanced classification is proposed, based on a smoothed bootstrap re-sampling technique. The proposed technique is founded on a sound theoretical basis and an extensive empirical study shows that it outperforms the main other remedies to face imbalanced learning problems.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Akbani R, Kwek S, Japkowicz N (2004) Applying support vector machines to unbalanced datasets. In: Boulicaut JF, Esposito F, Giannotti F, Pedreschi D, eds. Lecture Notes in Computer Science, Proceedings of 15th European conference on machine learning, ECML, Springer, Pisa, 3201:39–50
Asuncion A, Newman DJ (2007) UCI machine learning repository http://www.ics.uci.edu/~mlearn/MLRepository.html. University of California, School of Inf. and Comput. Sci., Irvine
Barandela R, SÃnchez JS, GarcÃá1a V, Rangel E (2003) Strategies for learning in class imbalance problems. Patt Recognit 36: 849–851
Batista G, Prati R, Monard M (2004) A study of the behaviour of several methods for balancing machine learning training data. SIGKDD Explor 6(1): 20–29
Batuwita R, Palade V (2010) FSVM-CIL: fuzzy support vector machines for class imbalance learning. IEEE Trans Fuzzy Syst 18(3): 558–571
Bowman AW, Azzalini A (1997) Applied smoothing techniques for data analysis: Kernel approach with S-plus illustrations. Oxford University Press, Oxford
Breiman L (1996) Bagging predictors. Mach Learn 24: 123–140
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth International Group, Belmont, CA
Burez J, Vanden Poel D (2009) Handling class imbalance in customer churn prediction. Expert Syst Appl 36: 4626–4636
Chawla NV (2003) C4.5 and imbalanced data sets: investigating the effect of sampling method, probabilistic estimate, and decision tree structure. Proceedings of the ICML’03 Workshop on Class Imbalances
Chawla NV, Bowyer KW, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16: 321–357
Chernick M, Murthy V, Nealy C (1985) Application of bootstrap and other resampling methods: evaluation of classifier performance. Pattern Recogn Lett 3: 167–178
Cieslak D, Chawla N (2008) Learning decision trees for unbalanced data. Lect. Notes in Comput. Sci. 5211: 241–256
Cramer JS (1999) Predictive performance of binary logit models in unbalanced samples. The Statistician 48: 85–94
Davis J, Goadrich M (2006) The relationship between Precision-Recall and ROC curves. In: Cohen W, Moore A, eds. Proceedings of the 23rd International Conference on Machine Learning, ACM Press, Pittsburgh, PA, pp 233–240
Demsar J (2006) Statistical comparison of classifiers over multiple data sets. J Mach Learn Res 7(7): 1–30
Drummond C, Holte RC (2006) Cost curves: an improved method for visualizing classifier performance. Mach Learn 65(1): 95–130
Efron B, Tibshirani R (1993) An introduction to the bootstrap. Chapman and Hall, New York
Eitrich T, Kless A, Druska C, Meyer W, Grotendorst J (2007) Classification of highly unbalanced CYP450 data of drugs using cost sensitive mach learning techniques. J Chem Inform Model 47(1): 92–103
Estabrooks A, Taeho J, Japkovicz N (2004) A multiple resampling method for learning form imbalanced data sets. Comput Intell 20: 18–36
Fernandez A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: bagging, boosting, and hybrid-based approaches. IEEE Trans Syst, Man, Cybern, C 42: 463–484
García S, Derrac J, Triguero I, Carmona CJ, Herrera F (2012) Evolutionary-based selection of generalized instances for imbalanced classification. Knowl Based Syst 25: 3–12
Guo H, Viktor HL (2004) Boosting with data generation: improving the classification of hard to learn examples. SIGKDD Explor 6(1): 30–39
Hand D (2006) Classifier technology and the illusion of progress. Stat Sci 21(1): 1–14
Hand D, Vinciotti V (2003) Choosing K for two-class nearest neighbour classifiers with unbalanced classes. Patt Recognit Lett 24: 1555–1562
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng, 21(9)
Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data An J 6
Jo T, Japkowicz N (2004) Class imbalances versus small disjuncts. SIGKDD Explor 6(1): 40–49
Khoshgoftaar TM, Golawala M, Van Hulse J (2007) An empirical study of learning from imbalanced data using random forest. Proceedings of the 19th IEEE international conference on tools with artif intelligence, vol 2, Washington, DC
Khoshgoftaar TM, Van Hulse J, Napolitano A (2011) Comparing boosting and bagging techniques with noisy and imbalanced data. IEEE Trans on Syst, Man, Cybern.-Part A: Syst Humans 41(3): 552– 568
King EN, Ryan TP (2002) A preliminary investigation of maximum likelihood logistic regression versus exact logistic regression. Am Stat 56: 163–170
King G, Zeng L (2001) Logistic regression in rare events data. Political Anal 9: 137–163
Kotsiantis S, Kanellopoulos D, Pintelas P (2006) Handling imbalanced datasets:a review. GESTS International Transactions on Computer Science and Engineering, vol 30
Kukar M, Kononenko I (1998) Cost-sensitive learning with neural networks. Proceedings of the 13th European conference on artificial intelligence, Wiley, New York, pp 445–449
Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. Proceedings of the 14th international conference on machine learning. ICML, Nashville, pp 179–186
Lee S (2000) Noisy replication in skewed binary classification. Comput Stat Data An 34: 165–191
Lee S (1999) Regularization in skewed binary classification. Comput Stat 14: 277–292
Lin Y, Lee Y, Wahba G (2002) Support vector machines for classification in nonstandard situations. Mach Learn 46: 191–202
Liu Y, Chawla NV, Harper MP, Shriberg E, Stolcke A (2006) A study in machine learning from imbalanced data for sentence boundary detection in speech. Comput Speech & Lang 20: 468–494
Mazurowski MA (2008) Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance. Neural Netw 21: 427–436
McCarthy K, Zabar B, Weiss G (2005) Does cost-sensitive learning beat sampling for classifying rare classes? Proceedings of the 1st international workshop on utility-based data mining, ACM Press, New York, pp 69–77
Mease D, Wyner A, Buja A (2007) Boosted classification trees and class probability-quantile estimation. J Mach Learn Res 8: 409–439
Oommen T, BaiseL Vogel R (2011) Sampling bias and class imbalance in maximum-likelihood logistic regression. Math Geosci 43: 99–120
Pavón R, Laza R, Reboiro-Jato M, Fdez-Riverola F (2011) Assessing the impact of class-imbalanced data for classifying relevant/irrelevant medline documents. Adv Intell Soft Comput 93: 345–353
Percannella G, Soda P, Vento M (2011) Mitotic HEp-2 cells recognition under class skew. Lecture Notes in Computer Science (including Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp 353–362
Riddle P, Segal R, Etzioni O (1994) Representation design and brute-force induction in a Boeing manufacturing domain. Appl Artif Intell 8: 125–147
Schiavo RA, Hand DJ (2000) Ten more years of error rate research. Int Stat Rev 68(3): 295–310
Silverman BW (1986) Density estimation for statistics and data analysis. Chapman and Hall, New York
Ström F, Koker R (2011) A parallel neural network approach to prediction of Parkinson’s Disease. Expert Syst Appl 38(10): 12470–12474
Sun Y, Kamel MS, Wong AKC, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Patt Recogn 40(12): 3358–3378
Sun Y, Wong AKC, Kamel MS (2009) Classification of imbalanced data: a review. Int J Patt Recogn Artif Intell 23(4): 687–719
Ting KM (2002) An instance-weighting method to induce cost-sensitive trees. IEEE Trans Knowl Data Eng 14(3): 659–665
Thomas J, Jouve P, Nicoloyannis N (2006) Optimisation and evaluation of random forests for imbalanced datasets. Lecture Notes in Computer Science, Springer 4203: 622–631
Veropoulos K, Campbell C, Cristianini N (1999) Controlling the sensitivity of support vector machines. Proceedings of the international joint conference on artificial intelligence, Stockholm, pp 55–60
Wasikowski M, Chen XW (2010) Combating the small sample class imbalance problem using feature selection. IEEE Trans Knowl Data Eng 22(10): 1388–1400
Wehberg S, Schumacher M (2004) A comparison of nonparametric error rate estimation methods in classification problems. Biom J 46(1): 35–47
Weiss GM (2004) Mining with rarity: a unifying framework. ACM SIGKDD Explor. Newsletter 6(1)
Weiss GM, Provost F (2001) The effect of class distribution on classifier learning: an empirical study. Technical report, ML-TR-44, Department of Computer Science, Rutgers University, New Jersey
Wu XLJ, Zhou Z (2009) Exploratory undersampling for class-imbalance learning. IEEE Trans: On Syst., Man, Cybern., B 39: 539–550
Yen S, Lee Y (2006) Under-sampling approaches for improving prediction of the minority class in an imbalanced dataset. Intelligent Control and Automation. Series: Lecture Notes in Control and Information Sciences, pp 731–740
Zhou Z, Liu X (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng 18(1): 63–77
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Chih-Jen Lin.
Rights and permissions
About this article
Cite this article
Menardi, G., Torelli, N. Training and assessing classification rules with imbalanced data. Data Min Knowl Disc 28, 92–122 (2014). https://doi.org/10.1007/s10618-012-0295-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-012-0295-5