Abstract
Sampling methods are a direct approach to tackle the problem of class imbalance. These methods sample a data set in order to alter the class distributions. Usually these methods are applied to obtain a more balanced distribution. An open-ended question about sampling methods is which distribution can provide the best results, if any. In this work we develop a broad empirical study aiming to provide more insights into this question. Our results suggest that altering the class distribution can improve the classification performance of classifiers considering AUC as a performance metric. Furthermore, as a general recommendation, random over-sampling to balance distribution is a good starting point in order to deal with class imbalance.
Chapter PDF
Similar content being viewed by others
Keywords
- Receiver Operating Characteristic
- Receiver Operating Characteristic Curve
- Receiver Operating Characteristic Analysis
- Class Distribution
- Minority Class
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
A. Asuncion, D.N.: UCI machine learning repository (2007). Http://www.ics.uci.edu/∼mlearn/MLRepository.html
Batista, G., Prati, R.C., Monard, M.C.: A Study of the Behaviour of Several Methods for Balance Machine Learning Training Data. SIGKDD Explorations 6(1), 20-29 (2004)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic Minority Over-sampling Technique. JAIR 16, 321-357 (2002)
Cussens, J.: Bayes and Pseudo-Bayes Estimates of Conditional Probabilities and their Reliability. In: ECML’93, pp. 136-152 (1993)
Drummond, C., Holte, R.C.: Exploiting the Cost (In)Sensitivity of Decision Tree Splitting Criteria. In: ICML’2000, pp. 239-246 (2000)
Elkan, C.: Learning and Making Decisions When Costs and Probabilities are Both Unknown. In: KDD’01, pp. 204-213 (2001)
Elkan, C.: The Foudations of the Cost-sensitive Learning. In: IJCAI’01, pp. 973-978. Margan Kaufmann (2001)
Fawcett, T.: An introduction to ROC analysis. Pattern Recognition Letters 27(8), 861-874 (2006)
Japkowicz, N.: Class Imabalances: Are we Focusing on the Right Issue? In: ICML’2003 Workshop on Learning from Imbalanced Data Sets (II) (2003)
Laurikkala, J.: Improving Identification of Difficult Small Classes by Balancing Class Distributions. Tech. Rep. A-2001-2, Univ. of Tampere, Finland (2001)
Prati, R.C., Batista, G.E.A.P.A., Monard, M.C.: Class Imbalance versus Class Overlapping: an Analysis of a Learning System Behavior. In: MICAI’04, pp. 312-321 (2004)
Provost, F.J., Fawcett, T.: Robust classification for imprecise environments. Machine Learning 42(3), 203-231 (2001)
Quinlan, J.R.: C4.5 Programs for Machine Learning. Morgan Kaufmann (1988)
Weiss, G.M., Provost, F.: Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction. JAIR 19, 315-354 (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 International Federation for Information Processing
About this paper
Cite this paper
Prati, R.C., Batista, G.E.A.P.A., Monard, M.C. (2008). A Study with Class Imbalance and Random Sampling for a Decision Tree Learning System. In: Bramer, M. (eds) Artificial Intelligence in Theory and Practice II. IFIP AI 2008. IFIP – The International Federation for Information Processing, vol 276. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-09695-7_13
Download citation
DOI: https://doi.org/10.1007/978-0-387-09695-7_13
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-09694-0
Online ISBN: 978-0-387-09695-7
eBook Packages: Computer ScienceComputer Science (R0)