A Study with Class Imbalance and Random Sampling for a Decision Tree Learning System

Prati, Ronaldo C.; Batista, Gustavo E. A. P. A.; Monard, Maria Carolina

doi:10.1007/978-0-387-09695-7_13

Ronaldo C. Prati²,
Gustavo E. A. P. A. Batista² &
Maria Carolina Monard²

Part of the book series: IFIP – The International Federation for Information Processing ((IFIPAICT,volume 276))

Included in the following conference series:

IFIP International Conference on Artificial Intelligence in Theory and Practice

1426 Accesses
7 Citations

Abstract

Sampling methods are a direct approach to tackle the problem of class imbalance. These methods sample a data set in order to alter the class distributions. Usually these methods are applied to obtain a more balanced distribution. An open-ended question about sampling methods is which distribution can provide the best results, if any. In this work we develop a broad empirical study aiming to provide more insights into this question. Our results suggest that altering the class distribution can improve the classification performance of classifiers considering AUC as a performance metric. Furthermore, as a general recommendation, random over-sampling to balance distribution is a good starting point in order to deal with class imbalance.

Download to read the full chapter text

Chapter PDF

A Novel Prototype Decision Tree Method Using Sampling Strategy

Effectiveness of Basic and Advanced Sampling Strategies on the Classification of Imbalanced Data. A Comparative Study Using Classical and Novel Metrics

A Comparison of Re-sampling Techniques for Pattern Classification in Imbalanced Data-Sets

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

A. Asuncion, D.N.: UCI machine learning repository (2007). Http://www.ics.uci.edu/∼mlearn/MLRepository.html
Batista, G., Prati, R.C., Monard, M.C.: A Study of the Behaviour of Several Methods for Balance Machine Learning Training Data. SIGKDD Explorations 6(1), 20-29 (2004)
Article Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic Minority Over-sampling Technique. JAIR 16, 321-357 (2002)
MATH Google Scholar
Cussens, J.: Bayes and Pseudo-Bayes Estimates of Conditional Probabilities and their Reliability. In: ECML’93, pp. 136-152 (1993)
Google Scholar
Drummond, C., Holte, R.C.: Exploiting the Cost (In)Sensitivity of Decision Tree Splitting Criteria. In: ICML’2000, pp. 239-246 (2000)
Google Scholar
Elkan, C.: Learning and Making Decisions When Costs and Probabilities are Both Unknown. In: KDD’01, pp. 204-213 (2001)
Google Scholar
Elkan, C.: The Foudations of the Cost-sensitive Learning. In: IJCAI’01, pp. 973-978. Margan Kaufmann (2001)
Google Scholar
Fawcett, T.: An introduction to ROC analysis. Pattern Recognition Letters 27(8), 861-874 (2006)
Article MathSciNet Google Scholar
Japkowicz, N.: Class Imabalances: Are we Focusing on the Right Issue? In: ICML’2003 Workshop on Learning from Imbalanced Data Sets (II) (2003)
Google Scholar
Laurikkala, J.: Improving Identification of Difficult Small Classes by Balancing Class Distributions. Tech. Rep. A-2001-2, Univ. of Tampere, Finland (2001)
Google Scholar
Prati, R.C., Batista, G.E.A.P.A., Monard, M.C.: Class Imbalance versus Class Overlapping: an Analysis of a Learning System Behavior. In: MICAI’04, pp. 312-321 (2004)
Google Scholar
Provost, F.J., Fawcett, T.: Robust classification for imprecise environments. Machine Learning 42(3), 203-231 (2001)
Article MATH Google Scholar
Quinlan, J.R.: C4.5 Programs for Machine Learning. Morgan Kaufmann (1988)
Google Scholar
Weiss, G.M., Provost, F.: Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction. JAIR 19, 315-354 (2003)
MATH Google Scholar

Download references

Author information

Authors and Affiliations

University of Sao Paulo, P. O. Box 668, ZIP Code 13560-970, São Carlos, SP, Brazil
Ronaldo C. Prati, Gustavo E. A. P. A. Batista & Maria Carolina Monard

Authors

Ronaldo C. Prati
View author publications
You can also search for this author in PubMed Google Scholar
Gustavo E. A. P. A. Batista
View author publications
You can also search for this author in PubMed Google Scholar
Maria Carolina Monard
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Portsmouth, UK
Max Bramer

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Prati, R.C., Batista, G.E.A.P.A., Monard, M.C. (2008). A Study with Class Imbalance and Random Sampling for a Decision Tree Learning System. In: Bramer, M. (eds) Artificial Intelligence in Theory and Practice II. IFIP AI 2008. IFIP – The International Federation for Information Processing, vol 276. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-09695-7_13

Download citation

DOI: https://doi.org/10.1007/978-0-387-09695-7_13
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-09694-0
Online ISBN: 978-0-387-09695-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Study with Class Imbalance and Random Sampling for a Decision Tree Learning System

Abstract

Chapter PDF

Similar content being viewed by others

A Novel Prototype Decision Tree Method Using Sampling Strategy

Effectiveness of Basic and Advanced Sampling Strategies on the Classification of Imbalanced Data. A Comparative Study Using Classical and Novel Metrics

A Comparison of Re-sampling Techniques for Pattern Classification in Imbalanced Data-Sets

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

A Study with Class Imbalance and Random Sampling for a Decision Tree Learning System

Abstract

Chapter PDF

Similar content being viewed by others

A Novel Prototype Decision Tree Method Using Sampling Strategy

Effectiveness of Basic and Advanced Sampling Strategies on the Classification of Imbalanced Data. A Comparative Study Using Classical and Novel Metrics

A Comparison of Re-sampling Techniques for Pattern Classification in Imbalanced Data-Sets

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation