The Imbalanced Training Sample Problem: Under or over Sampling?

Barandela, Ricardo; Valdovinos, Rosa M.; Sánchez, J. Salvador; Ferri, Francesc J.

doi:10.1007/978-3-540-27868-9_88

Ricardo Barandela^21,22,
Rosa M. Valdovinos²¹,
J. Salvador Sánchez²³ &
…
Francesc J. Ferri²⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3138))

Included in the following conference series:

Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR)

2905 Accesses
97 Citations
6 Altmetric

Abstract

The problem of imbalanced training sets in supervised pattern recognition methods is receiving growing attention. Imbalanced training sample means that one class is represented by a large number of examples while the other is represented by only a few. It has been observed that this situation, which arises in several practical domains, may produce an important deterioration of the classification accuracy, in particular with patterns belonging to the less represented classes. In this paper we present a study concerning the relative merits of several re-sizing techniques for handling the imbalance issue. We assess also the convenience of combining some of these techniques.

Download to read the full chapter text

Chapter PDF

Attribute Selection, Sampling, and Classifier Methods to Address Class Imbalance Issues on Data Set Having Ratio Less Than Five

Handling Imbalanced Data: A Survey

Empowering Imbalanced Data in Supervised Learning: A Semi-supervised Learning Approach

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Aha, D., Kibler, D.: Learning Representative Exemplars of Concepts: An Initial Case Study. In: Proceedings of the Fourth International Conference on Machine Learning, pp. 24–30 (1987)
Google Scholar
Barandela, R., Cortés, N., Palacios, A.: The Nearest Neighbor rule and the reduction of the training sample size. In: Proc. 9th Spanish Symposium on Pattern Recognition and Image Analysis 1, pp. 103–108 (2001)
Google Scholar
Barandela, R., Sánchez, J.S., García, V., Rangel, E.: Strategies for learning in class imbalance problems. Pattern Recognition 36(3), 849–851 (2003)
Article Google Scholar
Barandela, R., Sánchez, J.S., García, V., Ferri, F.J.: Learning from Imbalanced sets through resampling and weighting. In: Perales, F.J., Campilho, A.C., Pérez, N., Sanfeliu, A. (eds.) IbPRIA 2003. LNCS, vol. 2652, pp. 80–88. Springer, Heidelberg (2003)
Chapter Google Scholar
Barandela, R., Valdovinos, R.M., Sánchez, J.S.: New applications of ensembles of classifiers. Pattern Analysis and Applications 6(3), 245–256 (2003)
Article MathSciNet Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, 321–357 (2000)
Google Scholar
Dudani, S.A.: The distance-weighted k-nearest neighbor rule. IEEE Trans. on Systems, Man and Cybernetics 6, 325–327 (1976)
Article Google Scholar
Eavis, T., Japkowicz, N.: A Recognition-based Alternative to Discrimination-based Multi- Layer Perceptrons, Workshop on Learning from Imbalanced Data Sets. Technical Report WS-00-05, AAAI Press (2000)
Google Scholar
Ezawa, K.J., Singh, M., Norton, S.W.: Learning goal oriented Bayesian networks for telecommunications management. In: Proc. 13th Int. Conf. on Machine Learning, pp. 139–147 (1996)
Google Scholar
Fawcett, T., Provost, F.: Adaptive fraud detection. Data Mining and Knowledge Discovery 1, 291–316 (1996)
Article Google Scholar
Hart, P.E.: The Condensed Nearest Neighbor rule. IEEE Trans. on Information Theory 6(4), 515–516 (1968)
Article Google Scholar
Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study. Intelligent Data Analysis Journal 6(5), 429–450 (2002)
MATH Google Scholar
Kubat, M., Matwin, S.: Addressing the Curse of Imbalanced Training Sets: One-Sided Selection. In: Proceedings of the 14th International Conference on Machine Learning, pp. 179–186 (1997)
Google Scholar
Kubat, M., Holte, R., Matwin, S.: Detection of Oil-Spills in Radar Images of Sea Surface. Machine Learning 30, 195–215 (1998)
Article Google Scholar
Mladenic, D., Grobelnik, M.: Feature selection for unbalanced class distribution and naïve Bayes. In: Proc. 16th Int. Conf. on Machine Learning, pp. 258–267 (1999)
Google Scholar
Pazzani, M., Merz, C., Murphy, P., Ali, K., Hume, T., Brunk, C.: Reducing misclassification costs. In: Proc 11th Int. Conf. on Machine Learning, pp. 217–225 (1994)
Google Scholar
Ritter, G.I., Woodruff, H.B., Lowry, S.R., Isenhour, T.L.: An Algorithm for Selective Nearest Neighbor Decision Rule. IEEE Trans. on Information Theory 21(6), 665–669 (1975)
Article MATH Google Scholar
Weiss, G.M., Provost, F.: Learning when training data are costly: The effect of class distribution on tree induction. Journal of Artificial Intelligence Research 19, 315–354 (2003)
MATH Google Scholar
Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data sets. IEEE Trans. on Systems, Man and Cybernetics 2, 408–421 (1972)
Article MATH Google Scholar
Woods, K., Doss, C., Bowyer, K.W., Solka, J., Priebe, C., Kegelmeyer, W.P.: Comparative evaluation of pattern recognition techniques for detection of micro-calcifications in mammography. International Journal of Pattern Recognition and Artificial Intelligence 7, 1417–1436 (1993)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Instituto Tecnológico de Toluca, Ave. Tecnológico s/n, 52140, Metepec, México
Ricardo Barandela & Rosa M. Valdovinos
Instituto de Geografía Tropical, La Habana, Cuba
Ricardo Barandela
Dept. Llenguatges i Sistemes Informàtics, U. Jaume I, 12071, Castelló, Spain
J. Salvador Sánchez
Dept. d’Informàtica, U. Valencia, 46100, Burjassot (Valencia), Spain
Francesc J. Ferri

Authors

Ricardo Barandela
View author publications
You can also search for this author in PubMed Google Scholar
Rosa M. Valdovinos
View author publications
You can also search for this author in PubMed Google Scholar
J. Salvador Sánchez
View author publications
You can also search for this author in PubMed Google Scholar
Francesc J. Ferri
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Instituto Superior Técnico, Instituto de Telecomunicações, Lisbon, Portugal
Ana Fred
RSISE, the Australian National University, ACT 0200, Canberra, Australia
Terry M. Caelli
Information and Communication Theory Group, Delft University of Technology, P.O. Box 5031, 2600GA, Delft, The Netherlands
Robert P. W. Duin
FEUP - Faculdade de Engenharia, Universidade do Porto, Rua Dr. Roberto Frias, 4200-465, Porto, Portugal
Aurélio C. Campilho
Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Information and Communication Theory Group, Delft, The Netherlands
Dick de Ridder

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Barandela, R., Valdovinos, R.M., Sánchez, J.S., Ferri, F.J. (2004). The Imbalanced Training Sample Problem: Under or over Sampling?. In: Fred, A., Caelli, T.M., Duin, R.P.W., Campilho, A.C., de Ridder, D. (eds) Structural, Syntactic, and Statistical Pattern Recognition. SSPR /SPR 2004. Lecture Notes in Computer Science, vol 3138. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-27868-9_88

Download citation

DOI: https://doi.org/10.1007/978-3-540-27868-9_88
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22570-6
Online ISBN: 978-3-540-27868-9
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

The Imbalanced Training Sample Problem: Under or over Sampling?

Abstract

Chapter PDF

Similar content being viewed by others

Attribute Selection, Sampling, and Classifier Methods to Address Class Imbalance Issues on Data Set Having Ratio Less Than Five

Handling Imbalanced Data: A Survey

Empowering Imbalanced Data in Supervised Learning: A Semi-supervised Learning Approach

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

The Imbalanced Training Sample Problem: Under or over Sampling?

Abstract

Chapter PDF

Similar content being viewed by others

Attribute Selection, Sampling, and Classifier Methods to Address Class Imbalance Issues on Data Set Having Ratio Less Than Five

Handling Imbalanced Data: A Survey

Empowering Imbalanced Data in Supervised Learning: A Semi-supervised Learning Approach

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation