Abstract
The problem of imbalanced training sets in supervised pattern recognition methods is receiving growing attention. Imbalanced training sample means that one class is represented by a large number of examples while the other is represented by only a few. It has been observed that this situation, which arises in several practical domains, may produce an important deterioration of the classification accuracy, in particular with patterns belonging to the less represented classes. In this paper we present a study concerning the relative merits of several re-sizing techniques for handling the imbalance issue. We assess also the convenience of combining some of these techniques.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Aha, D., Kibler, D.: Learning Representative Exemplars of Concepts: An Initial Case Study. In: Proceedings of the Fourth International Conference on Machine Learning, pp. 24–30 (1987)
Barandela, R., Cortés, N., Palacios, A.: The Nearest Neighbor rule and the reduction of the training sample size. In: Proc. 9th Spanish Symposium on Pattern Recognition and Image Analysis 1, pp. 103–108 (2001)
Barandela, R., Sánchez, J.S., García, V., Rangel, E.: Strategies for learning in class imbalance problems. Pattern Recognition 36(3), 849–851 (2003)
Barandela, R., Sánchez, J.S., García, V., Ferri, F.J.: Learning from Imbalanced sets through resampling and weighting. In: Perales, F.J., Campilho, A.C., Pérez, N., Sanfeliu, A. (eds.) IbPRIA 2003. LNCS, vol. 2652, pp. 80–88. Springer, Heidelberg (2003)
Barandela, R., Valdovinos, R.M., Sánchez, J.S.: New applications of ensembles of classifiers. Pattern Analysis and Applications 6(3), 245–256 (2003)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, 321–357 (2000)
Dudani, S.A.: The distance-weighted k-nearest neighbor rule. IEEE Trans. on Systems, Man and Cybernetics 6, 325–327 (1976)
Eavis, T., Japkowicz, N.: A Recognition-based Alternative to Discrimination-based Multi- Layer Perceptrons, Workshop on Learning from Imbalanced Data Sets. Technical Report WS-00-05, AAAI Press (2000)
Ezawa, K.J., Singh, M., Norton, S.W.: Learning goal oriented Bayesian networks for telecommunications management. In: Proc. 13th Int. Conf. on Machine Learning, pp. 139–147 (1996)
Fawcett, T., Provost, F.: Adaptive fraud detection. Data Mining and Knowledge Discovery 1, 291–316 (1996)
Hart, P.E.: The Condensed Nearest Neighbor rule. IEEE Trans. on Information Theory 6(4), 515–516 (1968)
Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study. Intelligent Data Analysis Journal 6(5), 429–450 (2002)
Kubat, M., Matwin, S.: Addressing the Curse of Imbalanced Training Sets: One-Sided Selection. In: Proceedings of the 14th International Conference on Machine Learning, pp. 179–186 (1997)
Kubat, M., Holte, R., Matwin, S.: Detection of Oil-Spills in Radar Images of Sea Surface. Machine Learning 30, 195–215 (1998)
Mladenic, D., Grobelnik, M.: Feature selection for unbalanced class distribution and naïve Bayes. In: Proc. 16th Int. Conf. on Machine Learning, pp. 258–267 (1999)
Pazzani, M., Merz, C., Murphy, P., Ali, K., Hume, T., Brunk, C.: Reducing misclassification costs. In: Proc 11th Int. Conf. on Machine Learning, pp. 217–225 (1994)
Ritter, G.I., Woodruff, H.B., Lowry, S.R., Isenhour, T.L.: An Algorithm for Selective Nearest Neighbor Decision Rule. IEEE Trans. on Information Theory 21(6), 665–669 (1975)
Weiss, G.M., Provost, F.: Learning when training data are costly: The effect of class distribution on tree induction. Journal of Artificial Intelligence Research 19, 315–354 (2003)
Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data sets. IEEE Trans. on Systems, Man and Cybernetics 2, 408–421 (1972)
Woods, K., Doss, C., Bowyer, K.W., Solka, J., Priebe, C., Kegelmeyer, W.P.: Comparative evaluation of pattern recognition techniques for detection of micro-calcifications in mammography. International Journal of Pattern Recognition and Artificial Intelligence 7, 1417–1436 (1993)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Barandela, R., Valdovinos, R.M., Sánchez, J.S., Ferri, F.J. (2004). The Imbalanced Training Sample Problem: Under or over Sampling?. In: Fred, A., Caelli, T.M., Duin, R.P.W., Campilho, A.C., de Ridder, D. (eds) Structural, Syntactic, and Statistical Pattern Recognition. SSPR /SPR 2004. Lecture Notes in Computer Science, vol 3138. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-27868-9_88
Download citation
DOI: https://doi.org/10.1007/978-3-540-27868-9_88
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22570-6
Online ISBN: 978-3-540-27868-9
eBook Packages: Springer Book Archive