Abstract
Class imbalance is a common challenge when dealing with pattern classification of real-world medical data-sets. An effective counter-measure typically used is a method known as re-sampling. In this paper we implement an ANN with different re-sampling techniques to subsequently compare and evaluate the performances. Re-sampling strategies included a control, under-sampling, over-sampling, and a combination of the two. We found that over-sampling and the combination of under- and over-sampling both led to a significantly superior classifier performance compared to under-sampling only in correctly predicting labelled classes.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ayres-DeCampos, D., Bernardes, J., Garrido, A., MarquesDeS, J., PereiraLeite, L.: SisPorto 2.0: a program for automated analysis of cardiotocograms. J. Matern. Fetal Med. 9, 311–318 (2000)
Bradley, A.P.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn. 30(7), 1145–1159 (1997). https://doi.org/10.1016/s0031-3203(96)00142-2
Brooks, G.P., Johanson, G.A.: Sample size considerations for multiple comparison procedures in ANOVA. J. Mod. Appl. Stat. Methods 10(1), 97–109 (2011). https://doi.org/10.22237/jmasm/1304222940
de Campos, D.A.: The SisPorto automated analysis
Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Dagostino, R.B.: An omnibus test of normality for moderate and large size samples. Biometrika 58(2), 341 (1971). https://doi.org/10.2307/2334522
UCI Machine Learning Repository Database: Cardiotocography Data Set (2010). https://archive.ics.uci.edu/ml/datasets/cardiotocography
HHU Düsseldorf: G*Power. http://www.gpower.hhu.de/en.html
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)
Ennos, A.R., Johnson, M.: Statistical and Data Handling Skills in Biology. Pearson Education, New York (2017)
Esteva, A., Kuprel, B., Novoa, R., Ko, J., Swetter, S., Blau, H., Thrun, S.: Dermatologist-level classification of skin cancer with deep neural networks. Nature 542(7639), 115–118 (2017)
Gigerenzer, G.: Helping doctors and patients make sense of health statistics. In: Simply Rational, p. 2193 (2015). https://doi.org/10.1093/acprof:oso/9780199390076.003.0005
Heaton, J.: Introduction to Neural Networks for Java, p. 440. Heaton Research, Inc. (2008). https://dl.acm.org/citation.cfm?id=1502373. ISBN 1604390085 9781604390087
Huang, J., Ling, C.: Using AUC and accuracy in evaluating learning algorithms. IEEE Trans. Knowl. Data Eng. 17(3), 299–310 (2005). https://doi.org/10.1109/tkde.2005.50
Ishibuchi, H., Nakaskima, T.: Improving the performance of fuzzy classifier systems for pattern classification problems with continuous attributes. IEEE Trans. Ind. Electron. 46(6), 1057–1068 (1999). https://doi.org/10.1109/41.807986
Kim, H.Y.: Statistical notes for clinical researchers: type I and type II errors in statistical decision. Restor. Dentist. Endod. 40(3), 249 (2015). https://doi.org/10.5395/rde.2015.40.3.249
Nguyen, H.M., Cooper, E.W., Kamei, K.: Borderline over-sampling for imbalanced data classification. Int. J. Knowl. Eng. Soft Data Paradigms 3(1), 4 (2011). https://doi.org/10.1504/ijkesdp.2011.039875
Pearson, E.S., Dagostino, R.B., Bowman, K.O.: Tests for departure from normality: comparison of powers. Biometrika 64(2), 231–246 (1977). https://doi.org/10.1093/biomet/64.2.231
Prati, R.C., Batista, G.E.A.P.A., Monard, M.C.: Class imbalances versus class overlapping: an analysis of a learning system behavior. In: MICAI 2004: Advances in Artificial Intelligence Lecture Notes in Computer Science, pp. 312–321 (2004). https://doi.org/10.1007/978-3-540-24694-7-32
Preacher, K.J., Rucker, D.D., Maccallum, R.C., Nicewander, W.A.: Use of the extreme groups approach: a critical reexamination and new recommendations. Psychol. Methods 10(2), 178–192 (2005). https://doi.org/10.1037/1082-989x.10.2.178
Prechelt, L.: Early stopping but when? In: Neural Networks: Tricks of the Trade, vol. 7700 (2012). https://doi.org/10.1007/978-3-642-35289-8-5
Provost, F., Fawcett, T., Kohavi, R.: The case against accuracy estimation for comparing induction algorithms. In: Proceedings of the Fifteenth International Conference on Machine Learning (1998)
Saha, R., Chowdhury, A.R., Banerjee, S.: Diabetic retinopathy related lesions detection and classification using machine learning technology. Artificial Intelligence and Soft Computing Lecture Notes in Computer Science, pp. 734–745 (2016). https://doi.org/10.1007/978-3-319-39384-1-65
Scikit-Learn: Confusion Matrix. http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
Tape, T.: The Area Under an ROC Curve. http://gim.unmc.edu/dxtests/roc3.htm
Thatcher, L.: The Benefits of Machine Learning in Healthcare (2017). https://healthcare.ai/the-benefits-of-machine-learning-in-healthcare
Penn State University: Power and Sample Size Determination for Testing a Population Mean. https://onlinecourses.science.psu.edu/stat500/node/46
Yen, S.J., Lee, Y.S.: Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst. Appl. 36(3), 5718–5727 (2009). https://doi.org/10.1016/j.eswa.2008.06.108
Zacharaki, E.I., Wang, S., Chawla, S., Yoo, D.S., Wolf, R., Melhem, E.R., Davatzikos, C.: Classification of brain tumor type and grade using MRI texture and shape in a machine learning scheme. Magn. Reson. Med. 62(6), 1609–1618 (2009). https://doi.org/10.1002/mrm.22147
Zhang, J., Mani, I.: KNN approach to unbalanced data distributions: a case study involving information extraction. In: Workshop on Learning from Imbalanced Datasets II (2003)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Saul, M.A., Rostami, S. (2019). A Comparison of Re-sampling Techniques for Pattern Classification in Imbalanced Data-Sets. In: Lotfi, A., Bouchachia, H., Gegov, A., Langensiepen, C., McGinnity, M. (eds) Advances in Computational Intelligence Systems. UKCI 2018. Advances in Intelligent Systems and Computing, vol 840. Springer, Cham. https://doi.org/10.1007/978-3-319-97982-3_20
Download citation
DOI: https://doi.org/10.1007/978-3-319-97982-3_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-97981-6
Online ISBN: 978-3-319-97982-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)