Abstract
This paper studies the problem of imbalance in medical datasets. Today, modern machine learning techniques are becoming increasingly popular for this type of problem, with examples in the areas of health and medical. One of the major difficulties with this technique is that the database handled is highly imbalance. Under-sampling and over-sampling techniques are used to work around this problem. In this paper, we apply random forests, which are combinations of decision trees fitted to subsamples of the data, built using under-sampling and over-sampling. At the end of the work, we compare fit metrics obtained in the various specifications of the models tested and evaluate their results inside and outside the sample. We observed that random forest techniques using imbalanced sub-samples smaller than the original sample presented the best performance among the random forests used and an improvement compared to that practiced in the medical dataset.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Lu, H., et al.: Kernel principal component analysis combining rotation forest method for linearly inseparable data. Cogn. Syst. Res. 53, 111–122 (2019)
Zhu, H.-J., et al.: DroidDet: effective and robust detection of android malware using static analysis along with rotation forest model. Neurocomputing 272, 638–646 (2018)
Hong, H., et al.: Landslide susceptibility mapping using J48 decision tree with AdaBoost, bagging and rotation forest ensembles in the Guangchang area (China). Catena 163, 399–413 (2018)
Wang, L., et al.: An improved efficient rotation forest algorithm to predict the interactions among proteins. Soft Comput. 22(10), 3373–3381 (2018)
Pham, B.T., et al.: A hybrid machine learning ensemble approach based on a radial basis function neural network and rotation forest for landslide susceptibility modeling: a case study in the Himalayan area India. Int. J. Sediment Res. 33(2), 157–170 (2018)
Lee, S.-J., et al.: A novel bagging C4.5 algorithm based on wrapper feature selection for supporting wise clinical decision making. J. Biomed. Inform. 78, 144–155 (2018)
Gul, A., et al.: Ensemble of a subset of kNN classifiers. Adv. Data Anal. Classif. 12(4), 827–840 (2018)
Sun, J., et al.: Unbalanced enterprise credit evaluation with DTE-SBD: decision tree ensemble based on SMOTE and bagging with differentiated sampling rates. Inf. Sci. 425, 76–91 (2018)
Lango, M., Stefanowski, J.: Multi-class and feature selection extensions of roughly balanced bagging for unbalanced data. J. Intell. Inf. Syst. 50(1), 97–127 (2018)
Chen, W., et al.: Novel hybrid integration approach of bagging-based fisher’s linear discriminant function for groundwater potential analysis. Nat. Resour. Res. 1–20 (2019)
García, S., et al.: Dynamic ensemble selection for multi-class unbalanced datasets. Inf. Sci. 445, 22–37 (2018)
Maldonado, S., López, J.: Dealing with high-dimensional class-unbalanced datasets: embedded feature selection for SVM classification. Appl. Soft Comput. 67, 94–105 (2018)
Piri, S., Delen, D., Liu, T.: A synthetic informative minority over-sampling (SIMO) algorithm leveraging support vector machine to enhance learning from unbalanced datasets. Decis. Support Syst. 106, 15–29 (2018)
Zhang, C., et al.: Research on classification method of high-dimensional class-unbalanced datasets based on SVM. Int. J. Mach. Learn. Cybern. 10(7), 1765–1778 (2019)
Douzas, G., Bacao, F.: Effective data generation for unbalanced learning using conditional generative adversarial networks. Expert Syst. Appl. 91, 464–471 (2018)
Veganzones, D., Séverin, E.: An investigation of bankruptcy prediction in unbalanced datasets. Decis. Support Syst. 112, 111–124 (2018)
Tahan, M.H., Asadi, S.: EMDID: evolutionary multi-objective discretization for unbalanced datasets. Inf. Sci. 432, 442–461 (2018)
Hosmer Jr., D.W., Lemeshow, S., Sturdivant, R.X.: Applied logistic regression, vol. 398. Wiley, Hoboken (2013)
Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning, Springer Series in Statistics. vol. 1. no. 10. Springer, New York (2001)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Drummond, C., Holte, R.C.: C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Workshop on Learning from Imbalanced Datasets II, vol. 11. Citeseer, Washington DC (2003)
Burez, J., Van den Poel, D.: Handling class imbalance in customer churn prediction. Expert Syst. Appl. 36(3), 4626–4636 (2009)
Weiss, G.M.: Mining with rarity: a unifying framework. ACM SIGKDD Explor. Newslett. 6(1), 7–19 (2004)
Chawla, N.V., et al.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Chen, C., Liaw, A., Breiman, L.: Using random forest to learn imbalanced data. Univ. Calif. Berkeley 110(1-12), 24 (2004)
Bekkar, M., Djemaa, H.K., Alitouche, T.A.: Evaluation measures for models assessment over imbalanced data sets. J. Inf. Eng. Appl. 3(10) (2013)
Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: one-sided selection. In: ICML, vol. 97 (1997)
Hulse, J.V., Khoshgoftaar, T.M., Napolitano, A.: Experimental perspectives on learning from imbalanced data. In: ICML 2007 Proceedings of the 24th International Conference on Machine Learning, pp. 935–942, Corvalis, OR, USA (2007)
Sanz, J.A., et al.: An evolutionary underbagging approach to tackle the survival prediction of trauma patients: a case study at the hospital of Navarre. IEEE Access 7, 76009–76021 (2019)
Wang, S., Yao, X.: Diversity analysis on imbalanced data sets by using ensemble models. In: IEEE Symposium on Computational Intelligence and Data Mining, pp. 324–331 (2009)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
El-shafeiy, E., Abohany, A. (2020). Medical Imbalanced Data Classification Based on Random Forests. In: Hassanien, AE., Azar, A., Gaber, T., Oliva, D., Tolba, F. (eds) Proceedings of the International Conference on Artificial Intelligence and Computer Vision (AICV2020). AICV 2020. Advances in Intelligent Systems and Computing, vol 1153. Springer, Cham. https://doi.org/10.1007/978-3-030-44289-7_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-44289-7_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-44288-0
Online ISBN: 978-3-030-44289-7
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)