Abstract
The paper presents the result of the research concerning comparisson of various techniques of missing values multiple imputation by chained equations (MICE) with the use of logistic regression at the stage of the model verification. The presence of missing values in the data complicates the data processing and increases the risk factor in the process of solving various problems in various areas of data science techniques use. The simulation process was performed on the basis of the apply of both R and KNIME software tools. The Mammographic Mass dataset from Machine Learning Repository was used as the experimental data during the simulation process. Implementation of the step-by-step process of missing values handling involved the data analysis and missing values visualization at the first step. Then, we have performed the missing values handling with the use of various techniques which are available in MICE package of R software. The quality of the data processing at each step of this procedure implementation was estimated with the use of logistic regression model based on ROC analysis with calculation of the quantitative criteria: AUC (area under roc curve), Akaike and Bayesian ones. At the final step, we have compared various techniques of missing values handling for purpose of selection from them the best variants taking into account the used criteria.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Center for machine learning and intelligent systems. Machine learning repository. http://archive.ics.uci.edu/ml/datasets/mammographic+mass
Knime. https://www.knime.com/
Allotey, P., Harel, O.: Multiple imputation for incomplete data in environmental epidemiology research. Curr. Environ. Health Rep. 6(2), 62–71 (2020). https://doi.org/10.1007/s40572-019-00230-y
Babichev, S., Kornelyuk, A., Lytvynenko, V., Osypenko, V.: Computational analysis of microarray gene expression profiles of lung cancer. Biopolymers Cell 32(1), 70–79 (2016). https://doi.org/10.7124/bc.00090F
Babichev, S., Škvor, J., Fišer, J., Lytvynenko, V.: Technology of gene expression profiles filtering based on wavelet analysis. Int. J. Intell. Sys. Appl. 10(4), 1–7 (2018). https://doi.org/10.5815/ijisa.2018.04.01
Babichev, S., Lytvynenko, V., Škvor, J., Fišer, J.: Model of the objective clustering inductive technology of gene expression profiles based on SOTA and DBSCAN clustering algorithms. Adv. Intell. Sys. Comput. 689, 21–39 (2018). https://doi.org/10.1007/978-3-319-70581-1_2
Babichev, S., Lytvynenko, V., Osypenko, V.: Implementation of the objective clustering inductive technology based on DBSCAN clustering algorithm. In: Proceedings of the 12th International Scientific and Technical Conference on Computer Sciences and Information Technologies, CSIT 2017, vol. 1, pp. 479–484 (2017). https://doi.org/10.1109/STC-CSIT.2017.8098832
Barth, A., Wallerman, J., Stahl, G.: Spatially consistent nearest neighbor imputation of forest stand data. Remote Sens. Environ. 113(3), 546–553 (2009). https://doi.org/10.1016/j.rse.2008.09.011
Chhabra, G., Vashisht, V., Ranjan, J.: A review on missing data value estimation using imputation algorithm. J. Adv. Res. Dyn. Control Sys. 11(7), 312–318 (2019)
Choi, J., Dekkers, O., Cessie, S.: A comparison of different methods to handle missing data in the context of propensity score analysis. Eur. J. Epidemiol. 34(1), 23–36 (2019). https://doi.org/10.1007/s10654-018-0447-z
Choudhury, S., Pal, N.: Imputation of missing data with neural networks for classification. Knowl. Based Syst. 182 (2019). Article no. 104838. https://doi.org/10.1016/j.knosys.2019.07.009
Cihan, P., Ozger, Z.: A new heuristic approach for treating missing value: ABCimp. Elektron. Elektrotech. 25(6), 48–54 (2019). https://doi.org/10.5755/j01.eie.25.6.24826
Elter, M., Schulz-Wendtland, R., Wittenberg, T.: The prediction of breast cancer biopsy outcomes using two CAD approaches that both emphasize an intelligible decision process. Med. Phys. 34(11), 4164–4172 (2007). https://doi.org/10.1118/1.2786864
Ferguson, K., Yu, Y., Cantonwine, D., McElrath, T., Meeker, J., Mukherjee, B.: Foetal ultrasound measurement imputations based on growth curves versus multiple imputation chained equation (MICE). Paediatr. Perinat. Epidemiol. 32(5), 469–473 (2018). https://doi.org/10.1111/ppe.12486
Fitzmaurice, G., Lipsitz, S., Weiss, R.: Sensitivity analysis for non-monotone missing binary data in longitudinal studies: application to the NIDA collaborative cocaine treatment study. Stat. Methods Med. Res. 28(10–11), 3057–3073 (2019). https://doi.org/10.1177/0962280218794725
Ihaka, R., Gentleman, R.: R: a language for data analysis and graphics. J. Comput. Graph. Stat. 5(3), 299–314 (1996). https://doi.org/10.1080/10618600.1996.10474713
Izonin, I., Kryvinska, N., Vitynskyi, P., Tkachenko, R., Zub, K.: GRNN approach towards missing data recovery between IoT systems. Adv. Intell. Sys. Comput. 1035, 445–453 (2020). https://doi.org/10.1007/978-3-030-29035-1_43
Kanishcheva, O., Vysotska, V., Chyrun, L., Gozhyj, A.: Method of integration and content management of the information resources network. Adv. Intell. Sys. Comput. 689, 204–216 (2019). https://doi.org/10.1007/978-3-319-70581-1_14
Landerman, L., Land, K., Pieper, C.: An empirical evaluation of the predictive mean matching method for imputing missing values. Sociol. Methods Res. 26(1), 3–33 (1997). https://doi.org/10.1177/0049124197026001001
Ma, S., Schreiner, P., at. al.: Multiple predictively equivalent risk models for handling missing data at time of prediction: with an application in severe hypoglycemia risk prediction for type 2 diabetes. J. Biomed. Inform. 103, 103379 (2020). https://doi.org/10.1016/j.jbi.2020.103379
Meera, S., Rosiline Jeetha, B.: Missing value aware optimal feature selection method for efficient big data mining process. Int. J. Recent Technol. Eng. 8(2), 354–360 (2019). https://doi.org/10.35940/ijrte.B1055.0982S1119
Meyer, P., Olteanu, A.L.: Handling imprecise and missing evaluations in multi-criteria majority-rule sorting. Comput. Oper. Res. 110, 135–147 (2019). https://doi.org/10.1016/j.cor.2019.05.027
Mishchuk, O., Tkachenko, R., Izonin, I.: Missing data imputation through STGM neural-like structure for environmental monitoring tasks. Adv. Intell. Sys. Comput. 938, 142–151 (2020). https://doi.org/10.1007/978-3-030-16621-2_13
Naum, O., Chyrun, L., Vysotska, V., Kanishcheva, O.: Intellectual system design for content formation. In: Proceedings of the 12th International Scientific and Technical Conference on Computer Sciences and Information Technologies, CSIT 2017, vol. 1, pp. 131–138. Institute of Electrical and Electronics Engineers Inc. (2017). https://doi.org/10.1109/STC-CSIT.2017.8098753
Sabri, A., Alfred, R.: Effects of handling missing values of VOCS gases emitted from human for human detection. Int. J. Recent Technol. Eng. 8(2), 1405–1412 (2019). https://doi.org/10.35940/ijrte.B1075.0882S819
Sarkar, S., Pramanik, A., Khatedi, N., Maiti, J.: An investigation of the effects of missing data handling using ‘R’-packages. Adv. Intell. Sys. Comput. 1079, 275–284 (2020). https://doi.org/10.1007/978-981-15-1097-7_24
Shah, A., Bartlett, J., Carpenter, J., Nicholas, O., Hemingway, H.: Comparison of random forest and parametric imputation models for imputing missing data using mice: a caliber study. Am. J. Epidemiol. 179(6), 764–774 (2014). https://doi.org/10.1093/aje/kwt312
Soe, T., Min, M.: Analysis of missing data using matrix-characterized approximations. Stud. Comput. Intell. 845, 117–129 (2020). https://doi.org/10.1007/978-3-030-24344-9_7
van Buuren, S., Groothuis-Oudshoorn, K.: mice: multivariate imputation by chained equations in R. J. Stat. Softw. 45(3), 1–67 (2011). https://www.jstatsoft.org/v45/i03/
Xiao, Y., Tian, Z., Guo, W.: Empirical likelihood for partially non linear models with missing response variables at random. Commun. Stat. Theor. Methods 44(16), 3523–3540 (2015). https://doi.org/10.1080/03610926.2013.815211
Zhu, L.: Empirical likelihood for multidimensional linear model with missing responses. J. Probab. Stat. 473932 (2012). https://doi.org/10.1155/2012/473932
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Nadraga, V., Smirnov, V., Boiko, O., Dereko, V. (2021). Comparison of Missing Values Handling Techniques Using MICE Package Tools of R Software and Logistic Regression Model. In: Babichev, S., Lytvynenko, V., Wójcik, W., Vyshemyrskaya, S. (eds) Lecture Notes in Computational Intelligence and Decision Making. ISDMCI 2020. Advances in Intelligent Systems and Computing, vol 1246. Springer, Cham. https://doi.org/10.1007/978-3-030-54215-3_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-54215-3_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-54214-6
Online ISBN: 978-3-030-54215-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)