Skip to main content

Comparison of Missing Values Handling Techniques Using MICE Package Tools of R Software and Logistic Regression Model

  • Conference paper
  • First Online:
Lecture Notes in Computational Intelligence and Decision Making (ISDMCI 2020)

Abstract

The paper presents the result of the research concerning comparisson of various techniques of missing values multiple imputation by chained equations (MICE) with the use of logistic regression at the stage of the model verification. The presence of missing values in the data complicates the data processing and increases the risk factor in the process of solving various problems in various areas of data science techniques use. The simulation process was performed on the basis of the apply of both R and KNIME software tools. The Mammographic Mass dataset from Machine Learning Repository was used as the experimental data during the simulation process. Implementation of the step-by-step process of missing values handling involved the data analysis and missing values visualization at the first step. Then, we have performed the missing values handling with the use of various techniques which are available in MICE package of R software. The quality of the data processing at each step of this procedure implementation was estimated with the use of logistic regression model based on ROC analysis with calculation of the quantitative criteria: AUC (area under roc curve), Akaike and Bayesian ones. At the final step, we have compared various techniques of missing values handling for purpose of selection from them the best variants taking into account the used criteria.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Center for machine learning and intelligent systems. Machine learning repository. http://archive.ics.uci.edu/ml/datasets/mammographic+mass

  2. Knime. https://www.knime.com/

  3. Allotey, P., Harel, O.: Multiple imputation for incomplete data in environmental epidemiology research. Curr. Environ. Health Rep. 6(2), 62–71 (2020). https://doi.org/10.1007/s40572-019-00230-y

    Article  Google Scholar 

  4. Babichev, S., Kornelyuk, A., Lytvynenko, V., Osypenko, V.: Computational analysis of microarray gene expression profiles of lung cancer. Biopolymers Cell 32(1), 70–79 (2016). https://doi.org/10.7124/bc.00090F

    Article  Google Scholar 

  5. Babichev, S., Škvor, J., Fišer, J., Lytvynenko, V.: Technology of gene expression profiles filtering based on wavelet analysis. Int. J. Intell. Sys. Appl. 10(4), 1–7 (2018). https://doi.org/10.5815/ijisa.2018.04.01

    Article  Google Scholar 

  6. Babichev, S., Lytvynenko, V., Škvor, J., Fišer, J.: Model of the objective clustering inductive technology of gene expression profiles based on SOTA and DBSCAN clustering algorithms. Adv. Intell. Sys. Comput. 689, 21–39 (2018). https://doi.org/10.1007/978-3-319-70581-1_2

    Article  Google Scholar 

  7. Babichev, S., Lytvynenko, V., Osypenko, V.: Implementation of the objective clustering inductive technology based on DBSCAN clustering algorithm. In: Proceedings of the 12th International Scientific and Technical Conference on Computer Sciences and Information Technologies, CSIT 2017, vol. 1, pp. 479–484 (2017). https://doi.org/10.1109/STC-CSIT.2017.8098832

  8. Barth, A., Wallerman, J., Stahl, G.: Spatially consistent nearest neighbor imputation of forest stand data. Remote Sens. Environ. 113(3), 546–553 (2009). https://doi.org/10.1016/j.rse.2008.09.011

    Article  Google Scholar 

  9. Chhabra, G., Vashisht, V., Ranjan, J.: A review on missing data value estimation using imputation algorithm. J. Adv. Res. Dyn. Control Sys. 11(7), 312–318 (2019)

    Google Scholar 

  10. Choi, J., Dekkers, O., Cessie, S.: A comparison of different methods to handle missing data in the context of propensity score analysis. Eur. J. Epidemiol. 34(1), 23–36 (2019). https://doi.org/10.1007/s10654-018-0447-z

    Article  Google Scholar 

  11. Choudhury, S., Pal, N.: Imputation of missing data with neural networks for classification. Knowl. Based Syst. 182 (2019). Article no. 104838. https://doi.org/10.1016/j.knosys.2019.07.009

  12. Cihan, P., Ozger, Z.: A new heuristic approach for treating missing value: ABCimp. Elektron. Elektrotech. 25(6), 48–54 (2019). https://doi.org/10.5755/j01.eie.25.6.24826

    Article  Google Scholar 

  13. Elter, M., Schulz-Wendtland, R., Wittenberg, T.: The prediction of breast cancer biopsy outcomes using two CAD approaches that both emphasize an intelligible decision process. Med. Phys. 34(11), 4164–4172 (2007). https://doi.org/10.1118/1.2786864

    Article  Google Scholar 

  14. Ferguson, K., Yu, Y., Cantonwine, D., McElrath, T., Meeker, J., Mukherjee, B.: Foetal ultrasound measurement imputations based on growth curves versus multiple imputation chained equation (MICE). Paediatr. Perinat. Epidemiol. 32(5), 469–473 (2018). https://doi.org/10.1111/ppe.12486

    Article  Google Scholar 

  15. Fitzmaurice, G., Lipsitz, S., Weiss, R.: Sensitivity analysis for non-monotone missing binary data in longitudinal studies: application to the NIDA collaborative cocaine treatment study. Stat. Methods Med. Res. 28(10–11), 3057–3073 (2019). https://doi.org/10.1177/0962280218794725

    Article  MathSciNet  Google Scholar 

  16. Ihaka, R., Gentleman, R.: R: a language for data analysis and graphics. J. Comput. Graph. Stat. 5(3), 299–314 (1996). https://doi.org/10.1080/10618600.1996.10474713

    Article  Google Scholar 

  17. Izonin, I., Kryvinska, N., Vitynskyi, P., Tkachenko, R., Zub, K.: GRNN approach towards missing data recovery between IoT systems. Adv. Intell. Sys. Comput. 1035, 445–453 (2020). https://doi.org/10.1007/978-3-030-29035-1_43

    Article  Google Scholar 

  18. Kanishcheva, O., Vysotska, V., Chyrun, L., Gozhyj, A.: Method of integration and content management of the information resources network. Adv. Intell. Sys. Comput. 689, 204–216 (2019). https://doi.org/10.1007/978-3-319-70581-1_14

    Article  Google Scholar 

  19. Landerman, L., Land, K., Pieper, C.: An empirical evaluation of the predictive mean matching method for imputing missing values. Sociol. Methods Res. 26(1), 3–33 (1997). https://doi.org/10.1177/0049124197026001001

    Article  Google Scholar 

  20. Ma, S., Schreiner, P., at. al.: Multiple predictively equivalent risk models for handling missing data at time of prediction: with an application in severe hypoglycemia risk prediction for type 2 diabetes. J. Biomed. Inform. 103, 103379 (2020). https://doi.org/10.1016/j.jbi.2020.103379

  21. Meera, S., Rosiline Jeetha, B.: Missing value aware optimal feature selection method for efficient big data mining process. Int. J. Recent Technol. Eng. 8(2), 354–360 (2019). https://doi.org/10.35940/ijrte.B1055.0982S1119

    Article  Google Scholar 

  22. Meyer, P., Olteanu, A.L.: Handling imprecise and missing evaluations in multi-criteria majority-rule sorting. Comput. Oper. Res. 110, 135–147 (2019). https://doi.org/10.1016/j.cor.2019.05.027

    Article  MathSciNet  MATH  Google Scholar 

  23. Mishchuk, O., Tkachenko, R., Izonin, I.: Missing data imputation through STGM neural-like structure for environmental monitoring tasks. Adv. Intell. Sys. Comput. 938, 142–151 (2020). https://doi.org/10.1007/978-3-030-16621-2_13

    Article  Google Scholar 

  24. Naum, O., Chyrun, L., Vysotska, V., Kanishcheva, O.: Intellectual system design for content formation. In: Proceedings of the 12th International Scientific and Technical Conference on Computer Sciences and Information Technologies, CSIT 2017, vol. 1, pp. 131–138. Institute of Electrical and Electronics Engineers Inc. (2017). https://doi.org/10.1109/STC-CSIT.2017.8098753

  25. Sabri, A., Alfred, R.: Effects of handling missing values of VOCS gases emitted from human for human detection. Int. J. Recent Technol. Eng. 8(2), 1405–1412 (2019). https://doi.org/10.35940/ijrte.B1075.0882S819

    Article  Google Scholar 

  26. Sarkar, S., Pramanik, A., Khatedi, N., Maiti, J.: An investigation of the effects of missing data handling using ‘R’-packages. Adv. Intell. Sys. Comput. 1079, 275–284 (2020). https://doi.org/10.1007/978-981-15-1097-7_24

    Article  Google Scholar 

  27. Shah, A., Bartlett, J., Carpenter, J., Nicholas, O., Hemingway, H.: Comparison of random forest and parametric imputation models for imputing missing data using mice: a caliber study. Am. J. Epidemiol. 179(6), 764–774 (2014). https://doi.org/10.1093/aje/kwt312

    Article  Google Scholar 

  28. Soe, T., Min, M.: Analysis of missing data using matrix-characterized approximations. Stud. Comput. Intell. 845, 117–129 (2020). https://doi.org/10.1007/978-3-030-24344-9_7

    Article  Google Scholar 

  29. van Buuren, S., Groothuis-Oudshoorn, K.: mice: multivariate imputation by chained equations in R. J. Stat. Softw. 45(3), 1–67 (2011). https://www.jstatsoft.org/v45/i03/

  30. Xiao, Y., Tian, Z., Guo, W.: Empirical likelihood for partially non linear models with missing response variables at random. Commun. Stat. Theor. Methods 44(16), 3523–3540 (2015). https://doi.org/10.1080/03610926.2013.815211

    Article  MathSciNet  MATH  Google Scholar 

  31. Zhu, L.: Empirical likelihood for multidimensional linear model with missing responses. J. Probab. Stat. 473932 (2012). https://doi.org/10.1155/2012/473932

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vasiliy Nadraga .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nadraga, V., Smirnov, V., Boiko, O., Dereko, V. (2021). Comparison of Missing Values Handling Techniques Using MICE Package Tools of R Software and Logistic Regression Model. In: Babichev, S., Lytvynenko, V., Wójcik, W., Vyshemyrskaya, S. (eds) Lecture Notes in Computational Intelligence and Decision Making. ISDMCI 2020. Advances in Intelligent Systems and Computing, vol 1246. Springer, Cham. https://doi.org/10.1007/978-3-030-54215-3_3

Download citation

Publish with us

Policies and ethics