Skip to main content

Data Cleaning in Machine Learning: Improving Real Life Decisions and Challenges

  • Conference paper
  • First Online:
International Conference on Advanced Intelligent Systems for Sustainable Development (AI2SD 2022)

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 637))

  • 735 Accesses

Abstract

Data in real world is generally impure, it may contain outliers, incomplete, duplicated or obsolete values. Unclean data can direct to false conclusions and incorrect decisions. Thus it is crucial to clean the data efficiently. It is extensively known that the data quality affects machine learning (ML) model performances, data scientists spend considerable time on data cleaning before model training. During this work we shed light on some of the most important aspects of data cleaning efforts. We focused on a very important problem which is outlier detection. Regarding outlier detection we classified its approaches into three categories: statistics-based, distance-based, and Model-Based techniques. We discussed the previous techniques also mentioning the different sub-techniques of each, likewise the advantages and disadvantages of every technique. We also showed how contextual outlier identification and subspace outlier detection techniques may be used to overcome the “curse of dimensionality” in highdimensional outlier detection. The next task was to benchmark the major algorithms of the outlier detection and selecting the best algorithms depending on the dataset. Finally we proposed a new algorithms based on mainly two techniques k-Means clustering and Isolation forest in order to improve results of the outlier detection algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Hawkins, F., Douglas, S.: Identification of outliers. Springer (1980)

    Google Scholar 

  2. Ord, F., Keith, S., et al.: Outliers in statistical data. Int. J. Forecasting, 175–176 (1996)

    Google Scholar 

  3. Kuna, F., Horacio, S., et al.: Outlier detection in audit logs for application systems. Inf. Syst. 44, 22–33 (2014)

    Google Scholar 

  4. Bailis, F., Peter, S., et al.: Macrobase: Prioritizing attention in fast data, pp. 541–556 (2017)

    Google Scholar 

  5. Grubbs, F., Frank E, S.: Procedures for detecting outlying observations in samples. Technometrics, 1–21(1969)

    Google Scholar 

  6. Tietjen, F., Gary L.S., et al.: Some Grubbs-type statistics for the detection of several outliers. Technometrics, 583–597 (1972)

    Google Scholar 

  7. Hubert, F., Mia and Rousseeuw, S., et al.: A deterministic algorithm for robust location and scatter. J. Comput. Graph. Stat., 618–637 (2012)

    Google Scholar 

  8. Lee, G.Y., Alzamil, L., Doskenov, B., Termehchy, A.: A survey on data cleaning methods for improved machine learning model performance. Computer Science, 15 September 2021

    Google Scholar 

  9. Ridzuan, F., Zainon, W.M.N.W.: A review on data cleansing methods for big data. Procedia Comput. Sci. 161, 731–738 (2019)

    Article  Google Scholar 

  10. Maguerra, S., Boulmakoul, A., Badir, H.: Time framework: a type level and algebra driven design approach. In: 2021 International Conference on Data Analytics for Business and Industry (ICDABI), 25–26 October 2021, Sakheer, Bahrain

    Google Scholar 

  11. Chu, X., Ilyas, I.F., Krishnan, S., Wang, J.: Data cleaning: overview and emerging challenges. In: SIGMOD ’16: Proceedings of the 2016 International Conference on Management of Data June 2016, pp. 2201–2206, NY, United States

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sanae Borrohou .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Borrohou, S., Fissoune, R., Badir, H., Tabaa, M. (2023). Data Cleaning in Machine Learning: Improving Real Life Decisions and Challenges. In: Kacprzyk, J., Ezziyyani, M., Balas, V.E. (eds) International Conference on Advanced Intelligent Systems for Sustainable Development. AI2SD 2022. Lecture Notes in Networks and Systems, vol 637. Springer, Cham. https://doi.org/10.1007/978-3-031-26384-2_54

Download citation

Publish with us

Policies and ethics