Abstract
Data in real world is generally impure, it may contain outliers, incomplete, duplicated or obsolete values. Unclean data can direct to false conclusions and incorrect decisions. Thus it is crucial to clean the data efficiently. It is extensively known that the data quality affects machine learning (ML) model performances, data scientists spend considerable time on data cleaning before model training. During this work we shed light on some of the most important aspects of data cleaning efforts. We focused on a very important problem which is outlier detection. Regarding outlier detection we classified its approaches into three categories: statistics-based, distance-based, and Model-Based techniques. We discussed the previous techniques also mentioning the different sub-techniques of each, likewise the advantages and disadvantages of every technique. We also showed how contextual outlier identification and subspace outlier detection techniques may be used to overcome the “curse of dimensionality” in highdimensional outlier detection. The next task was to benchmark the major algorithms of the outlier detection and selecting the best algorithms depending on the dataset. Finally we proposed a new algorithms based on mainly two techniques k-Means clustering and Isolation forest in order to improve results of the outlier detection algorithms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Hawkins, F., Douglas, S.: Identification of outliers. Springer (1980)
Ord, F., Keith, S., et al.: Outliers in statistical data. Int. J. Forecasting, 175–176 (1996)
Kuna, F., Horacio, S., et al.: Outlier detection in audit logs for application systems. Inf. Syst. 44, 22–33 (2014)
Bailis, F., Peter, S., et al.: Macrobase: Prioritizing attention in fast data, pp. 541–556 (2017)
Grubbs, F., Frank E, S.: Procedures for detecting outlying observations in samples. Technometrics, 1–21(1969)
Tietjen, F., Gary L.S., et al.: Some Grubbs-type statistics for the detection of several outliers. Technometrics, 583–597 (1972)
Hubert, F., Mia and Rousseeuw, S., et al.: A deterministic algorithm for robust location and scatter. J. Comput. Graph. Stat., 618–637 (2012)
Lee, G.Y., Alzamil, L., Doskenov, B., Termehchy, A.: A survey on data cleaning methods for improved machine learning model performance. Computer Science, 15 September 2021
Ridzuan, F., Zainon, W.M.N.W.: A review on data cleansing methods for big data. Procedia Comput. Sci. 161, 731–738 (2019)
Maguerra, S., Boulmakoul, A., Badir, H.: Time framework: a type level and algebra driven design approach. In: 2021 International Conference on Data Analytics for Business and Industry (ICDABI), 25–26 October 2021, Sakheer, Bahrain
Chu, X., Ilyas, I.F., Krishnan, S., Wang, J.: Data cleaning: overview and emerging challenges. In: SIGMOD ’16: Proceedings of the 2016 International Conference on Management of Data June 2016, pp. 2201–2206, NY, United States
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Borrohou, S., Fissoune, R., Badir, H., Tabaa, M. (2023). Data Cleaning in Machine Learning: Improving Real Life Decisions and Challenges. In: Kacprzyk, J., Ezziyyani, M., Balas, V.E. (eds) International Conference on Advanced Intelligent Systems for Sustainable Development. AI2SD 2022. Lecture Notes in Networks and Systems, vol 637. Springer, Cham. https://doi.org/10.1007/978-3-031-26384-2_54
Download citation
DOI: https://doi.org/10.1007/978-3-031-26384-2_54
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26383-5
Online ISBN: 978-3-031-26384-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)