Data Cleaning in Machine Learning: Improving Real Life Decisions and Challenges

Borrohou, Sanae; Fissoune, Rachida; Badir, Hassan; Tabaa, Mohamed

doi:10.1007/978-3-031-26384-2_54

Sanae Borrohou¹²,
Rachida Fissoune¹²,
Hassan Badir¹² &
…
Mohamed Tabaa¹³

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 637))

Included in the following conference series:

International Conference on Advanced Intelligent Systems for Sustainable Development

735 Accesses

Abstract

Data in real world is generally impure, it may contain outliers, incomplete, duplicated or obsolete values. Unclean data can direct to false conclusions and incorrect decisions. Thus it is crucial to clean the data efficiently. It is extensively known that the data quality affects machine learning (ML) model performances, data scientists spend considerable time on data cleaning before model training. During this work we shed light on some of the most important aspects of data cleaning efforts. We focused on a very important problem which is outlier detection. Regarding outlier detection we classified its approaches into three categories: statistics-based, distance-based, and Model-Based techniques. We discussed the previous techniques also mentioning the different sub-techniques of each, likewise the advantages and disadvantages of every technique. We also showed how contextual outlier identification and subspace outlier detection techniques may be used to overcome the “curse of dimensionality” in highdimensional outlier detection. The next task was to benchmark the major algorithms of the outlier detection and selecting the best algorithms depending on the dataset. Finally we proposed a new algorithms based on mainly two techniques k-Means clustering and Isolation forest in order to improve results of the outlier detection algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A hybrid dimensionality reduction method for outlier detection in high-dimensional data

Article 23 May 2023

On Usefulness of Outlier Elimination in Classification Tasks

Detection of Outliers in an Unsupervised Environment

References

Hawkins, F., Douglas, S.: Identification of outliers. Springer (1980)
Google Scholar
Ord, F., Keith, S., et al.: Outliers in statistical data. Int. J. Forecasting, 175–176 (1996)
Google Scholar
Kuna, F., Horacio, S., et al.: Outlier detection in audit logs for application systems. Inf. Syst. 44, 22–33 (2014)
Google Scholar
Bailis, F., Peter, S., et al.: Macrobase: Prioritizing attention in fast data, pp. 541–556 (2017)
Google Scholar
Grubbs, F., Frank E, S.: Procedures for detecting outlying observations in samples. Technometrics, 1–21(1969)
Google Scholar
Tietjen, F., Gary L.S., et al.: Some Grubbs-type statistics for the detection of several outliers. Technometrics, 583–597 (1972)
Google Scholar
Hubert, F., Mia and Rousseeuw, S., et al.: A deterministic algorithm for robust location and scatter. J. Comput. Graph. Stat., 618–637 (2012)
Google Scholar
Lee, G.Y., Alzamil, L., Doskenov, B., Termehchy, A.: A survey on data cleaning methods for improved machine learning model performance. Computer Science, 15 September 2021
Google Scholar
Ridzuan, F., Zainon, W.M.N.W.: A review on data cleansing methods for big data. Procedia Comput. Sci. 161, 731–738 (2019)
Article Google Scholar
Maguerra, S., Boulmakoul, A., Badir, H.: Time framework: a type level and algebra driven design approach. In: 2021 International Conference on Data Analytics for Business and Industry (ICDABI), 25–26 October 2021, Sakheer, Bahrain
Google Scholar
Chu, X., Ilyas, I.F., Krishnan, S., Wang, J.: Data cleaning: overview and emerging challenges. In: SIGMOD ’16: Proceedings of the 2016 International Conference on Management of Data June 2016, pp. 2201–2206, NY, United States
Google Scholar

Download references

Author information

Authors and Affiliations

IDS Team, Abdelmalek Essaadi University, Tangier, Morocco
Sanae Borrohou, Rachida Fissoune & Hassan Badir
LPRI, EMSI, Casablanca, Morocco
Mohamed Tabaa

Authors

Sanae Borrohou
View author publications
You can also search for this author in PubMed Google Scholar
Rachida Fissoune
View author publications
You can also search for this author in PubMed Google Scholar
Hassan Badir
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Tabaa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sanae Borrohou .

Editor information

Editors and Affiliations

Polish Academy of Sciences, Systems Research Institute, Warsaw, Poland
Janusz Kacprzyk
Abdelmalek Essaâdi University, Tangier, Morocco
Mostafa Ezziyyani
Department of Automatics and Applied Software, Aurel Vlaicu University of Arad, Arad, Romania
Valentina Emilia Balas

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Borrohou, S., Fissoune, R., Badir, H., Tabaa, M. (2023). Data Cleaning in Machine Learning: Improving Real Life Decisions and Challenges. In: Kacprzyk, J., Ezziyyani, M., Balas, V.E. (eds) International Conference on Advanced Intelligent Systems for Sustainable Development. AI2SD 2022. Lecture Notes in Networks and Systems, vol 637. Springer, Cham. https://doi.org/10.1007/978-3-031-26384-2_54

Download citation

DOI: https://doi.org/10.1007/978-3-031-26384-2_54
Published: 10 June 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26383-5
Online ISBN: 978-3-031-26384-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Data Cleaning in Machine Learning: Improving Real Life Decisions and Challenges

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A hybrid dimensionality reduction method for outlier detection in high-dimensional data

On Usefulness of Outlier Elimination in Classification Tasks

Detection of Outliers in an Unsupervised Environment

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Data Cleaning in Machine Learning: Improving Real Life Decisions and Challenges

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A hybrid dimensionality reduction method for outlier detection in high-dimensional data

On Usefulness of Outlier Elimination in Classification Tasks

Detection of Outliers in an Unsupervised Environment

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation