Abstract
Entity Resolution is the process of matching records from more than one database that refer to the same entity. In case of a single database the process is called deduplication. This article proposes a method to solve deduplication problem using Scala over Spark framework. The concept of data quality is very important for good data governance in order to improve the interaction between the different collaborators of one or more organizations concerned. The presence of duplicate or similar data creates significant data quality concerns. A panorama of the methods of calculation of distance similarity between the data as well as algorithms for the elimination of similar data are presented and compared.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Karapiperis, D., Verykios, V.S.: LoadBalancing the distance computations in record linkage. ACM SIGKDD Explor. Newslett. 17(1), 1–7 (2015)
PrabhakarBennya, S., Vasavi, S., Anupriya, P.: Hadoop framework for entity resolution within high velocity streams. In: CMS 2016, vol. 85, pp. 550–557 (2016)
Yan, C., Song, Y., Wang, J., Guo, W.: Eliminating the redundancy in MapReduce-based entity resolution. In: 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), Shenzhen, pp. 1233–1236 (2015)
Mon, A.C., Thwin, M.M.S.: Effective blocking for combining multiple entity resolution systems. Int. J. Comput. Sci. Eng. 2(4), 126–136 (2013)
He, Q., Tan, Q., Ma, X., Shi, Z.: The high-activity parallel implementation of data preprocessing based on MapReduce. In: Lecture Notes in Computer Science, vol. 6401, pp. 646–654 (2015)
Albanese, P.A., Ale, J.M.: Data Matching and Deduplication Over Big Data Using Hadoop Framework
By Andrei Popescu in Data Matching. https://winpure.com/blog/what-is-record-linkage/
Towards a Scalable and Robust Entity Resolution—Approximate Blocking with Semantic Constraints COMP8740: Artificial Intelligence Project Australian National University Semester 1, 30 May 2014 (2014)
Parallel Sorted Neighborhood Blocking with MapReduce Lars Kolb, Andreas Thor, Erhard Rahm Department of Computer Science, University of Leipzig, Germany
Reprinted from Data Quality: Tools by (p. 35) Dries Van Dromme (2007)
Big Data Integration, Xin Luna Dong, Divesh Srivastava, March 2015
Integration donnees—approche materialisee virtuelle, 17 October 2008
Qualité contextuelle des données: détection et nettoyage guidés par la sémantique des données, Aïcha Ben Salem. Accessed 13 Dec 2017
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
El Abassi, M., Amnai, M., Choukri, A. (2021). Deduplication Over Big Data Integration. In: Gherabi, N., Kacprzyk, J. (eds) Intelligent Systems in Big Data, Semantic Web and Machine Learning. Advances in Intelligent Systems and Computing, vol 1344. Springer, Cham. https://doi.org/10.1007/978-3-030-72588-4_15
Download citation
DOI: https://doi.org/10.1007/978-3-030-72588-4_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-72587-7
Online ISBN: 978-3-030-72588-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)