Deduplication Over Big Data Integration

El Abassi, M.; Amnai, Med.; Choukri, A.

doi:10.1007/978-3-030-72588-4_15

M. El Abassi¹⁶,
Med. Amnai¹⁶ &
A. Choukri^16,17

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1344))

570 Accesses

Abstract

Entity Resolution is the process of matching records from more than one database that refer to the same entity. In case of a single database the process is called deduplication. This article proposes a method to solve deduplication problem using Scala over Spark framework. The concept of data quality is very important for good data governance in order to improve the interaction between the different collaborators of one or more organizations concerned. The presence of duplicate or similar data creates significant data quality concerns. A panorama of the methods of calculation of distance similarity between the data as well as algorithms for the elimination of similar data are presented and compared.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Entity Resolution in NoSQL Data Warehouse

Exploring Spark-SQL-Based Entity Resolution Using the Persistence Capability

Entity Resolution-Based Jaccard Similarity Coefficient for Heterogeneous Distributed Databases

References

Karapiperis, D., Verykios, V.S.: LoadBalancing the distance computations in record linkage. ACM SIGKDD Explor. Newslett. 17(1), 1–7 (2015)
Article Google Scholar
PrabhakarBennya, S., Vasavi, S., Anupriya, P.: Hadoop framework for entity resolution within high velocity streams. In: CMS 2016, vol. 85, pp. 550–557 (2016)
Google Scholar
Yan, C., Song, Y., Wang, J., Guo, W.: Eliminating the redundancy in MapReduce-based entity resolution. In: 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), Shenzhen, pp. 1233–1236 (2015)
Google Scholar
Mon, A.C., Thwin, M.M.S.: Effective blocking for combining multiple entity resolution systems. Int. J. Comput. Sci. Eng. 2(4), 126–136 (2013)
Google Scholar
He, Q., Tan, Q., Ma, X., Shi, Z.: The high-activity parallel implementation of data preprocessing based on MapReduce. In: Lecture Notes in Computer Science, vol. 6401, pp. 646–654 (2015)
Google Scholar
Albanese, P.A., Ale, J.M.: Data Matching and Deduplication Over Big Data Using Hadoop Framework
Google Scholar
By Andrei Popescu in Data Matching. https://winpure.com/blog/what-is-record-linkage/
Towards a Scalable and Robust Entity Resolution—Approximate Blocking with Semantic Constraints COMP8740: Artificial Intelligence Project Australian National University Semester 1, 30 May 2014 (2014)
Google Scholar
Parallel Sorted Neighborhood Blocking with MapReduce Lars Kolb, Andreas Thor, Erhard Rahm Department of Computer Science, University of Leipzig, Germany
Google Scholar
Reprinted from Data Quality: Tools by (p. 35) Dries Van Dromme (2007)
Google Scholar
Big Data Integration, Xin Luna Dong, Divesh Srivastava, March 2015
Google Scholar
Integration donnees—approche materialisee virtuelle, 17 October 2008
Google Scholar
Qualité contextuelle des données: détection et nettoyage guidés par la sémantique des données, Aïcha Ben Salem. Accessed 13 Dec 2017
Google Scholar

Download references

Author information

Authors and Affiliations

Laboratory of Computer Sciences, Faculty of Sciences, Ibn Tofail University, Kenitra, Morocco
M. El Abassi, Med. Amnai & A. Choukri
S.A.R.S Group, ENSA Safi, Cadi Ayyad University, Safi, Morocco
A. Choukri

Authors

M. El Abassi
View author publications
You can also search for this author in PubMed Google Scholar
Med. Amnai
View author publications
You can also search for this author in PubMed Google Scholar
A. Choukri
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to M. El Abassi .

Editor information

Editors and Affiliations

ENSA, Sultan Moulay Slimane University, Khouribga, Morocco
Noreddine Gherabi
Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland
Janusz Kacprzyk

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

El Abassi, M., Amnai, M., Choukri, A. (2021). Deduplication Over Big Data Integration. In: Gherabi, N., Kacprzyk, J. (eds) Intelligent Systems in Big Data, Semantic Web and Machine Learning. Advances in Intelligent Systems and Computing, vol 1344. Springer, Cham. https://doi.org/10.1007/978-3-030-72588-4_15

Download citation

DOI: https://doi.org/10.1007/978-3-030-72588-4_15
Published: 29 May 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-72587-7
Online ISBN: 978-3-030-72588-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Deduplication Over Big Data Integration

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Entity Resolution in NoSQL Data Warehouse

Exploring Spark-SQL-Based Entity Resolution Using the Persistence Capability

Entity Resolution-Based Jaccard Similarity Coefficient for Heterogeneous Distributed Databases

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Deduplication Over Big Data Integration

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Entity Resolution in NoSQL Data Warehouse

Exploring Spark-SQL-Based Entity Resolution Using the Persistence Capability

Entity Resolution-Based Jaccard Similarity Coefficient for Heterogeneous Distributed Databases

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation