Skip to main content

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1344))

  • 570 Accesses

Abstract

Entity Resolution is the process of matching records from more than one database that refer to the same entity. In case of a single database the process is called deduplication. This article proposes a method to solve deduplication problem using Scala over Spark framework. The concept of data quality is very important for good data governance in order to improve the interaction between the different collaborators of one or more organizations concerned. The presence of duplicate or similar data creates significant data quality concerns. A panorama of the methods of calculation of distance similarity between the data as well as algorithms for the elimination of similar data are presented and compared.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Karapiperis, D., Verykios, V.S.: LoadBalancing the distance computations in record linkage. ACM SIGKDD Explor. Newslett. 17(1), 1–7 (2015)

    Article  Google Scholar 

  2. PrabhakarBennya, S., Vasavi, S., Anupriya, P.: Hadoop framework for entity resolution within high velocity streams. In: CMS 2016, vol. 85, pp. 550–557 (2016)

    Google Scholar 

  3. Yan, C., Song, Y., Wang, J., Guo, W.: Eliminating the redundancy in MapReduce-based entity resolution. In: 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), Shenzhen, pp. 1233–1236 (2015)

    Google Scholar 

  4. Mon, A.C., Thwin, M.M.S.: Effective blocking for combining multiple entity resolution systems. Int. J. Comput. Sci. Eng. 2(4), 126–136 (2013)

    Google Scholar 

  5. He, Q., Tan, Q., Ma, X., Shi, Z.: The high-activity parallel implementation of data preprocessing based on MapReduce. In: Lecture Notes in Computer Science, vol. 6401, pp. 646–654 (2015)

    Google Scholar 

  6. Albanese, P.A., Ale, J.M.: Data Matching and Deduplication Over Big Data Using Hadoop Framework

    Google Scholar 

  7. By Andrei Popescu in Data Matching. https://winpure.com/blog/what-is-record-linkage/

  8. Towards a Scalable and Robust Entity Resolution—Approximate Blocking with Semantic Constraints COMP8740: Artificial Intelligence Project Australian National University Semester 1, 30 May 2014 (2014)

    Google Scholar 

  9. Parallel Sorted Neighborhood Blocking with MapReduce Lars Kolb, Andreas Thor, Erhard Rahm Department of Computer Science, University of Leipzig, Germany

    Google Scholar 

  10. Reprinted from Data Quality: Tools by (p. 35) Dries Van Dromme (2007)

    Google Scholar 

  11. Big Data Integration, Xin Luna Dong, Divesh Srivastava, March 2015

    Google Scholar 

  12. Integration donnees—approche materialisee virtuelle, 17 October 2008

    Google Scholar 

  13. Qualité contextuelle des données: détection et nettoyage guidés par la sémantique des données, Aïcha Ben Salem. Accessed 13 Dec 2017

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to M. El Abassi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

El Abassi, M., Amnai, M., Choukri, A. (2021). Deduplication Over Big Data Integration. In: Gherabi, N., Kacprzyk, J. (eds) Intelligent Systems in Big Data, Semantic Web and Machine Learning. Advances in Intelligent Systems and Computing, vol 1344. Springer, Cham. https://doi.org/10.1007/978-3-030-72588-4_15

Download citation

Publish with us

Policies and ethics