EntityManager: Managing Dirty Data Based on Entity Resolution

Liu, Xue-Li; Wang, Hong-Zhi; Li, Jian-Zhong; Gao, Hong

doi:10.1007/s11390-017-1731-1

EntityManager: Managing Dirty Data Based on Entity Resolution

Regular Paper
Published: 12 May 2017

Volume 32, pages 644–662, (2017)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Journal of Computer Science and Technology Aims and scope Submit manuscript

EntityManager: Managing Dirty Data Based on Entity Resolution

Download PDF

Xue-Li Liu¹,
Hong-Zhi Wang¹,
Jian-Zhong Li¹ &
…
Hong Gao¹

151 Accesses
3 Citations
Explore all metrics

Abstract

Data quality is important in many data-driven applications, such as decision making, data analysis, and data mining. Recent studies focus on data cleaning techniques by deleting or repairing the dirty data, which may cause information loss and bring new inconsistencies. To avoid these problems, we propose EntityManager, a general system to manage dirty data without data cleaning. This system takes real-world entity as the basic storage unit and retrieves query results according to the quality requirement of users. The system is able to handle all kinds of inconsistencies recognized by entity resolution. We elaborate the EntityManager system, covering its architecture, data model, and query processing techniques. To process queries efficiently, our system adopts novel indices, similarity operator and query optimization techniques. Finally, we verify the efficiency and effectiveness of this system and present future research challenges.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Andritsos P, Fuxman A, Miller R J. Clean answers over dirty databases: A probabilistic approach. In Proc. the 22nd ICDE, April 2006, Article No. 30.
Fuxman A D, Miller R J. First-order query rewriting for inconsistent databases. In Proc. the 10th ICDT, January 2005, pp.337-351.
Fuxman A, Fazli E, Miller R J. Conquer: Efficient management of inconsistent databases. In Proc. SIGMOD, June 2005, pp.155-166.
Boulos J, Dalvi N, Mandhani B, Mathur S, Ré C, Suciu D. MYSTIQ: A system for finding more answers by using probabili31 ties. In Proc. SIGMOD, June 2005, pp.891-893.
Hassanzadeh O, Miller R J. Creating probabilistic databases from duplicated data. VLDB J., 2009, 18(5): 1141-1166.
Article Google Scholar
Widom J. Trio: A system for integrated management of data, accuracy, and lineage. In Proc. CIDR, Jan. 2005, pp.262-276.
Getoor L, Machanavajjhala A. Entity resolution: Theory, practice & open challenges. PVLDB, 2012, 5(12): 2018-2019.
Google Scholar
Waguih D A, Berti-Equille L. Truth discovery algorithms: An experimental evaluation. arXiv: 1409.6428, May 2014. https://arxiv.org/abs/1409.6428, Mar. 2017.
Lipner S B, Balenson D M, Ellison CM,Walker S T. System and method for data recovery, September 1996. US Patent 5,557,765. https://www.google.com/patents/us5557765, Apr. 2017.
Miles M B, Huberman A M. Qualitative Data Analysis: An Expanded Sourcebook. Sage Publications, Inc., 1994.
Rahm E, Do H H. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 2000, 23(4): 3-13.
Google Scholar
Arasu A, Ganti V, Kaushik R. Efficient exact set-similarity joins. In Proc. the 32nd VLDB, September 2006, pp.918-929.
Behm A, Ji S, Li C, Lu J. Space-constrained gram-based indexing for efficient approximate string search. In Proc. ICDE, March 29-April 2, 2009, pp.604-615.
Goemans M X, Williamson D P. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of the ACM, 1995, 42(6): 1115-1145.
Article MathSciNet MATH Google Scholar
Hadjieleftheriou M, Chandel A, Koudas N, Srivastava D. Fast indexes and algorithms for set similarity selection queries. In Proc. the 24th ICDE, April 2008, pp.267-276.
Hadjieleftheriou M, Koudas N, Srivastava D. Incremental maintenance of length normalized indexes for approximate string matching. In Proc. ACM SIGMOD, June 29-July 2, 2009, pp.429-440.
Xiao C, Wang W, Lin X. Ed-Join: An efficient algorithm for similarity joins with edit distance constraints. PVLDB, 2008, 1(1): 933-944.
MathSciNet Google Scholar
Xiao C, Wang W, Lin X, Yu J X, Wang G. Efficient similarity joins for near-duplicate detection. ACM Transactions on Database Systems, 2011, 36(3): 15:1-15:15.
Zhang Z, Hadjieleftheriou M, Ooi B C, Srivastava D. Bedtree: An all-purpose index structure for string similarity search based on edit distance. In Proc. SIGMOD, June 2010, pp.915-926.
Bayardo R J, Ma Y, Srikant R. Scaling up all pairs similarity search. In Proc. the 16th WWW, May 2007, pp.131-140.
Wang J, Li G, Feng J. Trie-join: Efficient trie-based string similarity joins with edit-distance constraints. PVLDB, 2010, 3(1): 1219-1230.
Google Scholar
Sarawagi S, Kirpal A. Efficient set joins on similarity predicates. In Proc. ACM SIGMOD, June 2004, pp.743-754.
Vernica R, Carey M J, Li C. Efficient parallel set-similarity joins using mapreduce. In Proc. ACM SIGMOD, June 2010, pp.495-506.
Li C, Wang B, Yang X. VGRAM: Improving performance of approximate queries on string collections using variablelength grams. In Proc. the 33rd VLDB, September 2007, pp.303-314.
Wang J, Li G, Feng J. Can we beat the prefix filtering?: An adaptive framework for similarity join and search. In Proc. ACM SIGMOD, May 2012, pp.85-96.
Ioannidis Y E. The history of histograms (abridged). In Proc. the 29th VLDB, Sept. 2003, pp.19-30.
Haas P J, Naughton J F, Seshadri S, Swami A N. Selectivity and cost estimation for joins based on random sampling. Journal of Computer and System Sciences, 1996, 52(3): 550-569.
Article MathSciNet MATH Google Scholar
Hou W C, Ozsoyoglu G, Dogdu E. Error-constrained COUNT query evaluation in relational databases. ACM SIGMOD Record, 1991, 20(2): 278-287.
Article Google Scholar
Olken F. Random sampling from databases [Ph.D. Thesis]. University of California, 1993.
Ngu A H, Harangsri B, Shepherd J. Query size estimation for joins using systematic sampling. Distributed and Parallel Databases, 2004, 15(3): 237-275.
Article Google Scholar
Lee H, Ng R T, Shim K. Similarity join size estimation using locality sensitive hashing. PVLDB, 2011, 4(6): 338-349.
Google Scholar
Tong X, Wang H. Fgram-Tree: An index structure based on feature grams for string approximate search. In Proc. the 13th WAIM, August 2012, pp.241-253.
Liu X,Wang H, Li J, Gao H. Similarity join algorithm based on entity. Journal of Software, 2015, 26(6): 1421-1437. (in Chinese)
Zhang Y, Yang L, Wang H. Range query estimation for dirty data management system. In Proc. the 13th WAIM, August 2012, pp.152-164.
Liu X, Wang H, Li J, Gao H. Multi-similarity join order selection in entity database. Journal of Frontiers of Computer Science and Technology, 2012, 6(10): 865-876.
Google Scholar
Garcia-Molina H, Ullman J D, Widom J. Database System Implementation. Prentice-Hall, 2000.
Abiteboul S, Hull R, Vianu V. Foundations of Databases. Addison-Wesley, 1995.
Ilyas I F, Beskales G, Soliman M A. A survey of top-k query processing techniques in relational database systems. ACM Computing Surveys (CSUR), 2008, 40(4): 11:1-11:58.
Zhang Y, Yang L, Wang H. Similarity join size estimation with threshold for dirty data. Journal of Computers, 2012, 35(10): 2159-2168. (in Chinese)
Xu R, Wunsch D. Survey of clustering algorithms. IEEE Transactions on Neural Networks, 2005, 16(3): 645-678.
Article Google Scholar
Clauset A, Newman M E, Moore C. Finding community structure in very large networks. Physical Review E, 2004, 70(6): 66-111.
Article Google Scholar
Li Y, Wang H, Gao H. Efficient entity resolution based on sequence rules. In Proc. CSIE, May 2011, pp.381-388.
Kuang D, Li X, Ling C X. A new search engine integrating hierarchical browsing and keyword search. In Proc. the 22nd IJCAI, July 2011, pp.2464-2469.

Download references

Author information

Authors and Affiliations

Massive Data Computing Laboratory, Harbin Institute of Technology, Harbin, 150001, China
Xue-Li Liu, Hong-Zhi Wang, Jian-Zhong Li & Hong Gao

Authors

Xue-Li Liu
View author publications
You can also search for this author in PubMed Google Scholar
Hong-Zhi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jian-Zhong Li
View author publications
You can also search for this author in PubMed Google Scholar
Hong Gao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hong-Zhi Wang.

Electronic supplementary material

Below is the link to the electronic supplementary material.

ESM 1

(PDF 79 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, XL., Wang, HZ., Li, JZ. et al. EntityManager: Managing Dirty Data Based on Entity Resolution. J. Comput. Sci. Technol. 32, 644–662 (2017). https://doi.org/10.1007/s11390-017-1731-1

Download citation

Received: 29 February 2016
Revised: 10 January 2017
Published: 12 May 2017
Issue Date: May 2017
DOI: https://doi.org/10.1007/s11390-017-1731-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

EntityManager: Managing Dirty Data Based on Entity Resolution

Abstract

Article PDF

Similar content being viewed by others

EntityManager: An Entity-Based Dirty Data Management System

Entity Resolution in Big Data Era: Challenges and Applications

ERBlox: Combining Matching Dependencies with Machine Learning for Entity Resolution

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

EntityManager: Managing Dirty Data Based on Entity Resolution

Abstract

Article PDF

Similar content being viewed by others

EntityManager: An Entity-Based Dirty Data Management System

Entity Resolution in Big Data Era: Challenges and Applications

ERBlox: Combining Matching Dependencies with Machine Learning for Entity Resolution

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation