Abstract
Entity resolution is a key aspect in data quality and data integration, identifying which records correspond to the same real world entity in data sources. Many existing approaches require manually designed match rules to solve the problem, which always needs domain knowledge and is time consuming. We propose a novel genetic algorithm based entity resolution approach via active learning. It is able to learn effective match rules by logically combining several different attributes’ comparisons with proper thresholds. We use active learning to reduce manually labeled data and speed up the learning process. The extensive evaluation shows that the proposed approach outperforms the sate-of-the-art entity resolution approaches in accuracy.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Chen J C, Chen Y G, Du X Y, Li C P, Lu J H, Zhao S Y, Zhou X. Big data challenge: a data management perspective. Frontiers of Computer Science, 2013, 7(2): 157–164
Fellegi I P, Sunter A B. A theory for record linkage. Journal of the American Statistical Association, 1969, 64(328): 1183–1210
Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate record detection: a survey. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1): 1–16
Kopcke H, Rahm E. Frameworks for entity matching: a comparison. Data and Knowledge Engineering, 2010, 69(2): 197–210
Sarawagi S, Bhamidipaty A. Interactive deduplication using active learning. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2002, 269–278
Monge A E, Elkan C. The field matching problem: algorithms and applications. In: Proceedings of the 2nd ACM SIGKDD International Conference on Knowledge Discovery and DataMining. 1996, 267–270
Pinheiro J C, Sun D X. Methods for linking and mining massive heterogeneous databases. In: Proceedings of the 4th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1998, 309–313
Bilenko M, Mooney R J. Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2003, 39–48
Minton S N, Nanjo C, Knoblock C A, Michalowski M, Michelson M. A heterogeneous field matching method for record linkage. In: Proceedings of the 5th IEEE International Conference on Data Mining. 2005, 8
Sun C, Shen D, Kou Y, Nie T, Yu G. ERGP: A Combined Entity Resolution Approach with Genetic Programming. In: Proceedings of the 11th IEEE Web Information System and Application Conference. 2014, 215–220
Li P, Dong X L, Maurino A, Srivastava D. Linking temporal records. Frontiers of Computer Science, 2012, 6(3): 293–312
Sun C C, Shen D R, Kou Y, Nie T Z, Yu G. GB-JER: A Graph-Based Model for Joint Entity Resolution. In: Proceedings of the 20th International Conference on Database Systems for Advanced Applications. 2015, 458–473
Tejada S, Knoblock C A, Minton S. Learning object identification rules for information integration. Information Systems, 2001, 26(8): 607–633
Winkler W E. Methods for record linkage and bayesian networks. Technical report, Series RRS2002/05. 2002
De Carvalho M G, Gonçalves M A, Laender A H F, Da Silva A S. Learning to deduplicate. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries. 2006, 41–50
De Carvalho MG, Laender A H, Gonçalves MA, Da Silva A S. Replica identification using genetic programming. In: Proceedings of the 2008 ACM Symposium on Applied Computing. 2008, 1801–1806
De Carvalho M G, Laender A H, Gonçalves M A, Da Silva A S. A genetic programming approach to record deduplication. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(3): 399–412
Isele R, Bizer C. Learning linkage rules using genetic programming. In: Proceedings of the 6th International Workshop on Ontology Matching. 2011, 13–24
Isele R, Bizer C. Learning expressive linkage rules using genetic programming. The Very Large Databases Endowment, 2012, 5(11): 1638–1649
Banzhaf W, Nordin P, Keller R E, Francone F D. Genetic programming: an introduction. San Francisco: Morgan Kaufmann, 1998
Poli R, Langdon W B, McPhee N F, Koza J R. A field guide to genetic programming. Lulu.com, 2008
Liere R, Tadepalli P. Active learning with committees for text categorization. In: Proceedings of the 14th National Conference on Artificial Intelligence. 1997, 591–596
Cohn D, Atlas L, Ladner R. Improving generalization with active learning. Machine Learning, 1994, 15(2): 201–221
Bellare K, Suresh I, Parameswaran A, Rastogi V. Active sampling for entity matching with guarantees. ACM Transactions on Knowledge Discovery from Data, 2013, 7(3): 12
Arasu A, Gotz M, Kaushik R. On active learning of record matching packages. In: Proceedings of the 2010 International Conference on Management of Data. 2010, 783–794
Koza J R. Genetic Programming: on the Programming of Computers by Means of Natural Selection. Boston: MIT Press, 1992
Cohen W, Ravikumar P, Fienberg S. A comparison of string metrics for matching names and records. In: Proceedings of the 9th ACM SIGKDDWorkshop on Data Cleaning and Object Consolidation. 2003, 73–78
Baeza-Yates R, Ribeiro-Neto B. Modern information retrieval. New York: ACM Press, 1999
Blickle T, Thiele L. A comparison of selection schemes used in genetic algorithms. TIK-Report No.11, 1995
Monge A E, Elkan C P. An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proceedings of the 2nd ACM SIGMOD Workshop Research Issues in Data Mining and Knowledge Discovery. 1997, 23–29
Shannon C E. A mathematical theory of communication. Bell System Technical Journal, 1948, 27(3): 379–423
Hassanzadeh O, Chiang F, Lee H C, Miller R J. Framework for evaluating clustering algorithms in duplicate detection. The Very Large Databases Endowment, 2009, 2(1): 1282–1293
Acknowledgements
The authors thank anonymous reviewers for their inspiring doubts and helpful suggestions during the reviewing process. This work was supported by the National Basic Research Program of China (973 Program) (2012CB316201), the Fundamental Research Funds for the Central Universities (N120816001) and the National Natural Science Foundation of China (Grant Nos. 61472070, 61402213).
Author information
Authors and Affiliations
Corresponding author
Additional information
This work is an extended version of the 11th IEEE Web Information System and Application Conference paper
Chenchen Sun is a PhD candidate in the College of Information Science and Engineering, Northeastern University, China. He received BS and MS from the same university in 2010 and 2012, respectively. His research interest is entity resolution.
Derong Shen is a professor and PhD supervisor in the College of Information Science and Engineering, Northeastern University, China, from where she received her PhD in 2004. She received her BS and MS from Jilin University, China in 1987 and 1990, respectively. Her interests include distributed data management and data integration.
Yue Kou is an associate professor in the College of Information Science and Engineering, Northeastern University, China, from where she also received her BS, MS, and PhD in 2002, 2005, and 2009, respectively. Her interests include entity search and data mining.
Tiezheng Nie is an associate professor in the College of Information Science and Engineering, Northeastern University, China, from where he received his BS, MS, and PhD in 2002, 2005, and 2009, respectively. His interests include data quality and data integration.
Ge Yu is a professor and PhD supervisor in the College of Information Science and Engineering, Northeastern University, China, from where he received his BS and MS in 1982 and 1985, respectively. He received his PhD from Kyushu University of Japan, Japan in 1996. He is a senior member of the CCF, and a member of the ACM, IEEE. His interests include databases and big data management.
Electronic supplementary material
Rights and permissions
About this article
Cite this article
Sun, C., Shen, D., Kou, Y. et al. A genetic algorithm based entity resolution approach with active learning. Front. Comput. Sci. 11, 147–159 (2017). https://doi.org/10.1007/s11704-015-5276-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11704-015-5276-6