Abstract
Entity resolution (ER) (also known as deduplication or merge-purge) is a process of identifying records that refer to the same real-world entity and merging them together. In practice, ER results may contain “inconsistencies,” either due to mistakes by the match and merge function writers or changes in the application semantics. To remove the inconsistencies, we introduce “negative rules” that disallow inconsistencies in the ER solution (ER-N). A consistent solution is then derived based on the guidance from a domain expert. The inconsistencies can be resolved in several ways, leading to accurate solutions. We formalize ER-N, treating the match, merge, and negative rules as black boxes, which permits expressive and extensible ER-N solutions. We identify important properties for the rules that, if satisfied, enable less costly ER-N. We develop and evaluate two algorithms that find an ER-N solution based on guidance from the domain expert: the GNR algorithm that does not assume the properties and the ENR algorithm that exploits the properties.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Benjelloun, O., Garcia-Molina, H., Kawai, H., Larson, T.E., Menestrina, D., Thavisomboon, S.: D-swoosh: A family of algorithms for generic, distributed entity resolution. In: ICDCS (2007)
Benjelloun, O., Garcia-Molina, H., Menestrina, D., Whang, S.E., Su, Q., Widom, J.: Swoosh: a generic approach to entity resolution. VLDB J. (2008). doi:10.1007/s00778-008-0098-x
Bhattacharya, I., Getoor, L.: Relational clustering for multi-type entity resolution. In: MRDM ’05: Proceedings of the 4th international workshop on multi-relational mining, pp. 3–12. ACM Press, New York (2005). doi:10.1145/1090193.1090195
Bohannon, P., Flaster, M., Fan, W., Rastogi, R.: A cost-based model and effective heuristic for repairing constraints by value modification. In: SIGMOD Conference, pp. 143–154 (2005)
Chaudhuri, S., Ganti, V., Motwani, R.: Robust identification of fuzzy duplicates. In: Proceedings of ICDE. Tokyo, Japan (2005)
Chaudhuri, S., Sarma, A.D., Ganti, V., Kaushik, R.: Leveraging aggregate constraints for deduplication. In: SIGMOD’07: Proceedings of the 2007 ACM SIGMOD international conference on management of data, pp. 437–448. ACM Press, New York (2007). doi:10.1145/1247480.1247530
Chomicki, J., Marcinkowski, J.: On the computational complexity of minimal-change integrity maintenance in relational databases. In: Inconsistency Tolerance, pp. 119–150 (2005)
Doan, A., Lu, Y., Lee, Y., Han, J.: Object matching for information integration: A profiler-based approach. In: IIWeb, pp. 53–58 (2003)
Doan A., Lu Y., Lee Y., Han J.: Profile-based object matching for information integration. IEEE Intell. Syst. 18(5), 54–59 (2003)
Dong, X., Halevy, A.Y., Madhavan, J.: Reference reconciliation in complex information spaces. In: SIGMOD Conference, pp. 85–96 (2005)
Elmagarmid A.K., Ipeirotis P.G., Verykios V.S.: Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Eswaran, K.P., Chamberlin, D.D.: Functional specifications of subsystem for database integrity. In: VLDB, pp. 48–68 (1975)
Fellegi, I.P., Holt, D.: A systematic approach to automatic edit and imputation. J Am. Stat. Assoc. 71(353), 17–35 (1976). http://www.jstor.org/stable/2285726
Franconi, E., Palma, A.L., Leone, N., Perri, S., Scarcello, F.: Census data repair: A challenging application of disjunctive logic programming. In: LPAR’01: Proceedings of the Artificial Intelligence on Logic for Programming, pp. 561–578. Springer, London (2001)
Genesereth M.R., Nilsson N.J.: Logical Foundations of Artificial Intelligence. Morgan Kaufmann, Palo Alto (1988)
Ginsberg M.L.: Readings in Nonmonotonic Reasoning. Morgan Kaufmann, Los Altos (1987)
Gu, L., Baxter, R., Vickers, D., Rainsford, C.: Record linkage: Current practice and future directions. Tech. Rep. 03/83, CSIRO Mathematical and Information Sciences (2003)
Hammer, M., McLeod, D.: Semantic integrity in a relational data base system. In: VLDB, pp. 25–47 (1975)
Hernández, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: Proceedings of ACM SIGMOD, pp. 127–138 (1995)
Jaro M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida. J. Am. Stat. Assoc. 84(406), 414–420 (1989)
McCallum, A.K., Nigam, K., Ungar, L.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of KDD, pp. 169–178. Boston, MA (2000)
Menestrina, D., Benjelloun, O., Garcia-Molina, H.: Generic entity resolution with data confidences. In: First International VLDB Workshop on Clean Databases. Seoul, Korea (2006)
Monge, A.E., Elkan, C.: An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proceedings of SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, pp. 23–29 (1997)
Newcombe H.B., Kennedy J.M., Axford S.J., James A.P.: Automatic linkage of vital records. Science 130(3381), 954–959 (1959)
Nilsson N.J.: Artificial Intelligence A New Synthesis. Morgan Kaufmann, San Francisco (1998)
Rowland, T.: Connected component. http://mathwold.wolfram.com/ConnectedComponent.html
Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: Proceedings of ACM SIGKDD. Edmonton, Alberta (2002)
Shen, W., Li, X., Doan, A.: Constraint-based entity matching. In: AAAI, pp. 862–867 (2005)
Tejada S., Knoblock C.A., Minton S.: Learning object identification rules for information integration. Inf. Syst. J. 26(8), 635–656 (2001)
Whang, S.E., Benjelloun, O., Garcia-Molina, H.: Additional experiments on negative rules. Tech. rep., Stanford University. http://dbpubs.stanford.edu/pub/2005-5
Whang, S.E., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: Entity resolution with iterative blocking. Tech. rep., Stanford University (2008). http://dbpubs.stanford.edu/pub/2008-19
Widom, J., Ceri, S. (eds.): Active Database Systems: Triggers and Rules For Advanced Database Processing. Morgan Kaufmann, San Francisco (1996)
Winkler, W.: Overview of record linkage and current research directions. Tech. rep., Statistical Research Division, US Bureau of the Census, Washington, DC (2006)
Winkler, W.E.: State of statistical data editing and current research problems. In: UN/ECE Work Session on Statistical Data Editing, Working Paper n.29, pp. 2–4 (1999)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Whang, S.E., Benjelloun, O. & Garcia-Molina, H. Generic entity resolution with negative rules. The VLDB Journal 18, 1261–1277 (2009). https://doi.org/10.1007/s00778-009-0136-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-009-0136-3