Swoosh: a generic approach to entity resolution

Benjelloun, Omar; Garcia-Molina, Hector; Menestrina, David; Su, Qi; Whang, Steven Euijong; Widom, Jennifer

doi:10.1007/s00778-008-0098-x

Swoosh: a generic approach to entity resolution

Regular Paper
Published: 11 March 2008

Volume 18, pages 255–276, (2009)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

The VLDB Journal Aims and scope Submit manuscript

Swoosh: a generic approach to entity resolution

Download PDF

Omar Benjelloun¹,
Hector Garcia-Molina²,
David Menestrina²,
Qi Su²,
Steven Euijong Whang² &
…
Jennifer Widom²

1600 Accesses
268 Citations
3 Altmetric
Explore all metrics

Abstract

We consider the entity resolution (ER) problem (also known as deduplication, or merge–purge), in which records determined to represent the same real-world entity are successively located and merged. We formalize the generic ER problem, treating the functions for comparing and merging records as black-boxes, which permits expressive and extensible ER solutions. We identify four important properties that, if satisfied by the match and merge functions, enable much more efficient ER algorithms. We develop three efficient ER algorithms: G-Swoosh for the case where the four properties do not hold, and R-Swoosh and F-Swoosh that exploit the four properties. F-Swoosh in addition assumes knowledge of the “features” (e.g., attributes) used by the match function. We experimentally evaluate the algorithms using comparison shopping data from Yahoo! Shopping and hotel information data from Yahoo! Travel. We also show that R-Swoosh (and F-Swoosh) can be used even when the four match and merge properties do not hold, if an “approximate” result is acceptable.

References

Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: Proceedings of VLDB, pp. 586–597 (2002)
Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB, pp. 918–929 (2006)
Bansal, N., Blum, A., Chawla, S.: Correlation clustering. In: FOCS, p. 238 (2002)
Baxter, R., Christen, P., Churches, T.: A comparison of fast blocking methods for record linkage. In: Proceedings of ACM SIGKDD’03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation (2003). http://citeseer.ist.psu.edu/article/baxter03comparison.html
Bekkerman, R., McCallum, A.: Disambiguating web appearances of people in a social network. In: WWW, pp. 463–470 (2005)
Benjelloun, O., Garcia-Molina, H., Jonas, J., Menestrina, D., Whang, S., Su, Q., Widom, J.: Swoosh : a generic approach to entity resolution. Technical Report, Stanford University (2006). http://dbpubs.stanford.edu/pub/2005-5
Benjelloun, O., Garcia-Molina, H., Kawai, H., Larson, T.E., Menestrina, D., Thavisomboon, S.: D-Swoosh : a family of algorithms for generic, distributed entity resolution. In: ICDCS (2007)
Bhattacharya, I., Getoor, L.: Iterative record linkage for cleaning and integration. In: Proceedings of SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (2004)
Bhattacharya, I., Getoor, L.: A latent dirichlet model for unsupervised entity resolution. In: Sixth SIAM Conference on Data Mining (2006)
Blume, M.: Automatic entity disambiguation: benefits to NER, relation extraction, link analysis, and inference. In: International Conference on Intelligence Analysis (2005). https://analysis.mitre.org/
Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: Proceedings of ACM SIGMOD, pp. 313–324 (2003)
Chaudhuri, S., Ganti, V., Motwani, R.: Robust identification of fuzzy duplicates. In: Proceedings of ICDE, Tokyo, Japan (2005)
Cohen, W.: Data integration using similarity joins and a word-based information representation language. ACM Trans. Inf. Syst. 18, 288–321 (2000)
Article Google Scholar
Dong, X., Halevy, A.Y., Madhavan, J.: Reference reconciliation in complex information spaces. In: Proceedings of ACM SIGMOD (2005)
Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)
Article Google Scholar
Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.A.: Declarative data cleaning: Language, model, and algorithms. In: Proceedings of VLDB, pp. 371–380 (2001)
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500 (2001)
Gu, L., Baxter, R., Vickers, D., Rainsford, C.: Record linkage: current practice and future directions. Technical Report 03/83, CSIRO Mathematical and Information Sciences (2003)
Hernández, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: Proceedings of ACM SIGMOD, pp. 127–138 (1995)
Hernández, M.A., Stolfo, S.J.: Real-world data is dirty: data cleansing and the merge/purge problem. Data Min. Knowl. Discov. 2(1), 9–37 (1998)
Article Google Scholar
IBM: DB2 Entity Analytic Solutions. http://www-306.ibm.com/software/data/db2/eas/
Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida. J. Am. Stat. Assoc. 84(406), 414–420 (1989)
Article Google Scholar
Jin, L., Li, C., Mehrotra, S.: Efficient record linkage in large data sets. In: Proceedings of International Conference on Database Systems for Advanced Applications, p. 137 (2003)
Kalashnikov, D.V., Mehrotra, S., Chen, Z.: Exploiting relationships for domain-independent data cleaning. In: Proceedings of the SIAM International Conference on Data Mining, Newport Beach, CA (2005)
McCallum, A.K., Nigam, K., Ungar, L.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of KDD, pp. 169–178, Boston, MA (2000)
Menestrina, D., Benjelloun, O., Garcia-Molina, H.: Generic entity resolution with data confidences. In: CleanDB (2006)
Monge, A.E., Elkan, C.: An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proceedings of SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, pp. 23–29 (1997)
Motro, A., Anokhin, P.: Fusionplex: resolution of data inconsistencies in the integration of heterogeneous information sources. Inf. Fusion 7(2), 176–196 (2006)
Article Google Scholar
Newcombe, H.B., Kennedy, J.M., Axford, S.J., James, A.P.: Automatic linkage of vital records. Science 130(3381), 954–959 (1959)
Article Google Scholar
Parag, D.P.: Multi-relational record linkage. In: Proceedings of the KDD-2004 Workshop on Multi-Relational Data Mining, pp. 31–48 (2004)
Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: Proceedings of ACM SIGKDD, Edmonton, Alberta (2002)
Schallehn, E., Sattler, K.U., Saake, G.: Extensible and similarity-based grouping for data integratio. In: ICDE, p. 277 (2002)
Singla, P., Domingos, P.: Object identification with attribute-mediated dependences. In: Proceedings of PKDD, pp. 297 – 308 (2005)
Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981)
Article Google Scholar
Tarjan, R.E.: Efficiency of a good but not linear set union algorithm. J. ACM. 22(2), 215–225 (1975)
Article MATH MathSciNet Google Scholar
Tejada, S., Knoblock, C.A., Minton, S.: Learning object identification rules for information integration. Inf. Syst. J. 26(8), 635–656 (2001)
Article Google Scholar
Verykios, V.S., Moustakides, G.V., Elfeky, M.G.: A bayesian decision model for cost optimal record matching. VLDB J. 12(1), 28–40(2003). http://www.cs.purdue.edu/homes/mgelfeky/Papers/vldbj12(1).pdf
Google Scholar
Winkler, W.: Overview of record linkage and current research directions. Technical Report, Statistical Research Division, U.S. Bureau of the Census, Washington, DC (2006)
Winkler, W.E.: Using the EM algorithm for weight computation in the Fellegi–Sunter model of record linkage. In: American Statistical Association, Proceedings of the Section on Survey Research Methods, pp. 667–671 (1988)

Download references

Author information

Authors and Affiliations

Google Inc., Mountain View, CA, 94043, USA
Omar Benjelloun
Computer Science Department, Stanford University, Stanford, CA, 94305, USA
Hector Garcia-Molina, David Menestrina, Qi Su, Steven Euijong Whang & Jennifer Widom

Authors

Omar Benjelloun
View author publications
You can also search for this author in PubMed Google Scholar
Hector Garcia-Molina
View author publications
You can also search for this author in PubMed Google Scholar
David Menestrina
View author publications
You can also search for this author in PubMed Google Scholar
Qi Su
View author publications
You can also search for this author in PubMed Google Scholar
Steven Euijong Whang
View author publications
You can also search for this author in PubMed Google Scholar
Jennifer Widom
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Steven Euijong Whang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Benjelloun, O., Garcia-Molina, H., Menestrina, D. et al. Swoosh: a generic approach to entity resolution. The VLDB Journal 18, 255–276 (2009). https://doi.org/10.1007/s00778-008-0098-x

Download citation

Received: 22 May 2007
Revised: 09 January 2008
Accepted: 10 January 2008
Published: 11 March 2008
Issue Date: January 2009
DOI: https://doi.org/10.1007/s00778-008-0098-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Swoosh: a generic approach to entity resolution

Abstract

Article PDF

Similar content being viewed by others

A Practical Guide to Entity Resolution with OYSTER

An analysis of one-to-one matching algorithms for entity resolution

Experimental Evaluation Among Reblocking Techniques Applied to the Entity Resolution

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Swoosh: a generic approach to entity resolution

Abstract

Article PDF

Similar content being viewed by others

A Practical Guide to Entity Resolution with OYSTER

An analysis of one-to-one matching algorithms for entity resolution

Experimental Evaluation Among Reblocking Techniques Applied to the Entity Resolution

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation