Creating probabilistic databases from duplicated data

Hassanzadeh, Oktie; Miller, Renée J.

doi:10.1007/s00778-009-0161-2

Creating probabilistic databases from duplicated data

Special Issue Paper
Published: 20 August 2009

Volume 18, pages 1141–1166, (2009)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

The VLDB Journal Aims and scope Submit manuscript

Creating probabilistic databases from duplicated data

Download PDF

Oktie Hassanzadeh¹ &
Renée J. Miller¹

306 Accesses
38 Citations
3 Altmetric
Explore all metrics

Abstract

A major source of uncertainty in databases is the presence of duplicate items, i.e., records that refer to the same real-world entity. However, accurate deduplication is a difficult task and imperfect data cleaning may result in loss of valuable information. A reasonable alternative approach is to keep duplicates when the correct cleaning strategy is not certain, and utilize an efficient probabilistic query-answering technique to return query results along with probabilities of each answer being correct. In this paper, we present a flexible modular framework for scalably creating a probabilistic database out of a dirty relation of duplicated data and overview the challenges raised in utilizing this framework for large relations of string data. We study the problem of associating probabilities with duplicates that are detected using state-of-the-art scalable approximate join methods. We argue that standard thresholding techniques are not sufficiently robust for this task, and propose new clustering algorithms suitable for inferring duplicates and their associated probabilities. We show that the inferred probabilities accurately reflect the error in duplicate records.

References

Ailon, N., Charikar, M., Newman, A.: Aggregating inconsistent information: ranking and clustering. In: ACM Symp. on Theory of Computing (STOC), pp. 684–693 (2005)
Andritsos, P., Fuxman, A., Miller, R.J.: Clean answers over dirty databases: a probabilistic approach. In: IEEE Proc. of the Int’l Conf. on Data Eng., p. 30 (2006)
Antova, L., Koch, C., Olteanu, D.: Fast and simple relational processing of uncertain data. In: IEEE Proc. of the Int’l Conf. on Data Eng., pp. 983–992 (2008)
Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: Proc. of the Int’l Conf. on Very Large Data Bases (VLDB), pp. 918–929 (2006)
Arasu, A., Ré, C., Suciu, D.: Large-scale deduplication with constraints using dedupalog. In: IEEE Proc. of the Int’l Conf. on Data Eng., pp. 952–963 (2009)
Aslam J.A., Pelekhov E., Rus D.: The star clustering algorithm for static and dynamic information organization. J. Graph Algorithm. Appl. 8(1), 95–129 (2004)
MATH MathSciNet Google Scholar
Bansal N., Blum A., Chawla S.: Correlation clustering. Mach. Learn. 56(1–3), 89–113 (2004)
Article MATH Google Scholar
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Int’l World Wide Web Conference (WWW), pp. 131–140, Banff (2007)
Benjelloun O., Garcia-Molina H., Menestrina D., Su Q., Whang S.E., Widom J.: Swoosh: a generic approach to entity resolution. Int. J. Very Large Data Bases 18(1), 255–276 (2009)
Article Google Scholar
Beskales, G., Soliman, M.A., Ilyas, I.F., Ben-David, S.: Modeling and querying possible repairs in duplicate detection. In: Proc. of the Int’l Conf. on Very Large Data Bases (VLDB), 2009 (To Appear). Available as University of Waterloo, Tech. Report CS-2009-15, 2009
Bezdek J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. Kluwer Academic Publishers, Dordrecht (1981)
MATH Google Scholar
Bhattacharya, I., Getoor, L.: A latent dirichlet model for unsupervised entity resolution. In: Proc. of the SIAM International Conference on Data Mining (SDM), pp. 47–58, Bethesda (2006)
Bhattacharya I., Getoor L.: Collective entity resolution in relational data. IEEE Data Eng. Bull. 29(2), 4–12 (2006)
Google Scholar
Blei D.M., Ng A.Y., Jordan M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Article MATH Google Scholar
Boulos, J., Dalvi, N., Mandhani, B., Mathur, S., Re, C., Suciu, D.: MYSTIQ: a system for finding more answers by using probabilities. In: ACM SIGMOD Int’l Conf. on the Mgmt. of Data, pp. 891–893 (2005)
Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: ACM SIGMOD Int’l Conf. on the Mgmt. of Data, pp. 313–324 (2003)
Chaudhuri, S., Ganti, V., Motwani, R.: Robust identification of fuzzy duplicates. In: IEEE Proc. of the Int’l Conf. on Data Eng., pp. 865–876, Washington (2005)
Chaudhuri, S., Das Sarma, A., Ganti, V., Kaushik, R.: Leveraging aggregate constraints for deduplication. In: ACM SIGMOD Int’l Conf. on the Mgmt. of Data, pp. 437–448 (2007)
Cheng R., Chen J., Xie X.: Cleaning uncertain data with quality guarantees. Proc. VLDB Endow. (PVLDB) 1(1), 722–735 (2008)
Google Scholar
Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string distance metrics for name-matching tasks. In: Proc. of IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03), pp. 73–78, Acapulco (2003)
Dalvi N., Suciu D.: Efficient query evaluation on probabilistic databases. Int. J. Very Large Data Bases 16(4), 523–544 (2007)
Article Google Scholar
Dalvi, N., Suciu, D.: Management of probabilistic data: foundations and challenges. In: ACM SIGMOD Int’l Conf. on the Mgmt. of Data, pp. 1–12 (2007)
Demaine E.D., Emanuel D., Fiat A., Immorlica N.: Correlation clustering in general weighted graphs. Theor. Comput. Sci. 361(2), 172–187 (2006)
Article MATH MathSciNet Google Scholar
Dong, X.L., Halevy, A.Y., Yu, C.: Data integration with uncertainty. In: Proc. of the Int’l Conf. on Very Large Data Bases (VLDB), pp. 687–698 (2007)
Elmagarmid A.K., Ipeirotis P.G., Verykios V.S.: Duplicate record detection: a survey. IEEE Trans. Know. Data Eng. 19(1), 1–16 (2007)
Article Google Scholar
Fellegi I.P., Sunter A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)
Article Google Scholar
Flake G.W., Tarjan R.E., Tsioutsiouliklis K.: Graph clustering and minimum cut trees. Internet Math. 1(4), 385–408 (2004)
MATH MathSciNet Google Scholar
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (Almost) for free. In: Proc. of the Int’l Conf. on Very Large Data Bases (VLDB), pp. 491–500 (2001)
Gupta, R., Sarawagi, S.: Creating probabilistic databases from information extraction models. In: Proc. of the Int’l Conf. on Very Large Data Bases (VLDB), pp. 965–976 (2006)
Gusfield, D.: Algorithms on strings, trees, and sequences. In: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)
Hassanzadeh, O.: Benchmarking Declarative Approximate Selection Predicates. Master’s thesis, University of Toronto, February 2007
Hassanzadeh, O., Chiang, F., Lee, H.C., Miller, R.J.: Framework for evaluating clustering algorithms in duplicate detection. In: Proc. of the Int’l Conf. on Very Large Data Bases (VLDB), 2009
Hassanzadeh, O., Sadoghi, M., Miller, R.J.: Accuracy of approximate string joins using grams. In: Proc. of the International Workshop on Quality in Databases (QDB), pp. 11–18, Vienna (2007)
Haveliwala, T.H., Gionis, A., Indyk, P.: Scalable techniques for clustering the web. In: Proc. of the Int’l Workshop on the Web and Databases (WebDB), pp. 129–134, Dallas (2000)
Hernández M.A., Stolfo S.J.: Real-world data is dirty: data cleansing and the merge/purge problem. Data Min. Know. Discov. 2(1), 9–37 (1998)
Article Google Scholar
Indyk, P., Motwani, R., Raghavan, P., Vempala, S.: Locality-preserving hashing in multidimensional spaces. In: ACM Symp. on Theory of Computing (STOC), pp. 618–625 (1997)
Kukich K.: Techniques for automatically correcting words in text. ACM Comput. Surv. 24(4), 377–439 (1992)
Article Google Scholar
Li, C., Wang, B., Yang, X.: VGRAM: Improving performance of approximate queries on string collections using variable-length grams. In: Proc. of the Int’l Conf. on Very Large Data Bases (VLDB), pp. 303–314, Vienna (2007)
McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proc. of the Int’l Conf. on Knowledge Discovery & Data Mining, pp. 169–178 (2000)
Miller, D.R.H., Leek, T., Schwartz, R.M.: A hidden Markov model information retrieval system. In: ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 214–221 (1999)
Monge, A.E., Elkan, C.: An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proc. of SIGMOD Workshop on Data Mining and Knowledge Discovery (DMKD) (1997)
Rahm E., Hai Do H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)
Google Scholar
Re, C., Dalvi, N., Suciu, D.: Efficient Top-k query evaluation on probabilistic data. In: IEEE Proc. of the Int’l Conf. on Data Eng., pp. 886–895 (2007)
Robertson S.: Understanding inverse document frequency: on theoretical arguments for IDF. J. Doc. 60(5), 503–520 (2004)
Article Google Scholar
Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: ACM SIGMOD Int’l Conf. on the Mgmt. of Data, pp. 743–754, Paris (2004)
Sen, P., Deshpande, A., Getoor, L.: Representing tuple and attribute uncertainty in probabilistic databases. In: ICDM Workshops, pp. 507–512 (2007)
Slonim, N.: The Information Bottleneck: Theory and Applications. PhD thesis, The Hebrew University (2003)
Soliman, M.A., Ilyas, I.F., Chang, K.C.: Top-k query processing in uncertain databases. In: IEEE Proc. of the Int’l Conf. on Data Eng., pp. 896–905 (2007)
Soliman M.A., Ilyas I.F., Chang K.C.: Probabilistic top-k and ranking-aggregate queries. ACM Trans. Database Syst. (TODS) 33(3), 1–54 (2008)
Article Google Scholar
van Dongen, S.: Graph Clustering By Flow Simulation. PhD thesis, University of Utrecht (2000)
Widom J.: Trio: a system for integrated management of data, accuracy, and lineage. In: Proc. of the Conference on Innovative Data Systems Research (CIDR), pp. 262–276 (2005)

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Toronto, Toronto, Canada
Oktie Hassanzadeh & Renée J. Miller

Authors

Oktie Hassanzadeh
View author publications
You can also search for this author in PubMed Google Scholar
Renée J. Miller
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Oktie Hassanzadeh.

Additional information

Work supported in part by NSERC.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hassanzadeh, O., Miller, R.J. Creating probabilistic databases from duplicated data. The VLDB Journal 18, 1141–1166 (2009). https://doi.org/10.1007/s00778-009-0161-2

Download citation

Received: 14 September 2008
Revised: 10 June 2009
Accepted: 26 June 2009
Published: 20 August 2009
Issue Date: October 2009
DOI: https://doi.org/10.1007/s00778-009-0161-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Creating probabilistic databases from duplicated data

Abstract

Article PDF

Similar content being viewed by others

String similarity search and join: a survey

Analytics over Probabilistic Unmerged Duplicates

DAW: Duplicate-AWare Federated Query Processing over the Web of Data

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Creating probabilistic databases from duplicated data

Abstract

Article PDF

Similar content being viewed by others

String similarity search and join: a survey

Analytics over Probabilistic Unmerged Duplicates

DAW: Duplicate-AWare Federated Query Processing over the Web of Data

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation