Abstract
A major source of uncertainty in databases is the presence of duplicate items, i.e., records that refer to the same real-world entity. However, accurate deduplication is a difficult task and imperfect data cleaning may result in loss of valuable information. A reasonable alternative approach is to keep duplicates when the correct cleaning strategy is not certain, and utilize an efficient probabilistic query-answering technique to return query results along with probabilities of each answer being correct. In this paper, we present a flexible modular framework for scalably creating a probabilistic database out of a dirty relation of duplicated data and overview the challenges raised in utilizing this framework for large relations of string data. We study the problem of associating probabilities with duplicates that are detected using state-of-the-art scalable approximate join methods. We argue that standard thresholding techniques are not sufficiently robust for this task, and propose new clustering algorithms suitable for inferring duplicates and their associated probabilities. We show that the inferred probabilities accurately reflect the error in duplicate records.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Ailon, N., Charikar, M., Newman, A.: Aggregating inconsistent information: ranking and clustering. In: ACM Symp. on Theory of Computing (STOC), pp. 684–693 (2005)
Andritsos, P., Fuxman, A., Miller, R.J.: Clean answers over dirty databases: a probabilistic approach. In: IEEE Proc. of the Int’l Conf. on Data Eng., p. 30 (2006)
Antova, L., Koch, C., Olteanu, D.: Fast and simple relational processing of uncertain data. In: IEEE Proc. of the Int’l Conf. on Data Eng., pp. 983–992 (2008)
Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: Proc. of the Int’l Conf. on Very Large Data Bases (VLDB), pp. 918–929 (2006)
Arasu, A., Ré, C., Suciu, D.: Large-scale deduplication with constraints using dedupalog. In: IEEE Proc. of the Int’l Conf. on Data Eng., pp. 952–963 (2009)
Aslam J.A., Pelekhov E., Rus D.: The star clustering algorithm for static and dynamic information organization. J. Graph Algorithm. Appl. 8(1), 95–129 (2004)
Bansal N., Blum A., Chawla S.: Correlation clustering. Mach. Learn. 56(1–3), 89–113 (2004)
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Int’l World Wide Web Conference (WWW), pp. 131–140, Banff (2007)
Benjelloun O., Garcia-Molina H., Menestrina D., Su Q., Whang S.E., Widom J.: Swoosh: a generic approach to entity resolution. Int. J. Very Large Data Bases 18(1), 255–276 (2009)
Beskales, G., Soliman, M.A., Ilyas, I.F., Ben-David, S.: Modeling and querying possible repairs in duplicate detection. In: Proc. of the Int’l Conf. on Very Large Data Bases (VLDB), 2009 (To Appear). Available as University of Waterloo, Tech. Report CS-2009-15, 2009
Bezdek J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. Kluwer Academic Publishers, Dordrecht (1981)
Bhattacharya, I., Getoor, L.: A latent dirichlet model for unsupervised entity resolution. In: Proc. of the SIAM International Conference on Data Mining (SDM), pp. 47–58, Bethesda (2006)
Bhattacharya I., Getoor L.: Collective entity resolution in relational data. IEEE Data Eng. Bull. 29(2), 4–12 (2006)
Blei D.M., Ng A.Y., Jordan M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Boulos, J., Dalvi, N., Mandhani, B., Mathur, S., Re, C., Suciu, D.: MYSTIQ: a system for finding more answers by using probabilities. In: ACM SIGMOD Int’l Conf. on the Mgmt. of Data, pp. 891–893 (2005)
Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: ACM SIGMOD Int’l Conf. on the Mgmt. of Data, pp. 313–324 (2003)
Chaudhuri, S., Ganti, V., Motwani, R.: Robust identification of fuzzy duplicates. In: IEEE Proc. of the Int’l Conf. on Data Eng., pp. 865–876, Washington (2005)
Chaudhuri, S., Das Sarma, A., Ganti, V., Kaushik, R.: Leveraging aggregate constraints for deduplication. In: ACM SIGMOD Int’l Conf. on the Mgmt. of Data, pp. 437–448 (2007)
Cheng R., Chen J., Xie X.: Cleaning uncertain data with quality guarantees. Proc. VLDB Endow. (PVLDB) 1(1), 722–735 (2008)
Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string distance metrics for name-matching tasks. In: Proc. of IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03), pp. 73–78, Acapulco (2003)
Dalvi N., Suciu D.: Efficient query evaluation on probabilistic databases. Int. J. Very Large Data Bases 16(4), 523–544 (2007)
Dalvi, N., Suciu, D.: Management of probabilistic data: foundations and challenges. In: ACM SIGMOD Int’l Conf. on the Mgmt. of Data, pp. 1–12 (2007)
Demaine E.D., Emanuel D., Fiat A., Immorlica N.: Correlation clustering in general weighted graphs. Theor. Comput. Sci. 361(2), 172–187 (2006)
Dong, X.L., Halevy, A.Y., Yu, C.: Data integration with uncertainty. In: Proc. of the Int’l Conf. on Very Large Data Bases (VLDB), pp. 687–698 (2007)
Elmagarmid A.K., Ipeirotis P.G., Verykios V.S.: Duplicate record detection: a survey. IEEE Trans. Know. Data Eng. 19(1), 1–16 (2007)
Fellegi I.P., Sunter A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)
Flake G.W., Tarjan R.E., Tsioutsiouliklis K.: Graph clustering and minimum cut trees. Internet Math. 1(4), 385–408 (2004)
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (Almost) for free. In: Proc. of the Int’l Conf. on Very Large Data Bases (VLDB), pp. 491–500 (2001)
Gupta, R., Sarawagi, S.: Creating probabilistic databases from information extraction models. In: Proc. of the Int’l Conf. on Very Large Data Bases (VLDB), pp. 965–976 (2006)
Gusfield, D.: Algorithms on strings, trees, and sequences. In: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)
Hassanzadeh, O.: Benchmarking Declarative Approximate Selection Predicates. Master’s thesis, University of Toronto, February 2007
Hassanzadeh, O., Chiang, F., Lee, H.C., Miller, R.J.: Framework for evaluating clustering algorithms in duplicate detection. In: Proc. of the Int’l Conf. on Very Large Data Bases (VLDB), 2009
Hassanzadeh, O., Sadoghi, M., Miller, R.J.: Accuracy of approximate string joins using grams. In: Proc. of the International Workshop on Quality in Databases (QDB), pp. 11–18, Vienna (2007)
Haveliwala, T.H., Gionis, A., Indyk, P.: Scalable techniques for clustering the web. In: Proc. of the Int’l Workshop on the Web and Databases (WebDB), pp. 129–134, Dallas (2000)
Hernández M.A., Stolfo S.J.: Real-world data is dirty: data cleansing and the merge/purge problem. Data Min. Know. Discov. 2(1), 9–37 (1998)
Indyk, P., Motwani, R., Raghavan, P., Vempala, S.: Locality-preserving hashing in multidimensional spaces. In: ACM Symp. on Theory of Computing (STOC), pp. 618–625 (1997)
Kukich K.: Techniques for automatically correcting words in text. ACM Comput. Surv. 24(4), 377–439 (1992)
Li, C., Wang, B., Yang, X.: VGRAM: Improving performance of approximate queries on string collections using variable-length grams. In: Proc. of the Int’l Conf. on Very Large Data Bases (VLDB), pp. 303–314, Vienna (2007)
McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proc. of the Int’l Conf. on Knowledge Discovery & Data Mining, pp. 169–178 (2000)
Miller, D.R.H., Leek, T., Schwartz, R.M.: A hidden Markov model information retrieval system. In: ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 214–221 (1999)
Monge, A.E., Elkan, C.: An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proc. of SIGMOD Workshop on Data Mining and Knowledge Discovery (DMKD) (1997)
Rahm E., Hai Do H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)
Re, C., Dalvi, N., Suciu, D.: Efficient Top-k query evaluation on probabilistic data. In: IEEE Proc. of the Int’l Conf. on Data Eng., pp. 886–895 (2007)
Robertson S.: Understanding inverse document frequency: on theoretical arguments for IDF. J. Doc. 60(5), 503–520 (2004)
Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: ACM SIGMOD Int’l Conf. on the Mgmt. of Data, pp. 743–754, Paris (2004)
Sen, P., Deshpande, A., Getoor, L.: Representing tuple and attribute uncertainty in probabilistic databases. In: ICDM Workshops, pp. 507–512 (2007)
Slonim, N.: The Information Bottleneck: Theory and Applications. PhD thesis, The Hebrew University (2003)
Soliman, M.A., Ilyas, I.F., Chang, K.C.: Top-k query processing in uncertain databases. In: IEEE Proc. of the Int’l Conf. on Data Eng., pp. 896–905 (2007)
Soliman M.A., Ilyas I.F., Chang K.C.: Probabilistic top-k and ranking-aggregate queries. ACM Trans. Database Syst. (TODS) 33(3), 1–54 (2008)
van Dongen, S.: Graph Clustering By Flow Simulation. PhD thesis, University of Utrecht (2000)
Widom J.: Trio: a system for integrated management of data, accuracy, and lineage. In: Proc. of the Conference on Innovative Data Systems Research (CIDR), pp. 262–276 (2005)
Author information
Authors and Affiliations
Corresponding author
Additional information
Work supported in part by NSERC.
Rights and permissions
About this article
Cite this article
Hassanzadeh, O., Miller, R.J. Creating probabilistic databases from duplicated data. The VLDB Journal 18, 1141–1166 (2009). https://doi.org/10.1007/s00778-009-0161-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-009-0161-2