Abstract
String similarity join (SSJ) is essential for many applications where near-duplicate objects need to be found. This paper targets SSJ with edit distance constraints. The existing algorithms usually adopt the filter-andrefine framework. They cannot catch the dissimilarity between string subsets, and do not fully exploit the statistics such as the frequencies of characters. We investigate to develop a partition-based algorithm by using such statistics. The frequency vectors are used to partition datasets into data chunks with dissimilarity between them being caught easily. A novel algorithm is designed to accelerate SSJ via the partitioned data. A new filter is proposed to leverage the statistics to avoid computing edit distances for a noticeable proportion of candidate pairs which survive the existing filters. Our algorithm outperforms alternative methods notably on real datasets.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Afrati, F.N., Sarma, A.D., Menestrina, D., et al., 2012. Fuzzy joins using MapReduce. Int. Conf. on Data Engineering, p.498–509. https://doi.org/10.1109/ICDE.2012.66
Arasu, A., Ganti, V., Kaushik, R., 2006. Efficient exact set-similarity joins. Int. Conf. on Very Large Data Bases, p.918–929.
Bayardo, R.J., Ma, Y., Srikant, R., 2007. Scaling up all pairs similarity search. Int. World Wide Web Conf., p.131–140. https://doi.org/10.1145/1242572.1242591
Chaudhuri, S., Ganjam, K., Ganti, V., et al., 2003. Robust and efficient fuzzy match for online data cleaning. Int. SIGMOD Conf. on Management of Data, p.313–324. https://doi.org/10.1145/872757.872796
Chaudhuri, S., Ganti, V., Kaushik, R., 2006a. Data debugger: an operator-centric approach for data quality solutions. IEEE Data Eng. Bull., 29(2): 60–66.
Chaudhuri, S., Ganti, V., Kaushik, R., 2006b. A primitive operator for similarity joins in data cleaning. Int. Conf. on Data Engineering, p.687–698. https://doi.org/10.1109/ICDE.2006.9
Dong, X., Halevy, A.Y., Yu, C., 2007. Data integration with uncertainty. Int. Conf. on Very Large Data Bases, p.687–698.
Feng, J., Wang, J., Li, G., 2012. Trie-join: a trie-based method for efficient string similarity joins. VLDB J., 21(4): 437–461. https://doi.org/10.1007/s00778-011-0252-8
Ge, T., Li, Z., 2011. Approximate substring matching over uncertain strings. Proc. VLDB Endow., 4(11): 772–782.
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., et al., 2001. Approximate string joins in a database (almost) for free. Int. Conf. on Very Large Data Bases, p.491–500.
Hadjieleftheriou, M., Srivastava, D., 2010. Weighted Set-Based String Similarity. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering. AT&T Lab-Research.
Henzinger, M.R., 2006. Finding near-duplicate web pages: a large-scale evaluation of algorithms. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, p.284–291. https://doi.org/10.1145/1148170.1148222
Ji, S., Li, G., Li, C., et al., 2009. Efficient interactive fuzzy keyword search. Int. World Wide Web Conf., p.371–380. https://doi.org/10.1145/1526709.1526760
Li, C., Lu, J., Lu, Y., 2008. Efficient merging and filtering algorithms for approximate string searches. Int. Conf. on Data Engineering, p.257–266. https://doi.org/10.1109/ICDE.2008.4497434
Li, G., Deng, D., Wang, J., et al., 2011. Pass-Join: a partition-based method for similarity joins. Proc. VLDB Endow., 5(3): 253–264. https://doi.org/10.14778/2078331.2078340
Metwally, A., Agrawal, D., Abbadi, A.E., 2007. Detectives: detecting coalition hit inflation attacks in advertising networks streams. Int. World Wide Web Conf., p.241–250. https://doi.org/10.1145/1242572.1242606
Navarro, G., Salmela, L., 2009. Indexing variable length substrings for exact and approximate matching. Int. Symp. on String Processing and Information Retrieval, p.214–221. https://doi.org/10.1007/978-3-642-03784-9_21
Qin, J., Wang, W., Xiao, C., et al., 2013. Vchunkjoin: an efficient algorithm for edit similarity joins. Trans. Knowl. Dat. Eng., 25(8): 1916–1929. https://doi.org/10.1109/TKDE.2012.79
Sarawagi, S., Kirpal, A., 2004. Efficient set joins on similarity predicates. Int. SIGMOD Conf. on Management of Data, p.743–754. https://doi.org/10.1145/1007568.1007652
Scott, D.W., 1979. On optimal and data-based histograms. Biometrika, 66: 605–610. https://doi.org/10.1093/biomet/66.3.605
Vernica, R., Carey, M.J., Li, C., 2010. Efficient parallel set-similarity joins using MapReduce. Int. SIGMOD Conf. on Management of Data, p.495–506. https://doi.org/10.1145/1807167.1807222
Wang, J., Li, G., Feng, J., 2011. Fast-join: an efficient method for fuzzy token matching based string similarity join. Int. Conf. on Data Engineering, p.458–469. https://doi.org/10.1109/ICDE.2011.5767865
Wang, W., Xiao, C., Lin, X., et al., 2009. Efficient approximate entity extraction with edit distance constraints. Int. SIGMOD Conf. on Management of Data, p.759–770. https://doi.org/10.1145/1559845.1559925
Xiao, C., Wang, W., Lin, X., 2008a. Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. Proc. VLDB Endow., 1(1): 933–944. https://doi.org/10.14778/1453856.1453957
Xiao, C., Wang, W., Lin, X., et al., 2008b. Efficient similarity joins for near duplicate detection. Int. WorldWideWeb Conf., p.131–140. https://doi.org/10.1145/2000824.2000825
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Luo, Jz., Shi, Sf., Wang, Hz. et al. FrepJoin: an efficient partition-based algorithm for edit similarity join. Frontiers Inf Technol Electronic Eng 18, 1499–1510 (2017). https://doi.org/10.1631/FITEE.1601347
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1631/FITEE.1601347