Improved Duplicate Record Detection Using ASCII Code Q-gram Indexing Technique

Elziky, Mayada A.; Ibrahim, Dina M.; Sarhan, Amany M.

doi:10.1007/s13369-018-3105-6

Improved Duplicate Record Detection Using ASCII Code Q-gram Indexing Technique

Research Article - Computer Engineering and Computer Science
Published: 15 February 2018

Volume 43, pages 7409–7420, (2018)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Arabian Journal for Science and Engineering Aims and scope Submit manuscript

Improved Duplicate Record Detection Using ASCII Code Q-gram Indexing Technique

Download PDF

Mayada A. Elziky¹,
Dina M. Ibrahim¹ &
Amany M. Sarhan¹

91 Accesses
3 Citations
Explore all metrics

Abstract

With the aim of reducing duplicate records in databases, duplicate record detection (DRD) ensures the integrity of data. Its role is to identify records signifying same entities either in the same or in different compared to database. A diversity of indexing techniques has been proposed to support DRD. Q-gram is one of the common techniques used to index databases. This paper introduces modification to the Q-gram indexing technique. Such modification participates in improving the performance of the duplicate detection process and in reducing the time and number of comparisons. In the proposed work, in order to make the back-end computations easier, Q-gram strings are alternatively converted into numeric values using their corresponding ASCII code. Based on these numeric values, the indexing will decrease the complexity of Q-gram comparisons and speed up the DRD process as a whole. Unlike the existing approaches, the proposed technique is easier in implementation and requires less memory space. Two other variations of the proposed technique are introduced in this paper to decrease the matching process time; the first uses a range for matching, while the second sorts words alphabetically inside blocks. According to experimental results, the three proposed techniques perform much faster and are almost as accurate as the current Q-gram technique, meaning that they can be used in large-sized databases DRD.

Article PDF

Near Duplicate Text Detection Using Frequency-Biased Signatures

Efficient Approach for Near Duplicate Document Detection Using Textual and Conceptual Based Techniques

An Improved Data Cleaning Algorithm Based on SNM

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Issa, H.: Application of Duplicate Records Detection Techniques to Duplicate Payments in a Real Business Environment. Rutgers University, Rutgers Business School (2010)
Naderi, H.; Salehpour, N.; Farokhi, M.N.; Chegeni, B.H.: The search of new issues in the detection of near-duplicated documents. Int. J. Curr. Rev. 2(2), 25–34 (2014)
Google Scholar
Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24, 1537–1555 (2012)
Article Google Scholar
Fellegi, I.P.; Sunter, A.B.: A theory for record linkage. J. Am. Stat. Soc. 64(328), 1183–1210 (1969)
Article Google Scholar
Hernandez, M.A.; Stolfo, S.J.: The merge/purge problem for large databases. In: Proceedings of the ACM SIGMOD’95, San Jose (1995)
Aizawa, A.; Oyama, K.: A fast linkage detection scheme for multi-source information integration. In: Proceedings of the IEEE International Workshop on Challenges in Web Information Retrieval and Integration WIRI’05, Tokyo, Japan (2005)
Cohen, W.W.; Richman, J.: Learning to Match and cluster large high-dimensional data sets for data integration. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACMSIGKDD’02, Edmonton, pp. 475–480 (2002)
Gravano, L.; Ipeirotis, P.G.; Jagadish, H.V.; Koudas, N.; Muthukrishnan, S.; Srivastava. D.: Approximate string joins in a database (Almost) for free. VLDB (2001)
Adrian, B.; Christian, B.; Sean, R.; Rainer, S.: High quality linkage using multibit trees for privacy-preserving blocking. Int. J. Popul. Data Sci. (IJPDS) 1(1), 130 (2016)
Google Scholar
Kevin, Z.; Peter, A.: A Q-gram birthmarking approach to predicting reusable hardware. In: Design, automation & test in Europe conference and exhibition (DATE), 14–18 March (2016)
Jie, L.; Haiying, Z.: Research and implementation of finding duplicate science project based on dimension filtering of Q-gram index. Destech Transactions on Engineering and Technology Research (2016)
Christen, P.: FEBRL: An open source data cleaning, deduplication and record linkage system with a graphical user interface. In: Proceeding of the 14th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD’08), Las Vegas, USA, pp. 1065–1068, Aug. 24–27 (2008)
Elmagarmid, A.K.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Article Google Scholar
Alnoory, M.K.: Performance evaluation of similarity functions for duplicate record detection. M.Sc. Thesis, Yarmouk University (2011)
Churches, T.; Christen, P.; Lim, K.; Zhu, J.X.: Preparation of name and address data for record linkage using hidden Markov models. BioMed Cent. Med. Inf. Decis. Mak. 2(1), 9 (2002)
Article Google Scholar
Rahm, E.; Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)
Google Scholar
Bilenko, M.; Mooney, R.J.: On evaluation and training set construction for duplicate detection. In: Proceedings of the ACM SIGKDD’03 Workshop on Data Cleaning, Record Linkage and Object Consolidation, Washington, DC, pp. 7–12 (2003)
Higazy, A.A.; Sarhan, A.M.; El Tobely, T.: Web-based Arabic/English duplicate record detection with nested blocking technique. In: Proceedings of the IEEE 8th International Conference on Computer Engineering and Systems (ICCES), Egypt, pp. 313–318 (2013)
Azman, S.: Efficient identity matching using static pruning Q-gram indexing approach. Decis. Support Syst. 73, 97–108 (2015)
Article Google Scholar
Ramadan, B.; Christen, P.: Unsupervised blocking key selection for real-time entity resolution. In: Advances in Knowledge Discovery and Data Mining Volume 9078 of the Series, Lecture Notes in Computer Science. Springer, pp. 574–585 (2015)
Kreft, S.; Navarro, G.: On compressing and indexing repetitive sequences. Theor. Comput. Sci. J. 483, 115–133 (2013)
Article MathSciNet Google Scholar
McCallum, A.; Nigam, K.; Ungar, L.H.: Efficient clustering of high-dimensional datasets with application to reference matching. In: Proceedings of the ACM International Conference Knowledge Discovery and Data Mining, ACM SIGKDD’00, Boston, pp. 169–178 (2000)
Christen, P.: A comparison of personal name matching: techniques and practical issues. In: Proceedings of the IEEE Workshop on Mining Complex Data, IEEE ICDM’06, Hong Kong (2006)
Kumar, A.; Ingle, Y.S.; Pande, A.; Dhule, P.: Canopy clustering: a review on pre-clustering approach to K-means clustering. Int. J. Innov. Adv. Comput. Sci. (IJIACS) 3(5), 22–29 (2014)
Google Scholar
Cohen, W.W.; Ravikumar, P.; Fienberg, S.: A comparison of string distance metrics for name-matching tasks. In: Proceedings of the Workshop on Information Integration on the Web, held at IJCAI’03, Acapulco (2003)
Christen, P.; Goiser, K.: Quality and complexity measures for data linkage and deduplication. In: Guillet, F., Hamilton, H. (eds.) Quality Measures in Data Mining Series. Studies in Computational Intelligence, pp. 127–151. Springer, Berlin (2007)
Chapter Google Scholar
Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)
Article Google Scholar
Shannon, C.E.: A mathematical theory of communications. Bell Syst. Technol. 27, 379–423 (1948)
Article MathSciNet Google Scholar
Ukkonen, E.: Approximate string matching with q-grams and maximal matches. Theory Comput. Sci. 92, 191–211 (1992)
Article MathSciNet Google Scholar
Kukich, K.: Spelling correction for the telecommunications network for the deaf. Commun. ACM 35, 80–90 (1992)
Article Google Scholar
Gravano, L.; Ipeirotis, P.G.; Koudas, N.; Srivastava, D.: Text joins for data cleansing and integration in an RDBMS. In: Proceedings of the 19th IEEE International Conference on Data Engineering (ICDE) (2003)
Naumann, F.; Herschel, M.: An Introduction to Duplicate Detection. Morgan and Claypool Publishers, San Rafael (2010)
MATH Google Scholar
Christen, P.; Goiser, K.: A comparison of personal name matching: techniques and practical issues. In: Proceeding of Data Mining Workshops, ICDM Workshops (2006)

Download references

Author information

Authors and Affiliations

Department of Computers and Control Engineering, Faculty of Engineering, Tanta University, Tanta, Egypt
Mayada A. Elziky, Dina M. Ibrahim & Amany M. Sarhan

Authors

Mayada A. Elziky
View author publications
You can also search for this author in PubMed Google Scholar
Dina M. Ibrahim
View author publications
You can also search for this author in PubMed Google Scholar
Amany M. Sarhan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Amany M. Sarhan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Elziky, M.A., Ibrahim, D.M. & Sarhan, A.M. Improved Duplicate Record Detection Using ASCII Code Q-gram Indexing Technique. Arab J Sci Eng 43, 7409–7420 (2018). https://doi.org/10.1007/s13369-018-3105-6

Download citation

Received: 29 December 2016
Accepted: 04 February 2018
Published: 15 February 2018
Issue Date: December 2018
DOI: https://doi.org/10.1007/s13369-018-3105-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Improved Duplicate Record Detection Using ASCII Code Q-gram Indexing Technique

Abstract

Article PDF

Similar content being viewed by others

Near Duplicate Text Detection Using Frequency-Biased Signatures

Efficient Approach for Near Duplicate Document Detection Using Textual and Conceptual Based Techniques

An Improved Data Cleaning Algorithm Based on SNM

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Improved Duplicate Record Detection Using ASCII Code Q-gram Indexing Technique

Abstract

Article PDF

Similar content being viewed by others

Near Duplicate Text Detection Using Frequency-Biased Signatures

Efficient Approach for Near Duplicate Document Detection Using Textual and Conceptual Based Techniques

An Improved Data Cleaning Algorithm Based on SNM

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation