Abstract
To effectively support today’s global economy, database systems need to store and manipulate text data in multiple languages simultaneously. Current database systems do support the storage and management of multilingual data, but are not capable of querying or matching text data across different scripts. As a first step towards addressing this lacuna, we propose here a new query operator called LexEQUAL, which supports multiscript matching of proper names. The operator is implemented by first transforming matches in multiscript text space into matches in the equivalent phoneme space, and then using standard approximate matching techniques to compare these phoneme strings. The algorithm incorporates tunable parameters that impact the phonetic match quality and thereby determine the match performance in the multiscript space. We evaluate the performance of the LexEQUAL operator on a real multiscript names dataset and demonstrate that it is possible to simultaneously achieve good recall and precision by appropriate parameter settings. We also show that the operator run-time can be made extremely efficient by utilizing a combination of q-gram and database indexing techniques. Thus, we show that the LexEQUAL operator can complement the standard lexicographic operators, representing a first step towards achieving complete multilingual functionality in database systems.
A poster version of this paper appears in the Proc. of the 20th IEEE Intl. Conf. on Data Engineering, March 2004.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Baeza-Yates, R., Navarro, G.: Faster Approximate String Matching. Algorithmica 23(2), 127–158 (1999)
Chavez, E., Navarro, G., Baeza-Yates, R., Marroquin, J.: Searching in Metric Space. ACM Computing Surveys 33(3), 273–321 (2001)
Davis, M.: Unicode collation algorithm. Unicode Consortium Technical Report (2001)
Dhvani - A Text-to-Speech System for Indian Languages, http://dhvani.sourceforge.net/
The Foreign Word – The Language Site, Alicante, Spain, http://www.ForeignWord.com
Gravano, L., Ipeirotis, P., Jagadish, H., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate String Joins in a Database (almost) for Free. In: Proc. of 27th VLDB Conf. (September 2001)
Gusfield, D.: Algorithms on Strings, Trees and Sequences. Cambridge University Press, Cambridge (2001)
International Organization for Standardization. ISO/IEC 9075-1-5:1999, Information Technology – Database Languages – SQL (parts 1 through 5) (1999)
The International Phonetic Association. Univ. of Glasgow, Glasgow, UK, http://www.arts.gla.ac.uk/IPA/ipa.html
Jurafskey, D., Martin, J.: Speech and Language Processing. Pearson Education (2000)
Knuth, D.: The Art of Computer Programming. Sorting and Searching, vol. 3. Addison-Wesley, Reading (1993)
Kumaran, A., Haritsa, J.: On Database Support for Multilingual Environments. In: Proc. of 9th IEEE RIDE Workshop (March 2003)
Kumaran, A., Haritsa, J.: On the Costs of Multilingualism in Database Systems. In: Proc. of 29th VLDB Conference (September 2003)
Kumaran, A., Haritsa, J.: Supporting Multilexical Matching in Database Systems. DSL/SERC Technical Report TR-2004-01 (2004)
Lambert, B., Chang, K., Lin, S.: Descriptive analysis of the drug name lexicon. Drug Information Journal 35, 163–172 (2001)
Liberman, M., Church, K.: Text Analysis and Word Pronunciation in TTS Synthesis. Advances in Speech Processing (1992)
Melton, J., Simon, A.: SQL 1999: Understanding Relational Language Components. Morgan Kaufmann, San Francisco (2001)
Mareuil, P., Corredor-Ardoy, C., Adda-Decker, M.: Multilingual Automatic Phoneme Clustering. In: Proc. of 14th Intl. Congress of Phonetic Sciences (August 1999)
Navarro, G.: A Guided Tour to Approximate String Matching. ACM Computing Surveys 33(1), 31–88 (2001)
Navarro, G., Sutinen, E., Tanninen, J., Tarhio, J.: Indexing Text with Approximate q-grams. In: Proc. of 11th Combinatorial Pattern Matching Conf. (June 2000)
Navarro, G., Baeza-Yates, R., Sutinen, E., Tarhio, J.: Indexing Methods for Approximate String Matching. IEEE Data Engineering Bulletin 24(4), 19–27 (2001)
The Oxford English Dictionary. Oxford University Press (1999)
Pfeifer, U., Poersch, T., Fuhr, N.: Searching Proper Names in Databases. In: Proc. Conf. Hypertext-Information Retrieval-Multimedia (April 1995)
Rabiner, L., Juang, B.: Fundamentals of Speech Processing. Prentice-Hall, Englewood Cliffs (1993)
The Unicode Consortium. The Unicode Standard. Addison-Wesley (2000)
The Unisyn Project. The Center for Speech Technology Research, Univ. of Edinburgh, United Kingdom, http://www.cstr.ed.ac.uk/projects/unisyn/
Zobel, J., Dart, P.: Finding Approximate Matches in Large Lexicons. Software – Practice and Experience 25(3), 331–345 (1995)
Zobel, J., Dart, P.: Phonetic String Matching: Lessons from Information Retrieval. In: Proc. of 19th ACM SIGIR Conf. (August 1996)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kumaran, A., Haritsa, J.R. (2004). LexEQUAL: Supporting Multiscript Matching in Database Systems. In: Bertino, E., et al. Advances in Database Technology - EDBT 2004. EDBT 2004. Lecture Notes in Computer Science, vol 2992. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24741-8_18
Download citation
DOI: https://doi.org/10.1007/978-3-540-24741-8_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-21200-3
Online ISBN: 978-3-540-24741-8
eBook Packages: Springer Book Archive