Abstract
In the past we have conducted experiments that investigate the benefits and peculiarities attendant to alternative methods for tokenization, particularly overlapping character n-grams. This year we continued this line of work and report new findings reaffirming that the judicious use of n-grams can lead to performance surpassing that of word-based tokenization. In particular we examined: the relative performance of n-grams and a popular suffix stemmer; a novel form of n-gram indexing that approximates stemming and achieves fast run-time performance; various lengths of n-grams; and the use of n-grams for robust translation of queries using an aligned parallel text. For the CLEF 2003 evaluation we submitted monolingual and bilingual runs for all languages and language pairs and multilingual runs using English as a source language. Our key findings are that shorter n-grams (n=4 and n=5) outperform a popular stemmer in non-Romance languages, that direct translation of n-grams is feasible using an aligned corpus, that translated 5-grams yield superior performance to words, stems, or 4-grams, and that a combination of indexing methods is best of all.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Monz, C., Kamps, J., de Rijke, M.: The University of Amsterdam at CLEF 2002. In: Peters, C., Braschler, M., Gonzalo, J. (eds.) CLEF 2002. LNCS, vol. 2785, pp. 73–84. Springer, Heidelberg (2003)
Reidsma, D., Hiemstra, D., de Jong, F., Kraaij, W.: Cross-language Retrieval at Twente and TNO. In: Working Notes of the CLEF 2002 Workshop, pp. 111–114 (2002)
Savoy, J.: Cross-language information retrieval: experiments based on CLEF 2000 corpora. CLEF 2000 39(1), 75–115 (2003)
Miller, E., Shen, D., Liu, J., Nicholas, C.: Performance and Scalability of a Large-Scale N-gram Based Information Retrieval System. the Journal of Digital Information 1(5) (2000)
McNamee, P., Mayfield, J.: N-Grams for Translation and Retrieval in CL-SDR. In: Peters, C., Gonzalo, J., Braschler, M., Kluck, M. (eds.) CLEF 2003. LNCS, vol. 3237, pp. 658–663. Springer, Heidelberg (2004)
Pirkola, A., Hedlund, T., Keskusalo, H., Järvelin, K.: Dictionary-Based Cross-Language Information Retrieval: Problems, Methods, and Research Findings. Information Retrieval 4, 209–230 (2001)
Tomlinson, S.: Experiments in 8 European Languages with Hummingbird SearchServer at CLEF 2002. In: Peters, C., Braschler, M., Gonzalo, J. (eds.) CLEF 2002. LNCS, vol. 2785, pp. 203–214. Springer, Heidelberg (2003)
Miller, D., Leek, T., Schwartz, R.: A hidden Markov model information retrieval system. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 214–221 (1999)
Hiemstra, D.: Using Language Models for Information Retrieval. Ph. D. Thesis. Center for Telematics and Information Technology, The Netherlands (2000)
Jelinek, F., Mercer, R.: Interpolated Estimation of Markov Source Parameters from Sparse Data. In: Gelsema, E., Kanal, L. (eds.) Pattern Recognition in Practice, pp. 381–402. North-Holland, Amsterdam (1980)
McNamee, P., Mayfield, J.: Scalable Multilingual Information Access. In: Peters, C., Braschler, M., Gonzalo, J. (eds.) CLEF 2002. LNCS, vol. 2785, Springer, Heidelberg (2003)
Zhai, C., Lafferty, J.: A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 334–342 (2001)
McNamee, P., Mayfield, J.: Character N-gram Tokenization for European Language Text Retrieval. Information Retrieval 7(1-2), 73–97 (2004)
Porter, M.: Snowball: A Language for Stemming Algorithms (visited March 13, 2003), Available online at: http://snowball.tartarus.org/texts/introduction.html
Mayfield, J., McNamee, P.: Single N-gram Stemming. In: The Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 415–416 (2003)
Kwok, K., Chan, M.: Improving Two-Stage Ad-Hoc Retrieval for Short Queries. In: The Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 250–256 (1998)
Church, K.: Char_align: A program for aligning parallel texts at the character level. In: Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, pp. 1–8 (1993)
McNamee, P., Mayfield, J.: Comparing Cross-Language Query Expansion Techniques by Degrading Translation Resources. In: Proceedings of the 25th Annual International Conference on Research and Development in Information Retrieval, pp. 159–166 (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
McNamee, P., Mayfield, J. (2004). JHU/APL Experiments in Tokenization and Non-word Translation. In: Peters, C., Gonzalo, J., Braschler, M., Kluck, M. (eds) Comparative Evaluation of Multilingual Information Access Systems. CLEF 2003. Lecture Notes in Computer Science, vol 3237. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30222-3_8
Download citation
DOI: https://doi.org/10.1007/978-3-540-30222-3_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24017-4
Online ISBN: 978-3-540-30222-3
eBook Packages: Springer Book Archive