Abstract
We report on a first attempt to perform cross-language spoken document retrieval. Without prior monolingual speech retrieval experience we applied the same general approach we use for bilingual retrieval that is typified by the use of overlapping character n-grams for tokenization and a statistical language model of retrieval. An innovative approach was adopted for coping with out-of-vocabulary words and misspelled or mistranscribed words: direct translation of individual n-grams was the sole mechanism to translate source language queries into target language terms. Though this approach shows promise, especially for non-speech retrieval, our performance appears to lag that of other teams participating in this novel evaluation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
McNamee, P., Mayfield, J.: JHU/APL Experiments in Tokenization and Non-Word Translation. In: Peters, C., Gonzalo, J., Braschler, M., Kluck, M. (eds.) CLEF 2003. LNCS, vol. 3237, pp. 85–97. Springer, Heidelberg (2004)
Ng, C., Wilkinson, R., Zobel, J.: Experiments in Spoken Document Retrieval Using Phoneme N-grams. Speech Communication 32, 1–2, 61–77 (2000)
Ng, K.: Subword-based Approaches for Spoken Document Retrieval. Ph.D. Thesis. MIT (2000)
McNamee, P., Mayfield, J.: Scalable Multilingual Information Access. In: Peters, C., Braschler, M., Gonzalo, J. (eds.) CLEF 2002. LNCS, vol. 2785, pp. 207–218. Springer, Heidelberg (2003)
McNamee, P., Mayfield, J.: Character N-gram Tokenization for European Language Text Retrieval. Information Retrieval (to appear)
Hiemstra, D.: Using Language Models for Information Retrieval. Ph. D. Thesis. Center for Telematics and Information Technology, The Netherlands (2000)
Miller, D., Leek, T., Schwartz, R.: A hidden Markov model information retrieval system. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 214–221 (1999)
Ponte, J., Croft, B.: A Language Modeling Approach to Information Retrieval. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 275–281 (1998)
Pirkola, A., Hedlund, T., Keskusalo, H., Järvelin, K.: Dictionary-Based Cross-Language Information Retrieval: Problems, Methods, and Research Findings. Information Retrieval 4, 209–230 (2001)
Porter, M.: Snowball: A Language for Stemming Algorithms, Available online at: http://snowball.tartarus.org/texts/introduction.html (visited, March 13, 2003)
McNamee, P., Mayfield, J.: Comparing Cross-Language Query Expansion Techniques by Degrading Translation Resources. In: Proceedings of the 25th Annual International Conference on Research and Development in Information Retrieval, pp. 159–166 (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
McNamee, P., Mayfield, J. (2004). N-Grams for Translation and Retrieval in CL-SDR. In: Peters, C., Gonzalo, J., Braschler, M., Kluck, M. (eds) Comparative Evaluation of Multilingual Information Access Systems. CLEF 2003. Lecture Notes in Computer Science, vol 3237. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30222-3_63
Download citation
DOI: https://doi.org/10.1007/978-3-540-30222-3_63
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24017-4
Online ISBN: 978-3-540-30222-3
eBook Packages: Springer Book Archive