Skip to main content

JHU/APL Experiments in Tokenization and Non-word Translation

  • Conference paper
Comparative Evaluation of Multilingual Information Access Systems (CLEF 2003)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3237))

Included in the following conference series:

Abstract

In the past we have conducted experiments that investigate the benefits and peculiarities attendant to alternative methods for tokenization, particularly overlapping character n-grams. This year we continued this line of work and report new findings reaffirming that the judicious use of n-grams can lead to performance surpassing that of word-based tokenization. In particular we examined: the relative performance of n-grams and a popular suffix stemmer; a novel form of n-gram indexing that approximates stemming and achieves fast run-time performance; various lengths of n-grams; and the use of n-grams for robust translation of queries using an aligned parallel text. For the CLEF 2003 evaluation we submitted monolingual and bilingual runs for all languages and language pairs and multilingual runs using English as a source language. Our key findings are that shorter n-grams (n=4 and n=5) outperform a popular stemmer in non-Romance languages, that direct translation of n-grams is feasible using an aligned corpus, that translated 5-grams yield superior performance to words, stems, or 4-grams, and that a combination of indexing methods is best of all.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Monz, C., Kamps, J., de Rijke, M.: The University of Amsterdam at CLEF 2002. In: Peters, C., Braschler, M., Gonzalo, J. (eds.) CLEF 2002. LNCS, vol. 2785, pp. 73–84. Springer, Heidelberg (2003)

    Google Scholar 

  2. Reidsma, D., Hiemstra, D., de Jong, F., Kraaij, W.: Cross-language Retrieval at Twente and TNO. In: Working Notes of the CLEF 2002 Workshop, pp. 111–114 (2002)

    Google Scholar 

  3. Savoy, J.: Cross-language information retrieval: experiments based on CLEF 2000 corpora. CLEF 2000 39(1), 75–115 (2003)

    Article  MATH  Google Scholar 

  4. Miller, E., Shen, D., Liu, J., Nicholas, C.: Performance and Scalability of a Large-Scale N-gram Based Information Retrieval System. the Journal of Digital Information 1(5) (2000)

    Google Scholar 

  5. McNamee, P., Mayfield, J.: N-Grams for Translation and Retrieval in CL-SDR. In: Peters, C., Gonzalo, J., Braschler, M., Kluck, M. (eds.) CLEF 2003. LNCS, vol. 3237, pp. 658–663. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  6. Pirkola, A., Hedlund, T., Keskusalo, H., Järvelin, K.: Dictionary-Based Cross-Language Information Retrieval: Problems, Methods, and Research Findings. Information Retrieval 4, 209–230 (2001)

    Article  MATH  Google Scholar 

  7. Tomlinson, S.: Experiments in 8 European Languages with Hummingbird SearchServer at CLEF 2002. In: Peters, C., Braschler, M., Gonzalo, J. (eds.) CLEF 2002. LNCS, vol. 2785, pp. 203–214. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  8. Miller, D., Leek, T., Schwartz, R.: A hidden Markov model information retrieval system. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 214–221 (1999)

    Google Scholar 

  9. Hiemstra, D.: Using Language Models for Information Retrieval. Ph. D. Thesis. Center for Telematics and Information Technology, The Netherlands (2000)

    Google Scholar 

  10. Jelinek, F., Mercer, R.: Interpolated Estimation of Markov Source Parameters from Sparse Data. In: Gelsema, E., Kanal, L. (eds.) Pattern Recognition in Practice, pp. 381–402. North-Holland, Amsterdam (1980)

    Google Scholar 

  11. McNamee, P., Mayfield, J.: Scalable Multilingual Information Access. In: Peters, C., Braschler, M., Gonzalo, J. (eds.) CLEF 2002. LNCS, vol. 2785, Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  12. Zhai, C., Lafferty, J.: A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 334–342 (2001)

    Google Scholar 

  13. McNamee, P., Mayfield, J.: Character N-gram Tokenization for European Language Text Retrieval. Information Retrieval 7(1-2), 73–97 (2004)

    Article  Google Scholar 

  14. Porter, M.: Snowball: A Language for Stemming Algorithms (visited March 13, 2003), Available online at: http://snowball.tartarus.org/texts/introduction.html

  15. Mayfield, J., McNamee, P.: Single N-gram Stemming. In: The Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 415–416 (2003)

    Google Scholar 

  16. Kwok, K., Chan, M.: Improving Two-Stage Ad-Hoc Retrieval for Short Queries. In: The Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 250–256 (1998)

    Google Scholar 

  17. http://europa.eu.int/

  18. Church, K.: Char_align: A program for aligning parallel texts at the character level. In: Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, pp. 1–8 (1993)

    Google Scholar 

  19. McNamee, P., Mayfield, J.: Comparing Cross-Language Query Expansion Techniques by Degrading Translation Resources. In: Proceedings of the 25th Annual International Conference on Research and Development in Information Retrieval, pp. 159–166 (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

McNamee, P., Mayfield, J. (2004). JHU/APL Experiments in Tokenization and Non-word Translation. In: Peters, C., Gonzalo, J., Braschler, M., Kluck, M. (eds) Comparative Evaluation of Multilingual Information Access Systems. CLEF 2003. Lecture Notes in Computer Science, vol 3237. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30222-3_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30222-3_8

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-24017-4

  • Online ISBN: 978-3-540-30222-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics