JHU/APL Experiments in Tokenization and Non-word Translation

McNamee, Paul; Mayfield, James

doi:10.1007/978-3-540-30222-3_8

Paul McNamee¹⁹ &
James Mayfield¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3237))

Included in the following conference series:

Workshop of the Cross-Language Evaluation Forum for European Languages

408 Accesses
13 Citations

Abstract

In the past we have conducted experiments that investigate the benefits and peculiarities attendant to alternative methods for tokenization, particularly overlapping character n-grams. This year we continued this line of work and report new findings reaffirming that the judicious use of n-grams can lead to performance surpassing that of word-based tokenization. In particular we examined: the relative performance of n-grams and a popular suffix stemmer; a novel form of n-gram indexing that approximates stemming and achieves fast run-time performance; various lengths of n-grams; and the use of n-grams for robust translation of queries using an aligned parallel text. For the CLEF 2003 evaluation we submitted monolingual and bilingual runs for all languages and language pairs and multilingual runs using English as a source language. Our key findings are that shorter n-grams (n=4 and n=5) outperform a popular stemmer in non-Romance languages, that direct translation of n-grams is feasible using an aligned corpus, that translated 5-grams yield superior performance to words, stems, or 4-grams, and that a combination of indexing methods is best of all.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

word.alignment: an R package for computing statistical word alignment and its evaluation

Article 23 March 2020

Adding Multilingual Terminological Resources to Parallel Corpora for Statistical Machine Translation Deteriorates System Performance: A Negative Result from Experiments in the Biomedical Domain

SMT: A Case Study of Kazakh-English Word Alignment

References

Monz, C., Kamps, J., de Rijke, M.: The University of Amsterdam at CLEF 2002. In: Peters, C., Braschler, M., Gonzalo, J. (eds.) CLEF 2002. LNCS, vol. 2785, pp. 73–84. Springer, Heidelberg (2003)
Google Scholar
Reidsma, D., Hiemstra, D., de Jong, F., Kraaij, W.: Cross-language Retrieval at Twente and TNO. In: Working Notes of the CLEF 2002 Workshop, pp. 111–114 (2002)
Google Scholar
Savoy, J.: Cross-language information retrieval: experiments based on CLEF 2000 corpora. CLEF 2000 39(1), 75–115 (2003)
Article MATH Google Scholar
Miller, E., Shen, D., Liu, J., Nicholas, C.: Performance and Scalability of a Large-Scale N-gram Based Information Retrieval System. the Journal of Digital Information 1(5) (2000)
Google Scholar
McNamee, P., Mayfield, J.: N-Grams for Translation and Retrieval in CL-SDR. In: Peters, C., Gonzalo, J., Braschler, M., Kluck, M. (eds.) CLEF 2003. LNCS, vol. 3237, pp. 658–663. Springer, Heidelberg (2004)
Chapter Google Scholar
Pirkola, A., Hedlund, T., Keskusalo, H., Järvelin, K.: Dictionary-Based Cross-Language Information Retrieval: Problems, Methods, and Research Findings. Information Retrieval 4, 209–230 (2001)
Article MATH Google Scholar
Tomlinson, S.: Experiments in 8 European Languages with Hummingbird SearchServer at CLEF 2002. In: Peters, C., Braschler, M., Gonzalo, J. (eds.) CLEF 2002. LNCS, vol. 2785, pp. 203–214. Springer, Heidelberg (2003)
Chapter Google Scholar
Miller, D., Leek, T., Schwartz, R.: A hidden Markov model information retrieval system. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 214–221 (1999)
Google Scholar
Hiemstra, D.: Using Language Models for Information Retrieval. Ph. D. Thesis. Center for Telematics and Information Technology, The Netherlands (2000)
Google Scholar
Jelinek, F., Mercer, R.: Interpolated Estimation of Markov Source Parameters from Sparse Data. In: Gelsema, E., Kanal, L. (eds.) Pattern Recognition in Practice, pp. 381–402. North-Holland, Amsterdam (1980)
Google Scholar
McNamee, P., Mayfield, J.: Scalable Multilingual Information Access. In: Peters, C., Braschler, M., Gonzalo, J. (eds.) CLEF 2002. LNCS, vol. 2785, Springer, Heidelberg (2003)
Chapter Google Scholar
Zhai, C., Lafferty, J.: A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 334–342 (2001)
Google Scholar
McNamee, P., Mayfield, J.: Character N-gram Tokenization for European Language Text Retrieval. Information Retrieval 7(1-2), 73–97 (2004)
Article Google Scholar
Porter, M.: Snowball: A Language for Stemming Algorithms (visited March 13, 2003), Available online at: http://snowball.tartarus.org/texts/introduction.html
Mayfield, J., McNamee, P.: Single N-gram Stemming. In: The Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 415–416 (2003)
Google Scholar
Kwok, K., Chan, M.: Improving Two-Stage Ad-Hoc Retrieval for Short Queries. In: The Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 250–256 (1998)
Google Scholar
http://europa.eu.int/
Church, K.: Char_align: A program for aligning parallel texts at the character level. In: Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, pp. 1–8 (1993)
Google Scholar
McNamee, P., Mayfield, J.: Comparing Cross-Language Query Expansion Techniques by Degrading Translation Resources. In: Proceedings of the 25th Annual International Conference on Research and Development in Information Retrieval, pp. 159–166 (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Applied Physics Laboratory, The Johns Hopkins University, 11100 Johns Hopkins Road, Laurel, MD, 20723-6099, USA
Paul McNamee & James Mayfield

Authors

Paul McNamee
View author publications
You can also search for this author in PubMed Google Scholar
James Mayfield
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

ISTI-CNR, Area di Ricerca, Pisa, Italy
Carol Peters
No Affiliations,
Julio Gonzalo & Martin Braschler &
German Institute for International and Security Affairs, Stiftung Wissenschaft und Politik (SWP), Ludwigkirchplatz 3-4, 10719, Berlin, Germany
Michael Kluck

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

McNamee, P., Mayfield, J. (2004). JHU/APL Experiments in Tokenization and Non-word Translation. In: Peters, C., Gonzalo, J., Braschler, M., Kluck, M. (eds) Comparative Evaluation of Multilingual Information Access Systems. CLEF 2003. Lecture Notes in Computer Science, vol 3237. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30222-3_8

Download citation

DOI: https://doi.org/10.1007/978-3-540-30222-3_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24017-4
Online ISBN: 978-3-540-30222-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

JHU/APL Experiments in Tokenization and Non-word Translation

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

word.alignment: an R package for computing statistical word alignment and its evaluation

Adding Multilingual Terminological Resources to Parallel Corpora for Statistical Machine Translation Deteriorates System Performance: A Negative Result from Experiments in the Biomedical Domain

SMT: A Case Study of Kazakh-English Word Alignment

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

JHU/APL Experiments in Tokenization and Non-word Translation

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

word.alignment: an R package for computing statistical word alignment and its evaluation

Adding Multilingual Terminological Resources to Parallel Corpora for Statistical Machine Translation Deteriorates System Performance: A Negative Result from Experiments in the Biomedical Domain

SMT: A Case Study of Kazakh-English Word Alignment

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation