A Language Modeling Approach for Extracting Translation Knowledge from Comparable Corpora

Rahimi, Razieh; Shakery, Azadeh

doi:10.1007/978-3-642-36973-5_51

Razieh Rahimi²³ &
Azadeh Shakery²³

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7814))

Included in the following conference series:

European Conference on Information Retrieval

3031 Accesses
3 Citations

Abstract

A main challenge in Cross-Language information retrieval is to estimate a translation language model, as its quality directly affects the retrieval performance. The translation language model is built using translation resources such as bilingual dictionaries, parallel corpora, or comparable corpora. In general, high quality resources may not be available for scarce-resource languages. For these languages, efficient exploitation of commonly available resources such as comparable corpora is considered more crucial. In this paper, we focus on using only comparable corpora to extract translation information more efficiently. We propose a language modeling approach for estimating the translation language model. The proposed method is based on probability distribution estimation, and can be tuned easier in comparison with heuristically adjusted previous work. Experiment results show a significant improvement in the translation quality and CLIR performance compared to the previous approaches.

Access provided by Autonomous University of Puebla. Download to read the full chapter text

Chapter PDF

Adjusting Machine Translation Datasets for Document-Level Cross-Language Information Retrieval: Methodology

Training, Enhancing, Evaluating and Using MT Systems with Comparable Data

New Areas of Application of Comparable Corpora

Keywords

References

Farsi dictionary, http://www.farsidic.com/
Lemur toolkit, http://www.lemurproject.org/
AbduI-Rauf, S., Schwenk, H.: On the use of comparable corpora to improve SMT performance. In: Proceedings of EACL 2009, pp. 16–23. Association for Computational Linguistics, Stroudsburg (2009)
Google Scholar
Chiao, Y.C., Zweigenbaum, P.: Looking for candidate translational equivalents in specialized, comparable corpora. In: Proceedings of the 19th International Conference on Computational Linguistics, COLING 2002, vol. 2, pp. 1–5. Association for Computational Linguistics, Stroudsburg (2002)
Chapter Google Scholar
Dagan, I., Lee, L., Pereira, F.: Similarity-based methods for word sense disambiguation. In: Proceedings of ACL 1998, pp. 56–63. Association for Computational Linguistics, Stroudsburg (1997)
Google Scholar
Garera, N., Callison-Burch, C., Yarowsky, D.: Improving translation lexicon induction from monolingual corpora via dependency contexts and part-of-speech equivalences. In: Proceedings of the Thirteenth Conference on Computational Natural Language Learning, CoNLL 2009, pp. 129–137. Association for Computational Linguistics, Stroudsburg (2009)
Chapter Google Scholar
Gaussier, E., Renders, J.M., Matveeva, I., Goutte, C., Déjean, H.: A geometric view on bilingual lexicon extraction from comparable corpora. In: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, ACL 2004. Association for Computational Linguistics, Stroudsburg (2004)
Google Scholar
Hashemi, H.B.: Using Comparable Corpora for Persian-English Cross Language Information Retrieval. Master’s thesis, University of Tehran (2011)
Google Scholar
Hazem, A., Morin, E.: Adaptive dictionary for bilingual lexicon extraction from comparable corpora. In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012). European Language Resources Association (ELRA), Istanbul (2012)
Google Scholar
Li, B., Gaussier, E.: Improving corpus comparability for bilingual lexicon extraction from comparable corpora. In: Proceedings of the 23rd International Conference on Computational Linguistics, COLING 2010, pp. 644–652. Association for Computational Linguistics, Stroudsburg (2010)
Google Scholar
Li, B., Gaussier, E., Aizawa, A.: Clustering comparable corpora for bilingual lexicon extraction. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers, HLT 2011, vol. 2, pp. 473–478. Association for Computational Linguistics, Stroudsburg (2011)
Google Scholar
Munteanu, D.S., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Comput. Linguist. 31(4), 477–504 (2005)
Article Google Scholar
Nie, J.Y.: Cross-Language Information Retrieval. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers (2010)
Google Scholar
Rahimi, Z., Shakery, A.: Topic based creation of a persian-english comparable corpus. In: Salem, M.V.M., Shaalan, K., Oroumchian, F., Shakery, A., Khelalfa, H. (eds.) AIRS 2011. LNCS, vol. 7097, pp. 458–469. Springer, Heidelberg (2011)
Chapter Google Scholar
Rapp, R.: Identifying word translations in non-parallel texts. In: Proceedings of ACL 1995, pp. 320–322. Association for Computational Linguistics, Stroudsburg (1995)
Google Scholar
Sadat, F., Yoshikawa, M., Uemura, S.: Enhancing cross-language information retrieval by an automatic acquisition of bilingual terminology from comparable corpora. In: Proceedings of ACM SIGIR 2003, pp. 397–398. ACM, New York (2003)
Google Scholar
Shakery, A., Zhai, C.: Leveraging comparable corpora for cross-lingual information retrieval in resource-lean language pairs. Information Retrieval, 1–29 (2012)
Google Scholar
Sheridan, P., Ballerini, J.P.: Experiments in multilingual information retrieval using the spider system. In: Proceedings of ACM SIGIR 1996, pp. 58–65. ACM, New York (1996)
Google Scholar
Talvensaari, T., Laurikkala, J., Järvelin, K., Juhola, M., Keskustalo, H.: Creating and exploiting a comparable corpus in cross-language information retrieval. ACM Trans. Inf. Syst. 25(1) (February 2007)
Google Scholar
Tao, T., Zhai, C.: Mining comparable bilingual text corpora for cross-language information integration. In: Proceedings of the ACM SIGKDD, KDD 2005, pp. 691–696. ACM, New York (2005)
Google Scholar
Vulić, I., Moens, M.F.: Detecting highly confident word translations from comparable corpora without any prior knowledge. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2012, pp. 449–459. Association for Computational Linguistics, Stroudsburg (2012)
Google Scholar
Zhai, C.: Statistical language models for information retrieval: A critical review. Foundations and Trends in Information Retrieval 2(3), 137–213 (2008)
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran
Razieh Rahimi & Azadeh Shakery

Authors

Razieh Rahimi
View author publications
You can also search for this author in PubMed Google Scholar
Azadeh Shakery
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Yandex, Leo Tolstoy, 16, 119021, Moscow, Russia
Pavel Serdyukov & Ilya Segalovich &
Kontur Labs and Ural Federal University, Fonvizina 3-27, 620078, Yekaterinburg, Russia
Pavel Braslavski
National Research University Higher School of Economics (HSE), Pokrovskii bd 11, 109028, Moscow, Russia
Sergei O. Kuznetsov
University of Amsterdam, Turfdraagsterpad 9, 1012 XT, Amsterdam, The Netherlands
Jaap Kamps
Knowledge Media Institute, The Open University, Walton Hall, MK7 6AA, Milton Keynes, UK
Stefan Rüger
Mathematics & Computer Science Department, Emory University, 400 dowman Drive, 30329, Atlanta, GA, USA
Eugene Agichtein
Department of Computer Science, University College London, Gower Street, WC1E 6BT, London, UK
Emine Yilmaz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rahimi, R., Shakery, A. (2013). A Language Modeling Approach for Extracting Translation Knowledge from Comparable Corpora. In: Serdyukov, P., et al. Advances in Information Retrieval. ECIR 2013. Lecture Notes in Computer Science, vol 7814. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36973-5_51

Download citation

DOI: https://doi.org/10.1007/978-3-642-36973-5_51
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-36972-8
Online ISBN: 978-3-642-36973-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Language Modeling Approach for Extracting Translation Knowledge from Comparable Corpora

Abstract

Chapter PDF

Similar content being viewed by others

Adjusting Machine Translation Datasets for Document-Level Cross-Language Information Retrieval: Methodology

Training, Enhancing, Evaluating and Using MT Systems with Comparable Data

New Areas of Application of Comparable Corpora

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

A Language Modeling Approach for Extracting Translation Knowledge from Comparable Corpora

Abstract

Chapter PDF

Similar content being viewed by others

Adjusting Machine Translation Datasets for Document-Level Cross-Language Information Retrieval: Methodology

Training, Enhancing, Evaluating and Using MT Systems with Comparable Data

New Areas of Application of Comparable Corpora

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation