Abstract
A main challenge in Cross-Language information retrieval is to estimate a translation language model, as its quality directly affects the retrieval performance. The translation language model is built using translation resources such as bilingual dictionaries, parallel corpora, or comparable corpora. In general, high quality resources may not be available for scarce-resource languages. For these languages, efficient exploitation of commonly available resources such as comparable corpora is considered more crucial. In this paper, we focus on using only comparable corpora to extract translation information more efficiently. We propose a language modeling approach for estimating the translation language model. The proposed method is based on probability distribution estimation, and can be tuned easier in comparison with heuristically adjusted previous work. Experiment results show a significant improvement in the translation quality and CLIR performance compared to the previous approaches.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
References
Farsi dictionary, http://www.farsidic.com/
Lemur toolkit, http://www.lemurproject.org/
AbduI-Rauf, S., Schwenk, H.: On the use of comparable corpora to improve SMT performance. In: Proceedings of EACL 2009, pp. 16–23. Association for Computational Linguistics, Stroudsburg (2009)
Chiao, Y.C., Zweigenbaum, P.: Looking for candidate translational equivalents in specialized, comparable corpora. In: Proceedings of the 19th International Conference on Computational Linguistics, COLING 2002, vol. 2, pp. 1–5. Association for Computational Linguistics, Stroudsburg (2002)
Dagan, I., Lee, L., Pereira, F.: Similarity-based methods for word sense disambiguation. In: Proceedings of ACL 1998, pp. 56–63. Association for Computational Linguistics, Stroudsburg (1997)
Garera, N., Callison-Burch, C., Yarowsky, D.: Improving translation lexicon induction from monolingual corpora via dependency contexts and part-of-speech equivalences. In: Proceedings of the Thirteenth Conference on Computational Natural Language Learning, CoNLL 2009, pp. 129–137. Association for Computational Linguistics, Stroudsburg (2009)
Gaussier, E., Renders, J.M., Matveeva, I., Goutte, C., Déjean, H.: A geometric view on bilingual lexicon extraction from comparable corpora. In: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, ACL 2004. Association for Computational Linguistics, Stroudsburg (2004)
Hashemi, H.B.: Using Comparable Corpora for Persian-English Cross Language Information Retrieval. Master’s thesis, University of Tehran (2011)
Hazem, A., Morin, E.: Adaptive dictionary for bilingual lexicon extraction from comparable corpora. In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012). European Language Resources Association (ELRA), Istanbul (2012)
Li, B., Gaussier, E.: Improving corpus comparability for bilingual lexicon extraction from comparable corpora. In: Proceedings of the 23rd International Conference on Computational Linguistics, COLING 2010, pp. 644–652. Association for Computational Linguistics, Stroudsburg (2010)
Li, B., Gaussier, E., Aizawa, A.: Clustering comparable corpora for bilingual lexicon extraction. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers, HLT 2011, vol. 2, pp. 473–478. Association for Computational Linguistics, Stroudsburg (2011)
Munteanu, D.S., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Comput. Linguist. 31(4), 477–504 (2005)
Nie, J.Y.: Cross-Language Information Retrieval. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers (2010)
Rahimi, Z., Shakery, A.: Topic based creation of a persian-english comparable corpus. In: Salem, M.V.M., Shaalan, K., Oroumchian, F., Shakery, A., Khelalfa, H. (eds.) AIRS 2011. LNCS, vol. 7097, pp. 458–469. Springer, Heidelberg (2011)
Rapp, R.: Identifying word translations in non-parallel texts. In: Proceedings of ACL 1995, pp. 320–322. Association for Computational Linguistics, Stroudsburg (1995)
Sadat, F., Yoshikawa, M., Uemura, S.: Enhancing cross-language information retrieval by an automatic acquisition of bilingual terminology from comparable corpora. In: Proceedings of ACM SIGIR 2003, pp. 397–398. ACM, New York (2003)
Shakery, A., Zhai, C.: Leveraging comparable corpora for cross-lingual information retrieval in resource-lean language pairs. Information Retrieval, 1–29 (2012)
Sheridan, P., Ballerini, J.P.: Experiments in multilingual information retrieval using the spider system. In: Proceedings of ACM SIGIR 1996, pp. 58–65. ACM, New York (1996)
Talvensaari, T., Laurikkala, J., Järvelin, K., Juhola, M., Keskustalo, H.: Creating and exploiting a comparable corpus in cross-language information retrieval. ACM Trans. Inf. Syst. 25(1) (February 2007)
Tao, T., Zhai, C.: Mining comparable bilingual text corpora for cross-language information integration. In: Proceedings of the ACM SIGKDD, KDD 2005, pp. 691–696. ACM, New York (2005)
Vulić, I., Moens, M.F.: Detecting highly confident word translations from comparable corpora without any prior knowledge. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2012, pp. 449–459. Association for Computational Linguistics, Stroudsburg (2012)
Zhai, C.: Statistical language models for information retrieval: A critical review. Foundations and Trends in Information Retrieval 2(3), 137–213 (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Rahimi, R., Shakery, A. (2013). A Language Modeling Approach for Extracting Translation Knowledge from Comparable Corpora. In: Serdyukov, P., et al. Advances in Information Retrieval. ECIR 2013. Lecture Notes in Computer Science, vol 7814. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36973-5_51
Download citation
DOI: https://doi.org/10.1007/978-3-642-36973-5_51
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-36972-8
Online ISBN: 978-3-642-36973-5
eBook Packages: Computer ScienceComputer Science (R0)