Abstract
Today’s Web spreads all over the world and world’s communication over the internet leads to globalization and globalization makes it necessary to find information in any language. Since only one language is not recognized by all people across the world. Many people use their regional languages to express their needs and the language diversity becomes a great barrier. Cross Lingual Information Retrieval provides a solution for that barrier which allows a user to ask a query in native language and then to get the document in different language. This paper discusses the CLIR challenges, Query translation techniques and approaches for many Indian and foreign languages and briefly analyses the CLIR tools.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
- CLIR
- Dictionary translation
- Wikipedia translation
- UNL
- Corpora
- Ontology
- NER
- Google translator
- Homonymy
- Polysemy
1 Introduction
Information Retrieval (IR) is a reasoning process that is used for storing, searching and retrieving the relevant information between a document and user needs. These tasks are not restricted to only Monolingual but also Multilingual. The documents and sentences in other languages are considered as unwanted “noise” in classical IR [1, 2]. CLIR deals with the situation where a user query and relevant documents are in different language and the language barrier becomes a serious issue for world communication. A CLIR approach includes a translation mechanism followed by mono lingual IR to overcome such language barriers. There are two types of translation namely query translation and documents translation. Query translation approaches are preferred due to a lot of computation time and space elapsed in document translation approaches [3]. Many workshops and Forums are acquainted to boost research in CLIR. Cross Language Evaluation Forum (CLEF) deals mainly with European languages since 2000. The NII Test Collection for IR System (NTCIR) workshop is planned for enhancing researches in Japanese and other Asian languages. First evaluation exercise by Forum for Information Retrieval Evaluation (FIRE) was completed in 2008 with three Indian languages Hindi, Bengali, Marathi. CLIA consortium includes 11 institutes of India for the project “Development of Cross Lingual Information Access system (CLIA)” funded by government of India. The objective of this project is to create a portal where user queries are responded in three possibilities such as responded in the query language, in Hindi and in English [2]. Literature Survey is discussed in Sect. 2. Issues and Challenges are discussed in Sect. 3. Various CLIR Approaches are discussed in Sect. 4. Section 5 includes Comparative Analysis and Discussion about CLIR translation technique and retrieval strategies. A brief analysis of CLIR tools also included in Sect. 5.
2 Literature Survey
Makin et al. were concluded that bilingual dictionary with cognate matching and transliteration achieves better performance. Parallel corpora and Machine Translation (MT) approaches are not well functioned. [4]. Pirkola et al. were experimented with English and Spanish languages and extract similar terms to develop transliteration rules [5]. Bajpai et al. were developed a prototype model where query was translated using any one technique including MT, dictionary based and corpora based. Word Sense Disambiguation (WSD) technique with Boolean, Vector space and Probabilistic model was used for IR [6]. Chen et al. were experimented with SMT and Parallel corpora for translation [7]. Jagarlamudi et al. were exploited statistical machine translation (SMT) system and transliteration technique for query translation. Language modeling algorithm was used for retrieving the relevant documents [8]. Chinnakotla et al. were used bilingual dictionary and rule based transliteration approach for query translation. Term-Term co-occurrence statistics were used for disambiguation [9]. Gupta et al. were used SMT and transliteration and the queries wise results was undergone mining and a new list of queries was created. Terrier open sourceFootnote 1 search engine was used for information retrieval [10]. Yu et al. were experimented with domain ontology knowledge method which is obtained from user queries and target documents [11]. Monti et al. were developed ontology based CLIR system. First linguistic pre-processing step was applied on source language query then transformation routines (Domain concept mapping and RDF graph matching) and translation routines (Bilingual dictionary mapping and FSA/FSTs Development) were applied [12].
Chen-Yu et al. were used dictionary based approach and Wikipedia as a live dictionary for Out Of Vocabulary (OOV) terms. Further standard OKAPI BM25 algorithm was used for retrieval [13]. Sorg et al. were used Wikipedia as a knowledge resource for CLIR. Queries and documents both are converted to inter lingual concept space which is either Wikipedia article or categories. A bag-of-concept model was prepared then various vector based retrieval model and term weighting strategies experimented with the conjunction of Cross-Lingual Explicit Semantic Analysis (CL-ESA) [14]. Samantaray et al. were discussed concept based CLIR for agriculture domain. They were used Latent Semantic Analysis (LSA), Explicit Semantic Analysis (ESA) and Universal Networking language (UNL) and WordNet for CLIR and WSD [15]. Xiaoninge et al. were used Google translator due to high performance on named entity translation. Further Chinese character bigram was used as indexing unit, KL-divergence model was used for retrieval and pseudo feedback was used for improve average precision [16]. Zhang et al. were proposed search result based approach and appropriate translation was selected using inverse translation frequency (ITF) method that reduces the impact of the noisy symbols [17]. Pourmahmoud et al. were exploited phrase translation approach with bilingual dictionary and query expansion techniques were used to retrieve documents [18].
3 Issues and Challenges
Various issues and challenges are discussed in Table 1.
4 CLIR Approaches
Various CLIR approaches are discussed in Table 2.
5 Comparative Analysis and Discussion
A comparative analysis of CLIR approaches is presented in Table 3.
Mean Average Precision (MAP) is the evaluation measure. MAP for a set of queries is the mean of the average precision score of each query and precision is the fraction of retrieved documents that are query relevant. Google translator is more effective due to biasing towards named entities and 0.3889 MAP achieved for English-Chinese [16]. Machine translation and Parallel corpora combinedly achieve better MAP that is 0.4694 for English-Germen [7] but lack of resources problem is there because a parallel corpora of enough size is not available for all languages. Mostly researcher used bilingual dictionary because it is available for all languages and also takes nominal computation cost. Bi-lingual dictionary with Cohesion translation and Query expansion achieves 0.4337 for Persian-English [18]. Wikipedia is used to identify OOV terms but Wikipedia with sufficient data is available for a limited number of languages. CLIR with Wikipedia achieves 0.46 MAP [14].
Ontology, WordNet, UNL and co-occurrence translation used for resolving term homonymy and polysemy issues. Dictionary coverage and quality, phrase translation, Homonymy, Polysemy and Lack of resources are major challenges for CLIR. Many comprehensive tools are cultivated to resolve the language barrier issue, such as MT tools and CLIR tools [19]. A brief study to the CLIR tools is summarized in the Table 4. All these tools uses bilingual dictionary because of nominal time computation. A common problem of user assisted query translation was tried to remove in MIRACLE, MULTI LEX EXPLORER and MULTI SEARCHER. Automatic query translation suffered by a problem of homonymy and polysemy.
6 Conclusion
CLIR enables searching documents via eternal diversity of languages across the world. It removes the linguistic gap and allows a user to submit a query in a language different than the target documents. A CLIR method includes a translation mechanism followed by monolingual retrieval. It is analyzed that query translation always efficient choice than document translation. In this paper, various CLIR issues and challenges and Query translation approaches with disambiguation are discussed. A comparative analysis of CLIR approaches is presented in Table 3. A CLIR approach with Bi-Lingual dictionary, Cohesion Translation, query expansion and Language Modeling achieves good MAP i.e. 0.4337. Another CLIR approach with Wikipedia, Bag of Concept and Cross language- Explicit Semantic analysis achieves better MAP i.e. 0.46. MT with parallel corpora CLIR approach achieves 0.4694 MAP. A brief analysis of CLIR tools is represented in Table 4. Dictionary Coverage and Quality, Unavailability of Parallel Corpora, Phrase Translation, Homonymy and Polysemy are concluded as major issues.
Notes
- 1.
References
Nagarathinam A., Saraswathi S.: State of art: Cross Lingual Information Retrieval System for Indian Languages. In International Journal of computer application, Vol. 35, No. 13, pp. 15–21 (2006).
Nasharuddin N., Abdullah M.: Cross-lingual Information Retrieval State-of-the-Art. In Electronic Journal of Computer Science and Information Technology (eJCSIT), Vol. 2, No. 1, pp. 1–5 (2010).
Oard, D.W.: A Comparative Study of Query and Document Translation for Cross-language Information Retrieval. In Proceedings of the Third Conference of the Association for Machine Translation in the Americas on Machine Translation and the Information Soup., Springer-Verlag, pp. 472–483 (1998).
Makin R., Pandey N., Pingali P., Varma V.: Approximate String Matching Techniques for Effective CLIR. In International Workshop on Fuzzy Logic and Applications, Italy, Springer-Verlag, pp. 430–437 (2007).
Pirkola A., Toivonen J., Keskustalo H., Visala K., Jarvelin K.: Fuzzy translation of cross-lingual spelling variants. In: Proceedings of SIGIR’03, pp. 345–352 (2003).
Bajpai P., Verma P.: Cross Language Information Retrieval: In Indian Language Perspective. International Journal of Research in Engineering and Technology, Vol. 3, pp. 46–52 (2014).
Chen A., Gey F.C.: Combining Query Translation and Document Translation in Cross-Language Retrieval. In Comparative Evaluation of Multilingual Information Access Systems, Springer Berlin: Heidelberg, pp. 108–121 (2004).
Jagarlamudi J., Kumaran A.: Cross-Lingual Information Retrieval System for Indian Languages. In Advances in multilingual and multi modal information retrieval, pp. 80–87 (2008).
Chinnakotal M., Ranadive S., Dhamani O.P., Bhattacharyya P.: Hindi to English and Marathi to English Cross Language Information Retrieval Evaluation. In Advances in Multilingual and Multimodal Information Retrieval, springer-verlag, pp. 111–118 (2008).
Gupta S. Kumar, Sinha A., Jain M.: Cross Lingual Information Retrieval with SMT and Query Mining. In Advanced Computing: An International Journal (ACIJ), Vol.2, No.5, pp. 33–39 (2011).
Yu F., Zheng D., Zhao T., Li S., Yu H.: Chinese-English Cross-Lingual Information Retrieval based on Domain Ontology Knowledge. In International conference on Computational Intelligence and Security, Vol. 2, pp. 1460–1463 (2006).
Monti J., Monteleone M.: Natural Language Processing and Big Data An Ontology-Based Approach for Cross-Lingual Information Retrieval. In International Conference on Social Computing, pp. 725–731 (2013).
Chen-Yu S., Tien-Chien L., Shih-Hung W.: Using Wikipedia to Translate OOV Terms on MLIR. In Proceedings of NTCIR-6 Workshop Meeting, Tokyo, Japan, pp. 109–115 (2007).
Sorg P., Cimiano P.: Exploiting Wikipedia for Cross-Lingual and Multi-Lingual Information Retrieval. Elsevier, pp. 26–45 (2012).
Samantaray S. D.: An Intelligent Concept based Search Engine with Cross Linguility support. In 7th International Conference on Industrial Electronics and Applications, Singapore, pp-1441–1446 (2012).
Xiaoning H., Peidong W., Haoliang Q., Muyun Y., Guohua L., Yong X.: Using Google Translation in Cross-Lingual Information Retrieval, In Proceedings of NTCIR-7 Workshop Meeting, Tokyo, Japan, pp. 159–161 (2008).
Zhang J., Sun L. and Min J.: Using the Web Corpus to Translate the Queries in Cross-Lingual Information Retrieval. In Proceeding of NLP_KE, pp. 493–498 (2005).
Pourmahmoud S., Shamsfard M.: Semantic Cross-Lingual Information Retrieval. In International symposium on computer and information sciences, pp. 1–4 (2008).
Ahmed F., Nurnberger A.: Literature review of interactive cross language information retrieval tools. In The international Arab Journal of Information Technology, Vol. 9, No. 5, pp. 479–486 (2012).
Boretz, A., AppTek Launches Hybrid Machine Translation Software, in Speech Tag Online Magazine (2009).
Yuan, S., Yu S.: A new method for cross-language information retrieval by summing weights of graphs. In Fourth International Conference on Fuzzy Systems and Knowledge Discovery, IEEE Computer Society, pp. 326–330 (2007).
Nie, J., Simard M., Isabelle P., Durand R.: Cross-Language Information Retreval Based on Parallel Texts and Automatic Mining of Parallel Texts from the Web. In Proc. OfACM-SIGIR, pp. 74–81 (1999).
Lu W., Chien L., Lee H.: Anchor Text Mining for Translation of Web Queries: A Transitive Translation Approach. ACM Transactions on Information Systems 22(2), pp. 242–269 (2004).
Navigly R.: Word Sense Disambiguation: A Survey. ACM computing survey, Vol. 41, No. 2 (2009).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer India
About this paper
Cite this paper
Sharma, V.K., Mittal, N. (2016). Cross Lingual Information Retrieval (CLIR): Review of Tools, Challenges and Translation Approaches. In: Satapathy, S., Mandal, J., Udgata, S., Bhateja, V. (eds) Information Systems Design and Intelligent Applications. Advances in Intelligent Systems and Computing, vol 433. Springer, New Delhi. https://doi.org/10.1007/978-81-322-2755-7_72
Download citation
DOI: https://doi.org/10.1007/978-81-322-2755-7_72
Published:
Publisher Name: Springer, New Delhi
Print ISBN: 978-81-322-2753-3
Online ISBN: 978-81-322-2755-7
eBook Packages: EngineeringEngineering (R0)