Cross Lingual Information Retrieval (CLIR): Review of Tools, Challenges and Translation Approaches

Sharma, Vijay Kumar; Mittal, Namita

doi:10.1007/978-81-322-2755-7_72

Vijay Kumar Sharma⁶ &
Namita Mittal⁶

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 433))

1744 Accesses
10 Citations

Abstract

Today’s Web spreads all over the world and world’s communication over the internet leads to globalization and globalization makes it necessary to find information in any language. Since only one language is not recognized by all people across the world. Many people use their regional languages to express their needs and the language diversity becomes a great barrier. Cross Lingual Information Retrieval provides a solution for that barrier which allows a user to ask a query in native language and then to get the document in different language. This paper discusses the CLIR challenges, Query translation techniques and approaches for many Indian and foreign languages and briefly analyses the CLIR tools.

Access provided by Autonomous University of Puebla. Download conference paper PDF

An Overview of Cross-Language Information Retrieval

Information Retrieval System Based on Query Translation Approach for Cross-Languages

Cross-Lingual Information Retrieval: A Dictionary-Based Query Translation Approach

Keywords

1 Introduction

Information Retrieval (IR) is a reasoning process that is used for storing, searching and retrieving the relevant information between a document and user needs. These tasks are not restricted to only Monolingual but also Multilingual. The documents and sentences in other languages are considered as unwanted “noise” in classical IR [1, 2]. CLIR deals with the situation where a user query and relevant documents are in different language and the language barrier becomes a serious issue for world communication. A CLIR approach includes a translation mechanism followed by mono lingual IR to overcome such language barriers. There are two types of translation namely query translation and documents translation. Query translation approaches are preferred due to a lot of computation time and space elapsed in document translation approaches [3]. Many workshops and Forums are acquainted to boost research in CLIR. Cross Language Evaluation Forum (CLEF) deals mainly with European languages since 2000. The NII Test Collection for IR System (NTCIR) workshop is planned for enhancing researches in Japanese and other Asian languages. First evaluation exercise by Forum for Information Retrieval Evaluation (FIRE) was completed in 2008 with three Indian languages Hindi, Bengali, Marathi. CLIA consortium includes 11 institutes of India for the project “Development of Cross Lingual Information Access system (CLIA)” funded by government of India. The objective of this project is to create a portal where user queries are responded in three possibilities such as responded in the query language, in Hindi and in English [2]. Literature Survey is discussed in Sect. 2. Issues and Challenges are discussed in Sect. 3. Various CLIR Approaches are discussed in Sect. 4. Section 5 includes Comparative Analysis and Discussion about CLIR translation technique and retrieval strategies. A brief analysis of CLIR tools also included in Sect. 5.

2 Literature Survey

Makin et al. were concluded that bilingual dictionary with cognate matching and transliteration achieves better performance. Parallel corpora and Machine Translation (MT) approaches are not well functioned. [4]. Pirkola et al. were experimented with English and Spanish languages and extract similar terms to develop transliteration rules [5]. Bajpai et al. were developed a prototype model where query was translated using any one technique including MT, dictionary based and corpora based. Word Sense Disambiguation (WSD) technique with Boolean, Vector space and Probabilistic model was used for IR [6]. Chen et al. were experimented with SMT and Parallel corpora for translation [7]. Jagarlamudi et al. were exploited statistical machine translation (SMT) system and transliteration technique for query translation. Language modeling algorithm was used for retrieving the relevant documents [8]. Chinnakotla et al. were used bilingual dictionary and rule based transliteration approach for query translation. Term-Term co-occurrence statistics were used for disambiguation [9]. Gupta et al. were used SMT and transliteration and the queries wise results was undergone mining and a new list of queries was created. Terrier open source^{Footnote 1} search engine was used for information retrieval [10]. Yu et al. were experimented with domain ontology knowledge method which is obtained from user queries and target documents [11]. Monti et al. were developed ontology based CLIR system. First linguistic pre-processing step was applied on source language query then transformation routines (Domain concept mapping and RDF graph matching) and translation routines (Bilingual dictionary mapping and FSA/FSTs Development) were applied [12].

Chen-Yu et al. were used dictionary based approach and Wikipedia as a live dictionary for Out Of Vocabulary (OOV) terms. Further standard OKAPI BM25 algorithm was used for retrieval [13]. Sorg et al. were used Wikipedia as a knowledge resource for CLIR. Queries and documents both are converted to inter lingual concept space which is either Wikipedia article or categories. A bag-of-concept model was prepared then various vector based retrieval model and term weighting strategies experimented with the conjunction of Cross-Lingual Explicit Semantic Analysis (CL-ESA) [14]. Samantaray et al. were discussed concept based CLIR for agriculture domain. They were used Latent Semantic Analysis (LSA), Explicit Semantic Analysis (ESA) and Universal Networking language (UNL) and WordNet for CLIR and WSD [15]. Xiaoninge et al. were used Google translator due to high performance on named entity translation. Further Chinese character bigram was used as indexing unit, KL-divergence model was used for retrieval and pseudo feedback was used for improve average precision [16]. Zhang et al. were proposed search result based approach and appropriate translation was selected using inverse translation frequency (ITF) method that reduces the impact of the noisy symbols [17]. Pourmahmoud et al. were exploited phrase translation approach with bilingual dictionary and query expansion techniques were used to retrieve documents [18].

3 Issues and Challenges

Various issues and challenges are discussed in Table 1.

Table 1 List of CLIR issues and challenges

Full size table

4 CLIR Approaches

Various CLIR approaches are discussed in Table 2.

Table 2 List of CLIR approaches with description

Full size table

5 Comparative Analysis and Discussion

A comparative analysis of CLIR approaches is presented in Table 3.

Table 3 Comparative analysis of CLIR approaches

Full size table

Mean Average Precision (MAP) is the evaluation measure. MAP for a set of queries is the mean of the average precision score of each query and precision is the fraction of retrieved documents that are query relevant. Google translator is more effective due to biasing towards named entities and 0.3889 MAP achieved for English-Chinese [16]. Machine translation and Parallel corpora combinedly achieve better MAP that is 0.4694 for English-Germen [7] but lack of resources problem is there because a parallel corpora of enough size is not available for all languages. Mostly researcher used bilingual dictionary because it is available for all languages and also takes nominal computation cost. Bi-lingual dictionary with Cohesion translation and Query expansion achieves 0.4337 for Persian-English [18]. Wikipedia is used to identify OOV terms but Wikipedia with sufficient data is available for a limited number of languages. CLIR with Wikipedia achieves 0.46 MAP [14].

Ontology, WordNet, UNL and co-occurrence translation used for resolving term homonymy and polysemy issues. Dictionary coverage and quality, phrase translation, Homonymy, Polysemy and Lack of resources are major challenges for CLIR. Many comprehensive tools are cultivated to resolve the language barrier issue, such as MT tools and CLIR tools [19]. A brief study to the CLIR tools is summarized in the Table 4. All these tools uses bilingual dictionary because of nominal time computation. A common problem of user assisted query translation was tried to remove in MIRACLE, MULTI LEX EXPLORER and MULTI SEARCHER. Automatic query translation suffered by a problem of homonymy and polysemy.

Table 4 Comparative analysis of CLIR tools

Full size table

6 Conclusion

CLIR enables searching documents via eternal diversity of languages across the world. It removes the linguistic gap and allows a user to submit a query in a language different than the target documents. A CLIR method includes a translation mechanism followed by monolingual retrieval. It is analyzed that query translation always efficient choice than document translation. In this paper, various CLIR issues and challenges and Query translation approaches with disambiguation are discussed. A comparative analysis of CLIR approaches is presented in Table 3. A CLIR approach with Bi-Lingual dictionary, Cohesion Translation, query expansion and Language Modeling achieves good MAP i.e. 0.4337. Another CLIR approach with Wikipedia, Bag of Concept and Cross language- Explicit Semantic analysis achieves better MAP i.e. 0.46. MT with parallel corpora CLIR approach achieves 0.4694 MAP. A brief analysis of CLIR tools is represented in Table 4. Dictionary Coverage and Quality, Unavailability of Parallel Corpora, Phrase Translation, Homonymy and Polysemy are concluded as major issues.

Notes

1.
www.terrier.org.

References

Nagarathinam A., Saraswathi S.: State of art: Cross Lingual Information Retrieval System for Indian Languages. In International Journal of computer application, Vol. 35, No. 13, pp. 15–21 (2006).
Google Scholar
Nasharuddin N., Abdullah M.: Cross-lingual Information Retrieval State-of-the-Art. In Electronic Journal of Computer Science and Information Technology (eJCSIT), Vol. 2, No. 1, pp. 1–5 (2010).
Google Scholar
Oard, D.W.: A Comparative Study of Query and Document Translation for Cross-language Information Retrieval. In Proceedings of the Third Conference of the Association for Machine Translation in the Americas on Machine Translation and the Information Soup., Springer-Verlag, pp. 472–483 (1998).
Google Scholar
Makin R., Pandey N., Pingali P., Varma V.: Approximate String Matching Techniques for Effective CLIR. In International Workshop on Fuzzy Logic and Applications, Italy, Springer-Verlag, pp. 430–437 (2007).
Google Scholar
Pirkola A., Toivonen J., Keskustalo H., Visala K., Jarvelin K.: Fuzzy translation of cross-lingual spelling variants. In: Proceedings of SIGIR’03, pp. 345–352 (2003).
Google Scholar
Bajpai P., Verma P.: Cross Language Information Retrieval: In Indian Language Perspective. International Journal of Research in Engineering and Technology, Vol. 3, pp. 46–52 (2014).
Google Scholar
Chen A., Gey F.C.: Combining Query Translation and Document Translation in Cross-Language Retrieval. In Comparative Evaluation of Multilingual Information Access Systems, Springer Berlin: Heidelberg, pp. 108–121 (2004).
Google Scholar
Jagarlamudi J., Kumaran A.: Cross-Lingual Information Retrieval System for Indian Languages. In Advances in multilingual and multi modal information retrieval, pp. 80–87 (2008).
Google Scholar
Chinnakotal M., Ranadive S., Dhamani O.P., Bhattacharyya P.: Hindi to English and Marathi to English Cross Language Information Retrieval Evaluation. In Advances in Multilingual and Multimodal Information Retrieval, springer-verlag, pp. 111–118 (2008).
Google Scholar
Gupta S. Kumar, Sinha A., Jain M.: Cross Lingual Information Retrieval with SMT and Query Mining. In Advanced Computing: An International Journal (ACIJ), Vol.2, No.5, pp. 33–39 (2011).
Google Scholar
Yu F., Zheng D., Zhao T., Li S., Yu H.: Chinese-English Cross-Lingual Information Retrieval based on Domain Ontology Knowledge. In International conference on Computational Intelligence and Security, Vol. 2, pp. 1460–1463 (2006).
Google Scholar
Monti J., Monteleone M.: Natural Language Processing and Big Data An Ontology-Based Approach for Cross-Lingual Information Retrieval. In International Conference on Social Computing, pp. 725–731 (2013).
Google Scholar
Chen-Yu S., Tien-Chien L., Shih-Hung W.: Using Wikipedia to Translate OOV Terms on MLIR. In Proceedings of NTCIR-6 Workshop Meeting, Tokyo, Japan, pp. 109–115 (2007).
Google Scholar
Sorg P., Cimiano P.: Exploiting Wikipedia for Cross-Lingual and Multi-Lingual Information Retrieval. Elsevier, pp. 26–45 (2012).
Google Scholar
Samantaray S. D.: An Intelligent Concept based Search Engine with Cross Linguility support. In 7th International Conference on Industrial Electronics and Applications, Singapore, pp-1441–1446 (2012).
Google Scholar
Xiaoning H., Peidong W., Haoliang Q., Muyun Y., Guohua L., Yong X.: Using Google Translation in Cross-Lingual Information Retrieval, In Proceedings of NTCIR-7 Workshop Meeting, Tokyo, Japan, pp. 159–161 (2008).
Google Scholar
Zhang J., Sun L. and Min J.: Using the Web Corpus to Translate the Queries in Cross-Lingual Information Retrieval. In Proceeding of NLP_KE, pp. 493–498 (2005).
Google Scholar
Pourmahmoud S., Shamsfard M.: Semantic Cross-Lingual Information Retrieval. In International symposium on computer and information sciences, pp. 1–4 (2008).
Google Scholar
Ahmed F., Nurnberger A.: Literature review of interactive cross language information retrieval tools. In The international Arab Journal of Information Technology, Vol. 9, No. 5, pp. 479–486 (2012).
Google Scholar
Boretz, A., AppTek Launches Hybrid Machine Translation Software, in Speech Tag Online Magazine (2009).
Google Scholar
Yuan, S., Yu S.: A new method for cross-language information retrieval by summing weights of graphs. In Fourth International Conference on Fuzzy Systems and Knowledge Discovery, IEEE Computer Society, pp. 326–330 (2007).
Google Scholar
Nie, J., Simard M., Isabelle P., Durand R.: Cross-Language Information Retreval Based on Parallel Texts and Automatic Mining of Parallel Texts from the Web. In Proc. OfACM-SIGIR, pp. 74–81 (1999).
Google Scholar
Lu W., Chien L., Lee H.: Anchor Text Mining for Translation of Web Queries: A Transitive Translation Approach. ACM Transactions on Information Systems 22(2), pp. 242–269 (2004).
Google Scholar
Navigly R.: Word Sense Disambiguation: A Survey. ACM computing survey, Vol. 41, No. 2 (2009).
Google Scholar

Download references

Author information

Authors and Affiliations

Departmemt of Computer Science and Engineering, MNIT, Jaipur, India
Vijay Kumar Sharma & Namita Mittal

Authors

Vijay Kumar Sharma
View author publications
You can also search for this author in PubMed Google Scholar
Namita Mittal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vijay Kumar Sharma .

Editor information

Editors and Affiliations

Deparment of CSE, Anil Neerukonda Ins. of Tech. & Sci., Vishakapatnam, India
Suresh Chandra Satapathy
Kalyani University, Nadia, West Bengal, India
Jyotsna Kumar Mandal
University of Hyderabad, Hyderabad, Andhra Pradesh, India
Siba K. Udgata
Dept. of ECE, Shri Ramswaroop Mem. Group of Prof. Clg, Lucknow, Uttar Pradesh, India
Vikrant Bhateja

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sharma, V.K., Mittal, N. (2016). Cross Lingual Information Retrieval (CLIR): Review of Tools, Challenges and Translation Approaches. In: Satapathy, S., Mandal, J., Udgata, S., Bhateja, V. (eds) Information Systems Design and Intelligent Applications. Advances in Intelligent Systems and Computing, vol 433. Springer, New Delhi. https://doi.org/10.1007/978-81-322-2755-7_72

Download citation

DOI: https://doi.org/10.1007/978-81-322-2755-7_72
Published: 06 February 2016
Publisher Name: Springer, New Delhi
Print ISBN: 978-81-322-2753-3
Online ISBN: 978-81-322-2755-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Cross Lingual Information Retrieval (CLIR): Review of Tools, Challenges and Translation Approaches

Abstract

Similar content being viewed by others

An Overview of Cross-Language Information Retrieval

Information Retrieval System Based on Query Translation Approach for Cross-Languages

Cross-Lingual Information Retrieval: A Dictionary-Based Query Translation Approach

Keywords

1 Introduction

2 Literature Survey

3 Issues and Challenges

4 CLIR Approaches

5 Comparative Analysis and Discussion

6 Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Cross Lingual Information Retrieval (CLIR): Review of Tools, Challenges and Translation Approaches

Abstract

Similar content being viewed by others

An Overview of Cross-Language Information Retrieval

Information Retrieval System Based on Query Translation Approach for Cross-Languages

Cross-Lingual Information Retrieval: A Dictionary-Based Query Translation Approach

Keywords

1 Introduction

2 Literature Survey

3 Issues and Challenges

4 CLIR Approaches

5 Comparative Analysis and Discussion

6 Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation