Keywords

1 Introduction

Information Retrieval (IR) is a reasoning process that is used for storing, searching and retrieving the relevant information between a document and user needs. These tasks are not restricted to only Monolingual but also Multilingual. The documents and sentences in other languages are considered as unwanted “noise” in classical IR [1, 2]. CLIR deals with the situation where a user query and relevant documents are in different language and the language barrier becomes a serious issue for world communication. A CLIR approach includes a translation mechanism followed by mono lingual IR to overcome such language barriers. There are two types of translation namely query translation and documents translation. Query translation approaches are preferred due to a lot of computation time and space elapsed in document translation approaches [3]. Many workshops and Forums are acquainted to boost research in CLIR. Cross Language Evaluation Forum (CLEF) deals mainly with European languages since 2000. The NII Test Collection for IR System (NTCIR) workshop is planned for enhancing researches in Japanese and other Asian languages. First evaluation exercise by Forum for Information Retrieval Evaluation (FIRE) was completed in 2008 with three Indian languages Hindi, Bengali, Marathi. CLIA consortium includes 11 institutes of India for the project “Development of Cross Lingual Information Access system (CLIA)” funded by government of India. The objective of this project is to create a portal where user queries are responded in three possibilities such as responded in the query language, in Hindi and in English [2]. Literature Survey is discussed in Sect. 2. Issues and Challenges are discussed in Sect. 3. Various CLIR Approaches are discussed in Sect. 4. Section 5 includes Comparative Analysis and Discussion about CLIR translation technique and retrieval strategies. A brief analysis of CLIR tools also included in Sect. 5.

2 Literature Survey

Makin et al. were concluded that bilingual dictionary with cognate matching and transliteration achieves better performance. Parallel corpora and Machine Translation (MT) approaches are not well functioned. [4]. Pirkola et al. were experimented with English and Spanish languages and extract similar terms to develop transliteration rules [5]. Bajpai et al. were developed a prototype model where query was translated using any one technique including MT, dictionary based and corpora based. Word Sense Disambiguation (WSD) technique with Boolean, Vector space and Probabilistic model was used for IR [6]. Chen et al. were experimented with SMT and Parallel corpora for translation [7]. Jagarlamudi et al. were exploited statistical machine translation (SMT) system and transliteration technique for query translation. Language modeling algorithm was used for retrieving the relevant documents [8]. Chinnakotla et al. were used bilingual dictionary and rule based transliteration approach for query translation. Term-Term co-occurrence statistics were used for disambiguation [9]. Gupta et al. were used SMT and transliteration and the queries wise results was undergone mining and a new list of queries was created. Terrier open sourceFootnote 1 search engine was used for information retrieval [10]. Yu et al. were experimented with domain ontology knowledge method which is obtained from user queries and target documents [11]. Monti et al. were developed ontology based CLIR system. First linguistic pre-processing step was applied on source language query then transformation routines (Domain concept mapping and RDF graph matching) and translation routines (Bilingual dictionary mapping and FSA/FSTs Development) were applied [12].

Chen-Yu et al. were used dictionary based approach and Wikipedia as a live dictionary for Out Of Vocabulary (OOV) terms. Further standard OKAPI BM25 algorithm was used for retrieval [13]. Sorg et al. were used Wikipedia as a knowledge resource for CLIR. Queries and documents both are converted to inter lingual concept space which is either Wikipedia article or categories. A bag-of-concept model was prepared then various vector based retrieval model and term weighting strategies experimented with the conjunction of Cross-Lingual Explicit Semantic Analysis (CL-ESA) [14]. Samantaray et al. were discussed concept based CLIR for agriculture domain. They were used Latent Semantic Analysis (LSA), Explicit Semantic Analysis (ESA) and Universal Networking language (UNL) and WordNet for CLIR and WSD [15]. Xiaoninge et al. were used Google translator due to high performance on named entity translation. Further Chinese character bigram was used as indexing unit, KL-divergence model was used for retrieval and pseudo feedback was used for improve average precision [16]. Zhang et al. were proposed search result based approach and appropriate translation was selected using inverse translation frequency (ITF) method that reduces the impact of the noisy symbols [17]. Pourmahmoud et al. were exploited phrase translation approach with bilingual dictionary and query expansion techniques were used to retrieve documents [18].

3 Issues and Challenges

Various issues and challenges are discussed in Table 1.

Table 1 List of CLIR issues and challenges

4 CLIR Approaches

Various CLIR approaches are discussed in Table 2.

Table 2 List of CLIR approaches with description

5 Comparative Analysis and Discussion

A comparative analysis of CLIR approaches is presented in Table 3.

Table 3 Comparative analysis of CLIR approaches

Mean Average Precision (MAP) is the evaluation measure. MAP for a set of queries is the mean of the average precision score of each query and precision is the fraction of retrieved documents that are query relevant. Google translator is more effective due to biasing towards named entities and 0.3889 MAP achieved for English-Chinese [16]. Machine translation and Parallel corpora combinedly achieve better MAP that is 0.4694 for English-Germen [7] but lack of resources problem is there because a parallel corpora of enough size is not available for all languages. Mostly researcher used bilingual dictionary because it is available for all languages and also takes nominal computation cost. Bi-lingual dictionary with Cohesion translation and Query expansion achieves 0.4337 for Persian-English [18]. Wikipedia is used to identify OOV terms but Wikipedia with sufficient data is available for a limited number of languages. CLIR with Wikipedia achieves 0.46 MAP [14].

Ontology, WordNet, UNL and co-occurrence translation used for resolving term homonymy and polysemy issues. Dictionary coverage and quality, phrase translation, Homonymy, Polysemy and Lack of resources are major challenges for CLIR. Many comprehensive tools are cultivated to resolve the language barrier issue, such as MT tools and CLIR tools [19]. A brief study to the CLIR tools is summarized in the Table 4. All these tools uses bilingual dictionary because of nominal time computation. A common problem of user assisted query translation was tried to remove in MIRACLE, MULTI LEX EXPLORER and MULTI SEARCHER. Automatic query translation suffered by a problem of homonymy and polysemy.

Table 4 Comparative analysis of CLIR tools

6 Conclusion

CLIR enables searching documents via eternal diversity of languages across the world. It removes the linguistic gap and allows a user to submit a query in a language different than the target documents. A CLIR method includes a translation mechanism followed by monolingual retrieval. It is analyzed that query translation always efficient choice than document translation. In this paper, various CLIR issues and challenges and Query translation approaches with disambiguation are discussed. A comparative analysis of CLIR approaches is presented in Table 3. A CLIR approach with Bi-Lingual dictionary, Cohesion Translation, query expansion and Language Modeling achieves good MAP i.e. 0.4337. Another CLIR approach with Wikipedia, Bag of Concept and Cross language- Explicit Semantic analysis achieves better MAP i.e. 0.46. MT with parallel corpora CLIR approach achieves 0.4694 MAP. A brief analysis of CLIR tools is represented in Table 4. Dictionary Coverage and Quality, Unavailability of Parallel Corpora, Phrase Translation, Homonymy and Polysemy are concluded as major issues.