Ontology-based Tamil–English cross-lingual information retrieval system

Thenmozhi, D; Aravindan, Chandrabose

doi:10.1007/s12046-018-0942-7

Ontology-based Tamil–English cross-lingual information retrieval system

Published: 14 August 2018

Volume 43, article number 157, (2018)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Sādhanā Aims and scope Submit manuscript

Ontology-based Tamil–English cross-lingual information retrieval system

Download PDF

D Thenmozhi¹ &
Chandrabose Aravindan¹

401 Accesses
10 Citations
1 Altmetric
Explore all metrics

Abstract

Cross-lingual information retrieval (CLIR) systems facilitate users to query for information in one language and retrieve relevant documents in another language. In general, CLIR systems translate query in source language to target language and retrieve documents in target language based on the keywords present in the translated query. However, the presence of ambiguity in source and translated queries reduces the performance of the system. Ontology can be used to address this problem. The current approaches to ontology-based CLIR systems use manually constructed multilingual ontology, which is expensive. However, many methods exist to automatically construct ontology for any domain in English but not in other languages like Tamil. We propose a methodology for Tamil–English CLIR system by translating the Tamil query to English and retrieve pages in English to address these issues. Our approach uses a word sense disambiguation module to resolve the ambiguity in Tamil query. An automatically constructed ontology in English is used to address the ambiguity of English query. We have developed a morphological analyser for Tamil language, Tamil–English bilingual dictionary and named entity database to translate a Tamil query to English. The translated query is reformulated using ontology and the reformulated queries are given to a search engine to retrieve English documents from the Internet. We have evaluated our methodology for agriculture domain and the evaluation results show that our approach outperforms other approaches in terms of precision.

Cross Lingual Information Retrieval (CLIR): Review of Tools, Challenges and Translation Approaches

Tamil English Cross Lingual Information Retrieval

Information Retrieval System Based on Query Translation Approach for Cross-Languages

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Internet provides a rich source of information and is growing at an enormous rate. English is still the dominant language in the Internet, which contributes most of the information. However, world Internet usage statistics reveal that the number of non-English Internet users is steadily increasing, but all of them are not able to formulate queries in English. Tamil users such as farmers and people working for small scale industries who are not able to express their needs in English are also growing in the Internet. They generally search for information using Tamil search engines. But the content provided by these search engines is not adequate. Making the huge repository of information on the Web, which is available in English, accessible to non-English Internet users has become an essential challenge in recent times. When the non-English users want to access the existing search engines, most of the time they formulate improper English queries.

Cross-lingual information retrieval (CLIR) systems aim to solve the afore-mentioned problem by allowing the users to express their information need in their native language while the CLIR system takes care of matching it appropriately with the relevant documents in the target language. In general, CLIR systems translate the query in source language to target language and retrieve documents in target language. When the translated query has multiple meanings, all the documents that are retrieved may not be relevant to the user. For example, the user query “payinkaal” is translated to “tiller”, which has multiple meanings, namely part of a boat, agriculture equipment and name of a person. All the retrieved documents are not relevant to the user. Hence, it is necessary to include semantics into the search process to retrieve only relevant pages to the users. Also, the search process is improved by refining the queries to more specific queries. It is difficult for the Tamil users who are not able to express their needs in English to formulate such refined queries. We propose an ontology-based CLIR system that suggests possible refined queries and retrieves documents for all the queries.

Many research works have been reported for handling semantics in information retrieval (IR) using ontology [1,2,3,4,5]. Queries are accepted in formal languages like SPARQL in these research works. It is difficult for the users such as farmers to pose such queries. CLIR systems [6,7,8,9,10] facilitate non-English users to pose natural language queries in their own languages but fail to handle semantics. A few research works [11,12,13,14,15] have been reported on ontology-based CLIR systems that deal with semantics using bi-lingual ontologies. However, very few approaches are evolved to build multilingual ontologies [16,17,18] automatically from available resources like text documents, databases, etc. Also, many methods exist to automatically construct ontology for any domain in English but not in other languages. No such methodologies exist for learning Tamil ontology. Hence, we use a word sense disambiguation (WSD) module to resolve the ambiguity in Tamil queries during translation.

We propose a CLIR system in agricultural domain for Tamil farmers. The system retrieves relevant documents from an English corpus in response to a query expressed in Tamil language. Here, the query given in Tamil language is translated syntactically and semantically to English for IR process. The ambiguity of the translated query can be resolved by reformulating the query using an ontology. The ambiguity still persists even if we use a general purpose ontology like WordNet. For example, when we use WordNet, the query “Tiller” is reformulated as ”Tiller Shoot”, “Tiller Farmer”, “Tiller Lever”, “Tiller is part of Rudder”, “Harrow Tiller”, and “Tiller Farm Machine”. Among these queries, ”Tiller Shoot”, “Tiller Lever” and “Tiller is part of Rudder” will not retrieve any pages related to agriculture equipment. Hence, it is important to use a domain-specific ontology to reformulate the queries. We use an agriculture ontology that has been learnt from text documents automatically [19].

Section 2 briefly describes various works related to ontology-based retrieval and CLIR systems. Section 3 elaborates our framework designed for cross-lingual semantic retrieval system. Section 4 provides the details of experiments conducted to analyse the performance of the proposed ontology-based CLIR system. Section 5 gives conclusion and future directions for this research.

2 Related work

IR is the process of extracting relevant information for the given query. The huge increase in the amount of information in the Internet and the complexity to reach such information caused an excessive demand for tools and techniques that can handle data semantically [2]. Ontology-based retrieval is a solution to semantic web. However, many ontology-based retrieval systems do not deal with cross-language issues. Several approaches are reported to address the cross-language issues but fail to deal with ambiguity problems. A few research methodologies have been reported that deal with both cross-language and semantic issues but have many open issues. This section reviews existing research works and open issues related to ontology-based retrieval, CLIR and ontology-based CLIR.

2.1 Ontology-based retrieval

Bhogal et al [20] and Jain and Singh [21] presented a comprehensive survey on query expansion using ontology for IR. Zimmermann et al [1] extended RDF framework and SPARQL language by annotating with more information for representing, reasoning and querying semantic web data. Kara et al [2] proposed a methodology for semantic retrieval based on domain ontology. They proved that the methodology outperforms traditional keyword-based methods and query expansion methods. However, the queries are extended based only on the class hierarchy information of the ontology, but not based on the semantic relationships of the ontology. Also, the method retrieves information only from a set of documents that are semantically indexed.

Mustafa et al [3] proposed an approach to ontology-based semantic IR. The query in triple form is matched with a triple in ontology and gets reformulated with the ontology terms for retrieval. They have evaluated 300 manually collected documents in the domain of research thesis. The approach does not handle incomplete and imprecise triples of the queries. Also, the approach can be extended for cross-lingual applications. Hogan et al [4] implemented a semantic web search engine that consists of components of IR system such as crawling, data enhancing, indexing and a user interface for search, browsing and retrieval of information. This search engine operates on RDF framework of ontology.

Fernandez et al [5] introduced an ontology-based approach for semantically enhanced IR. In this approach, the query is accepted in a formal SPARQL language, lists of semantic entities are returned and documents that are indexed with these semantic entities are retrieved. This IR system requires the user to be familiar with the formal languages like SPARQL. It is desirable to have a common IR system that can be used by any user who does not have formal language knowledge. Sy et al [22] developed a user-centred and ontology-based IR system in which the given query is reformulated either by adding or removing concepts from the query. This is done by graphically selecting the documents as interested or not interested by the user. This IR is semi-automatic due to query refinement using explicit specification of interest.

2.2 CLIR

Sujatha and Dhavachelvan [23] presented a survey on CLIR and multilingual information retrieval (MLIR) systems in Indian and Foreign languages. Sorg and Cimiano [6] developed a CLIR system using cross-language links of Wikipedia. The user can give query in English, French and German languages and retrieve documents from English corpus or from German corpus. They developed a model to map bag of words that represent a document to bag of concepts using Wikipedia. They [24] extended this approach by analysing different strategies for exploiting the Wikipedia structure to define the concept space. Evaluations have been performed for both CLIR and MLIR systems for English, French, German and Spanish languages. However, ambiguity of the query in source and translated languages is not resolved in these approaches.

Several organizations in India are working on the CLIR system for Indian Languages [25]. Bandyopadhyay et al [8] developed a Bengali, Hindi and Telugu–English CLIR system as part of the ad-hoc bilingual task. Chinnakotla et al [9] developed Hindi–English and Marathi–English CLIR systems. Pingali and Varma [10] developed a Hindi and Telugu–English CLIR system. Mandal et al [26] developed a CLIR system for two most widely spoken Indian languages, Hindi and Bengali. All these works use bilingual dictionaries. Jagarlamudi and Kumaran [27] also worked on Hindi–English cross-lingual system in which a word alignment table was used that was learnt by a statistical machine translation (MT) system trained on aligned parallel sentences. All these research methodologies have been evaluated for English corpus of LA Times 2002. Rao and Devi [28] developed Tamil–English CLIR Track for news articles taken from “The Telegraph”, English news magazine in India. All these approaches use word by word translation method in news domain.

Sivakumar et al [7] developed a Hindi–English CLIR system that identifies equivalent English document for the given Hindi document based on cosine similarity measure. The features of the documents to find the similarity are reduced using latent semantic indexing. This approach requires a parallel corpus that contains documents in both languages. This system works well for document queries but not for user-generated queries.

Thenmozhi and Aravindan [29] developed a CLIR for Tamil farmers using MT approach. This approach translates the Tamil query to English using a morphological analyser, bi-lingual dictionary and NE recognizer. WSD is incorporated to avoid ambiguity in Tamil to English translation. However, the methodology does not handle the ambiguity in the translated query.

2.3 Ontology-based CLIR

Yu et al [11] developed a Chinese–English CLIR system based on domain ontology. Abusalah et al [13] developed an Arabic–English CLIR system based on ontology for travel and tourism. Yahya et al [12] developed English–Malay and Malay–English CLIR systems based on Quran ontology. However, the methodologies require ontologies in both source and target languages. Construction of multilingual ontology is a time-consuming task. Ontologies are built manually in these research works. Methodologies for constructing such ontologies automatically from existing resources like text document, databases, etc. are not available. Also, the approaches do not consider the semantic relationships of the ontology to improve the retrieval performance.

Monti et al [14] proposed a methodology for ontology-based CLIR. Italian–English retrieval has been evaluated using this approach for archaeological domain. This approach uses ontology for source language to refine the query and then translated to the target language. However, ambiguity in the translated query is not resolved by this approach, which may occur frequently especially in English language. Pourmahmoud and Shamsfard [15] developed a Persian–English CLIR system using ontology. Bilingual ontology and dictionary are used to translate the query to the target language query. Ontology is used to disambiguate the meaning of source query to target query when the source query has multiple meanings in target query. Probabilistic approach has been used to disambiguate the target query in this research. However, suggesting more refined queries to the user is not supported by this retrieval system.

By considering several issues discussed in this section, we propose a framework for CLIR system that addresses the ambiguity in both source language and target languages to improve the retrieval performance.

3 Proposed methodology

The proposed Tamil–English CLIR system translates the given Tamil query to an English query and also suggests multiple reformulated queries for searching and retrieval using ontology. This process is depicted in figure 1.

3.1 Morphological analysis

The words present in the given query are transformed to their root form using a morphological analyser that uses several rules for handling plurals, case suffixes, oblique, etc. The morphological analyser identifies the root form of the word and its suffixes. In Tamil, “kaL” is the major plural suffix. It has variants like “tkaL”, “NGkaL”, “RkaL” and “KkaL”. After removing the suffix “kaL”, the morphological analyser modifies the root word to get its base form by replacing “NG with m”, “R with l”, etc. Postpositions, namely accusative, dative, genitive, locative and plain postpositions, come next to case suffices like ai, in, il, itam, etc. The morphological analyser removes these postpositions along with case suffixes to bring the word to its root form. A list of different types of postpositions is given as follows:

Accusative postpositions: vita, pola, kontu, nokki, patti, kuRittu, cuRRi, vittu, thavira, munnittu, venti, otti, poRuttu, poRuttavari
Dative postpositions: aaka, enRu, mun, pin, ul, itaiye, natuve, mattiyil, veliye, mel, kizh, etiril, pakkattil, arukil, patil, maaraaka, neRaaka, uriya, ulla, takunta
Genetive postpositions: mitu, mel, valiyaaka, mUlamaaka, vazhiyaaka, pEril, poRuttu
Locative postpositions: irunthu, occurs only after case markers itam and il
Plain postpositions: utan, kUta, utaiya, vacam, itam, varai, aaka, toRum, aara
Oblique suffix: ththu.

Table 1 shows some of the compound words in Tamil and their root words along with suffixes. This analyser identifies multiples of suffixes to convert the word to its root form. For example, root word “maram” is obtained from the word “marangkaLinvazhiyaaka” (through trees) by removing multiple suffixes.

Table 1 Examples for morphological analysis.

Ontology-based Tamil–English cross-lingual information retrieval system

Abstract

Similar content being viewed by others

Cross Lingual Information Retrieval (CLIR): Review of Tools, Challenges and Translation Approaches

Tamil English Cross Lingual Information Retrieval

Information Retrieval System Based on Query Translation Approach for Cross-Languages

Explore related subjects

1 Introduction

2 Related work

2.1 Ontology-based retrieval

2.2 CLIR

2.3 Ontology-based CLIR

3 Proposed methodology

3.1 Morphological analysis

3.2 Dictionary look-up

3.3 WSD

3.4 Syntactic rearrangement

3.5 Query reformulation

3.6 Searching

3.7 Walk through examples

4 Implementation and experiments

4.1 Implementation

4.2 Experiments

4.3 Perforamance comparison of search methods

5 Conclusions

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix I

Appendix I

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation