Keywords

1 Introduction

Word Sense Disambiguation (WSD) is a very challenging task of machine translation when source and target languages are different in many aspects. WSD is a process which selects the correct sense of the word with respect to context. It is an open problem of Natural Language Processing and it comes under the NP-complete problem. WSD is very helpful in machine translation. The term WSD was first time introduced by Warren Weaver in his famous memorandum on translation in 1949. Machine Translation refers to the use of the computer to automate some part of the task or the entire task of translating between human languages. It is a subfield of computational linguistics that investigates the use of computer software to translate text from one source language (SL) to target language (TL). Many attempts are being made all over the world to develop MT systems for various languages using rule-based, example-based, dictionary based, corpus-based and statistical-based approaches. MT systems can be designed either specifically for two particular languages, called a bilingual system, or for more than a single pair or languages, called a multilingual system. A bilingual system may be either unidirectional, from one source language into one Target Language, or may be bidirectional. Multilingual systems are usually designed to be bidirectional, but most bilingual systems are unidirectional.

2 Literature Review

In India one of the first attempts made to disambiguate the nouns of Hindi language using Hindi WordNet [11]. Corpus is necessary for many purposes such as training purpose, pattern identification. Different types of resources of corpora such as speech, text, and image [12, 13]. Creation of dictionaries, thesauri and corpora started in 1980. Corpora provided a vast amount of knowledge and information on various hands-on language parameters. WSD research has been ongoing for the four decades. Word sense disambiguation is relevant when a word has multiple senses. The first experiments work done on word sense disambiguation with machine translation activities (Hutchins 1999) [5, 25]. In this machine lookup information and translate a word into a target or desired language. Many senseval workshops held on NLP and they gave many unique ideas for WSD approaches. Word Sense Induction (WSI) [7] is a method which is introduced by Navigli, WSI comes under unsupervised techniques which aimed at automatically identifying the set of senses denoted by a word. In this method word senses from the text by clustering word occurrences based on the idea that a given word used in a specific sense which co-occurs with the same neighboring words [2]. Anusaaraka: Machine Translation can be categorizing into rule-based [16], statistically based, example based, and hybrid [1]. Rule-based systems are still higher than statistical systems.

Many works are done and ongoing in machine translation for one natural language to other natural languages. Here we give the brief introduction of work on machine translation in India. MatraFootnote 1 English to Hindi Machine Translation systems (MTS) started at CDAC Pune in 2004. Shakti-English to Hindi, Marathi, and Telugu MTS (combines rule-based and statistical approach) IISc Bangalore and IIIT Hyderabad started in 2004. Anglabharti MTS (English-Hindi, Tamil MTS), AnglaHindi (combines example based approach and AnglaBharti approach) started 1991 and Anubhart Hindi-English MTS; Combines syntax started on 2004 at IIT Kanpur. Anglabharti is a pattern directed rule-based system with a context free grammar like structure for English (source language) that generates a ‘pseudo-target’ applicable to a group of Indian languages (target languages). Siva-English–Hindi translation started on 2004 at IISC Bangalore. AnusaarakaFootnote 2 is a unique approach to machine translation based on the theory of information dynamics inspired by the Paninian grammar formalism. MANTRAFootnote 3 (MAchiNe assisted TRAnslation tool) translates English text into Hindi in a specified area of personal administration, specifically, gazette notifications, office orders, office memorandums and circulars. MANTRA uses Lexicalized Tree Adjoining Grammar (LTAG) to represent the English as well as the Hindi grammar. Anglabharti uses a pseudo-interlingua approach. Many free translators available on the Internet which is: Google, Babylon, Yahoo Babel Fish, PROMT translation, Microsoft. Every Machine translation based on one or more than one approach.

3 Approaches to WSD

Many approaches are used to disambiguate the senses:

  • Supervised WSD: In this approach needs of supervision. These approaches used trained data set of machine-learning techniques [4, 8, 18].

  • Semi-supervised WSD: Semi-supervised or minimally supervised approaches do not need to a tagged corpus. Semi-supervised WSD used labeled and unlabeled data [10].

  • Unsupervised WSD: It is based on unlabeled corpora, and does not courage any manually sense-tagged corpus to provide a sense choice for a word in context. Clustering comes under this approach [23, 24].

  • Other WSD: Example based (or Instance based), Bootstrapping, AI based, Hybrid WSD (Combination of all WSD approaches or some approaches is known as hybrid WSD approaches).

Gathering of information needed for word sense disambiguation and every machine translation based on a different approach. Every WSD approaches do not do anything without knowledge [6]. Comparison of the Word Sense Disambiguation approaches with some specific criteria is shown in Table 1 [6, 9, 14, 15, 20].

Table 1. Comparison of supervised, semi-supervised and unsupervised WSD approach

4 External Knowledge Resources

The basic component of WSD is Knowledge. Knowledge resources provide data which are essential to associate senses with words. They can vary from corpora of texts, either unlabeled or annotated with word senses, to machine-readable dictionaries, thesauri, glossaries, ontology’s, etc. [4]. External knowledge resource can be two types one is structure and other is unstructured [8].

4.1 Structured Resources

Arrangement of data in some definite or fixed order in resources is known as structure data. For example, Dictionary follows an alphabetical order.

  • Machine-readable dictionaries (MRDs), which have become a popular source of knowledge for natural language processing since the 1980s, when the first dictionaries were made available in electronic format: among these, we cite the Collins English Dictionary, the Oxford Advanced Learner’s Dictionary of Current English, the Oxford Dictionary of English [26], and the Longman Dictionary of Contemporary English (LDOCE) [27].

  • Thesauri provided information about the relationships between words, like synonymy antonym and, possibly, further relations.

  • Ontology is the specifications of conceptualizations of specific domains of interest [28], usually including a taxonomy and a set of semantic relations.

  • Glossary is an alphabetical list of a particular domain; it appears at the end of the book.

4.2 Unstructured Resources

Lacking a definite structure of data in the unstructured resource. Wikipedia and corpus do not have any definite structure.

  • World Wide Web: It is the huge collection of online data.

    • Wikipedia: It is a huge collection of articles and this article written by many different Indian and another language. Wikipedia, mainly differentiates in three articles which are Feature, Good, and normal articles [24].

    • Search Engine Result: It has collection of searched data.

  • Corpora: Corpora can be sense-annotated or raw (i.e., unlabeled). Both kinds of resources are used in WSD and are most useful in supervised and unsupervised approaches, respectively [22]. WordNet: WordNet [29, 30] is a computational lexicon of English based on psycholinguistic principles, created and maintained at Princeton University. It encodes concepts in terms of sets of synonyms its latest version, WordNet 3.1, so the English WordNet is a collection of English synsets. WordNet as a graph whose nodes represented by synset and whose edges represented the semantic relation between synset [19, 21].

    • Raw Corpus: It is a collection of data in text format. It can be anything, such as articles, stories, and poems.

    • Sense-Annotated Corpora: It is a set of data, but data have sensed. This corpus is very helpful for supervised learning.

    • Collocation resources, which register the tendency for words to occur regularly with others.

An external knowledge resource is a vital part of machine translation similarly, the structure identification of source and target language is also important.

5 Question Classification

Question classification [3] is a process to identify the types of question sentences (such as wh- question, key word specific etc.) and it helps in machine translation. Interrogative sentences always have question marks at the end of sentences, but all question sentences do not follow the same pattern. Classification of question sentences helpful to identify the structure of question sentences such as Wh- question, Keyword specific question, and anomalous verb related question. Question classification is the vital part of WSD.

A question which requires reasoning and thus long explanations are identified by keyword like WH-word, explain, discuss, justify etc.

  • Question related to quantity

  • Question related to time.

  • Question-related to person or place.

  • Questions regarding to different passages like kaun-kaun, kya-kya, vibhinn.

  • Sentences that ask question are called interrogative sentences. Many types of question in exam question papers: Yes/no Interrogative question. Example: Did Ram go the game Friday night?

  • Alternative question - Example: Did Ram go Patna or Lucknow on Friday night?

  • Interrogative wh- type question- Example: What is Ram doing?

  • Tag question: Tag question is questions attached or tagged onto the ending of a declarative statement. They transform a declarative sentence into an interrogative sentence. Declarative sentences become question: Example: Ram live in the city, don’t Ram?

The computer is not working?

  • Subjective questions: fill in the blank- Example: Delhi is a capital of ______

  • Objective questions: Many different types of question come under this category such as fill in the blank (Different types of notation such as —, …., , ( ), ___), matching, passage, true false. Example: Who amongst the following is considered to be the Father of ‘Local Self-Government’ in India? (a) Lord Dalhousie, (b) Lord Canning, (c) Lord Curzon, (d) Lord Ripon

  • Keyword specific questions: In this types of the question have many different keywords. We collected some keyword from different resources such as exam question papers, online available data, and exercises of the book. Some keywords are- Explain, discuss, justify, solve, find out, perform. Example: Describe the potential method for amortized analysis of an algorithm with a suitable example.

  • Note type Question: Example: Write short notes on following:

    • Cyber law in Indian context,

    • Miller–Rabin method.

In modern English, only the anomalous verbs are normally inverted with the subject to form the interrogative [17].

6 Discussion on Question Sentence Ambiguity

This paper analyzes many types of question sentences in the English language which is discussed in Question Classification Section.

  1. 1.

    Question sentences should be in text format: Text format provides the uniformity to machine translation.

  2. 2.

    Subject or area of the question sentences should be known: It is very helpful in disambiguation because the meaning of the word also varies from subject to subject or context. For example word ring has a simple meaning in general language is a finger ring (Anguthi) but in physics, it has circulated (chhalla). Part of speech also changes the meaning of the word. Shabdkosh English to Hindi Dictionary shows the meaning of word “ring”.

  3. 3.

    Declare of nontranslate item (formula, abbreviation): Non translates item helpful for understanding the meaning of context for example formula, abbreviation etc.

  4. 4.

    Declare multiword expression: Multiword expression (ME) or idiom is also change the meaning of the context. A collection of two or more words is known as the multiword expression. It is a pain in the natural language processing (NLP) for example “kick the bucket” means “to die (mar jaanaa)” in ME but in the word to word translation is “Balti ko laat maarnaa”.

  5. 5.

    Identification of the question type (WH question or key word specific): Classification of question sentences helpful to identify the structure of question sentences such as Wh- question, Keyword specific question, and anomalous verb related question. Question classification is the vital part of WSD.

7 Conclusion

Machine translation is easy to disambiguate the word in simple translation because it identifies the pattern of sentences, but exam, question sentences does not have a pattern so the translation is more difficult. This paper reviews many types of question sentences and try to identify their pattern and discuss how to disambiguate their meaning. Questionable translation is very helpful for the student which is not friendly in English language.