Introduction

In every natural language there are so many words which carry different senses in different contexts of their use. These words are often recognized as ambiguous words and finding the exact contextual sense of an ambiguous word in a piece of text is known as called Word Sense Disambiguation (WSD) [1,2,3,4,5]. For example, the English words head, run, round, manage, etc. have multiple senses based on their contexts of use in texts. Finding the exact senses of the words in a given context is the main challenge of WSD. Till date we have come across three major methodologies that are used to deal with this problem, namely, Supervised Method, Knowledge based Method and Unsupervised Method.

In Supervised Method [4, 6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24], sense disambiguation of words is performed with the help of previously created learning data sets. These learning sets contain related sentences for a particular sense of an ambiguous word. The supervised method classifies the new test sentences based on the probability distributions calculated using these learning sets.

The Knowledge based Method [25,26,27,28,29,30,31,32,33,34,35] depends on external knowledge-based resources like online semantic dictionaries, thesauri, Machine readable dictionaries, etc. to obtain sense definitions of the lexical components.

In Unsupervised Method [34, 36, 37], the sense disambiguation happens in two phases. First, sentences are clustered using a clustering algorithm and these clusters are tagged with relevant senses with the help of a linguistic expert. Next, a distance-based similarity measuring technique is used to find the closeness of a test data with the sense-tagged clusters. The minimum distance from a sense tagged cluster assigns the sense to that new test data.

The present work is based on Naïve Bayes probabilistic model which is used as a baseline method for sense classification. This baseline method generates 81% accurate result when the algorithm tested on 900 instances of 19 ambiguous words. Next, two extensions are adopted to increase the level of accuracy: (a) incorporation of lemmatization in the system that generates 84% accuracy, and (b) operation of Bootstrapping on the system that produces 83% accuracy.

The organization of the paper is as follows: Sect. 2 presents a brief Survey in this research methodology; Proposed Approach is demonstrated in Sect. 3; Results and Discussion is presented in Sect. 4; in Sect. 5, Extensions on the Baseline Methodology is described in detail. The report is concluded with future scope in Sect. 6.

Survey

In case of the Supervised Method, manually created learning sets are used to train the model. The learning sets consist of example sentences relating to a particular sense of a word. The test instances are classified based on their probability distribution calculated using the learning sets. Some commonly used approaches are deployed in this method, which are discussed below:

Decision List

In Decision List [35, 36] based approach, first, a set of rules are formed for a target word. Next, few example sentences are fed to the system to calculate the decision parameters like feature-value, sense-score, etc. When a test data comes for classification task, these feature values categorize that data to a particular class using these parameters.

Decision Tree

The Decision Tree [38,39,40] based approach frames the rules in the form of a tree structure where the non-leaf nodes denote the tests and the branches represent the test results. The leaf nodes of the tree carry the different senses. If a set of rules can guide an execution to a leaf node then the sense is assigned to that word as a derived sense.

Naïve Bayes

The Naïve Bayes [41,42,43] probabilistic model classifies the instances based on few parameters. These parameters calculate the probability distribution of a particular instance w.r.t. the different classifiers. The classifier, for which the probability value is the maximum for a test instance, categorizes the instance accordingly. The formula for the Naïve Bayes classification is as follows:

$$ \hat{S} = \mathop {ARGMAX}\limits_{{S_{j} \in SenseD(w)}} P(S_{i} \left| {f_{1} , \ldots ,f_{m} } \right.) = \mathop {ARGMAX}\limits_{{S_{j} \in SenseD(w)}} \frac{{P(f_{1} , \ldots ,f_{m} \left| {S_{i} } \right.)P(S_{i} )}}{{P(f_{1} , \ldots ,f_{m} )}} $$

where ‘Si’ represents different senses of the ambiguous word (w), the parameter ‘fj’ represents the features of the word (w) in the context (Si) and m is the number of features.

Neural Network

In Neural Network based approach [44,45,46,47], the artificial neurons act as the data processing units. The artificial neurons categorize the features into a number of non-overlapping sets. While designing a network using artificial neurons, these are arranged in different layers and the data is passed through these layers to reach the destination layer. In a network, words are treated as nodes and relations among the words are considered as links. In a network when data proceeds, only those links get activated where the two words at the two end points of an edge are semantically related.

Exemplar-Based Method

In Exemplar-Based [48] method, examples are considered as points distributed over a feature space. When a new data point comes to be categorized, any distance based similarity measuring technique is used to find the closeness of the data point w.r.t. all the other classifiers. The minimum distance w.r.t. a particular classifier represents the sense of the test data.

Support Vector Machine

In Support Vector Machine based [49,50,51] method, examples are treated as polarized points, either positive or negative. The goal of the methodology is to separate these positive and negative points w.r.t. a hyper-plane. A test data is classified by evaluating, at which side of the hyper-plane the point belongs to.

Ensemble Methods

In Ensemble Method based [52] approach, classifiers are combined after every execution for a better classification result. This combination occurs according to different parameters, such as, Majority Voting, Probability Mixture, Rank-Based Combination, AdaBoost [53, 54] etc.

Proposed Approach

The proposed approach adopts the Naïve Bayes (NB) probabilistic model as a baseline strategy. This model classifies the instances based on few predefined parameters.

Module 1: Training Module

Development of the training model depends on the following parameters:

  1. a.

    |V| which represents the number of vocabulary,

  2. b.

    P(ci), to calculate the priori probability of each class,

  3. c.

    ni, carries the total numbers of word frequencies of each class,

  4. d.

    P(wi|ci) which represents the conditional probability of a keyword in a given class.

The “zero frequency” problem is resolved using the Laplace Estimation in the following way:

$$ {\text{P}}\left( {{\text{wi}}|{\text{ci}}} \right) \, = \, \left( {{\text{Number of occurrences of each word in a given class}} + 1} \right)/\left( {{\text{ni}} + |{\text{V}}|} \right). $$

Module 2: Testing Module

A test data is classified with the help of “posterior” probability, P(ci|W) w.r.t. each class using the following formula:

$$ {\text{P}}\left( {{\text{c}}_{\text{i}} |{\text{W}}} \right) = {\text{P}}\left( {{\text{c}}_{\text{i}} } \right) \times \sum\limits_{j = 1}^{\left| V \right|} {{\text{P}}\left( {{\text{Wj}}|{\text{ ci}}} \right) \, } $$

The highest probability measure assigns a test data to a particular class.

Flow Chart of the Baseline Method

The baseline method can be represented through the following diagram (Fig. 1).

Fig. 1
figure 1

Flow chart of the proposed baseline approach

Results and Discussion

The following steps have been executed to run the system on the database:

Text Normalization

The texts stored in the TDIL Bangla corpus are non-normalized in nature. So, the very first task was to normalize the texts adequately by (a) removing uneven number of spaces, new lines, etc., (b) discarding comma, colon, semi colon, double quote, single quote and all other orthographic symbols, (c) converting the whole texts into Unicode compatible single Bangla font (Vrinda in this work), (d) considering all types of Bangla sentence termination symbols as note-of-exclamation, note-of-interrogation and purnacched (full stop) (“।”).

Removal of Non-functional Words

In NLP works there is not any specific rule or process for differentiating between functional word and non-functional words. Rather, it is more or less based on the nature of application of a NLP work. Although, in practical sense, all Bangla words are useful in some contexts or the other, while preparing the data sets for the present work, few Bangla words have been ignored to keep the number of words within a manageable length. After lemmatization, words except nouns, pronouns, adjectives, verbs and adverbs (in Bangla, adverbs are also treated as a kind of adjective) are considered functional words.

Selection of Ambiguous Word

Theoretically it is possible to assume that any Bangla word can appear in a text with certain level of ambiguity. People of computational linguistics like to use a few constraints from implementation perspective to select the ambiguous words. The Bangla text corpus used in this work consists of 35,89,220 inflected and non-inflected words, among which 199,245 words may be treated as distinct lexical units. These words are first arranged in decreasing order according to their term frequency in the corpus. The most frequently used words are then selected for experiment with some necessary pre-requisite conditions as discussed later.

Annotation of an Input Data

The sentences in the test data set are annotated in the following way:

  • <Sentence x> tag at the beginning of each sentence represents the sentence number in the paragraph and <wsd_id = y, pos = z> tag carries the ambiguous word number and Part-of-Speech of the target word in that particular sentence (Fig. 2).

    Fig. 2
    figure 2

    Partial view of a sample input file

Preparation of a Reference Output Data

The reference output files have been generated with the help of a standard Bangla dictionary (Sansad Banglā Avidhān = Samsad Bangla Dictionary) (Fig. 3). The reference files are used by the system to verify the system generated outputs using a separate program.

Fig. 3
figure 3

Partial view of a reference output data

In the first phase of the work, the baseline method is applied on 900 sentences containing mostly used 19 Bangla ambiguous words.

Selection of Senses of the Ambiguous Words for Evaluation

After retrieving ambiguous words, a set of steps have been defined and executed to select their multiple senses for the experiment. The range of sense variation of Bangla words is so vast that it appeared as a real challenge to select a few senses from them for experiment. For example, according to the Sansad Banglā Avidhān, the word “হাত” (hāt) can denote more than 80 (eighty) different senses both in its singular and inflected forms, whereas the on-line Bangla WordNet sites only 14 (fourteen) distinct senses for the word. On the contrary, the TDIL Bangla text corpus provides only 4 (four) different senses of the word with some needful numbers of sentences. Taking all these variations into consideration the threshold value has been considered as 5 for the present work.

The following algorithm evaluates the multiple senses of an ambiguous word for the experiment:

Algorithm: Sense-Selection

  • Input: Sentences from a corpus containing an ambiguous word.

  • Output: Multiple senses of the ambiguous words.

  • Step 1: Sentences, classified based on contextual words.

  • Step 2: Misclassified sentences are rectified by an expert.

  • Step 3: Sense inventory is prepared for the ambiguous word based on Sansad Bangla Abhidhan and Bangla WordNet.

  • Step 4: Specific senses are tagged to the sentence classes from the sense inventory.

  • Step 5: Sense tagged classes are rearranged according to the decreasing number of sentences in it.

  • Step 6: Classes containing sentences more than a threshold value are selected.

  • Step 8: Senses associated with the classes are considered for evaluation.

The selected senses obtained by this algorithm are listed in Table 1.

Table 1 Selected senses of the ambiguous words

Parameters for Evaluating the Performance

The performances of the algorithms have been measured using the conventional parameters: Precision, Recall, and F-Measure.

$$ \begin{array}{*{20}l} {{\text{Precision}}\;\left( {\text{P}} \right) = \left( {\text{number of correctly evaluated instances according to human decision}} \right)/\left( {\text{total number of solved instances by the system}} \right).} \hfill \\ {{\text{Recall}}\;\left( {\text{R}} \right) = \left( {\text{number of correctly evaluated instances according to human decision}} \right)/ \, \left( {\text{total number of data instances}} \right), \, and} \hfill \\ {{\text{F - Measure}} = 2*{\text{P}}*{\text{R}}/\left( {{\text{P}} + {\text{R}}} \right).} \hfill \\ \end{array} $$

Through the work, the systems evaluated all the test instances either correctly or wrongly which result the same Precision and the Recall value for each data.

Baseline Result

The typical Naïve Bayes algorithm has been developed as a baseline for this work. The algorithm has evaluated 19 mostly used Bangla ambiguous words with the same Precession and Recall value of 81% on an average (Table 2).

Table 2 Execution of the baseline model

Extensions on the Baseline Methodology

To enhance the performance of the baseline methodology, the following two extensions have been adopted: (a) Lemmatization of inflected forms the whole system, and (b) Bootstrapping.

Lemmatization of the Whole System

Since Bangla is morphologically very strong, only lexical matching is not adequate enough for measuring the similarity of senses between the words. To overcome this bottleneck, the whole system has been operated on the lemmatized forms of words. The expansion of lexical coverage due to lemmatization generates situation where more number of lexical similarity are observed between the instances which, eventually, leads the system to act in a far more robust manner to achieve higher level of accuracy. The lemmatization tool operated on the training sets, test data, and vocabulary (features) in a uniform manner without any selectional bias. However, since the tool could not produce accurate results for all the words, which is bound to happen due to complexities involved in the surface forms of many inflected Bangla words, manual intervention has been necessary for rectifying some of the errors in the eventual output database. A glimpse of the sample lemmatized input data is presented below (Fig. 4) where annotation of the text follows the same strategy as in the baseline method in addition to the words derived from lemmatization. Words are represented in the following format: “word-in-surface-level/stem-form/POS”.

Fig. 4
figure 4

A sample lemmatized input data

This expansion approach uses the same standard output files used in the baseline experiment. Though the inputs have been prepared in lemmatized form, the outputs have been generated in surface level forms of the words to conduct a similar comparison with the baseline approach. In the following table (Table 3) the performance of the algorithm on a regular data and its corresponding lemmatized form is presented.

Table 3 Performance of the algorithm on a regular data and its corresponding lemmatized form

It is observed that the overall accuracy has been increased due to the expansion of lexical coverage of the words. Since the size of the data sets taken for the experiment is quite smaller in number, at several occasions, the algorithm has returned the same accuracy. As mentioned earlier, the Precession and Recall both the values are 84% in this phase over a baseline accuracy of 82% on a same data set.

Bootstrapping

In this extended methodology the sense-resolute test data in a particular phase of execution is inserted into the training sets to enrich the learning procedure. As the training sets become stronger in every execution, the system can produce a better accuracy in its next executions. A small manual intervention was mandatory in the phase also. Since the classification of a data set depends on the probability measures based on the training sets, the methodology requires a correctly populated training set for sense retrieval. Since the proposed model could not produce an absolute result in a particular execution, the misclassified instances have been further rectified by manual intervention to lead the system towards a right direction (Fig. 5).

Fig. 5
figure 5

Flowchart of the proposed bootstrapping method

In this phase two consecutive executions have been considered. In the first phase, the module has been tested on a selected set of data from the Bangla corpus. After the training sets are auto-incremented, a new set of data has been selected for the experiment for the second phase. The accuracy of the result in both the phases is presented in Table 4. The Precession and Recall values are same as 83% over a base line accuracy of 81.5%.

Table 4 Result of bootstrapping method

It is observed that extensions on the baseline methodology can produce a better result in most of the cases (Tables 3, 4). However, at a few cases, the accuracy level has slightly dropped. Through investigation it is observed that the accuracy of the system depends on a few predefined parameters such as, vast varieties in sentence representation of any particular sense, occurrence of same lexical entries in semantically dissimilar sentences, and many more.

Conclusion and Future Scope

In this paper the work for Word Sense Disambiguation in Bangla language has been proposed using the Naïve Bayes algorithm as a baseline method supported with two extensions, namely, lemmatization and bootstrapping. The results obtained from this work, although not exact to our expectation, may be accepted for the time being on the ground that this is the first attempt of this kind and this method may help us to devise new strategies for achieving our goals. In reality the complex linguistic natures of the South Asian languages like Hindi, Bangla, Tamil, Telugu, Punjabi, Malayalam and Marathi etc. usually put before us several challenges in the form of fonts, texts, morphological complexities, etc. due to which achieving even slight breakthrough in computation of these languages become a real challenge for many of us. At the same time the variation of senses of words, diversities in sentence structures, and complex formation of functional and nonfunctional words etc. demand additional attention for achieving better result from such experiments.