Keywords

1 Introduction

Word sense disambiguation (WSD) is defined as a task for identifying the correct meaning of words in context in a computational manner. This task is considered as an AI-complete problem [1], which means that its difficulty is of the same order as of such central problem in AI as, for example, the Turing Test [2]. The complexity of the WSD task is due to several reasons among which are the lack of a unified representation for word senses, the use of different levels of granularity of sense inventories and a strong dependence of the task on available knowledge sources. Currently, WordNet [3] is a standard sense inventory for WSD for English texts. It represents a computational lexicon for English created and maintained at Princeton University, which is based on psycholinguistic principles. WordNet encodes concepts in terms of sets of synonyms (called synsets) and its latest version, WordNet 3.1, contains about 155,000 words organized in over 117,000 synsets. For the WSD task WordNet is used for constructing machine-readable dictionaries that relate each word (word lemma) to a set of WordNet sense identifiers.

The knowledge resources used are the basis for clustering the existing approaches for solving WSD task to two main groups – supervised and knowledge based. The supervised approaches explore sense-annotated collections of texts (corpora). Since the manual annotation of texts is a very time consuming task, most of the existing supervised WSD systems use the largest manually annotated corpus - SemCor [4] as their training set. SemCor comprises texts from the Brown CorpusFootnote 1 and contains around 234,000 words that are manually annotated with part-of-speech (POS) tags and word senses from the WordNet inventory.

The semi-supervised WSD systems try to overcome the problem with the knowledge acquisition bottleneck of supervised systems [5] by creating and using some automatically sense annotated corpora or by exploring such corpora in conjunction with manually annotated corpora. It has been shown that in some settings such semi-supervised systems outperform purely supervised WSD systems [6].

Knowledge-based approaches do not need a sense annotated corpus and rely only on lexical recourses such as a dictionary or a computational lexicon. One of the earliest approaches in this group [8] is based on the computation of overlap between the context of the target word and its definitions from the sense inventory. Several approaches explore the distributional similarity between definitions and the context of the target word [8, 9]. An important branch of knowledge-based systems are graph-based systems that use some structural properties of semantic graphs for solving the WSD task [10, 11, 21]. Such systems first create a graph representation of the input text and then traverse the graph by different graph-based algorithms (e.g. PageRank) in order to determine the most probable senses in the context. The knowledge-based systems usually have lower accuracy than the supervised systems, but they have the advantage of a wider coverage, thanks to the use of large-scale knowledge resources [2].

The structure of the paper is as follows: in the next section we discuss some related work. Section 3 is devoted to the detailed description of the proposed approach. The experimental evaluation of the approach is presented in Sect. 4. The main characteristics of the approach and our future plans are summarized in the last section.

2 Related Work

WSD can be considered as a classification task, in which word senses are the classes and a classifier is used to assign each occurrence of a target word in a test text to one or more classes based on some features extracted from the context of this word. There exist two variants of this task – the lexical sample WSD and the all-words WSD. In the first case only a restricted subset of target words should be disambiguated. Such words usually occur one per sentence. The all-words WSD task requires finding senses for all open words in a sentence (i.e. nouns, verbs, adjectives, and adverbs). In the supervised context the complexity of this task is higher because of:

  1. 1.

    A huge number of classes: a WordNet-based dictionary used for sense annotating a text contains about 150,000 open words with average number of 1.4 senses per word (from 1 to 75). Thus, assuming that each sense of a word constitutes a unique class, the overall number of possible classes is about 210,000.

  2. 2.

    Data sparseness: it is practically impossible to have a training set of adequate size that covers all open words in the dictionary, which leads to an impossibility to create a good classification model for disambiguating all words.

Because of these reasons the existing supervised WSD systems try to solve the all-words WSD task by constructing a set of classifiers (so called word experts) – one specialized classifier per word. A training example for such word expert is represented by a set of features extracted from the context of the corresponding word (a fixed number of words occurring left and right to the target word in a sentence) and labeled by a concrete sense of the word (class) selected from a set of possible word senses presented in the dictionary.

One of the state-of-the-art supervised all-words WSD systems is IMS [13]. It constructs a classifier for each type of open word w (noun, verb, adjective, or adverb). The representation of each example (an occurrence of w in the text) includes three set of features extracted from the context of w (a window with the size of three words left and right to w). The first set consists of POS tags of the target word and of those context (surrounding) words, which belong to the same sentence as the target word. The second set contains a special binary representation of surrounding words in their lemma forms. The size of such a binary vector is equal to the number of all different context words for each word type found in the training set. The third feature set consists of 11 features – local collocations between context words. The representation of examples constructed by IMS can be used with different classifiers and the best results have been achieved by a linear support vector machine (SVM).

A current tendency in the development of advanced supervised WSD systems is the use of word embeddings instead of (or along with) ‘traditional’ discrete representation of words. Word embeddings are a distributional representation of words that contains some semantic and syntactic information [14]. Word embeddings have been used as a means to improve the performance of ‘traditional’ WSD systems as well as for constructing specialized WSD systems that are based on recurrent neural network models. The first approach includes different variants of integration of word embeddings with the IMS system [15,16,17]. In most cases different combinations of word embeddings over surrounding words were used as additional features. The extensive experiments have shown [17] that such integration is able to improve the accuracy of IMS on the all-word WSD task for several benchmark domains.

All IMS-based approaches for solving the all-words WSD task mentioned above suffer from the following shortcomings:

  • Necessity to construct separate classification models for each open word (or even for each word type).

  • Relying on rather complicated procedures for feature extraction and their integration with word embeddings.

  • Ignoring the order of words in the context.

The second approach [7, 18,19,20] can be seen as an attempt to overcome the last two shortcomings. It is based on a simple idea – first, the input context is transformed into vectors in the embedding space, and then such vectors are mapped to the possible sense vectors. Such systems usually use Bidirectional LSTM as a classifier and have received promising results on the lexical sample task. However, this approach still has to construct different models (with different sets of parameters) for each target word.

3 The Approach

The proposed here approach tries to overcome all three deficiencies of the existing supervised WSD systems mentioned above. Similarly to the IMS system we emphasize the feature extraction step, which aims at converting available text data into sets of training and testing examples that can the be used with different machine learning algorithms for solving the all-words WSD task.

We assume that three sources of knowledge are available: a WordNet-based dictionary, training text data, and a set of word embeddings. The dictionary relates open words (in their lemma forms) with their possible senses. Training text data may consist of one or several text files, where each file contains a set of sentences and each sentence is represented as a sequence of words in their lemma forms annotated by their POS category. Additionally, all open words in a sentence are assumed to be annotated by their sensesFootnote 2. A set of word embeddings relates words in their lemma form to their distributional representation. In our approach we use the file containing word embeddings as a parameter and are not interested in how they were createdFootnote 3. Testing data is assumed to have the same format as the training data.

3.1 Creating Sets of Training and Testing Examples

In order to represent the all-words WSD problem as a classification task it is necessary to convert the training and test text data into sets of training and test examples relating each open word to its correct sense, as well as to a set of features extracted from the word context. In our approach the creation of such examples is implemented as a three phase process consisting of dictionary and text data analysis, generation of examples in a symbolic form, and integration of examples with word embeddings.

The first phase begins with analyzing the dictionary. First of all, we determine in it a list of monosemous words (i.e. words with only one meaning). It is clear, that it is not necessary to construct examples for such words since their meaning does not depend on any context and can be determined directly from the dictionary. Then we analyze available text data in order to infer some statics including the number of occurrences of polysemous words (i.e. words with several meanings) in the training and test data (which corresponds to the numbers of training and test examples), frequencies of different word senses for each polysemous word in the training data, etc. Finally, for each word in the dictionary we rearrange the list of word senses – from most frequent to least frequent. Such rearrangement can be done based on the sense frequency data available in the dictionary, the sense frequency data calculated over the training set, or by mixing both types of such data. In the last case the sense frequency data calculated over the training data is applied only for rearrangement of senses, which have no data on their frequencies in the dictionary.

Generation of examples in a symbolic form is aimed at creating training and test examples for the all-words WSD task in an intermediate form readable and easily understandable by a human. Each example is represented as a pair <word_context, word_class> , where vector word_context describes the target word (and its POS category) and its context represented as words (and their POS categories) occurring left and right to the target word in the context window, which length is a system parameter:

$$ {\mathbf{word\_context}} = \left\{ {{\text{w}}_{\text{target}} - {\text{POS}}_{\text{target}} {\text{ w}}_{\text{ - L}} - {\text{POS}}_{\text{ - L}} {\text{ w}}_{\text{ - L + 1}} - {\text{POS}}_{{{\text{ - L + 1}} \cdots }} {\text{w}}_{ 1} - {\text{POS}}_{ 1\ldots } {\text{w}}_{\text{R}} - {\text{POS}}_{\text{R}} \, } \right\}, $$

where L and R are, respectively, the left and the right boundary of the context window. We use a special symbol (/s) for marking the end of a sentence and the system allows specifying whether the context window can include words from adjacent sentences or not. In the last case the content in the corresponding position of the context window is marked as 0-0.

We consider our representation of the class of the target word as the most important contribution to a new formulation of the all-words WSD task. Up to now each sense of a target word is considered as a unique class, which has its own encoding that makes sense only for that word. In our approach we propose a unified interpretation of a word sense class, which does not depend on the concrete word – the class of a target word is interpreted as the place of the corresponding word sense in an ordered sequence of the relative sense frequencies associated with the word in the modified dictionary. In other words, the most frequently used sense of each target word is marked as class 1, the next less frequent word sense is marked as class 2 and so on. The overall number of classes is defined by the maximum number of possible senses for a word in the dictionary (which is equal to 75 for our WordNet-based dictionary).

It should be noted, that according to this interpretation disambiguating an unknown open word is a two-steps process – at the first step the class associated with the sense of the tested word is determined by a classifier, and then the proper sense is retrieved from the dictionary via WordNet sense identifier that corresponds to the class for the tested word.

The last phase of the data preprocessing in our approach is intended for integrating examples created during the previous phase with word embeddings and transforming the examples into a format that may be used by different machine learning algorithms and environments. Our approach allows using words embeddings in two different ways – in a full and in a compressed form. In the first case each word in the example (i.e. context words and the target word itself) is replaced by its embedding. ‘Null’ words (i.e. marked by 0 in the symbolic representation of the example) are replaced by ‘null’ embeddings – vectors with the same size as ‘normal’ embeddings but having all their elements set to zero. In such a way, if the size of a word embedding vector is d (i.e. 300), and the size of the context widow is l (i.e. 10), such embedding-based representation of the target word and its context will be described by a real valued vector with dimension d x (l + 1) (i.e. 3300).

Since not all of the existing machine learning environments are able to work with data of such dimensionality, we have developed an alternative approach for utilizing word embeddings. The approach creates a ‘compressed’ real-valued representation of the target word and its context, whose dimensionality does not depend on the dimensionality of word embeddings used. In order to do this we replaced each word wi in the context that has d-dimensional embedding Emb(wi) by a single value compr_emb(wi), which measures the cosine similarity between Emb(wi) and Emb(wtarget) – a d-dimensional embedding of the target word wtarget:

$$ compr\_emb(w_{i} ) = sim{\mathbf{(Emb(w}}_{{\mathbf{i}}} {\mathbf{)}},{\mathbf{Emb(w}}_{{{\mathbf{target}}}} {\mathbf{)}}) = \frac{{\sum\limits_{j = 1}^{d} {Emb_{j} (w_{i}^{{}} )Emb_{j} (w_{target} )} }}{{\sqrt {\sum\limits_{j = 1}^{d} {Emb_{j}^{2} (w_{i} )} } \sqrt {\sum\limits_{j = 1}^{d} {Emb_{j}^{2} (w_{target} )} } }} $$

In such a way, the same context of different target words will have different ‘compressed’ embedding representations depending on the similarity between the concrete target word and the context words.

Compressed representation of the target word itself is created by measuring cosine similarity between the embedding of the word and the ‘unitary’ embedding (1) – a d-dimensional vector, whose arguments are all equal to one:

$$ compr\_emb(w_{target} ) = sim{\mathbf{(Emb(w}}_{{{\mathbf{target}}}} {\mathbf{)}},{\mathbf{1}}) = \frac{{\sum\limits_{j = 1}^{d} {Emb_{j} (w_{target} )} }}{{\sqrt d \sqrt {\sum\limits_{j = 1}^{d} {Emb_{j}^{2} (w_{target} )} } }} $$

The encoding of the POS categories of the target and context words in an example can be done in two ways. The first is to represent POS category of each word as a separate nominal attribute that can have k + 1 different nominal values, where k is the length of the list of POS categories to be included into context (k + 1th value is the ‘null’ category assigned to ‘null’ words and to the symbol/s used to mark the end of a sentence). Another option (which is suitable for the neural network based classifiers) is to represent each POS category of a word by a k-dimensional binary vector.

3.2 Classification

As it has been already mentioned, our goal was to develop a flexible feature extraction system that allows converting available training and test data into a format suitable for learning effective WSD classifiers rather than to create a specialized classifier for solving the all-words WSD task. That is why, the phase of learning and testing classification models, which are based on training and test sets of examples created by our system, is not a part of the system. However, the system provides a possibility to (implicitly) define a number of classifiers needed for solving the all-words WSD task. At the final phase it is possible to select how the created training and test sets of examples should be saved – as a single file or as a set of files. Saving data into a single file allows creating a single classifier that is able to disambiguate all target words occurring in the text. On the other side is a possibility to group examples based on a concrete target word or even on a POS category of the target word. In such a way we are able to create a set of word-specific training and test files that can be used for creating ‘traditional’ word expert classifiers.

An ability to group examples according to the POS categories of target words is situated between these two extremes. These four sets of examples can be used for constructing four different classifiers specializing in solving the WSD task for nouns, verbs, adverbs and adjectives. Such a flexible approach allows for the generating of different classification models for different POS word categories depending on the quality and quantity of available examples for each category. It should be mentioned, that all such created data sets can be saved either as.txt or .arff files which makes it possible to use them directly in such machine learning environments as WEKAFootnote 4 or OrangeFootnote 5.

4 Experiments and Results

The conducted experiments were aimed at achieving two main goals: the first was to evaluate the performance of different classifiers trained on the sets of examples prepared according to the proposed approach in relation to such commonly used measures for evaluating WSD classifiers as Most Frequent Sense (MFS) and WordNet First Sense (WNFS) baselines [7]. MFS is the accuracy of a classier that for each target word selects the sense that occurs most frequently in the training data for the same word. WNFS is the accuracy of a classifier that for each target word selects the most frequent sense of that word according to WordNet.3.0.

The second goal was to compare the behavior of such classifiers with a knowledge-based WSD system developed by our colleagues [21].The system is based on different kinds of knowledge graphs and outperforms other knowledge-based WSD systems created by means of the UKBFootnote 6 tool and using standard knowledge graphs. The best reported accuracy achieved by the their system (calculated on the whole data set including monosemous words) is 68.77%

We use the same training and test data that were extracted from SemCor 3.0Footnote 7 corpus – one third of available files (from br-a01 to br-f44) were used for testing and the rest two third – for training. It should be noted that the files contain only open class words. The sense annotation of words in the files has been done in accordance with WordNet dictionary wnet30_v203.lexFootnote 8 that contains 148,401 word lemmas with a minimum number of senses per word – 1 and maximum number of sense per word – 75 (1.40 senses per word in average). The overall number of polysemous words in the dictionary is 27,009. Table 1 and 2 summarize some features of these files at the level of words and their occurrences in the corresponding text files.

Table 1. Text data statistics.
Table 2. Some statistics on word occurrences.

As we have already mentioned, we construct examples only for polysemous words – each occurrence of such words in the text data is converted to an example. The test data contains 1,272 different polysemous words that are not present in the training data (they occur 2,042 times in the test data - 5.8% of all test examples). Such ‘test-only’ words can be recognized by a classifier, but can not be correctly disambiguated by any classification model constructed only from the given set of training examples. The best alternative for solving such examples is to use the most frequent sense for the tested word obtained from a WordNet dictionary (WNFS). That is why we excluded such examples from the test set.

We call a word occurring in the test data ‘unsolvable’ if it is present in the training data but its correct sense - is not. It is clear, that such words can never be disambiguated correctly by any classifier learnt from the given training data. Since ‘unsolvable’ words can not be recognized by any classifier, we use such data for calculating a so called ‘Best Case’ baseline that determines the upper bound of the classification accuracy that can be theoretically achieved on this test set by a classifier learnt from the given training set (for out training set this baseline is equal to 91.10%).

We used the words embeddingsFootnote 9 with 300 dimensions, that were trained over a pseudo corpus generated from an extended WordNet-based knowledge graph (PCWN) [21] by means of Word2Vec toolFootnote 10 with the following settings: context window of 5 words, 7 iterations, 5 negative samples, and frequency cut sampling set to 7.

We have experimented with two types of classifiers – neural network based and ensemble-based ones. Neural networks with different architecture were built in the TensorFlowFootnote 11 - an open source machine learning framework. For the ensemble based classifiers we have selected Random Forest in its scikit-learn implementationFootnote 12 and GCForestFootnote 13 implementation of the Deep Forest model [22].

The first set of experiments was conducted on data in which the ordering of classes was done according to the frequencies of senses calculated over the training set. That is why the results were compared to the Most Frequent Sense (MFS) baseline calculated over the same training set. We have experimented with two methods for integrating word embeddings - Table 3 presents the experiment results with word embeddings in their full form and Table 4 – in their compressed form. The disambiguation was done by four classifiers trained on the training sets corresponding to the four POS categories (nouns, adverbs, adjectives and verbs). In all experiments with neural networks we applied the same set of parameters – Sofmax evaluation function, ReLU activation function, learning rate set to 0.003, dropout set to 0.5 and Adam optimizer.

Table 3. Results from the experiments with ‘full’ word embeddings and class ordering based on the training set.
Table 4. Results from the experiments with ‘compressed’ word embeddings and class ordering based on the training set.

It can be seen that all classifiers easily ‘beat’ the MFS baseline as the best results have been achieved by the ensemble-based classifiers. It should be noted that the overall accuracy of classifiers trained on ‘compressed’ representation of word embedding is almost the same as for that used the ‘full’ form of word embeddings.

Tables 5 and 6 present results of similar experiments – the only difference is that the ordering of classes was done based on the sense frequencies specified in our WordNet-based dictionary. That is why we have compared the results with the WordNet First Sense (WNFS) baseline.

Table 5. Results from experiments with ‘full’ word embeddings and WordNet-based class ordering.
Table 6. Results from experiments with ‘compressed’ word embeddings and WordNet-based class ordering.

Although the baseline accuracy in this case is higher, all classifiers again outperform this baseline and again the ensemble-based classifiers have demonstrated better behavior than neural network-based ones.

In order to compare our supervised approach with the knowledge-based approach proposed in [21], we have to extend our classification schema to cover monosemous words and words from the test data that are not present in the training data. All such words are classified by means of WNFS heuristicFootnote 14. As it can be seen from Table 7, the supervised-based classifiers significantly outperform as the knowledge-based ones as the WNFS baseline.

Table 7. Classification accuracy of some classifiers on all test examples.

5 Conclusion and Future Work

In this paper we have described a new approach for solving the all-word WSD task in the supervised context. The approach allows formulating this problem as a classification task that can be solved by means of one or several classifiers that are trained on the training set or on its different subsets. In all cases the training examples are represented in a unified way, which is based on a new interpretation of the notion ‘class’ as a place of the concrete word sense in a sequence of possible word senses ordered by the sense frequencies calculated on some corpora. At the moment representation of examples includes information about POS categories of the target word and words from its context, as well as on the context itself. Both target and context words can be represented in two different ways via their embeddings. In the first case the full distributional representation of embeddings as vectors is used, while in the second – each word is coded only by a single real-valued number reflecting similarity between the target word and the given context word. The conducted experiments have shown that classifiers training on such created training examples outperform both Most Frequent Sense and WordNet First Sense baselines for the test data.

At the moment our efforts are concentrated mainly in two directions – the first is to evaluate the approach on a wider range of available data (i.e. Semeval-2, Semeval-3, etc.) and to compare the results with several state-of-the-art supervised WSD systems. The second is to extend the proposed approach by integrating it with more knowledge sources, for example with information on context word collocations and with embeddings over word senses.