Keywords

1 Introduction

Popular word embedding vectors, such as Word2Vec, represent a word’s semantic meaning and its syntactic role as a point in a vector space [1, 2]. As each word is only given one embedding, such methods are restricted to the representation of only a single combined sense, or meaning, of the word. Word sense embeddings generalise word embeddings to handle polysemous and homonymous words. Often these sense embeddings are learnt through unsupervised Word Sense Induction (WSI) [3,4,5,6]. The induced sense embeddings are unlikely to directly coincide with any set of human defined meaning at all, i.e. they will not match lexical senses such as those defined in a lexical dictionary, e.g. WordNet [7]. These induced senses may be more specific, more broad, or include the meanings of jargon not in common use.

One may argue that WSI systems can capture better word senses than human lexicographers do manually. However, this does not mean that induced senses can replace standard lexical senses. It is important to appreciate the vast wealth of existing knowledge defined around lexical senses. Methods to link induced senses to lexical senses allow us to take advantage of both worlds.

We propose a refitting method to generate a sense embedding vector that matches with a labelled lexical sense. Given an example sentence with the labelled lexical sense of a particular word, the refitting method algorithmically combines the induced sense embeddings of the target word such that the likelihood of the example sentence is maximised. We find that in doing so, the sense of the word in that sentence is captured. With the refitting, the induced sense embeddings are now able to be used in more general situations where standard senses, or user defined senses are desired.

Refitting word sense vectors to match a lexicographical sense inventory, such as WordNet or a translator’s dictionary, is possible if the sense inventory features at least one example of the target sense’s use. Our method allows this to be done very rapidly, and from only the single example of use this has with possible applications in low-resource languages.

Refitting can also be used to fit to a user provided example, giving a specific sense vector for that use. This has strong applications in information retrieval. The user can provide an example of a use of the word they are interested in. For example, searching for documents about “banks” as in “the river banks were very muddy”. By generating an embedding for that specific sense, and by comparing with the generated embeddings in the indexed documents, we can not only pick up on suitable uses of other-words for example “beach” and “shore”, but also exclude different usages, for example of a financial bank. The method we propose, using our refitted embeddings, has lower time complexity than AvgSimC [3], the current standard method for evaluating the similarity of words in context. This is detailed in Sect. 5.1.

We noted during refitting, that a single induced sense would often dominate the refitted representation. It is rare in natural language for the meaning to be so unequivocal. Generally, a significant overlap exists between the meaning of different lexical senses, and there is often a high level of disagreement when humans are asked to annotate a corpus [8]. We would expect that during refitting there would likewise be contention over the most likely induced sense. Towards this end, we develop a smoothing method, which we call geometric smoothing that de-emphasises the sharp decisions made by the (unsmoothed) refitting method. We found that this significantly improves the results. This suggests that the sharpness of sense decisions is an issue with the language model, which smoothing can correct. The geometric smoothing method is presented in Sect. 3.2.

We demonstrate the refitting method on sense embedding vectors induced using Adaptive Skip-Grams (AdaGram) [6], as well as our own simple greedy word sense embeddings. The method is applicable to any skip-gram-like language model that can take a sense vector as its input, and can output the probability of a word appearing in that sense’s context.

The rest of the paper is organised as follows: Sect. 2 briefly discusses two areas of related works. Section 3 presents our refitting method, as well as our proposed geometric smoothing method. Section 4 describes the WSI embedding models used in the evaluations. Section 5 defines the RefittedSim measure for word similarity in context, and presents its results. Section 6 shows how the refitted sense vectors can be used for lexical WSD. Finally, the paper concludes in Sect. 7.

2 Related Works

2.1 Directly Learning Lexical Sense Embeddings

In this area of research, the induction of word sense embeddings is treated as a supervised, or semi-supervised task, that requires sense labelled corpora for training.

Iacobacci et al. [9] use a Continuous Bag of Word language model [1], using word senses as the labels rather than words. This is a direct application of word embedding techniques. To overcome the lack of a large sense labelled corpus, Iacobacci et al. use a 3rd party WSD tool, BabelFly [10], to add sense annotations to a previously unlabelled corpus.

Chen et al. [11] use a supervised approach to train sense vectors, with an unsupervised WSD labelling step. They partially disambiguate their training corpus, using word sense vectors based on WordNet; and use these labels to train their embeddings. This relabelled data is then used as training data, for finding sense embeddings using skip-grams.

Our refitting method learns a new sense embedding as a weighted sum of existing induced sense embeddings of the target word. Refitting is a one-shot learning solution, as compared to the approaches used in the works discussed above. A notable advantage is the time taken to add a new sense. Adding a new sense is practically instantaneous, and replacing the entire sense inventory, of several hundred thousand senses, is only a matter of a few hours. Whereas for the existing approaches this would require repeating the training process, which will often take several days. Refitting is a process done to word sense embeddings, rather than a method for finding sense embeddings from a large corpus.

2.2 Mapping Induced Senses to Lexical Senses

By defining a stochastic map between the induced and lexical senses, Agirre et al. [12], propose a general method for allowing WSI systems to be used for WSD. Their work was used in SemEval-2007 Task 02 [13] to evaluate all entries. Agirre et al. use a mapping corpus to find the probability of a lexical sense, given the induced sense according to the WSI system. This is more general than the approach we propose here, which only works for sense embedding based WSI. By exploiting the particular properties of sense embedding based WSI systems we propose a system that can better facilitate the use of this subset of WSI systems for WSD.

3 Proposed Refitting Framework

The key contribution of this work is to provide a way to synthesise a word sense embedding given only a single example sentence and a set of pretrained sense embedding vectors. We termed this refitting the sense vectors. By refitting the unsupervised vectors we define a new vector, that lines up with the specific meaning of the word from the example sentence.

This can be looked at as a one-shot learning problem, analogous to regression. The training of the induced sense, and of the language model, can be considered an unsupervised pre-training step. The new word sense embedding should give a high value for the likelihood of the example sentence, according to the language model. It should also generalise to give a high likelihood of other contexts where this word sense occurs.

We initially attempted to directly optimise the sense vector to predict the example. We applied the L-BFGS [14] optimisation algorithm with the sense vector being the parameter being optimised over, and the objective being to maximise the probability of the example sentence according to the language model. This was found to generalise poorly, due to over-fitting, and to be very slow. Rather than a direct approach, we instead take inspiration from the locally linear relationship between meaning and vector position that has been demonstrated for word embeddings [1, 15, 16].

To refit the induced sense embeddings to a particular meaning of a word, we express that a new embedding as a weighted combination of the induced sense vectors. The weight is determined by the probability of each induced sense given the context.

Given a collection of induced (unlabelled) embeddings \(\textbf{u}={u_1,...,u_{n_u}}\), and an example sentence \(\textbf{c}={w_1,...,w_{n_c}}\) we define a function \(l(\textbf{u}\mid \textbf{c})\) which determines the refitted sense vector, from the unsupervised vectors and the context as:

$$\begin{aligned} l(\textbf{u}\mid \textbf{c}) = \sum _{\forall u_i \in \textbf{u}} u_i P(u_i \mid \textbf{c}) \end{aligned}$$
(1)

Bayes’ Theorem can be used to estimate the posterior predictive distribution \(P(u_i \mid \textbf{c})\).

Bengio et al. [17] describe a similar method to Eq. (1) for finding (single sense) word embeddings for words not found in their vocabulary. The formula they give is as per Eq. (1), but summing over the entire vocabulary of words (rather than just \(\textbf{u}\)).

3.1 A General WSD Method

Using the language model and application of Bayes’ theorem, we define a general word sense disambiguation method that can be used for refitting (Eq. (1)), and for lexical word sense disambiguation (see Sect. 6). This is a standard approach of using Bayes’ theorem [5, 6]. We present it here for completeness.

The context is used to determine which sense is the most suitable for this use of the target word (the word being disambiguated). Let \(\textbf{s}=(s_{1},...,s_{n})\), be the collection of senses for the target wordFootnote 1.

Let \(\textbf{c}=(w_{1},...,w_{n_c})\) be a sequence of words making up the context of the target word. For example for the target word kid, the context could be \(\textbf{c}=(\)wow the wool from the, is, so, soft, and, fluffy), where kid is the central word taken from between the and fluffy.

For any particular sense, \(s_i\), the multiple sense skip-gram language model can be used to find the probability of a word \(w_j\) occurring in the context: \(P(w_j \mid s_i)\). By assuming the conditional independence of each word \(w_j\) in the context, given the sense embedding \(s_i\), the probability of the context can be calculated:

$$\begin{aligned} P(\textbf{c}\mid s_{i})=\prod _{\forall w_{j}\in \textbf{c}}P(w_{j} \mid s_{i}) \end{aligned}$$
(2)

The correctness of the conditional independence assumption depends on the quality of the representation – the ideal sense representation would fully capture all information about the contexts it can appear in – thus the other contexts elements would not present any additional information, and so \(P(w_a \mid w_b,s_i)=P(w_a \mid s_i)\). Given this, we have an estimate of \(P(\textbf{c}\mid s_{i})\) which can be used to find \(P(s_i \mid \textbf{c})\). However, a false assumption of independence contributes towards overly sharp estimates of the posterior distribution [18], which we seek to address in Sect. 3.2 with geometric smoothing.

Bayes’ Theorem is applied to this context likelihood function \(P(\textbf{c}\mid s_{i})\) and a prior for the sense \(P(s_i)\) to allow the posterior probability to be found:

$$\begin{aligned} P(s_{i} \mid \textbf{c}) = \dfrac{P(\textbf{c}\mid s_{i})P(s_{i})}{\sum _{s_{j}\in \textbf{s}} P(\textbf{c}\mid s_{j})P(s_{j})} \end{aligned}$$
(3)

This is the probability of the sense given the context.

3.2 Geometric Smoothing for General WSD

During refitting, we note that often one induced sense would be calculated as having much higher probability of occurring than the others (according to Eq. 3). This level of certainty is not expected to occur in natural languages, ambiguity is almost always possible. To resolve such dominance problems, we propose a new geometric smoothing method. This is suitable for smoothing posterior probability estimates derived from products of conditionally independent likelihoods. It smooths the resulting distribution, by shifting all probabilities to be closer to the uniform distribution.

We hypothesize that the sharpness of probability estimates from Eq. (3) is a result of data sparsity, and of a false independence assumption in Eq. (2). This is well known to occur for n-gram language models [18]. Word-embeddings language models largely overcome the data sparsity problem due to weight sharing effects [17]. We suggest that the problem remains for word sense embeddings, where there are many more classes. Thus the training data must be split further between each sense than it was when split for each word. The power law distribution of word use [19] is compounded by word senses within those used also following the a power law distribution [20]. Rare senses are liable to over-fit to the few contexts they do occur in, and so give disproportionately high likelihoods to contexts that those are similar to. We propose to handle these issues through additional smoothing.

We consider replacing the unnormalised posterior with its \(n_c\)-th root, where \(n_c\) is the length of the context. We replace the likelihood of Eq. (2) with \(P_S(\textbf{c}\mid s_{i})=\prod _{\forall w_{j}\in \textbf{c}}\root n_c \of {P(w_{j} \mid s_{i})}\). Similarly, we replace the prior with: \(P_S(s_{i})= \root n_c \of {P(w_{j} \mid s_{i})}\) When this is substituted into Eq. (3), it becomes a smoothed version of \(P(s_{i} \mid \textbf{c})\).

$$\begin{aligned} P_S(s_{i}\mid \textbf{c}) =\dfrac{\root n_c \of {P(\textbf{c}\mid s_{i})P(s_{i})}}{\sum _{s_{j}\in \textbf{s}} \root n_c \of {P(\textbf{c}\mid s_{j})P(s_{j})}} \end{aligned}$$
(4)

The motivation for taking the \(n_c\)-th root comes from considering the case of the uniform prior. In this case \(P_S(\textbf{c}\mid s_{i})\) is the geometric mean of the individual word probabilities \(P_S(w_j \mid s_{i})\). Consider, if one has two context sentences, \(\textbf{c}=\{w_1,...,w_{n_c}\}\) and \(\textbf{c}^\prime =\{w_1^\prime ,...,w^\prime _{n_{c^\prime }}\}\), such that \(n_c^\prime > n_c^\prime \) then using Eq. (2) to calculate \(P(\textbf{c}\mid s_{i})\) and \(P(\textbf{c}^\prime \mid s_{i})\) will result in incomparable results as additional number of probability terms will dominate – often significantly more than the relative values of the probabilities themselves. The number of words that can occur in the context of any given sense is very large – a large portion of the vocabulary. We would expect, averaging across all words, that each addition word in the context would decrease the probability by a factor of \(\frac{1}{V}\), where V is the vocabulary size. The expected probabilities for \(P(\textbf{c}\mid s_{i})\) is \(\frac{1}{V^{n_c}}\) and for \(P(\textbf{c}^\prime \mid s_{i})\) is \(\frac{1}{V^{n_{c^\prime }}}\). As \(n_{c^\prime } > n_c\), thus we expect \(P(\textbf{c}^\prime \mid s_{i}) \ll P(\textbf{c}\mid s_{i})\). Taking the \(n_{c}\)-th and \(n_{c^\prime }\)-th roots of \(P(\textbf{c}\mid s_{i})\) and \(P(\textbf{c}\mid s_{i})\) normalises these probabilities so that they have the same expected value; thus making a context-length independent comparison possible. When this normalisation is applied to Eq. (3), we get the smoothing effect.

4 Experimental Sense Embedding Models

We trained two sense embedding models, AdaGram [6] and our own Greedy Sense Embedding method. During training we use the Wikipedia dataset as used by Huang et al. [4]. However, we do not perform the extensive preprocessing used in that work.

Most of our evaluations are carried out on Adaptive SkipGrams (AdaGram) [6]. AdaGram is a non-parametric Bayesian extension of Skip-gram. It learns a number of different word senses, as are required to properly model the language.

We use the implementationFootnote 2 provided by the authors with minor adjustments for Julia [21] v0.5 compatibility.

The AdaGram model was configured to have up to 30 senses per word, where each sense is represented by a 100 dimension vector. The sense threshold was set to \(10^{-10}\) to encourage many senses. Only words with at least 20 occurrences are kept, this gives a total vocabulary size of 497,537 words.

To confirm that our techniques are not merely a quirk of the AdaGram method or its implementation, we implemented a new simple baseline word sense embedding method. This method starts with a fixed number of randomly initialised embeddings, then greedily assigns each training case to the sense which predicts it with the highest probability (using Eq. (3)). The task remains the same: using skip-grams with hierarchical softmax to predict the context words for the input word sense. This is similar to [22], however it is using collocation probability, rather than distance in vector-space as the sense assignment measure. Our implementation is based on a heavily modified version of Word2Vec.jlFootnote 3.

This method is intrinsically worse than AdaGram. Nothing in the model encourages diversification and specialisation of the embeddings. Manual inspection reveals that a variety of senses are captured, though with significant repetition of common senses, and with rare senses being missed. Regardless of its low quality, it is a fully independent method from AdaGram, and so is suitable for our use in checking the generalisation of the refitting techniques.

The vocabulary used is smaller than for the AdaGram model. Words with at least 20,000 occurrences are allocated 20 senses. Words with at least 250 occurrences are restricted to a single sense. The remaining rare words are discarded. This results in a vocabulary size of 88,262, with 2,796 words having multiple senses. We always use a uniform prior, as the model does not facilitate easy calculation of the prior.

5 Similarity of Words in Context

Estimating word similarity with context is the task of determining how similar words are, when presented with the context they occur in. The goal of this task is to match human judgements of word similarity. For each of the target words and contexts; we use refitting on the target word to create a word sense embedding specialised for the meaning in the context provided. Then the similarity of the refitted vectors can be measured using cosine distance (or similar). By measuring similarity this way, we are defining a new similarity measure.

Reisinger and Mooney [3] define a number of measures for word similarity suitable for use with sense embeddings. The most successful was AvgSimC, which has become the gold standard method for use on similarity tasks. It has been used with great success in many works [4, 5, 11].

AvgSimC is defined using distance metric d (normally cosine distance) as:

$$\begin{aligned} \textrm{AvgSimC}((\textbf{u},\textbf{c}),(\textbf{u}^{\prime },\textbf{c}^{\prime })) = \frac{1}{n \times n^{\prime }} \sum _{u_{i}\in \textbf{u}} \sum _{u_{j}^{\prime }\in \textbf{u}^{\prime }} P(u_{i}\mid \textbf{c})\,P(u_{j}^{\prime }\mid \textbf{c}^{\prime })\,d(u_{i},u_{j}^{\prime }) \end{aligned}$$
(5)

for contexts \(\textbf{c}\) and \(\textbf{c}^\prime \), the contexts of the two words to be compared, and for \(\textbf{u}=\{u_1,...,u_n\}\) and \(\textbf{u}^\prime =\{u^\prime _1,...,u\prime _{n^\prime }\}\) the respective sets of induced senses of the two words.

Fig. 1.
figure 1

Block diagram for RefittedSim similarity measure

5.1 A New Similarity Measure: RefittedSim

We define a new similarity measure, RefittedSim, as the distance between the refitted sense embeddings. As shown in Fig. 1 the example contexts are used to refit the induced sense embeddings of each word. This is a direct application of Eq. (1).

Using the same definitions as in Eq. (5), RefittedSim is defined as:

$$\begin{aligned} \textrm{RefittedSim}((\textbf{u},\textbf{c}),(\textbf{u}^{\prime },\textbf{c}^{\prime })) \,=\, d(l(\textbf{u}\mid \textbf{c}), l(\textbf{u}^\prime \mid \textbf{c}^\prime ) = d\left( \sum \nolimits _{u_{i}\in \textbf{u}}u_{i}P(u_{i}\mid \textbf{c}),\, \sum \nolimits _{u_{j}^{\prime }\in \textbf{u}^{\prime }}u_{i}P(u_{j}^{\prime }\mid \textbf{c}^{\prime })\right) \end{aligned}$$
(6)

AvgSimC is a probability weighted average of pairwise computed distances for each sense vector. Whereas RefittedSim is a single distance measured between the two refitted vectors – which are the probability weighted averages of the original unsupervised sense vectors.

There is a notable difference in time complexity between AvgSimC and RefittedSim. AvgSimC has time complexity \(O(n\left\| \textbf{c}\right\| +n^{\prime }\left\| \textbf{c}^{\prime }\right\| +n\times n^{\prime })\), while RefittedSim has \(O(n\left\| \textbf{c}\right\| +n^{\prime }\left\| \textbf{c}^{\prime }\right\| )\). The product of the number of senses of each word \(n \times n^\prime \), may be small for dictionary senses, but it is often large for induced senses. Dictionaries tend to define only a few senses per word – the averageFootnote 4 number of senses per word in WordNet is less than three [7]. For induced senses, however, it is often desirable to train many more senses, to get better results using the more fine-grained information. Reisinger and Mooney [3] found optimal results in several evaluations near 50 senses. In such cases the \(O(n \times n^\prime )\) is significant, avoiding it with RefittedSim makes the similarity measure more useful for information retrieval.

5.2 Experimental Setup

We evaluate our refitting method using Stanford’s Contextual Word Similarities (SCWS) dataset [4]. During evaluation, each context paragraph is limited to 5 words to either side of the target word, as in the training.

Table 1. Spearman rank correlation \(\rho \times 100\) when evaluated on the SCWS task.

5.3 Results

Table 1a shows the results of our evaluations on the SCWS similarity task. A significant improvement can be seen by applying our techniques.

The RefittedSim method consistently outperforms AvgSimC across all configurations. Similarly geometric smoothing consistently improves performance both for AvgSimC and for RefittedSim. The improvement is significantly more for RefittedSim than for AvgSimC results. In general using the unsupervised sense prior estimate from the AdaGram model, improves performance – particularly for AvgSimC. The exception to this is with RefittedSim with smoothing, where it makes very little difference. Unsurprisingly, given its low quality, the Greedy embeddings are always outperformed by AdaGram. It is not clear if these improvements will transfer to clustering based methods due to the differences in how the sense probability is estimated, compared to the language model based method evaluated on in Table 1a.

Table 1b compares our results with those reported in the literature using other methods. These results are not directly comparable, as each method uses a different training corpus, with different preprocessing steps, which can have significant effects on performance. It can been seen that by applying our techniques we bring the results of our AdaGram model from very poor (\(\rho \times 100 = 43.8\)) when using normal AvgSimC without smoothing, up to being competitive with other models, when using RefittedSim with smoothing. The method of Chen et al. [11], has a significant lead on the other results presented. This can be attributed to its very effective semi-supervised fine-tuning method. This suggests a possible avenue for future development in using refitted sense vectors to relabel a corpus, and then performing fine-tuning similar to that done by Chen et al.

Fig. 2.
figure 2

Block diagram for performing WSD using refitting.

6 Word Sense Disambiguation

6.1 Refitting for Word Sense Disambiguation

Once refitting has been used to create sense vectors for lexical word senses, an obvious used of them is to perform word sense disambiguation. In this section we refer to the lexical word sense disambiguation problem, i.e. to take a word and find its dictionary sense; whereas the methods discussed in Eqs. (3) and (4) consider the more general problem, as applicable to disambiguating lexical or induced word senses depending on the inputs. Our overall process shown in Fig. 2 uses both: first disambiguating the induced senses as part of refitting, then using the refitted sense vectors to find the most likely dictionary sense.

First, refitting is used to transform the induced sense vectors into lexical sense vectors. We use the targeted word’s lemma (i.e. base form), and part of speech (POS) tag to retrieve all possible definitions of the word (Glosses) from WordNet; there is one gloss per sense. These glosses are used as the example sentence to perform refitting (see Sect. 3). We find embeddings, \(\textbf{l}=\{l_1,..., l_{n_l}\}\) for each of the lexical word senses using Eq. (1). These lexical word senses are still supported by the language model, which means one can apply the general WSD method to determine the posterior probability of a word sense, given an observed context.

When given a sentence \(\textbf{c}_{T}\), containing a target word to be disambiguated, the probability of each lexical word sense \(P(l_i \mid \textbf{c}_{T})\), can be found using Eq. (3) (or the smoothed version Eq. (4)), over the lexically refitted sense embeddings. Then, selecting the correct sense is simply selecting the most likely sense:

$$\begin{aligned} l^\star (\textbf{l}, \textbf{c}_T) = \mathop {\mathrm {argmax:}}\limits _{\forall l_i \in \textbf{l}} P(l_i|\textbf{c}_T) = \mathop {\mathrm {argmax:}}\limits _{\forall l_i \in \textbf{l}} \frac{P(\textbf{c}_T \mid l_i)P(l_i)}{\sum _{\forall l_j \in \textbf{l}} P(\textbf{c}_T \mid l_j)P(l_j)} \end{aligned}$$
(7)

6.2 Lexical Sense Prior

WordNet includes frequency counts for each word sense based on Semcor [24]. These form a prior for \(P(l_i)\). The comparatively small size of Semcor means that many word senses do not occur at all. We apply add-one smoothing to remove any zero counts. This is in addition to using our proposed geometric smoothing as an optional part of the general WSD. Geometric smoothing serves a different (but related) purpose, of decreasing the sharpness of the likelihood function – not of removing impossibilities from the prior.

6.3 Experimental Setup

The WSD performance is evaluated on the SemEval 2007 Task 7.

We use the weighted mapping method of Agirre et al. [12], (see Sect. 2.2) as a baseline alternative method for using WSI senses for WSD. We use Semcor as the mapping corpus, to derive the mapping weights.

The second baseline we use is the Most Frequent Sense (MFS). This method always disambiguates any word as having its most common meaning. Due to the power law distribution of word senses, this is a very effective heuristic [20]. We also consider the results when using a backoff to MSF when a method is unable to determine the word sense the method can report the MFS instead of returning no result (a non-attempt).

Table 2. Results on SemEval 2007 Task 7 – course-all-words disambiguation. The -S marks results using geometric smoothing. The marks results with MSF backoff.

6.4 Word Sense Disambiguation Results

The results of employing our method for WSD, are shown in Table 2. Our results using smoothed refitting, both with AdaGram and Greed Embeddings with backoff, outperform the MSF baseline [25] – noted as a surprisingly hard baseline to beat [11].

The mapping method [12] was not up to the task of mapping unsupervised senses to supervised senses, on this large scale task. The Refitting method works better. Though refitting is only usable for language-model embedding WSI, the mapping method is suitable for all WSI systems.

While not directly comparable due to the difference in training data, we note that our Refitted results, are similar in performance, as measured by F1 score, to the results reported by Chen et al. [11]. AdaGram with smoothing, and Greedy embeddings with backoff have close to the same result as reported for L2R with backoff – with the AdaGram slightly better and the Greedy embeddings slightly worse. They are exceeded by the best method reported in that paper: S2C method with backoff. Comparison to non-embedding based methods is not discussed here for brevity. Historically state of the art systems have functioned very differently; normally by approaching the WSD task by more direct means.

Our results are not strong enough for Refitted AdaGram to be used as a WSD method on its own, but do demonstrate that the senses found by refitting are capturing the information from lexical senses. It is now evident that the refitted sense embeddings are able to perform WSD, which was not possible with the unsupervised senses.

7 Conclusion

A new method is proposed for taking unsupervised word embeddings, and adapting them to align to particular given lexical senses, or user provided usage examples. This refitting method thus allows us to find word sense embeddings with known meaning. This method can be seen as a one-shot learning task, where only a single labelled example of each class is available for training. We show how our method can be used to create embeddings to evaluate the similarity of words, given their contexts.

This allows us to propose a new similarity measuring method, RefittedSim. The performance of RefittedSim on AdaGram is comparable to the results reported by the researchers of other sense embeddings techniques using AvgSimC, but its time complexity is significantly lower. We also demonstrate how similar refitting principles can be used to create a set of vectors that are aligned to the meanings in a sense inventory, such as WordNet.

We show how this can be used for word sense disambiguation. On this difficult task, it performs marginally better than the hard to beat MFS baseline, and significantly better than a general mapping method used for working with WSI senses on lexical WSD tasks. As part of our method for refitting, we present a geometric smoothing to overcome the issues of overly dominant senses probability estimates. We show that this significantly improves the performance. Our refitting method provides effective bridging between the vector space representation of meaning, and the traditional discrete lexical representation. More generally it allows a sense embedding to be created to model the meaning of a word in any given sentence. Significant applications of sense embeddings in tasks such as more accurate information retrieval thus become possible.