1 Introduction

Word embedding is a widespread technique in boosting the performance of modern NLP systems by learning a vector for each word as its semantic feature. The general idea of word embedding is to assign each word with a dense vector. In a qualified word embedding model, the vector similarity tends to reflect the word semantic similarity. These vectors will be either directly used as feature representations or further fine-tuned with training data from downstream supervised tasks [21, 23]. Although contextualized word embedding models like Bert [4] and ELMo [20] have achieved great success, pre-trained word embedding models like fastText [1] remain to be vital in the scenario when the text are too short to be fed into Bert. For example, the state-of-the-art TextVQA models such as M4C [10] use fastText to generate embedding for the OCR tokens in images.

Previous studies such as word2vec [16] mainly focus on how to generate word embedding as a by-product of training a language model. Later, Levy derived that the skip-gram with negative-sampling (SGNS) training method in word2vec is equivalent to implicitly factorizing a word–context co-occurrence matrix into “word” vectors and “context” vectors [12]. The cells in this word–context co-occurrence matrix are point-wise mutual information (PMI) between words and contexts. The PMI value and its variants have been considered as the best metric to evaluate the semantic similarity between words [11, 22] for a long time. As reported in Levy’s paper, the exact factorization with SVD can achieve solutions that are at least as good as SGNS’s solutions for word similarity tasks. After word2vec, several variants have been proposed in recent years [17, 18].

However, this paper argues that factorizing the word–context matrix into two different parts will create a gap between training and evaluation stage. During training, both the SGNS model and its variants use the inner product between “word” and “context” vectors to approximate the semantic similarity between words, while in the evaluation stage, people use only “word” vectors and discard the “context” vectors or vice versa. Such gap make the vector similarity between “word” (“context”) vectors lack a clear and unambiguous definition, and damages the semantic similarity between words. For example, when the inner product between the “word” vector of “happy” and the “context” vector of “birthday” approximates the PMI value of “happy birthday”, there is no reason to believe that the “word-word” inner product between “happy” and “birthday” also approximates the PMI value between them.

To close the above-mentioned gap and quantitatively define the information captured by vector similarity between words, this paper proposes a point-wise mutual information (PMI)-guided word embedding model. We proposed three improvements. First, we normalize the “word” vectors and scale up the “context” vectors in the training stage. This will force the PMI value to be encoded by the “context” vectors mainly. Next, we initialize all “word” and “context” vectors in the way that the angle between any pair of them is no bigger than \(30^\circ\). Finally, we introduce a new objective function. This new objective function explicitly model the word pair’s joint probability conditioned on different context in addition to the conventional uni-gram language models. In this way, our model can capture the word co-occurrence statistics better than current SG models. This paper shows that the resulting vector similarity between “context” vectors approximates the PMI value between words. Therefore, the vector similarity can better reveal the semantic relationship between words as expected.

In summary, the contributions of this paper are as follows:

  • A new criterion called point-wise mutual information (PMI) is introduced to describe the human intuition on semantic similarity, which makes it possible to test if the word vectors’ similarity can reflect the semantic similarity. To our best knowledge, this is the first attempt to explicitly define a quantitative criterion to measure the quality of a word embedding model.

  • Guided by the PMI criterion, this paper develops a novel word embedding model called PMIVec model, which can significantly improve the performance of word embedding and better reveal the semantic relationship between words as expect. Moreover, experiments on public popular data sets also show that our model can outperform state-of-the-art models consistently across both word similarity tasks and sentence embedding tasks.

2 Related work

Mikolov introduces continuous bag of words (CBOW) and skip-gram algorithm to build a language model. In the skip-gram algorithm, they intend to estimate the probability of a context word appearing around the given center word. This is a conditional probability, and therefore, is asymmetric, i.e., \(p(w_{i}|w_{j})\ne p(w_{j}|w_{i})\), where \(w_{i},w_{j}\) denote different words. To estimate the two different conditional probability separately, each word has two vector representations. One is the “context” vector \({\mathbf {O}}_{i}\), and the other is “word” vector \({\mathbf {I}}_{i}\). At last, the skip-gram model proposes \(p(w_{i}|w_{k})\propto \left<{\mathbf {O}}_{i},{\mathbf {I}}_{k}\right>\) and \(p(w_{k}|w_{i})\propto \left<{\mathbf {O}}_{k},{\mathbf {I}}_{i}\right>\).

The GloVe model learns the word vectors by aggregating global word–word co-occurrence matrix from a corpus [19]. Specifically, in the training stage, Glove forces the inner product \(\left<{\mathbf {O}}_{i},{\mathbf {I}}_{j}\right>\) to fit \(w_i\) and \(w_j\)’s co-occurrence count \(C_{ij}\). However, Glove will only use \(\left<{\mathbf {O}}_{i},{\mathbf {O}}_{j}\right>\) in the downstream tasks and throw \(\left<{\mathbf {I}}_{i},{\mathbf {I}}_{j}\right>\) away or vice versa. The co-occurrence count information \(C_{ij}\) is not explicitly used in the downstream tasks because \(\left<{\mathbf {O}}_{i},{\mathbf {I}}_{j}\right>\) is never used. PMIVec explicitly avoids this problem by forcing \(\left<{\mathbf {O}}_{i},{\mathbf {O}}_{j}\right>\) to capture the PMI, which is determined by the global word–word co-occurrence matrix, between \(w_i\) and \(w_j\).

The FastText model is proposed to enrich word vectors with subword information, that is, it represents each word with a bag of character n-grams, and the SG model will learn vector representation not only for the whole word but also for each character n-gram [1]. Therefore, this model can compute word vectors for out-of-vocabulary (OOV) words by summing their character n-grams’ representation. The FastText model inherits most of the SG’s problems.

Xing points out that there is an inconsistency among the SG model’s objective function used to learn the word vectors (maximum likelihood based on inner product) and the distance measurement for word vectors (cosine distance). Therefore, they developed a normalization model [24]. By normalizing each word vector during training, the inner product will be equivalent to the cosine distance. However, we argue that this model use only one vector to represent each word, and therefore, it cannot estimate the asymmetrical conditional probability well fundamentally.

In recent years, the contextualized word representation learning [20] and the pre-trained [4] language models achieve the state-of-the-art results in multiple NLP tasks. In these models, one word will have different vector representations based on its context, and therefore, can handle the polysemy better than the word embedding models. However, the contextualized models cannot perform well for context-free tasks. For example, BERT’s score in WordSim353 task is 0.477 while skip-gram’s score is 0.711 according to [15]. What is more, we argue that our work can shade light on these models. To be specific, BERT will estimate a masked language model during training time, and any language model will rely on a softmax layer to make prediction. The weight for each word in this softmax layer corresponds to the “context” vector in our model. Therefore, with further analysis, one can extend our model’s conclusion to these models and explore the relationship between the softmax layer’s weights and the contextualized word representations. This topic goes beyond this paper’s scope.

3 Our model

We will introduce the PMI’s definition and why it is important in the first subsection. Then, we will review the skip-gram model briefly to reveal why its energy function is problematic in the second subsection. In the third subsection, we will introduce our model and show why our model can solve this problem. At last, we will describe the algorithm of our model in the fourth subsection.

3.1 Point-wise mutual information

PMI is an estimation of how much one word \(w_{i}\) tells about the other word \(w_{k}\) [3]. People developed the notion of word window to help define when two words “co-occur”, i.e., \(w_{i}\) and \(w_{k}\) co-occur, only when they appear in the same word window [16]. Let L denote the word window’s width. By moving the word window across the whole corpus, we can construct a co-occurrence matrix \([C_{ki}]_{V*V}\), where V is the vocabulary’s length. Each entry \(C_{ki}\) represents that the word \(w_{i}\) appears in \(w_{k}\)’s word window \(C_{ki}\) times totally in the whole corpus. Based on these counts, one can calculate how many times the word \(w_{k}\) appears in the corpus totally. Let \(C_{k}\) denote this quantity, then

$$\begin{aligned} C_{k} = 1/L \sum _{i=1}^{V}C_{ki}. \end{aligned}$$

What is more, let T denote the total number of tokens in this corpus, then one can also calculate \(p(w_{k})\) and \(p(w_{i}|w_{k})\) by

$$\begin{aligned} p(w_{k})=\frac{C_{k}}{T};\ \ p(w_{i}|w_{k})=\frac{C_{ki}}{L C_{k}}. \end{aligned}$$
(1)

At last, the PMI’s definition is

$$\begin{aligned} \mathrm{PMI}(w_{i},w_{k})=\log \left( \frac{p(w_{i}|w_{k})}{p(w_{i})}\right) . \end{aligned}$$
(2)

The PMI evaluated from co-occurrence counts has a strong linear relationship with human semantic similarity judgments from survey data [8]. Therefore, it is reasonable to associate word embedding with the PMI.

3.2 Skip-gram model revisit

There is a sequence of training words \(w_{v_1},\ldots ,w_{v_T}\), in which each word \(w_{v_{k}}\) is chosen from a fixed vocabulary. The vocabulary’s size is V. The original objective of SG model is to maximize the average log likelihood

$$\begin{aligned} \frac{1}{T}\sum _{k=1}^{T}\sum _{i=k-L/2}^{k+L/2}\log \left( p\left( w_{v_i}|w_{v_k}\right) \right) ,\ i\ne k. \end{aligned}$$
(3)

For the sake of simplification, let \(w_{i}\) denote \(w_{v_i}\). Then, \(p(w_{i}|w_{k})\) is modeled by the Gibbs distribution with \(\left<{\mathbf {O}}_{i},{\mathbf {I}}_{k}\right>\) as its energy function

$$\begin{aligned} p(w_{i}|w_{k})=\frac{\exp (\left<{\mathbf {O}}_{i},{\mathbf {I}}_{k}\right>)}{\sum _{j=1}^{V}\exp (\left<{\mathbf {O}}_{j},{\mathbf {I}}_{k}\right>)}, \end{aligned}$$
(4)

where \({\mathbf {O}}_{i}\) is the “context” vector of word \(w_i\), and \({\mathbf {I}}_{k}\) is the “word” vector of word \(w_k\). The conditional probability \(p(w_i|w_k)\) describes what are the chances that word \(w_i\) appears around word \(w_k\), and is called uni-gram language model.

However, optimizing function (3) is impractical because the denominator of Eq. (4), a summation over V terms (usually \(10^{5}\)\(10^{7}\) terms), is difficult to calculate and optimize [16]. To solve this problem, SG model develops the famous negative-sampling technique by simplifying the noise contrastive estimation (NCE) method [7]. The following theorem shows that based on the energy function used in (4), the negative-sampling method will lead to a problematic solution.

Theorem 1

Assuming that, for any conditional probability matrix \(\left[ p(w_i|w_k)\right] _{V\times V}\), there exist \(\{{\mathbf {O}}_i\in {\mathcal {R}}^d\}_{i=1,\ldots ,V}\) and \(\{{\mathbf {I}}_k\in {\mathcal {R}}^d\}_{k=1,\ldots ,V}\), such that \(\forall i,k\), Eq. (4) holds. Then, the optimal solution to SG model’s negative-sampling loss will have the following property, that is

$$\begin{aligned} \left<{\mathbf {O}}_{i},{\mathbf {I}}_{k}\right>=PMI\left( w_{i},w_{k}\right) -\log K, \end{aligned}$$
(5)

where the constant K is the number of negative samples per positive sample.

The detailed definition of negative-sampling loss, and Theorem 1’s proof can be found in the appendices. Although Levy derived Eq. (5) in [12] as well, they need to assume that both \({\mathbf {O}}_i,{\mathbf {I}}_k\) have infinite dimension, which is obviously unrealistic. Theorem 1 does not require such an assumption.

Although the inner product between \(w_{i}\)’s context vector and \(w_{k}\)’s word vector is meaningful, what people really use in practice is \(\left<{\mathbf {O}}_{i},{\mathbf {O}}_{k}\right>\) or \(\left<{\mathbf {I}}_{i},{\mathbf {I}}_{k}\right>\). We argue that this gap will limit the performance of word embedding.

3.3 PMIVec model

We propose to use the inner product between the “context” vectors of two words to approximate the PMI value between them. For any pair of words \(w_i\) and \(w_j\), their “context” vector representations, \({\mathbf {O}}_i\) and \({\mathbf {O}}_j\), should satisfy the following equation:

$$\begin{aligned} 2*\left<{\mathbf {O}}_{i},{\mathbf {O}}_{j}\right>=\mathrm{PMI}\left( w_{i},w_{j}\right) =\log \left( \frac{p(w_i,w_j)}{p(w_i)p(w_j)}\right) . \end{aligned}$$
(6)

Noticing that both sides of the Eq. (6) are symmetrical about \(w_i,w_j\), this resolves the dilemma described in the Introduction section. What is more, Eq. (6) also quantitatively defines what statistical information between a pair of words has been captured by their vector representations. To achieve Eq. (6), one key observation is that its left hand side is determined by three vectors’ norm, i.e.,

$$\begin{aligned} \mathbf {LHS} = \left\| {\mathbf {O}}_i+{\mathbf {O}}_j\right\| _2^2 - \left\| {\mathbf {O}}_i\right\| _2^2 - \left\| {\mathbf {O}}_j\right\| _2^2, \end{aligned}$$

and its right hand side is also determined by three terms, i.e.,

$$\begin{aligned} \mathbf {RHS} = \log \left( p(w_i,w_j)\right) -\log \left( p(w_i)\right) -\log \left( p(w_j)\right) . \end{aligned}$$

Therefore, if one can design a Gibbs distribution and its energy function, such that its optimal solution satisfying \(\Vert {\mathbf {O}}_i\Vert _2^2\propto \log (p(w_i))\), then Eq. (6) can be achieved.

This paper proposes to use the following Gibbs distribution to fit \(p(w_i|w_k)\):

$$\begin{aligned} p\left( w_{i}|w_{k}\right) =\frac{\exp \left( \Vert {\mathbf {O}}_{i}\Vert _2\left<{\mathbf {O}}_{i},\frac{{\mathbf {I}}_{k}}{\Vert {\mathbf {I}}_{k}\Vert _2}\right>\right) }{\sum _{j=1}^{V}\exp \left( \Vert {\mathbf {O}}_{j}\Vert _2\left<{\mathbf {O}}_{j},\frac{{\mathbf {I}}_{k}}{\Vert {\mathbf {I}}_{k}\Vert _2}\right>\right) }. \end{aligned}$$
(7)

The sketch proof of why (7) works better than (4) is that its optimal solutions satisfy the following equation:

$$\begin{aligned} p(w_i|w_k)*Z_k = \exp \left( \left\| {\mathbf {O}}_i\right\| _2^2*\cos \left( \theta _{ik}\right) \right) , \end{aligned}$$
(8)

where \(Z_k\) is the denominator in (7), and \(\theta _{ik}\) is the angle between \({\mathbf {O}}_i\) and \({\mathbf {I}}_k\). Then, the expectation of both sides of (8) with respect to \(p(w_k)\) is

$$\begin{aligned} \sum \limits _{k=1}^{V}p\left( w_i,w_k\right) Z_k =\sum \limits _{k=1}^{V} \exp \left( \Vert {\mathbf {O}}_i\Vert _2^2\cos (\theta _{ik})\right) p(w_k). \end{aligned}$$
(9)

If \(\forall i,k\), \(\theta _{ik}\le 30^{\circ }\), then no matter what is the distribution of \(p(w_k)\), the RHS of (9) will approximate to \(\exp \left( \Vert {\mathbf {O}}_i\Vert _2^2\right)\), because \(\cos (\theta _{ik})\approx 1\). This assumption can be assured by properly initializing and regularizing the angle between \({\mathbf {O}}_i\) and \({\mathbf {I}}_k\). The LHS of (9) is approximated by \(p(w_i){\hat{Z}}\), where \({\hat{Z}}\) is the mean value of all \(Z_k\). One important phenomenon about the partition functions, \(Z_k\), is that they tend to concentrate around the mean value. The rigorous proof can be found in the chapter 7 of Ma’s Ph.D. thesis [14]. Finally, the \(\log\) value of (9)’s both sides are

$$\begin{aligned} \log \left( p(w_i)\right) + \log ({\hat{Z}})\approx \left\| {\mathbf {O}}_i\right\| _2^2. \end{aligned}$$
(10)

The analysis above briefly shows why the energy function \(\Vert {\mathbf {O}}_{i}\Vert _2\left<{\mathbf {O}}_{i},\frac{{\mathbf {I}}_{k}}{\Vert {\mathbf {I}}_{k}\Vert _2}\right>\), and the proper initialization are necessary for property (6). The analysis about \(\Vert {\mathbf {O}}_i+{\mathbf {O}}_j\Vert _2^2\) is similar. To assure the property (6), one also needs to fit \(p(w_i,w_j|w_k)\) using the following Gibbs distribution

$$\begin{aligned} p\left( w_{i},w_{j}|w_{k}\right) =\frac{\exp \left( \Vert {\mathbf {O}}_{i}+{\mathbf {O}}_{j}\Vert _2^2\cos (\theta _{(ij)k})\right) }{Z_k^\prime }, \end{aligned}$$
(11)

where \(\theta _{(ij)k}\) is the angle between \({\mathbf {O}}_i+{\mathbf {O}}_j\) and \({\mathbf {I}}_k\), and \(Z_k^\prime\)’s definition is

$$\begin{aligned} Z_k^\prime =\sum \limits _{(m,n)}^{C_V^2+V}\exp \left( \left\| {\mathbf {O}}_{m}+{\mathbf {O}}_{n}\right\| _2^2\cos \left( \theta _{(mn)k}\right) \right) . \end{aligned}$$
(12)

The following theorem shows that, based on the 3 modification proposed by this paper, there exist solutions such that they have property (6).

Theorem 2

Assuming that, for any conditional probability matrix \([p(w_i|w_k)]_{V\times V}\) and \([p(w_i,w_j|w_k)]_{(C_V^2+V)\times V}\), there exist \(\{{\mathbf {O}}_i\in {\mathcal {R}}^d\}_{i=1,\ldots ,V}\) and \(\{{\mathbf {I}}_k\in {\mathcal {R}}^d\}_{k=1,\ldots ,V}\), such that

  • \(\forall i,j,k\), the angles \(\theta _{ik},\theta _{jk},\theta _{(ij)k}\le 30^\circ\),

  • \(\forall i,j,k\), the Eq. (7), and (11) both hold,

then Eq. (6) holds for any pair of words \(w_i,w_j\).

The detailed proof of theorem 2 can be found in the appendices.

3.4 PMIVec algorithm

To fit \(p(w_i|w_k)\) and \(p(w_i,w_j|w_k)\), PMIVec algorithm adopts the NCE method [7] instead of the negative sampling technique. The resulted loss function consists of two parts.

The first part of loss is to fit \(p(w_i|w_k)\)

$$\begin{aligned} \log \sigma (s(i;k))+\sum _{t}^{K}\log \sigma (-s(t;k)), \end{aligned}$$
(13)

where \(s(i;k)=\frac{\Vert {\mathbf {O}}_i\Vert _2}{\Vert {\mathbf {I}}_k\Vert _2}\left<{\mathbf {O}}_i,{\mathbf {I}}_k\right>-\log Z_k-\log Kp_n(w_i)\). The \(Z_k\) is treated as a constant to be optimized, and \(p_n(w_i)\) is an arbitrary noise distribution. The first part of loss is denoted as \(L_1\).

The second part of loss is to fit \(p(w_i,w_j|w_k)\)

$$\begin{aligned} \log \sigma (s(i,j;k))+\sum _{m,n}^{K^2}\log \sigma (-s(m,n;k)), \end{aligned}$$
(14)

where

$$\begin{aligned} \begin{array}{rl} s(i,j;k)=&{}\frac{\Vert {\mathbf {O}}_i+{\mathbf {O}}_j\Vert _2}{\Vert {\mathbf {I}}_k\Vert _2}\left<{\mathbf {O}}_i+{\mathbf {O}}_j,{\mathbf {I}}_k\right>\\ &{}-\log Z_k^\prime -\log K^2p_n(w_i)p_n(w_j). \end{array} \end{aligned}$$

The second part of loss is denoted as \(L_2\).

figure a

As shown in algorithm 1, “\(\leftarrow\)” means we are querying a dictionary, “\(\sim\)” represents sampling from a distribution. The definition of Riemann gradient update in line 22 can be found in [15].

Inline 3 of algorithm 1, PMIVec retrieves word \(w_k\)’s “word” vector \({\mathbf {I}}_k\). PMIVec also does the retrieval operation for word \(w_i\), \(w_j\), \(w_t\), \(w_m\), \(w_n\)’s “context” vectors \({\mathbf {O}}_:\) respectively inline 5, 6, 12, 18, and 19.

From line 4 to line 9, PMIVec samples the context words pair \((w_i, w_j)\) from \(w_k\)’s word window and does the corresponding gradient update. More specifically, PMIVec samples \(w_i\) iteratively for L1 firstly, and then it samples another context word \(w_j\) randomly from \(w_k\)’s word window.

From lines 10 to 14, PMIVec samples K negative samples \(w_t\) from the whole vocabulary for L1 and updates the gradient. Inline 11 of algorithm 1, \(w_t\) is sampled from the negative distribution \(p_n(w_t)\).

PMIVec also does the negative-sampling operation inline 17. From lines 15 to 21, PMIVec samples \(K^2\) negative sample pairs \((w_m,w_n)\) from the whole vocabulary for L2 and updates the gradient.

4 Experiments

In this section, we will validate the effectiveness of our word embedding model on context-free tasks, and compare with state-of-the-art models, including the SG model, the FastText model,Footnote 1 the GloVe model,Footnote 2 and the JoSE model.Footnote 3 Another SG model’s embedding with extra training epochs is also reported for the sake of fairness. We will also include BERT’s results reported in JoSE just for comparison. Our model is denoted as PMIVec.

For all the models, we set the word window width to be 10, the negative samples to be 5, the epochs to be 5, and the word vector’s dimension to be 100. Since our model would update 3 positive examples and 30 negative examples each time, we will train an extra SG model’s embedding with 15 epochs. What is more, the negative samples for this SG model are also set to be 30. The other hyper-parameters are set to be their default values.

The data set used in our experiments is the MBTA (Massachusetts Bay Transportation Authority) corpus, which is crowed from the webFootnote 4 and is a subset of the GloWbe corpus.Footnote 5 The MBTA corpus is well cleaned and contains about 700 million tokens. The minimum vocabulary count is set to be 100, i.e., we discard the words appears less than 100 times in the corpus. Note here that, we did not evaluate our model on the famous Wiki training corpusFootnote 6 as some previous work [15, 16]. This is because the script provided to transform its format from XML to text is problematic. By looking at the resulting vocabulary of the Wiki corpus carefully, one can find that many non-English characters remains to exist. We believe that these characters would introduce a lot of noises to the samples of \(p(w_k),p(w_i|w_k)\), and \(p(w_i,w_j|w_k)\), and are harmful to the evaluation of word embedding models.

In Sect. 4.1, to directly check whether our model’s embedding fits the PMI between words better than the original word2vec model, we calculate the Pearson correlation coefficient between the word–word PMI and word–word vector similarities. Section 4.2 presents the performance of all embedding models in the word similarity tasks. Section 4.3 shows the performance of all embedding models in the sentence embedding tasks.

Fig. 1
figure 1

Pearson correlation coefficient between the vector similarity and the PMI value in large vocabulary

4.1 Quantitative model evaluation

In this subsection, we propose to judge the word similarity between words according to the PMI between them, and this is a quantity that has a specific definition and can be calculated for any collection of corpus. We implement a script to estimate the PMI between words from the raw corpus according to the definition in Eq. (2). Some words are so rare, that they may never appear around another word, i.e., p(i|k) may be zero. For these cases, we simply ignore these word pairs. At last, we can calculate the Pearson correlation coefficient between the vector similarities and the word pair’s PMI value.

The results are presented in Fig. 1. We can observe that the PMIVec model’s context vectors have the highest Pearson value, which means that they have captured the PMI information between words better than the rest vectors.

What is more, in the original SG model, the word vectors are more consistent with the PMI values than the context counterparts. However, the situation are opposite in the PMIVec model. The context vectors are the more consistent one. This is not surprising because we have normalized the word vectors so that the context vectors can capture more information about the co-occurrence information between words. Despite the seemingly contradictory results between Table 1 and Fig. 1, we argue that this is because of the vocabulary size difference between them. To be specific, the maximum vocabulary size in Table 1 is 1000 pairs of words, while Fig. 1’s vocabulary size is above 10,000. Therefore, with more noise in the vocabulary, the Pearson correlation coefficient would decrease dramatically, and appears to be contradict to some local results as presented in Table 1. In fact, in the sentence embedding tasks, the context vectors are better than the word vectors constantly in the PMIVec model.

Table 1 Pearson correlation coefficient rank on word similarity evaluation

4.2 Word similarity tasks

To compare the quality of different models’ embedding, previous papers tend to exam whether the word vector similarities agree with human’s judgements. For example, the test part of MEN data set [2] contains 1000 pairs of words together with human-assigned similarity judgments, such an example looks like “bird-insect-37.0”, where “37.0” is the human-assigned similarity score. Then, the previous papers would calculate the vector similarity scores for these word pairs. At last, they would calculate the Pearson correlation coefficient between the vector similarity scores and the human-assigned similarity scores.

We also validate the performance of different models in a serious popular word similarity tasks, including the Word Similarity353 data set [6], the MEN data set [2], the SimLex999 data set [9], the RW data set [13], and the RG65 data set [13]. The different data sets are all human annotated but with different scales and coverage. Some words in these testing data sets do not appear in our training corpus, and this means we cannot calculate the inner product between vectors for those words. To provide comparable results, we simply remove these words, and calculate the Pearson correlation coefficient among the remaining words. What is more, for the BERT model, we simply adopt the results reported in JoSE. Even though we used different training corpus, and our training corpus is smaller than the wiki corpus, these results can still validate our points here that the contextualized word representation learning model performs poorly in the context-free tasks.

We can observe that for most data sets, our model can outperform the SOTA models. For the sake of fairness, we have included an extra version of the SG model with more epochs and more negative-sampling numbers (see Tables 2 and 3).

Table 2 Pearson correlation coefficient rank on sentence similarity evaluation

4.3 Sentence embedding tasks

The purpose of this subsection is to explore whether our embedding can boost the performance of the downstream NLP tasks better than the other word embedding models. Since our embedding shows superiority in the word similarity tasks, it is reasonable to believe that we can extend this advantage to sentence embedding tasks with simple generating procedural. We choose the bag-of-words (bow) model to generate the sentence embedding from the word embedding with simple procedure like averaging the word embedding. More complex sentence embedding generation method goes beyond this paper’s scope. Therefore, BERT or Elmo’s results will not be reported here. There are two kinds of sentence embedding task. One is the sentence relatedness task, and the other one is the sentence classification task. For the relatedness tasks, we evaluate how the cosine distance between two sentences correlates with a human-labeled similarity score. For the classification tasks, we use a multi-layer perceptron which feeds on the sentence embedding.

The test data sets of relatedness tasks include the STS 14, 15, 16, and the SICK-Relatedness. Similar to the word similarity tasks, there is a premise sentence and a hypothesis sentence in these data sets. Take STS 14 for example, one such premise sentence is “Liquid ammonia leak kills 15 in Shanghai”. The corresponding hypothesis sentence is “Liquid ammonia leak kills at least 15 in Shanghai”. The label for this pair of sentence is 4.6. The minimum score is 0, and the maximum score is 5.

For the classification tasks, we use the MR, CR, MPQA, and SUBJ data sets to test our model’s performance. The MR data set is about sentiment classifications over movies’ reviews. For example, the sentence, “Too slow for a younger crowd, too shallow for an older one.”, has a negative label. The rest test data sets are similar, while they are collected from different domains.

As we can see, our model can outperform the rest models in most of the data sets by a large margin except for the STS 16 data set. In the sentence embedding tasks, the context vector of our model is always better than the word vector counterpart except in the MPQA data set. We argue that it is because each sample in this data set is rather a phrase than a sentence. For example, a positive record would look like “liberation and independence”, and the negative one would be like “suffering from some intoxication”.

Table 3 Precision rank on sentence classification evaluation

5 Conclusion

In this paper, we introduce PMI as a new criterion to describe the human intuition on semantic similarity, which makes it possible to test if the word vectors’ similarity can reflect the semantic similarity. Guided by the PMI criterion, this paper develops a novel word embedding model called PMIVec model. Different from previous work, our model explicitly models the word pair’s joint probability conditioned on different context in addition to the conventional uni-gram language model. Our model can capture the word co-occurrence statistics better than current SG models. Besides, the resulting vector similarity between any two words is more consistent with human’s expectation. Experiments show that our model can improve the performance of word embedding on the word similarity tasks, sentence embedding tasks.