Keywords

1 Introduction

Today word embeddings play an important role in many natural language processing tasks, from predictive language models and machine translation to image annotation and question answering, where they are usually ‘plugged in’ to a larger model. An understanding of their properties is of interest as it may allow the development of better performing embeddings and improved interpretability of models using them. One of the widely-used word embedding models is the Skip-gram with negative sampling (SGNS) of Mikolov et al. (2013). Levy and Goldberg (2014) showed that the SGNS is implicitly factorizing a pointwise mutual information (PMI) matrix shifted by a global constant. They also showed that a low-rank approximation of the PMI matrix by truncated singular-value decomposition (SVD) can produce word vectors that are comparable to those of SGNS. However, truncated SVD is not the only way of finding a low-rank approximation of a matrix. It is optimal in the sense that it minimizes the approximation error in the Frobenius and the 2-norms, but this does not mean that it produces optimal word embeddings, which are usually evaluated in downstream NLP tasks. The question is: Is there any other method of low-rank matrix approximation that produces word embeddings better than the truncated SVD factorization? Our experiments show that the truncated SVD is actually a strong baseline which we failed to beat by another two widely-used low-rank approximation methods.

2 Low-Rank Approximations of the PMI-matrix

The simplest version of a PMI matrix is a symmetric matrix with each row and column indexed by wordsFootnote 1, and with elements defined as

$$\begin{aligned} \textrm{PMI}(i,j)=\log \frac{p(i,j)}{p(i)p(j)}, \end{aligned}$$
(1)

where p(ij) is the probability that the words i, j appear within a window of a certain size in a large corpus, and p(i) is the unigram probability for the word i. For computational purposes, Levy and Goldberg (2014) suggest using a positive PMI (PPMI), defined as

$$\begin{aligned} \textrm{PPMI}(i,j)=\max (\textrm{PMI}(i,j),0). \end{aligned}$$
(2)

They also show empirically that the low-rank SVD of the PPMI produces word vectors which are comparable in quality to those of the SGNS.

The low-rank matrix approximation is approximating a matrix by one whose rank is less than that of the original matrix. The goal of this is to obtain a more compact representation of the data with a limited loss of information. In what follows we give a brief overview of the low-rank approximation methods used in our work. Since both PMI (1) and PPMI (2) are square matrices, we will consider approximation of square matrices. For a thorough and up-to-date review of low-rank approximation methods see the paper by Kishore Kumar and Schneider (2017).

Singular Value Decomposition (SVD) factorizes \(\textbf{A}\in \mathbb {R}^{n\times n}\), into the matrices \(\textbf{U}\in \mathbb {R}^{n\times n}\), \(\textbf{S}\in \mathbb {R}^{n\times n}\) and \(\textbf{V}^\top \in \mathbb {R}^{n\times n}\):

$$\textbf{A}=\textbf{USV}^\top ,$$

where \(\textbf{U}\) and \(\textbf{V}\) are orthogonal matrices, and \(\textbf{S}\) is a rectangular diagonal matrix whose entries are in descending order, \(\sigma _1\ge \sigma _2\ge \cdots \ge \sigma _n\ge 0\), along the main diagonal, and are known as the singular values of \(\textbf{A}\). The rank d approximation (also called truncated or partial SVD) of \(\textbf{A}\), \(\textbf{A}_d\) where \(d<{{\,\textrm{rank}\,}}\textbf{A}\), is given by zeroing out the \(n-d\) trailing singular values of \(\textbf{A}\), that isFootnote 2

$$\textbf{A}_d=\textbf{U}_{1:n,1:d}\textbf{S}_{1:d,1:d}\textbf{V}^\top _{1:d,1:n}.$$

By the Eckart-Young theorem (Eckart and Young 1936), \(A_d\) is the closest rank-d matrix to A in Frobenius norm, i.e. \(\Vert \textbf{A}-\textbf{A}_d\Vert _F\le \Vert \textbf{A}-\textbf{B}\Vert _F\), \(\forall \textbf{B}\in \mathbb {R}^{n\times n}:\,{{\,\textrm{rank}\,}}(\textbf{B})=d\). Levy and Goldberg (2014) suggest factorizing the PPMI matrix with truncated SVD, and then taking the rows of \(\textbf{U}_{1:n,1:d}\textbf{S}_{1:d,1:d}^{1/2}\) as word vectors, and we follow their approach.

QR Decomposition with column pivoting of \(\textbf{A}\in \mathbb {R}^{n\times n}\) has the form \(\textbf{A P}=\textbf{Q R}\), where \(\textbf{Q}\in \mathbb {R}^{n\times n}\) is orthogonal, \(\textbf{R}\in \mathbb {R}^{n\times n}\) is upper triangular and \(\textbf{P}\in \mathbb {R}^{n\times n}\) is a permutation matrix. The rank d approximation to \(\textbf{A}\) is then

$$\textbf{A}_d=\textbf{Q}_{1:n,1:d}[\textbf{RP}^\top ]_{1:d,1:n}$$

which is called truncated QR decomposition of \(\textbf{A}\). After factorizing the PPMI matrix with this method we suggest taking the rows of \(\textbf{Q}_{1:n,1:d}\) as word vectors.

However, we suspect that a valuable information could be left in the \(\textbf{R}\) matrix. A promising alternative to SVD is a Rank Reveling QR decomposition (RRQR). Assume the QR factorization of the matrix \(\textbf{A}\):

$$ \textbf{AP}= \textbf{Q}\left[ {\begin{array}{cc} \textbf{R}_{11} &{} \textbf{R}_{12} \\ 0 &{} \textbf{R}_{22} \\ \end{array} } \right] $$

where \(\textbf{R}_{11}\in \mathbb {R}^{d\times d}\), \(\textbf{R}_{12}\in \mathbb {R}^{d\times (n-d)}\), \(\textbf{R}_{22}\in \mathbb {R}^{(n-d)\times (n-d)}\). For RRQR factorization, the following condition should be satisfied:

$$\begin{aligned} \sigma _{\min }(\textbf{R}_{11})&= \mathrm {\Theta }(\sigma _{d}(\textbf{A}))\\ \sigma _{\max }(\textbf{R}_{22})&= \mathrm {\Theta }(\sigma _{d+1}(\textbf{A})) \end{aligned}$$

which suggests that the most significant entries are in \(\textbf{R}_{11}\), and the least important are in \(\textbf{R}_{22}\). Thus, we also suggest taking the columns of \([\textbf{RP}^\top ]_{1:d,1:n}\) as word vectors.

Non Negative Matrix Factorization (NMF). Given a non negative matrix \(\textbf{A}\in \mathbb {R}^{n\times n}\) and a positive integer \(d<n\), NMF finds non negative matrices \(\textbf{W}\in \mathbb {R}^{n\times d}\) and \(\textbf{H}\in \mathbb {R}^{d\times n}\) which minimize (locally) the functional \(f(\textbf{W},\textbf{H})=\Vert \textbf{A}-\textbf{WH}\Vert ^2_F\). The rank d approximation of \(\textbf{A}\) is simply

$$ \textbf{A}_d=\textbf{WH}. $$

When factorizing the PPMI matrix with NMF, we suggest taking the rows of \(\textbf{W}\) as word vectors.

3 Experimental Setup

3.1 Corpus

All word vectors were trained on the enwik9 datasetFootnote 3 which was divided into two equal-sized splits. The PMI matrices on these splits were obtained using the hypewords tool of Levy et al. (2015). All corpora were pre-processed by removing non-textual elements, sentence splitting, and tokenization. PMI matrices were derived using a window of two tokens to each side of the focus word, ignoring words that appeared less than 300 times in the corpus, resulting in vocabulary sizes of roughly 13000 for both words and contexts. A repeated measures design was used for assigning the method of factorization (SVD, QR, NMF) and dimensionality of the vectors (250, 500) to each PMI matrix replicate. We used two replicates per each level combination.

3.2 Training

Low-rank approximations were performed using the following open-source implementations:

  • Sparse SVD from SciPy (Jones et al. 2014),

  • Sparse RRQR from SuiteSparse (Davis and Hu, 2011), and

  • NMF from scikit-learn (Pedregosa et al. 2011).

For NMF we used the nonnegative double SVD initialization. We trained 250 and 500 dimensional word vectors with each method.

3.3 Evaluation

We evaluate word vectors on two tasks: similarity and analogy. A similarity is tested using the WordSim353 dataset of Finkelstein et al. (2002), containing word pairs with human-assigned similarity scores. Each word pair is ranked by cosine similarity and the evaluation is the Spearman correlation between those rankings and human ratings. Analogies are tested using Mixed dataset of 19544 questions such as “a is to b as c is to d”, where d is hidden and must be guessed from the entire vocabulary. We filter questions with out of vocabulary words, as standard. Accuracy is computed by comparing \(\mathop {\mathrm {arg\,min}}\limits _{d}\Vert \textbf{b} - \textbf{a} + \textbf{c} - \textbf{d}\Vert \) to the labelled answer.

4 Results

The results of evaluation are provided in Table 1, which we analyze using the two-factor ANOVA with factors being (1) low-rank approximation method and (2) dimensinality of word vectors, and response being the performance in similarity or analogy task. We analyze the tasks separately (Fig. 1).

Table 1. Results
Fig. 1.
figure 1

Test scores for different factorization methods on Similarity and Analogy tasks.

Fig. 2.
figure 2

ANOVA residuals for the results on Similarity task.

4.1 Similarity Task

The standard residual analysis is used to check whether the ANOVA assumptions are satisfied. From Fig. 2 we see that the residuals have constant variability around zero, are independent and normally distributed. The normality is confirmed using Shapiro-Wilk test, \(p\text {-value} = 0.7923\).

Table 2. ANOVA table for the similarity task results
Table 3. Main and Interaction Effects in the Similarity task
Table 4. ANOVA Table for the Analogy task results
Fig. 3.
figure 3

ANOVA Residuals for the Analogy task results

Table 5. Main and Interaction Effects in the Analogy task

The SAS package was used to obtain ANOVA table (Table 2), which shows the effects of the factors on the similarity score. F-test for equality of the factor level means was conducted, \(F = 29.68\) and \(\text {p-value}<0.0001\). Hence, it can be concluded that at least one factor level mean is different from the others. \({R}^2 = 0.962925\) shows that more than 96% of variation in the similarity score is explained by the factors considered.

Proceeding with analysis of main and interaction effects, one can conduct F-test for each of the factors and the interaction between them. From Table 3, we see that the method of low-rank approximation affects the performance of words vectors in the similarity task, \(F = 66.23\), p-value \(< 0.0001\). The dimensionality of word vectors has no effect on the performance in the similarity task, \(F = 0.06\) with \(\text {p-value} > 0.8\). Also, there is no interaction between the method of factorization and the dimensionality of word vectors, \(F = 3.01\) with p-value 0.0945. Thus, SVD significantly outperforms the other factorization methods.

4.2 Analogy Task

Again, we first need to check whether the ANOVA assumptions are satisfied. From Fig. 3 we see that the residuals have constant variability around zero, are independent and normally distributed. The normality is confirmed using Shapiro-Wilk test, \(\text {p-value} = 0.112\). The ANOVA Table (Table 4) shows that at least one level mean is different from the others. \({R}^2\) is 0.997316, thus, 99% of variation in the analogy score is explained by the considered factors (Table 5).

We proceed to the analysis of main and interaction effects. The method of low-rank approximation affects the performance of word vectors in the analogy task, \(F = 978.52\) with \(\text {p-value} < 0.0001\). The dimensionality of word vectors has no effect on the performance in the analogy task, \(F = 1.34\) with \(\text {p-value} > 0.2\). Unlike the similarity task, there is an interaction effect between the two factors, \(F = 11.78\) with \(\text {p-value} = 0.0026\).

5 Discussion

Why Dimensionality is Critical in Similarity Task for NMF? We obtained the highest results in the similarity task using the SVD-based low-rank approximation, for which the dimensionality of word vectors did not influence the performance much. On the contrary, the performance in similarity task using the NMF method of factorization is significantly affected by the dimension of the word vector: 250-dimensional word vectors give significantly better results than 500-dimensional ones. This can be explained by the specific characteristics of the NMF method of factorization. When we look at the word vectors produced by NMF, we can see that they contain many zeros. Hence, an increase in the dimensionality makes them even sparser. Similarity task is based on finding the cosine of the angle between two word vectors. Therefore, when the vectors become sparser, the result of element-wise multiplication, which is necessary for obtaining cosine, becomes smaller. Thus, there is a much higher possibility that the cosine similarity score between two vectors, containing many zeros, will give a number closer to zero than to 1. This, as a result, leads to the worse performance in the similarity task. Our suggestion is to decrease the dimensionality of the NMF method to 100. We expect that this may give better results.

Why NMF Performs Poorly in the Analogy Task? We provide a theoretical analysis of the poor performance of the NMF in the analogy task. We model word vectors produced by the NMF as independent and identically distributed random vectors from an isotropic multivariate Gaussian distribution \(\mathcal {N}(\mathbf {4.5},\textbf{I})\)Footnote 4, since for a 500-dimensional \(\textbf{v}\sim \mathcal {N}(\mathbf {4.5},\textbf{I})\) there is a big chance that it is nonnegative:

$$ \Pr (\textbf{v}\in [0,+\infty )^{500})=[\Pr (4.5+Z>0)]^{500}\approx 0.9983, $$

where \(Z\sim \mathcal {N}(0,1)\) is a standard normal random variable. For a triplet of word vectors \(\textbf{a}\), \(\textbf{b}\) and \(\textbf{c}\) we have \(\textbf{b}-\textbf{a}+\textbf{c}\sim \mathcal {N}(\mathbf {4.5},3\textbf{I})\), and therefore

$$\begin{aligned}&\Pr (\textbf{b}-\textbf{a}+\textbf{c}\in [0,+\infty )^d)=[\Pr (3+\sqrt{3}Z\ge 0)]^d\\&=[\Pr (Z\ge -4.5/\sqrt{3})]^d<[0.9953]^d. \end{aligned}$$

When \(d=500\), this probability is \(\approx 0.1\), i.e. there is a small chance that \(\textbf{b}-\textbf{a}+\textbf{c}\) is non negative, and thus we will likely not find a non-negative \(\textbf{d}\) when we minimize \(\Vert \textbf{b}-\textbf{a}+\textbf{c}-\textbf{d}\Vert \). This is confirmed empirically: for all word triplets (abc) from the analogy task, the vector \(\textbf{b}-\textbf{a}+\textbf{c}\) has at least one negative component.

Why Using \(\textbf{R}\) is Better than Using \(\textbf{Q}\) in the QR Decomposition? The \(\textbf{Q}\) matrix from QR factorization gives the worst results in the similarity task, and it does not depend on the dimensionality of the vector. The reason is that the necessary information is left in the \(\textbf{R}\) matrix. Truncation of \(\textbf{RP}^\top \) gives better approximation to the original matrix than the truncated \(\textbf{Q}\), because the most significant entries of \(\textbf{RP}^\top \) are in the top left quarter and remain after truncation.

6 Conclusion

We analyzed the performance of the word vectors obtained from a word-word PMI matrix by different low-rank approximation methods. As it was expected, the truncated SVD provides a far better solution than the NMF and the truncated QR in both similarity and analogy tasks. While the performance of the NMF is relatively good in the similarity task, it is significantly worse in the analogy task. NMF produces only non-negative sparse vectors and we showed how this deteriorates the performance in both tasks. \(\textbf{RP}^\top \) matrix from QR factorization with column pivoting gives better word embedding than \(\textbf{Q}\) matrix in both tasks.