Low-Rank Approximation of Matrices for PMI-Based Word Embeddings

Sorokina, Alena; Karipbayeva, Aidana; Assylbekov, Zhenisbek

doi:10.1007/978-3-031-24337-0_7

Alena Sorokina⁸,
Aidana Karipbayeva⁸ &
Zhenisbek Assylbekov⁸

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13451))

Included in the following conference series:

International Conference on Computational Linguistics and Intelligent Text Processing

383 Accesses

Abstract

We perform an empirical evaluation of several methods of low-rank approximation in the problem of obtaining PMI-based word embeddings. All word vectors were trained on parts of a large corpus extracted from English Wikipedia (enwik9) which was divided into two equal-sized datasets, from which PMI matrices were obtained. A repeated measures design was used in assigning a method of low-rank approximation (SVD, NMF, QR) and a dimensionality of the vectors (250, 500) to each of the PMI matrix replicates. Our experiments show that word vectors obtained from the truncated SVD achieve the best performance on two downstream tasks, similarity and analogy, compare to the other two low-rank approximation methods.

Access provided by Autonomous University of Puebla. Download conference paper PDF

The impact of preprocessing on word embedding quality: a comparative study

Article 18 October 2022

A Fistful of Vectors: A Tool for Intrinsic Evaluation of Word Embeddings

Article 22 January 2024

Word Embedding Based on Low-Rank Doubly Stochastic Matrix Decomposition

Keywords

1 Introduction

Today word embeddings play an important role in many natural language processing tasks, from predictive language models and machine translation to image annotation and question answering, where they are usually ‘plugged in’ to a larger model. An understanding of their properties is of interest as it may allow the development of better performing embeddings and improved interpretability of models using them. One of the widely-used word embedding models is the Skip-gram with negative sampling (SGNS) of Mikolov et al. (2013). Levy and Goldberg (2014) showed that the SGNS is implicitly factorizing a pointwise mutual information (PMI) matrix shifted by a global constant. They also showed that a low-rank approximation of the PMI matrix by truncated singular-value decomposition (SVD) can produce word vectors that are comparable to those of SGNS. However, truncated SVD is not the only way of finding a low-rank approximation of a matrix. It is optimal in the sense that it minimizes the approximation error in the Frobenius and the 2-norms, but this does not mean that it produces optimal word embeddings, which are usually evaluated in downstream NLP tasks. The question is: Is there any other method of low-rank matrix approximation that produces word embeddings better than the truncated SVD factorization? Our experiments show that the truncated SVD is actually a strong baseline which we failed to beat by another two widely-used low-rank approximation methods.

2 Low-Rank Approximations of the PMI-matrix

The simplest version of a PMI matrix is a symmetric matrix with each row and column indexed by words^{Footnote 1}, and with elements defined as

$$\begin{aligned} \textrm{PMI}(i,j)=\log \frac{p(i,j)}{p(i)p(j)}, \end{aligned}$$

(1)

where p(i, j) is the probability that the words i, j appear within a window of a certain size in a large corpus, and p(i) is the unigram probability for the word i. For computational purposes, Levy and Goldberg (2014) suggest using a positive PMI (PPMI), defined as

$$\begin{aligned} \textrm{PPMI}(i,j)=\max (\textrm{PMI}(i,j),0). \end{aligned}$$

(2)

They also show empirically that the low-rank SVD of the PPMI produces word vectors which are comparable in quality to those of the SGNS.

The low-rank matrix approximation is approximating a matrix by one whose rank is less than that of the original matrix. The goal of this is to obtain a more compact representation of the data with a limited loss of information. In what follows we give a brief overview of the low-rank approximation methods used in our work. Since both PMI (1) and PPMI (2) are square matrices, we will consider approximation of square matrices. For a thorough and up-to-date review of low-rank approximation methods see the paper by Kishore Kumar and Schneider (2017).

Singular Value Decomposition (SVD) factorizes $\textbf{A}\in \mathbb {R}^{n\times n}$, into the matrices $\textbf{U}\in \mathbb {R}^{n\times n}$, $\textbf{S}\in \mathbb {R}^{n\times n}$ and $\textbf{V}^\top \in \mathbb {R}^{n\times n}$:

$$\textbf{A}=\textbf{USV}^\top ,$$

where $\textbf{U}$ and $\textbf{V}$ are orthogonal matrices, and $\textbf{S}$ is a rectangular diagonal matrix whose entries are in descending order, $\sigma _1\ge \sigma _2\ge \cdots \ge \sigma _n\ge 0$, along the main diagonal, and are known as the singular values of $\textbf{A}$. The rank d approximation (also called truncated or partial SVD) of $\textbf{A}$, $\textbf{A}_d$ where $d<{{\,\textrm{rank}\,}}\textbf{A}$, is given by zeroing out the $n-d$ trailing singular values of $\textbf{A}$, that is^{Footnote 2}

$$\textbf{A}_d=\textbf{U}_{1:n,1:d}\textbf{S}_{1:d,1:d}\textbf{V}^\top _{1:d,1:n}.$$

By the Eckart-Young theorem (Eckart and Young 1936), $A_d$ is the closest rank-d matrix to A in Frobenius norm, i.e. $\Vert \textbf{A}-\textbf{A}_d\Vert _F\le \Vert \textbf{A}-\textbf{B}\Vert _F$, $\forall \textbf{B}\in \mathbb {R}^{n\times n}:\,{{\,\textrm{rank}\,}}(\textbf{B})=d$. Levy and Goldberg (2014) suggest factorizing the PPMI matrix with truncated SVD, and then taking the rows of $\textbf{U}_{1:n,1:d}\textbf{S}_{1:d,1:d}^{1/2}$ as word vectors, and we follow their approach.

QR Decomposition with column pivoting of $\textbf{A}\in \mathbb {R}^{n\times n}$ has the form $\textbf{A P}=\textbf{Q R}$, where $\textbf{Q}\in \mathbb {R}^{n\times n}$ is orthogonal, $\textbf{R}\in \mathbb {R}^{n\times n}$ is upper triangular and $\textbf{P}\in \mathbb {R}^{n\times n}$ is a permutation matrix. The rank d approximation to $\textbf{A}$ is then

$$\textbf{A}_d=\textbf{Q}_{1:n,1:d}[\textbf{RP}^\top ]_{1:d,1:n}$$

which is called truncated QR decomposition of $\textbf{A}$. After factorizing the PPMI matrix with this method we suggest taking the rows of $\textbf{Q}_{1:n,1:d}$ as word vectors.

However, we suspect that a valuable information could be left in the $\textbf{R}$ matrix. A promising alternative to SVD is a Rank Reveling QR decomposition (RRQR). Assume the QR factorization of the matrix $\textbf{A}$:

$$ \textbf{AP}= \textbf{Q}\left[ {\begin{array}{cc} \textbf{R}_{11} &{} \textbf{R}_{12} \\ 0 &{} \textbf{R}_{22} \\ \end{array} } \right] $$

where $\textbf{R}_{11}\in \mathbb {R}^{d\times d}$, $\textbf{R}_{12}\in \mathbb {R}^{d\times (n-d)}$, $\textbf{R}_{22}\in \mathbb {R}^{(n-d)\times (n-d)}$. For RRQR factorization, the following condition should be satisfied:

$$\begin{aligned} \sigma _{\min }(\textbf{R}_{11})&= \mathrm {\Theta }(\sigma _{d}(\textbf{A}))\\ \sigma _{\max }(\textbf{R}_{22})&= \mathrm {\Theta }(\sigma _{d+1}(\textbf{A})) \end{aligned}$$

which suggests that the most significant entries are in $\textbf{R}_{11}$, and the least important are in $\textbf{R}_{22}$. Thus, we also suggest taking the columns of $[\textbf{RP}^\top ]_{1:d,1:n}$ as word vectors.

Non Negative Matrix Factorization (NMF). Given a non negative matrix $\textbf{A}\in \mathbb {R}^{n\times n}$ and a positive integer $d<n$, NMF finds non negative matrices $\textbf{W}\in \mathbb {R}^{n\times d}$ and $\textbf{H}\in \mathbb {R}^{d\times n}$ which minimize (locally) the functional $f(\textbf{W},\textbf{H})=\Vert \textbf{A}-\textbf{WH}\Vert ^2_F$. The rank d approximation of $\textbf{A}$ is simply

$$ \textbf{A}_d=\textbf{WH}. $$

When factorizing the PPMI matrix with NMF, we suggest taking the rows of $\textbf{W}$ as word vectors.

3 Experimental Setup

3.1 Corpus

All word vectors were trained on the enwik9 dataset^{Footnote 3} which was divided into two equal-sized splits. The PMI matrices on these splits were obtained using the hypewords tool of Levy et al. (2015). All corpora were pre-processed by removing non-textual elements, sentence splitting, and tokenization. PMI matrices were derived using a window of two tokens to each side of the focus word, ignoring words that appeared less than 300 times in the corpus, resulting in vocabulary sizes of roughly 13000 for both words and contexts. A repeated measures design was used for assigning the method of factorization (SVD, QR, NMF) and dimensionality of the vectors (250, 500) to each PMI matrix replicate. We used two replicates per each level combination.

3.2 Training

Low-rank approximations were performed using the following open-source implementations:

Sparse SVD from SciPy (Jones et al. 2014),
Sparse RRQR from SuiteSparse (Davis and Hu, 2011), and
NMF from scikit-learn (Pedregosa et al. 2011).

For NMF we used the nonnegative double SVD initialization. We trained 250 and 500 dimensional word vectors with each method.

3.3 Evaluation

We evaluate word vectors on two tasks: similarity and analogy. A similarity is tested using the WordSim353 dataset of Finkelstein et al. (2002), containing word pairs with human-assigned similarity scores. Each word pair is ranked by cosine similarity and the evaluation is the Spearman correlation between those rankings and human ratings. Analogies are tested using Mixed dataset of 19544 questions such as “a is to b as c is to d”, where d is hidden and must be guessed from the entire vocabulary. We filter questions with out of vocabulary words, as standard. Accuracy is computed by comparing $\mathop {\mathrm {arg\,min}}\limits _{d}\Vert \textbf{b} - \textbf{a} + \textbf{c} - \textbf{d}\Vert $ to the labelled answer.

4 Results

The results of evaluation are provided in Table 1, which we analyze using the two-factor ANOVA with factors being (1) low-rank approximation method and (2) dimensinality of word vectors, and response being the performance in similarity or analogy task. We analyze the tasks separately (Fig. 1).

Table 1. Results

Full size table

4.1 Similarity Task

The standard residual analysis is used to check whether the ANOVA assumptions are satisfied. From Fig. 2 we see that the residuals have constant variability around zero, are independent and normally distributed. The normality is confirmed using Shapiro-Wilk test, $p\text {-value} = 0.7923$.

Table 2. ANOVA table for the similarity task results

Full size table

Table 3. Main and Interaction Effects in the Similarity task

Full size table

Table 4. ANOVA Table for the Analogy task results

Full size table

Table 5. Main and Interaction Effects in the Analogy task

Full size table

The SAS package was used to obtain ANOVA table (Table 2), which shows the effects of the factors on the similarity score. F-test for equality of the factor level means was conducted, $F = 29.68$ and $\text {p-value}<0.0001$. Hence, it can be concluded that at least one factor level mean is different from the others. ${R}^2 = 0.962925$ shows that more than 96% of variation in the similarity score is explained by the factors considered.

Proceeding with analysis of main and interaction effects, one can conduct F-test for each of the factors and the interaction between them. From Table 3, we see that the method of low-rank approximation affects the performance of words vectors in the similarity task, $F = 66.23$, p-value $< 0.0001$. The dimensionality of word vectors has no effect on the performance in the similarity task, $F = 0.06$ with $\text {p-value} > 0.8$. Also, there is no interaction between the method of factorization and the dimensionality of word vectors, $F = 3.01$ with p-value 0.0945. Thus, SVD significantly outperforms the other factorization methods.

4.2 Analogy Task

Again, we first need to check whether the ANOVA assumptions are satisfied. From Fig. 3 we see that the residuals have constant variability around zero, are independent and normally distributed. The normality is confirmed using Shapiro-Wilk test, $\text {p-value} = 0.112$. The ANOVA Table (Table 4) shows that at least one level mean is different from the others. ${R}^2$ is 0.997316, thus, 99% of variation in the analogy score is explained by the considered factors (Table 5).

We proceed to the analysis of main and interaction effects. The method of low-rank approximation affects the performance of word vectors in the analogy task, $F = 978.52$ with $\text {p-value} < 0.0001$. The dimensionality of word vectors has no effect on the performance in the analogy task, $F = 1.34$ with $\text {p-value} > 0.2$. Unlike the similarity task, there is an interaction effect between the two factors, $F = 11.78$ with $\text {p-value} = 0.0026$.

5 Discussion

Why Dimensionality is Critical in Similarity Task for NMF? We obtained the highest results in the similarity task using the SVD-based low-rank approximation, for which the dimensionality of word vectors did not influence the performance much. On the contrary, the performance in similarity task using the NMF method of factorization is significantly affected by the dimension of the word vector: 250-dimensional word vectors give significantly better results than 500-dimensional ones. This can be explained by the specific characteristics of the NMF method of factorization. When we look at the word vectors produced by NMF, we can see that they contain many zeros. Hence, an increase in the dimensionality makes them even sparser. Similarity task is based on finding the cosine of the angle between two word vectors. Therefore, when the vectors become sparser, the result of element-wise multiplication, which is necessary for obtaining cosine, becomes smaller. Thus, there is a much higher possibility that the cosine similarity score between two vectors, containing many zeros, will give a number closer to zero than to 1. This, as a result, leads to the worse performance in the similarity task. Our suggestion is to decrease the dimensionality of the NMF method to 100. We expect that this may give better results.

Why NMF Performs Poorly in the Analogy Task? We provide a theoretical analysis of the poor performance of the NMF in the analogy task. We model word vectors produced by the NMF as independent and identically distributed random vectors from an isotropic multivariate Gaussian distribution $\mathcal {N}(\mathbf {4.5},\textbf{I})$^{Footnote 4}, since for a 500-dimensional $\textbf{v}\sim \mathcal {N}(\mathbf {4.5},\textbf{I})$ there is a big chance that it is nonnegative:

$$ \Pr (\textbf{v}\in [0,+\infty )^{500})=[\Pr (4.5+Z>0)]^{500}\approx 0.9983, $$

where $Z\sim \mathcal {N}(0,1)$ is a standard normal random variable. For a triplet of word vectors $\textbf{a}$, $\textbf{b}$ and $\textbf{c}$ we have $\textbf{b}-\textbf{a}+\textbf{c}\sim \mathcal {N}(\mathbf {4.5},3\textbf{I})$, and therefore

$$\begin{aligned}&\Pr (\textbf{b}-\textbf{a}+\textbf{c}\in [0,+\infty )^d)=[\Pr (3+\sqrt{3}Z\ge 0)]^d\\&=[\Pr (Z\ge -4.5/\sqrt{3})]^d<[0.9953]^d. \end{aligned}$$

When $d=500$, this probability is $\approx 0.1$, i.e. there is a small chance that $\textbf{b}-\textbf{a}+\textbf{c}$ is non negative, and thus we will likely not find a non-negative $\textbf{d}$ when we minimize $\Vert \textbf{b}-\textbf{a}+\textbf{c}-\textbf{d}\Vert $. This is confirmed empirically: for all word triplets (a, b, c) from the analogy task, the vector $\textbf{b}-\textbf{a}+\textbf{c}$ has at least one negative component.

Why Using $\textbf{R}$ is Better than Using $\textbf{Q}$ in the QR Decomposition? The $\textbf{Q}$ matrix from QR factorization gives the worst results in the similarity task, and it does not depend on the dimensionality of the vector. The reason is that the necessary information is left in the $\textbf{R}$ matrix. Truncation of $\textbf{RP}^\top $ gives better approximation to the original matrix than the truncated $\textbf{Q}$, because the most significant entries of $\textbf{RP}^\top $ are in the top left quarter and remain after truncation.

6 Conclusion

We analyzed the performance of the word vectors obtained from a word-word PMI matrix by different low-rank approximation methods. As it was expected, the truncated SVD provides a far better solution than the NMF and the truncated QR in both similarity and analogy tasks. While the performance of the NMF is relatively good in the similarity task, it is significantly worse in the analogy task. NMF produces only non-negative sparse vectors and we showed how this deteriorates the performance in both tasks. $\textbf{RP}^\top $ matrix from QR factorization with column pivoting gives better word embedding than $\textbf{Q}$ matrix in both tasks.

Notes

1.
Assume that words have already been converted into integer indices.
2.
$\textbf{A}_{a:b,c:d}$ is a submatrix located at the intersection of rows $a, a+1, \ldots , b$ and columns $c, c + 1, \ldots , d$ of a matrix $\textbf{A}$.
3.
http://mattmahoney.net/dc/textdata.html.
4.
The isotropy is motivated by the work of Arora et al. (2016); $\mathbf {4.5}$ is a vector with all elements equal to 4.5.

References

Arora, S., Li, Y., Liang, Y., Ma, T., Risteski, A.: A latent variable model approach to PMI-based word embeddings. Trans. Assoc. Comput. Linguist. 4, 385–399 (2016)
Article Google Scholar
Davis, T.A., Hu, Y.: The university of Florida sparse matrix collection. ACM Trans. Math. Softw. (TOMS) 38(1), 1 (2011)
MathSciNet MATH Google Scholar
Eckart, C., Young, G.: The approximation of one matrix by another of lower rank. Psychometrika 1(3), 211–218 (1936)
Article MATH Google Scholar
Finkelstein, L., et al.: Placing search in context: the concept revisited. ACM Trans. Inf. Syst. 20(1), 116–131 (2002)
Article Google Scholar
Jones, E., Oliphant, T., Peterson, P.: $\{$SciPy$\}$: open source scientific tools for $\{$Python$\}$ (2014)
Google Scholar
Kishore Kumar, N., Schneider, J.: Literature survey on low rank approximation of matrices. Linear Multilinear Algebra 65(11), 2212–2244 (2017)
Article MathSciNet MATH Google Scholar
Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. In: Advances in Neural Information Processing Systems, pp. 2177–2185 (2014)
Google Scholar
Levy, O., Goldberg, Y., Dagan, I.: Improving distributional similarity with lessons learned from word embeddings. TACL 3, 211–225 (2015)
Article Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12(Oct), 2825–2830 (2011)
Google Scholar

Download references

Acknowledgement

The work of Zhenisbek Assylbekov has been funded by the Committee of Science of the Ministry of Education and Science of the Republic of Kazakhstan, contract # 346/018-2018/33-28, IRN AP05133700.

Author information

Authors and Affiliations

Department of Mathematics, Nazarbayev University, Astana, Kazakhstan
Alena Sorokina, Aidana Karipbayeva & Zhenisbek Assylbekov

Authors

Alena Sorokina
View author publications
You can also search for this author in PubMed Google Scholar
Aidana Karipbayeva
View author publications
You can also search for this author in PubMed Google Scholar
Zhenisbek Assylbekov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aidana Karipbayeva .

Editor information

Editors and Affiliations

Instituto Politécnico Nacional, Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sorokina, A., Karipbayeva, A., Assylbekov, Z. (2023). Low-Rank Approximation of Matrices for PMI-Based Word Embeddings. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2019. Lecture Notes in Computer Science, vol 13451. Springer, Cham. https://doi.org/10.1007/978-3-031-24337-0_7

Download citation

DOI: https://doi.org/10.1007/978-3-031-24337-0_7
Published: 26 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-24336-3
Online ISBN: 978-3-031-24337-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Low-Rank Approximation of Matrices for PMI-Based Word Embeddings

Abstract

Similar content being viewed by others

The impact of preprocessing on word embedding quality: a comparative study

A Fistful of Vectors: A Tool for Intrinsic Evaluation of Word Embeddings

Word Embedding Based on Low-Rank Doubly Stochastic Matrix Decomposition

Keywords

1 Introduction

2 Low-Rank Approximations of the PMI-matrix