1 Introduction

1.1 Context and Motivations

The motivation to build word representations as vectors in a Euclidean space is twofold. First, geometrical representations can possibly enhance our understanding of a language. Second, these representations can be useful for information retrieval on large datasets, for which semantic operations become algebraic operations. First attempts to model natural language using simple vector space models go back to the 1970s, namely Index terms [22], term frequency inverse document frequency (TF-IDF) [20], and corresponding software solutions SMART [21], Lucene [10]. In recent work about word representations, it has been emphasized that many analogies such as king is to man what queen is to woman, yielded almost parallel difference vectors in the space of the two most significant coordinates [15, 18], that is to say (if \(d=2\)):

$$\begin{aligned} \begin{aligned} (u_i \;|\;1\le i \le n) \in \mathbb {R}^{d} \,\, \text { being the word representations} \\ \text {(3,4) is an analogy of (1,2)} \Leftrightarrow \exists \epsilon \in \mathbb {R}^{d} \text { s.t } u_2 - u_1 = u_4 - u_3 + \epsilon \\ \text {where }|| \epsilon || \ll \min (||u_2-u_1||, || u_4 - u_3||) \end{aligned} \end{aligned}$$
(1)

In Eq. (1) \(|| x|| \ll ||y||\) means in practice that ||x|| is much smaller than ||y||. Equation (1) is stricter than just parallelism, but we adopt this version because it corresponds to the version the scientific press has amplified in such a way that now it appears to be part of layman knowledge about word representations [5, 14, 23]. We hope that our paper will help clear a misinterpretation.

Recent work leads us to cast word representations into two families: static representations, where each word of the language is associated to a unique element (scope of this paper), and dynamic representations, where the entity representating each word may change on the context (we do not consider this case in this paper).

1.2 Contributions

The attention devoted in the literature and the press to Eq. (1) might have been excessive, based on the following criteria:

\(\circ \) :

The proportion of analogies leading to the geometric Eq. (1) is small.

\(\circ \) :

The classification of analogies based on Eq. (1) or parallelism does not appear as an easy task.

Second, we present a very simple propagation method in the graph of analogies, enabling our notion of parallelism in Eq. (1). Our code is available online.Footnote 1

2 Related Work

2.1 Word Embeddings

In the static representations family, after the first vector space models (Index terms, TF-IDF, see SMART [21], Lucene [10]), Skip-gram and statistical log-bilinear regression models became very popular. The most famous are Glove [18], Word2vec [15], and fastText [4]. Since word embeddings are computed once and for all for a given string, this causes polysemy for fixed embeddings. To overcome this issue, the family of dynamic representations have gained in attention very recently due to the increase of deep learning methods. ELmo [19], and Bert [9] representations take in account context, letters, and n-grams of each word. We do not address comparison with these methods in this paper because of the lack of analysis of their geometric properties.

There have been attempts to evaluate the semantic quality of word embeddings [11], namely:

\(\circ \) :

Semantic similarity (Calculate Spearman correlation between cosine similarity of the model and human rated similarity of word pairs)

\(\circ \) :

Semantic analogy (Analogy prediction accuracy)

\(\circ \) :

Text categorisation (Purity measure).

However, in practice, these semantic quality measures are not preferred for applications: the quality of word embeddings is evaluated on very specific tasks, such as text classification or named entity recognition. In addition, recent work [17] has shown that the use of analogies to uncover human biases should be carried out very carefully, in a fair and transparent way. For example [7] analyzed gender bias from language corpora, but balanced their results by checking against the actual distribution of jobs between genders.

2.2 Relation Embeddings for Named Entities

An entity is a real-world object and denoted with a proper name. In the expression “Named Entity”, the word “Named” aims at restricting the possible set of entities to only those for which one or many rigid designators stands for the referent. Named entities have an important role in text information retrieval [16].

For the sake of completeness, we report work on the representation of relations between entities. Indeed, an entity relation can be seen as an example of relation we consider for analogies (example: Paris is the capital of France, such as Madrid to Spain). There exist several attempts to model these relations, for example as translations [6, 24], or as hyperplanes [12].

2.3 Word Embeddings, Linear Structures and Pointwise Mutual Information

In this subsection, we will focus on a recent analysis of pointwise mutual information, which aims at providing a piece of explanation of the linear structure for analogies [1, 2]. This work provides a generative model with priors to compute closed form expressions for word statistics. In the following, \(f = O(g)\) (resp. ) means that f is bounded by g (resp. bounded ignoring logarithmic factors) in the neighborhood considered. The generation of sentences in a given text corpus is made under the following generative assumptions:

\(\circ \) :

Assumption 1: The ensemble of word vectors consists of i.i.d samples generated by \(v = s \, \hat{v}\), where \(\hat{v}\) is drawn from the spherical Gaussian distribution in \(\mathbb {R}^{d}\) and s is an integrable random scalar, always upper bounded by a constant \(\kappa \in \mathbb {R^{+}}\).

\(\circ \) :

Assumption 2: The text generation process is driven by a random walk of a vector, i.e if \(w_t\) is the word at step t, there exists a discourse vector \(c_{t}\) such that \(\mathsf {P}(w_t = w| c_t) \propto \exp (\langle c_t, v_w \rangle )\). Moreover, \(\exists \kappa \ge 0 \) and \(\epsilon _1 \ge 0\) such that \(\forall t\ge 0\):

$$\begin{aligned} \begin{aligned} |s|&\le \kappa \\ \mathbb {E}_{c_{t+1}} ( e^{\kappa \sqrt{d}|| c_{t+1} - c_t||_{2} } )&\le 1 + \epsilon _1 \end{aligned} \end{aligned}$$
(2)

In the following, \(\mathsf {P}(w,w')\) is the probability that two words w and \(w'\) occur in a window of size 2 (the result can be generalized to any window size), \(\mathsf {P}(w)\) is the marginal probability of w. \(\mathsf {PMI}(w,w^{'})\) is the pointwise mutual information between two words w and \(w^{'}\) [8]. Under these conditions, we have the following result [1]:

Theorem 1

Let n denote the number of words and d denote the dimension of the representations. If Assumptions 1 and  2 are verified, then using the same notations, the following holds for any words w and \(w^{'}\):

(3)

Equation (3) shows that we could expect high cosine similarity for pointwise close terms (if \(\epsilon \) is negligible).

The main aspect we are interested in is the relationship between linear structures and analogies. In [1], the subject is treated with an assumption following [18], stated in Eq. (4). Let \(\chi \) be any set of words, and a and b words are involved in a semantic relation \(\mathcal {R}\). Then there exist two scalars \(v_\mathcal {R}(\chi )\) and \(\xi _{ab\mathcal {R}}(\chi )\) such that:

$$\begin{aligned} \frac{\mathsf {P}(\chi | a)}{\mathsf {P}(\chi |b)} = v_{\mathcal {R}}(\chi ) \, \xi _{ab\mathcal {R}} (\chi ) \end{aligned}$$
(4)

We failed to fully understand the argument made in [1, 18] linking word vectors to differences thereof. However, if we assume Eq. (4), by Eq. (3) we obtain the following.

Corollary 2

Let V be the \(n \times d\) matrix whose rows are the vectors of words in dimension d. Let \(v_a\) and \(v_b\) be vectors corresponding respectively to words a and b. Assume a and b are involved in a relation \(\mathcal {R}\). Let \(\log (v_\mathcal {R})\) the element-wise log of vector \(v_{\mathcal {R}}\). Then there exists a vector \(\xi '_{ab\mathcal {R}} \in \mathbb {R}^{n}\) such that:

$$\begin{aligned} V(v_a - v_b)&= d \log (v_{\mathcal {R}}) + \xi ^{'}_{ab\mathcal {R}} \end{aligned}$$
(5)

Proof

Let x a word, and ab two words sharing a relation \(\mathcal {R}\). On the one hand, taking the \(\log \) of Eq. (4):

$$\begin{aligned} \begin{aligned} \log (\frac{\mathsf {P}(x | a)}{\mathsf {P}(x|b)})&= \log (v_{\mathcal {R}}(x)) + \log (\xi _{ab\mathcal {R}}(x)) \end{aligned} \end{aligned}$$
(6)

On the other hand, using Eq. (3), \(\exists \, \epsilon _{abx} \in \mathbb {R}\) such that:

$$\begin{aligned} \log (\frac{\mathsf {P}(x|a)}{\mathsf {P}(x|b)})&= \log (\frac{\mathsf {P}(x,a)\mathsf {P}(b)}{\mathsf {P}(x,b)\mathsf {P}(a)}) \nonumber \\&= \log (\frac{\mathsf {P}(x,a)\mathsf {P}(b)\mathsf {P}(x)}{\mathsf {P}(x,b)\mathsf {P}(a)\mathsf {P}(x)}) \nonumber \\&= \mathsf {PMI}(x,a) - \mathsf {PMI}(x,b) \nonumber \\ \log (\frac{\mathsf {P}(x|a)}{\mathsf {P}(x|b)})&= \frac{\langle v_x, v_{a} - v_{b} \rangle }{d} + \epsilon _{abx} \end{aligned}$$
(7)

Combining Eqs. (6) and (7), for any x:

$$\begin{aligned} \begin{aligned} \langle v_x, v_{a} - v_{b} \rangle&= d\log (v_{\mathcal {R}}(x)) + d(\log (\xi _{ab\mathcal {R}}(x)) - \epsilon _{abx}) \end{aligned} \end{aligned}$$
(8)

Let V the matrix whose rows are the word vectors. \(V(v_a - v_b)\) is a vector of \(\mathbb {R}^{n}\) whose component associated with word x is exactly \(\langle v_x, v_{a} - v_{b} \rangle \). Then, let \(v_{\mathcal {R}}\) be the element-wise \(\log \) of vector \(v_\mathcal {R}\), and \(\xi ^{'}_{ab\mathcal {R}}\) the vectors of components \( d(\log \xi _{ab\mathcal {R}}(x) - \epsilon _{abx})\). Then, Eq. (8) is exactly Eq. (5).    \(\square \)

It is shown in [1] that \(||V^+\xi '_{ab\mathcal {R}}||\le ||\xi ^{'}_{ab\mathcal {R}}||\), where \(V^+\) is the pseudo-inverse of V. In other words, the “noise” factor \(\xi '\) can be reduced. This reduction may not be sufficient if \(\xi _{ab\mathcal {R}}\) is too large to start with. In the next section we shall propose an empirical analysis of existing embeddings with regard to analogies and parallelism of vector differences.

3 Experiments with Existing Representations

In this section, we present a list of experiments we ran on the most famous word representations.

3.1 Sanity Check

The exact meaning of the statement that analogies are geometrically characterized in word vectors is as follows [14, 18]. For each quadruplet of words involved in an analogy (abcd), consider the word vector triplet \((v_a,v_b,v_c)\), and the difference vector \(x_{ab}=v_b-v_a\). Then we run PCA on the set of word vectors to get representations in \(\mathbb {R}^2\). Find the k nearest neighbours of \(v_c+x_{ab}\) in the word embedding set (with k small). Finally, examine the k words and choose the most appropriate word d for the analogy \(a:b=c:d\). We ran this protocol in many dimension with a corpus of analogies obtained from [13]. We display the results obtained in Fig. 1.

Fig. 1.
figure 1

Sanity check

3.2 Analogies Protocol

In this subsection we show that the protocol we described in Sect. 3.1 for finding analogies does not really work in general. We ran it on 50 word triplets (abc) as input, with \(k=10\) in the k-NN stage, but only obtained 35 correct valid analogies, namely those in Fig. 2.

Fig. 2.
figure 2

Some valid analogies following Protocol 3.2

3.3 Turning the Protocol into an Algorithm

The protocol described in Sect. 3.2 is termed “protocol” rather than “algorithm” because it involves a human interaction when choosing the appropriate word out of the set of \(k=5\) nearest neighbours to \(v_c + (v_b-v_a)\). Since natural language processing tasks usually concern sets of words of higher cardinalities than humans can handle, we are interested in an algorithm for finding analogies rather than a protocol. In this section we present an algorithm which takes the human decision out of the protocol sketched above. Then we show that this algorithm has the same shortcomings as the protocol, as shown in Sect. 3.2.

We first remark that the obvious way to turn the protocol of Sect. 3.2 into an algorithm is to set \(k=1\) in the k-NN stage, which obviously removes the need for a human choice. If we do this, however, we cannot even complete the famous “king:man = queen:woman” analogy: instead of “woman”, we actually get “king” using glove embeddings.

Following our first definition in Eq. (1), we instead propose the notion of strong parallelism in Eq. (9):

$$\begin{aligned} || v_d - v_c - (v_b - v_a) || \le \tau \min (||v_b - v_a||, ||v_d - v_c||) \end{aligned}$$
(9)

where \(\tau \) is a small scalar. Equation (9) is a sufficient condition for quasi-parallelism between \(v_d - v_c\) and \(v_b - v_a\). The algorithm is very simple: given quadruplets (abcd) of words, and tag the quadruplet as a valid analogy if Eq. (9) is satisfied. We also generalize the PCA dimensional reduction from 2D to more dimensionalities. We ran this algorithm on a database of quadruplets corresponding to valid analogies, and obtained the results in Table 1. The fact that the results are surprisingly low was one of our initial motivations for this work. The failure of this algorithm indicates that the geometric relation Eq. (1) for analogies may be more incidental than systematic.

Table 1. Analogies from Eq. (9), F1-score

3.4 Supervised Classification

The failure of an algorithm for correctly labelling analogies based on Eq. (9) (see Sect. 3.3) does not necessarily imply that analogies are not correctly labeled (at least approximately) using other means. In this section we propose a very common supervised learning approach (a simple \(k-\)NN).

More precisely, we trained a \(5-\)NN to predict analogies using vector differences, following Eq. (1). If (abcd) is an analogy quadruplet, we use the representation:

$$\begin{aligned} x_{abcd} = (v_b - v_a, v_d - v_c) \end{aligned}$$
(10)

to predict the class of the quadruplet (abcd) (either no relation or being the capital of, plural, etc). If the angles between the vectors \(v_b-v_a\) and \(v_d-v_c\) (hint of parallelism) contain important information with respect to analogies, this representation should yield a good classification score. The dataset used is composed of 13 types of analogies, with thousand of examples in total (see Footnote 1). We considered 1000 pairs of words sharing a relation, with 13 labels (1 to 13, respectively: capital-common-countries and capital-world (merged), currency, city-in-state, family, adjective-to-adverb, opposite, comparative, superlative, present-participle, nationality-adjective, past-tense, plural, plural-verbs), and 1000 pairs of words sharing no relation (label 0). In order to generate different random quadruplets, we ran 500 simulations. Average results are in Table 2.

The results in Table 2 suggest that the representations obtained from Eq. (10) allow a good classification of analogies in dimension 10 when Euclidean geometry is used with a \(5-\)NN. However, in the remaining dimensions, vector differences do not encode enough information with regards to analogies.

Table 2. Multi-class F1 score classification of analogies based on representation 10 (5-nearest neighbors)

4 Parallelism for Analogies with Graph Propagation

In this section we present an algorithm which takes an existing word embedding as input, and outputs a modified word embedding for which analogies correspond to a notion of parallelism in vector differences. These new word embeddings will be later used (see Sect. 5) to confirm the hypothesis that analogies corresponding to parallel vector differences does not make the word embedding better for common classification tasks.

Let us consider a family of semantic relations \((\mathcal {R}_k | 1 \le k \le r)\). For instance, this family can contain the plural or superlative relation. One of the relations \(\mathcal {R}_k\) creates the analogy \(a:b=c:d\), if and only if: \(a \mathcal {R}_k b\) and \(c \mathcal {R}_k d\), i.e semantic relations create quadruplets of analogies in the following sense:

$$\begin{aligned} (a,b,c,d) \text { is an analogy quadruplet } \iff \exists k, \,\, a\mathcal {R}_kb \,\, \text { and } \,\, d\mathcal {R}_kc \end{aligned}$$
(11)

A sufficient condition for relation (1) to hold for a quadruplet is for each pair ab in the relation \(\mathcal {R}_k\):

$$\begin{aligned} \exists \mu _k \in \mathbb {R}^d{, ~ ~} a \mathcal {R}_{k} b \iff v_b = v_a + \mu _k \end{aligned}$$
(12)

Equation (12) can be generalized to other functions than summing a constant vector, namely it suffices that

$$\begin{aligned} \exists f_k: \mathbb {R}^{d}\longrightarrow \mathbb {R}^{d}\text {, } \quad v_a \mathcal {R}_{k} v_b \iff v_b = f_k(v_a) \end{aligned}$$
(13)

Other choices of \(f_k\) might be interesting, but are not considered in this work.

In order to generate word vectors satisfying Eq. (12), we propose a routine using propagation on graphs. The first step consists in building a directed graph of words (VE) encoding analogies:

$$\begin{aligned} (i,j) \in E \Leftrightarrow \exists k ~(i\mathcal {R}_{k}j) \end{aligned}$$
(14)

Therefore, we can label each edge with the type k of analogy involved (namely being the capital of, plural, etc, ...). Then, we use a graph propagation algorithm (Algorithm 1) involving Eq. (12) relation. We remark that propagation requires initial node representations.

figure a

Proposition 1

Let G the graph of analogies. If G is a forest, then the representations obtained with Algorithm 1 verify Eq. (12).

Proof

A forest structure implies the existence of a source node s for each component in G. For each component, every visited node with breadth-first search starting from s has only one parent, so the update defined Line 9 in Algorithm 1 defines a representation that verify Eq. (12) for the current node and its parent.    \(\square \)

However, if G is not a forest, words can have several parents. In this case, if \((parent_1, child)\) is visited before \((parent_2, child)\), our graph propagation method will not respect Eq. (12) for \((parent_1, child)\). This is the case with homonyms. For example, Peso is the currency for Argentina, but the currency for Mexico too. In practice, we selected \(\mu _1, \ldots , \mu _K\) as a family of independent vectors in \(\mathbb {R}^{d}\). We found better results in our experiments with \(\forall i,~ ||\mu _i|| \ge d\). This can be explained by the fact that the vectors of relations needs to be non negligible when compared to difference of the words vectors.

5 Experiments with New Embeddings

In this section we present results of the experiments described in Sec. 3 with the updated embeddings obtained with the propagation Algorithm 1. We call X++ the new word embeddings obtained with the propagation algorithm from the word embeddings X.

5.1 Classification of Analogies

Analogies from “Parallelism”: As in Sect. 3.3 using Eq. (9). Results are in Table 3. F1-scores are almost perfect (by design) in all dimensions.

Table 3. Analogies from Eq. (9) with updated embeddings, F1-score

With Supervised Learning: Same experiments as in Sec. 3.4: 1000 pairs of words sharing a relation with 13 labels (1 to 13), and 1000 pairs of words sharing no relation (label 0). Results are in Table 4.

Table 4. Multi-class F1 score on classification of analogies based on relation 10 with updated embeddings (5-nearest neighbors)

5.2 Text Classification: Comparison Using KNN

We used three datasets: one for binary classification (Subjectivity) and two for multi-class classification (WebKB and Amazon). For reasons of time computation we used a subset of WebKB and Amazon datasets (500 samples). The implementation and datasets are available online (see Footnote 1). Results are in Table 5.

Table 5. Text classification (\(d=20\)), F1-score

6 Conclusion

In this paper we discussed the well-advertised “geometrical property” of word embeddings w.r.t. analogies. By using a corpus of analogies, we showed that this property does not hold in general, in two or more dimensions. We conclude that the appearance of this geometrical property might be incidental rather than systematic or even likely.

This is somewhat in contrast to the theoretical findings of [1]. One possible way to reconcile these two views is that the concentration of measure argument in [1, Lemma 2.1] might yield high errors in vectors spaces having dimension as low as \(\mathbb {R}^{300}\). Using very high-dimensional vector spaces might conceivably increase the occurrence of almost parallel differences for analogies. By the phenomenon of distance instability [3], however, algorithms based on finding closest vectors in high dimensions require computations with ever higher precision when the vectors are generated randomly. Moreover, the model of [1] only warrants approximate parallelism. So, even if high dimensional word vectors pairs were almost parallel with high probability, verifying this property might require considerable computational work related to floating point precision.

By creating word embeddings on which the geometrical property is enforced by design, we also showed empirically that the property appears to be irrelevant w.r.t. the performance of a common information retrieval algorithm (k-NN). So, whether it holds or not, unless one is trying to find analogies by using the property, is probably a moot point. We are obviously grateful to this property for the (considerable, but unscientific) benefit of having attracted some attention of the general public to an important aspect of computational linguistics.