Keywords

1 Introduction

Deep architectures of Multilayer Perceptron (MLP), Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) are the state-of-the-art for many NLP tasks, such as automatic translation [22], question & answering [23], named-entity recognition [15], automatic text summarization [19] and sentiment analysis [16]. Most NLP solutions involving Deep Learning use word embeddings, which are vectorial word representations. Word embeddings are vectors of real numbers that represent a word; in this way, each word has its own word embedding. There are three main techniques for learning word embeddings:

  1. 1.

    Context-window based methods;

  2. 2.

    Semantic Relationship based methods;

  3. 3.

    Graph distance based methods.

All three methods have disadvantages in their development. Some of the methods in (1), such as [9] and [17], use only the local context of each word instead of the global context for training. The methods in (2) and (3) use the WordNet [18] and Freebase [3] knowledge bases to learn the word embeddings. The main disadvantage of the methods in (2) is that they use only a subpart of the aforementioned knowledge bases and do not consider the Paraphrase dataset [12], which contains a set of word pairs that are written differently but share the same meaning. Methods in (3) use the Leacock-Chodorow [6] distance in order to capture the semantic information between two words; not considering other distance measures is a limitation.

In this work, we used the Paraphrase knowledge base to enrich our training base of word embeddings and trained them using GloVe [20], which considers the global context of words. The hypothesis to be tested is that such combination improves vectorial word representations.

In Sect. 53.2 we present some related works. In Sect. 53.3 we describe the method used to improve the training base using the Paraphrase knowledge base. Section 53.4 details the experiments and discusses the results. Finally, we conclude the work in Sect. 53.5.

2 Related Work

2.1 Context-Window Based Methods

Collobert et al. [9] implemented a Neural Language Model (NLM) where each vocabulary word i is related with a vector vi ε Rn of dimension n, the word embedding of i. A sentence s = (s1, s2, s..., sl) of size l is represented for a vector x which is equal to concatenate vector of words embeddings from sentence s, \( x = [v_{s_1} \, ; \, v_{s_2} \, ; \, {\ldots } \, ; \, v_{s_l}], \: x \,\, \varepsilon \,\,R^{ln} \). After achieving x, it is propagated through a two layer neural network to obtain a score assign of how real this sentence is.

$$\displaystyle \begin{aligned} Score(x) = u^{T}(\sigma(Ax + b)) \end{aligned} $$
(53.1)

A is an weight matrix such that A 𝜖 Rh×ln and b 𝜖 Rh is the bias of first layer. The parameter h indicates how many units are in the layer f. uT 𝜖 Rh is the weight vector of output layer. The weight matrix and word embeddings of this model are trained using Noise Contrastive Estimation (NCE) [13], where, for each training sequence s, we build a noise sequence sc. To build sc we choose a word from s and replace it for a randomly selected word from vocabulary. Thus, we have a vector x for s and a vector xc for sc. To train a neural network able to achieve a high score on real sequences, we minimized the function cost at Eq. (53.2).

$$\displaystyle \begin{aligned} cost = max(0, 1 - Score(x) + Score(x_c)) \end{aligned} $$
(53.2)

The word embeddings and parameters A, b, and u are trained with backpropagation using Stochastic Gradient Descent (SGD) over a training corpus.

Mikolov et al. [17] presents two architectures to learn word embeddings based on word context window inside an sentence, the Skip-gram and bag-of-words (CBOW). The skip-gram goal is: given a sentence s and a central word c of s, predict the context words of c. The CBOW goal is predict the central word c based on its context. Given an word sequence w1, w2, w3, ⋯ , wT, the Skip-gram goal is maximize the Esk function.

$$\displaystyle \begin{aligned} E_{sk} = \frac{1}{T}\sum_{t = 1}^{T}\sum_{-c \leq j \geq c, j \neq 0} \mathbf{log} p (w_{t + j} | w_{t}), \end{aligned} $$
(53.3)

where c is the context window size used. The most simple Skip-gram formula define p(wt+j|wt) as:

$$\displaystyle \begin{aligned} p (w_{t + j} | w_{t}) = \frac{\mathbf{exp} (v_{w_{t + 1}}^{'}\mathbf{}^{T} v_{w_{t}})}{\sum_{n = 1}^{N}\mathbf{exp} (v_{w_{n}}^{'}\mathbf{}^{T}v_{w_{t}})} \end{aligned} $$
(53.4)

2.2 Semantic Relationship Based Methods

There are knowledge bases that present semantic information about words, such as Freebase [3], WordNet [18], Dbpedia [2], NELL [7]. Often, knowledge is represented by t = (wi, r, wj), where r indicate an semantic relationship between words wi e wj. Some models, e.g. TransE [4], Neural Tensor Network [21], try to learn word representations from this semantic information: tuple t as input and the output is a score indicating how real is the relationship r between words wi e wj.

2.3 Graph Distance-Based Methods

Fried and Duh [11] proposes the Graph Distance (GD) model. The goal of GD is to train the words embeddings such that its similarity is equal to LCH distance between the respective words in WordNet database. Its objective function is:

$$\displaystyle \begin{aligned} L_{GD}(v_{i}, v_{j}) = (\frac{v_{i}v_{j}}{||v_{i}||{}^2||v_{j}||{}^2} - [a \times LCH(w_{i}, w_{j}) + b])^2, \end{aligned} $$
(53.5)

where vi and vj are word embeddings of wi e wj words, respectively. The GD uses the parameters a and b to put the LCH distance in the same scale as cosine similarity between vi e vj

3 Model

Paraphrase is the task of rewrite an sentence p using different words, but keeping the meaning of p. Word level Paraphrase is when we rewrite an word w with different characters but keep the w word meaning. Ganitkevitch et al. [12] presents an database for Paraphrase (PPDB). This database has around 73 million paraphrases at sentence level and 8 million paraphrases at word level. The PPDB is divided into six sizes: S, M, L, XL, XXL, XXXL, in crescent order. The subpart S, minor part, has a higher precision score. In this work, we selected the PPDB subpart S and used its 473 thousand word level paraphrases. We implemented the getparaphrase(word = w) method which uses S; it randomly selects one paraphrase of the word w.

In this work, we used the GloVe [20] model to train the word embeddings. It is a context window based method. It uses the word global context of training corpus. Over its training, GloVe uses an word-to-word co-occurrence matrix X, where Xij indicates how many times the word j is presented in word i context within the training corpus.

$$\displaystyle \begin{aligned} X_{i} = \sum_{k}X_{ik} \end{aligned} $$
(53.6)

After building the matrix X, the GloVe’s goal is to minimize J loss function:

$$\displaystyle \begin{aligned} J = \sum_{i, j = 1}^{V}f(X_{ij})(w_{i}^{T}{\tilde w_{j}} + b_{i} + {\tilde b_{j}} - log(X_{ij}) )^{2}, \end{aligned} $$
(53.7)

where V is the vocabulary size, wi is the word embedding of central word i and \(\tilde w_{j}\) is the word embedding of the context word j. Thus, we have two word embedding matrices, W and \(\tilde W\). Equation (53.8) defines f(x) function.

$$\displaystyle \begin{aligned} f(x)= \begin{cases} (x/x_{max})^{\alpha}, & \text{if } x < x_{max}\\ 1, & \text{otherwise} \end{cases} \end{aligned} $$
(53.8)

At GloVe original work, the authors use α = 3/4 and xmax = 100 to train the word embeddings and perform the experiments.

input : matrix X, int V

for i ← 1to V  do

p = getparaphrase(i);

for j ← 1to V  do

if Xij == 0 and Xpj! = 0 then

Xij  :=  Xpj

end

end

end Algorithm 1: Algorithm used to enhance the X matrix

We use Algorithm 1 to enhance our X co-occurrence matrix and perform the GloVe’s training. For each vocabulary word v, it randomly selects an paraphrase x and uses its context to fill the empty slots of v. In other words, it appends more information in our X matrix. This idea is valid because the words x and v have the same meaning, so the word v can be placed at the contexts of the word x.

4 Experiments

4.1 Evaluation Methods

We use three different benchmarks to evaluate the word embedding: (1) SimLex999 [14], (2) MEN [5] and (3) WordSimilarity-353 (WS353) [1]. They all measure the semantic similarity between two words. Each dataset has a set of tuples t = (word1, word2), where each tuple has a score indicating how word1 and word2 are semantically related. This score is defined by arithmetic mean of a score set defined by humans.

The great difference between the SimLex999 and the other two, is that it explicitly evaluates the semantic similarity between two words, whereas the MEN and WS353 also consider the relatedness between two words. For instance, the tuple (Freud, Psychology) has a low score in SimLex999 but has a high score in MEN and WS, since the name Freud has a high relation with psychology.

For each dataset, we compute the cosine similarity between the word embeddings of word1 and word2 of its tuple t = (word1, word2), so we have the semantic similarity of its word embeddings. In the end, we calculate the Spearman correlation between word embeddings semantic similarity and humans semantic similarity to obtain how good is our word embeddings on that dataset.

4.2 Corpora and Training Details

To perform the experiment, we use the 1 Billion Word Language Model Benchmark [8] corpus. This corpus has approximately 1 billion tokens. We tokenize and lowercase every word in the corpus using the Stanford tokenizer. We build a vocabulary with the most 100 thousand frequent words and produce the X co-occurrence matrix. To build the X matrix, we use an context window of size 10.

For every experiment, we use a xmax  =  100, α  =  0.75 and train the GloVe model using Adagrad [10] and stochastically select elements with values different of zero from X. The initial learning rate was 0.05. We execute 100 training iterations of each word embedding for every experiment. In this work, we use W +  W~ as our final word embedding matrix. We use the GloVe original implementation to train our word embeddings and keep its default settings.

4.3 Results

In Table 53.1 we present the best achieved accuracy values. It can be observed that using the Paraphrase dataset to improve GloVe’s co-occurrence matrix X presents an improvement in every scenario. Another important aspect to be noted is the word embeddings’ dimension: experiments with bigger word embeddings also present better results.

Table 53.1 Best results for each model

Figures 53.1, 53.2, and 53.3 present the accuracy evolution for Glove 200 and Glove-P 200 on the following benchmarks: SimLex999, MEN, and WS353, respectively. Aside from the fact that Glove-P 200 presents better results in every epoch and every evaluation, it is clear that during the first epochs, Glove-P shows a better improvement when compared to Glove 200. This is important due to the fact that high computational power is not always available to enable long-run training sessions.

Fig. 53.1
figure 1

SimLex999 results

Fig. 53.2
figure 2

MEN results

Fig. 53.3
figure 3

WS353 results

5 Conclusion

Vectorial word representations are important for obtaining good results in NLP tasks using machine learning algorithms. Recently, some works have tried to incorporate information from knowledge bases in order to improve the learning of word embeddings. However, these works have not tried to achieve the state-of-the-art results in their methods. Also, they usually limit their experiments with the use of self-tailored datasets.

In this work, we have proposed a modification on the GloVe method, the state-of-the-art in word representation benchmarks. In particular, we presented a method to incorporate the knowledge of the Paraphrase dataset into GloVe’s co-occurrence matrix. We have used an universal dataset to train the word embeddings and results have shown improved word embeddings if compared to GloVe” original approach.

As future work, we intend to come up with a method to incorporate similar knowledge into other relevant learning methods such as word2vec.