Abstract
Recently, the NLP community has focused on finding methods for learning good vectorial word representations. These vectorial representations must be good enough to capture semantic relationships between words using simple vector arithmetic operations. Currently, two methods stand out: GloVe and word2vec. We argue that the proper usage of knowledge bases such as WordNet, Freebase and Paraphrase can improve even further the results of such methods. Although the attempt to incorporate information from knowledge bases in vectorial word representations is not new, results are not compared to that of GloVe nor word2vec. In this paper, we propose a method to incorporate the knowledge of Paraphrase knowledge base into GloVe. Results show that such incorporation improves GloVe’s original results for at least three different benchmarks.
Access provided by CONRICYT-eBooks. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Deep architectures of Multilayer Perceptron (MLP), Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) are the state-of-the-art for many NLP tasks, such as automatic translation [22], question & answering [23], named-entity recognition [15], automatic text summarization [19] and sentiment analysis [16]. Most NLP solutions involving Deep Learning use word embeddings, which are vectorial word representations. Word embeddings are vectors of real numbers that represent a word; in this way, each word has its own word embedding. There are three main techniques for learning word embeddings:
-
1.
Context-window based methods;
-
2.
Semantic Relationship based methods;
-
3.
Graph distance based methods.
All three methods have disadvantages in their development. Some of the methods in (1), such as [9] and [17], use only the local context of each word instead of the global context for training. The methods in (2) and (3) use the WordNet [18] and Freebase [3] knowledge bases to learn the word embeddings. The main disadvantage of the methods in (2) is that they use only a subpart of the aforementioned knowledge bases and do not consider the Paraphrase dataset [12], which contains a set of word pairs that are written differently but share the same meaning. Methods in (3) use the Leacock-Chodorow [6] distance in order to capture the semantic information between two words; not considering other distance measures is a limitation.
In this work, we used the Paraphrase knowledge base to enrich our training base of word embeddings and trained them using GloVe [20], which considers the global context of words. The hypothesis to be tested is that such combination improves vectorial word representations.
In Sect. 53.2 we present some related works. In Sect. 53.3 we describe the method used to improve the training base using the Paraphrase knowledge base. Section 53.4 details the experiments and discusses the results. Finally, we conclude the work in Sect. 53.5.
2 Related Work
2.1 Context-Window Based Methods
Collobert et al. [9] implemented a Neural Language Model (NLM) where each vocabulary word i is related with a vector vi ε Rn of dimension n, the word embedding of i. A sentence s = (s1, s2, s..., sl) of size l is represented for a vector x which is equal to concatenate vector of words embeddings from sentence s, \( x = [v_{s_1} \, ; \, v_{s_2} \, ; \, {\ldots } \, ; \, v_{s_l}], \: x \,\, \varepsilon \,\,R^{ln} \). After achieving x, it is propagated through a two layer neural network to obtain a score assign of how real this sentence is.
A is an weight matrix such that A 𝜖 Rh×ln and b 𝜖 Rh is the bias of first layer. The parameter h indicates how many units are in the layer f. uT 𝜖 R1×h is the weight vector of output layer. The weight matrix and word embeddings of this model are trained using Noise Contrastive Estimation (NCE) [13], where, for each training sequence s, we build a noise sequence sc. To build sc we choose a word from s and replace it for a randomly selected word from vocabulary. Thus, we have a vector x for s and a vector xc for sc. To train a neural network able to achieve a high score on real sequences, we minimized the function cost at Eq. (53.2).
The word embeddings and parameters A, b, and u are trained with backpropagation using Stochastic Gradient Descent (SGD) over a training corpus.
Mikolov et al. [17] presents two architectures to learn word embeddings based on word context window inside an sentence, the Skip-gram and bag-of-words (CBOW). The skip-gram goal is: given a sentence s and a central word c of s, predict the context words of c. The CBOW goal is predict the central word c based on its context. Given an word sequence w1, w2, w3, ⋯ , wT, the Skip-gram goal is maximize the Esk function.
where c is the context window size used. The most simple Skip-gram formula define p(wt+j|wt) as:
2.2 Semantic Relationship Based Methods
There are knowledge bases that present semantic information about words, such as Freebase [3], WordNet [18], Dbpedia [2], NELL [7]. Often, knowledge is represented by t = (wi, r, wj), where r indicate an semantic relationship between words wi e wj. Some models, e.g. TransE [4], Neural Tensor Network [21], try to learn word representations from this semantic information: tuple t as input and the output is a score indicating how real is the relationship r between words wi e wj.
2.3 Graph Distance-Based Methods
Fried and Duh [11] proposes the Graph Distance (GD) model. The goal of GD is to train the words embeddings such that its similarity is equal to LCH distance between the respective words in WordNet database. Its objective function is:
where vi and vj are word embeddings of wi e wj words, respectively. The GD uses the parameters a and b to put the LCH distance in the same scale as cosine similarity between vi e vj
3 Model
Paraphrase is the task of rewrite an sentence p using different words, but keeping the meaning of p. Word level Paraphrase is when we rewrite an word w with different characters but keep the w word meaning. Ganitkevitch et al. [12] presents an database for Paraphrase (PPDB). This database has around 73 million paraphrases at sentence level and 8 million paraphrases at word level. The PPDB is divided into six sizes: S, M, L, XL, XXL, XXXL, in crescent order. The subpart S, minor part, has a higher precision score. In this work, we selected the PPDB subpart S and used its 473 thousand word level paraphrases. We implemented the getparaphrase(word = w) method which uses S; it randomly selects one paraphrase of the word w.
In this work, we used the GloVe [20] model to train the word embeddings. It is a context window based method. It uses the word global context of training corpus. Over its training, GloVe uses an word-to-word co-occurrence matrix X, where Xij indicates how many times the word j is presented in word i context within the training corpus.
After building the matrix X, the GloVe’s goal is to minimize J loss function:
where V is the vocabulary size, wi is the word embedding of central word i and \(\tilde w_{j}\) is the word embedding of the context word j. Thus, we have two word embedding matrices, W and \(\tilde W\). Equation (53.8) defines f(x) function.
At GloVe original work, the authors use α = 3/4 and xmax = 100 to train the word embeddings and perform the experiments.
input : matrix X, int V
for i ← 1to V do
p = getparaphrase(i);
for j ← 1to V do
if Xij == 0 and Xpj! = 0 then
Xij := Xpj
end
end
end Algorithm 1: Algorithm used to enhance the X matrix
We use Algorithm 1 to enhance our X co-occurrence matrix and perform the GloVe’s training. For each vocabulary word v, it randomly selects an paraphrase x and uses its context to fill the empty slots of v. In other words, it appends more information in our X matrix. This idea is valid because the words x and v have the same meaning, so the word v can be placed at the contexts of the word x.
4 Experiments
4.1 Evaluation Methods
We use three different benchmarks to evaluate the word embedding: (1) SimLex999 [14], (2) MEN [5] and (3) WordSimilarity-353 (WS353) [1]. They all measure the semantic similarity between two words. Each dataset has a set of tuples t = (word1, word2), where each tuple has a score indicating how word1 and word2 are semantically related. This score is defined by arithmetic mean of a score set defined by humans.
The great difference between the SimLex999 and the other two, is that it explicitly evaluates the semantic similarity between two words, whereas the MEN and WS353 also consider the relatedness between two words. For instance, the tuple (Freud, Psychology) has a low score in SimLex999 but has a high score in MEN and WS, since the name Freud has a high relation with psychology.
For each dataset, we compute the cosine similarity between the word embeddings of word1 and word2 of its tuple t = (word1, word2), so we have the semantic similarity of its word embeddings. In the end, we calculate the Spearman correlation between word embeddings semantic similarity and humans semantic similarity to obtain how good is our word embeddings on that dataset.
4.2 Corpora and Training Details
To perform the experiment, we use the 1 Billion Word Language Model Benchmark [8] corpus. This corpus has approximately 1 billion tokens. We tokenize and lowercase every word in the corpus using the Stanford tokenizer. We build a vocabulary with the most 100 thousand frequent words and produce the X co-occurrence matrix. To build the X matrix, we use an context window of size 10.
For every experiment, we use a xmax = 100, α = 0.75 and train the GloVe model using Adagrad [10] and stochastically select elements with values different of zero from X. The initial learning rate was 0.05. We execute 100 training iterations of each word embedding for every experiment. In this work, we use W + W~ as our final word embedding matrix. We use the GloVe original implementation to train our word embeddings and keep its default settings.
4.3 Results
In Table 53.1 we present the best achieved accuracy values. It can be observed that using the Paraphrase dataset to improve GloVe’s co-occurrence matrix X presents an improvement in every scenario. Another important aspect to be noted is the word embeddings’ dimension: experiments with bigger word embeddings also present better results.
Figures 53.1, 53.2, and 53.3 present the accuracy evolution for Glove 200 and Glove-P 200 on the following benchmarks: SimLex999, MEN, and WS353, respectively. Aside from the fact that Glove-P 200 presents better results in every epoch and every evaluation, it is clear that during the first epochs, Glove-P shows a better improvement when compared to Glove 200. This is important due to the fact that high computational power is not always available to enable long-run training sessions.
5 Conclusion
Vectorial word representations are important for obtaining good results in NLP tasks using machine learning algorithms. Recently, some works have tried to incorporate information from knowledge bases in order to improve the learning of word embeddings. However, these works have not tried to achieve the state-of-the-art results in their methods. Also, they usually limit their experiments with the use of self-tailored datasets.
In this work, we have proposed a modification on the GloVe method, the state-of-the-art in word representation benchmarks. In particular, we presented a method to incorporate the knowledge of the Paraphrase dataset into GloVe’s co-occurrence matrix. We have used an universal dataset to train the word embeddings and results have shown improved word embeddings if compared to GloVe” original approach.
As future work, we intend to come up with a method to incorporate similar knowledge into other relevant learning methods such as word2vec.
References
E. Agirre, E. Alfonseca, K. Hall, J. Kravalova, M. Paşca, A. Soroa, A study on similarity and relatedness using distributional and wordnet-based approaches, in Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics (2009), pp. 19–27
S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, Z. Ives, DBpedia: a nucleus for a web of open data, in The semantic web (Springer, Berlin, 2007), pp. 722–735
K. Bollacker, C. Evans, P. Paritosh, T. Sturge, J. Taylor, Freebase: a collaboratively created graph database for structuring human knowledge, in Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (ACM, New York, 2008), pp. 1247–1250
A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, O. Yakhnenko, Translating embeddings for modeling multi-relational data, in Advances in Neural Information Processing Systems (2013), pp. 2787–2795
E. Bruni, N.-K. Tran, M. Baroni, Multimodal distributional semantics. J. Artif. Intell. Res. 49(2014), 1–47 (2014)
A. Budanitsky, G. Hirst, Semantic distance in wordnet: an experimental, application-oriented evaluation of five measures, in Workshop on WordNet and Other Lexical Resources, vol. 2 (2001), p. 2
A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E.R. Hruschka Jr., T.M. Mitchell, Toward an architecture for never-ending language learning, in AAAI, vol. 5 (2010), p. 3
C. Chelba, T. Mikolov, M. Schuster, Q. Ge, T. Brants, P. Koehn, T. Robinson, One billion word benchmark for measuring progress in statistical language modeling (2013, preprint). arXiv:1312.3005
R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, P. Kuksa, Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537 (2011)
J. Duchi, E. Hazan, Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)
D. Fried, K. Duh, Incorporating both distributional and relational semantics in word representations (2014, preprint). arXiv:1412.4369
J. Ganitkevitch, B. Van Durme, C. Callison-Burch, PPDB: the paraphrase database, in Proceedings of NAACL-HLT, Atlanta, GA, Association for Computational Linguistics (2013), pp. 758–764
M. Gutmann, A. Hyvärinen, Noise-contrastive estimation: a new estimation principle for unnormalized statistical models, in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (2010), pp. 297–304
F. Hill, R. Reichart, A. Korhonen, Simlex-999: evaluating semantic models with (genuine) similarity estimation. Comput. Linguist. 41, 665–695 (2016)
C. AEM Júnior, L.A. Barbosa, H.T. Macedo, S.E. Súo Cristóvão, Uma arquitetura híbrida lstm-cnn para reconhecimento de entidades nomeadas em textos naturais em língua portuguesa (2016)
H. Lakkaraju, R. Socher, C. Manning, Aspect specific sentiment analysis using hierarchical deep learning, in NIPS Workshop on Deep Learning and Representation Learning (2014)
T. Mikolov, I. Sutskever, K. Chen, G.S. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, in Advances in Neural Information Processing Systems (2013), pp. 3111–3119
G.A. Miller, Wordnet: a lexical database for english. Commun. ACM 38(11): 39–41 (1995)
R. Paulus, C. Xiong, R. Socher, A deep reinforced model for abstractive summarization (2017, preprint). arXiv:1705.04304
J. Pennington, R. Socher, C.D. Manning, Glove: global vectors for word representation, in EMNLP, vol. 14 (2014), pp. 1532–1543
R. Socher, D. Chen, C.D. Manning, A. Ng, Reasoning with neural tensor networks for knowledge base completion, in Advances in Neural Information Processing Systems (2013), pp. 926–934,
Y. Wu, M. Schuster, Z. Chen, Q.V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al., Google’s neural machine translation system: bridging the gap between human and machine translation (2016, preprint). arXiv:1609.08144
C. Xiong, V. Zhong, R. Socher, Dynamic coattention networks for question answering (2016, preprint). arXiv:1611.01604
Acknowledgements
The authors thank CAPES and FAPITEC-SE for the financial support [Edital CAPES/FAPITEC/SE No 11/2016 - PROEF, Processo 88887.160994/2017-00] and LCAD-UFS for providing a cluster for the execution of the experiments. The authors also thank FAPITEC-SE for granting a graduate scholarship to Flávio Santos, CNPq for granting an productivity scholarship to Hendrik Macedo [DT-II, Processo 310446/2014-7].
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Santos, F.A.O., Macedo, H.T. (2018). Improving Word Representations Using Paraphrase Dataset. In: Latifi, S. (eds) Information Technology - New Generations. Advances in Intelligent Systems and Computing, vol 738. Springer, Cham. https://doi.org/10.1007/978-3-319-77028-4_53
Download citation
DOI: https://doi.org/10.1007/978-3-319-77028-4_53
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77027-7
Online ISBN: 978-3-319-77028-4
eBook Packages: EngineeringEngineering (R0)