Waste Not: Meta-Embedding of Word and Context Vectors

Değirmenci, Selin; Gerek, Aydın; Ganiz, Murat Can

doi:10.1007/978-3-030-23281-8_35

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11608))

Included in the following conference series:

International Conference on Applications of Natural Language to Information Systems

1606 Accesses

Abstract

The word2vec and fastText models train two vectors per word: a word and a context vector. Typically the context vectors are discarded after training, even though they may contain useful information for different NLP tasks. Therefore we combine word and context vectors in the framework of meta-embeddings. Our experiments show performance increases at several NLP tasks such as text classification, semantic similarity, and analogy. In conclusion, this approach can be used to increase performance at downstream tasks while requiring minimal additional computational resources.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Sidecar: Augmenting Word Embedding Models with Expert Knowledge

Challenges and Solutions with Alignment and Enrichment of Word Embedding Models

Complement Lexical Retrieval Model with Semantic Residual Embeddings

Keywords

1 Introduction and Motivation

The choice of word embedding model is an important hyperparameter for many NLP tasks, since it has been observed that different embedding models tend to provide stronger representations for different types of downstream tasks [4]. It is also known that ensembles of machine learning models tend to perform better than their individual constituents. It makes sense, then, to combine different embedding models in order to improve the performance of downstream NLP tasks.

While using ensembles of downstream models seeded with different types of word embeddings had been tried before [1], the idea of combining word embeddings directly to form meta-embeddings starts with the work of [21]. In that work the authors form meta-embeddings by concatenation, by factorization of the concatenated vectors (SVD), and a method called 1toN that learns a meta-embedding from which (also learned) projections exist to the source embeddings, with said projections minimizing the mean square error between the projected meta-embedding and source embedding of the same word for all words. A simpler but overlooked idea of averaging source embeddings is explored in [5]. In [3] autoencoders are employed to dimension reduce the concatenated (CAEME) and averaged (AAEME) meta-embeddings as well as dimension reducing source embeddings and concatenating them (DAEME).

One of the best known word embedding models is word2vec [13, 14], which during its learning procedure not only learns a word vector for each word in the training corpus, but also a context vector for it. However context vectors are typically discarded after training. In [16] it is briefly mentioned that adding word and context vectors may result in a small performance boost. However it is not thoroughly investigated. In this study we investigate it in detail by forming meta-embeddings of word and context vectors in several different ways and conducting detailed experiments. We observe that combining the word and context embeddings to form a meta-embedding in several different settings yields a higher performance at the text classification, semantic similarity and analogy tasks.

In Sect. 2, we describe our novel approach and meta-embedding types. In Sect. 3, we describe our experimental setup, our implementation and NLP tasks that we perform. In Sect. 4, we present results of our meta-embedding methods on text classification, semantic similarity and word analogy tasks. In Sect. 5, we draw our conclusions based on our results and discuss possible extensions as future work.

2 Approach

Our novel approach focuses on exploiting otherwise ignored information encoded in context vectors. We formulate and experiment with seven different types of meta-embeddings. A total of nine results are given in our tables for comparison where the first two are traditional word and context embeddings which constitute the baselines.

In order to see if including context vectors help improving performance in several NLP tasks, first, we create a meta-embedding by concatenating word and context embeddings which is simply denoted by concat. This will result in doubling the dimensionality. Our second approach is to average word and context embeddings which is denoted by average. Third approach is to apply a max pooling filter to word and context embeddings to create a meta-embedding. This is donated by maxpool. Fourth one is a more complicated meta-embedding which is obtained by concatenation, averaging and maxpooling of word and context embeddings. This is indicated as CAM in our result tables. Following this, we have three additional auto-encoder based meta-embeddings [3] of word and context embeddings, namely Averaged Autoencoded Meta-Embedding (AAEME), Concatenated Autoencoded Meta-Embedding (CAEME), and Decoupled Autoencoded Meta-Embedding (DAEME). Please note that different meta-embedding approaches we have taken result in different dimensional vectors. This can be seen in Table 1.

In order to obtain word and context embeddings we use two of the most popular word embedding models; the word2vec and fastText [2, 8]. For both models we use the skip-gram negative sampling architecture as it is more popular. In the case of fastText models while character n-grams are trained alongside word and context embeddings, we’ve choose not to include those in our meta-embeddings for comparability reasons.

Table 1. Embedding and dimension

Full size table

3 Experiments

3.1 Datasets

We trained our embeddings on Text8 [11], which is a corpus based on the first \(10^9\) bytes of the Wikipedia dump of March 3, 2006. For comparison we also trained our embeddings on a large Wikipedia dump. This corpus contains 19,251,790 articles and occupies approximately 16 GB of disk space.

For the text classification task we use the following datasets: AG’s News Corpus [22] consisting of 120,000 documents in 4 classes, WEBKB which is a highly imbalanced dataset of 8,282 documents in 7 classes [12, 17], Yelp Reviews Polarity [20, 22] consisting of 560,000 documents in 2 classes, and DBPedia [9, 22] also consisting of 560,000 documents but in 14 classes.

For the semantic similarity test we use the following datasets: WS [6] (353 word pairs), RG [19] (65 word pairs), RW [10] (2034 word pairs), SL [7] (999 word pairs).

For the analogy test we use the GL [14] dataset (19,557 analogy questions).

3.2 Experimental Setup

We use the gensim library [18] implementations of word2vec and fastText. For both we train vectors dimension of 200, and use default hyperparameters otherwise. We also use the word similarity and analogy tests implemented in the gensim library. We report the Spearman Correlation between the cosine similarity of word vectors and human assigned similarity scores.

For the text classification experiments we use Support Vector Machines (SVM) algorithm, more specifically Linear Support Vector Classifier (LinearSVC) which is commonly used in this domain. We use the one implemented in the scikit-learn library [15] with the default hyper parameters. Documents to be classified are represented as averages of their words’ vectors.

The text classification experiments were run with 10-fold cross validation. We report the average accuracy and the standard deviations for the classification experiments.

In text classification experiments, in order to see if the performance improvement of meta-embeddings such as concat is due to the increased (actually doubled) number of dimensions or not, we conduct two sets of experiments. First, we compare meta-embeddings of size 200 (100 word + 100 context) with a baseline word embedding vectors of 200. In the second set of similar experiments we double the vector sizes.

4 Results and Discussion

4.1 Text Classification

According to our results, as seen in Tables 2 and 3, we see that for text classification the concatenation approach has a distinct advantage over all other approaches. The auto-encoder meta-embeddings appear to perform better than the average and the baseline meta-embeddings. However for the WEBKB dataset, which is a highly class imbalanced one, we observe a different pattern. In this dataset autoencoder based meta-embeddings perform poorly compared to others.

The concatenation meta-embeddings of both word2vec and fastText models exceed the classification performance of the other meta-embeddings in all datasets.

Table 2. Performance of word2vec meta-embeddings trained on text8 for text classification task

Full size table

Table 3. Performance of fastText meta-embeddings trained on text8 for text classification task

Full size table

The improvement is most obvious in the Yelp Reviews Polarity dataset with an increase of 3.8% points over word embeddings for word2vec, and 4.27% points for fastText.

As the second step of experiments we run the same text classification tasks using our meta-embedding models trained using Wikipedia. According to results, as seen in Tables 4 and 5, again concatenation meta-embeddings of both word2vec and fastText models exceed the classification performance of the other meta-embeddings in all datasets.

Table 4. Performance of word2vec meta-embeddings trained on Wikipedia for text classification task

Full size table

Table 5. Performance of fastText meta-embeddings trained on Wikipedia for text classification task

Full size table

For text classification, as seen in Table 6, we also compare the performance of same size meta-embedding and baseline embedding vectors. We observe that concatenation of word and context vectors still shows higher accuracy than word vectors by themselves, even though their dimensionalities are equal.

Table 6. Performance comparison of word2vec and fastText meta-embedding concat with word embeddings of the same dimensionality on the text classification task

Full size table

4.2 Semantic Similarity and Word Analogy

Word2vec meta-embedding semantic similarity and analogy results that can be seen in Table 7, for three of the five datasets the average meta-embedding performs better. For the RW dataset auto-encoder based meta-embedding DAEME slightly outperforms the average meta-embedding. Interestingly other auto-encoder based meta-embeddings under perform the average meta-embedding in the same dataset. One outlier in the semantic similarity task is the SL dataset. In this one concatenation outperforms all other methods by a large margin.

In the fastText meta-embedding semantic similarity and analogy results which can be seen in Table 7, we observe a pattern differing from its word2vec counterparts. We see a much better picture for auto-encoder based meta-embedding methods. For four of the five datasets the auto-encoder base meta-embeddings outperform all others visibly. The only dataset where they do not is the RG dataset which is the smallest dataset used in the semantic similarity task. In this dataset the average meta-embeddings perform better.

Of note is the fact that for every dataset except the SL dataset, the average meta-embeddings outperform the concatenation meta-embeddings at the semantic similarity task, and at the analogy task as well.

Table 7. Performance of word2vec/fastText meta-embeddings trained on text8 for semantic similarity and analogy tasks

Full size table

We also see a difference in the performance of fastText word and context embeddings. For instance in the analogy task the context embeddings only solve 13.8% of the analogy questions, whereas the word vectors manage 40.6%. In the SL dataset for the semantic similarity task the context vectors significantly outperform word vectors (Spearman correlation of 0.3 versus 0.242). This should be due to the difference in the training of fastText and word2vec vectors. Namely in the word2vec model the similarity score is calculated as a function of the dot product between word and context vectors, whereas in the fastText model word and character n-gram embeddings are summed before computing a dot product with the context vectors.

Thus while word and context vectors are symmetric in the word2vec model, they are not in the fastText model. We suspect the differences in performance are due to this fundamental asymmetry.

5 Conclusions and Future Work

By combining word and context vectors of word2vec and fastText models using several different meta-embedding approaches we evaluate how much improvement context vectors can provide to word vectors’ performances in downstream NLP tasks such as text classification, semantic similarity, and analogy. Furthermore we investigate which meta-embedding approaches are better at these tasks.

We show that even when we use a much larger training corpus for embedding models, resulting meta-embeddings show similar behavior, the concatenation of word and context embeddings usually leads to higher accuracy in text classification task.

It is interesting to note that just as the performances of word embedding models differ according to task, so do those of meta-embeddings of word and context vectors. In particular concatenation meta-embeddings perform better at text classification tasks, and average meta-embeddings tend to perform better at semantic similarity and analogy tasks.

We plan to combine word and context embeddings using a greater variety of meta-embedding methods. Namely, we think that the averaging method will perform better if the word and context embeddings are aligned via an orthogonal transformation first. We would also like to evaluate the 1toN [21] in this context. Another interesting approach will be inclusion of character n-gram embeddings of fastText in the various combinations.

In the future we would like to shed some light onto performance differences of auto-encoder based meta-embeddings by throughout analysis.

References

Bansal, M., Gimpel, K., Livescu, K.: Tailoring continuous word representations for dependency parsing. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), vol. 2, pp. 809–815 (2014)
Google Scholar
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017). http://aclweb.org/anthology/Q17-1010
Article Google Scholar
Bollegala, D., Bao, C.: Learning word meta-embeddings by autoencoding. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 1650–1661 (2018)
Google Scholar
Chen, Y., Perozzi, B., Al-Rfou, R., Skiena, S.: The expressive power of word embeddings. In: ICML 2013 Workshop on Deep Learning for Audio, Speech, and Language Processing, Atlanta, GA, USA, July 2013. https://sites.google.com/site/deeplearningicml2013/TheExpressive-PowerOfWordEmbeddings.pdf
Coates, J., Bollegala, D.: Frustratingly easy meta-embedding-computing meta-embeddings by averaging source word embeddings. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), vol. 2, pp. 194–198 (2018)
Google Scholar
Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., Ruppin, E.: Placing search in context: the concept revisited. ACM Trans. Inf. Syst. 20(1), 116–131 (2002)
Article Google Scholar
Hill, F., Reichart, R., Korhonen, A.: SimLex-999: evaluating semantic models with (genuine) similarity estimation. Comput. Linguist. 41(4), 665–695 (2015)
Article MathSciNet Google Scholar
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 427–431. Association for Computational Linguistics (2017). http://aclweb.org/anthology/E17-2068
Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., Van Kleef, P., Auer, S., et al.: Dbpedia–a large-scale, multilingual knowledge base extracted from Wikipedia. Semant. Web 6(2), 167–195 (2015)
Google Scholar
Luong, T., Socher, R., Manning, C.: Better word representations with recursive neural networks for morphology. In: Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pp. 104–113 (2013)
Google Scholar
Mahoney, M.: About the test data (2011). http://mattmahoney.net/dc/textdata
McCallum, A., Nigam, K., et al.: A comparison of event models for Naive Bayes text classification. In: AAAI-98 Workshop on Learning for Text Categorization, vol. 752, pp. 41–48. Citeseer (1998)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR abs/1301.3781 (2013). arxiv:1301.3781
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS 2013, pp. 3111–3119. Curran Associates Inc., USA (2013). http://dl.acm.org/citation.cfm?id=2999792.2999959
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on empirical methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Google Scholar
Poyraz, M., Kilimci, Z.H., Ganiz, M.C.: Higher-order smoothing: a novel semantic smoothing method for text classification. J. Comput. Sci. Technol. 29(3), 376–391 (2014)
Article Google Scholar
Řehůřek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta, pp. 45–50. ELRA, May 2010. http://is.muni.cz/publication/884893/en
Rubenstein, H., Goodenough, J.B.: Contextual correlates of synonymy. Commun. ACM 8(10), 627–633 (1965)
Article Google Scholar
Yelp: Yelp reviews dataset challenge (2015). https://www.yelp.com/dataset/challenge
Yin, W., Schütze, H.: Learning word meta-embeddings. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 1351–1360 (2016)
Google Scholar
Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. In: Advances in Neural Information Processing Systems, pp. 649–657 (2015)
Google Scholar

Download references

Acknowledgements

This work is supported in part by The Scientific and Technological Research Council of Turkey (TÜBİTAK) grant number 116E047. Points of view in this document are those of the authors and do not necessarily represent the official position or policies of the TÜBİTAK.

Author information

Authors and Affiliations

Marmara University, 34730, Istanbul, Turkey
Selin Değirmenci, Aydın Gerek & Murat Can Ganiz

Authors

Selin Değirmenci
View author publications
You can also search for this author in PubMed Google Scholar
Aydın Gerek
View author publications
You can also search for this author in PubMed Google Scholar
Murat Can Ganiz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Selin Değirmenci .

Editor information

Editors and Affiliations

Conservatoire National des Arts et Métiers, Paris, France
Elisabeth Métais
University of Salford, Salford, UK
Farid Meziane
University of Salford, Salford, UK
Sunil Vadera
Oakland University, Rochester, MI, USA
Vijayan Sugumaran
CSE, University of Salford, Salford, UK
Mohamad Saraee

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Değirmenci, S., Gerek, A., Ganiz, M.C. (2019). Waste Not: Meta-Embedding of Word and Context Vectors. In: Métais, E., Meziane, F., Vadera, S., Sugumaran, V., Saraee, M. (eds) Natural Language Processing and Information Systems. NLDB 2019. Lecture Notes in Computer Science(), vol 11608. Springer, Cham. https://doi.org/10.1007/978-3-030-23281-8_35

Download citation

DOI: https://doi.org/10.1007/978-3-030-23281-8_35
Published: 21 June 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-23280-1
Online ISBN: 978-3-030-23281-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Waste Not: Meta-Embedding of Word and Context Vectors

Abstract

Similar content being viewed by others

Sidecar: Augmenting Word Embedding Models with Expert Knowledge

Challenges and Solutions with Alignment and Enrichment of Word Embedding Models

Complement Lexical Retrieval Model with Semantic Residual Embeddings

Keywords

1 Introduction and Motivation

2 Approach

3 Experiments

3.1 Datasets

3.2 Experimental Setup

4 Results and Discussion

4.1 Text Classification

4.2 Semantic Similarity and Word Analogy

5 Conclusions and Future Work

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Waste Not: Meta-Embedding of Word and Context Vectors

Abstract

Similar content being viewed by others

Sidecar: Augmenting Word Embedding Models with Expert Knowledge

Challenges and Solutions with Alignment and Enrichment of Word Embedding Models

Complement Lexical Retrieval Model with Semantic Residual Embeddings

Keywords

1 Introduction and Motivation

2 Approach

3 Experiments

3.1 Datasets

3.2 Experimental Setup

4 Results and Discussion

4.1 Text Classification

4.2 Semantic Similarity and Word Analogy

5 Conclusions and Future Work

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation