Abstract
Neural network-based word embeddings –outperforming traditional approaches in the various Natural Language Processing tasks – have gained a lot of interest recently. Despite it, the Lithuanian word embeddings have never been obtained and evaluated before. Here we have used the Lithuanian corpus of \(\sim \)234 thousand running words and produced several word embedding models: based on the continuous bag-of-words and skip-gram architectures; softmax and negative sampling training algorithms; varied number of dimensions (100, 300, 500, and 1,000). Word embeddings were evaluated using the Lithuanian WordNet as the resource for the synonym search. We have determined the superiority of the continuous bag-of-words over the skip-gram architecture; while the training algorithm and dimensionality showed no significant impact on the results. Better results were achieved with the continuous bag-of-words, negative sampling and 1,000 dimensions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Describe in detail in https://code.google.com/archive/p/word2vec/.
- 2.
The Google word embeddings model can be downloaded from: https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit.
- 3.
Downloaded from https://dumps.wikimedia.org/ltwiktionary/.
- 4.
This corpus STENOGRAMOS_INDV can be downloaded from http://dangus.vdu.lt/~jkd/eng/?page_id=16.
- 5.
The whole corpus of the Contemporary Lithuanian Language is at http://clarin.vdu.lt:8080/xmlui/handle/20.500.11821/16.
- 6.
This corpus of the fiction texts GROŽINĖ_INDV can be downloaded from http://dangus.vdu.lt/~jkd/eng/?page_id=16.
- 7.
Literary works downloaded from http://ebiblioteka.mkp.emokykla.lt/.
- 8.
These embeddings were downloaded from https://fasttext.cc/docs/en/pretrained-vectors.html.
- 9.
Downloaded from http://korpus.juls.savba.sk/ltskwn_en.html.
References
Goldberg, Y.: A primer on neural network models for natural language processing. J. Artif. Intell. Res. 57, 345–420 (2016)
Venekoski, V., Puuska, S., Vankka, J.: Vector space representations of documents in classifying finnish social media texts. In: ICIST: Communications in Computer and Information Science, vol. 639, pp. 525–535 (2016)
Mandelbaum, A., Shalev, A.: Word embeddings and their use in sentence classification tasks. CoRR, abs/1610.08229 (2016)
Bengio, S., Heigold, G.: Word embeddings for speech recognition. In: Proceedings of the 15th Conference of the International Speech Communication Association (Interspeech) (2014)
Zou, W.Y., Socher, R., Cer, D.M., Manning, C.D.: Bilingual word embeddings for phrase-based machine translation. In: EMNLP, pp. 1393–1398 (2013)
Denis, P., Dehouck, M.: Delexicalized word embeddings for cross-lingual dependency parsing. In: EACL, vol. 1, pp. 241–250 (2017)
Tulkens, S., Emmery, C., Daelemans, W.: Evaluating unsupervised dutch word embeddings as a linguistic resource. CoRR abs/1607.00225 (2016)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR, abs/1301.3781 (2013)
Pennington, J., Socher, R., Manning, C.D: Glove: global vectors for word representation. In: EMNLP, vol. 14, pp. 1532–1543 (2014)
Cotterell, R., Schütze, H.: Morphological word-embeddings. In: HLT-NAACL, pp. 1287–1292 (2015)
Chen, X., Xu, L., Liu, Z., Sun, M., Luan, H.: Joint learning of character and word embeddings. In: IJCAI, pp. 1236–1242 (2015)
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. CoRR, abs/1607.04606 (2016)
Ruder, S., Vulić, I., Søgaard, A.: A survey of cross-lingual embedding models. CoRR, abs/1706.04902 (2017)
Deeplearning4j Development Team: Deeplearning4j: Open-source distributed deep learning for the JVM, Apache Software Foundation License 2.0. http://deeplearning4j.org (2017)
Morin, F., Bengio, Y.: Hierarchical probabilistic neural network language model. In: The 10th International Workshop on Artificial Intelligence and Statistics (AISTATS 2005), pp. 246–252 (2005)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems, pp. 3111–3119 (2013)
Goldberg, Y., Levy, O.: word2vec Explained: deriving Mikolov et al’.s negative-sampling word-embedding method. CoRR, abs/1402.3722 (2014)
Schnabel, T., Labutov, I., Mimno, D.M., Joachims, T.: Evaluation methods for unsupervised word embeddings. In: EMNLP, pp. 298–307 (2015)
Gladkova, A., Drozd, A.: Intrinsic evaluations of word embeddings: what can we do better? In: Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, pp. 36–42 (2016)
Garabík, R., Pileckytė, I.: From multilingual dictionary to Lithuanian WordNet. In: Natural Language Processing, Corpus Linguistics, E-Learning, pp. 74–80 (2013)
Faruqui, M., Tsvetkov, Y., Rastogi, P., Dyer, C.: Problems with evaluation of word embeddings using word similarity tasks. In: RepEval@ACL, pp. 30–35 (2016)
McNemar, Q.: Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 2(12), 153–157 (1947)
Lai, S., Liu, K., Xu, L., Zhao, J.: How to generate a good word embedding? CoRR, abs/1507.05523 (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Kapočiūtė-Dzikienė, J., Damaševičius, R. (2019). Intrinsic Evaluation of Lithuanian Word Embeddings Using WordNet. In: Silhavy, R. (eds) Artificial Intelligence and Algorithms in Intelligent Systems. CSOC2018 2018. Advances in Intelligent Systems and Computing, vol 764. Springer, Cham. https://doi.org/10.1007/978-3-319-91189-2_39
Download citation
DOI: https://doi.org/10.1007/978-3-319-91189-2_39
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-91188-5
Online ISBN: 978-3-319-91189-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)