Skip to main content

Intrinsic Evaluation of Lithuanian Word Embeddings Using WordNet

  • Conference paper
  • First Online:
Artificial Intelligence and Algorithms in Intelligent Systems (CSOC2018 2018)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 764))

Included in the following conference series:

Abstract

Neural network-based word embeddings –outperforming traditional approaches in the various Natural Language Processing tasks – have gained a lot of interest recently. Despite it, the Lithuanian word embeddings have never been obtained and evaluated before. Here we have used the Lithuanian corpus of \(\sim \)234 thousand running words and produced several word embedding models: based on the continuous bag-of-words and skip-gram architectures; softmax and negative sampling training algorithms; varied number of dimensions (100, 300, 500, and 1,000). Word embeddings were evaluated using the Lithuanian WordNet as the resource for the synonym search. We have determined the superiority of the continuous bag-of-words over the skip-gram architecture; while the training algorithm and dimensionality showed no significant impact on the results. Better results were achieved with the continuous bag-of-words, negative sampling and 1,000 dimensions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Describe in detail in https://code.google.com/archive/p/word2vec/.

  2. 2.

    The Google word embeddings model can be downloaded from: https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit.

  3. 3.

    Downloaded from https://dumps.wikimedia.org/ltwiktionary/.

  4. 4.

    This corpus STENOGRAMOS_INDV can be downloaded from http://dangus.vdu.lt/~jkd/eng/?page_id=16.

  5. 5.

    The whole corpus of the Contemporary Lithuanian Language is at http://clarin.vdu.lt:8080/xmlui/handle/20.500.11821/16.

  6. 6.

    This corpus of the fiction texts GROŽINĖ_INDV can be downloaded from http://dangus.vdu.lt/~jkd/eng/?page_id=16.

  7. 7.

    Literary works downloaded from http://ebiblioteka.mkp.emokykla.lt/.

  8. 8.

    These embeddings were downloaded from https://fasttext.cc/docs/en/pretrained-vectors.html.

  9. 9.

    Downloaded from http://korpus.juls.savba.sk/ltskwn_en.html.

References

  1. Goldberg, Y.: A primer on neural network models for natural language processing. J. Artif. Intell. Res. 57, 345–420 (2016)

    MathSciNet  MATH  Google Scholar 

  2. Venekoski, V., Puuska, S., Vankka, J.: Vector space representations of documents in classifying finnish social media texts. In: ICIST: Communications in Computer and Information Science, vol. 639, pp. 525–535 (2016)

    Google Scholar 

  3. Mandelbaum, A., Shalev, A.: Word embeddings and their use in sentence classification tasks. CoRR, abs/1610.08229 (2016)

    Google Scholar 

  4. Bengio, S., Heigold, G.: Word embeddings for speech recognition. In: Proceedings of the 15th Conference of the International Speech Communication Association (Interspeech) (2014)

    Google Scholar 

  5. Zou, W.Y., Socher, R., Cer, D.M., Manning, C.D.: Bilingual word embeddings for phrase-based machine translation. In: EMNLP, pp. 1393–1398 (2013)

    Google Scholar 

  6. Denis, P., Dehouck, M.: Delexicalized word embeddings for cross-lingual dependency parsing. In: EACL, vol. 1, pp. 241–250 (2017)

    Google Scholar 

  7. Tulkens, S., Emmery, C., Daelemans, W.: Evaluating unsupervised dutch word embeddings as a linguistic resource. CoRR abs/1607.00225 (2016)

    Google Scholar 

  8. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR, abs/1301.3781 (2013)

    Google Scholar 

  9. Pennington, J., Socher, R., Manning, C.D: Glove: global vectors for word representation. In: EMNLP, vol. 14, pp. 1532–1543 (2014)

    Google Scholar 

  10. Cotterell, R., Schütze, H.: Morphological word-embeddings. In: HLT-NAACL, pp. 1287–1292 (2015)

    Google Scholar 

  11. Chen, X., Xu, L., Liu, Z., Sun, M., Luan, H.: Joint learning of character and word embeddings. In: IJCAI, pp. 1236–1242 (2015)

    Google Scholar 

  12. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. CoRR, abs/1607.04606 (2016)

    Google Scholar 

  13. Ruder, S., Vulić, I., Søgaard, A.: A survey of cross-lingual embedding models. CoRR, abs/1706.04902 (2017)

    Google Scholar 

  14. Deeplearning4j Development Team: Deeplearning4j: Open-source distributed deep learning for the JVM, Apache Software Foundation License 2.0. http://deeplearning4j.org (2017)

  15. Morin, F., Bengio, Y.: Hierarchical probabilistic neural network language model. In: The 10th International Workshop on Artificial Intelligence and Statistics (AISTATS 2005), pp. 246–252 (2005)

    Google Scholar 

  16. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems, pp. 3111–3119 (2013)

    Google Scholar 

  17. Goldberg, Y., Levy, O.: word2vec Explained: deriving Mikolov et al’.s negative-sampling word-embedding method. CoRR, abs/1402.3722 (2014)

    Google Scholar 

  18. Schnabel, T., Labutov, I., Mimno, D.M., Joachims, T.: Evaluation methods for unsupervised word embeddings. In: EMNLP, pp. 298–307 (2015)

    Google Scholar 

  19. Gladkova, A., Drozd, A.: Intrinsic evaluations of word embeddings: what can we do better? In: Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, pp. 36–42 (2016)

    Google Scholar 

  20. Garabík, R., Pileckytė, I.: From multilingual dictionary to Lithuanian WordNet. In: Natural Language Processing, Corpus Linguistics, E-Learning, pp. 74–80 (2013)

    Google Scholar 

  21. Faruqui, M., Tsvetkov, Y., Rastogi, P., Dyer, C.: Problems with evaluation of word embeddings using word similarity tasks. In: RepEval@ACL, pp. 30–35 (2016)

    Google Scholar 

  22. McNemar, Q.: Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 2(12), 153–157 (1947)

    Article  Google Scholar 

  23. Lai, S., Liu, K., Xu, L., Zhao, J.: How to generate a good word embedding? CoRR, abs/1507.05523 (2015)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Robertas Damaševičius .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kapočiūtė-Dzikienė, J., Damaševičius, R. (2019). Intrinsic Evaluation of Lithuanian Word Embeddings Using WordNet. In: Silhavy, R. (eds) Artificial Intelligence and Algorithms in Intelligent Systems. CSOC2018 2018. Advances in Intelligent Systems and Computing, vol 764. Springer, Cham. https://doi.org/10.1007/978-3-319-91189-2_39

Download citation

Publish with us

Policies and ethics