Text Representations and Word Embeddings

Egger, Roman

doi:10.1007/978-3-030-88389-8_16

Roman Egger⁴

Part of the book series: Tourism on the Verge ((TV))

3703 Accesses
13 Citations

Abstract

Today, a vast amount of unstructured text data is consistently at our disposal. Owing to the rapid increase in user-generated content on the one hand and powerful, ready-to-use machine learning algorithms on the other hand, automated text analysis can now be carried out in a way that would have been unimaginable a few years ago. However, in order to be able to analyze textual data further, it first becomes necessary to extract features and transform the text into numerical values, which is required by many algorithms. As such, this chapter will present the basics of vectorizing text, beginning with the most simple approaches of text representation and increasing in complexity and performance as each subsequent algorithm is presented. The aim of this chapter is, thus, to convey the intuition behind each approach in a straightforward manner and to focus on the elements that are relevant for practical application.

Access provided by Autonomous University of Puebla. Download to read the full chapter text

Keywords

References

Afrizal, A. D., Rakhmawati, N. A., & Tjahyanto, A. (2019). New filtering scheme based on term weighting to improve object based opinion mining on tourism product reviews. Procedia Computer Science, 161, 805–812. https://doi.org/10.1016/j.procs.2019.11.186
Article Google Scholar
Aggarwal, C. C. (2018). Machine learning for text. Springer. Retrieved from https://springerlink.bibliotecabuap.elogim.com/content/pdf/10.1007/978-3-319-73531-3.pdf
Almeida, F., & Xexéo, G. (2019). Word Embeddings: A survey.
Google Scholar
Alsentzer, E., Murphy, J. R., Boag, W., Weng, W.-H., Di Jin, Naumann, T., & McDermott, M. B. A. (2019, April 6). Publicly available Clinical BERT Embeddings. Retrieved from http://arxiv.org/pdf/1904.03323v3
Anandarajan, M., Hill, C., & Nolan, T. (2019). Practical text analytics (Vol. 2). Springer International Publishing. https://doi.org/10.1007/978-3-319-95663-3
Book Google Scholar
Anibar, S. (2021, April 11). Text classification — From Bag-of-Words to BERT — Part 3(fastText). Retrieved from https://medium.com/analytics-vidhya/text-classification-from-bag-of-words-to-bert-part-3-fasttext-8313e7a14fce
Arefieva, V., & Egger, R. (2021). Tourism_Doc2Vec [computer software].
Google Scholar
Arefieva, V., Egger, R., & Yu, J. (2021). A machine learning approach to cluster destination image on Instagram. Tourism Management, 85, 104318. https://doi.org/10.1016/j.tourman.2021.104318
Article Google Scholar
Beltagy, I., Lo Kyle, & Cohan, A. (2019). SciBERT: A pretrained language model for scientific text. Retrieved from http://arxiv.org/pdf/1903.10676v3
Bender, E. M., & Lascarides, A. (2019). Linguistic fundamentals for natural language processing II: 100 essentials from semantics and pragmatics. Synthesis Lectures on Human Language Technologies, 12(3), 1–268. https://doi.org/10.2200/s00935ed1v02y201907hlt043
Article Google Scholar
Bengio, Y. (2008). Neural net language models. Scholarpedia, 3(1), 3881. https://doi.org/10.4249/scholarpedia.3881
Article Google Scholar
Chang, Y.-C., Ku, C.-H., & Chen, C.-H. (2020). Using deep learning and visual analytics to explore hotel reviews and responses. Tourism Management, 80, 104129. https://doi.org/10.1016/j.tourman.2020.104129
Article Google Scholar
Chang, C.-Y., Lee, S.-J., & Lai, C.-C. (2017). Weighted word2vec based on the distance of words. In Proceedings of 2017 International Conference on Machine Learning and Cybernetics: Crowne Plaza City center Ningbo, Ningbo, China, 9–12 July 2017. IEEE. https://doi.org/10.1109/icmlc.2017.8108974
Chapter Google Scholar
Chantrapornchai, C., & Tunsakul, A. (2020). Information extraction tasks based on BERT and SpaCy on tourism domain. ECTI Transactions on Computer and Information Technology (ECTI-CIT), 15(1), 108–122. https://doi.org/10.37936/ecti-cit.2021151.228621
Article Google Scholar
Conneau, A., & Kiela, D. (2018, March 14). SentEval: An evaluation toolkit for universal sentence representations. Retrieved from http://arxiv.org/pdf/1803.05449v1
Dhami, D. (2020). Understanding BERT - Word Embeddings. Retrieved from https://medium.com/@dhartidhami/understanding-bert-word-embeddings-7dc4d2ea54ca
Dong, G., & Liu, H. (Eds.). (2017). Chapman & Hall/CRC data mining & knowledge discovery series: No. 44. Feature engineering for machine learning and data analytics (1st ed.). CRC Press/Taylor & Francis Group.
Google Scholar
Duboue, P. (2020). The art of feature engineering. Cambridge University Press. https://doi.org/10.1017/9781108671682
Book Google Scholar
Dündar, E. B., Çekiç, T., Deniz, O., & Arslan, S. (2018). A hybrid approach to question-answering for a banking Chatbot on Turkish: Extending keywords with embedding vectors. In A. Fred & J. Filipe (Eds.), Proceedings: Volume 1, KDIR. [S. l.]: SCITEPRESS = science and technology publications. https://doi.org/10.5220/0006925701710177.
Ethayarajh, K. (2019). How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 Embeddings.
Google Scholar
FastText.cc (2020, July 18). fastText – Library for efficient text classification and representation learning. Retrieved from https://fasttext.cc/
Goldberg, Y., & Levy, O. (2014). word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method.
Google Scholar
Gurjar, O., & Gupta, M. (2020, December 18). Should I visit this place? Inclusion and exclusion phrase mining from reviews. Retrieved from http://arxiv.org/pdf/2012.10226v1
Han, Q., Leid, Z., & Margarida Abreu, N. (2019). tourism2vec, Available at SSRN 3350125.
Google Scholar
Harris, Z. S. (1954). Distributional structure. WORD, 10(2–3), 146–162. https://doi.org/10.1080/00437956.1954.11659520
Article Google Scholar
Hayashi, T., & Yoshida, T. (2019). Development of a tour recommendation system using online customer reviews. In J. Xu, F. L. Cooke, M. Gen, & S. E. Ahmed (Eds.), Lecture notes on multidisciplinary industrial engineering. Proceedings of the twelfth international conference on management science and engineering management (pp. 1145–1153). Springer International Publishing. https://doi.org/10.1007/978-3-319-93351-1_90
Chapter Google Scholar
Horn, N., Erhardt, M. S., Di Stefano, M., Bosten, F., & Buchkremer, R. (2020). Vergleichende Analyse der Word-Embedding-Verfahren Word2Vec und GloVe am Beispiel von Kundenbewertungen eines Online-Versandhändlers. In R. Buchkremer, T. Heupel, & O. Koch (Eds.), FOM-edition. Künstliche Intelligenz in Wirtschaft & Gesellschaft (pp. 559–581). Springer Fachmedien Wiesbaden. https://doi.org/10.1007/978-3-658-29550-9_29
Chapter Google Scholar
Jang, B., Kim, I., & Kim, J. W. (2019). Word2vec convolutional neural networks for classification of news articles and tweets. PLoS One, 14(8), e0220976. https://doi.org/10.1371/journal.pone.0220976
Article Google Scholar
Jatnika, D., Bijaksana, M. A., & Suryani, A. A. (2019). Word2Vec model analysis for semantic similarities in English words. Procedia Computer Science, 157, 160–167. https://doi.org/10.1016/j.procs.2019.08.153
Article Google Scholar
Jurafsky, D., & Martin, J. H. (2000). Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. In Prentice Hall series in artificial intelligence. Prentice Hall.
Google Scholar
Karanikolas, N. N., Voulodimos, A., Sgouropoulou, C., Nikolaidou, M., & Gritzalis, S. (Eds.). (2020). 24th Pan-Hellenic Conference on Informatics. ACM.
Google Scholar
Kenyon-Dean, K., Newell, E., & Cheung, J. C. K. (2020). Deconstructing word embedding algorithms. In B. Webber, T. Cohn, Y. He, & Y. Liu (Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 8479–8484). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.681
Chapter Google Scholar
Kishore, A. (2018). Word2vec. In Pro machine learning algorithms (pp. 167–178). Apress.
Google Scholar
Krishna, K., Jyothi, P., & Iyyer, M. (2018). Revisiting the importance of encoding logic rules in sentiment classification.
Book Google Scholar
Kuntarto, G. P., Moechtar, F. L., Santoso, B. I., & Gunawan, I. P. (2015). Comparative study between part-of-speech and statistical methods of text extraction in the tourism domain. In G. Kuntarto, F. Moechtar, B. I. Santoso, & I. P. Gunawan (Eds.), 2015 International Conference on Information Technology Systems and Innovation (ICITSI) (pp. 1–6). IEEE. https://doi.org/10.1109/ICITSI.2015.7437675
Chapter Google Scholar
Landthaler, J. (2020). Improving semantic search in the German legal domain with word Embeddings. Technische Universität München. Retrieved from https://mediatum.ub.tum.de/1521744
Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. In International conference on machine learning (pp. 1188–1196) Retrieved from http://proceedings.mlr.press/v32/le14.html
Google Scholar
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J. (2020). Biobert: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics (Oxford, England), 36(4), 1234–1240. https://doi.org/10.1093/bioinformatics/btz682
Article Google Scholar
Li, W., Guo, K., Shi, Y., Zhu, L., & Zheng, Y. (2018). DWWP: Domain-specific new words detection and word propagation system for sentiment analysis in the tourism domain. Knowledge-Based Systems, 146, 203–214. https://doi.org/10.1016/j.knosys.2018.02.004
Article Google Scholar
Li, Q., Li, S., Hu, J., Zhang, S., & Hu, J. (2018). Tourism review sentiment classification using a bidirectional recurrent neural network with an attention mechanism and topic-enriched word vectors. Sustainability, 10(9), 3313. https://doi.org/10.3390/su10093313
Article Google Scholar
Li, S., Li, G., Law, R., & Paradies, Y. (2020). Racism in tourism reviews. Tourism Management, 80, 104100. https://doi.org/10.1016/j.tourman.2020.104100
Article Google Scholar
Li, Q., Li, S., Zhang, S., Hu, J., & Hu, J. (2019). A review of text corpus-based tourism big data mining. Applied Sciences, 9(16), 3300. https://doi.org/10.3390/app9163300
Article Google Scholar
Li, W., Zhu, L., Guo, K., Shi, Y., & Zheng, Y. (2018). Build a tourism-specific sentiment lexicon via Word2vec. Annals of Data Science, 5(1), 1–7. https://doi.org/10.1007/s40745-017-0130-3
Article Google Scholar
Liu, Y., Che, W., Wang, Y., Zheng, B., Qin, B., & Liu, T. (2020). Deep contextualized word Embeddings for universal dependency parsing. ACM Transactions on Asian and Low-Resource Language Information Processing, 19(1), 1–17. https://doi.org/10.1145/3326497
Article Google Scholar
Luo, Y., He, J., Mou, Y., Wang, J., & Liu, T. (2021). Exploring China’s 5A global geoparks through online tourism reviews: A mining model based on machine learning approach. Tourism Management Perspectives, 37, 100769. https://doi.org/10.1016/j.tmp.2020.100769
Article Google Scholar
Memarzadeh, M., & Kamandi, A. (2020). Model-based location recommender system using geotagged photos on Instagram. In 2020 6th International Conference on Web Research (ICWR) (pp. 203–208). IEEE. https://doi.org/10.1109/ICWR49608.2020.9122274
Chapter Google Scholar
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013, January 16). Efficient estimation of word representations in vector space. Retrieved from http://arxiv.org/pdf/1301.3781v3
Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., & Joulin, A. (2017). Advances in pre-training distributed word representations.
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality.
Google Scholar
Mishra, R., Lata, S., Llavoric, R. B., & Srinathand, K. (2019). Automatic tracking of tourism spots for tourists. SSRN Electronic Journal. Advance online publication. https://doi.org/10.2139/ssrn.3462982
Nathania, H. G., Siautama, R., Amadea Claire, I. A., & Suhartono, D. (2021). Extractive hotel review summarization based on TF/IDF and adjective-noun pairing by considering annual sentiment trends. Procedia Computer Science, 179, 558–565. https://doi.org/10.1016/j.procs.2021.01.040
Article Google Scholar
Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In M. Alessandro, P. Bo, & D. Walter (Eds.), Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532–1543). Association for Computational Linguistics. https://doi.org/10.3115/v1/D14-1162
Chapter Google Scholar
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations.
Google Scholar
Premakumara, N., Shiranthika, C., Welideniya, P., Bandara, C., Prasad, I., & Sumathipala, S. (2019). Application of summarization and sentiment analysis in the tourism domain. In 2019 IEEE 5th International Conference for Convergence in Technology (I2CT) (pp. 1–5). IEEE. https://doi.org/10.1109/I2CT45611.2019.9033569
Chapter Google Scholar
Putra, Y. A., & Khodra, M. L. (2016). Deep learning and distributional semantic model for Indonesian tweet categorization. In Proceedings of 2016 International Conference on Data and Software Engineering (ICoDSE): Udayana University, Denpasar, Bali, Indonesia, October 26th–27th 2016. IEEE. https://doi.org/10.1109/icodse.2016.7936108
Chapter Google Scholar
Ramos, J. (2003). Using TF-IDF to determine word relevance in document queries. In: Proceedings of the first instructional conference on machine learning.
Google Scholar
Ray, B., Garain, A., & Sarkar, R. (2021). An ensemble-based hotel recommender system using sentiment analysis and aspect categorization of hotel reviews. Applied Soft Computing, 98, 106935. https://doi.org/10.1016/j.asoc.2020.106935
Article Google Scholar
Sahlgren, M. (2008). The distributional hypothesis. Ialian Journal of Disability Studies. (20), 33–53. Retrieved from https://www.diva-portal.org/smash/get/diva2:1041938/fulltext01.pdf
Santos, J., Consoli, B., & Vieira, R. (Eds.) (2020). Word embedding evaluation in downstream tasks and semantic analogies.
Google Scholar
Shahbazi, H., Fern, X. Z., Ghaeini, R., Obeidat, R., & Tadepalli, P. (2019). Entity-aware ELMo: Learning contextual entity representation for entity disambiguation.
Google Scholar
Sieg, A. (2019a). FROM Pre-trained Word Embeddings TO Pre-trained Language Models: FROM Static Word Embedding TO Dynamic (Contextualized) Word Embedding. Retrieved from https://towardsdatascience.com/from-pre-trained-word-embeddings-to-pre-trained-language-models-focus-on-bert-343815627598
Simov, K., Boytcheva, S., & Osenova, P. (2017). Towards lexical chains for knowledge-graph-basedWord Embeddings. In RANLP 2017 – Recent Advances in Natural Language Processing Meet Deep Learning. Incoma Ltd. https://doi.org/10.26615/978-954-452-049-6_087
Chapter Google Scholar
Sun, Y., Liang, C., & Chang, C.-C. (2020). Online social construction of Taiwan's rural image: Comparison between Taiwanese self-representation and Chinese perception. Tourism Management, 76, 103968. https://doi.org/10.1016/j.tourman.2019.103968
Article Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017). Attention is all you need.
Google Scholar
Wang, L., Wang, X., Peng, J., & Wang, J. (2020). The differences in hotel selection among various types of travellers: A comparative analysis with a useful bounded rationality behavioural decision support model. Tourism Management, 76, 103961. https://doi.org/10.1016/j.tourman.2019.103961
Article Google Scholar
C. Yuan, J. Wu, H. Li, & L. Wang (2018). Personality recognition based on user generated content. In 2018 15th International Conference on Service Systems and Service Management (ICSSSM).
Google Scholar
Zhang, X., Lin, P., Chen, S., Cen, H., Wang, J., Huang, Q., … Huang, P. (2016). Valence-arousal prediction of Chinese Words with multi-layer corpora. In M. Dong (Ed.), Proceedings of the 2016 International Conference on Asian Language Processing (IALP): 21–23 November 2016, Tainan, Taiwan. IEEE. https://doi.org/10.1109/ialp.2016.7875992
Chapter Google Scholar
Zheng, A., & Casari, A. (2018). Feature engineering for machine learning: Principles and techniques for data scientists (1st ed.). O’Reilly.
Google Scholar

Download references

Author information

Authors and Affiliations

Salzburg University of Applied Sciences, Innovation and Management in Tourism, Urstein (Puch), Salzburg, Austria
Roman Egger

Authors

Roman Egger
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Roman Egger .

Editor information

Editors and Affiliations

Innovation & Management in Tourism, Salzburg University of Applied Sciences, Urstein (Puch), Salzburg, Austria
Roman Egger

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Egger, R. (2022). Text Representations and Word Embeddings. In: Egger, R. (eds) Applied Data Science in Tourism. Tourism on the Verge. Springer, Cham. https://doi.org/10.1007/978-3-030-88389-8_16

Download citation

DOI: https://doi.org/10.1007/978-3-030-88389-8_16
Published: 31 January 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-88388-1
Online ISBN: 978-3-030-88389-8
eBook Packages: Business and ManagementBusiness and Management (R0)

Publish with us

Policies and ethics

Text Representations and Word Embeddings

Abstract

Chapter PDF

Similar content being viewed by others

Representing Words in Vector Space and Beyond

A detailed review on word embedding techniques with emphasis on word2vec

A Systematic Literature Review on Word Embeddings

Keywords

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Further Readings and Other Sources

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

Text Representations and Word Embeddings

Abstract

Chapter PDF

Similar content being viewed by others

Representing Words in Vector Space and Beyond

A detailed review on word embedding techniques with emphasis on word2vec

A Systematic Literature Review on Word Embeddings

Keywords

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Further Readings and Other Sources

Further Readings and Other Sources

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation