Topic Model—Machine Learning Classifier Integrations on Geocoded Twitter Data

Kant, Gillian; Weisser, Christoph; Kneib, Thomas; Säfken, Benjamin

doi:10.1007/978-3-031-08580-2_11

Part of the book series: Studies in Computational Intelligence ((SCI,volume 1045))

178 Accesses

Abstract

Topic Models are unsupervised probabilistic models that are used to explore the hidden semantic structure of a corpus of documents by generating latent topics. Latent topics are discrete distribution over words that need to be interpreted and labeled by humans. The latent topics can be also considered as a summary of the various topics that are discussed in the corpus of documents, so that the topic model can also serve as a dimension reduction method. Latent topics generated by topic models are difficult to measure in their quality [4] and their outputs are commonly used in a descriptive manner in order to gain topical insights into the topic distributions of chosen documents. A so far unexplored way, is the use of labeled data to assess the informational content of latent topics. In this paper, topic models are examined in terms of their informative value regarding classification problems for labeled text data. We use geo-coded social media posts from Twitter, but our approach can be expanded to other labeled documents. The output of Latent Dirichlet Allocation (LDA) [3] models and Structural Topic Models (STM) [29] are used as an input for machine learning classifiers, after pooling tweets based on the hashtag pooling algorithm of [24]. Their predictive power is compared with the performance of state-of-the-art Artificial Neural Networks (ANN) that are trained on a specifically optimized word-embedding and all the available metadata of the tweets. We find that the machine learning classifiers that are trained on topics can compete with the predictive performance of the ANNs, even for out-of-sample predicted topic distributions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Softcover Book: USD 179.99; Price excludes VAT (USA)

Hardcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Comparing Twitter Data for Topic Modling, Clustering, and Predictive Analysis Using LSTM Model

Extracting Categorical Topics from Tweets Using Topic Model

Tracking Dengue Epidemics Using Twitter Content Classification and Topic Modelling

References

D. Alvarez-Melis, M. Saveski, Topic modeling in twitter: aggregating tweets by conversations, in Tenth International AAAI Conference on Web and Social Media (2016), pp. 519–522
Google Scholar
D.M. Blei, J.D. Lafferty, Dynamic topic models, in Proceedings of the 23rd International Conference on Machine Learning (2006), pp. 113–120
Google Scholar
D.M. Blei, A.Y. Ng, M.I. Jordan, Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Google Scholar
Jordan Boyd-Graber, Hu. Yuening, David Mimno, Applications of topic models. Found. Trends Inf. Retr. 11, 143–296 (2017)
Article Google Scholar
Z. Cao, S. Li, Y. Liu, W. Li, H. Ji, A novel neural topic model and its supervised extension, in Twenty-Ninth AAAI Conference on Artificial Intelligence (2015), pp. 2210–2216
Google Scholar
J. Chang, S. Gerrish, C. Wang, J.L. Boyd-Graber, D.M. Blei, Reading tea leaves: how humans interpret topic models, in Advances in Neural Information Processing Systems (2009), pp. 288–296
Google Scholar
T. Chen, C. Guestrin, Xgboost: a scalable tree boosting system, in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016), pp. 785–794
Google Scholar
T.A. Curry, M.P. Fix, May it please the twitterverse: the use of twitter by state high court judges. J. Inf. Technol. Polit. 16(4), 379–393 (2019)
Google Scholar
Diana Fischer-Preßler, Carsten Schwemmer, Kai Fischbach, Collective sense-making in times of crisis: connecting terror management theory with twitter user reactions to the berlin terrorist attack. Comput. Hum. Behav. 100, 138–151 (2019)
Article Google Scholar
G. Forman, I. Cohen, Learning from little: comparison of classifiers given little training, in European Conference on Principles of Data Mining and Knowledge Discovery (2004), pp. 161–172 (2004)
Google Scholar
Jerome H. Friedman, Greedy function approximation: a gradient boosting machine. Ann. Stat. 29(5), 1189–1232 (2001)
Article MathSciNet MATH Google Scholar
T.L. Griffiths, M.I. Jordan, J.B. Tenenbaum, D.M. Blei, Hierarchical topic models and the nested Chinese restaurant process, in Advances in Neural Information Processing Systems (2004), pp. 17–24
Google Scholar
T. Hastie, R. Tibshirani, J. Friedman. The Elements of Statistical Learning: Data Mining, Inference and Prediction (Springer, 2009)
Google Scholar
Matthew Hoffman, Francis R. Bach, David M. Blei, Online learning for latent dirichlet allocation, in Advances in Neural Information Processing Systems, vol. 23 (2010), pp. 856–864
Google Scholar
L. Hong, B.D. Davison, Empirical study of topic modeling in twitter, in Proceedings of the First Workshop on Social Media Analytics (2010), pp. 80–88
Google Scholar
E. Ikonomakis, S. Kotsiantis, V. Tampakas, Text classification using machine learning techniques. WSEAS Trans. Comput. 4, 966–974 (2005)
Google Scholar
M. Imran, P. Mitra, C. Castillo, Twitter as a lifeline: human-annotated twitter corpora for NLP of crisis-related messages, in Proceedings of the Tenth International Conference on Language Resources and Evaluation (2016), pp. 1638–1643
Google Scholar
M. Jin, X. Luo, H. Zhu, H.H. Zhuo, Combining deep learning and topic modeling for review understanding in context-aware recommendation, in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Long Papers), vol. 1 (2018), pp. 1605–1614
Google Scholar
G. Kant, C. Weisser, B. Säfken, Ttlocvis: a twitter topic location visualization package. J. Open Source Softw. 5(54) (2020)
Google Scholar
Fedor Krasnov, Anastasiia Sen, The number of topics optimization: clustering approach. Mach. Learn. Knowl. Extr. 1(1), 416–426 (2019)
Article Google Scholar
C.-C. Lai, M.-C. Tsai, An empirical performance comparison of machine learning methods for spam e-mail categorization, in Fourth International Conference on Hybrid Intelligent Systems (2004), pp. 44–48
Google Scholar
J.H. Lau, T. Baldwin, T. Cohn, Topically driven neural language model, in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (2017), pp. 355–365
Google Scholar
W. Lou, X. Wang, F. Chen, Y. Chen, B. Jiang, H. Zhang, Sequence based prediction of DNA-binding proteins based on hybrid feature selection using random forest and gaussian Naïve Bayes. PloS One 9(01), e86703 (2014)
Google Scholar
R. Mehrotra, S. Sanner, W. Buntine, L. Xie, Improving LDA topic models for microblogs via tweet pooling and automatic labeling, in Proceedings of the 36th international ACM SIGIR Conference on Research and Development in Information Retrieval (2013), pp. 889–892
Google Scholar
T. Mikolov, I. Sutskever, K. Chen, G.S. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, in Advances in Neural Information Processing Systems (2013), pp. 3111–3119
Google Scholar
D. Mimno, H.M. Wallach, E. Talley, M. Leenders, A. McCallum, Optimizing semantic coherence in topic models, in Proceedings of the Conference on Empirical Methods in Natural Language Processing (2011), pp. 262–272
Google Scholar
A. Mishler, E.S. Crabb, S. Paletz, B. Hefright, E. Golonka, Using structural topic modeling to detect events and cluster twitter users in the Ukrainian crisis, in International Conference on Human-Computer Interaction (2015), pp. 639–644
Google Scholar
J. Pennington, R. Socher, C.D. Manning, Glove: global vectors for word representation, in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (2014), pp. 1532–1543
Google Scholar
M.E. Roberts, B.M. Stewart, D. Tingley, et al., STM: R package for structural topic models. J. Stat. Softw. 10(2), 1–40 (2014)
Google Scholar
J. Roesslein, Tweepy: twitter for python! (2020). https://github.com/tweepy/tweepy
P. Shrestha, S. Sierra, F.A. González, M. Montes, P. Rosso, T. Solorio, Convolutional neural networks for authorship attribution of short texts, in Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Short Papers, vol. 2 (2017), pp. 669–674 (2017)
Google Scholar
A. Steinskog, J. Therkelsen, B. Gambäck, Twitter topic modeling by tweet aggregation, in Proceedings of the 21st Nordic Conference on Computational Linguistics (2017), pp. 77–86
Google Scholar
S. Vosoughi, P. Vijayaraghavan, D. Roy, Tweet2vec: learning tweet embeddings using character-level CNN-LSTM encoder-decoder, in Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (2016), pp. 1041–1044
Google Scholar
C. Wang, J. Paisley, D. Blei, Online variational inference for the hierarchical dirichlet process, in Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (2011), pp. 752–760
Google Scholar
X. Wang, W. Jiang, Z. Luo, Combination of convolutional and recurrent neural network for sentiment analysis of short texts, in Proceedings of COLING 2016, the 26th international Conference on Computational Linguistics: Technical Papers (2016), pp. 2428–2437
Google Scholar
L. Yang, T. Sun, M. Zhang, Q. Mei, We know what@ you# tag: does the dual role affect hashtag adoption? in Proceedings of the 21st International Conference on World Wide Web (2012), pp. 261–270
Google Scholar
W.X. Zhao, J. Jiang, J. Weng, J. He, E.-P. Lim, H. Yan, X. Li, Comparing twitter and traditional media using topic models, in European Conference on Information Retrieval (2011), pp. 338–349
Google Scholar

Download references

Author information

Authors and Affiliations

University of Göttingen, Göttingen, Germany
Gillian Kant
University of Göttingen, Campus-Institut Data Science (CIDAS), Göttingen, Germany
Christoph Weisser, Thomas Kneib & Benjamin Säfken

Authors

Gillian Kant
View author publications
You can also search for this author in PubMed Google Scholar
Christoph Weisser
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Kneib
View author publications
You can also search for this author in PubMed Google Scholar
Benjamin Säfken
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christoph Weisser .

Editor information

Editors and Affiliations

Informatics Division, Thang Long University, Hoang Mai, Hanoi, Vietnam
Nguyen Hoang Phuong
Computer Science Department, University of Texas at El Paso, El Paso, TX, USA
Vladik Kreinovich

7 Appendix

An illustration of the A N N model has the category and Numerical features: Input layer, Dense layer, and Batch Normalization. — **Fig. 4**

A workflow of the data streaming and collection of tweets has data preprocessing with A N N classification, L D A, and S T M. Individual tweets have training and test set of topic distribution. — **Fig. 5**

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Kant, G., Weisser, C., Kneib, T., Säfken, B. (2023). Topic Model—Machine Learning Classifier Integrations on Geocoded Twitter Data. In: Phuong, N.H., Kreinovich, V. (eds) Biomedical and Other Applications of Soft Computing. Studies in Computational Intelligence, vol 1045. Springer, Cham. https://doi.org/10.1007/978-3-031-08580-2_11

Download citation

DOI: https://doi.org/10.1007/978-3-031-08580-2_11
Published: 23 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-08579-6
Online ISBN: 978-3-031-08580-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Topic Model—Machine Learning Classifier Integrations on Geocoded Twitter Data