Abstract
Topic Models are unsupervised probabilistic models that are used to explore the hidden semantic structure of a corpus of documents by generating latent topics. Latent topics are discrete distribution over words that need to be interpreted and labeled by humans. The latent topics can be also considered as a summary of the various topics that are discussed in the corpus of documents, so that the topic model can also serve as a dimension reduction method. Latent topics generated by topic models are difficult to measure in their quality [4] and their outputs are commonly used in a descriptive manner in order to gain topical insights into the topic distributions of chosen documents. A so far unexplored way, is the use of labeled data to assess the informational content of latent topics. In this paper, topic models are examined in terms of their informative value regarding classification problems for labeled text data. We use geo-coded social media posts from Twitter, but our approach can be expanded to other labeled documents. The output of Latent Dirichlet Allocation (LDA) [3] models and Structural Topic Models (STM) [29] are used as an input for machine learning classifiers, after pooling tweets based on the hashtag pooling algorithm of [24]. Their predictive power is compared with the performance of state-of-the-art Artificial Neural Networks (ANN) that are trained on a specifically optimized word-embedding and all the available metadata of the tweets. We find that the machine learning classifiers that are trained on topics can compete with the predictive performance of the ANNs, even for out-of-sample predicted topic distributions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
D. Alvarez-Melis, M. Saveski, Topic modeling in twitter: aggregating tweets by conversations, in Tenth International AAAI Conference on Web and Social Media (2016), pp. 519–522
D.M. Blei, J.D. Lafferty, Dynamic topic models, in Proceedings of the 23rd International Conference on Machine Learning (2006), pp. 113–120
D.M. Blei, A.Y. Ng, M.I. Jordan, Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Jordan Boyd-Graber, Hu. Yuening, David Mimno, Applications of topic models. Found. Trends Inf. Retr. 11, 143–296 (2017)
Z. Cao, S. Li, Y. Liu, W. Li, H. Ji, A novel neural topic model and its supervised extension, in Twenty-Ninth AAAI Conference on Artificial Intelligence (2015), pp. 2210–2216
J. Chang, S. Gerrish, C. Wang, J.L. Boyd-Graber, D.M. Blei, Reading tea leaves: how humans interpret topic models, in Advances in Neural Information Processing Systems (2009), pp. 288–296
T. Chen, C. Guestrin, Xgboost: a scalable tree boosting system, in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016), pp. 785–794
T.A. Curry, M.P. Fix, May it please the twitterverse: the use of twitter by state high court judges. J. Inf. Technol. Polit. 16(4), 379–393 (2019)
Diana Fischer-Preßler, Carsten Schwemmer, Kai Fischbach, Collective sense-making in times of crisis: connecting terror management theory with twitter user reactions to the berlin terrorist attack. Comput. Hum. Behav. 100, 138–151 (2019)
G. Forman, I. Cohen, Learning from little: comparison of classifiers given little training, in European Conference on Principles of Data Mining and Knowledge Discovery (2004), pp. 161–172 (2004)
Jerome H. Friedman, Greedy function approximation: a gradient boosting machine. Ann. Stat. 29(5), 1189–1232 (2001)
T.L. Griffiths, M.I. Jordan, J.B. Tenenbaum, D.M. Blei, Hierarchical topic models and the nested Chinese restaurant process, in Advances in Neural Information Processing Systems (2004), pp. 17–24
T. Hastie, R. Tibshirani, J. Friedman. The Elements of Statistical Learning: Data Mining, Inference and Prediction (Springer, 2009)
Matthew Hoffman, Francis R. Bach, David M. Blei, Online learning for latent dirichlet allocation, in Advances in Neural Information Processing Systems, vol. 23 (2010), pp. 856–864
L. Hong, B.D. Davison, Empirical study of topic modeling in twitter, in Proceedings of the First Workshop on Social Media Analytics (2010), pp. 80–88
E. Ikonomakis, S. Kotsiantis, V. Tampakas, Text classification using machine learning techniques. WSEAS Trans. Comput. 4, 966–974 (2005)
M. Imran, P. Mitra, C. Castillo, Twitter as a lifeline: human-annotated twitter corpora for NLP of crisis-related messages, in Proceedings of the Tenth International Conference on Language Resources and Evaluation (2016), pp. 1638–1643
M. Jin, X. Luo, H. Zhu, H.H. Zhuo, Combining deep learning and topic modeling for review understanding in context-aware recommendation, in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Long Papers), vol. 1 (2018), pp. 1605–1614
G. Kant, C. Weisser, B. Säfken, Ttlocvis: a twitter topic location visualization package. J. Open Source Softw. 5(54) (2020)
Fedor Krasnov, Anastasiia Sen, The number of topics optimization: clustering approach. Mach. Learn. Knowl. Extr. 1(1), 416–426 (2019)
C.-C. Lai, M.-C. Tsai, An empirical performance comparison of machine learning methods for spam e-mail categorization, in Fourth International Conference on Hybrid Intelligent Systems (2004), pp. 44–48
J.H. Lau, T. Baldwin, T. Cohn, Topically driven neural language model, in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (2017), pp. 355–365
W. Lou, X. Wang, F. Chen, Y. Chen, B. Jiang, H. Zhang, Sequence based prediction of DNA-binding proteins based on hybrid feature selection using random forest and gaussian Naïve Bayes. PloS One 9(01), e86703 (2014)
R. Mehrotra, S. Sanner, W. Buntine, L. Xie, Improving LDA topic models for microblogs via tweet pooling and automatic labeling, in Proceedings of the 36th international ACM SIGIR Conference on Research and Development in Information Retrieval (2013), pp. 889–892
T. Mikolov, I. Sutskever, K. Chen, G.S. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, in Advances in Neural Information Processing Systems (2013), pp. 3111–3119
D. Mimno, H.M. Wallach, E. Talley, M. Leenders, A. McCallum, Optimizing semantic coherence in topic models, in Proceedings of the Conference on Empirical Methods in Natural Language Processing (2011), pp. 262–272
A. Mishler, E.S. Crabb, S. Paletz, B. Hefright, E. Golonka, Using structural topic modeling to detect events and cluster twitter users in the Ukrainian crisis, in International Conference on Human-Computer Interaction (2015), pp. 639–644
J. Pennington, R. Socher, C.D. Manning, Glove: global vectors for word representation, in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (2014), pp. 1532–1543
M.E. Roberts, B.M. Stewart, D. Tingley, et al., STM: R package for structural topic models. J. Stat. Softw. 10(2), 1–40 (2014)
J. Roesslein, Tweepy: twitter for python! (2020). https://github.com/tweepy/tweepy
P. Shrestha, S. Sierra, F.A. González, M. Montes, P. Rosso, T. Solorio, Convolutional neural networks for authorship attribution of short texts, in Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Short Papers, vol. 2 (2017), pp. 669–674 (2017)
A. Steinskog, J. Therkelsen, B. Gambäck, Twitter topic modeling by tweet aggregation, in Proceedings of the 21st Nordic Conference on Computational Linguistics (2017), pp. 77–86
S. Vosoughi, P. Vijayaraghavan, D. Roy, Tweet2vec: learning tweet embeddings using character-level CNN-LSTM encoder-decoder, in Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (2016), pp. 1041–1044
C. Wang, J. Paisley, D. Blei, Online variational inference for the hierarchical dirichlet process, in Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (2011), pp. 752–760
X. Wang, W. Jiang, Z. Luo, Combination of convolutional and recurrent neural network for sentiment analysis of short texts, in Proceedings of COLING 2016, the 26th international Conference on Computational Linguistics: Technical Papers (2016), pp. 2428–2437
L. Yang, T. Sun, M. Zhang, Q. Mei, We know what@ you# tag: does the dual role affect hashtag adoption? in Proceedings of the 21st International Conference on World Wide Web (2012), pp. 261–270
W.X. Zhao, J. Jiang, J. Weng, J. He, E.-P. Lim, H. Yan, X. Li, Comparing twitter and traditional media using topic models, in European Conference on Information Retrieval (2011), pp. 338–349
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
7 Appendix
7 Appendix
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Kant, G., Weisser, C., Kneib, T., Säfken, B. (2023). Topic Model—Machine Learning Classifier Integrations on Geocoded Twitter Data. In: Phuong, N.H., Kreinovich, V. (eds) Biomedical and Other Applications of Soft Computing. Studies in Computational Intelligence, vol 1045. Springer, Cham. https://doi.org/10.1007/978-3-031-08580-2_11
Download citation
DOI: https://doi.org/10.1007/978-3-031-08580-2_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-08579-6
Online ISBN: 978-3-031-08580-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)