Skip to main content

Topic Model—Machine Learning Classifier Integrations on Geocoded Twitter Data

  • Chapter
  • First Online:
Biomedical and Other Applications of Soft Computing

Abstract

Topic Models are unsupervised probabilistic models that are used to explore the hidden semantic structure of a corpus of documents by generating latent topics. Latent topics are discrete distribution over words that need to be interpreted and labeled by humans. The latent topics can be also considered as a summary of the various topics that are discussed in the corpus of documents, so that the topic model can also serve as a dimension reduction method. Latent topics generated by topic models are difficult to measure in their quality [4] and their outputs are commonly used in a descriptive manner in order to gain topical insights into the topic distributions of chosen documents. A so far unexplored way, is the use of labeled data to assess the informational content of latent topics. In this paper, topic models are examined in terms of their informative value regarding classification problems for labeled text data. We use geo-coded social media posts from Twitter, but our approach can be expanded to other labeled documents. The output of Latent Dirichlet Allocation (LDA) [3] models and Structural Topic Models (STM) [29] are used as an input for machine learning classifiers, after pooling tweets based on the hashtag pooling algorithm of [24]. Their predictive power is compared with the performance of state-of-the-art Artificial Neural Networks (ANN) that are trained on a specifically optimized word-embedding and all the available metadata of the tweets. We find that the machine learning classifiers that are trained on topics can compete with the predictive performance of the ANNs, even for out-of-sample predicted topic distributions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 179.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. D. Alvarez-Melis, M. Saveski, Topic modeling in twitter: aggregating tweets by conversations, in Tenth International AAAI Conference on Web and Social Media (2016), pp. 519–522

    Google Scholar 

  2. D.M. Blei, J.D. Lafferty, Dynamic topic models, in Proceedings of the 23rd International Conference on Machine Learning (2006), pp. 113–120

    Google Scholar 

  3. D.M. Blei, A.Y. Ng, M.I. Jordan, Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    Google Scholar 

  4. Jordan Boyd-Graber, Hu. Yuening, David Mimno, Applications of topic models. Found. Trends Inf. Retr. 11, 143–296 (2017)

    Article  Google Scholar 

  5. Z. Cao, S. Li, Y. Liu, W. Li, H. Ji, A novel neural topic model and its supervised extension, in Twenty-Ninth AAAI Conference on Artificial Intelligence (2015), pp. 2210–2216

    Google Scholar 

  6. J. Chang, S. Gerrish, C. Wang, J.L. Boyd-Graber, D.M. Blei, Reading tea leaves: how humans interpret topic models, in Advances in Neural Information Processing Systems (2009), pp. 288–296

    Google Scholar 

  7. T. Chen, C. Guestrin, Xgboost: a scalable tree boosting system, in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016), pp. 785–794

    Google Scholar 

  8. T.A. Curry, M.P. Fix, May it please the twitterverse: the use of twitter by state high court judges. J. Inf. Technol. Polit. 16(4), 379–393 (2019)

    Google Scholar 

  9. Diana Fischer-Preßler, Carsten Schwemmer, Kai Fischbach, Collective sense-making in times of crisis: connecting terror management theory with twitter user reactions to the berlin terrorist attack. Comput. Hum. Behav. 100, 138–151 (2019)

    Article  Google Scholar 

  10. G. Forman, I. Cohen, Learning from little: comparison of classifiers given little training, in European Conference on Principles of Data Mining and Knowledge Discovery (2004), pp. 161–172 (2004)

    Google Scholar 

  11. Jerome H. Friedman, Greedy function approximation: a gradient boosting machine. Ann. Stat. 29(5), 1189–1232 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  12. T.L. Griffiths, M.I. Jordan, J.B. Tenenbaum, D.M. Blei, Hierarchical topic models and the nested Chinese restaurant process, in Advances in Neural Information Processing Systems (2004), pp. 17–24

    Google Scholar 

  13. T. Hastie, R. Tibshirani, J. Friedman. The Elements of Statistical Learning: Data Mining, Inference and Prediction (Springer, 2009)

    Google Scholar 

  14. Matthew Hoffman, Francis R. Bach, David M. Blei, Online learning for latent dirichlet allocation, in Advances in Neural Information Processing Systems, vol. 23 (2010), pp. 856–864

    Google Scholar 

  15. L. Hong, B.D. Davison, Empirical study of topic modeling in twitter, in Proceedings of the First Workshop on Social Media Analytics (2010), pp. 80–88

    Google Scholar 

  16. E. Ikonomakis, S. Kotsiantis, V. Tampakas, Text classification using machine learning techniques. WSEAS Trans. Comput. 4, 966–974 (2005)

    Google Scholar 

  17. M. Imran, P. Mitra, C. Castillo, Twitter as a lifeline: human-annotated twitter corpora for NLP of crisis-related messages, in Proceedings of the Tenth International Conference on Language Resources and Evaluation (2016), pp. 1638–1643

    Google Scholar 

  18. M. Jin, X. Luo, H. Zhu, H.H. Zhuo, Combining deep learning and topic modeling for review understanding in context-aware recommendation, in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Long Papers), vol. 1 (2018), pp. 1605–1614

    Google Scholar 

  19. G. Kant, C. Weisser, B. Säfken, Ttlocvis: a twitter topic location visualization package. J. Open Source Softw. 5(54) (2020)

    Google Scholar 

  20. Fedor Krasnov, Anastasiia Sen, The number of topics optimization: clustering approach. Mach. Learn. Knowl. Extr. 1(1), 416–426 (2019)

    Article  Google Scholar 

  21. C.-C. Lai, M.-C. Tsai, An empirical performance comparison of machine learning methods for spam e-mail categorization, in Fourth International Conference on Hybrid Intelligent Systems (2004), pp. 44–48

    Google Scholar 

  22. J.H. Lau, T. Baldwin, T. Cohn, Topically driven neural language model, in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (2017), pp. 355–365

    Google Scholar 

  23. W. Lou, X. Wang, F. Chen, Y. Chen, B. Jiang, H. Zhang, Sequence based prediction of DNA-binding proteins based on hybrid feature selection using random forest and gaussian Naïve Bayes. PloS One 9(01), e86703 (2014)

    Google Scholar 

  24. R. Mehrotra, S. Sanner, W. Buntine, L. Xie, Improving LDA topic models for microblogs via tweet pooling and automatic labeling, in Proceedings of the 36th international ACM SIGIR Conference on Research and Development in Information Retrieval (2013), pp. 889–892

    Google Scholar 

  25. T. Mikolov, I. Sutskever, K. Chen, G.S. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, in Advances in Neural Information Processing Systems (2013), pp. 3111–3119

    Google Scholar 

  26. D. Mimno, H.M. Wallach, E. Talley, M. Leenders, A. McCallum, Optimizing semantic coherence in topic models, in Proceedings of the Conference on Empirical Methods in Natural Language Processing (2011), pp. 262–272

    Google Scholar 

  27. A. Mishler, E.S. Crabb, S. Paletz, B. Hefright, E. Golonka, Using structural topic modeling to detect events and cluster twitter users in the Ukrainian crisis, in International Conference on Human-Computer Interaction (2015), pp. 639–644

    Google Scholar 

  28. J. Pennington, R. Socher, C.D. Manning, Glove: global vectors for word representation, in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (2014), pp. 1532–1543

    Google Scholar 

  29. M.E. Roberts, B.M. Stewart, D. Tingley, et al., STM: R package for structural topic models. J. Stat. Softw. 10(2), 1–40 (2014)

    Google Scholar 

  30. J. Roesslein, Tweepy: twitter for python! (2020). https://github.com/tweepy/tweepy

  31. P. Shrestha, S. Sierra, F.A. González, M. Montes, P. Rosso, T. Solorio, Convolutional neural networks for authorship attribution of short texts, in Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Short Papers, vol. 2 (2017), pp. 669–674 (2017)

    Google Scholar 

  32. A. Steinskog, J. Therkelsen, B. Gambäck, Twitter topic modeling by tweet aggregation, in Proceedings of the 21st Nordic Conference on Computational Linguistics (2017), pp. 77–86

    Google Scholar 

  33. S. Vosoughi, P. Vijayaraghavan, D. Roy, Tweet2vec: learning tweet embeddings using character-level CNN-LSTM encoder-decoder, in Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (2016), pp. 1041–1044

    Google Scholar 

  34. C. Wang, J. Paisley, D. Blei, Online variational inference for the hierarchical dirichlet process, in Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (2011), pp. 752–760

    Google Scholar 

  35. X. Wang, W. Jiang, Z. Luo, Combination of convolutional and recurrent neural network for sentiment analysis of short texts, in Proceedings of COLING 2016, the 26th international Conference on Computational Linguistics: Technical Papers (2016), pp. 2428–2437

    Google Scholar 

  36. L. Yang, T. Sun, M. Zhang, Q. Mei, We know what@ you# tag: does the dual role affect hashtag adoption? in Proceedings of the 21st International Conference on World Wide Web (2012), pp. 261–270

    Google Scholar 

  37. W.X. Zhao, J. Jiang, J. Weng, J. He, E.-P. Lim, H. Yan, X. Li, Comparing twitter and traditional media using topic models, in European Conference on Information Retrieval (2011), pp. 338–349

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christoph Weisser .

Editor information

Editors and Affiliations

7 Appendix

7 Appendix

Fig. 4
An illustration of the A N N model has the category and Numerical features: Input layer, Dense layer, and Batch Normalization.

ANN model

Fig. 5
A workflow of the data streaming and collection of tweets has data preprocessing with A N N classification, L D A, and S T M. Individual tweets have training and test set of topic distribution.

An overview on the implemented workflow

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Kant, G., Weisser, C., Kneib, T., Säfken, B. (2023). Topic Model—Machine Learning Classifier Integrations on Geocoded Twitter Data. In: Phuong, N.H., Kreinovich, V. (eds) Biomedical and Other Applications of Soft Computing. Studies in Computational Intelligence, vol 1045. Springer, Cham. https://doi.org/10.1007/978-3-031-08580-2_11

Download citation

Publish with us

Policies and ethics