Skip to main content

Improving Microblog Clustering: Tweet Pooling Schemes

  • Conference paper
  • First Online:
Data Management, Analytics and Innovation

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1174))

  • 876 Accesses

Abstract

Performing machine learning and natural language processing task on Twitter data is challenging due to the short and noisy nature of tweets. These tasks perform well on long documents like news articles, research papers but perform poorly when applied on short text like tweets. One way of improving the results is tweet pooling, i.e., to combine the related tweets to make longer coherent input documents. In this work, several new tweet pooling schemes are proposed based on the two tweet auxiliary information—user mentions and URLs. The proposed tweet pooling schemes are evaluated for clustering quality on the clusters/topics obtained by using standard Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF). Using the tweet labels of the tweet dataset, purity, and Normalized Mutual Information (NMI) measures are used to evaluate the clustering quality. Empirical results show that proposed tweet pooling schemes always outperform the existing schemes by significant margin when more than one tweet auxiliary information is used for pooling.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Blei, David M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77–84.

    Article  Google Scholar 

  2. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2002). Latent dirichlet allocation. Advances in Neural Information Processing Systems.

    Google Scholar 

  3. Xu, W., Liu, X., & Gong, Y. (2003). Document clustering based on non-negative matrix factorization. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. ACM.

    Google Scholar 

  4. Hofmann, T. (1999). Probabilistic latent semantic analysis. In: Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann Publishers Inc.

    Google Scholar 

  5. Joyce, J. M. (2011). Kullback-leibler divergence. International Encyclopedia of Statistical Science, pp. 720–722. Springer, Berlin, Heidelberg.

    Google Scholar 

  6. Zhao, W. X., et al. (2011). Comparing twitter and traditional media using topic models. In: European Conference on Information Retrieval. Springer, Berlin, Heidelberg.

    Google Scholar 

  7. Mehrotra, R., et al. (2013). Improving lda topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM.

    Google Scholar 

  8. Hong, L., & Davison, B. D. (2010). Empirical study of topic modeling in twitter. In: Proceedings of the First Workshop on Social Media Analytics. ACM.

    Google Scholar 

  9. Alvarez-Melis, David, & Saveski, Martin. (2016). Topic modeling in twitter: Aggregating tweets by conversations. ICWSM, 2016, 519–522.

    Google Scholar 

  10. Tang, J., et al. (2012). Enriching short text representation in microblog for clustering. Frontiers of Computer Science in China, 6(1), 88–101.

    Google Scholar 

  11. Tang, G., et al. (2014). Clustering tweets using wikipedia concepts. LREC.

    Google Scholar 

  12. Sahami, M., & Heilman, T. D. (2006). A web-based kernel function for measuring the similarity of short text snippets. In: Proceedings of the 15th International Conference on World Wide Web. ACM.

    Google Scholar 

  13. Pedrosa, G., et al. (2016). Topic modeling for short texts with co-occurrence frequency-based expansion. In: 2016 5th Brazilian Conference on Intelligent Systems (BRACIS). IEEE.

    Google Scholar 

  14. Bicalho, P., et al. (2017). A general framework to expand short text for topic modeling. Information Sciences, 393, 66–81.

    Google Scholar 

  15. Yan, X., et al. (2013). A biterm topic model for short texts. In: Proceedings of the 22nd International Conference on World Wide Web. ACM.

    Google Scholar 

  16. Nguyen, D. Q., Billingsley, R., Du, L., & Johnson, M. (2015). Improving topic models with latent feature word representations. Transactions of the Association for Computational Linguistics, 3, 299–313.

    Article  Google Scholar 

  17. Holmes, Ian, Harris, Keith, & Quince, Christopher. (2012). Dirichlet multinomial mixtures: generative models for microbial metagenomics. PLoS ONE, 7(2), e30126.

    Article  Google Scholar 

  18. Li, C., et al. (2016). Topic modeling for short texts with auxiliary word embeddings. In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM.

    Google Scholar 

  19. Caron, F., Davy, M., & Doucet, A. (2012). Generalized Polya urn for time-varying Dirichlet process mixtures. arXiv preprint. arXiv:1206.5254.

  20. Wang, Y., et al. (2014). Hashtag graph based topic model for tweet mining. In: Data Mining (ICDM), 2014 IEEE International Conference on. IEEE.

    Google Scholar 

  21. Zuo, Yuan, Zhao, Jichang, & Ke, Xu. (2016). Word network topic model: a simple but general solution for short and imbalanced texts. Knowledge and Information Systems, 48(2), 379–398.

    Article  Google Scholar 

  22. Rosen-Zvi, M., et al. (2004). The author-topic model for authors and documents. In: Proceedings of the 20th Conference On Uncertainty In Artificial Intelligence. AUAI Press.

    Google Scholar 

  23. Quan, X., et al. (2015). Short and sparse text topic modeling via self-aggregation. IJCAI.

    Google Scholar 

  24. Qiang, J., et al. (2017). Topic modeling over short texts by incorporating word embeddings. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, Cham.

    Google Scholar 

  25. Jin, O., et al. (2011). Transferring topical knowledge from auxiliary long texts for short text clustering. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. ACM.

    Google Scholar 

  26. Guo, W., et al. (2013). Linking tweets to news: A framework to enrich short text data in Social Media. ACL (1).

    Google Scholar 

  27. Manning, C. D., Raghavan, P., Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nadeem Akhtar .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Akhtar, N., Beg, M.M.S. (2021). Improving Microblog Clustering: Tweet Pooling Schemes. In: Sharma, N., Chakrabarti, A., Balas, V., Martinovic, J. (eds) Data Management, Analytics and Innovation. Advances in Intelligent Systems and Computing, vol 1174. Springer, Singapore. https://doi.org/10.1007/978-981-15-5616-6_1

Download citation

Publish with us

Policies and ethics