Improving Microblog Clustering: Tweet Pooling Schemes

Akhtar, Nadeem; Beg, M. M. Sufyan

doi:10.1007/978-981-15-5616-6_1

Nadeem Akhtar¹⁸ &
M. M. Sufyan Beg¹⁸

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1174))

876 Accesses

Abstract

Performing machine learning and natural language processing task on Twitter data is challenging due to the short and noisy nature of tweets. These tasks perform well on long documents like news articles, research papers but perform poorly when applied on short text like tweets. One way of improving the results is tweet pooling, i.e., to combine the related tweets to make longer coherent input documents. In this work, several new tweet pooling schemes are proposed based on the two tweet auxiliary information—user mentions and URLs. The proposed tweet pooling schemes are evaluated for clustering quality on the clusters/topics obtained by using standard Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF). Using the tweet labels of the tweet dataset, purity, and Normalized Mutual Information (NMI) measures are used to evaluate the clustering quality. Empirical results show that proposed tweet pooling schemes always outperform the existing schemes by significant margin when more than one tweet auxiliary information is used for pooling.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Improved Topic Modeling in Twitter Through Community Pooling

Network-Based Pooling for Topic Modeling on Microblog Content

Hierarchical Clustering in Improving Microblog Stream Summarization

References

Blei, David M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77–84.
Article Google Scholar
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2002). Latent dirichlet allocation. Advances in Neural Information Processing Systems.
Google Scholar
Xu, W., Liu, X., & Gong, Y. (2003). Document clustering based on non-negative matrix factorization. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. ACM.
Google Scholar
Hofmann, T. (1999). Probabilistic latent semantic analysis. In: Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann Publishers Inc.
Google Scholar
Joyce, J. M. (2011). Kullback-leibler divergence. International Encyclopedia of Statistical Science, pp. 720–722. Springer, Berlin, Heidelberg.
Google Scholar
Zhao, W. X., et al. (2011). Comparing twitter and traditional media using topic models. In: European Conference on Information Retrieval. Springer, Berlin, Heidelberg.
Google Scholar
Mehrotra, R., et al. (2013). Improving lda topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM.
Google Scholar
Hong, L., & Davison, B. D. (2010). Empirical study of topic modeling in twitter. In: Proceedings of the First Workshop on Social Media Analytics. ACM.
Google Scholar
Alvarez-Melis, David, & Saveski, Martin. (2016). Topic modeling in twitter: Aggregating tweets by conversations. ICWSM, 2016, 519–522.
Google Scholar
Tang, J., et al. (2012). Enriching short text representation in microblog for clustering. Frontiers of Computer Science in China, 6(1), 88–101.
Google Scholar
Tang, G., et al. (2014). Clustering tweets using wikipedia concepts. LREC.
Google Scholar
Sahami, M., & Heilman, T. D. (2006). A web-based kernel function for measuring the similarity of short text snippets. In: Proceedings of the 15th International Conference on World Wide Web. ACM.
Google Scholar
Pedrosa, G., et al. (2016). Topic modeling for short texts with co-occurrence frequency-based expansion. In: 2016 5th Brazilian Conference on Intelligent Systems (BRACIS). IEEE.
Google Scholar
Bicalho, P., et al. (2017). A general framework to expand short text for topic modeling. Information Sciences, 393, 66–81.
Google Scholar
Yan, X., et al. (2013). A biterm topic model for short texts. In: Proceedings of the 22nd International Conference on World Wide Web. ACM.
Google Scholar
Nguyen, D. Q., Billingsley, R., Du, L., & Johnson, M. (2015). Improving topic models with latent feature word representations. Transactions of the Association for Computational Linguistics, 3, 299–313.
Article Google Scholar
Holmes, Ian, Harris, Keith, & Quince, Christopher. (2012). Dirichlet multinomial mixtures: generative models for microbial metagenomics. PLoS ONE, 7(2), e30126.
Article Google Scholar
Li, C., et al. (2016). Topic modeling for short texts with auxiliary word embeddings. In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM.
Google Scholar
Caron, F., Davy, M., & Doucet, A. (2012). Generalized Polya urn for time-varying Dirichlet process mixtures. arXiv preprint. arXiv:1206.5254.
Wang, Y., et al. (2014). Hashtag graph based topic model for tweet mining. In: Data Mining (ICDM), 2014 IEEE International Conference on. IEEE.
Google Scholar
Zuo, Yuan, Zhao, Jichang, & Ke, Xu. (2016). Word network topic model: a simple but general solution for short and imbalanced texts. Knowledge and Information Systems, 48(2), 379–398.
Article Google Scholar
Rosen-Zvi, M., et al. (2004). The author-topic model for authors and documents. In: Proceedings of the 20th Conference On Uncertainty In Artificial Intelligence. AUAI Press.
Google Scholar
Quan, X., et al. (2015). Short and sparse text topic modeling via self-aggregation. IJCAI.
Google Scholar
Qiang, J., et al. (2017). Topic modeling over short texts by incorporating word embeddings. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, Cham.
Google Scholar
Jin, O., et al. (2011). Transferring topical knowledge from auxiliary long texts for short text clustering. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. ACM.
Google Scholar
Guo, W., et al. (2013). Linking tweets to news: A framework to enrich short text data in Social Media. ACL (1).
Google Scholar
Manning, C. D., Raghavan, P., Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Zakir Husain College of Engineering, Aligarh Muslim University, Aligarh, & Technology, Aligarh, India
Nadeem Akhtar & M. M. Sufyan Beg

Authors

Nadeem Akhtar
View author publications
You can also search for this author in PubMed Google Scholar
M. M. Sufyan Beg
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nadeem Akhtar .

Editor information

Editors and Affiliations

Society for Data Science, Pune, Maharashtra, India
Neha Sharma
A.K.Choudhury School of Information Technology, West Bengal, India
Amlan Chakrabarti
Department of Automatics and Applied Software, Faculty of Engineering, University of Arad, Arad, Romania
Valentina Emilia Balas
IT4Innovations, VSB-Technical University of Ostrava, Ostrava, Czech Republic
Jan Martinovic

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Akhtar, N., Beg, M.M.S. (2021). Improving Microblog Clustering: Tweet Pooling Schemes. In: Sharma, N., Chakrabarti, A., Balas, V., Martinovic, J. (eds) Data Management, Analytics and Innovation. Advances in Intelligent Systems and Computing, vol 1174. Springer, Singapore. https://doi.org/10.1007/978-981-15-5616-6_1

Download citation

DOI: https://doi.org/10.1007/978-981-15-5616-6_1
Published: 19 August 2020
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-5615-9
Online ISBN: 978-981-15-5616-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Improving Microblog Clustering: Tweet Pooling Schemes

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Improved Topic Modeling in Twitter Through Community Pooling

Network-Based Pooling for Topic Modeling on Microblog Content

Hierarchical Clustering in Improving Microblog Stream Summarization

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Improving Microblog Clustering: Tweet Pooling Schemes

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Improved Topic Modeling in Twitter Through Community Pooling

Network-Based Pooling for Topic Modeling on Microblog Content

Hierarchical Clustering in Improving Microblog Stream Summarization

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation