Abstract
Performing machine learning and natural language processing task on Twitter data is challenging due to the short and noisy nature of tweets. These tasks perform well on long documents like news articles, research papers but perform poorly when applied on short text like tweets. One way of improving the results is tweet pooling, i.e., to combine the related tweets to make longer coherent input documents. In this work, several new tweet pooling schemes are proposed based on the two tweet auxiliary information—user mentions and URLs. The proposed tweet pooling schemes are evaluated for clustering quality on the clusters/topics obtained by using standard Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF). Using the tweet labels of the tweet dataset, purity, and Normalized Mutual Information (NMI) measures are used to evaluate the clustering quality. Empirical results show that proposed tweet pooling schemes always outperform the existing schemes by significant margin when more than one tweet auxiliary information is used for pooling.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Blei, David M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77–84.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2002). Latent dirichlet allocation. Advances in Neural Information Processing Systems.
Xu, W., Liu, X., & Gong, Y. (2003). Document clustering based on non-negative matrix factorization. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. ACM.
Hofmann, T. (1999). Probabilistic latent semantic analysis. In: Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann Publishers Inc.
Joyce, J. M. (2011). Kullback-leibler divergence. International Encyclopedia of Statistical Science, pp. 720–722. Springer, Berlin, Heidelberg.
Zhao, W. X., et al. (2011). Comparing twitter and traditional media using topic models. In: European Conference on Information Retrieval. Springer, Berlin, Heidelberg.
Mehrotra, R., et al. (2013). Improving lda topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM.
Hong, L., & Davison, B. D. (2010). Empirical study of topic modeling in twitter. In: Proceedings of the First Workshop on Social Media Analytics. ACM.
Alvarez-Melis, David, & Saveski, Martin. (2016). Topic modeling in twitter: Aggregating tweets by conversations. ICWSM, 2016, 519–522.
Tang, J., et al. (2012). Enriching short text representation in microblog for clustering. Frontiers of Computer Science in China, 6(1), 88–101.
Tang, G., et al. (2014). Clustering tweets using wikipedia concepts. LREC.
Sahami, M., & Heilman, T. D. (2006). A web-based kernel function for measuring the similarity of short text snippets. In: Proceedings of the 15th International Conference on World Wide Web. ACM.
Pedrosa, G., et al. (2016). Topic modeling for short texts with co-occurrence frequency-based expansion. In: 2016 5th Brazilian Conference on Intelligent Systems (BRACIS). IEEE.
Bicalho, P., et al. (2017). A general framework to expand short text for topic modeling. Information Sciences, 393, 66–81.
Yan, X., et al. (2013). A biterm topic model for short texts. In: Proceedings of the 22nd International Conference on World Wide Web. ACM.
Nguyen, D. Q., Billingsley, R., Du, L., & Johnson, M. (2015). Improving topic models with latent feature word representations. Transactions of the Association for Computational Linguistics, 3, 299–313.
Holmes, Ian, Harris, Keith, & Quince, Christopher. (2012). Dirichlet multinomial mixtures: generative models for microbial metagenomics. PLoS ONE, 7(2), e30126.
Li, C., et al. (2016). Topic modeling for short texts with auxiliary word embeddings. In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM.
Caron, F., Davy, M., & Doucet, A. (2012). Generalized Polya urn for time-varying Dirichlet process mixtures. arXiv preprint. arXiv:1206.5254.
Wang, Y., et al. (2014). Hashtag graph based topic model for tweet mining. In: Data Mining (ICDM), 2014 IEEE International Conference on. IEEE.
Zuo, Yuan, Zhao, Jichang, & Ke, Xu. (2016). Word network topic model: a simple but general solution for short and imbalanced texts. Knowledge and Information Systems, 48(2), 379–398.
Rosen-Zvi, M., et al. (2004). The author-topic model for authors and documents. In: Proceedings of the 20th Conference On Uncertainty In Artificial Intelligence. AUAI Press.
Quan, X., et al. (2015). Short and sparse text topic modeling via self-aggregation. IJCAI.
Qiang, J., et al. (2017). Topic modeling over short texts by incorporating word embeddings. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, Cham.
Jin, O., et al. (2011). Transferring topical knowledge from auxiliary long texts for short text clustering. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. ACM.
Guo, W., et al. (2013). Linking tweets to news: A framework to enrich short text data in Social Media. ACL (1).
Manning, C. D., Raghavan, P., Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Akhtar, N., Beg, M.M.S. (2021). Improving Microblog Clustering: Tweet Pooling Schemes. In: Sharma, N., Chakrabarti, A., Balas, V., Martinovic, J. (eds) Data Management, Analytics and Innovation. Advances in Intelligent Systems and Computing, vol 1174. Springer, Singapore. https://doi.org/10.1007/978-981-15-5616-6_1
Download citation
DOI: https://doi.org/10.1007/978-981-15-5616-6_1
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-5615-9
Online ISBN: 978-981-15-5616-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)