Abstract
In this article, we apply recent short-text embeddings techniques to the automatic detection of events in a stream of tweets. We model this task as a dynamic clustering problem. Our experiments are conducted on a publicly available English corpus of tweets and on a French similar dataset annotated by our team. We show that recent techniques based on deep neural networks (ELMo, Universal Sentence Encoder, BERT, SBERT), although promising on many applications, are not well adapted for this task, even on the English corpus. We also experiment with different types of fine-tuning in order to improve the results of these models on French data. We propose a detailed analysis of the results obtained, showing the superiority of traditional tf-idf approaches for this type of task and corpus.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
See the results on the GLUE benchmark: https://gluebenchmark.com/leaderboard.
- 2.
More specifically, we are interested in the topic similarity of short texts, here tweets, which have the particularity of not always being grammatically correct. To simplify, we consider tweets as sentences in this article.
- 3.
For example, to calculate a similarity score between sentences with BERT [Devlin et al., 2019], it is necessary to process sentences in pair instead of individually Using the example proposed by [Reimers and Gurevych, 2019], to find the two most similar sentences in a corpus of n = 10,000, sentences, the number of treatments to be performed would be \(n\frac{(n - 1)}{2}\; = \;49,995,000\) operations, which represents approximately 65 h of processing with BERT on a V100 GPU.
- 4.
- 5.
This algorithm is implemented by the Python Scipy library.
- 6.
In accordance with Twitter’s terms of use, researchers sharing datasets of tweets do not share the content of the tweets, but only their ids. Using these identifiers, one can query the Twitter API to retrieve the full content of the tweets—but only if they are still online.
- 7.
github.com/loretoparisi/word2vec-twitter.
- 8.
code.google.com/archive/p/word2vec/.
- 9.
tfhub.dev/google/elmo/2.
- 10.
github.com/HIT-SCIR/ELMoForManyLangs.
- 11.
github.com/google-research/bert models: bert-large, uncased and bert-base, mul- tilingual cased.
- 12.
tfhub.dev/google/universal-sentence-encoder-large/3.
- 13.
tfhub.dev/google/universal-sentence-encoder-multilingual-large/ 1.
- 14.
github.com/UKPLab/sentence-transformers. Model: bert-large-nli-stsb-mean- tokens.
- 15.
We used the fine-tuning architecture provided by the authors of Sentence-BERT (sbert.net/docs/training/overview.html\#sentence\_transformers. SentenceTransformer.fit) with a batch size of 16, and 4 epochs. We have kept the default configuration for all other parameters.
- 16.
Cloud.google.com/translate/docs/reference/rest/.
References
Allan J, Lavrenko V, Malin D, Swan R (2000) Detections, bounds, and timelines: Umass and tdt-3. In Proc of Topic Detection and Tracking workshop, pp 167–174
Bank RE, Douglas CC (1993) Sparse matrix multiplication package (SMMP). Adv Comput Math 1(1):127–137
Becker H, Naaman M, Gravano L (2011) Beyond trending topics: real-world event identification on twitter. In Fifth international AAAI conference on weblogs and social media
Bowman SR, Angeli G, Potts C, Manning CD (2015) A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, pp 632–642. The Association for Computational Linguistics
Cagé J, Hervé N, Viaud M-L (2020) The production of information in an online world. The Review of Economic Studies
Cer D, Diab M, Agirre E, Lopez-Gazpio I, Specia L (2017) Semeval- 2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055
Cer D, Yang Y, Kong S-Y, Hua N, Limtiaco N, John RS, Constant N, Guajardo-Cespedes M, Yuan S, Tar C, et al (2018) Universal sentence encoder. arXiv preprint arXiv:1803.11175
Che W, Liu Y, Wang Y, Zheng B, Liu T (2018) Towards better UD parsing: deep contextualized word embeddings, ensemble, and treebank concatenation. In Proc. of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp 55–64
Conneau A, Kiela D, Schwenk H, Barrault L, Bordes A (2017) Supervised learning of universal sentence representations from natural language inference data. ar Xiv preprint arXiv:1705.02364
Devlin J, Chang M, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, Vol 1 (Long and Short Papers), pp 4171–4186. Association for Computational Linguistics
Fleuret F, Sahbi H (2003) Scale-invariance of support vector machines based on the triangular kernel. In 3rd International Workshop on Statistical and Computational Theories of Vision, pp 1–13
Godin F, Vandersmissen B, De Neve W, Van de Walle R (2015) Multimedia lab @ acl wnut ner shared task: named entity recognition for twitter microposts using distributed word representations. In Proc. of Workshop on Noisy User-generated Text, pp 146–153
Harris ZS (1954) Distributional structure. Word 10(2–3):146–162
Hasan M, Orgun MA, Schwitter R (2016) TwitterNews: real time event detection from the Twitter data stream. Peer J PrePrints
Johnson J, Douze M, Jégou H (2019) Billion-scale similarity search with gpus. IEEE Transactions on Big Data
Kiros R, Zhu Y, Salakhutdinov RR, Zemel R, Urtasun R, Torralba A, Fidler S (2015) Skip-thought vectors. In Advances in neural information processing systems, pp 3294–3302
Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In International conference on machine learning, pp 1188–1196
Li C, Duan Y, Wang H, Zhang Z, Sun A, Ma Z (2017) Enhancing topic modeling for short texts with auxiliary word embeddings. ACM Trans Inf Syst 36(2):11:1–11:30
Likhitha S, Harish B, Kumar HK (2019) A detailed survey on topic modeling for document and short text data. Intern J Comp Appl 975:8887
Mazoyer B, Cagé J, Hervé N, Hudelot C (2020) A french corpus for event detection on twitter. In Proceedings of the Twelfth International Conference on Language Resources and Evaluation (LREC 2020), pp 6220–6227
McMinn AJ, Moshfeghi Y, Jose JM (2013) Building a large-scale corpus for evaluating event detection on twitter. In Proc of ACM-CIKM, pp 409–418. ACM
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
Nguyen DQ, Billingsley R, Du L, Johnson M (2015) Improving topic models with latent feature word representations. Trans Assoc Comput Ling 3:299–313
Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In Proc of EMNLP, pp 1532–1543
Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365
Petrovic’ S, Osborne M, Lavrenko V (2010) Streaming first story detection with application to Twitter. In Proc of NAACL, pp 181–189
Reimers N, Gurevych I (2019) Sentence bert: sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, pp 3980–3990. Association for Computational Linguistics
Repp Ø, Ramampiaro H (2018) Extracting news events from microblogs. J Stat Manag Syst 21(4):695–723
Sankaranarayanan J, Samet H, Teitler BE, Lieberman MD, Sperling J (2009) Twitterstand: news in tweets. In Proc of ACM-GIS, pp 42–51
Sparck Jones K (1972) A statistical interpretation of term specificity and its application in retrieval. J Doc 28(1):11–21
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez N, Kaiser, Polosukhin I (2017) Attention is all you need. In Advances in neural information processing systems, pp 5998–6008
Wang A, Singh A, Michael J, Hill F, Levy O, Bowman SR (2018) GLUE: a multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the Workshop: Analyzing and Interpreting Neural Networks for NLP, Black- boxNLP@EMNLP 2018, Brussels, Belgium, pp 353–355. Association for Computational Linguistics
Yang Y, Pierce T, Carbonell JG (1998) A study of retrospective and on-line event detection. In Proc of ACM-SIGIR, pp 28–36
Yin J, Wang J (2014) A dirichlet multinomial mixture model-based approach for short text clustering. The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14. NY, USA, New York, pp 233–242
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Mazoyer, B., Hervé, N., Hudelot, C., Cagé, J. (2024). Comparison of Short-Text Embeddings for Unsupervised Event Detection in a Stream of Tweets. In: Jaziri, R., Martin, A., Cornuéjols, A., Cuvelier, E., Guillet, F. (eds) Advances in Knowledge Discovery and Management. Studies in Computational Intelligence, vol 1110. Springer, Cham. https://doi.org/10.1007/978-3-031-40403-0_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-40403-0_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-40402-3
Online ISBN: 978-3-031-40403-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)