Comparison of Short-Text Embeddings for Unsupervised Event Detection in a Stream of Tweets

Mazoyer, Béatrice; Hervé, Nicolas; Hudelot, Céline; Cagé, Julia

doi:10.1007/978-3-031-40403-0_4

Béatrice Mazoyer⁷,
Nicolas Hervé⁸,
Céline Hudelot⁷ &
…
Julia Cagé⁹

Part of the book series: Studies in Computational Intelligence ((SCI,volume 1110))

73 Accesses

Abstract

In this article, we apply recent short-text embeddings techniques to the automatic detection of events in a stream of tweets. We model this task as a dynamic clustering problem. Our experiments are conducted on a publicly available English corpus of tweets and on a French similar dataset annotated by our team. We show that recent techniques based on deep neural networks (ELMo, Universal Sentence Encoder, BERT, SBERT), although promising on many applications, are not well adapted for this task, even on the English corpus. We also experiment with different types of fine-tuning in order to improve the results of these models on French data. We propose a detailed analysis of the results obtained, showing the superiority of traditional tf-idf approaches for this type of task and corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Word Embedding Based Event Detection on Social Media

A deep semantic matching approach for identifying relevant messages for social media analysis

Article Open access 25 July 2023

Sentiment analysis in Portuguese tweets: an evaluation of diverse word representation models

Article 28 June 2023

Notes

1.
See the results on the GLUE benchmark: https://gluebenchmark.com/leaderboard.
2.
More specifically, we are interested in the topic similarity of short texts, here tweets, which have the particularity of not always being grammatically correct. To simplify, we consider tweets as sentences in this article.
3.
For example, to calculate a similarity score between sentences with BERT [Devlin et al., 2019], it is necessary to process sentences in pair instead of individually Using the example proposed by [Reimers and Gurevych, 2019], to find the two most similar sentences in a corpus of n = 10,000, sentences, the number of treatments to be performed would be $n\frac{(n - 1)}{2}\; = \;49,995,000$ operations, which represents approximately 65 h of processing with BERT on a V100 GPU.
4.
https://github.com/ina-foss/twembeddings.
5.
This algorithm is implemented by the Python Scipy library.
6.
In accordance with Twitter’s terms of use, researchers sharing datasets of tweets do not share the content of the tweets, but only their ids. Using these identifiers, one can query the Twitter API to retrieve the full content of the tweets—but only if they are still online.
7.
github.com/loretoparisi/word2vec-twitter.
8.
code.google.com/archive/p/word2vec/.
9.
tfhub.dev/google/elmo/2.
10.
github.com/HIT-SCIR/ELMoForManyLangs.
11.
github.com/google-research/bert models: bert-large, uncased and bert-base, mul- tilingual cased.
12.
tfhub.dev/google/universal-sentence-encoder-large/3.
13.
tfhub.dev/google/universal-sentence-encoder-multilingual-large/ 1.
14.
github.com/UKPLab/sentence-transformers. Model: bert-large-nli-stsb-mean- tokens.
15.
We used the fine-tuning architecture provided by the authors of Sentence-BERT (sbert.net/docs/training/overview.html\#sentence\_transformers. SentenceTransformer.fit) with a batch size of 16, and 4 epochs. We have kept the default configuration for all other parameters.
16.
Cloud.google.com/translate/docs/reference/rest/.

References

Allan J, Lavrenko V, Malin D, Swan R (2000) Detections, bounds, and timelines: Umass and tdt-3. In Proc of Topic Detection and Tracking workshop, pp 167–174
Google Scholar
Bank RE, Douglas CC (1993) Sparse matrix multiplication package (SMMP). Adv Comput Math 1(1):127–137
Article MathSciNet Google Scholar
Becker H, Naaman M, Gravano L (2011) Beyond trending topics: real-world event identification on twitter. In Fifth international AAAI conference on weblogs and social media
Google Scholar
Bowman SR, Angeli G, Potts C, Manning CD (2015) A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, pp 632–642. The Association for Computational Linguistics
Google Scholar
Cagé J, Hervé N, Viaud M-L (2020) The production of information in an online world. The Review of Economic Studies
Google Scholar
Cer D, Diab M, Agirre E, Lopez-Gazpio I, Specia L (2017) Semeval- 2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055
Cer D, Yang Y, Kong S-Y, Hua N, Limtiaco N, John RS, Constant N, Guajardo-Cespedes M, Yuan S, Tar C, et al (2018) Universal sentence encoder. arXiv preprint arXiv:1803.11175
Che W, Liu Y, Wang Y, Zheng B, Liu T (2018) Towards better UD parsing: deep contextualized word embeddings, ensemble, and treebank concatenation. In Proc. of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp 55–64
Google Scholar
Conneau A, Kiela D, Schwenk H, Barrault L, Bordes A (2017) Supervised learning of universal sentence representations from natural language inference data. ar Xiv preprint arXiv:1705.02364
Devlin J, Chang M, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, Vol 1 (Long and Short Papers), pp 4171–4186. Association for Computational Linguistics
Google Scholar
Fleuret F, Sahbi H (2003) Scale-invariance of support vector machines based on the triangular kernel. In 3rd International Workshop on Statistical and Computational Theories of Vision, pp 1–13
Google Scholar
Godin F, Vandersmissen B, De Neve W, Van de Walle R (2015) Multimedia lab @ acl wnut ner shared task: named entity recognition for twitter microposts using distributed word representations. In Proc. of Workshop on Noisy User-generated Text, pp 146–153
Google Scholar
Harris ZS (1954) Distributional structure. Word 10(2–3):146–162
Article Google Scholar
Hasan M, Orgun MA, Schwitter R (2016) TwitterNews: real time event detection from the Twitter data stream. Peer J PrePrints
Google Scholar
Johnson J, Douze M, Jégou H (2019) Billion-scale similarity search with gpus. IEEE Transactions on Big Data
Google Scholar
Kiros R, Zhu Y, Salakhutdinov RR, Zemel R, Urtasun R, Torralba A, Fidler S (2015) Skip-thought vectors. In Advances in neural information processing systems, pp 3294–3302
Google Scholar
Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In International conference on machine learning, pp 1188–1196
Google Scholar
Li C, Duan Y, Wang H, Zhang Z, Sun A, Ma Z (2017) Enhancing topic modeling for short texts with auxiliary word embeddings. ACM Trans Inf Syst 36(2):11:1–11:30
Google Scholar
Likhitha S, Harish B, Kumar HK (2019) A detailed survey on topic modeling for document and short text data. Intern J Comp Appl 975:8887
Google Scholar
Mazoyer B, Cagé J, Hervé N, Hudelot C (2020) A french corpus for event detection on twitter. In Proceedings of the Twelfth International Conference on Language Resources and Evaluation (LREC 2020), pp 6220–6227
Google Scholar
McMinn AJ, Moshfeghi Y, Jose JM (2013) Building a large-scale corpus for evaluating event detection on twitter. In Proc of ACM-CIKM, pp 409–418. ACM
Google Scholar
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
Nguyen DQ, Billingsley R, Du L, Johnson M (2015) Improving topic models with latent feature word representations. Trans Assoc Comput Ling 3:299–313
Google Scholar
Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In Proc of EMNLP, pp 1532–1543
Google Scholar
Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365
Petrovic’ S, Osborne M, Lavrenko V (2010) Streaming first story detection with application to Twitter. In Proc of NAACL, pp 181–189
Google Scholar
Reimers N, Gurevych I (2019) Sentence bert: sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, pp 3980–3990. Association for Computational Linguistics
Google Scholar
Repp Ø, Ramampiaro H (2018) Extracting news events from microblogs. J Stat Manag Syst 21(4):695–723
Google Scholar
Sankaranarayanan J, Samet H, Teitler BE, Lieberman MD, Sperling J (2009) Twitterstand: news in tweets. In Proc of ACM-GIS, pp 42–51
Google Scholar
Sparck Jones K (1972) A statistical interpretation of term specificity and its application in retrieval. J Doc 28(1):11–21
Article Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez N, Kaiser, Polosukhin I (2017) Attention is all you need. In Advances in neural information processing systems, pp 5998–6008
Google Scholar
Wang A, Singh A, Michael J, Hill F, Levy O, Bowman SR (2018) GLUE: a multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the Workshop: Analyzing and Interpreting Neural Networks for NLP, Black- boxNLP@EMNLP 2018, Brussels, Belgium, pp 353–355. Association for Computational Linguistics
Google Scholar
Yang Y, Pierce T, Carbonell JG (1998) A study of retrospective and on-line event detection. In Proc of ACM-SIGIR, pp 28–36
Google Scholar
Yin J, Wang J (2014) A dirichlet multinomial mixture model-based approach for short text clustering. The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14. NY, USA, New York, pp 233–242
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Université Paris-Saclay, CentraleSupélec, Mathématiques Et Informatique Pour La Com- Plexité Et Les Systèmes, 91190, Gif-Sur-Yvette, France
Béatrice Mazoyer & Céline Hudelot
Institut National de L’Audiovisuel, Bry-Sur-Marne, France
Nicolas Hervé
Department of Economics, Sciences Po, Paris, France
Julia Cagé

Authors

Béatrice Mazoyer
View author publications
You can also search for this author in PubMed Google Scholar
Nicolas Hervé
View author publications
You can also search for this author in PubMed Google Scholar
Céline Hudelot
View author publications
You can also search for this author in PubMed Google Scholar
Julia Cagé
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Béatrice Mazoyer .

Editor information

Editors and Affiliations

Paragraphe Laboratory, Paris 8 University, Saint-Denis, France
Rakia Jaziri
Université Rennes CNRS, IRISA, Lannion CEDEX, France
Arnaud Martin
MMIP, AgroParisTech, Paris, France
Antoine Cornuéjols
ICHEC Montgomery, Bruxelles, Belgium
Etienne Cuvelier
University of Nantes, LS2N, Nantes, France
Fabrice Guillet

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Mazoyer, B., Hervé, N., Hudelot, C., Cagé, J. (2024). Comparison of Short-Text Embeddings for Unsupervised Event Detection in a Stream of Tweets. In: Jaziri, R., Martin, A., Cornuéjols, A., Cuvelier, E., Guillet, F. (eds) Advances in Knowledge Discovery and Management. Studies in Computational Intelligence, vol 1110. Springer, Cham. https://doi.org/10.1007/978-3-031-40403-0_4

Download citation

DOI: https://doi.org/10.1007/978-3-031-40403-0_4
Published: 01 February 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-40402-3
Online ISBN: 978-3-031-40403-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Comparison of Short-Text Embeddings for Unsupervised Event Detection in a Stream of Tweets

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Word Embedding Based Event Detection on Social Media

A deep semantic matching approach for identifying relevant messages for social media analysis

Sentiment analysis in Portuguese tweets: an evaluation of diverse word representation models

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Comparison of Short-Text Embeddings for Unsupervised Event Detection in a Stream of Tweets

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Word Embedding Based Event Detection on Social Media

A deep semantic matching approach for identifying relevant messages for social media analysis

Sentiment analysis in Portuguese tweets: an evaluation of diverse word representation models

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation