Skip to main content

Comparison of Short-Text Embeddings for Unsupervised Event Detection in a Stream of Tweets

  • Chapter
  • First Online:
Advances in Knowledge Discovery and Management

Part of the book series: Studies in Computational Intelligence ((SCI,volume 1110))

  • 73 Accesses

Abstract

In this article, we apply recent short-text embeddings techniques to the automatic detection of events in a stream of tweets. We model this task as a dynamic clustering problem. Our experiments are conducted on a publicly available English corpus of tweets and on a French similar dataset annotated by our team. We show that recent techniques based on deep neural networks (ELMo, Universal Sentence Encoder, BERT, SBERT), although promising on many applications, are not well adapted for this task, even on the English corpus. We also experiment with different types of fine-tuning in order to improve the results of these models on French data. We propose a detailed analysis of the results obtained, showing the superiority of traditional tf-idf approaches for this type of task and corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    See the results on the GLUE benchmark: https://gluebenchmark.com/leaderboard.

  2. 2.

    More specifically, we are interested in the topic similarity of short texts, here tweets, which have the particularity of not always being grammatically correct. To simplify, we consider tweets as sentences in this article.

  3. 3.

    For example, to calculate a similarity score between sentences with BERT [Devlin et al., 2019], it is necessary to process sentences in pair instead of individually Using the example proposed by [Reimers and Gurevych, 2019], to find the two most similar sentences in a corpus of n = 10,000, sentences, the number of treatments to be performed would be \(n\frac{(n - 1)}{2}\; = \;49,995,000\) operations, which represents approximately 65 h of processing with BERT on a V100 GPU.

  4. 4.

    https://github.com/ina-foss/twembeddings.

  5. 5.

    This algorithm is implemented by the Python Scipy library.

  6. 6.

    In accordance with Twitter’s terms of use, researchers sharing datasets of tweets do not share the content of the tweets, but only their ids. Using these identifiers, one can query the Twitter API to retrieve the full content of the tweets—but only if they are still online.

  7. 7.

    github.com/loretoparisi/word2vec-twitter.

  8. 8.

    code.google.com/archive/p/word2vec/.

  9. 9.

    tfhub.dev/google/elmo/2.

  10. 10.

    github.com/HIT-SCIR/ELMoForManyLangs.

  11. 11.

    github.com/google-research/bert models: bert-large, uncased and bert-base, mul- tilingual cased.

  12. 12.

    tfhub.dev/google/universal-sentence-encoder-large/3.

  13. 13.

    tfhub.dev/google/universal-sentence-encoder-multilingual-large/ 1.

  14. 14.

    github.com/UKPLab/sentence-transformers. Model: bert-large-nli-stsb-mean- tokens.

  15. 15.

    We used the fine-tuning architecture provided by the authors of Sentence-BERT (sbert.net/docs/training/overview.html\#sentence\_transformers. SentenceTransformer.fit) with a batch size of 16, and 4 epochs. We have kept the default configuration for all other parameters.

  16. 16.

    Cloud.google.com/translate/docs/reference/rest/.

References

  1. Allan J, Lavrenko V, Malin D, Swan R (2000) Detections, bounds, and timelines: Umass and tdt-3. In Proc of Topic Detection and Tracking workshop, pp 167–174

    Google Scholar 

  2. Bank RE, Douglas CC (1993) Sparse matrix multiplication package (SMMP). Adv Comput Math 1(1):127–137

    Article  MathSciNet  Google Scholar 

  3. Becker H, Naaman M, Gravano L (2011) Beyond trending topics: real-world event identification on twitter. In Fifth international AAAI conference on weblogs and social media

    Google Scholar 

  4. Bowman SR, Angeli G, Potts C, Manning CD (2015) A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, pp 632–642. The Association for Computational Linguistics

    Google Scholar 

  5. Cagé J, Hervé N, Viaud M-L (2020) The production of information in an online world. The Review of Economic Studies

    Google Scholar 

  6. Cer D, Diab M, Agirre E, Lopez-Gazpio I, Specia L (2017) Semeval- 2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055

  7. Cer D, Yang Y, Kong S-Y, Hua N, Limtiaco N, John RS, Constant N, Guajardo-Cespedes M, Yuan S, Tar C, et al (2018) Universal sentence encoder. arXiv preprint arXiv:1803.11175

  8. Che W, Liu Y, Wang Y, Zheng B, Liu T (2018) Towards better UD parsing: deep contextualized word embeddings, ensemble, and treebank concatenation. In Proc. of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp 55–64

    Google Scholar 

  9. Conneau A, Kiela D, Schwenk H, Barrault L, Bordes A (2017) Supervised learning of universal sentence representations from natural language inference data. ar Xiv preprint arXiv:1705.02364

  10. Devlin J, Chang M, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, Vol 1 (Long and Short Papers), pp 4171–4186. Association for Computational Linguistics

    Google Scholar 

  11. Fleuret F, Sahbi H (2003) Scale-invariance of support vector machines based on the triangular kernel. In 3rd International Workshop on Statistical and Computational Theories of Vision, pp 1–13

    Google Scholar 

  12. Godin F, Vandersmissen B, De Neve W, Van de Walle R (2015) Multimedia lab @ acl wnut ner shared task: named entity recognition for twitter microposts using distributed word representations. In Proc. of Workshop on Noisy User-generated Text, pp 146–153

    Google Scholar 

  13. Harris ZS (1954) Distributional structure. Word 10(2–3):146–162

    Article  Google Scholar 

  14. Hasan M, Orgun MA, Schwitter R (2016) TwitterNews: real time event detection from the Twitter data stream. Peer J PrePrints

    Google Scholar 

  15. Johnson J, Douze M, Jégou H (2019) Billion-scale similarity search with gpus. IEEE Transactions on Big Data

    Google Scholar 

  16. Kiros R, Zhu Y, Salakhutdinov RR, Zemel R, Urtasun R, Torralba A, Fidler S (2015) Skip-thought vectors. In Advances in neural information processing systems, pp 3294–3302

    Google Scholar 

  17. Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In International conference on machine learning, pp 1188–1196

    Google Scholar 

  18. Li C, Duan Y, Wang H, Zhang Z, Sun A, Ma Z (2017) Enhancing topic modeling for short texts with auxiliary word embeddings. ACM Trans Inf Syst 36(2):11:1–11:30

    Google Scholar 

  19. Likhitha S, Harish B, Kumar HK (2019) A detailed survey on topic modeling for document and short text data. Intern J Comp Appl 975:8887

    Google Scholar 

  20. Mazoyer B, Cagé J, Hervé N, Hudelot C (2020) A french corpus for event detection on twitter. In Proceedings of the Twelfth International Conference on Language Resources and Evaluation (LREC 2020), pp 6220–6227

    Google Scholar 

  21. McMinn AJ, Moshfeghi Y, Jose JM (2013) Building a large-scale corpus for evaluating event detection on twitter. In Proc of ACM-CIKM, pp 409–418. ACM

    Google Scholar 

  22. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781

  23. Nguyen DQ, Billingsley R, Du L, Johnson M (2015) Improving topic models with latent feature word representations. Trans Assoc Comput Ling 3:299–313

    Google Scholar 

  24. Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In Proc of EMNLP, pp 1532–1543

    Google Scholar 

  25. Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365

  26. Petrovic’ S, Osborne M, Lavrenko V (2010) Streaming first story detection with application to Twitter. In Proc of NAACL, pp 181–189

    Google Scholar 

  27. Reimers N, Gurevych I (2019) Sentence bert: sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, pp 3980–3990. Association for Computational Linguistics

    Google Scholar 

  28. Repp Ø, Ramampiaro H (2018) Extracting news events from microblogs. J Stat Manag Syst 21(4):695–723

    Google Scholar 

  29. Sankaranarayanan J, Samet H, Teitler BE, Lieberman MD, Sperling J (2009) Twitterstand: news in tweets. In Proc of ACM-GIS, pp 42–51

    Google Scholar 

  30. Sparck Jones K (1972) A statistical interpretation of term specificity and its application in retrieval. J Doc 28(1):11–21

    Article  Google Scholar 

  31. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez N, Kaiser, Polosukhin I (2017) Attention is all you need. In Advances in neural information processing systems, pp 5998–6008

    Google Scholar 

  32. Wang A, Singh A, Michael J, Hill F, Levy O, Bowman SR (2018) GLUE: a multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the Workshop: Analyzing and Interpreting Neural Networks for NLP, Black- boxNLP@EMNLP 2018, Brussels, Belgium, pp 353–355. Association for Computational Linguistics

    Google Scholar 

  33. Yang Y, Pierce T, Carbonell JG (1998) A study of retrospective and on-line event detection. In Proc of ACM-SIGIR, pp 28–36

    Google Scholar 

  34. Yin J, Wang J (2014) A dirichlet multinomial mixture model-based approach for short text clustering. The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14. NY, USA, New York, pp 233–242

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Béatrice Mazoyer .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Mazoyer, B., Hervé, N., Hudelot, C., Cagé, J. (2024). Comparison of Short-Text Embeddings for Unsupervised Event Detection in a Stream of Tweets. In: Jaziri, R., Martin, A., Cornuéjols, A., Cuvelier, E., Guillet, F. (eds) Advances in Knowledge Discovery and Management. Studies in Computational Intelligence, vol 1110. Springer, Cham. https://doi.org/10.1007/978-3-031-40403-0_4

Download citation

Publish with us

Policies and ethics