Keywords

1 Introduction

Social media is a powerful tool that can provide great insights into a variety of topics. Using Twitter posts as a source for extracting narratives may bring us different information than a news article does, from the people who experience an event first-hand. The Twitter platform is a very helpful tool for journalists [4, 7], however, its colloquial language and the large volume of tweets (6000 tweets are posted every second, on average [12]) makes it impractical to keep up with an event. For this reason, obtaining the most relevant tweet posts turns out to be of the utmost importance. To achieve this, researchers have presented a variety of methods regarding the automatic summarization of tweet streams [1, 3, 6, 9, 10, 15], although none of these had narrative extraction in mind [11]. Recently, Campos et al. [2] have proposed the Tweet2Story frameworkFootnote 1, which performs the automatic narrative extraction from a bundle of tweets. However, this framework doesn’t work in real time, requiring users to previously collect and process the tweets that will be given as input. In this paper, we present TweetStream2StoryFootnote 2, an extension of Tweet2Story that fills this gap, by incorporating the real-time collection of tweets on a given topic, as well as the automatic extraction of narratives from these tweets. As a further contribution to the research community, we make the source code of our project available, thus challenging researchers to use and expand itFootnote 3.

2 TweetStream2Story

Fig. 1.
figure 1

Architecture overview of TweetStream2Story

Figure 1 depicts the architecture of the TweetStream2Story framework. The first step of this pipeline is to issue a topic, a query (e.g. Denmark Shooting) in the user interface (the web client) to search for related tweets. The user must also provide the time period for the collection of tweets (e.g. July 4 2022, 4:30 pm to July 5 2022, 12:30 am), which will be divided into time windows of a specified duration (e.g. 2 h). The narrative will be generated in two modes: in the global mode, each time window uses tweets since the start of the topic ([4:30 pm–6.30 pm], [4:30 pm - 8:30 pm], and on and so forth). In the interval mode, instead, each time window uses tweets posted strictly during that time window ([4:30 pm–6:30 pm], [6:30 pm–08:30 pm], and so on and so forth). Once this information is defined, we proceed by obtaining the collection of related tweets using either the Twitter API’s Filtered stream, in case the user wants to follow up tweets posted in real-time, or the Full-archive search, to look for events in the past. For every collected tweet, a preprocessing stage, involving hashtags removal, hyperlinks and emojis is applied. Similar tweets, with a term-frequency cosine similarity higher than 80% are also removed. The resulting set is then stored in Elasticsearch, a flexible document-oriented database. To reduce the amount of tweets, we then proceed with a summarization-like step where only the most relevant tweets are taken into account. To do this, we use, as in Rishab S. et al. [14], the Okapi BM25 function [8] as our IR model, a function that estimates the relevance of a document to a given search query, and by that, retrieve the top-X most relevant tweets belonging to a particular time window, where X equals 50 (a trade-off between the number of tweets and their Precision). Following, we proceed to use these tweets as input to the Text2Story narrative extraction pipeline. In the coming section, we demonstrate how such pipeline is used to create a visual representation of the topics narrative.

3 Demo

In this section, we describe the main features of this demo. Its live version can be used by anyone who wishes to extract the narrative of a specified topic from tweets posted either in real time or in the past. The first step for generating a narrative requires the user to input a topic of their interest. After typing in a topic and clicking on the Extract Narrative button, a modal opens where the user can specify parameters such as the desired language, the duration of each time window (e.g. 2 h), and the mode for collecting tweets (e.g. streaming, past tweets). Currently, the only languages supported are English and Portuguese. Although the focus of this work is the retrieval of tweets posted in real time, our framework also allows retrieving past tweets. In this case, however, the user must provide their Twitter API credentials, which will not be stored, but discarded as soon as they’re used. Topics are automatically added to a private list of topics, owned by the user, allowing them to keep track of its status, visualize the corresponding narrative or perform actions such as stopping the retrieval of tweets or deleting the topic from the list. Figure 2 shows the interface for the list of topics therein presented.

Fig. 2.
figure 2

Topics list

By clicking on a topic, users are offered the chance to visualize its narrative through a timeline, as shown in Fig. 3. Added, they can choose between the two modes mentioned before: global view or interval view. In each mode, the timeline is divided into time windows with the duration previously specified by the user, where each one shows its respective narrative and information. The default visualization of a narrative is the knowledge graph, which shows actors as nodes and semantic relationships as the edges between the actors. It also highlights in yellow the nodes that weren’t present in the previous time window, as a way for the user to quickly see new information. Other modes of visualization include the list of tweets that were used to generate the narrative, as well as the list of actors, as can be seen in Fig. 4. Further advanced analysis can also be performed by viewing and downloading the information in formal representations, as is the case of DRS annotations [5] or the Text2Story annotation [13].

Fig. 3.
figure 3

Timeline representation of a topic

Fig. 4.
figure 4

Narrative representation of a time window

As a means to demonstrate Twitter’s potential for narrative extraction, some examples of topics in both Portuguese and English, are pre-loaded in the interface. Figure 4 shows a visual representation of the topic Denmark Shooting, an event that occurred in Copenhagen in 2022. This knowledge graph is able to capture information about the number of deaths, critically wounded people, and previous shootings in the country. These examples are able to demonstrate Twitter’s usefulness as a news source, as the information contained in some of the extracted actors and relations is able to complement a news article.

4 Conclusions and Future Work

In this paper, we have presented a framework that allows the automatic collection of tweets and extraction of their narrative elements, TweetStream2Story. This tool can be beneficial not only for journalists, but also for users interested in an ongoing event. Some of its limitations are the requirement for a user to enter their Twitter API credentials when generating narratives from events in the past, and the long computational time to extract the narrative. In the future, we would like to improve the quality of the results by incorporating techniques such as irony detection and offensive speech, as a way to filter out some tweets. We also plan on improving the user-system interactions, as well as implementing an abstractive summarization approach, in order to use original content as the source of the narratives.