Keywords

1 Introduction

The inherent dynamism of Online Social Network (OSN) lies in the huge amount of varied information getting uploaded to OSN platform every second in the form of multimedia data from different events [1]. Any latest happening or prolonged event occurring around the globe has its footprint in OSN in some way or the other [2]. Any event ‘E’ is defined as a happening which is probable to occur in the next time span or duration [3]. On Twitter, to describe an event, the users use #tag (or hashtag) or @ symbol, which further facilitates in coupling different events with each other directly or indirectly [4]. According to [5], both unplanned events like natural disasters and planned events such as ICC World Cup Twenty20 on Twitter, which either can be trendy or non-trendy, can be bursty. The bursty behavior of an event is directly proportional to the rate of information diffusion over Twitter or any other OSN [5]. These bursty events which can be called as ‘trends’ have the capability to catch the attention of huge number of users almost instantly [6].

Real-time stream of data from Twitter help the researchers to analyze real-world bursty events within a specific timeline. Every tweet is accompanied by a timestamp of its creation, username, and biographical sketch making it easier for the researchers to take up the challenge of automatically detect and analyze the bursty events. The real challenge is in the analysis of tweet text during bursty events when data comes in large volume with high arrival rate [7]. Under this circumstance, close to real-time detection of bursty event should be implemented to match up the speed of the information diffusion which demands efficient algorithms. Prediction of bursty events has got important implication in the field of social, political, several planned or unplanned cases of events.

There are few algorithms proposed for detection of bursty events in literature. AmMost of the approaches use fix term of vocabulary, requires a set of query words, needs number of topics to discover, and also have a set threshold value in order to define the bursty event cluster [8]. Additionally, most of the techniques use a vector-space model to represent the tweets, given the dimension of the vector same as the word vocabulary [9]. Researches who have considered streaming of data, assumed a dynamic word vocabulary for bursty event detection which changes with time [2] Some recent literatures have used deep learning techniques, attention mechanisms and network structures too to detect bursty events [10,11,12]. To the best of our knowledge, none of them have studied the goodness of the topics produced across the different time windows. In this paper, a bursty event detection algorithm is proposed which considers a dynamic set of tweets in every time window and generates optimal topics per window of a bursty event.

The rest of the chapter is organized as follows: Sect. 2 elaborates the review of literature. Section 3 details the proposed burst detection framework and the corresponding algorithm. The implementation, evaluation results and analysis along with The experimental setup details, datasets description, the preparation of the datasets is illustrated in Sect. 4. Discussion on the results is performed in Sect. 5, followed by conclusion in Sect. 6.

2 Review of Literature

A very traditional work used statistical techniques and tests on data distribution to extract bursty keywords topics in an event [13]. In online mode, Twitter Monitor tool was designed by [6] which detects emerging topic trends in Twitter stream. Individual keywords buzz was used to identify trends in two steps. The occurrence of individual keywords in tweets is measured to identify the bursts. This was modified by a study by [7] through an algorithm named ‘Window Variation Keyword Burst Detection’ where a scalable and fast online procedure was proposed for detecting online bursty keywords. A study by [14], proposed a different approach for detection of online bursty keywords named as EDCoW (Event Detection with Clustering of Wavelet-based signals) model. EDCoW considers individual word as signals through an application of analysis of wavelet to frequencies of words. Emerging temporal trends were interpreted in a study by [15]. Firstly, a taxonomy of trends was found in the large dataset of Twitter. Secondly, the study found out primary features through which categorization of trends can be accomplished using each trend category features.

A segment-based system for detection of bursty events was introduced by [16] called ‘Twevent’. Twevent first maps and detects event segments from bursty tweet segments, followed by clustering/grouping the segments of events into events by using their distribution of frequency and similarity in content. An interesting study by [17] researched on identification of bursty topics early in the timeline with a large-scale real-time data from Twitter. Tool named TopicSketch proposed by the study, was an integrated solution consisting of two stages. In the first stage the model maintains three measures viz. total count of the tweets, each word occurrence and the respective word pairs. These measures were used as an indicator of a sudden burst in the attention of users towards the tweet, which further facilitates in the bursty topic detection. In the second stage, a topic model based on sketch was used to depict the bursty topics and their surge based on the statistics monitored in the sketch of the data. Incremental clustering methodology was used by studies [18, 19] to detect burst events where evolution of events was also experimented and solved [20]. The new arrival of tweets results in updating of the bursty topics for incremental clustering technique. Study [21] proposed a topic model, which is incremental in nature and includes the temporal features of texts, named as ‘Bursty Event dEtection (BEE)’ to detect the bursty events.

EventRadar was proposed by [22] which deals with activity burst in a localized area. A geo burst algorithm was proposed which was implemented using geo-tagged tweets containing information on location, time and text of the tweets. The topic clusters/ groups which are geographically tagged are created as candidate events per query window. A statistical approach was followed by a study by [23] on the Twitter platform. The study showed that a sudden spike in the tweet frequency follows a log-normal distribution with respect to the arrival of data. The data or tweet burstiness of any event was mapped with the z-score of the rate of tweet arrival. Real-time streaming text was used by study [5] to understand the bursty attitude of events. This study explored various event features and used clustering to classify the features as per their similarity index.

A study by [24] considered cross social media influence and unsupervised clustering for burst detection model. In this work, the time series social media data were divided into time slices and for each slice the burst word features in that time window were also calculated. The burst degree of words was calculated by fusing the three burst features in the time window, post which burst word set got generated. Finally, agglomerative hierarchical clustering technique was applied to cluster the word set to convert it into event. A novel graph based technique called KEvent was proposed by [25] where tweets were divided into separate bins to extract bursty keyphrases. The word2vec model was used to create a weighted keyphrase graph from the keyphrases. Final event detection was performed using Markov clustering.

Lately, deep learning algorithms [10, 11] coupled with attention networks is used by the researchers to handle the temporal dynamics of emerging keywords to detect events from tweets.

3 Proposed Burst Detection Model

Keywords/terms/words/tokens are synonymous for our research work and are interchangeably used throughout. A stepwise burst detection framework is detailed in Fig. 1.

The proposed burst detection algorithm is an extension of the Window Variation Keyword Burst detection algorithm given by [7]. The extended features are:

  1. a)

    Threshold: A threshold in included for: first in selecting the most frequent words per window in Algorithm 3 and then in Algorithm 4, for selecting the bursty keywords across two consecutive windows. This approach helps in the detection of appropriate the bursty topics.

  2. b)

    Topic Creation: After the list of bursty keywords is obtained in Algorithm 4, in the end we generate optimal k topics out of the bursty keywords per window. This approach helps in identifying the trending topics.

  3. c)

    Coherence Scores of Topics: Algorithm 5 generates coherence scores of optimal k topics in each time window across the bursty event. This approach helps in identifying bursty topics of similar context per window.

Fig. 1.
figure 1

Burst Detection Framework

The detailed explanation of each algorithm is given in piecewise manner according to the modules maintained in the framework, with their respective input, output, and the corresponding pseudocode. Table 1 provides a description of the important variables used in the pseudocode for a better understanding. The algorithms should be read keeping the framework, variable description and the pseudocode synced with one another.

  1. i)

    Gathering Event Tweet Stream: The process starts with collection of tweets (G_Stream1) using Twitter streaming API and converting into non-duplicated tweets (G_Stream2) as shown in Table 2. Tweet can be regular tweet or a retweet. Each tweet is of 140 or 280 characters. The events selected for our research are natural disaster events.

  2. ii)

    Tweet Pre-processing: The next module in the pipeline is data pre-processing, shown in Table 3 which involves preparation of the dataset to make it appropriate to feed for generating the most frequent bag of keywords.

Table 1. Variable Names and Its Description
  1. iii)

    Generating Temporal Bag of Most Frequent Keywords: The aim in this module is to output the most frequent words/tokens appeared in the respective bag of tweets within a particular time window. Time window size, window_size is decided on understanding the dataset from the descriptive analysis. We check on the total number of days’ data available and the burst_datasize. Final count of number of time windows, window_num is dependent on the window_size considered and the burst_datasize. Collection of pre-processed tweets is divided into bag of tweets tweet_bag as per the window_size. Every window starts with an initial timestamp init_time. For the first window, the timestamp is zero. Following this, every time window will have a duration according to the window_size. The finishing timestamp of a window end_time is calculated by adding window_size to init_time of that window. All the init_time values for all the windows are stored in window_init_time for future use. A snapshot of the windowing system referred in our algorithm is shown in Fig. 2. Here Tw, Tw + 1, refers to the incoming sequential stream of tweets. For every window, the bag of tweets is created, where tweets are further tokenized to get the bag of words total_win_words. For each word in the bag, word frequency word_freq per window-wise is calculated.

Table 2. Algorithm for Gathering Event Tweet Stream
Table 3. Algorithm for Tweet Pre-processing
Fig. 2.
figure 2

Time Windowing System of the Proposed Algorithm

The proposed algorithm has applied a threshold for considering the most frequent words per window (most_frequent_words). A threshold of 20% of the total number of tokens per window is considered for selecting the most_frequent_words for a particular window. The threshold value is based on the state-of-the-art study by [26]. In this module, we recorded the set of most_frequent_words along with their respective frequencies of occurrence per window, window-wise total number of tokens/words, total number of tweets (no_of_tweets) per window number for further use in the rest of the modules. The pseudocode of the stated process of the algorithm is given in Table 4.

Table 4. Algorithm for Generating Temporal Bag of Most Frequent Keywords
  1. iv)

    Bursty Keywords Detection: The input to this module is G_Queue2, consisting of most frequent keywords per time window. The purpose is to find out the how many most frequent keywords are eligible of becoming bursty keywords per time window and model these bursty keyword into window-wise topics as given in Table 5. A dictionary data structure Hash_Dict is initialized to store records of most_frequent_words has occurred in two consecutive windows- windows1 and window2. Further, it is checked for these words whether the probability of occurrence of the words are increasing across the two consecutive windows. If it is increasing, those words are considered to be eligible for bursty keywords for the previous window. All these eligible bursty keywords are sorted as per their decreasing probability and frequency window-wise, and stored in sorted_word_rank and sorted_word_freq respectively. In order to get meaningful topics, a threshold value of average probability/frequency is calculated. All the bursty keywords having probability greater than equal to average probability and frequency greater than equal to average frequency are stored in sorted_ word_rank_avg_cutoff and sorted_word_freq_avg_cutoff respectively. Finally, k topics are created from the list of bursty keywords per window in sorted_ word_freq_avg_cutoff where k is the user input greater than zero for the number of topics. In order to get meaningful topics, the value assigned to k should be optimal. Based on the coherence score the optimal number of topics can be calculated. We have used UMass coherence score for determining the optimal value for number of topics per datasets. All the generated optimal k topics per window-wise is stored in all_optimal_k_topics_per_window for further processing in the next module.

Table 5. Algorithm for Bursty Keywords Detection & Optimal k-Topics per Window
  1. v)

    Generating Topic’s Coherence Scores per Window: Optimal Topics generated per window is fed as an input to this module as shown in Table 6. Coherence scores of the topics is measured by using the coherence frameworks- UMass, UCI, NPMI, CV and word2vec.

Table 6. Algorithm for Generating Topic’s Coherence Scores per Window

Important Aspects of the Framework:

The framework designed is suitable under different real world high impact events like natural disasters, public opinion events or any emerging trends.

  • A dynamic threshold determination is utilized which incorporates variability in the model, making it more suitable for the real-world scenario.

  • Tweet and word vocabulary per window is not static but dynamically obtained according to the size of the window.

  • Optimal k number of topics can be obtained per window using coherence score as per user choice of coherence measures.

  • New module, for generation of Coherence Score of Topics, which helps to identify bursty topics of similar context and believed to be highly significant during impactful events.

  • The designed approach is implemented on Twitter microblogs. But can be applied universally on any short text messages.

4 Implementation, Results and Analysis

Experimental Set-Up:

Anaconda Jupyter Notebook and Google’s Colab Pro environment was used as a platform for the study. The PC Configuration used was 4-core Intel Core i7 processor and 16 GB memory. Python version 3.8 was used as a programming language to implement the models. A Python library Twarc which is also a command line tool was used for archiving Twitter JSON data. The same was also used for rehydrating the dehydrated data sets which consist of only the list of tweet ids.

Dataset Description:

We have used three natural disaster dataset collected from Kaggle Repository. All these repositories were released through a study by [27] where the data was collected by the author through specific keyword query search. The following datasets were selected with respect to volume of tweets, user engagement, retweet count showing the virility of the event. A brief snapshot on the datasets is elaborated in the Table 7.

The proposed algorithm is run across the three datasets and results are obtained. A baseline comparison is done with the Latent Dirichlet Allocation (LDA) Model [28], Gibbs Sampling Dirichlet Mixture Model (GSDMM) [29] and Gamma-Poisson Mixture Topic Model (GPM) [30] to show the perspectives in which the proposed model outperforms the baseline models. The latter two algorithms are proven to be good for short texts topic modelling.

Table 7. Dataset Features

4.1 Proposed Algorithm Implementation

The implementation of the proposed algorithm is carried out for all the mentioned datasets post some analysis of the datasets required for the implementation. Table 8 summarizes the corresponding variable values found from all the three datasets post analysis of the datasets.

  • Window Size: For Hurricane Harvey, variations in topics was much better for window size at 24 h or 86400 s as compared to window size at 6 h or 12 h. The burst in data happened only after the 5th day of the incident. So, it was pointless to go by lesser than 24 h for these 5 days as the incoming stream of tweets is very less. So, window size is taken as 24 h or 86400 s. In case of Typhoon Hagupit, the window size is considered as 12 h or 43200 secs. The variability in topics is better here for this window size as compared to lesser or more than 12 h. Also, this dataset shows a good burst in incoming tweets from the very beginning, so expected dynamic topics to be present at every 12 h of time window. Hurricane Sandy is a 3-day dataset with a burst of tweets within a very short period of time. Owing to the lesser number of days and huge tweets streaming in, the window-size is kept 12 h.

Table 8. Important Findings from the Dataset
  • Number of Windows: The detection of bursty keywords was considered comparing two consecutive windows. So, to determine a set of bursty keywords for a current window, the current and the next immediate window is considered. So, the total number of windows for which bursty keywords is detected is calculated as 6, 11 and 4 respectively for Harvey, Typhoon and Sandy datasets, which is one less than the actual number of windows in the main data frame as in Table 8.

  • Optimal Number of Topics: While deciding on the number of topics for the events of the three datasets, overall coherence scores were calculated using UMass Coherence measure and plotted with varying number of topics per window. The aim is to choose the number of topics for which the coherence score is optimized. For most of the windows, the coherence measure is stable at number of topics as 3, 5 and 3 for Harvey, Typhoon and Sandy datasets as in Table 8. Ideally, Hurricane Sandy could have been the best dataset with respect to the burstiness of data. But, actual implementation of the proposed algorithm on this dataset showed worst performance with respect to the topics generated with no variation at all. At the same time, as the topics are same across all the windows, there is no change in the coherence score with the change in topics. This clearly shows a disparity in the distribution of frequencies across the unique words across all the time windows.

4.2 Evaluation Results and Analysis

The list of coherence or confirmation measures [31] considered in this research to evaluate the bursty topics generated through the proposed and the baseline models are: UCI Coherence (CUCI), UMass Coherence (CUMASS), NPMI Coherence(CNPMI), CV Coherence(CV) and Word2vec (CW2V). Following the coherence framework, the aggregated score of the measures is obtained by calculating the arithmetic mean of all the coherence or confirmation scores. The performance of the proposed model is compared with three baseline models in this research based on these scores. During the process of evaluation, we experimented different settings of parameter to achieve the best result possible. The sliding window sizes were tweaked in the range of (10, 150), and the context window was varied between (10, 100) for both the proposed algorithm and the LDA model. For GSDMM and GPM, the tweaking was done with the number of iterations (iters), top words of the cluster (nTopWords) considered and the size of the document (N). The number of topics (K) in all the models for every dataset were determined with respect to the average coherence score.

For influence of hyper parameters, the dirichlet priors and the gamma priors’ values were tweaked for both GSDMM and GPM. For GSDMM, the dirichlet priors a and b are tried for a = 0.01,0.25,0.5,0.75,0.05 and b = 0.1,0.5,1.0,2.5 respectively. Finally, with respect to quality of the topics getting created for each of these, we settle on a = 0.25 and b = 0.15. Similar things were repeated for GPM model for the gamma and the dirichlet priors. The evaluation results of coherence scores obtained by implementing all the models, including the proposed algorithm are depicted in the following tables for respective datasets. The highlighted rows in bold are measures where our model has outperformed as compared to the baselines.

Table 9. Proposed Bursty Model Evaluation in Hurricane Harvey
Table 10. Proposed Bursty Model Evaluation in Typhoon Hagupit
Table 11. Proposed Bursty Model Evaluation in Hurricane Sandy
  • Coherence evaluation measures for Hurricane Harvey and the baseline comparison is shown in Table 9. The proposed model has generated competitive scores in case of CV and Word2vec.

  • Coherence evaluation measures for Typhoon Hagupit and the baseline comparison using the coherence measures in shown in Table 10. For this dataset, the CV score measure is better for the proposed model as compared to the other three models.

  • Coherence evaluation measures for Hurricane Sandy and the baseline comparison is shown in Table 11. The proposed model has resulted in better results for all the coherence measure as compared to LDA, GSDMM AND GPM for this dataset.

5 Discussions

The coherence score measures the quality of the topics getting generated per window. According to [31], higher or closer the coherence score towards ‘1’, more coherent the topics are. Also, the range of UMass coherence is −14 to + 14, UCI and NPMI Coherence is between −1 to + 1, for CV and Cw2v both are between 0 and 1. CV is proven to be the best measure in baseline paper [31]. This is a combination measure, found by combining indirect cosine confirmation measure with NPMI and the concept of Boolean sliding window. CV and Cw2v which are semantic and contextual measures of the topics, have given the best scores across all the datasets. For all the datasets, in general the NPMI coherence measure has given better coherence values than non-normalized UCI coherence version of it. Overall, Cw2v measure has performed well as compared to the other measures. The fact can be for the length of the input text. In the baseline paper [31], the goodness of the coherence measures was proved with long texts or articles. So, the scores the proposed algorithm achieved is proved to be competitive. In this case, we are trying the apply coherence measures for short texts. This shows the direction towards an improvement to the algorithm required which will take the length of the document also into consideration, and is an immediate future work. Apart from that in all the datasets, better performance of our model as compared to the other with respect to CV and Cw2v is a contribution of this study. As both these measures signifies the semantic and contextual features of topics, our model is successful in creating better semantically coherent and contextual topics as compared to the other state of the art techniques available in the field of topic modelling.

Practical Implications of this Research:

The proposed model can be used for modelling topics for any event based on Twitter. Additionally, the researchers can also measure the goodness of the topics through coherence measures, inferencing on the coherent topics at different time window across the events. This information can be further leveraged to understand the trends per time window. In case of disaster events or any high impact events, knowledge on coherent topics per window can facilitate in making several decisions in support for the disaster at that point of time.

6 Conclusion

This paper detailed the complete work regarding the proposed burst model of a high impact event. The proposed algorithm detects bursty optimal topics during high impact events comparing the bursty words across consecutive time windows. The algorithm further measures the coherence scores of the bursty optimal topics window-wise using a coherence framework. The coherence scores of the topics generated from the proposed algorithm is compared with the state-of-the-art baseline topic modelling techniques. Through proper experimentation and analysis, our proposed model is successful in creating better topics than the baseline models with respect to the contextual coherence features.