Event Detection in Live Twitter Streams Using Tf-Idf and Clustering Algorithms

Jain, Tavishi; Singh, Bhavya; Dewang, Rupesh Kumar

doi:10.1007/978-981-19-4687-5_36

Tavishi Jain⁴²,
Bhavya Singh⁴² &
Rupesh Kumar Dewang⁴²

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 907))

519 Accesses

Abstract

Twitter is a fast emerging form of social media. People use Twitter as a platform to report real-life events. The present paper focuses on extracting those events by analyzing the text stream on Twitter. This paper presents methods that take the tweets in real time as input and generate clusters of tweets denoting different communities as output. The tweets are collected using spark streaming and then pre-processed, and a key graph of keywords is constructed using the tf-idf method. Further community detection is applied on this key graph to generate clusters of tweets as a result.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Event detection from real-time twitter streaming data using community detection algorithm

Article 16 August 2023

Event Detection Based on Semantic Terms from Social Data Stream

Event Detection Over Twitter Social Media

Keywords

1 Introduction

Twitter has been the most promising news delivery and social media platform for the internet user. As time is passing, the number of users on Twitter is also increasing exponentially. Since the requirements are too high, so we need some technology to satisfy those requirements. A large amount of data in the form of tweets, images, videos and animation is being generated very frequently. There are social media platforms like Facebook, Instagram and YouTube which have changed the way of satisfying needs by using technology. All these social media giants have changed the traditional way of connecting people over the internet. Since Twitter is one of the most popular platforms, that’s why we are using Twitter for Event Detection purposes. Twitter is a platform where tweets related to any local or global events are written by users in a few seconds. Natural events like earthquakes, disease outbreaks, etc. are significantly detected by performing event detection on Twitter data [2]. Twitter easily figure out what is happening right now. Tweets are real time data for analysis covering different subjects from various sources. Event detection about emerging events is highly important if it is performed on real time data [9]. Different people are interested in different types of news and events. Some people want to know about local events [18]. Business oriented organizations are interested in sponsoring their product and services to their favourite customers [4, 5, 13]. Event detection can fulfill the requirements of these people. Event detection can also help in the detection of natural disasters and warning people faster than any other media [7]. Event detection is also helpful in crime detection for example bomb blast or terrorist attack. Event detection in Twitter is effected due to a large amount of meaningless data [15]. Data which is being generated on Twitter is not of high quality, So various high scalable and efficient techniques are being for the pre-processing of the data to make it usable for event detection. Users prefer to write short tweets and they generally use distorted words [12]. Different events consist of different numbers of participants, different time periods and relationships [14]. Real time collection is a big challenge in real time event detection because every time a large amount of data is created on Twitter. Real time event detection has different applications and with technology support. In the present paper, we have proposed the detection of events in live Twitter streams. For construction, the key graph tf-idf scheme is used. The key graph is further used for detecting the communities using the Betweenness Centrality score and then finally clusters are formed using k-mean with cosine similarity. The rest of this paper is structured as follows: Sect. 2 introduces the most relevant related works. Section 3 provides the used dataset description. Section 4 provides a detailed description of the proposed approach. Experimental Setup and Results are discussed in Sect. 5. Section 6 finally concludes the paper with future work.

2 Related Work

New Event Detection models do a single step increscent clustering algorithm. When a new document is collected similarity is found betweenthe document along with computation on known events and selection is done on the basis of maximum similarity. A threshold is set, if similarity is more than threshold then the document belongs to a known event else considered as a new event. A Key Graph based approach is used by Hassan Sayyadi, Matthew Hurst and Alexey Markov [6]. They built a key graph based on co-occurrence and used betweenness centrality algorithm for clustering keywords. In this algorithm, they count all keywords in a single community, though a subgroup of the keyword may be better. Modified version of TF/IDF was developed by Allen et al. [1] also penalized the threshold in which the threshold is amerced by the time distance between the event and the document. Calculation of IDF using online clustering algorithm is required as to know future document features. The same method is used by Sayyadi et al. in [14]. A supplementary data set is used by Allen et al. [1] in order to find IDF while Yang et al. [16] proposed an increscent IDF factor. Time difference was the basis to find the similarity between documents and events along with consideration of time window and a decay factor that Yang et al. used. To take out past event Yang et. al [16] put forwarded an agglomerative clustering, GAC (augmented Group Average Clustering). In order to handle cluster quality, they used looping bucketing and re-clustering. A probabilistic model was put forwarded by (Li et al. 2005) for RED and in order to maximize log-likelihood of the distributions they used the Expectation Maximization (EM) algorithm. A significant number of events is required by such algorithm which is not feasible in practice. Li et al. found an approximation of event counts from the article count- time distribution [11]. Event detection models usually use similar algorithms, many different document representation, distance or similarity metrics, and clustering algorithms mentioned in the theory [3, 8, 10, 17].

3 Dataset Used

In this paper, we have collected the tweets from Twitter based on a search keyword (hashtag) of “Facebook” and then, this data is given to the spark client where the data is processed in batches. The batch interval used is 1 min in which 80–90 tweets are collected in one batch of spark. So, our one iteration of algorithm runs on approximately 80–90 tweets.

4 Proposed Approach

We have collected the data set from Twitter in form of batches in real time using Spark streaming API. Prepossessing is performed on each batch of data which includes Tokenization, removal of Stop words and Stemming. Our approach comprises the following phases: (1) Construction of Key Graph; (2) Identifying Communities in Key Graph; (3) Document Clustering using K-Mean with Cosine similarity.

4.1 Construction of Key Graph

Key graph is made using pre-processed data. We first calculate term frequency (TF), document frequency (DF) and inverse document frequency (IDF). Remove keywords with low DF because these keywords are not useful. Key graph is constructed by taking the remaining keyword as a node of the graph. Co-occurrence of keyword represents the relationship between keywords. Make an edge between keywords which co-occur in the same document. Some edges are present as noise in the key graph. We remove edges which are not satisfying following two conditions-

(1)
If the co-occurrence of two keywords is below some specified threshold value then the edge between these two keywords is removed.

(2)

For the edge of the node ‘X’ and ‘Y’, if there is a potential probability of viewing.

‘X’ in the documents if ‘Y’ is present in the document and if ‘X’ exists then the condition of ‘Y’ in the documents are calculated and if both are smaller than the threshold, then the edge will be removed.

Algorithm 1 Construction of Key graph from tweets
Input: Keyword (hashtag) of tweets
Output: Key graph (kgi) of keywords ki
1	BEGIN
2	FOR Extracted all tweets ti (specified time limit) in real time for corresponding hashtag
3	Pull out words wi, noun ni, noun phrases npi, adverb ai adjective adi and named entities nei from ti as keywords ki
4	END FOR
5	FOR all keywords ki extracted from tweets ti do
6	Calculate Document frequency df of all tweets ti
7	END FOR
8	Remove keywords ki with low Document frequency df
9	Construct the keyword ki co-occurrence graph cgi as follows:
	– Generate single vertex vk for each keyword ki
	– Generate single link li;j for every set of co-occurring keywords ki and kj
10	END

4.2 Identifying Communities in Key Graph

Edges of the graph represent the relationship of keywords and the whole key graph presents a social network of keywords. Communities of keywords are represented by highly dense graphs. Communities have a large number of edges because there exist more relationships between keywords in the community. Between different communities, there are few links. Betweenness Centrality score is used to find the edges between communities. Betweenness centrality score for any node is the number of shortest paths between all pairs of graphs which pass through that node. Edges between communities come across many shortest paths so these edges have a high betweenness centrality score. These edges are connecting the different communities so after removing edges with high betweenness centrality score we will get different clusters of keywords and each cluster represent different communities. Each community is the hypothesis of an event. This is an iterative process. In each iteration, we remove edges with low BC scores. For this, we first calculate the BC score by finding the shortest path between all node pairs of graphs. We use BFS to find the shortest path. If two edges have the same BC then the edge with low conditional probability is removed. If the edges which are removed belong to two different communities and have high conditional probability then edge is duplicated in both communities and then removed. We again calculate BC score for the next iteration and perform the same operation until all edges with a high score are not removed. Finally, we get a cluster of keywords representing events.

Algorithm 2 Identifying Topic Features
Input: Set of key graphs (kgi)
Output: Topic (Ti = t1;t2;t3:::tn) feature sets (Fi = f1; f2; f3::: fm)
1	Separate Key Graph (kgi) in communities coi of strongly related ( kiecoi) keywords ki as topics Ti
2	FOR all topic tieT do
3	Place all keywords ki in the related feature vector (FVi = f v1; f v2; f v3::: f vn) of related keyword partition fp
4	END FOR

4.3 Document Clustering Using K-Mean

Each group of keywords make a synthetic document called as Key Document. All documents can be grouped in the original collection, similar to this synthetic document, thus a group of periodical documents is obtained. To discover document clusters, k-mean with cosine similarity is used. Sometimes keywords are common in an important document, or the main document has few keywords. It makes a general class of documents. These important documents make a set of documents related to higher levels. For example, a key document containing only “Modi”, “Rahul”, “Votes”, and “Election” indicate the Indian general elections which is not a separate event. We can find such important documents with help of the similarity of documents related to an important document. Documents are data points, and the variance of documents related to such key documents will be very large. So, we calculate the variance of documents for every main document and then filter those key documents that have a large variance. This makes it easier to find those key documents which truly represent events. However, documents allocated to key clusters are based on the cosine similarity of documents to the key documents, some of the documents are filtered later to reduce noise. Apart from these, the similarity between the documents and the centroid of each cluster are also calculated and filter those documents which have low cosine similarity to the centroid.

Algorithm 3 Document Clustering (K-mean with cosine similarity) using Topics
Input: Topic (Ti = t1;t2;t3:::tn) feature sets (Fi = f1; f2; f3::: fm)
Output: Topics (Tl = t1;t2;t3:::tm) under the cluster documents (cd1;cd2;cd3:::cdn)
1	FOR all topic ti
2	Initialize the k value and select as the initial centroid value ci&cj
3	Calculate the similarity (s) of topic ti for document di: s(ti = di) = = cos(di;FVi) ∑(tieT)cos(di;FVi)
4	Recalculate the centroid of each cluster until centroid value not changed
5	END FOR
6	FOR complete topic ti and tl do
7	IF the topic T overlap of ti and tl
8	Combined ti and tl into a new topic NT where FVi = f vti + f vtl
9	END IF
10	END FOR

4.4 Example of Construction Keygrpah, Identifying Topic Features and Document Clustering (K-means with Cosine Similarities) Using Topics

Consider the graph in the above Fig. 1, the nodes represent the keywords and the edges represent the co-occurrence of the two keywords between which it is present. Consider all the edge weights to be equal to 1. The following table gives the values of betweenness centrality scores calculated for every edge using the Breadth First Search algorithm.

An illustration of the initial keyword graph. It has edges and nodes with labels 1 to 10. — **Fig. 1**

Cosine Similarity

Let us say we have multiple documents and we need to determine the similarity between those documents. Let us name the two documents as document1 and document2 respectively. A document can be represented by a bag of terms or a long vector, with each attribute recording the frequency of a particular term (such as word, keyword or phrase) in the document. We will be having two term frequency vectors (d1 and d2). Table 2 shows that d1 and d2 denote term frequency in doc1 and doc2 respectively.

$$cos\theta =\frac{d1.d2}{\left|d1\right||d2|}$$

Finding similarity between document d1 and d2:

d1 = 5,0,3,0,2,0 and d2 = 3,0,2,0,1,1

1.
First, calculate the dot product of these two documents

$$d1.d2=5*3+0*0+3*2+0*0+2*1+0*1=23$$
2.
Then calculate kd1kandkd2k
$$\left|\left|d1\right|\right|=\sqrt{5*5+0*0+3*3+0*0+2*2+0*0}=6.164$$
$$\left|\left|d2\right|\right|=\sqrt{3*3+0*0+2*2+0*0+1*1+1*1}=3.873$$
3.
Calculate cosine similarity:
$$\mathrm{cos}\left(d1,d2\right)=\frac{23}{6.164*3.873}=0.96$$

5 Experimental Setup and Results

We have used the above presented algorithms on batches of tweets formed by passing the live tweets coming from the spark streaming API by using a search keyword. Batch interval used for the experiment is 1 min in which 80–90 tweets are collected in one batch of spark. So, our one iteration of the algorithm runs on approximately 80–90 tweets. In the first step of the algorithm, a key graph is formed which contains nodes and edges between any two nodes if they co-occur. Here we obtained approximately 40 keywords(nodes). After that, clusters of keywords are formed from the key graph and then tweets are categorized into these clusters of keywords using cosine similarity (Tables 1 and 2).

Table 1 Betweenness centrality score

Full size table

Table 2 Calculation example

Full size table

5.1 Batch wise Key graphs And Categorization of Tweets

In this section, we have shown the results obtained by running the above approach in batches. The algorithm is applied on each batch and the results are hence depicted batch-wise.

Batch 1 Results In Fig. 2, the initial key graph and final graph of batch 1 results. The figure shows the graph formed initially after pre-processing of tweets and also filtering of keywords based on document frequency. Here, the keywords represent the node of the graph and the edges between them show their co-occurrence. The final key graph is the formed after clustering of keywords phase of the algorithm. It contains edges only between those nodes which belong to a particular cluster. Table 3 shows the tweets belonging to a particular category. For example, number 2 represents the tweet belonging to this category.

2 illustrations of the initial and final key graph. It has edges and nodes that connect only to a particular cluster. — **Fig. 2**

Table 3 Categorization of tweets btach-1

Full size table

Batch 2 Results Similarly, for batch 2, Fig. 3 represents the initial key graph and final key graph. Table 4 shows the categorization of tweets.

2 illustrations of the initial and final key graph of batch 2. It has edges and nodes that connect only to a particular cluster. — **Fig. 3**

Table 4 Categorization of tweets btach-2

Full size table

6 Conclusion and Future Work

In this paper, we have used a key graph based approach for clustering keywords to find out events from live tweets. Presently, the batch interval used is small due to the amount of memory spark needed. Therefore, further work can be done to run the above algorithm on a distributed system so that algorithm can be run on a large dataset and can produce better results. Apart from that, the complexity of the algorithm becomes too high when used for live processing therefore, another approach needs to be formulated to reduce the complexity.

References

Allan J, Papka R, Lavrenko V (1998) On-line new event detection and tracking. In: Sigir, vol 98, pp 37–45. Citeseer
Google Scholar
Boyd DM, Ellison NB (2007) Social network sites: definition, history, and scholarship. J Comput-Mediat Commun 13(1):210–230
Article Google Scholar
Brants T, Chen F, Farahat A (2003) A system for new event detection. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval. ACM, pp 330–337
Google Scholar
Dewang RK, Singh AK (2018) State-of-art approaches for review spammer detection: a survey. J Intell Inf Syst 50(2):231–264
Article Google Scholar
Farzindar A, Inkpen D (2012) Proceedings of the workshop on semantic analysis in social media. In: Proceedings of the workshop on semantic analysis in social media
Google Scholar
Hasan M, Orgun MA, Schwitter R (2018). Real-time event detection from the twitter data stream using the twitternews+ framework. Inf Process Manag
Google Scholar
Java A, Song X, Finin T, Tseng B (2007) Why we twitter: understanding microblogging usage and communities. In Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on web mining and social network analysis. ACM, pp 56–65
Google Scholar
Kumaran G, Allan J (2004) Text classification and named entities for new event detection. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 297–304
Google Scholar
Kwak H, Lee C, Park H, Moon S (2010) What is twitter, a social network or a news media? In: Proceedings of the 19th international conference on world wide web. ACM, pp 591–600
Google Scholar
Lam W, Meng H, Wong K, Yen J (2001) Using contextual analysis for news event detection. Int J Intell Syst 16(4):525–546
Article Google Scholar
Li Z, Wang B, Li M, Ma W-Y (2005) A probabilistic model for retrospective news event detection. In Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 106–113
Google Scholar
Nallapati R, Feng A, Peng F, Allan J (2004) Event threading within news topics. In: Proceedings of the thirteenth ACM international conference on information and knowledge management. ACM, pp 446–453
Google Scholar
Sabeeh A, Dewang RK (2018) Comparison, classification and survey of aspect based sentiment analysis. In: International conference on advanced informatics for computing research. Springer, pp 612–629
Google Scholar
Sayyadi H, Hurst M, Markov A (2009) Event detection and tracking in social streams. In: Third international AAAI conference on weblogs and social media
Google Scholar
Wang X, Gerber MS, Brown DE (2012) Automatic crime prediction using events extracted from twitter posts. In: International conference on social computing, behavioral-cultural modeling, and prediction. Springer, pp 231–238
Google Scholar
Yang Y, Pierce T, Carbonell JG (1998) A study on retrospective and on-line event detection
Google Scholar
Yang Y, Zhang J, Carbonell J, Jin C (2002) Topic-conditioned novelty detection. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 688–693
Google Scholar
Yuan Q, Cong G, Ma Z, Sun A, Thalmann NM (2013) Who, where, when and what: discover spatio-temporal topics for twitter users. In: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 605–613
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science & Engineering, Motilal Nehru National Institute of Technology Allahabad, Prayagraj, India
Tavishi Jain, Bhavya Singh & Rupesh Kumar Dewang

Authors

Tavishi Jain
View author publications
You can also search for this author in PubMed Google Scholar
Bhavya Singh
View author publications
You can also search for this author in PubMed Google Scholar
Rupesh Kumar Dewang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rupesh Kumar Dewang .

Editor information

Editors and Affiliations

School of Information Technology, Rajiv Gandhi Technical University, Bhopal, Madhya Pradesh, India
Sanjeev Sharma
Department of Creative Technologies and Product Design, National Taipei University of Business, Taiwan, Taiwan
Sheng-Lung Peng
Department of Computer Science and Engineering, Rajiv Gandhi Technical University, Bhopal, Madhya Pradesh, India
Jitendra Agrawal
Department of Computer Science and Engineering, Oriental Institute of Science and Technology, Bhopal, Madhya Pradesh, India
Rajesh K. Shukla
Department of Information Technology, Haiphong University, Haiphong, Vietnam
Dac-Nhuong Le

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jain, T., Singh, B., Dewang, R.K. (2022). Event Detection in Live Twitter Streams Using Tf-Idf and Clustering Algorithms. In: Sharma, S., Peng, SL., Agrawal, J., Shukla, R.K., Le, DN. (eds) Data, Engineering and Applications. Lecture Notes in Electrical Engineering, vol 907. Springer, Singapore. https://doi.org/10.1007/978-981-19-4687-5_36

Download citation

DOI: https://doi.org/10.1007/978-981-19-4687-5_36
Published: 12 October 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-4686-8
Online ISBN: 978-981-19-4687-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Event Detection in Live Twitter Streams Using Tf-Idf and Clustering Algorithms

Abstract

Similar content being viewed by others

Event detection from real-time twitter streaming data using community detection algorithm

Event Detection Based on Semantic Terms from Social Data Stream

Event Detection Over Twitter Social Media

Keywords

1 Introduction

2 Related Work

3 Dataset Used