Keywords

1 Introduction

Twitter has been the most promising news delivery and social media platform for the internet user. As time is passing, the number of users on Twitter is also increasing exponentially. Since the requirements are too high, so we need some technology to satisfy those requirements. A large amount of data in the form of tweets, images, videos and animation is being generated very frequently. There are social media platforms like Facebook, Instagram and YouTube which have changed the way of satisfying needs by using technology. All these social media giants have changed the traditional way of connecting people over the internet. Since Twitter is one of the most popular platforms, that’s why we are using Twitter for Event Detection purposes. Twitter is a platform where tweets related to any local or global events are written by users in a few seconds. Natural events like earthquakes, disease outbreaks, etc. are significantly detected by performing event detection on Twitter data [2]. Twitter easily figure out what is happening right now. Tweets are real time data for analysis covering different subjects from various sources. Event detection about emerging events is highly important if it is performed on real time data [9]. Different people are interested in different types of news and events. Some people want to know about local events [18]. Business oriented organizations are interested in sponsoring their product and services to their favourite customers [4, 5, 13]. Event detection can fulfill the requirements of these people. Event detection can also help in the detection of natural disasters and warning people faster than any other media [7]. Event detection is also helpful in crime detection for example bomb blast or terrorist attack. Event detection in Twitter is effected due to a large amount of meaningless data [15]. Data which is being generated on Twitter is not of high quality, So various high scalable and efficient techniques are being for the pre-processing of the data to make it usable for event detection. Users prefer to write short tweets and they generally use distorted words [12]. Different events consist of different numbers of participants, different time periods and relationships [14]. Real time collection is a big challenge in real time event detection because every time a large amount of data is created on Twitter. Real time event detection has different applications and with technology support. In the present paper, we have proposed the detection of events in live Twitter streams. For construction, the key graph tf-idf scheme is used. The key graph is further used for detecting the communities using the Betweenness Centrality score and then finally clusters are formed using k-mean with cosine similarity. The rest of this paper is structured as follows: Sect. 2 introduces the most relevant related works. Section 3 provides the used dataset description. Section 4 provides a detailed description of the proposed approach. Experimental Setup and Results are discussed in Sect. 5. Section 6 finally concludes the paper with future work.

2 Related Work

New Event Detection models do a single step increscent clustering algorithm. When a new document is collected similarity is found betweenthe document along with computation on known events and selection is done on the basis of maximum similarity. A threshold is set, if similarity is more than threshold then the document belongs to a known event else considered as a new event. A Key Graph based approach is used by Hassan Sayyadi, Matthew Hurst and Alexey Markov [6]. They built a key graph based on co-occurrence and used betweenness centrality algorithm for clustering keywords. In this algorithm, they count all keywords in a single community, though a subgroup of the keyword may be better. Modified version of TF/IDF was developed by Allen et al. [1] also penalized the threshold in which the threshold is amerced by the time distance between the event and the document. Calculation of IDF using online clustering algorithm is required as to know future document features. The same method is used by Sayyadi et al. in [14]. A supplementary data set is used by Allen et al. [1] in order to find IDF while Yang et al. [16] proposed an increscent IDF factor. Time difference was the basis to find the similarity between documents and events along with consideration of time window and a decay factor that Yang et al. used. To take out past event Yang et. al [16] put forwarded an agglomerative clustering, GAC (augmented Group Average Clustering). In order to handle cluster quality, they used looping bucketing and re-clustering. A probabilistic model was put forwarded by (Li et al. 2005) for RED and in order to maximize log-likelihood of the distributions they used the Expectation Maximization (EM) algorithm. A significant number of events is required by such algorithm which is not feasible in practice. Li et al. found an approximation of event counts from the article count- time distribution [11]. Event detection models usually use similar algorithms, many different document representation, distance or similarity metrics, and clustering algorithms mentioned in the theory [3, 8, 10, 17].

3 Dataset Used

In this paper, we have collected the tweets from Twitter based on a search keyword (hashtag) of “Facebook” and then, this data is given to the spark client where the data is processed in batches. The batch interval used is 1 min in which 80–90 tweets are collected in one batch of spark. So, our one iteration of algorithm runs on approximately 80–90 tweets.

4 Proposed Approach

We have collected the data set from Twitter in form of batches in real time using Spark streaming API. Prepossessing is performed on each batch of data which includes Tokenization, removal of Stop words and Stemming. Our approach comprises the following phases: (1) Construction of Key Graph; (2) Identifying Communities in Key Graph; (3) Document Clustering using K-Mean with Cosine similarity.

4.1 Construction of Key Graph

Key graph is made using pre-processed data. We first calculate term frequency (TF), document frequency (DF) and inverse document frequency (IDF). Remove keywords with low DF because these keywords are not useful. Key graph is constructed by taking the remaining keyword as a node of the graph. Co-occurrence of keyword represents the relationship between keywords. Make an edge between keywords which co-occur in the same document. Some edges are present as noise in the key graph. We remove edges which are not satisfying following two conditions-

  1. (1)

    If the co-occurrence of two keywords is below some specified threshold value then the edge between these two keywords is removed.

  2. (2)

    For the edge of the node ‘X’ and ‘Y’, if there is a potential probability of viewing.

    ‘X’ in the documents if ‘Y’ is present in the document and if ‘X’ exists then the condition of ‘Y’ in the documents are calculated and if both are smaller than the threshold, then the edge will be removed.

    Algorithm 1 Construction of Key graph from tweets

    Input: Keyword (hashtag) of tweets

    Output: Key graph (kgi) of keywords ki

    1

    BEGIN

    2

    FOR Extracted all tweets ti (specified time limit) in real time for corresponding hashtag

    3

    Pull out words wi, noun ni, noun phrases npi, adverb ai adjective adi and named entities nei from ti as keywords ki

    4

    END FOR

    5

    FOR all keywords ki extracted from tweets ti do

    6

    Calculate Document frequency df of all tweets ti

    7

    END FOR

    8

    Remove keywords ki with low Document frequency df

    9

    Construct the keyword ki co-occurrence graph cgi as follows:

     

    – Generate single vertex vk for each keyword ki

     

    – Generate single link li;j for every set of co-occurring keywords ki and kj

    10

    END

4.2 Identifying Communities in Key Graph

Edges of the graph represent the relationship of keywords and the whole key graph presents a social network of keywords. Communities of keywords are represented by highly dense graphs. Communities have a large number of edges because there exist more relationships between keywords in the community. Between different communities, there are few links. Betweenness Centrality score is used to find the edges between communities. Betweenness centrality score for any node is the number of shortest paths between all pairs of graphs which pass through that node. Edges between communities come across many shortest paths so these edges have a high betweenness centrality score. These edges are connecting the different communities so after removing edges with high betweenness centrality score we will get different clusters of keywords and each cluster represent different communities. Each community is the hypothesis of an event. This is an iterative process. In each iteration, we remove edges with low BC scores. For this, we first calculate the BC score by finding the shortest path between all node pairs of graphs. We use BFS to find the shortest path. If two edges have the same BC then the edge with low conditional probability is removed. If the edges which are removed belong to two different communities and have high conditional probability then edge is duplicated in both communities and then removed. We again calculate BC score for the next iteration and perform the same operation until all edges with a high score are not removed. Finally, we get a cluster of keywords representing events.

Algorithm 2 Identifying Topic Features

Input: Set of key graphs (kgi)

Output: Topic (Ti = t1;t2;t3:::tn) feature sets (Fi = f1; f2; f3::: fm)

1

Separate Key Graph (kgi) in communities coi of strongly related ( kiecoi) keywords ki as topics Ti

2

FOR all topic tieT do

3

Place all keywords ki in the related feature vector (FVi = f v1; f v2; f v3::: f vn) of related keyword partition fp

4

END FOR

4.3 Document Clustering Using K-Mean

Each group of keywords make a synthetic document called as Key Document. All documents can be grouped in the original collection, similar to this synthetic document, thus a group of periodical documents is obtained. To discover document clusters, k-mean with cosine similarity is used. Sometimes keywords are common in an important document, or the main document has few keywords. It makes a general class of documents. These important documents make a set of documents related to higher levels. For example, a key document containing only “Modi”, “Rahul”, “Votes”, and “Election” indicate the Indian general elections which is not a separate event. We can find such important documents with help of the similarity of documents related to an important document. Documents are data points, and the variance of documents related to such key documents will be very large. So, we calculate the variance of documents for every main document and then filter those key documents that have a large variance. This makes it easier to find those key documents which truly represent events. However, documents allocated to key clusters are based on the cosine similarity of documents to the key documents, some of the documents are filtered later to reduce noise. Apart from these, the similarity between the documents and the centroid of each cluster are also calculated and filter those documents which have low cosine similarity to the centroid.

Algorithm 3 Document Clustering (K-mean with cosine similarity) using Topics

Input: Topic (Ti = t1;t2;t3:::tn) feature sets (Fi = f1; f2; f3::: fm)

Output: Topics (Tl = t1;t2;t3:::tm) under the cluster documents (cd1;cd2;cd3:::cdn)

1

FOR all topic ti

2

Initialize the k value and select as the initial centroid value ci&cj

3

Calculate the similarity (s) of topic ti for document di:

s(ti = di) =  = cos(di;FVi) ∑(tieT)cos(di;FVi)

4

Recalculate the centroid of each cluster until centroid value not changed

5

END FOR

6

FOR complete topic ti and tl do

7

IF the topic T overlap of ti and tl

8

Combined ti and tl into a new topic NT where FVi = f vti + f vtl

9

END IF

10

END FOR

4.4 Example of Construction Keygrpah, Identifying Topic Features and Document Clustering (K-means with Cosine Similarities) Using Topics

Consider the graph in the above Fig. 1, the nodes represent the keywords and the edges represent the co-occurrence of the two keywords between which it is present. Consider all the edge weights to be equal to 1. The following table gives the values of betweenness centrality scores calculated for every edge using the Breadth First Search algorithm.

Fig. 1
An illustration of the initial keyword graph. It has edges and nodes with labels 1 to 10.

Initial keyword graph

Cosine Similarity

Let us say we have multiple documents and we need to determine the similarity between those documents. Let us name the two documents as document1 and document2 respectively. A document can be represented by a bag of terms or a long vector, with each attribute recording the frequency of a particular term (such as word, keyword or phrase) in the document. We will be having two term frequency vectors (d1 and d2). Table 2 shows that d1 and d2 denote term frequency in doc1 and doc2 respectively.

$$cos\theta =\frac{d1.d2}{\left|d1\right||d2|}$$

Finding similarity between document d1 and d2:

d1 = 5,0,3,0,2,0 and d2 = 3,0,2,0,1,1

  1. 1.

    First, calculate the dot product of these two documents

    $$d1.d2=5*3+0*0+3*2+0*0+2*1+0*1=23$$
  2. 2.

    Then calculate kd1kandkd2k

    $$\left|\left|d1\right|\right|=\sqrt{5*5+0*0+3*3+0*0+2*2+0*0}=6.164$$
    $$\left|\left|d2\right|\right|=\sqrt{3*3+0*0+2*2+0*0+1*1+1*1}=3.873$$
  3. 3.

    Calculate cosine similarity:

    $$\mathrm{cos}\left(d1,d2\right)=\frac{23}{6.164*3.873}=0.96$$

5 Experimental Setup and Results

We have used the above presented algorithms on batches of tweets formed by passing the live tweets coming from the spark streaming API by using a search keyword. Batch interval used for the experiment is 1 min in which 80–90 tweets are collected in one batch of spark. So, our one iteration of the algorithm runs on approximately 80–90 tweets. In the first step of the algorithm, a key graph is formed which contains nodes and edges between any two nodes if they co-occur. Here we obtained approximately 40 keywords(nodes). After that, clusters of keywords are formed from the key graph and then tweets are categorized into these clusters of keywords using cosine similarity (Tables 1 and 2).

Table 1 Betweenness centrality score
Table 2 Calculation example

5.1 Batch wise Key graphs And Categorization of Tweets

In this section, we have shown the results obtained by running the above approach in batches. The algorithm is applied on each batch and the results are hence depicted batch-wise.

Batch 1 Results In Fig. 2, the initial key graph and final graph of batch 1 results. The figure shows the graph formed initially after pre-processing of tweets and also filtering of keywords based on document frequency. Here, the keywords represent the node of the graph and the edges between them show their co-occurrence. The final key graph is the formed after clustering of keywords phase of the algorithm. It contains edges only between those nodes which belong to a particular cluster. Table 3 shows the tweets belonging to a particular category. For example, number 2 represents the tweet belonging to this category.

Fig. 2
2 illustrations of the initial and final key graph. It has edges and nodes that connect only to a particular cluster.

Initial and Final Key Graph of Batch 1 Results

Table 3 Categorization of tweets btach-1

Batch 2 Results Similarly, for batch 2, Fig. 3 represents the initial key graph and final key graph. Table 4 shows the categorization of tweets.

Fig. 3
2 illustrations of the initial and final key graph of batch 2. It has edges and nodes that connect only to a particular cluster.

Initial and Final Key Graph of Batch 2 Results

Table 4 Categorization of tweets btach-2

6 Conclusion and Future Work

In this paper, we have used a key graph based approach for clustering keywords to find out events from live tweets. Presently, the batch interval used is small due to the amount of memory spark needed. Therefore, further work can be done to run the above algorithm on a distributed system so that algorithm can be run on a large dataset and can produce better results. Apart from that, the complexity of the algorithm becomes too high when used for live processing therefore, another approach needs to be formulated to reduce the complexity.