Keywords

1 Introduction

The advent of online social media and its continuous growing popularity has provided a new channel and arena for exchange and/or sharing of information [1, 2]. People on online social media have got an open platform to share opinions, viewpoints, and information on any topic. Over the last few years, Twitter, a micro-blogging service, has gained popularity as one among the most prominent information dissemination and news source agent. Exchange of messages on social media [3] increases considerably with the occurrence of an event that may be related to, for example, social cause, disaster, politics, or a particular person. Users sign in to their Twitter or other social media accounts, to either spread the information or to get updates about the information. Twitter can thus be used to analyze the ongoing situation since it is being used by public and thus it has the potential to provide real-time information. Content on Twitter supplies rich information related to the occurred activity. However, such abundant information is often not trustworthy since it may also contain fake information.

There has been a lot of interest to analyze twitter content [4] which includes, for instance, work in the field of event detection, user selection, and classification of tweets. Besides knowing information about popular people on twitter, it may be useful to know what event has caused the popularity. Such information can let the users know about the arena of popular person and other attributes of the person which can enhance the knowledge of users about famous personalities [5,6,7]. Moreover, a user may be interested to get suggestion about the people she wishes to follow on the basis of the area. The users would like to be kept updated with the currently ongoing events that may lead to the rise or fall of a known figure.

There are multiple sources of information like television, newspapers, social network sites or mouth to mouth words from friends and family [8]. A user may be interested to know about the person who is on everybody’s mind in recent times and also wishes to know the reason behind it. To achieve this, initially the current popular persons are obtained from data. In order to find the reasons of popularity, categorization of tweets is carried out since a person may be popular because of more than one reason at a time but there would be one prime reason. By applying all these techniques we can provide better information to the users.

The aim of this paper is to design a method for detecting the popularity of a person and the reason causing the popularity. We use the tweets of different users related to a particular person. We used Twitter4j api in Java to collect the tweets, initially for user selection, and then later to get data about that user. This approach uses nouns in the tweets as their keyword and combines tweets together into a single reason when their match score is above some threshold. Classification of tweets to which category (like business, politics, technology etc.) is realized by categorizing keywords used in each tweet.

The paper is organized as follows. Section 2 describes the related work. Section 3 describes our method for popularity detection. Section 4 describes the implementation details and results obtained by our method. Conclusion and future works are given in Sect. 5.

2 Related Work

A considerable amount of work has been done in classification of tweets, sentiment analysis, and detection of events from tweets. Different approaches have been proposed for sentiment analysis, finding sentiments in words, sentences, topics. Some approaches use natural language processing, some uses pattern based approach and some takes into account machine learning.

In [9], a technique for constructing a Key Graph is suggested using the keywords in the tweets to detect events. This approach is dependent on the interdependency between the keywords. The Key Graph is comprised of nodes and edges where nodes correspond to keywords and the occurrence or the existence of two keywords simultaneously in a tweet is represented by an edge between the nodes. Clusters are created from the Key Graph by clustering different nodes together using a community detection algorithm. In [10], the authors suggest an algorithm called NED (new event detection) to detect events. It consists of two subtasks that are online and retrospective; online NED detects new events in the stream of text while in retrospective NED, unidentified events are detected.

Wavelet transformation is used for event detection in [11]. The problem of identifying events and their user contributed social media documents as a clustering task, where documents have multiple features, associated with domain-specific similarity metrics [12] and pheromone based techniques [13,14,15]. A general online clustering framework, suitable for the social media domain is proposed in [16]. Several techniques for learning a combination of the feature-specific similarity metrics are given in [16] that are used to indicate social media document similarity in a general clustering framework. In [16] a clustering framework is proposed and the similarity metric learning technique is evaluated on two real-world datasets of social media event content.

Location is considered in [17] with every event as incident location and event are strongly connected. The approach in [17] consists of the following steps. First, preprocessing is performed to remove stop words and irrelevant words. Second, clustering is done to automatically group the messages in the event. Finally, a hotspot detection method is performed.

TwitInfo is a platformfor exploring Tweets regarding to a particular topicis presented in [18]. The user had to enter the keyword for an event and TwitInfo has provided the message frequency, tweet map, related tweets, popular links [19, 20] and the overall sentiment of the event. TheTwitInfouser interface contained following thing: the user defined name of the event with keywords in the tweet, timeline interface with y axis containing the volume of the tweet, Geo location along with that event is displayed on the map, Current tweets of selected event are colored red if the sentiment of the tweet is negative or blue if the sentiment of the tweet is positive and Aggregate sentiment of currently selected event using pie charts.

TwitterMonitor system is presented in [21] that detect the real time events in defined time window. This is done in three steps. In first step bursty keywords are identified, i.e. keywords that are occurring at a very high rate as compared to others. In second step grouping of bursty keyword is done based on their occurrences. In third and last step additional information about the event is collected.

A news processing system for twitter called as TwitterStand is presented in [22]. For users, 2000 handpicked seeders are used for collecting tweets. Seeders are mainly newspaper and television stations because they are supposed to publish news. After that junk is separated from news using the naïve Bayes classifier. Online clustering algorithm called leader-follower clustering to cluster the tweets to form events. A statistical method MABED (mention-anomaly-based event detection) is proposed in [23]. The whole process of event detection is divided in three steps. In first step detected the events based on mention anomaly. Second, words are selected that best describes each event. After deleted all the duplicated events or merged the duplicate events. Lastly, a list of top k events is generated.

In [24] a co-relation between clustering and event detection is shown. An aggregate trend change is similar to event detection. To find the popular event, authors of [24] have used algorithms based on community detection. In [26] to find the clusters the authors have suggested a hierarchical clustering of tweets along with the dynamic cutting and rating of resultant clusters is used, a similar technique has been applied in systematic search of maximal length codes [27]. In [28] a technique for finding bursty words is used for detecting events and location recognition using modules.

In [25] it has been stated that an event is associated with the message context but also with the location information, since location is also an important factor of an event. Localized events like any emergency event or any public event, emergency would be more accurately messaged or tweeted by the users closer to the event location in comparison to other users. Hence such users can play the role of sensors – human sensors for briefing an event.

A considerable amount of work has also been carried in the field of sentiment analysis that stresses on finding the sentiments in topics, sentences and the words. Various approaches have been suggested to carry out the sentiment analysis, these approaches either make use of natural language or pattern based processing or machine learning.

In [29] for sentiment analysis authors have suggested a sentiment tree bank approach that is based on a recursive neural network. It calculates in a bottom up manner the parent node vectors and takes advantage of a composition function and also the node vector that features for that node. In [30] an approach has been suggested for finding the sentiment score of informal, short text and also the sentences that consists of phrases within themselves.

Two methods for classification of the Twitter trending topics are proposed in [31] first, based on textual information and the other based on the network structure. In text based model all the hyperlinks are removed from the tweet and then a tokenizer removes stop words and delimited character. Since there is a limitation of 140 characters in a tweet, people use acronyms for words and so a vocabulary is used that has the full form of these words (e.g., BR is used to represent best regard). The network based approach uses a similarity model to find out the trending topic say X. It searches for five topics that are similar to the topic X and finds out the similarity index [5].

Most of the above works are related to sentiments, recommendation systems, trending topic and considered temporal context of messages and classification of tweets. However, these works do not discuss about the rising or decreasing popularity of a person and the reasons behind it. Our approach is different from others as we first look for the popular person and also let the users know the reason behind the popularity.

3 Proposed Methodology

An approach to extract a popular person from tweets is to find a person’s name and storing tweet counts corresponding to the person. In order to find the reasons behind the popularity of a person we are using keywords of tweets corresponding to the person.

3.1 Architecture

Figure 1 shows the basic flow diagram of our method. First, we download tweets of different users from different countries and then we look for the person that has been most talked about among those tweets. Then we fetch tweets of that specific person from our database. To detect the reason of popularity we divide all the tweets related to that person into keywords and separate hashtags. Keywords in a tweet are names of things (e.g., name of a person, name of a city). Hashtag is represented using the symbol # followed by some meaningful word like ‘Olympics2016’. If two tweets have the same hashtag, it means that these tweets are related and the tweets can be merged into one single tweet.

Fig. 1.
figure 1

Overview of the proposed method

First we will check hashtag of tweet with events which are already found. Then we pass keywords of that tweet with keywords of events, which are already found into a function called similarity. Similarity we are finding as number of common keywords divided by number of total different keywords. And for every found event with which event, similarity is maximum and greater than threshold then we add tweet into that event. Like this for all tweets algorithm is performed. In the end we find out main reasons behind popularity of person. Then we classify tweets of that person for showing the interest of users towards that popular person means what general users think about that person. Here user is the twitter user, whose tweets are downloaded from twitter.

3.2 Data Collection

We collected 2,18,490 tweets of 5 different countries from September, 2016 to November, 2016 using Twitter4j API [33]. Tweets were downloaded by taking latitude and longitude values of countries. We took news channels (CanadaNews, bbcnews) into consideration because news channels are reliable sources of data; news channels produce more data than simple twitter users.

3.3 Extraction of Names of Persons from Tweet

We used Stanford Named Entity Recognition (NER) tagger [32] for extracting the names of persons from tweets. NER labels sequence of words into a text which contains names of things, such as name of person, name of company, and name of place. Every tweet is passed through the NER tagger and it returns names of things for every tweet. We store only names of persons, and for this we used a hash function.

3.4 Fetching Top k Popular Persons

For storing name of a person and the number of occurrences of the names, we use a hash table named h_table, that has two fields: key and value. In the key field, we store person name; in the value field, a tuple <tweet_id, count_name>. If a name of a person does not exist, the count of the person is set to 1 and add the corresponding tweet id. Otherwise, increment count by one and update tweet id field.

3.5 Find the Basis of Popularity

For storing hashtags and keywords and the corresponding tweet ids and count of reasons of popularity, we use hash table named H_table, that has two fields key and value. In the key field, we store a tuple <hashtag, keywords>; in the value field, a tuple <tweet_id, count_reason>.

The following symbols are used in the algorithm.

  • S: set of all tweets storing tweet ids along with person mentioned in tweet.

  • R: set of all reasons related to popular person. Initially this set is empty.

  • PT: set of all the tweets of popular persons along with its keywords and hashtags.

  • h_table: a hash table that is initially empty.

  • H_table: a hash table that is initially empty.

  • P: set of popular persons.

  • m: threshold value, 0 < m < 1.

    figure a

4 Implementation and Results

To implement the algorithm, we collected 2,18,490 tweets of 5 different countries, using Twitter API. First, a user provides the value of n i.e., top n popular persons according to the downloaded tweets. Table 1 shows the output when a user provides the value of n = 4.

Table 1. Top n (n = 4) popular persons and their tweet count

Once the user gets the top n popular persons, she can select any one person from the results to get more details of the selected person. In this interface, on selecting one person it will show all the tweets of that person. User can get more information about the person using these tweets. Figure 2 shows the output of selecting one person.

Fig. 2.
figure 2

Tweets corresponding to the selected popular person

For the selected person, the reasons of popularity are given in Table 2. The table lists person name, all the popularity reasons, and the corresponding tweet counts.

Table 2. Reasons of popularity of the selected person

The pie chart in Fig. 3 shows users’ interest towards the selected popular person (Donald Trump). Since a large percentage of tweets are related to politics, this indicates that users are showing interest in political aspects of the person.

Fig. 3.
figure 3

Pie chart representing classification of tweets according to users’ interest toward popular person Donald Trump

We can compare users’ views for two different popular persons. Figure 4 shows users’ views for Donald Trump and Malcolm Turnbull. From this Figure we can conclude that in politics, users are more interested toward Trump than Turnbull.

Fig. 4.
figure 4

Comparison between the tweets related to Trump and Turnbull

5 Conclusion and Future Work

In this paper, we suggested an approach to get the popular person from the gathered tweets and obtained the reason behind the popularity of that person. In our approach, we first look for the names mentioned in the tweets and the name that occurs with highest frequency is suggested as the most popular person. In order to find the reason behind the popularity of the person we developed an algorithm that looks for the possible events in the tweets. For implementation, we used data sets of different time frames to showcase the output and the results obtained are very encouraging. In future we would like to further extend our system to compare the top most popular persons with each other and look if they are inter connected by the same reason or not.