Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

In this paper, we describe a simple methodology of analyzing a set of Twitter hashtags. The main focus of the method is investigation of temporal aspects of this data. We are interested in analysis of hashtags from the point of view of their dynamics. We identify groups of hashtags that exhibit similar temporal patterns, look at their linguistic descriptions, and recognize hashtags that are the most representative of these groups, as well as hashtags that do not fit the groups very well. The presented and used method is based on a fuzzy clustering process. Once the clusters are created we examine obtained clusters in detail and draw multiple conclusions regarding variations of hashtags over time. Further, we construct fuzzy signatures of political parties based on analysis of hashtags and noun-phrases extracted from a set of tweets associated with US elections of 2012. We use obtained signatures to analyze similarities between issues and opinions important for each party.

The paper is divided into the following sections. We start with a brief introduction to the concepts of tweets, fuzzy sets, and fuzzy clustering, Sect. 2. Section 3 provides a brief description of used data: hashtags collected from Hashtagify.me; and tweets associated with US elections 2012. Further, we focus on analysis of hashtags – we provide some examples of hashtag popularity; describe a data pre-processing leading to representation of popularity changes. Section 4 contains discussion and conclusion.

2 Hashtags and Clustering

2.1 Tweets and Hashtags

Twitter – one of the most popular online message systems – allows its users to post short messages called tweets. According to dictionary.com [14], the definition of a tweet is:

“… 2. (Digital Technology) a very short message posted on the Twitter website: the message may include text, keywords, mentions of specific users, links to websites, and links to images or videos on a website.”

The users posting these messages include special words in the text. These words – hashtags – are easily recognizable and play the role of “connectors” between messages. An informal definition of hashtags – obtained from Wikipedia [15] – is as follows:

“A hashtag is a type of label or metadata tag used on social network and microblogging services which makes it easier for users to find messages with a specific theme or content. Users create and use hashtags by placing the hash character (or number sign) # in front of a word or unspaced phrase, either in the main text of a message or at the end. Searching for that hashtag will then present each message that has been tagged with it.”

As it can be induced, hashtags carry quite a weight regarding marking and identifying topics the users wants to talk about or draw attention to. The spontaneous way hashtags are created – there are no restrictions regarding what a hashtag can be – is their crucial feature. This allows for forming a true image of the users’ interests, things important for them, and things that draw their attention. As the result, any type of analysis of hashtag data could lead to a better understanding of the users’ attitudes, as well as detection of events, incidents, and calamities.

2.2 Fuzzy Sets

Fuzzy set theory [13] aims at handling imprecise and uncertain information in various domains. Let D represents a universe of discourse. A fuzzy set F with respect to D is defined by a membership function \( \mu_{F} \): D  [0,1], assigning a membership degree \( \mu \left( d \right) \) to each \( d \in D \). This membership degree represents the level of belonging of d to F. A fuzzy set can be represented as pairs:

$$ \text{F = }\left\{ {\frac{{\upmu\left( {d_{1} } \right)}}{{d_{1} }},\frac{{\upmu\left( {d_{2} } \right)}}{{d_{2} }}, \ldots } \right\} $$

For more information on fuzzy sets and systems, please consult [17, 18].

2.3 Fuzzy Clustering

One of the most popular methods of analysis of data focuses on identifying clusters of data-points that exhibit substantial levels of similarity. There are multiple methods of clustering data that differ in their ability to find data clusters, and their complexity [4, 6, 7, 12].

Among many clustering algorithms there are ones that utilize fuzzy methodology [1, 2, 5]. In such a case, clusters of data-points do not have sharp boarders. In general, data-points belong to clusters to a degree. In the fuzzy terminology, we talk about a degree of belonging (membership) of a data-point to a given cluster. As the result, there are points that fully belong to a given cluster – membership value of 1, as well as points that belong to a cluster to a degree – membership values between 0 and 1. Such an approach provides more realistic segregation of data – very rarely we deal with a situation that everything is clear, and data can be divided into sets of data-points that are “clean”, i.e., contain points that simply belong or do not belong to clusters.

The method used here is based on a fuzzy clustering method called FANNY [9]. The optimization is performed via minimizing the following objective function

$$ \sum\limits_{v = 1}^{k} {\frac{{\sum\nolimits_{i = 1}^{n} {\sum\nolimits_{j = 1}^{n} {\mu_{iv}^{r} \mu_{jv}^{r} d(i,j)} } }}{{2\sum\nolimits_{j = 1}^{n} {\mu_{jv}^{r} } }}} $$

where n is a number of data-points, k is a number of clusters, \( \mu \) is a membership value of a data-point to a cluster, d(i,j) is a distance or difference between points i and j.

The selection of that approach has been dictated by the fact that we do not want to create fictitious centers of clusters, as it happens in widely popular fuzzy clustering method FCM [3]. Additionally, there is its new implementation in R programming language [15] that is used here.

2.4 Cluster Quality and Visualization

Clusters contain multiple data-points that are distributed in the space embraced by the clusters’ boundaries. Some of these points are quite inside – have high values of membership, while some are close to the boundaries – have small values of membership while at the same time they have comparable values of membership to other clusters. An interesting measure indicating quality of a cluster, i.e., demonstrating that data-points that belong to this cluster are well fitted into it, is called silhouette width [11]. This measure is represented by the following ratio for a given element i from a cluster k:

$$ \begin{aligned} s(i,k) = \frac{OUT(i) - IN(i,k)}{\hbox{max} (OUT(i),IN(i,k))} \hfill \\ \hfill \\ \end{aligned} $$

with

$$ OUT(i) = \mathop {\hbox{min} }\nolimits_{j \ne k} (\frac{{\sum\limits_{m = 1}^{{N_{j} }} {d(i,m)} }}{{N_{j} }}),\quad \quad IN(i,k) = \frac{{\sum\limits_{m = 1}^{{N_{k} }} {d(i,m)} }}{{N_{k} }} $$

where d(i,m) is a distance (or a difference) between data-points i and m, Nk is a size of cluster k, Nj is a size of any other cluster. The value of s(i,k) allows us to identify the closest cluster to a point i outside the cluster k. Positive values of silhouette indicate good separation of clusters.

The process of visualization of multi-dimensional clusters is fairly difficult. A possible solution could be a projection of clusters into selected dimensions. But then, the issue is which dimensions to choose. In his paper, we use an approach introduced in [10]. The approach called CLUSPLOT is based on a reduction of the dimension of data by principal component analysis [8]. Clusters are plotted in coordinates representing the first two principal components, and are graphically represented as ellipses. To be precise, each cluster is drawn as a spanning ellipse, i.e., as a smallest ellipse that covers all its elements.

3 Collected Data

3.1 Hashtag Data

The process of data analysis is performed on real data representing popularity of hashtags. The data are obtained from the website Hashtagify.me, and contain information about 40 different hashtags. The popularity ratings have been obtained for the period of nine weeks. The sample of data for a few selected hashtags is shown in Table 1.

Table 1. Popularity of selected hashtags

The values presented in Table 1 show the popularity (as % relative to other hashtags) for every week. For example, #KCA was the most popular hashtag for the first five weeks. However after the fifth week, its popularity has started decreasing. The hashtag #callmebaby did not even exist for the first few weeks, than rapidly gained popularity, and after two weeks its popularity has been around 85%. Very similar behavior can be observed for the #NepalQuake. Its popularity in the last few weeks has been in the range from 53% to 72% [13, 16]. The hashtag #iphone, on the other hand, is characterized via a continuous – with some small fluctuations – level of popularity: 80 to 84%.

In order to analyze behavioral patterns of hashtags a simple processing of data has also been performed. Here, we are interested in the percentage of changes of popularity of hashtags. The data are presented in Table 2. Here, the calculations have been done using a very simple formula [19, 20]:

Table 2. Changes in popularity of selected hashtags
$$ change_{zero\,vs\,N} = popularity_{week:zero} - popularity_{week:N} $$

The calculated change represents a difference between the popularity value for the current week week:zero and the popularity value for a considered week:N. For example, the value \( change_{zero\,vs\,nine} = - 24.1 \) means that the popularity of #KCA in week zero is −24.1 lower than its popularity in week -nine.

3.2 Presidential Election 2012 Data

The created data set focuses on elections in United States. We have selected elections of 2012. The main reason for such a selection is an importance and scope of the 2012 elections. The elections were a very large event in the US history. They consist of the following elections: (1) the 57th presidential election; (2) Senate elections; and (3) House of Representative elections.

The first step in collecting tweets of the members of parties has been creation of a list of Twitter accounts of members of a parliament and most important members of parties. Twitter has a feature called twitter list where you can create a collection of Twitter accounts for people to follow. Almost all parties share lists of party members, parliament members or party related accounts. Such lists enable to promote the party’s ideas, and make it easy to follow news related to the party. Also, some websites offer such lists for individuals to follow. In order to create our own lists for each party, we merge all accounts that appear in those party lists in one list. Such created list will be used to collect tweets.

The collection process has been done using the Twitter Search API. We have constructed a program – twitter search collector – that periodically collects and stores tweets using eight different API keys. The program requests only tweets that have an ID higher than the last tweets we collected with the previous/last usage of the collector.

The details regarding number of tweets associated with each party are presented in Table 3.

Table 3. Collected tweets statistics

4 Conclusion

The presented here analysis of temporal aspects of hashtags – their popularities over time and the changes of these popularities – is an attempt to look at dynamic nature of the user-generated data. The application of fuzzy clustering shown here provides a number of interesting benefits related to fact that categorization of hashtags is not crisp. The further investigation of fuzzy-based measures leads to interesting conclusions.

The construction of fuzzy signatures based on frequency of occurrence of hashtags is an interesting approach to express importance of opinions and issues represented via tweets’ hashtags and noun-phrases. A simple process of constructing such signatures is presented here. Once the signatures are obtained, they are used to compare the importance of opinions/issues articulated by groups of individuals represented by the signatures. These processes have been applied to tweets representing US elections 2012.