Keywords

1 Introduction

Twitter is a popular web microblogging and social networking service, on which users interact and share information with short messages (tweets). A twitter hashtag is a meta-tag that the user creates for classifying tweets based on their meanings and subjects. For example, tweets tagged with the hashtag #Jobs are related to job opportunities. This high-level summary/label information explicitly provided by users makes hashtag promising in finding dynamic relationships of tweets [2]. Hashtag clustering that finds semantically related hashtags can be used for event discovery [13, 16], opinion extraction and synonym detection.

Most existing methods discover hashtag clusters using the tweet texts [13, 14]. These text-based methods represent hashtags using explicit words in tweet texts. For example, with the bag-of-words (BoW) representation [13], one hashtag is represented by the multiplicity of the co-occurring words in the tweets. However, the short tweet texts may not reveal the whole hashtag relationships. First, many tweets may report the same event but use totally different words. Second, hashtag co-occurrence is not always reliable since many tweets have no more than two hashtags. Affected by these problems, text-based methods only can extract part of the information revealed by hashtags. The recent method [12] uses the temporal pattern of hashtags related to known events, i.e., time series that record the frequencies of hashtag usages changing with the time, to discover other related hashtags. Inspired by this, we use hashtag time series (which show temporal pattern) to cluster hashtags, because the frequencies of certain hashtag usages are highly correlated with the popularity of the corresponding events/topics. For example, in Fig. 1, the hashtag time series of #Obama and #Nobel can help discover the event that Obama was awarded the Nobel Peace Prize.

Fig. 1.
figure 1

The similar time series of #Obama and #Nobel can help detect the event that Obama was awarded the Nobel Peace Prize on 9/10/2009.

In this paper, we propose a novel clustering hashtags by their temporal patterns (CHTP) method. In CHTP, hashtags are represented as hashtag time series, so, clusters can be discovered by clustering hashtag time series. However, clustering hashtag time series faces two challenges. First, it is impractical to manually determine the number of hashtag clusters, because a Twitter dataset includes world-wide events/topics. Second, these hashtag clusters are naturally shaped and follow no specific distributions. Therefore, density-based time series clustering is preferred in CHTP since it can find arbitrarily shaped clusters. Many density-based clustering methods require single distance thresholds to define density of data objects. Unfortunately, specifying a suitable distance threshold is difficult because there are clusters of different density levels in the complex Twitter datasets. Thus, we propose a novel parameter-free Density-Sensitive Clustering (DSC) algorithm for CHTP to find hashtag clusters of different density levels. DSC is a density-based recursive partition process, and it discovers a hashtag time series group as a hashtag cluster if it cannot be split into multiple subgroups in lower recursions. Comprehensive experiments have been conducted on Twitter datasets to demonstrate the effectiveness of the proposed CHTP method, which uses DSC for hashtag clustering. Therefore, the contributions of this paper are listed as follows:

  1. 1)

    We provide a novel CHTP method for hashtag clustering, which groups hashtags using the common temporal patterns, rather than the tweet texts.

  2. 2)

    We develop a new parameter-free Density-Sensitive Clustering (DSC) algorithm for CHTP to discover hashtag clusters of different densities.

  3. 3)

    Comprehensive experiments conducted with four Twitter datasets show that averagely 36% of meaningful hashtag clusters discovered by CHTP (using DSC) cannot be found by text-based approaches; and the proposed DSC algorithm is more effective than the counterpart algorithms to find hashtag clusters.

The rest of this paper is organized as follows. In Sect. 2, we review the related work. We detail the CHTP method in Sect. 3, and then develop the DSC algorithm in Sect. 4. The proposed method is evaluated in Sect. 5. The paper is concluded in Sect. 6.

2 Related Work

Hashtag is widely used in tweets and can represent summary information of tweets to extract useful information. Most hashtag clustering methods discover hashtag clusters by the tweet texts tagged with the hashtags (text-based methods). SMSC [13] uses the bag-of-words (BoW) representation of co-occurred tweet texts with Kmeans to find hashtag clusters. Meanwhile, topic model is used in HGTM [14] to cluster hashtags by analyzing a hashtag graph built with the co-occurrence information of hashtags. The method in [7] further integrates lexical and contextual text information to improve clustering performance. In addition to tweet texts, external knowledge, i.e. the semantics of hashtags obtained from WordNet [8] or Wikipedia [6], is utilized to improve the accuracy of hashtag clustering.

Other than the text-based methods, SAX* [12] uses temporal similarity as the semantic relatedness for hashtags, and it detects hashtag clusters with string patterns deduced from external sources (e.g. Wikipedia) [9]. We extend this idea and discover hashtag clusters from their temporal patterns by clustering hashtag time series. We briefly review the recent time series clustering methods, upon which the proposed method is built developed. Partition-based time series clustering, such as kshape [10] and KSC [15], find cluster representatives and minimize the distances of time series to nearest representatives. However, they adopt spherical-shape clusters that are sensitive to outliers and noise in the complex Twitter datasets. Model-based methods. e.g. Gaussian Mixture Model [4] and Gaussian Inverse Covariance [5], cluster time series by the optimizations of specific models, but a model that well-explain the large Twitter datasets may have impractical complexity. Density-based clustering methods are preferred since they can find natural-shaped clusters. For example, YADING [3] hierarchically adopts DBSCAN to find time series clusters of different densities; however, the inflection points (on the distance-to-nearest-neighbours curve) YADING used to determine densities levels are not significant in the sparse Twitter datasets. TADPole [1] groups time series into clusters using the density and the distance by DPC [11]; but that makes TADPole hard to find clusters of different density levels since a global distance threshold is applied in the entire dataset.

3 Problem Definition

In this section, we present the CHTP method, and it comprises two steps:

  1. 1)

    Represent hashtags as hashtag time series. (see Sect. 3.1)

  2. 2)

    Discover hashtag clusters by clustering hashtag time series. (see Sect. 3.2)

Fig. 2.
figure 2

Four hashtag time series (as shown in (a)) obtained from a Twitter corpus [15]. (b) shows single-linkage clustering with Euclidean distance on hashtag time series (top) and hashtag BoW representation (bottom), respectively.

3.1 Hashtag Time Series

CHTP discovers hashtag clusters by hashtag temporal patterns, and each hashtag is represented as a time series denoted as \(Z=\{z_{1},z_{2},...,z_{m}\}\), where \(z_{j}\) is the frequency of a hashtag at time bin j (for example, one hour [15]). A hashtag time series dataset DS is a collection of hashtag time series denoted as \(DS = \{Z_{1},Z_{2},...Z_{n}\}\). Each hashtag time series is preprocessed to be scale-invariant by applying z-normalization as follows:

$$\begin{aligned} x_{j}=\frac{z_{j}-\mu }{\delta }\ (1\le j\le m), \end{aligned}$$
(1)

where \(\mu =\frac{1}{m}\sum z_{j}\) and \(\delta =\sqrt{\frac{\sum (z_{j}\,-\,\mu )^{2}}{m\,-\,1}}\). The distance of two normalized hashtag time series, X and Y, is measured by Euclidean distance (ED(XY)).

We show four hashtag time series as an example in Fig. 2 (a). By applying single-linkage clustering with Euclidean distance, two clusters are correctly discovered (the top in Fig. 2 (b)). However, with text-based hashtag BoW representation, unrelated hashtags are grouped together (the bottom in Fig. 2 (b)).

Fig. 3.
figure 3

The depth-first recursive partition of example hashtag time series.

3.2 Clustering by Hashtag Time Series

Specifying the number or the shape of clusters in a Twitter dataset is difficult since tweets posted world-widely cover enormous events/topics, and thus density-based clustering that discovers natural-shaped clusters is favored. Existing density-based time series clustering algorithms, such as TADPole [1], face a major challenge to cluster hashtag time series, i.e. the global distance threshold they used to define density cannot differentiate density levels. That may undermine the clustering accuracy, because a large distance threshold may group irrelevant hashtags into clusters; while a small threshold cannot discover sparse clusters with small densities. Therefore, CHTP demands a density-based clustering algorithm that is adaptable to different densities of hashtag clusters.

4 The DSC Algorithm

In this section, we present the DSC algorithm used in CHTP to cluster hashtag time series. DSC discovers clusters of different density levels by adaptive distance thresholds. In general, DSC partitions the dataset recursively from the coarse-grained to the fine-grained (in a depth-first manner), with adaptive distance thresholds, and clusters of different density levels are discovered as \(full\ pure\ group\)s. A \(full\ pure\ group\) contains highly correlated time series that are less relevant with time series in other \(full\ pure\ group\)s. Before explaining \(full\ pure\ group\), we first define \(pure\ group\).

Definition 1

\(Pure\ group\) (PG): a group of time series that cannot be split into multiple subgroups by density, i.e., each partition produces at most one subgroup (contains more than one time series) during the recursive partitioning.

Examples of PG are shown in Fig. 3, which records the recursive partition process of six hashtags. Apparently, groups with 2 or 3 time series are PGs. Based on the definition, the partitioned subgroups of a PG are also PGs but locates at different hierarchies on the partition tree. PGs on the top hierarchies (as \(full\ pure\ group\)) are regarded as the expected clusters.

Definition 2

\(Full\ pure\ group\) (FPG): \(full\ pure\ group\) is the \(pure\ group\) of the top hierarchy, i.e., \(full\ pure\ group\) must be partitioned from a non-PG.

Based on Definition 2, a subgroup of time series (SG) is a FPG only if 1) SG is a PG and 2) S (the group that SG is partitioned from) is not a PG (see the examples in Fig. 3). DSC discovers clusters as FPGs by analyzing the recursive process that partitions the dataset, and now we detail the DSC algorithm.

4.1 Data Partition

In DSC, a recursive DataPartition algorithm is used to partition the dataset (DS) by density, and the input dataset for each partition process in the recursion is denoted as S. DataPartition has two components, i.e. the group forming function (FormGroup) that partitions S by forming subgroups and the cluster detector that indicates whether a subgroup is a FPG. FormGroup forms subgroups by density, and one time series, X, finds a set of neighbours (\(N_{X}\)) as follows:

$$\begin{aligned} N_{X} = \{Y: ED(X,Y)<d_{c}, Y \in S\}, \end{aligned}$$
(2)

where \(d_{c}\) is the adaptive distance threshold (will be discussed later). The density of X, \(\rho _{X}\), is calculated as follows:

$$\begin{aligned} \rho _{X} = \sum \left( 1 - \frac{ ED(X,Y)^{2}}{d_{c}^{2}}\right) , Y \in N_{X}. \end{aligned}$$
(3)

We borrow the idea of DPC [11] to group time series by density, that is, X is density connected to a specific neighbour (\(n_{X}\)) as follows:

(4)

X is the centre (C) of a group if \(n_{X} = null\), which means X has the local maximum density among its neighbours. Groups are developed by two steps: 1) each C is assigned a unique group label; 2) starting from centers, the group labels are spread from \(n_{X}\) to X in the decreasing order of density. After all time series acquire group labels from relative centers, S is partitioned into subgroups (SG) for further partitioning.

4.2 Adaptive Distance Threshold

We especially expect the adaptive \(d_{c}\) to satisfy the following two requirements:

  1. 1)

    FormGroup can always partition S into subgroups or individuals, which ensures the termination of the recursive partition.

  2. 2)

    Partitions are conducted from the coarse-grained to the fine-grained. This ensures sparse clusters are not omitted.

We use Minimum Spanning Tree (MST) built with S to find the adaptive \(d_{c}\). Nodes of MST are the time series in S and the edges are the time series distances. Then, \(d_{c}\) of FormGroup is assigned as the largest edge on the MST. We show that \(d_{c}\) satisfies the first requirement by proving that at least two time series are split after partitioning. Assume \(E_{ij}=ED(X_{i},X_{j})\) is the longest edge of MST built with S, and we assign \(d_{c}=E_{ij}\) for FormGroup.

figure a

Lemma 1

\(X_{i}\) and \(X_{j}\) do not stay in the same subgroup after S is partitioned by FormGroup, with \(d_{c}=E_{ij}\).

Proof

MST is split into \(LMST=\{X_{l_{1}},...,X_{l_{n}}\}\) and \(RMST=\{X_{r_{1}},...,X_{r_{n}}\}\) after removing \(E_{ij}\), and \(X_{i}\in LMST\) and \(X_{j}\in RMST\). Now assume \(X_{i}\) and \(X_{j}\) still stay in the same subgroup. Since \(d_{c}=E_{ij}\), \(X_{j}\notin N_{X_{i}}\), therefore, \(X_{i}\) and \(X_{j}\) must be connected through \(\{X_{\eta _{0}}=X_{i},X_{\eta _{1}},...,X_{\eta _{k-1}},X_{\eta _{k}}=X_{j}\}\), in which \(E_{\eta _{t-1},\eta _{t}}<d_{c}, \forall t\in \{1,...,k\}\). Since \(X_{l}\in LMST\) and \(X_{r}\in RMST\), \(\exists t\in \{1,...,k\}\ s.t.\ X_{\eta _{t-1}}\in LMST\) and \(X_{\eta _{t}}\in RMST\), and thus the spanning tree by connecting LMST with RMST with \(E_{\eta _{t-1}\eta _{t}}\) ( \(E_{ij}\)) has a smaller weights than MST, which is impossible and the assumption is wrong. It is proved.

Now we show that the second requirement is also satisfied due to Lemma 2.

Lemma 2

The largest edge on MST of SG (one subgroup of S grouped by FormGroup) is always smaller than the largest edge on MST of S.

Proof

MST\(_{SG}\) (the MST of SG) comprises several subtrees, i.e., MST\(_{SG}=\{ST_{1},...,ST_{k}\}\) (\(ST_{i}\in \) MST\(_{SG}\) and \(ST_{i}\cap ST_{j}=\emptyset , \forall i,j \in [1,k]\)). In MST\(_{SG}\), \(\{ST_{1},...,ST_{k}\}\) are connected with the \(k-1\) edges among them. Assume the largest edge on MST\(_{SG}\) (\(E_{ab}\), which is also \(d_{c}^{SG}\)) connects \(ST_{a}\) and \(ST_{b}\), thus \(E_{ab}=\min _{X_{l}\in ST_{a},X_{r}\in ST_{b}}ED(X_{l},X_{r})\). Since SG is grouped by FormGroup on S with \(d_{c}^{S}\) (the largest edge on MST\(_{S}\)), \(\exists ED(X_{l},X_{r})<d_{c}^{S}\ s.t.\ X_{l}\in ST_{a},X_{r}\in ST_{b}\). Therefore, \(d_{c}^{SG}=E_{ab}\le ED(X_{l},X_{r})<d_{c}^{S}\). It is proved.

The pseudo-code of DataPartition is shown in Algorithm 1. \(d_{c}\) is assigned as the longest edge on MST at lines 1–2. S is partitioned at line 3, and the obtained subgroups are further partitioned at line 8. The clusters are discovered as FPGs is shown at lines 8–10.

4.3 Complexity Analysis

The complexity of DataPartition for the initial dataset (DS) is \(O(n\log {n}+2n^{2})\), including building MST from DS for \(O(n^{2})\) and adopting FormGroup for \(O(n\log {n}+n^{2})\). Meanwhile, the recursion depth (\(\gamma \)) is usually much smaller than n, because FormGroup can partition the dataset efficiently with a well-designed \(d_{c}\). Therefore, the complexity of DSC is \(O(\gamma n\log {n}+\gamma n^{2})\).

5 Evaluation

In this section, we evaluate the performance of the proposed CHTP method (hashtag time series clustering by DSC) and the DSC clustering algorithm by answering the following two questions:

Table 1. Statistics of the 4 Twitter datasets.
  1. 1)

    Q1: Can DSC find hashtag clusters more effectively than the counterpart TADPole and YADING? (Sect. 5.1)

  2. 2)

    Q2: Is it necessary to use CHTP to cluster hashtag when the text-based approaches exist? (Sect. 5.2)

All algorithms are implemented with python 2.7, and the experiments are run on a Windows 10 platform with 3.4 GHz CPU and 16 GB RAM.

Datasets: We use a Twitter corpus [15] that comprises tweets posted from August to November 2009, for our experiments. These tweets are split into 4 subsets by the months they were posted. In each subset, we select important hashtags (with frequency larger than 1000) along with the tweets they appeared to generate the corresponding dataset. The time bin is specified as one hour following [15]. The statistics of the datasets are shown in Table 1.

Counterparts for DSC: TADPole [1] and YADING [3], two density-based time series algorithms as discussed in Sect. 2, are compared with DSC in experiment 1 (see Sect. 5.1) to answer Q1. To be fair, distance measurements used are unified as Euclidean distance in TADPole, YADING and DSC. Counterparts for CHTP: We explicitly choose SMSC [13], which adopts the hashtag BoW (text based) and Kmeans, as the counterpart of CHTP in experiment 2 to answer Q2. We also use the BoW representation with DSC, i.e., TextDSC, to directly compare with CHTP. Besides DSC, TextDSC and YADING, which are parameter-free, the results of TADPole and SMSC are obtained under the optimal parameters that maximize Silhouette Coefficient. Clusters that contain multiple hashtags are used for comparison.

Table 2. Clustering accuracy.
Fig. 4.
figure 4

Density distribution of hashtag clusters discovered by DSC, TADPole and YADING, respectively.

5.1 Accuracy and Effectiveness of DSC on Hashtag Time Series

In this experiment, we compare DSC with TADPole and YADING for clustering hashtag time series. We evaluate the clustering accuracy by validating each hashtag cluster by its contained hashtags since manually labeling the hashtags is impractical. A cluster is valid only if the contained hashtags are synonyms/abbreviations or represent an event/topic (see Fig. 1) in the search results from Google. The accuracy of hashtag clustering is measured by \(F_{1}\) score.

The clustering accuracy results are shown in Table 2. DSC, TADPole and YADING achieve fair clustering results in the datasets and the average accuracy are 0.77, 0.63 and 0.53, respectively. Specifically, DSC out-performs TADPole and YADING in all the 4 datasets, and the average improvements of accuracy to TADPole and YADING are around 22% and 45%, respectively.

Fig. 5.
figure 5

Relationships of hashtag clusters discovered by CHTP, SMSC and TextDSC, and shaded area indicates similar clusters shared by multiple methods.

We compare the densities of valid hashtag clusters discovered by DSC, TADPole and YADING, with the density of a cluster being the average nearest-neighbour-distance of the contained hashtags. The larger the average nearest-neighbour-distance, the smaller the density, and the results are shown in Fig. 4. Generally, DSC finds more valid hashtag clusters than TADPole and YADING in all the 4 Twitter datasets, and YADING discovers the least valid hashtag clusters, which results in its low recall.

Fig. 6.
figure 6

Examples of events (a), opinions (b) and synonyms (c) discovered by CHTP, but SMSC and TextDSC fail to find. The right column shows the nearest hashtags with hashtag BoW representation.

5.2 CHTP vs. Text-Based Approaches

In this experiment, we analyze the significance of hashtag clustering by temporal pattern through showing whether the text-based methods (TextDSC and SMSC) can discover similar valid hashtag clusters as CHTP. We regard two valid hashtag clusters (discovered by different methods) are similar if they shares more than three fourth hashtags. Then we summarize the relationship of valid hashtag clusters discovered by CHTP, TextDSC and SMSC as shown in Fig. 5.

The results show that, in all the 4 Twitter datasets, around 36% of the valid hashtag clusters discovered by CHTP cannot be discovered by TextDSC/SMSC, even when using the same clustering algorithm (DSC). This result suggests that the hashtag temporal pattern partially represent distinctive correlations of hashtags compared with the tweet texts, and CHTP can supplement hashtag clustering that only uses tweet texts. In contrast, TextDSC and SMSC share many hashtag clusters since they use the same clue, i.e., the tweet texts.

We show some examples of hashtag clusters only discovered by CHTP to further understand the hashtag clusters discovered by temporal patterns in Fig. 6. That includes two events discovered only by CHTP (Fig. 6 (a)), two examples of extracted opinion (Fig. 6 (a)) and two examples of discovered synonyms (Fig. 6 (c)). To compare with the text-based BoW representation, we show the three most similar hashtags (with BoW) of the hashtags in the right-side column of Fig. 6. The result shows that hashtags having similar temporal patterns are not quite similar in their BoW representations, and that is the reason that TextDSC/SMSC fails to discover the hashtag clusters.

6 Conclusion

In this paper, we propose a novel CHTP method to discover hashtag clusters by hashtag temporal patterns. CHTP represents hashtags as hashtag time series and uses the proposed DSC algorithm (which can discover clusters of different density levels) to effectively cluster hashtag time series. Experiments conducted on Twitter datasets show that DSC is more effective in discovering hashtag clusters than two counterpart algorithms, and CHTP (uses DSC) can discover 36% hashtag clusters that cannot be discovered by the text-based approaches. Therefore, we conclude that by using the temporal pattern of hashtags, a more complete understanding of the relationship of hashtags can be obtained.