Keywords

1 Introduction

Over the past several years, real-time social networks, such as microblogs, have become a powerful platform where people can vent mood, share opinions, and even disseminate emerging news. Many scholars and researchers are attracted to detect underlying information as hot topics upon these online communication platforms for their perfect interactivity in users’ daily life.

As for microblogs, a topic is usually defined as an event on which people focus publicly, and the hot topics are the highly condensed summary of enormous microblogs [1]. It is a challenging problem to detect microblog hot topics because microblog posts are generally much shorter. Thus, traditional topical clustering techniques based on lexical overlap are undoubtedly weak and it is too difficult to construct an effective microblog representation model by using traditional methods like Vector Space Model (VSM) [2], Latent Semantic Analysis (LSA) [3] or Latent Dirichlet Allocation (LDA) [4]. There are many studies attempting to overcome the deficiencies of VSM [5, 6]. Tang et al. [7] devise an approach that enriches microblog representation by employing machine translation to increase the number of features from different languages. Cheng et al. [8] propose a coupled term-term relation model for text representation, which considers both the intra-relation and inter-relation between a pair of words, yet it is obvious that the process of calculating the relationships between words is too complex. A probabilistic correlation-based similarity measure [9] can be introduced to avoid the complex calculation.

Most of researchers prefer to use clustering method when detecting hot topics in microblogs. Since proposed by Kennedy et al. [10], the Particle Swarm Optimization(PSO) has been used in solving many clustering problems. Zhao et al. [11] discover that the PSO has the incomparable superiority in both operation time and complexity. Omran et al. [12] use PSO algorithm in image clustering. There are two types of clustering evaluation validation, one is external validation and the other is internal validation. Traditional external validations [13], such as F-measure and information entropy, are often adopted to evaluate the quality of final cluster while internal validations, the Global Silhouette (GS) coefficient and the Expected Density Measure (EDM), not only can evaluate a cluster’s quality effectively, but also have the possibility to optimize the clustering result in process for their feedback characteristic. On the basis of the previous research, Leticia et al. [14, 15] presents an efficient discrete PSO approach to cluster short texts and find that GS is more suitable to be the fitness function than EDM in that algorithm.

Inspired by the observation mentioned above, we present a hot topic detection approach for microblogs based on discrete PSO. First of all, we construct a new representation model for microblogs. And then, to reduce time cost of discrete PSO, a useful method is proposed to check whether particles of different forms are the same clustering results. GS, which is set as the fitness function to optimize the clustering for a better result, can be improved by the probabilistic correlation-based similarity measure mentioned above. Finally, we compare our algorithm with other classical methods and prove its superiority.

The remainder of the paper is organized as follows. In Sect. 2, we describe our representation model for microblogs. Section 3 introduces the discrete PSO in this work. In Sect. 4, experiments are conducted to show the effectiveness of this work in different angles. Section 5 concludes and discusses future work.

2 Representation Model for Microblogs

Though detecting microblog hot topics is a new domain of computer science research, it can be viewed as an instance of mining information from numerous short texts. How to construct an effective microblog representation model is one of the most significant steps for clustering because the characteristics of microblogs may hinder the application of conventional text mining algorithms. In this section, a probabilistic correlation-based similarity measure is adopted to calculate the similarity between words. Furthermore, the intra-correlation and inter-correlation are utilized to construct the representation model and calculate the similarity value between microblogs.

2.1 Probabilistic Correlation Definition

Bag of words (BOW) is a classical model to map documents to a matrix which makes an assumption that words in texts are independent of each other, and the correlations between words are ignored. In practice, word correlations do exit and shouldn’t be ignored while detecting hot topics. Considering the conditional probability of word co-occurrence, a probabilistic term correlation model is then developed.

At first, we deem that the words occurring in a microblog have latent relationships and the conditional probability adopted to model the probability of these latent relationships is defined as follows

$$ Pr\left( {w_{\text{i}} |w_{\text{j}} } \right) = \frac{{Pr(w_{\text{i}} w_{\text{j}} )}}{{Pr(w_{\text{j}} )}} = \frac{{df(w_{\text{i}} w_{\text{j}} )/\text{N}}}{{df(w_{\text{j}} )/\text{N}}} , $$
(1)

where \( Pr\left( {w_{\text{i}} w_{\text{j}} } \right) \) and \( df(w_{\text{i}} w_{\text{j}} ) \) respectively denote the probability and the number that words w i and w j occur in the same microblog while \( Pr\left( {w_{\text{i}} } \right) \) and \( df(w_{i} ) \) respectively mean the probability and the number that words w i appears in the a microblog, and N is the total number of microblogs in the corpora.

To ensure the similarity between microblogs symmetric, the probabilistic correlation of words is described as

$$ cor\left( {w_{\text{i}} ,w_{\text{j}} } \right) = Pr\left( {w_{\text{i}} |w_{\text{j}} } \right) \cdot Pr\left( {w_{\text{j}} |w_{\text{i}} } \right) . $$
(2)

The value of \( cor\left( {w_{\text{i}} ,w_{\text{j}} } \right) \) in range [0, 1] is proportional to the co-occurrence frequency of words w i and w j. When w i and w j appear in all microblogs that contain one of them, we have \( cor\left( {w_{\text{i}} ,w_{\text{j}} } \right) = 1 \).

2.2 The Representation Model for Microblogs

To construct the representation model for microblogs, a microblog-word matrix \( S_{N \times M} \) must be initialized at first where M is the number of different words that occur in the corpora and N denotes the total number of these microblogs. Different from traditional initialization method such as tf-idf, we present a novel approach by capturing both the intra-relation (explicit) and inter-relation (implicit).

Definition 1

(intra-relation). Given a microblogs mb i with words \( \left\{ {w_{{\text{i}1}} ,w_{{\text{i}2}} , \cdots ,w_{{\text{i}M}} } \right\} \), each word in mb i have the intra-relation with other words in the same microblog, i.e., as is shown in Fig. 1, love and flowers are of intra-relation in mb 1.

Fig. 1.
figure 1

One specific example of intra-relation and inter-relation, love is the link word.

Definition 2

(inter-relation). Given two microblogs mb i with words \( \left\{ {w_{{\text{i}1}} ,w_{{\text{i}2}} , \cdots ,w_{{\text{iM}}} } \right\} \) and mb n with words \( \left\{ {w_{{\text{n}1}} ,w_{{\text{n}2}} , \cdots ,w_{{\text{nM}}} } \right\} \), if there are one or more words appear both mb i and mb n, these words are called linking words and each word in mb i have inter-relation with the words in mb n. I.e., flowers and chocolates have the inter-relation and love is the link word in Fig. 1.

The intra-relation between two words in a microblog is given by:

$$ IaR\left( {w_{\text{i}} ,w_{\text{k}} } \right) = \left\{ {\begin{array}{*{20}l} {\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,1,} \hfill & {i = k} \hfill \\ {\frac{{cor\left( {w_{\text{i}} ,w_{\text{k}} } \right)}}{{\mathop \sum \nolimits_{j = 1,j \ne k}^{m} cor\left( {w_{\text{j}} ,w_{\text{k}} } \right)}},} \hfill & {i \ne k} \hfill \\ \end{array} } \right. , $$
(3)

where m is the number of different words in the corpora, and it is easy to derive that \( \sum\nolimits_{i = 1,i \ne k}^{m} {IaR\left( {w_{\text{i}} ,w_{\text{k}} } \right) = 1} \) when \( i \ne k \). Note that all words mentioned in the context belong to the dictionary based on the corpora, in other words, these words are different from each other which ensures that each word can be identified by its subscript.

However, Eq. 3 can only capture the co-occurrence frequency between w i and w k, while many words are also closely related though they don’t co-occur in the same microblog. Therefore, we define inter-relation as

$$ IeR\left( {w_{\text{i}} ,w_{\text{j}} } \right) = \left\{ {\begin{array}{*{20}r} \hfill {0,} & \hfill {i = j} \\ \hfill {\frac{{\mathop \sum \nolimits_{{\forall w_{k} \in L}} \hbox{min} \left\{ {IaR\left( {w_{\text{i}} ,w_{\text{k}} } \right), IaR\left( {w_{\text{j}} ,w_{\text{k}} } \right)} \right\}}}{{\left| \text{L} \right|}},} & \hfill {i \ne j} \\ \end{array} } \right. , $$
(4)

where \( \left| {\mathbf{L}} \right| \) is the amount of linked words set \( {\mathbf{L}} \), i.e., \( \left| {\mathbf{L}} \right| = 1 \) and \( {\mathbf{L}} = \{ love\} \) in Fig. 1. We let \( IeR\left( {w_{\text{i}} ,w_{\text{j}} } \right) = 0 \) when \( i = j \) since it is worth nothing to calculate the inter-relation between a word itself.

Taking both the intra-relation and inter-relation into consideration, the correlation between w i and w j can be defined as

$$ WR\left( {w_{\text{i}} ,w_{\text{j}} } \right) = \left\{ {\begin{array}{*{20}r} \hfill {1,} & \hfill {i = j} \\ \hfill {\alpha \times IeR\left( {w_{\text{i}} ,w_{\text{j}} } \right) + (1 - \alpha ) \times IaR\left( {w_{\text{i}} ,w_{\text{j}} } \right),} & \hfill {i \ne j} \\ \end{array} } \right. , $$
(5)

where WR is the abbreviation for word relation, and \( \alpha \in \left[ {0,1} \right] \) determines the importance of intra-relation and inter-relation between w i and w j. It is easy to prove that \( WR\left( {w_{\text{i}} ,w_{\text{j}} } \right) = 1 \) if \( i = j \).

Instead of term frequency (tf) with weight i = 1 in most cases, a new local weighting scheme of words in a microblog is proposed in this approach, namely correlation weight.

Definition 3

(correlation weight). Given a microblog mb n with an initial weight ni of each word w ni, the correlation weight of w ni in mb n is defined as

$$ cow\left( {w_{\text{ni}} } \right) = weight_{\text{ni}} + \frac{{\mathop \sum \nolimits_{{w_{\text{nj}} \in mb_{\text{n}} }} weight_{\text{nj}} \cdot WR\left( {w_{\text{ni}} ,w_{\text{nj}} } \right)}}{{\left| {mb_{\text{n}} } \right|}} , $$
(6)

where \( WR\left( {w_{\text{ni}} ,w_{\text{nj}} } \right) \) is the word relation between w ni and w nj in this microblog, \( \left| {mb_{\text{n}} } \right| \) is the total number of words in mb n.

Finally, a new representation model for microblog is proposed and each element rm ij in the matrix \( {\mathbf{RM}}_{{{\text{N}} \times {\text{M}}}} \) is defined as

$$ rm_{\text{ij}} = cow(w_{\text{ij}} ) \cdot idf(w_{\text{ij}} ) , $$
(7)

where \( i \in \left\{ {1,2, \cdots ,{\text{N}}} \right\} \) is the subscript of microblogs in the corpora, \( j \in \left\{ {1,2, \cdots ,{\text{M}}} \right\} \) is the subscript of words in the dictionary, \( idf\left( {w_{\text{ij}} } \right) = \log \left( {\frac{\text{N}}{{df(w_{\text{ij}} )}}} \right) \), N is the amount of microblogs and M is the length of words in this dictionary.

With the advantages of the conditional probability between words and inner/inter relation, the new matrix \( {\mathbf{RM}}_{{{\text{N}} \times {\text{M}}}} \) can reflect the relationship between any two words, and is proved denser, lower-dimensional than tradition models in the experimental section. Taking advantages of the inter-relation and inner-relation between words, the similarity function of microblogs is defined as follows [10]:

$$ sim\left( {mb_{\text{x}} ,mb_{\text{y}} } \right) = \frac{{\mathop \sum \nolimits_{{(w_{\text{i}} ,w_{\text{j}} ) \in {\mathbf{D}}}} weight_{\text{i}} weight_{\text{j}} cow(w_{\text{i}} ,w_{\text{j}} )}}{{\left\| {mb_{\text{x}} \oplus {\mathbf{D}}} \right\| \cdot \left\| {mb_{\text{y}} \oplus {\mathbf{D}}} \right\|}} , $$
(8)

where D denotes the words correlation of mb x and mb y, and \( \left\| {mb_{\text{x}} \oplus {\mathbf{D}}} \right\| \), \( \left\| {mb_{\text{y}} \oplus {\mathbf{D}}} \right\| \) denote the sizes of mb x and mb y so as to normalize the similarity value.

$$ \left\| {mb_{\text{x}} \oplus {\mathbf{D}}} \right\| = \sqrt {\sum\nolimits_{{(w_{\text{i}} ,w_{\text{j}} ) \in {\mathbf{D}}}} {(weight_{\text{i}}^{2} cow(w_{\text{i}} ,w_{\text{j}} )) + } \sum\nolimits_{{w_{\text{i}} \in mb_{\text{x}} \backslash mb_{\text{y}} }} {weight_{\text{i}}^{2} } } , $$
(9)

and \( \left\| {mb_{\text{y}} \oplus {\mathbf{D}}} \right\| \) can be calculate in a similar way.

Thus,\( sim\left( {mb_{\text{x}} ,mb_{\text{y}} } \right) \in [0,1] \).

3 The DPSO Algorithm for Microblog Hot Topic Detection

Particle Swarm Optimization (PSO) is a population-based search algorithm inspired by the behavior of biological communities that exhibit both individual and social behavior; examples of these communities are flocks of birds, swarms of bees. In PSO, each solution to the problem at hand is called a particle and per particle represents a real vector within the search space, corresponding to a solution of the mazy problem.

However, traditional PSO is originally developed for continuous space but many problems are defined for discrete valued spaces where the domain of the variables is finite. We propose a discrete PSO approach and fit it to clustering microblogs for detecting hot topics in this section. The fitness function of PSO is redefined by GS and the correlation similarity of microblogs. Finally, a valid method to reduce time cost of the algorithm by dealing with particles is also shown then.

3.1 DPSO Algorithm

The basic PSO algorithm contains a swarm of particles in which each particle includes a potential solution. The particles fly through a multi-dimensional search space where the position of each particle is adjusted according to its own experience and the experience of its neighbors during per iteration. In each iteration, the velocity and the position of every particle are calculated by Eqs. 10 and 11. Besides, the current global best position (gbest) and individual history best position (pbest) are all recorded in the corresponding matrices.

$$ v_{id} \leftarrow\upomega(v_{id} +\upgamma_{1} \left( {pbest_{\text{id}} - par_{\text{id}} } \right) +\upgamma_{2} \left( {gbest_{\text{d}} - par_{\text{id}} } \right)), $$
(10)
$$ par_{\text{id}} \leftarrow (par_{\text{id}} + v_{id} ) , $$
(11)

where \( \upomega \) is the inertia factor whose goal is to balance global exploration and local exploitation, γ1 is the personal learning factor, γ2 is the social learning factor, par id, pbest id, and v id are the position, history best position, and velocity of i th particle in d th dimension respectively, and gbest d is the best position of the whole swarm in d th dimension. In our context, these concepts are redefined in Definitions 4, 5 and 6 to make it easier to understand and calculate.

DPSO (Discrete Particle Swarm Optimization) is a discrete version of the basic PSO algorithm. In this method, each particle represents a clustering result of the corpora during the process of hot topics detection as the final aim of our algorithm is to get an excellent clustering result.

To detail the process of DPSO, we define four matrices as follows

Definition 4

(particles cluster result matrix). The particles cluster result matrix, \( {\mathbf{P}}_{{{\text{K}} \times {\text{N}}}} \), is defined to store the current position of each particle and each element, p id, is the cluster of i th particle in d th dimension. i.e., \( p_{12} = {\text{cluster}}1 \) means the first particle of the swarm in second dimension is clustered to cluster1. In addition, K is the number of particles in this paper and is set manually.

Definition 5

(particles cluster quality matrix). The particles cluster quality matrix, \( {\mathbf{PAR}}_{{{\text{K}} \times {\text{N}}}} \), is proposed to record the quality of \( {\mathbf{P}} \), and each element, par id is calculated by the fitness function mentioned below. By comparing the value of par id and pbest id and gbest d, we can judge whether it is worth putting the corresponding element in P to a new cluster.

Definition 6

(particles cluster velocity matrix). The particles cluster velocity matrix, \( {\mathbf{V}}_{{{\text{K}} \times {\text{N}}}} \), is provided to represent the probability of choosing the corresponding particle to optimize. The value of per element, v id, can be changed during each iteration of optimization by Eq. 8.

In DPSO algorithm, Eq. 11 is modified as

$$ par_{\text{id}} \leftarrow pbest_{\text{id}} , $$
(12)

During the initialization of P, there is a high probability that different particles may refer to the same clustering result as shown in Fig. 2.

Fig. 2.
figure 2

The situation that different particles represent same clustering result.

In order to avoid unnecessary time cost of the iterative process, we propose an effective method as follows.

By using Program 1, P i and P j in Fig. 2 are in the same form of {1,2,3,2,1,1} or {a,b,c,b,a,a} etc.

3.2 The Improved Fitness Function

The fitness function in PSO is usually used to measure the quality of particles. In other words, it is a method to judge whether the corresponding cluster result is a better one during clustering.

As an external validation used to deal with the corpora of unknown structure and evaluate the clustering quality based on the corpora only, the Global Silhouette coefficient (GS) is a good choice to be the fitness function since it provides a succinct graphical representation of how well each object lies within its cluster. Assuming the corpora have been clustered into l clusters. For each microblog mb i, let a(mb i) be the average dissimilarity between mb i and all other microblogs within the same cluster, which can be interpreted as how well mb i is assigned to its cluster. Let b(mb i) be the average dissimilarity between mb i and the microblogs in the neighboring cluster of mb i. The formula to calculate the GS value is given by

$$ GS\left( {mb_{\text{i}} } \right) = \frac{{b\left( {mb_{\text{i}} } \right) - a\left( {mb_{\text{i}} } \right)}}{{\hbox{max} \{ a\left( {mb_{\text{i}} } \right),b\left( {mb_{\text{i}} } \right)\} }} , $$
(13)

with \( - 1 \le GS(mb_{i} ) \le 1 \). From this formula it can be observed that negative values for this measure are undesirable and that for this coefficient values as close to 1 as possible are desirable.

Given a mb i belonging to cluster C a, and the neighboring cluster C b, the function to compute a(mb i) is defined as follows

$$ a\left( {mb_{\text{i}} } \right) = \frac{{\mathop \sum \nolimits_{{mb_{\text{j}} \in C_{\text{a}} }} (1 - sim(mb_{\text{i}} ,mb_{\text{j}} ))}}{{\left| {C_{\text{a}} } \right|}} , $$
(14)

and b(mb i) is defined as

$$ b\left( {mb_{\text{i}} } \right) = \frac{{\mathop \sum \nolimits_{{mb_{\text{j}} \in C_{\text{b}} }} (1 - sim(mb_{\text{i}} ,mb_{\text{j}} ))}}{{\left| {C_{\text{b}} } \right|}} , $$
(15)

where \( \left| {C_{\text{a}} } \right| \) and \( \left| {C_{\text{b}} } \right| \) are the amount of microblogs in cluster C a and C b respectively.

3.3 The Algorithm’s Framework and Details

Combined with correlations (intra/inter relation) of words, a clustering algorithm based on DPSO is proposed to detect microblog hot topics. The algorithm is mainly divided into three steps: constructing a microblog representation model by the conditional probability of word co-occurrence and correlations of words; using DPSO algorithm to initialize and optimize the cluster result of microblogs; judging whether a cluster result need to be optimized and whether the algorithm should end by calculating the corresponding GS (Fig. 3).

Fig. 3.
figure 3

The algorithm’s framework.

Note that V is initialized by a random value in range [0,1] where each element indicates the probability of choosing the corresponding particles to optimize, and P is initialized by the random cluster subscripts at the beginning of DPSO.

4 Experiments

In this section, we report our experimental results. Section 4.1 introduces the datasets, parameter settings and effectiveness criteria in the approach. Section 4.2 evaluates the performance of our approach with various parameters and compare our method with existing approaches such as traditional k-means, and MicroBlog Hierarchical Dirichlet Process (MB-HDP) [16] and a topic detection method based on microblog weight named Weighted LDA (W-LDA) [17].

4.1 Data Sets, Parameter Settings and Effectiveness Criteria

Datasets.

We grab the first one hundred search results by Twitter API from March 1st 2015 to Oct. 31st 2015. After removing those invalid twitters and those too short twitters, meanwhile merging the same content, the remaining data set is 148090 microblogs in total.

Parameter Settings.

In our experiments, 50 independent runs are performed, with 10,000 iterations per run, the swarm size K = 100 particles, dimensions of each particle N = number of twitters, inertia factor \( \upomega \) = 0.9, personal and social learning factors γ1, γ2 are set to 1.0, and the clusters number l = 10.

Effectiveness Criteria.

Two evaluative metrics, the Purity and F1-measure are used to evaluate experiments performance. Purity is a simple and transparent evaluation measure for cluster quality while F1-measure, is a weighted harmonic mean of Recall and Precision (R, P).

4.2 Experimental Results and Analysis

In this section, experimental results concerning our algorithm are discussed, such as the influence of regulatory factor \( \alpha \) to determine the importance of intra-relation and inter-relation between words, the size of microblogs to deal with, the number of iterations of the algorithm and the effectiveness of the new similarity function. Since our algorithm has some parameters to be tuned, all the involved parameters are carefully tuned and the parameters with best performance are used to report the final results. And then, we make the comparison among our algorithm and other hot topic detection methods such as W-LDA, traditional DPSO and MB-HD.

Figure 4(1) shows F1-measure values, purity values and GS values with different regulatory factor \( \alpha \). Obviously, our algorithm shows its best performance while \( \alpha \) is in the vicinity of 0.5 which indicates the need for a balance between the intra-relation and inter-relation.

Fig. 4.
figure 4

The influence of adjusting key parameters and the advantage of using the new similarity function based on intra/inter relation.

Figure 4(2) shows F1-measure values, purity values and GS values with the increasing size of iterations for optimizing particles. when the number of iteration is more than 10000, the performance of our algorithm becomes stable.

Figure 4(3) shows F1-measure values, purity values and GS values with different particles’ number. Each horizontal coordinate denotes the corresponding times the number of microblogs in the corpora.

Figure 4(4) shows F1-measure values with the new similarity function and cosine similarity function respectively. The new similarity function shows a better performance for considering the correlation and simplifying the steps of calculating GS which can reduce the algorithm’s time cost.

In summary, the above experimental results demonstrate that intra/inter relation, the similarity function play significant roles in this algorithm, which indicates that these aspects should not be ignored.

For effectiveness, we then compare our methods with several existing methods. As we can see, overall, our method outperform all the compared methods. The box plots in Fig. 5 represent that our approach reach the best performance with a median value close to 0.86 and without dispersion except for some outliers. This aspect is important because it shows that our approach is able to find very similar F-measure values in all the runs. The worst one is MB-HDP because its median value is the lowest and its box plot shows the highest dispersion. Both of MB-HDP and W-LDA are based on LDA, which cannot be well generalized. Compared with traditional DPSO, our algorithm present a representational model for microblogs and improve the fitness function which ensure a novel optimization.

Fig. 5.
figure 5

The comparison of various topic detection approaches.

5 Conclusion

Hot topic detection for microblogs is a challenging research problem in data mining domain. Different from traditional documents or texts, microblogs are often very short, sparse and spreading rapidly online. In this work, we propose an effective algorithm based on discrete particle swarm optimization. A representation model for microblogs considering correlations and intra/inter relations between words by calculating conditional probability of word-occurrences is constructed to capture the semantic associations between words. In addition, a new method to compute the similarity between microblogs is proposed so as to improve the fitness function, GS. Furthermore, DPSO algorithm is developed to obtain the hot topics in the corpora. Besides, an effective program to overcome the problem that various particles denotes the same clustering result. Finally, experiments have demonstrated its effectiveness in mining microblogs. The future research can be targeted at particle dimension reduction, a more suitable fitness function selection.