Introduction

With the rapid development of WEB 2.0, more and more people are acquiring knowledge from the Internet and expressing their personal opinions or attitudes about people and events actively. Microblogs update message in 140 words or less and share information using multi-publishing tools in a convenient way, and they have become an important social medium on the Internet and provide a suitable place for the generation and discussion of hot topics. Each microblog user can be an event reporter, using mobile devices to release news. Information on the microblog platform is displayed in a scattered and fragmented way; however, once some microblogs focus on a particular subject, the information flow forms a hot topic. Most microblogs contain elements of personal emotion, and we can thus use them as explicit indicators in identifying hot topics.

Recently, research on microblogs has become a popular domain in the fields of natural language processing and text mining, where the detection of hot topics in microblogs is a major area of interest and problems associated with topic detection thus need to be solved urgently. In this paper, a new approach of detecting hot topics is proposed employing an emotion distribution model and topic model. This is different from previous work on topic detection, which has commonly used seed topic words to guide the clustering process and may suffer from the overfitting problem. According to social cognitive theories, current word-use behaviors relate to past and future behaviors of users [1]. By analyzing the differences in the distributions of emotion words in adjacent time intervals, we can detect hot events in an unsupervised way and avoid the overfitting problem at the same time. Additionally, people tend to repost a microblog that contains a detailed description about a hot topic. In our work, the repost degree as well as the content of the microblog is taken into account in estimating the importance of each microblog and generating topic models.

In summary, our paper makes the following contributions. First, we propose the emotion distribution language model (ELM) for the simulation of the emotion distribution in microblogs, and use it to detect the potential time intervals for hot-topic detection. Second, we analyze the topics in detected time intervals, and estimate the importance of each microblog according to its content and repost degree to extract a hot topic by applying the topic model. Third, we apply our two-step method to Sina Microblog; experiment results show that the approach is effective.

This paper is organized as follows. First section introduces the background to and importance of our research. Second section briefly introduces related work. Third section presents our method of detecting and extracting hot topics. Fourth section describes the experimental process and analyzes experimental results. Final section concludes the paper and proposes further research.

Related Work

Research and Trends in the Microblog Domain

Microblogs have recently received attention from researchers. More and more researchers are investigating public opinion about various topics or news. Kwak et al. [2] analyzed Twitter as a social network and new form of media. Weng et al. [3] proposed the method TwitterRank to find a sensitive topic among influential Twitter users. Marchetti-Bowick et al. [4] used microblogs to forecast political polls by distance supervision. Bollen et al. [5] attempted to predict the stock market by analyzing daily microblogs. Researchers have also investigated the trustworthiness of microblog content [6]. In addition to these studies, topic detection, tracking, and extraction problems in the microblog research domain remain an attractive area of research.

Topic Detection and Tracking

Topic detection and tracking technology [7] is widely applied to topic detection. Information retrieval, text mining, and information extraction focus on a particular topic rather than a wide range of information, while topic detection and tracking emphasize the discovery of new information [8]. Topic detection usually applies clustering algorithms to different events. However, most of the content of microblogs is about the feelings of users, and the proportion of emotion words in microblogs is higher than that in traditional text messages. Additionally, fragmentation, timeliness, and mobility are explicit features of the microblog and increase the difficulty of topic clustering. For example, some topic-related words may not occur in microblogs, and this dramatically affects the cluster results to a large degree. Therefore, traditional topic tracking and detection technology is not suitable for hot-topic detection on the microblog platform. Emotion elements, which are important indicators of hot topics, should be taken into account in topic detection and tracking.

Considering the temporal changes in public opinion, Ku and Liang [9] used language characteristics to capture opinions and presented the changes in the overall sentiment about presidential candidates in an election. Akcora et al. [10] proposed a method of finding public opinion on Twitter by analyzing the emotion centroid (EC) and set space model (SSM). The above works addressed the detection of hot topics and public opinion in different domains and provide a background for the present study. On the basis of these works, we propose an ELM and apply it to the detection of hot topics on the Sina Microblog platform.

Opinion Mining

Sentiment analysis or opinion mining, from coarse to fine grained (e.g., from the document level to concept level), has attracted much attention from researchers [11, 12]. Online opinions on products and services not only affect choices made by consumers but also help merchants improve their services. Currently, text opinion analysis is based on an annotated corpus, using machine learning algorithms to analyze words, sentences, and chapter polarity. Owing to the shortness of a microblog, each microblog is similar to a sentence in an article; therefore, the technology of sentence-level opinion mining provides great support for this paper. Pang and Lee [13, 14] adopted Bayesian and maximum entropy methods to analyze the tendency of movie reviews with manual tagging training data. Pandarachalil et al. [15] used an unsupervised approach to analyze sentiment of Twitter. Additionally, researchers have focused on human emotion [16] and Chinese text characteristics [17, 18], such as the hourglass of emotions and Chinese recognition. All of these studies serve as an important basis for analyzing emotion fluctuation on Chinese Microblogs.

Methodology

The daily collection of microblogs can be seen as a document d, and the whole corpus is the document set D; therefore, D = {d 1, d 2 ,…,d n}. Each microblog is a sentence s of document d, and d = {s 1, s 2,…,s n}. Therefore, each term in the microblog can be regarded as a word w in the language model. There are three major steps in our paper.

The first step is the construction and recognition of emotion ontology. The recognition of emotion ontology is the basis of an ELM. In this paper, we use DUTIR Emotion Ontology [19] to recognize emotion words in microblogs. The process involves DUTIR Emotion Ontology and words that are specifically used as emotion words on the microblog platform; e.g., “Ink,” which is not an emotion word when used generally but sounds like the word “humorous” in Chinese, which has an emotion factor.

The second step is the detection of hot topics using an ELM. When an event occurs, there is an expectation that the emotions of microblog users will fluctuate, and the distributions of emotion words will thus change. By constructing an ELM for each time interval, we compare emotion distributions between adjacent time intervals to detect hot topics.

The third step is the extraction of hot topics from the microblog platform. After detecting the potential hot-topic time interval, we consider the content and repost degree of each microblog and then generate topic keywords using the topic model, and use them as indicators in extracting the hot events.

Construction and Recognition of Emotion Ontology

The paper uses DUTIR Emotion Ontology [19] as the emotion lexicon resource. In emotion classification, there is still no standard for how many emotion classes there are. Presently, emotions can be divided into four, six, eight, ten or twenty categories. In DUTIR Emotions Ontology, emotions are classified into seven categories and 20 sub-categories, which can be used in coarse or fine emotion computing.

To recognize an emotion word, mutual information is calculated between the word and ontology in DUTIR Emotion Ontology as

$$MI(w,S_{ui} ) = \log \frac{{P(w,S_{ui} )}}{{P(w)P(S_{ui} )}}$$
(1)

where S ui is the ith ontology in emotion u, p(w) is the probability of occurrence of word w, and p(S ui ) is the probability of occurrence of the ith ontology in emotion u.

We also consider other rules, such as the co-occurrence rule, the part-of-speech rule, and the context rule, to expand the emotion ontology. A machine learning method is also adopted for automatic expansion of emotion ontology. In this paper, conditional random fields [20] are implemented for automatic collection according to

$$P_{\theta } (y|x) = \exp \left( {\sum\limits_{e \in E,k} {\lambda_{k} f_{k} (e,y|_{e} ,x)} + \sum\limits_{v \in V,k} {u_{k} g_{k} (v,y|_{v} ,x)} } \right)$$
(2)

where graph G = (V, E), with V being a vertex and E an edge, x is a data sequence, y is a label sequence, and y| s is the set of components of y associated with the vertices in sub-graph S. The features f k and g k are given and fixed, and λ k is a weight parameter.

DUTIR Emotion Ontology is defined by 3-tuples as

$${\text{Lexicon}} = \left( {B,R,E} \right)$$
(3)

where B represents basic lexical information including the serial number, entry, English meaning, part of speech, editor, and version; R represents synonymous relations; and E represents emotion information, which is the most important part of 3-tuples.

For example, “pleasantly surprised” is describes as follows.

The value “PA” in the <emotion> tag is the symbol of the “happy” emotion, and the <intensity> tag includes the emotion and intensity of the word. The value in the “intensity” field is a vector, and the vector component ranges from 0 to 9 (where zero indicates no such feelings), which delegates a specific emotion. As an example, the words “pleasantly surprised” contain both “happiness” and “surprise” feelings. The intensity of emotions is given on a five-point scale having values of 1, 3, 5, 7 and 9. The “happiness” intensity is 7 and the “surprise” intensity is 5. The numbers 0, 1, 2 and 3 in the “polarity” field represent neutral, positive, negative and both positive and negative attributes, respectively. The framework includes a description of both the static and dynamic attributes, and emotion information is shown in quantitative and qualitative terms, which provides useful information for emotion analysis (Table 1).

Table 1 DUTIR emotion ontology

DUTIR Emotion Ontology is based on existing dictionaries such as Dictionary of Chinese Praise and Blame Words [21], Dictionary of Chinese Adjective [22], Dictionary of Chinese Idiomatic Phrases [23], Dictionary of Chinese Idiom [24], New Century Dictionary of Chinese New Words [25] and Chinese Classified Dictionary [26]. In addition, semantic network resources, including HowNet [27] and WordNet [28], are used. Network emotion words are also contained to improve precision. Therefore, DUTIR Emotion Ontology has a wide range of application in the emotion analysis of microblogs and similar network platforms, such as blogs and bulletin board systems. There are presently 27,243 entries in DUTIR Emotion Ontology, which is still being updated. We plan to introduce more semantic resources to enrich the ontology.

Hot-Topic Detection Based on an EML

A statistical language model [29], based on statistical methods for natural language processing, can be estimated using a multinomial distribution. The model provides a statistical way of scoring and ranking documents. The use of a statistical language model is a two-stage method; a language model is generated for each document in the first stage, and documents are ranked according to the query relevance score in the second stage. The relevance score is calculated as

$$P(Q|D) = \prod\limits_{{{\text{w}} \in v}} {P(w|D)^{{q_{w} }} }$$
(4)

where Q is the query, D is the document, V is the word set, and q w is the number of instances of the word w.

We then used relative entropy (Kullback–Leibler divergence) to measure the similarities between the two models. Their distance is a reflection of the difference between the learning model and real model. If the two models are the same, then the relative entropy is zero. Higher relative entropy corresponds to a greater difference between two models. The relative entropy calculation formula is defined as

$$D_{KL} ((P(w|Q)),(P(w|C))) = \sum\limits_{w \in V} {P(w|Q)\log \frac{P(w|Q)}{P(w|C)}}$$
(5)

where p(w|Q) is the probability that the word w occurs in Query Q and p(w|C) is the probability that the word w occurs in the experiment dataset C.

In the domain of information retrieval, according to “bag of words” theory, each word is independent and the distribution of words may be estimated from distributions. Emotion words on a microblog platform should also belong to a distribution. According to this distribution and considering the fragmented nature of microblogs, the present paper proposes an ELM approach. This approach analyzes the differences in ELMs between adjacent time intervals to detect a hot topic. We define the ELM of time period T n as

$$P(t|DT_{n} ) = \prod\limits_{t \in E} {P(t|C)^{{q_{t} }} }$$
(6)

where E is the set of emotion ontology, DT n denotes the microblogs in time period T n , p(t|C) is the probability that emotion ontology t occurs in experiment dataset C, and q t is the number of occurrences of emotion ontology t in time period T n .

Owing to the short and brief nature of microblogs, emotion words are sparse lexicon to some extent. We therefore apply Dirichlet smoothing to the experiment dataset [29]. The Dirichlet smoothing formula is defined as

$$P_{\mu } (w|d) = \frac{c(w;d) + \mu p(w|C)}{{\sum\nolimits_{w} {c(w;d) + \mu } }}$$
(7)

where P μ (w|d) is the probability of occurrence of word w after smoothing, c(w;d) is the number of occurrences of word w in document d, μ is a smoothing parameter, and p(w|C) is the probability of occurrence of word w in the experiment corpus C.

Relative entropy is an important evaluation metric of a statistical language model. By calculating ELMs for adjacent time intervals T n and T n−1, relative entropy can be used to measure the differences, where higher relative entropy corresponds to greater differences between adjacent time intervals, which provides a basis for detecting a potential hot topic. The equation for relative entropy is

$$D_{KL} (p(t|DT_{n - 1} ),p(t|DT_{n} )) = \sum\limits_{t \in E} {p(t|DT_{n - 1} )\log \frac{{p(t|DT_{n - 1} )}}{{p(t|DT_{n} )}}}$$
(8)

where E is the microblog Emotion Ontology, w is one emotion ontology in E, DT n denotes the microblog dataset in time period T n , and P(t|DT n ) is the probability that emotion ontology w occurs in the microblog dataset in time period T n .

According to observations in [10], when a hot event occurs, individuals tend to post more microblogs, and the microblogs include more emotion words. Thus, emotion expression patterns of microblogs are different from the emotion expression patterns that appeared in the previous period, but have a higher similarity with the emotion expression patterns of the following period. According to the above analysis, if the criteria are satisfied, then a potential hot event occurs in time period T n .

$$D_{KL} (p(t|DT_{n - 1} ),p(t|DT_{n} )) > D_{KL} (p(t|DT_{n} ),p(t|DT_{n + 1} ))$$
(9)
$$D_{KL} (p(t|DT_{n - 1} ),p(t|DT_{n} )) > D_{KL} (p(t|DT_{n - 1} ),p(t|DT_{n - 2} ))$$
(10)

After defining the equation, we calculate the relative entropy between adjacent time intervals to detect a potential hot topic.

Extracting Hot Topics from Microblogs

It has been observed that topic clustering can be helpful in the quick retrieval of desired information. Traditional text mining techniques have no special considerations for the short and sparse characteristics of microblog data, and it is therefore impossible to directly apply the traditional topic clustering technique to microblog posts. The topic detection process commonly uses the vector space model. However, for a short and sparse microblog, the vector space model (using words or terms as characters) cannot be used for the accurate calculation of the similarities between texts. To reduce data scarcity, we apply the latent Dirichlet allocation (LDA) model [30] to the data modeling and extract the hidden microblog topics. The high-dimensional sparse text vector is then reduced to a low-dimensional hidden-topic space.

The LDA model is a probabilistic graphical model that has three levels as shown in Fig. 1. The boxes are “plates” representing replicates. The outer plate represents documents, while the inner plate represents the repeated choice of topics and words within a document.

Fig. 1
figure 1

Representation of latent Dirichlet allocation

LDA assumes the following generative process for each document w in a corpus D.

  1. 1.

    Choose \(\theta_{i}\) ~ Dir(\(\alpha\)), where Dir(\(\alpha\)) is the Dirichlet distribution for parameter \(\alpha\) and i ∈ {1,…, M}.

  2. 2.

    Choose \(\varphi_{k}\) ~ Dir(\(\beta\)), where k ∈ {1,…, K}.

  3. 3.

    For each of the words w ij , where k ∈ {1,…, N i },

    1. (a)

      choose a topic z ij  ~ Multinomial (\(\theta_{i}\)),

    2. (b)

      choose a word w ij  ~ Multinomial (\(\varphi_{{z_{ij} }}\)).

The Markov chain Monte Carlo method [31] is a general method of obtaining samples from a complex distribution, with Gibbs sampling being a special case. After deduction, the final Gibbs sampling equation is

$$p(z_{i} = k|z_{\neg i} ,w)\infty \frac{{n_{k,\neg i}^{(i)} + \beta_{i} }}{{\left[ {\sum\limits_{v = 1}^{V} {n_{k}^{(v)} + \beta_{v} } } \right] - 1}} \cdot \frac{{n_{m,\neg i}^{(i)} + \alpha_{k} }}{{\left[ {\sum\limits_{z = 1}^{K} {n_{m}^{(z)} + \alpha_{z} } } \right] - 1}}$$
(11)

where \(n_{k}^{(v)}\) denotes the number of occurrences of word w v in topic k and \(n_{m}^{(z)}\) denotes the number of occurrences of topic z in document m.

Once we have the label of topic z for each word, we can estimate the values of the other latent variables according to

$$\varphi_{k,i} = \frac{{n_{k}^{(i)} + \beta_{i} }}{{\sum\limits_{v = 1}^{V} {n_{k}^{(v)} + \beta_{v} } }},\quad \theta_{m,k} = \frac{{n_{m}^{(k)} + \alpha_{k} }}{{\sum\limits_{z = 1}^{K} {n_{m}^{(z)} + \alpha_{z} } }}$$
(12)

where \(\varphi_{k,i}\) is the probability of word w i being in topic k, and \(\theta_{m,k}\) is the probability of topic z k being in document m.

Based on the above deduction, we can produce the following probability and its weight score s for the word w in microblog m. In our experiment, the repost degree R is an important factor used to estimate the importance of the word w:

$$p(w_{j} |m_{i} ) = p(w_{j} |\theta_{{m_{i} }} ,\beta ) = \sum\limits_{k = 1}^{K} {R_{{m_{i} }} p(w_{j} |z_{n} ,\beta )} p(z_{n} |\theta_{{m_{i} }} )$$
(13)
$$S(w_{j} |d_{{T_{n} }} ) = \sum\limits_{n = 1}^{N} {s(w_{j} |m_{i} )} = \sum\limits_{n = 1}^{N} {R_{{m_{i} }} } p(w_{j} |m_{i} )$$
(14)

In this part, we treat the microblogs posted in 1 h as a “document” d, and each microblog in that hour is a “sentence” in the “document” d. N is the number of microblogs in time period T n . Because we have detected the potential hot topic in some time periods using the ELM, parameters α and β are related to the dataset, and they can be sampled once in the process of generating a dataset. The variables θ mi are document-related variables, sampled once per “document” d, and the variables z n and w j are word-related variables that are sampled once for each word in each “document,” which consists of microblogs published in a specific hour. After taking into account the repost degree, we can re-weigh the score of each word in microblogs, and the results of the LDA model guide us in extracting the topic.

Experiments and Analysis

Datasets and Work Flow

We create a dataset by collecting microblogs from Sina Microblog during the period from June 7, 2010 to June 13, 2010. A total of 52,500 microblogs are collected and stored in a uniform format. A microblog is defined as below.

Here <name> is the user name, <text> denotes the content that the user posts, <rt> is the content that the user reposts, and <time> is the timestamp of a microblog. As users sometimes post original messages without repost, the <rt> segment can be null.

To extract the ELM, we analyze the emotion words in a day’s corpus to detect a hot topic on the microblog platform. A flowchart of our work is shown in Fig. 2.

Fig. 2
figure 2

Flowchart of our work

Results of Hot-Topic Detection

The length of the time intervals is another important factor in our work. If the time intervals are shorter than 1 h, too few meaningful microblogs are included, leading to biased results. In contrast, more than one topic may occur in a time interval, which does not suit the present problem domain. Therefore, we set the time interval as 1 h. This is the most acceptable time interval for providing meaningful data, and furthermore, it allows us to detect hot topics with fine granularity.

After we define the time interval, we use the Sina Microblog data crawled through from June 7, 2010 to June 13, 2010 as the experiment dataset. We choose this time period because there were a number of hot topics in that week. Information relating to the dataset is presented in Table 2.

Table 2 Experimental dataset

According to the hot topics provided by the microblog platform, we compare our method with that used in Ref. [10]. Results are presented in Fig. 3, and a detailed description of the topic detection results is given in Table 3. In the table, a number outside parentheses is the number of time periods for which detection using the method was correct, while a number inside parentheses is the number of time periods for which detection using the method was incorrect.

Fig. 3
figure 3

Experiment results of different methods

Table 3 Topic detection results

Figure 3 shows that all four methods can be used to find hot topics in our experiment dataset effectively and that the ELM has the highest precision and F1 score, demonstrating the effectiveness of our proposed approach.

The EC method has the highest recall rate but the lowest accuracy. Because the EC method only focuses on emotion words, if fewer emotion words are contained in the adjacent time intervals for different topics, then the EC is misleading in result detection. For the SSM, emotion words are only a part of the whole word set, and the Jaccard similarity may not depend on emotion words. The recall rate therefore drops obviously, but the precision is higher than that of the EC method for all kinds of words.

The EC&SSM method considers the advantages and disadvantages of the emotion centroid and set space model. It uses the Jaccard similarity to validate the hot topic detected by cosine similarity, which guarantees the basic recall rates and increases the accuracy. Figure 3 and Table 3 show that EC&SSM has higher precision and a higher F1 score compared with the EC and SSM.

The results of the ELM show that the ELM has not only higher precision but also higher recall and a higher F1 score when compared with EC&SSM.

The results reveal the importance of the relationship between language expression and the language model. Language expression, which is described by the language model, is more suitable for natural language processing, and the ELM therefore approximately reflects the facts of the experiment dataset and it has the highest precision and F1 score. The results show that the method that we proposed can be used to detect hot topics throughout the microblog dataset.

According to conditions (9) and (10), we find the hot-topic time intervals 9, 13, and 20 provided by the platform were detected by the ELM in Fig. 4. The ELM also has disadvantages, e.g., it does not take the relationship of emotion words into account, and when there are fewer emotion words occurrences, such as narrative microblogs, this could affect the performance to some degree.

Fig. 4
figure 4

Relative entropy of the ELM for adjacent time intervals on June 13, 2010

To further validate our results, we perform statistical analysis. We use the D-values of the emotion word ratio for adjacent time intervals to confirm our conclusion; i.e., emotion words play an important role in topic detection and more microblogs contain emotion words when a hot event occurs. In Fig. 5, for example, we find that there is an emotion burst between time intervals 9 and 10, and this burst is also detected using our proposed method.

Fig. 5
figure 5

Emotion word ratio D-values for adjacent time intervals on June 13, 2010

Results of Hot-Topic Extraction

After detecting the time periods using the ELM, the next step is to extract the hot topic using the topic model. The LDA model generates the probabilities of topics in the “document,” which consists of microblogs published in 1 h, and provides a topic space for “documents.” The results show that the method provides good results.

The topic number is an important parameter of the LDA model. If we set the topic number small, the model will not fully generate the “document,” whereas if it is too large, the hidden topic will be separated, affecting the result. According to our analysis, we set the topic number parameter as three and obtain a better experimental result. The results of topic extraction are given in Table 4.

Table 4 Topic extraction results

In this part of the experiment, we tend to detect and extract “new” topics that have not yet been detected. If the topic has been detected previously, we ignore the topic and select the next topic according to the probability ranking result. For example, one topic, which relates with the South Africa World Cup 2010, may continue all day, but if the match has been detected, we focus on matches other than the detected one.

Two factors are taken into account. One is the content of the microblog, and the other is the repost degree. The former is the basis of detecting the topics, and the latter is used to re-weight the words related to the topics. People tend to repost the microblog with a detailed description about the hot topic, and therefore, the repost degree is used to estimate the importance of each microblog and its content. The hot topics in the extracted results for June 13, 2010 are listed in Table 5.

Table 5 Topic words and their probabilities for June 13

From the results in Table 5, we conclude that the ELM can detect a previously existing topic during the time period. It is clear that three topics are explicitly extracted, namely “the mistake made by England Goalkeeper Green in the match between England and the United States,” “Duan Wu Festival,” and “Fake Caocao Tomb” based on the topic words and their probability values.

The results are reasonable and encouraging. The model benefits from the potential advantages of LDA [32, 33] as a generative model for documents. We also considered the repost degree, which is a significant factor in the microblog research domain and indicates the microblog users’ behaviors in emphasizing specific topics. Using the repost degree as a factor of the word probability score, we can re-weigh the score of each word in each microblog and use the re-weighted score to extract the topic precisely.

Conclusions and Future Work

In this paper, we proposed an approach of detecting and extracting hot topics using an ELM and topic model. The approach was used to analyze the differences in ELMs between adjacent time intervals in detecting hot topics.

According to a microblog content and its repost degree, we estimated the importance of each microblog and generated topic models and then used the topic keywords provided by the topic model to extract the hot topics. Experimental results show that the approach effectively detects and extracts hot topics on a microblog platform. The findings can be used to help Sina Microblog manage and monitor hot topics daily.

In our future work, we will continue to enrich DUTIR Emotion Ontology so that it can be widely used in Chinese text emotion analysis and to integrate social cognitive theory and thus quantify the emotion intensity. Meanwhile, we will attempt to improve the performance of the ELM and hot topic extraction with more semantic and statistic features on the microblog platform.