1 Introduction

As web service users’ online activities have expanded, the amount of information they generate and share online has also increased. In addition, social media is used as a communication tool for interactions between individuals and groups and to create interdependent relationships [1, 2]. Social media refers to activities, practices, and behaviors of exchanging and sharing information, knowledge, and opinions. Social media is a form of communication based on Web 2.0 that individuals use to share opinions, experiences, and information to create and expand their relationships with others [3, 4]. Leading social media forms include blogs, social networks, message boards, podcasts, Wikis, and Blogs.

With the growing use of social media, social network services (SNSs) have drawn attention as tools for information sharing, building connections, and expressing one’s ideas and tastes [5, 6]. SNSs have evolved from information-sharing through social networks to generating and consuming new information [7, 8]. Their structure has evolved to generate and share various types of information while reprocessing them for further sharing [9,10,11,12,13]. Companies use SNS gathered from platforms such as Twitter and Facebook to encourage customer participation, sharing, and conversation [14,15,16,17]. In the marketing field, SNS is used to promote products or to identify the reputation of products. SNS analysis is used to develop new products or to predict the performance of products in the manufacturing field. In the CRM field, SNS is used to analyze customer requirements or to identify trends of customers. As information is generated and shared exponentially on SNSs, we require schemes to selectively provide information to individuals and groups [18,19,20]. Therefore, research on human network analysis, influencer identification, and personalized suggestions is underway [14, 21,22,23,24].

Hot topic detection has been studied to identify public opinion or customer trends in various industries [25,26,27,28]. A hot topic is an event or a core theme that becomes an issue or interest at a particular time [29,30,31,32,33,34,35]. Jeelani and Singh [34] proposed a scheme for detecting hot topics using a machine-learning algorithm on Twitter for the classification of positive and negative tweets. Yu et al. [35] proposed a topic detection scheme that applies temporal distances to measure similarities between news and topics. However, insignificant or unreliable keywords can be identified as hot topics using existing schemes because they focus on the keyword occurrence frequency at specific times and use documents created by the unspecified majority. Moreover, the existing schemes cannot identify what keywords will become hot topics near future because they use data generated at the time.

If we can predict hot topics in the near future, we can keep up with future issues and problems. For example, a company can promote or sell related products through user trend changing derived from the hot topic prediction. In disaster safety, it is possible to identify the aftereffects of events and accidents and to work on countermeasures to minimize losses. In this paper, we proposes a new scheme for predicting future hot topics in social media. The proposed scheme incorporates user influence and expertise to identify what keywords will become hot topics in the near future. Since the documents written by users with a high level of influence and expertise are continuously propagated, the keywords contained in the document can be hot topics in the near future. We consider user influence and expertise to determine the propagation of documents written by the user. We extract candidate keywords using a modified Term Frequency-Inverse Document Frequency (TF-IDF) algorithm to determine changes in keywords across different time intervals. We incorporate user influence and expertise for keywords identified using the modified TF-IDF to increase the hot topic prediction accuracy. Finally, we predict near-future hot topics using the change rate over time.

The rest of this paper is organized as follows. Section 2 describes existing schemes for hot topic detection. Section 3 provides a detailed description of the scheme proposed for hot topic prediction. Section 4 demonstrates the proposed scheme’s performance over existing schemes. Section 5 provides the conclusion.

2 Related works

TF-IDF is used to identify major keywords in a specific document in information search and text mining [36, 37]. Term Frequency (TF) indicates the frequency at which a particular keyword appears in a document. The higher the frequency, the more important the keyword is in the document. DF indicates the number of documents that include a specific keyword; its reciprocal is the inverse document frequency (IDF). TF-IDF is a value obtained by multiplying TF by IDF. A high TF-IDF for a keyword indicates that the keyword appears frequently in the document of interest but infrequently in other documents.

Yang et al. [25] presented to identify emerging rumor for social media with hot topic. A hot topic detection combining bursty term identification and sentence modeling is performed for rumor identification. To determine the bursty term, skewness score, timeless score and periodicity score are used. The sentence modeling uses a bursty term vector and named entity vector to calculates the similarity between sentences. The bursty term vector is composed of bursty terms identified by the bursty term identification and the named entity vector is made up of named entities contained in a sentence.

Zhu and Yu [31] presented a prerecognition model for detecting hot topic. The prerecognition model finds potential hot topics during the period. The prerecognition clusters the original microblog messages to get topics and their amount, and calculates the velocity and acceleration of the topic. To classify microblog messages into different topics, the topic clustering is performed. Three factors such as topic amount, topic hot velocity, and hot acceleration, is used to detect hot topics. To extract the periodic characteristic of hot topics, the topic life cycle is defined.

Yu et al. [35] proposed a hot topic detection scheme based on the similarity between news and topics. Noting that users want to get information quickly, this scheme detects hot topics with sudden, frequent mentions by taking the following steps: capturing the title, source, publication date, and content of news, removing stopwords, and applying incremental TF-IDF; calculating the cosine similarity between the news content and a topic to determine the relationship between the news and the topic; determining the news with higher cosine similarity between the news’ publish time and the topic’s updated time as a part of the topic; calculating the temporal distance between the news’ publish time and the topic’s updated time; determining the news with a higher temporal distance between the news’ publish time and the topic’s updated time as a part of the topic; determining a topic’s status based on the combination of the cosine similarity and temporal distance.

Kim et al. [32] proposed a hot topic detection scheme based on the change in the keyword occurrence frequency over time on Twitter. Geographic information is used to classify geographic communities, because the geographic communities appear the similar fluctuation patterns of word frequency. To detect hot topics of each day that are not tweeted in the previous day, the rate of word frequency is calculated. However, considering only the keyword appearance frequency results in the problem of identifying frequent everyday keywords as hot topics. They addressed this problem by calculating the change rate of the keyword occurrence frequency over time. They identified keywords with a high change rate as hot topics because frequent everyday keywords have low change rates.

3 The proposed hot topic prediction scheme

3.1 Overall procedure

The existing hot topic detection schemes do not guarantee result precision because they detect hot topics based on the frequency of keyword occurrence. Moreover, they are incapable of predicting future hot topics because they detect hot topics at a specific time. This paper presents a hot topic prediction scheme based on user influence and expertise in social media. We use Twitter, a representative service of social media for predicting hot topics. The proposed scheme identifies a set of candidate keywords using modified TF-IDF that incorporates a temporal factor. Documents written by influential users with expertise on social media are more likely to be continuously shared and reprocessed by other users. Therefore, the hot topic prediction involves the determination of user reliability and expertise based on the analysis of various user activities and networks on social media. The hot topic prediction indices of candidate keywords are calculated by considering user reliability and expertise and hot topic are made based on changes in the hot topic prediction indices.

Figure 1 shows the overall procedure of the proposed hot topic prediction. A data collector collects social documents generated in real time, human network, and user activities. A candidate keyword extraction select candidate keywords that are suddenly start being mentioned frequently at a specific time from collected documents using a modified TF-IDF. A user analysis analyzes human network and social media activities and determines user influence and expertise. A hot topic prediction calculates the hot topic prediction index by applying weights to candidate keywords based on influence and expertise and identifies hot topic based on comparisons of change rates of indices over time.

Fig. 1
figure 1

Overall procedure

3.2 Candidate keyword extraction

The first step in hot topic prediction is to extract keywords from documents generated on Twitter using a morphological analyzer, followed by creating a set of meaningful keywords because all extracted keywords are not useful as hot topics. A set of meaningful keywords for hot topic detection are typically generated using TF-IDF. However, TF-IDF cannot extract keywords that are suddenly mentioned frequently because it does not consider the temporal factor. Therefore, the proposed method generates a set of keywords by modifying TF-IDF.

The modified TF-IDF is capable of extracting a set of keywords that are suddenly mentioned frequently because it considers the temporal factor. The modified TF-IDF extracts keywords with high occurrence frequency for a specific time-span from all data on Twitter. The modified TF-IDF extracted a set of keywords that are suddenly mentioned frequently using \(MTF_{t,w}\) and \(MIDF_{t,w}\) as shown in Eq. (1); \(MTF_{t,w}\) is obtained using Eq. (2). \(TF_{t,w}\) denotes keyword \(w\)’s occurrence frequency at time \(t\), and \(MIDF_{t,w}\) denotes the change rate in IDF between time points as shown in Eq. (3), \(IDF_{t,w}\) denotes the IDF for keyword \(w\) at time \(t\), and \(IDF_{t - 1,w}\) denotes IDF for keyword \(w\) at time \(t - 1\).

$$MTFIDF_{t,w} = MTF_{t,w} \times MIDF_{t,w}$$
(1)
$$MTF_{t,w} = \log (TF_{t,w} + 1)$$
(2)
$$MIDF_{t,w} = \frac{{IDF_{t,w} }}{{IDF_{t - 1,w} }}$$
(3)

3.3 Influence and expertise

The proposed scheme predicts near-future hot topics rather than identifying present hot topics. Predicting hot topics utilizes user-written documents’ propagation on social media. Most activities on social media are made by users. Therefore, the influence and expertise of users who write documents on social media are determined as indicators of documents’ propagation because documents written by users with a high level of influence and expertise will likely be shared and reprocessed continuously, becoming a part of hot topics. Tweeter user activities are highly correlated with influence. User influence is determined by considering followers, retweets, and mentions. \(IF_{t,u}\), the influence of user \(u\) at time \(t\), is obtained using Eq. (4). \(FR_{t,u}\) denotes the follower-based influence index; \(RT_{t,u}\) denotes the retweet-based influence index; \(MT_{t,u}\) denotes the mention-based influence index; \(NFR\), \(NRT\), and \(NMT\) respectively denote the normalization constants for \(FR_{t,u}\), \(RT_{t,u}\), and \(MT_{t,u}\).

$$IF_{t,u} = \frac{{\frac{{FR_{t,u} }}{NFR} + \frac{{RT_{t,u} }}{NRT} + \frac{{MT_{t,u} }}{NMT}}}{3}$$
(4)

Documents written by users with many followers will likely be frequently shared and reprocessed by followers. In other words, users with more followers are assumed to have greater influence because tweets written by users with many followers can be propagated. \(FR_{t,u}\) denotes the influence index for a user based on the number of followers at time \(t\) as shown in Eq. (5). \(NFR_{t,u}\) denotes the number of followers for user \(u\); \(MNFR_{t}\) denotes the maximum number of follows.

$$FR_{t,u} = \log \left( {\frac{{NFR_{t,u} }}{{MNFR_{t} }} + 1} \right)$$
(5)

A large number of retweets for a tweet means that the user who writes the tweet is receiving a lot of attention from other users. In addition, when a user has a large number of followers, a tweet may be continuously retweeted to other users. Therefore, the retweet-based influence index considers both the numbers of retweets and followers. \(RT_{t,u}\) is the retweet-based influence index at time \(t\) calculated by Eq. (6). \(NT_{t,u}\) denotes the total number of tweets generated by user \(u\), \(NRT_{t,u}\) denotes the number of retweets of tweets written by user \(u\), \(MNFR_{u}\) denotes the maximum number of followers, and \(NFFR_{t,u}\) denotes the number of followers of user \(u\). \(RT_{t,u}\) considers the average number of retweets per tweet for a user and the propagation of their retweeting followers.

$$RT_{t,u} = \log \left( {\frac{{NRT_{t,u} }}{{NT{}_{t,u}}} \times \frac{{NFFR_{t,u} }}{{MNFR_{u} }} + 1} \right)$$
(6)

A mention refers to comments for a specific tweet or sending a tweet to a specific user. The large number of mentions means that the tweets will continue to spread and other users will use them. As with the number of followers, a large number of mentions indicate a user’s large influence. \(MT_{t,u}\) is the mention-based influence index at \(t\) calculated by Eq. (7). \(NT_{t,u}\) denotes the total number of tweets written by user \(u\) and \(NMT_{t,u}\) denotes the number of mentions for tweets by user \(u\).

$$MT_{t,u} = \log \left( {\frac{{NMT_{t,u} }}{{NT_{t,u} }} + 1} \right)$$
(7)

A document written by experts in a field will likely be continuously used by users with an interest in that field. A document written by a user with expertise will also likely have be continuously spread on social media. Moreover, the number of related documents will increase as many users will share and reprocess them. In other words, documents written by experts will likely be part of a hot topic. User expertise indicates users’ expertise in the content of the documents they create and calculated using the number of embedded tweets, reliability, and expertise in the user’s preferred topic. \(PF_{t,u,c}\) denotes the expertise of user \(u\) at time \(t\) calculated by Eq. (8). \(c\) denotes the preferred topic’s category, \(ET_{t,u}\) denotes the expert index based on tweet embedding, \(ST_{t,u}\) denotes the expert index based on user reliability, and \(CE_{t,u,c}\) denotes the expert index based on the user’s preferred topic. \(NET\), \(NST\), and \(NCE\) denote the respective normalization constants for \(ET_{t,u}\), \(ST_{t,u}\), and \(CE_{t,u,c}\).

$$PF_{t,u,c} = \frac{{\frac{{ET_{t,u} }}{NET} + \frac{{ST_{t,u} }}{NST} + \frac{{CE_{t,u,c} }}{NCE}}}{3}$$
(8)

Embedding a tweet refers to the act of quoting a user-written tweet and is considered a proactive user activity. A large number of embedded tweets would indicate that a tweet is both trusted and high quality. \(ET_{t,u}\) is the expert index based on the number of embedded tweets for user \(u\) at time \(t\) calculated by Eq. (9). \(NT_{t,u}\) denotes the total number of tweets written by user \(u\) and \(NET_{t,u}\) denotes the number of embedded tweets for the user.

$$ET_{t,u} = \log \left( {\frac{{NET_{t,u} }}{{NT_{t,u} }} + 1} \right)$$
(9)

Malicious users interfere with normal users’ information acquisition by disseminating incorrect information or including the URLs of malicious sites on social media. Therefore, user expertise determination involves determining user reliability for selecting malicious users. Generally, users are connected in social networks through follows and followings. However, malicious users tend to have fewer followers due their weak social network resulting from the dissemination of incorrect information. In other words, malicious users have relatively fewer followers than the number of users they follow. Therefore, the expert index based on user reliability considers the numbers of follows and followings. \(ST_{t,u}\) is the expert index based on user \(u\)’s reliability at time \(t\) calculated by (10). \(NFR_{t,u}\) denotes the number of followers for user \(u\) and \(NFG_{t,u}\) denotes the number of the users that user \(u\) follows.

$$ST_{t,u} = \log \left( {\frac{{NFR_{t,u} }}{{NFR_{t,u} + NFG_{t,u} }} + 1} \right)$$
(10)

Experts are interested in specific topics and generate and share relevant information on social media; therefore, determining expertise involves determining whether a user often mentions a specific topic on Twitter. \(CE_{t,u,c}\), the expert index for user \(u\)’s preferred topic at time \(t\), is obtained using Eq. (11). \(c\) denotes the preferred topic category, \(NKW_{t,u}\) denotes the total number of keywords extracted from documents written by the user on social media, and \(CKW_{t,u,c}\) denotes the number of keywords extracted from documents created on each topic.

$$CE_{t,u,c} = \log \left( {\frac{{CKW_{t,u,c} }}{{NKW_{t,u} }} + 1} \right)$$
(11)

3.4 Hot topic prediction

When the user impact and expertise are determined, we predict hot topics for candidate keywords. Figure 2 shows the algorithm of hot topic prediction. Here, m denotes the number of candidate keywords. The hot topic prediction index is calculated by applying weights for user influence and expertise to candidate keywords extracted using the modified TF-IDF. The hot topic value for each keyword is calculated by comparing the change rates for hot topic prediction indices over time. When hot topic value is sorted from highest to lowest, the top k-th keywords are predicted as near-future hot topics.

Fig. 2
figure 2

The algorithm of hot topic prediction

The hot topic prediction index value indicates the likelihood that candidate keywords becomes hot topics. \(HTP_{t,w}\), the hot topic prediction index for keyword \(w\) at time \(t\), is calculated by Eq. (12). \(MTFIDF_{t,w}\) denotes the modified TF-IDF for keyword \(w\) at time \(t\) and \(KW_{t,w}\) denotes the keyword weight based on user influence and expertise. \(KW_{t,w}\) is obtained using Eq. (13). When there are \(n\) tweets that include keyword \(w\) at time \(t\), \(KW_{t,w}\) is the average influence and expertise of the \(n\) users who tweeted.

$$HTP_{t,w} = MTFIDF_{t,w} \times KW_{t,w}$$
(12)
$$KW_{t,w} = \frac{{\sum_{u = 1}^{n} {\alpha IF_{t,u} + (1 - \alpha )PF_{t,u,c} } }}{n}$$
(13)

Finally, we predicts hot topics by comparing the change rates for hot topic prediction indices over time. Equation (14) indicates is the hot topic value \(HT_{t,w}\) for the keyword \(w\) at time \(t\). \(HTP_{t,w}\) denotes the hot topic prediction index for keyword \(w\) at time \(t\) and \(HTP_{t - 1,w}\) denotes the hot topic prediction index for keyword \(w\) at time \(t - 1\).

$$HT_{t,w} = \frac{{HTP_{t,w} - HTP_{t - 1,w} }}{{HTP_{t,w} + HTP_{t - 1,w} }}$$
(14)

4 Performance evaluation

The proposed hot topic prediction scheme’s performance is demonstrated through comparison with the performance of an existing hot topic detection scheme [32]. Experimental evaluation was conducted on 1,215,342 data points collected April 1–May 31, 2015 using the Twitter Streaming API [38]. To determine user influence and expertise, information such as Twitter users’ social network and the number of retweets, mentions, and embedded tweets was gathered. Keywords were extracted from Twitter using the HanNanum Korean Morphological Analyzer [39]. Table 1 shows the performance evaluation setup. To show the superiority of the proposed method, performance such as precision, recall, and F-Measure were compared. Equations (15), (16), and (17) are the precision, recall and F-Measure respectively, where \(N_{pt}\) is the number of the prediction hot topics and \(N_{rt}\) is the number of real topics in current time.

$$Precision = \frac{{N_{pt} \cap N_{rt} }}{{N_{pt} }} \times 100$$
(15)
$$Recall = \frac{{N_{pt} \cap N_{rt} }}{{N_{rt} }} \times 100$$
(16)
$$F - Measure = \frac{2 \times Precision \times Recall}{Precison + Recall} \times 100$$
(17)
Table 1 Performance evaluation setup

Hot topics at the present time are detected by excluding stopwords and frequent everyday keywords from the detected hot topics based on the keyword occurrence frequency using the TF-IDF algorithm. Tables 2 and 3 show sets of top-10 hot topic keywords for May 1–7, 2015 detected using existing and proposed schemes, respectively. Performance evaluation is conducted by comparing these sets of hot topic keywords.

Table 2 Hot topic keywords predicted by the existing scheme
Table 3 Hot topic keywords predicted by the proposed scheme

The comparison results for the hot topic value for specific keywords obtained using the existing and proposed schemes demonstrate the superiority of our scheme based on temporal factor and user influence over the existing scheme in terms of detection result reliability. The proposed scheme addresses the problem of the existing scheme incorrectly identifying commonly used everyday words through the change rates of the hot topic prediction indices. In Fig. 3a, b, the negative value means that the hot topic prediction index at the current time point is decreased. That is, the keywords indicating the negative value indicate that the hot topic is meaningless. Figure 3a shows the hot topic value in hot topic prediction indices for “Easter”. In the figure, the keyword “Easter” is no longer a hot topic after Easter day (April 5th) according to indices based on the proposed scheme, which were far below those based on the existing scheme, as its hot topic prediction indices drastically decreased unlike those based on the existing scheme after the gradual increase leading up to Easter day. For keywords that are suddenly mentioned around a specific event, the proposed scheme’s detection result reliability was up to 39% higher than those of the existing scheme. Figure 3b shows a graph of the change rates in hot topic prediction indices for the keyword “Sewol Ferry.” The scheme identified “Sewol Ferry” as a hot topic when it was heavily tweeted continuously during the analyzed period, and the detection result reliability was up to 22% higher than the results for the existing scheme. Figure 3c shows a graph of the change rates for the hot topic prediction indices of the keyword “April Fool’s Day.” This keyword was detected as a hot topic when it was most frequently mentioned on April Fool’s Day, and the change rate was 26% higher for the proposed scheme’s results than for those of the existing scheme. The performance evaluation results suggest that the proposed scheme outperforms the existing scheme because the additional consideration of user influence increases the results’ reliability.

Fig. 3
figure 3

Change rates in hot topic prediction indices

Figures 4, 5 and 6 show the precision, recall, and F-measure between predicted hot topics using the proposed scheme and the observed hot topics for the first, second, and third weeks of May. As the proposed scheme was designed to predict hot topics, prediction precision is assessed using the degree of correspondence between predictions and observations for each week. The proposed scheme generally outperformed those obtained using the existing scheme, and the predicted hot topics became more similar to those observed as the days passed from the first to the third week. Based on how the hot topic changed over the 3 weeks, the precision and recall improved by 3% on average and the F-measure improved by 4% for the proposed scheme.

Fig. 4
figure 4

Recall in May

Fig. 5
figure 5

Precision in May

Fig. 6
figure 6

F-measure in May

Figures 7, 8 and 9 show the precision, recall, and F-measure between hot topics using the proposed scheme and the observed hot topics for April and May. The existing scheme detects hot topics based the rate of keyword frequency over time. It is detected as hot topics when the frequency of keywords increases rapidly. However, the existing scheme detect hot topics in current time but has limitations in predicting the near future hot topics. Since the documents written by users with a high level of influence and expertise are continuously propagated and shared among users, we consider the propagation of documents based on user influence and expertise. In the proposed scheme, the keywords contained in the document with a high level of influence and expertise are predicted as hot topics. Therefore, we increase the hot topic prediction accuracy in near future. The results demonstrate that the proposed scheme outperformed the existing scheme with 83.48% recall in April and 84.56% recall in May. Regarding precision and F-measure, the proposed scheme that incorporates the temporal factor and user influence showed higher detection-result reliability than the existing scheme.

Fig. 7
figure 7

Recall in April and May

Fig. 8
figure 8

Precision in April and May

Fig. 9
figure 9

F-measure in April and May

5 Conclusion

This paper proposed a hot topic prediction scheme based on user influence and expertise. The proposed scheme extracts a set of keywords that suddenly occur using a modified TF-IDF algorithm. The scheme incorporates user influence and expertise for those who create documents on social media to predict near-future hot topics. User influence is determined using the number of followers, retweets, and mentions; user expertise is determined using the number of embedded tweets, reliability, and preferred topic. Hot topic predictions are made by applying the weight-the average of expertise and influence indices for users who tweeted with a candidate keyword extracted using modified TF-ID to the keyword. Future research plans include research on grouping interrelated keywords around specific events.