Keywords

1 Introduction

With the development of Internet technology and the emerging forms of media, the Internet entered into a mass of information age [1]. At the same time a new type of information sharing and publishing platform (such as Weibo emergences, so the degree of Internet users’ participation and active in China is showing explosive growth. Through this platform, users can express their views by posting some original essays or sharing information [2]. Therefore, effective feature learning and interest prediction [3] is significant not only for users (e.g., looking for users with similar interests [4]), etc.), but also for service providers in a set of application scenarios (e.g., user behavior analysis, personalized recommendation).

Most existing research considers user interest prediction from mainly three aspects: user registration information [5], browsing and posting history [6], and social interaction and relationship [7]. But prediction performance based on these aspects is unsatisfactory. In this paper, we investigate the social network environment and propose a Markov chain model that is feasible and efficient in predicting both long-term and short-term user interests. The contribution of this paper can be concluded that: first, develop specific data crawler to collect dataset from Sina Weibo. After features extraction through a set of operations, each user could be converted and represented as a feature vector; second, obtain user interest eigenvalue sequence by establishing a sole–Markov chain model; implement the SOM algorithm to find similarities among users and construct enhanced–Markov chain model that merges users into specific predefined interest categories; finally, conduct experiments to validate the feasibility and efficiency of the proposed solution; validate that the proposed solution can be implemented for both long-term and short-term user interest predictions.

The rest of the paper is organized as follows. Section 2 presents the background of social network and Markov Chain and reviews existing research on user interest prediction. Section 3 introduces dataset preprocess and feature vector extraction. Section 4 describes the construction of sole–Markov chain and enhanced–Markov chain models for user clustering and interest prediction. Experiments and evaluations are conducted in Sect. 5. Finally, conclusions are drawn in Sect. 6.

2 Related Work

In recent years, a lot of research on how to predict user interest has been undertaken. In the view of industry, Twitter and Facebook mention in their reports that they are using AI (deep learning) to understand the significance behind users’ posting messages. However, the detailed techniques are proprietary and refuse to public. In academia, related works are: Attenberg et al. [8] propose a user interest prediction mechanism by analyzing the content of messages posted by users or by analyzing their interest eigenvalues. Xu et al. [9] propose a modified author-topic model to discover topics of interest on Twitter by filtering interest-unrelated tweets (noisy posts) from the aggregated user profiles. Yan et al. [10] establish a human dynamic model codriven by interest and social identity and show that user interest in sending posts is positively correlated with the number of comments on their previous posts. In the field of future interest prediction. Nori et al. [11] propose a new graphic representation (Action Graph) for modeling user multinomial with time-evolving actions and predict user interest by computing the similarity between each user and a set of resources. However, this ignores the influence of the user’s friends on his or her interests.

In summary, existing research explores user interest from mainly three aspects: user registration information, browsing and posting history, social interaction and relationships. Compared to existing research, our work contains a few distinguished points: (a) investigate and consider the factor of time and influence among friends or similar users, and propose a possible solution that combines Markov model with clustering technology; (b) through the construction of the sole–Markov chain and enhanced–Markov chain models, the proposed solution is capable of providing excellent performance with the true positive rate of clustering reaching 88.5 %.

3 Dataset Collection and Analysis

We develop a specific data crawlers and feature collection mechanisms for dataset collection: Firstly, Hundred normal users (celebrity, company, and government that post, repost, and comment frequently) with 20 interest categories of Weibo messages are manually selected as the data source. Secondly, specific data crawlers are developed for the ordinary user and for the hot Weibo category. And finally, 4,612 Weibo users, and 20 categories of hot Weibo messages are extracted. After that, for each user, the basic user information (e.g., username) is retrieved by the Weibo API. Through the username, it is possible to retrieve a set of message IDs through which the text messages can be obtained. Finally, more than 16 million messages are crawled. Finally, for each user, a feature vector is constructed according to the crawled user and the message information with the operations in the following section.

As soon as the dataset is crawled, it is preprocessed [12] and each user’s messages are converted into a vector which will be used in the establishment of the model.

  1. 1.

    Word Segment. This paper uses the Chinese Institute of Computing segmentation system (ICTCLAS) [13] divided user message content into separated words.

  2. 2.

    Frequency Statistics. The TF-IDF (term frequency–inverse document frequency) [14] algorithm to obtain the keywords appearing in 20 predefined interest categories is used. Finally, the top 50 keywords in each category are extracted. After de-duplication, 579 keywords as user interest eigenvalues are obtained.

  3. 3.

    Feature Vectors Generation. Based on these 579 keywords, it is possible to convert each user’s messages into a feature vector.

4 Markov Based User Interest Modeling

Figure 1 illustrates the system framework of proposed interest prediction solution. The concept is: After generation of a series of feature vectors (described in previous section), Markov chain model is implemented to construct the prediction model, and generates a series of user interest categories.

Fig. 1.
figure 1

Overview of interest prediction model

4.1 Sole-Markov Chain Based Interest Prediction (SMC)

According to user interest eigen values, a sole–Markov chain model [15] can be constructed.

A sole–Markov chain model can be represented as a triplet, \( MC = < X,A,\lambda >, \) where X is a discrete random variable in the range of \( \left\{ {x_{1} ,x_{2} ,\, \ldots ,x_{n} } \right\}, \) in which each \( x_{i} \) represents user interest eigenvalue. A is the transition rate matrix and λ is the initial state distribution represented as followed:

$$ A = \left( {p_{ij} } \right) = \left[ {\begin{array}{*{20}l} {P_{11} } \hfill & {P_{12} } \hfill & \ldots \hfill & {P_{1j} } \hfill & {} \hfill & {P_{1n} } \hfill \\ {P_{21} } \hfill & {P_{22} } \hfill & \ldots \hfill & {P_{2j} } \hfill & {} \hfill & {P_{2n} } \hfill \\ \ldots \hfill & \ldots \hfill & \ldots \hfill & \ldots \hfill & \ldots \hfill & \ldots \hfill \\ {P_{i1} } \hfill & {P_{i2} } \hfill & \ldots \hfill & {P_{ij} } \hfill & \ldots \hfill & {P_{in} } \hfill \\ \ldots \hfill & \ldots \hfill & \ldots \hfill & \ldots \hfill & \ldots \hfill & \ldots \hfill \\ {P_{n1} } \hfill & {P_{n2} } \hfill & \ldots \hfill & {P_{nj} } \hfill & \ldots \hfill & {P_{nn} } \hfill \\ \end{array} } \right] $$
(1)
$$ \lambda = \left( {p_{i} } \right) = \left( {p_{1} ,p_{2} ,\, \ldots ,p_{n} } \right) $$
(2)

Where \( p_{ij} = P\left( {X_{t} = x_{j} \,\left| {\,X_{t - 1} = x_{i} } \right.} \right) \) refers to the transition probability from state \( x_{i} \) to state \( x_{j}; \,p_{i} = P\left( {X_{t = 0} = x_{i} } \right). \)

Through the collection of user messages in a certain time period, it is possible to extract a sequence of user interest eigenvalues (variables x). After that, with the maximum likelihood estimation function, it is possible to estimate the value of the parameters in a SMC model, referred to in the following formulas:

$$ p_{ij} = \frac{{S_{ij} }}{{\sum\limits_{j = 1}^{n} {S_{ij} } }}\,\,\,\,\,\,\,p_{i} = \frac{{\sum\limits_{j = 1}^{n} {S_{ij} } }}{{\sum\limits_{i = 1}^{n} {\sum\limits_{j = 1}^{n} {S_{ij} } } }} $$
(3)

Where \( S_{ij} \) refers to the number of state pairs \( (x_{i} ,x_{j} ) \) appearing in users’ messages posted.

Let vector H(t) = [0, 0,…,1] refer to the user interest eigenvalue sequence in time point t and \( V\left( t \right) = \left[ {P\left( {X_{t} = x_{1} } \right),P\left( {X_{t} = x_{2} } \right),\, \ldots ,P\left( {X_{t} = x_{n} } \right)} \right] \) refer to the probability of each eigenvalue. Therefore, it is possible to predict user interest eigenvalue with the formula 4. And the most related user interest eigenvalue is the highest probability value in vector \( V\left( t \right) \). With the multistage weighted combination model (for considering historical user interest eigenvalues), prediction accuracy can be improved as represented in formulas 5 and 6:

$$ V(t) = H\left( {t - 1} \right) \times A $$
(4)
$$ \begin{aligned} V\left( t \right) & = w_{1} H\left( {t - 1} \right) \times A^{1} + w_{2} H\left( {t - 2} \right) \times A^{2} \\ & + \cdots + w_{h} H\left( {t - h} \right) \times A^{h} \\ \end{aligned} $$
(5)
$$ w_{1} + w_{2} + \cdots + w_{h} = 1 $$
(6)

Experimental results (Sect. 5.2) show that prediction accuracy increases with the higher value of h, until stabilized eventually.

4.2 Enhanced-Markov Chain Based Interest Prediction (EMC)

Based on the SMC, we further propose an Enhanced-Markov chain based interest prediction. First, two assumptions about user interest eigenvalue sequences are described:

Assume that there are K categories of interests, represented as \( C = \left\{ {c_{1} ,c_{2} ,\, \ldots ,c_{k} } \right\},P\,\left( {C = c_{k} } \right) \) refers to the probability of the i-th category the user belongs to, then, for each user:

$$ \sum\limits_{k = 1}^{K} {P\left( {C = c_{k} } \right) = 1} $$
(7)

Assume that users in the same interest category have similar behavior features and that the corresponding interest eigenvalues sequences are random process that follow the discrete homogeneous Markov chain.

With above two assumptions, it is possible to construct a user interest prediction classification model containing multiple Markov chains, known as the EMC model.

The EMC interest model is defined as a quaternion: \( < X,K,P\left( C \right),MC>, \) in which X is a discrete random variable in range \( \left\{ {x_{1} ,x_{2}^{{}} ,\, \ldots ,x_{n} } \right\}, \) where each xi represents an interest eigenvalue, \( C = \left\{ {c_{1} ,c_{2} ,\, \ldots ,c_{k} } \right\} \) represents a group of user interest categories with the number k, \( P\,\left( {C = c_{k} } \right) \) refers to the probability of the i-th category the user belongs to, \( MC = \left\{ {mc_{1} ,mc_{1} ,\, \ldots ,mc_{k} } \right\} \) expresses a set of Markov chains and each element \( mc_{k} \) is the Markov eigenvalue chain that belongs to a specific category \( c_{k} \). The transition rate matrix \( A_{k} \) of \( mc_{k} \) and the initial state distribution \( \lambda_{k} \) could be expressed as:

$$ A = \left( {p_{kij} } \right) = \left[ {\begin{array}{*{20}l} {P_{k11} } \hfill & {P_{k12} } \hfill & \ldots \hfill & {P_{k1j} } \hfill & {} \hfill & {P_{k1n} } \hfill \\ {P_{k21} } \hfill & {P_{k22} } \hfill & \ldots \hfill & {P_{k2j} } \hfill & {} \hfill & {P_{k2n} } \hfill \\ \ldots \hfill & \ldots \hfill & \ldots \hfill & \ldots \hfill & \ldots \hfill & \ldots \hfill \\ {P_{ki1} } \hfill & {P_{ki2} } \hfill & \ldots \hfill & {P_{kij} } \hfill & \ldots \hfill & {P_{kin} } \hfill \\ \ldots \hfill & \ldots \hfill & \ldots \hfill & \ldots \hfill & \ldots \hfill & \ldots \hfill \\ {P_{kn1} } \hfill & {P_{kn2} } \hfill & \ldots \hfill & {P_{knj} } \hfill & \ldots \hfill & {P_{knn} } \hfill \\ \end{array} } \right] $$
(8)
$$ \lambda_{k} = \left( {p_{ki} } \right) = \left( {p_{k1} ,p_{k2} ,\, \ldots ,p_{kn} } \right) $$
(9)

According to Definition 3, based on user interest eigenvalue sequences, it is possible to construct a set of Markov chains. Formula 10 are expressed for calculating \( p_{kij} \), with \( p_{ki} \) belonging to \( A_{k} \). Where k represents the number of interest categories; \( S_{kij} \) represents the number of status pairs \( \left( {x_{i} ,x_{j} } \right) \) appearing in user content; \( a_{kij} \) is a super parameter as formula 11 shows:

$$ p_{kij} = \frac{{S_{kij} + \alpha_{kij} }}{{\sum\limits_{j = 1}^{n} {\left( {S_{kij} + \alpha_{kij} } \right)} }}\,\,\,p_{ki} = \frac{{\sum\limits_{j = 1}^{n} {S_{kij} + \alpha_{kij} } }}{{\sum\limits_{i = 1}^{n} {\sum\limits_{j = 1}^{n} {\left( {S_{kij} + \alpha_{kij} } \right)} } }} $$
(10)
$$ \alpha_{kij} = \frac{\beta }{n \times n} $$
(11)

Where β is the constant value of the problem domain size n.

In cases where the transfer matrixes of two users have a high degree of similarity, \( \delta_{kl} , \) let the two matrixes merge together, with the calculation formulas:

$$ CE\left( {p_{ki} ,p_{li} } \right) = \sum\limits_{j = 1}^{n} {p_{kij} \log \frac{{p_{kij} }}{{p_{lij} }}} $$
(12)
$$ S{\text{imilarity}}\,\left( {A_{k} ,A_{l} } \right) = \sum\limits_{i = 1}^{n} {CE\left( {p_{ki} ,p_{li} } \right)} /n $$
(13)
$$ \begin{aligned} \delta_{kl} =\, & S{\text{imilarity}}\left( {mc_{k} ,mc_{l} } \right) \\ = & \frac{2}{{S{\text{imilarity}}\left( {mc_{k} ,mc_{l} } \right) + S{\text{imilarity}}\left( {mc_{l} ,mc_{k} } \right)}} \\ \end{aligned} $$
(14)

Where \( CE\left( {p_{ki} ,p_{li} } \right) \) is the cross entropy of \( p_{ki} \) and \( p_{li} \). In case the \( \delta_{kl} \) value falls between \( mc_{k} \) and \( mc_{l} \), the Markov chain is large enough or infinite and the corresponding two users are regarded to be in the same interest category, with merging formulas that follow:

$$ p_{{\left( {k + l} \right)ij}} = \frac{{S_{kij} + S_{lij} + \alpha_{{\left( {k + l} \right)ij}} }}{{\sum\limits_{j = 1}^{n} {\left( {S_{kij} + S_{lij} + \alpha_{(k + l)ij} } \right)} }} $$
(15)
$$ p_{{\left( {k + l} \right)i}} = \frac{{\sum\limits_{j = 1}^{n} {\left( {S_{kij} + S_{lij} + \alpha_{{\left( {k + l} \right)ij}} } \right)} }}{{\sum\limits_{i = 1}^{n} {\sum\limits_{j = 1}^{n} {\left( {S_{kij} + S_{lij} + \alpha_{{\left( {k + l} \right)ij}} } \right)} } }} $$
(16)

Step by step, user interest prediction with the EMC model can be generated.

5 Experiment Analysis

In this section, an experimental analysis of our proposed solution from four aspects is introduced: a user clustering experiment, prediction comparisons between the SMC and EMC models. Based on the collected dataset containing 4600 users, 3700 users are randomly selected with the Pareto principle as training data and the messages of the remaining 900 users are used as testing data. For simplification, the impact of the events that cause interruption is neglected. The experiment is carried out in MATLAB environment running in two Core i5-3470, 2 * 3.20 GHZ CPU.

5.1 User Clustering

The SOM neural network algorithm [16] is used to cluster users. The SOM algorithm reduces user n-dimensional original transfer matrixes into two-dimensional matrixes and keeps the original topology of the user transfer matrix. From the clustering result shown in Fig. 2, it can be seen that a set of clusters are formed, but these clusters have a lot of noise data. Further investigation reveals that this is caused by a group of spammers who might distribute spam messages in a lot of interest fields. These spam messages greatly reduce prediction accuracy. For noise filtering, the independent component analysis method (provided by Matlab) to remove spammers is imported and denoised user interest clusters are obtained, as shown in Fig. 3.

Fig. 2.
figure 2

User classification based on SOM neural network algorithm

Fig. 3.
figure 3

User classification after denoised

5.2 Prediction Comparisons Between the SMC and EMC Models

In this experiment, the selected user in Sect. 5.1 was reused and the prediction of accuracy between the SMC and EMC models was compared. The results in Fig. 4 show that (1) the prediction is independent of the number of order h (similar to the result of the EMC model described in Sect. 4.2) and (2) the EMC model is capable of separating interest categories with bigger intervals (from 0.03 to 0.35) than the SMC model and, therefore, capable of obtaining the most suitable interest category classification result.

Fig. 4.
figure 4

Prediction accuracy of sole and enhanced-Markov chain model

From the training dataset, 20 users from each category, with 400 users in total, are randomly selected. After that, both SMC and EMC based approaches are implemented, with the average prediction of accuracy results listed in Table 1. The results show that the average value of the SMC model is only 0.5249, with a variance value of 0.0460, while the corresponding values of the EMC model are 0.8699 and 0.0007. This further proves that the EMC based interest prediction model is capable of achieving better accuracy of prediction.

Table 1. Predictive accuracy of sole and enhanced-Markov chain model

6 Conclusions

Social user interest prediction has become an important topic in the social network research field. In this paper, interest prediction based on the Markov chain modeling on clustered users is introduced. The solution considers user content feature and obtains user interest eigenvalue sequences to a establish SMC model; implement user clustering algorithms to construct a EMC model that classifies different users into specific predefined interest categories. Through a multitude of analyses, experiments and evaluations, it can be concluded that the proposed solution is feasible, efficient, and capable of achieving a much higher accuracy of prediction than any of the other existing approaches.