1 Introduction

Interest prediction in social network is important for both users (e.g., community participation, activity initiation, etc. [23, 27, 36]), and social network service providers in a series of applications (e.g., behavior analysis, service recommendation, etc. [12, 37]). However, because of 4 V (huge volume, high variety, low value, fast velocity, etc.) characteristics in social multimedia data, feasible and efficient user interest prediction is not a trivial research challenge [18, 29]. On the other hand, similar with people’s common life, users in social network can vary from each other in different features. For instance, different users own different number of posted messages (e.g. text, image, video, etc.), online time, and behavior history, etc. Hence, depending on different feature or environment, social user interest prediction may require different approaches for soft computing.

Existing research is mainly based on three types of information: user registration profile [43], behavior history [3], and social relationship [13]. However, few of them are efficient, complete, and open sourced. This paper considers clustering algorithm as a typical soft computing technology (or computation intelligence) and proposes combination of Gaussian and Markov model (namely, GAM) for social user interest prediction. The clustering technologies are proposed due to the following reasons. First, unsupervised machine learning algorithms are normally computational efficient, especially in big data environments [4, 33]; second, clustering mechanisms take similarity calculation into consideration for better performance enhancement. In particular, we select the combination of Gaussian and Markov models as detailed solution. As described in Section 5, Gaussian content based approach provides accurate results with low computation, whereas Markov status based approach is capable to provide higher availability.

In this paper, the clustering approach proposed in this paper is relevant to soft computing technology. Due to specific implementation scenario for social multimedia data, the clustering prediction of interest requires the participation of computational intelligence. In general, this paper contains three contributions for recent advances in soft computing technology:

  • We investigate Gaussian and Markov based clustering approaches (model description, complexity, etc.) respectively for user interest prediction in social networks. Consequently, a compromised GAM model is proposed, which selects either Gaussian or Markov according to the key parameter “Number of posted message”.

  • A specific data crawler is developed to collect Sina Weibo as testing dataset. After that, the clustering experiment, strategies selection, and performance evaluations are conducted to show the feasibility and efficiency of proposed solution.

  • Through suitable data pre-processing and parameter adjusting, the proposed model achieves 94.3% prediction accuracy. This is the best prediction accuracy achieved ever. Additionally, performance result and model scalability, computation efficiency are discussed to justify our contributions.

Please note that our approach is a generic solution available in other existing social data (Twitter, Facebook, and so on). The paper’s structure is organized as follows. Section 2 investigates social user interest prediction and existing research. Section 3 discusses dataset preprocessing, feature extraction and user annotation. Section 4 investigates GMM and MCM approaches respectively and introduce our proposed GAM model. Section 5 illustrates experiment, analysis and discussion. Finally, Section 6 summaries the paper.

2 Social interest prediction

2.1 Social network and user interest

Online social networks have become major platforms for internet users to post multimedia messages (e.g. text, picture, video, etc.), discuss and share interesting topics [33]. In social network, interest is usually represented by posted messages describing the event or willingness such as what want to do or buy, where want to go, or who want to meet, follow or vote for [10]. Therefore, interest exploration in social networks is an important part of user behavior analysis, sine it can provide support for a series of extension services such as community detection targeted advertisement, personalizing recommendation and so on [34].

This paper investigates and collects dataset from Sina Weibo, over 350 million users and the eighth most frequently visited social network in the world until Dec 2017 [25, 31]. Upon this dataset, the investigation and further experiment is convincible and scalable. Therefore, this is a very relevant paper for the soft computing research on social network and multimedia big data.

2.2 Existing works

In industry, both Twitter and Facebook initiate their research project implementing machine learning technologies for user behavior analysis, according to their annual report [32]. However, the details are unknown to the public.

In academia, the initial attempt is to explore relevant messages entered by user as interest information so as to establish user interest prediction model [1] [6]. Abel F et al. [2] extend users’ basic information through tagged user profiles, and develop a cross-system user model to find user interest and improve recommendation quality. Xu et al. [35] filters interest-unrelated noisy posts according to aggregated user profiles, and to some extents, discovers user interest. These approaches are based on user’s registration information; however, the result may not be accurate due to incomplete user information entered.

Besides, there are some other method based on social relationship analysis. Xiaoling S et al. [28] propose an agent-based interest awareness model that considers social ties formed or reinforced between two individuals if they have similar interest. Xiao H et al. [15] capture various social features and investigate social inference based on interest similarity to realize users interest prediction. Saber Shokat Fadaee et al. [11] convert social network into Bernoulli based unweighted structure model, and predict user interest category according to structural difference between different categories of networks. Norietal et al. [21] import graph theory to model user time-evolving behavior, and predict user interest category via similarity computation. However, these approaches are incomplete because social relationship is only a part of feature for interest exploration.

Some other approaches are based on feature exploration on social network content. For example, Attenberg et al. [5] predict user interest through analyzing message content posted. Banerjee et al. [7] collect Twitter data and apply statistical and mining techniques to explore user interest distribution on categories, e.g., food, sport, movie, etc. Literature [19] considers the imbalanced data of social users and introduces an weighted ELM based on the overall distribution (ODW-ELM) model for predicting users future interests. [40] considers the evolution of user interests and utilizes semantic information from knowledge bases such as Wikipedia to predict user future interests and overcome the cold item problem. [24] proposes a multilevel deep belief network learning-based model for users consumption preferences, based on interaction between the preferential behaviors of users. Our previous research [42] also proposes a Markov chain model on clustered users to predict user interest. However, the disadvantage of above approaches is that most of them define complicated computation logic that cause a lot of system burden. Besides, these approaches only classify each user into a specific interest category, while in reality each user may have multiple interest. Additionally, none of these approaches achieve excellent performance result (most are between 60% - 80% in prediction accuracy).

Generally, this paper is the extension of our previous work [42] with a few significant improvements:

  1. 1.

    This paper integrates Gaussian and Markov based approaches, which achieves lower computation complexity and better performance outputs.

  2. 2.

    Both theoretical and experiment illustrate that via only inspection of the parameter “number of posted message”, our proposed solution is capable to select optimized handling logic. This makes the model implementation easy.

  3. 3.

    94.3% prediction accuracy could be obtained with suitable parameter adjusting (weed out the influence by swing users). This is the best result ever.

3 Dataset analysis

3.1 Dataset collection for social networks

Similar as most social media platforms, the public Weibo developer API (specifically, user_timeline API) only provides the downloadling functionality on the recent messages of authorized users. This is considered as an obstacle to the process of data collection. To solve this problem, specific data crawler and feature collection mechanism are developed. Specifically, we manually select 20 interest categories source data that contains 100 normal users (who post, repost, or comment frequently) as data source. After that, a specific data crawler is developed for dataset collection. The data crawler contains two classes: WeiboCrawler for collecting user related information, expecially posted messages, followee’s ID, etc.; and FolloweeCrawler class that collecting followee’s posted messages. Finally, 30,116 Weibo users with around 17 million messages are acquired (from 20th, Jan, 2017 to 1st, April, 2017) are extracted.

Figure 1 illustrates the distribution of “the number of posted messages”. It shows that most normal users post/repost 250-550 messages (including text, image, video, etc.) in around 70 days.

Fig. 1
figure 1

The distribution of “the number of posted messages”

3.2 Feature vector extraction

After dataset collection, feature vector could be generated according to following steps [22]:

  1. 1.

    Word Segment and Frequency Statistics. Via filtering image and video content and deploying the Chinese Institute of Computing Segmentation System (ICTCLAS) [8, 30], it is capable to extract separated words from Weibo message. After that, according to affiliated TF-IDF (term frequency–inverse document frequency) API [9], the top 50 keywords for each 20 predefined interest category could be obtained. Consequently, the total number of keywords is 20 * 50 = 1000.

  2. 2.

    De-duplication and Feature Vector Generation. After manual re-inspection to reduce redundancy, we achieve 579 keywords, based on which feature vector could be generated with dimension of 1*579.

3.3 User annotations

Among 30,116 users, we randomly select 4000 users and assign three volunteers to handle the annotation work, marking user interest category according to the message history. The marking behavior of three volunteers is not interfered each other. In case one user is marked in different categories, the majority voting is implemented for suitable decision. Finally, user number and corresponding category is illustrated in Table 1.

Table 1 User number and corresponding Category

4 Solution

Figure 2 illustrates the overview of proposed solution. After feature vector generation (described in Section 3), clustering algorithms (e.g., Markov chain model, GMM model and so on) are applied to construct prediction model.

Fig. 2
figure 2

System overview

4.1 GMM based prediction

4.1.1 Gaussian mixture model

According to [38], Gaussian mixture model is described as the following fomula:

$$ p(x)=\sum \limits_{k=1}^K{\pi}_kN\left(x|{\mu}_k,\sum k\right) $$
(1)

Where N(x| μk, ∑k) is density function, μk, ∑kandπkare corresponding mean, covariance and mixing coefficient respectively. According to sum and product rule, the marginal density is:

$$ p(x)=\sum \limits_{k=1}^Kp(k)p\left(x|k\right) $$
(2)

Supposed that the total number of messages user published is s, ands~N(μ, σ); the classification number k and s are independent each other, here is the Theorem:

  • Theorem 1: the prediction accuracyp(x) is a monotone increasing function with the increasing number of s.

  • Proof: p(x| k) in formula (2) can be transformed top(x| k, s), as follows:

$$ p\left(x|k,s\right)=\frac{p\left(x,k,s\right)}{p\left(k,s\right)} $$
(3)

Since the parameter kand sare independent each other, p(k, s) = p(k) × p(s),p(s| x, k) = p(s| x), formula (3) can be transformed to:

$$ {\displaystyle \begin{array}{l}p\left(x|k,s\right)=\frac{p\left(x,k\right)\times p\left(s|x\right)}{p(k)\times p(s)}\\ {}\kern3.75em =\frac{p\left(x,k\right)\times \frac{p\left(s,x\right)}{p(x)}}{p(k)\times p(s)}=\frac{p\left(x,k\right)}{p(x)\times p(k)}\times p\left(x|s\right)\end{array}} $$
(4)

Where \( \frac{p\left(x,k\right)}{p(x)\times p(k)} \) is not affected by s and p(x| s) is increased with the increasing number of s, therefore the theorem is proved.

4.1.2 EM steps

The maximum likelihood of Formula (1) is illustrated in the following formula:

$$ \ln p\left(X|\pi, \mu, \varSigma \right)=\sum \limits_{n=1}^N\ln \left\{\sum \limits_{k=1}^K{\pi}_kN\left({x}_n|{\mu}_k,{\varSigma}_k\right)\right\} $$
(5)

where X = {x1, ..., xN}.

Additionally, EM algorithm [20, 39] is implemented with the following steps:

  1. 1.

    Initialize μk, ∑kand πk, and calculate initial likelihood.

  2. 2.

    E-step:

$$ \gamma \left({z}_{nk}\right)=\frac{\pi_kN\left({x}_n|{\mu}_k,{\varSigma}_k\right)}{\sum \limits_{j=1}^K{\pi}_jN\left({x}_n|{\mu}_j,{\varSigma}_j\right)} $$
(6)
  1. 3.

    M-step:

$$ {\mu}_k^{new}=\frac{1}{N_k}\sum \limits_{n=1}^N\gamma \left({z}_{nk}\right){x}_n, $$
(7)
$$ {\varSigma}_k^{new}=\frac{1}{N_k}\sum \limits_{n=1}^N\gamma \left({z}_{nk}\right)\left({x}_n-{\mu}_k^{new}\right){\left({x}_n-{\mu}_k^{new}\right)}^T $$
(8)
$$ \frac{\pi_kN\left(x|{\mu}_k,{\varSigma}_k\right)}{\sum \limits_{j=1}^K{\pi}_jN\left(x|{\mu}_j,{\varSigma}_j\right)}, $$
(9)

Where

$$ {N}_k=\sum \limits_{n=1}^N\gamma \left({z}_{nk}\right). $$
(10)
  1. 4.

    Log likelihood Evaluation

$$ \ln p\left(X|\pi, \mu, \varSigma \right)=\sum \limits_{n=1}^N\ln \left\{\sum \limits_{k=1}^K{\pi}_kN\left({x}_n|{\mu}_k,{\varSigma}_k\right)\right\} $$
(11)

E-Step 2 would be returned until convergence criterion is satisfied. Consequently, optimized parameter with result value can be obtained.

4.1.3 Time complexity of GMM approach

  • Theorem 2: the time complexity of GMM algorithm for social interest prediction is 0(n2k), given interest category number k and user posted message number n.

  • Proof: For initialization the variables of k initial categories, the execution time is 0(k); for E-step calculation, the execution time is 0(nk); for M-step calculation, the execution time is 0(n2k); for maximum likelihood function, the execution time is 0(nk). Therefore, GMM time complexity is 0(n2k). The theorem is proved.

4.1.4 Computation complexity

Theoretically, with the increasing number of s, the value of p(x) increases (refer to the Matlab simulation result in Fig. 3). It is obviously that (1) GMM is capable to achieve high prediction result (for example, it reaches over 0.9 when user posted messages is more than 375); (2) however, GMM may not work efficiently in case that user posted messages is not enough (for instance, the prediction accuracy would be less than 0.7 when s is less than 175). Therefore, in order to further improve prediction accuracy, it might be necessary to introduce some other methods.

Fig. 3
figure 3

Simulation result of effect of s to p(x)

4.2 Markov chain model (MCM)

GMM based interest prediction is content based approach that require as much as user posted message as possible. This might be not efficient for users when posted message is inadequate. On the other hand, Markov model is status based prediction approach that might generate reliable result as long as its status chain has been constructed [17]. Therefore, Markov based interest prediction might be implementable for further improvement of prediction accuracy.

4.2.1 Markov chain model

Our previous work [42] has modeled user interest prediction in social network as a triplet MC =  < X, A, λ>, in which A is transition rate matrix:

$$ A=\left({p}_{ij}\right)=\left[\begin{array}{cccccc}{P}_{11}& {P}_{12}& ...& {P}_{1j}& ...& {P}_{1n}\\ {}{P}_{21}& {P}_{22}& ...& {P}_{2j}& ...& {P}_{2n}\\ {}...& ...& ...& ...& ...& ...\\ {}{P}_{i1}& {P}_{i2}& ...& {P}_{ij}& ...& {P}_{in}\\ {}...& ...& ...& ...& ...& ...\\ {}{P}_{n1}& {P}_{n2}& ...& {P}_{nj}& ...& {P}_{nn}\end{array}\right] $$
(12)

Where pij = P(Xt = xj| Xt − 1 = xi) means transition probability from xi to xj; λ refers to initial state distribution:

$$ \lambda =\left({p}_i\right)=\left({p}_1,{p}_2,...,{p}_n\right) $$
(13)

After that, via maximum likelihood calculation, each parameter in Markov model is capable to be estimated:

$$ {p}_{ij}=\frac{S_{ij}}{\sum \limits_{j=1}^n{S}_{ij}} $$
(14)
$$ {p}_i=\frac{\sum \limits_{j=1}^n{S}_{ij}}{\sum \limits_{i=1}^n\sum \limits_{j=1}^n{S}_{ij}} $$
(15)

However, our previous method only classify each user into a specific interest category. For further multiple interest prediction, this paper defines multi-Markov chain model and its corresponding solution.

  • Definition 1: The multi-Markov chain (m-MCM) based user interest model is represented as a quaternion: <X, K, P (C), MC>, where X is discrete random variable in range {x1, x2, ..., xn}, each xi refers to interest eigenvalue, C = {c1, c2, ..., ck} refers to a group of user interest categories with the number of k, P (C = ck) refers to probability of i-th category, MC = {mc1, mc2, ..., mck} represents multiple Markov chains and each element mck belongs to category ck. Therefore, the transition matrix Ak could be represented:

The initial state distribution, λk is represented as follows:

$$ {p}_{kij}=\frac{S_{kij}+{\alpha}_{kij}}{\sum \limits_{j=1}^n\left({S}_{kij}+{\alpha}_{kij}\right)} $$
(18)
$$ {p}_{ki}=\frac{\sum \limits_{j=1}^n\left({S}_{ki j}+{\alpha}_{ki j}\right)}{\sum \limits_{i=1}^n\sum \limits_{j=1}^n\left({S}_{ki j}+{\alpha}_{ki j}\right)} $$
(19)

Where k and Skij refer to number of interest categories and status pair respectively; akij represents background knowledge in Bayesian estimation [16, 41].

After that, we calculate the similarity δklamong each two users’ transfer matrixes, with the calculation formulas:

In case the value of δkl is large or infinite, two users are regarded in one interest category. The merging formulas could be illustrated:

$$ {\displaystyle \begin{array}{c}{\delta}_{kl}=S\mathrm{imilarity}\left(m{c}_k,m{c}_l\right)\\ {}=\frac{2}{S\mathrm{imilarity}\left(m{c}_k,m{c}_l\right)+S\mathrm{imilarity}\left(m{c}_l,m{c}_k\right)}\end{array}} $$
(20)
$$ {p}_{\left(k+l\right) ij}=\frac{S_{kij}+{S}_{lij}+{\alpha}_{\left(k+l\right) ij}}{\sum \limits_{j=1}^n\left({S}_{kij}+{S}_{lij}+{\alpha}_{\left(k+l\right) ij}\right)} $$
(21)
$$ {p}_{\left(k+l\right)i}=\frac{\sum \limits_{j=1}^n\left({S}_{kij}+{S}_{lij}+{\alpha}_{\left(k+l\right) ij}\right)}{\sum \limits_{i=1}^n\sum \limits_{j=1}^n\left({S}_{kij}+{S}_{lij}+{\alpha}_{\left(k+l\right) ij}\right)} $$
(22)

Consequently, multi-Markov chain for user interest prediction could be constructed.

4.2.2 Computation complexity

  • Theorem 3: the time complexity of MCM based approach is O(m3n2), given m as user interest enginvalue and n as the total number of user messages.

  • Proof: The MCM algorithm contains two part: the initial part and the circulation part. In initiation part that calculates pijandpi, and transforms user data into Markov Chain, the execution time is O(mn2).

In circulation part, the maximum cycle time is m because there are always two Markov chains merged (or exit the loop, the algorithm ends) for every cycle. Additionally, the calculation of similarity degree between different pairs listed in descending sequence costs O(m2n2) (If ignore the sorting operation time). In the calculation that merges the two Markov chain with the maximum similarity degree, the execution time is O(m2n2). Then the execution time of the circulation part is O(m*m2n2), that is, O(m3n2). In combination, the time complexity of MCM based approach is O(m3n2). And the theorem is proved.

4.3 Gaussian and Markov approach (GAP)

From Section 4.1 and 4.2,it obviously that both GMM and Markov based approaches have their own advantages. GMM model is content based approach that provides lower computation complexity, while Markov model is status based approach that may not require a large amount of user message as long as the status matrix could be constructed and stablized.

Therefore, one feasible combination can be: (1) set a predefined number w; (2) when the number of user’s posted messages s > w, implement GMM based prediction; while s < w, implement MCM based prediction; (3) adjust the value of w until the best prediction accuracy achieved.

5 Experiment and evaluation

The experiment and evaluation work are described from four aspects. First, the clustering result between GMM and MCM is described; after that, the integration strategy is investigated and tested; additionally, the performance evaluation is conducted with a few existing algorithms; and finally, we discuss the model implementation and scalability. Based on the collected data set, 4000 out of 30,116 users are randomly selected as experiment users. The experiment is conducted in Matlab [14] environment.

5.1 Clustering result

Figure 4(a) and (b) show clustering results of GMM and MCM respectively. This result contains noise which is the swing users (whose interest categories are difficult to be determined from). After noise filter (component analysis method provided in MATLAB), we obtain clearer clustering results illustrated in Fig. 5, which contains 3835 valid users.

Fig. 4
figure 4

The Clustering Result

Fig. 5
figure 5

The Clustering Results after Denoised

It is obvious that user classification result in Fig. 5(a) (boundary among clusters is better splitted) is better than that in Fig. 5(b) (certain clusters are scattered and difficult to be determined). Specifically, the user can be accurately divided into 20 categories in Fig. 5(a), whereas only 14 categories can be distinguished in Fig. 5(b). Therefore, the cluster graph shows that GMM approach has better capability in terms of splitting the boundary of each interest category compared with MCM approach.

More detailed, the clustering result in 20 categories is listed in Table 2. Although some classification might be wrong (for instance, some users classified as ‘Entertainment’ categories may not belong to this category,etc.), the high gap between GMM and MCM indicates the advantage of adopting GMM for clustering analysis in crowd intelligence.

Table 2 The Clustering Results

On the other hand, the time consumption between GMM and MCM is also compared. As shown in Fig. 6, MCM algorithm takes almost 16 times longer than GMM.

Fig. 6
figure 6

Time Consumption

5.2 Combination strategy

For exploring optimized integration, we further investigate clustering error in both GMM and MCM methods, and find the obvious feature difference in the number of posted message (including text, image, video, etc.) between these two approaches.

5.2.1 Clustering error analysis

According to previous clustering result, the error data can be listed, as shown in Table 3. We check each error user respectively, and find that most error users produced by GMM contain less posted message (between 34 and 208 in each category) than error users generated by MCM. Therefore, the number of posted message may probably a distinguished feature difference between GMM and MCM approaches.

Table 3 Error Users Analysis in GMM and MCM Respectively

5.2.2 The number of post message effect

Furthermore, this section investigates the features of user posted messages (with the total number between 0 and 50, 50–100, 100–150, 150–200, 200–250, 250–300, 300–350, 350–400, 400–450, 450–500, 500–550, and 550 plus) and discusses the effect of “The Number of Post Message” in the prediction accuracy. We randomly select 20 users in each interval and test the prediction accuracy with GMM and MCM approaches respectively. The results in Fig. 7 shows that (1) GMM algorithm is more accurate than MCM when the number of post messages is larger than 300; (2) MCM approach is still capable to achieve much better performance in case that the number of post messages is between 50 and 100 and 100-150.

Fig. 7
figure 7

Effect of different number of posted messages

Therefore, we set the threshold value in this paper as 150, and select to implement GMM approach when above threshold, and implement MCM when below this value. The prediction accuracy is further improved to around 95% in each of 20 categories (See Fig. 8).

Fig. 8
figure 8

Performance of Evaluation of Different models

5.3 GAM performance

Furthermore, we compare MCM solution with GAM, GMM based solution, and also and traditional solutions such as LIBSVM, K-Means algorithms provided by Weka tool [26]. Table 4 shows that proposed solution is 93.49% positive and 94.82% negative classified. In other words, our prediction can reach 93.49% for true positive and 94.82% true negatives. Table 5 further calculate precision, recall, and F-measure values, which are always above 0.9. Finally, the comparison between SVM and other classifiers illustrates that our GAM solution can achieve the highest prediction accuracy (shown in Table 6).

Table 4 Confusion matrix evaluation
Table 5 Classification evaluation
Table 6 Comparison with traditional classifiers

5.4 Discussion

Three topics of discussions to justify our research contributions in recent advances in crowd intelligence are as follow:

  1. (2)

    Model Feasibility and Scalability: the compromised GAM model (integrates Gaussian and Markov based clustering approaches) is theoritically feasible and proved/validated via a series of experiment. Additionally, the proposed solution, with few revision, is scalable for any other social networks (e.g. Facebook, Titter, etc.).

  2. (3)

    Computation Efficiency: As compromise of GMM and MCM, the computation efficiency of GAM depends on the ratio of users with “the number of posted message” below or above a predefined threshold value. Since most normal users in social network do post messages more than predefined threshold (As seen in Fig. 1), our proposed GAM solution would cause only a little bit more computation burden than GMM approach. This is regarded to be acceptable.

  3. (4)

    Performance Enhancement: Due to the different dataset and environment setting, it is difficult to directly compare the performance with existing works. However, it is obvious that our solution achieves the highest result due to two reasons: firstly, existing work take either tag / limited content, or only social relationship into consideration, while our solution considers all posted messages; secondly, our proposed solution considers “the number of posted message” as the only key feature, and is capable to select optimized handling mechanism. In summary, we have greatly improve the prediction accuracy from 60%-80% (See reference [5, 11, 15, 21, 28, 35]) to 94.3% in our work.

6 Conclusions

User interest prediction in social network has become hot topic in both academia and industry. This paper introduces clustering approaches to achieve soft computing (or computational intelligence), specifically GMM, MCM and finally proposes a GAM solution to predict user interest in social networks. We have conducted a series of experiments and analysis to show that our proposed GAM solution is feasible, efficient, and achieving a higher prediction accuracy. Comparing with other algorithms or existing work, our proposed solution also contains acceptable computation complexity. We demonstrate our work for recent advances in soft computing and justify our research contributions by applying different methods to meet prediction challenges for social intelligent multimedia systems.