Clustering based interest prediction in social networks

Zheng, Xianghan; Zheng, Wenfei; Yang, Yang; Guo, Wenzhong; Chang, Victor

doi:10.1007/s11042-018-7009-y

Clustering based interest prediction in social networks

Published: 08 March 2019

Volume 78, pages 32755–32774, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Multimedia Tools and Applications Aims and scope Submit manuscript

Clustering based interest prediction in social networks

Download PDF

Xianghan Zheng^1,2,
Wenfei Zheng^1,2,
Yang Yang^1,2,
Wenzhong Guo^1,2 &
…
Victor Chang³

354 Accesses
6 Citations
6 Altmetric
Explore all metrics

Abstract

Efficient interest prediction for social networks is critical for both users and service providers for behavior analysis and a series of extension services. However, most existing approaches are inefficient, incomplete or isolated. In this paper, we propose combination of Gaussian and Markov approaches (namely, GAM) as typical soft computing technology for interest prediction of social intelligent multimedia systems. GAM model considers “the number of posted messages” as the only parameter, and defines selection logic to implement either Gaussian or Markov based approaches. Our proposed solution takes the advantage of Gaussian model in prediction accuracy and computation complexity, and advantage of Markov model in high availability. Further experiments illustrate that our solution achieves higher prediction accuracy of 94.3% (without considering the influence of swing users), with the best result achieved ever.

GLDA-FP: Gaussian LDA Model for Forward Prediction

Online Social Network User Behavior Analysis — With RenRen Case

Markov Based Social User Interest Prediction

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Interest prediction in social network is important for both users (e.g., community participation, activity initiation, etc. [23, 27, 36]), and social network service providers in a series of applications (e.g., behavior analysis, service recommendation, etc. [12, 37]). However, because of 4 V (huge volume, high variety, low value, fast velocity, etc.) characteristics in social multimedia data, feasible and efficient user interest prediction is not a trivial research challenge [18, 29]. On the other hand, similar with people’s common life, users in social network can vary from each other in different features. For instance, different users own different number of posted messages (e.g. text, image, video, etc.), online time, and behavior history, etc. Hence, depending on different feature or environment, social user interest prediction may require different approaches for soft computing.

Existing research is mainly based on three types of information: user registration profile [43], behavior history [3], and social relationship [13]. However, few of them are efficient, complete, and open sourced. This paper considers clustering algorithm as a typical soft computing technology (or computation intelligence) and proposes combination of Gaussian and Markov model (namely, GAM) for social user interest prediction. The clustering technologies are proposed due to the following reasons. First, unsupervised machine learning algorithms are normally computational efficient, especially in big data environments [4, 33]; second, clustering mechanisms take similarity calculation into consideration for better performance enhancement. In particular, we select the combination of Gaussian and Markov models as detailed solution. As described in Section 5, Gaussian content based approach provides accurate results with low computation, whereas Markov status based approach is capable to provide higher availability.

In this paper, the clustering approach proposed in this paper is relevant to soft computing technology. Due to specific implementation scenario for social multimedia data, the clustering prediction of interest requires the participation of computational intelligence. In general, this paper contains three contributions for recent advances in soft computing technology:

We investigate Gaussian and Markov based clustering approaches (model description, complexity, etc.) respectively for user interest prediction in social networks. Consequently, a compromised GAM model is proposed, which selects either Gaussian or Markov according to the key parameter “Number of posted message”.
A specific data crawler is developed to collect Sina Weibo as testing dataset. After that, the clustering experiment, strategies selection, and performance evaluations are conducted to show the feasibility and efficiency of proposed solution.
Through suitable data pre-processing and parameter adjusting, the proposed model achieves 94.3% prediction accuracy. This is the best prediction accuracy achieved ever. Additionally, performance result and model scalability, computation efficiency are discussed to justify our contributions.

Please note that our approach is a generic solution available in other existing social data (Twitter, Facebook, and so on). The paper’s structure is organized as follows. Section 2 investigates social user interest prediction and existing research. Section 3 discusses dataset preprocessing, feature extraction and user annotation. Section 4 investigates GMM and MCM approaches respectively and introduce our proposed GAM model. Section 5 illustrates experiment, analysis and discussion. Finally, Section 6 summaries the paper.

2 Social interest prediction

2.1 Social network and user interest

Online social networks have become major platforms for internet users to post multimedia messages (e.g. text, picture, video, etc.), discuss and share interesting topics [33]. In social network, interest is usually represented by posted messages describing the event or willingness such as what want to do or buy, where want to go, or who want to meet, follow or vote for [10]. Therefore, interest exploration in social networks is an important part of user behavior analysis, sine it can provide support for a series of extension services such as community detection targeted advertisement, personalizing recommendation and so on [34].

This paper investigates and collects dataset from Sina Weibo, over 350 million users and the eighth most frequently visited social network in the world until Dec 2017 [25, 31]. Upon this dataset, the investigation and further experiment is convincible and scalable. Therefore, this is a very relevant paper for the soft computing research on social network and multimedia big data.

2.2 Existing works

In industry, both Twitter and Facebook initiate their research project implementing machine learning technologies for user behavior analysis, according to their annual report [32]. However, the details are unknown to the public.

In academia, the initial attempt is to explore relevant messages entered by user as interest information so as to establish user interest prediction model [1] [6]. Abel F et al. [2] extend users’ basic information through tagged user profiles, and develop a cross-system user model to find user interest and improve recommendation quality. Xu et al. [35] filters interest-unrelated noisy posts according to aggregated user profiles, and to some extents, discovers user interest. These approaches are based on user’s registration information; however, the result may not be accurate due to incomplete user information entered.

Besides, there are some other method based on social relationship analysis. Xiaoling S et al. [28] propose an agent-based interest awareness model that considers social ties formed or reinforced between two individuals if they have similar interest. Xiao H et al. [15] capture various social features and investigate social inference based on interest similarity to realize users interest prediction. Saber Shokat Fadaee et al. [11] convert social network into Bernoulli based unweighted structure model, and predict user interest category according to structural difference between different categories of networks. Norietal et al. [21] import graph theory to model user time-evolving behavior, and predict user interest category via similarity computation. However, these approaches are incomplete because social relationship is only a part of feature for interest exploration.

Some other approaches are based on feature exploration on social network content. For example, Attenberg et al. [5] predict user interest through analyzing message content posted. Banerjee et al. [7] collect Twitter data and apply statistical and mining techniques to explore user interest distribution on categories, e.g., food, sport, movie, etc. Literature [19] considers the imbalanced data of social users and introduces an weighted ELM based on the overall distribution (ODW-ELM) model for predicting users future interests. [40] considers the evolution of user interests and utilizes semantic information from knowledge bases such as Wikipedia to predict user future interests and overcome the cold item problem. [24] proposes a multilevel deep belief network learning-based model for users consumption preferences, based on interaction between the preferential behaviors of users. Our previous research [42] also proposes a Markov chain model on clustered users to predict user interest. However, the disadvantage of above approaches is that most of them define complicated computation logic that cause a lot of system burden. Besides, these approaches only classify each user into a specific interest category, while in reality each user may have multiple interest. Additionally, none of these approaches achieve excellent performance result (most are between 60% - 80% in prediction accuracy).

Generally, this paper is the extension of our previous work [42] with a few significant improvements:

1.
This paper integrates Gaussian and Markov based approaches, which achieves lower computation complexity and better performance outputs.
2.
Both theoretical and experiment illustrate that via only inspection of the parameter “number of posted message”, our proposed solution is capable to select optimized handling logic. This makes the model implementation easy.
3.
94.3% prediction accuracy could be obtained with suitable parameter adjusting (weed out the influence by swing users). This is the best result ever.

3 Dataset analysis

3.1 Dataset collection for social networks

Similar as most social media platforms, the public Weibo developer API (specifically, user_timeline API) only provides the downloadling functionality on the recent messages of authorized users. This is considered as an obstacle to the process of data collection. To solve this problem, specific data crawler and feature collection mechanism are developed. Specifically, we manually select 20 interest categories source data that contains 100 normal users (who post, repost, or comment frequently) as data source. After that, a specific data crawler is developed for dataset collection. The data crawler contains two classes: WeiboCrawler for collecting user related information, expecially posted messages, followee’s ID, etc.; and FolloweeCrawler class that collecting followee’s posted messages. Finally, 30,116 Weibo users with around 17 million messages are acquired (from 20th, Jan, 2017 to 1st, April, 2017) are extracted.

Figure 1 illustrates the distribution of “the number of posted messages”. It shows that most normal users post/repost 250-550 messages (including text, image, video, etc.) in around 70 days.

3.2 Feature vector extraction

After dataset collection, feature vector could be generated according to following steps [22]:

1.
Word Segment and Frequency Statistics. Via filtering image and video content and deploying the Chinese Institute of Computing Segmentation System (ICTCLAS) [8, 30], it is capable to extract separated words from Weibo message. After that, according to affiliated TF-IDF (term frequency–inverse document frequency) API [9], the top 50 keywords for each 20 predefined interest category could be obtained. Consequently, the total number of keywords is 20 * 50 = 1000.
2.
De-duplication and Feature Vector Generation. After manual re-inspection to reduce redundancy, we achieve 579 keywords, based on which feature vector could be generated with dimension of 1*579.

3.3 User annotations

Among 30,116 users, we randomly select 4000 users and assign three volunteers to handle the annotation work, marking user interest category according to the message history. The marking behavior of three volunteers is not interfered each other. In case one user is marked in different categories, the majority voting is implemented for suitable decision. Finally, user number and corresponding category is illustrated in Table 1.

Table 1 User number and corresponding Category

Full size table

4 Solution

Figure 2 illustrates the overview of proposed solution. After feature vector generation (described in Section 3), clustering algorithms (e.g., Markov chain model, GMM model and so on) are applied to construct prediction model.

4.1 GMM based prediction

4.1.1 Gaussian mixture model

According to [38], Gaussian mixture model is described as the following fomula:

$$ p(x)=\sum \limits_{k=1}^K{\pi}_kN\left(x|{\mu}_k,\sum k\right) $$

(1)

Where N(x| μ_k, ∑k) is density function, μ_k, ∑_kandπ_kare corresponding mean, covariance and mixing coefficient respectively. According to sum and product rule, the marginal density is:

$$ p(x)=\sum \limits_{k=1}^Kp(k)p\left(x|k\right) $$

(2)

Supposed that the total number of messages user published is s, ands~N(μ, σ); the classification number k and s are independent each other, here is the Theorem:

Theorem 1: the prediction accuracyp(x) is a monotone increasing function with the increasing number of s.
Proof: p(x| k) in formula (2) can be transformed top(x| k, s), as follows:

$$ p\left(x|k,s\right)=\frac{p\left(x,k,s\right)}{p\left(k,s\right)} $$

(3)

Since the parameter kand sare independent each other, p(k, s) = p(k) × p(s),p(s| x, k) = p(s| x), formula (3) can be transformed to:

$$ {\displaystyle \begin{array}{l}p\left(x|k,s\right)=\frac{p\left(x,k\right)\times p\left(s|x\right)}{p(k)\times p(s)}\\ {}\kern3.75em =\frac{p\left(x,k\right)\times \frac{p\left(s,x\right)}{p(x)}}{p(k)\times p(s)}=\frac{p\left(x,k\right)}{p(x)\times p(k)}\times p\left(x|s\right)\end{array}} $$

(4)

Where $ \frac{p\left(x,k\right)}{p(x)\times p(k)} $ is not affected by s and p(x| s) is increased with the increasing number of s, therefore the theorem is proved.

4.1.2 EM steps

The maximum likelihood of Formula (1) is illustrated in the following formula:

$$ \ln p\left(X|\pi, \mu, \varSigma \right)=\sum \limits_{n=1}^N\ln \left\{\sum \limits_{k=1}^K{\pi}_kN\left({x}_n|{\mu}_k,{\varSigma}_k\right)\right\} $$

(5)

where X = {x₁, ..., x_N}.

Additionally, EM algorithm [20, 39] is implemented with the following steps:

1.
Initialize μ_k, ∑_kand π_k, and calculate initial likelihood.
2.
E-step:

$$ \gamma \left({z}_{nk}\right)=\frac{\pi_kN\left({x}_n|{\mu}_k,{\varSigma}_k\right)}{\sum \limits_{j=1}^K{\pi}_jN\left({x}_n|{\mu}_j,{\varSigma}_j\right)} $$

(6)

3.
M-step:

$$ {\mu}_k^{new}=\frac{1}{N_k}\sum \limits_{n=1}^N\gamma \left({z}_{nk}\right){x}_n, $$

(7)

$$ {\varSigma}_k^{new}=\frac{1}{N_k}\sum \limits_{n=1}^N\gamma \left({z}_{nk}\right)\left({x}_n-{\mu}_k^{new}\right){\left({x}_n-{\mu}_k^{new}\right)}^T $$

(8)

$$ \frac{\pi_kN\left(x|{\mu}_k,{\varSigma}_k\right)}{\sum \limits_{j=1}^K{\pi}_jN\left(x|{\mu}_j,{\varSigma}_j\right)}, $$

(9)

Where

$$ {N}_k=\sum \limits_{n=1}^N\gamma \left({z}_{nk}\right). $$

(10)

4.
Log likelihood Evaluation

$$ \ln p\left(X|\pi, \mu, \varSigma \right)=\sum \limits_{n=1}^N\ln \left\{\sum \limits_{k=1}^K{\pi}_kN\left({x}_n|{\mu}_k,{\varSigma}_k\right)\right\} $$

(11)

E-Step 2 would be returned until convergence criterion is satisfied. Consequently, optimized parameter with result value can be obtained.

4.1.3 Time complexity of GMM approach

Theorem 2: the time complexity of GMM algorithm for social interest prediction is 0(n²k), given interest category number k and user posted message number n.
Proof: For initialization the variables of k initial categories, the execution time is 0(k); for E-step calculation, the execution time is 0(nk); for M-step calculation, the execution time is 0(n²k); for maximum likelihood function, the execution time is 0(nk). Therefore, GMM time complexity is 0(n²k). The theorem is proved.

4.1.4 Computation complexity

Theoretically, with the increasing number of s, the value of p(x) increases (refer to the Matlab simulation result in Fig. 3). It is obviously that (1) GMM is capable to achieve high prediction result (for example, it reaches over 0.9 when user posted messages is more than 375); (2) however, GMM may not work efficiently in case that user posted messages is not enough (for instance, the prediction accuracy would be less than 0.7 when s is less than 175). Therefore, in order to further improve prediction accuracy, it might be necessary to introduce some other methods.

4.2 Markov chain model (MCM)

GMM based interest prediction is content based approach that require as much as user posted message as possible. This might be not efficient for users when posted message is inadequate. On the other hand, Markov model is status based prediction approach that might generate reliable result as long as its status chain has been constructed [17]. Therefore, Markov based interest prediction might be implementable for further improvement of prediction accuracy.

4.2.1 Markov chain model

Our previous work [42] has modeled user interest prediction in social network as a triplet MC = < X, A, λ>, in which A is transition rate matrix:

$$ A=\left({p}_{ij}\right)=\left[\begin{array}{cccccc}{P}_{11}& {P}_{12}& ...& {P}_{1j}& ...& {P}_{1n}\\ {}{P}_{21}& {P}_{22}& ...& {P}_{2j}& ...& {P}_{2n}\\ {}...& ...& ...& ...& ...& ...\\ {}{P}_{i1}& {P}_{i2}& ...& {P}_{ij}& ...& {P}_{in}\\ {}...& ...& ...& ...& ...& ...\\ {}{P}_{n1}& {P}_{n2}& ...& {P}_{nj}& ...& {P}_{nn}\end{array}\right] $$

(12)

Where p_ij = P(X_t = x_j| X_t − 1 = x_i) means transition probability from x_i to x_j; λ refers to initial state distribution:

$$ \lambda =\left({p}_i\right)=\left({p}_1,{p}_2,...,{p}_n\right) $$

(13)

After that, via maximum likelihood calculation, each parameter in Markov model is capable to be estimated:

$$ {p}_{ij}=\frac{S_{ij}}{\sum \limits_{j=1}^n{S}_{ij}} $$

(14)

$$ {p}_i=\frac{\sum \limits_{j=1}^n{S}_{ij}}{\sum \limits_{i=1}^n\sum \limits_{j=1}^n{S}_{ij}} $$

(15)

However, our previous method only classify each user into a specific interest category. For further multiple interest prediction, this paper defines multi-Markov chain model and its corresponding solution.

Definition 1: The multi-Markov chain (m-MCM) based user interest model is represented as a quaternion: <X, K, P (C), MC>, where X is discrete random variable in range {x₁, x₂, ..., x_n}, each xi refers to interest eigenvalue, C = {c₁, c₂, ..., c_k} refers to a group of user interest categories with the number of k, P (C = c_k) refers to probability of i-th category, MC = {mc₁, mc₂, ..., mc_k} represents multiple Markov chains and each element mc_k belongs to category c_k. Therefore, the transition matrix A_k could be represented:

The initial state distribution, λ_k is represented as follows:

$$ {p}_{kij}=\frac{S_{kij}+{\alpha}_{kij}}{\sum \limits_{j=1}^n\left({S}_{kij}+{\alpha}_{kij}\right)} $$

(18)

$$ {p}_{ki}=\frac{\sum \limits_{j=1}^n\left({S}_{ki j}+{\alpha}_{ki j}\right)}{\sum \limits_{i=1}^n\sum \limits_{j=1}^n\left({S}_{ki j}+{\alpha}_{ki j}\right)} $$

(19)

Where k and S_kij refer to number of interest categories and status pair respectively; a_kij represents background knowledge in Bayesian estimation [16, 41].

After that, we calculate the similarity δ_klamong each two users’ transfer matrixes, with the calculation formulas:

In case the value of δ_kl is large or infinite, two users are regarded in one interest category. The merging formulas could be illustrated:

$$ {\displaystyle \begin{array}{c}{\delta}_{kl}=S\mathrm{imilarity}\left(m{c}_k,m{c}_l\right)\\ {}=\frac{2}{S\mathrm{imilarity}\left(m{c}_k,m{c}_l\right)+S\mathrm{imilarity}\left(m{c}_l,m{c}_k\right)}\end{array}} $$

(20)

$$ {p}_{\left(k+l\right) ij}=\frac{S_{kij}+{S}_{lij}+{\alpha}_{\left(k+l\right) ij}}{\sum \limits_{j=1}^n\left({S}_{kij}+{S}_{lij}+{\alpha}_{\left(k+l\right) ij}\right)} $$

(21)

$$ {p}_{\left(k+l\right)i}=\frac{\sum \limits_{j=1}^n\left({S}_{kij}+{S}_{lij}+{\alpha}_{\left(k+l\right) ij}\right)}{\sum \limits_{i=1}^n\sum \limits_{j=1}^n\left({S}_{kij}+{S}_{lij}+{\alpha}_{\left(k+l\right) ij}\right)} $$

(22)

Consequently, multi-Markov chain for user interest prediction could be constructed.

4.2.2 Computation complexity

Theorem 3: the time complexity of MCM based approach is O(m³n²), given m as user interest enginvalue and n as the total number of user messages.
Proof: The MCM algorithm contains two part: the initial part and the circulation part. In initiation part that calculates p_ijandp_i, and transforms user data into Markov Chain, the execution time is O(mn²).

In circulation part, the maximum cycle time is m because there are always two Markov chains merged (or exit the loop, the algorithm ends) for every cycle. Additionally, the calculation of similarity degree between different pairs listed in descending sequence costs O(m²n²) (If ignore the sorting operation time). In the calculation that merges the two Markov chain with the maximum similarity degree, the execution time is O(m²n²). Then the execution time of the circulation part is O(m*m²n²), that is, O(m³n²). In combination, the time complexity of MCM based approach is O(m³n²). And the theorem is proved.

4.3 Gaussian and Markov approach (GAP)

From Section 4.1 and 4.2,it obviously that both GMM and Markov based approaches have their own advantages. GMM model is content based approach that provides lower computation complexity, while Markov model is status based approach that may not require a large amount of user message as long as the status matrix could be constructed and stablized.

Therefore, one feasible combination can be: (1) set a predefined number w; (2) when the number of user’s posted messages s > w, implement GMM based prediction; while s < w, implement MCM based prediction; (3) adjust the value of w until the best prediction accuracy achieved.

5 Experiment and evaluation

The experiment and evaluation work are described from four aspects. First, the clustering result between GMM and MCM is described; after that, the integration strategy is investigated and tested; additionally, the performance evaluation is conducted with a few existing algorithms; and finally, we discuss the model implementation and scalability. Based on the collected data set, 4000 out of 30,116 users are randomly selected as experiment users. The experiment is conducted in Matlab [14] environment.

5.1 Clustering result

Figure 4(a) and (b) show clustering results of GMM and MCM respectively. This result contains noise which is the swing users (whose interest categories are difficult to be determined from). After noise filter (component analysis method provided in MATLAB), we obtain clearer clustering results illustrated in Fig. 5, which contains 3835 valid users.

It is obvious that user classification result in Fig. 5(a) (boundary among clusters is better splitted) is better than that in Fig. 5(b) (certain clusters are scattered and difficult to be determined). Specifically, the user can be accurately divided into 20 categories in Fig. 5(a), whereas only 14 categories can be distinguished in Fig. 5(b). Therefore, the cluster graph shows that GMM approach has better capability in terms of splitting the boundary of each interest category compared with MCM approach.

More detailed, the clustering result in 20 categories is listed in Table 2. Although some classification might be wrong (for instance, some users classified as ‘Entertainment’ categories may not belong to this category,etc.), the high gap between GMM and MCM indicates the advantage of adopting GMM for clustering analysis in crowd intelligence.

Table 2 The Clustering Results

Full size table

On the other hand, the time consumption between GMM and MCM is also compared. As shown in Fig. 6, MCM algorithm takes almost 16 times longer than GMM.

5.2 Combination strategy

For exploring optimized integration, we further investigate clustering error in both GMM and MCM methods, and find the obvious feature difference in the number of posted message (including text, image, video, etc.) between these two approaches.

5.2.1 Clustering error analysis

According to previous clustering result, the error data can be listed, as shown in Table 3. We check each error user respectively, and find that most error users produced by GMM contain less posted message (between 34 and 208 in each category) than error users generated by MCM. Therefore, the number of posted message may probably a distinguished feature difference between GMM and MCM approaches.

Table 3 Error Users Analysis in GMM and MCM Respectively

Full size table

5.2.2 The number of post message effect

Furthermore, this section investigates the features of user posted messages (with the total number between 0 and 50, 50–100, 100–150, 150–200, 200–250, 250–300, 300–350, 350–400, 400–450, 450–500, 500–550, and 550 plus) and discusses the effect of “The Number of Post Message” in the prediction accuracy. We randomly select 20 users in each interval and test the prediction accuracy with GMM and MCM approaches respectively. The results in Fig. 7 shows that (1) GMM algorithm is more accurate than MCM when the number of post messages is larger than 300; (2) MCM approach is still capable to achieve much better performance in case that the number of post messages is between 50 and 100 and 100-150.

Therefore, we set the threshold value in this paper as 150, and select to implement GMM approach when above threshold, and implement MCM when below this value. The prediction accuracy is further improved to around 95% in each of 20 categories (See Fig. 8).

5.3 GAM performance

Furthermore, we compare MCM solution with GAM, GMM based solution, and also and traditional solutions such as LIBSVM, K-Means algorithms provided by Weka tool [26]. Table 4 shows that proposed solution is 93.49% positive and 94.82% negative classified. In other words, our prediction can reach 93.49% for true positive and 94.82% true negatives. Table 5 further calculate precision, recall, and F-measure values, which are always above 0.9. Finally, the comparison between SVM and other classifiers illustrates that our GAM solution can achieve the highest prediction accuracy (shown in Table 6).

Table 4 Confusion matrix evaluation

Full size table

Table 5 Classification evaluation

Full size table

Table 6 Comparison with traditional classifiers

Full size table

5.4 Discussion

Three topics of discussions to justify our research contributions in recent advances in crowd intelligence are as follow:

(2)
Model Feasibility and Scalability: the compromised GAM model (integrates Gaussian and Markov based clustering approaches) is theoritically feasible and proved/validated via a series of experiment. Additionally, the proposed solution, with few revision, is scalable for any other social networks (e.g. Facebook, Titter, etc.).
(3)
Computation Efficiency: As compromise of GMM and MCM, the computation efficiency of GAM depends on the ratio of users with “the number of posted message” below or above a predefined threshold value. Since most normal users in social network do post messages more than predefined threshold (As seen in Fig. 1), our proposed GAM solution would cause only a little bit more computation burden than GMM approach. This is regarded to be acceptable.
(4)
Performance Enhancement: Due to the different dataset and environment setting, it is difficult to directly compare the performance with existing works. However, it is obvious that our solution achieves the highest result due to two reasons: firstly, existing work take either tag / limited content, or only social relationship into consideration, while our solution considers all posted messages; secondly, our proposed solution considers “the number of posted message” as the only key feature, and is capable to select optimized handling mechanism. In summary, we have greatly improve the prediction accuracy from 60%-80% (See reference [5, 11, 15, 21, 28, 35]) to 94.3% in our work.

6 Conclusions

User interest prediction in social network has become hot topic in both academia and industry. This paper introduces clustering approaches to achieve soft computing (or computational intelligence), specifically GMM, MCM and finally proposes a GAM solution to predict user interest in social networks. We have conducted a series of experiments and analysis to show that our proposed GAM solution is feasible, efficient, and achieving a higher prediction accuracy. Comparing with other algorithms or existing work, our proposed solution also contains acceptable computation complexity. We demonstrate our work for recent advances in soft computing and justify our research contributions by applying different methods to meet prediction challenges for social intelligent multimedia systems.

References

Abel F, Arajo S, Gao Q et al (2011) Analyzing cross-system user modeling on the social web.[J]. Lect Notes Comput Sci 6757(2-3):28–43
Article Google Scholar
Abel F, Herder E, Houben GJ et al (2013) Cross-system user modeling and personalization on the social web[J]. User Model User-Adap Inter 23(2-3):169–209
Article Google Scholar
Agarwal V (2013) Bharadwaj K K. a collaborative filtering framework for friends recommendation in social networks based on interaction intensity and adaptive user similarity[J]. Soc Netw Anal Min 3(3):359–379
Article Google Scholar
Anderberg MR (2014) Cluster Analysis for Applications: Probability and Mathematical Statistics: A Series of Monographs and Textbooks[M]. Academic press
Attenberg J, Pandey S, Suel T (2009) Modeling and predicting user behavior in sponsored search[C]//Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM: 1067–1076
Baltrunas L, Ricci F (2014) Experimental evaluation of context-dependent collaborative filtering using item splitting[J]. User Model User-Adap Inter 24(1-2):7–34
Article Google Scholar
Banerjee N, Chakraborty D, Dasgupta K et al. (2009) User interests in social media sites: an exploration with micro-blogs[C]//Proceedings of the 18th ACM conference on Information and knowledge management. ACM 1823–1826
Carpineto C (2012) Romano G. a survey of automatic query expansion in information retrieval[J]. ACM Comput Surv (CSUR) 44(1):1
Article Google Scholar
Erra U, Senatore S, Minnella F et al (2015) Approximate TF–IDF based on topic extraction from massive message stream using the GPU[J]. Inf Sci 292:143–161
Article Google Scholar
Facebook, in: http://www.facebook.com/
Fadaee SS, Farajtabar M, Sundaram R et al (2015) On the network you keep: analyzing persons of interest using Cliqster[J]. Soc Netw Anal Min 5(1):1–14
Article Google Scholar
Felix W, Zhengming L, Mung C et al (2016) On the efficiency of social recommender networks[J]. IEEE/ACM Trans Networking 24(4):2512–2524
Article Google Scholar
Gonzlez E, Turmo J (2015) Unsupervised ensemble minority clustering[J]. Mach Learn 98(1-2):217–268
Article MathSciNet Google Scholar
Grewal MS, Andrews AP (2014) Kalman filtering: Theory and Practice with MATLAB[M]. Wiley
Han X, Wang L, Crespi N et al (2015) Alike people, alike interests? Inferring interest similarity in online social networks[J]. Decis Support Syst 69:92–106
Article Google Scholar
Heckerman D, Geiger D, Chickering DM (1995) Learning Bayesian networks: the combination of knowledge and statistical data[J]. Mach Learn 20(3):197–243
MATH Google Scholar
Herrmann JW (2015) Predicting the performance of a design team using a Markov chain model[J]. Eng Manag, IEEE Trans 62(4):507–516
Article Google Scholar
Kunaver M, Porl T (2017) Diversity in recommender systems A survey[M]. Elsevier
Luo X, Jiang C, Wang W, et al. (2018) User behavior prediction in social networks using weighted extreme learning machine with distribution optimization ☆[J]. Futur Gen Comput Syst
Melnykov V, Melnykov I (2012) Initializing the EM algorithm in Gaussian mixture models with an unknown number of components[J]. Comput Stat Data Anal 56(6):1381–1395
Article MathSciNet Google Scholar
Nori N, Bollegala D, Ishizuka M (2011) Interest prediction on multinomial, time-evolving social graph[C]//IJCAI 11: 2507–2512
Phan XH, Nguyen CT, Le DT et al (2011) A hidden topic-based framework toward building applications with short web documents[J]. Knowl Data Eng, IEEE Trans 23(7):961–976
Article Google Scholar
Scott J (2012) Social network analysis[M]. Sage
Sharma P, Rathore S, Park JH (2017) Multilevel learning based modeling for link prediction and users’ consumption preference in Online Social Networks[J]. Futur Gen Comput Syst
Sina, in: http://www.sina.com.cn/
Singhal S, Jena MA (2013) Study on WEKA Tool for Data Preprocessing, Classification and Clustering[J]. Int J Innov Technol Exp Eng 2(6)
Statista, in: http://www.statista.com/
Sun X, Lin H, Xu K (2015) A social network model driven by events and interests[J]. Expert Syst Appl 42(9):4229–4238
Article Google Scholar
Tang J, Liu H (2014) An unsupervised feature selection framework for social media data[J]. IEEE Trans Knowl Data Eng 26(12):2914–2927
Article Google Scholar
WANG C, JIN C (2012) Based on the established vocabulary of Yi automatic segmentation system design and implementation[J]. science technology and. Engineering 10:020
Google Scholar
Weibo – SINA, in: http://english.sina.com/weibo/
Weston J, Ratle F, Mobahi H et al (2012) Deep learning via semi-supervised embedding[J]. Lect Notes Comput Sci 7700:1168–1175
Google Scholar
Wikipedia, in: http://www.wikipedia.com/
Xianghan Z, Nan C, Zheyi C, Chunming R, Guolong C, Wenzhong G (2014) Mobile cloud based framework for remote-resident multimedia discovery and access. J Intern Technol 15(6):1043–1050
Google Scholar
Xu Z, Lu R, Xiang L et al (2011) Discovering user interest on twitter with a modified author-topic model[C]//web intelligence and intelligent agent technology (WI-IAT), 2011 IEEE/WIC/ACM international conference on. IEEE 1:422–429
Google Scholar
Yager RR, Reformat MZ (2013) Looking for like-minded individuals in social networks using tagging and E fuzzy sets[J]. IEEE Trans Fuzzy Syst 21(4):672–687
Article Google Scholar
Yan Q, Wu L, Zheng L (2013) Social network based microblog user behavior analysis[J]. Phys A: Stat Mech Appl 392(7):1712–1723
Article Google Scholar
Yang MS, Lai CY (2012) Lin C Y. a robust EM clustering algorithm for Gaussian mixture models[J]. Pattern Recogn 45(11):3950–3961
Article Google Scholar
Yu K, Dang X, Bart H et al (2014) Robust model-based learning via spatial-EM algorithm[J]. Knowl Data Eng IEEE Trans 27(6):1–1
Google Scholar
Zarrinkalam F, Kahani M, Bagheri E (2018) User interest prediction over future unobserved topics on social networks ☆[J]. Inform Retrie J
Zhang Z, Zhou T, Zhang Y (2010) Personalized recommendation via integrated diffusion on user–item–tag tripartite graphs[J]. Phys A: Stat Mech Appl 389:179–186
Article Google Scholar
Zheng XH, An DY, Chen X, Guo WZ (2015) Interest Prediction in Social Networks based on Markov Chain Modeling on Clustered Users[J]. Concurr Comput: Pract Exp
Article Google Scholar
Zhepeng L, Xiao F, Xue B, Olivia R (2017) S. Utility-based link recommendation for online social networks[J]. Manag Sci 63(6):1938–1952
Article Google Scholar

Download references

Author information

Authors and Affiliations

College of Mathematics and Computer Science, Fuzhou University, Fuzhou, China
Xianghan Zheng, Wenfei Zheng, Yang Yang & Wenzhong Guo
Fujian Key Laboratory of Network Computing and Intelligent Information Processing, Fuzhou, China
Xianghan Zheng, Wenfei Zheng, Yang Yang & Wenzhong Guo
International Business School Suzhou, Xi’an Jiaotong-Liverpool University, Suzhou, China
Victor Chang

Authors

Xianghan Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Wenfei Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Yang Yang
View author publications
You can also search for this author in PubMed Google Scholar
Wenzhong Guo
View author publications
You can also search for this author in PubMed Google Scholar
Victor Chang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xianghan Zheng.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zheng, X., Zheng, W., Yang, Y. et al. Clustering based interest prediction in social networks. Multimed Tools Appl 78, 32755–32774 (2019). https://doi.org/10.1007/s11042-018-7009-y

Download citation

Received: 17 April 2018
Revised: 03 August 2018
Accepted: 29 November 2018
Published: 08 March 2019
Issue Date: December 2019
DOI: https://doi.org/10.1007/s11042-018-7009-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Clustering based interest prediction in social networks

Abstract

Similar content being viewed by others

GLDA-FP: Gaussian LDA Model for Forward Prediction

Online Social Network User Behavior Analysis — With RenRen Case

Markov Based Social User Interest Prediction

Explore related subjects

1 Introduction

2 Social interest prediction

2.1 Social network and user interest

2.2 Existing works

3 Dataset analysis

3.1 Dataset collection for social networks

3.2 Feature vector extraction

3.3 User annotations

4 Solution

4.1 GMM based prediction

4.1.1 Gaussian mixture model

4.1.2 EM steps

4.1.3 Time complexity of GMM approach

4.1.4 Computation complexity

4.2 Markov chain model (MCM)

4.2.1 Markov chain model

4.2.2 Computation complexity

4.3 Gaussian and Markov approach (GAP)

5 Experiment and evaluation

5.1 Clustering result

5.2 Combination strategy

5.2.1 Clustering error analysis

5.2.2 The number of post message effect

5.3 GAM performance

5.4 Discussion

6 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation