Abstract
In this paper, we present an overlapping microblog community detecting algorithm using a new partition criterion. The new partition criterion considers the explicit and implicit interest of users, together with the user’s attention and practical application. Users containing certain tag in the whole community are extracted as a temporary group and the quality value is calculated under the current partition. The most appropriate core tag is selected and the corresponding group is then updated until certain requirements are satisfied. The community detected by this algorithm share common core tags and the partition results can be revised. Experimental results show that the method is effective and has practical significance.
Access provided by CONRICYT-eBooks. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Real world offers a wide variety of possible group organizations, such as families, working and friendship circle. With the diffusion of Internet, virtual groups, like online communities, have emerged on the Web [1, 2]. Social networks are defined as a set of actors and relationships that represent the interactions between entities in the network. Since social networks have important roles in the dissemination of information and innovation, the analysis of such networks, attracted much attention in the research area. Overlapping is a significant feature of the social network community structure and overlapping community detection has attracted an increasing attention [3,4,5,6]. Unlike traditional networks, microblog network is much more complex, which contains a huge amount of users and various kinds of inter-relationships [7, 8]. Thus, the community detection method for microblog users is different from the traditional community detection algorithms, which requires to carry out the relevant research in a systematic way.
User tags play an important role for personalized recommendation. These tags provide a powerful scheme to represent attributes or interests of microblog users. We exploit tag correlation to investigates both inter and intra correlation of tags, and the tags for users can therefore be expanded (following the ideas of reference [9]), which will increase tag vocabulary and alleviate tag sparsity on measuring user similarity and can eventually reveal implicit interests for users. Besides, following-behaviors form a social network of microblog users, which also indicates the interests of the user. What is more, it is believed that the nodes with a central position in their partitions may have an important function of control and stability within the group. Based on the above aspects, in this paper, a new partition criterion is proposed and a novel tag cut strategy is developed for microblog community detection. Figure 1 summarizes an overview of the proposed microblog community detection system.
The basic outline of this paper is as follows: The partition criterion is given in Sect. 2 and the community detection algorithm via core tag is given in Sect. 3. The experiments and results are demonstrated in Sect. 4. Lastly, we conclude our paper in Sect. 5.
2 The Partition Criterion
The microblog network is defined as a directed graph G, namely a triple \( G = \left( {V,E,T} \right) \), where V and E are the user set and the edge set of the graph respectively. \( T = \left\{ {t_{1} ,t_{2} ,t_{3} \ldots t_{m} } \right\} \) is a finite set of tags for users. Each user \( v_{i} \in V\;{\text{and}}\;{\text{E = }}\left\{ {{<}v_{i} ,v_{j} {>} \left| {v_{i} \in V,v_{j} \in V} \right.} \right\} \) denotes the following relationship from vi to vj. \( C = \left\{ {C_{1} ,C_{2} ,C_{3} \ldots C_{k} } \right\} \) is the partition result of G, TCi denotes tag set for users in the cluster Ci.
Intuitively, an optimal community partition satisfies the following three aspects for each user:
-
(1)
Users in the cluster Ci share more common tags and the following relationship should be focused within cluster Ci.
-
(2)
Users in different clusters Ci and Cj own different tags and the following relationship should be consistent with each other.
-
(3)
The partition result should be of practical significance.
In view of the above three conditions, an evaluation function is proposed to guarantee the partition result, and three evaluation aspects are introduced to make a comprehensive correction of the partition results.
2.1 Community Cohesion
Definition 1
(Tag Similarity). Let Ti denote tag set for user vi and Tj denote tag set for user vj respectively, then the Tag Similarity between users vi and user vj is defined as:
Definition 2
(Tag Cohesion for Cl). Let users vi and vj belong to cluster Cl, Ti denotes tag set for user vi and Tj denotes tag set for user vj respectively, then the Tag Cohesion for Cl is defined as:
Definition 3
(Tag Cohesion for C). Let \( C = \left( {C_{1} ,C_{2} ,C_{3} \cdots C_{k} } \right) \) be the partition result of \( G = \left( {V,\,E,\,T} \right) \), then the Tag Cohesion for C is defined as:
Community cohesion reflects the degree of tag overlap of users in the community. The higher the tag cohesion, the higher the overlap of tags and vice versa.
Definition 4
(Edge Cohesion for Cl). Let user vi belong to cluster Cl, eij is the edge from vi to vj, then the Edge Cohesion for Cl is defined as:
Edge Cohesion indicates the proportion of following users within their community.
Definition 5
(Edge Cohesive for C). Let \( C = \left( {C_{1} ,C_{2} ,C_{3} \cdots C_{k} } \right) \) be the partition result of \( G = \left( {V,E,T} \right) \), then the Edge Cohesion for C is defined as:
Definition 6
(Cohesive for C). Let \( C = \left( {C_{1} ,C_{2} ,C_{3} \cdots C_{k} } \right) \) be the partition result of \( G = \left( {V,E,T} \right) \), then the Cohesion for C is defined as:
2.2 Community Coupling Degree
Definition 7
(Tag Similarity of Ci and Cj). Let TCi be the tag set for users in cluster Ci and TCj is the tag set for users in cluster Cj, then the Similarity of Ci and Cj is defined as:
Definition 8
(Tag Similarity for C). Let \( C = \left( {C_{1} ,C_{2} ,C_{3} \cdots C_{k} } \right) \) be the partition result of \( G = \left( {V,E,T} \right) \), then the Tag Similarity for C is defined as:
Definition 9
(Following rate from Ci to Cj). Let \( C = \left( {C_{1} ,C_{2} ,C_{3} \cdots C_{k} } \right) \) be the partition result of \( G = \left( {V,E,T} \right) \), there are p users in Ci and pm users follow users in Cj, then the Following rate from Ci to Cj is defined as:
Definition 10
(Following entropy from Ci to Cj). Let \( C = \left( {C_{1} ,C_{2} ,C_{3} \cdots C_{k} } \right) \) be the partition result of \( G = \left( {V,E,T} \right) \), then the Following entropy from Ci to Cj is defined as:
Definition 11
(Edge entropy). Let \( C = \left( {C_{1} ,C_{2} ,C_{3} \cdots C_{k} } \right) \) be the partition result of \( G = \left( {V,E,T} \right) \), then the Edge entropy is defined as:
Definition 12
(Coupling Degree for C). Let \( C = \left( {C_{1} ,C_{2} ,C_{3} \cdots C_{k} } \right) \) be the partition result of \( G = \left( {V,E,T} \right) \), then Coupling Degree for C is defined as:
2.3 Community Division Quality Evaluation Function
There are different user groups with different influence in the network community and those with great prestige can form the so-called ‘opinion leaders community’, thus it is necessary to distinguish them with the ordinary users [10, 11]. Our work introduces the PageRank algorithm and rank users via their influence to make sure users’ influence are similar within the same group. Each group of users take the variance of PageRank to measure the practicality of community detection, and the smaller the variance, the similar the influence of the users in the group.
Definition 13
(Practicality for C). Let \( C = \left( {C_{1} ,C_{2} ,C_{3} \cdots C_{k} } \right) \) be the partition result of \( G = \left( {V,E,T} \right) \), \( v_{i } \in C_{i} \), VarPageRank(vi) denotes the variance of PageRank of Ci, then Practicality for C is defined as:
Combining the above three evaluation criterion, the overall community quality evaluation function is defined as:
Definition 14
(Normalized quality function). Let \( C = \left( {C_{1} ,C_{2} ,C_{3} \cdots C_{k} } \right) \) be the partition result of \( G = \left( {V,E,T} \right) \), the normalized quality function is defined as:
Cohesion reflects the cohesion degree from the perspective within the community itself, Coupling considers the degree of connections between the communities and Utility considered the practical application, which distinguish the influential users and common users separately. The three criterion are mutually restricted, thus the quality evaluation function of community division is a compromise among the three.
3 Community Detection via Core Tags
3.1 Finding Core Tags
The optimal tags for community division can be detected through minimizing the normalized quality function. The whole community network is initially taken as one cluster, users with certain tag t are extracted as one partition and the rest users are those without tag t. For each new partition, we calculate the current normalized quality value. The tag whose partition can minimize the normalized quality function is marked as the core tag. The pseudocode of finding core tags is listed in Table 1.
3.2 Microblog Community Detection Strategy Based on Core Tags
The basic idea of our microblog community detection strategy is to determine the optimal partition via core tags in an iterative way. In order to avoid the excessive overlap between communities, we introduce the following definition:
Definition 15
(Similarity between Ci and Cj). Let \( C = \left( {C_{1} ,C_{2} ,C_{3} \cdots C_{k} } \right) \) be the partition result of \( G = \left( {V,E,T} \right) \), \( V_{{C_{i} }} \) denotes users in Ci, the similarity between Ci and Cj is defined as:
It can be seen that the similarity between Ci and Cj denotes the number of common users in these two partitions. The higher the similarity, the more overlap the two partitions and vice versa, therefore the introduction of similarity can control the degree of overlap effectively. The pseudocode of tag cut strategy can be summarized in Table 2.
In the TagCut strategy, the algorithm first initializes all the needed variables, then the core tag is decided in each iteration through minimizing the quality criterion for the current partition. After a core tag is determined, a new group Cb with users containing the core tag is obtained. A new partition Cb is formed if its similarity is less than a certain predefined threshold with all the rest partitions, otherwise all users in Cb should be added into the partition Ci that is the most similar one. Thereafter, the new partition is updated and the partition is updated until certain termination condition is met. Two specific algorithms NSTC (Number Specified Tag Cut) and QBTC (Quality Based Tag Cut) are derived from the Tag cut strategy based on two different termination conditions. The former takes the number of communities as the termination condition while the latter algorithm will end if the quality becomes worse. To be more specific, as for NSTC, after each iteration a new group is established and the algorithm then detects if the current number of communities exceeds the predefined threshold η to control the size of the communities. Obviously, this algorithm relies heavily on η. As for QBTC algorithm, with the iteration continues, the evaluation quality will come to its minimum value. Thereafter, evaluation quality function is considered as the termination condition.
4 Experiments and Results Analysis
4.1 Dataset Description
In this section, we demonstrate the performance of our algorithm for microblog community detection on three different datasets obtained from Sina weibo API. The crawl starts from a student majoring in computer science as the core user, and broad first search principle is applied. For S1, we take the core user and his following users, which mainly includes his classmates and users in information technology. For S2, we expand S1 by adding all following users of S1. S3 is obtained based on S2 using the same strategy as constructing S2. Table 3 shows the statistics of datasets.
4.2 Experimental Methodology
In order to verify the effectiveness of our algorithm, we design two experiments. (1) We test the most suitable γ via the proper number of communities. (2) We make a deep analysis of algorithm QBTC and NSTC.
The Sensitivity of Parameter γ
We start to use all datasets for NSTC algorithm to get the most appropriate number of communities with the increase of parameter η from 1 to 20. The quality value is kept and η with the smallest quality value is regarded as the most reasonable number of communities as is shown in Fig. 2.
For tagcut strategy, tag similarity Jaccard(Ci, Cj) determines whether the user should be classified into the previous groups or a new group should be established. Parameter γ will greatly influence the QBTC algorithm since QBTC algorithm will continue to partition iteratively before the quality function getting bigger or all the users are classified into some groups. A small parameter γ will merge a great amount of communities, the overlap of communities will be too small to form less groups while a big parameter γ will divide too many communities with high overlap degree.
From Fig. 3 we can see that when γ = 0 the new partition will form a new community no matter if it is independent enough, therefore the overlap cannot be controlled and some partition will be overlapped too much which results in the worse partition result. When γ = 1 the partition has degenerated into an non-overlap community. And when γ = 0.8 we can get the desired result, which means the proportion of overlap is the most close to the real results.
Performance Analysis of NSTC Algorithm and QBTC Algorithm
The number of communities should be specified for NSTC algorithm and the evaluation function will tend to a certain value. In this part, we only deal with S3 as the number of communities for S1 and S2 are too limited which cannot show the variation clearly. We change the value of η from 1 to 10 and calculate the value evaluation function respectively, and the experimental result is demonstrated in Fig. 4.
QBTC algorithm takes advantage of the value of evaluation function to determine the termination of iteration, the value of evaluation function tends to be better from worse, and the result can be demonstrated in Fig. 5.
Besides, as the parameter η of NSTC algorithm is more close to the return value of ClusterNum of QBTC algorithm, the partition result is more similar with each other, this is because both algorithms take advantage of TagCut strategy, the termination condition of NSTC algorithm is to meet the needs of the number of communities while QBTC algorithm is to get the global optimal value.
5 Conclusions and Future Work
In this paper, we propose a strategy for microblog overlapping community detection based on core tags. We have established a new partition criterion from three aspects: the explicit and implicit interest of users, user’s attention as well as practical application, based on which a tag cut mechanism is constructed. Our experiments demonstrate the effectiveness of our approach. In the future, we are aiming at designing new mechanism for tag expansion and tag weighing.
References
Yang, B., Liu, D., Liu, J.: Discovering communities from social networks: methodologies and applications. In: Furht, B. (ed.) Handbook of Social Network Technologies and Applications, pp. 331–346. Springer, Boston (2010). https://doi.org/10.1007/978-1-4419-7142-5_16
Papadopoulos, S., Kompatsiaris, Y., Vakali, A., et al.: Community detection in social media. Data Min. Knowl. Discov. 24(3), 515–554 (2012)
Xiong, X., Zhou, G., Niu, X., et al.: Remodeling the network for microgroup detection on microblog. Knowl. Inf. Syst. 39(3), 643–665 (2013)
Yang, T., Chi, Y., Zhu, S., et al.: Detecting communities and their evolutions in dynamic social networks-a Bayesian approach. Mach. Learn. 82(2), 157–189 (2011)
Li, H.J., Zhan, B., Li, A., et al.: Fast and accurate mining the community structure: integrating center locating and membership optimization. IEEE Trans. Knowl. Data Eng. 24(6), 2349–2362 (2016)
Reula, A., Vargasb, J.M.: Consensus clustering in complex networks. Sci. Rep. 2(13), 5134 (2012)
Liang, S., Rijke, M.D.: Burst-aware data fusion for microblog search. Inf. Process. Manag. 51(2), 89–113 (2015)
Magdy, W., Elsayed, T.: Unsupervised adaptive microblog filtering for broad dynamic topics. Inf. Process. Manag. Int. J. 52(4), 513–528 (2016)
Ma, H., Jia, M., Zhang, D., et al.: Combining tag correlation and user social relation for microblog recommendation. Inf. Sci. Int. J. 385(C), 325–337 (2017)
Zhang, W.Z., Wang, B.L., Hui, H.E., et al.: Public opinion leader community mining based on the heterogeneous network. Tien Tzu Hsueh Pao/Acta Electron. Sin. 40(10), 1927–1932 (2012)
Peetz, M.H., Rijke, M.D., Kaptein, R.: Estimating reputation polarity on microblog posts. Inf. Process. Manag. 52(2), 193–216 (2016)
Acknowledgement
The authors would also like to thank the anonymous referees for their valuable comments and helpful suggestions. The work is supported by the National Natural Science Foundation of China (No. 61762078, 61363058, 61762079) and Guangxi Key Laboratory of Trusted Software (No. kx201705).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Ma, H., Xie, M., Wei, J., He, T. (2018). An Overlapping Microblog Community Detection Method Using New Partition Criterion. In: Liu, W., Giunchiglia, F., Yang, B. (eds) Knowledge Science, Engineering and Management. KSEM 2018. Lecture Notes in Computer Science(), vol 11062. Springer, Cham. https://doi.org/10.1007/978-3-319-99247-1_28
Download citation
DOI: https://doi.org/10.1007/978-3-319-99247-1_28
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99246-4
Online ISBN: 978-3-319-99247-1
eBook Packages: Computer ScienceComputer Science (R0)