An Overlapping Microblog Community Detection Method Using New Partition Criterion

Ma, Huifang; Xie, Meng; Wei, Jiahui; He, Tingnian

doi:10.1007/978-3-319-99247-1_28

Huifang Ma^16,17,
Meng Xie¹⁶,
Jiahui Wei¹⁶ &
…
Tingnian He¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11062))

Included in the following conference series:

International Conference on Knowledge Science, Engineering and Management

1378 Accesses
1 Citations

Abstract

In this paper, we present an overlapping microblog community detecting algorithm using a new partition criterion. The new partition criterion considers the explicit and implicit interest of users, together with the user’s attention and practical application. Users containing certain tag in the whole community are extracted as a temporary group and the quality value is calculated under the current partition. The most appropriate core tag is selected and the corresponding group is then updated until certain requirements are satisfied. The community detected by this algorithm share common core tags and the partition results can be revised. Experimental results show that the method is effective and has practical significance.

Access provided by CONRICYT-eBooks. Download conference paper PDF

Mining Microblog Community Based on Clustering Analysis

Detection of Microblog Overlapping Community Based on Multidimensional Information and Edge Distance Matrix

High-performance social networking: microblog community detection based on efficient interactive characteristic clustering

Article 18 February 2017

Keywords

1 Introduction

Real world offers a wide variety of possible group organizations, such as families, working and friendship circle. With the diffusion of Internet, virtual groups, like online communities, have emerged on the Web [1, 2]. Social networks are defined as a set of actors and relationships that represent the interactions between entities in the network. Since social networks have important roles in the dissemination of information and innovation, the analysis of such networks, attracted much attention in the research area. Overlapping is a significant feature of the social network community structure and overlapping community detection has attracted an increasing attention [3,4,5,6]. Unlike traditional networks, microblog network is much more complex, which contains a huge amount of users and various kinds of inter-relationships [7, 8]. Thus, the community detection method for microblog users is different from the traditional community detection algorithms, which requires to carry out the relevant research in a systematic way.

User tags play an important role for personalized recommendation. These tags provide a powerful scheme to represent attributes or interests of microblog users. We exploit tag correlation to investigates both inter and intra correlation of tags, and the tags for users can therefore be expanded (following the ideas of reference [9]), which will increase tag vocabulary and alleviate tag sparsity on measuring user similarity and can eventually reveal implicit interests for users. Besides, following-behaviors form a social network of microblog users, which also indicates the interests of the user. What is more, it is believed that the nodes with a central position in their partitions may have an important function of control and stability within the group. Based on the above aspects, in this paper, a new partition criterion is proposed and a novel tag cut strategy is developed for microblog community detection. Figure 1 summarizes an overview of the proposed microblog community detection system.

The basic outline of this paper is as follows: The partition criterion is given in Sect. 2 and the community detection algorithm via core tag is given in Sect. 3. The experiments and results are demonstrated in Sect. 4. Lastly, we conclude our paper in Sect. 5.

2 The Partition Criterion

The microblog network is defined as a directed graph G, namely a triple $ G = \left( {V,E,T} \right) $, where V and E are the user set and the edge set of the graph respectively. $ T = \left\{ {t_{1} ,t_{2} ,t_{3} \ldots t_{m} } \right\} $ is a finite set of tags for users. Each user $ v_{i} \in V\;{\text{and}}\;{\text{E = }}\left\{ {{<}v_{i} ,v_{j} {>} \left| {v_{i} \in V,v_{j} \in V} \right.} \right\} $ denotes the following relationship from v_i to v_j. $ C = \left\{ {C_{1} ,C_{2} ,C_{3} \ldots C_{k} } \right\} $ is the partition result of G, T_Ci denotes tag set for users in the cluster C_i.

Intuitively, an optimal community partition satisfies the following three aspects for each user:

(1)
Users in the cluster C_i share more common tags and the following relationship should be focused within cluster C_i.
(2)
Users in different clusters C_i and C_j own different tags and the following relationship should be consistent with each other.
(3)
The partition result should be of practical significance.

In view of the above three conditions, an evaluation function is proposed to guarantee the partition result, and three evaluation aspects are introduced to make a comprehensive correction of the partition results.

2.1 Community Cohesion

Definition 1

(Tag Similarity). Let T_i denote tag set for user v_i and T_j denote tag set for user v_j respectively, then the Tag Similarity between users v_i and user v_j is defined as:

$$ TagSim\,\left( {v_{i} ,v_{j} } \right) = \frac{{\left| {T_{i} \cap T_{j} } \right|}}{{\left| {T_{i} \cup T_{j} } \right|}} $$

(1)

Definition 2

(Tag Cohesion for C_l). Let users v_i and v_j belong to cluster C_l, T_i denotes tag set for user v_i and T_j denotes tag set for user v_j respectively, then the Tag Cohesion for C_l is defined as:

$$ TagCohesion\left( {C_{l} } \right) = \frac{{\sum\limits_{{v_{i} \in C_{l} }} {\sum\limits_{{v_{j} \in C_{l} }} {TagSim\left( {v_{i} ,v_{j} } \right)} } }}{{\left| {C_{l} } \right|^{2} }} $$

(2)

Definition 3

(Tag Cohesion for C). Let $ C = \left( {C_{1} ,C_{2} ,C_{3} \cdots C_{k} } \right) $ be the partition result of $ G = \left( {V,\,E,\,T} \right) $, then the Tag Cohesion for C is defined as:

$$ TagCohesion(C) = \frac{{\sum\limits_{{C_{l} \in C}} {TagCohesion(C_{l} )} }}{k} $$

(3)

Community cohesion reflects the degree of tag overlap of users in the community. The higher the tag cohesion, the higher the overlap of tags and vice versa.

Definition 4

(Edge Cohesion for C_l). Let user v_i belong to cluster C_l, e_ij is the edge from v_i to v_j, then the Edge Cohesion for C_l is defined as:

$$ EdgeDispersion\,\left( {C_{1} } \right) = \frac{{\left| {\sum\limits_{{v_{i} ,\,v_{j} \in C_{1} }} {e_{ij} } } \right|}}{{\left| {\sum\limits_{{v_{i} \in C_{1} }} {e_{ij} } } \right|}} $$

(4)

Edge Cohesion indicates the proportion of following users within their community.

Definition 5

(Edge Cohesive for C). Let $ C = \left( {C_{1} ,C_{2} ,C_{3} \cdots C_{k} } \right) $ be the partition result of $ G = \left( {V,E,T} \right) $, then the Edge Cohesion for C is defined as:

$$ EdgeCohesion(C) = \frac{{\sum\limits_{{C_{l} \in C}} {EdgeDispersion(C_{l} )} }}{k} $$

(5)

Definition 6

(Cohesive for C). Let $ C = \left( {C_{1} ,C_{2} ,C_{3} \cdots C_{k} } \right) $ be the partition result of $ G = \left( {V,E,T} \right) $, then the Cohesion for C is defined as:

$$ Cohesion(C) = TagCohesion(C) \times EdgeCohesion(C) $$

(6)

2.2 Community Coupling Degree

Definition 7

(Tag Similarity of C_i and C_j). Let T_Ci be the tag set for users in cluster C_i and T_Cj is the tag set for users in cluster C_j, then the Similarity of C_i and C_j is defined as:

$$ Sim\left( {C_{i} ,C_{j} } \right) = \frac{{\left| {T_{{c_{i} }} \cap T_{{c_{j} }} } \right|}}{{\left| {T_{{c_{i} }} \cup T_{{c_{j} }} } \right|}} $$

(7)

Definition 8

(Tag Similarity for C). Let $ C = \left( {C_{1} ,C_{2} ,C_{3} \cdots C_{k} } \right) $ be the partition result of $ G = \left( {V,E,T} \right) $, then the Tag Similarity for C is defined as:

$$ TagSim\left( C \right) = \frac{{\sum\limits_{{C_{i} \in C}} {\sum\limits_{{C_{j} \in C}} {Sim\left( {C_{i} ,C_{j} } \right)} } }}{{k^{2} - k}} $$

(8)

Definition 9

(Following rate from C_i to C_j). Let $ C = \left( {C_{1} ,C_{2} ,C_{3} \cdots C_{k} } \right) $ be the partition result of $ G = \left( {V,E,T} \right) $, there are p users in C_i and p_m users follow users in C_j, then the Following rate from C_i to C_j is defined as:

$$ Follow_{{C_{i} }} \left( {C_{j} } \right) = \frac{{p_{m} }}{p} $$

(9)

Definition 10

(Following entropy from Ci to Cj). Let $ C = \left( {C_{1} ,C_{2} ,C_{3} \cdots C_{k} } \right) $ be the partition result of $ G = \left( {V,E,T} \right) $, then the Following entropy from Ci to Cj is defined as:

$$ H_{{C_{j} }} \left( {C_{i} } \right) = \left\{ {\begin{array}{*{20}c} { - Follow_{{C_{i} }} \left( {C_{j} } \right)\log_{2} \;Follow_{{C_{i} }} \left( {C_{j} } \right),} & {if\;Follow_{{C_{i} }} \left( {C_{j} } \right) > 0} \\ {0,} & {else.} \\ \end{array} } \right. $$

(10)

Definition 11

(Edge entropy). Let $ C = \left( {C_{1} ,C_{2} ,C_{3} \cdots C_{k} } \right) $ be the partition result of $ G = \left( {V,E,T} \right) $, then the Edge entropy is defined as:

$$ EdgeEntropy\left( C \right) = \frac{{\sum\limits_{{C_{i} \in C}} {\sum\limits_{{C_{j} \in C}} {H_{{C_{j} }} \left( {C_{i} } \right)} } }}{{k^{2} - k}} $$

(11)

Definition 12

(Coupling Degree for C). Let $ C = \left( {C_{1} ,C_{2} ,C_{3} \cdots C_{k} } \right) $ be the partition result of $ G = \left( {V,E,T} \right) $, then Coupling Degree for C is defined as:

$$ Coupling\left( C \right) = TagSim\left( C \right) \times EdgeEntropy\left( C \right) $$

(12)

2.3 Community Division Quality Evaluation Function

There are different user groups with different influence in the network community and those with great prestige can form the so-called ‘opinion leaders community’, thus it is necessary to distinguish them with the ordinary users [10, 11]. Our work introduces the PageRank algorithm and rank users via their influence to make sure users’ influence are similar within the same group. Each group of users take the variance of PageRank to measure the practicality of community detection, and the smaller the variance, the similar the influence of the users in the group.

Definition 13

(Practicality for C). Let $ C = \left( {C_{1} ,C_{2} ,C_{3} \cdots C_{k} } \right) $ be the partition result of $ G = \left( {V,E,T} \right) $, $ v_{i } \in C_{i} $, VarPageRank(v_i) denotes the variance of PageRank of C_i, then Practicality for C is defined as:

$$ Utility\left( C \right) = \sum\limits_{{C_{i} \in C}} {\frac{{\left[ {VarPageRank\left( {v_{i} } \right)\left| {v_{i} \in C_{i} } \right.} \right]}}{{\hbox{max} \left[ {PageRank\left( {v_{i} } \right) - \overline{{PageRank\left( {v_{i} } \right)}} } \right]^{2} }}} $$

(13)

Combining the above three evaluation criterion, the overall community quality evaluation function is defined as:

Definition 14

(Normalized quality function). Let $ C = \left( {C_{1} ,C_{2} ,C_{3} \cdots C_{k} } \right) $ be the partition result of $ G = \left( {V,E,T} \right) $, the normalized quality function is defined as:

$$ Quality(C) = \frac{Coupling(C) \times Utility(C)}{Cohesion(C) + 1} $$

(14)

Cohesion reflects the cohesion degree from the perspective within the community itself, Coupling considers the degree of connections between the communities and Utility considered the practical application, which distinguish the influential users and common users separately. The three criterion are mutually restricted, thus the quality evaluation function of community division is a compromise among the three.

3 Community Detection via Core Tags

3.1 Finding Core Tags

The optimal tags for community division can be detected through minimizing the normalized quality function. The whole community network is initially taken as one cluster, users with certain tag t are extracted as one partition and the rest users are those without tag t. For each new partition, we calculate the current normalized quality value. The tag whose partition can minimize the normalized quality function is marked as the core tag. The pseudocode of finding core tags is listed in Table 1.

Table 1. Pseudocode of finding core tags

Full size table

3.2 Microblog Community Detection Strategy Based on Core Tags

The basic idea of our microblog community detection strategy is to determine the optimal partition via core tags in an iterative way. In order to avoid the excessive overlap between communities, we introduce the following definition:

Definition 15

(Similarity between C_i and C_j). Let $ C = \left( {C_{1} ,C_{2} ,C_{3} \cdots C_{k} } \right) $ be the partition result of $ G = \left( {V,E,T} \right) $, $ V_{{C_{i} }} $ denotes users in C_i, the similarity between C_i and C_j is defined as:

$$ Jaccard(C_{i} ,C_{j} ) = \frac{{\left| {V_{{C_{i} }} \cap V_{{C_{j} }} } \right|}}{{\left| {V_{{C_{i} }} \cup V_{{C_{j} }} } \right|}} $$

(15)

It can be seen that the similarity between C_i and C_j denotes the number of common users in these two partitions. The higher the similarity, the more overlap the two partitions and vice versa, therefore the introduction of similarity can control the degree of overlap effectively. The pseudocode of tag cut strategy can be summarized in Table 2.

Table 2. Pseudocode of tag cut strategy

Full size table

In the TagCut strategy, the algorithm first initializes all the needed variables, then the core tag is decided in each iteration through minimizing the quality criterion for the current partition. After a core tag is determined, a new group C_b with users containing the core tag is obtained. A new partition C_b is formed if its similarity is less than a certain predefined threshold with all the rest partitions, otherwise all users in C_b should be added into the partition C_i that is the most similar one. Thereafter, the new partition is updated and the partition is updated until certain termination condition is met. Two specific algorithms NSTC (Number Specified Tag Cut) and QBTC (Quality Based Tag Cut) are derived from the Tag cut strategy based on two different termination conditions. The former takes the number of communities as the termination condition while the latter algorithm will end if the quality becomes worse. To be more specific, as for NSTC, after each iteration a new group is established and the algorithm then detects if the current number of communities exceeds the predefined threshold η to control the size of the communities. Obviously, this algorithm relies heavily on η. As for QBTC algorithm, with the iteration continues, the evaluation quality will come to its minimum value. Thereafter, evaluation quality function is considered as the termination condition.

4 Experiments and Results Analysis

4.1 Dataset Description

In this section, we demonstrate the performance of our algorithm for microblog community detection on three different datasets obtained from Sina weibo API. The crawl starts from a student majoring in computer science as the core user, and broad first search principle is applied. For S₁, we take the core user and his following users, which mainly includes his classmates and users in information technology. For S₂, we expand S₁ by adding all following users of S₁. S₃ is obtained based on S₂ using the same strategy as constructing S₂. Table 3 shows the statistics of datasets.

Table 3. Data description

Full size table

4.2 Experimental Methodology

In order to verify the effectiveness of our algorithm, we design two experiments. (1) We test the most suitable γ via the proper number of communities. (2) We make a deep analysis of algorithm QBTC and NSTC.

The Sensitivity of Parameter γ

We start to use all datasets for NSTC algorithm to get the most appropriate number of communities with the increase of parameter η from 1 to 20. The quality value is kept and η with the smallest quality value is regarded as the most reasonable number of communities as is shown in Fig. 2.

For tagcut strategy, tag similarity Jaccard(C_i, C_j) determines whether the user should be classified into the previous groups or a new group should be established. Parameter γ will greatly influence the QBTC algorithm since QBTC algorithm will continue to partition iteratively before the quality function getting bigger or all the users are classified into some groups. A small parameter γ will merge a great amount of communities, the overlap of communities will be too small to form less groups while a big parameter γ will divide too many communities with high overlap degree.

From Fig. 3 we can see that when γ = 0 the new partition will form a new community no matter if it is independent enough, therefore the overlap cannot be controlled and some partition will be overlapped too much which results in the worse partition result. When γ = 1 the partition has degenerated into an non-overlap community. And when γ = 0.8 we can get the desired result, which means the proportion of overlap is the most close to the real results.

Performance Analysis of NSTC Algorithm and QBTC Algorithm

The number of communities should be specified for NSTC algorithm and the evaluation function will tend to a certain value. In this part, we only deal with S₃ as the number of communities for S₁ and S₂ are too limited which cannot show the variation clearly. We change the value of η from 1 to 10 and calculate the value evaluation function respectively, and the experimental result is demonstrated in Fig. 4.

QBTC algorithm takes advantage of the value of evaluation function to determine the termination of iteration, the value of evaluation function tends to be better from worse, and the result can be demonstrated in Fig. 5.

Besides, as the parameter η of NSTC algorithm is more close to the return value of ClusterNum of QBTC algorithm, the partition result is more similar with each other, this is because both algorithms take advantage of TagCut strategy, the termination condition of NSTC algorithm is to meet the needs of the number of communities while QBTC algorithm is to get the global optimal value.

5 Conclusions and Future Work

In this paper, we propose a strategy for microblog overlapping community detection based on core tags. We have established a new partition criterion from three aspects: the explicit and implicit interest of users, user’s attention as well as practical application, based on which a tag cut mechanism is constructed. Our experiments demonstrate the effectiveness of our approach. In the future, we are aiming at designing new mechanism for tag expansion and tag weighing.

References

Yang, B., Liu, D., Liu, J.: Discovering communities from social networks: methodologies and applications. In: Furht, B. (ed.) Handbook of Social Network Technologies and Applications, pp. 331–346. Springer, Boston (2010). https://doi.org/10.1007/978-1-4419-7142-5_16
Chapter Google Scholar
Papadopoulos, S., Kompatsiaris, Y., Vakali, A., et al.: Community detection in social media. Data Min. Knowl. Discov. 24(3), 515–554 (2012)
Article Google Scholar
Xiong, X., Zhou, G., Niu, X., et al.: Remodeling the network for microgroup detection on microblog. Knowl. Inf. Syst. 39(3), 643–665 (2013)
Article Google Scholar
Yang, T., Chi, Y., Zhu, S., et al.: Detecting communities and their evolutions in dynamic social networks-a Bayesian approach. Mach. Learn. 82(2), 157–189 (2011)
Article MathSciNet Google Scholar
Li, H.J., Zhan, B., Li, A., et al.: Fast and accurate mining the community structure: integrating center locating and membership optimization. IEEE Trans. Knowl. Data Eng. 24(6), 2349–2362 (2016)
Article Google Scholar
Reula, A., Vargasb, J.M.: Consensus clustering in complex networks. Sci. Rep. 2(13), 5134 (2012)
Google Scholar
Liang, S., Rijke, M.D.: Burst-aware data fusion for microblog search. Inf. Process. Manag. 51(2), 89–113 (2015)
Article Google Scholar
Magdy, W., Elsayed, T.: Unsupervised adaptive microblog filtering for broad dynamic topics. Inf. Process. Manag. Int. J. 52(4), 513–528 (2016)
Article Google Scholar
Ma, H., Jia, M., Zhang, D., et al.: Combining tag correlation and user social relation for microblog recommendation. Inf. Sci. Int. J. 385(C), 325–337 (2017)
Google Scholar
Zhang, W.Z., Wang, B.L., Hui, H.E., et al.: Public opinion leader community mining based on the heterogeneous network. Tien Tzu Hsueh Pao/Acta Electron. Sin. 40(10), 1927–1932 (2012)
Google Scholar
Peetz, M.H., Rijke, M.D., Kaptein, R.: Estimating reputation polarity on microblog posts. Inf. Process. Manag. 52(2), 193–216 (2016)
Article Google Scholar

Download references

Acknowledgement

The authors would also like to thank the anonymous referees for their valuable comments and helpful suggestions. The work is supported by the National Natural Science Foundation of China (No. 61762078, 61363058, 61762079) and Guangxi Key Laboratory of Trusted Software (No. kx201705).

Author information

Authors and Affiliations

College of Computer Science and Engineering, Northwest Normal University, Lanzhou, 730070, Gansu, China
Huifang Ma, Meng Xie, Jiahui Wei & Tingnian He
Guangxi Key Laboratory of Trusted Software, Guilin University of Electronic Technology, Guilin, 541004, China
Huifang Ma

Authors

Huifang Ma
View author publications
You can also search for this author in PubMed Google Scholar
Meng Xie
View author publications
You can also search for this author in PubMed Google Scholar
Jiahui Wei
View author publications
You can also search for this author in PubMed Google Scholar
Tingnian He
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Huifang Ma .

Editor information

Editors and Affiliations

University of Bristol, Bristol, United Kingdom
Weiru Liu
Università di Trento, Povo, Italy
Fausto Giunchiglia
Jilin University, Changchun, China
Bo Yang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ma, H., Xie, M., Wei, J., He, T. (2018). An Overlapping Microblog Community Detection Method Using New Partition Criterion. In: Liu, W., Giunchiglia, F., Yang, B. (eds) Knowledge Science, Engineering and Management. KSEM 2018. Lecture Notes in Computer Science(), vol 11062. Springer, Cham. https://doi.org/10.1007/978-3-319-99247-1_28

Download citation

DOI: https://doi.org/10.1007/978-3-319-99247-1_28
Published: 11 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99246-4
Online ISBN: 978-3-319-99247-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

An Overlapping Microblog Community Detection Method Using New Partition Criterion

Abstract

Similar content being viewed by others

Mining Microblog Community Based on Clustering Analysis

Detection of Microblog Overlapping Community Based on Multidimensional Information and Edge Distance Matrix

High-performance social networking: microblog community detection based on efficient interactive characteristic clustering

Keywords

1 Introduction

2 The Partition Criterion

2.1 Community Cohesion

Definition 1

Definition 2

Definition 3

Definition 4

Definition 5

Definition 6

2.2 Community Coupling Degree

Definition 7

Definition 8

Definition 9

Definition 10

Definition 11

Definition 12

2.3 Community Division Quality Evaluation Function

Definition 13

Definition 14

3 Community Detection via Core Tags

3.1 Finding Core Tags

3.2 Microblog Community Detection Strategy Based on Core Tags

Definition 15

4 Experiments and Results Analysis

4.1 Dataset Description

4.2 Experimental Methodology

5 Conclusions and Future Work

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation