Keywords

1 Introduction

Real world offers a wide variety of possible group organizations, such as families, working and friendship circle. With the diffusion of Internet, virtual groups, like online communities, have emerged on the Web [1, 2]. Social networks are defined as a set of actors and relationships that represent the interactions between entities in the network. Since social networks have important roles in the dissemination of information and innovation, the analysis of such networks, attracted much attention in the research area. Overlapping is a significant feature of the social network community structure and overlapping community detection has attracted an increasing attention [3,4,5,6]. Unlike traditional networks, microblog network is much more complex, which contains a huge amount of users and various kinds of inter-relationships [7, 8]. Thus, the community detection method for microblog users is different from the traditional community detection algorithms, which requires to carry out the relevant research in a systematic way.

User tags play an important role for personalized recommendation. These tags provide a powerful scheme to represent attributes or interests of microblog users. We exploit tag correlation to investigates both inter and intra correlation of tags, and the tags for users can therefore be expanded (following the ideas of reference [9]), which will increase tag vocabulary and alleviate tag sparsity on measuring user similarity and can eventually reveal implicit interests for users. Besides, following-behaviors form a social network of microblog users, which also indicates the interests of the user. What is more, it is believed that the nodes with a central position in their partitions may have an important function of control and stability within the group. Based on the above aspects, in this paper, a new partition criterion is proposed and a novel tag cut strategy is developed for microblog community detection. Figure 1 summarizes an overview of the proposed microblog community detection system.

Fig. 1.
figure 1

The main idea and the flow chart to detect communities in microblogs.

The basic outline of this paper is as follows: The partition criterion is given in Sect. 2 and the community detection algorithm via core tag is given in Sect. 3. The experiments and results are demonstrated in Sect. 4. Lastly, we conclude our paper in Sect. 5.

2 The Partition Criterion

The microblog network is defined as a directed graph G, namely a triple \( G = \left( {V,E,T} \right) \), where V and E are the user set and the edge set of the graph respectively. \( T = \left\{ {t_{1} ,t_{2} ,t_{3} \ldots t_{m} } \right\} \) is a finite set of tags for users. Each user \( v_{i} \in V\;{\text{and}}\;{\text{E = }}\left\{ {{<}v_{i} ,v_{j} {>} \left| {v_{i} \in V,v_{j} \in V} \right.} \right\} \) denotes the following relationship from vi to vj. \( C = \left\{ {C_{1} ,C_{2} ,C_{3} \ldots C_{k} } \right\} \) is the partition result of G, TCi denotes tag set for users in the cluster Ci.

Intuitively, an optimal community partition satisfies the following three aspects for each user:

  1. (1)

    Users in the cluster Ci share more common tags and the following relationship should be focused within cluster Ci.

  2. (2)

    Users in different clusters Ci and Cj own different tags and the following relationship should be consistent with each other.

  3. (3)

    The partition result should be of practical significance.

In view of the above three conditions, an evaluation function is proposed to guarantee the partition result, and three evaluation aspects are introduced to make a comprehensive correction of the partition results.

2.1 Community Cohesion

Definition 1

(Tag Similarity). Let Ti denote tag set for user vi and Tj denote tag set for user vj respectively, then the Tag Similarity between users vi and user vj is defined as:

$$ TagSim\,\left( {v_{i} ,v_{j} } \right) = \frac{{\left| {T_{i} \cap T_{j} } \right|}}{{\left| {T_{i} \cup T_{j} } \right|}} $$
(1)

Definition 2

(Tag Cohesion for Cl). Let users vi and vj belong to cluster Cl, Ti denotes tag set for user vi and Tj denotes tag set for user vj respectively, then the Tag Cohesion for Cl is defined as:

$$ TagCohesion\left( {C_{l} } \right) = \frac{{\sum\limits_{{v_{i} \in C_{l} }} {\sum\limits_{{v_{j} \in C_{l} }} {TagSim\left( {v_{i} ,v_{j} } \right)} } }}{{\left| {C_{l} } \right|^{2} }} $$
(2)

Definition 3

(Tag Cohesion for C). Let \( C = \left( {C_{1} ,C_{2} ,C_{3} \cdots C_{k} } \right) \) be the partition result of \( G = \left( {V,\,E,\,T} \right) \), then the Tag Cohesion for C is defined as:

$$ TagCohesion(C) = \frac{{\sum\limits_{{C_{l} \in C}} {TagCohesion(C_{l} )} }}{k} $$
(3)

Community cohesion reflects the degree of tag overlap of users in the community. The higher the tag cohesion, the higher the overlap of tags and vice versa.

Definition 4

(Edge Cohesion for Cl). Let user vi belong to cluster Cl, eij is the edge from vi to vj, then the Edge Cohesion for Cl is defined as:

$$ EdgeDispersion\,\left( {C_{1} } \right) = \frac{{\left| {\sum\limits_{{v_{i} ,\,v_{j} \in C_{1} }} {e_{ij} } } \right|}}{{\left| {\sum\limits_{{v_{i} \in C_{1} }} {e_{ij} } } \right|}} $$
(4)

Edge Cohesion indicates the proportion of following users within their community.

Definition 5

(Edge Cohesive for C). Let \( C = \left( {C_{1} ,C_{2} ,C_{3} \cdots C_{k} } \right) \) be the partition result of \( G = \left( {V,E,T} \right) \), then the Edge Cohesion for C is defined as:

$$ EdgeCohesion(C) = \frac{{\sum\limits_{{C_{l} \in C}} {EdgeDispersion(C_{l} )} }}{k} $$
(5)

Definition 6

(Cohesive for C). Let \( C = \left( {C_{1} ,C_{2} ,C_{3} \cdots C_{k} } \right) \) be the partition result of \( G = \left( {V,E,T} \right) \), then the Cohesion for C is defined as:

$$ Cohesion(C) = TagCohesion(C) \times EdgeCohesion(C) $$
(6)

2.2 Community Coupling Degree

Definition 7

(Tag Similarity of Ci and Cj). Let TCi be the tag set for users in cluster Ci and TCj is the tag set for users in cluster Cj, then the Similarity of Ci and Cj is defined as:

$$ Sim\left( {C_{i} ,C_{j} } \right) = \frac{{\left| {T_{{c_{i} }} \cap T_{{c_{j} }} } \right|}}{{\left| {T_{{c_{i} }} \cup T_{{c_{j} }} } \right|}} $$
(7)

Definition 8

(Tag Similarity for C). Let \( C = \left( {C_{1} ,C_{2} ,C_{3} \cdots C_{k} } \right) \) be the partition result of \( G = \left( {V,E,T} \right) \), then the Tag Similarity for C is defined as:

$$ TagSim\left( C \right) = \frac{{\sum\limits_{{C_{i} \in C}} {\sum\limits_{{C_{j} \in C}} {Sim\left( {C_{i} ,C_{j} } \right)} } }}{{k^{2} - k}} $$
(8)

Definition 9

(Following rate from Ci to Cj). Let \( C = \left( {C_{1} ,C_{2} ,C_{3} \cdots C_{k} } \right) \) be the partition result of \( G = \left( {V,E,T} \right) \), there are p users in Ci and pm users follow users in Cj, then the Following rate from Ci to Cj is defined as:

$$ Follow_{{C_{i} }} \left( {C_{j} } \right) = \frac{{p_{m} }}{p} $$
(9)

Definition 10

(Following entropy from Ci to Cj). Let \( C = \left( {C_{1} ,C_{2} ,C_{3} \cdots C_{k} } \right) \) be the partition result of \( G = \left( {V,E,T} \right) \), then the Following entropy from Ci to Cj is defined as:

$$ H_{{C_{j} }} \left( {C_{i} } \right) = \left\{ {\begin{array}{*{20}c} { - Follow_{{C_{i} }} \left( {C_{j} } \right)\log_{2} \;Follow_{{C_{i} }} \left( {C_{j} } \right),} & {if\;Follow_{{C_{i} }} \left( {C_{j} } \right) > 0} \\ {0,} & {else.} \\ \end{array} } \right. $$
(10)

Definition 11

(Edge entropy). Let \( C = \left( {C_{1} ,C_{2} ,C_{3} \cdots C_{k} } \right) \) be the partition result of \( G = \left( {V,E,T} \right) \), then the Edge entropy is defined as:

$$ EdgeEntropy\left( C \right) = \frac{{\sum\limits_{{C_{i} \in C}} {\sum\limits_{{C_{j} \in C}} {H_{{C_{j} }} \left( {C_{i} } \right)} } }}{{k^{2} - k}} $$
(11)

Definition 12

(Coupling Degree for C). Let \( C = \left( {C_{1} ,C_{2} ,C_{3} \cdots C_{k} } \right) \) be the partition result of \( G = \left( {V,E,T} \right) \), then Coupling Degree for C is defined as:

$$ Coupling\left( C \right) = TagSim\left( C \right) \times EdgeEntropy\left( C \right) $$
(12)

2.3 Community Division Quality Evaluation Function

There are different user groups with different influence in the network community and those with great prestige can form the so-called ‘opinion leaders community’, thus it is necessary to distinguish them with the ordinary users [10, 11]. Our work introduces the PageRank algorithm and rank users via their influence to make sure users’ influence are similar within the same group. Each group of users take the variance of PageRank to measure the practicality of community detection, and the smaller the variance, the similar the influence of the users in the group.

Definition 13

(Practicality for C). Let \( C = \left( {C_{1} ,C_{2} ,C_{3} \cdots C_{k} } \right) \) be the partition result of \( G = \left( {V,E,T} \right) \), \( v_{i } \in C_{i} \), VarPageRank(vi) denotes the variance of PageRank of Ci, then Practicality for C is defined as:

$$ Utility\left( C \right) = \sum\limits_{{C_{i} \in C}} {\frac{{\left[ {VarPageRank\left( {v_{i} } \right)\left| {v_{i} \in C_{i} } \right.} \right]}}{{\hbox{max} \left[ {PageRank\left( {v_{i} } \right) - \overline{{PageRank\left( {v_{i} } \right)}} } \right]^{2} }}} $$
(13)

Combining the above three evaluation criterion, the overall community quality evaluation function is defined as:

Definition 14

(Normalized quality function). Let \( C = \left( {C_{1} ,C_{2} ,C_{3} \cdots C_{k} } \right) \) be the partition result of \( G = \left( {V,E,T} \right) \), the normalized quality function is defined as:

$$ Quality(C) = \frac{Coupling(C) \times Utility(C)}{Cohesion(C) + 1} $$
(14)

Cohesion reflects the cohesion degree from the perspective within the community itself, Coupling considers the degree of connections between the communities and Utility considered the practical application, which distinguish the influential users and common users separately. The three criterion are mutually restricted, thus the quality evaluation function of community division is a compromise among the three.

3 Community Detection via Core Tags

3.1 Finding Core Tags

The optimal tags for community division can be detected through minimizing the normalized quality function. The whole community network is initially taken as one cluster, users with certain tag t are extracted as one partition and the rest users are those without tag t. For each new partition, we calculate the current normalized quality value. The tag whose partition can minimize the normalized quality function is marked as the core tag. The pseudocode of finding core tags is listed in Table 1.

Table 1. Pseudocode of finding core tags

3.2 Microblog Community Detection Strategy Based on Core Tags

The basic idea of our microblog community detection strategy is to determine the optimal partition via core tags in an iterative way. In order to avoid the excessive overlap between communities, we introduce the following definition:

Definition 15

(Similarity between Ci and Cj). Let \( C = \left( {C_{1} ,C_{2} ,C_{3} \cdots C_{k} } \right) \) be the partition result of \( G = \left( {V,E,T} \right) \), \( V_{{C_{i} }} \) denotes users in Ci, the similarity between Ci and Cj is defined as:

$$ Jaccard(C_{i} ,C_{j} ) = \frac{{\left| {V_{{C_{i} }} \cap V_{{C_{j} }} } \right|}}{{\left| {V_{{C_{i} }} \cup V_{{C_{j} }} } \right|}} $$
(15)

It can be seen that the similarity between Ci and Cj denotes the number of common users in these two partitions. The higher the similarity, the more overlap the two partitions and vice versa, therefore the introduction of similarity can control the degree of overlap effectively. The pseudocode of tag cut strategy can be summarized in Table 2.

Table 2. Pseudocode of tag cut strategy

In the TagCut strategy, the algorithm first initializes all the needed variables, then the core tag is decided in each iteration through minimizing the quality criterion for the current partition. After a core tag is determined, a new group Cb with users containing the core tag is obtained. A new partition Cb is formed if its similarity is less than a certain predefined threshold with all the rest partitions, otherwise all users in Cb should be added into the partition Ci that is the most similar one. Thereafter, the new partition is updated and the partition is updated until certain termination condition is met. Two specific algorithms NSTC (Number Specified Tag Cut) and QBTC (Quality Based Tag Cut) are derived from the Tag cut strategy based on two different termination conditions. The former takes the number of communities as the termination condition while the latter algorithm will end if the quality becomes worse. To be more specific, as for NSTC, after each iteration a new group is established and the algorithm then detects if the current number of communities exceeds the predefined threshold η to control the size of the communities. Obviously, this algorithm relies heavily on η. As for QBTC algorithm, with the iteration continues, the evaluation quality will come to its minimum value. Thereafter, evaluation quality function is considered as the termination condition.

4 Experiments and Results Analysis

4.1 Dataset Description

In this section, we demonstrate the performance of our algorithm for microblog community detection on three different datasets obtained from Sina weibo API. The crawl starts from a student majoring in computer science as the core user, and broad first search principle is applied. For S1, we take the core user and his following users, which mainly includes his classmates and users in information technology. For S2, we expand S1 by adding all following users of S1. S3 is obtained based on S2 using the same strategy as constructing S2. Table 3 shows the statistics of datasets.

Table 3. Data description

4.2 Experimental Methodology

In order to verify the effectiveness of our algorithm, we design two experiments. (1) We test the most suitable γ via the proper number of communities. (2) We make a deep analysis of algorithm QBTC and NSTC.

The Sensitivity of Parameter γ

We start to use all datasets for NSTC algorithm to get the most appropriate number of communities with the increase of parameter η from 1 to 20. The quality value is kept and η with the smallest quality value is regarded as the most reasonable number of communities as is shown in Fig. 2.

Fig. 2.
figure 2

The quality value result of NSTC algorithm on three data sets

For tagcut strategy, tag similarity Jaccard(CiCj) determines whether the user should be classified into the previous groups or a new group should be established. Parameter γ will greatly influence the QBTC algorithm since QBTC algorithm will continue to partition iteratively before the quality function getting bigger or all the users are classified into some groups. A small parameter γ will merge a great amount of communities, the overlap of communities will be too small to form less groups while a big parameter γ will divide too many communities with high overlap degree.

From Fig. 3 we can see that when γ = 0 the new partition will form a new community no matter if it is independent enough, therefore the overlap cannot be controlled and some partition will be overlapped too much which results in the worse partition result. When γ = 1 the partition has degenerated into an non-overlap community. And when γ = 0.8 we can get the desired result, which means the proportion of overlap is the most close to the real results.

Fig. 3.
figure 3

The number of communities with different γ

Performance Analysis of NSTC Algorithm and QBTC Algorithm

The number of communities should be specified for NSTC algorithm and the evaluation function will tend to a certain value. In this part, we only deal with S3 as the number of communities for S1 and S2 are too limited which cannot show the variation clearly. We change the value of η from 1 to 10 and calculate the value evaluation function respectively, and the experimental result is demonstrated in Fig. 4.

Fig. 4.
figure 4

Performance of NSTC algorithm

QBTC algorithm takes advantage of the value of evaluation function to determine the termination of iteration, the value of evaluation function tends to be better from worse, and the result can be demonstrated in Fig. 5.

Fig. 5.
figure 5

The iterative result of evaluation index of QBTC algorithm

Besides, as the parameter η of NSTC algorithm is more close to the return value of ClusterNum of QBTC algorithm, the partition result is more similar with each other, this is because both algorithms take advantage of TagCut strategy, the termination condition of NSTC algorithm is to meet the needs of the number of communities while QBTC algorithm is to get the global optimal value.

5 Conclusions and Future Work

In this paper, we propose a strategy for microblog overlapping community detection based on core tags. We have established a new partition criterion from three aspects: the explicit and implicit interest of users, user’s attention as well as practical application, based on which a tag cut mechanism is constructed. Our experiments demonstrate the effectiveness of our approach. In the future, we are aiming at designing new mechanism for tag expansion and tag weighing.