A Community Detection Algorithm Based on Jaccard Similarity Label Propagation

Wang, Meng; Cai, Xiaodong; Zeng, Yan; Liang, Xiaoxi

doi:10.1007/978-3-319-68935-7_6

Meng Wang²²,
Xiaodong Cai²²,
Yan Zeng²² &
…
Xiaoxi Liang²²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10585))

Included in the following conference series:

International Conference on Intelligent Data Engineering and Automated Learning

2123 Accesses
1 Citations

Abstract

Due to the randomness of label propagation algorithm, the stability is poor and the accuracy is low for community detection results in complex networks. In order to solve the problem, this paper proposes a novel community detection algorithm based on Jaccard similarity label propagation. Firstly, the Jaccard similarity is used to measure nodes importance. Then, the importance of nodes is utilized to reduce the randomness in label selection. Finally, nodes with the highest importance are selected to update labels in iteration, which improves the stability for community detection. Stability and accuracy of data sets of real networks are measured by modularity and normalized mutual information, respectively. Experimental results show that, the proposed algorithm for community detection results is more stable and more accurate than LPA, LPA_SI and KLPA algorithms in the cases of near linear time complexity.

Access provided by CONRICYT-eBooks. Download conference paper PDF

A Link-Based Similarity for Improving Community Detection Based on Label Propagation Algorithm

Article 16 November 2018

Efficient Community Detection Based on Label Propagation with Belonging Coefficient and Edge Probability

Graph-based label propagation algorithm for community detection

Article 13 December 2019

Keywords

1 Introduction

In nature and social activities, there are many universal connections and interactive relationships. This phenomenon is expressed with complex network models. Network nodes represent specific objects, while edges stand for links between objects, such as biology network [1], citation network [2] and social network [3]. It was found that community structures are a common feature of these networks. The community is composed of nodes with similar characteristics. Furthermore, the internal connection of communities is dense, the external correlation is relatively sparse [4]. Besides, it is important for research not only on the community structure in terms of biology, sociology and e-commerce, but also on applications in network security and criminal organizations recognition.

A large number of algorithms have been proposed with the widespread concern for community detection in complex networks. The classical community detection algorithms proposed graph segmentation, hierarchical clustering and modularity optimization method. In 2007, a fast label community detection algorithm is proposed based on label propagation [5]. The algorithm does not require any input parameters, such as the number and size of the community, and performs a linear time complexity. The efficiency of community detection results is improved extensively. It is suitable for large-scale complex networks. However, the label propagation algorithm is random in iterative update process, which leads to instability of the result for community detection. In order to solve the problem, the optimization and improvement methods were provided in many papers. In [6], a semi-synchronous label propagation algorithm overcoming the node oscillation problem in complex networks is proposed, but the stability is not improved for community detection. In LPA_SI [7], the stability of community detection is improved by integrating node importance and label influence. However, the computational complexity is large, which makes it difficult for large-scale complex networks to divide. In KLPA [8], the author combined the K value and local influence to achieve community detection, the transmission of label nodes is reduced. Nevertheless, the time complexity is increased, and the efficiency of large-scale networks for community detection is affected.

In this paper, a novel community detection method is proposed to solve the problem of randomness in label propagation algorithm. It is inspired by Jaccard similarity measure [9]. Firstly, the similarity is used to measure nodes importance. Then, the importance of node is utilized to reduce the randomness in label selection. Finally, the nodes with the highest importance are selected to update labels, which improves the stability for community detection.

2 Label Propagation Algorithm for Community Detection Based on Jaccard Similarity

2.1 Label Propagation Algorithm

In the literature [5], a community detection method is proposed based on Label Propagation Algorithm (LPA). Each node chooses the label with the largest number of neighbors as its own label by sending the label information between the node and its neighbors. After numerous iterations, node labels belong to the same community tend to be consistent. The main process of LPA is described as follows:

Input: Network model is $ {\text{G}} = \left( {{\text{V}},{\text{E}}} \right) $, where $ {\text{V}} = \left\{ {v_{i} ,{\text{i}} = 1,\,2,\,3\, \ldots \,{\text{n}}} \right\} $ represents the collection of nodes, $ {\text{set}}\,{\text{C}} = \left\{ {C_{1} ,C_{2} , \ldots ,C_{k} } \right\} $ denotes individual; where $ {\text{E}} = \left\{ {e_{i} ,{\text{i}} = 1,2,3, \ldots ,{\text{n}}} \right\} $ represents the collection of edges, $ e_{ij} \; = \;\left\{ {\left( {v_{i} ,\,v_{j} } \right),\,v_{i} \,{ \in }\,V,\,v_{j} \,{ \in }\,V} \right\} $ denotes relationship.

Output: $ {\text{Set}}\,{\text{C}} = \left\{ {C_{1} ,C_{2} , \ldots ,C_{k} } \right\} $, where k represents the number of community.

(1)
Each node $ v $ is assigned a unique label as the identity which stands for the community.
(2)
An ordered list is generated by the random ordering of node sets.
(3)
Following the ordered list, the node is updated. The label of each node $ v $ is updated by the label with the largest number of neighbors in the network. When there are multiple labels with the same maximum number, a label is chosen randomly as the label of the node.
(4)
After several iterations, the label tends to be stable in each node neighbors.
(5)
Nodes with the same label are classified to the same community.

In LPA, the order of each iteration is random. When a neighbor node has the same number of labels, a label is randomly selected to update, so that the stability of the community detection result is greatly affected. In the process of label propagation, the importance of the node itself is not considered, which leads to some less important nodes affect some of the more important nodes in turn, resulting in “counter current” phenomenon. In addition, the nearby small communities are swallowed by the first formation of community labels. Finally, a very large community is formed and even the labels of nodes in the entire network are the same.

2.2 Label Propagation Algorithm Based on Jaccard Similarity

In order to solve the “counter current” phenomenon in LPA, this paper considers the influence of the node importance and proposes an algorithm based on Jaccard similarity label propagation for community detection.

Related Definitions.

The proposed algorithm is based on Jaccard similarity of label propagation. The related definitions are as follows:

Definition 1: Node Neighborhood.

$$ T_{{\left( {v_{i} } \right)}} \; = \;\left\{ {v_{i} } \right\}\, \cup \,\left\{ {\left( {v_{i} ,v_{j} } \right)\,{ \in }\,V} \right\} $$

(1)

$ T_{{\left( {v_{i} } \right)}} $ denotes the set of neighbors of node $ i $.

Definition 2: Jaccard Similarity between Nodes.

$$ {\text{S}}\left( {v_{i} ,v_{j} } \right) = \frac{{\left| {T_{{\left( {v_{i} } \right)}} \cap T_{{\left( {v_{j} } \right)}} } \right|}}{{\left| {T_{{\left( {v_{i} } \right)}} \cup T_{{\left( {v_{j} } \right)}} } \right|}} $$

(2)

The degree of similarity is used to measure the tightness of any two neighboring nodes. In formula (2), $ \left| {T_{{\left( {v_{i} } \right)}} \cap T_{{\left( {v_{j} } \right)}} } \right| $ is the number of communal neighbors between node $ v_{i} $ and $ v_{j } $ in networks. $ \left| {T_{{\left( {v_{i} } \right)}} \cup T_{{\left( {v_{j} } \right)}} } \right| $ denotes the number of all neighbors between node $ v_{i} $ and $ v_{j } $. When node $ v_{i} $ and $ v_{j } $ have more jointed neighbors, while the similarity is bigger, the node is more important to others. The range of $ {\text{S}}\left( {v_{i} ,v_{j} } \right) $ is $ \left[ {0,1} \right] $.

Definition 3: Similarity between Nodes and Communities.

$$ {\text{S}}\left( {v_{i} ,C_{j} } \right) = \frac{{\left| {T_{{\left( {v_{i} } \right)}} \cap C_{j} } \right|}}{{\left| {T_{{\left( {v_{i} } \right)}} \cup C_{j} } \right|}} $$

(3)

The similarity between the node and community is used to measure the degree of the connection between nodes and adjacent communities. Where $ \left| {T_{{\left( {v_{i} } \right)}} \cap C_{j} } \right| $ represents the number of common nodes, $ \left| {T_{{\left( {v_{i} } \right)}} \; \cup \;C_{j} } \right| $ denotes the number of all neighbors between node $ v_{i} $ and community $ C_{j} $.

Definition 4: Node Importance.

$$ {\text{important}}\left( {v_{i} } \right) = argmax\,\left\{ {S\left( {v_{i} ,C_{j} } \right)\updelta\left( {v_{i} ,C_{j} } \right)} \right\} $$

(4)

Where $ \updelta\left( {v_{i} ,C_{j} } \right) $ is the Kronecker function, if $ v_{i} $ is identical to $ C_{j} $, $ \updelta\left( {v_{i} ,C_{j} } \right) $ equals to 1, otherwise equals to 0. When the node importance is bigger, the influence to others is larger. The label of node spreads more easily.

In the process of updating labels, the Jaccard similarity of adjacent nodes is firstly calculated according to formula (2). Based on the similarity descending sort, node labels are updated in order and the randomness is avoided. After several iterations, the node is updated when there are multiple adjacent label collections. The similarity between adjacent label collection is calculated according to formula (3). It represents the tightness of the association between label nodes and adjacent label sets. A higher tightness indicates that the node label is more important for the adjacent label sets. The probability is greater to be the label of the node updated. The most important label set is selected to update the node label by the formula (4) to avoid the “reverse flow” phenomenon in the process of label propagation. The formula of node label updated is as follows:

$$ {\text{LabelNew}}\left( {v_{i} } \right) = { \arg }\,\mathop {\hbox{max} }\limits_{l} \left\{ {important\left( {v_{i} } \right)} \right\} $$

(5)

Where $ l $ is the adjacent label collection of the node.

Algorithm Description.

The steps of the algorithm for community detection are as follows:

Input: Complex network $ {\text{G}} = \left( {{\text{V}},{\text{E}}} \right) $, set the maximum number of iterations $ t = maxIter $. Node $ v_{i} $ has n neighbors. $ {\text{L}}v_{i\left( t \right)} $ is the label of node $ v_{i} $ after the $ t^{th} $ iteration.

Output: $ {\text{SetC}} = \left\{ {C_{1} ,C_{2} , \ldots ,C_{k} } \right\} $, where k represents the number of communities.

1.
The label of each node $ {\text{v}}\,{ \in }\,{\text{V}} $ is initialized. Each node is assigned a unique label. The iteration time is initialized, $ {\text{t}} = 1 $.
2.
The similarity between the label nodes is calculated by formula (2). According to the similarity from high to low sort, an ordered list $ V^{{\prime }} = \left\{ {v_{1} ,v_{2} , \ldots ,v_{n} } \right\} $ is generated.
3.
According to the ordered list $ V^{{\prime }} $, the label of node is updated. According to formula (3), the node is updated to the most important label.
4.
If iteration $ t = maxIter $ and $ {\text{L}}v_{i\left( t \right)} $ no longer changes, the nodes with the same label are grouped into same community. If $ {\text{t}} = {\text{t}} + 1 $, return to step 3.

Compared with LPA for community detection, in the proposed algorithm, the randomness of label propagation is reduced by considering the transmission characteristics of node labels and the tightness of connection among nodes. In addition, the influence of node is integrated, which makes it more accurate in label selection. The stability is improved for community detection.

3 Experimental Results and Analysis

To evaluate the proposed algorithm, real network data sets by Mark Newman [10] and the information of data sets is shown in Table 1 are used. The modularity and normalized mutual information are used as the measurement standard for community detection of these structure known data sets. For unknown structure data sets Hep-th and Internet, the modularity is used to test. The stability and accuracy of the proposed algorithm are analyzed, and compared by using the LPA [5], LPA_SI [7] and KLPA [8]. The time complexity is also analyzed. The experimental platform configuration consists of Intel (R) Core (TM) i5-4460 CPU @ 3.20 GHz processor, 8 GB memory, Microsoft Windows 7 operating system and JDK1.8, and Java programming language. In the testing process, to reduce the random caused by different operations, the average of 50 runs is taken. The maximum number iteration of the algorithm is set to 100.

Table 1. Data sets information.

Full size table

3.1 Modularity

Modularity (Q) is the measure of quality for community structure proposed by Girvan and Newman [4], which can be used to measure the stability for community detection. It is defined as follows:

$$ {\text{Q}} = \sum\nolimits_{1}^{k} {\left[ {\frac{{link\left( {C_{j} } \right)}}{m} - \left( {\frac{{{{\text{deg}}}\left( {C_{j} } \right)}}{2m}} \right)^{2} } \right]} $$

(6)

Where k is the number of community, $ link\left( {C_{j} } \right) $ is the number of $ C_{j} $, and $ {{\text{deg}}}\left( {C_{j} } \right) $ is the sum of all nodes in the community $ C_{j} $, and m is the number of edges in the network. The value range of Q is [−1, 1]. If Q is larger, the quality is better for community detection and the result is more stable.

The result of Q with the proposed algorithm, LPA [5], LPA_SI [7] and KLPA [8] in different data sets are shown in Table 2. It can be seen that the value of Q is obtained by using the Karate, Dolphins, Polbooks, Footballs, Hep-th and Internet data sets. Since the algorithm considers the importance of the node itself, the value of Q is obviously higher than that of the LPA [5], and the stability of the algorithm is improved significantly. Compared with LPA_SI [7], it considers importance of nodes and influence of labels together, there is a slightly higher Q value of the three small data sets: Dolphins, Polbooks and Footballs. In addition, Karate data set of the Q value is similar and Hep-th is lower. However, the Q of the big data set Internet is higher. Compared with KLPA [8], the value of Q is higher than that of the LPA [5].

Table 2. Comparison the modularity Q of the algorithm.

Full size table

3.2 Normalized Mutual Information

The accuracy for community detection is evaluated in NMI [11, 12] by comparing the similarity between prediction and the true community structure. It is defined as follows:

$$ NMI\left( {p,r} \right) = \frac{{2MI\left( {p,r} \right)}}{H\left( p \right) + H\left( r \right)} $$

(7)

Where $ MI\left( {p,r} \right) $ denotes the mutual information between prediction community p and the real community r, and the formula is:

$$ {\text{MI}}\left( {{\text{p}},{\text{r}}} \right) = \sum\nolimits_{i} {\sum\nolimits_{j} {P\left( {C_{{p_{i} }} \cap C_{{r_{j} }} } \right)\,log\frac{{P\left( {C_{{p_{i} }} \cap C_{{r_{j} }} } \right)}}{{P\left( {C_{{p_{i} }} } \right)P\left( {C_{{r_{j} }} } \right)}}} } $$

(8)

$ {\text{H}}\left( {\text{p}} \right) $ represents the entropy of the community p, the formula is:

$$ {\text{H}}\left( {\text{p}} \right) = - \sum\nolimits_{j} {P\left( {C_{{p_{i} }} } \right)\,{ \log }\left( {P\left( {C_{{p_{i} }} } \right)} \right)} $$

(9)

Where $ P\left( {C_{{p_{i} }} } \right) $ and $ P\left( {C_{{r_{i} }} } \right) $ respectively represent the probability of a node in the network divided into the community p and in the real community r. The range of NMI is [0, 1]. The NMI is greater and the result is more accurate.

The result of NMI with the proposed algorithm, LPA [5], LPA_SI [7] and KLPA [8] in different data sets are shown in Fig. 1. The values of NMI is obtained by using the Karate, Dolphins and Footballs data sets. The NMI of the proposed are close to 1, which is larger than that of LPA, LPA_SI and KLPA algorithm. For the Dolphins data set, NMI does not reach 1. However, it is also higher than LPA, LPA_SI and KLPA. Experimental results show that, the proposed algorithm improves the accuracy of community detection.

3.3 Computational Complexity

If in a given network, n represents the number of nodes, and m represents the number of edges. All nodes in the network initialize the label, all nodes traverse once, the time complexity is $ O\left( n \right) $. When calculating the similarity between nodes, all edges traverse once and the time complexity is $ O\left( m \right) $. Similarity computing node according to importance and sorting node update, the time is required for most $ O\left( {n + m} \right) $. The time complexity of each node is divided into $ O\left( n \right) $, and the time complexity is $ O\left( {ms} \right) $ after iteration s times. Thus, the total time complexity is $ O\left( {n + ms} \right) $. Compared with the LPA algorithm, the time complexity increases, mainly on the ordering of nodes, but still close to the linearity. Compared with LPA_SI and KLPA, the time complexity is similar (Table 3).

Table 3. Computational complexity.

Full size table

4 Conclusion

The proposed algorithm integrates label propagation and the importance of nodes, and adopts the new node importance measurement for label iteration process. The improvement is shown by comparing the modularity and normalized mutual information of the LPA, LPA_SI and KLPA algorithms in small and large data sets of real networks. It can be confirmed that the stability and accuracy of the community detection are improved when the importance of node itself and the impact of labels in the label propagation algorithm are considered. With the increase of the number of nodes in networks, the performance of the proposed algorithm is further improved.

References

Mering, V.C., Krause, R., Snel, B., et al.: Automated inference of highly conserved protein interaction networks. J. Nat. 417(6998), 399–403 (2002)
Article Google Scholar
Xiao, X., Chen, Y., Deng, Y.: Development of community discovery in citation networks. J. Inf. 35(4), 125–130 (2016)
Google Scholar
Yuta, K., Ono, N., Fujiwara, Y.: A gap in the community-size distribution of a large-scale social networking site. Physics (2007)
Google Scholar
Newman, M.E.J., Girvan, M.: Finding and evaluating community structure in networks. J. Phys. Rev. E69(2), 026113 (2004)
Google Scholar
Raghavan, U.N., Albert, R., Kumara, S.: Near linear time algorithm to detect community structures in large - scale networks. J. Phys. Rev. E76(3), 036106 (2007)
Google Scholar
Cordasco, G., Gargano, I.: Community detection via semi synchronous label propagation algorithms (2011)
Google Scholar
Huang, J., Guo, K., Guo, H.: Label propagation algorithm for community detection based on vertex significance and label influence. J. Chin. Comput. Syst. 36(6), 1171–1175 (2015)
Google Scholar
Deng, G.: Community detection base on mixed k-influence. J. Inf. Commun. 2, 61–63 (2016)
Google Scholar
Pan, L., Lei, Y., Wang, C., et al.: Method on entity identification using similarity measure based on weight of Jaccard. J. Beijing Jiaotong Univ. 33(6), 141–145 (2009)
Google Scholar
Mark Newman network data. http://www-personal.umich.edu/~mejn/netdata
Danon, L., Diaz-Guilera, A., Duch, J., et al.: Comparing community structure identification. J. Stat. Mech. Theor. Exp. 2005(09), P09008 (2005)
Article Google Scholar
Lancichinetti, A., Fortunato, S., Kertész, J.: Detecting the overlapping and hierarchical community structure of complex networks. New J. Phys. 11(3), 033015 (2009)
Article Google Scholar

Download references

Acknowledgment

This work was supported by the 2016 Key Laboratory of Cognitive Radio and Information Processing, Ministry of Education Project No. CRKL160102 and 2016 Guangxi Science and Technology Project No. AB16380264.

Author information

Authors and Affiliations

School of Information and Communication, Guilin University of Electronic Technology, Guilin, 541004, Guangxi, China
Meng Wang, Xiaodong Cai, Yan Zeng & Xiaoxi Liang

Authors

Meng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaodong Cai
View author publications
You can also search for this author in PubMed Google Scholar
Yan Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoxi Liang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaodong Cai .

Editor information

Editors and Affiliations

University of Manchester, Manchester, United Kingdom
Hujun Yin
School of Electronic and Electrical Engineering, Nanjing University, Nanjiing, China
Yang Gao
Nanjing University of Aeronautics and Astronautics, Nanjing, China
Songcan Chen
Guilin University of Electronic Technology, Guilin, China
Yimin Wen
Guilin University of Electronic Technology, Guilin, China
Guoyong Cai
Guilin University of Electronic Technology, Guilin, China
Tianlong Gu
Beijing University of Posts and Telecommunications, Beijing, China
Junping Du
University of Seville, Seville, Spain
Antonio J. Tallón-Ballesteros
Southeast University, Nanjing, China
Minling Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, M., Cai, X., Zeng, Y., Liang, X. (2017). A Community Detection Algorithm Based on Jaccard Similarity Label Propagation. In: Yin, H., et al. Intelligent Data Engineering and Automated Learning – IDEAL 2017. IDEAL 2017. Lecture Notes in Computer Science(), vol 10585. Springer, Cham. https://doi.org/10.1007/978-3-319-68935-7_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-68935-7_6
Published: 06 October 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-68934-0
Online ISBN: 978-3-319-68935-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Community Detection Algorithm Based on Jaccard Similarity Label Propagation

Abstract

Similar content being viewed by others

A Link-Based Similarity for Improving Community Detection Based on Label Propagation Algorithm

Efficient Community Detection Based on Label Propagation with Belonging Coefficient and Edge Probability

Graph-based label propagation algorithm for community detection

Keywords

1 Introduction

2 Label Propagation Algorithm for Community Detection Based on Jaccard Similarity

2.1 Label Propagation Algorithm

2.2 Label Propagation Algorithm Based on Jaccard Similarity

Related Definitions.

Definition 1: Node Neighborhood.

Definition 2: Jaccard Similarity between Nodes.

Definition 3: Similarity between Nodes and Communities.

Definition 4: Node Importance.

Algorithm Description.

3 Experimental Results and Analysis

3.1 Modularity

3.2 Normalized Mutual Information

3.3 Computational Complexity

4 Conclusion

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

A Community Detection Algorithm Based on Jaccard Similarity Label Propagation

Abstract

Similar content being viewed by others

A Link-Based Similarity for Improving Community Detection Based on Label Propagation Algorithm

Efficient Community Detection Based on Label Propagation with Belonging Coefficient and Edge Probability

Graph-based label propagation algorithm for community detection

Keywords

1 Introduction

2 Label Propagation Algorithm for Community Detection Based on Jaccard Similarity

2.1 Label Propagation Algorithm

2.2 Label Propagation Algorithm Based on Jaccard Similarity

Related Definitions.

Definition 1: Node Neighborhood.

Definition 2: Jaccard Similarity between Nodes.

Definition 3: Similarity between Nodes and Communities.

Definition 4: Node Importance.

Algorithm Description.

3 Experimental Results and Analysis

3.1 Modularity

3.2 Normalized Mutual Information

3.3 Computational Complexity

4 Conclusion

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation