1 Introduction

Complex network analysis has a wide range of from physical, technological, biological, to social sciences [1,2,3,4,5,6,7,8,9]. Community detection is one of the most important topics in complex network analysis. The purpose of community detection is to group nodes into different clusters, where nodes in the same cluster are more linked with each other than with nodes outside the cluster.

Many researchers have put forward various methods to detect clusters in complex networks. Label propagation based community detection algorithms such as LPA [10], require only local information and have shown highly efficient. However, label propagation algorithms might produce unstable results due to its random tie breaking strategy, which is highly undesirable in practice and prohibits its extension to other applications.

In this paper, we propose a graph-based label propagation algorithm called GLPA for non-overlapping community detection based on graph-based representation of label propagation process and community detection process. The main steps of GLPA proceeds as follows. First, we define the node similarity between a node and its neighbors, then we change each node’s label to that of its most similar neighbor node. We construct graph-based representation for label propagation progress, and obtain connected components of the resulted graph. We call each connected component a candidate community. Second, we construct graph-based representation for community detection, in which each candidate community is contracted to a super node, and the edge weights between corresponding super nodes are defined as the number of edges between two communities in original network. We compute the merging factor of each node in the resulted network. Then, we get the final results of the communities by merging the small scale communities in decreasing order of their merging factors. The proposed algorithm GLPA reduce the instability of the original LPA. Compared with other classical community detection algorithm on some real networks and artificial networks, the proposed algorithm shows preferable performance on stability, NMI, ARI and modularity.

The organization of the rest of this paper is as follows. Section 2 gives a brief introduction of the basic concepts of graph theory and the definition of some existing nodes’ similarity measures. In Sect. 3, the proposed graph based label propagation algorithm (GLPA) is introduced in detail. Section 4 investigates the effectiveness of the proposed method based on 12 real networks. Section 5 is the conclusion of this paper.

2 Preliminaries

2.1 Definitions and notations

Considering an unweighted and undirected simple network \(G=(V,E)\) with node set V and edge set E. Let \(A=(a_{ij})_{|V|\times |V|}\) be the adjacency matrix of G. The open neighborhood of a node \(v\in V\) is written as \(N_v=\{u\in V|(u,v)\in E\}\). The degree of a node v in G, denoted by \(d_{v}\), is the number of nodes in \(N_v\), i.e., \(d_v=|N_v|\). The subgraph of G induced by U, denoted by G[U], is the subgraph with node set U and edge set \(\{(u,v)|u,v\in U\ \wedge \ (u,v)\ \in E\}\). So, G[U] contains all nodes of U and all edges of G whose end nodes are both in U. We write [U] to denote the induced subgraph by node subset U when without causing confusion.

A connected subgraph, if it is unconnected to the other parts of the same graph, is called a (connected) component of the graph G, denoted as C(G). We can also get the connected components of a subgraph.

The modularity Q [11], defined as Eq. (1), compares the real density of intra-module links and the expected density in a random network without any community structure called a null model.

$$\begin{aligned} Q=\frac{1}{2|E|}\sum _{ij} \left( a_{ij}-\frac{d_id_j}{2|E|}\right) \delta _{i,j} \end{aligned}$$
(1)

where \(a_{ij}\) is a component in the adjacency matrix, \(\frac{d_id_j}{2|E|}\) represents the total expected number of edges between nodes i and j. \(\delta _{i,j}=1\) if node \(v_{i}\) and \(v_{j}\) are in the same community, otherwise \(\delta _{i,j}=0\). A high value of Q means that the partition is markedly different from a random network, thus it has strong community structure.

The complementary entropy [12] is another index to measure the quality of communities in a complex network. Let \(V_r\)(\(1\le r\le k\)) denote community r in the interesting network, \(L_{i}(V_{r})\) be the number of nodes in \(V_{r}\) adjacent to node \(v_i\), and \(f_{i}(V_{r})=\frac{L_{i}(V_{r})}{\left| V_{r}\right| }\) is the fraction of nodes in \(V_r\) adjacent to \(v_i\). Then the the complementary entropy is defined as

$$\begin{aligned} F=\sum _{i=1}^{k}{h_i}{I_i} \end{aligned}$$
(2)

where \(h_i=\sqrt{\sum \nolimits _{r=1}^{k} \left( \frac{f_{i}(V_{r})}{\sum \nolimits _{j=1 f_{i}(V_{j})}^{k}} \right) ^{2}}\) represents the degree to which node \(v_i\) is separated from other communities, and \(I_{i}=\sum \nolimits _{r=1}^{k}L_{i}(V_{r})\times f_{i}(V_{r})\) evaluates how close the node \(v_{i}\) is to other communities.

For further details about concepts of graph theory, readers are referred to [6].

2.2 Related works

Researchers have proposed various methods for community detection recent years. Newman and Girvan defined a quality function modularity Q and optimized it to detect communities [11, 13]. Since then, modularity-based methods have been widely used to detect communities by maximizing the modularity value. For example, the fast unfolding algorithm proposed by Blondel et al. [14] is a fast heuristic method for the modularity optimization.

Raghavan et al. [10] proposed the framework of label propagation algorithm (LPA). LPA initializes every node with unique labels and propagates each node’s label to that of most of its neighbors belonging to. LPA is a simple and unsupervised near linear algorithm without any parameters, and does not need any information about the size and number of communities in advance. Hence, LPA may be suitable for partitioning large networks in real time.

Because a random factor exists in LPA when there is more than one most frequent label, one might obtain different results after multiple runs. This disadvantage may prevent LPA from being widely used in practice. In addition, LPA may produce a meaningless solution in which all nodes are assigned to one community. In order to solve this problem, Barber et al. [15] proposed a modularity-specialized LPA called LPAm. LPAm might stuck in a local optimum and lead to inaccurate partitions. LPAm+ [16] is an improved approach to obtain the highest modularity value and can effectively avoid local maxima. Li et al. proposed LPA-S [17], in which labels are propagated by similarity. Although LPA has distinct advantages for community detection, there have also been many superior LPA-based methods, and there is still room for partitioning real-world networks more accurately.

Raghavan et al. [10] pointed out that 95% of nodes or more are classified correctly in 5 iterations. However, determining the speed of LPA-based methods is still an open problem. Asynchronous LPA is not suitable for very large scale network, whilst updating labels synchronously might lead to oscillation of labels and prevent the update procedure from converging. Although there are some methods to avoid non-convergent behaviors, label oscillation still exists in some cases.

3 GLPA: a graph-based label propagation algorithm

Let G be the complex network whose nodes to be clustered. In this section, we propose a Graph-based Label Propagation Algorithm (GLPA) for detecting communities in G by graph-based representation of the label propagation process and the community detection process. The proposed GLPA modifies the label propagation pattern and is divided into two sections as follows.

  1. (1)

    Constructing label propagation graph L(G) to get candidate communities. We first define node similarity between adjacent node pairs and find the most similar neighbor node for each node. GLPA updates the labels of nodes based on the similarity measure. We construct a label propagation graph, in which the edges are designed to record the label propagating process. Then, GLPA calculates the connected components in the resulted label propagation graph. Each connected component is treated as a candidate community in the next step.

  2. (2)

    Constructing a weighted graph W(G) to obtain final communities. We treat each connected component of L(G) as a super-node. If there are edges between two components, then there is an edge between two corresponding super-nodes, and the weight of the edge is assigned as the number of edges of G lying between two components. For a given super-node \(v\in W(G)\), we choose one of its neighbors \(u\in W(G)\) as its most similar node, such that the weight on edge (uv) is the largest one among those of incident edges of v. We compute the merging factor for each node in W(G). Then we merge super nodes with the highest merging factor to its most similar node iteratively until the maximum complementary entropy [as is shown in Eq. (2)] does not increase.

To illustrate how the GLPA detects communities in complex networks, we use Dolphin Social network [18] to show the detail steps of GLPA as an example. The Dolphin Social network has 62 nodes and 159 edges.

3.1 Node similarity

The basic LPA adopts the label belonging to the majority of its neighbors, which means that it treats all neighbors equally. However, in practice, a node plays different roles depending on its location in the networks. In the real world, entities with high similarity tend to be gathered in the same group. For this perspective, community detection methods using an appropriate similarity metric may discover some valuable results in practice. The number of common neighbors [19] between two nodes can reflect their similarity. The more common neighbors between two nodes, the more similar they are in structure. On the other hand, the node degrees also play an important role in detecting communities, especially for linked nodes without common neighbors. Hence, in our method we use a similarity \(S_{i,j}\) between two adjacent nodes \(v_{i}\) and \(v_{j}\) based on their common neighbors \(N_i\cap N_j\) and the degrees of \(v_{i}\) and \(v_{j}\). The formula of the similarity \(S_{i,j}\) is given as Eq. (3).

$$\begin{aligned} S_{i,j}=\left\{ \begin{array}{ll} |N_i\cap N_j|+\frac{1}{d_id_j}, &{}\quad if\, (v_i,v_j)\in E; \\ 0, &{}\quad otherwise \end{array} \right. \end{aligned}$$
(3)

where \(N_i\cap N_j\) denotes the set of common neighbors of adjacent nodes \(v_i\) and \(v_j\). The similarity of two adjacent nodes \(S_{i,j}\) would be determined by the number of common neighbors and the node degrees. \(S_{i,j}\) tends to be large when the two adjacent nodes share many common neighbors. And \(S_{i,j}\) might be small when the degrees of \(v_i\) and \(v_j\) are large.

3.2 The steps of GLPA algorithm

3.2.1 Construction of label propagation graph L(G)

At beginning, each node is identified by a unique label which implies different community identifier. We construct the graph representation L(G) of the label propagation progress from a null graph, whose initial node set \(V(L)=V(G)\) and initial edge set \(E(L)=\emptyset\).

Then for each node \(v_{i}\in V(G)\), we find the most similar node of \(v_{i}\) according to Eq. (3). If there are more than one most similar neighbor node, we choose one randomly. Let node \(v_{j}\) be the most similar node of \(v_{i}\), then we add an edge \((v_{i},v_{j})\) between node \(v_{i}\) and \(v_{j}\) to L, i.e., \(E(L)=E(L)\cup \{(v_{i},v_{j})\}\). We can find the most similar node for each node simultaneously. And we will obtain a label propagation graph L with |V(G)| nodes and no more than |E(G)| edges.

Given the form of node similarity S as Eq. (3), it is reasonable to expect that the number of neighbors with high similarity will be much lower than the degree of a node. In other words, there are only a few choices of neighbor node for each node. This would reduce the randomness in original LPA algorithm for the reason that the number of most similar nodes is greatly reduced. And it can be easily seen that, if there is a path between two nodes in L, then the node pair should have a higher similarity.

To avoid the oscillation of the label propagation, we calculate the connected components of L. Let \(\{C_1,\ldots , C_k\}\) (\(k\le |V(L)|\)) be the set of connected components of L. The nodes in the same connected component share the same initial label. Figure 1a shows the resulted label propagation graph L(G) and connected components of Dolphin Social network. Figure 1b shows the initial communities in Dolphins network detected by GLPA.

For most of complex networks, the number of connected components might be much greater than the real number of communities in G. We need additional merging steps to deal with the over-segmentation of the network.

Fig. 1
figure 1

Communities detection of Dolphins network

3.2.2 Merging initial clusters

Although, the number of nodes contained in a connected component is small, the similarity among these nodes might be high. In the second phase, we will perform merging process to maximize the complementary entropy F of the partition by constructing a weighted graph W(G) from L(G). To obtain W(G), we treat each connected component in L(G) as a super-node.

Let \(c_i\) for \(1\le i\le k\) be the super-node in W corresponding to the connected component \(C_i\) of L, and k is the number of connected components. For super-nodes \(c_i\) and \(c_j\), if there exist edges in G that have one endpoint in \(C_i\) and the other endpoint in \(C_j\), then we add an edge \((c_i,c_j)\) to W whose weight equals to the number of edges between the two components in G.

Then we compute the merging factor for each super-node \(c_i\) in W as follows:

$$\begin{aligned} PC(c_i)=\frac{\sum _{r=1}^{k}(\omega _{ir})^2}{\left( \sum _{j=1}^{k}\omega _{ij}\right) ^{2}} \end{aligned}$$
(4)

where \(\omega _{ij}\) represents the edge weights between node \(c_i\) and node \(c_j\), thus \(\omega _{ij}=\sum \nolimits _{x\in V(C_i)}\sum \nolimits _{y\in V(C_j)}a_{xy}\) and \(\omega _{ii}=1\).

We use merging factor to determine whether a component should be merged into another component to form a larger community. If a component is dispersedly connected with many other components, then the component would has a low merging factor and might be unlikely to be merged with others. On the contrary, if a component is intensively connected with a few other components, then the component would has a high merging factor and might be merged with another component. In the second step, we treat a component as a super-node. For a given super-node \(v\in W(G)\), we choose one of its neighbors \(u\in W(G)\) as its most similar node, such that the weight on edge (uv) is the largest one among those of incident edges of v. We compute the merging factor for each node in W(G). We will merge super nodes with the highest merging factor to its most similar super-node iteratively to reach the maximum complementary entropy F [as is shown in Eq. (2)]. Figure 1c shows the final communities in Dolphins network detected by GLPA.

3.2.3 Time complexity analysis

In GLPA, suppose the node number of graph G is n. In the first step, we need calculate similarities between adjacent node pairs. The computational complexity for computing nodes similarity is \(O(n^2)\). Then we construct a label progress graph L to imitate the label propagation process, and the computational complexity for obtaining label propagation graph is \(O(n^2)\). It takes O(n) time to obtain connected components of L(G). In the second step, it takes \(O(k^{2})\) to determine the weights of edges in W(G), where k is the number of components in L(G) and \(k\ll n\). The computational complexity for updating labels in this phase is \(O(t\times (n+k))\), where \(t~(t\ll n)\) is the number of iterations. We compute the merging factors of each node of the resulted network W, which has a complexity \(O(k^2)\). Therefore, the total time complexity of the proposed algorithm GLPA is \(O(n^2)\).

4 Experiments and results

In this paper, we compare the performance of our method GLPA on both synthetic and real-world networks with 8 classical algorithms LPA, LPAm, LPAm+, LPA-S, BGLL, ISCD+, FMM and Infomap [20] in the literature.

4.1 Evaluation criteria

4.1.1 NMI and ARI

To evaluate the performance of community detection algorithm, normalized mutual information(NMI) [21], adjusted rand index(ARI) [22] and modularity Q are adopted.

Given a set V with n nodes, let \(O=\{O_1,O_2,\ldots ,O_k\}\) represent the communities obtained by an algorithm, and \(P=\{P_1,P_2,\ldots ,P_{k'}\}\) represent the ground truth communities. For \(O_i\) and \(P_j(1\le i\le k,1\le j\le k')\), let \(n_{ij}=|O_i\cap {P_j}|\), \(b_i=\sum \nolimits _{j=1}^{k'}{n_{ij}}\), \(t_j=\sum \nolimits _{j=1}^{k}{n_{ij}}\). NMI is defined as

$$\begin{aligned} NMI=\frac{-2\sum \nolimits _{i}\sum \nolimits _{j}n_{ij} \log (\frac{n_{ij}n}{b_{i} t_{j}})}{\sum \nolimits _{i}b_{i} \log (\frac{b_{i}}{n})+\sum \nolimits _{j}t_{j}\log (\frac{t_{j}}{n})} \end{aligned}$$
(5)

And ARI is defined as

$$\begin{aligned} ARI=\frac{\sum \nolimits _{ij}\left( {\begin{array}{c}n_{ij}\\ 2\end{array}}\right) -\left( \sum \limits _{i} \left( {\begin{array}{c}b_{i}\\ 2\end{array}}\right) \sum \limits _{j}\left( {\begin{array}{c}t_{j}\\ 2\end{array}}\right) \right) / \left( {\begin{array}{c}n\\ 2\end{array}}\right) }{\frac{1}{2}\left( \sum \nolimits _{i}\left( {\begin{array}{c}b_{i}\\ 2\end{array}}\right) + \sum \nolimits _{j}\left( {\begin{array}{c}t_{j}\\ 2\end{array}}\right) \right) -\left( \sum \nolimits _{i} \left( {\begin{array}{c}b_{i}\\ 2\end{array}}\right) +\sum \nolimits _{j}\left( {\begin{array}{c}t_{j}\\ 2\end{array}}\right) \right) /\left( {\begin{array}{c}n\\ 2\end{array}}\right) } \end{aligned}$$
(6)

The higher values of NMI and ARI indicate the better partition quality. If NMI and ARI reach their maximal value (which equals 1), then the algorithm obtains exactly the ground truth communities.

4.1.2 Stability coefficient

In order to evaluate the stability of the obtained results, we adopt the stability coefficient \(\sigma\) proposed in [23]. For network \(G=(V,E)\) with \(|V|=n\), we run an algorithm T times to show the stability of the algorithm. For \(1\le t\le T\), we construct an \(n\times n\) matrix \({\varOmega }_{t}\), whose elements are defined as

$$\begin{aligned} \delta _{ij}^{(t)}= {\left\{ \begin{array}{ll} 1,&{} \text{ if } i \text{ and } j \text{ in } \text{ the } \text{ same } \text{ community, } \\ 0, &{} \text{ otherwise. } \end{array}\right. } \end{aligned}$$

Then we calculate the variance matrix \({\mathcal {S}}\) of \({\varOmega }_{t}\) (\(1\le t\le T\)). The element of \({\mathcal {S}}\) is defined as

$$\begin{aligned} {\mathcal {S}}_{i,j}=\frac{1}{T}\sum \limits _{t=1}^{T} \left( \delta _{i,j}^{(t)}- \frac{1}{T}\sum \limits _{x=1}^{T}\delta _{i,j}^{(x)}\right) ^2 \end{aligned}$$

The stability coefficient\(\sigma\) of the algorithm is defined as

$$\begin{aligned} \sigma =\frac{1}{n^{2}}\sum \limits _{1\le i,j\le n}{\mathcal {S}}_{i,j} \end{aligned}$$
(7)

If an algorithm has a low stability coefficient \(\sigma\) value, then the algorithm has good stability.

4.2 Experiments on computer-generated networks

We use the LFR benchmark networks [24] as synthetic networks to test the performance of our algorithm. LFR benchmark network can create networks by several parameters: the number of nodes N, average degree \(\langle d\rangle\), maximum degree \(d_{max}\), minimum community size \(c_{min}\), maximum community size \(c_{max}\), exponent \(\gamma\) for the degree distribution, exponent \(\beta\) for community size distribution, and mixing parameter \(\mu\). We test the algorithms on the networks with \(N\in \{1000,5000\}\), community size varying in the range 10–50 and 20–100, \(\langle d\rangle =20\), \(d_{max}=50\), \(\gamma =2\) and \(\beta =1\). The mixing parameter \(\mu\) denotes the fraction of inter-community edges of the network. When \(\mu >0.5\), there are more inter-community links than intra-community links, which indicates that there might be no obvious communities in the network. When \(\mu <0.05\), there is few inter-community links, which indicates that the network might be disconnected. Hence, the LFR networks used here were generated by setting the mixing parameter \(\mu \in [0.05,0.5]\). We compare our GLPA with 6 classical algorithms LPA, LPAm, LPAm+, LPA-S, BGLL and ISCD+ in this section.

Fig. 2
figure 2

Comparison of NMI on LFR benchmark networks

Fig. 3
figure 3

Comparison of ARI on LFR benchmark networks

We performed each algorithm 30 times on each dataset and obtained the average performance. Figures 2 and 3 show the comparison results of GLPA and comparing algorithms in term of NMI and ARI, respectively. When the mixing parameters were low (\(\mu \in [0.05,0.3]\)), the community structure of these networks were relatively obvious, and the results of each algorithm were highly consistent with the ground truth communities in these networks. When \(\mu\) grows, the community structure of the networks were not obvious. The proposed GLPA chooses communities with higher participation coefficient to merge, which ensures the quality of community detection and stability of the algorithm.

4.3 Experiments on real-world networks

We test algorithms on 4 real world networks with ground truth communities: Zachary’s Karate Club [25], Dolphins Social Network [18], Polbooks [26], and US College Football Network [27]; and we also run algorithms on 8 real-world networks without ground truth communities: Les Misérables [28], NetScience [29], Email [30], Yeast [31], Web_spam [32], Router [32], Bio_dmela [32] and PGP [32]. Table 1 displays the characteristics of these real world networks. We compared our GLPA with LPA, LPAm, LPAm+, LPA-S, FMM, BGLL, Infomap and ISCD+ on these networks.

Table 1 Characteristics of 12 real networks

For 4 real-world networks with ground truth communities, we use modularity, NMI, ARI and stability coefficient to show the performance of the algorithms. We perform 30 times for each algorithm on each dataset. The results are shown in Table 2, where k is the number of communities obtained.

Table 2 Comparison of four real-world networks with ground truth communities

Zachary’s Karate Club network is one of the most commonly used benchmark networks in community mining [25]. This network consists of 34 nodes and 78 links. Each member denotes a node in the network, and a link exists between two nodes if these two members interact consistently outside the activities of the club. However, the administrator (node 1) and instructor (node 33) fell out, and the members were split into two groups with either administrator or instructor. The proposed algorithm GLPA outputs two communities, as is shown in Fig. 4. Only node 10 is misclassified, and the value of NMI is higher than 0.8, which is much higher than the average of LPA, LPAm and LPAm+.

Fig. 4
figure 4

Result of GLPA on Karate network

The second network is Dolphins Social Network which has been investigated by Lusseau [18] over several years. It is composed of 62 dolphins and 159 links. As shown in Fig. 5, GLPA detects two communities in Dolphins network, which is completely consistent with the ground truth. GLPA gets the best result on this network.

Fig. 5
figure 5

Result of GLPA on Dolphins network

US College Football network [27] is the social network with 115 nodes and 613 edges. All nodes in the network are divided into 12 communities corresponding to the “conferences”. Our result is shown in Fig. 6, which contains 13 communities.

Fig. 6
figure 6

Result of GLPA on Football network

Another benchmark is commonly called Polbooks which is a network of books about American politics compiled by Valdis Krebs [26]. This network involves 105 nodes, each of which represents a book about US politics sold on Amazon.com. All nodes in the network were classified into three groups according to their political inclination: liberal, conservative, or centrist. If customers frequently bought two books at the same time, the two corresponding nodes are connected by an edge. The proposed GLPA detects two communities, as is shown in Fig. 7a. The value of the NMI via our algorithm are higher than other algorithms. Because in the original network division as shown in Fig. 7b, the nodes corresponding to the “centrist” books(blue nodes) are not densely connected due to the lack of obvious political inclination. GLPA divides the network into two communities and both communities were densely connected internally and sparsely connected externally. We think it might be reasonable to divide the network into two communities.

Fig. 7
figure 7

Comparisons on the Polbooks network

Table 3 shows the experimental results on 8 real-world networks without ground truth communities. We compare the performance of each algorithm from the modularity and stability. It can be seen that the stability coefficient of GLPA is low, which indicates that it has good stability.

Table 3 Comparison of four real-world networks without ground truth communities

5 Conclusion

In this paper, we propose a graph-based label propagation algorithm called GLPA for non-overlapping community detection based on neighbor influence and connectivity information during the label propagation process. The main steps of GLPA proceed as follows. First, we define the node similarity between a node and its neighbors. Each node changes its label to that of its most similar neighbor node. We construct a label progress graph L to imitate the label propagation process and calculate the connected components of L. Next, we construct a weighted graph W by treating each connected component as a super node, and assign the number of edges between two communities in original network as their edge weights. We compute the merging factors of each node of the resulted network W. We get the final communities by merging the small scale communities with high merging factors. The proposed algorithm GLPA could reduce the instability of the original LPA. Compared with other classical community detection algorithms on some real networks and artificial networks, the proposed algorithm shows preferable performance on NMI, ARI and modularity.

As future work, the authors intend to develop the graph-based representation to detect large amounts of small-scale communities, whose size distribution is imbalance. In this case, authors aim to employ the local version of measure to fit the imbalance size distribution. Moreover, we need to develop an effective community detection algorithm for detecting special community structures in complex networks.