1 Introduction

In the real world, complex systems exist in all aspects of people’s lives, such as social networks, protein interaction networks and scientists’ collaborative networks [1]. Complex systems are generally modeled as graphs or complex networks. A node in the network represents an individual, while edges represent connections between individuals. The existence of communities is a crucial characteristic of complex networks. The internal nodes of the community are connected closely, whereas the connections between communities are relatively sparse [2]. Uncovering communities in complex networks and mining the hidden relationships between communities are of utmost importance for complex network analysis.

The Girvan and Newman’s method (GN) [3] is the first algorithm for uncovering communities in complex networks. In the last decade, numerous algorithms for uncovering communities in complex networks have been proposed [4]. These algorithms can be divided into non-overlapping community detection algorithms and overlapping community detection algorithms. Most of the early methods [5, 6] for community detection focus only on identifying non-overlapping communities in which each node belongs to a single community. However, nodes in the network usually belong to more than one community. For example, a person may be a member of his family and a personnel of his company. Palla et al. [7] proposed the first algorithm for uncovering overlapping communities in complex networks. Henceforth, numerous algorithms have been proposed for addressing the problem of uncovering overlapping communities in complex networks [8,9,10,11,12,13]. The algorithms are mainly based on local methods and global methods [14]. The global methods are not applicable to complex networks with large scale or lack of integrity because it requires the topology information of the entire network. Compared with the global methods, local methods can uncover communities without the integral structural information of the complex networks. However, the accuracy of local methods is generally affected by the initial node selection.

Seed-extension algorithms play an important role in existing overlapping community detection methods. In term of efficiency, the time complexity of seed-extension algorithms is linear to the number of nodes or edges when the network is sparse [15,16,17]. In term of effectiveness, algorithms based on seed-extension perform well in uncovering highly overlapping communities in many types of complex networks [18,19,20]. However, algorithms based on seed-extension still have weakness. They may select seeds with poor quality or cannot fully utilize the local information among nodes. In this paper, a local seed-extension algorithm based on internal force between nodes (InfoNode) is proposed. The main contributions of this paper are as follows:

  1. (1)

    InfoNode uses local degree central nodes and Jaccard coefficient to detect core members in the network. The detected core members are treated as seeds, thus guaranteeing that the selected seeds are central nodes of the communities.

  2. (2)

    In order to fully utilize the local information between nodes, the definition of internal force between nodes is proposed. InfoNode combines internal force between nodes with the fitness function in the community extension stage, which greatly improves the accuracy of community detection.

  3. (3)

    A community pre-extension process is set up before the community extension stage. InfoNode selects the top k nodes with the best performance in the community pre-extension process and then calculate them accurately by fitness function with internal force between nodes in community extension stage, which can reduce the algorithm’s running time.

The rest of the paper is organized as follows: In Section 2, we present the research related to seed-extension algorithms with various strategies in seed selection and community extension stage. Section 3 describes the proposed community detection algorithm in detail. In Section 4, a series of experimental results are given to verify the performance of our method. Finally, the conclusion is drawn in Section 5.

2 Related work

The algorithms based on seed-extension generally select a seed as the initial community and then extend it by continuously checking its neighboring nodes. The method include two essential stages: seed selection and community extension. In the stage of seed selection, its aim is to detect core members of communities based on node centrality index. In the stage of community extension, its goal is to build communities from seeds based on their influence or a greedy process with quality function. In the following, various strategies for selecting seeds and extending communities are introduced in detail, respectively.

2.1 Seed selection strategies

Increasing studies show that the formation of communities depends on core members [21, 22]. For algorithms based on seed-extension, the quality of seeds has direct affection on the algorithms’ performance. In last decade, various seed selection strategies have been proposed [14]. Lancichinetti et al. [23] proposed a simple method that depends on random seed selection. The randomness brings efficiency but also makes it unable to discover high-quality communities. Lee et al. [18] proposed an algorithm that considers k-cliques in the network as initial communities so that it can uncover highly overlapping communities. However, the algorithm may ignore isolated sub-networks whose sizes are too small. In literature [24] and [25], the local degree central nodes whose degrees are greater than or equal to all their neighbors’ degrees are selected as seeds. Intuitively, the core members of a community are generally local degree central nodes. The strategy can detect all the core members of communities. However, it may also select non-core members. Some algorithms [26, 27] calculate the conductance of each node in the network and select local minimum conductance nodes as seeds. This kind of method can detect high-quality seeds but with a low time efficiency.

2.2 Community extension strategies

In the stage of community extension, each seed is taken as an initial community and extended by spreading the influence of the seed throughout the network [26, 28] or running a greedy process with a quality function [18, 23, 24]. The latter method attracts more interests of researchers. Hence, various quality functions have been proposed in recent years [29]. The quality function is optimized continuously until it gets the maximum or minimum value.

Clauset [30] proposed a local extension optimization algorithm that defines the local community modularity R, which is calculated as follows:

$$ R = \frac{B_{in}}{B_{in}+B_{out}} $$
(1)

In (1), B is a local community. Bin represents the number of edges whose endpoints are all in B, while Bout is the number of edges those have one endpoint in B. The algorithm needs to pre-define the size of communities. It continuously adds the neighboring node that makes the largest increase of R to current community, until the current community reaches the predefined size.

Lou et al. [31] proposed the modularity M of community S, which is calculated as follows:

$$ M = \frac{E_{in}}{E_{out}} $$
(2)

In (2), Ein represents the number of internal edges of community S, while Eout represents the number of edges between the community boundary and the external nodes. The algorithm proposes three heuristic node search methods to partially address the problem of uncovering communities in complex networks. However, it must set different thresholds for networks with different sizes.

Lancichinetti et al. [23] proposed a fitness function Fc to measure the tightness of internal nodes of the community. The fitness function is defined as follows:

$$ F_{c} = \frac{f_{in}^{c}}{(f_{in}^{c} + f_{out}^{c})^{\alpha}} $$
(3)

In (3), \(f_{in}^{c}\) and \(f_{out}^{c}\) are the total values of the internal degrees and external degrees of community c, and α is the resolution parameter used to control the size of communities detected. The quality function can effectively measure the tightness of nodes within the community, but it cannot fully utilize the local information among nodes.

3 The proposed community detection algorithm

This section is organized as follows: first, the basic notations and definitions used in the paper are presented in Section 3.1; second, Section 3.2 describes the proposed algorithm in detail; finally, the complexity analysis of our method is given in Section 3.3.

3.1 Basic notations and definitions

A network is usually modeled as G = (V, E). V is the set of nodes and E is the set of edges in the network. In this paper, we only consider undirected and unweighted networks. The notations used in this paper are listed in Table 1 and basic definitions are given as follows.

Table 1 Formal notations used in this paper

Definition 1

(Jaccard coefficient) The Jaccard coefficient [32] between two nodes is defined as:

$$ J(u,v) = \frac{|N(u)\cap N(v)|}{|N(u)\cup N(v)|} $$
(4)

Jaccard coefficient is defined to measure the similarity between two nodes. The larger the value, more similar the two nodes.

Definition 2

(Internal force) The internal force between two nodes is defined as:

$$ F(u,v) = g \times \frac{d(u) \times d(v)}{[1-J(u,v)]^{2}} $$
(5)

In (5), g is a parameter used to control the magnitude of internal force between two nodes, and we make its value as 1/d2. When internal force between nodes is calculated, a case may occur where the Jaccard coefficient between two nodes is 1, which means the two nodes are too ”close”. Therefore, the Jaccard coefficient formula needs to be slightly modified as follows:

$$ J^{\prime}(u,v) = \frac{|N(u) \cap N(v)|}{|N(u) \cup N(v)|+1} $$
(6)
$$ F^{\prime}(u,v) = g \times \frac{d(u) \times d(v)}{[1-J^{\prime}(u,v)]^{2}} $$
(7)

Definition 3

(Fitness function with internal force) The fitness function with internal force is used to measure the tightness of a group of nodes. Its specific formula is defined as follows:

$$ F_{G} = \frac{F_{in}^{'G}}{(F_{in}^{'G} + F_{out}^{'G})^{\alpha}} $$
(8)

In (8), \(F_{in}^{'G}\) and \(F_{out}^{'G}\) are the total value of the internal force and the external force of community G, respectively. The larger the value of fitness function of a group nodes, the more likely it is that they form a community.

Definition 4

(Node fitness to community) The fitness of node v to community c is used to determine whether the node should be added to the community. It is defined as follows:

$$ {F_{c}^{v}} = F_{c\cup \{v\}} - F_{c} $$
(9)

In (9), if the value of \({F_{c}^{v}}\) is positive, it indicates that the addition of node v to community c can make its structure more compact; on the contrary, it means that the addition of the node v to community c will make its internal structure looser.

3.2 Algorithm description

Figure 1 shows the flowchart of InfoNode that is mainly composed of two stages: seed selection and community extension. In the seed selection stage, local degree central nodes and Jaccard coefficient are used to detected core members of communities. In the community extension stage, InfoNode combines internal force between nodes with the fitness function to extend communities. In the following, the seed selection and community extension stage are described in detail, respectively.

Fig. 1
figure 1

The flowchart of InfoNode

3.2.1 Seed selection

For algorithms based on seed-extension, the quality of seeds directly affects the accuracy of communities detected. In general, the selected seeds should have a considerable ”influence” in the local structure and be the core members of communities. The process of selecting seeds is given in Algorithm 1.

figure f

First, all local degree central nodes in the network are selected as candidate seeds (line 2-3). Obviously, these nodes have a considerable ”influence” on their neighboring nodes. Second, in order to measure the ”influence” of candidate seeds on their neighboring nodes, InfoNode calculates the sum of Jaccard coefficient of each candidate seed with its neighboring nodes (line 5-6). The larger the value, the greater ”influence” the candidate seed has on its surrounding nodes. Finally, InfoNode normalizes the sum of Jaccard coefficient of each candidate seed and sets a threshold ε to select seeds (line 8-14). When the candidate seed is normalized to a value greater than ε, it is defined as a core member and used as the seed in community extension stage.

Figure 2 shows an example of selecting seeds in a toy network ENZYMES_g50Footnote 1. The parameter ε in seed selection stage is defaulted by 0.5. In Fig. 3, we use the well-studied Karate network [33] to verify the performance of our seed selection method. The network has two communities, one of which is led by a class instructor (node 0), and the other is led by a club administrator (node 33). Figure 3 shows that our seed selection method can accurately find the seeds in the network.

Fig. 2
figure 2

Example showing the process of seed selection

Fig. 3
figure 3

The selected seeds in Karate network

3.2.2 Community extension

After obtaining seeds in the network, InfoNode extends them to detect communities. The process of community extension is given in Algorithm 2.

figure g

According to Algorithm 2, the specific community extension steps can be summarized as follows:

(1):

The seeds obtained in the seed selection stage are sorted via degree in non-ascending order. When the seed set is non-empty, we select the seed with maximum degree to be extended; otherwise, we randomly select a node that have not been extended as the seed (line 5-10). If all nodes in the network are extended, the community extension stage ends.

(2):

The value of fitness function of each neighboring node to the current community is calculated and the top k nodes those can maximize the value of fitness function of current community are selected (line 13-16).

(3):

The top k nodes with the best performance in step (2) are accurately calculated by the fitness function with internal force between nodes. The node that can maximize the fitness value of the current community is selected(line 17-20).

(4):

If the selected node increases the value of fitness function of current community, it will be added into the current community. We update neighboring nodes of the current community and the process will then return to step (2); otherwise, the current community is a final divided community and the process will return to step (1) (line 21-26).

Figure 4 shows the example of extending communities in the network. Considering the small size of the network, we take the value of k as 3. The community extension stage repeats this process until all nodes in the network are extended. In Fig. 5, we extend the seeds obtained in the seed selection stage in Karate network. The result shows that our algorithm accurately divides the network into two communities, and each node is divided into the right community.

Fig. 4
figure 4

Example showing the process of community extension

Fig. 5
figure 5

Communities in the Karate network

3.3 Algorithm complexity analysis

In seed selection stage, the time complexity of determining whether one node is a local degree central node is O(nd) and the time complexity of selecting seeds is O(q2d). Before community extension stage, we should calculate the internal force between nodes and the time complexity of this step is O(nd). For algorithms based on seed-extension with a quality function, the time complexity is almost linear to the number of nodes or edges when communities are small, but the worst-case complexity is O(n2) [23]. In summary, the time complexity of InfoNode is O(nd) in a general complex network. However, when the size of communities is close to n, the time complexity of InfoNode is O(n2).

For the space complexity of InfoNode, all local degree central nodes in the network need to be stored in seed selection stage and the required space is O(q + n). In community extension stage, the Jaccard coefficient between the nodes must be stored and the required storage space is O(nd + m). Therefore, the total space complexity of the InfoNode algorithm is O(q + n + nd + m) which can be reduced to O(m).

4 Experimental results and analysis

This section is organized as follows: first, the description of real and artificial dataset is given in Section 4.1; second, we introduce the experimental settings and evaluation metrics in Section 4.2; finally, experimental results and analysis are given in Section 4.3.

4.1 Dataset description

(1) Real datasets.

The real datasets used in the experiments are all well-known and widely used in community detection field. The real networks are extracted from various domains with different scales and degree distributions, which can not only verify the performance of community detection algorithms, but also challenge them in the terms of robustness and scalability. The specific information of real datasets is shown in Table 2.

(2) Artificial datasets.

We use the LFR-benchmark [40] to generate artificial datasets. The network generated by this program can well control the distribution of nodes’ degree and communities’ size. Four groups of artificial dataset are generated by the program. The basic parameters are set as follows:

  1. (1)

    D1: N= 5000, μ= 0.1-0.7, on= 0.1, om= 3;

  2. (2)

    D2: N= 5000, μ= 0.3, on= 0.1,0.3, om= 2-8;

  3. (3)

    D3: N= 5000, μ= 0.3, on= 0.1-0.6, om= 3;

  4. (4)

    D4: N= 2000-10000, μ= 0.3, on= 0.3, om= 3.

Table 2 Description of real datasets

The rest of the parameters are set by default: kmax= 50, minc= 20, maxc= 100. The specific description of the parameters in artificial datasets is shown in Table 3. In these experiments, we takes on the ratio of the number of overlapping nodes to the total number of nodes in the network.

Table 3 Parameter description of artificial datasets

4.2 Experimental settings and evaluation metrics

4.2.1 Experimental settings

In the experiments, InfoNode is compared with three state-of-the-art approaches and three algorithms based on seed-extension.

The three state-of-the-art approaches are Attractor [41], Ego-Splitting [42] and MUTILCOM [43]. Attractor regards the network as a dynamic system and proposes three intuitive interaction modes to dynamically discover communities by simulating the changing distance between nodes in the network. Ego-splitting is a framework for detecting communities by leveraging local structures to de-couple overlapping communities. MUTILCOM a method for detecting multiple local communities those may overlaps. The algorithm expands seeds those are selected by embedding local networks around the initial seed set into low dimensional space.

In addition, to show the advantages of seed selection and community extension strategies used in our method, we compared InfoNode with three algorithms based on seed-extension: LFM [23], DEMON [44], and LMD [24]. LFM randomly selects seeds in the network and extends communities by a greedy process with fitness function. Demon considers all nodes in the network as seeds and merges communities based on ego-networks and a community consolidation strategy. LMD selects the local degree central nodes as seeds and extends communities around the seeds.

The experimental contents include three parts: algorithms’ parameter experiments, algorithms’ accuracy experiments on real and artificial datasets and the scalability experiments of InfoNode.

4.2.2 Evaluation metrics

We chose two widely used evaluation metrics to verify our method: the extension of modularity (EQ) [45] and the Normalized Mutual Information (NMI) [46].

The specific calculation formula of the extended modularity is defined as follows:

$$ EQ=\frac{1}{2m}\sum\limits_{i}{\sum\limits_{v\in c_{i}, w\in c_{i}}{\frac{1}{O_{V}O_{W}}(A_{vw}-\frac{k_{v}k_{w}}{2m})\delta(c_{v},c_{w})}} $$
(10)

In (10), Ov is the number of communities to which the node v belongs, A is the adjacency matrix and kv is the degree of node v. When node v and node w is connected, the value of δ(v,w) is 1; otherwise, it is 0. The closer to 1 the value of EQ, the better quality of com7munity detection.

NMI uses information entropy to measure the difference between communities detected and the ground-truth communities. The larger the value of NMI, the better quality of community detection. The specific calculation formula is defined as follows:

$$ NMI=\frac{-2{\sum}_{i=1}^{C_{A}}{\sum}_{j=1}^{C_{B}}C_{ij}\log{\frac{C_{ij}N}{C_{i}C_{j}}}}{{\sum}_{i=1}^{C_{A}}{C_{i}\log{\frac{C_{i}}{N}}}+{\sum}_{j=1}^{C_{B}}{C_{ij}\log{\frac{C_{j}}{N}}}} $$
(11)

In (11), CA(CB) is the number of communities detected and ground-truth, respectively. Ci(Cj) is the sum of elements of community C in row i (column j) and N represents the number of nodes in the network.

4.3 Experimental results

4.3.1 Experimental results on algorithms’ parameters

There are three parameters used in InfoNode: α, k, and ε. Parameter α is used to control the scale of communities detected. When the value of α is large, the size of communities detected is small; vice versa. It is the same parameter α in LFM. Therefore, it is set to 0.8-1.2 as suggested in [23].

(1):

Experiments on parameter k

Parameter k is used to select the top k nodes those have the best performance in the pre-expansion process. Obviously, the larger the value of k, the better quality of community detection, but the longer running time of InfoNode. Therefore, we should choose a appropriate value of k.

Figure 6 shows the experimental results on parameter k on the D1 dataset. As seen in the figure, as the value of k increases, the NMI increases. When the community mixing parameter μ is less than 0.5 and the value of k is 5, the NMI can reach the maximum value because the community structure in the network is clear and easy to be identified at this time. When the community mixing parameter μ reaches 0.5 or higher, the value of NMI does not increase until k reaches approximately 15 because the community structure in th network is fuzzy at this time and more nodes should be calculated by fitness function with internal force between nodes. Therefore, parameter k is suitable when its value is 5 in the network with a low community mix. It is more appropriate to take the value of k as 15 in the network with a high community mix.

(2):

Experiments on parameter ε

Fig. 6
figure 6

Experiments on parameter k

The parameter ε is used to select seeds in the network. Obviously, when ε is taken as 0, all local degree central nodes in the network are selected as seeds. When ε is taken as 1, seeds are randomly selected from the nodes those are not extended in the network.

Figure 7 shows the experimental results on parameter ε on the D1 dataset. When the community mixing parameter μ is less than 0.4, the value of NMI decreases until ε is approximately 0.7 because the community structure in the network is relatively clear at this time, and the quality of seeds selected has little effect on the accuracy of communities detected. When the community mixing parameter μ reaches 0.4 or higher, the downward trend of the value of NMI occurs until ε is approximately 0.5. The quality of seeds selected has a considerable impact on the accuracy of the communities detected because the community structure in the network is relatively vague at this time. Obviously, when the value of ε is small, there are many seeds selected and the subsequent community extension stage will perform some unnecessary operations. Therefore, the suitable value of parameter ε in InfoNode is about 0.5.

Fig. 7
figure 7

Experiments on parameter ε

4.3.2 Experimental results on real datasets

Table 4 shows the experimental results of EQ on real datasets. The accuracy of InfoNode is only slightly lower than that of Attractor on polbooks, football and jazz, because the average degree of nodes in the networks is quite high. In addition, links in the networks are relatively compact and more suitable for the algorithms based on distance dynamics among nodes such as Attractor. In the other nine real datasets, InfoNode has the highest EQ result owing to our new seed selection method and the fitness function with internal force in community extension stage.

Table 4 EQ results on real datasets

On the other hand, Fig. 8 shows the NMI results on real networks whose ground-truth communities are known. It can be seen from the figure that our algorithm can detect communities more accurately than all the other algorithms on the four networks. Compared with other three algorithms based on seed-extension, the strategy in seed selection of InfoNode can obtain high-quality seeds in the network and InfoNode combines internal force with the fitness function for fully utilizing the local information between nodes.

Fig. 8
figure 8

NMI results on real datasets

4.3.3 Experimental results on artificial datasets

Figure 9 shows the experimental results of different algorithms’ accuracy on group D1 artificial simulation networks. The experimental results reveal that the accuracy of InfoNode is higher than that of other algorithms in all values of μ. Attractor algorithm is based on changing distance between nodes in the network. When communities in the network are not obvious, it is difficult for Attractor to simulate the changing distance between nodes. Ego-Splitting algorithm is based on the ego-nets in the network and it performs well when the value of μ is small. However, it is harder for Ego-Splitting to find ego-nets in the network as the value of μ increases. MUTILCOM algorithm selects seeds by embedding local networks around the initial seed set into low dimensional space. The quality of seeds selected by the method decreases as the value of μ increases. LFM randomly selects nodes as seeds and does not fully utilize local information between nodes. DEMON needs to fuse local communities to form optimal global communities. It tends to form multiple independent communities with the increase of the value of μ. The accuracy of community detection is then reduced. LMD performs well when the value of μ is small. The number of local degree central nodes increases when the increase of the value of μ. LMD selects all local degree central nodes as seeds so that the quality of the seeds is reduced. In addition, LMD cannot fully utilize the local information between nodes. In conclusion, the strategies used in InfoNode can improve the accuracy of community detection.

Fig. 9
figure 9

NMI results on artificial datasets

Figures 10 and 11 show the experimental results of each algorithm on group D2 artificial networks. The experimental results indicate that as the value of om increases, the accuracy of each algorithm will decrease. This outcome is attributed to the artificial simulation network becoming complicated and more difficult for each algorithm to uncover communities. Overall, the accuracy of InfoNode is better than that of other algorithms because of the strategies in seed selection and community extension stage in our method.

Fig. 10
figure 10

NMI results on parameter om(on= 0.1)

Fig. 11
figure 11

NMI results on parameter om(on= 0.3)

Figure 12 shows the experimental results of each algorithm on group D3 artificial networks. The experimental results reveal that as the value of on increases, that is, as the number of overlapping nodes in the network increases, the accuracy of each algorithm will decrease, but InfoNode remains superior to the comparison algorithms. The accuracy of DEMON is only second to that of InfoNode, because DEMON forms local communities based on label propagation which is suitable for networks with many overlapping nodes.

Fig. 12
figure 12

NMI results on parameter on

4.3.4 Scalability experimental results

Figure 13 shows the running time of InfoNode and other comparison algorithms on group D4 artificial networks. The time efficiency of InfoNode is slightly lower than that of LFM and LMD. LFM randomly selects seeds and does not take advantages of internal force between nodes, so it is more efficient than other algorithms based on seed-extension. LMD is slightly more efficient than InfoNode because it does not utilize the local information between nodes. DEMON needs to extend all nodes in the network and optimize communities detected, so it has a low time efficiency. The time complexity of Attractor is linear to the number of edges in the network, so it is slightly slower that other three algorithms based on seed-extension. Ego-Splitting algorithm can easily detect all the ego-nets in the networks whose structure is not too complicated. However, its time efficiency is still inferior to LFM, LMD and InfoNode. The time complexity of MUTILCOM is high because it selects the initial seeds by embedding the local networks around the seeds into low dimensional space.

Fig. 13
figure 13

Algorithms’ running time(s) experimental results

5 Conclusions

This study proposes a local community detection algorithm based on internal force between nodes, which can accurately and effectively uncover communities in the network. First, we select seeds through the local degree central nodes and Jaccard coefficient in the network. Second, the node with the maximum degree among seeds adopts the fitness function strategy for pre-extension every time. Finally, the top k nodes with the best performance in the pre-expansion process are extended by the fitness function with internal force between nodes. As experimental results show on the real and artificial datasets, our method can accurately and effectively uncover communities in complex networks.

Future work will compare InfoNode with other state-of-the-art methods. The parallelization based on MapReduce model will be considered to improve the time efficiency of InfoNode. At the same time, dynamic strategies will be introduced to adapt to uncover communities in dynamic networks, thus increasing the practicality of our method.