1 Introduction

Networks have become an indispensable tool to study complex systems emerging from diverse areas including biological, social and computational sciences. Network techniques are widely used in data structure, data mining and artificial intelligence too. The emergence of giant social networking sites such as facebook and twitter has made the human–human interaction possible through complex social networks at an unprecedented level. The internet itself is a huge complex network where nodes represent the webpages and edges the hyperlinks given by one to others [9]. Molecular interactions in living organisms control mechanism of cellular functions [41]. Understanding the nature and dynamics of such large networks, specifically at mesoscale level, depends on their successful decomposition into smaller modules consisting largely of related nodes called communities [6]. Most of the previous studies have mainly focused on the disjoint community structures where each node belongs to no more than one community. However, in many real networks nodes possess the multiple groups simultaneously. Many networks such as co-authorship networks and the internet are likely to have large number of nodes whose interactions are spread across several groups. For example, influential scientists contribute to different disciplines of interdisciplinary nature by collaborating with other influential scientists. Likewise most often searched webpages on internet are well connected with several other websites. An individual in a social network may have interactions with multiple groups such as family, friends, peers, teachers, etc. Such nodes belong to multiple groups. This phenomenon is called community structure with pervasive overlaps [28, 31, 56], and it cannot be detected by the other traditional methods of overlapping community detection. Although, there are few algorithms based on pervasive overlap phenomenon such as CPM [40], LinkCom [2], SPLA [54], OSLOM [26], COPRA [17] but these are not satisfactory upto the mark and suffer with slow speed. A detailed discussion on these and few more algorithms is given in Sect. 2.

To overcome these limitations, we have proposed a novel algorithm called IC-LCD—Interaction Coefficient based Local Community Detection. As the name suggests, IC-LCD algorithm is based on node and edge interaction coefficient, where edge interaction coefficient is a newly defined parameter. Community detection through IC-LCD takes place in two phases. Phase-1 (Expansion Phase) expands communities starting from seeds containing single nodes, and Phase-2 (Refinement Phase) involves the merging of unstable or smaller communities into stable and larger communities. We have tested the proposed algorithm on LFR benchmarks and a range of real-world networks including some large networks. To compare its performance we choose the algorithms, namely DEMON, SPLA, OSLOM and COPRA as the baseline algorithms due to their wide-spread use in the literature. The results indicate that IC-LCD gives competitive performance with all these algorithms in terms of speed, accuracy and stability.

2 Related work

The overlapping community detection algorithms can be classified on the aspect whether they employ a local strategy or a global one for clustering nodes. Usually a global algorithm uses overall topological information about the network to optimise a global quality function such as modularity [34], or overlapping modularity [22, 39]. Generally, this process of community detection uses an agglomerative [35] or divisive [52] approach. Some examples of global methods are FCM by Zhang et al. [59], CONGO (Cluster-Overlap Newman Girvan Algorithm) by Gregory [16], and MOSES by McDaid and Hurley [31]. On the other hand, the algorithms such as CPM [40] and [43] employ local optimisation strategies for producing overlapping communities. The algorithm FCM is based on a combination of three concepts namely, modularity [38], spectral relaxation [52] and fuzzy c-means clustering method [7]. But, the computational complexity is a key problem of FCM (See Table 1 for more details.). CONGA provides good results but it is slow and has approximately cubic time complexity, so it cannot cope with large size networks. CONGA inherits its low speed from the GN algorithm [13]. Both CONGA and GN algorithms rely on betweenness, which is a global centrality measure: at each step, it counts the number of shortest paths between all pairs of vertices in the network. For a fast and scalable algorithm, there was an urgent need that a measure can be computed locally. Hence, Gregory [16] introduced the concept of local betweenness, and then described the new algorithm: CONGO (CONGA Optimized) algorithm. Although, CONGO is good for sparse networks, and is slightly faster than CONGA, but its accuracy is decreased (See Table 1 for complexity.). McDaid and Hurley [31] proposed MOSES algorithm which is based on the same global objective function (modularity) but with the incorporation of greedy maximisation strategy in which communities are created and deleted, nodes are added and deleted from community in order to maximise the objective function. It is learnt that MOSES is not able to attain high NMI scores on LFR benchmarks, despite being a fast algorithm.

On the other hand, local strategy-based algorithms such as EAGLE [45] and OSLOM [26] utilise local expansion and optimization-based strategies for growing a seed community (Other details can be seen in Table 1). In these methods, a seed community is grown only through a local structure such as neighbourhood around the seed. In some cases, the seed community can be started with a single node such as in LFM [25] and MONC [19]. However, a k-clique (complete subgraph on k vertices) is seed community for GCE [28] algorithm. Most of the local algorithms rely on a local benefit function that characterises the quality of a densely connected group of nodes. However, problems arise with both the approaches. For instance, global methods usually suffer with resolution limits [11], whereas local methods are unable to scale with network size.

Node versus link is another classification of the community detection algorithms. Structural information of networks lies in node degrees, clustering coefficients, etc. Usually the node clustering methods exploit the node degree or node centrality to extract the community structure [25]. A typical limitation of node-based algorithms is that they tend to produce smaller overlapping regions among communities [53]. The notion of link clustering was proposed by Palla et al. [40] with the presumption that overlapping community structure can be effectively recovered if the clustering is done through links rather than the nodes.

Table 1 Details of some state-of-the-art overlapping community detection algorithms

They offered the Clique Percolation Method (CPM) which is based on the definition that a community, or more specifically a k-clique community, is the union of all k-cliques that can be reached to each other through a series of adjacent k-cliques. The output of CPM is, however, sensitive to the parameter k. Another interesting algorithm for link clustering is LinkCom [2] proposed by Ahn et al. which works on the concept of link similarity. It has been argued that, in general, the link clustering methods perform no better than node clustering methods [12]. To resolve some of the drawbacks of node and link clusterings, recently an algorithm was proposed that can extract node, link and hybrid node-link communities through a probabilistic model [20]. Several other algorithms have been developed based on probabilistic modelling for community detection and these are highly applicable for the set of social media, target events, story line detection, protein function predictions, etc. DIM3 (dynamic infinite mixed-membership stochastic blockmodel) [8] is one such algorithm based on stochastic mechanism, and it is highly applicable on dynamical sets of data where data is changing day by day. NOODLES (overlappiNg cOmmunities and rOles from noDe Links and messagES) [5] is another algorithm for community detection based on the probabilistic approach and it can be applied to detect certain attributes from real data. This algorithm has extensively been experimented on real-world social media data for community detection and link prediction. However, alike many other algorithms it also needs further scope of improvement, for instance in eliminating the unrealistic communities.

Many overlapping community detection algorithms use the idea of label propagation introduced by Raghavan et al. [42]. It has been extended to incorporate multiple memberships of vertices in the methods such as COPRA [17], SLPA [54] and DEMON [4]. In COPRA each node updates its label and the belonging coefficients average out from the coefficients of all its neighbours in a synchronous manner. SLPA is a general speaker–listener–based information propagation process. It spreads label information between nodes according to the pairwise interaction rules. In SLPA, each node has a memory space to store the received information. The probability of observing a label in the memory of a node is perceived as the membership strength. So it needs a post-processing parameter to generate communities. However, there are no effective strategies for the proper parameter setting. The algorithm DEMON first extracts the ego-minus-ego network from the given network and then it applies the label propagation to discover communities. Recently, Sun et al. [47] has proposed a link propagation-based algorithm called LinkLPA for overlapping community detection. But LinkLPA also performs poorly on synthetic networks as well as on ground truth networks such as DBLP in terms of overlapping modularity [39]. Apart from all above discussion on overlapping community detection algorithms, few more attempts have been made for community detection in general or specific network [46, 48, 51, 58, 60].

To be specific we are concerned about the pervasive overlaps where “nodes possess large number of memberships” rather than communities with large overlapping regions. The task of node-community assignment in such networks is often hard. The algorithms such as DEMON, CPM and LinkCom can detect pervasive overlaps somehow satisfactorily but they suffer with slow speed [10, 55, 57]. Also they need reliable quality measures for proper evaluation. On LFR networks SLPA, OSLOM and COPRA show better performance but in most real networks they have detected less than 20% overlapping nodes [53]. In this direction, the performance of GCE and MOSES are noteworthy, especially on LFR benchmarks. In real networks, however, both fail to detect more than 10 memberships per node.

Accordingly, in this article, we have chosen to study the problem of community detection with “pervasive overlaps”. Our efforts have led us to develop an algorithm called IC-LCD (Interaction Coefficient based Local Community Detection). Interaction coefficient, the notion underlying IC-LCD can be defined for nodes as well as for edges with a subgraph. IC-LCD exploits the node and edge interaction coefficients combinedly under the framework of local seed expansion to reveal the community structure of real networks. We experiment with IC-LCD on a number of synthetic and real-world networks and compare the results with the algorithms DEMON, SLPA, OSLOM, and COPRA. The results indicate that IC-LCD gives competitive performance with the baseline algorithms in uncovering the community structure with pervasive overlaps.

3 Methodology

3.1 Definitions and terminologies

We denote a graph with G, its vertex (node) set with \(V_G\), and edge set with \(E_G\). We shall write \(C \subseteq G\) to indicate that C is an induced subgraph of G. Then the sets \(V_C\) and \(E_C\) are the vertex set, and the edge set of C, respectively. The notation \(N_v\) indicates the set of neighbours of a vertex v in G, and \(d_v\) is the degree of v. Let \(C \subseteq G\). Then \(N_C\) is the neighbourhood of C defined as the set of all neighbours of the nodes of the subgraph C excluding \(V_C\), i.e.,

$$\begin{aligned} N_C = \lbrace v \in V_G\backslash V_C : v \in N_u \text { for some } u\in C \rbrace . \end{aligned}$$

We shall write \(N_{uC}\) to denote the common neighbourhood of C and u, i.e., \(N_{uC}=N_u\cap N_C\). Traditionally, seed expansion-based algorithms use coefficients such as the one defined below.

Definition 1

(Node interaction coefficient) Given a subgraph \(C \subseteq G\) and a node \(v \in N_C\), the node interaction coefficient of v with C is the quantity:

$$\begin{aligned} \xi _\text {node}(v, C) = \frac{\left| N_v \cap V_C \right| }{d_v} \end{aligned}$$
(1)

Some authors define a node interaction coefficient as a weighted average of the quantities \(\left| N_v \cap V_C \right| \) and \(\left| N_v \cap N_C \right| \), see [23] for instance. Another approach is to define the interaction coefficient of an edge with a subgraph. So, consider a subgraph C of a graph G. Let \(e_{uv}\) be an edge of G such that its endpoints u and v are in \(N_C\). Note that

$$\begin{aligned} \left| N_u \cap V_C \right| \le d_u -1, \quad \left| N_v \cap V_C \right| \le d_v -1, \end{aligned}$$

which implies,

$$\begin{aligned} \min \lbrace \left| N_u \cap V_C \right| , \left| N_v \cap V_C \right| \rbrace \le \min \lbrace d_u-1, d_v-1 \rbrace \end{aligned}$$

Hence, \(1+\min \lbrace \left| N_u \cap V_C \right| , \left| N_v \cap V_C \right| \rbrace \le \min \lbrace d_u, d_v \rbrace \), which is smaller than or equal to \(\max \lbrace d_u, d_v \rbrace \). This leads to the following definition.

Definition 2

(Edge interaction coefficient) Let G be a graph and \(C \subseteq G\). Let \(e_{uv}\) be an edge of G with endpoints \(u,v \in N_C\). Then the edge interaction coefficient of \(e_{uv}\) with C is defined as

$$\begin{aligned} \xi _\text {edge}(e_{uv}, C) = \frac{1+\min \bigl \lbrace \left| N_u \cap V_C \right| , \left| N_v \cap V_C \right| \bigr \rbrace }{\max \lbrace d_u, d_v \rbrace } \end{aligned}$$
(2)

3.2 Seed expansion criterion

As described in the introduction, Phase-1 of IC-LCD is involved in the expansion of seed communities, where each seed community starts from a single node. To expand a seed community, one essentially needs to know which nodes are to be included in it. This task is performed by the procedure GET-NEW-NODES() of our algorithm, given in Algorithm 1. The arguments of the procedure GET-NEW-NODES() are the graph G, a subgrapgh C of G, the neighbourhood \(N_C\) of C, and the parameters \(\xi _0\) and \(n_{\min }\). The subgraph C is known as a seed community, and its local expansion solely depends on how it interacts with its neighbourhood \(N_C\). The interaction between C and \(N_C\) can be measured in two ways—using node interaction coefficient (\(\xi _{node }\)) and edge interaction coefficient (\(\xi _{edge }\)). The threshold value for both these coefficients is taken as \(\xi _0\). How this is done is explained in “Appendix A.1.” The parameter \(n_{\min }\) indicates the minimum size of the detected communities.

Let G be a graph where the nodes in \(V_G\) are, for the sake of simplicity, assumed to be positive integers. Let \(C \subseteq G\). In order to expand C into a community, we must have some criterion to pick new nodes from the neighbourhood of C. For this task, we select a set \(V_{\text {new}}\) which is empty, initially. Then, for each \(u \in N_C\), if \(\xi _{\text {node}}(u,C) \ge \xi _0\), u is added to \(V_{\text {new}}\), otherwise if \(\left| N_C\right| \ge \left| V_C\right| \), then for each \(v \in N_{uC}\) with \(d_v > d_u\), if \(\xi _{\text {edge}}((u,v),C) \ge \xi _0\), both u and v are added to \(V_{\text {new}}\). Note that \(v > u\) is taken to avoid computing \(\xi _{\text {edge}}\) twice for the same edge.

If the steps above yield \(V_{\text {new}} = \varnothing \) and the size of C is still smaller than \(n_\text {min}\), then all those nodes \(u \in N_C\) that satisfy \(\xi _{\text {node}}(u,C \cup N_C) \ge \xi _0\) are added to \(V_{\text {new}}\). This last step which we call augmentation step plays a pivotal role during the expansion phase of a community. The method for computing the new nodes is listed in Procedure GET-NEW-NODES (See Appendix A for illustration.).

figure a

3.3 Overlap and merging criterion

The strategies of determining the overlapping function and the consequent merging criterion is the next important aspect of a seed expansion based algorithm. A simple distance measure between two communities \(C_1\) and \(C_2\) is given below [28]:

$$\begin{aligned} \delta (C_1, C_2) = 1-\frac{\left| V_{C_1} \cap V_{C_2}\right| }{\min \lbrace \left| V_{C_1}\right| , \left| V_{C_2}\right| \rbrace }. \end{aligned}$$

Note that \(1-\delta (C_1, C_2)\) can be taken as the overlap between \(C_1\) and \(C_2\). However, this overlap simply depends on the number of nodes in the common region of \(C_1\) and \(C_2\). It does not take into account the edges between \(C_1\) and \(C_2\), and the edges within \(C_1\) and \(C_2\). It is quite possible that \(V_{C_1} \cap V_{C_2} = \varnothing \), but there are a large number of edges with one endpoint in \(C_1\) and the other endpoint in \(C_2\). In such cases \(1-\delta (C_1, C_2)\) would be zero, still they could be merged to form a stable community.

To arrive at a more sensitive overlap function let us concentrate on the quantity \(\left| \text {cut}(C_1, C_2)\right| \), which represents the number of edges with one endpoint in \(C_1\) and the other endpoint in \(C_2\). To normalise it we divide it by the minimum of the external degrees of \(C_1\) and \(C_2\), i.e., by the minimum of \(\left| N_{C_1}\right| \) and \(\left| N_{C_2}\right| \). Thus we have

$$\begin{aligned} 0 \le \frac{\left| \text {cut}(C_1, C_2)\right| }{\min \lbrace \left| N_{C_1}\right| , \left| N_{C_2}\right| \rbrace } \le 1 \end{aligned}$$
(3)

This indicates that \(C_1\) and \(C_2\) could be merged if either of them shares a large fraction of its external edges with another. However, this argument may fail in certain cases. For example, consider two cliques \(C_1\) and \(C_2\) joined by a single edge, and let they have no other external edges. Then merging \(C_1\) and \(C_2\) would not be a good idea. So, we need to look at some other factor that makes the merging of two communities necessary. In fact, it is the internal density (the ratio of the number of internal edges present to the maximum possible number of internal edges) of a community that makes it stable or unstable. So, consider the quantity

$$\begin{aligned} \alpha (C_1, C_2) = 1-\frac{\min \lbrace \Vert C_1\Vert , \Vert C_2\Vert \rbrace }{\max \lbrace \Vert C_1\Vert _{\max }, \Vert C_2\Vert _{\max }\rbrace }, \end{aligned}$$
(4)

which takes into account the combined effect of the internal densities of both \(C_1\) and \(C_2\). Here \(\Vert C\Vert \) denotes the number of edges in C, and \(\Vert C\Vert _{\max }\) the maximum possible number of edges in C. That is, \(\Vert C\Vert _{\max } = \left( {\begin{array}{c}\left| C\right| \\ 2\end{array}}\right) \). Note that \(\alpha (C_1, C_2)\) lies between 0 and 1, where 1 is achieved when either \(\Vert C_1\Vert \) or \(\Vert C_2\Vert \) is zero. When \(\Vert C_1\Vert \) and \(\Vert C_2\Vert \) increase, which means that both \(C_1\) and \(C_2\) have large number of internal edges, \(\alpha (C_1, C_2)\) decreases. So, \(\alpha \) can serve as a factor responsible for amalgamation of \(C_1\) and \(C_2\).

Now we are ready to define our overlap function. Given a cover \(\mathcal {K}\) containing the initial communities, we define the overlap function between any \(C_1, C_2 \in \mathcal {K}\) as

$$\begin{aligned} \theta (C_1, C_2) = \frac{\alpha (C_1, C_2)\left| \text {cut}(C_1, C_2)\right| }{\min \lbrace \left| N_{C_1}\right| , \left| N_{C_2}\right| \rbrace } \end{aligned}$$
(5)

We now present the criterion for merging two communities that have overlap greater than a certain threshold, say \(\theta _0\). The method for merging communities is described in Procedure MERGE-COMS(), which takes as input a graph G, the initial communities of G stored in a container \(\mathcal {K}\), and the corresponding membership container m, and the parameters \(\theta _0\) and \(n_{\min }\). (See Table 2 for details.) The parameter \(n_{\min }\) is included just to facilitate the user to get the communities smaller than a specific size merged into another suitable and stable community.

The method works as follows. All the labels of the initial communities in \(\mathcal {K}\) are stored in a variable L. Then in a repeat-until loop (lines 2–21), a label \(\ell \) is chosen from L randomly. Let C be the community corresponding to \(\ell \). The labels of the communities that are larger than or equal to C in size, and that share at least one edge with C are stored in \(L_N\) (See line 5.).

Now a variable \(L_H\) is required to store the labels of the communities into which C may be merged. There are two cases depending on whether \(\left| C\right| < n_{\min }\) or \(\left| C\right| \ge n_{\min }\). In the first case, \(L_H\) stores the labels of those \(C' \in \mathcal {K}\) for which \(\theta (C, C')\) is the maximum (see line 7). In the second case, on the other hand, \(L_H\) stores the labels of all those \(C' \in \mathcal {K}\) that satisfy \(\theta (C, C') \ge \theta _0\). (See line 9.)

The lines 11–16 simply perform the task of merging C into each of the communities with labels \(L_H\) and updating the memberships of all the vertices of C accordingly. Then, at line 17, \(\ell \) is removed from L. The lines 18–20 remove C from \(\mathcal {K}\) whenever \(L_H\) is nonempty. This means, when \(L_H = \varnothing \), C will not be removed from \(\mathcal {K}\) even if the size of C is smaller than \(n_{\min }\).

figure b

3.4 Main algorithm and parameter selection

Looking at the different parts of IC-LCD we are now ready to describe the full algorithm which is given in Algorithm 1. The algorithm maintains primarily three containers—\(\mathcal {K}, U\), and m as described in Table 2.

Table 2 Containers and parameters of IC-LCD

Initially, \(\mathcal {K}\) is empty, and U contains all the vertices of G, that is, \(U = V_G\). In the beginning \(m_u = \varnothing \), for each u, so m is also empty. The algorithm now works as follows. The counter i is initialised with 1 (line 5). Since in the beginning U is nonempty, a seed, say u, is selected from U randomly (line 7). Let C be the set initialised with u (line 8). Lines 9 through 18 update C as follows. \(V_{\text {new}}\) is computed for C through the procedure GET-NEW-NODES(). Next, for each \(v \in V_{\text {new}}\), v is added to C, v is removed from \(N_C\), and any neighbour of v not in C, is added to \(N_C\). When C can no more be expanded, C is added to \(\mathcal {K}\) at the \(i^{\mathrm{th}}\) position (line 19). Then, the lines 20–23, update the memberships of each of the vertices in C, and C is removed from U. Line 24 increments i with 1. As long as U remains nonempty, lines 7 through 24 get repeated. At the end of line 25, \(\mathcal {K}\) contains the list of all the initial communities generated in the previous steps. Finally, the procedure MERGE-COMS() is called to refine the communities. Finally, the algorithm prints \(\mathcal {K}\) containing the communities of G, and the set m of the memberships of the vertices of G.

We have run our algorithm several times on different networks for a range of parameter values. We found 0.5 to be the best choice for \(\xi _0\). So, it is internally fixed at \(\xi _0 = 0.5\) meaning that it is not available for user manipulation (See line 10 of Algorithm 1.). For \(n_{\min }\) we tried the values 2, 3, 4 and 5, and found that 5 is the most suitable choice. The next parameter is \(\theta _0\), which determines the maximum overlap allowed between any two communities, and it gives best results for 0.4. Both \(\theta _0\) and \(n_{\min }\) can be set by the user as per his/her choice. The default values for them are as \(\theta _0 = 0.4\) and \(n_{\min } = 5\). At the same settings we have compared IC-LCD with other algorithms for all the networks considered in this article (See Table 3.).

figure c

4 Evaluation on synthetic networks

The most often used synthetic networks for testing community detection algorithms are Lancichinetti–Fortunato–Radicchi (LFR) networks [24]. This is because the LFR networks possess a number of parameters which allows users to control many network characteristics. The most prominent parameter is the mixing parameter (\(\mu \)) which represents the fraction of the neighbours of a node lying outside the community of the node. Other important parameters are: N—the number of nodes, k—the average degree, \(k_{\max }\)—the maximum degree, \(c_{\min }\)—the minimum size for a community, \(c_{\max }\)—the maximum size of a community, \(O_n\)—the number of overlapping nodes, \(O_m\)—the number of memberships for each overlapping node. Based on the LFR benchmarks, we first talk about the accuracy measures.

4.1 Accuracy measures

Our criterion for measuring the accuracy of an algorithm is three-fold. The first measure is the widely used normalized mutual information (NMI) given by Lancichinetti et al. [25]. The NMI score varying between 0 and 1 indicates how much similar or dissimilar two community covers are. Second, we study how well community detection algorithms reproduce the number of communities. To measure this we consider the ratio \(\overline{C}/C\), where C is the average number of communities present in LFR networks over 100 different realizations and \(\overline{C}\) is the average number of communities detected by an algorithm in the same 100 networks. A value of \(\overline{C}/C\) closer to 1 indicates the higher accuracy. As a third measure we study how well the algorithms reproduce the overlapping nodes. To this end we compute the ratio \(\overline{OV}/OV\), where OV is fixed at 100 which is the average number of overlapping nodes present in LFR networks over 100 realizations and \(\overline{OV}\) is the average number of overlapping nodes detected by an algorithm in the same 100 networks.

We compute each of the three measures—NMI, \(\overline{C}/C\) and \(\overline{OV}/OV\), as independent functions of \(O_m\) and \(\mu \). For \(O_m\) we pick the range \(\{2,3,4, \ldots , 15\}\) and fix other LFR parameters as \(N = 2000, O_n = 100, \mu = 0.2, k = 50, k_{\max } = 100, c_{\min } = 10, c_{\max } = 150\). On the other hand for \(\mu \), we choose the range \(\{0.05, 0.10, \ldots , 0.50\}\) and take \(O_m = 8\) and other LFR parameters as above.

Table 3 Algorithms parameters
Fig. 1
figure 1

Accuracy measures versus LFR parameters for the algorithms DEMON, IC-LCD, SLPA, OSLOM and COPRA

4.2 Accuracy results and comparison with competitors

To compare the results of IC-LCD, we have selected DEMON, SLPA, OSLOM, and COPRA as the baseline algorithms. The parameters for these algorithms are set as in Table 3. Note that the parameter \(\epsilon \) of DEMON is not fixed at a particular value. Instead, we have taken some random discrete values of \(\epsilon \) in the range [0.1, 0.4], and recorded the output for the best choice. It is important to note that the algorithms SLPA and DEMON are implemented in the CDLib [44], which is a Community Discovery Library available for the python language. We have used this library for running the algorithms DEMON and SLPA, and also for computing the NMI measure. The results for all the three accuracy measures are shown in Fig. 1.

First we plot the NMI scores between the produced and actual covers of the five algorithms as a function of \(O_m\) (see Fig. 1a) and \(\mu \) (see Fig. 1b). It can be observed that as \(O_m\) varies, IC-LCD delivers more stable NMI scores than the rest of the algorithms. When NMI is seen as a function of \(\mu \), among all the algorithms IC-LCD delivers the highest scores till \(\mu \) reaches 0.3, and thereafter it starts degrading whereas OSLOM and COPRA maintain the high scores till \(\mu = 0.5\). The worst performance on NMI is shown by DEMON, which indicates that it produces highly unstable covers.

Next we look at \(\overline{C}/C\) as functions of \(O_m\) and \(\mu \) (see Fig. 1c, d). The algorithm COPRA delivers the correct number of communities till \(O_m = 13\), where as OSLOM does so only upto \(O_m = 6\). IC-LCD on the other hand, underestimates the number of communities until \(O_m\) crosses 4 and thereafter it delivers the correct number of communities.

The algorithm SLPA underestimates the number of communities for the whole range of \(O_m\). When \(\mu \) varies till 0.35, the scores of \(\overline{C}/C\) indicate that the number of communities are estimated correctly by both the COPRA and IC-LCD, whereas they are overestimated by OSLOM and underestimated by SLPA. In this case also DEMON exhibits the worst performance.

Finally, we consider how \(O_m\) and \(\mu \) affect \(\overline{OV}/OV\) (see Fig. 1e, f). As \(O_m\) increases, IC-LCD delivers scores of \(\overline{OV}/OV\) that monotonically increase and stay above 0.8 when \(O_m\) crosses 7. The opposite behavior is shown by SLPA which begins with high scores of \(\overline{OV}/OV\) and monotonically degrades as \(O_m\) increases. For OSLOM, the scores of \(\overline{OV}/OV\) are closer to 1 until \(O_m = 5\) and afterwards they start oscillating with wide variations. Both the algorithms COPRA and DEMON fail to detect the overlapping nodes as soon as \(O_m\) crosses 4. On the other hand, when \(\overline{OV}/OV\) is measured as a function of \(\mu \), SLPA produces slightly better estimates for overlapping nodes than IC-LCD till \(\mu \) reaches 0.2, and thereafter IC-LCD takes the lead. The other three algorithms deliver unrealistic estimates.

Overall, we find that IC-LCD delivers moderate performance when the overlapping nodes have fewer memberships (the number of communities per node). As their membership increases IC-LCD’s performance starts improving while that of the others deteriorate. The transition point usually occurs at \(O_m = 6\). This indicates that IC-LCD favors community structures with pervasive overlaps, i.e., where the nodes possess memberships of several communities.

5 Evaluation on real-world networks

After understanding how IC-LCD performs on synthetic networks, we proceed to test it on real-world networks. The list of real-world networks we have chosen to experiment with IC-LCD is given in Table 4. The list includes a range of networks with the number of nodes varying from very small to about 335 thousand, and the number of edges varying upto about 926 thousand. The networks DBLP and AMAZON are quite large and are well suited for testing the scalability of algorithms. Since in real networks the prior information about the community structure is usually absent, we must employ other measures to understand how well an algorithm performs on them.

Table 4 Details of networks studied in this paper

5.1 Quality measures

There are many quality measures for this purpose [27, 30, 39] which are motivated in some or other way by the modularity of Newman [37]. We select two such measures. The first one is the overlapping modularity measure (\(Q_{\text {ov}}\)) given by Nicosia et al. [39] as it has been used quite extensively in the literature on overlapping community detection. The \(Q_{\text {ov}}\) score depends on a function \(f(x) = 2px - p\), where p is any real number. The code for computing \(Q_{\text {ov}}\) implements the function f with \(p = 30\). To compute the \(Q_{\text {ov}}\) scores we have used the CDLib (Community Discovery Library). The second measure is the weighted overlapping modularity measure \(Q_{\text {wo}}\) [22] which is equally applicable for unweighted networks. Unlike the traditional versions of modularity, \(Q_{\text {wo}}\) has a very simple formulation and it can be computed very fast.

We take the parameters of the algorithms again as given in the Table 3 except the parameter r of SLPA which we set now as \(r = 0.45\). This value of r is chosen by the authors of SLPA for experiments on real networks. We run each of the five algorithms 10 times and record the average values \(\overline{Q}_{\text {ov}}, \overline{Q}_{\text {wo}}, \overline{C}\) and \(\overline{OV}_{\text {frac}}\) of the corresponding parameters over 10 runs.

5.2 Results and comparison with competitors

On all the 12 networks listed in Table 4, we applied IC-LCD along with the four baseline algorithms. The results for the selected measures for all chosen algorithms can be seen in Table 5. First, let us analyse how the five algorithms perform on the modularity measures. We find that the score \(\overline{Q}_{\text {ov}} \ge 0.60\) is achieved on 7 networks by IC-LCD, on 5 networks by both OSLOM and COPRA, on 3 networks by SLPA, and unfortunately on zero networks by DEMON. Likewise, the score \(\overline{Q}_{\text {wo}} \ge 0.60\) is achieved on 11 networks by IC-LCD, on 4 networks by OSLOM and DEMON both, on 6 networks by COPRA, and on 7 networks by SLPA.

Table 5 Quality measure comparison of the algorithms IC-LCD, OSLOM, COPRA, SLPA and DEMON on real-world networks
Fig. 2
figure 2

Community structure in LES-MISERABLES networks

This clearly indicates in terms of quality IC-LCD gives better performance than the baseline algorithms. On the other two parameters, i.e., \(\overline{C}\) and \(\overline{OV}_{\text {frac}}\), IC-LCD gives almost the similar results with the baseline algorithms.

Community Structure in LES-MISERABLES Network After giving a quantitative tests of the results on all the real-world networks listed in Table 4, we specifically select the network LES-MISERABLES for demonstration of its overlapping community structure found by IC-LCD and compare the results with those of the baseline algorithms. It is the weighted network of co-appearances of characters in Victor Hugo’s novel “Les Miserables” [21]. It contains 77 nodes and 254 edges. Nodes represent characters as indicated by the labels and edges connect any pair of characters that appear in the same chapter of the book. The values on the edges are the number of such co-appearances. Since IC-LCD does not work for weighted networks, we have ignored the weights.

We ran IC-LCD on this network 10 times. The best cover produced by IC-LCD has 8 communities (See Fig. 2.). We can see that the character ‘Valjean’ has a central role in the novel. It has occurred with almost all the groups, although prominently with 4 groups. Other characters that co-appear with two or more groups are ‘Marius’, ‘Cosette’, ‘Gavroche’, ‘Javert’, ‘Marguerite’, and ‘Simplice’. The modularity scores corresponding to this cover are \(\overline{Q}_{\text {ov}} = 0.66\) and \(\overline{Q}_{\text {wo}} = 0.52\).

When we ran OSLOM 10 times on this network we find that the best cover produced by it contains 6 communities with only two overlapping nodes namely, ‘Valjean’, and ‘Marius’. Both of these nodes possess the memberships of just two communities. However, the cover gets a quality scores as \(\overline{Q}_{\text {ov}} = 0.69\) and \(\overline{Q}_{\text {wo}} = 0.55\). Next we run COPRA 10 times on this network. The best cover produced by COPRA contains only 3 communities (see Fig. 2c). Further, we can see that no node has membership more than 2. The modularity scores for this cover are \(\overline{Q}_{\text {ov}} = 0.53\) and \(\overline{Q}_{\text {wo}} = 0.59\). Finally, we look at (Fig. 2d) the cover produced by SLPA. It detects 6 communities, but with no overlapping node. The quality scores for this cover are \(\overline{Q}_{\text {ov}} = 0.73\) and \(\overline{Q}_{\text {wo}} = 0.58\). It may be noted that DEMON does not produce more than 1 communities in this network in 10 consecutive runs for different values of its parameter \(\epsilon \) in the range [0.1, 0.4].

On this network, our algorithm produced the maximum number of communities, with node membership reaching as high as 4. On the other hand, rest of the three algorithms produce upto a maximum of 6 communities, and node memberships no more than 2.

5.3 Node membership distribution

To understand how node memberships are spread across multiple communities we select three real-world networks namely, INTERNET, DBLP and AMAZON. The network INTERNET represents a symmetrised snapshot of the structure of the Internet at the level of autonomous systems, reconstructed from BGP tables posted at archive.routeviews.org. This snapshot was created by Mark Newman from data for July 22, 2006. From Fig. 3a which depicts the vertex membership distribution, we find that DEMON detects the highest number of memberships for the overlapping nodes. The rest of the algorithms including IC-LCD are not able to detect more than 5 memberships for any node.

The network DBLP represents a co-authorship network taken from the computer science bibliography database DBLP where two authors are connected if they have published at least one paper together. This is a large network with 317,080 nodes and 1,049,866 edges. The distribution of the node memberships of the community structures produced by IC-LCD and the baseline algorithms is shown in Fig. 3b. In this network also DEMON takes the lead and IC-LCD achieves the second place.

The AMAZON network was collected by crawling Amazon website. It is based on Customers Who Bought This Item Also Bought feature of the Amazon website. If a product i is frequently co-purchased with product j, the graph contains an undirected edge joining i and j. On this network also we find that DEMON detects the highest number of memberships for the overlapping nodes, whereas IC-LCD remains at the second position.

Apparently, DEMON is able to detect highest number of memberships for the overlapping nodes. However, its performance on LFR networks does not make it a reliable candidate. Moreover, its output significantly depends on the selection of its parameter \(\epsilon \). On the other hand, despite remaining at the second position, IC-LCD is more reliable in terms of the stability and quality of the covers produced by it. Thus in a sense, IC-LCD is capable of revealing the highly overlapping structure of networks.

Fig. 3
figure 3

Relationships between node degree and shared communities in INTERNET,DBLP and AMAZON networks

6 Scalability and time complexity

Finally, we study the most important aspect for any algorithm its scalability and time complexity. Scalability of an algorithm tells us how the run time of the algorithm varies with the number of nodes in the network. For comparing the actual run times of IC-LCD with the baseline algorithms we select random networks with the number of nodes ranging from 1000 to 100,000. These networks are generated using the IGRAPH Package available for the R language. We simply used the command sample_gnp(n, p) with \(p = 5/n\), where n lies in \(\lbrace 5000, 10{,}000, \ldots , 100{,}000 \rbrace \). The parameters set for the algorithms are again as in Table 3. The plot showing the run time of the algorithms DEMON, IC-LCD, SLPA, OSLOM and COPRA on random networks is shown in Fig. 4a. Here COPRA is the fastest algorithm, whereas IC-LCD is the third fastest algorithm. However, as the network size reaches 100,000 mark all the three algorithms COPRA, DEMON and IC-LCD seem to be on the same scale.

To observe the difference in running time of IC-LCD, and the baseline algorithms, we again considered the three real networks INTERNET, DBLP and AMAZON. While on INTERNET, the running times are not much distinguishable, the difference is clear on DBLP and AMAZON. On DBLP we find that IC-LCD takes the least time which is less than 40 min, whereas DEMON takes more than 120 min. On AMAZON, the least time is taken by COPRA, whereas IC-LCD takes the second least time again around 40 min. DEMON takes again the highest time on this network.

Fig. 4
figure 4

a Running time (in seconds) plot of IC-LCD along with the baseline algorithms on a class of random networks generated by the command sample_gnp(n, p) of igraph package, with n lying in \(\{5000, 10{,}000,\ldots ,100{,}000\}\) and \(p = 5/n\) b Running time (in minutes) plot of IC-LCD along with the baseline algorithms on the real networks INTERNET, DBLP and AMAZON

We shall now estimate the theoretical complexity of IC-LCD. First we compute the complexity of the Expansion Phase. For an arbitrary subgraph C of the input graph, assume that \(c = \left| C\right| \) and \(n_c = \left| N_C\right| \). First observe that both node and edge interaction coefficients of a node or an edge with C can be computed in \(\mathcal {O}(c)\) time. The procedure GET-NEW-NODES takes a subgraph C and its neighbourhood \(N_C\), and returns new nodes for the expansion of C. This can be done in \(\mathcal {O}(cn_c)\) time. Now we come to the while loop (Lines 6–25) of Algorithm 1, which runs k times, where k is the number of (initial) communities. Note that the final communities are never more than k. Given a subgraph C, the repeat-until loop (Lines 9–18) runs at most \(\left| V_{\text {new}}\right| \le n_c\) times. So, the complexity of the repeat-until loop is \(\mathcal {O}(cn_c^2)\). The assignments at Lines 19–24 take \(\mathcal {O}(c)\) time. Thus the complexity of the while loop is \(\mathcal {O}(kcn_c^2+kc) = \mathcal {O}(kcn_c^2)\). Note that c and \(n_c\) are not constants, but they keep on changing as C expands. Normally, C and \(N_C\) do not grow simultaneously throughout the expansion phase. Indeed, for every C, we have \(1 = c \le n_c\) at the beginning. At the intermediate steps it is possible that \(c \le n_c\), but towards the end of the expansion phase two situations might emerge. In the first case, C does not grow after acquiring a few nodes, and so in this case it is reasonable to assume that \(cn_c \le n_{\max }\), where \(c_{\max }\) is the maximum size of the final communities. In the second case, C grows much faster than \(N_C\) so that c becomes much larger than \(n_c\), i.e., \(n_c \ll c\). In this case, we get \(n_c^2 \le c_{\max }\). As a result, \(cn_c^2 \le c_{\max }^2\), and hence the complexity of the Expansion Phase becomes \(\mathcal {O}(kc_{\max }^2)\).

Now we compute the complexity of the Refinement Phase, that is, of the MERGE-COMS() procedure. Observe that the repeat-until loop runs k times in the worst case. Now look at the Lines 5, 7 and 9, each of which requires \(cc_i\) steps, where \(c_i = \left| C_i\right| \). Clearly, \(cc_i \le c_{\max }^2\). Now observe that \(\left| L_H\right| \le \left| L_N\right| \le \left| N_C\right| = n_c\). So, the nested for loop (Lines 11–16) runs \(cn_c\) times. Finally, the operation at Line 19 takes c steps. Thus the complexity of the Expansion Phase is \(\mathcal {O}(kc_{\max }^2 + kcn_c + c) = \mathcal {O}(kc_{\max }^2)\). Consequently, the complexity of IC-LCD is also \(\mathcal {O}(kc_{\max }^2)\). Further, if the input network has n nodes, then it can be seen that \(\mathcal {O}(kc_{\max }) = \mathcal {O}(n)\). Consequently, the complexity of IC-LCD reduces to \(\mathcal {O}(nc_{\max })\).

7 Discussion

In the previous sections we have seen the performance of IC-LCD on synthetic as well as real-world networks. On synthetic networks, IC-LCD gives results more accurate than the other algorithms when the node memberships are high. In real-world networks, the performance of IC-LCD is measured by the quality indicators \(Q_{\text {ov}}\) and \(Q_{\text {wo}}\). We find that IC-LCD gives competitive performance with the baseline algorithms on almost all the quality indicators. If we assess from the point of view of quality and stability, IC-LCD is a suitable choice for overlapping community detection. The running time plots on synthetic networks show that IC-LCD is slightly slower than COPRA and DEMON, but much faster than both OSLOM and SLPA. Surprisingly, DEMON is the slowest algorithm on the real networks DBLP and AMAZON. (See Fig. 4.) Our theoretical estimate of the complexity of IC-LCD is thus in line with practical situations, and hence the algorithm scales for large networks.

Another interesting fact about IC-LCD is that its MERGE- COMS() procedure is stand-alone, and therefore it can be used as a refinement step for many seed expansion based algorithms which are fast but produce unstable covers. However, this requires a systematic study.

8 Conclusion

We have proposed the algorithm IC-LCD which employs the notion of node and edge interaction coefficient to reveal the community structure with pervasive overlaps in complex networks. IC-LCD has time complexity \(\mathcal {O}(nc_{\max })\), where n is the number of nodes in the input network, and \(c_{\max }\) is the maximum size of the communities detected. We have analysed the performance of IC-LCD on a number of quality indicators, and compared the results with a few well-known algorithms for overlapping community detection. The results on synthetic networks suggest that when the node memberships are low IC-LCD performs moderately and as the node memberships increase it begins outperforming the other algorithms. Tests on real networks suggest that IC-LCD gives competitive performance in terms of quality with other algorithms. From the point of view of detecting the community structure with pervasive overlaps, IC-LCD gives desirable outcomes. Additionally its high speed makes it appropriate for application on large networks.