1 Introduction

Component detection is a fundamental problem [18, 36] in the analysis of large-scale networks. Vertices in one component are more highly connected with each other than the rest of the network. Thus, finding highly connected components can benefit many real life applications. For example, groups of intimate entities discovered in social networks can be exploited to analyze their social behaviors [18, 32]; a set of Web pages with similar contents in Web data network can be indexed together to facilitate Web search [10]; clusters of interactive genetic markers discovered in genetic interaction networks can be utilized to infer the corresponding cause of diseases [46].

The existing methods of component detection can be roughly divided into two main categories, i.e. clique-based methods and clique-relaxed methods. According to differently relaxed constraints, clique-relaxed methods can be further divided into degree-relaxed [4, 9, 49], distance-relaxed [25, 33] and triangulation-relaxed [11, 22, 23, 42, 43] methods. Although succeeding to some extent, these methods still have respective drawbacks.

A toy co-friendship network is considered in Figure 1a. Degree-relaxed and distance-relaxed methods often consider the network as an indivisible whole, saying a k-core with k = 3 and a n-club with n = 3, since the degree of every vertex is no less than 3 and the maximum length of shortest pathes of all pairs of vertices is no larger than 3, respectively. This contradicts with the intuition that Figure 1a should be disconnected into two components by deleting edge e. The reason is that these two methods only concern about the degree and distance but ignore the connectivity of any pair of vertices. Although triangulation-relaxed methods can detect these two components, they do not work when there are no triangles in graph, e.g. bipartite graphs.

Figure 1
figure 1

A toy co-friendship network

Connectivity is measured by the number of disjoint paths between vertices. Intuitively, high connectivity would contribute to the robustness and steadiness of the component. For example, in the collaborative network, the vertices represent experts and the edges represent the relationships among them. The connectivity of a group of experts often measures their ability to finish a task together [26, 39]. Thus, the high connectivity of a component in this network guarantees its robustness because even if a few experts quit or relationships disappear, the task could still be completed. Similarly, in the road network, the vertices represent places and the edges represent the roads between places. The high connectivity among a set of places increases the intensity of road network. Even if some of the roads are congested, it can still be reached from other routes [29, 30, 50,51,52,53]. Also, in the genetic interaction network, the connectivity of a set of genetic markers ensures the successful expression of special gene functions. Therefore, the component of genetic markers with higher connectivity possess more stable functional characteristic [46].

A component with high connectivity could still be connected, even if losing a few relationships or entities. Therefore, in recent years, connectivity-based component detection such as k-edge connected components (k-ECC) [2, 6, 7, 21, 41, 54] has drawn a lot of attention. A k-ECC refers to a maximal subgraph, the remaining subgraph of which is still connected after any k − 1 edges are removed from it. A toy co-friendship network is considered in Figure 1a as an example. The network in Figure 1a is a 1-ECC. Since the low edge connectivity, it is naturally divided into two separate 3-ECCs, {Bob, David, Tony, Erik, Alice} and {Jack, Anna, Bell, Evan, Albert}.

However, k-ECC still has its own limitation. For example, if Bob is an alias of Jack, Figure 1a is equivalent to Figure 1b. In this case, Figure 1b is identified as a whole 3-ECC, although there are practically two separate components, since once we remove the vertex Bob/Jack, the network becomes disconnected. Thus, high edge connectivity does not necessarily indicate a component of strong connectivity.

In this paper, we study the k-vertex connected component (k-VCC) detection problem, which focuses on vertex connectivity instead of edge connectivity, of networks. Given a graph G, the goal is to find all such induced subgraphs, g, of G that g is still connected after removing any k − 1 vertices from it and no supergraph of g has the same property. According to this informal definition, the network in Figure 1b can be identified as two 3-VCCs, {Bob/Jack, David, Tony, Erik, Alice} and {Bob/Jack, Anna, Bell, Evan, Albert}. Unlike k-ECC, k-VCC has three unique advantages: (1) k-VCC captures more connectivity of networks than k-ECC. According to [15], a component of high vertex connectivity must be of high edge connectivity, but not vice versa; (2) k-VCC allows overlap among different components, say Bob/Jack appears in both the components shown in Figures 1b, which is more natural and reasonable for real-world networks [35]; (3) k-VCC can prevent the detected communities from including irrelevant subgraphs, i.e. free rider effect [45].

Unfortunately, the methods for k-ECC [2, 6] cannot be directly utilized to find k-VCCs. Because each vertex only belongs to at most one k-ECC [6], these methods could obtain k-ECCs by vertex contraction [40]. In our case, however, each vertex can be in more than one k-VCCs, which makes the former trick not work. To this end, we propose three novel frameworks to tackle the k-VCC detection problem, namely top-down, bottom-up and hybrid frameworks for k-VCC detection, respectively. The top-down iteratively divides the networks by finding minimum vertex cut set, which could find all exact k-VCCs. The bottom-up framework, instead of using the entire network structure, locally identifies the seed subgraphs, and obtains the heuristic k-VCCs by expanding and merging these seed subgraphs. Inspired by the above two frameworks, the hybrid framework further considers the group relationships of seed subgraphs, and thus can utilize the elaborate mixed graph structure to discover the exact k-VCCs by contracting the mixed graph in a top-down way. To the best of our knowledge, our conference paper [28] is the first work which discusses k-VCC detection problem. And this work is an extended version of it. All in all, our contributions are summarized as below:

  • A novel k-VCC detection problem is proposed from the perspective of vertex connectivity.

  • The top-down, bottom-up and hybrid frameworks are respectively developed to solve the k-VCC detection problem. The top-down and hybrid are exact approaches, while the bottom-up framework is heuristic.

  • Especially, in the bottom-up framework, a concept of local k-vertex connected subgraph is proposed to enable locally identifying seed subgraphs, which accelerates k-VCC detection. Further, in the hybrid framework, an upper bound of vertex connectivity is designed to speed up the contraction of the mixed graph. In addition, several optimization techniques are proposed to further reduce the search space;

  • Extensive experiments on both real and synthetic datasets demonstrate the efficiency and effectiveness of our frameworks and theories.

The rest of our paper is organized as follows. In Section 2 we discuss the related work. We give the notions and problem statement in Section 3. In Sections 45 and 6, we present the top-down, bottom-up and hybrid k-VCC detection frameworks, respectively. Extensive experiments are reported in Section 7. Section 8 concludes this work.

2 Related works

The related work is categorized into three main research lines, including graph connectivity, component detection, and overlapping component detection (OCD) respectively.

Graph connectivity

Graph connectivity has an extricably bound with minimum cut, since the minimum cut contributes to the graph connectivity. A large number of algorithms have been designed for computing the global minimum connectivity of the whole graphs [16, 17, 20, 40]. Recently, there exists several research on finding k-edge connected subgraphs, which concern about the local edge connectivity in the subgraphs [2, 6, 54]. Whereas, in this paper, we focus on the k-vertex connectivity of subgraphs.

Component detection

The existing component detection methods can be roughly divided into clique-based [36] and clique-relaxed-based methods. Since the definition of clique is too strict, the clique-relaxed based methods have recently drawn a great deal of attentions. It can be classified into the following three categories: (1) Distance-based relaxed methods. n-clique is defined a maximal subgraph such that the distance of each pair of its vertices is not larger than n in the whole network [25]. n-clan is an n-clique whose diameter is no larger than n [33]. n-club is a maximal subgraph whose diameter is no larger than n [33]. (2) Degree-based relaxed methods. k-plex is defined as a maximal subgraph in which each vertex is adjacent to all other vertices except at most k of them. Recently, efficient approaches are proposed to enumerate k-plexes in large networks [4, 13]. Similarly, k-core is a maximal subgraph in which each vertex is adjacent to at least k other vertices of the subgraph. Efficient global search methods [9, 39] and local search method [14] have been developed to discover the k-core communities or the community k-core containing given entities. Quasi-clique with a parameter γ is a subgraph with n vertices and \(\gamma \ast n\choose 2\) edges [49]. (3) Triangulation-based relaxed methods. The definition of DN-graph is proposed in [43], which is a connected subgraph in which the lower bound of shared neighborhood between any connected vertices is locally maximized. k-truss is first proposed by J Cohen [11], and defined as the largest subgraph in which every edge is contained in at least (k − 2) triangles within the subgraph. It has received extensive research attentions [22, 23, 42]. Different from aforementioned models, in this paper, we propose the k-VCC model, which possesses high vertex connectivity.

Overlapping community detection

Overlapping community structure exists in most real networks, which means vertices could belong to different communities [37]. Moreover, many methods have been proposed to solve the OCD problem. The clique percolation based method [1] defines a community as a k-clique component to detect the overlapping communities. The link-partition based method [31] first transforms the original graph G into line graph L(G), in which the edges in G correspond to vertices in L(G). Then, the node partition of L(G) can be converted to overlapping vertex communities of G. The label propagation based method [19] utilizes propagating labels between neighbouring vertices to detect communities that overlap. The k-VCC cannot only model community structure but is more aware of the connectivity of the communities. Furthermore, it naturally quantify the overlapping extent between two communities, such that two k-VCCs could at most overlap on k − 1 vertices.

The differences of this extended paper from our conference version [28] are as follows:

  • We give more analysis on complexity and property of this problem and present the paper with more clarity.

  • We propose the hybrid framework to discover exact k-VCCs by contracting groups of obtained k-vertex connected subgraphs in a top-down manner. An upper bound of vertex connectivity and other optimization techniques are designed to accelerate the efficiency of this framework. Section 6 is newly added.

  • We conduct more experiments on both real and synthetic datasets to evaluate the three proposed frameworks. Besides, we add another case study on protein interaction network to further determine the effectiveness of k-VCCs.

3 Notions and problem statement

In this paper, we focus on an unweighted undirected graph G(V, E), where V (G) is the set of vertices and E(G) is the set of edges. The number of vertices, |V | = n, and the number of edges, |E| = m. We use nbG(v) to denote the set of neighbors of a vertex v in G, i.e., nbG(v) = {u|(u, v) ∈ E}. The degree of a vertex vV (G), denoted by degG(v) = |nbG(v)| is the number of neighbors of v in G. dmax denotes the maximum vertex degree of G. If there is no ambiguity, we denote them as nb(v) and deg(v). Paths joining the same pair of vertices are called vertex disjoint paths if they have no other vertices in common. A graph G is a subgraph of G, denoted by GG if V (G) ⊆ V (G) and E(G) ⊆ E(G). Given a set of vertices SV, the induced subgraph of S, denoted by G[S], is a subgraph of G such that G[S] = (S, E(S)) where E(S) = {(u, v) ∈ E(G)|u, vS}.

Definition 1 (Vertex connectivity of two vertices)

Let u and v be two vertices in graph G. If (u, v)∉E, we define the vertex connectivity between u and v, denoted by κ(u, v), as the least number of vertices chosen from V −{u, v}, whose deletion from G will disconnect u and v (destroy every vertex disjoint path between u and v), and for (u, v) ∈ E, κ(u, v) = n − 1, because they can only be separated by deleting n − 1 vertices.

Next, we define the connectivity of a graph.

Definition 2 (Vertex connectivity of a graph)

The vertex connectivity of a graph G denoted as κ(G) is the least cardinality |S| of a vertex set SV such that G[VS] is either disconnected or trivial (graph with only one vertex). Such a vertex set S is called a minimum vertex cut set.

Obviously, κ(G) can be expressed in terms of κ(v, w) as follows: \(\kappa (G)=\min \limits \{\kappa (v,w)|\\unordered\ pair\ of\ vertices\ v,w\ in\ V\}\).

Definition 3 (k-vertex connected graph)

A graph G(V, E) is k-vertex connected if the remaining graph is still connected after the removal of any (k − 1) vertices from G, i.e., κ(G) ≥ k.

Specially, we define the graph with only one vertex is trivial and the vertex connectivity of a complete graph Kn is (n − 1). In other words, if a graph G is k-vertex connected, there are at least (k + 1) vertices in it.

Definition 4 (k-vertex connected component)

Given a graph G(V, E), a subgraph G[S] (SV) of G is a k-vertex connected component (k-VCC) if (1) G[S] is k-vertex connected, and (2) any supergraph G[S] (SSV) is not k-vertex connected.

For example, in Figure 2, graph G1, G2, G3 and G4 are all 3-vertex connected subgraphs, while only G3 and G4 are 3-VCCs, because G1 and G2 are contained in G4.

Figure 2
figure 2

An example of k-VCCs

Problem statement

Here, we give the formal problem statement.

Problem 1 (k-VCC detection problem)

Given a graph G(V, E) and an integer k, we study the problem of discovering all k-VCCs of G.

In theory, the value of k in k-VCC ranges from 1 to n − 1, however, it is unlikely to reach n − 1 in practice, because if k is large enough, the final result of k-VCCs is probably an empty set. Here, we give the upper bound of parameter k. It is highly related with k-core, denoted as G[Ck], which is a maximal induced subgraph of G, such that ∀vCk, \(deg_{G[C_{k}]}(v) \geq k\). The core number of a vertex vV, denoted as ψ(v), is the largest k such that v is in G[Ck]. In other words, ψ(v) = k means that vCk and vCk+ 1.

Lemma 1

All the k-VCCs in graph G are included in the k-core subgraph of G.

Definition 5 (Degeneracy of G)

The degeneracy \(\mathcal {D}\) of G(V, E) is the largest k for which G has a non-empty k-core, i.e., \(\mathcal {D} = \max \limits _{v\in V} \psi (v)\).

Theorem 1

The value of k ink-VCC is upper bounded by thedegeneracy\(\mathcal {D}\)ofG.

Theorem 1 could be directly induced from Lemma 1. Next, we analyze the complexity of the k-VCC detection problem shown in Theorem 2.

Theorem 2

Thek-VCC detection problem can be solved inO((nk) ⋅ (nδ − 1 + δ(δ − 1)/2) ⋅ mn2/3) polynomialtime complexity, whereδ = dmax.

Proof

As detailed in Section 4, we can compute the vertex cut Vcut of graph G in O((nδ − 1 + δ(δ − 1)/2) ⋅ mn2/3), where δ = dmax denoting the maximum vertex degree of G. If |Vcut| < k, in the extreme case, graph G could be divided into two subgraphs, i.e., g1 and g2 with size |Vcut| + 1 and n − 1, respectively. g1 and g2 overlaps on these Vcut vertices. Then, the larger subgraph g2 can be further divided into two subgraphs. In the following separations, we can see that the size of the larger subgraph at least decreases 1 in each iteration. Because the size of k-VCC is larger than k, the separating process conducts at most nk times and it will stop when the size of the larger subgraph is k. Therefore, the k-VCC detection problem can be solved in O((nk) ⋅ (nδ − 1 + δ(δ − 1)/2) ⋅ mn2/3) time complexity, which is polynomial. □

Discussion

Next, we discuss why k-VCC is a nice definition for components. We illustrate the reasons from the following three perspectives.

  1. (1)

    Robustness. Based on the definition of k-VCC, the components can only be disconnected by removing at least k vertices, which indicates k-VCC possesses high vertex connectivity and it is uneasy to destruct the structure (connectivity) of k-VCCs. Oppositely, like k-core or k-ECC, even if they inhere high vertex degree or edge connectivity, only deleting one vertex will disconnect the components. Therefore, high vertex connectivity contributes more to the robustness of components.

  2. (2)

    Cohesiveness. k-VCCs allow overlap between components. This is very natural and reasonable for component detection in large real-world networks, because a vertex can participate in a variety of components composed of different types of relationships. However, k-core and k-ECC only discover disjoint components (i.e., a vertex can only be contained in one component) and ignore the relationships that a component represents. This will bring in irrelevant subgraphs, called free rider effect [45]. On the contrary, due to allowing overlapping, k-VCCs could effectively prevent the free rider effect. Furthermore, the diameter of a k-VCC, G[Vk], denoted as diam(G[Vk]), is upper bounded by ⌊(|Vk|− 2)/κ(G[Vk])⌋ + 1 [24]. Thus, when the connectivity of G[Vk] is high, it has a low diameter, which also verifies the cohesiveness of k-VCCs.

  3. (3)

    Diversity. k-VCC could guarantee the diversity of components. Although k-VCCs allow overlap, any two k-VCCs can only overlap on less than k vertices; otherwise they can be merge into a new larger one. Thus, k-VCCs could both discover diversified overlapping components and decrease the redundancy among k-VCCs.

4 Top-down framework for k-VCCs detection

In this section, we detail the top-down framework for k-VCCs detection. The main idea behind this framework is to iteratively compute the minimum vertex cut set Vcut of the current subgraph G[C], if |Vcut|≥ k, then G[C] is a k-VCC; otherwise Vcut and their incident edges are copied to each remaining connected subgraph after deleting Vcut and their induced edges from G[C] and these newly constructed subgraphs are saved in a queue structure Q for further consideration.

Algorithm 1 summarizes this process. First, Lemma 1 enable us to exploit k-core to shrink the scale of G, which is possible to divide the original big graph G into several subgraphs of much smaller scale (line 1). In this way, the computation cost of the minimum vertex cut set Vcut of G is largely reduced. Now, the important problem is how to find the minimum vertex cut set (line 9). Unlike the min-cut [40] and Gomory–Hu tree [20] methods, which can efficiently find the minimum edge cut set in an undirected graph, the problem of computing κ(G) have to be reduced into a maximum flow problem in directed graph. Even and Tarjan [17] prove that κ(s, t) in an undirected graph G is equivalent to the maximal value of flow from s to t in the corresponding constructed directed graph G. And, κ(G) can be calculated in O(nδ − 1 + δ(δ − 1)/2) calls to maximum flow algorithm where δ = dmax [16]. The detail of constructing the directed graph can be found in [28].

figure a

Actually, we do not need to find the minimum vertex cut set every time. For the current subgraph G[C], once we discover a κ(u, v) < k where u, vG[C], G[C] can not be a k-VCC, we can safely terminate this process and use the vertex cut set corresponding to u, v to separate G[C]. Theorem 3 guarantees the top-down framework is correct.

Theorem 3

Given a graph G and a valuek, the top-down framework fork-VCC detectioncan correctly find all thek-VCCs.

Complexity analysis

Computing κ(v, w) on graph G[C] needs O(mn′2/3) time where n is the average size of C and m is the average size of E(G[C]). In the worst case, it needs to be invoked O(nδ − 1 + δ(δ − 1)/2) times. Let L represent the total number of G[C] detected in the algorithm. The overall running time of the top-down framework is O((nδ − 1 + δ(δ − 1)/2) ⋅ mn′2/3L).

5 Bottom-up framework for k-VCCs detection

In this section, we develop a bottom-up framework for k-VCCs detection. Instead of depending on the global structure of graph, it focuses on the microscopic structure when dealing with large networks and thus is able to find target components with computational cost proportional to their size. The idea is to locally find seed subgraphs around the neighborhood of vertices and obtain the k-VCCs heuristically by expanding and merging these subgraphs. The overall framework is summarized in Algorithm 2. Although this framework is heuristic, the experiment in Section 7.3 shows that the result is comparable to the real.

figure b

5.1 Identifying seed subgraphs

We propose the local k-vertex connected subgraph as seed subgraph, which only considers the neighborhood structure of a vertex. Moreover, unlike maximal clique, which is adopted as seed subgraph in [27], the local k-vertex connected subgraph is more generalized as seed than clique, whose structure is too strict [36].

In this section, Seeding() is developed to identify such subgraphs. And there exists two important problems: (1) how to find the seed subgraph for a start vertex; (2) how to efficiently identify seed subgraphs for the whole network. Correspondingly, we first give the formal definition of seed subgraph and propose the LkVCS method for its discovery, and then we devise two optimization strategies to elaborately select the start vertices in which we do not have to identify seed subgraphs for all the vertices, and thus to accelerate the process of identifying seed subgraphs.

Identifying seed subgraph for a start vertex

Here, we call the vertices used to obtain the seed subgraphs as start vertices. Assume u is a known start vertex, we study how to define and find the seed subgraph for it. We first give the definition of 2-ego neighborhood. We then define the local k-vertex connected subgraph as seed subgraph by exploiting 2-ego neighborhood to add additional constraint on the path length between vertices in the defined local k-vertex connected subgraph. This makes it possible to find the seed subgraph within the neighborhood of the start vertex.

Definition 6 (2-ego neighborhood)

Given a graph G(V, E) and a vertex u in G, the 2-ego neighborhood of u, N2(u), denotes the set of vertices in G whose distance to u is no more than 2, i.e., {v|dist(u, v) ≤ 2}. Specially, u also belongs to N2(u).

Definition 7 (Local k-vertex connected subgraph)

Given a graph G(V, E) and a start vertex uV, an induced subgraph G[S] is a local k-vertex connected subgraph if and only if (1) ∀v, wS, if (v, w)∉E(S), |nbG[S](v) ∩ nbG[S](w)|≥ k; (2) |S| > k; (3) uS.

From Definition 7, we can see the diameter of the local k-vertex connected subgraph is no more than 2 proved in Theorem 4. And the diameter of a graph is defined as the longest distance among all the distances between any pair of vertices in G, i.e., \(diam(G) = \max \limits _{u, v\in G}\{dist_{G}(u,v)\}\).

Theorem 4

The localk-vertex connected subgraphG[S] isk-vertex connected and its diameterdiam(G[S]) ≤ 2.

Proof

In Definition 7, condition (1) guarantees there are at least k independent paths and the path length is 2 for each pair of vertices that are not directly connected. Also, based on Definition 1, the connectivity between the pair of vertices that are connected in the network is |S|− 1 and the path length is 1. So, the path length between any pair of vertices in G[S] is no more than 2. In addition, condition (2) makes sure that G[S] has enough vertices to be a k-vertex connected subgraph. □

Based on [14], in real networks, community is usually existed in the neighborhood of vertices, hence it is rational to define the local k-vertex connected subgraph with path length constraint as seed subgraph. Furthermore, a k-VCC is usually composed of many adjacent local k-vertex connected subgraphs. Thus, if we can obtain these local subgraphs in advance, we probably retrieve the original k-VCC from them.

Further, we propose the LkVCS algorithm shown in Algorithm 3, which finds one of the local k-vertex connected subgraphs from the induced subgraph G[N2(u)] as the seed subgraph for a start vertex u. The LkVCS adopts the idea of locally enumerating and starts with every different subset of vertices of size k from the neighborhood of u, denoted as R and then continue bring vertices from PR into R until G[R] is a local k-vertex connected subgraph or there exists x, yR and (x, y)∉E such that \(|nb_{G[P^{\prime }]}(x)\cap nb_{G[P^{\prime }]}(y)| < k\) where P is the vertex set of k-core of \(G[N_{2}(u)]\). When the combination number \(|nb_{G[P^{\prime }]}(u)|\choose k\) is larger than a threshold α, α subsets are randomly tested. The pseudo code of the LkVCS algorithm is in the supplementary materials. Next, Example 1 illustrates how the LkVCS algorithm works.

figure c

Example 1

We apply LkVCS on the graph G in Figure 2 for u = v1. We set k = 3. First, we obtain the 3-core of G[N2(v1)] is G4. We arbitrarily select 3 vertices from \(nb_{G_{4}}(v_{1})\), that is {v2, v3, v4} and R = {v2, v3, v4}∪{v1}. Because G[R] is not a 3-vertex connected subgraph, we continue to add the common neighbor v5 of v1, v3 and v2, v4 into R. Now, G[R] is a 3-vertex connected subgraph. We output G[R] as the local 3-vertex connected subgraph of v1.

Identifying seed subgraphs for the whole network

A naive way is to view every vertex in the network as start vertex and compute the corresponding local k-vertex connected subgraph for them by LkVCS algorithm. However, it is very inefficient. In this part, we carefully devise two optimization strategies to select some of the vertices as start vertices to identify seed subgraphs, and thus further reduce the computational cost largely.

Optimization 1: :

Vertex order priority based strategy. In vertex order priority strategy, we assign the vertices with smaller degree have higher priority than that with larger degree. We observe that the value of \(({ deg(u) \atop k })\) is not large for a vertex u with small degree. Hence, if a vertex u having smaller degree, we can take less time to detect whether there exists a local k-vertex connected subgraph for u.

Theorem 5

Given a vertex u inG(V, E),it can be contained in at most 0k-VCCs whendeg(u) < kor\(1 + \lceil \frac {deg(u)-k}{k-1}\rceil \) k-VCCswhendeg(u) ≥ k,simultaneously.

Proof

Based on Definition 3, if the vertex u is in a k-vertex connected subgraph, it must have at least k neighbors. Thus, for u with deg(u) < k, it cannot be contained in any k-vertex connected subgraphs. Next, recall the property of the k-VCCs that given any two k-VCCs Gi and Gj, |ViVj| < k. Based on this property and Definition 3, for the vertex u with deg(u) ≥ k, besides u, there are at least k neighbors of u belonging to the same k-vertex connected subgraph. At the same time, any two k-vertex connected subgraph including u can overlap at most k − 1 vertices. Thus, u can be contained in at most \(1 + \lceil \frac {deg(u)-k}{k-1}\rceil \)k-VCCs. □

Theorem 5 gives the upper bound of number of k-VCCs, that a vertex could be contained in at the same time. We can see that the vertices with larger degree can be contained in more different k-VCCs and even larger amount of k-vertex connected subgraphs. In particular, for a specific vertex u with deg(u) = k, it can only be contained in at most 1 k-VCC. Recall that the LkVCS method will take more computational time for the input vertex u with larger vertex degree, because it will enumerate much more combinations of vertices than that of small degree to find the local k-vertex connected subgraphs. To avoid visiting the vertices with larger degree first, we set the vertices with larger degree having lower priority.

Optimization 2: :

Non-redundancy based strategy. We observe that if we find the seed subgraph for every vertex in the network, we will acquire many duplicate subgraphs or highly overlapping subgraphs. In order to reduce redundance, we design the non-redundancy based strategy such that for a given vertex, if it has already been contained in the discovered seed subgraphs of other vertices, there is no need to find its own seed subgraph. Note that a vertex can be included in different seed subgraphs, even if it is not processed.

Together with the vertex order priority strategy, we can find seed subgraphs for the uncovered vertices with smaller degree as soon as possible. Besides, based on the long-tail theory, the vertices with larger vertex degree are probably included in the discovered seed subgraphs, which means we do not need to detect seed subgraphs for these vertices. Thus, making use of these two strategies, we can find constraint number of seed subgraphs with higher efficiency.

In Seeding() shown in Algorithm 4, we visit all the vertices in the graph according to the non-decreasing order of their vertex degree. If a vertex v has not been visited, we find its local k-vertex connected subgraph using the LkVCS algorithm as its seed subgraph; otherwise we will continue. Finally, we obtain the set of seed subgraphs.

figure d

Example 2

We apply the above two strategies in seeding() on the graph shown in Figure 2 to identify the seed subgraphs. We rank all the vertices in G according to their non-decreasing order of their vertex degree denoted as, v12(3) ≼ v14(3) ≼… ≼ v9(5) ≼ v13(6) ≼ v4(7). The numbers in the brackets are their vertex degree. We first find the seed subgraph for v12, which is . Then, we successively visit the vertex according to the vertex priority. For example, we find the seed subgraph G3 for v14. As G3 contains {v13, v15, v16}, we do not need to detect the seed subgraphs for these vertices. At last, we identify three seed subgraphs including G3 for v14, G1 for v1 and G2 for v6.

5.2 Expanding and merging seed subgraphs

In this section, we focus on solving the problem of detecting k-VCCs from the discovered seed subgraphs. We observe that the k-VCCs do not satisfy the property of downward closeness. That is for a k-VCC denoted as G[S], ∃SS, the induced subgraph G[S] is not a k-vertex connected subgraph. Thus, we cannot simply expand the discovered seed subgraphs by adding a series of vertices adjacent to their neighborhood to obtain the target k-VCCs.

Menger’s Theorem [15] indicates that a graph is k-vertex connected if and only if it contains k vertex independent paths between any two non-adjacent vertices. Based on this relationship between independent path and vertex connectivity, we devise two algorithmic approaches, Expanding() and Merging(), in which, we could safely add vertices into the current subgraph and combine different subgraphs to form a bigger k-vertex connected subgraph, respectively. Next, we detail these two approaches.

Expanding

We first give some explanations of the notions used here. Given a graph G(V, E) and a vertex set SV, \(\overline {S}\) represents the complement of S in G, i.e., \(\overline {S} = V\setminus S\), and δS denotes the boundary of the induced subgraph G[S], which means for any vertex vδS, there exists a vertex unb(v) and uS.

In Expanding() shown in Algorithm 5, we add vertices that are connected to at least k vertices in G[S] into the current subgraph. The specific process is as follows. For each k-vertex connected subgraph G[S] now we have, if there exists vertex u in \(\delta \overline {S}\) that is adjacent to at least k vertices in δS, we add every vertex like this into the current S and update Su as the new S. We iteratively conduct this procedure until there is no such vertex in \(\delta \overline {S}\) that can be added into S. Theorem 6 guarantees the correctness of the Expanding() approach.

figure e

Theorem 6

SupposeG[S] isak-vertex connected subgraph. If vertex u is adjacent to at least k vertices inδS,G[Su] isalso ak-vertex connected subgraph.

Proof

We need to prove that for every pair of vertices a, b in G[Su], κ(a, b) ≥ k. Since G[S] is a k-vertex connected subgraph, we only need to prove for ∀aS, κ(u, a) ≥ k. We assume that there is a vertex cS making κ(u, c) < k. Based on the Menger’s Theorem, we know that there are k (k < k) independent paths between u and c. So, once we delete one vertex from each of the k paths, u will be disconnected from c. However, u is adjacent to at least k vertices in δS, even if we delete any k vertices from them, the subgraph is still connected, which is controversial to the assumption. Thus, G[Su] is a k-vertex connected subgraph. □

Example 3

We use the graph in Figure 2 as an example to illustrate Expanding(). Assume that we have already obtained the 3-vertex connected subgraph G[S] where G[S] = G2. We find that v11 is adjacent to three boundary vertices including v4, v6 and v7 in G[S]. Thus, we add v11 into G[S] and G[Sv11] is also a 3-vertex connected subgraph.

Merging

When the obtained k-vertex connected subgraphs cannot be further expanded by Expanding(), we expect to combine some of the adjacent subgraphs together to acquire larger subgraphs with k-vertex connectivity. Here, we develop Merging() shown in Algorithm 6 to integrate different k-vertex connected subgraphs with at least k direct independent paths into a new one. Specifically, we first detect whether the input subgraphs are k-VCCs. If so, we directly output them as k-VCCs; otherwise, we iteratively merge the subgraphs satisfying the condition in Theorem 7 that will be detailed later until no subgraphs can be merged. If meeting the conditions in Corollary 1, the subgraph after combined is already a k-VCC, otherwise it will be saved for further processing. In the implementation, we use the disjoint sets structure to accelerate the merging operation.

figure f

Formally, Theorem 7 provides the sufficient condition to guarantee Merging() is correct. If the sum of the number of overlapping vertices and length-1 disjoint paths between two k-vertex connected subgraphs is more than or equal to k, they can be combined together.

Theorem 7

Assume thatS andSaresets of vertices in a graph G, i.e.,SV,SV,and the induced subgraphG[S] andG[S] arek-vertex connected. We sayG[SS] isak-vertex connected subgraph, if the following condition issatisfied,\(|S\cap S^{\prime }| + \min \limits \{|nb_{S^{\prime }\setminus S}(S\setminus S^{\prime })|,|nb_{S\setminus S^{\prime }}(S^{\prime }\setminus S)|\} \geq k\).

Proof

Based on Definition 2, graph G[SS] is k-vertex connected if the vertex connectivity between any pair of vertices r, r in G[SS] are not smaller than k. Since G[S] and G[S] have already been k-vertex connected, we only need to prove that ∀sSS, ∀sSS, κ(s, s) ≥ k. A S-S path is a path from some sS to some sS that passes through no other vertex of S or S. If a vertex x belongs to both S and S, then x itself can be viewed as a S-S path. According to Menger’s Theorem, if the number of disjoint S-S pathes is not smaller than k, S and S are k-vertex connected. Thus, based on the conditions, if the sum of the number of sharing vertices and length 1 disjoint paths is not smaller than k, G[SS] is k-vertex connected. □

Furthermore, based on Theorem 6 and Theorem 7, we obtain Corollary 1. In Merging(), Corollary 1 can be exploited as an early termination condition. That is once finding some subgraphs which satisfy the conditions in Corollary 1, we can put these subgraphs into the result set, because it is impossible to be combined with any other subgraphs. This can significantly reduce the number of subgraph combining operations.

Corollary 1

LetG[S] beak-vertex connected subgraph. If the following two conditions are satisfied, we sayG[S] is ak-VCC:

  1. (1)

    \(\not \exists v \in \delta (\overline {S})\), |nb(v) ∩ δS| > k;

  2. (2)

    \(\min \limits \{|\delta S|, |\delta \overline {S}|\} < k\).

Example 4

A running example of Merging() is given using G in Figure 2. Suppose we already have three 3-vertex connected subgraphs G1, G2 and G3. We observe that G3 satisfies the conditions in Corollary 1. That is, v8, v9 and v10 are only adjacent to one boundary vertex v13 of G3, and \(\min \limits \{|\delta V(G_{3})|, |\delta \overline {V(G_{3})}|\} = \min \limits \{1, 3\} = 1 < 3\). Thus, G3 is a 3-VCC. For G1 and G2, we have that \(|V(G_{1})\cap V(G_{2})| + \min \limits \{|\delta V(G_{1})\cap \delta \overline {V(G_{2})}|, |\delta \overline {V(G_{1})}\cap \delta V(G_{2})|\} = 1 + \min \limits \{2, 2\} = 3 \geq 3\). Thus, we can merge G1 and G2 together based on Theorem 7.

Time complexity::

First, in the seed subgraph identification step, the Seeding() invokes n times of the LkVCS algorithm at the worst case, whose time complexity is O(α|Eavg|) where |Eavg| represents the average edge number in G[N2(u)]. Thus, the time complexity for Seeding() is O(nα|Eavg|). Then, the time complexity of Expanding() is O(Nsdavg|Bavg|), where Ns represents the number of seed subgraphs, davg is the average degree in G and |Bavg| represents the average number of vertices in the boundary of seed subgraphs. Also, the time complexity of Merging() is \(O({N_{s}^{2}}d_{avg}|B_{avg}|)\). Suppose that Algorithm 2 executes at most Niter iterations of expanding and merging, the overall time complexity of Algorithm 2 is \(O(n\alpha |E_{avg}|+N_{iter}(N_{s}d_{avg}|B_{avg}|+({N_{s}^{2}}d_{avg}|B_{avg}|))\).

Space complexity::

Algorithm 2 needs to record all the seed subgraphs and it takes O(Ns|Eavg|) space.

6 Hybrid framework for k-VCCs detection

When identifying the target k-VCCs, the bottom-up framework only exploits the pairwise relationship between a k-vertex connected subgraph and a vertex (or a k-vertex connected subgraph) corresponding to Expanding()(or Merging()). To fix this limitation, in this section, we propose a hybrid framework for k-VCCs detection.

The hybrid framework focuses on the group relationships among the subgraphs and vertices, and thus is able to detect larger k-vertex connected subgraphs building on the results obtained from the bottom-up framework. For example, in Figure 3a, A, B, C and D (enclosed by the dashed circles) represent four 3-vertex connected subgraphs that are discovered by the bottom-up framework and E denotes a single vertex. Intuitively, we can find that each pair of them cannot be merged, however, the whole graph is a 3-VCC, if we consider the group {A, B, C, D, E} together.

Figure 3
figure 3

An example of the construction of the mixed graph for k = 3

Algorithm 7 summarizes the hybrid framework. We first construct a novel mixed graph with the discovered k-vertex connected subgraphs and the vertices that are not assigned to any k-vertex connected subgraphs after the bottom-up framework (line 1), and then find larger k-vertex connected subgraphs by further iteratively contracting the mixed graph (line 2). In the end, we identify the target k-VCCs by a top-down way (line 3).

figure g

6.1 Constructing the mixed graph

We propose the mixed graph to facilitate the identification of the target k-VCCs. After processed by the bottom-up framework, the original graph is covered by the discovered k-vertex connected subgraphs and the vertices unassigned to any k-vertex connected subgraphs. Here, we also view the single vertex as a subgraph. In mixed graph, we model these subgraphs as vertices and the connectivity between them as weighted edges.

Specifically, we denote the discovered k-vertex connected subgraphs as type T1 vertices and the subgraphs of single vertex as type T2 vertices, respectively. To preserve the relationship lossless, according to Theorem 6 and 7, we use the number of adjacent vertices to measure the edge weight between vertices of type T1 and T2, while utilize the sum of overlapping vertices and length 1 independent paths to qualify the weights of edges connecting both two type T1 vertices. Definition 8 shows the formal definition of mixed graph.

Definition 8

(Mixed graph) Let Gmix(Vmix, Emix) represent the mixed graph, where Vmix is the set of vertices and Emix is the set of edges. Each vertex \(v_{i}^{t_{i}} \in V_{mix}\) corresponds to a subgraph in G, where ti ∈{T1, T2} denotes the type of vi. T1 and T2 represent vi to be a k-vertex connected subgraph and a single vertex in G, respectively. For each edge \(e = (v_{i}^{t_{i}}, v_{j}^{t_{j}}) \in E_{mix}\), its edge weight w(e) measures the vertex connectivity between vi and vj and can be computed as follows,

$$ w(v_{i}^{t_{i}},v_{j}^{t_{j}})= \left\{ \begin{array}{lll} &|\mathcal{Z}|+\min\{|nb_{\mathcal{Y}}(\mathcal{X})|,|nb_{\mathcal{X}}(\mathcal{Y})|\}\\ &\ \text{$t_{i} = T_{1}$, $t_{j} = T_{1}$},\\ &|nb_{map(v_{i})}(map(v_{j}))|\ \text{$t_{i} = T_{1}$, $t_{j} = T_{2}$},\\ &1\ \text{$t_{i} = T_{2}$, $t_{j} = T_{2}$} \end{array} \right. $$

where \(\mathcal {X}=map(v_{i})\setminus map(v_{j})\), \(\mathcal {Y}=map(v_{j})\setminus map(v_{i})\), \(\mathcal {Z}=map(v_{i})\cap map(v_{j})\) and the function map(⋅) maps the vertices to their original vertex sets in the original graph G.

Example 5

Figure 3 illustrates an example of the mixed graph when k = 3. Figure 3a is the original graph. A, B, C and D are four discovered 3-vertex connected subgraphs. E represents the unassigned vertex. Figure 3b shows the corresponding mixed graph, in which all these subgraphs/vertices in the original graph are treated as vertices and the values of connectivity between them are represented as weighted edges. In particular, we use ‘\(\square \)’ to represent the discovered k-vertex connected subgraphs (type T1) and ‘○’ to mark the unassigned vertices (type T2).

Upper bound of vertex connectivity

Let Gmix represent a mixed graph and v belongs to Gmix, we derive the upper bound of vertex connectivity between the corresponding subgraph G[map(v)] of v and the rest of graph G. We denote this upper bound as \(\overline {\mathsf {vc}}(v)\), which equals to the sum of the weights of edges that are adjacent to v, i.e.,\(\overline {\mathsf {vc}}(v) = {\sum }_{u\in nb_{G_{mix}}(v)}w(u, v)\). Next, we prove the correctness of this upper bound.

Theorem 8

\(\overline {\mathsf {vc}}(v)\geq \kappa (G[map(v)], G[V\backslash map(v)])\),whereκ(G[map(v)], G[Vmap(v)]) representsthe real vertex connectivity between subgraphsG[map(v)] andG[Vmap(v)] in G.

Proof

Because for different vertices such as \(u, h \in nb_{G_{mix}}(v)\), the independent paths between map(v) and map(u) and between map(v) and map(h) may incident on the same vertices, the real number of independent paths between map(v) and map(u) ∪ map(h) may be less than w(v, u) + w(v, h). Similarly, κ(G[map(v)], G[Vmap(v)]) may be also less than \(\overline {\mathsf {vc}}(v)\). And the equal sign is established when the independent paths between map(v) to other subgraphs are all incident on different vertices. □

Based on Theorem 8, for a vertex v, if \(\overline {\mathsf {vc}}(v) < k\), we can explicitly prune v from Gmix without exactly computing the vertex connectivity in G. This is because if v is a type T1 vertex, G[map(v)] is already a k-VCC, otherwise v is a type T2 vertex, and it cannot be in any of the k-VCCs. \(\overline {\mathsf {vc}}(v)\) is exploited as an early deletion condition in the following steps. In addition, \(\overline {\mathsf {vc}}(v)\) becomes tighter along with the contraction of the mixed graph (detailed in Section 6.2). This is because when the size of Gmix becoming smaller, the number of neighbors for a vertex v may also get smaller, which decrease the repeated counting of independent paths. For example, in Figure 4, \(\overline {\mathsf {vc}}(v)\) is originally equal to 3, and after S contracted to one vertex u, \(\overline {\mathsf {vc}}(v)\) becomes to 1, which is the same as the exact vertex connectivity between G[map(v)] and G[map(u)].

Figure 4
figure 4

An example of renewing the mixed graph

Implementation

When the bottom-up framework is conducted, we can easily map the discovered k-vertex connected subgraph to its corresponding vertex in Gmix. Meanwhile, in Expanding() and Merging(), the number of direct independent paths between subgraphs or vertices needed to be computed when merging two subgraphs or adding one vertex into a subgraph, which is just corresponding to the weight between two vertices in Gmix. Therefore, the construction of Gmix can naturally carry out along with the conducting of bottom-up framework and do not bring in extra computation.

6.2 Contracting the mixed graph

Due to the probably large size of the mixed graph, in this section, we study how to contract the mixed graph. This will largely reduce the computational cost for finding out target k-VCCs in a top-down manner, which has to exploit the global structure of the mixed graph.

To shrink the scale of the mixed graph, we design an iterative mixed graph contraction method, called Contraction(). In each iteration, (1) we try to contract groups of T1 vertices in the mixed graph by utilizing the efficient state-of-the-art k-ECC detection method KECC() [6]; (2) Due to the change of edge weight, we have to renew the graph after contracting to be a mixed graph again. In addition, we can safely remove the vertex if its connectivity to other vertices are less than k. The whole process ceases when the mixed graph cannot be further updated. Next, we detail the above two procedures.

Contracting type T1 vertices

To further reduce the size of Gmix, we design to contracting some of the type T1 vertices. Here, we observe that (1) each type T1 vertex can only belong to one k-ECC in \(G_{mix}[V_{mix}^{T_{1}}]\), which is detailed in Lemma 2; (2) only the vertices corresponding to groups of type T1 vertices in the same k-ECC could make up larger k-vertex connected subgraph in G. This is verified in Lemma 3.

Lemma 2

\(\forall v \in V_{mix}^{T_{1}}\) , v can only belong to one k-ECC in \(G_{mix}[V_{mix}^{T_{1}}]\) .

Proof

We prove this lemma by contradiction. Assume v belongs to different k-ECCs. Based on the definition of k-ECC, let Gmix[S] be a k-ECC in \(G_{mix}[V_{mix}^{T_{1}}]\) and after removing any k-1 edges from it, Gmix[S] is still connected. Thus, all these k-ECCs containing v is k-edge connected, which violates the k-ECC is a maximal subgraph. Thus, v can only belong to one k-ECC in \(G_{mix}[V_{mix}^{T_{1}}]\). □

Lemma 3

LetGmix[S] representak-ECC in\(G_{mix}[V_{mix}^{T_{1}}]\),only forSS,\(G[\bigcup _{v\in S^{\prime }}map(v)]\)couldmake up largerk-vertex connected subgraph inG.

Proof

We prove this lemma by contradiction. Assume v and v belong to different k-ECCs in \(G_{mix}[V_{mix}^{T_{1}}]\) and map(v) and map(v) are in the same k-vertex connected subgraph in G. Because of belonging to different k-ECCs in \(G_{mix}[V_{mix}^{T_{1}}]\), v and v are less than k-edge connected. Correspondingly, map(v) and map(v) are less than k-edge connected in G, and thus are less than k-vertex connected, which contradicts to the assumption. □

According to Lemma 2, we can first partition \(G_{mix}[V_{mix}^{T_{1}}]\) into different k-ECCs. Here we adopt the KECC() method, which only takes O(h × l|E|) time [6], where h is the height of the decomposition tree and l ≪|V |. This is much faster than the top-down framework for k-VCC detection.

Then, based on Lemma 3, we want to find out the set of type T1 vertices in the same k-ECC that can be further contracted to shrink Gmix. As we know, for every vertex v in the current k-ECC, i.e., Gmix[S], even if its upper bound \(\overline {\mathsf {vc}}(v) \geq k\), we still cannot determine whether its corresponding subgraph G[map(v)] is k-vertex connected to the rest of the induced subgraph corresponding to S in G, denoted by G[map(S) ∖ map(v)]. Here, we calculate the exact k-vertex connectivity between G[map(v)] and G[map(S) ∖ map(v)]. Moreover, we observe that if every G[map(v)] is k-vertex connected to the rest of the subgraph, then G[map(S)] is a larger k-vertex connected subgraph, which is proved in Theorem 9. Therefore, we can safely combine the type T1 vertices in S into a larger one.

Theorem 9

AssumeGmix[S] denoteak-ECC inGmix,\(S\subseteq V_{mix}^{T_{1}}\)andSS,G[map(S)] isak-vertex connected subgraph iffvS,G[map(v)] isk-vertex connected toG[map(S) ∖ map(v)].

Proof

First, we prove the sufficiency. If G[map(S)] is a k-vertex connected subgraph, then ∀v, vS, κ(v, v) ≥ k. Thus, ∀vS, G[map(v)] is k-vertex connected to G[map(S) ∖ map(v)]. Then, we prove the necessity. Because v is a type T1 vertex, G[map(v)] itself is a k-vertex connected subgraph, for every v, deleting any k − 1 vertices within G[map(v)] will not disconnect G[map(S)]. Also, G[map(v)] is k-vertex connected to G[map(S) ∖ map(v)] for every v in S. Therefore, even if removing any k − 1 vertices from different map(v)s, G[map(S)] is still connected. Overall, G[map(S)] is a k-vertex connected subgraph. □

In practice, we can iteratively remove v such that κ(G[map(v)], G[map(S)∖map(v)]) < k from S until all the vertices in the current S satisfy Theorem 9. Finally, we contract all the vertices in the current S into a larger one without destroying the connectivity of the corresponding subgraph in G.

Renewing the mixed graph

After contracting the type T1 vertices, we have to update the mixed graph. Let \(\mathcal {S}\) represent the set of type T1 vertices that can be grouped together. After contracting \(\mathcal {S}\) into a new type T1 vertex (e.g. u) in the mixed graph, we first need to duplicate all the vertex mapping information of S into u. Then, we reconnect and reweigh the edge weight between the neighbors of \(\mathcal {S}\) in Gmix (e.g., \(nb_{G_{mix}}(\mathcal {S})\)) and u. For a vertex \(v \in nb_{G_{mix}}(\mathcal {S})\), when computing w(v, u), we cannot easily use \({\sum }_{s\in \mathcal {S}} w(v, s)\) as its edge weight, because not all the paths between v and \(\mathcal {S}\) are independent. For example, in Figure 4, all the vertices in \(\mathcal {S}\) are adjacent to v at v0map(v), thus w(v, u) = 1.

After renewing u and the weights of its incident edges, there may exist edges with weight no less than k. Due to the weights of all edges in the mixed graph are less than k, it is necessary to merge the two parts of the edge with weight larger or equal to k until there is no such cases by Expanding() and Merging(). Notably, once a new type T1 vertex u is generated, it is necessary to utilize Expanding() to enlarge map(u) in G. This will ensure the maximality of the final results. For example, in Figure 4, if v0map(u), we should add v0 into map(u). The following example illustrates the complete process of contracting the mixed graph.

Example 6

In Figure 3b, the subgraph induced by the type T1 vertices including {A, B, C, D} is a 3-ECC for \(G_{mix}[V_{mix}^{T_{1}}]\) and each one of them is k-vertex connected to the rest. Hence, we can merge all these vertices together into a new type T1 vertex, denoted as s and reconnect all the edges originally connected to {A, B, C, D} to s, which is shown in Figure 3c. Because all the paths are independent, w(E, s) = 3. Finally, we continue to merge E and s as a whole 3-VCC, due to w(E, s) ≥ 3.

6.3 Identifying k-VCCs based on the mixed graph

In this section, we propose the Topdown_revise() method to find the target k-VCCs from the contracted Gmix. The main idea is to iteratively divide the mixed graph in a top-down manner until the separated parts themselves are corresponding to the k-VCCs in the original graph. Because the size of Gmix is much smaller than G, the Topdown_revise() is much more efficient than the top-down framework, which directly divides the original graph G.

First, we study how to determine the minimum vertex cut of Gmix. Recall that the top-down framework (detailed in Section 4) detects k-VCCs by iteratively computing κ(G[C]) for the current graph G[C] and reduce it to a maximum flow problem in directed graph. Correspondingly, Topdown_revise() identifies k-VCCs by computing κ(Gmix[C]) for the current Gmix[C]. Here, we only emphasize the differences when constructing the directed graph \(G^{\prime }_{mix}\) for Gmix. For each input Gmix and vertices s, tVmix, the directed flow network \(G^{\prime }_{mix}\) is constructed as below.

  1. 1.

    For each \(v\in V_{mix}^{T_{1}}\) (vs and vt), add two vertices v, v into \(V^{\prime }_{mix}\), and the directed edge(v, v) and (v, v) into \(E^{\prime }_{mix}\) with edge weights equal to k. And for each \(u\in V_{mix}^{T_{2}}\), add two vertices u, u into \(V^{\prime }_{mix}\), and the directed edge(u, u) and (u, u) into \(E^{\prime }_{mix}\) with edge weights equal to 1.

  2. 2.

    For each edge (s, v)∈ Emix, add edge (s, v) to \(E^{\prime }_{mix}\) with weight w(s, v); for each edge (v, t)∈ Emix, add edge (v, t) to \(E^{\prime }_{mix}\) with weight w(v, t); for each edge connecting two type T1 vertices, such as (u, v)∈ Emix, add two edges (u, v) and (v, u) to \(E^{\prime }_{mix}\) and each edge has capacity w(u, v); for each edge connecting type T1 and T2 vertices, such as (x, y)∈ Emix, add two edges (x, y) and (y, x) to \(E^{\prime }_{mix}\) and each edge has capacity \(\infty \).

  3. 3.

    Assign s as the source vertex and t as the sink vertex.

Next, we analyze the rationale to construct \(G^{\prime }_{mix}\) like above. When the maximum flow (minimum cut) between s and t in \(G^{\prime }_{mix}\) is less than k, \(G^{\prime }_{mix}\) needs to be separated. For type T1 vertices, the edges connecting two different type T1 vertices can belong to the edge cut set. And based on the connectivity information, we can locate its corresponding vertex cut set. However, for the same T1 vertex, e.g. v, by setting w(v, v) and w(v, v) to k, we find that v and v are inseparable due to these two edges cannot be included in the edge cut set. This observation is in accordance with Lemma 2. On the other hand, for type T2 vertices such as u, because w(u, u) and w(u, u) are equal to 1 in \(G^{\prime }_{mix}\), only these two edges are possible to be contained in the edge cut set. Further, the corresponding vertex u could be in the vertex cut set.

When the current induced subgraph Gmix[S] cannot be further separated, we have to check if G[map(S)] is a k-VCC in G. The process is similar to that in Section 6.2. Based on Theorem 9, we find out vS such that G[map(v)] is less k-vertex connected to G[map(S) ∖ map(v)] and iteratively remove it from S, until all the vertices left are k-vertex connected in G; that is the current G[map(S)] is a k-VCC. In addition, if the discarded v is a type T1 vertex, G[map(v)] is still a k-VCC.

By exploiting the top-down detection in Topdown_revise(), we can indicate that any super graph of the obtained subgraphs are not k-vertex connected, and the post process guarantees that they are k-vertex connected. Therefore, all these subgraphs are maximal k-vertex connected, i.e., k-VCCs.

6.4 Complexity analysis

We now analyze the computational complexity of the hybrid framework, which mainly consists of contracting the mixed graph (Section 6.2) and identifying k-VCCs from the contracted one (Section 6.3). For the former part, let Niter1 represent the number of iterations. In each iteration, \(|\overline {E}_{mix}|\), \(|\overline {E}_{mix}^{upt}|\) and \(|\overline {V}_{mix}^{map}|\) denote the average number of edges in the current Gmix, the average number of edges that need to be updated and the average number of vertices, to which each vertex in Vmix corresponds in G. Thus, it takes \(Time_{1} = O(N_{iter1}\cdot (h\cdot l|\overline {E}_{mix}| + |\overline {E}_{mix}^{upt}| \cdot |\overline {V}_{mix}^{map}|))\) time to contract Gmix, where l and h are mentioned after Lemma 3. For the later part, let Niter2 denote the total number of subgraphs detected when dividing \(G_{mix}^{cur}\), which represents the mixed graph after contracting. For a certain subgraph, \(|\overline {V}_{mix}^{cur}|\) and \(|\overline {E}_{mix}^{cur}|\) refer to its average number of vertices and edges. So, the computational cost is bounded by \(Time_{2} = O((|\overline {V}_{mix}^{cur}|-\delta -1+\delta (\delta -1)/2) \cdot |\overline {E}_{mix}^{cur}|\cdot |\overline {V}_{mix}^{cur}|^{2/3} \cdot N_{iter2})\) where δ represent the largest vertex degree in \(G_{mix}^{cur}\). Totally, the hybrid framework costs Time1 + Time2 time.

7 Experiments

We conduct extensive experiments to evaluate the effectiveness and efficiency of the proposed methods by using a variety of real and synthetic datasets. All algorithms are implemented in C++. All the experiments are conducted on a Linux Server with Intel Xeon 3.2 GHz CPU and 64 GB main memory.

7.1 Datasets and compared methods

The statistics of real networks used in the experiments are shown in Table 1. dmax denotes the maximum vertex degree of G. \(\mathcal {D}\) is the degeneracy of G in Definition 5. #C is the number of ground-truth communities. The first Yeast dataset is a protein-protein interaction network downloaded from BioGRID.Footnote 1 The other four datasets are networks with ground-truth communities.Footnote 2 We abbreviate these datasets as YA, AZ, DP, YT and LJ.

Table 1 Statistics of real networks (K = 103 and M = 106)

First, we compare our k-VCC with k-CC [3] and k-ECC [6] for effectiveness evaluation. Especially, we use k-VCC-B and k-VCC-H to represent the heuristic and exact k-VCCs obtained from the bottom-up and hybrid frameworks, respectively. Secondly, we evaluate the following algorithms for efficiency comparison:

  • TkVCC: the top-down framework for k-VCC detection shown in Algorithm 1, discussed in Section 4. This is also used as the baseline method.

  • BkVCC-Ran: the bottom-up framework for k-VCC detection shown in Algorithm 2 with random vertex order priority strategy in Section 5.1.

  • BkVCC-NI: the bottom-up framework for k-VCC detection shown in Algorithm 2 with non-increasing vertex order priority strategy in Section 5.1.

  • BkVCC-ND: the bottom-up framework for k-VCC detection shown in Algorithm 2 with non-decreasing vertex order priority strategy in Section 5.1.

  • HkVCC-ND: the hyrid framework for k-VCC detection shown in Algorithm 7 with non-decreasing vertex order priority strategy in Section 5.1.

In the end, we show the case studies of k-VCC for different applications.

7.2 Evaluation on real networks

Effectiveness evaluation

To evaluate the effectiveness of different component models, we compare the proposed k-VCC with k-CC [3] and k-ECC [6] on 4 real datasets including AZ, DP, YT and LJ with ground-truth communities [48] under different types of criteria. In the experiments, for different input k, we detect the heuristic k-VCCs (k-VCC-B) by BkVCC-ND, the exact k-VCCs (k-VCC-H) by HkVCC-ND, the k-CC by the method in [3] and the k-ECCs by the method in [6] as communities, respectively.

First, we use F-score to measure the accuracy of the detected communities with regard to the ground-truth communities. Given the discovered community G[S] and the ground-truth community G[T], F-score is defined as \(F(S, T) = 2 \ast \frac {prec(S, T)\ast rec(S, T)}{prec(S, T)+rec(S, T)}\) where \(prec(S, T) = \frac {|S\cap T|}{|S|}\) represents the precision and \(rec(S, T)= \frac {|S\cap T|}{|T|}\) represents the recall. We can see that higher F-score value means the detected community is more similar with the ground-truth.

For each discovered community Si, we compute the F-score with every ground-truth community Tj of the dataset and choose the largest F(Si, Tj) as the final F-score, Fi of Si. Further, we use the average value of all Fi, denote as \(\overline {F}\) to represent the F-score corresponding to a given dataset. Figure 5 shows the F-scores of the compared methods for different value of k. We find that the k-VCC-B and k-VCC-H have relatively higher F-score on AZ, YT and LJ datasets than k-ECC and k-CC. This is because in these datasets, the ground-truth communities are defined based on common interest or function, which is very cohesive. Also, the k-VCC-B has higher F-score than k-VCC-H, when k is smaller. This is because k-VCC-H obtains the exact k-VCCs, whose scales are usually larger than that of heuristic ones. Along with k becoming larger, the F-scores of k-VCC-B and k-VCC-H are almost the same, since the communities discovered with larger k-vertex connectivity are very similar.

Figure 5
figure 5

F-score on different real networks

However, the ground-truth community in DP is defined based on publication venues. The authors publishing papers in the same conference or journal may be not densely connected [48]. Thus, the trend we observe is opposite to the above three datasets, that is k-VCC-B has the lowest F-score on DP. Furthermore, k-VCC-B and k-VCC-H is almost the same along with the increasing of k.

Then, we use density and diameter to measure the goodness (cohesiveness) of the detected communities. Here, the density of a graph is defined as the fraction of the edges that appear between the vertices to that of all possible edges and the diameter is defined as the longest shortest paths between any pair of vertices in a network. Given a community G[S], the density and diameter of a subgraph G[S] is represented as \(dens(G[S]) = \frac {2|E(S)|}{|S||S-1|}\) and diam(G[S]) = maxu, vV (G[S])dist(u, v), respectively. We say a community is more cohesive than others when it possesses larger density and smaller diameter.

Figure 6 presents the density of the obtained k-CC, k-ECC, k-VCC-B and k-VCC-H for different k values on different datasets. we can observe that along with the increasing of k, the density of all these communities is becoming larger. The reason is that when k becoming larger, the vertices with smaller degree are continually removed. Thus, the communities with larger k values are more cohesive than that with smaller k values. In addition, for the same k-value, k-VCC has larger density than k-CC and k-ECC. This is because k-VCC is nested in k-CC and k-ECC, and the structure of k-VCC has more connectivity constraint and thus is more cohesive. We can also find that the detected k-VCC-B are more denser than k-VCC-H. This is because the exact k-VCC may be composed of different heuristic k-VCCs. Thus, when k is small, it is less cohesive than heuristic ones.

Figure 6
figure 6

Density on different real networks

Figure 7 shows the diameter of the detected k-CC, k-ECC, k-VCC-B and k-VCC-H for different k on different networks. We can see that, opposite to the density, with the increasing of k, the diameter becomes smaller for all kinds of communities with similar reasons as density. Moreover, for the same k value, the diameter of k-VCC-B and k-VCC-H is both smaller than that of k-CC and k-ECC. That means k-VCC is more cohesive and thus can effectively reduce the occurrence of free rider effect.

Figure 7
figure 7

Diameter on different real networks

Next, we present the number of k-vertex connected subgraphs (or k-VCCs) produced in different stages of various algorithms for different k. In particular, seed, heuristic and exact represents the number of seed subgraphs, heuristic k-VCCs and exact k-VCCs generated by the (BkVCC-ND) and (HkVCC-ND) methods, respectively. From Figure 8, we can see that seed, heuristic and exact are becoming smaller for the same k. This is because the seed subgraphs are expended and merged into heuristic k-VCCs and the heuristic are further contracted into the exact k-VCCs. However, for different k values, seed is getting smaller with the increasing of k, due to the number of local k-vertex connected subgraphs decreases when k raises. And the heuristic is first increases and then decreases. The reason is that when k is small, many seed subgraphs can be merged into the same larger subgraphs. At last, when k is large, the heuristic is very close to exact for at this moment, the target k-VCCs is very cohesive, and it is less possible that the heuristic k-VCCs far away from each other can be contracted into a larger one.

Figure 8
figure 8

The number of subgraphs generated in different stages on different real networks

Efficiency evaluation

In this section, we conduct experiments to evaluate the efficiency of top-down, bottom-up and hybrid frameworks for detecting k-VCCs on different real networks.

Figure 9 presents the overall running time of comparing methods, including TkVCC, BkVCC-Ran, BkVCC-NI, BkVCC-ND and HkVCC-ND for varying parameter k. We first observe that the TkVCC method always runs slowest on all the datasets. This is because that it exploits the structure of the entire graph to find the minimum vertex cut set. When the scale of the graph becoming larger, it will be more time-consuming. Thus, for large real network such as LJ shown in Figure 9d, the program even cannot end within the required time. For the bottom-up framework, BkVCC-ND method runs much faster than BkVCC-Ran and BkVCC-NI over all the datasets. Recall that in BkVCC-ND, we assign vertices with smaller vertex degree have higher priority, which reduce the combination number of the neighbors for a given vertex. When we visit vertices with large vertex degree, they are very probably having been included in the vertices with small degree, which reduces the running time a lot. We further evaluate the running time of hybrid framework (HkVCC-ND). Although the hybrid framework spends a little more time than the bottom-up framework (BkVCC-ND), it not only runs much faster than the top-down framework but can find out the exact k-VCCs. This is because the size of the constructed mixed graph is based on the results of bottom-up framework, which is much smaller than the original graph. Thus, the hybrid framework does not spend too much extra time in detecting the k-VCCs by the top-down manner.

Figure 9
figure 9

Runtime comparison between top-down and bottom-up frameworks on different real networks

On the other hand, along with the increasing of parameter k, the running time of these methods first increases. This is because it needs more time to compute minimum vertex cut for TkVCC and seed subgraphs for the bottom-up based methods. Then, when the k value reaches a turning point, the running time begins to decrease. The reason is that with k becoming larger, more and more vertices are pruned by the k-core component.

Next, we test the memory usage of the HkVCC-ND algorithm, when varying k on different real datasets. Here, we compare the methods based on whether using the disjoint-set structure when merging the k-vertex connected subgraphs. We denote the methods with (or without) using the disjoint-set as HkVCC-wDS and HkVCC-woDS, respectively. Figure 10 presents the corresponding results. We can see that HkVCC-wDS utilizes smaller memory storage than that of HkVCC-woDS. This is because by disjoint-set, we can merge multiple subgraphs simultaneously, instead of combining only two subgraphs at each time. To do like this, it can large reduce the number of intermediate results. On the other hand, we observe that the memory usage decreases with the increasing of k. This is because although the number of results may get larger, the scale of the graphs and the number of seed subgraphs become smaller, and thus consume less memory.

Figure 10
figure 10

The memory usage of the HkVCC-ND algorithm on different real networks

7.3 Evaluation on synthetic networks

We generate a set of synthetic bipartite networks to evaluate the performance of the selected methods. The number of vertices are balanced in each part of these bipartite networks. The degree of both parts follow the power-law distribution with exponent γ and dmax = n/2. Here, we set γ = 2 as usual and the vertices in the networks are linked according to [34].

We evaluate the efficiency and effectiveness of the TkVCC and BkVCC-ND methods. We set k = 4 for all the situations. Figure 11a shows the running time when varying the number of vertex in the network. We can see that BkVCC-ND method is much more efficient than TkVCC method, which is consistent with the results on real datasets. Figure 11b shows the F-score of the result of BkVCC-ND corresponding to that of TkVCC. Since the result of TkVCC are exact solution, the relative high values of F-score indicates that although the BkVCC-ND method is heuristic, it could generate results with high quality and hence proves its effectiveness.

Figure 11
figure 11

Results on synthetic network

7.4 Case studies

In this section, we explore two interesting networks, including the author collaboration network of DBLP and the genetic interaction network of hypertension (HT), that act as solid evidence of the utility of k-VCCs in the real world.

Case study on DBLP

We construct an author collaboration network on KDD conference extracted from the raw DBLP dataset (http://dblp.uni-trier.de/xml/) for case study. A vertex represents an author, and an edge between two authors indicates they have co-authored. Figure 12 presents one of the discovered 3-VCCs containing Prof. Jiawei Han. With the increasing of k, the current subgraph could be separated into different smaller subgraphs having higher vertex connectivity. For example, Figure 13a and b are two 4-VCCs derived from the graph in Figure 12. Based on the background knowledge, Figure 13a shows his cooperation with the group of his colleague, Prof. Chengxiang Zhai, when he began to work at UIUC. Figure 13b shows his research group at UIUC along with some often cooperative famous professors. In addition, Figure 13c presents another 4-VCC, which represents his co-authors when he worked at SFU. We also found that Prof. Jian Pei often cooperates with Prof. Jiawei Han in his research career. Oppositely, if we use 4-ECC or 4-Core, we can only acquire one community containing Prof. Jiawei Han. Thus, we say that the communities detected by k-VCCs are more reasonable and interpretable, which effectively reduces the free rider effect.

Figure 12
figure 12

One discovered 3-VCC in DBLP containing Prof. Jiawei Han

Figure 13
figure 13

Some of 4-VCCs containing Prof. Jiawei Han

Case study on HT

We also apply the k-VCC model in analyzing the genetic interaction network of HT. This network is produced based on the Welcome Trust Case Control Consortium (WTCCC) hypertension dataset [12]. Specifically, we first map all the genetic markers to their corresponding genes, and then calculate all the chi-square test values between genes. Because different genetic markers may map to the same gene, for each pair of genes, we choose the most significant test value among all the corresponding genetic marker pairs as its genetic interaction value. Here, we set the chi-square value threshold to 70. If the interaction values is larger than this threshold, we add an edge between the genes. Finally, the genetic interaction network of HT contains 8468 vertices (genes) and 17741 edges (interactions).

Figure 14a and b shows the identified k-VCCs with k = 10 and k = 15, respectively. From the figure, it is clear that with the increasing of k, the scale of the identified subgraph become smaller. We can obtain different size of candidate results based on our requirement by tuning the value of k. More importantly, we find out several genes (with gray background in Figure 14) in both of these subgraphs that haven been verified to be associated with HT. For example, MYO6 gene encodes a reverse-direction motor protein that moves toward the minus end of actin filaments and plays a role in intracellular vesicle and organelle transport. It has been reported to have association with HT in [38]. The CUBN is a gene locus for albuminuria, and hypertension is a major risk factor for albuminuria [5]. Also, STK39 gene encodes a serine/threonine kinase that is thought to function in the cellular stress response pathway. This gene is obviously associated with HT and has been reported many times [8, 44]. Besides these three genes, other genes such as C5, SCOC, SLC25A13 and KRT4 are potential HT candidate genes, because these genes appear in important signal transduction in related HT pathway, which have been reported in [47].

Figure 14
figure 14

Identified k-VCCs with k = 10 and k = 15 on genetic interaction network of HT

8 Conclusion

Component detection is a fundamental problem in network analysis and has attracted intensive interests. Most existing component detection methods suffer from the low connectivity issue, whose results will contain irrelevant subgraphs. In this paper, we propose the k-vertex connected component model, which focuses on the vertex connectivity of networks. We study the k-VCC detection problem. First, we develop exact top-down and heuristic bottom-up frameworks for k-VCC detection. Further, we propose the hybrid framework, which takes advantages of the above two frameworks. Although the hybrid framework is a little slower than the bottom-up framework, it can find the exact results. Through extensive experiments on large real and synthetic networks, we verify that the detected k-VCCs by both exact and heuristic frameworks are very effective, which largely reduce the free-ride effect, and the bottom-up and hybrid frameworks are much more efficient.