Keywords

1 Introduction

Nodes in a network can have different importance with respect to different network measures and behavior. Finding these nodes, called critical nodes, is an essential computational task. Critical nodes can be approached also from the general node deletion problem [14], which is a large class of problem composed of several problems, such as the vertex separator problem, the minimum vertex cover problem, the critical node detection problem, etc. Recently, the critical node detection problem (CDNP) gained attention due to its large applicability. A very important class of the critical node detection problem is to identify the set of nodes of a maximal size to remove from the graph in order to maximize the number of connected components. Applications of this problem can be found in epidemic control and immunization strategies, social networks, biology, telecommunications, etc.

In general, the critical node detection problem consists in finding a set of nodes in a given graph \(G=(V,E)\), which deleted maximally degrades the graph according to a given measure \(\sigma \). CDNP is a central problem in network analysis with applications in several research fields, such as biology [2], network vulnerability [6], social network analysis [3], etc. Regarding the measure \(\sigma \) several studies focus on network centrality measures, such as betweenness centrality, closeness centrality, page rank [11, 16].

Although several variants of the CDNP exist, only a few of them deal with computational methods for the variant consisting of removing k nodes in order to maximize the number of remaining components. The main goal of this paper is to approach this problem using a genetic algorithm with minimal problem specific adaptations. The choice of a genetic algorithm came first due to the natural binary encoding of an individual, but this is not the only reason we made it: we believe that it is important to explore different methods and paths and not constrain ourselves to assuming that one method may not work on a certain problem because it has not been tested on it. This is also related to the choice of operators: if there is not need for specific operators that use domain knowledge, we should not use them and keep the approach as general and as flexible as possible.

The rest of the paper is organized as follows: the next section presents the problem and reviews some existing approaches. The third section describes the proposed genetic algorithm. In the fourth section numerical experiments considering synthetic and real world networks are used to compare our results with the existing ones. The articles ends with conclusions and further work.

2 Related Work

Many variants of the critical node detection problem are studied in the literature, among which we mention: minimizing the pairwise connectivity by deleting k nodes (this variant is the most studied in the literature), minimizing the largest component size by deleting k nodes, bound the pairwise connectivity to a given threshold by deleting the minimal set of nodes, etc. A recent survey of the problem can be found in [13].

There are several ways to classify the critical node detection problem (CNDP). In [21] the two types variant is adopted: CNDP type 1 problems aim to minimize the network connectivity maintaining the number of removed nodes under a given threshold and CNDP type 2 problems in which the goal is to minimize the number of nodes that are removed such that the network connectivity reaches a given threshold. The type of connectivity measure used depends on the envisaged application, effect or the type of network. Applications are multiple as the CNDP is related to network sustainability and vulnerability [21]. Many practical approaches are devised for wireless sensor networks [7, 8, 18].

In [21] an exact algorithm for the problem considering the largest connected component is proposed. The k-vertex cut problem, consisting in finding the minimum weight subset whose removal disconnects the graph in at least k components is studied in [9]. Component-Cardinality-Constrained Critical Node Problem (3C-CNP) is approached in [12]. A bi-objective design is presented in [25]. As far as the type of networks, weighted networks are studied for example in [5] and directed graphs in [19].

In [1] the two types of CNDP problems are studied in three versions, among which also kMaxComp, the problem of removing a set of maximum k nodes to maximize the number of connected components in the remaining graph. This is one of the less studied CNDP variants, proven to be NP-hard [24]. In [24] a Mixed Integer linear programming approach is presented, [27] present a general integer programming framework. For a special class of graphs (trees and series-parallel graphs) a dynamic programming approach is presented [23]. In [1] a genetic algorithm is designed to solve the problem. The proposed genetic algorithm incorporates in the fitness function a penalization of solutions that are too close to the best solutions, combines a greedy strategy with variation operators and employs a local search mechanism at the end in order to refine solutions.

In this paper we focus on the problem \(CDNP^3_a\), denoted here as kMaxComp, introduced in [23, 24]. The \(CDNP^3_a\) is by itself an interesting problem to be studied, with many possible applications. It has received less attention because it does not impose any conditions on the connected components. The problem consists in removing a maximum of k nodes such as the number of remaining components to be maximal. Formally, if S denotes the set of the deleted nodes, and \(\mathcal {H}(G[V\setminus S])\) denotes the set of the maximal component of graph G without the set of nodes S, the optimization problem consists in

$$\begin{aligned} max |\mathcal {H}(G[V\setminus S])|, \text {such that } |S|\le k, \end{aligned}$$
(1)

where |A| denotes the cardinality of set A.

3 Maximum Components GA (MaxC-GA)

The goal of this work is to solve the kMaxComp problem by using a minimum number of problem specific information during the search. Because we search for a set of nodes from a network out of which some will be included in the critical set S and some not, a binary encoding of an individual of length \(N=|V|\) is natural, making a genetic algorithm the first choice in trying to approach this problem. We call this algorithm Maximum Components GA. MaxC-GA is outlined in Algorithm 1. MaxC-GA is a simple approach for the CDNP3a problem, that combines a standard GA with a constraint method based on the marginal contribution of a node to the fitness of an individual, concept borrowed from game theory, where such marginal contributions are used to evaluate the contribution of a player to the value of a coalition when computing the Shapley value [22].

Encoding. An individual has length N equal to the number of nodes in the network. The value 1 on position i indicates that node i is included in S.

Variation Operators. Two point crossover and flip-bit mutation are used.

Selection. Tournament selection is used for selection for recombination and mutation.

Fitness Function. The fitness of an individual is computed as the number of connected components the removal of its nodes with value 1 yields. Thus, if individual x encodes the critical set \(S_x\) then the fitness f(x) of x is computed as

$$\begin{aligned} f(x)=|\mathcal {H}(G[V\setminus S_x])|. \end{aligned}$$
(2)

Constraint Handling. In order to ensure that the size of the corresponding set S does not exceed k, before evaluation each individual is constrained to have only k nodes with value 1 by removing the nodes with the lowest marginal contribution to the fitness of the individuals from S. The marginal contribution of a node to the fitness of the individual is computed as the difference between the fitness of the individual and the fitness of the individual with the node removed from its corresponding set S of critical nodes. For a node i with value 1 in individual x with corresponding critical set \(S_x\) the marginal contribution of node i to the fitness of x denoted by \(u_i(x)\) is:

$$u_i(x)=f(x)-|\mathcal {H}(G[V\setminus \{S_x\setminus \{i\}\}])|, $$

where f(x) is the fitness defined in Eq. (2).

figure a

Parameters. MaxC-GA is a standard GA, and uses typical GA parameters: maximum number of generations, crossover and mutation probabilities, probability to mutate a bit, and tournament size. The effect of these parameters on the search results of a GA has been widely documented [15].

4 Numerical Experiments

The behavior of MaxC-GA is illustrated by using several benchmarks and comparing results with best known found in the literature for this problem. Benchmarks. A set of synthetic benchmarksFootnote 1 was proposed in [26]. The benchmark set contains four different type of graphs: Barabási-Albert (BA), Erdős-Rényi (ER), Forest-fire (FF), Watts–Strogatz (WS) graphs. BA graphs are scale free networks, ER graphs are random networks, FF graphs simulate how fire spreads through a forest, WS graphs are small world graphs with a dense structure.

Table 1 presents some basic measures of the benchmarks used for numerical experiments here: number of nodes (|V|), number of edges (|E|), average degree (\(\langle d \rangle \)), density of the graph (\(\rho \)), and average path length (\(l_G\)). In a similar manner, real networks are described in Table 2 with a reference added for each network.

Table 1. Synthetic benchmark test graphs and basic properties.
Table 2. Real-world graphs and basic properties.
Table 3. Maximum fitness values for the tested problems. The average over 10 runs is presented for MaxC-GA.

Parameter Settings. Several parameter setting are tested: population size set to 25 and 50, maximum number of generations 500, crossover probability 0, 0.5, 0.8, and 1, and mutation rate 0, 0.01, 0.02, 0.03, 0.04, and 0.05.

Fig. 1.
figure 1

Search evolution of MaxC-GA for the benchmarks, average best fitness for a population size of 25 and 50 over 10 runs.

Fig. 2.
figure 2

Box plots presenting results reported by MaxC-GA for the nine benchmarks and different parameter values.

Results and Discussion. MaxC-GA is compared with three algorithms described in [1]: two greedy algorithms, the first one, \(G_1\) based on node deletion from the candidate critical node set, and the second one, \(G_2\), based on the node addition to the candidate critical node set and a genetic algorithm from an evolutionary algorithm framework using greedy rules (denoted by GA). The genetic algorithm uses a specific fitness function that combines the number of connected components determined by the interval with previous search information, problem specific variation operators and a specific designed local search technique. Since the problem has been less addressed, we only have one approach based on GAs to compare with, and those results represent only one run. Results presented in the paper are preliminary and promising, supporting the idea that this approach may be extended for larger data sets.

As results presented in [1] include only the maximum number of connected components in one run, therefore statistical comparisons with results reported there are not possible. Table 3 includes these results as well best results reported by MaxC-GA. Results reported by MaxC-GA using different parameter settings are illustrated in Fig. 2. Furthermore, Fig. 1 illustrates the evolution of the search of MaxC-GA (average best solutions over 10 runs). We find that the evolution is steady, faster for a larger population size, and that MaxC-GA is capable to find and maintain the optimal solution. Because the behavior of MaxC-GA under different parameter settings is typical for that of a GA, with respect to convergence we have presented only graphs showing that it is capable to detect and maintain the optimal solution during one run. In all other ways it behaves as expected: a larger population size leads to an earlier convergence at a higher computational cost and a small population will eventually converge.

The effect of various parameter settings presented in Fig. 2 as boxplots of the ratio of maximum fitness values reported in 10 runs for each parameter setting and best known result for the benchmark (in order to keep all values between 0 and 1). We find that the algorithm is robust with respect to variation of parameters, with the notable exception that mutation plays an important role in the search, as setting the mutation rate to 0 significantly decreases the performance of the algorithm.

5 Conclusions

The critical node detection problem is approached with MaxC-GA, a simple genetic algorithm that uses a node fitness based on marginal contributions for constraint handling. Numerical results show that this approach is as effective as other, more complex, using more problems specific information.

These results may also be used to advocate for the use of minimal problem specific information in designing new evolutionary algorithms for real-world applications. Overusing specific problem information decreases the adaptability of the presented method, as practitioners will rarely try to adapt an existing algorithm presented in literature to a slightly different problem, mainly because the stochastic nature of these approaches does not guarantee direct portability to a different problem.