Keywords

1 Introduction

The Cluster Editing (CE) problem [5], also referred in the literature as Correlation Clustering [2], ask to transform an undirected graph \( G \) by minimizing the number of editions, i.e., insertions or deletions of edges, to create a vertex-disjoint union of cliques. Figure 1 shows a CE instance were three editions are made on the graph resulting on two disjoint cliques represented by the vertices set {A, B, C, D} and {E, F, G}.

Fig. 1.
figure 1

Cluster Editing instance. Solid lines represent edges that are maintained, dashed lines are removed edges and dotted lines are edges that were inserted.

This problem has been considered in the context of bioinformatics [4, 6], clustering documents [2], image segmentation [14], consensus clustering [1, 10], and qualitative data clustering [10, 11]. The CE problem is a NP-hard problem [2, 19], thus heuristics, approximations and data reduction methods were proposed in the literature [3, 9, 12, 18].

Grotschel and Wakabayashi [11] introduced an ILP model for the CE. They derived it from a mathematical analysis of the corresponding problem polytope, proposing several partition inequalities for this purpose. As there are an exponential number of these inequalities, they followed a cutting plane approach where the inequalities are added to the Linear Program only if a current fractional solution violates them.

The ILP creates models with a large number of constraints that limits the size of the problems that can be optimally solved. Therefore, here, we propose a preprocessing technique to construct a reduced model. In comparison to the original model, the reduced model preserves the optimal solution and achieves remarkable computational time speed up in the experiments performed on different datasets.

This work is organized as follows. In Sect. 2 the problem is defined and the integer linear programming formulation is presented. Section 3 presents the new preprocessing technique. The computational experiments are provided in Sect. 4. Finally, conclusions are drawn in Sect. 5.

2 Cluster Editing via ILP

The CE problem can be formulated as a maximization or minimization problem [7]. This paper considers a minimization version of graph clustering where the goal is to minimize the number of editions (edges deleted between clusters plus the number of edges inserted inside clusters).

The following notation is introduced to explain the proposed preprocessing technique. Given an undirected graph \( G = \left( {V,E} \right) \) where \( V \) is the set of vertices and \( E \) is the set of edges. The edge weight values, represented by \( w_{ij} \), are 1 if the edge exists in the graph and –1 for missing edges. The number of vertices in the graph is defined as \( n = \left| V \right| \), while \( m = \left| E \right| \) represent its number of edges.

The following ILP formulation can be used for cluster editing [6, 11]:

$$ \left( {CE_{ILP} } \right):{\text{ Minimize}}\;\sum\nolimits_{e \in E} {w\left( e \right)} {-}\sum\nolimits_{i < j} {w_{ij} x_{ij} } $$
$$ {\text{subject}}\;{\text{ to}}\quad x_{ij} + x_{jk} - x_{ik} \le 1\quad i < j < k $$
(1)
$$ x_{ij} - x_{jk} + x_{ik} \le 1 \quad i < j < k $$
(2)
$$ - x_{ij} + x_{jk} + x_{ik} \le 1 \quad i < j < k $$
(3)
$$ x_{ij} \in \left\{ {0,1} \right\} \quad i,j \in \left[ {1..n} \right] $$

where \( x_{ij} = 1 \) if i and j are part of the solution and 0 otherwise.

The edge editions are considered on (CEILP) model solution in the following way: an edge is inserted if \( x_{ij} = 1 \) for \( w_{ij} = - 1 \) and is removed if \( x_{ij} = 0 \) for \( w_{ij} = 1 \). Therefore, the objective function returns the minimum number of editions.

Constraints (1–3) are called “transitivity constraints”. Constraint 1, for instance, enforces that: if vertex i is in the same cluster as vertex j, and j is in the same cluster as k then i is in the same cluster as k.

3 Preprocessing Technique

The (CEILP) model has \( O\left( {n^{2} } \right) \) variables and \( O\left( {n^{3} } \right) \) constraints, which creates models with a large number of redundancies. Though not critical to solving the problem, these constraints affect the solver efficiency and might prohibit its usage.

Grotschel and Wakabayashi [11], in 1989, were the first to propose a strategy to deal with such redundancy. A cutting plane algorithm was created to identify violated constraints during the execution of a relaxed version of this problem. It was confirmed experimentally that, for many instances, the cutting plane ends with a small fraction of the transitivity constraints. This fact evidences a great redundancy of transitivity constraints in the original model (CEILP).

Other techniques, introduced in the context of clique partitioning problem [16] and modularity optimization in complex networks [8], try to identify redundancies in the ILP model by analyzing the graph representation of the clustering instance. Recently, Nguyen et al. [17] generalized the approach of Dinh et al. [8] to some constrained clustering problems. All those techniques were capable of reducing the size complexity of the transitivity constraints from \( O\left( {n^{3} } \right) \) to \( O\left( {nm} \right) \).

In the context of Cluster Editing, Bocker et al. [5, 6] introduced techniques that focus on reducing the problem size instead of improve the ILP model. They identify patterns that can be removed from the problem instance without changing the groups obtained in the optimal solution. The reduced problem is solved exactly by using the cutting plane proposed by Grotschel and Wakabayashi [11] and a fixed parameter branching algorithm. The experimental results shows that those techniques were capable of solving large graphs with 1000 vertices and several thousand edge modifications.

Here a new approach is introduced, focusing on the identification of the ILP model redundancy. It tries to identify redundancies by analyzing the graph representation of the clustering instance [8, 16]. Our approach is based on the identification of triangles formed by edges (graph edges and missing edges) that correspond to the transitivity constraints of a model (CEILP).

Figure 2 presents the edge weight distribution within triangles considered by the transitive constraints of the (CEILP) model. The analysis of such triangles helps to identify constraints that do not need to be considered in the model as they do not result in editions.

Fig. 2.
figure 2

Possible edge weights distribution.

Transitivity constraints corresponding to triangles T1, T3 and T4 do not need to be considered because of the following reasons:

  • T1: the optimization objective of (CEILP) tries to set all variables \( x_{ij} = x_{jk} = x_{ik} = 1 \) for \( w_{ij} = w_{jk} = w_{ik} = 1 \) since all transitivity constraints are satisfied.

  • T2: the optimization objective of (CEILP) tries to set all variables \( x_{ij} = x_{jk} = x_{ik} = 0 \) for \( w_{ij} = w_{jk} = w_{ik} = 0 \) since all transitivity constraints are satisfied.

  • T3: the optimization objective of (CEILP) tries to set the value 1 to the variable relative to the positive edge weight and 0 to the remaining variables. This satisfies all the transitivity constraints.

Constraints that deals with triangles of type T2 must be considered as the transitivity constraints are not satisfied when variables corresponding to positive edges are set to 1. To satisfy those constraints variables corresponding to the negative edge weights must be set to 1 which leads to one edge editing.

There is another possibility, removing an edge of the graph. Those circumstances are considered in Fig. 3. There are three possible variants of such triangle depending on vertex order:

Fig. 3.
figure 3

Possible edge label distribution for triangles of type T2.

  • T2A: considering the optimization objective of (CEILP), the best possibility of editing (1 edition) that satisfy all transitivity constraints are the following:

    • \( x_{ij} = x_{jk} = x_{ik} = 1\quad \left( { 1\,{\text{ edge}}\,{\text{ inserted}}} \right) \)

    • \( x_{ij} = 1, x_{jk} = x_{ik} = 0\;\;\left( { 1\,{\text{ edge}}\,{\text{ removed}}} \right) \)

    • \( x_{jk} = 1, x_{ij} = x_{ik} = 0\;\;\left( { 1\,{\text{ edge}}\,{\text{ removed}}} \right) \)

  • T2B: considering the optimization objective of (CEILP), the best possibility of editing (1 edition) that satisfy all transitivity constraints are the following:

    • \( x_{ij} = x_{jk} = x_{ik} = 1\quad (1 \, \,{\text{edge}}\,{\text{ inserted}}) \)

    • \( x_{ij} = 1, x_{jk} = x_{ik} = 0\;\;\left( { 1\,{\text{ edge}}\,{\text{ removed}}} \right) \)

    • \( x_{ik} = 1, x_{ij} = x_{jk} = 0\;\;\left( { 1\,{\text{ edge}}\,{\text{ removed}}} \right) \)

  • T2C: considering the optimization objective of (CEILP), the best possibility of editing (1 edition) that satisfy all transitivity constraints are the following:

    • \( x_{ij} = x_{jk} = x_{ik} = 1\quad (1 \, \,{\text{edge }}\,{\text{inserted}}) \)

    • \( x_{ik} = 1, x_{ij} = x_{jk} = 0\;\;\left( { 1\,{\text{ edge}}\,{\text{ removed}}} \right) \)

    • \( x_{jk} = 1, x_{ij} = x_{ik} = 0\;\;\left( { 1\,{\text{ edge}}\,{\text{ removed}}} \right) \)

T2 triangles are also known as conflict triples in literature and are the roots to some data reduction methods [5, 6]. In this paper, the information of the edge weight distribution within triangles is used directly in the model (CEILP) to identify what transitivity constraints should be considered while constructing the ILP model. The data is not reduced or modified, and only transitivity constraints corresponding to T2 triangles need to be taken into account. Hence, the following reduced model is proposed:

$$ \left( {CER_{ILP} } \right): \, \,{\text{Minimize}}\,\sum\nolimits_{e \in E} {w\left( e \right)} {-}\sum\nolimits_{i < j} {w_{ij} x_{ij} } $$
$$ {\text{subject}}\,{\text{ to}}\quad x_{ij} + x_{jk} - x_{ik} \le 1 \quad i,\;j,\;k \in S1 $$
(4)
$$ x_{ij} - x_{jk} + x_{ik} \le 1 \quad i,\;j,\;k \in S2 $$
(5)
$$ - x_{ij} + x_{jk} + x_{ik} \le 1\quad i,\;j,\;k \in S3 $$
(6)
$$ x_{ij} \in \left\{ {0,1} \right\} \;i,\;j \in \left[ {1..n} \right] $$

where

$$ S1 = \{ i < j < k | w_{ij} = + 1 \wedge w_{jk} = + 1 \wedge w_{ik} = - 1\} $$
$$ S2 = \{ i < j < k | w_{ij} = + 1 \wedge w_{jk} = - 1 \wedge w_{ik} = + 1\} $$
$$ S3 = \{ i < j < k | w_{ij} = - 1 \wedge w_{jk} = + 1 \wedge w_{ik} = + 1\} $$

The sets S1, S2, and S3 enforce that only constraints corresponding to T2 triangles must be considered while creating the model. This can be considered as a preprocessing technique that produces a model (CERILP) with a small number of constraints in comparison to the original model (CEILP). For instance, considering the example depicted in Fig. 1, the model (CEILP) has 105 constraints in contrast to the model (CERILP), which has only 11 constraints. Both models find the best number of editions, but the reduced model has 1.89 of speedup in computational times.

4 Experimental Results

The experiments and algorithms were coded in C++14 and executed on a computer with the following configuration: Intel Core i7-6770HQ (3,5 GHz) with 32 GB RAM running Windows 10 64-Bit. The commercial solver IBM ILOG CPLEX [13] 12.7.1 was used to solve the ILP models.

The following datasets were used to compare the performance and the quality of the solution of the proposed model (CERILP) against the original model (CEILP):

  • LFR benchmark networks. Networks created with the benchmark developed by Lancichinetti-Fortunato-Radicchi (LFR) [15]. The following parameters were used: number of vertices \( n = \left\{ {50,\;100,\;200} \right\} \), the average degree was set to 5 and the maximum degree was set to 10. The default values were used for the remaining parameters. Networks with increasing values for the mixing parameter (µ) were used to blur the community distinction.

  • Random unweighted graphs. Proposed by [6] in the following manner: given a number of nodes n and parameter k, uniformly selected an integer \( i \in \left[ {1,n} \right] \), which defines a cluster with \( i \) vertices. The process continues with the remaining \( n \leftarrow n - 1 \) vertices until \( n \le 5 \) holds. In this case, all remaining vertices are assigned to the last cluster. Finally, an estimated value \( k^{{\prime }} \le k \) is used to perform uniformly editions (add/remove edges). This dataset can be found onlineFootnote 1. Datasets with sizes \( n = \left\{ {100,\;200,\;300,\;1000,\;1500,\;2000} \right\} \) were selected for the experiments.

4.1 Experiments with LFR Networks

LFR benchmark networks [15] were created to evaluate the performance of the proposed preprocessing technique. Given a number of vertices n, one network with predefined clusters was created for each mixing parameter (µ). The amount of edges between clusters increase proportionally to µ. As a consequence, the clusters become more interconnected and the clustering problems more difficult.

Table 1 presents the results obtained for the (CEILP) and (CERILP) models on the LFR benchmark networks. Column \( n \) represents the number of vertices of the graph; \( \mu \) is the mixing parameter; columns \( Obj \), \( \# C \) and \( Time \) are defined for both models and represent, respectively, the objective value, the number of constraints, and the computational time in seconds. Finally, column \( \% C \) represents the percentage of constraints removed from the original model and \( S \) corresponds to the computational time speedup obtained by (CERILP).

Table 1. Results obtained by (CEILP) and (CERILP) on LFR benchmark networks.

It can be observed, based on Table 1, that (CERILP) achieves a better performance than (CEILP) while preserving the optimal number of editions for all the considered instances. This higher performance is due to a large number of redundant transitivity constraints disregarded by the preprocessing technique (above 99%). Consequently, the computational times are drastically improved providing speedups from 15 to 13755. All instances are solved by (CERILP) in less than 2 s.

Problems with \( n > 300 \) vertices were not tested because the (CEILP) fails to solve them due to lack of memory. It is worth noting that the (CERILP) can solve problems with a higher number of vertices, based on the results presented in Table 1.

4.2 Experiments with Random Unweighted Graphs

Proposed by Bocker et al. [6], these datasets were generated by disturbing an ideal cluster graph using random edge insertions and deletions. Given a number of vertices \( n \), 10 networks with predefined clusters were created for each corresponding \( k \). The values of \( k \) were selected according to the following rule \( k = c*n \) with \( c = \left\{ {0.25,\; 0.5, \;1, \;1.25, \;1.5, \;1.75, \;2} \right\}. \)

Datasets with \( n = \left\{ {100,\; 200, \;300} \right\} \) were considered initially for the experiments because (CEILP) failed to solve instances with \( n > 300 \) due to lack of memory. Table 2 presents the results obtained for the (CEILP) and (CERILP) models on such networks. Each row represents the average result for the 10 datasets considering each pair \( \left( {n, k} \right) \). Column \( n \) represents the number of vertices of the graph; \( k \) is the upper limit for the number of editions; columns \( \underline{Obj} \), \( \underline{\# C} \) and \( \underline{Time} \) are defined for both models and represent, respectively, the average objective value, average number of constraints, and computational times in seconds. Finally, column \( \underline{\% C} \) represents the average percentage of constraints removed from the original model and \( \underline{S} \) corresponds to the average computational time speedup obtained by (CERILP).

Table 2. Results obtained by (CEILP) and (CERILP) on random unweighted graphs.

Next, we run a set of experiments with larger datasets \( n = \left\{ {1000, \;1500, \;2000} \right\} \) to test the scalability of the (CERILP). The objective value cannot be compared to (CEILP), but the percentage of constraint elimination can be estimated.

Table 3 presents the results obtained for the (CERILP). Each row represents the average result for the 10 datasets and each pair \( \left( {n, k} \right) \). Column \( n \) represents the number of vertices of the graph; k is the upper limit for the number of editions; for (CEILP) only the average number of constraints is presented (\( \underline{\# C} \)). Columns \( \underline{Obj} \), \( \underline{\# C} \) and \( \underline{Time} \) represent the average objective value, average number of constraints and computational times in seconds for the (CERILP), respectively. Finally, column \( \underline{\% C} \) highlights the average percentage of constraints removed from the original model.

Table 3. Results obtained by (CEILP) and (CERILP) on large random unweighted graphs.

From Table 3, it can be verified that a large number of redundant transitivity constraints are disregarded by the preprocessing technique (above 99%). All instances are solved by (CERILP) in less than 23 min. This result shows the scalability of the proposed technique on the datasets proposed by Bocker et al. [6].

5 Conclusions

A novel preprocessing technique for the Clique Edition problem was proposed in this work. The experimental results showed that the reduced model (CERILP) provided a considerable reduction of transitivity constraints, preserved the optimal solution set, and improved the computational speedup compared to model (CEILP) for all considered instances.

This technique has the advantage of working as complementary to other techniques, like the cutting plane proposed by Grotschel and Wakabayashi [11]. The (CERILP) can provide means to increase the size of future instances that the Integer Linear Programming approach can execute.

Our preprocessing technique might also be combined with other approaches, such as the method for reducing the input graph proposed in [6]. Thus, the preprocessed reduced graph may speed up the solution of the ILP problem even further.

Finally, we expect that the results obtained in the ILP context can be used to guide the construction of better heuristic techniques for the unweighted Cluster Editing problem.