Keywords

1 Introduction

Computing a similarity or a dissimilarity measure between graphs is a major challenge in pattern recognition. One of the most well-known and used approach to compute a distance between two graphs is the Graph Edit Distance (GED) [12]. Computing the GED consists in finding a sequence of graph edit operations (insertions, deletions and substitutions of vertices and edges) that transforms a graph into another with a minimal cost. Such a sequence of edit operations is called an edit path, and the edit distance between two graphs G and H is defined by \(GED(G, H)=min_{\gamma \in \varGamma (G, H)}\sum _{e\in \gamma }c(e)\), where \(\varGamma (G, H)\) denotes the set of edit paths between G and H and c(e) denotes the cost of an elementary operation e belonging to the edit path \(\gamma \).

If both graphs are simple and if the cost between vertices and edges remains fixed, one can show [3] that the edit distance between two graphs G and H of respective orders n and m may be formulated as the following quadratic problem: \(GED(G, H)=\min _{x\in \varPi _{n, m}}\frac{1}{2}x^t\varDelta x+c^tx\), where \(\varPi _{n, m}\) denotes the set of vectorized assignment matrices between \(V_G\) and \(V_H\). Such a matrix x encodes for each element of \(V_G\) one and only one operation (either substitution or deletion). In the same way, x encodes for each element of \(V_H\) either a substitution or an insertion. The matrix \(\varDelta \) encodes the cost of edge operations while c encodes the cost of operations on vertices. Computing the GED is NP-hard, so several heuristics were proposed to compute approximate solutions in polynomial time.

The design of approximate solutions to the GED problem has been strongly stimulated by the introduction of an approximation of the GED problem into a Linear Sum Assignment Problem with Edition (LSAPE) [9]. This approximation consists in associating to each node of two graphs G and H a substructure and to populate a cost matrix encoding: the costs of matching two substructures, the cost of inserting one substructure in H and the cost of removing one substructure from G. Given such a cost matrix \(\tilde{c}\), the assignment matrix x minimizing \(\tilde{c}^tx\) provides a set of elementary operations on vertices from which an edit path may be deduced. The cost of this edit path provides an upper bound for the graph edit distance. This minimization step may be solved in polynomial time. This transformation of an \(\mathcal {NP}\)-complete problem into a minimization problem with a polynomial complexity is the major advantage of the LSAPE approximation. However the computation of the cost matrix \(\tilde{c}\) may require non polynomial execution times. Different types of substructures have been defined in  [4, 8, 9, 14].

However, LSAPE is based on a linear approximation of a quadratic problem. In order to get a finer approximation of the graph edit distance, several methods use variants of local search such as simulated annealing in order to improve an initial estimation of the edit distance  [6, 10, 11]. A slightly different approach [3], consists in using the Franck Wolfe minimization scheme [7] from an initial guess. This algorithm converges by iterations towards a local minimum of the quadratic function usually close from the initial guess. This heuristic provides close approximations of the graph edit distance on small graphs but is sensitive to the solution used to initialize the method. An heuristic to reduce the influence of the initial guess has been proposed in [5]. This heuristic is based on the use of multiple initial guesses either deduced from a set of solutions of the LSAPE problem or based on the generation of random assignment matrices. The common drawback of these two heuristics is that the generation of initial solutions does not take into account information provided by runs of Franck-Wolfe method which may have converged.

In this paper we propose a new heuristic based on alternative runs of the generation of initial solutions and the determination of their associated local minima using Franck-Wolfe method. This method is described in Sect. 3. The proposed method is evaluated in Sect. 4 through several experiments.

2 Preliminaries

Throughout the paper, we will use the same concepts and notations as those introduced in  [3]. Vertices of graphs G and H are numbered respectively from 1 to n and from 1 to m, and two virtual vertices indexed as \(n+1\) in G and as \(m+1\) in H are added. These virtual vertices, denoted by \(\epsilon _{G}\) and \(\epsilon _{H}\), correspond respectively to insertions and deletions. An assignment \(i\rightarrow \epsilon _{H}\) (resp. \(\epsilon _{G}\rightarrow j\)) corresponds to the deletion of vertex i of G (resp. insertion of vertex j of H).

In this context, a solution for GED will be described as an error-correcting assignment matrix x, where all vertices of G are assigned to a single element of \(H\cup \{\epsilon _{H}\}\), and all vertices of H are assigned to a single element of \(G\cup \{\epsilon _{G}\}\). The polytope \(\varPi _{n, m}^{\epsilon }\) of error-correcting assignment matrices thus contains all matrices of dimensions \((n+1)\times (m+1)\), with binary values, and with a single 1 in each line and in each column, except for the last line and the last column. We naturally extend the concept to matrices with fractional values: we call error-correcting bistochastic matrix any matrix where the sum of all cells for each line except the last one and each column except the last one amounts exactly to 1.

Given a cost matrix \(\varDelta \) and an initial continuous or discrete candidate solution x, Algorithm 1 describes Frank-Wolfe algorithm \(\text {FW}(\varDelta ,\mathbf {x})\). In the following, we denote by IPFP the method that consists in running \(\text {FW}(\varDelta ,\mathbf {x})\) and projecting the returned solution in the discrete space. See  [3] for more details.

figure a

3 RANDPOST Algorithm

The conception of algorithm RANDPOST finds its origin in the following intuition regarding the - often conflicting - criteria that a smart initial solution generator should fulfill. On the one hand, a smart generator should propose solutions that are well distributed inside the \(\varPi _{n, m}^{\epsilon }\) polytope, in order to increase the diversity among local minima that are ultimately returned. We call this criterium exploration criterium. On the other hand, we obviously wish that one of the initial solutions ultimately led to a global minimum, so that a good generator should somehow already generate solutions that includes “smart” assignments. We call this criterium quality criterium.

Building up on the progresses of the multistart IPFP (mIPFP) algorithm [5], we propose a new algorithm that explores the polytope and generates new solutions by taking advantage of the whole statistical information contained in the set of solutions returned by mIPFP. This algorithm is presented in Sect. 3.2. We also devise a new parameterized random generator that is described in Sect. 3.1.

3.1 Generating Initial Solutions with Parameterized Number of Insertions and Deletions

With respect to the initial random generator used in  [5], where vertices were randomly assigned by the use of the std::random_shuffle procedure of the C++ standard library, resulting in a random proportion of insertions and deletions, we decided to use a random generator with a parameterized proportion of insertions and deletions.

We focus our analysis on the number of deletions, as - for given \(n=|G|\) and \(m=|H|\) - the number of deletions is completely determined by it, namely, \(\#\text {insertions}\equiv \#\text {deletions}-n+m\).

Namely, given a parameter \(\alpha \in (0,1)\), we used a new initial random generator RANDGEN \((\alpha )\) which generates solutions with an expected number of deletions equal to \(\alpha n + (1-\alpha )\max \{(n-m),0\}\). In other words, \(\alpha \) denotes the expected proportion of “unnecessary” deletions of vertices in the set of initial solutions, reminding that if \(n>m\) any feasible assignment will have to assign at least \(n-m\) vertices to the virtual node \(\varepsilon _H\) which thus correspond “necessary” to deletions.

3.2 Stochastic Generation of New Initial Solutions Based on Several Refined Solutions

figure b

We describe here the functioning of RANDPOST (Algorithm 2). Given a pair of graphs G and H, the algorithm starts by running mIPFP, which outputs a set \(S^*\) of r solutions for the problem. Each of these r solutions is represented as an error-correcting assignment matrix x. A matrix \(\varPsi \) is then created by simply summing all of these r matrices and dividing all values by r. Hence, \(\forall i \in [1, n\,+\,1], j \in [1, m\,+\,1]\), \(\varPsi _{i, j}\) represents the proportion of solutions where i is assigned to j among the r solutions of \(S^*\). Matrix \(\varPsi \) (which is an error-correcting bistochastic matrix) is then used as a probability distribution to compute a new set of k solutions which are subsequently refined using IPFP, and the first r that have converged among these k parallel processes are added to the set \(S^*\). The whole sequence that updates \(\varPsi \), generates new solutions and finally refines them is repeated l times.

Parameters k and r enables to speed up the algorithm when \(k\gg r\): some of the k initial solutions might require many iterations of IPFP in order to converge to a local minimum, so that by launching k parallel IPFPs and stopping all them when r of them have converged, the whole process is likely to be faster than launching r parallel IPFPs and waiting for all of them to converge.

The intuitive idea behind algorithm RANDPOSTis the following: if an assignment \(i \rightarrow j\) appears with a high frequency within solutions of the set \(S^*\), then this assignment is likely to be part of many good solutions for the problem at hand. Hence, the algorithm generates k new solutions in a stochastic way, where the probability for a given assignment to be part of a solution is higher for assignments with high \(\varPsi _{i, j}\) value, and thus high frequency.

To be even more precise, the random generator assigns each vertex of G to a vertex of \(H\cup \{\epsilon _H\}\) following a greedy procedure. The matrix x of dimensions \((n+1)\times (m+1)\) (which will eventually contain the solution returned by the algorithm) is initialized with zeros, and whenever an assignment \(i\rightarrow j\) is made by the algorithm, this translates to \(x_{i, j}\leftarrow 1\). At the end of the algorithm x should be an error-correcting permutation. The random generator assigns iteratively vertices from 1 to n based on the following probabilistic distribution \(P_i\), where \(P_i(j)\) denotes the probability for vertex i of G to be assigned to vertex j of H by the generator, given a partial assignment of vertices from 1 to \(i-1\):

$$\begin{aligned} \begin{array}{l} \forall j=1,...,m+1,\quad P_1(j)=\varPsi (j)\\ \forall i=2,\ldots ,n, \forall j=1,...,m+1 \\ P_i(j) = \left\{ \begin{array}{lll} \ 0 &{} &{} \text {if } j\ne m+1 \text { and }\exists h < i, \text {s.t } x_{h,j}=1\\ \ \dfrac{\varPsi _{i,j}}{1-\sum _{h=1}^{m}\left( \sum _{l=1}^{i-1}x_{h,l}\varPsi _{i,l}\right) } &{} &{} \text {otherwise} \end{array} \right. \end{array} \end{aligned}$$
(1)

Finally, all vertices of H that have been left unassigned at the end of the procedure are all assigned to \(\epsilon _{G}\). First let us prove that matrix x produced by the proposed random generator is always an error-correcting permutation matrix. This is done by putting together the two following facts: (1) by construction, each line of x but the last one has exactly one value set to 1 and all the others set to 0 (a single assignment is made for a single line at each step of the generator), (2) the probability distribution described by (1) ensures that no vertex j of H is assigned twice (once assigned, its probability of being assigned again is zero), and the very last step ensures that each one is assigned at least once.

Let us briefly prove that (1) defines a proper probability distribution. It is easy to verify that \(\sum _{j=1}^{m+1}P_1(j)=1\) by simply reminding that \(\varPsi \) is an error-correcting bistochastic matrix. For \(i=2,\ldots , n\), consider a matrix \(\tilde{\varPsi }\) which values are as follows:

$$\begin{aligned} \tilde{\varPsi }_{i, j} = \left\{ \begin{array}{lll} \ 0 &{} &{} \text {if } j\ne m+1 \text { and }\exists h < i, \text {s.t } x_{h, j}=1\\ \ \varPsi _{i, j} &{} &{} \text {otherwise} \end{array} \right. \end{aligned}$$

It is easy to verify that:

$$\begin{aligned} \forall i=2,\ldots , n \quad \sum _{j=1}^{m+1} \tilde{\varPsi }_{i, j} = 1-\sum _{h=1}^{m}\left( \sum _{l=1}^{i-1}x_{h, l}\varPsi _{i, l}\right) \end{aligned}$$
(2)

We finally prove that (1) defines a proper probability distribution:

$$\begin{aligned} \forall i=2,\ldots , n \quad \sum _{j=1}^{m+1} P(i \rightarrow j) = \dfrac{\sum _{j=1}^{m+1} \tilde{\varPsi }_{i, j}}{1-\sum _{h=1}^{m}\left( \sum _{l=1}^{i-1}x_{h, l}\varPsi _{i, l}\right) }\underset{(2)}{=} 1 \end{aligned}$$

Finally, whenever the stochastic generator produces a candidate solution that has already been generated earlier, the solution is discarded and a new solution is produced using a slightly flatter (and thus more explorative) distribution.

4 Experimental Results

In this section, we evaluate the proposed method through several experiments, in order to determine as clearly as possible the relevance and importance of the exploration and quality criteria described in Sect. 3.

4.1 Datasets and Protocol

Table 1 presents the chemical datasets that were used in our experiments. MAO, PAH, MUTA10-70 and MUTAmix were considered in ICPR 2016 – Graph Distance Contest [2]. We also extracted 25 graphs from ClinTox  [13], and 10 graphs with more than 100 vertices from MUTA.

Table 1. Characteristics of datasets

We evaluated the four following versions of RANDPOST(krl): RANDPOST \((40,40,0)\), RANDPOST(40, 20, 1), RANDPOST(40, 10, 3) and RANDPOST(40, 5, 7). This choice of parameters is determined by the following idea: considering two algorithms RANDPOST \((k, r_1, l_1)\) and RANDPOST \((k, r_2, l_2)\) such that \(r_1(l_1+1)=r_2(l_2+1)\) and \(r_1>r_2\), their relative performances can be compared without bias as the overall number of candidate solutions is the same (in our case, all four algorithms generate exactly 40 candidates), while the latter algorithm performs better on the quality criteria, and the former on the exploration one. We thus consider that r represents the exploration parameter, while l represents the quality one.

Regarding cost functions, we tested all algorithms with four different sets of costs: \(c_1\), \(c_2\) and \(c_3\) correspond to the costs used in  [2], while \(c_4\) is the cost function used in  [5] and references therein. Note that \(c_1, c_2\) and \(c_4\) correspond to metric costs where a substitution cost of two elements is lower or equal to the cost of the removal of the first element together with the insertion of the second one. Conversely, \(c_3\) is an anti-metric cost violating this last inequality. The main idea between these two classes of cost functions is that metric cost functions favor substitutions while the anti-metric ones favor deletions and insertions.

All tests were performed using 4 AMD Opteron processors at 2.6 Ghz with 512G of RAM. The number of parallel threads was limited to 40 (which corresponds to parameter k). The code for the algorithm is written in C++.

Fig. 1.
figure 1

Behavior of RANDPOST w.r.t. parameter \(\alpha \), metric vs. anti-metric costs

4.2 Behavior of the Algorithm w.r.t. Parameter \(\alpha \)

We tested three versions of RANDPOST with several values for parameter \(\alpha \) of RANDGEN, and several cost functions (see the previous section). The most significant results are presented in Fig. 1 for MAO, a dataset with enough and relatively simple instances so that interesting statistical tendencies can emerge. Contrasting tendencies can be observed with the metric cost function \(c_4\) and the anti-metric one \(c_3\). Interestingly, the algorithm performs better as the initial proportion of “unfavored” choice rises. We believe that this is due to the design of the IPFP gradient descent, which is likely to find a better local minimum when starting from a solution including a greater number of “neutral” (in the sense of easily improvable) assignments.

Unfortunately, the behavior that we observe on MAO does not emerge with the same clarity on more complex datasets containing bigger or unlabeled graphs. However, it seems that IPFP requires a medium value for \(\alpha \) (around 0.4) to perform best when dealing with unlabeled graphs. For bigger graphs (more than 40 vertices) high values of \(\alpha \) seem to produce better starting points for IPFP, independently of cost functions.

Table 2. Experimental results of RANDPOST(krl), cost c1

4.3 Performance of RANDPOST

Table 2 presents the performance of the four versions of RANDPOST(kl) that we mentioned earlier, plus RANDPOST(1, 0) which corresponds to a single run of IPFP starting from a random candidate solution. For each pair of graphs in each dataset, we extracted the best known GED among those returned by a set of 14 algorithms (9 algorithms of  [2] + 5 versions of RANDPOST), except for ClinTox and MUTA100+ that weren’t part of the benchmark in  [1]. For these two datasets, the best GED was extracted from our 5 algorithms alone. The “err.” column presents the mean error w.r.t. best known solutions, while the “%best” column presents the proportion of pairs of graphs for which the best known GED was found. For each dataset, we selected the value of \(\alpha \) leading to a minimal mean GED over all 5 tested algorithms. The selected value is indicated in the table. Due to space restrictions, we present results regarding the metric cost c1 only. The same tendencies can be observed with all the other cost functions.

The tendencies that emerge from Table 2 are quite clear: the more qualitative versions of RANDPOST(krl) perform better than all algorithms presented in  [2] on datasets with labeled graphs containing at least 60 vertices. Under this threshold, the balance between exploration and quality criteria that performs better GED favors more exploratory methods as the size of the graphs decreases. Further analysis shows that the phenomenon is deeply linked to the speed and quality of convergence of the algorithms: a more exploratory version of RANDPOSTwill ultimately converge to better GED estimations, but it will also converge at a slower rate. On the other hand, bigger graphs lead to slower overall convergence rates. These two phenomenons are visible in Fig. 2. Both plots represent the improvement in GED estimations over the successive loops of RANDPOST. Each stairstep measures the best GED computed in a loop, and as the x-axis represents the number of computed solution, the length of the steps equals r for each algorithm RANDPOST(krl).

Fig. 2.
figure 2

Convergence of RANDPOSTon datasets MUTA-20 and MUTA-70

When dealing with smaller graphs, qualitative methods converge very rapidly to suboptimal solutions, while the exploratory ones converge more slowly to better GED estimations. On the other hand, when dealing with bigger graphs, the fast rate of convergence of qualitative methods becomes a strength rather than a flaw, Fig. 2b shows that when the number of computed solutions is limited to 40 (which corresponds to the results in Table 2), none of the algorithms has yet converged, so that the faster converging algorithm yields better results. This phenomenon eventually reverses on the long run: as an example, Fig. 2b suggests that the limit on the number of computed solutions must be brought as high as 90 for RANDPOST(40, 10, 20) to outperform RANDPOST(40, 5, 40) on MUTA70.

5 Conclusion

Using a new iterative IPFP-based algorithm relying on stochastically generated solutions, we investigated the relative importance of exploration and quality criteria when generating candidate solutions for a multistart version of IPFP. Our results suggest that the balance leading to better GED estimations depends mostly on some ratio between the dimension of the problem at hand and the overall number of generated solutions.