Keywords

1 Introduction

Labelled, undirected graphs can be used for modelling various kinds of objects, such as social networks, molecular structures, and many more. Because of this, labelled graphs have received increasing attention over the past years. One task researchers have focused on is the following: Given a database \(\mathcal {G}\) that contains labelled graphs, find all graphs \(G\in \mathcal {G}\) that are sufficiently similar to a query graph H or to find the k graphs from \(\mathcal {G}\) that are most similar to H. For approaching this task, a distance measure between undirected, labelled graphs G and H has to be defined. One of the most commonly used measures is the graph edit distance. Formally, a labelled, undirected graph G is a 4-tuple \(G=\langle V^G,E^G,\ell ^G_V,\ell ^G_E\rangle \), where \(V^G\) is a set of nodes, \(E^G\) is a set of undirected edges, and \(\ell ^G_V:V^G\rightarrow \varSigma _V\) and \(\ell ^G_E:E^G\rightarrow \varSigma _E\) are labelling functions that assign nodes an edges to labels from alphabets \(\varSigma _V\) and \(\varSigma _E\). Both \(\varSigma _V\) and \(\varSigma _E\) contain a special label \(\varepsilon \) reserved for dummy nodes and dummy edges. The graph edit distance \(\lambda (G,H)\) between graphs G and H on common label alphabets \(\varSigma _V\) and \(\varSigma _E\) is defined as the minimum cost of an edit path between G and H. An edit path is a sequence of labelled graphs starting with G and ending at a graph that is isomorphic to H. Each graph along the path can be obtained from its predecessor by applying one of the following edit operations: Deleting or inserting an \(\alpha \)-labelled edge, deleting or inserting an isolated \(\alpha \)-labelled node, changing a node’s or an edge’s label from \(\alpha \) to \(\beta \ne \alpha \). Edit operations on nodes and edges come with associated edit costs and , respectively. The cost of an edit path is defined as the sum of the costs of its edit operations. If the cost of each edit operation equals 1, we say that the edit costs are uniform. In many scenarios, it is natural to consider non-uniform metric edit costs. For instance, if the graphs model spacial objects and the node labels are Euclidean coordinates, the cost \(c_V(\alpha ,\beta )\) one has to pay for changing a node’s label from \(\alpha \) to \(\beta \) should probably be defined as the Euclidean distance between \(\alpha \) and \(\beta \).

It has been shown that, even for uniform edit costs, it is \( NP \)-hard to exactly compute the graph edit distance [14]. Exact algorithms that, if applied to large graphs, terminate within an acceptable amount of time are hence out of reach. Consequently, a substantial part of research on both uniform [14,15,16] and non-uniform [2,3,4, 6, 10, 12, 13] graph edit distance has focused on the task of devising heuristics that compute lower and/or upper bounds for \(\lambda (G,H)\). Nonetheless, efficient exact algorithms are still important. This is because some of the objects that are readily modelled by labelled, undirected graphs — for instance, some molecular compounds — induce graphs with very few nodes [9]. For these graphs, queries of the kind “find all \(G\in \mathcal {G}\) with \(\lambda (G,H)\le \tau \)” can in principle be answered. Of course, one would first use efficiently computable upper and lower bounds in order to filter out candidates from \(\mathcal {G}\). However, for the surviving candidates, \(\lambda (G,H)\le \tau \) has to be verified by means of an exact algorithm.

The standard approach \(\mathtt {A}^\star \)-\(\mathtt {GED}\) [11] for exactly computing \(\lambda (G,H)\) carries out a node-based best-first search in order to find the optimal edit path. It is very slow and has huge memory requirements. Recently, three better performing algorithms \(\mathtt {BLP}\)-\(\mathtt {GED}\) [8], \(\mathtt {DF}\)-\(\mathtt {GED}\) [1], and \(\mathtt {CSI\_GED}\) [5] have been proposed. \(\mathtt {BLP}\)-\(\mathtt {GED}\) formulates the problem of computing \(\lambda (G,H)\) as a binary linear program which is solved by calling the commercial solver CPLEX. It has been found to be faster and more memory-efficient than \(\mathtt {A}^\star \)-\(\mathtt {GED}\). \(\mathtt {DF}\)-\(\mathtt {GED}\) carries out a node-based depth-first search for finding the cheapest edit path. It has been found to be much more memory-efficient and slightly faster than \(\mathtt {A}^\star \)-\(\mathtt {GED}\). In contrast, \(\mathtt {CSI\_GED}\) carries out an edge-based depth-first search. It also has been found to be both faster and much more memory-efficient than \(\mathtt {A}^\star \)-\(\mathtt {GED}\). While \(\mathtt {A}^\star \)-\(\mathtt {GED}\), \(\mathtt {BLP}\)-\(\mathtt {GED}\), and \(\mathtt {DF}\)-\(\mathtt {GED}\) cover non-uniform metric edit costs, \(\mathtt {CSI\_GED}\) only works for uniform edit costs. A direct comparison between \(\mathtt {BLP}\)-\(\mathtt {GED}\), \(\mathtt {DF}\)-\(\mathtt {GED}\), and \(\mathtt {CSI\_GED}\) is lacking.

Our paper contains the following contributions: In Sect. 2, we present a speed-up \(\mathtt {DF}\)-\(\mathtt {GED^{u}}\) of \(\mathtt {DF}\)-\(\mathtt {GED}\)for uniform edit costs. \(\mathtt {DF}\)-\(\mathtt {GED^{u}}\) exploits the fact that, in the uniform case, a subroutine that \(\mathtt {DF}\)-\(\mathtt {GED}\) employs at each node of its search tree can be implemented to run in linear rather than cubic time. In Sect. 3, we propose a generalisation \(\mathtt {CSI\_GED^{nu}}\) of \(\mathtt {CSI\_GED}\) that also covers non-uniform metric edit costs. This generalisation comes at the price of a slightly increased runtime. However, this increase is very moderate, as the computational complexity is increased only at the initialisation of \(\mathtt {CSI\_GED^{nu}}\) and at the leafs of its search tree. In Sect. 4, we experimentally evaluate the performance of the newly proposed algorithms. The experiments show that, for uniform edit costs, our speed-up \(\mathtt {DF}\)-\(\mathtt {GED^{u}}\) clearly outperforms \(\mathtt {DF}\)-\(\mathtt {GED}\), while \(\mathtt {CSI\_GED}\) and our generalisation \(\mathtt {CSI\_GED^{nu}}\) perform similarly. They also indicate that, neither for uniform nor for non-uniform edit costs, there is a clear winner between \(\mathtt {DF}\)-\(\mathtt {GED^{u}}\) and \(\mathtt {DF}\)-\(\mathtt {GED}\), on the one side, and \(\mathtt {CSI\_GED}\) and \(\mathtt {CSI\_GED^{nu}}\), on the other side. Finally, the experiments suggest that \(\mathtt {CSI\_GED^{nu}}\) is the most versatile algorithm: It covers both uniform and non-uniform edit costs and runs very stable even on datasets where other algorithms perform better. Section 5 concludes the paper.

2 \(\mathtt {DF}\)-\(\mathtt {GED^{u}}\): Fast \(\mathtt {DF}\)-\(\mathtt {GED}\) for Uniform Edit Costs

In this section, we show how to speed-up the node-based depth-first search \(\mathtt {DF}\)-\(\mathtt {GED}\) for uniform edit costs. We first summarise \(\mathtt {DF}\)-\(\mathtt {GED}\) and then describe our speed-up \(\mathtt {DF}\)-\(\mathtt {GED^{u}}\).

The Baseline Approach. \(\mathtt {DF}\)-\(\mathtt {GED}\) builds upon the following observation: If edit costs are metric, then \(\lambda (G,H)\) can be defined equivalently as the minimum cost of an edit path that is induced by a node map [6]. Let \(V^{G+|H|}\) and \(V^{G+|H|}\) be the sets that are obtained from \(V^G\) and \(V^H\) by adding \(|V^H|\) respectively \(|V^G|\) isolated dummy nodes. A node map is an injective partial function \(\pi :V^{G+|H|}\rightarrow V^{H+|G|}\), whose domain contains \(V^G\) and whose image contains \(V^H\). For a given node map \(\pi \), its induced edit path is defined as follows: If \(\pi \) maps a real node \(i\in V^G\) to a dummy node \(j_\varepsilon \), i is deleted. Conversely, if a dummy node \(i_\varepsilon \) is mapped to a real node \(k\in V^H\), k is inserted. If a real node \(i\in V^G\) is mapped to a real node \(k\in V^H\), i’s label is changed from \(\ell ^G_V(i)\) to \(\ell ^H_V(k)\). If \(ij\in E^G\) but \(\pi (i)\pi (j)\notin E^H\), the edge ij is deleted. If \(kl\in E^H\) but \(\pi ^{-1}(k)\pi ^{-1}(l)\notin E^G\), the edge kl is inserted. Finally, if an edge \(ij\in E^G\) is mapped to an edge \(kl\in E^H\), ij’s label is changed from \(\ell ^G_E(ij)\) to \(\ell ^H_E(kl)\). The cost of the edit path induced by \(\pi \) is denoted by \(g(\pi )\).

\(\mathtt {DF}\)-\(\mathtt {GED}\) performs a depth-first search on the set of all partial node maps between \(V^{G+|H|}\) and \(V^{H+|G|}\) starting with the empty node map. The tree’s leafs correspond to complete node maps and its inner nodes correspond to incomplete node maps. \(\mathtt {DF}\)-\(\mathtt {GED}\) starts with sorting the nodes of \(V^G\) such that evident nodes will be processed first [3]. It also initialises an upper bound \( UB \) for \(\lambda (G,H)\), using a fast sub-optimal heuristic [10]. For each visited node \(\pi \) of the search tree, values \(g(\pi )\) and \(h(\pi )\) are maintained. The value \(g(\pi )\) denotes the cost of the corresponding incomplete induced edit path, and \(h(\pi )\) is a lower bound for the cost from \(\pi \) to a leaf, i.e., complete node map, in \(\pi \)’s down-shadow. Assume that all nodes in \(V^G\) up to node i have already been assigned by \(\pi \). If i is the last node in \(V^G\), \(\pi \) is extended to a complete node map by assigning a dummy node to each of the yet unassigned nodes \(j\in V^G\), and \( UB \) is updated to \(g(\pi )\) if \(g(\pi )< UB \). Otherwise, \(\pi \)’s children \(\pi ^\prime \in \{\pi \cup (i+1,j):j\in V^H\) unassigned by \(\pi \}\cup \{\pi \cup (i+1,j_\varepsilon )\}\) are considered in order of non-decreasing \(g(\pi ^\prime )+h(\pi ^\prime )\). If \(g(\pi ^\prime )+h(\pi ^\prime )< UB \), \(\pi \) is updated to \(\pi ^\prime \) and the process iterates. Otherwise, the branch rooted at \(\pi ^\prime \) is pruned. At termination, \( UB \) is returned.

Note that, for each visited partial node map \(\pi \), the lower bound \(h(\pi )\) has to be recomputed. \(\mathtt {DF}\)-\(\mathtt {GED}\) computes \(h(\pi )\) as follows: For a given partial node map \(\pi \), let \(V^{G+|H|-\pi }\) and \(V^{H+|G|-\pi }\) be the sets of unassigned nodes and \(E^{G-\pi }\) and \(E^{H-\pi }\) be the sets of unassigned edges filled up with dummy edges to ensure \(|E^{G-\pi }|=|E^{H-\pi }|\). Furthermore, let \(\ell ^G_V(V^{G+|H|-\pi })\), \(\ell ^H_V(V^{H+|G|-\pi })\), \(\ell ^G_E(E^{G-\pi })\), and \(\ell ^H_E(E^{H-\pi })\) denote the multisets of labels of the unassigned nodes or edges contained in these sets. Then \(h(\pi )\) is defined as \(h(\pi )=h_V(\pi )+h_E(\pi )\), where \(h_V(\pi )\) is the minimum cost of a linear assignment between \(\ell ^G_V(V^{G+|H|-\pi })\) and \(\ell ^H_V(V^{H+|G|-\pi })\) with assignment costs \(c_V\), and \(h_E(\pi )\) is the minimum cost of a linear assignment between \(\ell ^G_E(E^{G-\pi })\) and \(\ell ^H_E(E^{H-\pi })\) with assignment costs \(c_E\). Since a minimum linear assignment can be computed in cubic time, e.g., by using the Hungarian Algorithm [7], the runtime complexity of computing \(h(\pi )\) is thus cubic in n and m, where \(n=|V^G|+|V^H|\) and \(m=\max \{|E^G|,|E^H|\}\).

Our Speed-Up for Uniform Edit Costs. Our speed-up \(\mathtt {DF}\)-\(\mathtt {GED^{u}}\) builds upon the observation that, for uniform edit cost, \(h(\pi )\) can be computed in linear time. For showing this, we need the following lemma:

Lemma 1

Let A and B be two equally sized multisets and be uniform in the sense that c(ab) equals 1 if \(a\ne b\) and 0 otherwise. Then the cost of a minimum linear assignment between A and B for the assignment cost c equals \(|A|-|A\cap B|\).

Proof

Let \((a_i)^{|A|}_{i=1}\) and \((b_i)^{|B|}_{i=1}\) be orderings of A and B such that, for all \(i\le |A\cap B|\), it holds that \(a_i=b_i\). Note that this implies \(a_i\ne b_i\) for all \(i>|A\cap B|\). We define \(f:A\rightarrow B\) as \(f(a_i)=b_i\). It is easy to see that f is a minimum linear assignment between A and B for the uniform assignment cost c. Its cost is \(\sum _{a\in A}c(a,f(a))=\sum ^{|A\cap B|}_{i=1}c(a_i,b_i)+\sum ^{|A|}_{i=|A\cap B|+1}c(a_i,b_i)=|A|-|A\cap B|\).    \(\square \)

It has been shown that, if A and B are sorted multisets, the size of their intersection can be computed in linear time [14]. Together with Lemma 1, this immediately implies that, if \(c_V\) and \(c_E\) are uniform, \(h_V(\pi )\) and \(h_E(\pi )\) can be computed in \(\mathcal {O}(n\log n)\) and \(\mathcal {O}(m\log m)\) time, respectively: We first sort the labels of the nodes and the edges that have not been assigned by \(\pi \) in \(\mathcal {O}(n\log n)\) and \(\mathcal {O}(m\log m)\) time, respectively. Then, we compute the intersection sizes of the resulting sorted multisets in linear time. In order to further reduce the complexity of the computation of \(h_V(\pi )\) and \(h_E(\pi )\), we proceed as follows. When initialising \(\mathtt {DF}\)-\(\mathtt {GED}\), we once sort \(\ell ^G_V(V^{G+|H|})\), \(\ell ^H_V(V^{H+|G|})\), \(\ell ^G_E(E^{G})\), and \(\ell ^H_E(E^{H})\), i.e., the multisets containing the labels of all nodes and edges. For each L of the resulting sorted multisets and each partial node map \(\pi \), we maintain a boolean vector that indicates if the node or edge with label \(L_i\) is still unassigned by \(\pi \). This vector can be updated in constant additional time when updating the cost \(g(\pi )\) of the partial edit path induced by \(\pi \). For each partial node map \(\pi \), \(h_V(\pi )\) and \(h_E(\pi )\) can then be computed in linear time by using a variation of the algorithm for multiset intersection presented in [14].

3 \(\mathtt {CSI\_GED^{nu}}\): \(\mathtt {CSI\_GED}\) for Non-uniform Metric Edit Costs

In this section, we show how to generalise the edge-based depth-first search \(\mathtt {CSI\_GED}\) to non-uniform metric edit costs. We first summarise \(\mathtt {CSI\_GED}\) and then describe our generalisation \(\mathtt {CSI\_GED^{nu}}\).

The Baseline Approach. While \(\mathtt {DF}\)-\(\mathtt {GED}\) enumerates the space of all node maps, \(\mathtt {CSI\_GED}\) considers valid edge maps \(\phi :\overrightarrow{E^G}\rightarrow \overleftrightarrow {E^H}\cup \{e_\varepsilon \}\). The set \(\overrightarrow{E^G}\) contains one arbitrarily oriented edge (ij) for each undirected edge \(ij\in E^G\), \(\overleftrightarrow {E^H}\) contains two directed edges (kl) and (lk) for each \(kl\in E^H\), and \(e_\varepsilon \) denotes a dummy edge. An edge map \(\phi \) induces a relation \(\pi _\phi \) on \(V^G\times V^H\): If \(\phi (i,j)=(k,l)\), then \((i,k)\in \pi _\phi \) and \((j,l)\in \pi _\phi \). Since nodes cannot be assigned twice, \(\phi \) is called valid if and only if \(\pi _\phi \) is a partial injective function. A valid edge map \(\phi \) also induces a partial edit path between G and H: If \(\phi (i,j)=(k,l)\), ij’s label is changed from \(\ell ^G_E(ij)\) to \(\ell ^H_E(kl)\). If \(\phi (i,j)=e_\varepsilon \), the edge ij is deleted. If \(\phi ^{-1}[\{(k,l),(l,k)\}]=\emptyset \) holds for an edge \(kl\in E^H\), kl is inserted. And if \(\pi _\phi (i)=k\), i’s label is changed from \(\ell ^G_V(i)\) to \(\ell ^H_V(k)\). The cost of the partial edit path induced by \(\phi \) is denoted by \(g(\phi )\). In general, \(\phi \)’s induced edit path is incomplete, since the sets \(V^{G-\pi _\phi }\subseteq V^G\) and \(V^{H-\pi _\phi }\subseteq V^H\) containing the nodes that are left unassigned by \(\pi _\phi \) are in general non-empty. The following theorem constitutes the backbone of \(\mathtt {CSI\_GED}\):

Theorem 1

(Cf. Theorem 1 in [5]). If the edit costs \(c_V\) and \(c_E\) are uniform, then, for each node map \(\pi :V^{G+|H|}\rightarrow V^{H+|G|}\), there is a valid edge map \(\phi :\overrightarrow{E^G}\rightarrow \overleftrightarrow {E^H}\cup \{e_\varepsilon \}\) with \(g(\pi )\ge g(\phi )+\varGamma (V^{G-\pi _\phi },V^{H-\pi _\phi })\), where \(\varGamma (V^{G-\pi _\phi },V^{H-\pi _\phi })=\max \{|V^{G-\pi _\phi }|,|V^{H-\pi _\phi }|\}-|\ell ^G_V(V^{G-\pi _\phi })\cap \ell ^H_V(V^{H-\pi _\phi })|\). Moreover, \(g(\phi )+\varGamma (V^{G-\pi _\phi },V^{H-\pi _\phi })\ge \lambda (G,H)\) holds for each valid each map \(\phi \).

Theorem 1 implies that, for uniform edit costs, one can compute the graph edit distance by enumerating the space of all valid edge maps. To this purpose, \(\mathtt {CSI\_GED}\) carries out a depth-first search on the set of all valid partial edge maps starting with the empty edge map. \(\mathtt {CSI\_GED}\) maintains an upper bound for the graph edit distance, which is initialised as \( UB =\infty \), and considers the edges \(e_r\in \overrightarrow{E^G}\) in an arbitrary but fixed order. For each visited incomplete edge map \(\phi \), the current induced cost \(g(\phi )\) and a lower bound \(g^\prime (\phi )\) for the induced cost of a complete edge map in \(\phi \)’s down-shadow are maintained. Assume that all edges in \(\overrightarrow{E^G}\) up to \(e_r\) have already been assigned by \(\phi \). If \(e_r\) is the last edge in \(\overrightarrow{E^G}\), \(\phi \) is a complete valid edge map, and \( UB \) is updated to \(g(\phi )+\varGamma (V^{G-\pi _\phi },V^{H-\pi _\phi })\) if \(g(\phi )+\varGamma (V^{G-\pi _\phi },V^{H-\pi _\phi })< UB \). Otherwise, \(\phi \)’s children \(\phi ^\prime \in \{\phi \cup (e_{r+1},e): e\in \overleftrightarrow {E^H}\) unassigned by \(\phi \) and \(\phi \cup (e_{r+1},e)\) valid\(\}\cup \{\phi \cup (e_{r+1},e_\varepsilon )\}\) are considered in order of non-decreasing \(\mathcal {C}(e_{r+1},e)\). \(\mathcal {C}(e_{r+1},e)\) is an estimate of the graph edit distance under the constraint that the edge \(e_{r+1}\) is mapped to e. Note that the estimated cost matrix \(\mathcal {C}\) only has to be computed once at initialisation. If \(g^\prime (\phi ^\prime )< UB \), \(\phi \) is updated to \(\phi ^\prime \) and the process iterates. Otherwise, the branch rooted at \(\phi ^\prime \) is pruned. At termination, \( UB \) is returned.

Our Generalisation to Non-uniform Metric Edit Costs. The key-ingredient of our extension \(\mathtt {CSI\_GED^{nu}}\) is the following generalised version of Theorem 1:

Theorem 2

If the edit costs \(c_V\) and \(c_E\) are metric, then, for each node map \(\pi :V^{G+|H|}\rightarrow V^{H+|G|}\), there is a valid edge map \(\phi :\overrightarrow{E^G}\rightarrow \overleftrightarrow {E^H}\cup \{e_\varepsilon \}\) with \(g(\pi )\ge g(\phi )+\varGamma ^{\mathtt {nu}}(V^{G-\pi _\phi },V^{H-\pi _\phi })\), where \(\varGamma ^{\mathtt {nu}}(V^{G-\pi _\phi },V^{H-\pi _\phi })\) is defined as the cost of a minimum linear assignment between \(\ell ^G_V(V^{G+|H|-\pi _\phi })\) and \(\ell ^H_V(V^{H+|G|-\pi _\phi })\) for the assignment cost \(c_V\). Moreover, \(g(\phi )+\varGamma ^{\mathtt {nu}}(V^{G-\pi _\phi },V^{H-\pi _\phi })\ge \lambda (G,H)\) holds for each valid edge map \(\phi \).

Proof

Given a node map \(\pi \), we construct a valid edge map \(\phi \) as follows: Let \((i,j)\in \overrightarrow{E^G}\). If the corresponding undirected edge ij is preserved under \(\pi \), i.e., if \(\pi (i)\pi (j)\in E^H\), we define \(\phi (i,j)=(\pi (i),\pi (j))\). Otherwise, we set \(\phi (i,j)=e_\varepsilon \). By construction, \(\pi _\phi \) equals the restriction of \(\pi \) to those real nodes \(i\in V^G\) that are incident with an edge that is preserved under \(\pi \). This implies that \(\phi \) is valid. Next, we compare the complete edit path \(P_\pi \) that is induced by \(\pi \)and the partial edit path \(P_\phi \) that is induced by \(\phi \). We observe that \(P_\phi \) contains all edge-deletions, -insertions, and -relabelings that appear in \(P_\pi \), as well as all relabelings of nodes that are incident with a preserved edge. Apart from these edit operations, \(P_\pi \) also contains deletions and relabelings of nodes that are not incident with a preserved edge, as well as node-insertions. These latter operations can be viewed as a linear assignment between \(\ell ^G_V(V^{G+|H|-\pi _\phi })\) and \(\ell ^H_V(V^{H+|G|-\pi _\phi })\) for the assignment cost \(c_V\), which, together with the observation above, implies \(g(\pi )\ge g(\phi )+\varGamma ^{\mathtt {nu}}(V^{G-\pi _\phi },V^{H-\pi _\phi })\). For showing the second part of the theorem, we fix a valid edge map \(\phi \). Let \(\pi ^\prime _\phi \) be a minimum linear assignment between \(\ell ^G_V(V^{G+|H|-\pi _\phi })\) and \(\ell ^H_V(V^{H+|G|-\pi _\phi })\) for the assignment cost \(c_V\). Then \(\pi =\pi _\phi \cup \pi ^\prime _\phi \) is a complete node map. By construction, we have \(g(\phi )+\varGamma ^{\mathtt {nu}}(V^{G-\pi _\phi },V^{H-\pi _\phi })\ge g(\pi )\ge \lambda (G,H)\), where the last inequality follows from the fact that, for metric edit costs, the graph edit distance can be defined as the minimum cost of an edit path that is induced by a node map.     \(\square \)

Theorem 2 indicates how to extend \(\mathtt {CSI\_GED}\) to non-uniform metric edit costs: We just have to replace all occurrences of \(\varGamma \) by \(\varGamma ^{\mathtt {nu}}\). As presented above, during the depth-first search carried out by \(\mathtt {CSI\_GED}\), \(\varGamma \) has to be computed at the leafs of the search tree. At initialisation, further computations of \(\varGamma \) are required for computing the estimated cost matrix \(\mathcal {C}\) and a constant that is required for the computation of \(g^\prime \) (cf. [5] for these details of \(\mathtt {CSI\_GED}\)). Note that computing \(\varGamma \) requires linear time, whereas computing \(\varGamma ^{\mathtt {nu}}\) needs cubic time (cf. Sect. 2). This implies that our generalisation leads to an increased runtime of \(\mathtt {CSI\_GED}\). However, the increase is very moderate, as \(\varGamma ^{\mathtt {nu}}\) does not need to be computed at the inner nodes of the (exponentially large) search tree.

4 Empirical Evaluation

The aim of our experiments is to compare the performance of the algorithms \(\mathtt {CSI\_GED}\), \(\mathtt {CSI\_GED^{nu}}\), \(\mathtt {DF}\)-\(\mathtt {GED}\), and \(\mathtt {DF}\)-\(\mathtt {GED^{u}}\) for both uniform and non-uniform metric edit costs. We implemented all algorithms in C++ making them employ the same data structures and subroutines. All tests were carried out on a machine with two Intel Xeon E5-2667 v3 processors with 8 cores each and 98 GB of main memory running GNU/Linux. We conducted tests on the datasets Aids and Fingerprints [9], which are widely used in the research community [5, 10,11,12,13,14,15,16]. Both datasets contain graphs with both node and edge labels for which non-uniform metric relabelling costs \(c_V\) and \(c_E\) are naturally induced by the domain [10]. For defining non-uniform metric edit costs, we thus only had to specify the deletion/insertion costs \(c_V(\alpha ,\varepsilon )\) and \(c_E(\alpha ,\varepsilon )\). This was done by setting \(c_V(\alpha ,\varepsilon )=\max \{c_V(\beta ,\gamma )\mid \beta ,\gamma \in \varSigma _V\}\) for all \(\alpha \in \varSigma _V\smallsetminus \{\varepsilon \}\), and \(c_E(\alpha ,\varepsilon )=\max \{c_E(\beta ,\gamma )\mid \beta ,\gamma \in \varSigma _E\}\) for all \(\alpha \in \varSigma _E\smallsetminus \{\varepsilon \}\), i.e., deleting and inserting nodes and edges was defined to be as expensive as the most expensive relabelling operations. Since both \(\mathtt {DF}\)-\(\mathtt {GED}\) and \(\mathtt {CSI\_GED}\) fail to compute the exact graph edit distance for graphs with more than 25 nodes within reasonable time [1, 5], we excluded larger graphs from Aids. Fingerprints only contains small graphs, anyway. We then used the experimental setup suggested in [5]: For both considered datasets \(\mathcal {D}\) and all \(i\in \{3,6,\ldots ,\max _{G\in \mathcal {D}}|G|\}\), we defined a size-constrained test-group \(\mathcal {G}_i\) that contains four randomly selected graphs \(G\in \mathcal {D}\) satisfying \(|V^G|=i\pm 1\). For each tested algorithm \(\mathtt {ALG}\)and each test-group \(\mathcal {G}_i\), all six pairwise comparisons between graphs contained in \(\mathcal {G}_i\) were carried out. We set a time limit of 1000 s and recorded the metrics \( timeouts \), \( t \), and \( dev \). Since pretesting showed that the main memory demand of all tested algorithms is negligible, we did not record memory usage.

  • \( timeouts (\mathtt {ALG},i)\): The number of timeouts on \(\mathcal {G}_i\), i.e., of pairwise comparisons between graphs in \(\mathcal {G}_i\) where \(\mathtt {ALG}\)did not finish within 1000 s.

  • \( t (\mathtt {ALG},i)\): \(\mathtt {ALG}\)’s average runtime across all six pairwise comparisons between graphs in \(\mathcal {G}_i\).

  • \( dev (\mathtt {ALG},i)\): \(\mathtt {ALG}\)’s average percentual deviation from the best tested algorithm as introduced in [1], i.e., the average of \(100\cdot [ UB (\mathtt {ALG})- UB ^\star ]/ UB ^\star \) across all six pairwise comparisons between graphs in \(\mathcal {G}_i\). \( UB (\mathtt {ALG})\) denotes the value of the upper bound \( UB \) maintained by \(\mathtt {ALG}\)after 1000 s and \( UB ^\star \) is defined as \( UB ^\star =\min \{ UB (\mathtt {ALG^\prime })\mid \mathtt {ALG^\prime }\) is tested algorithm\(\}\).

Figure 1 shows the outcomes of our experiments for uniform edit costs. We observe that, on both datasets, our speed-up \(\mathtt {DF}\)-\(\mathtt {GED^{u}}\) outperforms \(\mathtt {DF}\)-\(\mathtt {GED}\) in terms of all recorded metrics, while \(\mathtt {CSI\_GED}\) and our generalisation \(\mathtt {CSI\_GED^{nu}}\) perform similarly. For instance, on Fingerprints, \(\mathtt {DF}\)-\(\mathtt {GED^{u}}\) is on average 4.75 times faster than \(\mathtt {DF}\)-\(\mathtt {GED}\), while \({{\mathrm{avg}}}_i t (\mathtt {CSI\_GED^{nu}},i)/ t (\mathtt {CSI\_GED},i)\approx 1.32\). On Aids, we have \({{\mathrm{avg}}}_i t (\mathtt {DF}{\text {-}}\mathtt {GED},i)/ t (\mathtt {DF}{\text {-}}\mathtt {GED^{u}},i)\approx 2.05\) and \({{\mathrm{avg}}}_i t (\mathtt {CSI\_GED^{nu}},i)/ t (\mathtt {CSI\_GED},i)\approx 1.31\). These results are readily explained by the fact that \(\mathtt {DF}\)-\(\mathtt {GED}\) has to carry out the cubic computation of the lower bound h at each node of its search tree, whereas \(\mathtt {CSI\_GED^{nu}}\) has to carry out the cubic computation of \(\varGamma ^{\mathtt {nu}}\) only at the leafs and at initialisation. Secondly, we see that, on Fingerprints (cf. Fig. 1a), the node-based approaches \(\mathtt {DF}\)-\(\mathtt {GED^{u}}\) and \(\mathtt {DF}\)-\(\mathtt {GED}\) outperform the edge-based algorithms \(\mathtt {CSI\_GED}\) and \(\mathtt {CSI\_GED^{nu}}\), while, on Aids (cf. Fig. 1b), the opposite is the case. Finally, we note that the edge-based algorithms are more stable: While their deviation never exceeds \(2\%\), the node-based approaches’ deviation explodes on comparisons between large graphs contained in Aids.

Fig. 1.
figure 1

Results for uniform edit costs.

The results for non-uniform edit metric costs are displayed in Fig. 2. Note that the algorithms \(\mathtt {DF}\)-\(\mathtt {GED^{u}}\) and \(\mathtt {CSI\_GED}\) do not appear in the evaluation, as they are designed only for uniform edit costs. The first observation is that, just like for uniform edit costs, the node-based approach \(\mathtt {DF}\)-\(\mathtt {GED}\) performs better on Fingerprints (cf. Fig. 2a), while our edge-based generalisation \(\mathtt {CSI\_GED^{nu}}\) performs better on Aids (cf. Fig. 2b). Secondly, we again note that the edge-based algorithm runs much more stable than the node-based approach: On Fingerprints, i.e., the dataset where \(\mathtt {DF}\)-\(\mathtt {GED}\) performs better, we have \(\max _i dev (\mathtt {CSI\_GED^{nu}},i)\approx 1.48\); whereas on Aids, i.e., the dataset where \(\mathtt {CSI\_GED^{nu}}\) performs better, we observe \(\max _i dev ({\mathtt {DF}}\)-\({\mathtt {GED}},i)\approx 47.97\).

Fig. 2.
figure 2

Results for non-uniform metric edit costs.

5 Conclusions and Future Work

Our experiments show that, for uniform edit costs, our speed-up \(\mathtt {DF}\)-\(\mathtt {GED^{u}}\) always outperforms \(\mathtt {DF}\)-\(\mathtt {GED}\), while \(\mathtt {CSI\_GED}\) and our generalisation \(\mathtt {CSI\_GED^{nu}}\) perform similarly. We also observed that, neither for uniform nor for non-uniform metric edit costs, there is a clear winner between the node-based approaches \(\mathtt {DF}\)-\(\mathtt {GED^{u}}\) and \(\mathtt {DF}\)-\(\mathtt {GED}\), on the one side, and the edge-based algorithms \(\mathtt {CSI\_GED}\) and \(\mathtt {CSI\_GED^{nu}}\), on the other side. On Fingerprints, the former two algorithms outperformed the latter in terms of runtime and timeouts, while on Aids, the opposite outcome was observed. However, \(\mathtt {CSI\_GED^{nu}}\) and \(\mathtt {CSI\_GED}\) turned out to be more stable than \(\mathtt {DF}\)-\(\mathtt {GED}\) and \(\mathtt {DF}\)-\(\mathtt {GED^{u}}\): While \(\mathtt {CSI\_GED^{nu}}\)’s and \(\mathtt {CSI\_GED}\)’s deviation is small across all test-runs, \(\mathtt {DF}\)-\(\mathtt {GED}\)’s and \(\mathtt {DF}\)-\(\mathtt {GED^{u}}\)’s deviation explodes for comparisons between large graphs contained in the Aids dataset. A global assessment of these observations indicates that, if there is no prior knowledge about the dataset and the graph edit distance has to be computed for both uniform and non-uniform metric edit costs, our generalisation \(\mathtt {CSI\_GED^{nu}}\) is the algorithm of choice. For future research, it might be interesting to individuate graph-properties that indicate if the node-based approaches \(\mathtt {DF}\)-\(\mathtt {GED^{u}}\) and \(\mathtt {DF}\)-\(\mathtt {GED}\) or the edge-based algorithms \(\mathtt {CSI\_GED}\) and \(\mathtt {CSI\_GED^{nu}}\) perform better. A meta-algorithm could then first compute these properties and select node-based or edge-based algorithms accordingly.