1 Introduction

The use of feature vectors for pattern representation implicates two limitations. First, vectors are based on a predefined set of features, and thus all vectors in a given application have to preserve the same length regardless of the size or complexity of the corresponding pattern. Second, there is no direct possibility to describe binary (or higher-order) relationships that might exist among different parts of a pattern. These two drawbacks are severe, particularly when the patterns under consideration are characterized by complex structural relationships rather than the statistical distribution of a fixed set of pattern features.

Graphs are able to overcome both limitations and are thus regarded as versatile alternative to feature vectors. Due to their power and flexibility graphs found widespread application in pattern recognition and related fields [10, 17]. However, one drawback of graphs, when compared to feature vectors, is the significant increase of the complexity of many algorithms. Regard, for instance, the algorithmic comparison of two patterns (which is actually a basic requirement for many pattern recognition algorithms). Due to the homogeneous nature of feature vectors, pairwise comparisons is straightforward and can be accomplished in linear time with respect to the length of the two vectors. However, the same task for graphs is much more complex, as one has to identify common parts of the graphs by considering all of their subsets of nodes.

With the rise of graph kernels [18] as well as graph embedding methods [39], the gap between vector based and graph based pattern recognition has been bridged. However, both approaches crucially depend on similarity or dissimilarity computation on graphs, commonly referred to as graph matching. The overall aim of graph matching is to find a correspondence between the nodes and edges of two graphs that satisfies some, more or less, stringent constraints. In the last 4 decades a huge number of procedures for graph matching have been proposed in the literature (e.g. [11, 20, 47]).

Graph edit distance [8, 48], introduced about 30 years ago, is still one of the most flexible graph distance models available and topic of various recent research projects. In fact, the concept of graph edit distance is particularly interesting because it is able to cope with directed and undirected, as well as with labeled and unlabeled graphs. If there are labels on nodes, edges, or both, no constraints on the respective label alphabets have to be considered.

In order to compute the graph edit distance often tree search techniques endowed with some heuristics are employed (e.g. [16]). However, this type of search algorithm is exponential in the number of nodes of the involved graphs. In [38] authors of this paper introduced an algorithmic framework for a suboptimal computation of graph edit distance. The basic idea of this approach is to reduce the difficult problem of graph edit distance to a linear sum assignment problem (LSAP), for which an arsenal of efficient (i.e. cubic time) algorithms exist [9]. In two recent papers [41, 42] the optimal algorithm for the LSAP has been replaced with a suboptimal greedy algorithm which runs in quadratic time. Due to the lower complexity of this suboptimal assignment process, a substantial speed up of the complete approximation procedure has been observed. However, it was also reported that the distance accuracy of this extension is slightly worse than with the original algorithm. Major contribution of this paper is to improve the overall distance accuracy of this recent procedure by means of an elaborated transformation of the underlying cost model.

This paper is based on a preliminary contribution presented in [45]. The current paper has been extended with respect to both the theoretic foundation and the underlying method. Moreover, the description of the novel approach as well as the underlying concepts are more detailed as in the preliminary paper. Last but not least, compared to the preliminary contribution the experimental evaluation as well as the discussion has been substantially extended. In particular, two additional real world data sets are employed for empirical investigations.

The remainder of this paper is organized as follows. Next, in Sect. 2, the computation of graph edit distance is reviewed. In particular, it is shown how the graph edit distance problem can be reduced to a linear sum assignment problem. In Sect. 3, the transformation of the cost model into a utility model is thoroughly described. Eventually, in Sect. 4, we empirically confirm the benefit of this transformation in a classification experiment on five graph data sets. Finally, in Sect. 5, we conclude the paper and outline possible future research activities.

2 Graph Edit Distance

Originally, the paradigm of edit distance has been proposed for sequential data structures such as strings [28, 58]. Eventually, the idea of edit distance has been extended to more general data structures such as trees [49] and graphs [8, 13, 48, 56, 57].

Definition 1

(Graph) A graph g is defined as \(g = (V,E)\), where V refers to the finite set of nodes and \(E\subseteq V \times V\) is the set of edges.

In this work we consider graphs to be labeled. Formally, we use \(\mu :V\rightarrow L_V\) and \(\nu : E \rightarrow L_E\) as node and edge labeling functions, respectively. The label alphabets \(L_V\) and \(L_E\) for both nodes and edges are generally not restricted to any domain. That is, the alphabets can be given by the set of integers \(L = \{1,2,3,\ldots \}\), the vector space \(L = {\mathbb {R}}^n\), a set of symbolic labels \(L = \{\alpha , \beta , \gamma ,\ldots \}\), or a combination of various label alphabets from different domains. Note that unlabeled graphs can be seen as special cases of labeled graphs by assigning the same (empty) label \(\varnothing \) to all nodes and edges, i.e. \(L_V = L_E = \{\varnothing \}\).

The idea of graph edit distance is to define a dissimilarity measure based on the number as well as the strength of so called edit operations that have to be applied to transform a graph \(g_1 = (V_1, E_1, \mu _1,\nu _1)\) into another graph \(g_2 = (V_2, E_2, \mu _2,\nu _2)\). These edit operations are commonly given by insertions, deletions, and substitutions of both nodes and edges. However, other edit operations such as merging or splitting might be useful in some applications but not considered in the current work (we refer to [3] for an application of additional edit operations). We denote the substitution of two nodes \(u \in V_1\) and \(v\in V_2\) by \((u \rightarrow v)\), the deletion of node \(u\in V_1\) by \((u \rightarrow \varepsilon )\), and the insertion of node \(v\in V_2\) by \((\varepsilon \rightarrow v)\), where \(\varepsilon \) refers to the empty node. For edge edit operations we use a similar notation.

Definition 2

(Edit Path) A set \(\{e_1, \ldots , e_k\}\) of k edit operations \(e_i\) that transform \(g_1\) completely into \(g_2\) is called an edit path \(\lambda (g_1,g_2)\) between \(g_1\) and \(g_2\).

Note that edge edit operations are unambiguously defined via node edit operations. That is, whether an edge (uv) is substituted, deleted, or inserted, depends on the edit operations actually performed on both adjacent nodes u and v. Formally, let \(u,u' \in V_1 \cup \{\varepsilon \}\) and \(v,v' \in V_2 \cup \{\varepsilon \}\), and assume that both edit operations \((u \rightarrow v)\) and \((u' \rightarrow v')\) are present in the edit path \(\lambda (g_1,g_2)\) under consideration. Depending on whether or not there is an edge \((u,u') \in E_1\) and/or an edge \((v,v') \in E_2\), the following three cases can be distinguished.

  1. 1.

    If there are edges \(e_1 = (u,u') \in E_1\) and \(e_2 = (v,v') \in E_2\), the edge substitution \((e_1 \rightarrow e_2)\) is implied by \((u \rightarrow v)\) and \((u' \rightarrow v')\).

  2. 2.

    If there is an edge \(e_1 = (u,u') \in E_1\) but no edge \(e_2 = (v,v') \in E_2\), the edge deletion \((e_1 \rightarrow \varepsilon )\) is implied by \((u \rightarrow v)\) and \((u' \rightarrow v')\). Obviously, if v or \(v'\) refers to the empty node \(\varepsilon \) there cannot be any edge \((v,v') \in E_2\) and thus the edge deletion \((e_1 \rightarrow \varepsilon )\) is necessary.

  3. 3.

    If there is no edge \(e_1 = (u,u') \in E_1\) but an edge \(e_2 = (v,v') \in E_2\), the edge insertion \((\varepsilon \rightarrow e_2)\) is implied by \((u \rightarrow v)\) and \((u' \rightarrow v')\). Again, if \(u= \varepsilon \) or \(u' = \varepsilon \) there cannot be any edge \((u,u') \in E_1\).

Thus, it is in general sufficient that an edit path \(\lambda (g_1,g_2)\) covers the nodes from \(V_1\) and \(V_2\) only. From now on we assume that an edit path \(\lambda (g_1,g_2)\) explicitly describes the correspondences found between the graphs’ nodes \(V_1\) and \(V_2\), while the edge edit operations are implicitly given by these node correspondences.

Commonly, one introduces a cost c(e) for every edit operation e, measuring the strength of the corresponding operation. The idea of such a cost is to define whether or not an edit operation e represents a strong modification of the graph. Clearly, between two similar graphs, there should exist an inexpensive edit path, representing low cost operations, while for dissimilar graphs an edit path with high cost is needed. Consequently, the edit distance of two graphs is defined as follows.

Definition 3

(Graph Edit Distance) Let \(g_1 = (V_1, E_1, \mu _1,\nu _1)\) and \(g_2 = (V_2, E_2, \mu _2,\nu _2)\) be two graphs. The graph edit distance \(d_{\lambda _{\min }}(g_1,g_2)\), or \(d_{\lambda _{\min }}\) for short, between \(g_1\) and \(g_2\) is defined by

$$\begin{aligned} d_{\lambda _{\min }}(g_1,g_2) = \underset{\lambda \in \Upsilon (g_1, g_2)}{\min }\sum _{e_i \in \lambda } c(e_i)~~~ , \end{aligned}$$
(1)

where \(\Upsilon (g_1, g_2)\) denotes the set of all complete edit paths transforming \(g_1\) into \(g_2\), c denotes the cost function measuring the strength \(c(e_i)\) of node edit operation \(e_i\) (including the cost of all edge edit operations implied by the operations applied on the adjacent nodes of the edges), and \(\lambda _{\min }\) refers to the minimal cost edit path found in \(\Upsilon (g_1, g_2)\).

There might be two (or more) edit paths with equal minimal cost in \(\Upsilon (g_1, g_2)\). That is, the minimal cost edit path \(\lambda _{\min } \in \Upsilon (g_1, g_2)\) is not necessarily unique. Moreover, graph edit distance does not build a metric function in general. In practice, however, the definition of some weak conditions on the cost function c are sufficient such that the graph edit distance becomes a metric function [8].

An adequate definition of cost functions is not only important for building a metric, but also for the effectiveness of edit distance based pattern recognition (see [52] for an extensive review on different cost functions for graph edit distance). In case of unlabeled graphs, the cost is usually defined via unit cost for all deletions and insertions of both nodes and edges, while substitutions are free of cost. Formally,

$$\begin{aligned}c(u \rightarrow \varepsilon )= & {} c(\varepsilon \rightarrow u') = c((u,v) \rightarrow \varepsilon )\\= & {} c(\varepsilon \rightarrow (u',v')) = 1\\c(u \rightarrow u')= & {} c((u,v) \rightarrow (u',v')) = 0\end{aligned}$$

for all nodes \(u,v \in V_1\) and \(u',v' \in V_2\) as well as all edges \((u,v) \in E_1\) and \((u',v') \in E_2\).

In general, however, the cost c(e) of a particular edit operation e is defined with respect to the underlying label alphabets \(L_V\) and \(L_E\). For instance, for numerical node and edge labels, i.e. for label alphabets \(L_V, L_E = {\mathbb {R}}^n\), a Minkowski distance can be used to model the cost of a substitution operation on the graphs. The Minkowski cost function defines the substitution cost proportional to the Minkowski distance of the two corresponding labels. The basic intuition behind this approach is that the more dissimilar two labels are, the stronger is the distortion associated with the corresponding substitution.

In other applications the node and/or edge labels might be not numerical and thus non-numerical distance functions have to be employed to measure the cost of a particular substitution operation. For instance, the label alphabet can be given by the set of all strings of arbitrary size over a finite set of symbols. In this case a distance model for strings, as for instance the string edit distance [28, 58], could be used for measuring the cost of a substitution. In other problem domains, the label alphabet might be given by a finite set of n symbolic labels \(L_{V/E} = \{\alpha _1, \alpha _2, \ldots , \alpha _n \}\). In such case a substitution cost model using a Dirac function, which returns zero when the involved labels are identical and a non-negative constant otherwise, could be the method of choice.

The definition of application specific cost functions, which can be adopted to the peculiarity of the underlying label alphabet, accounts for the flexibility of graph edit distance. However, prior knowledge about the labels and their meaning has to be available. If in a particular case this prior knowledge is not available, automatic procedures for learning the cost model from a set of sample graphs are available as well [12, 30,31,32, 53].

2.1 Computation of Graph Edit Distance

The exact computation of graph edit distance \(d_{\lambda _{\min }}(g_1,g_2)\) is often based on A* [5, 16, 21, 40]. A* is a best-first search algorithm [22] which always finds an optimal solution if the underlying heuristic function is admissible.

However, for graphs with m and n nodes the time complexity of these complete and admissible search algorithms is in \(O(m^n)\). Hence, the computation of exact edit distance is limited to graphs of rather small size. In fact, graph edit distance belongs to the family of quadratic assignment problems (QAPs) [26], which in turn belong to the class of NP-complete problems (see [7, 36] for details on formulating the GED problem as QAP). That is, an exact and efficient algorithm for the graph edit distance problem can not be developed unless \(P=NP\).Footnote 1

Various methods address the high complexity of graph edit distance computation. Local optimization criteria [6, 33, 54], for instance, are used to solve the error-tolerant graph matching problem in a more efficient way. Another idea for efficient graph edit distance is to prune the underlying search tree and consequently reduce both the search space and the matching time [34]. Linear programming for computing the edit distance of graphs with unlabeled edges is proposed in [25]. Finding an optimal match between the sets of subgraphs by means of dynamic programming [13, 14] is another possibility for speeding up the computation of graph edit distance.

2.1.1 Cubic Time Approximation of Graph Edit Distance

Authors of this paper introduced an algorithmic framework which allows the approximate computation of graph edit distance in a substantially faster way than traditional methods on general graphs [38]. The basic idea of this approach is to reduce the quadratic assignment problem of graph edit distance computation to an instance of a Linear Sum Assignment Problem (LSAP). LSAPs are similar to QAPs in the sense of also formulating an assignment problem of entities. However, in contrast with QAPs, LSAPs are able to optimize the assignment problem with respect to a linear term only.

For solving LSAPs a large number of algorithms exist (see [9] for an exhaustive survey). They range from primal–dual combinatorial algorithms [24, 27, 29], to simplex-like methods [2, 35] and other approaches [1, 55]. The time complexity of the best performing exact algorithms for LSAPs is cubic in the size of the problem. Hence, LSAPs can be—in contrast with QAPs—quite efficiently solved.

LSAPs are concerned with the problem of finding the best bijective assignment between the independent entities of two sets \(S_1 = \{s^{(1)}_1, \ldots , s^{(1)}_n \}\) and \(S_2 = \{s^{(2)}_1, \ldots , s^{(2)}_n \}\) of equal size. In order to assess the quality of an assignment of two entities, a cost \(c_{ij}\) is commonly defined that measures the suitability of assigning the i-th element \(s^{(1)}_i \in S_1\) to the j-th element \(s^{(2)}_j \in S_2\) (resulting in \(n\times n\) cost values \(c_{ij}\) (\(i,j = 1, \ldots , n\))).

Definition 4

(Linear Sum Assignment Problem (LSAP)) Given two disjoint sets \(S_1 = \{s^{(1)}_1, \ldots , s^{(1)}_n \}\) and \(S_2 = \{s^{(2)}_1, \ldots , s^{(2)}_n \}\) and a cost \(c_{ij}\) for every pair of entities \((s^{(1)}_i,s^{(2)}_j) \in S_1 \times S_2\), the Linear Sum Assignment Problem (LSAP) is given by finding

$$\begin{aligned} \min _{(\varphi _1, \ldots , \varphi _{n})\in {\mathscr {S}}_n}\sum _{i=1}^{n} c_{i\varphi _i} \end{aligned}$$

where \({\mathscr {S}}_n\) refers to the set of all n! possible permutations of n integers.

By reformulating the graph edit distance problem to an instance of an LSAP, three major issues have to be resolved.

  1. 1.

    First, LSAPs are generally stated on independent sets with equal cardinality. However, in our case the elements to be assigned to each other are given by the sets of nodes (and edges) with unequal cardinality in general.

  2. 2.

    Second, solutions to LSAPs refer to assignments of elements in which every element of the first set is assigned to exactly one element of the second set and vice versa (i.e. a solution to an LSAP corresponds to a bijective assignment of the underlying entities). However, graph edit distance is a more general assignment problem as it explicitly allows both deletions and insertions to occur on the basic entities (rather than only substitutions).

  3. 3.

    Third, graphs do not only consist of independent sets of entities (i.e. nodes) but also of structural relationships between these entities (i.e. edges that connect pairs of nodes). LSAPs are not able to consider these relationships in a global and consistent way.

The first two issues can be simultaneously resolved by adding an appropriate number of empty nodes \(\varepsilon \) to both graphs \(g_1\) and \(g_2\). Assuming that \(|V_1| = n\) and \(|V_2|=m\), we extend \(V_1\) and \(V_2\) according to

$$\begin{aligned}V_1^{+} = V_1 \cup \overbrace{\{\varepsilon _1 ,\ldots , \varepsilon _m\} }^{\textit{m empty nodes}} \end{aligned}$$

and

$$\begin{aligned}V_2^{+} = V_2 \cup \underbrace{\{\varepsilon _1 ,\ldots , \varepsilon _n\} }_{\textit{n empty nodes}}. \end{aligned}$$

The LSAP can now be carried out on these extended node sets. Formally, based on the extended node sets

$$\begin{aligned}V^+_1 = \{u_{1}, \ldots , u_{n}, \varepsilon _1,\ldots , \varepsilon _m\} \text { and } V^+_2 = \{v_{1}, \ldots , v_{m}, \varepsilon _1,\ldots , \varepsilon _n\}\end{aligned}$$

of \(g_1\) and \(g_2\), respectively, a cost matrix \(\mathbf {C}\) can be established as follows.

(2)

Entry \(c_{ij}\) thereby denotes the cost \(c(u_i \rightarrow v_j)\) of the node substitution \((u_i \rightarrow v_j)\), \(c_{i \varepsilon }\) denotes the cost \(c(u_i \rightarrow \varepsilon )\) of the node deletion \((u_i \rightarrow \varepsilon )\), and \(c_{\varepsilon j}\) denotes the cost \(c(\varepsilon \rightarrow v_j)\) of the node insertion \((\varepsilon \rightarrow v_j)\).

Obviously, the left upper part of the cost matrix \(\mathbf {C}= (c_{ij})\) represents the costs of all possible node substitutions, the diagonal of the right upper part the costs of all possible node deletions, and the diagonal of the bottom left part the costs of all possible node insertions. Every node can be deleted or inserted at most once. Therefore any non-diagonal element of the right-upper and left-lower part can be set to \(\infty \). The bottom right part of the cost matrix is set to zero since substitutions of the form \((\varepsilon \rightarrow \varepsilon )\) should not cause any cost. In [46, 50, 51] alternative definitions of a cost matrix \(\mathbf {C}\) have been proposed in order to decrease the size of the assignment problem.

The third issue stated above is about the edge structure of both graphs which cannot be considered by LSAPs. In fact, so far the cost matrix \(\mathbf {C} =(c_{ij})\) considers the nodes of both graphs only, and thus the assignment algorithm does not take any structural constraints into account. In order to integrate knowledge about the graph structure, to each entry \(c_{ij}\), i.e. to each cost of a node edit operation \((u_i \rightarrow v_j)\), the minimum sum of edge edit operation costs, implied by the corresponding node operation, is added.

This particular encoding of the minimum matching cost arising from the local edge structure enables the LSAP to consider information about the local, yet not global, edge structure of a graph. Hence, this heuristic procedure partially resolves the third issue (a complete solution for this third problem would be equivalent to an exact computation of graph edit distance, which would be unreasonable of course). Several other attempts have been made to include more adjacency information into the assignment process of [38] (e.g. [19, 44]). However, in this paper we make use of the original version of the algorithm.

The permutation \((\varphi _1, \ldots , \varphi _{(n+m)})\) that minimizes the sum of cost

$$\begin{aligned}\sum _{i=1}^{(n+m)} c_{i\varphi _i}\end{aligned}$$

actually corresponds to a bijective assignment

$$\begin{aligned}\psi = \{(u_{1}\rightarrow v_{\varphi _1}), (u_{2}\rightarrow v_{\varphi _2}), \ldots , (u_{m+n}\rightarrow v_{\varphi _{m+n}})\}\end{aligned}$$

of the extended node set \(V_1^{+}\) of \(g_1\) to the extended node set \(V_2^{+}\) of \(g_2\). That is, assignment \(\lambda \) includes node edit operations of the form \((u_i \rightarrow v_j)\), \((u_i \rightarrow \varepsilon )\), \((\varepsilon \rightarrow v_j)\), and \((\varepsilon \rightarrow \varepsilon )\) (the latter can be dismissed, of course). In other words, the permutation \((\varphi _1, \ldots , \varphi _{(n+m)})\) perfectly corresponds to a to an admissible and complete edit path between the graphs under consideration, i.e. \(\psi \in \Upsilon (g_1,g_2)\). The sum of costs of \(\psi \) gives us an approximation value \(d_{ \psi }(g_1, g_2)\), or \(d_{ \psi }\) for short, for the graph edit distance.

Note that the edge operations are only added to the edit path \(\psi \) after the optimization process has been terminated. This is because LSAP solving algorithms are not able to take information about assignments of adjacent nodes into account during run time. In other words, for finding the edit path \(\psi \in \Upsilon (g_1,g_2)\) based on the cost matrix \(\mathbf {C} =(c_{ij})\) the structural information of the graphs is considered in an isolated way only (single nodes and their adjacent edges). Hence, \(\psi \in \Upsilon (g_1,g_2)\) is a suboptimal edit path with cost greater than, or equal to, the optimal edit path \(\lambda _{\min }\) (see [43] for a formal proof).

For the remainder of this paper we denote this graph edit distance approximation algorithm with BP-GED (Bipartite Graph Edit Distance).Footnote 2

2.1.2 Quadratic Time Approximation of Graph Edit Distance

From a high level perspective, the algorithmic framework presented in [38] consists of the following three major steps.

  1. 1.

    In a first step the graphs to be matched are subdivided into individual nodes including local structural information.

  2. 2.

    Next, in step 2, an algorithm that solves the LSAP is employed in order to find an optimal assignment of the nodes (plus local structures) of both graphs.

  3. 3.

    Finally, in step 3, an approximate graph edit distance is derived from the assignment of step 2.

For the second step of BP-GED Munkres’ algorithm [29], also referred to as Kuhn–Munkres, or Hungarian algorithm, is deployed in the existing framework [38]. The time complexity of this particular algorithm (as well as the best performing other algorithms for LSAPs) is cubic in the size of the problem, i.e. \(O((n+m)^3)\) in our case. There exist several studies where different LSAP solving algorithms for this particular graph matching procedure have been compared with each other  [15, 23, 51].

Recently, it has been proposed to solve the LSAP stated on \(\mathbf {C}\) with an approximation rather than with an exact algorithm [41, 42]. While for the optimal solution of LSAPs quite an arsenal of algorithms is available, only a few works are concerned with the suboptimal solution of general LSAPs (see [4] for an early survey).

A basic greedy algorithm that suboptimally solves the LSAP stated on cost matrix \(\mathbf {C}\) is formalized in Algoritm 1. This algorithm iterates through \(\mathbf {C}\) from top to bottom through all rows and assigns every element to the minimum unused element in a greedy manner. More formally, for each row i in the cost matrix \(\mathbf {C} = (c_{ij})\) the minimum cost entry \(\varphi _i = \underset{\forall j }{\arg \min }~c_{ij}\) is determined and the corresponding node edit operation \((u_i \rightarrow v_{\varphi _i})\) is added to \(\psi \). By removing column \(\varphi _i\) in \(\mathbf {C}\) it is ensured that every column of the cost matrix is considered exactly once (i.e. \(\forall j\) refers to available columns in \(\mathbf {C}\)). Clearly, the complexity of this suboptimal assignment algorithm is \(O((n+m)^2)\).

figure a

In contrast with optimal LSAP solvers, the quadratic time assignment method operates in a greedy manner. That is, this approach is not able to undo a certain node assignment once it has been added to \(\psi \). Hence, this particular method crucially depends on the order in which the nodes are processed. However, the nodes of a graph and thus the first n rows (and m columns) in \(\mathbf {C}\) are arbitrarily ordered.

In order to take this observation into account, we reorder the first n rows of \(\mathbf {C}\) in ascending order with respect to the minimum cost entry per row (before the assignment is carried out). Formally, we sort the cost matrix \(\mathbf {C}\) such that

$$\begin{aligned}\underset{\forall j }{\min } ~c_{1j} \le \underset{\forall j }{\min }~c_{2j} \le \ldots \le \underset{\forall j }{\min }~c_{nj}~~~.\end{aligned}$$

That is, in the first row the overall minimum cost assignment can be found, while the second row contains the second smallest row minimum, the third row the third smallest row minimum, and so on.

This heuristic ensures that the most evident assignments, i.e. the assignments with overall lowest cost, are considered first by our greedy assignment. For the remainder of this paper we denote the graph edit distance approximation where the basic node assignment (i.e. step 2 of BP-GED) is computed by means of this greedy procedure with GR-GED.

3 Building the Utility Matrix

Major contribution of this paper is the introduction of a novel matrix to solve the graph matching problem. Similar to [41, 42] we aim at solving the basic LSAP in \(O(n^2)\) time in order to approximate the graph edit distance with a greedy assignment. However, in contrast with this previous approach, which considers the cost matrix \(\mathbf {C}=(c_{ij})\) directly as its basis, we transform the given cost matrix into a utility matrix with equal dimension as \(\mathbf {C}\) and work with this matrix instead.

Basically, for each assignment \((u_i \rightarrow v_j)\) the cost \(c_{ij}\) is assessed with respect to all cost entries in the i-th row and j-th column of \(\mathbf {C}\). That is, we define a utility of this particular edit operation in relation to all other possible assignments that involve \(u_i\) or \(v_j\).

Let us consider the i-th row of the cost matrix \(\mathbf {C}\) and let \(\textit{row-min}_{i}\) and \(\textit{row-max}_{i}\) denote the minimum and maximum value occurring in this row, respectively. Formally, we have

$$\begin{aligned}{} \textit{row-min}_{i}= \min _{j=1,\ldots , (n+m)} c_{ij} \end{aligned}$$

and

$$\begin{aligned} \textit{row-max}_{i} = \max _{j=1,\ldots , (n+m)} c_{ij}~~~.\end{aligned}$$

If the node edit operation \((u_i \rightarrow v_j)\) is selected, one might interpret the quantity

$$\begin{aligned}{} \textit{row-win}_{ij} = \frac{\textit{row-max}_{i} - c_{ij}}{\textit{row-max}_{i} - \textit{row-min}_{i} }\end{aligned}$$

as a win for \((u_i \rightarrow v_j)\), when compared to the locally worst case situation where \(v_k\) with \(k = \arg \max _{j=1,\ldots , (n+m)} c_{ij}\) is chosen as target node for \(u_i\). Likewise, we might interpret

$$\begin{aligned}{} \textit{row-loss}_{ij} = \frac{c_{ij} - \textit{row-min}_{i}}{\textit{row-max}_{i} - \textit{row-min}_{i}}\end{aligned}$$

as a loss for \((u_i \rightarrow v_j)\), when compared to selecting the minimum cost assignment which would be possible in this row. Note that both \(\textit{row-win}_{ij}\) and \(\textit{row-loss}_{ij}\) are normalized to the interval [0, 1]. That is, when \(c_{ij} = \textit{row-min}_{i}\) we have a maximum win of 1 and a minimum loss of 0. Likewise, when \(c_{ij} = \textit{row-max}_{i}\) we observe a minimum win of 0 and a maximum loss of 1.

Overall we define the utility of the node edit operation \((u_i \rightarrow v_j)\) with respect to row i as

$$\begin{aligned} \textit{row-utility}_{ij}= & {} \textit{row-win}_{ij} - \textit{row-loss}_{ij} \\= & {} \frac{\textit{row-max}_{i} + \textit{row-min}_{i} - 2c_{ij}}{\textit{row-max}_{i} - \textit{row-min}_{i}}~~~. \end{aligned}$$

Clearly, when \(c_{ij} = \textit{row-min}_{i}\) we observe a row utility of \(+\,1\), and vice versa, when \(c_{ij} = \textit{row-max}_{i}\) we have a row utility of \(-\,1\).

So far the utility of a node edit operation \((u_i \rightarrow v_j)\) is quantified with respect to the i-th row only. In order to take into account information about the j-th column, we seek for the minimum and maximum values that occur in column j by

$$\begin{aligned}{} \textit{col-min}_{j}= \min _{i=1,\ldots , (n+m)} c_{ij} \end{aligned}$$

and

$$\begin{aligned}{} \textit{col-max}_{j} = \max _{i=1,\ldots , (n+m)} c_{ij}~~~.\end{aligned}$$

Eventually, we define

$$\begin{aligned}{} \textit{col-win}_{ij} = \frac{\textit{col-max}_{j} - c_{ij}}{\textit{col-max}_{j} - \textit{col-min}_{j} } \end{aligned}$$

and

$$\begin{aligned}{} \textit{col-loss}_{ij} = \frac{c_{ij} - \textit{col-min}_{j}}{\textit{col-max}_{j} - \textit{col-min}_{j}} ~~~.\end{aligned}$$

Similarly to the utility of the node edit operation \((u_i \rightarrow v_j)\) with respect to row i we may define the utility of the same edit operation with respect to column j as

$$\begin{aligned} \textit{col-utility}_{ij}= & {} \textit{col-win}_{ij} - \textit{col-loss}_{ij} \\= & {} \frac{\textit{col-max}_{j} + \textit{col-min}_{j} - 2c_{ij}}{\textit{col-max}_{j} - \textit{col-min}_{j}}~~~. \end{aligned}$$

To finally estimate the utility \(u_{ij}\) of a node edit operation \((u_i \rightarrow v_j)\) we apply one of the following three rules.

  • Min:

    $$\begin{aligned}u_{ij} = \min (\textit{row-utility}_{ij},\textit{col-utility}_{ij})\end{aligned}$$
  • Max:

    $$\begin{aligned}u_{ij} = \max (\textit{row-utility}_{ij},\textit{col-utility}_{ij})\end{aligned}$$
  • Sum:

    $$\begin{aligned}u_{ij} = \textit{row-utility}_{ij}+\textit{col-utility}_{ij}~~~\end{aligned}$$

Since both \(\textit{row-utility}_{ij}\) and \(\textit{col-utility}_{ij}\) lie in the interval \([-\,1,1]\), we observe \(u_{ij} \in [-\,1,1]\) for the first and second rule, while \(u_{ij} \in [-\,2,2]\) accounts for the third rule. We denote the final utility matrix by \(\mathbf {U}_{\text {min}}\), \(\mathbf {U}_{\text {max}}\), or \(\mathbf {U}_{\text {sum}}\) (depending on the rule actually employed) from now on.

We aim at employing the same greedy assignment procedure on \(\mathbf {U}\) as used for GR-GED (see Algorithm 1.Footnote 3) Hence, we also reorder the first n rows of \(\mathbf {U}\). However, this time in descending order with respect to the maximum utilities in each row. That is, in the first row the overall maximum utility can be found, the second row contains the second highest row utility, and so on. Formally, we sort the utility matrix such that

$$\begin{aligned}\underset{\forall j }{\max } ~u_{1j} \ge \underset{\forall j }{\max }~u_{2j} \ge \ldots \ge \underset{\forall j }{\max }~u_{nj}~~~.\end{aligned}$$

This heuristic again ensures that the most evident assignments, i.e. the assignments with highest utility, are considered first by our greedy assignment.

The rationale behind the transformation of \(\mathbf {C}\) to \(\mathbf {U}\) is based on the following observation. When picking the minimum element \(c_{ij}\) from cost matrix \(\mathbf {C}\), i.e. when assigning node \(u_i\) to \(v_j\), we exclude both nodes \(u_i\) and \(v_j\) from any future assignment. However, it may happen that node \(v_j\) is not only the best choice for \(u_i\) but also for another node \(u_k\). Because \(v_j\) is no longer available, we may be forced to map \(u_k\) to another, very expensive node \(v_l\), such that the total assignment cost becomes higher than mapping node \(u_i\) to some node that is (slightly) more expensive than \(v_j\). In order to take such situations into account, we incorporate additional information in the utility matrix about the the minimum and maximum value in each row, and each column.

Example 1

Let us consider a simple toy example with a \(3\times 3\) cost matrix.

With a greedy algorithm we would find the assignment

$$\begin{aligned}\psi = \{u_1 \rightarrow v_1, u_2 \rightarrow v_2,u_3 \rightarrow v_3\}\end{aligned}$$

with a total assignment cost of 14.

The row and column utilities computed on \(\mathbf {C} \) are given by

Then, the final utility matrix (using the sum rule) is defined as

resulting in the greedy assignment

$$\begin{aligned}\psi = \{u_1 \rightarrow v_1, u_2 \rightarrow v_3,u_3 \rightarrow v_2\} \end{aligned}$$

with a lower total assignment cost, viz. 9, than directly achieved on \(\mathbf {C}\).

4 Experimental Evaluation

4.1 Setup and Data Sets

In the experimental evaluation we aim at investigating the benefit of using the utility matrix \(\mathbf {U}\) instead of the cost matrix \(\mathbf {C}\) in the framework GR-GED. In particular, we aim at assessing the quality of the different distance approximations by means of comparisons of the sum of distances and by means of a distance based classifier. Actually, a nearest-neighbor classifier (NN) is employed. There are various other approaches to graph classification that make use of graph edit distance in some form. However, the nearest neighbor paradigm is particularly interesting for the present evaluation because it directly uses the distances without any additional classifier training.

Table 1 The mean and max number of nodes and edges in the data set

For the experimental evaluations three data sets from the IAM graph database repository [37] and two data sets from GREYC’s dataset repositoryFootnote 4 are used. In Table 1 the mean and max number of nodes and edges for each data set is given.

Four data sets consist of graphs representing molecular compounds from different applications, viz. AIDS, MUTA, MAO, and PAH (all of these data sets represent two class problems) and one data set consists of graphs that represent proteins stemming from six different classes (PROT).

The molecular structures, which consist of atoms and covalent bonds, are converted into graphs in a very natural and straightforward manner by representing atoms as nodes and the covalent bonds as edges. Nodes are labeled with their corresponding chemical symbol and edges by the valence of the linkage.

The proteins are converted into graphs by representing the structure, the sequence, and chemical properties of a protein by nodes and edges. Nodes represent secondary structure elements (SSE) within the protein structure, labeled with their type (helix, sheet, or loop) and their amino acid sequence. Every pair of nodes is connected by an edge if they are neighbors along the amino acid sequence (sequential edges) or if they are neighbors in space within the protein structure (structural edges). Every node is connected to its three nearest spatial neighbors. In case of sequential relationships, the edges are labeled with their length in amino acids, while in case of structural edges a distance measure in Ångstroms is used as a label.

For molecular compounds the following cost model has been employed. For node and edge deletions/insertions constant positive costs \(\tau _{\textit{node}}\) and \(\tau _{\textit{edge}}\) have been used. The node substitution cost is measured via Dirac function returning 0 if the two symbols are equal, and \(2\tau _{\textit{node}}\) otherwise. Edge substitutions are free of cost.

For the protein graphs a cost model based on the amino acid sequences is used. For node substitutions the type of the involved nodes is compared first. If two types are identical, the amino acid sequences of the nodes to be substituted are compared by means of string edit distance [58]. For edge substitutions, we measure the dissimilarity with a Dirac function returning 0 if the two edge types are equal, and \(2 \tau _{\textit{edge}}\) otherwise.

Table 2 The mean run time for one matching (\(\varnothing t\)), the relative increase/decrease of the sum of distances compared with BP-GED, and the recognition rate (rr) of a nearest-neighbor classifier using a specific graph edit distance algorithm

4.2 Results and Discussion

In Table 2 the results obtained with five different graph edit distance approximations are shown. The first algorithm is BP-GED(\(\mathbf {C}\)), which solves the LSAP on \(\mathbf {C}\) in an optimal manner in cubic time [38]. The second algorithm is GR-GED(\(\mathbf {C}\)), which solves the LSAP on \(\mathbf {C}\) in a greedy manner in quadratic time [41, 42]. These two systems act as reference systems for our novel approach. The remaining algorithms are GR-GED(\(\mathbf {U}\)), which employ the greedy algorithm on a utility matrix \(\mathbf {U}\) instead of \(\mathbf {C}\) using the three rules for the definition of the utilities (Min, Max, and Sum).

We first focus on the mean run time for one matching in ms (\(\varnothing t\)) and compare BP-GED with GR-GED that both operate on the original cost matrix \(\mathbf {C}\). On all data sets substantial speed-ups of the greedy approach can be observed. On the AIDS data set, for instance, the greedy approach GR-GED(\(\mathbf {C}\)) is approximately three times faster than BP-GED. On the MUTA data set the mean matching time is decreased from 33.9 to 4.8 ms (seven times faster) and on the PROT data the greedy approach approximately halves the matching time (25.6 vs. 14.1ms). Also on PAH and MAO slight decreased run times are observed.

Comparing GR-GED(\(\mathbf {C}\)) with GR-GED(\(\mathbf {U}\)) we observe no, or only negligible, increases of the matching time when the latter approach is used (we show only one mean time for all utility matrices as no differences are observable between the three strategies). The slight increase of the run time, which is actually observable on MUTA and PAH only, is due to the computational overhead that is necessary for transforming the cost matrix \(\mathbf {C}\) to the utility matrix \(\mathbf {U}\).

Next, we focus on the distance quality of the greedy approximation algorithms. All of the employed algorithms return an upper bound on the true edit distance, and thus, the lower the sum of distances of a specific algorithm is, the better is its approximation quality. For our evaluation we take the sum of distances returned by BP-GED as reference point and measure the relative increase or decrease of the sum of distances when compared with BP-GED (sod).

For instance, we observe that GR-GED(\(\mathbf {C}\)) increases the sum of distances by 2.1% on the AIDS data when compared with BP-GED. On three other data sets, viz. MUTA, PROT, and MAO, the sum of distances is also increased (by 1.3, 9.9 and 18.1%, respectively). On PAH we observe a decrease of the sum of distance by 4.5%. By using the utility matrix \(\mathbf {U}\) rather than the cost matrix \(\mathbf {C}\) in the greedy assignment algorithm, we observe smaller sums of distances on the MUTA, PAH, and PROT data sets. On the other data sets our novel approach leads to higher approximation errors (on MAO, for instance, the approximation error is substantially increased). Note, however, that increased distances are not necessarily disadvantageous. For instance, increasing the distances between two graph stemming from different classes might be beneficial for distance based classifiers. Thus, we finally focus on the recognition rate (rr) of a NN-classifier that uses the different distance approximations.

We observe that the NN-classifier that is based on the distances returned by GR-GED(\(\mathbf {C}\)) achieves lower recognition rates than the same classifier that uses distances from BP-GED(\(\mathbf {C}\)) (on all data sets but PAH). This loss in recognition accuracy may be attributed to the fact that the approximations in GR-GED are coarser than those in BP-GED. However, our novel procedure, i.e. GR-GED(\(\mathbf {U}\)), improves the recognition accuracy on four out of five data sets when compared to GR-GED(\(\mathbf {C}\)). For instance, on the MUTA data set the recognition rate is increased from 69.6 to 71.7% when \(\mathbf {U_{\text {min}}}\) rather than \(\mathbf {C}\) is used. In fact, this result on the MUTA data set refers to the overall best recognition rate among all competing algorithms. Also on the PAH data set our novel approach (using \(\mathbf {U}_{\text {max}}\)) achieves the overall best result.

Comparing the three different utility matrices \(\mathbf {U_{\text {min}}}\), \(\mathbf {U_{\text {max}}}\), and \(\mathbf {U_{\text {sum}}}\) with each other, we observe that the Max rule achieves the best result on three out of five data sets (PROT, PAH, and MAO), while Min and Sum achieve the best result on two and one data set, respectively.

5 Conclusions and Future Work

In this paper we propose to use a utility matrix instead of a cost matrix for the assignment of local substructures in a graph. The motivation for this transformation is based on the greedy behavior of the basic assignment algorithm. More formally, with the transformation of the cost matrix into a utility matrix we aim at increasing the probability of selecting a correct node edit operation during the optimization process.

With an experimental evaluation on five real world data sets, we empirically confirm that our novel approach is able to increase the accuracy of a distance based classifier, while the run time is nearly not affected (or even decreased). In particular, our novel approach is up to seven times faster than BP-GED(\(\mathbf {C}\)) and improves the classification accuracy on four out of five data sets when compared with the same algorithm operating on cost (rather than utility) matrices [GR-GED(\(\mathbf {C}\))].

In future work we aim at testing other (greedy) assignment algorithms on the utility matrix \(\mathbf {U}\). Moreover, there seems to be room for developing and researching variants of the utility matrix with the aim of integrating additional information about the trade-off between wins and losses of individual assignments. Last but not least, we plan to evaluate this as well as other transformations not only on real world but also on artificial graphs in order to better understand the benefits and limitations of matrix transformations in the context of graph matching.