Graph Edit Distance or Graph Edit Pseudo-Distance?

Serratosa, Francesc; Cortés, Xavier; Moreno, Carlos-Francisco

doi:10.1007/978-3-319-49055-7_47

Francesc Serratosa¹⁸,
Xavier Cortés¹⁸ &
Carlos-Francisco Moreno¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10029))

Included in the following conference series:

Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR)

1355 Accesses
2 Citations

Abstract

Graph Edit Distance has been intensively used since its appearance in 1983. This distance is very appropriate if we want to compare a pair of attributed graphs from any domain and obtain not only a distance, but also the best correspondence between nodes of the involved graphs. In this paper, we want to analyse if the Graph Edit Distance can be really considered a distance or a pseudo-distance, since some restrictions of the distance function are not fulfilled. Distinguishing between both cases is important because the use of a distance is a restriction in some methods to return exact instead of approximate results. This occurs, for instance, in some graph retrieval techniques. Experimental validation shows that in most of the cases, it is not appropriate to denominate the Graph Edit Distance as a distance, but a pseudo-distance instead, since the triangle inequality is not fulfilled. Therefore, in these cases, the graph retrieval techniques not always return the optimal graph.

This research is supported by projects DPI2013-42458-P, TIN2013-47245-C2-2-R, and by Consejo Nacional de Ciencia y Tecnología (CONACyT México).

You have full access to this open access chapter, Download conference paper PDF

An Edit Distance Between Graph Correspondences

Graph Similarity Using Tree Edit Distance

Graph Edit Distance in the Exact Context

Keywords

1 Introduction

Attributed graphs have been of crucial importance in pattern recognition throughout more than four decades [1, 2] since they have been used to model several kinds of problems. Interesting reviews of techniques and applications are [3–5]. If elements in pattern recognition are modelled through attributed graphs, error-tolerant graph-matching algorithms are needed that aim to compute a matching between nodes of two attributed graphs that minimizes some kind of objective function. To that aim, one of the most widely used methods to evaluate an error correcting graph isomorphism is the Graph Edit Distance [1, 2, 6].

Graph Edit Distance needs two main input parameters, which are the pair of attributed graphs to be compared and also other calibration parameters. These parameters have to be tuned in order to maximise a recognition ratio in a classification scenario or simply to minimise the Hamming distance between a ground-truth correspondence between nodes of both graphs and the obtained correspondence. Unfortunately, little research has been done to analyse if really the Graph Edit Distance is a distance or simply a dissimilarity function that could be classified as a pseudo-distance, since some distance restrictions are not fulfilled. Reference [7] is the only paper related on this idea, and it shows in which conditions of these calibration parameters the Graph Edit Distance is really a distance.

The importance of Graph Edit Distance being indeed a distance has an influence on some applications. As an example, in [8–10], authors present methods to retrieve graphs in a database. They suppose that given three graphs, the triangle inequality is fulfilled and thanks to this assumption, some comparisons are not needed to be performed. It turns out that if the Graph Edit Distance is not a distance, then the triangle inequality is not guaranteed, and then some graphs that would have to be explored are not considered, making the methods sub-optimal.

The aim of this paper is to empirically analyse if the distance definition is hold when the recognition ratio is maximised or the obtained correspondence is close to the ground truth. The outline of the paper is as follows; in Sect. 2, we define the attributed graphs and the Graph Edit Distance. In Sects. 3 and 4, we explain the restrictions that a distance needs to fulfil and we relate these restrictions on the specific case of the Graph Edit Distance. In Sects. 5 and 6, we show the experimental validation and conclude the paper.

2 Graphs and Graph Edit Distance

Let $ \Delta _{\text{v}} $ and $ \Delta _{\text{e}} $ denote the domains of possible values for attributed vertices and arcs, respectively. An attributed graph (over $ \Delta _{\text{v}} $ and $ \Delta _{\text{e}} $) is defined by a tuple $ {\text{G}} = (\Sigma _{\upnu} ,\Sigma _{\text{e}} ,\upgamma_{\text{v}} ,\upgamma_{\text{e}} ) $, where $ \Sigma _{\text{v}} = \{ {\text{v}}_{\text{k}} \,|\,{\text{k }} = 1, \ldots ,{\text{R}}\} $ is the set of vertices (or nodes), $ \Sigma _{\text{e}} = \left\{ {{\text{e}}_{\text{ij}} \left| {{\text{i}},{\text{j }} \in \left\{ {1, \ldots ,{\text{R}}} \right\}} \right.} \right\} $ is the set of edges (or arcs), $ \upgamma_{\text{v}} :\,\Sigma _{\text{v}} \to\Delta _{\text{v}} $ assigns attribute values to vertices and $ \upgamma_{\text{e}} :\,\Sigma _{\text{e}} \to\Delta _{\text{e}} $ assigns attribute values to edges.

Let $ {\text{G}}^{\text{p}} = (\Sigma _{\text{v}}^{\text{p}} ,\Sigma _{\text{e}}^{\text{p}} ,\upgamma_{\text{v}}^{\text{p}} ,\upgamma_{\text{e}}^{\text{p}} ) $ and $ {\text{G}}^{\text{q}} = (\Sigma _{\text{v}}^{\text{q}} ,\Sigma _{\text{e}}^{\text{q}} ,\upgamma_{\text{v}}^{\text{q}} ,\upgamma_{\text{e}}^{\text{q}} ) $ be two attributed graphs of order $ {\text{R}}^{\text{p}} $ and $ {\text{R}}^{\text{q}} $. To allow maximum flexibility in the matching process, graphs can be extended with null nodes [1] to be of order $ {\text{R}}^{\text{p}} + {\text{R}}^{\text{q}} $. We refer to null nodes of $ {\text{G}}^{\text{p}} $ and $ {\text{G}}^{\text{q}} $ by $ \hat \Upsigma_{\rm{v}}^{\rm{p}} \subseteq \Upsigma_{\rm{v}}^{\rm{p}} $ and $ \hat \Upsigma_{\rm{v}}^{\rm{q}} \subseteq \Upsigma_{\rm{v}}^{\rm{q}} $ respectively. Let $ {\text{T}} $ be a set of all possible correspondences between two node sets $ \Sigma _{\text{v}}^{\text{p}} $ and $ \Sigma _{\text{v}}^{\text{q}} $. Correspondence $ f^{{{\text{p}},{\text{q}}}} :\,\Sigma _{\text{v}}^{\text{p}} \to\Sigma _{\text{v}}^{\text{q}} $, assigns each node of $ {\text{G}}^{\text{p}} $ to only one node of $ {\text{G}}^{\text{q}} $. The correspondence between edges, denoted by $ f_{\text{e}}^{{{\text{p}},{\text{q}}}} $, is defined accordingly to the correspondence of their terminal nodes.

$$ \eqalign{ & f_e^{p,q}\left( {e_{ab}^p} \right) = e_{ij}^q \Rightarrow {f^{p,q}}\left( {v_a^p} \right) = v_i^q \wedge {f^{p,q}}\left( {v_b^p} \right) = v_j^q \cr & \;\;\;v_a^p,v_b^p \in \Upsigma_{\rm{v}}^{\rm{p}} - \hat \Upsigma_{\rm{v}}^{\rm{p}}\;{\rm{and}}\;v_i^q,v_j^q \in \Upsigma_{\rm{v}}^{\rm{q}} - \hat \Upsigma_{\rm{v}}^{\rm{q}} \cr} $$

(1)

The basic idea behind the Graph Edit Distance is to define a dissimilarity measure between two graphs. This dissimilarity is defined as the minimum amount of distortion required to transform a graph into another. To this end, a number of distortion or edit operations, consisting of insertion, deletion and substitution of both nodes and edges are defined. Then, for every pair of graphs ($ {\text{G}}^{\text{p}} $ and $ {\text{G}}^{\text{q}} $), there is a sequence of edit operations that transforms $ {\text{G}}^{\text{p}} $ into $ {\text{G}}^{\text{q}} $. In general, several edit paths may exist between two given graphs and to evaluate which edit path is the best, edit cost functions are introduced. The basic idea is to assign a penalty cost to each edit operation according to the amount of distortion that it introduces into the transformation.

Each edit path can be related to an univocal correspondence $ f^{{{\text{p}},{\text{q}}}} \in {\text{T}} $ between the involved graphs. This way, each edit operation assigns a node of the first graph to a node of the second graph. Deletion and insertion operations are transformed to assignations of a non-null node of the first or second graph to a null node of the second or first graph. Substitutions simply indicate node-to-node assignations. Using this transformation, given two graphs, $ {\text{G}}^{\text{p}} $ and $ {\text{G}}^{\text{q}} , $ and a correspondence between their nodes, $ f^{{{\text{p}},{\text{q}}}} $, the graph edit cost is given by [1]:

$$ \eqalign{ & \quad \quad \quad \quad \quad \quad \quad \quad EditCost\left( {{G^p},{G^q},{f^{p,q}}} \right) = \cr & \mathop \sum \limits_{\matrix{ {v_a^p \in \varSigma_v^p - \hat \Upsigma_v^p} \cr {v_i^q \in \Upsigma_v^q - \hat \Upsigma_v^q} \cr } } {C_{ns}}\left( {v_a^p,v_i^q} \right) + \mathop \sum \limits_{\matrix{ {e_{ab}^p \in \varSigma_e^p - \hat \Upsigma_e^p} \cr {e_{ij}^q \in \Upsigma_e^q - \hat \Upsigma_e^q} \cr } } {C_{es}}\left( {e_{ab}^p,e_{ij}^q} \right) + \mathop \sum \limits_{\matrix{ {v_a^p \in \varSigma_v^p - \hat \Upsigma_v^p} \cr {v_i^q \in \hat \Upsigma_v^q} \cr } } {C_{nd}}\left( {v_a^p,v_i^q} \right) + \cr & \mathop \sum \limits_{\matrix{ {v_a^p \in \hat \Upsigma_v^p} \cr {v_i^q \in \Upsigma_v^q - \hat \Upsigma_v^q} \cr } } {C_{ni}}\left( {v_a^p,v_i^q} \right) + \mathop \sum \limits_{\matrix{ {e_{ab}^p \in \varSigma_e^p - \hat \Upsigma_e^p} \cr {e_{ij}^q \in \hat \Upsigma_e^q} \cr } } {C_{ed}}\left( {e_{ab}^p,e_{ij}^q} \right) + \mathop \sum \limits_{\matrix{ {e_{ab}^p \in \hat \Upsigma_e^p} \cr {e_{ij}^q \in \Upsigma_e^q - \hat \Upsigma_e^q} \cr } } {C_{ei}}\left( {e_{ab}^p,e_{ij}^q} \right) \cr & {\rm{being}}\;{f^{p,q}}\left( {v_a^p} \right) = v_i^q\;{\rm{and}}\;f_e^{p,q}\left( {e_{ai}^p} \right) = e_{ij}^q \cr} $$

(2)

where $ {\text{C}}_{\text{ns}} $ is the cost of substituting node $ {\text{v}}_{\text{a}}^{\text{p}} $ of $ {\text{G}}^{\text{p}} $ by node $ {\text{f}}^{{{\text{p}},{\text{q}}}} \left( {{\text{v}}_{\text{a}}^{\text{p}} } \right) $ of $ {\text{G}}^{\text{q}} $, $ {\text{C}}_{\text{nd}} $ is the cost of deleting node $ {\text{v}}_{\text{a}}^{\text{p}} $ of $ {\text{G}}^{\text{p}} $ and $ {\text{C}}_{\text{ni}} $ is the cost of inserting node $ {\text{v}}_{\text{i}}^{\text{q}} $ of $ {\text{G}}^{\text{q}} $. Equivalently for edges, $ {\text{C}}_{\text{es}} $ is the cost of substituting edge $ {\text{e}}_{\text{ab}}^{\text{p}} $ of graph $ {\text{G}}^{\text{p}} $ by edge $ {\text{f}}_{\text{e}}^{{{\text{p}},{\text{q}}}} \left( {{\text{e}}_{\text{ab}}^{\text{p}} } \right) $ of $ {\text{G}}^{\text{q}} $, $ {\text{C}}_{\text{ed}} $ is the cost of assigning edge $ {\text{e}}_{\text{ab}}^{\text{p}} $ of $ {\text{G}}^{\text{p}} $ to a non-existing edge of $ {\text{G}}^{\text{q}} $ and $ {\text{C}}_{\text{ei}} $ is the cost of assigning edge $ {\text{e}}_{\text{ab}}^{\text{q}} $ of $ {\text{G}}^{\text{q}} $ to a non-existing edge of $ {\text{G}}^{\text{p}} $.

Finally, the Graph Edit Distance is defined as the minimum cost under any correspondence in $ {\text{T}} $:

$$ GED\left( {G^{p} ,G^{q} } \right) = \mathop {\hbox{min} }\limits_{{f^{p,q} \in T}} EditCost\left( {G^{p} ,G^{q} ,f^{p,q} } \right) $$

(3)

Using this definition, the Graph Edit Distance essentially depends on $ {\text{C}}_{\text{ns}} , {\text{C}}_{\text{nd}} ,{\text{C}}_{\text{ni}} , {\text{C}}_{\text{es}} , {\text{C}}_{\text{ed}} $ and $ {\text{C}}_{\text{ei}} $ functions. Several definitions of these functions exist. Table 1 summarises the five different configurations that have been presented so far.

Table 1. Examples of Graph Edit Costs in the literature.

Full size table

The first options [11–16] are the ones where the whole costs are defined as functions that depend on the involved attributes and also on either learned or general knowledge. Attributes are density functions instead of vectors of attributes. The second option makes the Graph Edit Distance to be directly related to the maximal common sub-graph. That is, in [17], authors demonstrate that computing the Graph Edit Distance is exactly the same than deducting the maximal common sub-graph. In the third option, [18], authors assume that the graphs are complete, and a non-existing edge is an edge with a “null” attribute. In this case, the cost of deleting and inserting an edge is encoded in the edge substitution cost. Inserting and deleting nodes have a constant cost, $ K_{n} $. With this definition, authors describe several classes of costs that Eq. 3 deducts the same correspondence. The fourth option might be the most used one [1, 19, 20]. Substitution costs are defined as distances between vectors of attributes, usually the Euclidean distance. Insertion and deletion costs are constants, $ K_{n} $ and $ K_{e} $, that have been manually tested or automatically learned [21, 22]. Finally, the last option is used in fingerprint recognition [23]. It is similar to the previous option, except from the substitution costs that are constants. Nodes represent minutiae and edges are the relations between them. If a specific distance between minutiae is lower than a threshold, then a zero is imposed as a substitution cost. Otherwise, this cost takes a constant value $ K_{ns} $. The same happens with the edges that take a constant value $ K_{es} $.

It is worth noting that for all of the cases except for the first one, the insertion and deletion costs on nodes are considered to be the same, $ K_{n} $. The same happens for edges, $ K_{e} $. Nevertheless, in the string edit distance, also known as Levenshtein distance [24], insertion and deletion costs might be considered different depending on the application. The most usual application is an automatic writing correction scenario, in which the possibility of missing a character is different than accidentally adding an extra character [25].

The optimal computation of the Graph Edit Distance is usually carried out by means of a tree search algorithm, which explores the space of all possible mappings of the nodes and edges of the first graph to the nodes and edges of the second graph. A widely used method is based on the A* algorithm, for instance [18]. Unfortunately, the computational complexity of this algorithm is exponential in the number of nodes of the involved graphs. This means that the running time may be non-admissible in some applications, even for reasonably small graphs. This is why bipartite graph matching [26–29] has appeared.

3 Restrictions on the Graph Edit Distance

A distance, also called a metric, is a function that defines a dissimilarity between elements of a set, such as $ x $, $ y $ or $ z $. The domain is $ [0,\infty ) $ and it holds the following restrictions for all elements in the set [30]:

$$ \eqalign{ & {\rm{(1)}}\;{\text{Non-negativity:}}\;dist\left( {x,y} \right) \ge 0. \cr & {\rm{(2)}}\;{\text{Identity of indiscernible elements:}}\;dist\left( {x,y} \right) = 0 \Leftrightarrow x = y. \cr & {\rm{(3)}}\;{\text{Symmetry:}}\;dist\left( {x,y} \right) = dist\left( {y,x} \right) \cr & {\rm{(4)}}\;{\text{Triangle inequality:}}\;dist\left( {x,y} \right) \le dist\left( {x,z} \right) + dist\left( {z,y} \right) \cr} $$

(4)

In some cases, there is a need to relax these restrictions and thus the resulting functions are not called distance but pseudo-distance, quasi-distance, meta-distance or semi-distance, depending on which restriction is violated and how it is violated [30]. In the case of the Graph Edit Distance, this relaxation comes from the imposition of a ground truth correspondence. This ground truth is deducted from an independent method and it is not related on the edit costs.

If we wish the Graph Edit Distance to be defined as a true distance function, it is needed to assure the whole edit operations in the ground truth correspondence fulfil the four properties in the following Eq. 5. In these equations, we suppose that the ground truth correspondence is $ f^{p,q} $ such that $ f^{p,q} \left( {v_{a}^{p} } \right) = v_{i}^{q} $ and $ f^{p,q} \left( {v_{b}^{p} } \right) = v_{j}^{q} $.

$$ \begin{aligned}&\left({\rm{1}}\right)\;{\text{Non-negativity:}}\;{{\rm{C}}_{{\rm{ns}}}} \ge 0\;{\rm{and}}\;{{\rm{C}}_{{\rm{es}}}} \ge 0. \\ &\left( {\rm{2}} \right)\;{\text{Identity of indiscernible elements:}}\\ &\quad\;\;\; {C_{ns}}\left( {v_a^p,v_i^q} \right) = 0 \Leftrightarrow {\upgamma_{\rm{v}}}\left( {v_a^p} \right) = {\upgamma_{\rm{v}}}\left( {v_i^q}\right) \\ &\quad\;\;\; {C_{es}}\left({e_{ab}^p,e_{ij}^q} \right) = 0 \Leftrightarrow {\upgamma_{\rm{e}}}\left( {e_{ab}^p} \right) = {\upgamma_{\rm{e}}}\left({e_{ij}^q} \right) \\ & \left( {\rm{3}}\right)\;{\text{Symmetry:}} \\ &\quad\;\;\;{{\rm{C}}_{{\rm{nd}}}}\left({v_a^p,v_i^q} \right) = {{\rm{C}}_{{\rm{ni}}}}\left({v_{a^{\prime}}^p,v_{i^{\prime}}^q} \right) \Leftrightarrow {\upgamma_{\rm{v}}}\left({v_a^p} \right) = {\upgamma_{\rm{v}}}\left({v_{i^{\prime}}^q} \right) \\ &{\rm{where}}\;v_a^p \in \varSigma_v^p - \hat \Upsigma_v^p,\;v_i^q \in \hat \Upsigma_v^q,\;v_{a^{\prime}}^p \in \hat\Upsigma_v^p\;{\rm{and}}\;v_{i^{\prime}}^q \in \varSigma_v^q - \hat \Upsigma_v^q \\ &\quad\;\;\;{{\rm{C}}_{{\rm{ed}}}}\left( {e_{ab}^p,e_{ij}^q} \right) = {{\rm{C}}_{{\rm{ei}}}}\left( {e_{a^{\prime}b^{\prime}}^p,e_{i^{\prime}j^{\prime}}^q} \right) \Leftrightarrow {\upgamma_{\rm{e}}}\left( {e_{ab}^p} \right) = {\upgamma_{\rm{e}}}\left( {e_{i^{\prime}j^{\prime}}^q} \right) \\ &{\rm{where}}\;e_{ab}^p \in \varSigma_e^p - \hat \Upsigma_e^p,\;e_{ij}^q\in \hat \Upsigma_e^q,\;e_{a^{\prime}b^{\prime}}^p \in \hat \Upsigma_e^p\;{\rm{and}}\;e_{i^{\prime}j^{\prime}}^q \in \varSigma_e^q - \hat \Upsigma_e^q \\ &\left( {\rm{4}} \right)\;{\text{Triangle inequality:}} \\ &\quad\;\;\;{{\rm{C}}_{{\rm{ns}}}}\left( {v_a^p,v_i^q} \right) \le {{\rm{C}}_{{\rm{nd}}}}\left( {v_a^p,v_{i^{\prime}}^q} \right) + {{\rm{C}}_{{\rm{ni}}}}\left( {v_{a^{\prime}}^p,v_i^q} \right) \\ &{\rm{where}}\;v_a^p \in \varSigma_v^p - \hat \Upsigma_v^p,\;v_i^q \in \varSigma_v^q - \hat \Upsigma_v^q\;v_{i^{\prime}}^q \in \hat \Upsigma_v^q\;{\rm{and}}\;v_{a^{\prime}}^p \in \hat \Upsigma_v^p \\ &\quad\;\;\;{{\rm{C}}_{{\rm{es}}}}\left( {e_{ab}^p,e_{ij}^q} \right) \le {{\rm{C}}_{{\rm{ed}}}}\left( {e_{ab}^p,e_{i^{\prime}j^{\prime}}^q} \right) + {{\rm{C}}_{{\rm{ei}}}}\left( {e_{a^{\prime}b^{\prime}}^p,e_{ij}^q} \right) \\ &{\rm{where}}\;e_{ab}^p \in \varSigma_e^p - \hat \Upsigma_e^p,\;e_{ij}^q\in \varSigma_e^q - \hat \Upsigma_e^q\;e_{i^{\prime}j^{\prime}}^q \in \hat \Upsigma_e^q\;{\rm{and}}\;e_{a^{\prime}b^{\prime}}^p \in \hat \Upsigma_e^p\end{aligned} $$

(5)

For all cited references, functions in Table 1 are defined as distances, and constants as real positive numbers. For this reason, if the Graph Edit Distance cannot be defined as a true distance, it is due to the relations between these functions and constants. Considering the five options proposed in Table 1, it is really difficult to analyse the first option since being a distance or not depends on the specific distance values. We realise that the second and third ones do not hold the triangle inequality and therefore cannot be considered as distances. The fourth option is a distance only if it is guaranteed that the whole substitution operations in the edit path hold:

$$ d_{n} \left( {v_{a}^{p} ,v_{i}^{q} } \right) \le 2\,\cdot\,K_{n} \;{\text{and}}\;d_{e} (e_{ab}^{p} ,e_{ij}^{q} ) \le 2\,\cdot\,K_{e} $$

(6)

That is, we only have to analyse if the triangle inequality of Eq. 5 is fulfilled. Finally, the last option is almost the same than the third one and it is a true distance if constant costs are defined such that,

$$ K_{ns} \le 2\,\cdot\,K_{n} \;{\text{and}}\;K_{es} \le 2\,\cdot\,K_{e} $$

(7)

Since the fourth option is both the most used and the one that can be defined as distance or not, depending on the costs, from now on, we concretise on this specific case.

4 Defining the Graph Edit Distance as a True Distance

Note that given a pair of graphs and an optimal correspondence (the one that minimise $ EditCost $ in Eq. 3), we can analyse if the used edit costs make the Graph Edit Distance to be a true distance or not. Moreover, each combination of edit costs generates a different optimal correspondence and a Graph Edit Distance value. For this reason, the problem of knowing which are the edit costs that make the Graph Edit Distance to be a true distance is a chicken egg problem. Given some edit costs, we need to compute the optimal correspondence to deduct if the four distance restrictions are violated (Eq. 5), but to deduct the proper edit costs, we need the optimal correspondence.

To solve this problem, we propose to use a ground truth correspondence. That is, given a pair of attributed graphs, and independently of the edit costs, a human or another method deducts which is the “best” correspondence. Thus, we consider that the Graph Edit Distance is a true distance if the four properties in Eq. 5 are fulfilled assuming $ f^{p,q} $ is the ground truth correspondence.

Given an application that involves an attributed graph database of $ M $ graphs in which the computation of the Graph Edit Distance is needed, the same edit costs have to be used in the whole process and graphs. Thus, we generalise Eq. 6 considering that we have several graphs and also introducing the ground truth concept. We conclude that the Graph Edit Distance is a true distance given some specific insertion and deletion costs for nodes if the following equation holds,

$$ \eqalign{ & \quad \forall\, v_a^p \in {\varSigma}_v^p - {{\hat {\Upsigma} }}_v^p \;{\rm{given}}\;p :\, 1..M\;{\rm{and}}\;a :\,1..({{\rm{R}}^{\rm{p}}}{\rm{ }} + {\rm{ }}{{\rm{R}}^{\rm{q}}}) \cr & {\text{such that}}\; {f^{p,q}}{\rm{ }}\left( {v_a^p} \right) = v_i^q\,\&\,v_i^q \in {\varSigma}_v^q - {{\hat {\Upsigma} }}_v^q \cr & \quad {\text{leads to}}\; {d_n}\left( {v_a^p,v_i^q} \right) \le 2\cdot{K_n} \cr & \quad {\rm{being}}\;{f^{p,q}}\;{\text{the ground-truth correspondence}} \cr} $$

(8)

Similarly for the edges,

$$ \eqalign{ & \quad \forall\,e_{ab}^p \in {\varSigma}_e^p - {{\hat {\Upsigma} }}_e^p \;{\rm{given}}\;p :\,1..M\;{\rm{and}}\;a,b :\,1..({{\rm{R}}^{\rm{p}}}{\rm{ }} + {\rm{ }}{{\rm{R}}^{\rm{q}}}) \cr & {\text{such that}}\; f_{\rm{e}}^{{\rm{p}},{\rm{q}}}{\rm{ }}\left( {e_{ab}^p} \right) = e_{ij}^q\,\&\,e_i^{jq} \in {\varSigma}_e^q - {{\hat {\Upsigma} }}_e^q \cr & \quad {\text{leads to}}\; {d_e}\left( {e_{ab}^p,e_{ij}^q} \right) \le 2\cdot{K_e} \cr & \quad {\rm{being}}\;f_{\rm{e}}^{{\rm{p}},{\rm{q}}}\;{\text{the edge correspondence deducted from the }} \cr & {\text{ground-truth correspondence}}\;{f^{p,q}}. \cr} $$

(9)

5 Experimental Validation

We used four graph databases that are organised in registers such that each register is composed of a pair of graphs and a ground truth correspondence between their nodes. These databases were initially used to automatically learn insertion and deletion edit costs in [21, 22], and are publically available in [31]. These databases do not have attributes on the edges and therefore, we only analyse the insertion and deletion costs on nodes. Nonetheless, what can be deducted on nodes could be easily extrapolated to edges. Graphs in the first two databases, Letter Low and Letter High, represent hand written characters, which nodes have as only attribute the (x,y) position of the junctions of strokes in the character, and edges being the strokes. Graphs in House Hotel database and Tarragona Rotation Zoom database have been extracted from images. Their nodes represent salient points in the images with their attributes being the features obtained by the point extractor. Edges have been deducted by the Delaunay triangulation.

Table 2 shows the position of the quartiles, the mean and also half of the maximum values of the node substitution costs $ d_{n} \left( {v_{a}^{p} ,v_{i}^{q} } \right) $ given the whole correspondences. Clearly, if we want Eq. 8 to hold, the insertion and deletion costs have to be defined such that $ K_{n} \ge \frac{1}{2}Max. $

Table 2. Average node substitution costs given the ground truth correspondences.

Full size table

For the sake of clarification, Fig. 1 shows the histograms of $ d_{n} \left( {v_{a}^{p} ,v_{i}^{q} } \right) $ given all databases with the quartiles and the mean values.

We have used an error-tolerant graph-matching algorithm called Fast Bipartite [27] available in [32] to compute the automatically-deducted correspondence and the distance between the attributed graphs.

Table 3 shows the average Hamming distance between the ground-truth correspondence and the automatically obtained correspondence when $ K_{n} = Q1 $, $ K_{n} = Q2 $, $ K_{n} = Q3 $, $ K_{n} = Mean $, and $ K_{n} = \frac{1}{2}Max $. Specific values are shown in Table 2. The Hamming distance is computed as the number of node mappings that are different between both correspondences. Therefore, the lower these values, the better the performance.

Table 3. Average Hamming distance.

Full size table

We realise that the lowest Hamming distances are achieved in the positions of the insertion and deletion edit costs such that the triangle inequality is not hold, since these lowest Hamming distances are achieved in the first three quartiles, which are always smaller than $ \frac{1}{2}Max $.

Table 4 shows the classification ratio using the same conditions than the previous experiments. To compute the classification ratio, we have used the reference and test set of each database and the 1-Nearest Neighbour classification algorithm. Recall that the House Hotel database does not have classes. It seems as the classification ratio performs similar to the Hamming distance. That is, the best values are achieved when the insertion and deletion edit costs are smaller than $ \frac{1}{2}Max $.

Table 4. Classification ratio.

Full size table

The relation of the recognition ratio with the Hamming distance between the ground truth and the obtained correspondences was explored in [22] while learning the edit costs. In that paper, it was empirically demonstrated that decreasing the Hamming distance leads to the recognition ratio to increase. We have validated this dependence again. Moreover, the experimental validation in that paper shows that the presented optimisation method converged to some negative insertion and deletion costs. Again, these values make the Graph Edit Distance not to be a truly defined distance.

Finally, in Table 5 we show the average runtime (in milliseconds) to compute one graph-to-graph comparison. We have used a Windows, I7 and Matlab. We appreciate there is no relation, in general, between the insertion and deletion edit costs and the runtime.

Table 5. Average runtime to match a pair of graphs.

Full size table

6 Conclusions

Graph Edit Distance is nowadays the most widely used function to compare two graphs and to obtain both a distance and a graph correspondence. This function does not only depend on a pair of graphs, but also on the definition of the insertion and deletion edit costs on nodes and edges. These costs are usually defined as constants, and depending on their definition, we can consider the Graph Edit Distance to be a true distance or not. The fact of not being a true distance can influence on the performance in some applications. Experimental validation has shown that the insertion and deletion costs that obtain the lowest Hamming distances between the ground truth correspondence and the optimal correspondence and also the highest classification ratios are the ones where the triangle inequality does not hold when considering the ground truth correspondence. For instance, the assumption that $ EditCost\left( {G^{p} ,G^{q} ,f^{p,q} } \right) \le {\text{GED}}\left( {{\text{G}}^{\text{p}} ,{\text{G}}^{\text{t}} } \right) + {\text{GED}}\left( {{\text{G}}^{\text{t}} ,{\text{G}}^{\text{q}} } \right) $, where $ f^{p,q} $ is the ground truth correspondence, a fact which is commonly assumed on some applications such as graph retrieval or learning processes. This is because in a properly tuned system, $ EditCost\left( {G^{p} ,G^{q} ,f^{p,q} } \right) = {\text{GED}}\left( {{\text{G}}^{\text{p}} ,{\text{G}}^{\text{q}} } \right) $.

References

Solé, A., Serratosa, F., Sanfeliu, A.: On the graph edit distance cost: properties and applications. Int. J. Pattern Recogn. Artif. Intell. 26(5), 1260004 (2012). (21 pages)
Article MathSciNet Google Scholar
Sanfeliu, A., Fu, K.-S.: A distance measure between attributed relational graphs for pattern recognition. IEEE Trans. Syst. Man Cybern. 13(3), 353–362 (1983)
Article MATH Google Scholar
Conte, D., Foggia, P., Sansone, C., Vento, M.: Thirty years of graph matching in pattern recognition. Int. J. Pattern Recogn. Artif. Intell. 18(3), 265–298 (2004)
Article Google Scholar
Vento, M.: A long trip in the charming world of graphs for pattern recognition. Pattern Recogn. 48, 291–301 (2015)
Article Google Scholar
Foggia, P., Percannella, G., Vento, M.: Graph matching and learning in pattern recognition in the last 10 years. Int. J. Pattern Recogn. Artif. Intell. 28, 1450001 (2013)
Article MathSciNet Google Scholar
Gao, X., et al.: A survey of graph edit distance. Pattern Anal. Appl. 13(1), 113–129 (2010)
Article MathSciNet Google Scholar
Bunke, H., Allermann, G.: Inexact graph matching for structural pattern recognition. Pattern Recogn. Lett. 1(4), 245–253 (1983)
Article MATH Google Scholar
He, L., et al.: Graph matching for object recognition and recovery. Pattern Recogn. Lett. 37(7), 1557–1560 (2004)
Article Google Scholar
Serratosa, F., Cortés, X., Solé, A.: Component retrieval based on a database of graphs for hand-written electronic-scheme digitalisation. Expert Syst. Appl. 40, 2493–2502 (2013)
Article Google Scholar
Berretti, S., Del Bimbo, A., Vicario, E.: Efficient matching and indexing of graph models in content-based retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 23(10), 1089–1105 (2001)
Article Google Scholar
Wong, A., You, M.: Entropy and distance of random graphs with application to structural pattern recognition. Trans. Pattern Anal. Mach. Intell. 7(5), 599–609 (1985)
Article MATH Google Scholar
Neuhaus, M., Bunke, H.: Automatic learning of cost functions for graph edit distance. Inf. Sci. 177(1), 239–247 (2006)
Article MathSciNet MATH Google Scholar
Serratosa, F., Alquézar, R., Sanfeliu, A.: Function-described graphs for modelling objects represented by attributed graphs. Pattern Recogn. 36(3), 781–798 (2003)
Article Google Scholar
Serratosa, F., Alquézar, R., Sanfeliu, A.: Efficient algorithms for matching attributed graphs and function-described graphs. In: Proceedings of 15th International Conference on Pattern Recognition, ICPR 2000, Barcelona, Spain, vol. 2, pp. 871–876 (2000)
Google Scholar
Sanfeliu, A., Serratosa, F., Alquézar, R.: Second-order random graphs for modelling sets of attributed graphs and their application to object learning and recognition. Int. J. Pattern Recogn. Artif. Intell. 18(3), 375–396 (2004)
Article Google Scholar
Alquézar, R., Sanfeliu, A., Serratosa, F.: Synthesis of function-described graphs. In: Amin, A., Dori, D., Pudil, P., Freeman, H. (eds.) SSPR’98 and SPR’98. LNCS, vol. 1451, pp. 112–121. Springer, Heidelberg (1998)
Chapter Google Scholar
Bunke, H.: On a relation between graph edit distance and maximum common subgraph. Pattern Recogn. Lett. 18(8), 689–694 (1998)
Article Google Scholar
Bunke, H.: Error correcting graph matching: on the influence of the underlying cost function. Trans. Pattern Anal. Mach. Intell. 21(9), 917–922 (1999)
Article Google Scholar
Caetano, T., et al.: Learning graph matching. Trans. Pattern Anal. Mach. Intell. 31(6), 1048–1058 (2009)
Article Google Scholar
Ferrer, M., Serratosa, F., Riesen, K.: Improving bipartite graph matching by assessing the assignment confidence. Pattern Recogn. Lett. 65, 29–36 (2015)
Article Google Scholar
Cortés, X., Serratosa, F.: Learning graph matching substitution weights based on the ground truth node correspondence. Int. J. Pattern Recogn. Artif. Intell. 30(2), 1650005 (2016). (22 pages)
Article MathSciNet Google Scholar
Cortés, X., Serratosa, F.: Learning graph-matching edit-costs based on the optimality of the Oracle’s node correspondences. Pattern Recogn. Lett. 56, 22–29 (2015)
Article Google Scholar
Jain, A.K., Maltoni, D.: Handbook of Fingerprint Recognition. Springer, New York (2003)
MATH Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. Cybern. Control Theory 10, 707–710 (1966)
MathSciNet MATH Google Scholar
Serratosa, F., Sanfeliu, A.: Signatures versus histograms: definitions, distances and algorithms. Pattern Recogn. 39(5), 921–934 (2006)
Article MATH Google Scholar
Riesen, K., Bunke, H.: Approximate graph edit distance computation by means of bipartite graph matching. Image Vis. Comput. 27(7), 950–959 (2009)
Article Google Scholar
Serratosa, F.: Fast computation of bipartite graph matching. Pattern Recogn. Lett. 45, 244–250 (2014)
Article Google Scholar
Serratosa, F.: Computation of graph edit distance: reasoning about optimality and speed-up. Image Vis. Comput. 40, 38–48 (2015)
Article Google Scholar
Serratosa, F.: Speeding up fast bipartite graph matching through a new cost matrix. Int. J. Pattern Recogn. Artif. Intell. 29(2), 1550010 (2015)
Article MathSciNet Google Scholar
Arkhangel’skii, A.V., Pontryagin, L.S.: General Topology I: Basic Concepts and Constructions Dimension Theory. Encyclopaedia of Mathematical Sciences. Springer, Heidelberg (1990)
Book Google Scholar
http://deim.urv.cat/~francesc.serratosa/databases/
http://deim.urv.cat/~francesc.serratosa/SW

Download references

Author information

Authors and Affiliations

Universitat Rovira i Virgili, Tarragona, Catalonia, Spain
Francesc Serratosa, Xavier Cortés & Carlos-Francisco Moreno

Authors

Francesc Serratosa
View author publications
You can also search for this author in PubMed Google Scholar
Xavier Cortés
View author publications
You can also search for this author in PubMed Google Scholar
Carlos-Francisco Moreno
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Francesc Serratosa .

Editor information

Editors and Affiliations

Data 61 - CSIRO , Canberra, Australia
Antonio Robles-Kelly
Pattern Recognition Laboratory, Technical University of Delft Pattern Recognition Laboratory, CD Delft, The Netherlands
Marco Loog
Electrical and Electronic Engineering, University of Cagliari Electrical and Electronic Engineering, Cagliari, Italy
Battista Biggio
Computación e IA, Universidad de Alicante Computación e IA, Alicante, Spain
Francisco Escolano
Computer Science, University of York Computer Science, York, United Kingdom
Richard Wilson

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Serratosa, F., Cortés, X., Moreno, CF. (2016). Graph Edit Distance or Graph Edit Pseudo-Distance?. In: Robles-Kelly, A., Loog, M., Biggio, B., Escolano, F., Wilson, R. (eds) Structural, Syntactic, and Statistical Pattern Recognition. S+SSPR 2016. Lecture Notes in Computer Science(), vol 10029. Springer, Cham. https://doi.org/10.1007/978-3-319-49055-7_47

Download citation

DOI: https://doi.org/10.1007/978-3-319-49055-7_47
Published: 05 November 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49054-0
Online ISBN: 978-3-319-49055-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Graph Edit Distance or Graph Edit Pseudo-Distance?

Abstract

Similar content being viewed by others

An Edit Distance Between Graph Correspondences