Keywords

1 Introduction

Because of their expressiveness and versatility, labeled graphs are widely used to model various kinds of objects such as molecules, street networks, and images. Many pattern recognition problems defined over these domains presuppose the availability of a (dis-)similarity measure for labeled graphs. Despite the fact that its exact computation is \(\mathcal {NP}\)-hard [31], one of the most widely used measures is the graph edit distance (\(\mathrm {GED}\)). Given two labeled graphs G and H, it is defined as , where \(\varPsi \) is the set of all edit paths between G and H and c(P) denotes the cost of an edit path P. An edit path is a sequence of edit operations that transforms G into H. There are six edit operations: substituting a node or an edge in G by a node or an edge in H, deleting an edge or an isolated node from G, and inserting an edge or an isolated node into H. Each edit operation comes with an associated non-negative edit cost defined in terms of the node or edge labels involved in the operation; and the cost of an edit path is defined as the sum over the costs of its edit operations.

Over the past years, some exact and a lot of approximate algorithms for computing \(\mathrm {GED}\) have been suggested. As the hardness of \(\mathrm {GED}\) does not allow for a theoretical evaluation of approximate algorithms (the existence of any \(\alpha \)-approximation algorithm for \(\mathrm {GED}\) would imply that the graph isomorphism problem, a prime candidate for an \(\mathcal {NP}\)-intermediate problem, is in \(\mathcal {P}\)), these algorithms are typically evaluated empirically. In order for such a comparison to be fair, it is highly desirable that the compared algorithms be implemented within the same environment. However, to the best of our knowledge, no software is available that can be used for this purpose.

In this paper, we present the C++ template library GEDLIB which is intended to fill this gap. GEDLIB is available on GitHub:

https://github.com/dbblumenthal/gedlib

In its current version, GEDLIB contains implementations of 24 different \(\mathrm {GED}\) algorithms and 9 different edit cost functions. Further algorithms and edit costs can be implemented easily by implementing abstract classes contained in GEDLIB. For this, the user has access to standard libraries for blackbox optimization, mixed integer linear programming, the linear sum assignment problem with and without error-correction, deep neural networks, and support vector machines. GEDLIB provides a parser to load graphs given in the GXL file format. Alternatively, graphs with user-specified node ID, node, and edge label types can be constructed from within GEDLIB. Internally, GEDLIB uses the Boost Graph Library [22] for representing the graphs and Eigen [19] for matrix operations.

The remainder of this paper is organized as follows: In Sect. 2, the overall architecture of GEDLIB is sketched. In Sect. 3, the user interface is presented. In Sects. 4 and 5, the abstract classes for implementing \(\mathrm {GED}\) algorithms and edit cost functions are described. Section 6 concludes the paper. Details, examples, and installation instructions can be found in the documentation.

2 Overall Architecture

Figure 1 shows the overall architecture of GEDLIB in a UML diagram. The entire library is contained in the namespace ged. The template parameters UserNodeID, UserNodeLabel, and UserEdgeLabel correspond to the types of the node IDs, the node labels, and the edge labels of the graphs provided by the user.

Fig. 1.
figure 1

The overall architecture of GEDLIB shown in a UML class diagram.

  • The class template ged::GEDEnv provides the user interface. Via its public member functions, graphs can be constructed or loaded from GXL files, edit costs can be set, the algorithms implemented in GEDLIB can be run, and the results of the runs can be obtained. For users who do not want to provide extensions for GEDLIB, it suffices to get familiar with this class template.

  • The abstract class template ged::GEDMethod provides a generic interface for implementing algorithms that exactly or approximately compute \(\mathrm {GED}\).

  • The abstract class templates ged::LSBasedMethod, ged::MIPBasedMethod, and ged::LSAPEBasedMethod are derived from the generic interface provided by ged::GEDMethod. They yield more specialized interfaces for implementing methods using local search, mixed integer linear programming, and transformations to the linear sum assignment problem with error-correction.

  • The abstract class template ged::MLBasedMethod is derived from the interface ged::LSAPEBasedMethod. It can be used to implement algorithms that use deep neural networks or support vector machines for transforming \(\mathrm {GED}\) to the linear sum assignment problem with error-correction.

  • The class template ged::GEDData contains the normalized input data on which all \(\mathrm {GED}\) algorithms contained in GEDLIB operate. Via the public member functions of ged::GEDData, derived classes of ged::GEDMethod have access to the graphs that have been added to the environment and to the edit cost functions selected by the user.

  • The abstract class template ged::EditCosts provides a generic interface for implementing edit cost functions.

3 User Interface

In Fig. 2, the class template ged::GEDEnv, which constitutes the user interface of GEDLIB, is displayed in detail. By calling add_graph(), add_node(), and add_edge(), the user can add labeled graphs to the environment. Alternatively, load_gxl_graphs() can be used to load graphs given in the GXL file format. For this, the template parameter UserNodeID must be set to ged::GXLNodeID a. k. a. std::string, and the template parameters UserNodeLabel and UserEdgeLabel must be set to ged::GXLLabel a. k. a. .

Calls to set_edit_costs() add edit cost functions to the environment. The user can either select one of the predefined edit cost functions or use her own implementation of ged::EditCosts. Calls to \(\texttt {init()}\) initialize the environment eagerly or lazily. If eager initialization is chosen, all edit costs between graphs contained in the environment are precomputed. Otherwise, the edit cost functions are evaluated on the fly. The member function set_method() selects one of the \(\mathrm {GED}\) algorithms available in GEDLIB. Some algorithms accept options, which can be passed to set_method() as a string of the form . Calls to init_method() initialize the selected method for runs between graphs contained in the environment, and calls to run_method() run the method between two specified graphs. The results of the runs (lower and upper bounds, runtimes, etc.) can be accessed via various getter member function.

Fig. 2.
figure 2

The user interface ged::GEDEnv.

4 Abstract Classes for Implementing \(\mathrm {GED}\) Algorithms

Generic Interface. Figure 3 details the abstract class template ged::GEDMethod, which provides the generic interface for implementing \(\mathrm {GED}\). The interface is defined by the virtual member functions starting with the prefix ged_. We here describe only the most important virtual member functions; the remaining ones are detailed in the documentation: ged_run_() runs the method between two input graphs, ged_init_() initializes the methods for the graphs that have been added to the environment, and ged_parse_option_() parses the options of the method. The following existing algorithms already implemented in GEDLIB are directly derived classes of ged::GEDMethod: ged::BranchTight [2], ged::HED [17], ged::Partition [32], ged::Hybrid [32], ged::SimulatedAnnealing [30], ged::BranchCompact [32], ged::AnchorAwareGED [14].

Fig. 3.
figure 3

The generic interface ged::GEDMethod.

Interface for Methods Based on the Linear Sum Assignment Problem with Error-Correction. A popular approach for approximating \(\mathrm {GED}\) is to use transformations to the linear sum assignment problem with error-correction (\(\mathrm {LSAPE}\)). An instance of \(\mathrm {LSAPE}\) consists of a cost matrix \(\mathbf {C} =(c_{i,k})\in \mathbb {R} _{\ge 0}^{(n+1)\times (m+1)}\). The task is to compute a mapping \(\pi \) from rows to columns, such that each row except for \(n+1\) and each column expect for \(m+1\) is covered exactly once and is minimized. \(\mathrm {LSAPE}\) can be solved optimally in cubic time [10]; in GEDLIB, we use the \(\mathrm {LSAPE}\) toolbox [8] for solving \(\mathrm {LSAPE}\).

If \(\mathrm {LSAPE}\) is used for approximating \(\mathrm {GED} (G,H)\), n and m are set to \(|V^G|\) and \(|V^H|\), the first \(|V^G|\) rows of \(\mathbf {C}\) are associated with the nodes of G, the first \(|V^H|\) columns of \(\mathbf {C}\) are associated with the nodes of H, and the last rows and columns are associated with dummy nodes used for codifying node insertions and deletions. With this setup, each \(\mathrm {LSAPE}\) solution \(\pi \) corresponds to a node map between G and H, which, in turn, induces an edit path and hence an upper bound for \(\mathrm {GED} (G,H)\) [6]. \(\mathrm {LSAPE}\) based heuristics for \(\mathrm {GED}\) try to achieve tight upper bounds by encoding structural information of the input graphs into \(\mathbf {C}\). Moreover, some of them construct \(\mathbf {C}\) such that \(\min _{\pi }\mathbf {C} (\pi )\) lower bounds \(\mathrm {GED}\).

Figure 4 shows the abstract class template ged::LSAPEBasedMethod, which provides the interface for implementing heuristics of this kind. The interface is defined by the virtual member functions starting with the prefix lsape_. The most important one is lsape_populate_instance_(), which populates the \(\mathrm {LSAPE}\) instance \(\mathbf {C}\). The following algorithms implemented in GEDLIB are directly derived classes of ged::LSAPEBasedMethod: ged::Bipartite [26], ged::Branch [2], ged::BranchFast [2], ged::Node [21], ged::BranchUniform [32], ged::Ring [3], ged::Subgraph [12], ged::Walks [18]. Additionally, all derived classes of ged::LSAPEBasedMethod can be run with the node centralities suggested in [27].

Fig. 4.
figure 4

The interface ged::LSAPEBasedMethod for methods based on \(\mathrm {LSAPE}\).

Interface for Methods Based on Machine Learning. Recently, it has been suggested to use deep neural networks or support vector machines for carrying out the transformation from \(\mathrm {GED}\) to \(\mathrm {LSAPE}\). Given two graphs G and H, feature vectors are constructed for all node substitutions, deletions, and insertions, and the matrix \(\mathbf {C}\) is defined as . Here, \(p^\star (i,k)\) is the confidence of a machine learning framework (either a deep neural network or a support vector machine) that the feature vector associated to the node edit operation corresponding to row i and column k is contained in an optimal node map.

Figure 5 details the abstract class template ged::MLBasedMethod, which provides the interface for algorithm adopting this paradigm. For implementing the interface, it suffices to override the virtual member functions starting with the prefix ml_. The most important ones are the three virtual member functions of the form ml_populate_*_feature_vector_(), which construct the feature vectors associated to the node edit operations. Derived classes of ged::MLBasedMethod do not have to implement the machine learning frameworks, as ged::MLBasedMethod offers support for artificial deep neural networks (using FANN [24]) and support vector machines (using LIBSVM [13]). The following algorithms implemented in GEDLIB are directly derived classes of ged::MLBasedMethod: ged::BipartiteML [28], ged::RingML [4].

Fig. 5.
figure 5

The interface ged::MLBasedMethod for \(\mathrm {LSAPE}\) based methods that use machine learning techniques for populating their \(\mathrm {LSAPE}\) instances.

Interface for Methods Based on Mixed Integer Programming. Another approach for exactly or approximately computing \(\mathrm {GED}\) is to rephrase the problem of computing \(\mathrm {GED} (G,H)\) as a mixed integer programming (MIP) problem. \(\mathrm {GED} (G,H)\) can then be computed exactly by calling an MIP solver. Alternatively, lower bounds for \(\mathrm {GED} (G,H)\) can be obtained by solving the linear programming (LP) relaxations of the MIP formulations.

Figure 6 shows the abstract class template ged::MIPBasedMethod, which provides the interface for \(\mathrm {GED}\) algorithms that use MIP formulations. The virtual member functions that define the interface start with the prefix mip_. The most important one is mip_populate_model_(), which constructs the employed MIP formulation and must be overridden by all derived classes. In GEDLIB, we use Gurobi [20] as our MIP and LP solver. Gurobi is commercial software but offers a free academic license. For users who cannot obtain a license for Gurobi, the installation script distributed with GEDLIB offers the option to install GEDLIB without ged::MIPBasedMethod and its derived classes. The following algorithms implemented in GEDLIB are directly derived classes of ged::MIPBasedMethod: ged::F1 [23], ged::F2 [23], ged::CompactMIP [6], ged::BLPNoEdgeLabels [21].

Fig. 6.
figure 6

The interface ged::MIPBasedMethod for methods based on MIP.

Interface for Methods Based on Local Search. Another popular approach for upper bounding \(\mathrm {GED}\) is to use variants of local search to systematically vary a previously computed or randomly generated node map, such that the cost of the induced edit path decreases. Figure 7 shows the abstract class template ged::LSBasedMethod, which provides the interface for algorithms using local search. The prefix ls_ marks the virtual member functions defining the interface. The most important one is ls_run_from_initial_solution_(), which runs the local search from an initial node map. The following algorithms implemented in GEDLIB are directly derived classes of ged::LSBasedMethod: ged::IPFP [5, 9, 11], ged::BPBeam [16, 29], ged::Refine [31]. Moreover, ged::LSBasedMethod provides support for running all derived classes with parallel multi-start as suggested in [15], and stochastic generators as suggested in [7].

Fig. 7.
figure 7

The interface ged::LSBasedMethod for methods based on local search.

5 Abstract Class for Implementing Edit Costs

Figure 8 shows the abstract class template ged::EditCosts, which provided the interface for implementing edit cost functions. The virtual member functions *_del_cost_fun() compute the cost of deleting a node or an edge with a given label, the functions *_ins_cost_fun() compute the insertions costs, and the functions *_rel_cost_fun() compute the costs for relabeling a node or an edge. The functions vectorize_*_label() return vector representations of the node and the edge labels, which are required by some methods. In GEDLIB, edit costs are available for the datasets aids, fingerprint, grec, letter, mutagenicity, and protein from the IAM Graph Database [25], for the datasets acyclic, alkane, pah, and mao from GREYC’s Chemistry Dataset (available at https://brunl01.users.greyc.fr/CHEMISTRY/), and for the dataset cmu-ged from the Graph Data Repository for Graph Edit Distance [1]. We also provide constant edit cost functions that can be used with any data.

Fig. 8.
figure 8

The interface ged::EditCosts for implementing edit costs.

6 Conclusions and Future Work

In this paper, we have presented GEDLIB, a C++ library for \(\mathrm {GED}\) computations. GEDLIB currently implements 24 different \(\mathrm {GED}\) algorithms and 9 different edit cost functions designed for datasets which are widely used in the research community. In the future, we will provide Python and MATLAB bindings for better usability. Moreover, we would like to encourage authors of algorithms and edit costs that are not implemented in GEDLIB to commit their work to GEDLIB.