1 Introduction

Graph matching has become a very active field of research [13]. In its most general form, graph matching refers to the problem of finding a mapping f from the nodes of one given graph g1 to the nodes of another given graph g2, that satisfies some constraints or optimality criteria. For example, in graph isomorphism detection [4], mapping f is a bijection that preserves all edges and labels. In subgraph isomorphism detection [5], mapping f has to be injective such that all edges of g1 are included in g2 and all labels are preserved. Other graph matching problems that require the construction of a mapping f with particular properties are maximum common subgraph (MCS) detection [6, 7] and graph edit distance (GED) computation [8, 9].

The main problem with graph matching is its high computational complexity, which arises from the fact that it is usually very costly to find mapping f for a pair of given graphs. It is a known fact that the detection of a subgraph isomorphism or a maximum common subgraph (MCS), as well as the computation of GED are NP-complete problems. If the graphs in the application are small, optimal algorithms can be used. These algorithms are usually based on an exhaustive enumeration of all possible mappings f between two graphs. Sometimes, application-dependent heuristics can be found that allow us to eliminate significant portions of the search space (i.e. the space of all possible functions f), but still guarantee the correct, or optimal, solution being found. Such heuristics can be used in conjunction with look-ahead techniques and constraint satisfaction [5, 10, 11]. For the matching of large graphs, one needs to resort to suboptimal matching strategies. Methods of this type are characterised by a (often low-order) polynomial time complexity, but they are no longer guaranteed to find the optimal solution for a given problem. A large variety of such suboptimal approaches have been proposed in the literature, based on a multitude of different computational paradigms. Examples include probabilistic relaxation [12, 13], genetic algorithms [14, 15], expectation maximisation [16], eigenspace methods [17, 18] and quadratic programming [19].

Another possibility to overcome the problem arising from the exponential complexity of graph matching is to focus on classes of graphs with an inherently lower computational complexity of the matching task. Some examples of such classes are given in [2022]. Most recently, in the field of pattern recognition and computer vision, the class of trees has received considerable attention [23, 24].

In this paper, another special class of graphs will be introduced. The graphs belonging to this class are characterised by the existence of unique node labels, which means that each node in a graph possesses a node label that is different from all other node labels in that graph. This condition implies that, whenever two graphs are being matched with each other, each node has at most one candidate for possible assignment under function f in the other graph. This candidate is uniquely defined through its node label. Consequently, the most costly step in graph matching, which is the exploration of all possible mappings between the nodes of the two graphs under consideration, is no longer needed. Moreover, we introduce matching algorithms for this special class of graphs and analyse their computational complexity. Particular attention is directed to the computation of graph isomorphism, subgraph isomorphism, MCS, GED and median graph computation.

If constraints are imposed on any class of graphs, we usually lose some representational power. The class of graphs considered in this paper is restricted by the requirement of each node label being unique. Despite this restriction, there exist some interesting applications for this class of graphs. From the general point of view, graphs with unique node labels seem to be appropriate whenever the objects from the problem domain, which are modelled through nodes, possess properties that can be used to uniquely identify them. We review one particular application of this class of graphs in the domain of computer network monitoring. Another application of graphs with unique node labels is Web document analysis. In this case, each unique term (word) that occurs in a document is represented by a node. Multiple occurrences of a term are represented by the same node. In this application, the considered class of graphs has been used for the tasks of classification and clustering of Web pages, and has shown superior performance over traditional vector-based approaches. For further details, see [2527].

The remainder of this paper is organised as follows. In Sect. 2, we introduce our basic concepts and terminology. Graphs with unique node labels and related matching strategies are presented in Sect. 3. Potential applications of this class of graphs are discussed in Sect. 4. In Sect. 5, we present the results of an experimental study where the run time of some of the proposed algorithms was measured. Finally, conclusions from this work are drawn in Sect. 6. An earlier version of this paper appeared in [28]. The present paper has been significantly extended in both theory and experimental evaluation.

2 Basic concepts and notation

In this section, the basic concepts and terminology used throughout the paper will be introduced. We consider directed graphs with labelled vertices (nodes) and edges (links). Let LV and LE denote the sets of node and edge labels, respectively. A graph g=(V, E, α, β) is a 4-tuple where V is the finite set of vertices, \( E \subseteq V \times V \) is the set of edges, α: VLV is a function assigning labels to the nodes and β: ELE is a function assigning labels to edges. Edge (x, y)∈E originates at node xV and terminates at node yV. An undirected graph is obtained as a special case if there exists an edge (y, x)∈E for every edge (x, y)∈E with β(x, y)=β(y, x).

Let g=(V, E, α, β) and g′=(V′, E′, α′ , β′) be graphs; g′ is a subgraph of g, \( {g}\ifmmode{'}\else$'$\fi \subseteq g\;{\text{if}}\;{V}\ifmmode{'}\else$'$\fi \subseteq V,\;{E}\ifmmode{'}\else$'$\fi \subseteq E,\;\alpha {\left( x \right)} = {\alpha }\ifmmode{'}\else$'$\fi{\left( x \right)} \) for all xV′ and β(x, y)=β′(x, y) for all (x, y)∈E′. Let \( g \subseteq {g}\ifmmode{'}\else$'$\fi\;{\text{and}}\;g \subseteq {g}\ifmmode{''}\else$''$\fi. \) Then, g is called a common subgraph of g′ and g″. Furthermore, g is called a maximum common subgraph (notation: MCS) of g′ and g″ if there exists no other common subgraph of g′ and g″ that has more nodes and, for a given number of nodes, more edges than g.

For graphs g and g′, a graph isomorphism is any bijection f: VV′ such that:

  1. 1.

    α(x)=α′(x) for all xV; and

  2. 2.

    For any edge (x, y)∈E, there exists (f(x), f(y))∈E′ with β(x, y)=β′(f(x), f(y)), and for any edge (x′, y′)∈E′, there exists an edge (f−1(x′), f−1(y′))∈E with β′(x′, y′)=β(f−1(x′), f−1(y′)).

If f: VV′ is a graph isomorphism between graphs g and g′, and g′ is a subgraph of another graph g″, i.e. \( {g}\ifmmode{'}\else$'$\fi \subseteq {g}\ifmmode{''}\else$''$\fi, \) then f is called a subgraph isomorphism from g to g″.

Next, we introduce the concept of GED, which is based on graph edit operations. We consider six types of edit operations: substitution of a node label, substitution of an edge label, insertion of a node, insertion of an edge, deletion of a node and deletion of an edge. A cost (i.e. a non-negative real number) is assigned to each edit operation. Let e be an edit operation and c(e) its cost. The cost of a sequence of edit operations, s=e1,..., e n , is given by the sum of all its individual costs, i.e. \( c{\left( s \right)} = {\sum\nolimits_{i = 1}^n {c{\left( {e_{i} } \right)}} }. \) The edit distance d(g1, g2) of two graphs g1 and g2 is equal to the minimum cost, taken over all sequences of edit operations, that transform g1 into g2. Procedures for GED computation are discussed in [8].

Finally, we introduce the median of a set of graphs [29]. Let G={g1,..., g N } be a set of graphs and U be the set of all graphs with labels from LV and LE. The median \( \ifmmode\expandafter\bar\else\expandafter\=\fi{g} \) of G is a graph that satisfies the condition:

$$ {\sum\limits_{i = 1}^N {d{\left( {\ifmmode\expandafter\bar\else\expandafter\=\fi{g},g_{i} } \right)}} } = \min {\left\{ {{\sum\limits_{i = 1}^N {d{\left( {g,g_{i} } \right)}|g} } \in U} \right\}} $$

It follows that the median is a graph which has the minimum average edit distance to the graphs in set G. It is a useful concept to represent a set of graphs by a single prototype. Intuitively, the median can be understood as an optimal representative of a set of graphs since it has, in the universe of all graphs, the smallest average distance to the given graphs. In many instances, the median of a given set G is not unique, neither is it always a member of G. The use of the term “median graph” seems appropriate because of its analogy to the median of a set of numbers; both minimise the average absolute difference to the members in the set.

3 Graphs with unique node labels

In this section, we introduce a special class of graphs that are characterised by the existence of unique node labels. Formally, we require that, for any graph g and any pair x, yV, the condition α(x)≠α(y) holds if xy. Furthermore, we assume that the underlying alphabet of node labels is an ordered set, for example, the integers, i.e. LV={1, 2, 3,...}. Throughout the rest of this paper, we consider graphs from this class only, unless otherwise mentioned.

Definition 1

Let g=(V, E, α, β) be a graph. The label representation ρ(g) of g is given by ρ(g)=(L, C, λ), where:

  1. 1.

    L={α(x)|xV}

  2. 2.

    C={(α(x), α(y))|(x, y)∈E}; and

  3. 3.

    λ: CLE with λ(α(x), α(y))=β(x, y) for all (x, y)∈E.

According to this definition, the label representation of a graph g is obtained by representing each node of g by its (unique) label and dropping set V. From the formal point of view, ρ(g) defines the equivalence class of all graphs that are isomorphic to g. The individual members of this class are obtained by assigning an arbitrary node, or, more precisely, an arbitrary node name, to each unique node label, i.e. to each element from L.

Example 1

Let LV={1, 2, 3, 4, 5} and g=(V, E, α, β) where V={a, b, c, d, e}, E={(a, b), (b, e), (e, d), (d, a), (a, c), (b, c), (d, c), (e, c), (a, e), (b, d)}, α: a ↦ 1, b ↦ 2, c ↦ 5, d ↦ 4, e ↦ 3, β: (x,y) ↦ 1 for all (x, y)∈E. A graphical illustration of g is shown in Fig. 1a, where the node names (i.e. the elements of V) appear inside the nodes and the corresponding labels appear outside. Because all edge labels are identical, they have been omitted. The label representation ρ(g) of g is then given by the following quantities: L={1, 2, 3, 4, 5}, C={(1, 2), (2, 3), (3, 4), (4, 1), (1, 5), (2, 5), (4, 5), (3, 5), (1, 3), (2, 4)}, \( \lambda :{\left( {i,j} \right)} \mapsto 1 \) for all (i, j)∈C.

Fig. 1a, b
figure 1

Example graph g and its label representation. a Example graph g. b Label representation of g

Intuitively, we can interpret the label representation ρ(g) of any graph g, as a graph identical to g up to the fact that all node names are left unspecified. Hence, ρ(g) can be conveniently graphically represented in the same way as g is represented. For example, a graphical representation of ρ(g), where g is shown in Fig. 1a, is given in Fig. 1b.

Lemma 1

Let g1=(V1, E1, α1, β1), g2=(V2, E2, α2, β2) be two graphs and ρ(g1)=(L1, C1, λ1), ρ(g2)=(L2, C2, λ2) be their label representations. Graph g1 is isomorphic to graph g2 if and only if ρ(g1)=ρ(g2) (i.e. L1=L2, C1=C2 and λ1=λ2).

Proof

Assume that there exists a graph isomorphism f: V1V2. Then, α1(x)=α2(f(x)) for all xV1. As f is bijective, it follows that L1=L2. Furthermore, because of the conditions on the edges that are imposed by graph isomorphism f, we conclude that C1=C2 and λ1=λ2. Conversely, assume that ρ(g1)=ρ(g2). Construct now the mapping f: V1V2 such that f(x)=y if and only if α1(x)=α2(y). Because L1=L2 and the node labels in both g1 and g2 are unique, this mapping is a bijection that satisfies the conditions of graph isomorphism imposed on the edges and edge labels in g1 and g2.

Based on this lemma, we can examine two graphs for isomorphism by simply generating their label representations and checking the conditions L1=L2, C1=C2 and λ1=λ2. Assume n=max{|V1| , |V2| }. Then, \( {\left| {L_{1} } \right|} = {\left| {L_{2} } \right|} = \mathcal{O}{\left( n \right)},\;{\left| {E_{1} } \right|} = {\left| {C_{1} } \right|} = \mathcal{O}{\left( {n^{2} } \right)} \) and \( {\left| {E_{2} } \right|} = {\left| {C_{2} } \right|} = \mathcal{O}{\left( {n^{2} } \right)}. \) Testing two ordered sets for identity is an operation that is linear in the number of elements. Hence, the computational complexity of testing two graphs with unique node labels for isomorphism amounts to \( \mathcal{O}{\left( {n^{2} } \right)}. \)

Lemma 2

Let g1, g2, ρ(g1) and ρ(g2) be the same as in Lemma 1. Then, g1 is a subgraph isomorphic to g2 if and only if \( L_{1} \subseteq L_{2} ,\;C_{1} \subseteq C_{2} \) and λ1(i, j)=λ2(i, j) for all (i, j)∈C1.

Proof

Firstly, we assume that there exists a subgraph isomorphism f: V1V2. Then, α1(x)=α2(f(x)) for all xV1. As f is injective, it follows that \( L_{1} \subseteq L_{2} . \) Similarly to the proof of Lemma 1, we conclude \( C_{1} \subseteq C_{2} \) and λ1(i, j)=λ2(i, j) for all (i, j)∈C1. Conversely, assume \( L_{1} \subseteq L_{2} ,\;C_{1} \subseteq C_{2} \) and λ1(i, j)=λ2(i, j) for all (i, j)∈C1. Then, we can construct an injective mapping f: V1V2 such that f(x)=y if and only if α1(x)=α2(y). Similarly to the proof of Lemma 1, it follows that this mapping is a subgraph isomorphism from g1 to g2.

Using Lemma 2, testing two graphs for subgraph isomorphism reduces to examining the corresponding label representations for the three conditions \( L_{1} \subseteq L_{2} ,\;C_{1} \subseteq C_{2} \) and λ1(i, j)=λ2(i, j) for all (i, j)∈C1. The third condition can be checked in \( \mathcal{O}{\left( {{\left| {C_{1} } \right|}} \right)} = \mathcal{O}{\left( {n^{2} } \right)} \) time. Checking whether an ordered set is a subset of another ordered set is linear in the size of the larger of the two sets. Hence, the computational complexity of subgraph isomorphism of graphs with unique node labels equals \( \mathcal{O}{\left( {n^{2} } \right)}. \)

Lemma 3

Let g1, g2, ρ(g1) and ρ(g2) be the same as in Lemma 1. Let g be a graph with ρ(g)=(L, C, λ) such that L=L1L2, C={(i, j)|(i, j)∈C1C2 and λ1(i, j)=λ2(i, j)} and λ(i, j)=λ1(i, j) for all (i, j)∈C. Then, g is an MCS of g1 and g2.

Proof

First, we note that \( L \subseteq L_{1} \;{\text{and}}\;L \subseteq L_{2} . \) Hence, \( V \subseteq V_{1} \;{\text{and}}\;V \subseteq V_{2} . \) for any graph g with label representation ρ(g)=(L, C, λ). Similarly, because C includes a pair (i, j) if and only if a corresponding edge with identical label exists in both g1 and g2, we observe \( E \subseteq E_{1} \;{\text{and}}\;E \subseteq E_{2} . \) for any such graph g. Thirdly, the labels of edges (x, y) occurring in both g1 and g2 are preserved under λ. Hence, g is a subgraph of both g1 and g2. Now assume that g is not an MCS. In this case, there must exist another subgraph g′ of both g1 and g2 with either more nodes than g, or the same number of nodes, but with more edges. The first case contradicts the way set L is constructed; if g′ has more nodes than g, then LL1L2. The second case is in conflict with the construction of C and λ, i.e. if g′ has the same number of nodes as g, but more edges, then C and λ must be different from their values stated in Lemma 3. Hence, g must be indeed an MCS of g1 and g2. The proof follows.

Possible computational procedures implied by Lemma 3 are again based on the intersection of two ordered sets. Hence, the complexity of computing the MCS of two graphs with unique node labels is \( \mathcal{O}{\left( {n^{2} } \right)}. \)

In [30], a detailed analysis was provided showing how GED depends on the costs associated with the individual edit operations. A set of edit operations together with their cost is also called a cost function. In this paper, we focus our attention on the following cost function: cnd(x)=cni(x)=1, cns(x)=∞, ced(x, y)=cei(x,y)=ces(x, y)=1, where cnd(x), cni(x) and cns(x) denote the cost associated with the deletion, insertion and substitution of node x, while ced(x, y), cei(x, y) and ces(x, y) denote the cost associated with the deletion, insertion and substitution of edge (x, y), respectively. This cost function is simple in the sense that each edit operation has a cost equal to 1, except for node substitutions, which have infinite cost. It is easy to see that, for any two graphs, g1 and g2, there always exists a sequence of edit operations that transforms g1 into g2 with a finite total cost (for example, a sequence that deletes all nodes and edges from g1, and inserts all nodes and edges in g2). Hence, edit operations with infinite cost will never be applied in the computation of any actual GED. This means that node substitutions will never be applied and may be considered non-admissible under the cost function introduced above, while all other edit operations can be applied and have the same cost. The exclusion of node substitutions for graphs with unique node labels makes sense since node label substitutions may generate graphs with non-unique node labels, i.e. graphs that do not belong to the class of graphs under consideration.

Lemma 4

Let g1, g2, ρ(g1) and ρ(g2) be the same as in Lemma 1. Furthermore, let C0={(i, j)|(i, j)∈C1C2} and C 0 ={(i, j)|(i, j)∈C1C2} and λ1(i, j)≠λ2(i, j). Then \( d{\left( {g_{1} ,g_{2} } \right)} = {\left| {L_{1} } \right|} + {\left| {L_{2} } \right|} - 2{\left| {L_{1} \cap L_{2} } \right|} + {\left| {C_{1} } \right|} + {\left| {C_{2} } \right|} - 2{\left| {C_{0} } \right|} + {\left| {{C}\ifmmode{'}\else$'$\fi_{0} } \right|}. \)

Proof

Because node substitutions can be regarded non-admissible, the minimum cost sequence of edit operations transforming g1 into g2 assigns each node xV1 with label α1(x) to node yV2 with α1(x)=α2(y). If no node yV2 exists with this property, node x is deleted from g1. Similarly, all nodes yV2 for which no node xV1 exists with α1(x)=α2(y) will be inserted in g2. This leads to \( {\left| {L_{1} } \right|} - {\left| {L_{1} \cap L_{2} } \right|} \) node deletions in graph g1 and \( {\left| {L_{2} } \right|} - {\left| {L_{1} \cap L_{2} } \right|} \) node insertions in graph g2, each having a cost equal to 1. Hence, the total cost arising from edit operations on the nodes of g1 and g2 amounts to \( {\left| {L_{1} } \right|} - {\left| {L_{1} \cap L_{2} } \right|} + {\left| {L_{2} } \right|} - {\left| {L_{1} \cap L_{2} } \right|} = {\left| {L_{1} } \right|} + {\left| {L_{2} } \right|} - 2{\left| {L_{1} \cap L_{2} } \right|}. \)

We now consider the edges. There exist \( {\left| {C_{1} } \right|} - {\left| {C_{0} } \right|} \) edges in g1 that do not occur in g2, and need to be deleted. Similarly, there exist \( {\left| {C_{2} } \right|} - {\left| {C_{0} } \right|} \) edges in g2 that do not have a counterpart in g1, and need to be inserted. Furthermore, there are two types of edges corresponding to the set \( {\left| {C_{1} \cap C_{2} } \right|}. \) The first type are edges (i, j)∈C0, for which λ1(i, j)=λ2(i, j). No edit operations are needed for edges of this kind. The second type are edges (i, j)∈C 0 , for which λ1(i, j)≠λ2(i, j). An edge substitution with a cost of 1 is needed for each such edge. Hence, the total cost of the edit operations on the edges of g1 and g2 is equal to \( {\left| {C_{1} } \right|} - {\left| {C_{0} } \right|} + {\left| {C_{2} } \right|} - {\left| {C_{0} } \right|} + {\left| {{C}\ifmmode{'}\else$'$\fi_{0} } \right|} = {\left| {C_{1} } \right|} + {\left| {C_{2} } \right|} - 2{\left| {C_{0} } \right|} + {\left| {{C}\ifmmode{'}\else$'$\fi_{0} } \right|}. \) This concludes the proof.

Possible computational procedures for GED computation implied by Lemma 4 are based again on the intersection of two ordered sets. Hence, similarly to all other graph matching procedures considered before, the complexity of edit distance computation of graphs with unique node labels is \( \mathcal{O}{\left( {n^{2} } \right)}. \)

Finally, we turn to the problem of computing a graph g that is the median of a set of graphs G={g1,..., g N } with unique node labels. In the remainder of this section, we assume, for the purpose of notational convenience, and without restricting generality, that all graphs under consideration are complete. That is, there is an edge (x, y)∈E between any pair of nodes x, yV for any considered graph g. “Real” edges can be easily distinguished from “virtual” edges by including a special null symbol in the edge label alphabet LE and defining β(x, y)=null for any virtual edge. The benefit we get from considering complete graphs is that the only necessary edit operations on the edges are substitutions. In other words, any edge deletion or insertion now becomes a substitution that involves the null label. No conflicts will arise from this simplification because the cost of edge substitutions, deletions and insertions are the same.

Let ρ(g1),..., ρ(g N ) be the label representations of g1,..., g N . Define \( L_{{\text{U}}} = {\bigcup\nolimits_{i = n}^N {L_{i} } }\;{\text{and}}\;C_{{\text{U}}} = {\bigcup\nolimits_{i = n}^N {C_{i} } }. \) Furthermore, let γ(i) be the total number of occurrences of node label iLU in L1,..., L N . Note that (1≤γ(i)≤N). Formally, γ(i) can be defined through the following procedure:

  • γ(i)=0;

  • \( \begin{array}{*{20}l} {{{\text{for}}} \hfill} & {{k = 1\;{\text{to}}\;N\;{\text{do}}} \hfill} \\ {{} \hfill} & {{{\text{if}}\;i \in L_{k} \;{\text{then}}\;\gamma {\left( i \right)} = \gamma {\left( i \right)} + 1} \hfill} \\ \end{array} \)

Next, we define ρ(g)=(L, C, λ) such that:

  1. 1.

    L={i|iLU and γ(i)≥N/2};

  2. 2.

    C={(i, j)|i, jL}; and

  3. 3.

    λ(i, j)=max_label(i, j),

where the function max_label(I,j) returns the label λ k (i, j)∈LE that has the maximum number of occurrences on edge (i, j) in C1,..., C N . In case of a tie, any of the competing labels λ k (i, j) may be returned.

Lemma 5

Let G and ρ(g) be as above. Then, any graph g with a label representation ρ(g) is a median graph of G.

Proof

The smallest potential median graph candidate is the graph with an empty set of nodes, while the largest potential candidate corresponds to the case L=LU. The second observation is easy to verify because any graph g* that includes more node labels will have at least one label k* that does not occur in any of the L i ’s. Hence, the node with label k* will be deleted in all of the distance computations for d(g*, g), i=1,..., N. Therefore, dropping the node with label k* from g* will produce a graph with a smaller average edit distance to the members of G. It follows that, for any median graph g with node label representation ρ(g), the set L must be necessarily a subset of LU.

If we substitute the expression derived in Lemma 4 into the definition of a median graph given in Sect. 2, we recognise that any median graph g with node label representation ρ(g) must minimise the following expression:

$$ \begin{array}{*{20}l} {{\Delta = } \hfill} & {{{\left| L \right|} + {\left| {L_{1} } \right|} - 2{\left| {L \cap L_{1} } \right|} + {\left| C \right|} + {\left| {C_{1} } \right|} - 2{\left| {C_{{01}} } \right|} + {\left| {{C}\ifmmode{'}\else$'$\fi_{{01}} } \right|} + \ldots + } \hfill} \\ {{} \hfill} & {{{\left| L \right|} + {\left| {L_{N} } \right|} - 2{\left| {L \cap L_{N} } \right|} + {\left| C \right|} + {\left| {C_{N} } \right|} - 2{\left| {C_{{0N}} } \right|} + {\left| {{C}\ifmmode{'}\else$'$\fi_{{0N}} } \right|}} \hfill} \\ \end{array} $$

where we use the following notation (see Lemma 4):

$$ \begin{aligned} & C_{{0k}} = {\left\{ {{\left( {i,j} \right)}|{\left( {i,j} \right)} \in C \cap C_{k} } \right\}}\;{\text{and}}\;\lambda {\left( {i,j} \right)} = \lambda _{k} {\left( {i,j} \right)} \\ & {C}\ifmmode{'}\else$'$\fi_{{0k}} = {\left\{ {{\left( {i,j} \right)}|{\left( {i,j} \right)} \in C \cap C_{k} } \right\}}\;{\text{and}}\;\lambda {\left( {i,j} \right)} \ne \lambda _{k} {\left( {i,j} \right)} \\ \end{aligned} $$

Clearly, Δ can be rewritten as:

$$ \Delta = N{\left| L \right|} + {\sum\limits_{i = 1}^N {{\left| {L_{i} } \right|}} } - 2{\sum\limits_{i = 1}^N {{\left| {L \cap L_{i} } \right|}} } + N{\left| C \right|} + {\sum\limits_{i = 1}^N {{\left| {C_{i} } \right|}} } - 2{\sum\limits_{i = 1}^N {{\left| {C_{{0i}} } \right|}} } + {\sum\limits_{i = 1}^N {{\left| {{C}\ifmmode{'}\else$'$\fi_{i} } \right|}} } $$

Note that all quantities are non-negative integers. As all L i ’s and C i ’s are given, the minimisation of Δ is equivalent to minimising:

$$ N{\left| L \right|} - 2{\sum\limits_{i = 1}^N {{\left| {L \cap L_{i} } \right|}} } + N{\left| C \right|} - 2{\sum\limits_{i = 1}^N {{\left| {C_{{0i}} } \right|}} } + {\sum\limits_{i = 1}^N {{\left| {{C}\ifmmode{'}\else$'$\fi_{{0i}} } \right|}} } $$

First, we analyse the term \( \Delta _{1} = N{\left| L \right|} - 2{\sum\limits_{i = 1}^N {{\left| {L \cap L_{i} } \right|}} }. \) It is obvious that N|L| will become smaller if we include fewer nodes in the median graph. On the other hand, this will also make the term \( 2{\sum\limits_{i = 1}^N {{\left| {L \cap L_{i} } \right|}} } \) smaller, which leads to an increase of Δ1. To find the optimal number of nodes to be included in the median graph, we consider each element of L individually and decide whether it must be included in the median graph or not. From the definition of Δ1, it follows that, if a node with label i is included in the median graph, its contribution to Δ1 will be N−2γ(i). Conversely, if that node is not included, its contribution will be zero. Hence, in order to minimise Δ1, we include a node with label i in the median graph if N−2γ(i)≤0, which is equivalent to γ(i)≥N/2.

Now consider the term \( \Delta _{2} = N{\left| C \right|} - 2{\sum\limits_{i = 1}^N {{\left| {C_{{0i}} } \right|}} } + {\sum\limits_{i = 1}^N {{\left| {{C}\ifmmode{'}\else$'$\fi_{{0i}} } \right|}} }. \) Assume for the moment that |C| is a constant that is defined through the choice of L. Then, we have to minimise \( - 2{\sum\limits_{i = 1}^N {{\left| {C_{{0i}} } \right|}} } + {\sum\limits_{i = 1}^N {{\left| {{C}\ifmmode{'}\else$'$\fi_{{0i}} } \right|}} }. \) As \( {\left| {C_{{0i}} } \right|} + {\left| {{C}\ifmmode{'}\else$'$\fi_{{0i}} } \right|} = {\left| {C \cap C_{0} } \right|}, \) this is equivalent to maximising |C0i|. However, such a maximisation is exactly what is accomplished by the function max_label. This function chooses, for edge (i, j), the label that most often occurs on edge (i, j) in all the given graphs.

So far, we have treated the terms Δ1 and Δ2 independently of each other. In fact, they are not independent because the exclusion of a node with label i from the median graph implies exclusion of any of its incident edges (i, j) or (j, i). Therefore, the question arises whether this dependency can lead to an inconsistency in the minimization of Δ=Δ12 in the sense that decreasing Δ1 leads to an increase of Δ2 by a larger amount, and vice versa. It is easy to see that such an inconsistency can never happen. First of all, exclusion of an edge (i, j) for the sake of minimizing Δ2 does not imply any constraints on inclusion or exclusion of any of the incident nodes i and j. Second, if node i is not included because γ(i)<N/2, the function max_label will surely return the null label for any edge (i, j) or (j, i). This is equivalent to not including (i, j) or (j, i) in the median graph. In other words, if a node i is not included in the median graph because γ(I)<N/2, the dependency between Δ1 and Δ2 leads to also not including all incident edges, which is exactly what is required to minimize Δ2. This concludes the proof of Lemma 5.

In order to derive a practical computational procedure for the computation of a median of a set of graphs with unique node labels, we need to implement functions γ(i) and max_label(i, j). It is easy to verify that the complexity of these two functions is \( \mathcal{O}{\left( {nN} \right)} \) and \( \mathcal{O}{\left( {n^{2} N} \right)}, \) respectively. It follows that the median graph computation problem can be solved in \( \mathcal{O}{\left( {n^{2} N} \right)} \) time for graphs with unique node labels.

So far, we have assumed that there are \( \mathcal{O}{\left( {n^{2} } \right)} \) edges in a graph with n nodes. There are, however, applications where the graphs are of bounded degree, i.e. the maximum number of edges incident to a node is bounded by a constant κ. In this case, all of the expressions \( \mathcal{O}{\left( {n^{2} } \right)} \) reduce to \( \mathcal{O}{\left( n \right)}. \)

The following theorem summarises all of the results derived in this section.

Theorem 1

For the class of graphs with unique node labels, there exist computational procedures that solve the following problems in quadratic time with respect to the number of nodes in the underlying graph:

  1. 1.

    Graph isomorphism

  2. 2.

    Subgraph isomorphism

  3. 3.

    Maximum common subgraph

  4. 4.

    Graph edit distance under the cost function introduced earlier in this section

The median graph computation problem can be solved in \( \mathcal{O}{\left( {n^{2} N} \right)} \) time where n is the number of nodes in the largest graph and N is the number of given graphs.

4 Application to computer network monitoring

The performance management of computer networks is becoming increasingly important as computer networks grow in size and complexity. Not surprisingly, this growth has resulted in an increase in frequency, type and severity of network problems [31, 32]. In addition, the dynamic behaviour exhibited by computer networks, in terms of their connectivity and traffic distributions, has significantly complicated the role of network performance management. Unusual activity patterns of network users is often a major contributor to the dynamic behaviour of these networks. Techniques are required to improve network monitoring and provide proactive detection of network anomalies so that problems can be corrected before they result in a disruption to services.

In [33, 34], graph similarity measures for network monitoring and abnormal change detection were proposed. The basic idea is to represent a computer network by a graph where the nodes represent clients or servers, and the edges represent physical connections between clients or servers. A time series of graphs g1, g2,...,g t , gt+1,..., g n is obtained by measuring the state of a network at regular time intervals and representing each state by a graph. Measures of network difference that use graph matching algorithms, such as MCS and GED, were introduced in [35]. These algorithms are applied to pairs of consecutive graphs g t and gt+1 to measure network change. Change is classified as anomalous if the network change d(g t , gt+1) exceeds a given threshold α. Another measure of network change can be achieved using the median graph. Similarly to classical time series analysis, it can be expected that using the median, computed over a time window of a certain length, rather than an individual graph, will lead to more robustness against outliers in the time series of graphs. In [33], four procedures, based on median graphs, were defined for abnormal change detection. Essentially, the median graph \( \ifmmode\expandafter\bar\else\expandafter\=\fi{g}_{t} \) is computed for a subsequence of graphs (gt-L+1,..., g t ) of length L. Then, \( d{\left( {\ifmmode\expandafter\bar\else\expandafter\=\fi{g}_{t} ,g_{{t + 1}} } \right)} \) can be used to measure the anomalous network change. We classify the change between \( \ifmmode\expandafter\bar\else\expandafter\=\fi{g}_{t} \) and gt+1 as anomalous if \( d{\left( {\ifmmode\expandafter\bar\else\expandafter\=\fi{g}_{t} ,g_{{t + 1}} } \right)} \) is larger than a given threshold. The average deviation φ occurring within the median window is used to assign a more robust threshold (i.e. \( d{\left( {\ifmmode\expandafter\bar\else\expandafter\=\fi{g}_{t} ,g_{{t + 1}} } \right)} \geqslant \alpha \varphi \)).

The class of graphs considered in this paper have the requirement that each node label must be unique. This constraint does not pose a problem when dealing with graphs constructed from data collected from computer networks. It is common in these networks that each node, such as a client, server or router, be uniquely identified. For example, in an intranet employing ethernet technology on a local area network (LAN) segment, either the media access control (MAC) or the internet protocol (IP) address could be used to uniquely identify nodes on the local segment. As a consequence, the efficient graph matching algorithms described in Theorem 1 can be applied to computer networks to assist in network management functions.

Using Definition 1, the time series of graphs given above has label representation ρ(g1), ρ(g2),..., ρ(g t ), ρ(gt+1),..., ρ(g n ). In order to gain a significant saving in the computation time of MCS and GED, we use the label representations ρ(g t ) and ρ(gt+1), and apply the matching algorithms defined in Lemmas 3 and 4. If the threshold is exceeded, it can be concluded that a significant network change has occurred at time t+1. A network administrator can then use other network management tools to determine whether the change represents an abnormal event. The implementation of the median graph detection strategies, using the label representation, is achieved using the algorithm defined in Lemma 4.

Given the diverse range of networks that must be managed by network administrators (e.g. LAN, wide area networks and the Internet) it is important that techniques for measuring network change perform independently of the type of network topology. By inspection of the algorithms defined in Sect. 3, it can be deduced that, as long as the graphs have the same number of vertices and edges, the computation time for each method will be the same regardless of network topology. Thus, the techniques described are independent of network topology. This finding is verified in Sect. 5 and means that an accurate prediction of computation times can be determined for real network data based on measurements achieved for synthetic network data of equivalent size.

5 Experimental results

The aim of the experiments described in this section is to verify the low computational complexity for the theoretical results derived in Sect. 3. The time taken to compute isomorphism, subgraph isomorphism, MCS and GED are measured for graphs ranging in size from hundreds of nodes to tens of thousands of nodes, and with different edge densities. In addition, we validate the linear dependency of the time taken to compute a median graph to the number of graphs from which the median is derived. Computation times are measured for synthetic data sets and real network data. Real network data was acquired from a link in the core of a wide area computer network. A test for similarity of computation time measurements for real and synthetic data sets is made to verify that the results achieved for simulated networks can be repeated for real-world implementations. An experiment was conducted to verify that the time taken to compute algorithms in this paper are independent of network topology. Two graph generators were used to produce synthetic data sets having different network topologies. The real network data set was used as a third sample having different topology. The graphs in each data set had to be equivalent in the number of nodes and links for this test.

The hardware platform used to measure computation times was a SUN Fire V880 with 4×750 MHz UltraSparc3 processors and 8 GB of RAM. The specific hardware platform used to perform the experiments is not important and has been provided for completeness only. Only relative computation times with respect to graph dimensions are important.

5.1 Synthetic network data

Synthetic data sets are used to validate the computational complexity of the procedures defined in Sect. 3. These data sets are also used to verify that the procedures are independent of network topology.

Two data sets have been produced using normally distributed random edges with edge densities of 2.5% and 10%, respectively. An edge density of 2.5% was used so that graphs with 20,000 nodes could be synthesised without exceeding the computer memory of the computer platform used for the experiments. The data set with 10% edge density was chosen to mimic the characteristics of the real data network. The maximum number of nodes possible for graphs in this data set was 10,000.

An additional single synthetic data set having edge density of 2.5% was created using a Fan Chung algorithm [36]. This graph generator produced graphs having vertex degrees with a power-law distribution. The resultant topology of graphs produced using this method is quite different to those of graphs having normally distributed random edges. In fact, graphs having degree distribution that are power-laws are characteristic of large networks, such as the Internet [37].

For each synthetic data set, we first obtain a series S of graphs, comprising 100, 1,000, 3,000, 5,000, 7,000 and 10,000 nodes. For data sets with an edge density of 2.5%, we obtain an additional graph in the series that has 20,000 nodes. The resulting graphs have directed edges with Poisson distributed edge weights. A second series S′ was produced as a counterpart, using the same procedure, for measurements of computation times for MCS and GED.

A further set of graphs was created to verify the linear increase in computation time with an increase in edge density, for a fixed number of nodes. The graph generator assigned edges using a normal distribution. For this data set, graphs had 5,000 vertices and edge densities ranging from 1% to 10% in steps of 1%. A counterpart was created for each graph to be used for MCS and GED computations.

To compare the computation times of algorithms measured for synthetic data against real data sets, we created two randomly distributed graphs having the same number of vertices and edges as each of the real data sets in Sect. 5.2.

Finally, for the validation of computation times for median graphs, we created a series of 100 graphs using randomly distributed edges. In this series, the average number of vertices and the edge density is matched to our business domain network data set (i.e. comprises graphs having on average 70 vertices with edge density of 10%) as described in Sect. 5.2.

5.2 Real network data

Real network data was acquired from a core link in a large enterprise data network using network performance monitoring tools. The data network employs static IP addresses, hence, its suitability for representation by the class of graphs defined in this paper. Graphs were produced from traffic traversing the link at intervals of one day. This resulted in a time series of 100 graphs representing 100 days of network traffic.

Two levels of abstraction have been used to produce the time series of real network data. Both have quite different characteristics. The first data set has graph vertices that represent IP addresses, whilst the second has vertices that represent business domains. In both data sets, edges represent logical links, and edge weights represent the total number of bytes communicated between vertices in one day. The business domain abstraction is created by coalescing IP addresses belonging to a predefined business entity into a single node. This resulted in graphs that comprises on average 70 nodes with edge densities of 10%. The IP network abstraction has graphs that have on average 9,000 nodes with an edge density of 0.04%. The low edge density is a result of the near bipartite nature of the graphs arising from data collected at a single point in the core of the enterprise data network. The business domain and IP network abstractions are of interest to network administrators as they provide both coarse and fine network performance data, respectively.

Two consecutive graphs were chosen from each of the real network data set abstractions to be used in comparisons of computation times of algorithms with times measured for synthetic data. The two graphs chosen for the business domain abstraction comprised approximately 90 vertices with an edge density of 10%, whilst the graphs chosen from the IP abstraction comprised 9,000 vertices with an edge density of 0.04%.

To verify median graph computation times, the whole 100-day time series of graphs of business domain data was used.

5.3 Verification of \( \mathcal{O}{\left( {n^{2} } \right)} \) theoretical computational complexity for isomorphism, subgraph isomorphism, MCS and GED

To measure the time taken to compute a test for graph isomorphism, we select the first graph g1 from S, comprising 100 unique nodes, and make an exact copy g2. The fact that g2=g1 guarantees that the graphs tested are in fact isomorphic to each other. The computation time measurement does not include the time taken to derive the label representations ρ(g1) and ρ(g2) for graphs g1 and g2. This is true for all computation times measured for each algorithm. For the measurement of computation time for the subgraph isomorphism test, we use the same graph g1 together with graph g3, obtained by removing 20% of the edges from g1. The graph g3 is obviously a subgraph of g1. The measurements of time to compute both MCS and GED required both graph series S and S′. To measure the time taken to execute these algorithms, we again use g1 from S and select the equivalent size graph from S′. The procedures outlined above were repeated for all three synthetic data sets for graph sizes 1,000, 3,000, 5,000, 7,000, 10,000 and 20,000 (where present).

The results of all computation time measurements are shown in Tables 1, 2, 3 and 4. As expected, the measured computational complexity of all matching algorithms is \( \mathcal{O}{\left( {n^{2} } \right)}. \) Figure 2 graphically illustrates the quadratic dependence of computation time on the number of nodes for GED; the x-axis corresponds to the number of nodes in a graph and the y-axis represents the time, in seconds, to compute the GED algorithm. Greater computation times can be observed for larger edge densities. This result was anticipated due to the dependency on graph elements. Computation times to test for graph isomorphism were the longest. Testing for subgraph isomorphism required the least time to compute. This was a consequence of removing 20% of the edges from g1 to produce a subgraph g2. The smaller the size of g2 with respect to g1, the shorter the time taken to compute the subgraph isomorphism. The computation times for both MCS and GED, as observed in Figs. 3 and 4, are almost indistinguishable. This is not surprising since the computational steps, proposed in Lemmas 3 and 4, are nearly identical. In all cases, the computation times measured for both randomly distributed edges and those with power-law degree distributions, for an edge density of 2.5%, are nearly identical. The results would be identical if the number of edges in the graphs from both data sets were equal. Since the graph generator used to produce graphs with randomly distributed edges create, on average, graphs with a specified edge density, the actual number of edges can vary. The closeness of the results verifies the independence of the algorithms to network topology.

Table 1 Computation times for isomorphism
Table 2 Computation times for subgraph isomorphism
Table 3 Computation times for MCS
Table 4 Computation times for GED
Fig. 2
figure 2

Computation times for GED

Fig. 3
figure 3

Plot of linear dependency of computation times with respect to edge density

Further experimentation was performed to show the linear dependency of computational complexity in the number of edges in a graph with a fixed number of vertices. Figure 3 shows results for the four graph algorithms. Observation of these results reveals the linear relationship.

5.4 Comparison of computation times for real and synthetic data sets

In this section, measurements of computation times of isomorphism, subgraph isomorphism, MCS and GED on the two real network data sets (i.e. business domain and IP-level abstractions) and their synthetic counterparts were performed. The aim was to confirm that synthetic data measurements are consistent with those measured for real network data.

The label representation is first derived for the two graphs in each data set. Isomorphism and subgraph isomorphism computation requires only one of the graphs from each set. Both graphs are required for the MCS and GED computations. The results are given in Table 5. It can be seen that all measurements between the real and equivalent synthetic data sets agree. This infers that results obtained for synthetic data are consistent to those for real data.

Table 5 Comparison of computation times for real and synthetic network data

5.5 Verification of theoretical computation times for the median graph

The computation time of the median graph algorithm described in Sect. 3 is measured for both real and synthetic data sets. Both comprise a time series of 100 graphs. The sizes of the graphs within each time series and between time series are similar. We wish to verify that the time taken to compute a median graph increases linearly as the number of graphs in the median computation increases.

Measurements commenced by taking the first 10 graphs in the time series (i.e. {g1,..., g10}) of each data set and computing the median graph. The procedure was repeated using the first 20, 30, 40,..., 90 and 100 graphs. The results are given in Table 6 and Fig. 4. Both real and synthetic data sets show a linear dependency of computation time with respect to the number of graphs. The plot of computation times for the real data set deviates from a straight line because the edge counts in this data set had a greater standard deviation than those of the synthetic data set. The number of vertices and edges in graphs belonging to real and synthetic data sets can been seen in Fig. 5.

Table 6 Computation times for median graph
Fig. 4
figure 4

Computation times for median graph algorithm

Fig. 5
figure 5

Vertex and edge counts for real and synthetic data sets

6 Summary and conclusions

Graph matching is finding many applications in the fields of science and engineering. In this paper, we considered a special class of graphs, characterised by unique node labels. A label representation is given for graphs in this class. For a given graph, it comprises a set of unique vertex labels of the graph, an edge set based on vertex labels and a set of edge weights. A number of computationally efficient matching algorithms were derived for this class of graphs. The suitability of applying these matching algorithms to computer network monitoring was addressed.

The matching algorithms that have been derived for graphs having a label representation are detection of graph isomorphism and subgraph isomorphism, computation of MCS, GED and the median graph. The theoretical computational complexity of these algorithms, for graphs having n nodes, is \( \mathcal{O}{\left( {n^{2} } \right)}. \) It was also shown that the time taken to compute a median graph increases linearly with the number of graphs in the set from which the median is computed. Theoretical results were verified using real and synthetic data sets.

It is possible to apply the derived matching algorithms to computer network monitoring as the constraint for unique node labels can be satisfied. In computer networks, nodes can be uniquely identified by means of the MAS or IP addresses. The matching algorithms proposed can be used to measure a change that occurs in a computer network over time. Measures of network change provide good indicators of when abnormal network events have occurred. Such techniques greatly enhance computer network management, especially in the field of performance monitoring.

The theoretical computational complexity of the matching algorithms were verified through experimentation using synthetic and real network data. Synthetic data sets of graphs with a specified number of nodes and edge densities were used for this purpose. In addition, synthetic data sets having different network topologies were used to show that the computation times for derived algorithms are independent of network topology. A comparison of results achieved for synthetic data sets with those obtained using data acquired from a large wide area computer network of equal dimension were shown to agree. This outcome, along with the knowledge that the algorithms are independent of network topology, means that the simulation of performance of the algorithms on synthetic data can be used to accurately predict the performance that will be achieved for real networks.

In conclusion, graph matching algorithms for uniquely labelled graphs having a label representation provide a significant computational saving compared to the generalised class of graphs where such matching algorithms have an exponential computational complexity. In this paper, we have shown that, for this class of graphs, we have been able to apply matching algorithms to graphs having many thousands of nodes. This is a significant improvement in the pattern recognition community, where graphs with a few hundred nodes is considered very large.

The application considered in this paper, i.e. computer network analysis, is not a traditional application of pattern recognition, such as detection or classification of objects in an image. Nevertheless, it surely belongs to the discipline of pattern recognition because the considered task, i.e. abnormal event detection, is cast as a classification problem, where the change occurring in the network between time t and time t+1 is to be categorised as normal or abnormal. It remains an open problem to identify more problems in pattern recognition and artificial intelligence, in addition to the ones described in [2527], where graphs with unique node labels are suitable object representatives.

7 Originality and contribution

The problem of graph matching, such as the computation of GED edit distance, is NP complete. In this paper, we consider a class of graphs with an inherently lower computational cost for graph matching tasks. These graphs are characterised by the existence of unique node labels. The definition of a label representation is given for graphs in this class, and \( \mathcal{O}{\left( {n^{2} } \right)} \) matching algorithms are devised for isomorphism, subgraph-isomorphism, MCS, GED and the median graph. Whilst graph matching is not in itself novel, the algorithmic framework developed in this paper for the special class of graphs is new and offers significant computational improvements, especially when dealing with very large graphs. The paper introduces some interesting applications for this class of graphs and verifies the theoretical low computational complexity with practical results.

We explore the application of this class of graphs to computer network monitoring. Nodes in the computer network under investigation have static address allocations and, hence, can be considered to have unique identities. The networks are very large in size and are dynamic in behaviour. This has lead to increasing difficulties in network performance management. The proactive detection of network anomalies is of specific importance. In this application, we use GED and median graph computation to detect abnormal changes in network behaviour. This is very much a useful contribution to the discipline of pattern recognition as computer network monitoring is a classification problem where we want to categorise network changes as either normal or abnormal.

8 Biographies of authors

Arek Dadej is an Associate Professor and leader of the Telecommunication Networks and Services research group at the Institute for Telecommunications Research (ITR), University of South Australia. His current research interests comprise protocols for mobility support in IP networks, QoS control in wireless/mobile network environment, routing and network configuration, and ad-hoc network service discovery and user access control. His teaching areas include many courses in telecommunication networks and computer systems engineering. Arek has led major industry-sponsored research projects in telecommunication networks and protocols, like high-capacity 802.11 WLAN design with guaranteed QoS, a study of multicasting and scheduling techniques in satellite broadcast systems and a study of ad-hoc networking technologies. He also led two projects within the Cooperative Research Centre for Satellite Systems, focussing on the network architectures, protocols and on-board processing for ATM and IP-based service delivery via networks of small satellites.

Horst Bunke received his MSc and PhD degrees in Computer Science from the University of Erlangen, Germany. In 1984, he joined the University of Bern, Switzerland, where he is a professor in the Computer Science Department. He was Department Chairman from 1992 to 1996 and Dean of the Faculty of Science from 1997 to 1998. His research interests comprise computer vision, image analysis, pattern recognition and artificial intelligence. From 1998 to 2000, Horst Bunke was the first Vice-President of the International Association for Pattern Recognition (IAPR). He is a fellow of the IAPR, former editor-in-charge of the International Journal of Pattern Recognition and Artificial Intelligence, editor-in-chief of Electronic Letters of Computer Vision and Image Analysis journal, editor-in-chief of the book series on Machine Perception and Artificial Intelligence by World Scientific Publ. Co., associate editor of Acta Cybernetica, the International Journal of Document Analysis and Recognition and Pattern Analysis and Applications. Horst Bunke was on the program and organisation committee of many conferences and served as a referee for numerous journals and scientific organisations. He has more than 450 publications, including 28 books and special editions of journals.

Dr. Miro Kraetzl is a Principal Research Scientist in the Communications Analysis Group of the Defence Science and Technology Organisation (DSTO). He received BSc and MSc degrees in pure mathematics from the University of Zagreb (Croatia) and a PhD in graph theory from Curtin University of Technology in Perth (Western Australia). His current research interests comprise analysis of telecommunication, information and social network dynamics, parallel computer interconnection architectures and information geometry. Miro holds adjunct positions at the School of Applied Mathematics at the University of Adelaide and at the Institute for Telecommunications Research at the University of South Australia. He is a member of AFCEA.

Peter Dickinson is a Senior Engineer in the Communications Analysis Group of the Defence Science and Technology Organisation (DSTO). He received BE in Electronic Engineering from the South Australian Institute of Technology in Adelaide. His research interests comprise telecommunication and computer network analysis. Peter is currently working towards a PhD in performance monitoring of large dynamic intranets with the Telecommunication Networks and Services research group at the Institute for Telecommunications Research (ITR), University of South Australia.