Keywords

1 Introduction

The task of applying Data Mining methods [38] to web-based hypertexts is often referred to as Web Mining [16]. In view of the steadily increasing complexity of web data sources and the huge amount of information available online, Web Mining has been an important and fruitful research topic [16, 46]. Generally, Web Mining can be divided into the following categories:

  1. 1.

    Web Content Mining: Web Content Mining provides methods for automatically extracting information from web-based data sources. Important problems are data extraction and analysis by using, e.g., Text Mining methods [53].

  2. 2.

    Web Structure Mining: Web Structure Mining deals with exploring structural properties of web-based hypertexts, e.g., investigating internal and external link structures of web-based documents [16] or exploring hypertext structure types using graph-based models [55]. Moreover, there are a lot of earlier contributions rooted in complex network theory [29] dealing with analyzing mathematical growth-properties of the web graph and web subgraphs by using stochastic models [1, 34, 40, 48, 63]. Often, these methods aim to improve web-search and information extraction algorithms in Web Mining [14, 45].

  3. 3.

    Web Usage Mining: Web Usage Mining [73] deals with exploring and analyzing patterns reworked from web logs to analyze behavior of hypertext users. Such an analysis can be in particular useful to optimize business websites, to analyze their quality and to detect effectiveness features, see, for example [64].

In this chapter, we put the emphasis on discussing methods (in the context of Web Structure Mining) to analyze graph-based hypertext patterns. To tackle our problem, we discuss a graph-theoretic framework for exploring graph-based patterns representing web-based hypertext structures. Besides modeling document structures as graphs [57] that means that in the sense of consistent graph similarity measuring, we apply a method to measure the structural similarity of graphs (see Section (11.4.2)) to approach problems in Web Structure Mining, for example:

  1. 1.

    Computing the cumulative similarity distribution Θ of a web genre [50] corpus containing graph-based documents (see Section 11.5). A possible interpretation of Θ addresses the important question how structurally distributed the graph-based documents in the given corpus are.

  2. 2.

    Structural filtering of web-based units: By measuring the structural similarity of the document structures and then applying clustering techniques, we obtain clusters which contain structurally similar web-based units.

The main contribution of this conceptual chapter is to shed light on the task of automatically analyzing web genre data by using a method for structurally comparing graph-based hypertexts [18, 22, 27]. We use the term “web genre” and “web genre data” in the sense of Mehler et al. [59] where web genres are considered as hypertext types, see, e.g. [59, 65]. Also, we want to emphasize that we do not use the vector space model [31, 52] to represent a web-based document structure [18, 24]. Instead, we use a special graph class called generalized trees (GTs) [25, 57] for modeling our web-based documents [57].

Basically, Mehler focuses on webgenres not from the point of view of a bag-of-features model [56]. Rather, this approach conceives instances of webgenres as complex signs that have a characteristic structure due to their membership to a certain genre. This contrasts genre modeling with topic modeling in Information Retrieval [2] where a topic is represented by a set of lexical units that are typically used to manifest that topic. Rather, Mehler’s approach is linguistic in the sense that instances of a certain text type are seen to have a characteristic topical structure and a characteristic generic structure. Take the example of a newspaper article in contrast to, say, a personal letter: although in both cases the universe of topics is certainly open, we can nevertheless expect that instances of both types depart with respect to the topical areas they typically deal with. Moreover, the differences between these text types are also manifested in structural terms: the structure of a letter significantly differs from that of most newspaper articles. So why not exploring text structure [28], document structure [62] or even layout structure [75] to get insights into the webgenre (or hypertext type) of a webpage or of a website?

Interestingly, many webgenre models oversee this structural source of the characteristics of webgenres. Consequently, they tend to rely on some extension or simply on some application of the bag-of-features or vector space model. However, such an approach disregards a central characteristic of web units as instances of webgenres, that is, their hyperlink structure, which is genuine web-based. From this point of view, a website is seen to be identifiable as an instance of a webgenre by means of its hypertextual structure – beyond its textual structure. Mehler et al. [59] have shown that because of many aspects of informational uncertainty this hypertextual structure is – by analogy to its textual counterpart – not immediately accessible: neither can we simply read-out this structure from HTML tags or URLs, nor is it manifested by hyperlinks only. Rather, this hidden hypertext document structure needs first to be explored as this is done with its counterpart in the form of document structure [62].

In this paper, we propose a structural approach of webgenres and webgenre classification that builds upon a webgenre-related hypertext structure model. More specifically, we utilize a certain graph model (in the form of generalized trees) that has been found to be the structural kernel of many complex linguistic aggregates [54]. Our task is to add a computational model that deals with this class of graphs as a model of webgenre structure. In this sense, we propose an algorithmic model that integrates a recent structural model of linguistic units by example of webgenres with their computational processing.

The graph similarity-based approach we want to discuss in this chapter operates on generalized trees representing hierarchical and directed graphs. We notice that generalized trees are more general than ordinary rooted trees because a generalized tree contains an ordinary rooted tree as a special case. For practical applications, this implies that a generalized tree captures more structural information of the underlying document structure than an usual DOM-tree [15] represented by a directed rooted tree. The classical DOM-tree model has been also applied for measuring the structural similarity of underlying hypertext structures by [13, 42].

The chapter is organized as follows: Section 11.2 presents some mathematical preliminaries. In Section 11.3, we briefly discuss the problem of deriving structural properties of graphs to characterize them structurally. Besides outlining existing methods for measuring the similarity of web-based document structures in Section 11.4, this section also discusses a graph similarity-method that operates on generalized trees. In Section 11.5, we outline resulting applications in Web Structure Mining and Web Usage Mining. The chapter finishes with a short summary in Section 11.6.

2 Mathematical Preliminaries

First, we introduce some mathematical preliminaries [25, 37, 39].

Definition 1

\(G=(V, E), |V| < \infty, E \subseteq {V \choose 2}\) is called a finite undirected graph. \(G=(V, E), |V| < \infty\), \(E \subseteq {V \times V}\) represents a finite directed graph.

Definition 2

Let \(G=(V,E)\) be a graph. \(\tilde{G}=(\tilde{V},\tilde{E})\) is called a subgraph iff \(\tilde{V} \subseteq V\) and \(\tilde{E} \subseteq E\). Moreover, if it holds \(\tilde{E}=E \cap (\tilde{V} \times \tilde{V})\), then we call \(\tilde{G}\) the induced subgraph of G.

Definition 3

An isomorphism class denotes the set of graphs which are isomorphic to a given graph G.

Definition 4

A tree is a connected, acyclic undirected graph. A tree \(T=(V,E)\) with a distinguished vertex \(r \in V\) is a rooted tree. r is called the root of the tree. The level of a vertex v in a rooted tree T equals the length of the path from r to v. The maximum path length d from the root r to any vertex in the tree is called the depth of T. A leaf is a vertex incident to exactly one edge in a tree.

Definition 5

Let \(G = (V,E)\) be a finite, directed graph. Then, we define the following sets and quantities:

$$\begin{array}{lll}{\mathcal{N}}^{+}(v) & = & \{w \in V\backslash \{v\} \,|\, (v, w) \in E \}, \\ {\mathcal{N}}^{-}(v) & = & \{w \in V\backslash \{v\} \,|\, (w, v) \in E \}, \\ {\delta_{\mathrm{out}}}(v) & = & |{\mathcal{N}}^{+}(v)|, \\ {\delta_{\mathrm{in}}}(v) & = & |{\mathcal{N}}^{-}(v)|. \end{array}$$

We call \({\delta_{\mathrm{out}}}(v)\) and \({\delta_{\mathrm{in}}}(v)\) out-degree and in-degree of \(v\in V\), respectively.

Fig. 11.1
figure 1

A generalized tree with its edge types

Definition 6

A directed acyclic graph T is called a directed rooted tree if there is an unique vertex r satisfying \({\delta_{\mathrm{in}}}(r)=0\) from which any other vertex of T is reachable by a unique path.

Definition 7

Let \(T=(V, E_1)\) be a directed rooted tree. The vertex set is defined by

$$V:=\\\{ v_{0,1}, v_{1,1}, v_{1,2}, \ldots, v_{1,|V_1|}, v_{2,1}, v_{2,2}, \ldots, v_{2,|V_2|}, \ldots, v_{d,1}, v_{d,2}, \ldots, v_{d,|V_d|}\},$$
((11.1))

and we assume \(|V| < \infty\). \(|L|\) denotes the cardinality of the level set \(L=\{l_0,l_1,\ldots,l_d\}\). The surjective mapping \(\mathcal{L}: V \longrightarrow L\) is called a multi level function that assigns to every vertex an element of the level set L. It holds \(d =|L|-1\). \(v_{i,j}\) denotes the j-th vertex on the i-th level, \(0 \leq i \leq d, 1 \leq j \leq |V_i|\). \(|V_i|\) denotes the number of vertices on level i. The edge set \(E_{GT}:= E_1 \cup E_2 \cup E_3 \cup E_4\) of a finite generalized tree \(H=(V,E_{GT})\) is defined as [57]:

  • E 0 forms the edge set of the underlying directed rooted tree T. These edges are called Kernel-edges.

  • E 2: Up-edges associate analogously vertices of the tree hierarchy with one of their (dominating) predecessor vertices.

  • E 3: Down-edges associate vertices of the tree hierarchy with one of their (dominated) successor vertices in terms of that tree hierarchy.

  • E 4: Across-edges associate vertices of the tree hierarchy, none of which is an (immediate) predecessor of the other in terms of the tree hierarchy.

Figure 11.1 shows a generalized tree exemplarily.

Definition 8

We define some metrical properties of graphs. \(d(u,v)\) denotes the distance between \(u\in V\) and \(v\in V\) representing the minimum length of a path between \(u,v\). Note that \(d(u,v)\) is an integer metric. We call the quantity \(\sigma(v)= \max_{u\in V}d(u,v)\) the eccentricity of \(v \in V\). \(\rho(G)= \max_{v\in V}\sigma(v)\) and \(r(G)= \min_{v\in V}\sigma(v)\) is called the diameter and radius of G, respectively.

3 Structural Graph Measures

Graphs can be considered as powerful and generic models to describe complex relational objects which appear in a large number of scientific areas, e.g., computer science, chemistry, sociology, cognitive sciences and biology [17, 33, 76]. Apart from using graphs for modeling real world problems, an important problem is also to quantify structural information by inferring structural properties of a graph in question. This problem addresses the task of characterizing graphs based on graph measures. To give a short overview on such structural network measures, we present the listing as follows:

  1. 1.

    Degree distributions \(P(i)\), e.g., see [29].

  2. 2.

    Exponent of degree distributions, i.e., it holds \(P(i) \sim i^{-\gamma}\), e.g., see [29].

  3. 3.

    Total number of vertices \(|V|\) and edges \(|E|\).

  4. 4.

    Distance matrix \((d(v_i,v_j))_{v_i,v_j \in V}\).

  5. 5.

    Metrical properties of graphs, e.g., \(\sigma(v)\), \(\rho(G)\) and \(r(G)\), e.g., see [70].

  6. 6.

    Clustering coefficient, modularity and network motifs, e.g., see [3, 8].

  7. 7.

    Vertex centrality measures, e.g., see [9, 51, 76].

  8. 8.

    Eigenvector measures, e.g., see [47, 51].

Another method to characterize graphs is based on quantifying structural information using information-theoretic measures. This problem relates to determine the structural complexity of a graph. Entropic measures to determine the so-called structural information content of a graph have been developed by [7, 6, 19, 20, 30]. A task that is also related to determine structural features of graphs is to identify stylistic properties. For example, a stylistic property can be understood as a characteristic structural feature of a graph that manifests a graph class, e.g., a hierarchy, an undirected edge set, a directed edge set etc. To identify such features exemplarily, we consider Fig. 11.2. The depicted graphs from different application domains manifest different styles of graphs. More precisely, graph (A) represents a directed rooted tree to model a DOM-structure. Graph (B) shows a more complex website structure representing a generalized tree. Graph (C) is a chemical structure represented by an undirected and vertex labeled graph. A different definition of a style that aims to compare such styles structurally (this lead to a generalization of the classical graph similarity problem [26]) has been already expressed in [26]. In [26], a style was defined as a set of graphs with impressed structural properties. Finally, we compared the styles by using a method which is based on the definition of a median graph [26, 58].

Figu. 11.2
figure 2

Graph styles from different application domains

4 Graph Similarity Measures for Web Mining

4.1 Classical Similarity and Distance Measures for Graphs

The problem of measuring the similarity (or distance) between structures representing networks occur in numerous scientific disciplines [5, 13, 22, 68]. Usually, graph similarity measures are based on incorporating structural features of given graphs, e.g., degree sequences, subgraphs, and other metrical properties of graphs [70]. Also, the task of measuring the structural similarity of graphs is often referred to as graph matching [12]. There exist basically two major paradigms for matching graphs structurally which have been intensely discussed in the scientific literature: exact graph matching and inexact graph matching [12].

Exact graph matching is mainly based on the principle of finding a graph or a subgraph of a given graph that matches a graph or subgraph structure of an other graph exactly. With other words, one has to determine if two graphs are isomorphic [39], i.e., structurally equivalent. It is known that even classical graph similarity measures belonging to the exact graph matching paradigm are based on determining isomorphic and subgraph isomorphic relations, see, e.g., [43, 71, 72, 77]. A prominent example of a classical graph metric represents the well-known Zelinka-distance [77]; two graphs are more similar, the bigger the common induced (isomorphic) subgraph is. This implies that graphs which have a large common induced subgraph have a small distance and vice versa. It is worth mentioning that Zelinka [77] was the first who introduced such a measure for unlabeled graphs of same order. The key result is as follows [71, 72, 77].

Theorem 1

Let \(H=(V_{H},E_{H})\) and \(G=(V_{G},E_{G})\) be unlabeled graphs without reflexive and multiple edges and it holds \(|V_H|=|V_G|=n.\) \({\overline{SUB}}_m(H)\) denotes the set of induced subgraphs of order m. \({H}^{\star}\) denotes the isomorphism classes of such graphs in which H lies and let

$$SUB_m(H) :=\{ H^{\star} |\, H \in {\overline{SUB}}_m(H) \}.$$
((11.2))

\(SUB_m(H)\) is just the set of isomorphism classes in which the induced subgraphs of H with order m lie. Then,

$$d_{Z}(H, G ):= n - SIM(H,G),$$
((11.3))

is a graph metric, where

$$SIM(H,G ):= \max\{m| SUB_m(H) \cap SUB_m(G ) \not= \emptyset \}.$$
((11.4))

A more general version of this theorem was introduced by Sobik [71, 72]. The following assertion states that the measure \(d_{S}(H,G)\) for determining the structural similarity of arbitrary and also labeled graphs represents a graph metric.

Theorem 2

Let \(H:=(V,E, f_V, f_E, A_V, A_E)\) be a finite and labeled graph. \(A_V, A_E\) denote finite, non-empty vertex and edge alphabets and \(f_V: V \rightarrow A_V\), \( f_E: E \rightarrow A_E\) the associated vertex and edge labeling functions. Now, let H and G be finite, labeled graphs of arbitrary orders, respectively. Then,

$$d_{S}(H,G):= \max{\{|H |, |G|}\} - SIM(H,G )\}$$
((11.5))

is a graph metric.

Now, we want to briefly discuss inexact graph matching. The most prominent measure from inexact graph matching is the so-called graph edit distance (GED) developed by Bunke [10]. It can be considered as a powerful extension of the Levenshtein-distance [49]. GED is mainly based on the idea to define graph edit operations such as insertion or deletion of an edge/vertex or relabeling of a vertex along with costs associated with each such operation [10]. Moreover, Bunke [10] calls an optimal inexact match a sequence of edit operations which transforms a graph G into H by producing minimal transformation costs. If \(m_1, m_2, \ldots, m_n\) are assumed to be all possible transformations mapping G to H, then the optimal inexact match [10] \(m'\) is defined by

$$c(m')=\min\{c(m_i)|\, 1 \leq i \leq n\}.$$
((11.6))

Finally, the graph edit distance between two graphs is the minimum cost associated with a sequence of edit operations. Further, the optimal error-correcting graph isomorphism is defined as the resulting isomorphism after obtaining this optimal sequence of edit operations [10]. The original result of Bunke [10] can be now expressed as follows.

Theorem 3

Let \(d(H,G)\) be the costs for determining the optimal inexact match between H and G. Then, \(d(H,G)\) is a graph metric.

Many other graph similarity or distance measures and methods can be found in, e.g. [4, 17, 44, 60, 67, 71, 72].

4.2 Graph Similarity Measures Based on Trees

In this section, we outline graph similarity measures applied to web-based document structures. As follows, we express a listing of graph similarity measures which have been applied to DOM-trees [13]:

  1. 1.

    Similarity measures which are based on tree edit measures, e.g., see [41, 69, 74].

  2. 2.

    Similarity measures based on the frequency of tag labels, e.g., see [13].

  3. 3.

    Similarity measures based on Fourier transformation, e.g., see [32].

  4. 4.

    Similarity measures based on path similarity, e.g., see [42].

A major problem of these measures is that they only operate on ordinary rooted trees which do not capture the structural information properly represented by a complex hyperlink structure associated to a graph-based document. Especially the measures based on tag frequencies, see, e.g., [13] are restrictively interpretable because a rearrangement of the tag order does not necessarily imply a variation of the corresponding similarity measure. Moreover, the sketched measures do not provide the option to emphasize certain structural properties when measuring the structural similarity of graphs because the measures are non-parameterized. In contrast, parameterized similarity measures would give us the possibility to learn the parameters by using appropriate data sets. In Section 11.4.3, we express the definition of such a parameterized measure for determining the structural similarity of generalized trees. An in-depth treatment of graph similarity measures can be found in [11, 12, 18, 22].

4.3 Structural Similarity of Generalized Trees

This section aims to repeat the construction principle of a method for measuring the structural similarity of generalized trees, see, e.g. [18, 22, 27]. The main construction steps can be stated as follows [18, 22, 27]:

  • We start with two generalized trees, H 1 and H 2.

  • Derive their formal string representations and transform them into linear integer strings which are called property strings.

  • Perform string alignments of the derived property strings by using a dynamic programming (DP) algorithm. From each such alignment (on each level i), a similarity score will be obtained.

  • By cumulating up the derived similarity scores, a final graph similarity measure can be obtained. Hence, the problem of comparing two generalized trees structurally is then equivalent with determining optimal property string alignments.

These key steps are also visualized in Fig. 11.3. We start repeating the construction by stating some definitions [18, 21, 22].

Definition 9

Let X be a set. A positive function \(s: X \times X \longrightarrow [0,1]\) is called similarity measure if

  • \(s(x,y) >0 \forall\, x,y \in X \).

  • \(s(x,y)= s(y,x) \forall\, x,y \in X \).

  • \(s(x,y) \leq s(x,x)=1 \forall\, x,y \in X \).

Definition 10

Let X be a set. A positive function \(\omega: X \times X \longrightarrow [0,1]\) is called distance measure if

  • \(\omega(x,y) \geq 0 \forall\, x,y \in X \).

  • \(\omega(x,y)= \omega(y,x) \forall\, x,y \in X \).

  • \(\omega(x,x)= 0 \forall\, x \in X \).

Definition 11

Let H be a generalized tree. We call the set

$$S^{H}:= \left\{v^{H}_{0,1}, v^{H}_{1,1} \circ v^{H}_{1,2} \circ \cdots \circ v^{H}_{1,|V_1|}, \ldots, \circ v^{H}_{d,1} \circ v^{H}_{d,2} \circ \cdots \circ v^{H}_{d,|V_{d}|} \right\},$$
((11.7))

the formal string representation of H. The symbol ° denotes usual string concatenation.

Figu. 11.3
figure 3

Key steps to infer a graph similarity measure for generalized trees

Definition 12

Let H be a generalized tree. We call

$$\begin{aligned}S^{H}_{\mathrm{out}}:= \left\{\delta_{\mathrm{out}}\left(v^{H}_{0,1}\right), \delta_{\mathrm{out}}\left(v^{H}_{1,1}\right) \circ \delta_{\mathrm{out}}\left(v^{H}_{1,2}\right)\right. && \circ \cdots \circ \delta_{\mathrm{out}}\left(v^{H}_{1,|V_1|}\right), \ldots,\nonumber\\ &&\left.\circ \delta_{\mathrm{out}}\left(v^{H}_{d,1}\right) \circ \delta_{\mathrm{out}}\left(v^{H}_{d,2}\right) \circ \cdots \circ \delta_{\mathrm{out}}\left(v^{H}_{d,|V_{d}|}\right) \right\},\end{aligned}$$
((11.8))

the set of out-degree property strings and

$$\begin{aligned}S^{H}_{\mathrm{in}}:= \big\{\delta_{\mathrm{in}}\left(v^{H}_{0,1}\right), \delta_{\mathrm{in}}\left(v^{H}_{1,1}\right) \circ \delta_{\mathrm{in}}\left(v^{H}_{1,2}\right) && \circ \cdots \circ \delta_{\mathrm{in}}\left(v^{H}_{1,|V_1|}\right), \ldots,\nonumber\\ &&\circ \delta_{\mathrm{in}}\left(v^{H}_{d,1}\right) \circ \delta_{\mathrm{in}}\left(v^{H}_{d,2}\right) \circ \cdots \circ \delta_{\mathrm{in}}\left(v^{H}_{d,|V_{d}|}\right) \big\},\end{aligned}$$
((11.9))

the set of in-degree property strings of H.

Define \(r^{H^{k}}_k:= v^{H^{k}}_{0,1}, k \in \{1,2\}\). Let \(H^{1}\) be a given GT and \(v^{H^{1}}_{i,j},\,0\leq i\leq d_1,\, 1 \leq j \leq \sigma_{i}\) denotes the j-th vertex on the i-th level of H 1. Analogously, this also holds for \(v^{H^{2}}_{i,j} \in H^{2}\). As mentioned above, the task of measuring the structural similarity between H 1 and H 2 is equivalent to determine the optimal alignment of

$$\begin{array}{lll}S_{1} &=& v^{H^{1}}_{0,1} \circ v^{H^{1}}_{1,1} \circ v^{H^{1}}_{1,2} \circ \cdots \circ v^{H^{1}}_{d_1,\sigma_{d_1}},\\ S_{2} &=& v^{H^{2}}_{0,1} \circ v^{H^{2}}_{1,1} \circ v^{H^{2}}_{1,2} \circ \cdots \circ v^{H^{2}}_{d_2,\sigma_{d_2}}, \end{array}$$

with respect to their associated property strings and to a cost function α. \(S_{k}[i]\) denotes the i-th position of the sequence S k and it holds \(S_{1}[n]=v^{H^{1}}_{d_1,\sigma_{d_1}}, S_{2}[m]=v^{H^{2}}_{d_2,\sigma_{d_2}},\, \mathbb{N} \ni n,m \geq 1,\, S_{k}[1]=r^{H_{k}}_k,\, k\in \{1,2\}\). The algorithm for finding the optimal alignment of S 1 and S 2 generates a matrix (\(\mathcal{M}(i,j))_{ij}, \, 0 \leq i\leq n, \, 0 \leq j\leq m\). We find that its time complexity is \(O(|\hat{V}_1|\cdot |\hat{V}_2|)\), see [18, 23]. To determine optimal alignment of the derived property strings, we state the following algorithm [18, 23]:

$$\begin{array}{lll}\mathcal{M}(0,0) &:=& 0,\\ \mathcal{M}(i,0) &:=& \mathcal{M}(i-1,0) + {\alpha(S_1[i],-)} \ : \ 1 \leq i \leq n, \\ \mathcal{M}(0,j) &:=& \mathcal{M}(0,j-1)+ {\alpha(-, S_2[j])} \label{alg_2} \ : \ 1 \leq j \leq m, \end{array}$$
(11.10)

and

$$\mathcal{M}(i,j):= \min \begin{cases} \mathcal{M}(i-1,j) + {\alpha(S_1[i],-)} \\ \mathcal{M}(i,j-1)+ {\alpha(-,S_2[j])} \\ \mathcal{M}(i-1,j-1) + {\alpha(S_1[i],S_2[j])} \end{cases}$$

for \(1 \leq i\leq n,\, 1 \leq j \leq m\). Here, the derived property strings will be aligned on two levels: globally and locally. To evaluate the alignments, we need the preliminary assertion as follows.

Lemma 1

Let \({\omega}(x,y):= 1-e^{-\frac{1}{2}\frac{(x-y)^2}{\sigma^2}}.\) \(\omega:\mathbb{R}\times \mathbb{R}\longrightarrow [0,1]\) is a distance measure.

Proof

From the definition of \({\omega}(x,y)\) we infer \({\omega}(x,y) \in [0,1], \,\, \forall\, x,y \in \mathbb{R} \) and \({\omega}(x,x)=1-1=0, \,\, \forall\, x \in \mathbb{R}\). Since \((x-y)^2\) = \((y-x)^2, \,\, \forall\, x,y \in \mathbb{R} \), the symmetry condition holds.

Now, we define

$${\alpha}^{\mathrm{out}}\left(v^{H^{1}}_{i_1,j_1},v^{H^{2}}_{i_2,j_2}\right):= \left\{ \begin{array}{r@{\quad:\quad}l} {\omega}^{\mathrm{out}} \left(\delta_{\mathrm{out}}\left(v^{H^{1}}_{i_1,j_1}\right),\delta_{\mathrm{out}}\left(v^{H^{2}}_{i_2,j_2}\right),\sigma^{1}_{\mathrm{out}}\right) & i_1=i_2 \\ +\infty & {\mathrm{else}}\, , \end{array} \right.\\$$

\(0 \leq i_k \leq d_k,\, 1 \leq j_k \leq \sigma_{i_k},\, k\in \{1,2\}\), where \({\omega}^{\mathrm{out}}(x,y, \sigma^{k}_{\mathrm{out}}):= 1-e^{\!-\frac{1}{2}(x-y)^2/(\sigma^{k}_{\mathrm{out}})^2}, x,y,\sigma^{k}_{\mathrm{out}} \in \mathbb{R},\) and

$$\begin{array}{lll}{\alpha}^{\mathrm{out}}\left(v^{H^{1}}_{i,j_1},- \right)&:=& {\omega}^{\mathrm{out}}\left(\delta_{\mathrm{out}}\left(v^{H^{1}}_{i,j_1}\right),\xi,\sigma^{2}_{\mathrm{out}}\right),\\ {\alpha}^{\mathrm{out}}\left(-,v^{H^{2}}_{i,j_2} \right)&:=& {\omega}^{\mathrm{out}}\left(\xi,\delta_{\mathrm{out}}\left(v^{H^{2}}_{i,j_2}\right),\sigma^{2}_{\mathrm{out}}\right). \end{array}$$

\(\xi>0\) prevents an alignment between two leaves being better evaluated as an alignment between a leaf and a gap (“–”) [22]. By \({\omega}^{\mathrm{in}}\left(x,y, \sigma^{k}_{\mathrm{in}}\right):= 1-e^{\!{-\frac{1}{2}(x-y)^2/\left(\sigma^{k}_{in}\right)^2}}\), we define analogously \({\alpha}^{\mathrm{in}}\left(v^{H^{1}}_{i_1,j_1},v^{H^{2}}_{i_2,j_2}\right)\), \({\alpha}^{\mathrm{in}}\left(v^{H^{1}}_{i,j_1},- \right)\) and \({\alpha}^{\mathrm{in}}\left(-,v^{H^{2}}_{i,j_2} \right)\).

To evaluate the alignments of the property strings locally (i.e., on each generalized tree level), we express the mapping [18, 22]

$${\mathrm{align}}\left(v^{H^{1}}_{i,j_1}\right) := \left\{ \begin{array}{r@{\quad:\quad}l} v^{H^{2}}_{i,j_2} & {{\mathrm{align}}}^{-1}\left(v^{H^{2}}_{i,j_2}\right)= v^{H^{1}}_{i,j_1} \\ - & {\mathrm{else}}. \end{array} \right.$$

For \(v^{H^{1}}_{i,j_1}\), the mapping determines the vertex \(v^{H^{2}}_{i,j_2}\) during the trace-back [18]. Moreover, we define the functions

$$\begin{array}{lll}{\gamma}^{\mathrm{out}}_{{H^{k}}}(i)&:=&\frac{\sum_{j=1}^{\sigma^{k}_{i}}{{\hat{\alpha}}_{\mathrm{out}}\left(v^{H^{k}}_{i,j}, {\mathrm{align}}\left(v^{H^{k}}_{i,j}\right)\right) }}{\sigma^{k}_{i}},\\ {\gamma}^{\mathrm{in}}_{{H^{k}}}(i)&:=&\frac{\sum_{j=1}^{\sigma^{k}_{i}}{{\hat{\alpha}}_{\mathrm{in}}\left(v^{H^{k}}_{i,j}, {\mathrm{align}}\left(v^{H^{k}}_{i,j}\right)\right) }}{\sigma^{k}_{i}}, \end{array}$$

\(k\in \{1,2\}\), which provide similarity values of the alignments of out-degree and in-degree property strings. Finally, by analogously defining the functions \({\hat{\alpha}}_{\mathrm{out}}\) and \({\hat{\alpha}}_{\mathrm{in}}\), we obtain the normalized and cumulative functions

$$\begin{array}{lll}{\gamma}^{\mathrm{out}}\left(i,{\hat{\sigma}}^{1}_{\mathrm{out}}, {\hat{\sigma}}^{2}_{\mathrm{out}}\right) &: =& 1- \frac{1}{ \sigma^{1}_{i} + \sigma^{2}_{i}} \cdot \left\{ \sum_{j=1}^{\sigma^{1}_{i}}{ {\hat{\alpha}}^{\mathrm{out}}\left(v^{H^{1}}_{i,j}, {\mathrm{align}}\left(v^{H^{1}}_{i,j}\right)\right)}\right\} \nonumber \\ && - \frac{1}{ \sigma^{1}_{i} + \sigma^{2}_{i}} \cdot \left\{\sum_{j=1}^{\sigma^{2}_{i}}{{\hat{\alpha}}^{\mathrm{out}}\left(v^{H^{2}}_{i,j}, {\mathrm{align}}\left(v^{H^{2}}_{i,j}\right)\right)}\right\},\end{array}$$
((11.10))

and

$$\begin{array}{lll}{\gamma}^{\mathrm{in}}\left(i,{\hat{\sigma}}^{1}_{\mathrm{in}}, {\hat{\sigma}}^{2}_{\mathrm{in}}\right) : &=& 1- \frac{1}{ \sigma^{1}_{i} + \sigma^{2}_{i}} \cdot \left\{ \sum_{j=1}^{\sigma^{1}_{i}}{ {\hat{\alpha}}^{\mathrm{in}}\left(v^{H^{1}}_{i,j}, {\mathrm{align}}\left(v^{H^{1}}_{i,j}\right)\right)}\right\} \nonumber \\ && - \frac{1}{ \sigma^{1}_{i} + \sigma^{2}_{i}} \cdot \left\{\sum_{j=1}^{\sigma^{2}_{i}}{{\hat{\alpha}}^{\mathrm{in}}\left(v^{H^{2}}_{i,j}, {\mathrm{align}}\left(v^{H^{2}}_{i,j}\right)\right)}\right\},\end{array}$$
((11.11))

which detect the similarity of an out-degree and in-degree alignment on a level i. \({\hat{\sigma}}^{1}_{\mathrm{out}}, {\hat{\sigma}}^{2}_{\mathrm{out}}\) and \({\hat{\sigma}}^{1}_{\mathrm{in}}, {\hat{\sigma}}^{2}_{\mathrm{in}}\) are the parameters of \({\hat{\alpha}}^{\mathrm{out}}\) and \({\hat{\alpha}}^{\mathrm{in}}\), respectively. By using the defined quantities, it can be proven that the resulting comparative measure is a graph similarity measure (i.e., the measure satisfies the properties of Definition (9)) [18, 22].

Theorem 4

Let \(H_{1}, H_{2}\) be two generalized trees, \(\, 0 \leq i \leq \mu,\, \mu:=\max(d_1, d_2)\). Then,

$$s(H_{1}, H_{2}):= \frac{(\mu +1)}{{\sum_{i=0}^{\mu}{{\gamma}^{\mathrm{fin}}\left(i,{\hat{\sigma}}^{1}_{\mathrm{out}}, {\hat{\sigma}}^{2}_{\mathrm{out}}, {\hat{\sigma}}^{1}_{\mathrm{in}}, {\hat{\sigma}}^{2}_{\mathrm{in}}\right)}}}\prod_{i=0}^{\mu}{{\gamma}^{\mathrm{fin}}\left(i,{\hat{\sigma}}^{1}_{\mathrm{out}}, {\hat{\sigma}}^{2}_{\mathrm{out}}, {\hat{\sigma}}^{1}_{\mathrm{in}}, {\hat{\sigma}}^{2}_{\mathrm{in}}\right)},$$
((11.12))

is a graph similarity measure where γfin is defined by

$$\begin{array}{lll}{\gamma}^{\mathrm{fin}}&=&{\gamma}^{\mathrm{fin}}\left(i,{\hat{\sigma}}^{1}_{\mathrm{out}}, {\hat{\sigma}}^{2}_{\mathrm{out}}, {\hat{\sigma}}^{1}_{\mathrm{in}}, {\hat{\sigma}}^{2}_{\mathrm{in}}\right)\\ &:=& \zeta \cdot {\gamma}^{\mathrm{out}} +(1-\zeta) \cdot {\gamma}^{\mathrm{in}}, \zeta\in[0,1]. \end{array}$$

5 Applications

In the following, we outline existing and future applications of our presented approach which we have stated in Section 11.4.3. Here, we represent websites as a graph-based model [57] where we map each document structure to a generalized tree.

In [22], a family of graph similarity measures was evaluated based on a corpus containing 500 conference websites from mathematics and computer science created by Mehler et al. [57]. Finally, the conference websites were inferred from the web and transformed into generalized trees by using the tool HyGraph [35, 36].

One of the main ideas is to apply a comparative analysis to a corpus consisting of graph-based web units. Now, for automatically analyzing web genre data, we propose the following evaluation steps:

  1. 1.

    Because the graph similarity measure outlined in Section 11.4.3 is parameterized, one can emphasize structural features of the graphs under consideration when measuring their structural similarity [27, 22]. This can be done by varying the parameters \(\left(\zeta, {\hat{\sigma}}^{1}_{\mathrm{out}}, {\hat{\sigma}}^{2}_{\mathrm{out}}, {\hat{\sigma}}^{1}_{\mathrm{in}}, {\hat{\sigma}}^{2}_{\mathrm{in}}\right)\). For example in [27, 22], we have shown that by setting ζ equal to 1 or 0, we either consider the alignments of out-degree or in-degree property strings only. To set \(\zeta=\frac{1}{2}\) means that we weight the out-degree and in-degree property strings equally [22, 27].

  2. 2.

    We calculate the complete similarity matrix by computing the pairwise similarity scores of the given generalized trees. For this, we use the graph similarity measure presented in Section 11.4.3 with a fixed parameter set [22]. Moreover, we can compute the so-called cumulative similarity distribution Θ usually depicted as a two-dimensional plot. Θ can be used for expressing the percentage of generalized trees which possess a similarity value less or equal \(s\in [0,1]\) and, hence, to answer the question how structurally different the document structures of a given corpus are [22, 27]. Generally, we consider the study of Θ as a preliminary step for automatically analyzing web genre data that already led to a better understanding of the problem of comparing web-based hypertexts structurally [18, 22, 27].

  3. 3.

    Starting from a computed similarity matrix, one can additionally apply multivariate analysis methods, e.g., clustering techniques to filter web-based documents. By determining such clusters one identifies websites of similar structure, i.e., these clusters contain structurally similar web pages [18].

From the just outlined steps, it should be clear that this approach can also be used for analyzing data sets of hypertext structures inferred from other Web Mining areas. For example, if it would be possible to transform weblog data sets into sets of generalized trees, we could apply the approach analogously. This would result to novel applications in Web Usage Mining. In [18, 27] it has been sketched that the focus of such a study would be to analyze the navigation behavior of hypertext users [61, 66]. Generally, navigation patterns can be described by graphs [61, 66]. Particularly in our case, we would describe those by generalized trees. Each cluster we could determine by using the above stated approach then contains generalized trees which reflect a similar navigation behavior of a specific user. As we have already outlined in [18, 27], a possible interpretation of these clusters can lead to study psychological features of hypertext users.

6 Conclusion

The main goal of this conceptual chapter was to present an approach for automatically analyzing web genre data representing graphs. Instead of using the well-known vector space model for modeling document structures, we applied a graph-based representation model proposed by Mehler et al. [57]. A notable feature of this model is that the document structures represented by generalized trees capture more structural information than DOM-trees [18, 36, 57]. In Section 11.4.2, we briefly reviewed methods to measure the structural similarity of web-based documents which operate on tree structures only. In contrast to this, in Section 11.4.3 we repeated an approach for measuring the structural similarity of generalized trees. A key feature of this method is that the graphs will be transformed into linear integer strings. By applying a string alignment algorithm, we weighted these alignments and finally derived a graph similarity measure for generalized trees. Hence, we solved a graph similarity problem by transforming it into a string similarity problem. Section 11.5 presented an overview of possible evaluation steps for automatically analyzing web genre data representing graphs. Moreover, existing applications of this approach were discussed.