Roller: a novel approach to Web information extraction

Jiménez, Patricia; Corchuelo, Rafael

doi:10.1007/s10115-016-0921-4

Roller: a novel approach to Web information extraction

Regular Paper
Published: 10 March 2016

Volume 49, pages 197–241, (2016)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Knowledge and Information Systems Aims and scope Submit manuscript

Roller: a novel approach to Web information extraction

Download PDF

Patricia Jiménez¹ &
Rafael Corchuelo¹

560 Accesses
15 Citations
Explore all metrics

Abstract

The research regarding Web information extraction focuses on learning rules to extract some selected information from Web documents. Many proposals are ad hoc and cannot benefit from the advances in machine learning; furthermore, they are likely to fade away as the Web evolves, and their intrinsic assumptions are not satisfied. Some authors have explored transforming Web documents into relational data and then using techniques that got inspiration from inductive logic programming. In theory, such proposals should be easier to adapt as the Web evolves because they build on catalogues of features that can be adapted without changing the proposals themselves. Unfortunately, they are difficult to scale as the number of documents or features increases. In the general field of machine learning, there are propositio-relational proposals that attempt to provide effective and efficient means to learn from relational data using propositional techniques, but they have seldom been explored regarding Web information extraction. In this article, we present a new proposal called Roller: it relies on a search procedure that uses a dynamic flattening technique to explore the context of the nodes that provide the information to be extracted; it is configured with an open catalogue of features, so that it can adapt to the evolution of the Web; it also requires a base learner and a rule scorer, which helps it benefit from the continuous advances in machine learning. Our experiments confirm that it outperforms other state-of-the-art proposals in terms of effectiveness and that it is very competitive in terms of efficiency; we have also confirmed that our conclusions are solid from a statistical point of view.

A Novel Approach to Web Information Extraction

Joint Information Extraction from the Web Using Linked Data

Statistical Relational Data Integration for Information Extraction

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The Web is an enormous, growing source of information that is provided by a variety of websites. Most of them are intended for human consumption, and many of them do not provide any means to extract their information automatically, which makes it difficult to feed automated business processes [5, 46]. This has triggered significant interest in automating Web information extraction [11, 52, 56, 62].

Our focus is on supervised rule-based information extractors that work on semi-structured Web documents. They rely on a generic algorithm that executes extraction rules that are learnt from annotated documents that are provided by a user. The documents are written in HTML, and the information of interest is buried into formatting tags that specify how to render it in list, tabular, or other such regular formats. The documents must be annotated, which means that a user must have labelled the pieces of text or the nodes to be extracted; a machine learner is then applied to learn a rule that generalises the annotations and is expected to work as accurately as possible on new unseen documents.

Most existing proposals build on ad hoc machine-learning techniques that were specifically tailored to the problem of extracting Web information [1, 2, 10, 12–14, 32, 34, 38, 44, 49, 50, 53, 55, 58, 61]. This makes it difficult to adapt them as the Web evolves because the features of the documents on which they rely and the techniques used to analyse them are built-in. This, in turn, implies that they cannot easily benefit from the many advances in machine learning. This has made Web information extraction quite an active research field for years.

A few authors have explored representing documents as relational data [8, 9, 20, 24, 35, 59], that is, as collections of vectors that represent the attributive features of the pieces of text or the nodes to be extracted, e.g., HTML tag, rendering coordinates, or ratio of letters, plus a number of relational features that help establish a neighbourhood around each vector, e.g., right token, left sibling, or parent node. They all have devised proposals that got inspiration from inductive logic programming, that is, they learn first-order rules that constraint the attributive features of the information to be extracted as well as the attributive features of the context that can be reached by means of relational features. These proposals can learn very expressive rules that are very effective [37], but the learning process is usually costly and degrades as the number of examples or features increases [7, 8, 23, 24, 48].

In the general field of machine learning, there are a number of so-called propositio-relational proposals [40] that can deal with relational data using existing propositional techniques, that is, techniques that were originally designed to work on attributive features only. Unfortunately, such proposals have seldom been explored regarding learning information extraction rules; the only exception is the work by Sleiman and Corchuelo [57], who introduced an approach that combines automata and neural networks.

In this article, we introduce Roller, which is a proposal to learn web information extraction rules. Our contributions to the field are the following: we have devised a new propositio-relational technique that relies on a search procedure that uses a dynamic flattening technique to explore the context of the nodes that provide the information to be extracted; it needs to be configured with an open catalogue of features, which helps it adapt as the Web evolves, plus a base learner and a rule scorer, which helps it leverage the continuous advances in the general field of machine learning. We have conducted an extensive experimental analysis that proves that our proposal outperforms other state-of-the-art proposals regarding effectiveness; regarding efficiency, our results prove that it is comparable to the best ones. The conclusions that we have drawn from our experimental analysis have been confirmed using standard statistical hypothesis tests in the literature.

The rest of the article is organised as follows: Sect. 2 describes the details of our proposal; Sect. 3 reports on how we have configured it so that it can achieve its best results; then, the results of our experimental analysis are presented in Sect. 4; Sect. 5 presents the related work and a detailed comparison with our proposal; Sect. 6 summarises our conclusions. “Appendices 1 and 2” report, respectively, on our experimentation environment and the performance measures that we have used.

2 Description of our proposal

In this section, we describe Roller, which is our proposal to learn rules that can be used to extract information from semi-structured web documents. Such documents are written in HTML and can then be naturally represented as DOM nodes. For the sake of brevity, we use terms document and node to refer to the previous concepts, since there are not any ambiguities.

Our proposal works on a set of documents and an annotation. The documents provide examples of how the information to extract is encoded and the annotation assigns each of their nodes to a slot that classifies the information that it provides. (There is an implicit null slot to which the nodes that do not provide any information to be extracted are assigned by default.) The documents are assumed to provide information on a given topic and to have regularities that help learn the rule.

The main algorithm first computes a number of attributive and relational features on the input documents. Such features are not intrinsic to our proposal; on the contrary, we assume that the user provides a procedure called ${ \textsc {featureBuilder}}$ to compute them; in other words, our proposal relies on an open catalogue of features that allows it to evolve as the Web evolves. The attributive features are then used to assemble a training set from which a rule is learnt. Neither is the base learner used intrinsic to our proposal; on the contrary, any technique in the literature that can work with multi-class problems using both numeric and categoric features can be plugged into our proposal using a user-provided procedure to which we refer to as ${ \textsc {baseLearner}}$. The initial rule is then evaluated on the previous training set using a user-defined rule scorer to which we refer to as ${ \textsc {ruleScorer}}$. The main algorithm in Roller loops as long as a perfect rule is not found and the current rule can be expanded to a new rule that provides some score gain. The expansion procedure explores the context of every node, that is, the neighbouring nodes according to the available relational features, and then selects the one whose attributive features help learn a better rule.

Formally speaking, the problem that we address can be formulated as follows:

Requirements (a) a procedure called ${ \textsc {featureBuilder}}$ to compute a catalogue of features from a set of documents; (b) a procedure called ${ \textsc {baseLearner}}$ to learn a rule from a multi-class training set in which features can be both numeric and categoric; (c) a procedure called ${ \textsc {ruleScorer}}$ that returns a score for a rule based on how well it performs in a test set.
Inputs (a) a set of documents D; (b) an annotation A.
Assumptions (a) the documents in D have a regular structure; (b) if a node is not included in annotation A, then it is implicitly assumed to belong to the null slot.
Problem find a rule r that characterises the information to be extracted using some attributive features of a subset of nodes that are related by means of some relational features; that rule must have the best possible score.

Regarding the requirements, we have performed quite an exhaustive experimentation from which we have drawn the following conclusions: a good catalogue of features must include the following attributive features: the standard W3C HTML features, the standard W3C rendering features, and user-defined features to characterise the contents of the information to be extracted; it must also include the standard W3C DOM features to fetch the neighbours of every node; furthermore, we have found that JRip and Kappa seem to be best combination of base leaner and rule scorer. The previous recommendations can be used as a default to configure Roller. Consult Sect. 3 for further details on our catalogue of features and how to select the best combination of base learner and rule scorer.

In the following subsections, we first present the notation and the core concepts that we use, then introduce the main procedures in our proposal, and, finally, describe some ancillary procedures to deal with training sets and feature vectors.

2.1 Notation and core concepts

We use the standard mathematical notation to represent variables, sets, and logical formulae. There are only a few pieces of notation for which there is not a standard in the literature, namely: given elements $x_1, x_2, \dots , x_n$, then $\langle x_1, x_2, \dots , x_n \rangle $ denotes a sequence of them; given two sequences $s_1$ and $s_2$, we denote their concatenation as ; we denote the tuples of which a map is composed as ; given a map M, we denote its domain as $\mathop {\mathrm {dom}}M$ and its range as $\mathop {\mathrm {ran}}M$; maps are applied using the usual functional notation, e.g., $y = M(x)$; given a map M, we denote its inverse as $M^{-1}$.

Next, we define the core concepts of our proposal, namely: documents, nodes, features, annotations, slots, contexts, bindings, datasets, rules, base learners, and rule scorers.

Definition 1

(Documents and nodes) Documents are character strings that adhere to the HTML syntax and can then be represented as DOM nodes [31, 63].

Example 1

Figure 1 illustrates a collection with documents $\{d_1, d_2, d_3\}$. We show a partial view of document $d_1$, which we use as a running example through the rest of this section. The set of nodes includes $\{n_1, n_2, \dots , n_{15}\}$, plus the children of the head element and the nodes that correspond to documents $d_2$ and $d_3$, which are not shown. Please, note that this example is fictitious because it is not possible to show an actual collection of documents due to space constraints, but it is enough for illustration purposes.

Table 1 Sample features

Full size table

Definition 2

(Attributive and relational features) An attributive feature is a function that maps a node onto a value that represents either an HTML attribute [31], which is specified in the HTML code of a document, a rendering attribute [63], which is computed by a browser, or a user-defined attribute. A relational feature is a function that maps a node onto a set of nodes with which the former is related by means of a neighbouring relationship.

Example 2

Table 1 illustrates some of the features of the nodes of which the documents in Figure 1 are composed. node represents the node being examined; tag and class represent its HTML tag and its CSS class, respectively; and represent the ordinate and the abscissa of the corresponding rendering box, respectively; len and represent the number of tokens in the text that is associated with the node and whether it is a number or not, respectively.

Definition 3

(Annotations and slots) An annotation is a function that maps a node onto a slot. A slot is a label that provides a meaning to the nodes with which it is associated. There is a special slot called null that indicates that a node does not provide any information to be extracted. The nodes that belong to the null slot are referred to as negative examples and the others as positive examples.

Example 3

Table 2 presents the annotation that corresponds to document $d_1$ in Fig. 1. The set of slots is $\{Record, country, code, null\}$, where Record labels the records to be extracted, which are composed of a country name that is denoted as country and a phone code that is denoted as code.

Table 2 Sample annotation

Full size table

Definition 4

(Contexts and bindings) A context is a sequence of tuples of the form (t, rf, s), where t denotes a target variable, rf denotes a relational feature, and s denotes a source variable. If s and rf are not null, then it binds t to the result of applying relational feature rf to s, that is, $t = rf(s)$; if both rf and s are null, then it is an initial context tuple that indicates that t is bound to the set of all of the nodes in the input documents. Simply put, a context is a symbolic representation of a binding; the binding itself is a map in which the variables in a context are bound to their corresponding nodes.

Example 4

Regarding the documents in Fig. 1, context $\langle (node_0, null, null), (node_1, parent, node_0) \rangle $ sets variable $node_0$ to the nodes in the input documents and then variable $node_1$ to their parents. The corresponding binding is the following: $\{node_0 \mapsto \{n_1, n_2, \dots , n_{15}, \dots \}, node_1 \mapsto \{n_1, n_3, n_5, n_6, n_7, n_8, n_9, \dots \}\}$.

Table 3 Sample dataset and rule

Full size table

Definition 5

(Datasets and rules) A dataset is a function that maps every node in the input documents onto a vector with its attributive features and the attributive features of some neighbours (possibly none); such neighbours are fetched by means of relational features, and they are introduced by means of a context. The datasets that are used to learn rules are referred to as training sets and the datasets that are used to score rules are referred to as test sets. A rule is a function that maps a vector onto a slot that is expected to provide its correct meaning.

Example 5

Table 3a shows a sample dataset in which the context involves the nodes in the input documents and their parents. Columns $node_0$ and $node_1$ present the corresponding bindings. Table 3b shows a sample rule that was learnt from the previous dataset; it is of the form $\langle r_1, r_2, \dots , r_n \rangle $ ($n \ge 1$), where each component $r_i$ is of the form ; each $c_{i, j}$ is a simple condition of the form , where n denotes a target variable in a context, f denotes an attributive feature, $\theta $ is a comparator, and v is a value. Given a node to classify, it is first transformed into its corresponding vector, and then the components of the rule are applied in sequence; the last component assigns a default slot to nodes that cannot be better classified by the previous components.

Definition 6

(Base learners and rule scorers) A base learner is a procedure that takes a training set and an annotation as input and returns a rule. A rule scorer is a function that takes a rule and a test dataset and returns a value that indicates how good the rule is at predicting the correct slot for each node in the dataset. The scores must be normalised in range so that the closer to the lower bound, the worse, and the closer to the upper bound, the better.

Example 6

Regarding the rule in Table 3b, a rule scorer might score it at 0.94 when it is evaluated on the dataset in Table 3a. Note that it is not generally possible to assess a score in isolation unless it is 0.00 or 1.00; that is, a score of 0.94 does not mean that the rule works well in $94\,\%$ of the examples to which it is applied or something like that; it simply means that a rule that scores at, say, 0.90 is worse and a rule that scores at, say, 0.98 is better.

2.2 Main procedures

Figure 2 presents the main procedure in Roller. It works on a set of documents D and an annotation A; it returns a tuple (C, r), where C is a context and r is a rule. It consists of three steps that run in sequence, namely:

The first step consists in computing the attributive and the relational features of the nodes of which the input documents are composed. This is performed by a user-provided procedure called ${ \textsc {featureBuilder}}$, which works on a set of documents D and returns a tuple $({ AF }, { RF })$, where ${ AF }$ denotes the set of attributive features and ${ RF }$ the set of relational features that it has computed. It is quite a simple procedure from a conceptual point of view: it loads the input documents, parses them into DOM trees, iterates over the resulting nodes, and computes the features that are provided in a catalogue. Note, however, that it is a little more involved from a technology point of view since it requires to interact with DOM-specific, browser-specific, and user-defined APIs to compute the HTML, the rendering, and the user-defined features, respectively. Such technology details are out of the scope of this article, in which our focus is on presenting the proposal, not on delving into the technology intricacies to implement it. Working on tuples of the form $({ AF }, { RF }, A)$, where ${ AF }$ is a set of attributive features, ${ RF }$ is a set of relational features, and A is an annotation, is very common in our proposal; for the sake of brevity, we refer to such tuples as input configurations.
The second step consists in learning an initial rule building, exclusively, on the attributive features of the nodes in the input documents. To do so, we have to create an initial context of the form $C = \langle c \rangle $, where $c = (node_0, null, null)$. The corresponding binding B simply maps variable $node_0$ onto the set of all nodes in the input documents, which can be very easily computed from the domain of the input annotation A. Then, a training set ${ TS }$ is created; it maps every node bound to $node_0$ in the initial context tuple c onto a vector that represents its attributive features. The base learner is finally invoked on training set ${ TS }$ and the input annotation A in order to learn a rule r. Working on tuples of the form $(C, B, { TS }, r)$, where C denotes a context, B its corresponding binding, ${ TS }$ is a training set for context C, and r a rule that was learnt from that training set is very common in our proposal; for the sake of brevity, we refer to such tuples as rule configurations.
The third step consists in finding an expansion, which is a term that we use to refer to a rule configuration that results from exploring some neighbours of the nodes in the context of the best rule configuration found so far. Ideally, such expansion should provide a rule that scores better than the current rule. To achieve such a goal, we combine the attributive features of the nodes in the current context with the attributive features of the nodes that are explored in the expansion. If an expansion that achieves a better score is found, this step is repeated again; otherwise, it stops and the procedure returns the best rule found and its associated context.

The procedure to find an expansion is presented in Fig. 3. It works on the rule configuration $(C, B, { TS }, r)$ that corresponds to the best rule found so far and an input configuration $({ AF }, { RF }, A)$; it returns a rule configuration $(C^*, B^*, { TS }^*, r^*)$ such that $r^*$ improves or equals the score achieved by r. It consists of the following steps:

The first step initialises a rule configuration of the form $(C^*, B^*, { TS }^*, r^*)$ to the input rule configuration; the procedure searches for candidate expansions and stores the best one that it finds in this starred rule configuration. The criterion used to determine whether an expanded configuration is better than another is based on the score achieved by the corresponding rule; we use variable $g^*$ to save the score gain that is achieved when the current rule is replaced by the best expansion found so far. It is initialised to 0.00 because the first configuration coincides with the input configuration.
The second step is a loop that explores candidate expansions. It iterates over the set of pairs (c, rf) of the Cartesian product of the context tuples in C and the relational features in ${ RF }$, as long as the best expansion found is not perfect. For each such pair, a new context tuple of the form (x, rf, y) is created, where x denotes a new variable that is not used in context C and y denotes the target variable in context tuple c; simply put, the new context tuple binds a new variable to the result of applying relational feature rf to the nodes that are currently bound to the target of context tuple c. This allows to explore the neighbourhood of every node in the current context C. Note, however, that only context tuples that are not redundant with regard to the current context must be explored. Such context tuples are then used to create a new rule configuration $(C', B', { TS }', r')$. The score gain of $r'$ with respect to the input rule r is then computed; if it is greater than the score gain of the best expansion found so far, then it means that the new expansion must be saved since it has resulted in a better rule. This step iterates until a perfect rule is found or the whole Cartesian product is explored; in both cases, the best rule configuration found is returned.

The check for redundancy is implemented by means of predicate redundant, which given a context tuple (t, rf, s) and a context C holds as long as there is a context tuple $(t', rf', s')$ in C such that $s = s'$ and $rf = rf'$ or $s' = t$ and $rf' = rf^{-1}$. The first condition is trivial since it amounts to saying that context tuples (t, rf, s) and $(t', rf, s)$, where $t \not = t'$, are redundant because they bind the same nodes to different variables, which does not help explore new neighbours. The second condition is a little more involved; it amounts to saying that context tuples (t, rf, s) and $(t', rf^{-1}, t)$, where $t \not = t'$, are redundant because the second one binds $t'$ to the same nodes that are bound to s. Formally speaking, rf is the inverse of $rf'$ if $\forall n_1 \cdot rf_1(n_1) = N \Rightarrow \{n_1\} = \bigcup \{rf_2(n_2) \mid n_2 \in N\}$.

Example 7

Assume that Roller is executed on the input documents and the annotation that are sketched in Fig. 1 and Table 2, respectively. It first uses the user-provided ${ \textsc {featureBuilder}}$ procedure to compute the sets of attributive and relational features that are sketched in Table 1. These features and the annotation are used to create an initial training set that corresponds to context tuple $(node_0, null, null)$, which is sketched in Table 3a. Note that the previous figure sketches a training set that corresponds to two context tuples, namely, the initial context tuple $(node_0, null, null)$ and another context tuple that explores the parents of the nodes that are bound to $node_0$, that is, $(node_1, parent, node_0)$. The initial training set corresponds to the part of the figure that refers to the initial context tuple. If we apply a base learner to learn a rule from this training set, then we might get the following rule:

which is scored at 0.90. That means that it is reasonably good at classifying each node in the input documents into the appropriate slot, but it is not perfect. Thus, it makes sense to explore the neighbouring nodes in order to find out if there is one whose attributive features can contribute to producing a better rule. Since the context currently has the initial context tuple $(node_0, null, null)$ only and the relational features are parent, left, right, and child, then the procedure to find an expansion has to explore the following additional contexts:

Exploring the first context amounts to creating a new training set in which the attributive features of each node are combined with the attributive features of their parents. In this case, the resulting training set is sketched in Table 3a. If the base learner is applied to this training set, then we might get the following new rule:

which is scored at 0.94. Exploring the remaining context tuples results in similar rules, none of which is scored better. That means that we now have to explore the following contexts:

Note that there are two contexts that need not be explored, namely: context $(node_2, parent, node_0) \rangle $ is not explored because it does not provide any additional data to the training set and would result in the same rule; context $\langle (node_0, null, null), (node_1, parent, node_0), (node_2, child, node_1) \rangle $ is not explored because context tuples $(node_1, parent, node_0)$ and $(node_2, child, node_1)$ are redundant because relational feature child is the inverse of relational feature parent, so exploring it would result in the same rule because $node_2 = node_0$; that is, both $node_2$ and $node_0$ would be bound to exactly the same nodes. Note, however, that the new contexts are allowed to include context tuples that have been explored previously; for instance, a context of the form $\langle (node_0, null, null), (node_1, parent, node_0), (node_2, left, node_0) \rangle $ explores the left sibling of every node again, but in a different context: previously, we explored the nodes and their left siblings and now we explore the nodes, their parents, and their left siblings together.

In this case, the context that results in the best rule is $\langle (node_0, null, null), (node_1, parent, node_0), (node_2, parent, node_1) \rangle $, namely:

This rule is scored at 1.00, which means that it is a perfect rule, that is, it assigns every node in the training set to the correct slot. So the search for a rule finishes here. Note that the resulting rule takes into account nodes $node_0$ and $node_2$ only; $node_1$ was used just to reach $node_2$, but its attributive features do not provide any classification power in this example.

2.3 Working with training sets

Training sets associate nodes with the vectors that describe their attributive features within a given context. Figure 4 presents the two ancillary procedures that we propose to deal with training sets.

The first procedure is createTrainingSet, which works on an initial context tuple c and an input configuration $({ AF }, { RF }, A)$; it returns a training set in which every node in the domain of the annotation is mapped onto a singleton that provides its representation as a vector. Initially, every node is associated with a unique vector, but if the training set is expanded using a multi-valued relational feature, then the initial vectors need to be combined with the vectors that correspond to several neighbours. This is the reason why training sets associate nodes with sets of vectors.

The second procedure is expandTrainingSet, which works on a training set ${ TS }$, a context tuple c, a binding B, and an input configuration $({ AF }, { RF }, A)$; it returns a training set in which every node in ${ TS }$ is mapped onto a set of expanded vectors that represent the attributive features that are already present in training set ${ TS }$ plus the attributive features that correspond to the nodes bound in B by context tuple c.

Example 8

Table 3a illustrates a training set that is created from the attributive features in Table 1a. The initial training sets consists of the vectors that correspond to context tuple $(node_0, null, null)$; the same figure illustrates how this training set is expanded to take into account the features of the parents of every node.

2.4 Working with vectors

Vectors represent the attributive features of a subset of nodes in the input documents in a format that is suitable to learn a rule using a propositional base learner. We need two ancillary procedures to deal with them, which are presented in Fig. 5.

The first procedure is computeVectors. It works on a node n, a set of attributive features ${ AF }$, and a context tuple c. It computes a vector that is implemented as a map in which each attributive feature is associated with its corresponding value on node n regarding context c.

The second procedure is expandVectors. It works on a node n, a set of vectors V that is associated with n in a given training set, a context tuple c, a binding B, and an input configuration $({ AF }, { RF }, A)$. It first computes the set of nodes N that correspond to the target of context tuple c and, after getting the relational feature in c and initialising the result $V'$ to the empty set, it iterates over the set of pairs (m, v) in which m denotes a node in N and v is one of the vectors in V; note that only pairs in which $m \in rf(n)$ are considered, that is, pairs in which node m is a neighbour of node n regarding relational feature rf. For every such pair, we first compute the vector w that corresponds to m using the set of attributive features ${ AF }$ and the context tuple c; that vector is then merged with vector v, that is, vector v is expanded with the attributive features of node m.

Example 9

Let us examine node $n_{10}$ in the document in Fig. 1 and the initial context tuple $c = (node_0, null, null)$. Recall that Table 1a reports on the attributive features of the nodes in our running example. The vector that is associated with this node is the following: . If this vector is expanded with context tuple $c' = (node_1, parent, node_0)$, then it becomes .

3 Configuring Roller

Roller has three variation points, namely: procedure ${ \textsc {featureBuilder}}$, which computes a catalogue of features, ${ \textsc {baserLearner}}$, which learns a rule from a propositional training set, and ${ \textsc {ruleScorer}}$, which assesses how good a rule is. There are several alternatives to implement these procedures; the decision must, obviously, be made building on an experimental study that proves that the chosen combination of alternatives is very good.

The details regarding our experimentation environment and the performance measures that we used are described in “Appendices 1 and 2”, respectively. In the following subsections, we first present a method that helped us make a decision regarding which the best combination of alternatives was; then, we report on the feature builder, the base learner, and the rule scorers that we examined; finally, we report on the results of the experimental analysis that we carried out to find the best combination of alternatives.

3.1 A ranking method

Since we have to compare many alternatives regarding the variation points in Roller, we need a method to rank them building on a number of performance measures that we denote as M. Without loss of generality, we can assume that M can be partitioned as $M = A \uplus B$, where A denotes a set of performance measures whose value must be maximised and B denotes a set of performance measures whose value must be minimised.

For the method to work well, we first need to normalise the values of the measures so that they all range within the same interval and the interpretation of the lower and the upper bounds is homogeneous. Assume that we are dealing with a performance measure m, that we have gathered a set of values W regarding it, that a denotes the minimum value in set W, and that b denotes the maximum value. If $m \in A$, then we define its set of normalised values as $W' = \{w' \mid \exists w \cdot w \in W \wedge w' = (w - a) \div (b - a)\}$; if $m \in B$, then we define its set of normalised values as $W' = \{w' \mid \exists w \cdot w \in W \wedge w' = 1.00 - (w - a) \div (b - a)\}$. ($a \div b$ equals a / b if $b \not = 0.00$; otherwise, it equals 1.00.) Note that the values in $W'$ range in interval , so that the closer a value to the lower bound the worse and the closer to the upper bound the better.

We now have to transform the normalised values of the measures into a rank. We assume that the experimenter provides a map $\beta $ that assigns a relative weight in range to every measure in M; obviously, the weights must sum up to 1.00 so that they are consistent. The idea behind weighting the performance measures is that the researcher can provide a hint regarding the measures on which he or she is most interested. Note that we need to perform several experiments to rank a number of alternatives, which means that we actually compute a set of values for every measure. To work with that set, we need to compute their mean value, but we also have to take into account their deviation since a high deviation means that the mean value of a variable is not actually representative, that is, it is not good to estimate a rank. To deal with this problem, we have devised the following formula:

$$\begin{aligned} K^p= & {} \sum _{m' \in M'} \beta (m) \, K^p_{m'} \\ K^p_{m'}= & {} \frac{\mathop {\mathrm {mdr}_{m'}^{p}}}{\mathop {\mathrm {mdr}_{m'}^{max}}} \end{aligned}$$

where p denotes an alternative to be compared, $M'$ denotes a set of new measures that are in one-to-one correspondence with the measures in M, but are normalised according to the previous procedure, and $\mathop {\mathrm {mdr}_{m'}^{p}}$ denotes the mean-to-deviation ratio of alternative p with regard to normalised performance measure $m'$ and $\mathop {\mathrm {mdr}_{m'}^{\max }}$ denotes the maximum mean-to-deviation ratio of performance measure $m'$ across all of the alternatives to compare. This ratio is defined as follows:

$$\begin{aligned} \mathop {\mathrm {mdr}_{m}^{p}} = \left\{ \begin{array}{ll} \frac{\mu _m^p}{\sigma _m^p} \, \mu _m^p &{}\quad \text{ if } \sigma _m^p \not = 0.00 \\ \mu _m^p &{}\quad \text{ otherwise } \end{array} \right. \end{aligned}$$

where m denotes an arbitrary performance measure, $\mu _m^p$ denotes its mean value regarding alternative p, and $\sigma _m^p$ its standard deviation regarding alternative p. Note that this ratio maps every measure onto a value that weights its mean value with the inverse of the coefficient of variation. Intuitively, the smallest the coefficient of variation, the more representative the mean value; in other words, the smallest the coefficient of variation of a measure, the more can that measure contribute to the rank of an alternative.

Summing up, $K^p$ provides a rank for alternative p in which both the mean value of the performance measures and their deviation are taken into account. We think that this is a good approach since it blurs the contribution of measures that are not stable enough, but emphasises the others. Please, note that it was not our purpose to devise a general-purpose ranking method, but an ad hoc method that has proven to guide our search for a good alternative very well.

Before concluding, we would like to mention that our experimental analysis has revealed that some alternatives fail when they are applied to some datasets. Sometimes, the reason is that they consume too much memory; sometimes, they cannot learn a rule in a reasonable time (we set a deadline of 15 CPU minutes); sometimes, the dataset has some characteristics that make it impossible to execute the base learner on it. That means that we also need to compute a failure ratio for every alternative under consideration, which is defined as follows:

$$\begin{aligned} { FR } = \frac{F}{D}, \end{aligned}$$

where F denotes the number of datasets on which an alternative did not work, and D denotes the number of datasets on which the alternative was run. Intuitively, the closer to 0.00 the better and the closer to 1.00 the worse. We obviously are not willing to accept an alternative whose failure ratio is different from 0.00, since that means that it is not generally applicable.

Table 4 Partial catalogue of user-defined features

Full size table

3.2 Our feature builder

Our feature builder computes the standard HTML features and the standard rendering features of the input documents, as they are defined in the corresponding W3C recommendations [31, 63]. Additionally, it computes some user-defined features; Table 4 shows only the user-defined features that have proven to be useful in our experiments.

The attributive features can be classified according to the prefixes of their names into the following groups: (a) prefix beginsWith identifies some features that check whether the text of a node begins with a token that belongs to a given lexical class, e.g., a number or a punctuation symbol; (b) prefix countOf identifies some features that count the number of tokens in the text of a node that fulfil a given property, e.g., the count of alphanumeric tokens or the count of lowercase tokens; (c) prefix endsWith denotes features that check if the text in a node ends with a token that belongs to a given lexical class, e.g., a number or a punctuation symbol; (d) prefix first denotes features that return a prefix of the text in a node, e.g., the first token or the first two tokens (which are commonly referred to as bigrams); (e) prefix has identifies features that check if there is a subsequence of tokens in the text of a node that fulfils a given property, e.g., there is a bracketed number or a question mark; (f) prefix is denotes a feature that checks if the text in a node matches a given pattern, e.g., whether it is capitalised or a phone number; (g) finally, prefix last denotes features that return a suffix of the text in a node, e.g., the last token or the last bigram.

The catalogue of relational features provides common features to navigate from a node to its neighbours in a DOM tree, namely: ancestor, children, first sibling, last sibling, left sibling, right sibling, and parent.

Recall that Roller is not bound with a particular choice of features. It is open to work with the features that a user thinks are the most appropriate for a given problem. The previous features provide just a catalogue that has proven to work very well in our experiments.

Table 5 Experimental results regarding several variants of Roller

Full size table

3.3 Our base learner and rule scorer

Regarding the base learner, we have explored Conjunctive Rule, Decision Table, JRip, NNge, PART, and Ridor [22]. They are available in Weka and can deal with multi-class problems and both numeric and categoric attributive features. A problem with them is that they do not work well with training sets that are unbalanced, which is the case in our context. The reason is that input documents are composed of hundreds of nodes, most of which are negative examples, cf. Table 8 in “Appendix 1”. Thus the base learner must balance the training sets on which it works. We have explored several alternatives in the literature [4, 30], and our conclusion was that the one that best performs consists in computing the number of examples of the majority slot and then replicating as many examples of the other slots as needed to assemble a training set that has approximately the same number of examples for every slot.

Regarding the rule scorer, Information Content is the most common in practice [51]. It has proven to guide the search process very well when dealing with classical inductive logic programming problems. It relies exclusively on the number of true positives and false positives that a rule produces when it is evaluated. We wished to explore some rule scorers that also take into account the number of true negatives and false negatives. We have surveyed the literature and we have found several alternatives [27], namely: Collective Strength, Confidence, Jaccard, Kappa, Laplace, Leverage, Odds Ratio, Phi Coefficient, Satisfaction, Support, and Yule’s Q.

The cartesian product of base learners and rule scorers resulted in 72 variations of Roller. Table 5 summarises the results that we obtained when we run each variation on our datasets, including the mean and standard deviations of precision (P), recall (R), the $F_1$ score ($F_1$), learning time (${ LT }$), and extraction time (${ ET }$), as well as our rank (K) and the failure ratio (${ FR }$).

In our experimentation, we set the relative weights of the performance measures to $\beta (F_1) = 0.70$, $\beta ({ LT }) = 0.10$, and $\beta ({ ET }) = 0.20$. In other words, we think that a good proposal must be able to learn extraction rules that are very effective, that is, that achieve a high precision and recall, and, consequently, a high $F_1$ score. Note that learning a rule is a process that is executed every now and then, when a new site needs to be analysed or when a rule breaks because the corresponding site has undergone a change to its layout; since our experimental analysis confirmed that Roller is quite effective and can learn in a matter of seconds, we did not think that the learning time could make a big difference between two alternatives. Contrarily, once a rule is learnt, it must be executed as quickly as possible in a production environment, so the extraction time is much more important than the learning time.

Note that the best variations achieve $K = 0.98$ and $K = 0.97$; they all rely on Ridor as the base learner and Jaccard, Laplace, Satisfaction, or Support as the rule scorers; unfortunately, all of them have a failure ratio of 0.16, which means that they cannot deal with some datasets. The problem is that Ridor is a learner that uses a technique called Reduced Error Pruning to prune the resulting rules; unfortunately, there are a number of datasets that do not provide enough data for this technique to work, which means that it cannot be applied to relatively small documents. As a conclusion, we have to resign to use Ridor, even though it works well with sufficiently large documents.

Thus, the best variations seem to be those that achieve $K = 0.92$ with a 0.00 failure ratio. They all correspond to using JRip as the base learner and Confidence, Jaccard, Kappa, Laplace, Satisfaction, and Support as rule scorers. Since there are multiples ties, we decided to select JRip and Kappa because this is the variation that achieves the minimum extraction time.

4 Experimental analysis

In this section, we first report on the results of our experimental analysis regarding effectiveness and then regarding efficiency. The details regarding our experimentation environment, including the proposals with which we have compared ours, and the performance measures that we have used are described in “Appendices 1 and 2”, respectively. In every case, we have conducted a statistical analysis to make sure that the differences in rank that our experiments have found are statistically significant at the standard significance level $\alpha = 0.05$.

Following the results by Demšar [16] and García and Herrera [25], we have used Iman–Davenport’s test to find out whether there are statistically significant differences in the empirical ranks and then Hommel’s test to compare the best-ranked technique to the others. Note that we have to resort to nonparametric tests because the distribution of the performance measures is not normal and they are not homoscedastic [54]. Regarding the normality of Roller’s precision, for instance, Kolmogorov–Smirnov’s test returns $D = 0.78$ with a p value less than $2.20 \times 10^{-16}$, Shapiro–Wilk’s test returns $W = 0.79$ with p value $2.39 \times 10^{-07}$, and Arlinton–Darling’s test returns $AD = 49.85$ with p value $1.11 \times 10^{-05}$; regarding homoscedasticity, the comparison between Roller’s and SoftMealy’s precision, for instance, returns $F = 49.64$ (with one degree of freedom) and p value $1.94 \times 10^{-10}$ using Levene’s test, $K = 69.66$ (with one degree of freedom) and p value less than $2.20 \times 10^{-16}$ using Bartlett’s test, and $F = 0.08$ with p value less than $2.20 \times 10^{-16}$ using the F test. Note that the previous results provide a strong indication that the data does not behave normally and is not homoscedastic, which precludes using parametric tests like the well-known t test or ANOVA tests.

Table 6 Effectiveness results regarding precision (P), recall (R), and the $F_1$ score

Full size table

4.1 Effectiveness analysis

Table 6 reports on the raw effectiveness data that we got from our experimentation. For each proposal, we report on its precision (P), recall (R), and $F_1$ score ($F_1$) regarding our datasets. The first two lines also provide a summary of the results in terms of mean value and the standard deviation of each measure. Since it is difficult to spot a trend in this table, we decided to summarise the data using boxplots.

Figure 6 summarises the results regarding precision. By precision, we mean the ability of a proposal to learn a rule that makes as few classification mistakes as possible; simply put, it is the ratio of true positives to the total number of true positives and false positives. Empirically, Roller seems to be the proposal that can achieve a better precision, and it is, indeed, the one that is more stable regarding this effectiveness measure; the other proposals can achieve precisions that are as high as Roller’s, but their deviation with respect to the mean is larger. Iman–Davenport’s test returns a p value that is nearly zero, which is a strong indication that there are differences in rank amongst the proposals that we have compared. We then have to compare Roller, which ranks the first regarding precision, and the other techniques. Hommel’s test confirms that the differences in rank amongst Roller and the other techniques are statistically significant because it returns adjusted p values that are very small with regard to the significance level. In other words, our experimental data provides enough evidence to reject the hypothesis that Roller behaves similarly to the other proposals regarding precision, that is, it supports the idea that Roller can learn rules that are more precise than the rules learnt by the other proposals.

Figure 7 summarises the results regarding recall. By recall, we mean the ability of a proposal to learn a rule that assigns as many nodes as possible to their correct slots; simply put, it is the ratio of true positives to the total number of true positives and false negatives. Empirically, Roller seems to be the proposal that can achieve a higher recall and it is the one that seems more stable regarding this measure because its deviation is the smallest and its inter-quartile range is also the smallest. Note, however, that the other techniques can achieve results that are very good, too, chiefly Trinity and Aleph. Iman–Davenport’s test returns a p value that is very close to zero, which is a clear indication that there are differences in rank amongst the proposals that we have compared. Hommel’s test confirms that the differences in rank between Roller, which ranks the first from an empirical point of view, Aleph, RoadRunner, FiVaTech, Wien, and SoftMealy are statistically significant at the standard significance level; note, however, that the adjusted p value that corresponds to the comparison between Roller and Trinity is not greater than the standard significance level, which means that the difference in empirical rank between these two proposals is not statistically significant. As a conclusion, the experimental data does not provide enough evidence to reject the hypothesis that Roller and Trinity behave similarly regarding recall, that is, they both rank statistically at the first position; however, it provides enough evidence to reject the hypothesis that Roller behaves similarly to Aleph, RoadRunner, FiVaTech, Wien, and SoftMealy, that is, they rank worse than Roller and Trinity.

Figure 8 summarises the results regarding the $F_1$ score. This score is the harmonic mean of precision and recall; thus, it rewards proposals that achieve both a high degree of precision and recall and penalises those that do not. Empirically, Roller seems to be the proposal that can achieve the best $F_1$ score, and it is, again, the most stable. Iman–Davenport’s test returns a p value that is nearly zero, which strongly supports the hypothesis that there are statistically significant differences in rank. Hommel’s test returns adjusted p values that are clearly smaller than the significance level in every case, which supports the hypothesis that the differences in rank between Roller and every other proposal are statistically significant, too.

Since Roller works on the tree representation of the input documents, we need to parse them and correct the errors in their HTML code. Such errors are very common, cf. Table 8 in Appendix 1. As a conclusion, it was also necessary to carry out a statistical analysis to find out if our experiments provide enough evidence to conclude that the presence of errors in the input documents has an impact on the effectiveness of our proposal. We have used Kendall’s Tau test, which returned $\tau = -0.09$ with p value 0.37. Note that $\tau $ is very close to zero and that the p value is clearly greater than the standard significance level, which means that the experimental data does not provide enough evidence to reject the hypothesis that the correlation is zero. In other words, our experiments do not provide any evidence that the effectiveness of our proposal may be biased by the presence of errors in the HTML code of the input documents.

Our conclusions are that Roller outperforms the other proposals regarding effectiveness and that it is the proposal whose results are more stable. The statistical tests that we have performed have found enough evidence in our experimental data to support the hypothesis that the differences in the empirical rank amongst Roller and the other proposals are significant at the standard significance level, except for the case of recall, in which case the experimental data does not provide enough evidence to conclude that Roller and Trinity perform differently. Note, too, that proposals like RoadRunner and FiVaTech cannot deal with all of our datasets; in Table 6 such situations are indicated with a dash. The reason is that they took more than 15 CPU minutes to learn a rule or that they raised an exception; in both cases, we could not compute effectiveness measures for the corresponding datasets.

Table 7 Efficiency results regarding learning time (${ LT }$) and extraction time (${ ET }$)

Full size table

4.2 Efficiency analysis

Table 7 reports on the raw efficiency data that we got from our experimentation. For each proposal, we report on its mean learning time (${ LT }$) and its mean extraction time (${ ET }$) regarding our datasets. The first two lines also provide a summary of the results in terms of mean value and standard deviation of each measure. Since it is difficult to spot a trend in such a table, we decided to summarise the data using boxplots.

Figure 9 summarises the results regarding learning times, that is, the mean CPU time that each proposal took to learn a rule. Experimentally, it seems that Trinity is the proposal that takes less time to learn a rule; in most cases, it does not take more than a tenth of a second. It is followed by RoadRunner, SoftMealy, and Wien, whose learning times are very similar; Roller seems to rank at the fifth position, before FiVaTech and Aleph, which are the most inefficient. Iman–Davenport’s test returns a p value that is very close to zero, which clearly supports the hypothesis that there are differences in rank amongst these proposals. Hommel’s test also returns adjusted p values that are very small with respect to the significance level, which also reveals that the experimental data provides enough evidence to support the hypothesis that Trinity is the proposal that performs the best and that the others rank below it. Note that we do not think that this is a serious shortcoming since our learning times still lie within the range of a few seconds in most cases and we assume that learning rules is not a task that must be executed continuously in a production scenario.

Figure 10 summarises the results regarding extraction times, that is, the mean CPU time that it took to apply a rule to a dataset. Aleph, SoftMealy, and Wien seem to be the proposals that have the worst performance; RoadRunner and Trinity seem to be very similar in both mean extraction time and deviation since their inter-quartile ranges are identical. FiVaTech seems to be a little worse than RoadRunner and Trinity since its mean extraction time is larger, but note that it is a little more stable since the inter-quartile range is smaller. The timings regarding Roller are good since most rules do not take more than a tenth of a second to extract information, and its inter-quartile range is also very small with regard to the best-ranked proposals, but its mean time is slightly larger than in the case of RoadRunner and Trinity. Iman–Davenport’s test returns a p value that is nearly zero, which clearly indicates that there are statistically significant differences in the empirical rank. Hommel’s test returns adjusted p values that are not smaller than the standard significance level regarding the comparisons of Trinity, which is the best-ranked proposal according to the empirical ranking, Roller, RoadRunner, and FiVaTech. This means that the experimental data does not provide enough evidence to conclude that there is a statistically significant difference between Trinity, Roller, RoadRunner, and FiVaTech regarding extraction times, that is, they all rank at the first position. The test, however, finds enough evidence to reject the hypothesis that the previous proposals and the others behave similarly regarding extraction time. These results are very important, because they confirm that the rules that Roller learns are very competitive regarding efficiency.

As a conclusion, our experiments support the idea that Roller is very efficient. It is not the best performing regarding learning times, but it still lies within the range of seconds, which we do not think is a serious shortcoming from a practical point of view. However, the rules that it learns are as efficient as the rules that other state-of-the-art proposals can learn, which makes them competitive from a practical point of view. The reason why Roller takes a little more time to learn a rule than other proposals is that it has to create several training sets and then apply the base learner several times; its efficiency clearly depends on how effective the base learner is. Anyway, we think that the efficiency results are quite reasonable and that its superiority regarding effectiveness clearly compensates for its slightly worse performance.

5 Related work

In this section, we first provide an overview of the proposals in the literature that were specifically tailored to learning Web information extraction rules; we then delve into propositio-relational machine-learning proposals, which are closely related to ours, but have not been explored so far in our context; finally, we compare them with ours from a conceptual point of view.

5.1 Specific-purpose proposals

There are literally hundreds of proposals that were specifically tailored to learning Web information extraction rules in the context of semi-structured Web documents [11, 56]. Thus, we restrict our attention to those that pioneered a new research path.

Many authors have devised techniques that work on the text of the input documents, namely: Kushmerick et al. [44] presented a proposal that learns two patterns of tokens that characterise the left and the right context of the information to extract; Hsu and Dung [34] presented a proposal that relies on using automata to model the structure of the information and regular patterns to control the transitions amongst states; Chidlovskii [12] and Muslea et al. [49] also explored the idea of learning automata and patterns; Crescenzi and Mecca [13] and Crescenzi and Merialdo [14] explored learning regular expressions to extract information; Chang and Kuo [10] explored a multiple string alignment technique; Arasu and Garcia-Molina [2] presented other proposals to learn regular expressions; and sleiman and Corchuelo [55, 58] presented two proposals that are based on multi-string alignment techniques.

There are also many authors who have devised techniques that work on the DOM tree representation of the input documents, namely: Hogue and Karger [32] presented a proposal that is based on tree similarity; Park and Barbosa [50] devised a technique that combines tree matching and clustering; Shen and Karger [53] devised a heuristic-based proposal; Álvarez et al. [1] devised a proposal that relies on clustering, tree matching, string matching, and string alignment; Su et al. [61] presented a proposal that is based on aligning DOM trees using a maximum entropy model; and Kayed and Chang [38] introduced a technique that first learns an information schema and then a context-free grammar using a tree similarity and a tree alignment technique.

The previous techniques work on the documents themselves, that is, on their tokens or their nodes. A few authors have explored transforming the tokens or the nodes into vectors of attributive features that are related to others by means of relational features. (See Reference [19] to find details on an alternate first-order representation of documents, not necessarily semi-structured Web documents.) Such a representation allows to use techniques that got inspiration from inductive logic programming. Soderland [59] and Califf and Mooney [9] pioneered this research path with two proposals that learn ground first-order rules that work on the textual representation of the input documents; Bădică et al. [8] presented a technique that learns first-order rules with variables by applying the FOIL system to a first-order tree-based representation of the input documents. The previous techniques rely on quite a limited catalogue of built-in features; Freitag [24], Irmak and Suel [35], and Fernández-Villamor et al. [20] worked on proposals that learn first-order rules using open catalogues of features.

5.2 Exploring propositio-relational learning

Inductive logic programming is a natural approach to deal with relational data. Unfortunately, it is inefficient when the datasets scale in the number of data or features because the search space is typically huge [7, 8, 23, 24, 48]. This has motivated some authors to work on adapting efficient propositional techniques so that they can work on relational data. The proposals in the literature can be broadly classified as follows [29, 40]: upgrading, flattening (aka. proposionalisation), and multiple view.

Upgrading proposals rely on a conventional propositional learner that is upgraded to deal with relational features. Some proposals upgrade a propositional learner with the ability to learn first-order rules, namely: TILDE [6] upgrades C4.5, SCART [41] upgrades CART, RIBL [18] and RIBL2 [33] upgrade k-NN, Cumby and Roth [15] and Gärtner et al. [26] upgraded some kernel methods, PRM [28, 36] upgrades Bayesian networks, SLP [47] upgrades stochastic grammars, and 1BC and 1BC2 [21] upgrade Bayesian classifiers. Unfortunately, these proposals did not prove to be efficient enough [29], which motivated other authors to work on so-called relational database proposals that transform the original problems into SQL representations that can be handled more efficiently with commodity database management systems. There are two approaches in the literature: selection graph model, which includes MRDTL [3], which builds on TILDE but represents the data in SQL, and MRDTL2 [3], which is an optimised version of MRDTL that can also handle missing attributes using a proposal based on Naive Bayes classifiers; other proposals are based on a technique called tuple ID propagation, which basically attempts to join related vectors virtually; for instance, CrossMine [64] and GraphNB [65] follow this approach by extending FOIL [51] and a Bayesian classification algorithm, respectively.

Flattening proposals convert relational data into table-based representations to which standard propositional techniques can be applied. There are two approaches in the literature: creating universal vectors that join all of the data in the training sets, which was pioneered by LINUS [17], DINUS [45], and SINUS [43], or creating vectors that summarise and/or aggregate the data in the neighbourhood of every vector, e.g., RollUp [39] and RELAGGS [42].

Guo and Viktor [29] devised the only multiple view proposal of which we are aware. It relies on a meta-learning approach that can learn from multiple views of the data, that is, multiple subsets of data that result from projecting them using different feature subsets, and then integrates the results using a novel technique that does not require the complex preprocessing required by flattening proposals.

Although propositio-relational proposals seem very adequate to deal with the problem of learning information extraction rules, it remains an almost unexplored research path. The only exception is the work by Sleiman and Corchuelo [57], who devised a proposal that hybridises finite automata and neural networks; the states of the automata represent the information to be extracted and the transitions the next-token relational feature; the transitions are controlled by means of neural networks that recognise token patterns building on simple features like they their HTML tags or their lexical classes.

5.3 Discussion

Typically, researchers who are interested in Web information extraction have designed ad hoc proposals that are specifically tailored to this problem, which has led to a variety of alternatives. Although many of them were proven to be very effective and efficient, the problem is that they cannot leverage the many advances in the field of machine learning; neither can the general machine-learning field easily benefit from them. Furthermore, many of them have faded away quickly as their inherent assumptions about the structure of documents have become obsolete as the Web has evolved. Unfortunately, they could not be easily adapted to deal with such evolution because this would have required to rework them, that is, to have devised completely new proposals. Some of the ad hoc proposals that we have surveyed work on the textual representation of the input documents and their goal is to characterise the left and the right context of the information to extract; others work on their DOM tree representations and their goal is to characterise the path from the root node to the nodes that provide the information to be extracted.

Contrarily to the previous proposals, Roller can leverage many machine-learning techniques in the literature and benefit from the advances in this field. Furthermore, it is based on an open catalogue of features that can be easily extended and adapted as the Web evolves, without changing the proposal itself. Neither does Roller attempt to characterise the left or the right context of the information to extract or the path from the root to the nodes that provide the information to extract; but it tries to characterise a context in the DOM tree. Note that this may involve tokens in disparate positions, not necessarily on the left or the right, as well as tokens that are not on the same path to the root node, e.g., siblings or children of siblings.

A few authors have explored using techniques that got inspiration from inductive logic programming since the tokens or the DOM nodes of semi-structured documents can be naturally represented as relational data. Their proposals are expected to be easier to adapt as the Web evolves since they need not be adapted, but their catalogues of features. In general, they can achieve high effectiveness at the cost of efficiency. They explore an unbounded context, which does not restrict them to the left or the right context or nodes within a given path, as was the case for the ad hoc proposals. Unfortunately, they use the same heuristic to guide the search through both attributive and relational features; furthermore, although they all are analysed together in every step, only one of them is selected to grow the rule being learnt, which typically results in a problem called myopia. The reason is that when a relational feature is selected, the attributive features of the target node are not taken into account; in other words, there are cases in which a decision to explore a neighbouring node may lead to a local minimum. Except for L-Wrappers, none of the proposals that we have surveyed can backtrack to explore other choices. Note, too, that L-wrappers is the only proposal that advocates transforming the problem of Web information extraction into a first-order knowledge base and then learning extraction rules using an inductive logic system like FOIL. This proved not to be efficient enough, even with relatively simple documents. This problem was first pointed out by Freitag, who suggested that learning from a first-order representation would simply be too inefficient. Our experimental results using Aleph prove that the problem can be tackled using a pure inductive logic programming technique, but the rules learnt are not the most effective and they are inefficient.

Roller also works on a relational representation of the input documents that builds on an open catalogue of features that can easily evolve as the Web evolves, without making a change to the proposal itself. Furthermore, it relies on a propositional base learner that can be integrated in our proposal without a change; that is, it can benefit from the advances in the general field of machine learning. Our experiments prove that Roller is very effective and efficient. This is because it relies on a propositional learner to analyse the attributive features of the nodes to extract and then explores their context using relational features in an attempt to find neighbouring nodes whose attributive features can contribute to learning a better rule. Furthermore, two different search heuristics are involved: one that is provided by the base learner, which is ad hoc and was designed to guide the search through attributive features as effectively and efficiently as possible, and another one that was designed to guide the search through the relational features and helps explore the context as effectively and efficiently as possible. Roller also reduces myopia because it deals with all of the attributive features at the same time, not one after the other as was the case for the existing proposals; furthermore, the decision on which relational feature has to be explored next does not depend only on that feature itself, but on the attributive features of the target nodes. Obviously, this is not a solution to myopia, but our experiments prove that it reduces the odds of making wrong decisions; we explored using backtracking, but our experiments proved that the mechanism was not actually necessary, so we decided not to include it in the final version of Roller.

Since information extraction problems can be naturally represented using relational data, one might think that it would be easy to leverage a proposal from the field of propositio-relational learning. Unfortunately, few such proposals exist in the literature since there are a number of intrinsic problems: according to Guo and Viktor [29], upgrading proposals are not generally scalable-enough, chiefly those that rely on inductive logic programming approaches, and cannot generally achieve high effectiveness when they deal with numeric data, which is very common in our context, e.g., depth of a node, number of children, font size, ratio of letters or figures, text length, coordinates, and the like. Relational database proposals are more efficient because they rely on a database management system, but they do not seem easy to adapt to the problem of information extraction because they rely on a full-fledged relational schema, that is, they were designed to deal with actual relational databases that build on a rich data schema that includes information about every attribute, primary keys, foreign keys, and so on; in other words, they are schema-driven proposals. In our context, there is not such a schema, which requires the proposals to be instance-driven, that is: they must explore the context of every instance individually, without an explicit schema. The existing multi-view proposal in the literature improves on efficiency, but it does not seem appropriate in our context because it is based on aggregating neighbouring vectors. Numeric features are aggregated using the standard SQL aggregation functions (sum, average, minimum, maximum, standard deviation, and count), but categoric features are aggregated using counts only. That means that the classification power that those features can provide is lost, but they are very common in our context, e.g., font family, colour, horizontal alignment, floating specification, and the like; furthermore, it does not take into account that attributes in disparate nodes can contribute to obtaining a good rule. Flattening proposals require much computation to flatten the datasets to be analysed and the resulting vectors may have an arbitrarily large number of components, which hinders the applicability of many learners in practice; some of the proposals require data to be duplicated, which increases statistical skewness, whereas others require data to be aggregated, which implies that data distributions are neglected; furthermore, they need to put a limit to the amount of context that can be explored because the context of the data is not explored on-demand, but pre-computed.

Roller naturally fits within the category of flattening proposals, but it differs significantly from the existing ones: instead of pre-processing the vectors in an attempt to make the context of every node explicit, it first tries to learn a perfect rule building solely on the attributive features of the nodes to be extracted; if no such a rule can be learnt, then it explores the context by means of the available relational features, which involves flattening the vectors that correspond to the nodes being analysed and the vectors that correspond to their neighbours. This results in a dynamic flattening proposal that has proven to work very well in practice according to our experiments. Note that, contrarily to other existing proposals, no aggregation of data is required; it works on the attributive features themselves, which implies that no classification power is lost in the flattening process.

6 Conclusions

In this article, we have introduced Roller, which is a new proposal to learn Web information extraction rules in the context of semi-structured Web documents.

It is a highly configurable proposal: it relies on an open catalogue of both attributive and relational features, which helps adapt it as the Web evolves; furthermore, it does not commit to a specific base learner or rule scorer, but can leverage many proposals in the literature and thus benefit from the continuous advances in the general field of machine learning. This clearly deviates from the many existing ad hoc proposals in the literature and from the few existing proposals that are based on inductive logic programming techniques. Technically, the learner that underlies our proposal relies on a search procedure that uses a new dynamic flattening technique to explore the context of the nodes that provide the information to be extracted; our survey of the literature proves that is a novel approach to the problem.

We have conducted a series of experiments on a collection of 54 real-world datasets. The experiments confirm that our proposal is very effective and efficient in practice. It can outperform state-of-the-art proposals in terms of effectiveness and it is very competitive in terms of efficiency; although it is a little more inefficient than other proposals regarding learning times, it can still learn a rule in a matter of seconds, which we do not think is a serious shortcoming; the rules that it learns can, however, be executed as effectively as the rules learnt by other state-of-the-art proposals.

Our results clearly support our idea that using standard machine-learning techniques to learn Web information extraction rules is a promising approach. Note that this clearly deviates from the existing proposals in the literature, which build on ad hoc machine-learning techniques that were specifically tailored to the problem of learning Web information extraction rules. They have proved to be very effective, but the problem is that they tend to fade away because their learning components are not clearly differentiated, which makes it difficult to evolve them as the Web evolves and precludes reusing the many advances that are published in the general field of machine learning. Contrarily, Roller relies on a standard machine-learning technique and a standard rule scorer, which are reused as is, and an open catalogue of features, which can be easily extended. This proves that it makes sense to keep working on trying to use general-purpose machine-learning techniques instead of working on new ad hoc techniques.

Notes

Cantelli’s inequality states that given a random variable X, if $\mu $ denotes its mean and $\sigma $ denotes is standard deviation, then the probability that $X \ge \mu + k \, \sigma $ or that $X \le \mu - k \, \sigma $ is not greater than $1 / (1 + k ^ 2)$. If we wish to discard, say, 5 % of the data as lower or upper outliers, we then have to set $1 / (1 + k ^ 2 ) = 0.05$, that is $k = 4.36$.

References

Álvarez M, Pan A, Raposo J, Bellas F, Cacheda F (2008) Extracting lists of data records from semi-structured web pages. Data Knowl Eng 64(2):491–509
Article Google Scholar
Arasu A, Garcia-Molina H (2003) Extracting structured data from web pages. In: SIGMOD conference, pp 337–348
Atramentov A, Leiva H, Honavar V (2003) A multi-relational decision tree learning algorithm. In: ILP, pp 38–56
Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor 6(1):20–29
Article Google Scholar
Bernstein PA, Haas LM (2008) Information integration in the enterprise. Commun ACM 51(9):72–79
Article Google Scholar
Blockeel H, Raedt LD (1998) Top-down induction of first-order logical decision trees. Artif Intell 101(1–2):285–297
Article MathSciNet MATH Google Scholar
Blockeel H, Raedt LD, Jacobs N, Demoen B (1999) Scaling up inductive logic programming by learning from interpretations. Data Min Knowl Discov 3(1):59–93
Article Google Scholar
Bădică C, Bădică A, Popescu E, Abraham A (2007) L-wrappers: concepts, properties and construction. Soft Comput 11(8):753–772
Article Google Scholar
Califf ME, Mooney RJ (2003) Bottom-up relational learning of pattern matching rules for information extraction. J Mach Learn Res 4:177–210
MathSciNet MATH Google Scholar
Chang C-H, Kuo S-C (2004) OLERA: Semisupervised web-data extraction with visual support. IEEE Intell Syst 19(6):56–64
Article Google Scholar
Chang C-H, Kayed M, Girgis MR, Shaalan KF (2006) A survey of web information extraction systems. IEEE Trans Knowl Data Eng 18(10):1411–1428
Article Google Scholar
Chidlovskii B (2001) Wrapping web information providers by transducer induction. In: ECML, pp 61–72
Crescenzi V, Mecca G (2004) Automatic information extraction from large websites. J ACM 51(5):731–779
Article MathSciNet MATH Google Scholar
Crescenzi V, Merialdo P (2008) Wrapper inference for ambiguous web pages. Appl Artif Intell 22(1&2):21–52
Article Google Scholar
Cumby CM, Roth D (2003) On kernel methods for relational learning. In: ICML, pp 107–114
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
MathSciNet MATH Google Scholar
Džeroski S, Lavrač N (1993) Inductive learning in deductive databases. IEEE Trans Knowl Data Eng 5(6):939–949
Article Google Scholar
Emde W and Wettschereck D (1996) Relational instance-based learning. In ICML, pp 122–130
Esposito F, Ferilli S, Fanizzi N, Basile TMA, Mauro ND (2003) Incremental multistrategy learning for document processing. Appl Artif Intell 17(8–9):859–883
Article Google Scholar
Fernández-Villamor JI, Iglesias CÁ, Garijo M (2012) First-order logic rule induction for information extraction in web resources. Int J Artif Intell Tools 21(6):1–20
Article Google Scholar
Flach PA, Lachiche N (2004) Naive bayesian classification of structured data. Mach Learn 57(3):233–269
Article MATH Google Scholar
Frank E, Hall MA, Holmes G, Kirkby R, Pfahringer B, Witten IH, Trigg L (2010) Weka-a machine learning workbench for data mining. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer, Berlin, pp 1269–1277
Freitag D (1998) Information extraction from HTML: application of a general machine learning approach. In: AAAI/IAAI, pp 517–523
Freitag D (2000) Machine learning for information extraction in informal domains. Mach Learn 39(2/3):169–202
Article MATH Google Scholar
García S, Herrera F (2008) An extension on ‘statistical comparisons of classifiers over multiple data sets’ for all pair-wise comparisons. J Mach Learn Res 9:2677–2694
MATH Google Scholar
Gärtner T, Lloyd JW, Flach PA (2004) Kernels and distances for structured data. Mach Learn 57(3):205–232
Article MATH Google Scholar
Geng L, Hamilton HJ (2006) Interestingness measures for data mining: a survey. ACM Comput Surv 38(3):1–32
Getoor L, Friedman N, Koller D, Taskar B (2001) Learning probabilistic models of relational structure. In: ICML, pp 170–177
Guo H, Viktor HL (2008) Multirelational classification: a multiple view approach. Knowl Inf Syst 17(3):287–312
Article Google Scholar
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Article Google Scholar
Hickson I, Berjon R, Faulkner S, Leithead T, Navara ED, O’Connor E, Pfeiffer S (2014) HTML 5: a vocabulary and associated APIs for HTML and XHTML. Technical report W3C
Hogue AW, Karger DR (2005) Thresher: automating the unwrapping of semantic content from the world wide web. In: WWW, pp 86–95
Horváth T, Wrobel S, Bohnebeck U (2001) Relational instance-based learning with lists and terms. Mach Learn 43(1/2):53–80
Article MATH Google Scholar
Hsu C-N, Dung M-T (1998) Generating finite-state transducers for semi-structured data extraction from the web. Inf Syst 23(8):521–538
Article Google Scholar
Irmak U, Suel T (2006) Interactive wrapper generation with minimal user effort. In: WWW, pp 553–563
Jaeger M (2008) Probabilistic-logic models: reasoning and learning with relational structures. In: SCAI, pp 197–200
Kavurucu Y, Senkul P, Toroslu IH (2011) A comparative study on ILP-based concept discovery systems. Expert Syst Appl 38(9):11598–11607
Article Google Scholar
Kayed M, Chang C-H (2010) FiVaTech: page-level web data extraction from template pages. IEEE Trans Knowl Data Eng 22(2):249–263
Article Google Scholar
Knobbe AJ, de Haas M, Siebes A (2001) Propositionalisation and aggregates. In: PKDD, pp 277–288
Kramer S, Lavrač N, Flach P (2001a) Propositionalization approaches to relational data mining. In: Džeroski S, Lavrač N (eds) Relational data mining. Springer, Berlin, pp 262–291
Chapter Google Scholar
Kramer S, Widmer G, Pfahringer B, de Groeve M (2001b) Prediction of ordinal classes using regression trees. Fundam Inform 47(1–2):1–13
MathSciNet MATH Google Scholar
Krogel MA (2005) On propositionalization for knowledge discovery in relational databases. PhD thesis, Otto von Guericke Universität Magdeburg
Krogel M-A, Rawles S, Zelezný F, Flach PA, Lavrač N, Wrobel S (2003) Comparative evaluation of approaches to propositionalization. In: ILP, pp 197–214
Kushmerick N, Weld DS, Doorenbos RB (1997) Wrapper induction for information extraction. IJCAI 1:729–737
Google Scholar
Lavrač N, Džeroski S (1994) Inductive logic programming: techniques and applications. Ellis Horwood, Chichester
MATH Google Scholar
Montoto P, Pan A, Raposo J, Losada J, Bellas F, Carneiro V (2008) A workflow language for web automation. J UCS 14(11):1838–1856
Google Scholar
Muggleton S (2000) Learning stochastic logic programs. Electron Trans Artif Intell 4(B):141–153
MathSciNet Google Scholar
Muggleton S, Raedt LD, Poole D, Bratko I, Flach PA, Inoue K, Srinivasan A (2012) ILP turns 20: biography and future challenges. Mach Learn 86(1):3–23
Article MathSciNet MATH Google Scholar
Muslea I, Minton S, Knoblock CA (2001) Hierarchical wrapper induction for semistructured information sources. Auton Agents Multi-Agent Syst 4(1/2):93–114
Article Google Scholar
Park J, Barbosa D (2007) Adaptive record extraction from web pages. In: WWW, pp 1335–1336
Quinlan JR, Cameron-Jones RM (1995) Induction of logic programs: FOIL and related systems. New Gener Comput 13(3&4):287–312
Article Google Scholar
Sarawagi S (2008) Information extraction. Found Trends Databases 1(3):261–377
Article MATH Google Scholar
Shen YK, Karger DR (2007) U-REST: an unsupervised record extraction system. In: WWW, pp 1347–1348
Sheskin DJ (2012) Handbook of parametric and nonparametric statistical procedures, 5th edn. Chapman and Hall/CRC, Boca Raton/London
MATH Google Scholar
Sleiman HA, Corchuelo R (2013a) TEX: an efficient and effective unsupervised web information extractor. Knowl Based Syst 39:109–123
Article Google Scholar
Sleiman HA, Corchuelo R (2013b) A survey on region extractors from web documents. IEEE Trans Knowl Data Eng 25(9):1960–1981
Article Google Scholar
Sleiman HA, Corchuelo R (2014a) A class of neural-network-based transducers for web information extraction. Neurocomputing 135:61–68
Article Google Scholar
Sleiman HA, Corchuelo R (2014b) Trinity: on using trinary trees for unsupervised web data extraction. IEEE Trans Knowl Data Eng 26(6):1544–1556
Article Google Scholar
Soderland S (1999) Learning information extraction rules for semi-structured and free text. Mach Learn 34(1–3):233–272
Article MATH Google Scholar
Srinivasan A (2004) The Aleph manual. Technical report, University of Oxford
Su W, Wang J, Lochovsky FH (2009) ODE: ontology-assisted data extraction. ACM Trans. Database Syst. 34(2):12.1–12.35
Turmo J, Ageno A, Català N (2006) Adaptive information extraction. ACM Comput Surv 38(2):1–47
van Kesteren A, Gregor A, Russell A, Berjon R (2014) Document object model 4. Technical report W3C
Yin X, Han J, Yang J, Yu PS (2006) Efficient classification across multiple database relations: a crossmine approach. IEEE Trans Knowl Data Eng 18(6):770–783
Article MATH Google Scholar
Zhang H, Su J (2004) Conditional independence trees. In: ECML, pp 513–524

Download references

Acknowledgments

We are grateful to Dr. Hassan A. Sleiman, from the Commissariat à l’Énergie Atomique et aux Énergies Alternatives, LIST Institute, France, and Dr. Toñi Reina, from the University of Sevilla, for their help and support with previous proposals that have paved the way for Roller. We would also like to thank Dr. Francisco Herrera, from the University of Granada, Spain, for sharing his statistical analysis software with us, and also to Dr. José C. Riquelme, from the University of Sevilla, Spain, for the many fruitful discussions regarding evaluating our proposal and comparing our experimental results. Last, but not least, we would like to thank our reviewers and our editor for their insightful comments on the earlier versions of this article, which helped us to improve it very much. Our work was supported by the European Commission (FEDER), the Spanish and the Andalusian R&D&I programmes (Grants TIN2007-64119, P07-TIC-2602, P08-TIC-4100, TIN2008-04718-E, TIN2010-21744, TIN2010-09809-E, TIN2010-10811-E, TIN2010-09988-E, TIN2011-15497-E, and TIN2013-40848-R). The work by Patricia Jiménez was also partially supported by the University of Southern California during a research visit that she paid to the Information Sciences Institute.

Author information

Authors and Affiliations

ETSI Informática, University of Sevilla, Avda. Reina Mercedes, s/n, 41012, Sevilla, Spain
Patricia Jiménez & Rafael Corchuelo

Authors

Patricia Jiménez
View author publications
You can also search for this author in PubMed Google Scholar
Rafael Corchuelo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Patricia Jiménez.

Appendices

Appendix 1: The experimentation environment

Table 8 Description of our datasets

Full size table

We performed our experiments on a four-threaded Intel Core i7 computer that ran at 2.93 GHz, had 4 GB of RAM, Windows 7 Pro 64-bit, Oracle’s Java Development Kit $1.7.9\_02$, JTidy 9.38, and Weka 3.6.8. No changes were performed to the default configurations of the hardware or the software.

We used a collection 40 datasets on books, films, cars, events, doctors, jobs, realty, and players, plus the nine datasets from the ExAlg repository and the five datasets from the RISE repository that provide semi-structured documents. The categories regarding the first group of datasets were randomly sampled from The Open Directory sub-categories, and the websites inside each category were randomly selected from the 100 best-ranked websites between December 2010 and March 2011 according to Google’s search engine; we downloaded 30 documents from each website and handcrafted a set of annotations with the slots that we wished to extract from each document. Table 8 describes our datasets; for each category, we report on the sites from which they were downloaded, the slots that model the information that they provide, the number of documents that they have, their average size in KiB, the average number of HTML errors that they have (as reported by JTidy), the average number of positive examples, and the average number of negative examples. The datasets were split ten times; in each split, we randomly selected six documents for training purposes and the remaining ones for testing purposes. The results on which we report in this article were obviously computed on the testing sets.

We searched the Web and contacted many authors in order to have access to the implementation of as many proposals as possible. We managed to find an implementation for SoftMealy [34] and Wien [44], which are classical proposals, and RoadRunner [14], FiVaTech [38], and Trinity [58], which are recent proposals. We also experimented with a straightforward approach in which we translated our datasets into first-order knowledge bases and then used Aleph [60] to induce rules.

Appendix 2: Performance measures

We collected the usual effectiveness measures, namely: precision (P), recall (R), and the $F_1$ score to combine them both ($F_1$). We also collected some efficiency measures, namely: learning time (${ LT }$) and extraction time (${ ET }$), both measured in CPU seconds.

Effectiveness measures are stable because the proposals that we have compared are deterministic, that is, they do not change when a proposal is run multiple times on the same datasets. Efficiency measures, on the contrary, are subject to external experimental conditions and may vary from execution to execution. We decided to measure CPU times because they are far more stable than user times; given that the proposals that we have compared are deterministic, that means that they follow the same execution paths every time that they are executed on the same dataset, which implies that they execute exactly the same machine-level instructions. IO activities were not a problem in our experimental study; the reason is that the proposals that we compared are CPU bound, not IO bound. In other words, they read the input documents, which typically takes less than a hundredth of a second, then run their algorithms in memory, and finally output the results to a file, which does not usually take more than a hundredth of a second.

To confirm the idea that CPU times are stable enough, we repeated each experiment 25 times and we averaged the timings after discarding a few outliers using the well-known Cantelli’s inequality.^{Footnote 1} We studied these outliers to make sure that they were actual abnormal values. They were due to the fact that our University’s cloud infrastructure was reset a couple of times while the experiments were running; that resulted in a few timings that were abnormally large due to the interferences caused by the temporary interruption of the computing service. The rest of the timings were quite stable; the small differences from run to run were mainly due to the performance of the memory cache system, which, in turn, depended on the other processes that were running on our University’s cloud infrastructure.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jiménez, P., Corchuelo, R. Roller: a novel approach to Web information extraction. Knowl Inf Syst 49, 197–241 (2016). https://doi.org/10.1007/s10115-016-0921-4

Download citation

Received: 04 May 2015
Revised: 11 November 2015
Accepted: 03 February 2016
Published: 10 March 2016
Issue Date: October 2016
DOI: https://doi.org/10.1007/s10115-016-0921-4

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Roller: a novel approach to Web information extraction

Abstract

Similar content being viewed by others

A Novel Approach to Web Information Extraction

Joint Information Extraction from the Web Using Linked Data

Statistical Relational Data Integration for Information Extraction

Explore related subjects

1 Introduction

2 Description of our proposal

2.1 Notation and core concepts

Definition 1

Example 1

Definition 2

Example 2

Definition 3

Example 3

Definition 4

Example 4

Definition 5

Example 5

Definition 6

Example 6

2.2 Main procedures

Example 7

2.3 Working with training sets

Example 8

2.4 Working with vectors

Example 9

3 Configuring Roller

3.1 A ranking method

3.2 Our feature builder

3.3 Our base learner and rule scorer

4 Experimental analysis

4.1 Effectiveness analysis

4.2 Efficiency analysis

5 Related work

5.1 Specific-purpose proposals

5.2 Exploring propositio-relational learning

5.3 Discussion

6 Conclusions

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix 1: The experimentation environment

Appendix 2: Performance measures

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation