Keywords

1 Introduction

Web tables are a kind of concise and effective knowledge containers, which are abundant in structured and semi-structured knowledge. A Google study shows that more than 150 million high-quality relational tables exist on the Web [1]. Through a large-scale survey of tables on a large crawl of the Web, 75 % of the pages contain at least one table with an average of 9.1 tables per document [2].

Knowledge extraction from the Web tables is an important way to obtain high quality knowledge, and has many applications in knowledge acquisition, information retrieval, question answering, Web mining and so on. So it has become one of hot issues in the field of knowledge acquisition in recent years.

An important feature of Web tables is that the tables are presented in the form of a DOM tree, which is composed of various combinations of HTML tags. A HTML table usually begins with a <table> tag, and uses tags such as <tr>, <td> and <th> as its child nodes. Therefore, if we get a HTML subtree which begins with <table> and ends with </table>, we could get a table with knowledge. However, these tags are sometimes problematic, because they are usually used for decorative purposes. A group of researchers have focused on this problem about the identification of the genuine Web table [3, 4].

When we have got a genuine table, extracting knowledge from it becomes another important research question. This article aims to provide a new method of unsupervised classification, by considering the characteristics of the DOM tree, uses the tree similarity to measure the similarity between the tables, and then clusters the tables into classes. Finally, the method extracts knowledge in each table class, respectively.

The rest of this paper is organized as follows. Section 2 presents an overview of the related work. Section 3 gives a comprehensive analysis of Web tables from the perspectives of our experience and practical applications. Section 4 describes our knowledge extraction method based on DOM tree similarity. Section 5 shows the experiment results and Sect. 6 concludes the paper.

2 Related Work

Structured (semi-structured) knowledge tables have always been used as an important source of knowledge to construct a large-scale knowledge base or graph. DBpedia [5] and YAGO [6] are built on Wikipedia infoboxes and other structured data sources. Google’s Knowledge Vault also uses HTML tables as a source for knowledge fusion [7].

Like knowledge extraction from unstructured data, we want to obtain terms of entities, concepts, and semantic classes, and their relationships, attributes and attribute values from the Web tables. Roughly, table knowledge extraction can be roughly divided into two tasks: one is to detect the critical elements of the table and the other is to discover the content structure of the table.

By detecting the critical elements of a table, we can infer the distribution of knowledge of the table. With the help of a universal probabilistic taxonomy called Probase, Wang [8] presented a framework based on a table header detector and entity detector to harvest knowledge from HTML tables. Nagy [9] provided a trainable critical cell location algorithm for extracting and factoring header paths and RDFs. Dalvi [10] extracted concept-instance pairs by clustering similar terms found in HTML tables and then assigned concept names to these clusters using Hearst patterns. Limaye [11] proposed a system, and it used an existing catalog and type hierarchy for annotating table columns and cells. Oz and Hogan [12] described a novel method to extract facts by using an existing Linked Data knowledge base to find existing relations between entities in Wikipedia tables, suggesting that the same relations may hold for other entities in analogous columns on different rows.

By discovering the content structure of a table, we can infer the semantic structure of the table. The content structure is mainly reflected in the schema of the critical elements of the table. Chen [13] used the similarity between cells, rows and columns based three metrics (i.e. String similarity, Named entity similarity, and Number category similarity) to identify the schema of the table and captured attribute-value relationships among table cells. Pivk and Cimiano et al. showed the TARTAR model which can normalize a table into a logical matrix [14]. In their early work [15], they distinguished between two functional types of cells, i.e. Attribute cells and Instance cells from HTML tables, by the core steps of the methodology: (1) Cleaning and Normalization, (2) Structure Detection, (3) Building of the Functional Table Model (FTM), and (4) Semantic Enriching of the FTM. Cafarella described a WebTables system which can extract relational information from HTML tables by attribute correlation statistics, and provided schema auto-complete and attribute synonym finding capability for getting more similar tables [1]. A similar work was presented in [17]. Adelfio extracted the schemas of HTML tables (attribute names, values, data types, etc.) by using a classification technique based on conditional random fields [18].

The above two ways are only directed to one specific table, and some other methods also use the information around the table to enhance the performance of the table extraction. Wang and Phillips [16] presented a table structure understanding algorithm designed by using optimization methods and improved the table detection result by optimizing the whole page segmentation probability. Govindaraju [19] used experiments to evaluate their ideas inspired by a simple human study that the use of context information around the table can be combined to extract a higher quality and more entity relationships.

Our approach focuses more on the formal structure of tables. We define the DOM tree similarity between the tables, and use it to cluster the tables which have the similar semantic structure. This makes our method domain-independent. Unlike most other methods mentioned above, our method does not need to understand the table with external knowledge bases.

3 Analysis of Web Tables

Before diving into the technical details, we present some analysis of Web tables first.

3.1 Web Tables Classification: An Experiential Analysis

Webpages contain a large number of tables, and unlike Web texts, tables are relatively structured information. Tables organize information in the concise form. A well-structured relational table can be structured by the entities and their relationships, such as a Wikipedia infobox, and extracting knowledge from it becomes much easier.

In this paper, we define a piece of knowledge as a triple of the form <S, P, O>, where S is the subject, P is a predicate or relation and O is its object. For example, some triples can be extracted from infobox table of Fig. 1, such as <China, Capital, Beijing> and <China, Largest city, Shanghai>.

Fig. 1.
figure 1

An example of Wikipedia infoboxes (https://en.wikipedia.org/wiki/China)

However, in many cases, the tables which we are dealing with in this paper are much more complicated than infobox tables consisting of “Attribute-Value” pairs. Figure 2 is an example of a common relational table. We can see that multiple relationships and facts can be extracted, but the way of the extraction is not like the case of infoboxes.

Fig. 2.
figure 2

An example of complicated Web tables

In a good summary, [20] provided a taxonomy of table types along with their frequencies on the Web, as shown in Fig. 3. A similar taxonomy was presented in [2]. However, this taxonomy for practical use is limited.

Fig. 3.
figure 3

A taxonomy of Web tables

Visually, a table reveals the organizational structure of the data. There is a certain degree of structural similarity between different tables, and a kind of similar structure contains all kinds of similar entity relationships. So if we know that some of the tables have the same structure, we can use the same way to extract the knowledge from them.

3.2 Formal Structure and Similarity of Web Tables

Different styles of table structure come from the “Rowspan” and “Colspan”, but the structure of the DOM tree has no change and just differ in HTML tag’s attribute, such as <tr colspan = “3”>. More importantly, some of the tags such as <th> and <caption> would also affect the form of the semantic structure, and the distribution of these tags in the tree has great impact on the similarity of DOM trees.

As shown in Fig. 4, we can find that the semantic structure in (a) and (b) is the same from the table-style perspective. But when we use the traditional methods [2123] to measure the similarity between table (a) and (b), we get the result as shown in Table 1.

Fig. 4.
figure 4

Some common Web tables

Table 1. Traditional methods to measure the similarity between three tables in Fig. 4

Actually, as shown in Fig. 4, table (b) has three more rows than table (a), and table (c) has two more columns than table (a); that is to say, DOM tree (b) has three more <td> leaf nodes than DOM tree (a). Likewise, DOM tree (c) has two more <th> and eight more <td> leaf nodes than DOM tree (a).

From Fig. 5, we can see that the difference between the various DOM trees is the number of the repeated or same substructures, such as the content marked in the red border. The methods in the literature [2123] cannot solve the problem of the repeated substructures, and that is why the similarity results as described in Table 1 range from 0.4 to 0.6.

Fig. 5.
figure 5

DOM trees for tables (a), (b) and (c) in Fig. 4, respectively

Through the analysis above, we can assume that the number of the same substructures does not influence the inherent global structure of a table, and it is merely a continuation of the function of some elements of the structure. Intuitively, the continuation of the function of HTML tags is like the extension of phoneme of the speech. When we repeat an utterance, although the speech rate may not be the same, what the phonemes the utterance contains is the same. In other words, when we use the features of the phonemes and time to describe different speeds of an utterance to be spoken and form different speech waveform sequences, if we are going to calculate the similarity of these sequences, the results should be that they are very similar or the same. As shown in Table 1, the result of our method to measure the similarities of between table (a) and (b), between (a) and (c), and between (b) and (c) are all 1.0, as we expect.

4 Knowledge Extraction from Web Tables Based on DOM Tree Similarity

4.1 Acquiring Web Tables

A table is presented in the form of a DOM tree. DOM tags are used to represent a DOM tree. Usually, we can obtain the table by the tag <table>, no matter whether it is a genuine table or not. We think that the tables used for decorative or layout purposes may also have certain fixed structure, and it may be extracted by way of classification, without separately analyzing whether it is genuine or non-genuine. We use the HTML parser to get tables and retain their tree structure.

4.2 Defining the Similarity

Dynamic time warping (DTW) is a sequence alignment method for comparing two sequences’ similarity [24]. It is widely used in speech recognition to deal with the problem of different speech rates. To some extent, the diversification of tables comes from the results of “warping”. Therefore, based on our observation, we introduce the DTW method to measure the similarity between two DOM trees.

Let \( Table_{i} \) and \( Table_{j} \) be two Web tables, which are represented by DOM \( Tree_{i} \) and \( Tree_{j} \), respectively. As a tree, we let the tag <table> be the root node of the tree, and then traversing it gets a \( path_{i} \) for a leaf node \( i \). For example, a path can be represented as <table> ~ <tbody> ~ <tr>~<td> ~ <text>, and the <text> only represents the text content of the leaf node. The number of leaf nodes of a tree equals to the number of the paths it has. Comparing the similarity of two trees is a comprehensive comparison of the similarity between the relevant paths.

Suppose \( Tree_{i} \) has \( m \) paths and \( Tree_{j} \) has \( n \) paths. Let \( \upgamma\left( {path_{i} ,path_{j} } \right) \) be the string edit distance [21] of two paths, let \( \left| {path_{i} } \right| \) be the length of \( path_{i} \) and \( \left| {path_{j} } \right| \) be the length of \( path_{j} \). Let \( \uppsi\left( {path_{i} ,path_{j} } \right) \) be the similarity between \( path_{i} \) and \( path_{j} \), which we can get it by normalizing:

$$ \uppsi\left( {path_{i} ,path_{j} } \right) = \frac{{\upgamma\left( {path_{i} ,path_{j} } \right) - Min\left( \gamma \right)}}{Max\left( \gamma \right) - Min\left( \gamma \right)} $$
(1)

Where, \( {\text{Min}}\left(\upgamma \right) = - {\text{a}} \times \hbox{min} \left\{ {\left| {path_{i} } \right|,\left| {path_{j} } \right|} \right\} - {\text{b}} \times \left| {\left| {path_{i} } \right| - \left| {path_{j} } \right|} \right| \), and \( {\text{Max}}\left(\upgamma \right) = {\text{a}} \times \hbox{min} \left\{ {\left| {path_{i} } \right|,\left| {path_{j} } \right|} \right\} \). In the above formula, \( {\text{a}} \) denotes the weight of the string edit operation of “match”, \( {\text{b}} \) denotes the weight of the string edit operation of “delete” or “insert”. When we compare the two paths with each other, we only need to determine whether each tag of \( path_{i} \) can match with a tag of \( path_{j} \) or not.

After calculating the similarity of the two paths, we convert it to the distance between the two paths. By using the DTW method with the minimum cumulative distance, we can achieve the greatest match of two DOM trees which are composed of leaf paths. Formula (2) below shows how we calculate the cumulative distance between two DOM trees:

$$ \Phi \left[ {i,j} \right] = \phi \left( {path_{i} ,path_{j} } \right) + min\{\Phi \left[ {i,j - 1} \right],\Phi \left[ {i - 1,j} \right],\Phi \left[ {i - 1,j - 1} \right]\} $$
(2)

In the formula, \( \Phi \left[ {i,j} \right] \) denotes a \( \left( {m + 1} \right) \times \left( {n + 1} \right) \) matrix, \( m \) and \( n \) denotes the number of paths of \( Tree_{i} \) and \( Tree_{j} \) respectively. \( \Phi \left[ {i,0} \right] \) and \( \Phi \left[ {0,j} \right] \) are initiated with \( \infty \), \( \Phi \left[ {0,0} \right] = 0 \), and \( \phi \left( {path_{i} ,path_{j} } \right) = 1 -\uppsi\left( {path_{i} ,path_{j} } \right) \). So \( \Phi \left[ {m,n} \right] \) denotes the minimum cumulative distance i.e. the distance between two trees. Let \( N \) be the number of nodes in the warping path. In order to avoid the influence by the size of the tree and the length of the warping path, we can get the similarity between two trees by normalizing:

$$ {\text{Sim}}\left( {Tree_{i} ,Tree_{j} } \right) = 1 -\Phi \left[ {m,n} \right]/N $$
(3)

4.3 Table Clustering

Through the establishment of two trees’ similarity measure, we can obtain the similarity between two tables, which can be used to perform cluster analysis. We refer to the method described in [25], which proposed an approach based on the idea that cluster centers are characterized by a higher density than their neighbors and by a relatively large distance from points with higher densities. For each table (i.e. \( Tree_{i} \)), two quantities should be computed: its local density \( \rho_{i} \) and its distance \( \delta_{i} \) from points of higher density. Both these quantities depend only on the distances \( d_{ij} \) between any two tables. The local density \( \rho_{i} \) of table \( i \) is defined as \( \rho_{i} = \mathop \sum \limits_{j} \chi \left( {d_{ij} - d_{c} } \right) \), where \( \chi \left( x \right) = 1 \) if \( x < 0 \) and \( \chi \left( x \right) = 0 \) otherwise, and \( d_{c} \) is a cutoff distance. The main innovation of our paper is to calculate the similarity between two tables, as shown in Formula (3), we can get the distance \( d_{ij} \) by \( d_{ij} = 1 - Sim\left( {Tree_{i} ,Tree_{j} } \right) =\Phi \left[ {m,n} \right]/N \). Firstly we calculate the similarity between any two tables which need to be clustered and turn them into distance \( d_{ij} \), then we refer to the clustering algorithm [25] and get the result of table clustering. Finally we use the method NCMDSFootnote 1 to represent the result.

In order to obtain a good structure of the table, we choose HTML tags <caption>, <td> or <th> as leaf nodes, which are embedded with the corresponding cell content of the table. But in practice, each cell of the table may have some other forms for dividing, such as tags <span>, <li> and <div>. If we use these tags as leaf nodes, the number of paths of one DOM tree would increase dramatically, which greatly increases the complexity (\( {\rm O}\left( {mn} \right) \)) of the dynamic programming algorithm. Therefore, we implement a two-stage clustering strategy:

  1. 1.

    We use tags <caption>, <td> and <th> as the leaf nodes, and get the result of the coarse-grained clustering;

  2. 2.

    For tables of each cluster, we use tags <span>, <li>, <div> and the like as leaf nodes, increase the weight of <th> appropriately in the process of paths alignment for highlighting the headers, and get the result of the fine-grained clustering.

4.4 Knowledge Extraction

After characterizing the DOM tree’s similarity between the tables to cluster the tables which have the similar semantic structure, we extract knowledge in each class, respectively.

For a table of one class of Wikipedia tables as shown in Fig. 6, horizontal relational tables are the most common tables in Wikipedia. We know that “Name” and “Position” are labeled by tag <th>, which means the header of the table. Meanwhile, all the other remaining cells are labeled by tag <td>. So we can extract knowledge (i.e. facts) such as <David Sablan, Position, National Committeeman>.

Fig. 6.
figure 6

Example for a horizontal relation table (https://en.wikipedia.org/wiki/Republican_Party_of_Guam)

In this paper, as shown in Fig. 7, unlike the traditional methods, our method does not rely on any resource such as Linked Data, Existing entities library, or Concept-instance base. With artificial judgment to decide which class one table belongs to, we extract knowledge from it and the remaining tables of this class by the corresponding extraction method which had been designed based on the semantic structure of the class. Figure 7 is an extraction method for the class which only contains horizontal relation tables. The algorithm has a complexity of \( {\rm O}\left( {MN} \right) \).

Fig. 7.
figure 7

Algorithm for knowledge extraction

5 Experimental Results

By using 5000 Wiki tables which were extracted from Wikipedia at random as the source corpus, we obtain the clustering result of the experiments as shown in Fig. 8.

Fig. 8.
figure 8

Table clustering results of 5000 tables

As Fig. 8(a) shows, 5000 Wiki tables can be classified into two clusters (divide 5000 into 4484 and 516) in the first stage, and then the tables in one cluster (the red point one) are clustered into another three clusters (divide 4484 into 2679, 1023 and 782) in the second stage, as shown in Fig. 8(b). Meanwhile, the tables in the other cluster (the green point one) in the first stage are clustered into another two clusters (divide 516 into 408 and 108) in the second stage, as shown in Fig. 8(d). As seen from the results, with artificial judgment, the 5000 Wiki tables are clustered into relational tables and nested tables in the first stage, then the relational Tables (4484) are classified into horizontal relation Tables (2679), matrix relation Tables (1023) and no-header Tables (782), and the nested Tables (516) are classified into nested relational Tables (408) and decorated Tables (108) in the second stage. Specially, the vertical relation tables cannot be well separated from the matrix relation tables by table clustering. We can see that most of the Wikipedia tables are relational tables, and some tables for decorative purposes can be filtered out so that we do not need to judge the genuineness of the table, as done in [3, 4]. To some extent, the clustering result is close to the empirical taxonomy like [2, 20].

Because each class of tables has a corresponding extraction method which are based on the semantic structure of this class, the quality of table clustering directly affects the quality of the extracted knowledge. We choose 100 tables which are the class of horizontal relation tables at random as the test corpus, then implement the preliminary extraction as described in Fig. 7, and the result shows that roughly 65.5 % of the facts are correct and 34.5 % are incorrect.

An analysis shows that incorrect facts mainly consist of three cases:

  1. 1.

    Case 1 (35 %): Tables do not belong to their right class due to some clustering error. The tables which have a caption (or multi-row headers) mixed in the horizontal relation tables which have one-row header. So the Subject (or Predicate) is incorrect. As Fig. 9 shows, Predicate (“2011”) and Predicate (“1970”) should be added with the captions (“Ten largest metropolitan areas by population” and “Arizona Racial Breakdown of Population”) of the two tables, respectively.

    Fig. 9.
    figure 9

    Some facts (<S, P, O>) of the result of knowledge extraction

  2. 2.

    Case 2 (33 %): The Subject does not appear in the table at all, it could be the article’s title of a Wikipedia page, or appears in the context of the table. As Fig. 9 shows, Subject (“Portuguese”) should be combined with the table’s context (“Indonesian Language origin”), Subject (“Jennifer Lewis”) should be transformed by the table’s context (“Duke Women’s Soccer program received NSCAA All-American honors”); in other words, the Subject and the table’s context mentioned before are part-whole relations. Subject (“US Billboard Adult Top 40”) should be added with the article’s title of the Wikipedia page (“Save Me (Remy Zero song)”).

  3. 3.

    Case 3 (32 %): Due to some other reasons, such as the content of a cell having multi-valued or complex forms. As Fig. 9 shows, the Subject (“att tala, to speak”) and object (“West: 72,600.00 €/East: 62,400.00 €”; “talar, speak(s)”) represent multiple values.

6 Conclusion

With a detailed analysis of the table structure of various forms, we design a method for calculating the DOM tree similarity between different web tables based on DTW. By analyzing the characteristics of DOM trees, we use the tree’s similarity to measure the similarity between the tables, and then cluster the tables into some classes, extract knowledge in each class respectively. Experimental results show that the quality of the extracted of knowledge is satisfactory.

By unsupervised clustering, we can obtain various types of tables based on the corpus of any form, and this has more practical value. Meanwhile, we do not need to spend time to ponder over the genuineness of the tables and rely on existing resources to support the knowledge extraction.

As future research direction, we will improve the performance of the clustering algorithm to adapt to the massive data and make full use of knowledge which has been extracted for mutual verification in order to achieve a higher accuracy.