Keywords

1 Introduction

A big volume of arbitrary tables (e.g. cross-tabulations, invoices, roadmaps, and data collection forms) circulates in spreadsheet-like formats. Mitlöhner et al. [28] estimate that about 10% of the resources in Open Data portals are labeled as CSV (Comma-Separated Values), a spreadsheet-like format. Barik et al. [2] extracted 0.25M unique spreadsheets from Common CrawlFootnote 1 archive. Chen and Cafarella [6] reported about 0.4M spreadsheets of ClueWeb09 CrawlFootnote 2 archive.

Spreadsheets can be considered as a general form for representing tabular data with an explicitly presented layout (cellular structure) and style (graphical formatting). For example, HTML tables presented on web pages can be easily converted to spreadsheet formats. The arbitrary spreadsheet tables can be a valuable data source in business intelligence and data-driven research. However, difficulties that inevitably arise with extraction and integration of the tabular data often hinder the intensive use of them in the mentioned areas.

Fig. 1.
figure 1

A fragment of a source arbitrary spreadsheet table—a; a fragment of a target table in the relational (canonical) form generated from the source table—b.

Many of arbitrary spreadsheet tables are an instance of weakly-structured and non-standardized data (Fig. 1a). Unlike relational tables (Fig. 1, b), they are not organized in a predefined manner. They lack explicit semantics required for high-level computer interpretation such as SQL queries. To be accessible for data analysis and visualization, their data need to be extracted, transformed, and loaded (ETL) into databases. Since arbitrary tables can have various complex cell layouts (structures), often the familiar industrial ETL-tools are not sufficient to populate automatically a database with their data.

Researchers and developers faced with the tasks of spreadsheet data integration often use general-purpose tools. They often offer their own implementations of the same tasks. In such cases, domain-specific tools can reduce the complexity of software development in the domain of the spreadsheet data integration. This is especially important when a custom software for data extraction and transformation from heterogeneous arbitrary spreadsheet tables is implemented for a short time and with a lack of resources.

1.1 Related Work

There are several recent studies dealing with spreadsheet data converting. The tools [14,15,16, 18, 27, 29, 31] are devoted to issues of converting data presented in spreadsheets or web tables to RDF (Resource Description Framework) or OWL (Web Ontology Language) formats. The solutions for spreadsheet data extraction and transformation [1, 3, 17, 19, 22] are based on programming by examples. Some of the solutions also include own domain-specific languages: XLWrap [27], M\(^{2}\) [31], TableProg [19], and Flare [3].

Hung et al. [20] propose TranSheet, a spreadsheet-like formula language for specifying mappings between source spreadsheet data and a target schema. Embley et al. [13] propose an algorithmic end-to-end solution for transforming “header-indexed” tables in CSV format to a relational form (“category tables”) based on header indexing. Senbazuru, a spreadsheet database management system proposed by [5], provides extracting relational data from spreadsheets (“data frames”). HaExcel framework [10] enables migrating normalized data among spreadsheets and relational databases. MDSheet framework [8, 9] also implements a technique that automatically infers relational schemes from spreadsheets. The framework [7] constructs trained spreadsheet property (e.g. aggregation rows or hierarchical header) detectors based on rule-assisted active learning.

DeExcelerator project [12] aims at the development of a framework for information extraction from partially structured documents such as spreadsheets and HTML tables. The framework exploits a set of heuristics based on features of tables published as Open Data. DeExcelerator works as a predefined pipeline without any user interaction. Koci et al. [23,24,25] expand the covered spectrum of spreadsheets. They proposes a machine learning approach for table layout inference [24] and TIRS, a heuristic-based framework for automatic table identification and reconstruction in spreadsheets [25]. Their recent paper [23] introduces a novel approach to recognition of the functional regions in single- and multi-table spreadsheets.

The recent papers [4, 39] develop domain-specific solutions. The work [39] proposes algorithms and accompanying software for automatic annotation of natural science spreadsheet tables. The tool proposed in [4] extracts RDF data from French government statistical spreadsheets and populates instances of their conceptual model. The system CACheck [11] automatically detects and repairs “smelly” cell arrays by recovering their computational semantics. TaCLe [26] automatically identifies constraints (formulas and relations) in spreadsheets. The papers [41, 42] suggest an approach to rule-based semantic extraction from tabular documents.

1.2 Contribution

Our work shows new possibilities in spreadsheet data transformation from arbitrary to relational tables based on rule-based programming for the following table understanding [21] stages:

  • Role analysis extracts functional data items, entries (values) and labels (keys), from cell content.

  • Structural analysis recovers relationships of functional data items (i.e. entry-label and label-label pairs).

  • Interpretation binds recovered labels with categories (domains).

Our contribution consists in the following results:

  • The two-layered table object model combines the physical (syntactic) and logical (semantic) table structure. Unlike others models, it is not based on using functional cell regions but determines that functional items can be placed anywhere in a table.

  • The domain-specific language, CRL (Cells Rule Language), is intended for programming table analysis and interpretation rules. In contrast to the existing mapping languages, it expresses the spreadsheet data conversion in terms of table understanding.

  • The tool for spreadsheet data extraction and transformation from an arbitrary (Fig. 1a) to the relational (canonical) form (Fig. 1b), TabbyXLFootnote 3, implements both the model and the language. Compared to the mentioned solutions it draws up this process as consecutive steps: role analysis, structural analysis, and interpretation.

  • The CRL interpreter (CRL2J) provides the translation of CRL rules to Java code in the imperative style. It allows automatically generating Java source code from CRL rules and compile it to Java bytecode, and then runs generated Java programs. The interpreter uses CRL grammar implemented by ANTLRFootnote 4. This ensures the correctness of CRL language grammar.

  • The ruleset for transforming arbitrary tables of the same genre (government statistical websites) is implemented in three formats: CRL, DSLR (DroolsFootnote 5), and CLP (JESSFootnote 6). The experimental results are reproduced and validated by using three different options: (i) CRL-to-Java translation; (ii) Drools rule engine; (iii) JESS rule engine. The results (recovered relational tables) are the same for all of these three options. This confirms the applicability of CRL language for expressing table analysis and interpretation rules.

This paper continues our series of works devoted the issues of table understanding in spreadsheets [33, 34, 37, 38]. It combines and significantly expands our approach to table understanding based on executing rules for table analysis and interpretation with a business rule engine [34]. We briefly introduced the preliminary version of our domain-specific rule language first in [33]. The prototype of our tool for canonicalization of arbitrary tables in spreadsheets is discussed in [37]. This work extends the results presented in the paper [38] by adding the CRL interpreter for the CRL-to-Java translation, as well as by the implementation and validation of the ruleset by using the different options: CRL2J, Drools, and JESS.

The novelty of our current work consists in providing two rule-based ways to implement workflows of spreadsheet data extraction and transformation. In the first case, a ruleset for table analysis and interpretation is expressed in a general-purpose rule language and executed by a JSR-94-compatible rule engine (e.g. Drools or Jess). In the second case, our interpreter translates a ruleset expressed in CRL to Java source code that is complicated and executed by the Java development kit. This CRL-to-Java translation allows us to express rulesets without any instructions for management of the working memory such as updates of modified facts or blocks on the rule re-activation. The end-users can focus more on the logic of table analysis and interpretation than on the logic of the rule management and execution.

2 Table Object Model

The table object model is designed for representing both a physical structure and logical data items of an arbitrary table in the process of its analysis and interpretation (Fig. 2). Our model adopts the terminology of Wang’s table model [40]. It includes two interrelated layers: physical (syntactic) represented by the collection of cells (Sect. 2.1) and logical (semantic) that consists of three collections of entries (values), labels (keys), and categories (concepts) (Sect. 2.2). We deliberately resort to the two-way references between the layers to provide convenient access to their objects in table analysis and interpretation rules.

Fig. 2.
figure 2

Two-layered table object model.

2.1 Physical Layer

object represents common features of a cell that can be presented in tagged documents of well-known formats, such as Excel, Word, or HTML. We define object as a set of the following features:

  • Location: —left column, —top row, —right column, and —bottom row. A cell located on several consecutive rows and columns covers a few grid tiles, which always compose a rectangle. Moreover, two cells cannot overlap each other.

  • Style: —font features including: , , etc.; and —horizontal and vertical alignment; and —background and foreground colors; , , , and —border features;  —text rotation.

  • Content: —textual content, —indentation, and —its literal data type ( , , , etc.).

  • Annotation: —a user-defined word or phrase to annotate the cell.

  • Logical layer references: (a set of entries) and  (a set of labels) originated from this cell. Thus, a cell can contain several entries and labels.

For example, Fig. 3 presents an initial state of some cells ( ,..., ) shown in Fig. 1a.

2.2 Logical Layer

object serves as a representation a data value of a table. It consists of the following attributes: —a value (text),  —a set of labels associated with this entry, and  —the physical layer reference to a cell as its origin that serves as data provenance. An entry can be associated with only one label in each category.

represents a label (key) that addresses one or more entries (data values). It is defined as follows: —a value (text),  —a set of labels which are children of this label,  —its parent label, —an associated category, —the physical layer reference to a cell as its origin (data provenance).

models a category of labels as follows: —an internal name,  —a uniform resource identifier representing this category (concept) in an external vocabulary, —a set of its labels. Each label is associated with only one category. Labels combined into a category can be organized as one or more trees.

This layer allows representing items differently depending on target requirements of the table transformation. For example, Fig. 4 demonstrates a possible target state of the entry ( ), labels ( ,...,  ), and categories ( ,...,  ) recovered from the initial cells shown in Fig. 3. They can be presented as a tuple of a target relational Table 1b.

Fig. 3.
figure 3

Some initial facts (cells) for the table shown in Fig. 1a.

Fig. 4.
figure 4

Some recovered facts (functional items) for the table shown in Fig. 1a.

3 CRL Language

The rules expressed in our language are intended to map explicit features (layout, style, and text of cells) of an arbitrary table into its implicit semantics (entries, labels, and categories). Figure 5 demonstrates the grammar of CRL, our domain-specific language, in Extended Backus-Naur form. This grammar is also presented in ANTLR formatFootnote 7.

Fig. 5.
figure 5

Grammar of CRL language.

A rule begins with the keyword and ends with  . A number that follows the keyword determines the order of executing this rule.

figure ba

The left hand side ( ) of a rule consists of one or more conditions that enable to query available facts which are cells, entries, labels, and categories of a table. Each of the conditions listed in the left hand side of a rule has to be true to execute its right hand side ( ) that contains actions to modify the existed or to generate new facts about the table.

3.1 Conditions

We use two kinds of conditions. The first requires that there exists at least one fact of a specified data type, which satisfies a set of constraints:

figure bd

The condition consists of three parts. In their order of occurrence, the first is a keyword that denotes one of the following fact types: , , or . The second is , a variable of the specified fact type. The third optional part begins with the colon character. It defines constraints for restricting the requested facts and assignments for binding additional variables with values. A constraint is a boolean expression in Java. The comma character separating the constraints is the logical conjunction of them. An assignment ( ) sets a value (Java expression) to a variable. A condition without constraints allows querying all facts of specified type. The second kind of conditions determines that there exist no facts satisfied to specified constraints: The first part of these conditions is a keyword for satisfying a type of facts. The second part contains constraints on the facts.

3.2 Cell Cleansing

In practice, hand-coded tables often have messy layout (e.g. improperly split or merged cells) and content (e.g. typos, homoglyphs, or errors in indents). We address several actions to the issues of cell cleansing that can be used as the preprocessing stage.

  • , the action combines two adjacent cells when they share one border.

  • , the action divides a merged cell that spans n-tiles into n-cells. Each of the n-cells completely copies content and style from the merged cell and coordinates from the corresponding tile.

  • , the action provides modifying textual content of a cell. Some string processing (e.g. regular expressions and string matching algorithms) implemented as Java-methods can be involved in the action.

  • , the action modifies an indentation of a cell.

3.3 Role Analysis

This stage aims to recover entries and labels as functional data items presented in tables. We also enable associating cells with user-defined tags (marks) that can assist in both role and structural analysis.

  • , the action annotates a cell with a word or phrase. The assigned tag can substitute the corresponding constraints in subsequent rules. The typical practice is to set a tag to all cells, which play the same role or are located in the same table functional region. Thereafter, we can use these tags in subsequent rules instead of repeating constraints on cell location in the regions.

  • , the action generates an entry, using a specified cell as its origin. Usually, a value of the created entry is an expression obtained as a result of string processing for its textual content of the cell.

  • , the action generates a label in a similar way.

3.4 Structural Analysis

The next stage recovers pairs of two kinds: entry-label and label-label.

  • , the action binds an entry with an added label. A label can be specified as a value of a category indicated by its name.

  • , the action connects two labels as a parent and its child.

3.5 Interpretation

The stage includes actions for recovering label-category pairs.

  • , the action associates a label with a category.

  • , the action places two labels in one group. Arbitrary tables often place all labels of one category in the same row or column. Consequently, we can suppose that the labels belong to a category without defining its name. In the cases, grouping two or more labels means that they all belong to an undefined category. All labels of a group can be associated with only one category.

Fig. 6.
figure 6

Source tables—a and b; target canonicalized tables—c and d.

Fig. 7.
figure 7

A reference ruleset for transforming the source tables (Fig. 6ac) to the target canonical forms (Fig. 6bd).

3.6 Illustrative Example

Figure 6 depicts an example of a transformational task that consists in converting tables similar to ones (a and b) into the relational form (c and d). These tables satisfy the following assumptions: \(1,\dots ,n\) are entries, \(a_1,\dots ,a_m\) are column labels of the category A, \(b_1,\dots ,b_k\) are row labels of the category B. Figure 7 presents a reference ruleset implemented for this task. It contains only 7 rules that are executed in the following order: (1) data cleansing, (2) entry generation, (3) label generation, (4) associating entries with column labels, (5) associating entries with row labels, (6) categorizing column labels, and (7) categorizing row labels.

4 Implementation with Options

TabbyXL implements both the presented table object model and the rule-based approach to spreadsheet data extraction and transformation. It can process data, using one of the following options:

  • CRL2J-option automatically generates Java source code from CRL rules and compile it to Java bytecode, and then runs the generated program. This option requires that ruleset is implemented in CRL language. We use ANTLR, the parser generator, to implement the CRL-to-Java translator. This allows to parse CRL rules and to build their object model which is then translated to Java source code.

  • Drools-option relies on Drools Expert rule engine. The rules can be expressed in DRL (the general-purpose rule language that is native for Drools) or in a dialect of CRL that is implemented as a domain-specific language (DSL) in corresponding of Drools requirements. In the last case, CRL rules presented in DSLR format are automatically translated into DRL format through the DSL-specification that defines CRL-to-DRL mappings. Unlike the pure CRL, this dialect supports DRL attributes in rule declarations.

  • JESS-option executes a ruleset with JESS rule engine. This option requires that a ruleset is represented in the well-known CLP (CLIPS) format.

Moreover, the current version of TabbyXL supports rule engines that are compatible with Java Rule Engine API (JSR94Footnote 8). Therefore, it is possible to use not only Drools or JESS, but also others rule engines supporting Java Rule Engine API, and to represent the rules in their native formats. Any case, TabbyXL builds an instance of the table object model from source spreadsheet data. The cells of this instance are asserted as facts into the working memory of a rule engine. Optionally, some user-defined categories specified in YAML format can also be loaded and presented as facts. The rule engine matches asserted facts against the rules. While rules are executed, the instance of the table object model is augmented by recovered facts (entries, labels, and categories). At the end of the transformation process, the instance is exported as a flat file database.

5 Experimental Results

The purpose of the experiment is to show a possibility of using our tool for tables, which originate from various sources produced by different authors but pertain to the same document genre. The experiment includes two parts: (i) designing and implementing an experimental ruleset for tables of the same genre, and (ii) evaluating the performance of the ruleset on a set of these tables.

Table 1. Experiment results on the role and structural analysis stages.

We used Troy200 [30], the existing dataset of tables, for the performance evaluation. It contains 200 arbitrary tables as CSV files collected from 10 different sources of the same genre, government statistical websites predominantly presented in English language.

We designed and implemented a ruleset that transforms Troy200 arbitrary tables to the relational form, using two formats: CRL and CLP (JESS). We also prepared ground-truth data, including reference entries, labels, entry-label, and label-label pairs extracted from Troy200 tables. Their form is designed for human readability. Moreover, they are independent of the presence of critical cells in tables. The performance evaluation is based on comparing the ground-truth data with the tables generated by executing the presented ruleset for the experiment dataset.

All tables of the dataset were automatically transformed into the relational form, using TabbyXL with three different options: CRL2J, DROOLS, and JESS. The results are the same for all of these three options.

We used the standard metrics: recall, precision, and F-score, to evaluate our ruleset for both role and structural analysis. We adapted them as follows. When R is a set of instances in a target table and S is a set of instances in the corresponding source table, then:

$$\begin{aligned} \begin{aligned} \text{ recall } = \frac{\left| R \cap S\right| }{\left| S\right| } \end{aligned} \quad \begin{aligned} \text{ precision } = \frac{\left| R \cap S\right| }{\left| R\right| } \end{aligned} \end{aligned}$$

An instance refers to an entry, label, entry-label pair, or label-label pair. These metrics were separately calculated for each type of instances (entries, labels, entry-label pairs, and label-label pairs).

The experiment results are shown in Table 1. Among 200 tables of the dataset, only 25 are processed with errors (1256 false negatives in 25 tables, and 498 false positives in 14 ones). Only one table is not processed. This results in 948 false negative and 316 false positive errors, which amount to about 72% of all errors. In this case, entries are not recovered because they are not numeric as it is assumed in the used ruleset.

Table 2. Comparison of the running time by using the different options.

Additionally, Table 2 presents a comparison of the running time for the ruleset translation and execution processes, using the implemented options. The ruleset translation time (\(t_1\)) is a time of translating an original ruleset into an executable form. The ruleset execution time (\(t_2\)) is a total time of executing the ruleset presented in the executable form that is required to process all 200 tables of the dataset. In the case of CRL2J option, \(t_1\) is a time of parsing and compiling the original ruleset into a Java program, while \(t_2\) is a time of executing the generated Java program. In the case of the use of a rule engine (Drools or JESS option), \(t_1\) is a time of parsing the original ruleset and adding the result into a rule engine session. \(t_2\) consists of a time of asserting initial facts into the working memory and a time of executing the ruleset by the rule engine. The results shown in Table 2 were obtained with the following computer: 3.2 GHz 4-core CPU and 8 GB RAM. The use of CRL2J option requires slightly more time for the translation of the ruleset. However, CRL2J is faster for the execution of the program (ruleset) compared to Drools or JESS rule engine.

All data and steps to reproduce the experiment results are publicly available as a datasetFootnote 9 [35]. It contains the following items: the relational tables obtained by TabbyXL; the ground-truth data; the mentioned CRL and CLP rulesets; the detailing of the performance evaluation. Some comparison of TabbyXL in its previous version with others solutions is discussed in the paper [38].

This experiment exemplifies the use of our language for developing task-specific rulesets. The performance evaluation confirms the applicability of the implemented ruleset in accomplishing the stated objectives of this application.

6 Conclusions and Further Work

The presented approach can be applied to develop software for extraction and transformation data of arbitrary spreadsheet tables. We expect TabbyXL to be useful in cases when data from a large number of tables appertaining to a few table types are required for populating a database.

We showed experimentally that rules for table analysis and interpretation can be expressed in a general-purpose rule-based language and executed with a rule engine [34]. But only a few of its possibilities are sufficient for developing our rules. The exploration of these possibilities has led to development of CRL, a domain-specific rule language, which specializes in expressing only rules for table analysis and interpretation. Our language hides details which are inessential for us and allows to focus on the logic of table analysis and interpretation. There are two ways to use CRL rules. They can be executed with a rule engine (e.g. Drools or JESS) or be translated to programs presented in the imperative style (e.g. in Java language).

The experiment demonstrates that the tool can be used for developing programs for transformation of spreadsheet data into the relational form. One ruleset can process a wide range of tables of the same genre, e.g. government statistical websites. Our tool can be used for populating databases from arbitrary tables, which share common features.

The work focuses rather on table analysis than on issues of interpretation. We only recover categories as sets of labels, without binding them with an external taxonomy of concepts (e.g. Linked Open Data). The further work on table content conceptualization can overcome this limitation. Moreover, we observe that arbitrary tables can contain messy (e.g. non-standardized values or typos) and useless (e.g. aggregations or padding characters) data. It seems to be interesting for the further work to incorporate additional techniques of data cleansing in our tool. Another direction of development is to integrate the presented results with the tools for extracting tables from documents (e.g. untagged PDF documents [32, 36]) in end-to-end systems for the table understanding.