Keywords

1 Introduction

The continuous explosion of resource description framework (RDF) data opens door for new innovations in big data, social network analysis, and semantic web initiatives, which can be shared and reused through the application, enterprise and community boundaries. The semantic web [1] is one of the most important research fields that aim to construct a web of data based on the RDF [2] data model. It allows data to be shared and reused through applications, enterprise and community boundaries. Relational databases (RDBs) are the primary sources of web data, “deep web” [3]. The main reason is one of the studies [3] showed that internet accessible databases contained up to 500 times more data compared to the static web, and roughly 70 % of websites are backed by RDBs. The W3C RDB2RDF Working Group recently recommended a specification for languages to map RDB (data and schemas) to RDF and OWL, tentatively called Direct Mapping (DM) [4] and R2RML (Relational Database to RDF Mapping Language) [5]. However, the W3C Working Group does not recommend any implementation for DM and R2RML. The DM provides a set of automatic mapping rules to construct an ontology schema (RDF(S) and OWL) from RDB schema and convert relational data to RDF graphs according to that schema [6]. The ontology constructed reflects the structure and content of the relational database. Nevertheless, the DM method may not be constantly sufficient or optimum, especially when mapping a relational database to an existing ontology. R2RML is a customized mapping language, which allows users to define mappings manually. In this approach, the expert user expresses the RDB schema using an existing target ontology in order to convert the relational data into RDF datasets.

The R2RML specification is accompanied by the DM specification [4], representing a standard approach for converting an RDB into RDF without the use of a customized mapping definition. Thus, the RDF generated using DM can be represented in R2RML. R2RML provides more flexibility than DM specification. Meanwhile, creating R2RML rules by domain experts manually is complex, time consuming process, cumbersome, mistakable, high cost process, and requires the supports of domain experts in knowledge acquisition. Moreover, the users who are interested to apply the R2RML for RDF generating from RDB are requested to learn how to create an R2RML mapping document, in addition to a significant gap between the structure of RDB and the R2RML mappings specifications. One of the ways to solve those problems and ease-of-loading for creating an R2RML document from users efficiently is to generate an initial R2RML mapping document automatically from RDB schema that reflects the conduct of the DM specification. Afterward users will be able to modify that document into a text editor or user interface (display screen). Thus, making the process engine of generating RDF triples (such as morph-RDB [7], nknos [8], RDF-RDB2RDFFootnote 1, etc.) takes the R2RML mapping document and RDB data as an input, and then provides an output corresponding RDF dataset (triples). This is done by automatically mapping RDB concepts to an ontology vocabulary, which could be used as a base to support generating RDF triples from RDB data. Recently, the two reports presented by a survey report [9] and W3C’s RDB2RDF Implementation [10] are discussed and listed a few existing tools or ongoing projects that have been made available to support the task of mapping generation. However, some of those tools either create mappings in RDB2RDF languages such as ODEMapster GUI (creates R2O mappings) or only give syntactic sugar (form-based tools) to users, who still require a good knowledge of R2RML, which makes them not usable enough.

In this paper, we design and implement algorithms to automatically generate R2RML mapping documents that reflect the behavior of the Direct Mapping specification, which will be applicable as a base support generating the RDF triples from RDB data. Firstly, we design and implement an algorithm that takes an RDB schema as an input and extracts a DBsInfo class (has all the information about RDB schema) as an output. Secondly, we present an algorithm design approach to automatically generate an R2RML mapping document based on a DBsInfo class. Subsequently, generating an RDF dataset by integrating our work with the R2RML processor, which takes an R2RML mapping document and RDB data as inputs and generates the RDF triples as an output. The experimental results show important factors for the building R2RML mappings and their influence on the mapping generation time and size of R2RML and RDF file. These results together reflect the effectiveness of our algorithm and its implementation in Java with Jena API.

The rest of the paper is organized as follows. Section 2 provides an overview of the related works. Basic concepts which give a brief overview of the R2RML and DM with the relationship between them are described in Sect. 3. Section 4 proposes the approach and the algorithm. A prototype implementation of the architecture of our processor prototype and experimental results with discussion regarding the effectiveness and the run-time efficiency test on the proposed algorithms are presented in Sect. 5. Finally, Sect. 6 concludes this paper with the future work.

2 Related Work

Several approaches (auto or manual) have been proposed in the integrating RDB and semantic web, mainly concerning the creation and maintenance of mappings between RDF and RDB. Mapping RDB to RDF is a domain where quite a few works have been proposed over the last years [11]. Generally, the objective is to express the RDB contents using ontology (RDF graph) in a way that allows queries submitted to the RDF schema to be answered with data stored in the RDB. Also, for bringing data residing in RDB into the semantic web, several automated or semi-automated methods for ontology schemes representation have been created [1214].

Currently, there are two main approaches recommended by W3C RDB2RDF Working Group for mapping RDB into RDF that we have mentioned previously: DM [4] and R2RML [5]. In the DM approach the ontology model is constructed from RDB model, and the contents of the RDB are transformed to generate ontology instances [6, 12, 15, 16]. The approach [6, 12] proposed (automatic-direct mapping rules) by investigating several cases of RDB schema to be directly mapped into ontology represented in RDF(S)-OWL and transformed RDB data to ontological instances (represented in RDF triples) based on the structure of the database schema. While in approach [16] a tool RDB2OWL language for mapping a database into an ontology in a compact notation within the ontology class and property annotations was presented. This tool was implemented by converting the RDB2OWL mappings into executable D2RQ mappings to produce the RDF dump of the source RDB, or to turn it into an SPARQL endpoint.

On the other side, the customized mapping approach such as ODEMapster [17], Triplify [18], D2R Server [19], and OpenLink Virtuoso [20] lets a domain expert to create a mapping between the relational schema and an existing target ontology, which is used to convert RDB content to RDF. However, early surveys of RDB-to-RDF tools [21] revealed that the tools typically adopt proprietary mapping languages. Triplify [18] offers a Linked Data publishing interface and provides a simplistic approach to publish RDF from RDB. D2R Server [19] is an engine that directly maps the RDB into RDF and uses D2RQ mappings to translate requests from external applications to SQL queries on the RDB. This implementation was first available for the D2R language and later for R2RML. Moreover, there are some tools for implantation DM and R2RML such as r2rml4netFootnote 2 and db2triplesFootnote 3. The r2rml4net is a library for processing the R2RML mapping documents, which provides functions to load R2RML mapping document and functions to convert relational data to the RDF dump. The db2triples-software tool is an RDB2RDF AntidotFootnote 4 Java implementation of the DM specifications and the R2RML for extracting data from RDBs and loading data into an RDF triple store. Recent efforts offered MIRROR system [22] for produce mappings in the R2RML language and an RML mapping language [23], an extension of R2RML, for non-relational sources and the integration of heterogeneous data formats to support XML and JSON data sources expressions in the mappings. In this work, we focus on the RDB schema. Meanwhile, other researches introduced a semi-automatic mapping approach for generating R2RML mappings based on a set of correspondence assertions (mapping between relational metadata and the vocabulary of a domain ontology) defined by domain experts [24, 25]. Therefore, the user still needs to draw correspondence assertions (CAs) from the input system (source RDB schema and target ontology/RDF schema) to specify the mapping between them.

Based on the previous literature, mapping generation remains far from well understood and need to be further explored. Therefore, generation of R2RML mapping documents automatically from RDBs becomes an important challenge to avoid appearance of mistakes in R2RML mappings in addition, it reduces the generation time and no need for domain experts.

3 Basic Concepts

This section gives a brief overview of the R2RML and DM with the relationship between them. The W3C has recently standardized the RDB-to-RDF (RDB2RDF) mapping mechanism and language to bridge the gap between RDBs and the semantic web. These standardized namely Direct Mapping (DM) of relational data to RDF [4] and R2RML: RDB to RDF mapping language [5]. The mapping engine of approaches/tools generates RDF dataset from RDB schema and its instances. The main step in this engine is to decide how to represent RDB schema concepts in terms of RDF classes and properties from tables and columns. This is done by mapping RDB concepts to an ontology vocabulary, to be used as the base to generate a set of RDF triples from relational data.

3.1 R2RML Standard

R2RML is a language for describing customized mappings from a relational database to RDF dataset. The input of an R2RML mapping is an RDB schema and its instance. The output is an RDF graph. This mapping definition is represented as an RDF graph using the R2RML vocabulary and serialized in the RDF Turtle syntax (RDF triple Language) [26] which is the recommended syntax to write R2RML mapping documents. The structure of an R2RML mapping document consists of one or more triples maps, which contains a logical table, a subject map, and a number of predicate-object maps. The logical table can either be an SQL table, an SQL view, or an SQL query statement. The triples map specifies a rule for mapping each row of a logical table to a set of RDF triples. The subject map contains the rules for generating the subject for each row, often represented as an IRI. While the predicate-object map contains the rules for generating a predicate maps and object maps (or referencing object maps) from the values in the table row. The referencing object map allows using the subjects of another triples map as an object. Since both triples maps may be based on different logical tables, it may require a correlation between the logical tables.

Furthermore, a triples map specifies RDF triples corresponding to a logical table while the subject map and the number of predicate-object maps used to specify how the triples should be. So, RDF triples are created by combining the subject map with a predicate map and a (referencing) object map, and applying these three to each logical table row.

3.2 Direct Mapping (DM)

The DM is a notable one as the W3C candidate recommendation [4]. It is default method to translate a relational database (schema and data) to an ontology (OWL/RDF(S) and RDF triples) automatically through directly mapping without user interaction. The ontology represented in OWL/RDF(S) format. The RDF will reflect the exact data model of the relational data, rather than the domain of the data. A direct mapping is typically working by transform each table to a class, column to property, and relationship to an object property. Each row in the table will be transformed to an individual that will be a member of the table’s class. The foreign key transformed with a property that links one individual to another. The range of other properties will be literals.

Furthermore, generating IRI (prefix-name space) for the triples of the RDB schema and data (tables, columns in a table, and each row in a table) during the mapping process produced by combining base IRI and table name for each table, and base IRI, table name and column(s) name for each column in table. While the IRI for each row in table produced by combining base IRI and primary key column(s) of the table.

Therefore, a DM is the default and automatic way to translate RDBs into RDF without any input from the user, while R2RML is a mapping language, which allows users to manually define mappings. Thus, the DM can be represented in R2RML, which is accompanied with the DM specification, describing a standard method for generating RDF from RDB without using a customized mapping definition.

4 Approach and Algorithm

In this section, we introduced an approach that provides (RML-BDM) R2RML based on direct mapping rules from RDB to RDF(S)-OWL for automatically generating an R2RML mapping document from an RDB schema. Then any R2RML engine (e.g. nknos-r2rml parser) can be used to create the RDF dataset following the DM specification.

4.1 RDB Metadata Generation (DBsInfo Class)

In this section, we introduce a DBsInfo class, as a representation of an RDB’s metadata, to be used as a source of information for RML-BDM generation. Basic information needed to proceed includes table names, view names and columns’ properties that include column names, data types, size, and whether the column is nullable, index, primary key (PK), unique key (UK), and/or foreign key (FK). Moreover, the most important information needed when the attribute is FK are Ref_to_Table (reference to table) and Ref_to_Column (reference to column). The processor for producing a DBsInfo class, which contains all the information about the database, is shown in the Fig. 1. This processor has three levels, which contain forth algorithms (classes) to extract all the information about database tables, views, columns, datatype, columns properties, PKs, FKs, UKs, columns index, and relationships between tables through the foreign keys, etc. These algorithms are FillDBsInfo, FillTablesInfo, FillColumnsInfo, and FillTableRelationships.

Briefly, FillDBsInfo is the main algorithm that extracts the general important information about the database and invoking the FillTablesInfo algorithm to represent the functionality of tables and views. Algorithm FillTablesInfo extracts all the information about table and view, in addition to the information of table columns and table relationships with other tables by invoking algorithm FillColumnsInfo and algorithm FillTableRelationships, respectively.

A DBsInfo provides an image of metadata obtained from an existing RDB. The main purpose behind constructing a DBsIno class is to read essential metadata into memory outside the database’s secondary storage. In this study, the DBsInfo class is designed to upgrade the semantic level of RDB and to play the role of an intermediate stage for database migration from RDB to RDF acting on both levels: schema translation and data conversion.

Fig. 1.
figure 1

Processor of algorithms for extracting RDB metadata (DBsInfo).

4.2 Rules of Approach: R2RML-Based Direct Mapping

This section defines the algorithms, which the mapping RDB schema to R2RML file is based on DM approach (RML-BDM). The RDB2RDF algorithm is the core of an R2RML engine. According to the algorithm ideas proposed in W3C recommendation R2RML [5], DM [4] and our previous works [6, 12], we have designed a group of mapping algorithms, R2RML Generator (Algorithm 1), GreateMapClass (Algorithm 2), GenerateLogicalTable (Algorithm 3), GenerateSubjectMap (Algorithm 4), GenerateTempate (Algorithm 5), GeneratepredicateObject (Algorithm 6), and GenerateRefObjMap (Algorithm 7), to achieve the R2RML-BDC mapping file. Concisely, R2RML Generator is the main algorithm invoking the other algorithms to implement the functionality of R2RML triples map.

  • R2RML Generator: This algorithm is the main algorithm to generate the R2RML mapping file based on the direct approach mapping.

  • GreateMapClass: This algorithm creates map class name from the table/view name. The map class name is a triples map that used to translate each row in the logical table to number of RDF triples and link-connect between triples map classes (classes corresponding to tables that have the relationships with each other). The output is a map class name corresponding table/view name.

  • GenerateLogicalTable: This algorithm maps the table/view into logical table using the r2rml format depending on DM method, where the table name of RDB is rr:tableName in the logical table. This reflects which table/view is mapped to generate triples from its rows by using an R2RML parser. Then the return of this algorithm is the RDF triples.

  • GenerateSubjectMap: This is one of the important algorithms. It generates the unique IRI used as a subject for all the RDF triples generated from the row of the table in rr:tableName. This algorithm invokes the GenerateTemplate algorithm to the identification form of the primary key of each triple generated from the row of table/view. The algorithm’s input is a table/view name of DBsInfo.TablesInfo class. The DBsInfo.TablesInfo class used to store all the columns of the table/view, the properties of the table and properties of its columns, and its relationships to other tables.

  • GenerateTemplate: This algorithm of generating IRI format for all triples from base IRI, table name, and table columns especially from primary keys (PKs), unique keys (UKs), or collect some columns that are not-null (when the table does not have any PK or UK defined). Therefore, this algorithm is characterized by the formation of a basic IRI key that is unrepeatable for all the triples generated from the table rows, where each row maps to a set of triples refer to the same subject (IRI key row) and all the triples of the table rows refer to the table name in rr:Class. Then the return of this algorithm is the RDF triples format in r2rml.

  • GeneratepredicateObject: This algorithm maps the table/view column to PredicateObjectMap that includes a pair of predicate and object map. It generates the RDF terms for the predicate and object of a triple respectively. The value of rr:predicate is IRI consists of base IRI, table name and the columns name which are the algorithm inputs. The rr:object is a column name. Then the return of this algorithm is the RDF triples format in r2rml that will be associated with a subject (generated by the GenerateSubjectMap algorithm).

  • GenerateRefObjMap: This algorithm maps all the table relationships to the reference triples, which are generated for referencing object maps, through a rr:joinCondition to another table similar to local triples. The referencing object map allows using the subjects of another triples map as an object (produced by a predicate-object map). Since both triples maps may base on different logical tables, it may require a link between the logical tables. All relationships of the table are stored in the DBsInfo.TableRelationShips, including FK_Table_Name, FK_Column_Name, Ref_To_TableName, and Ref_To_Column_Name, which are the inputs of algorithm. Then the return of this algorithm is the RDF triples format in r2rml corresponding to relationship of table with other tables. The output of the algorithm is associated with the objectMap generated by a PredicateObjectMap.

figure a
figure b

5 Prototype Implementation

5.1 Architecture

Figure  2 shows the architecture of our RML-BDM processor prototype. Depending on the proposed algorithms, we have implemented an R2RML-BDB processor prototype and have been integrated with nknos-r2rml parserFootnote 5 [8]. The processor takes system configuration, a DB connection to the relational database and a base IRI as inputs and produces automatically the R2RML mapping document and resulting RDF dataset as outputs shown by screen display.

The architecture and process flow of the R2RML-BDB processor prototype is illustrated in Fig. 2, where the functional modules are briefly described as follows.

figure c
figure d
figure e
figure f
figure g
Fig. 2.
figure 2

A general overview of the R2RML mapping generation process in RML-BDM system.

  • System config module: This module configures the execution environment for the R2RML-BDB processor according to the user-specified settings, including:

    1. 1.

      DB config: This is used to specify all parameters for connection to any database.

    2. 2.

      R2RML file input type: It is used to specify the file name and type of R2RML document that will be used for storing an R2RML schema generated from RDB schema and then used it with any R2RML Parser to produce RDF triples from RDB data.

    3. 3.

      RDF triples output type: It is used to specify the name file and type (format) of RDF graph to be used for storing RDB data as RDF graph format.

    4. 4.

      Base IRI (NS Prefix): This NS-IRI is used to specify the namespace prefix of IRI for all the RDF triples.

  • DB connection: This module uses to connect with the database (using JDBC driver engine in Java) and make it ready for reading. The input is DB config parameters and the output is DB-connection class.

  • DB analysis processor: This module implements the algorithms of extracting RDB metadata (DBsInfo) (Fig. 1) from RDB. Metadata is extracted from RDB using JDBC driver engine in Java. The output is DBsInfo class that has many classes to store all the information about RDB schema such as tables, views, columns, data types, sizes, constraints, PKs, FKs, relationships, indexes, unique, and nulls, etc.

  • Generator R2RML Mapping File: This module is used to automatically generate an R2RML mapping file based on the behavior of the DM specification from the DBsInfo class, according to our approach algorithms. The input of this processor is DBsInfo (contains all information about RDB schema) the output is R2RML mapping file formatted as rdf format (or TTL). The output file can encapsulate all mapping results into a standard input for any R2RML processor later to produce a set of RDF triples that is similar to those resulting from DM.

  • R2RML Parser (RDF processor): It is used to generate a real RDF triples file from RDB data depending on R2RML mapping file to make it accessible to RDF store. Moreover, this stage is to satisfy our approach for generating R2RML mapping file. We used the open source tool nkons-r2rml-parser (or use any other R2RML processor) which is integrated with our RML-BDM system to generate a set of triples that correspond to the ones generated by DM approach.

  • Screen display (user interface): The user can specify the configuration setting for execution environments (system config module), display the database information-schema (DBsInfo class), R2RML mapping file, and the resulting RDF triples on the tool screen through the user interface.

5.2 Implementation

The algorithms described in this paper have been implemented in RML-BDM processor prototype and integrated with nkons-r2rml-parser [8]. This prototype has been implemented on Netbeans IDE 7.3.1 (J2SE, JDK 1.7) platform. Thus, the inputs of processor are user-specific configuration system, a SQL connection to the relational database, and a base IRI. Meanwhile, the outputs are produced automatically, including the DBsInfo class, an R2RML mapping document and resulting RDF dataset. These outputs are shown in screen display, and the R2RML mapping document and RDF triples can be saved in an RDF file in different syntax formats (RDF/XML, N-TRIPLES, TURTLE (or TTL), and N3). Moreover, the RDF mapping file can be used with any R2RML parser for converting relational data to RDF triples. Current system prototype supports SQL connections to MySQL Server and already included drivers for major commercial and open source databases, including Postgres, SQL Server and Oracle.

Table 1. A list of RDB schema and data sizes
Table 2. A list of results for our approach
Fig. 3.
figure 3

Important factors of RDBs to build R2RML mapping file. (Color figure online)

Fig. 4.
figure 4

Dataset sizes in RDBs (schema and data). (Color figure online)

Fig. 5.
figure 5

Schema sizes in RDBs and R2RML mapping files. (Color figure online)

Fig. 6.
figure 6

The running time of algorithmic routines in RML-BDM.

5.3 Experimental Results and Discussion

We carried out RDB metadata and R2RML mapping extraction experiments with our R2RML-BDC tool on a Laptop with configurations as Windows 7 (32-bit), CPU Intel(R) Core i5-2410M 2.30 GHz, RAM 6 GB. A prototype for this experiment is implemented using MYSQL, Java programming language, Netbeans IDE 7.3.1 and Apache Jena tools. Experimental tests on the effectiveness and validity of our RDB2RDF mapping algorithms were conducted with the schema sizes of the five RDBs created with MYSQL Server and tested in our experiments. These RDBs are rdblab [15], iswcFootnote 6, trackerFootnote 7, sakilaFootnote 8, and NorthwindFootnote 9, which covering various of important RDB concepts such as tables, views, columns, constraints, single or composite primary key, one or many foreign key in one table, and all types of relationships. Moreover, some databases have encoded a parent-child relationship which one table as parent related to many tables as child.

First, the databases information concepts that automatically extracted by our system are shown in Table 1. These concepts are the important factors for building R2RML mapping files and affecting the algorithm performance for extracting RDB metadata, generating R2RML mapping file and converting data of RDB to RDF datasets. Second, the results of R2RML schema tuples and RDF triples are shown in Table 2. The two fields namely SizeDBschema (in Table 1) and SizeR2RML (in Table 2) show the sizes of the RDBs schema compared to R2RML mapping files that were extracted. From these two fields, we can observe that the ontology is the best to store the schema and infer the knowledge from the ontology. Figures 3 and 4 are the analysis bar-chart of the Table 1. Figure 4 shows schema sizes and rows of the RDBs that tested in our experiments. Figure 5 reveals the best case to store the schema in ontology compared to RDB. The X-axis used to show the different domain datasets, whilst Y-axis shows the size of the database in kilobytes (kb). Moreover, the performance analysis of the different databases is shown in the Fig. 6. The execution time of creating R2RML mapping file included the running time for extracting RDB metadata-DBsInfo (RDB have data) and generating R2RML mapping file in algorithm R2RML Generator. Therefore, from the tables and figures analyses, we can conclude what are the most important factors for extracting R2RML mapping document from RDB and affecting the time of extraction. Although the size of tracker database greater than the size of sakila and northwind database, but the execution time of R2RML Generator algorithm is smaller, because there are other factors have influenced the execution time an R2RML mapping file generator with the size of the database.

6 Conclusion and Future Work

We introduced an approach tool for the automatically generating an R2RML mapping document from an RDB schema, and any R2RML engine can be use this mapping document to create a set of RDF dataset that following the DM specification. RML-BDM is a tool that enables domain experts and non-experts to automatically create an R2RML mapping files from RDBs by using the R2RML format. RML-BDM has been integrated with nknos-r2rml parser that can export RDB contents as RDF graphs, based on an R2RML mapping document.

The process was tested using five RDBs having different sizes of schema and covering several RDB concepts and types of relationships between tables. In future works, we will add other tools to address an expressing customized mappings from various types of data sources such as XML, NoSQL, and object-oriented to an RDF triples.