Keywords

1 Introduction

Knowledge graphs have become the cornerstone of artificial intelligence. The construction and publishing of large-scale knowledge graphs in various domains have posed new challenges on the management of those graphs.

Currently, there are two mainstream data models of knowledge graphs, namely the RDF [8] (Resource Description Framework) model and the property graph model. The former has been standardized by the W3C (World Wide Web Consortium), and the latter has been widely accepted in industrial communities of graph databases. Unlike the relational database communities, however, the two models of knowledge graphs and their query languages have not yet been unified. For RDF graphs, their model has a profound mathematical foundation and relatively complete model characteristics, and with the Linked Data [11] initiative, an increasingly large number of RDF data have been published on the Semantic Web; whereas for property graphs, their model has built-in support for properties and several query languages, including Cypher [9], Gremlin [10], and PGQL [12]. Property graphs, which have not been standardized yet, have been widely recognized in industrial communities with the application of graph databases. Due to the hypergraph structure of RDF graphs, it has been demonstrated that RDF graphs are more expressive than property graphs. How to effectively manage both RDF and property graphs in a unified storage schema has become an urgent problem. In this paper, we firstly focus on the integration of RDF and property graphs at the storage layer.

The relational model has increasingly turned mature over several decades. It has concise and universal relational structures, and expresses the operations and constraints of relationships using relational algebra with strict mathematical definitions. Therefore, it can provide a solid theoretical foundation to store RDF and property graphs.

Our contributions can be summarized as follows:

  1. (1)

    We propose a unified relational storage schema, that can seamlessly accommodate both RDF and property graphs.

  2. (2)

    We then implement the storage schema on an open-source database, AgensGraph, to verify its effectiveness and efficiency.

  3. (3)

    To some extent, we manage RDF and property graphs in an interchangeable way and realize the interoperability between the two models.

The remainder of this paper is organized as follows: Sect. 2 introduces the related work and the formal definitions of RDF and property graphs are given in Sect. 3. The unified storage schema we proposed is illustrated in Sect. 4, the subsequent Sect. 5 describes the implementation on an open-source graph database with the experimental results. Finally, we conclude the paper by discussing future research directions in Sect. 6.

2 Related Work

The knowledge graph data model is based on the graph structure, with vertices representing entities and edges representing the relationships between those entities. This kind of general data representation can naturally depict the extensive connections between things in the real world.

2.1 The RDF Storage Schema

There are two typical approaches to designing RDF data management systems: relational approaches and graph-based approaches [18]. The relational approaches map RDF data to a tabular representation and then execute SPARQL queries on it while the latter approach is graph-based, which model both RDF and the SPARQL query as graphs and execute the query by subgraph matching using homomorphism [16].

Relationship-Based Knowledge Graphs Storage Management. Relational databases are still the most widely used database management system at present, and the storage scheme based on relational database is a main storage method of knowledge graphs data currently [1]. The triple table storage scheme directly stores RDF data; the horizontal table storage scheme [3, 7, 15] records all predicates and objects of a subject in each row; the property table storage scheme is a subdivision of the horizontal table, and the same subject will be stored in a table, which solves the problem of too many columns in the table; the vertical partitioning storage scheme creates a two-column table for each predicate [2]; the sextuple indexing storage scheme divides all six permutations of a triple into six tables [14]. Last but not least, DB2RDF [6] has been used to improve query performance recent years by creating entity-oriented storage structures that reduce the Cartesian product operations in queries.

Graph-Based Knowledge Graphs Storage Management. The advantage of graph-based approach is that it maintains the original representation of the RDF data as well as it enforces the intended semantics of SPARQL. The disadvantage, however, is that the cost of subgraph matching by graph homomorphism is NP-complete [18]. Systems such as that proposed by Bönström et al. [5], gStore [15, 17], and chameleon-db [4] follow this approach.

2.2 The Property Graph Storage Scheme

A property graph is a directed, labeled, and attributed multi-graph. It means that the edges of a property graphs are directed, and both vertices and edges can be labeled and can have any number of properties, and there can be multiple edges between any two nodes [13]. Neo4jFootnote 1 is a native graph database that supports transactional applications and graph analytics, and it is currently the most popular property graphs database. Neo4j is also based on a network-oriented model where relations are first-class objects.

At present, the knowledge graph data model and the query language are not unified. The main reason for the surge of relational databases is that it has a precisely defined relational data model and a unified query language SQL. The unified data model and query language not only reduce the development and maintenance costs of the database management system, but also reduce the learning difficulty of users. Therefore, based on the existing work, we propose a unified relational storage scheme for RDF and property graph model.

3 Preliminaries

In this section, we provide the formal definitions of RDF triple, RDF graph, and property graph, which can be the basis for the transformations to relational tables in the document.

Definition 1

(RDF triple). Let U, B and L be disjoint sets of URIs, blank nodes and literals, respectively. An RDF triple \((s,p,o) \in (U \cup B) \times U \times (U \cup B \cup L)\) states the fact that the resource s has the relationship p to the resource \(o \in U\), or the resource s has the property p with the value \(o \in L,\) where s is called the subject, p the predicate (or property), and o the object.

Definition 2

(RDF graph). A finite set of RDF triples is called an RDF graph. Given an RDF graph T, we use S(T), P(T), and O(T) to denote the set of subjects, predicates, and objects in T, respectively. For a certain subject \(s_i \in S(T)\), we refer to the triples with the same subject \(s_i\) collectively as the entity \(s_i\), denoted by .

We can use RDF Schema (RDFS) to define classes of entities and the relationships between these classes. For example, declares that the entity s is an instance of the class C. Given an RDF graph T, we assume that for each subject \(s \in S(T)\) there exists at least a triple , denoted by \(s \in C\). We believe that this assumption is reasonable since every entity should belong to at least one type in the real world.

Definition 3

(Property graph). Let L and T be countable sets of node labels and relationship types, respectively [16]. A property graph is a tuple \(G = (N, R, src, tgt, l, \lambda , \tau )\) where:

  • N is a finite subset of N, whose elements are referred to as the nodes of G.

  • R is a finite subset of R, whose elements are referred to as the relationships of G.

  • src: \(R \rightarrow N\)is a function that maps each relationship to its source node.

  • tgt: \(R \rightarrow N\)N is a function that maps each relationship to its target node.

  • l: \((N \cup R) \times K \rightarrow V\) is a finite partial function that maps a (node or relationship) identifier and a property key to a value.

  • \(\lambda \): \(N \rightarrow 2L\) is a function that maps each node id to a finite (possibly empty) set of labels.

  • \(\tau \): \(R \rightarrow T\) is a function that maps each relationship identifier to a relationship type.

4 The Unified Relational Storage Schema

Originally, we propose a unified relational storage schema for both RDF and property graphs. Then we elaborate on the specific rules for transforming RDF and property graphs into relational tables to effectively realize the storage integration.

4.1 Integration of RDF and Property Graphs in Relational Tables

As the representations of knowledge graph models, RDF and property graphs are relatively independent with expressivity difference, increasing the difficulty of the direct mapping. As shown in Fig. 1, we select the mature relational model as the physical storage model to realize the integration of RDF and property graphs.

Fig. 1.
figure 1

The unified relational storage schema

4.2 Transforming RDF Graphs into Relational Tables

Since an RDF graph is defined as a finite set of triples, an RDF graph can be mapped into multiple relational tables. Mapping rules for an RDF graph to relational tables will be defined as follows.

RDF triples, by definition, will be formalized as \((s,p,o) \in (U \cup B) \times U \times (U \cup B \cup L)\). For simplicity, the namespace prefix of the resource and predicate URI names will be omitted in this paper (RDF built-in names is not omitted, such as ). Since the introduction of blank nodes will not make a fundamental change to the RDF data management method, the blank node in the RDF graph will be equated to the URI in this paper.

For three different forms of RDF triples, we define the basic mapping rules for RDF to relational tables as follows:

Rule 1. An RDF triple in the form of , that the predicate of the RDF triple is , then it can be expressed as a row with id (primary key) and properties in relational table \(U_2\).

Rule 2. An RDF triple in the form of \(\left\langle U_1 \right\rangle \left\langle U_2 \right\rangle \left\langle L \right\rangle \), that the object of the RDF triple is literal, then it can be expressed as a property \(\left\{ U_2:L \right\} \) in properties of \(U_1\).

Rule 3. An RDF triple in the form of \(\left\langle U_1 \right\rangle \left\langle U_2 \right\rangle \left\langle U_3 \right\rangle \), that the subject, the predicate, and the object of the RDF triple are all URI, then it can be expressed as a row with id (primary key), start that is the foreign key referencing the id of \(U_1\), end that is the foreign key referencing the id of \(U_3\), and properties in relational table U2.

As shown in Fig. 2, most RDF graphs can be mapped to relational schemata according to the above basic rules.

Fig. 2.
figure 2

The basic mapping from RDF graphs to relational tables

In particular, the intersection of vertices and edges is not empty in RDF graphs. Specifically, the predicate can also act as the subject or the object of another RDF triple. We then propose a solution to implementation of RDF reification. In the relational schema, we artificially create a relational table called “Edge_Vertex” with column Vertexid (primary key), column Edgeid that is the foreign key referencing the id of the edge, and column properties. The Edge_Vertex table stores edges as vertices to realize following relationships between edges and vertices or between edges and edges. Namely, as presented in Fig. 3, we use the dual storage to reserve the complete information of RDF in the relational model.

Fig. 3.
figure 3

The complete mapping from RDF graphs to relational tables

4.3 Transforming Property Graphs into Relational Tables

Property graphs also play a considerable role in knowledge graphs. In property graphs, an entity is represented as a vertex. Vertices and edges can have an arbitrary number of properties and can be categorized with labels. Labels are used to gather vertices and edges that have the same category. Furthermore, edges are directionally connected between two vertices, a start vertex and an end vertex.

We explore the transformation from property graphs to relational tables. For vertices and edges, we define the mapping rules for property graphs to relational tables as follows:

Rule 1. Labels can be represented as relational tables within vertices and edges of the same category.

Rule 2. Vertex tables have two columns, namely id (primary key) and properties.

Rule 3. Edge tables have four columns, namely id (primary key), start and end that are both the foreign keys referencing the id of vertex tables, and properties.

Rule 4. A vertex or an edge can be expressed as a row of the relational table.

According to the above rules, Fig. 4 visually shows the mapping from property graphs to relational tables.

Fig. 4.
figure 4

The mapping from property graphs to relational tables

5 Experiments

We have conducted experiments on synthetic RDF datasets to verify the effectiveness and efficiency of our method. The database is deployed on a desktop computer that has an Intel i54520 CPU with 2 cores of 2.31 GHz, 8 GB memory, 512 GB disk, and 64-bit Centos7.0 as the OS.

We implemented the storage schema on AgensGraph v2.1.1Footnote 2. AgensGraph is a new generation multi-model graph database for the modern complex data environment, that is very robust, fully-featured and ready for enterprise use. AgensGraph both supports relational tables and property graphs, and it has already realized the mapping from property graphs to relational tables. Consequently, RDF graphs are required to be imported into AgensGraph as relational tables.

As shown in Fig. 5, based on the existing storage mechanism, we extended the storage schema to accommodate RDF storage for AgensGraph with no effect to the original storage of relational tables and property graphs. According to the extension, RDF and property graphs can be stored and managed independently and compatibly in AgensGraph.

Fig. 5.
figure 5

The complete mapping from RDF graphs to relational tables

We generated five synthetic datasets using the LUBM (Lehigh University Benchmark), which is developed to facilitate the evaluation of Semantic Web repositories in a standard and systematic way, as a test sample imported into AgensGraph. LUBM consists of a university domain ontology, customizable and repeatable synthetic data, a set of test queries, and several performance metrics. The characteristics of each dataset are shown in Table 1.

Table 1. Characteristics of experimental datasets

Experiment 1: Storage Performance Analysis. We considered two indicators to evaluate the storage performance, namely storage time and storage space.

Storage time overhead is a significant indicator to evaluate the performance of the storage schema for importing RDF triples. Figure 6(a) shows the storage time to store RDF datasets of different sizes.

Fig. 6.
figure 6

The storage time and space of RDF

Table 2. Number of vertex and edge

Additionally, storage space overhead is also important to measure storage performance. By importing RDF into AgensGraph, the number of established vertices and edges are shown in Table 2. Figure 6(b) plots the storage space of different RDF datasets in AgensGraph. From Fig. 6(b) we can see that with continued accretion of the size of RDF data, the storage scheme can significantly reduce the spatial storage of knowledge graphs and the redundancy of data storage.

Experiment 2: Interoperability of RDF and Property Graphs. LUBM provides 14 SPARQL query statements to measure the performance. Therefore, we tested them on AgensGraph to realize the interoperability of RDF and property graphs. For instance, Query number, answer, and query time of LUBM50 are shown in Table 3. From Table 3, we found the storage schema can effectively achieve the interoperability.

Table 3. Query Results of LUBM50

Experiment 3: Comparison Between Import Methods. To verify the effectiveness of the storage schema, we compared the unified relational storage schema (Our-Method) with importing RDF graphs as property graphs (AgensGrpah) on storage time. From the experimental results, as shown in Fig. 7, the storage time and storage space are positively correlated with the size of the datasets. The efficiency of the proposed relational storage schema has increased hundreds of times with the roughly equivalent storage space, which is valid for large-scale RDF storage.

Fig. 7.
figure 7

The comparison between our method and AgensGraph

6 Conclusion and Outlook

In this paper, we have developed a unified relational storage schema of RDF and property graphs. On the one hand, we have solved the large-scale knowledge graph storage problem to some extent. On the other, the proposal of the unified storage schema promotes the integration of two mainstream data models of knowledge graph, playing an important role in the establishment of dominant knowledge graph databases.

The Unified data model not only lowers the development and maintenance cost of database management system, but also reduces the learning difficulty of users. Based on the unified storage schema, a unified query schema of Cypher and SPARQL needs to be proposed to realize a real sense of RDF to property graph interoperability. Therefore, it is an important research direction in the future to develop a unified knowledge graph query language with precise grammar and semantics. Furthermore, the research and development of distributed storage of large-scale knowledge graph data is still in its infancy, and the efficient algorithm of distributed queries is to be improved.