Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

With the introduction of its Knowledge Graph [20], Google has coined the name for a new generation of knowledge-based systems that go beyond what was previously expected of areas that include graph databases, knowledge bases, machine-learning systems and rule-based logical reasoners. Beyond Google, companies are recognizing the need for making use of their vast amounts of data and knowledge in the form of enterprise knowledge graphs [20]. In the same way that databases created the need for Database Management Systems (DBMS) and knowledge bases fostered the creation of Knowledge Base Management Systems (KBMS), the interest in Knowledge Graphs creates a need for academia and industry to understand and develop knowledge graph management systems (KGMS).

There are a variety of requirements for a KGMS. For a detailed discussion of such requirements, we refer to [4, 13]. Yet at its core, a KGMS must be expressive enough to represent knowledge beyond mere facts, but at the same time provide scalability in the presence of very large amounts of data and complex reasoning tasks. The most wide-spread form of knowledge that has been adopted over the last decades has been in the form of rules, be it in rule-based systems, ontology-based systems or other forms. The nature of graphs makes the presence of recursion in these rules a particularly important aspect. The need to refer to unknown or missing information makes the presence of existential quantification essential. At the same time, a KGMS has to provide methods and tools for data analytics and machine learning. Moreover, a KGMS needs interfaces to many heterogeneous data sources, including: corporate RDBMS, NoSQL stores, the web, machine-learning and analytics packages. We represent a short summary of a KGMS reference architecture in Fig. 1.

Fig. 1.
figure 1

KGMS reference architecture [4].

The search for knowledge representation and reasoning formalisms that satisfy some or all of these criteria has led the community to develop a wide range of languages. In this paper, we give a short introduction to the Vadalog language and system, a KGMS that satisfies the requirements summarized above. The logical core of Vadalog is Warded Datalog\(^\pm \) [2, 15]. The prominent feature of Warded Datalog\(^\pm \) is that it (1) includes all of Datalog, (2) is able to express SPARQL under set semantics and the OWL 2 QL entailment regime and (3) still has polynomial time data complexity. The vadalog system [5,6,7] follows the reference architecture shown in Fig. 1 and exploits the theoretical underpinnings of relevant Datalog\(^\pm \) languages, combining them with existing and novel techniques from database engineering and AI practice. A more detailed introduction to the Vadalog language and system is given in [4]. Yet, PTIME data complexity, while sufficient for many conventional applications, can be prohibitive for “Big Data” scenarios. One such example is towards building knowledge graphs that consider huge elections in the area of computational social choice [12]. Many more could be named. Therefore, we describe sub-languages, in particular strongly warded Datalog\(^\pm \), which are suited for such scenarios.

This survey includes, in abbreviated form, material from a number of previous papers on the topic [4,5,6,7]. The vadalog system is Oxford’s contribution to VADA [17], a joint project of the universities of Edinburgh, Manchester, and Oxford. We reported first work on the overall VADA approach to data wrangling in [13]. In this paper, we focus on the vadalog system at its core. Our system currently fully implements the core language and is already in use for a number of industrial applications.

2 The VADALOGvadalog Core Language

vadalog is a Datalog-based language. It belongs to the Datalog\(^\pm \) family of languages that extend Datalog by existential quantifiers in rule heads, as well as by other features, and restricts at the same time its syntax in order to achieve decidability and data tractability; see, e.g., [8,9,10,11]. The logical core of the vadalog language corresponds to Warded Datalog\(^\pm \) [2, 15], which captures plain Datalog as well as SPARQL queries under the entailment regime for OWL 2 QL [14], and is able to perform ontological reasoning tasks. Reasoning with the logical core of vadalog is computationally efficient. vadalog is obtained by extending Warded Datalog\(^\pm \) with additional features of practical utility. We now illustrate the logical core of vadalog, more details about extensions can be found in [4].

The logical core of vadalog relies on the notion of wardedness, which applies a restriction on how the “dangerous” variables of a set of existential rules are used. Note that existential rules are also known as tuple-generating dependencies (tgds), i.e., Datalog rules were existential quantification is allowed in the head. Intuitively, a “dangerous” variable is a body-variable that can be unified with a labeled null value when the chase algorithm is applied, and it is also propagated to the head of the rule. For example, given the set \(\varSigma \) consisting of the rules

$$ P(x) \rightarrow \exists z \, R(x,z) \quad \text {and}\quad R(x,y) \rightarrow P(y), $$

the variable y in the body of the second rule is “dangerous” (w.r.t. \(\varSigma \)) since starting, e.g., from the database \(D = \{P(a)\}\), the chase will apply the first rule and generate \(R(a,\nu )\), where \(\nu \) is a null that acts as a witness for the existentially quantified variable z, and then the second rule will be applied with the variable y being unified with \(\nu \) that is propagated to the obtained atom \(P(\nu )\).

The goal of wardedness is to tame the way null values are propagated during the construction of the chase instance by posing the following conditions: (i) all the “dangerous” variables should coexist in a single body-atom \(\alpha \), called the ward; (ii) the ward can share only “harmless” variables with the rest of the body, i.e., variables that are unified only with database constants during the construction of the chase.

Warded Datalog\(^\pm \) consists of all the (finite) sets of warded existential rules. As an example of a warded set of rules, the following rules encode part of the OWL 2 direct semantics entailment regime for OWL 2 QL (see [2, 15]):

$$\begin{aligned} \underline{\mathrm{Type}(x,y)}, \mathrm{Restriction}(y,z)\rightarrow & {} \exists w \, \mathrm{Triple}(x,z,w)\\ \underline{\mathrm{Type}(x,y)}, \mathrm{SubClass}(y,z)\rightarrow & {} \mathrm{Type}(x,z)\\ \underline{\mathrm{Triple}(x,y,z)}, \mathrm{Inverse}(y,w)\rightarrow & {} \mathrm{Triple}(z,w,x)\\ \underline{\mathrm{Triple}(x,y,z)}, \mathrm{Restriction}(w,y)\rightarrow & {} \mathrm{Type}(x,w). \end{aligned}$$

It is easy to verify that the above set is warded, where the underlined atoms are the wards. Indeed, a variable that occurs in an atom of the form \(\mathrm{Restriction}(\cdot ,\cdot )\), or \(\mathrm{SubClass}(\cdot ,\cdot )\), or \(\mathrm{Inverse}(\cdot ,\cdot )\), is trivially harmless. However, variables that appear in the first position of \(\mathrm{Type}\), or in the first/third position of \(\mathrm{Triple}\) can be dangerous. Thus, the underlined atoms are indeed acting as the wards.

3 Efficient Sub-languages in VADALOGvadalog

Reasoning in Warded Datalog\(^\pm \) is PTIME-complete in data complexity [2, 15]. Although polynomial time data complexity is desirable for conventional applications, PTIME-hardness can be prohibitive for “Big Data” applications; in fact, this is true even for linear time data complexity. This raises the question whether there are fragments of Warded Datalog\(^\pm \) that guarantee lower data complexity, but at the same time maintain the favourable properties discussed above. Of course, such a fragment should be weaker than full Datalog since Datalog itself is already ptime-complete in data complexity. On the other hand, such a fragment should be powerful enough to compute the transitive closure of a binary relation, which is a crucial feature for reasoning over graphs, and, in particular, for capturing SPARQL queries under the entailment regime for OWL 2 QL. Therefore, the complexity of such a refined fragment is expected to be nlogspace-complete.

Such a fragment of Warded Datalog\(^\pm \), dubbed Strongly-Warded, can be defined by carefully restricting the way recursion is employed. Before giving the definition, let us recall the standard notion of the predicate graph of a set \(\varSigma \) of existential rules, which essentially encodes how the predicates in \(\varSigma \) interact. The predicate graph of \(\varSigma \), denoted \( PG (\varSigma )\), is a directed graph (VE), where the node set V consists of all the predicates occurring in \(\varSigma \), and we have an edge from a predicate P to a predicate R iff there exists \(\sigma \in \varSigma \) such that P occurs in the body of \(\sigma \) and R occurs in the head of \(\sigma \). Consider a set of nodes \(S \subseteq V\) and a node \(R \in V\). We say that R is \(\varSigma \)-reachable from S if there exists at least one node \(P \in S\) that can reach R via a path in \( PG (\varSigma )\). We are now ready to introduce strong-wardedness.

A set of existential rules \(\varSigma \) is called strongly-warded if \(\varSigma \) is warded, and, for each \(\sigma \in \varSigma \) of the form

$$ \varphi (\bar{x},\bar{y})\ \rightarrow \ \exists \bar{z}\, (P_1(\bar{x},\bar{z}) \wedge \ldots \wedge P_n(\bar{x},\bar{z})), $$

there exists at most one atom in \(\varphi (\bar{x},\bar{y})\) whose predicate is \(\varSigma \)-reachable from \(\{P_1,\ldots ,P_n\}\). Strongly-Warded Datalog\(^\pm \) consists of all the (finite) sets of existential rules that are strongly-warded. Intuitively, in a strongly-warded set of existential rules, each rule \(\sigma \) is either non-recursive, or it employs a mild form of recursion in the sense that an atom generated by \(\sigma \) during the construction of the chase instance can affect exactly one body-atom of \(\sigma \). Let us clarify that the additional syntactic condition posed on warded existential rules in order to obtain strongly-warded existential rules, is the same as the condition underlying Piecewise Linear Datalog; see, e.g., [1].

It can be shown that our main reasoning task of tuple inference under Strongly-Warded Datalog\(^\pm \) is nlogspace-complete in the data complexity. Moreover, this refined language remains powerful enough for capturing OWL 2 QL, and, extended by a mild form of negation, can express every SPARQL query under the entailment regime for OWL 2 QL. As already explained above, the nlogspace data complexity immediately excludes full Datalog. However, Strongly-Warded Datalog\(^\pm \) includes some important and well-studied fragments of Datalog: (i) Non-Recursive Datalog, where the underlying predicate graph is acyclic, and (ii) IDB-Linear Datalog, where each rule can have at most one intensional predicate (i.e., it appears in the head of at least one rule) in its body, while all the other predicates are extensional.

A lightweight fragment of Strongly-Warded Datalog\(^\pm \) that is FO-Rewritable is Linear Datalog\(^\pm \), where each existential rule can have exactly one body-atom [9]. FO-Rewritability means that, given a pair \(Q = (\varSigma ,\mathrm{Ans})\), we can construct a (finite) first-order query \(Q_{ FO }\) such that, for every database D, Q(D) coincides with the evaluation of \(Q_{ FO }\) over D. This immediately implies that tuple inference under Linear Datalog\(^\pm \) is in ac\(_0\) in data complexity. Despite its simplicity, Linear Datalog\(^\pm \) is expressive enough for expressing every OWL 2 QL axiom. However, it cannot compute the transitive closure of a binary relation, which is unavoidable to ensure FO-Rewritability. This makes it unsuitable for querying RDF graphs under the entailment regime for OWL 2 QL.

4 The VADALOGvadalog System

The functional architecture of the vadalog system, our KGMS, is depicted in Fig. 1. The knowledge graph is organized as a repository, a collection of vadalog rules. The external sources are supported by means of transducers, intelligent adapters that integrate the sources into the reasoning process.

The Big Data characteristics of the sources and the complex functional requirements of reasoning are tackled by leveraging the underpinnings of the core language, which are turned into practical execution strategies. In particular, in the reasoning algorithms devised for Warded Datalog\(^\pm \), after a certain number of chase steps (which, in general, depends on the input database), the chase graph [9] (a directed acyclic graph where facts are represented as nodes and the applied rules as edges) exhibits specific periodicities and no new information, relevant to query answering, is generated. The vadalog system adopts an aggressive recursion and termination control strategy, which detects such redundancy as early as possible by combining compile-time and runtime techniques. In combination with a highly engineered architecture, the vadalog system achieves high performance and an efficient memory footprint.

At compile time, as wardedness limits the interaction between the labeled nulls, the engine rewrites the program in such a way that joins on specific values of labeled nulls will never occur. This exploits work on schema mapping composition and optimization [16, 19]. More details on the Vadalog system can be found in [7]. The system includes many other features, such as data extraction with OXPath, which is in use with our collaborators at dblp [18].

5 Conclusion

The vadalog system is already in use for a number of industrial applications. We believe that the vadalog system is a well-suited platform for knowledge graph applications that integrate machine learning (ML) and data analytics with logical reasoning. We are currently implementing applications of this type and will report about them soon. Other extensions that we envision in the scope of efficient sub-languages are in the area of inconsistency, in particular towards efficient fragments of consistent query answering [3].