Keywords

1 Introduction

Enterprises increasingly depend on intelligent information systems that operationalise corporate knowledge as a unified source across system boundaries. Such systems crucially rely on insights produced by data scientists, who use advanced data and graph analytics together with machine learning and statistical models to create predictive actionable knowledge from suitably preprocessed corporate data by means of data wrangling. To maintain their competitive edge, companies need to incorporate multiple heterogeneous sources of information, including streams of structured or unstructured data from internal systems (e.g., Enterprise Resource Planning, Workflow Management, and Supply Chain Management), external streams of unstructured data (e.g., news and social media feeds, and Common CrawlFootnote 1), publicly available and proprietary sources of semi-structured data (e.g., DBpedia [11], Wikidata [46], UniProt [19], data.gov.uk), structured data extracted from web pages using web data extraction techniques [24], as well as internal and external knowledge bases/ontologies (e.g., ResearchCycFootnote 2, DBpedia [11], Wikidata [46], FIBOFootnote 3). The integration of such diverse information is a non-trivial task that presents data scientists with a number of challenges including: the extraction and handling of big data with frequently changing content and structure; dealing with uncertainty of the extracted data; and finding ways of unifying the information from different sources.

Following the trend of large technological companies such as Google, Amazon, Facebook, and, LinkedIn, it is becoming common for enterprises to integrate their internal and external sources of information into a unified knowledge graph. A knowledge graph typically consists of graph-structured data to allow for smooth accommodation of changes in the structure of the data, and knowledge layers, which encode business logic used for the validation and enrichment of data and the uncovering of critical insights from it. Graph-structured data may stem from data directly exposed as graphs (e.g., RDFFootnote 4 used by triple stores such as GraphDBFootnote 5, Property Graphs used by graph databases like neo4jFootnote 6, and JanusGraphFootnote 7) or relational or semi-structured data that exhibits graph structure. The consolidated and enriched knowledge graph is then processed using the standard data science toolkit for graph analytics (including languages such as CypherFootnote 8, SPARQLFootnote 9, and GremlinFootnote 10), statistical analysis (using the R statistical framework), and machine learning (using toolkits such as WekaFootnote 11, scikit-learnFootnote 12, and TensorFlowFootnote 13).

The creation of a coherent knowledge graph from multiple sources of unstructured, semi-structured, and structured data is a challenging task that requires techniques from multiple disciplines. Entity resolution [18] is used to combine multiple sources of (semi-)structured data that do not share common identifiers. The goal is to identify pairs of entities that refer to the same real-world object and merge them into a single entity. The matching is performed using noisy, semi-identifying information (e.g., names, addresses) and relationships, and employs specialised similarity functions for strings, numbers, and dates, to determine the overall similarity of two entities. Information extraction [43] is used for automatically extracting structured data from unstructured sources (i.e., news and social media feeds). Thus, for example, the news feed “PayPal buys Hyperwallet for $400M” could result into the structured statement “acquire(PayPal, Hyperwallet)”. Information extraction is typically combined with entity resolution to correctly incorporate the extracted information within an existing knowledge graph.

Publicly available datasets are often equipped with ontologies which describe relationships between entities. In such cases ontological reasoning needs to be applied to validate whether the results of entity resolution and information extraction violate any of the constraints imposed by the ontology as well as to enrich the data with new information stemming from the newly produced facts. Further note that, unsurprisingly, the use of machine learning is pervasive throughout the stages of the data scientist’s workflow: from semantically annotating web page elements during web data extraction, through deciding whether entities should be matched during entity resolution, to predicting numerical trends during data analytics over the knowledge graph. Finally, observe that although uncertainty is intrinsic to many of the tasks in the data scientist’s workflow, it is typically resolved by the means of a threshold. For example, during entity resolution, the similarity of the attributes of two entities is typically converted to a probability for the two entities to be the same, and they are matched if the probability exceeds a certain threshold. Similarly, the information extraction stage typically associates output facts with level of uncertainty stemming from the extraction process, but likewise to the case of entity resolution, the uncertainty is converted into a probability for a fact to hold, and a hard decision is made on whether it should be included or not. Interestingly, one can do better than that. One may want to impose levels of uncertainty using business rules to better inform the decision of whether and how the knowledge graph should be updated. One such rule, for example, could be that public companies are much more likely to acquire private companies than vice-versa (the so called reverse takeover). Such rules can be produced by a domain expert or learned from the data using rule learning [7]. Furthermore, instead of ignoring the uncertainty, after it is being used to determine whether to accept a fact or a match, for example, one could alternatively incorporate this uncertainty into the knowledge graph and propagate them into the further stages of data wrangling and data analytics workflow.

To carry out the different stages of the described workflow data scientists need to use and coordinate a number of tools, languages, and technologies: for data access they require tools for web data extraction, various data-base management systems, triple stores and graph databases; during knowledge graph construction they require tools for entity resolution, information extraction, ontological reasoning, and uncertainty management; and during the analysis stage they require tools for graph analytic, machine learning and statistical modelling. The coordination of all these tools can be very challenging.

In this paper we present the Vadalog engine: a state-of-the-art Knowledge Graph Management System (KGMS) that provides a unified framework for integrating the various tools and technologies used by data scientists. Its language Vadalog is an extension of the rule-based language Datalog [1], and can naturally capture SQL (through support for the SQL operators), ontological reasoning in OWL 2 QLFootnote 14 and SPARQL (through the use of existential quantifiers), and graph analytics (through non-trivial support for recursion and aggregation). The declarative nature of the language makes the code concise, manageable, and self-explanatory. The engine is fully extensible through its bindings to different data sources and libraries. Data extensions provide access to relational data stored in Postgres or MySQL, for example, or to graph data stored in neo4j or Janus, or to web data using OXPath [24]. Library extensions allow the integration of state-of-the-art machine learning tools such as Weka, scikit-learn, or TensorFlow. Additional integration with libraries for string similarities and regular expressions allows for defining complex entity resolution workflows. The engine also supports reasoning with probabilistic data and probabilistic rules, which makes it ideal for handling uncertainty stemming from the different stages of the data scientist’s workflow. Finally, the Vadalog engine seamlessly integrates with Jupyter: a well-known platform for data analysts and scientists with a convenient interface for data processing and visualisation.

The paper is organised as follows. Section 2 provides an overview of the core language. Section 3 provides a system overview of the Vadalog engine. Section 4 describes the various features of the system within a typical data scientist’s workflow in Jupyter. Section 5 demonstrates the engine’s integration with machine learning on typical use cases. Finally, Sect. 6 describes in more detail the support of the system for probabilistic reasoning.

This paper includes, in abbreviated form, material from a number of previous papers on the topic [7,8,9,10]. The Vadalog system is Oxford’s contribution to VADA [34], a joint project of the universities of Edinburgh, Manchester, and Oxford. We reported first work on the overall VADA approach to data wrangling in [25]. In this paper, we focus on the Vadalog system at its core. Currently, our system fully implements the core language and is already in use for a number of industrial applications.

2 Core Language

Vadalog is a Datalog-based language. It belongs to the Datalog\(^\pm \) family of languages that extends Datalog by existential quantifiers in rule heads, as well as by other features, and at the same time restricts its syntax in order to achieve decidability and data tractability; see, e.g., [14,15,16,17]. The logical core of the Vadalog language corresponds to Warded Datalog\(^\pm \) [4, 29], which captures plain Datalog as well as SPARQL queries under the entailment regime for OWL 2 QL [28] and is able to perform ontological reasoning tasks. Reasoning with the logical core of Vadalog is computationally efficient. Vadalog is obtained by extending Warded Datalog\(^\pm \) with additional features of practical utility. We now illustrate the logical core of Vadalog, more details about extensions can be found in  [7].

The logical core of Vadalog relies on the notion of wardedness, which applies a restriction on how the “dangerous” variables of a set of existential rules are used. Note that existential rules are also known as tuple-generating dependencies (tgds), i.e., Datalog rules where existential quantification is allowed in the head. Intuitively, a “dangerous” variable is a body-variable that can be unified with a labelled null value when the chase algorithm is applied, and it is also propagated to the head of the rule. For example, given the set \(\varSigma \) consisting of the rules

$$ P(x) \rightarrow \exists z \, R(x,z) \quad \text {and}\quad R(x,y) \rightarrow P(y), $$

the variable y in the body of the second rule is “dangerous” (w.r.t. \(\varSigma \)) since starting, e.g., from the database \(D = \{P(a)\}\), the chase will apply the first rule and generate \(R(a,\nu )\), where \(\nu \) is a null that acts as a witness for the existentially quantified variable z, and then the second rule will be applied with the variable y being unified with \(\nu \) that is propagated to the obtained atom \(P(\nu )\).

Note that, throughout this paper, we will mix the “logical” notation shown above that is often used in papers, and the “code”-like notation that is used in systems, such as the Vadalog system. The above example would be given as follows in Vadalog notation:

  • r(X,Z) :- p(X).

  • p(Y) :- r(X,Y).

The goal of wardedness is to tame the way null values are propagated during the construction of the chase instance by posing the following conditions: (i) all the “dangerous” variables should coexist in a single body-atom \(\alpha \), called the ward; (ii) the ward can share only “harmless” variables with the rest of the body, i.e., variables that are unified only with database constants during the construction of the chase.

Warded Datalog\(^\pm \) consists of all the (finite) sets of warded existential rules. As an example of a warded set of rules, the following rules encode part of the OWL 2 direct semantics entailment regime for OWL 2 QL (see [4, 29]):

$$\begin{aligned} \underline{\mathrm{Type}(x,y)}, \mathrm{Restriction}(y,z)\rightarrow & {} \exists w \, \mathrm{Triple}(x,z,w)\\ \underline{\mathrm{Type}(x,y)}, \mathrm{SubClass}(y,z)\rightarrow & {} \mathrm{Type}(x,z)\\ \underline{\mathrm{Triple}(x,y,z)}, \mathrm{Inverse}(y,w)\rightarrow & {} \mathrm{Triple}(z,w,x)\\ \underline{\mathrm{Triple}(x,y,z)}, \mathrm{Restriction}(w,y)\rightarrow & {} \mathrm{Type}(x,w). \end{aligned}$$

It is easy to verify that the above set is warded, where the underlined atoms are the wards. Indeed, a variable that occurs in an atom of the form \(\mathrm{Restriction}(\cdot ,\cdot )\), or \(\mathrm{SubClass}(\cdot ,\cdot )\), or \(\mathrm{Inverse}(\cdot ,\cdot )\), is trivially harmless. However, variables that appear in the first position of \(\mathrm{Type}\), or in the first/third position of \(\mathrm{Triple}\) can be dangerous. Thus, the underlined atoms are indeed acting as the wards.

Reasoning in Warded Datalog\(^\pm \) is PTIME-complete in data complexity [4, 29]. Although polynomial time data complexity is desirable for conventional applications, PTIME-hardness can be prohibitive for “Big Data” applications. One such example is towards building knowledge graphs that consider huge elections in the area of computational social choice [20]. Yet, in fact, this is true even for linear time data complexity. This is discussed in more detail in [7].

This core language has a number of extensions to make it practical, among them data types, arithmetic, (monotonic) aggregation, bindings of predicates to external data sources, binding function symbols to external functions, and more.

We will discuss monotonic aggregation here. Vadalog supports aggregation (min, max, sum, prod, count), by means of an extension to the notion of monotonic aggregations [44], which allows adopting aggregation even in the presence of recursion while preserving monotonicity w.r.t. set containment. Such functionality is crucial for performing graph analytics, an example of which is shown in Sect. 4.

We will discuss some of these extensions throughout this paper. One of the extensions that are planned is more support consistency, in particular consistent query answering [3, 5] as well as view updates [13, 31].

Fig. 1.
figure 1

KGMS reference architecture [7]

3 Core System

The functional architecture of the Vadalog system, our KGMS, is depicted in Fig. 1. The knowledge graph is organised as a repository, a collection of Vadalog rules. The external sources are supported by means of transducers, intelligent adapters that integrate the sources into the reasoning process.

The Big Data characteristics of the sources and the complex functional requirements of reasoning are tackled by leveraging the underpinnings of the core language, which are turned into practical execution strategies. In particular, in the reasoning algorithms devised for Warded Datalog\(^\pm \), after a certain number of chase steps (which, in general, depends on the input database), the chase graph [15] (a directed acyclic graph where facts are represented as nodes and the applied rules as edges) exhibits specific periodicities and no new information, relevant to query answering, is generated. The Vadalog system adopts an aggressive recursion and termination control strategy, which detects such redundancy as early as possible by combining compile-time and runtime techniques. In combination with a highly engineered architecture, the Vadalog system achieves high performance and an efficient memory footprint.

At compile time, as wardedness limits the interaction between the labelled nulls, the engine rewrites the program in such a way that joins on specific values of labelled nulls will never occur. This exploits work on schema mapping composition and optimisation [32, 33, 38, 42].

The Vadalog system uses a pull stream-based approach (or pipeline approach), where the facts are actively requested from the output nodes to their predecessors and so on down to the input nodes, which eventually fetch the facts from the data sources. The stream approach is essential to limit the memory consumption or at least make it predictable, so that the system is effective for large volumes of data. Our setting is made more challenging by the presence of multiple interacting rules in a single rule set and the wide presence of recursion. We address this by means of a specialised buffer management technique. We adopt pervasive local caches in the form of wrappers to the nodes of the access plan, where the facts produced by each node are stored. The local caches work particularly well in combination with the pull stream-based approach, since facts requested by a node successor can be immediately reused by all the other successors, without triggering further backward requests. Also, this combination realises an extreme form of multi-query optimisation, where each rule exploits the facts produced by the others, whenever applicable. To limit memory occupation, the local caches are flushed with an eager eviction strategy that detects when a fact has been consumed by all the possible requestors and thus drops it from the memory. Cases of actual cache overflow are managed by resorting to standard disk swap heuristics (e.g., LRU, LFU).

More details on the Vadalog system can be found in [10]. The system includes many other features, such as data extraction with OXPath, which is in use with our collaborators at dblp [36].

4 Supporting the Data Science Workflow

As the importance of data science constantly increases, the Vadalog system can support the entire spectrum of data science tasks and processes to a certain extent. It does not however replace tools specialists like to use, but rather conveys a universal platform to integrate various approaches and tools into a unified framework. All integrations are realised in terms of data binding primitives and functions.

One such key example is the use of the UI/development platform, where Jupyter was chosen as a platform that data scientists are familiar with. The Vadalog system has seamless integration with JupyterLab with the use of a Vadalog extension and kernel (see Fig. 2). JupyterLab is a well-known platform for data analysts and scientists with a convenient interface for data processing and visualisation. It has a multi-user support, in which dedicated resources and the environment are associated with a concrete user. The Vadalog extension and kernel for JupyterLab give data scientists the possibility to evaluate the correctness of the program, run it, and analyse the derivation process of interesting output facts. All output is rendered in JupyterLab’s output area.

Fig. 2.
figure 2

Example of the Vadalog program for inferring a company control indicator

Data Binding Primitives. Bindings give one a possibility to connect an automatic reasoning workflow with external systems for data exchange. An external system can represent a database, framework, library or information system. Currently Vadalog supports relational databases, such as Postgres and MySQL, and graph databases, such as neo4j. It also has seamless integration with machine learning tools, e.g., Weka and scikit-learn (see Sect. 5.1), and a web data extraction tool, OXPath [24] (see Fig. 3). Other integrations are included or can be easily integrated. Data sources and targets can be declared by adopting @input and @output annotations. Annotations are special facts augmenting sets of existential rules with specific behaviours. @input and @output define the direction of facts into and from the Vadalog program, respectively. Additional @bind annotation defines means for interacting with an external system. A query bind annotation @qbind is a special modification of @bind. It supports binding predicates to queries against inputs and outputs in the external language (e.g., SQL-queries for a data source or target that supports SQL). The first parameter of @bind and @qbind specifies a predicate the external resource is bound to; the second parameter defines a type of the target (e.g., “postgres”). In case the schema of an external resource cannot be derived automatically, or should be overridden, additional @mapping annotation can be used to define mapping strategy for tuples between Vadalog and an external system.

In Fig. 2, we give a synthetic example of a Vadalog program to infer a company control indicator. It can be formulated as follows: A company A “controls” company B if A owns directly or indirectly (i.e., via shares in other companies) more than 50% of B’s shares (lines 18–25). As we can see, various strategies for binding external resources can be used in the Vadalog program. For example, data tuples ownsDirectly can be propagated into the program from the parametric @qbind (lines 6–7 for Postgres via tuples ownsDirectlyDB) or @bind (line 11 for CSV via tuples ownsDirectlyCSV). For @qbind SQL query is instantiated with the parameter from the predicate relevant_country (line 8). The query instantiation is realised within the join, in which the parameter C from the relevant_country predicate is propagated into the fourth term of the predicate ownsDirectlyDB. In contrast, in case of @bind, all data is streamed into the Vadalog system and filtered on-the-fly by only selecting information regarding the “relevant country” (line 16). ownsDirectly tuples can also be specified within the program in terms of facts (line 3). During the evaluation of the program, each derived tuple controls is streamed into a Postgres database as it is specified in lines 26–27.

Fig. 3.
figure 3

Integration of OXPath, a web data extraction tool

In Fig. 3, we illustrate an example of binding with OXPath. OXPath [24] is a web data extraction language, an extension of XPath 1.0 for interacting with web applications and extracting data from them. In this example, the OXPath binding streams all articles of Georg Gottlob from dblp website into the Vadalog program. Extracted articles can be represented as a relation article(authors, title, publication, pages). Integration with machine learning tools is discussed in the next section.

Functions. Besides bindings, functions provide a data scientist with a rich set of value transformations and operations for different data types supported in Vadalog. A user can write expressions of different complexity with the use of operators and functions to perform arithmetic, manipulate strings, dates, and compare values. Examples of supported data types are string, integer, double, date, boolean, set, and a special identifier for unknown values, marked null. A data scientist can also extend the set of supported functions with those written in Python, which is enabled in the Vadalog framework. Functions can be combined into libraries. For example, @library("sim:", "simmetrics"). enables the “simmetrics” library in the Vadalog program, where methods can be invoked with the prefix sim:, as in sim:removeDiacritics(Text) to remove diacritics from Text. We also convey libraries for building regression or classification models on-the-fly and applying those on the data derived during the automatic reasoning (see Sect. 5.1).

Fig. 4.
figure 4

A screenshot depicting code analysis for an altered Vadalog program in the company control example

Code Analysis. The correctness of the program is assessed with the use of the code analysis functionality (see Fig. 4). It checks whether there are essential or well-known error patterns in the program. For example, in Fig. 4, we altered the original program illustrated in Fig. 2. The parameter Share of the condition in the line 18 was replaced with Share2, lines 10–15 were commented, leaving ownsDirectlyCSV without the binding, and the output controls was changed to controls2.

Fig. 5.
figure 5

A screenshot of the output depicting a “yes”-explanation for the fact controls("A", "C") in the company control example

Fact Derivation Analysis. The analysis of derivations can be performed with the use of explanations (see Fig. 5). It gives an explanation of how a certain fact has been derived within the program and which rules have been triggered.

Bindings and functions make data analytics both more effective and efficient. Vadalog directly interacts with various data sources regardless of their nature, be it a database or the Web. Furthermore, with rich reasoning capabilities it can lift the analysis up from basic values, tuples or relations within databases to semantically rich structures, e.g., from property graphs such as of neo4j to concepts of a domain ontology. This makes the code more concise and self-explanatory.

The Vadalog system is a universal tool which can reconcile two opposite paradigms of data scientists and domain experts, so-called “inductive” (or bottom-up) and “deductive” (or top-down) approaches. An inductive paradigm goes along with a statement that “patterns emerge before reasons for them become apparent” [21]. It certainly refers to data mining and machine learning approaches which are used for deriving new knowledge and relations from data. As all data scientists face in practice, “all models are wrong and some are useful” [12, p. 208], which explains problems of finding the best model given a dataset. Furthermore, limitations related to labour intensive labelling for some machine learning algorithms can also cause incorrect or incomplete results. Thus, knowledge of a domain expert with a deductive approach is important to correct potential errors propagated from generated models.

Fig. 6.
figure 6

Schematic view of the interaction between machine learning and reasoning

5 Integrating Machine Learning

In this section, we will discuss how to integrate machine learning directly. We will focus on one of the approaches to machine learning integration, schematically illustrated in Fig. 6. In the first subsection, we will concretely talk about Weka and scikit-learn integration. The system’s TensorFlow integration is similar in style to the scikit-learn integration. This will be followed in Subsect. 5.2 by a case study on feature engineering. We will conclude in Subsect. 5.3 on how to include custom ML algorithms directly into the system.

Fig. 7.
figure 7

A snippet of Vadalog code, which demonstrates training a J48 Weka model

Fig. 8.
figure 8

A snippet of Vadalog code, which demonstrates the classification phase with a trained J48 Weka model

5.1 Direct Integration

Weka. Integration with a machine learning framework, Weka, is demonstrated in Figs. 7 and 8. Figure 7 illustrates the J48 model generation example for the Iris dataset. Training data is propagated to the bound decision tree classifier associated with the predicate j48. Mapping annotations specify attributes and the class of tuples streamed into the underlying machine learning algorithm. Figure 8 depicts an example of the classification process given a model M. Attributes of the tuple data to be classified and the generated model are streamed into the underlying Weka framework via the predicate j48. The results of the classification are instantiated in a relation classified_data. In the @qbind expression, the third parameter defines nominal attributes, a class in our case, which had index 4 in the training phase. The fourth parameter of @qbind defines parameter propagation template from the predicate j48 into the underlying model.

SciPy Toolkits Machine Learning. An external Python library such as scikit-learn can be utilised for machine learning tasks over predicates, through Vadalog Library framework. One basic linear regression example is shown below. The input consists of predicates in the form of \(training\_set(ID, X, Y)\). The sk : fit function feeds input data one by one and returns current training set size. Once sufficient training set size is reached, sk : train function is called with a boolean return value. The last rule takes predict inputs one by one and retrieves output from a trained model. \(\#T\) stands for boolean value true.

figure a

5.2 Case Study: Feature Engineering

We consider a case study of implementing a supervised machine learning framework and post-classification reasoning with Vadalog. Our implementation consists of three phases: (i) feature extraction with Vadalog, (ii) interaction between Vadalog and a serialised classifier, (iii) post-classification reasoning. We assume that the classifier has already been trained and serialised and for the reasons of brevity omit the description of representing a training corpus and training the classifier with Vadalog, as it can be done through a simple extension of the framework. The schematic view of the framework we implement in this case study is given in Fig. 6.

Feature Extraction with Vadalog. Consider the problem of identifying semantic blocks on a web page, such as pagination bars, navigation menus, headers, footers, and sidebars [26, 35]. The page is represented by the DOM tree and CSS model. We represent all information contained both in the DOM and CSS as DOM facts, which are Vadalog edb predicates. An example of three DOM facts representing the (i) font size of a DOM tree element with ID 100, (ii) its background colour, (iii) and the coordinates, width, and height of the corresponding CSS box is listed below.

figure b

In the code snippet below we extract the feature, which computes the average font size of the sub-tree rooted at a given DOM node N, used in the navigation menu classifier, i.e., the average font size computed on a set unifying node N and all of its descendant nodes (calculated through the Start and End indices of DOM nodes).

figure c

Note that we use the feature namespace for the predicate, which computes this particular feature, as well as all other features used by classifiers. The feature predicates are the output of this feature extraction phase of the framework, so that they can be further passed on as input to a serialised classifier.

Interaction with a Serialised Classifier. All extracted features are passed on to a serialised classifier through the @bind operator. For the case of web block classification, we use Weka as the machine learning library and J48 decision tree as the classifier, but the implementation of the framework in Vadalog is both library and classifier agnostic, e.g., we can seamlessly integrate Vadalog with scikit-learn, as demonstrated in Subsect. 5.1, and the J48 decision tree classifier can also be seamlessly changed to any other classifier, e.g., an SVM. The classifications produced by the classifier are then passed back to Vadalog, also through the @bind operator. These classifications are in the classification namespace, e.g., classification(e_200,"navigation_menu") that classifies DOM node with ID 200 as a navigation menu.

Post-classification Reasoning. We can now apply post-classification reasoning that cannot be easily represented by machine learning classifiers to the classifications computed in the previous phase. For example, given serialised header and footer classifiers and classifications computed in the previous phase, we can impose a constraint that a header and a footer cannot overlap.

figure d

5.3 Direct Use of Algorithms

In case no external support is available, or users want to adapt and tie their algorithms closer to the knowledge graph, a number of Machine Learning algorithms can be directly implemented in Vadalog. Note that this is a complementary alternative – in case algorithms should be used out-of-the-box based on existing systems and approaches, and no modification or close interaction with the knowledge graph is required, it is certainly a good idea to use such external systems and algorithms as described in Sect. 5.1. Taking advantage of the declarative programming paradigm, it requires only concisely expressing the logic of the definition, instead of explicitly describing the exact algorithm. As a result, the program is easy for modification, verification or parallel execution. The application areas include but are not limited to clustering, anomaly detection, and weekly supervised learning.

We will use DBSCAN (Density-based spatial clustering of applications with noise) algorithm as a simple example [22]. Two main parameters of DBSCAN are eps (distance threshold) and minPts (minimal number of points for a dense region). The input is a set of points p(IDXY), ID is a sequential number representing an identifier.

$$\begin{aligned}&eps(0.11), \ minPts(5),\\&p(1, 0.697, 0.460), \ p(2, 0.774, 0.376), \ {\ldots } \end{aligned}$$

Two points are in a neighbourhood if their Euclidean distance is less than eps. The neighbourhood number is obtained through aggregation as below.

$$\begin{aligned}&p(A, X_A, Y_A), p(B, X_B, Y_B), C=\sqrt{(X_A-X_B)^2+(Y_A-Y_B)^2} \\&\rightarrow point\_pairs (A, B, C). \\&point\_pairs(A, B, C), eps(E), C<=E \rightarrow neighbourhood(A, B).\\&neighbourhood(A, B), J=mcount(B) \rightarrow neighbourhood\_count(A, J). \\&neighbourhood\_count(A, J), K=max(J) \rightarrow neighbourhood\_number(A, K). \end{aligned}$$

Different types of points, i.e., core, border and noise, are defined as follows.

$$\begin{aligned}&neighbourhood\_number(A, K), minPts(M), K>=M \rightarrow core\_point(A).\\&\lnot core\_point(A), core\_point(B), neighbourhood(A, B) \rightarrow border\_point(A).\\&neighbourhood\_number(A, K), \lnot core\_point(A), \lnot border\_point(A) \\&\rightarrow noise\_point(A). \end{aligned}$$

Notions of density reachability and connectivity are defined below.

$$\begin{aligned}&core\_point(A), neighbourhood(A, B) \rightarrow directly\_reachable(A, B).\\&directly\_reachable(A, B) \rightarrow reachable(A, B).\\&reachable(A, C), directly\_reachable(C, B) \rightarrow reachable(A, B).\\&reachable(C, A), reachable(C, B) \rightarrow connected(A, B). \end{aligned}$$

The goal of density clustering process is to find point pairs that satisfy both connectivity and maximality properties, respectively:

$$\begin{aligned} connected(A, X) \rightarrow cluster(A, X). \\ reachable(A, X) \rightarrow cluster(A, X).&\end{aligned}$$

The cluster is identified by the point (from this cluster) which has the minimal ID number. This is achieved by the post-processing instruction, @post, which takes the minimum value for the second term (position) of the relation cluster, grouping by the first term (position).

@output("cluster"). @post("cluster", "min(2)").

Output Example: cluster(1,1). cluster(2,1). cluster(3,3).

6 Probabilistic Reasoning

In the design of winning data science solutions, it is more and more clear that completely neglecting domain knowledge and blindly relying only on inductive models (i.e., with parameters learnt from data) easily leads to sub-optimal results, subject to overfitting when not to wrong conclusions. Thus, data scientists tend to integrate inductive reasoning with deductive approaches, complementing and when it is the case overruling machine learning models with domain knowledge.

In the Vadalog system, we introduce probabilistic knowledge graphs, a valuable tool to craft a new kind of data science solutions where statistical models incorporate and are driven by the description of the domain knowledge.

Combining uncertainty and logic to describe rich uncertain relational structures is not new and has been the primary focus of Statistical Relational Learning (SRL) [27, 40]. One prominent representative of this area is Markov Logic Networks (MLN) [41], which allow to describe relational structures in terms of first-order logic. A number of algorithms for exact and approximate reasoning in MLNs and other SRL models [6, 23] have been proposed, and systems built such as Alchemy [41], Tuffy [37] and SlimShot [30]. MLNs have been successfully applied in natural language processing [39], ontology matching [2], record linkage [45], and so on. Yet, one common limitation of SRL models is their logical reasoning side: logic in SRL is not utilised for deducing new knowledge, but rather serves the role of a constraint language. Systems that can be built on top of these models are hence of very limited applicability in data science tasks.

Consider the following example.

Example 1

Let G be a knowledge graph, which contains the following facts about the ownership and link relationships between companies, augmented with a Vadalog program composed of rules (1) and (2):

$$\begin{aligned}&\mathrm{Own}(a,b,0.4), \mathrm{Own}(b,c,0.5), \mathrm{Own}(a,d,0.6), \mathrm{Own}(d,c,0.5),\\&\mathrm{Linked}(a,b),\mathrm{Linked}(a,d),\mathrm{Linked}(b,c),\mathrm{Linked}(d,c).\\ (1)&\mathrm{Own}(x,y,s), s>0.2 \rightarrow \mathrm{Linked}(x,y). \\ (2)&0.8\,\, {:}{:}\,\, \mathrm{Own}(x,y,s), \mathrm{Own}(y,z,t), w=\mathrm{sum}(s \cdot t) \rightarrow \mathrm{Own}(x,z,w) . \end{aligned}$$

Rule (1) expresses that company \(x\) is linked to \(y\) if \(x\) owns directly or indirectly more than \(20\%\) of \(y\)’s shares. Rule (2) is a recursive rule with an aggregate operator and expresses indirect shareholding: when \(x\) owns a number of companies \(y\), each holding a different share \(t_y\) of \(z\), then \(x\) owns \(\sum _y(s \cdot t_y)\) of \(z\). An example of a “traditional” logical reasoning task is answering the following question over G: “which companies are linked to a?”. The result of the reasoning task is the companies b and d, as directly specified by G, and, additionally, c, which is implied by the program. Indeed, by Rule (2) we first derive the fact \(\mathrm{Own}(a,c,0.5)\), as \(0.4 \times 0.5 + 0.6 \times 0.5 = 0.5\), and thus, by Rule (1), we deduce \(\mathrm{Linked}(a,c)\).

However, here we are in an uncertain setting: Rule (2) is not definitive but holds with a certain probability. We say that G is a probabilistic knowledge graph. Probabilistic reasoning on G would then consist in answering queries over such uncertain logic programs, i.e., when we can only access a distribution of the entailed facts. The answer to the question —which companies are linked to a— would contain companies b and d with probability one and c with some probability p depending on the “ownership distance” between \(a\) and \(c\).

In spite of its high relevance, surprisingly, none of the exiting KGMSs allow for uncertain reasoning, crucial in many contexts. The Vadalog system aims at filling this gap.

The Vadalog system provides a form of hybrid logic-probabilistic reasoning, where logical inference is driven and aided by statistical inference. We adopt the novel notion of probabilistic knowledge graph, and propose Soft Vadalog, an extension to Vadalog with soft, weighted rules (such as the ones used in Example 1) for representing and supporting uncertain reasoning in the Vadalog system. A Soft Vadalog program is a template for a reason-tailored statistical model, namely the chase tree, the semantics of which is based on a probabilistic version of the chase procedure, a family of algorithms used in databases to enforce logic rules by generating the entailed facts.

In particular, the system adopts the MCMC-chase algorithm: a combination of a Markov chain Monte Carlo method with the chase. The application of the chase is guided by the MCMC, so that logical and statistical inference are performed in the same process. We will report about these achievements soon.