Data Science with Vadalog: Bridging Machine Learning and Reasoning

Bellomarini, Luigi; Fayzrakhmanov, Ruslan R.; Gottlob, Georg; Kravchenko, Andrey; Laurenza, Eleonora; Nenov, Yavor; Reissfelder, Stéphane; Sallinger, Emanuel; Sherkhonov, Evgeny; Wu, Lianlong

doi:10.1007/978-3-030-00856-7_1

Luigi Bellomarini^18,19,
Ruslan R. Fayzrakhmanov¹⁸,
Georg Gottlob^18,20,
Andrey Kravchenko¹⁸,
Eleonora Laurenza¹⁹,
Yavor Nenov¹⁸,
Stéphane Reissfelder¹⁸,
Emanuel Sallinger¹⁸,
Evgeny Sherkhonov¹⁸ &
…
Lianlong Wu¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 11163))

Included in the following conference series:

International Conference on Model and Data Engineering

1257 Accesses
4 Citations

Abstract

Following the recent successful examples of large technology companies, many modern enterprises seek to build knowledge graphs to provide a unified view of corporate knowledge and to draw deep insights using machine learning and logical reasoning. There is currently a perceived disconnect between the traditional approaches for data science, typically based on machine learning and statistical modelling, and systems for reasoning with domain knowledge. In this paper we present a state-of-the-art Knowledge Graph Management System, Vadalog, which delivers highly expressive and efficient logical reasoning and provides seamless integration with modern data science toolkits, such as the Jupyter platform. We demonstrate how to use Vadalog to perform traditional data wrangling tasks, as well as complex logical and probabilistic reasoning. We argue that this is a significant step forward towards combining machine learning and reasoning in data science.

Access provided by CONRICYT-eBooks. Download conference paper PDF

Swift Logic for Big Data and Knowledge Graphs

Vadalog: Overview, Extensions and Business Applications

Vadalog: A Language and System for Knowledge Graphs

Keywords

1 Introduction

Enterprises increasingly depend on intelligent information systems that operationalise corporate knowledge as a unified source across system boundaries. Such systems crucially rely on insights produced by data scientists, who use advanced data and graph analytics together with machine learning and statistical models to create predictive actionable knowledge from suitably preprocessed corporate data by means of data wrangling. To maintain their competitive edge, companies need to incorporate multiple heterogeneous sources of information, including streams of structured or unstructured data from internal systems (e.g., Enterprise Resource Planning, Workflow Management, and Supply Chain Management), external streams of unstructured data (e.g., news and social media feeds, and Common Crawl^{Footnote 1}), publicly available and proprietary sources of semi-structured data (e.g., DBpedia [11], Wikidata [46], UniProt [19], data.gov.uk), structured data extracted from web pages using web data extraction techniques [24], as well as internal and external knowledge bases/ontologies (e.g., ResearchCyc^{Footnote 2}, DBpedia [11], Wikidata [46], FIBO^{Footnote 3}). The integration of such diverse information is a non-trivial task that presents data scientists with a number of challenges including: the extraction and handling of big data with frequently changing content and structure; dealing with uncertainty of the extracted data; and finding ways of unifying the information from different sources.

Following the trend of large technological companies such as Google, Amazon, Facebook, and, LinkedIn, it is becoming common for enterprises to integrate their internal and external sources of information into a unified knowledge graph. A knowledge graph typically consists of graph-structured data to allow for smooth accommodation of changes in the structure of the data, and knowledge layers, which encode business logic used for the validation and enrichment of data and the uncovering of critical insights from it. Graph-structured data may stem from data directly exposed as graphs (e.g., RDF^{Footnote 4} used by triple stores such as GraphDB^{Footnote 5}, Property Graphs used by graph databases like neo4j^{Footnote 6}, and JanusGraph^{Footnote 7}) or relational or semi-structured data that exhibits graph structure. The consolidated and enriched knowledge graph is then processed using the standard data science toolkit for graph analytics (including languages such as Cypher^{Footnote 8}, SPARQL^{Footnote 9}, and Gremlin^{Footnote 10}), statistical analysis (using the R statistical framework), and machine learning (using toolkits such as Weka^{Footnote 11}, scikit-learn^{Footnote 12}, and TensorFlow^{Footnote 13}).

The creation of a coherent knowledge graph from multiple sources of unstructured, semi-structured, and structured data is a challenging task that requires techniques from multiple disciplines. Entity resolution [18] is used to combine multiple sources of (semi-)structured data that do not share common identifiers. The goal is to identify pairs of entities that refer to the same real-world object and merge them into a single entity. The matching is performed using noisy, semi-identifying information (e.g., names, addresses) and relationships, and employs specialised similarity functions for strings, numbers, and dates, to determine the overall similarity of two entities. Information extraction [43] is used for automatically extracting structured data from unstructured sources (i.e., news and social media feeds). Thus, for example, the news feed “PayPal buys Hyperwallet for $400M” could result into the structured statement “acquire(PayPal, Hyperwallet)”. Information extraction is typically combined with entity resolution to correctly incorporate the extracted information within an existing knowledge graph.

Publicly available datasets are often equipped with ontologies which describe relationships between entities. In such cases ontological reasoning needs to be applied to validate whether the results of entity resolution and information extraction violate any of the constraints imposed by the ontology as well as to enrich the data with new information stemming from the newly produced facts. Further note that, unsurprisingly, the use of machine learning is pervasive throughout the stages of the data scientist’s workflow: from semantically annotating web page elements during web data extraction, through deciding whether entities should be matched during entity resolution, to predicting numerical trends during data analytics over the knowledge graph. Finally, observe that although uncertainty is intrinsic to many of the tasks in the data scientist’s workflow, it is typically resolved by the means of a threshold. For example, during entity resolution, the similarity of the attributes of two entities is typically converted to a probability for the two entities to be the same, and they are matched if the probability exceeds a certain threshold. Similarly, the information extraction stage typically associates output facts with level of uncertainty stemming from the extraction process, but likewise to the case of entity resolution, the uncertainty is converted into a probability for a fact to hold, and a hard decision is made on whether it should be included or not. Interestingly, one can do better than that. One may want to impose levels of uncertainty using business rules to better inform the decision of whether and how the knowledge graph should be updated. One such rule, for example, could be that public companies are much more likely to acquire private companies than vice-versa (the so called reverse takeover). Such rules can be produced by a domain expert or learned from the data using rule learning [7]. Furthermore, instead of ignoring the uncertainty, after it is being used to determine whether to accept a fact or a match, for example, one could alternatively incorporate this uncertainty into the knowledge graph and propagate them into the further stages of data wrangling and data analytics workflow.

To carry out the different stages of the described workflow data scientists need to use and coordinate a number of tools, languages, and technologies: for data access they require tools for web data extraction, various data-base management systems, triple stores and graph databases; during knowledge graph construction they require tools for entity resolution, information extraction, ontological reasoning, and uncertainty management; and during the analysis stage they require tools for graph analytic, machine learning and statistical modelling. The coordination of all these tools can be very challenging.

In this paper we present the Vadalog engine: a state-of-the-art Knowledge Graph Management System (KGMS) that provides a unified framework for integrating the various tools and technologies used by data scientists. Its language Vadalog is an extension of the rule-based language Datalog [1], and can naturally capture SQL (through support for the SQL operators), ontological reasoning in OWL 2 QL^{Footnote 14} and SPARQL (through the use of existential quantifiers), and graph analytics (through non-trivial support for recursion and aggregation). The declarative nature of the language makes the code concise, manageable, and self-explanatory. The engine is fully extensible through its bindings to different data sources and libraries. Data extensions provide access to relational data stored in Postgres or MySQL, for example, or to graph data stored in neo4j or Janus, or to web data using OXPath [24]. Library extensions allow the integration of state-of-the-art machine learning tools such as Weka, scikit-learn, or TensorFlow. Additional integration with libraries for string similarities and regular expressions allows for defining complex entity resolution workflows. The engine also supports reasoning with probabilistic data and probabilistic rules, which makes it ideal for handling uncertainty stemming from the different stages of the data scientist’s workflow. Finally, the Vadalog engine seamlessly integrates with Jupyter: a well-known platform for data analysts and scientists with a convenient interface for data processing and visualisation.

The paper is organised as follows. Section 2 provides an overview of the core language. Section 3 provides a system overview of the Vadalog engine. Section 4 describes the various features of the system within a typical data scientist’s workflow in Jupyter. Section 5 demonstrates the engine’s integration with machine learning on typical use cases. Finally, Sect. 6 describes in more detail the support of the system for probabilistic reasoning.

This paper includes, in abbreviated form, material from a number of previous papers on the topic [7,8,9,10]. The Vadalog system is Oxford’s contribution to VADA [34], a joint project of the universities of Edinburgh, Manchester, and Oxford. We reported first work on the overall VADA approach to data wrangling in [25]. In this paper, we focus on the Vadalog system at its core. Currently, our system fully implements the core language and is already in use for a number of industrial applications.

2 Core Language

Vadalog is a Datalog-based language. It belongs to the Datalog$^\pm $ family of languages that extends Datalog by existential quantifiers in rule heads, as well as by other features, and at the same time restricts its syntax in order to achieve decidability and data tractability; see, e.g., [14,15,16,17]. The logical core of the Vadalog language corresponds to Warded Datalog$^\pm $ [4, 29], which captures plain Datalog as well as SPARQL queries under the entailment regime for OWL 2 QL [28] and is able to perform ontological reasoning tasks. Reasoning with the logical core of Vadalog is computationally efficient. Vadalog is obtained by extending Warded Datalog$^\pm $ with additional features of practical utility. We now illustrate the logical core of Vadalog, more details about extensions can be found in [7].

The logical core of Vadalog relies on the notion of wardedness, which applies a restriction on how the “dangerous” variables of a set of existential rules are used. Note that existential rules are also known as tuple-generating dependencies (tgds), i.e., Datalog rules where existential quantification is allowed in the head. Intuitively, a “dangerous” variable is a body-variable that can be unified with a labelled null value when the chase algorithm is applied, and it is also propagated to the head of the rule. For example, given the set $\varSigma $ consisting of the rules

$$ P(x) \rightarrow \exists z \, R(x,z) \quad \text {and}\quad R(x,y) \rightarrow P(y), $$

the variable y in the body of the second rule is “dangerous” (w.r.t. $\varSigma $) since starting, e.g., from the database $D = \{P(a)\}$, the chase will apply the first rule and generate $R(a,\nu )$, where $\nu $ is a null that acts as a witness for the existentially quantified variable z, and then the second rule will be applied with the variable y being unified with $\nu $ that is propagated to the obtained atom $P(\nu )$.

Note that, throughout this paper, we will mix the “logical” notation shown above that is often used in papers, and the “code”-like notation that is used in systems, such as the Vadalog system. The above example would be given as follows in Vadalog notation:

r(X,Z) :- p(X).
p(Y) :- r(X,Y).

The goal of wardedness is to tame the way null values are propagated during the construction of the chase instance by posing the following conditions: (i) all the “dangerous” variables should coexist in a single body-atom $\alpha $, called the ward; (ii) the ward can share only “harmless” variables with the rest of the body, i.e., variables that are unified only with database constants during the construction of the chase.

Warded Datalog$^\pm $ consists of all the (finite) sets of warded existential rules. As an example of a warded set of rules, the following rules encode part of the OWL 2 direct semantics entailment regime for OWL 2 QL (see [4, 29]):

$$\begin{aligned} \underline{\mathrm{Type}(x,y)}, \mathrm{Restriction}(y,z)\rightarrow & {} \exists w \, \mathrm{Triple}(x,z,w)\\ \underline{\mathrm{Type}(x,y)}, \mathrm{SubClass}(y,z)\rightarrow & {} \mathrm{Type}(x,z)\\ \underline{\mathrm{Triple}(x,y,z)}, \mathrm{Inverse}(y,w)\rightarrow & {} \mathrm{Triple}(z,w,x)\\ \underline{\mathrm{Triple}(x,y,z)}, \mathrm{Restriction}(w,y)\rightarrow & {} \mathrm{Type}(x,w). \end{aligned}$$

It is easy to verify that the above set is warded, where the underlined atoms are the wards. Indeed, a variable that occurs in an atom of the form $\mathrm{Restriction}(\cdot ,\cdot )$, or $\mathrm{SubClass}(\cdot ,\cdot )$, or $\mathrm{Inverse}(\cdot ,\cdot )$, is trivially harmless. However, variables that appear in the first position of $\mathrm{Type}$, or in the first/third position of $\mathrm{Triple}$ can be dangerous. Thus, the underlined atoms are indeed acting as the wards.

Reasoning in Warded Datalog$^\pm $ is PTIME-complete in data complexity [4, 29]. Although polynomial time data complexity is desirable for conventional applications, PTIME-hardness can be prohibitive for “Big Data” applications. One such example is towards building knowledge graphs that consider huge elections in the area of computational social choice [20]. Yet, in fact, this is true even for linear time data complexity. This is discussed in more detail in [7].

This core language has a number of extensions to make it practical, among them data types, arithmetic, (monotonic) aggregation, bindings of predicates to external data sources, binding function symbols to external functions, and more.

We will discuss monotonic aggregation here. Vadalog supports aggregation (min, max, sum, prod, count), by means of an extension to the notion of monotonic aggregations [44], which allows adopting aggregation even in the presence of recursion while preserving monotonicity w.r.t. set containment. Such functionality is crucial for performing graph analytics, an example of which is shown in Sect. 4.

We will discuss some of these extensions throughout this paper. One of the extensions that are planned is more support consistency, in particular consistent query answering [3, 5] as well as view updates [13, 31].

3 Core System

The functional architecture of the Vadalog system, our KGMS, is depicted in Fig. 1. The knowledge graph is organised as a repository, a collection of Vadalog rules. The external sources are supported by means of transducers, intelligent adapters that integrate the sources into the reasoning process.

The Big Data characteristics of the sources and the complex functional requirements of reasoning are tackled by leveraging the underpinnings of the core language, which are turned into practical execution strategies. In particular, in the reasoning algorithms devised for Warded Datalog$^\pm $, after a certain number of chase steps (which, in general, depends on the input database), the chase graph [15] (a directed acyclic graph where facts are represented as nodes and the applied rules as edges) exhibits specific periodicities and no new information, relevant to query answering, is generated. The Vadalog system adopts an aggressive recursion and termination control strategy, which detects such redundancy as early as possible by combining compile-time and runtime techniques. In combination with a highly engineered architecture, the Vadalog system achieves high performance and an efficient memory footprint.

At compile time, as wardedness limits the interaction between the labelled nulls, the engine rewrites the program in such a way that joins on specific values of labelled nulls will never occur. This exploits work on schema mapping composition and optimisation [32, 33, 38, 42].

The Vadalog system uses a pull stream-based approach (or pipeline approach), where the facts are actively requested from the output nodes to their predecessors and so on down to the input nodes, which eventually fetch the facts from the data sources. The stream approach is essential to limit the memory consumption or at least make it predictable, so that the system is effective for large volumes of data. Our setting is made more challenging by the presence of multiple interacting rules in a single rule set and the wide presence of recursion. We address this by means of a specialised buffer management technique. We adopt pervasive local caches in the form of wrappers to the nodes of the access plan, where the facts produced by each node are stored. The local caches work particularly well in combination with the pull stream-based approach, since facts requested by a node successor can be immediately reused by all the other successors, without triggering further backward requests. Also, this combination realises an extreme form of multi-query optimisation, where each rule exploits the facts produced by the others, whenever applicable. To limit memory occupation, the local caches are flushed with an eager eviction strategy that detects when a fact has been consumed by all the possible requestors and thus drops it from the memory. Cases of actual cache overflow are managed by resorting to standard disk swap heuristics (e.g., LRU, LFU).

More details on the Vadalog system can be found in [10]. The system includes many other features, such as data extraction with OXPath, which is in use with our collaborators at dblp [36].

4 Supporting the Data Science Workflow

As the importance of data science constantly increases, the Vadalog system can support the entire spectrum of data science tasks and processes to a certain extent. It does not however replace tools specialists like to use, but rather conveys a universal platform to integrate various approaches and tools into a unified framework. All integrations are realised in terms of data binding primitives and functions.

One such key example is the use of the UI/development platform, where Jupyter was chosen as a platform that data scientists are familiar with. The Vadalog system has seamless integration with JupyterLab with the use of a Vadalog extension and kernel (see Fig. 2). JupyterLab is a well-known platform for data analysts and scientists with a convenient interface for data processing and visualisation. It has a multi-user support, in which dedicated resources and the environment are associated with a concrete user. The Vadalog extension and kernel for JupyterLab give data scientists the possibility to evaluate the correctness of the program, run it, and analyse the derivation process of interesting output facts. All output is rendered in JupyterLab’s output area.

Data Binding Primitives. Bindings give one a possibility to connect an automatic reasoning workflow with external systems for data exchange. An external system can represent a database, framework, library or information system. Currently Vadalog supports relational databases, such as Postgres and MySQL, and graph databases, such as neo4j. It also has seamless integration with machine learning tools, e.g., Weka and scikit-learn (see Sect. 5.1), and a web data extraction tool, OXPath [24] (see Fig. 3). Other integrations are included or can be easily integrated. Data sources and targets can be declared by adopting @input and @output annotations. Annotations are special facts augmenting sets of existential rules with specific behaviours. @input and @output define the direction of facts into and from the Vadalog program, respectively. Additional @bind annotation defines means for interacting with an external system. A query bind annotation @qbind is a special modification of @bind. It supports binding predicates to queries against inputs and outputs in the external language (e.g., SQL-queries for a data source or target that supports SQL). The first parameter of @bind and @qbind specifies a predicate the external resource is bound to; the second parameter defines a type of the target (e.g., “postgres”). In case the schema of an external resource cannot be derived automatically, or should be overridden, additional @mapping annotation can be used to define mapping strategy for tuples between Vadalog and an external system.

In Fig. 2, we give a synthetic example of a Vadalog program to infer a company control indicator. It can be formulated as follows: A company A “controls” company B if A owns directly or indirectly (i.e., via shares in other companies) more than 50% of B’s shares (lines 18–25). As we can see, various strategies for binding external resources can be used in the Vadalog program. For example, data tuples ownsDirectly can be propagated into the program from the parametric @qbind (lines 6–7 for Postgres via tuples ownsDirectlyDB) or @bind (line 11 for CSV via tuples ownsDirectlyCSV). For @qbind SQL query is instantiated with the parameter from the predicate relevant_country (line 8). The query instantiation is realised within the join, in which the parameter C from the relevant_country predicate is propagated into the fourth term of the predicate ownsDirectlyDB. In contrast, in case of @bind, all data is streamed into the Vadalog system and filtered on-the-fly by only selecting information regarding the “relevant country” (line 16). ownsDirectly tuples can also be specified within the program in terms of facts (line 3). During the evaluation of the program, each derived tuple controls is streamed into a Postgres database as it is specified in lines 26–27.

In Fig. 3, we illustrate an example of binding with OXPath. OXPath [24] is a web data extraction language, an extension of XPath 1.0 for interacting with web applications and extracting data from them. In this example, the OXPath binding streams all articles of Georg Gottlob from dblp website into the Vadalog program. Extracted articles can be represented as a relation article(authors, title, publication, pages). Integration with machine learning tools is discussed in the next section.

Functions. Besides bindings, functions provide a data scientist with a rich set of value transformations and operations for different data types supported in Vadalog. A user can write expressions of different complexity with the use of operators and functions to perform arithmetic, manipulate strings, dates, and compare values. Examples of supported data types are string, integer, double, date, boolean, set, and a special identifier for unknown values, marked null. A data scientist can also extend the set of supported functions with those written in Python, which is enabled in the Vadalog framework. Functions can be combined into libraries. For example, @library("sim:", "simmetrics"). enables the “simmetrics” library in the Vadalog program, where methods can be invoked with the prefix sim:, as in sim:removeDiacritics(Text) to remove diacritics from Text. We also convey libraries for building regression or classification models on-the-fly and applying those on the data derived during the automatic reasoning (see Sect. 5.1).

Code Analysis. The correctness of the program is assessed with the use of the code analysis functionality (see Fig. 4). It checks whether there are essential or well-known error patterns in the program. For example, in Fig. 4, we altered the original program illustrated in Fig. 2. The parameter Share of the condition in the line 18 was replaced with Share2, lines 10–15 were commented, leaving ownsDirectlyCSV without the binding, and the output controls was changed to controls2.

Fact Derivation Analysis. The analysis of derivations can be performed with the use of explanations (see Fig. 5). It gives an explanation of how a certain fact has been derived within the program and which rules have been triggered.

Bindings and functions make data analytics both more effective and efficient. Vadalog directly interacts with various data sources regardless of their nature, be it a database or the Web. Furthermore, with rich reasoning capabilities it can lift the analysis up from basic values, tuples or relations within databases to semantically rich structures, e.g., from property graphs such as of neo4j to concepts of a domain ontology. This makes the code more concise and self-explanatory.

The Vadalog system is a universal tool which can reconcile two opposite paradigms of data scientists and domain experts, so-called “inductive” (or bottom-up) and “deductive” (or top-down) approaches. An inductive paradigm goes along with a statement that “patterns emerge before reasons for them become apparent” [21]. It certainly refers to data mining and machine learning approaches which are used for deriving new knowledge and relations from data. As all data scientists face in practice, “all models are wrong and some are useful” [12, p. 208], which explains problems of finding the best model given a dataset. Furthermore, limitations related to labour intensive labelling for some machine learning algorithms can also cause incorrect or incomplete results. Thus, knowledge of a domain expert with a deductive approach is important to correct potential errors propagated from generated models.

5 Integrating Machine Learning

In this section, we will discuss how to integrate machine learning directly. We will focus on one of the approaches to machine learning integration, schematically illustrated in Fig. 6. In the first subsection, we will concretely talk about Weka and scikit-learn integration. The system’s TensorFlow integration is similar in style to the scikit-learn integration. This will be followed in Subsect. 5.2 by a case study on feature engineering. We will conclude in Subsect. 5.3 on how to include custom ML algorithms directly into the system.

5.1 Direct Integration

Weka. Integration with a machine learning framework, Weka, is demonstrated in Figs. 7 and 8. Figure 7 illustrates the J48 model generation example for the Iris dataset. Training data is propagated to the bound decision tree classifier associated with the predicate j48. Mapping annotations specify attributes and the class of tuples streamed into the underlying machine learning algorithm. Figure 8 depicts an example of the classification process given a model M. Attributes of the tuple data to be classified and the generated model are streamed into the underlying Weka framework via the predicate j48. The results of the classification are instantiated in a relation classified_data. In the @qbind expression, the third parameter defines nominal attributes, a class in our case, which had index 4 in the training phase. The fourth parameter of @qbind defines parameter propagation template from the predicate j48 into the underlying model.

SciPy Toolkits Machine Learning. An external Python library such as scikit-learn can be utilised for machine learning tasks over predicates, through Vadalog Library framework. One basic linear regression example is shown below. The input consists of predicates in the form of $training\_set(ID, X, Y)$. The sk : fit function feeds input data one by one and returns current training set size. Once sufficient training set size is reached, sk : train function is called with a boolean return value. The last rule takes predict inputs one by one and retrieves output from a trained model. $\#T$ stands for boolean value true.

5.2 Case Study: Feature Engineering

We consider a case study of implementing a supervised machine learning framework and post-classification reasoning with Vadalog. Our implementation consists of three phases: (i) feature extraction with Vadalog, (ii) interaction between Vadalog and a serialised classifier, (iii) post-classification reasoning. We assume that the classifier has already been trained and serialised and for the reasons of brevity omit the description of representing a training corpus and training the classifier with Vadalog, as it can be done through a simple extension of the framework. The schematic view of the framework we implement in this case study is given in Fig. 6.

Feature Extraction with Vadalog. Consider the problem of identifying semantic blocks on a web page, such as pagination bars, navigation menus, headers, footers, and sidebars [26, 35]. The page is represented by the DOM tree and CSS model. We represent all information contained both in the DOM and CSS as DOM facts, which are Vadalog edb predicates. An example of three DOM facts representing the (i) font size of a DOM tree element with ID 100, (ii) its background colour, (iii) and the coordinates, width, and height of the corresponding CSS box is listed below.

In the code snippet below we extract the feature, which computes the average font size of the sub-tree rooted at a given DOM node N, used in the navigation menu classifier, i.e., the average font size computed on a set unifying node N and all of its descendant nodes (calculated through the Start and End indices of DOM nodes).

Note that we use the feature namespace for the predicate, which computes this particular feature, as well as all other features used by classifiers. The feature predicates are the output of this feature extraction phase of the framework, so that they can be further passed on as input to a serialised classifier.

Interaction with a Serialised Classifier. All extracted features are passed on to a serialised classifier through the @bind operator. For the case of web block classification, we use Weka as the machine learning library and J48 decision tree as the classifier, but the implementation of the framework in Vadalog is both library and classifier agnostic, e.g., we can seamlessly integrate Vadalog with scikit-learn, as demonstrated in Subsect. 5.1, and the J48 decision tree classifier can also be seamlessly changed to any other classifier, e.g., an SVM. The classifications produced by the classifier are then passed back to Vadalog, also through the @bind operator. These classifications are in the classification namespace, e.g., classification(e_200,"navigation_menu") that classifies DOM node with ID 200 as a navigation menu.

Post-classification Reasoning. We can now apply post-classification reasoning that cannot be easily represented by machine learning classifiers to the classifications computed in the previous phase. For example, given serialised header and footer classifiers and classifications computed in the previous phase, we can impose a constraint that a header and a footer cannot overlap.

5.3 Direct Use of Algorithms

In case no external support is available, or users want to adapt and tie their algorithms closer to the knowledge graph, a number of Machine Learning algorithms can be directly implemented in Vadalog. Note that this is a complementary alternative – in case algorithms should be used out-of-the-box based on existing systems and approaches, and no modification or close interaction with the knowledge graph is required, it is certainly a good idea to use such external systems and algorithms as described in Sect. 5.1. Taking advantage of the declarative programming paradigm, it requires only concisely expressing the logic of the definition, instead of explicitly describing the exact algorithm. As a result, the program is easy for modification, verification or parallel execution. The application areas include but are not limited to clustering, anomaly detection, and weekly supervised learning.

We will use DBSCAN (Density-based spatial clustering of applications with noise) algorithm as a simple example [22]. Two main parameters of DBSCAN are eps (distance threshold) and minPts (minimal number of points for a dense region). The input is a set of points p(ID, X, Y), ID is a sequential number representing an identifier.

$$\begin{aligned}&eps(0.11), \ minPts(5),\\&p(1, 0.697, 0.460), \ p(2, 0.774, 0.376), \ {\ldots } \end{aligned}$$

Two points are in a neighbourhood if their Euclidean distance is less than eps. The neighbourhood number is obtained through aggregation as below.

$$\begin{aligned}&p(A, X_A, Y_A), p(B, X_B, Y_B), C=\sqrt{(X_A-X_B)^2+(Y_A-Y_B)^2} \\&\rightarrow point\_pairs (A, B, C). \\&point\_pairs(A, B, C), eps(E), C<=E \rightarrow neighbourhood(A, B).\\&neighbourhood(A, B), J=mcount(B) \rightarrow neighbourhood\_count(A, J). \\&neighbourhood\_count(A, J), K=max(J) \rightarrow neighbourhood\_number(A, K). \end{aligned}$$

Different types of points, i.e., core, border and noise, are defined as follows.

$$\begin{aligned}&neighbourhood\_number(A, K), minPts(M), K>=M \rightarrow core\_point(A).\\&\lnot core\_point(A), core\_point(B), neighbourhood(A, B) \rightarrow border\_point(A).\\&neighbourhood\_number(A, K), \lnot core\_point(A), \lnot border\_point(A) \\&\rightarrow noise\_point(A). \end{aligned}$$

Notions of density reachability and connectivity are defined below.

$$\begin{aligned}&core\_point(A), neighbourhood(A, B) \rightarrow directly\_reachable(A, B).\\&directly\_reachable(A, B) \rightarrow reachable(A, B).\\&reachable(A, C), directly\_reachable(C, B) \rightarrow reachable(A, B).\\&reachable(C, A), reachable(C, B) \rightarrow connected(A, B). \end{aligned}$$

The goal of density clustering process is to find point pairs that satisfy both connectivity and maximality properties, respectively:

$$\begin{aligned} connected(A, X) \rightarrow cluster(A, X). \\ reachable(A, X) \rightarrow cluster(A, X).&\end{aligned}$$

The cluster is identified by the point (from this cluster) which has the minimal ID number. This is achieved by the post-processing instruction, @post, which takes the minimum value for the second term (position) of the relation cluster, grouping by the first term (position).

@output("cluster"). @post("cluster", "min(2)").

Output Example: cluster(1,1). cluster(2,1). cluster(3,3).

6 Probabilistic Reasoning

In the design of winning data science solutions, it is more and more clear that completely neglecting domain knowledge and blindly relying only on inductive models (i.e., with parameters learnt from data) easily leads to sub-optimal results, subject to overfitting when not to wrong conclusions. Thus, data scientists tend to integrate inductive reasoning with deductive approaches, complementing and when it is the case overruling machine learning models with domain knowledge.

In the Vadalog system, we introduce probabilistic knowledge graphs, a valuable tool to craft a new kind of data science solutions where statistical models incorporate and are driven by the description of the domain knowledge.

Combining uncertainty and logic to describe rich uncertain relational structures is not new and has been the primary focus of Statistical Relational Learning (SRL) [27, 40]. One prominent representative of this area is Markov Logic Networks (MLN) [41], which allow to describe relational structures in terms of first-order logic. A number of algorithms for exact and approximate reasoning in MLNs and other SRL models [6, 23] have been proposed, and systems built such as Alchemy [41], Tuffy [37] and SlimShot [30]. MLNs have been successfully applied in natural language processing [39], ontology matching [2], record linkage [45], and so on. Yet, one common limitation of SRL models is their logical reasoning side: logic in SRL is not utilised for deducing new knowledge, but rather serves the role of a constraint language. Systems that can be built on top of these models are hence of very limited applicability in data science tasks.

Consider the following example.

Example 1

Let G be a knowledge graph, which contains the following facts about the ownership and link relationships between companies, augmented with a Vadalog program composed of rules (1) and (2):

$$\begin{aligned}&\mathrm{Own}(a,b,0.4), \mathrm{Own}(b,c,0.5), \mathrm{Own}(a,d,0.6), \mathrm{Own}(d,c,0.5),\\&\mathrm{Linked}(a,b),\mathrm{Linked}(a,d),\mathrm{Linked}(b,c),\mathrm{Linked}(d,c).\\ (1)&\mathrm{Own}(x,y,s), s>0.2 \rightarrow \mathrm{Linked}(x,y). \\ (2)&0.8\,\, {:}{:}\,\, \mathrm{Own}(x,y,s), \mathrm{Own}(y,z,t), w=\mathrm{sum}(s \cdot t) \rightarrow \mathrm{Own}(x,z,w) . \end{aligned}$$

Rule (1) expresses that company $x$ is linked to $y$ if $x$ owns directly or indirectly more than $20\%$ of $y$’s shares. Rule (2) is a recursive rule with an aggregate operator and expresses indirect shareholding: when $x$ owns a number of companies $y$, each holding a different share $t_y$ of $z$, then $x$ owns $\sum _y(s \cdot t_y)$ of $z$. An example of a “traditional” logical reasoning task is answering the following question over G: “which companies are linked to a?”. The result of the reasoning task is the companies b and d, as directly specified by G, and, additionally, c, which is implied by the program. Indeed, by Rule (2) we first derive the fact $\mathrm{Own}(a,c,0.5)$, as $0.4 \times 0.5 + 0.6 \times 0.5 = 0.5$, and thus, by Rule (1), we deduce $\mathrm{Linked}(a,c)$.

However, here we are in an uncertain setting: Rule (2) is not definitive but holds with a certain probability. We say that G is a probabilistic knowledge graph. Probabilistic reasoning on G would then consist in answering queries over such uncertain logic programs, i.e., when we can only access a distribution of the entailed facts. The answer to the question —which companies are linked to a— would contain companies b and d with probability one and c with some probability p depending on the “ownership distance” between $a$ and $c$.

In spite of its high relevance, surprisingly, none of the exiting KGMSs allow for uncertain reasoning, crucial in many contexts. The Vadalog system aims at filling this gap.

The Vadalog system provides a form of hybrid logic-probabilistic reasoning, where logical inference is driven and aided by statistical inference. We adopt the novel notion of probabilistic knowledge graph, and propose Soft Vadalog, an extension to Vadalog with soft, weighted rules (such as the ones used in Example 1) for representing and supporting uncertain reasoning in the Vadalog system. A Soft Vadalog program is a template for a reason-tailored statistical model, namely the chase tree, the semantics of which is based on a probabilistic version of the chase procedure, a family of algorithms used in databases to enforce logic rules by generating the entailed facts.

In particular, the system adopts the MCMC-chase algorithm: a combination of a Markov chain Monte Carlo method with the chase. The application of the chase is guided by the MCMC, so that logical and statistical inference are performed in the same process. We will report about these achievements soon.

Notes

References

Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley, Boston (1995)
MATH Google Scholar
Albagli, S., Ben-Eliyahu-Zohary, R., Shimony, S.E.: Markov network based ontology matching. J. Comput. Syst. Sci. 78(1), 105–118 (2012)
Article MathSciNet Google Scholar
Arenas, M., Bertossi, L.E., Chomicki, J.: Consistent query answers in inconsistent databases. In: PODS, pp. 68–79. ACM Press (1999)
Google Scholar
Arenas, M., Gottlob, G., Pieris, A.: Expressive languages for querying the semantic web. In: PODS, pp. 14–26 (2014)
Google Scholar
Arming, S., Pichler, R., Sallinger, E.: Complexity of repair checking and consistent query answering. In: ICDT, LIPIcs, SD-LZI, vol. 48 (2016)
Google Scholar
Bach, S.H., Broecheler, M., Huang, B., Getoor, L.: Hinge-loss Markov random fields and probabilistic soft logic. J. Mach. Learn. Res. (JMLR) 18(109), 1–67 (2017)
MathSciNet MATH Google Scholar
Bellomarini, L., Gottlob, G., Pieris, A., Sallinger, E.: Swift logic for big data and knowledge graphs. In: IJCAI, pp. 2–10 (2017)
Google Scholar
Bellomarini, L., Gottlob, G., Pieris, A., Sallinger, E.: Swift logic for big data and knowledge graphs. In: Tjoa, A.M., Bellatreche, L., Biffl, S., van Leeuwen, J., Wiedermann, J. (eds.) SOFSEM 2018. LNCS, vol. 10706, pp. 3–16. Springer, Cham (2018)
Chapter Google Scholar
Bellomarini, L., Gottlob, G., Pieris, A., Sallinger, E.: Swift logic for big data and enterprise knowledge graphs. In: AMW, The Vadalog System (2018)
Google Scholar
Bellomarini, L., Sallinger, E., Gottlob, G.: The Vadalog system: datalog-based reasoning for knowledge graphs. PVLDB 11(9), 975–987 (2018)
Google Scholar
Bizer, C., et al.: Dbpedia - a crystallization point for the web of data. J. Web Sem. 7(3), 154–165 (2009)
Article Google Scholar
Box, G.E.P., Hunter, J.S., Hunter, W.G.: Statistics for Experimenters: Design, Innovation, and Discovery, 2nd edn. Wiley, Hoboken (2005)
MATH Google Scholar
Buneman, P., Khanna, S., Tan, W.C.: On propagation of deletions and annotations through views. In: PODS, pp. 150–158. ACM (2002)
Google Scholar
Calì, A., Gottlob, G., Kifer, M.: Taming the infinite chase: query answering under expressive relational constraints. J. Artif. Intell. Res. 48, 115–174 (2013)
Article MathSciNet Google Scholar
Calì, A., Gottlob, G., Lukasiewicz, T.: A general datalog-based framework for tractable query answering over ontologies. J. Web Sem. 14, 57–83 (2012)
Article Google Scholar
Calì, A., Gottlob, G., Lukasiewicz, T., Marnette, B., Pieris, A.: Datalog+/-: a family of logical knowledge representation and query languages for new applications. In: LICS, pp. 228–242 (2010)
Google Scholar
Calì, A., Gottlob, G., Pieris, A.: Towards more expressive ontology languages: the query answering problem. Artif. Intell. 193, 87–128 (2012)
Article MathSciNet Google Scholar
Christen, P.: Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Data-Centric Systems and Applications. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31164-2
Book Google Scholar
The UniProt Consortium: UniProt: the universal protein knowledgebase. Nucleic Acids Res. 45(Database-Issue), D158–D169 (2017)
Google Scholar
Csar, T., Lackner, M., Pichler, R., Sallinger, E.: Winner determination in huge elections with MapReduce. In: AAAI, pp. 451–458. AAAI Press (2017)
Google Scholar
Dhar, V.: Data science and prediction. Commun. ACM 56(12), 64–73 (2013)
Article Google Scholar
Ester, M., Kriegel, H.-P., Sander, J., Xu, X. et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, vol. 96, pp. 226–231 (1996)
Google Scholar
Fierens, D., et al.: Inference and learning in probabilistic logic programs using weighted Boolean formulas. TPLP 15(3), 358–401 (2015)
MathSciNet MATH Google Scholar
Furche, T., Gottlob, G., Grasso, G., Schallhart, C., Sellers, A.J.: OXPath: a language for scalable data extraction, automation, and crawling on the deep web. VLDB J. 22(1), 47–72 (2013)
Article Google Scholar
Furche, T., Gottlob, G., Neumayr, B., Sallinger, E.: Towards a lingua franca for data wrangling. In: AMW, Data Wrangling for Big Data (2016)
Google Scholar
Furche, T., Grasso, G., Kravchenko, A., Schallhart, C.: Turn the page: automated traversal of paginated websites. In: ICWE, pp. 332–346 (2012)
Google Scholar
Getoor, L., Taskar, B.: Introduction to Statistical Relational Learning (Adaptive Computation and Machine Learning). The MIT Press, Cambridge (2007)
MATH Google Scholar
Glimm, B., et al.: SPARQL 1.1 entailment regimes. W3C Recommendation, 21 March 2013
Google Scholar
Gottlob, G., Pieris, A.: Beyond SPARQL under OWL 2 QL entailment regime: rules to the rescue. In: IJCAI, pp. 2999–3007 (2015)
Google Scholar
Gribkoff, E., Suciu, D.: Slimshot: in-database probabilistic inference for knowledge bases. PVLDB 9(7), 552–563 (2016)
Google Scholar
Guagliardo, P., Pichler, R., Sallinger, E.: Enhancing the updatability of projective views. In: AMW, CEUR Workshop Proceedings, vol. 1087. CEUR-WS.org (2013)
Google Scholar
Kolaitis, P.G., Pichler, R., Sallinger, E., Savenkov, V.: Nested dependencies: structure and reasoning. In: PODS, pp. 176–187. ACM (2014)
Google Scholar
Kolaitis, P.G., Pichler, R., Sallinger, E., Savenkov, V.: Limits of schema mappings. Theory Comput. Syst. 62(4), 899–940 (2018)
Article MathSciNet Google Scholar
Konstantinou, N., et al.: The VADA architecture for cost-effective data wrangling. In: SIGMOD. ACM (2017)
Google Scholar
Kravchenko, A., Fayzrakhmanov, R.R., Sallinger, E.: Web page representations and data extraction with BERyL. In: Proceedings of MATWEP 2018, p. 8 (2018, in Press)
Google Scholar
Michels, C., Fayzrakhmanov, R.R., Ley, M., Sallinger, E., Schenkel, R.: Oxpath-based data acquisition for dblp. In: JCDL, pp. 319–320. IEEE CS (2017)
Google Scholar
Niu, F., Ré, C., Doan, A.H., Shavlik, J.W.: Tuffy: scaling up statistical inference in markov logic networks using an RDBMS. PVLDB 4(6), 373–384 (2011)
Google Scholar
Pichler, R., Sallinger, E., Savenkov, V.: Relaxed notions of schema mapping equivalence revisited. Theory Comput. Syst. 52(3), 483–541 (2013)
Article MathSciNet Google Scholar
Poon, H., Domingos, P.M.: Unsupervised ontology induction from text. In: ACL, pp. 296–305 (2010)
Google Scholar
De Raedt, L.: Logical and Relational Learning: From ILP to MRDM (Cognitive Technologies). Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-68856-3
Book Google Scholar
Richardson, M., Domingos, P.M.: Markov logic networks. Mach. Learn. 62(1–2), 107–136 (2006)
Article Google Scholar
Sallinger, E.: Reasoning about schema mappings. In: Dagstuhl Follow-Ups, Data Exchange, Information, and Streams, vol. 5, pp. 97–127. SD-LZI (2013)
Google Scholar
Sarawagi, S.: Information extraction. Found. Trends Databases 1(3), 261–377 (2008)
Article Google Scholar
Shkapsky, A., Yang, M., Zaniolo, C.: Optimizing recursive queries with monotonic aggregates in deals. In: ICDE, pp. 867–878 (2015)
Google Scholar
Singla, P., Domingos, P.M.: Entity resolution with Markov logic. In: ICDM, pp. 572–582 (2006)
Google Scholar
Vrandecic, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Commun. ACM 57(10), 78–85 (2014)
Article Google Scholar

Download references

Acknowledgements

This work is supported by the EPSRC programme grant EP/M025268/1. The Vadalog system is IP of the University of Oxford.

Author information

Authors and Affiliations

University of Oxford, Oxford, UK
Luigi Bellomarini, Ruslan R. Fayzrakhmanov, Georg Gottlob, Andrey Kravchenko, Yavor Nenov, Stéphane Reissfelder, Emanuel Sallinger, Evgeny Sherkhonov & Lianlong Wu
Banca d’Italia, Rome, Italy
Luigi Bellomarini & Eleonora Laurenza
TU Wien, Vienna, Austria
Georg Gottlob

Authors

Luigi Bellomarini
View author publications
You can also search for this author in PubMed Google Scholar
Ruslan R. Fayzrakhmanov
View author publications
You can also search for this author in PubMed Google Scholar
Georg Gottlob
View author publications
You can also search for this author in PubMed Google Scholar
Andrey Kravchenko
View author publications
You can also search for this author in PubMed Google Scholar
Eleonora Laurenza
View author publications
You can also search for this author in PubMed Google Scholar
Yavor Nenov
View author publications
You can also search for this author in PubMed Google Scholar
Stéphane Reissfelder
View author publications
You can also search for this author in PubMed Google Scholar
Emanuel Sallinger
View author publications
You can also search for this author in PubMed Google Scholar
Evgeny Sherkhonov
View author publications
You can also search for this author in PubMed Google Scholar
Lianlong Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Georg Gottlob .

Editor information

Editors and Affiliations

Cadi Ayyad University, Marrakesh, Morocco
El Hassan Abdelwahed
LIAS/ISAE-ENSMA, Futuroscope Chasseneuil Cedex, France
Ladjel Bellatreche
University of Bologna, Cesena, Italy
Mattéo Golfarelli
University of Lorraine, Vandœuvre-lès-Nancy, France
Dominique Méry
University of Houston, Houston, TX, USA
Carlos Ordonez

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bellomarini, L. et al. (2018). Data Science with Vadalog: Bridging Machine Learning and Reasoning. In: Abdelwahed, E., Bellatreche, L., Golfarelli, M., Méry, D., Ordonez, C. (eds) Model and Data Engineering. MEDI 2018. Lecture Notes in Computer Science(), vol 11163. Springer, Cham. https://doi.org/10.1007/978-3-030-00856-7_1

Download citation

DOI: https://doi.org/10.1007/978-3-030-00856-7_1
Published: 13 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00855-0
Online ISBN: 978-3-030-00856-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics