1 Introduction

The participatory paradigm underlying the Data Web has led to more than 150 billion facts on more than 3 billion things being published on the Web in more than 10,000 RDF knowledge graphs.Footnote 1 For example, DBpedia [2], YAGO [20] and WikiData [13] contain information about millions of entities and comprise billions of facts about these entities. These facts are used in the backend of a growing number of applications including in-flight applications [13], community-support systems [1] and even personal assistants such as Apple’s Siri [13]. Ensuring the veracity of the facts contained in knowledge graphs is hence of critical importance for an increasing number of end users and applications. Manual solutions to the computation of the veracity of facts are clearly an impractical feat due to the volume and the velocity of the data of the Data Web.Footnote 2 Consequently, automated solutions to this computation, dubbed fact validation [11, 17] (also called fact checking in some of the literature, e.g., [8]) have been devised over the last years.

The goal of fact validation can be summarized as follows: Given a fact, compute the likelihood that the given fact is true. Two main families of approaches have been devised to address this problem (see Sect. 2 for more details). The first family of approaches encompasses solutions which verbalize the input fact and use textual evidence (e.g., large corpora such as the Web or Web crawls) to find statements which support or refute the input fact [8, 22, 24]. We focus on the second family of approaches. These approaches use a knowledge graph \(\mathcal {G}\) as background knowledge and use the facts contained therein to evaluate the likelihood that the given fact is true [5, 9, 18]. These approaches use sets of facts as evidence to compute the likelihood of a given fact. For example, when using DBpedia version 2016-10 as background knowledge, they might use facts such as (Barack_Obama, birthPlace, Hawaii) and (Hawaii, country, United_States_of_America) to conclude that (Barack_Obama, nationality, United_States_of_America) holds—a fact which is not to be found in the background knowledge base.

Our work is based on the following observation: While most approaches which use a knowledge graph as background knowledge have been deployed on RDF knowledge graphs, none has made use of the semantics of the accompanying schema in RDFS to the full. In particular, none of the state-of-the-art approaches makes use of the combination of domain, range and subsumption hierarchy expressed in the schema of most RDF datasets in RDFS. However, the RDFS schema contains crucial information (e.g., type information) necessary to detect facts which can be used to validate or invalidate other facts.

In this paper, we address this research gap by presenting an unsupervised fact validation approach for RDF knowledge graphs which identifies paths that support a given fact (spo). This approach is based on the insight that the predicate p (e.g., nationality) carries mutual information with a set of other paths (e.g., paths pertaining to birthPlace and country) in the background knowledge graph \(\mathcal {G}\). Hence, the presence of certain sets of paths in \(\mathcal {G}\) that begin in s and end in o can be regarded as evidence which corroborates the veracity of (spo). Our approach is the first to take the domain and range information of p, the type of s and o as well as the subsumption relations between types in the RDFS schema of \(\mathcal {G}\) into consideration while identifying these paths. Our results show conclusively that using this information leads to significantly higher AUC-ROC results on 17 benchmark datasets.

Our approach has several advantages over the state of the art: (i) It uses data which can be directly queried via SPARQL from \(\mathcal {G}\), i.e., there is no need to alter the representation mechanism of \(\mathcal {G}\) or to use an internal representation of \(\mathcal {G}\) in our implementation. Moreover, our approach can exploit the large body of work on scaling up triple stores to competitive runtimes. (ii) The proposed co-occurrence measure for the similarity calculation between predicates and paths is not bound to path lengths and can hence be exploited to detect paths of any finite length. (iii) Our approach is completely unsupervised and neither training nor labeled data is required.

The rest of the paper is organized as follows: Sect. 2 present details pertaining to related fact validation approaches. In Sect. 3, we present a brief overview of the formal notation used in this paper. We also introduce the formal specification we use throughout this work. Section 4 details the formal model underlying our approach. In particular, it gives a formal specification of corroborative paths and how they can be used to measure the likelihood of a fact being true. Section 5 provides the details of our implementation. We present our experimental setup in Sect. 6 and discuss our results in Sect. 7. Finally, we conclude in Sect. 8.

2 Related Work

Approaches to fact validation can be broadly classified into two categories: (i) approaches that use unstructured textual sources [8, 22, 24] and (ii) approaches that use structured information sources [3, 17,18,19]. The latter—in particular approaches that use a given knowledge graph for fact validation—are more relevant to the work presented herein. Several approaches view a given knowledge graph as labeled graph connecting nodes (entities) and edges (relations). Given an input triple (spo), the goal is then to search for paths of length up to a given threshold k and use them to validate/invalidate the given input triple. For instance, in [5, 18] a knowledge graph is viewed as undirected network of paths. The task is then to find shortest paths that connect s and o and are semantically related to p. These approaches are unsupervised and do not require prior training data. However, these approaches do not take into consideration the terminological information (in particular the semantics of RDFS) of the input knowledge graph while defining semantic proximity metrics. Other approaches view KBs as graphs and search for metapaths to extract features [9, 21, 25]. These features are then used to train a classification model to label unseen facts as true or false. However, these approaches require training data in the form of labeled metapaths and hence required significantly more human effort that the approach presented herein. In PredPath [17], the authors propose a novel method to automatically extract metapaths—called anchored predicate paths—given a set of labeled examples. To achieves this goal, PredPath uses the rdf:type information contained in the input knowledge graph. However, the anchored predicate paths used for learning features are selected based on the type information of subject and object irrespective of the predicate connecting them. This means that they do not consider the domain, range and class subsumption provided by the RDFS schema of the given knowledge graph. Consequently, their ability to generalize over paths is limited as shown in Sect. 7 of this paper. Additionally, PredPath requires labeled training data. Hence, porting it to previously unseen predicates is significantly more demanding that porting our approach, which is fully unsupervised.

Table 1. List of symbols

Alternative to graph models, several approaches encode the entities and relations in a KB using vector embeddings [3, 12, 19, 23]. The fact validation problem is then formulated as calculating the similarity between the entities and predicate of a given input triple. Embedding-based methods for link prediction address a related but different problem. Given a KG \(\mathcal {G}\), they compute a score function, which expresses how likely it is that any triple whose subject, predicate and object belong to the input graph \(\mathcal {G}\) should belong to \(\mathcal {G}\) [14]. Fact validation approaches addresses a different but related goal: Given a graph \(\mathcal {G}\) and a triple t, they aim to compute the likelihood that t is true [8, 18, 22]. A core repercussion of these two different problem formulations are the runtimes and the applications of link prediction and fact checking. While fact validation algorithms are used in online scenarios embedding-based algorithms are often used offline. Approaches such as [6, 7] mine Horn rules that can be used for knowledge base completion tasks. However, they often fail to scale to large knowledge graphs.

Our approach, is inspired by approaches that discover metapaths. We propose a novel approach for finding paths which corroborate a given triple (spo). In addition, we present a novel measure to calculate association strength these paths and the input triple. In contrast to approaches based on metapaths, our approach does not need training examples and does not require any supplementary effort to deployed to previously unseen relations.

3 Preliminaries

Throughout this paper, we consider RDF knowledge graphs with RDFS semantics. We use the notation presented in Table 1.

3.1 Knowledge Graph

Definition 1

An RDF knowledge graph \(\mathcal {G}\) is a set of RDF triples, i.e.,

$$\begin{aligned} \mathcal {G} = \{ (s, p, o) | s \in \mathbb {E} \cup \mathbb {B}, p \in \mathbb {P}, o \in \mathbb {E} \cup \mathbb {B} \cup \mathbb {L} \}, \end{aligned}$$
(1)

where \(\mathbb {E}\) is the set of all RDF resources, \(\mathbb {B}\) is the set of all blank nodes, \(\mathbb {P} \subseteq \mathbb {E}\) is the set of all RDF predicates and \(\mathbb {L}\) represents the set of all literals.

Intuitively, an RDF knowledge graph can be understood as an edge-labeled directed graph in which the node s is connected to the node o via an edge with the label p iff the triple \((s, p, o) \in \mathcal {G}\). This is the approach we use to display knowledge graphs graphically (see, e.g., Fig. 1). We use the notation \(s \xrightarrow {p}o\) to denote that \((s, p, o) \in \mathcal {G}\). We denote the set of all RDFS classes as \(\mathbb {C}\) (with \(\mathbb {C} \subseteq \mathbb {E}\)). For \(A \in \mathbb {C}\) and \(B \in \mathbb {C}\), we write \(A \sqsubseteq B\) to signify that \(A^\mathcal {I} \subseteq B^\mathcal {I}\) for any interpretation \(\cdot ^\mathcal {I}\).

Example 1

An excerpt of an example RDF knowledge graph—which we will use as a running example—is displayed in Fig. 1. The example shows a subgraph extracted from DBpediaFootnote 3 consisting of nodes (resources) (e.g., Barack Obama and United States) and edges (relations) connecting these entities either directly or via intermediate nodes (e.g., birthplace).

Fig. 1.
figure 1

A subgraph of DBpedia version 10-2016.

Definition 2

Path: A path of length k in a knowledge graph \(\mathcal {G}\) is a cycle-free sequence of triples from \(\mathcal {G}\) of the form \((v_{0}, p_{1}, v_{1}), (v_{1}, p_{2}, v_{2}), ..., (v_{k-1}, p_{k}, v_{k})\).

This means in particular that \(\forall i,j \in [0, k], i \ne j \rightarrow v_i \ne v_j\). We use \(\pi ^k(v_0, v_k)\) to denote paths between \(v_0\) and \(v_k\). For the sake of legibility, we use the notation \(v_0 \xrightarrow {p_1} \ldots \xrightarrow {p_{k-1}} v_k\) to denote paths. Note that several paths can exist between \(v_0\) and \(v_k\). For example, BarackObama \(\xrightarrow {\texttt {birthPlace}}\) Hawaii \(\xrightarrow {\texttt {country}}\) USA and BarackObama \(\xrightarrow {\texttt {party}}\) DemocraticParty \(\xrightarrow {\texttt {country}}\) USA are both paths of length 2 between the resources BarackObama and USA in our running example.

Definition 3

Undirected path: An undirected path of length k in a graph \(\mathcal {G}\) is a cycle-free sequence of triples of the form \((v_{0}, p_{1}, v{1}), (v_{1}, p_{2}, v_{2}), ..., (v_{k-1}, p_{k}, v_{k})\) where \(\forall i \in [0, k-1]\ (v_i, p_{i+1}, v_{i+1}) \in \mathcal {G} \vee (v_{i+1}, p_{i+1}, v_{i}) \in \mathcal {G}\).

Again, this means that \(\forall i,j \in [0, k], i \ne j \rightarrow v_i \ne v_j\). We denote undirected paths with \(\mu ^k(v_0, v_k)\). For example, BarackObama \(\xleftarrow {\texttt {alumni}}\) GreenwichHighSchool \(\xrightarrow {\texttt {country}}\) USA is an undirected path of length 2 between BarackObama and USA in our example.

4 Corroborative Paths

4.1 Intuition

In this paper, we address the following problem: Given an RDF knowledge graph \(\mathcal {G}\) and a triple (spo), compute the likelihood that (spo) is true. For example, we would have good reasons to believe that BarackObama is a citizen of the USA given that BarackObama was born in Hawaii and Hawaii is located in the USA. Clearly, we cannot formally infer that x is a national of z by virtue of the existence of \(x \xrightarrow {\texttt {birthplace}} y \xrightarrow {\texttt {country}} z\). Still, this path is a strong indicator (i.e., strongly corroborates) triples of the form \(x \xrightarrow {\texttt {nationality}} z\). The basic intuition behind our work is correspondingly that the existence of certain paths \(\pi ^k (s, o)\) between s and o is a strong indicator for the correctness (i.e., corroborate the existence) of (spo) and can hence be used to compute its likelihood.

4.2 Formal Model

Let \(\gamma \) be a function which maps each element of \(\mathbb {E} \cup \mathbb {P} \cup \mathbb {B} \cup \mathbb {L}\) to its type. For example, \(\gamma \)(BarackObama) = Person \(\sqcap \) Agent \(\sqcap \) Politician \(\sqcap \) President and \(\gamma \)(UnitedStates) = Place \(\sqcap \) Location \(\sqcap \) Country \(\sqcap \) PopulatedPlace in our running example.Footnote 4 Further, let \(\lambda \) be a function which maps a given type \(t_x\) to a set of resources that are instances of \(t_x\) by virtue of RDFS semantics. Extending the formal model in [17], we now define the set \(\varPi ^{k}_{(t_{x}, t_{y})}\) of typed paths of length k between pairs of resources of type \(t_x\) and \(t_y\) in a knowledge graph \(\mathcal {G}\) as follows:

$$\begin{aligned} \varPi ^{k}_{(t_{x}, t_{y})} = \{\pi ^{k}(v_0, v_k)\ |\ \gamma (v_{0}) \sqsubseteq t_{x} \wedge \gamma (v_{k}) \sqsubseteq t_{y} \}. \end{aligned}$$
(2)

For \(t_x=\{\texttt {Person}\}\) and \(t_y=\{\texttt {Place}\}\), the path BarackObama \(\xrightarrow {\texttt {birthPlace}}\) Hawaii \(\xrightarrow {\texttt {country}}\) USA is an element of the set \(\varPi ^{2}_{(t_{x}, t_{y})}\) in our running example. We define the set \(M^{k}_{(t_{x}, t_{y})}\) of typed undirected paths analogously.

Let \(\vec {q} = {q_1,\ldots ,q_k}\) be a vector of properties of length k. We define the set of \(\vec {q}\)-restricted typed paths \(\varPi ^{k}_{(t_{x}, t_{y}),\vec q} \subseteq \varPi ^{k}_{(t_{x}, t_{y})}\) as follows:

$$\begin{aligned} \begin{aligned} \varPi ^{k}_{(t_{x}, t_{y}),\vec {q}} =&\bigl \{\pi ^{k}(v_0, v_k) \;\big |\; \pi ^{k}(v_0, v_k) \in \varPi ^{k}_{(t_{x}, t_{y})},\\&\,\,\,\forall i \in [0, k-1] : (v_{i},p_{i+1},v_{i+1}) \in \pi ^{k}(v_0, v_k) \rightarrow p_{i+1} = q_{i+1} \bigr \} . \end{aligned} \end{aligned}$$
(3)

Put simply, this is the set of typed paths such that the sequence of properties in each path is exactly \(\vec {q}\). For example, let \(t_x\)=\(\{\texttt {Person}\}\), \(t_y=\{\texttt {Place}\}\) and \(\vec {q}\) = (birthPlace, country). Then the path BarackObama \(\xrightarrow {\texttt {birthPlace}}\) Hawaii \(\xrightarrow {\texttt {country}}\) USA is the only element of \(\varPi ^{2}_{(t_{x}, t_{y}),\vec {q}}\) in our running example. We call the elements of \(\varPi ^{k}_{(t_{x}, t_{y}),\vec {q}}\) similar as they share a sequence of predicates (i.e., \(\vec {q}\)). We define sets of \(\vec {q}\)-restricted undirected typed paths \(M^{k}_{(t_{x}, t_{y}),\vec q}\) analogously to \(\varPi ^{k}_{(t_{x}, t_{y}),\vec {q}}\).

We can now use restricted typed paths to compute how well a predicate is corroborated in a knowledge graphs as follows: Let D(p) be the domain of p and R(p) be its range. Given that we assume RDF knowledge graphs, we can safely assume the existence of an RDFS class hierarchy for the said graph (defined via the \(\texttt {rdf:type}\) predicate). Consequently, we can derive the following important condition on paths \(\pi ^k (s, o)\) which are to corroborate the correctness of (spo): Only typed paths in \(\varPi ^k_{(D(p), R(p))}\) can corroborate facts with the predicate p. This particular insight is one of the major differences between this and previous works (see Sect. 2), in which the consequences of RDFS semantics were not taken into consideration. In particular, while previous approaches [17] used at most \(\gamma (s)\) and \(\gamma (o)\) to measure the strength of the association between paths and predicates, we use D(p) and R(p) as well as the RDFS class hierarchy in the input knowledge graph \(\mathcal {G}\) to determine the degree to which a path \(\pi ^k(s,o)\) corroborates a predicate p.

Given an RDF knowledge graph \(\mathcal {G}\), we hence define the corroborative paths for a predicate p formally as follows:

$$\begin{aligned} \varPi ^{k}(p) = \bigcup \limits _{j=1}^k \varPi ^{j}_{(D(p), R(p))}. \end{aligned}$$
(4)

Simply put, corroborative paths in \(\varPi ^{k}(p)\) are paths of length at most k that carry similar information to p.

4.3 Association Strength

We base our computation of the strength of the association between \(\varPi ^{j}_{(t_{x}, t_{y}),\vec q}\) and p on their normalized pointwise mutual information [4]. To this end, we define probability \(\mathcal {P}(\varPi ^{j}_{(t_{x},t_{y}),\vec q})\) of pairs of instances of \(t_x\) resp. \(t_y\) being connected via a \(\vec {q}\)-restricted path of length j is as follows:

$$\begin{aligned} \frac{\left| \{(a, b): \gamma (a) \sqsubseteq t_x \wedge \gamma (b) \sqsubseteq t_y \wedge (\exists \pi ^j(a, b) \in \varPi ^{j}_{(t_{x},t_{y}),\vec q})\} \right| }{|\lambda (t_x)|\cdot |\lambda (t_y)|}. \end{aligned}$$
(5)

The probability \(\mathcal {P}(p)\) of the predicate p linking resources of type \(t_x\) and \(t_y\) is

$$\begin{aligned} \frac{\left| \left\{ (a, p, b): \gamma (x) \sqsubseteq t_x \wedge \gamma (y) \sqsubseteq t_y \wedge (a, p, b) \in \mathcal {G} \right\} \right| }{|\lambda (t_x)|\cdot |\lambda (t_y)|} \end{aligned}$$
(6)

Finally, the joint probability \(\mathcal {P}(\varPi ^{j}_{(t_{x},t_{y}),\vec q}, p)\) is defined as

$$\begin{aligned} \frac{\left| \{(a, b): \gamma (a) \sqsubseteq t_x \wedge \gamma (b) \sqsubseteq t_y \wedge (\exists \pi ^j(a, b) \in \varPi ^{j}_{(t_{x},t_{y}),\vec q}) \wedge (a, p, b) \in \mathcal {G}\} \right| }{|\lambda (t_x)|\cdot |\lambda (t_y)|}. \end{aligned}$$
(7)

We could now compute the NPMI of \(\varPi ^{j}_{(t_{x}, t_{y}),\vec q}\) and p as defined in [4]. However, a direct implementation of the original definition of the NPMI would be expensive as it would require deduplicating the sets of pairs (ab) connected by the paths in \(\varPi ^j_{(t_x, t_y)}\).Footnote 5 Hence, our approach implements an approximation of the NPMI based on counting the number of paths which connect pairs (ab) instead of the pairs themselves. We hence end up with the following approximations (note that these values are not probabilities):

$$\begin{aligned} \widehat{\mathcal {P}}(\varPi ^{j}_{(t_{x},t_{y}),\vec q}) = \frac{| \varPi ^{j}_{(t_{x},t_{y}), \vec q}|}{|\lambda (t_x)|\cdot |\lambda (t_y)|} \end{aligned}$$
(8)
$$\begin{aligned} \widehat{\mathcal {P}}(\varPi ^{j}_{(t_{x},t_{y}),\vec q}, p) = \frac{|\{\pi ^j(a, b) \in \varPi ^{j}_{(t_{x},t_{y}),\vec q}: (a, p, b) \in \mathcal {G}\}|}{|\lambda (t_x)|\cdot |\lambda (t_y)|}. \end{aligned}$$
(9)

These approximations can be computed by using SPARQL queries without DISTINCT clause, which makes the computation an order of magnitude faster (see Table 7 for some of the scores returned by this function). Note that \(\mathcal {P}(p)\) remains unchanged and the number of paths \(a \xrightarrow {p} b\) is exactly equal to the number of pairs (ab) connected by p. Based on these approximations we can now approximate the NPMI of \(\varPi ^{j}_{(t_{x}, t_{y}),\vec q}\) and p as follows:

$$\begin{aligned} \begin{aligned} \widehat{\text {NPMI}}(\varPi ^{j}_{(t_{x}, t_{y}),\vec q}, p) = \frac{\log {\Bigg (\frac{ \widehat{\mathcal {P}}\left( \varPi ^{j}_{(t_{x}, t_{y}),\vec q}, p\right) }{\widehat{\mathcal {P}}\left( \varPi ^{j}_{(t_{x},t_{y})}\right) \cdot {\mathcal {P}}(p)}\Bigg )}}{-\log \bigg (\widehat{\mathcal {P}}\left( \varPi ^{j}_{(t_{x},t_{y}),\vec q}, p\right) \bigg )} \end{aligned} \end{aligned}$$
(10)

5 Method and Implementation

This section presents our implementation of the formal model presented above in detail. In particular, we show how some of the core computations of our model can be implemented using SPARQL queries, ensuring practicable runtimes for our approach. As above, we explain the approach using directed paths for the sake of legibility. The approach was also implemented using undirected paths. An evaluation of the performance of the approach with directed and undirected paths is presented in Sect. 7.

figure a

5.1 Algorithm

Given an input triple \(t=(s, p, o)\), a knowledge graph \(\mathcal {G}\) and a maximum path length k, our implementation begins by identifying a set of paths of varying lengths connecting s and o, respectively. For each path, it calculates a score, which explicates the degree to which the path corroborate t. Finally, the scores are amalgamated to a single score \(\tau \) which expresses the veracity of t. The complete algorithm is shown in Algorithm 1 and can be separated into the 4 steps (i) Initialization, (ii) Path discovery, (iii) Path scoring and (iv) Veracity calculation.

Initialization. Firstly, we prune \(\mathcal {G}\) by removing all nodes from domain outside the union of (i) base namespace(s) of \(\mathcal {G}\), (ii) the namespace for RDF, RDFS and OWL. We carry out this preprocessing because we are interested in relations (edges) that are defined by the ontology of the given \(\mathcal {G}\) (line 1). Thereafter, the domain D(p) and range R(p) of the given triple’s predicate p are determined.Footnote 6 The number of instances of these two types as well as the number of triples containing p as predicate are retrieved via SPARQL count queries (lines 2–4).

Path Discovery. In the second step, the properties of all paths \(\pi ^j(s, o)\) (i.e., their \(\vec q\) restrictions) of length \(j \in [1, k]\) between s and o are retrieved. To this end, we generate SPARQLFootnote 7 query templates (line 7). The query template which retrieves directed paths of length \(j=2\) between ?v0 and ?v2 from an RDF graph is:

figure b

Note that the query has to be modified with UNION to cover undirected paths. Still, a single query can be used to detect paths of any length. Hence, our approach generates k queries in this step.

After generating all necessary query templates up to the given length k, we replace the variables ?v0 and ?vj with s and o respectively in the query (line 9). We prune the results of the query (line 11) by removing results containing predicates which define the terminology (e.g., class membership through rdf:type, class hierarchy through rdfs:subClassOf).Footnote 8 The remaining \(\vec q\)-restrictions are stored as pairs together with the template which was used to retrieve them in the list \(\mathcal {Q}\) (line 12–14). We store the template to ensure that we can reconstruct the direction of the predicates in case undirected paths are used.

Path Scoring. Pairs in \(\mathcal {Q}\) are used to define the \(\vec q\)-restricted typed path sets \(\varPi ^{j}_{(D(p),R(p)), \vec q}\). For each of these pairs, a score \(\zeta \) is calculated based on the NPMI approximation in Eq. 10 (lines 16–23). For the sake of efficiency, we use SPARQL queries to obtain the necessary counts of typed paths which are generated based on the query template and \(\vec q\) (line 18). However, as pointed out in [7], a direct translation of the needed counts into queries leads to time-consuming computationsFootnote 9 which require optimization. Therefore, we generate the SPARQL queries needed for our counts with a recursive structure. Listing 1.1 shows a sample query used to count the number of paths \(\texttt {?vo}\xrightarrow {birthPlace}{} \texttt {?v1}\xrightarrow {country}{} \texttt {?v2}\) between entities with the types Person and Country, respectively.

figure c

Veracity Calculation. We treat the association strength of each \(\vec q\)-restricted typed path as the confidence with which the path supports the existence of the input predicate p. We hence combine the \(\zeta \) values by checking whether at least one path supports p. Let Z be the set of scores of all single paths, the veracity score \(\tau \) can be calculated with the following equation (see lines 23–28):

$$\begin{aligned} q \tau = 1 - \prod _{\zeta \in Z} \left( 1 - \zeta \right) . \end{aligned}$$
(11)

6 Experiments and Results

In this section, we provide details of the data and hardware we used in our experiments. We compare the results of our approach with those achieve by state-of-the-art approaches in the subsequent section.

6.1 Setup

Knowledge Graph. For our experiments, we chose DBpedia version 2016-10 as background knowledge. We chose this dataset because it is the reference dataset of a large number of fact validation benchmarks. We used the latest dumpsFootnote 10 of ontology, instance types, mapping-based objects and infobox properties. We filtered out triples that (i) contain literals and datatypes or (ii) link the entities in DBpedia to external sources. The final graph contains 44 million triples, which we stored using an instance of Openlink Virtuoso v7.2.5.1 hosted on VM with 16GB memory and 256GB disk space. To ensure the comparability of our results, we ran our evaluation using GERBIL [16]—a benchmarking platform that facilitates the evaluation of fact validation systems across different datasets.Footnote 11 We used the AUC-ROC as an evaluation metric and set \(k=2\) for the sake of comparability with previous works.

Competing Approaches. We compare our approach (COPAAL) to three state-of-the-art graph-based fact validation approaches: (i) Knowledge Stream (KS), (ii) its variant Relational Knowledge Linker (KL-REL) [18] and (iii) Discriminative Path Mining (PredPath) [17]. For all these approaches, we use the implementation provided by the authors [18].Footnote 12 We considered the configuration suggested in the original paper: (i) PredPath [17] uses the top-100 features while learning positive and negative facts. (ii) KS [18] and KL-REL [18] use the top-5 paths and single best path, respectively, for validating input triples.

6.2 Benchmarks

We evaluated all the approaches using two publicly available sets of benchmarks: (i) the Real-World and (ii) Synthetic datasetsFootnote 13 made available by the authors of the literature [18]. In addition, we generated a new set of benchmarks dubbed FactBench-DBpedia from the FactBenchFootnote 14 dataset. All the facts in FactBench are automatically extracted from DBpedia and Freebase for 10 different relationsFootnote 15 and stored in the form of RDF models. In FactBench, the positive facts are generated by querying DBpedia and Freebase and selecting top 150 results returned for each relation. The negative facts are generated by modifying the positive facts while still following domain and range restrictions. The positive and negative facts are collected into 6 different benchmarks dubbed Domain, Range, Domain-Range, Mix, Random, Property. FactBench-DBpedia restricts the generation process of FactBench to DBpedia by extracting all facts belonging to DBpedia and facts from Freebase whose resources can be mapped to resources in DBpedia. Table 2 shows the stats for the different datasets.

Table 2. Summary of benchmark datasets

7 Results

7.1 Comparison of Directed and Undirected Paths

We first aimed to determine the type of paths for which our approach performs best. We hence compared the AUC achieved by both variations of our approach on FactBench-DBpedia (see Table 3). The results are clear: Using undirected paths (average AUC-ROC = 0.87) always outperforms using directed paths (avg. AUC-ROC = 0.66) and are 0.21 better on average w.r.t. the AUC-ROC they achieve. We studied the results achieved using the two types of paths. It became quickly evident that using undirected paths allows to detect significantly more corroborative evidence. Therewith, undirected paths achieve a better approximation of the probability of a triple being true (see Table 7 for examples). Consequently, we only consider our approach with undirected paths in the following.

Table 3. Comparison of AUC-ROC achieved using directed and undirected paths
Table 4. AUC-ROC results of all approaches on Real-World datasets

7.2 Comparison with Other Approaches

Tables 4 and 5 show the AUC-ROC results of all the approaches on the benchmarks contained in the Real-World and Synthetic datasets, respectively. Our approach outperforms other approaches on most of these datasets. In the best case, we are roughly 4.5% (absolute value, Birth Place benchmark) better than PredPath and more than 20% (absolute value, Birth Place benchmark) better than KS on real data. A careful study of our results reveals that the anchored predicate paths used by PredPath for learning features are restricted by the types of subject and object irrespective of predicate of the input triple. Hence they someetimes fail to generalize well. On the other hand, KL-REL uses single best paths, which sometimes limits its ability to validate facts if it is not able to rank the path which conveys the most evidence for the input triple to the first position. This is made evident by the examples shown in Table 7: We computed the union of the top-3 paths identified by our approach and all other approaches on the three datasets for which the difference in AUC values were the largest. We also computed the weights assigned by each of the approaches (i.e., \(\widehat{\text {NPMI}}\) for our approach, average flow values of paths for KS and KL-REL [18] and weights learned by the classifier for PredPath [17]). While our approach finds all paths and allocated them weights, the other approach sometimes fail to detect relevant paths (marked by dashes in Table 7) and are hence not able to use them in their evidence computation. Having a large number of paths available however also means that our scores are (even if rarely) overoptimistic w.r.t. evidence for a triple, which explain the marginally lower scores we achieve on Death Place and Oscars.

Table 5. ROC-AUC results of all approaches on Synthetic datasets
Table 6. ROC-AUC results of all approaches on FactBench-DBpedia datasets

The results on FactBench-DBpedia (see Table 6) confirm the insight we gained on the previous two datasets. Our approach outperforms the state of the art and achieve a better AUC-ROC on most datasets. We ran a Wilcoxon signed ranked test (significance = 99%) on all results we collected. The results state that our approach is significantly better than the state of the art.

One could assume that our approach is slower than the state of the art due to the larger amount of evidence it collects. Hence, we measured the average throughput of all the approaches including all phases of the processing. The average throughput of our approach was 21.02 triples/min. KS, which follows an approach similar to ours, achieves an average throughput of 10.05 triples/min while its counterpart KL-REL achieves 29.78 triples/min. PredPath’s average throughput was 21.67 triples/min. Overall, our results show that our approach scales as well as the state of the art while achieving significantly better results.

Table 7. Union of the top-3 paths identified by the different approaches and their weighting. The weights allocated by each of the approaches are given in the corresponding column. A dash (-) means that the approach was not able to find the said path.

8 Conclusion and Future Work

In this paper, we present a novel unsupervised approach for the validation of facts using an RDF knowledge graph \(\mathcal {G}\) as background knowledge. Our approach uses domain, range and class subsumption information found in the schema of \(\mathcal {G}\) to outperform both supervised and unsupervised fact validation approaches. We evaluated our results on 17 datasets against three state-of-the-art approaches. Our results show that our approach outperforms the state of the art significantly (Wilcoxon signed ranked test, \(p < 0.01\)). We studied the difference between the approaches and concluded that our approach performs better because it is able to score corroborative paths more accurately as it uses more information from the schema of \(\mathcal {G}\). These results point to the importance of using the semantics of the data contained in RDF knowledge graphs when aiming to validate them. Another advantage of our approach is that it allows to verbalize the evidence found to support a given input triple.

The main limitation of our approach lies in its relying on the existence of type information. Well-defined ontologies are not always given in real world datasets and therefore our approach cannot be applied on them. Previous works have aimed at improving type information in noisy knowledge graphs [15]. We will evaluate whether combining our approach with such algorithms leads to better corroborative paths in future works. Additionally, the approaches evaluated herein are limited to evidence found in one RDF graph. In future work, we will consider performing fact validation at a larger scale. In particular, we will use the linked nature of Linked Data sets to detect paths across several knowledge graphs. We will focus on the scalability and the distributed execution of this novel solution. Moreover, we will consider relaxing the requirements to types used in the definition of \(\varPi ^k_{(t_x, t_y), \vec {q}}\) by using well-defined semantic similarities [10].