Unsupervised Discovery of Corroborative Paths for Fact Validation

Syed, Zafar Habeeb; Röder, Michael; Ngomo, Axel-Cyrille Ngonga

doi:10.1007/978-3-030-30793-6_36

Zafar Habeeb Syed¹⁷,
Michael Röder^17,18 &
Axel-Cyrille Ngonga Ngomo^17,18

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11778))

Included in the following conference series:

International Semantic Web Conference

2790 Accesses
11 Citations

Abstract

Any data publisher can make RDF knowledge graphs available for consumption on the Web. This is a direct consequence of the decentralized publishing paradigm underlying the Data Web, which has led to more than 150 billion facts on more than 3 billion things being published on the Web in more than 10,000 RDF knowledge graphs over the last decade. However, the success of this publishing paradigm also means that the validation of the facts contained in RDF knowledge graphs has become more important than ever before. Several families of fact validation algorithms have been developed over the last years to address several settings of the fact validation problems. In this paper, we consider the following fact validation setting: Given an RDF knowledge graph, compute the likelihood that a given (novel) fact is true. None of the current solutions to this problem exploits RDFS semantics—especially domain, range and class subsumption information. We address this research gap by presenting an unsupervised approach dubbed COPAAL, that extracts paths from knowledge graphs to corroborate (novel) input facts. Our approach relies on a mutual information measure that takes the RDFS semantics underlying the knowledge graph into consideration. In particular, we use the information shared by predicates and paths within the knowledge graph to compute the likelihood of a fact being corroborated by the knowledge graph. We evaluate our approach extensively using 17 publicly available datasets. Our results indicate that our approach outperforms the state of the art unsupervised approaches significantly by up to 0.15 AUC-ROC. We even outperform supervised approaches by up to 0.07 AUC-ROC. The source code of COPAAL is open-source and is available at https://github.com/dice-group/COPAAL.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Semantic and Efficient Symbolic Learning over Knowledge Graphs

Refining Transitive and Pseudo-Transitive Relations at Web Scale

Truth validation with evidence

Article 15 March 2022

1 Introduction

The participatory paradigm underlying the Data Web has led to more than 150 billion facts on more than 3 billion things being published on the Web in more than 10,000 RDF knowledge graphs.^{Footnote 1} For example, DBpedia [2], YAGO [20] and WikiData [13] contain information about millions of entities and comprise billions of facts about these entities. These facts are used in the backend of a growing number of applications including in-flight applications [13], community-support systems [1] and even personal assistants such as Apple’s Siri [13]. Ensuring the veracity of the facts contained in knowledge graphs is hence of critical importance for an increasing number of end users and applications. Manual solutions to the computation of the veracity of facts are clearly an impractical feat due to the volume and the velocity of the data of the Data Web.^{Footnote 2} Consequently, automated solutions to this computation, dubbed fact validation [11, 17] (also called fact checking in some of the literature, e.g., [8]) have been devised over the last years.

The goal of fact validation can be summarized as follows: Given a fact, compute the likelihood that the given fact is true. Two main families of approaches have been devised to address this problem (see Sect. 2 for more details). The first family of approaches encompasses solutions which verbalize the input fact and use textual evidence (e.g., large corpora such as the Web or Web crawls) to find statements which support or refute the input fact [8, 22, 24]. We focus on the second family of approaches. These approaches use a knowledge graph $\mathcal {G}$ as background knowledge and use the facts contained therein to evaluate the likelihood that the given fact is true [5, 9, 18]. These approaches use sets of facts as evidence to compute the likelihood of a given fact. For example, when using DBpedia version 2016-10 as background knowledge, they might use facts such as (Barack_Obama, birthPlace, Hawaii) and (Hawaii, country, United_States_of_America) to conclude that (Barack_Obama, nationality, United_States_of_America) holds—a fact which is not to be found in the background knowledge base.

Our work is based on the following observation: While most approaches which use a knowledge graph as background knowledge have been deployed on RDF knowledge graphs, none has made use of the semantics of the accompanying schema in RDFS to the full. In particular, none of the state-of-the-art approaches makes use of the combination of domain, range and subsumption hierarchy expressed in the schema of most RDF datasets in RDFS. However, the RDFS schema contains crucial information (e.g., type information) necessary to detect facts which can be used to validate or invalidate other facts.

In this paper, we address this research gap by presenting an unsupervised fact validation approach for RDF knowledge graphs which identifies paths that support a given fact (s, p, o). This approach is based on the insight that the predicate p (e.g., nationality) carries mutual information with a set of other paths (e.g., paths pertaining to birthPlace and country) in the background knowledge graph $\mathcal {G}$. Hence, the presence of certain sets of paths in $\mathcal {G}$ that begin in s and end in o can be regarded as evidence which corroborates the veracity of (s, p, o). Our approach is the first to take the domain and range information of p, the type of s and o as well as the subsumption relations between types in the RDFS schema of $\mathcal {G}$ into consideration while identifying these paths. Our results show conclusively that using this information leads to significantly higher AUC-ROC results on 17 benchmark datasets.

Our approach has several advantages over the state of the art: (i) It uses data which can be directly queried via SPARQL from $\mathcal {G}$, i.e., there is no need to alter the representation mechanism of $\mathcal {G}$ or to use an internal representation of $\mathcal {G}$ in our implementation. Moreover, our approach can exploit the large body of work on scaling up triple stores to competitive runtimes. (ii) The proposed co-occurrence measure for the similarity calculation between predicates and paths is not bound to path lengths and can hence be exploited to detect paths of any finite length. (iii) Our approach is completely unsupervised and neither training nor labeled data is required.

The rest of the paper is organized as follows: Sect. 2 present details pertaining to related fact validation approaches. In Sect. 3, we present a brief overview of the formal notation used in this paper. We also introduce the formal specification we use throughout this work. Section 4 details the formal model underlying our approach. In particular, it gives a formal specification of corroborative paths and how they can be used to measure the likelihood of a fact being true. Section 5 provides the details of our implementation. We present our experimental setup in Sect. 6 and discuss our results in Sect. 7. Finally, we conclude in Sect. 8.

2 Related Work

Approaches to fact validation can be broadly classified into two categories: (i) approaches that use unstructured textual sources [8, 22, 24] and (ii) approaches that use structured information sources [3, 17,18,19]. The latter—in particular approaches that use a given knowledge graph for fact validation—are more relevant to the work presented herein. Several approaches view a given knowledge graph as labeled graph connecting nodes (entities) and edges (relations). Given an input triple (s, p, o), the goal is then to search for paths of length up to a given threshold k and use them to validate/invalidate the given input triple. For instance, in [5, 18] a knowledge graph is viewed as undirected network of paths. The task is then to find shortest paths that connect s and o and are semantically related to p. These approaches are unsupervised and do not require prior training data. However, these approaches do not take into consideration the terminological information (in particular the semantics of RDFS) of the input knowledge graph while defining semantic proximity metrics. Other approaches view KBs as graphs and search for metapaths to extract features [9, 21, 25]. These features are then used to train a classification model to label unseen facts as true or false. However, these approaches require training data in the form of labeled metapaths and hence required significantly more human effort that the approach presented herein. In PredPath [17], the authors propose a novel method to automatically extract metapaths—called anchored predicate paths—given a set of labeled examples. To achieves this goal, PredPath uses the rdf:type information contained in the input knowledge graph. However, the anchored predicate paths used for learning features are selected based on the type information of subject and object irrespective of the predicate connecting them. This means that they do not consider the domain, range and class subsumption provided by the RDFS schema of the given knowledge graph. Consequently, their ability to generalize over paths is limited as shown in Sect. 7 of this paper. Additionally, PredPath requires labeled training data. Hence, porting it to previously unseen predicates is significantly more demanding that porting our approach, which is fully unsupervised.

Table 1. List of symbols

Full size table

Alternative to graph models, several approaches encode the entities and relations in a KB using vector embeddings [3, 12, 19, 23]. The fact validation problem is then formulated as calculating the similarity between the entities and predicate of a given input triple. Embedding-based methods for link prediction address a related but different problem. Given a KG $\mathcal {G}$, they compute a score function, which expresses how likely it is that any triple whose subject, predicate and object belong to the input graph $\mathcal {G}$ should belong to $\mathcal {G}$ [14]. Fact validation approaches addresses a different but related goal: Given a graph $\mathcal {G}$ and a triple t, they aim to compute the likelihood that t is true [8, 18, 22]. A core repercussion of these two different problem formulations are the runtimes and the applications of link prediction and fact checking. While fact validation algorithms are used in online scenarios embedding-based algorithms are often used offline. Approaches such as [6, 7] mine Horn rules that can be used for knowledge base completion tasks. However, they often fail to scale to large knowledge graphs.

Our approach, is inspired by approaches that discover metapaths. We propose a novel approach for finding paths which corroborate a given triple (s, p, o). In addition, we present a novel measure to calculate association strength these paths and the input triple. In contrast to approaches based on metapaths, our approach does not need training examples and does not require any supplementary effort to deployed to previously unseen relations.

3 Preliminaries

Throughout this paper, we consider RDF knowledge graphs with RDFS semantics. We use the notation presented in Table 1.

3.1 Knowledge Graph

Definition 1

An RDF knowledge graph $\mathcal {G}$ is a set of RDF triples, i.e.,

$$\begin{aligned} \mathcal {G} = \{ (s, p, o) | s \in \mathbb {E} \cup \mathbb {B}, p \in \mathbb {P}, o \in \mathbb {E} \cup \mathbb {B} \cup \mathbb {L} \}, \end{aligned}$$

(1)

where $\mathbb {E}$ is the set of all RDF resources, $\mathbb {B}$ is the set of all blank nodes, $\mathbb {P} \subseteq \mathbb {E}$ is the set of all RDF predicates and $\mathbb {L}$ represents the set of all literals.

Intuitively, an RDF knowledge graph can be understood as an edge-labeled directed graph in which the node s is connected to the node o via an edge with the label p iff the triple $(s, p, o) \in \mathcal {G}$. This is the approach we use to display knowledge graphs graphically (see, e.g., Fig. 1). We use the notation $s \xrightarrow {p}o$ to denote that $(s, p, o) \in \mathcal {G}$. We denote the set of all RDFS classes as $\mathbb {C}$ (with $\mathbb {C} \subseteq \mathbb {E}$). For $A \in \mathbb {C}$ and $B \in \mathbb {C}$, we write $A \sqsubseteq B$ to signify that $A^\mathcal {I} \subseteq B^\mathcal {I}$ for any interpretation $\cdot ^\mathcal {I}$.

Example 1

An excerpt of an example RDF knowledge graph—which we will use as a running example—is displayed in Fig. 1. The example shows a subgraph extracted from DBpedia^{Footnote 3} consisting of nodes (resources) (e.g., Barack Obama and United States) and edges (relations) connecting these entities either directly or via intermediate nodes (e.g., birthplace).

Definition 2

Path: A path of length k in a knowledge graph $\mathcal {G}$ is a cycle-free sequence of triples from $\mathcal {G}$ of the form $(v_{0}, p_{1}, v_{1}), (v_{1}, p_{2}, v_{2}), ..., (v_{k-1}, p_{k}, v_{k})$.

This means in particular that $\forall i,j \in [0, k], i \ne j \rightarrow v_i \ne v_j$. We use $\pi ^k(v_0, v_k)$ to denote paths between $v_0$ and $v_k$. For the sake of legibility, we use the notation $v_0 \xrightarrow {p_1} \ldots \xrightarrow {p_{k-1}} v_k$ to denote paths. Note that several paths can exist between $v_0$ and $v_k$. For example, BarackObama $\xrightarrow {\texttt {birthPlace}}$ Hawaii $\xrightarrow {\texttt {country}}$ USA and BarackObama $\xrightarrow {\texttt {party}}$ DemocraticParty $\xrightarrow {\texttt {country}}$ USA are both paths of length 2 between the resources BarackObama and USA in our running example.

Definition 3

Undirected path: An undirected path of length k in a graph $\mathcal {G}$ is a cycle-free sequence of triples of the form $(v_{0}, p_{1}, v{1}), (v_{1}, p_{2}, v_{2}), ..., (v_{k-1}, p_{k}, v_{k})$ where $\forall i \in [0, k-1]\ (v_i, p_{i+1}, v_{i+1}) \in \mathcal {G} \vee (v_{i+1}, p_{i+1}, v_{i}) \in \mathcal {G}$.

Again, this means that $\forall i,j \in [0, k], i \ne j \rightarrow v_i \ne v_j$. We denote undirected paths with $\mu ^k(v_0, v_k)$. For example, BarackObama $\xleftarrow {\texttt {alumni}}$ GreenwichHighSchool $\xrightarrow {\texttt {country}}$ USA is an undirected path of length 2 between BarackObama and USA in our example.

4 Corroborative Paths

4.1 Intuition

In this paper, we address the following problem: Given an RDF knowledge graph $\mathcal {G}$ and a triple (s, p, o), compute the likelihood that (s, p, o) is true. For example, we would have good reasons to believe that BarackObama is a citizen of the USA given that BarackObama was born in Hawaii and Hawaii is located in the USA. Clearly, we cannot formally infer that x is a national of z by virtue of the existence of $x \xrightarrow {\texttt {birthplace}} y \xrightarrow {\texttt {country}} z$. Still, this path is a strong indicator (i.e., strongly corroborates) triples of the form $x \xrightarrow {\texttt {nationality}} z$. The basic intuition behind our work is correspondingly that the existence of certain paths $\pi ^k (s, o)$ between s and o is a strong indicator for the correctness (i.e., corroborate the existence) of (s, p, o) and can hence be used to compute its likelihood.

4.2 Formal Model

Let $\gamma $ be a function which maps each element of $\mathbb {E} \cup \mathbb {P} \cup \mathbb {B} \cup \mathbb {L}$ to its type. For example, $\gamma $(BarackObama) = Person $\sqcap $ Agent $\sqcap $ Politician $\sqcap $ President and $\gamma $(UnitedStates) = Place $\sqcap $ Location $\sqcap $ Country $\sqcap $ PopulatedPlace in our running example.^{Footnote 4} Further, let $\lambda $ be a function which maps a given type $t_x$ to a set of resources that are instances of $t_x$ by virtue of RDFS semantics. Extending the formal model in [17], we now define the set $\varPi ^{k}_{(t_{x}, t_{y})}$ of typed paths of length k between pairs of resources of type $t_x$ and $t_y$ in a knowledge graph $\mathcal {G}$ as follows:

$$\begin{aligned} \varPi ^{k}_{(t_{x}, t_{y})} = \{\pi ^{k}(v_0, v_k)\ |\ \gamma (v_{0}) \sqsubseteq t_{x} \wedge \gamma (v_{k}) \sqsubseteq t_{y} \}. \end{aligned}$$

(2)

For $t_x=\{\texttt {Person}\}$ and $t_y=\{\texttt {Place}\}$, the path BarackObama $\xrightarrow {\texttt {birthPlace}}$ Hawaii $\xrightarrow {\texttt {country}}$ USA is an element of the set $\varPi ^{2}_{(t_{x}, t_{y})}$ in our running example. We define the set $M^{k}_{(t_{x}, t_{y})}$ of typed undirected paths analogously.

Let $\vec {q} = {q_1,\ldots ,q_k}$ be a vector of properties of length k. We define the set of $\vec {q}$-restricted typed paths $\varPi ^{k}_{(t_{x}, t_{y}),\vec q} \subseteq \varPi ^{k}_{(t_{x}, t_{y})}$ as follows:

$$\begin{aligned} \begin{aligned} \varPi ^{k}_{(t_{x}, t_{y}),\vec {q}} =&\bigl \{\pi ^{k}(v_0, v_k) \;\big |\; \pi ^{k}(v_0, v_k) \in \varPi ^{k}_{(t_{x}, t_{y})},\\&\,\,\,\forall i \in [0, k-1] : (v_{i},p_{i+1},v_{i+1}) \in \pi ^{k}(v_0, v_k) \rightarrow p_{i+1} = q_{i+1} \bigr \} . \end{aligned} \end{aligned}$$

(3)

Put simply, this is the set of typed paths such that the sequence of properties in each path is exactly $\vec {q}$. For example, let $t_x$=$\{\texttt {Person}\}$, $t_y=\{\texttt {Place}\}$ and $\vec {q}$ = (birthPlace, country). Then the path BarackObama $\xrightarrow {\texttt {birthPlace}}$ Hawaii $\xrightarrow {\texttt {country}}$ USA is the only element of $\varPi ^{2}_{(t_{x}, t_{y}),\vec {q}}$ in our running example. We call the elements of $\varPi ^{k}_{(t_{x}, t_{y}),\vec {q}}$ similar as they share a sequence of predicates (i.e., $\vec {q}$). We define sets of $\vec {q}$-restricted undirected typed paths $M^{k}_{(t_{x}, t_{y}),\vec q}$ analogously to $\varPi ^{k}_{(t_{x}, t_{y}),\vec {q}}$.

We can now use restricted typed paths to compute how well a predicate is corroborated in a knowledge graphs as follows: Let D(p) be the domain of p and R(p) be its range. Given that we assume RDF knowledge graphs, we can safely assume the existence of an RDFS class hierarchy for the said graph (defined via the $\texttt {rdf:type}$ predicate). Consequently, we can derive the following important condition on paths $\pi ^k (s, o)$ which are to corroborate the correctness of (s, p, o): Only typed paths in $\varPi ^k_{(D(p), R(p))}$ can corroborate facts with the predicate p. This particular insight is one of the major differences between this and previous works (see Sect. 2), in which the consequences of RDFS semantics were not taken into consideration. In particular, while previous approaches [17] used at most $\gamma (s)$ and $\gamma (o)$ to measure the strength of the association between paths and predicates, we use D(p) and R(p) as well as the RDFS class hierarchy in the input knowledge graph $\mathcal {G}$ to determine the degree to which a path $\pi ^k(s,o)$ corroborates a predicate p.

Given an RDF knowledge graph $\mathcal {G}$, we hence define the corroborative paths for a predicate p formally as follows:

$$\begin{aligned} \varPi ^{k}(p) = \bigcup \limits _{j=1}^k \varPi ^{j}_{(D(p), R(p))}. \end{aligned}$$

(4)

Simply put, corroborative paths in $\varPi ^{k}(p)$ are paths of length at most k that carry similar information to p.

4.3 Association Strength

We base our computation of the strength of the association between $\varPi ^{j}_{(t_{x}, t_{y}),\vec q}$ and p on their normalized pointwise mutual information [4]. To this end, we define probability $\mathcal {P}(\varPi ^{j}_{(t_{x},t_{y}),\vec q})$ of pairs of instances of $t_x$ resp. $t_y$ being connected via a $\vec {q}$-restricted path of length j is as follows:

$$\begin{aligned} \frac{\left| \{(a, b): \gamma (a) \sqsubseteq t_x \wedge \gamma (b) \sqsubseteq t_y \wedge (\exists \pi ^j(a, b) \in \varPi ^{j}_{(t_{x},t_{y}),\vec q})\} \right| }{|\lambda (t_x)|\cdot |\lambda (t_y)|}. \end{aligned}$$

(5)

The probability $\mathcal {P}(p)$ of the predicate p linking resources of type $t_x$ and $t_y$ is

$$\begin{aligned} \frac{\left| \left\{ (a, p, b): \gamma (x) \sqsubseteq t_x \wedge \gamma (y) \sqsubseteq t_y \wedge (a, p, b) \in \mathcal {G} \right\} \right| }{|\lambda (t_x)|\cdot |\lambda (t_y)|} \end{aligned}$$

(6)

Finally, the joint probability $\mathcal {P}(\varPi ^{j}_{(t_{x},t_{y}),\vec q}, p)$ is defined as

$$\begin{aligned} \frac{\left| \{(a, b): \gamma (a) \sqsubseteq t_x \wedge \gamma (b) \sqsubseteq t_y \wedge (\exists \pi ^j(a, b) \in \varPi ^{j}_{(t_{x},t_{y}),\vec q}) \wedge (a, p, b) \in \mathcal {G}\} \right| }{|\lambda (t_x)|\cdot |\lambda (t_y)|}. \end{aligned}$$

(7)

We could now compute the NPMI of $\varPi ^{j}_{(t_{x}, t_{y}),\vec q}$ and p as defined in [4]. However, a direct implementation of the original definition of the NPMI would be expensive as it would require deduplicating the sets of pairs (a, b) connected by the paths in $\varPi ^j_{(t_x, t_y)}$.^{Footnote 5} Hence, our approach implements an approximation of the NPMI based on counting the number of paths which connect pairs (a, b) instead of the pairs themselves. We hence end up with the following approximations (note that these values are not probabilities):

$$\begin{aligned} \widehat{\mathcal {P}}(\varPi ^{j}_{(t_{x},t_{y}),\vec q}) = \frac{| \varPi ^{j}_{(t_{x},t_{y}), \vec q}|}{|\lambda (t_x)|\cdot |\lambda (t_y)|} \end{aligned}$$

(8)

$$\begin{aligned} \widehat{\mathcal {P}}(\varPi ^{j}_{(t_{x},t_{y}),\vec q}, p) = \frac{|\{\pi ^j(a, b) \in \varPi ^{j}_{(t_{x},t_{y}),\vec q}: (a, p, b) \in \mathcal {G}\}|}{|\lambda (t_x)|\cdot |\lambda (t_y)|}. \end{aligned}$$

(9)

These approximations can be computed by using SPARQL queries without DISTINCT clause, which makes the computation an order of magnitude faster (see Table 7 for some of the scores returned by this function). Note that $\mathcal {P}(p)$ remains unchanged and the number of paths $a \xrightarrow {p} b$ is exactly equal to the number of pairs (a, b) connected by p. Based on these approximations we can now approximate the NPMI of $\varPi ^{j}_{(t_{x}, t_{y}),\vec q}$ and p as follows:

$$\begin{aligned} \begin{aligned} \widehat{\text {NPMI}}(\varPi ^{j}_{(t_{x}, t_{y}),\vec q}, p) = \frac{\log {\Bigg (\frac{ \widehat{\mathcal {P}}\left( \varPi ^{j}_{(t_{x}, t_{y}),\vec q}, p\right) }{\widehat{\mathcal {P}}\left( \varPi ^{j}_{(t_{x},t_{y})}\right) \cdot {\mathcal {P}}(p)}\Bigg )}}{-\log \bigg (\widehat{\mathcal {P}}\left( \varPi ^{j}_{(t_{x},t_{y}),\vec q}, p\right) \bigg )} \end{aligned} \end{aligned}$$

(10)

5 Method and Implementation

This section presents our implementation of the formal model presented above in detail. In particular, we show how some of the core computations of our model can be implemented using SPARQL queries, ensuring practicable runtimes for our approach. As above, we explain the approach using directed paths for the sake of legibility. The approach was also implemented using undirected paths. An evaluation of the performance of the approach with directed and undirected paths is presented in Sect. 7.

5.1 Algorithm

Given an input triple $t=(s, p, o)$, a knowledge graph $\mathcal {G}$ and a maximum path length k, our implementation begins by identifying a set of paths of varying lengths connecting s and o, respectively. For each path, it calculates a score, which explicates the degree to which the path corroborate t. Finally, the scores are amalgamated to a single score $\tau $ which expresses the veracity of t. The complete algorithm is shown in Algorithm 1 and can be separated into the 4 steps (i) Initialization, (ii) Path discovery, (iii) Path scoring and (iv) Veracity calculation.

Initialization. Firstly, we prune $\mathcal {G}$ by removing all nodes from domain outside the union of (i) base namespace(s) of $\mathcal {G}$, (ii) the namespace for RDF, RDFS and OWL. We carry out this preprocessing because we are interested in relations (edges) that are defined by the ontology of the given $\mathcal {G}$ (line 1). Thereafter, the domain D(p) and range R(p) of the given triple’s predicate p are determined.^{Footnote 6} The number of instances of these two types as well as the number of triples containing p as predicate are retrieved via SPARQL count queries (lines 2–4).

Path Discovery. In the second step, the properties of all paths $\pi ^j(s, o)$ (i.e., their $\vec q$ restrictions) of length $j \in [1, k]$ between s and o are retrieved. To this end, we generate SPARQL^{Footnote 7} query templates (line 7). The query template which retrieves directed paths of length $j=2$ between ?v0 and ?v2 from an RDF graph is:

Note that the query has to be modified with UNION to cover undirected paths. Still, a single query can be used to detect paths of any length. Hence, our approach generates k queries in this step.

After generating all necessary query templates up to the given length k, we replace the variables ?v0 and ?vj with s and o respectively in the query (line 9). We prune the results of the query (line 11) by removing results containing predicates which define the terminology (e.g., class membership through rdf:type, class hierarchy through rdfs:subClassOf).^{Footnote 8} The remaining $\vec q$-restrictions are stored as pairs together with the template which was used to retrieve them in the list $\mathcal {Q}$ (line 12–14). We store the template to ensure that we can reconstruct the direction of the predicates in case undirected paths are used.

Path Scoring. Pairs in $\mathcal {Q}$ are used to define the $\vec q$-restricted typed path sets $\varPi ^{j}_{(D(p),R(p)), \vec q}$. For each of these pairs, a score $\zeta $ is calculated based on the NPMI approximation in Eq. 10 (lines 16–23). For the sake of efficiency, we use SPARQL queries to obtain the necessary counts of typed paths which are generated based on the query template and $\vec q$ (line 18). However, as pointed out in [7], a direct translation of the needed counts into queries leads to time-consuming computations^{Footnote 9} which require optimization. Therefore, we generate the SPARQL queries needed for our counts with a recursive structure. Listing 1.1 shows a sample query used to count the number of paths $\texttt {?vo}\xrightarrow {birthPlace}{} \texttt {?v1}\xrightarrow {country}{} \texttt {?v2}$ between entities with the types Person and Country, respectively.

Veracity Calculation. We treat the association strength of each $\vec q$-restricted typed path as the confidence with which the path supports the existence of the input predicate p. We hence combine the $\zeta $ values by checking whether at least one path supports p. Let Z be the set of scores of all single paths, the veracity score $\tau $ can be calculated with the following equation (see lines 23–28):

$$\begin{aligned} q \tau = 1 - \prod _{\zeta \in Z} \left( 1 - \zeta \right) . \end{aligned}$$

(11)

6 Experiments and Results

In this section, we provide details of the data and hardware we used in our experiments. We compare the results of our approach with those achieve by state-of-the-art approaches in the subsequent section.

6.1 Setup

Knowledge Graph. For our experiments, we chose DBpedia version 2016-10 as background knowledge. We chose this dataset because it is the reference dataset of a large number of fact validation benchmarks. We used the latest dumps^{Footnote 10} of ontology, instance types, mapping-based objects and infobox properties. We filtered out triples that (i) contain literals and datatypes or (ii) link the entities in DBpedia to external sources. The final graph contains 44 million triples, which we stored using an instance of Openlink Virtuoso v7.2.5.1 hosted on VM with 16GB memory and 256GB disk space. To ensure the comparability of our results, we ran our evaluation using GERBIL [16]—a benchmarking platform that facilitates the evaluation of fact validation systems across different datasets.^{Footnote 11} We used the AUC-ROC as an evaluation metric and set $k=2$ for the sake of comparability with previous works.

Competing Approaches. We compare our approach (COPAAL) to three state-of-the-art graph-based fact validation approaches: (i) Knowledge Stream (KS), (ii) its variant Relational Knowledge Linker (KL-REL) [18] and (iii) Discriminative Path Mining (PredPath) [17]. For all these approaches, we use the implementation provided by the authors [18].^{Footnote 12} We considered the configuration suggested in the original paper: (i) PredPath [17] uses the top-100 features while learning positive and negative facts. (ii) KS [18] and KL-REL [18] use the top-5 paths and single best path, respectively, for validating input triples.

6.2 Benchmarks

We evaluated all the approaches using two publicly available sets of benchmarks: (i) the Real-World and (ii) Synthetic datasets^{Footnote 13} made available by the authors of the literature [18]. In addition, we generated a new set of benchmarks dubbed FactBench-DBpedia from the FactBench^{Footnote 14} dataset. All the facts in FactBench are automatically extracted from DBpedia and Freebase for 10 different relations^{Footnote 15} and stored in the form of RDF models. In FactBench, the positive facts are generated by querying DBpedia and Freebase and selecting top 150 results returned for each relation. The negative facts are generated by modifying the positive facts while still following domain and range restrictions. The positive and negative facts are collected into 6 different benchmarks dubbed Domain, Range, Domain-Range, Mix, Random, Property. FactBench-DBpedia restricts the generation process of FactBench to DBpedia by extracting all facts belonging to DBpedia and facts from Freebase whose resources can be mapped to resources in DBpedia. Table 2 shows the stats for the different datasets.

Table 2. Summary of benchmark datasets

Full size table

7 Results

7.1 Comparison of Directed and Undirected Paths

We first aimed to determine the type of paths for which our approach performs best. We hence compared the AUC achieved by both variations of our approach on FactBench-DBpedia (see Table 3). The results are clear: Using undirected paths (average AUC-ROC = 0.87) always outperforms using directed paths (avg. AUC-ROC = 0.66) and are 0.21 better on average w.r.t. the AUC-ROC they achieve. We studied the results achieved using the two types of paths. It became quickly evident that using undirected paths allows to detect significantly more corroborative evidence. Therewith, undirected paths achieve a better approximation of the probability of a triple being true (see Table 7 for examples). Consequently, we only consider our approach with undirected paths in the following.

Table 3. Comparison of AUC-ROC achieved using directed and undirected paths

Full size table

Table 4. AUC-ROC results of all approaches on Real-World datasets

Full size table

7.2 Comparison with Other Approaches

Tables 4 and 5 show the AUC-ROC results of all the approaches on the benchmarks contained in the Real-World and Synthetic datasets, respectively. Our approach outperforms other approaches on most of these datasets. In the best case, we are roughly 4.5% (absolute value, Birth Place benchmark) better than PredPath and more than 20% (absolute value, Birth Place benchmark) better than KS on real data. A careful study of our results reveals that the anchored predicate paths used by PredPath for learning features are restricted by the types of subject and object irrespective of predicate of the input triple. Hence they someetimes fail to generalize well. On the other hand, KL-REL uses single best paths, which sometimes limits its ability to validate facts if it is not able to rank the path which conveys the most evidence for the input triple to the first position. This is made evident by the examples shown in Table 7: We computed the union of the top-3 paths identified by our approach and all other approaches on the three datasets for which the difference in AUC values were the largest. We also computed the weights assigned by each of the approaches (i.e., $\widehat{\text {NPMI}}$ for our approach, average flow values of paths for KS and KL-REL [18] and weights learned by the classifier for PredPath [17]). While our approach finds all paths and allocated them weights, the other approach sometimes fail to detect relevant paths (marked by dashes in Table 7) and are hence not able to use them in their evidence computation. Having a large number of paths available however also means that our scores are (even if rarely) overoptimistic w.r.t. evidence for a triple, which explain the marginally lower scores we achieve on Death Place and Oscars.

Table 5. ROC-AUC results of all approaches on Synthetic datasets

Full size table

Table 6. ROC-AUC results of all approaches on FactBench-DBpedia datasets

Full size table

The results on FactBench-DBpedia (see Table 6) confirm the insight we gained on the previous two datasets. Our approach outperforms the state of the art and achieve a better AUC-ROC on most datasets. We ran a Wilcoxon signed ranked test (significance = 99%) on all results we collected. The results state that our approach is significantly better than the state of the art.

One could assume that our approach is slower than the state of the art due to the larger amount of evidence it collects. Hence, we measured the average throughput of all the approaches including all phases of the processing. The average throughput of our approach was 21.02 triples/min. KS, which follows an approach similar to ours, achieves an average throughput of 10.05 triples/min while its counterpart KL-REL achieves 29.78 triples/min. PredPath’s average throughput was 21.67 triples/min. Overall, our results show that our approach scales as well as the state of the art while achieving significantly better results.

Table 7. Union of the top-3 paths identified by the different approaches and their weighting. The weights allocated by each of the approaches are given in the corresponding column. A dash (-) means that the approach was not able to find the said path.

Full size table

8 Conclusion and Future Work

In this paper, we present a novel unsupervised approach for the validation of facts using an RDF knowledge graph $\mathcal {G}$ as background knowledge. Our approach uses domain, range and class subsumption information found in the schema of $\mathcal {G}$ to outperform both supervised and unsupervised fact validation approaches. We evaluated our results on 17 datasets against three state-of-the-art approaches. Our results show that our approach outperforms the state of the art significantly (Wilcoxon signed ranked test, $p < 0.01$). We studied the difference between the approaches and concluded that our approach performs better because it is able to score corroborative paths more accurately as it uses more information from the schema of $\mathcal {G}$. These results point to the importance of using the semantics of the data contained in RDF knowledge graphs when aiming to validate them. Another advantage of our approach is that it allows to verbalize the evidence found to support a given input triple.

The main limitation of our approach lies in its relying on the existence of type information. Well-defined ontologies are not always given in real world datasets and therefore our approach cannot be applied on them. Previous works have aimed at improving type information in noisy knowledge graphs [15]. We will evaluate whether combining our approach with such algorithms leads to better corroborative paths in future works. Additionally, the approaches evaluated herein are limited to evidence found in one RDF graph. In future work, we will consider performing fact validation at a larger scale. In particular, we will use the linked nature of Linked Data sets to detect paths across several knowledge graphs. We will focus on the scalability and the distributed execution of this novel solution. Moreover, we will consider relaxing the requirements to types used in the definition of $\varPi ^k_{(t_x, t_y), \vec {q}}$ by using well-defined semantic similarities [10].

Notes

1.
http://lodstats.aksw.org/.
2.
See https://lod-cloud.net/ for data on the growth of the Linked Open Data Cloud.
3.
http://downloads.dbpedia.org/2016-10/.
4.
We use $\sqcap $ to denote the conjunction of classes. Note that given that President $\sqsubseteq $ Person $\sqsubseteq $ Agent, we could write the type BarackObama in an abbreviated form. Similar considerations holds for the type of UnitedStates. We chose to write the types out to remain consistent with the output of our example knowledge graph, DBpedia 2016-10.
5.
Preliminary experiments suggest a 20-fold increase in runtime without any significant increase in AUC-ROC.
6.
If D(p) or R(p) are not available, the types of the given subject or object will be used, respectively.
7.
https://www.w3.org/TR/rdf-sparql-query/.
8.
We are aware that the terminology (especially concept similarity scores) used in $\texttt {G}$ can potentially inform the fact validation process further. Studying the integration of assertional and terminological information will be the object of future work and is out of the scope of this paper.
9.
We used Virtuoso and Fuseki for our experiments and our runtime findings support [7].
10.
http://downloads.dbpedia.org/2016-10/.
11.
All the datasets and result files can be found at https://hobbitdata.informatik.uni-leipzig.de/COPAAL/.
12.
https://github.com/shiralkarprashant/knowledgestream.
13.
https://github.com/shiralkarprashant/knowledgestream/tree/master/datasets.
14.
https://github.com/DeFacto/FactBench.
15.
award, birthPlace, deathPlace, foundationPlace, leader, team, author, spouse, starring, subsidiary.

References

Athreya, R.G., Ngonga Ngomo, A.C., Usbeck, R.: Enhancing community interactions with data-driven chatbots-the DBpedia chatbot. In: Companion of the the Web Conference 2018 on The Web Conference, pp. 143–146 (2018). International World Wide Web Conferences Steering Committee (2018)
Google Scholar
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76298-0_52
Chapter Google Scholar
Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In: Advances in Neural Information Processing Systems, pp. 2787–2795 (2013)
Google Scholar
Bouma, G.: Normalized (pointwise) mutual information in collocation extraction. In: Proceedings of GSCL, pp. 31–40 (2009)
Google Scholar
Ciampaglia, G.L., Shiralkar, P., Rocha, L.M., Bollen, J., Menczer, F., Flammini, A.: Computational fact checking from knowledge networks. PloS One 10(6), e0128193 (2015)
Article Google Scholar
d’Amato, C., Fanizzi, N., Esposito, F.: Inductive learning for the semantic web: what does it buy? Semant. Web 1(1, 2), 53–59 (2010)
Google Scholar
Galárraga, L.A., Teflioudi, C., Hose, K., Suchanek, F.: Amie: association rule mining under incomplete evidence in ontological knowledge bases. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 413–422. ACM (2013)
Google Scholar
Gerber, D., et al.: DeFacto–temporal and multilingual deep fact validation. Web Semant. Sci. Serv. Agents World Wide Web 35, 85–101 (2015)
Article Google Scholar
Lao, N., Cohen, W.W.: Relational retrieval using a combination of path-constrained random walks. Mach. Learn. 81(1), 53–67 (2010)
Article MathSciNet Google Scholar
Lehmann, K., Turhan, A.-Y.: A framework for semantic-based similarity measures for $\cal{ELH}$-concepts. In: del Cerro, L.F., Herzig, A., Mengin, J. (eds.) JELIA 2012. LNCS (LNAI), vol. 7519, pp. 307–319. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33353-8_24
Chapter Google Scholar
Lin, P., Song, Q., Wu, Y.: Fact checking in knowledge graphs with ontological subgraph patterns. Data Sci. Eng. 3(4), 341–358 (2018)
Article Google Scholar
Lin, Y., Liu, Z., Sun, M., Liu, Y., Zhu, X.: Learning entity and relation embeddings for knowledge graph completion. In: Twenty-Ninth AAAI Conference on Artificial Intelligence (2015)
Google Scholar
Malyshev, S., Krötzsch, M., González, L., Gonsior, J., Bielefeldt, A.: Getting the most out of wikidata: semantic technology usage in wikipedia’s knowledge graph. In: Vrandečić, D., et al. (eds.) ISWC 2018. LNCS, vol. 11137, pp. 376–394. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00668-6_23
Chapter Google Scholar
Nickel, M., Tresp, V., Kriegel, H.P.: Factorizing yago: scalable machine learning for linked data. In: Proceedings of the 21st International Conference on World Wide Web, pp. 271–280. ACM (2012)
Google Scholar
Paulheim, H., Bizer, C.: Type Inference on noisy RDF data. In: Alani, H., et al. (eds.) ISWC 2013. LNCS, vol. 8218, pp. 510–525. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41335-3_32
Chapter Google Scholar
Röder, M., Usbeck, R., Ngonga Ngomo, A.: GERBIL - benchmarking named entity recognition and linking consistently. Semant. Web 9(5), 605–625 (2018)
Article Google Scholar
Shi, B., Weninger, T.: Discriminative predicate path mining for fact checking in knowledge graphs. Knowl.-Based Syst. 104, 123–133 (2016)
Article Google Scholar
Shiralkar, P., Flammini, A., Menczer, F., Ciampaglia, G.L.: Finding streams in knowledge graphs to support fact checking. In: 2017 IEEE International Conference on Data Mining (ICDM), pp. 859–864. IEEE (2017)
Google Scholar
Socher, R., Chen, D., Manning, C.D., Ng, A.: Reasoning with neural tensor networks for knowledge base completion. In: Advances in Neural Information Processing Systems.,pp. 926–934 (2013)
Google Scholar
Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In: Proceedings of the 16th InternationalCconference on World Wide Web, pp. 697–706. ACM (2007)
Google Scholar
Sun, Y., Han, J., Yan, X., Yu, P.S., Wu, T.: Pathsim: meta path-based top-k similarity search in heterogeneous information networks. Proc. VLDB Endow. 4(11), 992–1003 (2011)
Google Scholar
Syed, Z.H., Röder, M., Ngonga Ngomo, A.C.: Factcheck: Validating rdf triples using textual evidence. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 1599–1602. ACM (2018)
Google Scholar
Wang, Z., Zhang, J., Feng, J., Chen, Z.: Knowledge graph embedding by translating on hyperplanes. In: Twenty-Eighth AAAI Conference on Artificial Intelligence (2014)
Google Scholar
Yin, X., Han, J., Philip, S.Y.: Truth discovery with multiple conflicting information providers on the web. IEEE Trans. Knowl. Data Eng. 20(6), 796–808 (2008)
Article Google Scholar
Zhao, M., Chow, T.W., Zhang, Z., Li, B.: Automatic image annotation via compact graph based semi-supervised learning. Knowl.-Based Syst. 76, 148–165 (2015)
Article Google Scholar

Download references

Acknowledgements

This work has been supported by the BMVI projects LIMBO (project no. 19F2029C) and OPAL (project no. 19F20284), the BMBF project SOLIDE (project no. 13N14456) and the EU project KnowGraphs (project no. 860801).

Author information

Authors and Affiliations

Data Science Group, Paderborn University, Paderborn, Germany
Zafar Habeeb Syed, Michael Röder & Axel-Cyrille Ngonga Ngomo
Institute for Applied Informatics, Leipzig, Germany
Michael Röder & Axel-Cyrille Ngonga Ngomo

Authors

Zafar Habeeb Syed
View author publications
You can also search for this author in PubMed Google Scholar
Michael Röder
View author publications
You can also search for this author in PubMed Google Scholar
Axel-Cyrille Ngonga Ngomo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zafar Habeeb Syed .

Editor information

Editors and Affiliations

Fondazione Bruno Kessler, Trento, Italy
Chiara Ghidini
Linköping University, Linköping, Sweden
Olaf Hartig
University of Bonn, Bonn, Germany
Maria Maleshkova
University of Economics Prague, Prague, Czech Republic
Vojtěch Svátek
University of Illinois at Chicago, Chicago, IL, USA
Isabel Cruz
University of Chile, Santiago, Chile
Aidan Hogan
Memect Technology, Beijing, China
Jie Song
Mines Saint-Etienne, Saint-Etienne, France
Maxime Lefrançois
Inria Sophia Antipolis - Méditerranée, Sophia Antipolis, France
Fabien Gandon

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Syed, Z.H., Röder, M., Ngomo, AC.N. (2019). Unsupervised Discovery of Corroborative Paths for Fact Validation. In: Ghidini, C., et al. The Semantic Web – ISWC 2019. ISWC 2019. Lecture Notes in Computer Science(), vol 11778. Springer, Cham. https://doi.org/10.1007/978-3-030-30793-6_36

Download citation

DOI: https://doi.org/10.1007/978-3-030-30793-6_36
Published: 17 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30792-9
Online ISBN: 978-3-030-30793-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the Semantic Web Science Association (opens in a new tab)

Unsupervised Discovery of Corroborative Paths for Fact Validation

Abstract

Similar content being viewed by others

Semantic and Efficient Symbolic Learning over Knowledge Graphs

Refining Transitive and Pseudo-Transitive Relations at Web Scale

Truth validation with evidence

1 Introduction

2 Related Work

3 Preliminaries

3.1 Knowledge Graph

Definition 1

Example 1

Definition 2

Definition 3

4 Corroborative Paths

4.1 Intuition

4.2 Formal Model

4.3 Association Strength

5 Method and Implementation

5.1 Algorithm

6 Experiments and Results

6.1 Setup

6.2 Benchmarks

7 Results

7.1 Comparison of Directed and Undirected Paths

7.2 Comparison with Other Approaches

8 Conclusion and Future Work

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation