Abstract
Navigational graph queries are an important class of queries that can extract implicit binary relations over the nodes of input graphs. Most of the navigational query languages used in the RDF community, e.g. property paths in W3C SPARQL 1.1 and nested regular expressions in nSPARQL, are based on the regular expressions. It is known that regular expressions have limited expressivity; for instance, some natural queries, like same generation-queries, are not expressible with regular expressions. To overcome this limitation, in this paper, we present cfSPARQL, an extension of SPARQL query language equipped with context-free grammars. The cfSPARQL language is strictly more expressive than property paths and nested expressions. The additional expressivity can be used for modelling graph similarities, graph summarization and ontology alignment. Despite the increasing expressivity, we show that cfSPARQL still enjoys a low computational complexity and can be evaluated efficiently.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
1 Introduction
The Resource Description Framework (RDF) [30] recommended by World Wide Web Consortium (W3C) is a standard graph-oriented model for data interchange on the Web [6]. RDF has a broad range of applications in the semantic web, social network, bio-informatics, geographical data, etc. [1]. Typical access to graph-structured data is its navigational nature [12, 16, 21]. Navigational queries on graph databases return binary relations over the nodes of the graph [9]. Many existing navigational query languages for graphs are based on binary relational algebra such as XPath (a standard navigational query language for trees [25]) or regular expressions such as RPQ (regular path queries) [24].
SPARQL [32] recommended by W3C has become the standard language for querying RDF data since 2008 by inheriting classical relational languages such as SQL. However, SPARQL only provides limited navigational functionalities for RDF [28, 37]. Recently, there are several proposed languages with navigational capabilities for querying RDF graphs [3–5, 7, 11, 19, 26, 28, 35]. Roughly, Versa [26] is the first language for RDF with navigational capabilities by using XPath over the XML serialization of RDF graphs. SPARQLeR proposed by Kochut et al. [19] extends SPARQL by allowing path variables. SPARQL2L proposed by Anyanwu et al. [7] allows path variables in graph patterns and offers good features in nodes and edges such as constraints. PSPARQL proposed by Alkhateeb et al. [5] extends SPARQL by allowing regular expressions in general triple patterns with possibly blank nodes and CASPAR further proposed by Alkhateeb et al. [3, 4] allows constraints over regular expressions in PSPARQL where variables are allowed in regular expressions. nSPARQL proposed by Pérez et al. [28] extends SPARQL by allowing nested regular expressions in triple patterns. Indeed, nSPARQL is still expressible in SPARQL if the transitive closure relation is absent [37]. In March 2013, SPARQL 1.1 [33] recommended by W3C allows property paths which strengthen the navigational capabilities of SPARQL1.0 [11, 35].
However, those regular expression-based extensions of SPARQL are still limited in representing some more expressive navigational queries which are not expressed in regular expressions. Let us consider a fictional biomedical ontology mentioned in [31] (see Fig. 1). We are interested in a navigational query about those paths that confer similarity (e.g., between Gene(B) and Gene(C)), which suggests a causal relationship (e.g., between Gene(S) and Phenotype(T)). This query about similarity arises from the well-known same generation-query [2], which is proven to be inexpressible in any regular expression. To express the query, we have to introduce a query embedded with a context-free grammar (CFG) for expressing the language of \(\{ww^{T} |w is a string \}\)[31] where \(w^{T}\) is the converse of w. For instance, if w = “abcdfe” then \(w^{T}\) = “\(e^{-1}f^{-1}d^{-1}c^{-1}b^{-1}a^{-1}\)”. As we know, CFG has more expressive power than any regular expression [18]. Moreover, the context-free grammars can provide a simplified more user-friendly dialect of Datalog [1] which still allows powerful recursion [18]. Besides, the context-free graph queries have also practical query evaluation strategies. For instance, there are some applications in verification [20]. So it is interesting to introduce a navigational query embedded with context-free grammars to express more practical queries like the same generation-query.
A proposal of conjunctive context-free path queries (written by Helling’s CCFPQ) for edge-labeled directed graphs has been presented by Helling [14] by allowing context-free grammars in path queries. A naive idea to express same generation-queries is transforming this RDF graph to an edge-labeled directed graph via navigation axes [28] and then using Helling’s CCFPQ since an RDF graph can be intuitively taken as an edge-labeled directed graph. However, this transformation is difficult to capture the full information of this RDF graph since there exist some slight differences between RDF graphs and edge-labeled directed graphs, particularly regarding the connectivity [13], thus it could not express some regular expression-based path queries on RDF graphs. For instance, a nested regular expression (nre) of the form \( axis {:}{:}[e]\) on RDF graphs in nSPARQL [28], is always evaluated to the empty set over any edge-labeled directed graph. That is to say, an nre of the form “\( axis {:}{:}[e]\)” is hardly expressible in Helling’s CCFPQ.
To represent more expressive queries with efficient query evaluation is a renewed interest topic in the classical topic of graph databases [2]. Hence, in this paper, we present a context-free extension of path queries and SPARQL queries on RDF graphs which can express both nre and nSPARQL [28]. Furthermore, we study several fundamental properties of the proposed context-free path queries and context-free SPARQL queries. The main contributions of this paper can be summarized as follows:
-
We present context-free path queries (CFPQ) (including conjunctive context-free path queries (CCFPQ), union of simple conjunctive context-free path queries (UCCFPQ\(^{s}\)), and union of conjunctive context-free path queries (UCCFPQ) for RDF graphs and find that CFPQ, CCFPQ, and UCCFPQ have efficient query evaluation where the query evaluation has the polynomial data complexity and the NP-complete combined complexity. Finally, we implement our CFPQs and evaluate experiments on some popular ontologies.
-
We discuss the expressiveness of CFPQs by referring to nested regular expressions (nre). We show that CFPQ, CCFPQ, UCCFPQ\(^{s}\), and UCCFPQ exactly express four fragments of nre, basic nre “nre\(_{0}\)”, union-free nre “nre\(_{0}(\mathrm {N})\)”, nesting-free nre “nre\(_{0}(|)\)”, and full nre, respectively (see Fig. 2). The query evaluation of cfSPARQL has the same complexity as SPARQL.
-
We propose context-free SPARQL (cfSPARQL) and union of conjunctive context-free SPARQL (uccfSPARQL) based on CFPQ and UCCFPQ, respectively. It shows that cfSPARQL has the same expressiveness as that of uccfSPARQL. Furthermore, we prove that cfSPARQL can strictly express both SPARQL and nSPARQL (even nSPARQL\(^\lnot \): a variant of nSPARQL by allowing nre with negation “nre\(^{\lnot }\)) (see Fig. 3).
Organization of the Paper. Section 2 recalls nSPARQL and context-free grammar. Section 3 defines CFPQ. Section 4 discusses the expressiveness of CFPQ. Section 5 presents cfSPARQL and Sect. 6 discusses the relations on nre with negation. Section 7 evaluates experiments. We conclude in Sect. 8. Due to the space limitation, all proofs and some further preliminaries are omitted but they are available in an extended technical report in arXiv.org [36].
2 Preliminaries
In this section, we introduce the language nSPARQL and context-free grammar.
2.1 The Syntax and Semantics of nSPARQL
In this subsection, we recall the syntax and semantics of nSPARQL, largely following the excellent expositions [27, 28].
RDF Graphs. An RDF statement is a subject-predicate-object structure, called RDF triple which represents resources and the properties of those resources. For the sake of simplicity similar to [28], we assume that RDF data is composed only IRIsFootnote 1. Formally, let U be an infinite set of IRIs. A triple \((s, p, o) \in {U} \times {U} \times {U}\) is called an RDF triple. An RDF graph G is a finite set of RDF triples. We use \( adom (G)\) to denote the active domain of G, i.e., the set of all elements from U occurring in G.
For instance, a biomedical ontology shown in Fig. 1 can be modeled in an RDF graph named as \(G_{\mathrm {bio}}\) where each labeled-edge of the form \(a \overset{p}{\rightarrow } b\) is directly translated into a triple (a, p, b).
Paths and Traces. Let G be an RDF graph. A path \(\pi = (c_1c_2 \ldots c_m)\) in G is a non-empty finite sequence of constants from G, where, for every \(i \in \{1, \ldots , m-1\}, c_i\) and \(c_{i+1}\) exactly occur in the same triple of G (i.e., \((c_{i}, c, c_{i+1}), (c_{i}, c_{i+1}, c)\), and \((c, c_{i}, c_{i+1})\) etc.). Note that the precedence between \(c_i\) and \(c_{i+1}\) in a path is independent of the positions of \(c_i, c_{i+1}\) in a triple.
In nSPARQL, three different navigation axes, namely, \( next , edge \), and \( node \), and their inverses, i.e., \( next ^{-1}, edge ^{-1}\), and \( node ^{-1}\), are introduced to move through an RDF triple (s, p, o) [28].
Let \(\Sigma =\{ axis , { axis {:}{:}c} |c \in {U}\}\) where \( axis \in \{ self , next , edge , node , next ^{-1}, edge ^{-1}, node ^{-1}\}\). Let G be an RDF graph. We use \(\Sigma (G)\) to denote the set of all symbols \(\{ axis , { axis {:}{:}c} |c \in adom (G)\}\) occurring in G.
Let \(\pi = (c_1\ldots c_m)\) be a path in G. A trace of path \(\pi \) is a string over \(\Sigma (G)\) written by \(\mathcal {T}(\pi ) = l_1\ldots l_{m-1}\) where, for all \( i \in \{1, \ldots , m-1\}, (c_ic_{i+1})\) is labeled by \(l_i\) and \(l_i\) is of the form \( axis , axis {:}{:}c, axis ^{-1}\), or \( axis ^{-1}{:}{:}c\) [28]. We use \( Trace (\pi )\) to denote the set of all traces of \(\pi \).
Note that it is possible that a path has multiple traces since any two nodes possibly occur in the multiple triples. For example, consider an RDF graph \(G = \{(a, b, c), (a, c, b)\}\) and given a path \(\pi = (abc)\), both \(( edge {:}{:}c)( node {:}{:}a)\) and \(( next {:}{:}c)( node ^{-1}{:}{:}a)\) are traces of \(\pi \).
For instance, in the RDF graph \(G_{\mathrm {bio}}\) (see Fig. 1), a path from Gene(B) to Gene(C) has a trace: \(( next {:}{:}{} locatedIn )( next ^{-1}{:}{:}{} linkedTo )( next {:}{:}{} linkedTo )( next ^{-1}{:}{:}{} locatedIn )\).
Nested Regular Expressions. Nested regular expressions (nre) are defined by the following formal syntax:
Here the nesting nre is of the form \( axis {:}{:}[e]\).
For simplification, we denote some interesting fragments of nre as follows:
-
nre\(_{0}\): basic nre, i.e., nre only consisting of “\( axis \)”, “ / ”, and “\(*\)”;
-
nre\(_{0}(|)\): basic nre by adding the operator “\(|\)”;
-
nre\(_{0}(\mathrm {N})\) to basic nre by adding nesting nre \( axis {:}{:}[e]\).
Patterns. Assume an infinite set V of variables, disjoint from U. A nested regular expression triple (or nre-triple) is a tuple of the form (?x, e, ?y), where \(?x, ?y \in V\) and e is an nreFootnote 2.
Formally, nSPARQL (graph) patterns are recursively constructed from nre-triples:
-
An nre-triple is an nSPARQL pattern;
-
All \(P_1\, \mathbin {\mathrm {UNION}}\, P_2, P_1\, \mathbin {\mathrm {AND}}\, P_2\), and \(P_1\, \mathbin {\mathrm {OPT}}\, P_2\) are nSPARQL patterns if \(P_1\) and \(P_2\) are nSPARQL patterns;
-
\(P\, \mathbin {\mathrm {FILTER}}\, C\) if P is an nSPARQL pattern and C is a constraint;
-
\(\mathop {\mathrm {SELECT}}_{S}(P)\) if P is an nSPARQL pattern and S is a set of variables.
Semantics. Given an RDF graph G and an nre e, the evaluation of e on G, denoted by \(\llbracket e \rrbracket _{G}\), is a binary relation. More details can be found in [28]. Here, we recall the semantics of nesting nre of the form \( axis {:}{:}[e]\) as follows:
The semantics of nSPARQL patterns is defined in terms of sets of so-called mappings, which are simply total functions \(\mu :S \rightarrow U\) on some finite set S of variables. We denote the domain S of \(\mu \) by \(\mathrm {dom}(\mu )\).
Basically, the semantics of an nre-triple (u, e, v) is defined as follows:
Here, for any mapping \(\mu \) and any constant \(c \in {U}\), we agree that \(\mu (c)\) equals c itself.
Let P be an nSPARQL pattern, the semantics of P on G, denoted by \(\llbracket P \rrbracket _{G}\), is analogously defined as usual following the semantics of SPARQL [27, 28].
Query Evaluation. A SPARQL (SELECT) query is an nSPARQL pattern. Given a RDF graph G, a pattern P, and a mapping \(\mu \), the query evaluation problem of nSPARQL is to decide whether \(\mu \) is in \(\llbracket P \rrbracket _{G}\). The complexity of query evaluation problem is PSpace-complete [27].
2.2 Context-Free Grammar
In this subsection, we recall context-free grammar. For more details, we refer the interested readers to some references about formal languages [18].
A context-free grammar (COG) is a 3-tuple \(\mathcal {G}= (N, A, R)\) Footnote 3 where
-
N is a finite set of variables (called non-terminals);
-
A is a finite set of constants (called terminals);
-
R is a finite set of production rules r of the form \(v \rightarrow S\), where \(v \in N\) and \(S \in (N \cup A)^{*}\) (the asterisk \(*\) represents the Kleene star operation). We write \(v \rightarrow \epsilon \) if \(\epsilon \) is the empty string.
A string over \(N \cup A\) can be written to a new string over \(N \cup A\) by applying production rules. Consider a string avb and a production rule \(r: v \rightarrow avb\), we can obtain a new string aavbb by applying this rule r one time and another new string aaavbbb by applying the rule r twice. Analogously, strings with increasing length can be obtained in this rule.
Let \(S,T \in (N \cup A)^{*}\). We write \((S \mathop {\rightarrow }\limits ^{\mathcal {G}} T)\) if T can be obtained from S by applying production rules of \(\mathcal {G}\) within a finite number of times.
The language of grammar \(\mathcal {G}= (N, A, R)\) w.r.t. start non-terminal \(v \in N\) is defined by \(\mathcal {L}(\mathcal {G}_{v}) = \{S\) a finite string over A \(|v \mathop {\rightarrow }\limits ^{\mathcal {G}} S\}\).
For example, \(\mathcal {G}= (N, A, R)\) where \(N = \{v\}, A = \{a, b\}\), and \(R = \{v \rightarrow ab, v \rightarrow avb\}\). Thus \(\mathcal {L}(\mathcal {G}_{v}) = \{a^nb^n |n \ge 1\}\).
3 Context-Free Path Queries
In this section, we introduce context-free path queries on RDF graphs based on context-free path queries on directed graphs [14] and nested regular expressions [28].
3.1 Context-Free Path Queries and Their Extensions
In this subsection, we firstly define conjunctive context-free path queries on RDF graphs and then present some variants (it also can been seen as extensions).
Conjunctive Context-Free Path Queries. In this paper, we assume that \(N \cap V = \emptyset \) and \(A \subseteq \Sigma \) for all CFG \(\mathcal {G}= (N, A, R)\).
Definition 1
Let \(\mathcal {G}= (N, A, R)\) be a CFG and m a positive integer. A conjunctive context-free path query (CCFPQ) is of the form \(\mathbf {q}(?x,?y)\) Footnote 4, where,
where
-
\(\alpha _i\) is a triple pattern either of the form (?x, ?y, ?z) or of the form v(?x, ?y);
-
\(\{?x, ?y\} \subseteq vars (\mathbf {q})\) where \( vars (\mathbf {q})\) denotes a collection of all variables occurring in the body of \(\mathbf {q}\);
-
\(\{v_1, \ldots , v_m\} \subseteq N\).
We regard the name of query \(\mathbf {q}(?x, ?y)\) as \(\mathbf {q}\) and call the right of Eq. (1) as the body of \(\mathbf {q}\).
Remark 1
In our CCFPQ, we allow a triple pattern of the form (?x, ?y, ?z) to characterize those queries w.r.t. ternary relationships such as nre-triple patterns of nSPARQL [28] to be discussed in Sect. 4. The formula v(?x, ?y) is used to capture context-free path queries [14].
We say a simple conjunctive context-free path query (written by \(CCFPQ^{s}\)) if only the form v(?x, ?y) is allowed in the body of a CCFPQ. We also say a context-free path query (written by CFPQ) if \(m=1\) in the body of a CCFPQ\(^{s}\).
Semantically, let \(\mathcal {G}= (N, A, R)\) be a CFG and G an RDF graph, given a CCFPQ \(\mathbf {q}(?x, ?y)\) of the form (1), \(\llbracket \mathbf {q}(?x, ?y) \rrbracket _{G}\) is defined as follows:
where the semantics of v(?x, ?y) over G is defined as follows:
Intuitively, \(\llbracket v(?x, ?y) \rrbracket _{G}\) returns all pairs connected by a path in G which contains a trace belonging to the language generated from this CFG starting at non-terminal v.
Example 1
Let \(\mathcal {G}= (N, A, R)\) be a CFG where \(N = \{u, v\}, A = \{ next {:}{:}{} locatedIn , next ^{-1}{:}{:}{} locatedIn , next {:}{:}{} linkedTo , next ^{-1}{:}{:}{} linkedTo \}\), and \(P = \{v \rightarrow ( next {:}{:}{} locatedIn )\, u\,( next ^{-1}{:}{:}{} locatedIn ), u \rightarrow ( next ^{-1}{:}{:}{} linkedTo )\, u\,( next {:}{:}{} linkedTo ), u \rightarrow \epsilon \}\). Consider a CFPQ \(\mathbf {q}\) be of the form v(?x, ?y). The query \(\mathbf {q}\) represents the relationship of similarity (between two genes) since \(\mathcal {L}(\mathcal {G}_{v}) = \{{( next ^{-1}{:}{:}{} locatedIn )}^n( next ^{-1}{:}{:}{} linkedTo )( next {:}{:}{} linkedTo ){( next {:}{:}{} locatedIn )}^n |n\ge 1\}\). Consider the RDF graph \(G_{\mathrm {bio}}\) in Fig. 1, \(\llbracket \mathbf {q}(?x, ?y) \rrbracket _{G_{\mathrm {bio}}} = \{(?x =Gene(B) , ?y = Gene(C) )\}\). Clearly, the query \(\mathbf {q}\) returns all pairs with similarity.
Query Evaluation. Let \(\mathcal {G}= (N, A, R)\) be a CFG and G an RDF graph. Given a CCFPQ \(\mathbf {q}(?x, ?y)\) and a tuple \(\mu = (?x = a, ?y = b)\), the query evaluation problem is to decide whether \(\mu \in \llbracket \mathbf {q}(?x, ?y) \rrbracket _{G}\), that is, whether the tuple \(\mu \) is in the result of the query \(\mathbf {q}\) on the RDF graph G. There are two kinds of computational complexity in the query evaluation problem [1, 2]:
-
the data complexity refers to the complexity w.r.t. the size of the RDF graph G, given a fixed query \(\mathbf {q}\); and
-
the combined complexity refers to the complexity w.r.t. the size of query \(\mathbf {q}\) and the RDF graph G.
A CFG \(\mathcal {G}= (N, A, R)\) is said to be in norm form if all of its production rules are of the form \(v \rightarrow uw, v \rightarrow a\), or \(v \rightarrow \epsilon \) where \(v, u, w \in N\) and \(a \in A\). Note that this norm form deviates from the usual Chomsky Normal Form [22] where the start non-terminals are absent. Indeed, every CFG is equivalent to a CFG in norm form, that is, for every CFG \(\mathcal {G}\), there exists some CFG \(\mathcal {G}'\) in norm form constructed from \(\mathcal {G}\) in polynominal time such that \(\mathcal {L}(\mathcal {G}_{v}) = \mathcal {L}(\mathcal {G}'_{v})\) for every \(v \in N\) [14].
Let G be an RDF graph and \(\mathcal {G}= (N, A, R)\) a CFG. Given a non-terminal \(v \in N\), let \(\mathcal {R}_v(G)\) be the context-free relation of G w.r.t. v can be defined as follows:
Conveniently, the query evaluation of CCFPQ over an RDF graph can be reduced into the conjunctive first-order query over the context-free relations. Based on the conjunctive context-free recognizer for graphs presented in [14], we directly obtain a conjunctive context-free recognizer (see Algorithm 1) for RDF graphs by adding a convertor to transform an RDF graph into an edge-labeled directed graph (see Algorithm 2).
Given a path \(\pi \) and a context-free grammar \(\mathcal {G}\), Algorithm 1 is sound and complete to decide whether the path \(\pi \) in RDF graphs has a trace generated from the grammar \(\mathcal {G}\).
Proposition 1
Let G be an RDF graph and \(\mathcal {G}= (N, A, R)\) a CFG in norm form. For every \(v \in N\), let \(\varTheta \) be the result computed in Algorithm 1, \((v, a, b) \in \varTheta \) if and only if \((a, b) \in \mathcal {R}_v(G)\).
Moreover, we can easily observe the worst-case complexity of Algorithm 1 since the complexity of Algorithm 2 is \(\mathcal {O}(|G|)\).
Proposition 2
Let G be an RDF graph and \(\mathcal {G}= (N, A, R)\) a CFG. Algorithm 1 applied to G and \(\mathcal {G}\) has a worst-case complexity of \(\mathcal {O}((|N||G|)^3)\).
As a result, we can conclude the following proposition.
Proposition 3
The followings hold:
-
1.
The query evaluation of CCFPQ has polynomial data complexity;
-
2.
The query evaluation of CCFPQ has NP-complete combined complexity.
Union of CCFPQ. An extension of CCFPQ capturing more expressive power such as disjunctive capability is introducing the union of CCFPQ. For instance, given a gene (e.g., Gene(B)) in the biomedical ontology (see Fig. 1), we wonder to find those genes which are relevant to this gene, that is, those genes either are similar to it (e.g., Gene(C)) or belong to the same pathway (e.g., Gene(S)).
A union of conjunctive context-free path query (UCCFPQ) is of the form
where \(\mathbf {q}_i(?x, ?y)\) is a CCFPQ for all \(i = 1, \ldots , m\).
Analogously, we can define union of simple conjunctive context-free path query written by \(UCCFPQ^{s}\).
Semantically, let G be an RDF graph, we define
where \(\llbracket \mathbf {q}_i(?x, ?y) \rrbracket _{G}\) is defined as the semantics of CCFPQ for all \(i = 1, \ldots , m\).
In Example 1, based on \(\mathcal {G}= (N, A, R)\), we construct a CFG \(\mathcal {G}' = (N', A', R')\) where \(N' = N \cup \{s\}, A = A\cup \{ next {:}{:}{} belongsTo , next ^{-1}{:}{:}{} belongsTo \}\), and \(R' = R\cup \{s \rightarrow ( next {:}{:}{} belongsTo )s( next ^{-1}{:}{:}{} belongsTo )\}\). Consider a UCCFPQ \(\mathbf {q}(?x, ?y)\, {:}{=}\, v(?x, ?y)\, \vee \, s(?x,?y), \llbracket \mathbf {q}(?x, ?y) \rrbracket _{G_{\mathrm {bio}}} = \{(?x =Gene(B) , ?y = Gene(C) ), (?x =Gene(B) , ?y =Gene(S) )\}\). That is, \(\llbracket \mathbf {q}(?x, ?y) \rrbracket _{G_{\mathrm {bio}}}\) returns all pairs where the first gene is relevant to the latter.
Note that the query evaluation of UCCFPQ has the same complexity as that of the evaluating of CCFPQ since we can simply evaluate a number (linear in the size of a UCCFPQ) of CCFPQs in isolation [2].
4 Expressivity of (U)(C)CFPQ
In this section, we investigate the expressivity of (U)(C)CFPQ by referring to nested regular expressions [28] and fragments of nre.
We discuss the relations between variants of UCCFPQ and variants of (nested) regular expressions and obtain the following results:
-
1.
nre\(_{0}\)-triples can be expressed in CFPQ;
-
2.
nre\(_{0}(\mathrm {N})\)-triples can be expressed in CCFPQ;
-
3.
nre\(_{0}(|)\)-triples can be expressed in UCCFPQ\(^{s}\);
-
4.
nre-triples can be expressed in UCCFPQ.
1. nre \(_{0}\) in CFPQ. The following proposition shows that CFPQ can express nre\(_{0}\)-triples.
Proposition 4
For every nre\(_{0}\)-triple (?x, e, ?y), there exist some CFG \(\mathcal {G}=(N, A, R)\) and some CFPQ \(\mathbf {q}(?x, ?y)\) such that for every RDF graph G, we have \(\llbracket (?x, e, ?y) \rrbracket _{G} =\llbracket \mathbf {q}(?x, ?y) \rrbracket _{G}\).
2. nre \(_{0}(\mathrm {N})\) in CCFPQ. Let \(\mathcal {G}\) be a CFG. A CCFPQ \(\mathbf {q}(?x, ?y)\) is in nested norm form if the following holds:
where
-
\(\{?x, ?y\} \cap \{?x', ?y', ?z'\} \ne \emptyset \);
-
\(\{?x', ?y', ?z'\} \cap \{?u, ?w\} \ne \emptyset \);
-
\(\mathbf {q}_1(?u, ?w)\) is a CCFPQ.
Note that \((?x', ?y', ?z')\) is used to express a nested nre of the form \( axis {:}{:}[e]\) and v(?x, ?y) is necessary to express a nested nre of the form \( self {:}{:}[e]\).
The following proposition shows that CCFPQ can express nre\(_{0}(\mathrm {N}\))-triples.
Proposition 5
For every nre\(_{0}(\mathrm {N}\))-triple (?x, e, ?y), there exist a CFG \(\mathcal {G}=(N, A, R)\) and a CCFPQ \(\mathbf {q}(?x, ?y)\) in nested norm form (6) such that for every RDF graph G, we have \(\llbracket (?x, e, ?y) \rrbracket _{G} =\llbracket \mathbf {q}(?x, ?y) \rrbracket _{G}\).
3. nre \(_{0}(|)\) in UCCFPQ \(^{s}\) . Let e be an nre. We say e is in union norm form if e is of the following form \(e_1 | e_2 | \ldots | e_m\) where \(e_i\) is an nre\(_{0}(\mathrm {N}\)) for all \(i=1,\ldots , m\).
We can conclude that each nre-triple is equivalent to an nre in union norm form.
Proposition 6
For every nre-triple (?x, e, ?y), there exists some \(e'\) in union norm form such that \(\llbracket (?x, e, ?y) \rrbracket _{G} = \llbracket (?x, e', ?y) \rrbracket _{G}\) for every RDF graph G.
The following proposition shows that UCCFPQ\(^{s}\) can express nre\(_{0}(|)\).
Proposition 7
For every nre\(_{0}(|)\)-triple (?x, e, ?y), there exists some CFG \(\mathcal {G}=(N, A, R)\) and some UCCFPQ\(^{s}\) \(\mathbf {q}(?x, ?y)\) in nested norm form such that for every RDF graph G, we have \(\llbracket (?x, e, ?y) \rrbracket _{G} =\llbracket \mathbf {q}(?x,?y) \rrbracket _{G}\).
4. nre in UCCFPQ. By Propositions 5 and 7, we can conclude that
Proposition 8
For every nre-triple (?x, e, ?y), there exists some CFG \(\mathcal {G}=(N, A, R)\) and some UCCFPQ \(\mathbf {q}(?x, ?y)\) in nested norm form such that for every RDF graph G, we have \(\llbracket (?x, e, ?y) \rrbracket _{G} =\llbracket \mathbf {q}(?x,?y) \rrbracket _{G}\).
However, those results above in this subsection are not vice versa since the context-free language is not expressible in any nre.
Proposition 9
CFPQ is not expressible in any nre.
5 Context-Free SPARQL
In this section, we introduce an extension language context-free SPARQL (for short, cfSPARQL) of SPARQL by using context-free triple patterns, plus SPARQL basic operators \(\mathbin {\mathrm {UNION}}, \mathbin {\mathrm {AND}}, \mathbin {\mathrm {OPT}}, \mathbin {\mathrm {FILTER}}\), and \(\mathop {\mathrm {SELECT}}\) and its expressiveness.
A context-free triple pattern (cftp) is of the form \((?x, \mathbf {q}, ?y)\) where \(\mathbf {q}(?x, ?y)\) is a CFPQ. Analogously, we can define union of conjunctive context-free triple pattern (for short, uccftp) by using UCCFPQ.
cfSPARQL and Query Evaluation. Formally, cfSPARQL (graph) patterns are then recursively constructed from context-free triple patterns:
-
A cftp is a cfSPARQL pattern;
-
A triple pattern of the form (?x, ?y, ?z) is a cfSPARQL pattern;
-
All \(P_1\, \mathbin {\mathrm {UNION}}\, P_2, P_1\, \mathbin {\mathrm {AND}}\, P_2\), and \(P_1\, \mathbin {\mathrm {OPT}}\, P_2\) are cfSPARQL patterns if \(P_1, P_2\) are cfSPARQL patterns;
-
\(P\, \mathbin {\mathrm {FILTER}}\, C\) if P is a cfSPARQL pattern and C is a contraint;
-
\(\mathop {\mathrm {SELECT}}_{S}(P)\) if P is a cfSPARQL pattern and S is a set of variables.
Remark 2
In cfSPARQL, we allow triple patterns of form (?x, ?y, ?z) (see Item 2), which can express any SPARQL triple pattern together with \(\mathbin {\mathrm {FILTER}}\) [38], to ensure that SPARQL is still expressible in cfSPARQL while SPARQL is not expressible in nSPARQL since any triple pattern (?x, ?y, ?z) is not expressible in nSPARQL [28]. Our generalization of nSPARQL inherits the power of queries without more cost and maintains the coherence between CFPQ and “nested” nre of the form \( axis {:}{:}[e]\). Moreover, this extension in cfSPARQL coincides with our proposed CCFPQ where triple patterns of the form (?x, ?y, ?z) are allowed.
Semantically, let P be a cfSPARQL pattern and G an RDF graph, \(\llbracket (?x, \mathbf {q}, ?y) \rrbracket _{G}\) is defined as \(\llbracket \mathbf {q}(?x, ?y) \rrbracket _{G}\) and other expressive cfSPARQL patterns are defined as normal [27, 28].
Proposition 10
SPARQL is expressible in cfSPARQL but not vice versa.
A cfSPARQL query is a pattern.
We can define union of conjunctive context-free SPARQL query (for short, uccfSPARQL) by using uccftp in the analogous way.
At the end of this subsection, we discuss the complexity of evaluation problem of uccfSPARQL queries.
For a given RDF graph G, a uccftp P, and a mapping \(\mu \), the query evaluation problem is to decide whether \(\mu \) is in \(\llbracket P \rrbracket _{G}\).
Proposition 11
The evaluation problem of uccfSPARQL queries has the same complexity as the evaluation problem of SPARQL queries.
As a direct result of Proposition 8, we can conclude
Corollary 1
nSPARQL is expressible in uccfSPARQL but not vice versa.
On the Expressiveness of cfSPARQL. In this subsection, we show that cfSPARQL has the same expressiveness as uccfSPARQL. In other words, cfSPARQL is enough to express UCCFPQ on RDF graphs.
Since every cfSPARQL pattern is a uccfSPARQL pattern, we merely show that uccfSPARQL is expressible in cfSPARQL.
Proposition 12
For every uccfSPARQL pattern P, there exists some cfSPARQL pattern Q such that \(\llbracket P \rrbracket _{G} = \llbracket Q \rrbracket _{G}\) for any RDF graph G.
6 Relations on (Nested) Regular Expressions with Negation
In this section, we discuss both the relation between UCCFPQ and nested regular expressions with negation and the relation between cfSPARQL and variants of nSPARQL.
Nested Regular Expressions with Negation. A nested regular expression with negation (nre \(^{\lnot }\)) is an extension of nre by adding two new operators “difference (\(e_1 - e_2\))” and “negation (\(e^{c}\))” [37].
Semantically, let \(e, e_1, e_2\) be three nre\(^{\lnot }\)s and G an RDF graph,
-
\(\llbracket e_{1} - e_{2} \rrbracket _{G} = \{(a, b) \in \llbracket e_{1} \rrbracket _{G} |(a, b) \not \in \llbracket e_{2} \rrbracket _{G}\}\);
-
\(\llbracket e^{c} \rrbracket _{G} = \{(a, b) \in adom (G) \times adom (G) |(a, b) \not \in \llbracket e \rrbracket _{G}\}\).
Analogously, an nre\(^{\lnot }\)-triple pattern is of the form (?x, e, ?y) where e is an nre\(^{\lnot }\). Clearly, nre\(^{\lnot }\)-triple pattern is non-monotone.
Since nre is monotone, nre is strictly subsumed in nre\(^{\lnot }\) [37]. Though property paths in SPARQL 1.1 [29, 33] are not expressible in nre since property paths allow the negation of IRIs, property paths can be still expressible in the following subfragment of nre\(^{\lnot }\): let \(c, c_1, \ldots , c_{n+m} \in U\),
Note that \(e^{+}\) can be expressible as the expression \(e^{*} - self \).
Proposition 13
uccftp is not expressible in any nre\(^{\lnot }\)-triple pattern.
Due to the non-monotonicity of nre\(^{\lnot }\), we have that nre\(^{\lnot }\) is beyond the expressiveness of any union of conjunctive context-free triple patterns even the star-free nre\(^{\lnot }\) (for short, sf-nre\(^{\lnot }\)) where the Kleene star (\(*\)) is not allowed in nre\(^{\lnot }\).
Proposition 14
sf-nre\(^{\lnot }\)-triple pattern is not expressible in any uccftp.
In short, nre\(^{\lnot }\)-triple pattern and uccftp cannot express each other. Indeed, negation could make the evaluation problem hard even allowing a limited form of negation such as property paths [23].
cfSPARQL Can Express nSPARQL \(^{\lnot }\) . Following nSPARQL, we can analogously construct the language nSPARQL\(^{\lnot }\) which is built on nre\(^{\lnot }\), by adding SPARQL operators \(\mathbin {\mathrm {UNION}}, \mathbin {\mathrm {AND}}, \mathbin {\mathrm {OPT}}, \mathbin {\mathrm {FILTER}}\), and \(\mathop {\mathrm {SELECT}}\).
Though uccftps cannot express nre\(^{\lnot }\)-triple patterns by Proposition 13, cfSPARQL can express nSPARQL\(^{\lnot }\) since nSPARQL\(^{\lnot }\) is still expressible in nSPARQL [37].
Corollary 2
nSPARQL\(^{\lnot }\) is expressible in cfSPARQL.
6.1 Overview
Finally, Figs. 2 and 3 provide the implication of the results on RDF graphs for the general relations between variants of CFPQ and nre and the general relations between cfSPARQL and nSPARQL where \(\mathcal {L}_1 \rightarrow \mathcal {L}_2\) denotes that \(\mathcal {L}_1\) is expressible in \(\mathcal {L}_2\) and \(\mathcal {L}_1 \leftrightarrow \mathcal {L}_2\) denotes that \(\mathcal {L}_1 \rightarrow \mathcal {L}_2\) and \(\mathcal {L}_2 \rightarrow \mathcal {L}_1\). Analogously, nSPARQL\(^{\mathrm {sf}}\) is an extension of SPARQL by allowing star-free nre\(^\lnot \)-triple patterns.
7 Implementation and Evaluation
In this section, we have implemented the two algorithms for CFPQs without any optimization. Two context-free path queries over RDF graphs were evaluated and we found some results which cannot be captured by any regular expression-based path queries from RDF graphs.
The experiments were performed under Windows 7 on a Intel i5-760, 2.80GHz CPU system with 6GB memory. The program was written in Java 7 with maximum 2GB heap space allocated for JVM. Ten popular ontologies like foaf, wine, and pizza were used for testing.
Query 1. Consider a CFG \(\mathcal {G}_1 = (N, A, R)\) where \(N = \{S\}, A = \{ next ^{-1}{:}{:}{} \textit{subClassOf}, next {:}{:}{} \textit{subClassOf}, next ^{-1}{:}{:}{} \textit{type}, next {:}{:}{} \textit{type}\}\), and \(R = \{S \rightarrow ( next ^{-1}{:}{:}{} \textit{subClassOf})\,S\,( next {:}{:}{} \textit{subClassOf}),S \rightarrow ( next ^{-1}{:}{:}{} \textit{type})\,S\,( next {:}{:}{} \textit{type}),S \rightarrow \varepsilon \}\). The query \(Q_1\) based on the grammar \(\mathcal {G}_1\) can return all pairs of concepts or individuals at the same layer of the hierarchy of RDF graphs. Table 1 shows the experimental results of \(Q_1\) over the testing ontologies. Note that \(\#results\) denotes that number of pairs of concepts or individuals corresponding to \(Q_1\).
Taking the ontology foaf, for example, the query \(Q_1\) over foaf returns pairs of concepts like \((\textit{foaf:Document}, \textit{foaf:Person})\), which shows that the two concepts, Document and Person, are at the same layer of the hierarchy of foaf, where the top concept (\(\textit{owl:Thing}\)) is at the first layer.
Query 2. Similarly, consider a CFG \(\mathcal {G}_2 = (N, A, R)\) where \(N = \{S,B\}, A = \{ next ^{-1}{:}{:}{} \textit{subClassOf}, next {:}{:}{} \textit{subClassOf}\}\), and \(R = \{S \rightarrow BS,B \rightarrow ( next {:}{:}{} \textit{subClassOf})\,B\,( next ^{-1}{:}{:}{} \textit{subClassOf}),B \rightarrow B( next ^{-1}{:}{:}{} subClassOf )\,B \rightarrow ( next {:}{:}{} \textit{subClassOf})( next ^{-1}{:}{:}{} \textit{subClassOf}),S \rightarrow \varepsilon \}\). The query \(Q_2\) based on the grammar \(\mathcal {G}_2\) can return all pairs of concepts which are at adjacent two layers of the hierarchy of RDF graphs. We also take the ontology foaf, for example, the query \(Q_2\) over foaf returns pairs of concepts like (\(\textit{foaf:Person}, \textit{foaf:PersonalProfileDocument}\)), which denotes that Person is at higher layer than PersonalProfileDocument, since PersonalProfileDocument is a subclass of Document. Table 1 shows the experimental results of \(Q_2\) over the testing ontologies.
8 Conclusions
In this paper, we have proposed context-free path queries (including some variants) to navigate through an RDF graph and the context-free SPARQL query language for RDF built on context-free path queries by adding the standard SPARQL operators. Some investigation about some fundamental properties of those context-free path queries and their context-free SPARQL query languages has been presented. We proved that CFPQ, CCFPQ, UCCFPQ\(^{s}\), and UCCFPQ strictly express basic nested regular expression (nre\(_{0}\)), nre\(_{0}(\mathrm {N})\), nre\(_{0}(|)\), and nre, respectively. Moreover, uccfSPARQL has the same expressiveness as cfSPARQL; and both SPARQL and nSPARQL are expressible in cfSPARQL. Furthermore, we looked at the relationship between context-free path queries and nested regular expressions with negation (which can express property paths in SPARQL1.1) and the relationship between cfSPARQL queries and nSPARQL queries with negation (nSPARQL\(^{\lnot }\)). We found that neither CFPQ nor nre\(^{\lnot }\) can express each other while nSPARQL\(^{\lnot }\) is still expressible in cfSPARQL. Finally, we discussed the query evaluation problem of CFPQ and cfSPARQL on RDF graphs. The query evaluation of UCCFPQ maintains the polynomial time data complexity and NP-complete combined complexity the same as conjunctive first-order queries and the query evaluation of cfSPARQL maintains the complexity as the same as SPARQL. These results provide a starting point for further research on expressiveness of navigational languages for RDF graphs and the relationships among regular path queries, nested regular path queries, and context-free path queries on RDF graphs.
There are a number of practical open problems. In this paper, we restrict that RDF data does not contain blank nodes as the same treatment in nSPARQL. We have to admit that blank nodes do make RDF data more expressive since a blank node in RDF is taken as an existentially quantified variable [17]. An interesting future work is to extend our proposed (U)(C)CFPQ for general RDF data with blank nodes by allowing path variables which are already valid in some extensions of SPARQL such as SPARQ2L [7], SPARQLeR [19], PSPARQL [5], and CPSPARQL [3, 4], which are popular in querying general RDF data with blank nodes.
Notes
- 1.
A standard RDF data is composed of IRIs, blank nodes, and literals. For the purposes of this paper, the distinction between IRIs and literals will not be important.
- 2.
In nSPARQL [28], nre-triples allow a general form (v, e, u) where \(u, v \in U \cup V\). In this paper, we mainly consider the case \(u, v\in V\) to simplify our discussion.
- 3.
We deviate from the usual definition of context-free grammar by not including a special start non-terminal following [14].
- 4.
In this paper, we simply write a conjunctive query as a Datalog rule.
References
Abiteboul, S., Buneman, P., Suciu, D.: Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufmann, Burlington (2000)
Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley, Boston (1995)
Alkhateeb, F., Baget, J.-F., Euzenat, J.: Constrained regular expressions in SPARQL. In: Proceedings of SWWS 2008, pp. 91–99 (2008)
Alkhateeb, F., Euzenat, J.: Constrained regular expressions for answering RDF-path queries modulo RDFS. Int. J. Web Inf. Syst. 10(1), 24–50 (2014)
Alkhateeb, F., Baget, J.-F., Euzenat, J.: Extending SPARQL with regular expression patterns (for querying RDF). J. Web Semant. 7(2), 57–73 (2009)
Angles, R., Gutierrez, C.: Survey of graph database models. ACM Comput. Surv. 40(1), 1 (2008)
Anyanwu, K., Maduko, A., Sheth, A.P.: SPARQ2L: towards support for subgraph extraction queries in RDF databases. In: Proceedings of WWW 2007, pp. 797–806 (2007)
Arenas, M., Pérez, J., Gutierrez, C.: On the semantics of SPARQL. In: De Virgilio, R., Giunchiglia, F., Tanca, L. (eds.) Semantic Web Information Management - A Model-Based Perspective, pp. 281–307. Springer, Berlin (2009)
Barceló, P.: Querying graph databases. In: Proceedings of PODS 2013, pp. 175–188 (2013)
Bischof, S., Martin, C., Polleres, A., Schneider, P.: Collecting, integrating, enriching and republishing open city data as linked data. In: Arenas, M., et al. (eds.) ISWC 2015. LNCS, vol. 9367, pp. 57–75. Springer, Heidelberg (2015). doi:10.1007/978-3-319-25010-6_4
Fionda, V., Pirrò, G., Consens, M.P.: Extended property paths: writing more SPARQL queries in a succinct way. In: Proceedings of AAAI 2015, pp. 102–108 (2015)
Fletcher, G.H.L., Gyssens, M., Leinders, D., Surinx, D., den Bussche, J.V., Gucht, D.V., Vansummeren, S., Wu, Y.: Relative expressive power of navigational querying on graphs. Inf. Sci. 298, 390–406 (2015)
Hayes, J., Gutierrez, C.: Bipartite graphs as intermediate model for RDF. In: McIlraith, S.A., Plexousakis, D., van Harmelen, F. (eds.) ISWC 2004. LNCS, vol. 3298, pp. 47–61. Springer, Heidelberg (2004)
Hellings, J.: Conjunctive context-free path queries. In: Proceedings of ICDT 2014, pp. 119–130 (2014)
Hellings, J., Fletcher, G.H.L., Haverkort, H.J.: Efficient external-memory bisimulation on DAGs. In: Proceedings of SIGMOD 2012, pp. 553–564 (2012)
Hellings, J., Kuijpers, B., Van den Bussche, J., Zhang, X.: Walk logic as a framework for path query languages on graph databases. In: Proceedings of ICDT 2013, pp. 117–128 (2013)
Hogan, A., Arenas, M., Mallea, A., Polleres, A.: Everything you always wanted to know about blank nodes. J. Web Semant. 27, 42–69 (2014)
Hopcroft, J., Ullman, J.: Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, Boston (1979)
Kochut, K.J., Janik, M.: SPARQLeR: extended SPARQL for semantic association discovery. In: Franconi, E., Kifer, M., May, W. (eds.) ESWC 2007. LNCS, vol. 4519, pp. 145–159. Springer, Heidelberg (2007)
Lange, M.: Model checking propositional dynamic logic with all extras. J. Appl. Log. 4(1), 39–49 (2006)
Libkin, L., Reutter, J.L., Vrgoc, D.: Trial for RDF: adapting graph query languages for RDF data. In Proceedings of PODS 2013, pp. 201–212 (2013)
Linz, P.: An Introduction to Formal Languages and Automata, 5th edn. Jones & Bartlett Publishers, Burlington (2012)
Losemann, K., Martens, W.: The complexity of regular expressions and property paths in SPARQL. ACM Trans. Database Syst. 38(4), 24 (2013)
Reutter, J.L., Romero, M., Vardi, M.Y.: Regular queries on graph databases. In: Proceedings of ICDT 2015, pp. 177–194 (2015)
Marx, M., de Rijke, M.: Semantic characterizations of navigational XPath. SIGMOD Rec. 34(2), 41–46 (2005)
Olson, M., Ogbuij, U.: The Versa Specification, October 2001
Pérez, J., Arenas, M., Gutierrez, C.: Semantics and complexity of SPARQL. ACM Trans. Database Syst. 34(3), 16 (2009)
Pérez, J., Arenas, M., Gutierrez, C.: nSPARQL: a navigational language for RDF. J. Web Semant. 8(4), 255–270 (2010)
Polleres, A., Wallner, J.P.: On the relation between SPARQL1.1 and answer set programming. J. Appl. Non-Class. Log. 23(1–2), 159–212 (2013)
RDF primer. W3C Recommendation, Febraury 2004
Sevon, P., Eronen, L.: Subgraph queries by context-free grammars. J. Integr. Bioinform. 5(2), 100 (2008)
SPARQL query language for RDF. W3C Recommendation, January 2008
SPARQL 1.1 query language. W3C Recommendation, March 2013
Tian, Y., Hankins, R.A., Patel, J.M.: Efficient aggregation for graph summarization. In: Proceedings of SIGMOD 2008, pp. 567–580 (2008)
Kostylev, E.V., Reutter, J.L., Romero, M., Vrgoč, D.: SPARQL with property paths. In: Arenas, M., et al. (eds.) ISWC 2015. LNCS, vol. 9366, pp. 3–18. Springer, Berlin (2015)
Zhang, X., Feng, Z., Wang, X., Rao, G., Wu, W.: Context-free path queries on RDF graphs. Revised version (2016). arXiv:1506.00743
Zhang, X., den Bussche, J.V.: On the power of SPARQL in expressing navigational queries. Comput. J. 58(11), 2841–2851 (2015)
Zhang, X., den Bussche, J.V., Picalausa, F.: On the satisfiability problem for SPARQL patterns. J. Artif. Intel. Res. 56, 403–428 (2016)
Acknowledgments
The authors thank Jelle Hellings and Guohui Xiao for their helpful and constructive comments. This work is supported by the program of the National Key Research and Development Program of China (2016YFB1000603) and the National Natural Science Foundation of China (NSFC) (61502336, 61373035). Xiaowang Zhang is supported by Tianjin Thousand Young Talents Program.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Zhang, X., Feng, Z., Wang, X., Rao, G., Wu, W. (2016). Context-Free Path Queries on RDF Graphs. In: Groth, P., et al. The Semantic Web – ISWC 2016. ISWC 2016. Lecture Notes in Computer Science(), vol 9981. Springer, Cham. https://doi.org/10.1007/978-3-319-46523-4_38
Download citation
DOI: https://doi.org/10.1007/978-3-319-46523-4_38
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46522-7
Online ISBN: 978-3-319-46523-4
eBook Packages: Computer ScienceComputer Science (R0)