1 Introduction

With the increasing popularity of knowledge graphs, Resource Description Framework (RDF) has been widely recognized as a flexible graph-like data model to represent large-scale knowledge bases. It has become essential to realize efficient and scalable query processing for big RDF graphs in various domains, such as social networking [23] and bioinformatics [14, 27], stored in distributed clusters. As one of the fundamental operations for querying graph data [6], regular path queries (RPQs) can explore RDF graphs in a navigational manner, which is an indispensable building block in most graph query languages. The latest version of the standard query language of RDF, SPARQL 1.1 [13], has provided the property path [16] feature which is actually an implementation of RPQ semantics. In particular, answering an RPQ Q = (x,r,y) over an RDF graph T is to find a set of pairs of resources (v0,vn) such that there exists a path ρ in T from v0 to vn, where the label of ρ, denoted by λ(ρ), satisfies the regular expression r in Q.

However, from the above standard semantics of RPQs, we cannot tell what such a path ρ from v0 to vn looks like. To provide the provenance why a pair of resources in an RDF graph satisfies Q, we focus on the provenance-aware semantics of RPQs which actually returns a subgraph of the RDF graph consisting of all the “witness triples”. For example, Figure 1a depicts an RDF graph T1 excerpted from DBpedia [17], which shows predecessor and father relationships among seven British monarchs [25]. The RPQ Q1 = (x,(predecessor|father)+,y) asks to find pairs of monarchs (v0,vn) such that v0 can navigate to vn via one or more predecessor or father edges. The answers under the standard semantics to Q1 are shown in Figure 1b. In contrast, the provenance-aware answer to Q1 is a subgraph that contains all the paths whose labels satisfy Q1. In this example, the subgraph (i.e., answer) is exactly T1, which can efficiently encode the conventional answers to Q1 in Figure 1b.

Figure 1
figure 1

An example RDF graph T1 and answers to RPQ Q1

Currently, there have been some research works on RPQs over RDF graphs under both standard and provenance-aware semantics. To answer RPQs under the standard semantics, some approaches leverage views [9] or other auxiliary structures, such as “rare labels” [15]. The RPQ evaluation system Vertigo [20] is implemented based on Brzozowski’s derivatives using the Giraph parallel framework [2]. Wang et al. [26] employ the partial evaluation to obtain partial answers to RPQs in parallel and assemble the partial answers using an automata-based algorithm. However, the above methods may lead to potential large intermediate results and suffer from performance bottleneck when evaluating RPQs on large-scale RDF graphs. Although Dey et al. [11] have done the first work to investigate provenance-aware RPQs, they translate RPQs into standard Datalog queries, which is hardly scalable when evaluating on large RDF graph data. Another representative work [24] is based on product automata to evaluate provenance-aware RPQs, which may incur the costly construction process of product automata and excessive communications when handling large-scale RDF graphs.

A variety of parallel models and systems have been developed these years, such as Neo4j,Footnote 1 Trinity,Footnote 2 and BSP. (1) Neo4j is a graph database optimized for graph traversal, but it performs badly on a large distributed environment; (2) Trinity is a distributed system based on hypergraphs, and does not support regular path queries; (3) BSP models parallel computations in supersteps to synchronize communication among workers. Pregel [19] implements BSP with vertex-centric programming, where a superstep executes a user-defined function at each vertex in parallel. This vertex-centric approach works well with a series of iterative programs on graph data. Further, Pregel is more widely implemented in the mainstream big data platform, such as Spark and Giraph, which mostly rely on share-nothing architectures. In summary, Pregel is a state-of-the-art distributed graph computing framework.

Pregel is a computational model suitable for programs that can be expressed as a sequence of iterations, in each of which a vertex can receive messages sent in the previous iteration, send messages to other vertices, and modify its own state and that of its outgoing edges or mutate the graph topology. Since our proposed method for answering provenance-aware RPQs needs to traverse the graph, we can reasonably implement graph processing algorithms in a sequence of superstep iterations using Pregel.

To this end, in this paper, we propose a distributed Pregel-based parallel approach DP2RPQ to answering provenance-aware RPQs using Glushkov automata, which consists of a series of supersteps. The query processing starts with the vertices in an RDF graph to match against the states in the corresponding automaton of the RPQ; in each superstep, one hop of edges in the paths of the RDF graph are matched forward to obtain the intermediate partial answer to the provenance-aware RPQ.

In addition, we design several optimization strategies in three aspects: (1) to reduce vertex-computation cost, we design edge-filtering and candidate-states techniques to improve the performance of vertex computation, which can filter out those edges whose labels not occurring in r and avoid traversals via outgoing edges of the vertex, respectively; (2) to reduce message-communication cost, pruning-sending-messages and variable-length-byte encoding techniques are proposed to reduce intermediate results and communication overhead significantly; (3) to address the counting-paths problem [1], we further propose another two techniques, which combine multiple equivalent messages into a single message and compress the messages that are sent via different outgoing edges. Although our method is devised for answering provenance-aware RPQs over RDF graphs, it can be well adapted to the RPQs under the standard semantics. Actually, the answer of provenance-aware RPQs can be regarded as a subgraph of the RDF graph. In contrast, an RPQ with standard semantics is to find a set of pairs of vertex, and these vertices are completely included in the subgraph of provenance-aware answers. Further, our method can be easily extended to handle RPQs over general (un)directed labeled graphs.

Our main contributions include: (1) we propose an automata-based distributed algorithm, called DP2RPQ, for RPQs under the provenance-aware semantics using the Pregel graph parallel computing framework; (2) several optimization strategies in three aspects are presented to reduce the overhead of the basic DP2RPQ algorithm and alleviate the counting-paths problem; and (3) the extensive experiments were conducted to verify the efficiency and scalability of the proposed method on both synthetic and real-world datasets.

The rest of this paper is organized as follows. Section 2 reviews related work. In Section 3, we introduce preliminary definitions of RPQs. In Section 4, we describe in detail the DP2RPQ algorithm for answering provenance-aware RPQs. We then present the optimization techniques in Section 5. Section 6 shows experiment results, and we conclude in Section 7.

2 Related work

Most of the existing approaches aim to evaluate RPQs under the standard semantics, but relatively fewer works focus on RPQs under the provenance-aware semantics. Currently, we are not aware of any distributed Pregel-based approach to evaluating provenance-aware RPQs. We classify the existing approaches into the following two categories.

2.1 Standalone RPQs evaluations on a single machine

Standard semantics of RPQs

The approach proposed in [9] answers RPQs using views, which can be interpreted as checking whether a pair of nodes is one of the answers. The view-based approach for RPQs has been extensively investigated, while the types of data and queries that this approach can handle are restricted under certain assumptions. Koschmieder et al. [15] propose a rare-labels-based approach that decomposes RPQs into a series of smaller RPQs. The rare labels denote the elements in RPQs that have few matches by utilizing the labels and their frequencies in data graph. However, the performance of the method highly depends on a specific query decomposition and selectivity of rare labels.

Provenance-aware semantics of RPQs

Dey et al. [11] first translate the provenance-aware RPQs into standard Datalog queries or SQL queries, in which auxiliary predicates are introduced to evaluate queries represented by Datalog. In this work, two evaluators for RPQs and provenance-aware RPQs are both implemented on the relational DBMS. However, from the experimental results, we can observe that the approach is hardly scalable for large-scale RDF graphs.

2.2 Distributed RPQs evaluations in parallel

Standard semantics of RPQs

A distributed algorithm for evaluating RPQs on large-scale RDF graphs is proposed in [26], which is the first work to investigate RPQs using partial evaluation. It employs a dynamic programming method to compute partial answers in parallel, which are then assembled to obtain the final results using an automata-based algorithm. Nevertheless, the experiments on the real-world datasets are not shown in the paper. Sartiani et al. [20] exploit Brzozowski’s derivatives [8] of regular expressions to evaluate RPQs in a vertex-centric and message-passing-based manner, which is implemented on top of the Giraph framework. However, the experimental results are only evaluated on the Erd\(\ddot {o}\)s-R\(\acute {e}\)nyi models and the power-law graphs, lacking experiments on synthetic and real-world RDF graphs to verify the algorithm. A system for processing GXPath queries on a large data graphs is proposed in [21], in which GXPath [18] is the most powerful extension of RPQ. In this system, built on top Hadoop MapReduce, a query is complied into an acyclic graph of MapReduce jobs, similar in spirit to a database query plan. However, each graph must be indexed before becoming available for querying.

Provenance-aware semantics of RPQs

Wang et al. [24] propose an automata-based approach, which employs product automata for evaluating RPQs under the provenance-aware semantics in parallel. The product automaton is constructed using two NFA converted from the regular expression of an RPQ and the RDF data graph, respectively. Then the answer paths are extracted by running the product automaton recursively. Nevertheless, the product automata construction in this method may incur high overhead and excessive communication cost when dealing with large-scale RDF graphs. RDFPath [22] implements an expressive RDF path query language using the MapReduce framework. A query in RDFPath is translated as a sequence of location steps, in which the predicate in query is specified by the next adjacent edge attribute and separated by “>”. Since the location steps in RDFPath are deterministic predicates, queries are executed only on triples in graphs associated with these predicates. The execution plan of query is generated by dividing these location steps, which corresponds to a join between an intermediate set of paths and the corresponding RDF graph partition. However, it cannot implement the complete expressiveness of regular path queries, especially the Kleene closure operation. A new query language on the graph is presented in [3, 4], called G-Path, which focuses on complex path pattern query processing on a very large graph. Further, a system called Para-G [5] is introduced to process G-Path queries, which is based on a BSP-like model as well as MapReduce model [10], and can effectively handle distributed graph data operations and queries. Nevertheless, the experimental results are only evaluated on synthetic datasets, lacking experiments on real-world datasets.

Unlike the above previous works, we propose a Pregel-based algorithm for evaluating provenance-aware RPQs on big RDF graphs. To the best of our knowledge, it is the first work to implement an efficient and scalable evaluation of provenance-aware RPQs using the Pregel parallel graph computing model.

3 Preliminaries

We start by formally defining background knowledge, this section also serves to establish the notation we will use through the rest of paper.

Definition 1

RDF graph Let U and L be the disjoint infinite sets of URIs and literals, respectively. A tuple (s,p,o) ∈ U × U × (UL) is called an RDF triple, where s is the subject, p is the predicate(a.k.a. property), and o is the object. A finite set of RDF triples is called an RDF graph.

Given an RDF graph T = (V,E,Σ), where V, E, and Σ denote the set of vertices, edges, and edge labels in T, respectively. Formally, V = {s∣(s,p,o) ∈ T}∪{o∣(s,p,o) ∈ T}, E = {(s,o)∣(s,p,o) ∈ T}, and Σ = {p∣(s,p,o) ∈ T}. In addition, we define an infinite set Var of variables that is disjoint from U and L. An example RDF graphT2 is shown in Figure 2a, which consists of 13 triples (i.e., edges). For instance, (v1,a,v2) is an RDF triple as well as an edge with label a in T2, and \(V_{T_{2}}=\{v_{i} \mid 1 \le i \le 8\}\), \({\Sigma }_{T_{2}}=\{\texttt {a,b,c,d,e,f,g,h}\}\).

Definition 2

Regular path queries Let Q = (x,r,y) be a regular path query over an RDF graph T = (V,E,Σ), where x,yVar are variables, and r is a regular expression over the alphabet Σ. Regular expression r is recursively defined as r ::= ε | p | r/r | r|r | r, where p ∈Σ and /, |, and are concatenation, alternation, and the Kleene closure, respectively. The shorthands r+ for r/r and r? for ε|r are also allowed. L(r) denotes the language expressed by r and λ(ρ) is the label of path ρ. The answer set of Q under the standard semantics, denoted by , is defined as {(x,y)∣∃ a path ρ in T from x to y s.t. λ(ρ) ∈ L(r)}.

Figure 2
figure 2

An RDF graph T2 and provenance-aware answer of Q2 on T2

Given a regular expression r, let \(Pos(r)=\{1, 2, \dots , |r|\}\) be the set of positions in r, where |r| is the length of r. Thus, the symbols in r can be denoted as \(r[1],r[2],\dots ,r[|r|]\).

Definition 3

Automata of RPQs Given an RDF graph T and an RPQ Q = (x,r,y) over T, the automaton of RPQQ is the Glushkov automaton AQ converted from the regular expression r by using the Glushkov’s construction algorithm [7]. The function first(r) (resp. last(r)) is the set of positions in r that can match the first (resp. last) symbol of some string in L(r), and the function follow(r,i) is the set of positions in r that can follow position i when matching some string in L(r). AQ is defined as a 5-tuple (St,Σ,δ,q0,F), where (1) St = {0}∪ Pos(r) is a finite set of states, (2) Σ is the alphabet of r, (3) \(\delta : St \times {\Sigma } \rightarrow \mathcal {P}(St)\) is the transition function, (4) q0 = 0 is the initial state, (5) and F is the set of final states. Here, δ and F are further defined as follows:

$$ \begin{array}{@{}rcl@{}} \delta(q, a) &=& \left\{\begin{array}{llll} \{ i \mid i \in \texttt{first}(r) \land r[i] = a \} & \text{if } q = q_{0} \\ \{ i\mid i \in \texttt{follow}(r, q) \land r[i] = a \} & \text{if } q \in Pos(r) \end{array}\right. \\ F &=& \left\{\begin{array}{llll} \{ q_{0} \} \cup \texttt{last}(r) & \text{if } \varepsilon \in L(r) \\ \texttt{last}(r) & \text{otherwise } \end{array}\right. \end{array} $$

Example 1

Given an RPQ Q2 = (x,r,y) and r = a/b/c/(d|e), we build \(A_{Q_{2}}=\{St,{\Sigma },\delta ,q_{0},F\}\) based on r, where St = {0,1,2,3,4,5}, Σ = {a,b,c,d,e}, q0 = {0}, F = {4,5}, and the transition function δ is represented in the form of the transition graph shown in Figure 3a.

Figure 3
figure 3

The automaton and the provenance-aware answer of Q2

Definition 4

Provenance-aware answer set of RPQs Given an RDF graph T = (V,E,Σ) and an automaton AQ = (St,Σ,δ,q0,F) of an RPQ Q, the provenance-aware answer set of Q over T, denoted by , is defined as a 5-tuple \((V_{\mathfrak {p}}, E_{\mathfrak {p}}, L_{\mathfrak {p}}, I_{\mathfrak {p}}, F_{\mathfrak {p}})\), where (1) \(V_{\mathfrak {p}} \subseteq V \times St\) is a set of vertices, (2) \(E_{\mathfrak {p}} \subseteq V_{\mathfrak {p}} \times V_{\mathfrak {p}}\) is a set of edges, (3) \(L_{\mathfrak {p}}\) is a function that assigns each edge a label in Σ, and (4) \(I_{\mathfrak {p}} = \{(v, q_{0}) \mid v \in V\} \subseteq V_{\mathfrak {p}}\) and \(F_{\mathfrak {p}} = \{(v, q_{f}) \mid v \in V \land q_{f} \in F\} \subseteq V_{\mathfrak {p}}\) are the sets of start and final vertices, respectively. Here, is constructed by the following process: for each path v0a0v1vn− 1an− 1vn in T such that there exists a sequence of states \(st_{0}, st_{1}, \dots , st_{n}\) in St satisfying st0 = q0, sti+ 1δ(sti,ai) for \(i \in \{0, \dots , n-1\}\), and stnF, a path \(\rho =v_{\mathfrak {p}_{0}}a_{0}v_{\mathfrak {p}_{1}}\cdots v_{\mathfrak {p}_{n-1}}a_{n-1}v_{\mathfrak {p}_{n}}\) in is constructed, where \(v_{\mathfrak {p}_{i}} = (v_{i}, st_{i})\) for \(i \in \{0, \dots , n\}\) and by definition \(v_{\mathfrak {p}_{0}} \in I_{\mathfrak {p}} \land v_{\mathfrak {p}_{n}} \in F_{\mathfrak {p}}\).

Example 2

The provenance-aware answer set of Q2 over T2 as defined in Definition 4 is shown in Figure 3b, which can be considered as the extended version of the provenance-aware answer to Q2 (as a subgraph of T2 marked in green in Figure 2b), which attaches the matched states of \(A_{Q_{2}}\) to the corresponding vertices.

Pregel is a vertex-centric parallel model for graph computation . The computation in Pregel is composed of a sequence of iterations, i.e., supersteps, conforming to the Bulk Synchronous Parallel (BSP) model [12].

There are several functions involving in the process of Pregel-based algorithms, as shown in Table 1. In particular, for each vertex v, we define Val(v) as the set of values associated with v. Within a superstep, each vertex executes the user-defined vertex computation vertexCompute(T,M) to get and update Val(v) in parallel.

Table 1 The functions in Pregel framework

The description of notation introduced in the definition are shown in Table 2, which we will use in the rest of paper.

Table 2 The notations in the definition

Definition 5

Pregel framework Given an RDF graph T = (V,E,Σ) as the input data, in the first superstep, all the vertices are active. The entire computation terminates when all vertices are inactive. Let M be the set of messages. Within a superstep, the user-defined function vertexCompute(T,M) is executed on each active vertex in parallel. An inactive vertex will be reactivated by the incoming messages sent to it. When vertexCompute(T,M) is invoked on each active vertex v, it (1) gets the number of current superstep by superStep; (2) receives messages (i.e., each mM) sent to v in the previous superstep; (3) obtains and/or updates Val(v); (4) modifies M to generate the set of new messages \(M^{\prime }\); (5) invokes \(\texttt {sendMsg}(v, v^{\prime }, M^{\prime })\) to send \(M^{\prime }\) to the adjacent vertex \(v^{\prime }\), and (6) invokes voteToHalt.

Definition 6

Matching pair Given an RDF graph T = (V,E) and an automaton AQ of an RPQ Q = (x,r,y), the query processing in Pregel is matched forward by generating the matching pair in each superstep. The matching pair is denoted as a pair (v,q) satisfying ∃ \(a \in \mathcal {S}(v) \wedge q \in \{St \setminus q_{0}\} \wedge v \in V \wedge a=r[q]\), where \(\mathcal {S}(v)\) is the set of symbols labeled on the incoming edges of v.

In particular, (v,q0) (q0AQ) is also called a matching pair. The macthing pairs generated in query processing can be demonstrated as the vertices \(V_{\mathfrak {p}} \subseteq V \times St\) in the provenance-aware answer set in Definition 4. In general, an answer path in can be denoted as a sequence of matching pairs.

4 The Pregel-based algorithm

In this section, we propose the Pregel-based algorithm for answering provenance-aware RPQs, which employs the automata introduced in Section 3. First, we describe the overall evaluation, then elaborate the computation in each vertex of each superstep. Finally, we discuss the cycle detection mechanism to avoid loop or infinite matching when evaluating on cyclic RDF graphs.

4.1 Architecture overview

The architecture of the Pregel-based algorithm for evaluating provenance-aware RPQs is shown in Figure 4. ① Given an RDF graph T and an RPQ Q = (x,r,y) over T, the automaton of RPQQ is the Glushkov automaton AQ converted from the regular expression r by using the Glushkov’s construction algorithm; ② The query processor, which starts with the vertices in an RDF graph to match against the states in the corresponding automaton of Q, operates using the Pregel parallel computing framework to compute the matched intermediate results; ③ Several optimization strategies are presented to reduce the overhead of the basic DP2RPQ algorithm and alleviate the counting-paths problem; ④ After the entire computation completes, we can obtain the final provenance-aware results.

Figure 4
figure 4

The architecture of DP2RPQ

4.2 Provenance-aware RPQs based on Pregel

The overall evaluation DP2RPQ is shown in Algorithm 1, in which we construct an automaton AQ = {St,Σ,δ,q0,F} of an RPQ Q = (x,r,y) (line 1). In each superstep, each active vertex v invokes vertexComputein parallel (line 4) to match against a state qSt, while the updated partial answers are maintained in Val(v). The matching process is executed repeatedly in the following supersteps until the computation terminates. When the entire computation completes, we combine Val(v) of each vertex v to obtain the final provenance-aware answer set(line 5).

figure h

The process of the vertex computation vertexCompute is shown in Algorithm 2. It is executed at each vertex vVin parallel, which has the following three phases:

  1. 1)

    In the first superstep (lines 2-7), v is considered to be matched against the initial state q0 only if there exists an outgoing edge \((v, v^{\prime })\) such that the label of \((v, v^{\prime })\) is the same as r[q] satisfying q ∈first(r) (lines 3-4). Then, the first matched message m is generated, which formally is a set of matching pair (v,q0). Further, the message set Ms is a set of matched messages, which is sent to the adjacent vertices by invoking sendMsg (line 7). Finally, getState(v) = inactive by invoking voteToHalt (line 21).

  2. 2)

    As to the remaining supersteps (lines 8-20), if the set Mr of the receiving messages from the adjacent vertices in the previous superstep is empty, v is deactivated by voteToHalt (line 21), otherwise each active vertex is matched forward based on the messages in Mr (lines 10-19). First, the set \(R_{q^{\prime }}\) of the next possible states is computed by follow w.r.t. the current matched state q (line 12). Next, if v has an outgoing edge labeled with the same symbol as \(r[q^{\prime }]\) (\(q^{\prime } \in R_{q^{\prime }}\)), a new message m is built by appending \((v,q^{\prime })\) to \(m^{\prime }\) and then added to the message set Ms to be sent (lines 13-16). Finally, \(\texttt {sendMsg}(v,v^{\prime },M_{s})\) is invoked to send Ms from v to \(v^{\prime }\) (line 20). Meanwhile, v checks whether the current matched state \(q^{\prime }\) of the new set of matching pairs m is a final state (i.e., \(q^{\prime } \in \texttt {last}(r)\))(line 17). If it is, then m is regard as the answer path ρf in form of a sequence of matching pairs, which is added to the partial provenance-aware answer set, denoted by Val(v) (lines 18-19).

figure i

In \(\texttt {sendMsg}(v,v^{\prime },M_{s})\) (line7, 20), the condition of sending a message mMs from v to \(v^{\prime }\) is that the current matched state q satisfies ∃ \(q^{\prime } \in \texttt {follow}(r,q)\)\(r[q^{\prime }]=\lambda ((v,v^{\prime }))\). In addition, when \(v^{\prime }\) receives messages from different adjacent vertices, it merges all the messages into Mr.

The correctness of the DP2RPQ algorithm is guaranteed by the following theorem.

Theorem 1

Given an RPQ Q = (x,r,y) over an RDF graph T, .

Proof

(Sketch)

  1. (i)

    “If” direction: for , ∃ a path ρ1 in T from v0 to vn and a path ρ2 in AQ from q0 to qn. It can be observed that qiδ(qi− 1,λ((vi− 1,vi))), for \(1 \in \{1,\dots ,n\}\), holds in DP2RPQ. The label of ρ1 is the same as that of ρ2, i.e., λ(ρ1) ∈ L(r). Therefore, .

  2. (ii)

    “Only if” direction: for , assume a path in form of v0a0vn− 1an− 1vn in T such that there exists a sequence of states \(st_{0} {\dots } st_{n}\) in St of AQ in DP2RPQ satisfying st0 = q0, sti+ 1δ(sti,ai) for \(i \in \{0,\dots ,n-1\}\), and stnF. The path ρ = (v0,q0)a0(v1,q1)⋯(vn− 1,qn− 1)an− 1(vn,qn) is constructed in DP2RPQ, i.e., .

Theorem 2

The complexity of the DP2RPQ algorithm is bounded by \(O(|deg^{+}_{m}|^{k}\cdot |r|^{k}\cdot |deg^{-}_{m}|^{k-1})\), where |r| is the length of the regular expression r in Q, k is the total number of supersteps, and \(|deg^{+}_{m}|\) (resp. \(|deg^{-}_{m}|\)) is the maximum outdegree (resp. indegree) of the vertices in T.

Proof

(Sketch)

  1. (i)

    Basis: When k = 1, in the first superstep, there exists a vertex that is matched for \(O(|deg^{+}_{m}|\cdot |r|)\) times since at most \(|deg^{+}_{m}|\) outgoing edges of the vertex are matched against the states in first(r). Thus, the complexity is \(O(|deg^{+}_{m}|\cdot |r|)\).

  2. (ii)

    Induction step: For k (k ≥ 1) supersteps, the complexity is \(O(|deg^{+}_{m}|^{k}\cdot |r|^{k}\cdot |deg^{-}_{m}|^{k-1})\) as the induction hypothesis. Thus, there exists a vertex that is matched \(O(|deg^{+}_{m}|^{k}\cdot |r|^{k}\cdot |deg^{-}_{m}|^{k-1})\) times, and all these matches are sent as messages via the outgoing edges. Then, for (k + 1) supersteps, since the maximum number of incoming edges of a vertex may be \(|deg^{-}_{m}|\), the receiving message set of the vertex includes \(O(|deg^{+}_{m}|^{k}\cdot |r|^{k}\cdot |deg^{-}_{m}|^{k+1})\) messages. Next, each receiving message of the vertex are matched against at most |r| states in the automaton based on the label of \(|deg^{+}_{m}|\) outgoing edges of the vertex to generate \(O(|deg^{+}_{m}|^{k+1}\cdot |r|^{k+1}\cdot |deg^{-}_{m}|^{k})\) messages. Thus, the vertex is matched \(O(|deg^{+}_{m}|^{k+1}\cdot |r|^{k+1}\cdot |deg^{-}_{m}|^{k})\) times, which is the complexity for (k + 1) supersteps.

In particular, \(|deg^{+}_{m}|\) and \(|deg^{-}_{m}|\) can be further reduced to |r| by our optimization techniques in Section 5. Thus, the complexity of the optimized DP2RPQ algorithm is O(|r|3k− 1). In general, |r| is short in length and the number of total supersteps k is also limited.

4.3 Cycle detection

Loop matching may be generated in the matching process of RPQs when evaluating on cyclic RDF graphs. A cycle in an RDF graph is a path of edges and vertices wherein a vertex is reachable from itself.

Definition 7 (Loop matching)

Given a provenance-aware answer set of an RPQ Q over an RDF graph T, an answer path ρf = {(vs, \(q_{0}),\dots ,(v_{i},q_{m}),\dots ,(v_{j},q_{n}),\dots \}\) in is called a loop matching if and only if j = i.

In Definition 7, only if a vertex is matched more than once in a path, the matching can be regarded as a loop matching. Considering the edges and states additionally, loop matching can be further refined into four cases.

  1. (1)

    only vertex: (vi,qm) and (vi,qk) (mk) occur in the same path

  2. (2)

    vertex and state: (vi,qm) and (vi,qk) (m = k) occur in the same path

  3. (3)

    vertex and edge: (vi,qm)(vj,qn) and (vi,qk)(vj,ql) (mknlr[qn] = r[ql]) occur in the same path

  4. (4)

    vertex, edge and state: (vi,qm)(vj,qn) and (vi,qk)(vj,ql) (m = kn = lr[qn] = r[ql]) occur in the same path

In particular, when closure operations and/or + occur in r, it may lead to infinite number of matches when evaluating on cyclic RDF graphs. Therefore, we introduce a cycle detection mechanism in DP2RPQ to ensure that a vertex is not matched with the same state twice in intermediate partial answers, i.e., considering vertex and state. For a message \(m^{\prime }\) in a set of receiving messages, if a 2-tuple matching pair \((v,q) \in m^{\prime }\), then (v,q) cannot be added to \(m^{\prime }\) again for generating a new message in the following supersteps.

5 Optimization strategies

To improve the efficiency of our method, we evaluate the cost of the Pregel computation. Further, three optimization strategies are devised according to the cost model, which can reduce the cost of vertex-computation, reduce the intermediate results of the basic DP2RPQ algorithm dramatically, and address the counting-paths problem to some extent, respectively.

5.1 Cost estimation

The cost of the Pregel computation is determined as the sum of the cost of all supersteps. The cost of each superstep consists of the following terms: (1) wi is the maximum cost of vertex computation among all vertices in the i-th superstep; (2) hi is the maximum number of messages sent or received by each vertex; and (3) l is the cost of the barrier synchronization at the end of a superstep. Thus, the cost of a Pregel-based algorithm is \({\sum }_{i=1}^{k}w_{i}+g{\sum }_{i=1}^{k}h_{i}+kl\), where g is the cost to deliver a message and k is the number of total supersteps. Therefore, the cost of DP2RPQ is determined by the cost of vertex computation, the cost of passing messages, and the number of the supersteps.

5.2 Vertex-computation optimization

In this section, we design two techniques to reduce the cost of vertex-computation in each superstep.

5.2.1 Edge filtering

Generally, the input graph is stored into main memory in Pregel, which may result in an inefficient memory utilization. In order to improve the efficiency and scalability of the DP2RPQ algorithm, we design the edge-filtering technique, which only loads the edges labeled with the symbols that occur in r of Q = (x,r,y) by filtering other edges. We use Σr to denote the subset of the alphabet that appears in r. Then, an edge (v,a,u) in the input RDF graph T is loaded if and only if a∈Σr. In Example 2, with edge filtering, only the edges labeled with the symbols in Σr = {a,b,c,d,e} are involved in the processing, while the edges labeled with f, g, and h are filtered.

5.2.2 Candidate states

In Algorithm 2, the traversal operations over the incoming edges of each vertex (line 13) are not efficient when evaluating RPQs on large-scale RDF graphs. Thus, we leverage the prior knowledge in a given query RPQ Q over an RDF graph T to construct an auxiliary structure called candidate states, denoted by Rc, which keeps a set of the state q of the RPQ automaton in each vertex v satisfying a matching pair (v,q).

Rc is precomputed before processing and calculated only once. The construction of the candidate-states consists of two situations. First, we construct a set including \(\{(v_{1},q_{0}),\dots ,(v_{i},q_{0}),\dots ,(v_{n},q_{0})\}\) (n ≤|V |) and check whether the outgoing edges of each vertex v labeled with the same symbols as r[q] (q ∈first(r)). If it satisfies, q0 is put into Rc of v. Then, at each vertex v, we check whether the incoming edges of v labeled with the same symbols as r[q] (qSt). If it satisfies, q is put into Rc of v. With candidate-states technique, we traverse the edges of each vertex only once, which avoids the excessive cost of traversal operations in each superstep of the Pregel-based processing.

In vertex computation, we compare the states in Rc with that in \(R_{q^{\prime }}\) instead of the costly iteration of all adjacent vertices. Algorithm 3 is an optimized version of lines 13-16 in Algorithm 2 by using the candidate-states technique, in which the modified matching process is: (1) \(R_{q^{\prime }}\) is computed by function follow; (2) v receives the message \(m^{\prime } \in M_{r}\), if \(q^{\prime } \in R_{c} \cap R_{q^{\prime }}\), a new message m will be built by appending \((v,q^{\prime })\) to \(m^{\prime }\), otherwise there is no new message to be generated for this particular message \(m^{\prime }\).

figure s

Example 3

Consider the RDF graph T2 = (V,E,Σ), the RPQ Q2 = (x,r,y), and the automaton \(A_{Q_{2}}=\{St,{\Sigma },\delta ,q_{0},F\}\) in Example 2. First, q0 is appended to \(R_{c_{v_{1}}}\) and \(R_{c_{v_{2}}}\) since v1 and v2 have outgoing edges labeled with a satisfying first(r) = {1} and r[1] = a. Then, for example, there exists an incoming edge of v2 labeled with a and r[1] = a, the candidate states for v2 are \(R_{c_{v_{2}}}=\{0,1\}\), as shown in Figure 5. If v2 receives a message \(m^{\prime }=((v_{1},0))\) from v2, \(R_{q^{\prime }_{0}}=\texttt {follow}(r,0)=\{1\}\) is computed. Thus, (v2,1) is appended to \(m^{\prime }\) to generate a new matched message.

Figure 5
figure 5

The RDF graph T2 with candidate-states technique

5.3 Message-communication optimization

With the number of the superstep increasing, the size of each message to be sent will become larger. Meanwhile, from the analysis of algorithm complexity, it can be seen that the number of messages to be sent in the k-th superstep can reach exponential complexity of the maximum degree of outdegree/indegree in an RDF graph. When dealing with large-scale RDF knowledge graphs, the communication cost of sending a large number of messages is time-consuming, in this section, we consider reducing the communication cost. On one hand, we prune unnecessary messages by using the provenance-aware RPQs’ features and the Kleene closure operations to reduce the message-passing cost and the number of supersteps. On the other hand, we encode the messages that have to be sent by using variable-length-byte encoding to reduce the sizes of messages.

5.3.1 Pruning sending messages

In order to realize pruning some matching pairs in the messages to be sent, we subtract the matching pairs that has already been added to the partial provenance-aware answer set. Further, we subtract the duplicate matching pairs which are incurred by the Kleene closure operations.

Pruning answer matches

In the matching process of the DP2RPQ algorithm, when the current matched state of a message that received at the vertex is an accept state, the message is converted into an equivalent answer path, meanwhile, it is matched forward by appending a matching pair. Obviously, as the number of the superstep increases, the length of this message will increase accordingly. To this end, we prune the matching pairs that have already been added into the partial provenance-aware answer set to reduce the length of the message.

For example, given an RPQ Q3 = (x,r3,y) and \(r_{3}=\mathsf {a}/\mathsf {b}/\mathsf {c}^{*}\), it is evaluated on the RDF graph T3 shown in Figure 6. In the third superstep, v3 receives the message \(m^{\prime }=(v_{1},0)(v_{2},1)\) from v2, then v3 append (v3,2) to \(m^{\prime }\) and maintain the new message m = (v1,0)(v2,1)(v3,2) as the answer path into the partial answer set Val(v3). The message m is also sent to v4 to be matched forward, among which (v1,0)(v2,1) is already kept in Val(v3). Thus, the matching pairs (v1,0) and (v2,1) in m are pruned, only m = (v3,2) is sent to v4 as the message. It has no effect on the final result and can reduce the length of messages sent.

Figure 6
figure 6

An RDF graph T3

Pruning duplicate matches

When evaluating RPQs, the number of the supersteps required to process different queries may vary considerably. For example, in DP2RPQ, given Q4 = (x,a/b/c/d,y), |r| = 4, it needs (|r| + 1) supersteps at most; given Q5 = (x,(a/a)+,y), |r| = 2, it needs (|r|∗ k + 1) (k ∈{1,...,n}) supersteps, and k is proportional to the size of an RDF graph.

When evaluating Q5 on T3, it needs five supersteps. The longest answer path of Q5 is of the form (v1,0)(v2,1)(v7,2)(v8,1)(v9,2), which is generated at v9 in the fifth superstep. For the matching process, at the third superstep, v7 generates a new message (v1,0)(v2,1)(v7,2), which is to be sent and maintained in the partial provenance-aware set of v7, i.e., Val(v7). Meanwhile, v9 generates a new message (v7,0)(v8,1)(v9,2) and maintains it in Val(v9) at the third superstep. Next, in the fourth superstep, v8 receives the message (v1,0)(v2,1)(v7,2) from v7 and appends (v8,1) to it to generate a new message. The matching pair (v8,1) is a duplicate match, which cannot be appended to the receiving message. Thus, there is no new message in v8 in the fourth superstep. With pruning duplicate matches, the number of total supersteps decreases into |r| + 1, which is only related to |r|. Generally, |r| can be considered as a constant, which reduces the number of total supersteps and the number of intermediate results.

In summary, the size of messages to be sent can be reduced by pruning matches. The maximum number of sending messages in the k-th superstep decreases from exponential complexity to polynomial complexity, i.e., \(O(|deg^{+}_{m}|\cdot |r|\cdot |deg^{-}_{m}|)\).

5.3.2 Variable-length-byte encoding

In fact, the cost of passing messages has a significant impact on query performance. Therefore, we need to compress the storage space of messages based on encoding messages.

The message includes a series of matching pairs (v,q), where (1) v can correspond to the vertex label or vertex ID, both of which are unique identifiers; (2) q is the matched state of the automation of an RPQ. In general, v and q can be represented by an integer variable using 4 bytes. If using integer variables to represent v and q, when the value is small, space utilization will be very low. It can be seen that fixed length byte encoding that represents v and q wastes a lot of space. Therefore, we use variable-length-byte encoding method, as shown in Algorithm 4. The storage space varbyteEncoding(x) occupied by the integer variable x is as follows:

$$ \begin{array}{@{}rcl@{}} \texttt{varbyteEncoding}(x) &= \left\{\begin{array}{llll} 1B & \text{if } x < 2^{7} \\ 2B & \text{if } 2^{7} \leq x < 2^{14} \\ 3B & \text{if } 2^{14} \leq x < 2^{21} \\ \end{array}\right. \end{array} $$
figure t

5.4 Counting-paths alleviation strategy

In particular, due to the provenance-aware semantics, the counting-paths problem in DP2RPQ is often caused. To address it, we attempt to decrease the times of matching and compress the messages to be sent.

5.4.1 Counting-paths problem

It is inevitable to cause the counting-paths problem for a distributed Pregel-based algorithm to generate provenance-aware answers to RPQs, which may incur the prohibitively expensive overhead [1].

Theorem 3

In distributed provenance-aware RPQs evaluation, counting-paths problem is inevitable in most cases.

Proof

(Sketch) For example, given an RDF graph T4 in Figure 7, the vertex x has k incoming edges labeled with a and n outgoing edges labeled with b. If x receives a set Mr of k messages via the incoming edges (v1,x), \((v_{2},x),\dots ,(v_{k},x)\), there exists a set including \(\{m^{\prime }_{j},\dots ,m^{\prime }_{p}\} \in M_{r}\) (j ≥ 1 ∧ pk) such that the last element (vi,qi) (jiq) in these messages satisfying \(R_{q^{\prime }}\) = follow(qi) ∧ \(r[q^{\prime }]=\texttt {a}\)\(q^{\prime } \in R_{q^{\prime }}\). Then, for each message \(m^{\prime }\)\(\{m^{\prime }_{j},\dots ,m^{\prime }_{p}\}\), |rk new messages may be built by appending \((x,q^{\prime })\) to \(m^{\prime }\) and sent to the adjacent vertices by the n outgoing edges labeled b when \(R_{q^{\prime \prime }}=\texttt {follow}(q^{\prime })\)\(r[q^{\prime \prime }]=\texttt {b}\)\(q^{\prime \prime } \in R_{q^{\prime \prime }}\). Obviously, |rk × n messages need to be sent in total, which is actually the Cartesian product of the k receiving messages and n outgoing edges. It is known that the Cartesian product is the key factor in causing the counting-paths problem. □

Figure 7
figure 7

RDF graph T4 with the Cartesian product

5.4.2 Message selection

To partly address the counting-paths problem, we reduce the number of matches between a vertex and a state, which is the dominant cost in vertex computation of each superstep. Let a message set \(M=\{m^{\prime }_{1},\dots ,m^{\prime }_{k}\}\) be a subset of Mr. If the next matched states of the last elements of all messages in M are the same, we just select any message \(m^{\prime }\) from M and append the element \((v,q^{\prime })\) to \(m^{\prime }\) to built the message, and the remaining message set \(M \setminus m^{\prime }\) is cached in v. If there exists an answer path that contains \(m^{\prime }\), then \(M \setminus m^{\prime }\) is appended to the final provenance-aware answer set. This optimization technique is referred to as message selection, which can avoid the Cartesian product by reducing the number of messages from O(k ⋅|r|⋅ n) to O(|r|⋅ n).

In Figure 8, there are |rk messages generated at x satisfying \(q^{\prime } \in R_{q_{i}}\)\(R_{q_{i}}=\texttt {follow}(q_{i})\)\(R_{q_{i}}=R_{q_{j}}\) ∧ 1 ≤ i,jk. Then we just select any message among |rk messages to be matched.

Figure 8
figure 8

Matching processing with message selection

5.4.3 Message compression

In Algorithm 2, a message mMs generated at a vertex v may be sent several times via different edges when v has more than one outgoing edges labeled with the same symbol, which may incur excessive message passing cost. Since the messages are sent via the edges from one vertex to another vertex in Pregel, it is inevitable to send some message multiple times.

Definition 8 (Duplicated-passing message)

In DP2RPQ evaluation, if a generated message can be sent via more than one edges from one vertex to multiple adjacent vertices labeled with the same symbol, it is called a duplicated-passing message.

To reduce the cost of message-passing, in Algorithm 5, we compress the duplicated-passing messages. Thus, Algorithm 5 is an optimized version of line 20 in Algorithm 2. We leverage a sequence Sm(v) to keep the original uncompressed messages, which is attached to v. Then a duplicated-passing messagesm is compressed into a message \(((C_{m},i),(v,q^{\prime }))\) to be sent at v, which only consists of two elements compared to the original message, where Cm is a flag representing the message is compressed, i denotes the index of the original message in Sm(v), and \(q^{\prime }\) is the matched state of v in the current superstep. Then, the compressed message is appended to the message set Ms (line 9). Finally, when transforming a message to an equivalent path in line 18 in Algorithm 2, we uncompressed the partial message \(((C_{m},i),(v,q^{\prime }))\) by employing index searching strategy, in which i is regarded as the searching index to lookup the uncompressed message in Sm(v). With the message-compression technique, the process of matching is shown as follows.

  • I.Construction of the compressed messages. For RDF graph T2 in Example 2, in the third superstep, there exist the messages that can be compressed at v4. When it receives a message \(m^{\prime }=((v_{2},0),(v_{3},1))\), a new message m = ((v2,0),(v3,1),(v4,3)) will be built and then sent via the two outgoing edges (v4,d,v7) and (v4,d,v8) in Figure 9. In particular, the matching pairs in the message are represented as rectangle nodes, which are connected by the edges between the vertices. Next, the original message is compressed to become ((Cm,0),(v4,3)), where the index i = 0 since m is the first element of Sm(v). For the matching in the next superstep, (v4,3) denotes the cached position of the compressed message. Meanwhile, the original message m is appended to the sequence Sm(v) of v4. The message passing cost can be reduced dramatically when the original message is large in length.

  • II.Uncompression of the answer paths. When completing Algorithm 2, we collect all the answer paths by traversing Val(v) over each vertex v, shown in line 5 in Algorithm 1. In Example 2, the set of answer paths, i.e., Val(v), is not empty at v7 and v8 merely, as shown in Table 3. Taking the answer ((Cm,0),(v4,3),(v7,4)) in Val(v7) for example, as shown in Figure 10, we exhibit the processing of uncompressing the answer into the original answer path for presenting provenance-aware answer set. First, the compressed message is determined by the Cm. Next, since the index i= 0, the original message is the first element of the sequence Sm(v) of v4. At last, we can uncompress the answer path by the original message (v2,0),(v3,1),(v4,3).

figure u
Figure 9
figure 9

The building of the compressed message in RDF graph T2

Table 3 The set of answer path Val(v) cached in each vertex
Figure 10
figure 10

The uncompressing of answer path in RDF graph T2

In addition, we also employ the message selection technique in Section 5.4.2 to reduce the cost of message passing to help alleviate the counting-paths problem even further.

6 Experimental evaluation

In this section, we evaluate the performance of our method. We conducted extensive experiments to verify the efficiency and scalability of our proposed algorithms on both synthetic and real-world datasets.

6.1 Experimental settings

The proposed algorithms were implemented in Scala using Spark GraphX, which were deployed on a 10-site cluster in the Tencent Cloud.Footnote 3 Each site in this cluster installs a 64-bit CentOS 7.3 Linux operating system, with a 4-core CPU and 16GB memory. Our algorithms were executed on Java 1.8, Scala 2.11, Hadoop 2.7.4, and Spark 2.2.0.

For the verification of our algorithms, we use three datasets, including two benchmark datasets (LUBMFootnote 4 and WatDivFootnote 5) and a real-world dataset (DBpediaFootnote 6), which are listed in Table 4. At present, there is no benchmark for RPQs. We designed twelve RPQs based on the characteristics of operators in regular expressions, including simple queries and complex queries, denoted by Q1 to Q12 in Table 5. For an RPQ Q, if it contains the closure operators and/or +, it is a complex query, otherwise it is a simple one. Since the complex query contains closure operations, the length of the language expressed by r is uncertain, and it is likely to increase the number of supersteps in the query process. The above query is a combination of typical operators in the regular expression, which can cover all typical patterns of RPQs that match paths of different lengths.

Table 4 Datasets
Table 5 Regular path queries

6.2 Experimental results

We compared the performance of the basic algorithm and three optimized algorithms, which are denoted as DP2RPQ, DP2RPQopt (vertex-computation optimization), DP2RPQmsg (message-communication reduction), and DP2RPQcnt (counting-paths problem alleviation), respectively, using different queries over 3 datasets. In addition, we evaluated the scalability of DP2RPQ and DP2RPQopt. Finally, our approach is compared with RDFPath [22].

6.2.1 Efficiency of the algorithms

  1. I.

    Different queries over some dataset. We evaluate twelve queries (Q1 to Q12) over LUBM100 and DBpedia datasets, respectively.

The results are shown in Figure 11. Obviously, the basic and the optimized algorithm, i.e., DP2RPQ and DP2RPQopt, return answers to the queries in limited time, which verifies the efficiency of the algorithms. Although the size of LUBM100 is relatively smaller than the size of DBpedia dataset, the query time of an RPQ over LUBM100 dataset is not always less than that over DBpedia dataset, which is due to a large number of results when evaluating Q1 and Q6.

Especially, the experimental results on LUBM100 and DBpedia datasets indicate that DP2RPQopt performs better than DP2RPQ in all cases. When evaluating on LUBM100 dataset, we notice that the query time of Q6 is more than other queries. In fact, the number of the intermediate partial results has reached millions of paths in the query processing of Q6. When evaluating on DBpedia dataset, for DP2RPQopt, the most significant improvement is for the most complex query Q11, which takes 52.44% of the query time of DP2RPQ. Meanwhile, the average improvement ratio is 40.62%.

  1. II.

    Same queries over the datasets in different sizes. We evaluate the queries in Table 5 over the LUBM datasets and WatDiv datasets with different scale factors, i.e., 10, 100, and 200, respectively.

  1. (1)

    LUBM dataset. Both DP2RPQ and DP2RPQopt are executed on LUBM datasets, whose results are illustrated in Figure 12. It can be observed that the time of the algorithms scales linearly with the size of the data. However, the time of Q1, Q3, and Q6 increases rapidly along with the increasing size of the data because the query results have reached a relatively large scale (i.e., millions of paths). In most cases, the query time of DP2RPQopt is reduced significantly compared with DP2RPQ, which verifies the effectiveness of our optimization strategies. However, when Q3 and Q11 are evaluated on LUBM10, DP2RPQopt takes slightly longer time than DP2RPQ. The reason is that Σr of these queries involve more various symbols than other queries and LUBM10 is relatively small in size, which result in filtering out fewer useless edges than other queries. We can see as a general rule that the larger the dataset is, the better DP2RPQopt performs. The optimization effect of DP2RPQopt for all queries on LUBM200 has become much more significant than that on other datasets.

  2. (2)

    WatDiv dataset. Four representative queries (Q4, Q6, Q7, and Q10) are selected from Table 5 and evaluated on the WatDiv datasets of varying scale factors (SF), i.e., 10, 100, and 200, respectively. In Figure 13, it can be observed that DP2RPQopt outperforms DP2RPQ for all queries. The query time of DP2RPQopt is on average 41.59% of that of DP2RPQ. Due to the diversified predicates in WatDiv, DP2RPQopt can reduce the times of traversals and filter out more useless edges in comparison with the evaluation on the LUBM datasets.

Figure 11
figure 11

The experimental results of LUBM100 and DBpedia datasets

Figure 12
figure 12

The experimental results of efficiency on LUBM datasets

Figure 13
figure 13

The experimental results of efficiency on WatDiv datasets

6.2.2 Scalability of the algorithms

  1. I.

    Varying site number. In order to evaluate the scalability of DP2RPQ and DP2RPQopt, we use the LUBM100 and DBpedia datasets, with four representative queries (Q2, Q4, Q5, and Q11) selected from Table 5.

The query time on the different number of sites, varying from 4 to 10, is shown in Figure 14. The query time of DP2RPQ and DP2RPQopt decreases with the number of sites increasing, which confirms that our algorithms can take full advantage of the vertex-centric Pregel framework for graph parallel computing. Moreover, the average speedup ratio of DP2RPQ is 1.21 times of DP2RPQopt.

Figure 14
figure 14

Scalability by varying the number of sites

  1. II.

    Maximum time in vertex computation. It is obvious that the maximum execution time in each vertex computation is the dominant cost except the message passing cost. We selected four queries (Q2, Q3, Q6 and Q8) and evaluted them over LUBM datasets with different sizes (LUBM10, LUBM100, LUBM200).

The optimized techniques, i.e., edges-filtering and candidate-states strategies, reduce the maximum execution time in vertex computation, whose results are shown in Figure 15. It can be observed that the maximum time of vertex computation scales linearly with the size of the data. Generally, DP2RPQopt performs better than DP2RPQ in terms of maximum time in the vertex computation.

Figure 15
figure 15

The experimental results of the maximum time in vertex computation

6.2.3 Efficiency of the message-communication optimization

Based on the algorithm of DP2RPQopt, the optimization strategies including sending message pruning and variable-length-byte encoding are further implemented for reducing the communication cost, which is called DP2RPQmsg. To verify the impact of sending message pruning on reducing the sending message size, we compare the maximum length of messages to be sent in all supersteps between DP2RPQ and DP2RPQmsg. Eight queries (Q1 to Q7, Q9) are selected from Table 5 to evaluate on the LUBM datasets with different sizes, i.e., LUBM10, LUBM100, and LUBM200, respectively.

It presents the maximum length of sending messages in Table 6, where Rmsg is the reduction rate of the message length after message-communication optimization. As can be seen, the maximum length of sending messages is reduced significantly by the pruning-message strategy. Moreover, we notice that the reduction rate is more than 50% in all cases, which indicates that DP2RPQmsg reduce the number of messages efficiently.

Table 6 The experimental results of the maximum length of sending messages

6.2.4 Effectiveness of the counting-paths alleviation strategy

Another optimized algorithm, called DP2RPQcnt, was implemented with the message-compression and message-selection techniques to partly address the counting-paths problem. Due to the limited lengths of the answers to the queries in Table 5, DP2RPQcnt cannot reach its full potential. To this end, we generate RDF graphs w.r.t. the data model of WatDivFootnote 7 by constructing structures like T4 in Figure 7. Meanwhile, we design an RPQ \(Q_c=x_1/x_2/\dots /x_{10}\) by covering the predicates that may generate the Cartesian product. It is obvious that the lengths of the answer paths to Qc are much longer than that of the previous queries.

In Figure 16, it can be observed that DP2RPQ and DP2RPQopt cannot finish within the time limit (104s), denoted by INF, while DP2RPQcnt can return the answers in 78.39s and 377.56s over the RDF graphs that contain 1 million and 10 million triples, respectively. Thus, DP2RPQcnt can effectively alleviate the counting-paths problem.

Figure 16
figure 16

The experimental results of the counting-paths alleviation

6.2.5 Performance comparison between DP2RPQ and RDFPath

The existing distributed method for provenance-aware regular path query is rare. To the best of our knowledge, RDFPath is the only method for answering RPQs using a distributed setting. Thus, in this section, we compare our approach DP2RPQmsg with RDFPath. Although RDFPath is designed based on the MapReduce framework, it is implemented using Spark. RDFPath is deployed on a 10-site cluster in the Tencent Cloud and executed on Hadoop 2.6.0 and Spark 1.6.1.

  1. I.

    Efficiency. We first evaluated the efficiency of DP2RPQmsg and RDFPath, with four simple queries (Q1 to Q4) and four complex queries (Q5, Q6, Q7, and Q9). In RDFPath, a path query is composed by a sequence of basic navigational component denoted as location steps, it is very suitable for handling connection operations rather than the closure operators and/or +. In fact, the regular path query is converted into an SQL in the processing of RDFPath, which leads to the number of join operation is proportional to the length of path in the query. In particular, the semantics of the queries that RDFPath can handle are limited. If a query is not supported by RDFPath, its query response time is empty.

  1. (1)

    LUBM dataset. We evaluated DP2RPQmsg and RDFPath with four simple queries over LUBM datasets, which is shown in Figure 17. It can be observed that the experimental results of Q2 indicate that RDFPath performs better than DP2RPQmsg, since Q2 only contain the concatenation operator that RDFPath is suitable for handing. However, RDFPath only supports Q2 since the other queries all contain the alternation operator not supported by RDFPath.

Figure 17
figure 17

The experimental contrast results of simple queries on LUBM datasets

The number of edges can be specified in RDFPath query, so a fixed-length join operation can be used to simulate approximately the closure operations. For Q7 = a+, we convert it to an approximate query in RDFPath, denoted as a <k>, where k is the number of occurrences of the label a. Obviously, the expression power of a <k>is weaker than a+ in RPQ. Moreover, with the value of k increasing, the query time increases rapidly, as shown in Figure 18. When k = 15, we compare the complex query response time of DP2RPQmsg and RDFPath. It can be observed that DP2RPQmsg has better performance than RDFPath, as shown in Figure 19.

  1. (2)

    DBpedia dataset. The real-world dataset DBpedia has richer predicates and more complex structure, which can be used to design RPQs with a long length of path. Since RDFPath can only perform two queries of Q2 and Q7, in order to compare the performance of DP2RPQmsg and RDFPath, we design two additional queries \(Q_{13}=a/b/\dots /k/l\) and \(Q_{14}=a/b/\dots /o/p\) with lengths of 12 and 16, respectively. It shows that the RDFPath can only execute Q2, Q7, Q13, and Q14, and our method is more effective than RDFPath in most cases in Figure 20. Although the query time of DP2RPQmsg on Q2 is longer than RDFPath, the semantics of RPQs are far richer than RDFPath on the whole.

  1. II.

    Scalability. To evaluate the scalability with different number of sites, we used LUBM100 and DBpedia as the datasets and varied the number of sites from 4 to 10. The experimental results of four queries (Q1,Q2,Q7, and Q9) on LUBM100 are shown in Figure 21a, it can be seen that the query time of DP2RPQopt decreases with the number of sites increasing. However, the change of query time is not obvious on the two queries (Q2 and Q7) in RDFPath. Figure 21b depicts the scalability evaluation of DP2RPQmsg and RDFPath for DBpedia dataset on queries (Q2,Q7,Q9, and Q14), in which the query time of DP2RPQmsg decreases more significantly than RDFPath with the number of sites increasing. It confirms that our algorithm can take full advantage of the graph parallel computing model in comparison with RDFPath.

Figure 18
figure 18

Influence of different k values on query performance of RDFPath

Figure 19
figure 19

The experimental contrast results of complex queries on LUBM datasets

Figure 20
figure 20

The experimental contrast results on DBpedia datasets

Figure 21
figure 21

Scalability by varying the number of sites

7 Conclusion

In this paper, we propose a novel method for answering provenance-aware RPQs over large RDF knowledge graphs by using the Pregel parallel graph computing framework. We also devise three optimization techniques, among which the edge-filtering and candidate-states techniques can significantly improve the performance of RPQs, the sending message pruning and variable-length-byte encoding can reduce intermediate results and communication overhead greatly, and the message-compression and message-selection strategies are employed to alleviate the counting-paths problem. The extensive experiments were conducted on both synthetic and real-world datasets, which have verified the effectiveness, efficiency, and scalability of our method.