1 Introduction

A tremendous amount of information stored in the LOD can be inspected, by leveraging the already mature query capabilities of SPARQL, relational, and graph databases [14]. However, arbitrarily complex queries [2, 3, 7], entailing rather intricate, possibly recursive, graph patterns prove difficult to evaluate, even on small-sized graph datasets [4, 5]. On the other hand, the usage of these queries has radically increased in real-world query logs, as shown by recent empirical studies on SPARQL queries from large-scale Wikidata and DBPedia corpuses [8, 17]. As a tangible example of this growth, the percentage of SPARQL property paths has increased from \(15\%\) to \(40\%\), from 2017 to beginning 2018 [17], for user-specified Wikidata queries. In this paper, we focus on regular path queries (RPQs) that identify paths labeled with regular expressions and aim to offer an approximate query evaluation solution. In particular, we consider counting queries with regular paths, which are a notable fragment of graph analytical queries. The exact evaluation of counting queries on graphs is \(\#P-\)complete [21] and is based on another result on enumeration of simple graph paths. Due to this intractability, an efficient and highly-accurate approximation of these queries is desirable, which we address in this paper.

Approximate query processing on relational data and the related sampling methods are not applicable to graphs, since the adopted techniques are based on the linearity assumption [15], i.e., the existence of a linear relationship between the sample size and execution time, typical of relational query processing. As such, we design a novel query-driven graph summarization approach tailored for property graphs. These significantly differ from RDF and relational data models, as they attach data values to property lists on both nodes and edges [7].

To the best of our knowledge, ours is the first work on approximate property graph analytics addressing counting estimation on top of navigational graph queries. We illustrate our query fragment with the running example below.

Example 1 (Social Network Advertising)

Let \(\mathcal {G}_{SN}\) (see Fig. 1) be a property graph (see Sect. 2) encoding a social network, whose schema is inspired by the LDBC benchmark [12]Footnote 1. Entities are people (type Person, \(P_i\)) that know (\(l_0\)) and/or follow ( ) either each other or certain forums (type Forum, \(F_i\)). These are moderated ( ) by specific persons and can contain ( ) messages/ads (type Message, \(M_i\)), to which persons can author ( ) other messages in reply ( ).

We focus on a RPQ [3, 23] dialect with counting, capturing following query types (\(Q_1 - Q_7\)) (see Fig. 2):

(1) Simple/Optional Label. The number of pairs satisfying \(Q_1\), i.e., , counts the ad reactions, while that for \(Q_2\), i.e., , indicates the number of potential moderators. (2) Kleene Plus/Kleene Star. The number of the connected/potentially connected acquaintances is the count of node pairs satisfying \(Q_3\), i.e., \(() \leftarrow {l^{+}_0} ()\), respectively, \(Q_4\), i.e., \(() \leftarrow {l^{*}_0} ()\). (3) Disjunction. The number of the targeted subscribers is the sum of counting all node pairs satisfying \(Q_5\), i.e., or . (4) Conjunction. The direct reach of a company via its page ads is the count of node pairs satisfying \(Q_6\), i.e., . (5) Conjunction with Property Filters. Recommendation systems can further refine the \(Q_6\) estimates. Thus, one can compute the direct demographic reach and target people within an age group, e.g., 18–24, by counting all node pairs that satisfy \(Q_7\), i.e. , s.t \(x.age \ge 18\) and \(x.age \le 24\).

Fig. 1.
figure 1

Example social graph \(\mathcal {G}_{SN}\)

Fig. 2.
figure 2

Targeted advertising queries

Contributions. Our paper provides the following main contributions:

  • We design a property graph summarization algorithm for approximately evaluating counting regular path queries (Sect. 3).

  • We prove the intractability of the optimal graph summarization problem under the conditions of our summarization algorithm (Sect. 3).

  • We define a query translation module, ensuring that queries on the initial and summary property graphs are expressible in the same fragment (Sect. 4).

  • Based on this, we experimentally exhibit the small relative errors of various workloads, in the expressive query fragment from Example 1. We measure the relative response time between estimating counting recursive queries on summaries and on the original graphs. For non-recursive queries, we compare with SumRDF [19], a baseline graph summary for RDF datasets (Sect. 5).

In Sect. 2, we revisit the property graph model and query language. We present related work in Sect. 6 and conclude the paper in Sect. 7.

2 Preliminaries

Graph Model. We take the property graph model (PGM) [7] as our foundation. Graph instances are multi-edge digraphs; its objects are represented by typed, data vertices and their relationships, by typed, labeled edges. Vertices and edges can have any number of properties (key/value pairs). Let \(L_V\) and \(L_E\) be disjoint sets of vertex (edge) labels and \(\mathcal {G} = (V,E)\), with \(E \subseteq V \times L_E \times V\), a graph instance. Vertices \(v \in V\) have an id label, \(l_v\), and a set of property labels (attributes, \(l_i\)), each with a (potentially undefined) term value. For \(e \in E\), we use the binary notation \(e = l_e(v_1,v_2)\) and abbreviate \(v_1\), as e.1, and \(v_2\), as e.2. We denote the number of occurrences of \(l_e\), as \(\#l_e\), and the set of all edge labels in \(\mathcal {G}\), as \(\varLambda (\mathcal {G})\). Other key notations henceforth used are given in Table 1.

Table 1. Notation table
Fig. 3.
figure 3

Graph query language

Graph Query Language. To query the above property graph model, we rely on an RPQ [10, 11] fragment with aggregate operators (see Fig. 3). RPQs correspond to SPARQL 1.1 property paths and are a well-studied query class tailored to express graph patterns of one or more label-constrained reachability paths. For labels \(l^{i}_e\) and vertices \(v_i\), the labeled path \(\pi \), corresponding to \(v_1 \,\xrightarrow \,{l^{1}_{e}} v_2 \ldots v_{k-1}\,\xrightarrow \,{l^{k}_{e}} v_k\), is the concatenation \(l^{1}_{e} \cdot \ldots \cdot l^{k}_{e}\). In their full generality, RPQs allow one to select vertices connected via such labeled paths in a regular language over \(L_E\). We restrict RPQs to handle atomic paths – bi-directional, optional, single-labeled (\(l_e\), \(l_e?\), and \(l_e^{-}\)) and transitive single-labeled (\(l_e^{*}\)) – and composite paths – conjunctive and disjunctive composition of atomic paths (\(l_e \cdot l_e\) and \(\pi + \pi \)). While not as general as SPARQL, our fragment already captures more than 60% of the property paths found in practice in SPARQL query logs [8]. Moreover, it captures property path queries, as found in the large Wikidata corpus studied in [9]. Indeed, almost all the property paths in the considered logs contain Kleene-star expressions over single labels. In our work, we enrich the above query classes with the count operator and support basic graph reachability estimates.

3 Graph Summarization

We introduce a novel algorithm that summarizes any property graph into one tailored for approximately counting reachability queries. The key idea is that, as nodes and edges are compressed, informative properties are iteratively added to the corresponding newly formed structures, to enable accurate estimations.

The grouping phase (Sect. 3.1) computes \(\varPhi \), a label-driven \(\mathcal {G}\)-partitioning into subgroupings, following the connectivity on the most frequent labels in \(\mathcal {G}\). A first summarization collapses the vertices and inner-edges of each subgrouping into s -nodes and the edges connecting s-nodes, into s -edges. The merge phase (Sect. 3.2), based on further label-reachability conditions, specified by a heuristic mode m, collapses s-nodes into h -nodes and s-edges into h -edges.

3.1 Grouping Phase

For each frequently occurring label l in \(\mathcal {G}\), in descending order, we iteratively partition \(\mathcal {G}\) into \(\varPhi \), containing components that are connected on l, as below.

Definition 1 (Maximal L-Connectivity)

A \(\mathcal {G}\)-subgraphFootnote 2, \(\mathcal {G'} = (V', E')\), is maximally l-connected, i.e., \(\lambda (\mathcal {G'}) = l\), iff (1) \(\mathcal {G'}\) is weakly-connected, (2) removing any l-labeled edge from \(E'\), there exists a \(V'\) node pair not connected by a \(l^{+}\)-labeled undirected path, (3) no l-labeled edge connects a \(V'\) node to \(V \setminus V'\).

Example 2

In Fig. 1, \(\mathcal {G}_1\) is maximally \(l_0\)-connected, since it is weakly-connected, not connected by an \(l_0\)-labeled edge to the rest of \(\mathcal {G}\), and such that, by removing \(P_8\,\xrightarrow \,{l_0} P_9\), no undirected, \(l_0^{+}\)-labeled path unites \(P_8\) and \(P_9\).

We call each such component a subgrouping. The procedure (see Algorithm 1) computes, as the first grouping, all the subgroupings for the most frequent label, \(l_1\), and then identifies those corresponding to the rest of the graph and to \(l_2\). At the end, all remaining nodes are collected into a final subgrouping. We illustrate this in Fig. 4, on the running example below.

Example 3 (Grouping)

In Fig. 1, , and , as allows arbitrary ordering. We add the maximal \(l_0\)-connected subgraph, \(\mathcal {G}_1\), to \(\varPhi \). Hence, \(V = \{R_{i \in \overline{1,7}}, M_{i \in \overline{1,6}}, F_1, F_2\}\). Next, we add \(\mathcal {G}_2\), regrouping the maximal -connected subgraph. Hence, \(V = \{F_1, F_2\}\); we add \(\mathcal {G}_3\) and output \(\varPhi = \{\mathcal {G}_1, \mathcal {G}_2, \mathcal {G}_3\}\).

figure p
Fig. 4.
figure 4

Summarization phases for \(\mathcal {G}_{SN}\)

A \(\mathcal {G}\)-partitioning \(\varPhi \) (see Fig. 4a) is transformed into a s-graph \(\mathcal {G}^{*} = (V^{*}, E^{*})\) (see Fig. 4b). As such, each s-node gathers all the nodes and inner edges of a \(\varPhi \)-subgrouping, \(\mathcal {G}^{*}_j\), and each s-edge, all same-labeled cross-edges (edges between pairwise distinct s-nodes). During this phase, we compute analytics concerning the regrouped entities. We leverage PGM’s expressivity to internalize these as properties, e.g., Fig. 5 (right)Footnote 3. Hence, to every s-edge, \(e^{*}\), we attach EWeight, its number of compressed edges, e.g., in Fig. 4b, all s-edges have weight 1, except \(e^{*}(v^{*}_4,v^{*}_1)\), with weight 2. To every s-node, \(v^{*}\), we attach properties concerning: (1) Compression. \(\mathcal {V}{} Weight \) and \(EWeight \) store its number of inner vertices/edges. (2) Inner-Connectivity. The percentage of its l-labeled inner edges is \(LPercent \) and the number of its vertex pairs, connected with an l-labeled edge, is \(LReach \). These first two types of properties will be useful in Sect. 4, for estimating Kleene paths, as the labels of inner-edges in s-nodes are not unique, e.g., both \(l_0\) and appear in \(v^{*}_1\). (3) Outer-Connectivity. For pairs of labels and direction indices with respect to \(v^{*}\) (\(d=1\), for incoming edges, and \(d=2\), for outgoing ones), we compute cross-connectivity, \(CReach \), as the number of binary cross-edge paths that start/end in \(v^{*}\). Analogously, we record that of binary traversal paths, i.e., formed of an inner \(v^{*}\) edge and of a cross-edge, as TReach. Also, for a label l and given direction, we store, as \(V_F\), the number of frontier vertices on l, i.e., that of \(v^{*}\) nodes at either endpoint of a l-labeled s-edge.

We can thus record traversal connectivity information, LPart, dividing the number of traversal paths by that of the frontier vertices on the cross-edge label. Intuitively, this is due to the fact that, traversal connectivity, as opposed to cross connectivity, also needs to account for the “dispersion” of the inner-edge label of the path, within the s-node it belongs to. For example, for a traversal path \(l_c \cdot l_i\), formed of a cross-edge, \(l_c\), and an inner one, \(l_i\), not all frontier nodes \(l_c\) are endpoints of \(l_i\) labeled inner-edges, as we will see in the example below.

Fig. 5.
figure 5

Selected properties for Fig. 4b (right); Frontier vertices (left)

Example 4 (Outer-Connectivity)

Figure 5 (left) depicts a stand-alone example, such that circles denote s-nodes, labeled arrows denote the s-edges relating them, and crosses represent nameless vertices, as we only label relevant ones, for simplicity. We use this configuration to illustrate analytics regarding cross and traversal connectivity on labels \(l_1\) and \(l_2\). For instance, as we will see in Sect. 4, when counting \(l_1 \cdot l_2^{-}\) cross-edge paths, we will look at the CReach s-node properties mentioning these labels and note that there is a single such one, i.e., that corresponding to \(l_1\) and \(l_2\) appearing on edges incoming \(v^{*}_1\), i.e., \(CReach(v^{*}_1, l_1, l_2, 1, 1) = 1\). When counting \(l_1 \cdot l_2\) traversal paths, for the case when \(l_1\) appears on the cross-edge, we will look at the properties of s-nodes containing \(l_2\) inner-edges. Hence, for \(v^{*}_2\), we note that there is a single such path, formed by an outgoing \(l_2\) edge and incoming \(l_1\) edge, as \(TReach(v^{*}_2, l_1, l_2, 1, 1) = 1\). To estimate the traversal connectivity we will divide this by the number of frontier vertices on incoming \(l_1\) edges. As, \(V_F(v^{*}_2, l_1, 1) = \{v_2, v_3\}\), we have that \(LPart (v^{*}_2, l_1, l_2, 1, 1) = 0.5\).

3.2 Merge Phase

We take as input the graph computed by Algorithm 1, and a label set and output a compressed graph, \(\hat{\mathcal {G}} = (\hat{V}, \hat{E})\). During this phase, sets of h-nodes, \(\hat{V}\), and h-edges, \(\hat{E}\), are created. At each step, as previously, \(\hat{\mathcal {G}}\) is enriched with approximation-relevant precomputed properties (see Sect. 4).

Each h-node, \(\hat{v}\), merges all s-nodes, \(v^{*}_i, v^{*}_j \in V^{*}\), that are maximally label connected on the same label, i.e., \(\lambda (v^{*}_i) = \lambda (v^{*}_j)\), and that have either the same set of incoming (source-merge) or outgoing (target-merge) edge labels, i.e., \(\varLambda _{d}(v^{*}_i) = \varLambda _{d}(v^{*}_j)\), \(d \in \{1,2\}\) (see Algorithm 2). Each h-edge, \(\hat{e}\), merges all s-edges in \(E^{*}\) with the same label and orientation, i.e., \(e^{*}_i.d = e^{*}_j.d\), for \(d \in \{1,2\}\).

figure r

To each h-node, we attach properties, whose values, except LPercent, are the sum of those corresponding to each of its s-nodes. For the label percentage, these values record the weighted percentage mean. Next, we merge s-edges into h-edges, if they have the same label and endpoints, and attach to each h-edge, its number of compressed s-edges, EWeight. We also record the avg. s-node weight, \(\mathcal {V}^{*}Weight\), to estimate how many nodes a h-node compresses.

To formally characterize the graph transformation corresponding to our summarization technique, we first define the following function.

Definition 2 (Valid Summarization)

For \(\mathcal {G} = (\mathcal {V},E)\), a valid summarization function \(\chi _{\varLambda }: \mathcal {V} \rightarrow \mathbb {N}\) assigns vertex identifiers, s.t., any vertices with the same identifier are either in the same maximally l-connected \(\mathcal {G}\)-subgraph, or in different ones, not connected by an l-labeled edge.

A valid summary is thus obtained from \(\mathcal {G}\), by collapsing vertices with the same \(\chi _{\varLambda }\) into h-nodes and edges with the same (depending on the heuristic, ingoing/outgoing) label into h-edges. We illustrate this below.

Example 5 (Graph Compression)

The graphs in Fig. 4c are obtained from \(\mathcal {G}^{*} = (V^{*},E^{*})\), after the merge phase. Each h-node contains the s-nodes (see Fig. 4b) collapsed via the target-merge (left) and source-merge (right) heuristics.

We study our summarization’s optimality, i.e., the size of the obtained compressed graph, to graphs its tractability. Specifically, we investigate the following MinSummary problem, to establish whether one can always minimize the number of nodes of an input graph, when constructing its valid summary.

Problem 1 (Minimal Summary)

Let MinSummary be the problem that, for a graph \(\mathcal {G}\) and an integer \(k' \ge 2\), decides if there exists a label-driven partitioning \(\varPhi \) of \(\mathcal {G}\), \(|\varPhi | \le k'\), such that \(\chi _{\varLambda }\) is a valid summarization.

Each MinSummary h-node is thus intended to regroup as many nodes from the original graph as possible, while ensuring these are connected by frequently occurring labels. This condition (see Definition 2) reflects the central idea of our framework, namely that the connectivity of such prominent labels can serve to both compress a graph and to approximately evaluate label-constrained reachability queries. Next, we establish the difficulty of solving MinSummary.

Theorem 1 (MinSummary NP-completeness)

Even for undirected graphs, \(|\varLambda (\mathcal {G})| \le 2\), and \(k'=2\), MinSummary is NP-completeFootnote 4.

The intractability of constructing an optimal summary thus justifies our search for heuristics with good performance in practice.

4 Approximate Query Evaluation

Query Translation. For \(\mathcal {G}\) and a counting reachability query Q, we approximate \([\![Q ]\!]_\mathcal {G}\), the evaluation of Q over \(\mathcal {G}\). We translate Q into a query \(Q^{T}\), evaluated over the summarization \(\hat{\mathcal {G}}\) of \(\mathcal {G}\), s.t \([\![Q^{T} ]\!]_{\hat{\mathcal {G}}} \approx [\![Q ]\!]_{\mathcal {G}}\). The translations by input query type are given in Fig. 6, with PGQL as concrete syntax. (1) Simple and Optional Label Queries. A label l occurs in \(\hat{\mathcal {G}}\) either within a h-node or on a cross-edge. Thus, we either cumulate the number of l-labeled h-node inner-edges or the l-labeled cross-edge weights. To account for the potential absence of l, we also estimate, in the optional-label queries, the number of nodes in \(\hat{\mathcal {G}}\), by cumulating those in each h-node. (2) Kleene Plus and Kleene Star Queries. To estimate \(l^{+}\), we cumulate the counts within h-nodes containing l-labeled inner-edges and the weights on l-labeled cross-edges. For the former, we distinguish whether the \(l_{+}\) reachability is due to: (1) inner-connectivity – we use the property counting the inner l-paths; (2) incoming cross-edges – we cumulate the l-labeled in-degrees of h-nodes; or (3) outgoing cross-edges – we cumulate the number of outgoing l-paths. To handle the \(\epsilon \)-label in \(l^{*}\), we also estimate the number of nodes in \(\hat{\mathcal {G}}\). (3) Disjunction. We treat each possible configuration, on both labels. Hence, we either cumulate the number of h-node inner-edges or that of cross-edge weights, with either label. (4) Binary Conjunction. We distinguish whether the label pair appears on an inner h-node path, on a cross-edge path, or on a traversal one.

Example 6

We illustrate the approximate evaluation of these query types on Fig. 4. To evaluate the number of single-label atomic paths, e.g., , as only occurs inside h-node \(\hat{v}_2\), is the amount of -labeled inner edges in \(\hat{v}_2\), i.e., . To estimate the number of optional label atomic paths, e.g., , we add to the total number of graph vertices, \(\sum _{\hat{v} \in \hat{\mathcal {V}}} \mathcal {V}^{*}Weight(\hat{v}) * \mathcal {V}Weight(\hat{v})\) (empty case). As only appears on a h-edge of weight 2 and there are 25 initial vertices, is 27. To estimate Kleene-plus queries, e.g., \(Q_P^{T}(l_0)\), as no h-edge has label \(l_0\), we return \(LReach(\hat{v}_1, l_0)\), i.e., the number of \(l_0\)-connected vertex pairs. Thus, \([\![l_0^{+} ]\!]_{\hat{\mathcal {G}}}\) is 15. For Kleene-star, we add to this, the previously computed total number of vertices and obtain that \([\![l_0^{*} ]\!]_{\hat{\mathcal {G}}}\) is 40. For disjunction queries, e.g., , we cumulate the single-labeled atomic paths on each label, yielding 14. For binary conjunctions, e.g., , we rely on the traversal connectivity, , as appears on a h-edge and, , inside h-nodes; we thus count 7 node pairs.

Fig. 6.
figure 6

Query translations onto the graph summary.

5 Experimental Analysis

In this section, we present an empirical evaluation of our graph summarization, recording (1) the succinctness of our summaries and the efficiency of the underlying algorithm and (2) the suitability of our summaries for approximate evaluation of counting label-constrained reachability queries.

Setup, Datasets and Implementation. The summarization and approximation modules are implemented in Java using OpenJDK 1.8Footnote 5. As the underlying graph database backend, we have used Oracle Labs PGX 3.1, which is the only property graph engine allowing for the evaluation of complex RPQs.

To implement the intermediate graph analysis operations (e.g., weakly connected components), we used the Green-Marl domain-specific language and modified the methods to fit the construction of node properties required by our summarization algorithm. We base our analysis on the graph datasets in Fig. 7, encoding: a Bibliographic network (bib), the LDBC social network schema [12] (social), Uniprot knowledge graphs (uniprot), and the WatDiv schema [1] (shop).

We obtained these datasets using gMark [5], a synthetic graph instance and query workload generator. As gMark tries to construct the instance that best fits the size parameter and schema constraints, the resulting sizes vary (especially for the very dense graphs social and shop). Next, on the same datasets, we generated workloads of varying sizes, for each type in Sect. 2. These datasets and related query workloads have been chosen since they provide the most recent benchmarks for recursive graph queries and also to ensure a comparison with SumRDF [19] (as shown next) on a subset of those supported by the latter. Studies [8, 17] have shown that practical graph pattern queries formulated by users in online query endpoints are often small: \(56.5\%\) of real-life SPARQL queries consist of a single edge (RDF triple), whereas \(90.8\%\) use 6 edges at most. Hence, we select small-sized template queries with frequently occurring topologies, such as chains [8], and formulate them on our datasets, for workloads of \(\sim \)600 queries.

Experiments ran on a cloud VM with Intel Xeon E312xx, 4 cores, 1.80 GHz CPU, 128 GB RAM, and Ubuntu 16.04.4 64-bit. Each data point corresponds to repeating an experiment 6 times, removing the first value from the average.

Fig. 7.
figure 7

Datasets: no. of vertices |V|, edges |E|, vertex \(|L_V|\) and edge labels \(|L_E|\).

Summary Compression Ratios. First, we evaluate the effect that using the source-merge and target-merge heuristics has on the summary construction time (SCT). We also assess the compression ratio (CR) on the original graph’s vertices and edges, by measuring \((1- |\hat{\mathcal {V}}|/|\mathcal {V}|) * 100\) and, respectively, \((1- |\hat{\mathcal {E}|}/|\mathcal {E}|) * 100\).

Fig. 8.
figure 8

CRs for vertices and edges, along with SCT runtime for various dataset sizes, for both source-merge (a-c-e), and target-merge (b-d-f).

Next, we compare the results for source and target merge. In Fig. 8(a-d), the most homogeneous datasets, bib and uniprot, achieve very high CR (close to \(100\%\)) and steadily maintain it with varying graph sizes. As far as heterogeneity significantly grows for shop and social, the CR becomes eagerly sensitive to the dataset size, starting with low values, for smaller graphs, and stabilizing between 85% and 90%, for larger ones. Notice also that the most heterogeneous datasets, shop and social, although similar, display a symmetric behavior for the vertex and edge CRs: the former better compresses vertices, while the latter, edges. Concerning the SCT runtime in Fig. 8(e-f), all datasets keep a reasonable performance for larger sizes, even the most heterogeneous one shop. The runtime is, in fact, not affected by heterogeneity, but is rather sensitive, for larger sizes, to |E| variations (up to 450K and 773K, for uniprot and social). Also, while the source and target merge SCT runtimes are similar, the latter achieves better CRs for social. Overall, the dataset with the worst CR for the two heuristics is shop, with the lowest CR for smaller sizes. This is also due to the high number of labels in the initial shop instances, and, hence, to the high number of properties its summary needs: on average, for all considered sizes, 62.33 properties, against 17.67, for social graph, 10.0, for bib, and 14.0, for uniprot. These experiments show that, despite its high complexity, our summarization provides high CRs and low SCT runtimes, even for large, heterogeneous graphs.

Approximate Evaluation Accuracy. We assess the accuracy and efficiency of our engine with the relative error and time gain measures, respectively. The relative error (per query \(Q_i\)) is \( 1 - min(Q_i(\mathcal {G}),Q^T_i(\hat{\mathcal {G}}))/\) \(max(Q_i(\mathcal {G}),Q^T_i(\hat{\mathcal {G}}))\) (in %), where \(Q_i(\mathcal {G})\) computes (with PGX) the counting query \(Q_i\), on the original graph, and \(Q^T_i(\hat{\mathcal {G}})\) computes (with our engine) the translated query \(Q^T_i\), on the summary. The time gain is: \(t_{\mathcal {G}} - t_{\hat{\mathcal {G}}}/ max(t_{\mathcal {G}},t_{\hat{\mathcal {G}}})\) (in %), where \(t_{\mathcal {G}}\) and \(t_{\hat{\mathcal {G}}}\) are the query evaluation times of \(Q_i\) on the original graph and on the summary.

For the Disjunction, Kleene-plus, Kleene-star, Optional and Single Label query types, we have generated workloads of different sizes, bound by the number of labels in each dataset. For the concatenation workloads, we considered binary conjunctive queries (CQs) without disjunction, recursion, or optionality. Note that, currently, our summaries do not support compositionality.

Fig. 9.
figure 9

Rel. Error (a), Time Gain (b) per Workload, per Dataset, 200K nodes.

Figure 9(a) and (b) show the relative error and average time gain for the Disjunction, Kleene-plus, Kleene-star, Optional and Single Label workloads. In Fig. 9(a), we note that the avg. relative error is kept low in all cases and is bound by \(5.5\%\), for the Kleene-plus and Kleene-star workloads of the social dataset. In all the other cases, including the Kleene-plus and Kleene-star workloads of the shop dataset, the error is relatively small (near \(0\%\)). This confirms the effectiveness of our graph summaries for approximate evaluation of graph queries. In Fig. 9(b), we studied the efficiency of approximate evaluation on our summaries by reporting the time gain (in \(\%\)) compared with the query evaluation on the original graphs for the four datasets. We notice a positive time gain (\(\ge \)75%) in most cases, but for disjunction. While the relative approximation error is still advantageous for disjunction, disjunctive queries are time-consuming for approximate evaluation on our summaries, especially for extremely heterogeneous datasets, such as shop (having the most labels). This is due to the overhead introduced by considering all possible connectivity combinations on the disjunctive labels. The problem of scaling our method, without prohibitive accuracy loss, to queries involving multiple labels and further compositionality, e.g., Kleene-star over disjunctions [22], is challenging and falls under the scope of future work.

Fig. 10.
figure 10

Performance Comparison: SumRDF vs. APP (our approach): approx. eval. of binary CQs, \(\texttt {SELECT}~\texttt {COUNT(*)}~\texttt {MATCH} \; Q_i\), on the summaries of a shop graph instance (31K nodes, 56K edges); comparing estimated cardinality (no. of computed answers), rel. error w.r.t the original graph results, and query runtime.

Baseline for Approximate Query Evaluation Performance. The closest system to ours is SumRDF [19] (see Sect. 6), which, however, operates on a simpler edge-labeled model rather than on property graphs and is tailored for estimating the results of conjunctive queries only. As a performance baseline, we considered the shop dataset in gMark [5], simulating the WatDiv benchmark [1] (also a benchmark in [19]). From this dataset with 31K nodes and 56K edges, we generated the corresponding SumRDF and our summaries. We obtained a better CR than SumRDF, with 2737 nodes vs. 3480 resources and 17430 edges vs. 29621 triples. This comparison is, however, tentative, as our approach compresses vertices independently of the edges, while SumRDF returns triples. We then considered the same CQ types as in Fig. 10. Comparing our approach vs. SumRDF (see Fig. 10), we recorded an average relative error of estimation of only 0.15%. vs. 2.5% and an average query runtime of only 27.55 ms vs. 427.53 ms. As SumRDF does not support disjunctions, Kleene-star/plus queries and optional queries, further comparisons were not possible.

6 Related Work

Preliminary work on approximate graph analytics in a distributed setting has recently been pursued in [15]. They rather focus on a graph sparsification technique and small samples, in order to approximate the results of specific graph algorithms, such as PageRank and triangle counting on undirected graphs. In contrast, our approach operates in a centralized setting and relies on query-driven graph summarization for graph navigational queries with aggregates.

RDF graph summarization for cardinality estimation has been tackled in [19], albeit for a less expressive data model than ours (plain RDF vs. property graphs). They focus on Basic Graph Patterns (BGP), hence their considered query fragment has limited overlap with ours. As shown in Sect. 5, our approximate evaluation is faster and more accurate on a common set of (non recursive) queries.

An algorithm for answering graph reachability queries, using graph simulation based pattern matching, is given in [13], to construct query preserving summaries. However, it does not consider property graphs or aggregates.

Aggregation-based graph summarization [16] is at the heart of previous approaches, the most notable of which is SNAP [20]. This method is mainly devoted to discovery-driven graph summarization of heterogeneous networks and is unsuitable for approximate query evaluation.

More recently, Rudolf et al. [18] have introduced a graph summary suitable for property graphs based on a set of input summarization rules. However, it does not support the label-constrained reachability queries in this paper. Graph summaries for answering subgraphs returned by keyword queries on large networks are studied in [24]. Our query classes significantly differ from theirs.

7 Conclusion

Our paper focuses on a novel graph summarization method that is suitable for property graph querying. As the underlying MinSummary decision problem is NP-complete, this technique builds on an heuristic that compresses label frequency information in the nodes of the graph summary. We show the practical effectiveness of our approach, in terms of compression ratios, error rates and query evaluation time. As future work, we plan to investigate the feasibility of our graph summary for other query classes, such as those described in [22]. Also, we aim to apply formal methods, as described in [6], to ascertain the correctness of our approximation algorithm, with provably tight error bounds.