1 Introduction

The ability to efficiently analyze changing data is a key requirement in stream processing [39], complex event recognition [16], business intelligence [35], and machine learning [44]. Generally, the analysis that needs to be kept up-to-date, or at least their basic elements, are specified in a query language. The main task is then to efficiently update the query results under data updates.

In this context, we tackle the problem of dynamic query evaluation, where a given query Q has to be evaluated against a database that is constantly changing. Concretely, when database \(\textit{db}\) is updated to database \(\textit{db}+{ u }\) under update \({ u }\), the objective is to efficiently compute \(Q(\textit{db}+{ u })\), taking into consideration that \(Q(\textit{db})\) was already evaluated and recomputations could be avoided. Dynamic query evaluation is of utmost importance if response time requirements for queries under concurrent data updates have to be met or if data volumes are so large that full re-evaluation is prohibitive.

In this paper, we focus on the problem of dynamic evaluation for conjunctive queries that feature multi-way \(\theta \)-joins. The following example illustrates our setting. Assume that we wish to detect potential credit card fraud. Credit card transactions specify their timestamp (ts), account number (acc), and amount (amnt). A typical fraud pattern is that, in a short period of time, a criminal tests a stolen credit card with a few small purchases to then make larger purchases (cf. [37]). Assuming that the short period of time is 1 h, this pattern could be detected by dynamically evaluating the query in Fig. 1a. Queries like this may exhibit arbitrary local predicates and multi-way joins with equality as well as inequality predicates.

Fig. 1
figure 1

a Example query for detecting fraudulent credit card activity. b Delta query of (a) to be executed upon insertion of new high-amount transaction tuple t

Dynamic query evaluation has a rich history in data management and has been researched in the context of incremental view maintenance (IVM) [22, 29, 30], stream join processing [4, 20, 34, 43], and complex event recognition (CER, also known as complex event processing) [11, 15, 31, 46, 49]). All of the existing techniques are based on recomputation of query (sub)results and/or on their materialization. We next illustrate the issues with recomputation and result materialization; a detailed literature review is given in Sect. 2.

The most extreme form of recomputation is of course full recomputation: simply evaluate Q on \(\textit{db}+ { u }\) from scratch whenever an update \({ u }\) arrives on \(\textit{db}\). Clearly, this incurs the highest possible update processing cost, since it completely ignores the fact that \(Q(\textit{db})\) was already evaluated and certain shared computations were performed. This can be solved by introducing the simplest form of materialization: simply store the current result \(Q(\textit{db})\) and, whenever an update \({ u }\) arrives, evaluate the delta query\(\Delta Q\) associated with Q [22]. \(\Delta Q\) takes as input \(\textit{db}\) and \({ u }\) and computes the update that needs to be applied to the materialized \(Q(\textit{db})\) to obtain \(Q(\textit{db}+ { u })\). While this exploits certain shared computation, it unfortunately does so only to some extent.

Let us illustrate this by means of our example fraud query in Fig. 1a. Assume that \({ u }\) inserts a single new high-amount transaction t. Then, \(\Delta Q(\textit{db},{ u })\) amounts to computing the join shown in Fig. 1b. While we can expect that this is more efficient that full recomputation of Q (since the join with \(T_3\) is replaced by the join with a single tuple), observe that if we now get another update \({ u }'\) that inserts another high-amount transaction \(t'\) then we are essentially fully recomputing the join of Fig. 1b, but now with \(t'\) instead of t. Hence, by fully recomputing \(\Delta Q\) whenever a new update \({ u }'\) arrives, we are ignoring the fact that certain computation—most notably, the join between \(T_1\) and \(T_2\) common to t and \(t'\)—is shared and in principle need not be redone.

This can be solved by introducing more materialization: in addition, to storing \(Q(\textit{db})\) also materialize the result of the join between \(T_1\) and \(T_2\) indicated in the shaded area in Fig. 1a and use additional delta queries to maintain this result as well as \(Q(\textit{db})\). In the IVM literature, this approach is known as higher-order IVM (HIVM). This method is highly effective in practice and formally lowers the update processing complexity [29].

While more materialization can imply less recomputation, it has a serious drawback in terms of memory overhead: materialization of \(Q(\textit{db})\) requires \(\Omega (|{Q(\textit{db})}|)\) space, where \(|{\textit{db}}|\) denotes the size of \(\textit{db}\). Therefore, when \(Q(\textit{db})\) is large, which is often the case in CER as well as in data preparation scenarios for training statistical models, materializing \(Q(\textit{db})\) quickly becomes impractical, especially for main memory-based systems. Note that \(|{Q(\textit{db})}|\) can be polynomial in \(|{\textit{db}}|\). HIVM is even more affected by this problem since it not only materializes the result of Q but also the results of partial joins, which can be larger than both \(\textit{db}\) and \(Q(\textit{db})\). For example, the shaded area of Fig. 1a builds the table of all pairs of small transactions that could be part of a credit card fraud. If we assume that there are N small transactions, all of the same account, this materialization will take \(\Theta (N^2)\) space. This naturally becomes impractical when N grows.

In summary, in traditional techniques for dynamic query evaluation, there is a trade-off between recomputation and materialization: more materialization can mean less recomputation and hence faster update processing, but more memory consumption. In previous works [25, 26], we have shown that this trade-off can be avoided by taking a different approach to dynamic query evaluation: instead of materializing \(Q(\textit{db})\), we can build a succinct data structure that (1) supports updates efficiently and (2) represents\(Q(\textit{db})\) in the sense that from it we can generate\(Q(\textit{db})\) as efficiently as if it was materialized. In particular, the representation is equipped with index structures so that we can enumerate \(Q(\textit{db})\)with bounded delay [38]: one tuple at a time, while spending only a small amount of work to produce each new tuple. This makes the enumeration competitive with enumeration from materialized query results.

In essence, we hence separate dynamic query processing into two stages: (1) an update stage where we only maintain under updates the (small) information that is necessary for result enumeration and (2) an enumeration stage where the query result is efficiently enumerated.

The main insight of [25, 26] is that a practical family of algorithms for dynamic query evaluation based on this idea can naturally be obtained by modifying Yannakakis’ seminal algorithm for processing acyclic joins in the static setting [47]. In particular, instead of materializing \(Q(\textit{db})\) and its subjoins (which can both be large), it suffices to materialize semijoins (which are of size linear in \(\textit{db}\)) and build indexes on these semijoins as well as the base relations. The most general form of this modification, called general dynamic Yannakakis (GDyn), supports the class of acyclic generalized conjunctive queries (GCQs), which are acyclic conjunctive queries with \(\theta \)-joins, that are evaluated under multiset semantics and support certain forms of aggregation. The representation of query results that underlies \({\textsc {GDyn}} \) has several desirable properties:

  • It allows to enumerate \(Q(\textit{db})\) with bounded delay.

  • It requires only \({\mathcal {O}}(|{\textit{db}}|)\) space and is hence independent of the size of \(Q(\textit{db})\).

  • It features efficient maintenance under updates. When Q is a conjunctive query (with equijoins only), then we can update the representation of \(Q(\textit{db})\) to a representation of \(Q(\textit{db}+ { u })\) in time \({\mathcal {O}}(|{\textit{db}}| + |{u}|)\). In contrast, techniques based on (H)IVM may require \(\Omega (|{u}| + |{Q(\textit{db}+u)}|)\) time in the worst case. For the subclass of q-hierarchical queries [8], our update time is \({\mathcal {O}}(|{u}|)\). When Q consists of both equality and inequality joins (\(<, \le \)) the update time increases. If Q has at most one inequality per pair of relations, the update time is \({\mathcal {O}}(M \log M)\), where \(M = |{\textit{db}}| + |{{ u }}|\); otherwise, it is \({\mathcal {O}}(M^2)\). In contrast, existing techniques may require \(\Omega (|{\textit{db}}|^{k-1})\) time in the worst case, where k is the number of relations to be joined.

In this paper, we consolidate and expand our development of GDyn. Our contributions are as follows.

  1. (1)

    We give an intuitive, concise, and stand-alone description of \({\textsc {GDyn}} \) and explain the main components behind its efficiency. In addition, we provide, for the first time, full formal proof of GDyn ’s correctness and complexity (Sect. 4).

  2. (2)

    Like most query evaluation algorithms, \({\textsc {GDyn}} \)’s operation is driven by the availability of a query plan. In our previous work, we have always assumed query plans to be explicitly given. In this paper, in contrast, we present, for the first time, an algorithm for efficiently computing GDyn query plans (Sect. 5).

  3. (3)

    Finally, we present our implementation of \({\textsc {GDyn}} \) (Sect. 6) and extended experimental comparison of \({\textsc {GDyn}} \) to state-of-the-art IVM and CER engines. We explore the full design space of queries with up to three joins. GDyn performs consistently better than the competitor systems with multiple orders of magnitude improvements in both time and memory consumption (Sect. 7).

We discuss related work in Sect. 2, introduce the required background in Sect. 3, and conclude in Sect. 8. Because of space restrictions, certain supporting material and proofs of auxiliary statements are given in the Appendix.

2 Related work

IVM The trade-off between materialization and recomputation explained in the Introduction is at the core of IVM [12, 21, 22, 29, 30]. IVM hence differs from GDyn as already explained in the Introduction.

CER There are two approaches to CER: relational and automaton based. Relational approaches (e.g., [31]) are similar to IVM. In contrast to the relational approaches, automaton approaches assume that event tuples are endowed with a timestamp and that the arrival order of event tuples corresponds to the timestamp order (i.e., there are no out-of-order events). They build an automaton to recognize the desired temporal patterns in the input stream. Broadly speaking, there are two automata-based recognition approaches. In the first approach, followed by [3, 46], events are cached per state and once a final state is reached, a search through the cached events is done to recognize the complex events. While it is no longer necessary to check the temporal constraints (e.g., \(T_1.{\text {ts}} < T_2.\text {ts}\)) during the search, the additional constraints (in our fraud query example \(T_1.\text {acc} = T_2.\text {acc} = T_3.\text {acc}\)) must still be verified. At essence, this corresponds to fully recomputing a delta query since each event triggering a transition to a final state may cause re-evaluation of a subjoin on the cached data. In the second approach, followed by [11, 14, 15, 49], partial runs are materialized according to the automaton’s topology. For our example query, this means that, just like HIVM, the join in the shaded area of Fig. 1a is materialized and maintained, so it is available when a large amount transaction arrives. This approach hence shares with HIVM its high memory overhead.

Stream Joins The goal of stream join processing is to produce join results incrementally as new tuples are added to the input or old tuples retracted. To see how GDyn differs from stream joins, we discern two classes of stream join algorithms.

Algorithms in the first class are designed to produce output tuples as soon as possible, without blocking for more input to become available. They are based on the symmetric hash join [45] and its variants [41, 43]. Algorithms in this class favor full recomputation of delta queries and hence ignore the opportunity to reduce redundant recomputation by additional materialization. Moreover, these algorithms are limited to processing equijoins only.

Algorithms in the second class focus on window-based stream joins [20, 27, 34, 40]. Like the automata-based approaches in CER, they assume that each tuple is endowed with a timestamp attribute and are restricted to the setting where tuples follow a FIFO paradigm: new tuples arrive in increasing timestamp order and tuples with the oldest timestamp are deleted first. This property is crucial for the algorithm’s proposed optimizations to work. GDyn, in contrast, makes no FIFO assumption and can deal with arbitrary updates, including out-of-order updates. Furthermore, we note that window-based joins are a strict subclass of the class of all inequality joins, since in a window-based join, only a single temporal attribute will be compared across all relations. As such, queries like \(R_1 \bowtie _{R_1.\textit{amnt}< R_2.\textit{amnt}} R_2 \bowtie _{R_2.\textit{ts} < R_3.\textit{ts}} R_3\) that inequality join across multiple unrelated attributes are not considered by [20, 27, 34, 40]. GDyn, in contrast, processes such queries intelligently. Finally, we note that, in contrast to GDyn, support for multi-way stream joins is limited: [27, 34, 40] consider only binary joins, while [20] treats multi-way joins, but makes the simplifying assumption that all comparisons are on the same, single attribute.

Because of the increasing widespread use of distributed compute engines such as Flink, Spark, and Storm in contemporary data analysis scenarios, there has also been much research on how to support stream joins in such engines (e.g., [44]). To the best of our knowledge, this work builds upon the above-mentioned centralized stream join algorithms (while tackling additional challenges such as distribution and fault-tolerance) and hence similarly differ from GDyn as described above. We leave the extension of GDyn to parallel and distributed settings as future work.

Query evaluation with constant delay enumeration has gained increasing attention in the last decade [6,7,8, 10, 25, 32, 33, 33, 36, 38]. This setting, however, deals with equijoins only.

Inequality joins also related, although restricted to the static setting, is the practical evaluation of binary [17, 18, 23] and multi-way [9, 48] inequality joins. Our work, in contrast, considers dynamic processing of multi-way \(\theta \)-joins, with a specialization to inequality joins. Khayyat et al. [28] proposed fast multi-way inequality join algorithms based on sorted arrays and space efficient bit arrays. They focus on the case where there are exactly two inequality conditions per pairwise join. While they also present an incremental algorithm for pairwise joins, their algorithm makes no effort to minimize the update cost of multi-way joins. As a result, they either materialize subresults (implying a space overhead that can be more than linear) or recompute subresults.

3 Preliminaries

Query Language Throughout the paper, let \(x, y, z, \ldots \) denote variables (also commonly called column names or attributes). A hyperedge is a finite set of variables. We use \(\overline{x}, \overline{y}\), ...to denote hyperedges. A generalized conjunctive query (GCQ) is an expression of the form

$$\begin{aligned} Q = \pi _{\overline{y}} \big (r_1(\overline{x_1}) \bowtie \ldots \bowtie r_n(\overline{x_n})\mid \bigwedge _{i=1}^m \theta _i(\overline{z_i})\big ). \end{aligned}$$
(1)

Here \(r_1,\ldots , r_n\) are relation symbols; \(\overline{x_1},\ldots , \overline{x_n}\) are hyperedges (of the same arity as \(r_1,\ldots , r_n\)); \(\theta _1,\ldots ,\theta _m\) are predicates over \(\overline{z_1},\ldots ,\overline{z_m}\), respectively; and both \(\overline{y}\) and \(\bigcup _{i=1}^m\overline{z_i}\) are subsets of \(\bigcup _{i=1}^n \overline{x_i}\). We treat predicates abstractly: for our purpose, a predicate over \(\overline{x}\) is a (not necessarily finite) decidable set \(\theta \) of tuples over \(\overline{x}\). For example, \(\theta (x,y)=x < y\) is the set of all tuples (ab) satisfying \(a<b\). We indicate that \(\theta \) is a predicate over \(\overline{x}\) by writing \(\theta (\overline{x})\). Throughout the paper, we consider only non-nullary predicates, i.e., predicates with \(\overline{x} \not = \emptyset \).

Example 1

The following query is hence a GCQ.

$$\begin{aligned} \pi _{x,y,z} \big (r(x,y) \bowtie s(z,u) \bowtie t(v,w) \mid x< u, z < w\big ). \end{aligned}$$

Since, as usual, the natural join between relations that have a disjoint schema is simply their Cartesian product, this query asks to take the Cartesian product of r(xy), s(zu) and t(vw); subsequently, select those tuples that satisfy \(x < u\) and \(z < w\) and finally project on xyz. Likewise, the GCQ

$$\begin{aligned} \pi _{y,z} \big (r(x,y)\bowtie s(y,z)\bowtie t(z,v) \mid y<v\big ) \end{aligned}$$

asks to take the natural join of r(xy), s(yz) and t(zv) (where r(xy) and s(yz) equijoin on y, and s(yz) and t(zv) equijoin on z), subsequently select those tuples that satisfy \(y < v\) and project on yz.

We call \(\overline{y}\) the output variables of Q and denote it by \({\textit{out}}(Q)\). If \(\overline{y} = \overline{x_1} \cup \cdots \cup \overline{x_n}\), then Q is called a full query and we may omit the symbol \(\pi _{\overline{y}}\) altogether for brevity. We denote by \(\textit{full}(Q)\) the full \(GCQ \) obtained from Q by setting \({\textit{out}}(Q)\) to \(\overline{x_1} \cup \cdots \cup \overline{x_n}\). The elements \(r_i(\overline{x_i})\) are called atoms. \(\textit{at}(Q)\) denotes the set of all atoms in Q and \({\textit{pred}}(Q)\) the set of all predicates in Q. A conjunctive query (or CQ) is a GCQ where \({\textit{pred}}(Q) = \emptyset \).

Semantics We evaluate GCQs over generalized multiset relations (GMRs for short) [25, 29, 30]. Let \({{\,\mathrm{dom}\,}}(x)\) denotes the domain of variable x. As usual, a tuple over \(\overline{x}\) is a function \(\mathbf {t}\) that assigns a value from \({{\,\mathrm{dom}\,}}(x)\) to every \(x \in \overline{x}\). \({\mathbb {T}}[\overline{x}]\) denotes the set of all tuples over \(\overline{x}\). A GMR over \(\overline{x}\) is a function \(R:{\mathbb {T}}[\overline{x}]\rightarrow {\mathbb {Z}}\) mapping tuples over \(\overline{x}\) to integers such that \(R(\mathbf {t}) \not = 0\) for finitely many tuples \(\mathbf {t}\). In contrast to classical multisets, the multiplicity of a tuple in a GMR can hence be negative, allowing to treat insertions and deletions uniformly. We write \(\textit{var}(R)\) for \(\overline{x}\); \({{\,\mathrm{supp}\,}}(R)\) for the set of all tuples with nonzero multiplicity in R; \(\mathbf {t} \in R\) to indicate \(\mathbf {t} \in {{\,\mathrm{supp}\,}}(R)\); and \(|{R}|\) for \(|{{{\,\mathrm{supp}\,}}(R)}|\).

Figure 2 illustrates the operations of GMR union (\(R+S\)), minus \((R - S)\), projection (\(\pi _{\overline{z}} R\)), natural join (\(R \bowtie T\)), and selection (\(\sigma _P(R)\)) which are defined similarly as in relational algebra with multiset semantics. See [25, 30] for formal semantics. We stress that, as usual, if R and T have disjoint schema, then \(R \bowtie T\) is simply their Cartesian product.

A GMR R is positive if \(R(\mathbf {t}) > 0\) for all \(\mathbf {t} \in {{\,\mathrm{supp}\,}}(R)\). A database over a set \({\mathcal {A}}\) of atoms is a function \(\textit{db}\) that maps every atom \(r(\overline{x}) \in {\mathcal {A}}\) to a positive GMR \(\textit{db}_{r(\overline{x})}\) over \(\overline{x}\). Given a database \(\textit{db}\) over the atoms occurring in query Q, the evaluation of Q over \(\textit{db}\), denoted \(Q(\textit{db})\), is the GMR over \(\overline{y}\) constructed in the expected way: take the natural join of the GMRs in the database, do a selection over the result w.r.t. each predicate and finally project on \(\overline{y}\). It is instructive to note that after evaluation, each result tuple has an associated multiplicity that counts the number of derivations for the tuple. In other words, the query language has built-in support for COUNT aggregations. We note that, in their full generality, GMRs can carry multiplicities that are taken from an arbitrary algebraic semiring structure (cf., [29]), which can be useful to describe the computation of more advanced aggregations over the result of a GCQ [2]. To keep the notation and discussion simple, we fix the ring \({\mathbb {Z}}\) of integers throughout the paper, but our results generalize to arbitrary semirings and their associated aggregations.

Fig. 2
figure 2

Operations on GMRs

Semijoins If \(\Theta \) is a set of predicates, then we write \(\sigma _{\Theta } R\) for \(\sigma _{\bigwedge _{\theta \in \Theta }} R\) and \(R \bowtie _{\Theta } S\) for \(\sigma _{\Theta }(R \bowtie S)\). If \(\overline{z} \subseteq \textit{var}(R)\) or \(\overline{z} \subseteq \textit{var}(S)\), then \(\pi _{\overline{z}}(R \bowtie _{\Theta } S)\) is called a semijoin. We write \(R \ltimes _{\Theta } S\) for the subGMR of R consisting of all tuples that have a joining tuple in S:

$$\begin{aligned}&R \ltimes _{\Theta } S \in {\mathbb {T}}[\textit{var}(R)] \rightarrow {\mathbb {Z}}:\\&\quad \mathbf {t} \mapsto {\left\{ \begin{array}{ll} R(\mathbf {t})&{}\quad \text { if }\, \mathbf {t} \in \pi _{\textit{var}(R)}(R \bowtie _{\Theta } S)\\ 0 &{}\quad \text { otherwise} \end{array}\right. } \end{aligned}$$

Often, S will contain only a single tuple \(\mathbf {t}\) with multiplicity 1. In that case, we simply write \(R \ltimes _{\Theta } \mathbf {t}\).

Updates An update to a GMR R is simply a GMR \(\Delta {R}\) over the same variables as R. Applying update \(\Delta {R}\) to R yields the GMR \(R + \Delta {R}\). An update to a database\(\textit{db}\) is a collection u of (not necessarily positive) GMRs, one GMR \(u_{r({\overline{x}})}\) for every atom \(r(\overline{x})\) of \(\textit{db}\), such that \(\textit{db}_{r(\overline{x})} + { u }_{r(\overline{x})}\) is positive.Footnote 1 We write \(\textit{db}+ u\) for the database obtained by applying u atom-wise to \(\textit{db}\).

Computational Model We focus on dynamic query evaluation in main memory. We assume a model of computation where the space used by tuple values and integers, the time of arithmetic operations on integers, the time of operations on tuples (such as projecting a tuple on a subset of its variables, or taking the union of two tuples), and the time of memory lookups are all \({\mathcal {O}}(1)\). We further assume that hash tables have \({\mathcal {O}}(1)\) access and update times while requiring linear space. While it is well known that real hash table access is \({\mathcal {O}}(1)\)expected time and updates are \({\mathcal {O}}(1)\)amortized, complexity results that we establish for this simpler model can be expected to translate to average (amortized) complexity in real-life implementations [13].

A direct consequence of these assumptions is that, using standard database implementation techniques, every GMR R can be represented in our model by a data structure that allows (1) enumeration of R with delay \({\mathcal {O}}(1)\) (as defined in Sect. 4.1); (2) multiplicity lookups \(R(\mathbf {t})\) in \({\mathcal {O}}(1)\) time given \(\mathbf {t}\); (3) single-tuple insertions and deletions in \({\mathcal {O}}(1)\) time; while (4) having size that is proportional to \(|{R}|\). In addition, we assume it possible to sort GMRs by a given order on its tuples in \({\mathcal {O}}(|{R}| \log |{R}|)\) time, after which it allows enumeration in the given order with \({\mathcal {O}}(1)\). Single-tuple insertions that keep the GMR sorted become \({\mathcal {O}}(\log |{R}|)\) in this case.

4 General dynamic Yannakakis

In this section, we formulate \({\textsc {GDyn}} \), a dynamic version of the Yannakakis algorithm [47] that focuses on the evaluation of \(GCQs \). GDyn takes a nonstandard approach to dynamic query evaluation: instead of materializing \(Q(\textit{db})\) and subjoins of Q, GDyn builds a succinct, efficiently updatable data structure that represents \(Q(\textit{db})\) in the sense that from it we can enumerate\(Q(\textit{db})\). Formally, a data structure D supports enumeration of a set E if there is a routine \(\textsc {enum}\) such that \(\textsc {enum}(D)\) outputs each element of E exactly once. Such enumeration occurs with delay d if the time until the first element is output; the time between any two consecutive elements and the time between the last element and the termination of \(\textsc {enum}(D)\), are all bounded by d. D supports enumeration of a GMR R if it supports enumeration of the set \(E_R = \{ (\mathbf {t}, R(\mathbf {t})) \mid \mathbf {t} \in {{\,\mathrm{supp}\,}}(R)\}\).

When evaluating a GCQ Q over a database \(\textit{db}\), we will be interested in representing the elements of \(Q(\textit{db})\) by means of a data structure \(D_{\textit{db}}\), such that we can enumerate \(Q(\textit{db})\) from \(D_{\textit{db}}\). If, for every \(\textit{db}\), the delay to enumerate \(Q(\textit{db})\) from \(D_{\textit{db}}\) is sublinear in \(|{\textit{db}}|\), then we say that the enumeration occurs with sublinear delay. Similarly, if the delay is independent of \(|{\textit{db}}|\), then we say that the enumeration occurs with constant delay [38]. Note that this is sublinear/constant in data complexity [42]: the delay may still depend on the size of the query Q. This is reasonable since Q specifies the arity of the query result, and a larger arity inherently implies a longer delay between elements. Note that enumeration with constant delay is what we typically obtain by materializing \(Q(\textit{db})\). For example, if we store the elements of \(Q(\textit{db})\) in an array, then enumerating \(Q(\textit{db})\) amounts to scanning the array where each element access is \({\mathcal {O}}(1)\).

We start the development of GDyn by first giving some intuition into constant delay enumeration (henceforth: CDE) in Sect. 4.1. The algorithm itself is stated in Sect. 4.2, while proofs of its correctness and analysis of its complexity is given in Sect. 4.3. Finally, we specialize \({\textsc {GDyn}} \) to the case where all \(\theta \)-joins are inequality joins in Sect. 4.4.

4.1 Intuition

In this section, we discuss how we can obtain CDE of the result \(Q(\textit{db})\) of a GCQ Q. Of course, the simplest way to obtain this is simply to materialize \(Q(\textit{db})\). Unfortunately, this requires memory proportional to \(|{Q(\textit{db})}|\) which, depending on Q, can be of size polynomial in \(|{\textit{db}}|\). We hence desire other data structures to represent \(Q(\textit{db})\) using less space, while still allowing CDE. Let us build some intuition on how this can be done by subsequently considering three queries of increasing complexity:

$$\begin{aligned} Q_1&= r(x,y) \bowtie s(y,z), \\ Q_2&= s(y,z) \bowtie t(z,v) \mid v< y, \quad \text {and} \\ Q_3&= r(x,y) \bowtie s(y,z) \bowtie t(z,v) \mid v < y. \end{aligned}$$

Throughout our discussion, assume that the GMRs assigned to r(xy), s(yz), and t(zv) by input database \(\textit{db}\) are RS, and T, respectively.

It is instructive to start with the simple binary equijoin query \(Q_1 = r(x,y) \bowtie s(y,z)\) and analyze why traditional join processing algorithms do not yield CDE. Suppose that we evaluate \(Q_1\) using a simple in-memory hash join with R as probe relation and S as build relation. Assume that the corresponding hash index of S on y has already been computed. Concretely, this hash index allows us to retrieve, for every y-tuple \(\mathbf {t}\), in \({\mathcal {O}}(1)\) time a pointer to the GMR \(S \ltimes \mathbf {t}\) of all S-tuples that join with \(\mathbf {t}\). Now observe that, when iterating over R to probe the index, we may have to visit an unbounded number of R-tuples that do not join with any S-tuple. Consequently, the delay between consecutive outputs may be as large as \(|{R}|\). A similar analysis shows that other join algorithms, such as the sort-merge join, do not yield CDE.

How then can we obtain CDE for \(r(x,y) \bowtie s(y,z)\)? Intuitively, if we can ensure to only iterate over those R-tuples that have matching S-tuples, we trivially obtain constant delay since then every probe will yield at least one output tuple. As such, the key is to first compute the semijoin \(\rho _{x,y} = \pi _{x,y}(R \bowtie S)\). We can then iterate over the elements of \(\rho _{x,y}\), probing S in each iteration to generate the output with constant delay. Note that, because \(\rho _{x,y}\) is a semijoin, the space needed to store \(\rho _{x,y}\) is linear in \(|{\textit{db}}|\).

CDE for queries that feature \(\theta \)-joins can be obtained similarly. Consider \(Q_2 = (s(y,z) \bowtie t(z,v) \mid y < v)\) which is a combination of an equijoin on z and inequality join on \(y < v\). To obtain CDE for \(Q_2\), first compute the semijoin \(\rho _{y,z} = \pi _{y,z}( S(y,z) \bowtie _{y<v} T(z,v) )\) which consists of all tuples in S that have a matching tuple in T. Assume for a moment that we have a more powerful index structure I that allows, for any \(\{y,z\}\)-tuple \(\mathbf {s}\), to enumerate \(T \ltimes _{y < v} \mathbf {s}\) with constant delay. We can then easily enumerate \(Q_2(\textit{db})\) with constant delay by iterating over \(\mathbf {s} \in \rho _{y,z}\), and for each such \(\mathbf {s}\), probe I to produce the tuples \(\mathbf {t} \in T \ltimes _{{y < v}} \mathbf {s}\), outputting \((\mathbf {s} \cup \mathbf {t}, S(\mathbf {s}) \times T(\mathbf {t}))\) for each such \(\mathbf {s}\) and \(\mathbf {t}\). Since \(T \ltimes _{{y < v}} \mathbf {s}\) allows CDE and multiplicity lookups are \({\mathcal {O}}(1)\), the entire procedure is CDE. The key question then is how we can build this more powerful index structure I. The solution is to group T on z; subsequently, sort each group in descending order on v; and create a normal (hash-based) index J that allows to find the group for each z-value. This now supports CDE of \(T\ltimes _{y < v} \mathbf {s}\): first, use J to get a pointer to the group with z-value \(\mathbf {s}(z)\) in \({\mathcal {O}}(1)\) time, and then enumerate this group with constant delay and in decreasing order on v. Yield the current tuple \(\mathbf {t}\) that is being enumerated in this fashion, provided that \(\mathbf {s}(y) < \mathbf {t}(v)\). As soon as \(\mathbf {s}(y) \ge \mathbf {t}(v)\), we know that all subsequent \(\mathbf {t}\) will fail the inequality, and we can hence terminate. The lower left of Fig. 3 illustrates S, T, \(\rho _{y,z}\), and J.

Fig. 3
figure 3

Illustration of query \(Q_3\)

CDE for queries that join more than two relations can be obtained similarly, but now by computing nested semijoins. Figure 3 illustrates how to obtain CDE for \(Q_3\). Concretely, we first ensure CDE of the subquery \(Q_2\) of \(Q_3\) as already explained above by computing the semijoin \(\rho _{y,z} = \pi _{y,z}( S(y,z) \bowtie _{y<v} T(z,v) )\) and suitably indexing T. Then CDE of \(Q_3(\textit{db})\) is obtained by observing that \(Q_3(\textit{db}) \equiv R \bowtie Q_2(\textit{db})\), where \(R \bowtie Q_2(\textit{db})\) is a binary equijoin, which can hence treated completely analogous as \(Q_1\). Concretely, compute the nested semijoin \(\rho _{x,y} = \pi _{x,y}(R \bowtie \rho _{y,z})\), and build a hash index of \(\rho _{y,z}\) on y. (This index is depicted as \(I_{\rho _{y,z}}\) in Fig. 3). Enumeration of \(Q_3(\textit{db})\) is done by iterating over the tuples \(\mathbf {r} \in \rho _{x,y}\), and for each such tuple \(\mathbf {r}\), use \(I_{\rho _{y,z}}\) to get a pointer to \(\rho _{y,z} \ltimes \mathbf {r}\), which consists of all \(\mathbf {s} \in \rho _{y,z}\) that equijoin with \(\mathbf {r}\). Iterate over these \(\mathbf {s}\) with constant delay, and finally use the more advanced index on T to enumerate all tuples \(\mathbf {t} \in T \ltimes _{y < v} \mathbf {s}\). For each such \(\mathbf {r}, \mathbf {s}\), and \(\mathbf {t}\), we output \(\left( \mathbf {r} \cup \mathbf {s} \cup \mathbf {t}, R(\mathbf {r}) \times S(\mathbf {s}) \times T(\mathbf {t})\right) \). By construction of \(\rho _{x,y}\), we are ensured that matching \(\mathbf {s}\) will exists for every \(\mathbf {r}\). Similarly, matching \(\mathbf {t}\) exist for every \(\mathbf {s}\) by construction of \(\rho _{y,z}\). Therefore, each tuple that we iterate over will produce a new output, and the entire enumeration of \(Q(\textit{db})\) is CDE.

In conclusion As the examples above illustrate, we can obtain CDE for GCQs by computing (nested) semijoins, and suitably indexing both base relations and semijoin results for enumeration. Because the only additional relations that we compute are obtained by semijoining existing relations, the size of all additional GMRs that are stored is linear in the input \(\textit{db}\). Contrast this to techniques that materialize subjoin results, whose size may become polynomial in the database.

Updates We finish this section by remarking that, in the presence of updates, this approach is only valid if we materialize and maintain all required semijoin results. To speed up the maintenance of semijoin results under updates, it is sometimes beneficial to create additional indexes that help in incremental computation of the semijoins, as we illustrate next. Reconsider \(Q_3\) as illustrated in Fig. 3. If we receive an update \(\Delta T\) to T, then we need to correspondingly update \(\rho _{y,z}\) from \(\pi _{y,z}( S \bowtie _{y<v} T)\) to \(\pi _{y,z}( S \bowtie _{y<v} (T + \Delta T))\). To that end, it suffices to compute \(\Delta \rho _{y,z} = \pi _{y,z}( S \bowtie _{y<v} \Delta T)\), and add this to \(\rho _{y,z}\). Computing \(\Delta \rho _{y,z}\) by means of a nested loop join has \(\Omega (|{S}| \times |{\Delta T}|)\) complexity. We can do better if we index S by sorting S lexicographically on (zy), in decreasing order (this is actually how S is depicted in Fig. 3). \(\Delta \rho _{y,z}'\) can then be computed by means of a hybrid form of sort-merge and index nested loop join. First, group \(\Delta T\) on z and, per group, sort tuples in decreasing order on variable v. Create a hash index on \(\Delta T\) to be able to quickly find each group by a given z-value. Second, iterate over the tuples in S in the given lexicographic order. For each z-group in S, find the corresponding group in \(\Delta T\) by passing the z-value to the hash table. Let \(\mathbf {s}\) be the first tuple in the S-group. Then iterate over the tuples of the \(\Delta T\) group in decreasing order on v, and sum up their multiplicities until \(\mathbf {s}(y)\) becomes larger than v. Add \(\mathbf {s}\) to \(\Delta \rho _{y,z}\), with its original multiplicity in S multiplied by the found sum (provided that it is nonzero). Then, consider the next tuple in the S-group, and continue summing from the current tuple in the \(\Delta T\) group until \(\mathbf {s}(y)\) becomes again larger than v, and add the result tuple with the correct multiplicity. Continue repeating this process for each tuple in the S-group, and for each group in S. Assuming that the index on S already existed, then the total cost of computing \(\rho _{y,z}\) in this way is \({\mathcal {O}}(|{S}| + |{\Delta T}| + |{\Delta T}| \log |{\Delta T}|)\) since we scan S and \(\Delta T\) only once, need to sort \(\Delta T\), and create a hash table for each group. This is much better than the \({\mathcal {O}}(|{S}| \times |{\Delta T}|)\) complexity of a nested loop.

4.2 The algorithm

We now turn to the general formulation of the dynamic Yannakakis algorithm. As exemplified in Sect. 4.1, GDyn obtains CDE by computing (nested) semijoins and indexing both these semijoins and the base relations. The order in which semijoins are computed and how they are indexed is recorded in a dynamic query plan, which is introduced next.

Dynamic Query Plans To simplify notation, we denote the set of all variables (resp. atoms, resp. predicates) that occur in an object X (such as a query) by \(\textit{var}(X)\) (resp. \(\textit{at}(X)\), resp. \(\textit{pred}(X)\)). In particular, if X is itself a set of variables (a hyperedge), then \(\textit{var}(X) = X\). We extend this notion uniformly to labeled trees, e.g., if n is a node in tree T, then \(\textit{var}_T(n)\) denotes the set of variables occurring in the label of n and similarly for edges and trees themselves. If T is clear from the context, we omit subscripts from our notation.

Definition 1

A dynamic query plan (or simply: plan) is a tuple (TN) where T is a binary generalized join tree, and N is a sibling-closed connex subset of T. A generalized join tree (GJT) is a node-labeled and edge-labeled directed tree \(T = (V,E)\) such that:

  • Every leaf is labeled by an atom.

  • Every interior node n is labeled by a hyperedge and has at least one child c such that \(\textit{var}(n)\subseteq \textit{var}(c)\). Such a child is called a guard of n.

  • Whenever the same variable x occurs in the label of two nodes m and n of T, then x occurs in the label of each node on the unique path linking m and n. This condition is called the connectedness condition.

  • Every edge \(p \rightarrow c\) from parent p to child c in T is labeled by a set \({\textit{pred}}(p \rightarrow c)\) of predicates. It is required that for every predicate \(\theta (\overline{z}) \in {\textit{pred}}(p \rightarrow c)\), we have \(\textit{var}(\theta ) = \overline{z} \subseteq \textit{var}(p) \cup \textit{var}(c)\).

T is binary if every node in T has at most two children. A connex subset of T is a set \(N \subseteq V\) that includes the root of T such that the subgraph of T induced by N is a tree. N is sibling closed if for every node \(n \in N\) with a sibling m in T, m is also in N. The frontier of a connex set N is the subset \(F \subseteq N\) consisting of those nodes in N that are leaves in the subtree of T induced by N.

Fig. 4
figure 4

Two example plans. The connex sets are indicated by the shaded areas

Figure 4 shows two plans \((T_1,N_1)\) and \((T_2,N_2)\). The set \(N_1\) contains all nodes of \(T_1\) and is therefore a sibling-closed connex subset of \(T_1\). Its frontier is \(\{ r(x,y), s(y,z), t(z,v)\}\). If we remove \(\{y,z\}\) from \(N_1\) the set is no longer connex. If instead we remove r(xy), the set is still connex, but no longer sibling closed. The set \(N_2\) is a sibling-closed connex subset of \(T_2\), and its frontier is \(\{\{y,z,w\}, \{u\}\}\). Removing either \(\{y,z,w\}\) or \(\{u\}\) makes \(N_2\) no longer sibling closed.

Definition 2

Let T be a GJT and let N be a connex subset of T. Assume that \(\{\!|r_1(\overline{x_1}), \ldots , r_n(\overline{x_n})|\!\}\) is the multiset of atoms occurring as labels in the leaves of T. Then, the query associated with T is the full join

$$\begin{aligned} {\mathcal {Q}}{\texttt {[}}T{\texttt {]}} = r_1(\overline{x_1}) \bowtie \cdots \bowtie r_n(\overline{x_n}) \mid \bigwedge _{\theta (\overline{z}) \in \textit{pred}(T)} \theta (\overline{z}), \end{aligned}$$

and the query associated with (TN) is the GCQ defined as \({\mathcal {Q}}{\texttt {[}}T,N{\texttt {]}} = \pi _{\textit{var}(N)}({\mathcal {Q}}{\texttt {[}}T{\texttt {]}})\).

To illustrate, referring to Fig. 4, we have that \({\mathcal {Q}}{\texttt {[}}T_1,N_1{\texttt {]}} = \pi _{x,y,z,v}\left( r(x,y) \bowtie s(y,z) \bowtie t(z,v) \mid y<v\right) \), which is \(Q_3\) from Sect. 4.1.

The data structure Following the intuition of Sect. 4.1, the GJT T of a plan (TN) specifies the semijoin results that need be maintained and indexed, while the connex set N drives the enumeration of query results. We formalize this next. Because in this section, we are introducing \({\textsc {GDyn}} \) for arbitrary GCQs (with arbitrary join predicates \(\theta \)), we first need to introduce general notions of an index.

Definition 3

(Enumeration index) Let R be a GMR, \(\theta \) a predicate, and \(\overline{y}\) be a hyperedge. A data structure I that is of size linear in R and that allows, for any given \(\overline{y}\)-tuple \(\mathbf {t}\), enumeration of \((R \ltimes _{\theta } \mathbf {t})\) with delay \({\mathcal {O}}(f(|{R}|))\) is called an enumeration index ofRby\((\theta , \overline{y})\)with delay\(f:{\mathbb {N}}\rightarrow {\mathbb {N}}\).

For example, in Sect. 4.1, we have discussed how, by means of grouping and sorting, we can obtain an enumeration index of T(zv) on \((y < v, \{y,z\})\) with constant delay.

Definition 4

(Join Index) Let R be a GMR, \(\theta \) be a predicate, and \(\overline{y},\overline{z}\) be hyperedges such that \(\overline{z} \subseteq \textit{var}(R)\) or \(\overline{z} \subseteq \overline{y}\). A data structure I that is of size linear in R and that allows, for any GMR S over \(\overline{y}\), computation of \(\pi _{\overline{z}}(R \bowtie _{\theta } S)\) in time \({\mathcal {O}}(g(|{R}|,|{S}|))\) is called a join index ofRby\((\theta , \overline{y}, \overline{z})\)with access time\(g:{\mathbb {N}}^2 \rightarrow {\mathbb {N}}\).

For example, in Sect. 4.1, we have discussed how, by means of grouping and sorting, we can obtain a join index of R(yz) by \((y < v, \{z,v\},\{y,z\})\) whose access time is \({\mathcal {O}}(|{R}| + |{S}| \log |{S}|)\).

For both enumeration and join indexes, the update time is the time required to update the index to a corresponding new index on \(R + \Delta R\), given update \(\Delta R\).

Definition 5

Let (TN) be a plan and let \(\textit{db}\) be a database over \(\textit{at}(T)\). The T-reduct (or semijoin reduction) of \(\textit{db}\) is a collection \(\rho \) of GMRs, one GMR \(\rho _n\) for each node \(n \in T\), defined inductively as follows. Let \({\textit{pred}}(n)\) denote the set of all predicates on the edges from n to its children in T.

  • If \(n = r(x)\) is an atom, then \(\rho _n = \textit{db}_{r(x)}\).

  • If n has a single child c, then \(\rho _n = \pi _{\textit{var}(n)}\sigma _{{\textit{pred}}(n)}\rho _c\).

  • Otherwise, n has two children \(c_1\) and \(c_2\). In this case, we have \(\rho _n = \pi _{\textit{var}(n)} \sigma _{{\textit{pred}}(n)} \left( \rho _{c_1} \bowtie \rho _{c_2}\right) \). Note that, because n has a guard child, this is actually a semijoin.

A T-reduct needs to be augmented by suitable index structures to be used for both enumeration and maintenance under updates. Concretely for each node n with parent p in T, the following indexes are created:

  • If n belongs to N, then we store an enumeration index \(P_n\) on \(\rho _n\) by \(({\textit{pred}}(p\rightarrow n),\textit{var}(p))\).

  • If n is a node with a sibling m, then we store a join index \(S_n\) on \(\rho _n\) by \(({\textit{pred}}(p), \textit{var}(m), \textit{var}(p))\).

The T-reduct \(\rho \) together with these indexes is called a (TN)-representation for \(\textit{db}\), or (TN)-rep for short.

Reconsider the plan \((T_1,N_1)\) from Fig. 4. Figure 3 depicts an example \((T_1,N_1)\)-representation \(\rho \) for the database \(\textit{db}\) composed of the GMRs shown at the leaves of the tree. \(I_{\rho _{y,z}}\) and \(I_{\rho _t}\) illustrate the enumeration indexes; the join indexes are not illustrated.

It is important to observe that, since a T-reduct constructs only semijoins of database GMRs, and projections thereof, each \(|{\rho _n}|\) is linear in the size of \(\textit{db}\). Consequently, the indexes are also of size linear in \(\textit{db}\) and hence the entire (TN)-rep is linear in \(\textit{db}\).

Proposition 1

\(|{\rho _n}| \le \max _{r(\overline{x}) \in \textit{at}(T)} |{\textit{db}_{r(\overline{x})}}|\), for every \(n \in T\).

Given these definitions, the enumeration and maintenance algorithms that form GDyn are shown in Algorithm 1. They operate as follows.

figure a

Enumeration To enumerate from a (TN)-rep, we iterate over the reductions \(\rho _n\) with \(n \in N\) in a nested fashion, starting at the root and proceeding top down. When n is the root, we iterate over all tuples in \(\rho _n\). For every such tuple \(\mathbf {t}\), we iterate only over the tuples in the children c of n that are compatible with \(\mathbf {t}\) (i.e., tuples in \(\rho _c\) that join with \(\mathbf {t}\) and satisfy \(\textit{pred}(n \rightarrow c)\)). Note that such tuples can be enumerated directly using the enumeration index \(P_c\). This procedure continues until we reach nodes in the frontier of N at which time the output tuple can be constructed. The pseudocode is given by the routine \(\textsc {enum}\) in Algorithm 1.

Update processing To maintain a (TN)-rep under update \({ u }\), we traverse the nodes of T in a bottom-up fashion. At each node n, we have to compute the update \(\Delta _n\) to apply to \(\rho _n\) and its associated indexes. For leaf nodes, this update is given by the update \({ u }\) itself. For interior nodes, \(\Delta _n\) can be computed from the update and the original reduct of its children. \(\textsc {update}\) in Algorithm 1 gives the pseudocode. Here, line 21 is implemented by using the join index \(S_m\) on \(\rho _m\) by \((\textit{pred}(p), \textit{var}(n), \textit{var}(p))\). Line 23 can be implemented by a straightforward hash-based aggregation. We assume here that, as a side effect of the presented modifications, the associated indexes are also updated.

4.3 Correctness and complexity

We next prove correctness of the enumeration and update procedures, and bound their complexity. We start with enumeration. Throughout this section, let (TN) be a plan, let \(\textit{db}\) be a database over \(\textit{at}(T)\), and assume that we have (TN)-rep of \(\textit{db}\) with T-reduct \(\rho \). Given a node \(n\in T\), we denote the subtree of T rooted at n by \(T_n\), and the subset of all nodes of N that are in \(T_n\) by \(N_n\). The following lemma relates the GMRs at each node n of a T-reduct to the query induced by the subtree of T at n. Here, we write \({\mathcal {Q}}{\texttt {[}}T_n,n{\texttt {]}}\) as a shorthand for \({\mathcal {Q}}{\texttt {[}}T_n,\{n\}{\texttt {]}}\). Recall from Definition 2 that this is the query that joins all atoms in \(T_n\) (w.r.t. all predicates in \(T_n\)), and subsequently projects on \(\textit{var}(n)\).

Lemma 1

\(\rho _n= {\mathcal {Q}}{\texttt {[}}T_n,n{\texttt {]}}(\textit{db})\), for every node \(n \in T\).

The proof by induction is detailed in “Appendix A”.

To show correctness of enumeration, we need the following additional lemma regarding the subroutine of Algorithm 1 (line 3). The proof is again by induction and detailed in “Appendix A”.

Lemma 2

For every node \(n\in N\) and every tuple \(\mathbf {t}\) in \(\rho _n\), \(\textsc {enum}_{T,N}(n, \mathbf {t},\rho )\) enumerates \({\mathcal {Q}}{\texttt {[}}T_n, N_n{\texttt {]}}(\textit{db}) \ltimes \mathbf {t}\).

Finally, we require the following insights, also proved in “Appendix A”.

Lemma 3

  1. 1.

    \(Q(\textit{db})\) is a positive GMR, for any \(GCQ \)Q and any database \(\textit{db}\).

  2. 2.

    If R is a positive GMR over \(\overline{x}\) and \(\overline{y} \subseteq \overline{x}\), then \(\mathbf {t}[\overline{y}] \in \pi _{\overline{y}} R\) for every tuple \(\mathbf {t} \in R\).

We note that item (2) is not true for when R has both positive and negative multiplicities, since multiplicities of opposite sign could cancel each other out when projecting, thereby removing \(\mathbf {t}[\overline{y}]\) from \(\pi _{y}(R)\).

Proposition 2

If \(\rho \) is a T-reduct of \(\textit{db}\), then \(\textsc {enum}_{T,N}(\rho )\) enumerates \({\mathcal {Q}}{\texttt {[}}T,N{\texttt {]}}(\textit{db})\).

Proof

Let r be the root of T. By Lemma 1, we have \(\rho _r= {\mathcal {Q}}{\texttt {[}}T_r,r{\texttt {]}}(\textit{db}) = {\mathcal {Q}}{\texttt {[}}T, r{\texttt {]}}(\textit{db}) = \pi _{\textit{var}(r)} {\mathcal {Q}}{\texttt {[}}T{\texttt {]}}(\textit{db})\). Furthermore, \(\pi _{\textit{var}(r)} {\mathcal {Q}}{\texttt {[}}T{\texttt {]}}(\textit{db}) = \pi _{\textit{var}(r)} \pi _{\textit{var}(N)} {\mathcal {Q}}{\texttt {[}}T{\texttt {]}}(\textit{db})\) since \(\textit{var}(r) \subseteq \textit{var}(N)\) as \(r \in N\). Therefore, \(\rho _r = \pi _{\textit{var}(r)} \pi _{\textit{var}(N)} {\mathcal {Q}}{\texttt {[}}T{\texttt {]}}(\textit{db}) = \pi _{\textit{var}(r)} {\mathcal {Q}}{\texttt {[}}T,N{\texttt {]}}(\textit{db})\). We conclude that \(\rho _r\) is a projection of \({\mathcal {Q}}{\texttt {[}}T,N{\texttt {]}}(\textit{db})\), and hence by Lemma 3 that every tuple in \({\mathcal {Q}}{\texttt {[}}T,N{\texttt {]}}(\textit{db})\) has a compatible tuple in \(\rho _r\). As such, \({\mathcal {Q}}{\texttt {[}}T,N{\texttt {]}}(\textit{db})\) equals the disjoint union \(\bigcup _{\mathbf {t}\in \rho _r} {\mathcal {Q}}{\texttt {[}}T,N{\texttt {]}}(\textit{db})\ltimes \mathbf {t}\). By Lemma 2, this is exactly what \(\textsc {enum}_{T,N}(\rho )\) enumerates. \(\square \)

We now analyze the complexity of \(\textsc {enum}_{T,N}\). First, observe that by definition of T-reducts, it is the case that for every parent node p with a child n and every \(\mathbf {t} \in \rho _p\) there exists a tuple in \(\rho _n \ltimes _{{\textit{pred}}(p\rightarrow n)} \mathbf {t}\). Hence, every tuple that we iterate over will eventually produce a new output tuple. This ensures that we do not risk wasting time in iterating over tuples that in the end yield no output. As such, the time needed for \(\textsc {enum}_{T,N}(\rho )\) to produce a single new output is dominated by the time taken to iterate over the tuples in \(\rho _n \ltimes _{{\textit{pred}}(p \rightarrow n)}\mathbf {t}\), where p is the parent of n. Since we can use the enumeration index \(P_n\) to do so efficiently, the efficiency of the entire enumeration will depend on the delay incurred by accessing the enumeration indexes. The following proposition formalizes this insight.

Proposition 3

Assume that every enumeration index has enumeration delay f, where f is a monotone function. Then, using these indexes, \(\textsc {enum}_{T,N}(\rho )\) enumerates \({\mathcal {Q}}{\texttt {[}}T,N{\texttt {]}}(\textit{db})\) with delay \({\mathcal {O}}(|N| f(M))\) where \(M = \max _{r(\overline{x}) \in \textit{at}(T)} |{\textit{db}_{r(\overline{x})}}|\). Thus, the total time required to execute \(\textsc {enum}_{T,N}(\rho )\) is \({\mathcal {O}}(|{{\mathcal {Q}}{\texttt {[}}T,N{\texttt {]}}(\textit{db})}| |{N}| f(M))\).

Proof

\(\textsc {enum}_{T,N}(\rho )\) correctly enumerates \({\mathcal {Q}}{\texttt {[}}T,N{\texttt {]}}(\textit{db})\) by Proposition 2. As such, it suffices to show that that the delay satisfies the given bounds. To that end, we show that for every node \(n\in N\) and \(\mathbf {t}\in \rho _n\), the call \(\textsc {enum}_{T,N}(n,\mathbf {t},\rho )\) will produce outputs with delay \({\mathcal {O}}(|N_n| f(M))\). We proceed by induction on \(|N_n|\). If \(|N_n|=1\), then n is in the frontier of N and the delay is clearly constant as the algorithm will only yield \((\mathbf {t},\rho _n(\mathbf {t}))\) (line 4). Now assume that \(|N_n|>1\). Then, n is not in the frontier of N. If n has a single child c, then line 6 is executed, and the enumeration index \(P_c\) allows us to iterate over \(\rho _c\ltimes _{{\textit{pred}}(n)} \mathbf {t}\) with delay \({\mathcal {O}}(f(|\rho _c|))\), which is \({\mathcal {O}}(f(M))\) by Proposition 1. For each element \(\mathbf {s}\) of this enumeration, the algorithm calls \(\textsc {enum}_{T, N}(c, \mathbf {s}, \rho )\), which by induction hypothesis produces output elements with delay \({\mathcal {O}}(|N_c| f(M))\). Hence, the maximum delay between two outputs is \({\mathcal {O}}\big (f(|M|) + |{N_c}| f(M)\big ) = {\mathcal {O}}\big ((|N_c| + 1) f(M)\big ) = {\mathcal {O}}\big (|N_n| f(M)\big )\). For the case in which n has two children \(c_1\) and \(c_2\), lines 7–12 are executed. By similar reasoning, it is easy to show that the maximum delay between outputs is

$$\begin{aligned}&{\mathcal {O}}(f(|M|))+{\mathcal {O}}(|N_{c_1}|f(M)) \\&\quad +{\mathcal {O}}(f(|M|))+{\mathcal {O}}(|N_{c_2}| f(M)) \end{aligned}$$

which is bounded by \({\mathcal {O}}((|N_{c_1}| + |N_{c_2}| + 2)f(M)) ={\mathcal {O}}(2 |N_n| f(M)) = {\mathcal {O}}(|N_n| f(M))\). \(\square \)

In particular, if all enumeration indexes are with constant delay (i.e., \(f(M) = {\mathcal {O}}(1))\), then GDyn enumerates \({\mathcal {Q}}{\texttt {[}}T,N{\texttt {]}}(\textit{db})\) with delay \({\mathcal {O}}(|{N}|)\), which is also constant in data complexity.

Update processing We next turn our attention to the update procedure \(\textsc {update}\) of Algorithm 1. Since this is straightforward to prove correct, we focus on its complexity. Since \(\textsc {update}\) uses the join indexes available in the (TN)-rep during its execution we will hence bound the running time of \(\textsc {update}\) in terms of the join index access and update times. We first require the following insight, which bounds the running time under the condition that the GMRs computed by \(\textsc {update}\) have a certain bounded size. The proof is in “Appendix A”.

Proposition 4

Assume that all join indexes in the (TN)-rep have access time g, and that all indexes (join and enumeration) have update time h, where g and h are monotone functions. Further assume that, during the entire execution of \(\textsc {update}\), K and U bound the size of \(\rho _n\) and \(\Delta _n\), respectively, for all n. Then, \(\textsc {update}_{T,N}(\rho ,{ u })\) runs in time \({\mathcal {O}}\left( |{T}| \cdot \left( U + h(K, U) + g(K,U) \right) \right) \).

We next bound the size of \(\rho _n\) and \(\Delta _n\) throughout the execution of \(\textsc {update}\).

Proposition 5

During the entire execution of \(\textsc {update}\), we have \(|{\rho _n}| \le M\) and \(|{\Delta _n}| \le 4M\) for every \(n\in T\), where \(M = \max _{r(\overline{x}) \in \textit{at}(T)} |{\textit{db}_{r(\overline{x})}}| + |{{ u }_{r(\overline{x})}}|\).

Proof

We first establish the bound on \(|{\rho _n}|\) during execution. Before execution, \(\rho \) is a T-reduct of \(\textit{db}\). Hence, by Proposition 1, \(|{\rho _n}| \le \max _{r(\overline{x})} |{\textit{db}_{r(\overline{x})}}| \le M\) before execution starts. Now note that the only line that updates \(\rho _n\) is line 19, executed while visiting node n in the bottom-up traversal of T. This line is only applied once for every node n. Hence, since at the end of execution the collection of modified GMRs \(\rho _m\) for \(m \in N\) form a T-reduct of \(\textit{db}+ { u }\), we know that after executing line 19, \(\rho _n\) contains exactly the content for the T-reduct of \(\textit{db}+ { u }\). Hence, by Proposition 1, \(|{\rho _n}| \le \max _{r(\overline{x})} |{(\textit{db}+ { u })_{r(\overline{x})}}| \le \max _{r(\overline{x})} |{\textit{db}_{r(\overline{x})}}| + |{{ u }_{r(\overline{x})}}| = M\).

We are now ready to establish the bounds on \(|{\Delta _n}|\). Clearly, \(|{\Delta _n}| \le M\) during the initialization of \(\Delta _n\) done in lines 14–17. Now consider that we are executing the bottom-up traversal of T in lines 18–23 and that n is the currently visited node. We have already established that both before and after applying the update \(\Delta _n\) to \(\rho _n\), we have \(|{\rho _n}| \le M\) (line 19). This implies that \(|{\Delta _n}| \le 2M\): in the worst case, \(\Delta _n\) deletes all existing tuples in \(\rho _n\) and adds M new ones. To see that \(|{\Delta _p}| \le 4M\) after executing line 21, we consider two cases. If n is visited before m in the bottom-up traversal of T, then \(\Delta _p\) is necessarily empty before executing line 21 and hence \(|{\Delta _p}| = |{\pi _{\textit{var}(p)}(\rho _m \bowtie _{{\textit{pred}}(p)} \Delta _n)}|\). Because, by definition of GJTs, p has either m or n as a guard, it follows that every tuple in \(\pi _{\textit{var}(p)}(\rho _m \bowtie _{{\textit{pred}}(p)} \Delta _n)\) is either a projected version of some tuple in \(\rho _m\), or a projected version of some tuple in \(\Delta _n\). As such, \(|{\Delta _p}| \le \max \left( |{\rho _m}|, |{\Delta _n}|\right) = 2M\). If, on the other hand, m is visited before n in the bottom-up traversal of T, then \(\Delta _p\) necessarily contains the result computed during executing line 21 while visiting m. By the reasoning of the previous case, \(|{\Delta _p}| \le 2M\) before executing line 21 while visiting n. Furthermore, by the same reasoning, we know that \(|{\pi _{\textit{var}(p)}(\rho _m \bowtie _{{\textit{pred}}(p)} \Delta _n)}| \le 2M\). Hence, at most 2M new tuples can be added to \(\Delta _p\). As such, \(|{\Delta _p}| \le 4M\) after executing line 21 when visiting n. To see that also \(|{\Delta _p}| \le 4M\) after executing line 23 when visiting n, it suffices to observe that \(\Delta _p\) starts out empty, and \(|{\Delta _n}| \le 2M\) (as already established). \(\square \)

Combining Propositions 4 and 5, we obtain

Theorem 1

Assume that all join indexes in the (TN)-rep have access time g and that all indexes (join and enumeration) have update time h, where g and h are monotone functions. \(\textsc {update}_{T,N}(\rho ,{ u })\) runs in time

$$\begin{aligned} {\mathcal {O}}\left( |{T}| \cdot \left( 4M + h(M, 4M) + g(M,4M)\right) \right) . \end{aligned}$$

where \(M = \max _{r(\overline{x}) \in \textit{at}(T)} |{\textit{db}_{r(\overline{x})}}| + |{{ u }_{r(\overline{x})}}|\).

4.4 IEDyn

GDyn provides a general framework for dynamic query processing in the presence of arbitrary \(\theta \)-joins. In this section, we instantiate GDyn to the specific setting where queries mention only inequality predicates (\(<, \le , >, \ge \)).Footnote 2 We refer to this instantiation as IEDyn. Concretely, IEDyn uses the following data structures for its enumeration and join indexes. Let R be a GMR over \(\overline{x}\), \(\theta \) be a conjunction of inequalities, and \(\overline{y}, \overline{z}\) be hyperedges such that \(\overline{z} \subseteq \overline{x}\) or \(\overline{z} \subseteq \overline{y}\). The data structure underlying the enumeration and join index of R by \((\theta , \overline{y})\) resp. \((\theta , \overline{y}, \overline{z})\) depends on the number of inequalities in \(\theta \).

(1) No inequality In this case, \(\theta \) is hence equivalent to true, and the enumeration and join index only have to deal with equijoins. Concretely, the enumeration index of R on \((\theta , \overline{y})\) is obtained by creating a traditional (hash-based) index of R on the variables that \(\overline{x}\) and \(\overline{y}\) have in common. Then, for any \(\overline{y}\)-tuple \(\mathbf {t}\), \(R \ltimes \mathbf {t}\) can be enumerated with constant delay by first using the hash index to find the corresponding (\(\overline{x} \cap \overline{y}\))-group of R and enumerating the elements of that group.

To obtain the join index by \((\theta , \overline{y}, \overline{z})\), we discern two cases. If \(\overline{z} \subseteq \overline{y}\), then the same index as for enumeration is reused except that in addition, for each \((\overline{x} \cap \overline{y})\)-group of R, we cache the sum of all multiplicities in that group. This allows to evaluate \(\pi _{\overline{z}}(R \bowtie S)\) in time \({\mathcal {O}}(|{S}|)\) independently of \(|{R}|)\), as follows. Initialize an empty GMR to hold the join result. Group S on the variables in \(\overline{x} \cap \overline{y}\) in \(O(|{S}|)\) time (using hashing). For each group in S, use the index on R to locate the corresponding group of R in \({\mathcal {O}}(1)\) time and retrieve the cached sum-of-multiplicities \(\mu \) of that group. Then, iterate over the tuples \(\mathbf {s}\) of the S-group one by one and add \(S(\mathbf {s}) \times \mu \) to the multiplicity of \(\mathbf {s}[\overline{z}]\) in the result GMR. (If \(\mathbf {s}[\overline{z}]\) has not been added to the result before, this multiplicity is zero.) Repeat this process for all tuples in the S-group, and for every group in S. Since we only scan \(|{S}|\) once, the total time is \({\mathcal {O}}(|{S}|)\).

If \(\overline{z} \subseteq \overline{x}\), no special data structure is required: we can compute \(\pi _{\overline{z}}(R \bowtie S)\) in \({\mathcal {O}}(|{R}| + |{S}|)\) time by first computing \(\pi _{\overline{x} \cap \overline{y}} S\) in \({\mathcal {O}}(|{S}|)\) time and then repeating the above process with the roles of R and S reversed.

We conclude that in this case the enumeration delay is \({\mathcal {O}}(1)\), the access time is \({\mathcal {O}}(|{S}|)\) if \(\overline{z} \subseteq \overline{x}\) and \({\mathcal {O}}(|{R}| + |{S}|)\) otherwise, and the update time is \({\mathcal {O}}(|{\Delta R}|)\).

(2) Single inequality Assume that \(\theta = x > y\) with \(x \in \overline{x}\) and \(y \in \overline{y}\) (the other cases \(x < y, x \le y\), \(x\ge y\) are similar). Then, we build a hash-based index I of R on \(\overline{x} \cap \overline{y}\), sorting each group in descending order on x. In Sect. 4.1, we have illustrated, by means of example, that this realizes an enumeration index by \((\theta , \overline{y})\) with constant delay. At the end of the same section, we have also illustrated that, when \(\overline{z} \subseteq \overline{x}\), this also realizes a join index of R by \((\theta , \overline{y}, \overline{z})\) with access time \({\mathcal {O}}(|{R}| + |{S}| + |{S}| \log |{S}|) = {\mathcal {O}}(|{R}| + |{S}| \log |{S}|)\).Footnote 3 When \(\overline{z} \subseteq \overline{x}\), the same procedure, but using S for the outer loop, and R for the inner loop, can be used to realize a join index of R by \((\theta , \overline{y}, \overline{z})\) with access time \({\mathcal {O}}(|{R}| + |{S}| \log |{S}|)\).

Note that, because we need to keep data sorted the update time of these indexes is \({\mathcal {O}}\big (|{\Delta R}| \log (|{R}| + |{\Delta R}|)\big )\).

We conclude that in this case the enumeration delay is \({\mathcal {O}}(1)\), the access time is \({\mathcal {O}}(|{R}| + |{S}| \log |{S}|)\), and the update time \({\mathcal {O}}\big (|{\Delta R}| \log (|{R}| + |{\Delta R}|)\big )\).

(3) Multiple inequalities Assume that \(\theta = x_1> y_1 \wedge x_2> y_2 \wedge \cdots \wedge x_k > y_k\) with \(x_1,\ldots ,x_k \in \overline{x}\) and \(y_1,\ldots , y_k \in \overline{y}\). (The reasoning where some of the > are replaced by < is completely analogous.) Then, as with the case with single inequalities, we build a hash-based index I of R on \(\overline{x} \cap \overline{y}\) but now sort each group lexicographically on \((x_1,\ldots , x_k)\) (each \(x_i\) in descending order). In addition, for each group and each i (\(1 \le i \le k\)), we record the smallest \(x_i\)-value present in the group.

The fact that we have multiple inequalities complicates matters, in the sense that enumeration delay becomes logarithmic instead of constant. We can see this as follows. To enumerate \(R \ltimes _{\theta } \mathbf {t}\) given \(\overline{y}\)-tuple \(\mathbf {t}\), we first use I to obtain a pointer to \(R \ltimes \mathbf {t}[\overline{x} \cap \overline{y}]\) in \({\mathcal {O}}(1)\) time. Initialize \((m_1, \ldots , m_k)\) such that \(m_i\) is the smallest \(x_i\)-value in the group. Then, start enumerating \(R \ltimes \mathbf {t}[\overline{x} \cap \overline{y}]\) with constant delay and in decreasing lexicographic order. Yield the current pair \((\mathbf {s}, \mu )\) that is being enumerated, provided that \(\mathbf {s}[x_i] > \mathbf {t}[y_i]\) for all i. In contrast to the case where there is a single inequality, however, we cannot deduce that all subsequent \(\mathbf {s}\) will fail \(\theta \). The problem is then to find the next tuple in the group that occurs after \(\mathbf {s}\) in sorted order, but is larger than \(\mathbf {t}\). Concretely, let i be the smallest index such that \(\mathbf {s}(x_i) \le \mathbf {t}(y_i)\) yet \(\mathbf {s}(x_j) > \mathbf {t}(x_j)\) for all \(j < i\). Let \(\mathbf {s'}\) be the tuple obtained from \(\mathbf {s}\) by setting \(\mathbf {s}(x_j) := m_j\) for all \(j \ge i\). Then, using binary search, find the next tuple that lexicographically larger or equal than \(\mathbf {s'}\) and recontinue enumeration from there. This binary search takes \({\mathcal {O}}(\log |{R}|)\) time, which causes the logarithmic delay.

Having multiple inequalities also complicates the update processing, since the sorted order no longer can be exploited to speed up join computation. In this case, therefore, we simply do a nested loop join per group, which yields a total access time of \({\mathcal {O}}(|{R}| \times |{S}|)\). Designing an join index with better access time is a interesting avenue for future work.

We conclude that in this case the enumeration delay is \({\mathcal {O}}(\log |{R}|)\), the access time is \({\mathcal {O}}(|{R}| \times |{S}|)\), and the update time is \({\mathcal {O}}\big (|{\Delta R}| \log (|{R}| + |{\Delta R}|) \big )\).

Complexity of IEDyn By plugging in the above-mentioned delay into Proposition 3 and the above-mentioned access time and update time into Corollary 1, we obtain the following complexity of IEDyn.

Theorem 2

Assume that (TN) is a plan in which all predicates are inequalities and let all enumeration and join indexes be as described above. Then, \(\textsc {enum}\) enumerates with delay \({\mathcal {O}}(|N| \log (\max _{r(\overline{x})}|{\textit{db}_{r(\overline{x})}}|))\) and \(\textsc {update}\) processes updates in time \({\mathcal {O}}(|{T}| M^2)\) where \(M = \max _{r(\overline{x}) \in \textit{at}(T)} |{\textit{db}_{r(\overline{x})}}| + |{{ u }_{r(\overline{x})}}|\).Footnote 4 If T is such that each edge is labeled by at most one predicate, then the enumeration delay is \({\mathcal {O}}(|N|)\) and update time is \({\mathcal {O}}(|T| M \log M )\). If T has no inequalities, the update time is \({\mathcal {O}}(|{T}| M)\).

A simple GJT is a GJT without predicates where \(\textit{var}(p) \subseteq \textit{var}(n)\) for every node n with parent p. In a simple GJT, every child is hence a guard of its parent. For simple GJTs, the update processing time is optimal, in the following sense.

Theorem 3

If T is simple, then \(\textsc {update}\) processes updates in time \({\mathcal {O}}(|{T}| \max _{r(\overline{x})} |{{ u }_{r(\overline{x})}}|)\), which is independent of \(|{\textit{db}}|\).

Proof

Using the fact that every node is a guard of its parent, it is straightforward to prove by induction on the height of a node n in T (defined as the length of the shortest path from n to a leaf in T) that \(|{\Delta \rho _n}| \le \max _{r(\overline{x})} |{{ u }_{r(\overline{x})}}|\), for each n. Since T does not contain any predicates, all join indexes that are created are as described in the paragraph “(1) No inequality” above. In particular, since every node is guard of its parent, we have that for every join index \(S_m\) by \(({\textit{pred}}(p), {\textit{pred}}(n), \textit{var}(p))\) that is created we have \(\textit{var}(p) \subseteq {\textit{pred}}(m)\). For this particular case, the access time to execute the semijoin \(\pi _{\textit{var}(p)}(\rho _m \bowtie \Delta _n)\) is \({\mathcal {O}}(|{\Delta }|_n) = {\mathcal {O}}(\max _{r(\overline{x})} |{{ u }_{r(\overline{x})}}|)\). By now plugging in this access time and the linear index update time into Proposition 4, the result follows. \(\square \)

Berkholz et al. [8] show that, unless the Online Matrix-Vector Multiplication conjecture [24] is false, the class of conjunctive queries that allow both (1) constant delay enumeration of query results (in data complexity) and (2) update processing time that is linear in \(|{u}|\) (again in data complexity) for every update \({ u }\), is exactly the class of so-called q-hierarchical queries. While we forego a formal definition of this class, we show in [25] that a CQ Q is q-hierarchical if, and only if, there exists a plan (TN) for Q such that T is simple. Since, by the results above, \({\textsc {GDyn}} \) has both constant delay enumeration and update time \({\mathcal {O}}(|{{ u }}|)\) (in data complexity) for exactly these queries, \({\textsc {GDyn}} \) hence meets the theoretical lower bound.

5 Computing query plans

We say that (TN) is a plan for GCQ Q, or that Q has plan (TN), if Q and \({\mathcal {Q}}{\texttt {[}}T,N{\texttt {]}}\) are the same query, up to reordering of atoms and predicates, i.e., if \(\#\textit{at}(Q) = \#\textit{at}(T)\), \(\textit{pred}(Q) = \textit{pred}(T)\), and \({\textit{out}}(Q) = \textit{var}(N)\). Here, \(\#\textit{at}(X)\) denotes the multiset of atoms occurring in object X.

By the results of Sect. 4, it follows that we can dynamically process a given GCQ Q by first computing a plan for Q and subsequently applying GDyn on that plan. In this section we show how to compute a plan for Q by describing two algorithms.

  1. 1.

    The first algorithm computes a GJT pair for Q. Here, a GJT pair is a pair \((T', N')\) defined exactly like a query plan, except that \(T'\) need not be binary and \(N'\) need not be sibling closed. A query plan is hence a particular kind of GJT pair. We call \((T',N')\) a GJT pair for Q if \(\#\textit{at}(Q) = \#\textit{at}(T')\), \(\textit{pred}(Q) = \textit{pred}(T')\), and \({\textit{out}}(Q) = \textit{var}(N')\).

  2. 2.

    The second algorithm transforms this GJT pair into an equivalent query plan. Here, two GJT pairs (TN) and \((T',N')\) are equivalent if \(\#\textit{at}(T) = \#\textit{at}(T')\), \(\textit{pred}(T) = \textit{pred}(T')\), and \(\textit{var}(N) = \textit{var}(N')\).

Clearly, the plan resulting from the composition of the two algorithms must be a plan for Q.

Before describing these two algorithms, it is important to emphasize that there are GCQs for which no GJT pair exists (and, consequently, for which no query plans exists). This is illustrated in Example 2 below. In particular, for full conjunctive queries (i.e., GCQs without \(\theta \)-joins and projections), the results of [25] imply that a GJT pair exists for a full CQ Q if, and only if, Q is acyclic, a well-studied class of queries [1, 47]. Similarly, the results imply that for conjunctive queries (with projections, but still without \(\theta \)-joins) a GJT pair exists if an only if the query is free-connex acyclic, another well-studied class [6]. The existing definitions of acyclicity and free-connex acyclicity are given for CQs only. Given the previous discussion, we extend these notions to GCQs as follows.

Definition 6

A GCQ Q is free-connex acyclic if it has a GJT pair. It is acyclic if \(\textit{full}(Q)\) has a GJT pair. A GCQ that is not acyclic is cyclic.

In particular, every full GCQ that is acylic is also free-connex acyclic. Also note that, since \({\textit{out}}(\textit{full}(Q)) = \textit{var}(Q)\), a GJT pair will exist for \(\textit{full}(Q)\) if an only if there exists a GJT T for Q, i.e., a GJT with \(\#\textit{at}(Q) = \#\textit{at}(T)\) and \(\textit{pred}(Q) = \textit{pred}(T)\). Indeed, if T is a GCQ for Q, then (TN) with N the set of all nodes in T, is a GJT pair for Q: clearly N is connex and \(\textit{var}(N) = \textit{var}(T) = {\textit{out}}(Q)\). For this reason, free-connex acyclicity is a stronger requirement than acyclicity: acyclicity only requires that a GJT for Q exists while free-connex acyclicity requires, in addition, that there exists a connex subset with \({\textit{out}}(Q) = \textit{var}(N)\).

Example 2

The trees \(T_1\) and \(T_2\) depicted in Fig. 4 are GJTs for the full GCQs

$$\begin{aligned} Q_1&= \big (r(x,y) \bowtie s(y,z) \bowtie t(z,v) \mid y< v \big ), \quad \text { and } \\ Q_2&= \big (r(x,y)\bowtie s(y,z,w)\bowtie t(u,v) \mid x<z \wedge w<u\big ), \end{aligned}$$

respectively. These queries are hence acyclic. In contrast, \(r(x,y) \bowtie s(y,z) \bowtie t(x,z)\) (also known as the triangle query) is the prototypical cyclic join query.

Let \(Q'_2 = \pi _{y,z,w,u}(Q_2)\). \(Q'_2\) is free-connex acyclic since the pair \((T_2, N_2)\) of Fig. 4 is a GJT pair for \(Q_2\). In contrast, there is no GJT pair for \(Q'_1 = \pi _{y,z}(Q_1)\) that contains tree \(T_1\). Indeed, observe that any connex set of \(T_1\) must include the root, which includes \(x \not \in {\textit{out}}(Q'_1)\). Finally, it can be verified that there is no GJT pair for \(\pi _{x,v}(Q_1)\); this query is hence not free-connex acyclic.

Given that plans only exist for free-connex acyclic queries, it is desirable to be able to check free-connex acyclicity. In this respect, we develop an algorithm in Sect. 5.1 that checks whether Q is free-connex acyclic and if so computes a GJT pair for Q. This algorithm hence realizes step (1) above. Subsequently, step (2) above is realized in Sect. 5.2 where we discuss how to transform GJT pairs into equivalent query plans.

5.1 Computing GJT pairs

The canonical algorithm for checking acyclicity of normal conjunctive queries is the GYO algorithm [1]. The algorithm described in this section is a generalization of the GYO algorithm that checks free-connex acyclicity in addition to normal acyclicity and deals with GCQs featuring \(\theta \)-join predicates instead of CQs that have equality joins only. We first recall the classical GYO algorithm and then formulate its extension.

5.1.1 Classical GYO

The GYO algorithm operates on hypergraphs. A hypergraphH is a set of nonempty hyperedges. Recall from Sect. 3 that a hyperedge is just a finite set of variables. Every GCQ is associated with a hypergraph as follows.

Definition 7

Let Q be a GCQ. The hypergraph of Q, denoted \(\textit{hyp}(Q)\), is the hypergraph

$$\begin{aligned} \textit{hyp}(Q) = \{ \overline{x} \mid r(\overline{x}) \text { is an atom of } Q \text { with } \overline{x} \not = \emptyset \}. \end{aligned}$$

The GYO algorithm checks acyclicity of a normal conjunctive query Q by constructing \(\textit{hyp}(Q)\) and repeatedly removing ears from this hypergraph. If ears can be removed until only the empty hypergraph remains, then the query is acyclic; otherwise, it is cyclic.

An ear in a hypergraph H is a hyperedge e for which we can divide its variables into two groups: (1) those that appear exclusively in e and (2) those that are contained in another hyperedge \(\ell \) of H. A variable that appears exclusively in a single hyperedge is also called an isolated variable. Thus, ear removal corresponds to executing the following two reduction operations.

  • Remove isolated variables: select a hyperedge e in H and remove isolated variables from it; if e becomes empty, remove e it altogether from H.

  • Subset elimination: remove hyperedge e from H if there exists another hyperedge \(\ell \) for which \(e \subseteq \ell \).

The GYO reduction of a hypergraph is the hypergraph that is obtained by executing these operations until no further operation is applicable. The following result is standard; see, e.g., [1] for a proof.

Proposition 6

A CQ Q is acyclic if and only if the GYO reduction of \(\textit{hyp}(Q)\) is the empty hypergraph.

5.1.2 GYO reduction for GCQs

In order to extend the GYO reduction to check free-connex acyclicity (not simply acyclicity) of GCQs (not simply standard CQs), we will: (1) redefine the notion of being an ear to take into account the predicates and (2) transform the GYO reduction into a two-stage procedure. The first stage allows to check that a connex set with exactly \({\textit{out}}(Q)\) can exist while the first stage and second stage combined check that the query is acyclic.

Our algorithm operates on hypergraph triplets instead of hypergraphs, which are defined as follows.

Definition 8

A hypergraph triplet is a triple

$$\begin{aligned} \mathcal {H} = (\textit{hyp}(\mathcal {H}), {\textit{out}}(\mathcal {H}), \textit{pred}(\mathcal {H})) \end{aligned}$$

with \(\textit{hyp}(\mathcal {H})\) a hypergraph, \({\textit{out}}(\mathcal {H})\) a hyperedge, and \(\textit{pred}(\mathcal {H})\) a set of predicates.

Intuitively, the variables in \({\textit{out}}(\mathcal {H})\) will correspond to the output variables of a query and the set \(\textit{pred}(\mathcal {H})\) will contain predicates that need to be taken into account when removing ears. Every GCQ is, therefore, naturally associated with a hypergraph triplet as follows.

Definition 9

The hypergraph triplet of a GCQ Q, denoted \({\mathcal {H}}(Q)\), is the triplet \((\textit{hyp}(Q), {\textit{out}}(Q), \textit{pred}(Q))\).

In order to extend the notion of an ear, we require the following definitions. Let \(\mathcal {H}\) be a hypergraph triplet. Variables that occur in \({\textit{out}}(\mathcal {H})\) or in at least two hyperedges in \(\textit{hyp}(\mathcal {H})\) are called equijoin variables of \(\mathcal {H}\). We denote the set of all equijoin variables of \(\mathcal {H}\) by \({\textit{jv}}(\mathcal {H})\) and abbreviate \({\textit{jv}}_{\mathcal {H}}(e) = e \cap {\textit{jv}}(\mathcal {H})\). A variable x is isolated in \(\mathcal {H}\) if it is not an equijoin variable and is not mentioned in any predicate, i.e., if \(x \not \in {\textit{jv}}(\mathcal {H})\) and \(x \not \in \textit{var}(\textit{pred}(\mathcal {H}))\). We denote the set of isolated variables of \(\mathcal {H}\) by \({\textit{isol}}(\mathcal {H})\) and abbreviate \({\textit{isol}}_{\mathcal {H}}(e) = e \cap {\textit{isol}}(\mathcal {H})\). The extended variables of hyperedge e in \(\mathcal {H}\), denoted \(\textit{ext}_{\mathcal {H}}(e)\) is the set of all variables of predicates that mention some variable in e, except the variables in e themselves:

$$\begin{aligned} \textit{ext}_{\mathcal {H}}(e) = \bigcup \{ \textit{var}(\theta ) \mid \theta \in \textit{pred}(\mathcal {H}), \textit{var}(\theta ) \cap e \not = \emptyset \} {\setminus } e. \end{aligned}$$

Finally, a hyperedge e is a conditional subset of hyperedge \(\ell \) w.r.t. \(\mathcal {H}\), denoted \(e \sqsubseteq _{\mathcal {H}} \ell \), if \({\textit{jv}}_{\mathcal {H}}(e) \subseteq \ell \) and \(\textit{ext}_{\mathcal {H}}(e {\setminus } \ell ) \subseteq \ell \). We omit subscripts from our notation if the triplet is clear from the context.

Fig. 5
figure 5

Illustration of GYO reduction for GCQs. Colored regions depict hyperedges. Variables in \({\textit{out}}\) are underlined. Variables occurring in the same predicate are connected by dashed lines (color figure online)

Example 3

In Fig. 5, we depict several hypergraph triplets. There, hyperedges in \(\mathcal {H}\) are depicted by colored regions and variables in \({\textit{out}}(\mathcal {H})\) are underlined. We use dashed lines to connect variables that appear together in a predicate. So, in \(\mathcal {H}_1\), we have predicates \(\theta _1, \theta _2\) with \(\textit{var}(\theta _1) = \{t,v\}\) and \(\textit{var}(\theta _2) = \{x,y\}\). Now consider triplet \(\mathcal {H}_1\) in particular. It is the hypergraph triplet \({\mathcal {H}}(Q)\) for the following GCQ Q:

$$\begin{aligned} Q&= \pi _{t,u,z,w}(r_1(s, t, u) \bowtie r_2(t, u) \bowtie r_3(u, w, x)\\&\quad \bowtie r_4(s, v) \bowtie r_5(w, z, y) \mid t< v \wedge x < y). \end{aligned}$$

Moreover, \({\textit{jv}}(\mathcal {H}_1) =\{s,t,u,w,z\} \) and \({\textit{isol}}(\mathcal {H}_1) = \emptyset \). Furthermore, \(\textit{ext}_{\mathcal {H}_1}(\{v\}) = \{t\}\) since \(\theta _1 = t < v\) shares variables with \(\{v\}\). Finally \({\textit{jv}}_{\mathcal {H}_1}(\{s,v\}) = \{s\} \subseteq \{s,t,u\}\) and \(\textit{ext}_{\mathcal {H}_1}(\{s,v\} {\setminus } \{s,t,u\}) = \textit{ext}_{\mathcal {H}_1}(\{v\}) = \{t\} \subseteq \{s,t,u\}\). Therefore, \(\{s, v\} \sqsubseteq _{\mathcal {H}_1} \{s,t,u\}\). Similarly, \(\{t,u\} \sqsubseteq _{\mathcal {H}_1} \{s,t,u\}\).

We define ears in our context as follows.

Definition 10

A hyperedge e is an ear in a hypergraph triplet \(\mathcal {H}\) if \(e \in \textit{hyp}(\mathcal {H})\) and either

  1. 1.

    We can divide its variables into two: (a) those that are isolated and (b) those that form a conditional subset of another hyperedge \(\ell \in \textit{hyp}(\mathcal {H}) {\setminus } \{e\}\); or

  2. 2.

    e consists only of nonjoin variables, i.e., \({\textit{jv}}(e) = \emptyset \) and \(\textit{ext}(e) = \emptyset \).

Note that case (2) allows for \(\theta \in \textit{pred}(\mathcal {H})\) with \(\textit{var}(\theta ) \subseteq e\). We call predicates that are covered by a hyperedge in this sense filters because they correspond to filtering a single GMR instead of \(\theta \)-joining two GMRs. If, in case (2), there is no filter \(\theta \) with \(\textit{var}(\theta ) \subseteq e\), then \(e = {\textit{isol}}_{\mathcal {H}}(e)\). Similar to the classical GYO reduction, we can view ear removal as a rewriting process on triplets, where we consider the following reduction operations.

  • (ISO) Remove isolated variables: select a hyperedge \(e \in \textit{hyp}(\mathcal {H})\) and remove a nonempty set \(X \subseteq {\textit{isol}}_{\mathcal {H}}(e)\) from it. If e becomes empty, remove it from \(\textit{hyp}(\mathcal {H})\).

  • (CSE) Conditional subset elimination: remove hyperedge e from \(\textit{hyp}(\mathcal {H})\) if it is a conditional subset of another hyperedge f in \(\textit{hyp}(\mathcal {H})\). Also update \(\textit{pred}(\mathcal {H})\) by removing all predicates \(\theta \) with \(\textit{var}(\theta ) \cap (e {\setminus } f) \not = \emptyset \).

  • (FLT) Filter elimination: select \(e \in \textit{hyp}(\mathcal {H})\) and a nonempty subset of predicates \(\Theta \subseteq \textit{pred}(\mathcal {H})\) with \(\textit{var}(\Theta ) \subseteq e\). Remove all predicates in \(\Theta \) from \(\textit{pred}(\mathcal {H})\).

We write \(\mathcal {H} \rightsquigarrow \mathcal {I}\) to denote that triplet \(\mathcal {I}\) is obtained from triplet \(\mathcal {H}\) by applying a single such operation, and \(\mathcal {H} \rightsquigarrow ^* \mathcal {I}\) to denote that \(\mathcal {I}\) is obtained by a sequence of zero or more of such operations.

Example 4

For the hypergraph triplets illustrated in Fig. 5, we have \(\mathcal {H}_1 \rightsquigarrow \mathcal {H}_2 \rightsquigarrow \mathcal {H}_3 \rightsquigarrow \mathcal {H}_4\) and \(\mathcal {H}_5 \rightsquigarrow \mathcal {H}_6 \rightsquigarrow \mathcal {H}_7 \rightsquigarrow \mathcal {H}_8 \rightsquigarrow \mathcal {H}_9 \rightsquigarrow \mathcal {H}_{10} \rightsquigarrow \mathcal {H}_{11}\). For each reduction, it is illustrated in the figure which set of isolated variables is removed, or which conditional subset is removed.

We write \({\mathcal {H}}\!\!\downarrow \) to denote \(\mathcal {H}\) is in normal form, i.e., that no operation is applicable on triplet \(\mathcal {H}\). Note that, because each operation removes at least one variable, hyperedge, or predicate, we will always reach a normal form after a finite number of operations. Furthermore, while multiple different reduction steps may be applicable on a given triplet \(\mathcal {H}\), the order in which we apply them does not matter:

Proposition 7

(Confluence) Whenever \(\mathcal {H} \rightsquigarrow ^* \mathcal {I}_1\) and \(\mathcal {H} \rightsquigarrow ^* \mathcal {I}_2\), there exists \(\mathcal {J}\) such that \(\mathcal {I}_1 \rightsquigarrow ^* \mathcal {J}\) and \(\mathcal {I}_2 \rightsquigarrow ^* \mathcal {J}\).

Because the proof is technical but not overly enlightening, we defer it to “Appendix B.1”. A direct consequence is that normal forms are unique: if \(\mathcal {H} \rightsquigarrow ^* {\mathcal {I}_1}\!\!\downarrow \) and \(\mathcal {H} \rightsquigarrow ^* {\mathcal {I}_2}\!\!\downarrow \), then \(\mathcal {I}_1 = \mathcal {I}_2\).

Let \(\mathcal {H}\) be a triplet. The residual of \(\mathcal {H}\), denoted \(\tilde{\mathcal {H}}\), is the triplet \((\textit{hyp}(\mathcal {H}), \emptyset , \textit{pred}(\mathcal {H}))\), i.e., the triplet where \({\textit{out}}(\mathcal {H})\) is set to \(\emptyset \). A triplet is empty if it equals \((\emptyset , \emptyset , \emptyset )\).

Our main result in this section states that to check whether a GCQ Q is free-connex acyclic, it suffices to start from \({\mathcal {H}}(Q)\) and do a two-stage reduction: the first from \({\mathcal {H}}(Q)\) until a normal form \({\mathcal {I}}\!\!\downarrow \) is reached, and the second from the residual of \({\mathcal {I}}\!\!\downarrow \), until another normal form \(\mathcal {J}\) is reached.Footnote 5

Theorem 4

Let Q be a GCQ. Assume \({\mathcal {H}}(Q) \rightsquigarrow ^* {\mathcal {I}}\!\!\downarrow \) and \(\tilde{\mathcal {I}} \rightsquigarrow ^* {\mathcal {J}}\!\!\downarrow \). Then, the following hold.

  1. 1.

    Q is acyclic if, and only if, \(\mathcal {J}\) is the empty triplet.

  2. 2.

    Q is free-connex acyclic if, and only if, \(\mathcal {J}\) is the empty triplet and \(\textit{var}(\textit{hyp}(\mathcal {I})) = {\textit{out}}(Q)\).

  3. 3.

    For every GJT T of Q and every connex subset N of T, it holds that \(\textit{var}(\textit{hyp}(\mathcal {I})) \subseteq \textit{var}(N)\).

We devote Sect. 5.1.3 to the proof of this theorem.

Example 5

Figure 5 illustrates the two-stage sequence of reductions starting from \({\mathcal {H}}(Q)\) with Q the GCQ of Example 3. Note that \({\mathcal {H}}(Q) = \mathcal {H}_1\) and \(\mathcal {H}_5\) is the residual of \(\mathcal {H}_4\). Because we end with the empty triplet Q is acyclic, but it is not free-connex since \({\textit{out}}(Q) \subsetneq \textit{var}(\mathcal {H}_4)\).

Theorem 4 gives us a decision procedure for checking free-connex acyclicity of GCQ Q. From its proof in Sect. 5.1.3, we can actually derive an algorithm for constructing a GJT pair for Q. At its essence, this algorithm starts with the set of atoms appearing in Q and subsequently uses the sequence of reduction steps from Theorem 4 to construct a GJT from it, at the same time checking free-connex acyclicity. Every reduction step causes new nodes to be added to the partial GJT constructed so far. We will refer to such partial GJTs as generalized join forests (GJF).

Definition 11

A generalized join forest is a set F of pairwise disjoint GJTs s.t. for distinct trees \(T_1, T_2 \in F\), we have \(\textit{var}(T_1) \cap \textit{var}(T_2) = \textit{var}(n_1) \cap \textit{var}(n_2)\) where \(n_1\) and \(n_2\) are the roots of \(T_1\) and \(T_2\).

Every GJF encodes a hypergraph as follows.

Definition 12

The hypergraph \(\textit{hyp}(F)\) associated with GJF F is the hypergraph that has one hyperedge for every nonempty root node in F,

$$\begin{aligned} \textit{hyp}(F) = \{ \textit{var}(n) \mid n \text { root node in } F, \textit{var}(n) \not = \emptyset \}. \end{aligned}$$

The GJT construction algorithm does not manipulate hypergraph triplets directly. Instead, it manipulates GJF triplets. A GJF triplet is defined like a hypergraph triplet, except that it has a GJF instead of a hypergraph.

Definition 13

A GJF triplet is a triple \(\mathbb {F} = ({\textit{forest}}(\mathbb {F}), {\textit{out}}(\mathbb {F}), \Theta _{\mathbb {F}})\) with \({\textit{forest}}(\mathbb {F})\) a GJF, \({\textit{out}}(\mathbb {F})\) a hyperedge, and \(\Theta _{\mathbb {F}}\) a set of predicates. Every GJF triplet \(\mathbb {F}\) induces a hypergraph triplet \({\mathcal {H}}(\mathbb {F}) = (\textit{hyp}({\textit{forest}}(\mathbb {F})), {\textit{out}}(\mathbb {F}), \Theta _{\mathbb {F}})\).

The algorithm for constructing a GJT pair for a given GCQ Q is now shown in Algorithm 2. It starts in line 2 by initializing the GJF triplet \(\mathbb {F}\) to \(\mathbb {F} = ({\textit{forest}}(Q), {\textit{out}}(Q), {\textit{pred}}(Q))\). Here, \({\textit{forest}}(Q)\) is the GJF obtained by creating, for every atom \(r({\overline{x}})\) that occurs \(k > 0\) times in Q, k corresponding leaf nodes labeled by \(r({\overline{x}})\). In lines 3–4, Algorithm 2 then performs the first phase of reduction steps of Theorem 4. To this end, it checks whether a reduction operation is applicable to \({\mathcal {H}}(\mathbb {F})\) and, if so, enacts this operation by modifying \(\mathbb {F}\) as follows.

  • (ISO) If the reduction operation on the hypergraph triplet \({\mathcal {H}}(\mathbb {F})\) is to remove a nonempty subset X of isolated variables from hyperedge e, then \(\mathbb {F}\) is modified as follows. Let \(n_1,\ldots , n_k\) be all the root nodes in \({\textit{forest}}(\mathbb {F})\) that are labeled by e. Merge the corresponding trees into one tree by creating a new node n with \(\textit{var}(n) = e\) and attaching \(n_1,\ldots , n_k\) as children to it with \({\textit{pred}}(n \rightarrow n_i) = \emptyset \) for \(1 \le i \le k\). Then, enact the removal of X by creating a new node p with \(\textit{var}(p) = e {\setminus } X\) and attaching n as child to it with \({\textit{pred}}(p \rightarrow n) = \emptyset \).

  • (CSE) If the reduction operation on \({\mathcal {H}}(\mathbb {F})\) is to remove a hyperedge e because it is a conditional subset of another hyperedge \(\ell \), then \(\mathbb {F}\) is modified as follows. Let \(n_1,\ldots , n_k\) (resp. \(m_1,\ldots , m_l\)) be all the root nodes in \({\textit{forest}}(\mathbb {F})\) that are labeled by e (resp. \(\ell \)) and let \(T_1,\ldots , T_k\) (resp. \(U_1,\ldots , U_l\)) be their corresponding trees. Similar to the previous case, merge the \(T_i\) (resp. \(U_j\)) into a single tree with new root n labeled by e (resp. m labeled by \(\ell \)). Then, enact the removal of e by creating a new node p with \(\textit{var}(p) = \ell \) and attaching n and m as children with \({\textit{pred}}(p \rightarrow n) = \{ \theta \in \textit{pred}(\mathbb {F}) \mid \textit{var}(\theta ) \cap (e {\setminus } \ell ) \not = \emptyset \}\) and \({\textit{pred}}(p \rightarrow m) = \emptyset \).

  • (FLT) If the reduction operation on \({\mathcal {H}}(\mathbb {F})\) is to remove nonempty set of predicates \(\Theta \) because there exists a hyperedge e with \(\textit{var}(\Theta ) \subseteq e\), then \(\mathbb {F}\) is modified as follows. Let \(n_1,\ldots , n_k\) be all the root nodes in \({\textit{forest}}(\mathbb {F})\) that are labeled by e. Merge the corresponding trees into one tree by creating a new root n labeled by e and attaching \(n_1,\ldots , n_k\) as children with \({\textit{pred}}(n \rightarrow n_i) = \Theta \). Enact the removal of \(\Theta \) by removing all \(\theta \in \Theta \) from \(\Theta (\mathbb {F})\).

figure b

It is straightforward to check that these modifications of the forest triplet \(\mathbb {F}\) faithfully enact the corresponding operations on \({\mathcal {H}}(\mathbb {F})\), in the following sense.

Lemma 4

Let \(\mathbb {F}\) be a forest triplet and assume \({\mathcal {H}}(\mathbb {F}) \rightsquigarrow \mathcal {I}\). Let \(\mathbb {G}\) be the result of enacting this reduction operation on \(\mathbb {F}\). Then \(\mathbb {G}\) is a valid forest triplet and \({\mathcal {H}}(\mathbb {G}) = \mathcal {I}\).

We continue the explanation of Algorithm 2. In line 5, Algorithm 2 records the set of root nodes obtained after the first stage of reductions. It then sets \({\textit{out}}(\mathbb {F}) = \emptyset \) in line 6 and continues with the second stage of reductions in lines 7, 8. It then employs Theorem 4 to check acyclicity of Q. If Q is not acyclic, it reports this in lines 9, 10. If Q is acyclic, then we know by Theorem 4 that \({\mathcal {H}}(\mathbb {F})\) has become the empty triplet. Note that \({\mathcal {H}}(\mathbb {F})\) can be empty only if all the roots of \(\mathbb {F}\)’s join forest are labeled by the empty set of variables. As such, we can transform this forest into a join tree T by linking all of these roots to a new unique root, also labeled \(\emptyset \). This is done in line 12. In line 13, the set of nodes N is computed and consists of all nodes identified at the end of the first stage (line 5) plus all of their parents in T.

We will prove in Sect. 5.1.3 that Algorithm 2 is correct, in the following sense.

Theorem 5

Given a GCQ Q, Algorithm 2 reports an error if Q is cyclic. Otherwise, it returns a GJT pair (TN) with T a GJT for Q. If Q is free-connex acyclic, then (TN) is GJT pair for Q. Otherwise, \({\textit{out}}(Q) \subsetneq \textit{var}(N)\), but \(\textit{var}(N)\) is minimal in the sense that for every other GJT pair \((T', N')\) with \(T'\) a GJT for Q, we have \(\textit{var}(N) \subseteq \textit{var}(N')\).

It is straightforward to check that this algorithm runs in polynomial time in the size of Q.

Fig. 6
figure 6

GJT Construction by GYO reduction

Example 6

In Fig. 6, we show a GJT T and use this \(GJT \) to illustrate a number of GJFs \(F_1,\ldots , F_{10}\) in the following way: let level 1 be the leaf nodes, level 2 the parents of the leaves, and so on. Then, we take GJF \(F_i\) to be the set of all trees rooted at nodes at level i, for \(1 \le i \le 10\), and with each level i, we mention the set of remaining predicates \(\theta _i\) for \(1 \le i \le k\) where k is the number of predicates in Q. Nodes (resp. predicates with each \(F_i\)) labeled by “\(\bullet \)” in Fig. 6 indicates that the node (and hence tree, resp. predicates) was already present in \(F_{i-1}\) and did not change. These should hence not be interpreted as new nodes (resp. predicates changed). With this coding of forests, it is easy to see that for all \(1 \le i \le 9\), \(F_i = \textit{hyp}(\mathcal {H}_i)\) with \(\mathcal {H}_i\) illustrated in Fig. 5 (note here that the hypergraph of residual of \(\mathcal {H}_4\), i.e., \(\mathcal {H}_5\) is the same as \(\mathcal {H}_4\), hence we do not show the corresponding \(F_5\)). Furthermore, \(\textit{pred}(F_i) = \textit{pred}(Q) {\setminus } \textit{pred}(\mathcal {H}_i)\) with Q the GCQ from Example 3. As such, the tree illustrates the sequence of GJF triplets that is obtained by enacting the hypergraph reductions illustrated in Fig. 5. For example, let \(\mathbb {F}_1 = (F_1, {\textit{out}}(Q), \textit{pred}(Q)\). After enacting the removal of hyperedge \(\{t,u\}\) from \(\mathcal {H}_1\) to obtain \(\mathcal {H}_2\), we obtain \(\mathbb {F}_2 = (F_2, {\textit{out}}(Q), \textit{pred}(Q))\). Here, \(F_2\) is obtained by merging the single-node trees (i.e., labeled by the atoms in Q) \(\{s,t,u\}\) and \(\{t,u\}\) in to a single tree with root \(\{s,t,u\}\). The shaded area illustrate the nodes in the connex subset N computed by Algorithm 2.

We stress that Algorithm 2 is nondeterministic in the sense that the pair (TN) returned depends on the order in which the reduction operations are performed.

5.1.3 Correctness

To prove Theorems 4 and 5 we show some propositions.

Proposition 8

Let Q be a GCQ. Assume \({\mathcal {H}}(Q) \rightsquigarrow ^* {\mathcal {I}}\!\!\downarrow \) and \(\tilde{\mathcal {I}} \rightsquigarrow ^* {\mathcal {J}}\!\!\downarrow \). If \(\mathcal {J}\) is the empty triplet, then, when run on Q, Algorithm 2 returns a pair (TN) s.t. T is a GJT for Q and \(\textit{var}(N) = \textit{var}(\textit{hyp}(\mathcal {I}))\).

Proof

Assume that \(\mathcal {J}\) is the empty triplet. Algorithm 2 starts in line 3 by initializing \(\mathbb {F} = ({\textit{forest}}(Q),{\textit{out}}(Q), \textit{pred}(Q))\). Clearly, \({\mathcal {H}}(\mathbb {F}) = {\mathcal {H}}(Q)\) at this point. Algorithm 2 subsequently modifies \(\mathbb {F}\) throughout its execution. Let \(\mathbb {H}\) denote the initial version of \(\mathbb {F}\); let \(\mathbb {I}\) denote the version of \(\mathbb {F}\) when executing line 5; let \(\tilde{\mathbb {I}}\) denote the version of \(\mathbb {F}\) after executing line 6 and let \(\mathbb {J}\) denote the version of \(\mathbb {F}\) when executing line 9. By repeated application of Lemma 4, we know that \({\mathcal {H}}(Q) = {\mathcal {H}}(\mathbb {H}) \rightsquigarrow ^* {\mathcal {H}}(\mathbb {I})\). Furthermore, \({\mathcal {H}}(\mathbb {I})\) is in normal form. Since also \({\mathcal {H}}(Q) \rightsquigarrow ^* {\mathcal {I}}\!\!\downarrow \) and normal forms are unique, \({\mathcal {H}}(\mathbb {I}) = \mathcal {I}\). Therefore, \({\mathcal {H}}(\tilde{\mathbb {I}}) = \tilde{\mathcal {I}}\). Again by repeated application of Lemma 4, we know that \(\tilde{\mathcal {I}} = {\mathcal {H}}(\tilde{\mathbb {I}}) \rightsquigarrow ^* {\mathcal {H}}(\mathbb {J})\). Moreover, \({\mathcal {H}}(\mathbb {J})\) is in normal form. Since also \(\tilde{\mathcal {I}} \rightsquigarrow ^* {\mathcal {J}}\!\!\downarrow \) and normal forms are unique, \({\mathcal {H}}(\mathbb {J}) = \mathcal {J}\). As \(\mathcal {J}\) is empty, we will execute lines 12–14. Since \(\mathcal {J}\) is the empty hypergraph triplet, every root of every tree in \({\textit{forest}}(\mathbb {J})\) must be labeled by \(\emptyset \). By definition of join forests, no two distinct trees in \({\textit{forest}}(\mathbb {J})\) hence share variables. As such, the tree T obtained in line 12 by linking all of these roots to a new unique root, also labeled \(\emptyset \), is a valid GJT.

We claim that T is a GJT for Q. Indeed, observe that \(\textit{at}(T) = \textit{at}(Q)\) and the number of times that an atom occurs in Q equals the number of times that it occurs as a label in T. This is because initially \({\textit{forest}}(\mathbb {H}) = {\textit{forest}}(Q)\) and by enacting reduction steps, we never remove nor add nodes labeled by atoms. Furthermore \(\textit{pred}(T) = \textit{pred}(Q)\). This is because initially \(\textit{pred}(\mathbb {H}) = \textit{pred}(Q)\) yet \(\Theta _{\mathbb {J}}\) is empty. This means that, for every \(\theta \in \textit{pred}(Q)\), there was some reduction step that removed \(\theta \) from the set of predicates of the current GJF triplet \(\mathbb {F}\). However, when enacting reduction steps we only remove predicates after we have added them to \({\textit{forest}}(\mathbb {F})\). Therefore, every predicate in \(\textit{pred}(Q)\) must occur in T. Conversely, during enactment of reduction steps, we never add predicates to \({\textit{forest}}(\mathbb {F})\) that are not in \(\Theta _{\mathbb {F}}\), so all predicates in T are also in \(\textit{pred}(Q)\). Thus, T is a GJT for Q.

It remains to show that N is a connex subset of T and \( \textit{var}(N) = \textit{var}(\textit{hyp}(\mathcal {I}))\). To this end, let X be the set of all root nodes of \({\textit{forest}}(\mathbb {I})\), as computed in Line 5. Since \(\mathbb {J}\) is obtained from \(\tilde{\mathbb {I}}\) by a sequence of reduction enactments, and since such enactments only add new nodes and never delete them, X is a subset of nodes of \({\textit{forest}}(\mathbb {J})\) and therefore also of T. As computed in line 5, N consists of X and all ancestors of nodes of X in T. Then, N is a connex subset of T by definition. Furthermore, since \({\mathcal {H}}(\mathbb {I}) = \mathcal {I}\), \(\textit{hyp}({\textit{forest}}(\mathbb {I})) = \textit{hyp}(\mathcal {I})\). Thus, \(\textit{var}(X) = \textit{var}(\textit{hyp}(\mathbb {I})) = \textit{var}(\textit{hyp}(\mathcal {I}))\). Hence, to establish that \(\textit{var}(N) = \textit{var}(\textit{hyp}(\mathcal {I}))\), it suffices to show that \(\textit{var}(X) = \textit{var}(N)\). Since \(X \subseteq N\) the inclusion \(\textit{var}(X) \subseteq \textit{var}(N)\) is immediate. To also establish \(\textit{var}(N) \subseteq \textit{var}(X)\), let n be an arbitrary but fixed node in N. If \(n \in X\), then clearly \(\textit{var}(n) \subseteq \textit{var}(X)\). If \(n \not \in X\), then n was created during the sequence of reduction enactments in that transform \(\tilde{\mathbb {I}}\) into \(\mathbb {J}\). Now note that, whenever a new node m is created during a reduction enactment on a GJF \(\mathbb {G}\), there exists a root node of \({\textit{forest}}(\mathbb {G})\) that contains all variables of m. From this observation and the fact that n was created during a sequence of reduction enactments that start from \(\tilde{\mathbb {I}}\), it follows that there is some root node r in \(\tilde{\mathbb {I}}\) with \(\textit{var}(n) \subseteq \textit{var}(r)\). Then, because X contains all root nodes of \(\tilde{\mathbb {I}}\), also \(\textit{var}(n) \subseteq \textit{var}(X)\). Therefore, \(\textit{var}(N) = \textit{var}(X) = \textit{var}(\textit{hyp}(\mathcal {I}))\). \(\square \)

Corollary 1

(Soundness) Let Q be a GCQ and assume that \({\mathcal {H}}(Q) \rightsquigarrow ^* {\mathcal {I}}\!\!\downarrow \) and \(\tilde{\mathcal {I}} \rightsquigarrow ^* {\mathcal {J}}\!\!\downarrow \). Then:

  1. 1.

    If \(\mathcal {J}\) is the empty triplet then Q is acyclic.

  2. 2.

    If \(\mathcal {J}\) is the empty triplet and \(\textit{var}(\textit{hyp}(\mathcal {I})) = {\textit{out}}(Q)\), then Q is free-connex acyclic.

To also show completeness, we will interpret a GJT T for a GCQ Q as a “parse tree” that specifies the two-stage sequence of reduction steps that can be done on \({\mathcal {H}}(Q)\) to reach the empty triplet. Not all GJTs will allows us to do so easily, however, and we will, therefore, restrict our attention to those GJTs that are canonical.

Definition 14

(Canonical) A GJT T is canonical if:

  1. 1.

    its root is labeled by \(\emptyset \);

  2. 2.

    every leaf node n is the child of an internal node m with \(\textit{var}(n) = \textit{var}(m)\);

  3. 3.

    for all internal nodes n and m with \(n \not = m\), we have \(\textit{var}(n) \not = \textit{var}(m)\); and

  4. 4.

    for every edge \(m \rightarrow n\) and all \(\theta \in \textit{pred}(m \rightarrow n)\), we have \(\textit{var}(\theta ) \cap (\textit{var}(n) {\setminus } \textit{var}(m)) \not = \emptyset \).

A connex subset N of T is canonical if every node in it is an interior node of T. A GJT pair (TN) is canonical if both T and N are canonical.

The following proposition, proven in “Appendix B”, shows that we may restrict our attention to canonical GJT pairs without loss of generality.

Proposition 9

For every GJT pair, there exists an equivalent canonical pair.

We also require the following auxiliary notions and insights. First, if (TN) is a GJT pair, then define the hypergraph associated with (TN), denoted \(\textit{hyp}(T,N)\), to be the hypergraph formed by node labels in N,

$$\begin{aligned} \textit{hyp}(T,N) = \{ \textit{var}_T(n) \mid n \in N, \textit{var}_T(n) \not = \emptyset \}. \end{aligned}$$

Further, define \(\textit{pred}(T,N)\) to be the set of all predicates occurring on edges between nodes in N. For a hyperedge \(\overline{z}\), define the hypergraph triplet of (TN) w.r.t. \(\overline{z}\), denoted \({\mathcal {H}}(T,N,\overline{z})\) to be the hypergraph triplet \((\textit{hyp}(T,N), \overline{z}, \textit{pred}(T,N))\).

The following technical Lemma shows that we can use canonical pairs as “parse” trees to derive a sequence of reduction steps. Its proof can be found in “Appendix B”.

Lemma 5

Let \((T,N_1)\) and \((T,N_2)\) be canonical GJT pairs with \(N_2 \subseteq N_1\). Then, \({\mathcal {H}}(T,N_1, \overline{z}) \rightsquigarrow ^* {\mathcal {H}}(T,N_2, \overline{z})\) for every \(\overline{z} \subseteq \textit{var}(N_2)\).

We require the following additional lemma, proven in “Appendix B”:

Lemma 6

Let \(H_1\) and \(H_2\) be two hypergraphs such that for all \(e \in H_2\) there exists \(\ell \in H_1\) such that \(e \subseteq \ell \). Then, \((H_1 \cup H_2, \overline{z}, \Theta ) \rightsquigarrow ^* (H_1, \overline{z}, \Theta )\), for every hyperedge \(\overline{z}\) and set of predicates \(\Theta \).

We these tools in hand we can prove completeness.

Proposition 10

Let Q be a GCQ, let T be a GJT for Q and let N be a connex subset of T with \({\textit{out}}(Q) \subseteq \textit{var}(N)\). Assume that \({\mathcal {H}}(Q) \rightsquigarrow ^* {\mathcal {I}}\!\!\downarrow \) and \(\tilde{\mathcal {I}} \rightsquigarrow ^* {\mathcal {J}}\!\!\downarrow \). Then, \(\mathcal {J}\) is the empty triplet and \(\textit{var}(\textit{hyp}(\mathcal {I})) \subseteq \textit{var}(N)\).

Proof

By Proposition 9, we may assume without loss of generality that (TN) is a canonical GJT pair. Let A be the set of all of T’s interior nodes. Clearly, A is a connex subset of T and \(\textit{var}(A) \subseteq \textit{var}(Q)\). Furthermore, because for every atom \(r({\overline{x}})\) in Q there is a leaf node l in T labeled by \(r({\overline{x}})\) (as T is a GJT for Q), which has a parent interior node \(n_l\) labeled \({\overline{x}}\) (because T is canonical), also \(\textit{var}(Q) \subseteq \textit{var}(A)\). Therefore, \(\textit{var}(A) = \textit{var}(Q)\). By the same reasoning, \(\textit{hyp}(Q) \subseteq \textit{hyp}(T,A)\). Therefore, \(\textit{hyp}(T, A) = \textit{hyp}(T,A) \cup \textit{hyp}(Q)\). Furthermore, because every interior node in a GJT has a guard descendant, and the leaves of T are all labeled by atoms in Q, we know that for every node \(n \in A\) there exists some hyperedge \(f \in \textit{hyp}(Q)\) such that \(\textit{var}(n) \subseteq \textit{var}(f)\). In addition, we claim that \(\textit{pred}(T,A) = \textit{pred}(Q)\). Indeed, \(\textit{pred}(T,A) \subseteq \textit{pred}(Q)\) since T is a GJT for Q. The converse inclusion follows from canonicality properties (2) and (4): because leaf nodes in a canonical GJT have a parent labeled by the same hyperedge, there can be no predicates on edges to leaf nodes in T. Thus, all predicates in T are on edges between interior nodes, i.e., in \(\textit{pred}(T,A)\). Then, because every predicate in Q appears somewhere in T (since T is a GJT for Q), we have \(\textit{pred}(Q) \subseteq \textit{pred}(T,A)\). From all of the observations made so far and Lemma 6, we obtain:

$$\begin{aligned}&{\mathcal {H}}(T, A, {\textit{out}}(Q)) \\&\quad = (\textit{hyp}(T, A), {\textit{out}}(Q), \textit{pred}(T, A)) \\&\quad = (\textit{hyp}(T, A) \cup \textit{hyp}(Q), {\textit{out}}(Q), \textit{pred}(T, A)) \\&\quad \rightsquigarrow ^* (\textit{hyp}(Q), {\textit{out}}(Q), \textit{pred}(T,A)) \\&\quad = (\textit{hyp}(Q), {\textit{out}}(Q), \textit{pred}(Q)) = {\mathcal {H}}(Q) \end{aligned}$$

Thus \({\mathcal {H}}(T, A, {\textit{out}}(Q)) \rightsquigarrow ^* {\mathcal {H}}(Q) \rightsquigarrow ^* \mathcal {I}\). Furthermore, because (TN) is also canonical with \(N \subseteq A\) and \({\textit{out}}(Q) \subseteq \textit{var}(N)\), we have \({\mathcal {H}}(T, A, {\textit{out}}(Q)) \rightsquigarrow ^* {\mathcal {H}}(T, N, {\textit{out}}(Q))\) by Lemma 5. Then, because reduction is confluent (Proposition 7), we obtain that \({\mathcal {H}}(T, N,{\textit{out}}(Q))\) and \(\mathcal {I}\) can be reduced to the same triplet. Because \(\mathcal {I}\) is in normal form, necessarily \({\mathcal {H}}(T, N, {\textit{out}}(Q)) \rightsquigarrow ^* \mathcal {I}\). Since reduction steps can only remove nodes and hyperedges (and never add them), \(\textit{var}(\textit{hyp}(\mathcal {I})) \subseteq \textit{var}(N)\).

It remains to show that \(\mathcal {J}\) is the empty triplet. Hereto, first verify the following. For any hypergraph triplets \(\mathcal {U}\) and \(\mathcal {V}\), if \(\mathcal {U} \rightsquigarrow ^* \mathcal {V}\) then also \(\tilde{\mathcal {U}} \rightsquigarrow ^* \tilde{\mathcal {V}}\). From this, \({\mathcal {H}}(T, A, {\textit{out}}(Q)) \rightsquigarrow ^* \mathcal {I}\), and the fact that \({\mathcal {H}}(T,A, \emptyset )\) is the residual of \({\mathcal {H}}(T,A, {\textit{out}}(Q))\), we conclude \({\mathcal {H}}(T,A,\emptyset ) \rightsquigarrow ^* \tilde{\mathcal {I}}\). Then, because \(\tilde{\mathcal {I}} \rightsquigarrow ^* \mathcal {J}\), it follows that \({\mathcal {H}}(T, A, \emptyset ) \rightsquigarrow ^* \mathcal {J}\). Let r be T’s root node, which is labeled by \(\emptyset \) since T in canonical. Then, \(\{r\}\) is a connex subset of T. By Lemma 5, \({\mathcal {H}}(T, A, \emptyset ) \rightsquigarrow ^* {\mathcal {H}}(T, \{r\}, \emptyset )\). Now observe that the hypergraph of \({\mathcal {H}}(T, \{r\}, \emptyset )\) is empty and its predicate set is also empty. Therefore, \({\mathcal {H}}(T, \{r\}, \emptyset )\) is the empty hypergraph triplet. In particular, it is in normal form. But, since \(\mathcal {J}\) is also in normal form and normal forms are unique, \(\mathcal {J}\) must also be the empty triplet. \(\square \)

Corollary 2

(Completeness) Let Q be a GCQ. Assume that \({\mathcal {H}}(Q) \rightsquigarrow ^* {\mathcal {I}}\!\!\downarrow \) and \(\tilde{\mathcal {I}} \rightsquigarrow ^* {\mathcal {J}}\!\!\downarrow \).

  1. 1.

    If Q is acyclic, then \(\mathcal {J}\) is the empty triplet.

  2. 2.

    If Q is free-connex acyclic, then \(\mathcal {J}\) is the empty triplet and \(\textit{var}(\textit{hyp}(\mathcal {I})) = {\textit{out}}(Q)\).

  3. 3.

    For every GJT T of Q and every connex subset N of T it holds that \(\textit{var}(\textit{hyp}(\mathcal {I})) \subseteq \textit{var}(N)\).

Proof

  1. (1)

    Since Q is acyclic, there exists a GJT T for Q. Let N be the set of all of T’s nodes. Then, N is a connex subset of T and \({\textit{out}}(Q) \subseteq \textit{var}(N) = \textit{var}(Q)\). The result then follows from Proposition 10.

  2. (2)

    Since Q is free-connex acyclic, there exists a GJT pair (TN) compatible with Q. In particular, \(\textit{var}(N) = {\textit{out}}(Q)\). By Proposition 10, \(\mathcal {J}\) is the empty triplet, and \(\textit{var}(\textit{hyp}(\mathcal {I})) \subseteq \textit{var}(N) = {\textit{out}}(Q)\). It remains to show \({\textit{out}}(Q) \subseteq \textit{var}(\textit{hyp}(\mathcal {I}))\). First, verify the following: A reduction step on a hypergraph triplet \(\mathcal {H}\) never removes any variable in \({\textit{out}}(\mathcal {H})\) from \(\textit{hyp}(\mathcal {H})\), nor does it modify \({\textit{out}}(\mathcal {H})\). Then, since \({\textit{out}}({\mathcal {H}}(Q)) = {\textit{out}}(Q) \subseteq \textit{var}(Q) \subseteq \textit{var}(\textit{hyp}({\mathcal {H}}(Q))))\), and \({\mathcal {H}}(Q) \rightsquigarrow ^* \mathcal {I}\), we obtain \({\textit{out}}(Q) \subseteq \textit{var}(\textit{hyp}(\mathcal {I}))\).

  3. (3)

    Follows directly from Proposition 10.

\(\square \)

Theorem 4 follows directly from Corollaries 1 and 2. Theorem 5 follows from Theorem 4 and Proposition 8.

5.2 Transforming GJT pairs to query plans

Let us call a GJT pair (TN) binary if T is binary and sibling closed if N is sibling closed. A query plan is hence a binary and sibling-cloded GJT pair. In this section, we prove the following result.

Proposition 11

Every GJT pair can be transformed in polynomial time into an equivalent plan.

We prove Proposition 11 in two steps. First, we show that any pair (TN) can be transformed in polynomial time into an equivalent sibling-closed pair. Next, we show that any sibling-closed GJT pair (TN) can be converted in polynomial time into an equivalent plan. Proposition 11 hence follows by composing these two transformations. Throughout this section, let \({{\,\mathrm{ch}\,}}_T(n)\) denote the set of children of n in T.

Sibling-closed transformation We say that \(n \in T\) is a violator node in a GJT pair (TN) if \(n \in N\) and some, but not all children of n are in N. A violator is of type 1 if some node in \({{\,\mathrm{ch}\,}}_T(n) \cap N\) is a guard of n. It is of type 2 otherwise. We now define two operations on (TN) that remove violators of type 1 and type 2, respectively. The sibling-closed transformation is then obtained by repeatedly applying these operators until all violators are removed.

The first operator is applicable when n is a type 1 violator. It returns the pair \((T',N')\) obtained as follows:

  • Since n is a type 1 violator, some \(g \in {{\,\mathrm{ch}\,}}_T(n) \cap N\) is a child guard of n (i.e., \(\textit{var}(n) \subseteq \textit{var}(g)\)).

  • Because every node has a guard, there is some leaf node l that is a descendant guard of g (i.e., \(\textit{var}(g) \subseteq \textit{var}(l)\)). Possibly, l is g itself.

  • Now create a new node p between node l and its parent with label \(\textit{var}(p)=\textit{var}(l)\). Since l is a descendant guard of n and g, p becomes a descendant guard of n and g as well. Detach all nodes in \({{\,\mathrm{ch}\,}}_T(n) {\setminus } N\) from n and attach them as children to p, preserving their edge labels. This effectively moves all subtrees rooted at nodes in \({{\,\mathrm{ch}\,}}_T(n) {\setminus } N\) from n to p. Denote by \(T'\) the final result.

  • If l was not in N, then \(N' = N\). Otherwise, \(N' = N {\setminus } \{l\} \cup \{p\}\).

We write \((T,N) \xrightarrow {1,n} (T',N')\) to indicate that \((T',N')\) can be obtained by applying the above-described operation on node n.

Example 7

Consider the GJT pair (TN) from Fig. 7 where N is indicated by the nodes in the shaded area. Let us denote the root node by n and its guard child with label \(\{y,z,w\}\) by g. The node \(l = h(y,z,w,t)\) is a descendant guard of g. Since s(yzm) is not in N, n is violator of type 1. After applying the operation 1 for the choice of guard node g and descendant guard node l, \((T',N')\) shows the resulting valid sibling-closed GJT.

Lemma 7

Let n be a violator of type 1 in (TN) and assume \((T,N) \xrightarrow {1,n} (T',N')\). Then, \((T',N')\) is a GJT pair and it is equivalent to (TN). Moreover, the number of violators in \((T',N')\) is strictly smaller than the number of violators in (TN).

We prove this lemma in “Appendix C”. The second operator is applicable when n is a type 2 violator. When applied to n in (TN), it returns the pair \((T',N')\) obtained as follows:

  • Since n is a type 2 violator, no node in \({{\,\mathrm{ch}\,}}_T(n) \cap N\) is a guard of n. Since every node has a guard, there is some \(g \in {{\,\mathrm{ch}\,}}_T(n) {\setminus } N\) which is a guard of n.

  • Create a new child p of n with label \(\textit{var}(p) = \textit{var}(n)\); detach all nodes in \({{\,\mathrm{ch}\,}}_T(n){\setminus } N\) (including g) from N, and add them as children of p, preserving their edge labels. This moves all subtrees rooted at nodes in \({{\,\mathrm{ch}\,}}_T(n) {\setminus } N\) from n to p. Denote by \(T'\) the final result.

  • Set \(N' = N \cup \{p\}\).

We write \((T,N) \xrightarrow {2,n} (T',N')\) to indicate that \((T',N')\) was obtained by applying this operation on n.

Example 8

Consider the GJT pair (TN) in Fig. 8. Let us denote the root node by n. Since its guard child h(yzwt) is not in N, n is violator of type 2. After applying operation 2 on n, \((T',N')\) shows the resulting valid sibling-closed GJT.

Fig. 7
figure 7

Illustration of the sibling-closed transform: removal of type-1 violator. The connex sets N and \(N'\) are indicated by the shaded areas

Fig. 8
figure 8

Illustration of the sibling-closed transform: removal of type-2 violator. The connex sets N and \(N'\) are indicated by the shaded areas

Lemma 8

Let n be a violator of type 2 in (TN) and assume \((T,N) \xrightarrow {2,n} (T',N')\). Then, \((T',N')\) is a GJT pair and it is equivalent to (TN). Moreover, the number of violators in \((T',N')\) is strictly smaller than the number of violators in (TN).

The proof can be found in “Appendix C”.

Proposition 12

Every GJT pair can be transformed in polynomial time into an equivalent sibling-closed pair.

Proof

The two operations introduced above remove violators, one at a time. By repeatedly applying these operations until no violator remains we obtain an equivalent pair without violators, which must hence be sibling closed. Since each operator can clearly be executed in polynomial time and the number of times that we must apply an operator is bounded by the number of nodes in the GJT pair, the removal takes polynomial time. \(\square \)

Fig. 9
figure 9

Binarizing a k-ary node n

Binary transformation Next, we show how to transform a sibling-closed pair (TN) into an equivalent binary and sibling-closed pair \((T', N')\). The idea here is to “binarize” each node n with \(k > 2\) children as shown in Fig. 9. There, we assume without loss of generality that \(c_1\) is a guard child of n. The binarization introduces \(k-2\) new intermediate nodes \(m_1,\ldots , m_{k-2}\), all with \(\textit{var}(m_i) = \textit{var}(n)\). Note that, since \(c_1\) is a guard of n and \(\textit{var}(m_i) = \textit{var}(n)\), it is straightforward to see that \(c_1\) will be a guard of \(m_1\), which will be a guard of \(m_2\), which will be a guard of \(m_3\), and so on. Finally, \(m_{k-2}\) will be a guard of n. The connex set N is updated as follows. If none of n’s children are in N, i.e., n is a frontier node, set \(N'=N\). Otherwise, since N is sibling closed, all children of n are in N, and we set \(N'=N\cup \{m_1,\ldots ,m_{k-2}\}\). Clearly, \(N'\) remains a sibling-closed connex subset of \(T'\) and \(\textit{var}(N') = \textit{var}(N)\). We may hence conclude:

Lemma 9

By binarizing a single node in a sibling-closed GJT pair (TN) as shown in Fig. 9, we obtain an equivalent GJT pair \((T', N')\) that has strictly fewer nonbinary nodes than (TN).

Binarizing a single node is a polynomial-time operation. Then, by iteratively binarizing nonbinary nodes until all nodes have become binary we hence obtain:

Proposition 13

Every sibling-closed GJT pair can be transformed in polynomial time into an equivalent plan.

6 Implementation

We have implemented IEDyn, the instantiation of GDyn to setting where all \(\theta \)-joins are inequality joins described in Sect. 4.4, as a query compiler that generates executable code in the Scala programming language. The generated code instantiates a (TN)-rep and defines trigger functions that maintain the (TN)-rep under updates. Off-the-shelf Scala collection libraries are used to implement the required indexes, and we take care to share the data structures between the join and enumeration indexes whenever possible. An important optimization used by our implementation lies in observing that, for the nodes in the connex set N that are not in the frontier of N, we never use the multiplicities stored in \(\rho _n\) during enumeration. As such, we also do not need to compute these multiplicities during update processing. While our theoretical framework supports batch updates, our implementation is currently limited to single-tuple updates.

Our implementation supports two modes of operation: push based and pull based. In both modes, the system maintains the T-rep under updates. In the push-based mode the system generates, on its output stream, the delta result \(\Delta {Q}(\textit{db},{ u })\) after each single-tuple update \({ u }\). To do so, it uses a modified version of enumeration (Algorithm 1) that we call delta enumeration. Similarly to how Algorithm 1 enumerates \(Q(\textit{db})\), delta enumeration enumerates \(\Delta Q(\textit{db},{ u })\) with constant delay (if Q has at most one inequality per pair of atoms) resp. logarithmic delay (otherwise). To do so, it uses both (1) the T-reduct GMRs \(\rho _n\) and (2) the delta GMRs \(\Delta \rho _n\) that are computed by Algorithm 1 when processing u. In this case, however, one also needs to index the \(\Delta \rho _n\) similarly to \(\rho _n\). In the pull-based mode, in contrast, the system only maintains the (TN)-rep under updates but does not generate any output stream. Nevertheless, at any time a user can \(\textsc {enum}\) (Algorithm 1) to obtain the current output.

We have described in Sect. 4 how IEDyn can process free-connex acyclic GCQs under updates. It should be noted that our implementation also supports the processing of general acyclic GCQs that are not necessarily free-connex. This is done using the following simple strategy. Let Q be acyclic but not free-connex. First, compute a free-connex acyclic approximation \(Q_F\) of Q. \(Q_F\) can always be obtained from Q by extending the set of output variables of Q. In the worst case, we need to add all variables, and \(Q_F\) becomes the full join underlying Q. Then, use IEDyn to maintain a (TN)-rep for \(Q_F\). When operating in push-based mode, for each update \({ u }\), we use the (TN)-rep to delta-enumerate \(\Delta {Q_F}(\textit{db}, { u })\) and project each resulting tuple to materialize \(\Delta {Q}(\textit{db},{ u })\) in an array. Subsequently, we copy this array to the output. Note that the materialization of \(\Delta {Q}(\textit{db}, { u })\) here is necessary since the delta enumeration can produce duplicate tuples after projection. When operating in pull-based mode, we materialize \(Q(\textit{db})\) in an array and use delta enumeration of \(Q_F\) to maintain the array under updates. Of course, under this strategy, we require \(\Omega (|{Q(\textit{db})}|)\) space in the worst case, just like (H)IVM would, but we avoid the (partial) materialization of delta queries. Note the distinction between the two modes: in push-based mode \(\Delta {Q}(\textit{db}, { u })\) is materialized (and discarded once the output is generated), while in pull-based mode \(Q(\textit{db})\) is materialized upon requests.

Our query compiler computes query plans using the algorithm of Sect. 5. Whenever we have the choice between enacting multiple reduction steps, we first enact using (ISO), then using (FLT), and finally using (CSE). This corresponds to the usual heuristics of pushing down projections and selections. If multiple applications of (CSE) are possible, we prefer those where the hyperedge to be eliminated has no extended variables. This corresponds to pushing down equi-semijoins so that inequality semijoins are hopefully executed over GMRs of reduced cardinality.

7 Experimental evaluation

In this section, we experimentally compare IEDyn against competing state-of-the-art HIVM and CER systems.

Table 1 Benchmark queries

Queries Because the effectiveness of Dyn for equijoin queries has already been documented [25], we focus on inequality join queries during our experimental analysis. Since there is no industry-strength established benchmark suite for such queries, we perform a systematic and in-depth exploration of the design space of inequality joins of up to three relations. Concretely, we evaluate IEDyn on the queries listed in Table 1. Here, \(Q_1\)\(Q_7\) are full join queries (i.e., queries without projections). Among these, \(Q_1\), \(Q_3\) and \(Q_4\) contain only inequality join predicates, while \(Q_2\), \(Q_5\)\(Q_7\) additionally contain at least one equality join. Queries \(Q_1\) and \(Q_2\) are binary joins, while \(Q_3\)\(Q_6\) are multi-way join queries over three relations. The inequality predicates in \(Q_3\) are unrelated (\(a< d \wedge e < g\)) while they form a chain (\(a< d < g\)) in \(Q_4\)\(Q_7\). Note that \(Q_5\)\(Q_7\) are variants of \(Q_4\) with equijoins added. We have similarly experimented with variants of \(Q_3\) with equijoins added, but the trends are similar to what we obtain for \(Q_4\)\(Q_7\). Collectively, \(Q_3\)\(Q_7\) (and the omitted variants) cover all possible ways in which three relations can be inequality joined in an acyclic manner.

Queries \(Q_8\)\(Q_{15}\) project over the result of queries \(Q_4\)\(Q_7\). Among these, \(Q_8\)\(Q_{11}\) are free-connex acyclic while \(Q_{12}\)\(Q_{15}\) are acyclic but not free-connex.

Updates We evaluate on streams of synthetic updates where each update consists of a single-tuple insertion. We focus on the setting where updates are single-tuple insertions for the following reasons. First, single-tuple updates stress-test dynamic query processing since the query results must be kept up-to-date after each and every single tuple, in contrast to the setting for batch updates, where results can be out-of-sync for the duration of the batch. Second, since batch updates can always be processed by executing all updates in the batch individually (using the single-tuple update triggers), performance measurements for single-tuple updates yield an upper bound for the performance of batch updates. Third, while it is true that, in principle, the processing of batch updates can be done faster than that of single tuple updates (e.g., by amortizing index access and data structure updates over the entire batch instead of per tuple), both our implementation and most of the competing systems that we describe below do not implement such optimization.Footnote 6 As such, we restrict attention to single-tuple updates. We focus on insertions because this is supported by all of the systems that we consider whereas explicit deletions are not. While we have experimented with mixed (insert and delete) update streams for IEDyn, performance is similar to that for insert-only streams. This is expected, since IEDyn treats insertions and deletions uniformly.

The database is always empty when we start processing the update stream. We synthetically generate two kinds of update streams: randomly ordered and temporally ordered update streams. In randomly ordered update streams, insertions can occur in any order. In contrast, temporally ordered update streams guarantee that any attribute that participates in an inequality in the query has a larger value than the same attribute in any of the previously inserted tuples. Randomly ordered update streams are useful for comparing against systems that allow processing of out-of-order tuples; temporally ordered update streams are useful for comparison against systems that assume events arrive always with increasing timestamp values. Examples of systems that process temporally ordered streams are automaton-based CER systems.

Competitors We compare IEDyn with the following state-of-the-art HIVM and CER engines: DBToaster (DBT) [30], Esper (E) [19], SASE (SE) [3, 46, 49], Tesla (T) [14, 15], and ZStream (Z) [31]. The competing systems differ in their mode of operation (push based vs pull based, cf. Sect. 6) and some of them only support temporally ordered streams. The capabilities of each system is summarized in Table 2 and discussed in detail in “Appendix D”.

Table 2 Competing systems capabilities overview

Setup Our experiments are run on an 8-core 3.07 GHz machine running Ubuntu with GNU/Linux 3.13.0-57-generic. To compile the different systems or generated trigger programs, we have used GCC version 4.8.2, Java 1.8.0_101, and Scala version 2.12.4. Each query is evaluated ten times to measure update processing delay and two times to measure memory footprint. We present the average over those runs. Each time a query is evaluated, 20 GB of main memory are freshly allocated to the program. To measure the memory footprint for Scala/Java-based systems, we invoke the JVM system calls every 10 updates and consider the maximum value. For C/C++-based systems we use the GNU/Linux time command to measure memory usage. Experiments that measure memory footprint are always run separately of the experiments that measure processing time.

7.1 Results

We will compare using total update processing time, memory footprint, and enumeration delay as metrics. Here, the total update processing time is the time required to process the entire update stream, where updates are fed into the systems as fast as they can process them. This hence measures the maximum system performance. While measuring the update processing time, we take care to compare fairly with the competing systems by consistently running IEDyn in the same operation mode as the one supported by the competitor. Concretely, for push-based systems, we report the time required to process the entire update stream, while generating the delta result \(\Delta Q(\textit{db},u)\) after each update (cf. Sect. 6). When comparing against a pull-based system, the measured time includes only processing the entire update stream. For these systems, we later separately report the enumeration delay, i.e., the speed with which the result can be generated from the underlying representation of the output (a T-representation in the case of IEDyn). SASE, ZStream, and Tesla only support temporally ordered streams and hence we only compare with them on such streams. While DBToaster and Esper support both random and temporally ordered streams, we only report comparisons using randomly ordered streams. We have also looked at temporally ordered streams for these systems, but their throughput is similar (fluctuating between 3% and 12%) while that of IEDyn significantly improves (fluctuating between 35% and 50%) because insertions to sorted lists become constant instead of logarithmic. We omit these experiments due to lack of space.

Some executions of the competing systems failed either because they required more than 20GB of main memory or they took more than 1500 s. If an execution requires more than 20GB, we report the processing time elapsed until the exception was raised. If an execution is still running after 1500 s, we stop it and report its maximum memory usage while running.

Table 3 (Left) Maximum update stream and result sizes, \(k=10^3\). (Right) Relative performance of IEDyn, expressed as a percentage of the corresponding resources used by competing systems
Fig. 10
figure 10

Processing time in seconds on full join queries for various stream sizes. (*: \(DBT\) out of memory, +: \(Z\) out of memory, \('\): \(T\) was stopped after 1500 s)

Full join queries We first analyze the performance of the full join queries \(Q_1\)\(Q_7\) in the setting where the join selectivity is relatively large. Here, the selectivity of \(R \bowtie S\) is defined as \(\frac{|{R \bowtie S}|}{ |{R}| \times |{S}|}\). High-selectivity joins are common in CER scenarios. We refer to Table 3(Left) for output sizes. Figure 10 compares the update processing time of IEDyn against the competing systems for full join queries \(Q_1\)\(Q_7\), grouped per system capability (push/pull, random/temporal). We observe that all of the competing systems have large processing times even for very small update stream sizes, indicating poor scalability. In particular, \(DBT\) runs out of memory on streams of sizes \(\ge \!18k\) for query \(Q_5\), and on streams of size \(\ge \!15k\) for query \(Q_6\). \(Z\) runs out of memory for \(Q_5\) and \(Q_6\) on streams of size 21k. Similarly, \(T\) took more than 1500 s for \(Q_5\) on streams of size \(\ge \!12k\), for \(Q_6\) on streams of size \(\ge \!18k\), and for \(Q_7\) on streams of size \(\ge \!9k\) and was aborted at that time. All of these behaviors are due to the large selectivity of joins on this dataset. Note that in all cases, IEDyn scales satisfactorily with increasing stream sizes and significantly outperforms the competitors, often by orders of magnitude. This is confirmed in Table 3 (Right) where we show the processing time and memory footprint used by IEDyn as a percentage of the corresponding usage in the competing systems (for the largest stream size of each query). Writing “(xy) oom”: to indicate x orders of magnitude improvement in processing time and y orders in memory consumption, we see that IEDyn improves up to (1, 2) oom w.r.t. Z; almost (3, 1) oom w.r.t. T; almost (2, 2) oom w.r.t SE, and up to (4, 3) oom w.r.t DBT. Moreover, for these queries, even in push-based mode IEDyn can support the enumeration of query results from its data structures at any time while competing push-based systems have no such support. Hence, IEDyn is not only more efficient but also provides more functionality.

Drilling deeper into the specific queries, we see from Fig. 10 that while existing systems already perform poorly on the inequality-only binary join query \(Q_2\), this is further aggravated when moving to the corresponding three-way join query \(Q_3\) and \(Q_4\): note that the plot for \(Q_1\) shows streams sizes up to 12k, whereas the plots for \(Q_3\) and \(Q_4\) only go to 2.7k while having much larger runtimes. Because \(Q_2\) and \(Q_5\)\(Q_7\) modify \(Q_1\), resp. \(Q_4\) by adding equality join predicates, they have a smaller join selectivity. As a result, performance for these queries is much better, across all systems. (To see this, note that, for \(Q_5\)\(Q_6\) the first data point plotted has a significantly lower processing time while processing a data stream three times the size.) Similarly, we note that, while not visually apparent from Fig. 10, the performance of all systems on \(Q_3\) is \(10\%-100\%\) slower compared to the same systems on \(Q_4\). This is due to the fact that the two unrelated inequality predicates \((a< d \wedge e < g)\) of \(Q_3\) are linearly ordered in \(Q_4\)\((a< d \wedge d < g)\), which causes \(Q_4\) to have higher selectivity, improving performance. From these observations we may conclude that the performance of all systems increases as the join selectivity decreases. Nevertheless, in contrast to existing systems, IEDyn continues to scale and perform satisfactorily even with large selectivities.

Fig. 11
figure 11

IEDyn as percentage of (\(E\), \(DBT\), \(SE\), \(Z\), \(T\)) for higher join selectivities. X-axis shows queries with tuples per relation and selectivities

To confirm this trend in the setting of very low selectivities, were processing is hence less output-size dependent, we have also generated datasets with probability distributions that are parametrized by a selectivitys, such that the expected number of output tuples is s percent of the Cartesian product of all relations in the query. The results in Fig. 11 show that IEDyn consistently continues to perform better also on very selective inequality joins. For super selective inequality joins the measurements come similar to what we observe for equality joins, which we investigated in detail in [25].

Projections The trends observed for full join queries are confirmed for the queries \(Q_8\)\(Q_{15}\) with projections, as shown in Table 3 (Right). In particular, IEDyn improves up to almost (4, 1) oom w.r.t. T; up to almost (1, 1) w.r.t. E. and up to (4, 3) oom w.r.t DBT. Again, we observe that, while IEDyn consistently outperforms the competitors, when the selectivity of the queries decreases, the performance gain of IEDyn increases. In addition, we note that the performance gain of IEDyn is bigger on queries that are free-connex acyclic (namely: \(Q_8\)\(Q_{10}\)) as opposed to those that are not (\(Q_{11}\)\(Q_{14})\). This is to be expected since our implementation evaluates nonfree-connex queries by approximating them by free-connex variants, requiring additional materialization (cf. Sect. 6).

Fig. 12
figure 12

Enumeration of query results: IEDyn vs \(DBT\) for various input stream sizes. Different bars for \(Q_4,Q_5,Q_6\) show their projected versions

Result enumeration CDE is theoretical notion that hides a constant factor which could decrease performance in practice when compared to enumerating from a fully materialized representation In Fig. 12, we show the practical application of CDE in IEDyn and compare against \(DBT\) which materializes the full query results. We plot the time required to enumerate the result from IEDyn ’s T-rep as a fraction of the time required to enumerate the result from \(DBT\) ’s materialized views. As can be seen from the figure, both enumeration times are comparable on average. Note that we do not compare enumeration time for push-based systems, since for these systems the time required for delta enumeration is already included in the update processing time.

8 Conclusions

Traditional techniques for dynamic query evaluation are based on a trade-off between materialization of join subresults (to avoid recomputing these subresults) and their recomputation (to avoid the space overhead induced by materialization). We have shown that instead of materializing full join subresults, it suffices to maintain and index semijoin subresults. This methodology, called General Dynamic Yannakakis, allows us to maintain a data structure that, like recomputation, has low memory overhead; and yet supports all operations one commonly expects from full materialization: enumeration with bounded delay and efficient processing of updates. In addition, the framework supports bounded-delay delta enumeration under single-tuple update, hence allowing to operate in push-based mode similar to streaming systems. Our experiments against state-of-the-art engines in the IVM and CER domains show that Dynamic Yannakakis can improve performance by orders of magnitude in both time and space. The performance gap with existing systems is the largest for output-dominated queries (i.e., queries where the join result is large). While the gap decreases for more selective joins, GDyn continues to consistently outperform existing systems. In addition, while GDyn ’s theory is developed only for free-connex acyclic GCQs, our experiments show that GDyn ’s adaptation to acyclic GCQs that are notfree connex (by means of free-connex approximation followed by post-processing) is equally effective compared to existing systems.

From a theoretical viewpoint, it would be satisfying to establish lower bounds on the complexity of processing inequality joins dynamically. For equijoins, such lower bounds were recently established by Berkholz et al. [8]. Since GDyn meets these lower bounds on queries with equijoins only [25], it would be interesting to know whether it is similarly optimal for inequality joins.