Knowledge Compilation Meets Database Theory: Compiling Queries to Decision Diagrams

Jha, Abhay; Suciu, Dan

doi:10.1007/s00224-012-9392-5

Knowledge Compilation Meets Database Theory: Compiling Queries to Decision Diagrams

Published: 06 March 2012

Volume 52, pages 403–440, (2013)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Theory of Computing Systems Aims and scope Submit manuscript

Knowledge Compilation Meets Database Theory: Compiling Queries to Decision Diagrams

Download PDF

Abhay Jha¹ &
Dan Suciu¹

387 Accesses
25 Citations
Explore all metrics

Abstract

The goal of Knowledge Compilation is to represent a Boolean expression in a format in which it can answer a range of “online-queries” in PTIME. The online-query of main interest to us is model counting, because of its application to query evaluation on probabilistic databases, but other online-queries can be supported as well such as testing for equivalence, testing for implication, etc. In this paper we study the following problem: given a database query q, decide whether its lineage can be compiled efficiently into a given target language. We consider four target languages, of strictly increasing expressive power (when the size of compilation is restricted to be polynomial in the data size): read-once Boolean formulae, OBDD, FBDD and d-DNNF. For each target, we study the class of database queries that admit polynomial size representation: these queries can also be evaluated in PTIME over probabilistic databases. When queries are restricted to conjunctive queries without self-joins, it was known that these four classes collapse to the class of hierarchical queries, which is also the class of PTIME queries over probabilistic databases. Our main result in this paper is that, in the case of Unions of Conjunctive Queries (UCQ), these classes form a strict hierarchy. Thus, unlike conjunctive queries without self-joins, the expressive power of UCQ differs considerably with respect to these target compilation languages. Moreover, we give a complete characterization of the first two target languages, based on the query’s syntax.

Connecting Knowledge Compilation Classes Width Parameters

Article 10 June 2019

Knowledge Compilation Languages as Proof Systems

Compilation of Conditional Knowledge Bases for Computing C-Inference Relations

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The goal of Knowledge Compilation [6, 13, 27] is to represent a Boolean expression in a language in which it can answer a range of problems, also called “online-queries”, in PTIME. Typical problems are satisfiability, validity, implication, model counting, substitution with constants, and substitution with functions. For example, the model counting problem asks for the number of satisfying assignments to a Boolean expression; the more general probability computation problem asks for the probability of that expression being true, if every variable is true/false independently with some probability. If one compiles the Boolean expression into (say) an FBDD, then the model counting problem and the probability computation problem can be solved in linear time in the size of the FBDD. Different compilation languages can solve efficiently different classes of problems, in time polynomial in the size of compiled expression [13]. This motivates the need to know if an expression can be compiled into a small-sized or compact representation in a given language.

The provenance of a query on a relational database is an expression that describes how the answer was derived from the tuples in the database [17]. In this paper, we are interested in the flavor of provenance called PosBool in [25] (see also [18]), which we will refer to as lineage. The lineage is a Boolean expression over Boolean variables corresponding to tuples in the input database. Our goal in this paper is to identify queries whose lineage admits a compact compilation. Our main motivation comes from (but is not limited to) probabilistic databases, where the problem is the following: given a query and a probabilistic database (i.e. each tuple has a given probability), compute the probability of each query answer [10]. If the lineage has been compiled into a compact format that supports the probability computation, then one can compute the output probabilities efficiently. In this paper we study queries whose lineage always admits a compact compilation, on any database instance. We are only interested in the data complexity i.e., we assume the query to be fixed (and, in particular its size is constant). Our query language is that of unions of conjunctive queries, UCQ, and, as usual, we restrict our discussion to Boolean queries.

We consider four compilation targets. For each target T, we denote by UCQ(T) the class of UCQ queries whose lineage admits a compact compilation in T for all input databases: the precise definition of the term “compact” depends on the compilation target, but usually means that the target has polynomial size. There are two different ways of defining a compact compilation: (i) Uniform: A compact compilation can be found in polynomial time, (ii) Non-Uniform: A compact compilation exists, but there are no restrictions on how to find it. The first interpretation is more strict, but makes more practical sense if one is looking for tractable algorithms for compilation; all our upper bounds are for uniform compilation. The second interpretation is more useful in complexity theory, since the results show the expressibility, and limitation of different models of compilation/computation; all our lower bounds are for non-uniform compilation. Unless stated otherwise, we always assume that the compilation is uniform, except in Sect. 7 where we focus exclusively on non-uniform compilation.

Our first target is the class of read-once expressions, denoted RO. A read-once Boolean expression is an expression consisting of ∧, ∨, ¬ operators in such a way that every input variable is used only once. A read-once Boolean formula is one that can be represented by a read-once expression. read-once formulas admit an elegant characterization due to Gurvich [19] (see [16]). Thus, UCQ(RO) is the class of queries q such that for every input database, the lineage of q on that database is a read-once formula. Our second and third targets are Ordered and Free Binary Decision Diagram. A Binary Decision Diagram ^{Footnote 1}(BDD) is a rooted DAG where each internal node is labeled with a variable and has two outgoing edges labeled 0 and 1, and each sink node is labeled either 0 or 1. A BDD represents a Boolean function, as follows. Given an assignment to the Boolean variables, the value of the function is obtained by traversing the BDD starting at the root node, and at each internal node following either the 0 or the 1 edge, according to the value of that node’s variable. The unique sink node reached at the end of the traversal gives the value (0 or 1) of the Boolean function under that assignment. A BDD is free (hence FBDD) if any path from the root to a sink node reads every variable at most once. An FBDD is ordered (hence OBDD) if there exists a total order on the Boolean variables s.t. any path from the root to a sink node reads the variables in this order (it may skip some variables). Thus, UCQ(OBDD) and UCQ(FBDD) denote the class of queries q s.t. that for any database instance D, one can construct an OBDD (FBDD) for the lineage of q on D, in time polynomial in D; in particular, the resulting OBDD or FBDD also has size polynomial in D. Finally, our fourth target is the language of deterministic-Decomposable Negation Normal Form(d-DNNF) introduced by Darwiche [12] (see also [13]), which are DAGs whose leaves are labeled with literals (Boolean variables or negated Boolean variables), and internal nodes are labeled either an independent-∧ (where the two children must have distinct sets of Boolean variables), or with disjoint-∨ (where the two children must be exclusive Boolean formulas). We also allow for one more type of internal node: not (¬). UCQ(dDNNF) represents the class of queries s.t. one can construct a d-DNNF of its lineage in PTIME for any input database.

In addition to these four classes defined by a compilation target, we also consider UCQ(P), the class of queries q with the property that, for every probabilistic database D, the probability of q on D can be computed in PTIME in the size of D. It follows from known results that these five classes form an increasing hierarchy: UCQ(RO)⊆UCQ(OBDD)⊆UCQ(FBDD)⊆UCQ(dDNNF)⊆UCQ(P).

Dalvi and Suciu [9, 10] have studied the evaluation problem over probabilistic databases for conjunctive queries without self-joins, denoted here CQ ⁻, and showed that the class of queries computable in PTIME, CQ ⁻(P), consists precisely of hierarchical queries (reviewed in Sect. 2). Olteanu and Huang [20] have shown a remarkable result: that for any hierarchical query, its lineage is a read-once formula. In other words, they explained that the reason why hierarchical queries can be computed in PTIME is because their lineage is read once. This immediately implies (assuming FP≠#P ) that the following five classes collapse: CQ ⁻(RO)=CQ ⁻(OBDD)=CQ ⁻(FBDD)=CQ ⁻(UCQ)=CQ ⁻(P).

In this paper we show that, over unions of conjunctive queries (UCQ), these classes no longer collapse. In fact they form a strict hierarchy:

$$\mathit{UCQ}(\mathit{RO}) \subsetneq \mathit{UCQ}(\mathit{OBDD}) \subsetneq \mathit{UCQ}(\mathit{FBDD}) \subsetneq \mathit{UCQ}(\mathit{dDNNF}) \subseteq \mathit{UCQ}(P)$$

This means that the reason why certain queries can be computed in PTIME over probabilistic databases is no longer their read-once-ness, or any other efficient compilation method (We were not able to separate UCQ(dDNNF) from UCQ(P) but we conjecture that they are also separated). Instead, each notion of efficiency is distinct. We refer to Table 1 to discuss our results.

Table 1 Several representative queries defined in Table 2. All queries are hierarchical, and have the additional syntactic properties shown. $\hat{0}$ denotes the minimal element of the query’s CNF-lattice; μ its Mobius function. Queries q ₂, q _V, q _W separate the corresponding classes. We conjecture that q ₉ separates UCQ(dDNNF) from UCQ(P). ^∗: assuming FP≠#P; ^?: conjectured

Full size table

Table 2 Important queries used throughout the paper

Full size table

Our results make use of three syntactic properties of a query, called inversion [8], separator [11], and hierarchical queries [10], reviewed in Sect. 2. The following strict implications hold: inversion-free implies existence of separators at all levels, which implies the query is hierarchical.

We give a complete characterization of UCQ(RO) and UCQ(OBDD). First, UCQ(OBDD) coincides with inversion-free queries. UCQ(RO) coincides with queries that are both inversion-free and can be written using ∧,∨,∃ such that every relation symbol occurs only once. For example, consider the query q ₁ in Table 1, q ₁=∃x ₁.∃y ₁.R(x ₁),S(x ₁,y ₁)∨∃x ₂.∃y ₂.T(x ₂),S(x ₂,y ₂); in this paper we drop existential quantifiers when they are clear from the context, and write the query as q ₁=R(x ₁),S(x ₁,y ₁)∨T(x ₂),S(x ₂,y ₂). The query can also be written as ∃x.((R(x)∨T(x))∧∃y.(S(x,y))): here each symbol R,S,T occurs only once and, since q ₁ is also inversion-free, it follows that it is in UCQ(RO). Note that our characterization of UCQ(RO) is unrelated to Gurvich’s characterization of read-once Boolean expressions [16, 19], or to algorithms for checking read-once-ness in [21, 23]: these results apply to the Boolean formula, while our results apply directly to the query.

For UCQ(FBDD) and UCQ(dDNNF), we only give sufficient conditions by making use of the CNF-lattice associated to a query (introduced in [11]), where each lattice element x is labeled by a subquery, denoted λ(x). A sufficient condition for a query to be in UCQ(FBDD) is for every lattice element to have a separator and to satisfy certain additional conditions (see Sect. 6). A sufficient condition for UCQ(dDNNF) is that every lattice element must have a separator, except those lattice elements that can be erased (a notion we define in Sect. 6). For comparison, the necessary and sufficient condition for UCQ(P) is that every lattice element must have a separator, except those lattice elements where the Mobius function is 0 (μ=0) [11]. If an element can be erased, then its Mobius function is 0, but the converse is not true, as illustrated by q ₉ in Table 1. We conjecture that q ₉ is not in UCQ(dDNNF).

The most difficult results in this paper are the separation results UCQ(OBDD)⊆̷UCQ(FBDD)⊆̷UCQ(dDNNF); they are separated by the queries q _V and q _W respectively in Table 1. In each case we prove that the query does not belong to the smaller class, but that it belongs to the larger class. The separation between UCQ(OBDD) and UCQ(FBDD) extends even to the non-uniform definition of the complexity class. More precisely, we show q _V∉UCQ(OBDD), even if one assumes the non-uniform definition of UCQ(OBDD), and that q _V∈UCQ(FBDD); similarly q _W∉UCQ(FBDD), even for a non-uniform definition of this class, and q _W∈UCQ(dDNNF). These results are significant for the following reason. All lineage expressions for queries in UCQ are very simple: they are monotone, and have a DNF expression of polynomial size. This also applies to the lineage expressions of q _V and q _W. Thus, our lower bounds make a contribution to the general separation problem of polynomial-size OBDD, FBDD, and d-DNNF. Early lower bounds for FBDD were for non-monotone formulas, with exponential size DNFs. The first “simple” Boolean formula shown to have exponential FBDD was given by Gál in [14], followed by a “very simple” formula given by Bollig and Wegener [1]. But it is not very surprising that the “very simple” has no polynomial size FBDDs, since computing the probability of that formula is #P-hard: the “very simple” formula is precisely the lineage of the non-hierarchical query R(x),S(x,y),T(y), for which computing the probability is #P-hard. In contrast, for both q _V and q _W one can compute the probability in polynomial time: hence, the fact that they do not admit polynomial size OBDDs or polynomial size FBDDs respectively is more surprising. On the other hand, we can use Bollig and Wegener’s result to prove that, for every non-hierarchical query, its lineage has no polynomial size FBDD.

The lineage of the query q _V, that we use for the first major separation between UCQ(OBDD) and UCQ(FBDD) is, to the best of our knowledge, the first “simple” Boolean formula separating polynomial-size OBDD from FBDD. Previous Boolean formulas separating the two classes are non-monotone, and do not have polynomial size DNFs. The classic example is the Weighted Bit Addressing problem (WBA), defined as $F(X_{1}, \ldots, X_{n}) =X_{\sum_{i=1,n} X_{i}}$ (where X ₀=0). Bryant [5] has shown that it has no polynomial size OBDD, while Gergov and Meinel [15] and independently Sieling and Wegener [24] have shown that WBA has a polynomial sized FBDD. More examples are given in [26]. Our characterization of UCQ(OBDD) and UCQ(FBDD) allows one to give a class of simple Boolean expressions that separate polynomial-size OBDD from FBDD.

The lineage of the query q _W that we use for our second major separation between UCQ(FBDD) and UCQ(dDNNF) is also, to the best of our knowledge, the first “simple” Boolean formula separating polynomial-size FBDD from d-DNNF. The previous separation relies on a result due to Bollig and Wegener [2]: they give an example of two Boolean formulas Φ ₁,Φ ₂ that have polynomial size OBDD, Φ ₁∧Φ ₂≡false, yet Φ ₁∨Φ ₂ cannot have polynomial size FBDD. Hence Φ ₁∨Φ ₂ separates d-DNNF from FBDD.

Finally, we note that no formula with exponential lower bound on d-DNNF size is presently known. In particular, we leave open the question whether UCQ(dDNNF)⊆̷UCQ(P). However, our algorithm in Sect. 6 suggests how d-DNNF may be constructed for general queries, which further suggests that this is not possible for q ₉. We conjecture that q ₉ is not in UCQ(dDNNF), and, hence, that its lineage has no polynomial size d-DNNF.

The paper is organized as follows. We give the basic definitions and review the relevant results in [11] in Sect. 2, then discuss read-once, OBDD, FBDD, and d-DNNF in Sect. 3, Sect. 4, Sect. 5, Sect. 6. We discuss results for non-uniform setting in Sect. 7 and conclude in Sect. 8.

2 Background and Definitions

In this paper we discuss unions of conjunctive queries (UCQ), which are expressions defined by the following grammar:

(1)

$R(\bar{x})$ is a relational atom with variables and/or constants, whose relation symbol R is from a fixed vocabulary. We replace ∧ with comma, and drop ∃, when no confusion arises. Comma or ∧ operator takes precedence over ∨. For example we write R(x),S(x,y)∨T(y) for ∃x.(R(x)∧∃y.S(x,y))∨∃y.T(y).

A query is an expression as defined by (1), up to logical equivalence. We consider only Boolean queries in this paper. A conjunctive query (CQ) is a query that can be written without ∨. A conjunctive query admits an alternative representation, as a set of atoms, $R_{1}(\bar{x}_{1}), \ldots, R_{m}(\bar{x}_{m})$. Given two conjunctive queries q,q′, the logical implication q⇒q′ holds iff there exists a homomorphism q′→q between their representations as sets of atoms [7].

Let D be a database instance. Denote X _t a distinct Boolean variable for each tuple t∈D. If t∉D, X _t≡false. Let Q be a UCQ. The lineage of Q on D is the Boolean expression $\varPhi ^{D}_{Q}$, or simply Φ _Q if D is understood from the context, defined inductively as follows, where ADom(D) denotes the active domain of the database instance:

(2)

(3)

Given a probability p(X _t)∈[0,1] for each Boolean variable X _t, we denote $P(\varPhi _{Q}^{D})$ the probability that the Boolean formula $\varPhi _{Q}^{D}$ is true, when each Boolean variable X _t is set to 1 independently, with probability p(X _t).

A probabilistic database is a pair (D,p) where D is a database and p(t)∈[0,1] assigns a probability to each tuple t∈D. Given a Boolean query Q and a probabilistic database (D,p), the query probability P(Q) is defined as $P(Q) =P(\varPhi _{Q}^{D})$, where in the latter expression each Boolean variable X _t has a probability equal to that of its corresponding tuple, p(X _t)=p(t).

The query evaluation problem on probabilistic databases is the following: given a query Q and a probabilistic database (D,t), compute P(Q). Usually we are interested in the data complexity of the query evaluation problem: for a fixed Q, determine the complexity of computing P(Q) as a function of the input database (D,p).

Definition 1

UCQ(P) is the class of UCQ queries Q s.t. for any probabilistic database (D,p), the probability P(Q) can be computed in PTIME in the size of D.

A complete characterization of the class UCQ(P) was given in [11]. We review it here, since we will reuse some of the same concepts that characterize the class UCQ(P) to characterize various compilation targets.

We start by discussing connected queries. Consider a conjunctive query q given by the set of its atoms $R_{1}(\bar{x}_{1}), \ldots,R_{m}(\bar{x}_{m})$, and assume this representation is minimal, i.e., removing any atom results in an inequivalent query; it is known that this minimal representation is unique up to isomorphism [7]. Define the following undirected graph G: there is one node for each atom, and there is an edge from atom i to atom j if $R_{i}(\bar{x}_{i})$ and $R_{j}(\bar{x}_{j})$ share a common variable. We say that the query q is connected if the graph G is connected.

Lemma 1

Suppose we restrict all conjunctive queries to be without constants. Let q be a conjunctive query. Then the following conditions are equivalent. (1) The query q is connected. (2) For every two conjunctive queries q ₁,q ₂, if q ₁∧q ₂⇒q then either q ₁⇒q or q ₂⇒q. (3) For every two conjunctive queries q ₁,q ₂, if q≡q ₁∧q ₂ then either q≡q ₁ or q≡q ₂.

Proof

(1) implies (2). Assuming q ₁∧q ₂⇒q we obtain a homomorphism q→q ₁∧q ₂. Since neither q ₁ nor q ₂ have constants, the homomorphism must map every variable in q to a variable in q ₁∧q ₂. Since q is connected, the image of this homomorphism must be a connected graph, and, therefore, it is included either in q ₁ or in q ₂; this means that the homomorphism is either q→q ₁ or q→q ₂, implying either q ₁⇒q or q ₂⇒q.

(2) implies (3). Assume q ₁∧q ₂≡q. In particular, q ₁∧q ₂⇒q, and by property (2) we have q ₁⇒q or q ₂⇒q. Assuming the former, we derive q ₁⇒q ₁∧q ₂, which further implies q ₁≡q ₁∧q ₂≡q. The latter case is symmetric.

(3) implies (1). Suppose that q is not connected. Suppose its minimal representation has m atoms. Let G be the graph corresponding to its minimal representation. We can partition its nodes into two sets, each with strictly less than m atoms, s.t. they do not share any variables. Thus, we have written q=q ₁∧q ₂ where q ₁,q ₂ share no common variables and each has strictly less than m atoms. By condition (3) it follows that either q ₁⇒q or q ₂⇒q. Assuming the former, we have q≡q ₁, contradicting the fact that the minimal representation of q has m atoms. The latter case follows similarly too. □

The restriction to conjunctive queries without constants is necessary for the lemma to hold. Otherwise, consider the connected query q=R(x,y),S(y,z), and q ₁=R(x,a), q ₂=S(a,z), where a is a constant and x,y,z are variables: we have q ₁∧q ₂⇒q but neither q ₁⇒q nor q ₂⇒q holds.

We now define the key notions needed to characterize UCQ(P), and which we need throughout this paper:

A component, c, is a conjunctive query that is connected.
Every conjunctive query can be written as a conjunction of components. That is, q=c ₁,c ₂,…,c _k, s.t. c _i and c _j do not share any common variables, for all i≠j. If q=c ₁,c ₂,… and $q' = c_{1}', c_{2}', \ldots$ are two conjunctive queries given as conjunction of components, and if they do not have any constants, then the logical implication q⇒q′ holds iff ∀j.∃i s.t. $c_{i}\Rightarrow c_{j}'$.
A disjunctive query is a disjunction of components, d=c ₁∨⋯∨c _k. Given two disjunctive queries d=c ₁∨c ₂∨⋯ and $d' = c_{1}' \vee c_{2}' \vee \cdots{}$, the logical implication d⇒d′ holds iff ∀i.∃j s.t. $c_{i} \Rightarrow c_{j}'$.
A UCQ in DNF is a disjunction of conjunctive queries, Q=q ₁∨⋯∨q _m. Given two queries in DNF, Q=q ₁∨q ₂∨⋯ and $Q' = q_{1}' \vee q_{2}' \vee \cdots{}$, the logical implication Q⇒Q′ holds iff ∀i.∃j s.t. $q_{i} \Rightarrow q_{j}'$.
A UCQ in CNF is a conjunction of disjunctive queries, Q=d ₁∧⋯∧d _m. Given two queries in CNF, Q=d ₁∧d ₂∧⋯ and $Q' = d_{1}' \wedge d_{2}' \wedge \cdots{}$, if they do not have any constants, then the implication Q⇒Q′ holds iff ∀j.∃i s.t. $d_{i}\Rightarrow d'_{j}$.

Obviously, any component is both a conjunctive query and a disjunctive query; also, every conjunctive query is a UCQ in DNF, and every disjunctive query is a UCQ in CNF.

The containment condition for DNF is due to Sagiv and Yannakakis [22]. The containment condition for CNF is from [11], and only holds if the queries have no constants. To see that this requirement is needed, consider the following three disjunctive queries: d ₁=R(x,a), d ₂=S(a,z), and d=R(x,y),S(y,z), where a is a constant and x,y,z are variables. Define the following two UCQ’s: Q=d ₁,d ₂ and Q′=d. Both are in CNF, and Q⇒Q′, yet neither d ₁⇒d nor d ₂⇒d holds.

Following [11] we first perform the following transformations on the query. They preserve the lineage of the query and hence membership in UCQ(P) and all the classes considered in this paper.

Remove constants :: Every query with constants is rewritten into an equivalent query without constants, over an extended vocabulary, by repeatedly substituting a relation R(A ₁,…,A _k) with 2 relations: $R_{1} = \sigma_{A_{i} \neq a}(R)$ and $R_{2} = \varPi _{A_{1}\ldots A_{i-1} A_{i+1} \ldots A_{k}}(\sigma_{A_{i}=a}(R))$, for every attribute position i and every constant a that occurs in the query. For example, R(x,a),S(x)∨R(x,y),T(x) is rewritten as R ₂(x),S(x)∨R ₂(x),T(x)∨R ₁(x,y),T(x), where R ₁(x,y)=σ _y≠a(R(x,y)), and R ₂(x)=π _x(σ _y=a(R(x,y))).
Ranking :: Assume an ordered domain. A query is ranked if it remains consistent after adding all predicates of the form x<y, for all pairs of variables x,y that co-occur in some atom, such that x occurs before y. For example, R(x,y),R(y,z),R(x,z) is ranked because x<y∧y<z∧x<z is consistent, while R(x,y),S(y,x) is not ranked (x<y∧y<x is inconsistent), and R(x,x,y) is not ranked (x<x∧x<y is inconsistent). Every query is rewritten into an equivalent, ranked query, over an extended vocabulary, by repeatedly substituting a relation R(A ₁,…,A _k) with three relations $R_{1} = \sigma_{A_{i} < A_{j}}(R)$, $R_{2} = \varPi _{A_{1} \ldots A_{j-1} A_{j+1} \ldots A_{k}}(\sigma_{A_{i} =A_{j}}(R))$, $R_{3} = \varPi _{A_{1} \ldots A_{j} \ldots A_{i} \ldots A_{k}}(\sigma_{A_{i} > A_{j}}(R))$, for every two attributes A _i,A _j s.t. i<j. We give here the main intuition by illustrating with q=R(x,y),R(y,x), and refer to [11] for further details. Denoting R ₁(x,y)=σ _x<y(R), R ₂(x)=π _x(σ _x=y(R(x,y))), R ₃(y,x)=π _yx(σ _x>y(R)), we rewrite the query as R ₂(x)∨R ₁(x,y),R ₃(x,y). The new query is ranked.

The reason for the first transformation is to ensure that the implication criteria for CNF expressions hold. As a consequence, every UCQ has a unique, minimal representation in DNF, and a unique, minimal representation in CNF. The reason for the second transformation will become clear below. We will assume throughout the paper that a CNF or DNF expression of a query is minimized.

The first step in characterizing UCQ(P) is to describe a class of disjunctive queries that are hard for #P, using the notion of a separator. Consider a query, and a subexpression of the form ∃w.Q (see grammar equation (1)): the scope of the variable w is the subexpression Q.

Definition 2

A variable w is called a root variable if it occurs in all atoms in its scope.

For a simple illustration, consider ∃x.∃y.R(x)∧S(x,y). Then x is a root variable, but y is not. However, we can write the query equivalently as ∃x.R(x)∧(∃y.S(x,y)): now both x and y are root variables.

Definition 3

A disjunctive query d has a separator if it can be written as d≡∃w.Q, such that w is a root variable, and for every two atoms g,g′ using the same relational symbol R, the variable w occurs in the same position in g and in g′. In this case the variable w is called a separator variable in the expression ∃w.Q.

The hardness part of characterizing UCQ(P) consists of showing that, if a disjunctive query has no separator then it is hard for #P: hence, it cannot be in UCQ(P) unless FP=#P. Recall that we assume all queries to be without constants, ranked, and minimized.

Theorem 1

([11])

Let d be a disjunctive query s.t. each component has at least one variable. If d has no separator, then d is hard for #P.

If d has any component without variables then it trivially has no separator. For example, consider d=R()∨S(x): the first component, R(), has no variables, and clearly d has no separator, e.g. if we write it as ∃x.(R()∨S(x)) then x is not a root variable. However, it is always easy to get rid of the components without variables, then apply Theorem 1. Indeed, write the disjunctive query as d=d ₀∨d′ where d ₀ contains all components without variables and d′ contains all components with variables. Thus, d ₀ is a disjunction R ₁()∨R ₂()∨⋯ of zero-ary relational symbols, and d′ is a disjunction of components c ₁∨c ₂∨⋯, each having at least one variable. None of the symbols R _i() occurs in any component c _j, otherwise c _j would not be connected. Thus, d ₀ and d′ are independent probabilistic events, and P(d)=1−(1−P(d ₀))(1−P(d′)), in other words computing P(d) reduces to computing P(d′), and this is the reason why the theorem focuses only on the latter. Note that the theorem holds only if the query is ranked: for a counter-example, R(x,y),R(y,x) has no separator, yet is in UCQ(P) (this follows from the ranking shown above, and from Theorem 2 below); this is the reason why we rank queries.

Conversely, if d has a separator, d=∃w.Q, then its probability can be computed as P(d)=1−∏_i(1−P(Q[a _i/w])), where a ₁,…,a _n is the active domain of the database, because no two queries among Q[a ₁/w],…,Q[a _n/w] have any tuple in common. Furthermore, this can be computed efficiently, provided that each query Q[a _i/w] is in UCQ(P). Although we disallowed constants in queries, the expression Q[a _i/w] is OK because all occurrences of a relational symbol have the constant a in the same position; we simply remove a from all atoms, renaming all relational symbols, and decreasing their arity by 1.

Example 1

Query q ₁ in Table 1 has a separator, because^{Footnote 2} q ₁≡∃w.(R(w),S(w,y ₁)∨T(w),S(w,y ₂)). We can compute its probability as P(q ₁)=1−∏_i(1−P(R(a _i),S(a _i,y ₁)∨T(a _i),S(a _i,y ₂))). Query h ₁, on the other hand, does not have a separator: if we write it as ∃w.(R(w),S(w,y ₁)∨S(w,y ₂),T(y ₂)) then w is not a root variable, and if we write it as ∃w.(R(w),S(w,y ₁)∨S(x ₂,w),T(w)) then w occurs in different positions in S(w,y ₁) and S(x ₂,w). Therefore, h ₁ is hard for #P.

Consider a UCQ in CNF: Q=d ₁∧⋯∧d _k. For each subset s⊆[k] denote d _s=⋁_i∈s d _i. The inclusion/exclusion formula gives us P(Q)=−∑_s≠∅(−1)^|s| P(d _s) and, therefore, if all d _s are in UCQ(P) (in particular, they have separators), then so is Q. The formula is exponential in the size of the query, but this does not affect data complexity. However, the condition d _s∈UCQ(P) is not necessary for all s: some terms in the inclusion/exclusion formula may cancel out, and Q may be in UCQ(P) even if some disjunctive queries d _s are hard.

To characterize precisely when Q is in UCQ(P), [11] defines the CNF lattice (L,≤) for Q. Each element x∈L corresponds to a distinct disjunctive query, denoted λ(x)=d _s, for some s⊆[k], up to logical equivalence; that is, if $d_{s_{1}} \equiv d_{s_{2}}$ then they correspond to the same element in x∈L. The order relation ≤ is reversed logical implication: x≤y iff λ(y)⇒λ(x).

The maximal element in the lattice is denoted $\hat{1}$, and corresponds to d _∅≡false: all other elements correspond to non-trivial disjunctive queries d _s. The minimal element of the lattice is denoted $\hat{0}$, and corresponds to $\lambda(\hat{0}) = d_{1} \vee \cdots \vee d_{k}$. Three examples are shown in Fig. 1.

We say x covers y if y≤x and there is no z∈L s.t. y<z<x. We call x an atom if it covers $\hat{0}$; it is a co-atom if it is covered by $\hat{1}$. Denote by L ^∗ the set of co-atoms. The meet closure of S⊆L is the lattice: $\overline{S} = \{ \bigwedge T \mid T \subseteq S \}$. Note that $\bigwedge \emptyset = \hat{1}$, the meet closure of any set S contains the maximal element $\hat{1}$.

The Mobius function of a lattice (L,≤) is the function μ _L:L×L→Z defined by μ _L(x,x)=1, μ _L(x,y)=−∑_x<z≤y μ _L(z,y). Note that μ _L(x,y)=0 whenever $x \not \leq y$. We will drop the L, i.e. denote μ _L by simply μ, henceforth when it is clear from the context. Mobius’ inversion formula applied to P(Q) is: $P(Q) = - \sum_{x < \hat{1}} \mu(x,\hat{1}) P(\lambda(x))$. Now it becomes obvious that we only need to compute P(d _s) for those queries for which $\mu(x, \hat{1}) \neq 0$. This justifies:

Definition 4

(Safe queries) ([11])

(1) Let Q=d ₁∧⋯∧d _k, and k≥2. Then Q is safe if for every element x in its CNF lattice, if $\mu(x,\hat{1})\neq 0$, then the disjunctive query λ(x) is safe (recursively). (2) Let d=d ₀∨d ₁, be a disjunctive query where d ₀ contains all components without variables, and d ₁ contains all components with at least one variable. Then d is safe if d ₁ has a separator w and d ₁[a/w] is safe (recursively), for a constant a.

The characterization of UCQ(P) is:

Theorem 2

([11])

Any safe query is in UCQ(P). Any unsafe query is hard for #P.

The first part of the theorem follows from our discussion so far. The second part is proven in [11] by using Theorem 1.

This completes the characterization of UCQ(P) from [11]. We still need to introduce two more notions that we use in the rest of the paper: hierarchical queries and inversion-free queries.

Hierarchical Queries

Let q be a conjunctive query, and denote Vars(q) the set of variables used in the query and at(x) the set of atoms containing a variable x∈Vars(q). We say that q is hierarchical if for any two variables x,y, we have at(x)⊆at(y) or at(x)⊇at(y), or at(x)∩at(y)=∅. A UCQ query Q is hierarchical if it is the union of hierarchical conjunctive queries. We give an alternative definition next:

Definition 5

Let Q be a query expression given by the grammar equation (1). We say that it is a hierarchical expression if every variable is a root variable.

It is easy to check that a query is hierarchical iff it can be written as a hierarchical expression. For example, the query R(x,y),S(x,z) is hierarchical, because it can be written as ∃x.(∃y.R(x,y)∧∃z.S(x,z)). Examples of non-hierarchical queries are R(x),S(x,y),T(y) and R(x,y),R(y,z),R(x,z). The following is easy to see:

Proposition 1

If Q is safe, then it is hierarchical.

Proof

By induction on the structure of Q. If Q=d ₁∧⋯∧d _k, then each d _i corresponds to a co-atom x in the CNF lattice, hence $\mu(x,\hat{1}) = -1 \neq 0$, and therefore d _i must be safe, hence it is hierarchical by induction, hence Q is hierarchical. If Q=d ₀∨d ₁ and d ₁ has a separator, d ₁=∃w.Q ₁, then Q ₁[a/w] is safe, hence it is hierarchical by induction, hence ∃w.Q ₁ is hierarchical because w occurs in all atoms of Q ₁. □

The converse is not true: for example h ₁ in Table 1 is hierarchical, but unsafe. Thus, all non-hierarchical queries are #P-hard, but the converse fails in general.

Inversions

Inversions were first defined in [8]. In this paper we show how to use inversions to characterize UCQ(RO) and UCQ(OBDD). Let Q=q ₁∨⋯∨q _k be a query in DNF. The unification graph G has as nodes all pairs of variables (x,y) that co-occur in some atom, and has an edge between (x,y) and (x′,y′) if the following holds: x,y co-occur in some atom g , x′,y′ co-occur in some atom g′, the atoms g and g′ are over the same relation symbol and x,y appear at the same positions in g as x′,y′ in g′. In other words, g and g′ are unifiable, and the unification equates x=x′ and y=y′. Given x,y∈Vars(q _i), denote x≻y if $at(x)\not\subseteq at(y)$.

Definition 6

(Inversion) ([8])

An inversion in Q is a path of length ≥0 in G from a node (x,y) to a node (x′,y′) s.t. x≻y and x′≺y′. If no such path exists, we say Q is inversion-free.

If a query is non-hierarchical then it has an inversion. Indeed, let x,y be two variables occurring in the same non-hierarchical conjunctive query, such that at(x)∩at(y)≠∅ and neither of the two sets at(x), at(y) contains the other. Consider the node (x,y) in the unification graph (such a node exists because at(x)∩at(y)≠∅). Since we have both x≻y and x≺y, the empty path starting and ending at (x,y) is an inversion. The converse fails: h ₁ in Table 1 is hierarchical, yet has an inversion, from (x ₁,y ₁) to (x ₂,y ₂).

We give now an alternative, syntactic characterization of an inversion-free query, which we need later. Consider a query expression Q given by the grammar equation (1). Let g be an atom in Q, over the relation symbol R of arity k; thus g contains k distinct variables. Assume the existential quantifiers of these k variables are in the following order: ∃x ₁,∃x ₂,…,∃x _k. In other words, each variable x _i+1 is within the scope of x _i. Define π _g to be the permutation for which $g = R(x_{\pi_{g}(1)}, \ldots, x_{\pi_{g}(k)})$.

Definition 7

A query expression Q given by the grammar equation (1) is an inversion-free expression if it is a hierarchical expression, and for any two atoms g ₁, g ₂ with the same relational symbol, $\pi_{g_{1}} =\pi_{g_{2}}$.

If Q is a hierarchical expression and R a relational symbol, then we write π _R for the common permutation π _g of all atoms g with symbol R. We have the following equivalence:

Proposition 2

Q is inversion free iff it can be written as an inversion-free expression.

Proof

We first show how to write an inversion-free query as an inversion-free expression. Let Q be inversion free. We will define an order relation x⋙y on Q’s variables s.t. (a) for every atom g, ⋙ is total over the set of variables occurring in g, and (b) if g, g′ are two atoms with the same relation symbol R then the order imposed by ⋙ on the attributes of R is the same in g and g′. The order ⋙ gives us immediately the permutation π _g for every atom g; since Q can be written as a union of conjunctive queries, each of which is hierarchical, it follows that Q can be written as an inversion-free expression. To define ⋙, we first define a weaker relation ≫: x≫y if there exists a path in the unification graph from a node (x,y) to a node (x′,y′) s.t. x′≻y′. Clearly ≫ is antisymmetric, because if we have both x≫y and x≪y then the graph has an inversion. We prove that ≫ is transitive. Indeed, suppose x≫y and y≫z. By definition there exists a unification path (y,z),(y ₁,z ₁),(y ₂,z ₂),…,(y _k,z _k) s.t. y _k≻z _k. Consider the first edge of this path: there exists two atoms g,g ₁ with the same relation name, g contains y,z, and g ₁ contains y ₁,z ₁ on the same position. Then g must contain x as well (otherwise x≺y contradicting x≫y). Denote x ₁ the variable on the same position in g ₁: since x≫y and y≫z we have x ₁≫y ₁ and y ₁≫z ₁. Repeating the same argument we find variables x _i s.t. x _i≫y _i and y _i≫z _i, for i=1,k. Since y _k≻z _k, it also follows that x _k≻z _k (because x _k≫y _k implies at(x _k)⊇at(y _k) hence $\mathit{at}(x_{k}) \not\subseteq \mathit{at}(z_{k})$), proving that x≫z. Therefore, ≫ defines a partial order on the set of variables. It is not a total order yet, because it may leave pairs of variables unordered. To make it a total order, we use the fact that the query Q is ranked, and define x⋙y to be: x≫y or ($x \not\ll y$ and there exists an atom g containing both x and y s.t. x occurs before y). It is easy to check that ⋙ is a partial order, and for any atom g it is total over its set of variables.

Now, suppose Q can be written as an inversion-free expression and still has an inversion from a node (x,y) to node (x′,y′) s.t. x≻y and x′≺y′. Let π be the order in which variables are introduced in the inversion-free expression. Then x,y appear in the same order in π as x′,y′. W.l.o.g., lets assume x occurs before y. Then x′ occurs before y′. But we have y′≻x′, hence x′ couldn’t have been a root variable when it was introduced which violates the fact that every inversion-free expression is also a hierarchical expression, a contradiction. This completes the proof. □

For example, consider q ₁ in Table 1. On one hand we can write it as a union of conjunctive queries, q ₁=R(x ₁),S(x ₁,y ₁)∨T(x ₂),S(x ₂,y ₂). The unification graph has four nodes, (x ₁,y ₁),(y ₁,x ₁),(x ₂,y ₂),(y ₂,x ₂), and two edges ((x ₁,y ₁),(x ₂,y ₂)) and ((y ₁,x ₁),(y ₂,x ₂)). We have both x ₁≻y ₁ (because $\mathit{at}(x_{1}) = \{R(x_{1}),S(x_{1},y_{1})\}\not\subseteq \mathit{at}(y_{1}) = \{S(x_{1},y_{1})\}$), and similarly x ₂≻y ₂. Hence, there is no inversion in the graph, and the query is inversion free. The proposition gives us an alternative way to see that, by writing the query as q ₁=∃x ₁.R(x ₁),∃y ₁.S(x ₁,y ₁)∨∃x ₂.T(x ₂),∃y ₂.S(x ₂,y ₂): in both S-atoms the existential variables x _i,y _i are introduced in the same order, for i=1,2.

On the other hand, consider the query h ₁=R(x ₁),S(x ₁,y ₁)∨S(x ₂,y ₂),T(y ₂). Here x ₁≻y ₁ and x ₂≺y ₂, hence the edge ((x ₁,y ₁),(x ₂,y ₂)) forms an inversion in the unification graph. One can see that we cannot write h ₁ in a way that satisfies Definition 7: if we write it hierarchically as ∃x ₁.R(x ₁),∃y ₁.S(x ₁,y ₁)∨∃y ₂.T(y ₂).∃x ₂.S(x ₂,y ₂), then the variables in S(x ₂,y ₂) are introduced in a different order from those of S(x ₁,y ₁).

We end with a simple remark. If d is a disjunctive query that is inversion free, then it has a separator. Indeed, write d=⋁_i c _i, and write each component as a hierarchical expression, c _i=∃x _i.Q _i. Re-write d as ∃w.(⋁_i Q _i[w/x _i]). Then w is a separator variable: it obviously occurs in all atoms, and in every atom with relation symbol R, it must occur in position π _R(1).

3 Queries with Read-Once Lineage

A Boolean expression Φ is read once (RO) if it can be written using the connectors ∨,∧,¬ such that every Boolean variable occurs at most once. We consider only positive Boolean expressions in this paper, and therefore will use only ∨ and ∧. The probability of a read-once Boolean expression can be computed in linear time, because of independence: P(Φ ₁∧Φ ₂)=P(Φ ₁)⋅P(Φ ₂) and P(Φ ₁∨Φ ₂)=1−(1−P(Φ ₁))(1−P(Φ ₂)); this justifies our interest in this class of expressions. In this section we characterize the queries that have read-once lineages. An elegant characterization of read-once Boolean expressions was given by Gurvich [19] (see [16]), but we will not use that characterization. Note that our characterization is of queries, while Gurvich’s characterization is of Boolean expressions.

Definition 8

UCQ(RO) is the class of queries Q s.t. for every database instance D, the lineage of Q on D is a read once Boolean expression.

Recall that CQ ⁻ denotes the set of conjunctive queries without self-joins. Dalvi and Suciu [9, 10] showed that CQ ⁻(P) is precisely the class of hierarchical queries. Olteanu and Huang [20] showed that all hierarchical queries in CQ ⁻ have read-once lineages, implying CQ ⁻(RO)=CQ ⁻(P)= “hierarchical queries”. In this section we characterize the class UCQ(RO).

Definition 9

Let Q be a query expression given by the grammar equation (1). We say that Q is hierarchical-read-once if it is hierarchical (see Definition 5), and every relational symbol occurs at most once. A query is hierarchical-read-once if it is equivalent to a hierarchical-read-once expression.

Obviously, every hierarchical CQ ⁻ query is also hierarchical-read-once; our definition is more interesting when applied to UCQ. The following is a necessary condition for hierarchical-read-once-ness:

Proposition 3

If Q is a hierarchical read-once expression then it is also an inversion-free expression.

The proof is immediate, since no two distinct atoms in Q may refer to the same relational symbol, hence the condition $\pi_{g_{1}} =\pi_{g_{2}}$ is satisfied vacuously.

For a simple example, consider query q ₁ in Table 1. It is equivalent to the expression ∃x.(R(x)∨T(x))∧∃y.S(x,y), which is both hierarchical and read-once. Notice that in the definition we require Q to be at the same time hierarchical and read-once. Sometimes we can achieve these two goals separately, but not simultaneously: for example h ₁=R(x ₁),S(x ₁,y ₁)∨S(x ₂,y ₂),T(y ₂) is hierarchical, and can also be written as ∃x.∃y.(R(x)∨T(y))∧S(x,y), which is read-once. Since h ₁ has an inversion, by Proposition 3 it cannot be written simultaneously as a hierarchical and read-once expression.

Theorem 3

Q∈UCQ(RO) iff it is hierarchical-read-once.

The “if” direction is a straightforward extension of the technique used in [20] to prove that hierarchical queries in CQ ⁻ are read-once. For the “only-if”, we construct one database instance D that is “large enough” (depending only on the query), and prove the following: if Q’s lineage on D is read-once, then Q is hierarchical-read-once.

Proof

If: We prove by induction on the structure of a hierarchical-read-once expression Q that its lineage is read-once; this proof extends that of [20]. If Q=Q ₁∨/∧Q ₂, then Q ₁,Q ₂ have no relation symbols in common, and their lineage given by (3) is also read-once. If Q=∃x.Q ₁ then x must be a root variable, which implies that the lineages of Q ₁[a ₁/x], …, Q ₁[a _n/x] do not share any Boolean variables; hence, the lineage given by (2) is also read-once.

Only if: Assume Q∈UCQ(RO); thus $\varPhi _{Q}^{D}$ is read-once, for every database instance D; we show that Q must be equivalent to a hierarchical-read-once expression. First, we use Theorem 4 to argue that, if Q∈UCQ(RO) then Q∈UCQ(OBDD), hence Q is inversion-free. Referring to Definition 7, for every relation symbol R we denote π _R the permutation mapping the variable nesting order to the order in which they occur in an atom with relation symbol R.

We will construct a special database instance D: from the read-once-ness of $\varPhi ^{D}_{Q}$, we will extract a hierarchical-read-once expression for Q. Let k be the total number of variables plus the total number of atoms in Q. We first construct the active domain for D. Start by choosing k constants a ₁,a ₂,…,a _k called “root constants”: these will be used to populate the attribute π _R(1) of the relation R, for each relation symbol R. Next, choose k ² constants a _ij,1≤i,j≤k: these will populate the attribute π _R(2) of each relation R, such that a _ij occurs only in those tuples that also contain a _i. Next, choose k ³ constants for the next level of the hierarchy, etc. This way we construct k ^a tuples for a relation of arity a: note that the functional dependencies π _R(i+1)→π _R(i) hold for every relation R and every i=1,…,arity(R)−1.

Thus, we have fixed the database D. Next, we prove the following statement by induction: Let Q be any inversion-free query where the total number of variables plus atoms is at most k, and consider its lineage over our fixed database D, $\varPhi ^{D}_{Q}$: if the lineage is read-once, then Q can be written as a hierarchical-read-once expression. Our induction proceeds on the structure of the read-once expression $\varPhi ^{D}_{Q}$, abbreviated Φ _Q.

Case 1: Suppose Φ _Q=ϕ ₁∧ϕ ₂. Then ϕ ₁,ϕ ₂ have no common Boolean variables. We prove something stronger: that the variables come from disjoint sets of relations. Assume contrary, that both have a Boolean variable over the relational symbol R. To simplify the discussion we will assume R is unary; our argument extends in general too. Note that if m ₁ is a minterm of ϕ ₁ and m ₂ is a minterm of ϕ ₂, then m ₁ m ₂ must be a minterm of Φ _Q. This is because ϕ ₁,ϕ ₂ have no tuples in common, hence if another minterm $m'_{1}m'_{2} \Rightarrow m_{1}m_{2}$, then $m'_{1} \Rightarrow m_{1}$ and hence m ₁ couldn’t have been a minterm of ϕ ₁. Now, suppose $X_{R(a_{1})}$ occurs in ϕ ₁ and not in ϕ ₂, and $X_{R(a_{2})}$ occurs in ϕ ₂ and not in ϕ ₁. Then $X_{R(a_{1})}X_{R(a_{2})}$ occurs in some minterm in Φ _Q. Since the lineage is invariant under permutations of the active domain, for all 1≤i<j≤k the term $X_{R(a_{i})}X_{R(a_{j})}$ occurs in some minterm of Φ _Q. Consider a third tuple of R, say R(a ₃). It must occur in either ϕ ₁ or ϕ ₂: assume w.l.o.g. it occurs in ϕ ₂, and since Φ _Q has a minterm that contains $X_{R(a_{2})}X_{R(a_{3})}$, ϕ ₂ must have a minterm that contains it. Hence, after conjoining with ϕ ₁, we obtain a minterm in Φ _Q that contains $X_{R(a_{1})}X_{R(a_{2})}X_{R(a_{3})}$, and, therefore, for every 1≤i<j<l≤k there exists a minterm in Φ _Q containing $X_{R(a_{i})}X_{R(a_{j})}X_{R(a_{l})}$. Repeating this argument leads us to conclude that Φ _Q has a minterm containing $X_{R(a_{1})}\ldots X_{R(a_{k})}$: this is a contradiction because the minterms of Φ _Q cannot have more variables than the number of atoms in Q.

Thus, ϕ ₁ contains only tuples over the relations R ₁,R ₂,… and ϕ ₂ contains only tuples over the relations S ₁,S ₂,… Denote Q ₁=Q[S ₁=S ₂=⋯=true] the query obtained from Q by replacing all atoms referring to S _i with true; similarly denote Q ₂=Q[R ₁=R ₂=⋯=true]. The lineage of Q ₁ on D is ϕ ₁: hence, by induction hypothesis, Q ₁ is equivalent to a hierarchical-read-once expression. Similarly Q ₂. We prove now that Q≡Q ₁∧Q ₂. Since every atom logically implies true, we obtain immediately Q⇒Q ₁ and Q⇒Q ₂, hence Q⇒Q ₁∧Q ₂. For the converse, write Q ₁∧Q ₂ as a union of conjunctive queries ⋁_i q _i, and let $D_{q_{i}}$ be the canonical database for q _i: it suffices to prove that Q is true on $D_{q_{i}}$ for every q _i. Both Q ₁ and Q ₂ are inversion-free, hence q _i is inversion free and the order π _R must be same in q _i and Q. Therefore the canonical database $D_{q_{i}}$ satisfies all functional dependencies that hold in D and we can find an isomorphic copy of $D_{q_{i}}$ in D, since we have chosen D “large enough”. Set all Boolean variables corresponding to this copy to true and all others to false: we have ϕ ₁∧ϕ ₂=true (because Q ₁∧Q ₂ is true on $D_{q_{i}}$), which implies Φ _Q=true, implying that Q is true on $D_{q_{i}}$.

Case 2: Φ _Q=ϕ ₁∨ϕ ₂. We distinguish two cases:

Case 2.1: Every minterm in Φ _Q consists of tuples that have the same root constant. Thus, it may contain variables like $X_{R(a_{1},a_{13})}X_{R(a_{1},a_{15})}X_{S(a_{1})}$ (same root constant a ₁) but not $X_{R(a_{1},a_{13})}X_{S(a_{2})}$ (distinct root constants a ₁,a ₂). Then we claim Q must be a disjunctive sentence. Indeed, consider the DNF expression for Q=q ₁∨q ₂∨⋯ and assume w.l.o.g. that q ₁ is not connected, hence q ₁=c∧c′ where c, c′ are two components. Then the lineage of q ₁ includes minterms with mixed root constants, contradiction. Hence, Q is a disjunctive sentence. Now we use the fact that Q is inversion-free; in particular it has a separator, Q=∃x.Q ₁, and its lineage is $\varPhi _{Q} =\bigvee_{i=1,k} \varPhi _{Q_{1}[a_{i}/x]}$. Since Φ _Q is read-once, so is each $\varPhi _{Q_{1}[a_{i}/x]}$ (since the latter is obtained from Φ _Q by setting to false all tuples with root constant a _j, for j≠i). Hence, we apply induction hypothesis to Q ₁[a _i/x] and obtain a hierarchical-read-once expression: this proves that ∃x.Q ₁ is hierarchical-read-once.

Case 2.2: Φ _Q=ϕ ₁∨ϕ ₂ and there is at least one mixed minterm, containing tuples with two distinct root constants, say $X_{R_{1}(a_{1},\bar{b})}X_{R_{2}(a_{2},\bar{c})}$, and assume w.l.o.g. that this minterm appears in ϕ ₁. We will show that all R ₂-tuples occur in ϕ ₁, i.e. ϕ ₂ does not have R ₂ tuples. We consider the case when R ₁ and R ₂ are distinct relational symbols: the case when they are the same symbol is similar. We use again the fact that Q is invariant under permutations of D to argue that Φ _Q must contain minterms that contain the tuples $X_{R_{1}(a_{1},\bar{b})}X_{R_{2}(a_{j},\bar{c}')}$, for any j≤k. All these minterms must be in ϕ ₁, since ϕ ₂ may not contain $X_{R_{1}(a_{1},\bar{b})}$. Thus, ϕ ₁ must contain all tuples over R ₂. With a similar argument, it must also contain all tuples over R ₁.

Denote R ₁,R ₂,… the relation symbols that occur only in ϕ ₁ and S ₁,S ₂,… the other symbols. By our assumption, at least one symbol is in the first list (because there is a mixed minterm in ϕ ₁) and at least one symbol is in the second list (because ϕ ₂ is not empty). We prove that no minterm contains tuples with relation symbols from both lists. Indeed, if the minterm is mixed, then we have seen that its symbols appear either only in ϕ ₁ (hence they are all R _i symbols), or only in ϕ ₂ (hence they are all S _j symbols). Suppose the minterm contains a unique root constant, say a ₁, i.e. the minterm contains $X_{R_{i}(a_{1}, \ldots)}X_{S_{j}(a_{1},\ldots)}$. Since the minterms are closed under isomorphisms of the domain, there are minterms containing $X_{R_{i}(a_{2}, \ldots)}X_{S_{j}(a_{2},\ldots)}$, $X_{R_{i}(a_{3},\ldots)}X_{S_{j}(a_{3},\ldots)}$, etc. All these must belong to ϕ ₁ (because they contain R _i); hence ϕ ₁ contains all tuples over S _j, and therefore S _j must also be in the list R ₁,R ₂,…

Define Q ₁=Q[R ₁=R ₂=⋯=false] i.e. formula obtained from Q by setting all relation symbols R _i to false. Similarly, define Q ₂=Q[S ₁=S ₂=⋯=false]. We show that Q=Q ₁∨Q ₂, and since by induction hypothesis Q ₁,Q ₂ have a hierarchical-read-once expression, so does Q. First note that Q _i⇒Q, i=1,2, since false implies anything, hence Q ₁∨Q ₂⇒Q. We will now show Q⇒Q ₁∨Q ₂, and here we use an argument similar to the above. Let Q=⋁q _i and let $D_{q_{i}}$ be a canonical database for q _i. Since D was chosen large enough, there exists an isomorphic copy of $D_{q_{i}}$ in D, and, consequently, a minterm in Φ _Q consisting of the conjunction of its tuples. This minterm either consists of R _i tuples, hence $D_{q_{i}} \models Q_{2}$, or of S _j tuples, hence $D_{q_{i}} \models Q_{1}$. □

It is decidable if a given query Q is hierarchical-read-once, because for a fixed vocabulary there are only finitely many hierarchical-read-once expressions: simply iterate over all of them and check equivalence to Q. This implies that it is decidable whether Q∈UCQ(RO). For example, one can check that q ₂ in Table 1 is not in UCQ(RO), by enumerating all hierarchical-read-once expressions over the vocabulary R, S, T; we will return to q ₂ in the next section.

4 Queries and OBDD

OBDD were introduced by Bryant [3] and studied extensively in the context of model checking and knowledge representation. A good survey can be found in [27]; we give here a quick overview. A BDD, is a rooted DAG with two kinds of nodes. A sink node or output node is a node without any outgoing edges, which is labeled either 0 or 1. An inner node, decision node, or branching node is labeled with a Boolean variable X and has two outgoing edges, labeled 0 and 1 respectively. Every node u uniquely defines a Boolean expression Φ _u as follows: Φ _u=false and Φ _u=true for a sink node labeled 0 or 1 respectively, and $\varPhi _{u} = \neg X \wedge \varPhi _{u_{0}} \vee X \wedge \varPhi _{u_{1}}$ for an inner node labeled with X and with successors u ₀,u ₁ respectively. The BDD represents a Boolean expression Φ: Φ≡Φ _u where u is the root of the BDD. A Free BDD, or FBDD is one in which every path from the root to a sink node contains any variable X at most once. Given an FBDD that represents Φ, one can compute the probability P(Φ) in time linear in the size of the FBDD: this justifies our interest in FBDD.

While it is trivial to construct a large FBDD for Φ (e.g. as a tree of size 2ⁿ that checks exhaustively all n variables X ₁,…,X _n), it is not trivial at all to construct a compact FBDD. To simplify the construction problem, Bryant [4] introduced the notion of Ordered BDD, OBDD, which is an FBDD such that there exists a total order Π on the set of variables s.t. on each path from the root to a sink, the variables X ₁,…,X _n are tested in the order Π (variables may be skipped). One also writes Π-OBDD, to emphasize that the OBDD has order Π. Therefore, the OBDD construction problem has been reduced to the problem of finding a variable order Π.

One can construct an OBDD for any read-once formula Φ in time linear in Φ, by an inductive argument: if Φ=Φ ₁∧Φ ₂ first construct OBDDs for Φ ₁ and Φ ₂, and replace every sink-node labeled 1 in Φ ₁ with (an edge to) the root of Φ ₂; for Φ ₁∨Φ ₂, replace every sink-node labeled 0 in Φ ₁ with the root of Φ ₂.

Definition 10

UCQ(OBDD) is the class of queries Q s.t. for every database D, one can construct an OBDD for $\varPhi _{Q}^{D}$ in time polynomial in |D|.

We show an example in Fig. 2. In this section we prove the following:

Theorem 4

Q∈UCQ(OBDD) iff it is inversion-free.

We have seen that q ₂ from Table 1 is not read-once. However, q ₂∈UCQ(OBDD), because it is inversion-free, therefore we obtain the following separation:

Proposition 4

q ₂∈UCQ(OBDD)−UCQ(RO).

The significance of this result is the following. Olteanu and Huang [20] showed that for any hierarchical query Q, one can construct an OBDD for $\varPhi _{Q}^{D}$ in time O(|D|), proving that CQ ⁻(RO)=CQ ⁻(OBDD). Our proposition shows that these classes no longer collapse over UCQ.

We also note that all inversion-free queries are hierarchical (Sect. 2), therefore any non-hierarchical query is not in UCQ(OBDD).

In the remainder of the section we prove Theorem 4, in two stages: first showing that one can construct in PTIME an OBDD for inversion-free formulae, and every query with inversion has exponential size OBDD over some database.

4.1 Tractable Queries

Given an OBDD of Φ over variables $\bar{x}=\{ x_{1},x_{2}, \ldots, x_{n} \}$ with variable order Π, the width at level k, k≤n is the number of distinct subformulae that result after checking first k variables in the order Π, i.e. $|\{{\varPhi _{x_{\pi(1)}\ldots x_{\pi(k)} = \bar{b}}}\mid {\bar{b} \in \{ 0,1\}^{k}}\}|$. The width of an OBDD is the maximum width at any level. If the width is w, then a trivial upper bound on the size of the OBDD is nw. In what follows, we give a variable ordering for inversion-free queries under which the width is always constant (exponential in query size) and hence the size of the OBDD is linear. Note that if the size of OBDD is polynomial, then the construction of the OBDD can be done in PTIME in our setting, since the lineage of a UCQ is always a monotone formula and checking the equivalence of monotone formulas is in PTIME.

We first need to define the notion of a shared BDD. A shared BDD for a set of formulas Φ ₁,Φ ₂,…,Φ _m is a BDD where the sink nodes are labeled with {0,1}^m i.e. they give the valuation for each of the Φ _i, 1≤i≤m. This means a node reached by following the assignments $\bar{x}$ from the root can be thought of as representing a set of subformulae $\varPhi _{1\bar{x}},\varPhi _{2\bar{x}},\ldots,\varPhi _{k\bar{x}}$. Shared BDD evaluate a set of formulae simultaneously: this enables us to compute any combination function of the formulae. So, for instance, one can derive the OBDD of Φ ₁⊗Φ ₂ for any Boolean operation ⊗ from the shared OBDD for Φ ₁,Φ ₂.

The following is a well-known lemma for OBDD synthesis.

Lemma 2

(cf. [27])

Let Φ ₁,Φ ₂ be two Boolean functions and consider a fixed variable order Π. If there exist Π-OBDD of width w ₁, w ₂ for Φ ₁, Φ ₂ respectively, then there exists a shared Π-OBDD of width w ₁ w ₂ for Φ ₁,Φ ₂.

Proposition 5

If Q is inversion-free, then for every database D its lineage has an OBDD with width w=2^g, where g is the number of atoms in the query. Therefore, the size of the OBDD is linear in the size of the database.

We give a simple proof, using Lemma 2, that constructs the OBDD inductively on the hierarchical expression for Q: the resulting OBDD has size O(|D|).

Proof

Consider a hierarchical expression for Q, and let π _R be the permutation associated to the symbol R (Definition 7). Let D be a database, and assume that its active domain ADom(D) is an ordered domain. We start by defining a linear order Π on all tuples in D. Fix any linear order on the relational symbols, R ₁<R ₂<⋯. We add all relation symbols to ADom(D), placing them at the beginning of the order. We associate to each tuple in D a string in (ADom(D))^∗, as follows: tuple $R(a_{\pi_{R}(1)}, a_{\pi_{R}(2)}, \ldots,a_{\pi_{R}(k)})$ is associated to the string a ₁ a ₂…a _k R. That is, the first element is the constant on the root attribute position; the second element is the constant on the attribute position corresponding to a quantifier depth 2, etc. We add the relation name at the end. Next, we order the Boolean variables in the lineage expression $\varPhi _{Q}^{D}$ lexicographically by their string, and denote Π the resulting order. We prove that Π-OBDD has width w=2^g, inductively on the structure of the inversion-free expression Q. If Q=Q ₁∨/∧Q ₂ then we use Lemma 2. If Q=∃x.Q ₁, then Φ _Q=⋁_a∈ADom(D) Φ _Q[a/x]. Let the active domain consist of a ₁<a ₂<⋯<a _n, in this order. The OBDDs for $\varPhi _{Q[a_{1}/x]},\ldots, \varPhi _{Q[a_{n}/x]}$ are over disjoint sets of Boolean variables (because x is a root variable); assume that their width is w. The OBDD for Φ _Q consists of their union, where we redirect the 0 sink nodes of $\varPhi _{Q[a_{i}/x]}$ to the root node of $\varPhi _{Q[a_{i+1}/x]}$: the width is still w. The OBDD of a single ground atom, say $R(\bar{a})$, has width only 2. This completes the proof. □

Corollary 1

If a set of components c ₁,c ₂,…,c _m is inversion-free, then for every database D, they have a shared-OBDD with size linear in the size of the database.

4.2 Hard Queries

For k≥1, define the following queries (see also Fig. 1):

Denote h _k=⋁_i=0,k h _ki. The queries h _k were shown in [8, 11] to be hard for #P and are used to prove the hardness of a much larger class of unsafe queries. We show here that they have a remarkable property w.r.t. OBDD: if the same variable order Π is used to compute all queries h _k0, h _k1, …, h _kk, then at least one of these k+1 OBDDs has exponential size. Note that each query is inversion-free, hence it admits an efficient OBDD, e.g. Fig. 2 illustrates h _k0: what we prove is that there is no common order under which all have an efficient OBDD. This tool is quite powerful, allowing us to give a rather simple proof that queries with inversion have exponential size OBDD (Proposition 7). There is no analogous tool for proving #P-hardness: all queries h _ki are in PTIME, for i=0,k, and this tells us nothing about the larger query where they occur.

The complete bipartite graph of size n is the following database D over the vocabulary of h _k: relation R has n tuples R(a ₁),…,R(a _n), relation T has n tuples T(b ₁),…,T(b _n), and each relation S _i has n ² tuples S _i(a _j,b _l), for i=1,k, and j,l=1,n.

Proposition 6

Let D be the complete bipartite graph of size n, and fix any ordering Π on the corresponding Boolean variables. For any i=0,k, let n _i be the size of some Π-OBDD for the lineage of h _ki on D. Then $\sum_{i=0}^{k} n_{i} > k \cdot 2^{\frac{n}{2k}}$.

Proof

Denote the Boolean variables associated to the tuples R(a _i), i=1,n with X ₁,X ₂,…; those associated to the tuples S _p(a _i,b _j) with $Z^{p}_{ij}$; and those associated to the tuples T(b _j) with Y _j. We will refer generically to any variable as v _i, and assume the order Π is v ₁,v ₂,… Denote Φ _kp the lineage of h _kp on D; by assumption, we have Π-OBDD for each of them. Assume w.l.o.g. that each OBDD is complete i.e. every path from root to sink contains every variable exactly once.

In any OBDD of a Boolean expression Φ, the number of nodes at level h (i.e. after first h variables v ₁…v _h have been eliminated) is the size of the set $\{{ \varPhi [(v_{1}\ldots v_{h}) =\bar{b}]}\mid {\bar{b} \in \{0,1\}^{h} }\}$. This is because every distinct subformula will result in a new separate node. A standard technique in proving lower bounds on the size of OBDD is to find a level where the number of distinct formulae must be exponential. This immediately gives the same exponential lower bound on the size of OBDD for that ordering.

For any level h, denote h ₁, h ₂ the number of X, and of Y variables respectively in the initial sequence v ₁,v ₂,…,v _h of Π. Define h to be the first level for which h ₁+h ₂=n. Denote X ^set={X _i∣X _i∈{v ₁,…,v _h}} and X ^unset=X∖X ^set, and similarly Y ^set, Y ^unset, Z ^set, Z ^unset. W.l.o.g. assume h ₁≥n/2.

Consider the OBDD for $\varPhi _{k0} = \bigvee_{ij} X_{i}Z^{1}_{ij}$. Suppose there exists j s.t. $\forall i. ( X_{i} \in \mathbf{X}^{set} \Rightarrow Z^{1}_{ij} \in \mathbf{Z}^{unset} ) $; then for each assignment $\bar{b}$ to X ^set, we get a different subformula $\varPhi _{k0}[\mathbf{X}^{set} = \bar{b}]$. Since the number of such formulae is $2^{h_{1}}\geq 2^{n/2}$, we obtain n ₀>2^n/2, which proves the claim. Hence we can assume there is no such j. This means ∀j, ∃i s.t. X _i∈X ^set and $Z^{1}_{ij} \in \mathbf{Z}^{set}$.

Define S to be a set of pairs (i,j) as follows. For each j s.t. Y _j∈Y ^unset, choose some i s.t. $Z^{1}_{ij} \in \mathbf{Z}^{set}$: then include (i,j) in S. Note that the cardinality of S is n−h ₂=h ₁.

For each p=1,…,k−1, denote C _p the subset of S consisting of indices (i,j) s.t. $Z_{ij}^{1}, \ldots, Z_{ij}^{p} \in \mathbf{Z}^{set}$ and $Z_{ij}^{p+1} \in \mathbf{Z}^{unset}$; and let C _k=S−⋃_p=1,k−1 C _p. Thus, C ₁,…,C _k forms a partition of S. Denoting c ₁,…,c _k their cardinalities we have c ₁+⋯+c _k=h ₁.

Next, for each p=1,…,k−1, consider the OBDD for $\varPhi _{kp}= \bigvee_{ij} Z^{p}_{ij} Z^{p+1}_{ij}$. Forall (i,j)∈C _p we have $Z^{p}_{ij} \in \mathbf{Z}^{set}$ and $Z^{p+1}_{ij} \in \mathbf{Z}^{unset}$. Each assignment of the former variables leads to a different expression over the latter variables: hence there are at least $2^{c_{p}}$ distinct expressions, therefore the number of nodes in this OBDD is $n_{p} \geq 2^{c_{p}}$.

Finally, consider the OBDD for $\varPhi _{kk} = \bigvee_{ij}Z^{k}_{ij}Y_{j}$. Forall (i,j)∈C _k we have $Z^{k}_{ij} \in \mathbf{Z}^{set}$ and Y _j∈Y ^unset. Using the same argument, we obtain $n_{k}\geq 2^{c_{k}}$.

Putting everything together we obtain:

Notice that n ₀ does not appear above, but we used it in order to construct the set S. This proves our claim. □

Proposition 7

Let Q be a query, and suppose it has an inversion of length k>0. Let D ₀ be a complete bipartite graph of size n (i.e. a database over the vocabulary of h _k). Then there exists a database D for Q s.t. |D|=O(|D ₀|) and any OBDD for Q has size Ω(k2^n/2k).

We use the inversion of length k to construct a database D that mimics the query h _k over a complete bipartite graph. Assuming an OBDD for Q on this database, we show that one can set the Boolean variables to 0 or 1, to obtain a lineage for each h _ki. What is interesting is that this construction cannot be used to prove #P-hardness of Q by reduction from h _k: in other words, Q over D is not equivalent to h _k over D ₀. But we make Q equivalent to each h _ki, and by Proposition 6 this is sufficient to prove that Q has a no compact OBDD.

Proof

Write Q=⋁q _j in DNF, and let (x ₀,y ₀),(x ₁,y ₁),…,(x _k,y _k) be an inversion in Q. Assume w.l.o.g. that the inversion is of minimal length: this implies there exist atoms $r, s_{1}, s_{1}', \ldots, s_{k},s_{k}', t$ with the following properties: r∈at(x ₀)−at(y ₀), t∈at(y _k)−at(x _k), and for every i=1,k, s _i contains x _i−1,y _i−1, $s_{i}'$ contains x _i,y _i, they unify, and the unification equates x _i−1=x _i and y _i−1=y _i. In particular, the atoms s _i and $s_{i}'$ have the same relation symbol. Assume that x _i,y _i are variables in the query $q_{j_{i}}$, for i=0,k. We assume that these k queries are distinct: if not, simply create a fresh copy of the query, creating new copies of its variables. Thus, $q_{j_{0}}$ contains the atoms r,s ₁, query $q_{j_{1}}$ contains the atoms $s_{1}',s_{2}$ and so on. Next, we perform variable substitutions in the queries $q_{j_{0}}, \ldots, q_{j_{k}}$ in order to equate all variables in s _i and $s_{i}'$, except for x _i−1,y _i−1,x _i,y _i. In other words, all atoms along the inversion path have the same variables, except for the variables forming the actual inversion. For example, if the queries were R(x ₀,u ₀),S ₁(x ₀,y ₀,u ₀); S ₁(x ₁,y ₁,u ₁),S ₂(x ₁,y ₁,u ₁,v ₁); S ₂(x ₂,y ₂,u ₂,v ₂),… then we equate u ₀=u ₁=u ₂=⋯ and v ₁=v ₂=⋯ This is possible in general because Q is ranked: we only equate variables between $q_{j_{i}}$ and $q_{j_{l}}$, but not within the same $q_{j_{i}}$. We now construct the database D as follows. Its active domain consists of all constants a ₁,…,a _n,b ₁, …,b _n and all variables $z\in \mathit{Vars}(q_{j_{i}})$ s.t. z≠x _i, z≠y _i, for i=0,k. For each i=0,k, and each j=1,n, l=1,n, let $q_{j_{i}}[a_{j},b_{l}]$ denote the set of tuples obtained by substituting x _i with a _j and y _i with b _l. Define D to be the union of all these sets: $D = \bigcup_{i,j,l} q_{j_{i}}[a_{j}, b_{l}]$. Because of our earlier variable substitutions, s _i[a _j,b _l] and $s_{i}'[a_{j},b_{l}]$ are the same tuple: this tuple corresponds to the tuple S _i(a _j,b _l) in the bipartite graph. Similarly, r(a _j) and t(b _l) correspond to the tuples R(a _j) and T(b _l) in the bipartite graph. Thus, the bipartite graph D ₀ is isomorphic to a subset of the database D. Consider now any OBDD for $\varPhi _{Q}^{D}$, over a fixed variable ordering Π. We can obtain an OBDD for h _ki for every i=0,k as follows. Assume 0<i<k. Then we keep unchanged the Boolean variables corresponding to S _i(a _j,b _l) and S _i+1(a _j,b _l), (that is, the atoms $s_{i}'[a_{j},b_{l}]$ and s _i+1[a _j,b _l]). All other Boolean variables corresponding to tuples in $q_{j_{i}}[a_{j},b_{l}]$ are set to true; all remaining Boolean variables are set to false. Then the lineage $\varPhi _{Q}^{D}$ becomes the lineage $\varPhi ^{D_{0}}_{h_{ki}}$. The case i=0 is similar (here we keep unchanged the Boolean variables corresponding to R(a _j) and S ₁(a _j,b _l)), and so is the case i=k. Thus, we obtain k+1 OBDD’s for all queries h _ki, and all use the same variable order Π. The claim follows now from Proposition 6. □

If Q has an inversion of length 0, then it is non-hierarchical and as we discuss later in Theorem 7 Q∉UCQ(FBDD), and hence Q∉UCQ(OBDD) either.

5 Queries and FBDD

We now turn to FBDD, also known as read-once Branching Programs. Unlike OBDD, here we no longer require the same variable order on different paths. FBDD are known to be strictly more expressive than OBDD over arbitrary (non-monotone) Boolean expressions, for example the Weighted Bit Addressing problem admits polynomial sized FBDD, but no polynomial size OBDD [5, 15, 24]. On the other hand, to the best of our knowledge no monotone formula was known to separate these two classes. Moreover, over conjunctive queries without self-joins, FBDD are no more expressive than OBDD, since the latter already capture CQ ⁻(P). In this section we show that FBDD are strictly more expressive than OBDD over UCQ. In particular, we give a simple (!) monotone Boolean expression for which one can construct a FBDD in PTIME, but no polynomial size OBDD exists.

Definition 11

UCQ(FBDD) is the class of queries Q s.t. for any database D, one can construct an FBDD of $\varPhi _{Q}^{D}$ in time polynomial in |D|.

Clearly UCQ(OBDD)⊆UCQ(FBDD): we prove now that the inclusion is strict, using a simple example.

Example 2

Consider q _V in Table 1. This query has an inversion between S(x ₁,y ₁) and S(x ₂,y ₂), hence it does not admit a compact OBDD. We show how to construct a compact FBDD. Write it in CNF:

Its CNF lattice is shown in Fig. 1. The minimal element of the lattice is:

Each of d ₁,d ₂,d ₃ is inversion-free, hence they have OBDDs, denote them F ₁, F ₂, F ₃. Of course, F ₁ and F ₂ use different variable orderings and cannot be combined into an OBDD for q _V. Consider the database given by the bipartite graph (Sect. 4) and assume the following order on the active domain: a ₁<⋯<a _n<b ₁<⋯<b _n. Our FBDD starts by computing d ₃. If d ₃=0, then q _V=0; this is a sink node. If d ₃=1, then, depending on which sink node in F ₃ we have reached, either d ₁=1 or d ₂=1, and we need to continue with either F ₂ or F ₁ respectively. This way, no path goes through both F ₁ and F ₂. Note that the FBDD is not ordered, since some paths use the order in F ₁, others that in F ₂. Figure 3 illustrates the construction. The FBDD inspects the variables in this order: $X_{R(a_{1})}, X_{R(a_{2})}, \ldots, X_{R(a_{n})},X_{T(b_{1})}, \ldots, X_{T(b_{n})}$. (This is F ₃.) Each edge $X_{R(a_{i})} = 0$ leads to the next node, $X_{R(a_{i+1})}$, etc. Consider now an edge $X_{R(a_{i})} = 1$. Here we know d ₂ is true, but we still need to evaluate d ₁. We create a new copy of F ₁ where we set all variables $X_{R(a_{1})}, \ldots, X_{R(a_{i-1})}$ to 0 (i.e. eliminate these nodes and redirect their incoming edge to their 0-child) and set $X_{R(a_{i})}$ to 1. Then we connect the edge $X_{R(a_{i})} = 1$ to the root node of this copy of F ₁. Similarly, we connect an edge $X_{T(b_{j})}=1$ to the root of a copy of F ₂ where we set $X_{R(a_{1})} = \cdots = X_{R(a_{n})} = X_{T(b_{1})} = \cdots = X_{T(b_{j-1})} = 0$ and $X_{T(b_{j})} = 1$. The result is an FBDD of size^{Footnote 3} O(n ³) (since F ₁,F ₂ have sizes O(n ²)).

Thus:

Proposition 8

q _V∈UCQ(FBDD)−UCQ(OBDD).

The significance of this result is the following. The lineage of q _V is, to the best of our knowledge, the first “simple” Boolean expression (i.e. monotone, and with polynomial size DNF) that has a polynomial size FBDD but no polynomial size OBDD. Previous examples separating these classes where Weighted Bit Addressing problem (WBA) [5, 15, 24], and other examples given in [26], and these were not “simple”. Our result also constrasts UCQ to CQ ⁻: for the latter it follows from [20] that CQ ⁻(OBDD)=CQ ⁻(FBDD).

In the reminder of this section we will give a partial characterization of UCQ(FBDD), by providing a sufficient condition, and a necessary condition for membership. We start with the sufficient condition.

Definition 12

Let d=⋁c _i and $d' = \bigvee c_{j}'$ be two disjunctive queries, s.t. the logical implication d′⇒d holds. We say that d dominates d′ if for every component $c_{j}'$ in d′ and for every atom g in $c_{j}'$ one of the following conditions hold: (a) the relation symbol of g does not occur in d, or (b) there exists a component c _i and a homomorphism $c_{i} \rightarrow c_{j}'$ whose image contains g.

In Example 2, d ₃ dominates d ₁: if one considers the component R(x ₁),S(x ₁,y ₁) in d ₁, then the atom R(x ₁) is the image of a homomorphism, while the atom S(x ₁,y ₁) does not occur at all in d ₃. Similarly d ₃ dominates d ₂.

In analogy to the definition of safe queries Definition 4 we define here rf-safe queries ^{Footnote 4}:

Definition 13

(1) Let Q=d ₁∧⋯∧d _k, and k≥2. Then Q is rf-safe if for every element x in its CNF lattice the disjunctive query λ(x) is rf-safe, and for every two lattice elements x≤y, λ(x) dominates λ(y). (2) Let d=d ₀∨d ₁, be a disjunctive query, where d ₀ contains all components c _i without variables, and d ₁ contains all components c _i with at least one variable. Then d is rf-safe if d ₁ has a separator w and d ₁[a/w] is rf-safe, for a constant a.

For example, query q _V is rf-safe, since d ₃ dominates both d ₁ and d ₂. Our sufficient characterization of UCQ(FBDD) is:

Theorem 5

Every rf-safe query is in UCQ(FBDD).

Proof of Theorem 5

We start with a definition. Consider a monotone, Boolean expression, written as a disjunction of minterms: Φ=⋁_i=1,n T _i. We also view Φ as a set of minterms, writing T _i∈Φ. Each minterm T _i is a set of variables: thus, T _i⇒T _j means that, as sets, T _j⊆T _i.

Definition 14

Let Φ, Φ′ be two monotone, Boolean expressions, s.t. Φ′⇒Φ. We say that Φ dominates Φ′ if for every minterm T′∈Φ′, and for every Boolean variable X∈T′, one of the following conditions hold: (a) either X does not occur in Φ, or (b) there exists a minterm T∈Φ s.t. X∈T and T⊆T′.

The following is easy to check:

Lemma 3

If the disjunctive query d dominates d′, then for any database D, the lineage $\varPhi _{d}^{D}$ dominates $\varPhi _{d'}^{D}$.

Consider an FBDD for Φ, and a node x. Any path from the root to x corresponds to an assignment of a subset of Boolean variables; we call that an assignment at x.

Call an FBDD for Φ greedy if each sink node x labeled 1 has an additional label consisting of (a) a set s⊆[n] and (b) and index i, such that, for any assignment at x, the following properties hold: T _i=1 (i.e. all variables in T _i are set to 1 by the assignment). (2) For any j∈s, T _j=0. (3) For any k∉s, if the assignment sets a value of a variable in T _k, then that variable is in T _i (hence it is set to 1).

A greedy FBDD does exactly what the name says: it evaluates the Boolean DNF expression greedily. If a variable X=0 then it skips all minterms that contain X: these minterms now belong to the set s. If a variable X=1, then it continues to read only variables Y that co-occur with X in some minterm.

Let Φ=⋀_i=1,m Φ _i, where each Φ _i is a monotone, Boolean expression. Let (L,≤) be the CNF-lattice for Φ constructed as follows. Its elements x are in one-to-one correspondence with Boolean expressions Φ _s=⋁_i∈s Φ _i, where s⊆[m], up to logical equivalence (i.e. if $\varPhi _{s_{1}} \equiv \varPhi _{s_{2}}$ then they correspond to the same element x); λ(x)=Φ _s denotes the Boolean expression associated to x. And x≤y if λ(y)⇒λ(x).

Lemma 4

Let Φ=⋀_i=1,m Φ _i, and (L,≤) be its CNF lattice. Suppose that, for all x≤y, λ(x) dominates λ(y). Let F _x be a greedy FBDD of size n _x for λ(x), for each x∈L. Then there exists a greedy FBDD for Φ, of size O(m⋅(∏n _x)).

Proof

(Sketch) We construct the FBDD by generalizing the idea of Example 2. Let $x_{0} = \hat{0}$ be the smallest element in the lattice. The FBDD starts with $F_{x_{0}}$. Consider a sink node. If it is labeled 0, then we leave it labeled 0: we know that Φ ₁=⋯=Φ _m=0, hence so is Φ. If it is labeled 1, then we have two additional labels: a minterm T _i (known to be 1), and a set of minterms, TT (all known to be 0). Let s⊆[m] be the set s.t. i∈s iff T _i does not imply Φ _i: thus, from that sink node we have to continue to evaluate Φ _s. We assume, inductively, to have an FBDD for Φ _s. Then we modify it, by setting all variables in T _i to 1, and setting all other variables occurring in the set of minterms TT to 0: dominance ensures that the new FBDD still computes correctly Φ _s. □

The proof of Theorem 5 follows now directly from the last two lemmas. □

rf-safe is not a complete characterization of UCQ(FBDD). The following query q _T, is not rf-safe, but one can construct a polynomial-size FBDD for it.

Next, we present our separation result. Recall the query q _W from Table 1. We prove here:

Theorem 6

q _W∉UCQ(FBDD).

We will return to this query in the next section.

Proof

Consider the following three queries:

Thus, q _W=d ₁∧d ₂∧d ₃. We first prove a surprising fact that given an FBDD for q _W, we can construct a shared FBDD for d ₁,d ₂,d ₃. Note that in general it is not possible to split an FBDD into a shared FBDD: we exploit the properties of d ₁,d ₂,d ₃ to do this.

Lemma 5

Given any FBDD for q _W of size N, there exists a shared FBDD for d ₁,d ₂,d ₃ of size polynomial in N and data instance.

Call a node shared if for any two paths $\bar{x},\bar{y}$ from root to the node, $d_{i \bar{x}} = d_{i \bar{y}}$, for i=1, 2, 3. Hence if all nodes were shared, then the FBDD would also be shared. To prove the lemma, we show that if a node is not shared then the subformula represented by that node is actually an inversion-free query, so we could just replace the FBDD below that node with a shared OBDD. Hence, one can make every node shared, and therefore the FBDD shared.

Proof of Lemma 5 (Transformation into a shared FBDD)

Each node x in an FBDD for F represents a Boolean expression F _x, obtained by setting some variables in F. If two paths P1,P2, lead to the same x, then F[P1]=F[P2], where F[P1] denotes the formula obtained by applying the partial assignment P1. Let F be the lineage of q _W=d ₁∧d ₂∧d ₃, and write it as F=G ₁∧G ₂∧G ₃, where G _i is the lineage of d _i. In general, two paths P1,P2 in the FBDD(q _W) that lead to the same node do not have to equate d ₁, i.e. we may have G ₁[P1]≠G ₁[P2].

Definition 15

An FBDD for g _W is called shared if for any two paths P1,P2, if F[P1]=F[P2] then for all i=1,2,3, G _i[P1]=G _i[P2].

This lemma is significant, because it says that we can transform FBDD(q _W) to compute each of the three queries d ₁, d ₂, d ₃ separately. In general, this is not possible for an arbitrary conjunction of Boolean formulas. What makes this work in our case is that, whenever a node x fails to keep track separately of the three subqueries then the formula at x depends only on d ₁,d ₂ or only on d ₂,d ₃. Both these queries are inversion-free, hence in both cases we can construct a shared OBDD for the remaining computation of the two queries, replacing the rest of the FBDD(q _W)

Call a node x shared if for any two paths P1,P2 leading to x, we have G _i[P1]=G _i[P2], for i=1,2,3. We will leave shared nodes unchanged. If x is an non-shared node, then we show how to replace it with a shared OBDD for the remaining formulas. (Some of x’s descendants may become unreachable, and they can be removed later.)

Suppose x is a non-shared node. Let P1,P2 be two paths leading to a node x in the FBDD(F). Then F _x=F[P1]=F[P2]. Suppose G ₁[P1]≠G ₁[P2].

Case 1: For every pair of constants a,b, the path P1 sets either S ₃(a,b)=0 or T(b)=0. Then G ₂[P1] has only terms R(a′),S ₁(a′,b′) and similarly G ₃[P1] has only terms S ₁(a′,b′),S ₂(a′,b′). Denote the three queries below:

Let D ₁,D ₂,D ₃ be the set of tuples that are unset in G ₁[P1], G ₂[P1], and G ₃[P1]. Then F _x is the conjunction of the lineages of d ₁,e ₂,e ₃ on these three databases. Since d ₁,e ₂,e ₃ are inversion-free, we can compute a shared OBDD that evaluates all three of them in parallel on D ₁,D ₂,D ₃ (Corollary 1): thus, we have separated the FBDD at x and below x.

Case 2: There exists a pair of tuples X=S ₃(a,b), Y=T(b) s.t. P1 leaves them either unset, or set to 1. Set X=Y=1. Then G ₂, G ₃ become true, and therefore G ₁[P,X=1,Y=1]=F[P,X=1,Y=1], for P∈{P1,P2}. Since F[P1]=F[P2] we obtain G ₁[P1,X=1,Y=1]=G ₁[P2,X=1,Y=1]. Hence, the only way in which G ₁[P1] and G ₁[P2] may differ is that either one has the term S ₂(a,b) and the other has S ₂(a,b),S ₃(a,b), or one is true and the other has the term S ₃(a,b). The first case is impossible because it means that one of the two paths has set S ₃(a,b) to true. The second case implies F _x=G ₂[P]∧G ₃[P] for P∈{P ₁,P ₂} and we repeat the argument above.

This completes the case when G ₁ differs on two paths. The case when G ₃ differs is similar and omitted. So suppose G ₁[P1]=G ₁[P2] and G ₃[P1]=G ₃[P2]. Then, we prove that we also have G ₂[P1]=G ₂[P2]. Indeed, this follows immediately by inspecting the three query lineages: G ₁ is ⋁_a,b R(a)S ₁(a,b)∨S ₂(a,b),S ₃(a,b) and G ₂ is ⋁_a,b R(a),S ₁(a,b)∨S ₃(a,b),T(b). Thus, the subformula of R(a),S ₁(a,b) in G ₁[P1] is the same as that in G ₂[P1]; since G ₁[P1]=G ₁[P2], it means that this part of G ₂ is the same in P1 and in P2. Similarly, from G ₃[P1]=G ₃[P2] we conclude that the part S ₃(a,b),T(b) in G ₂ is the same in P1 and P2. Hence, G ₂[P1]=G ₂[P2]. □

Let $h_{1}' = R(x_{1}),S(x_{1},y_{1}),S(x_{2},y_{2}),T(y_{2})$. We can show that

Lemma 6

$h_{1}' \notin \mathit{UCQ}(\mathit{FBDD})$.

We start at the root and choose 2ⁿ⁻¹ paths (our database is bipartite as in Sect. 4) as follows: for each node we go in both directions if the given node isn’t a prime implicant in the subformula at that node, otherwise we choose only the 0-edge. Then we exploit the fact that each of these paths has only set a limited number of variables to show that any two paths can differ on the assignment of only a few variables, hence most of them must end in distinct subformulas.

Proof of Lemma 6

Define Φ ₁ to be $\bigvee_{i,j=1}^{n} r_{i} s_{i,j}$, Φ ₂ as $\bigvee_{i,j=1}^{n} s_{i,j}t_{j} $ and F_n to be Φ ₁∧Φ ₂. Then we show FBDD(F_n)=n ^Ω(log(n)). We start at the root and choose 2ⁿ⁻¹ paths $\mathcal{P}$ as follows: for each node we go in both directions if the given node isn’t a prime implicant in either Φ ₁ or Φ ₂: we call such nodes branching; otherwise we choose only the 0-edge. We ignore redundant (i.e. the two branches for 0, 1 point to the same node) nodes. We stop a path after we have branched on n−1 nodes. We can always find n−1 nodes to branch on since our function cannot become 0 before we branched on n−1 nodes. This is because each branching variable can set at most n ³ minterms to 0. We have n ⁴ minterms to set to 0 and that can’t be done with n−1 branching nodes. The set of variables on a branch p are denoted by Vars(p) and p(v) denotes the assignment of the variable v in p. The Boolean formula f _p=F_n[p(x)/x], x=Vars(p) is obtained by applying the assignments on path p to F_n.

Definition 16

Given a Boolean formula f, call a variable v determined, if there exists b∈{0,1} s.t. for any $p \in \mathcal{P}$ f _p=f implies v∈Vars(p) and p(v)=b. Conversely any variable x∉Vars(f) that is not determined is called undetermined.

Now the essence of the proof is that two paths that result in the same function can only differ on the undetermined variables. The following lemma characterizes the variables that can be undetermined.

Lemma 7

Given a path p and a variable x∉Vars(f _p); x is undetermined for f _p iff x=s _i,j for some 1≤i,j≤n and

1.
r _i=0 or s _i,j1=1 for some 1≤j1≤n on p and
2.
t _j=0 or s _i1,j=1 for some 1≤i1≤n on p.

Proof

Follows immediately by doing a simple case analysis. □

Note that for any f _p, the number of paths q s.t. f _q=f _p is bounded above by $2^{\# \textrm{undetermined variables in} f_{p}}$. This is because q and p can only differ on the assignment of undetermined variables. We now bound the number of undetermined variables for any path p. Let n _r,n _t be the number of r,t variables set to 0 on path p; n _s be the number of s variables set to 1. Then by Lemma 7 the number of undetermined variables is at most (n _r+n _s)(n _t+n _s). Let m _r,m _t,m _s similarly be the number of branching variables amongst n _r,n _t,n _s. Then, m _s=n _s and n _r≤m _r+m _s, n _t≤m _t+m _s. The first equality m _s=n _s holds because non-branching variables are always set to 0. Also a non-branching variable from r,t is set to 0 iff it becomes a prime implicant; which can only happen if one of the variables from s is set to 1 and conversely setting one variable from s makes at most one prime implicant. This proves the second and third inequality.

Hence the number of undetermined variables is at most (m _r+2m _s)(m _t+2m _s)≤4(m _r+m _s+m _t)². Now consider all paths p where $l=m_{r} +m_{s} + m_{t} = \frac{\log(n)}{8}$. Consider the complete binary tree of depth n−1 on the branching variables. Flip the edges coming out of r,t nodes in the tree, i.e. 0 to 1 and vice-versa. We map any such path p to a path in this tree by choosing opposite assignment to variables from r,t. This is a 1-1 mapping. And the set of such paths in our tree is exactly the set of paths where the number of nodes tested to be 1 are l, which is ${n-1 \choose l}$. Hence the size of the FBDD is at least $\frac{{n-1 \choose l}}{2^{4l^{2}}}$ which for $l=\frac{\log(n)}{8}$ gives the required bound. □

Now to finish the proof, we finally show a reduction from a shared FBDD of d ₁,d ₂,d ₃ to $h'_{1}$.

Lemma 8

Given a shared FBDD for q _W of size N, there exists an FBDD for $h_{1}'$ of size at most N.

This is, perhaps, the most surprising step, because it seems to reduce a query that is hard for #P ($h_{1}'$) to a query that is in PTIME (q _W). However, what we describe below is not a reduction: instead is a transformation of an FBDD.

The goal of the reduction is to set S ₁(a,b)=¬S ₂(a,b)=S ₃(a,b) in FBDD(q _W): it can be easily seen (by inspecting the definition of the two queries) that the new FBDD computes $h_{1}'$. Since we have shown in Lemma 6 that $h_{1}'$ has no polynomial size FBDD, this completes the proof.

The difficulty of this step is to show that the FBDD has enough memory to remember if any of S ₁(a,b), S ₂(a,b), or S ₃(a,b) was set. We show that it does have enough memory, by using the fact that it is a shared FBDD, i.e. it computes the queries d ₁,d ₂,d ₃ simultaneously.

Proof of Lemma 8

Start with the shared FBDD for q _W, given by Lemma 5. We want to enforce

(4)

for every tuple S ₁(a,b).

Modify each node x as follows. If it is a test for R(a) or for T(b), then make no change: it will continue to be a test for the same variable. If it is a test for S _i(a,b), i=1,3, then we will either replace it with a test for S(a,b), or will “know” the value of S _i(a,b) and will follow only the 0 or the 1 branch. We give the details next.

Suppose node x tests for S ₁(a,b). Denote G _i,x, i=1,2,3 the three Boolean expressions at x: the FBDD is shared, so it can keep track of them separately.

Step 1: Inspect G _3,x. Recall that the original G ₃ contained S ₁(a,b),S ₂(a,b)∨S ₃(a,b),T(b). There are a few cases: (a) G _3,x does not contain S ₁(a,b) at all: then we know S ₂(a,b)=0 on all paths leading to x. (b) G _3,x contains S ₁(a,b) (a prime implicant). Then on all paths to x must set S ₂(a,b)=1. In both (a) and (b) we know how to set S ₁(a,b) according to (4). (c) G _3,x contains S ₁(a,b),S ₂(a,b): then we know S ₂(a,b) is unset, and continue with step 2.

Step 2. At this point we know S ₂(a,b) is unset. Inspect G ₁. The original G ₁ was R(a),S ₁(a,b)∨S ₂(a,b),S ₃(a,b). We know S ₂(a,b) is unset, hence the cases are: (a) G _1,x does not contain S ₂(a,b): then we know S ₃(a,b)=0 on all paths to x. (b) G _1,x contains S ₂(a,b) (a prime implicant). Then on all paths to x, S ₃(a,b)=1. In cases (a) and (b) we know how to set S ₁(a,b) according to the constraint equation (4). (c) G _1,x contains S ₂(a,b),S ₃(a,b). Then we know S ₃(a,b) is also unset, and it means we can read S(a,b).

So far we have assumed that neither G _1,x nor G _3,x are true. Suppose G _3,x is true. Let x be a maximal node where G ₃ becomes true, i.e. the same doesn’t hold for any of the children of x. Due to our previous construction when we transformed the FBDD into a shared one, the entire subgraph reachable from x is isolated. First, we consider all variables S ₁,S ₂,S ₃ already set at x, and update the subgraph under x accordingly. We will need to cope with the variables read within this subgraph. Here, we first rewrite the query d ₁,d ₂=R(x),S ₁(x,y)∨S ₂(x ₁,y ₁),S ₃(x ₁,y ₁),S ₃(x ₂,y ₂),T(y ₂). Given our constraint equation (4), this query is equivalent to R(x),S ₁(x,y). We simply replace the entire subtree at x with an OBDD for R(x),S(x,y). This completes the proof. □

Our hardness result for FBDD is more limited in scope than that for OBDD; in particular it says nothing about non-hierarchical queries. This, however, follows from a very strong result by Bollig & Wegener [1]. They showed that, for arbitrary large n, there exists a bipartite graph G s.t. the formula Φ=⋁_(i,j)∈G X _i Y _j has no polynomial size FBDD.^{Footnote 5} This immediately implies that the query Q=R(x),S(x,y),T(y) is not in UCQ(FBDD), because from any FBDD for Q on the complete, bipartite graph one can obtain and FBDD for Φ by setting all variables X _S(i,j)=1 for (i,j)∈G and setting X _S(i,j)=0 for $(i,j) \not \in G$. In particular, this implies:

Theorem 7

(cf. [1])

If Q is non-hierarchical, then Q∉UCQ(FBDD).

6 Queries and d-DNNFs

d-DNNFs were introduced by Darwiche [12]; a good survey is [13], we review them here briefly. A Negation Normal Form is a rooted DAG, internal nodes are labeled with ∨ or ∧, and leaves are labeled with either a Boolean variable X or its negation ¬X. Each node x in an NNF represents a Boolean expression Φ _x, and the NNF is said to represent Φ _z, where z is the root node. A Decomposable NNF, or DNNF, is one where for every ∧ node, the expressions of its children are over disjoint sets of Boolean variables. A Deterministic DNNF, or d-DNNF is a DNNF where for every ∨ node, the expressions of its children are mutually exclusive. Given a d-DNNF one can compute its probability in polynomial time, by applying the rules P(Φ _x∧Φ _y)=P(Φ _x)P(Φ _y) and P(Φ _x∨Φ _y)=P(Φ _x)+P(Φ _y) (and similarly for nodes with out-degree greater than 2); this justifies our interest in d-DNNF. Any FBDD of size n can be converted to an d-DNNF of size 5n [13]: for any interior node labeled with variable X in the FBDD, write its formula as (¬X)∧Φ _y∨X∧Φ _z, where y and z are the 0-child and the 1-child: obviously, the ∨ is “deterministic”, and the ∧’s are “decomposable”.

It is open whether d-DNNFs are closed under negation [13, pp. 14]; NNFs are obviously closed under negation, but the d-DNNF impose asymmetric restrictions on ∧ and ∨, so by switching them during negation, the resulting NNF is no longer a d-DNNF. For that reason, we extend here d-DNNF’s with ¬-nodes, and denote the result d-DNNF^¬: probability computation can still be done in polynomial time on a d-DNNF^¬.

Definition 17

UCQ(dDNNF) is the class of queries Q s.t. for any database D, one can construct a d-DNNF for $\neg \varPhi _{Q}^{D}$ in time polynomial in |D|. UCQ(dDNNF ^¬) is the class of queries Q s.t. one can construct a d-DNNF^¬ for $\varPhi _{Q}^{D}$ in time polynomial in |D|.

UCQ(FBDD)⊆UCQ(dDNNF)⊆UCQ(dDNNF ^¬)⊆UCQ(P); the first inclusion is can be shown to be strict.

Proposition 9

q _W∈UCQ(dDNNF)−UCQ(FBDD).

The significance of this result is the following. This is, to the best of our knowledge, the first example of a “simple” Boolean expression (meaning monotone, and with a polynomial size DNF) that has a polynomial size d-DNNF but not FBDD. The previous separation of FBDD and d-DNNF is based on a result by Bollig and Wegener [2], which we review briefly. Consider a Boolean matrix of variables X _ij. Let Φ ₁ denote the formula “there are an even number of 1’s and there is a row consisting only of 1’s”. Let Φ ₂ denote the formula “there are an odd number of 1’s and there is a column consisting only of 1’s”; in [2] the authors show that Φ ₁∨Φ ₂ does not have a polynomial size FBDD. However, this formula has a polynomial size d-DNNF, because each of Φ ₁,Φ ₂ has polynomial size OBDD and Φ ₁∧Φ ₂≡false. Note, however, that these formulas are non-monotone and have exponential size DNF’s (they are not in AC⁰). By contrast, the lineage of q _W is monotone, has polynomial size DNF, and separates FBDD from d-DNNF.

In the rest of the section, we give a sufficient criterion for a query Q=d ₁∧⋯∧d _m, to be in UCQ(dDNNF ^¬), which is quite interesting because it explains the border between d-DNNF and PTIME in terms of lattice-theoretic concepts. We need to define some lattice theoretic concepts first. Let the CNF lattice of Q be (L,≤).

We now describe the construction algorithm. If m=1 then Q is a disjunctive query; in this case it must have a separator (assuming FP ≠ #P (Theorem 1)), d ₁=∃w.Q ₁ and we write: ¬Q=⋀_a∈ADom(D) Q ₁[a/w]. The ∧ operator is “decomposable”, i.e. its children are independent.

If m≥2, we express Q=Q ₁∧Q ₂, and consider the following derivation for ¬Q, where we write ∨^d to indicate that a ∨ operation is disjoint:

(5)

The effect of the decomposition above is that it reduces Q to three subqueries, namely Q ₁, Q ₂, and Q ₁∨Q ₂, whose CNF lattices are meet-sublattices of L, obtained as follows. Let Q=Q ₁∧Q ₂, where Q ₁=d _l1∧d _l2∧⋯ and Q ₂=d _o1∧d _o2∧⋯. Denote by v ₁,…,v _m,u ₁,…,u _k the co-atoms of this lattice, such that v ₁,v ₂,… are the co-atoms corresponding to d _l1,d _l2,… and u ₁,u ₂,… are the co-atoms for d _o1,d _o2,….

The CNF lattice of Q ₁ is $\overline{M}$, where M={v ₁,…,v _m}.
The CNF lattice of Q ₂ is $\overline{K}$, where K={u ₁,…,u _k}.
The CNF lattice of Q ₁∨Q ₂=⋀_i,j(d _li∨d _oj) is $\overline{N}$, where N={v _i∧u _j∣i=1,m;j=1,k}. Here v _i∧u _j denotes the lattice-meet, and corresponds to the query-union.

Note that each of the three lattices above, $\bar{M}, \bar{K},\bar{N}$ is a strict subset of L.

This justifies the following definition of d-safe queries, analogous to safe queries Definition 4.

Definition 18

(1) Let Q=Q ₁∧Q ₂. Then Q is d-safe if Q ₁,Q ₂,Q ₁∨Q ₂ are all d-safe. (2) Let d=c ₁∨⋯∨c _k be a disjunctive query, and let d=d ₀∨d ₁, where d ₀ contains all components c _i without variables, and d ₁ contains all components c _i with at least one variable. Then d is d-safe if d ₁ has a separator w and d ₁[a/w] is d-safe.

Theorem 8

If Q is d-safe, then it is in UCQ(dDNNF ^¬).

To illustrate this algorithm, we now show how to construct a d-DNNF for q _W.

Proof of Proposition 9

Consider q _W=d ₁∧d ₂∧d ₃ in Fig. 1.

Denote the three lower points in the lattice as:

We express Q _W as d ₁∧(d ₂∧d ₃), and using (5) get

d ₁ is hierarchical-read-once, and d ₂∧d ₃ is inversion-free; hence they both have compact d-DNNF. d ₁∨(d ₂∧d ₃)=d ₁₂ is inversion-free and hence it also admits a compact d-DNNF. □

We prove now that every d-safe query is also safe. Fix a lattice L. Every non-empty subset $S \subseteq L- \{\hat{1}\}$ corresponds to a query, ⋀_u∈S λ(u). We define a nondeterministic function NE that maps a non-empty set $S \subseteq L - \{\hat{1}\}$ to a set of elements $\mathit{NE}(S) \subseteq \overline{S}$, as follows. If S={v} is a singleton set, then NE(S)={v}. Otherwise, partition S non-deterministically into two disjoint, non-empty sets S=M∪K, define N={v∧u∣v∈M,u∈K}, and define NE(S)=NE(M)∪NE(K)∪NE(N). Thus, NE(S) is non-deterministic, because it depends on our choice for partitioning S. The intuition is the following: in order for the query ⋀_u∈S λ(u) to be d-safe, all lattice points in NE(S) must also be d-safe: they are “non-erasable”.

Call an element z∈L erasable if there exists a non-deterministic choice for NE(L ^∗) that does not contain z. Recall that L ^∗ is the set of co-atoms of L. The intuition is that, if z is erasable, then there exists a sequence of applications of rules from Definition 18, which avoids computing z; in other words, it “erases” z from the list of queries in the lattice for which it needs to compute the d-DNNF ^¬, and therefore Q _z is not required to be d-safe. We prove that only queries Q _z where $\mu_{L}(z,\hat{1})=0$ can be erased:

Lemma 9

If z is erasable in L, then $\mu_{L}(z, \hat{1}) = 0$.

Proof

We prove the following claim, by induction on the size of the set S: if z∉NE(S), $z \neq \hat{1}$, then $\mu_{\overline{S}}(z, \hat{1}) = 0$ (if $z \notin \overline{S}$, then we define $\mu_{\overline{S}}(z, \hat{1}) = 0$). The lemma follows by taking S=L ^∗ (the set of all co-atoms in L).

If S={v}, then NE(S)={v} and $\overline{S} = \{v,\hat{1}\}$: therefore, the claim hold vacuously. Otherwise, let S=M∪K, and define N={v∧u∣v∈M,u∈K}. We have NE(S)=NE(M)∪NE(K)∪NE(N). If z∉NE(S), then z∉NE(M), z∉NE(K), and z∉NE(N). By induction hypothesis $\mu_{\overline{M}}(z, \hat{1}) =\mu_{\overline{K}}(z, \hat{1}) = \mu_{\overline{N}}(z, \hat{1}) = 0$. Next, we notice that (1) $\overline{M}, \overline{K}, \overline{N}\subseteq \overline{S}$, (2) $\overline{S} = \overline{M} \cup \overline{K} \cup \overline{N}$ and (3) $\overline{M} \cap \overline{K} = \overline{N}$. Then, we apply the definition of the Möbius function directly, using a simple inclusion-exclusion formula:

□

The lemma implies immediately:

Proposition 10

For any UCQ query Q, if Q is d-safe, then it is safe. The converse does not hold in general: query q ₉ is safe but is not d-safe.

It is conjectured that q ₉∉UCQ(dDNNF ^¬). Note that the proposition only states that q ₉ is not d-safe, but it is not known whether d-safety is a complete characterization of UCQ(dDNNF ^¬).

Proof

We prove the statement by induction on Q. We show only the key induction step, which is when Q=⋀_i d _i, and L is its CNF lattice. Let Z⊆L denote the nodes corresponding to d-unsafe queries: if Q is d-safe, then all elements in Z are erasable. This implies that ∀z∈Z, $\mu(z, \hat{1}) = 0$. Hence, we can apply Möbius’ inversion formula to the lattice L, and refer only to queries that are d-safe; by induction hypothesis, these queries are also safe, implying that Q is safe.

We show that q ₉ is safe, but is not d-safe. We will denote the lattice points with query d _i∨d _j∨⋯ in Fig. 1 as d _ij…. The query at $\hat{0}$ is the only hard query (since it is equivalent to h ₃), and $\mu(\hat{0},\hat{1}) = 0$. On the other hand, we prove that $\hat{0}$ cannot be erased. Indeed, the co-atoms of the lattice are L ^∗={d ₁,d ₂,d ₃,d ₄}; given the symmetry of d ₁,d ₂,d ₃, there are only three ways to partition the co-atoms into two disjoint sets L ^∗=M∪K:

M={d ₁,d ₂,d ₃}, K={d ₄}. In this case the lattice $\overline{M}$ is $\{\hat{0}, d_{12}, d_{13}, d_{23}, d_{1},d_{2},\allowbreak d_{3}, \hat{1}\}$, and $\mu_{\overline{M}}(\hat{0}, \hat{1}) = -1$, proving that this query is unsafe, and, therefore, d-unsafe.
M={d ₁,d ₂}, K={d ₃,d ₄}. In this case the lattice $\overline{K}$ is $\{\hat{0}, d_{3}, d_{4}, \hat{1}\}$, and has $\mu_{\overline{K}}(\hat{0}, \hat{1}) = 1$, hence, by the same argument, is d-unsafe.
M={d ₁}, K={d ₂,d ₃,d ₄}. Here, too, $\overline{K} = \{\hat{0}, d_{23}, d_{2}, d_{3}, d_{4}, \hat{1}\}$, and $\mu_{\overline{K}}(\hat{0}, \hat{1})= 1$.

□

7 Results on Non-uniform Classes

In this section we look at the non-uniform compact classes for different targets T.

Definition 19

For target T∈{OBDD,FBDD,d-DNNF}, denote by UCQ ⁿ(T) the class of all queries Q∈UCQ s.t. $\varPhi _{D}^{Q}$ has a compilation in T of size polynomial in |D| for all D.

Note that in case of RO, a formula is either RO or not RO. Furthermore, thanks to the result due to Gurvich [19], this can be done in PTIME for monotone formulas that form the lineages of UCQ. On the other hand, a formula may have a compact OBDD, FBDD, or d-DNNF, but our algorithm may not be able to construct it. Hence UCQ ⁿ(T)⊇UCQ(T).

For OBDD though, it follows from the results of Sect. 4, that

Proposition 11

UCQ ⁿ(OBDD)=UCQ(OBDD).

The proof follows from Proposition 7, which implies that queries not in UCQ(OBDD) do not have a polynomial size OBDD, i.e., UCQ−UCQ(OBDD)⊆UCQ−UCQ ⁿ(OBDD). Hence UCQ ⁿ(OBDD)⊆UCQ(OBDD), which means the two sets must be equal.

We do not have a full characterization for FBDD, d-DNNF, so we do not know if the same is true for them. The main separation results though still hold for non-uniform classes as well.

Proposition 12

Proof

We can construct an FBDD for q _V in PTIME (Theorem 5), but it doesn’t always admit a compact OBDD (Proposition 7). This proves the first result. Similarly one can construct a d-DNNF for q _W in PTIME (Proposition 9), but it admits no polynomial size FBDD (Theorem 6). This proves the last two separations. □

8 Conclusion

We have studied the problem of compiling the query lineage into compact representations. We considered four compilation targets: read-once, OBDD, FBDD, and d-DNNF. We showed that over the query language of unions of conjunctive queries, these four classes form a strict hierarchy. For the first two classes we gave a complete characterization based on the query’s syntax. For the last two classes we gave sufficient characterizations.

Our two main separation results, between UCQ(OBDD) and UCQ(FBDD), and between UCQ(FBDD) and UCQ(dDNNF), are the first examples of “simple” Boolean expressions (meaning: monotone, and with polynomial size DNFs) that separate those two classes.

We leave three open problems: complete characterizations of FBDD and d-DNNF, and separation of the latter from PTIME. Also, as future work, it would be interesting to investigate compact representations of lineages in other semirings described in [17].

Notes

BDD are also known as Branching Program(BP) in the literature.
We omitted the inner quantifiers ∃y ₁ and ∃y ₂.
In this particular example one could reduce the size to O(n ²) by sharing nodes among the multiple copies of F ₁ and similarly for F ₂.
r is for restricted, since we do not have a full characterization yet.
Their graph is the following: fix n=p ² where p is a prime number. Then G={(a+bp,c+dp)∣c≡(a+bd)mod p}.

References

Bollig, B., Wegener, I.: A very simple function that requires exponential-size read-once branching programs. Inf. Process. Lett. 66, 53–57 (1998)
Article MATH MathSciNet Google Scholar
Bollig, B., Wegener, I.: Complexity theoretical results on partitioned (nondeterministic) binary decision diagrams. Theory Comput. Syst. 32, 487–503 (1999). doi:10.1007/s002240000128
Article MATH MathSciNet Google Scholar
Bryant, R.E.: Symbolic manipulation of boolean functions using a graphical representation. In: DAC, pp. 688–694 (1985)
Google Scholar
Bryant, R.E.: Graph-based algorithms for boolean function manipulation. IEEE Trans. Comput. 35(8), 677–691 (1986)
Article MATH Google Scholar
Bryant, R.E.: On the complexity of VLSI implementations and graph representations of boolean functions with application to integer multiplication. IEEE Trans. Comput. 40(2), 205–213 (1991) doi:10.1109/12.73590
Article MATH MathSciNet Google Scholar
Cadoli, M., Donini, F.M.: A survey on knowledge compilation. AI Commun. 10(3,4), 137–150 (1997)
Google Scholar
Chandra, A., Merlin, P.: Optimal implementation of conjunctive queries in relational data bases. In: Proceedings of 9th ACM Symposium on Theory of Computing, Boulder, Colorado, pp. 77–90 (1977)
Google Scholar
Dalvi, N., Suciu, D.: The dichotomy of conjunctive queries on probabilistic structures. In: PODS, pp. 293–302 (2007)
Google Scholar
Dalvi, N., Suciu, D.: Efficient query evaluation on probabilistic databases. VLDB J. 16(4), 523–544 (2007)
Article Google Scholar
Dalvi, N., Suciu, D.: Management of probabilistic data: foundations and challenges. In: PODS, pp. 1–12. ACM Press, New York (2007)
Google Scholar
Dalvi, N.N., Schnaitter, K., Suciu, D.: Computing query probability with incidence algebras. In: PODS, pp. 203–214 (2010)
Google Scholar
Darwiche, A.: On the tractable counting of theory models and its application to belief revision and truth maintenance. CoRR cs.AI/0003044 (2000)
Darwiche, A., Marquis, P.: A knowledge compilation map. J. Artif. Intell. Res. 17(1), 229–264 (2002)
MATH MathSciNet Google Scholar
Gál, A.: A simple function that requires exponential size read-once branching programs. Inf. Process. Lett. 62(1), 13–16 (1997). doi:10.1016/S0020-0190(97)00041-0. http://www.sciencedirect.com/science/article/B6V0F-3SNV288-T/2/39afb175413bd7ee03397bb582be0161
Article Google Scholar
Gergov, J., Meinel, C.: Efficient boolean manipulation with obdd’s can be extended to fbdd’s. IEEE Trans. Comput. 43(10), 1197–1209 (1994)
Article MATH Google Scholar
Golumbic, M.C., Mintz, A., Rotics, U.: Factoring and recognition of read-once functions using cographs and normality and the readability of functions associated with partial k-trees. Discrete Appl. Math. 154(10), 1465–1477 (2006)
Article MATH MathSciNet Google Scholar
Green, T., Karvounarakis, G., Tannen, V.: Provenance semirings. In: PODS, pp. 31–40 (2007)
Google Scholar
Green, T.J.: Containment of conjunctive queries on annotated relations. In: ICDT, pp. 296–309 (2009)
Chapter Google Scholar
Gurvich, V.: Repetition-free boolean functions. Usp. Mat. Nauk 32, 183–184 (1977)
MATH Google Scholar
Olteanu, D., Huang, J.: Using OBDDs for efficient query evaluation on probabilistic databases. In: SUM, pp. 326–340 (2008)
Google Scholar
Roy, S., Perduca, V., Tannen, V.: Faster query answering in probabilistic databases using read-once functions. In: Proceedings of the 14th International Conference on Database Theory, ICDT’11, pp. 232–243. ACM, New York (2011). doi:10.1145/1938551.1938582
Chapter Google Scholar
Sagiv, Y., Yannakakis, M.: Equivalences among relational expressions with the union and difference operators. J. ACM 27, 633–655 (1980)
Article MATH MathSciNet Google Scholar
Sen, P., Deshpande, A., Getoor, L.: Read-once functions and query evaluation in probabilistic databases. In: VLDB (2010)
Google Scholar
Sieling, D., Wegener, I.: Graph driven bdds—a new data structure for boolean functions. Theor. Comput. Sci. 141(1&2), 283–310 (1995)
Article MATH MathSciNet Google Scholar
Tannen, V.: Provenance for database transformations. In: EDBT, p. 1 (2010)
Chapter Google Scholar
Wegener, I.: Branching Programs and Binary Decision Diagrams: Theory and Applications. SIAM, Philadelphia (2000)
Book MATH Google Scholar
Wegener, I.: BDDs–design, analysis, complexity, and applications. Discrete Appl. Math. 138(1–2), 229–251 (2004)
Article MATH MathSciNet Google Scholar

Download references

Acknowledgements

This work was supported by IIS-0713576 and IIS-0627585.

Author information

Authors and Affiliations

University of Washington, Seattle, WA, USA
Abhay Jha & Dan Suciu

Authors

Abhay Jha
View author publications
You can also search for this author in PubMed Google Scholar
Dan Suciu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abhay Jha.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jha, A., Suciu, D. Knowledge Compilation Meets Database Theory: Compiling Queries to Decision Diagrams. Theory Comput Syst 52, 403–440 (2013). https://doi.org/10.1007/s00224-012-9392-5

Download citation

Published: 06 March 2012
Issue Date: April 2013
DOI: https://doi.org/10.1007/s00224-012-9392-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Knowledge Compilation Meets Database Theory: Compiling Queries to Decision Diagrams

Abstract

Similar content being viewed by others

Connecting Knowledge Compilation Classes Width Parameters

Knowledge Compilation Languages as Proof Systems

Compilation of Conditional Knowledge Bases for Computing C-Inference Relations

1 Introduction

2 Background and Definitions

Definition 1

Lemma 1

Proof

Definition 2

Definition 3

Theorem 1

Example 1

Definition 4

Theorem 2

Hierarchical Queries

Definition 5

Proposition 1

Proof

Inversions

Definition 6

Definition 7

Proposition 2

Proof

3 Queries with Read-Once Lineage

Definition 8

Definition 9

Proposition 3

Theorem 3

Proof

4 Queries and OBDD

Definition 10

Theorem 4

Proposition 4

4.1 Tractable Queries

Lemma 2

Proposition 5

Proof

Corollary 1

4.2 Hard Queries

Proposition 6

Proof

Proposition 7

Proof

5 Queries and FBDD

Definition 11

Example 2

Proposition 8

Definition 12

Definition 13

Theorem 5

Proof of Theorem 5

Definition 14

Lemma 3

Lemma 4

Proof

Theorem 6

Proof

Lemma 5

Proof of Lemma 5 (Transformation into a shared FBDD)

Definition 15

Lemma 6

Proof of Lemma 6

Definition 16

Lemma 7

Proof

Lemma 8

Proof of Lemma 8

Theorem 7

6 Queries and d-DNNFs

Definition 17

Proposition 9

Definition 18

Theorem 8

Proof of Proposition 9

Lemma 9

Proof

Proposition 10