1 Introduction

Static analysis is an area of database theory that focuses on deciding properties of syntactic objects, like queries, integrity constraints, or data dependencies. The unifying paradigm is that because these objects are mostly user-generated, they tend to be small; hence, higher complexities are tolerable. The fundamental problems include satisfiability, validity, containment, and equivalence of queries [9, 25], as well as consistency and entailment of integrity constraints [16, 28]. More specialized tasks include query rewriting in data integration scenarios [24], and manipulating schema mappings in data exchange and schema evolution scenarios [1, 15]. Many of these problems are equivalent to satisfiability of fragments of first order logic, possibly over a restricted class of structures, but they are rarely presented this way, because the involved fragments are tailored for specific applications, and usually do not form natural sublogics. As satisfiability over arbitrary structures is undecidable even for relatively simple fragments of first order logic, in static analysis undecidability is always close [19, 20].

In this paper we present a decidability result (with tight complexity bounds) for a problem in static analysis for tree-structured data. The specific model we consider is that of data trees: finite ordered unranked trees whose nodes have a label from a finite alphabet and store a data value from an infinite data domain. The problem has three possible interpretations:

  • consistency modulo schema for a class of integrity constraints;

  • validity modulo schema for a class of queries; and

  • consistency for a class of schema mappings.

The more general problems of entailment (or implication) of constraints and containment of queries are—as is often the case—very close to their restricted counterparts listed above, and can be solved by easy modifications of our decision procedure.

Our basic setting is that of consistency of integrity constraints; it seems best suited for proofs and—in combination with entailment—the most appealing. We consider non-mixing constraints of the forms

$$\alpha (\bar{x}) \Rightarrow \eta_{\sim}(\bar{x}) \quad \text{ and} \quad \alpha (\bar{x}) \Rightarrow \eta_{\nsim} (\bar{x}) $$

that require each tuple \(\bar {x}\) of nodes selected by α to satisfy, respectively, a positive combination of equalities η or a positive combination of inequalities \(\eta _{\nsim }\) over the stored data values. As tuple selectors \(\alpha (\bar {x})\) we use conjunctive queries over the signature including label tests and the usual navigational axes. For example, the constraint

$$a(x) \;\land \;x \downarrow y \;\land\; x \downarrow y^{\prime} \; \land \;y \rightarrow^+\! y^{\prime} \;\Rightarrow \;y \nsim y^{\prime} $$

expresses that different children of the same a-labelled node store different data values, and the constraint

$$a(x) \;\land \;x \downarrow y \; \land \;x \downarrow y^{\prime}\; \land\; x \downarrow y^{\prime\prime}\;\Rightarrow\; y \sim y^{\prime}\; \lor \; y^{\prime}\sim y^{\prime\prime} \lor y^{\prime\prime}\sim y $$

expresses that at most two different data values are used by children of each a-labelled node. The consistency problem is to decide if there exists an instance of a given schema that satisfies a given set of constraints. In the example above, there exists an instance satisfying both constraints if and only if the schema allows trees without a-labelled nodes with more than two children.

What is the expressive power of non-mixing constraints? Let us first look at what they cannot do. Being first-order constraints, they cannot compare full subtrees, unlike some other formalisms [21, 22]. They have purely universal character (can be written as universal sentences of first order logic), so they cannot express general inclusion dependencies nor foreign keys, as these need quantifier alternation. Finally, the inability to mix freely data equalities and inequalities within a single constraint makes them unable to express general functional dependencies. What can they do, then?

Non-mixing integrity constraints can be seen as a special case of the general framework of XML-to-relational constraints (X2R constraints) introduced by Niewerth and Schwentick [27]. Within this framework they cover a wide subclass of functional dependencies, dubbed XKFDs, which are particularly well suited for tree-structured data. They include W3C XML Schema key and unique constraints [18], as well as absolute and relative XML keys by Arenas et al. [2], and XFDs by Arenas and Libkin [3]. XKFDs can be expressed with non-mixing constraints of the form \(\alpha (\bar {x})\Rightarrow \eta _{\nsim }(\bar {x})\); that is, using only data inequalities.

Constraints of the form \(\alpha (\bar {x})\Rightarrow \eta _{\sim }(\bar {x})\)—that is, using only equalities—can express all sorts of finite data domain restrictions, either to a specific set of constants or to a set of data values taken from the data tree (the latter can be seen as a limited variant of inclusion constraints), as well as cardinality restrictions over data values (like in the example above).

The novelty of our work is that we allow these two kinds of constraints simultaneously. Unrestricted mixing of data equalities and inequalities in constraints would immediately lead to undecidability [6], but for non-mixing constraints we can show decidability of the consistency problem, and a slight extension of the proof gives decidability for entailment (with the same complexity bounds).

Our approach leads through a simple model property, which asserts that a set of constraints is satisfiable if and only if it has a model of bounded data cut [7]; that is, the number of data values shared by any subforest of the model and its complement is bounded. This property can be seen as a strengthening of the bounded clique-width property [11], in which decompositions must follow the structure of data trees. The robustness of our approach is witnessed by the fact that it can be naturally extended to constraints in which tuple selectors \(\alpha (\bar {x})\) are expressed in monadic second order logic (MSO) using label tests and navigational predicates. At the core of our argument lies a simple lemma of geometric nature.

Under the second interpretation our result shows decidability of validity and containment for unions of conjunctive queries where each conjunctive query can use either data equality or inequality, but never both. Seen this way, our result is a uniform extension of decidability results for UCQs using only data equality, and UCQs using only data inequality by Björklund et al. [6] (see also [12]). However, it cannot be obtained via a combination of techniques used in these cases, as they are virtually contradictory: they require assuming that almost all data values in counter-examples are, respectively, different and equal. If data equalities and inequalities are mixed freely in UCQs, even validity is undecidable [6].

In its third incarnation, our result gives decidability of the consistency problem for XML schema mappings with source integrity constraints, which asks to decide if there exists a source instance which satisfies the integrity constraints and admits a target instance satisfying the requirements imposed by the schema mapping.

In all three cases (excluding the unsurprisingly non-elementary MSO extension), the decision procedure is doubly exponential. This bound is tight, as already validity modulo schema for UCQs over trees without data values is 2ExpTime-complete [6]. We show that restricting the CQs to tree patterns does not help. However, the complexity does drop to ExpTime-complete when we replace CQs with tree patterns and bound the number of variables used in data comparisons.

A broader context for our work is the rich landscape of results on static analysis for the popular XML query language XPath [5, 26] and related formalisms like alternating register automata [17, 23] or the two-variable fragment of first order logic with data comparisons [8]. These formalisms do not compare easily with ours. Arbitrary alternation of quantifiers (implicit, in the case of XPath) lets them reach far beyond conjunctive queries. But the restriction on the number of registers or variables (reflected in the the syntax of XPath) limits data comparisons: one cannot compare data values from too many nodes at the same time. In their basic form, our results imply decidability (with the same tight complexity bounds) of the containment problem in the presence of a schema for unions of XPath queries without negation, where each query uses either equality or inequality, but never both. The extension to MSO constraints allows free use of negation as long as data comparisons are not used under negation.

The remainder of the paper begins with a precise definition of non-mixing constraints and a short discussion of their scope (Section 2). Then we present the decision procedure for consistency of non-mixing constraints and show its optimality (Section 3). We continue with a potpourri of extensions and connections: the entailment problem (Section 4.1), the lower-complexity fragment (Section 4.2), the relationships with existing constraint formalisms (Section 4.4), the two alternative interpretations of our results (Sections 4.3 and 4.5), a comparison with clique-width (Section 4.6), and the MSO extension (Section 4.7). We conclude with a brief discussion of further possible extensions and open questions (Section 5).

This is an extended version of an 18-pages-long paper under the same title presented at ICDT 2016. The new material includes full proofs of all results, as well as the comparison with clique-width and the MSO extension. There is also a major difference in the way the proof of the main result is presented. In the conference version, register tree automata are used to recognize witnesses for consistency of bounded data cut. Here, we encode such witnesses as trees over a finite alphabet and use ordinary tree automata. The new formulation encapsulates reasoning about data values within the encoding, and harmonizes with the clique-width approach and the MSO extension.

2 Non-Mixing Constraints

2.1 Preliminaries

Let us fix a finite labelling alphabet Γ and a countably infinite set of data values \({\mathbb {D}}\). A data tree t is a finite ordered unranked tree whose nodes are labelled with elements of Γ by function lab t :dom t → Γ, and with elements of \({\mathbb {D}}\) by function \(\mathsf {val}_{t}: \mathsf {dom}_{t} \to {\mathbb {D}}\); here, dom t stands for the domain of tree t, that is, the set of its nodes. If lab t (v) = a and val t (v) = d, we say that node v has label a and stores data value d. A data forest f is a sequence of data trees whose roots are considered siblings (with the inherited order); lab f , val f , and dom f are defined naturally. While each data tree contains at least the root, a data forest can be empty. For a node v of t, we write t v for the data forest consisting of subtrees of t rooted at v itself and at all preceding siblings of v; by slight abuse of notation we write tt v for the remaining part of t (see Fig. 1 for illustration). For a forest f we use the analogous notation, f v and ff v .

Fig. 1
figure 1

A tree t and a subforest t v associated with a node v

We abstract schemas as tree automata in the “previous sibling, last child” variant. A tree automaton \(\mathcal {A}\) is a tuple (Q,q 0,F,δ), where Q is a finite set of states, q 0Q is an initial state, FQ is a set of accepting states, and δQ × Q × Γ × Q is a set of transitions. During the computation the automaton assigns a state to each node v of the input tree t, based on the accumulated information about t v . More precisely, the state for the node v depends on the label of v and the states from the previous sibling and the last child of v. In leftmost siblings and in leaves we resort to imaginary nodes outside of the actual tree t, which are always assigned the initial state q 0. Formally, let \(\mathsf {dom}^{\mathit {cl}}_{t}\) be the set containing each node of t, an artificial previous sibling for each leftmost sibling in t, and an artificial (last) child for each leaf in t. A run of \(\mathcal {A}\) on t is a function \(\rho :\mathsf {dom}^{\mathit {cl}}_{t}\to Q\) such that ρ(v) = q 0 for every node \(v\in \mathsf {dom}^{\mathit {cl}}_{t}-\mathsf {dom}_{t}\), and for every node vdom t with previous sibling v p s and last child v l c there is a transition (ρ(v p s ),ρ(v l c ),lab t (v),ρ(v)) ∈ δ. A run ρ is accepting if it assigns a state from F to the root of t, and a tree t is accepted by \(\mathcal {A}\) if it admits an accepting run. Runs on forests are defined entirely analogously; acceptance is based on the state in the root of the last tree (if the forest is empty, we take the initial state q 0). An automaton is deterministic if δ is a function Q × Q × Γ → Q. Each deterministic automaton has a unique run on each tree (and forest), and can be complemented (negated) simply by replacing the set of final states F with its complement QF.

To facilitate the use of the standard first order semantics, we model data trees and data forests as relational structures over signature

$$\mathsf{sig}_{dt}=\{\downarrow, \downarrow^{+}, \rightarrow, \rightarrow^{+}, \sim, \nsim\} \cup {\Gamma} \cup \mathbb{D} \cup \check{\mathbb{D}} $$

with \(\check {\mathbb {D}} = \big \{\check d \bigm | d\in {\mathbb {D}}\big \}\); that is, we have

  • binary relations: child , descendant +, next sibling →, and following sibling+;

  • data equality relation ∼ and data inequality relation \(\nsim \) that contain pairs of nodes storing, respectively, the same data value and different data values;

  • unary relation a for each label a ∈ Γ;

  • unary relations d and \(\check d\) for each data value \(d\in {\mathbb {D}}\) that contain nodes storing, respectively, data value d and any data value different from d.

Signature sig dt is infinite (because of \({\mathbb {D}}\) and \(\check {\mathbb {D}}\)), but queries use only finite fragments. We include \(\check {\mathbb {D}}\) in the signature to keep negation out of the syntax.

A conjunctive query α(x 1,⋯,x n ) over a signature sig is a first order formula of the form

$$\exists y_{1} {\dots} \exists y_{m} \, \beta(x_{1}, \dots, x_{n}, \, y_{1}, \dots, y_{m})\,, $$

where β(x 1,⋯,x n , y 1,⋯,y m ) is a conjunction of atoms over signature sig and variables x 1,⋯,x n , y 1,⋯,y m .

2.2 Definition

In their most general form, non-mixing integrity constraints σ are formulas of the form

$$\alpha(\bar{x}) \Rightarrow \eta_{\sim}(\bar{x}) \land \eta_{\nsim}(\bar{x}) $$

where

  • \(\alpha (\bar {x})\) is a conjunctive query over the signature sig nav = {, +,→,→+}∪ Γ;

  • \(\eta _{\sim }(\bar {x})\) is a finite positive Boolean combination of atoms over the signature \(\mathsf {sig}_{\sim } = \{\sim \}\cup \mathbb {D}\) and variables \(\bar {x}\);

  • \(\eta _{\nsim }(\bar {x})\) is a finite positive Boolean combination of atoms over the signature \(\mathsf {sig}_{\sim }=\{\nsim \}\cup \check {\mathbb {D}}\) and variables \(\bar {x}\).

Query α is called the selector of σ, and \(\eta _{\sim }, \eta _{\nsim }\) are its assertions. Non-mixing constraints have the usual semantics of first order logic formulas: a data tree t satisfies constraint σ, denoted tσ, if each tuple \(\bar v\) of nodes of t selected by α satisfies both η and \(\eta _{\nsim }\); that is,

$$t \models \alpha(\bar v) \quad \text{implies} \quad t \models \eta_{\sim}(\bar v) \land</p><p class="noindent">\eta_{\nsim}(\bar v)\,. $$

For a set Σ of non-mixing constraints, we write t⊧ Σ if tσ for all σ ∈ Σ.

Note that \(\alpha \Rightarrow \eta _{\sim }\land \eta _{\nsim }\) is equivalent to \(\{\alpha \Rightarrow \eta _{\sim }\,, \; \alpha \Rightarrow \eta _{\nsim }\}\). Consequently, each set Σ of non-mixing constraints is equivalent to \({\Sigma }_{\sim } \cup {\Sigma }_{\nsim }\), where Σ is a set of constraints of the form αη , \({\Sigma }_{\nsim }\) is a set of constraints of the form \(\alpha \Rightarrow \eta _{\nsim }\), and the sizes of Σ and \({\Sigma }_{\nsim }\) are bounded by the size of Σ. Thus, without loss of generality, we restrict our attention to sets of constraints of the form \({\Sigma }_{\sim } \cup {\Sigma }_{\nsim }\), which do not mix sig and \(\mathsf {sig}_{\nsim }\) (hence “non-mixing”). One can also assume that α is quantifier free: \(\exists \bar y \, \alpha (\bar {x}, \bar y) \Rightarrow \eta (\bar {x})\) is equivalent to \(\alpha (\bar {x}, \bar y) \Rightarrow \eta (\bar {x})\).

2.3 Scope

Using non-mixing constraints one can express a variety of useful constraints. Let us consider a database storing information about banks, each in a separate sub-document. We want each bank to be identified by its BIC number. This key constraint can be expressed as

$$q_{\textsc{bic}} (x, x^{\prime}) \land q_{\textsc{bic}}(y,y^{\prime}) \land x\neq y \;\Rightarrow \; x^{\prime}\nsim y^{\prime} $$

where q BIC selects the root of the sub-document for bank, and the node storing the BIC number. Depending on the schema, query q BIC could be for instance q BIC(x,x ) = bank(x) ∧ x ↓ x BIC(x ). Node inequality ≠ is not part of the signature, but can be expressed using sig n a v . Assuming that the roots of the sub-documents for banks are siblings, xy can be replaced by x+ y. In general, we also need to consider four other possible ways in which two different nodes x and y can be positioned in a tree (up to swapping x and y):

$$x \downarrow^+ y\,, \; x \!\rightarrow^+ \!z \land z \!\downarrow^+ \!y\,, \; z\! \downarrow^+ \!x \land z \!\rightarrow^+ \!y\,, \; \textrm{ and } \; z\! \downarrow^+\! x \land z\! \rightarrow^+\! z^{\prime} \land z^{\prime} \!\downarrow^+\! y\,, $$

which means that we need five non-mixing constraints to express a single key constraint.

Another natural constraint is that account numbers should be different for every account within the same bank, but different banks may use the same account numbers. Such a relative key constraint can also be expressed as

$$\texttt{bank}(z) \land z\!\downarrow^+\!x \land z\!\downarrow^+\!y \land q_{\textsc{acc}}(x,x^{\prime}) \land q_{\textsc{acc}}(y,y^{\prime}) \land x\!\neq\!y \; \Rightarrow\; x^{\prime}\nsim y^{\prime}\,. $$

where q ACC(x,x ) selects account x and its number x , similarly to q BIC.

We can also express multi-attribute keys (i.e. keys using composite fields). For example

$$\begin{array}{@{}rcl@{}} q_{\textsc{bic}}(u, u^{\prime}) \land q_{\textsc{bic}}(v, v^{\prime}) \land u\!\downarrow^+\!x \land v\!\downarrow^+\!y \land q_{\textsc{acc}}(x,x^{\prime}) \!\!\!&\land&\!\! q_{\textsc{acc}}(y,y^{\prime}) \land x\!\neq\!y \; \Rightarrow\; \\ &\Rightarrow&\; u^{\prime}\!\nsim \!v^{\prime} \lor x^{\prime}\!\nsim \!y^{\prime} \,. \end{array} $$

asserts that BIC and account number form an absolute key, not relative to bank sub-document.

If, as a result of redundancy, BIC appears in several places within a bank sub-document, using the singleton constraint

$$\texttt{bank}(x) \land x \downarrow^+\! x^{\prime} \land \texttt{BIC}(x^{\prime}) \land x \downarrow^+\! x^{\prime\prime} \land \texttt{BIC}(x^{\prime\prime}) \;\Rightarrow \; x^{\prime} \sim x^{\prime\prime} $$

we can guarantee that each time it gives the same value (for the same bank).

Assume now that each bank has a director and several branches, each of them having a team of employees among which one is the manager of the branch. The information about each employee is stored in a sub-document of its branch’s sub-document. Each employee reports either to the manager of the branch or directly to the director of the bank. Using a conjunctive query q SUPER(x,y,z), we can select the director’s ID node x, the branch manager’s ID node y and the node z storing the supervisor’s ID for an employee of the same branch. The constraint on employee’s supervisor can be encoded as

$$q_{\textsc{super}}(x,y,z) \; \Rightarrow \; x\sim z \lor y\sim z\,.$$

Following this idea we can express inclusion constraints of a restricted form, where the intended superset is a tuple of values that can be selected by a conjunctive query. This includes enumerative domain restrictions, like the constraint

$$\begin{array}{@{}rcl@{}} \texttt{creditCard}(x) \land x &\downarrow^{+}&\!x^{\prime} \land \texttt{brand}(x^{\prime}) \Rightarrow \\ &\Rightarrow& \textit{Visa}(x^{\prime}) \lor \textit{MasterCard}(x^{\prime}) \lor \textit{AmericanExpress}(x^{\prime})\,, \end{array} $$

ensuring that banks issue only Visa, Master Card, and American Express cards. Unrestricted inclusion constraints are beyond the scope of our formalism. Indeed, non-mixing constraints cannot be violated by removing nodes, which is not the case even for the simplest unary inclusion constraints, like each value stored in an a node is also stored in a b node.

Our formalism is also capable of expressing cardinality constraints. Assume, for instance, that banks support charity projects by delegating their employees to help. The projects are organized by category (culture, education, environment, etc.) and each project sub-document carries the list of involved employees. For the sake of balance, we want each category to involve at most ten different employees in total. This can be imposed by selecting eleven employee nodes below a single category node and imposing at least two of them to carry the same data value by means of a long disjunction of data equalities. We can also ensure that no employee is involved in more than three different projects: the conjunctive query selects four different project nodes and an employee for each of them; the assertion imposes at least two of the four employees to have different ID.

Let us remark that while these constraints look clumsy expressed as non-mixing constraints, one can easily imagine a syntactic-sugar layer on top of our formalism. The point is that all these constraints can be rewritten as non-mixing constraints of linear size (except for the cardinality constraints, where the size would grow by a factor proportional to the numerical bounds).

In Section 4 we examine the expressive power of non-mixing constraints further by comparing them to other existing formalisms.

3 Consistency Problem

Our main result is decidability of the consistency problem for non-mixing constraints:

figure a

More precisely, we show the following theorem, establishing tight complexity bounds.

Theorem 1

Consistency of non-mixing constraints is 2ExpTime -complete.

The reminder of this section is devoted to the proof of Theorem 1. The proof is based on a simple idea with a geometric flavour, but does not require any specialist knowledge from geometry or linear algebra. Consider a family of finite unions of affine subspaces of an Euclidean space. The intersection of this family can be also represented as a finite union of affine subspaces and we show that their number can be bounded independently of the cardinality of the family. From this bound we infer a “bounded data cut” model property for non-mixing constraints, where by data cut of a data tree t, denoted by datacut(t), we mean the maximum over nodes vdom t of the number of data values shared by t v and tt v . With bounded data cut, we can reduce the consistency problem to the emptiness problem for tree automata (over a finite alphabet). In the final subsection we prove the lower bound.

3.1 Intersecting Unions of Subspaces

By a subspace of \({\mathbb {D}}^{\ell }\) we mean a subset of \({\mathbb {D}}^{\ell }\) defined by equating pairs of coordinates and fixing coordinates; that is, it is a set of points (x 1,x 2,⋯,x ) in space \({\mathbb {D}}^{\ell }\) defined by a conjunction of equalities of the form x i = x j or x i = d where \(d\in {\mathbb {D}}\). By simple rewriting of equalities, each nonempty subspace of \({\mathbb {D}}^{\ell }\) can be defined with a canonical set of at most equalities such that

  • for each coordinate i we have either x i = x j with i < j, or x i = d with \(d\in {\mathbb {D}}\), or nothing;

  • each coordinate j occurs at most once on the right side of an equality; and

  • no data value d is used in more than one equality.

A subspace of \({\mathbb {D}}^{\ell }\) has dimension k if its canonical definition consists of k equalities. In other words, each equality that does not follow from the others decreases the dimension by one. To enhance intuitions, let us remark that if we equip \({\mathbb {D}}^{\ell }\) with the structure of linear space by assuming that \({\mathbb {D}}\) is a field, this notion of dimension coincides with the classical notion of dimension for affine subspaces (of which the subspaces above are a special case).

An intersection XY of subspaces X,Y is also a subspace, defined by the conjunction of conditions defining X and Y. If XY, then the canonical definition of XY contains at least one more equation than that of X; consequently, the dimension of XY is strictly smaller than the dimension of X. Similarly, intersecting unions of subspaces, we obtain a union of subspaces; the following lemma gives a bound on the size of such union.

Lemma 1

Let \(Z_{1}, Z_{2}, \dots , Z_{m}\subseteq {\mathbb {D}}^{\ell }\) be such that each Z i is a union of at most n subspaces of \({\mathbb {D}}^{\ell }\) . Then, Z 1Z 2 ∩ ⋯ ∩ Z m can be represented as a union of at most n subspaces of \({\mathbb {D}}^{\ell }\).

Proof

Assume that Z 1Z 2 ∩ ⋯ ∩ Z i−1 is a union X 1X 2 ∪ ⋯ ∪ X p of subspaces of \({\mathbb {D}}^{\ell }\). We can write Z i as Y 1Y 2 ∪ ⋯ ∪ Y n , where some of subspaces Y k may be empty. We have

$$\begin{array}{@{}rcl@{}} Z_{1} \cap Z_{2} \cap {\dots} \cap Z_{i} &= & \, (X_{1} \cup X_{2} \cup {\dots} \cup X_{p} ) \cap Z_{i} = \\ &=&\, (X_{1} \cap Z_{i}) \cup (X_{2}\cap Z_{i}) \cup {\dots} \cup (X_{p} \cap Z_{i})\,. \end{array} $$

Let us examine a single X j Z i . If X j Y k for some k, then X j Z i = X j . Otherwise, X j Z i is a union of n subspaces, X j Y 1,X j Y 2,⋯,X j Y n , where each X j Y k is either empty or has dimension strictly smaller than X j . Thus, when X 1X 2 ∪⋯ ∪ X p is intersected with Z i , each X j either does not change, or is split into at most n subspaces of strictly smaller dimension; if X j is a point, in the second possibility it disappears.

Now, consider the following process: begin with \({\mathbb {D}}^{\ell }\), a single subspace of dimension , and then intersect with Z i for i from 1 to m, one by one. Since with each split, the dimension strictly decreases, each non-empty subspace in the resulting union is obtained in the course of at most splits. Since each split generates at most n subspaces, we cannot obtain more than n subspaces in this process. □

We remark that the bound in Lemma 1 is tight, as shown by the following example.

Example 1

Footnote 1 Assume \(0,1\in {\mathbb {D}}\) and let \(Z_{i} =\left \{ \bar {x}\subseteq {\mathbb {D}}^{\ell } \bigm | x_{i} =0 \lor x_{i} =1 \right \}\) for i = 1, 2, ⋯,. Then Z 1Z 2 ∩ ⋯ ∩ Z = {0,1} is a union of 2 (disjoint) subspaces of \({\mathbb {D}}^{\ell }\) of dimension 0.

3.2 Bounding the Data Cut

Based on the geometric fact we have proved in the previous subsection, in Lemma 3 we bound the data cut of data trees witnessing consistency of non-mixing constraints. The proof relies on a simple compositionality property for conjunctive queries over trees, shown in Lemma 2.

Lemma 2

Let \(\alpha (\bar {x}, \bar y)\) be a conjunction of atoms over sig nav , where \(\bar {x}\) and \(\bar y\) are disjoint, and let w be a node of a data tree t. For all tuples \(\bar u, \bar u^{\prime }\) of nodes from t w and tuples \(\bar v, \bar v^{\prime }\) of nodes from tt w , if

$$t\models \alpha(\bar u, \bar v)\quad \textrm{ and } \quad t\models \alpha(\bar u^{\prime}, \bar v^{\prime})\,, $$

then

$$t\models \alpha(\bar u, \bar v^{\prime}) \quad \textrm{ and } \quad t\models \alpha(\bar u^{\prime},\bar v)\,. $$

Proof

Let \(\bar u, \bar u^{\prime }, \bar v, \bar v^{\prime }\) be as in the statement of the lemma. Since \(\alpha (\bar {x}, \bar y)\) is a conjunction of atoms, we only need to check that each atom of \(\alpha (\bar u, \bar v^{\prime })\) and \(\alpha (\bar u^{\prime }, \bar v)\) is satisfied in t. Given that \(t\models \alpha (\bar u, \bar v)\) and \(t\models \alpha (\bar u^{\prime }, \bar v^{\prime })\), it is enough to examine atoms using variables from both \(\bar {x}\) and \(\bar y\). That excludes unary relations and leaves us with atoms of the forms x i ↓ y j ,x i + y j ,x i y j ,x i + y j , and symmetrical. Given that variables \(\bar {x}\) are matched within t w , and variables \(\bar y\) are matched within tt w , atoms x i ↓ y j , x i + y j , y j x i , and y j + x i are excluded by the combination of two things: the way t w and tt w are positioned within tree t (see Fig. 1 on page 5), and the fact that \(t\models \alpha (\bar u, \bar v)\). That is, it remains to consider y j ↓ x i , y j + x i , x i y j , and x i + y j . Suppose y j ↓ x i occurs in α. We know that v j ↓ u i and \(v^{\prime }_{j} \downarrow u^{\prime }_{i}\). Since nodes \(u_{i},u^{\prime }_{i}\) are from t w and nodes \(v_{j},v^{\prime }_{j}\) are from tt w , it follows immediately that v j and \(v^{\prime }_{j}\) are equal to the parent of node w, and \(u_{i}, u^{\prime }_{i}\) are siblings of w or w itself. Consequently, \(v_{j} \downarrow u^{\prime }_{i}\) and \(v^{\prime }_{j} \downarrow u_{i}\). For the remaining three kinds of atoms the reasoning is similar. If v j + u i and \(v^{\prime }_{j} \downarrow ^+ u^{\prime }_{i}\), then v j , \(v^{\prime }_{j}\) are ancestors of w and \(u_{i},u_{i}^{\prime }\) are nodes in t w , so \(v_{j} \downarrow ^+ u^{\prime }_{i}\) and \(v^{\prime }_{j} \downarrow ^+ u_{i}\) follows. If u i v j and \(u^{\prime }_{i} \rightarrow v^{\prime }_{j}\), then \(u_{i}=u^{\prime }_{i}=w\) and \(v_{j}=v^{\prime }_{j}\) is w’s next sibling. Finally, if u i + v j and \(u^{\prime }_{i} \rightarrow ^+ v^{\prime }_{j}\), then \(u_{i}, u^{\prime }_{j}\) are preceding siblings of w (or w itself) and \(v_{j}, v^{\prime }_{j}\) are following siblings of w. □

Lemma 3

If \({\Sigma }_{\sim }\cup {\Sigma }_{\nsim }\) is satisfied in a data tree t, it is also satisfied in some data tree t obtained from t by changing data values, such that

$$\textit{datacut}(t^{\prime}) \leq \ell \cdot 2^{\ell}\cdot (\ell +m)^{\ell^{2}} \cdot |{\Sigma}_{\sim}|\,, $$

where ℓ and m are the maximal numbers of, respectively, variables and predicates from \({\mathbb {D}}\cup \check {\mathbb {D}}\) in theconstraints from Σ.

Proof

Assume that \(t\models {\Sigma }_{\sim } \cup {\Sigma }_{\nsim }\). We show that for each node w of the data tree t, one can replace all but \(\ell \cdot 2^{\ell }\cdot (\ell +m)^{\ell ^{2}} \cdot |{\Sigma }_{\sim }|\) data values used in t w with distinct fresh data values without violating \({\Sigma }_{\sim } \cup {\Sigma }_{\nsim }\). After this operation is performed for a node w, the number of data values used both in t w and tt w is bounded by \(\ell \cdot 2^{\ell }\cdot (\ell +m)^{\ell ^{2}} \cdot |{\Sigma }_{\sim }|\). Moreover, as the fresh data values are to be distinct, the new ∼ relation over nodes of t is a subset of the old one. In consequence, the operation does not increase the number of data values shared by \(t_{w^{\prime }}\) and \(t-t_{w^{\prime }}\) for other nodes w . Consequently, applying this operation for each node of t in an arbitrary order, we obtain a model of bounded data cut.

Let us fix a node w of t. As long as the fresh values are pairwise different, the obtained tree will still satisfy \({\Sigma }_{\nsim }\). Hence, we only need to ensure that Σ is not violated. Consider a constraint αη in Σ. Recall that we assume that α is quantifier free. Let \(\bar {x}\), \(\bar y\) be a partition of variables used in α (one of the tuples \(\bar {x}\), \(\bar y\) may be empty). We shall indicate the partition of variables by writing the constraint as \(\alpha (\bar {x}, \bar y) \Rightarrow \eta _{\sim }(\bar {x}, \bar y)\). The intended meaning is that the variables \(\bar {x}\) refer to nodes in t w , and the variables \(\bar y\) refer to nodes outside of t w . Directly from the definition it follows that tαη , if and only if for each partition \(\bar {x}, \bar y\) of variables in α, for each tuple \(\bar u\) of nodes from t w and each tuple \(\bar v\) of nodes from tt w , if \(t\models \alpha (\bar u, \bar v)\), then \(t\models \eta _{\sim }(\bar u, \bar v)\).

Fix a partition \(\bar {x}\), \(\bar y\). By Lemma 2, the condition above is equivalent to: for all tuples \(\bar u, \bar u^{\prime }\) of nodes from t w and all tuples \(\bar v, \bar v^{\prime }\) of nodes from tt w , if \(t\models \alpha (\bar u, \bar v)\) and \(t\models \alpha (\bar u^{\prime }, \bar v^{\prime })\), then \(t\models \eta _{\sim }(\bar u, \bar v^{\prime })\). Let us turn this into a condition on stored data values. Define \(\eta (\bar {x}, \bar y)\) as the formula obtained from \(\eta _{\sim }(\bar {x}, \bar y)\) by replacing ∼ with = , and d(z) with z = d for all variables z and all \(d\in {\mathbb {D}}\). Reformulating the condition above we obtain: for each tuple \(\bar u\) of nodes from t w such that \(t\models \alpha (\bar u, \bar v)\) for some tuple \(\bar v\) of nodes from tt w , the tuple \(\mathsf {val}_{t}(\bar u)\) of data values belongs to the set

$$Z_{\alpha(\bar{x}, \bar y)\Rightarrow \eta_{\sim}(\bar{x}, \bar y)} = \bigcap\limits_{\bar v^{\prime}} \big \{ \bar c \in {\mathbb{D}}^{|\bar{x}|}\bigm | \eta(\bar c, \mathsf{val}_{t}(\bar v^{\prime}))\big\}\,, $$

where \(\bar v^{\prime }\) ranges over tuples of nodes from tt w satisfying \(t\models \alpha (\bar u^{\prime }, \bar v^{\prime })\) for some tuple \(\bar u^{\prime }\) of nodes from t w .

Writing \(\eta (\bar {x},\mathsf {val}_{t}(\bar v^{\prime }))\) in the disjunctive normal form, we see that the set \(\left \{ \bar {c} \in {\mathbb {D}}^{|\bar {x}|}\bigm | \eta (\bar c, \mathsf {val}_{t}(\bar v^{\prime }))\right \}\) is a union of subspaces of \({\mathbb {D}}^{|\bar {x}|}\). How many subspaces? The canonical definition of each nonempty subspace has for each coordinate i either an equality x i = x j for some j > i, or an equality x i = d for some \(d\in {\mathbb {D}}\), or nothing. In our case, d is a data value used explicitly in η or occurring in the data tuple \(\mathsf {val}_{t}(\bar v^{\prime })\). Consequently, the number of these subspaces can be bounded by \((N + |\bar {x}| + |\bar y|)^{|\bar {x}|}\), where N is the number of data values used explicitly in η. That is, \(Z_{\alpha (\bar {x}, \bar y)\Rightarrow \eta _{\sim }(\bar {x}, \bar y)}\) is an intersection of unions of at most \((N + |\bar {x}| + |\bar y|)^{|\bar {x}|}\) subspaces of \({\mathbb {D}}^{|\bar {x}|}\). By Lemma 1, it can be represented as a union of at most \((N + |\bar {x}| + |\bar y|)^{|\bar {x}|^{2}}\) subspaces. In the canonical definition of each of these subspaces, there are at most \(|\bar {x}|\) equalities of the form x i = d for \(d\in {\mathbb {D}}\). That is, we can define \(Z_{\alpha (\bar {x}, \bar y) \Rightarrow \eta _{\sim }(\bar {x}, \bar y)}\) using explicitly at most \(|\bar {x}|\cdot (N + |\bar {x}| + |\bar y|)^{|\bar {x}|^{2}}\) data values. From this we shall derive a bound on the number of important data values in t w that ensure satisfaction of Σ, and conclude that we can safely replace others with fresh ones.

Let \(\mathsf {val}^{\prime } : \mathsf {dom}_{t} \to {\mathbb {D}}\) be a new data labelling of t. As we are only going to change data values in t w , keeping a constraint αη satisfied requires only that for each partition \(\bar {x}, \bar y \) of its variables, for each tuple \(\bar u\) of nodes from t w , if \(t\models \alpha (\bar u, \bar v)\) for some tuple \(\bar v\) of nodes from tt w , then \(\mathsf {val}^{\prime }(\bar u) \in Z_{\alpha (\bar {x}, \bar y)\Rightarrow \eta _{\sim }(\bar {x}, \bar y)}\). Moreover, replacing all occurrences of a data value d in t w with a given data value d does not affect equalities of the form x i = x j in the canonical definition of the set \(Z_{\alpha (\bar {x}, \bar y)\Rightarrow \eta _{\sim }(\bar {x}, \bar y)}\). We only need to ensure that equalities of the form x i = d are not violated. Let \(D\subseteq {\mathbb {D}}\) be the set of data values occurring in these equalities for all sets \(Z_{\alpha (\bar {x}, \bar y) \Rightarrow \eta _{\sim }(\bar {x}, \bar y)}\), with \(\alpha (\bar {x}, \bar y) \Rightarrow \eta _{\sim }(\bar {x}, \bar y)\) ranging over constraints from Σ with all possible partitions of variables. A labelling val that replaces each data value from \({\mathbb {D}} - D\) used in t w with a fresh data value does not violate Σ. For each constraint αη there is at most 2 partitions \(\bar {x}, \bar y\) of variables; each partition corresponds to a set \(Z_{\alpha (\bar {x}, \bar y) \Rightarrow \eta _{\sim }(\bar {x}, \bar y)}\), which contributes at most \(\big (\ell \cdot (m + \ell )^{\ell ^{2}}\big )\) data values, where and m are the maximal numbers of variables and predicates from \({\mathbb {D}}\cup \check {\mathbb {D}}\) in constraints from Σ. Hence, we have \(|D|\leq |{\Sigma }_{\sim }|\cdot 2^{\ell } \cdot \big (\ell \cdot (m + \ell )^{\ell ^{2}}\big ) \). □

3.3 From Bounded Data Cut to Automata

In order to use automata, we need to encode data trees of bounded data cut as trees over a finite alphabet. Let \(C, D\subseteq {\mathbb {D}}\) be two disjoint finite sets of data values; we shall call elements of C colours and elements of D distinguished data values. As encodings we shall use trees over the alphabet

$${\Gamma} \times (C \cup D )\times {\mathbb{P}(C)}\,; $$

we shall refer to the values in the third component of label as refresh sets. The distinguished data values (corresponding to data values used explicitly in the constraints) are represented explicitly in the encoding. The remaining data values are represented implicitly by colours and refresh sets: two nodes u,v store the same data value if and only if they have the same colour c and there is no node w such that ut w and vtt w (or symmetrically), and the refresh set for w contains c. We define the semantics of the encoding slightly more generally: to every forest f over the alphabet \({\Gamma } \times (C\cup D) \times {\mathbb {P}(C)}\) we associate a data forest \(\widehat {f}\). It has the same domain as f, the structure and labelling with elements of Γ is inherited from f, and the assignment of data values is defined inductively based on the remaining two components of the labelling of f. If the forest f is empty, so is \(\widehat {f}\). Otherwise, let us decompose f into a forest f followed by a tree further decomposed into the root and a forest f (see Fig. 2); both f and f may be empty. Assume the root has label (a,d,R).

Fig. 2
figure 2

Forest f decomposed into a forest \(f^{\prime }\) followed by a tree further decomposed into the root and a forest \(f^{\prime \prime }\)

The forest \(\widehat {f}\) is obtained by plugging \(\widehat {f^{\prime \prime }}\) under a root with label a and data value d, appending the resulting tree to \(\widehat {f^{\prime }}\), and then replacing each colour from the refresh set R used in the resulting forest with a globally fresh data value from \({\mathbb {D}} - (C\cup D)\). Note that \(\widehat {f}\) is unique up to permutations of \({\mathbb {D}} - (C \cup D)\). By construction, \(\textit {datacut}(\widehat {f}\,) \leq |C\cup D|\).

Lemma 4

For each data forest f of data cut n and all finite disjoint sets \(C, D \subseteq {\mathbb {D}}\) such that

$$|C| >\frac{3}{2}\cdot n \,,$$

there exists aforest g over \({\Gamma }\times (C\cup D) \times {\mathbb {P}(C)}\) such that \(\widehat {g} = f\) up to apermutation of \({\mathbb {D}}-D\).

Proof

Since we only aim at equality modulo a permutation of \({\mathbb {D}}-D\), we may assume that no data value from C is used in f. As the structure of g must be identical to that of f, we only need to define the labelling. Moreover, in the label (a,d,R) of a node w, we must always have a =lab f (w). The remaining two components, d and R, are defined in the course of a procedure processing nodes in the usual bottom-up, left-to-right order, maintaining the following invariants:

  1. 1.

    \(\widehat {g_{w}} = f_{w}\) up to a bijection i w between the colours from C used in \(\widehat {g_{w}}\) and the data values from \({\mathbb {D}} - D\) shared by f w and ff w ;

  2. 2.

    if v is the last child of the next sibling of w, then i v and i w coincide over dom(i v ) ∩dom(i w ), and \(|\mathsf {dom}(i_{v}) \cup \mathsf {dom}(i_{w}) | \leq \frac {3}{2}\cdot n\).

For a node w the procedure first sets the values d and R in such a way that the first invariant is satisfied, and then applies a permutation of C to the whole g w to ensure the second invariant. Note that applying such a permutation affects \(\widehat {g_{w}}\), but does not violate the first invariant.

Let w and w be the previous sibling and the last child of w (if some of these do not exist, the argument is adjusted easily). For d there are three cases:

  • if val f (w) ∈ D, set d =val f (w);

  • if \(\mathsf {val}_{f}(w) =i_{w^{\prime }}(c)\) or \(\mathsf {val}_{f}(w) = i_{w^{\prime \prime }}(c)\) for some colour cC, set d = c;

  • otherwise, set d = c for an arbitrary colour \(c\in C-\big (\mathsf {dom}(i_{w^{\prime }}) \cup \mathsf {dom}(i_{w^{\prime \prime }})\big )\), which exists by the second invariant.

The refresh set R contains each colour cC currently used in \(\widehat {g_{w}}\), that represents a data value occurring in f w but not in ff w . After these colours have been refreshed, all colours C 0C used in \(\widehat {g_{w}}\) represent different data values shared by f w and ff w ; the bijection i w can be defined by restricting to C 0 the union of \(i_{w^{\prime }}\), \(i_{w^{\prime \prime }}\), and {(d,val f (w))} if \(d \in C_{0} -\big (\mathsf {dom}(i_{w^{\prime }}) \cup \mathsf {dom}(i_{w^{\prime \prime }})\big )\).

If w has no next sibling or the next sibling has no children, we are done. Otherwise, let v be the last child of the next sibling u of w. We need a permutation π of C such that i v and the updated bijection \(i_{w}\circ (\pi \upharpoonright C_{0})^{-1}\) satisfy the second invariant. Let \(W, V, U \subseteq {\mathbb {D}} - D\) be the sets of data values used, respectively, in the fragments f w , f v , and f − (f w f v ) shown in Fig. 3, and let k = |wUV |, = |VUw|, m = |wVU|, and r = |wVU|.

Fig. 3
figure 3

The positioning of f w , f v , and \(f-(f_{w} \cup f_{v})\) in a forest f, when v is the last child of the next sibling u of w

By the definition of data cut applied to w, v, and u, we have

$$k + m + r \leq n\,,\qquad \ell + m + r \leq n\,, \qquad k + \ell + r + e \leq n \,, $$

where e = 1 if the data value in u is used in ff u and does not belong Vw, and e = 0 otherwise. In order to represent values in wU, VU, and wV, we need exactly k + + m + r colours. By adding the three inequalities above, we obtain that \(k+\ell + m + r \leq \frac {3n - r - e}{2}\). Hence, \(\frac {3}{2} \cdot n \) colours are sufficient to accommodate the domains of i v and the updated bijection \(i_{w}\circ (\pi \upharpoonright C_{0})^{-1}\).

As the bounds in the three inequalities above can be attained simultaneously, with r = 0 and e = 0, \(\frac {3}{2} \cdot n\) colours are also necessary. Consequently, the assumption in the lemma is tight, because if the data value in the node w does not occur anywhere else, we need one more colour to represent it. □

Let \({\Sigma }_{\sim }\cup {\Sigma }_{\nsim }\) be a set of non-mixing integrity constraints and let \(\mathcal {A}\) be a tree automaton. By Lemma 3, it is enough to test satisfiability of \({\Sigma }_{\sim }\cup {\Sigma }_{\nsim }\) over trees of data cut bounded by a number n, singly exponential in the total size of constraints in \({\Sigma }_{\sim }\cup {\Sigma }_{\nsim }\). Let \(D\subseteq {\mathbb {D}}\) be the set of data values used explicitly in \({\Sigma }_{\sim }\cup {\Sigma }_{\nsim }\), and let \(C\subseteq {\mathbb {D}}-D\) be a fixed set such that \(|C|=\left \lfloor \frac {3}{2}\cdot n \right \rfloor + 1\). By Lemma 4, each tree of data cut bounded by n can be encoded as a tree over \({\Gamma }\times (C \cup D) \times {\mathbb {P}(C)}\) up to a permutation of \({\mathbb {D}} - D\). Since such permutations do not affect relations used in \({\Sigma }_{\sim }\cup {\Sigma }_{\nsim }\), there exists a data tree accepted by \(\mathcal {A}\) and satisfying \({\Sigma }_{\sim }\cup {\Sigma }_{\nsim }\) if and only if there exists a tree t such that the data tree \(\widehat {t}\) is accepted by \(\mathcal {A}\) and satisfies \({\Sigma }_{\sim }\cup {\Sigma }_{\nsim }\). We reduce consistency of \({\Sigma }_{\sim }\cup {\Sigma }_{\nsim }\) to the emptiness problem for tree automata, by constructing an automaton that recognizes such trees t.

The automaton is obtained by taking the product of two automata, testing acceptance by \(\mathcal {A}\) and satisfaction of \({\Sigma }_{\sim }\cup {\Sigma }_{\nsim }\), respectively. The first automaton is just the automaton \(\mathcal {A}\) lifted to the product alphabet \({\Gamma }\times (C\cup D) \times {\mathbb {P}(C)}\): it looks only at the first component of each label. Note that already this automaton has doubly exponential size, because of the size of the alphabet. The second automaton is the product over all constraints \(\sigma \in {\Sigma }_{\sim }\cup {\Sigma }_{\nsim }\) of automata \(\mathcal {B}_{\sigma }\) recognizing trees t such that \(\widehat {t} \models \sigma \), which will be constructed in the next subsection. Each automaton \(\mathcal {B}_{\sigma }\) is doubly exponential and so is the whole construction. As the emptiness problem for tree automata is in PTime, we can conclude that the consistency problem is in 2ExpTime.

3.4 Translating Constraints to Automata Over Encodings

To complete the proof of the upper bound of Theorem 1, it remains to construct an automaton recognizing encodings of trees that satisfy a given constraint.

Let us fix a constraint \(\alpha (\bar {x}) \Rightarrow \eta (\bar {x})\) with \(\bar {x}= (x_{1}, x_{2}, \dots , x_{\ell })\); for the present construction, it needs not to be non-mixing. Let \(D\subseteq {\mathbb {D}}\) be a finite set of data values, containing each data value used explicitly in \(\eta (\bar {x})\), and let \(C\subseteq {\mathbb {D}} - D\) be an arbitrary finite set. We shall construct an automaton \(\mathcal {B}\) over the alphabet \({\Gamma }\times (C\cup D) \times {\mathbb {P}(C)}\) recognizing the language

$$\left\{t \; \middle | \; \widehat{t}\models \alpha(\bar{x}) \Rightarrow \eta(\bar{x})\right\}\,. $$

It will read a tree t, compute a representation of tuples selected from the associated data tree \(\widehat {t}\) by the selector query \(\alpha (\bar {x})\), and accept if all these tuples satisfy the assertion \(\eta (\bar {x})\). The representation of the selected tuples will be computed based on the maintained information about partial matchings of the selector query in the forest encoded by the processed part of the tree. This can be done in the usual way, except that we need to systematically refresh the colours, as specified in t. To explain the details, we need some auxiliary notions.

Recall that t and \(\widehat {t}\) have the same domain, structure, and labelling with elements of Γ; the only difference lies in the way data values are represented: encoded in t and explicit in \(\widehat {t}\). Consequently, as long as we do not care about data values, we can blur the distinction between the encoded and decoded data tree. Similarly, t w , \(\widehat {t_{w}}\), and \(\big (\,\widehat {t}\:\big )_{w}\) are the same forest, up to the representation of data values. A partial valuation of variables x 1,x 2,⋯,x is a function

$${g} \colon \{ x_{1}, x_{2}, \dots, x_{\ell}\} \to \mathsf{dom}_{t} \cup \{\bot\}\,.$$

If g(x i )≠⊥, we say that x i is matched at g(x i ), and if u i = ⊥ we say that x i is not matched. Two partial valuations of x 1,x 2,⋯,x are disjoint, if no variable is matched by both of them. The union of disjoint partial valuations g,h of variables x 1,x 2,⋯,x is given as

$$({g \cup h})(x_{i}) = \left\{\begin{array}{ll} {g}(x_{i}) & \text{if } {g}(x_{i}) \neq \bot\,,\\ {h}(x_{i}) & \text{otherwise}\,. \end{array}\right. $$

Recall that \(\alpha (\bar {x})\) is a conjunction of atoms. A partial matching of \(\alpha (\bar {x})\) in \(\widehat {t_{w}}\) is a partial valuation g of variables \(\bar {x}\) such that

  • variables are matched only in the nodes of t w ;

  • each atom in \(\alpha ({g}(\bar {x}))\) that does not contain ⊥ holds true in \(\widehat {t_{w}}\); and

  • each atom that contains both a node from t w and ⊥ is of the form

    $$w \rightarrow \bot, \quad w^{\prime} \rightarrow^+ \bot,\quad \bot \downarrow w^{\prime}, \quad \text{or} \quad \bot \downarrow^+ v\,, $$

    where w is a preceding sibling of w or w itself, and v is an arbitrary node of t w .

The last condition means that each such atom can be made true (independently of others) by replacing ⊥ with a node from tt w , unless w has no following siblings or no ancestors in t.

If \(\widehat {t}\models \alpha (\bar u)\), each partial valuation matching a subset of variables x i at nodes u i from \(\widehat {t_{w}}\) is a partial matching of α. Conversely, if a partial matching g matches all variables \(\bar {x}\), then \(\widehat {t}\models \alpha ({g}(\bar {x}))\). Note, however, that not every partial matching can be extended so that it matches all variables: remaining atoms may be satisfiable on their own, but not together.

The automaton collects information about tuples selected by \(\alpha (\bar {x})\) node by node: when it is in a node w of the input tree t, it has information corresponding to partial matchings of \(\alpha (\bar {x})\) in \(\widehat {t_{w}}\). More precisely, the states of the automaton \(\mathcal {B}\) are subsets Δ of

$$\left( C \cup D \cup \left\{\top_{1}, \top_{2}, \dots, \top_{\ell}, \bot\right\} \right)^{\ell}. $$

Each such tuple represents a partial matching of \(\alpha (\bar {x})\) in \(\widehat {t_{w}}\), and the whole Δ represents a set of such partial matchings. The intended meaning of the symbolic values is as follows:

  • cCD in the coordinate j of the tuple means that the variable x j is matched in a node of \(\widehat {t_{w}}\) storing the data value c;

  • i means that the variable x j is matched in a node storing some data value \(d_{j} \in {\mathbb {D}} - (C\cup D)\), where d 1,d 2,⋯,d are distinct and depend on the tuple;

  • ⊥ means that variable x j has not been matched yet.

The initial state is {(⊥,⊥,⋯,⊥)}. The accepting states are the ones whose each tuple either contains ⊥ or satisfies the assertion \(\eta (\bar {x})\).

Let us describe the transition relation. Assume that automaton \(\mathcal {B}\) is about to determine the state in a node w. Let w and w be, respectively, the previous sibling and the last child of w. The set of partial matchings of \(\alpha (\bar {x})\) in \(\widehat {t_{w}}\) depends only on the sets of partial matchings in \(\widehat {t_{w^{\prime }}}\) and \(\widehat {t_{w^{\prime \prime }}}\), and the label of w. Indeed, a partial valuation of \(\bar {x}\) is a partial matching of \(\alpha (\bar {x})\) in \(\widehat {t_{w}}\) if it is the union of disjoint partial matchings of \(\alpha (\bar {x})\) in \(\widehat {t_{w^{\prime }}}\) and \(\widehat {t_{w^{\prime \prime }}}\) possibly extended by matching some (yet unmatched) variables at node w, respecting two conditions. For all atoms x i x j , x i + x j in \(\alpha (\bar {x})\), either x i ,x j are both matched in \(\widehat {t_{w^{\prime \prime }}}\) or none is; and the new matching of variables at w does not violate the definition of partial matching. The latter can be expressed as follows:

  • if \(\alpha (\bar {x})\) contains x i ↓ x j or x i + x j , we may match x i at w only if x j is matched in \(t_{w^{\prime \prime }}\); for x i ↓ x j , if x j is matched, we must match x i , unless it is matched in \(t_{w^{\prime \prime }}\) already;

  • if \(\alpha (\bar {x})\) contains x i x j or x i + x j , we may match x j at w only if x i is matched in \(t_{w^{\prime }}\); for x i x j , if x i is matched, we must match x j , unless it is matched in \(t_{w^{\prime }}\) already;

  • if \(\alpha (\bar {x})\) contains a(x i ), we may match x i at w only if lab t (w) = (a,c,R) for some c,R.

Checking the conditions above requires only information about which variables are matched in \(\widehat {t_{w^{\prime }}}\) and \(\widehat {t_{w^{\prime \prime }}}\); the used tree nodes are not relevant. Consequently, one can determine the set of tuples representing partial matchings in \(\widehat {t_{w}}\) based on the sets of tuples representing partial matchings in \(\widehat {t_{w^{\prime }}}\) and \(\widehat {t_{w^{\prime \prime }}}\), and the label (a,c,R) of the current node w. Notice that the symbolic values ⊤ i represent different data values in \(\widehat {t_{w^{\prime }}}\) and \(\widehat {t_{w^{\prime \prime }}}\), so before combining two tuples we rename these values to guarantee that none is used in both tuples ( values are always sufficient for this). The final step is to refresh colours: in each tuple we replace all occurrences of cR with some ⊤ i not yet used in this tuple.

3.5 Lower Bound

Lemma 5

Consistency of non-mixing constraints is 2ExpTime -hard.

Proof

Relying on the fact that 2ExpTime = AExpSpace, we will be using alternating Turing machines. Such machines can be defined in multiple similar ways, and we use a definition that is most convenient for our encoding. We do not divide states of our machine into existential and universal; we only distinguish accepting states. Instead, we use the following notion of a run tree, requiring that from every configuration two different transitions can be applied. A run tree of an alternating Turing machine M on input word w is a tree labelled by configurations of M, where

  • the root is labelled by the initial configuration for the input word w;

  • every node not labelled by an accepting configuration has exactly two children, labelled by successors of this configuration, reached by applying to it two different transitions;

  • every node labelled by an accepting configuration is a leaf.

We say that an input word w is accepted by M if there is a finite run tree of M for w.

To turn a standard machine with existential and universal states into a machine of the form above, one simply ensures that in universal states the machine has exactly two available transitions, and duplicates transitions available in existential states.

Consider an alternating Turing machine M (of the form described above) that works in space bounded by 2|w|, where w is the input word. Note that we limit the space to 2|w| instead of considering any exponential function, but already among such machines there is one solving an AExpSpace-hard problem. We show that for every input word w we can construct (in polynomial time) a tree automaton \(\mathcal {A}\) and a set Σ of non-mixing constraints such that \(\mathcal {A}\) and Σ are consistent if and only if M accepts w. More precisely, every tree \(t\in L(\mathcal {A})\) such that t ⊧ Σ describes a run tree of M on w. Below we specify how such a tree t encodes a run tree of M on w, simultaneously saying how these properties are ensured by \(\mathcal {A}\) or Σ.

Nodes labelled by s form a prefix of t that is a binary tree: the parent of every s-labelled node (if exists) is s-labelled, and every s-labelled node has zero or two s-labelled children. This is ensured by the automaton \(\mathcal {A}\). This part of t is called the skeleton, and will have the same shape as the run tree.

Additionally, each node of the skeleton has a c-labelled child (in addition to the zero or two children from the skeleton). The subtree rooted in this child forms a path, whose labels match the regular expression

$$\mathsf{c}\,A\, Q\, A\, Q\,(\mathsf{l}+\mathsf{n}+\mathsf{r})\,(\$\,(\mathsf{h}+\mathsf{n})\,(\mathsf{p}+\mathsf{n})\,\mathsf{c}^{2n+1})^{+}\,\#\,,$$

where b, l, n, r, h, p are new alphabet symbols, Q and A are the state set and the tape alphabet of M, and n = |w|, Again, this is ensured by \(\mathcal {A}\). Such path, called a configuration path, describes a configuration of M assigned to the corresponding node of the skeleton (which is also a node of the run tree).

At the beginning of each configuration path we have the transition used to reach this configuration: the second node is labelled by the letter present on the tape under the head in the previous configuration; the third node is labelled by the previous state; the fourth by the letter written on the tape; the fifth by the new state; the sixth by the direction in which the head was moved (left, no move, right). The automaton ensures that this is indeed a valid transition of M (except for the configuration path directly below the root, where we only ensure that the fifth node contains the initial state); that the label of the third node (previous state) is equal to the label of the fifth node (current state) of the parent configuration; that the transitions assigned to sibling configurations are different (as required in the definition of a run tree); that states are accepting in leaf configurations and not accepting in non-leaf configurations.

The next part of a configuration path consists of multiple blocks of length 2n + 4; each of them describes a single letter on the tape. To identify a block, we use the first n c-labelled nodes for a binary counter encoding the position in the tape, using data values \(0,1\in {\mathbb {D}}\). We assign data values 0,⋯,0,0 to these nodes in the first block, 0,⋯,0,1 in the second block, and so on, until 1,⋯,1,1 in the last block (we have 2n blocks, which equals to the length of the tape). The next n nodes of the block also contain such a counter, but going back: we assign 1,⋯,1,1 to these nodes in the first block, and 0,⋯,0,0 in the last block. Notice that when one counter of a block contains bits b 1,⋯,b n , then the other counter contains their inverses 1 − b 1,⋯,1 − b n . This double encoding of the position is the key trick that allows using non-mixing integrity constraints to check correctness of the run between two consecutive configurations. To enforce this behaviour of counters we use constraints. In the constraints we shall use queries matching tuples of variables

$$\bar{x}=(x_\$,x_{h},x_{p},x_{1},\dots,x_{n},x^{\prime}_{1},\dots, x^{\prime}_{n},x_{d}) $$

to the 2n + 4 consecutive nodes of blocks in configuration paths. It is easy to write a conjunctive query \(\alpha _{\mathit {fb}}(\bar {x})\) that matches \(\bar {x}\) to the first block in any configuration path. We include in Σ the constraint

$$\alpha_{\mathit{fb}}(\bar{x})\Rightarrow 0(x_{1})\land\dots\land 0(x_{n})\land 1({x^{\prime}_{1}})\land\dots\land 1({x^{\prime}_{n}})\,. $$

We deal analogously with the last block. Then, using a conjunctive query \(\alpha _{\mathit {cb}}(\bar {x},\bar {y})\) with \(\bar y\) defined as \(\bar {x}\) above, that matches two consecutive blocks, we include the constraint

$$\alpha_{\mathit{cb}}(\bar{x},\bar{y})\Rightarrow\eta_{\sim}(\bar{x},\bar{y}) $$

ensuring that the counters in these two blocks encode consecutive numbers. It is a standard task to express this property as a positive Boolean combination \(\eta _{\sim }(\bar {x},\bar {y})\) of atoms over {∼,0,1}, of a quadratic size.

The second node of each block is marked by h if the head of M is placed over this position of the tape, and the third node is marked by p if the head was placed over this position in the previous configuration. The automaton ensures that in each configuration path exactly one block is marked by h and exactly one block is marked by p; that in the initial configuration the head is over the first letter; that the relation between the p and h markers on a configuration path is as described by the sixth node of that path (l, n, or r). To ensure that the position of p corresponds to the position of h in the previous configuration we use the constraint

$$\alpha_{\mathit{ch}}(\bar{x},\bar{y})\Rightarrow x_{1}\sim y_{1}\land\dots\land x_{n}\sim y_{n}\,, $$

where \(\alpha _{\mathit {ch}}(\bar {x},\bar {y})\) matches \(\bar {x}\) to the h-marked block of a configuration and \(\bar y\) to the p-marked block of a child configuration.

The last node of each block carries the tape letter (from A) in the data value. To ensure that the initial configuration starts with the input word, we write a constraint

$$\alpha_{\mathit{ini}}(x_{1},\dots,x_{n})\Rightarrow \eta_{\sim}(x_{1},\dots,x_{n})\,, $$

where α i n i (x 1,⋯,x n ) selects the last node from each of the first n blocks of the topmost configuration path (to make sure that only the topmost configuration path is selected, we can check for the presence of the initial state, assuming w.l.o.g. that M cannot reach the initial state in any transition). Another constraint

$$\alpha_{\mathit{bl}}(x)\Rightarrow \mathsf{b}(x) $$

ensures that the rest of the initial tape contains blanks bA, where α b l (x) matches the last node of a block of the topmost configuration path other than the first n blocks. Next |A| constraints ensure that the p-marked block contains the letter written in the fourth node of the configuration path (letter written under the head), and another |A| constraints that the h-marked block of the previous configuration contains the letter written in the second node of the configuration path (letter seen under the head).

Finally, we have to ensure that the content of the tape is preserved (except the single letter under the head). Let \(\alpha _{\mathit {2b}}(\bar {x},\bar {y})\) be a conjunctive query matching some blocks on consecutive configuration paths, where the first of them is not marked by h. For every such pair \(\bar {x},\bar {y}\), we want to enforce that either the two corresponding blocks carry the same letter or they represent two different positions in the tape. Using the double complementary encoding of the position in the blocks, this can be enforced using only ∼ in the following constraint:

$$\alpha_{\mathit{2b}}(\bar{x},\bar{y})\Rightarrow x_{1}\sim {y^{\prime}_{1}}\lor\dots\lor x_{n}\sim{y^{\prime}_{n}}\lor x_{d}\sim y_{d}\,.$$

Notice that the property “\(\bar {x}\) and \(\bar {y}\) encode different positions in the tape” seems to require \(\nsim \), but thanks to the inverted counter stored in each block we may use ∼ instead, avoiding the illegal mixture of ∼ and \(\nsim \) in the assertion of the constraint.

By construction, witnesses for the obtained automaton \(\mathcal {A}\) and set of constraints Σ correspond to run trees of the machine M, which ensures correctness of the reduction. □

We remark that all conjunctive queries used in the above proof could be written using tree patterns (see Section 4.2 for the definition), and that the set \({\Sigma }_{\nsim }\) was empty. Thus the 2ExpTime-hardness result holds already for constraints of this form. If we only allow tree patterns as selectors and Σ is empty, the complexity might be lower. In Section 4.3 we shall see a different hardness argument, showing that there is no hope for lower complexity without restricting selectors.

4 Extensions, Connections, and Applications

4.1 Entailment of Non-mixing Constraints

A static analysis problem more general than consistency is entailment. Recall that a set of constraints Σ is entailed by a set of constraints Σ modulo a tree automaton \(\mathcal {A}\), written as \({\Sigma } \models _{\mathcal {A}} {\Sigma }^{\prime }\), if for each data tree t accepted by automaton \(\mathcal {A}\),

$$t\models {\Sigma} \quad \text{implies} \quad t \models {\Sigma}^{\prime}\,. $$

The entailment problem is then defined as follows:

figure b

Entailment is a more general problem than consistency, but for non-mixing constraints the results on consistency generalize to entailment almost effortlessly.

Theorem 2

Entailment of non-mixing constraints is 2ExpTime -complete.

Proof

Inconsistency is a special case of entailment: Σ is inconsistent with respect to an automaton \(\mathcal {A}\) if and only if \({\Sigma } \models _{\mathcal {A}} \bot \), where ⊥ is an inconsistent set of constraints, say {a(x) ⇒ 0(x) ∧ 1(x)|a ∈ Γ}. Thus, the lower bound follows.

Lemma 3 shows that witnesses for consistency can have bounded data cut. The same is true for counter-examples to entailment. Suppose t⊧ Σ and t ⊮ Σ. Then, \(t\models \alpha ^{\prime }(\bar u) \land \lnot \eta ^{\prime }(\bar u)\) for some constraint \(\alpha ^{\prime }(\bar {x}) \Rightarrow \eta ^{\prime }(\bar {x})\) from Σ and some tuple \(\bar u\) of nodes of t. Let D 0 be the set of data values used in the nodes \(\bar u\). We can repeat the construction of the tree t word for word, except that we replace the set D of values not to be touched by DD 0. This increases datacut(t ) by the maximal number of variables in the constraints of Σ.

The automata construction in Section 3.3 is modified similarly: the set D contains also the data values used explicitly in Σ, in the product automaton we include additionally the automata \(\mathcal {B}_{\sigma }\) for σ ∈Σ, and we let it accept if at least one of these components rejects and all previously described components accept. As the automata \(\mathcal {B}_{\sigma }\) are deterministic, this does not involve any additional cost.

Note that the argument above works also if Σ mixes predicates from sig and \(\mathsf {sig}_{\nsim }\). □

4.2 A Singly Exponential Fragment

A closer look at the complexity of our algorithm reveals that it is doubly exponential only in the maximal number of variables in the constraints. This number appears in three roles: in the exponent in the factors \((\ell +m)^{\ell ^{2}}\) and 2 of the bound on the data cut, and as the length of tuples representing partial matchings of selectors. A slightly more detailed analysis of the proof of Lemma 3 shows that in the first role, could be replaced by the maximal number of variables actually used in the assertions. Indeed, since data equalities involve only variables occurring in the atoms of the assertions, everything is in fact happening in a space of dimension at most . While limiting the size of selector queries to lower the complexity makes little sense, limiting the number of variables actually used in assertions seems acceptable. But what about the other two roles of ?

Concerning the third role, the need to represent all partial matchings (up to data equality type) comes from the fact that the automaton is essentially evaluating conjunctive queries. The standard technique to lower the complexity in such cases is to replace conjunctive queries with tree patterns, which are essentially tree-structured conjunctive queries. In the most basic form, with only and + axes allowed, a tree pattern is a conjunctive query α over signature {, +}∪ Γ, such that graph

$$(A_{\alpha}, \downarrow_{\alpha} \cup \downarrow^+_{\alpha})$$

is a directed tree, where

$$\mathbb{A}_{\alpha}=(A_{\alpha}, \downarrow_{\alpha}, \downarrow^+_{\alpha}, \{a_{\alpha} \}_{a\in{\Gamma}}) $$

is the canonical relational structure associated to query α in the usual way: the universe A α is the set of variables of α, and relations are given by the respective atoms in α.

Finally, we also have the factor 2 in the bound on the data cut. This factor appears because in the proof of Lemma 3, we consider separately each partition of variables into two tuples. As we shall see, the number of partitions can also be reduced for tree patterns. To this end, we prove the following analogue of Lemma 2.

Lemma 6

Let \(\alpha (\bar {x}, \bar y, \bar z)\) be a tree pattern, where \(\bar {x}\), \(\bar y\) , and \(\bar z\) are pairwise disjoint, and in \(\downarrow _{\alpha } \cup \downarrow ^+_{\alpha }\) there are no edges from variables in \(\bar z\) to variables in \(\bar {x}\), \(\bar y\) . Let w be a node of a data tree t. For all tuples \(\bar u, \bar u^{\prime }\) of nodes from t w , and tuples \(\bar v, \bar v^{\prime }\) of nodes from tt w , if

$$t\models \exists \bar z \, \alpha(\bar u, \bar v, \bar z)\quad \textrm{ and } \quad t\models \exists \bar z\, \alpha(\bar u^{\prime}, \bar v^{\prime}, \bar z)\,, $$

then

$$t\models \exists \bar z\,\alpha(\bar u, \bar v^{\prime},\bar z) \quad \textrm{ and } \quad t\models \exists \bar z\,\alpha(\bar u^{\prime},\bar v,\bar z)\,. $$

Proof

We only prove that \(t\models \exists \bar z\,\alpha (\bar u, \bar v^{\prime }, \bar z)\), as the other part is symmetric. Let \(\bar w\) and \(\bar w^{\prime }\) be tuples of nodes from t such that \(t\models \alpha (\bar u, \bar v, \bar w)\) and \(t\models \alpha (\bar u^{\prime }, \bar v^{\prime }, \bar w^{\prime })\). The claim holds trivially if both \(\bar {x}\) and \(\bar y\) are empty. Assume that at least one of these tuples is nonempty. Then the root of the tree pattern belongs to \(\bar {x}\) or to \(\bar y\). Let \(\bar z=(z_{1},\dots ,z_{k})\), \(\bar w=(w_{1},\dots ,w_{k})\), \(\bar w^{\prime }=(w^{\prime }_{1},\dots ,w^{\prime }_{k})\). For i ∈{1,⋯,k} we look at the nearest ancestor of z i that is in \(\bar {x}\) or in \(\bar {y}\). If it is in \(\bar {x}\), we take \(w^{\prime \prime }_{i}=w_{i}\), otherwise \(w^{\prime \prime }_{i}=w_{i}^{\prime }\), and we define \(\bar w^{\prime \prime }=(w^{\prime \prime }_{1},\dots ,w^{\prime \prime }_{k})\).

We need to check that every atom of \(\alpha (\bar u, \bar v^{\prime },\bar w^{\prime \prime })\) is satisfied in t. This is clear for unary atoms. Assume a binary atom involves a variable from \(\bar z\) that is valuated as in \(\bar u, \bar v, \bar w\). From the definition of \(\bar w^{\prime \prime }\) it then follows that the other variable, being its child or its parent in the tree pattern, is also valuated as in \(\bar u, \bar v, \bar w\). It follows that the atom is satisfied, because \(t\models \alpha (\bar u, \bar v, \bar w)\). We argue analogously for binary atoms with a variable from \(\bar z\) valuated as in \(\bar u^{\prime }, \bar v^{\prime }, \bar w^{\prime }\). It remains to consider binary atoms involving only variables from \(\bar {x}\) and \(\bar y\), and this can be done as in the proof of Lemma 2. □

We remark that for non-mixing integrity constraints, restricting selectors to tree patterns alone does not suffice to lower the complexity: the reduction in Lemma 5 uses only such constraints (and no assertions over \(\mathsf {sig}_{\nsim }\)). But together with the bound on the number of variables in assertions, it does suffice.

Proposition 1

For non-mixing constraints whose selectors are tree patterns and whose assertions use constantly many variables, consistency and entailment are ExpTime -complete.

Proof

We first complete the proof that the data cut can be bounded polynomially. We have already argued that in the factor \(\ell \cdot (\ell +m)^{\ell ^{2}}\) of the bound given by Lemma 3 we can replace by the maximal number of variables used in the assertions, which is assumed to be constant. It remains to deal with the factor 2. We show that the number of considered partitions can be limited by a polynomial.

Fix a tree t, its node w, and a constraint αη in Σ. We say that a variable x used in α is important if either x or some descendant of x (in the sense of \(\downarrow _{\alpha } \cup \downarrow ^+_{\alpha }\)) is used in η ; otherwise x is unimportant. We shall partition only important variables: we write the tree pattern as \(\alpha (\bar {x}, \bar y, \bar z)\), where \(\bar {x}, \bar y\) is a partition of important variables, and \(\bar z\) contains all unimportant variables. Notice that in \(\downarrow _{\alpha } \cup \downarrow ^+_{\alpha }\) there are no edges from unimportant variables to important variables, and thus Lemma 6 can be used.

We additionally restrict ourselves to tame partitions, defined as follows: a partition \(\bar {x}, \bar y\) of important variables is tame if in \(\downarrow _{\alpha } \cup \downarrow ^+_{\alpha }\) there are no edges from variables in \(\bar {x}\) to variables in \(\bar y\). This way we only prune empty cases, because in the proof of Lemma 3 we only valuate variables from \(\bar {x}\) with nodes from t w , and variables from \(\bar y\) with nodes from tt w .

We thus have the following statement: tαη if and only if for each tame partition \(\bar {x}, \bar y\) of important variables in α, for each tuple \(\bar u\) of nodes from t w , each tuple \(\bar v\) of nodes from tt w , and each tuple \(\bar w\) of nodes of t, if \(t\models \exists \bar z\, \alpha (\bar u, \bar v, \bar z)\), then \(t\models \eta _{\sim }(\bar u, \bar v)\). This allows us to continue as in the proof of Lemma 3; the unimportant variables do not appear in η , so it is irrelevant whether they are valuated in t w or in tt w . Finally, we observe that the number of tame partitions of important variables is polynomial in the size of the tree pattern, assuming that the number of variables used in η is constant. Indeed, for each of the constantly many variables used in η , we only have to decide how many of its closest ancestors (including itself) are to be taken to \(\bar {x}\) (the ancestors being farther are then taken to \(\bar y\)).

We have thus proved that the bound on the data cut is polynomial. Hence, the size of the set of colours C is also polynomial. It remains to optimize the automaton \(\mathcal {B}_{\sigma }\) verifying a single constraint σ, assuming that the selector of σ is a tree pattern.

We use the standard method relying on the fact that subtrees of a tree pattern can be matched independently. By definition, the domain of a partial matching of a tree pattern is a collection of disjoint full subtrees of the pattern. Such a collection can be matched, if each of its elements can be matched independently; the information sufficient to represent all possible matchings is a set of subtrees that can be matched. For our purposes this is insufficient: we are interested not in just matching the selector, but in all tuples of data values that can be associated with the variables used in the assertion. This information cannot be stored separately for each subtree, as we are interested in the equalities and inequalities between data values assigned to variables in different subtrees; this is a property of a set of matched subtrees, and there can be ways of matching the same set that yield different equalities and inequalities. The solution is to treat subtrees with variables used in the assertion in a special way. The automaton remembers in each state a collection of subtrees without assertion variables and a collection of pairs consisting of

  • a set of pairwise disjoint subtrees with assertion variables, and

  • a tuple representing the associated data values (like before).

As the number of assertion variables is constant, the number of such sets and such tuples is polynomial. Hence, the whole automaton is singly exponential. This shows that both consistency and entailment are in ExpTime.

The lower bound follows immediately from ExpTime-hardness of consistency of schema mappings with trivially unsatisfiable right hand sides of dependencies [1, Proposition 18.2], which can be also reinterpreted as validity of unions of tree patterns modulo a given tree automaton. □

4.3 Static Analysis of Unions of Conjunctive Queries

Our results can be reinterpreted in the framework of static analysis of unions of conjunctive queries (UCQs). Note that

$$t \not\models \alpha(\bar{x})\Rightarrow \eta(\bar{x}) \qquad \text{if and only if} \qquad t\models \exists \bar{x}\, \alpha(\bar{x})\land \lnot \eta(\bar{x})\,. $$

It follows immediately that the problem of validity of UCQs over signature sig d t that never mix predicates from sig and \(\mathsf {sig}_{\nsim }\)—call them non-mixing UCQs—reduces in polynomial time to inconsistency of non-mixing constraints. Similarly, containment of such queries reduces to entailment of non-mixing constraints. The converse reduction is also possible, but it involves exponential blow-up when arbitrary Boolean combinations in assertions are rewritten in disjunctive normal form. This correspondence brings our results very close to the work by Björklund, Martens, and Schwentick on static analysis for UCQs over signature \(\mathsf {sig}_{\textit {nav}}\cup \{\sim , \nsim \}\) [6].

On one hand, our results immediately give the following new decidability result for the setting considered by Björklund, Martens, and Schwentick (constraints used in the lower bound of Lemma 5 can be rewritten without blow-up).

Theorem 3

Over \(\mathsf {sig}_{\textit {nav}}\cup \{\sim , \nsim \}\), both validity of non-mixing UCQs and containment of UCQs in non-mixing UCQs (with respect to a given tree automaton) are 2ExpTime -complete.

Results of Björklund, Martens, and Schwentick give 2ExpTime upper bound for containment (with respect to a tree automaton) in UCQs over sig nav ∪{∼} and UCQs over \(\mathsf {sig}_{\textit {nav}}\cup \{\nsim \}\). The original work is on CQs, but arguments for UCQs are the same [12]. Essentially, they amount to an observation that in counter-examples to containment of a query p in a query q, all data values can be set equal (in the case with \(\nsim \)) or different (in the case with ∼), except for a bounded number of them needed to witness satisfaction of p; such counter-examples can be easily encoded as trees over a finite alphabet, and recognized by an automaton evaluating p and q in the usual way. Theorem 3 extends both these results. Since we have both ∼ and \(\nsim \) in query q, we cannot assume that all data values are equal, nor that all are different; our more involved approach seems necessary. The third relevant result of [6] is that containment of p in q is 2ExpTime-complete under the assumption that p is a CQ over sig nav ∪{∼} and q is a CQ over \(\mathsf {sig}_{\textit {nav}}\cup \{\sim ,\nsim \}\). It looks stronger than ours because query q can mix ∼ and \(\nsim \). In fact, it is much weaker, depending entirely on the fact that q is a single CQ, not a UCQ. More specifically, the argument is as follows: if q uses \(\nsim \), the answer is yes if and only if p is not satisfiable with respect to the tree automaton (otherwise p is satisfiable in a tree with all data values equal, and no such tree can satisfy q because of its \(\nsim \) atoms); if q does not use \(\nsim \), we are back in the case of UCQs over sig nav ∪{∼}.

On the other hand, some results of Björklund, Martens, and Schwentick give a broader context to our results. They show that validity with respect to a given automaton is already 2ExpTime-complete for unions of conjunctive queries over signature sig nav , that is, for trees without data. Consequently, restricting only assertions of non-mixing constraints would not lower the complexity. This is complementary to our lower bound of Lemma 5, which shows 2ExpTime-hardness for constraints using tree patterns as selectors. Hence, the only way to lower the complexity is to restrict both, selectors and assertions. Björklund, Martens, and Schwentick also show that for UCQs over \(\mathsf {sig}_{\textit {nav}}\cup \{\sim , \nsim \}\) validity is undecidable; this means that we cannot go beyond non-mixing assertions.

4.4 XML Constraints

Non-mixing constraints form an instance of the general framework of XML-to-relational (X2R) constraints proposed by Niewerth and Schwentick [27], where selectors are arbitrary queries defining relations by selecting tuples of nodes and data values (in separate columns), and assertions are arbitrary relational constraints over the defined relations; the considered problem is entailment modulo schema. Our setting corresponds to a fragment in which selectors are conjunctive queries over sig nav interpreted as queries selecting tuples of data values, assertions are positive quantifier-free formulas using constants and either = or ≠, and schemas are tree automata. Niewerth and Schwentick investigate two classes of assertions: functional dependencies (FDs) and XML-key FDs (XKFDs). In an FD

$$A_{1}\,A_{2}\,{\dots} \,A_{m} \to B\,, $$

A 1,A 2,⋯,A m ,B are arbitrary columns of the relation defined by the selector (each referring either to nodes or to data values); in an XKFD, B is required to be a node column. Our setting captures XKFDs, but not general FDs. Consider an X2R constraint given by a CQ α(x 1,⋯,x n ) populating a table with tuples (x 1,⋯,x n ,@ x 1,⋯,@ x n ), where @ x i stands for the data value stored in the node represented by variable x i , and an XKFD x 1,⋯,x j ,@ x j+1,⋯,@ x n−1x n (it makes no sense to use both x i and @ x i in the same constraint). Such constraint can be rewritten as

$$\begin{array}{@{}rcl@{}} \alpha(x_{1}, \dots, x_{n}) &\land& \alpha(x_{1},\dots, x_{j}, x^{\prime}_{j+1},\dots, x^{\prime}_{n}) \land x_{n}\neq x^{\prime}_{n} \; \Rightarrow \; \\ &\Rightarrow& \; x_{j+1} \nsim x^{\prime}_{j+1} \lor {\dots} \lor x_{n-1} \nsim x^{\prime}_{n-1}\, \end{array} $$

which can be turned into a set of five non-mixing constraints by replacing \(x_{n}\neq x^{\prime }_{n}\) with simple subqueries describing possible ways of arranging two different nodes in a tree, as explained in Section 2.3. Note that these constraints do not use ∼. Hence, for XKFDs with UCQs over sig nav as tuple selectors decidability of entailment follows already from the results on containment of UCQs over sig nav ∪{∼}, discussed in the previous subsection; the challenge tackled by Niewerth and Schwentick is to determine the exact complexity and identify tractable fragments.

If we replace the XKFD above with an FD x 1,⋯,x j ,@ x j+1,⋯,@ x n−1@ x n we have

$$\begin{array}{@{}rcl@{}} \alpha(x_{1}, \dots, x_{n}) &\land& \alpha(x_{1}, \dots, x_{j}, x^{\prime}_{j+1},\dots, x^{\prime}_{n}) \; \Rightarrow \; \\ &\Rightarrow& \; x_{j+1} \nsim x^{\prime}_{j+1} \lor {\dots} \lor x_{n-1} \nsim x^{\prime}_{n-1} \lor x_{n} \sim x^{\prime}_{n}\,, \end{array} $$

which cannot be expressed without mixing ∼ and \(\nsim \). As we have explained, consistency and entailment is undecidable for such constraints, but one can investigate fragments with restricted schemas and tuple-selectors. This is what Niewerth and Schwentick do.

As XKFDs with tree patterns as tuple-selectors can express XML Schema key and unique constraints [18], XML keys by Arenas, Fan, and Libkin [2], and XFDs by Arenas and Libkin [3], so can non-mixing constraints. A technical subtlety is that some of these classes of constraints apply to nodes of a specified type (playing the role of a state in XML Schemas). As proposed by Niewerth and Schwentick, we can deal with it by annotating tree nodes with types (verified by the automaton encoding the schema), and let the patterns refer to types and labels. This slight extension does not affect our complexity bounds. Also, XML Schema unique constraints demand that each field path selects at most one node, and XML Schema key constraints demand exactly one node; the latter can be checked by the automaton too. In practice, one often wants at most (or exactly) one data value, not tree node. This may or may not be equivalent. To express that at most one data value is selected, we can use the singleton constraints discussed in Section 2.3. Note that this requires assertions over sig .

4.5 Consistency of XML Schema Mappings

Schema mappings are a formalism used in data exchange scenarios to specify relations between instances of two database schemas, a source schema and a target schema [1, 14]. In the basic setting for XML [4], schemas can be abstracted as tree automata, and the relation between source and target instances can be defined by a set Σ of dependencies of the form

$$\alpha(\bar{x}) \Rightarrow \alpha^{\prime}(\bar{x}) $$

where α,α are conjunctive queries over sig n a v , treated as queries selecting data values, not nodes. That is, a pair of data trees (t,t ) satisfies dependency σ of the form above, written as (t,t )⊧σ, if

$$\big \{\mathsf{val}_{t}(\bar u) \bigm| t \models \alpha(\bar u) \big\} \subseteq \big\{\mathsf{val}_{t^{\prime}}(\bar u) \bigm| t^{\prime} \models \alpha^{\prime}(\bar u) \big\}\,. $$

The consistency problem for XML schema mappings [4] is to decide for a given schema mapping \(\mathcal {M}=(\mathcal {A}, \mathcal {A}^{\prime }, {\Sigma })\), whether there exists a tree t accepted by automaton \(\mathcal {A}\) and a tree t accepted by automaton \(\mathcal {A}^{\prime }\) such that (t,t )⊧ Σ. This problem is known to be decidable: without loss of generality one may assume that all data values in t and t are equal, and use standard automata techniques ignoring data values. This is not only uninspiring theoretically, but also not very practical: an instance with all data values equal is not a convincing witness that the mapping makes sense. What if the source schema includes constraints, say XML Schema key or unique constraints? We cannot assume that all data values are equal any more. As we have argued in the previous subsection, such constraints can be expressed with non-mixing constraints, which leads us to the problem of consistency with source constraints, a common generalization of consistency of constraints and schema mappings: given a schema mapping \(\mathcal {M}=(\mathcal {A}, \mathcal {A}^{\prime }, {\Sigma })\) and a set of non-mixing constraints Σ src , decide if there exist a tree t accepted by automaton \(\mathcal {A}\) and a tree t accepted by automaton \(\mathcal {A}^{\prime }\) such that t⊧ Σ src and (t,t )⊧ Σ.

The following lemma gives the connection between XML schema mappings and non-mixing constraints that allows us to apply our decidability result. It was proved in a slightly different but equivalent form in [13]. A non-mixing constraint with free data value predicates uses additional unary predicate symbols in the assertions. A data tree t satisfies a set Σ of such constraints (possibly sharing some additional predicate symbols) if it satisfies Σ obtained from Σ by replacing each additional predicate symbol with some \(d\in {\mathbb {D}}\). Free data value predicates are not problematic for the consistency algorithm, as it can guess the data values to replace them; up to equality type with respect to data values already used in Σ, there are only exponentially many possibilities.

Lemma 7

For each schema mapping \(\mathcal {M} = (\mathcal {A}, \mathcal {A}^{\prime }, {\Sigma })\) one can compute in doubly exponential time sets \({\Sigma }^{1}_{\sim }, {\Sigma }^{2}_{\sim }, \dots , {\Sigma }^{m}_{\sim }\) of non-mixing constraints with free data value predicates, each obtained from Σ by replacing target-side queries \(\alpha ^{\prime }(\bar {x})\) with assertions \(\eta _{\sim }(\bar {x})\) of exponential size, such that for each data tree t, \(t\models {\Sigma }^{i}_{\sim }\) for some i ∈{1,⋯,m} if and only (t,t )⊧ Σ for some data tree t accepted by automaton \(\mathcal {A}^{\prime }\) .

Thus, mapping \(\mathcal {M}\) is consistent with source constraints Σ src if and only if at least one of the sets \({\Sigma }^{i}_{\sim } \cup {\Sigma }_{\textit {src}}\) obtained via Lemma 7 is consistent with respect to automaton \(\mathcal {A}\). Since the number of variables in each involved constraint is linear, the latter can be tested in 2ExpTime, as the algorithm from Section 3 is doubly exponential only in the maximal number of variables. As Lemma 7 translates mappings into constraints with assertions over sig , even if Σ src is just a set of key constraints (expressible with assertions over \(\mathsf {sig}_{\nsim }\)), we need the full power of non-mixing constraints, allowing assertions over sig and \(\mathsf {sig}_{\nsim }\).

4.6 Data Cut and Clique-Width

A classical measure of simplicity for relational structures is that of clique-width [11]. As has been noticed before for unordered data trees, clique-width and data cut are related [7]. We shall now reexamine briefly this relationship for ordered data trees, and in the following subsection we shall see how it can be used to extend our decidability results to constraints with much more expressive selector queries.

Let τ = {R 1,⋯,R } be a relational signature, that is, a set of predicate symbols with arities ar(R i ). A (finite) τ-structure \({\mathbb {A}}\) is a tuple \(\langle A, R_{1}^{{\mathbb {A}}}, \dots , R_{\ell }^{{\mathbb {A}}} \rangle \) consisting of a finite universe A and relations \(R_{i}^{{\mathbb {A}}} \subseteq A^{{\text {ar}}(R_{i})}\) (interpretations of the predicates). A k-coloured τ-structure is a pair \(({\mathbb {A}}, \gamma )\), consisting of a τ-structure \({\mathbb {A}}\) and a mapping γ : A →{1,⋯,k}, assigning colours to elements of the universe of \({\mathbb {A}}\).

Clique-width of structures is defined by means of an appropriate notion of decomposition, traditionally known as k-expression (over τ). It is defined as a term over the following set of operations (function symbols) Op(τ,k):

  • new(i) for 1 ≤ ik, nullary,

  • col(i,j) for 1 ≤ i,jk, unary,

  • R(i 1,⋯,i r ) for predicates Rτ of arity r and 1 ≤ i 1,⋯,i r k, unary,

  • ⊕, binary.

With each k-expression e we associate a k-coloured τ-structure [ [e] ]:

  • [ [new(i)] ] is a structure with a single element, coloured i, and empty relations;

  • [ [col(i,j)(e)] ] is obtained from [ [e] ] by recolouring all elements of colour i to j;

  • [ [R(i 1,⋯,i r )(e)] ] is obtained from \([\![ e ]\!] = ({\mathbb {A}}, \gamma )\) by adding to R [ [e] ] all tuples (a 1,⋯,a r ) such that a j A and γ(a j ) = i j for 1 ≤ jr;

  • [ [ee ] ] is the disjoint union of [ [e] ] and [ [e ] ].

A k-expression for \({\mathbb {A}}\) is any k-expression e such that \([\![ e ]\!] = ({\mathbb {A}}, \gamma )\) for some γ. The clique-width of \({\mathbb {A}}\) is the least k such that there exists a k-expression for \({\mathbb {A}}\).

Example 2

Consider a data tree t n consisting of a root w and two branches u 1,u 2,⋯,u n and v 1,v 2,⋯,v n , in which all nodes have label a and the data values correspond to the node’s depth in the tree, as shown in Fig. 4. Then, \(t_{u_{1}}\) and \(t - t_{u_{1}}\) share n different data values, and the data cut of t n is n.

Fig. 4
figure 4

A tree of data-cut n and clique width bounded by 7

Let us see t n as a relational structure over the signature sig nav ∪{∼}, where ∼ is interpreted as the equivalence relation with abstraction classes {w} and {u i ,v i } for i = 1,2,⋯,n. We claim that the clique width of t n is bounded by 7: if we construct t n top-down, level by level, at any point of the construction it is enough to distinguish between the root, the internal nodes on two branches, the two current leaves, and the two new nodes.

To see this, begin with a node of colour root and set its label with a(root), add two nodes of colours leaf1 and leaf2 with relations specified by

$$\begin{array}{@{}rcl@{}} && a(\textsf{leaf}_{i}), \;\downarrow(\textsf{root},\textsf{leaf}_{i}), \; \downarrow^+(\textsf{root},\textsf{leaf}_{i}),\;\rightarrow(\textsf{leaf}_{1},\textsf{leaf}_{2}),\;\rightarrow^+(\textsf{leaf}_{1},\textsf{leaf}_{2}), \\ &&\sim(\textsf{leaf}_{1},\textsf{leaf}_{2}) \end{array} $$

for i = 1,2 and then repeat the following n − 1 times: add two nodes of colours new1 and new2 with relations specified by

$$\begin{array}{@{}rcl@{}} && a(\textsf{new}_{i}),\; \downarrow(\textsf{leaf}_{i},\textsf{new}_{i}), \; \downarrow^+(\textsf{root},\textsf{new}_{i}),\; \downarrow^+(\textsf{internal}_{i},\textsf{new}_{i}),\; \downarrow^+(\textsf{leaf}_{i},\textsf{new}_{i}), \\ && \sim(\textsf{new}_{1},\textsf{new}_{2}) \end{array} $$

and recolour using col(leaf i ,internal i ), col(new i ,leaf i ) for i = 1,2.

The example shows that trees of bounded clique width can have arbitrary large data cut. We shall now see that bounded data cut implies bounded clique width.

For each set \(D\subseteq {\mathbb {D}}\), data trees can be seen as relational structures over the signature \(\mathsf {sig}_{\textit {nav}} \cup \mathsf {sig}_{\sim }^{D}\), where \(\mathsf {sig}_{\sim }^{D} = \{\sim \} \cup D\); that is, we restrict the unary predicates in sig to those associated to data values from D.

Proposition 2

For each finite \(D\subseteq {\mathbb {D}}\) and each data tree t seen as a relational structure over \(\mathsf {sig}_{\textit {nav}} \cup \mathsf {sig}_{\sim }^{D}\),

$$\textit{cliquewidth}(t) \leq 4\cdot \left( \frac{3}{2}\cdot\textit{datacut}(t) + 2 + |D|\right)\,. $$

Proof

Let C = {1,2,⋯,N} with \(N = \left \lfloor \frac {3}{2}\cdot \textit {datacut}(t)\right \rfloor + 1\). By Lemma 4, there exists a tree s over the alphabet \({\Gamma }\times (C\cup D) \times {\mathbb {P}(C)}\) such that the encoded data tree \(\widehat {s}\) equals t up to a permutation of \({\mathbb {D}} - D\). That is, \(\widehat {s}\) and t are equal when seen as relational structures over \(\mathsf {sig} _{\textit {nav}} \cup \mathsf {sig} _{\sim }^{D}\). We shall turn s into a 4(|C| + |D| + 1)-expression for \(\widehat {s}\), interpreting colours as elements of the set

$$\{ \textsf{last-root},\textsf{other-root}, \textsf{new-root}, \textsf{not-root}\}\times \big(C \cup D\cup \{\bot\} \big)\,. $$

Processing the nodes of s in the usual order (bottom-up and left-to-right), for each node w we construct a 4(|C| + |D| + 1)-expression e w such that

$$[\![ e_{w} ]\!] = (\widehat{s_{w}}, \gamma) $$

and for each node u the colour γ(u) satisfies the following properties:

  1. 1.

    the first coordinate describes the status of the node u in the forest \(\widehat {s_{w}}\): the last root, one of the other roots, or not a root (the value new-root will be used later);

  2. 2.

    the second coordinate is the data value stored in the node in \(\widehat {s_{w}}\) if this value belongs to CD, or ⊥ if it does not.

Let w be a node of s, labelled with (a,c,R). To build e w , we begin by creating a new node and specifying the unary relations for it (label and data value) with operations

$${\text{new}}\big((\textsf{new-root}, c)\big)\,,\;</p><p class="noindent">a\big((\textsf{new-root}, c)\big)\,, \; c\big((\textsf{new-root}, c)\big)\,,$$

where the last operation is included only if cD (that is, c is in the signature). If w has children, let w be its last child. Then, the expression \(e_{w^{\prime \prime }}\) is already constructed and we incorporate it into the expression e w as follows:

  • combine the expression built so far with \(e_{w^{\prime \prime }}\) using the operation ⊕;

  • specify structural relations between the two parts using the operations

    $$\begin{array}{@{}rcl@{}}\begin{array}{ll} \downarrow\big((\textsf{new-root}, c), (\textsf{last-root}, d)\big)\,, & \downarrow^+\big((\textsf{new-root}, c), (\textsf{last-root},d) \big)\,,\\ \downarrow\big((\textsf{new-root}, c), (\textsf{other-root},d) \big)\,, & \downarrow^+\big((\textsf{new-root}, c), (\textsf{other-root},d) \big)\,, \\ & \downarrow^+\big((\textsf{new-root}, c), (\textsf{not-root},d) \big) \end{array} \end{array} $$

    for all dCD ∪{⊥};

  • change last-root and other-root to not-root with the operations

    $$\begin{array}{ll} {\text{col}}\big((\textsf{last-root}, d), (\textsf{not-root}, d)\big)\,, & \quad{\text{col}}\big((\textsf{other-root}, d), (\textsf{not-root}, d)\big) \end{array} $$

    for all dCD ∪{⊥};

Similarly, if w is the previous sibling of w, we incorporate the expression \(e_{w^{\prime }}\) as follows:

  • combine the expression build so far with \(e_{w^{\prime }}\) using the operation ⊕;

  • specify structural relations between the two parts using the operations

    $$\begin{array}{@{}rcl@{}} \begin{array}{ll} \rightarrow\big((\textsf{last-root}, d), (\textsf{new-root}, c)\big)\,, & \quad \rightarrow^+\big((\textsf{last-root}, d), (\textsf{new-root},c) \big)\,,\\ &\quad \rightarrow^+\big((\textsf{other-root}, d), (\textsf{new-root},c) \big) \end{array} \end{array} $$

    for all dCD ∪{⊥};

  • change last-root to other-root with the operations

    $${\text{col}}\big((\textsf{last-root}, d), (\textsf{other-root}, d)\big) $$

    for all dCD ∪{⊥}.

Finally, we take care of data equalities and clean up the colours:

  • specify data equalities between the combined parts using the operations

    $$\sim\left( (\xi, d), (\zeta, d)\right) $$

    for all ξ,ζ ∈{new-root,other-root,not-root} and dCD;

  • change new-root to last-root with the operation

    $${\text{col}}\big((\textsf{new-root}, c), (\textsf{last-root}, c)\big)\,; $$
  • refresh the colours with the operations

    $${\text{col}}\left( (\xi, d), (\xi, \bot)\right) $$

    for all ξ ∈{last-root,other-root,not-root} and dR;

By construction, the resulting expression e w satisfies properties 1 and 2. □

Thus, bounded data cut is a stronger property than bounded clique width. It can be seen as a strengthening of bounded clique-width for data trees, in which decompositions must closely follow the structure of data trees.

4.7 MSO Constraints

Our decidability results for consistency and entailment of non-mixing constraints can be naturally extended by allowing selectors expressed in monadic second-order logic (MSO), a powerful extension of first order-logic in which quantification over subsets of the universe is available. However, as is usually the case when MSO is involved, the complexity will be non-elementary.

The syntax of MSO formulae over sig nav is

$$\varphi, \psi ::= \exists X \,\varphi \,\big|\, \exists x \,\varphi \,\big|\, \varphi \land \psi\,\big|\, \lnot \varphi \,\big|\, x \in X \,\big|\, x \downarrow y \,\big|\, x \downarrow^+ y \,\big|\, x \rightarrow y \,\big|\, x \rightarrow^+ y \,\big|\, a(x) $$

for a ∈ Γ; the semantics is the natural one, with the usual distinction between first-order variables (lower case) referring to elements of the universe and second-order variables (upper case) referring to subsets of the universe.

We consider MSO constraints of the form

$$\varphi(\bar{x}) \Rightarrow \eta_{\sim}(\bar{x})\land \eta_{\nsim}(\bar{x})\,, $$

where the selector \(\varphi (\bar {x})\) is an MSO formula over sig nav in which all free variables are first-order.

As a first step, we reprove the bound on the data cut, shown in Lemma 3. Instead of using Lemma 2 we rely on the compositionality of MSO. For a forest f over Γ, a tuple \(\bar v = (v_{1}, \dots , v_{m})\) of nodes of f, and a tuple \(\bar V = (V_{1}, \dots , V_{n})\) of sets of nodes of f, let

$$\langle f;\bar v;\bar V\rangle $$

be the forest over the alphabet Γ ×{0,1}m + n obtained from f by extending labels with binary vectors encoding \(\bar v\) and \(\bar v\): a node w is labelled with

$$(\mathsf{lab}_{f}(w), e_{1},{\dots} e_{m}, E_{1}, \dots, E_{n})\,, $$

where e i = 1 if and only if w = v i , and E j = 1 if and only if wV j . If \(\bar v\) or \(\bar v\) is empty, we skip it and write, for instance, \(\langle f ; \bar v \rangle \). It is well known that for a given MSO formula

$$\varphi(x_{1}, \dots, x_{m}, X_{1},{\dots} X_{n} ) $$

one can effectively construct a deterministic automaton \(\mathcal {A}_{\varphi }\) (of non-elementary size) recognizing the tree language

$$\left\{ \langle t; \bar v; \bar V\rangle \, \middle | \, t\models \varphi(\bar v, \bar V) \right\}\,. $$

The construction follows the syntactic structure of MSO formulas: it begins with explicit automata for atomic formulas, and then turns logical connectives into Boolean operations on automata, and existential quantification into projecting out the corresponding binary coordinate from the alphabet. Defining the φt y p e of a forest f over Γ ×{0,1}m + n as the state of the automaton \(\mathcal {A}_{\varphi }\) in the root of the last tree of f (in the unique run over f), we obtain the following analogue of Lemma 2.

Lemma 8

Let \(\varphi (\bar {x}, \bar {y})\) be an MSO formula over sig nav , where \(\bar {x}\) and \(\bar y\) are disjoint tuples of first-order variables, and let w be a node of a data tree t. For all tuples \(\bar u, \bar u^{\prime }\) of nodes from t w and tuples \(\bar v, \bar v^{\prime }\) of nodes from tt w , if

$$t\models \varphi(\bar u, \bar v) \quad \text {and } \quad t \models \varphi(\bar u^{\prime}, \bar v^{\prime})\,, $$

and the φ-types of \(\langle t_{w}; \bar u\rangle \)and\(\langle t_{w}; \bar u^{\prime }\rangle \) are equal,then

$$t\models \varphi(\bar u, \bar v^{\prime}) \quad \textrm{ and } \quad t\models \varphi(\bar u^{\prime},\bar v)\,. $$

Proof

As \(t\models \varphi (\bar u, \bar v)\) and \(t\models \varphi (\bar u^{\prime }, \bar v^{\prime })\), the trees \(\langle t; \bar u,\bar v \rangle \) and \(\langle t; \bar u^{\prime }, \bar v^{\prime }\rangle \) are accepted by the deterministic automaton \(\mathcal {A}_{\varphi }\). Moreover, the state in the node w in the unique runs of \(\mathcal {A}_{\varphi }\) over these trees is the same, because \(\langle t; \bar u,\bar v \rangle _{w} = \langle t_{w}; \bar u\rangle \) and \(\langle t; \bar u^{\prime },\bar v^{\prime } \rangle _{w} = \langle t_{w}; \bar u^{\prime }\rangle \), and the φ-types of \(\langle t_{w}; \bar u\rangle \) and \(\langle t_{w}; \bar u^{\prime }\rangle \) are equal. Hence, swapping \(\langle t; \bar u,\bar v \rangle _{w}\) and \(\langle t; \bar u^{\prime },\bar v^{\prime } \rangle _{w}\) does not affect acceptance by \(\mathcal {A}_{\varphi }\). That is, the resulting trees \(\langle t; \bar u^{\prime },\bar v \rangle \) and \(\langle t; \bar u, \bar v^{\prime }\rangle \) are accepted by \(\mathcal {A}_{\varphi }\). Consequently, \(t\models \varphi (\bar u^{\prime }, \bar v)\) and \(t\models \varphi (\bar u, \bar v^{\prime })\). □

Now, we can show a bound on the data cut for non-mixing MSO constraints. Unlike in Lemma 3, the bound is non-elementary: it is proportional to the maximal size of the automata for the MSO formulas used in the constraints.

Lemma 9

If a set \({\Sigma }_{\sim }\cup {\Sigma }_{\nsim }\) of non-mixing MSO constraints is satisfied in a data tree t, it is also satisfied in some data tree t obtained from t by changing data values, such that

$$\textit{datacut}(t^{\prime}) \leq S \cdot \ell \cdot 2^{\ell}\cdot (\ell +m)^{\ell^{2}} \cdot |{\Sigma}_{\sim}|\,,$$

where ℓ and m are the maximal numbers of, respectively, free variables and predicates from \({\mathbb {D}}\cup \check {\mathbb {D}}\) in the constraints from Σ , and S is the maximal number of types for the selector formulas in the constraints from Σ.

The same is true for counter-examples to entailment, except that the bound onthe data cut needs to be increased by the number of variables in the violatedassertion.

Proof

We proceed just like for Lemma 3. Let us take a node w of the data tree t and an MSO constraint \(\varphi (\bar {x}, \bar y) \Rightarrow \eta _{\sim }(\bar {x}, \bar y)\) from Σ with a fixed partition \(\bar {x}\), \(\bar y\) of the free variables of φ. Using Lemma 8, we arrive at the following condition: for all tuples \(\bar u, \bar u^{\prime }\) of nodes from t w and all tuples \(\bar v, \bar v^{\prime }\) of nodes from tt w , if \(t\models \varphi (\bar u, \bar v)\), \(t\models \varphi (\bar u^{\prime }, \bar v^{\prime })\), and the φ-types of \(\langle t_{w}, \bar u\rangle \) and \(\langle t_{w}, \bar u^{\prime }\rangle \) are equal, then \(t\models \eta _{\sim }(\bar u, \bar v^{\prime })\). This can be reformulated as follows: for each tuple \(\bar u\) of nodes from t w such that the φ-type of \(\langle t_{w}; \bar u \rangle \) is q and \(t\models \varphi (\bar u, \bar v)\) for some tuple \(\bar v\) of nodes from tt w , the tuple \(\mathsf {val}_{t}(\bar u)\) of data values belongs to the set

$$Z^{q}_{\varphi(\bar{x}, \bar y)\Rightarrow \eta_{\sim}(\bar{x}, \bar y)} = \bigcap_{\bar v^{\prime}} \big \{ \bar c \in {\mathbb{D}}^{|\bar{x}|}\bigm | \eta(\bar c, \mathsf{val}_{t}(\bar v^{\prime}))\big\}\,, $$

where \(\bar v^{\prime }\) ranges over tuples of nodes from tt w satisfying \(t\models \varphi (\bar u^{\prime }, \bar v^{\prime })\) for some tuple \(\bar u^{\prime }\) of nodes from t w such that the φ-type of \(\langle t_{w}; \bar u^{\prime } \rangle \) is q.

Like before, we modify the tree t by changing to a fresh one each data value used in t w , except for those from the set \(D\subseteq {\mathbb {D}}\) of data values used in the definitions of the sets \(Z^{q}_{\varphi (\bar {x}, \bar y)\Rightarrow \eta _{\sim }(\bar {x}, \bar y)}\), with \(\varphi (\bar {x}, \bar y)\Rightarrow \eta _{\sim }(\bar {x}, \bar y)\) ranging over constraints from Σ with all possible partitions of free variables, and q ranging over all possible φ-types. As the bound on the number of data values used in the canonical definition of a single set \(Z^{q}_{\varphi (\bar {x}, \bar y)\Rightarrow \eta _{\sim }(\bar {x}, \bar y)}\) remains unchanged, we have \(|D|\leq |{\Sigma }_{\sim }|\cdot 2^{\ell }\cdot S \cdot \left (\ell \cdot (m + \ell )^{\ell ^{2}}\right ) \). Performing this modification for each node w, we guarantee the bound on data cut as stated in the lemma.

For the second claim, extend the set D with the data values used in the tuple of nodes violating the assertion, as described in Section 4.1 for constraints with CQ selectors. □

As each set of MSO constraints can be rewritten as a single MSO formula over the signature sig nav sig , by Lemma 9 and Proposition 2 from Section 4.6, the consistency problem and the entailment problem reduce to satisfiability of MSO over structures of bounded clique-width (one has to ensure that the structure is indeed a data tree, but this can be easily expressed in MSO). As the latter is known to be decidable [10], we immediately obtain decidability of consistency and entailment. For completeness, we give a direct proof, avoiding the notion of clique-width.

Theorem 4

Consistency and entailment of non-mixing MSO constraints is decidable.

Proof

Let \({\Sigma }_{\sim }\cup {\Sigma }_{\nsim }\) be a set of non-mixing constraints and let \(\mathcal {A}\) be a tree automaton. By Lemma 9, it is enough to test satisfiability of \({\Sigma }_{\sim }\cup {\Sigma }_{\nsim }\) over trees of data cut bounded by a number N, computable from \({\Sigma }_{\sim }\cup {\Sigma }_{\nsim }\). Let \(D\subseteq {\mathbb {D}}\) be the set of data values used explicitly in \({\Sigma }_{\sim }\cup {\Sigma }_{\nsim }\), and let \(C\subseteq {\mathbb {D}}-D\) be a fixed set such that \(|C|=\left \lfloor \frac {3}{2}\cdot N\right \rfloor + 1\). Like before, by Lemma 4, the proof boils down to constructing an automaton recognizing the set of trees t over \({\Gamma }\times (C\cup D) \times {\mathbb {P}(C)}\) such that the data tree \(\widehat {t}\) satisfies \({\Sigma }_{\sim }\cup {\Sigma }_{\nsim }\). Each MSO constraint \(\varphi (\bar {x}) \Rightarrow \eta (\bar {x})\) is equivalent to a closed MSO formula \(\forall \bar {x}\, \big (\varphi (\bar {x}) \rightarrow \eta (\bar {x})\big )\) over the signature sig nav sig (in the presence of negation, \(\mathsf {sig}_{\nsim }\) is redundant). Hence, a finite set of MSO constraints is equivalent to a conjunction of such formulas. Thus, it suffices to construct an automaton accepting trees t over \({\Gamma }\times (C \cup D) \times {\mathbb {P}(C)}\) such that \(\widehat {t}\) satisfies φ, where φ is an MSO formula over sig nav sig , using only predicates associated with data values from the set D.

We modify the standard construction of the automaton \(\mathcal {A}_{\varphi }\) for a formula φ of MSO over sig nav . As the structure of the tree and the labelling with elements of Γ is the same in t and \(\widehat {t}\), we only need to provide explicit constructions for the atomic formulas over sig .

For formulas of the form d(x), we have dD, so the data value d is represented explicitly in t. Hence, the automaton simply identifies the node with value 1 in the corresponding binary coordinate of the label, and accepts if and only if the non-binary component of this label is (a,d,R) for some a and R.

For formulas of the form xx , the automaton also identifies the nodes x and x in the input tree t, and then accepts if they are labelled with (a,d,R) and (a ,d,R ) for some a, a , R, R , and additionally, if dC, then it is not refreshed before reaching the first node w such that t w contains both x and x .

For entailment, we use the additional claim of Lemma 9 and include in D also the data values used explicitly in the second set of constraints, \({\Sigma }^{\prime }_{\sim }\cup {\Sigma }^{\prime }_{\nsim }\). As the fact that \({\Sigma }_{\sim }\cup {\Sigma }_{\nsim }\) holds and \({\Sigma }^{\prime }_{\sim }\cup {\Sigma }^{\prime }_{\nsim }\) does not hold can also be expressed with a single closed MSO formula, we can use directly the construction described above. □

Both approaches give non-elementary complexity, as already the bound of Lemma 9 is non-elementary. This cannot be improved, as the satisfiability problem for MSO over sig nav , well known to be non-elementary, easily reduces to inconsistency of MSO constraints: a closed formula φ is satisfiable if and only if

$$\big \{\varphi \land a(x) \,\Rightarrow\, 0(x)\land 1(x) \;\big| \; a\in{\Gamma}\big\} $$

is inconsistent with respect to the trivial automaton accepting all trees.

5 Conclusions

We have shown that consistency and entailment of non-mixing constraints are decidable. Both problems are 2ExpTime-complete, but become ExpTime-complete when we restrict selector queries to tree patterns and bound the number of variables in assertions; decidability can be pushed further to constraints with selector queries defined in monadic second order logic over the signature sig nav , but the complexity becomes non-elementary. We have reinterpreted these results in terms of validity and containment of conjunctive queries, as well as consistency of schema mappings. The latter setting best illustrates the benefits of combining assertions over sig and \(\mathsf {sig}_{\nsim }\). Indeed, equalities are involved even in the simplest schema mappings, and inequalities allow to cover key constraints over the source database.

We worked with ordered trees, but all discussed results immediately carry over to unordered trees: as long as the signature does not contain the horizontal axes, one can freely move back and forth between ordered and unordered trees by forgetting the sibling order or introducing it arbitrarily. As both 2ExpTime lower bounds, the one from Lemma 5 and the one from [6], do not use the horizontal axes, they also hold for unordered trees. The same is true of the undecidability for the settings that mix equality and inequality [6]. Similarly, restricting to ranked trees does not change the picture: the upper bounds carry over immediately, and the lower bounds only use trees of bounded branching. The reductions can be also adapted to the case of unlabelled trees: one can simulate labels with unique small tree gadgets attached to the main nodes of the tree and use the automaton to ensure that each main node has exactly one gadget attached. However, referring to the gadgets with selector queries requires either the next sibling or the following sibling relation. For unordered unlabelled trees the complexity might drop.

One might also ask how the presence of the schema affects the complexity. The fact that we model schemas as tree automata is inessential: all lower bounds can be adjusted to the setting where the schema language is restricted to DTDs [6]. When there is no schema at all, the consistency problem trivializes, because if a tree satisfies a set of constraints, so does any tree obtained by removing nodes. Hence, it suffices to look for witnesses among trees with a single node, which leads to a polynomial-time algorithm. The question is more interesting for the entailment problem, because there the counter-example must contain enough nodes to falsify the non-entailed constraint. It is plausible that the complexity is lower than with a schema.