1 Introduction

When XML is used for document-centric applications, the relative order among the elements is typically important e.g., the relative order of paragraphs and chapters in a book. On the other hand, in case of data-centric XML applications, the order among the elements may be unimportant [1]. In this paper we focus on the latter use case. As an example, take a trivialized fragment of an XML document containing the DBLP repository in Fig. 1. While the order of the elements title, author, and year may differ from one publication to another, it has no impact on the semantics of the data stored in this semi-structured database.

Fig. 1
figure 1

A trivialized DBLP repository

Typically, a schema for XML defines for every node its content model i.e., the children nodes it must, may, and cannot contain. For instance, in the DBLP example, one would require every article to have exactly one title, one year, and one or more authors. A book may additionally contain one publisher and may also have one or more editors instead of authors. A schema has numerous important uses. For instance, it allows to validate a document against a schema and identify potential errors. A schema also serves as a reference for a user who does not know yet the structure of the XML document and attempts to query or modify its content.

The Document Type Definition (DTD), the most widespread XML schema formalism for (ordered) XML [8, 23], is essentially a set of rules associating with each label a regular expression that defines the admissible sequences of children. The DTDs are best fitted for ordered content because they use regular expressions, a formalism that defines sequences of labels. However, when unordered content model needs to be defined, there is a tendency to use over-permissive regular expressions. For instance, the DTD below corresponds to the one used in practice for the DBLP repository:Footnote 1

$$\begin{array}{@{}rcl@{}} \textsl{dblp} \rightarrow &&(\textsl{article} \mid \textsl{book})^{*}\\ \textsl{article} \rightarrow && (\textsl{title} \mid \textsl{year} \mid \textsl{author})^{*} \\ \textsl{book} \rightarrow && (\textsl{title} \mid \textsl{year} \mid \textsl{author} \mid \textsl{editor} \mid \textsl{publisher})^{*} \end{array} $$

This DTD allows an article to contain any number of titles, years, and authors. A book may also have any number of titles, years, authors, editors, and publishers. These regular expressions are clearly over-permissive because they allow XML documents that do not follow the intuitive guidelines set out earlier e.g., an XML document containing an article with two titles and no should not be valid.

While it is possible to capture unordered content models with regular expressions, a simple pumping argument shows that their size may need to be exponential in the number of possible labels of the children. In case of the DBLP repository, this number reaches values up to 12, which basically precludes any practical use of such regular expressions. This suggests that over-permissive regular expressions may be employed for the reasons of conciseness and readability, a consideration of great practical importance.

The use of over-permissive regular expressions, apart from allowing documents that do not follow the guidelines, has other negative consequences e.g., in static analysis tasks that involve the schema. Take for example the following two twig queries [3, 47]:

$$\begin{array}{@{}rcl@{}} &/\textsl{dblp}/\textsl{book}[\textsl{author}=\text{``\(\mathit{C.\,Papadimitriou}\)''}]\\ &/\textsl{dblp}/\textsl{book}[\textsl{author}=\text{``\(\mathit{C.\,Papadimitriou}\)''}][\textsl{title}] \end{array} $$

The first query selects the elements labeled book, children of dblp and having an author containing the text “C. Papadimitriou.” The second query additionally requires that book has a title. Naturally, these two queries should be equivalent because every book should have a title. However, the DTD above does not capture properly this requirement, and consequently the two queries are not equivalent w.r.t. this DTD.

In this paper, we investigate schema languages for unordered XML. First, we study languages of unordered words, where an unordered word can be seen as a multiset of symbols. We consider unordered regular expressions (UREs), which are essentially regular expressions with unordered concatenation “ | |” instead of standard concatenation. The unordered concatenation can be seen as union of multisets, and consequently, the star “ ∗” can be seen as the Kleene closure of unordered languages.

Similarly to a DTD which associates to each label a regular expression to define its (ordered) content model, an unordered schema uses UREs to define for each label its unordered content model. For instance, take the following schema (satisfied by the tree in Fig. 1):

$$\begin{array}{@{}rcl@{}} \textsl{dblp} \rightarrow && \textsl{article}^{*} \parallel \textsl{book}^{*}\\ \textsl{article} \rightarrow && \textsl{title} \parallel \textsl{year} \parallel \textsl{author}^{+}\\ \textsl{book} \rightarrow && \textsl{title} \parallel \textsl{year} \parallel \textsl{publisher}^? \parallel (\textsl{author}^{+}\mid\textsl{editor}^{+}) \end{array} $$

The above schema uses UREs and captures the intuitive requirements for the DBLP repository. In particular, an article must have exactly one title, exactly one year, and at least one author. A book may additionally have a publisher and may have one or more editors instead of authors. Note that, unlike the DTD defined earlier, this schema does not allow documents having an article with several titles or without any author.

Using UREs is equivalent to using DTDs with regular expressions interpreted under the commutative closure [4, 34]: essentially, a word matches the commutative closure of a regular expression if there exists a permutation of the word that matches the regular expression in the standard way. Deciding this problem is known to be NP-complete [26] for arbitrary regular expressions. We show that the problem of testing the membership of an unordered word to the language of a URE is NP-complete even for a restricted subclass of UREs that allows unordered concatenation and the option operator “ ?” only. Not surprisingly, testing the containment of two UREs is also intractable. These results are of particular interest because they are novel and do not follow from complexity results for regular expressions, where the order plays typically an essential role [31, 46]. Consequently, we focus on finding restrictions rendering UREs tractable and capable of capturing practical languages in a simple and concise manner.

The first restriction is to disallow repetitions of a symbol in a URE, thus banning expressions of the form aa ? because the symbol a is used twice. Instead we add general interval multiplicities a [1,2] which offer a way to specify a range of occurrences of a symbol in an unordered word without repeating a symbol in the URE. While the complexity of the membership of an unordered word to the language of a URE with interval multiplicities and without symbol repetitions has recently been shown to be in PTIME [11], testing containment of two such UREs remains intractable. We, therefore, add limitations on the nesting of the disjunction and the unordered concatenation operators and the use of intervals, which yields the proposed class of disjunctive interval multiplicity expressions (DIMEs). DIMEs enjoy good computational properties: both the membership and the containment problems become tractable. Also, we believe that despite the imposed restriction DIMEs remain a practical class of UREs. For instance, all UREs used in the schema for the DBLP repository above are DIMEs.

Next, we employ DIMEs to define languages of unordered trees and propose two schema languages: disjunctive interval multiplicity schema (DIMS), and its restriction, disjunction-free interval multiplicity schema (IMS). Naturally, the above schema for the DBLP repository is a DIMS. We study the complexity of several basic decision problems: schema satisfiability, membership of a tree to the language of a schema, containment of two schemas, twig query satisfiability, implication, and containment in the presence of schema. We present in Table 1 a summary of the complexity results and we observe that DIMSs and IMSs enjoy the same computational properties as general DTDs and disjunction-free DTDs, respectively.

Table 1 Summary of complexity results

The lower bounds for the decision problems for DIMSs and IMSs are generally obtained with easy adaptations of their counterparts for general DTDs and disjunction-free DTDs. To obtain the upper bounds we develop several new tools. We propose to represent DIMEs with characterizing tuples that can be efficiently computed and allow deciding in polynomial time the membership of a tree to the language of a DIMS and the containment of two DIMSs. Also, we develop dependency graphs for IMSs and a generalized definition of an embedding of a query. These two tools help us to reason about query satisfiability, query implication, and query containment in the presence of IMSs. Our constructions and results for IMSs allow also to characterize the complexity of query implication and query containment in the presence of disjunction-free DTDs, which, to the best of our knowledge, have not been previously studied.

Finally, we compare the expressive power of the proposed schema languages with yardstick languages of unordered trees (FO, MSO, and Presburger constraints) and DTDs under commutative closure. We show that the proposed schema languages are capable of expressing many practical languages of unordered trees.

It is important to mention that this paper is a substantially extended version of a preliminary work presented in [10]. More precisely, in this paper we show novel intractability results for some subclasses of unordered regular expressions and we extend the expressibility of the tractable subclasses. While in [10] we have considered only simple multiplicities (∗,+,?), in this paper we deal with arbitrary interval multiplicities of the form [n,m].

Organization

In Section 2 we introduce some preliminary notions. In Section 3 we study the reasons of intractability of unordered regular expressions while in Section 4 we present the tractable subclass of disjunctive interval multiplicity expressions (DIMEs). In Section 5 we define two schema languages: the disjunctive interval multiplicity schemas (DIMSs) and its restriction, the disjunction-free interval multiplicity schemas (IMSs), and the related problems of interest. In Sections 6 and 7 we analyze the complexity of the problems of interest for DIMSs and IMSs, respectively. In Section 8 we discuss the expressiveness of the proposed formalisms. In Section 9 we present related work. In Section 10 we summarize our results and outline further directions.

2 Preliminaries

Throughout this paper we assume an alphabet Σ that is a finite set of symbols. We also assume that Σ has a total order <Σ that can be tested in constant time.

Trees

We model XML documents with unordered labeled trees. Formally, a tree t is a tuple (N t ,r o o t t ,l a b t , c h i l d t ), where N t is a finite set of nodes, r o o t t N t is a distinguished root node, \(lab_{t}:N_{t}\rightarrow {\Sigma }\) is a labeling function, and \(child_{t}\subseteq N_{t}\times N_{t}\) is the parent-child relation. We assume that the relation c h i l d t is acyclic and require every non-root node to have exactly one predecessor in this relation. By T r e e we denote the set of all trees.

Queries

We work with the class of twig queries, which are essentially unordered trees whose nodes may be additionally labeled with a distinguished wildcard symbol ⋆∉Σ and that use two types of edges, child (/) and descendant (//), corresponding to the standard XPath axes. Note that the semantics of the / /-edge is that of a proper descendant (and not that of descendant-or-self). Formally, a twig query q is a tuple (N q ,r o o t q ,l a b q ,c h i l d q , d e s c q ), where N q is a finite set of nodes, r o o t q N q is the root node, \(lab_{q}:N_{q}\rightarrow {\Sigma }\cup \{\star \}\) is a labeling function, \(child_{q}\subseteq N_{q}\times N_{q}\) is a set of child edges, and \(desc_{q}\subseteq N_{q}\times N_{q}\) is a set of descendant edges. We assume that \(child_{q}\ \cap \ desc_{q}= \varnothing \) and that the relation \(child_{q} \ \cup \ desc_{q}\) is acyclic and we require every non-root node to have exactly one predecessor in this relation. By T w i g we denote the set of all twig queries. Twig queries are often presented using the abbreviated XPath syntax [47] e.g., the query q 0 in Fig. 2a can be written as r/⋆[⋆]// a.

Fig. 2
figure 2

A tree and a twig query

Embeddings

We define the semantics of twig queries using the notion of embedding which is essentially a mapping of nodes of a query to the nodes of a tree that respects the semantics of the edges of the query. Formally, for a query qT w i g and a tree tT r e e, an embedding of q in t is a function \(\lambda : N_{q} \rightarrow N_{t}\) such that:

  1. 1.

    λ(r o o t q )=r o o t t ,

  2. 2.

    for every \((n,n^{\prime })\in child_{q}\), \((\lambda (n),\lambda (n^{\prime }))\in child_{t}\),

  3. 3.

    for every \((n,n^{\prime })\in desc_{q}\), \((\lambda (n),\lambda (n^{\prime }))\in (child_{t})^{+}\) (the transitive closure of c h i l d t ),

  4. 4.

    for every nN q , l a b q (n)= or l a b q (n)=l a b t (λ(n)).

We write \(t\preccurlyeq q\) if there exists an embedding of q in t. Later on, in Section 7.2 we generalize this definition of embedding as a tool that permits us characterizing the problems of interest.

As already mentioned, we use the notion of embedding to define the semantics of twig queries. In particular, we say that t satisfies q if there exists an embedding of q in t and we write tq. By L(q) we denote the set of all trees satisfying q.

Note that we do not require the embedding to be injective i.e., two nodes of the query may be mapped to the same node of the tree. Figure 3 presents all embeddings of the query q 0 in the tree t 0 from Fig. 2.

Fig. 3
figure 3

Embeddings of q 0 in t 0

Unordered Words

An unordered word is essentially a multiset of symbols i.e., a function \(w:{\Sigma }\rightarrow {\mathbb {N}}_{0}\) mapping symbols from the alphabet to natural numbers. We call w(a) the number of occurrences of the symbol a in w. We also write aw as a shorthand for w(a)≠0. An empty word ε is an unordered word that has 0 occurrences of every symbol i.e., ε(a)=0 for every a∈Σ. We often use a simple representation of unordered words, writing each symbol in the alphabet the number of times it occurs in the unordered word. For example, when the alphabet is Σ={a,b,c}, w 0=a a a c c stands for the function w 0(a)=3, w 0(b)=0, and w 0(c)=2. Additionally, we may write w 0=a 3 c 2 instead of w 0=a a a c c.

We use unordered words to model collections of children of XML nodes. As it is usually done in the context of XML validation [41, 42], we assume that the XML document is encoded in unary i.e., every node takes the same amount of memory. Thus, we use a unary representation of unordered words, where each occurrence of a symbol occupies the same amount of space. However, we point out that none of the results presented in this paper changes with a binary representation. In particular, the intractability of the membership of an unordered word to the language of a URE (Theorem 1) also holds with a binary representation of unordered words.

Consequently, the size of an unordered word w, denoted |w|, is the sum of the numbers of occurrences in w of all symbols in the alphabet. For instance, the size of w 0=a a a c c is |w 0|=5.

The (unordered) concatenation of two unordered words w 1 and w 2 is defined as the multiset union w 1w 2 i.e., the function defined as (w 1w 2)(a)=w 1(a)+w 2(a) for every a∈Σ. For instance, a a a c ca b b c=a a a a b b c c c. Note that ε is the identity element of the unordered concatenation εw=wε=w for every unordered word w. Also, given an unordered word w, by w i we denote the concatenation w⊎…⊎w (i times).

A language is a set of unordered words. The unordered concatenation of two languages L 1 and L 2 is a language L 1L 2={w 1w 2w 1L 1,w 2L 2}. For instance, if L 1={a,a a c} and L 2={a c,b,ε}, then L 1L 2={a,a b,a a c,a a b c,a a a c c}.

Unordered Regular Expressions

Analogously to regular expressions, which are used to define languages of ordered words, we propose unordered regular expressions to define languages of unordered words. Essentially, an unordered regular expression (URE) defines unordered words by using Kleene star “ ∗”, disjunction “ ∣”, and unordered concatenation “ ∥”. Formally, we have the following grammar:

$$E ::= \epsilon \mid a \mid E^{*} \mid (E\text{``\(\mid\)''} E) \mid (E\text{``\(\parallel \)''} E), $$

where a∈Σ. The semantics of UREs is defined as follows:

$$\begin{array}{@{}rcl@{}} &&L(\epsilon) = \{ \varepsilon \}, \\ &&L(a) = \{a\}, \\ &&L(E_{1}\mid E_{2}) = L(E_{1}) \cup L(E_{2}),\\ &&L(E_{1}\parallel E_{2}) = L(E_{1}) \uplus L(E_{2}),\\ &&L(E^{*}) = \{w_{1}\uplus\ldots\uplus w_{i}\mid w_{1},\ldots,w_{i}\in L(E)\wedge i\geq0\}. \end{array} $$

For instance, the URE (a∥(bc)) accepts the unordered words having the number of occurrences of a equal to the total number of b’s and c’s.

The grammar above uses only one multiplicity ∗ and we introduce macros for two other standard and commonly used multiplicities:

$$\begin{array}{@{}rcl@{}} E^{+} := E\parallel E^{*}, \qquad E^? := E\mid\epsilon. \end{array} $$

The URE (ab ?)+∥(ac)? accepts the unordered words having at least one a, at most one c, and a number of b’s less or equal than the number of a’s.

Interval Multiplicities

While the multiplicities ∗, + , and ? allow to specify unordered words with multiple occurrences of a symbol, we additionally introduce interval multiplicities to allow to specify a range of allowed occurrences of a symbol in an unordered word. More precisely, we extend the grammar of UREs by allowing expressions of the form E [n,m] and \(E^{[n,m]^?}\), where \(n\in \mathbb N_{0}\) and \(m\in \mathbb N_{0}\cup \{\infty \}\). Their semantics is defined as follows:

$$\begin{array}{@{}rcl@{}} L(E^{[n,m]}) &=& \{w_{1}\uplus\ldots\uplus w_{i}\mid {} w_{1},\ldots,w_{i}\in L(E) \land n \leq i\leq m\}, \\ L(E^{[n,m]^?})& =& L(E^{[n,m]}) \cup \{\varepsilon\}. \end{array} $$

In the rest of the paper, we write simply interval instead of interval multiplicity. Furthermore, we view the following standard multiplicities as macros for intervals:

$$\begin{array}{@{}rcl@{}} * := [0,\infty], \qquad + := [1,\infty], \qquad ? := [0,1]. \end{array} $$

Additionally, we introduce the single occurrence multiplicity 1 as a macro for the interval [1,1].

Note that the intervals do not add expressibility to general UREs, but they become useful if we impose some restrictions. For example, if we disallow repetitions of a symbol in a URE and ban expressions of the form aa ?, we can however write a [1,2] to specify a range of occurrences of a symbol in an unordered word without repeating a symbol in the URE.

3 Intractability of Unordered Regular Expressions

In this section, we study the reasons of the intractability of UREs w.r.t. the following two fundamental decision problems: membership and containment. In Section 3.1 we show that membership is NP-complete even under significant restrictions on the UREs while in Section 3.2 we show that the containment is \({\Pi }_{2}^{p}\)-hard (and in 3-EXPTIME). We notice that the proofs of both results rely on UREs allowing repetitions of the same symbol. Consequently, we disallow such repetitions and we show that this restriction does not avoid intractability of the containment (Section 3.3). We observe that the proof of this result employs UREs with arbitrary use of disjunction and intervals, and therefore, in Section 4 we impose further restrictions and define the disjunctive interval multiplicity expressions (DIMEs), a subclass for which we show that the two problems of interest become tractable.

3.1 Membership

In this section, we study the problem of deciding the membership of an unordered word to the language of a URE. First of all, note that this problem can be easily reduced to testing the membership of a vector to the Parikh image of a regular language, known to be NP-complete [26], and vice versa. We show that deciding the membership of an unordered word to the language a URE remains NP-complete even under significant restrictions on the class of UREs, a result which does not follow from [26].

Theorem 1

Given an unordered word w and an expression E of the grammar \(E ::= a\mid E^?\mid (E ``\parallel {\prime }{\prime }E)\) , deciding whether w∈L(E) is NP-complete.

Proof

To show that this problem is in NP, we point out that a nondeterministic Turing machine guesses a permutation of w and checks whether it is accepted by the NFA corresponding to E with the unordered concatenation replaced by standard concatenation. We recall that w has unary representation.

Next, we prove the NP-hardness by reduction from SAT1-in-3 i.e., given a 3CNF formula, determine whether there exists a valuation such that each clause has exactly one true literal (and exactly two false literals). The SAT1-in-3 problem is known to be NP-complete [38]. The reduction works as follows. We take a 3CNF formula \(\varphi =c_{1}\wedge \ldots \wedge c_{k}\) over the variables \(\{x_{1},\ldots ,x_{n}\}\). We take the alphabet \(\{d_{1},\ldots ,d_{k},v_{1},\ldots ,v_{n}\}\). Each d i corresponds to a clause c i (for \(1\leqslant i\leqslant k\)) and each v j corresponds to a variable x j (for \(1\leqslant j\leqslant n\)). We construct the unordered word \(w_{\varphi }=d_{1}{\ldots } d_{k}v_{1}{\ldots } v_{n}\) and the expression \(E_{\varphi } = X_{1}\parallel {\ldots } \parallel X_{n}\), where for \(1\leqslant j\leqslant n\):

$$ X_{j} = (v_{j}\parallel d_{t_{1}}\parallel {\ldots} \parallel d_{t_{l}})^? \parallel (v_{j} \parallel d_{f_{1}} \parallel {\ldots} \parallel d_{f_{m}})^?, $$

and \(d_{t_{1}},\ldots ,d_{t_{l}}\) (with \(1\leqslant t_{1},\ldots ,t_{l}\leqslant k\)) correspond to the clauses that use the literal x j , and \(d_{f_{1}},{\ldots } d_{f_{m}}\) (with \(1\leqslant f_{1},\ldots ,f_{m}\leqslant k\)) correspond to the clauses that use the literal ¬x j . For example, for the formula \(\varphi _{0}=(x_{1}\vee \neg x_{2}\vee x_{3})\wedge (\neg x_{1}\vee x_{3}\vee \neg x_{4})\), we construct \(w_{\varphi _{0}}=d_{1}d_{2}v_{1}v_{2}v_{3}v_{4}\) and

$$ E_{\varphi_{0}}=(v_{1} \parallel d_{1})^? \parallel (v_{1} \parallel d_{2})^? \parallel v_{2}^? \parallel (v_{2} \parallel d_{1})^? \parallel (v_{3} \parallel d_{1}\parallel d_{2})^?\parallel v_{3}^?\parallel v_{4}^?\parallel (v_{4}\parallel d_{2})^?. $$

We claim that φ∈SAT1-in-3 iff w φ L(E φ ). For the only if case, let \(V:\{x_{1},\ldots ,x_{n}\}\rightarrow \{\mathit {true},\mathit {false}\}\) be the SAT1-in-3 valuation of φ. We use V to construct the derivation of w φ in L(E φ ): for \(1\leqslant j\leqslant n\), we take \((v_{j}\parallel d_{t_{1}}\parallel \ldots \parallel d_{t_{l}})\) from X j if V(x j )=t r u e, and \((v_{j}\parallel d_{f_{1}}\parallel \ldots \parallel d_{f_{m}})\) from X j otherwise. Since V is a \(\text {SAT}_{1\text {-}\text {in}\text {-}3}\) valuation of φ, each d i (with \(1\leqslant i\leqslant k\)) occurs exactly once, hence \(w_{\varphi }\in L(E_{\varphi })\). For the if case, we assume that \(w_{\varphi }\in L(E_{\varphi })\). Since \(w_{\varphi }(v_{j}) = 1\), we infer that w φ uses exactly one of the expressions of the form (v j ∥…)?. Moreover, since w φ (d i )=1, we infer that the valuation encoded in the derivation of w φ in L(E φ ) validates exactly one literal of each clause in φ, and therefore, φ∈SAT1-in-3. Clearly, the described reduction works in polynomial time. □

3.2 Containment

In this section, we study the problem of deciding the containment of two UREs. It is well known that regular expression containment is a PSPACE-complete problem [46], but we cannot adapt this result to characterize the complexity of the containment of UREs because the order plays an essential role in the reduction. In this section, we prove that deciding the containment of UREs is \({\Pi }_{2}^{\mathrm P}\)-hard and we show an upper bound which follows from the complexity of deciding the satisfiability of Presburger logic formulas [36, 44].

Theorem 2

Given two UREs E 1 and E 2 , deciding \(L(E_{1})\subseteq L(E_{2})\) is 1) \({\Pi }_{2}^{\mathrm P}\) -hard and 2) in 3-EXPTIME.

Proof

1) We prove the \({\Pi }_{2}^{\mathrm P}\)-hardness by reduction from the problem of checking the satisfiability of ∀QBF formulas, a classical \({\Pi }_{2}^{p}\)-complete problem. We take a ∀QBF formula

$$ \psi = \forall x_{1},\ldots,x_{n}.\ \exists y_{1},\ldots,y_{m}.\ \varphi, $$

where φ=c 1∧…∧c k is a quantifier-free CNF formula. We call the variables x 1,…,x n universal and the variables y 1,…,y m existential.

We take the alphabet \(\{d_{1},\ldots ,d_{k},t_{1},f_{1},\ldots ,t_{n},f_{n}\}\) and we construct two expressions, \(E_{\psi }\) and \(E_{\psi }^{\prime }\). First, \(E_{\psi } = d_{1}\parallel \ldots \parallel d_{k}\parallel X_{1}\parallel \ldots \parallel X_{n}\), where for \(1\leqslant i\leqslant n\) \(X_{i} = ((t_{i}\parallel d_{a_{1}}\parallel \ldots \parallel d_{a_{l}})\mid (f_{i}\parallel d_{b_{1}}\parallel \ldots \parallel d_{b_{s}}))\), and \(d_{a_{1}},{\ldots } d_{a_{l}}\) (with \(1\leqslant a_{1},\ldots ,a_{l}\leqslant k\)) correspond to the clauses which use the literal x i , and \(d_{b_{1}},\ldots ,d_{b_{s}}\) (with \(1\leqslant b_{1},\ldots ,b_{s}\leqslant k\)) correspond to the clauses which use the literal ¬x i . For example, for the formula

$$ \psi_{0} = \forall x_{1},x_{2}.\ \exists y_{1},y_{2}.\ (x_{1}\vee\neg x_{2}\vee y_{1})\wedge (\neg x_{1}\vee y_{1}\vee\neg y_{2})\wedge (x_{2}\vee\neg y_{1}), $$

we construct:

$$ E_{\psi_{0}} = d_{1}\parallel d_{2}\parallel d_{3}\parallel ((t_{1}\parallel d_{1})\mid(f_{1}\parallel d_{2}) )\parallel ((t_{2}\parallel d_{3})\mid(f_{2}\parallel d_{1})). $$

Note that there is an one-to-one correspondence between the unordered words in L(E ψ ) and the valuations of the universal variables. For example, given the formula ψ 0, the unordered word \({d_{1}^{3}}d_{2}d_{3}t_{1}f_{2}\) corresponds to the valuation V such that V(x 1)=t r u e and V(x 2)=f a l s e.

Next, we construct \(E_{\psi }^{\prime } = X_{1}\parallel \ldots \parallel X_{n}\parallel Y_{1}\parallel \ldots \parallel Y_{m}\), where:

  • \(X_{i} = ((t_{i}\parallel d_{a_{1}}^{*}\parallel \ldots \parallel d_{a_{l}}^{*})\mid (f_{i}\parallel d_{b_{1}}^{*}\parallel \ldots \parallel d_{b_{s}}^{*}))\), and \(d_{a_{1}},{\ldots } d_{a_{l}}\) (with \(1\leqslant a_{1},\ldots ,a_{l}\leqslant k\)) correspond to the clauses which use the literal x i , and \(d_{b_{1}},\ldots ,d_{b_{s}}\) (with \(1\leqslant b_{1},\ldots ,b_{s}\leqslant k\)) correspond to the clauses which use the literal ¬x i (for 1≤in),

  • \(Y_{j} = ((d_{a_{1}}^{*}\parallel \ldots \parallel d_{a_{l}}^{*})\mid (d_{b_{1}}^{*}\parallel \ldots \parallel d_{b_{s}}^{*}))\), and \(d_{a_{1}},{\ldots } d_{a_{l}}\) (with \(1\leqslant a_{1},\ldots ,a_{l}\leqslant k\)) correspond to the clauses which use the literal y j , and \(d_{b_{1}},\ldots ,d_{b_{s}}\) (with \(1\leqslant b_{1},\ldots ,b_{s}\leqslant k\)) correspond to the clauses which use the literal ¬y j (for \(1\leqslant j\leqslant m\)).

For example, for ψ 0 above we construct:

$$E_{\psi_{0}}^{\prime} = ((t_{1}\parallel d_{1}^{*})\mid(f_{1}\parallel d_{2}^{*}))\parallel ((t_{2}\parallel d_{3}^{*})\mid(f_{2}\parallel d_{1}^{*}))\parallel ((d_{1}^{*}\parallel d_{2}^{*})\mid d_{3}^{*})\parallel (\epsilon\mid d_{2}^{*}).$$

We claim that ⊧ψ iff \(E_{\psi }\subseteq E_{\psi }^{\prime }\). For the only if case, for each valuation of the universal variables, we take the corresponding unordered word wL(E ψ ). Since there exists a valuation of the existential variables which satisfies φ, we use this valuation to construct a derivation of w in \(L(E_{\psi }^{\prime })\). For the if case, for every unordered word from L(E ψ ), we take its derivation in \(L(E_{\psi }^{\prime })\) and we use it to construct a valuation of the existential variables which satisfies φ. Clearly, the described reduction works in polynomial time.

2) The membership of the problem to 3-EXPTIME follows from the complexity of deciding the satisfiability of Presburger logic formulas, which is in 3-EXPTIME [36]. Given two UREs E 1 and E 2, we compute in linear time [44] two existential Presburger formulas for their Parikh images: \(\varphi _{E_{1}}\) and \(\varphi _{E_{2}}\), respectively. Next, we test the satisfiability of the following closed Presburger logic formula: \(\forall \overline {x}.\ \varphi _{E_{1}}(\overline {x})\Rightarrow \varphi _{E_{2}}(\overline {x})\). □

While the complexity gap for the containment of UREs (as in Theorem 2) is currently quite important, we believe that this gap may be reduced by working on quantifier elimination for the Presburger formula obtained by translating the containment of UREs (as shown in the second part of the proof of Theorem 2). Although we believe that this problem is \({\Pi }_{2}^{\mathrm P}\)-complete, its exact complexity remains an open question.

3.3 Disallowing Repetitions

The proofs of Theorem 1 and Theorem 2 rely on UREs allowing repetitions of the same symbol, which might be one of the causes of the intractability. Consequently, from now on we disallow repetitions of the same symbol in a URE. Similar restrictions are commonly used for the regular expressions to maintain practical aspects: single occurrence regular expressions (SOREs) [7], conflict-free types [17, 18, 22], and duplicate-free DTDs [33]. While the complexity of the membership of an unordered word to the language of a URE without symbol repetitions has recently been shown to be in PTIME [11], testing containment of two such UREs continues to be intractable.

Theorem 3

Given two UREs E 1 and E 2 not allowing repetitions of symbols, deciding \(L(E_{1})\subseteq L(E_{2})\) is coNP-hard.

Proof

We show the coNP-hardness by reduction from the complement of 3SAT. Take a 3CNF formula \(\varphi =c_{1}\wedge \ldots \wedge c_{k}\) over the variables \(\{x_{1},\ldots ,x_{n}\}\). We assume w.l.o.g. that each variable occurs at most once in a clause. Take the alphabet \(\{a_{ij}\mid 1\leqslant i\leqslant k, 1\leqslant j\leqslant n,\) c i uses x j or ¬x j }. We construct the expression \(E_{\varphi } = X_{1}\parallel \ldots \parallel X_{n}\), where \(X_{j}=((a_{t_{1}j}\parallel {\ldots } \parallel a_{t_{l}j})\mid (a_{f_{1}j}\parallel \ldots \parallel a_{f_{m}j})\) (for \(1\leqslant j\leqslant n\)), and \(c_{t_{1}},\ldots ,c_{t_{l}}\) (with \(1\leqslant t_{1},\ldots ,t_{l}\leqslant k\)) are the clauses which use the literal x j , and \(c_{f_{1}},\ldots ,c_{f_{m}}\) (with \(1\leqslant f_{1},\ldots ,f_{m}\leqslant k\)) are the clauses which use the literal ¬x j . Next, we construct \(E_{\varphi }^{\prime }= (C_{1}\mid \ldots \mid C_{k})^{[0,k-1]}\), where \(C_{i}=(a_{ij_{1}}\mid \ldots \mid a_{ij_{p}})^{+}\) (for \(1\leqslant i\leqslant k\)), and \(x_{j_{1}},\ldots ,x_{j_{p}}\) (with \(1\leqslant j_{1},\ldots ,j_{p}\leqslant n\)) are the variables used by the clause c i . For example, for

$$ \varphi_{0}=(x_{1}\vee \neg x_{2}\vee x_{3})\wedge (\neg x_{1}\vee x_{3}\vee \neg x_{4})\wedge (x_{2}\vee \neg x_{3}\vee \neg x_{4}), $$

we obtain:

$$\begin{array}{@{}rcl@{}} E_{\varphi_{0}} =&(a_{11}\mid a_{21})\parallel (a_{32}\mid a_{12})\parallel ((a_{13}\parallel a_{23})\mid a_{33})\parallel (\epsilon\mid (a_{24}\parallel a_{34})),\\ E_{\varphi_{0}}^{\prime} =& ((a_{11}\mid a_{12}\mid a_{13})^{+}\mid (a_{21}\mid a_{23}\mid a_{24})^{+}\mid (a_{32}\mid a_{33}\mid a_{34})^{+})^{[0,2]}. \end{array} $$

Note that there is an one-to-one correspondence between the unordered words w V in L(E φ ) and the valuations V of the variables \(x_{1},\ldots ,x_{n}\) (*). For example, for above φ 0 and the valuation V such that V(x 1)=V(x 2)=V(x 3)=t r u e and V(x 4)=f a l s e, the unordered word w V =a 11 a 32 a 13 a 23 a 24 a 34 is in \(L(E_{\varphi _{0}})\). Moreover, given an w V L(E φ ), one can easily obtain the valuation.

We observe that the interval [0,k−1] is used above a disjunction of k expressions of the form C i and there is no repetition of symbols among the expressions of the form C i . This allows us to state an instrumental property (**): \(w\in L(E_{\varphi }{\prime })\) iff there exists an i∈{1,…,k} such that none of the symbols used in C i occurs in w. From (*) and (**), we infer that given a valuation V, Vφ iff \(w_{V}\in L(E_{\varphi })\,\backslash \,L(E_{\varphi }{\prime })\), that yields φ∈ 3SAT iff \(L(E_{\varphi })\nsubseteq L(E_{\varphi }^{\prime })\). Clearly, the described reduction works in polynomial time. □

Theorem 3 shows that disallowing repetitions of symbols in a URE does not avoid the intractability of the containment.

Additionally, we observe that the proof of Theorem 3 employs UREs with arbitrary use of disjunction and intervals. Consequently, in the next section we impose further restrictions that yield a class of UREs with desirable computational properties.

4 Disjunctive Interval Multiplicity Expressions (DIMEs)

In this section, we present the DIMEs, a subclass of UREs for which membership and containment become tractable. First, we present an intuitive representation of DIMEs with characterizing tuples (Section 4.1). Next, we formally define DIMEs and show that they are precisely captured by their characterizing tuples (Section 4.2). Finally, we use a compact representation of the characterizing tuples to show the tractability of DIMEs (Section 4.3).

4.1 Characterizing Tuples

In this section, we introduce the notion of characterizing tuple that is an alternative, more intuitive representation of DIMEs, the subclass of UREs that we formally define in Section 4.2. Recall that by aw we denote w(a)≠0. Given a DIME E, the characterizing tuple Δ E =(C E ,N E ,P E ,K E ) is as follows.

  • The conflicting pairs of siblings C E consisting of all pairs of symbols in Σ such that E defines no word using both symbols simultaneously:

    $$ C_{E} = \{(a,b)\in{\Sigma}\times{\Sigma}\mid \nexists w\in L(E).\ a\in w\wedge b\in w\}. $$
  • The extended cardinality map N E capturing for each symbol in the alphabet the possible numbers of its occurrences in the unordered words defined by E:

    $$ N_{E} = \{(a, w(a))\in {\Sigma}\times\mathbb N_{0}\mid w\in L(E)\}. $$
  • The collections of required symbols P E capturing symbols that must be present in every word; essentially, a set of symbols X belongs to P E if every word defined by E contains at least one element from X:

    $$ P_{E}=\{X\subseteq{\Sigma}\mid\forall w\in L(E).\ \exists a\in X.\ a \in w\}. $$
  • The counting dependencies K E consisting of pairs of symbols (a,b) such that in every word defined by E, the number of bs is at most the number of as. Note that if both (a,b) and (b,a) belong to K E , then all unordered words defined by E should have the same number of a’s and b’s.

    $$\begin{array}{@{}rcl@{}} K_{E}=\{(a,b)\in{\Sigma}\times{\Sigma}\mid\forall w\in L(E).\ w(a)\geqslant w(b)\}. \end{array} $$

As an example we take \(E_{0}= a^{+} \parallel ((b \parallel c^?)^{+}\mid d^{[5,\infty ]})\) and we illustrate its characterizing tuple \({\Delta }_{E_{0}}\). Because P E is closed under supersets, we list only its minimal elements:

$$\begin{array}{@{}rcl@{}}&&{} C_{E_{0}} = \{ (b,d), (c,d), (d,b), (d,c) \}, \\ &&{} N_{E_{0}}= \{(a,i)\mid i\geqslant 1\} \cup \{(b,i)\mid i\geqslant 0\} \cup{}\{(c,i)\mid i\geqslant 0\} \cup \{(d,i)\mid i = 0 \lor i\geqslant 5\},\\ &&{} P_{E_{0}} =\{\{a\},\{b,d\},\ldots\},\\ &&{} K_{E_{0}} = \{(b,c)\}. \end{array} $$

We point out that N E may be infinite and P E exponential in the size of E. Later on we discuss how to represent both sets in a compact manner while allowing efficient manipulation.

Then, an unordered word w satisfies a characterizing tuple Δ E corresponding to a DIME E, denoted w⊧Δ E , if the following conditions are satisfied:

  1. 1.

    wC E i.e., \(\forall (a, b) \in C_{E}.\ (a\in w \Rightarrow b \notin w) \wedge (b\in w \Rightarrow a\notin w)\),

  2. 2.

    wN E i.e., ∀a∈Σ. (a,w(a))∈N E ,

  3. 3.

    wP E i.e., ∀XP E . ∃aX. aw,

  4. 4.

    wK E i.e., \(\forall (a,b)\in K_{E}.\ w(a)\geqslant w(b)\).

For instance, the unordered word a a b b c satisfies the characterizing tuple \({\Delta }_{E_{0}}\) corresponding to the aforementioned DIME \(E_{0}= a^{+} \parallel ((b \parallel c^?)^{+}\mid d^{[5,\infty ]})\) since it satisfies all the four conditions imposed by \({\Delta }_{E_{0}}\). On the other hand, note that the following unordered words do not satisfy \({\Delta }_{E_{0}}\):

  • a b d d d d d because it contains at the same time b and d, and \((b,d)\in C_{E_{0}}\),

  • a d d because it has two d’s and \((d,2)\notin N_{E_{0}}\),

  • a a because it does not contain any b or d and \(\{b,d\}\in P_{E_{0}}\),

  • a b b c c c because it has more c’s than b’s and \((b,c)\in K_{E_{0}}\).

In the next section, we define the DIMEs and show that they are precisely captured by characterizing tuples.

4.2 Grammar of DIMEs

An atom is \((a_{1}^{I_{1}}\parallel \ldots \parallel a_{k}^{I_{k}})\), where all I i ’s are ? or 1. For example, (ab ?c) is an atom, but (a [3,4]b) is not an atom. A clause is \((A_{1}^{I_{1}}\mid \ldots \mid A_{k}^{I_{k}})\), where all A i ’s are atoms and all I i ’s are intervals. A clause is simple if all I i ’s are ? or 1. For example, (a [2,3]∣(b ?c)) is a clause (which is not simple), ((a ?b)∣c ?) is a simple clause while ((a ?b +)∣c) is not a clause.

A disjunctive interval multiplicity expression (DIME) is \((D_{1}^{I_{1}}\parallel \ldots \parallel D_{k}^{I_{k}})\), where for \(1\leqslant i\leqslant k\) either 1) D i is a simple clause and I i ∈{+,∗}, or 2) D i is a clause and I i ∈{1,?}. Moreover, a symbol can occur at most once in a DIME. For example, (a∣(bc ?)+)∥(d [3,4]e ) is a DIME while (ab ?)+∥(ac) is not a DIME because it uses the symbol a twice. A disjunction-free interval multiplicity expression (IME) is a DIME which does not use the disjunction operator. An example of IME is a∥(bc ?)+d [3,4]. For more practical examples of DIMEs see Examples 3 and 4 from Section 5.

We have tailored DIMEs to be able to capture them with characterizing tuples that permit deciding membership and containment in polynomial time (cf. Section 4.3). As we have already pointed out Section 3.3, a slightly more relaxed restriction on the nesting of disjunction and intervals leads to intractability of the containment (Theorem 3). Even though DIMEs may look very complex, the imposed restrictions are necessary to obtain lower complexity while considering fragments with practical relevance (cf. Section 8).

Next, we show that each DIME can be rewritten as an equivalent reduced DIME. Reduced DIMEs may also seem complex, but they are a building block for (i) proving that the language of a DIME is precisely captured by its characterizing tuple (Lemma 1), and (ii) computing the compact representation of the characterizing tuples that yield the tractability of DIMEs (cf. Section 4.3).

Before defining the reduced DIMEs, we need to introduce some additional notations. Given an atom A (resp. a clause D), we denote by Σ A (resp. Σ D ) the set of symbols occurring in A (resp. D). Given a DIME E, by \({I_{E}^{a}}\) (resp. \({I_{E}^{a}}\) or \({I_{E}^{D}}\)) we denote the interval associated in E to the symbol a (resp. atom A or clause D). Because we consider only expressions without repetitions, this interval is well-defined. Moreover, if E is clear from the context, we write simply I a (resp. I A or I D) instead of \({I_{E}^{a}}\) (resp. \({I_{E}^{a}}\) or \({I_{E}^{D}}\)). Furthermore, given an interval I which can be either [n,m] or [n,m]?, by I ? we understand the interval [n,m]?. In a reduced DIME E, each clause with interval D I has one of the following three types:

  1. 1.

    D I=(A 1∣…∣A k )+, where \(k\geqslant 2\) and, for every i∈{1,…,k}, A i is an atom such that there exists \(a\in {\Sigma }_{A_{i}}\) such that I a=1.

    For example, ((ab ?)∣c)+ has type 1, but a + and ((a ?b ?)∣c)+ do not.

  2. 2.

    \((A_{1}^{I_{1}}\mid \ldots \mid A_{k}^{I_{k}})\), where for every i∈{1,…,k}1) A i is an atom such that there exists \(a\in {\Sigma }_{A_{i}}\) such that I a=1 and 2) 0 does not belong to the set represented by the interval I i .

    For example, \((a\mid (b^?\parallel c)^{[5,\infty ]})\) and a + have type 2, but \((a\mid (b^?\parallel c^?)^{[5,\infty ]})\) and \((a^{*}\mid (b^?\parallel c)^{[5,\infty ]})\) do not.

  3. 3.

    \((A_{1}^{I_{1}}\mid \ldots \mid A_{k}^{I_{k}})\), where for every i∈{1,…,k} A i is an atom and I i is an interval such that 0 belongs to the set represented by the interval I i .

    For example, \((a^{*}\mid (b\parallel c)^{[3,4]^?})\) and \((a^?\parallel b^?)^{*}\) have type 3, but \((a^?\parallel b^?)^{[3,4]}\) does not.

The reduced DIMEs easily yield the construction of their characterizing tuples. Take a clause with interval D I from a DIME E and observe that the symbols from Σ D are present in the characterizing tuple Δ E as follows.

  • If D I is of type 1, then there is no symbol in Σ D that occurs in a conflict in C E . Otherwise, C E consists of all pairs of distinct symbols (a,b) from Σ D that appear in different atoms from D I.

  • If D I is of type 1, then we have (a,n)∈N E for every \((a,n)\in {\Sigma }_{D}\times {\mathbb {N}_{0}}\). Otherwise, the possible number of occurrences of every symbol a from Σ D can be obtained directly from the two intervals above it: the interval of D and the interval of the atom containing a. We explain in Section 4.3 how to precisely construct a compact representation of the potentially infinite set N E .

  • If D I is of type 1 or 2, then every unordered word defined by E contains at least one of the symbols a from Σ D having interval I a=1. More precisely, P E contains all sets of symbols \(X\subseteq {\Sigma }\) containing, for every atom of D, at least one symbol a with I a=1. For example, for ((abc ?)∣(de))+, P E consists of the sets {a,d},{a,e},{b,d},{b,e} and all their supersets. Otherwise, if D I is of type 3, then there is no set in P E containing only symbols from Σ D .

  • Regardless of the type of D I, the counting dependencies K E consist of all pairs of symbols (a,b) such that they appear in the same atom in D and I a=1.

To obtain reduced DIMEs, we use the following rules:

  • Take a simple clause \((A_{1}^{I_{1}}\mid \ldots \mid A_{k}^{I_{k}})\).

    • \((A_{1}^{I_{1}}\mid \ldots \mid A_{k}^{I_{k}})^{*}\) goes to \(A_{1}^{*}\parallel \ldots \parallel A_{k}^{*}\) (k clauses of type 3). black Essentially, we distribute the ∗ of a disjunction of atoms with intervals to each of the atoms. For example, (a∣(bc ?)) goes to a ∥(bc ?).

    • \((A_{1}^{I_{1}}\mid \ldots \mid A_{k}^{I_{k}})^{+}\) goes to \(A_{1}^{*}\parallel \ldots \parallel A_{k}^{*}\) (k clauses of type 3) if there exists an atom with interval \(A_{i}^{I_{i}}\) (i∈{1,…,k}) that defines the empty word i.e., I i =? or I a=? for every symbol \(a\in {\Sigma }_{A_{i}}\). If the empty word is defined, then we can basically transform the + into ∗ and then distribute the ∗ as for the previous case. For example, ((ab ?)∣(cd)?)+ goes to (ab ?)∥(cd).

  • Take a clause \((A_{1}^{I_{1}}\mid \ldots \mid A_{k}^{I_{k}})\).

    • \((A_{1}^{I_{1}}\mid \ldots \mid A_{k}^{I_{k}})^?\) goes to \((A_{1}^{I_{1}^?}\mid \ldots \mid A_{k}^{I_{k}^?})\) (type 3). We essentially distribute the ? of a disjunction of atoms with intervals to each of the atoms. For example, (a [2,3]b +)? goes to \((a^{[2,3]^?}\mid b^{*})\).

    • \((A_{1}^{I_{1}}\mid \ldots \mid A_{k}^{I_{k}})\) goes to \((A_{1}^{I_{1}^?}\mid \ldots \mid A_{k}^{I_{k}^?})\) (type 3) if there exists an atom with interval \(A_{i}^{I_{i}}\) (i∈{1,…,k}) that defines the empty word i.e., 0 belongs to the set represented by I i or I a=? for every symbol \(a\in {\Sigma }_{A_{i}}\). If the empty word is defined by one of the atoms, then we can basically distribute ? to all of them. For example, (a∣(bc)[0,5]) goes to (a ?∣(bc)[0,5]).

  • Take an atom (a1?∥…∥a k?) and an interval I. Then, (a1?∥…∥a k?)I goes to \((a_{1}^?\parallel \ldots \parallel a_{k}^?)^{[0,\max (I)]}\), where by \(\max (I)\) we denote the maximum value from the set represented by the interval I. This step may be combined with one of the previous ones to rewrite a clause with interval as one of type 3. For example, ((a ?b ?)[3,6]c) goes to ((a ?b ?)[0,6]c ?).

  • Remove symbols a (resp. atoms A or clauses D) such that I a (resp. I A or I D) is [0,0].

Note that each of the rewriting steps gives an equivalent reduced expression.

Next, we assume that we work with reduced DIMEs only and show that the language defined by a DIME E comprises of all unordered words satisfying the characterizing tuple Δ E .

Lemma 1

Given an unordered word w and a DIME E, w∈L(E) iff w⊧Δ E .

Proof

The only if part follows from the definition of the satisfiability of Δ E . For the if part, we take the tuple Δ E corresponding to a DIME \(E=D_{1}^{I_{1}}\parallel \ldots \parallel D_{k}^{I_{k}}\) and an unordered word w such that w⊧Δ E . Let \(w=w_{1}\uplus \ldots \uplus w_{k}\uplus w^{\prime }\), where each w i contains all occurrences in w of the symbols from \({\Sigma }_{D_{i}}\) (for \(1\leqslant i\leqslant k\)). Since wN E , we infer that there is no symbol \(a\in {\Sigma }\setminus ({\Sigma }_{D_{1}}\cup \ldots \cup {\Sigma }_{D_{k}})\) such that aw, which implies \(w^{\prime }=\varepsilon \). Thus, proving wE reduces to proving that \(w_{i}\models D_{i}^{I_{i}}\) (for \(1\leqslant i\leqslant k\)). Since E is a reduced DIME, each derivation can be constructed by reasoning on the three possible types of the \(D_{i}^{I_{i}}\) (for \(1\leqslant i\leqslant k\)).

Case 1

Take \(D_{i}^{I_{i}}=(A_{1}\mid \ldots \mid A_{k})^{+}\) of type 1. From the semantics of the UREs, we observe that proving \(w_{i}\models D_{i}^{I_{i}}\) is equivalent to proving that (i) w i is non-empty and (ii) w i can be split as \(w_{i}= w_{1}^{\prime }\uplus \ldots \uplus w_{p}^{\prime }\), where every \(w_{j}^{\prime }\) (\(1\leqslant j\leqslant p\)) satisfies an atom A l (\(1\leqslant l\leqslant k\)). First, we point out that since w satisfies the collections of required symbols P E , we infer that w i is non-empty, which implies (i). Then, since w satisfies the extended cardinality map N E and the counting dependencies K E , we infer that (ii) is also satisfied.

Case 2

Take \(D_{i}^{I_{i}} = (A_{1}^{I_{1}}\mid \ldots \mid A_{k}^{I_{k}})\) of type 2. From the semantics of UREs, we observe that proving \(w_{i}\models D_{i}^{I_{i}}\) is equivalent to proving that (i) w i is non-empty and (ii) there exists an atom with interval \(A_{j}^{I_{j}}\) (\(1\leqslant j\leqslant k\)) such that \(w_{i}\models A_{j}^{I_{j}}\). Since wP E , we infer that w i is non-empty hence (i) is satisfied. Then, since wC E , we infer that only the symbols from one atom A j of D i are present in w i . Moreover, since wN E and wK E , we infer that the number of occurrences of each symbol from \({\Sigma }_{A_{j}}\) are such that \(w_{i}\models A_{j}^{I_{j}}\). Hence, the condition (ii) is also satisfied.

Case 3

Take \(D_{i}^{I_{i}} = (A_{1}^{I_{1}}\mid \ldots \mid A_{k}^{I_{k}})\) of type 3. The only difference w.r.t the previous case is that w i may be also empty, hence proving \(w_{i}\models D_{i}^{I_{i}}\) is equivalent to proving only that there exists an atom with interval \(A_{j}^{I_{j}}\) (\(1\leqslant j\leqslant k\)) such that \(w_{i}\models A_{j}^{I_{j}}\), which follows similarly to the previous case.

Moreover, we define the subsumption of two characterizing tuples, which captures the containment of DIMEs. Given two DIMEs E and \(E^{\prime }\), we write \({\Delta }_{E^{\prime }}\preccurlyeq {\Delta }_{E}\) if \(C_{E}\subseteq C_{E^{\prime }}\), \(N_{E^{\prime }}\subseteq N_{E}\), \(P_{E}\subseteq P_{E^{\prime }}\), and \(K_{E}\subseteq K_{E^{\prime }}\). Then, we obtain the following.

Lemma 2

Given two DIMEs E and \(E^{\prime }\) , \({L}(E^{\prime })\subseteq {L}(E)\) iff \({\Delta }_{E^{\prime }}\preccurlyeq {\Delta }_{E}\).

Proof

First, we claim that given two DIMEs E and \(E^{\prime }\): \({\Delta }_{E^{\prime }}\preccurlyeq {\Delta }_{E}\) iff \(w\models {\Delta }_{E^{\prime }}\) implies w⊧Δ E for every w (*). The only if part of (*) follows directly from the definitions while the if part can be easily shown by contraposition. From Lemma 1 and (*) we infer the correctness of Lemma 2. □

Example 1

For the following DIMEs, it holds that \(L(E^{\prime })\subsetneq L(E)\) and \(L(E)\nsubseteq L(E^{\prime })\):

  • Take E=a b and \(E^{\prime } = (a\parallel b^?)^{*}\). Note that \(K_{E}=\varnothing \) and \(K_{E^{\prime }} = \{(a,b)\}\). For instance, the unordered word b belongs to L(E), but does not belong to \(L(E^{\prime })\).

  • Take \(E=a^{[3,6]^?}\mid b^{*}\) and \(E^{\prime } = a^{[3,6]}\mid b^{+}\). Note that \(P_{E} =\varnothing \), and \(P_{E^{\prime }} = \{\{a,b\}\}\). For instance, the unordered word ε belongs to L(E), but does not belong to \(L(E^{\prime })\).

  • Take E=(ab ?) and \(E^{\prime } = (a\parallel b^?)^{[0,5]}\). Note that (a,6) belongs to N E , but not to \(N_{E^{\prime }}\). For instance, the unordered word a 6 belongs to L(E), but does not belong to \(L(E^{\prime })\).

  • Take E=(ab)+ and \(E^{\prime } = a^{+}\mid b^{+}\). Note that \(C_{E} = \varnothing \), and \(C_{E^{\prime }} = \{(a,b),(b,a)\}\). For instance, the unordered word ab belongs to L(E), but does not belong to \(L(E^{\prime })\).

Lemma 2 shows that two equivalent DIMEs yield the same characterizing tuple, and hence, the tuple Δ E can be viewed as a “canonical form” for the language defined by a DIME E. Formally, we obtain the following.

Corollary 1

Given two DIMEs E and \(E^{\prime }\) , \(L(E) = L(E^{\prime })\) iff \({\Delta }_{E} = {\Delta }_{E^{\prime }}\).

In the next section, we show that the characterizing tuple has a compact representation that permits us to decide the problems of membership and containment in polynomial time.

4.3 Tractability of DIMEs

We now show that the characterizing tuple admits a compact representation that yields the tractability of deciding membership and containment of DIMEs.

Given a reduced DIME E, note that C E and K E are quadratic in |Σ| and can be easily constructed. The set C E consists of all pairs of distinct symbols (a,b) such that they appear in different atoms in the same clause of type 2 or 3. Moreover, K E consists of all pairs of distinct symbols (a,b) such that they appear in the same atom and I a=1.

While N E may be infinite, it can be easily represented in a compact manner using intervals: for every symbol a, the set \(\{i\in \mathbb N_{0}\mid (a,i)\in N_{E}\}\) is representable by an interval. Given a symbol a∈Σ, by \(\hat N_{E}(a)\) we denote the interval representing the set \(\{i\in \mathbb N_{0}\mid (a,i)\in N_{E}\}\) that can be easily obtained from E:

  • \(\hat N_{E}(a) = [0,0]\) if a appears in no clause in E,

  • \(\hat N_{E}(a) = [0,\infty ]\) (or simply ∗) if a appears in a clause of type 1 in E,

  • \(\hat N_{E}(a)= I^{A}\) if I a=1, A is the atom containing a, and A is the unique atom of a clause of type 2 or 3,

  • \(\hat N_{E}(a) = {I^{A}}^?\) if I a=1, A is the atom containing a, and A appears in a clause of type 2 or 3 containing at least two atoms,

  • \(\hat N_{E}(a)= [0,\max (I^{A})]\) if I a=?, A is the atom containing a, and A appears in a clause of type 2 or 3.

For example, for \(E_{0}= a^{+} \parallel ((b \parallel c^?)^{+}\mid d^{[5,\infty ]})\), we obtain the following \(\hat N_{E_{0}}\):

$$\begin{array}{@{}rcl@{}}\hat N_{E_{0}}(a)=+, \qquad \hat N_{E_{0}}(b)\,=*, \qquad \hat N_{E_{0}}(c)\,=*, \qquad \hat N_{E_{0}}(d)=[5,\infty]^?. \end{array} $$

Naturally, testing \(N_{E^{\prime }}\subseteq N_{E}\) reduces to a simple test on \(\hat N_{E^{\prime }}\) and \(\hat N_{E}\).

Representing P E in a compact manner is more tricky. A natural idea would be to store only its \(\subseteq \)-minimal elements since P E is closed under supersets. Unfortunately, there exist DIMEs having an exponential number of \(\subseteq \)-minimal elements. For instance, for the DIME E 1=((ab)∣(cd))+∥((ef)[2,5]g [1,3])∥(h i [0,9]), the set \(P_{E_{1}}\) has 6 \(\subseteq \)-minimal elements: {a,c}, {a,d}, {b,c}, {b,d}, {e,g}, and {f,g}. The example easily generalizes to arbitrary numbers of atoms used in the clauses.

However, we observe that the exponentially-many \(\subseteq \)-minimal elements may contain redundant information that is already captured by other elements of the characterizing tuple. For instance, for the above DIME E 1, if we know that {a,c} belongs to P E , we can easily see that other \(\subseteq \)-minimal elements also belong to P E . More precisely, we observe that for every unordered word w defined by E it holds that w(a)=w(b), w(c)=w(d) and w(e)=w(f), which is captured by the counting dependencies K E ={(a,b),(b,a),(c,d),(d,c),(e,f),(f,e)}. Hence, for the unordered words defined by E, the presence of an a implies the presence of a b, the presence of a c implies the presence of a d, etc. Consequently, if {a,c} belongs to P E , then {b,c}, {a,d}, and {b,d} also belong to P E . Similarly, if {e,g} belongs to P E , then {f,g} also belongs to P E .

Next, we use the aforementioned observation to define a compact representation of P E . For this purpose, we introduce the auxiliary notion of symbols implied by a DIME E in the presence of a set of symbols X, denoted i m p l E (X):

$$\mathit{impl}_{E}(X)=X\cup\{a\in{\Sigma}\mid \exists b\in X.\ (a,b)\in K_{E} \text{ and } (b,a)\in K_{E}\}. $$

For example, for the above E 1, we have i m p l E ({a,c})={a,b,c,d}.

Moreover, given a DIME E, by \(\mathit {P^{\subseteq _{\tiny \min }}_{E}}\) we denote the set of all \(\subseteq \)-minimal elements of P E . Given a subset \(P\subseteq \mathit {P^{\subseteq _{\tiny \min }}_{E}}\), we say that P is:

  • non-redundant if \(\forall X\in P.\ \not \exists Y\in P.\ X\subseteq \mathit {impl}_{E}(Y)\),

  • covering if \(\forall X\in \mathit {P^{\subseteq _{\tiny \min }}_{E}}.\ \exists Y\in P.\ X\subseteq \mathit {impl}_{E}(Y)\).

For example, take the above E 1=((ab)∣(cd))+∥((ef)[2,5]g [1,3])∥(h i [0,9]) and recall that \(\mathit {P^{\subseteq _{\tiny \min }}_{E_{1}}}=\{\{a,c\},\{a,d\},\{b,c\},\{b,d\},\{e,g\},\{f,g\}\}\). Then, we have the following:

  • {{b,c},{f,g}} is non-redundant and covering,

  • {{b,c}} is non-redundant and it is not covering,

  • {{a,c},{b,c},{f,g}} is redundant and covering,

  • {{a,c},{b,c}} is redundant and not covering.

Given a DIME E, the compact representation of the collections of required symbols P E is naturally a non-redundant and covering subset of \(\mathit {P^{\subseteq _{\tiny \min }}_{E}}\). Since there may exist many non-redundant and covering subsets of \(\mathit {P^{\subseteq _{\tiny \min }}_{E}}\), we use the total order <Σ on the alphabet Σ to propose a deterministic construction of the compact representation \(\hat P_{E}\). For this purpose, we define first some additional notations.

Given an atom A, by Φ(A) we denote the smallest label from Σ w.r.t. <Σ that is present in A and has interval 1:

$$ {\Phi}(A)=\min_{<_{\Sigma}}\{a\in{\Sigma}_{A}\mid I^{a} = 1\}. $$

For example, Φ(ab)=a. Then, given a clause with interval D I, by Φ(D I) we denote the set of all symbols Φ(A) for every atom A in D:

$$ {\Phi}(D^{I})=\{\Phi(A)\mid A\text{ is an atom in }D\}. $$

For example, Φ(((ab)∣(cd))+)={a,c} and Φ(((ef)[2,5]g [1,3]))={e,g}. Then, \(\hat P(E)\) consists of all such sets for the clauses with intervals of type 1 or 2:

$$ \hat P_{E} = \{\Phi(D^{I})\mid D^{I} \text{ is a clause with interval of type 1 or 2 in }E\}. $$

For example, \(\hat P_{E_{1}}=\{\{a,c\},\{e,g\}\}\). Notice that the set {a,c} is due to the clause with interval ((ab)∣(cd))+ of type 1 and the set {e,g} is due to the clause with interval ((ef)[2,5]g [1,3]) of type 2. Also notice that the clause with interval (h i [0,9]) is of type 3, none of its symbols is required, and consequently, no set in \(\hat P_{E}\) contains symbols from it.

We have introduced all elements to be able to define the compact representation of a characterizing tuple. Given a DIME E, we say that \(\hat {\Delta }=(C_{E},\hat N_{E},\hat P_{E}, K_{E})\) is the compact representation of its characterizing tuple Δ E . Then, an unordered word w satisfies \(\hat {\Delta }_{E}\), denoted \(w\models \hat {\Delta }_{E}\), if

  • wC E and wK E as previously defined when we have introduced w⊧Δ E ,

  • \(w\models \hat N_{E}\) i.e., \(\forall a\in {\Sigma }.\ w(a)\in \hat N_{E}(a)\),

  • \(w\models \hat P_{E}\) i.e., \(\forall X\in \hat P_{E}.\ \exists a\in X.\ a\in w\). Notice that we use exactly the same definition as for wP E and recall that \(\hat P_{E}\) is in fact a non-redundant and covering subset of \(\mathit {P^{\subseteq _{\tiny \min }}_{E}}\).

Next, we show that given a DIME E, its compact characterizing tuple \(\hat {\Delta }_{E}\) defines precisely the same set of unordered words as its characterizing tuple Δ E .

Lemma 3

Given an unordered word w and a DIME E, w⊧Δ E iff \(w\models \hat {\Delta }_{E}\).

Proof

The only if part follows directly from the definitions. For the if part, proving w⊧Δ E reduces to proving that wP E , which moreover, reduces to proving that for every X from \(\mathit {P^{\subseteq _{\tiny \min }}_{E}}\) there is a symbol a in X that occurs in w (*). Since \(\hat P_{E}\) is a covering subset of \(\mathit {P^{\subseteq _{\tiny \min }}_{E}}\), we know that for every \(X\in \mathit {P^{\subseteq _{\tiny \min }}_{E}}\) there exists a set \(Y\in \hat P_{E}\) such that \(X\subseteq \mathit {impl}_{E}(Y)\). Since \(w\models \hat P_{E}\) and wK E , we infer that (*) is satisfied. □

Additionally, we define the subsumption of the compact representations of two characterizing tuples. Given two DIMEs E and \(E^{\prime }\), we write \(\hat {\Delta }_{E^{\prime }}\preccurlyeq \hat {\Delta }_{E}\) if

  • \(C_{E}\subseteq C_{E^{\prime }}\) and \(K_{E}\subseteq K_{E^{\prime }}\) (as for the subsumption of characterizing tuples),

  • \(\forall a\in {\Sigma }.\ \hat N_{E^{\prime }}(a)\subseteq \hat N_{E}(a)\),

  • \(\forall X\in \hat P_{E}.\ \exists Y\in \hat P_{E^{\prime }}.\ Y\subseteq \mathit {impl}_{E^{\prime }}(X)\).

Next, we show that the subsumption of compact representations of characterizing tuples captures the subsumption of characterizing tuples.

Lemma 4

Given two DIMEs E and \(E^{\prime }\) , \({\Delta }_{E^{\prime }}\preccurlyeq {\Delta }_{E}\) iff \(\hat {\Delta }_{E^{\prime }}\preccurlyeq \hat {\Delta }_{E}\).

Proof

First, since P E is closed under supersets, we observe that

$$ P_{E}\subseteq P_{E^{\prime}} \text{ iff } \forall X\in\mathit{P^{\subseteq_{\tiny\min}}_{E}}.\ \exists Y\in\mathit{P^{\subseteq_{\tiny\min}}_{E^{\prime}}}.\ Y\subseteq X. $$

Moreover, the conditions \(C_{E}\subseteq C_{E^{\prime }}\) and \(K_{E}\subseteq K_{E^{\prime }}\) are part of both \({\Delta }_{E^{\prime }}\preccurlyeq {\Delta }_{E}\) and \(\hat {\Delta }_{E^{\prime }}\preccurlyeq \hat {\Delta }_{E}\). Consequently, proving \({\Delta }_{E^{\prime }}\preccurlyeq {\Delta }_{E}\) iff \(\hat {\Delta }_{E^{\prime }}\preccurlyeq \hat {\Delta }_{E}\) reduces to proving that, if \(C_{E}\subseteq C_{E^{\prime }}\) and \(K_{E}\subseteq K_{E^{\prime }}\), then

$$\forall X\in\mathit{P^{\subseteq_{\tiny\min}}_{E}}.\ \exists Y\in\mathit{P^{\subseteq_{\tiny\min}}_{E^{\prime}}}.\ Y\subseteq X \text{ iff } \forall X\in \hat P_{E}.\ \exists Y \in\hat P_{E^{\prime}}.\ Y\subseteq \mathit{impl}_{E^{\prime}}(X). $$

For the only if part, take a set X from \(\hat P_{E}\). Since X also belongs to \(\mathit {P^{\subseteq _{\tiny \min }}_{E}}\), we know by hypothesis that there exists a set Y in \(\mathit {P^{\subseteq _{\tiny \min }}_{E^{\prime }}}\) such that \(Y\subseteq X\). Then, construct a set \(Y^{\prime }\) from Y by replacing each symbol b from Y with the smallest a w.r.t. <Σ such that (a,b) and (b,a) belong to \(K_{E^{\prime }}\). Moreover, since \(K_{E}\subseteq K_{E^{\prime }}\), we infer that \(Y^{\prime }\subseteq \mathit {impl}_{E^{\prime }}(X)\). For the if part, take an X from \(\hat P_{E}\) and an Y from \(\hat P_{E^{\prime }}\) s.t. \(Y\subseteq \mathit {impl}_{E^{\prime }}(X)\). To construct the corresponding \(X^{\prime }\) in \(\mathit {P^{\subseteq _{\tiny \min }}_{E}}\) and \(Y^{\prime }\) in \(\mathit {P^{\subseteq _{\tiny \min }}_{E^{\prime }}}\) such that \(Y^{\prime }\subseteq X^{\prime }\), we replace symbols a from X and \(a^{\prime }\) from Y with symbols b in \(X^{\prime }\) and \(b^{\prime }\) in \(Y^{\prime }\) such that (a,b) and (b,a) belong to K E , and \((a^{\prime },b^{\prime })\) and \((b^{\prime },a^{\prime })\) belong to \(K_{E^{\prime }}\). Since \(K_{E}\subseteq K_{E^{\prime }}\), we know that such \(X^{\prime }\) and \(Y^{\prime }\) do exist. □

Example 2

Take E=a ∥(bc)+d and \(E^{\prime }=(a\parallel b)^{+}\mid (c\parallel d)^{+}\). Notice that \(L(E^{\prime })\subseteq L(E)\), \({\Delta }_{E^{\prime }}\preccurlyeq {\Delta }_{E}\), and \(\hat {\Delta }_{E^{\prime }}\preccurlyeq \hat {\Delta }_{E}\). In particular, we have the following.

  • \(C_{E}=\varnothing \) is included in \(C_{E^{\prime }}=\{(a,c),(a,d),(b,c),(b,d),(c,a),(c,b),(d,a),(d,b)\}\),

  • \(\hat N_{E}(a) = \hat N_{E^{\prime }}(a) = *,\ldots , \hat N_{E}(d) = \hat N_{E^{\prime }}(d) = *\),

  • \(K_{E}=\varnothing \) is included in \(K_{E^{\prime }} = \{(a,b),(b,a),(c,d),(d,c)\}\),

  • \(\hat P_{E} = \{\{b,c\}\}\) and \(\hat P_{E^{\prime }} = \{\{a,c\}\}\) that compactly represent P E ={{b,c},…} and \(P_{E^{\prime }}=\{\{a,c\}, \{a,d\}, \{b,c\}, \{b,d\},\ldots \}\), respectively (we have listed only the \(\subseteq \)-minimal sets). Then, take X={b,c} from \(\hat P_{E}\) and notice that there exists Y={a,c} in \(\hat P_{E^{\prime }}\) such that \( Y\subseteq \mathit {impl}_{E^{\prime }}(X)\) because \(\mathit {impl}_{E^{\prime }}(\{b,c\}) = \{a,b,c,d\}\).

Next, we show that the compact representation is of polynomial size.

Lemma 5

Given a DIME E, the compact representation \(\hat {\Delta }_{E}=(C_{E},\hat N_{E},\hat P_{E},K_{E})\) of its characterizing tuple Δ E is of size polynomial in the size of the alphabet Σ.

Proof

By construction, the sizes of C E and K E are quadratic in |Σ| while the sizes of \(\hat P_{E}\) and \(\hat N_{E}\) are linear in |Σ|. □

The use of compact representation of characterizing tuples allows us to state the main result of this section.

Theorem 4

Given an unordered word w and two DIMEs E and \(E^{\prime }\):

  1. 1.

    deciding whether w∈L(E) is in PTIME,

  2. 2.

    deciding whether \(L(E^{\prime })\subseteq L(E)\) is in PTIME.

Proof

The first part follows from Lemma 1, Lemma 3, and Lemma 5. The second part follows from Lemma 2, Lemma 4, and Lemma 5. □

5 Interval Multiplicity Schemas

In this section, we employ DIMEs to define schema languages and we present the related problems of interest.

Definition 1

A disjunctive interval multiplicity schema (DIMS) is a tuple S=(r o o t S ,R S ), where r o o t S ∈Σ is a designated root label and R S maps symbols in Σ to DIMEs. By D I M S we denote the set of all disjunctive interval multiplicity schemas. A disjunction-free interval multiplicity schema (IMS) S=(r o o t S ,R S ) is a restricted DIMS, where R S maps symbols in Σ to IMEs. By I M S we denote the set of all disjunction-free interval multiplicity schemas.

We define the language captured by a DIMS S in the following way. Given a tree t, we first define the unordered word \(c{h_{t}^{n}}\) of children of a node nN t of t i.e., \({\mathit {ch}_{t}^{n}}(a) = |\{m\in N_{t}\mid (n,m)\in \mathit {child}_{t}\wedge \mathit {lab}_{t}(m)=a\}|\). Now, a tree t satisfies S, in symbols tS, if l a b t (r o o t t )=r o o t S and for every node nN t , \({ch_{t}^{n}}\in L(R_{S}(lab_{t}(n)))\). By \(L(S)\subseteq {\mathit {Tree}}\) we denote the set of all trees satisfying S.

In the sequel, we present a schema S=(r o o t S ,R S ) as a set of rules of the form \(a\rightarrow R_{S}(a)\), for every a∈Σ. If L(R S (a))=ε, then we write \(a\rightarrow \epsilon \) or we simply omit writing such a rule.

Example 3

Take the content model of a semi-structured database storing information about a peer-to-peer file sharing system, having the following rules: 1) a peer is allowed to download at most the same number of files that it uploads, and 2) peers are split into two groups: a peer is a vip if it uploads at least 100 files, otherwise it is a simple user:

$$\begin{array}{@{}rcl@{}} peers \rightarrow~ && user^{*}\parallel vip^{*},\\ user \rightarrow~ && (upload\parallel download^?)^{[0,99]},\\ vip \rightarrow~ && (upload\parallel download^?)^{[100,\infty]}. \end{array} $$

Example 4

Take the content model of a semi-structured database storing information about two types of cultural events: plays and movies. Every event has a date when it takes place. If the event is a play, then it takes place in a theater while a movie takes place in a cinema.

$$\begin{array}{@{}rcl@{}} events\rightarrow~ && event^{*},\\ event\rightarrow~ && date\parallel ((play\parallel theater)\mid (movie\parallel cinema)). \end{array} $$

Problems of Interest

We define next the problems of interest and we formally state the corresponding decision problems parameterized by the class of schema \(\mathcal S\) and, when appropriate, by a class of queries \(\mathcal Q\).

  • Schema satisfiability – checking if there exists a tree satisfying the given schema:

    $$ \text{SAT}_{\mathcal{S}} = \{ S \in\mathcal{S} \mid \exists t\in Tree.\ t\models S \}. $$
  • Membership – checking if the given tree satisfies the given schema:

    $$ \text{MEMB}_{\mathcal{S}} = \{(S,t) \in\mathcal{S}\times Tree \mid t\models S\}. $$
  • Schema containment – checking if every tree satisfying one given schema satisfies another given schema:

    $$ \text{CNT}_{\mathcal{S}} = \{(S_{1},S_{2})\in\mathcal{S}\times\mathcal{S}\mid L(S_{1})\subseteq L(S_{2})\}. $$
  • Query satisfiability by schema – checking if there exists a tree that satisfies the given schema and the given query:

    $$ \text{SAT}_{\mathcal{S},\mathcal{Q}} = \{(S,q)\in \mathcal{S}\times\mathcal{Q}\mid \exists t\in L(S).\ t\models q\}. $$
  • Query implication by schema – checking if every tree satisfying the given schema satisfies also the given query:

    $$ \text{IMPL}_{\mathcal{S},\mathcal{Q}} = \{(S, q)\in \mathcal{S}\times\mathcal{Q}\mid \forall t\in L(S).\ t\models q\}. $$
  • Query containment in the presence of schema – checking if every tree satisfying the given schema and one given query also satisfies another given query:

    $$ \text{CNT}_{\mathcal{S},\mathcal{Q}}= \{(p, q, S) \in \mathcal{Q}\times\mathcal{Q}\times\mathcal{S} \mid \forall t\in L(S).\ t \models p\Rightarrow t\models q\}. $$

We study these problems for DIMSs and IMSs in Sections 6 and 7 of the paper.

6 Complexity of Disjunctive Interval Multiplicity Schemas (DIMSs)

In this section, we present the complexity results for DIMSs. First, we show the tractability of schema satisfiability and containment. Then, we provide an algorithm for deciding membership in streaming i.e., that processes an XML document in a single pass and using memory depending on the height of the tree and not on its size. Finally, we point out that the complexity of query satisfiability, implication, and containment in the presence of the schema follow from existing results.

First, we show the tractability of schema satisfiability and schema containment.

Proposition 1

SAT DIMS and CNT DIMS are in PTIME.

Proof

A simple algorithm based on dynamic programming can decide the satisfiability of a DIMS. More precisely, given a schema S=(r o o t S ,R S ), one has to determine for every symbol a of the alphabet Σ whether there exists a (finite) tree t that satisfies \(S^{\prime }=(a,R_{S})\). Then, the schema S is satisfiable if there exist such a tree for the root label r o o t S .

Moreover, testing the containment of two DIMSs reduces to testing, for each symbol in the alphabet, the containment of the associated DIMEs, which is in PTIME (Theorem 4). □

Next, we provide an algorithm for deciding membership in streaming i.e., that processes an XML document in a single pass and uses memory depending on the height of the tree and not on its size. Our notion of streaming has been employed in [42] as a relaxation of the constant-memory XML validation against DTDs, which can be performed only for some DTDs [41, 42]. In general, validation against DIMSs cannot be performed with constant memory due to the same observations as in [41, 42] w.r.t. the use of recursion in the schema. Hence, we have chosen our notion of streaming to be able to have an algorithm that works for the entire class of DIMSs. We assume that the input tree is given in XML format, with arbitrary ordering of sibling nodes. Moreover, the proposed algorithm has earliest rejection i.e., if the given tree does not satisfy the given schema, the algorithm outputs the result as early as possible. For a tree t, h e i g h t(t) is the height of t defined in the usual way. We employ the standard RAM model and assume that subsequent natural numbers are used as labels in Σ.

Proposition 2

MEMB DIMS is in PTIME. There exists an earliest rejection streaming algorithm that checks membership of a tree t in a DIMS S in time O(|t|×|Σ| 2 ) and using space O(height(t)×|Σ| 2 ).

Proof

We propose Algorithm 1 for deciding the membership of a tree t to the language of a DIMS S. The input tree t is given in XML format, with some arbitrary ordering of sibling nodes. We assume a well-formed stream \(\widetilde {t} \subset \{\mathit {open},\mathit {close}\}\times {\Sigma }\) representing a tree t and a procedure \(read(\widetilde {t})\) that returns the next pair (𝜃,b) in the stream, where 𝜃∈{o p e n,c l o s e} and b∈Σ. The algorithm works for every arbitrary ordering of sibling nodes. To validate a tree t against a DIMS S=(r o o t S ,R S ), one has to run Algorithm 1 after reading the opening tag of the root.

For a given node, the algorithm constructs the compact representation of the characterizing tuple of its label (line 1), which requires space O(|Σ|2) (cf. Lemma 5). The algorithm also stores for a given node the number of occurrences of each label in Σ among its children. This is done using the array c o u n t, which requires space O(Σ). Initially, all values in the array c o u n t are set at 0 (lines 2-3) and they are updated after reading the open tag of the children (lines 4-6). During the execution, the algorithm maintains a stack whose height is the depth of the currently visited node. Naturally, the bound on space required is O(h e i g h t(t)×|Σ|2).

The algorithm has earliest rejection since it rejects a tree as early as possible. More precisely, this can be done after reading the opening tag for nodes that violate the maximum value for the allowed cardinality for their label (lines 7-8) or violate some conflicting pair of siblings (lines 9-10). If it is not the case, the algorithm recursively validates the corresponding subtree (lines 11-12). After reading all children of the current node, the algorithm checks whether the components of the characterizing tuple are satisfied: the extended cardinality map (lines 14-15), the collections of required symbols (lines 16-17), and the counting dependencies (lines 18-19). Notice that since we have checked the conflicting pairs of siblings after reading each opening tag, we do not need to check them again after reading all children. However, we still need to check the extended cardinality map at this moment to see whether the number of occurrences of each label is in the allowed interval. When we have read the opening tag, we were able to reject only if the maximum value for the allowed number of occurrences has been already violated. As for the collections of required symbols and the counting dependencies, we are able to establish whether they are satisfied or not after reading all children. If none of the constraints imposed by the characterizing tuple is violated, the algorithm returns true (line 20). As we have already shown with Lemma 1 and Lemma 3, the compact representation of the characterizing tuple captures precisely the language of a given DIME. Consequently, the algorithm returns true after reading the root node iff the given tree satisfies the given schema. □

figure a

We continue with complexity results that follow from known facts. Query satisfiability for DTDs is NP-complete [5] and we adapt the result for DIMSs.

Proposition 3

SAT DIMS,Twig is NP-complete.

Proof

Proposition 4.2.1 from [5] implies that satisfiability of twig queries in the presence of DTDs is NP-hard. We adapt the proof and we obtain the following reduction from SAT to SAT D I M S,T w i g : we take a CNF formula \(\varphi =\bigwedge _{i=1}^{n}C_{i}\) over the variables x 1,…,x m , where each C i is a disjunction of literals. We take Σ={r,t 1,f 1,…,t m ,f m ,c 1,…,c n } and we construct:

  • The DIMS S having the root label r and the rules:

    • \(r \rightarrow (t_{1} \mid f_{1}) \parallel \dots \parallel (t_{m}\mid f_{m})\),

    • \(t_{j} \rightarrow c_{j_{1}}\parallel \ldots \parallel c_{j_{k}}\), where \(c_{j_{1}},\ldots ,c_{j_{k}}\) correspond to the clauses using x j (for \(1\leqslant j\leqslant m\)),

    • \(f_{i} \rightarrow c_{j_{1}}\parallel \ldots \parallel c_{j_{k}}\), where \(c_{j_{1}},\ldots ,c_{j_{k}}\) correspond to the clauses using ¬x j (for \(1\leqslant j\leqslant m\)).

  • The twig query \(q = r[{/\!/} c_{1}]\dots [{/\!/} c_{n}]\).

For example, for the formula \(\varphi _{0}=(x_{1}\vee \neg x_{2} \vee x_{3})\wedge (\neg x_{1}\vee x_{3}\vee \neg x_{4})\) we obtain the DIMS S containing the rules:

$$\begin{array}{@{}rcl@{}}r \rightarrow (t_{1}\mid f_{1}) \parallel (t_{2}\mid f_{2}) \parallel (t_{3}\mid f_{3}) \parallel (t_{4}\mid f_{4}), \\ t_{1} \rightarrow c_{1}, \qquad f_{1}\rightarrow c_{2}, \qquad t_{2}\rightarrow \epsilon, \qquad f_{2}\rightarrow c_{1},\\ t_{3}\rightarrow c_{1}\parallel c_{2}, \qquad f_{3} \rightarrow \epsilon, \qquad t_{4} \rightarrow \epsilon, \qquad f_{4} \rightarrow c_{2}. \end{array} $$

and the query q=/r[/ /c 1][/ /c 2]. The formula φ is satisfiable iff (S,q)∈SAT D I M S,T w i g . The described reduction works in polynomial time in the size of the input formula.

For the NP upper bound, we reduce SAT D I M S,T w i g to SAT D T D,T w i g (i.e., the problem of satisfiability of twig queries in the presence of DTDs), known to be in NP (Theorem 4.4 from [5]). Given a DIMS S, we construct a DTD D having the same root label as S and whose rules are obtained from the rules of S by replacing the unordered concatenation with standard (ordered) concatenation. Then, take a twig query q. We claim that there exists an (unordered) tree satisfying q and S iff there exists an (ordered) tree satisfying q and D. For the if part, take an ordered tree t satisfying q and D, remove the order to obtain an unordered tree \(t^{\prime }\), and observe that \(t^{\prime }\) satisfies S. For the only if part, take an unordered tree t satisfying q and S. From the construction of D, we infer that there exists an ordered tree \(t^{\prime }\) (obtained via some ordering of the sibling nodes of t) satisfying both q and D. We recall that the twig queries disregard the relative order among the siblings.

The complexity results for query implication and query containment in the presence of DIMSs follow from the EXPTIME-completeness proof from [35] for twig query containment in the presence of DTDs.

Proposition 4

IMPL DIMS,Twig and CNT DIMS,Twig are EXPTIME-complete.

Proof

The EXPTIME-hardness proof of twig containment in the presence of DTDs (Theorem 4.5 from [35]) has been done using a reduction from the Two-player corridor tiling problem and a technique introduced in [32]. In the proof from [35], when testing the containment \(p\subseteq _{S}q\), p is chosen such that it satisfies every tree in S, hence IMPL DTD, Twig is also EXPTIME-complete. Furthermore, Lemma 3 in [32] can be adapted to twig queries and DIMS: for every SDIMS and twig queries q 0,q 1,…,q m there exists \(S^{\prime }\in DIMS\) and twig queries q and \(q^{\prime }\) such that \(q_{0} \subseteq _{S} q_{1}\cup \ldots \cup q_{m}\) iff \(q \subseteq _{S^{\prime }} q^{\prime }\). Moreover, the DTD in [35] can be captured with a DIMS constructible in polynomial time: take the same reduction as in [35] and then replace the standard concatenation with unordered concatenation. Hence, we infer that CNT D I M S,T w i g and IMPL D I M S,T w i g are also EXPTIME-hard.

For the EXPTIME upper bound, we reduce CNT D I M S,T w i g to CNT D T D,T w i g (i.e., the problem of twig query containment in the presence of DTDs), known to be in EXPTIME (Theorem 4.4 from [35]). Given a DIMS S, we construct a DTD D having the same root label as S and whose rules are obtained from the rules of S by replacing the unordered concatenation with standard (ordered) concatenation. Then, take two twig queries p and q. We claim that \(p\subseteq _{S} q\) iff \(p\subseteq _{D} q\) and show the two parts by contraposition. For the if part, assume \(p\not \subseteq _{S} q\), hence there exists an unordered tree t that satisfies q and S, but not p. From the construction of D, we infer that there exists an ordered tree \(t^{\prime }\) (obtained via some ordering of the sibling nodes of t) that satisfies q and D, but not p. For the only if part, assume \(p\not \subseteq _{D} q\), hence there exists an ordered tree t that satisfies q and D, but not p. By removing the order of t, we obtain an unordered tree \(t^{\prime }\) that satisfies q and S, but not p. We recall that the twig queries disregard the relative order among the siblings. The membership of CNT D I M S,T w i g to EXPTIME yields that IMPL D I M S,T w i g is also in EXPTIME (it suffices to take as p the universal query). □

7 Complexity of Disjunction-Free Interval Multiplicity Schemas (IMSs)

Although query satisfiability and query implication in the presence of schema are intractable for DIMSs, we prove that they become tractable for IMSs (Section 7.4). We also show a considerably lower complexity for query containment in the presence of schema: coNP-completeness for IMSs instead of EXPTIME-completeness for DIMSs (Section 7.4). Additionally, we point out that our results for IMSs allow also to characterize the complexity of query implication and query containment in the presence of disjunction-free DTDs (i.e., restricted DTDs using regular expressions without disjunction operator), which, to the best of our knowledge, have not been previously studied (Section 7.5). To prove our results, we develop a set of tools that we present next: dependency graphs (Section 7.1), generalized definition of embedding (Section 7.2), family of characteristic graphs (Section 7.3).

7.1 Dependency Graphs

Recall that IMSs use IMEs, which are essentially expressions of the form \(A_{1}^{I_{1}}\parallel \ldots \parallel A_{k}^{I_{k}}\), where A 1,…,A k are atoms, and I 1,…,I k are intervals. Given an IME E, let s y m b o l s (E) be the set of symbols present in all unordered words in L(E), and s y m b o l s(E) the set of symbols present in at least one unordered word in L(E):

$$\begin{array}{@{}rcl@{}} {\mathit{symbols}^{\forall}}(E) = \{a\in{\Sigma}\mid \forall w\in L(E).\ a\in w\},\\ { \mathit{symbols}^{\exists}}(E) = \{a\in{\Sigma}\mid \exists w\in L(E).\ a\in w\}. \end{array} $$

Given an IME E, notice that \({\mathit {symbols}^{\forall }}(E)\subseteq { \mathit {symbols}^{\exists }}(E)\), and moreover, the sets s y m b o l s (E) and s y m b o l s(E) can be easily constructed from E. For example, given E 0=(ab ?)[5,6]c +, we have s y m b o l s (E 0)={a,c} and s y m b o l s(E 0)={a,b,c}.

Definition 2

Given an IMS S=(r o o t S ,R S ), the existential dependency graph of S is the directed rooted graph \(G_{S}^{\exists } = ({\Sigma },root_{S}, E_{S}^{\exists })\) with the node set Σ, the distinguished root node r o o t S , and the set of edges \(E_{S}^{\exists }\) such that \((a,b)\in E_{S}^{\exists }\) if bs y m b o l s (R S (a)). Furthermore, the universal dependency graph of S is the directed rooted graph \(G_{S}^{\forall } = ({\Sigma },root_{S}, E_{S}^{\forall })\) such that \((a,b)\in E_{S}^{\forall }\) if bs y m b o l s (R S (a)).

Example 5

Take the IMS S containing the rules:

$$\begin{array}{@{}rcl@{}} r \rightarrow (a^?\parallel b)^{[1,10]}\parallel c, \qquad a \rightarrow d^?, \qquad b \rightarrow a^{[2,3]}\parallel c^{*}\parallel d^{+}. \end{array} $$

In Fig. 4 we present the existential dependency graph of S and the universal dependency graph of S.

Fig. 4
figure 4

Existential dependency graph \(G_{S}^{\exists }\) and universal dependency graph \(G_{S}^{\forall }\) for Example 5

Given an IMS S and a symbol a, we say that a is reachable (or useful) in S if there exists a tree in L(S) which has a node labeled by a. Moreover, we say that an IMS is trimmed if it contains rules only for the reachable symbols. For every satisfiable IMS S, there exists an equivalent trimmed version which can be obtained by removing the rules for the symbols involved in unreachable components in \(G_{S}^{\forall }\) (in the spirit of [2]). Notice that the unreachable components of \(G_{S}^{\forall }\) correspond in fact to cycles in \(G_{S}^{\forall }\). In the sequel, we assume w.l.o.g. that all IMSs that we manipulate are satisfiable and trimmed.

7.2 Generalizing the Embedding

We generalize the notion of embedding previously defined in Section 2. Note that in the rest of the section we use the term dependency graphs when we refer to both existential and universal dependency graphs. First, an embedding of a query q in a dependency graph G=(Σ,r o o t,E) is a function \(\lambda : N_{q} \rightarrow {\Sigma }\) such that:

  1. 1.

    λ(r o o t q )=r o o t,

  2. 2.

    for every \((n,n^{\prime })\in \mathit {child}_{q}\), \((\lambda (n),\lambda (n^{\prime }))\in E\),

  3. 3.

    for every \((n,n^{\prime })\in \mathit {desc}_{q}\), \((\lambda (n),\lambda (n^{\prime }))\in E^{+}\) (the transitive closure of E),

  4. 4.

    for every nN q , l a b q (n)= or l a b q (n)=λ(n).

If there exists an embedding of q in G, we write \(G\preccurlyeq q\). Next, a simulation of a dependency graph G=(Σ,r o o t,E) in a tree t is a relation \(R\subseteq {\Sigma }\times N_{t}\) such that:

  1. 1

    (r o o t,r o o t t )∈R,

  2. 2

    for every \((a, n)\in R,\ (a, a^{\prime })\in E\), there exists \(n^{\prime }\in N_{t}\) such that \((n,n^{\prime })\in \mathit {child}_{t}\) and \((a^{\prime },n^{\prime })\in R\),

  3. 3

    for every (a,n)∈R. l a b t (n)=a.

Note that R is a total relation for the nodes of the graph reachable from the root i.e., for every a∈Σ reachable from root in G, there exists a node nN t such that (a,n)∈R. If there exists a simulation from G to t, we write \(t\preccurlyeq G\). Additionally, note that given a graph containing cycles reachable from the root, there does not exist any (finite) tree where it can be simulated. However, we point out that in the remainder we use the notion of simulation only for universal dependency graphs that are supposed to come from trimmed IMSs, hence they do not have such cycles.

Given two dependency graphs G 1=(Σ,r o o t,E 1) and G 2=(Σ,r o o t,E 2), G 1 is a subgraph of G 2 if \(E_{1}\subseteq E_{2}\). For a dependency graph G=(Σ,r o o t,E), we define the partial order ≤ G on the subgraphs of G: given G 1 and G 2 two subgraphs of G, G 1 G G 2 if G 1 is a subgraph of G 2. Note that the relation ≤ G is reflexive, antisymmetric, and transitive, thus being an ordering relation. Moreover, it is well-founded and it has a minimal element \(G_{0} = ({\Sigma },root, \varnothing )\). The following result can be easily shown by a structural induction using the order ≤ G .

Lemma 6

For every IMS S, its universal dependency graph can be simulated in every tree t which belongs to the language of S.

A path in a dependency graph G=(Σ,r o o t,E) is a non-empty sequence of vertices starting at root such that for every two consecutive vertices in the sequence, there is a directed edge between them in G. By \({\mathit {Paths}}(G)\subseteq {\Sigma }^{+}\) we denote the set of all paths in G. The set of paths is finite only for graphs without cycles reachable from the root. For instance, the paths of the graph G 1 in Fig. 5b are P a t h s(G 1)={r,r a,r b,r c,r b d,r c d,r b d e,r c d e}.

Fig. 5
figure 5

A tree and two graphs with their corresponding unfoldings

Similarly, a path in a tree t is a non-empty sequence of nodes starting at r o o t t such that every two consecutive nodes in the sequence are in the relation c h i l d t . By \({\mathit {Paths}}(t)\subseteq N_{t}^{+}\) we denote the set of all paths in t. Then, we define \({\mathit {LabPaths}}(t)\subseteq {\Sigma }^{+}\) as the set of sequences of labels of nodes from all paths in t. For instance, for the tree t 1 from Fig. 5b we have \(Paths(t_{1})=\{n_{0},n_{0}n_{1},n_{0}n_{1}n_{2},n_{0}n_{3},n_{0}n_{3}n_{4}\}\) and L a b P a t h s(t 1)={r, r a,r a b}. Note that |L a b P a t h s(t)|≤|P a t h s(t)|. The unfolding of a dependency graph G=(Σ,r o o t,E), denoted u G , is a tree \(u_{G} = (N_{u_{G}}, root_{u_{G}}, lab_{u_{G}}, child_{u_{G}})\) such that:

  • \(N_{u_{G}} = {\mathit {Paths}}(G)\),

  • \(root_{u_{G}}\in N_{u_{G}}\) is the root of u G ,

  • \((p, p\cdot a)\in \mathit {child}_{u_{G}}\), for every path p,paP a t h s(G) (note that “ ⋅” stands for standard ordered concatenation),

  • \(\mathit {lab}_{u_{G}}(root_{u_{G}}) = root\), and \(\mathit {lab}_{u_{G}}(p\cdot a) = a\), for every path paP a t h s(G).

The unfolding of a graph is finite only when the graph has no cycle reachable from the root, because otherwise P a t h s(G) is infinite, hence u G is infinite. In the remainder, we use the unfolding only for graphs having no cycle reachable from the root (in order to have finite unfoldings). In such a case, the unfolding can be seen as the smallest tree u G (w.r.t. the number of nodes) having L a b P a t h s(u G )=P a t h s(G). The idea of the unfolding is to transform the dependency graph G into a tree having the c h i l d relation instead of directed edges. There are nodes duplicated in order to avoid nodes with more than one incoming edge. For instance, in Fig. 5b we take the graph G 1 and construct its unfolding \(u_{G_{1}}\). Moreover, notice that the size of the unfolding may be exponential in the size of the graph, for example for the graph G 2 from Fig. 5c.

We also extend the definition of embedding and propose the embedding from a tree to another tree i.e., given two trees t and \(t^{\prime }\), we say that \(t^{\prime }\) can be embedded in t (denoted \(t\preccurlyeq t^{\prime }\)) if the query \((N_{t^{\prime }},root_{t^{\prime }},lab_{t^{\prime }},child_{t^{\prime }},\varnothing )\) can be embedded in t. Similarly, we can define the embedding from a tree to a dependency graph. Note that two embeddings can be composed, for example:

  • \(\forall t,t^{\prime }\in {\mathit {Tree}}.\ \forall q\in {\mathit {Twig}}.\ (t\preccurlyeq t^{\prime }\wedge t^{\prime }\preccurlyeq q\Rightarrow t\preccurlyeq q)\),

  • \(\forall S\in \mathit {IMS}.\ \forall t\in {\mathit {Tree}}.\ \forall q\in {\mathit {Twig}}.\ (G_{S}^{\forall /\exists }\preccurlyeq t\wedge t\preccurlyeq q\Rightarrow G_{S}^{\forall /\exists }\preccurlyeq q)\).

We state next two auxiliary lemmas that can be easily proven by structural induction on the dependency graphs (using the order ≤ G ):

Lemma 7

A dependency graph G can be simulated in a tree t iff its unfolding u G can be embedded in t.

Lemma 8

A query q can be embedded in a dependency graph G iff q can be embedded in the unfolding tree of G.

In Fig. 6 we present the operations fuse and add. Given two trees t and \(t^{\prime }\), we say that \(t\lhd _{0} t^{\prime }\) if \(t^{\prime }\) is obtained from t by applying one of the operations from Fig. 6. The fuse operation takes two siblings with the same label and creates only one node having below it the subtrees corresponding to each of the siblings. The add operation consists simply in adding a subtree at some place in the tree. By \(\unlhd \) we denote the transitive and reflexive closure of \(\lhd _{0}\).

Fig. 6
figure 6

Operations fuse and add

Note that the fuse and add operations preserve the embedding i.e., given a twig query q and two trees t and \(t^{\prime }\), if \(t\preccurlyeq q\) and \(t\unlhd t^{\prime }\), then \(t^{\prime }\preccurlyeq q\). Furthermore, if we can embed a query q in a tree t which can be embedded in the existential dependency graph of an IMS S, we can perform a sequence of operations such that t is transformed into another tree \(t^{\prime }\) satisfying S and q at the same time. Formally, we have the following.

Lemma 9

Given an IMS S, a query q and a tree t, if \(G_{S}^{\exists }\preccurlyeq t\) and \(t\preccurlyeq q\) , then there exists a tree \(t^{\prime }\in L (S)\cap L(q)\) . The tree \(t^{\prime }\) can be constructed after a sequence of fuse and add operations (consistently with the schema S) from the tree t and we denote \(t\unlhd _{S} t^{\prime }\).

7.3 Family of Characteristic Graphs

Given a schema S and a query q, we can capture all trees satisfying both S and q with the characteristic graphs that we introduce next.

More formally, a characteristic graph G is a tuple (V G ,r o o t G ,l a b G ,E G ), where V G is a finite set of vertices, r o o t G V G is the root of the graph, \(\mathit {lab}_{G}:V_{G}\rightarrow {\Sigma }\) is a labeling function (with l a b G (r o o t G )=r o o t S ), and \(E_{G}\subseteq V_{G}\times V_{G}\) is the set of edges. Let us assume that \(G_{S}^{\exists }\preccurlyeq q\) and take such an embedding \(\lambda :N_{q}\rightarrow {\Sigma }\). By Λ(q,S,λ) we denote the set of all characteristic graphs for q and S w.r.t. λ. To construct such a graph, let us start with G=(V G ,r o o t G ,l a b G ,E G ) where V G and E G are empty, and perform the four steps described below.

  1. 1.

    For every n in N q , add a node \(n^{\prime }\) to V G such that \(\mathit {lab}_{G}(n^{\prime })=\lambda (n)\). Let r o o t G be the node such that l a b G (r o o t G )=r o o t S .

  2. 2.

    For every (n 1,n 2) in c h i l d q , add \((n_{1}^{\prime }, n_{2}^{\prime })\) to E G , where \(n_{1}^{\prime }\) and \(n_{2}^{\prime }\) are the nodes corresponding to n 1 and n 2, respectively, as constructed at step 1.

  3. 3.

    For every (n 1,n 2) in d e s c q , choose an acyclic path a 0,…,a k in \(G_{S}^{\exists }\) where λ(n 1)=a 0 and λ(n 2)=a k . Notice that, since n 1 and n 2 belong to N q , we have already added in V G two nodes \(n_{1}^{\prime }\) and \(n_{2}^{\prime }\), respectively, corresponding to them at step 1. Then, for every a i (with 1≤ik−1), we add in V G a node \(n_{i}^{\prime \prime }\) such that \(\mathit {lab}_{G}(n_{i}^{\prime \prime }) = a_{i}\). Also, add in E G the edges \((n_{1}^{\prime },n_{1}^{\prime \prime }), (n_{1}^{\prime \prime },n_{2}^{\prime \prime }),\ldots ,\) \( (n_{k-1}^{\prime \prime },n_{2}^{\prime })\).

  4. 4.

    For every n in V G , take from \(G_{S}^{\forall }\) the subgraph \((V^{\prime },\mathit {lab}_{G}(n),E^{\prime })\) rooted at l a b G (n). Then, for every al a b G (n) in \(V^{\prime }\) add a node \(n^{\prime }\) in \(V^{\prime }\) such that \(\mathit {lab}_{G}(n^{\prime })=a\). Also, for every \((a_{1},a_{2})\in E^{\prime }\), add in E G an edge (n 1,n 2) where n 1 and n 2 are the nodes corresponding to a 1 and a 2, respectively.

The following example illustrates the construction of such a graph.

Example 6

Take in Fig. 7a an existential dependency graph \(G_{S}^{\exists }\), a twig query q, and an embedding \(\lambda :N_{q}\rightarrow G_{S}^{\exists }\). Notice that in \(G_{S}^{\exists }\) we have drawn the universal edges with a full line and those that are existential without being universal with a dotted line. Then, in Fig. 7b we present an example of a graph G from Λ(q,S,λ). Notice that in G we have represented in boxes the nodes corresponding to the images λ(n) for the nodes of the query nN q .

Fig. 7
figure 7

An embedding from a query q to an existential dependency graph \(G_{S}^{\exists }\) and a graph \(G\in \mathcal G(q,S)\). In \(G_{S}^{\exists }\), the universal edges are drawn with a full line and those that are existential without being universal with a dotted line

Next, we define the set of all characteristic graphs for q and S w.r.t. the all embeddings λ of q in \(G_{S}^{\exists }\):

$$\mathcal G(q,S)=\{G\in{\Lambda}(q,S,\lambda)\mid \lambda \text{ is an embedding of } q \text{ in } G_{S}^{\exists}\}. $$

Note that \(G\preccurlyeq q\) and the size of G is polynomially bounded by |q|×|Σ|2 for every G in \(\mathcal G(q,S)\). Indeed, after step 1 of the construction, a characteristic graph G has |q| nodes. Then, after steps 2 and 3, since at step 3 we allow only acyclic paths of \(G_{S}^{\exists }\), we add at most |Σ| nodes for each already existing node, hence G has at most |q|×|Σ| nodes. Finally, after 4, since we add at most |Σ| nodes for each already existing node, G has at most |q|×|Σ|2 nodes.

Furthermore, let Λ(q,S,λ) and \(\mathcal G^{*}(q,S)\) be sets of characteristic graphs constructed similarly to Λ(q,S,λ) and \(\mathcal G(q,S)\), respectively, the only difference being that we allow cyclic paths at step 3 of the aforementioned construction. While the size of the graphs in \(\mathcal G(q,S)\) is polynomial , notice that the size of the graphs in \(\mathcal G^{*}(q,S)\) is not necessary polynomial since the possible cyclic paths chosen at step 3 can be arbitrarily long. Additionally, note that \(|\mathcal G(q,S)|\) is finite and may be exponential while \(|\mathcal G^{*}(q,S)|\) may be infinite if the existential dependency graph \(G_{S}^{\exists }\) contains cycles reachable from the root.

Next, we extend the previous definition of the unfolding to the characteristic graphs. Given an IME E and a symbol a, by min_nb(E, a) we denote the minimum number of occurrences of the symbol a in every unordered word defined by E. Next, we define the unfolding of a characteristic graph. Given a query q, an IMS S, and a characteristic graph \(G\in \mathcal G^{*}(q,S)\), we construct its unfolding as follows:

  • Let u G be the unfolding of G obtained as defined in Section 7.2.

  • Update u G such that for every \(n\in N_{u_{G}}\), for every a∈Σ, let t a the subtree having as root the child of n labeled by a. Next, add copies of t a as children of n until n has \(\mathit {min\_nb}(R_{S}(\mathit {lab}_{u_{G}}(n)), a)\) children labeled by a.

Notice that every graph G in \(\mathcal G^{*}(q,S)\) is acyclic. Indeed, when constructing such a graph G, after steps 1, 2 and 3, G is basically shaped as a tree. Then, the subgraphs that we fuse at step 4 are all acyclic since they are subgraphs of the universal dependency graph \(G_{S}^{\forall }\) that we assume trimmed (cf. Section 7.1). Since every graph G in \(\mathcal G^{*}(q,S)\) is acyclic, it has a finite unfolding, which naturally belongs to the language of S.

7.4 Complexity Results

In this section, we use the above defined tools to show the complexity results for IMSs. First, the dependency graphs and embeddings capture satisfiability and implication of queries by IMSs.

Lemma 10

Given a twig query q and an IMS S:

  1. 1.

    q is satisfiable by S iff \(G_{S}^{\exists }\preccurlyeq q\),

  2. 2.

    q is implied by S iff \(G_{S}^{\forall }\preccurlyeq q\).

Proof

1) For the if part, we know that \(G_{S}^{\exists }\preccurlyeq q\), thus the family of graphs \(\mathcal G(q, S)\) is not empty. The unfolding of every graph from \(\mathcal G(q, S)\) satisfies S and q at the same time, hence q is satisfiable by S. For the only if part, we know that there exists a tree \(t\in L(S)\cap L(q)\), and we assume w.l.o.g. that it is the unfolding of a graph G from \(\mathcal G^{*}(q,S)\). Since \(t\preccurlyeq q\), we obtain \(u_{G}\preccurlyeq q\), hence \(G\preccurlyeq q\) (by Lemma 8), which, from the construction of G, implies that \(G_{S}^{\exists }\preccurlyeq q\).

2) For the if part, we know that \(G_{S}^{\forall }\preccurlyeq q\), which implies by Lemma 8 that \(u_{G_{S}^{\forall }}\preccurlyeq q\). On the other hand, take a tree tL(S). By Lemma 6 we have \(t\preccurlyeq G_{S}^{\forall }\), which implies by Lemma 7 that \(t\preccurlyeq u_{G_{S}^{\forall }}\). From the last embedding and \(u_{G_{S}^{\forall }}\preccurlyeq q\) we infer that \(t\preccurlyeq q\). Since t can be every tree in the language of S, we conclude that q is implied by S. For the only if part, we know that for every tL(S), \(t\preccurlyeq q\). Consider the tree t obtained as follows: we take \(u_{G_{S}^{\forall }}\) and we duplicate some subtrees in order to have, for each node nN t , \(minnb(R_{S}(lab_{t}(n)),a)\) children labeled by a. Naturally, t is in the language of S, hence \(t\preccurlyeq q\) from the hypothesis. From the definition of the unfolding, we infer that \(G_{S}^{\forall }\preccurlyeq t\), which implies that \(G_{S}^{\forall }\preccurlyeq q\). □

For instance, the twig query q=r[a]/b/ /d can be embedded in the existential dependency graph of the IMS S from Example 5, thus q is satisfiable by S. In Fig. 8 we present embeddings of q in \(G_{S}^{\exists }\) and in a tree t satisfying both S and q. Additionally, notice that the twig query q=r[a]/b/ /d cannot be embedded in \(G_{S}^{\forall }\) from Example 5, and therefore, q is not implied by S. On the other hand, the twig query \(q^{\prime }= r/b{/\!/} d\) can be embedded in \(G_{S}^{\forall }\), thus \(q^{\prime }\) is implied by S.

Fig. 8
figure 8

Embeddings of q in \(G_{S}^{\exists }\) and in a tree t which satisfies S and q at the same time

Moreover, we point out that testing the embedding of a query in a dependency graph can be done in polynomial time with a simple bottom-up algorithm. From this observation and Lemma 10 we obtain the following.

Theorem 5

SAT IMS,Twig and IMPL IMS,Twig are in PTIME.

Next, we present the complexity of query containment in the presence of IMSs. The coNP-completeness of the containment of twig queries [32] implies the coNP-hardness of the containment of twig queries in the presence of IMSs. Proving the membership of the problem to coNP is, however, not trivial. Given an instance (p,q,S), the set of all trees satisfying p and S can be characterized with a set \(\mathcal {G}({p,S})\) containing an exponential number of polynomially-sized graphs and p is contained in q in the presence of S iff the query q can be embedded into all graphs in \(\mathcal {G}({p,S})\). This condition is easily checked by a non-deterministic Turing machine.

Theorem 6

CNT IMS,Twig is coNP-complete.

Proof

The coNP-completeness of the containment of twig queries (Theorem 4 in [32]) implies that C N T IMS, Twig is coNP-hard. Next, we prove the membership of the problem to coNP. Given an instance (p,q,S), a witness is a function \(\lambda :N_{p}\rightarrow {\Sigma }\). Testing whether λ is an embedding from p to \(G_{S}^{\exists }\) requires polynomial time. If λ is an embedding, a non-deterministic polynomial algorithm chooses a graph G from Λ(p,S,λ) and checks whether q can be embedded in G. We claim that \(p\nsubseteq _{S} q\) iff there exists a graph G in \(\mathcal G(p,S)\) such that \(G\not \preccurlyeq q\).

For the if case, we assume that there exists a graph \(G\in \mathcal G(p,S)\) such that \(G\not \preccurlyeq q\). We know that \(G\preccurlyeq p\), thus \(u_{G}\preccurlyeq p\) (by Lemma 8), hence there exists a tree tL(S) such that \(t\preccurlyeq p\) and \(u_{G}\unlhd _{S} t\) (by Lemma 9). If we assume by absurd that \(t\preccurlyeq q\), we have \(u_{G}\preccurlyeq q\), thus \(G\preccurlyeq q\), which is a contradiction. We infer thus that there exists a tree \(t\in L(S)\cap L(p)\), such that tL(q), and consequently, \(p\nsubseteq _{S} q\).

For the only if case, we assume that \(p\nsubseteq _{S} q\), hence there exists a tree \(t\in L(S)\cap L(p)\) such that tL(q). Because \(t\in L(S)\cap L(p)\), we know that there exists a graph \(G\in \mathcal G^{*} (p,S)\), such that \(u_{G}\unlhd _{S} t\). We know that \(t\not \preccurlyeq q\), thus \(u_{G}\not \preccurlyeq q\) (by Lemma 8), that yields \(G\preccurlyeq q\). Furthermore, by using a simple pumping argument, we have \(\forall q\in {\mathit {Twig}}.\ \forall G\in \mathcal G^{*}(q,S).\ (G\not \preccurlyeq q\Rightarrow \exists G^{\prime }\in \mathcal G(q,S).\ G^{\prime }\not \preccurlyeq q)\), which implies that there exists a graph \(G^{\prime }\in \mathcal G(p,S)\) such that \(G^{\prime }\not \preccurlyeq q\). □

7.5 Extending the Complexity Results to Disjunction-Free DTDs

We also point out that the complexity results for implication and containment of twig queries in the presence of IMSs can be adapted to disjunction-free DTDs. This allows us to state results which, to the best of our knowledge, are novel. Similarly to the IMSs, we represent a disjunction-free DTD as a tuple S=(r o o t S ,R S ), where r o o t S is a designated root label and R S maps symbols to regular expressions using no disjunction, basically regular expressions of the grammar:

$$E ::= \epsilon\mid a\mid E^{*}\mid E^?\mid E^{+}\mid (E\cdot E), $$

where a∈Σ and “ ⋅” stands for the standard concatenation operator. Given such an expression E, let s y m b o l s (E) be the set of symbols present in all words from L(E), and s y m b o l s(E) the set of symbols present in at least one word from L(E):

$$\begin{array}{@{}rcl@{}} { \mathit{symbols}^{\forall}}(E) = \{a\in{\Sigma}\mid \forall w\in L(E).\ \exists w_{1},w_{2}.\ w = w_{1}\cdot a\cdot w_{2}\},\\ {\mathit{symbols}^{\exists}}(E) = \{a\in{\Sigma}\mid \exists w\in L(E).\ \exists w_{1},w_{2}.\ w = w_{1}\cdot a\cdot w_{2}\}. \end{array} $$

As pointed out for the IMEs, note that the sets s y m b o l s (E) and s y m b o l s (E) can be easily constructed from E. Next, we adapt the notions of dependency graph and universal dependency graph for disjunction-free DTDs. The existential dependency graph of a disjunction-free DTD S is a directed rooted graph \(G_{S}^{\exists }=({\Sigma }, root_{S}, E_{S}^{\exists })\), where

$$E_{S}^{\exists} = \{(a,a^{\prime})\mid a^{\prime}\in{\mathit{symbols}^{\exists}}(R_{S}(a))\}. $$

Similarly, the universal dependency graph of a disjunction-free DTD S is a directed rooted graph \(G_{S}^{\forall }=({\Sigma }, root_{S}, E_{S}^{\forall })\), where

$$E_{S}^{\forall} = \{(a,a^{\prime})\mid a^{\prime}\in{\mathit{symbols}^{\forall}}(R_{S}(a))\}. $$

Analogously to the IMSs, we assume w.l.o.g. that we manipulate only disjunction-free DTDs having no cycle reachable from the root in the universal dependency graph. Otherwise, if there is a cycle in the universal dependency graph, this means that there is no tree consistent with the schema and containing at least one of the symbols implied in that cycle. Moreover, similarly to IMSs, for a symbol a∈Σ and a disjunction-free regular expression E, by m i n_n b(E,a) we denote the minimum number of occurrences of the symbol a in every word defined by E.

Next, we state our complexity results for disjunction-free DTDs.

Theorem 7

IMPL disj-free-DTD,Twig is in PTIME and CNT disj-free-DTD,Twig is coNP-complete.

Proof

We claim that a query q is implied by a disjunction-free DTD S iff \(G_{S}^{\forall }\preccurlyeq q\) and since the embedding of a query in a graph can be computed in polynomial time, this implies that I M P L d i s j-f r e e-D T D,T w i g is in PTIME. The proof follows from the proof of Lemma 10.2. The coNP-completeness of the containment of twig queries (Theorem 4 in [32]) implies that C N T d i s j-f r e e-D T D,T w i g is coNP-hard. Theorem 6 states the coNP-completeness of the query containment in the presence of IMSs and an easy adaptation of its proof technique yields the membership of C N T d i s j-f r e e-D T D,T w i g to coNP. The mentioned proofs can be adapted because given a disjunction-free regular expression E and a word uL(E), u can in fact be obtained as an ordering of the unordered word \(w = \biguplus _{a\in {\Sigma }} a^{min\_nb(E,a)}\). Moreover, the order imposed by the DTD on the siblings is not important because the twig queries are order-oblivious. □

8 Expressiveness of DIMS

First, we compare the expressive power of DIMSs with yardstick languages of unordered trees. We begin with FO logic that uses only the binary c h i l d predicate and the unary label predicates P a with a∈Σ. It is easy to show that DIMSs are not comparable with FO. With a simple rule \(a\rightarrow (b\parallel c)^{*}\) a DIMS can express the language of trees where every node labeled by a has as children only nodes labeled by b and c such that the number of b’s is equal to the number of c’s. Such language cannot be captured with FO for reasons similar to those for which it cannot be expressed in FO whether the cardinality of the universe is even. There are languages of unordered trees expressible by FO, but not expressible by DIMSs e.g., the language of trees that contain exactly two nodes labeled b. Such languages are not expressible by DIMSs for reasons similar to those for which they cannot be expressed by DTDs, more precisely they are not closed under substitution of subtrees with the same root type (cf. Lemma 2.10 in [37]). By using exactly the same examples, note that DIMSs and MSO are also incomparable. MSO with Presburger constraints [12, 13, 43, 44] is essentially an extension of MSO that additionally allows elements of arithmetic (numerical variables and value comparisons) and unary functions # a that return the number of children of a node having a given label a∈Σ. This extension is very powerful and can express Parikh images of arbitrary regular languages. DIMSs are strictly less expressive than Presburger MSO as they use a strict restriction of unordered regular expressions.

Next, we compare the expressive power of DIMSs and DTDs. For this purpose, we introduce a simple tool for comparing regular expressions with DIMEs. Given a regular expression R, the language L(R) of unordered words is obtained by removing the relative order of symbols from every ordered word defined by R. A DIME E captures R if L(E)=L(R). This tool is equivalent to considering DTDs under commutative closure [4, 34]. We believe that this simple comparison is adequate because if a DTD is to be used in a data-centric application, then supposedly the order between siblings is not important. Therefore, a DIME that captures a regular expression defines basically the same admissible content model of a node, without imposing an order among the children.

Naturally, by using the above notion to compare the expressive powers of DTDs and DIMSs, DTDs are strictly more expressive than DIMSs. For example, the commutative closure of the regular expression (a⋅(bc)) cannot be expressed by a DIME. Various classes of regular expressions have been reported in widespread use in real-world schemas and have been studied in the literature: simple regular expressions [8, 29], single occurrence regular expressions (SOREs) [7], chain regular expressions (CHAREs) [7]. DIMEs are strictly more expressive than CHAREs and incomparable to the other mentioned classes of regular expressions.

Finally, we investigate how many real-life DTDs can be captured with DIMSs and use the comparison on the XMark benchmark [39] and the University of Amsterdam XML Web Collection [23]. All 77 regular expressions of the XMark benchmark are captured by DIMEs, and among them 76 by IMEs. As for the DTDs from the University of Amsterdam XML Web Collection, 92 % of regular expressions are captured by DIMEs and among them 78 % by IMEs. We also point out that CHAREs, captured by DIMEs, are reported to represent up to 90 % of regular expressions used in real-life DTDs [7]. These numbers give a generally positive coverage, but should be interpreted with caution, as we do not know which of the considered DTDs were indeed intended for data-centric applications.

9 Related Work

Languages of unordered trees can be expressed by logic formalisms or by tree automata. Boneva et al. [12, 13] make a survey on such formalisms and compare their expressiveness. The fundamental difference resides in the kind of constraints that can be expressed for the allowed collections of children for some node. We mention here only formalisms introduced in the context of XML. Presburger automata [43], sheaves automata [20], and the TQL logic [15] allow to express Presburger constraints on the numbers of occurrences of the different symbols among the children of some node. Suitable restrictions allow to obtain the same expressiveness as the Presburger MSO logic on unordered trees [12, 13], strictly more expressive than DIMSs. Additionally, we believe that DIMSs are more appropriate to be used as schema languages, as they were designed as such, in particular regarding the more user-friendly DTD-like syntax.

Languages of unordered trees can be also expressed by considering DTDs under commutative closure [4, 34]. We assume DTDs using arbitrary regular expressions, not necessarily one-unambiguous [14] as required by the W3C. We also point out that it has been recently shown that it is PSPACE-complete to decide whether a given regular expression can be rewritten as an equivalent one-unambiguous one [19]. Given a DTD using arbitrary regular expressions under commutative closure, we say that an (ordered) tree matches such a DTD iff every tree obtained by reordering of sibling nodes also matches the DTD. However, it is PSPACE-complete to test whether a DTD defines a commutatively-closed set of trees [34] and, moreover, such a DTD may be of exponential size w.r.t. the size of the alphabet, which makes such DTDs unfeasible. Another consequence of the high expressive power of DTDs under commutative closure is that the membership problem is NP-complete [26]. Therefore, these formalisms were not extensively used in practice. From a different point of view, Martens et al. [27, 28] investigate DTDs equipped with formulas from the \(\mathcal {SL}\) logic that specifies unordered languages and obtain complexity improvements for typechecking XML transformations.

The unordered concatenation operator “ ∥” should not be confused with the shuffle (interleaving) operator “ &” used in a restricted form in XML Schema and RELAX NG to define order-oblivious, yet still ordered, content. On the one hand, a &b defines all ordered words with an arbitrary number of a’s and exactly one occurrence of b, and analogously, a b defines all unordered words with exactly the same characteristic. On the other hand, (a&b) defines ordered words of the form w 1⋅…⋅w n , where the factors w 1,…,w n are either ab or ba, while (ab) defines unordered words having the same number of a’s and b’s. For instance, (a&b) does not accept the ordered word aabb while it has the same number of a’s and b’s. Adding the shuffle and interval multiplicities to the regular expressions increases the computational complexity of fundamental decision problems such as: membership [6, 25], inclusion, equivalence, and intersection [21]. Colazzo et al. [17, 18, 22] propose efficient algorithms for membership and inclusion of conflict-free types, a class of regular expressions with shuffle and numerical constraints using intervals. Their approach is based on capturing a language with a set of constraints, similar to our characterizing tuples for DIMEs. While conflict-free types and DIMEs both forbid repetitions of symbols, they differ on the restrictions imposed on the use of the operators and the interval multiplicities. Consequently, they are incomparable.

We finally point out that the static analysis problems involving twig queries i.e., twig query satisfiability [5], implication [9, 24], and containment [35] in the presence of schema have been extensively studied in the context of DTDs. However, to the best of our knowledge, these problems have not been previously studied neither for the mentioned unordered schema languages, nor for DTDs using classes of regular expressions extended with counting and interleaving.

10 Conclusions and Future Work

We have studied schema languages for unordered XML. First, we have investigated languages of unordered words and we have proposed disjunctive interval multiplicity expressions (DIMEs), a subclass of unordered regular expressions for which two fundamental decision problems, membership of an unordered word to the language of a DIME and containment of two DIMEs, are tractable. Next, we have employed DIMEs to define languages of unordered trees and have proposed disjunctive interval multiplicity schema (DIMS) and its restriction, disjunction-free interval multiplicity schema (IMS). DIMSs and IMSs can be seen as DTDs using restricted classes of regular expressions and interpreted under commutative closure to define unordered content models. These restrictions allow to maintain a relatively low computational complexity of basic static analysis problems while allowing to capture a significant part of the expressive power of practical DTDs.

As future work, we want to study whether the restrictions imposed by the grammar of DIMEs can be relaxed while maintaining the tractability of the problems of interest. Moreover, we would like to investigate learning algorithms for the unordered schema languages proposed in this paper. We have already proposed learning algorithms for restrictions of DIMSs and IMSs [16] and we want to extend them to take into account all the expressive power. We also aim to apply the unordered schemas to query minimization [3] i.e., given a query and a schema, find a smaller yet equivalent query in the presence of the schema. Furthermore, we want to use unordered schemas and optimization techniques to boost the learning algorithms for twig queries [45].