1 Introduction

Trees are a fundamental data structure in computer science and are used in many application areas like natural language processing [12], database theory [1], and compiler construction [17]. All the mentioned applications as well as others [6, 7] require effective representations of sets of trees, also called tree languages. These requirements triggered detailed investigations of various classes of tree languages since the 1960s and by now there exists an abundance of models [5].

The most robust of those classes of tree languages are the regular tree languages [6, 7], which are generated by finite-state tree automata, which are a natural extension of the finite-state string automata that generate the regular string languages [18]. Most standard problems are decidable for the regular tree languages and they generally enjoy the same nice algorithmic properties as the regular string languages. The main feature of those automata are their finitely many states, which enable most of the positive properties. However, these states are not exhibited directly in the trees generated. In application areas like natural language processing, in which representations of tree languages have to be inferred from finite sets of trees, practitioners often resorted to simpler models, in which the representation can more readily be induced from the sample.

Tree substitution grammars were originally introduced as a special case of tree-adjoining grammars [9, 11], in which no adjunction is allowed. This restriction proved useful in the lexicalization of context-free grammars [10]. However, tree substitution grammar soon became popular in the parsing community [15] under the approach called data-oriented parsing [3] and were the formal model of many state-of-the-art parsers [16]. Similarly, synchronous tree substitution grammars, which are the same as the syntax-directed translation schemes of [2], are used in many statistical machine translation models [4, 8, 13, 14]. Despite the multitude of applications, a fundamental study of their expressive power is missing. Rather they are attributed properties like “extended domain of locality”, which provides some intuition, but has no formal definition.

A tree substitution grammar G is essentially a finite set F of tree fragments together with a set R of permissible root labels. Those tree fragments can be arbitrarily tall or large, which distinguishes tree substitution grammars from local tree grammars [6, 7]. In addition, the fragments can contain leaves that are labeled by internal symbols. Leaves with such labels are called open and can be expanded further by fragments of F that have the same symbol as root label. Indeed G generates trees from a permissible root label of R by successively expanding open leaves with fragments of F until no open leaves remain. The set of all trees derivable in this manner is called the tree language generated by G. The tree languages that can be generated by some tree substitution grammar are called the tree substitution languages.

In this contribution we start a fundamental study of the expressive power of tree substitution grammars. We show that tree substitution grammars are strictly more expressive than local tree grammars [6, 7], but strictly less expressive than finite-state tree automata (see Corollary 10). This, in particular, yields that most standard decision problems are also decidable for tree substitution languages because they are regular. In addition, it is decidable to determine whether a given tree substitution language is local (see Theorem 8). The decidability status of the related question whether a given regular tree language is a tree substitution language remains open. It is interesting to note that all finite and co-finite tree languages are tree substitution languages (see Theorem 6), which makes them much more useful for the approximation of finite samples of trees than the local tree languages, which do not contain all finite tree languages.

We also investigate the closure properties of the tree substitution languages. Unfortunately, they are neither closed under union (see Theorem 9), nor under intersection (see Theorem 13), nor under complement (see Theorem 14). In fact, unions of tree substitution languages even form a strict hierarchy (see Theorem 11), so unions of k tree substitution languages are strictly less expressive than unions of \(k+1\) tree substitution languages. A similar hierarchy is significantly more difficult to prove for intersections and remains an open problem because intersections break the “extended domain of locality” (as shown in the proof of Theorem 13) and can manage a non-explicit information transport over unbounded distances in the trees. Indeed the trivial union construction, which just takes the union of the fragments of the individual tree substitution grammars \(G_{1}, \dotsc , G_{n}\), does yield a tree substitution grammar G that can generate each tree that can be generated by some \(G_i\). However, G might over-generalize in the sense that it may also generate trees that cannot be generated by any \(G_{1}, \dotsc , G_{n}\). This property is utilized in grammar induction to generalize beyond the seen data. Overall, the expressive power of tree substitution grammars is interesting and offers new challenging problems because they are used extensively in real-world applications despite their brittle expressive power. It is exactly this absence of good closure properties, which requires separate arguments for each individual problem and thus makes several problems challenging as outlined in the open problems section.

2 Preliminaries

We denote the set of nonnegative integers (including 0) by \(\mathbb {N}\). For every \(k \in \mathbb {N}\), we use the subset \([k] = \{i \in \mathbb {N}\mid 1 \le i \le k\}\). An alphabet A is simply a finite set and \(A^* = \bigcup _{k \in \mathbb {N}} A^k\) is the set of all finite words over A, where \(A^k = A \times \cdots \times A\) containing k factors A and \(A^0 = \{\varepsilon \}\), of which \(\varepsilon \) is called the empty word. The length \(\left| w\right| \) of a word \(w = a_{1} \cdots a_{k} \in A^*\) with \(a_{1}, \dotsc , a_{k} \in A\) is \(\left| w\right| = k\); i.e. the number of symbols making up w. Given words \(v, w \in A^*\), their concatenation is written v.w or simply vw. We write \(v \preceq w\) provided that there exists \(u \in A^*\) such that \(vu = w\). The relation \(\preceq \) is actually a partial order, called the prefix order.

Let S be a set and \(R \subseteq S \times S\) be a relation. The identity on S is the relation . Given another relation \(R' \subseteq S \times S\), the composition \(R \mathbin ; R'\) is given by \(R \mathbin ; R' = \{(s_1, s_3) \mid \exists s_2 \in S :(s_1, s_2) \in R,\, (s_2, s_3) \in R'\}\). The relation R is reflexive if , and it is transitive if \(R \mathbin ; R \subseteq R\). The reflexive, transitive closure of R is \(R^* = \bigcup _{k \in \mathbb {N}} R^k\) and the transitive closure of R is \(R^+ = \bigcup _{k \ge 1} R^k\), where  and \(R^k = R \mathbin ; \cdots \mathbin ; R\) containing k times the relation R.

A ranked alphabet  is a pair consisting of an alphabet \(\varSigma \) and a mapping  that assigns a rank to each symbol of \(\varSigma \). We usually denote a ranked alphabet  by just \(\varSigma \) alone when the ranks are clear. We also write \(\sigma ^{(k)}\) to indicate that \({{\,\mathrm{rk}\,}}\!(\sigma ) = k\). Moreover, for every \(k \in \mathbb {N}\), we let \(\varSigma _k = \{ \sigma \in \varSigma \mid {{\,\mathrm{rk}\,}}\!(\sigma ) = k\}\). Given a ranked alphabet \(\varSigma \) and a set Z, the set \(T_\varSigma (Z)\) of \(\varSigma \)" trees indexed by Z is the smallest set T such that \(Z \subseteq T\) and \(\sigma (t_{1}, \dotsc , t_{k}) \in T\) for every \(k \in \mathbb {N}\), \(\sigma \in \varSigma _k\), and \(t_{1}, \dotsc , t_{k} \in T\). We abbreviate \(T_\varSigma (\emptyset )\) simply to \(T_\varSigma \), and any subset \(L \subseteq T_\varSigma \) is called tree language. It is co-finite if \(T_\varSigma \setminus L\) is finite.

Next, we recall some common notions and notations for trees. In the following, let \(t \in T_\varSigma (Z)\) be a tree for a ranked alphabet \(\varSigma \) and a set Z. The set \({{\,\mathrm{pos}\,}}\!(t)\) of positions of t is inductively defined by \({{\,\mathrm{pos}\,}}\!(z) = \{\varepsilon \}\) for all \(z \in Z\), and \({{\,\mathrm{pos}\,}}\!(\sigma (t_{1}, \dotsc , t_{k})) = \{\varepsilon \} \cup \{i.p \mid i \in [k],\, p \in {{\,\mathrm{pos}\,}}\!(t_i)\}\) for every \(k \in \mathbb {N}\), \(\sigma \in \varSigma _k\), and \(t_{1}, \dotsc , t_{k} \in T_\varSigma (Z)\). The height of t is defined by \({{\,\mathrm{ht}\,}}\!(t) = \max _{p \in {{\,\mathrm{pos}\,}}\!(t)} \left| p\right| \), and the size of t is defined by \(\left| t\right| = \left| {{\,\mathrm{pos}\,}}\!(t)\right| \). A leaf is a position \(p \in {{\,\mathrm{pos}\,}}\!(t)\) such that \(p.1 \notin {{\,\mathrm{pos}\,}}\!(t)\). We denote the subset of leaves of \({{\,\mathrm{pos}\,}}\!(t)\) by \({{\,\mathrm{leaf}\,}}\!(t)\). Given a position \(p \in {{\,\mathrm{pos}\,}}\!(t)\), the label t(p) of t at p and the subtree \(t|_p\) of t at p are defined by \(z(\varepsilon ) = z|_\varepsilon = z\) for all \(z \in Z\), and

$$\begin{aligned} \bigl ( \sigma (t_{1}, \dotsc , t_{k}) \bigr ) (p)&= {\left\{ \begin{array}{ll} \sigma &{} \text {if }p = \varepsilon \\ t_i(p') &{} \text {if } p = i.p' \text { with } i \in \mathbb {N}\text { and } p' \in {{\,\mathrm{pos}\,}}\!(t_i) \end{array}\right. } \\ \sigma (t_{1}, \dotsc , t_{k})|_p&= {\left\{ \begin{array}{ll} \sigma (t_{1}, \dotsc , t_{k}) &{} \text {if } p = \varepsilon \\ t_i|_{p'} &{} \text {if }p = i.p' \text { with } i \in \mathbb {N}\text { and } p' \in {{\,\mathrm{pos}\,}}\!(t_i) \end{array}\right. } \end{aligned}$$

for all \(k \in \mathbb {N}\), \(\sigma \in \varSigma _k\), and \(t_{1}, \dotsc , t_{k} \in T_\varSigma (Z)\). Finally, the replacement \(t[u]_p\) of the leaf \(p \in {{\,\mathrm{leaf}\,}}\!(t)\) by another tree \(u \in T_\varSigma (Z)\) is given by \(\alpha [u]_\varepsilon = u\) for every \(\alpha \in Z \cup \varSigma _0\), and \(\sigma (t_{1}, \dotsc , t_{k})[u]_{i.p'} = \sigma (t_{1}, \dotsc , t_{i-1}, t_i[u]_{p'}, t_{i+1}, \dotsc , t_{k})\) for every \(k \in \mathbb {N}\), \(i \in [k]\), \(\sigma \in \varSigma _k\), \(t_{1}, \dotsc , t_{k} \in T_\varSigma (Z)\), and \(p' \in {{\,\mathrm{pos}\,}}\!(t_i)\).

We reserve the use of the special symbol \({\scriptstyle \Box }\). A tree \(t \in T_\varSigma (Z \cup \{{\scriptstyle \Box }\})\) is a context, if there exists exactly one \(p \in {{\,\mathrm{pos}\,}}\!(t)\) with \(t(p) = {\scriptstyle \Box }\); i.e., there is exactly one occurrence of \({\scriptstyle \Box }\) in t. The set of all such contexts is denoted by \(C_\varSigma (Z)\). Given a context \(c \in C_\varSigma (Z)\) and a tree \(t \in T_\varSigma (Z \cup \{{\scriptstyle \Box }\})\), the substitution c[t] of t into c yields the tree \(c[t]_p\), where p is the unique position \(p \in {{\,\mathrm{pos}\,}}\!(c)\) with \(c(p) = {\scriptstyle \Box }\). Note that given \(c, c' \in C_\varSigma (Z)\), also \(c[c'] \in C_\varSigma (Z)\). Similarly, we write \(c^k[t]\) for \(c[c[\cdots c[t] \cdots ]]\) containing the context c a total of k times.

Finally, let us recall regular tree grammars (RTGs) [6, 7]. An RTG is a tuple \(G = (Q, \varSigma , Q_0, P)\), where Q is a finite set of states such that \(Q \cap \varSigma = \emptyset \), \(\varSigma \) is a ranked alphabet of input symbols, \(Q_0 \subseteq Q\) is a set of initial states, and \(P \subseteq Q \times T_\varSigma (Q)\) is a finite set of productions. We also write productions (qt) as \(q \rightarrow t\). The derivation relation for \(\xi , \zeta \in T_\varSigma (Q)\) is defined for every \(\xi , \zeta \in T_\varSigma (Q)\) by \(\xi \Rightarrow _G \zeta \) if and only if there exists a production \(q \rightarrow t \in P\) and a context \(c \in C_\varSigma (Q)\) such that \(\xi = c[q]\) and \(\zeta = c[t]\). The tree language generated by G is \(L(G) = \bigcup _{q \in Q_0} \{t \in T_\varSigma \mid q \Rightarrow ^+_G t\}\). A tree language L is regular if there exists an RTG G such that \(L(G) = L\). The class of regular tree languages is denoted by \(\text {RTL}\). We note that \(\text {RTL}\) coincides with the class of tree languages generated by tree automata [6, 7].

Fig. 1.
figure 1

Fragments of the TSG of Example 2.

3 Tree Substitution Grammars

Let us start with the formal definition of tree substitution grammars (TSGs) taken essentially from the natural language processing community [10, 11]. TSGs have been applied to various tasks including parsing [16] and machine translation [19]. Consequently, the definitions of TSGs vary, but our definition captures the essence of the notion, while still being convenient to work with.

Definition 1

A tree substitution grammar (TSG) is a tuple \(G = (\varSigma , R, F)\), in which \(\varSigma \) is a ranked alphabet of input symbols, \(R \subseteq \varSigma \) is a set of root labels, and \(F \subseteq T_\varSigma (\varSigma ) \setminus \varSigma \) is a finite set of fragments. The TSG G is a local tree grammar (LTG) if \({{\,\mathrm{ht}\,}}\!(f) \le 1\) for all \(f \in F\).

Example 2

Consider the ranked alphabet \(\varSigma =\{\sigma ^{(2)}, \delta ^{(2)}, \alpha ^{(0)}, \beta ^{(0)}\}\) and the TSG \(G = (\varSigma , \{\sigma \}, F)\) with the fragments displayed in Fig. 1. Clearly, this TSG is not an LTG due to the third and fourth fragment.

Next we present the derivation semantics for a TSG \(G = (\varSigma , R, F)\). Essentially we start the derivation process with a tree consisting solely of a root label of R and then iteratively replace a leaf by a fragment of F with the same root label. This process can be repeated until no replacements are possible anymore. If the such obtained tree t contains only leaves that are labeled by nullary symbols, then t is part of the tree language generated by G.

Definition 3

Let \(G = (\varSigma , R, F)\) be a TSG. For any two trees \(\xi , \zeta \in T_\varSigma (\varSigma )\), we write \(\xi \Rightarrow _G \zeta \) if there exists a fragment \(f \in F\) and a context \(c \in C_\varSigma (\varSigma )\) such that \(\xi = c[f(\varepsilon )]\) and \(\zeta = c[f]\). The TSG G generates the tree language \(L(G) = \{ t \in T_\varSigma \mid \exists \sigma \in R :\sigma \Rightarrow _G^* t \}\).

Fig. 2.
figure 2

Example derivation steps using the TSG G of Example 2.

Fig. 3.
figure 3

Fragments of the TSG G of Example 4 and example derivation steps.

Example 4

Let \(\varSigma = \{\sigma ^{(2)}, \gamma ^{(1)}, \alpha ^{(0)}\}\) and consider the TSG \(G = (\varSigma , \{\sigma \}, F)\) with the fragments displayed in Fig. 3. The derivation presented in Fig. 3 illustrates that a derived tree can contain several leaves that still need to be independently replaced. More precisely, both occurrences of \(\gamma \) in the tree \(\sigma (\gamma , \gamma )\) are independently replaced in the displayed derivation.

Example 5

Consider the TSG G from Example 2. A few derivation steps are displayed in Fig. 2. Let \(c_\alpha = \delta (\alpha , \delta (\alpha , {\scriptstyle \Box }))\) and \(c_\beta = \delta (\beta , \delta (\beta , {\scriptstyle \Box }))\). Overall, this TSG generates the tree language

$$ \bigl \{\sigma (x, c_1[\cdots c_n[\delta (y, \alpha )] \cdots ]) \mid x,y \in \{\alpha , \beta \},\, n \in \mathbb {N},\, \forall i \in [n] :c_i \in \{c_\alpha , c_\beta \} \bigr \}. $$

Two TSGs G and \(G'\) are equivalent if \(L(G) = L(G')\). A tree language L is a tree substitution language if there exists a TSG G such that \(L = L(G)\), and it is local [6, 7] if there exists a local tree grammar G such that \(L = L(G)\). The classes of all tree substitution languages and all local tree languages are denoted by \(\text {TSL}\) and \(\text {LTL}\), respectively.

4 Expressive Power

In this section, we investigate the expressive power of tree substitution grammars and start with some simple tree languages that are contained in \(\text {TSL}\). To this end, let \(\text {FIN}\) and \(\text {co-FIN}\) be the classes of all finite and all co-finite tree languages, respectively.

Theorem 6

\(\text {FIN} \cup \text {co-FIN} \subseteq \text {TSL}\).

Proof

Every finite tree language \(L \subseteq T_\varSigma \) is trivially a tree substitution language via the TSG \((\varSigma , R, L)\) with \(R = \{t(\varepsilon ) \mid t \in L\}\).

Now, let \(L \subseteq T_\varSigma \) be a co-finite tree language and \(T_\varSigma \setminus L = \{t_{1}, \dotsc , t_{k}\}\) be the finitely many trees outside L. Moreover, let \(n > \max _{i \in [k]} {{\,\mathrm{ht}\,}}\!(t_i)\) be larger than the height of the tallest tree from \(\{t_{1}, \dotsc , t_{k}\}\). We construct the TSG \((\varSigma , R, F)\) with

  • \(R = \{t(\varepsilon ) \mid t \in L\}\) and

  • \(F = \{t \in L \mid {{\,\mathrm{ht}\,}}\!(t) \le 2n\} \cup \{t \in T_\varSigma (\varSigma ) \mid n \le {{\,\mathrm{ht}\,}}\!(t) \le 2n\}\).

Clearly, F is finite. Now we prove \(L(G) = L\). For \(L(G) \subseteq L\) it is sufficient to show that \(t_i \notin L(G)\) for every \(i \in [k]\). Obviously, the fragments of F are either in L or have height at least n, which proves \(L(G) \subseteq L\). We prove the converse \(L \subseteq L(G)\) by contradiction, so suppose that there exists \(t \in L\) with \(t \notin L(G)\). Then there also exists a smallest \(t' \in L\) with \(t' \notin L(G)\). Since all trees \(t' \in L\) with \({{\,\mathrm{ht}\,}}\!(t') \le 2n\) can be generated directly using a single fragment from F, we must have \({{\,\mathrm{ht}\,}}\!(t') > 2n\). Let

$$ P = \{p \in {{\,\mathrm{pos}\,}}\!(t') \mid \left| p\right| \le n,\, \exists p' \in {{\,\mathrm{pos}\,}}\!(t') :p \preceq p',\, \left| p'\right| > 2n\} $$

be the short positions that are prefixes to long positions, and let \(C = \max _{\preceq } P\) be the maximal (with respect to \(\preceq \)) elements of P. We construct the unique tree \(f \in T_\varSigma (\varSigma )\) with positions

$$ {{\,\mathrm{pos}\,}}\!(f) = \{p \in {{\,\mathrm{pos}\,}}\!(t') \mid \left| p\right| \le 2n \} \setminus \{p \in {{\,\mathrm{pos}\,}}\!(t') \mid \exists c \in C :c \prec p\} $$

and labels \(f(p) = t'(p)\) for all \(p \in {{\,\mathrm{pos}\,}}\!(f)\). In other words, we obtain f by cutting all paths in \(t'\) that have length more than 2n at length n. Obviously, \(f \in F\). In addition, we observe that \({{\,\mathrm{ht}\,}}\!(t'|_p) > n\) for all \(p \in C\). For every \(p \in C\), we thus obtain \(t'|_p \in L\) and \(t'|_p \in L(G)\) since \(\left| t'|_p\right| < \left| t'\right| \) and \(t'\) is the smallest counterexample. However, this yields that \(t'(\varepsilon ) \Rightarrow _G f\) as well as \(f(p) \Rightarrow _G^* t'|_p\) for all \(p \in C\). Altogether \(t'(\varepsilon ) \Rightarrow _G^* t'\), which proves that \(t' \in L(G)\) contradicting the assumption.    \(\square \)

Next we relate the class of tree substitution languages to the well-known classes of local and regular tree languages, respectively. Unsurprisingly, they are situated strictly between them, but the second strictness will be established later (see Corollary 10).

Theorem 7

\(\text {LTL} \subsetneq \text {TSL} \subseteq \text {RTL}\).

Proof

The first inclusion holds by definition. For the latter, let \(G = (\varSigma , R, F)\) be a TSG and \(S \notin \varSigma \) a new symbol. We construct an RTG \(G' = (\overline{\varSigma } \cup \{S\}, \varSigma , S, P)\) such that \(L(G') = L(G)\). To this end, we use copies \(\overline{\varSigma } = \{ \overline{\sigma } \mid \sigma \in \varSigma \}\) of the input symbols of \(\varSigma \) as states. The productions are given by \(P = P_S \cup P'\) with

$$\begin{aligned} P_S&= \{ S \rightarrow \text {rel}(\sigma ) \mid \sigma \in R\} \\ P'&= \{\overline{f(\varepsilon )} \rightarrow \text {rel}(f) \mid f \in F \}, \end{aligned}$$

where \(\text {rel} :T_\varSigma (\varSigma ) \rightarrow T_\varSigma (\overline{\varSigma })\) is inductively defined by

$$ \text {rel}(\sigma ) = {\left\{ \begin{array}{ll} \sigma &{} \text {if } \sigma \in \varSigma _0 \\ \overline{\sigma }&{} \text {otherwise} \end{array}\right. } $$

for every \(\sigma \in \varSigma \) and \(\text {rel}(\sigma (t_{1}, \dotsc , t_{k})) = \sigma (\text {rel}(t_1), \dotsc , \text {rel}(t_k))\) for all \(k \in \mathbb {N}\setminus \{0\}\), \(\sigma \in \varSigma _k\), and \(t_{1}, \dotsc , t_{k} \in T_\varSigma (\varSigma )\). Clearly any derivation \(\xi _0 \Rightarrow _G \xi _1 \Rightarrow _G \cdots \Rightarrow _G \xi _n\) of G yields a corresponding derivation \(\text {rel}(\xi _0) \Rightarrow _{G'} \text {rel}(\xi _1) \Rightarrow _{G'} \cdots \Rightarrow _{G'} \text {rel}(\xi _n)\) of \(G'\). Together with \(\text {rel}(t) = t\) for all \(t \in T_\varSigma \) and the new initial states, we obtain \(L(G) \subseteq L(G')\). The converse is proved similarly.

The first inclusion is strict because \(\text {FIN} \subseteq \text {TSL}\) by Theorem 6, but it is well-known [6, 7] that \(\text {FIN} \not \subseteq \text {LTL}\).    \(\square \)

The inclusion \(\text {TSL} \subseteq \text {RTL}\) immediately yields that most interesting problems are decidable for tree substitution languages. For example, the emptiness, finiteness, inclusion, and equivalence problems are all decidable because they are decidable for regular tree languages [6, 7]. We proceed with a subclass definability problem: Is it decidable whether an effectively presented tree substitution language is local? Whenever we speak about an effectively presented tree substitution language L, we assume that we are actually given a tree substitution grammar G such that \(L(G) = L\). Let \(G = (\varSigma , R, F)\) be a TSG. A fragment \(f \in F\) is useless if G and \((\varSigma , R, F \setminus \{f\})\) are equivalent. The TSG G is reduced if no fragment \(f \in F\) is useless. Clearly, for every TSG we can construct an equivalent reduced TSG.

Theorem 8

For every effectively presented \(L \in \mathrm {TSL}\), it is decidable whether \(L \in \mathrm {LTL}\).

Proof

Let \(G = (\varSigma , R, F)\) be a reduced tree substitution grammar such that \(L(G) = L\). We construct the local tree grammar \(G' = (\varSigma , R, F')\) with

$$ F' = \{ f(p)(f(p.1), \dotsc , f(p.k)) \mid f \in F,\, k \in \mathbb {N},\, p \in {{\,\mathrm{pos}\,}}\!(f) \setminus {{\,\mathrm{leaf}\,}}\!(f),\, f(p) \in \varSigma _k\} . $$

Obviously, \(L = L(G) \subseteq L(G')\) and all fragments of \(F'\) are essential for this property. Consequently, L is local if and only if \(L(G') \subseteq L\). Since both \(L(G')\) and L are regular by Theorem 7 and inclusion is decidable for regular tree languages [6, 7], we obtain the desired statement.    \(\square \)

5 Closure Properties

In this section, we investigate the closure properties of the class of tree substitution languages. More specifically, we investigate the Boolean operations and the hierarchy for union. Unfortunately, the results are all negative, but they and, in particular, their proofs shed additional light on the expressive power of tree substitution languages. Let us start with union.

Fig. 4.
figure 4

The tree languages \(L(G_1)\) and \(L(G_2)\) used in the proof of Theorem 9.

Theorem 9

\(\mathrm {TSL}\) is not closed under union.

Proof

Consider the ranked alphabet \(\varSigma =\{ \sigma ^{(2)}, \gamma ^{(1)}, \alpha ^{(0)}, \beta ^{(0)} \}\) and the LTGs

$$\begin{aligned} G_1&= \bigl (\varSigma , \{\sigma \}, \{ \sigma (\gamma , \alpha ),\, \gamma (\gamma ),\, \gamma (\alpha ) \} \bigr ) \\ G_2&= \bigl (\varSigma , \{\sigma \}, \{ \sigma (\gamma , \beta ),\, \gamma (\gamma ),\, \gamma (\beta ) \} \bigr ), \end{aligned}$$

which generate the local tree languages (see Fig. 4)

$$ L(G_1) = \bigl \{\sigma \bigl (c^n[\alpha ], \alpha \bigr ) \mid n \in \mathbb {N}\bigr \} \qquad \text { and } \qquad L(G_2) = \bigl \{\sigma \bigl (c^n[\beta ], \beta \bigr ) \mid n \in \mathbb {N}\bigr \} $$

with \(c = \gamma ({\scriptstyle \Box })\). Now suppose that their union \(L = L(G_1) \cup L(G_2)\) is a tree substitution language; i.e., \(L \in \text {TSL}\). Hence there exists a TSG \(G = (\varSigma , R, F)\) such that \(L(G) = L\). Let \(n \in \mathbb {N}\) be such that \(n > \max _{f \in F} {{\,\mathrm{ht}\,}}\!(f)\). Since \(t = \sigma (c^n[\alpha ], \alpha ) \in L\), there must exist a derivation \(\sigma \Rightarrow _G^* t\) and \(\sigma \in R\). Since \({{\,\mathrm{ht}\,}}\!(t) > n\) at least two derivation steps are required, so \(\sigma \Rightarrow _G \sigma (c^k[\gamma ], \alpha ) \Rightarrow _G^+ t\) for some \(0 \le k < n\), which yields the subderivation \(\gamma \Rightarrow _G^+ c^{n-k}[\alpha ]\). In the same manner we consider the tree \(t' = \sigma (c^n[\beta ], \beta ) \in L\), for which the derivation \(\sigma \Rightarrow _G \sigma (c^\ell [\gamma ], \beta ) \Rightarrow _G^+ t'\) for some \(0 \le \ell < n\) and the subderivation \(\gamma \Rightarrow _G^+ c^{n-\ell }[\beta ]\) must exist. However, exchanging the subderivations yields the derivation

$$ \sigma \Rightarrow _G \sigma (c^k[\gamma ], \alpha ) \Rightarrow _G^+ \sigma (c^k[c^{n-\ell }[\beta ]], \alpha ) , $$

which shows \(\sigma (c^{n-\ell +k}[\beta ], \alpha ) \in L(G) = L\) contradicting \(L = L(G_1) \cup L(G_2)\).    \(\square \)

Since the class of regular tree languages is closed under union [6, 7], we obtain the following corollary from Theorems 7 and 9.

Corollary 10

\(\text {LTL} \subsetneq \text {TSL} \subsetneq \text {RTL}\).

We demonstrated that the union of two tree substitution languages need not be a tree substitution language. Next, we ask ourselves whether additional unions increase the expressive power even further. For every \(k \in \mathbb {N}\) let

$$ \cup _k\text {-TSL} = \{L_1 \cup \cdots \cup L_k \mid L_{1}, \dotsc , L_{k} \in \text {TSL} \} $$

be the class of those tree languages that can be presented as unions of k tree substitution languages. Since \(\emptyset \in \text {TSL}\) (see Theorem 6), we obtain \(\cup _0\text {-TSL} = \emptyset \), \({\cup _1}\text {-TSL} = \text {TSL}\), and \(\cup _k\text {-TSL} \subseteq \cup _{k+1}\text {-TSL}\) for every \(k \in \mathbb {N}\). Next, we show that the mentioned inclusion is actually strict, so that we obtain an infinite hierarchy.

Fig. 5.
figure 5

Illustration of the tree substitution languages used in the proof of Theorem 11.

Theorem 11

\(\cup _k\)-TSL \(\subsetneq \cup _{k+1}\)-TSL for all \(k \in \mathbb {N}\).

Proof

The statement is clear for \(k = 0\), so let \(k \ge 1\). Consider the ranked alphabet \(\varSigma =\{ \sigma ^{(2)}, \delta ^{(2)}, \alpha ^{(0)} \}\) and the TSG \(G_i = (\varSigma , \{\sigma \}, F_i)\) for every \(i \in [k+1]\), where

$$ F_i = \{\sigma (\delta , s_i),\, s_i,\, \delta (\delta , \alpha ),\, \delta (s_i, \alpha ) \} $$

and \(s_i = c_r^i[\alpha ]\) with \(c_r = \delta (\alpha , {\scriptstyle \Box })\). Clearly, \(L(G_i) = \{ \sigma (c_\ell ^n[s_i], s_i) \mid n \in \mathbb {N}\}\) with \(c_\ell = \delta ({\scriptstyle \Box }, \alpha )\). The tree substitution language \(L(G_i)\) and the tree \(s_i\) are illustrated in Fig. 5.

Obviously, \(L = L(G_1) \cup \cdots \cup L(G_{k + 1}) \in \cup _{k+1}\)-TSL and those individual tree languages are infinite and pairwise disjoint. For the sake of a contradiction, assume that \(L \in \cup _k\)-TSL; i. e. there exist \(L'_{1}, \dotsc , L'_{k} \in \text {TSL}\) such that \(L = L'_1 \cup \cdots \cup L'_k\). The pigeonhole principle establishes that there exist \(i \in [k]\) and \(m, n \in [k+1]\) with \(m \ne n\) such that \(L_m \cap L'_i\) and \(L_n \cap L'_i\) are infinite. Let \(G = (\varSigma , R, F)\) be a TSG such that \(L(G) = L'_i\). Let \(z > \max _{f \in F} {{\,\mathrm{ht}\,}}\!(f)\). Since \(L_m \cap L(G)\) is infinite, there exists \(x > z\) such that \(\sigma (c_\ell ^x[s_m], s_m) \in L(G)\). Similarly, there exists \(y > z\) such that \(\sigma (c_\ell ^y[s_n], s_n) \in L(G)\) because \(L_n \cap L(G)\) is infinite. Inspecting the derivations for those trees there exist \(x', y' \in \mathbb {N}\) such that

$$\begin{aligned} \sigma&\Rightarrow _G \sigma (c_\ell ^{x'}[\delta ], s_m) \Rightarrow _G^* \sigma (c_\ell ^x[s_m], s_m) \;&\text { with subderivation }&\;&\delta&\Rightarrow _G^+ c_\ell ^{x-x'}[s_m] \\ \sigma&\Rightarrow _G \sigma (c_\ell ^{y'}[\delta ], s_n) \Rightarrow _G^* \sigma (c_\ell ^y[s_n], s_n) \;&\text { with subderivation }&\;&\delta&\Rightarrow _G^+ c_\ell ^{y-y'}[s_n] \end{aligned}$$

Exchanging the subderivations we obtain

$$ \sigma \Rightarrow _G \sigma (c_\ell ^{x'}[\delta ], s_m) \Rightarrow _G^* \sigma (c_\ell ^{x'+y-y'}[s_n], s_m) $$

and thus \(\sigma (c_\ell ^{x'+y-y'}[s_n], s_m) \in L(G) \subseteq L\), which is a contradiction because \(m \ne n\).    \(\square \)

Corollary 12

(of Theorem 11).

$$ \cup _0\hbox {-}\mathrm {TSL} \subsetneq \cup _1\hbox {-}\mathrm {TSL} \subsetneq \cup _2\hbox {-}\mathrm {TSL} \subsetneq \cup _3\hbox {-}\mathrm {TSL} \subsetneq \cup _4\hbox {-}\mathrm {TSL} \subsetneq \cdots $$
Fig. 6.
figure 6

Fragments of the TSG \(G'\) used in the proof of Theorem 13.

Let us move on to intersection. Unfortunately, \(\text {TSL}\) is not closed under intersection, but intersections of \(\text {TSL}\) become quite powerful. In particular, they allow information to be transported over unbounded distances, which can be observed from the proof.

Fig. 7.
figure 7

Tree substitution languages L(G) and \(L(G')\) used in the proof of Theorem 13.

Theorem 13

\(\mathrm {TSL}\) is not closed under intersection.

Proof

Recall the ranked alphabet \(\varSigma = \{\sigma ^{(2)}, \delta ^{(2)}, \alpha ^{(0)}, \beta ^{(0)}\}\) and the TSG G of Example 2 as well as the contexts \(c_\alpha = \delta (\alpha , \delta (\alpha , {\scriptstyle \Box }))\) and \(c_\beta = \delta (\beta , \delta (\beta , {\scriptstyle \Box }))\) from Example 5. Additionally, let \(G' = (\varSigma , \{\sigma \}, F')\) with \(F'\) displayed in Fig. 6. The generated tree substitution languages L(G) and \(L(G')\) are

$$\begin{aligned}&\bigl \{\sigma (x, c_1[\cdots c_n[\delta (y, \alpha )] \cdots ]) \mid x, y \in \{\alpha , \beta \},\, n \in \mathbb {N},\, \forall i \in [n] :c_i \in \{c_\alpha , c_\beta \} \bigr \} \\&\bigl \{\sigma (x, \delta (x, c_1[\cdots c_n[\alpha ] \cdots ]) \mid x \in \{\alpha , \beta \},\, n \in \mathbb {N},\, \forall i \in [n] :c_i \in \{c_\alpha , c_\beta \} \bigr \} \end{aligned}$$

respectively, which are also illustrated in Fig. 7. Their intersection

contains only trees, in which all left children along the spine carry the same label. This tree language is not a tree substitution language, which can be proved using the subderivation exchange technique used in the proof of Theorem 9.    \(\square \)

Note how the intersection achieves a global synchronization in the proof of Theorem 13. This power makes the investigation of the intersection hierarchy difficult. We leave the strictness of the intersection hierarchy as an open problem and conclude by considering the complement.

Fig. 8.
figure 8

Trees used in the proof of Theorem 14.

Theorem 14

\(\mathrm {TSL}\) is not closed under complements.

Proof

Consider the ranked alphabet \(\varSigma = \{\gamma ^{(1)}, A^{(1)}, B^{(1)}, \alpha ^{(0)},\beta ^{(0)} \}\) and the LTG \(G = (\varSigma , \{\gamma \}, F)\) with fragments

$$ F = \{\gamma (A),\, A(A),\, A(\alpha )\} \cup \{\gamma (B),\, B(B),\, B(\beta )\} . $$

The generated tree language is illustrated in Fig. 8. Now suppose that its complement \(L = T_\varSigma (\varSigma ) \setminus L(G)\) is a tree substitution language; i.e., \(L \in TSL\). Hence there exists a TSG \(G' = (\varSigma , R, F)\) such that \(L = L(G')\). Let \(n \in \mathbb {N}\) be such that \(n > \max _{f \in F} {{\,\mathrm{ht}\,}}\!(f)\). Since \(t = \gamma (A^n(\beta )) \in L\) (see Fig. 8) there must exist a derivation \(\gamma \Rightarrow _G^* t\) and \(\gamma \in R\). Since \({{\,\mathrm{ht}\,}}\!(t) > n\) at least two derivation steps are required, so \(\gamma \Rightarrow _G \gamma (A^k) \Rightarrow _G^+ t\) for some \(0 \le k < n\), which yields the subderivation \(A \Rightarrow _G^+ A^{n-k}(\beta )\). Similarly, we consider the tree \(t' = \gamma (B(A^n(\alpha ))) \in L\) (see Fig. 8), for which the derivation \(\gamma \Rightarrow _G^+ \gamma (B(A^\ell )) \Rightarrow _G^+ t'\) for some \(0 \le \ell < n\) and the subderivation \(A \Rightarrow _G^+ A^{n-\ell }(\alpha )\) must exist. However, exchanging the subderivations yields the derivation

$$ \gamma \Rightarrow _G \gamma (A^k) \Rightarrow _G^+ \gamma (A^k(A^{n-\ell }(\alpha )) , $$

which shows \(\gamma (A^k(A^{n-\ell }(\alpha )) \in L(G') = L\) contradicting \(L = T_\varSigma (\varSigma )\setminus L(G)\).    \(\square \)

6 Open Problems

We showed that it is decidable whether a given tree substitution language is local. It remains open if we can also decide whether a given regular tree language is a tree substitution language. Progress on this problem will probably provide additional fine-grained insight into the expressive power of tree substitution grammars in comparison to the regular tree grammars.

Another open problem concerns the intersection hierarchy. We showed that unions of tree substitution languages can progressively express more and more tree languages. A similar hierarchy also exists for intersections of tree substitution languages and we showed that the intersection of two tree substitution languages is not necessarily a tree substitution languages. However, it remains open whether there is an infinite intersection hierarchy or whether it collapses at some level.