1 Introduction

The notion of a rational index of a language was introduced by Boasson, Courcelle and Nivat [4] as a complexity measure for context-free languages. The rational index \(\rho _L\) of a language L is an integer function, where \(\rho _L(n)\) is the maximum length of the shortest string in a language of the form \(L \cap R\), where R is a regular language recognized by an n-state nondeterministic finite automaton (NFA), and the maximum is taken over all such languages R with \(L \cap R \ne \emptyset \).

Besides its theoretical value as a measure of complexity of a language, the rational index is useful in determining the parallel complexity of practical problems, such as the CFL-reachability problem and the more general Datalog query evaluation. The CFL-reachability problem is stated as follows: for a context-free grammar G, given an NFA A over the same alphabet, determine whether \(L(G) \cap L(A)\) is non-empty. With A regarded as a labelled graph, this is a kind of graph reachability problem with path constraints defined by a context-free grammar. This is an important problem used in static code analysis [29] and graph database query evaluation [39].

Reachability problems for more general families of languages have been studied as well. To increase the precision of the analysis, two types of sensitivity can be used simultaneously. For instance, some problems of data flow analysis are representable as the reachability problem for the so-called Interleaved Dyck language: and although Reps [30] proved this problem to be undecidable, its various approximations have been studied [35, 40]. In particular, useful special cases of this problem can be approximated using a well-known generalization of the context-free grammars known as multiple context-free grammars (MCFG) [33], and as linear context-free rewriting systems [37], and hereinafter referred to as multi-component grammars. These grammars can capture the balanced property of Interleaved Dyck languages, represented by an abstract language known as the n-dimensional origin-crossing language [14], and, thus, are a good candidate to approximate them. This motivates the study of the reachability problem for multi-component grammars.

The CFL-reachability problem is P-complete already for some fixed context-free grammars, that is, there exists such a grammar G that testing whether a given NFA accepts any string in L(G) is a P-complete problem: this result can be found in the book by Greenlaw, Hoover and Ruzzo [15] under the name “labelled graph accessibility problem” (LGAP), and as G one can take, for instance, a grammar for the Dyck language on two pairs of brackets. In the special case when the fixed grammar defines a language with a polynomial rational index, the problem becomes computationally easier. The question of the parallel complexity of this problem was investigated by Ullman and Van Gelder [36] in a much more general case, for a rich logic for database queries instead of grammars, and it was proved that under an assumption called the polynomial fringe property the problem is decidable in NC [36]. In the special case of grammars, the result of Ullman and Van Gelder [36] gives an NC\(^2\) algorithm for the CFL-reachability problem, under the assumption that the language’s rational index is polynomial; see also Rubtsov and Vyalyi [31] for an elementary proof. The same result applies to multi-component grammars, as long as the language defined has polynomial rational index.

Theoretical properties of the rational index have received some attention in the literature. Pierre and Farinone [27] proved that for every algebraic number \(\gamma \geqslant 1\), there is a context-free grammar with a rational index of the order \(\Theta (n^\gamma )\). An upper bound on the rational index of a context-free language, shown by Pierre [26], is \(2^{\Theta (n^2/\ln n)}\), and this bound is reached on the Dyck language on two pairs of parentheses. For several important subfamilies of grammars, such as the linear and the one-counter languages, there are polynomial upper bounds on the rational index, which imply that the CFL-reachability problem is in NC\(^2\); they can be proved to lie in NL by direct methods not involving the rational index [17, 19].

Other problems on the length of shortest strings have received some attention in literature. Chistikov et al. [6] investigated the length of shortest strings in one-counter languages, and, in particular, proved that their rational index is \(O(n^2)\). Ellul et al. [9] studied the length of the shortest string which is not accepted by an NFA. Alpoge et al. [1] found upper and lower bounds on the length of shortest strings in regular languages specified in various ways. The maximum length of shortest strings for deterministic two-way finite automata (2DFA) has been investigated in some recent papers [8, 20, 23].

This paper investigates the rational index of a generalization of linear languages: the languages of bounded tree dimension, that is, those defined by context-free grammars with a certain limit on branching in the parse trees. The notion of tree dimension is well-known in the literature and appears under different names, as well as differs by one under variants of the definition: Chytil and Monien [7] use the term k-caterpillar trees, Lohrey et al. [21] call this Horton–Strahler number, Esparza et al. [12] use the term Strahler number of a tree and mention numerous applications and alternative names for this notion; Strahler’s number for k-caterpillar trees is always \(k-1\). Luttenberger and Schlund [22] use the term tree dimension, which is adopted in this paper. Furthermore, this paper uses the definition compatible with Chytil and Monien [7]: our trees of dimension d are exactly d-caterpillar, and trees in linear grammars are deemed to have dimension 1.

In the literature, languages of bounded tree dimension are also known as quasi-rational languages (Qrt), defined as the substitution closure of the linear languages [2, Sect. 6.2], as non-expansive languages, defined by grammars that admit no rewriting of the form \(A \Longrightarrow ^* uAvAw\), and as finite-index languages, that is, languages defined by grammars in which the number of nonterminals in a sentential form is bounded by a constant [32]. Quasi-rational languages form a hierarchy with respect to k, the number of levels of substitution, denoted by Qrt(k), whereas finite-index languages form a hierarchy with respect to k, the bound on number of nonterminal symbols, called index k. More precisely, it is known that Qrt(k) and index k languages are the same class [2, Prop. 6.6]. The relation between this parameter k and the maximum tree dimension is as follows: for grammars with at most m nonterminal symbols on the right-hand side of rules, and with tree dimension bounded by d, the index is at most \((m-1)(d-1)+1\) [11, Lem. 2.2]; in particular, it is d for grammars in the Chomsky normal form.

The easiest languages from this class are the linear languages: these are languages of tree dimension 1, or languages of index 1, or the class Qrt(1). Their rational index is known to be \(O(n^2)\) [4]. A lower bound on the rational index of languages from the entire class was established by Boasson, Courcelle and Nivat [4], who showed that there is a language from Qrt(k) with rational index \(\Omega (n^{2k})\).

This paper begins with adapting the known results to the setting of grammars with bounded tree dimension. First, it is derived from the work of Chytil and Monien [7] that languages of tree dimension bounded by d have rational index \(O(n^{2d})\): this is explained in Section 3 of this paper. In Section 4, the argument of Boasson, Courcelle and Nivat [4] is adapted to proving a more precise bound: that, for every d, there is a language of tree dimension bounded by d with rational index \(\frac{1}{2^{d^2 + d - 3} 3^{2d}} n^{2d}\).

The results in Sections 34 serve as a model for a similar investigation of multi-component grammars. Instead of defining substrings, these grammars define k-tuples of substrings, for bounded k, and otherwise are the same as ordinary (context-free) grammars. In particular, they have similar parse trees, to which the notion of the tree dimension is equally applicable. The relevant definitions are given in Section 5. In the subsequent Section 6, the Chytil–Monien lemma is extended to multi-component grammars of tree dimension bounded by d: they are proved to have rational index \(O(n^{2kd})\). In Section 7, a matching lower bound on the rational index is established, thus demonstrating that it is of the order \(\Theta (n^{2kd})\) in the worst case.

Some implications of these results are presented in Section 8. The maximum order of magnitude of the rational index is determined for superlinear languages [5], and some bounds are obtained for languages of bounded oscillation [13, 38], and for the linear subclass of multi-component grammars [10, 18].

In the final Section 9, the results of this paper are adapted to LL(1)-grammars in the Greibach normal form, and their potential application to conjunctive and Boolean grammars is discussed.

2 Definitions

A (context-free) grammar is a quadruple \(G = (\Sigma , N, R, S)\), where \(\Sigma \) is an alphabet; N is a set of nonterminal symbols; R is a set of rules, each of the form \(A \rightarrow \alpha \), with \(A \in N\) and \(\alpha \in (\Sigma \cup N)^*\); and \(S \in N\) is the start symbol. A parse tree is a tree, in which every leaf is labelled with a symbol from \(\Sigma \), while every internal node is labelled with a nonterminal symbol \(A \in N\) and has an associated rule \(A \rightarrow X_1 \ldots X_\ell \in R\), so that the node has \(\ell \) ordered children labelled with \(X_1, \ldots , X_\ell \). The language defined by each nonterminal symbol \(A \in N\), denoted by \(L_G(A)\), is the set of all strings \(w \in \Sigma ^*\), for which there exists a parse tree, with A as a root and with the leaves forming the string w. The language defined by the grammar is \(L(G)=L_G(S)\).

A grammar G is said to be is in the Chomsky normal form, if all rules of R are of the form \(A \rightarrow BC\), with \(B, C \in N\), or of the form \(A \rightarrow a\), with \(a \in \Sigma \).

A grammar is linear if every rule is either of the form \(A \rightarrow uBv\), with \(u,v \in \Sigma ^*\) and \(B \in V\), or of the form \(A \rightarrow w\), with \(w \in \Sigma ^*\).

A nondeterministic finite automaton (NFA) is a quintuple \(\mathcal {A}=(\Sigma , Q, Q_0, \delta , F)\), where Q is a finite set of states, \(\Sigma \) is a finite set of input symbols, \(Q_0 \subseteq Q\) is the set of initial states, \(\delta :Q \times \Sigma \rightarrow 2^Q\) is the transition function, \(F \subseteq Q\) is the set of accepting states. It accepts a string \(w=a_1 \ldots a_n\) if there is a sequence of states \(q_0, \ldots , q_n \in Q\) with \(q_0 \in Q_0\), \(q_i \in \delta (q_{i-1}, a_i)\) for all i, and \(q_n \in F\). The language of all strings accepted by \(\mathcal {A}\) is denoted by \(L(\mathcal {A})\).

An NFA is said to be a partial deterministic finite automaton (DFA), if \(|Q_0|=1\) and \(|\delta (q, a)| \leqslant 1\) for all \(q \in Q\) and \(a \in \Sigma \). In this paper, all DFA are partial.

For a language L over an alphabet \(\Sigma \), its rational index \(\rho _L\) is a function defined as follows:

$$\begin{aligned} \rho _L(n) = \max _{\begin{array}{c} \mathcal {A} \text {: NFA with }n\text { states} \\ L \cap L(\mathcal {A}) \ne \emptyset \end{array}} \; \min _{w \in L \cap L(\mathcal {A})}|w| \end{aligned}$$
Fig. 1
figure 1

A parse tree with marked dimensions of its subtrees

Tree dimension. The definition excludes all leaves labelled with symbols from \(\Sigma \) and with the empty string, leaving only nodes labelled with nonterminal symbols. For each node v labelled with a nonterminal symbol, its dimension \(\dim v\) is an integer representing the amount of branching in its subtree. It is defined inductively on the structure of subtrees. if v has no children labelled with nonterminal symbols, its dimension is 1. Otherwise, let \(v_1, v_2, \ldots , v_k\), with \(k \geqslant 1\), be all its children labelled with nonterminal symbols. If one of them has a greater dimension than all the others, then v has the same dimension, and if there are multiple children of maximum dimension, then the dimension of v is greater by one.

$$\begin{aligned} \dim v = {\left\{ \begin{array}{ll} \max _{i \in \{1, \ldots , k\}} \dim v_i &{}\text {if there is a unique maximum or k=1} \\ \max _{i \in \{1, \ldots , k\}} \dim v_i +1 &{}\text {otherwise} \end{array}\right. } \end{aligned}$$

The dimension of a parse tree t, denoted by \(\dim t\), is the dimension of its root. An example of a parse tree with marked dimensions of its nodes is given in Fig. 1, it uses a grammar for the Dyck language (\(S \rightarrow SS \ | \ aSb \ | \ \epsilon \)).

Note that the above definition gives the same number as the definition of d-caterpillar parse trees by Chytil and Monien [7]. On the other hand, the definition by Esparza et al. [12] produces a number that is less by one than \(\dim t\) defined in this paper.

Definition 1

(Grammars of bounded tree dimension) Let \(d \geqslant 1\). A grammar G is said to be of tree dimension bounded by d, if every parse tree t of G has \(\dim t \leqslant d\). The least such constant d is called the dimension of G, denoted by \(\dim G=d\).

3 Upper Bound on the Rational Index

The first result of this paper is that, if the dimension of trees in a grammar is bounded by a constant d, then the rational index of its language is bounded by \(O(n^{2d})\), where the constant factor depends upon the grammar.

Theorem 1

Let G be a grammar of tree dimension bounded by d, and let \(\mathcal {A}\) be an NFA with n states, with non-empty intersection \(L(G) \cap L(\mathcal {A})\). Then the length of the shortest string in \(L(G) \cap L(\mathcal {A})\) is in \(O(n^{2d})\).

The main component of the proof is the following lemma by Chytil and Monien [7], which they used in their study of unambiguous grammars of finite index.

Lemma 1

(Chytil and Monien [7, Lem. 7]) Let \(G=(\Sigma , N, R, S)\) be a grammar, let m be the maximal length of the right-hand side of its rules, and assume that there exists a parse tree of some dimension \(d \geqslant 1\) in this grammar. Then the grammar defines some string of length at most \((|N|(m-1) + 1)^d\).

Proof

A proof is included for completeness; later in Section 6 it will be generalized for multi-component grammars.

For each nonterminal A that has at least one parse tree of some dimension d, it is proved that A defines some string of length at most \((|N|(m-1) + 1)^d\). The proof proceeds by induction on d.

Base case: \(d=1\). Let A define some trees of dimension 1, and of all these trees, consider the one that defines the shortest string. The tree is a path of one or more nonterminal nodes, illustrated in Fig. 2(left), in which every node except the last one spawns at most \(m-1\) terminal leaves to both sides, whereas the bottom node has at most m terminal leaves. All nodes on the path have dimension 1. Its length is at most |N|, because if two nodes on it are labeled with the same nonterminal symbol X, then the larger subtree labeled with X can be replaced with a smaller subtree labeled with the same nonterminal. Overall, the total number of leaves in such tree is bounded by \((|N|-1)(m-1) + m = |N|(m-1) + 1\).

Induction step: \(d -1 \rightarrow d\). Let w be the shortest string defined by A with a parse tree of dimension at most d, and among all such parse trees, let t be the one with the fewest nodes. Consider the path in t that proceeds from its root and passes through nodes of dimension d. If all children of the root have dimension \(d-1\), then this path consists of a single node; otherwise, one of the children of the root has dimension d, and other children have dimension less than d, and the path continues to the child of dimension d, etc. Such a path is illustrated in Fig. 2(right).

Let \(A_1, \ldots , A_h\) be the nonterminals in the labels of nodes on this path, with \(A_1=A\). No nonterminal symbol is repeated twice in this sequence, because if \(A_i=A_j\) for some \(i<j\), then the segment of this path from \(A_{i+1}\) to \(A_j\) could be contracted, and the resulting parse tree would either define a string shorter than w, or would be a smaller tree of w, while still having dimension at most d. Therefore, \(h \leqslant |N|\).

Fig. 2
figure 2

The path passing through nodes of the same dimension in the proof of the Chytil–Monien lemma: (left) base case, \(d=1\); (right) induction step

For each node on this path except the last one, at most \(m-1\) subtrees may spawn off to the left and to the right, and each of them has dimension less than d. The last node on this path has at most m children, all of dimension less than d. Overall, there are at most \((m-1) \cdot (h-1) + m \leqslant (m-1)(|N|-1)+m = |N|(m-1) + 1\) subtrees, each of dimension at most \(d-1\). Each of these subtrees, with some root \(X \in \Sigma \cup N\), defines one of the shortest strings in \(L_G(X)\), and therefore, by the induction hypothesis, the substring in each subtree is of length at most \((|N|(m-1) + 1)^{d-1}\). Since w is the concatenation of these substrings, its length is at most \((|N|(m-1) + 1)^{d-1} \cdot (|N|(m-1) + 1) = (|N|(m-1) + 1)^d\), as claimed.\(\square \)

Proof

(of Theorem 1) Let N be the set of nonterminal symbols in G. A grammar \(G'\) for the language \(L(G) \cap L(\mathcal {A})\) is obtained from G and \(\mathcal {A}\) by the classical construction by Bar-Hillel et al. [3], which produces \(|N| \cdot n^2 + 1\) nonterminal symbols: these are all triples of the form (Apq), where \(A \in N\) and pq are two states of the automaton, as well as a new start symbol. Furthermore, each parse tree in the grammar \(G'\) has the same structure as some parse tree in G, and differs only in the labelling of internal nodes; in particular, it has the same dimension.

Since \(G'\) defines at least one string, the parse tree of that string has the same dimension as some parse tree in G, and its dimension is therefore at most d. Let m be the maximum length of the right-hand sides of rules in \(G'\). Then, by Lemma 1, the length of the shortest string defined by \(G'\) is at most \(((|N| \cdot n^2 + 1) (m-1) + 1)^d = |N| \cdot (m-1) \cdot n^{2d} + o(n^{2d}) = O(n^{2d})\).\(\square \)

4 Lower Bound on the Rational Index

The upper bound \(O(n^{2d})\) on the rational index of a language defined by a grammar with tree dimension bounded by d has a matching lower bound \(\Omega (n^{2d})\).

This lower bound resembles the lower bound on the rational index of Qrt(k) established by Boasson, Courcelle and Nivat [4], which is \(\Omega (n^{2k})\). Although their construction does not explicitly use grammars, it is actually easy to modify to grammars of tree dimension k. In this section, the construction by Boasson et al. [4] is reimplemented in the notation of this paper, so that later it could be extended to multi-component grammars. Also, reimplementing the construction allows a more precise lower bound without the \(\Omega \)-notation to be proved.

The lower bound is first established for a convenient infinite set of values of n, to be extended to arbitrary n in the following.

Lemma 2

For every \(d \geqslant 1\), there is a grammar G of tree dimension bounded by d, such that for every \(n \geqslant 2^{d+1}\) divisible by \(2^d\) there is an n-state partial DFA \(\mathcal {B}\), such that the shortest string w in \(L(G) \cap L(\mathcal {B})\) is of length at least \(\frac{1}{2^{d^2 + 3d - 3}} n^{2d}\).

Proof

The proof is carried out by induction on d. A grammar is constructed for each d, and then, for every n divisible by \(2^d\), an n-state DFA with the stated property is defined. Each constructed automaton shall have a unique initial state, which is also the unique accepting state.

Basis: \(dim(G) = 1\). The family of languages having dimension \(d = 1\) coincides with the family of linear languages. Let G be a linear grammar with the rules \(S \rightarrow aSb \ | \ ab\), which defines the language \(L(G) = \{ \, a^i b^i \mid i \geqslant 1 \, \}\).

For every \(n \geqslant 4\) divisible by \(2^d=2\), let \(\ell = \frac{n}{2}\) and \(m=\frac{n}{2} + 1\). Then \(\ell \) and m are coprime integers. Define a DFA \(\mathcal {B}\) over the alphabet \(\{a, b\}\), which consists of two cycles sharing one node, \(q_0\), which is both the initial and the unique accepting state. The cycle of length \(\ell \) has all transitions by a, and the other by b, as shown in Fig. 3. The automaton has \(\ell +m-1=n\) states.

Every string in \(L(G) \cap L(\mathcal {B})\) is of the form \(a^i b^i\), with \(i \geqslant 1\). For the automaton to accept it, i must be divisible both by \(\ell \) and by m. Since the cycle lengths are relatively prime, the shortest string w with this property has \(i=\ell m\), and is accordingly of length \(2\ell m\). Its growth with n is estimated as follows.

$$\begin{aligned} |w| = 2\ell m = 2 \frac{n}{2} \cdot \Big (\frac{n}{2} + 1\Big ) = \frac{1}{2} n^2 + n \end{aligned}$$

This example is well-known to the community [16, 39].

Fig. 3
figure 3

DFA \(\mathcal {B}\) defined in Lemma 2 for \(d=1\)

Induction step: \(dim(G) = d\). By the induction hypothesis, there is a grammar \(\widehat{G} = (\widehat{\Sigma }, \widehat{N}, \widehat{R}, \widehat{S})\) of bounded tree dimension \(dim(\widehat{G}) = d-1\), which satisfies the statement of the lemma. The new grammar \(G = (\Sigma , N, R, S)\) of tree dimension at most d is defined over the alphabet \(\Sigma = \widehat{\Sigma } \cup \{a, b, c\}\), where \(a, b, c \not \in \widehat{\Sigma }\) are new symbols. It uses nonterminal symbols \(N = \widehat{N} \cup \{S, A\}\), adding two new nonterminals \(A, S \not \in \widehat{N}\) to those in \(\widehat{G}\), where S is the new initial symbol. Its set of rules includes all rules from \(\widehat{G}\) and the following new rules.

$$\begin{aligned} S&\rightarrow A S c \ | \ A c \\ A&\rightarrow a A b \ | \ a \widehat{S} b \end{aligned}$$

Here the nonterminal symbol A defines all substrings of the form \(a^i u b^i\), with \(i \geqslant 1\) and \(u \in L(\widehat{G})\), and hence the grammar defines the following language.

$$\begin{aligned} L(G)=\{ \, a^{i_1} w_1 b^{i_1} \ldots a^{i_t} w_t b^{i_t} c^t \mid t \geqslant 1, \, i_1, \ldots , i_t \geqslant 1, \, w_1, \ldots , w_t \in L(\widehat{G}) \, \} \end{aligned}$$

To see that trees in the new grammar have dimension at most d, first consider the dimension of any parse tree t with the root labeled by the nonterminal A, which is of the form shown in Fig. 4(right). The dimension of the \(\widehat{S}\)-subtree at the bottom is at most \(d-1\) by the properties of \(\widehat{G}\). This dimension is inherited by all A-nodes in the tree, because their remaining children are leaves.

Now consider the dimension of a complete parse tree t with the start symbol S in the root, as in Fig. 4(left). All A-subtrees in this tree have dimension at most \(d-1\). Then the bottom S-subtree, which uses the rule \(S \rightarrow Ac\), also has dimension at most \(d-1\). Every S-subtree higher up in the tree uses a rule \(S \rightarrow ASc\), and its dimension is at most d, because getting a higher dimension would require two subtrees of dimension d, which is never the case.

Fig. 4
figure 4

Parse trees for S and for A, annotated with dimensions of their nodes

Now, for every \(n \geqslant 2^{d+1}\) divisible by \(2^d\), the goal is to construct an n-state DFA over the alphabet \(\Sigma \), so that the shortest string w in \(L(G) \cap L(\mathcal {B})\) is of length at least \(\frac{1}{2^{d^2 + 3d - 3}} n^{2d}\). Since the number \(\frac{n}{2}\) is at least \(2^d\) and is divisible by \(2^{d-1}\), the induction hypothesis for the grammar \(\widehat{G}\) asserts that there is a DFA \(\widehat{\mathcal {B}} = (\widehat{Q}, \widehat{\Sigma }, \widehat{\delta }, \widehat{q}_0, \{\widehat{q}_0\})\), with \(\frac{n}{2}\) states, with the shortest string \(\widehat{w}\) in \(L(\widehat{G}) \cap L(\widehat{\mathcal {B}})\) of length at least \(\frac{1}{2^{(d-1)^2 + 3(d-1) - 3}} (\frac{n}{2})^{2(d-1)}\).

The desired n-state DFA \(\mathcal {B} = (\Sigma , Q, q_0, \delta , \{q_0\})\) is constructed as follows. Let \(\ell = \frac{n}{4}\) and \(m=\frac{n}{4} + 1\), these are two coprime integers. The set of states of \(\mathcal {B}\) contains all \(\frac{n}{2}\) states from \(\widehat{Q}\), in which \(\mathcal {B}\) operates as \(\widehat{\mathcal {B}}\), and \(m+\ell -1=\frac{n}{2}\) new states forming a cycle of length \(\ell \) and a chain of length m, which share a common state \(q_0\).

$$\begin{aligned} Q = \widehat{Q} \cup \{p_1, \ldots , p_{\ell -1}, q_0, \ldots , q_{m-1}\} \end{aligned}$$

The new initial state \(q_0\) has a transition by a leading to the initial state of \(\widehat{\mathcal {B}}\), from where one can return to \(q_1\) by b.

$$\begin{aligned} \delta (q_0, a)&= \widehat{q}_0 \\ \delta (\widehat{q}_0, b)&= q_1 \end{aligned}$$

There is a chain of transitions by a from \(q_{m-1}\) to \(q_0\), and another chain b in the opposite direction, from \(q_1\) to \(q_{m-1}\) and back to \(q_0\).

$$\begin{aligned} \delta (q_i, a)&= q_{i-1},{} & {} \text {with } 1 \leqslant i \leqslant m-1 \\ \delta (q_i, b)&= q_{i+1 \bmod m},{} & {} \text {with } 1 \leqslant i \leqslant m-1 \end{aligned}$$

There is a cycle by c in the states \(q_0, p_1, \ldots , p_{\ell -1}\); for uniformity, denote \(p_0=q_0\).

$$\begin{aligned} \delta (p_i, c)&= p_{i+1 \bmod \ell },{} & {} \text {with } 0 \leqslant i \leqslant \ell -1 \end{aligned}$$

The general form of \(\mathcal {B}\) is shown in Fig. 5.

Fig. 5
figure 5

DFA \(\mathcal {B}\) defined in Lemma 2 for d, which incorporates DFA \(\widehat{\mathcal {B}}\) for \(d-1\). A computation on a block \(a^{i_s} w_s b^{i_s}\) that starts in \(q_{s-1}\) and ends in \(q_s\) is illustrated

Let w be any string in \(L(G) \cap L(\mathcal {B})\). Since w is defined by G, it is of the form \(w = a^{i_1} w_1 b^{i_1} \ldots a^{i_t} w_t b^{i_t} c^t\), for some \(t \geqslant 1\), \(i_1, \ldots , i_t \geqslant 1\) and \(w_1, \ldots , w_t \in L(\widehat{G})\). At the same time, w must be accepted by \(\mathcal {B}\), and the automaton’s behaviour on w is explained in the following claim.

Claim. After reading each prefix \(a^{i_1} w_1 b^{i_1} \cdots a^{i_s} w_s b^{i_s}\) of w, with \(s \in \{0, \ldots , t\}\),the automaton comes to the state \(q_{s \bmod m}\).

The claim is proved by an induction on s. For the base case, \(s=0\), it holds true, because the initial state is \(q_0\). For the induction step, assume that the automaton is in the state \(q_{(s-1) \text {mod}~m}\) after reading \(a^{i_1} w_1 b^{i_1} \cdots a^{i_{s-1}} w_{s-1} b^{i_{s-1}}\) and it reads the next block \(a^{i_s} w_s b^{i_s}\); it is claimed that it must finish reading this block in the state \(q_{s \text {mod}~m}\). For the automaton to read any symbols of \(w_s\), it must come to the state \(\widehat{q}_0\), which is possible only if \(i_s = ((s-1) \text {mod}~m) + 1\); hence, \(i_s\) is at most m. Similarly, for the automaton to continue reading \(b^{i_s}\) after processing \(w_s\), it must finish reading \(w_s\) in the state \(\widehat{q}_0\). Then, by reading the string \(b^{i_s}\), it gets to the state \(q_{i_s \text {mod} m}\). If \(i_s < m\), then \((s-1) \text {mod} m < m-1\) and \(i_s = s {mod} m\), and the state reached is \(q_{s \text {mod}~m}\), as claimed (this is the case illustrated in Fig. 5). If \(i_s = m\), then \((s-1) \text {mod}~m = m-1\), and the automaton reaches \(q_0\); since \(s \text {mod}~m = 0\), this is the claimed state. The proof of the claim is complete.\(\square \)

Resuming the proof of the lemma, it is now known that the automaton starts reading the suffix \(c^t\) of w in the state \(q_{t \mod m}\). Unless this is the state \(q_0\), symbols c cannot be read. Therefore, t must be divisible by m. At the same time, in order to finish reading \(c^t\) in the accepting state, t must also be divisible by \(\ell \). Furthermore, in such a string, each substring \(w_i\) must be accepted by \(\widehat{\mathcal {B}}\), because \(\mathcal {B}\) starts and finishes reading it in the state \(\widehat{q}_0\), and therefore lies in \(L(\widehat{G}) \cap L(\widehat{\mathcal {B}})\), and its length is at least \(|\widehat{w}|\), where \(\widehat{w}\) is the shortest string in \(L(\widehat{G}) \cap L(\widehat{\mathcal {B}})\).

Then, the shortest string w with these properties must have \(t=\ell m\), because \(\ell \) and m are co-prime, and must have each \(w_i\) of length exactly \(|\widehat{w}|\). This gives the following lower bound on the length of w.

$$\begin{aligned} |w| > \ell m |\widehat{w}| \geqslant \frac{n}{4} \cdot \frac{n}{4} \cdot \frac{1}{2^{(d-1)^2 + 3(d-1) - 3}}&\Big (\frac{n}{2}\Big )^{2(d-1)}= \\&=\frac{n^2}{16} \cdot \frac{1}{2^{d^2 + d - 5}} \cdot \frac{n^{2d-2}}{2^{2d-2}} = \frac{1}{2^{d^2 + 3d -3}} n^{2d} \end{aligned}$$

The estimation uses that \(\ell \) and m are at least \(\frac{n}{4}\), and invokes the induction hypothesis to get a lower bound on \(|\widehat{w}|\).

Theorem 2

For every \(d \geqslant 1\), there is a grammar G of bounded tree dimension d, such that for every \(n \geqslant 2^{d+1}\) there is an n-state partial DFA \(\mathcal {B}\), such that the shortest string w in \(L(G) \cap L(\mathcal {B})\) is of length greater than \(\frac{1}{2^{d^2 + d - 3} 3^{2d}} n^{2d}\).

Proof

Let G be the grammar given for d by Lemma 2. Let \(2^d r \leqslant n < 2^d (r+1)\), for some integer r. Then \(r \geqslant 2\) (for otherwise \(n < 2^{d+1}\)), and \(2^d r \geqslant 2^{d+1}\).

Since \(2^d r\) is divisible by \(2^d\), by Lemma 2, there is a DFA \(\mathcal {B}\) with \(2^d r \leqslant n\) states, such that the length of the shortest string w in \(L(G) \cap L(\mathcal {B})\) is at least \(\frac{1}{2^{d^2 + 3d - 3}} (2^d r)^{2d}\). This is the desired n-state DFA.

The inequality \(n < 2^d (r+1)\) implies that \(n < 2^d \frac{3r}{2}\), because \(r+1\) is at most \(\frac{3r}{2}\) for \(r \geqslant 2\). Then \(2^d r > \frac{2}{3}n\), and the lower bound on the length of w is expressed as a function of n as follows.

$$\begin{aligned} |w| \geqslant \frac{1}{2^{d^2 + 3d - 3}} (2^d r)^{2d} > \frac{1}{2^{d^2 + 3d - 3}} \big (\frac{2}{3} n\big )^{2d} = \frac{1}{2^{d^2 + d - 3} 3^{2d}} n^{2d} \square \end{aligned}$$

For finite automata with fewer than \(2^{d+1}\) states, no lower bounds are given, as the construction in the proof relies on having sufficiently long cycles in the automata.

Overall, the rational index of grammars with tree dimension bounded by d is \(\Theta (n^{2d})\) in the worst case.

5 Tree Dimension in Multi-Component Grammars

Multi-component grammars define the structure of a sentence w by splitting it into constituents that are finite k-tuples of disjoint substrings of w, for k bounded by a constant. In other words, these are substrings with \(k-1\) gaps, and these constituents can be joined to each other by concatenating their components from any sides. In all other respects, these grammars are the same as ordinary (context-free) grammars.

Definition 2

(Vijay-Shankar, Weir, Joshi [37]; Seki, Matsumura, Fujii, Kasami [33]) A multi-component grammar is a quintuple \(G=(\Sigma , N, \mathop {\textrm{rank}}, R, S)\), where

  • \(\Sigma \) is the alphabet of the language being described;

  • N is the set of nonterminal symbols;

  • \(\mathop {\textrm{rank}}:N \rightarrow \mathbb {N}\) is a function that defines the number of components in each nonterminal symbol, so that if \(\mathop {\textrm{rank}}A=k\), then A describes k-tuples of substrings;

  • R is a set of grammar rules, each of the form

    $$\begin{aligned} A(\alpha _1, \ldots , \alpha _{\mathop {\textrm{rank}}A}) \leftarrow B_1(x_{1,1}, \ldots , x_{1, \mathop {\textrm{rank}}B_1}), \ldots , B_\ell (x_{\ell ,1}, \ldots , x_{\ell , \mathop {\textrm{rank}}B_\ell }), \end{aligned}$$
    (1)

    where \(\ell \geqslant 0\), the variables \(x_{i,j}\) are pairwise distinct, \(\alpha _1, \ldots , \alpha _{\mathop {\textrm{rank}}A}\) are strings over symbols from \(\Sigma \) and variables \(x_{i,j}\), and each variable \(x_{i,j}\) occurs in \(\alpha _1 \ldots \alpha _{\mathop {\textrm{rank}}A}\) exactly once;

  • a nonterminal symbol \(S \in N\) of dimension 1 is the “initial symbol”, that is, the category of all well-formed sentences defined by the grammar.

A grammar is seen as a logical system for proving elementary propositions of the form \(A(w_1, \ldots , w_k)\), with \(k=\mathop {\textrm{rank}}A\) and \(w_1, \ldots , w_k \in \Sigma ^*\), meaning that the given k-tuple of strings has the property A. A proof proceeds using the rules in R, with each rule (*) treated as a schema for derivation rules, for any strings substituted for all variables \(x_{i,j}\).

$$\begin{aligned} \frac{B_1(x_{1,1}, \ldots , x_{1, \mathop {\textrm{rank}}B_1})\ldots B_\ell (x_{\ell ,1}, \ldots , x_{\ell , \mathop {\textrm{rank}}B_\ell })}{A(\alpha _1, \ldots , \alpha _{\mathop {\textrm{rank}}A})} \end{aligned}$$

The language generated by the grammar, denoted by L(G), is the set of all such strings w that the proposition S(w) can be derived in one or more such steps.

The rank of a grammar, \(\mathop {\textrm{rank}}G\), is the largest rank of a nonterminal symbol. A multi-component grammar of rank k shall be called a k-component grammar.

Whenever a string w is generated by G, the derivation of a proposition S(w) forms a parse tree. Each node in the tree is labelled with a proposition \(A(w_1, \ldots , w_k)\), where \(k=\mathop {\textrm{rank}}A\) and \(w_1, \ldots , w_k\) are substrings of w. Every node has a corresponding rule (*), by which the proposition is derived, and the direct successors of this node are labelled with \(B_1(x_{1,1}, \ldots , x_{1, \mathop {\textrm{rank}}B_1})\), ..., \(B_\ell (x_{\ell ,1}, \ldots , x_{\ell , \mathop {\textrm{rank}}B_\ell })\), as in the definition of a derivation step.

Example 1

The language \(L = \{ \, a^m b^n c^m d^n \mid m, n \in \mathbb {N} \, \}\) is defined by the following 2-component grammar.

$$\begin{aligned} S(xy)&\leftarrow A(x, y) \\ A(x, byd)&\leftarrow A(x, y) \\ A(x, y)&\leftarrow B(x, y) \\ B(\epsilon , \epsilon )&\leftarrow \\ B(ax, cy)&\leftarrow B(x, y) \end{aligned}$$

Here B defines all pairs \((a^m, c^m)\), with \(m \geqslant 0\), and A defines all pairs \((a^m, b^n c^m d^n)\), with \(m,n \geqslant 0\). Finally S concatenates every such pair to a string in L.

Tree dimension. The conventional definition of parse trees for multi-component grammars conveniently does not include the terminal symbols in the tree structure, and this makes it compatible with ordinary parse trees in terms of tree dimension. The dimension of every node is defined inductively on the dimensions of its children, as in Section 2.

Definition 3

(Grammars of bounded tree dimension) A multi-component grammar G is said to be of tree dimension bounded by d if every parse tree has dimension at most d. The least such constant d is called the dimension of G, denoted by \(\dim G=d\).

6 The Chytil–Monien Lemma for Multi-Component Grammars

The rational index of every language defined by a multi-component grammar of a bounded tree dimension is bounded by a polynomial that depends both on the tree dimension and on the maximum number of components in the grammar. This is proved by a generalization of the proof in Section 3, from what was essentially the one-component case to the case of up to k components.

The first step is to adapt the Chytil–Monien lemma [7, Lem. 7].

Lemma 3

Let \(G=(\Sigma , N, \mathop {\textrm{rank}}, R, S)\) be a multi-component grammar of rank k, let r be the maximum number of nonterminal symbols on the right-hand side of a rule, let \(\ell \) be the maximal number of terminal symbols on the left-hand side of a rule, and assume that there exists a parse tree of dimension \(d \geqslant 1\) in this grammar. Then the grammar defines some string of length less than \(\frac{\ell }{r-1} (|N|(r-1) + 1)^d\) for \(d>1\) (which implies \(r>1\)), and \(|N|\ell \) for \(d = 1\).

Proof

For every nonterminal symbol A and for every tree dimension \(d \geqslant 1\), it is claimed that if there exists at least one parse tree of some string from A, which is of dimension d, then A defines some string of length at most f(d), where \(f(1) = |N|\ell \) and \(f(d) = \big (|N|(r-1)+1\big ) \cdot f(d-1) + |N| \cdot \ell \).

Base case:\(d=1\). Consider the shortest string defined by A with a parse tree of dimension 1, and among these trees, let t be the one with the fewest nodes. A parse tree of dimension 1 is a path formed of nodes of dimension 1. The maximal length of this path is bounded by |N| (otherwise there would be a tree of a shorter string or a tree with fewer nodes). At each node on the path, at most \(\ell \) terminal symbols are appended. Therefore, there are at most \((|N|-1)\ell + \ell = |N|\ell \) symbols in the string.

Induction step. Let \(d > 1\). Note that this implies \(r>1\): indeed, if \(r=1\), then the only form of parse trees are paths with no branching, and hence no trees of dimension greater than 1 can be defined. Again, consider the shortest string defined by A with a tree of dimension d. Recall the proof of Lemma 1 and consider the path from the root of the parse tree which passes through the nodes of dimension d, where every node except the last one has one child of dimension d and other children of dimension \(d-1\). As in the proof of Lemma 1, the length of the path is bounded by |N| (otherwise there is a shorter string). Each node on the path (except the last one) has at most \(r-1\) subtrees of dimension \(d-1\) attached. The last node of the path has at most r children of dimension \(d-1\). By the induction hypothesis, the shortest string defined by a subtree of dimension \(d-1\) is of length at most \(f(d-1)\). Additionally, at each node, at most \(\ell \) terminal symbols listed on the left-hand side of a rule are appended; in total, there are at most \(|N|\ell \) such symbols.

Putting all together, the length of the shortest string is bounded by the following number.

$$\begin{aligned} \big (|N|(r-1)+1\big ) \cdot f(d-1) + |N| \cdot \ell \end{aligned}$$

This is exactly f(d), which completes the proof of the claim.

Now, the length of the shortest string defined by the grammar by a parse tree of dimension d is at most f(d). This number is represented by a recurrence relation of the form \(x_1=|N|\cdot \ell \), \(x_d=ax_{d-1} + b\), where \(a=|N|(r-1)+1\) and \(b=|N| \cdot \ell \). The upper bound on the solution of this recurrence relation is as follows.

$$\begin{aligned} x_d = b\frac{a^d-1}{a-1} = |N|\ell \frac{{(|N|(r-1)+1)}^d-1}{|N|(r-1)} \leqslant \frac{\ell }{r-1} (|N|(r-1) + 1)^d \end{aligned}$$

\(\square \)

The above lemma is used to prove an upper bound similar to the one in Theorem 1. The bound gets raised into the power of k, the maximum number of components in a grammar, because the construction for the intersection with a regular languages produces more nonterminal symbols.

Theorem 3

Let G be an multi-component grammar of rank k and of tree dimension bounded by d, and let \(\mathcal {A}\) be an NFA with n states, with non-empty intersection \(L(G) \cap L(\mathcal {A})\). Then the length of the shortest string in \(L(G) \cap L(\mathcal {A})\) is at most \(O(n^{2kd})\).

Proof

Denote the set of nonterminal symbols in G by N. As proved by Seki et al. [33, Thm. 3.9], for a k-component grammar G and an NFA \(\mathcal {A}\), the intersection \(L(G) \cap L(\mathcal {A})\) is defined by another k-component grammar with at most \(|N| \cdot n^{2k} + 1\) nonterminal symbols. Their construction generalizes that by Bar-Hillel et al. [3], and produces nonterminal symbols of the form \((A,p_1, q_1, \ldots , p_{\mathop {\textrm{rank}}A}, q_{\mathop {\textrm{rank}}A})\), where \(A \in N\), \(\mathop {\textrm{rank}}A\) is a rank of the nonterminal A, and \(p_i,q_i\) are states of the automaton, which it enters before and after reading the corresponding substrings. The transformation preserves the structure of the parse trees: each tree in \(G'\) has the corresponding tree in G that differs only in the labels of its nodes, and has the same dimension.

Since the intersection is non-empty, \(G'\) defines at least one parse tree, which has the same dimension as some parse tree in G, that is, at most d. If \(d \geqslant 2\), then, by Lemma 3, the shortest string defined by \(G'\) is of length at most \(\frac{\ell }{r-1} ((|N| \cdot n^{2k} + 1)(r-1) + 1)^d\), where \(\ell \) and r are constants depending on \(G'\). This number is estimated as follows.

$$\begin{aligned} \frac{\ell }{r-1} \big ((|N| \cdot n^{2k} + 1)(r-1) + 1\big )^d = |N| \cdot \ell \cdot n^{2kd} + o(n^{2kd}) = O(n^{2kd}) \end{aligned}$$

If \(d=1\), then Lemma 3 asserts that the length of the shortest string is bounded by

$$\begin{aligned} (|N| \cdot n^{2k} + 1) \cdot \ell = O(n^{2k}), \end{aligned}$$

which completes the proof.\(\square \)

7 A Lower Bound for Multi-Component Grammars

This section establishes a lower bound on the rational index of multi-component grammars that asymptotically matches the bound in Theorem 3. The main part of the argument is the following lemma that generalizes Lemma 2 to the multi-component case.

Lemma 4

For every \(d \geqslant 1\), \(k \geqslant 2\), there is a k-component grammar G of bounded tree dimension d, such that for every \(n \geqslant 8 \cdot 2^d k^2 \ln 8k\) divisible by \(2^dk\) there is an n-state partial DFA \(\mathcal {B}\), such that the shortest string w in \(L(G) \cap L(\mathcal {B})\) is of length at least \(\frac{1}{k^{2kd-1} \cdot 2^{kd^2 + 5kd -2k -1}} n^{2kd}\).

Proof

For every fixed k, the proof is by an induction on d. For every d, a grammar is constructed first. Next, for each n divisible by \(2^d \cdot k\), an n-state DFA with a unique initial state and a unique accepting state is constructed.

Basis: \(d = 1\). Consider the following k-component grammar G which defines a language \(L = \{ \, a_1^i a_2^i \ldots a_{2k}^i \mid i \geqslant 1 \, \}\).

$$\begin{aligned} S(x_1\ldots x_k)&\leftarrow A(x_1, \ldots , x_k) \\ A(a_1x_1a_2, \ldots , a_{2k-1}x_ka_{2k})&\leftarrow A(x_1, \ldots , x_k) \\ A(a_1a_2, \ldots , a_{2k-1}a_{2k} )&\leftarrow \end{aligned}$$

Every parse tree according to this grammar is a path, and hence the dimension of G is 1 for every k.

The automaton \(\mathcal {B}\) is constructed as a generalization of the automaton \(\mathcal {B}\) for context-free grammars of dimension 1 described in Section 4.

Let \(n \geqslant 16 k^2 \ln 8k\) and n be divisible by 2k, and let \(R_{2k}\) be 2kth Ramanujan prime [28], that is the least number for which there are at least \(R_{2k}\) primes between \(\frac{x}{2}\) and x for all \(x \geqslant R_{2k}\). According to Sondow [34], \(R_{2k}\) is at most \(8k \ln 8k\).

Let \(\ell _1, \ldots , \ell _{2k}\) be different primes, whose values are between \(\frac{n}{4k}\) and \(\frac{n}{2k}\); such primes exist because \(\frac{n}{2k} \geqslant 8k \ln 8k\).

Clearly, \(\sum _{i=1}^{2k} \ell _i\) is at most \(2k \cdot \frac{n}{2k} = n\). Then the automaton \(\mathcal {B}\) over the alphabet \(\{a_1, \ldots , a_{2k}\}\) is constructed as follows. It consists of 2k cycles sharing a common node, which is initial and final state at the same time. Every ith cycle is a cycle of length \(\ell _i\) having transitions by \(a_i\). The automaton has \(1 + \sum _{i=1}^{2k} (\ell _i-1) \leqslant n\) states. DFA \(\mathcal {B}\) is illustrated in Fig. 6.

Fig. 6
figure 6

DFA \(\mathcal {B}\) defined in Lemma 4 for \(d=1\) and fixed \(k \geqslant 0\)

Consider the length of the shortest string in \(w \in L(G) \cap L(\mathcal {B})\). Every such string is of the form \(a^i_1 \ldots a^i_{2k}\) for \(i \geqslant 1\). The automaton accepts the string of this form if and only if i is a multiple of all lengths of its cycles \(\ell _1, \ldots , \ell _{2k}\). All lengths of the cycles are primes by construction, thus the least such i is equal to \(\prod \limits _{i=1}^{2k}\ell _i\). Then the length of w is estimated as follows.

$$\begin{aligned} |w| = 2k\prod \limits _{i=1}^{2k} \ell _i > 2k \prod \limits _{i=1}^{2k} \frac{n}{4k} = 2k \frac{n^{2k}}{(4k)^{2k}} = \frac{1}{k^{2k-1}\cdot 2^{4k-1}} n^{2k} \end{aligned}$$

Induction step: \(dim(G) = d\). By the induction hypothesis, there exists a k-component grammar \(\hat{G}=(\widehat{\Sigma }, \widehat{N}, \mathop {\textrm{rank}}, \widehat{R}, \widehat{S})\) of tree dimension bounded by \(d-1\) that satisfies the condition of the lemma. A k-component grammar of tree dimension bounded by d is then defined over the alphabet \(\Sigma = \widehat{\Sigma } \cup \{a, b, c_1, \ldots , c_{2k-1}\}\), where \(a, b, c_1, \ldots , c_{2k-1} \not \in \widehat{\Sigma }\). The set of nonterminal symbols of new grammar is \(N = \widehat{N} \cup \{S, A, C\}\), where all nonterminal symbols from \(\widehat{N}\) preserve their original rank, while new nonterminals \(A, C, S \not \in \widehat{N}\) have \(\mathop {\textrm{rank}}S=\mathop {\textrm{rank}}A=1\) and \(\mathop {\textrm{rank}}C=k\). The rules of the grammar G include all rules from \(\widehat{G}\) and the following additional rules.

$$\begin{aligned} S(x_1 \ldots x_k)&\leftarrow C(x_1, \ldots , x_k) \\ C(yx_1c_1, c_2 x_2 c_3, \ldots , c_{2k-2}x_kc_{2k-1})&\leftarrow A(y), C(x_1, \ldots , x_k) \\ C(y c_1, c_2 c_3, \ldots , c_{2k-2} c_{2k-1})&\leftarrow A(y) \\ A(ayb)&\leftarrow A(y) \\ A(ayb)&\leftarrow \widehat{S}(y). \end{aligned}$$

Here, the nonterminal A defines all substrings of the form \(a^i u b^i\), where \(i \geqslant 1\) and \(u \in L(\widehat{G})\), as in the grammar from Lemma 2. The nonterminal symbol C generates all k-tuples of the form \((v_1 \ldots v_t c_1^t, c_2^t c_3^t, \ldots , c_{2k-2}^t c_{2k-1}^t)\), where \(t \geqslant 1\) and \(v_1, \ldots , v_t\) are strings defined by nonterminal A. Finally, the nonterminal S defines concatenations of all such k-tuples, so that the language generated by grammar G is of the following form.

$$\begin{aligned} \{ \, a^{i_1} w_1 b^{i_1} \ldots a^{i_t} w_t b^{i_t} c_1^t \ldots c_{2k-1}^t \mid t \geqslant 1, \, i_1, \ldots , i_t \geqslant 1, \, w_1, \ldots , w_t \in L(\widehat{G}) \, \} \end{aligned}$$

Next it is claimed that the dimension of the grammar G is at most d. The dimension of every tree with the root labelled by a nonterminal \(\widehat{S}\) is at most \(d-1\), that is the maximal dimension of grammar \(\widehat{G}\).

Then each subtree with the root labelled by A has dimension bounded by \(d-1\). The dimension of subtrees with the root labelled by C is at most d, because the lowest such subtree has the same dimension \(d-1\) as subtree labelled by A, and each next subtree has one child labelled by A of dimension \(d-1\) and one child C of dimension at most d. The dimension of the root labelled by S is equal to the dimension of the topmost subtree labelled by nonterminal C, and, therefore, is at most d.

Let \(n \geqslant 8 \cdot 2^d k^2 \ln 8k\) and n be divisible by \(2^dk\). As \(\frac{n}{2}\) is greater than \(8 \cdot 2^{d-1} k^2 \ln 8k\) and is divisible by \(2^{d-1}k\), by the induction hypothesis there exists an \(\frac{n}{2}\)-state automaton \(\widehat{\mathcal {B}} = (\widehat{Q}, \widehat{\Sigma }, \widehat{\delta }, \widehat{q}_0, \{\widehat{q}_0\})\) for grammar \(\widehat{G}\) such that the length of the shortest string \(\widehat{w}\) in the intersection \(L(\widehat{G}) \cap L(\widehat{\mathcal {B}})\) has the length at least \(\frac{1}{k^{2k(d-1)-1} \cdot 2^{k(d-1)^2 + 5k(d-1) -2k -1}} n^{2k(d-1)}\).

Then n-state DFA \(\mathcal {B} = (\Sigma , Q, q_0, \delta , \{q_0\})\) is constructed as follows. It consists of DFA \(\widehat{\mathcal {B}}\) with at most \(\frac{n}{2}\) states, and the second half of the states is used to construct cycles. Let \(\ell _1, \ldots , \ell _{2k}\) be primes, such that \(\frac{n}{8k} \leqslant \ell _i \leqslant \frac{n}{4k}\) for every i. These numbers define the lengths of the cycles, and the set of states of the new automaton is as follows.

$$\begin{aligned} Q = \widehat{Q} \cup \{q_0, \underbrace{p^1_1, \cdots , p^1_{\ell _1-1}}_{\text {cycle by}\, c_1}, \cdots , \underbrace{p^{2k-1}_1, \ldots , p^{2k-1}_{\ell _{2k-1}-1}}_{\text {cycle by}\, c_{2k-1}}, \underbrace{q_1, \cdots , q_{\ell _{2k}-1}}_{\text {chain by} \,a \, \text {and}\, b}\} \end{aligned}$$

Overall, at most \(2k \frac{n}{4k} = \frac{n}{2}\) states are used for constructing of cycles.

The transitions are similar to those in the proof of Lemma 2, with the only difference that instead of one cycle by c the automaton in this proof has \(2k-1\) cycles by \(c_1\), ..., \(c_{2k-1}\). The initial state is \(q_0\), it has a transition by a to the initial state of \(\widehat{\mathcal {B}}\), continued by a transition by b to \(q_1\).

$$\begin{aligned} \delta (q_0, a)&= \widehat{q}_0 \\ \delta (\widehat{q}_0, b)&= q_1 \end{aligned}$$

A chain of transitions by a and another chain by b are as in Lemma 2; they use states from \(q_1\) to \(q_{\ell _{2k}-1}\).

$$\begin{aligned} \delta (q_i, a)&= q_{i-1},{} & {} \text {with } 1 \leqslant i \leqslant \ell _{2k}-1 \\ \delta (q_i, b)&= q_{i+1 \bmod \ell _{2k}},{} & {} \text {with } 1 \leqslant i \leqslant \ell _{2k}-1 \end{aligned}$$

For each i, with \(1 \leqslant i \leqslant 2k-1\), there is a cycle by \(c_i\) in the states \(q_0=p^i_0\), \(p^i_1\), ..., \(p^i_{\ell _i-1}\).

$$\begin{aligned} \delta (p^i_j, c_i)&= p^i_{j+1 \bmod \ell _i},{} & {} \text {with } 0 \leqslant j \leqslant \ell _i-1 \end{aligned}$$

This automaton \(\mathcal {B}\) is illustrated in Fig. 7.

Fig. 7
figure 7

DFA \(\mathcal {B}\) defined in Lemma 4 for \(d \geqslant 2\) and fixed k, which incorporates DFA \(\widehat{\mathcal {B}}\) for \(d-1\) and k

Every string w in \(L(G) \cap L(\mathcal {B})\) is of the form \(w = a^{i_1} w_1 b^{i_1} \ldots a^{i_t} w_t b^{i_t} c_1^t \ldots c_{2k-1}^t\), with \(t \geqslant 1\), \(i_1, \ldots , i_t \geqslant 1\) and \(w_1, \ldots , w_t \in L(\widehat{G})\), because it is defined by G. The string w is also accepted by \(\mathcal {B}\), and, up to the first \(c_1\), the automaton’s computation proceeds as in the proof of Lemma 2.

Claim. After reading each prefix \(a^{i_1} w_1 b^{i_1} \cdots a^{i_s} w_s b^{i_s}\) of w, with \(s \in \{0, \ldots , t\}\), the automaton comes to the state \(q_{s \bmod \ell _{2k}}\).

Thus, the automaton \(\mathcal {B}\) reaches the first \(c_1\) in the state \(q_{t \bmod \ell _{2k}}\). For the automaton to proceed further, this state must be \(q_0\), and hence t is divisible by \(\ell _{2k}\). Next, the automaton reads each of the blocks \(c_1^t\), ..., \(c_{2k-1}^t\), and must begin and end reading each of them in the same state \(q_0\). Therefore, t must also be divisible by all numbers \(\ell _1\), ..., \(\ell _{2k-1}\). As all \(\ell _i\) are primes, t is divisible by their product \(\prod \limits _{i=1}^{2k}\ell _i\).

As in the proof of Lemma 2, the automaton begins and ends reading each substring \(w_i\) in the state \(\widehat{q}_0\), and hence all these strings are in \(L(\widehat{G}) \cap L(\widehat{\mathcal {B}})\).

The shortest w satisfying the above constraints then has \(t=\prod \limits _{i=1}^{2k}\ell _i\), and its length is at least \(t \cdot |\widehat{w}|\), where \(\widehat{w}\) is the shortest string in \(L(\widehat{G})\). Using a lower bound on \(|\widehat{w}|\) given by the induction hypothesis, the lower bound on the length of w is estimated as follows.

$$\begin{aligned} |w|\geqslant & {} \bigg (\prod \limits _{i=1}^{2k}\ell _i\bigg ) \cdot |\widehat{w}|\geqslant \\\geqslant & {} \bigg (\prod \limits _{i=1}^{2k}\frac{n}{8k}\bigg ) \cdot \frac{1}{k^{2k(d-1)-1} \cdot 2^{k(d-1)^2 + 5k(d-1) -2k -1}} \Big (\frac{n}{2}\Big )^{2k(d-1)}=\\= & {} \frac{n^{2k}}{2^{6k}k^{2k}} \cdot \frac{1}{k^{2kd-2k-1} \cdot 2^{k(d-1)^2 + 5k(d-1) -2k -1 + 2k(d-1)}} n^{2k(d-1)} =\\{} & {} \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad =\frac{1}{k^{2kd-1} \cdot 2^{kd^2 + 5kd -2k -1}} n^{2kd}. \end{aligned}$$

\(\square \)

Theorem 4

For every \(d \geqslant 1\), \(k \geqslant 2\) there is a grammar G of bounded tree dimension d, such that for every \(n \geqslant 8 \cdot 2^d k^2 \ln 8k\) there is an n-state partial DFA \(\mathcal {B}\), such that the shortest string w in \(L(G) \cap L(\mathcal {B})\) is of length at least \(\frac{k}{2^{kd^2 - kd -2k -1} \cdot (8k+1)^{2kd}} n^{2kd}\).

Proof

Let G be the grammar given for d by Lemma 4. Let \(2^d k r \leqslant n < 2^d k (r+1)\), for some integer r. Then \(r \geqslant 8k \ln 8k\) (for otherwise n would be less than \(8 \cdot 2^d k^2 \ln 8k\)), and \(2^d k r \geqslant 2^{d+1} k\).

Since \(2^d k r\) is divisible by \(2^dk\) and is at least \(8 \cdot 2^d k^2 \ln 8k\), by Lemma 4, there is a DFA \(\mathcal {B}\) with \(2^d k r \leqslant n\) states, such that the length of the shortest string w in \(L(G) \cap L(\mathcal {B})\) is at least \(\frac{1}{k^{2kd-1} \cdot 2^{kd^2 + 5kd -2k -1}} \cdot (2^d k r)^{2kd}\). This is the desired n-state DFA.

Since \(r > 8k\), the inequality \(n < 2^d k (r+1)\) implies that \(n < 2^d k r \frac{8k+1}{8k}\) and accordingly \(2^d k r > n \frac{8k}{8k+1}\). Then, the lower bound on the length of w is expressed as a function of n as follows.

$$\begin{aligned} |w|&\geqslant \frac{1}{k^{2kd-1} \cdot 2^{kd^2 + 5kd -2k -1}} (2^d k r)^{2kd}> \\ {}&> \frac{1}{k^{2kd-1} \cdot 2^{kd^2 + 5kd -2k -1}} \Big (n \cdot \frac{8k}{8k+1}\Big )^{2kd} = \\ {}&= \frac{1}{k^{2kd-1} \cdot 2^{kd^2 + 5kd -2k -1}} \cdot \frac{2^{6kd} \cdot k^{2kd}}{(8k+1)^{2kd}} \cdot n^{2kd} = \\ {}&= \frac{k}{2^{kd^2 - kd - 2k - 1} (8k+1)^{2kd}} \cdot n^{2kd} \end{aligned}$$

\(\square \)

The final conclusion on the rational index of k-component grammars with tree dimension bounded by d is that it is in the worst case of the order \(\Theta (n^{2kd})\).

8 Rational Indices for Some Language Families

For a few families of grammars known in the literature, the results of this paper imply some bounds on their rational index.

Superlinear languages. A grammar \(G = (\Sigma , N, R, S)\) is superlinear (Brzozowski [5]) if its nonterminal symbols split into two classes, \(N = N_{lin} \cup N_{nonlin}\), where rules for each nonterminal \(A \in N_{lin}\) are of the form \(A \rightarrow uBv\) or \(A \rightarrow w\), with \(B \in N_{lin}\), \(u,v,w \in \Sigma ^*\), while rules for a nontermial \(A \in N_{nonlin}\) are of the form \(A \rightarrow \alpha B \beta \), with \(B \in N\) and \(\alpha ,\beta \in (\Sigma \cup N_{lin})^*\). A language is superlinear if it is generated by some superlinear grammar.

Corollary 1

For every superlinear grammar G, the rational index \(\rho _{L(G)}\) is at most \(O(n^4)\).

Proof

Parse trees in a superlinear grammar G have dimension at most 2. Then, by Theorem 1, the rational index \(\rho _{L(G)}\) is bounded by \(O(n^4)\).

Turning to a lower bound, note that the grammar constructed in Theorem 2 for \(d=2\) is actually superlinear.

Corollary 2

There exists a superlinear grammar G with rational index \(\rho _{L(G)}(n) \geqslant \frac{1}{648} n^4\).

Bounded-oscillation languages. The notion of oscillation of runs in pushdown automata, applicable to Turing machines with auxiliary pushdown tape, was introduced by Wechsung [38]. Languages with oscillation bounded by k are then a generalization of the linear languages (as one-turn pushdown automata are those with oscillation bounded by \(k=1\)).

This family was later studied by Ganty and Valput [13], who introduced the corresponding notion of oscillation in parse trees of grammars. Among other results, they prove that oscillation of a parse tree is closely related to its dimension.

Lemma 5

(Ganty and Valput [13]) Let \(G = (\Sigma , N, R, S)\) be a grammar in the Chomsky normal form, and let t be a parse tree in G. Then, osc \(t - 1 \leqslant \dim t \leqslant 2 \,\,\text {osc} \, t\).

Thus, k-bounded-oscillation grammars have dimension of parse trees bounded by 2k, and Theorem 1 gives the following upper bound on the rational index of these languages.

Corollary 3

Let L be a k-bounded-oscillation language. Then \(\rho _{L}(n) = O(n^{4k})\).

Linear (non-branching) multi-component grammars. A multi-component grammar is called linear [10], or, alternatively, non-branching [18] if each rule has no more than r occurrences of nonterminals in its body. A unary branching grammar is often called non-branching or linear.

Corollary 4

For every linear k-component grammar G, the rational index \(\rho _{L(G)}\) is at most \(O(n^{2k})\).

Proof

Parse trees in a non-branching k-component grammar G have dimension at most 1. Then, by Theorem 3, the rational index \(\rho _{L(G)}\) is bounded by \(O(n^{2k})\).\(\square \)

9 Adaptation to Further Grammar Families

If the branching of parse trees in an ordinary (context-free) grammar is restricted, then the rational index of the language is polynomial of a certain degree that depends on the bound on the branching. The result has been generalized to multi-component grammars, where the degree of the polynomial additionally depends on the rank of the grammar. The question is, could this kind of results hold for any other grammar families of interest, under suitable restrictions on the structure of their parse trees?

LL(k), LR(k) and unambiguous grammars. LL(k) grammars, LR(k) grammars and unambiguous grammars are classical subfamilies of grammars that are notable for their beautiful theoretical properties and diverse practical applications. Under the restriction of d-bounded dimension of parse trees, the rational index of all these grammars is subject to the upper bound \(O(n^{2d})\) given in Theorem 1. It turns out that the lower bound \(\Omega (n^{2d})\) holds as well, because all grammars constructed in Theorem 2 can be transformed to the most restricted of these subfamilies: LL(1)-grammars in the Greibach normal form, also known as simple grammars.

A grammar \(G = (\Sigma , N, R, S)\) is LL(1) in the Greibach normal form if every rule is of the form \(A \rightarrow a \alpha \), with \(a \in \Sigma \) and \(\alpha \in (\Sigma \cup N)^*\), and if, furthermore, there is at most one such rule for every pair (Aa). Both grammars in the proof of Theorem 2 can be rewritten as LL(1) grammars in the Greibach normal form: the first language \(\{ \, a^i b^i \mid i \geqslant 1 \, \}\) is defined by a grammar with the rules

$$\begin{aligned} S&\rightarrow a T \\ T&\rightarrow a T b \ | \ b \end{aligned}$$

The second language consists of all strings \(a^{i_1} w_1 b^{i_1} \ldots a^{i_t} w_t b^{i_t} c^t\), with \(t \geqslant 1\), \(i_1, \ldots , i_t \geqslant 1\), and with \(w_1, \ldots , w_t\) defined by a nonterminal symbol \(\widehat{S}\) using some rules that already satisfy the requirement of being LL(1) in the Greibach normal form. This language is defined by the following grammar.

$$\begin{aligned} S&\rightarrow a A T \\ A&\rightarrow a A b \\ A&\rightarrow \widehat{S} b{} & {} (\widehat{S} \rightarrow \sigma \in \widehat{R}) \\ T&\rightarrow c \ | \ a A T c \end{aligned}$$

An interested reader can verify that the dimensions of parse trees in the proof of Theorem 2 are the same for this grammar, and hence the argument remains valid with this grammar substituted.

Conjunctive and Boolean grammars. Conjunctive grammars are an extension of ordinary (context-free) grammars with a conjunction operation, so that each rule is of the form \( A \rightarrow \alpha _1 \& \ldots \& \alpha _m\), with \(m \geqslant 1\) and \(\alpha _1, \ldots , \alpha _m \in (\Sigma \cup N)^*\); such a rule states that every string defined by each \(\alpha _i\) is then defined by A. Boolean grammars further extend the model with a conjunction operation. These grammars preserve the notion of a parse tree, which becomes an acyclic graph, and are primarily notable for generalizations of classical parsing algorithms to handle these grammars. An interested reader is referred to an up-to-date survey of conjunctive grammars [25] and to an earlier survey of Boolean grammars [24] for more details.

The question is, could any restriction on the structure of branching in parse trees for these grammars lead to any upper bound on the rational index? Unfortunately, no way of adapting the results of this paper to these models is anticipated. Consider that the language of valid accepting computations of a Turing machine (VALC) is an intersection of two linear languages; a conjunctive grammar can define such an intersection by including two linear grammars, with initial symbols A and B, and adding a rule \( S \rightarrow A \& B\). Parse trees in this grammar will have the simplest possible structure for a conjunctive grammar, with two linear trees coming out of the root, and with each leaf shared between these two trees. However, since checking the emptiness of VALC is undecidable, there is no a priori recursive upper bound on the rational index.