1 Introduction

Formal grammars are both a classical subject, presented in all computer science curricula, and a topic of ongoing theoretical and applied research. Over half a century, the classroom presentation of grammars has stabilized to a certain collection of essential facts on context-free grammars, all established in the 1960s. However, the research on formal grammars did not end in the 1960s: many results on context-free grammars were established, and also quite a few new grammar models have been introduced over the years. Some of the new models did not work out well: indeed, when there are no examples of sensible language specifications by the proposed grammars, no efficient algorithms and even no theoretical results of any value, such a model deserves oblivion. On the other hand, a few models were found to share some useful properties of context-free grammars, which confirmed their value, and the research on such models has carried on.

The aim of this paper is to present several grammar families with good properties together, emphasizing their common underlying principles, and tracing what happens with the fundamental ideas of formal grammar theory, as they are applied to these families. Even putting these grammar families together already requires a certain reappraisal of foundations. At the dawn of computer science, the study of formal grammars was dominated by the string-rewriting approach in Chomsky’s [17] early work. By now, the Chomsky hierarchy of rewriting systems has retained a purely historical value, and modern presentations of the basics of formal grammars, such as the one in Sipser’s [78] textbook, omit it altogether. Rather than try fitting all kind of grammars into this framework, one should look for the actual common ground for the existing formal grammar families, and present them in light of the current state of knowledge.

This common ground is the understanding of formal grammars as a logic for describing the syntax, and definitions of grammars through inference rules. For instance, whereas the string-rewriting approach is to rewrite S into \(\text {NP} \ \text {VP}\) by a rule \(S \rightarrow \text {NP} \ \text {VP}\), and then proceed with rewriting \(\text {NP} \ \text {VP}\) into, e.g., Every man is mortal, using inference rules, a proposition \(S(\texttt {Every man is mortal})\) is inferred from \(\text {NP}(\texttt {Every man})\) and \(\text {VP}(\texttt {is mortal})\). This more modern understanding of grammars led to many developments in formal grammar theory which would not be possible under the string-rewriting approach. In particular, it is essential in the definitions of multi-component grammars of Seki et al. [77], it led to the important representation of formal grammars in the FO(LFP) logic, given by Rounds [74], and to a uniform description of parsing algorithms by Pereira and Warren [70].

Once the grammar families are uniformly defined, this paper aims to trace the main recurring ideas in the field, all originating from the context-free grammar theory, as they are applied to various new grammar models. It turns out that almost none of these ideas are exclusive to context-free grammars, and their applicability to other grammar families confirms that the formal grammar theory is larger than just the context-free grammar theory. As well put by Chytil [19], “A tour through the theory of context-free languages reveals an interesting feature of many celebrated results—something like ‘stability’ of their proofs, or ‘buried reserves’ contained in them”.

Before proceeding any further, there is a certain small detail to mention that will likely arouse some controversy. When Chomsky [17] proposed the term “context-free grammar”, he considered the idea of a phrase-structure rule applicable only in specified contexts, and attempted to implement this idea by a string-rewriting system. Thus, the ordinary kind of grammars got a name “context-free” to distinguish them from the attempted new model. However, it was soon found that context-sensitive string rewriting does not implement any sensible syntactic descriptions, and, as a formal grammar model, it makes no sense. As of today, even though the term “context-free” has been repeated for decades, no major model in the theory of formal grammars uses contexts of any kind. Thus, this name is no longer suitable for distinguishing the main model of syntax from other related models. Due to the significance of this model in computer science and its central position in the theory of formal grammars, the suggested alternative name is

an ordinary grammar.

This name is used throughout in this paper. All other grammar families are named after the feature that makes them different from the ordinary grammars: linear grammars, unambiguous grammars, conjunctive grammars, multi-component grammars, etc.

2 Grammar Families

A formal grammar is a mathematically precise syntactical description, which formalizes natural definitions, such as “A noun phrase followed by a verb phrase is a sentence” or “An arithmetical expression enclosed in brackets is also an arithmetical expression”. The following standard example of a grammar illustrates this kind of definitions.

Example 1

The set of well-nested strings of brackets over the alphabet \(\varSigma =\{a,b\}\), known as the Dyck language, is defined by the following conditions.

  • The empty string \(\varepsilon \) is well-nested.

  • If w is a well-nested string, then so is awb.

  • If u and v are well-nested strings, then so is uv.

In the notation of formal grammars, this definition is expressed as follows, where S is the syntactic category of well-nested strings.

$$\begin{aligned} S \rightarrow \varepsilon \ | \ aSb \ | \ SS \end{aligned}$$

The S on the left-hand side, followed by an arrow, means that this is the definition of strings with the property S. The right-hand side defines the possible form of such strings, with alternative structure separated by vertical lines. Any occurrence of S on the right-hand sides stands for an arbitrary string with the property S.

Actually, there is a bizarre detail: a rule in a grammar means a logical implication from its right-hand side to its left-hand side, and therefore there are all reasons to write the arrow in the opposite direction, as in \(S \leftarrow \varepsilon \ | \ aSb \ | \ SS\). This paper follows the traditional, incorrect direction of arrows.

The general form of this kind of syntactic descriptions is universally known.

Definition 1

(Chomsky [17]). An ordinary grammar (Chomsky’s “context-free”) is a quadruple \(G=(\varSigma , N, R, S)\), where

  • \(\varSigma \) is the alphabet of the language being defined;

  • N is the set of syntactic categories defined in the grammar, which are typically called nonterminal symbols;

  • R is the set of rules, each of the form \(A \rightarrow u_0 B_1 u_1 \ldots B_\ell u_\ell \), with \(\ell \geqslant 0\), \(u_0, u_1, \ldots , u_\ell \in \varSigma ^*\) and \(B_1, \ldots , B_\ell \in N\).

  • the nonterminal symbol S represents the syntactic category of grammatically correct strings (“sentences” of the language).

Each rule \(A \rightarrow u_0 B_1 u_1 \ldots B_\ell u_\ell \) defines a possible structure of strings with the property A; it means that if each string \(v_i\) has the property \(B_i\), then the string \(u_0 v_1 u_1 \ldots v_\ell u_\ell \) has the property A. If there are multiple rules for A, this means that strings with the property A may be of any of the given forms.

All other grammar families are obtained by modifying some elements of this definition. For that reason, in order to see what kind of grammar families can there be, one should first take note, what is this definition comprised of.

Definition 2

A family of grammars is characterized by the following three elements.  

Constituents,:

that is, fragments, from which a sentence is formed. Fragments are joined together to form larger fragments, until the entire sentence is ultimately obtained. In most grammar families, the constituents are substrings of the sentence.

Operations on constituents:

used to express other constituents.

Logical operations:

used to define syntactic conditions.

  A grammar family is thus a specialized logic for reasoning about the properties of constituents.

In ordinary grammars, constituents are substrings, the only operation on constituents is concatenation, that is, writing two substrings one after another to form a longer substring. The only logical operation is disjunction of syntactical conditions.

How can one modify this model? One of its special cases, the linear grammars, is obtained by restricting the operations on constituents: instead of concatenation of arbitrary substrings, as in a rule \(S \rightarrow SS\), linear grammars may only concatenate fixed symbols to a substring from both sides, as in a rule \(S \rightarrow aSb\).

Another special case, the unambiguous grammars, restricts both the concatenation and the disjunction. In an unambiguous grammar, whenever a concatenation of two languages, K and L, is expressed, it is required that every string w has at most one partition \(w=uv\) into a string u in K and a string v in L. Furthermore, whenever a disjunction is used in an unambiguous grammar, at most one alternative must be true.

Several grammar families are obtained by splitting the alphabet \(\varSigma \) into left brackets and right brackets, restricting the constituents to be well-nested substrings, and then restricting the concatenation, ensuring, in particular, that only well-nested strings could be expressed. Such models proposed by McNaughton [53] (“parenthesis grammars”), by Ginsburg and Harrison [30] (“bracketed grammars”) by Berstel and Boasson [9] (“balanced grammars”) and by Alur and Madhusudan [3] (“visibly pushdown grammars”).

One possible direction for extending ordinary grammars is to augment the set of allowed logical operations. Adding the conjunction of any syntactical conditions leads to conjunctive grammars, defined as follows.

Definition 3

[57, 65]. A conjunctive grammar is a quadruple \(G=(\varSigma , N, R, S)\), where \(\varSigma \), N and S are as in an ordinary grammar, and each rule in R is of the form \( A \rightarrow \alpha _1 \mathop { \& }\ldots \mathop { \& }\alpha _m\), where \(m \geqslant 1\) and \(\alpha _1, \ldots , \alpha _m \in (\varSigma \cup N)^*\).

Such a rule asserts that if a string w can be representated according to each conjunct \(\alpha _i\), then w has the property A. More precisely, let \(\alpha _i = u_{i,0} B_{i,1} u_{i,1} B_{i,2} \ldots u_{i,\ell _i-1} B_{i,\ell _i} u_{i,\ell _i}\), with \(B_{i,1}, \ldots , B_{i,\ell _i} \in N\), \(\ell _i \geqslant 0\) and \(u_{i,0}, u_{i,1} \ldots , u_{i,\ell } \in \varSigma ^*\). Then, if, for each \(i \in \{1, \ldots , m\}\), the string w is representable as \(w = u_{i,0} v_{i,1} u_{i,1} v_{i,2} \ldots u_{i,\ell _i-1} v_{i,\ell _i} u_{i,\ell _i}\), where each \(v_{i,1}\) has the property \(B_i\), then w has the property A.

Thus, conjunctive grammars use substrings as constituents, concatenation as the only operation on substrings, and unrestricted disjunction and conjunction as logical operations. Years before this model was studied by the author [57], exactly the same model was defined by Szabari [81] in his unpublished Master’s thesis. Closely related models were considered by Boullier [13] and by Lange [50]. Clark et al. [20] and Yoshinaka [90] defined an equivalent grammar model for their learning algorithm.

Example 2

The following conjunctive grammar describes the language \(\{ \, a^n b^n c^n \mid n \geqslant 0 \, \}\).

$$ \begin{aligned}\begin{array}{rcl} S &{}\rightarrow &{} AB \mathop { \& }DC \\ A &{}\rightarrow &{} aA \ | \ \varepsilon \\ B &{}\rightarrow &{} bBc \ | \ \varepsilon \\ C &{}\rightarrow &{} cC \ | \ \varepsilon \\ D &{}\rightarrow &{} aDb \ | \ \varepsilon \end{array}\end{aligned}$$

The rules for the nonterminal symbols A, B, C and D do not use conjunction, and have the same meaning as in an ordinary grammar. In particular, A and C define the languages \(a^*\) and \(c^*\), B defines \(\{ \, b^n c^n \mid n \geqslant 0 \, \}\) and D defines \(\{ \, a^k b^k \mid k \geqslant 0 \, \}\). Then the rule for S represents the strings in \(a^* b^* c^*\) that have the same number of symbols b and c (ensured by AB), as well as the same number of symbols a and b (specified by DC). These are exactly all strings of the form \(a^n b^n c^n\).

Boolean grammars are a further extension of conjunctive grammars that additionally allows the negation.

Definition 4

[58, 65]. A Boolean grammar is a quadruple \(G=(\varSigma , N, R, S)\), where each rule in R is of the form \( A \rightarrow \alpha _1 \mathop { \& }\ldots \mathop { \& }\alpha _m \mathop { \& }\beta _1 \mathop { \& }\ldots \mathop { \& }\beta _n\), with \(m,n \geqslant 0\), \(m+n \geqslant 1\) and \(\alpha _1, \ldots , \alpha _m, \beta _1, \ldots , \beta _n \in (\varSigma \cup N)^*\).

This rule asserts that if a string w can be representated according to each positive conjunct \(\alpha _i\), and cannot be represented according to any negative conjunct \(\beta _j\), then w has the property A.

Restricting conjunctive grammars and Boolean grammars to use linear concatenation yields linear conjunctive grammars and linear Boolean grammars, which are known to be equivalent in power [59].

Extended grammar models of another kind feature more complicated constituents, such as substrings with a gap. A substring with a gap is a pair (uv), with \(u, v \in \varSigma ^*\), which stands for any substring of the form \(u\textsc {-gap-}v\), or uxv with \(x \in \varSigma ^*\). Using these constituents facilitates describing syntactical links between two remote parts of the same sentence. A more complicated type of constituents are substrings with multiple gaps, that is, k-tuples of the form \((u_1, u_2, \ldots , u_k)\), representing substrings \(u_1 \textsc {-gap-} u_2 \textsc {-gap-} \ldots \textsc {-gap-} u_k\) with \(k-1\) gaps. Grammars, in which every nonterminal symbol defines a set of such k-tuples, for some k, were independently defined by Seki et al. [77] (as “multiple context-free grammars”), and by Vijay-Shanker et al. [89] (as “linear context-free rewriting systems”). In these grammars, which shall be called multi-component grammars in this paper, there is the following operation on constituents: given a k-tuple and an \(\ell \)-tuple, the substrings therein may be concatenated with each other in any combinations, forming an m-tuple, with \(1 \leqslant m \leqslant k+\ell \). The only logical operation is disjunction.

Definition 5

(Seki et al. [77]; Vijay-Shanker et al. [89]). A multi-component grammar is a quintuple \(G=(\varSigma , N, \mathop {\mathrm {dim}}, R, S)\), where \(\varSigma \) and N are as in an ordinary grammar, the function \(\mathop {\mathrm {dim}}:N \rightarrow \mathbb {N}\) defines the dimension of each nonterminal symbol, that is, the number of components its refers to, and each rule in R defines a possible structure of (\(\dim {A}\))-tuples with the property A as a composition of \(\ell \) (\(\mathop {\mathrm {dim}}B_i\))-tuples with the property \(B_i\), for some \(B_1, \ldots , B_\ell \in N\).

For each i, let \((x_{i,1}, \ldots , x_{i, \mathop {\mathrm {dim}}B_i})\) be variables representing the components of a \((\mathop {\mathrm {dim}}B_i)\)-tuple with the property \(B_i\). Let \(\alpha _1, \ldots , \alpha _{\mathop {\mathrm {dim}}A}\) be strings comprised of symbols from \(\varSigma \) and the variables \(x_{i,j}\), with the condition that every variable \(x_{i,j}\) occurs in \(\alpha =\alpha _1 \ldots \alpha _{\mathop {\mathrm {dim}}A}\) exactly once, and that, for every i, the variables \(x_{i,1}, \ldots , x_{i, \mathop {\mathrm {dim}}B_i}\) occur in \(\alpha \) in their original order. These strings represent the form of the desired \((\mathop {\mathrm {dim}}A)\)-tuple, which is written down as the following rule in R.

$$\begin{aligned} A(\alpha _1, \ldots , \alpha _{\mathop {\mathrm {dim}}A}) \rightarrow B_1(x_{1,1}, \ldots , x_{1, \mathop {\mathrm {dim}}B_1}), \ldots , B_\ell (x_{\ell ,1}, \ldots , x_{\ell , \mathop {\mathrm {dim}}B_\ell }), \end{aligned}$$

The initial symbol has \(\mathop {\mathrm {dim}}S=1\). The dimension of the grammar is the maximum dimension of a nonterminal symbol: \(\mathop {\mathrm {dim}}G=\max _{A \in N} \mathop {\mathrm {dim}}A\).

In this notation, a rule \(A \rightarrow BC\) in an ordinary grammar is written down as \(A(xy) \rightarrow B(x), C(y)\).

Example 3

The following multi-component grammar, with \(\mathop {\mathrm {dim}}S=1\) and \(\mathop {\mathrm {dim}}A=2\), defines the language \(\{ \, a^m b^n c^m d^n \mid m,n \geqslant 0 \, \}\).

$$\begin{aligned} S(xy)&\rightarrow A(x, y) \\ A(axb, cyd)&\rightarrow A(x, y) \\ A(\varepsilon , \varepsilon )&\rightarrow \end{aligned}$$

An interesting special case of multi-component grammars are the well-nested multi-component grammars, in which the constituents are the same, and the operations on the constituents are restricted as follows. For each rule in R, for any variables \(x_{i,j}\) and \(x_{i,k}\) of some \(B_i\), and for any variables \(x_{i',m}\) and \(x_{i',n}\) of some \(B_{i'}\), their occurrences in \(\alpha _1, \ldots , \alpha _{\mathop {\mathrm {dim}}A}\) may not cross each other, as in \(\ldots x_{i,j} \ldots x_{i',m} \ldots x_{i,k} \ldots x_{i',n} \ldots \). The grammar in Example 3 is well-nested.

Well-nested 2-component grammars have received particular attention in the literature. In these grammars, constituents are substrings with at most one gap, and the operations are: wrapping a pair (uv) around a pair (xy), producing a pair (uxyv); creating a pair \((w, \varepsilon )\) or \((\varepsilon , w)\) out of a substring w; removing the gap in a pair (uv), obtaining a string uv. This model was defined under a somewhat cryptic name of head grammars [71]; the definitions were later restated in the modern form by Rounds [74]. These grammars were then found to be equivalent to the earlier studied tree-adjoining grammars [88].

Fig. 1.
figure 1

Hierarchy of grammar families: inclusions.

The hierarchy of grammar families described in this paper is given in Fig. 1, where each arrow indicates containment; all inclusions without a question mark are known to be proper. The hierarchy is centered at the ordinary grammars (Ordinary), Chomsky’s “context-free”; other families are defined in reference to them. Their special cases are: the unambiguous grammars (Unamb); the LR and the LL grammars (LR, LL); and their subcases with linear concatenation (Lin, UnambLin, LRLin, LLLin). A family of grammars dealing with well-nested strings is labelled with the name of the corresponding automata: the input-driven pushdown automata (IDPDA). Then, there are the generalizations of ordinary grammars: well-nested 2-component grammars, labelled according to the name “tree-adjoining grammars” (TAG) conjunctive grammars (Conj), Boolean grammars (Bool), and their unambiguous subclasses (UnambTAG, UnambConj, UnambBool). Finally, there is the full family of multi-component grammars (Multi) and the linear conjunctive grammars (LinConj). The regular languages stand at the bottom of the hierarchy.

3 Inference Rules and Parse Trees

So far, grammars have been presented as an intuitively defined formalism for language specification, as in Example 1. Several mathematically precise definitions of grammars are known. Chomsky’s [17] string rewriting is one possible way of formalizing these syntactic descriptions. Although it is sufficient to reason about ordinary grammars, it does not explicitly follow the intuition behind Example 1, and is unsuitable for defining the extensions of ordinary grammars.

A much clearer definition, representing the right outlook on formal grammars. was presented, for instance, in a monograph by Kowalski [46, Chap. 3]. This definition regards a grammar as a logic, in which the properties of any string can be inferred from the properties of its substrings by the means of logical inference. For an ordinary grammar, the logic deals with propositions of the form “a string \(w \in \varSigma ^*\) has the property \(A \in N\)”, denoted by . For instance, a rule \(A \rightarrow BC\) allows the following deductions to be made, for all \(u, v \in \varSigma ^*\).

Example 4

For the ordinary grammar in Example 1, the well-nestedness of the string abaabb is proved by the following logical derivation.

This object is essentially the parse tree.

In a conjunctive grammar, a proposition A(w) can be deduced by a rule \( A \rightarrow BC \mathop { \& }DE\) as follows, for any two partitions w into \(w=uv=xy\), with \(u,v,x,y \in \varSigma ^*\).

Example 5

For the conjunctive grammar in Example 2, the following derivation establishes that the string \(w=abc\) is a well-formed sentence.

This inference, as well as any such inference for a conjunctive grammar, represents a parse tree with shared leaves: indeed, there are only three symbols in the string, which are shared by two subtrees corresponding to the two conjuncts in the rule for S. This sharing represents multiple structures for the same substring.

Definition by logical inference perfectly works for multi-component grammars.

Example 6

The multi-component grammar in Example 3 defines the string \(w=aabbccdd\) by the following logical derivation.

The definitions by logical derivation cannot implement the negation in the rules, such as in Boolean grammars. Negation can be formalized within the more general approach to defining grammars explained in the next section.

Conclusions. Grammar families without negation can be defined by logical derivations. This is a formalization of a most intuitive object, a parse tree.

4 Language Equations

Another approach to defining the language described by a grammar is based upon a variant of informal definitions, such as the one in Example 1, this time written down as “if and only if” conditions.

Example 7

A string \(w \in \{a, b\}^*\) is well-nested if and only if

  • either \(w=\varepsilon \),

  • or \(w=aub\), for some well-nested string u,

  • or \(w=uv\), for some well-nested strings u and v.

This can be written down as the following language equation, with the set of well-nested strings X as the unknown.

$$\begin{aligned} X = \{\varepsilon \} \cup \big (\{a\} \cdot X \cdot \{b\}\big ) \cup \big (X \cdot X\big ) \end{aligned}$$

The Dyck language is among the solutions of this equation, actually the least solution with respect to inclusion.

The representation of grammars by language equations was discovered by Ginsburg and Rice [31]. An ordinary grammar \(G=(\varSigma , N, R, S)\), with \(N=\{X_1, \ldots , X_n\}\) is represented by a system of equations with the following form, where each nonterminal symbol \(X_i\) becomes a variable.

$$\begin{aligned} \left\{ \begin{array}{rcl} X_1 &{}=&{} \varphi _1(X_1, \ldots , X_n) \\ &{}\vdots &{} \\ X_n &{}=&{} \varphi _n(X_1, \ldots , X_n) \end{array}\right. \end{aligned}$$
(*)

For each \(X_i\), the right-hand side of its equation is a union of concatenations, representing the rules for \(X_i\), with each occurrence of a symbol \(a \in \varSigma \) represented by a constant language \(\{a\}\), as shown in the above Example 7.

Language equations corresponding to conjunctive grammars represent the conjunction operator by intersection of languages on the right-hand sides.

Language equations can be naturally extended to the cases of tree-adjoining grammars and multi-component grammars. These equations will use sets of pairs or k-tuples of strings as unknowns. The operations on the right-hand sides are: the set-theoretic union, and the operations on constituents extended to sets.

Language equations are particularly essential for defining Boolean grammars [58], but there are a few non-trivial details to take care of. First, the negation in the rules is implemented by a complementation operation on sets. However, in this case the equations might have no solutions, such as the equation \(S=\overline{S}\) corresponding to the grammar \(S \rightarrow \lnot S\). One possibility of handling this problem is to impose an certain condition on grammars, under which the grammar \(S \rightarrow \lnot S\) is dismissed as ill-formed [58]. An improved definition of Boolean grammars done in terms of three-valued logic was given by Kountouriotis et al. [45]: under their definition, every grammar defines a three-valued language, with each string having a “well-formed”, “ill-formed” or “undefined” status; in particular, the grammar \(S \rightarrow \lnot S\) defines a language with all strings undefined.

Language equations of the general form, with unrestricted left-hand sides, can define computationally universal sets by their solutions [41, 47, 60, 63], which makes them completely useless for the purpose of defining grammar models.

Conclusions. All grammar families have definitions by language equations, this is a common underlying principle.

5 Expressibility of Operations

Closure properties are among the main indicators of the expressive power of a grammar family: closure under some operation means that this operation can be expressed in those grammars. Some closure results are immediate, because the operation is a part of the formalism of rules: for instance, union and concatenation are expressible both in ordinary grammars and in conjunctive grammars, and in the latter, intersection is expressible as well. There are numerous closure results for various grammar families, and only a brief account of some recurring properties can be given in this paper.

Perhaps the most fundamental closure result for ordinary grammars is their closure under intersection with a regular language, proved by Bar-Hillel et al. [6]: in their construction, an ordinary grammar \(G=(\varSigma , N, R, S)\) and a finite automaton \(\mathcal {A}=(\varSigma , Q, q_0, \delta , F)\) are combined into a new grammar \(G'\) with the nonterminal symbols of the form \(A_{p, q}\), with \(A \in N\) and \(p,q \in Q\), which defines all strings with the property A, on which the finite automaton moves from state p to state q. Parse trees in \(G'\) have the same structure as parse trees in G, with the extra information on the computation of \(\mathcal {A}\).

By a similar method, one can implement a nondeterministic finite transduction (NFT) on an ordinary grammar G, producing a grammar \(G'\) for the set of all strings that can be emitted by the transducer while processing a string defined by G [29]. This general closure result has important special cases, such as homomorphisms, inverse homomorphisms, inverse deterministic finite transductions (DFT), the set of prefixes, etc.

All the above results apply to linear grammars and to multi-component grammars. For unambiguous grammars, the construction for intersection with a regular language still applies, but the construction simulating an NFT is no longer applicable. In fact, this family is not closed already under homomorphisms. However, the special case of the latter construction involving an inverse DFT applies for unambiguous grammars as well.

For conjunctive and Boolean grammars, similarly, there is a non-closure under homomorphisms [57], but the contruction for the closure under inverse DFT can still be generalized [51].

6 Normal Forms

Numerous normal form theorems exist for ordinary grammars, and some of them have been extended to other grammar families.

The most well-known normal form is the Chomsky normal form for ordinary grammars, which requires all rules to be of the form \(A \rightarrow BC\), with \(B, C \in N\), or \(A \rightarrow a\), with \(a \in \varSigma \). The known transformation to the normal form proceeds by first ensuring that the right-hand sides are of length at most one (this incurs a linear increase in the size of the grammar), then eliminating null rules of the form \(A \rightarrow \varepsilon \) (linear increase), and finally eliminating chain rules of the form \(A \rightarrow B\) (quadratic increase). A lower bound of the order \(n^{\frac{3}{2}-o(1)}\) has been established by Blum [11].

The Chomsky normal form exists for several subclasses of ordinary grammars: namely, for unambiguous grammars, for LL grammars and for LR grammars.

For conjunctive grammars, there is a direct generalization, called the binary normal form [57]. A conjunctive grammar in the binary normal form has all rules of the form \( A \rightarrow B_1 C_1 \mathop { \& }\ldots \mathop { \& }B_m C_m\), with \(m \geqslant 1\) and \(B_i,C_i \in N\), or of the form \(A \rightarrow a\), with \(a \in \varSigma \). The transformation follows the same plan as for ordinary grammar, consecutively eliminating null conjuncts in rules of the form \( A \rightarrow \varepsilon \mathop { \& }\ldots \), and unit conjuncts in rules of the form \( A \rightarrow B \mathop { \& }\ldots \). However, the elimination of unit conjuncts incurs an exponential blow-up. No lower bound on the complexity of this transformation is known.

In the Greibach normal form [32] for ordinary grammars, every rule is of the form \(A \rightarrow a \alpha \), with \(a \in \varSigma \) and \(\alpha \in N^*\). The best known transformation to the Greibach normal form was developed by Rosenkrantz [73] and by Urbanek [85]: it transforms a grammar in the Chomsky normal form to a grammar in the Greibach normal form with a cubic blow-up. An \(O(n^2)\) lower bound on the complexity of the latter transformation was proved by Kelemenová [44].

The definition of the Greibach normal form can be naturally extended to conjunctive grammars, which would have all rules of the form \( A \rightarrow a \alpha _1 \mathop { \& }\ldots \mathop { \& }a \alpha _m\), with \(a \in \varSigma \), \(m \geqslant 1\) and \(\alpha _1, \ldots , \alpha _m \in N^*\). However, it is not known whether every conjunctive grammar can be transformed to this form [65, Problem 5].

A stronger version of the Greibach normal form for ordinary grammars was defined by Rosenkrantz [73]. In the Rosenkrantz normal form (also known as double Greibach normal form), every rule is of the form \(A \rightarrow a \alpha d\), with \(a, d \in \varSigma \) and \(\alpha \in (\varSigma \cup N)^*\), or \(A \rightarrow a\). The transformation from the Chomsky normal form to the Rosenkrantz normal form, as stated by Engelfriet [24], produces a grammar of size \(O(n^{10})\), no lower bounds are known.

In the operator normal form for ordinary grammars, defined by Floyd [27], each rule is of the form \(A \rightarrow u_0 B_1 u_1 B_2 u_2 \ldots u_{k-1} B_k u_k\), with \(k \geqslant 0\), \(u_0, u_k \in \varSigma ^*\), \(B_1, \ldots B_k, \in N\) and \(u_1, \ldots , u_{k-1} \in \varSigma ^+\). This normal form extends to conjunctive grammars [68]: every grammar can be transformed to one with all rules of the form \( A \rightarrow B_1 a_1 C_1 \mathop { \& }\ldots \mathop { \& }B_m a_m C_m\), with \(m \geqslant 1\), \(B_i,C_i \in N\), and \(a_i \in \varSigma \), or \(A \rightarrow a\), or \(S \rightarrow a A\), as long as S never appears on the right-hand sides of any rules.

There is a generalized normal form theorem by Blattner and Ginsburg [10], in which all rules are either of the form \(A \rightarrow w\), with \(w \in \varSigma ^*\), or of the form \(A \rightarrow u_0 B_1 u_1 \ldots B_\ell u_\ell \), where \(\ell \) and the lengths \(u_i\) may be fixed almost arbitrarily.

Conclusions. Ordinary grammars can be normalized in many different ways. Some normal form theorems are extended to conjunctive grammars; other normal forms could be generalized as well, but it is open whether every grammar can be transformed to those forms. For multi-component grammars, such normal forms would be hard to formulate: apparently, one has to deal with grammars of more or less the general form.

7 Parsing

Most parsing algorithms applicable to grammars of the general form are based upon the dynamic programming method. The most well-known is the Cocke–Kasami–Younger algorithm, which, for an ordinary grammar \(G=(\varSigma , N, R, S)\) in the Chomsky normal form, given a string \(w=a_1 \ldots a_n\), constructs, for each substring \(a_{i+1} \ldots a_j\), the sets \(T_{i,j}=\{ \, A \mid a_{i+1} \ldots a_j \in L_G(A) \, \}\). Its running time is \(\Theta (n^3)\) and it uses \(\Theta (n^2)\) space. The algorithm applies to conjunctive grammars [57] and to Boolean grammars [58] without any changes.

Valiant [86] proved that the same data structure can be constructed in time \(O(n^\omega )\), where \(O(n^\omega )\) is the number of operations needed to multiply two \(n \times n\) matrices. Valiant’s algorithm was originally presented in a generalized algebraic form, which complicates the creation of any derivative algorithms. However, it can be reformulated easier and in elementary terms, and then it directly applies to conjunctive and Boolean grammars [66], with the same time complexity \(O(n^\omega )\).

The ideas of the Cocke–Kasami–Younger algorithm are directly extended to multi-component grammars [77, Sect. 3.2], obtaining an algorithm working in time \(O(n^k)\), where k depends on the number of components and on the complexity of rules. In particular, for tree-adjoining grammars, the basic algorithm works in time \(O(n^6)\). Both algorithms can be accelerated using fast matrix multiplication [54, 72].

A variant of the Cocke–Kasami–Younger algorithm, the Kasami–Torii algorithm constructs a different data structure encoding the same sets \(T_{i,j}\): for each position j and for each nonterminal symbol A, this is the sorted list of all positions i with \(A \in T_{i,j}\). The resulting algorithm works in time \(O(n^3)\) in the worst case, and in time \(O(n^2)\) for unambiguous grammars, as well as for unambiguous conjunctive and unambiguous Boolean grammars [62]. Using the same idea, parsing for tree-adjoining grammars can be accelerated to \(O(n^4)\).

There are well-known subclasses of ordinary grammars that have linear-time parsing algorithms: the LL grammars and the LR grammars. The LL grammars have a generalization for Boolean grammars [61], the LR grammars have an extension for conjunctive grammars [2].

Fig. 2.
figure 2

Hierarchy of grammar families: parsing time.

Conclusions. Each grammar family has a simple dynamic programming parsing algorithm that works in polynomial time, with the degree of the polynomial depending on the grammar family. The presence of Boolean operations does not affect the complexity, whereas for multi-component grammars, the degree of the polynomial is determined by the structure of rules. For all known grammar families, the algorithm can be accelerated by using fast matrix multiplication, and with the unambiguity assumption, the algorithm can be accelerated even further. Linear-time algorithms for subclasses of ordinary grammars are well-developed, but their extensions to more powerful grammar families need further study. The running time for different families is compared in Fig. 2.

8 Representation in the FO(LFP) Logic

In 1988, Rounds [74] has identified the previously unknown foundation for formal grammars: the FO(LFP) logic. The fundamental theoretical property of this logic, discovered by Immerman [37] and by Vardi [87], is that it can describe exactly all problems decidable in polynomial time: the complexity class P. As it turned out, it is not only that all grammar families can be described within this logic—they can be described exactly according to their definition! Then, each grammar family becomes a clearly defined special case of the FO(LFP) logic.

Definition 6

Let \(\varSigma \) be an alphabet, let N be a finite set of predicate symbols, with each \(A \in N\) having a finite number of arguments, denoted by \(\mathop {\mathrm {dim}}A\).

The logic uses first-order variables referring to positions in the string. Positions are given by terms, defined as follows.

  • The first position, the last position and any variables are terms.

  • If t is a term, then so are \(t+1\) and \(t-1\).

Next, a formula is defined as follows.

  • If \(A \in N\) is a predicate symbol with \(\mathop {\mathrm {dim}}A=k\) and \(t_1, \ldots , t_k\) are terms, then \(A(t_1, \ldots , t_k)\) is a formula;

  • If \(a \in \varSigma \) is a symbol and t is a term, then \(\varphi =a(t)\) is a formula;

  • If t and \(t'\) are terms, then \(t<t'\) and \(t=t'\) are formulae;

  • If \(\varphi \) and \(\psi \) are formulae, then so are \(\varphi \vee \psi \) and \(\varphi \wedge \psi \);

  • If \(\varphi \) is a formula and x is a free variable in \(\varphi \), then \((\exists x) \varphi \) and \((\forall x) \varphi \) are formulae as well.

A FO(LFP)-definition is a quintuple \(G=(\varSigma , N, \mathop {\mathrm {dim}}, \langle \varphi _A\rangle _{A \in N}, \sigma )\), where each predicate \(A \in N\) is defined by a formula \(\varphi _A\) with \(\mathop {\mathrm {dim}}A\) free variables, and \(\sigma \) is a formula with no free variables that defines the condition of being a syntactically well-formed sentence.

Similarly to a grammar, an FO(LFP)-definition describes a language of well-formed strings. For each string \(w \in \varSigma ^*\), there is a least assignment of sets of \((\mathop {\mathrm {dim}}A)\)-tuples of positions in w to each predicate A, which satisfies the system of equations \(A=\varphi _A\), for all A. This is essentially a generalization of language equations of Ginsburg and Rice [31]. Then, if \(\sigma \) is true under this assignment, the string w is considered well-formed.

Example 8

The grammar in Example 1 is transcribed as the FO(LFP)-definition \(G=(\varSigma , \{S\}, \mathop {\mathrm {dim}}, \langle \varphi _S\rangle , \sigma )\), with \(\mathop {\mathrm {dim}}S=2\) and with S defined by the following formula.

$$\begin{aligned} S(x, y) = \underbrace{\big [(\exists z) (S(x, z) \wedge S(z, y))\big ] \vee (a(x+1) \wedge S(x+1, y-1) \wedge b(y)) \vee x=y}_{\varphi _S} \end{aligned}$$

The condition of being a well-formed sentence is \(\sigma =S(\text {first}, \text {last})\).

All other grammar families can be similarly expressed in the FO(LFP) logic: conjunctive grammars are represented using the conjunction, multi-component grammars require predicates with more than two arguments.

The following decision procedure for the FO(LFP) logic can be regarded as the mother of all parsing algorithms.

Theorem 1

Let \(G=(\varSigma , N, \mathop {\mathrm {dim}}, \langle \varphi _A\rangle _{A \in N}, \sigma )\) be an FO(LFP)-definition, let k be the largest dimension of a predicate, let m be the largest number of nested quantifiers in a definition of a predicate. Then there exists an algorithm, which, given an input string \(w \in \varSigma ^*\) of length n, determines whether w is in L(G), and does so in time \(O(n^{2k+m})\), using space \(O(n^k)\).

The algorithm calculates the least model by gradually proving all true elementary propositions. There are \(O(n^k)\) propositions in total, and at each step, at least one new proposition is proved, which bounds the number of steps by O(\(n^k\)). At each step, the algorithm cannot know, which propositions it is already able to prove, so it tries proving each of \(O(n^k)\) propositions. Each nested quantifier requires considering n possibilities for the bounded variables, and thus an attempted proof of each proposition requires \(O(n^m)\) steps.

Conclusions. All grammar families are representable in FO(LFP) and have a polynomial-time parsing algorithm provided by Theorem 1. The degree of the polynomial is usually not as good, as provided by the specialized algorithms given in Sect. 7.

9 Equivalent Models

The classical representation of ordinary grammars by nondeterministic pushdown automata [18] gives rise to several related results. First, the LR grammars are similarly characterized by deterministic pushdown automata [28], and linear grammars are characterized by one-turn pushdown automata,

A particularly important special case of pushdown automata are the input-driven pushdown automata, also known as visibly pushdown automata. This model was known already in 1980; von Braunmühl and Verbeek [14], proved that its deterministic and nondeterministic variants are equal in power. Later, Alur and Madhusudan [3] reintroduced the model under the names “visibly pushdown automata” and “nested word automata”, carried out a systematic study of its properties and inspired further work on the closure properties of input-driven automata and on their descriptional complexity [69].

A generalization of pushdown automata characterizing conjunctive grammars was defined by Aizikowitz and Kaminski [1]. Their model, the synchronized alternating pushdown automata, are pushdown automata with a tree-structured stack, with bottom of the stack as the root of the tree and with the top of the stack formed by all the leaves.

Linear conjunctive grammars have an automaton representation of an entirely different kind [59]. They are characterized by the simplest kind of one-dimensional cellular automata: the one-way real-time cellular automata, also known as trellis automata, studied, in particular, by Ibarra and Kim [36] and by Terrier [82,83,84]. These automata work in real time, making \(n-1\) parallel steps on an input of length n, and the next value of each cell is determined only by its own value and the value of its right neighbour.

An important alternative representation of ordinary grammars as categorial grammars was established by Bar-Hillel et al. [5]. An extension of categorial grammars, the combinatory categorial grammars, similarly characterizes tree-adjoining grammars [88]. A different extension augmented with conjunction was introduced by Kuznetsov [48], and its equivalence to conjunctive grammars was proved by Kuznetsov and Okhotin [49].

Conclusions. Representations by pushdown automata and by categorial grammars are among the recurring ideas of formal grammars. A representation by cellular automata has so far been found only for linear conjunctive grammars.

10 Homomorphic Characterizations

The Chomsky–Schuützenberger theorem [18] was one of the first theoretical results on grammars. In its original form, it asserts that every language \(L \subseteq \varSigma ^*\) described by an ordinary grammar is a homomorphic image of an intersection of a Dyck language on k pairs of brackets \(D_k\) with a regular language M.

$$\begin{aligned} L=h(D_k \cap M) \end{aligned}$$

The classical proofs of this result rely on using erasing homomorphisms.

There are several stronger forms of this theorem that use non-erasing homomorphisms. The simplest of them assumes that all strings in L are of even length; then it is sufficient to use only symbol-to-symbol homomorphisms.

Theorem 2

(Chomsky and Schützenberger [18]; Okhotin [64]; Crespi-Reghizzi and San Pietro [22]). For each alphabet \(\varSigma \), there exists such a number \(k \geqslant 1\), that a language \(L \subseteq (\varSigma ^2)^*\) is described by an ordinary grammar if and only if there exist a regular language M over an alphabet of k pairs of brackets and a symbol-to-symbol homomorphism h mapping each bracket to \(\varSigma \), such that \(L = h(D_{k} \cap M)\).

In plain words, the theorem asserts that if a language is defined by an ordinary grammar, then it is obtained from a nested bracketed structure checked by a finite automaton by renaming the brackets to symbols in \(\varSigma \).

Yoshinaka et al. [91] extended the Chomsky–Schützenberger theorem (in its erasing form) to multi-component grammars, using a suitable generalization of the Dyck language. Salomaa and Soittola [76] and Droste and Vogler [23] established a variant for weighted grammars.

The theorem cannot be extended to conjunctive grammars, regardless of which language would be used instead of the Dyck language, for the reason that every recursively enumerable set is representable as a homomorphic image of a language described by a conjunctive grammar.

11 Hardest Languages

A famous theorem by Greibach [33] states that there exists a fixed language \(L_0\) described by an ordinary grammar \(G_0\), with the property that every language L over any alphabet \(\varSigma \) that is described by an ordinary grammar G is reducible to \(L_0\) by a homomorphic reduction. In other words, L is representable as an inverse homomorphic image \(h^{-1}(L_0)\), for some homomorphism \(h :\varSigma \rightarrow \varSigma _0^*\). In the proof, the image h(a) of each symbol \(a \in \varSigma \) encodes basically the entire grammar G, and for each string \(w \in \varSigma ^*\), its image h(w) is defined by the “universal” grammar \(G_0\) if and only if \(w \in L\).

This theorem is similar in spirit to the results on the existence of complete sets in several complexity classes, such as NP-complete sets. In Greibach’s theorem, a homomorphism is a reduction function, cf. polynomial-time reductions in the definitions of NP-complete problems.

For conjunctive grammars and for Boolean grammars, there are hardest language theorems with exactly the same statement [67]. Likely, Greibach’s theorem could also hold for certain classes of multi-component grammars, such as for multi-component grammars of maximum dimension k, for each k.

Turning to special cases of ordinary grammars, Greibach [34] demonstrated that the LR grammars cannot have a hardest language under homomorphic reductions, and Boasson and Nivat [12] proved the same result for linear grammars. One could expect that there is no hardest language for the unambiguous grammars; however, as Boasson and Nivat [12] rightfully remarked, proving that “seems to be a hard problem”. Whether Greibach’s theorem holds for linear conjunctive grammars is another open problem.

Conclusions. Grammar families without any special restrictions, such as determinism, unambiguity or linearity of concatenation, tend to have hardest languages. For various combinations of special restrictions, the existence of hardest languages has either been disproved or remains open.

12 Limitations of Grammars

There are several known methods for proving that a language is not described by any grammar from a certain class.

For ordinary grammars, there is the classical pumping lemma of Bar-Hillel et al. [6] that exploits the possibility of inserting repetitive structure into any sufficiently large parse tree. The same idea yields stronger versions of the pumping lemma: Ogden’s lemma [55] featuring distinguished positions and the Bader–Moura lemma [4] that further allows some positions to be excluded from pumping. There are two special results based on exchanging subtrees between parse trees of different strings: Sokołowski’s lemma [79] and the interchange lemma by Ogden et al. [56].

In general, the idea behind the pumping lemma equally applies to multi-component grammars: one can also insert repetitive structure into their parse trees. However, the inserted fragments may scatter between the components, and, as demonstrated by Kanazawa et al. [43], a direct analogue of the ordinary pumping lemma does not hold for multi-component grammars of dimension 3 or more; it is only known that some string can be pumped, but not necessarily all of them. A standard pumping lemma exists for well-nested multi-component grammars [42].

Parikh’s theorem states that if a language is described by an ordinary grammar, then it can be transformed to a regular language by changing the order of symbols in the strings. Numerous proofs of this theorem are known, including the recent constructive proof by Esparza et al. [25]. Parikh’s theorem directly applies to multi-compoment grammars, and does not apply to conjunctive grammars, which can, for instance, describe some non-regular unary languages [39, 40].

Negative results for unambiguous grammars can be proved using the pumping lemma and its variants [55], or using the methods of analytic combinatorics [26]: if the generating function for a language is transcendental, it cannot be described by an unambiguous grammar. These methods extend to unambiguous multi-component grammars, but are not applicable to unambiguous conjunctive grammars.

For linear conjunctive grammars, several methods for proving non-representability of languages have been developed using their cellular automaton representation [16, 82, 84]. There are no known methods for proving that there is no conjunctive grammar for a certain language [65], and this remains a substantial gap in the knowledge on that family.

13 Complexity

In the late 1950 s and early 1960s, when nothing was yet known about computational complexity, the Chomsky hierarchy by itself served as a pre-historic hierarchy of complexity classes, consisting of DSPACE(const), grammars, NSPACE(n) and the recursively enumerable sets. It answered the natural question on the complexity of ordinary grammars by placing them between DSPACE(const) and NSPACE(n), and put them in the context of the emerging theoretical computer science. The complexity theory has advanced since that time, and it is important to relate formal grammars to the modern complexity classes, as well as to compare different grammar families according to their complexity. The resulting picture is presented in Fig. 3.

First of all, different grammar families are special cases of the complexity class P due to their representation in the FO(LFP) logic; and linear conjunctive grammars can describe some P-complete languages [36, 65]. At the lower end of the hierarchy, there are families contained in the logarithmic space (L), the largest of them are the LR(1) linear grammars (LRLin); also, there exists an LR(1) linear grammar that describes an L-complete language [35]. The whole family of linear grammars (Lin) is similarly contained in NL and can describe an NL-complete language [80].

Fig. 3.
figure 3

Hierarchy of grammar families: complexity.

Turning to the complexity of ordinary grammars, in 1979, Cook [21] wrote: “I see no way of showing \(\mathrm {DCFL} \subseteq \mathrm {NC}\)”. Yet, a few years later, Brent and Goldschlager [15] and Rytter [75], described a circuit of depth \((\log n)^2\) with \(O(n^6)\) gates that recognizes the membership of a string in the language in time \(O((\log n)^2)\). There was a much earlier algorithm based on the same idea: this was the recognition procedure for ordinary grammars by Lewis, Stearns and Hartmanis [52] that uses only \(O((\log n)^2)\) bits of memory, at the expense of super-polynomial running time. The underlying idea of these algorithms is to use an augmented logical system, in which the height of proof trees is logarithmic in the length of a string. The small-space algorithm and the fast parallel algorithm both work by finding a shallow proof in the augmented system.

14 Towards Further Models

Some good grammar families may still remain undiscovered, and it would be interesting to identify such models.

In the early days of formal language theory, when, following Chomsky [17], grammars were regarded as rewriting systems, any new kinds of grammars were defined by modifying the rewriting rules. These attempts usually resulted in models that do not correspond to any intuitive syntactic descriptions. It is unlikely that any useful model can be defined in this way.

Judging by today’s knowledge, since other grammar families are naturally expressed in FO(LFP), there are all reasons to expect some further well-chosen fragments of FO(LFP) to give rise to interesting models. Likely, an interesting model would express some condition that could be naturally used in informal definitions of syntax, and at the same time would maintain some of the recurring ideas of formal grammars. For instance, grammars with context operators [7, 8] were defined by taking the old idea of a rule applicable in a context and expressing it in FO(LFP).

Instead of using FO(LFP) as the base model, one could also consider the related logic theories studied in the field of descriptive complexity [38]; perhaps some of them could be useful as a source of inspiration in the search for new grammar families.