Keywords

1 Introduction

It is well known that regular languages are closed w.r.t. all basic operations and are characterized also in terms of classic monadic second-order (MSO) logic [8, 17, 36], but the same properties do not hold for general context-free (CF) languages with the exception of structured CF languages. With this term we mean those various families of languages whose typical tree structure is immediately visible in their sentences: two first equivalent examples are parenthesis languages [27] and tree languages [34]. More recently, input-driven languages (IDLs) [7], later renamed visibly pushdown languages (VPLs) [2], height-deterministic languages [29] have also been shown to share many important properties of regular languages. Tree-languages and VPLs, in particular, are closed w.r.t. Boolean operations, concatenation, Kleene *, and are characterized in terms of some MSO logic, although such operations and the adopted logic language are rather different in the two cases. For a more complete analysis of structured languages and how they extend various properties of regular languages, see [26].

In this paper we are interested in an important subfamily of regular languages and its extension to various types of (structured) CF languages, namely noncounting (NC) or aperiodic languages. Intuitively, aperiodicity is a property of a recognizing device which prevents from separating strings that differ from each other by the number of repetitions of some substring, e.g. odd versus even. It is well-known [28] that NC regular languages coincide with those expressible by means of star-free regular expressions or by the first-order (FO) fragment of MSO. FO definability has a strong impact on the success of model-checking, thanks to the first-order completeness of linear temporal logic [32].

Various attempts have been done to extend the notion of aperiodicity beyond regular languages, specifically to some kind of structured CF languages. In facts, linguists as well as language designers have observed over and over that modulo-counting features are not present or needed either in natural or in technical languages.

NC parenthesis languages have been first introduced in [10]; then an equivalent definition thereof has been given in [35]. That paper, however, showed that the same equivalences holding for regular languages do not extend to tree languages: e.g., there are counting star-free tree languages. Further investigations (e.g., [18, 21, 22, 31]) obtained partial results on special subclasses of the involved families but the complete set of equivalences was lost.

In this paper we pursue a different approach to achieve the goal of restating the above equivalences in the larger context of structured CF languages. In essence, we go back to string languages as opposed to the approaches based on tree languages but we choose a family of languages where the tree structure is somewhat implicit in the string, namely operator precedence languages (OPL). OPLs have been invented by Floyd to support efficient deterministic parsing [19]. We classify them as “structured but semivisible” languages since their structure is implicitly assigned by precedence relations between terminal characters which were inspired by the precedence rules between arithmetic operations: as an early intuition for readers who are not familiar with OPLs, the expression \(a+b \cdot c\) “hides” the parenthetic structure \(a+(b \cdot c)\) which is implied by the fact that multiplicative operations should be applied before the additive ones. Subsequent investigations characterized OPLs as the largest known family of structured CFLs that is closed under all fundamental language operations and can be defined through a natural extension of the classic MSO logic [26].

Our new results on the relations between aperiodicity, star-freeness, and FO-definability of OPLs are the following. We define Operator precedence expressions (OPE), as a simple extension of regular expressions and show that they define exactly OPLs. We prove closure properties of NC OPLs and derive therefrom that the languages defined by star-free OPEs are NC. We show that star-free OPEs and FO formulas define exactly the same subfamily of OPLs. We conjecture that all NC OPLs are FO-definable.

2 Background

For brevity, we just list our notations for the basic concepts we use from formal language and automata theory. The terminal alphabet is denoted by \(\varSigma \), and the empty string is \(\varepsilon \). We also take for granted the traditional logic and set theoretic abbreviations. For a string x, |x| denotes its length. The character \(\#\), not present in the terminal alphabet, is used as string delimiter, and we define the alphabet \(\varSigma _\# = \varSigma \cup \{\# \}\).

Definition 1

(Regular expression and language). A regular expression (RE) over an alphabet \(\varSigma \) is a well-formed formula made with the characters of \(\varSigma \), \(\emptyset \), \(\varepsilon \), the Boolean operators \(\cup , \lnot , \cap \), the concatenation \(\cdot \), the Kleene star \(^*\) and plus \(^+\) operators. When neither \(^*\) nor \(^+\) are used, the RE is called star-free (SF). An RE E defines a language over \(\varSigma \), denoted by L(E). REs define the language family of regular languages.

A regular language L over \(\varSigma \) is called noncounting or aperiodic if there exists an integer \(n\ge 1\) such that for all \(x,y,z \in \varSigma ^*\), \(x y^n z\in L\) iff \(x y^{n+m} z\in L\), \(\forall m \ge 0\).

Proposition 2

The family of aperiodic regular languages coincides with the family of languages defined by star-free REs.

Definition 3

(Grammar and language). A (CF) grammar is a tuple \(G=(\varSigma , V_N, P, S)\) where \(\varSigma \) and \(V_N\), with \( \varSigma \cap V_N = \emptyset \), are resp. the terminal and the nonterminal alphabets, the total alphabet is \(V = \varSigma \cup V_N\), \(P\subseteq V_N \times V^*\) is the rule (or production) set, and \(S \subseteq V_N\), \(S \ne \emptyset \), is the axiom set. For a generic rule \(B \rightarrow \alpha \), where B and \(\alpha \) are resp. called the left/right hand sides (lhs/rhs) the following forms are relevant:

  • axiomatic: \(B \in S\),    terminal: \(\alpha \in \varSigma ^+\),    empty: \(\alpha =\varepsilon \),

  • renaming: \(\alpha \in V_N\),

  • operator: \(\alpha \not \in V^* V_N V_N V^*\),

  • parenthesized: , with , new terminals.

G is called backward deterministic (or BD-grammar) if \((B \rightarrow \alpha , C \rightarrow \alpha \in P)\) implies \(B=C\).

If all rules of G are in operator form, G is called an operator grammar or O-grammar.

is a parenthesis grammar (Par-grammar) if the rhs of every rule is parenthesized. \(\tilde{G}\) is called the parenthesized version of G, if \(\tilde{P}\) consists of all rules such that \(B \rightarrow \beta \) is in P. For brevity we take for granted the usual definition of derivation; the language defined by a grammar starting from a nonterminal A is . The subscript G will be omitted whenever clear from the context. The string x is derivable from A and we call x a sentence if \(A \in S\). The union of \(L_G(A)\) for all \(A\in S\) is the language L(G) defined by G. Two grammars defining the same language are equivalent. Two grammars such that their parenthesized versions are equivalent, are structurally equivalent. The language generated by a Par-grammar is called a parenthesis language, and its sentences are well-parenthesized strings.

From now on we only consider w.l.o.g. [5, 20] BD-, unless otherwise stated, O-grammars without renaming rules, and without empty rules except, if the empty string is in the language, the axiomatic rule \(B \rightarrow \varepsilon \) where B does not appear in the rhs of any rule.

Definition 4

(Backward deterministic reduced grammar [27, 33]). A context over an alphabet \(\varSigma \) is a string in \(\varSigma ^*\{-\}\varSigma ^*\), where the character ‘−’ \(\not \in \varSigma \) and is called a blank. We denote by \(\alpha [x]\) the context \(\alpha \) with its blank replaced by the string x. Two nonterminals B and C of a grammar G are termed equivalent if, for every context \(\alpha \), \(\alpha [B]\) is derivable exactly in case \(\alpha [C]\) is derivable (not necessarily from the same axiom). A nonterminal B is useless if there is no context \(\alpha \) such that \(\alpha [B]\) is derivable or B generates no terminal string. A terminal b is useless if it does not appear in any sentence of L(G). A grammar is clean if it has no useless nonterminals and terminals. A grammar is reduced if it is clean and no two nonterminals are equivalent. A BDR-grammar is both backward deterministic and reduced.

From [27], every parenthesis language is generated by a unique, up to an isomorphism of its nonterminal alphabet, Par-grammar that is BDR.

2.1 Operator-Precedence Languages

Intuitively, operator precedence grammars (OPG) are based on three precedence relations, called equal, yield and take, included in \(\varSigma _\# \times \varSigma _\#\). For a given O-grammar, a character a is equal in precedence to b iff some rhs contains as substring ab or a string aBb, where B is a nonterminal; in fact, when evaluating the relations between terminal characters for OPG, nonterminals are “transparent”. A character a yields precedence to b iff a can occur immediately to the left of a syntax subtree whose leftmost terminal character is b. Symmetrically, a takes precedence over b iff a can occur as the rightmost terminal character of a subtree and b is the immediately following terminal character.

Definition 5

[19]. Let \(G=(\varSigma , V_N, P, S)\) be an O-grammar. Let ab denote elements in \(\varSigma \), AB in \(V_N\), C in \(V_N\) or \(\varepsilon \), and \(\alpha , \beta \) range over \((V_N \cup \varSigma )^*\). The left and right terminal sets of nonterminals are respectively:

(The grammar name will be omitted unless necessary to prevent confusion.)

The operator precedence relations (OPRs) are defined over \(\varSigma _\# \times \varSigma _\# \) as follows:

  • equal in precedence: \( a\doteq b \iff \exists A\rightarrow \alpha aCb\beta \in P, \# \,\doteq \, \# \)

  • takes precedence: \( a\gtrdot b \iff \exists A\rightarrow \alpha B b\beta \in P, a\in \mathcal {R}(B); a\gtrdot \# \iff a\in \mathcal {R}(B), B \in S\)

  • yields precedence: \( a\lessdot b \iff \exists A\rightarrow \alpha aB\beta \in P, b\in \mathcal {L}(B); \# \lessdot b \iff b\in \mathcal {L}(B), B \in S. \)

The OPRs can be collected into a \(|\varSigma _\#| \times |\varSigma _\#|\) array, called the operator precedence matrix of the grammar, \({{\,\mathrm{OPM}\,}}(G)\): for each (ordered) pair \((a,b)\in \varSigma _\# \times \varSigma _\#\), \({{\,\mathrm{OPM}\,}}_{a,b}(G)\) contains the OPRs holding between a and b.

Consider a square matrix: \(M=\{M_{a,b}\subseteq \) \(\{\doteq , \lessdot , \gtrdot \}\) \(\mid \) \(a, b \in \varSigma _\# \}\). Such an OPM matrix, is called conflict-free iff \(\forall a,b \in \varSigma _\#\), \(0 \le |M_{a,b}|\le 1\). A conflict-free matrix is called total or complete iff \(\forall a,b \in \varSigma _\#\), \(M_{a,b} \ne \emptyset \). A matrix is \(\dot{=}\)-acyclic if \(\not \exists a_i \in \varSigma \) such that \(a_i \dot{=} \ldots \dot{=} a_i\).

In this we assume that an OPM is \(\dot{=}\)-acyclic. Such a hypothesis is stated for simplicity despite the fact that, rigorously speaking, it affects the expressive power of OPLs. It could be avoided if we adopted OPGs extended by the possibility of including regular expressions in production rhs [12, 14], which however would require a much heavier notation.

We extend the set inclusion relations and the Boolean operations in the obvious cell by cell way, to any two matrices having the same terminal alphabet. Two matrices are compatible iff their union is conflict-free.

Definition 6

(Operator precedence grammar). An O-grammar G is an operator precedence grammar (OPG) iff the matrix \({{\,\mathrm{OPM}\,}}(G)\) is conflict-free, i.e., the three OPRs are pairwise disjoint. Then the language generated by G is an operator precedence language (OPL). An OPG is \(\dot{=}\)-acyclic if \({{\,\mathrm{OPM}\,}}(G)\) is so.

It is known that the OPL family is strictly included within the deterministic and reverse-deterministic CF family and strictly includes the VPL one [12].

Example 7

For the grammar \(GAE_1\) (see Fig. 1, left), the left and right terminal sets of nonterminals E, T and F are, respectively: \(\mathcal {L}(E) = \{+, *,e\}\), \(\mathcal {L}(T) = \{*,e\}\), \(\mathcal {L}(F) = \{e\}\), \(\mathcal {R}(E) = \{+, *,e\}\), \(\mathcal {R}(T) = \{*,e\}\), and \(\mathcal {R}(F) = \{e\}\).

Figure 1 (center) displays the conflict-free OPM associated with the grammar \(GAE_1\); for instance \({{\,\mathrm{OPM}\,}}_{*,e} = \lessdot \) tells that \(*\) yields precedence to e.

Fig. 1.
figure 1

\(GAE_1\) (left) and its OPM (center); the string \(e + e * e + e\), with relation \(\curvearrowright \) (right).

Unlike the arithmetic relations having similar typography, the OPRs do not enjoy any of the transitive, symmetric, reflexive properties.

A conflict-free matrix associates to every string at most one structure, i.e., a unique parenthesization. This aspect, paired with a way of deterministically choosing rules’ rhs to be reduced, are the basis of Floyd’s natural bottom-up deterministic parsing algorithm. E.g., the following BD version of \(GAE_1\) (axioms and OPM are unchanged) drives the parsing of the string \(e + e * e + e \) to the unique structure :

$$ \begin{array}{l} E \rightarrow E + T \ \mid \ E + F \ \mid \ T + T \mid F + F \ \mid \ F + T \ \mid \ T + F\\ T \rightarrow T * F \ \mid F * F \qquad F \rightarrow e. \end{array} $$

Various formal properties of OPGs and languages are documented in the literature, chiefly in [12, 13, 26]. For convenience, we just recall and collect the ones that are relevant for this article in the next proposition.

Proposition 8

(Relevant properties of OPGs and OPLs).

Let M be a conflict-free OPM over \(\varSigma _\# \times \varSigma _\#\).

  1. 1.

    The class of OPGs and OPLs compatible with M are: \(\ \mathscr {C}_M = \{G \mid G \text { is an OPG}, {{\,\mathrm{OPM}\,}}(G) \subseteq M \}\), \(\ \mathscr {L}_M = \{ L(G) \mid G \in \mathscr {C}_M\}\).

  2. 2.

    The class \(\mathscr {C}_M\) contains a unique grammar, called the maxgrammar of M, denoted by \(G_{max, M}\), such that for all grammars \(G \in \mathscr {C}_M\), the inclusion holds \(L(G) \subseteq L(G_{max, M})\). \(L(G_{\max ,M})\) is called max-language. If M is total, then \(L(G_{\max ,M})= \varSigma ^*\).

  3. 3.

    Let M be total. With a natural overloading, we define the function as are corresponding derivations. E.g. with M such that \(a \lessdot a\), \(a \doteq b\), \(b \gtrdot b\), .

  4. 4.

    Let M be total.

    • \(\mathscr {L}_M\) is closed under all set operations, therefore it is a Boolean algebra.

    • \(\mathscr {L}_M\) is closed under concatenation and Kleene star.

In summary, an OPM assigns a universal structure to strings in \(\varSigma ^*\), thus we call the pair \((\varSigma ,M)\) an OP alphabet. The following characterizations of OPLs, in terms of logic and expressions, are bound to the OP alphabet.

Logic Characterization. In [25] the traditional monadic second order logic (MSO) characterization of regular languages by Büchi, Elgot, and Trakhtenbrot [8, 17, 36] is extended to the case of OPL. To deal with the typical tree structure of CF languages the original MSO syntax is augmented with the predicate \(\curvearrowright \), based on the OPL precedence relations: informally, \(\varvec{x} \curvearrowright \varvec{y}\) holds between the rightmost and leftmost positions of the context encompassing a subtree, i.e., respectively, of the character that yields precedence to the subtree’s leftmost leaf, and of the one over which the subtree’s rightmost leaf takes precedence.

Unlike similar but simpler relations introduced, e.g., in [23] and [2], the \(\curvearrowright \) relation is not one-to-one. For instance, Fig. 1 (right) displays the \(\curvearrowright \) relation holding for the sentence \(e + e * e + e \) generated by grammar \(GAE_1\): we have \(0 \curvearrowright 2\), \(2 \curvearrowright 4\), \(4 \curvearrowright 6\), \(6 \curvearrowright 8\), \(2 \curvearrowright 6\), \(0 \curvearrowright 6\), and \(0 \curvearrowright 8\). Such pairs correspond to contexts where a reduce operation is executed during the parsing of the string (they are listed according to their execution order).

Formally, we define a countable infinite set of first-order variables \({\varvec{{x}}}, {\varvec{{y}}}, \dots \) and a countable infinite set of monadic second-order (set) variables \({\varvec{{X}}},{\varvec{{Y}}}, \dots \). We adopt the convention to denote first and second-order variables in boldface font.

Definition 9

(Monadic Second-order Logic over \((\varSigma ,M)\)). Let \(\mathcal {V}_1\) be a set of first-order variables, and \(\mathcal {V}_2\) be a set of second-order (or set) variables. The MSO\(_{\varSigma ,M}\) (monadic second-order logic over \((\varSigma , M)\)) is defined by the following syntax (symbols \(\varSigma , M\) will be omitted unless necessary to prevent confusion), where \(c \in \varSigma _\#\), \({\varvec{{x}}}, {\varvec{{y}}} \in \mathcal {V}_1\), and \({\varvec{{X}}} \in \mathcal {V}_2\):

$$ \varphi := c({\varvec{{x}}}) \mid {\varvec{{x}}}\, \in {\varvec{{X}}} \mid {\varvec{{x}}} \,< {\varvec{{y}}} \mid {\varvec{{x}}}\, \curvearrowright {\varvec{{y}}} \mid \lnot \varphi \mid \varphi \vee \varphi \mid \exists {\varvec{{x}}}.\varphi \mid \exists {\varvec{{X}}}.\varphi . $$

A MSO formula is interpreted over a \((\varSigma , M)\) string w, with respect to assignments \(\nu _1: \mathcal {V}_1 \rightarrow \{0, 1, \ldots |w|+1\}\) and , in this way:

  • \(\# w \#, M, \nu _1, \nu _2 \models c({\varvec{{x}}})\) iff \(\# w \# = w_1 c w_2\) and \(|w_1| = \nu _1({\varvec{{x}}})\).

  • \(\# w \#, M, \nu _1, \nu _2 \models {\varvec{{x}}} \in {\varvec{{X}}}\) iff \(\nu _1({\varvec{{x}}}) \in \nu _2({\varvec{{X}}})\).

  • \(\# w \#, M, \nu _1, \nu _2 \models {\varvec{{x}}} < {\varvec{{y}}}\) iff \(\nu _1({\varvec{{x}}}) < \nu _1({\varvec{{y}}})\).

  • \(\# w \#, M, \nu _1, \nu _2 \models {\varvec{{x}}} \curvearrowright {\varvec{{y}}}\) iff \(\# w \# = w_1 a w_2 b w_3\), \(|w_1| = \nu _1({\varvec{{x}}})\), \(|w_1 a w_2| = \nu _1({\varvec{{y}}})\), and \(w_2\) is the frontier of a subtree of the syntax tree of w.

  • \(\# w \#, M, \nu _1, \nu _2 \models \lnot \varphi \) iff \(\# w \#, M, \nu _1, \nu _2 \not \models \varphi \).

  • \(\# w \#, M, \nu _1, \nu _2 \models \varphi _1 \vee \varphi _2\) iff \(\# w \#, M, \nu _1, \nu _2 \models \varphi _1\) or \(\# w \#, M, \nu _1, \nu _2 \models \varphi _2\).

  • \(\# w \#, M, \nu _1, \nu _2 \models \exists {\varvec{{x}}}.\varphi \) iff \(\# w \#, M, \nu '_1, \nu _2 \models \varphi \), for some \(\nu '_1\) with \(\nu '_1({\varvec{{y}}}) = \nu _1({\varvec{{y}}})\) for all \({\varvec{{y}}} \in \mathcal {V}_1 - \{{\varvec{{x}}}\}\).

  • \(\# w \#, M, \nu _1, \nu _2 \models \exists {\varvec{{X}}}.\varphi \) iff \(\# w \#, M, \nu _1, \nu '_2 \models \varphi \), for some \(\nu '_2\) with \(\nu '_2({\varvec{{Y}}}) = \nu _2({\varvec{{Y}}})\) for all \({\varvec{{Y}}} \in \mathcal {V}_2 - \{{\varvec{{X}}}\}\).

To improve readability, we drop M, \(\nu _1\), \(\nu _2\) and the delimiters # from the notation whenever there is no risk of ambiguity; furthermore we use some standard abbreviations in formulas, e.g., \(\wedge \), \(\forall \), \({\varvec{{x}}} + 1\), \({\varvec{{x}}} -1\), \({\varvec{{x}}} = {\varvec{{y}}}\), \({\varvec{{x}}} \le {\varvec{{y}}}\).

A sentence is a formula without free variables. The language of all strings \(w \in \varSigma ^*\) such that \(w \models \varphi \) is \( L(\varphi ) = \{ w \in \varSigma ^* \mid w \models \varphi \}. \)

The above MSO logic describes exactly the OPL family [25]. We denote the restriction of the MSO logic to the first-order as FO. Whenever we deal with logic definition of languages we implicitly exclude from such languages the empty string, according with the traditional convention adopted in the literature (e.g., [28]); thus, when talking about MSO or FO definable languages we exclude empty rules from their grammars.

2.2 Parenthesis and OPLs and the Noncounting Property

We briefly recall the standard definition of the NC property for CF parenthesis languages [10] and its decidability, then we apply it to OPLs.

Definition 10

(NC parenthesis language and grammar [10]). A parenthesis language L is NC (or aperiodic) iff \(\exists n > 1\) such that, for all strings xuwvy in where w and uwv are well parenthesized, \(xu^n wv^n y \in L\) iff \(xu^{n+m} wv^{n+m} y \in L\), \(\forall m \ge 0\).

A derivation of a Par-grammar is counting iff it has the form \(B {\mathop {\Longrightarrow }\limits ^{*}}u^m B v^m\), with \(m >1\), and there is not a derivation \(B {\mathop {\Longrightarrow }\limits ^{*}}u B v\). A Par-grammar is NC iff none of its derivations is counting.

The next proposition ensures the decidability of the NC property.

Theorem 11

[NC language and grammar (Th. 1 of [10])] A parenthesis language is NC iff its BDR grammar has no counting derivation.

Definition 12

(NC OPLs and grammars). For a given OPL L with OPM M, \(L_p\) is the language of the parenthesized strings \(x_p\) uniquely associated to L’s strings x by M. An OPL L is NC iff its corresponding parenthesized language \(L_p\) is NC.

A derivation of a grammar G is counting iff the corresponding derivation of the associated Par-grammar \(G_p\) is counting.

Thus, an OPL is NC iff its BDR OPG (unique up to an isomorphim of nonterminal alphabets) has no counting derivations.

In the following, unless parentheses are explicitly needed, we refer to unparenthesized strings since their correspondence to parenthesized strings is one-to-one. It is also worth recalling [11] the following peculiar property of OPLs: such languages are NC or not independently on their OPM, in other words, although the NC property is defined for structured languages (parenthesis or tree languages [27, 34]), in the case of OPLs this property does not depend on the structure given to the sentences by the OPM.

It is important to stress, however, that, despite the above peculiarity of OPLs, aperiodicity remains a property that makes sense only with reference to the structured version of languages. Consider the following languages, with the same OPM consisting of \(\{c \lessdot c, c \doteq a,c \doteq b, a \,\gtrdot \, b, b \,\gtrdot \, a \}\) besides the implicit relations w.r.t. \(\#\): \(L_1 = \{ c^{2n}(ab)^n \mid n \ge 1\}\), \(L_2 = (ab)^+\). They are both clearly NC and so is their concatenation \(L_1 \cdot L_2\), according to Definition 12, which in its parenthesized version is , (see also Theorem 20); however, if we applied Definition 10 to \(L_1 \cdot L_2\) without considering parentheses, we would obtain that, for every n, \(c^{2n}(ab)^{2n} \in L_1 \cdot L_2\) but not so for \(c^{2n+1}(ab)^{2n+1}\).

3 Expressions for OPLs

Operator Precedence Expressions (OPE) extend traditional REs in a similar way as in other cases such as, e.g., REs for tree-languages [35]. We show that OPEs define exactly the OPL family, so that, by joining this result with the previous characterizations of OPL in terms of MSO definability and recognizability by Operator Precedence Automata (OPA) [25], we have that the class OPL can be defined equivalently as the class of languages generated by OPGs, or described through MSO formulas, or recognized by OPAs, or defined by OPEs.

As well as for MSO logic and OPA, OPE’s definition is based on an OP alphabet.

Definition 13

(OPE). Given an OP alphabet \((\varSigma , M)\), where M is complete, an OPE E and its language \(L_M(E) \subseteq \varSigma ^*\) are defined as follows. The meta-alphabet of OPE uses the same symbols as REs, together with the two symbols ‘[’, and ‘]’. Let \(E_1\) and \(E_2\) be OPEs:

  1. 1.

    \(a \in \varSigma \) is an OPE with \(L_M(a) = a\).

  2. 2.

    \(\lnot E_1\) is an OPE with \(L_M(\lnot E_1) = \varSigma ^* - L_M(E_1)\).

  3. 3.

    \(a[E_1]b\), called the fence operation, i.e., we say \(E_1\) in the fence ab, is an OPE with if \(a, b \in \varSigma \): if \(a = \#, b \in \varSigma \): if \(a \in \varSigma , b = \#\): where \(E_1\) must not contain #.

  4. 4.

    \(E_1 \,\cup \, E_2\) is an OPE with \(L_M(E_1 \,\cup \, E_2) = L_M(E_1) \,\cup \, L_M(E_2)\).

  5. 5.

    \(E_1 \cdot E_2\) is an OPE with \(L_M(E_1 \cdot E_2) = L_M(E_1) \cdot L_M(E_2)\), where \(E_1\) does not contain \(a[E_3]\#\) and \(E_2\) does not contain \(\#[E_3]a\), for some OPE \(E_3\), and \(a \in \varSigma \).

  6. 6.

    \(E_1^*\) is an OPE defined by \(E_1^* := \bigcup ^{\infty }_{n = 0} E_1^n\), where \(E_1^0 := \{\varepsilon \}\), \(E^1_1 = E_1\), \(E_1^n := E_1^{n-1} \cdot E_1\); \(E_1^+ := \bigcup ^{\infty }_{n = 1} E_1^n\).

Among the operations defining OPEs, concatenation has the maximum precedence; set-theoretic operations have the usual precedences, the fence operation is dealt with as a normal parenthesis pair. A star-free (SF) OPE is one that does not use \(^*\) and \(^+\).

The conditions on # are due to the peculiarities of OPLs closure w.r.t. concatenation (see also Theorem 20). In 5. the \(\#\) is not permitted within, say, the left factor \(E_1\) because delimiters are necessarily positioned at the two ends of a string.

Besides the usual abbreviations for set operations (e.g., \(\cap \) and −), we also use the following derived operators: \(a \varDelta b := a [ \varSigma ^+ ] b\), and \(a \nabla b := \lnot ( a \varDelta b ) \,\cap \,a \cdot \varSigma ^+ \cdot b\). It is trivial to see that the identity \(a[E]b = a \varDelta b \,\cap \, a \cdot E \cdot b\) holds.

The fact that in Definition 13 matrix M is complete is w.l.o.g.: to state that for two terminals a and b, \(M_{a,b} = \emptyset \) (i.e. that there should be a “hole” in the OPM for them), we can use the short notations: \( {{\,\mathrm{hole}\,}}(a,b) := \lnot (\varSigma ^* (a b \cup a \varDelta b) \varSigma ^*), \ {{\,\mathrm{hole}\,}}(\#,b) := \lnot (\# \varDelta b \varSigma ^*), \ {{\,\mathrm{hole}\,}}(a,\#) := \lnot (\varSigma ^* a \varDelta \#) \) and intersect them with the OPE.

The following examples illustrate the meaning of the fence operation, the expressiveness of OPLs w.r.t. less powerful classes of CF languages, and how OPEs naturally extend REs to the OPL family.

Example 14

Let \(\varSigma \) be \(\{a, b \}\), \(\{a \,\lessdot \,a, a\, \doteq \, b, b \,\gtrdot \, b\} \subseteq M\). The OPE \(a[a^{*}b^{*} ]b\) defines the language \(\{a^{n}b^n \mid n \ge 1 \}\). In fact the fence operation imposes that any string \(x \in a^{*}b^{*} \) embedded within the context ab is well-parenthesized according to M.

The OPEs \(a[a^{*}b^{*} ]\#\) and \(a^{+}a[a^{*}b^{*} ]b \,\cup \, \{a^{+}\}\), instead, both define the language \(\{ a^n b^m \mid n > m \ge 0\}\) since the matrix M allows for, e.g., the string aaabb parenthesized as .

It is also easy to define Dyck languages with OPEs, as their parenthesis structure is naturally encoded by the OPM. Consider \(L_{\text {Dyck}}\) the Dyck language with two pairs of parentheses denoted by \(a, a'\) and \(b, b'\). This language can be described simply through an incomplete OPM, reported in Fig. 2 (left). In other words it is \(L_{\text {Dyck}} = L(G_{max, M})\) where M is the matrix of the figure. Given that, for technical simplicity, we use only complete OPMs, we must refer to the one in Fig. 2 (center), and state in the OPE that some OPRs are not wanted, such as \(a, b'\), where the open and closed parentheses are of the wrong kind, or \(a, \#\), i.e. an open a must have a matching \(a'\).

Fig. 2.
figure 2

The incomplete OPM defining \(L_{\text {Dyck}}\) (left), a possible completion \(M_\text {complete}\) (center), and the OPM \(M_\text {int}\) for the OPE describing an interrupt policy (right).

The following OPE defines \(L_{\text {Dyck}}\) by suitably restricting the “universe”

\(L(G_{max, M_\text {complete}})\):

$$ {{\,\mathrm{hole}\,}}(a, b') \,\cap \, {{\,\mathrm{hole}\,}}(b, a') \,\cap \, {{\,\mathrm{hole}\,}}(\#, a') \,\cap \, {{\,\mathrm{hole}\,}}(\#, b') \,\cap \, {{\,\mathrm{hole}\,}}(a, \#) \,\cap \, {{\,\mathrm{hole}\,}}(b, \#). $$

Example 15

For a more application-oriented case, consider the classical LIFO policy managing procedure calls and returns but assume also that interrupts may occur: in such a case the stack of pending calls is emptied and computation is resumed from scratch.

This policy is already formalized by the incomplete OPM of Fig. 2 (right), with \(\varSigma = \{call, ret, int \}\) with the obvious meaning of symbols. For example, the string call call ret call call int represents a run where only the second call returns, while the other ones are interrupted. On the contrary, call call int ret is forbidden, because a return is not allowed when the stack is empty. If we further want to say that there must be at least one terminating procedure, we can use the OPE: \(\varSigma ^* \cdot call \varDelta ret \cdot \varSigma ^*\).

Another example is the following, where we state that the run must contain at least one sub-run where no procedures are interrupted: \(\varSigma ^* \,\cdot \, {{\,\mathrm{hole}\,}}(call, int) \,\cdot \, \varSigma ^*\). Notice that the language defined by the above OPE is not a VPL since VPLs only allow for unmatched returns and calls at the beginning or at the end of a string, respectively.

Theorem 16

For every OPE E on an OPM M, there is an OPG G, compatible with M, such that \(L_M(E) = L(G)\).

Proof

By induction on E’s structure. The operations \(\cup , \lnot , \cdot \), and \(^*\) come from the closures of OPLs (Proposition 8). The only new case is a[E]b which is given by the following grammar. The function \(\eta : \varSigma _\# \rightarrow \varSigma \), such that \(\eta (\#) = \varepsilon \), \(\eta (a) = a\) otherwise, is used to take borders into account. If, by induction, G defines the same language as E, with axiom set \(S_E\), then build the grammar \(G'\) from G by adding the following rules, where A and \(A'\) are new nonterminals of \(G'\) not in G, A is an axiom of \(G'\), and \(B \in S_E\):

  • \(A \rightarrow \eta (a) B \eta (b)\), if \(a \doteq b\) in M;

  • \(A \rightarrow \eta (a) A'\) and \(A' \rightarrow B \eta (b)\), if \(a \lessdot b\) in M;

  • \(A \rightarrow A' \eta (b)\) and \(A' \rightarrow \eta (a) B\), if \(a \gtrdot b\) in M.

The grammar for a[E]b is then obtained by applying the construction for \(L(G')\, \cap \,L(G_{max, M})\). This intersection is to check that \(a\, \lessdot \, \mathcal {L}(B)\) and \(\mathcal {R}(B) \,\gtrdot \, b\); if it is not the case, according to the semantics of a[E]b, the resulting language is empty.   \(\square \)

Next, we show that OPEs can express any language that is definable through an MSO formula as defined in Sect. 2.1. Thanks to the fact that the same MSO logic can express exactly OPLs [25] and to Theorem 16 we obtain our first major result, i.e., the equivalence of MSO, OPG, OPA (see e.g., [26]), and OPE.

To construct an OPE from a given MSO formula we follow the traditional path adopted for regular languages (as explained, e.g., in [30]) and augment it to deal with the \(\curvearrowright \) relation. For a MSO formula \(\varphi \), let \(\pmb {x}_1, \pmb {x}_2, \ldots , \pmb {x}_r\) be the set of first-order variables occurring in \(\varphi \), and \(\pmb {X}_1, \pmb {X}_2, \ldots , \pmb {X}_s\) be the set of second order variables. We use the new alphabet \(B_{p,q} = \varSigma \times \{0,1\}^p \times \{0,1\}^q\), where \(p \ge r\) and \(q \ge s\). The main idea is that the \(\{0,1\}^p\) part of the alphabet is used to encode the value of the first-order variables (e.g. for \(p=r=4\), (1, 0, 1, 0) stands for both the positions \(\pmb {x}_1\) and \(\pmb {x}_3\)), while the \(\{0,1\}^q\) part of the alphabet is used for the second order variables. Hence, we are interested in the language \(K_{p,q}\) formed by all strings where the components encoding the first-order variables contain exactly one occurrence of 1. We also use this definition \(C_k := \{ c \in B_{p,q} \mid \text { the }(k+1)\text {-st component of }c = 1\}\).

Theorem 17

For every MSO formula \(\varphi \) on an OP alphabet \((\varSigma , M)\) there is an OPE E on M such that \(L_M(E) = L(\varphi )\).

Proof

By induction on \(\varphi \)’s structure; the construction is standard for regular operations, the only difference is \(\pmb {x}_i \curvearrowright \pmb {x}_j\). \(B_{p,q}\) is used to encode interpretations of free variables. The set \(K_{p,q}\) of strings where each component encoding a first-order variable is such that there exists only one 1, is given by: \( K_{p,q} = \bigcap _{1 \le i \le p} (B_{p,q}^* C_i B_{p,q}^* - B_{p,q}^* C_i B_{p,q}^* C_i B_{p,q}^*). \) Disjunction and negation are naturally translated into \(\cup \) and \(\lnot \); like in Büchi’s theorem, for OPE \(\exists \pmb {x}_i \psi \) (resp. \(\exists \pmb {X}_j \psi \)), first the expression \(E_\psi \) for \(\psi \) on an alphabet \(B_{p,q}\) is built, then E for \(\exists \pmb {x}_i \psi \) is obtained from \(E_\psi \) by erasing the component i (resp. j) from \(B_{p,q}\); \(\pmb {x}_i < \pmb {x}_j\) is represented by \(K_{p,q} \cap B^*_{p,q} C_i B^*_{p,q} C_j B^*_{p,q}\). Last, the OPE for \(\pmb {x}_i \curvearrowright \pmb {x}_j\) is: \(B_{p,q}^* C_i [ B_{p,q}^+ ] C_j B_{p,q}^*\).    \(\square \)

4 Closure Properties of Noncounting OPLs and Star-Free OPEs

Thanks to the fact that an OPM implicitly defines the structure of an OPL, i.e., its parenthesization, aperiodic OPLs inherit from the general class the same closure properties w.r.t. the basic algebraic operations. Such properties are proved in this section under the same assumption as in the general case (see Proposition 8), i.e., that the involved languages have compatible OPMs. As a major consequence we derive that star-free OPEs define aperiodic OPLs.

Theorem 18

Counting and NC parenthesis languages are closed w.r.t. complement. Thus, for any OPM M, counting and NC OPLs in the family \(\mathscr {L}_M\) (Proposition 8) are closed w.r.t. complement w.r.t. the max-language defined by M.

Proof

We give the proof for counting languages which also implies the closure of NC ones.

By definition of counting parenthesis language and from Theorem 11, if L is counting there exist strings xuvzy and integers nm with \(n> 1, m > 1\) such that \(xv^{n+r}zu^{n+r}y \in L\) for all \(r = km > 0\), \(k>0\), but not for all \(r>0\). Thus, the complement of L contains infinitely many strings \(xv^{n+i}zu^{n+i}y \in L\) but not all of them since for some i, \(i = km\). Thus, for \(\lnot L\) too there is no n such that \(xv^{n}zu^{n}y \in L\) iff \(xv^{n+r}zu^{n+r}y \in L\) for all \(r \ge 0\).   \(\square \)

Theorem 19

NC parenthesis languages and NC OPLs in the same family \(\mathscr {L}_M\) are closed w.r.t. union and therefore w.r.t. intersection.

Proof

Let \(L_1, L_2\) be two NC parenthesis languages (resp. OPLs). Assume by contradiction that \(L = L_1 \cup L_2\) is counting. Thus, there exist strings xuvzy such that for infinitely many m, \(xv^mzu^my \in L\) but for no n \(xv^nzu^ny \in L\) iff \(xv^{n+r}zu^{n+r}y \in L\) for all \(r \ge 0\). Hence, the same property must hold for at least one of \(L_1\) and \(L_2\) which therefore would be counting.    \(\square \)

Notice that, unlike the case of complement, counting languages are not closed w.r.t. union and intersection, whether they are regular or parenthesis or OPLs.

Fig. 3.
figure 3

An example of paired derivations combined by the concatenation construction. In this case the last character of u is in \(\doteq \) relation with the first character of v.

Theorem 20

NC OPLs are closed w.r.t. concatenation.

Proof

Let \(L_i=L(G_i)\), \(i=1,2\), be NC OPLs with \(G_i = (\varSigma , V_{Ni}, P_i, S_i)\) BDR OPGs. Let also \(L_{pi}=L(G_{pi})\) be the parenthesized languages and grammars. We exploit the proof in [12] that OPLs with compatible OPM are closed w.r.t. concatenation. In general the parenthesized version \(L_p\) of \(L = L_1 \cdot L_2\) is not the parenthesized concatenation of the parenthesized versions of \(L_1\) and \(L_2\), i.e., \(L_p\) may differ from , where and , because the concatenation may cause the syntax trees of \(L_1\) and \(L_2\) to coalesce.

The construction given in [12] builds a grammar G whose nonterminal alphabet includes \(V_{N1}\), \(V_{N2}\) and a set of pairs \([A_1, A_2]\) with \(A_1 \in V_{N1}\), \(A_2 \in V_{N2}\); the axioms of G are the pairs \([X_1, X_2]\) with \(X_1 \in S_1\), \(X_2 \in S_2\).Footnote 1 In essence (Lemmas 18 through 21 of [12]) G’s derivations are such that , implies \(u = w \cdot z\) for some wz and , , , . Notice that some substrings of \(x \cdot w \), resp. \(z \cdot y \), may be derived from nonterminals belonging to \(V_{N1}\), resp. \(V_{N2}\), as the consequence of rules of type \([A_1, A_2] \rightarrow \alpha _1 [B_1, B_2] \beta _2\) with \(\alpha _1 \in V_1^*\), \(\beta _2 \in V_2^*\), where \([B_1, B_2]\) could be missing; also, any string \(\gamma \) derivable in G contains at most one nonterminal of type \([A_1,A_2]\).Footnote 2

Suppose, by contradiction, that G has a counting derivationFootnote 3 (one of \(u^m\), \(v^m\) could be empty either in L or in \(L_p\)) whereas \([A_1, A_2]\) does not derive \(u[A_1,A_2]v\): this would imply the derivations , which would be counting in \(G_1\) and \(G_2\) since they would involve the same nonterminals in the pairs \([A_i, A_j]\). If instead the counting derivation of G were derived from nonterminals belonging to \(V_{N1}\), (resp. \(V_{N2}\)) that derivation would exist identical for \(G_1\) (resp. \(G_2\)).Figure 3 shows a counting derivation of G derived by the concatenation of two counting derivations of \(G_1\) and \(G_2\); in this case neither \(u^m\) nor \(v^m\) are empty.   \(\square \)

Theorem 21

The OPLs defined through star-free OPEs are NC.

Proof

We need to show that if the language defined by the SF expression E is NC, so is the language defined by a[E]b. This follows by the identity \(a[E]b = a \varDelta b\, \cap \, a E b = a [\varSigma ^+] b \,\cap \, a E b\).   \(\square \)

5 FO-Definable OPLs and SF OPEs

We now prove that SF OPE-definable languages coincide with FO-definable OPLs which are therefore NC as well; a result in sharp contrast with the negative results for NC tree-languages [35]. In Sect. 5.1 we show that NC linear OPLs are FO-definable too.

Lemma 22

(Flat Normal Form). Any star-free OPE can be written in the following form, called flat normal form: \( \bigcup _i \bigcap _j \ t_{i,j}, \) where the elements \(t_{i,j}\) have either the form \(L_{i,j} a_{i,j} \varDelta b_{i,j} R_{i,j}\), or \(L_{i,j} a_{i,j} \nabla b_{i,j} R_{i,j}\), or \(H_{i,j}\), for \(a_{i,j}\), \(b_{i,j} \in \varSigma _\#\), and \(L_{i,j}\), \(R_{i,j}\), \(H_{i,j}\) star-free REs.

Proof

The lemma is a consequence of the distributive and De Morgan properties, together with the following identities, where \(\circ _1, \circ _2 \in \{\varDelta , \nabla \}\), and \(L_k\), \(1 \le k \le 3\) are star-free REs:

$$ a[E]b = a \varDelta b \cap a E b $$
$$ L_1 a_1 \circ _1 a_2 L_2 a_3 \circ _2 a_4 L_3 = (L_1 a_1 \circ _1 a_2 L_2 a_3 \varSigma ^+ a_4 L_3) \,\cap \, (L_1 a_1 \varSigma ^+ a_2 L_2 a_3 \circ _2 a_4 L_3) $$
$$ \lnot (L_1 a_1 \varDelta a_2 L_2) = L_1 a_1 \nabla a_2 L_2 \,\cup \, \lnot (L_1 a_1 \varSigma ^+ a_2 L_2) $$
$$ \lnot (L_1 a_1 \nabla a_2 L_2) = L_1 a_1 \varDelta a_2 L_2 \,\cup \, \lnot (L_1 a_1 \varSigma ^+ a_2 L_2) $$

The first two identities are immediate, while the last two are based on the idea that the only non-regular constraints of the left-hand negations are respectively \(a_1 \nabla a_2\) or \(a_1 \varDelta a_2\), that represent strings that are not in the set only because of their structure.   \(\square \)

Theorem 23

For every FO formula \(\varphi \) on an OP alphabet \((\varSigma , M)\) there is a star-free OPE E on M such that \(L_M(E) = L(\varphi )\).

Proof

Consider the formula \(\varphi \), and its set of first-order variables: like in Sect. 3, \(B_p = \varSigma \times \{0,1\}^p\) (the q components are absent, \(\varphi \) being a first-order formula), and the set \(K_p\) of strings where each component encoding a variable is such that there exists only one 1.

First, \(K_p\) is star-free: \(K_p = \bigcap _{1 \le i \le p} (B_p^* C_i B_p^* - B_p^* C_i B_p^* C_i B_p^*).\)

Disjunction and negation are naturally translated into \(\cup \) and \(\lnot \); \(\pmb x_i < \pmb x_j\) is covered by the star-free OPE \(K_p \,\cap \, B^*_p C_i B^*_p C_j B^*_p\). The \(\pmb x_i \curvearrowright \pmb x_j\) formula is like in the second order case, i.e. is translated into \(B_{p}^* C_i [ B_{p}^+ ] C_j B_{p}^*\), which is star-free.

For the existential quantification, the problem is that star-free (OP and regular) languages are not closed under projections. Like in the regular case, the idea is to leverage the encoding of the evaluation of first-order variables, because there is only one position in which the component is 1 (see \(K_p\)), to use the bijective renamings \(\pi _0(a,v_1, v_2, ..., v_{p-1}, 0) = (a,v_1, v_2, ..., v_{p-1})\), and \(\pi _1(a,v_1, v_2, ..., v_{p-1}, 1) =\) \((a,v_1,\) \(v_2, ..., v_{p-1})\), where the last component is the one encoding the quantified variable. Notice that the bijective renaming does not change the \(\varSigma \) component of the symbol, thus maintaining all the OPRs.

Let \(E_\varphi \) be the star-free OPE on the alphabet \(B_p\) for the formula \(\varphi \), with x a free variable in it. Let us assume w.l.o.g. that the evaluation of x is encoded by the last component of \(B_p\); let \(B = \varSigma \,\times \, \{0,1\}^{p-1} \,\times \, \{0\}\), and \(A = \varSigma \, \times \,\{0,1\}^{p-1} \,\times \, \{1\}\).

The OPE for \(\exists x \varphi \) is obtained from the OPE for \(\varphi \) through the bijective renaming \(\pi \), and considering all the cases in which the symbol from A can occur.

First, let \(E'\) be an OPE in flat normal form, equivalent to \(E_\varphi \) (Lemma 22). The FO semantics is such that \(L(\varphi ) = L_M(E') = L_M(E') \,\cap \, B^* A B^*\).

By construction, \(E'\) is a union of intersections of elements \(L_{i,j} a_{i,j} \varDelta b_{i,j} R_{i,j}\), or \(L_{i,j} a_{i,j} \nabla b_{i,j}\) \(R_{i,j}\), or \(H_{i,j}\), where \(a_{i,j}\), \(b_{i,j} \in \varSigma \), and \(L_{i,j}\), \(R_{i,j}\), \(H_{i,j}\) are star-free regular languages.

In the intersection of \(E'\) and \(B^* A B^*\), all the possible cases in which the symbol in A can occur in \(E'\)’s terms must be considered: e.g. in \(L_{i,j} a_{i,j} \varDelta b_{i,j} R_{i,j}\) it could occur in the \(L_{i,j}\) prefix, or in \(a_{i,j} \varDelta b_{i,j}\), or in \(R_{i,j}\). More precisely, \(L_{i,j} a_{i,j} \varDelta b_{i,j} R_{i,j} \cap B^* A B^*\) \(=\) \((L_{i,j} \cap B^* A B^*) a_{i,j} \varDelta b_{i,j} R_{i,j}\) \(\cup \) \(L_{i,j}\) \((a_{i,j} \varDelta b_{i,j}\) \(\,\cap \, B^* A B^*)\) \(R_{i,j} \cup L_{i,j} a_{i,j} \varDelta \) \(b_{i,j} (R_{i,j} \cap B^* A B^*)\) (the \(\nabla \) case is analogous, \(H_{i,j}\) is immediate, being regular star-free).

The cases in which the symbol from A occurs in \(L_{i,j}\) or \(R_{i,j}\) are easy, because they are by construction regular star-free languages, hence we can use one of the standard regular approaches found in the literature (e.g. by using the splitting lemma in [16]). The only differences are in the factors \(a_{i,j} \varDelta b_{i,j}\), or \(a_{i,j} \nabla b_{i,j}\).

Let us consider the case \(a_{i,j} \varDelta b_{i,j} \cap B^* A B^*\). The cases \(a_{i,j} \in A\) or \(b_{i,j} \in A\) are like \((L_{i,j} \cap B^* A B^*)\) and \((R_{i,j} \cap B^* A B^*)\), respectively, because \(L_{i,j} a_{i,j}\) and \(b_{i,j} R_{i,j}\) are also regular star-free (\(\nabla \) is analogous).

The remaining cases are \(a_{i,j} \varDelta b_{i,j} \cap B^+ A B^+\) and \(a_{i,j} \nabla b_{i,j} \cap B^+ A B^+\). By definition of \(\varDelta \), \(a_{i,j} \varDelta b_{i,j} \cap B^+ A B^+ = a_{i,j} [B^* A B^*] b_{i,j}\), and its bijective renaming is \(\pi _0( a_{i,j} ) [ \pi _0(B^*) \pi _1(A)\) \(\pi _0( B^* )] \pi _0(b_{i,j})\) \( = a'_{i,j} [B_{p-1}^+] b'_{i,j}\), where \(\pi _0(a_{i,j}) = a'_{i,j}\), and \(\pi _0(b_{i,j}) = b'_{i,j}\), which is a star-free OPE. By definition of \(\nabla \), \(a_{i,j} \nabla b_{i,j} \cap B^+ A B^+ =\) \(\lnot ( a_{i,j} [B_p^+] b_{i,j} ) \cap a_{i,j} B_p^+ b_{i,j} \cap B^+ A B^+ = \) \(\lnot ( a_{i,j} [B_p^+] b_{i,j} ) \,\cap \, a_{i,j} B^* A B^* b_{i,j}\). Hence, its renaming is \(\lnot ( \pi _0(a_{i,j})[\pi _0(B_p^*) \pi _1(B_p)\pi _0(B_p^*) ]\) \(\pi _0(b_{i,j}) )\) \( \,\cap \,\pi _0(a_{i,j} B^*) \pi _1(A) \pi _0(B^*\) \(b_{i,j}) = \) \(\lnot ( a'_{i,j} [B_{p-1}^+] b'_{i,j} ) \,\cap a'_{i,j} B_{p-1}^+ b'_{i,j}\), a star-free OPE.   \(\square \)

Theorem 24

For every star-free OPE E on an OP alphabet \((\varSigma , M)\), there is a FO formula \(\varphi \) on \((\varSigma , M)\) such that \(L_M(E) = L(\varphi )\).

Proof

The proof is by induction on E’s structure. Of course, singletons are easily first-order definable; for negation and union we use \(\lnot \) and \(\vee \) as natural. Like in the case of star-free regular languages, concatenation is less immediate, and it is based on formula relativization. Consider two FO formulas \(\varphi \) and \(\psi \), and assume w.l.o.g. that their variables are disjunct, and let \(\pmb x\) be a variable not used in either of them. To construct a relativized variant of \(\varphi \), called \(\varphi _{< \pmb x}\), proceed from the outermost quantifier, going inward, and replace every subformula \(\exists \pmb z \lambda \) with \(\exists \pmb z ((\pmb z < \pmb x) \wedge \lambda )\). Variants \(\varphi _{\ge \pmb x}\) and \(\varphi _{> \pmb x}\) are analogous. We also call \(\varphi _{\pmb x, \pmb y}\) the relativization where quantifications \(\exists \pmb z \lambda \) are replaced by \(\exists \pmb z ((\pmb x< \pmb z < \pmb y) \,\wedge \,\lambda )\). The language \(L(\varphi ) \cdot L(\psi )\) is defined by the following formulas: \(\exists \pmb x (\varphi _{< \pmb x} \,\wedge \, \psi _{\ge \pmb x})\) if \(\varepsilon \not \in L(\psi )\); otherwise \(\exists \pmb x (\varphi _{< \pmb x} \,\wedge \,\psi _{\ge \pmb x}) \vee \varphi \).

The last part we need to consider is the fence operation, i.e. a[E]b. Let \(\varphi \) be a FO formula such that \(L(\varphi ) = L_M(E)\), for a star-free OPE E. Let \(\pmb x\) and \(\pmb y\) be two variables unused in \(\varphi \). Then the language L(a[E]b) is the one defined by \(\exists \pmb x \exists \pmb y (a(\pmb x) \,\wedge \, b(\pmb y) \,\wedge \,\pmb x \curvearrowright \pmb y \,\wedge \, \varphi _{\pmb x, \pmb y})\).   \(\square \)

5.1 Linear Noncounting OPLs Are First-Order Definable

Definition 25

Let \(G = (\varSigma , V_N, P, S)\) be a linear grammar, i.e., a grammar where all rule rhs \( \in \varSigma ^+ V_N \varSigma ^* \,\cup \, \varSigma ^* V_N \varSigma ^+\, \cup \,\varSigma ^+\). The finite state stencil automaton associated to G is \(\mathcal {A}^\Xi _G = (Q, \Xi , \delta , S, \{q_F\})\), where \(Q = V_N \cup \{ q_F \}\) are its states, \(\Xi \subseteq (\varSigma ^* \times \varSigma ^*) \cup \varSigma ^+\), its input alphabet, is the set of stencils of G,Footnote 4 and the transition relation \(\delta \subseteq Q \,\times \, \Xi \,\times \, Q\) is defined as follows:

Lemma 26

Let G be a linear OPG, with OPM M, and \(\Xi \) the set of stencils of G. L(G) is NC iff \(L(\mathcal {A}^\Xi _G)\) is a NC regular language.

Proof

By construction, every derivation \(x_k A_k\ y_k y_{k-1} \ldots y_1\) \(z y_k y_{k-1} \ldots y_1\) of G corresponds to a run \( A_0 \vdash A_1 \vdash \ldots \vdash A_k \vdash q_F \) of \(\mathcal {A}^\Xi _G\), reading the word \((x_1, y_1) (x_2, y_2) \ldots (x_k, y_k) z\), and vice versa.

According to Definitions 10 and 12, if it is \(x_1 x_2 \ldots \) \(x_k z y_k y_{k-1} \ldots \) \(y_1 = o u^n wv^n p\), for some strings ouwvp, then \(o u^{n+1} wv^{n+1}\) \(p \in L(G)\), i.e., assuming that \(x_i \ldots x_j = u\) and \(y_j \ldots y_i = v\), \(o u^{n+1} wv^{n+1} p =\) \(x_1 x_2 \ldots (x_i \ldots \) \( x_j)^2 x_{j+1} \ldots x_k z\) \(y_k y_{k-1} \ldots \) \((y_j\) \(\ldots y_i)^2 y_{i-1} \ldots \) \(y_1\).

Hence, \((x_1, y_1) (x_2, y_2) \ldots (x_k, y_k) z = o' u'^{n} w'\), where \(u' = (x_i, y_i)\) \(\ldots (x_j, y_j)\), and \((x_1, y_1) \ldots \) \( ( (x_{i},y_{i}) \ldots (x_j,y_j) )^2\) \( (x_{j+1},\) \(y_{j+1}) \ldots (x_k, y_k) z\) \(= o' u'^{n+1} w' \in L(\mathcal {A}^\Xi _G)\). The other direction is analogous.   \(\square \)

Theorem 27

Let G be a linear NC OPG with OPM M, L(G) is expressible in FO\(_{(\varSigma ,M)}\).

Proof

From Lemma 26, we can build \(\mathcal {A}^\Xi _G\), which defines a NC regular language. It is known from [28] that there exists an equivalent first-order formula \(\varphi (\mathcal {A}^\Xi _G)\) on the alphabet \(\Xi \).

We can build a FO\(_{(\varSigma ,M)}\) formula defining L(G) in this way. First, for each \(p \in \Xi \), where \(p = (u, v)\), \(u = u_1 u_2 \ldots u_m\), \(v = v_1 v_2 \ldots v_n\), \(m, n \ge 0\), \(m+n>0\), we introduce the following formula \(\sigma _p\):

$$ \sigma _p({\varvec{{x}}}) := \begin{array}{c} \left( \begin{array}{c} u_1({\varvec{{x}}}) \,\wedge \,u_2({\varvec{{x}}} + 1) \,\wedge \, \\ \ldots \wedge \, u_n({\varvec{{x}}} + m - 1) \end{array} \right) \,\wedge \, \exists {\varvec{{y}}} \left( \begin{array}{c} {\varvec{{x}}}+m-1 \curvearrowright {\varvec{{y}}} \,\wedge \, v_1({\varvec{{y}}}) \,\wedge \, v_2({\varvec{{y}}} + 1) \,\wedge \\ \ldots \wedge \,v_n({\varvec{{y}}} + n - 1) \end{array} \right) \end{array} $$

(the case \(p \in \varSigma ^+\) is trivial.) E.g. for \(p = (\varepsilon , ab)\), \(\sigma _p({\varvec{{x}}}) = \exists {\varvec{{y}}} ({\varvec{{x}}} \curvearrowright {\varvec{{y}}} \wedge a({\varvec{{y}}}) \wedge b({\varvec{{y}}} + 1))\).

It is easy to see that \(\sigma _p\) is a straightforward way to encode a stencil p into a FO\(_{(\varSigma ,M)}\) formula defining its structure.

Let us consider \(\varphi (\mathcal {A}^\Xi _G)\), and obtain a trivially equivalent formula \(\varphi '(\mathcal {A}^\Xi _G)\) by substituting each quantified subformula \(\exists {\varvec{{x}}} (\rho )\) in it with \(\exists {\varvec{{x}}} \left( \left( \bigvee _{p \in \Xi } p({\varvec{{x}}}) \right) \,\wedge \,\rho \right) .\) Hence, we are stating that each quantified variable in \(\varphi (\mathcal {A}^\Xi _G)\) correspond to a position in which there is a stencil.

To obtain a FO\(_{(\varSigma ,M)}\) formula \(\xi \), such that \(L(\xi ) = L(G)\), we take \(\varphi '(\mathcal {A}^\Xi _G)\), and substitute in it every subformula \(p({\varvec{{x}}})\) with \(\sigma _p({\varvec{{x}}})\), for each \(p \in \Xi \), and variable \({\varvec{{x}}}\).   \(\square \)

6 Conclusion

To the best of our knowledge OPLs are the largest family among the many families of structured CFLs to enjoy closure w.r.t. most fundamental language operations and to be characterized in terms of a MSO logic that naturally extends the classic one for regular languages. In this paper we have introduced OPEs as an extension of Kleene’s REs. We have shown that languages defined by OPEs coincide with OPLs and that FO-definable OPLs are those defined by SF OPEs and are NC, in sharp contrast with comparable results framed in the context of tree-languages [21, 35]. Together with previous partial results [15, 24] and the fact that linear NC OPLs are first-order definable (Sect. 5.1), they support our conjecture that OPLs jointly with our MSO and FO logics, perfectly extend the classic results of regular languages. Figure 4 summarizes past, present and “future” results on OPLs and their logics.

Fig. 4.
figure 4

The relations among the various characterizations of OPLs and their aperiodic subclass.

The figure immediately suggests a first further research step, i.e., making the internal triangle a square, as well as the external one: we conjecture that once the concept of NC OPLs has been put in the appropriate framework, a further characterization thereof in terms of a suitable subclass of OPAs should be possible but so far we did not pursue such an option.

The further goal that we wish to pursue is the complete reproduction of the historical path that, for regular languages, lead from the first characterization in terms of MSO logic to the restricted case of FO characterization of NC regular languages, to the temporal logic case which in turn is first-order complete, and, ultimately, to the success of model checking techniques.

Some proposals of temporal logic extension of the classical linear or branching time ones to cope with the typical nesting structure of CF languages have been already offered in the literature. E.g., [1, 4, 6] present different cases of temporal logics extended to deal with VPLs; they also prove FO-completeness of such logics but do not afford the relation between FO and MSO versions of their logics; see also [3]. A first example of temporal logic for OPLs and a related model checking algorithm have also been provided in [9].

Given that most, if not all, of the CF languages for practical applications are aperiodic, the final goal of building verification tools that cover a much wider application field than that of regular languages –and of VPLs too– does not seem unreachable.