1 Introduction

Since being introduced by Kleene [23] in 1956, regular expressions have developed into a central device of theoretical and applied computer science. On one side, research into the theoretical properties of regular expressions, in particular various aspects of their complexity, is still a very active area of investigation (see Holzer and Kutrib [20] for a survey with numerous recent references). On the other side, almost all modern programming languages offer regular expression matching in their standard libraries or application frameworks, and most text editors allow the use of regular expressions for search and replacement functionality.

But, due to practical considerations (cf. Friedl [16]), most modern matching engines have evolved to use an extension to regular expressions that allows the user to specify non-regular languages. In addition to the features of regular expressions as they are mostly studied in theory (which we, from now on, call proper regular expressions), and apart from the (regularity preserving) “syntactic sugar” that most implementations use, these extended regular expressions contain backreferences, also called variables, which specify repetitions that increase the expressive power beyond the class of regular languages. For example, the (non-regular) language L={www∈{a,b}} is generated by the extended regular expression \(\alpha :=( (\mathtt {a}\mid \mathtt {b})^{*} ) \%x\: x\).

This expression can be understood as follows (for a more formal treatment, see Definition 4): For any expression β, (β)%x matches the same expression as β, and binds the match to the variable x. In the case of this example, the subexpression ((ab))%x can be matched to any word w∈{a,b}, and when it is matched to w, the variable x is assigned the value w. Any further occurrence of x repeats w, leading to the language of all words of the form ww with w∈{a,b}. Analogously, the expression \(( (\mathtt {a}\mid \mathtt {b})^{*} ) \%x\: xx\) generates the language of all www with w∈{a,b}.

Although this ability to specify repetitions is used in almost every modern matching engine (e.g., the programming languages PERL and Python), the implementations differ in various details, even between two versions of the same implementation of a programming language (for some examples, see Câmpeanu and Santean [7]). Nonetheless, there is a common core to these variants, which was first formalized by Aho [2]. Later, Câmpeanu et al. [9] introduced a different formalization that is closer to the real world syntax, which addresses some questions of semantics that were implicitly left open in [2]. In addition to this, the pattern expressions by Câmpeanu and Yu [8] and the H-expressions by Bordihn et al. [5] use comparable repetition mechanisms and possess similar expressive power.

Still, theoretical investigation of extended regular expressions has been comparatively rare (in particular when compared to their more prominent subclass); see e.g., Larsen [25], Della Penna et al. [14], Câmpeanu and Santean [7], Carle and Narendran [10], and Reidenbach and Schmid [30].

In contrast to their widespread use in various applications, extended regular expressions have some undesirable properties. Most importantly, their membership problem (the question whether an expression matches a word) is NP-complete (cf. Aho [2]); the exponential part in the best known upper bounds depends on the number of different variables in the expression. Of course, this compares unfavorably to the efficiently decidable membership problem of proper regular expressions (cf. Aho [2]). On the other hand, there are cases where extended regular expressions express regular languages far more succinctly than proper regular expressions. Consider the following example:

Example 1

For n≥1, let L n :={wwww∈{a,b}+,|w|=n}. These languages L n are finite, and hence, regular. For the sake of this example, we define the length of an extended regular expression α as the total number of symbols that occur in α (in literature, this measure is often called size, cf. [21]).

With some effort, one can prove that every proper regular expression for L n is at least of length exponential in n, e.g., by using the technique by Glaister and Shallit [17] to prove that every NFA for L n requires at least O(2n) states. Due to the construction used in the proof of Theorem 2.3 in Hopcroft and Ullman [22], this also gives a lower bound on the length of the regular expressions for L n .

In contrast to this, L n is generated by the extended regular expression

$$\alpha_n:=(\underbrace{(\mathtt {a}\mid \mathtt {b})\cdots(\mathtt {a}\mid \mathtt {b})}_{\text{$n$ times $(\mathtt {a}\mid \mathtt {b})$}})\%x\: xx,$$

which is of a length that is linear in n.

Due to the repetitive nature of the words of languages L n in Example 1, it is not surprising that the use of variables provides a shorter description of L n . The following example might be considered less straightforward:

Example 2

Consider the expression \(\alpha :=(\mathtt {a}\mid \mathtt {b})^{*} ((\mathtt {a}\mid \mathtt {b})^{+})\%x\:x (\mathtt {a}\mid \mathtt {b})^{*}\). It is a well-known fact that every word w∈{a,b} with |w|≥4 can be expressed in the form w=uxxv, with u,v∈{a,b} and x∈{a,b}+ (as is easily verified by examining all four letter words). Thus, the expression α matches all but finitely many words; hence, its language L(α) is regular.

Example 2 demonstrates that the use of variables can lead to languages that are (non-trivially) regular. The phenomenon that an expression like α can generate a cofinite language is strongly related to the notion of avoidable patterns (cf. Cassaigne [11]), and involves some very hard combinatorial questions (in particular, Example 2 illustrates this connection for the pattern xx over a binary alphabet).

We observe that extended regular expressions can be used to express regular languages more succinctly than proper regular expressions do, and that it might be hard to convert an extended regular expression into a proper regular expression for the same language.

The two central questions studied in the present paper are as follows: First, how hard is it to minimize extended regular expressions (both with respect to their length, and with respect to the number of variables they contain), and second, how succinctly can extended regular expressions describe regular languages? These natural questions are also motivated by practical concerns: If a given application reuses an expression many times, it might pay off to invest resources in the search for an expression that is shorter, or uses fewer variables, and thus can be matched more efficiently.

We approach this question through related decidability problems (e.g., the universality problem) and by studying lower bounds on the tradeoff between the size of extended regular expressions and proper regular expressions.

The main contribution of the present paper is the proof that all these decision problems are undecidable (some are not even semi-decidable), even for extended regular expressions that use only a single variable. Thus, while bounding the number of variables in extended regular expressions (or, more precisely, the number of variable bindings) reduces the complexity of the membership problem from NP-complete to polynomial (cf. Aho [2]), we show that extending proper regular expressions with only a single variable already results in undecidability of various problems.

As a consequence, extended regular expressions cannot be minimized effectively, and the tradeoff between extended and proper regular expressions is not bounded by any recursive function (a so-called non-recursive tradeoff, cf. Sect. 2.3 for additional context). Thus, although the use of the “right” extended regular expression for a regular expression might offer arbitrary advantages in size (and, hence, parsing speed), these optimal expressions cannot be found effectively. These results highlight the power of the variable mechanism, and demonstrate that different restrictions than the number of variables ought to be considered.

The structure of the further parts of the paper is as follows: In Sect. 2, we introduce most of the technical preliminaries. Section 3 consists of Theorem 10 (the main undecidability result), its proof, and the required additional technical preliminaries, while Sect. 4 discusses the consequences and some extensions of Theorem 10.

2 Preliminaries

This paper is largely self-contained. Unexplained notions can be found in Hopcroft and Ullman [22], Cutland [13], and Minsky [27].

2.1 Basic Definitions

Let ℕ be the set of natural numbers, including 0. The function \(\operatorname {div}\) denotes integer division, and \(\operatorname {mod}\) denotes its remainder (e.g., \(5\operatorname {div}3=1\) and \(5\operatorname {mod}3=2\)). The symbol ∞ denotes infinity.

The symbols ⊆, ⊂, ⊇ and ⊃ refer to the subset, proper subset, superset and proper superset relation, respectively. The symbol ∅ denotes the empty set, ∖ denotes the set difference (defined by AB:={xAxB}). For every set A, \(\mathcal{P}(A)\) denotes the power set of A.

We denote the empty string by λ. For the concatenation of two strings w 1 and w 2, we write w 1w 2 or simply w 1 w 2. We say a string vA is a factor of a string wA if there are u 1,u 2A such that w=u 1 vu 2. The notation |K| stands for the size of a set K or the length of a string K.

If A is an alphabet, a (one-sided) infinite word over A is an infinite sequence \(w=(w_{i})_{i=0}^{\infty}\) with w i A for every i≥0. We denote the set of all one-sided infinite words over A by A ω and, for every aA, let a ω denote the word \(w=(w_{i})_{i=0}^{\infty}\) with w i =a for every i≥0. We shall only deal with infinite words wA ω that have the form \(w=u\:a^{\omega}\) with uA and aA. Concatenation of words and infinite words is defined canonically: For every uA and every vA ω with \(v=(v_{i})_{i=0}^{\infty }\), uv:=wA ω, where w 0⋅⋯⋅w |u|−1=u and w i+|u|=v i for every i≥0, while vu is undefined. In particular, note that \(a\:a^{\omega}=a^{\omega}\) for every aA.

2.2 Extended Regular Expressions

We now introduce syntax and semantics of extended regular expressions. Apart from some changes in terminology, the following definition of syntax is due to Aho [2]:

Definition 3

Let Σ be an infinite set of terminals, let X be an infinite set of variables, and let the set of metacharacters consist of λ, (, ), ∣, , and %, where all three sets are pairwise disjoint. We define the set of extended regular expressions to be the smallest set that satisfies the following conditions:

  1. 1.

    Every aΣX∪{λ} is an extended regular expression.

  2. 2.

    If α 1 and α 2 are extended regular expression, then

    1. (a)

      (α 1)(α 2) (concatenation),

    2. (b)

      (α 1)∣(α 2) (alternation),

    3. (c)

      (α 1) (Kleene star)

    are extended regular expressions.

  3. 3.

    For every extended regular expression α and every variable xX such that %x is not a factor of α, (α)%x is an extended regular expression (variable binding).

We denote the set of all extended regular expressions by \(\operatorname {RegEx}\). A proper regular expression is an extended regular expression that contains neither %, nor any variable (hence, proper regular expressions are those expressions that are commonly called “regular expressions” in theoretical computer science).

If an extended regular expression β is a factor of an extended regular expression α, we call β a subexpression of α. We denote the set of all subexpressions of α by \(\operatorname {SUB}(\alpha)\).

We shall use the notation (α)+ as a shorthand for α(α), and freely omit parentheses whenever the meaning remains unambiguous. When doing this, we assume that there is a precedence on the order of the applications of operations, with  and + ranking over concatenation ranking over the alternation operator ∣.

In Aho [2], an informal definition of the semantics of extended regular expressions is given. In Aho’s approach, extended regular expressions are interpreted as language generators in the following way: An extended regular expression α is interpreted from left to right. A subexpression of the form (β)%x generates the same language as the expression β; in addition to this, the variable x is bound to the word w that was generated from β (if x already has a value, that value is overwritten). Every occurrence of x that is not in the context of a variable binding is then replaced with w.

When following this approach, there are some cases where the semantics are underspecified. For example, Aho [2] does not explicitly address the rebinding of variables (cf. Example 5, further down), and the semantics of expressions like ((a)%xb)x are unclear.

Although the proofs in the present paper are not affected by the ambiguities that arise from the informal approach, we include a formal definition of the semantics, which is an adaption of the semantics of Câmpeanu et al. [9] to the syntax from Definition 3:

Definition 4

A match tree of an extended regular expression α is a finite (directed, ordered) tree T α , where the nodes of T α are labeled with elements of \(\varSigma^{*}\times \operatorname {SUB}(\alpha )\), and T α is constructed according to the following rules:

  1. 1.

    The root of T α is labeled with some (w,α), wΣ .

  2. 2.

    If a node v of T α is labeled by some (w,a) with aΣ∪{λ}, then v is a leaf, and w=a holds.

  3. 3.

    If a node v of T α is labeled by some (w,β) with β=(β 1)(β 2), then v has exactly two children v 1 and v 2 (as left and right child, respectively), with respective labels (w 1,β 1) and (w 2,β 2), where w 1,w 2Σ , and w=w 1 w 2.

  4. 4.

    If a node v of T α is labeled by some (w,β) with β=(β 1)∣(β 2), then v has exactly one child v′, that is labeled (w,β 1) or (w,β 2).

  5. 5.

    If a node v of T α is labeled by some (w,β) with β=(β 1), we distinguish two cases:

    1. (a)

      If w=λ, v has exactly one child v 1 that is labeled (λ,β), and v 1 is a leaf of T α .

    2. (b)

      If wλ, v has k≥1 children that are labeled by (w 1,β 1),…,(w k ,β 1) (from left to right), where w 1,…,w k Σ + and w=w 1w k .

  6. 6.

    If a node v of T α is labeled by some (w,β) with β=(β 1)%x, then v has exactly one child that is labeled (w,β 1).

  7. 7.

    If a node v of T α is labeled by some (w,x), where x is a variable, we let ≺ denote the post-order on the nodes of T α (that results from a left-to right, depth-first traversal), and distinguish the following two cases:

    1. (a)

      If there is no node v 1 of T α with v 1v such that v 1 is labeled with some (w 1,(β 1)%x), v is a leaf, and w=λ.

    2. (b)

      Otherwise, let v 1 denote that node with v 1v that is ≺-maximal among the nodes that have some (β′)%x as the second component of their label. Then v is a leaf, and w=w 1, where w 1 is the first component of the label of v 1.

We define the language L(α) that is generated by an extended regular expression α as

$$L(\alpha):=\bigl\{w\in\varSigma^*\mid \text{$(w,\alpha)$ labels the root of some match tree $T_{\alpha}$ of $\alpha$}\bigr\}.$$

Example 5

Consider the following extended regular expressions:

These expressions generate the following languages:

Note that in the case of α 4, which was pointed out by one of the anonymous referees, the semantics given in Definition 4 might be considered counterintuitive, as one could expect L(α 4)={a,aa}. The results in the present paper do not rely on such pathological cases, which shall be excluded by Definition 6 a little further down.

In general, the membership problem for \(\operatorname {RegEx}\) is NP-complete, as shown in Theorem 6.2 in Aho [2]. As explained in that proof, this problem is solvable in polynomial-time if the number of different variables is bounded. It is not clear how (or if) Aho’s reasoning applies to expressions like α 2 in our Example 5; therefore, and in order to exclude problematic expressions like α 4 Example 5, in we formalize a slightly stronger restriction than Aho, and consider the following subclasses of \(\operatorname {RegEx}\):

Definition 6

For k≥0, let \(\operatorname {RegEx}(k)\) denote the class of all extended regular expressions α that satisfy the following properties:

  1. 1.

    α contains at most k occurrences of the metacharacter %,

  2. 2.

    if α contains a subexpression (β), then the metacharacter % does not occur in β,

  3. 3.

    for every xX that occurs in α, α contains exactly one occurrence of %x.

Intuitively, these restrictions on extended regular expressions in \(\operatorname {RegEx}(k)\) limit not only the number of different variables, but also the total number of possible variable bindings, to at most k.

Note that \(\operatorname {RegEx}(0)\) is equivalent to the class of proper regular expressions; furthermore, observe that \(\operatorname {RegEx}(k)\subset \operatorname {RegEx}(k+1)\) for every k≥0.

Referring to the extended regular expressions given in Example 5, we observe that, as %x occurs twice in α 1, α 1 is not element of any \(\operatorname {RegEx}(k)\) with k≥0, but the extended regular expression \(\alpha'_{1}:=( (\mathtt {a}\mid \mathtt {b})^{*} ) \%x\:xx\: ( (\mathtt {a}\mid \mathtt {b})^{*} ) \%y\:y\) generates the same language as α 1, and \(\alpha'_{1}\in(\operatorname {RegEx}(2)\setminus \operatorname {RegEx}(1))\). In contrast to this, \(\alpha_{2}\notin \operatorname {RegEx}(k)\) for all k≥0, as % occurs inside a () subexpression (as we defined + through ).

For any k≥0, we say that a language L is a \(\operatorname {RegEx}(k)\) -language if there is some \(\alpha\in \operatorname {RegEx}(k)\) with L(α)=L.

We also consider the class \(\operatorname {FRegEx}\) of all extended regular expressions that do not use the operator (or +), and its subclasses

$$\operatorname {FRegEx}(k):=\operatorname {FRegEx}\cap \operatorname {RegEx}(k)$$

for k≥0. Thus, \(\operatorname {FRegEx}\) contains exactly those expressions that generate finite (and, hence, regular) languages. Analogously, for every k≥0, we define a class \(\operatorname {CoFRegEx}(k)\) as the class of all \(\alpha \in \operatorname {RegEx}(k)\) such that L(α) is cofinite. Unlike the classes \(\operatorname {FRegEx}(k)\), these classes have no straightforward syntactic definition—as we shall prove in Theorem 10, cofiniteness is not semi-decidable for \(\operatorname {RegEx}(k)\) (if k≥1).

2.3 Decision Problems and Descriptional Complexity

Most of the technical reasoning in the present paper is centered around the following decision problems:

Definition 7

Let Σ denote a fixed terminal alphabet. For all k,l≥0, we define the following decision problems for \(\operatorname {RegEx}(k)\):

Universality::

Given \(\alpha\in \operatorname {RegEx}(k)\), is L(α)=Σ ?

Cofiniteness::

Given \(\alpha\in \operatorname {RegEx}(k)\), is Σ L(α) finite?

RegEx(l)-ity::

Given \(\alpha\in \operatorname {RegEx}(k)\), is there a \(\beta\in \operatorname {RegEx}(l)\) with L(α)=L(β)?

As we shall see, Theorem 10—one of our main technical results—states that these problems are undecidable (to various degrees). We use the undecidability of the universality problem to show that there is no effective procedure that minimizes extended regular expressions with respect to their length, and the undecidability of \(\operatorname {RegEx}(l)\)-ity to conclude the same for minimization with respect to the number of variables. Furthermore, cofiniteness and \(\operatorname {RegEx}(l)\)-ity help us to obtain various results on the relative succinctness of proper and extended regular expressions. As a side note, note that cofiniteness for \(\operatorname {RegEx}\) is a more general case of the question whether a pattern is avoidable over a fixed terminal alphabet, an important open problem in pattern avoidance (cf. Currie [12]).

By definition, \(\operatorname {RegEx}(l)\)-ity holds trivially for all \(\operatorname {RegEx}(k)\) with kl. If l=0, we mostly use the more convenient term regularity (for \(\operatorname {RegEx}(k)\)), instead of \(\operatorname {RegEx}(0)\)-ity. Note that, even for \(\operatorname {RegEx}(0)\), universality is already PSPACE-complete (see Aho et al. [3]).

In order to examine the relative succinctness of \(\operatorname {RegEx}(1)\) in comparison to \(\operatorname {RegEx}(0)\), we build on well-established notions from descriptional complexity (see Holzer and Kutrib [21] and the less recent Goldstine et al. [18] for survey articles on the area). In particular, we use the following notion of complexity measures, which is based on the more general notion of complexity measures for descriptional systems from [24]:

Definition 8

Let \(\mathcal{R}\) be a class of extended regular expressions. A complexity measure for \(\mathcal{R}\) is a total recursive function \(c: \mathcal{R}\to \mathbb {N}\) such that, for every alphabet Σ, the set of all \(\alpha\in\mathcal{R}\) with L(α)⊆Σ

  1. 1.

    can be effectively enumerated in order of increasing c(α), and

  2. 2.

    does not contain infinitely many extended regular expressions with the same value c(α).

This definition includes the canonical concept of the length (as used in Example 1), as well as most of its natural extensions—for example, in our context, one could define a complexity measure that gives additional weight to the number or distance of occurrences of variables, or their nesting level. Kutrib [24] provides more details on (and an extensive motivation of) complexity measures.

Using this definition, we are able to define the notion of tradeoffs between classes of extended regular expressions, which is again based on the more general definition from [24]:

Definition 9

Let k>l≥0 and let c be a complexity measure for \(\operatorname {RegEx}(k)\) (and thereby also for \(\operatorname {RegEx}(l)\)). A recursive function f c :ℕ→ℕ is said to be a recursive upper bound for the tradeoff between \(\operatorname {RegEx}(k)\) and \(\operatorname {RegEx}(l)\) if, for all those \(\alpha\in \operatorname {RegEx}(k)\) for which L(α) is a \(\operatorname {RegEx}(l)\)-language, there is a \(\beta\in \operatorname {RegEx}(l)\) with L(β)=L(α) and c(β)≤f c (c(α)).

If no recursive upper bound for the tradeoff between \(\operatorname {RegEx}(k)\) and \(\operatorname {RegEx}(l)\) exists, we say that the tradeoff between \(\operatorname {RegEx}(k)\) and \(\operatorname {RegEx}(l)\) is non-recursive.

The first non-recursive tradeoffs where demonstrated by Meyer and Fischer [26]; since then, there has been a considerable amount of results on a wide range of non-recursive tradeoffs between various description mechanisms. For a survey on non-recursive tradeoffs, see Kutrib [24].

2.4 Extended Regular Expressions in the Real World

In this section we take a brief closer look at the relation between our definition of extended regular expressions, and the variable-like mechanisms in the real-world regex dialects that inspired that definition. This section is supposed to provide a wider context of the theoretical model discussed in the present paper and can be skipped without loss of continuity.

While the syntax in Câmpeanu et al. [9] is based on the backreferences that can be found in Perl Compatible Regular Expressions, POSIX, and various related implementations (cf. Friedl [16]), our definition is syntactically closer to the named capture groups that originated from Python (ibid.). As an example, consider the extended regular expression

$$\alpha :=(\mathtt {a})\mid \bigl(\bigl(\mathtt {a}\mathtt {a}^+\bigr)\%x\:x^+ \bigr),$$

which is due to Abigail [1]. It is easy to see that this expression generates the language

$$L(\alpha)=\bigl\{\mathtt {a}^n\mid n\geq1, \text{ $n$ is not a prime number}\bigr\}.$$

In Python, the same language can be expressed using the expression

figure a

where the (prefix) operator ?P=<x> acts like our (postfix) variable binding operator %x, while (?P=x) repeats that variable and corresponds to our use of x (without %). A similar syntax is used in .NET, see p. 137 in Friedl [16].

Although this method of explicit naming and referencing of capture groups is also supported in newer versions of PERL (from version 5.10 onwards), traditionally, PERL allows only implicit naming of capture groups, using the aforementioned backrefences. Using these, α would be written

figure b

Here, the backreference 3 repeats the match of the third pair of parentheses (as defined by the third opening parenthesis when reading from the left). To the author’s knowledge, there is no valid way of expressing pathological examples like α 1 from Example 5 in these programming languages—in particular, the reuse of capture group names is forbidden. Nonetheless, all expressions from \(\operatorname {RegEx}(1)\) can be easily converted to each of this dialects. Hence, all “negative” results in the present paper apply not only to our theoretical model of \(\operatorname {RegEx}\), but to each of these real world regex dialects.

As an unrelated side note, the author wishes to point out that L(α) is an example of an extended regular expression over a unary terminal alphabet that generates a non-regular language (another example of such a language can be found in Carle and Narendran [10]).

3 The Main Theorem and Its Proof

As mentioned in Sect. 1, the central questions of the present paper are whether we can minimize extended regular expressions under any complexity measure as defined in Definition 8, or with respect to the number of variables, and whether there is a recursive upper bound on the tradeoff between extended and proper regular expressions. We approach these questions by proving various degrees of undecidability for the decision problems given in Definition 7, as shown in the main theorem of this section:

Theorem 10

For \(\operatorname {RegEx}(1)\), universality is not semi-decidable; and regularity and cofiniteness are neither semi-decidable, nor co-semi-decidable.

The proof of Theorem 10 requires considerable technical preparation and takes up the remainder of the present section. Readers who are more interested in implications and applications of Theorem 10 are invited to skip over to Sect. 4.

On a superficial level, we prove Theorem 10 by using Theorem 14 (which we introduce further down in the present section) to reduce various undecidable decision problems for Turing machines to appropriate problems for extended regular expressions (the problems from Definition 7). This is done by giving an effective procedure that, given a Turing machine \(\mathcal {M}\), returns an extended regular expression that generates the complement of a language that encodes all accepting runs of \(\mathcal {M}\).

On a less superficial level, this approach needs to deal with certain technical peculiarities that make it preferable to study a variation of the Turing machine model. Most importantly, when applied to “standard” Turing machines, the construction procedure and the encoding we shall use in the proof do not preserve the finiteness of the domain of the encoded machine. As the distinction between Turing machines with finite and infinite domain is a central element of the proofs further down, we introduce the notion of an extended Turing machine as an intermediate step in the construction of extended regular expressions from Turing machines.

An extended Turing machine is a 3-tuple \(\mathcal {X}=(Q,q_{1},\delta)\), where Q and q 1 denote the state set and the initial state. All extended Turing machines operate on the tape alphabet Γ:={0,1} and use 0 as the blank letter. The transition function δ is a function \(\delta: \varGamma\times Q\to(\varGamma\times\{L,R\}\times Q)\cup \{\operatorname {HALT}\} \cup(\{\operatorname {CHECK}_{\mathrm {R}}\}\times Q)\). The movement instructions L and R and the \(\operatorname {HALT}\)-instruction are interpreted canonically—if δ(a,q)=(b,M,p) for some M∈{L,R} (and a,bΓ, p,qQ), the machine replaces the symbol under the head (a) with b, moves the head to the left if M=L (or to the right if M=R), and enters state p. If \(\delta (a,q)=\operatorname {HALT}\), the machine halts and accepts.

The command \(\operatorname {CHECK}_{\mathrm {R}}\) works as follows: If \(\delta(a,q)=(\operatorname {CHECK}_{\mathrm {R}},p)\) for some pQ, \(\mathcal {X}\) immediately checks (without moving the head) whether the right side of the tape (i.e., the part of the tape that starts immediately to the right of the head) contains only the blank symbol 0. If this is the case, \(\mathcal {X}\) enters state p; but if the right side of the tape contains any occurrence of 1, \(\mathcal {X}\) remains in q. As the tape is never changed during a \(\operatorname {CHECK}_{\mathrm {R}}\)-instruction, this leads \(\mathcal {X}\) into an infinite loop, as it will always read a in q, and will neither halt, nor change its state, head symbol, or head position. Although it might seem counterintuitive to include an instruction that allows our machines to search the whole infinite side of a tape in a single step and without moving the head, this command is expressible in the construction we use in the proof of Theorem 14, and it is needed for the intended behavior.

We partition the tape of an extended Turing machine \(\mathcal {X}\) into three disjoint areas: The head symbol, which is (naturally) the tape symbol at the position of the head, the right tape side, which contains the tape word that starts immediately to the right of the head symbol and extends rightward into infinity, and the left tape side, which starts immediately left to the head symbol and extends infinitely to the left. When speaking of a configuration, we denote the head symbol by a and refer to the contents of the left or right tape side as the left tape word t L or the right tape word t R , respectively. For an illustration and further explanations, see Fig. 1.

Fig. 1
figure 1

An illustration of tape words of an extended Turing machine (as defined in Sect. 3). The arrow below the tape symbolizes the position of the head, while the dashed lines show the borders between the left tape side, the head position and the right tape side. Assuming that all tape cells that are not shown contain 0, we observe the left tape word \(t_{L}=10111 \:0^{\omega}\), the right tape word \(t_{R}=1001\:0^{\omega}\), and the head letter a=1

A configuration of an extended Turing machine \(\mathcal {X}=(Q,q_{1},\delta )\) is a tuple (t L ,t R ,a,q), where t L ,t R Γ 0ω are the left and right tape word, aΓ is the head symbol, and qQ denotes the current state. The symbol \(\vdash _{\mathcal {X}}\) denotes the successor relation on configurations of \(\mathcal {X}\), i.e., \(C\vdash _{\mathcal {X}}C'\) if \(\mathcal {X}\) enters C′ immediately after C.

We define \(\operatorname {dom}_{\mathrm {X}}(\mathcal {X})\), the domain of an extended Turing machine \(\mathcal {X}=(Q,q_{1},\delta)\), to be the set of all tape words t R Γ 0ω such that \(\mathcal {X}\), if started in the configuration (0ω,t R ,0,q 1), halts after finitely many steps.

The definition of \(\operatorname {dom}_{\mathrm {X}}\) is motivated by the properties of the encoding that we shall use. Usually, definitions of the domain of a Turing machine rely on the fact that the end of the input is marked by a special letter \(\mathdollar \) or an encoding thereof (cf. Minsky [27]). As we shall see, our use of extended regular expressions does not allow us to express the fact that every input is ended by exactly one \(\mathdollar \) symbol. Without the \(\operatorname {CHECK}_{\mathrm {R}}\)-instruction in an extended Turing machine \(\mathcal {X}\), we then would have to deal with the unfortunate side effect that a nonempty \(\operatorname {dom}_{\mathrm {X}}(\mathcal {X})\) could never be finite: Assume wΓ such that \(w\:0^{\omega}\in \operatorname {dom}_{\mathrm {X}}(\mathcal {X})\). The machine can only see a finite part of the right side of the tape before accepting. Thus, there is a vΓ such that both \(wv1\:0^{\omega}\in \operatorname {dom}_{\mathrm {X}}(\mathcal {X})\) and \(wv0\:0^{\omega}\in \operatorname {dom}_{\mathrm {X}}(\mathcal {X})\), as \(\mathcal {X}\) will not reach the part where wv1 and wv0 differ. This observation leads to \(wvx\:0^{\omega}\in \operatorname {dom}_{\mathrm {X}}(\mathcal {X})\) for every xΓ , and applies to various other extensions of the Turing machine model. As Lemma 13—and thereby most of the main results in Sect. 4—crucially depends on the fact that there are extended Turing machines with a finite domain, we use \(\operatorname {CHECK}_{\mathrm {R}}\) to allow our machines to perform additional sanity checks on the input and to overcome the limitations that arise from the lack of the input markers and \(\mathdollar \).

Using a classical coding technique for two-symbol Turing machines (see Minsky [27]) and the corresponding undecidability results, we establish the following negative results on decision problems for extended Turing machines:

Lemma 11

Consider the following decision problems for extended Turing machines:

Emptiness::

Given an extended Turing machine \(\mathcal {X}\), is \(\operatorname {dom}_{\mathrm {X}}(\mathcal {X})\) empty?

Finiteness::

Given an extended Turing machine \(\mathcal {X}\), is \(\operatorname {dom}_{\mathrm {X}}(\mathcal {X})\) finite?

Then emptiness is not semi-decidable, and finiteness is neither semi-decidable, nor co-semi-decidable.

Proof

We show these results on extended Turing machines by reducing each of these problems for “non-extended” Turing machines or, as we call them, general Turing machines, to its counterpart for extended Turing machines. A general Turing machine is a 7-tuple , where Q is a finite set of states, q 1 is the initial state, \(0\in \hat {\varGamma }\) is the blank tape symbol, are distinct special symbols (with ) that are used to mark the beginning and end (respectively) of an input of \(\mathcal {M}\), and

$$\delta:\hat {\varGamma }\times Q\to\bigl(\hat {\varGamma }\times\{L,R\}\times Q\bigr )\cup\{\operatorname {HALT}\}$$

is the transition function. We interpret δ as for extended Turing machines, and us the same notion of tape words and configurations as for extended Turing machines.

The domain \(\operatorname {dom}_{\mathrm {T}}(\mathcal {M})\) of a general Turing machine , is defined to be the set of all such that \(\mathcal {M}\), if started in the configuration (0ω,t R ,0,q 1) with , halts after finitely many steps.

The definition of \(\operatorname {dom}_{\mathrm {T}}(\mathcal {M})\) corresponds to the definition of the language of a Turing machine as given by Hopcroft and Ullman [22] and Minsky [27]. As for extended Turing machines, we consider the following decision problems for general Turing machines:

Emptiness::

Given a general Turing machine \(\mathcal {M}\), is \(\operatorname {dom}_{\mathrm {T}}(\mathcal {M})\) empty?

Finiteness::

Given a general Turing machine \(\mathcal {M}\), is \(\operatorname {dom}_{\mathrm {T}}(\mathcal {M})\) finite?

Emptiness of \(\operatorname {dom}_{\mathrm {T}}\) is undecidable due to Rice’s Theorem; as it is obviously co-semi-decidable, it cannot be semi-decidable. Furthermore, due to the Rice-Shapiro Theorem, finiteness of \(\operatorname {dom}_{\mathrm {T}}\) is neither semi-decidable, nor co-semi-decidable (cf. Cutland [13], Hopcroft and Ullman [22]—in the latter reference, the Rice-Shapiro Theorem is called “Rice’s Theorem for recursively enumerable index sets” (Chap. 8.4)).

In order to prove the present lemma’s claims on extended Turing machines, we now define an effective procedure that, given a general Turing machine \(\mathcal {M}\), returns an extended Turing machine \(\mathcal {X}\) such that \(\operatorname {dom}_{\mathrm {T}}(\mathcal {M})\) is empty (or finite) if and only if \(\operatorname {dom}_{\mathrm {X}}(\mathcal {X})\) is empty (or finite).

First, assume that \(\mathcal {M}\) is defined over some tape alphabet . Using the common technique to simulate Turing machines with larger tape alphabets on Turing machines with a binary tape alphabet (cf. Chap. 6.3.1 in Minsky [27]), we choose a k≥1 with \(2^{k}\geq|\hat {\varGamma }|\) and fix any injective function \(b_{k}:\hat {\varGamma }\to\varGamma^{k}\) (every letter from \(\hat {\varGamma }\) is encoded by a block of k letters from Γ), with b k (0)=0k (the blank symbol of \(\mathcal {M}\) is mapped to k successive blank symbols of \(\mathcal {X}\)). We extend this function b k canonically to an injective morphism \(b_{k}: \hat {\varGamma }^{*}0^{\omega}\to\varGamma^{*}0^{\omega}\). Moreover, we partition the tape of \(\mathcal {X}\) into non-overlapping blocks of k tape cells, each representing a single tape cell of \(\mathcal {M}\) as encoded by b k .

The main idea of the construction is that \(\mathcal {X}\) works in two phases. First, it checks that its right tape word is for some . If this is the case, \(\mathcal {X}\) simulates \(\mathcal {M}\), always reading blocks of k letters at a time and interpreting every block b k (a) as input a for \(\mathcal {M}\).

More explicitly, the first phase works as follows: If started on an input wΓ 0ω, \(\mathcal {X}\) scans w and checks whether the first block of k letters is (using its finite control to store the k−1 letters of the block, and evaluating the whole block after reading its k-th letter). If this is not the case, \(\mathcal {X}\) enters an infinite loop (and thus, rejects implicitly). Otherwise, \(\mathcal {X}\) continues scanning to the right, evaluating every block of k letters until a block with \(b_{k}(\mathdollar )\) is encountered. On its way to the right, \(\mathcal {X}\) performs the following checks: If a block contains some b k (a) with , \(\mathcal {X}\) examines the next block. If a block contains or some sequence of k letters that is not an image of any letter from \(\hat {\varGamma }\), \(\mathcal {X}\) enters an infinite loop. If a k-letter block containing \(b_{k}(\mathdollar )\) is found, \(\mathcal {X}\) moves the head to the last letter of this block and executes the \(\operatorname {CHECK}_{\mathrm {R}}\)-command. This leads the machine to enter an infinite loop if any occurrence of the non-blank symbol 1∈Γ follows.

Thus, if there is no such that , \(\mathcal {X}\) will never find a block \(b_{k}(\mathdollar )\), and will never halt. Intuitively, \(\mathcal {X}\) (implicitly) rejects any input that does not satisfy its sanity criteria by refusing to halt.

But if for some , no tape cell containing 1 is found by \(\operatorname {CHECK}_{\mathrm {R}}\). Then \(\mathcal {X}\) enters its second phase: The machine returns to the left side of w (which it recognizes by the unique block containing ), and simulates \(\mathcal {M}\) on the corresponding input with , always using the finite control to read blocks b k (a) of length k which represent a tape letter \(a\in \hat {\varGamma }\) as input for \(\mathcal {M}\), and halting if and only if \(\mathcal {M}\) halts. By definition, the left tape side is initially empty; hence, due to b k (0)=0k, and due to the sanity check using \(\operatorname {CHECK}_{\mathrm {R}}\), we do not even need to keep track which part of the tape \(\mathcal {X}\) has already seen.

Thus, if \(w\in \operatorname {dom}_{\mathrm {X}}(\mathcal {X})\), there is exactly one with \(\hat {w}\in \operatorname {dom}_{\mathrm {T}}(\mathcal {M})\) and . Likewise, for every \(\hat {w}\in \operatorname {dom}_{\mathrm {T}}(\mathcal {M})\) (which, by definition, implies that \(\hat {w}\) does not contain any or \(\mathdollar \)), . Thus, \(\operatorname {dom}_{\mathrm {T}}(\mathcal {M})=\emptyset\) if and only if \(\operatorname {dom}_{\mathrm {X}}(\mathcal {X})=\emptyset\), and likewise, \(\operatorname {dom}_{\mathrm {T}}(\mathcal {M})\) is finite if and only if \(\operatorname {dom}_{\mathrm {X}}(\mathcal {X})\) is finite.

As the whole construction process can be realized effectively, any algorithm that (semi-)decides any of these two problems for extended Turing machines could be converted into an algorithm that (semi-)decides the corresponding problem for general Turing machines.

Due to the fact that emptiness of the domain for general Turing machines is not semi-decidable, and as finiteness is neither semi-decidable nor co-semi-decidable, the claim follows. □

Those who are interested in these problems’ exact position in the arithmetical hierarchy (cf. Odifreddi [28]) can use Propositions X.9.5 and X.9.6 from Odifreddi [29] and the canonical reasoning on the order of quantifiers for the respective levels to observe that—for general and for extended Turing machines—emptiness of the domain is \(\varPi^{0}_{1}\)-complete, while finiteness of the domain is \(\varSigma^{0}_{2}\)-complete (hence, its complement is \(\varPi^{0}_{2}\)-complete).

In order to simplify some technical aspects of our further proofs below, we adopt the following convention on extended Turing machines:

Convention 12

Every extended Turing machine

  1. 1.

    has the state set Q={q 1,…,q ν } for some ν≥1, where q 1 is the initial state,

  2. 2.

    has δ(0,q 1)=(0,L,q 2),

  3. 3.

    has \(\delta(a,q)=\operatorname {HALT}\) for at least one pair (a,q)∈Γ×Q.

Obviously, every extended Turing machine can be straightforwardly (and effectively) adapted to satisfy these criteria.

As every tape word contains only finitely many occurrences of 1, we can interpret tape sides as natural numbers in the following (canonical) way: For sequences \(t=(t_{i})_{i=0}^{\infty}\) over Γ, define \(\operatorname {e}(t):=\sum_{i=0}^{\infty}2^{i} \operatorname {e}(t_{i})\), where \(\operatorname {e}(0):=0\) and \(\operatorname {e}(1):=1\). Most of the time, we will not distinguish between single letters and their values under \(\operatorname {e}\), and simply write a instead of \(\operatorname {e}(a)\) for all aΓ. It is easily seen that \(\operatorname {e}\) is a bijection between ℕ and Γ 0ω, the set of all tape words over Γ. Intuitively, every tape word is read as a binary number, starting with the cell closest to the head as the least significant bit, extending toward infinity.

Expressing the three parts of the tape (left and right tape word and head symbol) as natural numbers allows us to compute the tape parts of successor configurations using elementary integer operations. The following straightforward observation shall be a very important tool in the proof of Theorem 14:

Observation 1

Assume that an extended Turing machine \(\mathcal {X}=(Q,q_{1},\delta)\) is in some configuration C=(t L ,t R ,a,q i ), and δ(a,q i )=(b,M,q j ) for some bΓ, some M∈{L,R} and some q j Q. For the (uniquely defined) successor configuration \(C'=(t'_{L},t'_{R},a',q_{j})\) with \(C\vdash _{\mathcal {X}}C'\), the following holds:

These equations are fairly obvious—when moving the head in direction M, \(\mathcal {X}\) turns the tape cell that contained the least significant bit of \(\operatorname {e}(t_{M})\) into the new head symbol, while the other tape side gains the tape cell containing the new letter b that was written over the head symbol as new least significant bit.

Using the encoding \(\operatorname {e}\), we define an encoding \(\operatorname {enc}\) of configurations of \(\mathcal {X}\) by

$$\operatorname {enc}(t_L,t_R,a,q_i ):=0 0^{\operatorname {e}(t_L)}\texttt {\#}0 0^{\operatorname {e}(t_R)}\texttt {\#}0 0^{\operatorname {e}(a)}\texttt {\#}0^i$$

for every configuration (t L ,t R ,a,q i ) of \(\mathcal {X}\). We extend \(\operatorname {enc}\) to an encoding of finite sequences \(C=(C_{i})^{n}_{i=1}\) (where every C i is a configuration of \(\mathcal {X}\)) by

$$\operatorname {enc}(C):=\texttt {\#} \texttt {\#} \operatorname {enc}(C_1)\texttt {\#} \texttt {\#} \operatorname {enc}(C_2)\texttt {\#} \texttt {\#}\cdots \texttt {\#} \texttt {\#} \operatorname {enc}(C_n)\texttt {\#} \texttt {\#}.$$

A valid computation of \(\mathcal {X}\) is a sequence \(C=(C_{i})^{n}_{i=1}\) of configurations of \(\mathcal {X}\) where C 1 is an initial configuration (i.e. some configuration (0ω,w,0,q 1) with wΓ 0ω), C n is a halting configuration, and for every i<n, \(C_{i}\vdash _{\mathcal {X}}C_{i+1}\). Thus, let

The main part of the proof of Theorem 10 is Theorem 14 (still further down), which states that, given an extended Turing machine \(\mathcal {X}\), one can effectively construct an expression from \(\operatorname {RegEx}(1)\) that generates \(\operatorname {INVALC}(\mathcal {X})\). Note that in \(\operatorname {enc}(C)\), ## serves as a boundary between the encodings of individual configurations, which will be of use in the proof of Theorem 14. Building on Convention 12, we observe the following fact on the regularity of \(\operatorname {VALC}(\mathcal {X})\) for a given extended Turing machine \(\mathcal {X}\):

Lemma 13

For every extended Turing machine \(\mathcal {X}\), \(\operatorname {VALC}(\mathcal {X})\) is regular if and only if \(\operatorname {dom}_{\mathrm {X}}(\mathcal {X})\) is finite.

Proof

The if direction follows immediately: As \(\mathcal {X}\) is deterministic and accepts by halting, every word in \(\operatorname {VALC}(\mathcal {X})\) corresponds to exactly one word from \(\operatorname {dom}_{\mathrm {X}}(\mathcal {X})\) (and the computation of \(\mathcal {X}\) on that word). Thus, if \(\operatorname {dom}_{\mathrm {X}}(\mathcal {X})\) is finite, \(\operatorname {VALC}(\mathcal {X})\) is also finite, and thus, regular.

For the only if direction, let \(\mathcal {X}=(Q,q_{1},\delta)\), and assume that \(\operatorname {dom}_{\mathrm {X}}(\mathcal {X})\) is infinite, while \(\operatorname {VALC}(\mathcal {X})\) is regular. The main idea of the proof is to show that this assumption implies the regularity of the language

$$L_{\mathcal {X}} :=\bigl\{ 0^{\operatorname {e}(t_R)}\texttt {\#}0^{\operatorname {e}(t_R)}\mid t_R\in \operatorname {dom}_{\mathrm {X}}(\mathcal {X})\bigr\}.$$

Due to \(L_{\mathcal {X}}\) being an infinite subset of {0n #0nn≥0}, we can then obtain a contradiction using the Pumping Lemma. In order to achieve this result, we use our convention that \(\mathcal {M}\) does not halt on the very first configuration (cf. Convention 12).

As \(\mathcal {X}\) is deterministic, every word \(w\in \operatorname {VALC}(\mathcal {X})\) corresponds to exactly one tape word \(t_{R} \in \operatorname {dom}_{\mathrm {X}}(\mathcal {X})\) and its accepting computation. This means that w has a prefix that encodes the initial configuration (t L ,t R ,a,q 1) with t L =0ω and a=0, and its successor configuration \((t'_{L},t'_{R},a',q_{j})\). Recall that, by Convention 12, δ(0,q 1)=(0,L,q 2). Using the first equation in Observation 1, we conclude \(\operatorname {e}(t'_{L})=0\), \(\operatorname {e}(t'_{R})=2\operatorname {e}(t_{R})\), and a′=0. This means that there are w p ,w′∈{0,#} such that w=w p w′, and

$$ w_p=\texttt {\#} \texttt {\#}0\underbrace{}_{\operatorname {e}(t_L)}\texttt {\#}0\underbrace{0^{\operatorname {e}(t_R)}}_{\operatorname {e}(t_R)}\texttt {\#}0\underbrace{}_{a} \texttt {\#}\underbrace{0}_{q_1} \texttt {\#} \texttt {\#}0\underbrace{ }_{\operatorname {e}(t'_L)}\texttt {\#}0\underbrace{0^{\operatorname {e}(t'_R)}}_{\operatorname {e}(t'_R)}\texttt {\#}0\underbrace{}_{a'}\texttt {\#}\underbrace{0^2}_{q_2}\texttt {\#} \texttt {\#}.$$
(1)

We now define a GSM (generalized sequential machine, cf. Hopcroft and Ullman [22]) M to transform \(\operatorname {VALC}(\mathcal {X})\) into the language \(L_{\mathcal {X}}\). Basically, generalized sequential machines can be understood as an extension to nondeterministic finite automata. In addition to the usual behavior of an NFA, a GSM supplements every transition with an output; i.e., whenever a GSM reads a symbol, it also emits a string (as specified by its transition relation). Thus, every path through a GSM also yields the concatenation of the emitted strings as an output. Applying a GSM M to a word w yields the language M(w) that consists of every string emitted by M along an accepting path for w. Likewise, for every language L, M(L):=⋃ wL M(w). As M maps L to M(L), this process is called a GSM mapping. As regular languages are closed under GSM mappings, the regularity of \(L_{\mathcal {X}}\) then follows from the assumed regularity of \(\operatorname {VALC}(\mathcal {X})\).

The GSM M is defined by the transition diagram in Fig. 2. Compared to Hopcroft and Ullman [22], this definition of M uses a slightly streamlined notation by allowing M to read multiple letters in the transition between q 1 and q 2 and between q 3 and q 4. By introducing additional states, one can easily convert M into a GSM that reads one letter after the other.

Fig. 2
figure 2

The GSM M that is used in the proof of Lemma 13. Every transition shows the string that is read to the left of the | symbol, and the emitted string to the right. First, M erases ##0#0 and keeps the following continuous block of 0s and the # after it. It then erases 0#0##0#0 and halves the number of 0s in the next continuous block of 0s (using the loop between q 4 and q 5). After that block (as recognizable by #), all following letters are erased. Note that M relies on the fact that it is only used on words w=w p w′, where w p of the form that is described in (1). For all such words, w′ is completely erased in the loop in q 6

It is easily seen that \(M( \operatorname {VALC}(\mathcal {X}))=L_{\mathcal {X}}\). By our initial assumption, \(\operatorname {VALC}(\mathcal {X})\) is regular, and as the class of regular languages is closed under GSM mappings, \(L_{\mathcal {X}}\) must be regular as well. Also by our initial assumption, \(\operatorname {dom}_{\mathrm {X}}(\mathcal {X})\) is infinite, which means that \(L_{\mathcal {X}}\) is an infinite subset of {0n #0nn≥0}. Using the Pumping Lemma (cf. Hopcroft and Ullman [22]), we can obtain the intended contradiction, as pumping any sufficiently large word from \(L_{\mathcal {X}}\) would lead to a word that is not a subset of 0 #0, or to a word 0m #0n with mn. □

We are now ready to state the central part of our proof of Theorem 10:

Theorem 14

For every extended Turing machine \(\mathcal {X}\), one can effectively construct an extended regular expression \(\alpha_{\mathcal {X}}\in \operatorname {RegEx}(1)\) such that \(L(\alpha_{\mathcal {X}})= \operatorname {INVALC}(\mathcal {X})\).

Proof

Let \(\mathcal {X}=(Q,q_{1},\delta)\) be an extended Turing machine. Let ν≥2 denote the number of states of \(\mathcal {X}\); by Convention 12, Q={q 1,…,q ν } for some ν≥2. Intuitively, each element w of \(\operatorname {INVALC}(\mathcal {X})\) contains at least one error that prevents w from being an encoding of a valid computation of \(\mathcal {X}\). We distinguish two kinds of errors:

  1. 1.

    structural errors, where a word is not an encoding of any sequence \((C_{i})_{i=1}^{n}\) over configurations of \(\mathcal {X}\) for some n, or the word is such an encoding, but C 1 is not an initial, or C n is not a halting configuration, and

  2. 2.

    behavioral errors, where a word is an encoding of some sequence of configurations \((C_{i})_{i=0}^{n}\) of \(\mathcal {X}\), but there is an i<n such that \(C_{i}\vdash _{\mathcal {X}}C_{i+1}\) does not hold.

The extended regular expression \(\alpha_{\mathcal {X}}\) is defined by

$$\alpha_{\mathcal {X}} :=\alpha _{struc} \mid \alpha _{beha},$$

where the subexpressions α struc and α beha describe all structural and all behavioral errors, respectively. Both expressions shall be defined later. Note that the variable reference mechanism shall be used only for some extended regular expressions in α beha ; most of the encoding of \(\operatorname {INVALC}(\mathcal {X})\) can be achieved with proper regular expressions. In order to define α struc , we take a short detour and consider the language

Note that \(S_{\mathcal {X}}\) is exactly the set of all \(\operatorname {enc}(C)\), where \(C =(C_{i})_{i=1}^{n}\) (with n≥2) is a sequence of configurations of \(\mathcal {X}\); with C 1 being an initial configuration (where neither the left tape side nor the head cell contain any 1), and C n being a halting configuration (n≥2 follows from our Convention 12 that \(\mathcal {X}\) cannot halt in the first step). In other words, all that distinguishes \(S_{\mathcal {X}}\) from \(\operatorname {VALC}(\mathcal {X})\) is that for \(S_{\mathcal {X}}\), we do not require that \(C_{i} \vdash _{\mathcal {X}}C_{i+1}\) holds for all i<ν. Thus, \(\operatorname {VALC}(\mathcal {X})\subseteq S_{\mathcal {X}}\).

Furthermore, \(S_{\mathcal {X}}\) is a regular language, as it is obtained by an intersection of three regular languages. Thus, \(\{0,\texttt {\#}\}^{*}\setminus S_{\mathcal {X}}\) is also a regular language, and we define α struc to be any proper regular expression with \(L(\alpha _{struc})=\{0,\texttt {\#}\}^{*}\setminus S_{\mathcal {X}}\). It is easy to see that such an α struc can be constructed effectively solely from \(\mathcal {X}\), for example by constructing a deterministic finite automaton A for \(S_{\mathcal {X}}\), complementing A (by turning accepting into non-accepting states, and vice versa), and converting the resulting nondeterministic automaton into a proper regular expression. The DFA A depends only on ν and the halting instructions occurring in δ and can be constructed effectively, as can all the conversions that lead to α struc (again, cf. Hopcroft and Ullman [22]). The exact shape of α struc is of no significance to this proof, as we require only that the expression is a proper regular expression, and can be obtained effectively.

As mentioned above, \(\operatorname {VALC}(\mathcal {X})\subseteq S\), and thus, \(\operatorname {INVALC}(\mathcal {X})\supseteq L(\alpha _{struc})\). Furthermore, all elements of \(\operatorname {INVALC}(\mathcal {X})\setminus L(\alpha _{struc})\) are elements of \(S_{\mathcal {X}}\) and encode a sequence \((C_{i})_{i=1}^{n}\) (n≥2) of configurations of \(\mathcal {X}\) such that \(C_{i}\vdash _{\mathcal {X}}C_{i+1}\) does not hold for at least one i, 1≤i<n.

Thus, \(\operatorname {INVALC}(\mathcal {X})\setminus L(\alpha _{struc})\) contains exactly those words from \(S_{\mathcal {X}}\) that encode a sequence of configurations with at least one behavioral error. Therefore, when defining α beha to describe all these remaining errors, we can safely assume that the word in question is an element of \(S_{\mathcal {X}}\), as otherwise, it is already contained in L(α struc ). This allows us to reason about the yet to be defined elements of \(\operatorname {INVALC}(\mathcal {X})\) purely in terms of the execution of \(\mathcal {X}\), as the encoding is already provided by the structure of \(S_{\mathcal {X}}\), and to understand all errors that are yet to be defined as incorrect transitions between configurations.

We distinguish three kinds of behavioral errors in the transition between a configuration C=(t L ,t R ,a,q i ) and a configuration \(C'=(t'_{L},t'_{R},a',q_{j})\), where \(C\vdash _{\mathcal {X}}C'\) does not hold:

  1. 1.

    state errors, where q j has a wrong value,

  2. 2.

    head errors, where a′ is wrong, and

  3. 3.

    tape side errors, where \(t'_{L}\) or \(t'_{R}\) contains an error (characterized by \(\operatorname {e}(t'_{L})\) or \(\operatorname {e}(t'_{R})\) being different from the value that is expected according to Observation 1).

Each of these types of errors shall be handled by an expression α state , α head or α tape (respectively, of course), and we define

$$\alpha _{beha} :=(\alpha _{state} \mid \alpha _{head} \mid \alpha _{tape}).$$

Basically, each of these expressions lists all combinations of aΓ and q i Q, and describes the corresponding errors of the respective kind. The error that \(\mathcal {X}\) continues its computation after encountering a \(\operatorname {HALT}\)-instruction is considered a state error and handled in α state (thus, we do not need to consider \(\operatorname {HALT}\)-instructions in α head and α tape ). We can already note that α head and α state are proper regular expressions, as variables and the % metacharacter occur only in α tape (recall that, as \(\alpha _{\mathcal {X}}\in \operatorname {RegEx}(1)\), we are only allowed to use % once in the whole expression).

State errors: We begin with the definition of α state . For every aΓ and every i with q i Q, we define a proper regular expression \(\alpha^{state}_{a,i}\), and let

$$\alpha _{state} :=\bigl(\alpha^{state}_{0,1}\mid \alpha^{state}_{1,1}\mid \alpha^{state}_{0,2}\mid \alpha^{state}_{1,2}\mid \cdots \mid \alpha^{state}_{0,\nu} \mid \alpha^{state}_{1,\nu}\bigr),$$

where each \(\alpha^{state}_{a,i}\) lists all ‘forbidden’ follower states for q i on a. More formally, if \(\delta(a,q_{i})=\operatorname {HALT}\), let

$$\alpha^{state}_{a,i}:=(0\mid \texttt {\#})^* \texttt {\#}00^{a} \texttt {\#}0^i \texttt {\#} \texttt {\#}0 (0\mid \texttt {\#})^*.$$

For all words in \(S_{\mathcal {X}}\), this expression describes all cases where \(\mathcal {X}\) reads a in state q i , and continues instead of halting. First, as mentioned above, we only need to consider words from \(S_{\mathcal {X}}\), as all other words are already matched by α struc . Due to the definition of \(\operatorname {enc}\), every ## in words from \(S_{\mathcal {X}}\) marks the boundary between two encoded configurations, and every string #0i immediately to the left of such a ## encodes a state q i . Likewise, when continuing to the left, #00a encodes the head letter a. Thus, whenever a word from \(S_{\mathcal {X}}\) contains a string #00a #0i ##, there is a configuration where \(\mathcal {X}\) is in state q i and reads a. As \(\delta(a,q_{i})=\operatorname {HALT}\), there may not be a succeeding configuration, and this definition of \(\alpha^{state}_{a,i}\) describes all cases where \(\mathcal {X}\) continues after reading a in q i . Note that we do not need to deal with cases where ## is followed by yet another #, as such words are not contained in \(S_{\mathcal {X}}\) and, thus, contained in L(α struc ).

For those cases where δ(a,q i )=(b,M,q j ) for some M∈{L,R}, some bΓ, and some q j Q, we define

$$\alpha^{state}_{a,i}:=(0\mid \texttt {\#})^* \texttt {\#}00^{a} \texttt {\#}0^i \texttt {\#} \texttt {\#}0^+ \texttt {\#}0^+ \texttt {\#}0^+ \texttt {\#}\alpha^{not}_{j} \texttt {\#} \texttt {\#}(0\mid \texttt {\#})^*,$$

where \(\alpha^{not}_{j}\) is any proper regular expression with

$$L\bigl(\alpha^{not}_{j}\bigr)=\bigl\{0^k\mid1\leq k \leq\nu\text{ and }k\neq j\bigr\}.$$

Again, we use #00a #0i ## to identify an encoding of a configuration with head letter a in state q i . To the right of ##, the subexpression 0+ #0+ #0+ # is used to skip over the encodings of \(t'_{L}\), \(t'_{R}\) and a′, as we only deal with state errors (for now). By definition, the invalid successor states are exactly all states from Q∖{q j }, and these are described by \(\alpha ^{not}_{j}\). Thus, if a word from \(S_{\mathcal {X}}\) contains any state error when reading a in q i , the whole word belongs to \(\alpha^{state}_{a,i}\), and \(\alpha^{state}_{a,i}\) only matches such words.

Finally, if \(\delta(a,q_{i})=(\operatorname {CHECK}_{\mathrm {R}},q_{j})\) for some q j Q, we define

where \(\alpha^{not}_{j}\) is defined as in the preceding paragraph. This expression is slightly more complicated, as it needs to distinguish two cases. Recall that the \(\operatorname {CHECK}_{\mathrm {R}}\)-instruction is to be interpreted as follows: If t R =0ω, \(\mathcal {X}\) is supposed to change into state q j ; and if t R ≠0ω, \(\mathcal {X}\) is supposed to remain in q i , which will lead to an infinite loop. The first line of the definition handles all cases where t R =0ω, while the second handles those where t R ≠0ω. Again, both cases use #00a #0i ## to identify configurations where \(\mathcal {X}\) is in state q i reading a.

In the first case, the string #0#00a #0i ## contains the additional information that \(\operatorname {e}(t_{R})=0\), and thus, t R =0ω. The correct successor state would be q j , and the expression skips over the encodings of \(t'_{L}\), \(t'_{R}\) and a′ (using \(0^{+}\texttt {\#}0^{+}\texttt {\#}0^{+}\texttt {\#}\alpha^{not}_{j} \texttt {\#}\)) and then matches all states but q j .

Likewise, in the second case, #00+ #00a #0i ## matches all cases where (when reading a in q i ) \(\operatorname {e}(t_{R})>0\), which is equivalent to t R ≠0ω. Again, the expression skips over the encodings of \(t'_{L}\), \(t'_{R}\) and a′ and uses \(\alpha^{not}_{i}\) to identify all states that are not the correct successor state q i .

Head errors: As α state handles all cases where a halting configuration is followed by any other configuration, we can restrict our definition of the various head errors to cases where a non-halting instruction should be executed. We define

$$\alpha _{head} :=\bigl(\alpha^{head}_{0,1}\mid \alpha^{head}_{1,1}\mid \alpha^{head}_{0,2}\mid \alpha^{head}_{1,2}\mid \cdots \mid \alpha^{head}_{0,\nu} \mid \alpha^{head}_{1,\nu}\bigr),$$

omitting those \(\alpha^{head}_{a,i}\) with \(\delta(a,q_{i})=\operatorname {HALT}\). For all aΓ, q i Q with \(\delta(a,q_{i})\neq \operatorname {HALT}\), we define \(\alpha^{head}_{a,i}\) as follows.

If δ(a,q i )=(b,L,q j ) (for some q j Q), let

According to the first equation in Observation 1, after a left movement of the head, \(a'=\operatorname {e}(t_{L})\operatorname {mod}2\) must hold. The two lines in the \(\alpha^{head}_{a,i}\) distinguish the two possible cases for \(\operatorname {e}(t_{L})\operatorname {mod}2\). In both cases, we once again identify a and q i in the encoding using #00a #0i #. In the first line, the expression ignores \(\operatorname {e}(t_{R})\) (using the 0+ to the left of #00a #), and describes all cases where \(\operatorname {e}(t_{L})\) is even (by the (00) part of #0(00)). To the right of ##, the expression skips \(t'_{L}\) and \(t'_{R}\) and finds a′=1, thus exactly those cases where \(\operatorname {e}(t_{L})\) is even, but a′=1. Likewise, the second line handles the cases where \(\operatorname {e}(t_{L})\) is odd, but a′=0. As a′∈{0,1} is ensured by \(S_{\mathcal {X}}\), these expressions describe exactly the head errors after L-movements.

Likewise, if δ(a,q i )=(b,R,q j ) (for some q j Q), let

This expression uses the second equation in Observation 1, \(a'=\operatorname {e}(t_{R})\operatorname {mod}2\), and works like the expression for L-moves, the only difference being that it does not skip over the encoding of t R .

Finally, if \(\delta(a,q_{i})=\operatorname {CHECK}_{\mathrm {R}}(q_{j})\) for some q j Q, we define

As \(\operatorname {CHECK}_{\mathrm {R}}\)-instructions do not change the tape or the head symbol, we just need to describe the case where a′≠a. The expression identifies an encoding of a configuration with head symbol a in state q i (again using ## as a navigation tool), skips over \(t'_{L}\) and \(t'_{R}\), and finds a head symbol a′=1 if a=0, or a′=0 if a=1. As a is fixed within every \(\alpha^{head}_{a,i}\), we can use the shorthand notation 01−a without any formal problems (as it is just another notation for 0 or λ, depending on a).

Tape side errors: As mentioned above, α tape shall be the only expression in this proof that uses variables and variable bindings. In fact, as we operate in \(\operatorname {RegEx}(1)\), we are only allowed to use a single variable (which shall be called x), and bind it only once in all of α tape .

In order to increase the readability, we shall define α tape using numerous subexpressions. As most of these expressions contain the binding operator %, simply connecting them with ∣ (as we did with the proper regular expressions in the previous cases) would force us out of \(\operatorname {RegEx}(1)\). The main idea of this part of the proof is that, in all these expressions, the binding occurs only in a prefix that they all have in common. This allows us ‘factor out’ the variable binding, and to capture all tape side errors without leaving \(\operatorname {RegEx}(1)\). Therefore, we do not need to be worried about the fact that most of the following definitions contain %x; as we shall see, the resulting expression α tape contains only a single %x.

In this section, we do not follow our usual order, as we discuss tape side errors for L- and R-instructions after the tape side errors for \(\operatorname {CHECK}_{\mathrm {R}}\)-instructions. If δ(a,q i ) is a \(\operatorname {CHECK}_{\mathrm {R}}\)-instruction, we define

Intuitively, for M∈{L,R}, \(\alpha^{a,i}_{M,>}\) is used to describe all successor configurations (after reading a in q i ), where \(\operatorname {e}(t'_{M})>\operatorname {e}(t_{M})\). Likewise, \(\alpha^{a,i}_{M,<}\) handles all those cases where \(\operatorname {e}(t'_{M})<\operatorname {e}(t_{M})\). We discuss the correctness of these expressions using \(\alpha^{a,i}_{L,>}\) as an example, the three other expressions behave analogously. If \(\alpha^{a,i}_{L,>}\) matches a word from \(S_{\mathcal {X}}\), x is bound to some word 0n with n≥0 (due to (0)%x). As this 0n belongs to the fourth block of 0s when counting from ## to the left, 0(0)n corresponds to \(00^{e(t_{L})}\) for a configuration (t L ,t R ,a,q i ). Analogously, the subexpression ##0x0+ # matches the block that encodes \(\operatorname {e}(t'_{L})\). As ##00n0+ # is expanded to some ##00n0m (with m≥1), we know that \(\operatorname {e}(t'_{L})=0^{m}0^{n}>\operatorname {e}(t_{L})\). Likewise, for every \(\operatorname {e}(t'_{L})>\operatorname {e}(t_{L})\), we can find an appropriate m≥1 and expand 0+ to 0m. Thus, \(\alpha^{a,i}_{L,>}\) matches exactly those cases in which \(\mathcal {X}\) reads a in q i , and the resulting \(\operatorname {e}(t'_{L})\) is larger than it should be (i.e., larger than \(\operatorname {e}(t_{L})\), as \(\operatorname {CHECK}_{\mathrm {R}}\)-instructions do not change the tape).

The three other expressions behave analogously; but note that for \(\alpha^{a,i}_{R,>}\) and \(\alpha^{a,i}_{R,<}\), the subexpression 0(0)%x matches the coding of \(\operatorname {e}(t_{R})\) instead of \(\operatorname {e}(t_{L})\), as can be seen by the location of ##.

Handling tape side errors for configurations that lead to a left or right movement of the head follows the same basic principle, but is a little more complicated, as we have to deal with changes to the tape. First, recall that the equations given in Observation 1 allow us to compute \(\operatorname {e}(t'_{L})\) and \(\operatorname {e}(t'_{R})\) from \(\operatorname {e}(t_{L})\), \(\operatorname {e}(t_{R})\) and δ(a,q i ).

If δ(a,q i )∈(b,L,q j ) for some q j Q, we define

These expressions fulfill the same purpose as their equally named counterparts we defined to handle tape side errors for configurations in which a \(\operatorname {CHECK}_{\mathrm {R}}\)-instruction is executed, i.e., they describe the cases where \(\operatorname {e}(t'_{L})\) or \(\operatorname {e}(t'_{R})\) contains too much or too little. For technical reasons that shall be explained a little later, we also use a proper regular expression \(\alpha^{a,i}_{mod}\) to describe the errors where \(\operatorname {e}(t_{R})\) has the wrong parity.

We begin with the expressions that handle errors on the left tape side. According to the first equation in Observation 1, the correct \(t'_{L}\) is characterized by \(\operatorname {e}(t'_{L})=\operatorname {e}(t_{L})\operatorname {div}2\).

First, we consider \(\alpha^{a,i}_{L,>}\). As before, a and q i are matched in #00a #0i ##. To the left of that block, the expression skips the encoding of \(\operatorname {e}(t_{R})\), and matches \(00^{\operatorname {e}(t_{L})}\) with the expression 0(0)%xx(0∣λ). Note that x is bound to \(0^{\operatorname {e}(t_{L})\operatorname {div}2}\), and \(0^{\operatorname {e}(t_{L})\operatorname {mod}2}\) matches 0 or λ in (0∣λ). To the right of ##, 0x0+ matches the encoding of \(00^{\operatorname {e}(t'_{L})}\). It is easily seen that this describes exactly those cases where \(\operatorname {e}(t'_{L})>(\operatorname {e}(t_{L})\operatorname {div}2)\), as the 0+ is used to match all possible values of \(\operatorname {e}(t'_{L})-(\operatorname {e}(t_{L})\operatorname {div}2)\).

Next, consider \(\alpha^{a,i}_{L,<}\). Note that if \(\operatorname {e}(t'_{L})= (\operatorname {e}(t_{L})\operatorname {div}2 )\), either \(\operatorname {e}(t_{L})=2\operatorname {e}(t'_{L})\) or \(\operatorname {e}(t_{L})=2\operatorname {e}(t'_{L})+1\) holds. Thus, \(\operatorname {e}(t'_{L})\) is too small if and only if \(\operatorname {e}(t_{L})>2\operatorname {e}(t'_{L})+1\), which holds if and only if \(\operatorname {e}(t_{L})=2\operatorname {e}(t'_{L})+2+m\) for some m≥0. We can easily see that (for words from \(S_{\mathcal {X}}\)), \(\alpha^{a,i}_{L,<}\) matches exactly those encodings of successive configurations (t L ,t R ,a,q i ) and \((t'_{L},t_{R},a',q_{j})\), where \(\operatorname {e}(t_{L})= 2n+2+m\) and \(\operatorname {e}(t'_{L})=n\) for some m,n≥0, as x binds to 0n, and 0(0)%xx000 corresponds to \(00^{n} 0^{n} 00 0^{m}=00^{\operatorname {e}(t_{L})}\). Thus, this expression handles exactly those cases where \(\operatorname {e}(t'_{L})\) is too small.

Hence, \(\alpha^{a,i}_{L,>}\) and \(\alpha^{a,i}_{L,<}\) handle all errors on the left tape side (for given a,q i with δ(a,q i )=(b,L,q j )). As we still need to handle errors on the right tape side, recall that (according to the first equation in Observation 1), the correct right tape word \(t'_{R}\) is characterized by \(\operatorname {e}(t'_{R})=2 \operatorname {e}(t_{R})+b\).

It is easy to see that \(\alpha^{a,i}_{R,>}\) handles exactly those words (from \(S_{\mathcal {X}}\), with \(\mathcal {X}\) reading a in q i ) where \(\operatorname {e}(t'_{R})\) is too large. First, x is bound to \(0^{\operatorname {e}(t_{R})}\). For \(t'_{R}\), we have \(0\:xx0^{b}0^{+}\), and thus, \(\operatorname {e}(t'_{R})= 2\operatorname {e}(t_{R})+b+n\), with n≥1, where 0+ expresses the difference between \(\operatorname {e}(t'_{R})\) and its intended value.

The final case, where \(t'_{R}\) is too small, is more complicated. We handle this case with two expressions. First, note that \(\operatorname {e}(t'_{R})\operatorname {mod}2 = b\) must hold. The expression \(\alpha^{a,i}_{mod}\) describes those cases where this condition is not satisfied; i.e., \(\mathcal {X}\) reads a in q i , but \(\operatorname {e}(t'_{R})\operatorname {mod}2 \neq b\). Therefore, we can restrict our definition of \(\alpha^{a,i}_{R,<}\) to those cases where \(\operatorname {e}(t'_{R})\) and b have the same parity. In these cases, \(\operatorname {e}(t'_{R})=2n+b\) for some n≥0, but \(\operatorname {e}(t_{R})> n\), which holds if and only if there is an m≥0 with \(\operatorname {e}(t_{R})=n+m+1\). The words from \(S_{\mathcal {X}}\) that match \(\alpha^{a,i}_{R,<}\) (but not \(\alpha^{a,i}_{mod}\)) are exactly those that satisfy this condition: The variable x is bound to 0n, while 0+ corresponds to m+1. Thus, \(\alpha^{a,i}_{mod}\mid \alpha ^{a,i}_{R,<}\) handles the cases where \(\operatorname {e}(t'_{R})\) is too small (but not exactly those cases, even when restricted to \(S_{\mathcal {X}}\), as \(\alpha ^{a,i}_{mod}\) also matches encodings of computations where \(\operatorname {e}(t'_{R})\) is too large and of the wrong parity).

As we have seen, these five expressions describe all tape side errors occurring during left movements of the head. Analogously, if δ(a,q i )∈(b,R,q j ) for some q j Q, we define

As already suggested by the similarities (of the equations in Observation 1, and of the definitions of the five expressions), tape side errors for R-movements can be handled analogously to tape side errors for L-movements. Thus, it can be easily verified that these expressions describe exactly the tape side errors for all transitions where \(\mathcal {X}\) reads a in state q i (again assuming that we only consider words from \(S_{\mathcal {X}}\)).

We can now combine all these expressions to define α tape . First, note that whenever it is defined, \(\alpha^{a,i}_{mod}\) is a proper regular expression. We define

$$\alpha_{mod}:=\alpha^{0,1}_{mod}\mid \alpha^{1,1}_{mod}\mid \alpha^{0,2}_{mod}\mid \alpha^{1,2}_{mod}\mid \cdots \mid \alpha^{0,\nu}_{mod}\mid \alpha^{1,\nu}_{mod},$$

omitting those \(\alpha^{a,i}_{mod}\) that are undefined (i.e., δ(a,q i ) is a \(\operatorname {HALT}\)- or a \(\operatorname {CHECK}_{\mathrm {R}}\)-instruction).

Next, note that all other expressions for tape side errors use exactly one variable binding, and start with the common prefix (0∣#) #0(0)%x. For every \(\alpha^{a,i}_{M,c}\) (with M∈{L,R} and c∈{>,<}), let \(\hat{\alpha}^{a,i}_{M,c}\) be the (uniquely defined) extended regular expression that satisfies

$$\alpha^{a,i}_{M,c} =(0\mid \texttt {\#})^* \texttt {\#}0 (0^*)\%x\:\hat{\alpha}^{a,i}_{M,c}.$$

In other words, \(\hat{\alpha}^{a,i}_{M,c}\) is obtained from \(\alpha ^{a,i}_{M,c}\) by factoring out the prefix that contains the variable binding. We then combine all this expressions into a single expression \(\alpha_{var}\in \operatorname {RegEx}(1)\) by

omitting all subexpressions that refer to (a,q i ) where \(\delta (a,q_{i})=\operatorname {HALT}\). Finally, we set

$$\alpha _{tape} :=\alpha_{mod}\mid \alpha_{var}.$$

As discussed before, it is easy to see that L(α tape ) is the union of all the languages that are generated by the various regular expressions we defined to handle tape side errors, and thus, L(α tape ) contains exactly those words from \(S_{\mathcal {X}}\) that encode a tape side error at some point of the encoded computation. Therefore, for every word w∈{0,#}, \(w\in L(\alpha_{\mathcal {X}})\) if and only if

$$w \in L(\alpha _{struc})\cup L(\alpha _{state})\cup L(\alpha _{head})\cup L(\alpha _{tape}),$$

and this holds if and only if w contains a structural or behavioral error. Hence, we observe that \(w\in L(\alpha_{\mathcal {X}})\) if and only \(w\in \operatorname {INVALC}(\mathcal {X})\), and thereby, \(L(\alpha_{\mathcal {X}})= \operatorname {INVALC}(\mathcal {X})\).

Finally, as this proof defines \(\alpha_{\mathcal {X}}\) constructively, using only ν and δ, it also describes an effective procedure to compute \(\alpha_{\mathcal {X}}\) from \(\mathcal {X}\). This concludes the proof of Theorem 14. □

Theorem 10 follows almost immediately from Theorem 14, and Lemmas 11 and 13:

Proof of Theorem 10

We prove each of the claims by reduction from one of three problems for extended Turing machines that are listed in Lemma 11 (these being emptiness, finiteness, and the complement of finiteness). Each reduction uses the same construction: Given an extended Turing machine \(\mathcal {X}\), we construct an extended regular expression \(\alpha_{\mathcal {X}}\in \operatorname {RegEx}(1)\) with \(\operatorname {INVALC}(\mathcal {X})=L(\alpha_{\mathcal {X}})\) (this is possible according to Theorem 14).

Then \(\operatorname {dom}_{\mathrm {X}}(\mathcal {X})=\emptyset\) if and only if \(\operatorname {VALC}(\mathcal {X})=\emptyset\), which holds if and only if \(\operatorname {INVALC}(\mathcal {X})=\{0,\texttt {\#}\}^{*}\), which holds if and only if \(L(\alpha_{\mathcal {X}})=\{0,\texttt {\#}\}^{*}\), which holds if and only if L(α)⊇{0,#}. Thus, any algorithm that decides universality for \(\operatorname {RegEx}(1)\) could be used to decide the emptiness of the domain for extended Turing machines, which is undecidable according to Lemma 11.

Furthermore, \(\operatorname {dom}_{\mathrm {X}}(\mathcal {X})\) is finite if and only if \(\operatorname {VALC}(\mathcal {X})\) is finite, which holds if and only if \(\operatorname {VALC}(\mathcal {X})\) is regular (according to Lemma 13), which holds if and only if \(\operatorname {INVALC}(\mathcal {X})\) is regular (as the class of regular languages is closed under complementation), which holds if and only if \(L(\alpha_{\mathcal {X}})\) is regular. Hence, semi-decidability of regularity for \(\operatorname {RegEx}(1)\) would lead to semi-decidability of finiteness of \(\operatorname {dom}_{\mathrm {X}}\), a problem that is not semi-decidable (according to Lemma 11).

Likewise, as \(\operatorname {dom}_{\mathrm {X}}(\mathcal {X})\) is finite if and only if \(L(\alpha_{\mathcal {X}})\) is regular, \(\operatorname {dom}_{\mathrm {X}}(\mathcal {X})\) is infinite if and only if \(L(\alpha_{\mathcal {X}})\) is not regular. Therefore, semi-decidability of non-regularity for \(\operatorname {RegEx}(1)\) contradicts the fact that the complement of finiteness of \(\operatorname {dom}_{\mathrm {X}}\) is not semi-decidable (see Lemma 11).

As \(\operatorname {INVALC}(\mathcal {X})\) is cofinite if and only if \(\operatorname {INVALC}(\mathcal {X})\) is regular, the results for regularity and non-regularity also show that neither cofiniteness nor non-cofiniteness is semi-decidable for \(\operatorname {RegEx}(1)\). □

Those who are interested in the exact position of these problems in the arithmetical hierarchy (cf. Odifreddi [28]) can conclude that universality is \(\varPi^{0}_{1}\)-complete, while regularity and co-finiteness are \(\varSigma^{0}_{2}\)-complete. For each of these problems, hardness for the respective class follows from the respective completeness of each of the problems on extended Turing machines used in the proof of Theorem 10 (see the remark after the proof of Lemma 11). Membership in the respective level of the hierarchy is easily proved using the appropriate representation for that class; e.g., universality of some L(α) can be expressed as ∀wΣ :wL(α). As the membership problem for extended regular expressions is decidable, this proves that universality is in \(\varPi^{0}_{1}\). Likewise, non-regularity can be expressed as \(\forall\beta\in \operatorname {RegEx}(0):\exists w\in\varSigma ^{*}:[w\in L(\alpha)\Leftrightarrow w\notin L(\beta)]\), which shows that non-regularity is in \(\varPi^{0}_{2}\). Actually, under a strict interpretation of the definition given by Odifreddi [28], we would need to quantify over natural numbers; but as there are computable bijections between ℕ and Σ , as well as between ℕ and \(\operatorname {RegEx}(0)\), we can omit this technical detail.

4 Consequences of Theorem 10

In Sect. 3, we introduced and proved Theorem 10, which states that for \(\operatorname {RegEx}(1)\), universality is not semi-decidable; and regularity and cofiniteness are neither semi-decidable, nor co-semi-decidable.

Of course, all these undecidability results also hold for every \(\operatorname {RegEx}(k)\) with k≥2, and for the whole class \(\operatorname {RegEx}\) of extended regular expressions (as \(\operatorname {RegEx}(1)\) is contained in all these classes).

Theorem 10 also demonstrates that inclusion and equivalence are undecidable for \(\operatorname {RegEx}(1)\) (and, hence, all of \(\operatorname {RegEx}\)). We also see, as an immediate consequence to Theorem 10, that there is no algorithm that minimizes the number of variables in an extended regular expression, as such an algorithm could be used to decide regularity.

Note that in the proof of Theorem 10, the single variable x is bound only to words that match the expression 0. This shows that the “negative” properties of extended regular expressions we derive from Theorem 10 hold even if we restrict \(\operatorname {RegEx}(1)\) by requiring that the variable can only be bound to a very restricted proper regular expression. Furthermore, the proof also applies to the extension of proper regular expressions through numerical parameters that is proposed by Della Penna et al. [14]. We discuss an adaption to another model in Sect. 4.2.

In addition to this, the construction from Theorem 14 (which we used to prove Theorem 10, and consequently, all other results in the present paper) can be refined to also include bounds on the number of occurrences of the single variable—see Sect. 4.1.

From the undecidability of universality, we can immediately conclude that \(\operatorname {RegEx}(1)\) cannot be minimized effectively:

Corollary 15

Let c be a complexity measure for \(\operatorname {RegEx}(1)\). Then there is no recursive function m c that, given an expression \(\alpha\in \operatorname {RegEx}(1)\), returns an expression \(m_{c}(\alpha)\in \operatorname {RegEx}(1)\) with 1. L(m c (α))=L(α), and 2. c(β)≥c(m c (α)) for every \(\beta\in \operatorname {RegEx}(1)\) with L(β)=L(α).

Proof

Let c be a complexity measure for \(\operatorname {RegEx}(1)\) and assume there is such a function m c . Let Σ be any finite alphabet with |Σ|≥2, and let

$$U_c:=\bigl\{m_c(\alpha)\mid L(\alpha)=\varSigma^* \bigr\}.$$

By definition of c, U c is a finite set, and therefore recursive. As L(α)=Σ if and only if m c (α)∈U c , m c and the characteristic function of U c can be used to decide universality for \(\operatorname {RegEx}(1)\), a problem that is not decidable (cf. Theorem 10). This is a contradiction. □

Following the classic proof method of Hartmanis [19] (cf. Kutrib [24]), we can use the fact that non-regularity is not semi-decidable to obtain a result on the relative succinctness of extended and proper regular expressions:

Corollary 16

There are non-recursive tradeoffs between \(\operatorname {RegEx}(1)\) and \(\operatorname {RegEx}(0)\). This holds even if we consider only the tradeoffs between \(\operatorname {CoFRegEx}(1)\) and \(\operatorname {CoFRegEx}(0)\), using a complexity measure for \(\operatorname {RegEx}(1)\).

Proof

The result for \(\operatorname {RegEx}(1)\) and \(\operatorname {RegEx}(0)\) follows immediately from Theorem 10 and Theorem 4 in Hartmanis [19]: As non-regularity is not semi-decidable for \(\operatorname {RegEx}(1)\), the tradeoff between \(\operatorname {RegEx}(1)\) and \(\operatorname {RegEx}(0)\) is non-recursive (see also Kutrib [24] for more detailed explanations than in Hartmanis [19]).

Applied to \(\operatorname {RegEx}(1)\) and \(\operatorname {RegEx}(0)\), Hartmanis’ proof scheme gives the following proof: As non-regularity is not semi-decidable for \(\operatorname {RegEx}(1)\), the set

$$\varDelta :=\bigl\{\alpha\in \operatorname {RegEx}(1)\mid L(\alpha) \text{ is not regular}\bigr\}$$

is not partially recursive. Now assume that, for a given complexity measure c, the tradeoff from \(\operatorname {RegEx}(1)\) to \(\operatorname {RegEx}(0)\) is recursively bounded by some recursive function f c . We can then use f c to construct a semi-decision procedure for Δ as follows: Given some \(\alpha\in \operatorname {RegEx}(1)\), we compute n:=f c (c(α)), and let

$$F_n:=\bigl\{\beta\in \operatorname {RegEx}(0)\mid c(\beta)\leq n\bigr\}.$$

As c is a complexity measure, F n is finite, and we can effectively list all its elements (as we can effectively list all \(\beta\in \operatorname {RegEx}(1)\) with c(β)≤n, and we can decide whether \(\beta\in \operatorname {RegEx}(0)\) by searching β for occurrences of variables or the metacharacter %). For each βF n , we then semi-decide L(β)≠L(α) by checking wL(α) and wL(β) for all wΣ successively.

If L(α) is not regular, then for every βF n , we find some w β in finite time that proves L(β)≠L(α) (as w β is not contained in one of the two languages, but in the other), and we can proceed to the next expression in F n . If L(α) is regular, and f c is a bound on the tradeoff from \(\operatorname {RegEx}(1)\) to \(\operatorname {RegEx}(0)\), there is a βF n with L(β)=L(α), and the procedure will never terminate. If no such β can be found, we know that αΔ, and the procedure can return 1. Thus, we can construct a semi-decision procedure for Δ, which contradicts the established fact that Δ is not partially recursive.

Likewise, we can obtain the same result if we restrict the claim to expressions from \(\operatorname {CoFRegEx}(1)\) and \(\operatorname {CoFRegEx}(0)\). According to Theorem 10, non-cofiniteness for \(\operatorname {RegEx}(1)\) is not semi-decidable. Thus, given a complexity measure c for \(\operatorname {RegEx}(1)\), a bound f c on the tradeoff from \(\operatorname {CoFRegEx}(1)\) to \(\operatorname {CoFRegEx}(0)\) could be used to give a semi-decision algorithm for the set

$$\varDelta _C:=\bigl\{\alpha\in \operatorname {RegEx}(1)\mid\text{$L(\alpha)$ is not cofinite}\bigr\}.$$

First, note that the use of a complexity measure for \(\operatorname {RegEx}(1)\) instead of \(\operatorname {CoFRegEx}(1)\) is intentional and serves to avoid complications with the fact that cofiniteness is not decidable for \(\operatorname {RegEx}(1)\) (recall Theorem 10). We can construct a semi-decision procedure for Δ C from f c using almost the same reasoning as above: Given an \(\alpha\in \operatorname {RegEx}(1)\), we define n:=f(c(α)), and let

$$F_{C,n}:=\bigl\{\beta\in \operatorname {RegEx}(0)\mid c(\beta)\leq n, \text{$L(\beta)$ is cofinite}\bigr\}.$$

Enumerating all elements of F C,n is slightly more difficult than enumerating all elements of F n (see above), as we also need to decide whether L(β) is cofinite for every βF n . Luckily, this is not a problem, as cofiniteness is decidable for \(\operatorname {RegEx}(0)\) (e.g., given a \(\beta\in \operatorname {RegEx}(0)\), one could convert β into an equivalent DFA, compute its complement, and check the resulting DFA for loops that contain a final state and are reachable from the initial state). From here on, the proof continues as above, mutatis mutandis. □

Thus, no matter which complexity measure and which computable upper bound we assume for the tradeoff, there is always a regular language L that can be described by an extended regular expression from \(\operatorname {RegEx}(1)\) so much more succinctly that every proper regular expression for L has to break that bound. Obviously, this has also implications for the complexity of matching regular expressions: Although membership is “easier” for proper regular expressions than for extended regular expressions, there are regular languages that can be expressed far more efficiently through extended regular expressions than through proper regular expressions.

Recall Example 1, where we consider extended regular expressions that describe finite languages. In this restricted case, there exists an effective conversion procedure—hence, the tradeoffs are recursive:

Lemma 17

For every k≥1, the tradeoff between \(\operatorname {FRegEx}(k)\) and \(\operatorname {FRegEx}(0)\) is recursive (even when considering complexity measures for \(\operatorname {RegEx}(k)\) instead of \(\operatorname {FRegEx}(k)\)).

Proof

Let k≥1, and let c be a complexity measure for \(\operatorname {RegEx}(k)\). By definition, no \(\alpha\in \operatorname {FRegEx}(k)\) contains a Kleene star (or Kleene plus). Thus, given an \(\alpha\in \operatorname {FRegEx}(k)\), we can effectively compute the finitely many words in L(α) by exhausting all possible combinations of choices for each alternation symbol in α, and computing the corresponding word for each combination, handling all bindings accordingly.

For example, the expression \(\alpha :=(\mathtt {a}\mid \mathtt {b})(\mathtt {a}\mid \mathtt {b})\%x (\mathtt {a}\mid \mathtt {b})\%y\:x\:y\) contains three alternation symbols, totaling 23=8 possible combinations of choices, each corresponding to one of the words in L(α). More generally, if \(\alpha\in \operatorname {FRegEx}(k)\) contains n alternation symbols, L(α) contains at most 2n words w 1,…,w i (i≤2n), and we can compute an \(\hat{\alpha }\in \operatorname {FRegEx}(0)\) with \(L(\hat{\alpha})=L(\alpha)\) simply by computing these words w 1,…,w i and defining \(\hat{\alpha} :=w_{1}\mid \cdots \mid w_{i}\).

Fixing an effective procedure that computes \(\hat{\alpha}\), we can straightforwardly define the recursive bound f c :ℕ→ℕ as follows: For n≥0, we define

$$F_n:=\bigl\{\hat{\alpha}\in \operatorname {FRegEx}(0)\mid\alpha\in \operatorname {FRegEx}(k), c(\alpha)=n\bigr\}.$$

By definition of complexity measures, every F n is finite, and given any n, we can effectively list all expressions from F n (again, as c is a complexity measure, and membership in \(\operatorname {FRegEx}(k)\) is straightforwardly decidable for \(\operatorname {RegEx}(k)\)). For any n≥0, we define

$$f_c(n):=\max\bigl\{c(\hat{\alpha})\mid\hat{\alpha}\in F_n\bigr\}.$$

As every F n can be listed effectively, every F n is finite, and as \(c(\hat{\alpha})\) can be computed effectively, f n is a recursive function. Furthermore, f c is a bound on the tradeoff from \(\operatorname {FRegEx}(k+1)\) to \(\operatorname {FRegEx}(k)\) by definition. □

Although the class of \(\operatorname {RegEx}\)-languages is not closed under complementation (Lemma 2 in Câmpeanu et al. [9]), there are languages L such that both L and its complement Σ L are \(\operatorname {RegEx}\)-languages (e.g., all regular languages). Combining Lemma 17 and Corollary 16, we can straightforwardly conclude that there are cases where it is far more efficient to describe the complement of a \(\operatorname {RegEx}(1)\)-language, as opposed to the language itself:

Corollary 18

Let Σ be a finite alphabet. Let c be a complexity measure for \(\operatorname {RegEx}(1)\). For any recursive function f c :ℕ→ℕ, there exists an \(\alpha\in \operatorname {RegEx}(1)\) such that Σ L(α) is a \(\operatorname {RegEx}(1)\)-language, and for every \(\beta\in \operatorname {RegEx}(1)\) with L(β)=Σ L(α), c(β)≥f c (c(α)).

Proof

Assume to the contrary that, for some complexity measure c for \(\operatorname {RegEx}(1)\), there is a recursive function f 1:ℕ→ℕ such that for every \(\alpha\in \operatorname {RegEx}(1)\), if Σ L(α) is a \(\operatorname {RegEx}(1)\)-language, there is a \(\beta\in \operatorname {RegEx}(1)\) with L(β)=Σ L(α) and c(β)≤f 1(c(α)) (in other words, f 1 is a recursive bound on the blowup of complementation for \(\operatorname {RegEx}(1)\)).

We can now use f 1 and Lemma 17 to obtain a recursive bound on the tradeoff between \(\operatorname {CoFRegEx}(1)\) and \(\operatorname {CoFRegEx}(0)\), which contradicts Corollary 16.

First, note that according to Lemma 17, there is a recursive bound f 2:ℕ→ℕ on the tradeoff between \(\operatorname {FRegEx}(1)\) and \(\operatorname {FRegEx}(0)\).

Furthermore, we can easily prove that there is a recursive bound f 3:ℕ→ℕ on the blowup that occurs in the complementation for \(\operatorname {FRegEx}(1)\), as f 3 might be computed as follows: For every input n, there are finitely many \(\alpha\in \operatorname {RegEx}(0)\) with c(α)=n. Finiteness for \(\operatorname {RegEx}(0)\) is obviously decidable; thus, we can effectively construct the finite set

$$F_n:=\bigl\{\alpha\in \operatorname {FRegEx}(0)\mid c(\alpha)=n\bigr\}.$$

For every αF n , we (effectively) construct an \(\overline {\alpha}\in \operatorname {RegEx}(0)\) with \(L(\overline{\alpha})=\varSigma ^{*}\setminus L(\alpha)\), for example, by converting α into a DFA, complementing it, and converting the resulting DFA into a proper regular expression, all using the standard techniques as described in Hopcroft and Ullman [22]. We then check all \(\beta\in \operatorname {RegEx}(0)\), ordered by growing size of c(β), until we find the smallest β (w.r.t. c) with L(β)=Σ L(α), and refer to this β as \(\tilde{\alpha}\) (again, this is possible due to the decidability of equivalence for \(\operatorname {RegEx}(0)\), cf. Hopcroft and Ullman [22]), and define

$$C_n:=\{\widetilde{\alpha_n}\mid\alpha_n\in F_n\}.$$

Finally, we define f 3(n) to be the maximum of all \(c(\tilde{\beta})\) with βC n . By definition, f 3 is recursive, and serves as an upper bound for the blowup that occurs when complementing expressions from \(\operatorname {FRegEx}(0)\).

Now, consider the function f:ℕ→ℕ that is defined by f(n):=f 3(f 2(f 1(n))) for every n∈ℕ. We shall see that our assumption implies that f is a recursive bound on the tradeoff between \(\operatorname {CoFRegEx}(1)\) and \(\operatorname {CoFRegEx}(0)\), which contradicts Corollary 16. An illustration of this argument can be found in Fig. 3.

Fig. 3
figure 3

An illustration of the functions that are used in the proof of Corollary 18

First, observe that f is a recursive function, as f 1, f 2, and f 3 are recursive functions. Due to Corollary 16, there is an \(\alpha_{fc}\in \operatorname {CoFRegEx}(1)\) such that L(α fc ) is regular, but for every \(\beta\in \operatorname {RegEx}(0)\) (and hence, \(\beta\in \operatorname {CoFRegEx}(0)\)) with L(α fc )=L(β), c(β)>f(c(α fc )).

By our assumption, there is an \(\overline{\alpha}_{fc}\in \operatorname {RegEx}(1)\) with \(L(\overline{\alpha}_{fc})=\varSigma^{*}\setminus L(\alpha_{fc})\) and \(c(\overline{\alpha}_{fc})\leq f_{1}(c(\alpha_{fc}))\). As \(\alpha _{fc}\in \operatorname {CoFRegEx}(1)\), L(α fc ) is cofinite, hence, \(L(\overline {\alpha}_{fc})\) is finite, and \(\overline{\alpha}_{fc}\in \operatorname {FRegEx}(1)\).

According to Lemma 17, there is a \(\overline{\beta }_{fc}\in \operatorname {FRegEx}(0)\) with \(L(\overline{\beta}_{fc})=L(\overline {\alpha }_{fc})=\varSigma^{*}\setminus L(\alpha_{fc})\), and \(c(\overline{\beta }_{fc})\leq f_{2}(c(\overline{\alpha}_{fc}))\).

Finally, as explained above, there is a \(\beta_{fc}\in \operatorname {CoFRegEx}(0)\) with \(L(\beta_{fc})=\varSigma^{*}\setminus(\overline{\beta }_{fc})=L(\alpha _{fc})\) and

This contradicts our choice of α fc and concludes the proof. □

With some additional technical effort, we can extend the previous results on undecidability of \(\operatorname {RegEx}(l)\)-ity and on tradeoffs between \(\operatorname {RegEx}(k)\) and \(\operatorname {RegEx}(0)\) to arbitrary levels of the hierarchy of \(\operatorname {RegEx}(k)\)-languages:

Lemma 19

Let k≥1. For \(\operatorname {RegEx}(k+1)\), \(\operatorname {RegEx}(k)\)-ity is neither semi-decidable, nor co-semi-decidable.

Proof

We adapt the construction from the proof of Theorem 14 to the larger alphabet \(\varSigma_{k}:=\{0,\texttt {\#},\mathdollar ,\mathtt {a}_{1},\mathtt {b}_{1},\ldots ,\mathtt {a}_{k},\mathtt {b}_{k}\}\), where 0, #, \(\mathdollar \), all a i , and all b i are pairwise distinct letters. For every i with 1≤ik, let Σ i :={a i ,b i }. Given an extended Turing machine \(\mathcal {X}\), we construct the extended regular expression \(\alpha_{\mathcal {X}}\in \operatorname {RegEx}(1)\) as in the proof of Theorem 14. For every i with 1≤ik, we define an expression \(\alpha_{i}\in \operatorname {RegEx}(1)\) by

$$\alpha_i:=\bigl((\mathtt {a}_i\mid \mathtt {b}_i)^*\bigr)\%x_i x_i,$$

where every x i is distinct from the variable x in \(\alpha_{\mathcal {X}}\), and from all x j with ji. Finally, we define \(\alpha_{\mathcal {X},k}\in \operatorname {RegEx}(k+1)\) by

Thus, \(\alpha_{k,\mathcal {X}}\in \operatorname {RegEx}(k+1)\). It suffices to show that \(L(\alpha_{k,\mathcal {X}}\)) is a \(\operatorname {RegEx}(k)\)-language if and only if \(L(\alpha _{\mathcal {X}})\) is regular, as neither regularity nor non-regularity is semi-decidable for \(\operatorname {RegEx}(1)\), cf. Theorem 10.

The if direction is obvious: If \(L(\alpha_{\mathcal {X}})\) is regular, we can (non-effectively) replace that part of \(\alpha_{\mathcal {X},k}\) with an appropriate proper regular expression, and obtain an expression from \(\operatorname {RegEx}(k)\) for \(L(\alpha_{\mathcal {X},k})\).

For the only if direction, first note that, for every i, L(α i )={www∈{a i ,b i }+}, a language that is well known to be not regular (as can be easily verified with the Pumping Lemma, cf. Hopcroft and Ullman [22]). Likewise, note that \(L_{k}:=L(\alpha _{1}\mathdollar \cdots \mathdollar \alpha_{k})\) is not a \(\operatorname {RegEx}(k-1)\) language, as can be seen by the following line of reasoning: Assume there is an \(\alpha\in \operatorname {RegEx}(k-1)\) with L(α)=L k . By definition, α contains at most k−1 different variables. Note that, whenever α is matched to a wL k , every variable x in α that is bound and also referenced when matching w contains only terminals from a single set \(\varSigma_{i}\cup\{\mathdollar \}\), as \(L_{k}\subset\{\mathtt {a}_{1},\mathtt {b}_{1}\}^{+}\mathdollar \cdots \mathdollar \{\mathtt {a}_{k},\mathtt {b}_{k}\}^{+}\), and repeating any string that contains some a j or b j with ji would break this structure. As every L(α i ) needs to use at least one variable, and no variable that is matched to a letter other than \(\mathdollar \) can cross the boundaries between the different L(α i ), there can be no \(\alpha\in \operatorname {RegEx}(k-1)\) with L(α)=L k .

If \(L(\alpha_{\mathcal {X}})\) is not regular, this reasoning extends to \(L(\alpha_{k,\mathcal {X}})\) and \(\operatorname {RegEx}(k)\), as we need at least one variable to generate \(L(\alpha_{\mathcal {X}})\), and k variables for L k , for a total of k+1 variables.

Thus, \(L(\alpha_{\mathcal {X},k})\) is a \(\operatorname {RegEx}(k)\)-language if and only if \(\alpha_{\mathcal {X}}\) is regular. As seen in the proof of Theorem 10, this proves that for \(\operatorname {RegEx}(k)\), \(\operatorname {RegEx}(k)\)-ity is neither semi-decidable, nor co-semi-decidable. □

Non-recursive tradeoffs between \(\operatorname {RegEx}(k+1)\) and \(\operatorname {RegEx}(k)\) for every k≥1 follow immediately, using Hartmanis’ proof technique as in the proof of Corollary 16.

Although the proof of Lemma 19 uses an unbounded terminal alphabet \(\varSigma_{k}:=\{0,\texttt {\#},\mathdollar ,\mathtt {a}_{1},\mathtt {b}_{1},\ldots,\mathtt {a}_{k},\mathtt {b}_{k}\}\), the construction can be adapted to an alphabet Σ of constant size in the following way. First, let \(\varSigma :=\{0,\texttt {\#},\mathdollar ,\mathtt {a},\mathtt {b},\mathtt {c}\}\), and define a morphism \(h_{k}:\varSigma _{k}^{*}\to \varSigma^{*}\) by h(a i ):=ca i c and h(b i ):=cb i c for all 1≤ik, and h(a)=a for all \(a\in\{0,\texttt {\#},\mathdollar \}\). Instead of the languages \(L(\alpha_{\mathcal {X},k})\), we then consider the languages \(h_{k}(L(\alpha_{\mathcal {X},k}))\). The reasoning then proceeds as in the original proof, with the sole exception that, instead of arguing that the variables must contain characteristic letters a i and b i , they now contain characteristic segments ca i c and cb i c.

Using the same approach, the proof can be adapted to binary terminal alphabets.

4.1 A Technical Note on Bounded Occurrences of Variables

Although the extended regular expressions \(\alpha_{\mathcal {X}}\) that follow from the construction in the proof of Theorem 14 use only one variable x, there is no bound on the number of occurrences of x in \(\alpha_{\mathcal {X}}\). In fact, the number of occurrences grows with the number of fields in the transition table δ of \(\mathcal {X}\), and limiting that number would not allow to simulate infinitely many extended Turing machines, which is required to obtain the results in the present paper.

Nonetheless, similar to the proofs for the undecidability of inclusion for pattern languages in Bremer and Freydenberger [6], these limitations can be overcome by using a single universal Turing machine: First, let ψ:ℕ×ℕ→ℕ be a universal partially recursive function, i.e., for every partially recursive function ϕ:ℕ→ℕ, there is an m≥0 such that ψ(m,n)=ϕ(n) for every n≥0. Note that it is an elementary fact of recursion theory that such a function (which is often called a numbering) exists and can be defined constructively (cf. Cutland [13], Odifreddi [28]). For every m≥0, we define the function ψ m :ℕ→ℕ by ψ m (n):=ψ(m,n) for all n≥0. Furthermore, we define the function

$$\operatorname {dom}(\psi_m):=\bigl\{n\in \mathbb {N}\mid\psi_m(n)\text{ is defined}\bigr\},$$

from which we derive the index sets

As in Lemma 11, one can use Rice’s Theorem and its extension (again, cf. [13, 28]) to show that E ψ is not semi-decidable, and F ψ is neither semi-decidable, nor co-semi-decidable.

Moreover, there is a Turing machine \(\mathcal{U}\) over some tape alphabet such that

$$\operatorname {dom}_{\mathrm {T}}(\mathcal{U})=\bigl\{\mathtt {a}^m0 \mathtt {b}^n\mid\psi(m,n) \text{ is defined}\bigr\}.$$

Again, this machine can be defined constructively (e.g. from the constructive definition of ψ). Using the same construction as in the proof of Lemma 11, we can build an extended Turing machine \(\hat{\mathcal{U}}=(Q,q_{1},\delta)\) that simulates \(\mathcal {U}\), using an appropriate injective function \(b_{k}:\hat {\varGamma }\to \varGamma ^{k}\) with b k (0)=0k, for some appropriate k≥3. Instead of constructing a single extended regular expression \(\alpha_{\hat {\mathcal {U}}}\), we construct an expression \(\alpha_{\hat{\mathcal{U}},m}\) for every natural number m, using a slight modification of the proof of Theorem 14. Instead of allowing arbitrary contents for the right tape word in the initial configuration (as can be seen in the second line of the definition of the language \(S_{\mathcal {X}}\)), we define

The only difference to the definition of \(S_{\mathcal {X}}\) is the second line. Evidently, the new definition allows exactly those initial right tape sides t R for which for some i≥0. As , these are exactly those t R that have as a prefix.

The construction of \(\alpha_{\hat{\mathcal{U}},m}\) then uses \(S_{\hat {\mathcal{U}},m}\) to construct α struc , and proceeds as in the original proof. Thus, \(\{0,\texttt {\#}\}^{*}\setminus L(\alpha_{\hat{\mathcal {U}},m})\) is exactly that set that corresponds to the valid computations of \(\mathcal{U}\) on some input that starts with a m0. Furthermore, the number of occurrences of x in every \(\alpha_{\hat {\mathcal{U}},m}\) depends only on δ, not on m, which means that we can bound that number.

Hence, given an m∈ℕ, we can effectively construct an expression \(\alpha_{\hat{\mathcal{U}},m}\) such that the language \(L(\alpha _{\hat {\mathcal{U}},m})\) is

  • universal iff \(\operatorname {dom}(\psi_{m})\) is empty iff mE ψ ,

  • regular iff \(\operatorname {dom}(\psi_{m})\) is finite iff mF ψ ,

  • cofinite iff \(\operatorname {dom}(\psi_{m})\) is finite iff mF ψ ,

which holds following the same reasoning as for \(L(\alpha_{\mathcal {X}})\). Hence, a procedure that decides one of these problems for \(\operatorname {RegEx}(1)\) with a bounded number of variable occurrences to some degree can be effectively transformed into procedure that decides one of the index sets E ψ and F ψ to the same degree.

As mentioned above, E ψ and F ψ are not semi-decidable, and F ψ is also not co-semi-decidable. Hence, we arrive at the same conclusions as in Theorem 10 and Corollaries 16 and 18 for \(\operatorname {RegEx}(1)\) with a bounded number of occurrences of the single variable.

4.2 A Note on H-Systems

As mentioned multiple times above, Theorem 14—and hence, Theorem 10 and the resulting consequences—can be easily adapted to various other models that use similar repetition mechanisms. One of these models are the so-called H-expressions by Bordihn et al. [5]. These H-expressions are based on H-systems, which were introduced by Albert and Wegner [4].

In fact, instead of H-expressions, the construction can be implemented using these (less expressive) H-systems. An H-system is a 4-tuple H=(X,Σ,L 1,ϕ), where X and Σ are finite alphabets (the meta alphabet and the terminal alphabet, respectively), L 1X is called the meta language, and \(\phi: X\to\mathcal{P}(\varSigma^{*})\) is a function that assigns to each xX a language ϕ(x)=L x Σ . The language of H is defined as

$$L(H):=\bigl\{h(w)\mid w\in L_1, \text{$h$ is a homomorphism with $h(x)\in \phi(x)$ for all }x\in X\bigr\}.$$

Less formally, every letter x from the meta alphabet is replaced uniformly with a word from ϕ(x).

Furthermore, if \(\mathcal{L}_{1}\) and \(\mathcal{L}_{2}\) are classes of languages, \(\mathcal{H}(\mathcal{L}_{1},\mathcal{L}_{2})\) denotes the class of H-system languages of \(\mathcal{L}_{1}\) and \(\mathcal{L}_{2}\), i.e., the class of all languages that are generated by H-systems that use a language \(L_{1}\in\mathcal {L}_{1}\) as metalanguage, and have \(\phi(x)\in\mathcal{L}_{2}\) for every x in their meta alphabet X.

It is easy to see that, for every extended Turing machine \(\mathcal {X}\), the expressions \(\alpha_{\mathcal {X}}\) that are constructed in the proof of Theorem 14 can be easily converted into an H-system H=(X H ,Σ H ,L 1,ϕ) with \(L(H)= \operatorname {INVALC}(\mathcal {X})\) and \(L(H)\in \mathcal{H}(\operatorname {REG},\operatorname {REG})\) by proceeding as follows.

First, we select X H :={0,#,x} and Σ H :={0,#}. We then replace the single occurrence of (0)%x in \(\alpha_{\mathcal {X}}\) with x, and obtain a proper regular expression \(\hat {\alpha }_{\mathcal {X}}\) over the alphabet Σ H . This allows us to choose \(L_{1}:=L(\hat {\alpha }_{\mathcal {X}})\), while \(L_{1}\in \operatorname {REG}\) holds. Next, we define ϕ(0)={0}, ϕ(#)={#}, and ϕ(x):={0}. Hence, \(L(H)=L(\alpha _{\mathcal {X}})= \operatorname {INVALC}(\mathcal {X})\) follows immediately.

Thus, we can adapt these proofs to the class \(\mathcal{H}(\operatorname {REG},\operatorname {REG})\) of H-system languages, and conclude the same levels of undecidability of the respective decision problems for \(\mathcal{H}(\operatorname {REG},\operatorname {REG})\). As \(\mathcal{H}(\operatorname {REG},\operatorname {REG})\) is a subclass of the class of H-expression languages, this also proves that equivalence for H-expression languages is undecidable, a problem that was explicitly mentioned as open by Bordihn et al. [5].

5 Conclusions

The present paper shows that extending regular expressions with only a single variable already leads to an immense increase in succinctness and expressive power. The good part of this news is that in certain applications, using the right extended regular expression instead of a proper regular expression can lead to far more efficient running times, even with the same matching engine. The bad part of this news is that this additional power can only be harnessed in full if one is able to solve undecidable problems, which greatly diminishes the usefulness of extended regular expressions as more efficient alternative to proper regular expressions.

Due to underlying undecidable problems, some questions of designing optimal extended regular expressions are of comparable difficulty to designing optimal programs. For applied computer scientists, it could be worthwhile to develop heuristics and good practices to identify cases where the non-conventional use of extended regular expressions might offer unexpected speed advantages. As regular expressions are often precompiled, this heuristics might be employed as compiler-level optimization features, comparable to the compilation of programming languages.

For theoretical computer scientists, the results in the present paper highlight the need for appropriate restrictions other than the number of variables; restrictions that lead to large and natural subclasses with decidable decision problems. One possible approach that does not extend the expressive power of proper regular expressions beyond regular languages would be a restriction of the length of the words on which variables can be bound. As the results in the present paper show, any extension of proper regular expressions that includes some kind of repetition operator needs to be approached with utmost care.

On the other hand, the author wishes to point out that the enormous relative succinctness of extended regular expressions should be considered an advantage, especially as the existence of an effective minimization procedure for proper regular expressions is probably of little practical use, considering the fact that there is no efficient minimization procedure (unless P = PSPACE). In particular, the author is convinced that the results on descriptional complexity emphasize the potential usefulness of large and natural subclasses of \(\operatorname {RegEx}\) for which the membership problem can be decided efficiently, as opposed to the NP-completeness of the general problem. One possible example of such restrictions can be found in Reidenbach and Schmid [30].