1 Introduction

A regular expression specifies a set of strings formed by characters combined with concatenation, union (|), and Kleene star (*) operators. For instance, (a|(ba))* describes the set of strings of as and bs, where every b is followed by an a. Regular expressions are a fundamental concept in formal language theory and a basic tool in computer science for specifying search patterns. Regular expression search appears in diverse areas such as internet traffic analysis [14, 18, 28], data mining [11], data bases [20, 22], computational biology [24], and human computer interaction [17].

Given a regular expression R and a string Q, the regular expression parsing problem [8, 10, 16, 19, 25, 26] is to determine if Q matches a string in the set of strings described by R and if so, determine how it matches by computing a corresponding sequence of positions of characters in R, i.e., the mapping of each character in Q to a character in R corresponding to the match. For instance, if \(R = \) (a|(ba))* and \(Q = \) aaba, then Q matches R and 1, 1, 2, 3 is a corresponding parse specifying that Q[1] and Q[2] match the first a in R, Q[3] match the b in R, and Q[4] match the last a in R. Another typical definition of parsing is to compute a parse tree (or a variant thereof) of the derivation of Q on R. Our definition simplifies our presentation and it is straightforward to derive a parse tree from our parses. Regular expression parsing makes finding matches of a regular expression even more useful by allowing us to directly extract subpatterns of the match, e.g., for extracting IP addresses from internet traffic analysis or extracting subparts of genomes from genetic data bases.

To state the existing bounds, let n and m be the length of the string and the regular expression, respectively. As a starting point consider the simpler regular expression matching problem, that is, to determine if Q matches a string in the set of strings described by R (without necessarily providing a mapping from characters in Q to characters in R). A classic textbook algorithm to matching, due to Thompson [27], constructs and simulates a non-deterministic finite automaton (NFA) in O(nm) time and O(m) space. An immediate approach to solve the parsing problem is to combine Thompson’s algorithm with backtracking. To do so, we store all state-sets produced during the NFA simulation and then process these in reverse order to recover an accepting path in the NFA matching Q. From the path, we then immediately obtain the corresponding parse of Q since each transition labeled by a character uniquely corresponds to a character in R. This algorithm uses O(nm) time and space. Hence, we achieve the same time bound as matching but increase the space by an \(\Omega (n)\) factor. We can improve the time by polylogarithmic factors using faster algorithms for matching [3, 4, 6, 7, 23], but by a recent conditional lower bound [2] we cannot hope to achieve \(\Omega ((nm)^{1-\varepsilon })\) time assuming the strong exponential time hypothesis. Other direct approaches to regular expression parsing [8, 10, 16, 19, 25, 26] similarly achieve \(\Theta (nm)\) time and space (ignoring polylogarithmic factors), leaving a substantial gap between linear space for matching and \(\Theta (nm)\) space for parsing. The goal of this paper is to address this gap.

1.1 Results

We present a new technique to efficiently extend the classic state-set transition algorithms for matching to also solve parsing in the same time complexity while only using linear space. Specifically, we obtain the following main result based on Thompson’s algorithm:

Theorem 1

Given a regular expression of length m and a string of length n, we can solve the regular expression parsing problem in O(nm) time and \(O(n + m)\) space.

This is the first bound to significantly improve upon the combination of \(\Theta (nm)\) time and space. Our techniques are sufficiently general to also handle the more recent faster state-set transition algorithms [3, 4, 23] and we also obtain a similar space improvement for these.

1.2 Techniques

Our overall approach is similar in spirit to the classic divide and conquer algorithm by Hirschberg [13] for computing a longest common subsequence of two strings in linear space. Let A be the Thompson NFA (TNFA) for R built according to Thompson’s rules [27] (see also Fig. 1) with m states, and let Q be the string of length n.

We first decompose A using standard techniques into a pair of nested subTNFAs called the inner subTNFA and the outer subTNFA. Each have roughly at most 2/3 of the states of A and overlap in at most 2 boundary states. We then show how to carefully simulate A to decompose Q into substrings corresponding to subparts of an accepting path in each of the subTNFAs. The key challenge here is to efficiently handle cyclic dependencies between the subTNFAs. From this we construct a sequence of subproblems for each of the substrings corresponding to the inner subTNFAs and a single subproblem for the outer subTNFA. We recursively solve these to construct a complete accepting path in A. This strategy leads to an O(nm) time and \(O(n\log m + m)\) space solution. We show how to tune and organize the recursion to avoid storing intermediate substrings leading to the linear space solution in Theorem 1. Finally, we show how to extend our solution to obtain linear space parsing solutions for other state-set transition algorithms.

1.3 Outline

In Sect. 2, we introduce our the basic definitions and notations for regular expression and finite automata. In Sect. 3, we show how to efficiently compute a balanced decomposition of a TNFA. In Sect. 4, we then show how to efficiently decompose a string according to a TNFA decomposition. We then use our TNFA and string decomposition algorithms in a our first recursive algorithm for computing accepting paths in Sect. 5 using O(nm) time and \(O(n\log m + m)\) space. In Sect. 6, we improve the space to \(O(n + m)\) leading to our main result in Theorem 1. Finally, in Sect. 7, we show how to extend the fast state-set transition algorithm to regular parsing.

2 Preliminaries

2.1 Strings

A string Q of length \(n = |Q|\) is a sequence \(Q[1]\ldots Q[n]\) of n characters drawn from an alphabet \(\Sigma \). The string \(Q[i]\ldots Q[j]\) denoted Q[ij] is called a substring of Q. The substrings Q[1, i] and Q[jn] are the ith prefix and the jth suffix of Q, respectively. The string \(\epsilon \) is the unique empty string of length zero.

2.2 Regular expressions

First, we briefly review the classical concepts used in the paper. For more details see, e.g., Aho et al. [1]. We consider the set of non-empty regular expressions over an alphabet \(\Sigma \), defined recursively as follows. If \(\alpha \in \Sigma \cup \{\epsilon \}\), then \(\alpha \) is a regular expression, and if S and T are regular expressions, then so is the concatenation, \((S)\cdot (T)\), the union, (S)|(T), and the star, \((S)^*\). The language L(R) generated by R is defined as follows. If \(\alpha \in \Sigma \cup \{\epsilon \}\), then \(L(\alpha )\) is the set containing the single string \(\alpha \). If S and T are regular expressions, then \(L(S \cdot T) = L(S)\cdot L(T)\), that is, any string formed by the concatenation of a string in L(S) with a string in L(T), \(L(S)|L(T) = L(S) \cup L(T)\), and \(L(S^*) = \bigcup _{i \ge 0} L(S)^i\), where \(L(S)^0 = \{\epsilon \}\) and \(L(S)^i = L(S)^{i-1} \cdot L(S)\), for \(i > 0\). The parse tree \(\mathcal {T}^P(R)\) of R (not to be confused with the parse of Q wrt. to R) is the rooted, binary tree representing the hierarchical structure of R. The leaves of \(\mathcal {T}^P(R)\) are labeled by a character from \(\Sigma \) or \(\epsilon \) and internal nodes are labeled by either \(\cdot \), \(\mid \), or \(*\).

2.3 Finite automata

A finite automaton is a tuple \(A = (V, E, \Sigma , \theta , \phi )\), where V is a set of nodes called states, E is a set of directed edges between states called transitions either labeled \(\epsilon \) (called \(\epsilon \)-transitions) or labeled by a character from \(\Sigma \) (called character-transitions), \(\theta \in V\) is a start state, and \(\phi \in V\) is an accepting stateFootnote 1. In short, A is an edge-labeled directed graph with a special start and accepting node. A is a deterministic finite automaton (DFA) if A does not contain any \(\epsilon \)-transitions, and all outgoing transitions of any state have different labels. Otherwise, A is a non-deterministic automaton (NFA). When we deal with multiple automatons, we use a subscript \(_A\) to indicate information associated with automaton A, e.g., \(\theta _A\) is the start state of automaton A.

Given a string Q and a path P in A we say that Q and P match if the concatenation of the labels on the transitions in P is Q. Given a state s in A, we define the state-set transition \(\delta _A(s, Q)\) to be the set of states reachable from s through paths matching Q. For a set of states S we define \(\delta _A(S, Q) = \bigcup _{s \in S} \delta _A(s, Q)\). We say that A accepts the string Q if \(\phi _A\in \delta _A(\theta _A, Q)\). Otherwise, A rejects q. For an accepting path P in A, we define the parse of P for A to be the sequence of character transitions in A on P. Given a string Q accepted by A, a parse of Q is a parse for A of any accepting path matching Q.

We can use a sequence of state-set transitions to test acceptance of a string Q of length n by computing a sequence of state-sets \(S_0, \ldots , S_n\), given by \(S_0 = \delta _A(\theta _A, \epsilon )\) and \(S_i = \delta _A(S_{i-1}, Q[i])\), \(i=1, \ldots , n\). We have that \(\phi _A \in S_n\) iff A accepts Q. We can extend the algorithm to also compute the parse of Q for A by processing the state-sets in reverse order to recover an accepting path and output the character transitions. Note that for matching, we only need to store the latest two state-sets at any point to compute the final state-set \(S_n\), whereas for parsing, we store the full sequence of state-sets.

Fig. 1
figure 1

Thompson’s recursive NFA construction. The regular expression \(\alpha \in \Sigma \cup \{\epsilon \}\) corresponds to NFA (a). If S and T are regular expressions, then N(ST), N(S|T), and \(N(S^*)\) correspond to NFAs (b), (c), and (d), respectively. In each of these figures, the leftmost node \(\theta \) and rightmost node \(\phi \) are the start and the accept nodes, respectively. For the top recursive calls, these are the start and accept nodes of the overall automaton. In the recursions indicated, e.g., for N(ST) in (b), we take the start node of the subautomaton N(S) and identify with the state immediately to the left of N(S) in (b). Similarly, the accept node of N(S) is identified with the state immediately to the right of N(S) in (b)

2.4 Thompson NFA

Given a regular expression R, we can construct an NFA accepting precisely the strings in L(R) by several classic methods [12, 21, 27]. In particular, Thompson [27] gave the simple well-known construction shown in Fig. 1. We will call an NFA constructed with these rules a Thompson NFA (TNFA). A TNFA N(R) for R has at most 2m states, at most 4m transitions, and can be computed in O(m) time. Note that each character in R corresponds to a unique character transition in N(R) and hence a parse of a string Q for N(R) directly corresponds to a parse of Q for R. The parse tree of a TNFA N(R) is the parse tree of R. With a breadth-first search of N(R), we can compute a state-set transition for a single character in O(m) time. By our above discussion, it follows that we can solve regular expression matching in O(nm) time and O(m) space, and regular expression parsing in O(nm) time and O(nm) space.

3 Computing TNFA decompositions

In this section, we show how construct a balanced decomposition of TNFAs. Our decomposition is used for our string decomposition presented in the next section, which in turn is a key component in our recursive algorithm for regular expression parsing in Sects. 5 and 6. Our decomposition is based on well-known techniques for TNFAs and similar decompositions are used in [3, 23].

Given a TNFA A with \(m > 2\) states, we decompose A into an inner subTNFA \(A_I\) and an outer subTNFA \(A_O\). The inner subTNFA consists of a pair of boundary states \(\theta _{A_I}\) and \(\phi _{A_I}\) and all states and transitions that are reachable from \(\theta _{A_I}\) without going through \(\phi _{A_I}\). See Fig. 2 for an illustration. Furthermore, if there is a path of \(\epsilon \)-transitions from \(\phi _{A_I}\) to \( \theta _{A_I}\) in \(A_O\), we add an \(\epsilon \)-transition from \(\phi _{A_I}\) to \( \theta _{A_I}\) in \(A_I\) (following the rules from Thompson’s construction). The outer subTNFA is obtained by removing all states and transitions of \(A_I\) except \(\theta _{A_I}\) and \(\phi _{A_I}\). Between \(\theta _{A_I}\) and \(\phi _{A_I}\) we add a special transition labeled \(\beta _{A_I} \not \in \Sigma \) and if \(A_I\) accepts the empty string we also add an \(\epsilon \)-transition (corresponding to the regular expression \((\beta _{A_I} \mid \epsilon )\)). The decomposition has the following properties. Similar results are proved in [3, 23].

Fig. 2
figure 2

Decomposition of TNFA A into subTNFAs \(A_O\) and \(A_I\). The dotted \(\epsilon \)-transition in \(A_O\) exists since \(A_I\) accepts the empty string, and the dotted \(\epsilon \)-transition in \(A_I\) exists since there is a path of \(\epsilon \)-transitions from \(\phi _{A_I}\) to \( \theta _{A_I}\)

Lemma 1

Let A be any TNFA with \(m > 2\) states. In O(m) time, we can decompose A into inner and outer subTNFAs \(A_O\) and \(A_I\) such that

  1. (i)

    \(A_O\) and \(A_I\) have at most \(\frac{2}{3}m + 8\) states each, and

  2. (ii)

    any path from \(A_O\) to \(A_I\) crosses \(\theta _{A_I}\) and any path from \(A_I\) to \(A_O\) crosses \(\phi _{A_I}\).

Proof

Consider the parse tree \(\mathcal {T}^P\) of A with v nodes. Since \(\mathcal {T}^P\) is a binary tree with more than one node we can find a separator edge e in linear time whose removal splits \(\mathcal {T}^P\) into subtrees \(\mathcal {T}^P_I\) and \(\mathcal {T}^P_O\) that each have at most \(\frac{2}{3}v + 1\) nodes [15]. Here, \(\mathcal {T}^P_O\) is the subtree containing the root of \(\mathcal {T}^P\). The subTNFA \(A_I\) is the TNFA induced by \(\mathcal {T}^P_I\), possibly with a new union node as root with one child being the root of \(\mathcal {T}^P_I\) and the other a leaf labeled \(\epsilon \). The subTNFA \(\mathcal {T}^P_O\) is the TNFA induced by the tree \(\mathcal {T}^P_O\), where the separator edge is replaced by an edge to either a leaf labeled \(\beta _{A_I}\) or to a union-node with children labeled \(\beta _{A_I}\) and \(\epsilon \) in the case where \(A_I\) accepts the empty string. Thus, each subTNFA are induced by a tree with at most \(\frac{2}{3}v + 4\) nodes. Since each node correspond to two states, each subTNFA has at most \(\frac{2}{3}m + 8\) states. \(\square \)

4 Computing string decompositions

Let Q be a string of length n accepted by a TNFA A with m states. Consider a decomposition of A into subTNFA \(A_O\) and \(A_I\) according to Lemma 1. In this section, we show how to efficiently decompose Q into substrings corresponding to subpaths matched in each subTNFA using O(nm) time and \(O(n + m)\) space. The algorithm will be a key component in our recursive algorithms for computing accepting paths in the next sections.

4.1 Path and string decompositions

We first define string decompositions and discuss the main challenges for computing them. Given an accepting path P in A, we define the path decomposition of P wrt. \(A_I\) to be the partition of P into a sequence of subpaths \(\mathcal {P} = \overline{p}_1, p_1, \overline{p}_2, p_2, \ldots , \overline{p}_\ell ,p_\ell , \overline{p}_{\ell +1}\), where the outer subpaths, \(\overline{p}_1, \ldots , \overline{p}_{\ell +1}\), are subpaths in \(A_O\) and the inner subpaths, \(p_1, \ldots , p_{\ell }\) are the subpaths in \(A_I\). We require that none of the subpaths matches the empty string. Note that it is always possible to partition P into a sequence of alternating subpaths not matching the empty string, since when we construct \(A_I\) and \(A_O\), we add an \(\epsilon \)-transition in \(A_O\) if \(A_I\) accepts the empty string, and an \(\epsilon \)-transition from \(\phi _{A_I}\) to \( \theta _{A_I}\) in \(A_I\) if there is a path of \(\epsilon \)-transitions from \(\phi _{A_I}\) to \( \theta _{A_I}\) in \(A_O\). The string decomposition induced by \(\mathcal {P}\) is the sequence of (non-empty) substrings \(\mathcal {Q} = \overline{q}_1, q_1, \overline{q}_2, q_2, \ldots , \overline{q}_\ell ,q_\ell , \overline{q}_{\ell +1}\) formed by concatenating the labels of the corresponding subpath in A. A sequence of substrings is a substring decomposition wrt. to \(A_I\) if there exists an accepting path that induces it. Note that there may be many accepting paths, but each accepting path induces exactly one string decomposition.

As mentioned above, our goal is to compute a string decomposition wrt. \(A_I\) in O(nm) time and \(O(n + m)\) space, where n is the length of Q and m is the number of states in A. An immediate idea would be to process Q from left to right using state-set transitions and “collapse” the state set to a boundary state b of \(A_I\) whenever the state set contains b and there is a path from b to \(\phi _A\) matching the rest of Q. Since \(A_O\) and \(A_I\) only interact at the boundary states, this effectively corresponds to alternating the simulation of A between \(A_O\) and \(A_I\). However, because of potential cyclic dependencies from paths of \(\epsilon \)-transitions from \(\phi _{A_I}\) to \(\theta _{A_I}\) in \(A_O\) and \(\theta _{A_I}\) to \(\phi _{A_I}\) in \(A_I\), we cannot immediately determine which subTNFA we should proceed in and hence we cannot correctly compute the string decomposition. For instance, consider the string \(Q = aaacdaabaacdacdaabab\) from Fig. 3. After processing the first two characters (aa) both \(\theta _{A_I}\) and \(\phi _{A_I}\) are in the state set, and there is a path from both these states to \(\phi _A\) matching the rest of Q. The same is true after processing the first six characters (aaacda). In the first case, the substring consisting of the next three characters (acd) only matches a path in \(A_I\), whereas in the second case the substring consisting of the next two characters (ab) only matches a path in \(A_O\).

4.2 Computing string decompositions

We now show how to compute a string decomposition in O(nm) time and \(O(n+ m)\) space. The key challenge is to efficiently overcome the above mentioned issue of cyclic dependencies. We show how to achieve this by a two-step approach that first decomposes the string into substrings and then labels the substrings greedily to find a correct string decomposition.

We need the following new definitions. Let i be a position in Q and let s be a state in A. We say that (is) is a valid pair if there is a path from \(\theta _A\) to s matching Q[1, i] and from s to \(\phi _A\) matching \(Q[i+1, n]\). For any set of states X in A, we say that (iX) is a valid pair if each pair (ix), \(x \in X\), is a valid pair. An accepting path P in A intersects a valid pair (iX) if some state \(x \in X\) is on the path, the subpath of P from \(\theta _A\) to x matches Q[1, i], and the subpath of P from x to \(\phi _A\) matches \(Q[i+1, n]\).

Our algorithm consist of the following steps. In step 1, we process Q from left to right and right to left to compute and store the match sets, consisting of all valid pairs for the boundary states \(\theta _{A_I}\) and \(\phi _{A_I}\). We then use the match sets in step 2 to process Q from left to right to build a sequence of valid pairs for the boundary states that all intersect a single accepting path P in A matching Q, and that has the property that all positions where the accepting path P contains \(\theta _{A_I}\) or \(\phi _{A_I}\) correspond to a valid pair in the sequence. Finally, in step 3, we construct the string decomposition using a greedy labeling of the sequence of valid pairs. See Fig. 3 for an example of the computation in each step.

Step 1: Computing Match Sets

First, compute the match sets given by

$$\begin{aligned} \qquad \qquad \mathrm {Match}(\theta _{A_I})&= \{i \mid (i, \theta _{A_I}) \text { is a valid pair}\} \\ \mathrm {Match}(\phi _{A_I})&= \{i \mid (i, \phi _{A_I}) \text { is a valid pair}\} \end{aligned}$$

Thus, \(\mathrm {Match}(\theta _{A_I})\) and \(\mathrm {Match}(\phi _{A_I})\) are the positions in Q that correspond to a valid pair for the boundary states \(\theta _{A_I}\) and \(\phi _{A_I}\), respectively. To compute these, we first compute the prefix match sets, \(\mathrm {Prefix}(s)\), and suffix match sets, \(\mathrm {Suffix}(s)\), for \(s\in \{\theta _{A_I}, \phi _{A_I}\}\). A position i is in \(\mathrm {Prefix}(s)\) if there is a path from \(\theta _A\) to s accepting the prefix Q[1, i], and in \(\mathrm {Suffix}(s)\) if there is a path from s to \(\phi _A\) accepting the suffix \(Q[i+1,n]\). To compute the prefix match sets, we perform state-set transitions on Q and A and whenever the current state-set contains either \(\theta _{A_I}\) or \(\phi _{A_I}\), we add the corresponding position to \(\mathrm {Prefix}(s)\). We compute the suffix match sets in the same way, but now we perform the state-set transitions on Q from right to left and A with the direction of all transitions reversed. Each step of the state-set transition takes O(m) time and hence we use O(nm) time in total.

Fig. 3
figure 3

A string decomposition of the string \(Q= aaacdaabaacdacdaabab\) wrt. \(A_I\) in Fig. 2 and the corresponding suffix/prefix match sets. The dark gray blocks in the prefix, suffix and match sets are the positions contained in the sets. The blocks in the partition of the string are labeled \(\mathcal {O}\) and \(\mathcal {I}\) for outer and inner, respectively. The gray blocks in the partition are the substrings that can be parsed by both the inner and outer automaton. According to our procedure these blocks are labeled inner.

Finally, we compute the match sets \(\mathrm {Match}(s)\), for \(s\in \{\theta _{A_I},\phi _{A_I}\}\), by taking the intersection of \(\mathrm {Prefix}(s)\) and \(\mathrm {Suffix}(s)\). In total, we use O(mn) time and \(O(n+m)\) space to compute and store the match sets.

Step 2: Computing Valid Pairs

We now compute a sequence of valid pairs

$$\begin{aligned} V = (i_1, X_1), (i_2, X_2), \ldots , (i_{k}, X_{k}) \end{aligned}$$

such that \(0 \le i_1< \cdots < i_{k} \le n\) and \(X_j \subseteq \{\theta _{A_I}, \phi _{A_I}\}\) and with the property that the states of all pairs intersect a single accepting path P in A and at all places where P is equal to either \(\theta _{A_I}\) or \(\phi _{A_I}\) correspond to a valid pair in V.

To compute the sequence, we run a slightly modified state-set transition algorithm: For \(i = 0,1, \ldots , n\), we set \(S_i = \delta _A(S_{i-1}, Q[i])\) (for \(i=0\) set \(S_0 := \delta _A(\theta _A,\epsilon )\)) and compute the set

$$\begin{aligned} X:= \{x \mid x\in \{\theta _{A_I}, \phi _{A_I}\} \text { and } i \in \mathrm {Match}(x)\} \cap S_i\;. \end{aligned}$$

Thus, X is the set of boundary nodes in \(S_i\) that corresponds to a valid pair computed in Step 1. If \(X\ne \emptyset \), we add (iX) to the sequence V and set \(S_i := X\).

We argue that this produces a sequence V of valid pairs with the required properties. First note that by definition of X, we inductively produce state-set \(S_0, \ldots , S_n\) such that \(S_i\) contains the set of states reachable from \(\theta _{A}\) that match Q[1, i] and the paths used to reach \(S_i\) intersect the states of the valid pairs produced up to this point. Furthermore, we include all positions in V where \(S_i\) contains \(\theta _{A_I}\) or \(\phi _{A_I}\). It follows that V satisfies the properties.

Each of modified state-set transition uses O(m) time and hence we use O(nm) time in total. The sequence V uses O(n) space. In addition to this, we store the match sets and a constant number of active state-sets using \(O(n + m)\) space.

Step 3: Computing the String Decomposition

We now convert the sequence \(V = (i_1, X_1), (i_2, X_2), \ldots , (i_{k}, X_{k})\) into a string decomposition. First, we construct the partition \(q_0, \ldots , q_{k+1}\) of Q such that \(q_0 = Q[1,i_1]\), \(q_j = Q[i_j + 1, i_{j+1}]\), and \(q_{k+1} = Q[i_k+1, n]\). Note that \(q_0\) and \(q_{k+1}\) may be the empty string. Next, we convert this partition into a string decomposition by first labeling each substring as inner or outer and then greedily merging these substrings to obtain an alternating sequence forming a string decomposition. We discuss these steps in detail below.

Labeling We label the substrings as follows. First label \(q_0\) and \(q_{k+1}\) with outer. For the rest of the substrings, if \(X_i = \{\theta _{A_I}\}\) and \(X_{i+1} = \{\phi _{A_I}\}\) then label \(q_i\) with inner, and if \(X_i = \{\phi _{A_I}\}\) and \(X_{i+1} = \{\theta _{A_I}\}\) then label \(q_i\) with outer. If either \(X_{i}\) or \(X_{i+1}\) contain more than one boundary node, then we use standard state-set transitions in \(A_I\) and \(A_O\) to determine if \(A_I\) accepts \(q_i\) or if there is a path in \(A_O\) from \(\phi _{A_I}\) to \(\theta _{A_I}\) that matches \(q_i\). If a substring is accepted by \(A_I\), then it can be an inner substring and if there is a path in \(A_O\) from \(\phi _{A_I}\) to \(\theta _{A_I}\) that matches \(q_i\) then it can be an outer substring. If a substring can only be either an inner or an outer substring then it is labeled with inner or outer, respectively. Let \(q_i\) be a substring that can be both an inner or an outer substring. We divide this into two cases. If there is an \(\epsilon \)-path from \(\phi _{A_I}\) to \(\theta _{A_I}\) then label all such \(q_i\) with inner. Otherwise label all such \(q_i\) with outer. See also Algorithm 1.

figure a

For correctness first note that \(q_0\) and \(q_{k+1}\) are always (possibly empty) outer substrings. The cases where both \(|X_i| = |X_{i+1}|\) (case 1 and 2) are correct by the correctness of the sequence of valid pairs V. Due to cyclic dependencies, we may have that \(X_{i}\) and \(X_{i+1}\) contain more than one boundary node. This can happen if there is an \(\epsilon \)-path from \(\theta _{A_I}\) to \(\phi _{A_I}\) and/or there is an \(\epsilon \)-path from \(\phi _{A_I}\) to \(\theta _{A_I}\). If a substring only is accepted by one of \(A_I\) (case 3a) or \(A_O\) (case 3b), then it follows from the correctness of V that the labeling is correct. It remains to argue that the labeling in the case where \(q_i\) is accepted by both \(A_I\) and \(A_O\) is correct. To see why the labeling in this case is consistent with a string decomposition of the accepting path consider case 3c. Here, it is safe to label \(q_i\) with inner, since if we are in \(\phi _{A_I}\) after having read \(q_{i-1}\), we can just follow the \(\epsilon \)-path from \(\phi _{A_I}\) to \(\theta _{A_I}\) and then start reading \(q_i\) from here. The argument for case 3d is similar.

Except for the state-set transitions in case 3, all cases takes constant time. The total time of all the state-set transitions is O(nm). The space of V and the partition together with the labeling uses O(n) space.

String decomposition Now, every substring has a label that is either inner or outer. We then merge adjacent substrings that have the same label. This produces an alternating sequence of inner and outer substrings, which is the final string decomposition. Such an alternating subsequence must always exist since each pair in V intersects an accepting path.

In summary, we have the following result.

Lemma 2

Given string Q of length n, and TNFA A with m states decomposed into \(A_O\) and \(A_I\), we can compute a string decomposition wrt. \(A_I\) in O(nm) time and \(O(n+m)\) space.

5 Computing accepting paths

Let Q be a string of length n accepted by a TNFA A with m states. In this section, we first show how to compute an accepting path for Q in A in O(nm) time and \(O(n\log m + m)\) space. In the next section, we then improve the space to \(O(n + m)\). As discussed earlier, an accepting path corresponds to a parsing and hence this results immediately implies Theorem 1.

Since an accepting path may have length \(\Omega (nm)\) (there may be \(\Omega (m)\)-many \(\epsilon \)-transitions between each character transition), we cannot afford to explicitly compute the path in our later o(nm) time algorithms. Instead, our algorithm will compute compressed paths of size O(n) obtained by deleting all \(\epsilon \)-transitions from a path. Note that the compressed path stores precisely the information needed to solve the parsing problem.

To compute the compressed path we define a recursive algorithm \(\mathrm {Path}(A,Q)\) that takes a TNFA A and a string Q and returns a compressed accepting path in A matching Q as follows. If \(n < \gamma _n\) or \(m < \gamma _m\), for constants \(\gamma _n, \gamma _m > 0\) that we will fix later (in the analysis in Sect. 5.1), we simply run the naive algorithm that stores all state-sets during state-set simulation from left-to-right in Q. Since one of n or m is constant this uses \(O(nm) = O(n + m)\) time and space. Otherwise, we proceed according to the following steps.

Step 1: Decompose

We compute a decomposition of A into inner and outer subTNFAs \(A_I\) and \(A_O\) according to Lemma 1 and compute a corresponding string decomposition \(\mathcal {Q} = \overline{q}_1, q_1, \overline{q}_2, q_2, \ldots , \overline{q}_\ell , q_\ell , \overline{q}_{\ell +1}\) for Q.

Step 2: Recurse

We build a single substring corresponding to all the subpaths in \(A_O\) and \(\ell \) substrings for \(A_I\) (one for each subpath in \(A_I)\) and recursively compute the compressed paths. To do so, construct \(\overline{q} = \overline{q}_1\cdot \beta _{A_I} \cdot \overline{q}_2 \cdot \beta _{A_I} \cdots \beta _{A_I} \cdot \overline{q}_{\ell +1}\). Recall that \(\beta _{A_I}\) is the label of the special transition we added between \(\theta _{A_I}\) and \(\phi _{A_I}\)in \(A_O\). Then, compute the compressed paths

$$\begin{aligned} \qquad \qquad \overline{p}&= \mathrm {Path}(A_O, \overline{q}) \\ p_i&= \mathrm {Path}(A_I, q_i) \qquad 1\le i \le \ell \\ \end{aligned}$$

Step 3: Combine

Finally, extract the subpaths \(\overline{p}_1, \overline{p}_2, \ldots , \overline{p}_{\ell +1}\) from \(\overline{p}\) corresponding to the substrings \(\overline{q}_1, \overline{q}_2, \ldots , \overline{q}_{\ell +1}\) and return the combined compressed path

$$\begin{aligned} P = \overline{p}_1 \cdot p_1 \cdot \overline{p}_2 \cdot p_2 \cdot \overline{p}_3 \cdots p_\ell \cdot \overline{p}_{\ell +1} \end{aligned}$$

Inductively, it directly follows that the returned compressed path is a compressed accepting path for Q in A.

5.1 Time analysis

We now show that the total time T(nm) of the algorithm is O(nm). If \(n < \gamma _n\) or \(m < \gamma _m\), we run the backtracking algorithm using \(O(nm) = O(n + m)\) time and space. If \(n \ge \gamma _n\) and \(m \ge \gamma _m\), we implement the recursive step of the algorithm using O(nm) time. Let \(n_i\) be the length of the inner string \(q_i\) in the string decomposition and let \(n_0 = \sum _{i=1}^{\ell +1}|\bar{q}_i|\). Thus, \(n= \sum _{i=1}^{\ell +1}n_i\) and \(|\bar{q}|=n_0+\ell \). In step 2, the recursive call to compute \(\overline{p}\) takes \(O(T(n_0 +\ell ,\frac{2}{3}m+8))\) time and the recursive calls to compute \(p_1, \ldots , p_\ell \) take \(\sum _{i=1}^\ell T(n_i, \frac{2}{3}m+8)\) time. The remaining steps of the algorithm all take O(nm) time. Hence, we have the following recurrence for T(nm).

$$\begin{aligned} T(n,m) = {\left\{ \begin{array}{ll} \sum _{i=1}^\ell T(n_i, \frac{2}{3}m+8) &{} \\ \qquad + \; T(n_0 +\ell ,\frac{2}{3}m+8)+ O(mn) &{} m \ge \gamma _m \text { and } n\ge \gamma _n \\ O(m+n) &{} m< \gamma _m \text { or } n < \gamma _n \end{array}\right. } \end{aligned}$$

We show that there exists constants \(\gamma _n=2\) and \(\gamma _m=25\) such that \(T(n,m) = O(nm)\). For simplicity, we have not attempted to minimize the constants. We show that \(T(n,m) \le acmn - (a-1)cm +c\) for constants \(a \ge 1\), \(c \ge 1\) using induction. First, consider the base case (\(m < \gamma _m\) or \(n<\gamma _n\)). Since \(mn + 1 \ge m+n\) and \(T(n,m) = c(m+n)\), we have \(T(n,m) \le c(mn + 1) = cmn -cm + c(m+1)\) and it follows that \(T(n,m) \le acmn - (a-1)cm +c\). For the induction step (\(m \ge \gamma _m\) and \(n\ge \gamma _n\)), we have

$$\begin{aligned} T(n,m)= & {} \sum _{i=1}^\ell T\left( n_i, \frac{2}{3}m+8\right) + T\left( n_0 +\ell ,\frac{2}{3}m+8\right) + O(mn) \nonumber \\\le & {} \sum _{i=1}^\ell \left( ac n_i( \frac{2}{3}m+8) - (a-1)c(\frac{2}{3}m+8)+c\right) \nonumber \\&\qquad + \left( ac(n_0 +\ell )(\frac{2}{3}m+8) - (a-1)c\left( \frac{2}{3}m+8\right) + c\right) + cmn \nonumber \\= & {} ac\left( \frac{2}{3}m+8\right) \sum _{i=1}^\ell n_i + ac(n_0 +\ell )\left( \frac{2}{3}m+8\right) \nonumber \\&\qquad - (\ell +1)(a-1)c\left( \frac{2}{3}m+8\right) +(\ell +1) c +cmn \nonumber \\\le & {} ac\left( \frac{2}{3}m+\frac{8}{25}m\right) \sum _{i=0}^\ell n_i + ac\ell \left( \frac{2}{3}m+\frac{8}{25}m\right) \nonumber \\&\qquad - (\ell +1)(a-1)c\left( \frac{2}{3}m+\frac{8}{25}m\right) +(\ell +1) c + cmn \end{aligned}$$
(1)
$$\begin{aligned}\le & {} ac\frac{74}{75}mn+ac\ell \frac{74}{75}m - ac(\ell +1)\frac{74}{75}m \nonumber \\&\qquad + (\ell +1)c\frac{74}{75}m+ (\ell +1) c + cmn \end{aligned}$$
(2)
$$\begin{aligned}\le & {} ac\frac{74}{75}mn - ac\frac{74}{75}m+ (n/2+2)c\frac{77}{75}m+cmn \end{aligned}$$
(3)
$$\begin{aligned}\le & {} acmn - acm +c(m+1) \end{aligned}$$
(4)

Here, (1) and (2) follows since \(m \ge 25\). Inequality (3) follows from \(\ell \le (n+1)/2\), which is true since the string decomposition produced by the algorithm consists of alternating sequence of at least \(2\ell -1\) inner and outer non-empty substrings. Inequality (4) holds for sufficiently large a. The size of the constant a depends on the choice of \(\gamma _n\). Choosing the constant a boils down to choosing a such that the following inequality holds:

$$\begin{aligned} (a - 227/2)n \ge 144 + a, \end{aligned}$$

e.g., for \(\gamma _n = 2\) inequality (4) holds for \(a\ge 371\) and for \(\gamma _n = 10\) the inequality holds for \(a\ge 140\) .

Hence, the total running time is O(nm).

5.2 Space analysis

Next, we consider the space complexity. First, note that the total space for storing R and Q is \(O(n + m)\). To analyze the space during the recursive calls of the algorithm, consider the recursion tree \(\mathcal {T}_{\text {rec}}\) for \(\mathrm {Path}(A, Q)\). For a node v in \(\mathcal {T}_{\text {rec}}\), we define \(Q_v\) of length \(n_v\) to be the string and \(A_v\) with \(m_v\) states to be the TNFA at v.

Consider a path \(\rho = v_1, \ldots , v_j\) of nodes in \(\mathcal {T}_{\text {rec}}\) from the root to leaf \(v_j\) corresponding to a sequence of nested recursive calls of the algorithm. If we explicitly store the subTNFAs, the string decompositions, and the compressed paths on \(\rho \), we use \(O(n_{v_i} + m_{v_i})\) space at each node \(v_i\), \(1 \le i \le j\). By Lemma 1(i), the sum of the sizes of the subTNFAs on \(\rho \) forms a geometrically decreasing sequence and hence the total space for the subTNFAs is \(\sum _{i=1}^j m_{v_i}=O(m)\). Each string and compressed path at each node \(v_i\), \(1 \le i \le j\), have length at most O(n). By Lemma 1(i), the depth of \(\mathcal {T}_{\text {rec}}\) is \(O(\log m)\) and hence the total space for storing the string and compressed path on \(\rho \). In total, we use \(O(n\log m + m)\) space. In combination with the time analysis from the previous section, we have the following result.

Theorem 2

Given a TNFA with m states and a string of length n, we can compute a compressed accepting path for Q in A in O(nm) time and \(O(n\log m + m)\) space.

Unfortunately, each string (and hence compressed path) on a root-to-leaf path in \(\mathcal {T}_{\text {rec}}\) may have length \(\Omega (n)\) and hence we may need \(\Omega (n \log m)\) space to store these in our algorithm. In the next section, we show how to improve this to \(O(n + m)\) space by refining the recursion.

6 Squeezing into linear space

We now show how to improve the space to \(O(n + m)\) in Theorem 2. To do so, we show how to carefully implement the recursion to only store the strings for a selected subset of the nodes along any root to leaf path in \(\mathcal {T}_{\text {rec}}\) that in total take no more than O(n) space.

First, consider a node v in \(\mathcal {T}_{\text {rec}}\) and the corresponding string \(Q_v\) and TNFA \(A_v\). Define \(\chi ^Q_v\) to be the function that maps each character position in \(Q_v\) (ignoring \(\beta _{A_I}\) transitions) to the unique corresponding character in Q and \(\chi ^A_v\) to be the function that maps each character transition (non-\(\epsilon \) transition) in \(A_v\) to the unique character transition in A. Note that these are well-defined by the construction of subproblems in the algorithm. At a node v, we represent \(\chi ^Q_v\) by storing for each character in \(Q_v\) a pointer to the corresponding character in Q. Similarly, we represent \(\chi ^A_v\) by storing for each character transition in \(A_v\) a pointer to the corresponding character transition in A. This uses \(O(n_v + m_v)\) additional space. It is straightforward to compute these mappings during the algorithm directly from the string and TNFA decomposition in the same time. With the mappings we can now output transitions on the compressed path as pairs of position in Q and transitions in A immediately as they are computed in a leaf of \(\mathcal {T}_{\text {rec}}\). Thus, when we have traversed a full subtree at a node we can free the space for that node since we do not have to wait to return to the root to combine the pieces of the subpath with other subpaths.

We combine the mappings with an ordering of the recursion according to a heavy-path decomposition of \(\mathcal {T}_{\text {rec}}\). Let v be an internal node in \(\mathcal {T}_{\text {rec}}\). The string length of v is \(n_v\). We define the heavy child of v to be a child of v of maximum string length among the children of v (ties are broken arbitrarily). The remaining children are light children of v. We have the following key property of light children.

Lemma 3

For any node v with light child u in \(\mathcal {T}_{\text {rec}}\), we have that \(n_u \le \frac{3}{4} n_v + O(1)\).

Proof

Let v be a node in \(\mathcal {T}_{\text {rec}}\) with \(\ell +1\) children. The total string length of the children of v is \(n_v+\ell \) and a light child of v can have string length at most \((n_v + \ell )/2\). Since the \(\ell \) inner strings are disjoint non-empty substrings of Q separated by at least one character (the non-empty outer substrings), we have that \(\ell \le (n_v+1)/2\). Hence, a light child u can have string length at most \(n_u \le \frac{n_v + \ell }{2} < \frac{3}{4} n_v + 1\). \(\square \)

We order the recursive calls at each node v as follows. First, we recursively visit all the light children of v and upon completing each recursive call, we free the space for that node. Note that the mappings allow us to do this. We then construct the subproblem for the heavy child of v, free the space for v, and continue the recursion at the heavy child.

To analyze the space of the modified algorithm, consider a path a path \(v_1, \ldots , v_j\) of nodes in \(\mathcal {T}_{\text {rec}}\) from the root to a leaf \(v_j\). We now have that only nodes \(v_i\), \(1 \le i < j\) will explicitly store a string if \(v_{i+1}\) is a light child of \(v_i\). By Lemma 3, the sum of these lengths form a geometrically decreasing sequence and hence the total space is now O(n). Our modified algorithm does not change the asymptotic time complexity. Hence, we improve the space of Theorem 2 to linear. In summary, we have the following result.

Theorem 3

Given a TNFA with m states and a string of length n, we can compute a compressed accepting path for Q in A in O(nm) time and \(O(n+m)\) space.

As discussed earlier, we can compute a parse directly from the compressed accepted path in O(n) time. Hence, Theorem 3 directly implies our main result in Theorem 1.

7 Speeding up the algorithm

We now show how to adapt the algorithm to use the faster state-set simulation algorithms such as Myers’ algorithm [23] and later variants [3, 4] that improve the O(m) bound for a single state-set transition. These results and ours all work on a unit-cost word RAM model of computation with w-bit words and a standard instruction set including addition, bitwise Boolean operations, shifts, and multiplication. We can store a pointer to any position in the input and hence \(w \ge \log (n + m)\). For simplicity, we will show how to adapt the tabulation-based algorithm of Bille and Farach-Colton [4].

7.1 Fast matching

Let A be a TNFA with m states and let Q be a string of length n. Assume first that the alphabet is constant. We briefly describe the main features of the algorithm by Bille and Farach-Colton [4] that solves the matching problem in \(O(nm/\log n + n + m)\) time and \(O(n^{\varepsilon } + m)\) space, for any constant \(\varepsilon > 0\). In the next section, we show how to adapt the algorithm to compute an accepting path.

Given a parameter \(t < w\), we will construct a global table of size \(2^{ct} < 2^w\), for a constant c, to speed up the state-set transition. We decompose A into a tree MS of \(O(\left\lceil {m/t}\right\rceil )\) micro TNFAs, each with at most t states. For each \(M \in MS\), each child C is represented by its start and accepting state and a pseudo-transition connecting these. By similar arguments as in Lemma 1, we can always construct such a decomposition.

We represent each micro-TNFA \(M \in MS\) uniquely by its parse tree using O(t) bits. Since M has at most t states, we can represent the state-set for M, called the local state-set and denoted \(S_M\), using t bits. Hence, we can make a universal table of size \(2^{O(t)}\) that for every possible micro-TNFA M of size \(\le t\), local state-set \(S_M\), and character \(\alpha \in \Sigma \cup \{\epsilon \}\) computes the state-set transition \(\delta _M(S_M, \alpha )\) in constant time.

We use the tabulation to efficiently implement a global state-set transition on A as follows. We represent the state-set for A as the union of the local state-sets in MS. Note that parents and children in MS share some local states, and these states will be copied in the local state-sets.

To implement a state-set transition on A for a character \(\alpha \), we first traverse all transitions labeled \(\alpha \) in each micro-TNFA from the current state-set. We then follow paths of \(\epsilon \) transition in two depth-first left-to-right traversal of MS. At each micro TNFA M, we compute all states reachable via \(\epsilon \)-transitions and propagate the shared states among parents and children in MS. Since any cycle free path in a TNFA contains at most one back transition (see [23, Lemma 1]) it follows that two such traversals suffices to to correctly compute all states in A reachable via \(\epsilon \)-transitions.

With the universal table, we process each micro TNFA in constant time, leading to an algorithm using \(O(|MS|/t + n + m) = O(nm/t + n + m)\) time and \(O(2^t + m)\) space. Setting \(t = \varepsilon \log n\) produces the stated result. Note that each state-set uses \(O(\left\lceil {m/t}\right\rceil )\) space. To handle general alphabets, we store dictionaries for each micro TNFA with bit masks corresponding to characters appearing in the TNFA and combine these with an additional masking step in state-set transition. This leads to a general solution with the same time and space bounds as above.Footnote 2

7.2 Fast parsing

We now show how to modify our algorithm from Sec. 5 to take advantage of the fast state-set transition algorithm. Let \(t < w\) be the desired speed up as above. We have the following two cases.

If \(n \ge t\) and \(m \ge t\), we implement the recursive step of the algorithm but replace all state-set transitions, that is, when we compute the match sets and valid pairs, by the fast state-set transition algorithm. To compute the suffix match sets, we need to compute fast state-set transitions on A with the direction of all transitions reversed. To do so, we make a new universal table of size \(2^{O(t)}\) for the micro-TNFAs with the direction of transitions reversed. We traverse the tree of micro-TNFAs with two depth traversals as before except that we now traverse children in right to left order to correctly compute all reachable states. It follows that this uses O(nm/t) time.

Otherwise (\(n < t\) or \(m < t\)), we use backtracking to compute the accepting path as follows. First, we process Q from left-to-right using fast state-set transitions to compute the sets \(S_0, \ldots , S_n\) of states reachable via paths from \(\theta _A\) for each prefix of Q. We store each of these state-sets. This uses \(O(nm/t + n + m) = O(n + m)\) time and space. Then, we process Q from right-to-left to recover a compressed accepting path in A. Starting from \(\phi _A\) we repeatedly do a fast state-set transition A with the direction of transition reversed, compute the intersection of the resulting state-set with the corresponding state-set from the first step, and update the state-set to a state in the intersection. We can compute the intersection of state-sets and the update in O(m/t) time using standard bit wise operations. We do the state-set transitions on the TNFA with directions of transitions reversed as above. In total, this uses \(O(nm/t + n + m) = O(n + m)\) time.

In summary, we have the following recurrence for the time T(nm).

$$\begin{aligned} T(n,m) = {\left\{ \begin{array}{ll} \sum _{i=0}^\ell T(n_i, \frac{2}{3}m + 8) + T(n_0 + \ell , \frac{2}{3}m + 8) + O\left( \frac{nm}{t}\right) &{} n \ge t\,\text { and }\,m \ge t \\ O\left( n + m \right) &{} m< t \,\text { or} \,n < t \end{array}\right. } \end{aligned}$$

It follows as in Sect. 5.1 that \(T(n,m) = O(nm/t + n + m)\), for \(25 \le t < w\). The space remains linear as before. Plugging in \(t = \epsilon \log n\) and including the preprocessing time and space for the universal tables, we have the following logarithmic improvement of Theorem 1:

Theorem 4

Given a regular expression of length m, a string of length n, we can solve the regular expression parsing problem in \(O(nm/\log n + n + m)\) time and \(O(n+m)\) space.

Other fast state-set transition algorithms [3, 23] are straightforward to adapt using the above framework. The key requirement (satisfied by the existing solution) is that the algorithms need to efficiently support state-set transitions on the TNFA with the directions of the transitions reversed and intersection between two state-sets.

An algorithm that does not appear to work within the above framework is the \(O(\frac{nm\log \log n}{\log ^{3/2} n})\) matching algorithm by Bille and Thorup [6]. This algorithm does not produce state-sets at each character in the string, but instead maintains specialized information to derive state-sets every \(\sqrt{\log n}\) characters, which we do not see how to maintain in our parsing framework.

8 Open problems

We have presented a general technique to efficiently convert a large class of regular expression matching algorithms into regular expression parsing algorithms. We conclude with two open problems:

  • To be applicable, our result requires that the regular expression matching algorithm is a state-set transition algorithm. A challenging open problem is whether it is possible to obtain a reduction that works for any regular expression matching algorithm.

  • In some applications, a specific parsing is required, e.g., matching subexpression with the longest or shortest matching strategy, or we want to enumerate all possible parsing. We wonder if it is possible to adapt our reduction to find such parsings.