Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Recent focus on string analysis is motivated by the fact that strings play a central role in all aspects of web programming. As soon as you visit a web page or read a file, several encoders, decoders and sanitizers launch for different purposes. Some coders are related to data integrity and format, such as UTF8 encoding and decoding that translates between standard text file representation (UTF8) and standard runtime memory representation (UTF16) of Unicode characters. Other encoders, called sanitizers, are used to prevent cross-site scripting (XSS) attacks; typical examples are Html encoder and Css encoder. While for such coders, basic functional correctness criteria is often vital for security, it may be notoriously difficult to implement them correctly or even reason about such correctness criteria [3, 13]. One reason behind this difficulty is the subtle semantics resulting from a combinaton of arithmetic with automata theory. Individual characters are represented by integers and operations over characters often involve arithmetic operations such as bit-shifting and modulo arithmetic. Automata theory, on the other hand, is used overs strings (sequences of characters) to check for possible input or output patterns that may cause security vulnerabilities. Encoding related security vulnerabilities have been exploited for example through over-encoding [18, 20], double-encoding [19], and XSS attacks. Some recent work has studied sanitizer correctness by utilizing automata techniques [6, 16, 17], including Bek [13] that our current work builds on.

Here we introduce a language called Bex. The main features of Bex that make it more expressive and succinct than Bek are: (1) regex lookahead for pattern matching that removes the burden of having to explicitly encode state machines; (2) default rules to specify what happens when a normal rule fails. In contrast, Bek supports only single-character guards and construction of default rules is then trivial by using the disjunction of all the negated guards from a given state as the guard of the default rule from that state.

Example 1

Consider the following Bex program B. B decodes two-digit html decimal encodings. The first rule, with pattern , states that the null character must not be decoded. The second rule, with pattern , is the normal decoding case. The third rule, with pattern , uses the end-anchor \(\mathtt {\$}\) so it applies only if the match occurs at the end of the input. The fourth rule is a default rule, it applies only when no other rule applies and it always reads a single character while a normal rules read \(k\) characters at a time with \(k\) being the length of the matched input.

figure a

Consider the input . No pattern matches initially, both \(P_0\) and \(P_1\) match from position 1, \(P_1\) matches from position 6, and \(P_2\) matches from position 11. For the overlapping case, \(P_i\) has priority over \(P_j\) for \(i<j\). So

where the ASCII character codes are , and .    \(\boxtimes \)

Bek programs were originally compiled into symbolic finite transducers or SFTs [13]. Unlike sanitizers, a direct representation of decoders with SFTs is highly impractical due to state space explosion [24]. Even when registers are added to Bek and symbolic transducers with registers (STs) are being used, direct representation with Bek and STs is still very cumbersome and error prone, as illustrated by the representation of HTMLdecode (corresponding to B) in [24, Fig. 7]. The need to read several characters at once without storing them in registers and without introducing intermediate states, motivated the introduction of extended symbolic finite transducers (ESFTs) [8], that add support for lookahead. However, unlike in the classical case where lookahead can effectively be eliminated [26, Theorem 2.17], analysis of ESFTs does not reduce to analysis of SFTs and requires, for decidability, further restriction to the Cartesian case [7] where guards are conjunctions of unary predicates. Regexes such as \(P_1\) in Example 1 naturally give rise to Cartesian guards, e.g., \(P_1\) represents the guard . The guard is Cartesian because it has the form \(\lambda \bar{x}.\bigwedge _{i=1}^{|\bar{x}|}\varphi _i(x_i)\).

Cartesian ESFTs are still a powerful extension of SFTs because outputs may depend on multiple variables and use nonunary functions. For example, the second rule of B in Example 1 has the output function \(\lambda \bar{x}.[10*(x_2-48) + x_3-48]\).

The main difficulty with Bex is how to efficiently deal with default rules. A naive implementation of the semantics of bex, e.g., by using a regex matching library, is far too inefficient. For example, the full version of HtmlDecode requires 280 rules. One approach would be to eliminate default rules by adding more normal rules in an attempt to transform Bex programs to ESFTs. For example, we could add the rule to cover the case when the matched subsequence starts with but is not followed by or . Continuing this transformation quickly leads to an explosion of cases and requires intermediate states, obfuscating the semantics and defeating the purpose of the concise declarative style of Bex.

Instead, we provide here a novel compilation scheme from Bex programs to an intermediate form called symbolic rollback transducers SRTs that are subsequently compiled into STs. SRTs use lookback to avoid state space explosion. For example, an SRT may treat the pattern of an html decoder using nine transitions rather than 100 k transitions required by an SFT; once it successfully matches the pattern it refers back to the characters in the matched input, similar to \(k\)-SLTs [5]. SRTs incorporate the notion of rollback in form of rollback-transitions not present in STs [24], ESFTs [8] or \(k\)-SLTs [5], to accommodate default or exceptional behavior.

To summarize, this paper makes the following contributions:

  • Bex: a new declarative language for specifying string coders;

  • SRTs: a variant of ESFTs with the capability of rewinding the input tape;

  • Algorithm for compiling bex programs into SRTs.

As a key component the algorithm makes use of the recent algorithm for minimizing SFAs [9].

2 Symbolic Automata

In this section we introduce the basic concepts of symbolic automata that we are using in this paper. A key role is played by symbolic representation of alphabets as effective Boolean algebras. An effective Boolean algebra \(\mathcal {A}\) has components \((\mathfrak {D}_{},{\varPsi }_{},[\![\_]\!]_{},\bot ,\top ,\vee ,\wedge ,\lnot )\). \(\mathfrak {D}_{}\) is a nonempty r.e. (recursively enumerable) set of domain elements. \({\varPsi }_{}\) is an r.e. set of predicates closed under the Boolean connectives and \(\bot ,\top \in {\varPsi }_{}\). The denotation function \([\![\_]\!]_{}:{\varPsi }_{}\rightarrow 2^{\mathfrak {D}_{}}\) is r.e. and is such that, \([\![\bot ]\!]_{} = \emptyset \), \([\![\top ]\!]_{} = \mathfrak {D}_{}\), for all \(\varphi ,\psi \in {\varPsi }_{}\), \([\![\varphi \vee \psi ]\!]_{} = [\![\varphi ]\!]_{}\cup [\![\psi ]\!]_{}\), \([\![\varphi \wedge \psi ]\!]_{} = [\![\varphi ]\!]_{}\cap [\![\psi ]\!]_{}\), and \([\![\lnot \varphi ]\!]_{} = \mathfrak {D}_{}\setminus [\![\varphi ]\!]_{}\). For \(\varphi \in {\varPsi }_{}\), we write \( IsSat (\varphi )\) when \([\![\varphi ]\!]_{}\ne \emptyset \) and say that \(\varphi \) is satisfiable. The intuition is that \(\mathcal {A}\) is represented programmatically as an API with corresponding methods implementing the components. We use the following symbolic alphabets.

  • \(\mathbf {2}^{k}\) is the powerset algebra with domain \(\{n\mid 0\le n < 2^{k}\}\). Case \(k=0\) is trivial and is denoted \(\mathbf {1}\). For \(k>0\), a predicate in \({\varPsi }_{\mathbf {2}^{k}}\) is a BDD of depth \(k\).Footnote 1 The Boolean operations are the BDD operations. The denotation \([\![\beta ]\!]_{}\) is the set of all \(n\), \(0\le n< 2^k\), whose binary representation is a solution of \(\beta \).

  • \(\mathcal {U}\) We let \(\mathcal {U}\mathop {=}\limits ^{\text {def}}\mathbf {2}^{16}\) denote the basic Unicode alphabet. We use standard regex character class notation to describe predicates in \({\varPsi }_{\mathcal {U}}\). For example , \([\![\mathtt {01} ]\!]_{} = \{48,49\}\), and .

We use the following construct for alphabet extensions. Given a domain \(D\) we write \(D'\) for an injective renaming of all elements in \(D\), \(D'=\{a'\mid a\in D\}\). Similarly for \(D''\). One concrete definition of \(D'\) is \(D\times \{1\}\) and of \(D''\) is \(D\times \{2\}\). In particular, \(D'\cap D'' = \emptyset \).

Definition 1

The disjoint union \(\mathcal {A}{+}\mathcal {B}\) of two effective Boolean algebras \(\mathcal {A}\) and \(\mathcal {B}\), is the effective Boolean algebra \((\mathfrak {D}_{\mathcal {A}}'\cup \mathfrak {D}_{\mathcal {B}}'', {\varPsi }_{\mathcal {A}}\times {\varPsi }_{\mathcal {B}}, [\![\_]\!]_{},\bot ,\top ,\vee ,\wedge ,\lnot )\) where,

$$ \begin{array}{l} [\![\langle \alpha ,\beta \rangle ]\!]_{} \mathop {=}\limits ^{\text {def}}[\![\alpha ]\!]_{\mathcal {A}}'\cup [\![\beta ]\!]_{\mathcal {B}}'',\; \langle \alpha ,\beta \rangle \diamond \langle \alpha _1,\beta _1 \rangle \mathop {=}\limits ^{\text {def}}\langle \alpha \diamond _{\mathcal {A}} \alpha _1,\beta \diamond _{\mathcal {B}} \beta _1 \rangle ,\;({\diamond } \in \{{\vee },{\wedge }\})\\ \lnot \langle \alpha ,\beta \rangle \mathop {=}\limits ^{\text {def}}\langle \lnot _{\mathcal {A}} \alpha ,\lnot _{\mathcal {B}} \beta \rangle , \quad \bot \mathop {=}\limits ^{\text {def}}\langle \bot _{\mathcal {A}},\bot _{\mathcal {B}} \rangle ,\; \top \mathop {=}\limits ^{\text {def}}\langle \top _{\mathcal {A}},\top _{\mathcal {B}} \rangle . \end{array} $$

It is straightforward to prove that \(\mathcal {A}{+}\mathcal {B}\) is still an effective Boolean algebra. Observe that the implementation of \(\mathcal {A}{+}\mathcal {B}\) is trivial given the implementations of \(\mathcal {A}\) and \(\mathcal {B}\), e.g., extension of \(\mathcal {A}\) with a new element can be defined as \(\mathcal {A}{+}\mathbf {1}\).

Given a word \(u\in \mathfrak {D}_{\mathcal {A}}^*\) we write \(u'\) for the word \([u{|}_{0}',u{|}_{1}',\ldots ,u{|}_{|u|-1}']\) in \(\mathfrak {D}_{\mathcal {A}{+}\mathcal {B}}^*\). Similarly for the second subdomain.

Definition 2

A symbolic finite automaton (SFA) \(M\) is a tuple \((\mathcal {A},Q,q^0,F,{\varDelta })\) where \(\mathcal {A}\) is an effective Boolean algebra, called the alphabet, \(Q\) is a finite set of states, \(q^0\in Q\) is the initial state, \(F\subseteq Q\) is the set of final states, and \({\varDelta }\subseteq Q\times {\varPsi }_{\mathcal {A}}\times Q\) is a finite set of moves or transitions. Elements of \(\mathfrak {D}_{\mathcal {A}}\) are called characters and finite sequences of characters are called words.    \(\boxtimes \)

A word \(w\) of length \(|w|\), is denoted by \([a_0,a_1,\ldots ,a_{|w|-1}]\) where all the \(a_i\) are characters. Given a position \(i<|w|\), \(w{|}_{i}\) denotes the \(i\)’th character \(a_i\) of \(w\). The empty word is \([]\). Given two words \(u\) and \(v\), \(u.v\) denotes their concatenation. In particular, \(u.[] = [].u = u\).

A move \(\rho =(p,\varphi ,q)\) from \(p\) to \(q\) is also denoted by \(p\xrightarrow {\varphi }q\) where \(p\) is the source state \(\textit{Src}(\rho )\), \(q\) is the target state \(\textit{Tgt}(\rho )\), and \(\varphi \) is the guard or predicate of the move \(\textit{Grd}(\rho )\). A move is feasible if its guard is satisfiable. In the following let \(M=(\mathcal {A},Q,q^0,F,{\varDelta })\) be a fixed SFA.

Definition 3

A word \(w\in \mathfrak {D}_{\mathcal {A}}^*\), is accepted at state \(q\) of \(M\), \(w\in \fancyscript{L}_{q}(M)\), if there exists a set of moves where \(k=|w|\), \(q_0=q\), \(q_k\in F\), and \(w{|}_{i}\in [\![\varphi _i]\!]_{}\) for \(i< k\). The language of \(M\) is \(\fancyscript{L}_{}(M)\mathop {=}\limits ^{\text {def}}\fancyscript{L}_{q^0}(M)\).

For \(q\in Q\), we use the definitions

$$ {\varDelta }(q)\mathop {=}\limits ^{\text {def}}\{\rho \in {\varDelta }\mid \textit{Src}(\rho )=q\} ,\quad {\varDelta }^{-1}(q)\mathop {=}\limits ^{\text {def}}\{\rho \in {\varDelta }\mid \textit{Tgt}(\rho )=q\}. $$

A state \(q\) of \(M\) is a deadend when \(\fancyscript{L}_{q}(M)=\emptyset \). A deadend-move is a move whose target is a deadend. A state \(q\) of \(M\) is partial if \(\lnot \bigvee \{\textit{Grd}(\rho )\mid \rho \in {\varDelta }(q)\}\) is satisfiable. A move is feasible if the guard of the move is satisfiable. The following terminology is used to characterize various subclasses of SFAs.

  • \(M\) is deterministic: for all \(p\xrightarrow {\varphi }q, p\xrightarrow {\varphi '}q' \in {\varDelta }\), if \( IsSat (\varphi \wedge \varphi ')\) then \(q=q'\).

  • \(M\) is partial (incomplete): there is a partial state.

  • \(M\) is clean: all moves are feasible and all states are reachable from \(q^0\),

  • \(M\) is trim: \(M\) is clean and has no deadend-moves,

  • \(M\) is normalized: forall \(p,q\in Q\), there is at most one move from \(p\) to \(q\).

  • \(M\) is minimal: \(M\) is deterministic, trim, and normalized, and forall \(p,q\in Q\), \(p=q\) if and only if \(\fancyscript{L}_{p}(M)=\fancyscript{L}_{q}(M)\).

  • \(M\) is a prefix acceptor if \(M\) is minimal, \(M\) has a single final state \(q^{\mathrm {f}}_M\) and either \({\varDelta }_M(q^{\mathrm {f}}_M)=\emptyset \), or \({\varDelta }_M(q^{\mathrm {f}}_M)=\{q^{\mathrm {f}}_M\xrightarrow {\top }q^{\mathrm {f}}_M\}\), and all paths from \(q^0_M\) to \(q^{\mathrm {f}}_M\) without passing through \(q^{\mathrm {f}}_M\) have a fixed length \(K\), the length of \(M\).

Regexes used here range over the Unicode alphabet \(\mathcal {U}\) and support character classes, the syntax and the semantics is the same as in C# or JavaScript.Footnote 2 Given a regex \(P\) we write for \(P\) prepended with the start anchor. We write \(\fancyscript{L}_{}(P)\) for the regular language over \(\mathfrak {D}_{\mathcal {U}}\) accepted by \(P\). Given a regular language \(L\), we write \( SFA _{\mathrm {min}}(L)\) for a minimal SFA accepting \(L\).

Anonymous functions. We write \(\varLambda (D\,{\rightarrow }\,R)\) for some well-defined effective representation of functions, or \(\lambda \) -terms, with domain \(D\) and range \(R\). A \(\lambda \)-term \(f\in \varLambda (D\,{\rightarrow }\,R)\) denotes the mathematical function \({\varvec{f}}:D\rightarrow R\).

Let the alphabet \(\mathcal {A}\) be fixed and let \(\mathfrak {D}_{}\) stand for \(\mathfrak {D}_{\mathcal {A}}\) and let \({\varPsi }_{}\) stand for \({\varPsi }_{\mathcal {A}}\). We let \(\mathfrak {D}_{}^k \mathop {=}\limits ^{\text {def}}\{w \in \mathfrak {D}_{}^*\mid |w|=k\}\).Footnote 3 We write \(\varLambda \) for \(\bigcup _{m>0,n\ge 0}\varLambda (\mathfrak {D}_{}^m\,{\rightarrow }\,\mathfrak {D}_{}^n)\), i.e., \(\varLambda \) is the set of \(\lambda \)-terms denoting functions from nonempty fixed length words to fixed length words (the range may be \(\{[]\}\)). Given \(f\in \varLambda \), let \(\natural (f)\) denote the input rank \(m\) of \(f\in \varLambda (\mathfrak {D}_{}^m\,{\rightarrow }\,\mathfrak {D}_{}^n)\).

Example 2

Consider \(\mathcal {A}=\mathcal {U}\). Let \(h\in \varLambda (\mathfrak {D}_{}\,{\rightarrow }\,\mathfrak {D}_{})\) be \(\lambda x.(x < 10\; {?}\; x + 48\; {:}\; x + 55)\). Then \({\varvec{h}}\) encodes every nibble (value in \(\{0,\ldots ,15\}\)) as the corresponding hexadecimal (ASCII) digit,Footnote 4 e.g., and . Let \(f\in \varLambda (\mathfrak {D}_{}^1\,{\rightarrow }\,\mathfrak {D}_{}^2)\) be \( \lambda x.[h(x{|}_{0}\,{\gg }\, 4),h(x{|}_{0}\,{ \& }\, 15)]\) (\(\gg \) is shift-right and & is bitwise-and). Then \({\varvec{f}}\) encodes every single-byte-word as a word of two hexadecimal digits, e.g., .    \(\boxtimes \)

3 Symbolic Rollback Transducers

Symbolic transducers (STs) are a generalization of symbolic finite transducers or SFTs; STs were originally introduced in [24]. An ST may use registers in addition to a finite set \(Q\) of states. In general, registers can hold arbitrary values and the use of registers is unrestricted. Here we introduce another extension of SFTs called SRTs that do not allow explicit use of registers but allow lookback and rollback of input. SRTs have three kinds of transitions, defined below.

To formally define the semantics of transitions we introduce the notion of a snapshot \(\mathfrak {s}\), that is a triple \(\langle \alpha ,q,\beta \rangle \in \mathfrak {S} = \mathfrak {D}_{}^*\times Q\times \mathfrak {D}_{}^*\) with argument store \(\alpha \), state \(q\) and buffer \(\beta \). We say current character for the first character of the buffer if it is nonempty, else for the first character in the rest of the input. The unread portion of the input tape is not part of the snapshot. The idea behind the concept of a snapshot is illustrated in Fig. 1. The buffer is intended to be a prepending to the rest of the input; the semantics enforces that the buffer must be empty before any more characters are read from the rest of the input.

Fig. 1.
figure 1

Intuition behind a snapshot \(\langle \alpha ,q,\beta \rangle \) of an SRT.

An input-transition \(p{\displaystyle \mathop {\longrightarrow }^{\varphi }}q\in Q\times {\varPsi }_{}\times Q\) has the following semantics. From source state \(p\) it reads and enqueues the current character \(a\) into the argument store, provided that \(a\in [\![\varphi ]\!]_{}\), and enters the target state \(q\), formally:

An output-transition \(p\mathop {\mapsto }\limits ^{f}q \in Q\times \varLambda \times Q\) has the following semantics. From state \(p\) it consumes the argument store \(\alpha \) outputs the word \({\varvec{f}}(\alpha )\) and enters state \(q\). The transition is enabled when the length of \(\alpha \) matches the arity of \(f\).

A rollback-transition has the following semantics. From state \(p\), if the current character \(a\in [\![\varphi ]\!]_{}\), it “rewinds the input tape” by pushing the current character and the argument store (back) into the buffer, and enters state \(q\).

The idea is that a rollback-transition is taken when a normal input sequence cannot be completed, the target state \(q\) is then an “exception handling” state.

Definition 4

A Symbolic Rollback Transducer (SRT) is a tuple \((\mathcal {A},Q,q^0,F,{\varDelta })\), where \(\mathcal {A}\), \(Q\), \(q_0\), and \(F\) are as in Definition 2, and \({\varDelta }\) is a finite set of transitions as defined above.    \(\boxtimes \)

The semantics of an SRT \(B\) is defined using a transducer \((\mathfrak {s}_0,\mathfrak {S},\mathfrak {T})\) that is the unwinding of \(B\), where \(\mathfrak {s}_0\) is the initial snapshot \(\langle [],q^0,[] \rangle \) of \(B\), \(\mathfrak {S}\) is the set \(\mathfrak {D}_{}^*\times Q\times \mathfrak {D}_{}^*\) and \(\mathfrak {T}\subseteq \mathfrak {S}\times \mathfrak {D}_{}^* \times \mathfrak {D}_{}^*\times \mathfrak {S}\) is the set \(\bigcup _{\rho \in {\varDelta }}[\![\rho ]\!]_{}\).

The relation \(\mathfrak {s}\xrightarrow {u/v}\!\!\!\!\!\!\!\rightarrow _{}\mathfrak {t}\) for \(\mathfrak {s},\mathfrak {t}\in \mathfrak {S}\) and \(u,v\in \mathfrak {D}_{}^*\) is defined as the least relation such that \(\mathfrak {s}_0\xrightarrow {[]/[]}\!\!\!\!\!\!\!\rightarrow _{}\mathfrak {s}_0\) and if \(\mathfrak {s}\xrightarrow {u/v}\!\!\!\!\!\!\!\rightarrow _{}\mathfrak {s}_1\) and \(\mathfrak {s}_1\xrightarrow {u_1/v_1}\mathfrak {t}\in \mathfrak {T}\) then \(\mathfrak {s}\xrightarrow {u.u_1/v.v_1}\!\!\!\!\!\!\!\rightarrow _{}\mathfrak {t}\). The transduction of \(B\) is now defined as the following function from \(\mathfrak {D}_{}^*\) to \(2^{\mathfrak {D}_{}^*}\).

$$ \fancyscript{T}_{\!B}^{}(u)\mathop {=}\limits ^{\text {def}}\{v \mid \exists q\in F\, \langle [],q^0,[] \rangle \xrightarrow {u/v}\!\!\!\!\!\!\!\rightarrow _{}\langle [],q,[] \rangle \} $$

As a minimal requirement, we want the transition relation \(\mathfrak {T}\) to be well-founded in the following sense: there is no infinite chain \(\{\mathfrak {s}_i\xrightarrow {[]/v_i}\mathfrak {s}_{i+1}\}_{i<\omega }\) in \(\mathfrak {T}\). For example, if there is a rollback-transition then \(\mathfrak {T}\) is not well-founded, because \(\langle [],p,[a] \rangle \xrightarrow {[]/[]}\langle [],p,[a] \rangle \in \mathfrak {T}\). A sufficient condition to ensure well-foundedness of \(\mathfrak {T}\) is that the SRT is not ill-defined:

Definition 5

An SRT is ill-defined if there exists a path of states \((q_{i})_{i\le n}\) and states \(p_1\) and \(p_2\) such that, , (for \(0\le i < n\)) \(q_i\;\xrightarrow {}\;q_{i+1}\), and . The SRT is well-defined otherwise.    \(\boxtimes \)

In a well-defined SRT, any two rollback-transitions must be separated by at least one output-transition. For example, if then the SRT is ill-defined. An output-state is a state that has an outgoing output-transition.

Definition 6

An SRT \(B\) is deterministic if every output-state has exactly one outgoing transition and for every other state \(q\), all transitions from \(q\) have mutually disjoint guards. \(B\) is single-valued if, for all \(u\), \(|\fancyscript{T}_{\!B}^{}(u)|\le 1\).    \(\boxtimes \)

Proposition 1

Every deterministic SRT is single-valued.

Proof

Determinism implies that for any snapshot and current character there can be at most one resulting snapshot. Thus, for any given \(u\in \mathfrak {D}_{}^*\), there can be at most one path \(\{\mathfrak {s}_i\xrightarrow {u_i/v_i}\mathfrak {s}_{i+1}\}_{i<n}\) such that \(u=u_0.u_1.\cdots .u_{n-1}\). Thus, either \(\fancyscript{T}_{\!B}^{}(u)=\emptyset \) or \(\fancyscript{T}_{\!B}^{}(u)=\{v_0.v_1.\cdots .v_{n-1}\}\).    \(\boxtimes \)

We treat a deterministic SRT \(B\) as a partial function and we write \(B(u)=v\) for \(\fancyscript{T}_{\!B}^{}(u)=\{v\}\).

Example 3

Let \(f\) be defined as in Example 2. Let \(B\) be the SRT

Since there are no rollback-transitions the buffer is never used. We have

Thus , so .    \(\boxtimes \)

End anchors. Given an alphabet \(\mathcal {A}\), in order to detect the end of the input string over \(\mathfrak {D}_{\mathcal {A}}\), we can lift \(\mathcal {A}\) to \(\mathcal {A}{+}\mathbf {1}\) and lift all \(u\in \mathfrak {D}_{\mathcal {A}}^*\) to \(u'.[0'']\in \mathfrak {D}_{\mathcal {A}{+}\mathbf {1}}^*\) where the character \(0''\in \mathfrak {D}_{\mathcal {A}{+}\mathbf {1}}\) is used only as the last input character. Such end-of-input character can then be used to trigger a final output-transition that empties the store (when the store is nonempty).

4 Bex

The alphabet is fixed to \(\mathcal {U}\) here, \(\mathfrak {D}_{}\) stands for \(\mathfrak {D}_{\mathcal {U}}\). A bex program consist of a nonempty sequence of pattern rules \((P_\imath \Longrightarrow f_\imath )_{0\le \imath < k}\) and a default output \(f^{}_{\text {d}}\), where all \(P_\imath \) are regexes, called patterns, and all \(f_\imath \) and \(f^{}_{\text {d}}\) are output expressions such that the following well-formdness criteria hold.

  • is a prefix acceptor of some length \(K_{\imath }>0\).

  • \(f_\imath \in \varLambda \) and .

  • \(f^{}_{\text {d}}\) is undefined or \(f^{}_{\text {d}}\in \varLambda \) and .

The first well-formdness condition ensures that all patterns have fixed lengths. The second condition ensures that the output functions are in scope: depend only on the characters matched by the pattern. The third condition ensures that the default output function only depends on one character (the current one).

The formal semantics of bex programs is as follows. The intent is to support straightforward specification of how typical encoders and decoders work in practice. Given a word \(u\) and indices \(i\) and \(j\), \(0\le i\le j < |u|\), we write \(u[i..]\) for the suffix \([u{|}_{i},\ldots ,u{|}_{|u|-1}]\) and \(u[i..j]\) for the subsequence \([u{|}_{i},\ldots ,u{|}_{j}]\) of \(u\).

Definition 7

Given a bex program \(B=((P_\imath \Longrightarrow f_\imath )_{0\le \imath < k},f^{}_{\text {d}})\). The denotation of \(B\), \({\varvec{B}}\), is a (partial) function from \(\mathfrak {D}_{\mathcal {U}}^*\) to \(\mathfrak {D}_{\mathcal {U}}^*\). Let \(u\in \mathfrak {D}_{\mathcal {U}}^*\) be the input sequence. Let \(n:=0\) and \(v:=[]\). Let and let \(K_\imath \) be the length of \(M_\imath \). Repeat the following until \(n=|u|\):

  1. 1.

    Let \(I=\{\imath \mid u[n..] \in \fancyscript{L}_{}(M_{\imath })\}\).

  2. 2.

    If \(I\ne \emptyset \) let \(\imath = \min \{i\in I\mid K_i=\min \{K_j\mid j\in I\}\}\) and \((m,f) = (K_\imath ,f_\imath )\)

  3. 3.

    If \(I= \emptyset \) let \((m,f)=(1,f^{}_{\text {d}})\).

  4. 4.

    Let \(v:= v + f(u[n..n + m - 1])\) and \(n:=n+m\).

Then \({\varvec{B}}(u)=v\). (\({\varvec{B}}(u)\) is undefined if \(f^{}_{\text {d}}\) is used but is undefined).    \(\boxtimes \)

Example 1 is a simplified version of an Html decoder. Its purpose is to illustrate the use and the semantics of typical pattern rules and the default rule. It is used as a running example in the rest of the paper. In the next section we describe an algorithm that converts a bex program into an equivalent SRT.

5 Bex to SRT Compiler

The purpose of the bex to SRT compiler is, given a well-formed bex program \(B=((P_\imath \Longrightarrow f_\imath )_{0\le i < k},f^{}_{\text {d}})\) as input, to generate a well-defined deterministic SRT that is equivalent to \(B\). We assume that the default output \(f^{}_{\text {d}}\) is defined. The case when \(f^{}_{\text {d}}\) is undefined amounts to a trivial modification of the compiler.

The compiler works in two main phases. First, all the patterns of the rules are combined into a single pattern automaton \(N\) that is then minimized. The alphabet of \(N\) is \( \mathcal {U}2 \mathop {=}\limits ^{\text {def}}\mathcal {U}{+}\mathcal {U}\). The first subuniverse \(\mathcal {U}'\) serves the purpose of the Unicode alphabet, while the second subuniverse \(\mathcal {U}''\) serves the purpose of bex rule identifiers.

Second, the pattern automaton \(N\) is (essentially) extended with output-transitions and rollback-transitions to form the final SRT. It follows from minimality of \(N\) and the construction of the additional transitions that the resulting SRT is well-defined and deterministic and preserves the semantics of the original bex program.

Fig. 2.
figure 2

Sample SRT with rollback-transitions.

The alphabet of the generated SRT is going to be \(\mathcal {U}+\mathbf {1}\). The new element \(0''\in \mathfrak {D}_{\mathcal {U}+\mathbf {1}}\) is used as the end-of-input symbol of words. Observe that \(\mathfrak {D}_{\mathcal {U}+\mathbf {1}}=\mathfrak {D}_{\mathcal {U}}'\cup \{0''\}\). The main correctness theorem is the following.

Theorem 1

Given a bex program \(B\), \(\textit{SRT}(B)\) is a well-defined deterministic SRT such that, for all \(u,v\in \mathfrak {D}_{\mathcal {U}}^*\), \(B(u)=v\) iff \(\fancyscript{T}_{\!\textit{SRT}(B)}^{}(u'.[0''])=\{v'.[0'']\}\).

Proof

Formal proof is by induction over the length of computations, relating the points in Definition 7 to the constructs below and by using basic properties of \(N\) and SFAs operations. The construction of the SFA \(N\) itself uses an algorithm for minimizing SFAs [9].    \(\boxtimes \)

Detailed descriptions of the compilation phases are given below. The following example illustrates a small but realistic example.

Example 4

Consider the bex program \(B\) in Example 1. Figure 2 shows the generated \(\textit{SRT}(B)\). The rollback-transitions have a guard (not shown) that is the complement of the disjunction of all the guards from all other transitions from the source state. E.g., and    \(\boxtimes \)

5.1 Pattern Automaton Construction

  1. 1.

    Let \(\mathcal {E}:=\emptyset \); \(\mathcal {E}\) is computed as the set of all pattern ids having end anchors.

  2. 2.

    For \(\imath = 0,\ldots ,k-1\):

    1. (a)

      Let . (Recall that \(M_{\imath }\) is a prefix acceptor.)

      Let \(K_\imath \) be the length of \(M_{\imath }\).

      If \({\varDelta }_{M_\imath }(q^{\mathrm {f}}_{M_\imath })=\emptyset \) then \(\mathcal {E}:= \mathcal {E}\cup \{\imath \}\).

    2. (b)

      Let \(q^\imath \) be a new state not it \(Q_{M_\imath }\). Lift \({M_{\imath }}\) into \(N_\imath \):

      $$ \begin{array}{rll} N_{\imath }&{}=&{} ( \mathcal {U}2 ,Q_{M_{\imath }}\cup \{q^\imath \},q^0_{M_{\imath }},\{q^\imath \},{\varDelta }),\\ \text {where}\; {\varDelta }&{}= &{} \{ p\xrightarrow {\langle \varphi ,\bot \rangle }q \mid p\xrightarrow {\varphi }q \in {\varDelta }_{M_{\imath }}, p\ne q^{\mathrm {f}}_{M_\imath } \} \cup \{ q^{\mathrm {f}}_{M_\imath }\xrightarrow {\langle \bot ,\hat{\imath } \rangle }q^\imath \} \end{array} $$
  3. 3.

    Let

    $$ N := SFA _{\mathrm {min}}(\bigcup _\imath \fancyscript{L}_{}(N_{\imath })). $$

    \(N\) has a single final state, say \(F_N=\{q^{\text {f}}_N\}\), and \({\varDelta }_N(q^{\text {f}}_N)=\emptyset \). A move \(p\xrightarrow {\langle \bot ,\beta \rangle }q^{\text {f}}_N\) is a final move; let \(\imath = \min [\![\beta ]\!]_{}\), the state \(p\) is \(\imath \) -final.

  4. 4.

    Cleanup:

    1. (a)

      If a state \(p\) is \(\imath \)-final but \(\imath \notin \mathcal {E}\) then delete all non-final moves from \(p\).

    2. (b)

      Remove unreachable states from \(N\).

Cleanup removes unreachable cases: shorter patterns override longer ones (for the overlapping cases) and for patterns of the same length the ones with smaller id have priority (see Definition 7.2). The following are key properties of \(N\).

Proposition 2

For all \(w\in \mathfrak {D}_{ \mathcal {U}2 }^*\) the following statements are equivalent:

  • \(w \in \fancyscript{L}_{}(N)\)

  • for some \(u\in \mathfrak {D}_{\mathcal {U}}^*\) and \(\imath \in \mathfrak {D}_{\mathcal {U}}\), \(w=u'.[\imath '']\) and \(w\in \fancyscript{L}_{}(N_\imath )\)

  • for some \(u\in \mathfrak {D}_{\mathcal {U}}^*\) and \(\imath \in \mathfrak {D}_{\mathcal {U}}\), \(w=u'.[\imath '']\) and \(u \in \fancyscript{L}_{}(M_\imath )\) and \(|u|=K_\imath \)

Proposition 3

If \(q^0_N\xrightarrow {v}\!\!\!\!\!\!\!\rightarrow _{}q\) and \(q\xrightarrow {\langle \bot ,\psi \rangle }q^{\text {f}}_N\in {\varDelta }_N\) then for all \(\imath \in [\![\psi ]\!]_{}\), \(|v| = K_\imath \).

Proof

Fix \(\imath ,\jmath \in [\![\psi ]\!]_{}\). Then \(v+\imath ^{(2)},v+\jmath ^{(2)}\in \fancyscript{L}_{}(N)\). So, by Proposition 2, \(v=u^{(1)}\) for some \(u\in \mathfrak {D}_{\mathcal {U}}^*\) such that \(u \in \fancyscript{L}_{}(M_\imath )\cap \fancyscript{L}_{}(M_\jmath )\) and \(|u|=K_\imath =K_\jmath \).    \(\boxtimes \)

The purpose of \(N\) is going to be that \(N\) is used to construct a control flow graph of the SRT. \(N\) takes care of selecting the correct rule for a given input.

Example 5

Consider the bex program B in Example 1. The SFAs \(N_0\), \(N_1\), \(N_2\) and \(N\) are as follows:Footnote 5

figure b

In \(N\) the overlapping patterns are reflected in the final move where . \(\mathcal {E}= \{2\}\). If the end anchor was removed from \(P_2\) then \(\mathcal {E}\) would be empty and the cleanup step would delete the moves from \(q_2\) to \(q_3\) and \(q_4\). Then the states \(\{q_3,q_4,q_6,q_7,q_9,q_{10}\}\) would become unreachable.    \(\boxtimes \)

5.2 Compute Normal Transitions

We will now use \(N\) as a starting point for constructing an SRT \(\textit{SRT}(B)\) from \(B\).

We lift functions \(f\) over the universe \(\mathfrak {D}_{\mathcal {U}}\) implicitly to functions over the universe \(\mathfrak {D}_{\mathcal {U}+\mathbf {1}}\) by lifting elements in \(\mathfrak {D}_{\mathcal {U}^{}}\) to elements in the first subuniverse \(\mathfrak {D}_{\mathcal {U}^{}}'\) of \(\mathfrak {D}_{\mathcal {U}+\mathbf {1}}\). Let \({\varDelta }^{\text {in}}\) be the following set of input-transitions.

$$\begin{aligned} {\varDelta }^{\text {in}}= & {} \{ p\;\xrightarrow {\langle \varphi ,\bot \rangle }\;q \mid p\xrightarrow {\langle \varphi ,\bot \rangle }q\in {\varDelta }_N \} \end{aligned}$$

In other words, all nonfinal moves of \(N\) become input-transitions. Let \({\varDelta }^{\text {out}}\) be the following set of output-transitions, where \(q^0=q^0_N\),

$$\begin{aligned} {\varDelta }^{\text {out}}= & {} \{ p\mathop {\mapsto }\limits ^{f_{\imath }}q^0 \mid p\xrightarrow {\langle \bot ,\beta \rangle }q^{\text {f}}_N\in {\varDelta }_N,\; \imath =\min [\![\beta ]\!]_{},\; \imath \notin \mathcal {E}\} \end{aligned}$$

In other words, if a state \(p\) is \(\imath \)-final and the pattern \(P_\imath \) is not a suffix pattern of the input \((\imath \notin \mathcal {E})\) then, upon reaching the state \(p\), the input store contains a word \(s\) of length \(K_\imath \) matching the pattern \(P_\imath \). The output function \({\varvec{f}}_\imath \) is applied to the matched word \(s\) committing to the output word \({\varvec{f}}_\imath (s)\). The process is repeated from the initial state \(q^0\).

5.3 Compute Ending Transitions

When a regex pattern \(P_{\imath }\) ends with an end anchor ($ or ) then this is reflected in \(N\) by the fact that there is a state \(p\) that is \(\imath \)-final and \(\imath \in \mathcal {E}\). This means that the match must end with the end-of-input character \(0''\) because all valid input words have the form \(u'.[0'']\) for \(u\in \mathfrak {D}_{\mathcal {U}}^*\). There are new output states \(p_{\$}^{\imath }\) for all \(\imath \in \mathcal {E}\), with the following input-transitions leading to them.

$$\begin{aligned} {\varDelta }^{\text {in{\$}}}= & {} \{ p\;\xrightarrow {\langle \bot ,\top \rangle }\;p^{\imath }_{\$} \mid p\xrightarrow {\langle \bot ,\beta \rangle }q^{\text {f}}_N\in {\varDelta }_N,\; \imath =\min [\![\beta ]\!]_{},\; \imath \in \mathcal {E}\} \end{aligned}$$

There are output-transitions from each \(p_{\$}^{\imath }\) that apply the corresponding final output function to the final stored input word in each case, append the ending character \(0''\), so that all output words are also \(0''\)-terminated, and transition to the final state \(q^{\text {f}} = q^{\text {f}}_N\) of the SRT.

There are also transitions from the initial state \(q^0=q^0_N\) leading to the final state (upon end of input), where \(q^0_{\$}\) is a new output state:

There are no transitions outgoing from the final state \(q^{\text {f}}\).

5.4 Compute Default Transitions

The default behavior kicks in from a state \(p\) when the current character does not match any of the possible guards of the outgoing input-transitions from \(p\). Formally, let \(G(p)\) be the disjunction of all the guards from transitions exiting from \(p\). Here \(p\) is an input state that is a non-output state and not \(q^{\text {f}}\).

$$ G(p) \mathop {=}\limits ^{\text {def}}\bigvee \{\varphi \mid \exists q(p\;\xrightarrow {\varphi }\;q \in {\varDelta }^{\text {in}}\cup {\varDelta }^{\text {in}\$}\cup {\varDelta }^{0})\},\quad \gamma _p\mathop {=}\limits ^{\text {def}}\lnot G(p). $$

Predicate \(\gamma _p\) describes all characters that break all possible patterns at state \(p\). If \(\gamma _p\) is satisfiable then, for all current characters in \([\![\gamma _p]\!]_{}\), roll back the input tape back to the position before the match was started, then apply the default function to the first character in the input tape (it cannot be \(0''\) because the input store is nonempty when \(p\ne q^0\) and \(0''\notin [\![\gamma _{q^0}]\!]_{}\)), and finally continue the process from state \(q^0\) and the next input position. This corresponds to Definition 7.3. Formally, the following transitions are added to capture this default behavior.

where \(\mathrm {default}_{\mathrm {in}}\) and \(\mathrm {default}_{\mathrm {out}}\) are fixed new states. Observe that the well-definedness criterion (see Definition 5) is trivially satisfied. Let \(Q\) be the set of all states that occur in the transitions. The final result of the compilation is the SRT:

$$ \textit{SRT}(B) \mathop {=}\limits ^{\text {def}}(\mathcal {U}+\mathbf {1},Q,q^0,\{q^{\text {f}}\}, {\varDelta }^{\text {in}}\cup {\varDelta }^{\text {out}}\cup {\varDelta }^{\text {in}\$}\cup {\varDelta }^{\text {out}\$}\cup {\varDelta }^0\cup {\varDelta }^{\text {default}}) $$

Moreover, a well-defined \(\text {SRT}\) can be further translated into an equivalent ST without rollback-transitions by performing a symbolic partial evaluation of the default cases. Generation of C# or JavaScript code is straightforward from either well-defined deterministic SRTs or deterministic STs.

6 Implementation and Experiments

The bex language and the algorithm for generating symbolic transducers from bex programs has been implemented and is available in an online toolkit and tutorial [4]. The tutorial includes several samples, such as base64 encoding and decoding, allows online editing, and enables JavaScript generation from the bex programs. The generated JavaScript can also be directly executed online.

We have built a prototype implementation of the compiler. In a final phase the compiler converts the generated SRT into an ST without rollback-transitions. It does so by symbolically forward executing the rollback-cases and by optimizing the generated code through a combination of SFA techniques and SMTlib representation of terms using Z3 [10, 27]. Z3 terms are used to simplify arithmetic expressions and to prune unsatisfiable predicates. The STs are then converted into either C# or JavaScript implementations.

We have applied this technique to a variety of different encoders and decoders such as: Utf16Encoder and Decoder, Base32Encoder and Decoder, Base64Encoder and Decoder, CssEncoder, JavaScriptEncoder, JsonEncoder, and HtmlEncoder and Decoder. They all fall into a category of string transformation routines that can be very naturally expressed and analyzed in bex.

So far our largest case study is a bex program for the complete version of HtmlDecode that uses over 280 rules. The full bex program is less than 300 lines of code including comments. The large number of rules is due to many special cases of patterns such as and in addition to rules that decode numeric (decimal or hexadecimal) encodings of characters. The resulting minimal pattern automaton \(N\) has in this case 920 states and the generated C# code is just shy of 20 k lines of code (with sparsely generated code). The end-to-end compilation time was around 8 s that includes preprocessing as well as some analysis of the generated code. A key factor here was an efficient minimization algorithm of SFAs [9]. The minimization algorithm is used repeatedly in the loop where the SFAs \(N_i\) are being constructed during the pattern automaton construction phase of the bex compiler. For the alphabet algebra we use \( \mathcal {U}2 \) for most parts, but for dealing with \(\lambda \)-terms and satisfiability checking of linear arithmetic formulas in the final phases of the compiler we use SMT2lib representation of terms and Z3 [27].

We compared the running time of the bex generated HtmlDecoder in C# against the hand-optimized HtmlDecoder in the. NET System.Net.WebUtility library. As input to both decoders we used maximally encoded input (with hexadecimal encoding of all non-ASCII) texts over various parts of the Unicode alphabet. In this experiment, the bex coder outperformed the System coder by 2 times on average.

7 Related Work

Symbolic finite transducers (SFTs) and the Bek language were originally introduced in [13] and formally studied in [24]. SFTs were also extended to STs in [24] to allow the use of registers for increased expressive power. A common usage pattern that often occurs in the context of string decoders is that of a finite window of characters that are grouped and processed together. For such a class of problems, SFTs are too weak, while STs sacrifice analyzability. Two related formalisms have been proposed to address this issue, ESFTs [8] and \(k\)-SLTs [5]. The former uses bounded lookahead and reads several characters at once, while the latter uses bounded lookback and reads one character at a time. Further properties of ESFTs are studied in [7].

The formalism of SRTs is in spirit related to \(k\)-SLTs, because output-transitions refer to earlier characters as a form of lookback. However, once an output happens (is “committed”), there is no way to refer back to those input characters that were used, in later transitions; this is similar to the sematic of ESFTs. The aspect that is new in SRTs, is the notion of a rollback-transition that allows the input tape to be rewound or rolled back conditionally. As we demonstrated with bex, this aspect greatly simplifies the task of programming typical encoders and decoders, HtmlDecoder being a perfect example, where default rules are used extensively when pattern matching fails.

Automata over infinite alphabets have received a lot of interest [21], starting with the work on register automata [14]. A different line of work on automata with infinite alphabets called lattice automata, originates from verification of symbolic communicating machines [11]. Streaming transducers [1] provide a recent symbolic extension of finite transducers. Extended Finite Automata, or XFAs, is a succinct representation of DFAs that use registers, are introduced in [22] for network packet inspection. XFAs support only finite alphabets. History-based finite automata [15] are another extension of DFAs that have been introduced for encoding regular expressions in the context of network intrusion detection systems. Finite state transducers have been used for dynamic and static analysis to validate sanitization functions in web applications [17, 25].

Symbolic transductions can also be considered over infinite strings. For finite alphabets, a study of transformations of infinite strings is proposed in [2]. Yet a different extension is symbolic transductions over trees [23].

We use the SMT solver Z3 [10] for incrementally solving and simplifying constraints in the process of composing predicates that arise during bex compilation. Similar applications of SMT techniques have been introduced in the context of symbolic execution of programs by using path conditions [12].