Keywords

1 Introduction

As a versatile mechanism for pattern matching, searching, substituting, and so on, real-world regular expressions (regexes for short) have become an integral part of modern programming languages and software development, with numerous applications across various fields [3, 13, 18, 19, 31, 43]. Previous research [12, 18, 53] has reported that regexes are utilized in 30–40% of Java, JavaScript, and Python software.

Though popular, regexes can be difficult to comprehend and construct even for proficient programmers, and error-prone, due to the intrinsic complexities of the syntax and semantics involved, resulting in tens of thousands of bug reports [13, 27, 30, 31, 44]. Therefore, it is crucial to offer automated techniques for test generation and bug finding within regexes. Producing matching strings for regexes is essential for many tasks, such as automated testing, verifying, and validating programs that utilize regexes. There have been numerous studies related to this problem using various techniques [14, 25, 26, 28, 29, 42, 50, 52]. However, there are crucial concerns that have been either overlooked or inadequately addressed in the existing literature, limiting their utility. We classify these issues as follows.

Features Support. All existing works lack support for some important features. For example, none of the existing works support lookbehind, and only one work supports lookahead but with soundness errors (see Sect. 2.3). Regexes are featured with various extended features (or simply called features) such as lookarounds, capturing groups, and backreferences. An instance of a regex using a backreference is \((.^*)\backslash 1\), which defines a context-sensitive language \(\{ww | w\!\in \!\varSigma ^*\) \(\}\) where \(\backslash 1\) is a backreference. In addition, if such an expression is also included in lookarounds, then those lookarounds effectively encode the intersection of context-sensitive languages [15]. These show that regexes are not limited to representing regular languages [1], and as a result, generating strings for regexes becomes more involved. For instance, in [15], the authors demonstrated that the emptiness problem of regular expressions with lookaheads and backreferences is undecidable. Thus, in many works, these features are often disregarded or imprecisely estimated. This lack of support can lead to, for instance, poor coverage or unrevealed bugs. Furthermore, based on our analysis of open-source projects for six programming languages (Java, Python, JavaScript, C#, PHP, Perl) which yielded 955,184 unique regexes, the average ratio of capturing group usage exceeds 38%, while the average percentage of lookarounds and backreferences is over 4%, while it approaches 10% in C#, thus those features are non-negligible. Similar observations for JavaScript were also reported by [29].

Subtle Semantic Differences. Regexes have different semantics which can result in different matching results. For example, there are partial and full matching functions for the regexes in most programming languages, which can lead to different matching results. For instance, the regex \(node\_modules(\) ?= \(paradigm.^*)\) from practical project [19] matches \(node\_modulesparadigm\) under a partial matching call, but is unsatisfiable under a full matching call. None of the existing works addressed the different matching semantics of regexes (such as partial/full matching), thus may lead to wrong results. As another example, backreference has different semantics in different programming languages. For instance, the regex \((\texttt {?:}(a)b)^?\backslash 1c\) can match c in JavaScript, but does not match c in Java, Python, PHP and C#. See more examples in Sect. 2.3.

Unicode Support. Supporting the Unicode Standard can be useful in the internationalization and localization of practical software. PCRE and POSIX standards for regexes defined several operators such as \(\backslash uFFFF\), \([\texttt {:}word\texttt {:}]\) and \(\backslash p\{L\}\) to represent Unicode code points, improving the usability of regexes. In modern mainstream regex engines and string constraint solvers [17, 24, 45], those operators are common. However, we found the existing tools show incomplete support for those features.

Various Kinds of Soundness Errors. We also found incorrect outputs generated by existing works, reflecting logic errors in their implementation which may be due to the intricacy of the syntax and semantics of regexes. See Sect. 2.3 for details.

To achieve the end, this paper proposes a novel semantic model for comprehensively modeling the extended features in regexes with the awareness of different matching semantics (i.e. partial/full matching) and matching precedence (i.e. greedy/lazy matching). Leveraging this model we then develop PowerGen, a tool for deducing matching strings for regexes. Specifically, PowerGen first rewrites the input regex by selecting the appropriate optimization rule based on the input programming language and rewrites the input regex based on the information of partial/full matching function. Then it uses Unicode automata to support a vast class of extended Unicode-related features. Next PowerGen selects the appropriate induction rules based on the input programming language to perform the top-down induction of the sub-expressions of the rewritten regex. Finally, PowerGen randomly generates matching strings according to the induction rules and stack compiled from the rewritten regex, which effectively identifies unsatisfiable cases.

We evaluate PowerGen by comparing PowerGen against nine state-of-the-art tools on publicly available datasets. Our evaluation demonstrates the high effectiveness and efficiency of PowerGen.

The contributions of this paper are listed as follows.

  • We propose a novel semantic model for regexes, which comprehensively models the extended features, with the awareness of different matching semantics and matching precedence. To the best of our knowledge, it is the first one to consider partial/full matching semantics in modeling and supporting lookbehind.

  • Based on our model, we develop PowerGen, a tool for deducing matching strings for regexes. To this end, we devise novel algorithms that randomly generate matching strings according to the input regex, which effectively identifies unsatisfiable cases.

  • Evaluation shows the high effectiveness and efficiency of PowerGen.

2 Background

2.1 Regex

Let \(\varSigma \) be a finite alphabet of symbols and \(\varSigma ^*\) be the set of all possible words (i.e. strings) over \(\varSigma \), \(\varepsilon \) denotes the empty word and the empty set is denoted by \(\varnothing \). For the definition of standard regular expressions we refer to [55].

In practice, real-world regular expressions (regexes) are pervasive in diverse application scenarios. A regex over \(\varSigma \) is a well-formed parenthesized formula constructed by, besides using the constructs for standard regular expressions and character classes, as well as using the following operators (i) capturing group (E); (ii) named capturing group \((\texttt {<}name\texttt {>} E)\); (iii) non-capturing group \((\texttt {?:}E)\); (iv) lookarounds: positive lookahead \((\texttt {?=}E_1)E_2\), negative lookahead \((\texttt {?!}E_1)E_2\), positive lookbehind \(E_1(\texttt {?<=}E_2)\), and negative lookbehind \(E_1(\texttt {?<!}E_2)\); (v) anchors: word boundary \(\backslash b\), non-word boundary \(\backslash B\), start-of-line anchor , and end-of-line anchor $; (vi) greedy quantifiers: \(E^?\), \(E^*\), \(E^+\), and \(E^{\{m,n\}}\); (vi) lazy quantifiers: \(E^{??}\), \(E^{*?}\), \(E^{+?}\), and \(E^{\{m,n\}?}\); (vii) backreference \(\backslash i\) and (viii) named backreference \(\backslash k\texttt {<}name\texttt {>}\), etc.

In addition, E?, \(E^*\), \(E^+\) and \(E^{\{i\}}\) where \(i\in \mathbb {N}\) are abbreviations of \(E^{\{0,1\}}\), \(E^{\{0,\infty \}}\), \(E^{\{1,\infty \}}\) and \(E^{\{i,i\}}\), respectively. \(E_1^{\{m,\infty \}}\) is often simplified as \(E_1^{\{m,\}}\). Following symbols (, ), \(\{\), \(\}\), [, ], , $, |, \(\backslash \), ., ?, * and + are escaped with the escape character \(\backslash \) in \(\varSigma \). The language L(E) of a regex E is the set of all strings accepted by E.

2.2 Research Problem

In this paper, we focus on the research problem of finding matching strings which depends on the partial/full matching semantics in regexes. We present related concepts below.

In most programming languages there are partial and full matching functions for the regexes (e.g. the matches and find functions in Java for full matching respectively partial matching). For a regex E, if it is used with the full matching function, then a string w is a matching string of E if \(w \in L(E)\). If E is used with the partial matching function, then a string w is a matching string of E if \(w \in L(.^*E.^*)\) .

2.3 The Current Status of Existing String Generation Tools

Table 1. The Results from Each Tool for Examples in Practical Projects

We identified 9 state-of-the-art string generation tools for comparison which can be categorized into three groups: (1) string generation based on automata, including Egret [25], dk.brics [35], Mutrex [2] and Generex [54]; (2) string generation based on AST (Abstract Syntax Tree), including Exrex [49], Xeger (O’Connor) [37] and Randexp.js [22]; (3) string generation based on SMT (Satisfiability Modulo Theories) solvers, including ExpoSE [29], and Ostrich [14]. It should be noticed that string constraint solvers do more work than string generators: they handle word equations and other more complicated string constraints like ReplaceAll.

Fig. 1.
figure 1

The Framework of PowerGen

We notice that, even under the features they claim to support, errors exist and are predominantly on features like lazy quantifier, word boundary, backreference. ExamplesFootnote 1 from practical projects [19] are listed in Table 1 with corresponding strings generated by each tool mentioned above, where the correct results are marked with “( )”. In addition, “” indicates that the tool does not support one or more features in the regex, error indicates run-time errors, and unsat indicates the tool determines that the regex cannot be satisfied with any input. It is evident that certain tools exhibit flawed handling of lazy quantifier, word boundary, backreference and lookaround. In the second example, the .* should be constrained to favor \(\backslash b\), but Exrex, Xeger (O’Connor), Randexp.js and Ostrich do not take that into account and gives wrong results, and Expose returns an error. For the third regex, Randexp.js and ExpoSE, yielding abcdefghijk and unsat respectively, fail to support more than 9 backreferences. In the case of the expression semantic differences arise between languages. JavaScript is capable of supporting backreferences preceding the corresponding capture group and generating the correct output, such as ab, abab. Nevertheless, Randexp.js fails to return the correct string. For the given example 6, the right node of the logical OR operator can accept an empty string. However, Egret, Exrex and Xeger(’Connor) return an error and ExpoSE returns unsat. Moreover, none of them are capable of correctly handling a self-referenced backreference like example 7, or combining a backreference with lookaround as shown in example 8.

3 Overview

In this section, we provide an overview of our approach. Our method, depicted in Fig. 1, encompasses four main components: reduction rules, Unicode automata compiler, induction system, and random matching string generator. Initially, the Reduction Rules module takes the regex, the language, and a matching function to form an AST, addressing the semantic divergence between partial and full match calls. This simplified regex AST and language are then forwarded to the Unicode Automata Compiler, which develops automata to integrate the Unicode 15.0.0 standard into UTF-8 and compiles the AST leaves into the Unicode Automaton. The Induction System component, chosen based on the input language, converts the AST into induction rules and stack. Lastly, the Random Matching String Generator uses these induction rules and stack with capture information to generate matching strings. By iteratively executing the generation function, multiple matching strings are produced.

In the following section, we exemplify our approach by highlighting the intractable part of a regex from Node.js version 18.16.0. The original regex is , which is used to validate the Web Linking header within HTML documents. This validation process ensures that the input value is free from any syntactically invalid Uniform Resource Identifiers (URIs). A legal input of \(\varPsi \) is ="style" and the backreference referred to the first capturing group ("\()^{?}\) ensures the quotation marks " are matching, e.g. = "style should be rejected by \(\varPsi \), but = style is acceptable.

4 Modeling and String Generation Algorithms

In this section, we present the details of our model for regex semantics. First, we describe our extension of functions as a foundation for our model in Sect. 4.1. For optimizing the efficiency of our tool, we implemented some reduction rules in Sect. 4.2. Then, we introduce a new automaton model for effective representation and Boolean operations for Unicode character classes in Sect. 4.3. We provide the induction rules from the AST in Sect. 4.4. Finally, we give a brief account of the random generation algorithms in Sect. 4.5.

4.1 Extension of Functions to Regex

Among the basic functions to synthesize position automata [23], according to the nomenclature in [11], the output of first, follow, last functions are called position sets. Here we generalize the nullable, first and last functions to regexes by giving computation rules for operators of regexes, which is necessary for modeling regexes. Due to space limitation, the details of computation rules are shown in our complete version.Footnote 2 We also deploy these functions on tasks such as processing semantic of anchors and identification of unsatisfiable cases, as heuristics to avoid the algorithms that require exponential time [46].

Definition 1

For a regex E over \(\varSigma \), we define the following functions:

$$\begin{aligned} first(E) = & {} \{a \mid aw \in L(E), a \in \varSigma , w \in \varSigma ^*\}\end{aligned}$$
(1)
$$\begin{aligned} last(E) = & {} \{a \mid wa \in L(E), a \in \varSigma , w \in \varSigma ^*\}\end{aligned}$$
(2)
$$\begin{aligned} nullable(E) = & {} \left\{ \begin{array}{ll} true, &{}\quad if\ \varepsilon \in L(E)\\ false, &{}\quad otherwise \end{array} \right. \end{aligned}$$
(3)

The definitions effectively compute the possible prefix/suffix of length one from a regex, without a full traversal on the AST. For example, for , \(first(E)=\{[a\texttt {-}zA\texttt {-}Z]\}\), , \(nullable(E)=f\!alse\). We also notice those functions for backreferences depends on the capturing information during the generation, which can not be soundly computed statically.

4.2 Reduction Rules for Regex

We implemented several reduction rules shown in Fig. 2 to optimize the efficiency of our tool. Some of these reduction rules above are derived from existing regex engines and practical tools. For instance, mechanic of is also found in the C# regex library. The others provide significant help in terms of efficiency and precision. The reduction rules we show here are common among our language-dependent reduction rules according to their original engine implementation.

We handle the semantic differences of partial/full match calls in the reduction system. For tight relationships between function calls and anchors, we consider semantic equivalence reduction by appending \(.^*\) in an unanchored prefix/suffix, and none when anchored under partial match call. We also consider the full matching semantic by appending and $. Inside our algorithms, anchors are processed as empty characters with constraints and for the sake of succinctness, we support start-of-line/end-of-line anchors implicitly. For our running example \(\varPsi \), the \(^?\) and \(^*\) operators are rewritten into \(^{\{0,1\}}\) and \(^{\{0,\infty \}}\), thus the output is . However E and $E when E is not nullable will be considered unsat in the reduction system and directly return \(\varnothing \).

Fig. 2.
figure 2

Reduction Rules for Regex

4.3 Effective Representation of Unicode Character Classes

We build automata in a top-down manner to encode the Unicode 15.0.0 [51] standard into UTF-8 in Algorithm 1 and traverse the automata to generate acceptable strings in UTF-8 byte-by-byte. This structure makes string generation feasible in acceptable time. In situations when ASCII flags (e.g. re.A in Python) are enabled, our representation is simplified into ASCII ranges.

In Algorithm 1, each Unicode character sets \(c_i\) when transformed into UTF-8 encoding composed with several runes (ranges) \(c_i^j\) defined on a byte. After initialization, Algorithm 1 first checks whether the character class is \(\varnothing \). If not, the algorithm iterates the Unicode ranges \(c_1, c_2, .., c_r\) in the character class cc, initializes the current state \(\mathcal {A}\) as the initial state init and j as the height of each \(c_i\).

figure ad

Then Algorithm 1 checks whether \(j\ge 0\) and whether there has been a transition with \(c^j_i\) from the current state \(\mathcal {A}\) to a non-null state, if so make this state as the current state and substract j. If there is no such non-null state and \(j\ge 0\), we build a new state \(\mathcal {S}\), and mark a transition \(\delta \)(\(\mathcal {A}, c^j_i)\) to \(\mathcal {S}\), then take \(\mathcal {S}\) as the current state and substract j. Finally, we mark the final states when \(j=0\).

The Unicode automaton allows us to support a vast class of Unicode-related extended features, which is a major factor of the high usability of our tool. The Unicode character classes effectively define an automaton of a finite language with more succinctness than those translated to standard regular expressions. The cost from Boolean operations among Unicode automata, including intersection [40], subset construction [41], has a major impact on the performance of our tool. To mitigate the cost, we execute those algorithms lazily, e.g. for in our running example \(\varPsi \), the complementation of \([\mathtt{;"}\backslash s]\) is computed only if a character is to be generated from this character class, instead of pre-processing and rewritting them in advance [17], thus guarantee the efficiency. The other character classes in \(\varPsi \) are also compiled into Unicode automata.

4.4 Induction System for Regex

To comprehensively model the semantics of extended operators and generate matching strings, we propose the induction system.

The induction rules are composed of \(con\!figurations\) and logical constraints, where a triple (EwC) is called a configuration, where E is a regex, w is a variable representing strings that E defines, and C is a stack preserving the generated strings from the referenced subexpressions. To comprehensively model the semantics of regex operators, the extension of the basic functions defined in Sect. 4.1 is necessary for our induction system. Also the syntactic and semantic differences between dialects of regexes can hardly be negligible. To tackle this problem, we designed different induction rules for string generation, according to specified languages from the user. From the categorization in [5], we consider \(\varepsilon \)-semantics and \(\varnothing \)-semantics, and differentiate regex dialects in implementation details, e.g. \(\backslash w\) is equal to \([a\mathtt{-}zA\mathtt{-}Z0\mathtt{-}9\_]\) in Python mode and \([\backslash p\{L\}\backslash p\{Mn\}\backslash p\{Nd\}\backslash p\{Pc\}]\) in C# mode. The induction rules we show here are designed for JavaScript regexes. In our induction system, specialized induction rules for other dialects of regexes can also be found.

figure af

The rules for standard operators are self-explanatory. For rule GREEDY, the original configuration unfolds this operator into a conjunction of series of configurations to generate strings separately. For rule LAZY, it makes the string that the expression matches as short as possible.

We constrain the expressive power of the regex in the rule CAPTURE as follows. When processing a Python/C# regex, we disallow backreferences to appear inside any referenced capturing group. And when the user specifies JavaScript regex for the input, the first assertion in the post-condition is canceled and the unassigned backreference is configured as \(\varepsilon \) by default. The configuration also pushes the generated string from its configuration into stack C for reference. As we do not rewrite quantifiers like in [29], the generated strings from the same capturing group will overwrite the stack of index i during generation. Notice that we do not present induction rule for non-capturing groups, since those are considered useless on AST as in regex engines like C#’s, and the unreferenced capture groups are considered non-capturing as in Java regex engine. Thus for those capturing groups not referenced within the regex, we treat them as non-capturing groups. For succinctness, the logic for named capturing groups is also contained in the rule CAPTURE.

In rule BACKREF, the configuration simply reads the context from the latest assigned value of the stack \(C_i\) into the string variable. For our running example \(\varPsi \), last and nullable will output unknown and proceed to generate a character from ("\()^{?}\), when the generation by the capturing group requires any of those functions from this exact capturing group, our algorithm returns unknown to avoid non-termination. Once a character " is generated from the sub-regex ("\()^{?}\), the above functions are considered decidable, i.e. \(first(\varPsi )=\{\texttt {=}\}\), \(last(\varPsi )=\{\texttt {"}\}\), \(nullable(\varPsi ) = false\). Also, named backreferences are contained in this rule.

From ES2018 [20], lookbehinds are introduced into the standard. In the rules for lookarounds, take positive lookahead as an example, \(w_1\) should belong to \(E_1.^*\), \(w_1w_2\) is generated from \(E_2\), the result is the conjunction of two configurations, i.e. (?= \(a)\backslash w^+\) can generate abbbb even in full match mode. The most intractable case is when lookarounds are decorated by repetitions: the lookarounds also put limitations on each repeating subexpression adjacent to it, our induction system shows a natural advantage in handling these cases.

In rules WBOUND and NON-WBOUND, we categorize the situations by the first and last functions of subregexes. For instance, in regex \(\backslash b\)( &|ab|c), & is not satisfiable. Thus we generate a character from first(( & \(|ab|c))\cap \backslash w\) and prune the configuration of &. Also, the rules contain the case when the word-boundary appears at the start or end of the regex. And a Non-word boundary operator can generate \(\varepsilon \) when \(E_1\) and \(E_2\) are both nullable. If none of the situations are satisfied, the induction system will return \(\varnothing \).

Furthermore, since the complement of a regular language requires exponential time [21, 46], we also apply heuristics for identification of unsatisfiable cases. For \(E_1(\) \(\texttt {?<!}\) \(E_2)\), if \(E_2\) is nullable, the complement of \(E_2\) is \(\varnothing \), thus the regex is unsatisfiable. Similar strategies are applied to other zero-width assertions.

Back to running example \(\varPsi \), the result from applying induction rules is shown as below. Notice we simplified the induction process to improve readability.

figure ah

4.5 String Generation Algorithm

In this section, we introduce our algorithm PowerGen, which takes a regex, the corresponding language, and the matching function as inputs, and outputs matching strings.

Our algorithm first conducts a syntax check conforming to the regex syntax rules of the provided language, selects the corresponding reduction rules based on the language, and creates an AST. The reduction rules handle the semantic difference of partial/full match calls in the corresponding language. It then forwards the simplified regex AST and the language as input to the Unicode automata compiler. It develops automata to incorporate the Unicode 15.0.0 [51] standard into UTF-8 in Algorithm 1, and compiles the AST leaves into Unicode automata. Depending on the language, we choose the appropriate induction system. This system takes the AST as input and compiles it into induction rules with stack storage. Finally, our random matching string generator receives induction rules and stack information as input and produces a matching string as an output. The generator performs a top-down traversal on the induction rules to generate strings. All of the Boolean operations of induction rules are performed on Unicode automata in Sect. 4.3. If the induction system returns \(\varnothing \), PowerGen outputs unsat. By running the generation function repetitively, multiple matching strings are produced, since our generation strategy is random. Returning to our running example, we can easily obtain random sentences from the string variable of the root configuration. One of the strings generated is =":z2L@Q", while the “sound” modeling in [29] mistakenly returns =".

5 Evaluation

We implemented our algorithms in C++, and conducted experiments on a machine with 192-core Intel Xeon E7-8850 v2 2.30 GHz processors and 2048 GB of RAM, running Ubuntu 16.04.5 LTS. Our algorithm can generate matching strings in multiple languages, including Python, JavaScript, Java, PCRE2, and C#. Our empirical investigation aims to address the following research questions:

RQ1.:

When randomly generating strings, does accurate modeling of regexes improve string generation efficiency?

RQ2.:

Is our support for full matching and partial matching better than other tools, which only support one kind of matching semantics?

RQ3.:

How does our approach work in practical projects?

To address RQ1, we compare our approach with existing string-generation tools by evaluating our algorithm on publicly available datasets. We select representative examples to demonstrate the correctness of all the tools in generating strings according to their specified matching semantics, thus validating RQ2. Lastly, we assess the performance of our approach in comparison to other tools in practical projects to clarify RQ3.

5.1 Datasets

Our experiment was conducted on a benchmark from [19]. This benchmark contains 537,806 unique regexes extracted from 193,524 programs written in 8 programming languages, including JavaScript, Java, PHP, Python, Ruby, Go, Perl, and Rust. The unique regexes represent the set of expressions after removing duplications.

5.2 Tools for Comparison

We compared nine string generation tools (see Sect. 2.3). We ensured all tools were configured according to their original configurations as stated in their papers or documentation, respectively. Egret, dk.brics, Mutrex and Generex (extended on dk.brics) did not specify the regexes they supported in which programming language, so we classified them as Unspecific support. Meanwhile, Exrex and Xeger focused on Python regexes, and Randexp.js, ExpoSE and Ostrich specialized in JavaScript regexes. ExpoSE supports partial matching semantics, while the others support full matching semantics, as shown in Table 2.

Table 2. Language-Matching Calls-Features Support by String Generation Tools

\(^{a}\) abbreviations for each tool in this table, along with their respective definitions, are as follows: Egr (Egret), DK (dk.brics), Mut (Mutrex), Gen (Generex), Exr (Exrex), Xeg (Xeger (O’Connor)), Rnd (Randexp.js), Exp (ExpoSE), Ost (Ostrich)

5.3 Evaluation of Random String Generation

We compared the feature support for each tool. The results are presented in Table 2. The “” in Table 2 indicates that the programming language specified by the tool lacks support for that feature, and the “ ” signifies that the corresponding tool does not support a feature. Our algorithm supports all the features listed earlier.

To begin with, we evaluate the impact of random generation. Through statistical analysis, the language-specific tools are based on the regexes of the Python or JavaScript languages. Therefore, in full matching comparisons, we evaluate each language-specific tool in its own language, and unspecific tools in both Python and JavaScript. For partial matching comparisons, there is only one tool, ExpoSE, to compare with us. We filtered the dataset to ensure syntax correctness. We assess the accuracy rate and time of the tools, with results depicted in Table 3, where the full match validation are shown on the left and the JavaScript partial match validation on the right. The first line on the left is based on Python validation results, while the second line refers to JavaScript validation results (achieved by adding start and end-of-line anchors before and after the regex). The accuracy rate is defined as: \(AccuracyRate = \frac{1}{|D|}\sum _{i=1}^{|D|}Match(r_{i},s_{i}).\) In the given formulas, \(r_i\) denotes the i-th regex in the dataset D, where |D| refers to the size of D. \(s_i\) denotes the string generated by \(r_i\) by the tool. \(Match(r_i,s_i)\) represents the validation of string \(s_i\) using the language’s matcher. It returns 1 for success and 0 for failure. It is noteworthy to mention that Ostrich, ExpoSE, and our approach perform checks for unsatisfiability. We classify regexes as unsatisfiable if all tools determine them to be either unsatisfiable or incorrect. The memory usage refers to the average maximum resident set size.

Table 3. Experimental Results of String Generation Tools under Specific Language
Fig. 3.
figure 3

Experimental Results of String Generation Tools on Feature Dataset

The experimental results show that PowerGen achieves the highest accuracy in random string generators among state-of-the-art tools. In Python full matching validation, PowerGen achieved the highest accuracy of 97.40%, followed by Exrex and Xeger (O’Connor) with about 92%, and the worst performer was DK, Mutrex and Generex with about 40%. For JavaScript full matching validation, to be ranked according to accuracy, PowerGen achieved the highest percentage of 97.71%, followed by Randexp.js and Ostrich with about 92%, while DK, Mutrex and Generex achieved only 35.82%, 44.62% and 40.93% respectively. For JavaScript partial matching validation, PowerGen is about 20% higher than ExpoSE. Additionally, PowerGen generates correct results for all the examples mentioned in Sect. 2.3.

Regarding efficiency, in the full matching comparisons, the fastest among other tools is Exrex with 0.146 s, while our tool only takes 0.004 s, which has achieves a 36.5x speedup over Exrex. Moreover, our tool outperforms the slowest tool, Ostrich, by a factor of 2136. In the partial matching comparisons, we are up to ten thousand times faster than ExpoSE. We also expect our tool to be integrated into other softwares to improve their efficiency. Regarding memory usage, our memory usage is the least among tools in comparison.

To further analyze the experimental results, we then use the AST parsing tool to parse the regexes in the dataset and divide the dataset according to regex features. It should be noted that a regex can have multiple features thus can be split into multiple features categories. Due to the space limitation, we only present the comparison results under full matching calls. We calculate the accuracy rate for each feature and plot the Fig. 3(a) and 3(b) for Python and Javascript. The x-axis represents different features, the y-axis represents different tools, and the values indicate the accuracy rate.

Table 4. The Results by Each Generator for Examples

The Fig. 3(a) and 3(b) indicate that our tool achieved the highest accuracy rate for each feature, where lookbehind is only supported by our tool. Among other tools that support Python regex syntax, their support for word boundary is relatively poor. Besides, for the tools that claim to support JavaScript regex syntax, their accuracy for word boundary, non-word boundary, and backreference achieves about 80%, and even 0% in some cases due to lack of support.

figure ak

5.4 Statistics for Full Matching and Partial Matching

The actual project library contains both full and partial matching capabilities. Unfortunately, the tools being compared only support one kind of matching semantics, which is inadequate for dealing with this situation. Furthermore, there are many logical errors in these tools regarding the semantics they claim to support. In this section, we analyse some representative examples from the dataset under full and partial matching calls in Table 4.

  1. 1.

    \(\backslash b\backslash \)$

    Under full matching call, the regex above is unsatisfiable, while under partial matching it should return a string from \(\backslash w\backslash \) $. Among those tools supporting full matching semantic, dk.brics, Mutrex and Generex lack support for word boundaries, Exrex, Xeger(O’Connor) and Randexp.js returns $, which is incorrect. ExpoSE’s output $ is erroneous under partial match, while PowerGen is capable to find the correct results under both matching semantics.

  2. 2.

    \(node\_modules(\texttt {?=}paradigm.^*)|(paradigm\texttt {-}gulp\texttt {-}watch)\)

    Under full matching, \(paradigm\texttt {-}gulp\texttt {-}watch\) is the matching string, while for partial matching, results from \(node\_modulesparadigm.^*\) or \(paradigm\texttt {-}gulp\texttt {-} watch\) are acceptable. Although ExpoSE claimed to support lookahead, it returned unsat, which is incorrect for both full and partial matching; PowerGen generates the correct answers under both cases.

  3. 3.

    \(\texttt {"""}(\texttt {?="})\)

    For full matching, this regex is not satisfiable. In partial matching, lookaround requires that after matching three quotes, a quote must follow, resulting in the generation of four quotes. ExpoSE returns unsat, which is incorrect in the case of partial matching. In contrast, our result is accurate in both full and partial matching scenarios.

  4. 4.

    \((\texttt {?=}a^{\{2,5\}})\backslash w\)

    For full matching, this regex is unsatisfiable. And under partial matching, a string aa is acceptable. ExpoSE returned unsat, which is wrong under partial matching, while our results are correct under both cases.

The other examples in the table are similar to four examples above.

figure al

5.5 Results on Real Projects

Table 5. Examples in PyPI (Python) Project Library
Table 6. Examples in Maven (Java) Project Library

We inspected a large number of projects in PyPI [39], Maven [32], npm [36] and other project libraries, and found that many of them contain various matching calls within the same project. In Table 5 and 6, we present a few examples of this phenomenon. Similar to RQ1, we conducted experiments in languages supported by each tool. It can be seen that other tools give wrong results in most examples due to not supporting some features and/or the matching calls (indicated by –), run-time errors (indicated by error), or others (indicated by ). Our tool consistently produces the correct results thanks to considering different matching semantics and supporting a wider range of extended features.

figure an

6 Related Work

Matching Semantics and Extended Features. Leftmost-longest (POSIX-type) and Leftmost-greedy (PCRE-type) policies are two popular disambiguation strategies for regular expressions. However, POSIX implementations were found error-prone [16]. Okui and Suzuki [38] formalized leftmost-longest semantics and extended position automata [23] with leftmost-longest rule. Sulzmann and Lu extended Brzozowski’s derivatives [7] to POSIX parsing problem [48]. Berglund et al. gave a formalization of Boost semantics for its combination of POSIX semantics and capturing groups [4]. Regular expressions with backreferences were first proposed by Aho in 90s [1]. Câmpeanu and colleagues gave rigorous formalisms and various properties [8,9,10] for regular expressions with backreferences. Recently Berglund and van der Merwe investigated theoretical aspects of regex with backreferences [5]. On the theoretical foundation for lookaheads, Miyazaki and Minamide [33] extended Brzozowski’s derivatives [7] to lookaheads. Recently Berglund et al. [6] proposed a model based on Boolean Automata for regular expressions with lookaheads, and gave state complexity results. In 2022, Chida and Terauchi [15] gave the first formal study on regexes with both backreferences and lookaheads.

String Generation Toolkits. The Automaton Library [35] compiles a regex into an \(\epsilon \)-NFA, and implements interfaces for random string generation. Egret [25] has a partial support for regexes to find inconsistency between regexes and specifications, it was found their tool lacks support for Unicode-related features. Reggae [26] supports string generation for regular expressions with intersection and complement operators; it also mentions that supporting lookarounds and boundaries is challenging. Veanes et al. [52] proposed Rex, which can be used for regular expression testing. Due to the cost of determinization on their proposed symbolic automata based on the construction of \(\epsilon \)-NFAs, the string generation of Rex is not efficient. Loring et al. [29] claimed their model supports the complete regex language for ES6, but for lookarounds decorated by repetitions, ExpoSE seems to fail to give a correct result. Chen et al. [14] proposed a novel transducer model, namely PSST, to formalize the semantics of regex-dependent string functions, but backreference and lookarounds are still on their future work.

This paper and [47] both propose novel methods for modeling regexes. However, they develop different techniques because the problems they solve are different. The algorithmic differences are listed as follows: firstly our implementation is platform-independent based on induction rules instead of Z3 [34]. Secondly in [47] authors introduced the length constraint in modeling regex operators for checking satisfiability, which is insufficient to deduce fixed-length matched strings for generic purpose. Thirdly since the matching functions for regexes are ubiquitous in practical programs, for the first time we consider different matching semantics in modeling regexes for deducing matching strings. Also we constrained the expressive power of the input regex by induction rules to ensure the completeness of our algorithm within a fragment of the class of regexes, while in [47], authors introduced a CEGAR (counterexample-guided abstraction refinement) scheme which makes their algorithm incomplete.

7 Conclusion

We propose PowerGen, a tool for deducing matching strings for regexes. It is based on a novel semantic model for regexes, which comprehensively models the extended features, with the awareness of different matching semantics and matching precedence. The evaluation results demonstrate the high effectiveness and efficiency of our algorithms. We aim to develop methods to further deduce the shortest matching strings for regexes and thus get a more refined model for regex in the future.