1 Introduction

Regular expressions (regexp) are used in different contexts [4] (e.g., MySQL injection prevention, DNA sequencing alignment, etc.) to validate data, perform lexical analysis, or do string matching in texts. A regexp characterizes a set of strings (i.e., a language). Several studies [6, 12] show that regexps often contain conformance faults, i.e., they do not exactly describe the intended language: they either accept strings that should be refused, or refuse strings that should be accepted, or both. Indeed, specifying a correct regexp is usually difficult, since regexps provide a compact syntax that makes the understanding of their exact meaning difficult.

Due to their tolerant syntax, most regexps are free of syntax errors and it is unlikely to find conformance faults by syntax checking or static analysis [12]. For this reason, the most used form of regexp validation is testing, i.e., generating a set of strings S and comparing the behavior of the regexp under test \( r \) over S w.r.t. an oracle. Often, the oracle is the tester, who has to check whether \( r \) correctly evaluates the strings in S. A common assumption done by many validation and testing techniques is that for a human evaluating if a given string should belong to the desired language \(\mathcal {L}\) is much easier than writing the regexp that characterizes \(\mathcal {L}\). Based on this assumption, several programs and web services allow to validate regexps by generating a set of strings [6] or allowing the user to enter strings and check whether they are correctly rejected or accepted by the regexp under test. However, two issues arise: (i) As checking all possible strings is not feasible, which strings are more critical to validate a regexp? (ii) Once a wrongly evaluated string is found, can a repair be suggested to the user?

In this paper, we address both issues. We start with the assumption that syntactical faults usually done by developers are those described by a set of fault classes identified in literature [2]. Based on this assumption, we can restrict the testing activity to the strings able to distinguish a regexp from its mutants, i.e., the strings demonstrating the presence of particular syntactic faults. Regarding what to do when a fault has been discovered (i.e., a string s which is rejected but it should be accepted or vice versa), we propose to repair the regexp under test by substituting it with its mutant which instead correctly evaluates s.

We devised an iterative process which starts with a developer who has written a regexp \( r \) and (s)he wants to check if \( r \) is correct, i.e., if it correctly evaluates the intended language. Then, the process proceeds through a sequence of steps in which each iteration consists in: (i) generating some mutants of the regexp \( r \) under test, (ii) computing a set S of distinguishing strings that permit to distinguish \( r \) from its mutants, (iii) checking the quality of the mutated regexps and the original regexp over S by asking the user to assess the correctness of the strings evaluation, (iv) and choosing the best regexp \( r ^\prime \) (i.e., the one evaluating correctly the majority of generated strings), and updating \( r \) with \( r ^\prime \). The process iterates until no \( r ^\prime \) better than \( r \) is found. We propose four versions of the previous general process. The greedy one changes the regexp as soon as it finds a better candidate among the mutants. The multiDSs mitigates the greediness by evaluating several strings before changing the candidate. The breadth approach, before changing the candidate, evaluates all the mutants. The collecting approach aims at distinguishing multiple mutants at the same time. Experiments show that the approach is able to repair 85% of faulty regexps; on average, the user must evaluate around 40 strings and the whole process (also considering the user’s effort) lasts, at most, 4.5 min.

Paper Structure. Section 2 gives background on regexps and mutation operators for them. Section 3 introduces the notions of conformance and mutation fault. Section 4 presents our approach for repairing a (possibly faulty) regexp. Section 5 describes the experiments we performed, Sect. 6 reviews related work, and Sect. 7 concludes the paper.

2 Background

In our context, regular expressions are intended as valid regular expressions from formal automata theory, i.e., only regular expressions that describe regular languages. The proposed approach is indeed based on the use of the finite automata accepting the language described by the regular expressions.

Definition 1

(Regular expression). A regular expression (regexp) \( r \) is a sequence of constants, defined over an alphabet \({\varSigma }\), and operator symbols. The regexp \( r \) accepts a set of words \(\mathcal {L}( r ) \subseteq {\varSigma }^*\) (i.e., a language).

We also use \( r \) as predicate: \( r (s)= true \) if \(s \in \mathcal {L}( r ) \), and \( r (s)= false \) otherwise.

As \({\varSigma }\) we support the Unicode alphabet (UTF-16); the supported operators are union, intersection, concatenation, complement, character class, character range, and wildcards.

Normally, acceptance is computed by converting \( r \) into an automaton \(\mathcal {R}\), and then checking whether \(\mathcal {R}\) accepts the string. In this paper, we use the library dk.brics.automatonFootnote 1 to transform a regexp \( r \) into an automaton \(\mathcal {R}\) (by means of function \(\mathtt {toAutomaton}( r )\)) and perform standard operations (complement, intersection, and union) on it. Moreover, we use the operator \(\mathtt {pickAword}(\mathcal {R})\) that returns a string s accepted by \(\mathcal {R}\), i.e., \(s \in \mathcal {L}(\mathcal {R}) \).

For our purposes, we give the notion of a string distinguishing two languages.

Definition 2

(Distinguishing string). Given two languages \(\mathcal {L}_1\) and \(\mathcal {L}_2\), we say that a string s is distinguishing if it is a word of the symmetric difference \(\mathcal {L}_1 \oplus \mathcal {L}_2 = \mathcal {L}_1 \setminus \mathcal {L}_2 \cup \mathcal {L}_2 \setminus \mathcal {L}_1\) between \(\mathcal {L}_1\) and \(\mathcal {L}_2\).

Fault Classes and Mutation Operators. Our approach is based on the idea that conformance faults (see Definition 4) can be discovered by means of mutation analysis. In mutation analysis for regexps, a fault class represents a possible kind of mistake done by a developer when writing a regexp; a mutation operator for a fault class is a function that given a regexp \( r \), returns a list of regexps (called mutants) obtained by mutating \( r \) (possibly removing the fault); a mutation slightly modifies the regexp \( r \) under the assumption that the programmer has defined \( r \) close to the correct version (competent programmer hypothesis [10]).

We use the fault classes and the related mutation operators proposed in [2]. In the following, given a regexp \( r \), we identify with \(\mathtt {mut}( r )\) its set of mutants.

3 Conformance Faults and Mutation Faults

In this section, we describe how to compare the language characterized by a developed regexp with the intended language. Given a regexp \( r \), we suppose to have a reference oracle that represents the intended meaning of \( r \). Formally:

Definition 3

(Oracle). An oracle \(o_{ r }\) is a predicate saying whether a string must be accepted or not by a regexp \( r \), i.e., \(o_{ r }(s) = true \) if s must be accepted by a correct regexp. We identify with \(\mathcal {L}(o_{ r })\) the set of strings accepted by \(o_{ r }\).

The oracle (a program or a human) can be queried and it must say whether a string belongs to \(\mathcal {L}(o_{ r })\) or not, i.e., whether it must be rejected or accepted by \( r \). Usually, the oracle is the developer who can correctly assess whether a string should be accepted or not (but (s)he may be not able to correctly write the correct regexp). A correct regexp is indistinguishable from its oracle (i.e., they share the same language) while a faulty one wrongly accepts or rejects some strings. We here define such strings.

Definition 4

(Conformance Fault). A regexp \( r \) (supposed to behave as specified by \(o_{ r }\)) contains a conformance fault if

$$\begin{aligned} \exists s \in {\varSigma }^* : distStr (s, r , o_{ r }) \end{aligned}$$

where \( distStr (s, r , o_{ r })\) indicates that s is a distinguishing string between the languages of \( r \) and \(o_{ r }\), i.e., \(s \in \mathcal {L}( r ) \wedge s \not \in \mathcal {L}(o_{ r }) \) or \(s \not \in \mathcal {L}( r ) \wedge s \in \mathcal {L}(o_{ r }) \). We name the first case as inclusion fault, and the second case as exclusion fault.

A fault shows that the user specified the regexp wrongly, possibly misunderstanding the semantics of the regexp notation. Finding all conformance faults would require to compute \(\mathcal {L}( r ) \oplus \mathcal {L}(o_{ r }) \); however, we do not know \(\mathcal {L}(o_{ r }) \) in advance, and we can only invoke the oracle \(o_{ r }\) to check whether a string belongs to \(\mathcal {L}(o_{ r })\) or not. Therefore, proving the absence of conformance faults in \( r \) would require to evaluate, for all the strings \(s \in {\varSigma }^*\), \( distStr (s, r , o_{ r })\) by checking whether s belongs to \(\mathcal {L}( r )\) and not to \(\mathcal {L}(o_{ r })\), or vice versa. Such exhaustive testing is infeasible and so we aim at finding a subset of strings of \({\varSigma }^*\) that are more likely faulty. Following a fault-based approach, we claim that the syntactic faults that are more likely done by developers are those described by the fault classes we introduced in [2]. If such assumption holds, most conformance faults should be those strings that are evaluated differently by the regexp and its mutants. For this reason, we introduce a stronger type of fault defined in terms of mutations.

Definition 5

(Mutation Fault). A regexp \( r \) (that is supposed to behave as specified by \(o_{ r }\)) contains a mutation fault if

$$\begin{aligned} \exists r ^\prime \in \mathtt {mut}( r ) , \exists s \in (\mathcal {L}( r ) \oplus \mathcal {L}( r ^\prime )) : distStr (s, r , o_{ r }) \end{aligned}$$

We call (\( r ^\prime \),s) failure couple.

A regexp \( r \) contains a mutation fault if, given a mutant \( r ^\prime \), a distinguishing string s between \( r \) and \( r ^\prime \) shows that \( r \) is faulty. To find a mutation fault, we do not need to consider all the strings in \({\varSigma }^*\), since we know how to compute \(\mathcal {L}( r ) \oplus \mathcal {L}( r ^\prime ) \). Targeting mutation faults is beneficial thanks to this theorem.

Theorem 1

Given a regexp \( r \), if it contains a mutation fault with failure couple (\( r ^\prime \),s), then \(\lnot distStr (s, r ^\prime , o_{ r })\), i.e., \( r ^\prime \) correctly evaluates s.

figure a

Proof

Let us assume that \(s \in \mathcal {L}( r ) \oplus \mathcal {L}( r ^\prime ) \) and it is distinguishing w.r.t. the oracle, i.e., \( distStr (s, r , o_{ r })\). This also means \(s \in \mathcal {L}( r ) \oplus \mathcal {L}(o_{ r }) \). We can identify two cases. (1) s is accepted by \( r \), i.e., \( s \in \mathcal {L}( r ) \), then s is rejected by both \( r ^\prime \) and \(o_{ r }\). (2) s is rejected by \( r \), i.e., \( s \not \in \mathcal {L}( r ) \), then s is accepted by both \( r ^\prime \) and \(o_{ r }\). In both cases, s is equally evaluated by \( r ^\prime \) and \(o_{ r }\), so \( s \notin \mathcal {L}( r ^\prime ) \oplus \mathcal {L}(o_{ r }) \). \(\square \)

Theorem 1 is central in our approach that combines testing and repairing: if we test a regexp \( r \) with strings distinguishing \( r \) from one of its mutants \( r ^\prime \) (instead of other strings, like random ones), then in case of failure (the test distinguishes \( r \) from its oracle) we also know how to repair \( r \) to remove that particular fault, by taking \( r ^\prime \) as new regexp. However, note that \( r ^\prime \) may still contain other faults.

4 Testing and Repairing Regular Expressions

Our repairing approach exploits Theorem 1 in order to test and repair a regexp in an interactive way. It is based on the assumption that the mistakes done by a developer are those described by the fault classes proposed in [2]. However, if we showed to the developer a mutated regexp, (s)he could still have problems in understanding whether that is the correct regexp. On the other hand, if we showed her/him a string s, (s)he would have no difficulty in assessing the correct evaluation. We therefore propose an approach that generates strings distinguishing a regexp \( r \) from its mutants and that selects one of the mutants in case of mutation fault. The approach is formalized by the meta-algorithm presented in Algorithm 1. The algorithm takes in input the regexp \( r \) we want to repair, and an oracle \(o_{ r }\) (usually the oracle is the user able to assess the correct evaluation of all the strings). Two sets \(A\) and \(R\) are created for memorizing the strings that are known to be accepted and rejected by the oracle. Then, the following instructions are repeatedly executed:

  • \( r \) is mutated with the operators defined in [2] (mutants are stored in \( muts \));

  • function \(\textsc {testAndRep}\) checks whether in \( muts \) there exists a regexp \( r ^\prime \) better than \( r \), asking the user to evaluate some distinguishing strings;

  • if a better regexp is found, the process is iterated again using \( r ^\prime \) as new regexp under test (line 5), otherwise \( r ^\prime \) is returned as final regexp (line 6).

We identify four possible ways to select the new repaired regexp (function testAndRep), described in the following sections. Note that Algorithm 1 does not guarantee to always terminate with any approach and, in any case, it could iterate through several regexps (asking the user to evaluate several distinguishing strings) before terminating. For this reason, the user should specify a maximum number of evaluations MaxEval, after which the process is interruptedFootnote 2.

figure b

4.1 Greedy Approach

Algorithm 2 shows the greedy version of testAndRep. The function, for each mutant m in \( muts \), performs the following instructions:

  • m is evaluated by function \(\textsc {EvMut}\) (line 3), working as follows:

    • if m evaluates wrongly a string in \(A\) or \(R\), it returns \(( null , null )\), meaning that m must not be considered (line 11);

    • it generates a distinguishing string \( ds \) between \( r \) and m, using function \(\mathtt {genDs}\) (line 13); see [2] for the implementation of \(\mathtt {genDs}\);

    • if no string is generated, \( r \) and m are equivalent; in this case, it returns \(( null , null )\), meaning that m must not be considered (line 14);

    • if a \( ds \) is generated, it stores the oracle evaluation of \( ds \) in \( oEv \) (line 16);

    • depending on the oracle evaluation, it adds \( ds \) either to \(A\) or \(R\), using procedure \(\textsc {MarkDs}\) (line 17);

    • it returns the \( ds \) and its evaluation \( oEv \) (line 18);

  • if a \( ds \) is returned, it checks whether the mutant m and the oracle assess the validity of \( ds \) in the same way (line 4). If this is the case, a mutation fault in \( r \) has been found, and m is returned as the new repaired regexp (line 5).

If no better mutant is found, \( r \) is returned (line 8). The approach guarantees to return a regexp that correctly accepts all strings in \(A\) and refuses those in \(R\), while each of its mutants wrongly evaluates at least one of these strings.

figure c

4.2 MultiDSs Approach

The greedy approach returns the mutant m (line 5 of Algorithm 2) as new repaired regexp if it evaluates the generated string as the oracle. However, m may be distinguished by other strings and, so, changing the regexp could be a too greedy choice. To mitigate this problem, the MultiDSs approach (see Algorithm 3) generates N distinguishing strings (with \(N \ge 2\)) at line 13 (difference w.r.t. line 13 of Algorithm 2) and, at lines 6–7, takes m as the new repaired regexp only if it evaluates correctly more strings than \( r \) (difference w.r.t. lines 4–5 in Algorithm 2).

4.3 Breadth Approach

The approach in Sect. 4.1 and its extension in Sect. 4.2 are both greedy as they change the regexp under test as soon as a better regexp is found (it correctly evaluates the majority of generated strings). We here present a less greedy approach; Algorithm 4 shows the breadth search version of testAndRep. The algorithm evaluates all the mutants of a regexp under test and selects the best one (by function bestCand). If a better regexp is found, it is returned, otherwise \( r \) is returned.

figure d

In testAndRep, after the generation of the \( ds \) by EvMut (line 4), if the mutant m correctly evaluates \( ds \), it is stored in a set of possible candidates \( cands \) (line 6). Then, all the previously generated candidates that do not correctly evaluate \( ds \) are removed from \( cands \) (line 8). When all the mutants have been evaluated, if there is no candidate, then the algorithm terminates and \( r \) is returned as final regexp (line 10). Otherwise, a candidate is selected by function bestCand as follows. As long as there is more than one candidate:

  • two candidates \(c_1\) and \(c_2\) are randomly selected (line 18);

  • a \( ds \) is generated for \(c_1\) and \(c_2\) (line 19);

  • if \( ds \) is \( null \) (i.e., \(c_1\) and \(c_2\) are equivalent), \(c_1\) is removed from \( cands \) (line 25); otherwise, the following instructions are executed;

  • the candidates not evaluating \( ds \) correctly are removed from \( cands \) (line 22);

  • \( ds \) is added either to \(A\) or \(R\) using procedure \(\textsc {MarkDs}\) (line 23).

At the end of the while loop, the only survived candidate (it is guaranteed to exist) is selected as best candidate and returned as new repaired regexp (line 28).

Fig. 1.
figure 1

Distinguishing automata

4.4 Collecting Approach

The previous approaches always generate a string \( ds \) for distinguishing a regexp \( r \) from one of its mutants m; often \( ds \) does not distinguish \( r \) from other mutants and other strings must be generated for distinguishing them, so requiring more effort from the user who must evaluate more strings. The aim of the collecting approach is to generate strings that distinguish as many mutants as possible. A string \( ds \) distinguishes \( r \) from a set of mutants \(M=\{m_1, \ldots , m_n\}\) if \( ds \) is accepted by \( r \) and not accepted by any mutant in M, or if \( ds \) is not accepted by \( r \) and accepted by all the mutants in M; the distinguishing string is a word of one of these two automata:

$$\begin{aligned} \mathcal {D} ^+ = \mathcal {R} \cap \bigcap _{i=1}^n \mathcal {M}_i^\complement \quad \quad \quad \quad \mathcal {D} ^- =\mathcal {R}^\complement \cap \bigcap _{i=1}^n \mathcal {M}_i \end{aligned}$$

being \(\mathcal {R}\), \(\mathcal {M}_1\), ..., \(\mathcal {M}_n\) the automata of \( r \), \(m_1\), ..., \(m_n\). We name \(\mathcal {D} ^+\) and \(\mathcal {D} ^-\) as positive and negative distinguishing automata. Figure 1 shows a positive and a negative distinguishing automaton for two mutants \(m_1\) and \(m_2\) of a regexp \( r \)Footnote 3.

figure e

Algorithm 5 shows the collecting approach that exploits the definition of distinguishing automaton. It initially selects all the mutants \( muts \) as candidates in \( cands \) (line 2). Then, it performs the following actions as long as \( cands \) contains at least one regexp:

  • it collects as many regexp as possible using function collect:

    • it randomly initializes a positive or a negative distinguishing automaton \(\mathcal {D}\) (line 19);

    • for each mutant m, it tries to add it to \(\mathcal {D}\) considering the polarity (lines 21–23); \(\mathcal {D}\) is modified only if the addition does not produce an empty automaton (lines 25–26); if the addition has been performed and \( CollTh \) regexps have been collected (being \( CollTh \) a parameter of the approach), it returns \(\mathcal {D}\) (line 27).

    • if after trying all the candidates, at least one regexp has been collected, it returns \(\mathcal {D}\) (line 31); otherwise, it tries the distinguishing automaton with the opposite polarity (if any);

    • if no regexp can be collected in any distinguishing automaton of any polarity, the empty automaton is returned (line 34).

  • if the returned automaton \(\mathcal {D}\) is empty (meaning that all the candidates are equivalent to \( r \)), \( r \) is returned as selected regexp (line 6); otherwise, a word is randomly selected from \(\mathcal {D}\) (line 8) and \( cands \) is updated with only the regexps that evaluate \( ds \) correctly (lines 9–10);

  • if \( r \) does not evaluate \( ds \) correctly, it is changed with one mutant randomly selected among those collected in \(\mathcal {D}\) (lines 11–12).

When there are no more candidates, \( r \) is returned as selected regexp (line 15).

TearRex . We implemented the approach in the tool TearRex (TEst And Repair Regular EXpressions)Footnote 4. Figure 2 shows an interaction with the tool. At the beginning, the user inserts the regexp [a-z]* for accepting all the strings containing only Latin letters; however, since (s)he misunderstood the semantics of operator * and of character classes, the developed regexp also accepts the empty string and does not accept upper-case Latin letters. The tool asks her/him to evaluate two strings, “a” and “A”, and (s)he assesses that both strings should be accepted. Since the developed regexp does not accept “A”, the tool modifies the regexp in

figure f

that also accepts “A”. The process continues by evaluating strings “0”, “z”, and “AA”, that are all correctly evaluated. The empty string “”, instead, is wrongly evaluated and the tool changes the regexp in the more correct version

figure g

that does not accept it. Since no possible mutation fault exists between

figure h

and its mutants, it is returned as final regexp.

Fig. 2.
figure 2

TearRex– Testing and repairing of

figure i
with oracle
figure j

5 Experiments

Websites http://www.regular-expressions.info/ and http://www.regexlib.com report, for different matching tasks (e.g., email addresses, credit cards, social security numbers, etc.), some regexps implementing them. To build the benchmark set Bench , we considered 20 tasks and, for each task, we selected two regexps, one acting as initial (possibly faulty) regexp and the other one as oracle; for 19 couples the initial regexp is indeed faulty, while in one couple the initial regexp is correctFootnote 5. The initial regexps are between 17 and 279 chars long (60 on average) and have between 7 and 112 operators (24.45 on average).

Note that, in our approach, the oracle should be the user able to assess the evaluation of strings; in the experiments, we use another regexp as oracle for the sake of automation. However, in RQ2 we estimate which is the burden required to the user in evaluating the generated strings during the repair process.

We run the approaches Greedy, MultiDSs3 (MultiDSs with \(N=3\)), Breadth, and Coll5 (collecting with \( CollTh =5\)) on the selected regexps and fixing the maximum number of evaluations MaxEval to 10, 30, 100, and 200 strings.

Experiments have been executed on a Linux machine, Intel(R) Core(TM) i7, 4 GB RAM. All the reported data are the averages of 10 runs of the experiments.

Table 1. Experiment results for Bench (E: avg # evaluations, T: avg time (secs), R: repaired (%), CR: completely repaired (%), R\(^\prime \): avg # \( r ^\prime \))

Tables 1a and b report experimental results aggregated by type of repair process and maximum number of evaluations MaxEval. They both report the average number of strings generated and shown to the user for evaluation E, the average process time T, the percentage R of regexps that have been repaired (i.e., the final regexp has lower failure index than the initial one), and the percentage CR of regexps that have been completely repaired (i.e., all the faults have been removed). Table 1a also reports the average number R\(^\prime \) of regexps \( r ^\prime \) that are changed during the process. The cells in gray highlight the best results. We evaluate our approach with a series of research questions.

figure k

Since the correct evaluation of the generated strings must be assessed by the user, we want to keep their number low. Table 1a (column E) shows that the four approaches generate, on average, between 41 and 53 strings; the limited number of strings allows the manual evaluation by the user. MultiDSs3 generates more strings than Greedy as three strings are generated for each mutant (if possible). Breadth generates slightly fewer strings than Greedy because, on average, the third changed regexp is the final expression (see column R\(^\prime \)); Greedy changes on average 20 regexps, so producing more strings. Coll5 also produces few strings, as it directly generates strings for set of mutants.

Table 1b (column E) shows that a higher maximum number of evaluations allows to generate more strings. However, with MaxEval equal to 10 and 30, the number of generated strings is close to MaxEval itself, while with MaxEval equal to 100 and 200 the number of strings is much lower than the limit: this means that, on average, the process requires less than 100 strings to terminate.

figure l

Table 1a (column T) shows that MultiDSs3 is twice slower than Greedy, as it generates more strings. Breadth is more than three times slower than Greedy; a possible reason is that although Greedy generates more strings than Breadth, it creates much fewer mutants (that is a costly operation). Coll5 is the slowest one, as automata intersection (see lines 22 and 23 in Algorithm 5) is a costly operation. Table 1b shows that increasing the limit of evaluated strings also increments the time (as the process can continue the search).

All the experiments show that, regardless of the process configuration, the process is rather fast. However, the final time should be computed as \(\textit{pTime} + |A \cup R | \times \textit{eTime} \), being \(\textit{pTime}\) the time of the generation and repair process (the one we measured), and \(\textit{eTime}\) the time taken by a user to evaluate a stringFootnote 6. For example, considering 5 s per string as \(\textit{eTime}\), the average times would be: 4 min for Greedy, 4.5 min for MultiDSs3, 3.5 min for Breadth, and 3.6 min for Coll5. This means that the whole process involving the user’s evaluations is feasible. Note that Greedy and MultiDSs3 would be equivalent from the time point of view (break-even) to Breadth and Coll5 if the user could evaluate the strings in less than 0.2 s, which seems impossible; so, Breadth and Coll5 are always advantageous in terms of time if the user’s effort is taken into account.

figure m

We are here interested in assessing whether the proposed approach is actually able to repair faulty regexps. We are able to assess whether a starting regexp \( r \) has been completely repaired by checking the equivalence between the final regexp \( r ^\prime \) and the oracle \(o_{ r }\). For not completely repaired regexps, we introduce the measure \(F_{x}\) counting the number of strings that a regexp x wrongly accepts or rejects, i.e., that are misclassified. As the number of misclassified strings is possibly infinite, we need to restrict the length of the considered strings to n. Being \(\mathcal {L}^{n}(x) = \mathcal {L}(x) \cap {\varSigma }^{n}\), \(F^n_x\) is defined as follows:

$$\begin{aligned} F^n_x = |\mathcal {L}^{n}(x) \oplus \mathcal {L}^{n}(o_{x}) | \end{aligned}$$

In order to know if the repaired regexp \( r ^\prime \) is better than \( r \), we can compute \({\varDelta }F = F^n_{ r ^\prime } - F^n_{ r }\) with a fixed n. In the experiments, we set n to 20, to count the strings of length n up to 20. If \({\varDelta }F<0\) or the final regexp is completely repaired, the regexp under test is considered repaired, otherwise, it means that the process did not remove any fault (or removed and introduced faults equally).

By Table 1a, we can see that the processes can repair (column R) between 72% and 85% of regexps, but they can completely repair (column CR) a small amount of them (between 2.92% and 4.58%). Thus, the proposed techniques are not reliable in finding the completely correct regexp, but they are very efficient in removing some faults. However, some regexps are not repaired. This can be due to two reasons. First, the new changed regexp \( r ^\prime \) behaves better than the original \( r \) on the strings that are generated and tested, but it fails to correctly accept/reject other strings that are not tested, so \({\varDelta }F\) is actually greater than 0. Secondly, we are trying to repair a regexp \( r \) that is too far from the oracle, so each mutation of \( r \) is not better than it; indeed, when selecting the benchmarks, we did not assume the competent programmer hypothesis, so a regexp and its oracle may be very different. In RQ5, we will evaluate how the results change if the competent programmer hypothesis holds.

Table 1b (columns R and CR) shows that increasing the number of maximum number of evaluations permits to (completely) repair more regexps.

figure n
Table 2. Experiment results for MutBench (E: avg # evaluations, T: avg time (secs), R: repaired (%), CR: completely repaired (%), R\(^\prime \): avg # \( r ^\prime \))

From Table 1a, we observe that MultiDSs3 under-performs in all the measures. Breadth and Coll5, instead, are better approaches. Breadth repairs 7.5% more regexps than Greedy using fewer strings. Although Breadth is more than three times slower than Greedy, we saw in RQ2 that the program time (without string evaluation by the user) is negligible w.r.t. the evaluation time of the user: therefore, in order to contain the total time of the process, it makes sense to keep the number of generated strings limited. Coll5 is the approach that completely repairs more regexps, although it does not behave so well in the repaired ones: this probably means that the strings generated over the collection are good to drive towards correct candidates, but also that Coll5 is not incremental enough and it tends not to choose regexps that are only slightly better.

figure o

When building Bench, we did not make any assumption and most of the selected couples of regexps are very different. However, the common assumption that is done in fault-based approaches is the competent programmer hypothesis, i.e., the programmer defined the regexp close to the correct one (only with one or few of the syntactic faults defined by the fault classes [2]). We here evaluate the approach performances when the competent programmer hypothesis holds. We built a second benchmark set, MutBench, as follows. We took the oracles of Bench and we randomly modified them introducing n faults (with \(n = 1, \ldots , 3\)), obtaining three faulty versions of the oracle; we therefore added to MutBench 60 couples of regexps. The regexps are between 43 and 302 characters long (108.15 on average) and contain between 11 and 63 operators (25.52 on average). We then applied our approach to MutBench. Table 2 reports the aggregated results for the experiments performed over the regexps in MutBench. Most of the observations we did for Bench are still valid for MutBench, although there are some interesting differences. First of all, by Table 2a, we observe that Greedy now behaves as good as Breadth in terms of repaired regexps (actually slightly better): this means that if the syntactic faults of the regexp under test are those identified by our fault classes, applying a greedy approach that changes the regexp as soon as a fault is found pays off. Moreover, for both approaches, the numbers of evaluated strings and of changed regexps are lower w.r.t. Bench: this means that the approaches converge faster to the final solution as they are able to apply the correct mutations to remove the syntactic faults. MultiDSs3 is now the best approach in terms of repaired regexps: this probably means that generating more strings w.r.t. Greedy for reinforcing the decision makes sense only if there are few faults, otherwise additional strings are useless.

Remark

A threat to external validity is that the obtained results could be not generalizable to all regexps. In order to mitigate this threat, we tried to select the most diverse regexps performing different tasks; in order to evaluate the approach on the worst case scenario, we also did not assume the competent programmer hypothesis. We saw in RQ5 that, if the hypothesis holds (as it is assumed by fault-based approaches), the performance of the approach improves.

6 Related Work

As far as we know, no approach exists to repair regexps. The approaches proposed in literature mainly focus on regexps testing or on their synthesization.

Regarding test generation of labeled strings, MutRex  [2] is an open source tool able to generate fault-detecting strings. We exploit its mutation operators and its string generation facility (i.e., function \(\mathtt {genDs}\)).

Another test generator is EGRET [6] that generates evil strings that should be able to expose errors commonly made by programmers. EGRET takes a regexp as input and generates both accepted and rejected strings. The user can estimate the regexp correctness by evaluating these strings and identifying those that are wrongly classified. As in our approach, the user is the oracle. Also EGRET applies mutation, but on the strings accepted by the regexp under test, and not on the regexp itself as in our case. The advantage of the strings we use in our approach is that, once we detect a failure, we also know how to remove the corresponding syntactic fault in the regexp; instead, using the strings generated by EGRET would leave open the problem to localize the syntactic fault.

Another tool that can be used for labeled string generation is Rex [13], a solver of regexps constraints. Rex translates a given regexp into a symbolic finite automaton; the Z3 SMT solver is used for satisfiability checking. Since Z3 is able to generate a model as witness of the satisfiability check, Rex can be used to build strings accepted by the regexp.

Reggae [7] is a tool based on dynamic symbolic execution that generates string inputs that are accepted by a regexp. Reggae aims at achieving branch coverage of programs containing complex string operations.

Several other tools for testing regexps exist, as EXREX, Generex, and regldgFootnote 7. However, they are based on exhaustive or random generation of strings matching a given regexp, and the strings they generate are not useful for repairing regexps.

A different use of labeled strings is the synthesization of regexps. ReLIE [8] is a learning algorithm that, starting from an initial regexp and a set of labeled strings, tries to learn the correct regexp. It performs some regexp transformations (a kind of regexp mutation); however, no definition of fault class is given. Our approach could be adapted for regexps synthesis as well.

Our approach has some similarities with automatic software repair. The automatic repair of software requires an oracle: in our approach, the oracle is the user, while in software repair the oracle is usually specified using test suites [3, 14], pre- and post-conditions [11], etc. Moreover, such approaches also identify techniques to repair the software when a fault is detected: in [9], for example, some repair actions are proposed, that are similar to our mutation operators.

Automatic repair has also been proposed for specifications, as, for example, automatic repair of feature models describing software product lines. The approach in [5] applies a cycle of test-and-fix to a feature model in order to remove its wrong constraints; the approach uses configurations derived both from the model and from the real system and checks whether these are correctly evaluated by the feature model. The approach has similarity with ours in alternating testing and fixing (similar to our repair phase); however, since the evaluation of configurations is done automatically on the system, the approach can produce several configurations, while in our approach we need to keep the number of generated labeled strings limited, as these must be assessed by the user.

The aim of our work is similar to that in [1] in which a student tries to learn an unknown regular set S by posing two types of queries to a teacher. In a membership query, the student gives a string t and the teacher tells whether it belongs to S or not. In a conjecture query, the student provides a regular set \(S^\prime \) and the teacher answers yes if \(S^\prime \) corresponds to S, or with a wrong string t (as our distinguishing string) otherwise. In our approach, the user plays the role of the teacher only for the first kind of query, but not for the second kind (if (s)he could, (s)he would also be able to write the correct regexp). Our tool, instead, plays the role of the student by providing membership queries.

7 Conclusions

The paper presents an approach able to detect and remove conformance faults in a regular expression, i.e., faults that make a regular expression to wrongly accept or reject some strings. The approach consists in an iterative process composed of a testing phase in which the user who wrote the regular expression is asked to assess the correct evaluation of some strings that are able to distinguish a regular expression from its mutants, and in a repair phase in which the mutant evaluating correctly all the generated strings is taken as new version of the regular expression. The approach terminates when it is no more possible to repair the regular expression under test. Experiments show that the approach is indeed able to remove conformance faults from regular expressions, in particular if the competent programmer hypothesis holds, i.e., the user did some small syntactical faults as those described by the fault classes proposed in literature.

In the experiments, we performed different evaluations regarding the effect of the process configuration on the final results, without assessing their statistical significance. As future work, we plan to perform experiments on a wider set of regular expressions and to conduct some statistical tests to assess the statistical significance of the drawn conclusions.