1 Introduction

Code change patterns have various uses in the software engineering domain. They are notably used for labeling changes (Pan et al. 2009), triaging developer commits (Tian et al. 2012) or predicting changes (Ying et al. 2004). In recent years, fix patterns have been heavily leveraged in the software maintenance community, notably for building patch generation systems, which now attract growing interest in the literature (Monperrus 2018). Automated Program Repair (APR) has indeed gained incredible momentum, and various approaches (Nguyen et al. 2013; Weimer et al. 2009; Le Goues et al. 2012a; Kim et al. 2013; Coker and Hafiz 2013; Ke et al.2015; Mechtaev et al. 2015; Long and Rinard 2015, 2016; Le et al. 2016ab, 2017; Chen et al. 2017; Long et al. 2017; Xuan et al. 2017; Xiong et al.2017; Jiang et al. 2018; Wen et al. 2018; Hua et al. 2018; Liu et al. 2019, b) have been proposed, aiming at reducing manual debugging efforts through automatically generating patches. A common and reliable strategy in automatic program repair is to generate concrete patches based on fix patterns (Kim et al. 2013) (also referred to as fix templates (Liu and Zhong 2018) or program transformation schemas (Hua et al. 2018)). Several APR systems (Kim et al. 2013; Saha et al. 2017; Durieux et al. 2017; Liu and Zhong 2018; Hua et al. 2018; Martinez and Monperrus 2018; Liu et al. 2019, 2019b) in the literature implement this strategy by using diverse sets of fix patterns obtained either via manual generation or automatic mining of bug fix datasets.

In PAR (Kim et al. 2013), the authors mined fix patterns by inspecting 60,000 developer patches manually. Similarly, for Relifix (Tan and Roychoudhury 2015), a manual inspection of 73 real software regression bug fixes is performed to infer fix patterns. Manual mining is however tedious, error-prone, and cannot scale. Thus, in order to overcome the limitations of manual pattern inference, several research groups have initiated studies towards automatically inferring bug fix patterns. With Genesis (Long et al. 2017), Long et al. proposed to automatically infer code transforms for patch generation. Genesis infers 108 code transforms, from a space of 577 sampled transforms, with specific code contexts. However, this work limits the search space to previously successful patches from only three classes of defects of Java programs: null pointer, out of bounds, and class cast related defects.

Liu and Zhong (Liu and Zhong 2018) proposed SOFix to explore fix patterns for Java programs from Q&A posts in Stack Overflow, which mines patterns based on GumTree (Falleri et al. 2014) edit scripts, and builds different categories based on repair pattern isomorphism. SOFix then mines a repair pattern from each category. However, the authors note that most of the categories are redundant or even irrelevant, mainly due to two major issues: (1) a considerable portion of code samples are designed for purposes other than repairing bugs; (2) since the underlying GumTree tool relies on structural positions to extract modifications, these “modifications do not present the desirable semantic mappings”. They relied on heuristics for manually filtering categories (e.g., categories that contain several modifications), and then after SOFIX mines repair patterns they have to manually select useful ones (e.g., merging some repair patterns due to their similar semantics).

Liu et al. (2018a) and Rolim et al. (2018) proposed to mine fix patterns from static analysis violations from FindBugs and PMD respectively. Both approaches, leverage a similar methodology in the inference process. Rolim et al. (2018) rely on the distance among edit scripts: edit scripts with low distances among them are grouped together according to a defined similarity threshold. Liu et al. (2018a), on the other hand, leverage deep learning to learn features of edit scripts, to find clusters of similar edit scripts. Eventually, both works do not consider code context in their edit scripts and manually derive the fix patterns from the clusters of similar edit scripts of patches.

In another vein, CapGen (Wen et al. 2018) and SimFix (Jiang et al. 2018) propose to use frequency of code change actions. The former uses it to drive patch selection, while the latter uses it in computing donor code similarity for patch prioritization. In both cases, however, the notion of patterns is not an actionable artefact, but rather a supplementary information that guides their patch generation system. Although we concurrentlyFootnote 1 share with SimFix and CapGen the idea of adding more contextual information for patch generation, our objective is to infer actionable fix patterns that are tractable and reusable as input to other APR systems.

Table 1 presents an overview of different automated mining strategies implemented in literature to obtain diverse sets of fix patterns. Some of the strategies are directly presented as part of APR systems, while others are independent approaches. We characterize the different strategies by considering the diff representation format, the use of contextual information, the tractability of patterns (i.e., what extent they are separate and reusable components in patch generation systems), and the scope of mining (i.e., whether the scope is limited to specific code changes). Overall, although the literature approaches can come handy for discovering diverse sets of fix patterns, the reality is that the intractability of the fix patterns and the generalizability of the mining strategies remain a challenge for deriving relevant patterns for program repair.

This paper.:

We propose to investigate the feasibility of mining relevant fix patterns that can be easily integrated into an automated pattern-based program repair system. To that end, we propose an iterative and three-fold clustering strategy, FixMiner, to discover relevant fix patterns automatically from atomic changes within real-world developer fixes. FixMiner is a pattern mining approach to produce fix patterns for program repair systems. We present in this paper the concept of Rich Edit Script which is a specialized tree data structure of the edit scripts that captures the AST-level context of code changes. To infer patterns, FixMiner leverages identical trees, which are computed based on the following information encoded in Rich Edit Scripts for each round of the iteration: abstract syntax tree, edit actions tree, and code context tree.

Contribution.:

We propose the FixMiner pattern mining tool as a separate and reusable component that can be leveraged in other patch generation systems.

Paper content.:

Our contributions are:

  • We present the architecture of a pattern inference system, FixMiner, which builds on a three-fold clustering strategy where we iteratively discover similar changes based on different tree representations encoding contexts, change operations and code tokens.

  • We assess the capability of FixMiner to discover patterns by mining fix patterns among 11 416 patches addressing user-reported bugs in 43 open source projects. We further relate the discovered patterns to those that can be found in a dataset used by the program repair community (Just et al. 2014). We assess the compatibility of FixMiner patterns with patterns in the literature.

  • Finally, we investigate the relevance of the mined fix patterns by embedding them as part of an Automated Program Repair system. Our experimental results on the Defects4J benchmark show that our mined patterns are effective for fixing 26 bugs. We find that the FixMiner patterns are relevant as they lead to generating plausible patches that are mostly correct.

Table 1 Comparison of fix pattern mining techniques in the literature

2 Motivation

Mining, enumerating and understanding code changes have been a key challenge of software maintenance in recent years. Ten years ago, Pan et al. have contributed with a manually-compiled catalog of 27 code change patterns related to bug fixing (Pan et al. 2009). Such “bug fix patterns” however are generic patterns (e.g., IF-RMV: removal of an If Predicate) which represent the type of changes that are often fixing bugs. More recently, thanks to the availability of new AST differencing tools, researchers have proposed to automatically mine change patterns (Martinez et al. 2013; Osman et al. 2014; Oumarou et al. 2015; Lin et al. 2016). Such patterns have been mostly leveraged for analysing and towards understanding characteristics of bug fixes. In practice, however, the inferred patterns may turn out to be irrelevant and intractable.

We argue however that mining fix patterns can help for guiding mutation operations for patch generation. In this case, there is a need to mine truly recurrent change patterns to which repair semantics can be attached, and to provide accurate, fine-grained patterns that can be actionable in practice, i.e., separate and reusable as inputs to other processes.

Our intuition is that relevant patterns cannot be mined globally since bug fixes in the wild are subject to noisy details due to tangled changes (Herzig and Zeller 2013). There is thus a need to break patches into atomic units (contiguous code lines forming a hunk) and reason about the recurrences of the code changes among them. To mine changes, we propose to rely on the edit script format, which provides a fine-grained representation of code changes, where different layers of information are included:

  • the context, i.e., AST node type of the code element being changed (e.g., a modifier in declaration statements, should not be generalized to other types of statements);

  • the change operation (e.g., a “remove then add” sequence should not be confused with “add then remove” as it may have a distinct meaning in a hierarchical model such as the AST);

  • and code tokens (e.g., changing calls to “Log.warn” should not be confused to any other API method).

Our idea is to iteratively find patterns within the contexts, and patterns of change operations for each context, and patterns of recurrently affected literals in these operations.

We now provide background information for understanding the execution as well as the information processed by FixMiner.

2.1 Abstract Syntax Tree

Code representation is an essential step in the analysis and verification of programs. Abstract syntax trees (ASTs), which are generally produced for program analysis and transformations, are data structures that provide an efficient form of representing program structures to reason about syntax and even semantics. An AST indeed represents all of the syntactical elements of the programming language and focuses on the rules rather than elements like braces or semicolons that terminate statements in some popular languages like Java or C. The AST is a hierarchical representation where the elements of each programming statement are broken down recursively into their parts. Each node in the tree thus denotes a construct occurring in the programming language.

Formally, let t be an AST and N be a set of AST nodes in t. An AST t has a root that is a node referred to as root(t) ∈ N. Each node nN (and nroot(t)) has a parent denoted as parent(n) = pN. Note that there is no parent node of root(t). Furthermore, each node n has a set of child nodes (denoted as children(n) ⊂ N). A label l (i.e., AST node type) is assigned to each node from a given alphabet L (label(n) = lL). Finally, each node has a string value v (token(n) = v where nN and v is an arbitrary string) representing the corresponding raw code token. Consider the AST representation in Fig. 2 of the Java code in Fig. 1. We note that the illustrated AST has nodes with labels matching structural elements of the Java language (e.g., MethodDeclaration, IfStatement or StringLiteral) and can be associated with values representing the raw tokens in the code (e.g., A node labelled StringLiteral from our AST is associated to value “Hi!”) (Fig. 2).

Fig. 1
figure 1

Example Java class

Fig. 2
figure 2

AST representation of the Helloworld class

2.2 Code differencing

Differencing two versions of a program is the key pre-processing step of all studies on software evolution. The evolved parts must be captured in a way that makes it easy for developers to understand or analyze the changes. Developers generally deal well with text-based differencing tools, such as the GNU Diff represents changes as addition and removal of source code lines as shown in Fig. 3. The main issue with this text-based differencing is that it does not provide a fine-grained representation of the change (i.e., StringLiteral Replacement) and thus it is poorly suited for systematically analysing the changes.

Fig. 3
figure 3

GNU diff format

To address the challenges of code differencing, recent algorithms have been proposed based on tree structures (such as the AST). ChangeDistiller and GumTree are examples of such algorithms which produce edit scripts that detail the operations to be performed on the nodes of a given AST (as formalized in Section 2.1) to yield another AST corresponding to the new version of the code. In particular, in this work, we build on GumTree’s core algorithms for preparing an edit script. An edit script is a sequence of edit actions describing the following code change actions:

  • UPD where an upd(n, v) action transforms the AST by replacing the old value of an AST node n with the new value v.

  • INS where an ins(n, np, i, l, v) action inserts a new node n with v as value and l as label. If the parent np is specified, n is inserted as the ith child of np, otherwise n is the root node.

  • DEL where a del(n) action removes the leaf node n from the tree.

  • MOV where a mov(n, np, i) action moves the subtree having node n as root to make it the ith child of a parent node np.

An edit action, embeds information about the node (i.e., the relevant node in the whole AST tree of the parsed program), the operator (i.e., UPD, INS, DEL, and MOV) which describes the action performed, and the raw tokens involved in the change.

2.3 Tangled code changes

Solving a single problem per patch is often considered as a best practice to facilitate maintenance tasks. However, often patches in real-world projects address multiple problems in a patch (Tao and Kim 2015; Koyuncu et al. 2017). Developers often commit bug fixing code changes together with changes unrelated to fix such as functionality enhancements, feature requests, refactorings, or documentation. Such patches are called tangled patches (Herzig and Zeller 2013) or mixed-purpose fixing commits (Nguyen et al. 2013). Nguyen et al. found that 11% to 39% of all the fixing commits used for mining archives were tangled (Nguyen et al. 2013).

Consider the example patch from GWT illustrated in Fig. 4. The patch is intended to fix the issueFootnote 2 that reported a failure in some web browsers when the page is served with a certain mime type (i.e., application/xhtml+xml). The developer fixes the issue by showing a warning when such mime type is encountered. However, in addition to this change, a typo has been addressed in the commit. Since the typo is not related to the fix, the fixing commit is tangled. There is thus a need to separately consider single code hunks within a commit to allow the pattern inference to focus on finding recurrent atomic changes that are relevant to bug fixing operations.

Fig. 4
figure 4

Tangled commit

3 Approach

FixMiner aims to discover relevant fix patterns from the atomic changes within bug fixing patches in software repositories. To that end, we mine code changes that are similar in terms of context, operations, and the programming tokens that are involved. Figure 5 illustrates an overview of the FixMiner approach.

Fig. 5
figure 5

The FixMiner Approach. At each iteration, the search index is refined, and the computation of tree similarity is specialized in specific AST information details

3.1 Overview

In Step 0, as an initial step, we collect the relevant bug-fixing patches (cf. Definition 1) from project change tracking systems. Then, in Step 1, we compute a Rich Edit Script representation (cf. Section 3.3) to describe a code change in terms of the context, operations performed and tokens involved. Accordingly, we consider three specialized tree representations of the Rich Edit Script (cf. Definition 2) carrying information about either the impacted AST node types, or the repair actions performed, or the program tokens affected. FixMiner works in an iterative manner considering a single specialized tree representation in each pattern mining iteration, to discover similar changes: first, changes affecting the same code context (i.e., on identical abstract syntax trees) are identified; then among those identified changes, changes using the same actions (i.e., identical sequence of operations) are regrouped; and finally within each group, changes affecting the same tokens set are mined. Therefore, in FixMiner, we perform a three-fold strategy, carrying out the following steps in a pattern mining iteration:

  • Step 2: We build a search index (cf. Definition 3) to identify the Rich Edit Scripts that must be compared.

  • Step 3: We detect identical trees (cf. Definition 4) by computing the distance between two representations of Rich Edit Scripts.

  • Step 4: We regroup identical trees into clusters (cf. Definition 5).

The initial pattern mining iteration uses Rich Edit Scripts computed in Step 1 as its input, where the following rounds use clusters of identical trees yielded in Step 4 as their input.

In the following sections, we present the details of Steps 1-4, considering that a dataset of bug fix patches is available.

3.2 Step 0 - patch collection

Definition 1

(Patch) A program patch is a transformation of a program into another program, usually to fix a defect. Let \(\mathbb {P}\) being a set of programs, a patch is represented by a pair (\(p, p^{\prime }\)), where \(p, p^{\prime } \in \mathbb {P}\) are programs before and after applying the patch, respectively. Concretely, a patch implements changes in code block(s) within source code file(s).

To identify bug fix patches in software repositories projects, we build on the bug linking strategies implemented in the Jira issue tracking software. We use a similar approach to the ones proposed by Fischer et al. (2003) and Thomas et al. (2013) in order to link commits to relevant bug reports. Concretely, we crawl the bug reports for a given project and assess the links with a two-step search strategy: (i) we check project commit logs to identify bug report IDs and associate the corresponding bug reports to commits; then (ii) we check for bug reports that are indeed considered as such (i.e., tagged as “BUG”) and are further marked as resolved (i.e., with tags “RESOLVED” or “FIXED”), and completed (i.e., with status “CLOSED”).

We further curate the patch set by considering bug reports that are fixed by a single commit. This provides more guarantees that the selected commits are indeed fixing the bugs in a single shot (i.e., the bug does not require supplementary patches (Park et al. 2012)). Eventually, we consider only changes that are made on the source code files: changes on configuration, documentation, or test files are excluded.

3.3 Step 1 – Rich Edit Script computation

Definition 2

(Rich Edit Script) A Rich Edit ScriptrRE represents a patch as a specialized tree of changes. This tree describes which operations are made on a given AST, associated with the code block before patch application, to transform it into another AST, associated with the code block after patch application: i.e., \(r: \mathbb {P} \rightarrow \mathbb {P}\). Each node in the tree is an AST node affected by the patch. Every node in Rich Edit Script has three different types of information: Shape, Action, and Token.

A bug-fix patch collected in open source change tracking systems is represented in the GNU diff format based on addition and removal of source code lines as shown in Fig. 6. This representation is not suitable for fine-grained analysis of changes.

Fig. 6
figure 6

Patch of fixing bug Closure-93 in Defects4J dataset

To accurately reflect the change that has been performed, several algorithms have been proposed based on tree structures (such as the AST) (Bille 2005; Pawlik and Augsten 2011; Chawathe et al. 1996; Hashimoto and Mori 2008; Duley et al. 2012; Fluri et al. 2007; Falleri et al. 2014). ChangeDistiller (Fluri et al. 2007) and GumTree (Falleri et al. 2014) are state-of-the-art examples of such algorithms which produce edit scripts that detail the operations to be performed on the nodes of a given AST in order to yield another AST corresponding to the new version of the code. In particular, in this work, we selected the GumTree AST differencing tool which has seen recently a momentum in the literature for computing edit scripts. GumTree is claimed to build in a fast, scalable and accurate way the sequence of AST edit actions (a.k.a edit scripts) between the two associated AST representations (the buggy and fixed versions) of a given patch.

Consider the example edit script computed by GumTree for the patch of Closure-93 bug from Defects4J illustrated in Fig. 7. The intended behaviour of the patch is to fix the wrong variable declaration of indexOfDot due to a wrong method reference (lastIndexOf instead of indexOf ) of java.lang.String object. GumTree edit script summarizes the change as an update operation on an AST node simple name (i.e., an identifier other than a keyword) that is modifying the identifier label (from indexOf to lastIndexOf ).

Fig. 7
figure 7

GumTree edit script corresponding to Closure-93 bug fix patch represented in Fig. 6

Although GumTree edit script is accurate in describing the bug fix operation at fine-grained level, much of the contextual information describing the intended behaviour of the patch is missing. The information regarding method invocation, the method name (java.lang.String), the variable declaration fragment which assigns the value of the method invocation to indexOfDot, as well as the type information (int for indexOfDot - cf. Fig. 6) that is implied in the variable declaration statement are all missing in the GumTree edit script. Since such contextual information is lost, the yielded edit script fails to convey the full syntactic and semantic meaning of the code change.

To address this limitation, we propose to enrich GumTree-yielded edit scripts by retaining more contextual information. To that end, we construct a specialized tree structure of the edit scripts which captures the AST-level context of the code change. We refer to this specialized tree structure as Rich Edit Script. A Rich Edit Script is computed as follows:

Given a patch, we start by computing the set of edit actions (edit script) using GumTree, where the set contains an edit action for each contiguous group of code lines (hunks) that are changed by a patch. In order to capture the context of the change, we re-organize edit actions under new AST (minimal) subtrees building an AST hierarchy. For each edit action in an edit script, we extract a minimal subtree from the original AST tree which has the GumTree edit action as its leaf node, and one of the following predefined node types as its root node: TypeDeclaration, FieldDeclaration, MethodDeclaration, SwitchCase, CatchClause, ConstructorInvocation, SuperConstructorInvocation or any Statement node. The objective is to limit the scope of context to the encompassing statement, instead of going backwards until the compilation unit (cf. Fig. 2). We limit the scope of parent traversal mainly for two reasons: first, the pattern mining must focus on the program context that is relevant to the change; second, program repair approaches, which FixMiner is built for, generally target statement-level fault localization and patch generation.

Consider the AST differencing tree presented in Fig. 8. From this diff tree, GumTree yields the leaf nodes (gray) of edit actions as the final edit script. To build the Rich Edit Script, we follow these steps:

  1. i)

    For each GumTree-produced edit action, we remap it to the relevant node in the program AST;

  2. ii)

    Then, starting from the GumTree edit action nodes, we traverse the AST tree of the parsed program from bottom to top until we reach a node of predefined root node type.

  3. iii)

    For every predefined root node that is reached, we extract the AST subtree between the discovered predefined root node down to the leaf nodes mapped to the GumTree edit actions.

  4. iv)

    Finally, we create an orderedFootnote 3 sequence of these extracted AST subtrees and store it as Rich Edit Script.

Fig. 8
figure 8

Illustration of subtree extraction

Concretely, with respect to our running example, consider the case of Closure-93 illustrated in Fig. 6. The construction of the Rich Edit Script starts by generating the GumTree edit script (cf. Fig. 7) of the patch. The patch consists of a single hunk, thus we expect to extract a single AST subtree, which is illustrated by Fig. 9. To extract this AST subtree, first, we identify the node of the edit action “SimpleName” at position 4 in the AST Tree of program. Then, starting from this node, we traverse backward the AST tree until we reach the node “VariableDeclarationStatement” at position 1. We extract the AST subtree, by creating a new tree, setting “VariableDeclarationStatement” as root node of the new tree, and adding the intermediate nodes at positions 2,3 until we reach the corresponding node of the edit action “UPD SimpleName” at position 4. We create a sequence, and add the extracted AST subtree to the sequence.

Fig. 9
figure 9

Excerpt AST of buggy code (Closure-93)

Rich Edit Scripts are tree data structures. They are used to represent changes. In order to provide tractable and reusable patterns as input to other APR systems, we define the following string notation (cf. Grammar 1) based on syntactic rules governing the formation of correct Rich Edit Script.

figure l

Figure 10 illustrates the computed Rich Edit Script. The first line indicates the root node (no dashed line). ‘UPD ’ indicates the action type of the node, VariableDeclarationStatement corresponds to ast node type of the node, tokens between ‘@@’ and ‘@TO@’ contains the corresponding code tokens before the change, where as tokens between ‘@TO@’ and ‘@AT’ corresponding to new code tokens with the change. The three dashed (- - -) lines indicate a child node. Immediate children nodes contain three dashes while their children add another three dashes (- - - - - -) preserving the parent-child relation.

Fig. 10
figure 10

Rich Edit Script for Closure-93 patch in Defects4J. ↩ represents the carriage return character which is necessary for presentation reasons

An edit action node carries the following three types of information: the AST node type (Shape), the repair action (Action), the raw tokens (Token) in the patch. For each of these three information types, we create separate tree representations from the Rich Edit Script, named as ShapeTree, ActionTree and TokenTree, each carrying respectively the type of information indicated by its name. Figures 1112 and 13 show ShapeTree, ActionTree, and TokenTree, respectively, generated for Closure-93.

Fig. 11
figure 11

ShapeTree of Closure-93

Fig. 12
figure 12

ActionTree of Closure-93

Fig. 13
figure 13

TokenTree of Closure-93

3.4 Step 2 – search index construction

Definition 3

(Search Index) To reduce the effort of matching similar patches, a search index (SI) is used to confine the comparison space. Each fold ({Shape, Action, Token}) defines a search index: SIShape, SIAction, and SIToken, respectively. Each is defined as \(SI_{\ast }: Q_{\ast } \rightarrow 2^{RE}\), where Q is a query set specific to each fold and ∗∈{Shape, Action, Token}.

Given that Rich Edit Scripts are computed for each hunk in a patch, they are spread inside and across different patches. A direct pairwise comparison of these Rich Edit Scripts would lead to a combinatorial explosion of the comparison space. In order to reduce this comparison space and enable a fast identification of Rich Edit Scripts to compare, we build search indices. A search index is a set of comparison sub-spaces created by grouping the Rich Edit Scripts with criteria that depend on the information embedded the used tree representation (Shape, Action, Token) for the different iterations.

The search indices are built as follows:

“Shape” search index.:

The construction process takes the ShapeTree representations of the Rich Edit Scripts produced by Step 1 as input, and groups them based on their tree structure in terms of AST node types. Concretely, Rich Edit Scripts having the same root node (e.g., IfStatement, MethodDeclaration, ReturnStatement) and same depth are grouped together. For each group, we create a comparison space by enumerating the pairwise combinations of the group members. Eventually, the “Shape” search index is built by storing an identifier per group, denoted as root node/depth (e.g., IfStatement/2, IfStatement/3, MethodDeclaration/4), and a pointer to its comparison space (i.e., the pairwise combinations of its members).

“Action” search index.:

The construction process follows the same principle as for “Shape” search index, except that the regrouping is based on the clustering output of ShapeTrees. Thus, the input is formed by ActionTree representations of the Rich Edit Scripts and the group identifier for each comparison space is generated as node/depth/ShapeTreeClusterId (e.g., IfStatement/2/1,MethodDeclaration/2/2) where ShapeTreeClusterId represents the id of the cluster yielded by the clustering (Steps 3-4) based on the ShapeTree information. Concretely, this means that the “Action” search index is built on groups of trees having the same shape.

“Token” search index.:

The construction process follows the same principle as for “Action” search index, using this time the clustering output of ActionTrees. Thus, the input is formed by TokenTree representations of the Rich Edit Scripts and the group identifier for each comparison space is generated as node/depth/Shape TreeClusterId/ActionTreeClusterId (e.g., IfStatement/2/1/3,MethodDeclaration/2/2/1) where ActionTreeClusterId represents the id of the cluster yielded by the clustering (Steps 3-4) based on the ActionTree information.

3.5 Step 3 – tree comparison

Definition 4

(Pair of identical trees) Let a = (ri, rj) ∈ Ridentical be a pair of Rich Edit Script specialized tree representations if d(ri, rj) = 0, where ri, rjRE and d is a distance function. Ridentical is a subset of RE × RE.

The goal of tree comparison is to find identical tree representations of Rich Edit Scripts for a given fold. There are several straightforward approaches for checking whether two Rich Edit Scripts are identical. For example, syntactical equality could be used. However, we aim at making FixMiner a flexible and extensible framework where future research may tune threshold values for defining similar trees. Thus, we propose a generic approach for comparing Rich Edit Scripts, taking into account the diversity of information to compare for each specialized tree representation. To that end, we compute tree edit distances for the three representations of Rich Edit Scripts separately. The tree edit distance is defined as the sequence of edit actions that transform one tree into another. When the edit distance is zero (i.e., no operation is necessary to transform one tree to another) the trees are considered as identical. In Algorithm 1 we define the steps to compare Rich Edit Scripts.

figure m

The algorithm starts by retrieving the identifiers from the search index SI corresponding to the fold. An identifier is a pointer to a comparison sub-space that contains pairwise combinations of tree representation of Rich Edit Scripts to compare (cf. Section 3.4). Concretely, we restore the Rich Edit Scripts of a given pair from the cache, and their corresponding specialized tree representation according to the fold: At the first iteration, we consider only trees denoted ShapeTrees, whereas in the second iteration we focus on ActionTrees, and TokenTrees for the third iteration. We compute the edit distance between the restored trees in two distinct ways.

  • In the first two iterations (i.e, Shape and Action) we leverage again the edit script algorithm of GumTree (Falleri , Section 3). We compute the edit distance by simply invoking GumTree on restored trees as input, given that Rich Edit Scripts are indeed AST subtrees that are compatible with GumTree. Concretely, GumTree takes the two AST trees as input, and generates a sequence of edit actions (a.k.a edit script) that transform one tree into another, where the size of the edit script represents the edit distance between the two trees.

  • For the third iteration (i.e., Token), since the relevant information in the tree is text, we use a text distance algorithm (Jaro-Winkler (Jaro 1989; Winkler 1990)) to compute the edit distance between two tokens extracted from the trees. We use the implementation of Jaro-Winkler edit distance from Apache Commons Text libraryFootnote 4, which computes the Jaro-Winkler edit distance of two strings dw as defined in Eq. 1. The equation consists of two components; Jaro’s original algorithm (jsim) and Winkler’s extension(wsim). The Jaro similarity is the weighted sum of percentage of matched characters c from each file and transposed characters t. Winkler increased this measure for matching initial characters, by using a prefix scale p that is is set to 0.1 by default, which gives more favorable ratings to strings that match from the beginning for a set prefix length l. The algorithm produces a similarity score (wsim) between 0.0 to 1.0, where 0.0 is the least likely and 1.0 is a positive match. Finally, this similarity score is transformed to distance (dw).

    $$ \begin{array}{@{}rcl@{}} &&{d_{w}}(s_{1},s_{2})=1 - {w_{sim}}(s_{1},s_{2})\\ &&{w_{sim}}(s_{1},s_{2}) = {j_{sim}}(s_{1},s_{2}) + l*p(1-{j_{sim}}(s_{1},s_{2}))\\ &&{j_{sim}}(s_{1},s_{2}) = \left\{\begin{array}{ll} {0} & \text{if } c = 0; \\ \frac{1}{3}\left( \frac{c}{|s_{1}|}+\frac{c}{|s_{2}|}+\frac{c-t}{c}\right) & otherwise. \end{array}\right.\\ &&l: \text{ The number of agreed characters at the beginning of two strings}.\\ &&p: \text{is a constant scaling factor for how much the score is adjusted upwards}\\ &&\text{for having common prefixes, which is set to 0.1 in Winkler's work}\\ &&\text{(Winkler 1990).} \end{array} $$
    (1)

As the last step of comparison, we check the edit distance of the tree pair and tag the pairs having the distance zero as identical pairs, since the distance zero implies that no operation is necessary to transform one tree to another, or for the third fold (Token) the tokens in the tree are the same. Eventually, we store and save the set of identical tree pairs produced in each iteration, which would be used in Step 4.

3.6 Step 4 – pattern inference

Definition 5

(Pattern) Let g be a graph in which nodes are elements of RE and edges are defined by Ridentical.

g consists of a set of connected subgraphs SG (i.e., clusters of specialized tree representations of Rich Edit Scripts) where sgi and sgj are disjoint ∀sgi, sgjSG. A pattern is defined by sgiSG if sgi has at least two nodes (i.e., there are recurrent trees).

Finally, to infer patterns, we resort to clustering of the specialized tree representations of Rich Edit Scripts. First, we start by retrieving the set of identical tree pairs produced in Step 3 for each iteration. Following Algorithm 2, we extract the corresponding specialized tree representations according to the fold (i.e., ShapeTrees, ActionTrees, TokenTrees) since the trees are identical only in a given fold. In order to find groups of trees that are identical among themselves (i.e., clusters) we leverage graphs. Concretely, we implement a clustering process based on the theory of connected components (i.e., subgraph) identification in a graph (Skiena 1997). We create an undirected graph from the list of tree pairs, where the nodes of the graph are the trees and the edges represent trees that are associated (i.e., identical tree pairs). From this graph, we identify clusters as the subgraphs, where each subgraph contains a group of trees that are identical among themselves and disjoint from others.

figure n

A cluster contains a list of Rich Edit Scripts sharing a common specialized tree representations according to the fold. Finally, a cluster is qualified as a pattern, when it has at least two members.

The patterns for each fold are defined as follows:

Shape patterns

The first iteration attempts to find patterns in the ShapeTrees associated to developer patches. We refer to them as Shape patterns, since they represent the shape of the changed code in a structure of the tree in terms of node types. Thus, they are not fix patterns per se, but rather the context in which the changes are recurrent.

Action patterns

The second iteration considers samples associated to each shape pattern and attempts to identify reoccurring repair actions from their ActionTrees. This step produces patterns that are relevant to program repair as they refer to recurrent code change actions. Such patterns can indeed be matched to dissection studies performed in the literature (Sobreira et al. 2018). We will refer to Action patterns as the sought fix patterns. Nevertheless, it is noteworthy that, in contrast with literature fix patterns, which can be generically applied to any matching code context, our Action patterns are specifically mapped to a code shape (i.e., a shape pattern) and is thus applicable to specific code contexts. This constrains the mutations to relevant code contexts, thus yielding more likely precise fix operations.

Token patterns

The third iteration finally considers samples associated to each action pattern and attempts to identify more specific patterns with respect to the tokens available. Such token-specific patterns, which include specific tokens, are not suitable for implementation into pattern-based automated program repair systems from the literature. We discuss however their use in the context of deriving collateral evolutions (cf. Section 5.2).

4 Experimental evaluation

We now provide details on the experiments that we carry out for FixMiner. Notably, we discuss the dataset, and present the implementation details. Then, we overview the statistics on the mining steps, and eventually enumerate the research questions for the assessment of FixMiner.

4.1 Dataset

We collect code changes from 44 large and popular open-source projects from Apache-Commons, JBoss, Spring and Wildfly communities with the following selection criteria: we focused on projects (1) written in Java, (2) with publicly available bug reports, (3) having at least 20 source code files in at least one of its version; finally, to reduce selection bias, (4) we choose projects from a wide range of categories - middleware, databases, data warehouses, utilities, infrastructure. This is a process similar to Bench4bl (Lee et al. 2018). Table 2

Table 2 Dataset

details the number of bug fixing patches that we considered in each project. Eventually, our dataset includes 11 416 patches.

4.2 Implementation choices

We recall that we have made the following parameter choices in the FixMiner workflow:

  • The “Shape” search index considers only Rich Edit Scripts having a depth greater than 1 (i.e., the AST sub-tree should include at least one parent and one child).

  • Comparison of Rich Edit Scripts is designed to retrieve identical trees (i.e., tree edit distance is 0).

4.3 Statistics

FixMiner is a pattern mining approach to produce fix patterns for program repair systems. Its evaluation (cf. Section 5) will focus on evaluating the relevance of the yielded patterns. Nevertheless, we provide statistics on the mining process to provide a basis of discussion on the implications of FixMiner’s design choices.

Search indices

FixMiner mines fix patterns through comparison of hunks (i.e., contiguous groups of code lines). 11 416 patches in our database are eventually associated to 41 823 hunks. A direct pairwise comparison of these hunks would lead to 874 560 753 tree comparison computations. The combinatorial explosion of the comparison space is overcome by building search indices as previously described in Section 3.4. Table 3 shows the details on the search indices built for each fold in the FixMiner iterations. From the 874+ million tree pairs to be compared (i.e., \(C_{41823}^{2}\)), the construction of the Shape index (implements criteria on the tree structure to focus on comparable trees) lead to 670 relevant comparison sub-spaces yielding a total of only 12+ million tree comparison pairs. This represents a reduction of 98% of the comparison space. Similarly, the Action index and the Token index reduce the associated comparison spaces by 88% and 72% respectively.

Table 3 Comparison space reduction

Clusters

We infer patterns by considering recurrence of trees: the clustering process groups together only tree pairs that are identical among themselves. Table 4 overviews the statistics of clusters yielded for the different iterations: Shape patterns (which represent code contexts) are the most diverse. Action patterns (which represent fix patterns that are suitable as inputs for program repair systems) are substantially less numerous. Finally, Token patterns (which may be codebase-specific) are significantly fewer. We recall that we consider all possible clusters as long as it includes at least 2 elements. A practitioner may however decide to select only large clusters (i.e., based on a threshold).

Table 4 Statistics on clusters

Because FixMiner considers code hunks as the unit for building Rich Edit Scripts, a given pattern may represent a repeating context (i.e., Shape pattern) or change (i.e., Action or Token pattern) that is only part of the patch (i.e., this patch includes other change patterns) or that is the full patch (i.e., the whole patch is made of this change pattern). Table 5 provides statistics on partial and full patterns. The numbers represent the disjoint sets of patterns that can be identified as always full or as always partial. Patterns that may be full for a given patch but partial for another patch are not considered. Overall, the statistics indicate that, from our dataset of over 40 thousand code hunks, only a few (e.g., respectively 278 and 7 120 hunks) are associated with patterns that are always full or always partial respectively. In the remaining cases, the pattern is associated to a code hunk that may form alone the patch or may be tangled with other code. This suggests that FixMiner is able to cope with tangled changes during pattern mining.

Table 5 Statistics on Full vs Partial patterns

Similarly, we investigate how the patterns are spread among patches. Indeed, a pattern may be found because a given patch has made the same change in several code hunks. We refer to such pattern as vertical. In contrast, a pattern may be found because the same code change is spread across several patches. We refer to such pattern as horizontal. Table 6 shows that vertical and horizontal patterns occur in similar proportions for Shape and Action patterns. However, Token patterns are significantly more vertical than horizontal (65 vs 224). This is in line with studies of collateral evolutions in Linux, which highlight large patches making repetitive changes in several locations at once (Padioleau et al. 2008) (i.e., collateral evolutions are applied through vertical patches).

Table 6 Statistics on pattern spread

4.4 Research questions

The assessment experiments are performed with the objective of investigating the usefulness of the patterns mined by FixMiner. To that end, we focus on the following research questions (RQs):

RQ-1:

Is automated patch clustering of FixMiner consistent with human manual dissection?

RQ-2:

Are patterns inferred by FixMiner compatible with known fix patterns?

RQ-3:

Are the mined patterns effective for automated program repair?

5 Results

5.1 RQ1: Comparison of FixMiner clustering against manual dissection

Objective. :

We propose to assess relevance of the clusters yielded by FixMiner in terms of whether they represent patterns which practitioners would view as recurrent changes that are indeed relevant to the patch behaviour. In previous section, the statistics showed that several changes are recurrent and are mapped to FixMiner’s clusters. In this RQ, we validate whether they are relevant to the practitioner’s viewpoint. For example, if FixMiner was not leveraging AST information, removal of blank lines would have been seen as a recurrent change (hence a pattern); however, a practitioner would not consider it as relevant.

Protocol. :

We consider an oracle dataset of patches with change patterns that are labelled by humans. Then we associate each of these patches to the relevant clusters mined by FixMiner on our combined study datasets. This way, we ensure that the clustering does not overfit to the oracle dataset labelled by humans. Eventually, we check whether each set of patches (from the oracle dataset) that are associated to a given FixMiner cluster, consists of patches having the same labels (from the oracle).

Oracle. :

For our experiments, we leverage the manual dissection of Defects4J (Just et al. 2014) provided by Sobreira et. al (2018).

This oracle dataset associates the developer patches of 395 bugs in the Defects4J datasets with 26 repair pattern labels (one of which is being “Not classified”).

Results. :

Table 7 provides statistics that describe the proportionFootnote 5 of FixMiner’s patterns that can be associated to change patterns in the Defects4J patches.

Table 7 Proportion of shared patterns between our study dataset and Defects4J

Diversity

We check the number of patterns that can be found in our study dataset and Defects4J. In absolute numbers, Defects4J patches include a limited set of change patterns (i.e., \(\sim 7\%=\frac {214}{2947}\)) in comparison to what can be found in our study dataset.

Consistency

We check for consistency of FixMiner’s pattern mining by assessing whether all Defects4J patches associated to a FixMiner cluster are indeed sharing a common dissection pattern label. We have found that the clustering to be consistent for \(\sim 78\%=\frac {166}{214}\), \(\sim 73\%=\frac {27}{37}\) and \(\sim 92\%=\frac {12}{13}\) of Shape, Action and Token clusters respectively.

figure o

Granularity

The human dissection provides repair pattern labels for a given patch. Nonetheless, the label is not specifically associated to any of the various changes in the patch. FixMiner however yields patterns for code hunks. Thus, while FixMiner links a given hunk to a single pattern, the dissection data associates several patterns to a given patch. We investigate the granularity level with respect to human-provided patterns. Concretely, several patterns of FixMiner can actually be associated (based on the corresponding Defects4J patches) to a single human dissection pattern. Consider the example cases in Table 8. Both patches consist of nested InfixExpression under the IfStatement. The first FixMiner pattern indicates that the change operation (i.e., update operator) should be performed on the children InfixExpression. On the other hand, the second pattern implies a change operation in the parent InfixExpression. Thus, eventually, FixMiner patterns are finer-grained and associates the example patches to two distinct patterns each pointing the precise node to update, while manual dissection considers them under the same coarse-grained repair pattern.

Table 8 Granularity example to FixMiner mined patterns

We have investigated the differences between FixMiner patterns and dissection labels and found several granularity mismatches similar to the previous example: condBlockRetAdd (condition block addition with return statement) from manual dissection is associated to 14 fine-grained Shape patterns of FixMiner: this suggests that the repair-potential of this pattern could be further refined depending on the code context. Similarly, expLogicMod (logic expression modification), is associated to 2 separate Action patterns (see Table 8) of FixMiner: this suggests that the application of this repair pattern can be further specialized to reduce the repair search space and the false positives.

Overall, we found in total 37, 3 and 1 dissection repair patterns are further refined into several FixMiner’s Shape, Action and Token patterns respectively.

figure t

Assessment of FixMiner’s patterns with respect to associated bug reports

Beyond assessing the consistency of FixMiner’s patterns based on human-built oracle dataset of labels, we further propose to investigate the relevance of the patterns in terms of the semantics that can be associated to the intention of the changes. To that end, we consider bug reports associated to patches as a proxy to characterize the intention of the code changes. We expect bug reports sharing textual similarity to be addressed by patches that are syntactically similar. This hypothesis drives the entire research direction on Information retrieval-based bug localization (Lee et al. 2018).

Figure 14 provides the distribution of pairwise bug report (textual) similarity values for the bug reports corresponding to patches associated to each cluster. For clear presentation, we focus on the top-20 clusters (in terms of size). We use TF-IDF to represent each bug report as a vector, and leverage Cosine similarity to compute similarity scores among vectors. The represented boxplots display all pairwise bug report similarity values, including outliers. Although for Shape and Action patterns the similarities are near 0 for all clusters, we note that there are fewer outliers for Action patterns. This suggests a relative increase in the similarity among bug reports. As expected, similarity among bug reports is the highest with Token patterns.

Fig. 14
figure 14

Distribution of pairwise bug report similarity. Note: A red line represents an average similarity for all bug reports in fold, and blue line represents average similarity bug reports within a cluster

5.2 RQ2: Compatibility between FixMiner’s patterns and APR literature patterns

Objective. :

Given that FixMiner aims to automatically produce fix patterns that can be used by automated program systems, we propose to assess whether the yielded patterns are compatible with patterns in the literature.

Protocol. :

We consider the set of patterns used by literature APR systems and compare them against FixMiner’s patterns. Concretely, we systematically try to map FixMiner’s patterns with patterns in the literature. To that end, we rely on the comprehensive taxonomy of fix patterns proposed by Liu et al. (2019): if a given FixMiner pattern can be mapped to a type of change in the taxonomy, then this pattern is marked as compatible with patterns in the literature.

Recall that, as described earlier, fix patterns used by APR tools abstract changes at the form of FixMiner’s Action patterns (Section 3 - Step 4). In the absence of common language for specifying patterns, the comparison is performed manually. For the comparison, we do not conduct exact mapping between literature patterns and the ones yielded by FixMiner as fix patterns yielded by FixMiner have more context information. We rather consider whether the context information yielded by FixMiner patterns matches with the context of literature patterns. We discuss the related threats to validity in Section 6. Given that the assessment is manual and thus time-consuming, we limit the comparisons to the top 50 patterns (i.e., Action patterns) yielded by FixMiner.

Oracle. :

We build on the patterns enumerated by Liu et al. (2019) who systematically reviewed fix patterns used by Java APR systems in the literature. They summarised 35 fix patterns in GNU format, which we refer to for comparing against FixMiner patterns.

Results. :

Overall, among the 35 fix patterns used by the total of 11 studied APR systems, 16 patterns are also included in the fix patterns (i.e., Action patterns) yielded by FixMiner when mining our study dataset. We recall that these patterns are often manually inferred and specified by researchers for their APR tools. Table 9 illustrates examples of FixMiner’s fix patterns associated to some of the patterns used in literature. We note that fix patterns identified by FixMiner are specific (e.g., for FP4: Insert Missed Statement, the corresponding FixMiner’s fix pattern specifies which type of statement must be inserted).

Table 9 Example FixMiner fix-patterns associated to APR literature patterns

Table 10 illustrates the proportion of FixMiner’s patterns that are compatible with patterns in the literature. In this comparison, we select the top-50 fix patterns yielded by FixMiner and verify their presence within the fix patterns used in the APR systems.

Table 10 Compatibility of patterns: FixMiner vs Literature patterns

We observed that

  • 7 patterns are compatible with fix patterns that are mined manually from bug fix patches (i.e., fix patterns in PAR (Kim et al. 2013)).

  • between 1 and 8 patterns are compatible with researcher-predefined fix patterns used in ssFix (Xin and Reiss 2017), ELIXIR (Saha et al. 2017), S3 (Le et al. 2017), NEPfix (Durieux et al. 2017), and SketchFix (Hua et al. 2018), respectively.

  • 7 patterns are compatible with fix pattern mined from history bug fixes by HDRepair (Le et al. 2016a), 9 patterns are compatible with fix patterns mined from StackOverflow by SOFix (Liu and Zhong 2018), and 1 fix pattern is compatible with 1 fix pattern mined by Genesis (Long et al. 2017) that focuses on mining fix patterns for three kinds of bugs.

  • 12 and 8 patterns are compatible with the patterns used by CapGen (Wen et al. 2018) and SimFix (Jiang et al. 2018), respectively, which extract patterns in a statistic way similar to the empirical studies of bug fixes (Martinez and Monperrus 2015; Liu et al. 2018b).

  • 6 patterns are compatible with the fix patterns used in AVATAR (Liu et al. 2019), which are presented in a study for inferring fix patterns from FindBugs(Hovemeyer and Pugh 2004) static analysis violations (Liu et al. 2018a).

figure u

Manual (but Systematic) Assessment of Token patterns

Action and Token patterns are the two types of patterns that relate to code changes. In the assessment scenario above, we only considered Action patterns since they are the most appropriate for comparison with the literature patterns. We now focus on Token patterns to assess whether our hypothesis on their usefulness for deriving collateral evolutions holds (cf. Section 3 - Step 4). To that end, we consider the various Token clusters yielded by FixMiner and manually verify whether the recurrent change (i.e., the pattern) is relevant (i.e., a human can explain whether the intentions of the changes are the same). Eventually, if the pattern is validated, it should be presented as a generic/semantic patch (Padioleau et al. 2008; Andersen and Lawall 2010) written in SmPL.Footnote 6

In Table 11, we list some of the patches that we found to be relevant. Among the top 50 Token patterns investigated, 12 patterns correspond to a modifier change, 4 patterns target changes in logging methods, and 1 pattern is about fixing the infix operator (e.g., >>=). The remaining cases mainly focus on changes that complete the implementation of code finally block logic (e.g., missing call to closeAll for opened files), changes in Exception handling, updates to wrong parameters passed to method invocations, as well as wrong method invocations. As mentioned earlier, these patterns are spread mostly vertically (i.e. change is recurrent in several code hunks of a given patch) and the semantic behaviour of these patterns are specific to project nature.

Table 11 Example changes associated to FixMiner mined patterns

Overall, our manual investigations on the top 50 Token patterns confirm that many of the recurrent changes associated to specific tokens are indeed relevant. We even found several cases where collateral evolution changes are regrouped to form a pattern as exhibited by the corresponding pattern example presented in Fig. 15. In this example, we illustrate the pattern using the SmPL specification language, which was designed for specifying collateral evolutions. This finding suggests that FixMiner can be leveraged to systematically mined collateral evolutions in the form of Token patterns which could be automatically rewritten as semantic patches in SmPL format. This endeavour is however out of the scope of this paper, and will be investigated in future work.

Fig. 15
figure 15

Example SmPL patch corresponding to generic representation of the pattern associated to FixMiner pattern

5.3 RQ3: Evaluation of Fix Patterns’ Relevance for APR

Objective. :

We propose to assess whether fix patterns yielded by FixMiner are effective for automated program repair.

Protocol. :

We implement a prototype APR system that uses the fix patterns mined by FixMiner to generate patches for bugs by following the principles of the PAR (Kim et al. 2013), which is referred to as PARFixMiner in the remainder of this paper. In contrast with PAR where the templates were engineered by a manual investigation of example bug fixes, in PARFixMiner, the templates for repair are engineered based on fix patterns mined by FixMiner. Figure 16 overviews the workflow of PARFixMiner.

Fault Localization. :

PARFixMiner uses spectrum-based fault localization. We use the GZoltarFootnote 7 (Campos et al. 2012) dynamic testing framework and leverage Ochiai (Abreu et al. 2007) ranking metric to predict buggy statements based on execution coverage information of passing and failing test cases. This setting is widely used in the repair community (Martinez and Monperrus 2016; Xiong et al. 2017; Xin and Reiss 2017; Wen et al. 2018; Liu et al. 2018), allowing for comparable assessment of PARFixMiner against the state-of-the-art.

Pattern Matching and Patch Generation. :

Once the spectrum-based fault localization (or ir-based fault localization (Koyuncu et al. 2019; Wen et al. 2016)) process yields a list of suspicious code locations, PARFixMiner attempts to select fix patterns for each statement in the list. The selection of fix patterns is conducted by matching the context information of suspicious code locations and fix patterns mined by FixMiner. Concretely, first, we parse the suspicious statement and traverse each node of its AST from its first child node to its last leaf node and form an AST subtree to represent its context. Then, we try to match the context (i.e., shape) of the AST subtree (from a suspicious statement) to the fix patterns’ shapes.

Fig. 16
figure 16

The overall workflow of PARFixMiner program repair pipeline

If a matching fix pattern is found, we proceed with the generation of a patch candidate. Some fix patterns require donor code (i.e., source code extracted from the buggy program) to generate patch candidates with fix patterns. These are also often referred to as part of fix ingredients. Recall that, to integrate with repair tools, we leverage FixMiner Action patterns, which do not contain any code token information: they have “holes”. Thus we search the donor code locally from the file which contains the suspicious statement. We select relevant donor code among the ones that are applicable to the fix pattern and the suspicious statement (i.e., data type(s) of variable(s), expression types, etc. that are matching to the context) to reduce the search space of donor code and further limit the generation of nonsensical patch candidates. For example, the fix pattern in Fig. 17 can only be matched to a suspicious return statement that has a method invocation expression: thus, the suspicious return statement will be patched by replacing its method name with another one (i.e., donor code). The donor code is searched by identifying all method names from the suspicious file having the same return type and parameters with the suspicious statement. Finally, a patch candidate is generated by mutating suspicious statements with identified donor code following the actions indicated in the matched fix pattern. We generate as many patches as the number of identified pieces of donor code. Patches are generated consecutively in the order of matching within the AST.

Fig. 17
figure 17

Example of fix patterns yielded by FixMiner

Note: We remind the reader that in this study, we do not perform a specific patch prioritization strategy. We search donor code from the AST tree of the local file that contains the suspicious statement by traversing each node of the AST of the local file from its first child node to its last leaf node in a breadth-first strategy (i.e., left-to-right and top-to-bottom). In case of multiple donor code options for a given fix pattern, the candidate patches are generated (each with a specific donor code) following the positions of donor codes in the AST tree (of the local file which contains the suspicious statement).

Pattern Validation.:

Once a patch candidate is generated, it is applied to buggy program and will be validated against the test suite. If it can make the buggy program pass all test cases successfully, the patch candidate will be considered as a plausible patch and PARFixMiner stops trying other patch candidates for this bug. Otherwise, the pattern matching and patch generation steps are repeated until the entire suspicious code locations list is processed. Specifically, we consider only the first generated plausible patch for each bug to evaluate its correctness. For all plausible patches generated by PARFixMiner, we further manually check the equivalence between these patches and the oracle patch provided in Defects4J. If they are semantically similar to the developer-provided fix, we consider they as correct patches, otherwise remain as plausible.

Oracle.:

We use Defects4JFootnote 8 (Just et al. 2014) dataset which is widely used as a benchmark for Java-targeted APR research (Martinez and Monperrus 2016; Le et al. 2016a; Chen et al. 2017; Martinez et al. 2017). The dataset contains 357 bugs with their corresponding developer fixes and test cases covering the bugs. Table 12 details statistics on the benchmark.

Results.:

Overall, we implemented the 31 fix patterns (i.e., Action patterns) mined by FixMiner, focusing only on the top-50 clusters (in terms of size).

We compare the performance of PARFixMiner against 13 state-of-the-art APR tools which have also used Defects4J benchmark for evaluating their repair performance. Table 13 illustrates the comparative results in terms of numbers of plausible (i.e., that passes all the test cases) and correct (i.e., that is eventually manually validated as semantically similar to the developer-provided fix) patches. Note that although HDRepair manuscript counts 23 bugs for which “correct” fixes are generated (and among which a correct fix is ranked number one for 13 bugs), the authors labeled fixes as “verified ok” for only 6 bugs (see artefact pageFootnote 9). We consider these 6 bugs in our comparison.

Table 12 Details of the benchmark
Table 13 Number of bugs fixed by different APR tools

Overall, we find that PARFixMiner successfully repaired 26 bugs from the Defects4J benchmark by generating correct patches. This performance is only surpassed to date by SimFix (Jiang et al. 2018) that was concurrently developed with PARFixMiner.

Nevertheless, while these tools generate more correct patches than PARFixMiner, they also generate many more plausible patches which are however not correct. In order to comparatively assess the different tools, we resort to a Precision metric (P), which is the probability of correctness of the generated patches. P(%) is defined as the ratio of the number of bugs for which a correct fix is generated first (i.e., before any other plausible patch) against the number of bugs for which a plausible (but incorrect) patch is generated first. For example, 81% of PARFixMiner’s plausible patches are actually correct, while it is the case for 63% and 60% of respectively ELIXIR and SimFix plausible patches are correct. To date only CapGen (Wen et al. 2018) achieves similar performance at yielding patches with slighter higher probability (at 84%) to be correct. The high performance of CapGen confirms their intuition that context-awareness, which we provide with Rich Edit Script, is essential for improving patch correctness.

Table 14 enumerates 128 bugs that are currently fixed (both correct and plausible) in the literature. 89 of them can be correctly fixed by at least one APR tool. PARFixMiner generates correct patches for 26 bugs. Among the bugs in the used version of Defects4J benchmark, 267 bugs have not yet been fixed by any tools in the literature, which still is a big challenge for automated program repair research.

Table 14 Defects4J bugs fixed by different APR tools

Finally, we find that, thanks to its automatically mined patterns, PARFixMiner is able to fix six (6) bugs which have not been fixed by any state-of-the-art APR tools (cf. Fig. 18).

figure x
Fig. 18
figure 18

Overlap of the correct patches by PARFixMiner and other APR tools

6 Discussions and threats to validity

Runtime performance

To run the experiments with FixMiner, we leveraged a computing system with 24 Intel Xeon E5-2680 v3 cores with 2.GHz per core and 3TB RAM. The construction of the Rich Edit Scripts took about 17 minutes. Rich Edit Scripts are cached in memory to reduce disk access during the computation of identical trees. Nevertheless, we recorded that comparing 1 108 060 pairs of trees took about 18 minutes.

Threats to external validity

The selection of our bug-fix datasets carries some threats to external validity that we have limited by considering known projects, and heuristics used in previous studies. We also make our best effort to link commits with bug reports as tagged by developers. Some false positives may be included if one considers a strict and formal definition of what constitutes a bug.

Threats to construct validity

arise when checking the compatibility of FixMiner’s patterns against the patterns used by literature APR systems. Indeed, for the comparison, we do not conduct exact mapping where the elements should be the same, given that literature patterns can be more abstract than the ones yielded by FixMiner. For example, Modify Method Name (i.e., FP10.1) is a sub-fix pattern of Mutate Method Invocation Expression (i.e., FP10), which is about replacing the method name of a method invocation expression with another appropriate method name (Liu et al. 2019). This fix pattern can be matched to any statement that contains a method name under method invocation expression. However, in this paper, the similar fix patterns yielded by FixMiner have more context information. Therefore, we consider context information to check the compatibility of FixMiner’s patterns against the patterns used by literature APR systems. For example, the fix pattern shown in Fig. 17 is to modify the buggy method name of a method invocation expression with another appropriate method name which is inside a Return-Statement. As the context information refers to a Return-Statement the fix pattern shown in Fig. 17 considered as compatible with Mutate Return Statement (i.e., FP12.). Nevertheless, the mapping is conservative in the sense that we consider that a FixMiner pattern matches a pattern from the literature as long as it can fit with the literature pattern.

7 Related work

Automated program repair

Patch generation is one of the key tasks in software maintenance since it is time-consuming and tedious. If this task is automated, the cost and time of developers for maintenance will be dramatically reduced. To address the issue, many automated techniques have been proposed for program repair (Monperrus 2018). GenProg (Le Goues et al. 2012b), which leverages genetic programming, is a pioneering work on program repair. It relies on mutation operators that insert, replace, or delete code elements. Although these mutations can create a limited number of variants, GenProg could fix several bugs (in their evaluation, test cases were passed for 55 out of 105 real program bugs) automatically, although most of them have been found to be incorrect patches later. PACHIKA (Dallmeier et al. 2009) leverages object behavior models. SYDIT (Meng et al. 2011) and LASE (Meng et al. 2013) automatically extracts an edit script from a program change. While several techniques have focused on fixability, Kim et al. (Kim et al. 2013) pointed out that patch acceptability should be considered as well in program repair. Automatically generated patches often have nonsensical structures and logic even though those patches can fix program bugs with respect to program behavior (i.e., w.r.t. test cases). To address this issue, they proposed PAR, which leverages manually-crafted fix patterns. Similarly Long and Rinard proposed Prophet (Long and Rinard 2016) and Genesis (Long et al. 2017) which generates patches by leveraging fix patterns extracted from the history of changes in repositories. Recently, several approaches (Bhatia and Singh 2016; Gupta et al. 2017) leveraging deep learning have been proposed for learning to fix bugs. Even recent APR approaches that target bug reports rely on fix templates to generate patches. iFixR (Koyuncu et al. 2019) is such an example which builds on top of the templates built TBar (Liu et al. 2019) templates. Overall, we note that the community is going in the direction of implementing repair strategies based on fix patterns or templates. Our work is thus essential in this direction as it provides a scalable, accurate and actionable tool to mine such relevant patterns for automated program repair.

Code differencing

Code differencing is an important research and practice concern in software engineering. Although commonly used by human developers in manual tasks, differencing at the text line level granularity (Myers 1986) is generally unsuitable for automated analysis of changes and the associated semantics. AST differencing work has benefited in the last decade for the extensive investigations that the research community has performed for general tree differencing (Bille 2005; Chawathe et al. 1996; Chilowicz et al. 2009; Al-Ekram et al. 2005). ChangeDistiller (Fluri et al. 2007) and GumTree (Falleri et al. 2014) constitute the current state-of-the-art for AST differencing in Java. In this work, we have selected GumTree as the base tool for the computation of edit scripts as its results have been validated by humans, and it has been shown to be more accurate and fine-grained edit scripts. Nevertheless, we have further enhanced the edit script yielding an algorithm that keeps track of contextual information. Our approach echoes a recently published work by Huang et al. (2018): their CLDIFF tool similarly enriches the AST produced by GumTree to enable the generation of concise code differences. The tool however was not available at the time of our experiments. Thus, to satisfy the input requirements of our fix pattern mining approach, we implement Rich Edit Script, to enrich GumTree-yielded edit scripts by retaining more contextual information.

Change patterns

The literature includes a large body of work on mining change patterns.

Mining-based approaches

In recent years, several approaches have built upon the idea of mining patterns or leveraging templates. Fluri et al., based on edit scripts computed by their ChangeDistiller AST difference, have used hierarchical clustering to discover unknown change types in three Java applications (Fluri et al. 2008). They have limited themselves however to considering only changes implementing the 41 basic change types that they had previously identified (Fluri and Gall 2006). Kreutzer et al. have developed C3 to automatically detect groups of similar code changes in code repositories with the help of clustering algorithms (Kreutzer et al. 2016). Martinez and Monperrus (2015) assessed the relationship between the types of bug fixes and automatic program repair. They perform extensive large scale empirical investigations on the nature of human bug fixes based on fine-grained abstract syntax tree differences by ChangeDistiller. Their experiments show that the mined models are more effective for driving the search compared to random search. Their models however remain at a high level and may not carry any actionable patterns to be used by other template-based APR. Our work however also targets systematizing and automating the ”mining of actionable fix patterns” to feed pattern-based program repair tools.

An example application is related to work by Livshits and Zimmermann (2005) who discovered application-specific repair templates by using association rule mining on two Java projects. More recently, Hanam et al. (2016) have developed the BugAID technique for discovering most prevalent repair templates in JavaScript. They use AST differencing and unsupervised learning algorithms. Our objective is similar to theirs, focusing on Java programs with different abstraction levels of the patterns. FixMiner builds on a three-fold clustering strategy where we iteratively discover recurrent changes preserving surrounding code context.

Studies on code change redundancies

A number of empirical studies have confirmed that code changes are repeatedly performed in software code bases (Kim and Notkin 2009; Kim et al. 2006; Molderez et al. 2017; Yue et al. 2017). Same changes are prevalent because multiple occurrences of the same bug require the same change. Similarly, when an API evolves, or when migrating to a new library/framework, all calling code must be adapted by same collateral changes (Padioleau et al. 2008). Finally, code refactoring or routine code cleaning can lead to similar changes. In a manual investigation, Pan et al. (2009) have identified 27 extractable repair templates for Java software. Among other findings, they observed that if-condition changes are the most frequently applied to fix bugs. Their study, however, does not discuss whether most bugs are related to If-condition or not. This is important as it clarifies the context to perform if-related changes. Recently, Nguyen et al. (2010) have empirically found that 17-45% of bug fixes are recurring. Our focus in this paper is to provide tool-support automated approach to inferring change patterns in a dataset to drive repair patterns to guide APR mutation. Moreover, our patterns are less generic than the ones in previous works (e.g., as in Pan et al. (2009) and Nguyen et al. (2010)).

Concurrently to our work, Jiang et al. have proposed SimFix (Jiang et al. 2018), and Wen et al. CapGen (2018) which implements a similar idea of leveraging code redundancies using contextual information for shaping the program repair space. In FixMiner however, the pattern mining phase is independent from the patch generation phase, and the resulting patterns are tractable and reusable as input to other APR systems.

Generic and semantic patch inference

Ideally, FixMiner is a tool that aims at performing towards finding a generic patch that can be leveraged by automated program repair to correctly update a collection of buggy code fragments. This problem has been recently studied by approaches such as spdiff (Andersen and Lawall 2010; Andersen et al. 2012) which work on the inference of generic and semantic patches. This approach, however, is known to be poorly scalable and has constraints of producing ready-to-use semantic patches that can be used by the Coccinelle matching and transformation engine (Brunel et al. 2009). There have however a number of prior work that tries to detect and summarize program changes. A seminal work by Chawathe et al. describes a method to detect changes to structured information based on an ordered tree and its updated version (Chawathe et al. 1996). The goal was to derive a compact description of the changes with the notion of minimum cost edit script which has been used in the recent ChangeDistiller and GumTree tools. The representations of edit operations, however, are either often too overfit to a particular code change or abstract very loosely the change so that it cannot be easily instantiated. Neamtiu et al. (2005) proposed an approach for identifying changes, additions and deletions of C program elements based on structural matching of syntax trees. Two trees that are structurally identical but have differences in their nodes are considered to represent matching program fragments. Kim et al. (2007) have later proposed a method to infer “change-rules” that capture many changes. They generally express changes related to program headers (method headers, class names, package names, etc.). Weissgerber et al. (2006) have also proposed a technique to identify likely refactorings in the changes that have been performed in Java programs. Overall, these generic patch inference approaches address the challenges of how the patterns that will be leveraged in practice. Our work goes in that direction by yielding different kinds of patterns for different purposes: shape-based patterns reduce the context of code to match; action patterns are the ones that correspond to fix patterns used in the repair community; token patterns are used for inferring collateral evolutions.

8 Conclusion

We have presented FixMiner, a systematic and automated approach to mine relevant and actionable fix patterns for automated program repair. The approach builds on an iterative and three-fold clustering strategy, where in each round forming clusters of identical trees representing recurrent patterns.

We assess the consistency of the mined patterns with the patterns in the literature. We further demonstrate with the implementation of an automated repair pipeline that the patterns mined by our approach are relevant for generating correct patches for 26 bugs in the Defects4J benchmark. These correct patches correspond to 81% of all plausible patches generated by the tool.

Availability

All the data and tool support is available at:

https://github.com/SerVal-DTF/fixminer-core.