1 Introduction

Software can not be seen or touched, but it has a physical existence. With software embedded into many devices today, software failures have caused not only inconveniences but also tragedies, such as the deaths of patients due to massive overdose caused by an avoidable error in a radiation therapy machine (Kaner et al. 2008). A more recent case is Google’s self-driving cars (controlled by software), which experienced 272 failures in less than a year. These failures would have resulted in at least 13 crushes killing their human drivers if they had not intervened (Harris 2016). Software failures are also the cause of massive economical losses, costing the global economy $41 billion annually (Software 2013). Repairing software faults, however, is becoming an extremely difficult and expensive task – constituting up to 90% of the software expenses (Le Goues et al. 2013) – due to the increasing complexity and size of software systems. A modern car, for example, has 100 million lines of code, and this number is expected to increase to 200-300 millions in the near future (Charette 2009). Hence the critical task of software repair must be automated.

Automated Program Repair (APR) has been identified as the grand challenge in software engineering research (Mark Harman 2018). Many APR methods have shown promising results in fixing bugs with minimal, or even no human intervention (Le Goues et al. 2012a; Le Goues et al. 2012b; Martinez and Monperrus 2015; Xuan et al. 2017). Despite many studies introducing various APR techniques (APRTs), much remains to be learned, however, about what makes a particular technique work well (or not) for a specific software system (Anand et al. 2013). The effectiveness of APRTs is likely to be problem dependent, which calls for an analysis of the software characteristics that impact their effectiveness in order to help practitioners select the most appropriate technique for their software system.

In addition, results claiming the superior performance of an APRT over other techniques on a selected set of software systems may not generalise to untested systems. It is likely that there are software systems where an APRT excels because it is exploiting some particular characteristics of the buggy program. Thus, an understanding of conditions under which an APRT can be expected to succeed or fail is essential, however, this is rarely included in published studies. The aim of this paper is to address the issue of objective assessment of APRTs, and we achieve this by answering the following research questions:

  • RQ1 What impacts the effectiveness of APRTs? - Research introducing new APRTs or experimental studies investigating the performance of different techniques usually is based on a carefully selected set of buggy programs. These works offer little insight into the characteristics of the buggy programs and how they are correlated with the effectiveness of APRTs. The overwhelming majority of published work in APR only describes the benefits of the newly introduced technique, while just a few mention the limitations or present negative results.

    Certain limitations of APRTs have previously been discussed in the literature, such as the issue with patch overfitting (a patch generated by a tool that, while being valid according to the correctness oracle, they are still incorrect and potentially introduce new bugs that can not be captured by the correctness oracle). On the other hand, negative results in terms of why certain techniques can not repair certain bugs have not been investigated in the literature so far. In this paper, we aim to find out if particular features of a buggy program correlate with the effectiveness of APRTs, thus providing insights on why some techniques might be more or less suited to certain software and bug instances. We achieve these kind of insights by proposing a new method for analysing the effectiveness of APRTs.

  • RQ2 Are APR datasets significantly different? - Most research in APR uses well-known datasets, such Defects4J, which can result in the techniques to be tailored towards solving particular problems, and as a result not generalise well for other problems. In this paper, we aim to show how different these datasets are in terms of the features that have an impact on the effectiveness of existing APRTs. This allows us to understand if existing benchmarks are sufficiently diverse for stress-testing the effectiveness of APR techniques.

  • RQ3 How can we select the most suitable APR technique? The final aim of this research is to investigate the effectiveness of Machine Learning techniques for APRT selection. We investigate different multi-label classification techniques and report their effectiveness in terms of recall, precision and f1-score.

To answer these research questions, we introduce a new approach which characterises both strengths and weaknesses of existing APR techniques. E-APR is inspired from earlier work on instance space analysis in the area of machine learning (Muñoz et al. 2018) and search based software testing (SBST) (Oliveira et al. 2018, 2019). These approaches are concerned with the problem of objective performance evaluation of different algorithms used in machine learning (Muñoz et al. 2018) and SBST (Oliveira et al. 2018, 2019), and the impact of the choice of problem instances. The methodology used in these studies extend the Rice’s framework (1976) with the aim of gaining insights into why some algorithms might be more or less suited to certain problem instances.

E-APR extends the methodology from Oliveira et al. (2018, 2019) and Muñoz et al. (2018) to the automated program repair problem. E-APR allows for a more objective assessment of existing APR techniques, and helps in understanding why certain APR techniques cannot generate plausible patches for certain bugs. We apply our framework on a large study of 2,141 bugs from 130 projects, and 23,551 repair attempts. E-APR uses software and bug features to characterise the buggy program instances, and learns which features have an impact on the effectiveness of APRTs. For human programmers, software repair is challenging because fixing bugs is a difficult task. While there are bugs that can be trivially fixed, many of us can remember a bug that took hours, if not days and weeks to be understood and fixed (Eisenstadt 1997). The approach we devise gives insights into how an APR technique can be selected to automatically fix bugs.

2 The E-APR Framework

The E-APR framework has two main goals:

  • to help designers of APRTs gain insight into why some techniques might be more or less suited to repair certain buggy programs, thus devising new and better techniques that address any challenging areas, and

  • to help software developers select the most effective APRT for their software system.

E-APR provides a way for objective assessment of the overall effectiveness of an APR technique. It is based on previous work on instance space analysis and algorithm selection in the area of Search-Based Software Testing (Oliveira et al. 2018, 2019), machine learning (Muñoz et al. 2018), and optimisation (Smith-Miles and Tan 2012). The concept of instance space analysis was first introduced by Smith-Miles in her seminal work looking at the strengths and weaknesses of optimisation problems, and forms the foundation of the E-APR approach. Understanding the effectiveness of an APR technique is critical for selecting the most suitable technique for a particular buggy program, thus avoiding trial and error application of APR techniques.

An overview of the E-APR framework is presented in Figure 1. E-APR starts with a set of buggy programs pP and a portfolio of APRTs tT. The performance of APRTs is measured for each buggy program as y(t, p), which indicates whether a plausible patch has been found for that program. The first step of E-APR is to identify the significant features of buggy programs (f(p) ∈ F) that have an impact on how easy or hard they are for a particular APRT. Next, E-APR constructs the APRT footprints (g(f(P))) ∈ R2 which indicate the area of strength for each APRT. Finally, E-APR applies machine learning techniques on the most significant features to learn a model that can be used for APRT selection for future application.

Fig. 1
figure 1

An overview of E-APR

2.1 Buggy Programs

Buggy Programs, defined in Figure 1 as pP are software instances used by researchers to evaluate automated program repair techniques. Most of the APRTs for Java use Defects4J (Just et al. 2014).

Durieux et al. (2019) is one of the few that uses 5 peer-reviewed Java bug benchmarks: Bears (Madeiral et al. 2019), Bugs.jar (Saha et al. 2018), IntroClassJava and QuixBugs (Lin et al. 2017) and Defects4J (Just et al. 2014). Our analysis is based on the experimental data generated by Durieux et al. (2016b), which is available at github.com/program-repair/RepairThemAll_experiment.

In total, we consider 2,141 bugs from 130 projects, and 23,551 repair attempts. A repair attempt is the execution of an APRT on a buggy program. The execution of all repair attempts on the 5 benchmarks by the 11 APRTs took 314 days (Durieux et al. 2019). The patches considered in this study are plausible patches. These patches produce: a) the failing test cases (that exposed the bug) pass, and b) the remaining test cases continue to pass. Those patches are also known as plausible patches (Qi et al. 2015). Previous work have shown that a test-suite adequate patch can produce passing all tests but they are yet incorrect. Those are overfitting patches (Smith et al. 2015) and can arise due to the weakness of the test-suite used for synthesising the patches. Overfitting detection is not yet mature (i.e., not capable of detecting all overfitting patches) and thus adopting such techniques could introduce some bias in this work, hence we consider all patches generated by the repair tools executed by RepairThemAll. This means that we did not filter out the outputs generated by APRTs.

The source of the bugs in the bug benchmark are diverse: Defects4J and Bugs.jar contains real bugs extracted from software repositories, Bears contains real bugs collected from breaking builds on Travis platforms, IntroClassJava contains buggy subjects from students, and QuixBugs contains buggy implementation of well-known algorithms (such as merge-sort).

2.2 APR Techniques

APR techniques are defined in Figure 1 as tT. In this paper we focus on one family of repair approaches: test-suite based repair approaches (Le Goues et al. 2012a). Approaches from this family aim at repairing bugs exposed by at least one failing test case. The main idea of these approaches is to use failed test cases to localise potential faults and then apply mutations to the source code until the program satisfies all unit test cases. The mutations that are applied to the program code can range from small changes like modification, addition or removal of a single code line (Le Goues et al. 2012a) to complex edit operations (Martinez and Monperrus 2015; Kim et al. 2013), which are mined from software repositories and used to fix a fault in a different context.

In this paper we employ 11 repair tools for Java programs similar to the study by Durieux et al. (2019). These tools can be classified into semantics-based repair tools (Nopol (Xuan et al. 2017) and DynaMoth (Durieux and Monperrus 2016a)), a metaprogramming-based tool (NPEFix (Durieux et al. 2017)), and generate-and-validate (ARJA (Yuan and Banzhaf 2018), Cardumen (Martinez M and Monperrus 2018), jGenProg (Martinez M and Monperrus 2016), GenProg-A (Yuan and Banzhaf 2018), jKali (Martinez et al. 2017a), Kali-A (Yuan and Banzhaf 2018), jMutRepair (Martinez M and Monperrus 2016), and RSRepair-A (Yuan and Banzhaf 2018)).

jGenProg and GenProg-A are two Java implementations of GenProg (Le Goues et al. 2012a). Both techniques use a generate-and-validate method to produce patches using a genetic programming approach. The search space consists of patches that are formed through combinations of removing code, and inserting and replacing code from elsewhere in the program under repair (Martinez M and Monperrus 2016).

Cardumen (Martinez M and Monperrus 2018) synthesises patches using the existing code as a basis, by taking code elements from elsewhere in the program and replacing the variables. Each potential patch is filtered based on location and type compatibility, and the remaining patches are prioritised based on how frequently the selected variables occur together.

jKali and Kali-A (Qi et al. 2015) are different implementations of Kali in Java. They attempt to come up with candidate patches by removing or skipping statements. Neither jKali nor Kali-A is a ‘repair’ program, instead, they are more useful in identifying weak test suites and under-specified bugs (Martinez et al. 2017b). Since Kali simply removes or skips code, if a patch is found, it is a strong indication that the functionally of the removed code is not specified in the test-suite. In addition, if Kali finds a test-suite adequate patch, so can jGenProg or Nopol (Martinez et al. 2017b), the patches found by Kali, however, rarely work beyond the given test-suite.

jMutRepair (Martinez M and Monperrus 2016) performs an exhaustive search of the code and applies the following three types mutation operators on suspicious if conditions. The relational mutation operator with the following values (==, !=, ≤, ≥, <, >), the logical mutation operator (AND, OR), and the Unary mutation operator which applies negation and positivation.

Nopol (Xuan et al. 2017) focuses on repairing IF conditions, which are amongst the most error-prone elements of Java programs, and many one-change commits simply update an IF condition. Nopol has three main steps. First, it locates a fix location for a potential patch using “angelic fix localisation”. This process also involved finding “angelic values”, which are assigned values that can be used at the fix location to make all failing tests pass. Next, Nopol collects runtime data from a test execution, including a snapshot of the program state at candidate fix locations. Then, Nopol translates the angelic values and available variables at the fix location into a Satisfiability Modulo Theorem problem, and attempts to find a solution, which is then translated into a patch.

RSRepair-A (Qi et al. 2014) is a Java implementation of the RSRepair program repair tool written for C programs. RSRepair uses a generate-and-validate technique to prepare patches. It takes inspiration from the GenProg tool, however, instead of using genetic programming as its search method, RSRepair uses random search.

ARJA (Yuan and Banzhaf 2018) uses Genetic Programming to modify and mutate suspicious statements in a program by performing three actions: i) deleting the suspicious statement, ii) replacing the suspicious statement, or iii) inserting extra statements before or after the suspicious statement. ARJA reduces the scope of the search and computation time to speed up the fitness process by applying rules that exclude statements that are not related to the problem (Yuan and Banzhaf 2018).

NPEFix (Durieux et al. 2017) repairs null pointer exceptions at runtime by using two strategies. The first strategy assigns an alternative value (which can be a valid value that is stored in another variable or a random value) for a null dereference. The second strategy skips the execution of the null dereference, by either skipping a single statement or skipping the complete method. All strategies are applicable for any arbitrary objects, including instances of library classes, and instances of domain classes.

In summary, the APR techniques discussed in this section can be broadly categorised based on their high-level repair strategy. For example, jGenProg (Martinez M and Monperrus 2016), ARJA (Yuan and Banzhaf 2018) and RSRepair-A (Qi et al. 2014) use or build upon genetic programming. Other techniques take more unique approaches and are designed to target specific bugs, like NPEFix (Durieux et al. 2017) targeting null pointer exceptions. Other repair tools can only function if code is structured in a certain way, like Nopol (Xuan et al. 2017), which only works when IF conditions are present, and will only find a valid patch if the patch involves changing IF conditions. These observations further support our hypothesise that the performance of each technique will likely be affected by the features of the code. Different repair strategies may favour different code features, and that different bug targeting will definitely perform badly on code with the wrong type of bug.

2.3 APRT Performance Measures

An APRT Performance Measures y(t, p) ∈ Y takes as input the patches generated by an APRT tT for a particular buggy program pP. There exist various measures of APRT performance focusing on the quality of the patches produced. In this work we consider test-suite adequate patches (Le Goues et al. 2012a). We acknowledge that a portion of the patches may be overfitting, i.e., according to the test suite, the buggy program may appear to have been fixed by the patch, however, new errors may have been introduced. The problem of filtering correct patches (e.g., (Yu et al. 2019; Xiong et al. 2018; Xin and Reiss 2017; Le et al. 2019)) is currently being addressed by many researchers, who are looking at ways for automating or semi-automating this process, since manually inspecting all generated patches by automated program repair techniques is not practical. The APRT performance measures in E-APR (y(t, p) ∈ Y) can easily be extended to new measures, such as patch correctness.

2.4 Significant Features

A critical step of E-APR is identifying features of buggy program instances f(p) ∈ F that have an impact on the effectiveness of APR techniques. Features are problem dependent and must be chosen such that the varying complexities of the buggy program instances are exposed, any known structural properties of the software systems are captured, and any known advantages and limitations of the different APRTs are related to features.

For the purpose of this work, an APR technique is effective if it can generate a plausible patch for a buggy software system. While much is known and reported on features that correlate with software quality, we must consider that there may be other unknown features that have an impact on the effectiveness of APR techniques. In addition, it is possible that not all known features are useful for our goal of separating the hard and easy software instances. The candidate set of features may contain redundancy, with features measuring aspects of a buggy program that are either similar or not relevant to expose the hardness of the APR task itself. Thus, a small set of relevant features must be selected.

Learning significant features has two steps: first we define how to measure the quality of a particular set of features, and second, we apply a Genetic Algorithm to select the set that maximises this measure. A subset of features is considered of high quality if they result in an instance space – as defined by the 2-dimensional projection of the subset of features – with buggy programs that show similar performance of APRTs closer to each other. The best subset of features is the one that can best discriminate between easy and hard buggy program instances for APR techniques.

E-APR aims at identifying features that are able to create a clear separation of the buggy program instances, such that we can clearly see the different clusters of buggy programs where each APRT is effective. We employ principal component analysis (PCA) (Jolliffe 2011) to locate significant features. PCA learns a linear combinations of the buggy program features. The first PC is the linear combination of the variables which explain the maximum amount of variance in the dataset. Each subsequent PC is orthogonal to all previously calculated PCs and captures a maximum variance under these conditions. In our work, the subset of variables that have large coefficients and therefore contribute significantly to the variance of each PC, are identified as the significant features which are selected to explain bugs.

Given |F| software features, we can have at most |F| components which are estimated in decreasing order of the variance (measured through the eigenvalue of each PC) they explain in the dataset. We analyse for each PC the features that are found significant. This shows which dimensions are the main drivers of APR technique effectiveness and help explain why this is the case. In PCA, usually only the first few components are regarded as important. We retain the first 2 components, which makes visualising the footprints of the algorithms much easier.

E-APR uses a genetic algorithm (Aleti et al. 2014) to search the space of possible subsets of k features, with the classification accuracy on an out-of-sample test set used as the fitness function to guide the search for the optimal subset. The instance space is generated in iterations, until an optimal subset of features is found (Muñoz et al. 2018). The genetic algorithm performs the following steps to select the features and generate the instance space:

  1. 1.

    a set of buggy program features is selected;

  2. 2.

    an instance space is generated using the selected features and PCA to reduce the dimensionality;

  3. 3.

    the fitness of the set of features is evaluated ;

  4. 4.

    if the features are not adequate, go back to step 1.

Once the best set of features features is identified, E-APR creates a 2-D instance space that helps inspect the relationships between problem instances, their features and objectively assess APRT performance. 2D visualisation has been found to be effective in visualising footprints (Oliveira et al. 2018, 2019; Muñoz et al. 2018), hence we follow a similar approach as previous work. Similar approaches have been proposed in the literature for feature subset selection for machine learning (Bengio and Chapados 2003), optimisation (Smith-Miles et al. 2014), and search-based software testing tasks (Oliveira et al. 2018). Certainly, other feature selection methods proposed in the literature (Guyon and Elisseeff 2003) would also be suitable for the task at hand.

2.5 APRT Footprints

The idea of algorithm footprints was first introduced by Smith-Miles and Tan (2012) and aims to determine the relative performance of different algorithms across various classes of instances. In the original paper (Smith-Miles and Tan 2012), the authors focused on optimisation problems. Rather than reporting algorithm performance averaged across a chosen set of benchmark instances, the authors develop metrics for an algorithm’s performance generalised across a diverse set of instances. E-APR extends these ideas to Automated Program Repair techniques and aims to measure APRT footprint, which gives an indication of the area of strength of these algorithms.

Once the significant features have been identified, they are used to analyse and visualise the footprints of the APR techniques. In order to facilitate the visualisation of the footprints, similar to previous work (Smith-Miles and Tan 2012; Oliveira et al. 2018, 2019), we utilise the 2-D instance space created using PCA as a dimensionality reduction technique, and project the instances to two dimensions, while making sure that we retain as much information as possible. PCA rotates the data to a new coordinate system Rk, with axes defined by linear combinations of the selected F features, where k = |F|. The k new axes are the eigenvectors of the k × k covariance matrix.

We retain the two principal eigenvectors which correspond to the two largest eigenvalues of the covariance matrix. The instance space is then projected on this two-dimensional space. We use the variance explained in the data by the two principal components as a measure of the loss in information due to dimensionality reduction. Following a similar approach to previous work on dimensionality reduction (Smith-Miles et al. 2014), we accept the new two dimensional instance space as adequate if most of the variance in the data is explained by the two principal axes. The two principal components z1 and z2 are then used to visualise the footprints of the APR technique (APRT).

If our goal was only to make performance predictions on the best APR tool for repairing a particular software system, we could use machine learning algorithms to identify the relationship between software features and APR performance. Machine learning on its own does not allow for explanations as to why a particular APRT works well. Our goal in this paper is much broader than only making prediction, as we aim to visualise the footprints of the different APR approaches and provide insights into the workings of these methods.

Next, we calculate the relative size of APRT footprints by estimating the area of the hull covering the software instances where the technique is expected to perform well. This is a metric of the relative goodness of the APRT across the software instance space. Formally, given the convex hull H(S) of an area defined by points S = {(xi, yi)},∀i = 1,..., n, the area A(H(S)) is given by

$$ A(H(S))=\frac{1}{2} \sum\limits_{j=1}^{k}(x_{j}y_{j+1}-y_{j}x_{j+1})+(x_{k}y_{1}-y_{k}x_{1}), $$
(1)

where the subset {(xj, yj),∀j = 1,..., k}, kn defines the extreme points of H(S). Using (1), we compare the relative size of the footprint of each APRT to determine which APRT has the largest footprint and explore the degree of overlap of the footprints.

2.6 APRT Selection

In the final step, E-APR predicts, based on the most significant software features, the most effective APR technique for repairing particular buggy programs. E-APR uses the most significant features as an input to machine learning algorithms to learn the relationship between the instance features and APR method performance. For this purpose, we can use a variety of machine learning algorithms, such as decision trees, or support vector machines for binary labels (effective/ineffective), or statistical prediction methods, such as regression algorithms or neural networks for continuous labels (e.g., time complexity of the approach).

In this work, we investigate four machine learning approaches for multi-label classification (Madjarov et al. 2012). These methods are support vector machine (SVM) (Boser et al. 1992), a random forest classifier (RFC) (Prabhu and Varma 2014), a decision tree (DT) (Quinlan 1996) and a multi-layer perceptron (MLP) (Ruck et al. 1990). We now briefly describe those techniques.

Support Vector Machine (SVM)

is a supervised learning model with associated learning algorithms that analyse data for classification and regression analysis. For classification, SVM aims at finding a hyper-plane in the feature space, which separates the training data into two classes while maximising the margin (in the feature space) between this hyper-plane and the two classes (Vapnik 1995).

Decision Tree (DT)

uses observations about an item (represented in the branches) to learn an item’s target value (represented in the leaves). Classification trees are those trees where the target variable can take a discrete set of value, leaves represent class labels and branches represent conjunctions of features that lead to those class labels.

Random Forest Classifier (RFC)

is an ensemble learning method for classification that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes.

Multi-Layer Perceptron (MLP)

consists of a feed-forward artificial neural network which is a system of interconnected neurons representing a nonlinear mapping between an input vector and an output vector. MLP is used for classification by assigning output nodes to represent each class. MLPs are typically trained using a supervised learning technique called back-propagation.

At the end of this process, E-APR produces a model that can be used for algorithms selection in automated program repair. This model can be retrained and extended with more APR tools and features.

3 Experimental Design

We implement the E-APR framework described in Section 2, and conduct a set of experiments and analysis to answer the research questions stated in Section 1. In this section, we describe: the automated program repair techniques, the benchmark of buggy programs, and the set of software features.

3.1 Buggy Program Features

Features are problem dependent and must be chosen so that the varying complexities of the problem instances are exposed, any known structural properties of the buggy programs are captured, and any known advantages and limitations of the different program repair techniques are related to features. The most common measures and metrics used to characterise features of a software system are extracted from code.

Among others, we use object-oriented code metrics based on measurement theory and expertise of experienced software developers (Chidamber and Kemerer 1994b). These metrics are also mapped to the Quality Model for Object-Oriented Design (El-Wakil et al. 2004), which is a comprehensive model that establishes a clearly defined and empirically validated model to assess object-oriented design quality attributes such as understandability and reusability, and relates them through mathematical formulas with structural object-oriented design properties such as encapsulation and coupling. The set of code metrics, presented in Table 1, includes simple metrics, which count the number of methods or lines of code, to more complex metrics that measure the interaction between methods and the depth of inheritance tree. As in this paper we focus on Java APR, we also include Java-Specific method features, which are presented in Table 2.

Table 1 Object-oriented features
Table 2 Java specific method features

In addition to code features widely used by software practitioners and researchers, we also consider a set of Observation-based features (Yu et al. 2019). Those features were manually crafted for targeting different open challenges from automated program repair’s field such as prediction of source code transformations on buggy code (Yu et al. 2019) and detection of incorrect patches (Ye et al. 2019).

Observation-based features capture different characteristics of a buggy program. Initially, Defects4J (Just et al. 2014) was considered as a starting point, which is a dataset of real Java bugs and the corresponding human-written patches, widely used in evaluations of automated program repair tools (Durieux et al. 2019). We recorded the following information for each code element in the buggy code affected by the patch and in the patched code: a) the characteristics of the elements (e.g., the type of a variable is primitive), and b) the relation of such elements with respect to the rest of the buggy and patched file, respectively. Finally, the designers defined a set of features from those observations. For example, from Listing 1, it was observed that the buggy statement references to a variable (p1) which has compatible type and similar name to another variable (p2) in scope. From this observation, they created a feature named “HVSN” (Has Variable with Similar Name).

Listing 1
figure a

Human-written for bug Chart-11 from Defects4J

Observation-based features were included in the set of buggy program features because they allow us to capture the characteristics of code elements related to bug that: a) can be repaired by a tool, and b) cannot be repaired by any tool.

Thus, our approach could predict whether a buggy program can be repaired (or not) by a particular repair tool, and to determine which is the most adequate repair tool to face the bug. A simple example to illustrate the intention behind the adoption of such features: the buggy version of bug Chart-11 from Listing 1 has the feature HVSN with a true value and it is successfully repaired by Cardumen (Martinez M and Monperrus 2018) but neither by for instance Arja nor GenProg (Durieux et al. 2019). Thus, our intuition is that other bugs having that feature could be repaired by Cardumen.

Observation-based features are grouped into three categories: 1) features related to the Usage of code elements, for example, the feature OUIA indicates if a statement references a local variable that has not been referenced in other statements before it, 2) features related to the Syntax of code elements, for example, the feature HVSN (Has Variable with Similar Name) indicates whether, given a statement that references a variable, there exist other variables in the same scope that have a similar identifier name with that variable; 3) features related to the Types of code elements, for example, the feature VTSV indicates whether, given a statement that references a variable, there exist other variables in the same scope that are type compatible with that variable.

In total, we have 146 Observation-based features. The complete list is available in our (Appendix 2020). These features can be computed using the open-source tool Coming (Martinez and Monperrus 2019), which is available online at https://github.com/SpoonLabs/coming.

3.2 E-APR Input Data

For each buggy program, we first create a vector where each dimension corresponds to a particular feature. We add to that vector an additional dimension per each APRT considered in this experiment: its value is ‘1’ if the corresponding APRT produced a plausible patch and a ‘0’ otherwise. Table 3 shows an example of the features extracted from 4 buggy programs. Each row has the values of the features extracted for a program, and it is a vector of features. From the second to the fifth column, it shows the values corresponding to 4 object-oriented features (wmc, dit, npc and cbo). The last two columns indicate whether the buggy program could be repaired by two approaches (Kali and Arja).

Table 3 A snapshot of the dataset

To create a vector with features for each buggy program, we compute the Object-oriented and Java-Specific method features, which are calculated at the class-level. Then we calculate the average value of these features over all classes for each buggy program. Next, we compute the Observation-based features. Instead of considering all statements from the buggy program, we focus on a subset of them: those that, with a given probability, could have the bug. Note that, for predicting which is the most suitable tool given a program bug (Section 4.3), our approach does not know in which statement(s) the bug is located or the human patch. For this reason we apply fault localisation to filter the statements. To retrieve those statements, we compute the suspicious value of each statement using GZoltar tool (Campos et al. 2012), which uses the Ochiai formula (Abreu et al. 2007) to compute the suspiciousness value. GZoltar is the most prominent fault localisation tool used by the Java repair systems considered in this study. For each buggy program, we select the 100 most suspicious statements returned by GZoltar. We consider 100 as it is a common cut-off value used in program repair experiment, e.g., see analysis from (Long and Rinard 2016a). If a patch we obtain from our dataset is applied in a statement not included in the mentioned list of suspicious statements returned by the fault localisation tool, we include that statement in the list, with the goal of also analysing it. We found 75 bugs having, at least, one patch of such case. Next, we compute the Observation-based features for each of those statements. Finally, we compute the average of the features that characterise the suspicious statements.

4 Results

We present the results for each research question, and aim to provide insights into why the different APR techniques work. First we present the most significant features that impact APRT effectiveness. Second, we investigate the diversity of exiting buggy datasets used for APR. Next, we investigate the differences between exiting APRTs by analysing their strengths and weaknesses using the most significant features. Finally, we present the results from the Machine Learning algorithms used for APRT selection.

4.1 RQ1. What impacts the effectiveness of existing APRTs?

We performed feature learning on the total list of features (described in Section 3.1) that were extracted from 1,282 buggy programs. The aim is to select the best set of features that highlights the strengths and weaknesses of the APR techniques. To account for the randomness in the results, each trial of feature learning was run 10 times on each buggy program for each approach, using different random seeds, and the mean was considered. Out of the 146 features that were part of the study, E-APR identified the following 9 optimal features which best capture the difficulty in generating patches for APR:

(F1) MOA: Measure of Aggregation.

(F2) CAM: Cohesion Among Methods

(F3) AMC: Average Method Complexity

(F4) PMC: Private Method Count

(F5) AECSL: Atomic Expression Comparison Same Left indicates the number of statements with a binary expression that have more than an atomic expression (e.g., variable access). This feature belongs to Syntax category.

(F6) SPTWNG: Similar Primitive Type With Normal Guard indicates the number of statements that contain a variable (local or global) that is also used in another statement contained inside a guard (i.e., an If condition). This feature belongs to Usage category.

(F7) CVNI: Compatible Variable Not Included is the number of local primitive type variables within the scope of a statement that involves primitive variables that are not part of that statement. This feature belongs to Usage category.

(F8) VCTC: Variable Compatible Type in Condition measures the number of variables within an If condition that are compatible with another variable in the scope. This feature belongs to Type category.

(F9) PUIA: Primitive Used In Assignment measures the number of primitive variables in assignments. This feature belongs to Type category.

Using these features we were able to define the footprints of the techniques with with the highest topological preservation of 87% (explained variance). In essence, we can conclude the following.

figure b

To visualise the results in a 2-D instance space, we apply PCA as a dimensionality reduction technique on the optimal subset of features. Two new axes were created, which are linear combinations of the selected set of most significant features. The coordinate system that defines the new instance space is defined as:

$$ \begin{bmatrix} z_{1} \\ z_{2} \end{bmatrix} = \begin{bmatrix} {0.38} & {-0.02} \\ {-0.16} & {0.19} \\ {0.37}&{-0.04}\\ {-0.06}&{0.36}\\ {0.08}&{0.28}\\ {0.17}&{0.22}\\ {0.07}&{0.31}\\ {-0.34}&{0.01}\\ {0.12}&{0.16} \end{bmatrix}^{T} \begin{bmatrix} {\text{MOA}} \\ {\text{AECSL}} \\ {\text{PMC}}\\ {\text{SPTWNG}}\\ {\text{AMC}}\\ {\text{CVNI}}\\ {\text{VCTC}}\\ {\text{CAM}}\\ {\text{PUIA}} \end{bmatrix} $$
(2)

The new coordinates (depicted in (2)) are a combination of the 9 features. CAM, PMC and MOA have the highest contribution on z1, and SPTWNG, AMC and VCTC contribute the most to z2. CVNI, AECSL, and PUIA contribute equally to both coordinates.

We plot the footprints of the 11 APRTs in Figure 2. Each point in the 2-D instance space represents a buggy program. If an APR technique produced a patch for a particular program, it is considered Effective, otherwise, we label it as Ineffective. Each graph in Figure 2 represents the footprint of one of the techniques that we study in this paper. The x-axis and y-axis are the two principal components z1 and z2, defined in (2).

Fig. 2
figure 2

APR technique footprints. Each point is a buggy class, and is labelled as Effective, if the technique was able to generate a plausible patch for it

A visual inspection of the footprints shows that while some techniques appear more similar than others (for example, jKali is more similar to jMutRepair than NPEFix), each technique has its unique strengths.

All APRTs apart form NPEFix repaired bugs located at the top-right of the instance space. These are bugs from Defects4J benchmark (see Figure 4), which confirms a long held hypothesis that APRTs are being perfected to repair bugs from this dataset.

4.1.1 Footprints Size

Table 4 shows the area size of the APRT footprints, measured using (1). The size of the footprint is an indication of the overall performance of the APRT. The larger the footprint, the more diverse bugs an APRT can repair.

Table 4 Performance differences between the APRTs

While most techniques have relatively similar footprint size, jGenProg is the winner. The footprint size is not based on the number of programs that a technique was able to repair. Instead, the effectiveness of an APRT is measured in terms of the diversity of the features of these programs and their spread in the instance space. An APRT that can repair more diverse bugs is considered to be more effective.

4.1.2 Significant Software Features

Figure 3 depicts the feature footprints, which shows how the buggy program instances score in terms of the most significant features.

Fig. 3
figure 3

Feature footprints. The values of features have been normalised between 0 and 1, and the colour scheme is used to represent the values of features

MOA and PMC

The cluster of buggy programs where mostly jMutRepair, RSRepair, Nopol, GenProgA and Arja are effective has a lower measure of aggregation (MOA) and private methods count (PMC). MOA (as defined in Table 1) is the percentage of data declaration in the system whose types are of user defined classes, as opposed to those of system defined classes, such as integers, real numbers etc. It indicates that, compared to other approaches, it is easier for jMutRepair, RSRepair, Nopol, GenProgA and Arja to repair bugs originating from buggy programs that have fewer user declared types and lower number of private methods. jMutRepair, RSRepair, GenProgA and Arja are from the class of generate-and-validate APR techniques, with GenProgA and RSRepair being variations of GenProg. These tools make use of mutation operators to generate patches, which in general can be an effective way to fix bugs, but it proves ineffective in programs with many private methods and user defined types. This indicates that more sophisticated operators are required to fix such programs.

CAM

The third most significant feature is cohesion among methods in a class (CAM), which is a measure of class cohesion. The cluster of buggy programs where mostly jMutRepair, RSRepair, Nopol, GenProgA and Arja are effective is high in terms of CAM. High class cohesion is a desirable property and has previously been linked with high software quality. As mentioned above, MutRepair, RSRepair, GenProgA and Arja use mutation to generate new patches for buggy programs, which is quite simple and works well with highly cohesive programs where related program elements are in the same place (in this case, class). Mutation applies random changes in code, and is less likely to introduce new bugs if classes are highly cohesive. On the other hand, DynaMoth is effective at repairing bugs from programs with low cohesion. DynaMoth is a semantic-based APR tool, which performs a dynamic synthesis of patches for repairing conditional bugs. The tool specifically addresses the issue of complex method calls and low cohesion, which explains its superior performance in buggy programs with low CAM.

AMC

Average Method Complexity is relatively high in the upper right part of the instance space, where most APRTs are able to generate plausible patches. AMC is defined as the average method size (the number of java binary codes in the method) for each class, indicating that APRTs are usually more effective with longer methods.

Observation-based features

These metrics capture different characteristics of the buggy parts of the programs. Out of the 146 features, E-APR identified 5 significant Observation-based – SPTWNG, VCTC, PUIA, CVNI and AECSL – whose footprints we show in Figure 3. Four of these five features – SPTWNG, VCTC, PUIA, CVNI – have very similar footprints.

SPTWNG (Similar Primitive Type With Normal Guard) indicates the number of statements that contain a variable (local or global) that is also used in another statement contained inside a guard (i.e., an If condition). VCTC (Variable Compatible Type in Condition) measures the number of variables within an If condition that are compatible with another variable in the scope. PUIA (Primitive Used In Assignment) measures the number of primitive variables in assignments. CVNI (Compatible Variable Not Included) is the number of local primitive type variables within the scope of a statement that involves primitive variables that are not part of that statement. Finally, AECSL (Atomic Expression Comparison Same Left) indicates the number of statements with a binary expression that have more than an atomic expression (e.g., variable access). Programs with a high value of these features are more likely to be repaired by most techniques, while jMutRepair, Arja, KaliA, Nopol and RSRepair can generate plausible patches even for programs with low feature values. Since these features measure properties of the potential buggy locations, it makes sense that programs with such high feature values are more likely to be repaired.

In summary, the effectiveness of APRTs is impacted by software features, which makes these methods problem dependent, and as such, no technique can be considered the best in all cases. We observe different strengths and weaknesses of existing APRTs, which calls for methods that make it possible to select the most suitable technique given a buggy program with particular features.

4.2 RQ2. Are existing APR datasets significantly different?

The 2-D instance space that was constructed to analyse the effectiveness of APRTs, also allows us to analyse the location of the different benchmarks, which reveals how diverse they are. The dataset footprint presented in Figure 4 shows the reduced feature space with instances labelled according to the dataset they belong to. Each point is a bug from a particular dataset.

Fig. 4
figure 4

Benchmark footprint. Each point corresponds to a bug from a particular benchmark dataset

The features that were eventually found as significant and used to create this instance space, are the ones that have a good linear relationship with algorithm performance. For some APRTs, the choice of features may be better than for others, however, our approach chooses a common feature set that performs well on average across all algorithms.

We observe that there is a distinctive cluster on the left of Figure 4 composed of only bugs from IntroClassJava. It is clear that this dataset is significantly different from the other datasets. Further away from this cluster, is the footprint of Defects4J, which is on the rightmost side of the graph. This indicates that Defects4J is significantly different from IntroClassJava.

On the other hand, the footprints of Bears, Bugs.jar and QuixBugs overlap to a greater extent. They are spread between IntroClassJava and Defects4J and have a higher spread than the other datasets. Bugs.jar covers a larger are and encompasses that one from Quixbugs.

Bugs.jar contains some bugs obtained from the same software as the other datasets (e.g., Apache, Commons, Math), thus the bugs are eventually the same. QuixBugs is a set of buggy implementation of well known algorithms (e.g., Quixsort), and each buggy program in this dataset is a single class. The others datasets are real buggy programs, composed of several classes.

In summary, the answer to the second research question is as follows:

figure c

The dataset footprint also helps us understand if a dataset is biased, that is if it doesn’t fill the possible instance space, and lies within the ‘footprint’ (area of strength) of one APRT only, and doesn’t give other algorithms a chance to show their strength. We particularly observe that the footprint of Defects4J lies within the area of strength of most APRTs apart from NPEFix, whose footprint is shown in Figure 2. What this means is that if the performances of APRTs are compared solely on this dataset, the evaluation can be biased and demonstrate only the strengths of these approaches. The footprints of Bugs.jar, Quixbugs and Bears lie within the area where most APRTs apart from NPEFix are not able to generate plausible patches, indicating that these datasets are quite challenging for these approaches and exhibit less bias.

Our finding from this research task can inform researchers who develop new APRTs in the selection of the bug benchmark to test their technique. It wouldn’t be sufficient to test a new APRT on just one dataset, and a technique that works for Defects4J may not produce good results when repairing IntroClassJava.

4.3 RQ3. How can we select the most suitable APRT?

To answer this question, the E-APR framework uses multi-label classification algorithms to predict the most suitable APRT to repair buggy programs with particular features. We use 10-fold cross validation to evaluate the performance of four notable Machine Learning techniques: Support Vector Machine (SVM), Random Forest Classifier (RFC), Decision Tree (DT), and Multi-Layer Perceptron (MLP).

We use the scikit-learn Python implementation of these approaches and employed MLSMOTE (Charte et al. 2015) to address the class imbalance problem. The performance of the two approaches is evaluated in terms of precision, recall, and f1-score. Precision is the fraction of instances that are correctly predicted, calculated as:

$$ P = \frac{TP}{TP+FP} $$
(3)

where TP is the true positives and FP is the false positives. Recall measures how accurately the model is able to identify the relevant data.

$$ R = \frac{TP}{TP+FN} $$
(4)

where FN is false negatives. F1-Score is the harmonic mean of P and R, computed as follows:

$$ F1 = 2 \frac{P \cdot R}{P + R} $$
(5)

Results are shown in Table 5.

Table 5 The performance of Support Vector Machine (SVM), Random Forest Classifier (RFC), Decision Tree (DT) and Multi-Layer Perceptron (MLP) classifier in terms of precision (P), recall (R) and f1-score (F1)

The results indicate that, while all ML algorithms perform well in the task of APRT selection, the performance of RFC is clearly better than SVM. As a comparison, the work from Le et al. (2015) presents a similar task (predict whether a bug can be repaired by a genetic-programming based repair approach i.e., GenProg (Le Goues et al. 2012a)) and reports a precision of 72%.

figure d

Given the high performance of E-APR for predicting the most suitable APRT, it is of high-priority for us to integrate this approach to existing repair infrastructures such as RepairThemAll (Durieux et al. 2019) or Repairnator (Monperrus et al. 2019). For example, RepairThemAll has 11 automated repair techniques, but it does not offer any capabilities or guidelines in terms of which technique to select. Integrating E-APR with RepairThemAll would make it possible for users to select the most suitable APRT on the fly. Repairnator, on the other hand, is a software bot that automatically repairs broken Travis builds. Given a buggy program that produces a build to fail, Repairnator executes different repair approaches (including jGenProg, Nopol, among others) one by one, and the execution order is hard-coded. By incorporating E-APR, Repairnator could first execute E-APR to obtain the most suitable repair approaches for the buggy program, and execute them accordingly. In the future, we will investigate the effectiveness of integrating our approach with automated program repair infrastructure, such as RepairThemAll and Repairnator.

The overhead of using E-APR to select the most suitable APRT within existing APR infrastructures is minimal. APRT first must extract the code features by doing static analysis of the buggy program (it takes a few seconds to extract the nine feature we have identified as significant). Then, based on the extracted features, APRT uses the trained model to perform the prediction in few milliseconds.

5 Discussion and Threats to Validity

5.1 Features selection

A threat to the validity of this study is the selection of the considered features. Our approach E-APR considers 3 sets of features, each of them selected with a clear purpose: one aims at capturing object-oriented features, the second aims at capturing features specific to Java, and the third aims at capturing features related to the bug fixing activity (Observation-based features). Thus, we consider that the set of features is diverse enough to capture the characteristics of the program under analysis.

In this work, we complement extensively used features (e.g., Object-oriented features (Chidamber and Kemerer 1994b)) with a novel set of features (Observation-based features) that aim at characterising buggy programs. As the latter features are novel, there is a risk they do not precisely characterise buggy programs. However, a recent work (Yu et al. 2019) used Observation-based features to successfully predict source code transformations applied on buggy program. For this reason, we consider that Observation-based features can be used to predict the most suitable APR tool to apply to a buggy program.

5.2 Correctness of Patches

A threat to the validity is the correctness of patches. In our experiment from Section 4 we consider all generated patches (plausible) rather than focusing only on correct patches. One of the reasons is the availability of data on patch correctness. The number of correct patches generated by APRTs is much lower than the number of plausible patches. For instance, the recent manual evaluation done by Tian et al. (2020) of the patches we have used in this paper (more than 67,000 patches from RepairThemAll (Durieux et al. 2019)) found only 900 correct patches, from 20 different bugs (14 from Defects4J, 5 from Quixbugs and 1 from Bugs.jar). We reproduce our experiment, available in our (Appendix 2020), by considering the bugs that could be repaired by at least one correct patch. We found that the dataset consisting of 20 bugs that were correctly repaired is not sufficient for the machine learning algorithms used to identify significant features and create the algorithm footprints.

While we think that considering correct patches is an important next step, and a priority for our future work, the results with plausible patches provide some important insights into how APRTs work and how effective they are. Current APRTs find it challenging to even produce plausible patches, and E-APR helps understand why this is the case, and what kind of weaknesses future research into APRT should focus on. Moreover, recent studies shows that a plausible patch, even being overfitting and not adequate for repairing a bug, could give developer a valuable piece of information. For instance, Ginelli et al. (2020) studied code-removal patches, which works on manual machine patch analyses (e.g., Qi et al. 2015) labelled most of them as overfitting patches. They found that in 95.8% of the cases having an overfitting code-removal patch, it reveals different kinds of problems affecting the test suites that are relevant for the developers. Thus, they show this type of overfitting patches is useful: it exposes a particular weakness of the test suites.

5.3 Selection of Repair Tools

During the last years, several repair tools have been presented to repair Java bugs. Two previous works have tried to execute the tools (i.e., the materialization of repair approaches) on real bugs: (Durieux et al. 2019) could executed 11 repairs tools, and (Liu et al. 2020) executed 16 tools. Both papers list the reasons about why other repair approaches and tools could not be executed.

In this paper we consider the execution data from 11 tools. The main reason is that those tools were executed on 5 different bug benchmarks and generated patches are publicly available (Durieux et al. 2019). Other tools have exclusively focused on Defects4J, and we were not able to generate results for other datasets we consider in this study. For example, (Liu et al. 2020) evaluated 16 repair approaches only on Defects4J. We have included in our (Appendix 2020) initial results of an experiment done by considering the patches of that experiment, which includes, in addition to 10 repair tools considered in our paper (all except NPEfix), another 6: ACS (Xiong et al. 2017), Avatar (Liu et al. 2019a), FixMiner (Koyuncu et al. 2020), kPar (Liu et al. 2019), SimFix (Jiang et al. 2018), TBar (Liu et al. 2019b). From those initial results, we could not draw conclusive results. Our conjecture is the experiment has not enough diverse data: a single dataset evaluated (Defects4J), which contains bugs extracted from only 5 projects (Commons Math, Commons Lang, Joda Time, jFree Chart, and Closure).

We prioritised in this paper having a larger dataset, and found that the techniques we consider are diverse enough to demonstrate the capabilities of the proposed technique. We consider that this point (i.e., the selection of evaluated tools) does not invalidate the novelty of our technique.

5.4 Failure Information

Some repair techniques are designed to fix specific bugs, and their effectiveness can be limited by the nature of the bug that is addressed (Monperrus 2018; Gazzola et al. 2019). For instance, NPEFix repairs null pointer exceptions and is unlikely to be useful in other cases. In this paper, we decided to focus on the features that allow to characterise the buggy program under repair, without considering, for instance, the type of failure. Our approach, however, can easily be extended to include failure information. For instance, an extension could include a new set of features that characterise the failure, for example null pointer exceptions, stack overflow, array index error.

E-APR does not consider failure information since most of the bugs considered in this study do not produce a failure, but an incorrect output. The incorrect output is exposed by the failing test case via assertions. For instance, by inspecting Defects4J Dissection (Sobreira et al. 2018) we found that 304 out of 395 (77%) bugs from Defects4J are due to incorrect output. We explored the meta-data of bugs using http://program-repair.org/defects4j-dissection (Sobreira et al. 2018) and found that 275 bugs are due to an AssertionFailedError (for example, Chart-7: junit.framework.AssertionFailedError: expected:< 1> but was:< 3>) or unit.framework.ComparisonFailure (e.g. junit.framework.ComparisonFailure: expected:<String[[]]> but was:<String[;]>). Both exceptions are thrown by the testing framework after detecting the incorrect output.

5.5 Integration with Repair Infrastructures

To our knowledge, repair infrastructures such as Repairnator or RepairThemAll do not have the ability to predict, given a buggy program taken as input, with is the most suitable APR tool to generate a test-suite adequate patch for it. Repairnator calls APR tools in a fixed order, independently of the characteristics of the program under analysis. On the contrary, the user of RepairThemAll must decide the APR to be call. In both cases, our approach E-APR could be integrated to both of them. For Repairnator, E-APR could determine the order of the APR tools to be called with the goal of calling first the tools that are most suitable for a given buggy program. Similarly, for RepairThemAll, E-APR could suggest the user the APR tool to be invoked.

6 Related Work on the Effectiveness of APR Techniques

Researchers working in the area of APR have acknowledged that evaluating the quality of patches produced by APR techniques is crucial (Martinez et al. 2017a; Smith et al. 2015). To this end, Qi et al. (2015) studied the plausible generated by GenProg (Le Goues et al. 2012b) for C programs, and classified them as plausible (passing all tests), overfitting (plausible and incorrect) and correct (plausible, and do not have latent defects and do not introduce new defects or vulnerabilities (Long and Rinard 2016a)). They found that most of the reported patches were overfitting. Long and Rinard (2016a) analysed the patch search space of two repair tools, SPR (Long and Rinard 2015) and Prophet (Long and Rinard 2016b), and found that overfitting patches are typically orders of magnitude more abundant than correct patches.

Other works have studied the ability of APR techniques to repair buggy Java programs. For example, Martinez et al. (2017a) manually studied the correctness of patches produced by three APR techniquess over defects from Defects4J benchmark. They found that only a small number of bugs (9/47) could be correctly repaired.

Liu et al. (2020) executed 16 repairs tools on Defects4J and manually analyzed the generated patches following the procedure defined in that work. They found that the percentage of patches correctness varies between the tools at is belong of the 37% for 15/16 tools.

Ye et al. (2019) studied the repairability of bugs from QuixBugs (Lin et al. 2017), a dataset of 40 small buggy programs (between 9 and 69 LOC). They found that 15 bugs could be repaired by Nopol (Xuan et al. 2017) and approaches from Astor (Martinez M and Monperrus 2016), which generated in total 64 plausible patches. However, they found that 33 of them were incorrect.

The presence of overfitting patches has motivated researchers to investigate the amount of the overfitting patches (e.g., (Yu et al. 2019; Wang et al. 2020)), detect overfitting patches (e.g., DiffTGen (Xin and Reiss 2017), PatchSim (Xiong et al. 2018), Static code feature via learning (Wang et al. 2020), ODS (Ye et al. 2019)), and to avoid generating such patches (e.g., UnsatGuided (Yu et al. 2019), CapGen (Wen et al. 2018), Anti-pattern (Tan et al. 2016)). Empirical studies also have studied overfitting patches in detail. For example, Liu et al. (2020) conducted an large-scale empirical study which analyzed the correctness of patches generated by 16 repairs tools (10 mentioned in Section 2.2 (all except NPEFix) and ACS (Xiong et al. 2017), Avatar (Liu et al. 2019a), FixMiner (Koyuncu et al. 2020), kPar (Liu et al. 2019), SimFix (Jiang et al. 2018), TBar (Liu et al. 2019b)). One of their main findings is that many plausible patches are related to wrong locations of the patches. As previously found by Liu et al. (2019), the accuracy of fault localization tool has a direct and substantial impact on the performance of APR tools.

Our work extends existing research in analysing the effectiveness of APR techniques by examining what software features impacts the repairability of a software system. We characterise a software system using code features (e.g., depth of inheritance tree and method cohesion) and determine the most significant features that have impact on whether an APR technique can generate a patch.

There has also been some research in characterising patches generated by APR techniques to investigate how these patches differ from the ones generated by human programmers.

Wang et al. (2019) compared the difference between 177 correct patches for Defects4J bugs generated by APR techniques and the patches written by developers. To characterise the bugs, the authors considered 6 metrics: a) Patch size, b) Number of chunks c) Number of modified files, d) Number of modified methods e) Line coverage, and f) Branch coverage. They found that automatically generated patches are on average syntactically different compared to the patches generated by developers. Patches generated by APR techniques are usually longer, have a higher number of chunks, and have a higher line and branch coverage.

Similarly, Smith et al. (2015) studied the quality of patches generated by two C program repair approaches (GenProg and TprAutoRepair). The authors used two metrics that were dynamically computed (i.e., by running the program under repair): a) number of passing and failing test cases, and b) test suite coverage.

Both Wang et al. (2019) and Smith et al. (2015) focus on analysing the kind of patches generated by APR techniques. The aim of these works is to understand how good the patches are, and how they are different from developer-generated patches. Our work, instead, aims at understanding what kind of software systems and bugs APR techniques are able to repair. This will help explain how and why they work, and as a result, make it possible to select the right technique given a new buggy software system.

In their research, Smith et al. (2015) state that “Automatic repair should be used in the appropriate contexts” and that “Our results suggest that more work is needed to fully understand and characterise test suite quality beyond coverage metrics alone”. The E-APR framework addresses these two research challenges by investigating 146 features, and building a machine learning model that enables the selection of the most suitable APR technique for a given buggy program.

Another related work is the one by Motwani et al. (2018) which investigates correlations between the effectiveness of APR techniques and different aspects of bugs, such as bug importance and bug complexity. Results were analysed at course-grained level, with the findings showing weak to moderate correlation between bug importance and the ability of the APR technique to produce a patch. The results also show that APR techniques are effective in repairing easy bugs - as measured by the number of files and lines that have to be changed to fix the bug - while struggling with more complex bugs. This study makes an important step towards understanding where APR techniques work. In this paper, we take this research one step further by providing a more detailed analysis of the effectiveness of different APR techniques. The framework we propose allows us to examine the effectiveness of individual techniques in a visual and numerical way. We measure the footprints of the different APR techniques and whether their results overlap. This helps us understand the strengths and weaknesses of individual techniques, and their similarities in a more fine-grained way.

Le et al. (2015) present a work that has a similar goal to ours: they build an oracle that can predict whether fixing a failure should be delegated to a genetic-programming-based automated repair technique. The authors first extract features from an early stage of running a repair tool. Then, they pass the values of these features to learn a discriminative model capable of predicting whether continuing a genetic programming search will lead to a repair within a desired time limit.

Beyond the similarities, there are notable differences between our work and the work by Le et al. (2015).

First, Le et al. (2015) focus on genetic-programming-based automated repair technique, while our approach is independent of the type of repair technique. For instance, it considers genetic-programming-based technique (Arja and Genprog), semantic-based techniques (Nopol) and exhaustive methods (jMutRepair). Le et al. (2015) consider 27 features, 18 of which are related to genetic-programming. We use 3 sets of features (in total more than 200 features) that are independent of any repair technique and aim to describe the buggy program under repair. Le et al. (2015) analyse the early stage of a genetic-programming-based technique to extract 18 features. This means that it could be necessary to modify a repair approach to extract those features or to monitor the execution logs. E-APR considers the features extracted from a buggy program and trains the prediction model using the output from previous executions (i.e., a bug was patched or not by a technique). Le et al. (2015) considers one dataset of bugs (ManyBugs (Le Goues et al. 2015) with 105 bugs) and one repair tool (GenProg), while we consider 11 repair tools and 5 datasets (1282 bugs).

Lin et al. (2020) studied the non-repairability factors of various APR techniques. They analysed 11,818 execution logs from 27 Java tools, and found that 25.7% of them contained unexpected exceptions that prevent those tools to find a patch.

7 Conclusion

In this paper, we introduced E-APR, which is a novel framework for assessing strengths and weaknesses of APR techniques for Automated Program Repair (APR). We identified nine significant software features that have an impact on APRT effectiveness. These features were then used to provide explanations on an APR technique’s effectiveness across a range of buggy programs. We introduced a method for visualising APRT footprints, which reveal strengths and weaknesses of the APR techniques in fixing buggy programs.

We conducted an analysis of 11 different APR techniques applied to 2,141 bugs from 130 projects, constituting in total 23,551 repair attempts. Our approach effectively identified APRT footprints and the features that impact the effectiveness of an automated program technique. Using the most significant features, we applied two machine learning approaches that learns the relationship between software features and APRT effectiveness. Random Forest Classifier showed the best performance, with 88% precision, 81% recall and 84% f1-score.