1 Introduction

Software developers have limited resources to verify and test their source code. If developers can identify defective components (e.g., files or commits) they would be able to focus their effort on these components. Defect prediction supports this activity, and prior work has reported that defect prediction can reduce development cost for developers (Tassey 2002).

There exists plenty of work aimed at predicting defective components (Basili et al. 1996; Kim et al. 2007; Moser et al. 2008; Hassan 2009; D’Ambros et al. 2010). In particular, several prior research work has focused on predicting defective changes called change-level defect prediction—also called just-in-time defect prediction (Kamei et al. 2013; Kim et al. 2008; Fukushima et al. 2014; Mockus and Votta 2000). Just-in-time defect prediction has the advantage that it can determine if a commit is likely to be defective when the commit is being performed (Hata et al. 2012) and providing faster feedback than other defect prediction methods (Kamei et al. 2013). Previous research has used metrics based on measuring the code changes (e.g., churn–changed lines) in just-in-time defect prediction (Kamei et al. 2013; Kim et al. 2008; Mockus and Votta 2000).

To the best of our knowledge, no studies have considered using the information in the lines that surround the changed lines of a commit, which we call context lines. Our main hypothesis is that information in the context lines has an impact on the likelihood that the change is defective.

In this paper, we evaluate the use this information in just-in-time defect prediction. The dictionary defines context as “the parts of something written or spoken that immediately precede and follow a word or passage and clarify its meaning” (Stevenson and Lindberg 2010). In this paper, we define the context lines of a chunk of changed lines as the n-lines (n = 1,2,…) that precede the chunk and then-lines that follow the chunk.

This paper proposes several context metrics. The different metrics vary around three different axis: a) how many context lines around each change to use (the size of the context, n), b) whether to use all context lines, or only those of added or removed lines (the type of the change), and c) counting the number of words or counting the number of keywords (as defined by the programming language) in the context. We consider these axes as the parameters of context metrics. We refer to a context metric which uses a set of the parameters as a variant of context metrics. We empirically study the best-performing variant in terms of defect prediction performance. We also compare the context metrics that are the best-performing variants with traditional code churn metrics (change metrics (Kamei et al. 2013; Kim et al. 2008; Mockus and Votta 2000) and indentation metrics (Hindle et al. 2008)), extended context metrics and combination metrics that use two extended context metrics. Indentation metrics use the total number of white spaces in front of changed lines, and the total number of pairs of braces that surrounded changed lines; we handle indentation metrics as code churn metrics, since they are computed on changed lines. In order to improve the predicting power of the context metrics in defect prediction, we also define extended context metrics. Extended context metrics count the number of words/keywords in both, the context lines and the changed lines. Hence, extended context metrics are hybrids of the context metrics and traditional code churn metrics. In addition, we use combination metrics that use two extended context metrics that count (1) number of words and (2) number of a certain keyword (e.g., “goto”) at a prediction model in order to improve the predicting power of the extended context metrics in defect prediction.

Using six large open source software projects (from different domains) we empirically evaluate the defect prediction power of context metrics and compare them against traditional change metrics. This comparison is done using logistic regression models and random forest models.

Specifically, we address the following three research questions:

RQ1::

What is the impact of the different variants of context metrics on defect prediction?

RQ2::

Do context lines improve the performance of defect prediction?

RQ3::

What is the impact of combination metrics of context metrics on defect prediction?

The main findings of our paper are as follows:

  • The best performing context metrics are the ones that measure the context of added-lines only.

  • The prediction power of context metrics varies when different sizes of the context (number of lines around the change) are used. The optimal size of the context for the metric that uses number of words is smaller than the optimal size for the metric that uses keywords.

  • The number of “goto” statements in context lines and changed lines is a good indicator of defective commits.

  • Our proposed combination metrics of extended context metrics significantly outperform all the metrics that are used in this paper, and achieve the best-performing metrics in all of the studied projects in terms of 2 of the 3 evaluation measures used (area under the receiver operation characteristic curve, and Matthews correlation coefficient).

This paper is organized as follows: Section 2 shows motivation example. Section 3 introduces related work. Section 4 explains our proposed context metrics. Section 5 presents our case study design. Section 6 describes research questions and methodology. Section 7 presents the results of our case study. Section 8 discusses the results. Section 9 describes the threats to the validity of our findings. Section 10 presents the conclusion.

2 Motivating Example

Let us start from a simple example to illustrate the use of context lines to measure the complexity of changes. Figure 1 shows an example of two changed functions. The context lines are lines that precede or follow the changed lines. In this example, the underlined text represents the context lines and the bold lines are the changed lines. The function shown in Fig. 1a has simple context lines: there is one assignment before the changed line and one empty line after the changed line. The changed in Fig. 1b has more complex context lines: the “if” and “else” statements. If we use only the changed lines as an input to compute the complexity of the changes these two changes have the same complexity. In contrast, if we use the context lines as a measure of complexity, these two functions have a different complexity.

Fig. 1
figure 1

An example of two changed functions each of which has one changed line (in this case, an added line, in bold). We call the lines that precede or follow the changed lines context lines (in italic with an underline). Other lines except the context lines are same in both functions

To the best of our knowledge, there exists no research work that studies the context lines in defect prediction. In this paper, we introduce two types of new metrics that use the context lines: context metrics and extended context metrics, and evaluate their performance in defect prediction.

There are complexity metrics, such as Halstead’s complexity metrics (Halstead 1977) and McCabe’s Cyclomatic complexity metrics (McCabe 1976), that can capture the complexity of the function being changed and take into consideration the context; however, (1) to compute these metrics we need all the lines of the functions, (2) these metrics are limited because they require a parser, and (3) complexity metrics are not optimized for code churn. In contrast, context metrics provide several advantages; first, they are easy to compute (they only require the “diff” and—in the case of number of keywords—a list of keywords of the programming language as input) and they measure only the complexity that surrounds the change instead of the entire function.

3 Related Work

3.1 Source Code Churn

Many researchers have studied source code churn for software defect, reliability and quality (Nagappan and Ball 2005; Munson and Elbaum 1998; Khoshgoftaar et al. 1996; Ohlsson et al. 1999; Graves et al. 2000; Karunanithi 1993; Khoshgoftaar and Szabo 1994; Ostrand et al. 2004; Kamei et al. 2013; Kim et al. 2008; Mockus and Votta 2000). Source code churn measures changes and extensions of source code in a period of time (Oram and Wilson 2010). Munson and Elbaum (1998) reported that, as a system is developed (evolved), complexity of the system is also changed.

They proposed a methodology to produce an indicator of defects based on this tendency. Nagappan and Ball (2005) predicted defect density between different releases of Windows Server 2003. Comparing traditional code churn metrics with relative code churn metrics, which relate proportion of code churn such as size of its component, they found the relative code churn metrics are strong metrics for the defect density.

Prior studies proposed more complex code churn metrics (Hassan 2009; Hindle et al. 2008). Hassan (2009) proposed code churn metrics based on the code change process. He applied Shannon entropy (from information theory) to the code change process in order to formulate his metrics.

Hindle et al. (2008) proposed indentation metrics that measure the indentations of added-lines and fixed-lines of changes. They studied the correlations between the indentation metrics and traditional complexity metrics (McCabe’s Cyclomatic complexity (McCabe 1976) and Halstead’s complexity (Halstead 1977)). They showed that the indentation metrics are mildly or strongly correlated with the traditional complexity metrics and the indentation is potentially its own complexity metric (Hindle et al. 2008). Because indentation metrics use the information in changed or added lines, we refer to indentation metrics as a type of code churn metric. This paper is the first study to investigate the effectiveness of indentation metrics for defect prediction.

In this paper, we compare the prediction power of 6 types of metrics in defect prediction. These metrics are: 1) context metrics, 2) traditional code churn metrics (Kamei et al. 2013; Kim et al. 2008; Mockus and Votta 2000), 3) each of traditional code churn metrics, 4) code churn metrics based on indentation metrics (Hindle et al. 2008), 5) extended context metrics (which are combinations between context metrics and a traditional code churn metric) and 6) combination metrics of extended context metrics (which are two extended context metrics that are (1) number of words and (2) number of a certain keyword at a prediction model).

3.2 Text-Based/Just-In-Time Defect Prediction

Many researchers have tackled the problem of defect prediction (Mizuno and Kikuno 2007; Kim et al. 2008, 2011; Kamei et al. 2013; Aversano et al. 2007; Jiang et al. 2013; Yang et al. 2015; Wang et al. 2016; Zimmermann et al. 2007; Li et al.2017; Bettenburg et al. 2012; Śliwerski et al. 2005). In addition, several researchers have proposed metrics to predict defective components (Basili et al. 1996; Kim et al. 2007; Moser et al. 2008; Hassan 2009; D’Ambros et al. 2010). Mizuno and Kikuno (2007) applied spam filter to defect prediction problem. Śliwerski et al. (2005) proposed a method that automatically identifies changes that lead to defects in the future.

Textual information has also being used for defect prediction (Mizuno and Kikuno 2007; Kim et al. 2008; Aversano et al. 2007; Wang et al. 2016; Li et al. 2017). Kim et al. (2008) used not only metadata and complexity metrics but also text information to build a prediction model and predicted defects. They used change-log messages, source code and file names as input to their predictors.

Wang et al. (2016) used the programs’ Abstract Syntax Trees (ASTs) as a representation of source code. They applied a deep learning technique to ASTs in order to learn semantic features from token vectors.

Several researchers have worked on just-in-time defect prediction (Kamei et al. 2013; Kim et al. 2007, 2008, 2011; Fukushima et al. 2014; Mockus and Votta 2000; Aversano et al. 2007; Jiang et al. 2013; Yang et al. 2015; Hassan 2009). Just-in-time defect prediction aims at identifying defective code changes, such as commits, instead of identifying defective files or packages as in traditional file/package-level defect prediction. For example, Kamei et al. (2013) focused on predicting the risk of commits. They used change metrics to predict defective commits at the time of committing commits. Yang et al. (2015) applied a deep learning technique as a prediction model to change metrics and conducted just-in-time defect prediction. Just-in-time defect prediction has the following three benefits that address the challenges on file/package-level defect prediction (Kamei et al. 2013): (1) prediction targets are fine-grained, (2) relevant-developers can be identified, and (3) the prediction-period is faster. In this paper, we use context metrics for just-in-time defect prediction.

There are several widely known pitfalls that should be avoided in defect prediction (Tan et al. 2015; Tantithamthavorn and Hassan 2018). For example, Tan et al. (2015) reported that cross validation technique is frequently used to evaluate prediction models (Kim et al. 2008, 2011; Bettenburg et al. 2012; Jiang et al. 2013; Kamei et al. 2013). However, this technique risks to mix past and future commits; an unrealistic scenario that artificially improves results. In our study, we take into consideration their recommendations to avoid these potential pitfalls. This technique called online change classification is a validation technique without the risks. We describe the details in Section 5.4.

4 Context Metrics

In this section, we describe the implementation of the proposed context metrics. As described in the previous sections, context information might be useful for defect prediction since it provides a new perspective of changes. In addition, it is easy to obtain context information (e.g., using the diff command in the version control system). For example, for the changed function in Fig. 1b, we consider only the lines in italic with an underline for context information.

Any modifications to a file can be described in terms of a unified diff. A unified diff is a sequence of hunks; each hunk is composed of one or more sequences of contiguously changed lines. Each of these sequences is composed of ‘+’ lines (lines added to the file) or ‘-’ lines (lines removed from the file). For the sake of simplicity, we refer to these sequences of changed lines as chunks. We consider two types of chunks: ‘+’ chunks (which contain at least one ‘+’ line), ‘-’ chunks (which contain at least one ‘-’ line). Finally, we will refer to any chunks (including both ‘+’ and ‘-’ chunks) as ‘all’ chunks. Figure 2 shows an example of two unified diffs (a part of output by git show).

Fig. 2
figure 2

An example of unified diffs of a commit with context size equal to three produced by git show (< 1 >) in Bitcoin project; due to the space limitation, we remove the metadata of this commit (the commit comment and the author information). This commit consists of two source code file diffs. The above diff has two hunks (divided by the lines prefixed with @@, < 2 >). Each of both hunks consists of only one chunk (sequence of changed lines). The first chunk is of type ‘+’ and ‘all’. The second one is of type ‘-’, and ‘all’. The below diff has a hunk. This hunk consists of two chunks. Each of both chunks is of type ‘+’ and ‘all’. The context lines of each chunk are the above and below the corresponding chunk (above and below of < 3 > and < 4 >). The filename is prefixed with ‘+++ b/’

The unified diff shown in Fig. 2 is a sequence of two hunks that are divided by the lines prefixed with @@, < 2 >. Each hunk has a chunk < 3 > and < 4 >, respectively. The above chunk, < 3 >, is of type ‘+’ and ‘all’. The below chunk, < 4 >, is of type ‘-’, and ‘all’. The below unified diff has a hunk. This hunk includes two chunks that are type ‘+’ and ‘all’.Footnote 1

Each chunk is surrounded by its context lines (the lines above and below the chunk that indicate where the chunk is to be applied—prefixed with ‘ ’ in the hunk). We refer to these context lines as the context of the chunk. We also consider as a part of the context the full filename of the file being changed. This is because we consider that the directories where the file is located can contribute to the complexity of the context; i.e., more directories in the filename indicate a more complex context than no-directories. We evaluated the use or the filename/directories in the context metrics for their prediction power and found that when used, the performance of the context metrics improved.

For explaining context metrics, we define the following terminology:

  • c: a commit.

  • n: a context size that is the maximum number of lines that can precede or follow a chunk we consider. (This is also a parameter of the diff command in the version control system.)

  • d(f,n): a unified diff of a changed file f with context sizen.

  • D(c,n): a set of d(f,n) for all the changed files in commit c.

For a given unified diff d(f,n), we define the three types of contexts, based on the three chunk types, with the following notation (refer to Table 1):

  • context(d(f,n),t): the concatenation of the full filename of f and the context of all chunks of chunk type t in diff d(f,n).

For a unified diff d(f,n), we define the following two notations:

  1. 1.

    ncw(d(f,n),t): the number of words in context(d(f,n),t).

  2. 2.

    nckw(d(f,n),t): the number of programming language keywords (Table 2 shows all studied keywords)Footnote 2 in context(d(f,n),t).

Given a commit c, a context size (the number of context lines) n, and the chunk type t, we define the following two kinds of context metrics:

$$ \begin{array}{@{}rcl@{}} NCW\left( c, n, t\right) &=& \sum\limits_{d(f,n) \in D(c,n)} ncw(d(f,n), t),\\ NCKW\left( c, n, t\right) &=& \sum\limits_{d(f,n) \in D(c,n)} nckw(d(f,n), t). \end{array} $$
Table 1 Types of contexts. The context of chunk type t of a unified diff d(f,n) is the concatenation of the full filename of f and the contexts of the chunk type t in the diff d(f,n)
Table 2 Studied programming language keywords

The defined context metrics are described in Table 3. To compute the context metrics of a commit m(c,n,t) —where m is either NCW or NCKW, c is a commit id, n is the number of context lines, and t is the chunk type—we use the following algorithm:

  1. 1.

    Compute the diffs D(c,n) of the source code filesFootnote 3 of commit c with the given number of lines of context, n, using the following command: git show --unified=n c

  2. 2.

    For each diff d(f,n) of a source code file, compute ncw(d(f,n),t) or nckw(d(f,n),t):

    1. (a)

      Remove all chunks that are not of chunk type t, including their contexts.

    2. (b)

      Remove comments.

    3. (c)

      Create a string st with the concatenation of

      • the full filename of the diff d(f,n), and

      • the contexts around the identified chunks.

    4. (d)

      Use lscpFootnote 4 (Thomas WS 2015) to convert st into a sequence of words. For ncw, count the number of words in this sequence; for nckw, count the number of programming language keywords in st.

  3. 3.

    Finally, the context metric NCW /NCKW of the commit is calculated as the sum of values of ncw/nckw for all diffs of the source code files in the commit.

Table 3 Different context metrics

Figure 3 depicts an example showing how the context metrics are computed from a unified diff. The left square corresponds to the first step in our algorithm. (1) and (2) are corresponding to the second step; we have removed unrelated code in (1), and convert the string into a sequence of words by lscp in (2). (3) is corresponding to the step three; we compute the context metrics.

The Intuition Behind Counting Words or Keywords:

Our definition of context metrics involves counting words or keywords in the context of a change. We consider that a context with more words is likely to be more complex than a context that has less words. Hence, we consider that counting the number of words in the context of a change is a proxy of the complexity of such change.

The main intuition behind using the number of keywords is that the number of keywords in the context might indicate how deeply nested change is. Therefore, a change with a larger number of keywords is likely to more complex that a change that has fewer (or no) keywords around it.

Finally, counting number of words/keywords is easy to compute in practice.

Fig. 3
figure 3

Example showing how NCW and NCKW are computed from a unified diff. The unified diff corresponds to the change from Fig. 2; due to the space limitation, we remove several hunks, the commit comment, the author information, and the commit hash from the unified diff, and use “–unified= 1” option. The number of context lines n is 1. The chunk type t is ‘+’. The commit hash c is ‘commit_hash.’ The changed file f is ‘src/qt/rpcconsole.cpp.’ The left square corresponds to the first step in our algorithm. (1) and (2) are corresponding to the second step; we remove unrelated code in (1), and convert the string into a sequence of words by lscp in (2). (3) is corresponding to the step three; we compute the context metrics

5 Case Study Design

In this section, we discuss our selection criteria for the studied indentation metrics, data, validation technique, preprocessing, projects, resampling approach, evaluation measures, and prediction models.

5.1 Indentation Metrics

We compare context metrics with indentation metrics. We study two indentations metrics: Added Spaces (AS), defined by Hindle et al. (2008); AS is the sum of the number of white spaces on all the ‘+’ lines in a commit.

We additionally define a new indentation metric Added Braces (AB). We consider the number of braces as a logical indentation because the number of braces in C++ and Java expresses how embedded one block of code is inside others.

We first count the number of left-braces Bleft and right-braces Bright from the head of a function to each ‘+’ line, respectively. Second, we compute the difference Bdiff between Bleft and Bright on each ‘+’ line. Finally, we sum Bdiff for all ‘+’ lines in a commit.

The Intuition of Using the Indentation Metrics as Way to Predict Defects:

The indentation metrics have been used as a proxy to measure complexity of source code (Hindle et al. 2008).

However, they have not been used in defect prediction. The rationale behind their use in defect prediction is that modifications in more indented code are likely to be more complex that modifications that happen in less indented code because the person doing the changes not only has to be concerned with what the code does, but also with the code that surrounds it. The code with the larger indentation is likely to be inside more control blocks–e.g., while, for, and if statements–than the code with the less indentation; we hypothesize that more control blocks might create more brittle code. Hence, all things equal, we expect that changes to code that has more indentation might result in more defects that changes to code that has less indentation.

5.2 Preparing Data using Commit Guru

The availability and openness of experimental data is a real challenge to evaluate defect prediction approaches. Therefore, we use data provided by Commit Guru, which Rosen et al. (2015) provide publicly. Commit Guru is a web application, which identifies and predicts defective commits for Git repositories and calculates the change metrics (Table 4) that are often used for just-in-time defect prediction (Kamei et al. 2013).

Table 4 Change metrics

In this paper, we use Commit Guru to calculate the change metrics (Kamei et al. 2013). We use the change metrics in RQ2 to compare with the context metrics in order to study what is the impact of the context metrics on defect prediction. Then, we use the change metrics, and their subsets (each of the change metrics) as studied metrics.

We refer to each metric in the change metrics as a subset of the change metrics. When using a subset of the change metrics, we pick up a metric from the change metrics, and use that metric for defect prediction. This is because each of the change metrics is also a churn metric. However, several metrics do not strongly relate to code churn. For example, Purpose metric (i.e., FIX, described in Table 4) is not affected by code churn. Hence, we remove three types of metrics from all the change metrics when considering their subsets that are Purpose metric (i.e., FIX), History metrics (i.e., NDEV, AGE, and NUC), and Experience metrics (i.e., EXP, REXP and SEXP). Hence we use each of NS, ND, NF, Entropy, LA, LD, and LT as a subset of the change metrics. We apply z-score to each of the subsets to normalized to a mean of 0 and a variance of 1.

When using the change metrics, to avoid using several strongly correlated metrics in the prediction, we apply the following preprocessing proposed and described in Kamei et al. (2013):

  • Exclude ND and REXP since they are strongly correlated with NF and EXP.

  • LA and LD are divided by LT to normalize LA and LD.

  • LT and NUC are divided by NF to normalize LT and NUC.

Finally, we apply z-score (Zhang et al. 2016) to the changed metrics to normalized to a mean of 0 and a variance of 1.

5.3 Time Sensitive Change Classification

Because we could use future commits to predict past commits, using 10-fold cross validation has a risk to make the artificially good results such as high precision and recall while studying just-in-time defect prediction (Tan et al. 2015). In addition, while using 10-fold cross validation, we label the commits in training data as defective or not using all the commits information. However, this procedure also risks to use future information for prediction. To address these two issues and validate our experiments, we use time sensitive change classification (Tan et al. 2015).

Time sensitive change classification uses only past commits to label past commits and build prediction models for future commits. Figure 4 shows an example of the time sensitive change classification that uses the training interval between tTr and t as training data and the test interval between t and t + Te as test data. In this example, we use the commits in the training data to label its commits and build prediction models for predicting commits in the test data.

Fig. 4
figure 4

An example of the time sensitive change classification. The cross in gray indicates the information of fixing a commit is not used in the training interval

However, Tan et al. (2015) reported three challenges. First, because defective commits are typically detected and fixed in 100–300 days (Kim and Whitehead 2006), many undetected defective commits in the training interval would be labeled clean. Second, this validation is sensitive to the interval. For example, if the training interval is before the release day, features in the test interval would be different with the training interval. Third, if we take a long time gap between the training interval and the test interval, features such as developers and programming styles might have changed between the training interval and the test interval. To address these three challenges, Tan et al. (2015) recommended to use online change classification.

5.4 Online Change Classification

Online change classification is a validation technique. We describe the online change classification, and how this validation technique addresses these three challenges. To address the first challenge, a gap is used between the training interval and the test interval (Fig. 5). The gap is used only during the labeling of the commits in the training interval. This additional interval allows more time to detect defective commits in the training interval and make labeling result more precise. Typically, the gap is the average or medium time between a defect inducing commit and a defect fixing commit; in our experiments, we use median time for each project from our pre-experiment (Table 5).

Fig. 5
figure 5

An overview of the online change classification. We show two iterations as an example. The part of the rectangle in black is the training data (training interval) labeled using the commits in the training interval and gap (in dark gray). The part of the rectangle in light gray is the test data (test interval) labeled using all of the commits in the project history including the end gap. Details of the terms in this figure are described in Section 5.4

Table 5 Parameter values of the online change classification for each project (days)

To address the second and third challenges, the time sensitive change classification is executed multiple times while updating the training interval, test interval and gap. The multiple execution minimizes the bias from a certain test interval. The training interval, test interval and gap slide into the future by a certain interval (Fig. 5). This certain interval is called unit. A unit is 30 days (one month) in our experiments. The test interval is 30 days as well. Note that the unit and the test interval are parameters, hence; different parameter values might have the impact to the result of our experiments. We studied this point in Section 9. The result shows that these parameters have little impact for the results of our experiments.

We also use start gap and end gap (Tan et al. 2015) that are intervals that we do not use as training interval and test interval. The beginning of a software project history may be inconsistent and unstable. The end of a software project history would be labeled clean because defective commits would not be detected. Hence, the start gap and end gap would support building better prediction models and improving the quality of the analysis.

Table 5 shows the actual parameters for each project. We manually look at the number of commits and decide on the start date at a point after the number of committed commits increases and decreases moderately (reach a peak). The start gap is the interval between the first commit date and the start date. The reason why we use this process is that after the number of committed commits increases and decreases moderately, the project would have been released and would be in a stable state.

To decide the end gap, we need to compute the analysisperiod, iterationstepsize and traininginterval. In the following, analysisperiod is the maximum studied days. We define the analysisperiod, iterationstepsize and the traininginterval as follows:

$$ \begin{array}{@{}rcl@{}} {analysis\ period} &=& ({CommDate_{latest}} - {{start\ date}}) - {margin},\\ {iteration\ step\ size} &=& ({analysis\ period}/2 - {gap}) / {unit},\\ {Tr} &=& {iteration\ step\ size} \cdot {unit}, \end{array} $$

where (and hereafter)

  • CommDatelatest is the latest commit date,

  • margin is a margin to remove defective commits that may not be detected yet, and

  • Tr is the training interval.

We first compute the interval between the startdate and the date that is margin days before the latest commit date. This process removes the defective commits that are not detected. We use 365 as the margin to compute the end gap. Hence, the end gap is always 365 and over. Because we use unit as a test interval as well, iterationstepsize shows that the rest of iterations that we can slide the training interval, test interval and gap into the future as avoiding to use the commits that are committed in the latest margin days. In addition, we use gap to compute iterationstepsize. This additional gap avoids the commits that are in the latest margin days plus the gap days and ensure that we consider enough commits to label the commits in the test interval. The training interval is decided by iterationstepsize and unit. Finally, we define the end date and the end gap as follows:

$$ \begin{array}{@{}rcl@{}} \mathit{end\ date} &=& \mathit{start\ date} + (\mathit{Tr} + \mathit{gap} + (\mathit{iteration\ step\ size} \cdot \mathit{unit})),\\ \mathit{end\ gap} &=& \mathit{CommDate_{latest}} - \mathit{end\ date}. \end{array} $$

For labeling commits either defective or clean, we follow the labeling process used by Commit Guru:

  1. 1.

    Collect commits cfix whose messages contain specific keywords (as described by Rosen et al. (2015)), such as “bug” or “fix”. Identify the modified lines l in the commits cfix.

  2. 2.

    Find out previous commits cbad on which the lines l were added or modified previously to the corresponding change in cfix. Label each commit cbad as defective.

We conduct this procedure using the training interval and the gap for labeling training data, and using all of the commits for labeling test data.

5.5 Preprocessing by z-score

z-score is a popular normalization approach in defect prediction (Zhang et al. 2016). z-score normalizes the input data to mean 0 and variance 1.

The equation of z-score is:

$$ \boldsymbol{X}_{z-score} = \frac{\boldsymbol{X}_{org}-\mu}{\sigma} $$
(1)

where μ is the mean of the values of a feature for commits. σ is the variance of the values of a feature for commits. Xorg is a vector of all values (all commits) of a feature. Xzscore is a vector of all values (all commits) of a normalized feature.

5.6 Studied Projects

For our experiments, we use six open source projects: Hadoop, Camel, Gerrit, Osmand, Bitcoin and Gimp. Table 6 shows details of the projects. The studied projects include software for various fields, such as a server or an application, and are written in two popular programming languages (C++ and Java). We calculate the context metrics and the indentation metrics for each commit of these projects. For more precise analysis, we study all the commits that have changed at least one line in the source code.

Table 6 Details of the studied projects. Defective rate refer to the commits labeled using all commits

5.7 Resampling Approach

While learning the defect prediction model, the learning performance is affected by imbalanced data (Tan et al. 2015). In our case, Table 6 shows that “clean” commits outnumber “defective” commits. Hence, if we use this data directly as training data, the learning performance could decrease. General resampling approaches remedy this problem, as shown by prior studies (Kamei et al. 2013; Yang et al. 2015; Tan et al. 2015).

For our experiment, we use random under-sampling. Random under-sampling reduces the majority class at random to make the size of the majority class equal to the size of the minority class. Because we must evaluate our approach on real data, we apply resampling only to training data, not to test data.

5.8 Evaluation Measures

To measure the impact of the context metrics for defect prediction, we use three evaluation measures: the area under the receiver operation characteristic curve (AUC), the Matthews correlation coefficient (MCC), and Brier score (Brier).Footnote 5 Precision and Recall are frequently used in defect prediction as evaluation measures. However, several researchers warned that these measures show biased results (Bowes et al. 2012; Tantithamthavorn and Hassan 2018; Chicco 2017).

AUC and Brier score are threshold-independent measures. Tantithamthavorn and Hassan (2018) suggested to use threshold-independent measures to address pitfalls in defect prediction research. Although MCC is a threshold-dependent measure, MCC is not affected by the skewness of defect data (Zhang et al. 2016; Boughorbel et al. 2017) and we want to better understand the predicting power of the metrics (Kamei et al. 2016). Therefore, we also use MCC in this paper. The threshold of MCC is 0.5.

We use the Scott-Knott ESD test (Tantithamthavorn et al. 2017) (using 95% significance level) to compare the context metrics and the traditional code churn metrics. The Scott-Knott test is a hierarchical clustering algorithm that ranks the distributions of values. In particular, metrics with distributions that are not statistically significantly different are placed in the same rank. The Scott-Knott ESD test is an extension of the Scott-Knott test, which not only ranks based on significance, but also on Cohen’s d effect size (Cohen 1988). The Scott-Knot ESD test places distributions which are not significantly different, or have a negligible effect size, in the same rank. We use the ScottKnottESDR packageFootnote 6 that was provided by Tantithamthavorn et al. (2016). We also apply the Scott-Knott ESD test to the ranks that are computed by the Scott-Knott ESD test.

The reason why we apply the Scott-Knott ESD test twice is to avoid the variances of the values of the evaluation measures across the studied projects. If there exist the variances across the studied projects, it would be difficult to compare the studied metrics over all the studied projects instead of each studied project. This idea was proposed by Ghotra et al. (2015). They applied Scott-Knott test twice to ensure that they recognized techniques that perform well across the studied projects. They showed the following example: if a prediction model has an AUC of 0.9 on project A, and 0.5 on project B, we would get worse result while using Scott-Knott test once for all projects. However, if an AUC of 0.5 is the best AUC value in the project B, and 0.9 is also the best value in the project A, then this classification technique should be the best-performing technique. The first Scott-Knott test computes the rank within a project. And the second Scott-Knott test computes the rank across the projects without the variance of the values of the evaluation measures due to using the rank. We use the Scott-Knott ESD test instead of the Scott-Knott test in order to consider the effect size. We call this procedure as double Scott-Knott ESD test.

The results of the Scott-Knott ESD test and the double Scott-Knott ESD test are a rank (number) for each metric. The smallest rank, 1, indicates the best rank. The largest rank indicates the worst rank. A rank can contains multiple metrics at once. We interpret metrics which have many smallest/smaller ranks as the best metrics since it indicates that the metrics significantly outperform many others. Hence, for the Scott-Knott ESD test, we used the top-3 ranks to evaluate the metrics across the studied projects. We report metrics which have the most top-3 ranks across the studied projects as the best metrics in the Scott-Knott ESD test.

For the double Scott-Knott ESD test, we used boxplots to show the ranks of the studied metrics for each evaluation measure. Each boxplot contains six ranks by the Scott-Knott ESD test for all the studied projects. The double Scott-Knott ESD test classifies these boxplots by the Scott-Knott ESD test. This analysis avoids the variances of the actual performance differences across the studied projects due to using the rank. Our interpretation is that metrics which have the smallest rank as the best metrics since it indicates that the metrics significantly outperform many others.

5.9 Prediction Models

We use two defect prediction models, logistic regression model (LR) (McDonald 2014) and random forest model (RF) (Ho 1995). We give a brief overview of the idea behind the prediction models:

  • Logistic Regression (LR) (McDonald 2014): LR is a frequently used defect prediction model. They build a linear model which has all metrics as explanation variables, these coefficients, and a bias. LR feeds the output of this linear model to a sigmoid function (Han and Moraga 1995). The output of the sigmoid function corresponds to the probability.

  • Random Forest (RF) (Ho 1995): RF is an ensemble learning model. RF builds various decision trees (Quinlan 1993) based on subsets of metrics. Finally, RF merges all the results of the decision trees, and provides the probability of defect.

Prior work (Tantithamthavorn et al. 2016; Hall et al. 2012) showed that the parameter optimization of the prediction models crucially affects the prediction performance. For example, Tantithamthavorn et al. (2016) showed that a simple automated parameter optimization can dramatically improve the AUC performance of defect prediction models (the best case is about 40 percentage points of AUC). Hence, considering the parameter optimization is also an important aspect in our experiment.

For LR, we consider a parameter: C.

  • C: C is a parameter which indicates the regularization strength. For example, if we have many metrics but not much data, LR would optimize its parameter for the training data excessively. Hence, LR provides worse performance for the test data. To address this challenge, the regularization strength C is used when optimizing the parameter. We study the C of 0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, and 100 when using the change metrics and COMB. For the other metrics, we do not use the C since the number of metrics is 1 at a prediction model.

In addition, we need to consider the correlation between the studied metrics. If the studied metrics are correlated, LR would get multicollinearity problem (Farrar and Glauber 1967). When using the change metrics, we need to consider the correlation. To avoid the correlated metrics, prior work (Kamei et al. 2013) proposed a preprocessing. We follow the same preprocessing of prior work (Kamei et al. 2013) that was described in Section 5.2. COMB has two metrics. However, they are not correlated (see Table 21). Hence, we do not need to deal with the correlation in COMB.

For RF, we keep using the normalized change metrics for LR. In addition, we consider two parameters: mtry and number of trees that are specific parameters in RF.

  • mtry: mtry is a parameter which indicates the number of metrics randomly selected for each node in a tree. For example, if we set mtry= 2, RF selects 2 metrics from the studied metrics to generate a node in a tree for splitting the studied commits. We study the mtry of 1, 2, 5, 10, and 12 when using the normalized change metrics, 1 and 2 when using COMB, and 1 when using other metrics since the number of normalized change metrics is 12, the number of metrics in COMB is 2, and the number of other metrics is 1 at a prediction model.

  • number of trees: Number of trees is a parameter which indicates the number of trees which RF generates. RF merges all the outputs of the trees for computing the final result. We study the number of trees of 2, 5, 10, 50, 100, 500, 1,000.

We optimize these parameters for each iteration. We split the training data to 80% of the training data and 20% of validation data. We use the training data to train the model based on a parameter setting, and evaluate that parameter setting on the validation data. We use the best parameter setting on the test data.

6 Research Questions and Methodology

6.1 Research Questions

Our proposed context metrics have three parameters: commit c, context size n and chunk type t. Hence, we first study which configurations of these parameters are the best for predicting defective commits. Because c is a parameter that cannot be optimized, we study n and t to design the best context metrics. To do this, we formulate the following research question: (RQ1) What is the impact of the different variants of context metrics on defect prediction?

RQ1 does not confirm what is the impact of the context metrics for defect prediction compared to the traditional code churn metrics. Hence, we also study the prediction performance of the context metrics compared to the traditional code churn metrics that are the change metrics, their subsets and the indentation metrics in order to confirm whether the context metrics are effective or not. We additionally study the performance of extended context metrics, which are combinations between the context metrics and the traditional code churn metrics for defect prediction in order to improve the predicting power of the context metrics. The extended context metrics count (1) number of words and (2) number of keywords in the context lines and the changed lines. To do this, we formulate the following research question: (RQ2) Do context lines improve the performance of defect prediction?

RQ2 compares the prediction performance across the context metrics, the extended context metrics, and the traditional code churn metrics. However, we do not study combination metrics between the context metrics; we use a context metric alone on a prediction model in RQ1 and RQ2. Hence, in this RQ, we study the impact of combination metrics that use two extended context metrics that count (1) number of words and (2) number of a certain keyword (e.g., “goto”) at a prediction model. To do this, we formulate the following research question: (RQ3) What is the impact of combination metrics of context metrics on defect prediction?

6.2 Methodology

We explain our experimental methodology.

6.2.1 RQ1. What is the Impact of the Different Variants of Context Metrics on Defect Prediction?

We conduct two experiments in order to study the impact of chunk types and context sizes for just-in-time defect prediction. We first study the impact of chunk types. Second, we study the impact of context size based on a fixed chunk type. In each experiment, we build the studied defect prediction models and predict defective commits in the studied project histories.

We consider two supervised learning models as defect prediction models that are LR and RF. Prior research showed inconsistent results that prediction models provide significant difference (Ghotra et al. 2015) and no significant difference (Lessmann et al. 2008; Shepperd et al. 2014; Menzies et al. 2010). The main point in this paper is to evaluate the impact of the context metrics for defect prediction, not the impact of the prediction models. Hence, we use only two models and do not consider the difference between the prediction models.

We split the set of commits into training data and test data using the online change classification (Tan et al. 2015). 10-fold cross validation is a frequently used validation technique in defect prediction, however; cross validation has risks such as making artificially good results due to mixing past and future commits. The online change classification addresses the challenges of the cross validation and improves the quality of the analysis in just-in-time defect prediction (Tan et al. 2015). We described details in Section 5.4.

We compute the context metrics for each chunk type for each commit. We apply preprocessing to the context metrics in the training and test data. We use z-score; the mean and the variance of z-score are decided from the training data. We use the context metric as an input of the studied models. The models are trained using training data, and compute prediction results using test data. When training the model, we optimize the parameters of the prediction models. We described details in Section 5.9.

Finally, we evaluate the results using three evaluation measures: AUC, MCC, and Brier score. Each measure has multiple values that come from the number of the iteration step sizes of the online change classification. We show the number of iteration step sizes in Table 5. For example, it is 17 that is the number of iteration step size of the Hadoop project. Hence, we get 17 values for each of three evaluation measures. For each measure, we summarize the multiple values with its median value. We conduct the above procedure for each studied project. Therefore, each context metric has 12 median values in the online change classification (for six projects times two prediction models).

We conduct this procedure for each chunk type. Then, we compare the context metrics of the different chunk types w. r. t. the three evaluation measures. We apply the Scott-Knott ESD test (Tantithamthavorn et al. 2017) to the context metrics for each evaluation measure for each project. Each context metric has two values (results by LR and RF models) for each project. Then, we evaluate statistically significant differences and effect sizes between the context metrics for each evaluation measure for each project. The result is shown as a rank. For example, if a certain context metric A has the best value on a certain evaluation measure, this context metric A achieves the rank 1. If another context metric B has no significant difference to the context metric A that achieves the rank 1, this context metric B also achieves the rank 1. If another context metric C has significant difference to the context metric A and B, this context metric C achieves rank 2.

Although we would get the rank from the first Scott-Knott ESD execution, the rank is computed for each project. Hence, we would get different ranks for each project on a context metric. To avoid the variances of the ranks across the studied projects, we additionally apply the Scott-Knott ESD test to the ranks instead of the actual values of the evaluation measures, the double Scott-Knott ESD test. Each context metric has six ranks (results by all the studied projects) for each evaluation measure. The additional Scott-Knott ESD test compares the studied context metrics in terms of the rank. Then, we evaluate statistically significant differences and effect sizes between the context metrics for each evaluation measure.

We conduct the same procedures on different context sizes instead of different chunk types before we apply the Scott-Knott ESD test. In this comparison, we then compare the values of evaluation measures for each iteration step between different context sizes. We count the iteration steps for each context size that provide the best prediction performance value. We make histograms of the number of iteration steps that provide the best prediction performance for each context size for each evaluation measure and context metric. From these histograms, we conclude the impact of different context sizes for the performance of defect prediction. For example, let us suppose we conducted an experiment with that iteration steps are 100, context sizes are 1, 2, and 3; the context size 1 has 50 iteration steps where the context size 1 has the best performance, the context size 2 has 20 iteration steps where the context size 2 has the best performance, and the context size 3 has 30 iteration steps where the context size 3 has the best performance. In this example, we would get histograms in which the context size 1 has 40, 2 has 20, and 3 has 30; hence, we would conclude that the context size 1 is the best.

From the results, we investigate the impact of the context metrics variants (different chunk types and context sizes). The goal of this RQ1 is to find the best context metrics variant for just-in-time defect prediction. The best context metrics variant is considered as the context metrics in RQ2.

6.2.2 RQ2. Do Context Lines Improve the Performance of Defect Prediction?

To answer this RQ, we compare the best variant of the context metrics NCW and NCKW (as determined in RQ1) with the change metrics and their subsets (both described in Section 5.2), the indentation metrics (described in Section 5.1) and the extended context metrics. We build the defect prediction models to evaluate the metrics. The prediction procedure is similar to the procedure for RQ1; however, the preprocessing has differences (the details are described later in this section).

In order to improve the performance of defect prediction, we define two new metrics based on NCW and NCKW that measure both the context and the changed lines called extended context metrics. These metrics are NCCW (number of words in the context and the changed lines) and NCCKW (number of keywords in the context and the changed lines) in Table 7. NCCW and NCCKW use only added-lines as the changed lines. This is because it is known that a change metric, “added-lines”, is one of the best indicator of change risk (Shihab 2012; Shihab et al. 2012). These metrics will show the results of the combination between the context metrics and the traditional code churn metrics. From the results of RQ1, we choose the appropriate chunk type from ‘+’, ‘-’ and ‘all’, and the context size from one to ten for NCCW and NCCKW.

Table 7 The extended context metrics

We apply the preprocessing to the change metrics and their subsets that was described in Section 5.9. For the context metrics, we apply z-score to normalize to a mean of 0 and a variance of 1 since the subsets of the change metrics are also normalized by z-score.

6.2.3 RQ3. What is the Impact of Combination Metrics of Context Metrics on Defect Prediction?

To answer this RQ, we use our new combination metrics that use both NCCW and NCCKW. This is because, according to the results of RQ2, NCCW and NCCKW have better prediction performance than NCW and NCKW alone. NCCW and NCCKW are strongly correlated with each other (see Section 8.3). Hence, we need to remove the correlation in order to address the multicollinearity problem (Farrar and Glauber 1967) for using them on a prediction model.

We, hence, modify NCCKW into counting only each specific keyword instead of counting all keywords (Table 2, # of keywords: 20). Hence, we get 20 variants of NCCKW. For example, a variant of NCCKW measures the number of “goto” statements (in both the context and the changed lines). We call each of these metrics as a modified NCCKW. There are 20 modified NCCKW. This modification removes the strong correlation between NCCW and NCCKW. NCCW and each modified NCCKW are rarely correlated.

We use NCCW and each of the modified NCCKW on a prediction model as two explanation variables, and study the performance of each of the modified NCCKW. From this result, we conclude the best combination metrics for NCCW and a modified NCCKW.

We call the combination metrics as COMB. We compare COMB with the other metrics following the same procedures of the procedure for RQ2.

7 Case Study Results

7.1 RQ1. What is the Impact of the Different Variants of Context Metrics on Defect Prediction?

For the Context Metrics, the Best Chunk Type is ‘+’

Table 8 shows the ranks of the Scott-Knott ESD test results for each evaluation measure for each context metric variant. Each cell shows the rank of a context metric variant in an evaluation measure and a project. Note that we compared variants with different chunk types with the same context size (n = 3, the default context size of the diff command git show). The rank is computed across context metric variants for each project and evaluation measure. For example, the gray cells in Table 8 are a set where the Scott-Knott ESD test is conducted. We summarize the number of projects that are the top three ranks for each context metric variant (row) in columns of #R1, #R2, and #R3. Hence, the sum of numbers between #R1 to #R3 in a row is 6 or less. The column of Sum is the sum of #R1, #R2, and #R3. Due to space limitation, we shorten the project names in the table: Bitcoin is B., Camel is C., Gerrit is Ge., Gimp is Gi., Hadoop is H., and Osmand is O.

Table 8 The ranks of the Scott-Knott ESD test results for each context metric variant and studied project on three evaluation measures

Regarding AUC, using only the ‘+’ chunk on NCW yields the best results and statistically outperforms the other metrics except the Osmand project, i.e., the rank is one in 5 of 6 projects. Regarding MCC, we find that the rank is one in 3 of 6 projects, and the rank is one, two or three in all projects when using ‘+’ chunk on NCW or NCKW. Regarding Brier score, using the ‘+’ or ‘all’ chunk on NCKW yields the best results and statistically outperforms the other metrics for 3 of 6 studied projects.

Figure 6 shows the results of the double Scott-Knott ESD test on the results for each context metric in all projects; each boxplot contains six ranks of the first Scott-Knott ESD test execution for the studied projects on a chunk type. The x-axis indicates a chunk type; Plus, Minus, and All correspond to ‘+’, ‘-’, and ‘all’; the y-axis indicates the rank for each studied project in the first Scott-Knott ESD test execution. We use two gray colors (dark gray and light gray) and two lines (solid line and dashed line) indicate a rank according to the double Scott-Knott ESD test. The different rank indicates a statistical significant difference with small effect size or over. We observe that ‘+’ achieves the best median rank for all the evaluation measures and the context metrics.

Fig. 6
figure 6

The results of the double Scott-Knott test on the results for each context metric in all projects. Please see text for a full explanation

With one exception, ’+’ consistently performed better than other types of chuck types. This exception is shown in Fig. 6f shows that ‘all’ chunk statistically outperforms ‘+’ chunk on NCKW on Brier score; however, the median, and 25 and 75 percentiles are same. Hence, we choose ‘+’ chunk as the best chunk type for our context metrics.

A Context Size of 1 Provides Better Prediction Performance for NCW, While a Context Size of 10 Provides Better Prediction Performance for NCKW

Figure 7 shows the numbers of iteration steps that provide the best prediction performance on different context sizes. The left column of Fig. 7 (Fig. 7a, c and e) shows the results for NCW with chunk type ‘+’. The right column of Fig. 7 (Fig. 7b, d and f) shows the results for NCKW with chunk type ‘+’.

Fig. 7
figure 7

The numbers of iteration steps that provide the best prediction performance for each context size. We use all iteration steps of all studied projects on two prediction models (LR and RF). The sum of all iteration steps is 188 (17 + 37 + 30 + 14 + 20 + 70 from Table 5). Hence, the sum of all values is 376 (188 iteration steps * 2 models). For example, the sum of the y-axis values in Fig. 7a between 1 to 10 is 376

We can observe opposite results between the NCW and NCKW. On the NCW, the context size of 1 has the highest histogram. This result indicates that the context size of 1 provides the best prediction performance in all the iteration steps comparing to other context sizes. However, on the NCKW, the context size of 10 has the highest histogram on AUC and Brier score. The context size of 1 in MCC is slightly higher than the other context sizes. This result implies that the threshold, 0.5, is not suitable for NCKW. Figure 8 shows the numbers of studied commits with predicted probabilities that were computed by the prediction models in Hadoop project when the context size is 10. The numbers of commits in Fig. 8b are gathered more closely around 0.5 and many defective commits (orange) are lower than 0.5 (by LR), however, the numbers of commits in Fig. 8a are not gathered around 0.5 (by RF). Because the threshold 0.5 provides many defective commits that are identified as clean in Fig. 8b, this distribution affects the results on MCC when using NCKW. Hence, the results are best when the context size is 10 in AUC and Brier score, however; the result is not best when the context size is 10 in MCC. We can observe the same tendency on different studied projects.

Fig. 8
figure 8

The numbers of studied commits in Hadoop project when the context size is 10. The x-axis refers to the predicted probabilities using NCKW that were computed by either RF (left) or LR (right) models

From these results, as the appropriate context size, we use 1 for NCW, and 10 for NCKW. Hereafter, we refer to NCW(c,1,+) and NCKW(c,10,+) as NCW and NCKW, respectively. In addition, we refer to NCCW(c,1,+) and NCCKW(c,10,+) as NCCW and NCCKW, respectively.

7.2 RQ2. Do Context Lines Improve the Performance of Defect Prediction?

The Extended Context Metric NCCW, the Indentation Metrics, and Lines Added (LA) Provide Many Top Three Rank Performance on Just-in-Time Defect Prediction

Table 9 shows the ranks according to the Scott-Knott ESD test results of the three evaluation measures for each studied metric. Each cell includes the rank. The rank is computed across the studied metrics for each project. For example, the gray cells in Table 9 (a) is a set where the Scott-Knott ESD test is computed. The actual values of the three evaluation measures that are used in the Scott-Knott ESD test are shown in Appendix as Table 1819 and 20. We summarize the number of projects that are the top three ranks for each studied metric (row) in columns #R1 to #R3, and the column Sum is the sum of #R1, #R2, and #R3. The maximum value of Sum is six that is the number of the studied projects. Note that “Changes” in the table (also in other tables and figures of this paper) indicates the change metrics.

Table 9 The ranks of the Scott-Knott ESD test results for studied metrics

NCCW (NCCW(c,1,+)) provides the top three rank prediction performance in all projects on AUC and MCC, and 5 of 6 projects on Brier score. NCCW does not provide the top one rank prediction performance on Brier score. However, this is not to be a challenge for just-in-time defect prediction. Brier score is the sum of the mean squared differences between predicted probabilities, i.e., the outputs computed by RF and LR models, and actual binary labels, i.e., clean or defect in the studied commits. From this point, this result implies that the probabilities that were computed by NCCW might be close to 0 or 1 (clean or defect) than other studied metrics. The probabilities that are closer to 0 or 1 indicate that the probabilities clearly indicate either clean or defect even if predicted results are incorrect. However, the results on AUC and MCC are good. Hence, even if incorrect results are far from correct results, NCCW still has strong predicting power because of its MCC results and NCCW might provide better performance at other thresholds on average because of its AUC results. This result indicates that the extended context metric NCCW has strong predicting power for just-in-time defect prediction in the studied churn metrics.

Added spaces (AS), added braces (AB) and lines of code added (LA) also provide many top three rank prediction performance on AUC and MCC. For AS and AB, all projects on AUC and 5 of 6 projects on MCC, for LA, all projects on AUC and MCC. This result also shows that the indentation metrics and a churn metric LA have strong predicting power. All of the metrics do not provide the top one rank prediction performance on Brier score as well. From the same reason of the results of the extended context metrics, we conclude that AS, AB and LA have strong predicting power.

The change metrics that use all of the churn metrics provide that all projects are in the top three ranks on Brier score, while rarely providing the top three rank performance on AUC and MCC. This result implies that the probabilities that were computed by the change metrics might be close to 0.5 or the correct label than probabilities given by the other studied metrics. The probabilities that are close to 0.5 indicate that the probabilities are close to the correct label in incorrect results. Figure 9 shows the number of studied commits with predicted probabilities that were computed by the prediction models in the Camel project using NCCW and the change metrics. We can observe that when using the RF model, the probabilities that were computed by the change metrics are close to 0.5 than the NCCW.

Fig. 9
figure 9

The number of studied commits in Camel project. The x-axis refers to the probabilities using each metric on either RF (left column) or LR (right column) models

When using the LR model, the probabilities that were computed by the NCCW is close to 0.5 than the change metrics. However, the mean squared differences (Brier score) of the results of the change metrics are smaller than NCCW in the half of the projects (Table 20 in Appendix). To show this result in a simpler manner, we define a difference between the probabilities and the actual labels in LR model. In the following, Diff is the difference on a metric in a project, C is a set of all of the studied commits c, abs is a function that computes absolute value, pc is the predicted probabilities of a commit c and labelc is the actual label of a commit c where defective commits are 1 and clean commits are 0. Based on these parameters, we define the Diff as follows:

$$ \mathit{Diff} = \sum\limits_{C}{\text{abs}(p_{c}-\mathit{label}_{c})}. $$

This is a simple variant of the Brier score.

Table 10 shows the values of the difference of the LR model. The gray cells indicate the smallest values of the difference between the metrics in a project. We can observe that the change metrics achieve gray cells in the majority (5 of 6) of projects. This result implies that although probabilities that were computed by the NCCW are close to 0.5 than the change metrics, the difference of the results of the change metrics is smaller than NCCW. Hence, the probabilities are close to the correct label than NCCW. This is the reason why the change metrics provide that all projects are in the top three rank on Brier score.

Table 10 The values of our proposed difference of the LR model. The gray cells refer to the smallest difference values by the metrics within each project

The Indentation Metric, AS, is the Best-Performing Metric on AUC and MCC According to the Double Scott-Knott ESD Test

Figure 10 shows the results of the double Scott-Knott ESD test on the results for each studied metric in all projects; each boxplot contains six ranks of the first Scott-Knott ESD test execution for the studied projects on a studied metric. We use two gray colors (dark and light gray) and two lines (solid and dashed lines) to represent the ranks according to the double Scott-Knott ESD test; the adjacent boxplots with the same gray color and line indicate the same rank. Otherwise, the rank is changed at that point. The different rank indicates a statistical significant difference with small effect size or over according to the double Scott-Knott ESD test. We observe that AS is the best-performing metric on both AUC and MCC. The change metrics are the best-performing metrics on Brier score, and AS is the second best-performing metric. This result provides that AS is a top rank metric across the studied projects on AUC and MCC, and the change metrics are the top rank metrics across the studied projects on Brier score.

The Extended Context Metric, NCCW, and the Churn Metric, LA, are Also Better Metrics According to the Double Scott-Knott ESD Test

LA provides the second-rank performance in AUC and Brier score, and the first rank performance in MCC as well. The extended context metric NCCW provides the third rank performance in AUC, the second rank performance in Brier score, and the first rank performance in MCC as well. This result provides that NCCW and LA are also better metrics across the studied projects on AUC and MCC.

Fig. 10
figure 10

The double Scott-Knott ESD test results for each studied metric in all projects. Please see text for a full explanation

In this RQ, we study the metrics in terms of the prediction performance. However, we ignore other aspects such as detected defective commits. We closely look at the detected defective commits, pair-wise relation across the studied metrics, and the basic predicting power of the studied metrics in Section 8 (discussion).

7.3 RQ3. What is the Impact of Combination Metrics of Context Metrics on Defect Prediction?

“goto” Statement is the Best Keyword for the Modified NCCKW

Figure 11 shows the results of the double Scott-Knott ESD test on the results for each modified NCCKW in all projects. Each boxplot contains six ranks of the first Scott-Knott ESD test execution within a studied project for all projects using a studied keyword as the modified NCCKW. The x-axis indicates a keyword which is used on the modified NCCKW; the y-axis indicates the rank for each studied project in the first Scott-Knott ESD test execution. We use two gray colors (dark and light gray) and two lines (solid and dashed lines) to represent the ranks according to the double Scott-Knott ESD test; the adjacent boxplots with the same gray color and line indicate the same rank. Otherwise, the rank is changed at that point. The different rank indicates a statistical significant difference with small effect size or over. The first Scott-Knott ESD test is applied to the values of the evaluation measures that were computed by the results of the studied prediction models that use NCCW and a certain modified NCCKW which uses a certain keyword (e.g., “goto”) as the explanation variables.

Fig. 11
figure 11

The results of the double Scott-Knott ESD test on the results for each modified NCCKW in all projects. Please see text for a full explanation

We observe that the number of “goto” statement in the context and changed lines achieves the top-1 or 2 rank in AUC and MCC. In addition, the median rank value is the best in AUC and MCC. The number of “goto” statement achieves the worst rank in Brier score. From the same reason of RQ2 results in Brier score, we conclude that the modified NCCKW which counts the number of “goto” statements is the strongest metric on the combination with NCCW. In addition, the modified NCCKW is not strongly correlated with NCCW (see Table 21). Hereafter, we refer to this variant (using the number of “goto” statement) of the modified NCCKW as gotoNCCKW. We use NCCW and gotoNCCKW for a prediction model in order to improve the prediction performance. We refer to the combination metrics as COMB.

COMB Provides the Top-One Rank Prediction Performance for all the Studied Projects in AUC and MCC

Table 11 shows the ranks according to the Scott-Knott ESD test results of the three evaluation measures for each studied metric. We observe that COMB provides the top-one rank prediction performance for all the studied projects in AUC and MCC. In addition, except AS in MCC, there exists no other studied metrics that achieve the top-one rank prediction performance. This result indicates that COMB are the best prediction metrics in all the studied metrics. COMB achieves at least the top-three rank prediction performance for all studied projects in Brier score.

Table 11 The ranks of the Scott-Knott ESD test results for studied metrics

COMB Statistically Outperforms the Other Studied Metrics

Figure 12 shows the results of the double Scott-Knott ESD test on the results for each studied metric in all projects; each boxplot contains six ranks of the first Scott-Knott ESD test execution for the studied projects on a studied metric. The x-axis indicates a metric; the y-axis indicates the rank for each studied project in the first Scott-Knott ESD test execution. We use two gray colors (dark and light gray) and two lines (solid and dashed lines) to represent the ranks according to the double Scott-Knott ESD test; the adjacent boxplots with the same gray color and line indicate the same rank. Otherwise, the rank is changed at that point. The different rank indicates a statistical significant difference with small effect size or over.

Fig. 12
figure 12

The results of the double Scott-Knott ESD test on the results for each studied metric in all projects. Please see text for a full explanation

We observe that COMB are the best-performing metrics on both AUC and MCC. This result provides that COMB are the top rank metrics across the studied projects on AUC and MCC. Even on Brier score, COMB are the second rank metrics. The best-performing metrics on Brier score is still the change metrics.

8 Discussion

8.1 Are the Commits Identified by the Context Metrics Different than the Ones Identified by the Traditional Churn Metrics?

The Proposed Context Metrics COMB Identify Some Defective Commits that Other Churn Metrics Cannot; these Commits Tend to have Large Context Lines

We define unique defective commits as the commits that are only identified by our proposed metrics (and not by other metrics). The existence of these defective commits contributes to defect prediction since they cannot be identified using traditional churn metrics. Hence, we study the commits identified as defective by COMB.

Figure 13 shows the values of the context metric NCW for the commits identified as defective in Hadoop project.

Fig. 13
figure 13

The values of the context metric NCW for the commits identified as defective in Hadoop project. The boxplots show the cases where COMB identified the commits differently with the context metric NCCW, the change metrics, LA and the indentation metrics on RF and LR models. For instance, COMB-AB refers to the cases where commits are identified as defective by COMB but are identified as clean by AB. The x-axis shows the metrics that are compared; the y-axis shows the value of NCW

We can observe that COMB identifies the commits that have higher NCW values as defective compared to the other metrics. For example, the median NCW value of COMB-Changes is higher than the median NCW value of Changes-COMB (Fig. 13a and b). The results for the other projects show the same tendency except NCCW; NCCW has higher NCW values in 4 of 6 projects since NCCW is also a context metric.

Because we use NCW values to show unique defective commits, this result may seem obvious. However, even if we use LA value to show unique defective commits, the median LA value of COMB-LA is higher than the median LA value of LA-COMB in several projects. Figure 14 shows the values of LA for the commits identified as defective by LR model in Bitcoin project and Hadoop project. In Bitcoin project, the median LA value of COMB-LA is higher than the median LA value of LA-COMB, while LA-COMB has higher median LA value in Hadoop project. This result implies that the result in Fig. 13a and b indicates that COMB can uniquely identify some defective commits.

Fig. 14
figure 14

The values of LA for the commits identified as defective in Bitcoin project and Hadoop project. The boxplots show the cases where COMB identified the commits differently with the context metric NCCW, the change metrics, LA and the indentation metrics on LR model. The y-axis shows the value of LA

The Proposed Context Metrics NCW and NCKW, and the Extended Context Metrics NCCW and NCCKW Can Uniquely Identify Defective Commits; and these Commits Tend to Have Larger Context Lines than Other Churn Metrics on the LR Model

We observe the same tendency for the other context metrics on LR model, but not RF model. This result may be from the difference between RF and LR models. To study the difference between the prediction models lies beyond the scope of this paper. In addition, there exist commits that the traditional code churn metrics can identify that the context metrics cannot. Future studies are necessary to investigate these points.

8.2 How Much Do the Indentation Metrics Improve the Defect Prediction Performance?

Indentation Metrics AS and AB Have the Potential to Defect Prediction Performance

Our study is the first applying the indentation metrics to the defect prediction problem. From our results, the indentation metrics are one of the best metrics on defect prediction perfomances, and significantly outperform other studied metrics without COMB. Hence, we observe that the indentation metrics have the potential of predicting power for just-in-time defect prediction.

8.3 How Redundant are the Context Metrics Compared to the Traditional Metrics?

8.3.1 Motivation

To our knowledge, prior work in defect prediction disregards information around the changed lines, context lines. Hence, we propose the context metrics, and study the impact of them in the defect prediction performance. However, we did not study the redundancy of our context metrics compared to the traditional metrics.

We present an in-depth analysis to understand the relation between our context metrics and the traditional metrics. This result produces insights of why our context metrics are not inducing redundancy, and why the context metrics can uniquely identify defective commits compared to the traditional metrics. Finally, we show the basic predicting power using information gain (Romanski and Kotthoff 2018).

8.3.2 Approach

We first study five context metrics (i.e., NCW, NCKW, NCCW, NCCKW, and gotoNCCKWFootnote 7) two indentation metrics and 14 traditional change metrics based on a correlation analysis (Zwillinger and Kokoska 1999) and the principal component analysis (PCA) (D’Ambros et al. 2010) to identify correlated metrics and find metrics that are important to represent the variance of the original metrics. Second, we compute information gain (Romanski and Kotthoff 2018) for all the studied metrics in order to clarify the basic predicting power of the studied metrics.

We first conduct a correlation analysis on the metrics. When we use strongly correlated metrics as explanation variables for a prediction model, we get the problem of multicollinearity (Farrar and Glauber 1967). In addition, these metrics are redundant. We use Spearman rank correlation (Zwillinger and Kokoska 1999) to measure the correlation between the metrics. Spearman rank correlation is a non-parametric correlation. We apply Spearman rank correlation to all commits on each studied project. We compute the average values of the correlation coefficients between the projects.

Second, we conduct the PCA in order to identify metrics which represent the highest variance of all the studied metrics. The PCA result shows which metrics can represent the variance of all the studied metrics. The PCA reduces the number of input metrics and makes new metrics. Then, the PCA shows the coefficientFootnote 8 for every new metric to convert the input metrics into the new metric. We use the coefficient of the most important new metric called the first principal componentFootnote 9 to identify which metrics represent the highest variance.Footnote 10 We apply the PCA to all commits on each studied project. We suppose that metrics which represent the highest variance are important metrics in the studied metrics.

Finally, we compute information gain (Romanski and Kotthoff 2018) in order to clarify the basic predicting power of the studied metrics. In our case, information gain measures the basic predicting power of each of the metrics. For example, if an original metric perfectly separates defective commits and clean commits, the value of information gain would be maximum. However, if an original metric separates all the commits to 50% defective commits and 50% clean commits, the value of information gain would be minimum because this prediction is the same as random classification. The formula of information gain (Romanski and Kotthoff 2018) is as follows:

$$ \mathit{InfoGain(metric)} = \mathit{H(Defect)} + \mathit{H(metric)}- \mathit{H(Defect, metric)}, $$

where metric is a certain studied metric, InfoGain() is the information gain of ⋅ (metric), H(⋅) is Shannon entropy (Shannon 1948) of ⋅ where the base of the logarithm is 2, H(⋅,⋅) is Shannon entropy of ⋅ after classifying by ⋅, Defect is the set of all commits with prediction results (defective or clean).

We compute the ratio of the information gain between NCCW, and the indentation metrics and the churn metrics. Since NCCW is our proposed metric, we use NCCW as a base. The formulation is as follows:

$$ \mathit{Ratio} = \mathit{InfoGain(NCCW)}/\mathit{InfoGain(\cdot)}, $$

If the ratio is over 1.0 when using a certain original metric, NCCW has high potential to classify the commits in defect prediction rather than the certain metric.

8.3.3 Results

The Context Metrics NCCW and NCCKW, the Indentation Metrics AI and AS, and the Change Metric LA are Strongly Correlated

Table 12 shows the Spearman rank correlation between all the studied metrics (including the context, the indentation and the change metrics) in all studied projects; each cell in the table shows the average correlation in the studied projects (the median is very similar). A gray cell refers to the case of the strong correlation whose coefficient is 0.7 and over. We observe that the correlations between NCCW, NCCKW, AI, AS, and LA are strong (over 0.7). This is because the context metrics and the indentation metrics include changed lines information.

Table 12 Spearman rank correlation between the context metrics, the indentation metrics, and the change metrics in the studied projects

The Context Metrics NCW and NCKW, However, are Moderately Correlated to the Indentation Metrics and the Change Metric LA

NCCW and NCCKW are extended metrics of NCW and NCKW. NCW and NCKW are moderately correlated to AI, AS, and LA (less than 0.7). Hence, although the context information have a similar concept with the indentation metrics and changed lines, the context information is not redundant.

The Context Metrics NCCW and NCCKW are the Metrics that Represent the Highest Variance of all the Original Metrics

Table 13 shows the coefficient of the first principal component for each project in the PCA. A gray cell refers to the case with the absolute coefficient 0.3 and over. We observe that NCCW and NCCKW have over 0.3 absolute coefficient in all the studied projects. If the first principal component has a certain metric which has high coefficient in all the projects, this metric is likely to represent the highest variance of all studied metrics in all the projects. NCCW and NCCKW include the context information and have the strong correlation to the indentation metrics and LA due to using changed line information. Hence, NCCW and NCCKW can add the context information while having the information of the indentation metrics and LA. Hence, NCCW and NCCKW represent the highest variance.

Table 13 The coefficient of the first principal component for each project in the PCA. GNCCKW indicates gotoNCCKW. Please see text for a full explanation

In summary, the context metrics NCW and NCKW are not redundant metrics, and add the context information to the defect prediction model. While NCCW and NCCKW have strong correlations to the indentation metrics and LA, NCCW and NCCKW also add information from the context of the change.

Except for LT, NCCW has the Strongest Basic Predicting Power Regarding the Information Gain Compared to Other Studied Metrics

Figure 15 shows the ratio of the information gain. We observe that all the median values are grater than 1.0 except LT. Hence, almost all cases, the information gain of NCCW is better than the other studied metrics. LT has better value of the information gain. However, the prediction performance such as AUC is not good. In summary, except for LT, NCCW has the strongest basic predicting power in the studied metrics.

Fig. 15
figure 15

The ratio of the information gain between NCCW and other metrics. The x-axis indicates the metrics that are used to compute the ratio; the y-axis indicates the ratio. The dashed line indicates that the ratio is 1.0

8.4 Does the context size changes the complexity of change?

We argued that more words/keywords in a context, more complex a change is. Although number of words/keywords are determined by the context size, we were concerned about that the complexity is changed by the context size. In this discussion, we explain that changing context size does not affect the complexity of change.

From our experiments, given a fixed size of context, the number of words/keywords in such context is a good indicator of the complexity of the change (RQ1). This is because as the context size increases, the number of context words/keywords also increases; however, the distance of some words/keywords to the hunk will also increase, making them less effective as an indicator of complexity. Hence, a balance is required: too small a context might not have enough information to capture the context of the change, however a context that is too large will dilute the important context information around a hunk.

8.5 How are the Actual AUC and MCC Values of the Context Metrics?

We study the ranks that were computed by the Scott-Knott ESD test across the studied metrics to determine which are the best prediction metrics to use in defect prediction. However, practitioners would concern about the actual AUC and MCC values since practitioners need accurate prediction model.

We show the actual AUC and MCC values in Appendix (Tables 18 and 19). From the AUC result (Table 18), COMB provides at least 0.737. This value corresponds to the strong effect size according to prior work (Rice and Harris 2005). From the MCC result (Table 19), COMB provides at least 0.3 except RF in the Camel project. This value corresponds to the moderate correlation. Hence, we conclude that COMB can be used in practice since they have acceptable prediction performance in the actual values as well.

8.6 Practical Guides (Recommendations) for the Parameters of the Context Metrics

The context metrics have two tunable parameters: the context size, and the churn type. We made our practical guides (recommendations) of optimizing the parameters of the context metrics as applicable as possible to practitioners.

Recommendation 1: If Practitioners have Both, Training Data and Validation Data, We Recommend to Optimize the Context Size and the Churn Type Following our Experiments in RQ1

The most important parameters to determine are how many context lines to use (we call this the context size) and what type of context lines to use (we call this the churn type). In our study, we calculated the context size and the churn type that yield the best results; we recommend that, if practitioners have training data and validation data, they optimize the context size and the churn type following our experiments in RQ1. Our experiments in RQ3 show that COMB which are the combination metrics of the extended context metrics that are number of words and number of “goto” keyword significantly outperform the other studied metrics. Hence, if practitioners want to use our prediction model, we recommend to use COMB. Practitioners do not need to decide using either number of keywords or words as a parameter of the context metrics. COMB includes both of them. The details of how to use COMB can be found in Section 8.7.

Recommendation 2: If Practitioners do not have Enough Validation Data, We Recommend to Use the Same Parameters that we Found Perform Best

Our experiments in RQ1 show optimal values for the parameters for the projects we studied. The studied projects cover multiple domains of software, and two popular programming languages, C++ and Java. We believe this diversity of studied projects is likely to make these parameters useful in general.

8.7 Practical Guides (Recommendation) for Practitioners Who want to Use a Defect Prediction Model

We proposed the context metrics. We present recommendations of using them for defect prediction according to the experimental results.

Recommendation 1: Use the Indentation Metric AS Instead of the Traditional Size Metrics in the Change Metrics

Our experiments in RQ2 show that AS significantly outperforms the other studied metrics including traditional size metrics (LA, LD and LT). In addition, AS is strongly correlated with the traditional size metric LA which has the highest performance in the change (code churn) metrics. Hence, using AS instead of the traditional size metrics allows practitioners to improve the performance of their defect prediction models.

Recommendation 2: For the Case Where Practitioners Want to Improve the Prediction Performance Using a Simple Prediction Model, Use the Context Metrics COMB on the Logistic Regression Model

Our experiments in RQ3 show that COMB are the best-performing metrics in AUC and MCC. In addition, our discussion shows that: (1) a context metric used in COMB, NCCW, is one of the metric that represents the highest variance of all the original metrics, and (2) the basic defect predicting power of NCCW is strong. For the interpretation of the prediction model, COMB contains only two metrics (NCCW and gotoNCCKW), and therefore, we can easily interpret the prediction results. Finally, the effect size of the actual prediction values in AUC is strong. Hence, for the case where practitioners want to improve the prediction performance using a simple prediction model, using COMB might allow practitioners to get good prediction performance with a simple prediction model.

9 Threats to Validity

9.1 Construct Validity

We follow the labeling process in Commit Guru (Rosen et al. 2015) in order to label each commit either defective or clean. SZZ algorithm is also a popular approach to identify defective commits (Śliwerski et al. 2005); however, it has no open source implementation available. In contrast, Commit Guru is a publicly available open source project. Hence, we follow the labeling process in Commit Guru for its repeatability and openness.

We use the online change classification (Tan et al. 2015) to validate the performance of defect prediction. This validation technique addresses the challenges of the cross validation technique. Hence, we believe this validation technique is acceptable.

The online change classification has parameters. In particular, the unit (test interval) is the most important parameter. Below, we studied the impact of the unit for the performance in defect prediction. If the unit has strong impact for the performance in defect prediction, we would need to consider the parameter in our experiments.

Approach

We build defect prediction models for NCCW, NCCKW, COMB, AS, LA, and the change metrics. The prediction procedure is the almost same as RQ2. The only difference is that we change the unit value between 10 to 100 by 10. Finally, we report the evaluation measures by (1) plotting a line plot for each project, prediction model, and studied metric, and (2) computing the median and 75 percentile IQR values of different unit values in all projects, prediction models, and studied metrics.

Results

Figure 16 shows the values of the evaluation measures for different unit values. We observe that all evaluation measures are stable for different unit values. In addition, we observe the same tendency for other projects, prediction models, and metrics.

Fig. 16
figure 16

The values of the evaluation measures for each unit (test interval) using the NCCW metric on LR model in the Camel project. Eva indicates evaluation measures. The x-axis indicates the unit value between 10 to 100; the y-axis indicates the values of the evaluation measures

Table 14 shows the median and 75 percentile (3Q) IQR values for different unit values in all projects, prediction models, and studied metrics. We observe that even if we check 3Q values, they are less than 0.05 IQR value in all cases. Hence, the unit (test interval) has little impact for the results. The training interval is decided by the unit. Hence, the training interval also has little impact for the results.

Table 14 The median and 75 percentile (3Q) IQR values of the performance for the context metrics, the indentation metric and the change metrics

9.2 External Validity

As the studied projects, we use six large open source software. These software are written in the popular programming languages C++ and Java; and one of various types of software, such as server and web application. These systems we study are open source but not commercial software. In the future, we need to study the context metrics, extended context metrics, and combination context metrics on commercial projects for verifying our findings.

9.3 Internal Validity

We remove comments from the hunks. However, if all lines in a hunk are comments and use “/**/”, we do not identify whether the hunks are comments.

We use three evaluation measures, AUC, MCC and Brier score, which are not affected by skewed data (Zhang et al. 2016; Boughorbel et al. 2017) and address the pitfalls in defect prediction (Tantithamthavorn et al. 2017). Hence, we believe these measures are acceptable.

10 Conclusion

In this paper, we propose context metrics based on the context lines, the extended context metrics based on both the context lines and changed lines as code churn metrics, and COMB based on the extended context metrics. We study the impact of considering the context lines for defect prediction.

We compare the context metrics, the extended context metrics, and COMB with the traditional code churn metrics in six open source software. The main findings of our paper are as follows:

  • The chunk type ‘+’ is the best parameter for context metrics for defect prediction. This chunk type achieves the best median rank according to the three evaluation measures, AUC, MCC and Brier score on the Scott-Knott ESD test.

  • The small context size is suitable when considering the number of words, while the large context size is suitable when considering the number of keywords in context lines for defect prediction.

  • “goto” statement in the context lines and the changed lines is the best keyword to detect defective commits in the modified NCCKW.

  • Our proposed combination metrics, COMB, significantly outperform all the metrics, and achieve the best-performing metrics in all of the studied projects in terms of AUC and MCC.