The impact of context metrics on just-in-time defect prediction

Kondo, Masanari; German, Daniel M.; Mizuno, Osamu; Choi, Eun-Hye

doi:10.1007/s10664-019-09736-3

The impact of context metrics on just-in-time defect prediction

Published: 08 August 2019

Volume 25, pages 890–939, (2020)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Empirical Software Engineering Aims and scope Submit manuscript

The impact of context metrics on just-in-time defect prediction

Download PDF

Masanari Kondo ORCID: orcid.org/0000-0002-6317-7001¹,
Daniel M. German²,
Osamu Mizuno¹ &
…
Eun-Hye Choi³

1534 Accesses
37 Citations
5 Altmetric
Explore all metrics

Abstract

Traditional just-in-time defect prediction approaches have been using changed lines of software to predict defective-changes in software development. However, they disregard information around the changed lines. Our main hypothesis is that such information has an impact on the likelihood that the change is defective. To take advantage of this information in defect prediction, we consider n-lines (n = 1,2,…) that precede and follow the changed lines (which we call context lines), and propose metrics that measure them, which we call “Context Metrics.” Specifically, these context metrics are defined as the number of words/keywords in the context lines. In a large-scale empirical study using six open source software projects, we compare the performance of using our context metrics, traditional code churn metrics (e.g., the number of modified subsystems), our extended context metrics which measure not only context lines but also changed lines, and combination metrics that use two extended context metrics at a prediction model for defect prediction. The results show that context metrics that consider the context lines of added-lines achieve the best median value in all cases in terms of a statistical test. Moreover, using few number of context lines is suitable for context metric that considers words, and using more number of context lines is suitable for context metric that considers keywords. Finally, the combination metrics of two extended context metrics significantly outperform all studied metrics in all studied projects w. r. t. the area under the receiver operation characteristic curve (AUC) and Matthews correlation coefficient (MCC).

Revisiting supervised and unsupervised models for effort-aware just-in-time defect prediction

Article 27 October 2018

An empirical evaluation of defect prediction approaches in within-project and cross-project context

Article 04 March 2023

Defect prediction model of static code features for cross-company and cross-project software

Article 06 December 2018

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Software developers have limited resources to verify and test their source code. If developers can identify defective components (e.g., files or commits) they would be able to focus their effort on these components. Defect prediction supports this activity, and prior work has reported that defect prediction can reduce development cost for developers (Tassey 2002).

There exists plenty of work aimed at predicting defective components (Basili et al. 1996; Kim et al. 2007; Moser et al. 2008; Hassan 2009; D’Ambros et al. 2010). In particular, several prior research work has focused on predicting defective changes called change-level defect prediction—also called just-in-time defect prediction (Kamei et al. 2013; Kim et al. 2008; Fukushima et al. 2014; Mockus and Votta 2000). Just-in-time defect prediction has the advantage that it can determine if a commit is likely to be defective when the commit is being performed (Hata et al. 2012) and providing faster feedback than other defect prediction methods (Kamei et al. 2013). Previous research has used metrics based on measuring the code changes (e.g., churn–changed lines) in just-in-time defect prediction (Kamei et al. 2013; Kim et al. 2008; Mockus and Votta 2000).

To the best of our knowledge, no studies have considered using the information in the lines that surround the changed lines of a commit, which we call context lines. Our main hypothesis is that information in the context lines has an impact on the likelihood that the change is defective.

In this paper, we evaluate the use this information in just-in-time defect prediction. The dictionary defines context as “the parts of something written or spoken that immediately precede and follow a word or passage and clarify its meaning” (Stevenson and Lindberg 2010). In this paper, we define the context lines of a chunk of changed lines as the n-lines (n = 1,2,…) that precede the chunk and then-lines that follow the chunk.

This paper proposes several context metrics. The different metrics vary around three different axis: a) how many context lines around each change to use (the size of the context, n), b) whether to use all context lines, or only those of added or removed lines (the type of the change), and c) counting the number of words or counting the number of keywords (as defined by the programming language) in the context. We consider these axes as the parameters of context metrics. We refer to a context metric which uses a set of the parameters as a variant of context metrics. We empirically study the best-performing variant in terms of defect prediction performance. We also compare the context metrics that are the best-performing variants with traditional code churn metrics (change metrics (Kamei et al. 2013; Kim et al. 2008; Mockus and Votta 2000) and indentation metrics (Hindle et al. 2008)), extended context metrics and combination metrics that use two extended context metrics. Indentation metrics use the total number of white spaces in front of changed lines, and the total number of pairs of braces that surrounded changed lines; we handle indentation metrics as code churn metrics, since they are computed on changed lines. In order to improve the predicting power of the context metrics in defect prediction, we also define extended context metrics. Extended context metrics count the number of words/keywords in both, the context lines and the changed lines. Hence, extended context metrics are hybrids of the context metrics and traditional code churn metrics. In addition, we use combination metrics that use two extended context metrics that count (1) number of words and (2) number of a certain keyword (e.g., “goto”) at a prediction model in order to improve the predicting power of the extended context metrics in defect prediction.

Using six large open source software projects (from different domains) we empirically evaluate the defect prediction power of context metrics and compare them against traditional change metrics. This comparison is done using logistic regression models and random forest models.

Specifically, we address the following three research questions:

RQ1::: What is the impact of the different variants of context metrics on defect prediction?
RQ2::: Do context lines improve the performance of defect prediction?
RQ3::: What is the impact of combination metrics of context metrics on defect prediction?

The main findings of our paper are as follows:

The best performing context metrics are the ones that measure the context of added-lines only.
The prediction power of context metrics varies when different sizes of the context (number of lines around the change) are used. The optimal size of the context for the metric that uses number of words is smaller than the optimal size for the metric that uses keywords.
The number of “goto” statements in context lines and changed lines is a good indicator of defective commits.
Our proposed combination metrics of extended context metrics significantly outperform all the metrics that are used in this paper, and achieve the best-performing metrics in all of the studied projects in terms of 2 of the 3 evaluation measures used (area under the receiver operation characteristic curve, and Matthews correlation coefficient).

This paper is organized as follows: Section 2 shows motivation example. Section 3 introduces related work. Section 4 explains our proposed context metrics. Section 5 presents our case study design. Section 6 describes research questions and methodology. Section 7 presents the results of our case study. Section 8 discusses the results. Section 9 describes the threats to the validity of our findings. Section 10 presents the conclusion.

2 Motivating Example

Let us start from a simple example to illustrate the use of context lines to measure the complexity of changes. Figure 1 shows an example of two changed functions. The context lines are lines that precede or follow the changed lines. In this example, the underlined text represents the context lines and the bold lines are the changed lines. The function shown in Fig. 1a has simple context lines: there is one assignment before the changed line and one empty line after the changed line. The changed in Fig. 1b has more complex context lines: the “if” and “else” statements. If we use only the changed lines as an input to compute the complexity of the changes these two changes have the same complexity. In contrast, if we use the context lines as a measure of complexity, these two functions have a different complexity.

To the best of our knowledge, there exists no research work that studies the context lines in defect prediction. In this paper, we introduce two types of new metrics that use the context lines: context metrics and extended context metrics, and evaluate their performance in defect prediction.

There are complexity metrics, such as Halstead’s complexity metrics (Halstead 1977) and McCabe’s Cyclomatic complexity metrics (McCabe 1976), that can capture the complexity of the function being changed and take into consideration the context; however, (1) to compute these metrics we need all the lines of the functions, (2) these metrics are limited because they require a parser, and (3) complexity metrics are not optimized for code churn. In contrast, context metrics provide several advantages; first, they are easy to compute (they only require the “diff” and—in the case of number of keywords—a list of keywords of the programming language as input) and they measure only the complexity that surrounds the change instead of the entire function.

3 Related Work

3.1 Source Code Churn

Many researchers have studied source code churn for software defect, reliability and quality (Nagappan and Ball 2005; Munson and Elbaum 1998; Khoshgoftaar et al. 1996; Ohlsson et al. 1999; Graves et al. 2000; Karunanithi 1993; Khoshgoftaar and Szabo 1994; Ostrand et al. 2004; Kamei et al. 2013; Kim et al. 2008; Mockus and Votta 2000). Source code churn measures changes and extensions of source code in a period of time (Oram and Wilson 2010). Munson and Elbaum (1998) reported that, as a system is developed (evolved), complexity of the system is also changed.

They proposed a methodology to produce an indicator of defects based on this tendency. Nagappan and Ball (2005) predicted defect density between different releases of Windows Server 2003. Comparing traditional code churn metrics with relative code churn metrics, which relate proportion of code churn such as size of its component, they found the relative code churn metrics are strong metrics for the defect density.

Prior studies proposed more complex code churn metrics (Hassan 2009; Hindle et al. 2008). Hassan (2009) proposed code churn metrics based on the code change process. He applied Shannon entropy (from information theory) to the code change process in order to formulate his metrics.

Hindle et al. (2008) proposed indentation metrics that measure the indentations of added-lines and fixed-lines of changes. They studied the correlations between the indentation metrics and traditional complexity metrics (McCabe’s Cyclomatic complexity (McCabe 1976) and Halstead’s complexity (Halstead 1977)). They showed that the indentation metrics are mildly or strongly correlated with the traditional complexity metrics and the indentation is potentially its own complexity metric (Hindle et al. 2008). Because indentation metrics use the information in changed or added lines, we refer to indentation metrics as a type of code churn metric. This paper is the first study to investigate the effectiveness of indentation metrics for defect prediction.

In this paper, we compare the prediction power of 6 types of metrics in defect prediction. These metrics are: 1) context metrics, 2) traditional code churn metrics (Kamei et al. 2013; Kim et al. 2008; Mockus and Votta 2000), 3) each of traditional code churn metrics, 4) code churn metrics based on indentation metrics (Hindle et al. 2008), 5) extended context metrics (which are combinations between context metrics and a traditional code churn metric) and 6) combination metrics of extended context metrics (which are two extended context metrics that are (1) number of words and (2) number of a certain keyword at a prediction model).

3.2 Text-Based/Just-In-Time Defect Prediction

Many researchers have tackled the problem of defect prediction (Mizuno and Kikuno 2007; Kim et al. 2008, 2011; Kamei et al. 2013; Aversano et al. 2007; Jiang et al. 2013; Yang et al. 2015; Wang et al. 2016; Zimmermann et al. 2007; Li et al.2017; Bettenburg et al. 2012; Śliwerski et al. 2005). In addition, several researchers have proposed metrics to predict defective components (Basili et al. 1996; Kim et al. 2007; Moser et al. 2008; Hassan 2009; D’Ambros et al. 2010). Mizuno and Kikuno (2007) applied spam filter to defect prediction problem. Śliwerski et al. (2005) proposed a method that automatically identifies changes that lead to defects in the future.

Textual information has also being used for defect prediction (Mizuno and Kikuno 2007; Kim et al. 2008; Aversano et al. 2007; Wang et al. 2016; Li et al. 2017). Kim et al. (2008) used not only metadata and complexity metrics but also text information to build a prediction model and predicted defects. They used change-log messages, source code and file names as input to their predictors.

Wang et al. (2016) used the programs’ Abstract Syntax Trees (ASTs) as a representation of source code. They applied a deep learning technique to ASTs in order to learn semantic features from token vectors.

Several researchers have worked on just-in-time defect prediction (Kamei et al. 2013; Kim et al. 2007, 2008, 2011; Fukushima et al. 2014; Mockus and Votta 2000; Aversano et al. 2007; Jiang et al. 2013; Yang et al. 2015; Hassan 2009). Just-in-time defect prediction aims at identifying defective code changes, such as commits, instead of identifying defective files or packages as in traditional file/package-level defect prediction. For example, Kamei et al. (2013) focused on predicting the risk of commits. They used change metrics to predict defective commits at the time of committing commits. Yang et al. (2015) applied a deep learning technique as a prediction model to change metrics and conducted just-in-time defect prediction. Just-in-time defect prediction has the following three benefits that address the challenges on file/package-level defect prediction (Kamei et al. 2013): (1) prediction targets are fine-grained, (2) relevant-developers can be identified, and (3) the prediction-period is faster. In this paper, we use context metrics for just-in-time defect prediction.

There are several widely known pitfalls that should be avoided in defect prediction (Tan et al. 2015; Tantithamthavorn and Hassan 2018). For example, Tan et al. (2015) reported that cross validation technique is frequently used to evaluate prediction models (Kim et al. 2008, 2011; Bettenburg et al. 2012; Jiang et al. 2013; Kamei et al. 2013). However, this technique risks to mix past and future commits; an unrealistic scenario that artificially improves results. In our study, we take into consideration their recommendations to avoid these potential pitfalls. This technique called online change classification is a validation technique without the risks. We describe the details in Section 5.4.

4 Context Metrics

In this section, we describe the implementation of the proposed context metrics. As described in the previous sections, context information might be useful for defect prediction since it provides a new perspective of changes. In addition, it is easy to obtain context information (e.g., using the diff command in the version control system). For example, for the changed function in Fig. 1b, we consider only the lines in italic with an underline for context information.

Any modifications to a file can be described in terms of a unified diff. A unified diff is a sequence of hunks; each hunk is composed of one or more sequences of contiguously changed lines. Each of these sequences is composed of ‘+’ lines (lines added to the file) or ‘-’ lines (lines removed from the file). For the sake of simplicity, we refer to these sequences of changed lines as chunks. We consider two types of chunks: ‘+’ chunks (which contain at least one ‘+’ line), ‘-’ chunks (which contain at least one ‘-’ line). Finally, we will refer to any chunks (including both ‘+’ and ‘-’ chunks) as ‘all’ chunks. Figure 2 shows an example of two unified diffs (a part of output by git show).

The unified diff shown in Fig. 2 is a sequence of two hunks that are divided by the lines prefixed with @@, < 2 >. Each hunk has a chunk < 3 > and < 4 >, respectively. The above chunk, < 3 >, is of type ‘+’ and ‘all’. The below chunk, < 4 >, is of type ‘-’, and ‘all’. The below unified diff has a hunk. This hunk includes two chunks that are type ‘+’ and ‘all’.^{Footnote 1}

Each chunk is surrounded by its context lines (the lines above and below the chunk that indicate where the chunk is to be applied—prefixed with ‘ ’ in the hunk). We refer to these context lines as the context of the chunk. We also consider as a part of the context the full filename of the file being changed. This is because we consider that the directories where the file is located can contribute to the complexity of the context; i.e., more directories in the filename indicate a more complex context than no-directories. We evaluated the use or the filename/directories in the context metrics for their prediction power and found that when used, the performance of the context metrics improved.

For explaining context metrics, we define the following terminology:

c: a commit.
n: a context size that is the maximum number of lines that can precede or follow a chunk we consider. (This is also a parameter of the diff command in the version control system.)
d(f,n): a unified diff of a changed file f with context sizen.
D(c,n): a set of d(f,n) for all the changed files in commit c.

For a given unified diff d(f,n), we define the three types of contexts, based on the three chunk types, with the following notation (refer to Table 1):

context(d(f,n),t): the concatenation of the full filename of f and the context of all chunks of chunk type t in diff d(f,n).

For a unified diff d(f,n), we define the following two notations:

1.
ncw(d(f,n),t): the number of words in context(d(f,n),t).
2.
nckw(d(f,n),t): the number of programming language keywords (Table 2 shows all studied keywords)^{Footnote 2} in context(d(f,n),t).

Given a commit c, a context size (the number of context lines) n, and the chunk type t, we define the following two kinds of context metrics:

$$ \begin{array}{@{}rcl@{}} NCW\left( c, n, t\right) &=& \sum\limits_{d(f,n) \in D(c,n)} ncw(d(f,n), t),\\ NCKW\left( c, n, t\right) &=& \sum\limits_{d(f,n) \in D(c,n)} nckw(d(f,n), t). \end{array} $$

Table 1 Types of contexts. The context of chunk type t of a unified diff d(f,n) is the concatenation of the full filename of f and the contexts of the chunk type t in the diff d(f,n)

The impact of context metrics on just-in-time defect prediction

Abstract

Similar content being viewed by others

Revisiting supervised and unsupervised models for effort-aware just-in-time defect prediction

An empirical evaluation of defect prediction approaches in within-project and cross-project context

Defect prediction model of static code features for cross-company and cross-project software

Explore related subjects

1 Introduction

2 Motivating Example

3 Related Work

3.1 Source Code Churn

3.2 Text-Based/Just-In-Time Defect Prediction

4 Context Metrics

The Intuition Behind Counting Words or Keywords:

5 Case Study Design

5.1 Indentation Metrics

The Intuition of Using the Indentation Metrics as Way to Predict Defects:

5.2 Preparing Data using Commit Guru

5.3 Time Sensitive Change Classification

5.4 Online Change Classification

5.5 Preprocessing by z-score

5.6 Studied Projects

5.7 Resampling Approach

5.8 Evaluation Measures

5.9 Prediction Models

6 Research Questions and Methodology

6.1 Research Questions

6.2 Methodology

6.2.1 RQ1. What is the Impact of the Different Variants of Context Metrics on Defect Prediction?

6.2.2 RQ2. Do Context Lines Improve the Performance of Defect Prediction?

6.2.3 RQ3. What is the Impact of Combination Metrics of Context Metrics on Defect Prediction?

7 Case Study Results

7.1 RQ1. What is the Impact of the Different Variants of Context Metrics on Defect Prediction?

For the Context Metrics, the Best Chunk Type is ‘+’

A Context Size of 1 Provides Better Prediction Performance for NCW, While a Context Size of 10 Provides Better Prediction Performance for NCKW

7.2 RQ2. Do Context Lines Improve the Performance of Defect Prediction?

The Extended Context Metric NCCW, the Indentation Metrics, and Lines Added (LA) Provide Many Top Three Rank Performance on Just-in-Time Defect Prediction

The Indentation Metric, AS, is the Best-Performing Metric on AUC and MCC According to the Double Scott-Knott ESD Test

The Extended Context Metric, NCCW, and the Churn Metric, LA, are Also Better Metrics According to the Double Scott-Knott ESD Test

7.3 RQ3. What is the Impact of Combination Metrics of Context Metrics on Defect Prediction?

“goto” Statement is the Best Keyword for the Modified NCCKW

COMB Provides the Top-One Rank Prediction Performance for all the Studied Projects in AUC and MCC

COMB Statistically Outperforms the Other Studied Metrics

8 Discussion

8.1 Are the Commits Identified by the Context Metrics Different than the Ones Identified by the Traditional Churn Metrics?

The Proposed Context Metrics COMB Identify Some Defective Commits that Other Churn Metrics Cannot; these Commits Tend to have Large Context Lines

The Proposed Context Metrics NCW and NCKW, and the Extended Context Metrics NCCW and NCCKW Can Uniquely Identify Defective Commits; and these Commits Tend to Have Larger Context Lines than Other Churn Metrics on the LR Model

8.2 How Much Do the Indentation Metrics Improve the Defect Prediction Performance?

Indentation Metrics AS and AB Have the Potential to Defect Prediction Performance

8.3 How Redundant are the Context Metrics Compared to the Traditional Metrics?

8.3.1 Motivation

8.3.2 Approach

8.3.3 Results

The Context Metrics NCCW and NCCKW, the Indentation Metrics AI and AS, and the Change Metric LA are Strongly Correlated

The Context Metrics NCW and NCKW, However, are Moderately Correlated to the Indentation Metrics and the Change Metric LA

The Context Metrics NCCW and NCCKW are the Metrics that Represent the Highest Variance of all the Original Metrics

Except for LT, NCCW has the Strongest Basic Predicting Power Regarding the Information Gain Compared to Other Studied Metrics

8.4 Does the context size changes the complexity of change?

8.5 How are the Actual AUC and MCC Values of the Context Metrics?

8.6 Practical Guides (Recommendations) for the Parameters of the Context Metrics

Recommendation 1: If Practitioners have Both, Training Data and Validation Data, We Recommend to Optimize the Context Size and the Churn Type Following our Experiments in RQ1

Recommendation 2: If Practitioners do not have Enough Validation Data, We Recommend to Use the Same Parameters that we Found Perform Best

8.7 Practical Guides (Recommendation) for Practitioners Who want to Use a Defect Prediction Model

Recommendation 1: Use the Indentation Metric AS Instead of the Traditional Size Metrics in the Change Metrics

Recommendation 2: For the Case Where Practitioners Want to Improve the Prediction Performance Using a Simple Prediction Model, Use the Context Metrics COMB on the Logistic Regression Model

9 Threats to Validity

9.1 Construct Validity

Approach

Results

9.2 External Validity

9.3 Internal Validity

10 Conclusion

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note