Towards building a universal defect prediction model with rank transformed predictors

Zhang, Feng; Mockus, Audris; Keivanloo, Iman; Zou, Ying

doi:10.1007/s10664-015-9396-2

Towards building a universal defect prediction model with rank transformed predictors

Published: 30 August 2015

Volume 21, pages 2107–2145, (2016)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Empirical Software Engineering Aims and scope Submit manuscript

Towards building a universal defect prediction model with rank transformed predictors

Download PDF

Feng Zhang ORCID: orcid.org/0000-0001-9805-395X¹,
Audris Mockus²,
Iman Keivanloo³ &
…
Ying Zou³

1383 Accesses
72 Citations
1 Altmetric
Explore all metrics

Abstract

Software defects can lead to undesired results. Correcting defects costs 50 % to 75 % of the total software development budgets. To predict defective files, a prediction model must be built with predictors (e.g., software metrics) obtained from either a project itself (within-project) or from other projects (cross-project). A universal defect prediction model that is built from a large set of diverse projects would relieve the need to build and tailor prediction models for an individual project. A formidable obstacle to build a universal model is the variations in the distribution of predictors among projects of diverse contexts (e.g., size and programming language). Hence, we propose to cluster projects based on the similarity of the distribution of predictors, and derive the rank transformations using quantiles of predictors for a cluster. We fit the universal model on the transformed data of 1,385 open source projects hosted on SourceForge and GoogleCode. The universal model obtains prediction performance comparable to the within-project models, yields similar results when applied on five external projects (one Apache and four Eclipse projects), and performs similarly among projects with different context factors. At last, we investigate what predictors should be included in the universal model. We expect that this work could form a basis for future work on building a universal model and would lead to software support tools that incorporate it into a regular development workflow.

An empirical evaluation of defect prediction approaches in within-project and cross-project context

Article 04 March 2023

On effort-aware metrics for defect prediction

Article Open access 06 August 2022

Revisiting process versus product metrics: a large scale analysis

Article 17 March 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

It is common for software to contain defects (Nguyen et al. 2011). A defect is an error in software behaviour that causes unexpected results. Software defects were estimated to cost U.S. economy $59.5 billion annually (Tassey 2002). The cost of correcting defects ranges from 50 % to 75 % of the total software development cost (Hailpern and Santhanam 2002). There is a rich history of attempts to anticipate the parts of source code that are likely to have defects to fix. For example, D’Ambros et al. (2012) evaluate over 30 different approaches published from 1996 to 2010 for building defect prediction models. Unfortunately, such models could not be generalized to apply on other projects or even new releases of the same project (Zimmermann et al. 2009; Nam et al. 2013). Refitting such models is not a trivial task. It requires collecting and tagging defects for each file, and computing software metrics from historical data. Sufficient history may not be available in certain projects, e.g., small or new projects (Nagappan et al. 2006).

We refer to a single model that is built from the entire set of projects as a universal model. A universal defect prediction model would relieve the need for refitting project-specific or release-specific models for an individual project. A universal model would also help interpret basic relationships between software metrics and defects, potentially resolving inconsistencies among different studies (Mair and Shepperd 2005). Moreover, it might allow a more direct comparison of defect rates across projects, and enable a continuous evaluation of defect proneness of a project. Therefore, it is of significant interest to build a universal defect prediction model.

Cross-project prediction may be a step towards building a universal model. We refer to a prediction as a cross-project prediction if a model is learnt from one project and the prediction is performed on another project. Zimmermann et al. (2009) examine the performance of all 622 possible cross-project predictions using 28 versions from 12 projects, and find a low ratio of successful cross-project predictions (i.e., 3.4 %). They consider a prediction to be successful if all the three performance measures (i.e., precision, recall, and accuracy) are greater than 0.75. The first challenge in building successful cross-project defect prediction models is related to the variations in the distribution of predictors (Cruz and Ochimizu 2009; Nam et al. 2013). To overcome this challenge, we consider two approaches: 1) use data from projects with similar distributions of predictors to the target project as training data (e.g., Turhan et al. 2009; Menzies et al. 2011); or 2) transform predictors in both training and target projects to make them more similar in their distribution (e.g., Nam et al. 2013; Ma et al. 2012). However, the first approach uses partial dataset and results in multiple models for different target projects. The transformation approaches are typically specialized to a particular pair of training and testing datasets. Our prior study (Zhang et al. 2013) found that the distribution of software metrics varies with project contexts (e.g., size and programming language). Therefore, we combine the three insights in an attempt to build a universal defect prediction model for a large set of projects with diverse contexts.

In this study, we propose a context-aware rank transformation to address the variations in the distribution of predictors before fitting them in the universal defect prediction model. There are six context factors investigated in this study, including programming language, issue tracking, the total lines of code, the total number of files, the total number of commits, and the total number of developers. The context-aware approach stratifies the entire set of projects by context factors, and clusters the projects with similar distribution of predictors. Inspired by metric-based benchmarks (e.g., Alves et al. 2010), which use quantiles to derive thresholds for ranking software quality, we apply every tenth quantile of predictors on each cluster to specify ranking functions. We use twenty-one code metrics and five process metrics as predictors. After rank transformation, the predictors from different projects will have exactly the same scale. The universal model is then built using the transformed predictors.

We apply our approach on 1,385 open source projects hosted on SourceForge and GoogleCode. We observe that the F-measures and area under curve (AUC) obtained using rank-transformed predictors is comparable to that of logarithmicly transformed predictors. The logrithmic transformation uses the logrithmic values of predictors, and is commonly used to build prediction models. After adding the six context factors as predictors, the performance of the universal model built using only code and process metrics can be further improved. On average, the universal model yields higher AUC than within-project models. Moreover, the universal model achieves up to 48 % of the successful predictions of within-project models using loose criteria (i.e., recall is above 0.70, and precision is greater than 0.50) suggested by He et al. (2012) to determine the success of defect prediction models.

We examine the generalizability of the universal model in two ways. First, we build the universal model using projects hosted on SourceForge and GoogleCode, and apply the universal model on five external projects, including Lucene, Eclipse, Equinox, Mylyn, and Eclipse PDE. The results show that the universal model provides a similar performance (in terms of AUC) as within-project models for the five projects. Second, we compare the performance of the universal model on projects of different context factors. The results indicate that the performance does not change significantly among projects with different context factors. These results suggest that the universal model is context-insensitive and generalizable.

In summary, the major contributions of our study are:

Propose an approach of context-aware rank transformation. The rank transformation method addresses the problem of large variations in the distribution of predictors across projects from diverse contexts. The transformed predictors have exactly the same scales. This enables us to build a universal model for a large set of projects.
Improve the performance of the universal model by adding context factors as predictors. We add the context factors to our universal prediction model, and find that context factors significantly improve the predictive power of the universal defect prediction model (e.g., the average AUC increases from 0.607 to 0.641 comparing to the combination of code and process metrics).
Provide a universal defect prediction model. The universal model achieves similar performance as within-project models for five external projects, and does not show significant difference in the performance for projects with different context factors. The universal model is context-insensitive and generalizable. We also provide the estimated coefficients of predictors for the universal model.

This work extends our previous work (Zhang et al. 2014) that was published in the proceedings of the 11th working conference on Mining Software Repositories (MSR) in three ways. First, we added the details of the approach and data processing steps, so that our study could be easily replicated. Second, we added RQ4 that examines the performance of the universal model when applied to projects from diverse contexts. Third, we added RQ5 that investigates the importance of different predictors in the universal model. Our findings provide insights of what predictors are more suitable to establish general relationships with defect-proneness.

The remainder of this paper is organized as follows. The related work is summarized in Section 2. Section 3 and Section 4 describe our approach and experiment design, respectively. Section 5 presents our results and discussions. The threats to validity of our work are discussed in Section 6. We conclude and provide insights for future work in Section 7.

2 Related Work

In this section, we review previous studies on defect prediction in general, cross-project defect prediction, and preprocessing techniques on predictors.

2.1 Defect Prediction

Defect prediction studies have a long history since 1970s (Akiyama 1971), and have become very active in the last decade (D’Ambros et al. 2012). The purpose of defect prediction models is to predict the defect proneness (i.e., buggy or clean) or the number of defects of a software artifact. The defect prediction is studied for artifacts with various granularities such as project, module, file, and method levels (e.g., Zimmermann et al. 2009; Hassan 2009). The impact of granularity on the performance of defect prediction models is studied by Posnett et al. (2011). This study chooses to predict defect proneness at file level.

Building a defect prediction model requires three major steps: 1) collect predictors; 2) label defect proneness; and 3) choose proper modelling techniques. Software metrics are commonly used as predictors in defect prediction models. Numerous software metrics have been investigated, including complexity metrics (e.g., lines of code and McCabe’s cyclomatic complexity (Menzies et al. 2007b), structural metrics (Zimmermann and Nagappan 2008), process metrics (e.g., recent activities, number of changes, and the complexity of changes (Hassan 2009)), the number of previous defects (Zimmermann et al. 2007), and social network metrics (Bettenburg and Hassan 2010). D’Ambros et al. (2012) systematically compare the predictive power of different metric categories, and find that process metrics are superior in predicting the defect proneness. Arisholm et al. (2010) find large differences in terms of cost-effectiveness for defect prediction among different metric sets (e.g., process metrics significantly outperform structural metrics). Moreover, Radjenović et al. (2013) report that the performance of metric sets relates to several context factors (e.g., size and life cycles) of subject projects.

There are two major types of modelling techniques: statistical methods (e.g., Naive Bayes and logistic regression), and machine learning methods (e.g., decision trees, support vector machine (SVM), K-nearest neighbour, and artificial neural networks). Lessmann et al. (2008), Arisholm et al. (2010), and D’Ambros et al. (2012) propose different approaches to compare and evaluate different modelling techniques. Lessmann et al. (2008) find that there are no significant differences in the performance among different modelling techniques. Arisholm et al. (2010) also report that the choice of modelling techniques has only limited impact on the performance in terms of accuracy or cost-effectiveness. However, different observations are reported by Hall et al. (2012) that some modelling techniques (e.g., Naive Bayes and logistic regression) perform well in defect prediction, and some other modelling techniques (e.g., SVM) perform less well. Sarro et al. (2012) find that tuning the parameters of SVM using genetic algorithm can improve the performance of defect prediction. Our rank transformation approach is a step for data preprocessing, thus it is independent of the modelling techniques. Software organizations can choose the technique that best suits their needs.

2.2 Cross-Project Defect Prediction

Most of the aforementioned studies have been conducted under within-project settings. We refer to a prediction as a within-project prediction if the training and target projects are the same. Building within-project models requires enough historical data of the target project. However, some projects, such as small or new projects, may not have sufficient historical data (Nagappan et al. 2006). In response to such challenges, many researchers attempt to build cross-project defect prediction models. Most studies experience poor performance of cross-project defect predictions (Hall et al. 2012). For instance, Zimmermann et al. (2009) run cross-project predictions for all 622 possible pairs of 28 datasets from 12 projects, and find only 21 pairs (i.e., cross-project predictions) match their performance criteria (i.e., all precision, recall and accuracy are above 0.75). Turhan et al. (2009) observe that cross-project prediction not only underperforms within-project prediction, but also has excessive false alarms. Premraj and Herzig (2011) confirm the challenges in cross-project defect prediction through their replication study. Even for the same project, Shatnawi and Li (2008) report a decrease in performance of within-project defect prediction models from release to release.

Rahman et al. (2012) argue that cross-project defect prediction can yield the same performance as within-project prediction in terms of cost effectiveness, instead of standard measures (i.e., precision, recall, and F-measure). Nevertheless, the challenge of cross-project prediction still exists. It might be due to the fact that metrics from different projects may have significantly different distributions (Cruz and Ochimizu 2009; Nam et al. 2013). Denaro and Pezzè (2002) conclude that good predictive performance can be achieved only across homogeneous projects. Similar finding is reported by Nagappan et al. (2006). Hall et al. (2012) investigate 36 studies and report that some context factors (e.g., system size and application domain) affect the performance of cross-project predictions. Zimmermann et al. (2009) and Menzies et al. (2011) both suggest to consider project contexts for cross-project defect prediction. To deal with heterogeneous data from diverse projects, this study proposes context-aware rank transformation for predictors as a data preprocessing step. In addition, we find that adding context factors (see Section 3.2) as predictors can improve the predictive power for the universal defect prediction model.

2.3 Data Preprocessing

The distribution of metric values varies across projects (Cruz and Ochimizu 2009). Our previous study (Zhang et al. 2013) examines 320 diverse projects, and conclude that the distribution of metric values sometimes varies significantly in projects of different contexts. The variation in scales of metrics pose a challenge for building a universal defect prediction model. Data preprocessing has been proved to improve the performance of defect prediction models by Menzies et al. (2007b). These findings suggest that preprocessing predictors may be a mandatory step needed to build a successful cross-project defect prediction model. Jiang et al. (2008) evaluate the impact of log transformation and discretization on the performance of defect prediction models, and find different modelling techniques to “prefer” different transformation techniques. For instance, Naive Bayes achieves better performance on discretized data, while logistic regression benefits from both approaches. Cruz and Ochimizu (2009) also observe that log transformations can improve the performance of cross-project predictions, only if the data of target project is not as skewed as the data of the training project.

The state-of-the-art approaches to improve the performance of cross-project defect prediction mainly use two data preprocessing techniques: 1) use data from projects with similar distributions to the target project (e.g., Turhan et al. 2009; Menzies et al. 2011); or 2) transform predictors in both training and target projects to make them more similar in their distribution (e.g., Nam et al. 2013; Ma et al. 2012). To filter the training data, He et al. (2012) propose to use the distributional characteristics (e.g., median, mean, variance, standard deviation, skewness, and quantiles); Turhan et al. (2009) propose to use nearest neighbour filter; Li et al. (2012) propose to use sampling; and He et al. (2013) propose to use data similarity. The aforementioned approaches are able to improve the performance of cross-project defect prediction models. However, they use only partial dataset and end up with multiple models (i.e., one model per target project). On the other hand, the transformation approaches are typically specialized to a particular pair of training and testing datasets. For instance, Watanabe et al. (2008) propose to compensate the target project with the average values of predictors of both target and training projects. Similarly, Ma et al. (2012) weight training data by estimations on the distribution of target data. Nam et al. (2013) propose to transform both training and target data to the same latent feature space, and build models on the latent feature space. Our previous study (Zhang et al. 2013) suggests to consider project contexts, when deriving proper thresholds and ranges of metric values that are often used to evaluate software quality (Baggen et al. 2012). By combining these three insights, we propose a context-aware rank transformation approach which does not require or depend on the target data set. The target data set contains the projects on which to apply defect prediction models.

3 Approach

In this section, we present the details of our approach for building a universal defect prediction model.

3.1 Overview

The poor performance of cross-project prediction may be caused by the significant differences in the distribution of metric values among projects (Cruz and Ochimizu 2009; Nam et al. 2013). Therefore, to build a universal model using a large set of projects, it is essential to reduce the difference in the distribution of metric values across projects. Our previous work (Zhang et al. 2013) finds that context factors of projects can significantly affect the distribution of metric values. Therefore, we propose a context-aware rank transformation approach to preprocess metric values before fitting them to the universal model. As illustrated in Fig. 1, our approach consists of the following four steps:

1)
Partitioning. We partition the entire set of projects into non-overlapping groups based on the six context factors (i.e., programming language, issue tracking, the total lines of code, the total number of files, the total number of commits, and the total number of developers). This step aims to reduce the number of pairwise comparisons. We compare the distribution of metric values across groups of projects instead of individual projects.
2)
Clustering. We cluster the project groups with the similar distribution of predictor values. This step aims to merge similar groups of projects so that we could include more projects in each cluster for obtaining ranking functions.
3)
Obtaining ranking functions. We derive a ranking function for each cluster using every 10t h quantile of predictor values. This transformation removes large variations in the distribution of predictors by transforming them to exactly the same scale.
4)
Ranking. We apply the ranking functions to convert the raw values of predictors to one of the ten levels. This step aims to remove the difference in the scales of metric values across projects. The scales of the transformed metric values are exactly the same for all projects.

After the preprocessing steps, we build the universal model based on the transformed predictors. The following subsections describe the context factors used in this study, and the details of each step.

3.2 Context Factors

Software projects have diverse context factors. However, it is still unclear what context factors best characterize projects. For instance, Nagappan et al. (2013) choose seven context factors based on their availability in Ohloh,^{Footnote 1} including main programming language, the total lines of code, the number of contributors, the number of churn, the number of commits, project age, and project activity. Along the same lines, our previous work (Zhang et al. 2013) also selects seven context factors, i.e., application domain, programming language, age, lifespan, the total lines of code, the number of changes, and the number of downloads. Our prior study (Zhang et al. 2013) finds that project age, lifespan, and the number of downloads do not significantly affect the distribution of metric values. Thus we exclude these three context factors from this study. The shared context factors of the aforementioned two studies are programming language, the total lines of code, and the number of commits. These factors are common to all projects with version control systems. Hence, we include these three context factors in this study. The information of application domain is unavailable to our subject projects that are hosted on GoogleCode. Therefore, we exclude application domain as well. Moreover, we add the number of developers as Nagappan et al. (2013), and the number of files as another size measurement. Eventually, we choose the following six context factors in this study.

1)
Programming Language (PL) describes the nature of programming paradigms. There is a high chance for metric values of different programming languages to experience significantly different distributions. Moreover, it is interesting to investigate the possibility of inter language reuse of prediction models. Due to the limitation of our metric computing tool, we only consider projects mainly written in C, C++, Java, C#, or Pascal. A project is mainly written in programming language pl if the largest number of source code files are written in pl.
2)
Issue Tracking (IT) describes whether a project uses an issue tracking system or not. The usage of an issue tracking system can reflect the quality of the project management process. It is likely that the distribution of metric values are different between projects with or without usage of issue tracking systems. A project uses an issue tracking system if the issue tracking system is enabled in the website of the project and there is at least one issue recorded.
3)
Total Lines of Code (TLOC) describes the project size in terms of source code. Comment and blank lines are excluded when counting the total lines of code. Moreover, the lines of code of files that are not written in the main language of the project are also excluded. Such exclusion simplifies our approach for transforming metric values, as only one programming language is considered for each project.
4)
Total Number of Files (TNF) describes the project size in terms of files. This context factor measures the project size from a different granulatiry to the total lines of code. Similar as the total lines of code measurement, we exclude files that are not written in the main language of each project.
5)
Total Number of Commits (TNC) describes the project size in terms of commits. Different from the total lines of code and the total number of files, this context factor captures the project size from the process perspective. The total number of commits can describe how actively the project was developed.
6)
Total Number of Developers (TND) describes the project size in terms of developers. Teams of different sizes (e.g., small or large) may follow different development strategies, therefore the team size can impact the distribution of metric values.

3.3 Partitioning Projects

We assume that projects with the same context factors have similar distribution of software metrics, and projects with different contexts might have different distribution of software metrics. Hence, we stratify the entire set of projects based on the aforementioned six context factors.

1)
PL. We divide the set of projects into 5 groups based on programming languages: G _c, G _c++, G _{j
a
v
a}, G _{c
#}, and G _{p
a
s
c
a
l}.
2)
IT. The set of projects is separated into 2 groups based on the usage of an issue tracking system: G _{u
s
e
I
T} and G _{n
o
I
T}.
3)
TLOC. We compute the TLOC of each project and the quartiles of TLOC. Based on the first, second, and third quartiles, we split the set of projects into 4 groups: G _{l
e
a
s
t
T
L
O
C}, G _{l
e
s
s
T
L
O
C}, G _{m
o
r
e
T
L
O
C}, and G _{m
o
s
t
T
L
O
C}.
4)
TNF. We calculate TNF of each project, and the quartiles of TNF. Based on the first, second, and third quartiles, we separate the set of projects into 4 groups: G _{l
e
a
s
t
T
N
F}, G _{l
e
s
s
T
N
F}, G _{m
o
r
e
T
N
F}, and G _{m
o
s
t
T
N
F}.
5)
TNC. We compute the TNC of each project, and the quartiles of TNC. Based on the first, second, and third quartiles, we break the entire set of projects into 4 groups: G _{l
e
a
s
t
T
N
C}, G _{l
e
s
s
T
N
C}, G _{m
o
r
e
T
N
C}, and G _{m
o
s
t
T
N
C}.
6)
TND. We calculate the TND of each project, and the quartiles of TND. Based on the first, second, and third quartiles, we split the whole set of projects into 4 groups: G _{l
e
a
s
t
T
N
D}, G _{l
e
s
s
T
N
D}, G _{m
o
r
e
T
N
D}, and G _{m
o
s
t
T
N
D}.

In summary, we get 5, 2, 4, 4, 4, and 4 non-overlapping groups along each of the six context factors, respectively. In total, we obtain 2560 (i.e., 5×2×4×4×4×4) non-overlapping groups for the entire set of projects.

3.4 Clustering Similar Projects

In the previous step, we obtain non-overlapping groups of projects. However, the size of most groups is small. In some cases the non-overlapping groups of projects do not have significantly different distributions of metrics. In addition, clustering similar projects together helps obtain more representative quantiles of a particular metric. At this step, we cluster the projects with the similar distributions of a metric. We consider two distributions to be similar if neither their difference is statistically significant nor the effect size of their difference is large, as our previous study (Zhang et al. 2013).

For different metrics, the corresponding clusters are not necessarily the same. In other words, we produce a particular set of clusters for each individual metric. We describe a cluster using a vector. The first element shows for what metric the cluster is created, and the remaining elements characterize the cluster from the context factor perspective. For example, the cluster <m, C++, u s e I T, m o r e T L O C> is created for metric m, contains C++ projects that use issue tracking systems, and has the TLOC between the second and third quartiles (see Section 3.2).

For each metric m, the clusters of projects with similar distribution of metric m are obtained using the Algorithm 1. The Algorithm 1 has two major steps:

1)
Comparing the distribution of metrics. This step (Line 8 in Algorithm 1) merges the groups of projects that do not have significantly different distribution of metric m. We apply Mann-Whitney U test (Sheskin 2007) to compare the distribution of metric values between every two groups of projects, using the 95 % confidence level (i.e., p-value <0.05). The Mann-Whitney U test assesses whether two independent distributions have equally large values. It is a non-parametric statistical test. Therefore it does not assume a normal distribution. As we conduct multiple tests to investigate the distribution of each metric, we apply Bonferroni correction to control family-wise errors. Bonferroni adjusts the threshold p-value by dividing the number of tests.
2)
Quantifying the difference between distributions. This step (Lines 10 to 16 in Algorithm 1) merges the groups of projects that have significantly different distributions of metric m, but the difference is not large. We calculate Cliff’s δ (Line 10 in Algorithm 1) as the effect size (Romano et al. 2006) to quantify the importance of the difference between the distribution of every two groups of projects. Cliff’s δ estimates non-parametric effect sizes. It makes no assumptions of a particular distribution, and is reported (Romano et al. 2006) to be more robust and reliable than Cohen’s d (Cohen 1988). Cliff’s δ represents the degree of overlap between two sample distributions (Romano et al. 2006). It ranges from -1 (if all selected values in the first group are larger than in the second group) to +1 (if all selected values in the first group are smaller than the second group). It is zero when two sample distributions are identical (Cliff 1993). Cohen’s standards (i.e., small, medium, and large) are commonly used to interpret effect size. Therefore, we map the Cliff’s δ to Cohen’s standards, using the percentage of non-overlap (Romano et al. 2006). The mapping between the Cliff’s δ and Cohen’s standards is shown in Table 1. Cohen (1992) states that a medium effect size represents a difference likely to be visible to a careful observer, while a large effect is noticeably greater than medium. In this study, we choose the large effect size as the threshold of the importance of the differences between the distributions (Line 11 in Algorithm 1).

Table 1 Mapping Cliff’s δ to Cohen’s standards

Towards building a universal defect prediction model with rank transformed predictors

Abstract

Similar content being viewed by others

An empirical evaluation of defect prediction approaches in within-project and cross-project context

On effort-aware metrics for defect prediction

Revisiting process versus product metrics: a large scale analysis

Explore related subjects

1 Introduction

2 Related Work

2.1 Defect Prediction

2.2 Cross-Project Defect Prediction

2.3 Data Preprocessing

3 Approach

3.1 Overview

3.2 Context Factors

3.3 Partitioning Projects

3.4 Clustering Similar Projects

3.5 Obtaining Ranking Functions

3.6 Building a Universal Defect Prediction Model

3.6.1 Choice of Modelling Techniques

3.6.2 Steps to Build the Universal Defect Prediction Model

3.7 Measuring the Performance

Precision (prec)

Recall (pd)

False Positive Rate (fpr)

F-measure

g-measure

Matthews Correlation Coefficient (MCC)

Area Under Curve (AUC)

4 Experiment Setup

4.1 Data Collection

4.1.1 Subject Projects

4.1.2 Cleaning the Dataset

Filtering Out Projects by Programming Languages

Filtering Out the Projects with a Small Number of Commits

Filtering Out the Projects with Lifespan Less Than One Year

Filtering Out the Projects with Limited Defect Data

Filtering Out the Projects Without Fix-Inducing Commits

Description of the Final Experiment Dataset

4.2 Software Metrics

4.3 Defect Data

5 Case Study Results

5.1 Project Clusters

5.2 Research Questions

Motivation

Approach

Findings

Motivation

Approach

Findings

Motivation

Approach

Findings

Discussions on False Positive Rate

Motivation

Approach

Findings

Motivation

Approach

Findings

6 Threats to Validity

Threats to Conclusion Validity

Threats to Internal Validity

Threats to External Validity

Threats to Reliability Validity

7 Conclusion

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation