An empirical evaluation of defect prediction approaches in within-project and cross-project context

Bhat, Nayeem Ahmad; Farooq, Sheikh Umar

doi:10.1007/s11219-023-09615-7

An empirical evaluation of defect prediction approaches in within-project and cross-project context

Published: 04 March 2023

Volume 31, pages 917–946, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Software Quality Journal Aims and scope Submit manuscript

An empirical evaluation of defect prediction approaches in within-project and cross-project context

Download PDF

Nayeem Ahmad Bhat¹ &
Sheikh Umar Farooq¹

312 Accesses
6 Citations
Explore all metrics

Abstract

The software defect prediction approaches are evaluated, in within-project context only, with only a few other approaches, according to distinct scenarios and performance indicators. So, we conduct various experiments to evaluate well-known defect prediction approaches using different performance indicators. The evaluations are performed in the scenario of ranking the entities — with and without considering the effort to review the entities and classifying entities in within-project as well as cross-project contexts. The effect of imbalanced datasets on the ranking of the approaches is also evaluated. Our results indicate that in within-project as well as cross-project context, process metrics, the churn of source code, and entropy of source code perform significantly better under the context of classification and ranking — with and without effort consideration. The previous defect metrics and other single metric approaches (like lines of code) perform worst. The ranking of the approaches is not changed by imbalanced datasets. We suggest using the process metrics, the churn of source code, and entropy of source code metrics as predictors in future defect prediction studies and taking care while using the single metric approaches as predictors. Moreover, different evaluation scenarios generate different ordering of approaches in within-project and cross-project contexts. Therefore, we conclude that each problem context has distinct characteristics, and conclusions of within-project studies should not be generalized to cross-project context and vice versa.

Towards building a universal defect prediction model with rank transformed predictors

Article 30 August 2015

On effort-aware metrics for defect prediction

Article Open access 06 August 2022

Cross project defect prediction for open source software

Article 06 April 2019

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Defect prediction, one of the holy grails of software development, has attracted much attention among software engineering researchers. The driving factor is resource allocation. Quality assurance resources being limited, it is wise to prioritize and allocate the resources towards the areas with higher probable defect scores. Different approaches relying on various information sources, like as code metrics Nagappan and Ball (2005), Zimmermann et al. (2007), process metrics Rahman and Devanbu (2013), Mnkandla and Mpofu (2016), previous defects Kim et al. (2007), Felix and Lee (2017), the churn of source code, the entropy of source code D’Ambros et al. (2012), and entropy of changes Hassan (2009) have been devised to perform the task of defect prediction. The relative performance comparison of these approaches is still one of the areas where research is needed. Most of these approaches were compared to only a few other approaches, or were evaluated using distinct performance indicators, or were compared only in the within-project context. Moreover, replicating the evaluations is also a difficult task since the data of commercial systems not available for public use was used in the evaluations.

An evaluation of the relative performance of different approaches in the cross-project context is imperative to identify the approaches with stable and good performance. Traditionally, most defect prediction models were trained and evaluated in the within-project context, i.e., the data of past releases of the software were used for model training to predict defects in upcoming releases. But, there were software systems for which no or scarce defect data was tracked in defect tracking systems. For example, no defect data is available for the first release of a software. Moreover, people prefer the reuse of old defect data in defect prediction Turhan et al. (2009). The local data scarcity and the possibility of old data reuse triggered the idea of cross-project defect prediction (CPDP). In CPDP, defect data of multiple source projects combined together is used in training a prediction model that is used to predict defects in a target project. In the past decade research focused on designing training data selection Bhat and Farooq (2021a, b) and Li et al. (2017) and transfer learning Xu et al. (2019), Qiu et al. (2019), Ma et al. (2012) techniques for alleviating the distribution mismatch between different projects. The relative performance of different data sources (features) used in model training has not been evaluated in the cross-project context. Moreover, the defect datasets are implicitly imbalanced because only a small proportion of modules/classes contain most of the defects. The use of these imbalanced datasets biases the trained models towards correctly classifying the majority class non-defective instances and incorrectly classifying the minority class defective instances. Various techniques have been proposed and evaluated to address the effect of imbalanced datasets on the prediction results. However, the effect of class imbalance on the ranking of different approaches has not been evaluated.

Performance evaluation of the approaches is also addressed differently. Some researchers use classification (i.e., predicting if an artifact is defective or non-defective), while others use a ranking of artifacts with or without taking effort factor (i.e., the effort required to inspect an artifact) into consideration.

Therefore, we provide a performance evaluation of different approaches over three different scenarios of classification, ranking and ranking with effort consideration in within-project as well as cross-project defect prediction contexts. The evaluation is performed over five projects from publicly available Bug Prediction Dataset. The evaluation is limited at comparing different data sources for prediction performance. The effect of balanced datasets on the ranking of different data sources is also evaluated. The evaluations do not compare the learning algorithms, or data preprocessing methods.

The primary contributions of this paper are:

1.
An evaluation of different defect prediction approaches in the within-project context.
2.
An evaluation of different defect prediction approaches in the cross-project context.
3.
An evaluation of the effect of data balancing on the ranking of different defect prediction approaches in both within-project and cross-project context in classification scenario.

The evaluations are performed according to three different scenarios of classification, ranking, and ranking with effort consideration in two distinct ways, aimed at (1) comparing performance and (2) testing the statistical significance of performance differences.

The remainder of this paper is structured as follows: First, we give a summary of the related work in Section 2. Next, we summarize the datasets and data sources used in our evaluations in Section 3. Afterward, in Section 4, we discuss the evaluation scenarios we used for the evaluation of approaches. We discuss the experimental methodology used to evaluate the approaches in the within-project and cross-project contexts in Sections 5 and 6, respectively. In Section 7, we report and discuss the within-project and cross-project defect prediction experimental results. In Section 8 we report on the statistical significance of experimental results. In Section 9, we discuss experimental methodology for evaluating the effect of class imbalance on the ranking of different approaches and discuss the results. In Section 10, we discuss the threats to the validity of our work. Finally, in Section 11 we conclude.

2 Related work

In defect prediction, we train statistical or machine learning models to predict the defect proneness in software components. The predicted defect proneness score is used to prioritize the code review and testing effort optimally Suhag et al. (2020). Hosseini et al. in a systematic literature review Hosseini et al. (2019) report there exists a lot of diversity in the features, datasets, performance evaluation, and training methods used in software defect prediction. They further report that CPDP is a challenge that needs more scrutiny before it is used reliably in practice. In this section, we summarize various defect prediction approaches, the type of data they need, and the datasets they were validated on.

Change log approaches

use information collected from the versioning system, supposing that recently or frequently changed artifacts are the greatest possible source of future defects. Khoshgoftaar et al. (1996), used the number of past changes to the software modules, to predict the modules as defect prone. At the module level, they declared that the number of lines of code added or removed in the past predicts future defects with good performance.

Nagappan and Ball in Windows Server 2003 study the impact of code churn (i.e., the number of changes) on the defect density. They conclude that relative churn performed better than absolute churn metrics Nagappan and Ball (2005).

Moser et al. (2008) used different metrics (code churn, past defects, and refactorings, number of authors, age and size of files, etc.) to classify the files of Eclipse as defective or non-defective.

Hassan (2009) proposed the entropy of change metrics (i.e., the complexity of code changes). They compared the entropy to code churn, previous defects, and found the entropy to perform better. Their evaluation used six open-source systems: FreeBSD, KDE, KOffice, NetBSD, OpenBSD, and PostgreSQL.

Hassan and Holt (2005) validate the heuristics regarding the defect proneness of the most recent changes and most number of bug fixes in the files on six open-source systems: FreeBSD, KDE, KOffice, NetBSD, OpenBSD, and PostgreSQL. They conclude that most recently changed and fixed files are most defect prone.

Kim et al. (2007) similar to Hassan and Holt (2005) use the features of recent changes and defects with an additional assumption that defects occur in bursts.

Rahman and Devanbu (2013) across 85 releases of 12 large open-source projects, built prediction models to check the performance, portability, stability, and stasis of process and product metrics. Their findings indicate that process metrics are generally more useful than widely used product metrics for prediction.

Single version approaches

with a diverse set of metrics analyze the present state of the source code rigorously, presuming that the current design and behavior of software determines the presence or absence of future defects more than the history of software. One standard set of metrics is the set of source code metrics proposed by Chidamber and Kemerer (1994). Ohlsson and Alberg (1996) used some graph metrics entailing, cyclomatic complexity, to predict defects in telecom switches. Nagappan et al. (2006) on five Microsoft sytems used a list of source code metrics to predict release-level defects. Although the predictors were able to predict well for individual projects but failed to do so on all projects.

D’Ambros et al. (2010, 2012) evaluate the performance of various source code metrics (CK, OO) along with change log approaches (process metrics, previous defects, entropy of changes, churn of source code, entropy of source code).

Cross-project defect prediction

Traditionally, the historical data of prior releases of software is used to train defect prediction models for predicting defects in the future releases of the software — an idea called within-project defect prediction. However, there are software systems for which insufficient defect data is recorded in defect tracking systems. For example, no defect data record exists for the first release of a software system Çatal (2016). Moreover, the reuse of old data in defect prediction is preferred over the collection of new data for each project Turhan et al. (2009). The insufficiency of local defect data and the potential for reuse of defect data of other projects gave birth to the idea of cross-project defect prediction (CPDP). In CPDP, for software systems with insufficient training data, models trained on defect data of other software are adapted for the defect prediction task.

Zimmermann et al. (2009) analyzed the impact of the domain, process, and code structure on CPDP. Only 3.4% among 622 cross-project defect predictions over 12 real-world applications, satisfy their performance benchmark, and the CPDP models trained on one set of software did not generalize to other software. Turhan et al. (2009) report an increased defect detection rate with an associated increase in false-positive rate in defect predictors that are trained on all multi-source cross-project data. They introduced an analogy-based training data selection method (Burak filter) to select the relevant training data and reduced the false-positive rate. Bhat and Farooq (2021) propose a filter approach (BurakMHD filter) for selecting relevant training data in CPDP and conclude the BurakMHD filter compared to Burak filter Turhan et al. (2009) and Peter filter Peters (2013) improves the CPDP performance. Turhan (2012) asserted the dataset shift problem between software defect datasets is the main reason for the substandard performance of CPDP.

Hosseini et al. (2018) infer that search-based methods integrated with feature selection are a propitious way for training data selection in CPDP. Yu et al. (2019) through an empirical study on NASA and PROMISE datasets reveal that feature subset as well as feature ranking approaches improve the CPDP performance. Therefore they recommend selecting the representative subset of features to improve the CPDP performance. Xu et al. (2019) present a Balanced Distribution Adaptation Based Transfer Learning technique for CPDP and report their approach performs better than 12 baseline approaches. Sun et al. (2021) introduced the Collaborative filtering-based source projects selection (CFPS) technique for source project selection and validated the feasibility, importance, and effectiveness of source projects selection for CPDP. Agrawal and Malhotra (2019) report, the CPDP is not always feasible. They suggest the selection of relevant training for the defect prediction task.

Class imbalance learning

The defect datasets are implicitly imbalanced because only a small proportion of all the modules/classes contain most of the defects and most of the classes/modules are without any defects Wang and Yao (2013). The imbalanced nature of the datasets biases the classification models towards correctly classifying the non-defective class instances and classifying the defective class instances incorrectly Haixiang et al. (2017). To improve the performance of models trained on imbalanced datasets both data-level approaches García et al. (2012), Chawla et al. (2002), Barua et al. (2014), Al Majzoub et al. (2020), Han et al. (2005), Menardi and Torelli (2012), Bashir et al. (2020), Bennin et al. (2022), Feng et al. (2021), Malhotra and Jain (2022), Feng et al. (2021), Bennin et al. (2017) that alter the distribution of the datasets and algorithm-level approaches Zhou and Liu (Jan 2006), Tomar and Agarwal (2015, 2016), Ryu et al. (2017) that modify the learning procedure according to the costs function have been proposed Dar and Farooq (2022). Bennin et al. (2017) studied the effect of resampling approaches on the classification performance. They observed recall and G-mean are improved at the expense of PF and no significant effect on AUC indicating that the resampling approaches improve the defect classification but not defect ranking or prioritization. Limsettho et al. (2018) introduced the CDE-SMOTE technique to cope with the class imbalance and distribution difference between source and target projects. Goel et al. (2021) evaluate data sampling techniques used to cope with the class imbalance in CPDP. And conclude, the synthetic minority oversampling technique (SMOTE) is suitable to handle the class imbalance. Han et al. on the assumption that the borderline instances are most likely to be misclassified Han et al. (2005) proposed the BLSMOTE technique that synthetically creates minority instances from the borderline minority instances.

Effort-aware defect prediction

Traditional methods largely ignore the effort required to inspect an artifact, during defect prediction. They assume that the effort is uniformly distributed across modules. Arisholm et al. (2010) propose the effort is approximately proportional to the size of the software modules. Hence, to discover an equal amount of defects, less effort is required in shorter files. The idea of effort-aware defect prediction is to output a ranking of files in which a lesser amount of effort would discover a greater number of defects.

Mende and Koschke (2009) showed that when evaluated with traditional evaluation metrics like the ROC curve, the simplest defect prediction model — large files are the most defect prone — performs well. However, under effort-aware evaluation, the simplest model performs the worst. Similarly, when Kamei et al. (2010) introduced effort-aware performance metrics to revisit the common findings in defect prediction, they confirmed the fact that process metrics still outperform the product metrics.

We observe that the evaluation of the relative performance of different data sources for defect prediction is still one of the areas where research is needed. While most of the earlier studies used different data sources, the data sources were compared to only a few other data sources, or were evaluated using distinct performance indicators without any effort-aware consideration, or were compared only in the within-project context. The effect of data imbalance on the relative performance of different data sources was also not considered. Moreover, it is also difficult to replicate those studies since data of commercial software not accessible for public use was used. Therefore, it is imperative to evaluate the data sources in both within-project and cross-project contexts using different evaluation scenarios (classification, ranking, and ranking with effort consideration). It is also important to evaluate the effect of data imbalance on the ranking of different data sources.

3 Evaluated data sources

It is impractical to compare the plethora of existing approaches in defect prediction. To have a range as inclusive as possible for evaluation, we select the approaches summarized in Table 1. We select five project datasets summarized in Table 2 from the Bug Prediction Dataset D’Ambros et al. (2010) publicly accessible at https://bug.inf.usi.ch/index.php. Only Bug Prediction Dataset is used in the experiments because no other dataset provides information about all the approaches summarized in Table 1. Source code metrics — CK metrics Basili et al. (1996); Capretz and Xu (2008) and an additional set of OO metrics D’Ambros et al. (2010), previous defect data Zimmermann et al. (2007); Kim et al. (2007), change metrics Moser et al. (2008), and entropy of changes Hassan (2009), the churn of source code, and entropy of source code D’Ambros et al. (2010) of the five projects from Bug Prediction Dataset are used in evaluations. To help replication of our experiments we select the publicly accessible dataset for defect prediction. Moreover, the five projects have the same Java code structure and the datasets have been recorded by the same defect tracking methods.

Table 1 Evaluated Defect Prediction Approaches

An empirical evaluation of defect prediction approaches in within-project and cross-project context

Abstract

Similar content being viewed by others

Towards building a universal defect prediction model with rank transformed predictors

On effort-aware metrics for defect prediction

Cross project defect prediction for open source software

Explore related subjects

1 Introduction

2 Related work

Change log approaches

Single version approaches

Cross-project defect prediction

Class imbalance learning

Effort-aware defect prediction

3 Evaluated data sources

4 Performance evaluation

Classification

Ranking

Effort-aware ranking

5 Experiment 1: within-project defect prediction context

5.1 Min-Max normalization

5.2 Feature selection

5.3 Training regression models

5.4 Five-fold cross-validation

6 Experiment 2: cross-project defect prediction context

7 Results and discussion of defect prediction approaches

7.1 Results of within-project defect prediction

AUC in within-project classification

7.2 Discussion of within-project defect prediction rankings

7.3 Results of cross-project defect prediction

AUC in cross-project classification

7.4 Discussion of cross-project prediction rankings

8 Statistical significance for defect prediction approaches

8.1 Finding statistical significance for within-project defect prediction

AUC in within-project classification

8.2 Discussion of statistical significance for within-project defect prediction

8.3 Finding statistical significance for cross-project defect prediction

AUC in cross-project classification

8.4 Discussion of statistical significance for cross-project defect prediction

9 Experiment 3: Effect of class imbalance on the ranking of different approaches

9.1 Methodology

9.2 Results and discussion

10 Threats to validity

Construct validity

Statistical conclusion validity

Internal validity

External validity

11 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation