The application of ROC analysis in threshold identification, data imbalance and metrics selection for software fault prediction

Shatnawi, Raed

doi:10.1007/s11334-017-0295-0

The application of ROC analysis in threshold identification, data imbalance and metrics selection for software fault prediction

Original Paper
Published: 02 August 2017

Volume 13, pages 201–217, (2017)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Innovations in Systems and Software Engineering Aims and scope Submit manuscript

The application of ROC analysis in threshold identification, data imbalance and metrics selection for software fault prediction

Download PDF

Raed Shatnawi ORCID: orcid.org/0000-0001-7777-1370¹

800 Accesses
30 Citations
Explore all metrics

Abstract

Software engineers have limited resources and need metrics analysis tools to investigate software quality such as fault-proneness of modules. There are a large number of software metrics available to investigate quality. However, not all metrics are strongly correlated with faults. In addition, software fault data are imbalanced and affect quality assessment tools such as fault prediction or threshold values that are used to identify risky modules. Software quality is investigated for three purposes. First, the receiver operating characteristics (ROC) analysis is used to identify threshold values to identify risky modules. Second, the ROC analysis is investigated for imbalanced data. Third, the ROC analysis is considered for feature selection. This work validated the use of ROC to identify thresholds for four metrics (WMC, CBO, RFC and LCOM). The ROC results after sampling the data are not significantly different from before sampling. The ROC analysis selects the same metrics (WMC, CBO and RFC) in most datasets, while other techniques have a large variation in selecting metrics.

The Stability of Threshold Values for Software Metrics in Software Defect Prediction

Data quality issues in software fault prediction: a systematic literature review

Article 21 December 2022

Analysis of Different Sampling Techniques for Software Fault Prediction

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Software quality engineers must exploit tools to monitor, audit and verify the software fault-proneness during the life cycle of a project. Software engineers keep records of fault data in a special repository such as Bugzilla. The fault data can be used to find where faults are likely to occur. However, the fault data may not be recorded or available for many reasons. There is a need for indirect measurement of the software, which can be used as a surrogate of software fault-proneness. Software metrics, for example the Chidamber and Kemerer metrics (CK) [15], were validated as indicators of fault-proneness of classes. CK metrics found to have significant relationships with faults using many machine learning and statistical techniques [2, 4, 18, 19, 28, 37, 48, 54, 63]. Machine learning and statistical techniques are rigorous, and dedicated software tools are needed to analyze such relationships. However, software engineers need more easy tools to investigate fault-proneness in modules. Identifying which classes are more likely to have faults is necessary to guide software testers to improve their performance and reduce the costs of activities such as testing and maintenance [18]. The presence of faults can be used to profile software modules into several risk levels (threshold value or reference value). However, we face some problems in the quality of data. Fault data may not be collected for part of the system because the costs of collection may be extremely expensive [11]. Two major issues are considered in analyzing the fault-proneness in classes: (1) few classes have faults, while the majority do not have faults (imbalance fault distribution) [42, 57], and (2) many metrics already exist and can be used to evaluate software fault-proneness. Khoshgoftaar et al. [38] proposed two techniques to confront these issues: a data sampling technique to overcome the class imbalance problem in fault distribution and a feature selection technique for selecting the important metrics. According to Khoshgoftaar et al. [38], the performance of a prediction model depends on both the selected metrics and faults distribution in modules.

In this study, we propose to use the receiver operating characteristic (ROC) analysis as a quality assurance tool. The ROC analysis is proposed as a quality assurance tool in selecting software metrics as fault-proneness indicators. The ROC identifies a threshold value that separates software modules into two areas: low quality (fault-prone) and high quality (not fault-prone). The results of ROC analysis can be used in quality assurance tools such as JArchitect^{Footnote 1} to identify, for example, the most coupled classes or the lowest cohesive classes in a system. JArchitect uses thresholds to identify bad quality areas and allows the user to change thresholds. The tool reports all classes that exceed thresholds and requires more quality inspection. In a white paper, Gronback [27] reported on using threshold values to identify bad smells in a commercial quality assurance tool (Borland together). The tool uses threshold values to identify potential bad smells such as god classes, god methods and data class. The used threshold values were subjectively reported to be used to identify bad smells in code [44]. This study introduces ROC analysis to investigate the relationship between the CK metrics and the faults in five open-source systems. The study aims to validate the use of ROC in threshold identification. We also validate the consistency of the ROC analysis after two major sampling techniques: oversampling and undersampling. The ROC is used to select metrics to include in learners, and the stability of selection is assessed and compared with other traditional feature selection methods. Finally, we use the metrics that are selected via the ROC analysis to build fault-proneness models using four well-known learners: logistic regression, naïve bayes, the nearest neighbors and C4.5 decision trees [10].

The rest of this paper is organized as follows: In Sect. 2, we discuss the related work on fault prediction and identification of threshold values. Section 3 discusses the experimental design of this research and provides a detailed description of the research methodology. In Sect. 4, we present and discuss the results of this work. Finally, we conclude this work.

2 Related work

Studies on fault-proneness categorized software classes into several groups based on the number of faults. Usually, classes are divided into two groups: faulty classes that have one or more faults in the current release under investigation and non-faulty classes that do not have any faults. Other researchers utilized the threshold values of software metrics in many applications. Metric threshold values can help developers in identifying the most risky classes in software design. Using threshold values, a developer can bookmark such classes during the daily development tasks and make quick decisions whenever needed. However, there were only few empirical studies on threshold values. Erni et al. [21] proposed to use normal distribution parameters (average and standard deviation) to determine thresholds. Erni et al. calculated threshold values as follows, $\hbox {Tmin} = \mu - s$ and $\hbox {Tmax} = \mu + s$, being $\mu $ the average of a metric and s the standard deviation. However, these thresholds were not empirically validated. Daly et al. [16] studied the effect of two arbitrary levels of inheritance (three and five) on maintenance time. Other researchers, [8, 30, 49], replicated the study of Daly et al. with different systems and only considered the effect of three levels of inheritance on maintenance time and found different results. However, these experiments were conducted on few undergraduate students. Benlarbi et al. [6] and El Emam et al. [20] estimated the threshold values using a logistic regression and could not find valid thresholds. Shatnawi [55] used a logistic regression method reported in Bender [5] to find thresholds for the Chidamber and Kemerer suite. Shatnawi [55] used the derivatives of the logistic regression to identify several risk levels in software classes, e.g., CBO could have four different thresholds at four different risk levels, CBO = 6, 9, 16 or 29. In another study, Shatnawi et al. [56] studied the use of ROC curve to identify threshold values of object-oriented metrics. They conducted the study on three releases of a large open-source system—Eclipse. The results could identify thresholds for only three metrics, RFC = 44, CBO = 13 and WMC = 24, whereas the LCOM, DIT and NOC could not have plausible and practical thresholds. Catal et al. [11] proposed a modified ROC analysis of Shatnawi et al. [56] to obtain thresholds for structural metrics. Catal et al. [11] used outlier detection techniques to improve the performance of fault prediction by labeling non-faulty classes as faulty if they exceed threshold criteria and faulty classes otherwise. For example, the threshold values for the SLOC falls in the range between 11 and 33. Ferreira et al. [23] derived thresholds for some OO metrics using statistical properties, such as power law behavior, of the metrics. For example, the authors assigned three rankings based on ranges of metrics: good, regular and bad. Ferreira et al. [23] found different thresholds for different application domains (11 domains reported in the study).

To our knowledge, previous studies on identifying threshold values have not addressed the problems of data imbalance and metric selection. Since the same data are used to build fault prediction models and threshold identification, we need to study these two important issues on threshold identification. Many studies have already addressed data imbalance in fault prediction [3, 38, 42, 57].

The ROC analysis was proposed to identify threshold values in Shatnawi et al. [56]. The objectives of this research are to extend our previous work [56] on identifying threshold values using ROC curves to new contexts and to test it under the effect of data imbalance and feature selection. The previous and the current works are both empirical but have many differences. The previous work included only one project (Eclipse), while we study five projects in this work. However, the objective of this work is different. In this work, we provide more evidence on using the ROC to identify thresholds even for imbalanced data which have a major effect on the validity of the results of fault prediction. In addition, we introduce the ROC analysis as a method of metric selection, which is also important to provide a better performance in quality assurance activities.

3 Experimental design

In this section, we discuss the details of the ROC analysis, the metrics under investigation, the research objectives, data sources and feature selection approaches, and finally, we provide a brief description of classification models.

3.1 Area under the curve (AUC)

The area under the receiver operating characteristic (ROC) analysis is used in classifiers evaluation. The ROC curve is plotted using two variables: one is binary and another is continuous. The binary variable is the event of faults or not in software modules (i.e., 1 and 0). The continuous variable is one of the CK metrics, e.g., WMC metric in Fig. 1. Each metric is analyzed separately by considering all values as potential thresholds that can be used to categorize classes into either faulty ($\ge $threshold) or not faulty (<threshold). For each potential threshold, a classification table (confusion matrix) is produced as shown in Table 1.

Each table can be used to calculate two important measures of ROC performance: sensitivity and specificity, which are defined as follows.

$$\begin{aligned} \hbox {Sensitivity}= & {} \hbox {TP rate} = \hbox {TP/P};\\ \hbox {Specificity}= & {} \hbox {TN rate} = \hbox {TN/N}. \end{aligned}$$

The area under the curve (AUC), as shown in Fig. 1, shows a visual trade-off analysis between the rate of correctly classified classes as fault-prone and the rate of incorrectly classified classes as not fault-prone. The AUC is a single value that evaluates the discrimination power in the curve between the faulty and not faulty classes. The diagonal line in Fig. 1 represents the approach of randomly guessing a class. If a classifier randomly guesses the positive class 50% of the time, then half of the positives and half of the negatives are classified correctly. The area under the diagonal line is calculated as 0.50. Therefore, the curve that discriminates well between the two classes should be larger than 0.5 and should approach the upper left corner. Hosmer and Lemeshow suggested using the following rules to evaluate the performance of classifiers using AUC [31]:

AUC = 0.50: means no good classification and not significantly different from random classifier;
$0.50<\hbox {AUC}<0.60$: means poor classification;
$0.60 \le \hbox {AUC}<0.70$: means fair classification;
$0.70 \le \hbox {AUC}<0.80$: means acceptable classification;
$0.80 \le \hbox {AUC}<0.90$: means excellent classification;
AUC $\ge $ 0.9: means outstanding classification.

The ROC analysis is very effective for data with skewed distributions and unequal classification error costs [22]. The characteristics of the ROC analysis help researchers in generalizing results even in case of changing data distributions [39, 40]. In addition, the ROC analysis is preferred both for practical choices and for drawing scientific conclusions [64]. Koru et al. showed that smaller modules are more fault-prone than larger modules [39]. Kubat and Matwin found that in imbalanced dataset, the effect of the negative cases (no bugs in classes) prevails [40].

Table 1 The confusion matrix based on a threshold value

The application of ROC analysis in threshold identification, data imbalance and metrics selection for software fault prediction

Abstract

Similar content being viewed by others

The Stability of Threshold Values for Software Metrics in Software Defect Prediction

Data quality issues in software fault prediction: a systematic literature review

Analysis of Different Sampling Techniques for Software Fault Prediction

Explore related subjects

1 Introduction

2 Related work

3 Experimental design

3.1 Area under the curve (AUC)

3.2 The Chidamber and Kemerer metrics

3.3 Research objectives

Objective 1

Objective 2

Objective 3

3.4 Data sources

3.5 Feature selection approaches

3.6 Classification models

4 Results and analysis

4.1 ROC Analysis and Identification of Threshold Values

4.1.1 WMC metric

4.1.2 CBO metric

4.1.3 NOC metric

4.1.4 DIT metric

4.1.5 RFC metric

4.1.6 LCOM metric

4.1.7 ROC application in god classes identification

4.2 ROC analysis after sampling

4.3 Feature selection using ROC

5 Limitations and threats to validity

6 Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation