A new weighted naive Bayes method based on information diffusion for software defect prediction

Ji, Haijin; Huang, Song; Wu, Yaning; Hui, Zhanwei; Zheng, Changyou

doi:10.1007/s11219-018-9436-4

A new weighted naive Bayes method based on information diffusion for software defect prediction

Published: 02 January 2019

Volume 27, pages 923–968, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Software Quality Journal Aims and scope Submit manuscript

A new weighted naive Bayes method based on information diffusion for software defect prediction

Download PDF

Haijin Ji^1,2,
Song Huang ORCID: orcid.org/0000-0002-6894-3916²,
Yaning Wu²,
Zhanwei Hui² &
…
Changyou Zheng²

792 Accesses
21 Citations
Explore all metrics

Abstract

Software defect prediction (SDP) plays a significant part in identifying the most defect-prone modules before software testing and allocating limited testing resources. One of the most commonly used classifiers in SDP is naive Bayes (NB). Despite the simplicity of the NB classifier, it can often perform better than more complicated classification models. In NB, the features are assumed to be equally important, and the numeric features are assumed to have a normal distribution. However, the features often do not contribute equivalently to the classification, and they usually do not have a normal distribution after performing a Kolmogorov-Smirnov test; this may harm the performance of the NB classifier. Therefore, this paper proposes a new weighted naive Bayes method based on information diffusion (WNB-ID) for SDP. More specifically, for the equal importance assumption, we investigate six weight assignment methods for setting the feature weights and then choose the most suitable one based on the F-measure. For the normal distribution assumption, we apply the information diffusion model (IDM) to compute the probability density of each feature instead of the acquiescent probability density function of the normal distribution. We carry out experiments on 10 software defect data sets of three types of projects in three different programming languages provided by the PROMISE repository. Several well-known classifiers and ensemble methods are included for comparison. The final experimental results demonstrate the effectiveness and practicability of the proposed method.

A training sample selection method for predicting software defects

Article 19 September 2022

Software defect prediction based on nested-stacking and heterogeneous feature selection

Article Open access 20 February 2022

A decision analysis approach for selecting software defect prediction method in the early phases

Article 06 September 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In the software development lifecycle, software testing is one of the most important phases. At this stage, software testing engineers should identify as many software defects as possible because the cost of rectifying a software defect after release is much higher than during the software development stage (Pelayo and Dick 2007). Furthermore, in many software organizations, the project budget is often limited, or the entire software system is large and cannot be tested exhaustively. Therefore, how to effectively allocate the limited testing resources to those software modules (i.e., the smallest unit of functionality) that are most likely to contain defects is particularly crucial to a software organization. To solve this problem, the notion of software defect prediction (SDP) is proposed and has attracted the attention of a growing number of researchers in the field of software engineering (Turhan et al. 2009; Menzies et al. 2007; Yang et al. 2015; Ghotra et al. 2015; Kamei et al. 2013; Xia et al. 2016). The main task of SDP models is to classify software modules into two types, defective or non-defective, using software metrics (i.e., features or attributes), such as McCabe (McCabe 1976) metrics and Halstead (Halstead 1977) metrics. The performance indices of SDP models can be repeated and compared in publicly available data sets, such as the PROMISE repository (Boetticher et al. 2007).

Naive Bayes (NB) is one of the most commonly used classifiers in SDP (Malhotra 2015), and its performance is usually superior to the more complex classifiers (Menzies et al. 2007; Hand and Yu 2001; Zaidi et al. 2013). NB is also used in many other learning scenarios, such as text classification, web page mining, fraud detection, and image classification because of its simplicity and effectiveness. However, there are still some issues to be resolved, especially concerning the assumptions of features in NB, including the equal importance assumption (Turhan et al. 2009; Turhan and Bener 2007; Zhang and Sheng 2005), the normal distribution assumption (Witten et al. 2011), and the independence assumption (Zaidi et al. 2013; Arar and Ayan 2017; Turhan et al. 2009), which are not true in many problems. In this paper, our study mainly focuses on the issues of the equal importance assumption and the normal distribution assumption to improve the performance of NB for SDP.

The equal importance assumption means that each feature of the software contributes equivalently to defect prediction. However, in practice, as different features are designed for different purposes, not all of them are good indicators of software defects (Turhan and Bener 2007). Therefore, the equal importance assumption is inappropriate. One solution to this problem is feature-weighting techniques, which assign each feature weight based on its degree of importance. This technique is commonly called weighted naive Bayes (WNB). Turhan and Bener (2007) applied a heuristic method for feature-weighting in WNB. In more specific terms, they used information gain (IG), gain ratio (GR), and principal component analysis (PCA) to evaluate the degree of importance of features. The experimental results illustrated that IG and GR are more suitable for WNB than PCA. However, a number of other algorithms are still available for importance measures, for example, Chi square (CS) (Plackett 1983), IG (Witten et al. 2011), GR (Quinlan 1993), symmetrical uncertainty (SU) (Yu and Liu 2003), ReliefF (RF) (Kira and Rendell 1992; Robnikšikonja and Kononenko 2003), and information flow (IF) (Liang 2014; Bai et al. 2018), to name a few. Thus, we must identify which method is more appropriate to assign the feature weights in WNB.

The normal distribution assumption refers to the general assumption that a data set with continuous or numeric features follows a normal distribution. However, after performing a normal distribution hypothetical test on 10 data sets (the data sets are illustrated in Sect. 4), we found that the features do not follow a normal distribution. More specifically, we conduct a Kolmogorov-Smirnov (KS) test (Razali and Wah 2011) 10,000 times on each data set based on a “kstest” function in MATLAB; the acquiescent null hypothesis is that each feature comes from a normal distribution; the return value h equals 1 if it rejects the null hypothesis at a significance level 0.05; otherwise, h equals 0. We take the data set Xalan-2.6 as an example, the results of which are provided in Table 1. As shown in the table, the normal distribution assumption is often violated, and as a result, its probability density estimation at each value of numeric features for NB is often suboptimal.

Table 1 The results of Xalan-2.6 for the Kolmogorov-Smirnov test

A new weighted naive Bayes method based on information diffusion for software defect prediction

Abstract

Similar content being viewed by others

A training sample selection method for predicting software defects

Software defect prediction based on nested-stacking and heterogeneous feature selection

A decision analysis approach for selecting software defect prediction method in the early phases

Explore related subjects

1 Introduction

2 Related work

2.1 Software defect prediction

2.2 Naive Bayes

2.3 Naive Bayes for software defect prediction

3 Proposed methodology for software defect prediction

3.1 The entire framework

3.2 Application of weighted naive Bayes

3.2.1 Weighted naive Bayes

3.2.2 Chi square

3.2.3 Information gain

3.2.4 Gain ratio

3.2.5 Symmetrical uncertainty

3.2.6 ReliefF (RF)

3.2.7 Information flow

3.3 Application of the information diffusion model

3.3.1 Information diffusion model

3.3.2 The proposed method WNB-ID

4 Experimental setup

4.1 Data collection

4.2 The 10 × 10-fold cross-validation method

4.3 Evaluation measures

4.4 Student’s t test

4.5 Experimental design

4.5.1 Weighted naive Bayes methods

4.5.2 Information diffusion model method

4.5.3 Classification models

4.5.4 The ensemble method

5 Experimental results

5.1 RQ1: Which feature-weighting technique is more suitable for alleviating the equal importance assumption in WNB?

5.2 RQ2. How can IDM appropriately alleviate the normal distribution assumption to improve the performance of NB for SDP?

5.3 RQ3: What is the defect prediction performance of WNB-ID compared with other prediction methods?

6 Threats to validity

7 Discussion

8 Conclusion

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation