Keywords

1 Introduction

Nowadays, software has become an essential and integral part of our lives. The industry and society are immensely reliant on software-backed surroundings as it significantly reduces human endeavour and time [1]. Any electronic device or system, modern household product or spacecraft revolves around the software. Due to the high demand for good-quality software products across different walks of life, it is crucial to develop a software product free from defects. However, good quality and reliable software products come with complexities and challenges. In software development, a defect is an outcome of some programming mistakes due to which the software system does not show the expected results. Software testing enables the developers to develop a quality and reliable software product by identifying and fixing the defects. Defects present in a software module increase the development and maintenance costs and, at times, are the prime reason for the failure of the software product. At present, a significant part of the software development and maintenance budget goes into identifying and fixing defects [2]. This high maintenance cost can come down significantly by identifying the defects in the earlier stage of the development. Mantyala et al. [3], in 2008, described two types of defects named functional defects and evolvability defects. Functional defects affect the system's behaviour, whereas evolvability defects improve the evolvability of the software by making it easy to understand and modify.

Software defect prediction (SDP) focuses on defect-prone modules that require extensive testing. This way it facilitates the efficient utilization of the available resources without affecting the testing constraints. For a software product, the timely defect prediction boosts the quality of the software besides giving the flexibility to the project managers to manage the resources optimally [4]. Additionally, it also helps increase the quality of the developed software, leading to higher customer satisfaction and subsequently to the product's success. A practical and powerful defect prediction mechanism is vital due to the challenges, namely dimensionality reduction and class imbalance. The class imbalance problem is an extreme imbalance between the defect-prone (DP) and non-defect-prone (NDP) modules. Hence, the data set is highly skewed towards DP or NDP modules. Usually, the DP modules are fewer than NDP modules, and initially, the learners mainly focus on NDP modules. Hence, data balancing is required to resolve the skewness in the data set [5]. Another major challenge most defect prediction models face is irrelevant features present in the data set. These features do not possess any significant information and are hence considered noise. Like the class imbalance problem, irrelevant features also reduce the model’s predictive performance.

Organization of the paper: Sect. 1 discusses the formal requirement of a software defect prediction model and its challenges. Section 2 describes the basic concepts that might help understand this work. Section 3 discusses the methodology that has been followed in this work. The literature review has been discussed in Sect. 4. It also raises three research questions. Section 5 consists of the discussion on our findings. The answer to the research questions also follows here. Finally, the paper concludes with the summary and future action in Sect. 6.

2 Software Defect Prediction

Here, we discuss the fundamental concepts of defect prediction in software modules and the applicability of machine learning. First, Fig. 1 depicts the block diagram of software defect prediction [7]. Next, the data set, taken from publicly available repositories like NASA, PROMISE, ECLIPSE, AEEEM, or a real-life project, undergoes some pre-processing to remove any noise or missing values. After pre-processing, the dis-aggregation of the data set is done into training and testing data. Unfortunately, the training set may contain irrelevant attributes that do not contribute significantly to the defect prediction but can lower the model's performance. A suitable feature selection technique can significantly solve this issue. Then, a suitable machine learning classifier uses this reduced training data to build the model. Further, the built model measures the performance using the test data.

Fig. 1
A flow diagram for software defect prediction. It has data set, pre-processing, training, testing data, attribute selection, classifier, and result.

Software defect prediction: An overview

Machine learning (ML) enables computers to learn automatically without human intervention and adjust their actions accordingly. Lately, ML has emerged as a powerful decision-making approach in software defect prediction. It helps in identifying the defective modules more effectively and reduces maintenance costs. Several defect prediction models have been proposed using the classifiers such as Naïve Bayes (NB), logistic regression (LR), random forest (RF), support vector machine (SVM), K-NN, decision tree (DT). However, the performance of these machines was not that satisfactory as these models suffer from challenges like class imbalance and irrelevant features [6].

3 Methodology

This study intends to review and assess the experimental evidence gained from the existing work done in software defect prediction using machine learning techniques. The overall methodology followed in this review is as below:

3.1 Research Questions

The following research questions have been framed to assist us in the assessments as mentioned earlier.

  • RQ1: Does the feature selection/reduction techniques impact the model’s predictive performance? What are the increasingly used feature selection/reduction techniques for defect prediction?

  • RQ2: Do Ensemble Learning methods give a better predictive performance in defect prediction models than the individual classifiers?

  • RQ3: What are the frequently used software and performance measure metrics in software defect prediction?

3.2 Review Protocol

The general process of identifying relevant works includes selecting appropriate digital repositories, determining search keywords and finding a list of existing works that match the search keywords. This survey includes research papers since 2010 from several databases/publishers such as Google Scholar, IEEE, Science Direct, and Springer Link. Further, we defined the search keyword “Defect Prediction in software modules using Machine Learning”, intending to include relevant studies for our study. We identified 117 papers from the repositories mentioned above in the initial list. We did a preliminary study based on the title, abstract, and conclusion of those papers and reduced the number to 79 in our second list. The statistical distribution of the publisher-wise studies and the methods followed has been shown in Figs. 2 and 3.

Fig. 2
A pie chart depicts publisher-wise studies number of papers I E E E 28, Springer 26, Elsevier 25, A C M 5, B E I E S P 3, M D P I 3, Wiley 1, M E C S 2, and others 26.

Statistical distribution of publisher-wise studies

Fig. 3
A pie chart depicts the method followed in studies on the number of papers on ML techniques 58, ensemble modeling 27, deep learning 15, and others 17.

Statistical distribution of studies on the methods followed

We went through these papers thoroughly and selected 15 final papers based on the inclusion criteria as mentioned below:

  • Inclusion Criteria:

    • Papers that used ML techniques in software defect prediction.

    • Papers that compared the performance of different defect prediction models.

    • Empirical papers.

We examined the final list of papers based on characteristics like feature selection/reduction techniques, ensemble learning methods, software metrics, and performance measurement metrics to check if they covered the above-defined research questions or not. The results are represented in Table 1.

Table 1 Data extraction results

4 Literature Review

In this section, we present the findings of our literature survey. We divided our survey into two categories. The first category included studying the conventional ML approach using feature selection/reduction techniques. In Table 3, we summarize the findings of the study (Table 2).

Table 2 Conventional ML approach using the feature selection/reduction techniques
Table 3 Defect prediction employing an ensemble learning approach

Though the defect prediction models based on individual classifiers using the conventional machine learning approach showed good performance, they still faced challenges like class imbalance. This challenge paved the way for further research to strengthen the achievement of defect prediction models. As a result, ensemble learning methods came into existence that combined several individual classifiers and built the prediction models using the ensemble classifiers. Table 3 represents the findings of defect prediction using the ensemble learning methods.

5 Discussion on Our Findings

5.1 Answer to the Research Questions

Through a detailed study of the selected papers, we observed that though software defect prediction system has made significant progress in recent times, it still faces challenges like irrelevant features and data imbalance. While feature selection and reduction techniques have helped to an extent in removing the irrelevant features, ensemble learning techniques have significantly improved the model’s performance compared to individual classifiers. The above study's findings answered the research questions raised in this paper.

RQ1: Does the feature selection/reduction techniques impact the model’s predictive performance? What are the increasingly used feature selection/reduction techniques for defect prediction?

Answer: The above survey found that the feature selection/reduction techniques significantly improved the model’s performance [4, 13,14,15, 18,19,20,21,22,23, 26, 27]. The widely used feature selection techniques are filter-based, wrapper-based, correlation-based, and consistency-based. Similarly, the widely used feature reduction techniques are principal component analysis, FastMap, feature agglomeration, transfer component analysis and TCA+ , random projection, restricted Boltzmann machine, and autoencoder. The study also found that the selection techniques exceeded the reduction techniques in supervised learning. Similarly, reduction techniques based on neural networks were better in unsupervised learning [27].

RQ2: Do ensemble learning methods give a better predictive performance in defect prediction models than the individual classifiers?

Answer: Several core models are combined to produce one optimal predictive model in the ensemble method. It improves the model’s predictive performance compared to the individual classifiers [5, 13, 23, 24]. When applied with ensemble learning methods, the feature selection methods mostly gave a better performance than the scenario when the feature selection was not used [4, 13, 26, 27]. However, in some cases, the performance was decreased when ensemble learning methods used specific feature selection methods [4]. Some widely used ensemble methods are bagging, boosting, and stacking.

RQ3: What are the frequently used software and performance measure metrics in software defect prediction?

Answer: In this review, we extensively studied the different software metrics used in software defect prediction. From our analysis, we found that majority of the studied works have used McCabe metrics, Halstead base and derived metrics. In terms of performance measurement metrics. Our study observed that the area under the receiver operating characteristic curve (AUC) is the most widely used performance measure as the skewness of defect data does not affect AUC. AUC is followed by accuracy as the second most widely used performance measure metric. Some other popular performance measures are precision, accuracy, specificity, G-mean, F-measure, performance variance, and error rate.

5.2 Filter-Based Feature Subset Selection Technique

The feature subset selection techniques examine the importance of each feature and produce a subset of relevant features. Filter-based and wrapper-based feature subset selection techniques are prevalent in defect prediction. The past research works have extensively used filter-based feature subset selection techniques, which were found very effective [15,16,17]. An overview of subset-based feature selection has been depicted in Fig. 4.

Fig. 4
A flow chart depicts the subset selection technique with original features, search strategy, selected feature, goodness criteria, optimal feature subset.

Feature subset selection technique

Correlation subset-based techniques do not evaluate individual features. Instead, they result in subsets of features. The best feature subset is low on intercorrelation but high correlation with the class label [17]. Consistency subset-based techniques focus on consistency to estimate the relevance of a feature subset. This approach provides a nominal feature subset with consistency equivalent to all the features [16]. Ghotra et al. [15] did an extensive study and assessed the influence of feature selection techniques on the defect forecast model. Through their trial outcome, the authors established that correlation-based subset feature selection techniques, coupled with best-first search, outperformed other feature selection approaches over different datasets used in this study. Balogun et al. [20] scrutinize the effect of diverse feature selection approaches on the predictive accomplishment of the models. They also compared the effect of filter-based feature selection methods with filter-based feature ranking methods on fault forecast models. The authors concluded that although the filter-based feature selection methods, in addition to the best-first search technique, enhanced the achievement of the fault forecast model, the filter feature ranking approach-based models gave a more stable predictive performance. Kondo et al. [21] applied different filter-based feature subset selection techniques and filter-based feature ranking techniques and observed the outcome of the fault forecast model in both supervised and unsupervised learning. They observed that filter-based feature subset selection procedures gave the best performance for the supervised defect prediction model. Similarly, neural network-based feature reduction approaches (RBM and AE) provided better-unsupervised patterns.

Balogun et al. [23] proposed an enhanced wrapper feature selection technique that selects features dynamically and iteratively and found that the proposed technique not only selected the subsets in less time but also returned an improved prediction rate.

5.3 Bagging Ensemble Learning

Bagging, an acronym of bootstrap aggregation, is one widely used ensemble learning method. It is a parallel method that fits several weak learners independently, making it possible to train them simultaneously. Bagging uses random sampling with replacement to generate additional data for training from the data set. Multiple models are trained in parallel using these multidata sets. Finally, the average predictions from different ensemble models are calculated. Bagging not only reduces the variance but also attunes the forecast to an anticipated outcome. Figure 5 depicts an overview of the bagging ensemble learning method.

Fig. 5
A flow diagram depicts the learning method with data, random sampling with replacement, learners building in parallel, and the average learner.

Bagging ensemble learning

5.4 Boosting Ensemble Learning

Boosting is another type of ensemble method where the weak learners learn in sequence and enhance the performance of a model adaptively—boosting increases the weight of a wrongly predicted data point. Each resulting model is allocated with weight during training. Figure 6 shows the boosting ensemble method.

Fig. 6
A flow diagram depicts boosting learning with data, random sampling with over-weighted data, sequential learner building, and average strategy.

Boosting ensemble learning

Khan et al. [24] explored the idea of a hybrid ensemble learning technique using AdaBoost and bagging ensemble approaches using Naive Bayes, support vector machine, and random forest classifiers using PROMISE datasets. They equated the outcome of the studied models and concluded that AdaBoost-SVM and bagging SVM gave the best performance compared to the other methods.

From our study, we observed that most of the studied papers used bagging and boosting as an ensemble learning method in software defect prediction [4, 13, 24,25,26,27]. Mangla et al. [5] used a sequential ensemble model based on a neural network and compared the performance of their proposed model with other ensemble methods such as bagging, boosting, stacking. Laradji et al. [13] used the average probability ensemble (APE) method on two variants to gain a more robust data imbalance and feature redundancy system. Figure 7 shows the distribution of selected ensemble method-based papers.

Fig. 7
A pie chart of distribution on ensemble learning methods for bagging, boosting, stacking, and others. Booting occupies the maximum percentage.

Distribution on ensemble learning method

5.5 Metrics Used in Software Defect Prediction

5.5.1 Software Metrics

The static source code consists of some features known as software metrics. These features are helpful, user-friendly, and extensively used. The data are module-based, and mainly it includes McCabe and Halstead features extractors of source code. For example, a practical defect prediction model only considers the best metrics and discards the metrics that may hurt the model’s predictive performance. McCabe [8] argued that the code containing a complex pathway is more prone to errors. Hence, the metrics reflect the pathways within a code module. Some commonly used McCabe metrics are (LOC, cyclomatic complexity, essential complexity, design complexity). Halstead [9] argued that a hard to read code is more error-prone. So, the metrics estimate the various concepts in a module and determine the complexity—Table 4 lists the software metrics.

Table 4 Software metrics in software defect prediction

The widely used Halstead metrics are base Halstead metrics (number of unique operators, number of unique operands, total number of operators, total number of operands, length, vocabulary), derived Halstead metrics (volume, potential minimum volume, program level, difficulty, effort, and time) [10]. Our study noted that all the selected papers used the software as mentioned above metrics in their defect prediction model.

5.5.2 Performance Measure Metrics

Performance measure metrics measure the model’s predictive performance. Table 5 furnishes some performance measure metrics used in software defect prediction.

Table 5 Performance measure metrics in software defect prediction

Our study observed that the area under the receiver operating characteristic curve (AUC) is the most widely used performance measure as the skewness of defect data does not affect AUC. AUC is followed by accuracy as the second most widely used performance measure metric. Some other popular performance measures are precision, accuracy, specificity, G-mean, F-measure, performance variance, and error rate. Figure 8 represents the distribution of studied works on the usage of performance measure metrics.

Fig. 8
A pie chart depicts performance measure metrics. A U C 11, recall 2, accuracy 6, precision 2, specificity 1, F-measure 2, G-mean 2, and error rate 1.

Distribution on performance measure metrics

5.6 Limitations of Existing Research

The research gap in the existing works lies in the lack of universally acceptable techniques across different datasets and classifiers, though past research used several feature selection techniques. Therefore, an approach is needed to find the right balance of the features using a suitable feature selection technique to enhance the model’s performance across different datasets and classifiers. In addition, high complexity in ensemble-based defect prediction models remains an unaddressed challenge that is crucial in bringing down the maintenance cost.

6 Conclusion

This literature survey aimed to comprehensively study different machine learning techniques used in software defect prediction. The study found that though defect prediction has come a long way, it still lacks a suitable approach in selecting appropriate features while discarding the irrelevant ones as there is no universally accepted feature selection technique available. This research work discussed different feature selection/reduction techniques accustomed to enhancing the accomplishment of the fault forecast system. The study has given a good emphasis on ensemble learning methods and observed that forecast model based on ensemble learning approaches give enhanced performance than individual classifier-based models. This work has also discussed the widely used software and performance measurement metrics in defect prediction. This concise work would guide future researchers in this emerging research area. In future, we aim to introduce a hybrid feature selection technique using the filter and wrapper approaches to an ensemble learning-based defect prediction model to enhance its predictive performance.