Keywords

1 Introduction

The idea behind software defect prediction (SDP) is to deploy machine learning (ML) methods to predict software defects based on historical information such as bug reports and source code edit logs generated from the software development process [1]. SDP can help development teams use available resources more wisely in the software development process by concentrating on flawed or defect-prone modules or components before software product release [1, 2]. To anticipate defective modules in software systems, Data from software features which include source code complexity (Line of Code (LOC), McCabe, and Halstead), software development history, software cohesion, and coupling are used to build SDP models [3,4,5]. These software features are quantitatively measured to assess the degree of reliability and dependability of software systems [3].

SDP models can be developed using either supervised or unsupervised ML approaches [6,7,8,9]. The objective is to develop an SDP model that predicts defects in software systems with perfect certainty and accuracy. Nonetheless, the effectiveness of SDP models is reliant on the characteristics of the software datasets deployed. Particularly, the software attributes utilized to develop SDP models impact their efficacy [1, 2, 10]. By default, software attributes are muddled and skewed, which can be regarded as a class imbalance problem.

A class imbalance occurs in SDP if there is a disproportionality of class labels, having the non-defective and defective cases as majority and minority labels respectively. In addition, the class imbalance is a dormant issue that happens spontaneously in software attributes and impairs the prediction performance of deployed prediction algorithms. Addressing class imbalance as a data quality problem has piqued the interest of experts, as several kinds of research and techniques have been presented to handle the imbalance issue [7, 11, 12]. Based on previous research, SDP models developed using imbalanced datasets provide unreliable findings because the resulting SDP models produce poor performance. In other words, SDP models developed on imbalanced datasets preferentially identify the majority instances over the minority. However, it is crucial to re-affirm that correctly predicting the minority instances (in this case defective labels) is imperative, since neglecting the defective labels may be deleterious. Consequently, in the context of SDP, several researchers have adopted methodologies such as data sampling, cost-sensitive learning, and ensemble methods to address the problem of class imbalance [7, 11, 13]. Independently, these strategies have a positive influence on deployed prediction algorithms in SDP processes; nonetheless, the development of novel approaches to addressing the class imbalance in SDP is still ongoing research. The data sampling approach that increases and decreases the proportion of minority instances (over-sampling) and majority instances (under-sampling) respectively has been reported to overcome the class imbalance problem [14, 15]. Furthermore, the class imbalance has been shown to have negligible or no effect on ensemble techniques [7, 16]. In response to the foregoing reports, this research work proposes a hybrid of data sampling and ensemble approaches to overcome the class imbalance problem in SDP.

Data sampling approaches, especially data oversampling, have been proven in studies to have a positive influence on ML algorithms [17, 18]. However, as demonstrated in [19], they still suffer from excessive volatility and instability. To alleviate the large variance and instability of data sampling methods, ensemble methods such as boosting and bagging may be used [20]. The objective of this study is to conduct an empirical evaluation of the prediction performance data sampling-based ensemble methods. Specifically, models based on ensembled (Bagging and Boosting) Naïve Bayes (NB) and Decision Tree (DT) classifiers are deployed on newly generated datasets based on Random Over-Sampling (ROS), Synthetic Minority Over-Sampling Technique (SMOTE), Adaptive Synthetic (ADASYN), Random Under-Sampling (RUS), and Near Miss data sampling methods. Software defect datasets from the NASA repository are deployed in this study.

Summarily, the contribution of this study is as follows:

  1. i.

    Validate the effectiveness of ensemble methods over the conventional ML classifiers on data-sampled datasets in SDP.

  2. ii.

    Compare the effectiveness of experimented data sampling methods on studied ensemble (bagging and boosting) methods.

The remainder of this research is structured as follows: Sect. 2 outlines the associated works that have been completed in this respect. Section 3 discusses the notion of imbalanced learning, and Sect. 4 presents the experimental data and analyses. Section 5 concludes this research work.

2 Related Works

This section investigates and analyses relevant developed SDP models based on ML methods and current solutions to the class imbalanced problem.

SDP is a method for detecting defects in software systems early on. It determines the characteristics of single code components to detect whether parts of it are prone to defects [7] or to forecast the number of defects in each component or module [2]. Experts have used a variety of ways to create SDP models based on static code metrics. NB [21], DT [14], artificial neural networks (ANNs) [22], support vector machines (SVM) [23], k-nearest neighbour (KNN) [24], and logistic regression (LR) [21] are just a few of the classical classifiers that have been used directly to develop SDP models. Nonetheless, these classifiers ignore the skewness and other inherent data quality problems in software defect datasets that could affect the effectiveness of SDP models [10]. For instance, SVM and KNN tend to overlook the minority class labels as they seek to maximise the accuracy values [25]. [26] reported that software defect datasets are very susceptible to the class imbalance problem, which is an example of the data quality problem.

Class imbalance is a latent anomaly of software defect data that consists of a small number of defective instances and a big number of non-defective instances [27]. NASA dataset PC1 exemplifies this, with just 6.59% of cases belonging to the defective class label. Since most classifiers aim to build classifiers that maximize overall prediction accuracy, this feature has a significant influence on both model training and predictive performance. Therefore, such models often neglect the valued minority (defective) class.

A considerable amount of study has been recommended to address the issue of class imbalance. [28] presented an overview of approaches for decreasing the detrimental impact of imbalance on prediction performance. [29] investigated whether various classifiers detect the same problems. To do this, they used NASA datasets to conduct a sensitivity analysis and evaluate the outcomes of RF, NB, RPart, and SVM. They concluded that certain flaws are more consistent in defect prediction than others and that each classifier identifies a distinct set of defects.

[30] addressed two procedures, undersampling and oversampling and then asserted that both data sampling approaches were successful. Furthermore, it was discovered that the adoption of advanced sampling procedures did not give any discernible improvement in addressing the class imbalance issue. Similarly, [31] observed that the RUS approach typically outperforms more complicated undersampling algorithms. In addition to undersampling procedures, oversampling methods are prominently deployed to solve the class imbalance issue. Aside from ROS with replacement, numerous more sophisticated data sampling techniques have been created. [15] proposed a unique oversampling approach termed the SMOTE, in which new minority samples are generated based on feature space similarities between existing minority cases. [32] used the borderline-SMOTE to oversample minority class samples close to the borderline. Both [18] and [13] found that oversampling outperforms undersampling. It can be noted that the aforementioned research on data sampling techniques is not conducted on software defect issues, and there is limited literature on evaluating data sampling methods for SDP. In [33], Tomek-Link, an undersampling technique was combined with Random Undersampling (RUS) and Synthetic Minority Oversampling SMOTE). It was reported that this combination showed an improvement in performance than other experimented methods.

Cost-Sensitive Learning is another technique that has been investigated by researchers. [34] evaluated data sampling techniques and MetaCost learning and there concluded that the sampling methods with replacement are effective for imbalanced learning. However, this method is still vague and needs to be more thoroughly investigated because assigning a cost penalty is not generic, it rather is dependent on some factors such as the dataset used, and the level of misclassification [35].

More recently, ensemble methods have been explored by researchers [7, 11, 16]. [36] suggested an ensemble technique for SDP based on object-oriented (OO) modules and compared the proposed method to some existing ML methods. They reported that the proposed method performed better than the studied ML methods. Furthermore, ensemble techniques have been also combined with resampling methods, which is known as the hybrid approach. [11] proposed a hybrid-SMOTE ensemble technique in what they simply referred to as <SMOTE + classifier>. In their experiment, they first resampled the dataset using SMOTE, and then the process of the ensemble was done using RF, AdaBoost, and Bagging. They observed that the suggested approach could effectively enhance the prediction accuracy of studied SDP models. [19] also researched this area, although their work was not in the domain of SDP, they demonstrated the effectiveness of ensembling resampled data over a single base classifier. They conducted their experiments using both simulated and real-world datasets.

3 Methodology

This section outlines and describes the data sampling techniques, prediction algorithms, ensemble methods, defect datasets examined, performance measures, and experimental strategy employed in this research work.

3.1 Data Sampling Method

In this study, five (5) data sampling approaches (SMOTE, ADASYN, ROS, RUS, and NearMiss) are studied. Data sampling techniques are broadly classified into two types: oversampling methods and undersampling methods [15]. The oversampling method's fundamental idea is to balance the dataset by raising or increasing the amount or frequency of minority class instances to an equal number of classes as majority class instances. In contrast, in undersampling approaches, the majority class instances are downsampled or lowered to the same number or frequency as the minority class occurrences. ROS, SMOTE, and ADASYN are instances of oversampling methods, while RUS and NearMiss are examples of undersampling methods. The samples to replicate in ROS are selected at random. This duplication of minority class instances often leads to overfitting and a poor prediction model. In SMOTE, however, the samples are selected using a (K-Nearest Neighbour) k-NN. The Euclidean distance between a feature vector and its nearest neighbour is used to generate a new vector [15]. ADASYN is an oversampling strategy based on k-NN that produces data adaptively based on density distribution [18]. RUS is an undersampling approach that removes examples of the majority class at random until both sets of instances are equal. Although some information may be lost in this process, it increases computation speed and power. NearMiss is another undersampling technique, but instead of randomly selecting samples to eliminate, it uses the k-NN approach [13].

3.2 Prediction Algorithms

In this study, NB and DT algorithms are used as prediction algorithms. NB is a probability-based classifier that is predicated on the Bayes theorem and the presumption that every pair of features is independent of one another [37]. DT is a type of non-parametric classifier whose goal is to create a model that predicts the value of a pre-determined instance using simple decision rules derived from data variables. The classifiers were chosen to bring heterogeneity into the prediction models and are based on their relative use and performance in previous SDP research. Table 1 gives a summary of the chosen models and their parameter values as employed in this research work.

Table 1. Selected prediction algorithms with parameter settings

3.3 Ensemble Methods

This study investigated boosting and bootstrap aggregating (Bagging) ensemble methods. Ensemble methods generally combine multiple weak classifiers into a single strong and robust model to improve the effectiveness and stability of the model [26]. To learn the re-weighted training data, the boosting ensemble method utilizes a weak classifier in sequence. Finally, it uses a majority vote mechanism for its final judgement, including all weak hypotheses created by the weak classifiers into the final hypothesis [14]. In other words, boosting use weighted averages to transform weak classifiers into stronger classifiers, with each model choosing which qualities the next iteration focuses on. In the case of the Bagging ensemble, a bagging ensemble's baseline classifiers learn from a given dataset by extracting multiple samples from the original dataset. The classifiers’ output is collected at prediction time. Consequently, the aggregation technique ensures that each classifier's variance is minimized while its bias is not raised. In layman's words, the bagging approach randomly resamples the original datasets, trains several base classifiers using the resampled subsets, and then creates a prediction based on the predictions of the many base learners [36]. Summarily, Table 2 depicts the investigated ensemble techniques and their parameters as they were used throughout the experimental phase of this research work.

Table 2. Experimented Ensemble Methods with parameter settings

3.4 Software Defect Datasets

The software defect datasets utilized in this research work were gathered from the NASA repository. In this research work, the [4] version of the NASA corpus was employed. The NASA datasets contain software features obtained from static code analysis centred on source code size and complexity [38,39,40]. Table 3 contains details of the datasets analysed as well as their corresponding imbalance ratios.

Table 3. Description of studied defect datasets

3.5 Evaluation Measures

According to available research, choosing performance evaluation criteria is crucial in SDP [41, 42]. This is because the datasets used to train and test the SDP models are unbalanced. Relying just on prediction accuracy values may not be sufficient. For example, in [1], an experiment was conducted to test the biasness of measurements such as Accuracy, F-Measure, and Area Under a ROC Curve (AUC). They concluded that Matthews Correlation Coefficient (MCC) is a more trustworthy statistic since it includes all confusion measures, as opposed to others that exclude the True Negative (TN). In this study, the prediction performances of the developed SDP models were evaluated using AUC, and MCC values. These chosen assessment measures have been used often and are reliable [1, 16, 21].

$$ {\text{AUC}} = \frac{1 + TPR - FPR}{2} $$
(1)
$$ {\text{MCC}} = \frac{TP*TN - FP*FN}{{\sqrt {\left( {TP + FP} \right)*\left( {TP + FN} \right)*\left( {TN + FP} \right)*\left( {TN + FN} \right)} }} $$
(2)

3.6 Experimental Procedure

This section discusses the experimental procedure employed in this research work as shown in Fig. 1.

The procedure is designed to experimentally investigate and evaluate the efficacy of the data sampling-based ensemble methods in SDP. Particularly, the original defect datasets and the newly generated datasets based on studied data sampling methods (ROS, SMOTE, ADASYN, RUS, and NearMiss) are deployed on ensembled NB and DT classifiers. That is, each of the studied data sampling methods is used to resolve the inherent class imbalance present in the defect datasets by balancing the number of the majority and minority class variables respectively, hence new datasets. The balancing of the original datasets is based on conclusions presented in previous research [21, 43]. The SDP models are created and tested using the K-fold (k = 10) Cross-Validation (CV) technique. The preference for the k-fold technique is based on its ability to develop prediction models while minimizing the influence of the class imbalance problem [27, 44]. Furthermore, the K-fold technique enables every variable to be deployed iteratively for the training and testing process. The investigated classifiers (NB and DT) and ensemble techniques (Boosting and Bagging) were chosen based on their application and performance in previous research [11, 36]. Table 1 (Sect. 3.3) and Table 2 (Sect. 3.4) indicate the parameter values for the classifiers and ensemble methods investigated in this research work. Following that, the prediction performances of the resulting models are evaluated using the chosen evaluation measures (AUC and MCC values). In addition, the prediction performances of the created model are compared to each other to determine the influence of data sampling techniques on the prediction models (NB and DT) and the effectiveness of the examined data sampling methods. Conclusively, the effectiveness of the created models is correlated with existing SDP models. The essence of the comparison is to validate the effectiveness of data sampling-based ensemble approaches in SDP procedures. The Python-Scikit ML library was utilized to develop the data sampling techniques, while the WEKA ML platform was used to construct the prediction models. These two ML resources are often employed in SDP and ML activities.

Fig. 1.
figure 1

Experimental framework

4 Results and Discussion

This section displays and analyses the results of evaluating the various constructed SDP models. It is crucial to show how the data sampling approach affects the effectiveness of SDP models. Furthermore, the performance of the studied data sampling-based ensemble models is one of the most important aspects of this study. As a result, the findings for investigated data sampling approaches, ensemble methods, and software defect datasets will be provided to reflect these impacts.

Table 4 and Table 5 display the AUC values of experimented prediction models using original and balanced NASA defect datasets.

Table 4. AUC values of NB and DT models on original and balanced datasets

From Table 4, NB and DT models developed using the balanced NASA datasets had better AUC values than when original NASA datasets are used. Models based on the studied datasets recorded significant increments in their respective AUC values except in the case of the RUS-balanced KC3 dataset. NB and DT models trained with SMOTE (NB: +9.52%; DT: +24.65%), ADASYN (NB: +8.16%; DT: +29.71%), ROS (NB: +4.23%; DT: +36.75%) and NearMiss (NB: +22.21%; DT: +39.36%)-balanced KC3 datasets had increments in AUC values when compared with the NB and DT models developed with the original KC3 dataset. On PC1 dataset, NB and DT models developed with SMOTE (NB: +5.19%;DT:  +53.68%), ADASYN (NB:  +4.94%; DT: + 55.35%), ROS (NB:  +3.54%; DT:  +61.20%), RUS (NB:  +4.56; DT:  + 9.20%) and NearMiss (NB:  +8.61%; DT:  +40.47%)-balanced PC1 datasets had enhanced AUC values when compared with the NB and DT models. Also, a similar occurrence was observed in the MWI dataset. NB and DT models trained with the balanced MW1 dataset had superior AUC values than when the original MW1 dataset is deployed in most cases. NB models developed with SMOTE, ADASYN, ROS, NM, and RUS-balanced MW1 datasets had more than a +100% increment in AUC values while DT models developed with SMOTE (+76.54%), ADASYN (+73.96%), ROS (+87.28%), NearMiss (+59.64%), and RUS (+16.89%) had a significant increment in AUC values. These results proved that the experimented data sampling methods can enhance the performance of NB and DT in the presence of class imbalance.

Based on this observation, the performance of ensembled NB and DT models trained with balanced and original NASA datasets are further analyzed. Table 6 presents the AUC values of Ensemble NB and DT models on original and balanced NASA datasets. Specifically, ensembled NB models developed with SMOTE (BaggedNB:  +5.93%; BoostedNB:  +27.08%), ADASYN (BaggedNB:  +7.42%; BoostedNB:  +23.38%), ROS (BaggedNB: +5.64%; BoostedNB:  +16.46%) and NearMiss (BaggedNB:  +24.04%; BoostedNB:  +16.92%)-balanced KC3 datasets had increments in AUC values when compared with the ensemble NB model developed with the original KC3 dataset. A similar occurrence was observed with ensemble DT models developed with SMOTE (BaggedDT:  +16.62%; BoostedDT:  +30.06%), ADASYN (BaggedDT:  +19.87%; BoostedDT:  +33.15%), ROS (BaggedDT:  +25.71%; BoostedDT:  +39.88%) and NearMiss (BaggedDT:  +14.68%; BoostedDT:  +22.19%)-balanced KC3 datasets and original KC3 dataset. In the case of the RUS-balanced KC3 dataset, ensembled NB and DT models had poor AUC values that are lower than other experimented models.

Table 5. AUC values of ensemble NB and DT models on original and balanced datasets

For the PC1 dataset, ensemble NB and DT models trained with the balanced PC1 dataset had superior AUC values than when the original PC1 dataset is utilized. Ensemble NB models developed with SMOTE (BaggedNB:  +5.22%; BoostedNB:  +4.28%), ADASYN (BaggedNB:  +5.61%; BoostedNB:  +0.49%), ROS (BaggedNB:  +4.08%; BoostedNB:  +8.20%), NM (BaggedNB:  +12.48%; BoostedNB:  +4.65%) and RUS (BaggedNB:  +3.44%; BoostedNB:  +7.96%)-balanced PC1 datasets had improved AUC values. Also, the ensemble DT model with a balanced PC1 dataset had better AUC values than when the original PC1 dataset is used.

In addition, on the MWI dataset, ensemble NB and DT models trained with the balanced-MW1 dataset had better AUC values than when the original MW1 dataset is deployed except in the case of the RUS-balanced MW1 dataset. Ensembled NB models developed with SMOTE (BaggedNB:  +4.27%; BoostedNB:  +11.49%), ADASYN (BaggedNB:  +5.31%; BoostedNB:  +15.89%), and NearMiss (BaggedNB: + 5.05%; BoostedNB:  +9.04%)-balanced MW1 datasets had increments in AUC values when compared with the ensemble NB model developed with the original MW1 dataset. A similar occurrence was observed with ensemble DT models developed with SMOTE (BaggedDT:  +29.63%; BoostedDT:  +33.56%), ADASYN (BaggedDT:  +29.64%; BoostedDT:  +37.06%), ROS (BaggedDT:  +33.11%; BoostedDT:  +39.72%) and NearMiss (BaggedDT:  +15.22%; BoostedDT:  +32.73%)-balanced MW1 datasets and original KC3 dataset.

Furthermore, as recommended by [1], the performance of the developed models was analyzed using the MCC value. Table 6 and Table 7 present the MCC values of the developed models using both balanced and original NASA datasets. According to Table 6, NB and DT models created utilizing balanced datasets showed higher MCC values than original NASA datasets. NB and DT models based on the balanced datasets showed a considerable increase in their respective MCC values. This observation further supports the notion that data sampling methods can improve the prediction performance of SDP models in the presence of class imbalance.

Table 6. MCC values of NB and DT models on original and balanced datasets
Table 7. MCC values of ensemble NB and DT models on original and balanced datasets

Table 7 presents the MCC values of ensemble NB and DT models on both original and balanced datasets. Except in the case of the RUS-balanced dataset, the MCC values of ensemble NB and DT models developed using balanced KC3 datasets showed more than a +40% increase in MCC values. A similar phenomenon was observed with the ensemble NB and DT models using balanced-PC1 and balanced-MW1 datasets. Specifically, ensemble DT models on balanced-PC1 and balanced-MW1 datasets had a +100% increment in their MCC values in most cases.

Consequently, these findings indicate that the deployment of balanced datasets further enhances the prediction performances of the experimented ensemble NB and DT models. Table 8 shows the performance comparison of some of the developed models (ROS-BoostedDT, ADASYN-BoostedDT and SMOTE-BoostedDT) with existing methods on PC1. Specifically, the experimental results from El-Shorbagy, El-Gammal and Abdelmoez [16], Li, Zhou, Zhang, Liu, Huang and Sun [45], and Alsaeedi and Khan [7] are compared with ROS-BoostedDT, ADASYN-BoostedDT and SMOTE-BoostedDT. The developed methods had superior AUC and MCC values to existing SDP models.

Table 8. Comparison of some developed models with existing SDP results

In summary, the analyses of the experimental results show that the investigated data sampling methods can ameliorate the class imbalance problem in SDP datasets while also improving the prediction performances of the SDP models. Furthermore, it was discovered that the analyzed data oversampling techniques (SMOTE, ADASYN, and ROS) outperformed the data undersampling approaches (NearMiss and RUS). There are considerable disparities in performance amongst the explored oversampling strategies since this changes throughout the explored datasets and chosen prediction models. However, it is worth mentioning that the RUS technique had the least influence on the prediction models and, in some instances, performed worse than when the original datasets were used. This finding may be ascribed to the random elimination of key data that might be critical for the SDP process. Although the NearMiss technique is likewise a data undersampling method, it eliminates instances based on their closest neighbour characteristics.

5 Conclusion and Future Works

Addressing SDP concepts and the class imbalance problem as described in research work is critical for successful SDP model development. Data sampling methods are utilized on software defect datasets in this study to alleviate the latent class imbalance problem by levelling the number of minority and majority class instances observed, resulting in new defect datasets with no class imbalance problem. Particularly, three data oversampling methods (SMOTE, ADASYN, and ROS) and two data undersampling methods (RUS and NM) are deployed on defect datasets from the NASA repository, while ensembled (Bagging and Boosting) NB and DT classifiers are employed on the original and newly developed software defect datasets.

Overall, the experimental findings showed that the data sampling methods investigated can address the class imbalance problem in SDP datasets. Furthermore, in most of the experimental scenarios, the studied data sampling method improved the prediction performances of the deployed ensembled NB and DT models. However, it should be noted that when combining ensemble models with data sampling methods, the choice of the data sampling method, as well as the base classifier, is critical if any significant result is to be achieved. In terms of the effectiveness of the data sampling methods, the oversampling approaches (ROS, SMOTE, ADASYN) had a greater (positive) impact on the prediction models than their undersampling counterparts (RUS and NearMiss).

As a result, it is recommended that data sampling operations, particularly oversampling approaches, be carried out during SDP activities. Implementing data sampling procedures may help to ease the underlying class imbalance issue and ensure the effectiveness of SDP models.

Following this, further study on the hybrid technique should be carried out utilizing other ensemble methods to investigate the data sampling method that best suits them as well as the classification algorithm that works well with those ensemble methods.