Introduction

Software systems must be fault free to prevent the interruption of the service and money losses. Therefore, proper software testing after development is necessary to develop quality software, but testing of the software consumes almost half of the allocated resources and money [1]. Testing effort and cost should be reduced to develop quality software with a limited budget. High-quality software must have less number of faults, which reduces the testing cost and effort for detecting and correcting faults because the fault proneness of the modules depends upon the code quality of the software [1]. To develop high-quality software, all the software modules under development must be tested properly. However, it is difficult to execute all test cases based on execution paths of complex software to ensure it is defects free. Fault prediction models based on code quality can identify poor-quality and fault-prone modules to support the testing and development team so that high-quality software can be developed with limited resources and budget. All the fault prediction models use code quality metrics to identify fault-prone modules of the software.

Researchers in the literature have proposed many code metrics-based defect prediction models. The Chidamber and Kemerer (CK) metrics suit is widely used for software defect prediction models in the object-oriented system [2,3,4,5]. These fault prediction models are either based on machine learning (ML) algorithms or on the code metrics' threshold values [6,7,8,9,10,11,12,13]. Code metrics’ threshold-based fault prediction is faster than machine learning-based algorithms because the tester can directly identify the fault-prone modules by checking the threshold values. Code metrics’ thresholds-based models predict software modules as defective when the value of a certain metric exceeds the threshold value. Many researchers proposed different defect prediction models based on threshold values of code metrics. Henderson-Sellers proposed a model to categorize the classes into safe, flag, and alarm (defective) based on different threshold values of code metrics [14]. He also suggested that thresholds of metrics are relative to the complexity of the software. Daly et al. [15] studied the average time needed for the maintenance change in a software module with a different number of inheritance levels. He noted that up to three levels of inheritance reduce the time of maintenance chance in software, and more than three levels of inheritance increase the time of maintenance change. Cartwright et al. [16] replicated the study done by Daly with different datasets, and the results differed from Daly’s study. El Emam et al. [17, 18] proposed a new method to find the threshold values of code metrics which uses logistic regression to draw a threshold line. Erni and Lewerentz [19] proposed a method to find the thresholds based on the mean and standard deviation of code metrics' values. Shatnawi et al. [3] proposed a new method based on ROC curves to find the thresholds of different code metrics. Many other methods exist to find code metrics thresholds, like Alves ranking and VARL [20, 21]. All these methods calculate the threshold value for each code metric independently, but the software module cannot be categorized as fault based on only one metric’s threshold value, and there is a need for relative threshold values of all metrics in the dataset.

This research article proposes a new method based on metaheuristic algorithms to find the thresholds of different code metrics in an object-oriented and procedural system. A new objective function is generated based on specificity (also called true negative rate) and sensitivity (also called true positive rate), and metaheuristic algorithms such as PSO and GA are used to optimize the objective function. The class with at least one metric value beyond the threshold value is considered faulty while calculating the TPR and TNR. We use the genetic algorithm and particle swarm algorithm to find the thresholds of code metrics and compare the performance of metaheuristic-based models with the ROC curves-based model, as Boucher et al. [1] proved that ROC curves-based models give the best performance as compared to other models such as Alves ranking and VARL.

The research article is organized into related works, methodology, results, and conclusion. Related works give a summary of past work done in this field. "Methodology" presents our proposed metaheuristic-based and ROC-based threshold calculation techniques. It also explains the feature selection procedure to select the optimal set of features to generate a threshold-based fault prediction model. The result section compares the performance of the metaheuristic-based model with the ROC curves-based model, followed by a conclusion.

Related Works

Many studies have been done in the recent past to develop threshold-based software fault prediction models because there is a direct relationship between code quality and metrics values. This research article section presents past studies to develop threshold-based software fault prediction models.

McCabe [22] proposed a threshold value for McCabe’s cyclomatic complexity metric based on some experience. Catel et al. [11] used PREDICTIVE to calculate the thresholds of code quality metrics. The PREDICTIVE tool is no longer available. Rosenberg et al. [23] proposed a statistical analysis method to calculate the thresholds. They calculated the thresholds of code quality metrics based on experience and statistical methods for CK metrics suit.

Shatnawi et al. [3] proposed a new method that uses the ROC curve to find the thresholds of different code quality metrics. They plot a curve between sensitivity and specificity for different thresholds for software code metrics, and the threshold with the maximum area under the curve is chosen for software fault prediction. The experiment is performed on three versions of Eclipse projects for binary and multi-class classification. They proved that the proposed method is good for multi-class classification. However, for binary class classification, there is a need for more accurate and sophisticated methods.

Bender [20] proposed a novel method called VARL to calculate the thresholds of code metrics. This method was originally not proposed for software code metrics. However, it is used by many researchers in their studies to find the optimal threshold values [3, 6, 24]. This method uses univariate logistic regression to calculate the correct threshold of the metric. Some studies proved it useful for threshold calculation [24]; on the other hand, some researchers mentioned that no valid threshold values were found with VARL [3, 6].

Alves et al. [21] proposed a sophisticated method called Alves ranking to calculate the correct threshold of different code metrics. They combine the data of several projects to find the single threshold value per code metrics which is universal for all projects. They calculated the threshold value to find the quality of the software module. However, they have not used it to identify the defective modules of the software.

In their study, Benlarbi et al. [25] used logistic regression to find the threshold of software code metrics. They proved that logistic regression is not able to calculate the optimal thresholds and hence does not improve the model’s prediction performance.

In their other study, Catel et al. [26] used the ROC curve method, originally proposed by Shatnawi et al. [3], to find the thresholds for software fault prediction. They modified this method a little bit. Instead of maximizing the value of sensitivity (TPR) and specificity (TNR), they plotted a curve between sensitivity and 1-specificity, and the AUC value among three points (0, 0), (sensitivity, 1—specificity), and (1, 1) is maximized.

After studying different techniques to find the threshold values in literature, we propose a metaheuristic-based method to calculate the relative thresholds of different code metrics in this research article. Table 1 presents a comparative study of the proposed work with existing threshold-based fault prediction methods.

Table 1 A comparative study of proposed model with existing threshold based fault prediction models

Based on the literature, we identified that threshold-based software fault prediction models are faster than machine learning-based models [1]. Once the universal threshold value of the code metric is decided, then it can be applied to any software to identify the faulty modules, which saves the cost and effort of the testing team because the tester can identify faulty modules immediately based on the code metric threshold values and no need of prediction model. Many techniques exist to calculate the thresholds of the code metrics, such as ROC, VARL, and Alves ranking. All these techniques calculate threshold values of the code metrics independently. However, there is a need to find the relative threshold values of code metrics because independent thresholds of the code metrics cannot identify the faulty classes.

Methodology

This section of the research article shows the detailed steps of our proposed technique to decide the threshold values of different code metrics for software fault identification. Figure 1 shows the overall structure of the proposed model.

Fig. 1
figure 1

Proposed fault prediction model based on metrics’ thresholds

The first step of Fig. 1 normalize the original dataset using the min–max normalization technique. Normalized dataset are divided into training and testing data. In the training data, ranking is assigned to all the features based on the ROC value of the features. Feature with high ROC value has a higher rank. Relevant features are selected based on the iterative feature elimination method from the ranked features. In the iterative feature elimination process feature with the lowest rank is eliminated in each step. In the next step, a dataset with selected features is provided as input to train the model based on metaheuristic algorithms such as PSO and GA; after training the model based on metaheuristic algorithms, optimal threshold values of the selected features are extracted and used for software fault prediction on testing data. Finally, the model's performance is evaluated based on accuracy, sensitivity (true positive rate), and specificity (true negative rate).

ROC Curve Based Thresholds

The ROC curve-based threshold detection method plots ROC curves for all code metrics. Optimal threshold values are selected where the sum of the sensitivity and specificity is maximum. Plotting the ROC curve needs one continuous variable and one categorical variable. Here continuous variable is the code metric, and the categorical variable is the class labels. The ROC curve is drawn based on a range of threshold values from the minimum possible value to the maximum possible value of the code metric. After drawing the ROC curve threshold value is chosen where the sum of sensitivity and specificity is maximum. These threshold values are used for software fault prediction after finding the threshold values of all code metrics. Class is considered faulty if the threshold value of at least one code metric is violated.

Metaheuristic Algorithms Based Thresholds

Steps to apply metaheuristic algorithms to find the optimal threshold values of software code metrics are shown in algorithm 1.

figure a

Normalization

Simple min–max normalization is used to standardize the range of all selected code metrics to apply metaheuristic algorithms. Values of code metrics are normalized between 0 and 1.

Some code metrics, such as WMC, CBO, and RFC, are directly proportional to quantifying code quality. On the other hand, some code metrics, such as TCC, LCC, and SCOM, are inversely proportional to quantifying the code quality. It is required to reverse the values of inversely proportional code metrics by subtracting each value from 1.0 after normalization.

Feature Selection

In the first step, we select relevant code metrics out of available 20 code metrics in promise repository datasets. The selection of relevant code metrics is very important to build a software fault prediction model based on metrics threshold values. Otherwise, irrelevant features can drastically reduce the software fault prediction performance. Features can be selected based on experimentation or feature selection techniques.

In this work, we select the best features based on the following steps:

  1. (1)

    The objective value (sum of specificity and sensitivity) of each performance metric on each selected dataset is calculated based on the ROC curve method.

  2. (2)

    The average object value of each performance metric on all selected datasets is taken to find the overall performance of each performance metric on all selected datasets.

  3. (3)

    After finding the average of objective values of all performance metrics on selected datasets, metrics are arranged in decreasing order based on average objective values because our objective function is a maximization function.

  4. (4)

    After sorting features in decreasing order, features are eliminated from lower to higher ranks based on their performance. In each iteration, the lowest-ranked feature (with the lowest objective value) is eliminated, and after eliminating the lowest-ranked feature, the combined objective value of the remaining features is calculated and compared with the performance of the previous objective value. If performance is degraded, then we stop further feature elimination.

Figure 2 shows the forth stem of the above procedure for the selection of features. In Fig. 2, \(\text{obj}\) represents the objective value of the feature set in the current iteration and \({\text{obj}}_{\text{prev}}\) represents the objective value of the feature set in the previous iteration. Initially, \({\text{obj}}_{\text{prev}}\) is set to 0 and feature set contains all 20 features arranged in descending order of ROC value. Ɛ is a small constant value that is set to 0.01 in our case.

Fig. 2
figure 2

Feature set selection based on iterative feature elimination

Objective Function

Our objective is to maximize the sum of sensitivity and specificity. Sensitivity also called true positive rate (TPR), is used to check the number of correctly classified instances as positive out of total positive instances. Specificity also called true negative rate (TNR), is used to check the number of correctly classified instances as negative out of total negative instances. Sensitivity and specificity can be calculated based on Eqs. (1) and (2) [28].

$$\text{Sensitivity }\,(\text{TPR})=\frac{\text{TP}}{\text{TP}+\text{FN}}$$
(1)
$$\text{Specificity }\,(\text{TNR})=\frac{\text{TN}}{\text{TN}+\text{FP}}$$
(2)

Threshold values are used to differentiate the positive and negative classes, so we need to maximize the sum of sensitivity and specificity, as shown in Eq. (3).

$$\text{Objective function}=\left(\frac{\text{TP}}{\text{TP}+\text{FN}}+ \frac{\text{TN}}{\text{TN}+\text{FP}}\right)$$
(3)

Class with metric value greater than threshold for at least one metric is considered as faulty otherwise class is considered as healthy. Value of objective function can vary between 0 and 2.

Metaheuristic Algorithms

Any metaheuristic algorithm can optimize the objective function in Eq. (3). This research article uses a genetic algorithm (GA) and PSO (particle swarm optimizer) to optimize this objective function.

In the genetic algorithm, chromosome size is equals to the number of code metrics to detect threshold values. Each chromosome element represents the threshold value of the corresponding code metric, and each element must contain a value between 0 and 1 because each code metric is normalized between 0 and 1. A population of 20 chromosomes is generated in the first step of the algorithm. After generating the population, in the second step, the value of the objective function is calculated based on Eq. (3). To calculate the objective value, first, we need to calculate the value of specificity and sensitivity. Class with at least one metric value greater than the threshold value is considered faulty in calculating the sensitivity and specificity. After finding each chromosome's objective value, the chromosomes are sorted based on the high to low objective value. The best two chromosomes are selected, and a binary crossover is applied to generate the offspring. The best chromosome and its objective value are stored as the global optimum value. The mutation is performed to add randomization with a mutation probability of 0.1. After a hundred iterations of the algorithm, the best chromosome is returned, representing the threshold values of code metrics that can be used for software fault prediction. Algorithm 2 represents the overall steps of the genetic algorithm.

figure b

Twenty random particle positions are initialized between 0 and 1 in the particle swarm optimizer. Each particle's size equals the number of code metrics in the feature set. All particles’ velocities are initialized between – 0.5 and 0.5 for each particle. Global best objective value and local best objective values are initialized to 0. Global best particle position and local best particle position are empty initially.

After initialization, in the second step objective value of each particle is calculated based on Eq. (3) in a similar way as that calculated in the genetic algorithm. Local best value and global best value of fitness function are stored. The best particle with the best global value is also saved. In the next step, the velocity and position of each particle are updated based on Eqs. (4) and (5) [29].

$${v}^{i}\left(t+1\right)= w\times {v}^{i}\left(t\right)+ {c}_{1}{r}_{1}\left({\text{best}}_{\text{local}}^{i}- {X}^{i}\left(t\right)\right)+ {c}_{2}{r}_{2}\left({\text{best}}_{\text{global}}-{X}^{i}\left(t\right)\right)$$
(4)
$${X}^{i}\left(t+1\right)= {X}^{i}\left(t\right)+ {v}^{i}\left(t+1\right)$$
(5)

While calculating the new position of the particle, if the value of any element of the particle goes beyond the range of the code metrics, which is between 0 and 1 in our case, then reinitialize that element of the particle randomly between 0 and 1.

Global best objective value and particle are returned after a hundred iterations. The global best particle represents the optimal threshold values of the features used for software fault prediction. Algorithm 3 shows the overall steps of the PSO algorithm applied to select optimal threshold values.

figure c

Results and Discussion

This section of the research article shows the results of the experiments performed on selected datasets. Experimentation is done on an HP Pavilion laptop with 8 GB RAM and a corei5 processor.

Datasets

Table 2 shows the selected datasets for our experimentation. Datasets are downloaded from the PROMISE repository [27] and selected based on recent studies [30,31,32,33,34]. Datasets are divided into two categories in Table 2. The first ten datasets are created based on object-oriented code metrics, and the last four datasets are created based on procedural code metrics because the last four software are developed in a procedural programming language. Datasets of software developed in object-oriented programming language contain 20 features each, which are WMC, CBO, RFC, LOC, LCOM, LCOM3, DIT, NOC, CA, CE, MAX_CC, AVG_CC, NPM, MOA, MFA, DAM, CAM, IC, and AMC. Datasets of software developed in procedural programming language contains 28 code metrics which are total_loc, blank_loc, comment_loc, code_and_comment_loc, executable_loc, unique_operands, unique_operators, total_operands, total_operators, halstead_vocabulary, halstead_length, halstead_volume, halstead_level, halstead_difficulty, halstead_effort, halstead_error, halstead_time, branch_count, decision_count, call_pairs, condition_count, multiple_condition_count, cyclomatic_complexity, cyclomatic_density, decision_density, design_complexity, design_density, normalized_cyclomatic_complexity, formal_parameters.

Table 2 Datasets used to perform experiments

This research article uses only six features for datasets of software developed in object-oriented programming languages: WMC, LOC, CBO, RFC, DIT, and LCOM because these six features provide the best results based on threshold values compared to datasets with the existing 20 features. For datasets of software developed in the procedural programming language, selected code metrics are halstead_volume, cyclomatic_complexity, cyclomatic_density, decision_density, design_density, and normalized_cyclomatic_complexity. Each dataset is divided into a 67–33% ratio. 67% data is used to find the threshold values, and the remaining 33% is used to test the performance. Each experiment is performed ten times on each dataset, and averages and standard deviations are collected to compare the techniques.

Performance Metrics

Accuracy, sensitivity, and specificity performance metrics are selected to compare the performance. Accuracy is used to show the overall performance of the model. Specificity and sensitivity check how well the proposed approach differentiates the classes in binary classification problems. Accuracy, sensitivity, and specificity can be calculated based on Eqs. (6), (7), and (8) [28].

$$\text{Accuracy}=\frac{\text{TP}+\text{TN}}{\text{TP}+\text{TN}+\text{FP}+\text{FN}}$$
(6)
$$\text{Sensitivity}=\frac{\text{TP}}{\text{TP}+\text{FN}}$$
(7)
$$\text{Specificity}=\frac{\text{TN}}{\text{TN}+\text{FP}}$$
(8)

Parameter Configuration

  1. (1)

    ROC: step size = 0.0001

  2. (2)

    GA: iterations = 100, population size = 20, crossover = binary, mutation rate = 0.1

  3. (3)

    PSO: iterations = 100, \({c}_{1}={c}_{2}=1\), number of particles = 20

Performance Comparison

This section of the research article compares the proposed threshold based software fault prediction technique with an existing technique based on Accuracy, specificity, and sensitivity performance metric.

Selection of Features

Selection of good features is important while predicting software faults based on metric threshold values; otherwise, prediction accuracy drops drastically.

Table 3 shows the accuracy comparison of different threshold selection techniques with all features and selected features. Selected features are WMC, CBO, RFC, LCOM, LOC, and DIT for object-oriented software and halstead_volume, cyclomatic_complexity, cyclomatic_density, decision_density, design_density, and normalized_cyclomatic_complexity for procedural software. It is not necessary that for all datasets, the selected feature set will be the same. It may differ for different datasets depending upon the initial feature set and values of the features. But, we can select the optimal set of features based on the algorithm presented in "Normalization".

Table 3 Accuracy comparison of selected features with all features

Prediction accuracy with all features is very low, as shown in Table 3. So, choosing good features to achieve acceptable results is super important. The best values are shown in boldface letters in Table 3.

In the case of ROC-based software fault prediction, selected features give approximately 14% improvement; in the case of GA-based software fault prediction, selected features give 24% improvement; and in the case of PSO-based software fault prediction, selected features give a 21% improvement in accuracy. However, prediction results based on the ROC curve and selected features are poor and below the acceptable range of 0.6 [1]. In the case of GA and PSO, achieved performance is above the acceptable range, and GA gives 3% better results than PSO.

Based on Table 3, it can be concluded that selected features give a significant improvement in the performance of the threshold-based software fault prediction. Further, metaheuristic algorithms provide better relative threshold values of selected features than ROC, and in the case of metaheuristic, GA provides better results than PSO.

Accuracy Comparison

Table 4 presents the accuracy comparison of metaheuristic algorithms-based threshold detection technique with ROC-based threshold detection technique. The average accuracy and standard deviation with the ‘\(\pm\)’ sign of 10 runs are shown in Table 4. The best results are shown in boldface letters. GA gives the best results for 12/14 datasets. Out of the remaining two datasets, for Lucene-2.2, ROC gives the best results, and for Jedit-4.0, PSO gives the best results.

Table 4 Accuracy comparison of metaheuristic algorithms with ROC based threshold

On average, ROC achieves 43% accuracy based on selected features that are poor to apply for real-life projects. On the other hand, metaheuristic-based selected threshold values achieve accuracy greater than 60%, which is acceptable. PSO-based thresholds achieve 66% average accuracy, which is not bad, and GA-based threshold values achieve 69% average accuracy, considered good and better 2% than PSO.

Figure 3 compares the accuracy of different threshold-based software fault prediction techniques. The red line shows the results of ROC-based threshold values, the Blue line shows the accuracy of PSO-based threshold values, and the green line shows the accuracy achieved by GA-based threshold values.

Fig. 3
figure 3

Accuracy comparison of ROC, GA, and PSO based threshold detection techniques

The green line and blue line are almost overlapping each other. However, there is a slight difference between the green and blue lines, and we can say GA performs better than PSO because the green line is slightly above the blue line. Both GA and PSO outperform the ROC method, as shown in Fig. 3.

So, it is clear from Fig. 3 that the results of GA and PSO are comparable. However, GA gives slightly better results than PSO. The results of PSO and GA are far better than ROC based threshold detection technique.

Other Metrics Comparison

Table 5 shows the sensitivity comparison of ROC, GA, and PSO-based threshold detection techniques over selected datasets. It is clear from Table 5 that ROC based threshold detection technique provides better sensitivity than the metaheuristic algorithms-based technique. Sensitivity, also called true positive rate, is better for ROC in the case of all datasets. But based on only a true positive rate, we can’t say ROC-based threshold values develop a better model. We need to consider the true negative rate also.

Table 5 Sensitivity comparison

Table 6 presents the comparison of the specificity of different threshold detection techniques. In the case of 9/14 datasets, GA gives better results, and PSO gives better results for the remaining five datasets. The metaheuristic algorithms-based technique for all ten datasets gives better results than the ROC-based threshold detection technique. However, based on only the true negative rate, we cannot conclude that the metaheuristic-based threshold provides a better software fault prediction model. We need to consider both the true positive and negative rates.

Table 6 Specificity comparison

Figure 4 compares the specificity and sensitivity trade-offs of different threshold detection techniques. In the case of sensitivity, ROC gives better results, and in the case of specificity, the metaheuristic algorithms-based technique gives better results. However, metaheuristic algorithms give better results than ROC if we compare the sensitivity and specificity trade-offs. The results of different metaheuristic algorithms are comparable.

Fig. 4
figure 4

Trade-off between sensitivity and specificity

In Fig. 4, the red line shows the performance of ROC-based threshold values, the green line shows the performance of GA-based threshold values, and the blue line shows the performance of PSO-based threshold values. The blue and green lines are above the red lines in the graph. So, based on Fig. 4, we can conclude that metaheuristic-based threshold values are better than ROC for developing a good software fault prediction model.

Threat to Validity

This study compares the metaheuristic-based threshold values with ROC-based threshold values for software fault prediction. We used fourteen open-source software, ten developed in object-oriented programming, and four developed in procedural programming. However, threshold-based fault prediction models' performance depends on features extracted from the code. We compare results based on only 20 object-oriented and 28 procedural code metrics, but many code metrics exist in the literature, and the model's performance can vary with changes in input features.

Second, we cannot say that the threshold values decided in this research article are universal because the experiment is done on a small set of datasets.

Third, thresholds decided by metaheuristic algorithms are not perfect. Metaheuristics always gives approximate solutions. However, performance is better than the ROC-based technique, and properly tuned algorithms can achieve good results.

Conclusion

In this research article, we propose a new method based on metaheuristic algorithms to find the threshold values of software code metrics for software fault prediction. GA and PSO are used to find the threshold values, and performance is compared with ROC-based threshold values. In the first experiment, useful code metrics are selected based on experimentation out of available 20 object-oriented code metrics and 28 procedural code metrics because bad code metrics can drastically reduce the prediction performance of threshold-based software fault prediction models. In the second experiment, threshold values of selected code metrics are detected and used for software fault prediction. Accuracy, sensitivity, and specificity performance metric results are compared with ROC curve-based threshold detection techniques. Results show that metaheuristic algorithms give better threshold values for software fault prediction.

In the future, we will compare the performance of more metaheuristic algorithms for better and fast convergence and use threshold values for cross-project software fault prediction.