1 Introduction

Software modules have become a crucial component of every business model since the software industry has overgrown, regardless of the industry core product manufacturing, banking, healthcare, aviation, medicine, e-commerce, social networking, education, or any other. Delivering software components with desirable quality is a challenging task in the development process. Software Quality Assurance (SQA) aims to ensure the desired software quality at a reasonable price by monitoring and managing the development process. SQA comprises inspection, walk-through, code-review, peer-review, software fault prediction, and testing. Jones and Bonsignour (2012) reported that finding and correcting defects is one of the most expensive software development activities. The complexity of the software increases with the progress of the software development stages, making the defects to hide deeper and more challenging. Unidentified defects manifest themselves and cause the application to malfunction. Figure 1 Shows the generic Software defect prediction framework architecture. So, early prediction of defects has been a recent interest among researchers, which not only help to identify flaws in the early stages but also reduce the cost and effort. By using software metrics (Schneider et al. 1992; Kamei, et al. 2012; Hall et al. 2011; Pandey et al. 2021), software defect prediction (Weiss and Provost 2001; Yoon and Kwek 2007) aims to narrow down the software project’s most likely fault-prone modules. Early prediction assists the Software Quality assurance team in allocating finite resources effectively to build quality software.

Fig. 1
figure 1

Software defect prediction framework

Regression, SVM, KNN, Decision Tree, and other techniques have been used for years to build SDP models, but these models depend on a good amount of training data. Unfortunately,the Defect data set suffers due to the class imbalance problem. The amount of data in one class outnumbers the number of instances in another class, leading to biased learning (Provost 2000; Diez-Pastor et al. 2015; Thabtah et al. 2020). Researchers have proposed a wide range of approaches to confronting the class imbalance, including class rebalancing through sampling, algorithmic, and ensemble-based approaches. Sampling techniques either keep control of the number of majority instances, i.e., under-sampling, or increases the minority instances by introducing synthetic samples (Gao et al. 2014).

Algorithmic-based approaches involve adaptive weight learning, cost-sensitive, and threshold adjustment methods (Guo and Viktor 2004; Sri Kavya 2020; Ozturk 2017; Zhang et al. 2015). Finally, ensemble-based strategies, including bagging, boosting, and voting, are applied to improve learners’ performance (Song et al. 2019; Freund and Schapire 1997; Jo and Japkowicz 2004, 2004). However, the minority class’s small disjunct (Rathore and Kumar 2019; Li and Henry 1993) is also equally responsible for the poor performance of SDP models. Minority class samples are widely dispersed in the distribution space. When modifying sampling methods, researchers consider the single cluster approach, where all minority samples are a part of one cluster, which cause overgeneralization due to class overlapping.

Initial research has been done to comprehend and assess how well different metrics measure fault proneness (Briand et al. 2001; Ohlsson et al. 1998; Menzies et al. Jan. 2007; Gray et al. 2010). Li and Henry (1993) conducted a study to analyse the relationship between the Object-oriented metrics proposed CK using multiple linear regression and state that there exist and strong relation between the proposed metrics and effort towards maintenance activity. Briand (2001) validated Coupling and Cohesion’s usefulness in fault prediction and proved they are strong candidates for fault predictions. As the new metrics are introduced, the researchers focus on their ability towards fault proneness. Static code metrics have been found to be highly correlated with the defect which can be further investigated by various researchers (Menzies et al. Jan. 2007; Gray et al. 2010).

Japkowicz and Stephen (2002) studied the class imbalance problem and discussed two major sampling techniques to counter the imbalance issues: oversampling and under-sampling. Oversampling approach focuses on increasing the number of samples in the minority class by random duplication until the desired level of balance is attained, which will result in overfitting due to the repetition of samples. To compensate for the randomness, an information-based approach is proposed where the samples’ closeness is considered for duplication. Second technique discussed was random under-sampling, which will eliminate the samples from majority class until the desired level of balance is attained, resulting in poor performance due to loss of information. Information-based under-sampling techniques focus on the samples far away from boundaries and are considered for elimination.

Chawla et al. (2002) proposed a novel oversampling technique called SMOTE which operates on the feature space to generate synthetic samples by choosing each minority sample and introducing synthetic samples along the line, joining any/all minority samples. Depending on the level of balance required the synthetic samples are chosen. Since, samples along the line are chosen will results in overlapping of decision boundary and results in overgeneralization. Different variation of SMOTE such as BSMOTE (Han et al. 2005) which focus on the samples lies on the decision boundary, resulting in more dense samples near the decision boundary. Safe-Level SMOTE (Bunkhumpornpat et al. 2009) is quite similar to SMOTE, but all synthetic samples are introduced along the safe line.

MW-SMOTE (Barua et al. 2014) introduces weight for the instances which are hard to classify and focuses on more challenging samples alone. ADASYN (He et al. 2008) is similar to MW-SMOTE, which assigns weight to instances based on their complexity to classify. Kamthorn Puntumapon et al. (2016) proposed cluster-based minority over-sampling, which uses TRIM criteria to identify all minority class regions to form clusters and combine multiple small clusters to form larger clusters. Ebo Bennin et al. (2017) proposed a diversity-based oversampling technique that introduces synthetic samples using diversity measure, i.e., Mahalanobis distance, by averaging two minority samples.

2 Method

2.1 Overview

The majority of oversampling approach doesn’t consider the distribution of minority samples and consider all the instance of the minority class of a single cluster. Since the minority samples are more dispersed in reality, generated synthetic samples may overlap with representatives from the majority or may manage to evade the minority decision boundary, which leads to overgeneralization. The poor performance of the SDP model is equally attributable to smaller disjuncts of minority instances as it is to the class imbalance. By considering the distribution of minority data samples and partitioning them into multiple clusters, we can eliminate the possibility of overlapping.

2.2 Diversity measurement

Researchers have often opted for distance metrics such as Hamming, Manhattan, Minkowski, Euclidean, and Cosine distance is a mere statistical distance which computes the difference between the two points and introduces synthetic samples between them. Most statistical measures don’t consider the sample distribution and suffer when applied to high-dimensional data. To alleviate the issue, we opted for Mahalanobis distance (D). This diversity-based measure computes how far the individual samples is diverse from the population already experimented with outlier detection. By considering × 1 and × 2 as the two instances of the minority class the Mahalanobis distance is defined by 1.

$${D}^{2}={({x}_{1}-{y}_{1})}^{t }*{C}^{-1}* ({x}_{1}-{y}_{1})$$
(1)

where D2 is the square of the Mahalanobis distance, × 1 is the observation vector, i.e., the data set row, and C−1 is the inverse of the covariance matrix of independent variables.

The covariance matrix of the vector with n dimension is computed using

$$C = \begin{array}{*{20}c} {var\left( {n1} \right)} & {cov\left( {n1,n2} \right) \ldots } & {cov\left( {n1,nn} \right)} \\ {cov\left( {n1,n2} \right)} & {var\left( {n2} \right) \ldots } & {cov\left( {n2.nn} \right)} \\ {cov\left( {n1,nn} \right)} & {cov\left( {n2,nn} \right) \ldots } & {var\left( {nn} \right)} \\ \end{array}$$

Variance (v) computes the diversity of the data samples over its mean, and covariance (c) measures the relationship between the variables, which aids in measuring the diversity of the samples from its population given by 2.

$${\text{V }} = \frac{{\mathop \sum \nolimits_{1}^{n} \left( {xi - \mu } \right)^{2} }}{n}$$
(2)

where xi refers to the column vector, µ is the mean of the column, and n is the number of rows. Similarly, covariance is computed using 3

$$c\left(x,y\right)=\frac{{\sum }_{1}^{n}(xi-\mu x)(yi-\mu y)}{n}$$
(3)

2.3 Clustering of minority samples

Based on the computed Mahalanobis distance matrix, we grouped the minority cluster into multiple clusters based on Calinski-Harabasz (CH) index and Davies-Bouldin (DB) index.Both these metrics helps to identify the correct number of clusters based on the compactness and cluster separation factor from 4 and 5.

$${\text{CH}}_{{\text{k}}} = \frac{BCSM}{{k - 1}}* \frac{n - k}{{WCSM}}$$
(4)

where k is the number of clusters, n is the number of minority samples, BCSM measures of separation between cluster and WCSM is the measure of cluster compactness.

$${\text{DB}} = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} max_{{{\text{j}} \ne {\text{i}}}} \left( {\frac{\sigma i + \sigma j}{{d\left( {ci,cj} \right)}}} \right)$$
(5)

where n is the number of cluster and σi is the average distance of all vectors in cluster i from its cluster center ci. We compute the number of clusters (nc) by finding the minimum of both the index CH and CB using 6

$${\text{nc}} = {\text{ min}}\left( {{\text{CH}},{\text{DB}}} \right)$$
(6)

2.4 Synthetic sample generation

Based on the computed diversity measure (MD) the instances are grouped into multiple clusters. The feature vector, made up of metrics components extracted from software artifacts, is thought of as chromosomes per Walter’s theory of inheritance.

figure a

By computing the midpoint of cluster, the data’s partitioned into two subgroups where g1 represents the instance above the center and g2 represents the instance below the mean. The parents are indexed from 1 to n in groups g1 and g2. Considering p1 as the parent1 from g1 with index1 and p2 as the parent2 from g2 induced into sampling process which yields new offspring. Considering that the samples are far apart from each other eliminates the possibility of duplication and alleviates the over-generalization by introducing a more diverse sample. The process of synthetic generation is illustrated in Fig. 2 which is repeated until the desired level balance is attained.

Fig. 2
figure 2

Illustration of synthetic sample generation

The objective of the proposed methodology is to improve the performance of SDP model by reducing the false alarm rate and reduce overgeneralization. To justify the performance a comparative experiment is conducted with five other prominent oversampling approaches such as SMOTE, B-SMOTE, ADASYN, S and MAHAKIL using five different classification models such as Naive Bayesian, Decision Tree, KNN, SVM and Random Forest on 20 different imbalanced datasets obtained from promise repository with varied imbalance ration from 2 to 14%. The datasets chosen to conduct a comprehensive investigation vary in terms of the imbalance ratio and sample size. Figure 3 shows the experimental setup to validate the performance of the proposed over-sampling approach.

Fig. 3
figure 3

Framework of proposed model

The dataset comprises of 20 features which is indicated the various measure of the software component measured from modules or classes of those projects. The class labels are converted to binary values of 0 and 1 using label encoding, where 0 represent the negative instance and 1 represents the positive instance. Encoded dataset is separated into two bins where bin1 represent the positive samples and bin2 represent the negative samples. The positive samples are provided to the oversampling module to add synthetics samples using the proposed approach. The oversampled dataset is divided into ten folds, with the first eight folds being used for training, the ninth for model validation, and the tenth for model testing. The division is accomplished through stratification, so that the proportion of minority class distribution, or pfp, remains constant across both the training and testing datasets. The sampled dataset is trained on five different machine learning models such as Naive Bayesian, Decision Tree, KNN, SVM and Random Forest and the performance of the models are evaluated using different measures.

3 Results and discussion

The SDP is a two-class problem, where the samples of interest are in the minority class or defective and are regarded as positive. In contrast, the majority class is viewed as unfavorable. According to the confusion matrix, the terms True Positive (TP) and False Positive (FP) refer to the number of defective samples that are classified as defective and non-defective, respectively similarly, False Negative (FN) and True Negative (TN), which refers to the number of non-defective samples that are classified as defective and non-defective respectively. The Performance of the machine learning models is evaluated using measures such as accuracy, precision, recall, F1-Score, AUC, falsealarm and so on. Since the model is trained on imbalance data, accuracy has proved to be biased.

Similarly, precision is also unstable for imbalanced data (Shihab 2012). So, we opted for recall (pd) and false-alarm (pf) rate to evaluate the performance. Recall estimates the number of false negatives and false-alarm is closely correlated with defective instance identification. We consider those two measures to assess the performance (Table 1).

$${\text{Recall}}\;\left( {{\text{pd}}} \right) \, = {\text{ TP}}/\left( {{\text{TP }} + {\text{ FN}}} \right)$$
(7)
$${\text{False Alarm }}\left( {{\text{pf}}} \right) \, = {\text{ FP }}/ \, \left( {{\text{TN }} + {\text{ FP}}} \right)$$
(8)
Table 1 Performance comparison proposed model

3.1 Comparison with other oversampling techniques

To validate the performance of the proposed oversampling, we conducted the experiment using five different oversampling approaches SMOTE, BSMOTE, ADASYN, MAHAKIL, and ROS using the sample five machine learning models and the across all the 20 different datasets in Table 2. The Proposed sampling approach outperforms other oversampling approaches in terms of reduced false alarm rate in all three-balance ratios. The proposed method when tested with RF provides better performance in terms of recall and false alarm rate.

Table 2 Performance comparison of different oversampling techniques

The experiment was conducted using five different oversampling approaches on three different balance ratios across 20 different datasets and evaluated using five machine learning models. From the above figure KNN and RF consistently outperform all other models concerning recall and false alarm rate. Also, it’s evident from the Fig. 4 that NB and SVM has no effect on balancing i.e., even after balancing the performance of the model doesn’t have any effect. They are increasing Pfp to 50% for all sampling methods resulting in the best pd performance across all prediction models and datasets.

Fig. 4
figure 4

a Recall measure comparison proposed model using different machine learning algorithms. b False alarm rate measure comparison proposed model using different machine learning algorithms. c False alarm measure comparison different over sampling approaches using Random Forest algorithm

Across all models, proposed approach, MAHAKIL and SMOTE has consistently performs better recall (pd) and false-alarm rate (pf). Proposed approach performed considerably better with high pd and low pf values across all models and datasets. This is because the more diverse nature of synthetic samples also ensures the samples reside intact within the decision boundary, eliminating the possibility of misclassification.

The outcomes demonstrate that the proposed model outperforms all five sampling strategies. According to Lessman et al. research’s (2008) RF significantly outperformed the other 21 prediction models, which is evident from these results. RF is more resilient to overfitting and parameterization as an ensemble method. Our proposed approach effectively splits the minority samples into multiple clusters based on the computed diversity measure which also considers the samples original distribution. Also, the diversity-based measure ensures the synthetic samples are more diverse, eliminating duplication of samples.

Oversampling, which entails producing additional synthetic instances of the minority class to balance it with the majority class, is a popular strategy for dealing with unbalanced datasets. Several algorithms have been developed to perform oversampling effectively, and in this content, we will compare the distribution of data samples after applying two cluster based oversampling algorithms: KMeansSMOTE and MBOA algorithm along with Initial distribution. Figure 5 depicts the distribution of data samples of an imbalanced dataset before and after sampling.

Fig. 5
figure 5

a Distribution of data sample without sampling. b Distribution of data samples after sampling using KMeansSMOTE. c Distribution of data samples after sampling using MBOA

In comparing the distribution of data sample after oversampling using the two cluster-based algorithms with original distribution, we observe the following:

KMeansSMOTE tends to create synthetic samples that are distributed around the clusters identified by the K-means algorithm. It creates synthetic samples around the borderline of cluster so the data samples are more clustered around the boundary region of the cluster. Hence, the generated synthetic samples are highly deviated from the original distribution which leads to poor performance.

The proposed multi-cluster based oversampling approach tends to create synthetic samples which closely matches the original distribution since, the synthetic samples are created in each cluster based on the original cluster density. To enhance diversity in the synthetic samples, it then applies crossover to create new combinations of synthetic samples within clusters. The resulting distribution is more diverse, with synthetic samples that represent a broader range of characteristics of the minority class which results in better performance of the prediction models.

Also, we performance quantitative analysis on the performance of both cluster-based oversampling approach using XGBoost ensemble model. Performance of XGBoost model on five different imbalance datasets with varied imbalance ratio in terms of recall and false-alarm rate are shown in Table 3. The proposed multi-cluster based oversampling approach outperforms the KMeansSMOTE cluster algorithm in both the measures because of the diversity of synthetic samples.

Table 3 Performance comparison of cluster-based oversampling approach using XGBoost classifier

4 Threats to validity

The Outcome of the proposed model depends on the Mahalanobis distance. The computation of MD for sizeable dimensional data is challenging because of the covariance matrices. Also, when the features are closely related to each other, covariance calculation is impossible because of high correlation. Additionally, covariance matrices for datasets that contain fewer minority samples than the features is an impossible task which is compensated using the generalized inverse method. To obtain more stable performance irrespective of the imbalance ratio generalized metric can be adopted. Also, the model’s performance has not been tested across cross-project defect prediction.

5 Conclusion

The proposed approach improves the performance of SDP by generating more diverse synthetics sample also ensure the samples within decision boundaries. It outperforms other over-sampling approachesconcerning false alarm rate (pf) and recall (pd) measures. But, in cross project defect prediction domain where the datasets are readily not available, finding project with similar characteristics and domain to perform transfer learning is yet another challenging task. The performance of the proposed model has been tested using five different models across 20 different datasets was superior to other over-sampling approaches.