Feature Engineering to Heterogeneous Cross Software Projects Defect Prediction: A Novel Framework

Vashisht, Rohit; Rizvi, Syed Afzal Murtaza

doi:10.1007/s13369-022-07337-9

Feature Engineering to Heterogeneous Cross Software Projects Defect Prediction: A Novel Framework

Research Article-Computer Engineering and Computer Science
Published: 02 November 2022

Volume 48, pages 2539–2560, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Arabian Journal for Science and Engineering Aims and scope Submit manuscript

Feature Engineering to Heterogeneous Cross Software Projects Defect Prediction: A Novel Framework

Download PDF

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

Heterogeneous cross-project defect prediction (HCPDP) aims to predict defects in a target project with limited historical defect data via a defect prediction (DP) model trained with defect data of another source project. The accuracy of a DP model is highly dependent on the set of features selected in feature engineering (FE) phase. The study evaluates the effectiveness of proposed four-phase HCPDP framework with more focus on FE phase using the stacking-based ensemble learning method. Auto-encoder (AE), a deep learning-based FE technique is used for the proposed analysis. In addition, two novel techniques to deal with imbalance dataset and to determine correlation between features are also proposed in this paper. For comparative analysis, accuracy, recall, F-score and area under curve (AUC) are used as the output parameters. To compare DP model’s output with or without FE phase, ten prediction pairs from four open source projects have been considered. The experimental results show that the AE technique is able to reduce the number of features by an average of 50% as compared to data-driven approaches. Also, the proposed model gave better performance in comparison with traditional heterogeneous models with highest AUC of 0.8901.

Software defect prediction based on nested-stacking and heterogeneous feature selection

Article Open access 20 February 2022

Ensemble Based-Cross Project Defect Prediction

Feature Representation and Feature Matching for Heterogeneous Defect Prediction

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Software has become an indispensable part of every human's day-to-day activities. In today's scenario, many important fields such as education, marketing, banking and transport need highly reliable, defect-free and high-quality software applications, as any failure in these applications can result in enormous losses from finance to human lives. Software errors may be due to inconsistencies, ambiguities, oversights or misinterpretation of the specifications to be met by the software, carelessness or negligence in writing code, insufficient testing, inappropriate or unexpected use of the software or other unforeseen issues. In order to reduce the significant cost of software development, it is very important to identify these software defects at the right time. "Software testing should be done for early identification of software faults because amendments in maintenance phase will lead to huge cost that grows exponentially if faults are identified in later stages of Software Development Life Cycle (SDLC)", as described in [1]. In comparison, the SDLC’s software testing phase absorbs 60% of the overall cost of software development. Therefore, it is very critical that testing on the right modules should be performed at the right time.

According to the state of the art, software defect prediction (SDP) can be broadly divided into two groups- with-in project defect prediction (WPDP) and cross-project defect prediction (CPDP). In WPDP, the available defect dataset is split up into two parts in order to build the DP model in such a way that the DP model is trained using one part of the dataset (referred to as labeled observations) and the other part is used to validate DP model as shown in Fig. 1. Testing the DP model involves finding labels that are either faulty or non-faulty for unidentifiable instances in the target dataset [2].

CPDP is another class of SDP in which software projects that lack the needed local defect data can use data from other projects to construct an accurate and effective DP model. In addition, CPDP can be further categorized into homogeneous CPDP (HoCPDP) and heterogeneous CPDP (HCPDP) subcategories. The common software measures/features are collected by HoCPDP from both the source application (whose defect data is employed to train the SDP model) and the target application (for which the SDP model is made) [3]. But, there are no uniform metrics between the prediction pair datasets when using HCPDP. Through measuring the coefficient of correlation between all feasible software feature combinations, uniform features may be found between two applications. In order to forecast project-wide defects, the combinations of feature pairs displaying some sort of analogous distribution in their values are used as common features between source and target datasets in case of HCPDP. For example, (A,Q), (B,P) and (D,S) are correlated feature pairs for HCPDP category as depicted in Fig. 2. More information about both CPDP categories is shown in Fig. 2.

Irrelevant or useless software features chosen during the FE step can be one of the key explanations for less accurate DP model as redundant collection of features can lead to skewed or misleading prediction performance. Therefore, the important issue that should be tackled first in order to build a highly reliable SDP model is the selection of the right set of features from a given pool of input features. The article studies tests the prediction performance of three-phased WPDP model and four-phased HCPDP-AE model with or without FE phase using different classification algorithms.

Data-driven FE techniques such as principal component analysis (PCA) can only model linear relationships among input features. In contrast to data-based FE models, neural nets can model nonlinear transformations of features and perform better as the number of features grows.

Auto-encoder (AE), an unsupervised artificial neural network (ANN) and encoding–decoding-based FE technique, has been used in the proposed research study to map higher-dimensional feature data to lower one along with elimination of redundant and noisy features. The motivation behind this study is to evaluate the prediction performance of the four-stage HCPDP-AE model with introduction of the novel approach to implement each stage with a greater emphasis on the FE stage. In addition, two novel techniques to deal with imbalance dataset and to determine correlation among features in HCPDP are also proposed in the paper. Both techniques overcome the shortcomings of traditional methods in their domain as described in later section. The main contribution areas of the study are as follows:-

RQ1. Compare and contrast the use of data-driven and deep learning-based FE strategies with the conventional approach of DP, i.e. WPDP.

RQ2. Compare the prediction performance of proposed HCPDP framework with or without FE phase.

RQ3. Whether and to what extent DP results of HCPDP’s model are comparable to the outcomes of WPDP’s model?

RQ4. Compare and validate the performance of the proposed HCPDP framework with existing benchmarked heterogeneous prediction models.

The outline of the paper is as follows:—Section 2 includes a comprehensive analysis of HCPDP’s related work, Sect. 3 describes the four-phase HCPDP-AE model and the three-phase WPDP model with detailed explanation of each phase, Sect. 4 describes the datasets used to execute the proposed work and the output metrics used to assess the experimental results, the development part of the experiments is explained in Sect. 5, the experimental outcomes are discussed in Sects. 6 and 7 explains the threats to construct validity and the conclusive findings are outlined in Sect. 8.

2 Related Work

In 2002, Melo et al. [4] reported the first known study in CPDP. They introduced the MARS (Multivariate Adaptive Regression Spline) paradigm for defect prediction and data architecture in two Java-based frameworks, Xpose and Jwriter. They used their proclivity for fault to forecast the classes in Jwriter. They did this by using a model trained on the Xpose dataset. They compared MARS' efficiency to that of linear regression (LR) and discovered that MARS outperforms LR and is much more cost-effective.

Menzies et al. [5] used data from ten projects from two different sources in 2009. They filter down the data for successful defect prediction by eliminating noisy, repetitive and irrelevant data and train the model with this unblended data. The tests were carried out using the nearest neighbor (NN) method on data from ten projects. The findings showed that the tests were effective at predicting defects within a project. Meanwhile, using these experiments, the CPDP task was unable to outperform the project defect prediction task.

In same year 2009, Camargo et al. [6] used log transformation for the first time to identify related instances in training and analyzing project data to eliminate project-based data instances. The classification for defect prediction on Internet Explorer and Mozilla Firefox as training and testing projects was suggested by Menzies et al. [5] in the same year. For the classification task, they used the coding model and process parameters. They used Mozilla Firefox defect data to train the proposed DP model, which was then used to predict defects in Internet Explorer. These experiments revealed that when the proposed model was used as a training project and Mozilla Firefox was used as a testing project, the proposed model outperformed.

Menzies et al. [7] argued in 2011 that relevancy varies depending on interpretation. They said that relevance differs with interpretation, and that data relevance can be contradictory depending on how it is perceived. When viewed globally, data that seems to be significant can be meaningless when viewed locally. They supported their claims with experiments, concluding that local behavior was superior to global behavior and that condition-based laws should be prioritized over taking into account other factors.

In 2012, Bellenburg et al. [8] added to Menzies et al. [7] arguments by demonstrating that local models were better for a specific dataset, but global models were better for generality. In the same year, Rahman et al. [9] conducted studies to show that performance indicators such as F-score, accuracy and recall are not sufficient for quality assurance when defect prediction is made using various models. They said that AUC gives the comparable results in WPDP models.

To address the shortcomings of single objective model [9], in 2013, Canfora et al. [10] suggested a multi-objective approach. They used a non-dominated sorted generic algorithm (NSGA-II) to practice the logistic regression (LR) model.

In 2011, Gao et al. [11] developed a universal defect prediction (UDP) model using 1398 projects from Google code and source forge. This model compares the metrics in the training and testing projects' datasets, and if at least 26 of them fit, then only predictions could be made on target project. He et al. [3] overcame this constraint by developing a new metric based on instance characteristic vectors. They also found unfavorable effects when comparing CPDP to feature disparity. The tests were carried out on 11 projects using three different datasets.

In 2014, He et al. [3] compared the output findings for WPDP and CPDP using feature selection approaches. They discovered that the lower the number of training project features used to train classifiers, the higher the precision in WPDP and the greater the F-score and recall performance in CPDP. Various ensemble classifiers are also trained and validated for the CPDP task [12, 13].

Dong et al. [14] suggested canonical correlation analysis (CCA) approach to model defects. They became the first to put the study for heterogeneous defect prediction (HDP) to the public's attention. By supplementing dummy metrics with null values, they eradicate the metrics disparity issue between the training and testing project datasets. They tested 14 projects using four different datasets.

In 2015, Fu et al. [15] performed studies on 34 projects with 5 datasets. They suggested HDP task using transfer learning system. They do not use null values to augment metrics as Dong et al. [14] suggested, but their findings are equivalent to WPDP.

Ryu, Jang and Baik [16] used a new approach called the transfer cost-sensitive boosting method to execute the CPDP challenge in the same year. For the CPDP task, their approach generated state-of-the-art performance. They also suggested a CPDP challenge that takes into account class-imbalance using a multi-objective Naive Bayes technique (MONBT) [17]. All WPDP models, as well as single objective models, were outperformed by their MONBT.

Jing et al. [18] created a novel unified metric representation (UMR) for predicting heterogeneous defects in 2015. Fu et al. [15] suggested an HDP challenge based on metric collection and metric matching in the same year. They studied 28 projects and found that the proposed approach was superior to WPDP and, in some circumstances, outperformed it statistically.

Ni et al. [19] proposed FESCH in 2017, a novel approach that outperformed both TCA+ and WPDP in most scenarios and gave state-of-the-art results for the baseline methods used. Furthermore, the findings indicated that FeSCH's success was self-sustaining and independent of the classifiers used.

In same year, Li et al. [20] compared the four filtration approaches for defect data. They said that the fault data filtration approach chosen has a significant impact on the model's ability to predict defects. They compared four different filtering techniques: data characteristic-based filter (DCBF), target project data-guided filter (TGF), source project data-guided filter (SGF) and local cluster-based filter (LCBF). They also introduced a new filter, the hierarchical selection-dependent filter (HSDF), to resolve the shortcomings of the previous four filters in terms of scalability when dealing with massive datasets. The proposed filtering strategy outperformed existing filtering techniques.

Xu et al. [21] proposed a domain adaptation approach for reducing the higher-dimensional features of training and analyzing project domains in 2018. To learn the difference between function spaces, they used the dictionary learning technique. They compared heterogeneous defect adaptation (HDA) [21], CCA+ [14] and HDP [15] using three open source projects: NetGene, NASA and AEEEM, and three performance measures: recall, F-score and balance.

In 2020, Lee and Felix [22] concentrated on method-level (ML) defect estimation using regression models in a recent software release after gathering-related data from previous versions of the same system. The authors used three performance measurement variables, such as defect density, defect velocity, and time to implement defects that show a significant relationship with ML defects. The proposed work also facilitated the study and evaluation of pre-to-post system data preprocessing classifiers and entropy values in average output datasets. The defect velocity had the highest correlation with the count among ML faults among all three factors, with a 93% correlation.

Majd et al. [23] suggested using deep learning models to forecast statement class defects (SCD) in same year. The authors of this paper sought to relieve the burden on software developers by defining places or modules that are more vulnerable to defects. The authors used Code4Bench's Broad Short-Term Memory (BSTM) deep learning model to run experiments on 1,19,989 C/C++ programs. The authors have put the SCD model to the test for predicting defects in unseen data (i.e., new statements) and found that it performed well, with high memory, precision and accuracy.

In the field of HCPDP, Grassmann manifold optimal transfer defect prediction (GMOTDP) is a recent novel work in year 2020. Jiang et al. [24] gave a three-phase HDP model that proposed Mahalanobis distance-based class imbalance learning (CIL) framework for dealing with class imbalance problem (CIP) in the source dataset, as well as a classification and regression trees (CART)-based ensemble learning methodology for finding the best subset of the source dataset for metric matching. To check the feasibility of the proposed approach, the authors used nine projects from three public domain software defect repositories and compared them to four known advanced approaches. The findings of the experiments show that the proposed approach is more reliable in terms of AUC.

Wu et al. [40] presented the multi-source heterogeneous cross-project defect prediction (MHCPDP) approach in 2021, which employed AE to extract intermediate features from the original datasets rather than merely deleting redundant and unrelated features. To limit the impact of negative transfers and improve the performance of the classifier, the MHCPDP developed a multi-source transfer learning algorithm. The authors tested MHCPDP on five open source datasets in depth. The results of the experiments revealed that MHCPDP not only improved two performance measures but also addresses the drawbacks of traditional HCPDP approaches.

3 Proposed Defect Prediction Model

While HoCPDP allows more than one homogeneous project’s datasets for training and testing of DP model, on the contrary, HCPDP-AE begins its DP process with a pair of datasets as source dataset S_a*b and target dataset T_p*q. Each row and column represents an instance and a software metric, respectively in both datasets.

Preprocessing of datasets is performed in the very first phase to make them compatible for their employment in the model as shown in Fig. 3. This phase entails the treatment of missing values, the labeling of a categorical/dependent variable, eradication of the CIP in an imbalanced training dataset, and the normalization of a given collection of data values. The far disproportionate ratio of instances in two groups referred to as CIP is the most challenging and significant problem that should be addressed in this phase. If one wants to use data resampling techniques [25] to tackle CIP, so there would be two ways to equalize the number of cases in majority and minority class. In the first way termed as random over-sampling (ROS), one attempts to generate more synthetic minority class observations, and in the other way termed as random under-sampling (RUS), one tries to minimize majority class observations in order to achieve an equivalent count of observations in both classes for a binary classification problem [26].

But, as per the state of the art, both approaches have drawbacks of their own [26]. The former approach induces redundant observations for minority class that can over-generalize the minority class without taking into account the distribution of instances in the majority class. On the other hand, the latter method can exclude some helpful or relevant observations from the majority class without taking into consideration their significance in predicting the expected outcome.

Therefore, a novel hybrid approach named as chunk balancing algorithm (CBA) is proposed in the research study to treat CIP in order to obtain benefits as well as to overcome the shortcomings of both ROS and RUS techniques. SMOTE's overgeneralization and over-fitting issues are addressed in CBA by balancing minority class instances via creating chunks as stated in CBA, rather than introducing synthetic examples that cause duplication in the dataset distribution. RUS, on the other hand, eliminates majority class occurrences regardless of their significance in predicting the final outcome. To create n balanced chunks, CBA also entails the random selection of instances from the majority class. However, it provides all majority class instances an equal chance to become a part of the model's training rather than discarding them entirely. In this manner, CBA overcomes the constraints of data re-sampling methodologies used to handle CIP. The first phase of the HCPDP-AE model is completed throughout this way.

This algorithm takes an imbalanced dataset as an input and returns n chunks with almost equal numbers of instances from both the majority and minority groups as the output.

Today's age is known as era of data. So, it is prime need of time to extract meaningful, relevant and highly discriminating data that has higher significance in comparison with excluded data from pool for prediction of the expected results. The second phase of the model, i.e., feature engineering (FE), focuses on the same problem. It involves both feature selection and feature extraction [26]. Feature selection methods help to delete unnecessary and outdated features that are not crucial in determining expected outcomes. However, by introducing new lower-dimensional feature set along with discarding the original higher-dimensional feature set, feature extraction helps to decrease the dimensionality of the given pool of features. The accuracy and reliability of a SDP model is therefore strongly influenced by the collection of the most suitable and substantial features in the FE process. Deep learning-based FE techniques are not yet well explored with robust metric matching technique in the prediction of defects among projects with homogeneous and heterogeneous features, as per the literature survey carried out for the research analysis. So, auto-encoder (AE), a deep learning-based FE technique is applied to implement the second phase of HCPDP-AE model. The principle behind the use of the AE technique is to extract a feature set of lower dimension that can reproduce the original input using the encoding–decoding model. The purpose of using this technique is an unsupervised learning technique; it also employs feed-forward neural network (FFNN) for compact representation of original input known as representation learning [27]. In contrast to other related approaches, it provides better outcomes if some of the features used in the dataset have some kind of correlation between them rather than being entirely independent. The more detail of AE method is shown in Fig. 4.

All of the properties listed in Fig. 4 are implemented in the proposed HCPDP_AE model. For example, using the hidden layer's features, the output features cannot be determined exactly the same as the input feature set. As demonstrated in Table 6, the model generates varied RMSE values depending on the reduced number of features at the hidden layer for a given source dataset. This shows that the model's transformation of a higher-dimensional feature set to a lower-dimensional feature set is always lossy.

The weights of each link between a hidden layer and an output layer or an input layer and a hidden layer at the next iteration are calculated by the weights of links between corresponding layers at the previous iteration and with the loss function (RMSE) that is used to estimate the transformation loss.

The model only accepts vectors of feature values as input, with no class labels, means HCPDP_AE is adhering to the unsupervised nature of the AE approach. It simply tries to learn a function that maps the higher-dimensional input x to the lower-dimensional input y, and then attempts to recreate x using y.

The model uses a deep learning strategy to extract the features, but only one hidden layer was employed because one hidden layer is adequate to train the FFNN considering the feature cardinality of employed datasets.

The detailed encoding–decoding architecture of AE is well shown in Fig. 5. It consists of three constituent stages: encoding stage, bottleneck stage and decoding stage. In the encoding stage, it is possible to have n numbers of encoding layers that are En-LAYER(1), En-LAYER(2), En-LAYER(3),…, En-LAYER(n) consisting of N₁, N₂, N₃,…, N_n number of nodes in each respective layer. This stage encodes the original input with q number of features in a compact form with q’ number of features such that q' < q.

The best possible compressed form of input features that are used to train the DP model is given by the second stage, i.e., bottleneck stage. In the last stage, the architecture attempts to recreate the original input from the compressed form generated from the bottleneck stage and aims to regularize the loss of reconstruction by comparing original data and reconstructed data. The decoding step can be thought of as a mirror image of the encoding stage, with De_LAYER(1) performing the mirror operations as done in En_LAYER(n) with N_n nodes. During back propagation in architecture, the model’s training focuses to mitigate the reconstruction loss. In this way, the proposed research aims to incorporate the second phase of the model by investigating the same using FE methodology based on deep learning.

The encoding and decoding equations for the AE model with one hidden layer are described in Eqs. (1) and (2), respectively, where F_A is the encoding function that maps higher-dimensional input A to compressed form B and G_B is the decoding function that attempts to recreate input as A' from compressed form B with minimal reconstruction loss. A_f is the applied activation function and w and w’ are the weighting parameter in encoding and decoding stage, respectively. The bias values for both stages are denoted by b_A and b_B, respectively.

$$B \, = \, F_{\lambda } \left( A \right) \, = \, A_{f} \left( {wA + b_{A} } \right)$$

(1)

$$A^{\prime} \, = \, G_{{\lambda^{\prime}}} \left( B \right) \, = \, A_{f} \left( {w^{\prime}B + b_{B} } \right)$$

(2)

The main aim of AE is to reduce reconstruction loss on an original input A, which is well defined by the objective function λ as follows:-

$$\lambda \, = \, \min \, L_{f} \left( {A,A^{\prime}} \right)$$

(3)

where A′ = G(F(A)) and L_f is the loss function depending upon the type of reconstruction (linear or nonlinear). It produces the optimal set of weights for mapping the input variables to the target or output variable with the least amount of reconstruction loss. The loss function used for neural network’s implementation is strongly intertwined to the selected activation function. According to [28], the best result for a neural network designed for a regression problem is given by rectified linear activation function (ReLAF) with root mean squared error (RMSE) loss function. Table 1 offers more information on the loss function and activation function used in executing the AE Model for the comprehensive research.

Table 1 Loss function and activation function

Feature Engineering to Heterogeneous Cross Software Projects Defect Prediction: A Novel Framework

Abstract

Similar content being viewed by others

Software defect prediction based on nested-stacking and heterogeneous feature selection

Ensemble Based-Cross Project Defect Prediction

Feature Representation and Feature Matching for Heterogeneous Defect Prediction

Explore related subjects

1 Introduction

2 Related Work

3 Proposed Defect Prediction Model

4 Datasets Used and Performance Measures

4.1 Data Collection

4.2 Performance Parameters

5 Experimentation Setup

5.1 Experiment 1

5.2 Experiment 2

6 Results and Discussion

6.1 RQ1. Compare and contrast the use of data-driven and deep learning-based FE strategies with the conventional approach to DP, i.e. WPDP.

6.2 RQ2. Compare the Prediction Performance of Proposed HCPDP-AE Framework with and Without FE Phase.

6.3 RQ3. Whether and to What Extent DP Results of HCPDP’s Model are Comparable to the Outcomes of WPDP’s Model?

6.4 RQ4. Compare and Validate the Performance of the Proposed HCPDP-AE Framework with Existing Benchmarked HCPDP Models.

7 Threats to validity

8 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation