1 Introduction

Software has become an indispensable part of every human's day-to-day activities. In today's scenario, many important fields such as education, marketing, banking and transport need highly reliable, defect-free and high-quality software applications, as any failure in these applications can result in enormous losses from finance to human lives. Software errors may be due to inconsistencies, ambiguities, oversights or misinterpretation of the specifications to be met by the software, carelessness or negligence in writing code, insufficient testing, inappropriate or unexpected use of the software or other unforeseen issues. In order to reduce the significant cost of software development, it is very important to identify these software defects at the right time. "Software testing should be done for early identification of software faults because amendments in maintenance phase will lead to huge cost that grows exponentially if faults are identified in later stages of Software Development Life Cycle (SDLC)", as described in [1]. In comparison, the SDLC’s software testing phase absorbs 60% of the overall cost of software development. Therefore, it is very critical that testing on the right modules should be performed at the right time.

According to the state of the art, software defect prediction (SDP) can be broadly divided into two groups- with-in project defect prediction (WPDP) and cross-project defect prediction (CPDP). In WPDP, the available defect dataset is split up into two parts in order to build the DP model in such a way that the DP model is trained using one part of the dataset (referred to as labeled observations) and the other part is used to validate DP model as shown in Fig. 1. Testing the DP model involves finding labels that are either faulty or non-faulty for unidentifiable instances in the target dataset [2].

According to the state of the art, software defect prediction (SDP) can be broadly divided into two groups- with-in project defect prediction (WPDP) and cross-project defect prediction (CPDP). In WPDP, the available defect dataset is split up into two parts in order to build the DP model in such a way that the DP model is trained using one part of the dataset (referred to as labeled observations) and the other part is used to validate DP model as shown in Fig. 1. Testing the DP model involves finding labels that are either faulty or non-faulty for unidentifiable instances in the target dataset [2].

CPDP is another class of SDP in which software projects that lack the needed local defect data can use data from other projects to construct an accurate and effective DP model. In addition, CPDP can be further categorized into homogeneous CPDP (HoCPDP) and heterogeneous CPDP (HCPDP) subcategories. The common software measures/features are collected by HoCPDP from both the source application (whose defect data is employed to train the SDP model) and the target application (for which the SDP model is made) [3]. But, there are no uniform metrics between the prediction pair datasets when using HCPDP. Through measuring the coefficient of correlation between all feasible software feature combinations, uniform features may be found between two applications. In order to forecast project-wide defects, the combinations of feature pairs displaying some sort of analogous distribution in their values are used as common features between source and target datasets in case of HCPDP. For example, (A,Q), (B,P) and (D,S) are correlated feature pairs for HCPDP category as depicted in Fig. 2. More information about both CPDP categories is shown in Fig. 2.

Irrelevant or useless software features chosen during the FE step can be one of the key explanations for less accurate DP model as redundant collection of features can lead to skewed or misleading prediction performance. Therefore, the important issue that should be tackled first in order to build a highly reliable SDP model is the selection of the right set of features from a given pool of input features. The article studies tests the prediction performance of three-phased WPDP model and four-phased HCPDP-AE model with or without FE phase using different classification algorithms.

Data-driven FE techniques such as principal component analysis (PCA) can only model linear relationships among input features. In contrast to data-based FE models, neural nets can model nonlinear transformations of features and perform better as the number of features grows.

Auto-encoder (AE), an unsupervised artificial neural network (ANN) and encoding–decoding-based FE technique, has been used in the proposed research study to map higher-dimensional feature data to lower one along with elimination of redundant and noisy features. The motivation behind this study is to evaluate the prediction performance of the four-stage HCPDP-AE model with introduction of the novel approach to implement each stage with a greater emphasis on the FE stage. In addition, two novel techniques to deal with imbalance dataset and to determine correlation among features in HCPDP are also proposed in the paper. Both techniques overcome the shortcomings of traditional methods in their domain as described in later section. The main contribution areas of the study are as follows:-

RQ1. Compare and contrast the use of data-driven and deep learning-based FE strategies with the conventional approach of DP, i.e. WPDP.

RQ2. Compare the prediction performance of proposed HCPDP framework with or without FE phase.

RQ3. Whether and to what extent DP results of HCPDP’s model are comparable to the outcomes of WPDP’s model?

RQ4. Compare and validate the performance of the proposed HCPDP framework with existing benchmarked heterogeneous prediction models.

The outline of the paper is as follows:—Section 2 includes a comprehensive analysis of HCPDP’s related work, Sect. 3 describes the four-phase HCPDP-AE model and the three-phase WPDP model with detailed explanation of each phase, Sect. 4 describes the datasets used to execute the proposed work and the output metrics used to assess the experimental results, the development part of the experiments is explained in Sect. 5, the experimental outcomes are discussed in Sects. 6 and 7 explains the threats to construct validity and the conclusive findings are outlined in Sect. 8.

Fig. 1
figure 1

With-in project defect prediction

Fig. 2
figure 2

Categories of cross software projects defect prediction

2 Related Work

In 2002, Melo et al. [4] reported the first known study in CPDP. They introduced the MARS (Multivariate Adaptive Regression Spline) paradigm for defect prediction and data architecture in two Java-based frameworks, Xpose and Jwriter. They used their proclivity for fault to forecast the classes in Jwriter. They did this by using a model trained on the Xpose dataset. They compared MARS' efficiency to that of linear regression (LR) and discovered that MARS outperforms LR and is much more cost-effective.

Menzies et al. [5] used data from ten projects from two different sources in 2009. They filter down the data for successful defect prediction by eliminating noisy, repetitive and irrelevant data and train the model with this unblended data. The tests were carried out using the nearest neighbor (NN) method on data from ten projects. The findings showed that the tests were effective at predicting defects within a project. Meanwhile, using these experiments, the CPDP task was unable to outperform the project defect prediction task.

In same year 2009, Camargo et al. [6] used log transformation for the first time to identify related instances in training and analyzing project data to eliminate project-based data instances. The classification for defect prediction on Internet Explorer and Mozilla Firefox as training and testing projects was suggested by Menzies et al. [5] in the same year. For the classification task, they used the coding model and process parameters. They used Mozilla Firefox defect data to train the proposed DP model, which was then used to predict defects in Internet Explorer. These experiments revealed that when the proposed model was used as a training project and Mozilla Firefox was used as a testing project, the proposed model outperformed.

Menzies et al. [7] argued in 2011 that relevancy varies depending on interpretation. They said that relevance differs with interpretation, and that data relevance can be contradictory depending on how it is perceived. When viewed globally, data that seems to be significant can be meaningless when viewed locally. They supported their claims with experiments, concluding that local behavior was superior to global behavior and that condition-based laws should be prioritized over taking into account other factors.

In 2012, Bellenburg et al. [8] added to Menzies et al. [7] arguments by demonstrating that local models were better for a specific dataset, but global models were better for generality. In the same year, Rahman et al. [9] conducted studies to show that performance indicators such as F-score, accuracy and recall are not sufficient for quality assurance when defect prediction is made using various models. They said that AUC gives the comparable results in WPDP models.

To address the shortcomings of single objective model [9], in 2013, Canfora et al. [10] suggested a multi-objective approach. They used a non-dominated sorted generic algorithm (NSGA-II) to practice the logistic regression (LR) model.

In 2011, Gao et al. [11] developed a universal defect prediction (UDP) model using 1398 projects from Google code and source forge. This model compares the metrics in the training and testing projects' datasets, and if at least 26 of them fit, then only predictions could be made on target project. He et al. [3] overcame this constraint by developing a new metric based on instance characteristic vectors. They also found unfavorable effects when comparing CPDP to feature disparity. The tests were carried out on 11 projects using three different datasets.

In 2014, He et al. [3] compared the output findings for WPDP and CPDP using feature selection approaches. They discovered that the lower the number of training project features used to train classifiers, the higher the precision in WPDP and the greater the F-score and recall performance in CPDP. Various ensemble classifiers are also trained and validated for the CPDP task [12, 13].

Dong et al. [14] suggested canonical correlation analysis (CCA) approach to model defects. They became the first to put the study for heterogeneous defect prediction (HDP) to the public's attention. By supplementing dummy metrics with null values, they eradicate the metrics disparity issue between the training and testing project datasets. They tested 14 projects using four different datasets.

In 2015, Fu et al. [15] performed studies on 34 projects with 5 datasets. They suggested HDP task using transfer learning system. They do not use null values to augment metrics as Dong et al. [14] suggested, but their findings are equivalent to WPDP.

Ryu, Jang and Baik [16] used a new approach called the transfer cost-sensitive boosting method to execute the CPDP challenge in the same year. For the CPDP task, their approach generated state-of-the-art performance. They also suggested a CPDP challenge that takes into account class-imbalance using a multi-objective Naive Bayes technique (MONBT) [17]. All WPDP models, as well as single objective models, were outperformed by their MONBT.

Jing et al. [18] created a novel unified metric representation (UMR) for predicting heterogeneous defects in 2015. Fu et al. [15] suggested an HDP challenge based on metric collection and metric matching in the same year. They studied 28 projects and found that the proposed approach was superior to WPDP and, in some circumstances, outperformed it statistically.

Ni et al. [19] proposed FESCH in 2017, a novel approach that outperformed both TCA+ and WPDP in most scenarios and gave state-of-the-art results for the baseline methods used. Furthermore, the findings indicated that FeSCH's success was self-sustaining and independent of the classifiers used.

In same year, Li et al. [20] compared the four filtration approaches for defect data. They said that the fault data filtration approach chosen has a significant impact on the model's ability to predict defects. They compared four different filtering techniques: data characteristic-based filter (DCBF), target project data-guided filter (TGF), source project data-guided filter (SGF) and local cluster-based filter (LCBF). They also introduced a new filter, the hierarchical selection-dependent filter (HSDF), to resolve the shortcomings of the previous four filters in terms of scalability when dealing with massive datasets. The proposed filtering strategy outperformed existing filtering techniques.

Xu et al. [21] proposed a domain adaptation approach for reducing the higher-dimensional features of training and analyzing project domains in 2018. To learn the difference between function spaces, they used the dictionary learning technique. They compared heterogeneous defect adaptation (HDA) [21], CCA+ [14] and HDP [15] using three open source projects: NetGene, NASA and AEEEM, and three performance measures: recall, F-score and balance.

In 2020, Lee and Felix [22] concentrated on method-level (ML) defect estimation using regression models in a recent software release after gathering-related data from previous versions of the same system. The authors used three performance measurement variables, such as defect density, defect velocity, and time to implement defects that show a significant relationship with ML defects. The proposed work also facilitated the study and evaluation of pre-to-post system data preprocessing classifiers and entropy values in average output datasets. The defect velocity had the highest correlation with the count among ML faults among all three factors, with a 93% correlation.

Majd et al. [23] suggested using deep learning models to forecast statement class defects (SCD) in same year. The authors of this paper sought to relieve the burden on software developers by defining places or modules that are more vulnerable to defects. The authors used Code4Bench's Broad Short-Term Memory (BSTM) deep learning model to run experiments on 1,19,989 C/C++ programs. The authors have put the SCD model to the test for predicting defects in unseen data (i.e., new statements) and found that it performed well, with high memory, precision and accuracy.

In the field of HCPDP, Grassmann manifold optimal transfer defect prediction (GMOTDP) is a recent novel work in year 2020. Jiang et al. [24] gave a three-phase HDP model that proposed Mahalanobis distance-based class imbalance learning (CIL) framework for dealing with class imbalance problem (CIP) in the source dataset, as well as a classification and regression trees (CART)-based ensemble learning methodology for finding the best subset of the source dataset for metric matching. To check the feasibility of the proposed approach, the authors used nine projects from three public domain software defect repositories and compared them to four known advanced approaches. The findings of the experiments show that the proposed approach is more reliable in terms of AUC.

Wu et al. [40] presented the multi-source heterogeneous cross-project defect prediction (MHCPDP) approach in 2021, which employed AE to extract intermediate features from the original datasets rather than merely deleting redundant and unrelated features. To limit the impact of negative transfers and improve the performance of the classifier, the MHCPDP developed a multi-source transfer learning algorithm. The authors tested MHCPDP on five open source datasets in depth. The results of the experiments revealed that MHCPDP not only improved two performance measures but also addresses the drawbacks of traditional HCPDP approaches.

3 Proposed Defect Prediction Model

While HoCPDP allows more than one homogeneous project’s datasets for training and testing of DP model, on the contrary, HCPDP-AE begins its DP process with a pair of datasets as source dataset Sa*b and target dataset Tp*q. Each row and column represents an instance and a software metric, respectively in both datasets.

Preprocessing of datasets is performed in the very first phase to make them compatible for their employment in the model as shown in Fig. 3. This phase entails the treatment of missing values, the labeling of a categorical/dependent variable, eradication of the CIP in an imbalanced training dataset, and the normalization of a given collection of data values. The far disproportionate ratio of instances in two groups referred to as CIP is the most challenging and significant problem that should be addressed in this phase. If one wants to use data resampling techniques [25] to tackle CIP, so there would be two ways to equalize the number of cases in majority and minority class. In the first way termed as random over-sampling (ROS), one attempts to generate more synthetic minority class observations, and in the other way termed as random under-sampling (RUS), one tries to minimize majority class observations in order to achieve an equivalent count of observations in both classes for a binary classification problem [26].

But, as per the state of the art, both approaches have drawbacks of their own [26]. The former approach induces redundant observations for minority class that can over-generalize the minority class without taking into account the distribution of instances in the majority class. On the other hand, the latter method can exclude some helpful or relevant observations from the majority class without taking into consideration their significance in predicting the expected outcome.

Therefore, a novel hybrid approach named as chunk balancing algorithm (CBA) is proposed in the research study to treat CIP in order to obtain benefits as well as to overcome the shortcomings of both ROS and RUS techniques. SMOTE's overgeneralization and over-fitting issues are addressed in CBA by balancing minority class instances via creating chunks as stated in CBA, rather than introducing synthetic examples that cause duplication in the dataset distribution. RUS, on the other hand, eliminates majority class occurrences regardless of their significance in predicting the final outcome. To create n balanced chunks, CBA also entails the random selection of instances from the majority class. However, it provides all majority class instances an equal chance to become a part of the model's training rather than discarding them entirely. In this manner, CBA overcomes the constraints of data re-sampling methodologies used to handle CIP. The first phase of the HCPDP-AE model is completed throughout this way.

This algorithm takes an imbalanced dataset as an input and returns n chunks with almost equal numbers of instances from both the majority and minority groups as the output.

Fig. 3
figure 3

Four-phased HCPDP-AE framework

figure a

Today's age is known as era of data. So, it is prime need of time to extract meaningful, relevant and highly discriminating data that has higher significance in comparison with excluded data from pool for prediction of the expected results. The second phase of the model, i.e., feature engineering (FE), focuses on the same problem. It involves both feature selection and feature extraction [26]. Feature selection methods help to delete unnecessary and outdated features that are not crucial in determining expected outcomes. However, by introducing new lower-dimensional feature set along with discarding the original higher-dimensional feature set, feature extraction helps to decrease the dimensionality of the given pool of features. The accuracy and reliability of a SDP model is therefore strongly influenced by the collection of the most suitable and substantial features in the FE process. Deep learning-based FE techniques are not yet well explored with robust metric matching technique in the prediction of defects among projects with homogeneous and heterogeneous features, as per the literature survey carried out for the research analysis. So, auto-encoder (AE), a deep learning-based FE technique is applied to implement the second phase of HCPDP-AE model. The principle behind the use of the AE technique is to extract a feature set of lower dimension that can reproduce the original input using the encoding–decoding model. The purpose of using this technique is an unsupervised learning technique; it also employs feed-forward neural network (FFNN) for compact representation of original input known as representation learning [27]. In contrast to other related approaches, it provides better outcomes if some of the features used in the dataset have some kind of correlation between them rather than being entirely independent. The more detail of AE method is shown in Fig. 4.

Fig. 4
figure 4

Characteristics of auto-encoder

All of the properties listed in Fig. 4 are implemented in the proposed HCPDP_AE model. For example, using the hidden layer's features, the output features cannot be determined exactly the same as the input feature set. As demonstrated in Table 6, the model generates varied RMSE values depending on the reduced number of features at the hidden layer for a given source dataset. This shows that the model's transformation of a higher-dimensional feature set to a lower-dimensional feature set is always lossy.

The weights of each link between a hidden layer and an output layer or an input layer and a hidden layer at the next iteration are calculated by the weights of links between corresponding layers at the previous iteration and with the loss function (RMSE) that is used to estimate the transformation loss.

The model only accepts vectors of feature values as input, with no class labels, means HCPDP_AE is adhering to the unsupervised nature of the AE approach. It simply tries to learn a function that maps the higher-dimensional input x to the lower-dimensional input y, and then attempts to recreate x using y.

The model uses a deep learning strategy to extract the features, but only one hidden layer was employed because one hidden layer is adequate to train the FFNN considering the feature cardinality of employed datasets.

The detailed encoding–decoding architecture of AE is well shown in Fig. 5. It consists of three constituent stages: encoding stage, bottleneck stage and decoding stage. In the encoding stage, it is possible to have n numbers of encoding layers that are En-LAYER(1), En-LAYER(2), En-LAYER(3),…, En-LAYER(n) consisting of N1, N2, N3,…, Nn number of nodes in each respective layer. This stage encodes the original input with q number of features in a compact form with q’ number of features such that q' < q.

Fig. 5
figure 5

Three stage architecture of auto-encoder

The best possible compressed form of input features that are used to train the DP model is given by the second stage, i.e., bottleneck stage. In the last stage, the architecture attempts to recreate the original input from the compressed form generated from the bottleneck stage and aims to regularize the loss of reconstruction by comparing original data and reconstructed data. The decoding step can be thought of as a mirror image of the encoding stage, with De_LAYER(1) performing the mirror operations as done in En_LAYER(n) with Nn nodes. During back propagation in architecture, the model’s training focuses to mitigate the reconstruction loss. In this way, the proposed research aims to incorporate the second phase of the model by investigating the same using FE methodology based on deep learning.

The encoding and decoding equations for the AE model with one hidden layer are described in Eqs. (1) and (2), respectively, where FA is the encoding function that maps higher-dimensional input A to compressed form B and GB is the decoding function that attempts to recreate input as A' from compressed form B with minimal reconstruction loss. Af is the applied activation function and w and w’ are the weighting parameter in encoding and decoding stage, respectively. The bias values for both stages are denoted by bA and bB, respectively.

$$B \, = \, F_{\lambda } \left( A \right) \, = \, A_{f} \left( {wA + b_{A} } \right)$$
(1)
$$A^{\prime} \, = \, G_{{\lambda^{\prime}}} \left( B \right) \, = \, A_{f} \left( {w^{\prime}B + b_{B} } \right)$$
(2)

The main aim of AE is to reduce reconstruction loss on an original input A, which is well defined by the objective function λ as follows:-

$$\lambda \, = \, \min \, L_{f} \left( {A,A^{\prime}} \right)$$
(3)

where A′ = G(F(A)) and Lf is the loss function depending upon the type of reconstruction (linear or nonlinear). It produces the optimal set of weights for mapping the input variables to the target or output variable with the least amount of reconstruction loss. The loss function used for neural network’s implementation is strongly intertwined to the selected activation function. According to [28], the best result for a neural network designed for a regression problem is given by rectified linear activation function (ReLAF) with root mean squared error (RMSE) loss function. Table 1 offers more information on the loss function and activation function used in executing the AE Model for the comprehensive research.

Table 1 Loss function and activation function

Since the built model's training accuracy is primarily based on the matched metric collection, the third stage of heterogeneous prediction modeling, i.e., metric matching, is the most critical and complex stage. To illustrate the association between metrics, modern methods such as the least square method, dispersion diagram method and Spearman's rank correlation method can be used. After measuring the coefficient of correlation value (CCV) between different possible feature pairs of two applications, the model selects those feature pairs whose CCV is greater than the given cutoff threshold.

After imposing a cutoff threshold filter, a set of feature pairs known as strongly correlated features is selected. The source dataset S cannot be used to model defects in a heterogeneous target dataset T if the highly correlated metric collection for a pair of datasets (S, T) is null. As per the literature study [15, 18], the authors picked instances (rows in a dataset) at random and created only one training chunk from the instances pool of a dataset with more number of instances in contrast to another dataset, which might be a source or target dataset. However, it is likely that arbitrarily picking instances from either of the datasets.

at once would result in poor metric matching between feature pairs, or that the count of strongly correlated feature pairs obtained after applying the threshold filter would be insufficient to train the DP model.

To address this shortcoming of traditional metric matching approach, a novel metric matching method known as chunk-based metric matching technique (CBMMT) is proposed in this article to incorporate this most critical and hot part of the heterogeneous prediction system. CBMMT being also random in nature checks each chunk of instances for metric matching and selects the chunk for DP model’s training that shows the maximum number of strongly correlated feature pairs with CCV greater than the threshold. In this manner, the most important step of the HCPDP-AE is effectively implemented with CBMMT.

figure b

The model is trained using an appropriate machine learning algorithm after identifying this strongly correlated metric collection, and the performance results are reported in the model's final step. Different evaluation metrics are used to outline the output findings. The traditional DP approach, known as WPDP, uses only one dataset, which is then split up into two components, one for training and another for testing, as per the partition strategy (7:3 or 6:4). Fig. 6 presents the structured WPDP model in detail.

Fig. 6
figure 6

Three-phase WPDP framework

The function of all three phases is identical to that of the HCPDP-AE model. In this study, the performance of both categories of DP (WPDP & HCPDP-AE) is evaluated using both single as well as ensemble learning approaches. When opposed to a single model, the ensemble learning approach allows for the improved predictive results.

4 Datasets Used and Performance Measures

4.1 Data Collection

The study uses 16 publicly available and widely used datasets from four open source repositories, including AEEEM, ReLink, SOFTLAB and NASA, for the experimental analysis. The dataset AEEEM was created by D'Ambros et al. [38]. There are 61 metrics in all, including 17 source code metrics, 5 past-defect metrics, 5 entropy-of-change metrics, 17 entropy of-source-code metrics, and 17 churn-of-source-code metrics [38]. AEEEM contains linearly decaying entropy (LDHH) and weighted churn in particular (WCHU).

The Understand tool yielded 26 findings of coding consequences, which were stored in the ReLink repository. Wu et al. [15] gathered manually validated and rectified defect data in ReLink. The number of instances in its three datasets varied from 56 to 399, but the number of features is fixed at 26 [15].

The NASA data was obtained over a five-year period from a variety of NASA contractors working in various geographic locations across the USA [5]. Size, readability, complexity, and other static code metrics for NASA datasets are all strongly connected to software quality.

The five datasets for SOFTLAB repositories were obtained by a Turkish software industry (SOFTLAB). Each dataset contains data about controller software for a variety of electrical appliances. It also has patented datasets that consists of Halstead and McCabe cyclomatic metrics. The used SOFTLAB and NASA datasets were obtained from the PROMISE repository [39]. There are 28 metrics that these two projects have in common.

All experimental studies of SDP are carried out on these repositories based on a thorough literature review [14, 15]. Second, the datasets must contain a sufficient number of features in order to execute effective deep learning-based feature extraction and metric matching. Some datasets, such as MORPH, have only 20 features, which are insufficient for the proposed research study to be implemented. As a result, these repositories were picked by the authors for their research study. There is no other particular motive for choosing these datasets.

Table 2 provides additional information about these datasets. According to Table 2, the proportion of defective instances in four project categories ranges from 7.43% to 50.51%. The class imbalance ratio (CIR) is the ratio of the number of defective instances to the number of non-defective instances, or vice versa. In a particular training sample, the lower will be the CIR value; the greater will be the imbalance problem (IP). CIR values range from 6.79 (highest IP) to 102.08 (lowest IP) in datasets CM1 and Apache, respectively. On the other hand, there are 61, 26, 29 and 37 software metrics in four repositories, respectively. The source of all datasets is given as:

https://github.com/Sanuj12/ROHIT_VASHISHT.git

Table 2 Datasets illustration

4.2 Performance Parameters

In this section, the various metrics used to gauge the efficiency of different machine learning classifiers are stated. During the evaluation process, the other parameters in the confusion matrix are used. Table 3 indicates the confusion matrix that was used to estimate erroneous classifications.

  • Accuracy:—The ratio of true outcomes (TP and TN) to the total number of instances examined is known as accuracy. Its value varies from 0 to 1, with 0 representing the least accurate result and 1 representing the most accurate result.

    $${\text{Accuracy }} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}}$$
    (6)
  • Recall (Rate of True Positives):—It is also known as sensitivity and defined as the chance of getting a positive test if an instance is defective or non-defective.

    $${\text{Recall }} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}}$$
    (7)
  • F-Score:-It is also known as the F-measure, because it is a measurement of a test's accuracy in a statistical study of binary classification. In order to measure the score, it considers both the test's accuracy and precision. The harmonic mean of precision (p) and recall (r) are used to determine it as per Eq. (8).

    $${\text{F-Score }} = \frac{{2{\text{*p*r}}}}{{{\text{p}} + {\text{r}}}}$$
    (8)
  • Area Under Operating Curve (AUC):- It is a plot of true positive rate (TPR) and false positive rate (FPR) that is used to determine the overall efficacy of a classification algorithm. The classification model would be more accurate if the AUC parameter is set to a higher value. For a given classification algorithm, the maximal AUC value is 1. In Table 4, the prediction performance corresponding to the AUC value is discussed in greater detail.

Table 3 Confusion matrix
Table 4 Relation between AUC value and prediction performance

5 Experimentation Setup

The main goals of the proposed research study are divided into four groups. The study's first two goals compare the performance of the respective HCPDP’s and WPDP’s frameworks with and without the use of the FE technique. The third objective is to see if and how well the prediction performance of HCPDP-AE model is comparable to the prediction efficiency of WPDP for a given set of machine learning classifiers. Lastly, the study's final aim is to compare the performance of the proposed model to that of the existing benchmarked models, as well as to validate the former. The research analysis performed two experiments in order to address the stated four research questions.

5.1 Experiment 1

The aim of this experiment is to look into the conventional category of DP, namely WPDP’s output with two categories of FE techniques that are data-based FE and deep learning-based FE. To begin, the dataset is preprocessed to remove unnecessary software features and encode the categorical data with the tag. The process assigns 0 or 1 to the defective and non-defective cases, respectively. As well as, it also employs the novel CBA mentioned in Sect. 3 to deal with CIP in imbalanced training datasets. According to the current state of the art [23, 29], DP on heterogeneous projects has yet to be examined using FE based on deep learning in conjunction with effective and resilient strategies for dealing with CIP and for executing metric matching, which is the fundamental phase of the HCPDP framework. Data-driven FE is a generic methodology that does not require domain awareness. Such a representation is found by mining pair-wise feature correlations, evaluating the linear or nonlinear relationship between each pair, applying regression, and choosing the most stable relationships [30]. The aim of the experiment is to compare the performance of a three-stage WPDP model using two FE approaches: regression feature engineering (RFE) and AE, respectively [27, 30]. RFE and AE techniques are used as data-based and deep learning-based FE methods, respectively, to implement the second phase of the model. After the selection of the most discriminating features, the available number of instances is segregated into training and validation instances set in a ratio of 7:3. Figure 7 depicts software defect predictions within a project where I_Total, I_Train and I_Test denote count of total, training and testing observations in a dataset, respectively. Table 5 lists the training and testing datasets that have been used to conduct the experiment.

Fig. 7
figure 7

With-In project defect prediction

Table 5 Prediction pairs for WPDP

5.2 Experiment 2

The aim of the experiment is to see how the FE step affects the performance of the proposed four-phase HCPDP-AE framework when using the stacking-based ensemble classification method. To carry out this experiment, ten prediction pairs mentioned in Table 6 are taken from four open source projects: AEEEM, ReLink, NASA and SOFTLAB. Prediction pairs are formed based on the number of maximal-associated feature pairs found between them.

Table 6 Prediction combinations for HCPDP

Preprocessing of datasets involves the deletion of redundant data and the encoding of categorical data by labels in the first phase. In this phase, class imbalance learning (CIL) is also applied to deal with the significant difference in the ratio of binary type instances count. To carry out CIL, the model employs a novel hybrid approach known as CBA as mentioned in Sect. 3. Following that, feature ranking and feature selection strategies are used to derive a list of K- best features that are more relevant in the prediction of desired final outcome for a given dataset of features. The research aims to examine the framework using AE as a deep learning-based FE approach. The features are chosen to make the two datasets dimensionally identical, allowing for easy metric matching. After the selection of useful features, the metric matching method evaluates the relationship between each combination of feature pair of source and target dataset. CBMMT is used to execute the third phase, which provides a set of highly correlated features for a given heterogeneous prediction pair.

To train the HCPDP-AE model, i.e., to execute the final modeling phase, the experiment incorporates ensemble classification approach. Finally, the efficiency of the model is assessed using performance parameters specified in Sect. 4.

6 Results and Discussion

TensorFlow 2.0 with GPU support is used to conduct both experiments, which run on a Windows 10 operating system with an Intel Core i5-1130G7 processor and 32GB RAM. The results of the experiments are discussed in this section. Tables 7, 8, 9, 10, 11, 12, 13, 14 and 15 and Figs. 8, 9, 10, 11, 12 and 13 display the experimental results. The analysis uses precision, recall, F-score (FS), and AUC as performance indicators. Randomness, class imbalance issue, and prediction threshold, all have a significant impact on the model's accuracy and recall. As a result, tenfold cross-validation (CV) is utilized to measure these factors in order to obtain results that are less skewed and have a low variance. Other parameters like AUC and F-score are also evaluated on basis of 30 repeated trials to reduce effect of randomness on prediction performance. Second, these parameters (AUC and FS) have no impact due to imbalanced distribution of instances and prediction threshold.

Table 7 Feature extraction using AE
Table 8 Feature reduction for WPDP
Table 9 Comparison of WPDP performance
Table 10 Statistics of feature reduction for HCPDP
Table 11 Performance comparison of HCPDP output
Table 12 Execution time (in seconds) for HCPDP and HCPDP_AE
Table 13 Comparative analysis of WPDP-AE and HCPDP-AE
Table 14 HCPDP-AE vs. benchmarked HDP models
Table 15 Output of Wilcoxon signed rank test
Fig. 8
figure 8

WPDP’s performance comparison with three baselines

Fig. 9
figure 9

CBMMT in HCPDP-P1

Fig. 10
figure 10

HCPDP v/s HCPDP-AE

Fig. 11
figure 11

WPDP-AE v/s HCPDP-AE

Fig. 12
figure 12

Comparison between HCPDP-AE with benchmarked models

Fig. 13
figure 13

Plot of AUC values for nine prediction pairs

6.1 RQ1. Compare and contrast the use of data-driven and deep learning-based FE strategies with the conventional approach to DP, i.e. WPDP.

To compare the WPDP's output using RFE and AE, as data-driven and deep learning techniques, respectively. Ten with-in prediction pairs WPDP-P1 to WPDP-P10 have been used for the analysis. In the first step of WPDP model, i.e., preprocessing, binary encoding is used to label the target variable and Z-score normalization (ZSN) is used to scale all feature values so that one particular feature does not overshadow the others. The mean and standard deviation of each feature were determined to produce normalized values. In this manner, feature encoding and feature scaling were performed as two subtasks under this step. The next and last subtask in preprocessing step is to manage CIP in all four prediction pairs' training datasets. The range of CIR values for all ten with-in prediction pairs is shown in Table 2. With a CIR of 6.29%, CM1 is the most imbalanced dataset, whereas Apache has the most balanced data distribution with a CIR of 102.08%. Traditional data re-sampling methods used for CIL have their own pitfalls, as discussed in Sect. 3.

In the analysis, the prediction accuracy of the WPDP model is assessed using three CIL approaches: Synthetic Minority Over-Sampling Technique (SMOTE) as a ROS technique, RUS and a novel approach CBA [37]. The algorithms SMOTE_CIL and RUS_CIL, respectively, explain the detailed procedure for implementing SMOTE and RUS.

figure c
figure d

For example, in HCPDP_P3, the source dataset ar1 has CIR as 7.43. On the basis of count of instances as seen in Table 2, the minority and majority classes are identified as defective and non-defective category, respectively. The OS% can be computed as 92% using the values ICMaj as 112 and ICMin as 9, according to Eq. (9). According to Eq. (10), the CRR is 103, i.e., in order to attain CIR as 1:1, 103 additional replicas should be created for defective class by following steps 4 to 7 in SMOTE_CIL. To deal with CIP, RUS_CIL attempts to delete any 243 random instances from the non-defective class in dataset CM1.

The shallow AE model is used to extract features using a three-layer convolutional FFNN. The count of original features in a dataset is equal to the number of neurons in both input and output layers. The hidden layer's number of neurons is determined by the rules-of-thumb that represents the compact number of features [24]. The reconstruction loss for respective output features that are reconstructed from collection of both extracted features and initial input feature set is represented by RMSE. The output of neurons in the hidden layer with the lowest RMSE would be considered as final set of extracted features. The extraction of features is performed in this way using a deep learning-based AE model. For example, Table 7 displays the RMSE values for the considered set of neurons in the hidden layer for first four with-in prediction pairs. It shows that the reduced number of features for datasets ar1, JDT, Apache and ML are 17, 20, 16 and 15 respectively, with corresponding least RMSE of 0.1371, 0.1207, 0.1261 and 0.1129.

The statistics of the original and extracted features using AE model is shown in Table 8. It indicates an overall decrease of 50.45% in the total number of input features.

As shown in Fig. 7, 70% and 30% of the total observations are used as training and testing observations for each prediction pair, respectively. For example, WPDP-P1 (ar1) uses 85 instances to train the WPDP model and 36 instances to validate the proposed WPDP model. The WPDP model is trained using support vector machine (SVM) with linear radial basis function kernel (RBFK). For linear as well as nonlinear classification, SVM-RBF is the most useful and effective algorithm. It is beneficial when the number of features in a dataset is substantially greater than the number of instances. When looking at the cardinality of used datasets, the number of observations in Apache, ar1, ar3, ar4, ar5, ar6 and Safe is substantially lower. Due to the limited number of features in the training dataset, SVM was chosen to have the least or no impact on prediction performance. Table 9 shows the values of performance parameters after validating the model with 30% testing observations for all prediction pairs under three baselines that are without FE, with FE using RFE and AE.

The results of Table 9 can be viewed from two angles. The first aspect is a comparison of DP’s output between standard CIL approaches and the proposed hybrid CBA approach. In contrast to SMOTE and RUS, the results reveal that CBA outperforms for all prediction pairs.

As per the statistics of Table 9, CBA offers the best values of all performance parameters with accuracy as 0.94 for WPDP-P3 and recall, AUC and FS as 0.96, 0.810 and 0.88, respectively, for WPDP-P4, when AE model was employed for feature extraction.

When CIP is handled using SMOTE with the original pool of features, the prediction pair WPDP-P7 has the lowest accuracy and AUC of 0.33 and 0.300, respectively. For all three baselines, SMOTE and RUS have given comparable results. The first baseline, i.e., WPDP without FE, has the lowest output among the three. Using the average values of all performance measures for all ten prediction pairings under the CBA strategy to combat CIP, Fig. 8 depicts the performance comparison of WPDP under three baselines as:

$$WPDP \, without \, FE \, < \, WPDP \, with \, RFE \, < \, WPDP \, with \, AE.$$

As a result, the proposed three-phase WPDP model (WPDP-AE) outperforms when a hybrid solution is used to tackle CIP and feature extraction is performed using a deep learning-based model rather than a data-driven approach.

6.2 RQ2. Compare the Prediction Performance of Proposed HCPDP-AE Framework with and Without FE Phase.

Ten heterogeneous prediction pairs (HCPDP-P1 to HCPDP-P10) from three open source projects are considered to test the prediction accuracy of the proposed HCPDP-AE model with and without FE. Binary encoding and ZSN techniques are used for feature encoding and feature scaling, respectively, in the first step. The preprocessing step of the dataset is carried out in the same way as done in the WPDP scenario. The implementation of HCPDP-AE model is explained in detail using the prediction combination HCPDP-P1, where JDT and ar1 are the source and target dataset, respectively. In both datasets, the target variable is initially encoded by labeling defective instances as 0 and non-defective instances as 1. Feature extraction is now performed using the three-layer AE model in the same way as done in WPDP-AE model. Table 10 shows the total number of extracted features and feature reduction%age for all training datasets in ten prediction combinations that were considered for the analysis.

The extracted feature set is chosen based on least reconstruction loss in terms of RMSE value for a particular input feature set. The number of neurons in the hidden layers that produce the least RMSE is referred as the compressed number of features for that particular dataset.

After executing second phase of FE, JDT and ar1 now have dimensions of (997 × 20) and (121 × 17), respectively. In heterogeneous prediction, the next phase, metric matching, is the most important among all four phases. The HCPDP-AE model used a novel technique called CBMMT to accomplish this phase. The primary criterion of CBMMT is that the two datasets have the same cardinality of software features. The authors applied the Fisher Score (FS) method, a supervised feature ranking and selection technique to choose the best 17 features (minimum between 20 and 17) that are more relevant and useful in predicting the final outcome. There is no strong reason to use FS technique for choosing the optimal subset of features. FS evaluates each feature’s significance independently and ranks them according to their utility in predicting the expected outcome. It generates a list of features sorted by their ranking values in descending order. After that, CBMMT is used to do metric matching once the number of features has been equalized in each dataset. The execution of metric matching in HCPDP-P1 is shown in Fig. 9. According to CBMMT, the variables a, p, n and m have values of 997, 121, 17 and 8, respectively. That means there are total eight candidate chunks (JDT_C1, JDT_C2,……., JDT_C8) of source dataset JDT for evaluating feature correlation with ar1. As shown in Fig. 9, CBMMT calculates eight correlation matrices (C1, C2,….., C8) for each of the eight possible combinations of each source chunk with the target dataset ar1 that are (JDT_C1, ar1), (JDT_C2, ar1), (JDT_C3, ar1),…….., (JDT_C8, ar1). Each matrix is used to find the number of highly correlated features between two datasets. The threshold for this analysis is set at 0.05. According to the state of the art, the threshold value is decided as 0.05 to cover the maximum number of possible pairs of source and target projects for defect prediction [15]. The threshold value of 0.05, based on the obtained CCVs in the correlation matrix, yields the greatest number of highly associated feature pairs. As a result, the cutoff level is determined based on the estimated CCVs, allowing the DP model to predict defects in maximum number of datasets in the target project. In HDP, this problem is referred to as target prediction coverage (TPC), and the best HCPDP model should obtain the highest TPC for a given prediction combination [36]. Selection of proper threshold to achieve maximum TPC is one of the promising future directions in this domain. As per the analysis, after 30 repeated trials of experiments, the candidate chunk C4 gives the highest number of highly correlated pairs with ar1. Between JDT and ar1, CBMMT discovered 9 pairs of correlated features.

CIL is now applied to the candidate source chunk C4 using the novel technique CBA. This algorithm is generating three chunks as the output, each with a balanced number of defective and non-defective instances. Hence, CBMMT and CBA are only used once over the life cycle of the HCPDP-AE model to predict defects between two heterogeneous projects. Thus, the third phase, i.e., metric matching, is carried out to produce the appropriate chunks of training dataset JDT. Finally, the model is trained using a stacking-based ensemble learning (SBEL), with two base models which are trained using K-nearest neighbor (KNN) and random forest (RF) and a meta model that is trained using logistic regression (LR). KNN with higher value of hyper-parameter k will have less impact on DP’s performance due to randomness in picking the instances for the training dataset. The findings revealed that k as 18 can control the variance of the model's performance in the experiments. The value of k is finalized on the basis of hit and trial method. Random forest is a classifier that combines a number of decision trees on different subsets of a dataset and averages the results to increase the dataset's predicted accuracy. Instead than relying on a single decision tree, the random forest collects the predictions from each tree and predicts the final output based on the majority votes of predictions. This increases the accuracy of the prediction done by the model. LR is easy to interpret and less inclined to over-fitting. Because CBA and CBMMT use random shuffling of instances in their implementations, the most important factor to consider when choosing classification algorithms is that randomness has a minimal impact on prediction performance. Second, the dataset distribution has a significant role in the selection of classification technique. The ensemble model makes better prediction performance than a single prediction model. SBEL is the most suitable classification algorithm among all ensemble learning strategies when the training data is broken into n distinct fragments. Finally, the performance of the heterogeneous prediction is compared in terms of AUC, recall, accuracy and FS with and without FE as shown in Table 11.

In all ten prediction variations, HCPDP-AE outperforms HCPDP without FE, as shown in Table 11. The highest and lowest AUC values are 0.901 (HCPDP-P10) and 0.697 (HCPDP-P2), respectively, when the features have been extracted using AE technique. The experimental results under RQ1 and RQ2 show that using AE to obtain strongly discriminated features has a substantial impact on the prediction accuracy of DP, whether it is WPDP or HCPDP. Figure 10 demonstrates a comparison of heterogeneous prediction output with and without FE, taking the average values of all performance parameters into account.

In Table 12, the authors have introduced training time as an extra evaluation criterion to improve the performance comparability. For all prediction combinations, Table 12 compares HCPDP's performance with the original (HCPDP) and reduced (HCPDP-AE) number of features. It can be shown that.

HCPDP-AE takes longer time to train than the HCPDP model because the former method comprises the training of a neural network for feature extraction. The average classification time for the HCPDP-AE model is 0.178, which is 37.32% faster than the average classification time for HCPDP.

However, because the HCPDP-AE framework only includes one hidden layer in the encoding–decoding AE model for feature extraction, there is no substantial difference in training and classification time for both approaches. As a result, the execution time cannot be used as an effective assessment parameter to compare the performance of the two models here.

For HCPDP and HCPDP-AE, the mean accuracy, recall, F-Score and AUC are 0.63, 0.67, 0.64, 0.587 and 0.83, 0.84, 0.83 and 0.796, respectively. The target dataset in HCPDP-P5 and HCPDP-P6 is ar3, but the source dataset is different. The prediction accuracy using LC as source dataset is higher than EQ, according to the experimental findings of Table 11. CBMMT has produced 12 and 8 strongly feature pairs for (LC, ar3) and (EQ, ar3), respectively, which explains LC's superior performance due to higher number of correlated features of LC with target dataset ar3. The study shows that the collection of a highly correlated feature pairs set improves heterogeneous predictive performance in a significant fold.

6.3 RQ3. Whether and to What Extent DP Results of HCPDP’s Model are Comparable to the Outcomes of WPDP’s Model?

The experimental findings for with-in project combinations WPDP-P1 to WPDP-P10 and heterogeneous combinations HCPDP-P1 to HCPDP-P10 can be used to compare the success of the standard method of DP, i.e., WPDP, with heterogeneous prediction. The target projects in these combinations are same. Under WPDP and HCPDP categories, the same as well as different source projects are used to predict defects in these target projects. WPDP's prediction performance is comparable to HCPDP, as shown in Table 13.

It is self-evident that a DP model that is trained and tested on the same project makes stronger predictions than a DP model that is trained and tested on different projects. WPDP and HCPDP have almost comparable mean AUCs of 0.778 and 0.796, respectively. Figure 11 presents a comparison of WPDP and HCPDP based on the mean values of all output parameters considered. On the basis of these findings, one may conclude that HCPDP efficiency is comparable to defect prediction within the project with statistical significance. But, when there isn't enough past defect data for the target application to train the DP model, the proposed HCPDP-AE model proves useful.

6.4 RQ4. Compare and Validate the Performance of the Proposed HCPDP-AE Framework with Existing Benchmarked HCPDP Models.

To assess the performance of the proposed algorithm, i.e., HCPDP-AE, three existing benchmarked defect prediction models such as TCA + [31], CCA + [14] and GMOTDP [24] have been considered for the comparative analysis. In heterogeneous defect prediction, TCA + and CCA + are two benchmarked comparative approaches and GMOTDP is very recent work in the SDP domain. One dataset is used as the source project, and three heterogeneous datasets from another project are used as target projects for heterogeneous prediction throughout the analysis. For evaluating the DP output of all three prediction models, AUC is used as the evaluation index. The AUC estimates the probability that a classification model will distinguish a randomly selected defective instance as being more likely than a randomly selected defect-free example [32]. AUC is more notable in comparison with other performance evaluation metrics (such as accuracy and FS) since it is not influenced by class imbalance issue and irrespective of the prediction threshold, it is used to determine whether an instance should be labeled as a negative instance [9, 33, 34]. According to the literature study [14, 31], CIP is not taken into account in TCA + and CCA + . So, these facts (CIP and prediction threshold) highlight the importance of comparing models performance using AUC, so that uniform pre-conditions (irrespective of CIP) can be achieved for comparative analysis of all four models. Since treating CIP with CBA and selecting the source dataset's chunk for metric matching using CBMMT in HCPDP-AE entail randomness, the average result of 30 repeated trials are counted for each case to mitigate the effect of randomness on the experimental outcomes in training as well as testing of the model. The authors examined these nine prediction pairs for HCPDP_AE only and used experimental results from [24] for the remaining three HDP models.

Table 14 reveals that HCPDP-AE gives remarkable prediction performance than the two classic HDP models (TCA + & CCA +) by 54.88% and 46.09% AUC gain over the mean AUC value of the two models, respectively. There is no greater disparity in mean values of AUC between GMOTDP and HCPDP-AE. Nonetheless, HCPDP-AE outperforms GMOTDP with better DP efficiency, with a 2.43% AUC gain over the latter methodology. The mean AUC for HCPDP-AE is 0.8901, while the mean AUCs for the other three benchmarked models are 0.5747, 0.6093 and 0.8690, respectively, as shown in Table 14.

Figure 12 contrasts the performance of all four HCPDP models on the basis of the mean AUC value, when considering the same nine prediction combinations for the study. This analysis concludes that HCPDP-AE has a stronger prediction effect than the other three models, as shown by the line graph in Fig. 13. So, the performance ranking of the four prediction models in increasing order is given as:-

$$TCA + \, < \, CCA + \, < \, GMOTDP \, < \, HCPDP-AE$$

The following are the reasons behind the usefulness and superior performance of proposed model HCPDP-AE:

  1. 1.

    First, it handles CIP better than SMOTE and RUS. It tackles SMOTE's overgeneralization and over-fitting concerns through balancing minority class instances by creating chunks as described in CBA, rather than introducing synthetic instances that duplicate minority class instances. On the other hand, RUS discards majority class occurrences regardless of their importance in prediction of the final outcome. But, CBA does not include the removal of any instance from the dataset. In this manner, CBA addresses the limitations of data re-sampling approaches used to manage CIP.

  2. 2.

    By randomly selecting instances from the original source chunk, traditional HDP techniques [14, 31] create only one training source chunk for the metric matching step. Because there is only one source training chunk, the scope of identifying a better correlation matrix with the target application is limited. Furthermore, the random selection of instances used to build this chunk produces poor results for a while. However, as indicated in Sect. 3, CBMMT gives larger scope by providing numerous source chunks that are used to compute the correlation matrix with the target dataset. The candidates source chunk with the highest number of best CCVs (> 0.05) in the corresponding correlation matrix will be used as the training dataset.

  3. 3.

    Finally, the FE approach AE, which is based on deep learning, aids in increasing the prediction performance of the HCPDP-AE model.

A nonparametric Wilcoxon signed rank test is used to statistically validate the effect of deep learning-based feature extraction on heterogeneous prediction. The test is performed with a P-value of 0.05 (significance level of 5%) to see if there is a significant difference among the performance of proposed model HCPDP-AE and the existing benchmarked models as TCA + , CCA + and GMOTDP. For the analysis, the authors have taken recall and AUC metrics for comparing all the models using 14 heterogeneous prediction combinations that are (EQ → ar3), (EQ → ar4), (EQ → ar5), (JDT → ar3), (JDT → ar4), (JDT → ar5), (LC → ar3), (LC → ar4), (LC → ar5), (JDT → ar1), (ar5 → JDT), (ar1 → Apache), (Safe → ML) and (ML → ar4). AUC and recall are used as assessment measures to execute the test because they have the least or negligible effect on prediction due to the imbalance problem in training datasets [9, 33,34,35]. Although HCPDP-AE used the robust approach CBA to deal with CIP, CCA + and TCA + do not have this issue addressed in their frameworks [14, 31]. As a result, the authors used these two parameters as evaluation criteria for the test in order to obtain the consistent and unbiased results. If the calculated P-value is less than 0.05, the null hypothesis is rejected. The null and alternate hypotheses are framed as follows:- H0: The two heterogeneous models are giving same prediction performance. H1: The two heterogeneous models are giving different prediction performance.

As shown in Table 15, the calculated P-values on the basis of both metrics recall and AUC for all benchmarked models with HCPDP-AE are less than 0.05. That is means that the proposed HCPDP-AE model is performing differently as compared to existing heterogeneous prediction models. Therefore, on the basis of this empirical study, it can be concluded that HCPDP-AE outperforms among all models.

7 Threats to validity

In the proposed feature extraction technique, the number of neurons in the hidden layer is determined by the minimum reconstruction loss. In order to perform metric matching effectively, the CBMMT technique needs at least 15 input features from both projects so that at least 5 correlated feature pairs can be obtained after metric matching to train the DP model effectively. This raises a question about the model's construct validity. The experimental findings can be improved further if the employed classification algorithms are tuned using other optimized options, as the authors used the default options for machine learners in the experiments. This may also be an issue in case of construct validity.

8 Conclusion

HCPDP is a promising area in the SDP domain that enables potentially heterogeneous software project datasets to predict defects on new projects or projects that lack historical defect data to train a DP model. In the paper, the authors proposed a novel four-phased heterogeneous prediction model using a deep learning-based FE technique called auto-encoder to extract strongly discriminated features that are more relevant to predict expected outcomes. Furthermore, two novel techniques, CBA and CBMMT, are proposed to deal with imbalance problem in training datasets and to evaluate correlation between features of two heterogeneous projects, respectively. For WPDP and HCPDP, the study is able to reduce features by 50.45% and 47.31%, respectively. The experimental findings show that the prediction performance of heterogeneous prediction with and without feature extraction is statistically significant as compared to the respective DP within a project.

The results indicate that using a deep learning method to extract features has a major impact on model prediction accuracy as opposed to using a data-driven FE approach, since a larger number of features contribute to over-fitting and longer processing time to train a model. In addition to this, the authors compared the proposed model's efficiency with three traditional heterogeneous prediction models. HCPDP-AE has been found to be outperformed among all models with the highest mean AUC value of 0.8901.

The future scope of the research is to integrate instance-based filtering and feature extraction in accordance with double pre-processing of datasets. Second, the metric matching process has a huge impact on the prediction performance of any heterogeneous model. As a result, another interesting future direction in this area is to develop a more robust correlation estimation methodology. Future research should also focus on developing an empirical relationship between software defect prediction and predictive maintenance. The authors used a shallow auto-encoder with only one hidden layer in this study. After studying the applicability to a particular problem, one may investigate other variants of AE, such as stacked AE, contractive AE, and de-noising AE as a future work.