1 Introduction

The software is a collection of large datasets with thousands or millions of lines of code. How to ensure software quality has become a key issue. Bugs inevitably exist in software projects, even though experienced developers using powerful development platforms develop these projects (Xiaoting et al., 2020).

Within-project defect prediction (WPDP) (Lee et al., 2011; Menzies et al., 2007b; Nam et al., 2015) first constructs classification models based on a sufficient amount of historical labeled software entities from a project and then uses the models to predict the defect labels of new entities within the same project (Xu et al., 2018). However, in some situations where we do not have enough datasets with labeled defect information, WPDP does not necessarily work well.

In order to solve this problem, many researchers continue to explore open source software, and then, they share datasets such as NASA (Menzies et al., 2007a), Relink (Wu et al., 2011), Promise (Jureczko & Madeyski, 2010), and AEEEM (D’Ambros et al., 2010). This makes cross-project defect prediction (CPDP) possible. CPDP (He et al., 2012; Ma et al., 2012; Rahman et al., 2012; Turhan et al., 2009) is the study of how to transfer knowledge related to the target project from the source project (Hosseini et al., 2019). This topic has attracted a lot of attention lately in the literature.

However, most existing CPDP methods are based on the assumption that both source and target projects have the identical features or share similar features. Unfortunately, it is not always feasible when source and target projects have little or no shared features (Turhan et al., 2008), because different projects may be developed with distinct programming languages and the features may be collected at different levels of granularity using various tools. Jing et al. (Hosseini et al., 2019) has showed that NASA and Relink, as well as Softlab and Relink, have few common metrics, and there are no common metrics between the remaining datasets. In this case, the effect obtained by CPDP can be limited. Therefore, the heterogeneous cross-project software defect prediction (HCPDP) method was proposed.

The main challenge of HCPDP is how to narrow the difference between two different feature spaces of source projects and target projects. Besides, the relationship between source projects and target projects influences the effectiveness of transfer learning. Resource utilization that is not related to the target may degrade the performance of the classifier and result in a negative transfer. One strategy to reduce negative transfer is to increase the number of sources.

In this paper, we propose a novel multi-source heterogeneous cross-project defect prediction (MHCPDP) method. To narrow the gap between source datasets and target datasets, MHCPDP uses an autoencoder to extract the intermediate features from the original datasets instead of simply removing redundant and unrelated features and adopts a modified autoencoder algorithm to make instance selection for eliminating irrelevant instances from the source datasets.

Furthermore, to reduce the impact of negative transfer in transfer learning, MHCPDP incorporates multiple source projects to solve the problem of insufficient source projects and defines a multi-source transfer-learning algorithm to reduce the impact of negative transfer.

To evaluate our approach, we conducted large-scale experiments on five public datasets. This paper mainly focuses on answering the following three research questions.

  1. 1

    Is MHCPDP better than the existing HCPDP methods?

  2. 2

    How does MHCPDP compare to unsupervised learning?

  3. 3

    What is the result of MHCPDP when compared with WPDP?

This paper makes the following main contributions:

  1. 1

    We propose the MHCPDP method for HCPDP. MHCPDP adopts the autoencoder and a modified autoencoder algorithm for dimensionality reduction and instance selection to reduce the feature space gap between the source project and target project. Furthermore, we develop a multisource transfer-learning algorithm to address the negative transfer effects of transfer learning in HCPDP, and overcome the shortcoming of insufficient source projects.

  2. 2

    We have comprehensively evaluated MHCPDP by using two widely used performance indicators. The results show that MHCPDP not only is comparable with the performance of two of the most advanced HCPDP but also overcomes the shortcomings of the conventional HCPDP methods.

  3. 3

    We have conducted extensive experiments on our method. The results show that MHCPDP has the advantages and competitiveness when it compares with the WPDP scheme and the state-of-the-art unsupervised learning method.

The rest of the paper is organized as follows. Section 2 introduces the related work. After presenting our method MHCPDP in Sect. 3, we detail the experimental setup in Sect. 4. Section 5 shows experimental results and statistical test. Section 6 discusses the parameter sensitivity to MHCPDP, the effectiveness of MHCPDP, and the impact of different classifiers on our method, followed by threats to validity in Sect. 7. Finally, Sect. 8 concludes our work.

2 Related work

2.1 Cross-project defect prediction method

Cross-project defect prediction is to build a defect prediction model for the target project with existing project datasets. In order to make the model effect and accurate, scholars have carried out a series of researches on this.

Briand et al., (2002) were the first to study CPDP where they proposed a new use of exploratory analysis techniques (namely, multivariate adaptive regression splines) to establish this failure propensity model. The results show that models built on one system can accurately classify classes within another system based on fault propensity.

Zimmermann et al. (2009) conducted a series of large-scale studies on cross-project defect prediction models. For 12 real-world applications, they ran 622 cross-project forecasts. The results show that cross-project prediction is a serious challenge, that is, using only models from the same field or projects with the same process does not lead to accurate predictions.

Nam et al. (2015) proposed that the performance of cross-project defect prediction is generally poor, mainly due to the difference in feature distribution between the source project and the target project. In the paper, they apply a state-of-the-art transfer learning method—TCA (Pan et al., 2011) to make the distribution of features in source and target projects similar. In addition, they extended a TCA to propose a new transfer defect learning method TCA + . They also conducted experiments on eight open source projects, and the results showed that TCA + significantly improved cross-project forecasting performance.

However, all these studies assume that the target and source project data share some common features. Therefore, if the cross-project data have heterogeneous feature sets (i.e., HCPDP scenario), these methods are a failure.

2.2 Heterogeneous cross-project defect prediction method

Nam et al. (2013) propose a heterogeneous defect prediction (HDP) method that matches different metrics in different projects. Experiments show that their proposed HDP model is feasible and has achieved good results. In addition, they also studied the lower bounds of the source and target dataset sizes for effective transfer learning in defect prediction. Based on their empirical and mathematical studies, it is possible to display only 50 instances of the category dataset, sufficient to construct defect predictions and apply HDP.

Jing et al. (2015) propose a method for using a unified metric representation (UMR) source project and target project. UMR includes three types of metrics, that is, the source project and the target project are in common, the source project is unique, and the target project is unique. Then a typical correlation analysis (CCA) (Knapp, 1978) was introduced into the method. Their approach is clearly superior to the latest cooperating CPDP method, and it achieves comparable predictive performance.

Besides, Li et al. propose a series of solutions for HCPDP, for example, a novel two-stage ensemble learning (TSEL) approach to HDP (Li et al., 2019), a new cost-sensitive transfer kernel canonical correlation analysis (CTKCCA) approach for HDP (Li et al., 2017a), a multi-source selection based manifold discriminant alignment (MSMDA) approach (Li et al., 2017b), and an empirical study on heterogeneous defect prediction approaches (Chen et al., 2020).

In a word, the prediction performance of HCPDP model is greatly affected by its construction method. Most of the studies have constructed multiple models with different modeling techniques and have made great progress in their comparative performance. The main differences between our approach and existing HCPDP methods are as follows. Firstly, our method adopts the autoencoder and a modified autoencoder algorithm for dimensionality reduction and instance selection to reduce the feature space gap between the source project and target project. Secondly, our method uses multiple heterogeneous source projects to make up insufficient source datasets, and we propose a multi-source transfer-learning algorithm to address the negative transfer effects of transfer learning.

2.3 Transfer learning

With the increasing popularity of machine learning applications in recent years, existing supervised learning tends to require a large amount of annotated data. However, labeling data is a cumbersome task that requires a lot of work, human resources, and financial resources. So transfer learning (Fu-Zhen & Ping, 2015; Pan & Yang, 2010) that requires less data annotation has received attention.

A common assumption of traditional machine learning algorithms is that the probability distributions of training datasets and test datasets are the same. Under this assumption, when presenting a set of new datasets with different distributions, we need to collect and label new training samples to learn new classifiers. However, if we use the useful information in the existing labeled dataset to train the classifier, we can save time. Transfer learning is to apply the knowledge or model learned in a certain domain task to different but related domains. However, when the probability distributions between the source domain and target domain vary greatly, it may lead to serious negative transfer. How to solve the impact of negative transfer in HCPDP is a main motivation of this paper.

3 Method

3.1 Framework

As shown in Fig. 1, the framework of MHCPDP consists of five steps: data collection, data pre-processing, feature processing and instance selection, multi-source transfer learning, and defect prediction.

Fig. 1
figure 1

Framework of MHCPDP

Step 1. Data collection: The first step of the framework is data collection. In this paper, we collect the datasets that come from five public datasets, i.e., AEEEM, MORPH, NASA, Relink, and SOFTLAB.

Step 2. Data pre-processing: The second step is data pre-processing. After the data collection, it is necessary to pre-process datasets because they may have several irregular data. The pre-processing step in this paper includes oversampling and datasets standardization. For over-sampling, we adopt smote method, and z-score for dataset standardization.

Step 3. Feature representation and instance selection: The third step is feature representation and instance selection. In this step, we firstly adopt the autoencoder to obtain the intermediate features of source datasets. Then, we design a modified autoencoder for instance selection to narrow the gap between the source datasets and the target dataset. The implementary procedure of instance selection is described in Sect. 3.3.

Step 4. Multi-source transfer learning: The fourth step is data pre-processing. The project (i.e., target project) that is needed to perform defect prediction may not have enough labeled defect for training the classification model, while other projects (i.e., source projects) may have a large amount of label data. Transfer learning can capture useful information from the source project as additional training data for the target project’s prediction. In this step, we propose a multi-source transfer learning algorithm to solve the negative transfer and further narrow the gap between the source datasets and the target datasets. The implementary procedure of the multi-source transfer learning algorithm is presented in Sect. 3.4.

Step 5. Defect prediction: The fifth step is defect prediction. Machine learning classifiers are used to build the defects prediction model in this step. To compare the performance of different classifiers, four machine learning classifiers are used, including decision tree, Naive Bayes, logistic regression, and random forest. The evaluation metrics used in this paper are AUC and g-measure.

3.2 Problem definition

Let the target data matrix be \(T=[{t}_{1}\),\({t}_{2},\dots {t}_{m }]\), where \({t}_{i}={[{t}_{i1},{t}_{i2},\dots ,{t}_{ip}]}^{T}\in {R}^{p}\); m and p represent the number of target project entities and features, respectively. \({T}_{ij}\) is the jth features of ith target entity. Similarly, let the source project data matrix be \(S=[{s}_{1}\),\({s}_{2},\dots {s}_{n}]\), where \({s}_{i}={[{s}_{i1},{s}_{i2},\dots ,{s}_{iq}]}^{T}\in {R}^{p}\); n and q represent the number of source entities and features, respectively. A task \(\tau\) is made of a label space \(\gamma =[+1,-1]\), and a boolean function \(\mathcal{F}:x\to y\). \({s}_{ij}\) indicates the jth feature of the ith target entity. \({v}_{S}={[{v}_{s1},{v}_{s2},\dots ,{v}_{sn}]}^{T}\),\({v}_{si}\in \{\mathrm{0,1}\}\), and \({v}_{si}\) represents the ith instance in the source project. When \({v}_{si}\)=1, the instance is selected; otherwise, it is not selected. In addition, the \({f}_{e}(.)\) is encoding functions and the \({f}_{d}\left(.\right)\) is decoding function.

3.3 Feature processing and instance selection

3.3.1 Feature processing

Inspired by the work from GE Hinton and RR Salakhutdinov (Hinton & Salakhutdinov, 2006), MHCPDP performs the feature processing based on an autoencoder instead of simply removing redundant and unrelated features. We use autoencoder to mine potential features and the connections between features.

The autoencoder is an unsupervised neural network model, which contains two processes of encoding and decoding. The encoding process can learn the implicit features of the input data. The decoding process is to reconstruct the original input data by using the learned new features. Through such processing, the intermediate features can be considered as feature extraction of the original data and are advanced features with robustness.

By training, we make the output \({x}^{^{\prime}}\) close to the input \({x}^{i}\in {R}^{n}\). When we add some restrictions to the autoencoder neural network, such as limiting the number of hidden neurons, we can find some interesting structures from the input data. For example, the data dimension we input is n = 64, and there are 25 hidden neurons in the hidden layer \({L}_{2}\).

Since there are only 25 hidden neurons, we force the autoencoder neural network to learn the compressed representation of the input data. In other words, it must be reconstructed a 64-dimensional input value from the 25-dimensional hidden neuron activation degree vector \({a}^{(2)}\in {R}^{25}\). If some specific structures are hidden in the input data, for example, some input features are related to each other, and then, this algorithm can well find the correlation of the input data. After the network is trained, \({a}^{(2)}\in {R}^{25}\) of layer \({L}_{2}\) corresponding to each input \({x}^{i}\) is equivalent to the reduced-dimensional data.

In this paper, the autoencoder is used for feature extraction, and the number of features in the five public datasets after extraction is same.

3.3.2 Instance selection

After dimensionality reduction, we modified the autoencoder algorithm for instance selection. This idea comes from distant domain transfer learning (Tan et al., 2017). The goal is to narrow the gap between the source and target datasets.

The specific steps of Algorithm 1 as follows:

Step 1: We take the source dataset, the target dataset, and the number of iterations as input data.

Step 2: Initializing the parameters, \({V}_{s}=1\) of the source domain dataset means instances in the source domain dataset are selected.

Step 3: Iterate N times and update \({V}_{s}\) by adjusting the parameters of the autoencoder to minimize the loss function.

Step 4: Finally, remove the source domain \({V}_{s}=0\) and a subset of the source datasets as output data.

The loss function is as follows:

$$\underset{{f}_{e,}{f}_{d}}{\mathrm{min}}\sum_{n}^{i=1}{|\left|{x}^{^{\prime}}-{x}^{i}\right||}_{2}^{2}+l\left({y}^{i},{f}_{c}\left({f}_{e}\left({x}^{i}\right)\right)\right)$$
(1)

\({y}^{i}\) is label of \({x}^{i}\).\(l(.)\) is loss function. \({f}_{c}(.)\) is a classification function to output classification probabilities.

The original autoencoder reconstructs the output through input. Through the original input, we found the instance with a large difference between the source datasets and the target datasets and marked it as zero. When finally outputting, these instances do not output no longer. The final output instances are a subset of the source datasets, and the gap between this subset and the target datasets is small. After we have dealt with the source datasets, we reconstruct the datasets. Then, use multi-transfer learning for modeling.

3.4 Multi-source transfer learning

Transferring learning has been widely used since it was introduced into software defect prediction, but the effects of negative transfer have been rarely mentioned. We use Algorithm 2 to solve this problem and further narrow the gap between the source datasets and the target datasets.

figure e

Source projects that have little to do with the target project in transfer learning may reduce the performance of the classifier and cause negative transfer. An effective way to solve the negative transfer is to use multiple sources for transfer learning (Yao & Doretto, 2010). In research, single-source transfer learning is often used, and the resulting negative transfer is inevitable. This paper introduces multi-source transfer learning to solve the impact of negative transfer in HCPDP.

Algorithm 2 is a multi-source transfer learning algorithm. Our inspiration comes from Yi Yao’s multi-source transfer learning. In the multi-source transfer learning algorithm, we take the filtered subset of \({S}_{1},\dots {S}_{N}\), target dataset T, and the maximum number of iterations M as input. We take the target classifier function \({F}_{t}:x\to y\) as output. Learning the task \(\tau\) for the target domain T, in traditional machine learning, from the given source datasets according to certain criteria, amounts to estimating a classifier function \({F}_{t}:x\to y\). By adjusting parameters, it can reduce data that is different from the target dataset, thereby limiting the effects of negative transfers. We use this algorithm to solve the impact of negative transfer in HCPDP, narrow the gap between the source data domain and the target data domain, and increase the number of sources.

4 Empirical setup

4.1 Benchmark datasets

We selected five public datasets as the benchmark datasets, namely AEEEM, MORPH, NASA, Relink, and SOFTLAB. These five datasets are widely used for defect prediction (D'Ambros et al., 2012; Herzig et al., 2013a; Herzig et al., 2013b; Jing et al., 2015; Nam et al., 2013; Nam et al., 2015; Tantithamthavorn et al., 2015; Xu et al., 2016; Pingclasai, xxxx). Table 1 reports the detailed information of the items in the five datasets, including the name of the items (dataset), the number of features (metrics), the number of entities (entities), and the number of defective entities (Buggy).

Table 1 Statistic of benchmark datasets

These datasets may share some of the same features and certainly differ from each other. NASA and SOFTLAB contain proprietary datasets from NASA and a Turkish software company, respectively (Turhan et al., 2009). D’Ambros et al. (2010) collected the AEEEM dataset. Features in the AEEEM dataset include changing indicators, source code indicators, entropy of source code indicators, and drain of source code indicators (Appendix). Wu et al. (2011) collected the Relink dataset, and the items they analyzed were Apache HTTP Server, Safe, and Zxing. The dataset considers 26 metrics, all of which focus on code complexity. The MORPH group contains defect datasets of several open source projects used in the study about the dataset privacy issue for defect prediction (Peters & Menzies, 2012).

4.2 Performance Indicators

In this paper, we chose two commonly used metrics as indicators to measure the performance of our experiment.

4.2.1 AUC

Area under ROC curve (AUC) (Jiang et al., 2009) is the area under the ROC curve (Fawcett, 2006). The full name of the ROC curve is the receiver’s working characteristic curve. AUC is an evaluation index to measure the pros and cons of a two-class model, which indicates the probability that the predicted positive example is ranked in front of the negative example. The calculation of this indicator has nothing to do with the threshold setting. The value ranges from 0 to 1. The higher the value, the better the model’s performance.

4.2.2 G-measure

There is a contradiction between the precision (It can also be called the probability of false alarm.) and the recall rate (It can also be called the probability of detection.). In these occasions, g-measure can be used to effectively balance the two indicators. The calculation formula is as follows:

$$\mathrm{g}-\mathrm{measure}=\frac{2\times \mathrm{pd}\times (1-\mathrm{pf})}{(\mathrm{pd}+\left(1-\mathrm{pf}\right))}$$
(2)

Among them, pf is the probability of false alarm, which refers to the proportion of all non-defective modules that are predicted to be defective. In addition, pd is the probability of detection. It returns the proportion of all defective modules that are predicted to be defective. G-measure is the harmonic mean of pd and (1-pf).

4.3 Prediction model

In this work, we train a decision tree model with the mapped source project dataset and then apply it to the mapped target project datasets. The decision tree model is widely used for software defect prediction.

4.4 Experimental design

At the beginning of the experiment, some parameters need to be set. The parameters will be adjusted with multiple experiments. Finally, a relatively optimal parameter is selected as the final parameter setting. The initial values of the parameters are shown in the figure below.

Following the steps of MHCPDP that are presented in Sect. 3.1, we carry out the experiments. To alleviate the bias, we run MHCPDP 30 times and report the average indicator values. For the parameter values, we empirically set x = 100 (EPOCHS: it is defined as a single training iteration of all batches in the forward and backward propagation.) and y = 128(BATCH_SIZE: it is number of samples selected for one training.) as the basic setting through extensive experiments in autoencoder. The number of iterations is an important parameter in transfer learning. We set the number of iterations to 50. We will discuss the impact of different parameter settings on the discussion in Sect. 5.

5 Experimental results and statistical test

5.1 Experimental results

Common classification methods are logistic regression (Hosmer et al., 2013), random forest, Bayesian network, and decision tree. Whether there is a difference in these classification methods in software defect prediction is also a significant problem. Lessmann et al. have explored this problem in the early days. They did experiments on the NASA datasets, and they compare the research performance of 22 different classification methods. The results showed that the difference is not significant (Ghotra et al., 2015). Later, other scholars did the corresponding research and got the same answer. However, Ghotra et al. (Lessmann et al., 2008) did get a different answer. We discussed the impact of the classifier on our experiments in Sect. 5. We choose the decision tree classifier as our classifier to experiment.

Next, we will start with the following three questions and compare the performance of our model with AUC value and g-measure value (in 5-A, 5-B, and 5-C).

  1. A.

    RQ1. How does MHCPDP compare to existing HCPDP methods?

    1. 1)

      Motivation

      When we make software defect prediction, we often encounter situations where the dataset is insufficient or the existing dataset is different from the target dataset. Although many methods have studied this, there are still some shortcomings in these methods. We compared our approach to two top-level heterogeneous defect prediction methods (CCA + and HDP methods), hoping to achieve better results while overcoming their shortcomings.

    2. 2)

      Methods

      We choose AUC and g-measure to measure our experiments and comparative experiments. To simulate HCPDP scenario, we select the source project and the target project from different datasets. We selected 16 datasets as the target dataset. Since neither CCA + nor HDP involves any randomness, we only run it once and record the results. The former is the representative method of HCPDP. Please note that all ranges of the two indicators used in this work are between 0 and 1. For the CCA + method, when it is applied to some cross-project pairs, and the elements in the mapped target project data are all zero, we treat this situation as a failure. For the HDP method, when the matching scores of all feature pairs are below the threshold, we will not be able to select any features. In this case, we will also treat them as failures, and we will not do much research on failed experiments.

      Table 2 shows the comparison between our method and the HDP method, as well as the CCA + method when selecting the AUC values and g-measure values.

    3. 3)

      Results

      We selected sixteen datasets in AEEEM, NASA, MORPH and Relink, and Softlab as target datasets.

      Firstly, according to Table 3, we see that our method is superior to HDP by comparing the AUC values. Our AUC values are all above 0.6, and some projects even reach 0.875. The average value of AUC is also above 0.712, while the HDP mean is only 0.700, which is slightly lower than our method. It can be seen that HDP does not perform well in some datasets, such as xerces-1.2, whose AUC value is only 0.497 which is lower than the random guess. The results of our method are better, and AUC values are above 0.6. The AUC values of many projects are around 0.7, and the overall average is higher than HDP. Although the test results are not ideal on some projects, they are generally better than HDP.

      Secondly, when comparing AUC values with CCA + method, our method achieves better performance. The CCA + method relies heavily on many features common between the source project and the target project. The source datasets we selected during the comparative experiment and our target datasets have many common features, so the experimental results are better. However, even if the CCA + method selects a large number of datasets between the source project and the target project, its AUC value is not very high: the average value is only 0.654, and the AUC value of the ML and xerces-1.2 items is about 0.5, and the highest AUC value is 0.716. The AUC value of most projects is above 0.6, and only a few of them reach 0.7 or more.

      Thirdly, when comparing g-measure values, the mean of our method is higher than the CCA + and HDP methods. The maximum value and the minimum value are better than the first two methods. Most of them change from 0.6 to 0.65; only a few are slightly lower than 0.6, which is better than the former two as a whole.

      Lastly, when we do experiments, there are not as many constraints as the first two methods, and the experiment success rate is higher than the first two methods. CCA + works better when the source and target projects have features that are more common. When there are only a few or no common features, the experimental results do not perform well. The HDP method must consider the consistency of the defect propensity of the source project and the target project when experimenting, but often, we do not know this propensity. In contrast, our method achieved better results than they did when we overcome the shortcomings of the two.

      In summary, our method overcomes some shortcomings of HDP method and CCA + method in addition to achieving comparable performance with HDP.

Table 2 Hyper-parameter settings
Table 3 Comparison of AUC values and g-measure values between MHCPDP, HDP, and CCA + methods
  1. B.

    RQ2. What is the result of MHCPDP compared with unsupervised learning?

    1. 1)

      Motivation

      Due to insufficient source datasets, unsupervised learning methods are used to solve CPDP problems. However, there are also some shortcomings in unsupervised defect prediction. Firstly, in software defect prediction, many unsupervised methods require manual labeling, and the cost of manual labeling is too high. Secondly, unsupervised learning methods generally solve problems by cluster analysis, but cluster analysis methods also have certain defects. When doing software defect prediction, cluster analysis methods are based on the assumption that the value of the metric element of the defective module has a higher value than the metric element of the non-defective module. However, we encountered the problem may not be met, and the effectiveness of the unsupervised method may not be guaranteed.

      Therefore, HCPDP research is still essential. Next, we compare our method with excellent unsupervised methods.

    2. 2)

      Methods

      The unsupervised learning method CLAML proposed by Nam and Kim et al. It is an automatic method to serve the process of expert annotation in unsupervised learning methods (Nam & Kim, 2015). Their method is the leader in unsupervised learning methods in the field of software defect prediction. One of the best ways, we use AUC values and g-measure values to evaluate the performance of the CLAML model and the MHCPDP model.

    3. 3)

      Results

      From Table 4, we can see that our method is better than UDP when comparing AUC values and g-measure values.

      Firstly, as far as AUC is concerned, our MHCPDP has a higher value than the UDP method in the prediction of 12 target datasets, and the overall mean value is much higher than the UDP method. The minimum value is much higher than the UDP method.

      Secondly, as far as g-measure is concerned, our MHCPDP has a higher value than the UDP method in the prediction of 11 target datasets, and the overall mean value is much higher than the UDP method. The minimum value is much higher than the UDP method.

      Although there is no heterogeneous problem with unsupervised learning methods, still many other problems still affect the accuracy of the SDP model. Therefore, it is not feasible to solve heterogeneous defect prediction only by unsupervised learning methods. Although our method has its drawbacks, the effect is much better than the UDP method. We will continue to improve it in the future and hope to get a better model.

Table 4 Comparison of AUC values and g-measure between MHCPDP, UDP, and WPDP methods
  1. C. RQ3. What is the result of MHCPDP compared with WPDP?
    1. 1)

      Motivation

      In the WPDP scheme, some labeled entities in the target project are used to train the prediction model, and the generated model is used to predict the class labels of other entities. Previous studies have shown that if there is enough labeled training data available, the performance of WPDP will be improved (Lessmann et al., 2008). However, we often encounter the problem of insufficient datasets, that is, we may not have enough data to train the target datasets. Therefore, we often use CPDP to predict software defects, but compared with WPDP, the performance of the previous CPDP method is usually not good enough. Therefore, we are interested in whether our method HDA has better or comparable results than WPDP in terms of two performance indicators.

    2. 2)

      Methods

      For WPDP, we randomly select 50% entities in the target datasets as the training datasets and the remaining entities as the test datasets. Since the random selection process for the training datasets and test datasets may affect the prediction performance, we repeat the process 30 times to obtain the average indicator values.

      Then, we also select the AUC value and g-measure for performance evaluation. For the parameter settings of these classifiers, we follow the previous studies (Xu et al., 2018).

    3. 3)

      Results

      Firstly, the AUC value of our method has nine datasets with AUC values above 0.7 and seven at 0.6 or higher. The AUC value of WPDP has fourteen datasets with AUC values above 0.7, one at 0.6 or above, and one at around 0.5.

      Secondly, as far as g-measure is concerned, our MHCPDP has a higher value than the WPDP method in the prediction of seven target datasets, and the overall mean value is close to the WPDP method. The minimum value is much higher than the WPDP method.

      WPDP is based on the same dataset to achieve such an effect, and our dataset is based on heterogeneous conditions; the source datasets of the two are very different.

      In summary, compared with WPDP methods, our method still has performance comparable with WPDP when implementing heterogeneous defect prediction.

5.2 Statistical test

To statistically analyze the performance values in Tables 3 and 4, we perform the non-parametric Friedman test with the Nemenyi's post hoc test (Demsar, 2006) at significant level 0.05 over the datasets, because the results of the Friedman test show that the p values on all two indicators are all less than 0.05. The Friedman test evaluates whether there exist statistically significant differences among the average ranks of different methods. Since Friedman test is based on performance ranks of the methods, rather than actual performance values, therefore it makes no assumptions on the distribution of performance values and is less susceptible to outliers (Xu et al., 2018).

Use Friedman test to determine whether these algorithms have the same performance. If they are same, their average ordinal values should be equal. Suppose k algorithms on N datasets, let \({r}_{i}\) denote the average order value of the ith algorithm. In order to simplify the discussion, we do not consider the halving value for the time being, then \({r}_{i}\) obeys the normal distribution and its mean and variance are \(\frac{k+1}{2}\) and \(({k}^{2}-1)/12\); variable is\({\tau }_{{x}^{2}}=\frac{k-1}{k}\times \frac{12N}{{k}^{2}-1}\sum_{i=1}^{k}{\left({r}_{i}-\frac{k+1}{3}\right)}^{2}=\frac{12N}{k\left(k+1\right)}(\sum_{i=1}^{k}{{r}_{i}}^{2}-\frac{{k(k+1)}^{2}}{4})\); when both k and N are large, it obeys the \({x}^{2}\) distribution with \(k-1\) degrees of freedom.

If the hypothesis that “all algorithms have the same performance” is rejected, it means that the performance of the algorithms is significantly different. At this time, a “post hoc test” is needed to further distinguish the algorithms. The commonly used algorithm is Nemenyi’s post hoc test. Nemenyi’s post hoc test calculates the critical range of average ordinal difference that is \(\mathrm{CD}={q}_{\alpha }\sqrt{\frac{k(k+1)}{6N}}\).

Table 5 shows the commonly used \({q}_{\alpha }\) values when α = 0.05 and 0.1.

Table 5 The commonly used \({q}_{\alpha }\) values when α = 0.05 and 0.1

If the difference between the average sequence values of the two algorithms exceeds the critical value range CD, the assumption that “the performance of the two algorithms is same” is rejected with the corresponding confidence.

As we can see from Fig. 2, the horizontal axis is the average order, and the vertical axis is each algorithm. Each algorithm uses points to represent the average order, and the horizontal line segment with the point as the center represents the size of the critical value range. If the horizontal lines of the two algorithms overlap, it means that there is no significant difference between the two algorithms. According to Friedman test with the Nemenyi’s post hoc test, we sorted the order value for each method and calculated the average order value.

Fig. 2
figure 2

Comparison of MHCPDP against HDP, CCA + , UDP, and WPDP with Friedman test and Nemenyi’s post hoc test on the AUC values and g-measure values

Figure 2 shows the comparison of MHCPDP against HDP, CCA + , UDP, and WPDP with Friedman test and Nemenyi’s post hoc test on the AUC values and g-measure values. Firstly, when HDP, CCA + , and MHCPDP are compared, in terms of AUC values, HDP has no significant difference compared with MHCPDP, while MHCPDP has a significant difference compared with CCA + , and in terms of g-measure values, MHCPDP, HDP, and CCA + have no significant differences. Secondly, when WPDP, UDP, and MHCPDP are compared, in terms of AUC values and g-measure values, MHCPDP has no significant difference compared with WPDP, while MHCPDP has a significant difference compared with UDP.

We have known that our method is significantly different from these baseline methods, but how big the difference is depends on the effect size test.

The effect size refers to the difference caused by factors, and it is an index to measure the size of the treatment effect. Unlike the significance test, these indicators are not affected by the sample size. It represents the size of the difference between the overall mean under different treatments and can be compared between different studies. Mean difference, variance analysis explanation ratio, and regression analysis explanation ratio need to be described by effect size. The effect size is not affected by the sample size. When the sample size is significant, it is necessary to report the size of the effect. We choose Cohen’s d values. The greater the Cohen’s d value, the greater the difference. The calculation formula of the Cohen’s d is as follows:

Cohen’s d = \(\frac{{M}_{1}-{M}_{2}}{\sqrt{\frac{{\mathrm{SD}}_{1}^{2}-{\mathrm{SD}}_{2}^{2}}{2}}}\)

where M1 and M2 are the average values of two sets of comparative experiments. SD1 and SD2 are the standard deviations of two sets of comparative experiments. The results are shown in Table 6 below:

Table 6 The values of Cohen’s d

From the above point of view, our method MHCPDP and HDP are both effective methods to conduct defect prediction on heterogeneous datasets. To sum up, our method MHCPDP achieves competitive results compared with CCA + and HDP on most datasets in terms of the two indicators.

6 Discussion

6.1 How much time does it take for the MHCPDP methods and CCA + methods?

In addition to comparing the effectiveness of our method in the experiment, we also evaluated the efficiency of the method. Because our method and CCA + method involve multiple projects as source datasets, we specially selected the CCA + method as the reference object. The time we compare is the time of instance selection in MHCPDP and the time of obtaining the mapped project data in CCA + .

Our method involves the time-consuming extraction of the example selection process of the autoencoder. We have specially experimented with this, and the time that spent the experiment is kept at about lower level. The time spent by CCA + is to obtain the mapped project data, so we select several items to measure the time spent on CCA + . As shown by Fig. 3, the time spent by CCA + is about a few minutes. When the source project is a large-scale dataset, the time may be longer, which is very inconvenient for the actual operation, but our method can adapt well to large-scale datasets.

Fig. 3
figure 3

The time (s) for MHCPDP and CCA + 

6.2 How different classifiers affect the performance of MHCPDP?

In MHCPDP, we choose the decision tree (DT) classifier as the basic prediction model. In order to explore the impact of different classifiers on the performance of MHCPDP, we choose three classifiers that are Naive Bayes (NB), logistic regression (LR), and random forest (RF) and compare the AUC value and g-measure value of these classifiers. Figures 4 and 5 depict the results of our experiments.

Fig. 4
figure 4

The impact of different classifiers on AUC values

Fig. 5
figure 5

The impact of different classifiers on g-measure values

We can see that several classifiers have achieved good results in terms of AUC value, but the overall performance of decision tree is better. On the g-measure value, random forest and decision tree perform are better, and the performance of Naive Bayes and logistic regression are relatively poor.

Figure 6 shows the performance comparison of DT against RF, LR, and NB with Friedman test and Nemenyi’s post hoc test on the AUC values and g-measure values. By comparing g-measure and AUC values of DT, RF, LR, and NB in Fig. 6, DT has less overlap with the other three classifiers, and the average order value of DT is less than other three classifiers, it indicates that DT has a significant difference, and DT is better than the other three classifiers.

Fig. 6
figure 6

Comparison of DT against RF, LR, and NB with Friedman test and Nemenyi’s post hoc test on the AUC values and g-measure values

6.3 How does the parameter affect the performance of MHCPDP?

From Figs. 7 and 8, we can find that the parameters have a certain influence on our experiment. How to adjust the parameters is that our method shows better performance is our future research topic.

Fig. 7
figure 7

The effect of x on g-measure values and AUC values

Fig. 8
figure 8

The effect of y on g-measure values and AUC values

6.3.1 How does the parameter x affect the performance of MHCPDP?

The parameter x is defined as a single training iteration of all batches in the forward and backward propagation. We select x = 40 in our experiment. Here, we evaluate the influence of different x values on the experimental results by selecting x = 50 for comparison. The blue in the figure represents the x value selected by our experiment.

6.3.2 How does the parameter y affect the performance of MHCPDP?

The parameter y is the number of samples selected for one training. We select y = 128 in our experiment. Here, we evaluate the influence of different y values on the experimental results by selecting y = 150 for comparison. The blue in the figure represents the y value selected by our experiment.

Overall, the adjustment of the parameters has a certain impact on our method. We decided to choose the optimal parameters as our work focus to continue research.

7 Threats to validity

In this section, we discuss some of the threats to the validity of our work.

7.1 Threats to construct validity

The construction validity is based on the selection of performance indicators. In this paper, we selected two commonly used indicators, g-measure and AUC. These performance indicators make our method very good with existing excellent. The comparison of the methods is very helpful for us to evaluate the performance of our method.

7.2 Threats to internal validity

The internal threat is mainly related to the implementation of the method and the parameter setting. The baseline method we selected may have some differences due to the unknown author’s parameter settings and some specific information.

However, we reduce this threat by adjusting the parameter settings to choose the optimal combination of parameters.

7.3 Threats to external validity

External threats are primarily the application of our approach to real-world applications. We reduce this threat by selecting five publicly available datasets.

8 Conclusion

This is a challenge for heterogeneous cross-project software defect prediction due to the low correlation between features selected between different datasets. In the face of this challenge, we have proposed our own method MHCPDP. First, we analyzed many previous studies and found that many methods simply delete redundant features and do not mind the deeper meaning behind the data to filter the features. We choose to use the autoencoder to mind the data deeper.

Secondly, we modified the autoencoder algorithm for instance selection, which can narrow the gap between the source project and the target projects. Thirdly, we introduce multi-source transfer learning algorithm into heterogeneous software defect prediction, which not only solves the impact of negative transfer on transfer learning but also can carry out multi-source learning. Multiple source project datasets can solve the problem of insufficient training datasets. This idea also comes from transfer learning. In the future, we will continue to explore more ideas in transfer learning and apply them to software defect prediction in order to develop more and better methods. To confirm the effectiveness of our approach, we conducted a number of experiments and compared them with some of the top methods in the UDP, WPDP, and HDP directions. The experimental results show that our method has achieved better results.

Although our approach has achieved good results, there are still some unresolved issues. For example, class imbalance is also a very important topic in software defect prediction; we simply adopted the smote method. In addition to the class imbalance problem, we have not considered the noise problem in the dataset. We will continue to study these experimental results in the future.