1 Introduction

With the development of modern software engineering, more and more resources are allocated in phase of software testing to keep software projects bug-free, and thus software testing has become one of the most important phases in the whole software lifecycle. As a result, how to guarantee software reliability through software testing attracts significant attention. Some researchers (He et al. 2015; Ni et al. 2019; Balogun et al. 2021; Bal and Kumar 2020) adopted Software Defect Prediction (SDP) methods to locate defects of software projects. A program that contains at least one bug (defect), and/or bugs that seriously interfere with its functionality, is said to be buggy (defective). Otherwise, the program is said to be clean (non-defective). Most existing SDP methods utilized the historical data of a software project to train their prediction models, and then employed it to find out future defects in the same project. However, in practice, it is hard for researchers or developers to build a satisfactory SDP model in the early stage of a given project because labeled data are hardly available. Therefore, Cross-Project Defect Prediction (CPDP) (Briand et al. 2002; Herbold et al. 2018; Hosseini et al. 2019; Khatri and Singh 2021) was proposed, making it feasible to build an SDP model in a new project. CPDP allows us to train defect prediction model based on mature projects with sufficient labeled data (source projects, and then apply it to new projects that lack labeled data (target projects). A typical CPDP method can be roughly split up into two modules: how to extract predictive features and how to effectively apply knowledge to target projects that is learned from source projects.

How to construct predictive features remains a challenge in the research of CPDP. As a binary classification problem, feature extraction is one of the most important components in CPDP models. In the early study of software defect prediction, most defect prediction methods (Watanabe et al. 2008; Turhan et al. 2009; Shepperd et al. 2014; Jing et al. 2014; Zhu et al. 2021; Li et al. 2019) depended on handcrafted features that represent the static characteristic of programs, including McCabe loop complexity (McCabe 1976), Halstead metric (Halstead 1977) and CK (Chidamber-Kemerer) metric (Chidamber and Kemerer 1994). However, these handcrafted metrics were often designed based on researchers’ statistical analysis or experience of the software, which might not fully mine the contextual and semantic features of project source codes (Wang et al. 2016). If CPDP models are constructed based only on these handcrafted features, they may suffer from losing some prediction performance. Because of its prominence in feature representation, deep learning is strongly recommended by researchers (Wang et al. 2016; Li et al. 2017) to capture programs’ hidden contextual and semantic features and feed them into prediction models for training. Several prevalent deep learning approaches have been adopted in CPDP, including Deep Bayes Network (DBN) (Tong et al. 2019; Wang et al. 2020), Long Short-Term Memory (LSTM) network (Li et al. 2019; Deng et al. 2020; Liang et al. 2019) and Convolutional Neural Network (CNN) (Li et al. 2017; Qiu et al. 2019a, d), etc. To some extent, source code in software project is a kind of standardized and formal languages which is closely related to NLP context (Huang et al. 2021), and we believe that the LSTM network is more suitable for extracting hidden semantic features in CPDP tasks. LSTM is a variant of recurrent neural networks, which processes sequential data by modeling units in sequence and "memorize" long-term dependencies by computations (Hochreiter and Schmidhuber 1997). In this paper, we extend the LSTM network with Attention Mechanism (AM) (Vaswani et al. 2017) as the feature generator for fully capturing the hidden semantic features of programs. AM is a method that enables us to focus on the most important parts to make further decisions by giving different weights to different parts of the input. Since it can bring significant improvement, AM recently has almost become an essential standard in many sequential tasks (Veličković et al. 2018; Zeng et al. 2021). In the field of CPDP, project source codes are also sequential data, and we believe we would benefit from LSTM network and AM, too.

Fig. 1
figure 1

(a) Knowledge learned from source project is directly transferred to target project. (b) Traditional transfer learning methods may fail to consider the relationship between the decision boundary and the target samples, which results in performance loss. (c) We train two classifiers simultaneously, which try to categorize source instances correctly and distinguish ambiguous target instances at the same time. In these figures, dashed lines represent the boundary between the source and target sample space while the solid lines represent the decision boundary learned by CPDP models

Another challenging problem in CPDP tasks is how to effectively transfer knowledge learned from one project and apply it to other projects. Most fruitful CPDP methods (Yu et al. 2017; Ryu et al. 2017; Hosseini et al. 2018; Wu et al. 2018; Liu et al. 2019; Gong et al. 2020) adopted transfer learning approaches to bridge the gap between feature distributions of two different projects. Although inspiring and straightforward, these CPDP models might miss out the relationship between the target project instances and decision boundary. Chen et al. (2015) put forward a CPDP method called Double Transfer Boosting, which re-calculated instances’ weights and reduced the divergence of two distributions based on TrAdaBoost (Dai et al. 2007). Yu et al. (2017) proposed a feature matching and transfer method to perform CPDP. They designed an algorithm to match and transfer features in the source and target project with respect to their distance. Xu et al. (2019) came up with a transfer learning method, named Balanced Domain Adaptation, to assign different weights to marginal and conditional distributions of two projects. Through balancing these two kinds of distribution, their approach was able to effectively overcome the divergence of two projects. However, these methods might not be capable of mining predictive features since they did not take the relationship between target data and decision boundary into account when matching two distributions. From mathematical point of view, this relationship can be defined as the distance between the boundary and target data. As shown in Fig. 1, even if the two distributions are quite similar, the feature generator could generate ambiguous instances near the decision boundary if we do not take the distance into consideration. Hypothetically, we consider that these ambiguous targets samples could be misclassified and therefore result in poor prediction performance.

In our work, we propose a novel Adversarial Domain Adaptation (ADA) method for CPDP to handle the drawbacks mentioned above. The proposed approach, as shown in Fig. 1, adopts a domain adaptation method based on adversarial learning to rectify ambiguous classification data, which not only alleviates the gap in the feature distributions between different projects, but also takes the relationship between the target project instances and learned decision boundary into account to further enhance performance. The general framework of the ADA method is presented in Fig. 2. Specifically, we first compile project source files to generate corresponding abstract syntax trees (ASTs), and then convert ASTs into token vectors by traversing them. By certain mapping rules, we map the token vectors to integer vectors and feed them into the following feature extraction network. We adopt a bi-directional LSTM network with a self-attention layer to capture the programs’ hidden semantic features. To avoid the overfitting problem and obtain more accurate features, we amalgamate the generated features with the handcrafted features to construct joint features. Inspired by Generative Adversarial Networks (GAN) (Goodfellow et al. 2014) and an adversarial domain adaptation method proposed by Saito et al. (2018), we utilize a domain adaptation method and train it in the manner of adversarial learning to improve the discriminability of the joint features. In ADA, we simultaneously train two classifiers, and they take joint features as input, trying to correctly classify source samples and identify the target instances which are not close to the support of the source at the same time. We presume these target instances are non-discriminative and ambiguous since they are not categorized explicitly as negative or positive. Therefore, we see these two classifiers as a discriminator to discriminate whether the generator (LSTM network with an attention layer) creates discriminative features for the target instances. In this way, the disagreement of two classifiers can be used to further compute the distance between the decision boundary and the target instances. When we repeat this procedure, ADA can reduce the divergence of the feature distribution in two projects, and give information to the classifiers about the relationship mentioned above. As they know the relationship, the classifiers can correctly categorize those confusing target instances, and thereby improve defect prediction performance.

Fig. 2
figure 2

(a) In the phase of data preprocessing, we first parse the project source files into abstract syntax trees (ASTs), and then traverse the ASTs to get token vectors. Lastly, we convert the token vectors into integer vector by pre-defined mapping rules. (b) In the phase of model construction, we first feed the integer vectors into the generator (LSTM network with AM) to extract preliminary semantic features. Then, we combine the semantic and handcrafted features to form joint features. Next, we utilize the joint features to simultaneously train two classifiers based on adversarial learning methods

To validate the effectiveness of ADA method, we conduct multiple experiments on two public benchmark datasets, AEEEM (D’Ambros et al. 2010) and PROMISE (Jureczko and Madeyski 2010). In many experimental settings, the proposed model is superior to other the state-of-the-art methods by a significant margin. The contributions of this work are three-fold.

  • We propose a novel CPDP model which integrates transfer learning and adversarial learning methods to not only bridge the gap between the feature distributions of different projects, but also fully think of the relationship between the target data and class boundary.

  • We extend the LSTM network with Attention Mechanism to learn long-term dependencies in a software project context and extract more meaningful and semantic features of programs. Together with AM, the LSTM network significantly improves the performance of defect prediction.

  • We conduct extensive experiments on two benchmark datasets and the experimental and statistical results verify the effectiveness of the proposed method.

The rest of this paper consists of six parts. Firstly, we shortly report the related work in Section 2. Then, we thoroughly illustrate the ADA model in Section 3. In Section 4, we describe the details of experimental setups. Next, we present the experimental and statistical results in Section 5. More discussions on the proposed ADA method are in Section 6. Lastly, we conclude our work and talk about future work in Section 7.

2 Related Work

In this section, we briefly introduce related work on cross-project defect prediction, adversarial domain adaptation and attention mechanism.

2.1 Cross-Project Defect Prediction

Software Defect Prediction (SDP) (Li et al. 2018; Zou et al. 2018; Thota et al. 2020; Rathore and Kumar 2021) aims to find out defects in software modules before release so that developers can allocate limited resources optimally. According to the project data used in the training and evaluating phase, there are two branches in SDP, namely Within-Project Defect Prediction (WPDP) and Cross-Project Defect Prediction (CPDP). In WPDP, the data are from the same project in the training and evaluating phase. By contrast, CPDP uses different project data when training and evaluating. In our work, we concentrate on the problem of CPDP.

To evaluate the feasibility of CPDP, Zimmermann et al. (2009) carried out a total of 622 cross-project experiments on 12 software projects using logistic regression classifiers. Their experimental results showed that only 3.4% of them were successful. They came to conclusion that CPDP was a challenging problem because it was not transitive. Researchers hoped to improve software defect prediction performance by applying deep learning (LeCun et al. 2015) methods to mine semantic features of programs due to its prominent feature learning capability. Wang et al. (2016) utilized a DBN model to automatically extract semantic features from AST of programs, and bridged the gap between semantic and defect prediction features of different projects. Tong et al. (2019) adopted a transfer naive Bayes approach to consider both class-imbalance and feature importance problems, and used it to make two feature distributions as similar as possible. Li et al. (2017) applied a CNN to generate effective features of programs. Combined with traditional handcrafted features, the prediction model they trained achieved better performance than other baseline methods. Different from the deep learning methods that adopted in the above CPDP study, Pandey and Tripathi (2021) employed an LSTM network to mine programs’ semantic features. In the context of CPDP, project source code is a kind of formal language containing rich structural and semantic information, which is more closely related to the context of NLP problem (Li et al. 2019). Therefore, we believe that Recurrent Neural Network (RNN) is more suitable since it excels at processing sequential input and we leverage a variant of RNN, the LSTM network to mine the structural and semantic information contained in programs.

To further verify the feasibility of CPDP, He et al. (2012) carried out three experiments on 34 projects by manually selecting training data. They concluded that CPDP performance could be better than WPDP tasks in some cases. After further analysis, they concluded that defect prediction results were related with the distributional characteristics which could be valuable for training data selection. Their research suggested that CPDP tasks were feasible if the data distributions across two different projects can be made similar. Taking their suggestions, Herbold (2013) put forward a distance-based algorithm for the training data selection according to the corresponding distributional characteristics. Concretely, they came up with characteristic vectors to represent each dataset, which consisted of mean and standard deviation, and they adopted the characteristic vectors to stand for the marginal distribution of each dataset. Herbold conducted experiments on 44 public software projects and witnessed a 9% improvement in performance over traditional CPDP methods. Apart from training data selection methods, transfer learning techniques (Nam et al. 2013; Qiu et al. 2019d; Liu et al. 2019; Jin 2021; Huang et al. 2021) were also adopted to make two distributions similar. Liu et al. (2019) came up with a CPDP method named Two-Phase Transfer Learning. Firstly, they chose two source projects that are most similar by means of a source project estimator. Secondly, they leveraged TCA+ (Nam et al. 2013) to construct two defect predictors based on the two selected project individually, and then combined their prediction probabilities to enhance performance. Qiu et al. (2019d) proposed a CPDP model called Transfer CNN (TCNN). They adopted CNN to extract the semantic features of project data and added a matching layer to align feature distribution of two different projects. When matching, they embedded the source and target data representations into a reproducing kernel Hilbert space and utilized a classic transfer learning method, Maximum Mean Discrepancy (MMD) (Borgwardt et al. 2006), to bridge the gap between two distributions. Next, they combined generated semantic features with handcrafted features and trained the predictor based on them. Jin (2021) learned domain adaptation model by a method called kernel twin support vector machine, trying to match the feature distributions between the source and target project as much as possible. He trained the feature generator that aimed to match distributions between two different projects and assumed that such target features were correctly categorized by the defect predictor for the reason that they were matched to source instances.

Compared to CPDP, heterogeneous defect prediction (HDP) (Nam and Kim 2015; Chen et al. 2021) relaxes the limitation of defect data used when predicting, allowing different metric sets to be contained in the source and target projects. Afterward, many researchers came up with fruitful work on HDP. Jing et al. (2015) proposed a unified metric representation for the defective data, and used them for further canonical correlation analysis (CCA) to reduce the gap between two domains. Based on CCA, Li et al. (2018) also proposed an HDP model called cost-sensitive transfer kernel canonical correlation analysis, which can not only make the data distributions of source and target projects much more similar in the nonlinear feature space, but also utilize the different misclassification costs for defective and non-defective classes to alleviate the class imbalance problem. Li et al. (2019) employed multiple sources to improve the performance of HDP model, putting forward a multi-source selection based manifold discriminant alignment approach. Their experimental results verified the performance gain. Bal and Kumar (2023) enhanced the data pre-processing for HDP by utilizing chi-square test to select the relevant metrics between source and target datasets. Finally, they performed experiments using their proposed approach with various machine learning algorithms to various the effectiveness of their model. Note that we only concentrate on "homogeneous" (non-heterogeneous) defect prediction.

In the above CPDP studies, most of them adopted different approaches (training data selection (Herbold 2013) or transfer learning (Qiu et al. 2019d; Liu et al. 2019; Jin 2021) to make the feature distributions of the two different projects as similar as possible. However, they might fail to take full consideration of the relationship between the decision boundary and target data when matching the distributions. They used different transfer learning approaches that can match two distribution, hoping the target samples are correctly classified by the defect predictor. As we mentioned before, even though the two distributions are quite similar, the classifier would still be confused by the target samples with ambiguous features. In this paper, we put forward a domain adaptation method based on adversarial learning to fully consider the relationship between the target instances and decision boundary, which can eliminate the confusion for predictor.

2.2 Adversarial Domain Adaptation

Domain adaptation is a prevalent type of transfer learning, where the target task remains the same as the source whereas the domain is different (Pan and Yang 2010). Since generative adversarial network (GAN) was proposed by Goodfellow et al. (2014), many scholars have applied it into domain adaptation to solve specific tasks including image classification (Tzeng et al. 2017; Saito et al. 2018; Ma et al. 2019), object detection (Song et al. 2020; Su et al. 2020), machine translation (Wang et al. 2021), image semantic segmentation (Li et al. 2019; Yi et al. 2021). Tzeng et al. (2017) proposed an adversarial discriminative domain adaptation method to classify images. They learned discriminative representation based on source data at first. Then, they learned a separate encoding based on transfer learning to map target data to the same feature space as the source. Lastly, they trained the whole model by minimizing a domain-adversarial loss function. Song et al. (2020) proposed a method for the salient object detection problem based on adversarial domain adaptation. To evaluate the effectiveness of their model, they collected a new dataset and made comparison with other methods on the dataset. In the field of machine translation, Wang et al. (2021) came up with a counterfactual domain adaptation method to improve target domain translation. By adopting adversarial learning methods, they used the concatenations of texts in source domain and tags in target to construct counterfactual representations. Motivated by adversarial learning, Li et al. (2019) put forward a bidirectional learning model to solve the problem of image semantic segmentation. They separated their model into two submodules: image-to-image translation model and segmentation adaptation model, which would be motivated to promote each other alternatively and gradually reduced the domain gap. Saito et al. (2018) put forward an adversarial domain adaptation method, attempting to match distribution of source and target by using the task-specific decision boundaries. They have proven that this method outperformed other methods in the tasks of image classification and semantic segmentation.

Inspired by these fruitful work, we believe that CPDP models will benefit from adversarial learning since it also is an application of domain adaptation. Particularly, we assume that we can take advantage of the training pattern adopted by Saito et al. (2018), using multiple task-specific classifiers as discriminators and a feature generator which tries to "fool" them in order to generate more discriminative features. Different from previous CPDP methods that apply traditional domain adaptation methods (Jin 2021; Xu et al. 2018; Qiu et al. 2019b; Zou et al. 2021) which tried to align two distributions by training appropriate distance metrics, in this work, we develop the domain adaptation method based on adversarial learning like this to reduce the divergence between two feature distributions of different projects.

3 Proposed Method

In this section, we elaborate the ADA method in details. Firstly, we give the formal formulation of CPDP problem. Then, we put forward the overall framework of the proposed approach. A Discussion about the data preprocessing of the software projects is placed in Section 3.2, including program parsing and data imbalanced learning. In Section 3.3, we elaborately present how we construct the CPDP model.

3.1 Problem Definition

Let the given source project with labelled data be \( D_\textrm{S} = \{(\varvec{x}_{\textrm{S}_i}, y_{\textrm{S}_i})\}_{i=1}^n \), where \( \varvec{x}_{\textrm{S}_i} = \left\{ x_{\textrm{S}_{i1}}, \ldots , x_{\textrm{S}_{id}}\right\} \in \mathbb {R}^{n*d}\) denotes the i-th source instance, and \( y_{\textrm{S}_i} \in \{0, 1\}\) is the corresponding defect information (0 for non-defective and 1 for defective). Let the given target project without labelled data be \( D_\textrm{T} = \{\varvec{x}_{\textrm{T}_i}\}_{i=1}^m \), where \( \varvec{x}_{\textrm{T}_i} = \left\{ x_{\textrm{T}_{i1}}, \ldots , x_{\textrm{T}_{id}}\right\} \in \mathbb {R}^{m*d}\) denotes the i-th target instance. We assume that both source and target samples share the same feature space as they come from the same set of metrics (i.e. \( \varvec{x}_\textrm{S}, \varvec{x}_\textrm{T} \in \mathbb {R}^d\) where d denotes the dimension of the feature). Let n and m be the numbers of instances in the source and target projects, respectively. Let \( P_\textrm{S}(\textbf{X}_\textrm{S}) \) and \( P_\textrm{T}(\textbf{X}_\textrm{T}) \) be the marginal probability distributions of \( \textbf{X}_\textrm{S}=\{\varvec{x}_{\textrm{S}_i}\}_{i=1}^n \) and \( \textbf{X}_\textrm{T}=\{\varvec{x}_{\textrm{T}_i}\}_{i=1}^m \) from the source and target projects, respectively. Generally, the distributions of two distinct projects are different, too, which implies \( P_\textrm{S}(\textbf{X}_\textrm{S}) \ne P_\textrm{T}(\textbf{X}_\textrm{T}) \). Cross-project defect prediction aims to enhance the performance of the target predictor \( f_{T}(\cdot ) \) in target project \( D_\textrm{T} \) by utilizing the knowledge learned from source project \( D_\textrm{S} \).

In this paper, we learn the optimal parameters \( \theta ^{*} \) of our method by solving the following minimization problem,

$$\begin{aligned} \begin{aligned} \theta ^{*}&=\underset{\theta \in \varTheta }{\arg \min }\ \sum _{(\varvec{x}, y) \in D_\textrm{S}} \frac{P_\textrm{T}\left( \textbf{X}_\textrm{T}\right) }{P_\textrm{S}\left( \textbf{X}_\textrm{S}\right) } P\left( \textbf{X}_\textrm{S}\right) \ell (\varvec{x}, y, \theta ) \\&\approx \underset{\theta \in \varTheta }{\arg \min }\ \sum _{i=1}^{n} \frac{P_\textrm{T}(\varvec{x}_{\textrm{T}_{i}})}{P_\textrm{S}\left( \varvec{x}_{\textrm{S}_{i}}, y_{\textrm{S}_{i}}\right) } \ell \left( \varvec{x}_{\textrm{S}_{i}}, y_{\textrm{S}_{i}}, \theta \right) , \end{aligned} \end{aligned}$$
(1)

where \( \varTheta \) is the parameter space and \( \ell (\varvec{x}, y, \theta ) \) denotes the error function relying on \( \theta \). Therefore, by assigning distinct weights to each sample \( (\varvec{x}_{\textrm{S}_{i}}, y_{\textrm{S}_{i}}) \) with corresponding value \( \frac{P_\textrm{T}(\varvec{x}_{\textrm{T}_{i}})}{P_\textrm{S}\left( \varvec{x}_{\textrm{S}_{i}}, y_{\textrm{S}_{i}}\right) } \), we are able to build an accurate predictor for the target project.

3.2 Data Preprocessing

Figure 2 presents the general framework of the proposed ADA approach. We divide the whole framework into two steps: data preprocessing and model construction. In this subsection, we discuss the details of data preprocessing in ADA method.

3.2.1 Generating Input Vectors

An Abstract syntax tree (AST) is a representation of syntactic structure parsed from source code, contains rich semantic information of the software project, and is considered really useful in the field of program analysis. Alon et al. (2019); Compton et al. (2020) Early studies (Wang et al. 2016, 2020; Chen et al. 2016; Balog et al. 2017) have proven that ASTs can be mined and utilized in software defect prediction, so we choose AST as a high level representation of the project source codes. Specifically, we use an open source compiling tool, JavalangFootnote 1, to parse Java source files and generate corresponding ASTs. There are many types of nodes in ASTs, but only a part of them are highly related to the defects of the code. Following previous work (Wang et al. 2020; Deng et al. 2020; Huang et al. 2021), we evaluate the AST nodes and select four kinds of nodes: 1) method invocation nodes, 2) declaration nodes, 3) control flow nodes and 4) other necessary nodes. Then, we employ depth-first traversal to generate sequence vectors from ASTs. Since the sequence vectors are a list of string tokens, which cannot be directly used as input of an LSTM network, we construct a mapping dictionary between nodes and integers. After this step, we convert the project source files into integer vectors.

3.2.2 Imbalanced Learning

The imbalanced learning problem is one of the challenging problems in machine learning, where the number of one kind of samples is much more than that of another kind of samples (He and Garcia 2009). This problem is also faced in the field of SDP (Jing et al. 2017; Tong et al. 2019; Bal and Kumar 2020), as we will discuss the details of datasets chosen in this paper in Section 4.1. Qiu et al. (2019c) and Song et al. (2019) carried out large-scale experiments to explore the characteristic of imbalanced learning problem and systematically evaluated multiple imbalanced learning methods. Their research proved that class imbalance was omnipresent and would significantly affect the performance of prediction models. Nevertheless, we can alleviate this problem by adopting appropriate imbalanced learning approaches.

There are two types of methods coping with class imbalance, over-sampling method and under-sampling method. The former over-samples the minority class and the latter under-samples the majority class, both of which make the numbers of instances in two class balanced. In our approach, we apply a Synthetic Minority Over-Sampling Technique (SMOTE) (Chawla et al. 2002), which is a hybrid method of under-sampling and over-sampling. SMOTE is capable of handling the skewed class distribution by introducing and learning a bias towards the minority class, thereby achieves better performance than minority over-sampling with replacement method.

3.3 Model Construction

In this section, we illustrate how we build the ADA model based on adversarial learning methods in details. Figure 1 provides a brief introduction to our approach. We first leverage a bi-directional LSTM network with AM as a feature generator which takes the integer vectors as input. Then, we concatenate the generated semantic features and the handcrafted features to construct the joint features. Next, we simultaneously train two classifiers as a discriminator by feeding the joint features into them. Two classifiers attempt to categorize source samples properly. In the meanwhile, they are also trained to find out which target instance is distance to the support of the source. We reckon that these target instances may confuse the classifiers as most of them are likely to be misclassified, and we consider these target samples as non-discriminative. Based on adversarial training process (Chen et al. 2020), the generator is forced to create discriminative target features near the support by considering discriminator prediction for target samples. In adversarial learning manner, the generator attempts to fool the discriminator, and the discriminator feedback to the generator whether the extracted features are wanted. Through this training pattern, we are able to align two feature distributions of the source and target project, and in the meanwhile, let the classifiers know how to correctly predict ambiguous target instances. After training, the ADA model can predict whether a new project instance is defective or not.

3.3.1 Generator

Since source code of software project is a kind of standardized and formal languages (Huang et al. 2021), we believe CPDP tasks are more related to NLP context. LSTM networks (Hochreiter and Schmidhuber 1997) are more suitable for dealing with NLP tasks because they can learn long-term dependencies by sophisticated computations. To mine which code snippet is the most important in defect prediction, we also adopt attention mechanism to learn and assign different weights to the sequential data. Together with AM, we suppose the LSTM network is powerful to extract contextual and semantic features of project data.

Fig. 3
figure 3

The network architecture of the generator

The whole network architecture of the generator is described in Fig. 3. The feature generator is comprised of five parts: an input layer, an embedding layer, a bi-directional LSTM layer, an attention layer and an output layer. We discuss the latter four layer as follows.

Embedding Layer

As a useful technique for encoding semantic information, word embeddings have been proven powerful as extra features in many NLP tasks (Almeida and Xexéo 2019). Therefore, we convert each integer vector into the corresponding embedding matrix by means of a word embedding layer,

$$\begin{aligned} \varphi :T\rightarrow W_e, \end{aligned}$$
(2)

where \( T \in \mathbb {R}^N \) represents the input integer vector after AST parsing, \( W_e \in \mathbb {R}^{E*N} \) is the embedding matrix to be learned and \( \varphi \) is a mapping function between them. We randomly initialize the embedding matrix and it can be updated when training the whole network. We assume that the high-dimensional embedding matrix is able to capture more contextual information contained in ASTs. Encoded integer vectors represented by the embedding matrices are then fed into a bi-directional LSTM network to create preliminary semantic features.

LSTM Layer

LSTM network is good at dealing with sequential data and explores long-term dependencies in the semantic context. To feed the embedding matrix into LSTM, we split it into N column vectors with E dimensions,

$$\begin{aligned} W_e=\left[ e_{1}, e_{2}, \ldots , e_{N}\right] , \end{aligned}$$
(3)

where \( e_{i} \in \mathbb {R}^{E} (i=1,2,\ldots ,N) \). There are three types of gates, including input, output and forget gates, in LSTM networks. Based on these gates, LSTM can control how the information is processed and memorized in the cell states of LSTM units. The forget gates decide what information ought to be ignored from the previous moment, which can be described as

$$\begin{aligned} f_{t}=\sigma \left( W_{\textrm{fg}} \cdot \left[ h_{t-1}, e_{t}\right] +b_{\textrm{fg}}\right) , \end{aligned}$$
(4)

where \( \sigma (\cdot ) \) denotes the sigmoid function used for neural activation, \( h_{i} \left( 0 \le i \le N \right) \) is the hidden state of the memory cell at moment i and \( W_{\textrm{fg}} \) and \( b_{\textrm{fg}} \) are two parameters of the forget gate. Different from forget gates, input gates decide what to memorize from the current moment. This includes two parts, one of which is the activation result of the input and the other of which is the \( \tanh \) result of the input. We can formulate them as follows.

$$\begin{aligned} \begin{aligned} i_{t}&=\sigma \left( W_{\textrm{in}} \cdot \left[ h_{t-1}, e_{t}\right] +b_{\textrm{in}}\right) , \\ \tilde{C}_{t}&=\tanh \left( W_{C} \cdot \left[ h_{t-1}, e_{t}\right] +b_{C}\right) , \end{aligned} \end{aligned}$$
(5)

where \( W_{\textrm{in}}, b_{\textrm{in}}, W_{C}, b_{C} \) are the corresponding weights and biases. \( \tilde{C}_{t} \) is called candidate value of moment t, which is a part of updating to the current cell state. We multiply the old state by \( f_{t} \), trying to forget the information we decide to ignore, and add the candidate value scaled by the input gate’s result:

$$\begin{aligned} C_{t}=f_{t}*C_{t-1}+i_{t}*\tilde{C}_{t}. \end{aligned}$$
(6)

Final output is given by the output gates according to the output from last moment and cell state from current moment, namely

$$\begin{aligned} \begin{aligned} o_{t}&=\sigma \left( W_{\textrm{out}} \cdot \left[ h_{t-1}, e_{t}\right] +b_{\textrm{out}}\right) , \\ h_{t}&=o_{t}*\tanh (C_{t}), \end{aligned} \end{aligned}$$
(7)

where \( W_{\textrm{out}}, b_{\textrm{out}} \) are the weights and bias of the output gates. Through the mechanisms of LSTM networks, we are able to learn the dependencies of defective codes in the context, which is greatly helpful for defect prediction.

Attention Layer

The semantic features generated by the bi-directional LSTM network can be directly used by classifiers to train the prediction model. Though simple and effective, however, directly feeding these features into classification model may result in performance loss. As in natural language, different words have different importance, and people tend to put more emphasis on more significant words rather than less relevant ones. Since programming languages conform to the paradigm of natural languages to some extent, we believe different code snippets contribute distinctly to software defect prediction. To formally depict these differences, we introduce the attention mechanism (Vaswani et al. 2017) to assign weights for the generated semantic features. Firstly, we input the hidden state \( h_{i} \) and output feature \( o_{t-1}\) at moment \( t-1 \) into an alignment model \( \zeta (\cdot )\) to compute the alignment score,

$$\begin{aligned} a_{t,i} = \zeta \left( o_{t-1}, h_{i}\right) . \end{aligned}$$
(8)

We use a multi-layer perceptron as the alignment model, which evaluates how well the elements of the input sequence align with the current output at moment t. Then, we obtain the weights \( \alpha _{t,i} \) by applying a softmax function to the previously computed alignment scores,

$$\begin{aligned} \alpha _{t,i} = {\text {softmax}} \left( a_{t,i}\right) = \frac{\exp \left( a_{t,i}\right) }{\sum _{i=1}^{N} \exp \left( a_{t,i}\right) }. \end{aligned}$$
(9)

Finally, we obtain a context vector \( q_{t} \) at moment t by adding up the corresponding weights of all moments,

$$\begin{aligned} q_{t} = \sum _{i}^{N} \alpha _{t,i} h_{i}. \end{aligned}$$
(10)

Output Layer

The context vector produced by the attention mechanism then is concatenated with the handcrafted features to construct joint features for further prediction. We can see the generator as a mapping \( \mathcal {G} \) from the AST input integer vectors to the joint features,

$$\begin{aligned} \mathcal {G}:\mathcal {V}\rightarrow R_{j}, \end{aligned}$$
(11)

where \( \mathcal {V} \in \mathbb {R}^N \) represents the AST input integer vector and \( R_{j} \in \mathbb {R}^{N_{j}} \) is the joint features vector with \( N_{j}\) dimensions extracted by the generator.

3.3.2 Training Steps

The aim of the ADA method is to minimize the misalignment between feature distributions of the source and target projects by classifiers which know the relationship about the target instances and decision boundary. To achieve this goal, we have to distinguish some target instances from others. Generally, those instances which are near the class boundary are more likely to be misclassified by classifiers and we say these instances are ambiguous. We exploit two distinct classifiers to predict whether a given target instance is defective or not, and then make use of their disagreement to find these target instances. Consider two distinct classifiers, \( \mathcal {F}_{1} \) and \( \mathcal {F}_{2} \), which are initialized differently. Since the labeled source samples are available, we assume that \( \mathcal {F}_{1} \) and \( \mathcal {F}_{2} \) can correctly classify source samples after training. We see the two classifiers as a discriminator, telling the generator whether the target features are discriminative by maximizing the discrepancy of corresponding instances. By doing so, the two classifiers can be different and be capable of finding ambiguous target instances. Then, the generator is forced not to create such target features by minimizing the same discrepancy over the target instances. The discriminator and generator interact with each other, encouraging the generator to create more predictive features and the discriminator to classify more precisely. We repeat the above steps in adversarial learning manners.

$$\begin{aligned} \begin{gathered} \min _{\mathcal {G}, \mathcal {F}_{1}, \mathcal {F}_{2}} \mathcal {L}\left( \textbf{X}_\textrm{S}, Y_\textrm{S}\right) , \\ \mathcal {L}\left( \textbf{X}_\textrm{S}, Y_\textrm{S}\right) =-\mathbb {E}_{\left( \varvec{x}_{\textrm{S}},y_{\textrm{S}}\right) \sim \left( \textbf{X}_\textrm{S}, Y_\textrm{S}\right) }\left[ y_{\textrm{S}}\log p\left( y \mid \varvec{x}_{\textrm{S}}\right) +\left( 1-y_{\textrm{S}}\right) \log \left( 1-p\left( y \mid \varvec{x}_{\textrm{S}}\right) \right) \right] , \\ p\left( y \mid \varvec{x}\right) =\frac{p_{1}\left( y \mid \varvec{x}\right) +p_{2}\left( y \mid \varvec{x}\right) }{2} \end{gathered} \end{aligned}$$
(12)

To sum up, we have to train a generator and two classifiers first, both of which have to correctly categorize source instances. Then, we apply a discrepancy maximization problem to the two classifiers for filtering out unwanted target instances, forcing the generator to create more predictive target features. We address this task in the following steps.

Step 1

Firstly, two classifiers and the generator are trained based on source project data. They have to properly categorize source instances. This step is important because subsequent derivations are based on this assumption. In our study, we select the Logistic Regression (LR) classifier as the base classifier. Since a CPDP task is a binary classification task, the binary cross entropy loss is applied to train the whole CPDP model, which can be written as (12), where \( p_{1}\left( y\mid \varvec{x}\right) \) and \( p_{2}\left( y\mid \varvec{x}\right) \) denote the output of the two classifiers over instance \( \varvec{x} \).

Step 2

Secondly, the discriminator (\( \mathcal {F}_{1} \) and \( \mathcal {F}_{2} \)) is trained to find out those ambiguous target instances while the generator is fixed. Given a specific target instance, if one classifier categorizes it as negative while another classifiers think it as a positive instance, then this instance is more likely to be ambiguous. Therefore, in this step, we maximize the disagreement between two classifiers to find these target instances, that is,

$$\begin{aligned} \begin{gathered} \min _{\mathcal {F}_{1}, \mathcal {F}_{2}} \mathcal {L}\left( \textbf{X}_\textrm{S}, Y_\textrm{S}\right) -\lambda \mathcal {L}_{\textrm{dis}}\left( \textbf{X}_\textrm{T}\right) , \\ \mathcal {L}_{\textrm{dis}}\left( \textbf{X}_\textrm{T}\right) =\mathbb {E}_{\varvec{x}_{\textrm{T}} \sim \textbf{X}_\textrm{T}}\left[ d\left( p_{1}\left( y \mid \varvec{x}_{\textrm{T}}\right) , p_{2}\left( y \mid \varvec{x}_{\textrm{T}}\right) \right) \right] , \\ d\left( p_{1}\left( y \mid \varvec{x}_{\textrm{T}}\right) , p_{2}\left( y \mid \varvec{x}_{\textrm{T}}\right) \right) =\left( p_{1}^{+}-p_{2}^{+}\right) ^{2} + \left( p_{1}^{-}-p_{2}^{-}\right) ^{2}, \end{gathered} \end{aligned}$$
(13)

where \( p_{1}^{+} \)/\( p_{1}^{-} \) and \( p_{2}^{+} \)/\( p_{2}^{-} \) denote the prediction probability of \( \mathcal {F}_{1} \) and \( \mathcal {F}_{2} \) for negative and positive class respectively, and \( \lambda >0 \) is a weighting parameter for adjusting the influence of the discrepancy loss. The discrepancy loss equals to the expectation of the discrepancy over all target instances and we choose \( L_{2} \) distance to calculate the discrepancy value of two classifiers.

Step 3

Finally, the generator is trained to create more discriminative features while the discriminator is fixed. When two classifiers do not update in this step, we are supposed to minimize the discrepancy above in order to generate discriminative features. Consider a target instance with the generated features. If the prediction results of \( \mathcal {F}_{1} \) and \( \mathcal {F}_{2} \) are the same, then this instance is unambiguous and more likely to be classified correctly. We formulate the objective of this step as follows.

$$\begin{aligned} \min _{\mathcal {G}} \mathcal {L}_{\textrm{dis}}\left( \textbf{X}_\textrm{T}\right) . \end{aligned}$$
(14)

Ideally, the feature distribution of target data is well-aligned with the source after this step. As a result, the discriminator is able to achieve comparable performance to the source project data over target instances.

Algorithm 1
figure a

ADA training algorithm.

We repeat these three steps in training phase. Due to that classifiers can correctly categorize source samples (Step 1), we train them to detect desired target instances (Step 2) and force the generator to extract more discriminative features (Step 3). The pseudo-code of the proposed method ADA is described in Algorithm 1. Its approximate time complexity can be given as \( \mathcal {O}\left( n \times N \times T\right) \), where n is the number of instances in source project, N is the number of neurons in LSTM network and T is the maximum training iteration. By applying the adversarial training method, we can improve the final prediction performance.

4 Experiment Setups

In this section, we describe our experimental setups in detail, including benchmark datasets, experimental settings, evaluation metrics, baseline methods, statistical analysis methods and research questions.

Table 1 5 projects chosen from the AEEEM dataset
Table 2 10 project chosen from PROMISE dataset

4.1 Benchmark Datasets

To assess the ADA model, we utilize two benchmark datasets, AEEEM (D’Ambros et al. 2010) and PROMISE (Jureczko and Madeyski 2010). These two datasets are readily available and widely used in recent CPDP research. The projects they contain are rather representative in the field of software engineering, which can better reflect the realities of defects in general software programs.

AEEEM dataset consists of five open-source Java projects: Eclipse JDT Core (JDT), Eclipse PDE UI (PDE), Equinox framework (Equinox), Mylyn and Apache Lucene (Lucene). There are 61 metrics in it, including source code metrics, entropy-of-change metrics, entropy-of-source-code metrics, etc. Table 1 presents the main information of these projects.

PROMISE dataset is comprised of 48 releases of 15 public open-source projects, 27 releases of 6 proprietary projects and 17 releases of 17 academic projects. All of them are written in Java. In recent CPDP studies (Qiu et al. 2019d, b; Jin 2021), researchers prefer to use the open-source projects. In our work, we carefully select 10 projects (as shown in Table 2). PROMISE dataset consists of 20 handcrafted features (metrics), which mainly concentrate on the programs’ complexity.

Under the conditions of the CPDP tasks, we need to select a source project and a target project. Therefore, we choose a project as the target project at first, and then treat the remaining projects as the source project respectively. In this way, we collect 20 and 90 project pairs in AEEEM and PROMISE datasets, respectively. In the following sections, we conduct our experiments to perform the CPDP based on these project pairs.

4.2 Experimental Settings

There are many hyper parameters in the proposed ADA model. Different values of these parameters could have a different influence on the performance of defect prediction. We conduct the cross-validation analysis of these parameters to find the optimal ones. There are two main parameters in our model: the dimension of the embedding E and the penalty coefficient \( \lambda \) that balances the classification loss and the discrepancy loss in (13). We empirically set E to 48. More details about tuning \( \lambda \) are discussed in Section 6.2.

We implement the proposed ADA model with PyTorch (Paszke et al. 2019). The hidden state dimension of bi-directional LSTM is set to 256 and the batch size of training phase is set to 32. An extension to stochastic gradient descent, Adam Khatri and Singh (2017) is adopted to optimize the entire model. We set the momentum to 0.9 by default and initial learning rate to 0.001. The entire model is trained for 100 iterations. Our experimental environment is an Intel(R) Xeon(R) CPU E5-2618 at 2.20 GHz, 64 GB RAM and 8 GPU (NVIDIA 1080 Ti) of 80 GB memory server running Ubuntu 20.04.2 LTS.

Table 3 Confusion matrix for binary classification tasks

4.3 Evaluation Metrics

To measure the proposed ADA method, we employ the following three metrics, namely \( F_{1} \) measure, balanced accuracy and geometric mean (G-Mean) which are widely used in the CPDP research. In binary classification analysis, we often measure a classifier by a confusion matrix (Table 3). According to its true label and classification result, an instance can be TP (True Positive), FP (False Positive), TN (True Negative) or FN (False Negative). We can derive the metrics we use in this work by means of confusion matrix. Sensitivity (or recall) evaluates the effectiveness of the classifier on the positive class while specificity assesses negative one. Precision measures how precise a model is, which is a widely-used indicator, too. The definitions of these metrics are as follows.

$$\begin{aligned} \begin{aligned}&\textrm{Sensitivity}=\frac{\textrm{TP}}{\textrm{TP}+\textrm{FN}}, \\&\textrm{Specificity}=\frac{\textrm{TN}}{\textrm{TN}+\textrm{FP}}, \\&\textrm{Precision}=\frac{\textrm{TP}}{\textrm{TP}+\textrm{FP}}. \end{aligned} \end{aligned}$$
(15)

\( F_{1} \) Measure

According to their definition, sensitivity measures the ratio of samples that are underreported whereas precision measures the ratio of misreported samples. However, these two metrics usually are in conflict. One classifier with high sensitivity is more likely to perform poorly on precision, and vice versa. Therefore, we use \( F_{1} \) measure to balance these two metrics, which equals to the harmonic average of them,

$$\begin{aligned} \begin{aligned} F_{1}&=\frac{2 \times \textrm{Sensitivity} \times \textrm{Precision}}{\textrm{Sensitivity}+\textrm{Precision}} \\&=\frac{2 \textrm{TP}}{2 \textrm{TP}+\textrm{FN}+\textrm{FP}}. \end{aligned} \end{aligned}$$
(16)

Balanced Accuracy

The most commonly used metric of a balanced classification problem is accuracy, which evaluates the overall effectiveness of a model. Accuracy equals to the number of correctly classified instances divided by the total number of all instances,

$$\begin{aligned} \textrm{Accuracy}=\frac{\textrm{TP}+\textrm{TN}}{\textrm{TP}+\textrm{TN}+\textrm{FP}+\textrm{FN}}. \end{aligned}$$
(17)

Nevertheless, when the data is skewed and imbalanced, accuracy may not be an appropriate metric (Bekkar et al. 2013). In CPDP tasks, a classifier that predicts all sample as the majority class is able to perform well on accuracy. Thus, we propose to use balanced accuracy to measure the performance of CPDP model. Balanced accuracy equals to the average of specificity and sensitivity, which comprehensively considers the performance of a classifier both in the majority and minority classes. Balanced accuracy is formulated as follows.

$$\begin{aligned} \begin{aligned} \mathrm {Balanced\ Accuracy}&=\frac{1}{2}\left( \textrm{Sensitivity} \times \textrm{Specificity}\right) \\&=\frac{\textrm{TP}\times \textrm{TN}}{2 \left( \textrm{TP}+\textrm{FN}\right) \left( \textrm{TN}+\textrm{FP}\right) }. \end{aligned} \end{aligned}$$
(18)

Geometric Mean

The geometric mean (G-Mean) (Kubat and Matwin 1997) of the sensitivity and specificity is another metric used in imbalanced classification problem. G-Mean tries to maximize the performance of both the majority and minority classes, and keep them balanced at the same time, which is defined as:

$$\begin{aligned} \begin{aligned} \text {G-Mean}&=\sqrt{\textrm{Sensitivity} \times \textrm{Specificity}} \\&=\sqrt{\frac{\textrm{TP} \times \textrm{TN}}{\left( \textrm{TP}+\textrm{FN}\right) \left( \textrm{TN}+\textrm{FP}\right) }}. \end{aligned} \end{aligned}$$
(19)

4.4 Baseline Models

In order to validate that the proposed ADA model is valid and effective for CPDP tasks, we conduct extensive comparison experiments to show whether our model can outperform other state-of-the-art methods. We select 9 baseline models, and summarize them briefly in Table 4.

Table 4 Baseline models for comparison

Please note that we re-implement all the baseline methods except TCNN (whose source codes are available online) with PyTorch. All baseline methods are re-trained under roughly the same experimental settings for fair comparisons.

4.5 Statistical Analysis Methods

To show the statistical significant difference between two models, we adopt a non-parametric test, Wilcoxon signed-rank test (Wilcoxon 1945) at a confidence level of 95%, which is commonly used in other defect prediction research (Ryu et al. 2016; Wang et al. 2020; Li et al. 2019). Wilcoxon singed-rank test relaxes the constraints on data distribution which does not require data to follow any distribution including the normal distribution. At the confidence level of 95%, we say that two methods are statistically different from each other if the p-value is less than 0.05. When the p-value equals or is larger than 0.05, we conclude that the difference between two methods is not statistically significant.

Besides, we utilize a variant of Scott-Knott Effect Size Difference (ESD) test (Tantithamthavorn et al. 2016) to measure the effect size between two models and rank all compared models. The Scott-Knott ESD test is a comparison method based on the mean of data. It partitions the set of means into statistically different groups with non-negligible difference by a hierarchical clustering algorithm. Table 5 describes the meanings of different Scott-Knott ESD (denoted by d).

Table 5 Scott-Knott Effect Size Difference and corresponding effectiveness level

4.6 Research Questions

To assess the effectiveness and performance of the proposed ADA method, we discuss the following three research questions (RQs).

  • RQ1: Is our proposed ADA method better than other state-of-the-art CPDP models?

  • Motivation: To prove the effectiveness of the proposed ADA method, we need to make comparisons with other state-of-the-art CPDP models. We choose 9 baseline methods (Section 4.4), and compare the predictive performance of these baseline methods.

  • RQ2: How effective are the semantic features extracted by the generator that combines an LSTM network and attention mechanisms?

  • Motivation: To verify that our proposed feature generation model is more suitable for CPDP tasks, we need to confirm the effectiveness of our feature generation method.

  • RQ3: How effective is the adversarial domain adaptation method?

  • Motivation: Transfer learning is a powerful tool dealing with CPDP tasks. We need to verify the effectiveness of the proposed adversarial domain adaptation method compared to other transfer learning methods.

Table 6 \( F_{1} \) measure comparison results with 9 methods on PROMISE dataset
Table 7 Balanced accuracy comparison results with 9 methods on PROMISE dataset
Table 8 G-Mean comparison results with 9 methods on PROMISE dataset
Table 9 \( F_{1} \) measure comparison results with 9 methods on AEEEM dataset
Table 10 Balanced accuracy comparison results with 9 methods on AEEEM dataset
Table 11 G-Mean comparison results with 9 methods on AEEEM dataset

5 Experimental Results

In this section, we conduct extensive experiments and present the experimental and statistical results.

5.1 Rq1: Is our Proposed ADA Method Better than Other State-of-the-Art CPDP Models?

Tables 6, 7, 8, 9, 10 and 11 display the comparison results of the three evaluation metrics on PROMISE and AEEEM dataset, respectively.

On PROMISE dataset, there are 10 projects in this repository, which can form 90 project pairs. We can observe that, our proposed ADA method achieves 0.666, 0.722 and 0.713 on average in terms of \( F_{1} \) measure, balanced accuracy and G-Mean, respectively. All of them are the best average values on the corresponding metrics compared to the other baseline methods. From the perspective of Win/Tie/Lose (W/T/L), our model wins 7 projects, ties 0 project and loses 3 projects on \( F_{1} \) measure compared to TPTL method; and wins 9 projects, ties 0 project and loses 1 project on balanced accuracy compared to DMDA_JFR. To better illustrate that our method is better than other baseline model, we quantify the degree of improvement. Compared with DMDA_JFR method, ADA model attains average improvements of 12.36%, 5.01% and 3.94% in terms of \( F_{1} \) measure, balanced accuracy and G-Mean, respectively; compared to DBN method, these improvements are 11.30%, 9.40% and 5.34%; compared to TPTL method, these improvements are 18.96%, 21.18% and 17.92%; compared to STr-NN method, these improvements are 20.87%, 20.33% and 26.19%. In the aspect of p-value given by Wilcoxon signed-rank test, all p-values are less than 0.05, meaning that the differences compared to ADA method are statistically significant at the 95% confidence level. From the perspective of Scott-Knott ESD test, all ESDs are larger than 0.8 on the metric of \( F_{1} \) measure, which implies that the differences are large according to the Scott-Knott ESD effectiveness level as described in Table 5. On the metric of balanced accuracy, the ESD of DMDA_JFR is 0.449, indicating that the effectiveness level is small. The same is true for DBN and DMDA_JFR on the metric of G-Mean. Notably, our method achieves 0.495 in project Log4j on the metric of \( F_{1} \) measure, which is only better than 3 other baseline methods. From Table 2 we can see that the defect rate of Log4j is 92.2%, resulting an imbalanced data distribution over positive samples. Even though we apply an imbalanced learning method to alleviate this problem, we may still get poor performance if there are insufficient data. This may be the reason why we perform poorly in this project. Consider another imbalanced project, Xalan, with 98.8% defect rate. Since there are far more samples (909 instances) in Xalan project, our method performs best among baseline models. Therefore, if there are ample samples in a project, ADA is able to attain considerable performance even though they are imbalanced. On AEEEM dataset, our model achieves best performance among baseline methods, too. From these tables we can observe that our method achieves 0.669, 0.654, 0.681 on average on the metrics of \( F_{1} \) measure, balanced accuracy and G-Mean, respectively. Similarly, all of them are the best average values among all comparison methods. Experimental and statistical results on AEEEM dataset also indicate the same fact.

Fig. 4
figure 4

Box charts of three evaluation metrics on PROMISE dataset (left-hand side) and AEEEM dataset (right-hand side). We eliminate the fliers for simplicity, so are that with the following box charts. These methods are ranked and grouped according to the Scott-Knott ESD test results. The orange lines represent the methods’ average medians and the green triangles stand for the methods’ means annotated by values. Methods in the same color box are grouped into the same cluster in Scott-Knott ESD test

In order to intuitively show the differences among these methods, we draw box charts based on the Scott-Knott ESD test results, as shown in Fig. 4. These methods are ranked and grouped according to the results of Scott-Knott ESD test, and the methods in the same color box have little difference in performance. From the figures we can observe that both the medians and the means of our model on three different metrics are higher than other baseline models on two datasets. In the aspect of rankings, the proposed ADA model ranks first in all three metrics. The second tie models are DMDA_JFR and DBN, which rank second and third respectively in terms of balanced accuracy and G-Mean metrics on PROMISE dataset. TPTL, STr-NN and MANN methods rank after these two methods. On AEEEM dataset, TPTL ranks second on the metrics of balanced accuracy and G-Mean. By elaborate and comprehensive observation on the experimental results, we can answer the RQ1: Our proposed ADA method is better than other state-of-the-art baseline models.

Fig. 5
figure 5

The box charts of different feature generation models in terms of three evaluation metrics on PROMISE dataset. The orange lines represent the methods’ average medians and the blue triangles stand for the methods’ means annotated by values

5.2 Rq2: How Effective are the Semantic Features Extracted by the Generator that Combines an LSTM Network and Attention Mechanisms?

To answer the RQ2, we first compare different approaches adopted in CPDP research. Among the baseline models we choose, Seml Liang et al. (2019) uses an LSTM network to extract the semantic features of the programs while TCNN (Qiu et al. 2019d) utilizes CNN as the base feature generator. These two methods adopt roughly the same data preprocessing techniques, and train the model to perform CPDP tasks. From the empirical point of view, Seml performs better than TCNN. Referring to Fig. 4, Seml ranks sixth while TCNN ranks seventh or eighth, respectively in terms of \( F_{1} \) measure, balanced accuracy and G-Mean on PROMISE dataset. As we discussed in Section 1, we are more inclined to consider the CPDP problem as an NLP problem since the programming language is a kind of standard, formal languages that follow certain language paradigms. Therefore, recurrent neural networks, which are good at processing sequential data, may perform better than other deep learning methods like CNN and DBN. The comparisons between Seml and TCNN, DBN and ADA partly prove this point of view.

Table 12 Ablation study of different components on the metric of \( F_{1} \) measure on PROMISE dataset

To make this point more convincing, we further conduct a comparison experiment. We use different feature generator models, including CNN (adopted in TCNN (Qiu et al. 2019d)), DBN (adopt in Ref. Wang et al. (2020)), Double Marginalized Denoising Auto-Encoders (DMDA, adopted in DMDA_JFR (Zou et al. 2021)) and LSTM with AM (adopted in this paper) to extract the semantic features separately, and feed them into LR classifiers. We train these generators in the manner of adversarial learning, namely by means of adversarial domain adaptation method. We use the PROMISE dataset as it contains more projects than AEEEM repository. Figure 5 shows the performance differences between these feature generation models. Our model uses LSTM as the base model for generating features, which performs best. DMDA with DA ranks second among these methods and DBN or CNN with DA achieves the worst performance. Compared to DBN or CNN, DMDA uses two auto-encoders to capture the semantic features of program, which is also suitable for sequential data. DBN builds prediction model based on probability calculation and CNN utilizes convolution operation for feature extraction, which are better at dealing with computer vision tasks. We believe that this may be the main reason why our feature generation model outperforms other methods. We fully consider the context of CPDP problem, leveraging an LSTM network with AM to capture the semantic and contextual information contained in programs.

Based on the above analysis, we can answer the RQ2: Together with AM and LSTM networks, the feature generator of our proposed model is more effective than other approaches.

Table 13 Ablation study of different components on the metric of balanced accuracy on PROMISE dataset
Table 14 Ablation study of different components on the metric of G-Mean on PROMISE dataset

5.3 RQ3: How Effective is the Adversarial Domain Adaptation Method?

Tables 12, 13 and 14 and Fig. 6 show the results of ablation study. We examine the influence on the prediction performance of each component, including imbalanced learning techniques (IL), handcrafted features (HF), attention mechanisms (AM) and adversarial domain adaptation methods (DA). We now focus on adversarial domain adaptation methods (we discuss more details about ablation study in Section 6.1). No-DA stands for the model that the feature generator is not trained by means of adversarial domain adaptation. We train the generator in an SDP way, that is, we directly feed the output of the generator into the LR classifier without applying any transfer learning or domain adaptation method. From the tables we can see that, the performance drops significantly. On the metric of \( F_{1} \) measure, ADA achieves 0.666 on average while No-DA only achieves 0.423 on average, losing more than one third of the performance; on the metric of balance accuracy, ADA achieves 0.722 on average while No-DA only achieves 0.431 on average, dropping 37.45% performance; on the metric of G-Mean, ADA achieves 0.713 on average while No-DA only achieves 0.465, losing 34.78% performance. We can observe the obvious differences between ADA and No-DA in the box charts (Fig. 6).

Fig. 6
figure 6

The box charts of ablation study results in terms of three evaluation metrics on PROMISE dataset. The orange lines represent the methods’ average medians and the blue triangles stand for the methods’ means annotated by values

To further validate how effective our proposed adversarial domain adaptation method is used in ADA model, we conduct comparison experiments with other transfer learning methods. We compare our domain adaptation method with the transfer learning method adopted in TCNN (Qiu et al. 2019d). TCNN leverages a standard CNN to extract the features of programs and add a matching layer to the CNN model and then embeds both source and target data representation into a reproducing kernel Hilbert space. Next, TCNN adopts Maximum Mean Discrepancy (MMD) to reduce the distribution divergence between the source and target projects. In the comparison experiments, we also add a same matching layer to our proposed feature generation model, trying to align the two distributions by means of MMD. Table 15 shows the experimental comparison results. The comparative model is denoted by MMD in the table and figure. From the table and figure, we can observe that ADA is better than MMD, which surpasses MMD by 15.29%, 9.34% and 8.23% respectively in terms of three metrics. Compared to MMD or other transfer learning methods adopted in CPDP (Jin 2021; Nam et al. 2013), ADA fully consider the relationship between the decision boundary and the target instances while reducing the distribution divergence, forcing the feature extraction model to generate more discriminative features. We believe this is the main reason why the adversarial domain adaptation method we use is more practical than others.

Through the above discussions, we are able to answer the RQ3: The adversarial domain adaptation method we utilize is the main reason accounting for the performance improvement. We can say that it is more effective than other transfer learning methods according to the experimental results.

6 Discussions

In this section, we further discuss the proposed ADA method with several questions, and talk about threats of validity of this work.

6.1 Why does ADA Work?

From the empirical point of view, the experimental and statistical results on two benchmark datasets of 15 software releases witness that our proposed ADA method is generally superior to other baseline methods in terms of three different evaluation metrics including \( F_{1} \) measure, balanced accuracy and G-Mean. In this section, we discuss the effectiveness of ADA in aspects of problem modeling and ablation study.

From the perspective of problem modeling, we believe our model is superior to other models for two main results. Firstly, we exploit the semantic and contextual features of programs in a more appropriate way. Because project source code is a kind of standardized, formalized language that follows certain specifications, we believe that the CPDP problem is more in line with the context of NLP. Therefore, we extend an LSTM network with attention mechanism as the feature generator since it is better at coping with sequential data like programming languages. AM enables us to assign different weights to distinct tokens. Just like natural language in daily life, the importance of each word in a single sentence is different, and we believe that the contribution of different tokens in a program to software defects is also different. In this way, the feature extraction model we use can produce better semantic features for prediction. Secondly, we take an effective measure to transfer knowledge learned from the source project and apply it to the target project. Previous CPDP methods (Qiu et al. 2019d; Ryu et al. 2017; Liu et al. 2019) utilized transfer learning methods to reduce the divergence between the feature distributions of different projects. Though simple and effective, most of them might fail to consider the substantial information about feature space. In the proposed method, we adopt an adversarial domain adaptation method. By training the feature generator and defect predictor in the manner of adversarial learning, we reduce the divergence of the two feature distributions of the source and target project and fully take the relationship between the decision boundary and target project instance into consideration. As described in Section 3.3, we train the whole model in three steps. In order to observe how the adversarial learning method works, we investigate the trend in the value of the discrepancy at each iteration. Take project pair Xerces \( \rightarrow \) Ivy (i.e., Xerces is the source project and Ivy is the target project) as an example. Concretely, we record the discrepancy value between two classifiers (denoted by \( d_{1} \)) as described in (13), and the difference loss of the feature generator (denoted by \( d_{2} \)) as described in (14) when two classifiers are fixed. \( d_{1} \) reflects the degree of disagreement of the predictions of the two classifiers on the target project data. If the value is large, it means that the current target instance is near the class boundaries and considered as ambiguous one. By maximizing \( d_{1} \), the classifiers are more likely to distinguish these ambiguous target instances and correctly classify non-ambiguous instances. Once we know about this relationship between decision boundaries and target instances, we can tell the generator not to generate such ambiguous instances. \( d_{2} \) reflects the difference loss of the feature generator on the target project. The smaller the value is, it means that the features generated by the feature generator can be more accurately predicted by the predictor. We improve the discriminability of the features generated by the feature generator by minimizing the difference loss \( d_{2} \), so that the features of ambiguous instances are generated in regions far away from the decision boundary. We repeat this procedure in an adversarial training manner until convergence. We draw a line chart to show the trends of \( d_{1} \), \( d_{2} \) and the loss value in (12), as shown in Figs. 7 and 8. From the line charts we can see that \( d_{1} \) is increasing and \( d_{2} \) is decreasing in general, which is in line with our expectation. In this example, the loss is stabilized and does not decrease any more after 15 iterations, meaning that the model is done training and our algorithm has converged.

Table 15 Comparison results with MMD transfer learning method
Fig. 7
figure 7

The changes of discrepancy \( d_{1} \) and \( d_{2} \)

Fig. 8
figure 8

The change of loss value

In order to examine the effectiveness of each component of the proposed model, we design four ablation models. The first model removes the imbalanced learning techniques (No-IL) described in Section 3.2. The second model only takes semantic features extracted by the generator to train the prediction model without combining the handcrafted features. We define the second model as No-HF. The third model erases the attention mechanisms adopted in the generator, namely assigning the same weights to different tokens. The third model is called No-AM. The final model is designed without adversarial domain adaptation methods (No-DA). We train both the feature generator and a LR classifier (as the prediction model) on source project data, and directly apply the model to the target project data without matching the distribution difference between them. Tables 12-14 show the experimental results of these models on the metrics of \( F_{1} \) measure, balanced accuracy and G-Mean. Figure 6 visualizes the table results. We can draw some conclusions based on the observation from the tables and figures. Firstly, the performance of our model drops significantly if we do not apply techniques to tackle the data imbalance problem. Compared to ADA method, the performance of No-IL drops from 0.666 to 0.415, from 0.722 to 0.429 and from 0.713 to 0.454 in terms of three metrics. Secondly, handcrafted features and attention mechanisms play important roles in generating crucial features for software prediction. On average, handcrafted features can improve the performance by 15.84%, 15.90% and 18.87% on the three metrics. By assigning more weights to important tokens and less weights to irrelevant ones, attention mechanisms can bring 11.67%, 13.07% and 15.57% improvement to the ADA model in the three performance indicators respectively. Last by not least, adversarial domain adaptation plays a significant role in defect prediction. If we do not take any actions on transferring knowledge learned from the source projects and utilizing it in target projects, we could suffer from poor performance in all aspects. By adopting the domain adaptation method, we can enhance the performance from 0.423 to 0.666, from 0.431 to 0.722 and from 0.465 to 0.713 in terms of \( F_{1} \) measure, balanced accuracy and G-Mean.

6.2 How does the Hyper Parameter \( \lambda \) Affect the Performance of ADA?

Hyper parameter \( \lambda \) controls the weight ratio between the classification loss of the prediction model and the discrepancy loss between two classifiers in (13). If \( \lambda \) is small, we tend to think less of the discrepancy loss and put more emphasis on the classification loss; if \( \lambda \) is large, we care more about the discrepancy loss and take less of the classification into consideration. We set \( \lambda = 0.8 \) in our research through cross-validation experiments. We select 14 different values of \( \lambda \) ( from 0.1 to 1.3 with step 0.1), and see how they influent the prediction performance of the ADA model.

Fig. 9
figure 9

Experimental results of the three metric values of ADA with different \( \lambda \) values on PROMISE dataset

Figure 9 shows the experimental results of the performance of ADA with the selected 14 \( \lambda \) values in terms of \( F_{1} \) measure, balanced accuracy and G-Mean on PROMISE dataset. From the figure, we can draw the following two conclusions. First, when \( \lambda \) is relatively small (less than 0.7), the performance on the three metrics is quite low. This is consistent with our early discussions. If we think less of the discrepancy, we may not be able to maximize the discrepancy loss between two classifiers, and thereby they could not distinguish ambiguous target samples. Thus, the ADA is less effective. Second, when \( \lambda \) is in the interval \( \left[ 0.8, 1.0\right] \), the performance of ADA is the best. If \( \lambda \) is larger than 1.0, the performance starts to drop. Based on these observations, we choose \( \lambda = 0.8 \). Note that other hyper parameters in the ADA model are also selected in this way.

Fig. 10
figure 10

Experimental results of ADA model with five different classifiers on the three evaluation metrics on PROMISE (left) and AEEEM (right) dataset

6.3 How Different Classifiers Affect the Performance of ADA?

We talk about why we choose the Logistic Regression classifier as the base classification model in this subsection. To evaluate the impact of different classifiers on the prediction performance, we choose five classic machine learning classifiers, including Logistic Regression (LR), Support Vector Machine (SVM), Neural Network (NN), Random Forrest (RF) and Naive Bayes (NB). For SVM, we use the Gaussian radial basis function as the kernel function. There is one hidden layer in NN, and the number of hidden neural is twice the input feature dimension.

Figure 10 displays the results of the performance of the proposed method with five different classifiers on the three evaluation metrics on PROMISE and AEEEM dataset. From the figure we can observe that the performance of these models is not very different. On PROMISE dataset, they all achieve the performance of about 0.65 in the \( F_{1} \) metric; LR achieves the best performance among five models in terms of balanced accuracy and G-Mean. On AEEEM dataset, the performance of SVM, NN and RF in the balanced accuracy is relatively low compared to LR and NB; RF achieves comparable, even better performance to LR in terms of \( F_{1} \) measure and G-Mean, but it fails to compete with LR on the metric of balanced accuracy.

To sum up, all kinds of classifiers we choose do not differ greatly in performance indicators. Among them, the Logistic Regression classifier is able to consistently achieve the best performance in terms of the three metrics on two benchmark datasets.

6.4 Threats to Validity

Threats to Construct Validity

We carefully choose three commonly used performance metrics, including \( F_{1} \) measure, balanced accuracy and G-Mean, as our evaluation criteria. \( F_{1} \) measure equals to the harmonic average between the sensitivity and precision, seeking a balance between the underreporting rate and misreporting rate. Balanced accuracy and G-Mean are two commonly used metrics when the data distribution is imbalanced and skewed. However, these metrics might not be the only appropriate metrics. There are other measures (e.g., Area under the ROC (receiver operating characteristic) Curve (AUC) and Matthews Correlation Coefficient (MCC)) that can be used for performance measurement in binary classification tasks.

Threats to Internal Validity

We implement some baseline models that are with open-access source code (such as TCNN) by utilizing the provided source code to reduce the potential impact of the improper implementations. For those models whose source code is not provided, we cautiously implement the models following the details described in the corresponding papers. However, our implementation might not fully reveal the details in the baseline methods. For a fair comparison, we apply consistent implementations of overlapped parts including data preprocessing, LR implementation, etc.

Threats to External Validity

We try to reduce the bias by carefully selecting two public benchmark datasets of 15 open-source projects. These projects may not represent all software projects. Additionally, all selected projects are written in Java programming language. The results of our proposed method in some commercial software projects or projects that are written in other programming languages may be better or worse. Validating the effectiveness of ADA on more diversity and datasets with more projects is needed.

7 Conclusions

The main challenges of CPDP problem lie in how to construct more meaningful and contextual features that can represent programs and how to effectively transfer knowledge learned from source projects and apply it to target projects. In this paper, we proposed a novel ADA method to tackle these two challenges. For the feature generation, we extend a variant of Recurrent Neural Networks, Long Short-Term Memory networks with Attention Mechanism. Compared to other deep learning methods, LSTM is better at dealing with sequential data like project source codes. With attention mechanism, the feature generator can capture more important parts and further improve the prediction performance. Moreover, to effectively transfer knowledge learned from source projects to target projects, we propose an adversarial domain adaptation method to bridge the gap between two feature distributions. Besides, our proposed method takes full consideration of the relationship between the target instances and class boundary when aligning the distributions. We treat the feature extraction model as the generator, and train two classifiers as the discriminator. By training the whole model in the manner of adversarial learning, we first maximize the discrepancy of two classifiers over target samples to distinguish the ambiguous ones, and then train the generator to create more discriminative features according to the information about the relationship between the target instances and class boundary.

We conduct extensive experiments on two benchmark datasets of 15 open-source Java projects. The classification performance of ADA is measured on the evaluation metric of \( F_{1} \) measure, balanced accuracy and G-Mean. We compare ADA with the state-of-the-art CPDP models by using Wilcoxon signed-rank test and Scott-Knott Effect Size Difference test. Experimental and statistical results show that ADA is effective and outperforms other baseline models by a significant margin.

There are several problems needed to be investigated in the future. Firstly, our proposed method may not generalize well on other datasets. We will conduct experiments on more projects and extend our method to other programming language to make ADA more generalizable. Secondly, transfer learning is a powerful tool in dealing with CPDP tasks. More precise and appropriate transfer learning methods which can align two or more feature distributions without losing the information of the source domain is needed to be explored in the future.