Adversarial domain adaptation for cross-project defect prediction

Song, Hengjie; Wu, Guobin; Ma, Le; Pan, Yufei; Huang, Qingan; Jiang, Siyu

doi:10.1007/s10664-023-10371-2

Adversarial domain adaptation for cross-project defect prediction

Published: 19 September 2023

Volume 28, article number 127, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Empirical Software Engineering Aims and scope Submit manuscript

Adversarial domain adaptation for cross-project defect prediction

Download PDF

Hengjie Song¹,
Guobin Wu¹,
Le Ma²,
Yufei Pan¹,
Qingan Huang¹ &
…
Siyu Jiang ORCID: orcid.org/0000-0002-1969-807X³

559 Accesses
2 Citations
Explore all metrics

Abstract

Cross-Project Defect Prediction (CPDP) is an attractive topic for locating defects in projects with little labeled data (target projects) by using the prediction model from other projects with sufficient data (source projects). However, previous models may not fully capture the semantic features of programs because of inappropriate feature extraction models. Besides, researchers may fail to consider the relationship between the decision boundary and target project data when matching two feature distributions by adopting transfer learning methods, which would lead to the misclassification of target samples that are near boundary. To handle these drawbacks, we propose a novel Adversarial Domain Adaptation (ADA) model for CPDP. Specifically, we leverage a Long Short-Term Memory network with attention mechanism to extract semantic features that better represent programs. Then, we train two classifiers to correctly categorize source samples and distinguish ambiguous target instances that influence prediction accuracy. Next, we treat the classifiers as a discriminator and feature extraction model as a generator, and train them based on adversarial learning methods to depict the desired relationship. As the classifiers know this relationship, they should attain better performance. Extensive experiments on two benchmark datasets are conducted to verify the effectiveness of the proposed ADA methods. Experimental and statistical results show that ADA significantly outperforms other state-of-the-art baseline methods.

MHCPDP: multi-source heterogeneous cross-project defect prediction via multi-source transfer learning and autoencoder

Article 27 April 2021

A Cost-Sensitive Shared Hidden Layer Autoencoder for Cross-Project Defect Prediction

Deep Adversarial Learning Based Heterogeneous Defect Prediction

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

With the development of modern software engineering, more and more resources are allocated in phase of software testing to keep software projects bug-free, and thus software testing has become one of the most important phases in the whole software lifecycle. As a result, how to guarantee software reliability through software testing attracts significant attention. Some researchers (He et al. 2015; Ni et al. 2019; Balogun et al. 2021; Bal and Kumar 2020) adopted Software Defect Prediction (SDP) methods to locate defects of software projects. A program that contains at least one bug (defect), and/or bugs that seriously interfere with its functionality, is said to be buggy (defective). Otherwise, the program is said to be clean (non-defective). Most existing SDP methods utilized the historical data of a software project to train their prediction models, and then employed it to find out future defects in the same project. However, in practice, it is hard for researchers or developers to build a satisfactory SDP model in the early stage of a given project because labeled data are hardly available. Therefore, Cross-Project Defect Prediction (CPDP) (Briand et al. 2002; Herbold et al. 2018; Hosseini et al. 2019; Khatri and Singh 2021) was proposed, making it feasible to build an SDP model in a new project. CPDP allows us to train defect prediction model based on mature projects with sufficient labeled data (source projects, and then apply it to new projects that lack labeled data (target projects). A typical CPDP method can be roughly split up into two modules: how to extract predictive features and how to effectively apply knowledge to target projects that is learned from source projects.

How to construct predictive features remains a challenge in the research of CPDP. As a binary classification problem, feature extraction is one of the most important components in CPDP models. In the early study of software defect prediction, most defect prediction methods (Watanabe et al. 2008; Turhan et al. 2009; Shepperd et al. 2014; Jing et al. 2014; Zhu et al. 2021; Li et al. 2019) depended on handcrafted features that represent the static characteristic of programs, including McCabe loop complexity (McCabe 1976), Halstead metric (Halstead 1977) and CK (Chidamber-Kemerer) metric (Chidamber and Kemerer 1994). However, these handcrafted metrics were often designed based on researchers’ statistical analysis or experience of the software, which might not fully mine the contextual and semantic features of project source codes (Wang et al. 2016). If CPDP models are constructed based only on these handcrafted features, they may suffer from losing some prediction performance. Because of its prominence in feature representation, deep learning is strongly recommended by researchers (Wang et al. 2016; Li et al. 2017) to capture programs’ hidden contextual and semantic features and feed them into prediction models for training. Several prevalent deep learning approaches have been adopted in CPDP, including Deep Bayes Network (DBN) (Tong et al. 2019; Wang et al. 2020), Long Short-Term Memory (LSTM) network (Li et al. 2019; Deng et al. 2020; Liang et al. 2019) and Convolutional Neural Network (CNN) (Li et al. 2017; Qiu et al. 2019a, d), etc. To some extent, source code in software project is a kind of standardized and formal languages which is closely related to NLP context (Huang et al. 2021), and we believe that the LSTM network is more suitable for extracting hidden semantic features in CPDP tasks. LSTM is a variant of recurrent neural networks, which processes sequential data by modeling units in sequence and "memorize" long-term dependencies by computations (Hochreiter and Schmidhuber 1997). In this paper, we extend the LSTM network with Attention Mechanism (AM) (Vaswani et al. 2017) as the feature generator for fully capturing the hidden semantic features of programs. AM is a method that enables us to focus on the most important parts to make further decisions by giving different weights to different parts of the input. Since it can bring significant improvement, AM recently has almost become an essential standard in many sequential tasks (Veličković et al. 2018; Zeng et al. 2021). In the field of CPDP, project source codes are also sequential data, and we believe we would benefit from LSTM network and AM, too.

Another challenging problem in CPDP tasks is how to effectively transfer knowledge learned from one project and apply it to other projects. Most fruitful CPDP methods (Yu et al. 2017; Ryu et al. 2017; Hosseini et al. 2018; Wu et al. 2018; Liu et al. 2019; Gong et al. 2020) adopted transfer learning approaches to bridge the gap between feature distributions of two different projects. Although inspiring and straightforward, these CPDP models might miss out the relationship between the target project instances and decision boundary. Chen et al. (2015) put forward a CPDP method called Double Transfer Boosting, which re-calculated instances’ weights and reduced the divergence of two distributions based on TrAdaBoost (Dai et al. 2007). Yu et al. (2017) proposed a feature matching and transfer method to perform CPDP. They designed an algorithm to match and transfer features in the source and target project with respect to their distance. Xu et al. (2019) came up with a transfer learning method, named Balanced Domain Adaptation, to assign different weights to marginal and conditional distributions of two projects. Through balancing these two kinds of distribution, their approach was able to effectively overcome the divergence of two projects. However, these methods might not be capable of mining predictive features since they did not take the relationship between target data and decision boundary into account when matching two distributions. From mathematical point of view, this relationship can be defined as the distance between the boundary and target data. As shown in Fig. 1, even if the two distributions are quite similar, the feature generator could generate ambiguous instances near the decision boundary if we do not take the distance into consideration. Hypothetically, we consider that these ambiguous targets samples could be misclassified and therefore result in poor prediction performance.

In our work, we propose a novel Adversarial Domain Adaptation (ADA) method for CPDP to handle the drawbacks mentioned above. The proposed approach, as shown in Fig. 1, adopts a domain adaptation method based on adversarial learning to rectify ambiguous classification data, which not only alleviates the gap in the feature distributions between different projects, but also takes the relationship between the target project instances and learned decision boundary into account to further enhance performance. The general framework of the ADA method is presented in Fig. 2. Specifically, we first compile project source files to generate corresponding abstract syntax trees (ASTs), and then convert ASTs into token vectors by traversing them. By certain mapping rules, we map the token vectors to integer vectors and feed them into the following feature extraction network. We adopt a bi-directional LSTM network with a self-attention layer to capture the programs’ hidden semantic features. To avoid the overfitting problem and obtain more accurate features, we amalgamate the generated features with the handcrafted features to construct joint features. Inspired by Generative Adversarial Networks (GAN) (Goodfellow et al. 2014) and an adversarial domain adaptation method proposed by Saito et al. (2018), we utilize a domain adaptation method and train it in the manner of adversarial learning to improve the discriminability of the joint features. In ADA, we simultaneously train two classifiers, and they take joint features as input, trying to correctly classify source samples and identify the target instances which are not close to the support of the source at the same time. We presume these target instances are non-discriminative and ambiguous since they are not categorized explicitly as negative or positive. Therefore, we see these two classifiers as a discriminator to discriminate whether the generator (LSTM network with an attention layer) creates discriminative features for the target instances. In this way, the disagreement of two classifiers can be used to further compute the distance between the decision boundary and the target instances. When we repeat this procedure, ADA can reduce the divergence of the feature distribution in two projects, and give information to the classifiers about the relationship mentioned above. As they know the relationship, the classifiers can correctly categorize those confusing target instances, and thereby improve defect prediction performance.

To validate the effectiveness of ADA method, we conduct multiple experiments on two public benchmark datasets, AEEEM (D’Ambros et al. 2010) and PROMISE (Jureczko and Madeyski 2010). In many experimental settings, the proposed model is superior to other the state-of-the-art methods by a significant margin. The contributions of this work are three-fold.

We propose a novel CPDP model which integrates transfer learning and adversarial learning methods to not only bridge the gap between the feature distributions of different projects, but also fully think of the relationship between the target data and class boundary.
We extend the LSTM network with Attention Mechanism to learn long-term dependencies in a software project context and extract more meaningful and semantic features of programs. Together with AM, the LSTM network significantly improves the performance of defect prediction.
We conduct extensive experiments on two benchmark datasets and the experimental and statistical results verify the effectiveness of the proposed method.

The rest of this paper consists of six parts. Firstly, we shortly report the related work in Section 2. Then, we thoroughly illustrate the ADA model in Section 3. In Section 4, we describe the details of experimental setups. Next, we present the experimental and statistical results in Section 5. More discussions on the proposed ADA method are in Section 6. Lastly, we conclude our work and talk about future work in Section 7.

2 Related Work

In this section, we briefly introduce related work on cross-project defect prediction, adversarial domain adaptation and attention mechanism.

2.1 Cross-Project Defect Prediction

Software Defect Prediction (SDP) (Li et al. 2018; Zou et al. 2018; Thota et al. 2020; Rathore and Kumar 2021) aims to find out defects in software modules before release so that developers can allocate limited resources optimally. According to the project data used in the training and evaluating phase, there are two branches in SDP, namely Within-Project Defect Prediction (WPDP) and Cross-Project Defect Prediction (CPDP). In WPDP, the data are from the same project in the training and evaluating phase. By contrast, CPDP uses different project data when training and evaluating. In our work, we concentrate on the problem of CPDP.

To evaluate the feasibility of CPDP, Zimmermann et al. (2009) carried out a total of 622 cross-project experiments on 12 software projects using logistic regression classifiers. Their experimental results showed that only 3.4% of them were successful. They came to conclusion that CPDP was a challenging problem because it was not transitive. Researchers hoped to improve software defect prediction performance by applying deep learning (LeCun et al. 2015) methods to mine semantic features of programs due to its prominent feature learning capability. Wang et al. (2016) utilized a DBN model to automatically extract semantic features from AST of programs, and bridged the gap between semantic and defect prediction features of different projects. Tong et al. (2019) adopted a transfer naive Bayes approach to consider both class-imbalance and feature importance problems, and used it to make two feature distributions as similar as possible. Li et al. (2017) applied a CNN to generate effective features of programs. Combined with traditional handcrafted features, the prediction model they trained achieved better performance than other baseline methods. Different from the deep learning methods that adopted in the above CPDP study, Pandey and Tripathi (2021) employed an LSTM network to mine programs’ semantic features. In the context of CPDP, project source code is a kind of formal language containing rich structural and semantic information, which is more closely related to the context of NLP problem (Li et al. 2019). Therefore, we believe that Recurrent Neural Network (RNN) is more suitable since it excels at processing sequential input and we leverage a variant of RNN, the LSTM network to mine the structural and semantic information contained in programs.

To further verify the feasibility of CPDP, He et al. (2012) carried out three experiments on 34 projects by manually selecting training data. They concluded that CPDP performance could be better than WPDP tasks in some cases. After further analysis, they concluded that defect prediction results were related with the distributional characteristics which could be valuable for training data selection. Their research suggested that CPDP tasks were feasible if the data distributions across two different projects can be made similar. Taking their suggestions, Herbold (2013) put forward a distance-based algorithm for the training data selection according to the corresponding distributional characteristics. Concretely, they came up with characteristic vectors to represent each dataset, which consisted of mean and standard deviation, and they adopted the characteristic vectors to stand for the marginal distribution of each dataset. Herbold conducted experiments on 44 public software projects and witnessed a 9% improvement in performance over traditional CPDP methods. Apart from training data selection methods, transfer learning techniques (Nam et al. 2013; Qiu et al. 2019d; Liu et al. 2019; Jin 2021; Huang et al. 2021) were also adopted to make two distributions similar. Liu et al. (2019) came up with a CPDP method named Two-Phase Transfer Learning. Firstly, they chose two source projects that are most similar by means of a source project estimator. Secondly, they leveraged TCA+ (Nam et al. 2013) to construct two defect predictors based on the two selected project individually, and then combined their prediction probabilities to enhance performance. Qiu et al. (2019d) proposed a CPDP model called Transfer CNN (TCNN). They adopted CNN to extract the semantic features of project data and added a matching layer to align feature distribution of two different projects. When matching, they embedded the source and target data representations into a reproducing kernel Hilbert space and utilized a classic transfer learning method, Maximum Mean Discrepancy (MMD) (Borgwardt et al. 2006), to bridge the gap between two distributions. Next, they combined generated semantic features with handcrafted features and trained the predictor based on them. Jin (2021) learned domain adaptation model by a method called kernel twin support vector machine, trying to match the feature distributions between the source and target project as much as possible. He trained the feature generator that aimed to match distributions between two different projects and assumed that such target features were correctly categorized by the defect predictor for the reason that they were matched to source instances.

Compared to CPDP, heterogeneous defect prediction (HDP) (Nam and Kim 2015; Chen et al. 2021) relaxes the limitation of defect data used when predicting, allowing different metric sets to be contained in the source and target projects. Afterward, many researchers came up with fruitful work on HDP. Jing et al. (2015) proposed a unified metric representation for the defective data, and used them for further canonical correlation analysis (CCA) to reduce the gap between two domains. Based on CCA, Li et al. (2018) also proposed an HDP model called cost-sensitive transfer kernel canonical correlation analysis, which can not only make the data distributions of source and target projects much more similar in the nonlinear feature space, but also utilize the different misclassification costs for defective and non-defective classes to alleviate the class imbalance problem. Li et al. (2019) employed multiple sources to improve the performance of HDP model, putting forward a multi-source selection based manifold discriminant alignment approach. Their experimental results verified the performance gain. Bal and Kumar (2023) enhanced the data pre-processing for HDP by utilizing chi-square test to select the relevant metrics between source and target datasets. Finally, they performed experiments using their proposed approach with various machine learning algorithms to various the effectiveness of their model. Note that we only concentrate on "homogeneous" (non-heterogeneous) defect prediction.

In the above CPDP studies, most of them adopted different approaches (training data selection (Herbold 2013) or transfer learning (Qiu et al. 2019d; Liu et al. 2019; Jin 2021) to make the feature distributions of the two different projects as similar as possible. However, they might fail to take full consideration of the relationship between the decision boundary and target data when matching the distributions. They used different transfer learning approaches that can match two distribution, hoping the target samples are correctly classified by the defect predictor. As we mentioned before, even though the two distributions are quite similar, the classifier would still be confused by the target samples with ambiguous features. In this paper, we put forward a domain adaptation method based on adversarial learning to fully consider the relationship between the target instances and decision boundary, which can eliminate the confusion for predictor.

2.2 Adversarial Domain Adaptation

Domain adaptation is a prevalent type of transfer learning, where the target task remains the same as the source whereas the domain is different (Pan and Yang 2010). Since generative adversarial network (GAN) was proposed by Goodfellow et al. (2014), many scholars have applied it into domain adaptation to solve specific tasks including image classification (Tzeng et al. 2017; Saito et al. 2018; Ma et al. 2019), object detection (Song et al. 2020; Su et al. 2020), machine translation (Wang et al. 2021), image semantic segmentation (Li et al. 2019; Yi et al. 2021). Tzeng et al. (2017) proposed an adversarial discriminative domain adaptation method to classify images. They learned discriminative representation based on source data at first. Then, they learned a separate encoding based on transfer learning to map target data to the same feature space as the source. Lastly, they trained the whole model by minimizing a domain-adversarial loss function. Song et al. (2020) proposed a method for the salient object detection problem based on adversarial domain adaptation. To evaluate the effectiveness of their model, they collected a new dataset and made comparison with other methods on the dataset. In the field of machine translation, Wang et al. (2021) came up with a counterfactual domain adaptation method to improve target domain translation. By adopting adversarial learning methods, they used the concatenations of texts in source domain and tags in target to construct counterfactual representations. Motivated by adversarial learning, Li et al. (2019) put forward a bidirectional learning model to solve the problem of image semantic segmentation. They separated their model into two submodules: image-to-image translation model and segmentation adaptation model, which would be motivated to promote each other alternatively and gradually reduced the domain gap. Saito et al. (2018) put forward an adversarial domain adaptation method, attempting to match distribution of source and target by using the task-specific decision boundaries. They have proven that this method outperformed other methods in the tasks of image classification and semantic segmentation.

Inspired by these fruitful work, we believe that CPDP models will benefit from adversarial learning since it also is an application of domain adaptation. Particularly, we assume that we can take advantage of the training pattern adopted by Saito et al. (2018), using multiple task-specific classifiers as discriminators and a feature generator which tries to "fool" them in order to generate more discriminative features. Different from previous CPDP methods that apply traditional domain adaptation methods (Jin 2021; Xu et al. 2018; Qiu et al. 2019b; Zou et al. 2021) which tried to align two distributions by training appropriate distance metrics, in this work, we develop the domain adaptation method based on adversarial learning like this to reduce the divergence between two feature distributions of different projects.

3 Proposed Method

In this section, we elaborate the ADA method in details. Firstly, we give the formal formulation of CPDP problem. Then, we put forward the overall framework of the proposed approach. A Discussion about the data preprocessing of the software projects is placed in Section 3.2, including program parsing and data imbalanced learning. In Section 3.3, we elaborately present how we construct the CPDP model.

3.1 Problem Definition

Let the given source project with labelled data be $ D_\textrm{S} = \{(\varvec{x}_{\textrm{S}_i}, y_{\textrm{S}_i})\}_{i=1}^n $, where $ \varvec{x}_{\textrm{S}_i} = \left\{ x_{\textrm{S}_{i1}}, \ldots , x_{\textrm{S}_{id}}\right\} \in \mathbb {R}^{n*d}$ denotes the i-th source instance, and $ y_{\textrm{S}_i} \in \{0, 1\}$ is the corresponding defect information (0 for non-defective and 1 for defective). Let the given target project without labelled data be $ D_\textrm{T} = \{\varvec{x}_{\textrm{T}_i}\}_{i=1}^m $, where $ \varvec{x}_{\textrm{T}_i} = \left\{ x_{\textrm{T}_{i1}}, \ldots , x_{\textrm{T}_{id}}\right\} \in \mathbb {R}^{m*d}$ denotes the i-th target instance. We assume that both source and target samples share the same feature space as they come from the same set of metrics (i.e. $ \varvec{x}_\textrm{S}, \varvec{x}_\textrm{T} \in \mathbb {R}^d$ where d denotes the dimension of the feature). Let n and m be the numbers of instances in the source and target projects, respectively. Let $ P_\textrm{S}(\textbf{X}_\textrm{S}) $ and $ P_\textrm{T}(\textbf{X}_\textrm{T}) $ be the marginal probability distributions of $ \textbf{X}_\textrm{S}=\{\varvec{x}_{\textrm{S}_i}\}_{i=1}^n $ and $ \textbf{X}_\textrm{T}=\{\varvec{x}_{\textrm{T}_i}\}_{i=1}^m $ from the source and target projects, respectively. Generally, the distributions of two distinct projects are different, too, which implies $ P_\textrm{S}(\textbf{X}_\textrm{S}) \ne P_\textrm{T}(\textbf{X}_\textrm{T}) $. Cross-project defect prediction aims to enhance the performance of the target predictor $ f_{T}(\cdot ) $ in target project $ D_\textrm{T} $ by utilizing the knowledge learned from source project $ D_\textrm{S} $.

In this paper, we learn the optimal parameters $ \theta ^{*} $ of our method by solving the following minimization problem,

$$\begin{aligned} \begin{aligned} \theta ^{*}&=\underset{\theta \in \varTheta }{\arg \min }\ \sum _{(\varvec{x}, y) \in D_\textrm{S}} \frac{P_\textrm{T}\left( \textbf{X}_\textrm{T}\right) }{P_\textrm{S}\left( \textbf{X}_\textrm{S}\right) } P\left( \textbf{X}_\textrm{S}\right) \ell (\varvec{x}, y, \theta ) \\&\approx \underset{\theta \in \varTheta }{\arg \min }\ \sum _{i=1}^{n} \frac{P_\textrm{T}(\varvec{x}_{\textrm{T}_{i}})}{P_\textrm{S}\left( \varvec{x}_{\textrm{S}_{i}}, y_{\textrm{S}_{i}}\right) } \ell \left( \varvec{x}_{\textrm{S}_{i}}, y_{\textrm{S}_{i}}, \theta \right) , \end{aligned} \end{aligned}$$

(1)

where $ \varTheta $ is the parameter space and $ \ell (\varvec{x}, y, \theta ) $ denotes the error function relying on $ \theta $. Therefore, by assigning distinct weights to each sample $ (\varvec{x}_{\textrm{S}_{i}}, y_{\textrm{S}_{i}}) $ with corresponding value $ \frac{P_\textrm{T}(\varvec{x}_{\textrm{T}_{i}})}{P_\textrm{S}\left( \varvec{x}_{\textrm{S}_{i}}, y_{\textrm{S}_{i}}\right) } $, we are able to build an accurate predictor for the target project.

3.2 Data Preprocessing

Figure 2 presents the general framework of the proposed ADA approach. We divide the whole framework into two steps: data preprocessing and model construction. In this subsection, we discuss the details of data preprocessing in ADA method.

3.2.1 Generating Input Vectors

An Abstract syntax tree (AST) is a representation of syntactic structure parsed from source code, contains rich semantic information of the software project, and is considered really useful in the field of program analysis. Alon et al. (2019); Compton et al. (2020) Early studies (Wang et al. 2016, 2020; Chen et al. 2016; Balog et al. 2017) have proven that ASTs can be mined and utilized in software defect prediction, so we choose AST as a high level representation of the project source codes. Specifically, we use an open source compiling tool, Javalang^{Footnote 1}, to parse Java source files and generate corresponding ASTs. There are many types of nodes in ASTs, but only a part of them are highly related to the defects of the code. Following previous work (Wang et al. 2020; Deng et al. 2020; Huang et al. 2021), we evaluate the AST nodes and select four kinds of nodes: 1) method invocation nodes, 2) declaration nodes, 3) control flow nodes and 4) other necessary nodes. Then, we employ depth-first traversal to generate sequence vectors from ASTs. Since the sequence vectors are a list of string tokens, which cannot be directly used as input of an LSTM network, we construct a mapping dictionary between nodes and integers. After this step, we convert the project source files into integer vectors.

3.2.2 Imbalanced Learning

The imbalanced learning problem is one of the challenging problems in machine learning, where the number of one kind of samples is much more than that of another kind of samples (He and Garcia 2009). This problem is also faced in the field of SDP (Jing et al. 2017; Tong et al. 2019; Bal and Kumar 2020), as we will discuss the details of datasets chosen in this paper in Section 4.1. Qiu et al. (2019c) and Song et al. (2019) carried out large-scale experiments to explore the characteristic of imbalanced learning problem and systematically evaluated multiple imbalanced learning methods. Their research proved that class imbalance was omnipresent and would significantly affect the performance of prediction models. Nevertheless, we can alleviate this problem by adopting appropriate imbalanced learning approaches.

There are two types of methods coping with class imbalance, over-sampling method and under-sampling method. The former over-samples the minority class and the latter under-samples the majority class, both of which make the numbers of instances in two class balanced. In our approach, we apply a Synthetic Minority Over-Sampling Technique (SMOTE) (Chawla et al. 2002), which is a hybrid method of under-sampling and over-sampling. SMOTE is capable of handling the skewed class distribution by introducing and learning a bias towards the minority class, thereby achieves better performance than minority over-sampling with replacement method.

3.3 Model Construction

In this section, we illustrate how we build the ADA model based on adversarial learning methods in details. Figure 1 provides a brief introduction to our approach. We first leverage a bi-directional LSTM network with AM as a feature generator which takes the integer vectors as input. Then, we concatenate the generated semantic features and the handcrafted features to construct the joint features. Next, we simultaneously train two classifiers as a discriminator by feeding the joint features into them. Two classifiers attempt to categorize source samples properly. In the meanwhile, they are also trained to find out which target instance is distance to the support of the source. We reckon that these target instances may confuse the classifiers as most of them are likely to be misclassified, and we consider these target samples as non-discriminative. Based on adversarial training process (Chen et al. 2020), the generator is forced to create discriminative target features near the support by considering discriminator prediction for target samples. In adversarial learning manner, the generator attempts to fool the discriminator, and the discriminator feedback to the generator whether the extracted features are wanted. Through this training pattern, we are able to align two feature distributions of the source and target project, and in the meanwhile, let the classifiers know how to correctly predict ambiguous target instances. After training, the ADA model can predict whether a new project instance is defective or not.

3.3.1 Generator

Since source code of software project is a kind of standardized and formal languages (Huang et al. 2021), we believe CPDP tasks are more related to NLP context. LSTM networks (Hochreiter and Schmidhuber 1997) are more suitable for dealing with NLP tasks because they can learn long-term dependencies by sophisticated computations. To mine which code snippet is the most important in defect prediction, we also adopt attention mechanism to learn and assign different weights to the sequential data. Together with AM, we suppose the LSTM network is powerful to extract contextual and semantic features of project data.

The whole network architecture of the generator is described in Fig. 3. The feature generator is comprised of five parts: an input layer, an embedding layer, a bi-directional LSTM layer, an attention layer and an output layer. We discuss the latter four layer as follows.

Embedding Layer

As a useful technique for encoding semantic information, word embeddings have been proven powerful as extra features in many NLP tasks (Almeida and Xexéo 2019). Therefore, we convert each integer vector into the corresponding embedding matrix by means of a word embedding layer,

$$\begin{aligned} \varphi :T\rightarrow W_e, \end{aligned}$$

(2)

where $ T \in \mathbb {R}^N $ represents the input integer vector after AST parsing, $ W_e \in \mathbb {R}^{E*N} $ is the embedding matrix to be learned and $ \varphi $ is a mapping function between them. We randomly initialize the embedding matrix and it can be updated when training the whole network. We assume that the high-dimensional embedding matrix is able to capture more contextual information contained in ASTs. Encoded integer vectors represented by the embedding matrices are then fed into a bi-directional LSTM network to create preliminary semantic features.

LSTM Layer

LSTM network is good at dealing with sequential data and explores long-term dependencies in the semantic context. To feed the embedding matrix into LSTM, we split it into N column vectors with E dimensions,

$$\begin{aligned} W_e=\left[ e_{1}, e_{2}, \ldots , e_{N}\right] , \end{aligned}$$

(3)

where $ e_{i} \in \mathbb {R}^{E} (i=1,2,\ldots ,N) $. There are three types of gates, including input, output and forget gates, in LSTM networks. Based on these gates, LSTM can control how the information is processed and memorized in the cell states of LSTM units. The forget gates decide what information ought to be ignored from the previous moment, which can be described as

$$\begin{aligned} f_{t}=\sigma \left( W_{\textrm{fg}} \cdot \left[ h_{t-1}, e_{t}\right] +b_{\textrm{fg}}\right) , \end{aligned}$$

(4)

where $ \sigma (\cdot ) $ denotes the sigmoid function used for neural activation, $ h_{i} \left( 0 \le i \le N \right) $ is the hidden state of the memory cell at moment i and $ W_{\textrm{fg}} $ and $ b_{\textrm{fg}} $ are two parameters of the forget gate. Different from forget gates, input gates decide what to memorize from the current moment. This includes two parts, one of which is the activation result of the input and the other of which is the $ \tanh $ result of the input. We can formulate them as follows.

$$\begin{aligned} \begin{aligned} i_{t}&=\sigma \left( W_{\textrm{in}} \cdot \left[ h_{t-1}, e_{t}\right] +b_{\textrm{in}}\right) , \\ \tilde{C}_{t}&=\tanh \left( W_{C} \cdot \left[ h_{t-1}, e_{t}\right] +b_{C}\right) , \end{aligned} \end{aligned}$$

(5)

where $ W_{\textrm{in}}, b_{\textrm{in}}, W_{C}, b_{C} $ are the corresponding weights and biases. $ \tilde{C}_{t} $ is called candidate value of moment t, which is a part of updating to the current cell state. We multiply the old state by $ f_{t} $, trying to forget the information we decide to ignore, and add the candidate value scaled by the input gate’s result:

$$\begin{aligned} C_{t}=f_{t}*C_{t-1}+i_{t}*\tilde{C}_{t}. \end{aligned}$$

(6)

Final output is given by the output gates according to the output from last moment and cell state from current moment, namely

$$\begin{aligned} \begin{aligned} o_{t}&=\sigma \left( W_{\textrm{out}} \cdot \left[ h_{t-1}, e_{t}\right] +b_{\textrm{out}}\right) , \\ h_{t}&=o_{t}*\tanh (C_{t}), \end{aligned} \end{aligned}$$

(7)

where $ W_{\textrm{out}}, b_{\textrm{out}} $ are the weights and bias of the output gates. Through the mechanisms of LSTM networks, we are able to learn the dependencies of defective codes in the context, which is greatly helpful for defect prediction.

Attention Layer

The semantic features generated by the bi-directional LSTM network can be directly used by classifiers to train the prediction model. Though simple and effective, however, directly feeding these features into classification model may result in performance loss. As in natural language, different words have different importance, and people tend to put more emphasis on more significant words rather than less relevant ones. Since programming languages conform to the paradigm of natural languages to some extent, we believe different code snippets contribute distinctly to software defect prediction. To formally depict these differences, we introduce the attention mechanism (Vaswani et al. 2017) to assign weights for the generated semantic features. Firstly, we input the hidden state $ h_{i} $ and output feature $ o_{t-1}$ at moment $ t-1 $ into an alignment model $ \zeta (\cdot )$ to compute the alignment score,

$$\begin{aligned} a_{t,i} = \zeta \left( o_{t-1}, h_{i}\right) . \end{aligned}$$

(8)

We use a multi-layer perceptron as the alignment model, which evaluates how well the elements of the input sequence align with the current output at moment t. Then, we obtain the weights $ \alpha _{t,i} $ by applying a softmax function to the previously computed alignment scores,

$$\begin{aligned} \alpha _{t,i} = {\text {softmax}} \left( a_{t,i}\right) = \frac{\exp \left( a_{t,i}\right) }{\sum _{i=1}^{N} \exp \left( a_{t,i}\right) }. \end{aligned}$$

(9)

Finally, we obtain a context vector $ q_{t} $ at moment t by adding up the corresponding weights of all moments,

$$\begin{aligned} q_{t} = \sum _{i}^{N} \alpha _{t,i} h_{i}. \end{aligned}$$

(10)

Output Layer

The context vector produced by the attention mechanism then is concatenated with the handcrafted features to construct joint features for further prediction. We can see the generator as a mapping $ \mathcal {G} $ from the AST input integer vectors to the joint features,

$$\begin{aligned} \mathcal {G}:\mathcal {V}\rightarrow R_{j}, \end{aligned}$$

(11)

where $ \mathcal {V} \in \mathbb {R}^N $ represents the AST input integer vector and $ R_{j} \in \mathbb {R}^{N_{j}} $ is the joint features vector with $ N_{j}$ dimensions extracted by the generator.

3.3.2 Training Steps

The aim of the ADA method is to minimize the misalignment between feature distributions of the source and target projects by classifiers which know the relationship about the target instances and decision boundary. To achieve this goal, we have to distinguish some target instances from others. Generally, those instances which are near the class boundary are more likely to be misclassified by classifiers and we say these instances are ambiguous. We exploit two distinct classifiers to predict whether a given target instance is defective or not, and then make use of their disagreement to find these target instances. Consider two distinct classifiers, $ \mathcal {F}_{1} $ and $ \mathcal {F}_{2} $, which are initialized differently. Since the labeled source samples are available, we assume that $ \mathcal {F}_{1} $ and $ \mathcal {F}_{2} $ can correctly classify source samples after training. We see the two classifiers as a discriminator, telling the generator whether the target features are discriminative by maximizing the discrepancy of corresponding instances. By doing so, the two classifiers can be different and be capable of finding ambiguous target instances. Then, the generator is forced not to create such target features by minimizing the same discrepancy over the target instances. The discriminator and generator interact with each other, encouraging the generator to create more predictive features and the discriminator to classify more precisely. We repeat the above steps in adversarial learning manners.

$$\begin{aligned} \begin{gathered} \min _{\mathcal {G}, \mathcal {F}_{1}, \mathcal {F}_{2}} \mathcal {L}\left( \textbf{X}_\textrm{S}, Y_\textrm{S}\right) , \\ \mathcal {L}\left( \textbf{X}_\textrm{S}, Y_\textrm{S}\right) =-\mathbb {E}_{\left( \varvec{x}_{\textrm{S}},y_{\textrm{S}}\right) \sim \left( \textbf{X}_\textrm{S}, Y_\textrm{S}\right) }\left[ y_{\textrm{S}}\log p\left( y \mid \varvec{x}_{\textrm{S}}\right) +\left( 1-y_{\textrm{S}}\right) \log \left( 1-p\left( y \mid \varvec{x}_{\textrm{S}}\right) \right) \right] , \\ p\left( y \mid \varvec{x}\right) =\frac{p_{1}\left( y \mid \varvec{x}\right) +p_{2}\left( y \mid \varvec{x}\right) }{2} \end{gathered} \end{aligned}$$

(12)

To sum up, we have to train a generator and two classifiers first, both of which have to correctly categorize source instances. Then, we apply a discrepancy maximization problem to the two classifiers for filtering out unwanted target instances, forcing the generator to create more predictive target features. We address this task in the following steps.

Step 1

Firstly, two classifiers and the generator are trained based on source project data. They have to properly categorize source instances. This step is important because subsequent derivations are based on this assumption. In our study, we select the Logistic Regression (LR) classifier as the base classifier. Since a CPDP task is a binary classification task, the binary cross entropy loss is applied to train the whole CPDP model, which can be written as (12), where $ p_{1}\left( y\mid \varvec{x}\right) $ and $ p_{2}\left( y\mid \varvec{x}\right) $ denote the output of the two classifiers over instance $ \varvec{x} $.

Step 2

Secondly, the discriminator ($ \mathcal {F}_{1} $ and $ \mathcal {F}_{2} $) is trained to find out those ambiguous target instances while the generator is fixed. Given a specific target instance, if one classifier categorizes it as negative while another classifiers think it as a positive instance, then this instance is more likely to be ambiguous. Therefore, in this step, we maximize the disagreement between two classifiers to find these target instances, that is,

$$\begin{aligned} \begin{gathered} \min _{\mathcal {F}_{1}, \mathcal {F}_{2}} \mathcal {L}\left( \textbf{X}_\textrm{S}, Y_\textrm{S}\right) -\lambda \mathcal {L}_{\textrm{dis}}\left( \textbf{X}_\textrm{T}\right) , \\ \mathcal {L}_{\textrm{dis}}\left( \textbf{X}_\textrm{T}\right) =\mathbb {E}_{\varvec{x}_{\textrm{T}} \sim \textbf{X}_\textrm{T}}\left[ d\left( p_{1}\left( y \mid \varvec{x}_{\textrm{T}}\right) , p_{2}\left( y \mid \varvec{x}_{\textrm{T}}\right) \right) \right] , \\ d\left( p_{1}\left( y \mid \varvec{x}_{\textrm{T}}\right) , p_{2}\left( y \mid \varvec{x}_{\textrm{T}}\right) \right) =\left( p_{1}^{+}-p_{2}^{+}\right) ^{2} + \left( p_{1}^{-}-p_{2}^{-}\right) ^{2}, \end{gathered} \end{aligned}$$

(13)

where $ p_{1}^{+} $/$ p_{1}^{-} $ and $ p_{2}^{+} $/$ p_{2}^{-} $ denote the prediction probability of $ \mathcal {F}_{1} $ and $ \mathcal {F}_{2} $ for negative and positive class respectively, and $ \lambda >0 $ is a weighting parameter for adjusting the influence of the discrepancy loss. The discrepancy loss equals to the expectation of the discrepancy over all target instances and we choose $ L_{2} $ distance to calculate the discrepancy value of two classifiers.

Step 3

Finally, the generator is trained to create more discriminative features while the discriminator is fixed. When two classifiers do not update in this step, we are supposed to minimize the discrepancy above in order to generate discriminative features. Consider a target instance with the generated features. If the prediction results of $ \mathcal {F}_{1} $ and $ \mathcal {F}_{2} $ are the same, then this instance is unambiguous and more likely to be classified correctly. We formulate the objective of this step as follows.

$$\begin{aligned} \min _{\mathcal {G}} \mathcal {L}_{\textrm{dis}}\left( \textbf{X}_\textrm{T}\right) . \end{aligned}$$

(14)

Ideally, the feature distribution of target data is well-aligned with the source after this step. As a result, the discriminator is able to achieve comparable performance to the source project data over target instances.

We repeat these three steps in training phase. Due to that classifiers can correctly categorize source samples (Step 1), we train them to detect desired target instances (Step 2) and force the generator to extract more discriminative features (Step 3). The pseudo-code of the proposed method ADA is described in Algorithm 1. Its approximate time complexity can be given as $ \mathcal {O}\left( n \times N \times T\right) $, where n is the number of instances in source project, N is the number of neurons in LSTM network and T is the maximum training iteration. By applying the adversarial training method, we can improve the final prediction performance.

4 Experiment Setups

In this section, we describe our experimental setups in detail, including benchmark datasets, experimental settings, evaluation metrics, baseline methods, statistical analysis methods and research questions.

Table 1 5 projects chosen from the AEEEM dataset

Adversarial domain adaptation for cross-project defect prediction

Abstract

Similar content being viewed by others

MHCPDP: multi-source heterogeneous cross-project defect prediction via multi-source transfer learning and autoencoder

A Cost-Sensitive Shared Hidden Layer Autoencoder for Cross-Project Defect Prediction

Deep Adversarial Learning Based Heterogeneous Defect Prediction

Explore related subjects

1 Introduction

2 Related Work

2.1 Cross-Project Defect Prediction

2.2 Adversarial Domain Adaptation

3 Proposed Method

3.1 Problem Definition

3.2 Data Preprocessing

3.2.1 Generating Input Vectors

3.2.2 Imbalanced Learning

3.3 Model Construction

3.3.1 Generator

Embedding Layer

LSTM Layer

Attention Layer

Output Layer

3.3.2 Training Steps

Step 1

Step 2

Step 3

4 Experiment Setups

4.1 Benchmark Datasets

4.2 Experimental Settings

4.3 Evaluation Metrics

\( F_{1} \) Measure

Balanced Accuracy

Geometric Mean

4.4 Baseline Models

4.5 Statistical Analysis Methods

4.6 Research Questions

5 Experimental Results

5.1 Rq1: Is our Proposed ADA Method Better than Other State-of-the-Art CPDP Models?

5.2 Rq2: How Effective are the Semantic Features Extracted by the Generator that Combines an LSTM Network and Attention Mechanisms?

5.3 RQ3: How Effective is the Adversarial Domain Adaptation Method?

6 Discussions

6.1 Why does ADA Work?

6.2 How does the Hyper Parameter \( \lambda \) Affect the Performance of ADA?

6.3 How Different Classifiers Affect the Performance of ADA?

6.4 Threats to Validity

Threats to Construct Validity

Threats to Internal Validity

Threats to External Validity

7 Conclusions

Data Availability Statements

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation