Data Transformation in Cross-project Defect Prediction

Zhang, Feng; Keivanloo, Iman; Zou, Ying

doi:10.1007/s10664-017-9516-2

Data Transformation in Cross-project Defect Prediction

Published: 14 April 2017

Volume 22, pages 3186–3218, (2017)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Empirical Software Engineering Aims and scope Submit manuscript

Data Transformation in Cross-project Defect Prediction

Download PDF

1590 Accesses
47 Citations
2 Altmetric
Explore all metrics

Abstract

Software metrics rarely follow a normal distribution. Therefore, software metrics are usually transformed prior to building a defect prediction model. To the best of our knowledge, the impact that the transformation has on cross-project defect prediction models has not been thoroughly explored. A cross-project model is built from one project and applied on another project. In this study, we investigate if cross-project defect prediction is affected by applying different transformations (i.e., log and rank transformations, as well as the Box-Cox transformation). The Box-Cox transformation subsumes log and other power transformations (e.g., square root), but has not been studied in the defect prediction literature. We propose an approach, namely Multiple Transformations (MT), to utilize multiple transformations for cross-project defect prediction. We further propose an enhanced approach MT+ to use the parameter of the Box-Cox transformation to determine the most appropriate training project for each target project. Our experiments are conducted upon three publicly available data sets (i.e., AEEEM, ReLink, and PROMISE). Comparing to the random forest model built solely using the log transformation, our MT+ approach improves the F-measure by 7, 59 and 43% for the three data sets, respectively. As a summary, our major contributions are three-fold: 1) conduct an empirical study on the impact that data transformation has on cross-project defect prediction models; 2) propose an approach to utilize the various information retained by applying different transformation methods; and 3) propose an unsupervised approach to select the most appropriate training project for each target project.

FENSE: A feature-based ensemble modeling approach to cross-project just-in-time defect prediction

Article 20 September 2022

Cross project defect prediction: a comprehensive survey with its SWOT analysis

Article 03 January 2021

A software defect prediction method with metric compensation based on feature selection and transfer learning

Article 04 April 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

A defect is an error that can cause a software system to behave in an unexpected way or produce incorrect results. In the last decade, defect prediction has attracted great attention from both researchers and practitioners. Software metrics (e.g., lines of code and the number of method calls) constitute the major part of the input data used to build a defect prediction model. Earlier studies (e.g., Concas et al. 2007; Louridas et al. 2008; Zhang 2009) report that software metrics rarely follow a normal distribution, but a power-law distribution, which threats the fitness of prediction models to provide an accurate prediction (Cohen et al. 2003).

In the literature of defect prediction, researchers widely apply log and rank transformations to improve the normality of software metrics (e.g., Menzies et al. 2007; Jiang et al. 2008; Cruz and Ochimizu 2009; Song et al. 2011; Zhang et al. 2014). The log transformation is basically a mathematical operation that replaces the original metric values by their logarithm, thus suits log-normal data (i.e., normally distributed data after the log transformation). The rank transformation substitutes the original metric values with their ranks.

Despite of the success to improve the normality of software metrics, the aforementioned transformations fail to constantly improve the performance of defect prediction models in a within-project setting (Jiang et al. 2008). A within-project model is built and applied within the same project. However, to the best of our knowledge, the impact that such transformations have on the performance of defect prediction models has not been thoroughly investigated in a cross-project setting. A cross-project model is built using the training data from one project and applied on the target data from another project.

Cross-project prediction is needed when the target project (e.g., a small or new project) does not have sufficient historical data to build a prediction model (Nagappan et al. 2006). Cross-project prediction experiences a great challenge to deal with the heterogeneity between the training and target projects, since software metrics among different projects often exhibit varied distributions (Zhang et al. 2013). Transformations, if learnt from both the training and target projects, have the potential to mitigate the heterogeneity between the training and target projects. For instance, our previous work (Zhang et al. 2014) successfully implements the context-aware rank transformation towards generalizing defect prediction models. In addition, Ma et al. (2012) propose to transform the training project based on the statistical characteristics learnt from the target project. Nam et al. (2013) apply transfer component analysis (TCA) approach to transform both the training and target projects. Although both approaches on average significantly improve the performance of cross-project predictions, it is unclear how to choose an appropriate transformation for a particular pair of training and target projects. Jiang et al. (2008) even show that the benefit of transformations varies with modelling techniques on the same dataset.

Nonetheless, different transformations retain the information of the original data from various perspectives, especially in the cross-project setting. Therefore, in this study, we set out an exploratory study to investigate if using different transformations that retain distinct characteristics of software metrics is beneficial to cross-project defect prediction.

We perform experiments using three publicly available data sets, i.e., AEEEM (D’Ambros et al. 2010), ReLink (Wu et al. 2011), and PROMISE (Jureczko and Madeyski 2010). First, we examine if transformations have the same ability to improve the normality of software metrics. Besides log and rank transformations, we study the Box-Cox transformation (Box and Cox 1964) that represents a family of power transformations (e.g., the log transformation) but has not been investigated in existing studies on defect prediction. Second, we study if different transformations cause distinct predictions on the same file in the cross-project setting. We propose an approach, namely Multiple Transformations (MT), to integrate predictions by multiple models, with each model built using a single transformation. The weight of each model is determined by its accuracy in predicting defective instances on the training data. We further enhance our approach (MT+) by automatically selecting the most appropriate training project for each target project based on the parameter of the Box-Cox transformation. Accordingly, we study four research questions:

RQ1.
Are log, Box-Cox, and rank transformations equally effective in increasing the normality of software metrics?

All three transformations can significantly improve the normality of software metrics (i.e., reduce both the skewness and kurtosis). The three transformations have similar ability to improve the normality of software metrics, with small or negligible difference indicated by Cliff’s δ.
RQ2.
Do different transformations result in distinct predictions in cross-project defect prediction models?

In general, models built with each of the three transformations do not exhibit significantly difference in terms of the six studied performance measures (i.e., precision, recall, false positive rate, balance, F-measure and AUC values). However, the results of McNemar’s test indicate that the three prediction models judge differently about the defect proneness of each file. When a defective file is overlooked by one model, it may be captured by other models.
RQ3.
Can our approaches improve the performance of cross-project defect prediction models?

The results show that our approach MT+ statistically significantly improves the performance of cross-project defect prediction, comparing to models built with the log transformation. On average, our MT+ approach increases F-measure of cross-project defect prediction models built using logistic regression by 24% (i.e., from 0.34 to 0.42), 11% (i.e., from 0.54 to 0.60), and 29% (i.e., from 0.31 to 0.42) in AEEEM, ReLink, and PROMISE datasets, respectively.
RQ4.
Do our approaches work well for other classifiers?

We study the generalizability of our approaches using six other classifiers (e.g., Naive Bayes, and random forest), since different classifiers are reported to prefer different transformations (Jiang et al. 2008). We find that our approach MT+ generally outperforms models built with the log transformation.

Our major contributions are: 1) study if data transformation impacts cross-project defect prediction; 2) propose to utilize the various information retained through different transformation methods; and 3) propose to use the parameter of the Box-Cox transformation to select the most appropriate training project for each target project.

Paper organization. Section 2 presents the three studied transformation methods. The experimental setup is presented in Section 3. Our motivation study is described in Section 4. Our approaches (i.e., MT and MT+) and evaluation are presented in Sections 5 and 6, respectively. The related work is summarized in Section 7. The threats to validity of our work are discussed in Section 8. We conclude the paper and provide insights for future work in Section 9.

2 Background on transformation methods

In this section, we describe two common measurements of data normality, and present the details of the three studied transformation methods.

2.1 Normality measurements

Skewness and kurtosis are two widely applied measurements of data normality. We compute these two measurements to measure the normality of software metrics, using the R functions skewness and kurtosis in the R ^{Footnote 1} package e1071.^{Footnote 2}

Skewness measures the degree of symmetry in the probability distribution of the values of a software metric. The value of skewness can be positive (indicating a long tail to the right), negative (indicating a long tail to the left), or zero (indicating balanced tails on both sides), as illustrated in Fig. 1a. The ideal value of skewness ranges from −0.80 to 0.80 (Osborne 2010).
Kurtosis measures the “peakness” (e.g., the width of the peak) in the probability distribution of the values of a software metric. The value of kurtosis can be positive that indicates a more acute peak, or negative that indicates a lower and wider peak. Positive and negative kurtosis are illustrated in Fig. 1b. The ideal value of kurtosis is zero.

2.2 Log transformation

The log transformation is a mathematical operation that computes the logarithm (mostly the natural logarithm) of software metrics to replace the original values. The log transformation is widely used in building software defect prediction models (e.g., Menzies et al. 2007; Song et al. 2011).

The log transformation can only transform numerical values that are greater than zero, due to the limitation of the function “ln(x)”. To deal with zero values, a constant is often added, such as “ l n(x + 1)”. An alternative solution is to replace all values under 0.000001 by 0.000001. We apply the following commonly used equation:

$$ Log(x) = ln(x+1) $$

(1)

where x is the value of a software metric.

2.3 Rank transformation

The rank transformation replaces the original values by their ranks. The rank transformation is recommended to deal with heavy-tailed distributions (i.e., have high kurtosis) Bishara and Hittner (2014) and Keren and Lewis (1993). In the literature of defect prediction, Jiang et al. (2008) observe that the rank transformation can improve the performance of some classifiers (e.g., Naive Bayes). Moreover, the rank transformation has been successfully applied to mitigate the heterogeneity of software metrics across projects in the cross-project setting (Zhang et al. 2014).

In this study, we convert the original values of each metric into ten ranks, using every 10t h percentile of the corresponding metric, as defined in (2).

$$ Rank(x) = \left\{ \begin{array}{l l} 1 & \quad \text{if $x \in [0, Q_{1}]$}\\ k & \quad \text{if $x \in (Q_{k-1}, Q_{k}]$, $k \in \{2,\ldots, 9\}$}\\ 10 & \quad \text{if $x \in (Q_{9}, +\infty)$ } \end{array} \right. $$

(2)

where Q _k is the k*10% percentile of the corresponding metric in the union of the training and target projects.

2.4 Box-Cox transformation

The Box-Cox transformation represents a family of power transformations, as defined in (3). To the best of our knowledge, the Box-Cox transformation has not been explored in the literature of defect prediction.

$$ BoxCox(x, \lambda) = \left\{ \begin{array}{l l} \frac{x^{\lambda}-1}{\lambda} & \quad \text{if $\lambda \ne 0$}\\ ln(x) & \quad \text{if $\lambda=0$ } \end{array} \right. $$

(3)

where x is the value of a metric, and λ is the only configuration parameter of the Box-Cox transformation.

The parameter λ determines the concrete format of the Box-Cox transformation. For example, “ λ = 1.0” means no transformation, “ λ = 0.5” equals to the square root transformation, “ λ = 0.0” represents the log transformation, and “ λ = −1.0” indicates the inverse transformation. As such, the Box-Cox transformation is often used to transform variables that follow a power law distribution. The Box-Cox transformation is suggested to improve the variance homogeneity, increase the precision of estimation, and simplify models (Shang 2014).

The parameter λ can be estimated from a sample of data points. In the context of cross-project prediction, the parameter λ is estimated from both the training and target projects. The details to apply the Box-Cox transformation in our study are presented as follows.

1)
Shifting metric values to 1.0. As suggested by Guo (2014), we shift the minimum value of a metric in a distribution at exactly 1.0 before applying the Box-Cox transformation. This treatment can increase the accuracy of the Box-Cox transformation (Guo 2014). We use the equation $\tilde {x} = x-min(x)+1$, where x is the value of a software metric.
2)
Estimating the parameter λ . The parameter λ is estimated for each metric independently, since different metrics rarely follow the same distribution. To ensure the same transformation applied on both the training and target projects, as aforementioned, we estimate the parameter λ using the values of the corresponding metric from both sets.

We estimate the parameter λ in an iterative process. First, we select a set of candidate λ values that range from −1.0 to 1.0. Second, we iterate the λ values from −1.0 towards 1.0 with a step of 0.1. At each iteration, we compute the skewness of transformed values. We select the λ value that leads to the minimum skewness (i.e., the absolute skewness value is the closest to zero) of transformed values. The iterative process can be described using the following equation:
$$ \widehat{\lambda} = \arg{\min_{\lambda \in L}}~|skewness(\underset{\tilde{x} \in X}{BoxCox(\tilde{x}, \lambda)})| $$
(4)
where L is a set of candidate λ values from -1.0 to 1.0 with a step by 0.1, and X is a vector of shifted metric values.
3)
Normalizing transformed values. Normalization creates equal scales of software metrics, and is useful for classification algorithms (Han et al. 2012; Nam et al. 2013). In this study, we choose the min-max method (Han et al. 2012), since it can normalize values exactly into the range of [0,1]. Based on the benefit of shifting the minimum value to 1.0 (Guo 2014), we slightly modify this method using the following (5).
$$ Normalize(\widehat{x}) = \frac{\widehat{x}-\min_{\widehat{x} \in U}(\widehat{x})}{\max_{\widehat{x} \in U}(\widehat{x})-\min_{\widehat{x} \in U}(\widehat{x})}+1 $$
(5)
where $\widehat {x}$ is the transformed value by (3) using $\tilde {x}$ and $\widehat {\lambda }$, and U is a set of $\widehat {x}$ from the union of the training and target projects.

3 Experimental setup

In this section, we first describe our subject projects. Then, we present classifiers to build cross-project defect prediction models, and six performance measures used in this study.

3.1 Subject projects

In this study, we choose three publicly available datasets, such as AEEEM (D’Ambros et al. 2010), ReLink (Wu et al. 2011), and PROMISE (Jureczko and Madeyski 2010). The three datasets have been widely used for cross-project defect prediction (e.g., Nam et al. 2013). The diversity of the three datasets can help verify the generalizability of our approach. Table 1 presents the summary of the three datasets.

1)
AEEEM dataset was made by D’Ambros et al. (2010), and contains 61 metrics. It has the two largest projects (i.e., Mylyn and PDE) among the three datasets. The ratio of defective files in this dataset is relatively lower than the other two datasets (i.e., 9.3% to 39.8%).
2)
ReLink dataset was collected by Wu et al. (2011), and the defect information in this dataset was manually verified. ReLink dataset has 26 metrics. Projects in this dataset are relatively small (e.g., project OpenIntents Safe has the least number of files). This dataset has a moderate ratio of defective files (i.e., 29.6% to 50.5%).
3)
PROMISE dataset was prepared by Jureczko and Madeyski (2010). We select the same ten projects as in our prior study (Zhang et al. 2016). PROMISE dataset has 20 metrics. PROMISE dataset has the most diverse characteristics of projects, such as the number of files ranges from 135 to 965, and the ratio of defective files varies between 6.6 and 63.6%.

Table 1 Descriptive statistics of all 18 subject projects from AEEEM, ReLink, and PROMISE datasets

Full size table

3.2 Classifiers for defect prediction

Each classifier has its own advantages when used to build a defect prediction model. For instance, logistic regression is easy to interpret and is widely used (Nam et al. 2013). Naive Bayes is robust for defect prediction using data with observable noises (Kim et al. 2011). In this study, we choose to use logistic regression as the main classifier in RQ2 and RQ3. We further perform the sensitive analysis on the choice of classifiers in RQ4, since not all classifiers are sensitive to data transformations (Kuhn and Johnson 2013). For instance, in the defect prediction literature, data transformations have been reported to have varied impacts on the performance of different classifiers (Jiang et al. 2008; Menzies et al. 2007; Song et al. 2011). In addition to logistic regression, we evaluate the performance of our approaches in RQ4 using six other classifiers such as Bayes net (BN), k-nearest neighbours (IBk), decision tree (J48), naive Bayes (NB), random forest (RF), and random tree (RT).

3.3 Performance measures

In this study, we compute six commonly used measures (i.e., precision, recall, false positive rate, balance, F-measure, and AUC value) to evaluate the performance of cross-project prediction models.

The first five measures can be calculated from the following four numbers: 1) true positive (TP) that counts the number of defective instances successfully predicted as defective instances; 2) true negative (TN) that calculates the number of non-defective instances correctly predicted as non-defective instances; 3) false positive (FP) that is the number of non-defective instances incorrectly predicted as defective instances; and 4) false negative (FN) that measures the number of defective instances wrongly predicted as non-defective instances. The details are described as follows:

Precision (prec):: measures the ratio of correctly predicted defective instances. It is defined as: $prec=\frac {TP}{TP+FP}$.
Recall (pd):: evaluates the proportion of defective instances that are predicted as defective instances. It is defined as: $pd=\frac {TP}{TP+FN}$.
False Positive Rate (fpr):: captures the proportion of non-defective instances that are predicted as defective instances. It is defined as: $fpr=\frac {FP}{FP+TN}$.
Balance:: is proposed by Menzies et al. (2007) to balance recall and false positive rate. It is defined as: $balance=1-\frac {\sqrt {(0-fpr)^{2}+(1-pd)^{2}}}{\sqrt {2}}$.
F-measure:: is the harmonic mean of precision and recall. It is defined as: F-$measure=\frac {2\times pd \times prec}{pd+prec}$.

The five aforementioned measures depend on the cut-off value, which is used to compute the four numbers TP, TN, FP, and FN. On the other hand, Area Under Curve (AUC) is the area under the receiver operating characteristics (ROC) curve, thus the AUC value is independent of the cut-off value. Therefore, we further compute AUC values to evaluate cross-project defect prediction models as prior studies, such as (Rahman et al. 2012).

4 Motivation study

In this section, we aim to find if the three studied transformation methods have different performances in the context of defect prediction. The investigation is performed from the following two perspectives:

1)
if they can equally improve the normality of software metrics.
2)
if cross-project defect prediction models built using each of the transformation methods have similar performance.

Accordingly, we formulate two research questions. We now present the findings of each question, along with our motivation and approach.

4.1 RQ1. Are log, Box-Cox, and rank transformations equally effective in increasing the normality of software metrics?

Motivation

Data normality can impact the performance of a prediction model, particularly the model that is not tree-based (Kuhn and Johnson 2013). Although log and rank transformations have been applied in defect prediction (e.g., Jiang et al. 2008; Menzies 2007; Zhang et al. 2014), their capability in improving the normality of software metric values has not been explicitly explored. In addition, the Box-Cox transformation introduced in Section 2.4 has not been used in the defect prediction studies.

To thoroughly examine the impact that transformations have on defect prediction models, it is necessary to investigate if the three transformation methods indeed have different performance in improving the normality of software metric values.

Approach

To address this question, software metrics need to be transformed using each of the three transformation methods. As different software metrics exhibit various distributions, we transform the values of each metric independently.

In each project, we apply the log transformation on software metric values to get log transformed values. When applying the Box-Cox transformation, we first apply the steps described in Section 2.4 to estimate the parameter λ using values of a single metric from the same project, and then apply the Box-Cox transformation. To apply the rank transformation, we compute every 10t h percentile of the distribution of values of a single metric from the same project, and obtain rank transformed values using (2).

On the transformed metric values, the skewness and kurtosis are computed to evaluate the normality. To investigate if transformation improves the normality of software metric values, we test the following null hypothesis for each transformation method:

H0₁₁::: there is no difference in the normality of the transformed metric values and the original metric values.

We conduct paired Wilcoxon rank sum test (Sheskin 2007), with the 95% confidence level (i.e., p-value < 0.05). The Wilcoxon rank sum test is a non-parametric statistical test to assess whether two independent distributions are equal. Non-parametric statistical methods make no assumptions about the distribution of assessed variables. If there is a statistical significance, we reject the hypothesis and conclude that the examined transformation significantly changes the normality of software metric values.

Furthermore, we compare the capability of the three transformations in improving the normality of software metric values. We apply paired Wilcoxon rank sum test to evaluate the following null hypothesis, with the 95% confidence level (i.e., p-value < 0.05).

H0₁₂::: there is no difference in the normality of metric values that are processed by transformations T _a and T _b.

T _a and T _b denote two different transformations. If there is a statistical significance, we reject the hypothesis and conclude that the corresponding two transformations have different capability in improving data normality. We further compute the Cliff’s δ (Romano et al. 2006) to quantify the difference. The Cliff’s δ is a nonparametric effect size that does not assume a particular distribution. The difference is negligible if Cliff’s |δ| < 0.147, small if Cliff’s 0.147 <= |δ| < 0.330, medium if Cliff’s 0.330 <= |δ| < 0.474, and large if Cliff’s |δ| >= 0.474.

Findings

All three studied transformations can significantly improve the normality of software metrics. Figure 2 presents the skewness and kurtosis values of the transformed metric values from all projects. The median skewness and kurtosis of the original metric values among all three datasets are 4 and 22, respectively. It indicates that the original metric values are highly skewed, because the ideal skewness value is between −0.80 and 0.80, and the perfect kurtosis value is zero (see Section 2.1). Using any of the three transformations can make the skewness and kurtosis values become closer to zero (i.e., nearly normally distributed). As shown in Table 2, the results of Wilcoxon rank sum tests in the skewness and kurtosis between the transformed values and the original values show statistically significant difference, respectively. Hence, we reject hypothesis H0₁₁ for all three transformations. Moreover, the corresponding Cliff’s δ is always greater than 0.474 (as shown in Table 2), indicating that each of the studied transformation methods yields a large improvement on the data normality. The capabilities of the three transformations on improving data normality are ordered as: the rank transformation > the Box-Cox transformation > the log transformation. For any pair of the transformations, we reject hypothesis H0₁₂ as the p-values of Wilcoxon rank sum tests are always less than 0.05. However, the Cliff’s δ is either small or negligible, except the kurtosis value between log and rank transformations and the kurtosis value between Box-Cox and rank transformations.

Table 2 The results of comparing cross-project defect prediction models built using the three transformations (n.s. denotes no statistical significance, and bold font is used if the corresponding model is better)

Full size table

Regarding the Box-Cox transformation, the estimated parameter λ varies across projects

We present the boxplot of the estimated λ values for each project in Fig. 3. The varying λ values across projects suggests that estimating λ values from both the training and target projects can maximize the normality of metrics values in both projects. We observe that few of the estimated λ values are zero (λ = 0 indicates a log transformation). Therefore, the Box-Cox transformation is not close to the log transformation when dealing with software metrics. In addition, the median value of the estimated λ is often less than zero across projects, showing that most of the estimated λ values are negative. In other words, the Box-Cox transformation tends to reverse the order of metric values, i.e., making larger metric values smaller, and vice versa. The reversed order of metric values does not affect the performance of defect prediction models, since both the training and testing projects are treated in the same way. However, researchers should keep in mind of such possible alteration of the order of metric values when interpreting models built with the Box-Cox transformation.