1 Introduction

A defect is an error that can cause a software system to behave in an unexpected way or produce incorrect results. In the last decade, defect prediction has attracted great attention from both researchers and practitioners. Software metrics (e.g., lines of code and the number of method calls) constitute the major part of the input data used to build a defect prediction model. Earlier studies (e.g., Concas et al. 2007; Louridas et al. 2008; Zhang 2009) report that software metrics rarely follow a normal distribution, but a power-law distribution, which threats the fitness of prediction models to provide an accurate prediction (Cohen et al. 2003).

In the literature of defect prediction, researchers widely apply log and rank transformations to improve the normality of software metrics (e.g., Menzies et al. 2007; Jiang et al. 2008; Cruz and Ochimizu 2009; Song et al. 2011; Zhang et al. 2014). The log transformation is basically a mathematical operation that replaces the original metric values by their logarithm, thus suits log-normal data (i.e., normally distributed data after the log transformation). The rank transformation substitutes the original metric values with their ranks.

Despite of the success to improve the normality of software metrics, the aforementioned transformations fail to constantly improve the performance of defect prediction models in a within-project setting (Jiang et al. 2008). A within-project model is built and applied within the same project. However, to the best of our knowledge, the impact that such transformations have on the performance of defect prediction models has not been thoroughly investigated in a cross-project setting. A cross-project model is built using the training data from one project and applied on the target data from another project.

Cross-project prediction is needed when the target project (e.g., a small or new project) does not have sufficient historical data to build a prediction model (Nagappan et al. 2006). Cross-project prediction experiences a great challenge to deal with the heterogeneity between the training and target projects, since software metrics among different projects often exhibit varied distributions (Zhang et al. 2013). Transformations, if learnt from both the training and target projects, have the potential to mitigate the heterogeneity between the training and target projects. For instance, our previous work (Zhang et al. 2014) successfully implements the context-aware rank transformation towards generalizing defect prediction models. In addition, Ma et al. (2012) propose to transform the training project based on the statistical characteristics learnt from the target project. Nam et al. (2013) apply transfer component analysis (TCA) approach to transform both the training and target projects. Although both approaches on average significantly improve the performance of cross-project predictions, it is unclear how to choose an appropriate transformation for a particular pair of training and target projects. Jiang et al. (2008) even show that the benefit of transformations varies with modelling techniques on the same dataset.

Nonetheless, different transformations retain the information of the original data from various perspectives, especially in the cross-project setting. Therefore, in this study, we set out an exploratory study to investigate if using different transformations that retain distinct characteristics of software metrics is beneficial to cross-project defect prediction.

We perform experiments using three publicly available data sets, i.e., AEEEM (D’Ambros et al. 2010), ReLink (Wu et al. 2011), and PROMISE (Jureczko and Madeyski 2010). First, we examine if transformations have the same ability to improve the normality of software metrics. Besides log and rank transformations, we study the Box-Cox transformation (Box and Cox 1964) that represents a family of power transformations (e.g., the log transformation) but has not been investigated in existing studies on defect prediction. Second, we study if different transformations cause distinct predictions on the same file in the cross-project setting. We propose an approach, namely Multiple Transformations (MT), to integrate predictions by multiple models, with each model built using a single transformation. The weight of each model is determined by its accuracy in predicting defective instances on the training data. We further enhance our approach (MT+) by automatically selecting the most appropriate training project for each target project based on the parameter of the Box-Cox transformation. Accordingly, we study four research questions:

  1. RQ1.

    Are log, Box-Cox, and rank transformations equally effective in increasing the normality of software metrics?

    All three transformations can significantly improve the normality of software metrics (i.e., reduce both the skewness and kurtosis). The three transformations have similar ability to improve the normality of software metrics, with small or negligible difference indicated by Cliff’s δ.

  2. RQ2.

    Do different transformations result in distinct predictions in cross-project defect prediction models?

    In general, models built with each of the three transformations do not exhibit significantly difference in terms of the six studied performance measures (i.e., precision, recall, false positive rate, balance, F-measure and AUC values). However, the results of McNemar’s test indicate that the three prediction models judge differently about the defect proneness of each file. When a defective file is overlooked by one model, it may be captured by other models.

  3. RQ3.

    Can our approaches improve the performance of cross-project defect prediction models?

    The results show that our approach MT+ statistically significantly improves the performance of cross-project defect prediction, comparing to models built with the log transformation. On average, our MT+ approach increases F-measure of cross-project defect prediction models built using logistic regression by 24% (i.e., from 0.34 to 0.42), 11% (i.e., from 0.54 to 0.60), and 29% (i.e., from 0.31 to 0.42) in AEEEM, ReLink, and PROMISE datasets, respectively.

  4. RQ4.

    Do our approaches work well for other classifiers?

    We study the generalizability of our approaches using six other classifiers (e.g., Naive Bayes, and random forest), since different classifiers are reported to prefer different transformations (Jiang et al. 2008). We find that our approach MT+ generally outperforms models built with the log transformation.

Our major contributions are: 1) study if data transformation impacts cross-project defect prediction; 2) propose to utilize the various information retained through different transformation methods; and 3) propose to use the parameter of the Box-Cox transformation to select the most appropriate training project for each target project.

Paper organization. Section 2 presents the three studied transformation methods. The experimental setup is presented in Section 3. Our motivation study is described in Section 4. Our approaches (i.e., MT and MT+) and evaluation are presented in Sections 5 and 6, respectively. The related work is summarized in Section 7. The threats to validity of our work are discussed in Section 8. We conclude the paper and provide insights for future work in Section 9.

2 Background on transformation methods

In this section, we describe two common measurements of data normality, and present the details of the three studied transformation methods.

2.1 Normality measurements

Skewness and kurtosis are two widely applied measurements of data normality. We compute these two measurements to measure the normality of software metrics, using the R functions skewness and kurtosis in the R Footnote 1 package e1071.Footnote 2

  • Skewness measures the degree of symmetry in the probability distribution of the values of a software metric. The value of skewness can be positive (indicating a long tail to the right), negative (indicating a long tail to the left), or zero (indicating balanced tails on both sides), as illustrated in Fig. 1a. The ideal value of skewness ranges from −0.80 to 0.80 (Osborne 2010).

  • Kurtosis measures the “peakness” (e.g., the width of the peak) in the probability distribution of the values of a software metric. The value of kurtosis can be positive that indicates a more acute peak, or negative that indicates a lower and wider peak. Positive and negative kurtosis are illustrated in Fig. 1b. The ideal value of kurtosis is zero.

Fig. 1
figure 1

The illustration of skewness and kurtosis in a distribution

2.2 Log transformation

The log transformation is a mathematical operation that computes the logarithm (mostly the natural logarithm) of software metrics to replace the original values. The log transformation is widely used in building software defect prediction models (e.g., Menzies et al. 2007; Song et al. 2011).

The log transformation can only transform numerical values that are greater than zero, due to the limitation of the function “ln(x)”. To deal with zero values, a constant is often added, such as “ l n(x + 1)”. An alternative solution is to replace all values under 0.000001 by 0.000001. We apply the following commonly used equation:

$$ Log(x) = ln(x+1) $$
(1)

where x is the value of a software metric.

2.3 Rank transformation

The rank transformation replaces the original values by their ranks. The rank transformation is recommended to deal with heavy-tailed distributions (i.e., have high kurtosis) Bishara and Hittner (2014) and Keren and Lewis (1993). In the literature of defect prediction, Jiang et al. (2008) observe that the rank transformation can improve the performance of some classifiers (e.g., Naive Bayes). Moreover, the rank transformation has been successfully applied to mitigate the heterogeneity of software metrics across projects in the cross-project setting (Zhang et al. 2014).

In this study, we convert the original values of each metric into ten ranks, using every 10t h percentile of the corresponding metric, as defined in (2).

$$ Rank(x) = \left\{ \begin{array}{l l} 1 & \quad \text{if \(x \in [0, Q_{1}]\)}\\ k & \quad \text{if \(x \in (Q_{k-1}, Q_{k}]\), \(k \in \{2,\ldots, 9\}\)}\\ 10 & \quad \text{if \(x \in (Q_{9}, +\infty)\) } \end{array} \right. $$
(2)

where Q k is the k*10% percentile of the corresponding metric in the union of the training and target projects.

2.4 Box-Cox transformation

The Box-Cox transformation represents a family of power transformations, as defined in (3). To the best of our knowledge, the Box-Cox transformation has not been explored in the literature of defect prediction.

$$ BoxCox(x, \lambda) = \left\{ \begin{array}{l l} \frac{x^{\lambda}-1}{\lambda} & \quad \text{if \(\lambda \ne 0\)}\\ ln(x) & \quad \text{if \(\lambda=0\) } \end{array} \right. $$
(3)

where x is the value of a metric, and λ is the only configuration parameter of the Box-Cox transformation.

The parameter λ determines the concrete format of the Box-Cox transformation. For example, “ λ = 1.0” means no transformation, “ λ = 0.5” equals to the square root transformation, “ λ = 0.0” represents the log transformation, and “ λ = −1.0” indicates the inverse transformation. As such, the Box-Cox transformation is often used to transform variables that follow a power law distribution. The Box-Cox transformation is suggested to improve the variance homogeneity, increase the precision of estimation, and simplify models (Shang 2014).

The parameter λ can be estimated from a sample of data points. In the context of cross-project prediction, the parameter λ is estimated from both the training and target projects. The details to apply the Box-Cox transformation in our study are presented as follows.

  1. 1)

    Shifting metric values to 1.0. As suggested by Guo (2014), we shift the minimum value of a metric in a distribution at exactly 1.0 before applying the Box-Cox transformation. This treatment can increase the accuracy of the Box-Cox transformation (Guo 2014). We use the equation \(\tilde {x} = x-min(x)+1\), where x is the value of a software metric.

  2. 2)

    Estimating the parameter λ . The parameter λ is estimated for each metric independently, since different metrics rarely follow the same distribution. To ensure the same transformation applied on both the training and target projects, as aforementioned, we estimate the parameter λ using the values of the corresponding metric from both sets.

    We estimate the parameter λ in an iterative process. First, we select a set of candidate λ values that range from −1.0 to 1.0. Second, we iterate the λ values from −1.0 towards 1.0 with a step of 0.1. At each iteration, we compute the skewness of transformed values. We select the λ value that leads to the minimum skewness (i.e., the absolute skewness value is the closest to zero) of transformed values. The iterative process can be described using the following equation:

    $$ \widehat{\lambda} = \arg{\min_{\lambda \in L}}~|skewness(\underset{\tilde{x} \in X}{BoxCox(\tilde{x}, \lambda)})| $$
    (4)

    where L is a set of candidate λ values from -1.0 to 1.0 with a step by 0.1, and X is a vector of shifted metric values.

  3. 3)

    Normalizing transformed values. Normalization creates equal scales of software metrics, and is useful for classification algorithms (Han et al. 2012; Nam et al. 2013). In this study, we choose the min-max method (Han et al. 2012), since it can normalize values exactly into the range of [0,1]. Based on the benefit of shifting the minimum value to 1.0 (Guo 2014), we slightly modify this method using the following (5).

    $$ Normalize(\widehat{x}) = \frac{\widehat{x}-\min_{\widehat{x} \in U}(\widehat{x})}{\max_{\widehat{x} \in U}(\widehat{x})-\min_{\widehat{x} \in U}(\widehat{x})}+1 $$
    (5)

    where \(\widehat {x}\) is the transformed value by (3) using \(\tilde {x}\) and \(\widehat {\lambda }\), and U is a set of \(\widehat {x}\) from the union of the training and target projects.

3 Experimental setup

In this section, we first describe our subject projects. Then, we present classifiers to build cross-project defect prediction models, and six performance measures used in this study.

3.1 Subject projects

In this study, we choose three publicly available datasets, such as AEEEM (D’Ambros et al. 2010), ReLink (Wu et al. 2011), and PROMISE (Jureczko and Madeyski 2010). The three datasets have been widely used for cross-project defect prediction (e.g., Nam et al. 2013). The diversity of the three datasets can help verify the generalizability of our approach. Table 1 presents the summary of the three datasets.

  1. 1)

    AEEEM dataset was made by D’Ambros et al. (2010), and contains 61 metrics. It has the two largest projects (i.e., Mylyn and PDE) among the three datasets. The ratio of defective files in this dataset is relatively lower than the other two datasets (i.e., 9.3% to 39.8%).

  2. 2)

    ReLink dataset was collected by Wu et al. (2011), and the defect information in this dataset was manually verified. ReLink dataset has 26 metrics. Projects in this dataset are relatively small (e.g., project OpenIntents Safe has the least number of files). This dataset has a moderate ratio of defective files (i.e., 29.6% to 50.5%).

  3. 3)

    PROMISE dataset was prepared by Jureczko and Madeyski (2010). We select the same ten projects as in our prior study (Zhang et al. 2016). PROMISE dataset has 20 metrics. PROMISE dataset has the most diverse characteristics of projects, such as the number of files ranges from 135 to 965, and the ratio of defective files varies between 6.6 and 63.6%.

Table 1 Descriptive statistics of all 18 subject projects from AEEEM, ReLink, and PROMISE datasets

3.2 Classifiers for defect prediction

Each classifier has its own advantages when used to build a defect prediction model. For instance, logistic regression is easy to interpret and is widely used (Nam et al. 2013). Naive Bayes is robust for defect prediction using data with observable noises (Kim et al. 2011). In this study, we choose to use logistic regression as the main classifier in RQ2 and RQ3. We further perform the sensitive analysis on the choice of classifiers in RQ4, since not all classifiers are sensitive to data transformations (Kuhn and Johnson 2013). For instance, in the defect prediction literature, data transformations have been reported to have varied impacts on the performance of different classifiers (Jiang et al. 2008; Menzies et al. 2007; Song et al. 2011). In addition to logistic regression, we evaluate the performance of our approaches in RQ4 using six other classifiers such as Bayes net (BN), k-nearest neighbours (IBk), decision tree (J48), naive Bayes (NB), random forest (RF), and random tree (RT).

3.3 Performance measures

In this study, we compute six commonly used measures (i.e., precision, recall, false positive rate, balance, F-measure, and AUC value) to evaluate the performance of cross-project prediction models.

The first five measures can be calculated from the following four numbers: 1) true positive (TP) that counts the number of defective instances successfully predicted as defective instances; 2) true negative (TN) that calculates the number of non-defective instances correctly predicted as non-defective instances; 3) false positive (FP) that is the number of non-defective instances incorrectly predicted as defective instances; and 4) false negative (FN) that measures the number of defective instances wrongly predicted as non-defective instances. The details are described as follows:

Precision (prec):

measures the ratio of correctly predicted defective instances. It is defined as: \(prec=\frac {TP}{TP+FP}\).

Recall (pd):

evaluates the proportion of defective instances that are predicted as defective instances. It is defined as: \(pd=\frac {TP}{TP+FN}\).

False Positive Rate (fpr):

captures the proportion of non-defective instances that are predicted as defective instances. It is defined as: \(fpr=\frac {FP}{FP+TN}\).

Balance:

is proposed by Menzies et al. (2007) to balance recall and false positive rate. It is defined as: \(balance=1-\frac {\sqrt {(0-fpr)^{2}+(1-pd)^{2}}}{\sqrt {2}}\).

F-measure:

is the harmonic mean of precision and recall. It is defined as: F-\(measure=\frac {2\times pd \times prec}{pd+prec}\).

The five aforementioned measures depend on the cut-off value, which is used to compute the four numbers TP, TN, FP, and FN. On the other hand, Area Under Curve (AUC) is the area under the receiver operating characteristics (ROC) curve, thus the AUC value is independent of the cut-off value. Therefore, we further compute AUC values to evaluate cross-project defect prediction models as prior studies, such as (Rahman et al. 2012).

4 Motivation study

In this section, we aim to find if the three studied transformation methods have different performances in the context of defect prediction. The investigation is performed from the following two perspectives:

  1. 1)

    if they can equally improve the normality of software metrics.

  2. 2)

    if cross-project defect prediction models built using each of the transformation methods have similar performance.

Accordingly, we formulate two research questions. We now present the findings of each question, along with our motivation and approach.

4.1 RQ1. Are log, Box-Cox, and rank transformations equally effective in increasing the normality of software metrics?

Motivation

Data normality can impact the performance of a prediction model, particularly the model that is not tree-based (Kuhn and Johnson 2013). Although log and rank transformations have been applied in defect prediction (e.g., Jiang et al. 2008; Menzies 2007; Zhang et al. 2014), their capability in improving the normality of software metric values has not been explicitly explored. In addition, the Box-Cox transformation introduced in Section 2.4 has not been used in the defect prediction studies.

To thoroughly examine the impact that transformations have on defect prediction models, it is necessary to investigate if the three transformation methods indeed have different performance in improving the normality of software metric values.

Approach

To address this question, software metrics need to be transformed using each of the three transformation methods. As different software metrics exhibit various distributions, we transform the values of each metric independently.

In each project, we apply the log transformation on software metric values to get log transformed values. When applying the Box-Cox transformation, we first apply the steps described in Section 2.4 to estimate the parameter λ using values of a single metric from the same project, and then apply the Box-Cox transformation. To apply the rank transformation, we compute every 10t h percentile of the distribution of values of a single metric from the same project, and obtain rank transformed values using (2).

On the transformed metric values, the skewness and kurtosis are computed to evaluate the normality. To investigate if transformation improves the normality of software metric values, we test the following null hypothesis for each transformation method:

H011::

there is no difference in the normality of the transformed metric values and the original metric values.

We conduct paired Wilcoxon rank sum test (Sheskin 2007), with the 95% confidence level (i.e., p-value < 0.05). The Wilcoxon rank sum test is a non-parametric statistical test to assess whether two independent distributions are equal. Non-parametric statistical methods make no assumptions about the distribution of assessed variables. If there is a statistical significance, we reject the hypothesis and conclude that the examined transformation significantly changes the normality of software metric values.

Furthermore, we compare the capability of the three transformations in improving the normality of software metric values. We apply paired Wilcoxon rank sum test to evaluate the following null hypothesis, with the 95% confidence level (i.e., p-value < 0.05).

H012::

there is no difference in the normality of metric values that are processed by transformations T a and T b .

T a and T b denote two different transformations. If there is a statistical significance, we reject the hypothesis and conclude that the corresponding two transformations have different capability in improving data normality. We further compute the Cliff’s δ (Romano et al. 2006) to quantify the difference. The Cliff’s δ is a nonparametric effect size that does not assume a particular distribution. The difference is negligible if Cliff’s |δ| < 0.147, small if Cliff’s 0.147 <= |δ| < 0.330, medium if Cliff’s 0.330 <= |δ| < 0.474, and large if Cliff’s |δ| >= 0.474.

Findings

All three studied transformations can significantly improve the normality of software metrics. Figure 2 presents the skewness and kurtosis values of the transformed metric values from all projects. The median skewness and kurtosis of the original metric values among all three datasets are 4 and 22, respectively. It indicates that the original metric values are highly skewed, because the ideal skewness value is between −0.80 and 0.80, and the perfect kurtosis value is zero (see Section 2.1). Using any of the three transformations can make the skewness and kurtosis values become closer to zero (i.e., nearly normally distributed). As shown in Table 2, the results of Wilcoxon rank sum tests in the skewness and kurtosis between the transformed values and the original values show statistically significant difference, respectively. Hence, we reject hypothesis H011 for all three transformations. Moreover, the corresponding Cliff’s δ is always greater than 0.474 (as shown in Table 2), indicating that each of the studied transformation methods yields a large improvement on the data normality. The capabilities of the three transformations on improving data normality are ordered as: the rank transformation > the Box-Cox transformation > the log transformation. For any pair of the transformations, we reject hypothesis H012 as the p-values of Wilcoxon rank sum tests are always less than 0.05. However, the Cliff’s δ is either small or negligible, except the kurtosis value between log and rank transformations and the kurtosis value between Box-Cox and rank transformations.

Fig. 2
figure 2

The boxplot of the skewness and kurtosis values of the metrics that are transformed using each of the three methods on all subject projects (“Raw”, “Log”, “BC”, and “Rank” represent no transformation, the log transformation, the Box-Cox transformation, and the rank transformation, respectively)

Table 2 The results of comparing cross-project defect prediction models built using the three transformations (n.s. denotes no statistical significance, and bold font is used if the corresponding model is better)

Regarding the Box-Cox transformation, the estimated parameter λ varies across projects

We present the boxplot of the estimated λ values for each project in Fig. 3. The varying λ values across projects suggests that estimating λ values from both the training and target projects can maximize the normality of metrics values in both projects. We observe that few of the estimated λ values are zero (λ = 0 indicates a log transformation). Therefore, the Box-Cox transformation is not close to the log transformation when dealing with software metrics. In addition, the median value of the estimated λ is often less than zero across projects, showing that most of the estimated λ values are negative. In other words, the Box-Cox transformation tends to reverse the order of metric values, i.e., making larger metric values smaller, and vice versa. The reversed order of metric values does not affect the performance of defect prediction models, since both the training and testing projects are treated in the same way. However, researchers should keep in mind of such possible alteration of the order of metric values when interpreting models built with the Box-Cox transformation.

figure d
Fig. 3
figure 3

Boxplot of estimated λ values for metrics in each project. (The full name of each project is presented in Table 1)

4.2 RQ2. Do different transformations result in distinct predictions in cross-project defect prediction models?

Motivation

Although all three transformations can effectively improve the normality of software metric values, it is unclear if cross-project defect prediction models are impacted by applying different transformations. We are interested to find if the same performance of cross-project defect prediction models could be achieved, when applying each transformation method. In particular, we want to 1) compare the overall performance (e.g., F-measure and AUC value) of cross-project defect prediction models built using the three transformations; and 2) examine if the three transformations result in distinct predictions in the cross-project setting.

Approach

To address this question, we build cross-project prediction models using all possible pairs of the training and target projects in each dataset. In AEEEM, ReLink, and PROMISE datasets, there are 5, 3, and 10 projects, respectively. Therefore, the total number of possible pairs for cross-project defect prediction in the three datasets are 20 (= 5 × 4), 6 (= 3 × 2), and 90 (= 10 × 9), respectively.

For each pair of the training and target projects, we build three models. Each model is built using metrics transformed by one of the three studied transformation methods. As some metrics correlate with other metrics, we perform the correlation analysis to remove the redundancy among software metrics. To measure correlation, we compute Spearman’s ρ (Sheskin 2007) that is more robust to outliers (Triola 2004) and preferred in the presence of ties (Sheskin 2007). We define the distance between each pair of software metrics as 1 −∥ρ2, where ρ is their correlation. We perform hierarchical clustering using R function hclust Footnote 3 and obtain the metrics with a threshold of ∥ρ∥ < 0.8 (Succi et al. 2005; Selim et al. 2010; Fukushima et al. 2014) using R function cutree.Footnote 4

To apply the same Box-Cox transformation on the training and target projects, we estimate λ values for each metric using both projects (see Section 2.4). Similarly, for the rank transformation, we calculate every 10t h percentile of the values of each metric using both projects.

We apply logistic regression to build cross-project prediction models using transformed values of the training project, and apply the models on the target project. Next, we examine the impact of three transformation methods on cross-project defect prediction from two perspectives:

  1. 1)

    The overall performance: We compute precision, recall, false positive rate, balance, F-measure, and AUC value to measure the overall performance of these models. To compare the overall performance of the models built using the three transformations, we test the following null hypothesis.

    H021::

    there is no difference between the performance of models built with transformations T a and T b .

    T a and T b represent two different transformations. We apply paired Wilcoxon rank sum test with the 95% confidence level (i.e., p-value < 0.05). If there is a statistical significance, we reject null hypothesis H021 and further compute Cliff’s δ to quantify the difference.

  1. 2)

    The prediction error: To evaluate if different transformations result in distinct predictions, we compare the prediction errors among models built using the three transformations. To this end, we test the following null hypothesis using McNemar’s test with the 95% confidence level (i.e., p-value < 0.05).

    H022::

    there is no difference between the error rate of models built with transformations T a and T b .

McNemar’s test is commonly used to compare prediction errors of two prediction models (Japkowicz and Shah 2011). As a nonparametric test, it makes no assumptions on the distribution of a subject variable. McNemar’s test is applicable only if two models are applied on the same dataset with separated training and target sets. In this study, our models are built on the same training project, and applied on the same target project that is different from the training project. Therefore, McNemar’s test is applicable to our study.

To perform McNemar’s test, we need to compute a contingency matrix (see Table 3) based on the predictions produced by two models (i.e., M 1 and M 2). In the contingency matrix, N c c is the number of instances that both models achieve correct predictions; N c w is the number of instances that model M 1 makes correct prediction, but model M 2 has wrong prediction; N w c is the number of instances that model M 1 makes wrong prediction, but model M 2 produces correct prediction; and N w w is the number of instances that both models result in wrong predictions.

Table 3 Contingency matrix to perform McNemar’s test

The null hypothesis of McNemar’s test is that both models M 1 and M 2 have the same error rates. We apply R function mcnemar.exact from R package exact2x2 Footnote 5 to perform McNemar’s test. The result of McNemar’s test can only indicate if there exists a statistical significance, but can not show how much the difference is. Therefore, we further compute odds ratio (OR) to measure the effect size of the difference. The odds ratio measures the degree of the wrong prediction made by one model over the other model. We compute the odds ratio using the equation \(OR=\frac {N_{cw}}{N_{wc}}\) (Breslow and Day 1980). An OR = 1 means that both models make the same amount of wrong predictions while the other model makes the correct prediction. An OR < 1 indicates that model M 1 makes less wrong predictions than model M 2, and vice versa. An OR greater or less than implies a larger difference in the two models.

Findings

Applying the three transformation methods yields a similar performance of cross-project prediction models. Figure 4 depicts the boxplots of the six performance measures on the prediction model that is obtained using each of the three transformations. Table 4 presents the average performance measures of models built using the three transformations for each data set. The results of Wilcoxon rank sum test show that in overall there is no significant difference among the three transformation methods. Hence, we can not reject the null hypothesis H021 for all the cases. We conclude that the performance of cross-project prediction models built using the three transformations are similar. This finding is consistent to our previous work (Zhang et al. 2014) that rank and log transformations have a similar power for cross-project predictions, as well as the work of Jiang et al. (2008). Although the rank transformation significantly outperforms the log and the Box-Cox transformations in improving the normality of software metric values in terms of both skewness and kurtosis (see RQ1), the predictive power of models built using the rank transformation does not outperform models built using either the log or the Box-Cox transformations. One possible reason is that information is lost after the rank transformation, which may offset potential benefits of using well transformed metrics.

Fig. 4
figure 4

The boxplots of the six performance measures with the three transformations

Table 4 The results of comparing cross-project defect prediction models built using the three transformations (n.s. denotes no statistical significance, and bold font is used if the corresponding model is better)

The predicted defective files are not consistent among the results of multiple defect prediction models built using different transformation methods.

Although having similar overall performances (e.g., F-measure and AUC value), the three models do not necessarily have similar prediction errors. More specifically, when some models make wrong prediction on a file, other models may make correct prediction on the same file. To the best of our knowledge, such distinct prediction existing in models built with various transformation methods is overlooked in prior studies. We conjecture that it might improve the predictive power of cross-project defect prediction models, if integrating multiple models, with each model built using different transformation methods.

The detailed results of McNemar’s tests are presented in Table 5. We observe that in AEEEM dataset, the prediction error of using the log transformation is significantly different from using Box-Cox and rank transformations in 40 and 50% of all cross-project prediction models, respectively. The prediction errors of using Box-Cox and rank transformations are significantly different in 70% of all cross-project prediction models. In cases with a statistically significant difference, we can reject the null hypothesis H022 and conclude that the three models built using each of the three transformation methods do not consistently make the same predictions on the same file. In other words, each transformation method could capture different aspects of the metric values. Similar findings are observed in ReLink and PROMISE datasets.

figure f
Table 5 The p-values of McNemar’s test and odds ratio (OR)

5 Our approach

Transformations may alter the nature of software metrics. Applying various transformations on software metrics captures different perspectives of software metrics. In Section 3, we find that cross-project defect prediction models built using different transformations do not always make the same prediction on the same file. This observation motivates us to integrate models built using different transformations.

In this section, we describe our approach to integrate a set of predictions made by models built with multiple transformations. We present the details of our two approaches: 1) the basic approach MT that integrates multiple models; and 2) the enhanced approach MT+ that selects the most appropriate training project for each target project.

5.1 Our basic approach – MT (multiple transformations)

For a pair of training and target projects, we build multiple defect prediction models. Each model is built using one of the three transformation methods. Our approach aims to utilize the information obtained from multiple models other than a single model. The weight of each model is determined by the accuracy of the model on the training data. Figure 5 illustrates the overview of our approach. The details are described as follows.

  • 1) Notations. Let M = {M 1,…,M n } represents a set of prediction models built using n transformations. A file f in a target project is represented as X, a vector of all software metrics. P B,i (X) is the predicted probability of defect proneness on file f by model M i , and P C,i (X) is the predicted probability of file f as a clean file. Thus, P B,i (X) + P C,i (X) = 1. We consider a file is defective, if P B,i (X) is greater than 0.5 (Zimmermann et al. 2009).

  • 2) Computation of the probability of defect proneness. We use P B (X) to denote the final probability of defect proneness on file f using all the n models. We compute P B (X) in the following two ways: 1) weighting the probability P B,i (X) produced by models that consider file f as defective; or 2) weighting the probability P C,i (X) produced by all models, if no model considers file f as defective. Accordingly, P B (X) is defined in (6).

    $$ P_{B}(X) = \left\{ \begin{array}{l l} min(1, \frac{{\sum}_{M_{i} \in M}w_{i} \times s_{i}(X) \times P_{B,i}(X)}{N_{B}(X)}) & \text{if \(N_{B}(X)>0\)}\\ \\ max(0, 1-\frac{{\sum}_{M_{i} \in M}w_{i} \times P_{C,i}(X)}{n}) & \text{otherwise} \end{array} \right. $$
    (6)

    where w i is the weight assigned to model M i ; s i (X) is the selector for model M i that determines whether the probability predicted by model M i is used to compute the final probability of defect proneness or not; and N B (X) is the number of selected models. The min and max limit P B (X) in the range [0,1].

  • 3) Weight of each model. A weight is assigned to each model, since the accuracy of different models varies. We consider that models with higher accuracy should be encouraged, and models with lower accuracy should be penalized. Hence, we use the accuracy a i of a model to obtain the weight w i for each model M i . The accuracy a i is computed on the training data as the proportion of correct predictions (i.e., true positives and true negatives) relative to the total number of predictions. We set w i = 0, if a i = 0. For a model with non-zero accuracy (i.e., a i > 0), we define its weight w i as \(w_{i}=\frac {a_{i}}{minAcc}\), where m i n A c c is the minimum non-zero accuracy among n models.

  • 4) Selection of each model. A selector s i (X) for each model M i is defined to capture every possible defective file. We consider that a file is defective, if it is predicted as defective by one or more models. As such, the selector s i (X) is defined in (7). For each file, as shown in (6), the predicted probability of model M i is used only if the file is predicted as defective by model M i (i.e., P B,i (X) > 0.5).

    $$ s_{i}(X) = \left\{ \begin{array}{l l} 1 & \quad \text{if \(P_{B,i}(X)>0.5\)}\\ 0 & \quad \text{otherwise} \end{array} \right. $$
    (7)

    The number of selected models N B (X) for file f is defined in (8). As applied in (6), N B (X) is used to normalize the predicted probability of a file that is predicted as defective by at least one model.

    $$ N_{B}(X)=\sum\limits_{i=1}^{n} s_{i}(X) $$
    (8)
  • 5) Prediction by the integrated model. As described in (6), the integrated model considers a file as defective if it is predicted as defective by at least one model. Therefore, the integrated model increases the recall of defective files but introduces false positives as well. To mitigate the inflation of false positive rate, we increase the cut-off value by multiplying a factor of α, where α > 1. In this study, we empirically set α = 1.2, thus the default cut-off value 0.5 is increased to 0.6 (= 0.5 ×1.2). Changing the value of α only affects performance measures that reply on the cut-off value of the predicted probability of defect proneness, but does not alter the performance measure AUC. Moreover, automatic determination on the cut-off value has been proposed in our previous work (Zhang et al. 2015).

Fig. 5
figure 5

Overview of our approach MT to integrate models built upon differently transformed data

5.2 Our enhanced approach MT+

Our basic approach integrates defect prediction models built with various transformation methods, but it can not select the most appropriate training project for each target project. If there exist many candidate training projects, it is still unclear how to choose the best project to build the model.

To this end, we further enhance our approach by automatically selecting the most appropriate training project for each target project in unsupervised fashion. For the automatic selection, we propose to apply the knowledge learned from the Box-Cox transformation, as the Box-Cox transformation has a parameter λ reflecting the distribution of the values of the original metrics. For each candidate training project, we compute the best \(\widehat {\lambda }\) value of each metric using (4). Then we choose the training project with its average \(\widehat {\lambda }\) value close to zero. Note that λ = 0 indicates a log-normal distribution, therefore the distribution of the metric values of the selected training project is the closest to the log-normal distribution. A log-normal distribution can be observed in many natural growth processes where small changes are accumulated. We conjecture that a software project that grows similar as natural growth processes is more appropriate to build a cross-project defect prediction model.

As our approach in based on MT, we term our enhanced approach as MT+.

6 Evaluation of our approaches

In this section, we evaluate the effectiveness of using our approaches in cross-project defect prediction.

6.1 RQ3. Can our approaches improve the performance of cross-project defect prediction models?

Motivation

When building defect prediction models, usually only one transformation method is applied. The findings of RQ2 suggest that cross-project defect prediction models built using the three transformation methods could make different predictions on the same file. To improve the predictive power of cross-project defect prediction models, we propose the approach MT to integrate multiple transformations (see Section 5) and the enhanced approach MT+ to select the most appropriate training project. In this question, we aim to investigate if our approaches can achieve better performance of cross-project predictions, comparing to models built using only one transformation method.

Approach

To evaluate the performance of our approaches, we choose five baselines: 1) Raw models built without any transformation; 2) Raw+ models built without any transformation but the most appropriate training project is selected using the same strategy as MT+; 3) Min-max models built using the min-max scaling method (i.e., min-\(max=\frac {x-x_{min}}{x_{max}-x_{min}}\), where x is the raw value of a metric); 4) Z-score models built using the normalization method (i.e., z-\(score=\frac {x-\mu }{\sigma }\), where x is the raw value of a metric, μ is the mean of the metric, and σ is the standard deviation of the metric); and 5) Log models built using the log transformation.

Similar to RQ2, we build cross-project defect prediction models using all possible pairs of the training and target projects in each dataset. We perform the seven transformations on each training project, and build seven logistic regression models, respectively. We apply the seven models on the target project to obtain predictions on each file of the target project. As described in Section 5, we use our approach MT to integrate the predictions of the three models built with log, the Box-Cox, and rank transformations. We use the method described in Section 5.2 to select the most appropriate training projects for Raw+ and MT+. The most appropriate training projects in the three datasets are: “Eclipse JDT Core” followed by “Eclipse PDE UI” (AEEEM), “Apache HTTP Server” followed by “OpenIntents Safe” (ReLink), and “POI v3” followed by “Camel v1.6” (PROMISE).

To evaluate the performance of these models, we calculate precision, recall, false positive rate, balance, F-measure, and AUC value. To investigate if our approaches can improve the performance of cross-project prediction, we test the following null hypothesis for each performance measure:

H031: there is no difference between the performance of two types of models.

For instance, we test the null hypothesis H031 between models built with our approach MT and models built with the log transformation. We apply paired Wilcoxon rank sum test with the 95% confidence level (i.e., p-value < 0.05). If there is a statistical significance, we reject the hypothesis and conclude that our approaches have statistically significant improvement in the performance of cross-project predictions.

Findings

In general, our approaches MT and MT+ statistically significantly improve the performance of cross-project defect prediction, in terms of recall, balance, and F-measure. Table 6 shows the detailed Cliff’s δ between the performance measures of each pair of assessed models. When using logistic regression to build the model, our strategy for selecting the most appropriate training project (i.e., Raw+) can statistically significantly improve the performance of models built without transformation (i.e., Raw). Using the min-max scaling method yields statistically significantly lower performance in terms of recall and AUC than using no transformation. We do not observe any significance between the performance of models built with z-score normalization method and without transformation, in all six studied performance measures. Similarly, there is no significant difference in the performance of models built with our approach MT and Raw+. However, comparing to models built without transformation or with Min-max/Z-score/Log transformation, our approach MT significantly improves the performance of cross-project defect prediction in terms of recall, balance, F-measure and AUC, and the only exceptional case is that our approach MT achieves similar performance as log transformation. From Table 6, we can observe that our enhanced approach MT+ can further improve the performance of cross-project defect prediction than Raw+ and MT in terms of recall and F-measure.

Table 6 The Cliff’s δ between the performance measures of two assessed models. (‘*’ indicates p-value < 0.05, ‘**’ for p-value < 0.01, and ‘***’ for p-value < 0.001)

Table 7 presents the average value of the six measures for all possible cross-project predictions. In particular, the average F-measure of the models built using the log transformation in AEEEM dataset is 0.34. Log transformation is the most widely used transformation in the literature of defect prediction. Comparing to log transformation, our approach MT+ achieves an improvement of 24% in the average F-measure (i.e., from 0.34 to 0.42). In ReLink and PROMISE datasets, we have similar observations that the average F-measures are improved from 0.54 to 0.60 (i.e., 11% improvement), and 0.31 to 0.42 (i.e., 29% improvement), respectively. Comparing to the models built with the log transformation, our approach MT only achieves a trivial improvement in terms of F-measure. The p-values of Wilcoxon test on three performance measures (i.e., recall, balance, and F-measure) between models built with the log transformation and our approach MT+ are 1.93e-03, 1.58e-03 and 3.28e-04, respectively. Hence, we reject the null hypothesis H031 for the three performance measures and conclude that our enhanced approach MT+ achieves statistically significant improvement in these measures. We conclude that solely integrating models built multiple transformations is not sufficient to improve the predictive power. It is more effective to improve the performance of cross-project defect prediction by selecting the most appropriate training project for each target project (i.e., MT+).

Table 7 Average performance measures of cross-project defect prediction models obtained using five baseline transformations and our approaches (MT and MT+). (Note: bold font is used if the corresponding model is better)

Furthermore, we present F-measure and the AUC values of cross-project defect prediction models for each project in Table 8. We consider a model wins if it achieves the best performance among all models. We observe that models built without any transformation win one or zero times in terms of F-measure and the AUC value. Models built with Raw+, the log transformation and our basic approach MT achieve similar efficiency. However, models built with our enhanced approach MT+ win 11 and 10 times in terms of F-measure and the AUC value, respectively. This indicates that the strategy used in MT+ to select the most appropriate training project is efficient. We recall that our strategy is to find the project whose metrics are more likely to follow the log-normal distribution as the training project. Our results confirm our assumption that a software project that grows similar as natural growth processes is more appropriate to build a cross-project defect prediction model. In other words, whether metrics of the training project follow the log-normal distribution is an important factor to determine if a defect prediction model succeeds in cross-project defect prediction.

Table 8 The F-measure and AUC values of cross-project defect prediction models for each project (bold font is used if the corresponding model is better)

A recent work by Nam et al. (2013) has a similar concept as our approach, i.e., using transformations (namely TCA+) to improve the performance of cross-project defect prediction. In particular, Nam et al. (2013) propose a set of rules to automatically select the most appropriate normalization method (e.g., min-max and z-score) for each pair of projects. The TCA+ approach successfully improves the average F-measure by 28% (i.e., 0.32 to 0.41) in the AEEEM dataset, and 24% (i.e., 0.49 to 0.61) in the ReLink dataset. Although TCA+ and MT+ achieve similar improvement, our approach MT+ can automatically determine the most appropriate training project for each target project, therefore relieves practitioners from experimenting with every pair of training and target projects. Moreover, with the training projects selected by our approach MT+, the F-measure of TCA+ can be further improved (i.e., from 0.41 to 0.43 in the AEEEM dataset and from 0.61 to 0.62 in the ReLink dataset).

The false positive rate is increased by our approach MT+ in the ReLink and the PROMISE datasets, but it is controllable

We observe that our approach MT+ has a higher false positive rate than the models built with the log transformation in two datasets. For instance, the average false positive rate of the models built using our approach MT+ in the ReLink dataset is acceptable (i.e., 0.28, which is less than 0.3 (Moser et al. 2008)). In the PROMISE dataset, the false positive rate is increased from 0.18 to 0.47. We conjecture that an appropriate cut-off may reduce the excessive false positive rate, since there the AUC value is the same (i.e., both are 0.71 in the PROMISE dataset). Our previous work (Zhang et al. 2015) describes two concrete and practical solutions to reduce the false positive rate by automatically determining the cut-off. Therefore, the inflated false positive rate is controllable.

figure g

6.2 RQ4. Do our approaches work well for other classifiers?

Motivation

We have demonstrated the effectiveness of our approach using a single classifier (i.e., logistic regression). However, there are many other classifiers (e.g., random forest and Naive Bayes) that are also frequently used to build defect prediction models (e.g., Jiang et al. 2008; Kim et al. 2011; Menzies et al. 2007; Song et al. 2011). To understand the generalizability of our approaches, it is necessary to study if our approaches can achieve a similar improvement using other classifiers as using logistic regression.

Approach

We follow the same approach as in RQ3, but using different classifiers to build cross-project prediction models. As described in Section 3.2, we study six classifiers, i.e., Bayes net (BN), k-nearest neighbours (IBk), decision tree (J48), naive Bayes (NB), random forest (RF), and random tree (RT).

To investigate if our approaches can improve the performance of cross-project prediction, we test the following null hypothesis for each classifier. We apply paired Wilcoxon rank sum test with the 95% confidence level (i.e., p-value < 0.05).

H041::

there is no difference between the performance of our approach and the models built with the log transformation, when using classifier C to build the model.

Classifier C represents one of our studied classifiers. Same as in RQ3, we choose five baselines: Raw, Raw+, Min-max, Z-score, and Log.

Findings

In general, our approach MT+ can improve the performance of cross-project defect prediction models. However, the improvement varies with classifiers. Table 9 presents the average F-measures and AUC values of models built with the log transformation and our approaches using each of the six classifiers. Comparing to models built without any transformation, building models with a single transformation method (i.e., Min-max, Z-score, or log transformation) generally can not improve the performance in terms of F-measure and AUC values. From this perspective, improving the normality of software metrics may not be sufficient to improve the performance of cross-project defect prediction models. However, our approach MT generally improves the performance, indicating that the integration of models built with multiple transformation is beneficial. Moreover, our strategy of selecting the most appropriate training project (i.e., both Raw+ and MT+) can further improve the performance. Therefore, although using a single transformation is not proved beneficial, it is worth integrating models built multiple transformations and selecting the most appropriate training project based on the estimated parameter of the Box-Cox transformation.

Table 9 Average F-measures and AUC values of cross-project defect predictions obtained using the log transformations and our approach (bold font is used if the corresponding model is better)

In terms of F-measure, our approach MT can achieve statistically significant improvement over models built with the log transformation, when using IBk (26% to 30%), J48 (16% to 68%), logistic regression (3% to 13%), and random tree (21% to 54%). In terms of the AUC value, four classifiers can benefit from our approach MT, i.e., IBk (4% to 9%), J48 (3% to 13%), random forest (1% to 4%) and random tree (3% to 9%). In all cases, the AUC values remain the same or are increased by using our approach MT.

Our enhanced approach MT+ statistically significantly improves the F-measure for almost all studied classifiers (except Naive Bayes), such as BayesNet (5% to 56%), IBk (30% to 37%), J48 (28% to 106%), logistic regression (11% to 35%), random forest (7% to 59%), and random tree (21% to 56%). The AUC values are statistically significantly improved for one classifier Bayes net (1% to 11%). As shown in Table 9, our enhanced approach MT+ achieves the same or higher AUC values in most cases, except four cases.

In summary, our approaches generally improve the performance of cross-project defect prediction models for multiple classifiers, although the improvement varies with classifiers.

figure h

7 Related work

In this section, we first describe prior studies on data transformation in defect prediction, and then present related work regarding cross-project defect prediction.

7.1 Data transformation in defect prediction

Data normality benefits both parametric and non-parametric statistical methods (Osborne 2008), and improves the performance of linear models (Kuhn and Johnson 2013). Transformation is a common method to reduce skewness and improve data normality (Bishara and Hittner 2014; Gaudard and Karson 2000). Data transformation is essential in defect prediction studies, since many software metrics follow power law distributions (Zhang 2009).

To build a defect prediction model, researchers often apply the natural log transformation (e.g., Menzies et al. 2007; Jiang et al. 2008; Cruz and Ochimizu 2009; Song et al. 2011) and the rank transformation (e.g., Jiang et al. 2008; Zhang et al. 2014) on software metrics.

However, applying the log transformation only benefits some classifiers (e.g., Naive Bayes) (Menzies et al. 2007; Song et al. 2011). For some other classifiers (e.g., decision tree), there is no statistically significant difference (Menzies et al. 2007; Song et al. 2011). Jiang et al. (2008) further compare the performance of defect prediction models built using log and rank transformations. After examining ten classifiers, Jiang et al. (2008) conclude that different classifiers prefer different transformations. For instance, random forest performs better if using the log transformation; and Naive Bayes performs better if using rank transformation. There are also studies that use different transformations for different classifiers. He et al. (2013) apply the rank transformation for Naive Bayes, but use original value for random forest and logistic regression.

7.2 Cross-project defect prediction

The major challenge in cross-project defect prediction is the heterogeneity between the training and target projects (Zimmermann et al. 2009). To address this problem, Menzies et al. (2013) and Bettenburg et al. (2012) investigate if it is beneficial to build models based on instances (e.g., files or classes) that are similar with the target project. Both studies observe that using only the instances that are similar to the target project achieves better performance in cross-project defect prediction models than using all instances. Turhan et al. (2013) recommend to mix the within-project and cross-project data to build a model, since models built with the mixed data outperform models built with only cross-project data. As a summary, the aforementioned studies propose to select similar instances to reduce the heterogeneity between the training and target projects.

An alternative solution is to transform the training and target projects to mitigate their heterogeneity (e.g., Ma et al. 2012; Nam et al. 2013). For instance, Ma et al. (2012) propose an approach to transform the training project using statistical characteristics extracted from the target project. Nam et al. (2013) propose an approach based on transfer component analysis (TCA) to transform the training and target projects together. Jing et al. (2015) propose to unify metric representation between the training and target projects based on the canonical correlation analysis (CCA). These three approaches achieve significant improvement for cross-project predictions. Furthermore, in our prior work (Zhang et al. 2014), we propose a context-aware rank transformation to convert the values of metrics to exactly the same scales across projects. The rank transformation enables us to build a generalized model that on average provides comparable performance as within-project models. Different from the aforementioned studies, we focus on a in-depth analysis of three simple transformation methods (i.e., log, Box-Cox, and rank) in the cross-project setting. We perform a thoroughly analysis on the capability of these transformations on improving the normality of software metrics (i.e., RQ1), and on the benefits of applying these transformations for cross-project predictions (i.e., RQ2).

In addition to reducing the heterogeneity between the training and target projects, there are few other approaches aiming to improve cross-project predictions. For instance, Jing et al. (2016) propose a feature learning method, namely subclass discriminant analysis (SDA), to effectively solve the class-imbalance problem in cross-project defect prediction. In the case where the training and target projects do not share the same set of metrics, Nam and Kim (2015) provide a solution. Canfora et al. (2013) propose to build multiple models other than a single model. Similarly, we propose to integrate predictions by models built using the three transformations (i.e., RQ3 and RQ4).

8 Threats to validity

In this section, we describe the threats to validity of our study under common guidelines by Yin (2002).

Threats to conclusion validity

concern the relation between the treatment and the outcome. The main threats come from our implementation of the three transformations. For instance, we normalize metric values transformed by the Box-Cox transformation to [1,2]. We clearly describe our treatments of the three transformations, so that researchers can replicate our work and yield the same conclusion when applying the same treatments as our study.

Threats to internal validity

concern our selection of subject systems and analysis methods. We choose subjects from the three publicly available data sets that have been used in many other studies (He et al. 2012; Nam et al. 2013; Tantithamthavorn et al. 2016). The selected projects have diversity in size and ratio of defectiveness. The threats to our analysis method come from our choice of random forest to study RQ2 and RQ3. Thus, in RQ4, we examine the effectiveness of our approach using six other classifiers.

Threats to external validity

concern the possibility to generalize our results. Our approach is based on log, Box-Cox, and rank transformations. All the three transformations are applicable to software metrics, since many metrics follow power law distributions (Concas et al. 2007; Louridas et al. 2008; Zhang 2009). The diversity in size and defect-proneness of our subject projects helps verify the generalizability of our approach. Nevertheless, further validations on other open source projects and even commercial projects are welcome.

Threats to reliability validity

concern the possibility of replicating this study. All three data sets used in this study are publicly accessible. We also provide all necessary details of our experiments on the internet.Footnote 6

9 Conclusion

Cross-project defect prediction is still a challenging problem, since the heterogeneity between the training and target projects (Nam et al. 2013; Zimmermann et al. 2009). For instance, the values of software metrics exhibit varied distributions across projects (Zhang et al. 2013). To this end, (Ma et al. 2012; Nam et al. 2013), and (Zhang et al. 2014) successfully apply appropriate transformations to improve the performance of cross-project defect prediction models. Apart from such complex transformations, several simple transformations are overlooked. Therefore, we set out to investigate the impact of three simple transformations (i.e., log, Box-Cox, and rank transformations) in the cross-project setting.

In this paper, we observe that all three transformation methods have a similar power to significantly improve the normality of software metrics. Moreover, cross-project prediction models built with each of the three transformation methods achieve similar performances (i.e., precision, recall, false positive rate, balance, F-measure and AUC value). However, we find that these models do not always make the same prediction on the same file, since the results of McNemar’s tests clearly show that these models can experience significantly different error rates.

Therefore, we propose an approach MT (Multiple Transformations) to integrate predictions by the cross-project defect prediction models built using each of the three transformation methods (i.e., log, Box-Cox and rank). We further enhance our approach (i.e., MT+) by automatically selecting the most appropriate training project for each target project. We perform an experiment using three public data sets, such as AEEEM (D’Ambros et al. 2010), ReLink (Wu et al. 2011), and PROMISE (Jureczko and Madeyski 2010). The results show that, comparing to the models built with only one transformation method (i.e., the widely used log transformation), our enhanced approach MT+ statistically significantly improves recall, balance, and F-measure in all three data sets. For instance, the average F-measures are improved by 24, 11 and 29% in AEEEM, ReLink, and PROMISE datasets, respectively. Our approaches also leads to the performance improvement in cross-project defect prediction models for various classifiers under study (e.g., random forest). Furthermore, we compare the performance of using untransformed values (Raw), untransformed values but with the selection of the most appropriate training project (Raw+), rescaled values by the min-max method (Min-max), normalized values by the z-score method (Z-score), transformed values by logarithm, and our approaches MT and MT+. We find that using a single transformation method usually can not improve the performance of cross-project defect prediction models. Instead, the models built with multiple transformations should be integrated, and more importantly it is necessary to select the most appropriate training project which can be done in an unsupervised way (i.e., by estimating the parameter of the Box-Cox transformation).

For the future work, we recommend future studies to experiment with our approach for potential gains in the predictive power of cross-project defect prediction models, since our approach introduces little overhead by only adding simple mathematical operations. We are interested to apply more advanced ensemble learners (e.g., Xia et al. 2015; Misirli et al. 2011; Panichella et al. 2014) to enhance our approach. In addition, it worths studying if it is beneficial to apply multiple transformation methods when building other types of prediction models (e.g., effort estimation).