Keywords

1 Introduction

Fixing software defects is a very difficult and time-consuming job for software practitioners during the development of a large-scale software project [3], at the start of which we need to identify potential locations of these defects. In last decades, software defect prediction has been proposed to locate defect-prone code regions [7, 9, 11, 18, 19], most of which are constructed by supervised classifiers (e.g. trained with label information in the training set and tested on the disjoint test set). Generally, these supervised prediction models have helped a lot in software test and alleviate debug burden of software engineers. In most of previous studies, it has been reported that these prediction models have a promising performance in defect prediction.

Recently, unsupervised prediction defect has drawn more attention from academic fields. Zhou et al. [24] pointed out that a simple unsupervised prediction defect model (simple module size models) has a prediction performance comparable or even superior to most of the existing defect prediction models in cross-project scenario. Yang et al. [21] showed that in effort-aware just in time defect prediction settings, many simple unsupervised models have a better performance compared with the state-of-the-art supervised models.

One limiting factor of defect prediction is that the data size of defect data extracted from the real-world software projects is not big, which prevents defect prediction models from employing popular methods of analyzing and mining big data and restricts the scope of application with defect prediction models. In this study, we try to construct bigger defect datasets by merging available datasets with the same measurement dimension and check whether bigger data will bring better defect prediction performance or not.

In detail, we introduce traditional supervised classifiers and the state-of-the-art supervised classifier (simple module size model [24]) to test their performances on the bigger datasets we collected. Besides, due to the curiosity about if the promotion of the scale of dataset would boost the performances of supervised classifiers, we will organize a comparative experiment between raw smaller datasets and merged bigger datasets. The results of our experiment reveal that larger-scale dataset doesn’t bring improvements of supervised classifiers, and simple module size model is also comparable to traditional supervised models on bigger datasets.

The rest of this paper is organized as follows. Section 2 introduces the background. Section 3 describes the preparations, including studied dataset, classifiers and evaluation measures. Section 4 presents the experimental results. Section 5 summarizes the threats to the validity. Section 6 concludes the paper.

2 Background and Related Work

There are two kinds of defect predict models: supervised and unsupervised. Supervised models are widely used and proved effective in prior work [22]. While unsupervised models could classify objects directly without training process, so it’s time-saving and easy to implement. Clustering and simple module size model are two common unsupervised techniques used in software field.

According to the dataset based on, we also could divide defect prediction into two scenarios, one of them is within-project defect prediction (WPDP). In WPDP, the dataset used to train classifiers and the one used for prediction come from a same project. The other scenario is cross-project defect prediction (CPDP). In practice, some companies may find it’s hard to collect enough data in their project to train the classifier [25], so it’s essential for them to seek for CPDP solutions, which could utilize data from a different source project. In this study, we will leverage both supervised and unsupervised models, and organize experiments on both WPDP and CPDP.

3 Experimental Setup

In this section, we will introduce our preparations for the experiment. In the following statement, you will have a primary understanding about how we collect dataset, select classifier and other similar things.

3.1 Collected Datasets

We use data from 15 projects of three groups - AEEEM, Eclipse and JURECZKO. The data in one group have unified metrics, so they are easy to be merged up and conduct further testing. Each project has both code metrics and clear defect label. More specifically, each file’s defect label is marked as 1 (buggy file) or 0 (clean file). Table 1 describes the 43 data sets used. The first to the third columns respectively list the group name, project name and version number. For each project, the fourth to the sixth columns respectively list the number of modules, the number of metrics, and the percent of defective modules.

Table 1. Information of collected data sets.

In this study, we try to figure out how these defect prediction models perform on datasets with larger scale, so all the versions in a project are merged into one file (they have the same measurement dimension). Group AEEEM is an exception, because the projects of it don’t process series of versions for us to merge. Hence, we merge all the projects in AEEEM into one file, so this one could be seen as test on CPDP, while others are on WPDP. All these merged files are called combination in the following description. In contrast, original raw files are called individual. All the merged files are listed in Table 2.

3.2 Classifiers

In this study, we leverage three common supervised learning models and a simple module size model. They are DNN (Deep Neural Network), RF (Random Forest), LR (Logistic Regression) and ManualDown. Compared with other frequently-used supervised learning models such as Linear Regression or Decision Tree, the models we chose are more suitable and representative to be applied on our dataset. Notions and implement details are presented below in sequence.

Table 2. Information of merged datasets.

Deep Neural Network. Deep Neural Network (DNN) is a classical machine learning technique and is widely used in a variety of learning scenarios. In our study, we build DNN model with the help of tensorflow. The DNN we use equips 2 hidden layers and 15 neural nodes in each layer. We select AdamOptimizer provided by tensorflow as training algorithm and Cross Entropy as loss function.

Random Forest. Random forest (RF) is an ensemble learning method usually used for classification. Nowadays it has been used more and more frequently in software engineering field study, and have been proved to be effective by former related work [5, 15]. In our experience, we leverage the package in R named randomForest to implement the model. The number of trees grown (a parameter in RF function) is set 100, which is a apropos value after our testing.

Logistic Regression. Logistic regression (LR) is a technique borrowed by machine learning from the field of statistics. As a typical machine learning model, logistic regression is also chosen as one of traditional defect prediction model. we use function glm in R package to the model, which is used to fit generalized linear models.

ManualDown. ManualDown is a simple module size model coming from the idea that we could use module size in a target project to predict bugs’ location. Project with larger module is thought to be more defect-prone. Since ManualDown do not need any data from the source projects to build the models, they are free of the challenges on different data metric sets of the source and target project data. In particular, they have small calculation demand and are easy to implement. Furthermore, previous studies show that module size has an intense confounding effect on the associations between code metrics and defect-proneness [2, 23]. We hence include ManualDown as a baseline model for following comparison.

3.3 Prediction Setting

In order to decrease test error and estimate the accuracy of a predictive model in practice, we apply one round of 5-fold cross-validation for each test in our study. Firstly, we partition all the instances (or called module) in the data set into 5 complementary subsets, each pass one of them is chose as testing set, others as training set. After 5 passes, we could obtain 5 performance indexes, which reflect classifier’s performance and constitute box plots exhibited in Sect. 4.

3.4 Evaluation Measures

There are numerous evaluation measures in defect prediction field. We mainly consider effort-aware scenarios and select three typical evaluation measures to evaluate our prediction models’ performance.

AUC (The area under ROC curve). In Machine Learning, we usually count on AUC for evaluation when it comes to a classification problem. AUC is the area under ROC curve, which reflects classification model’s performance variation at various thresholds settings. Hence AUC could fairly tell model’s capability of distinguishing between classes. AUC is between 0 and 1 (inclusive). The closer AUC is to 1, the better the model performs to distinguish between instances with buggy and clean.

F1 (F1-Score). F1 (F1 score) is the harmonic mean of precision and recall:

$$ F_{1}=(\frac{recall^{-1}+precision^{-1}}{2})^{-1} =2\cdot \frac{precision\cdot recall}{precision+recall} $$

By the way of harmonic mean, F1 combines precision and recall to give a comprehensive description of how much model is able to distinguish two classes. If you get a good F1 score, it means that your false positives and low false negatives is low, so you correctly identify real buggy instances. A F1 score is also between 0 and 1 (inclusive), and is considered perfect when its value is close to 1.

CE (Cost-Effectiveness). During defect prediction process, software practitioners prefer to check those files with high defect-proneness and small module size due to limited resources. Therefore, we could rank files in descending order using their probability of being buggy. The effort-aware ranking effectiveness of classifiers is usually measured by cost-effectiveness (CE) curve, which is a common evaluation measure in the field of defect prediction. And it’s widely used in prior work [12, 16]. CE could be calculated by the following formula introduced by Arisholm et al. [1]:

$$ CE_{\pi }=\frac{Area_{\pi }(M)-Area_{\pi }*(Random)}{Area_{\pi }(Optimal)-Area_{\pi }*(Random)} $$

Where \(Area_{\pi }(M)\) is the area under the curve of model M (likewise to Optimal and Random) for a given \({\pi }\). \({\pi }\) is a cut-off varying from 0 to 1, and it indicates the percentage of cumulative LOC we take into account. A larger \(CE_{\pi }\) represents a better ranking effectiveness. In this work, we discuss \(CE_{\pi }\) at \({\pi }\) = 0.2.

3.5 Analysis Methods

In order to make the results in Sect. 4 more convincing, we introduce two practical analysis methods to help us analyze in detail.

Scott-Knott Test. Scott-Knott (SK) test [17] is an analysis method used for group classifiers into statistically distinct ranks. When a plot appears before our eyes, it’s hard to identify if there is significant distinction between classifiers, and SK test could help on it. The SK test recursively ranks the given classifiers through hierarchical clustering analysis. It clusters the given classifiers into two group based on evaluation indicators and recursively executes until there is no significant distinct group created [4]. The SK test used in classifiers’ comparation can be found in prior works [4, 8, 13]. In this study, we use SK test in 95% confidence level.

Win/Tie/Loss. Win/Tie/Loss result is another analysis method which is useful in performance comparation between different techniques and has been widely used in prior works [10, 14]. Once we get classifiers’ performance evaluation data, Wilcoxon signed-rank test [20] and Cliff’s delta \({\delta }\) [6] could be conducted to compare their performance. If one outperforms another based on the Wilcoxon signed-rank test (p <0.05), and there is distinct difference between the two based on Cliff’s delta \({\delta }\) (\({\delta } \ge \) 0.147), we mark test as a ‘Win’. In contrast, test is marked as a ‘Loss’ if p <0.05 and \({\delta } \le \) -0.147. Otherwise, the case is marked as a ‘Tie’. The Win/Tie/Loss result shows if one of them outperforms the other one actually in all conditions.

Fig. 1.
figure 1

Comparision between combinations and individuals on AUC, F1 and \(CE_{0.2}\) (Color figure online)

4 Experimental Results

This section will provide a detailed description of our experimental results. We focus on how different defect predict models perform on combination and individual datasets, and answer the following research question:

4.1 RQ1: Does Improvement of the Size of Dataset Promote the Effect of Defect Prediction?

The answer we give is NO. Generally speaking, training on the combination of a project doesn’t outperform the one on the individual versions evaluated by AUC, F1 and \(CE_{0.2}\). And the result is convincingly supported by Win/Tie/Loss.

Figure 1 shows an overview of the comparison between prediction on combinations and individuals. The boxplots show the distribution of evaluation measures (AUC, F1 and \(CE_{0.2}\)) of each classifier in the studied datasets. Green boxes represent the performances on merged files. Blue boxes represent the performances on individual files. Generally speaking, the results of the two don’t exist significant discrepancy. But if you observe the plot more carefully, you could find that almost all the blue boxes are slightly higher than green boxes. Take the second plot in Fig. 1 as an example, except the performance on lucene with LR, all the blue boxes are higher than green boxes. This trend means that, surprisingly, the merge operation on the data set slightly reduces the performance of the classifier. This may be due to the hidden differences in different versions of a project.

Table 3. Information of Win/Tie/Loss indicator.

In order to have a more detailed observation, we also apply the Win/Tie/Loss indicator to help analyzing. The Win/Tie/Loss result manifests whether combination is significantly better or not when compared with individual. Table 3 displays the details of Win/Tie/Loss result - it gives the number of Win/Tie/Loss in our test. Take the first row of table for illustration, it shows that performances of test on individuals wins 11 times, losses 1 time and ties 21 times against test on combinations. (11 merged files, 3 classifiers, so 33 competitions in total) This result manifests that our prior observation based on Fig. 1 is correct: individuals’ performances slightly surpass the combination ones. In general, combination is not significantly better than individual on all three measure indexes. This result evinces the combination of software engineering dataset does not achieve distinct improvement.

We have considered many possibilities why increase in dataset scale didn’t achieve distinct improvement, and the most possible reason comes to heterogeneity between different projects and different versions. Due to heterogeneous nature, increase in sample size doesn’t provide supervised model with consistent and valid information, no matter on WPDP or on CPDP. So, it’s hard to elevate classifiers’ performances.

figure a

4.2 RQ2: Does ManualDown Outperforms Other Supervised Techniques?

In addition to differences between combinations and individuals, we also want to make a thorough inquiry on different performances of the four defect prediction model - DNN, RF, LR and ManualDown. Our question is if ManualDown could outperform other three typical supervised learning model, and it will be answered on both the merged and independent datasets.

First let’s inspect the performance on independent datasets, and the three box plots in Fig. 2 show our effort. The models from left to right on the x-axis are in turn DNN, LR, ManualDown and RF. Three pictures represent the performance on AUC, F1 and \(CE_{0.2}\). Our approach is to calculate the evaluation measure values of individual version and put them into box plot, so we can clearly discern the discrepancy between different model.

Fig. 2.
figure 2

Comparision between classifiers on individuals

To explain with more detail, taking picture 1 for example. ant is one of the projects in the study, and firstly, we have a 5-fold DNN training and test on each version of ant (ant1.3, ant1.4, ant1.5, ant1.6, ant1.7) get 5 mean AUC values out of it. Then we calculate the mean of these values from different versions, so we get a synthesis AUC value of ant. In our experience we have prepared 11 projects, therefore we could procure 11 synthesis AUC value, which constitute the leftmost box in the plot. Similarly we could complement this box plot by using other models to have 5-fold training and test on the same dataset.

Besides, Scott-Knott (SK) test is also applied here to depict if distinct difference exists among models. In Fig. 2, clusters with high SK values have been tinted into carnation, while others are staying gray. From the three pictures we can tell that RF outperforms other models. And RF is significantly distinct from DNN, LR and ManualDown on AUC, and from DNN, LR on \(CE_{0.2}\) through SK test. As a result, we could draw a conclusion that ManualDown doesn’t outperform other supervised techniques on independent datasets.

Now let’s pay our attention on the performs on merged datasets, and this is showed on Fig. 3. Unlike the experience on independent datasets, taking picture 1 for example. Firstly, we combine all the instances in different versions of a project. Hence, we get 11 combined datasets (because we have 11 projects). Then we have a 5-fold DNN training and test on each combined dataset and calculate AUC value of it, which make up the leftmost box in the picture 1 plot. In a similar way it’s easy to draw up the performance box plot on \(CE_{0.2}\) and F1 on merged datasets.

Fig. 3.
figure 3

Comparision between classifiers on combinations

From Fig. 3, it’s conspicuous that there doesn’t exist obvious difference among the performance on these defect predict models. On the other hand, along with the increase of the instance’s number in one dataset, the performance of model which has week performance on independent datasets makes a measly progress. And this lead to differences between different models is less obvious. In summary, ManualDown doesn’t outperform other supervised techniques on merged datasets as well, and ManualDown is not significantly worse than traditional supervised models. Considering the complexity of traditional models, ManualDown is undoubtedly a predict method with more practical significance based on us experience.

figure b

5 Threats to Validity

Project Selection. In this study, we select 15 open-source projects which have been used in prior works. These projects equip large enough data scale and regular metric information, which is conducive to train classifiers. Whereas limits still exist since these projects only come from 3 groups, thus there may be a different result if more diverse data is introduced in the experiment. So, it’s necessary to replicate our study in the future with a wider variety of datasets.

Classifier Selection. The classifiers we select in this work are DNN, RF and LR. Although these are common investigated in defect prediction literatures, they cannot represent all the classifiers. There would be a chance to get a better prediction performance if some newly proposed effective techniques are used during model training. Replication studies using different classifiers may prove fruitful.

Study Replication. DNN and ManualDown are implemented in Python. LR and RF are implemented using R packages. The three evaluation measures are implemented in Python. All these open source implementations and datasets can be accessed online at https://github.com/NJUaaron/2019DataTest.

6 Conclusions

Accurate software defect prediction plays an important role in the software industry to alleviate burden of software engineers. Many supervised and unsupervised methods have been proposed in prior works and are proved effective in the literature. However, for supervised methods, there is a limiting factor of defect prediction - the size of data used for training is not big, which restricts the scope of application with defect prediction models in practice. In this study, we construct bigger defect datasets by merging available datasets with same measurement dimension and check whether the promotion in data size will lift defect prediction performance or not. Meanwhile, ManualDown, one of simple module size models, is introduced as a baseline model to measure supervised models’ performances.

In the experience, we test DNN, RF, LR and ManualDown on individual files and merged files. Their prediction performances are evaluated by AUC, F1 and \(CE_{0.2}\), and the experimental results are analyzed by SK test and Win/Tie/Loss technique. In summary, our conclusions are as follow:

  • Performance on larger-scale dataset is not significantly better than performance on raw smaller dataset under AUC, F1 and \(CE_{0.2}\). More precisely, the increase in the size of dataset even makes the classifier perform worse, although the degree of deterioration is not distinct.

  • There is not significant difference between performances of ManualDown and other supervised techniques. In other words, classical supervised models cannot outperform simple module size model on our merged bigger data.