1 Introduction

Defect prediction in software systems focuses on prediction of fault prone classes early in the software development life cycle. This helps in near to optimal allocation of testing and maintenance resources. Defect prediction works well if large amount of data is available to train the prediction model. However, if the data is not preserved or if we are dealing with the first release of the software system, no training data is available. Thus defect prediction based on historical data of same project is not always feasible.

Cross project defect prediction is the process of predicting defects in software systems using historical data of other projects [1]. Very few Studies are available in literature for cross project defect prediction and they show that this is a serious challenging task. In our work, we have attempted to study the feasibility of cross project defect prediction using open source software systems. The prediction model is build using logistic regression. In this work we have empirically investigated the following severe research paradigms.

1.1 Defect data of one project is likely to derive defects of another project

Various studies in the literature show that historical data from software repositories can be used in prediction of software defects for upcoming releases, but availability of this past defect data is not always possible. In this study, we have empirically validated that defect data from other projects can be used to identify the defective classes.

1.2 The potential defect predictors in cross project defect prediction

Cross project defect prediction is feasible in some cases, but this is not always possible. The major challenge in this field is how to identify the scenarios where cross project defect prediction is applicable. One solution to this problem given by Zimmerman [1] in his work is to study the relationship between the characteristics of the training and test set. In this work, we have studied 14 characteristics of software projects and illustrated this relationship with the help of decision tree where the characteristics determine the potential predictors.

1.3 The criteria for successful cross project defect prediction?

There are numerous studies in literature in the area of software defect prediction. The prediction model is build using statistical or machine learning methods and the efficiency of the model is evaluated using various measures as sensitivity, specificity, precision, recall, AUC etc. analyzing the various studies, we have built the model using logistic regression and chosen appropriate cut off values of precision, recall and AUC to accept or reject the model.

Rest of the paper is organized as follows: Sect. 2 is the related work in the context of cross project defect prediction. Numerous studies are present in literature for defect prediction model trained from previous release of the same project, but very few cross project studies have been done in literature. Section 3 explains the Research Methodology that is used for our experiment. Following this Sect. 4 is Result Analysis section that describes the cross project prediction results. We have shown the acceptance criteria of each cross project model and also the decision tree, which is learnt from data. In the last section, we provide the conclusion and related future work scopes on this project, which could be taken as a subject to be worked upon. We have also discussed the threats to validity in this section.

2 Related work

Numerous studies are available in literature in the area of software defect prediction. The aim of most of them is to study the feasibility of defect prediction from the historical data of same project. Prediction models are built using the statistical and machine learning methods. Radjenovic et al. have done a systematic literature review for software fault prediction models in their work [3]. In this work, the authors have searched seven digital libraries to identify the most commonly used set of software metrics in software fault proneness prediction. Gray and MacDonell have also compared the various techniques for software fault prediction models [4]. The authors have discussed the inherent limitations of the techniques used in defect prediction models. Careful attribute selection is very important for the success of a fault prediction model. The authors have investigated the impact of attribute selection on naïve bayes based fault prediction model in their work [5].

Very few studies are available in the area of cross project defect prediction. Turhan et al. have investigated the application of cross company defect data to build prediction model using static code features [9]. They have conducted their experiments on seven NASA and three SOFTLAB datasets. Zimmermann et al. have studied the feasibility of cross project defect prediction and validated it using several versions of open source software. They have conducted their study on apache tomcat, apache Derby, Eclipse, Firefox, Direct-X, IIS, Printing, Windows Clustering, Windows File system, SQL Server 2005 and Windows Kernel [1]. The results indicate that the relationship of characteristics between the projects permits cross project defect prediction in some cases. This relationship is analyzed with the help of decision trees.

He et al. have also empirically validated cross project defect prediction using defect data from PROMISE repository [7]. They have conducted the experiment on 34 releases of 10 open source projects. Ma et al. have proposed a novel learning algorithm ‘Transfer Naïve Bayes’ for cross company defect prediction [8]. They have exploited all the cross company data in training the model. The results are validated on NASA datasets and Turkish local software datasets. Gerardo et al. proposed the use of genetic algorithm to build a multi objective cross project defect prediction model [10]. They used public dataset from PROMISE repository to validate and produce a compromise between precision and recall for cross-project defect prediction.

S. Herbold proposed distance-based strategies for training data selection based on distributional characteristics [11]. They evaluated their work with 44 data sets obtained from 14 open source projects. The results indicate that this training data selection strategy improved the success rate of cross-project defect prediction. They also proposed a tool CrossPare to provide standards for cross project defect prediction [15]. The tool implemented few techniques proposed for cross-project defect predictions. CrossPare can be used for improving the assessment of results in cross project defect prediction studies.

Ryu et al. proposed a transfer learning based model to deal with the class imbalance problem which may decrease the prediction accuracy in cross project defect prediction studies [12]. They computed similarity weights of the training data based on the test data and applied it to Boosting algorithm considering the class imbalance. The results are validated using NASA and SOFTLAB datasets. Ryu et al. also propose a multi objective naïve Bayes algorithm with Harmony search meta-heuristic algorithm [20]. The results indicate that the proposed approach shows similar prediction performance but better diversity compared to existing multi objective CPDP algorithms.

Panichella et al. conducted an empirical study on 10 open source software systems to analyze the similarity of different defect prediction models [13]. They proposed a combined approach that used the classification provided by different machine learning methods to improve the defect prediction results. They found that better prediction accuracy was achieved using the combined approach. Amasaki et al. conducted a study to identify the effects of the data simplification for CPDP methods [14]. They compared the predictive performance with and without applying data simplification on CPDP methods. They found that applying data simplification achieved improved results for cross-project selection. Satin et al. studied the combination of different classification algorithms for feature selection and data clustering [16]. They applied it to 1270 projects and built different cross-project prediction models. The authors reported that Naive Bayes algorithm obtained the best performance, with 31.58% of adequate predictions in 19 models. Zhang et al. investigated 7 algorithms integrating multiple machine learning classifiers to improve prediction results in cross project studies [17]. They performed experiments using 10 open source software systems from the PROMISE repository. They compared their results with CODEP [13] and found better results in terms of F-measure. Zhang et al. also compared the performance of unsupervised and supervised classifiers for cross project defect prediction using AEEEM, NASA and PROMISE datasets [21]. They propose connectivity-based classifiers as the potential solution for cross project defect prediction studies. The authors also investigated the effect of Log, Box-Cox and rank transformations in cross project defect prediction [23]. They found that all of these are comparable in terms of performance measures however these models do not exhibit same behavior on single entities. Peters et al. proposed a private multi-party sharing method for cross-project defect prediction [18]. Xia et al. propose a two-phase technique for cross defect prediction i.e. genetic algorithm phase and ensemble learning phase [19]. They performed experiments with 29 datasets from PROMISE repository and reported improved results when compared to literature. Hosseini et al. proposed Genetic Instance Selection (GIS) that optimizes combined measure of F-Measure and G-Mean [22]. They used 13 datasets from PROMISE repository for their experiments and concluded that search based instance selection is a promising solution for cross project defect prediction. Wu et al. proposed a semi-supervised dictionary learning technique for software defect prediction [24]. They used the labeled defect data and unlabeled data and performed their experiments using two public datasets. They found that the proposed technique was useful in identification of software defects. Poon et al. proposed a credibility theory based naïve bayes classifier based on reweighing mechanism [25]. Thus the source data adapts to the target distribution of data and preserves its patter as well. The results are promising and show significant improvement in prediction rate. Huang et al. proposed a three-stage algorithm for cross project defect prediction. They used the nearest neighbor algorithm for similarity identification and then applied the Bayes classifier [26]. Jing et al. proposed combination of improved Subclass discriminant analysis (ISDA) and semi supervised transfer component analysis as a solution for cross project defect prediction [27]. Goel et al. have conducted a systematic literature review on cross project defect prediction [28]. They found that the best practices for cross project defect prediction could not be established and more research needs to be carried out in heterogeneous CPDP to improve the prediction results.

From this study we observed that cross project defect prediction is feasible with careful selection of code quality features. The relationship among the various characteristics of the datasets should be carefully analyzed to choose the potential defect predictors. We have attempted to extend this study by empirical validation of cross project defect data using defect data of twelve open source software and twelve OO metrics [6]. The prediction model is build using logistic regression.

3 Research methodology

3.1 Data collection

We have analyzed the logs of latest version for software to identify the faulty classes. We have developed a tool, configuration management system (CMS) in java language to fetch these logs [2]. CMS offers features to analyze the changes amongst two versions of software as well as fetch logs from software project repositories and process them to obtain bug count. In this study we have used CMS to obtain faulty classes only. Figure 1 explains the data collection method of CMS.

Fig. 1
figure 1

Data collection methodology

3.1.1 Source code checkout

The first step in data collection process is to obtain the source code from the remote repository. For this, we create a local copy of the software. We connect to the CVS repository of the software by logging in into the system and then download the source code on our local machine. This is done with the help of CVS “checkout” command.

3.1.2 Extraction of bugs

After we make a local repository for the code, we can request the logs using “log” command. The server replies with the software logs in response to this command, which is a huge file. We apply text mining on this file and search for text pattern “bug” and “fix” in the logs. If any of the two keywords is found, the class is assumed to be faulty. We repeat the process for each file in the source code to identify all the faulty classes in the software.

3.1.3 Metrics calculation

We obtain the metrics for software with the help of “Understand” tool. This tool calculates the object oriented metrics for each class. We have calculated seven object oriented metrics for software.

3.1.4 Preparation of dataset

We integrate the metric and bug report to obtain the dataset. Preprocessing is done to remove the unnecessary data points. Now we apply logistic regression on the collected data to build the prediction model.

We have used 12 software systems for our study obtained from sourceforge.net. These datasets vary in domain of application, size and percentage of faulty classes, while the programming language of all datasets is Java. Table 1 lists the programming language, version used for our experiments, and the count and percentage of faulty classes for all software under study.

Table 1 Software systems used for experiment

Amakihi: Amakihi supports the software testing activity of SDLC by helping the software developers in automation of test scripts. It consists of 98 classes where 44 are faulty [29].

Amber archer: Amber archer is a java class library to support corporate software development process. It consists of 693 java files with 9.7% of faulty classes [30].

Abbot: ABBOT is a java framework that is used to test UI for java applications. It consists of 330 java classes out of which 46.1% classes are faulty [31].

Apollo: It provides an editor and a compiler for data migration purpose for software systems. It consists of 292 java classes out of which 58 are faulty [32].

Avisync: Avisync is a utility developed in java language which is used to fix synchronization problems in audio/video while playing AVI files. It is also small software with 67 classes with 37.3% of faulty classes having one or more faults [33].

Jfreechart: Jfreechart is a chart library that can be used with java programs. It is developed in Java and we have used the version 1.0.0. It consists of 689 classes out of which 59.2% classes contain one or more faults [34].

Jgap: Jgap is a genetic programming component available as a Java framework. It is developed in java language. It consists of 173 classes out of which 35.3% (61) classes are having one or more than one faults [35].

Jtreeview: Jtreeview is a cross platform visualization tool which is used for visualization of gene expression data. It is developed in java language. We have studied version 1.0.0 of this software for our study and 184 out of 405 classes (45.4%) are found faulty [36].

Barcode4j: Barcode4j is available under the Apache license v2.0. It is a flexible generator of barcodes. We have used version 1.0 of this software for our experiments which consists of 170 classes out of which 31 classes had one or more than one faults [37].

Jtopen: It is a set of lightweight classes appropriate to be used on mobile devices. We have used v1.0 of this software for our study, which consists of 1527 classes out of which 27.9% classes are faulty [38].

Jung: JUNG provides a common and extendible language for the modeling, analysis, and visualization of data that can be represented as a graph or network. We have performed our experiments on JUNGv1.3, which consists of 51 faulty classes out of 149 [39].

Geotag: It is a portable; GUI based intelligent matching software system. We have used v 0.07 of this software for our study, which consists of 628 classes, 89 of which are faulty [40].

3.2 Prediction model

The prediction model is build using the logistic regression technique. Logistic regression is a type of probabilistic statistical classification model, which is used to predict a binary response from a binary predictor based on one or more predictor variables. It measures the relationship between the independent variable and the categorical independent variable. We have studied various object oriented software metrics and selected 12 of them to build our prediction model. Table 2 lists these software metrics. These metrics are the independent variables to construct the prediction variable and the binary dependent variable is fault proneness.

Table 2 Metrics description

3.3 Descriptive statistics

We have calculated 14 indicators to describe the distribution of each metric in a training/test set. These indicators and their description are listed in Table 3. We combine these indicators with the metrics to make a set of (14 indicators × 12 metrics) 168 metric indicators. These 168 indicators describe the distributional characteristics of the training and test sets under study. We have listed these characteristics in Tables 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 for all the datasets.

Table 3 Indicators of software attributes
Table 4 Descriptive statistics of Amakihi dataset
Table 5 Descriptive statistics of Amberarcher dataset
Table 6 Descriptive statistics of Abbot dataset
Table 7 Descriptive statistics of Apollo dataset
Table 8 Descriptive statistics of Avisync dataset
Table 9 Descriptive statistics of Jfreechart dataset
Table 10 Descriptive statistics of Jgap dataset
Table 11 Descriptive statistics of Jtreeview dataset
Table 12 Descriptive statistics of barcode4j dataset
Table 13 Descriptive statistics of Jtopen dataset
Table 14 Descriptive statistics of Jung dataset
Table 15 Descriptive statistics of Geotag dataset

3.4 Performance evaluation measures

3.4.1 Precision

Precision is the ratio of number of classes that are correctly classified as faulty and the no. of classes that are classified as faulty.

3.4.2 Recall

Recall is defined as the ratio of the number of classes that are correctly classified to the total no. of faulty classes.

3.4.3 Area under receiver operating characteristics (ROC) curve

ROC curve is a plot in between the true positives out of the total actual positives vs. the false positives out of the total actual negatives. Hence, ROC curve is a graphical plot between sensitivity and 1: specificity at varied discriminating thresholds. We define sensitivity and specificity as.

Sensitivity: Sensitivity or true positive rate is the fraction of true positives and total actual positives.

Specificity: It is the false positive rate or the fraction of false positives and total actual negatives subtracted from 1.

3.5 Construction of decision tree

Although cross project defect prediction works in several cases, but successful defect prediction is not feasible in all cases. After studying the various combinations of training and testing datasets, we have constructed a decision tree to validate the relationship between feasibility of cross project defect prediction and distributional characteristics of training and testing datasets.

We have conducted our experiments on all possible permutations of the datasets. One set is chosen as training set, which is used to build the prediction model, and the remaining 11 sets are test sets. They are chosen one by one to evaluate the model. This process is repeated y choosing all the datasets as training sets one at a time. Thus we get 132 (12 × 11) combinations from 12 datasets. Precision, recall and AUC are analyzed to predict whether prediction is possible or not. If precision > 0.6 and recall > 0.7 and AUC > 0.6, then we assume that prediction is possible, otherwise not. Analyzing the acceptance criterion of various prediction models that are available in literature chooses these cut off values. Choosing these cut off values of precision, recall and AUC, prediction was found possible in 35 out of 132 permutations. Then we used the distributive characteristics of these datasets to build the decision node of the decision tree and the leaf node tells whether prediction is possible or not.

To construct the decision tree, we have used weka 3.6.10. Random tree algorithm is used to construct the decision tree with 10 fold cross validation on the dataset. The dataset is constructed in the following manner: first we List all the distributive characteristics for all the metrics for the training data set followed by the distributive characteristics of test dataset. The last column is a binary variable, which tells prediction is possible for this permutation, or not. Assuming we have m distribution characteristics for n metrics, the total number of columns in the dataset will be 2(m × n) +1. In our case, m = 14 and n = 12, hence the total number of columns in the dataset = 337. The number of rows is equal to the number of permutations of the training and test sets.

The procedure for construction of dataset for decision tree is shown in Fig. 2. The prediction model is built by training from a software system and tested on all remaining datasets. The result is marked “yes” if the criterion for successful prediction is satisfied else “no”. Now we calculate distributive characteristics for all metrics of both train and test sets and combine them with the prediction result as shown in Fig. 2. This gives one row of the combined dataset. Now we repeat the process for all combinations to complete the dataset for learning of the decision tree.

Fig. 2
figure 2

Generation of training-test instance from the dataset combination

4 Results and findings

4.1 Experimental results

We generated 132 train-test instances from the various combinations of the datasets. Out of these 132 instances, 35 were successful with the values of precision, recall and AUC greater than the cut off values. Thus we get only 26.5% successful cross project defect prediction scenarios. The best prediction results are observed with Amberarcher as test set and various training sets. The highest values of precision recall and AUC are obtained with barcode4j and Geotag as training sets and Amberarcher as test set. Precision and recall for both these models is greater than 80% and AUC is greater than 70%. Table 16 lists the successful train-test combinations and corresponding values of precision, recall and AUC.

Table 16 Successful prediction results

The size of the decision tree learnt from these train test instances is 75. It consists of 38 leaf nodes out of which 15 are labeled “yes” and 23 are labeled “no”. We performed 10-fold cross validation and observed precision 74.7%, recall 74.2% and AUC 67.9%. The decision tree is built using random tree algorithm. Table 17 lists the top 3 rules derived from the decision tree for successful cross project defect prediction. The support indicates the no. of instances which satisfy the rule. Only 37 out of 336 project characteristics are found significant in the construction of decision tree. 24 of these 37 deciding characteristics are of training set and the rest 13 of test set. These characteristics are compared with a cut off value at each deciding node and the value decides the class whether “yes” or “no”.

Table 17 Top 3 rules for successful prediction learnt from DT

4.2 Discussion of results

From the results obtained from our experiments, some of the common observations we concluded are:

  • Software with lower percentage of defective classes has a very large set of potential defect predictors.

  • Defects for large software systems cannot be predicted by relatively smaller software systems.

  • Datasets with huge difference in the number of classes cannot be used in cross project defect prediction.

Table 18 lists the datasets which can be used for identification of defective classes for each of the training dataset. Here we can see that Jtopen is not useful in defect proneness prediction of any of the software under study. Amber archer, Barcode4j and Jung are potential predictors for 5 and 4 datasets respectively. Abbot and Avisync are predictors for only 1 dataset while jtopen for none.

Table 18 Performance of training sets

Figure 3 shows the diagrammatic representation of the potential predictors for software. The X-axis shows the test dataset and Y-axis shows the count of the potential predictors. This helps in the relative study of the potential predictors. Amber archer has the highest number of predictors while jfreechart, jtopen and jung can’t be predicted by any of the training sets. However, if we relax some of the acceptance criterion, we obtain better results with these software systems. Apollo, avisync, barcode4j and geotag also have ample training sets. Thus we can see that 9 out of 12 software systems under study can be successfully predicted by one or more training sets.

Fig. 3
figure 3

potential defect predictors for dataset under study

4.3 Threats to validity

One of the major threats to validity of our work is the acceptance criteria of the successful model. We have selected three parameters for successful model i.e. precision, recall and AUC. The selection of these parameters is based on previous studies in literature about defect prediction and our own analysis. However, the acceptance criteria may vary depending on various factors. In such a case, some of our observations and conclusions may change.

Another threat is the selection of static code metrics to build the defect prediction model. Studies in literature show that these metrics can be used for defect prediction models, but it is not always the case. The appropriate selection of these metrics may vary depending on the dataset. A subset of these metrics is found significant in a large number of studies. Thus we can conclude that our experiments and observations may vary depending on the selection of these static code metrics.

5 Conclusions and future work

5.1 Conclusions

Cross project defect prediction is the process of learning from one project to improve another project. It is applicable to the Software Development Life Cycle to improve the software quality and make the software system more reliable and gain more confidence and customer’s satisfaction. Also choosing more than one prediction model trained by different sets will increase the confidence of the prediction results. This is not possible in the traditional defect prediction method because model is trained from the previous release of same system. Training the model with different data will increase the reliance on prediction results. This will increase the reliability, traceability, usability and maintainability of the software systems and will help in mitigating software crisis.

In this work, the prediction model is built with this using logistic regression. We conducted our experiments on 12 open source projects. 132 combinations of train-test instances are generated from these 12 projects and the feasibility of cross project defect prediction is analyzed. The results show that cross project defect prediction is not always feasible. Only 35 out of 132 instances exhibit successful cross project prediction behavior in our experiments. Thus careful selection of training set needs to be done in order to identify defective classes correctly. The decision tree, constructed in our experiment learns from the distributive characteristics of software projects to generate rules for successful defect prediction. This may help in selection of appropriate training sets.

Chances of successful cross project defect prediction are more likely for comparable number of classes in the training and test sets. It is observed that jtopen; with very high number of classes and high LOC than other software systems is neither a good training nor a test set. Cross project defect prediction may provide acceptable results, if careful selection of training set is done.

5.2 Future work

Previous studies show that application of some machine learning algorithms builds better prediction models than statistical method. We may apply machine learning methods as bagging, naïve bayes etc. to build the prediction model instead of logistic regression in future. This may improve the performance of the model as well as decision tree.

Our experiments for cross project do not take into account the programming language of the software under study. All the projects under study are developed using same programming language i.e. java. We may extend our work where combinations of different programming languages are taken and verify if the prediction works as well in such scenarios or not. We may also extend the work on real life corporate software. The process followed and the complexity of industrial software is different from that of open source software and hence we may take these features also into account. This will make the application of cross project defect prediction more realistic and applicable to software industry.