Cross project defect prediction for open source software

Agrawal, Anushree; Malhotra, Ruchika

doi:10.1007/s41870-019-00299-6

Cross project defect prediction for open source software

Original Research
Published: 06 April 2019

Volume 14, pages 587–601, (2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

International Journal of Information Technology Aims and scope Submit manuscript

Cross project defect prediction for open source software

Download PDF

300 Accesses
8 Citations
Explore all metrics

Abstract

Software defect prediction is the process of identification of defects early in the life cycle so as to optimize the testing resources and reduce maintenance efforts. Defect prediction works well if sufficient amount of data is available to train the prediction model. However, not always this is the case. For example, when the software is the first release or the company has not maintained significant data. In such cases, cross project defect prediction may identify the defective classes. In this work, we have studied the feasibility of cross project defect prediction and empirically validated the same. We conducted our experiments on 12 open source datasets. The prediction model is built using 12 software metrics. After studying the various train test combinations, we found that cross project defect prediction was feasible in 35 out of 132 cases. The success of prediction is determined via precision, recall and AUC of the prediction model. We have also analyzed 14 descriptive characteristics to construct the decision tree. The decision tree learnt from this data has 15 rules which describe the feasibility of successful cross project defect prediction.

Multi-objective cross-version defect prediction

Article 09 December 2016

An investigation on the effect of cross project data for prediction accuracy

Article 19 March 2016

Cross Projects Defect Prediction Modeling

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Defect prediction in software systems focuses on prediction of fault prone classes early in the software development life cycle. This helps in near to optimal allocation of testing and maintenance resources. Defect prediction works well if large amount of data is available to train the prediction model. However, if the data is not preserved or if we are dealing with the first release of the software system, no training data is available. Thus defect prediction based on historical data of same project is not always feasible.

Cross project defect prediction is the process of predicting defects in software systems using historical data of other projects [1]. Very few Studies are available in literature for cross project defect prediction and they show that this is a serious challenging task. In our work, we have attempted to study the feasibility of cross project defect prediction using open source software systems. The prediction model is build using logistic regression. In this work we have empirically investigated the following severe research paradigms.

1.1 Defect data of one project is likely to derive defects of another project

Various studies in the literature show that historical data from software repositories can be used in prediction of software defects for upcoming releases, but availability of this past defect data is not always possible. In this study, we have empirically validated that defect data from other projects can be used to identify the defective classes.

1.2 The potential defect predictors in cross project defect prediction

Cross project defect prediction is feasible in some cases, but this is not always possible. The major challenge in this field is how to identify the scenarios where cross project defect prediction is applicable. One solution to this problem given by Zimmerman [1] in his work is to study the relationship between the characteristics of the training and test set. In this work, we have studied 14 characteristics of software projects and illustrated this relationship with the help of decision tree where the characteristics determine the potential predictors.

1.3 The criteria for successful cross project defect prediction?

There are numerous studies in literature in the area of software defect prediction. The prediction model is build using statistical or machine learning methods and the efficiency of the model is evaluated using various measures as sensitivity, specificity, precision, recall, AUC etc. analyzing the various studies, we have built the model using logistic regression and chosen appropriate cut off values of precision, recall and AUC to accept or reject the model.

Rest of the paper is organized as follows: Sect. 2 is the related work in the context of cross project defect prediction. Numerous studies are present in literature for defect prediction model trained from previous release of the same project, but very few cross project studies have been done in literature. Section 3 explains the Research Methodology that is used for our experiment. Following this Sect. 4 is Result Analysis section that describes the cross project prediction results. We have shown the acceptance criteria of each cross project model and also the decision tree, which is learnt from data. In the last section, we provide the conclusion and related future work scopes on this project, which could be taken as a subject to be worked upon. We have also discussed the threats to validity in this section.

2 Related work

Numerous studies are available in literature in the area of software defect prediction. The aim of most of them is to study the feasibility of defect prediction from the historical data of same project. Prediction models are built using the statistical and machine learning methods. Radjenovic et al. have done a systematic literature review for software fault prediction models in their work [3]. In this work, the authors have searched seven digital libraries to identify the most commonly used set of software metrics in software fault proneness prediction. Gray and MacDonell have also compared the various techniques for software fault prediction models [4]. The authors have discussed the inherent limitations of the techniques used in defect prediction models. Careful attribute selection is very important for the success of a fault prediction model. The authors have investigated the impact of attribute selection on naïve bayes based fault prediction model in their work [5].

Very few studies are available in the area of cross project defect prediction. Turhan et al. have investigated the application of cross company defect data to build prediction model using static code features [9]. They have conducted their experiments on seven NASA and three SOFTLAB datasets. Zimmermann et al. have studied the feasibility of cross project defect prediction and validated it using several versions of open source software. They have conducted their study on apache tomcat, apache Derby, Eclipse, Firefox, Direct-X, IIS, Printing, Windows Clustering, Windows File system, SQL Server 2005 and Windows Kernel [1]. The results indicate that the relationship of characteristics between the projects permits cross project defect prediction in some cases. This relationship is analyzed with the help of decision trees.

He et al. have also empirically validated cross project defect prediction using defect data from PROMISE repository [7]. They have conducted the experiment on 34 releases of 10 open source projects. Ma et al. have proposed a novel learning algorithm ‘Transfer Naïve Bayes’ for cross company defect prediction [8]. They have exploited all the cross company data in training the model. The results are validated on NASA datasets and Turkish local software datasets. Gerardo et al. proposed the use of genetic algorithm to build a multi objective cross project defect prediction model [10]. They used public dataset from PROMISE repository to validate and produce a compromise between precision and recall for cross-project defect prediction.

S. Herbold proposed distance-based strategies for training data selection based on distributional characteristics [11]. They evaluated their work with 44 data sets obtained from 14 open source projects. The results indicate that this training data selection strategy improved the success rate of cross-project defect prediction. They also proposed a tool CrossPare to provide standards for cross project defect prediction [15]. The tool implemented few techniques proposed for cross-project defect predictions. CrossPare can be used for improving the assessment of results in cross project defect prediction studies.

Ryu et al. proposed a transfer learning based model to deal with the class imbalance problem which may decrease the prediction accuracy in cross project defect prediction studies [12]. They computed similarity weights of the training data based on the test data and applied it to Boosting algorithm considering the class imbalance. The results are validated using NASA and SOFTLAB datasets. Ryu et al. also propose a multi objective naïve Bayes algorithm with Harmony search meta-heuristic algorithm [20]. The results indicate that the proposed approach shows similar prediction performance but better diversity compared to existing multi objective CPDP algorithms.

Panichella et al. conducted an empirical study on 10 open source software systems to analyze the similarity of different defect prediction models [13]. They proposed a combined approach that used the classification provided by different machine learning methods to improve the defect prediction results. They found that better prediction accuracy was achieved using the combined approach. Amasaki et al. conducted a study to identify the effects of the data simplification for CPDP methods [14]. They compared the predictive performance with and without applying data simplification on CPDP methods. They found that applying data simplification achieved improved results for cross-project selection. Satin et al. studied the combination of different classification algorithms for feature selection and data clustering [16]. They applied it to 1270 projects and built different cross-project prediction models. The authors reported that Naive Bayes algorithm obtained the best performance, with 31.58% of adequate predictions in 19 models. Zhang et al. investigated 7 algorithms integrating multiple machine learning classifiers to improve prediction results in cross project studies [17]. They performed experiments using 10 open source software systems from the PROMISE repository. They compared their results with CODEP [13] and found better results in terms of F-measure. Zhang et al. also compared the performance of unsupervised and supervised classifiers for cross project defect prediction using AEEEM, NASA and PROMISE datasets [21]. They propose connectivity-based classifiers as the potential solution for cross project defect prediction studies. The authors also investigated the effect of Log, Box-Cox and rank transformations in cross project defect prediction [23]. They found that all of these are comparable in terms of performance measures however these models do not exhibit same behavior on single entities. Peters et al. proposed a private multi-party sharing method for cross-project defect prediction [18]. Xia et al. propose a two-phase technique for cross defect prediction i.e. genetic algorithm phase and ensemble learning phase [19]. They performed experiments with 29 datasets from PROMISE repository and reported improved results when compared to literature. Hosseini et al. proposed Genetic Instance Selection (GIS) that optimizes combined measure of F-Measure and G-Mean [22]. They used 13 datasets from PROMISE repository for their experiments and concluded that search based instance selection is a promising solution for cross project defect prediction. Wu et al. proposed a semi-supervised dictionary learning technique for software defect prediction [24]. They used the labeled defect data and unlabeled data and performed their experiments using two public datasets. They found that the proposed technique was useful in identification of software defects. Poon et al. proposed a credibility theory based naïve bayes classifier based on reweighing mechanism [25]. Thus the source data adapts to the target distribution of data and preserves its patter as well. The results are promising and show significant improvement in prediction rate. Huang et al. proposed a three-stage algorithm for cross project defect prediction. They used the nearest neighbor algorithm for similarity identification and then applied the Bayes classifier [26]. Jing et al. proposed combination of improved Subclass discriminant analysis (ISDA) and semi supervised transfer component analysis as a solution for cross project defect prediction [27]. Goel et al. have conducted a systematic literature review on cross project defect prediction [28]. They found that the best practices for cross project defect prediction could not be established and more research needs to be carried out in heterogeneous CPDP to improve the prediction results.

From this study we observed that cross project defect prediction is feasible with careful selection of code quality features. The relationship among the various characteristics of the datasets should be carefully analyzed to choose the potential defect predictors. We have attempted to extend this study by empirical validation of cross project defect data using defect data of twelve open source software and twelve OO metrics [6]. The prediction model is build using logistic regression.

3 Research methodology

3.1 Data collection

We have analyzed the logs of latest version for software to identify the faulty classes. We have developed a tool, configuration management system (CMS) in java language to fetch these logs [2]. CMS offers features to analyze the changes amongst two versions of software as well as fetch logs from software project repositories and process them to obtain bug count. In this study we have used CMS to obtain faulty classes only. Figure 1 explains the data collection method of CMS.

3.1.1 Source code checkout

The first step in data collection process is to obtain the source code from the remote repository. For this, we create a local copy of the software. We connect to the CVS repository of the software by logging in into the system and then download the source code on our local machine. This is done with the help of CVS “checkout” command.

3.1.2 Extraction of bugs

After we make a local repository for the code, we can request the logs using “log” command. The server replies with the software logs in response to this command, which is a huge file. We apply text mining on this file and search for text pattern “bug” and “fix” in the logs. If any of the two keywords is found, the class is assumed to be faulty. We repeat the process for each file in the source code to identify all the faulty classes in the software.

3.1.3 Metrics calculation

We obtain the metrics for software with the help of “Understand” tool. This tool calculates the object oriented metrics for each class. We have calculated seven object oriented metrics for software.

3.1.4 Preparation of dataset

We integrate the metric and bug report to obtain the dataset. Preprocessing is done to remove the unnecessary data points. Now we apply logistic regression on the collected data to build the prediction model.

We have used 12 software systems for our study obtained from sourceforge.net. These datasets vary in domain of application, size and percentage of faulty classes, while the programming language of all datasets is Java. Table 1 lists the programming language, version used for our experiments, and the count and percentage of faulty classes for all software under study.

Table 1 Software systems used for experiment

Cross project defect prediction for open source software

Abstract

Similar content being viewed by others

Multi-objective cross-version defect prediction

An investigation on the effect of cross project data for prediction accuracy

Cross Projects Defect Prediction Modeling

Explore related subjects

1 Introduction

1.1 Defect data of one project is likely to derive defects of another project

1.2 The potential defect predictors in cross project defect prediction

1.3 The criteria for successful cross project defect prediction?

2 Related work

3 Research methodology

3.1 Data collection

3.1.1 Source code checkout

3.1.2 Extraction of bugs

3.1.3 Metrics calculation

3.1.4 Preparation of dataset

3.2 Prediction model

3.3 Descriptive statistics

3.4 Performance evaluation measures

3.4.1 Precision

3.4.2 Recall

3.4.3 Area under receiver operating characteristics (ROC) curve

3.5 Construction of decision tree

4 Results and findings

4.1 Experimental results

4.2 Discussion of results

4.3 Threats to validity

5 Conclusions and future work

5.1 Conclusions

5.2 Future work

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation