Time variance and defect prediction in software projects

Ekanayake, Jayalath; Tappolet, Jonas; Gall, Harald C.; Bernstein, Abraham

doi:10.1007/s10664-011-9180-x

Time variance and defect prediction in software projects

Towards an exploitation of periods of stability and change as well as a notion of concept drift in software projects

Published: 03 November 2011

Volume 17, pages 348–389, (2012)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Empirical Software Engineering Aims and scope Submit manuscript

Time variance and defect prediction in software projects

Download PDF

Jayalath Ekanayake¹,
Jonas Tappolet¹,
Harald C. Gall² &
…
Abraham Bernstein¹

861 Accesses
23 Citations
Explore all metrics

Abstract

It is crucial for a software manager to know whether or not one can rely on a bug prediction model. A wrong prediction of the number or the location of future bugs can lead to problems in the achievement of a project’s goals. In this paper we first verify the existence of variability in a bug prediction model’s accuracy over time both visually and statistically. Furthermore, we explore the reasons for such a high variability over time, which includes periods of stability and variability of prediction quality, and formulate a decision procedure for evaluating prediction models before applying them. To exemplify our findings we use data from four open source projects and empirically identify various project features that influence the defect prediction quality. Specifically, we observed that a change in the number of authors editing a file and the number of defects fixed by them influence the prediction quality. Finally, we introduce an approach to estimate the accuracy of prediction models that helps a project manager decide when to rely on a prediction model. Our findings suggest that one should be aware of the periods of stability and variability of prediction quality and should use approaches such as ours to assess their models’ accuracy in advance.

On the time-based conclusion stability of cross-project defect prediction models

Article 09 September 2020

The impact of tangled code changes on defect prediction models

Article 16 April 2015

The role of bug report evolution in reliable fixing estimation

Article 20 September 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Many different approaches have been developed to predict the number and location of future bugs in source code (e.g., Khoshgoftaar et al. 1996; Graves et al. 2000; Hassan and Holt 2005; Ostrand et al. 2005; Bernstein et al. 2007)). Such predictions can help project managers to quantitatively plan and steer a project according to the expected number of bugs and their bug-fixing effort. Bug prediction can also be helpful in a qualitative way whenever the defect location is predicted: testing efforts can focus on the predicted bug locations.

Many bug prediction approaches (including the ones cited above) use software evolution (or history) information to predict defects. This information is, typically, collected from software development systems such as CVS or Bugzilla. From this data, file related features (i.e. attributes of source-code files) such as number of revisions or number of authors etc. are extracted.

Those features are then used to train a prediction model. To evaluate such a model, it is given the feature values from another time period and the predicted values are compared with observed ones facilitating the quality assessment of the model. The common downside of these approaches is their temporally coarse evaluations. Usually, a bug prediction algorithm is evaluated in terms of accuracy in only one or a small number of points in time. Such an evaluation implicitly assumes that the evolution of a project and its underlying data are relatively stable over time. But, according to findings of Tsymbal (2004) and Widmer and Kubat (1993) this assumption is not necessarily valid. Therefore, a generalization of such a model is difficult and jeopardizes correct decision-making by software managers.

Given the dynamic nature of software evolution data, the purpose of this paper is to analyze the variability in prediction quality and measure the reliability of prediction models. The general hypothesis of this paper is: Defect prediction quality varies over time exhibiting periods of stability and variability. Derived from this hypothesis, this paper focuses on the following research questions:

RQ1
Can we develop methods to assess the prediction quality over time?
RQ2
Is it possible to identify periods of stability in the prediction quality?
RQ3
Can we identify elements from the models’ input features which are responsible for the variability?
RQ4
Can we develop a method to predict the future variability of a prediction model?

Possible reasons for a high variability could be the sudden change of influencing factors such as a changing number of developers, the use of a new development tool, or even political/economical events (e.g., financial crisis, holiday seasons). Our goal in this work is to uncover possible candidate factors by looking for correlating features in the development process and provide software managers with a decision procedure to evaluate the prediction model’s accuracy in advance.

Our approach can be summarized as follows: Similar to other techniques described in Kagdi et al. (2007) we use software evolution data to extract file-related features. The set of features chosen reflects data about the files itself and its history (see Table 2). In addition we extract these features for many time periods of the investigated projects. From this data we then train our prediction models.

1.1 General Overview of Experiments

The first set of experiments, described in Section 5.1, addresses RQ1. First, for one prediction time period—in our case a single month, i.e. the target period—many prediction models are trained using datasets generated from every possible training period.

This procedure is repeated with varying target periods. The predictive power of the models is measured using the receiver operating characteristics (ROC) and the area under the ROC (AUC). For example, if a given software project is evolved over the last 36 months then we use data from the past 35 months to train 34 different bug prediction models and then predict defects on the 36th month. Second, we use the same model to predict defects on every possible target. For example, a prediction model is trained using the data from the first 10 months and then this model is used to predict defects from month 11 onwards until the end of the observed period. To substantiate our claims we illustrate the predictions using heat-maps as a visual tool and additionally employ statistical methods to further support our observations. Our results indicate that there are periods of stability and variability of prediction quality over time and, hence, project managers should not always rely on bug prediction models without such stability information.

Our second set of experiments, described in Section 5.2, addresses RQ2: We first determined a suitable threshold for the prediction quality measure denoting periods of “good” predictions computed by the BugCache algorithm (Kim et al. 2007). Using this threshold we then graphically illustrate how each of the four investigated projects exhibits periods of stability in terms of prediction quality and change.

The third set of experiments is aimed at establishing statistically that the observations about stability and variations are not random. To that end we intersperse our features with random information and show that our experiments are statistically unaffected by the random data and, hence, show a non-random phenomenon.

The fourth and fifth set of experiments, described in Sections 5.4 and 5.5, address RQ3. Using regression analysis we empirically uncover potential reasons for the variability of prediction quality that can serve as early indicators for upcoming variability. For example, we observe that an increasing number of authors editing a project causes a decline in prediction quality. Another observation is that more work done for fixing bugs relative to other activities reduces the prediction quality. Moreover, more authors being active in the training period and fixing many bugs help increasing the prediction quality.

The last set of experiments, described in Section 6, addresses RQ4 to make the results actionable. Using the insights of the previous experiments as a foundation we developed a tool that evaluates the quality of the prediction models. Specifically, we train a meta-model that predicts the quality of the models. Using these ‘meta-predictions’ software project managers can easily decide when to use bug prediction models and when to forgo them given their (expected) bad quality and/or reduced expressivity.

2 Related Work

Research in mining software repositories investigates the usage of historical data from software projects for various kinds of analyses (as described in Kagdi et al. 2007). One line of this research focuses on building models for the prediction of the occurrence of future defects, changes, or refactorings (cf. Diehl et al. 2009). To put our work in relation to these studies, we discuss a brief selection of related papers. Note, however, that to the best of our knowledge, there is no prior work investigating the possible variation in defect prediction quality over time and its causes.

2.1 General Issues in Bug Prediction

A critical survey of defect prediction models was conducted by Fenton and Neil (1999). They claim that there are numerous serious theoretical and practical problems in many studies. In particular, they mentioned five issues regarding defect prediction models: (1) unknown relationship between defects and failures, (2) problems with the multivariate statistical approach, (3) problems of using size and complexity metrics as sole predictors of defects, (4) problems in statistical methodology and data quality and (5) false claims about software decomposition. In this work, we tried to avoid the above issues. Nevertheless, this was not completely possible, and, therefore, we mention those problems in Section 4 (Threats to Validity). Additionally, to ensure methodological soundness, we employed the methods described in Zimmermann et al. (2007) to link CVS and Bugzilla databases and rely on Eaddy et al. (2008), Antoniol et al. (2008), and Bird et al. (2009) to validate the defect datasets.

Lessmann et al. (2008) introduce a framework to compare defect prediction models. They criticize the usage of only a few number of datasets which might also be proprietary. Furthermore they criticize the usage of inappropriate accuracy indicators and the limited usage of statistical testing procedures to substantiate findings. We are convinced that we address all these issues by (1) using well-known open-source projects as data sources, (2) reporting AUC for measuring the accuracy of our prediction models and (3) using statistical tests where appropriate.

2.2 Different Approaches for Bug Prediction

Apart from our work, there are different approaches that try to predict defects in software systems based on their source code information. For instance Hassan (2009) proposed complexity metrics that are based on the code-change process. He used concepts from information theory to define the change complexity metrics. He considered the code-change introduction process in order to measure the change probability or entropy during specific time periods. Their definition of time frames is similar to the one used in this paper.

Li et al. (2005) present an approach for the prediction of model parameters for software reliability growth models (SRGMs). These are time-based models using metrics-based modeling methods. They used three SRGMs, seven metrics-based prediction methods, and two different sets of predictors to forecast pre-release defect-occurrence rates. Our study also uses time-based prediction models to predict the location of defects. However, we predict defects in every possible time period which allows us to perform a continuous analysis of the bug prediction quality. Further, we use only one prediction model—i.e. class probability estimation models—and only process metrics as predictors. Moreover, our goal is to investigate the variability of defect prediction quality over time as opposed to forecasting defect occurrence rate. Ostrand et al. (2005) and Knab et al. (2006) both used code metrics and modification history to train regression models predicting the location and number of faults in software systems. Zimmermann et al. (2007), in contrast, used only code metrics. They all share the following experimental procedure: first, they constructed several file-level and project-level features from the software history and use those features to train prediction models. Then, the feature values from another time period are computed and the predicted values are compared with observed ones. The common downside of these approaches are their temporally coarse evaluations. Usually, a bug prediction algorithm is evaluated, in terms of accuracy, in only one or a small number of points in time. This renders the generalization of models difficult, as such an evaluation implicitly assumes that the evolution of a project and its underlying data are relatively stable over time. In our study, we also use the software history to compute a set of features and some of our features are similar to these studies. In contrast, however, our feature set reflects almost all the changes to a file in the past. In addition, we evaluate our prediction models throughout the project duration in order to show the variability in prediction quality over time and illustrate the limited “temporal generalizability” of bug prediction models.

Khoshgoftaar et al. (1996), Graves et al. (2000), Nagappan and Ball (2005) and Bernstein et al. (2007) all developed prediction models using software evolution data to predict future failures of the software systems. Mockus and Votta (2000) showed that a textual description field of change history is essential to predict the causes for this change. Further, they define three causes for a change: Adding new features, correcting faults and restructuring code to accommodate future changes (i.e. refactoring). We also use the change history for constructing features, but we predict only faults. We partially base our work on the above mentioned related approaches by adopting some of their presented features.

3 Experimental Setup

Now we succinctly introduce the overall experimental setup. We present the data used, its acquisition method, and the measures used to evaluate the quality of the results.

3.1 The Data: CVS and Bugzilla for Eclipse, Netbeans, Mozilla, and Open Office

The availability of data about multiple development cycles and their possible association to the variation in prediction quality was essential to this study. We therefore selected four open source projects with a particularly long development history (>6 years): Eclipse, Netbeans, Mozilla, and Open Office.

Within each project we considered unique file names with file path and source code file type *.java in Eclipse and Netbeans, *.cpp in Mozilla and *.hxx and *.cxx in Open Office during the observed periods of each project. Further, we considered only those files that were not marked as dead within the observation period. We did not follow file renaming events as this is not a feature supported by CVS.

All data was collected from the projects’ Concurrent Versioning Systems (CVS)^{Footnote 1} and Bugzilla.^{Footnote 2} The data from the two sources was then linked using the method described in Zimmermann et al. (2007). Even though the method has been shown to lead to bias in terms of links (see Bird et al. 2009; Bachmann and Bernstein 2009) it has been argued that it is highly unlikely to contain false positives (i.e. links between commits and bugs that should not be there). Consequently, we can say that we predict the presence of bugs (fixes) instead of just commits. Note, that in total we considered 114,186 bug reports from all four projects. The manual verification of such a large number of bug reports would be prohibitively expensive.

We also considered reusing existing and verified datasets.^{Footnote 3} However, given that we needed temporally oriented data for our analysis (organized on a monthly basis) we decided to extract the data from the CVS and Bugzilla databases. We will discuss possible bias in the threats to validity in Section 4.

Table 1 shows an overview of the observation periods and the number of files considered. Moreover, Tables 13, 14 and 15 ^{Footnote 4} provide detailed descriptions about all components and the number of files of those components. In Eclipse, we consider the core components of the products Equinox, JDT, PDE and Platform available in June 2007. We selected all the components from Netbeans and Mozilla available in June 2007 and February 2008, respectively. For Open Office we only used files from the SW component. This component relates to the product Writer being the word processor of the Open Office suite.

Table 1 Analyzed projects: time spans and number of files

Time variance and defect prediction in software projects

Abstract

Similar content being viewed by others

On the time-based conclusion stability of cross-project defect prediction models

The impact of tangled code changes on defect prediction models

The role of bug report evolution in reliable fixing estimation

1 Introduction

1.1 General Overview of Experiments

2 Related Work

2.1 General Issues in Bug Prediction

2.2 Different Approaches for Bug Prediction

3 Experimental Setup

3.1 The Data: CVS and Bugzilla for Eclipse, Netbeans, Mozilla, and Open Office

3.2 The Data: Features

3.3 Performance Measures

4 Threats to Validity

4.1 Determination of Authorship

4.2 Creation-Time vs. Commit-Time

4.3 Bug-Fixing or Enhancement? A Clear Case of Bias

4.4 Choice of Time Frames

4.5 Choice of Observed Periods of Projects

5 Experiments: Change in Bug Prediction Quality

5.1 Defect Prediction Quality Varies Over Time

5.2 Finding Periods of Stability and Change

5.3 Triangle Shapes are not Random Phenomena

5.4 Finding Indicators for Prediction Quality Variability

5.5 Author Fluctuation and Bug Fixing Activities

6 Turning the Insights into Actionable Knowledge

7 Discussion, Conclusions, and Future Work

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix A: Component List

Appendix B: Detailed Feature Description

Appendix C: Dataset Format

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation