1 Introduction

In open source software, the requirements of the users are dynamic and large number of requests for bug fixes, feature enhancement and new feature introduction are reported through bug/issue reporting system. Software source codes are modified or enhanced to satisfy the user’s need. The enormous changes in the code make the software complex over a period of time. These changes increase the complexity of the code which also leads to the introduction of bugs. The changes are being made continuously in the software to remove the bugs, enhance the features and implement the new functionalities. The open source software/projects are being built through the contributors from diverse communities and keep on improving rapidly. The code complexity is one of the major attribute which determines the quality and reliability of the software. Many contributors are making changes in the source code of the software to produce the quality software in limited time. It is getting difficult to remember all those changes committed in the code which makes source code complex. The contributors are interacting with each other through the discussion forums/mailing lists etc. Three types of code changes occur in the source code namely, bug repair/bug fix, feature enhancement and addition of new features. Bugs are generated in the software mainly due to mis-communication or no communication among active users, frequently changing requirements, early release pressures, occurrence of programming errors, increasing software complexity, and bugs present in software development tools itself.

Entropy, an information theory based measure is defined as a measure of randomness/uncertainty/complexity in the code changes. This has been earlier attempted to quantify the code change and predict the bugs based on past defects using entropy (Hassan 2009). The software reliability growth models (SRGM) based on calendar time and testing efforts (time vs bugs) have been proposed in the literature and widely used in the industry (Goel and Okumoto 1979; Musa et al. 1987; Ohba 1984; Yamada et al. 1983; Singh et al. 2007; Pham 2006; Trivedi 2001; Xie 1991) .

In this paper, we have proposed models to determine the latent bugs lying dormant in the software by using two approaches namely, time vs entropy and entropy vs bugs. The proposed models have been developed in the line of existing SRGM. These approaches are summarized as follows:

Time vs bugs (reliability growth models):

In this approach, the reliability growth models namely exponential (Goel and Okumoto 1979), delayed S-Shaped (Yamada et al. 1983), inflected delayed S-shaped (Ohba 1984) and power function (Singh et al. 2007) based models have been used to predict the potential bugs lying dormant in the software

Time vs entropy (potential complexity of code changes/entropy based model):

Using this approach, we firstly predict the potential complexity of code changes/entropy and then it has been used to predict the potential bugs

Entropy vs bugs (complexity of code changes based models):

In this approach, we have proposed and developed the complexity of code changes based bug prediction models

To study the various bug prediction practices/approaches mentioned above, we have proposed the following research hypothesis:

  • Null hypothesis: The bug prediction approaches namely time vs bugs, time vs entropy and entropy vs bugs have no significant difference.

  • Alternate hypothesis: The bug prediction approaches namely time vs bugs, time vs entropy and entropy vs bugs have a significant difference.

The proposed models in the study have been applied to various components of Mozilla software to carry out the empirical analysis. The statistical significance has been tested using non parametric Kolmogorov–Smirnov (K–S) statistical test. The bug prediction scenario has been validated using R-squared (R2), Adjusted R2 (Adj. R2), bias, variation and root mean square prediction errors (RMSPE).

The rest of the paper is divided into nine sections. Sect 2 describes the review of work available in the literature. Sect 3 defines the complexity of code changes. Sect 4 discusses bug prediction approaches and modelling. Sect 5 mentions the procedure of data collection and data pre-processing. In Sect 6, the results and discussions of the models are discussed. The managerial applications and threat to the validity has been discussed in Sects 7 and 8 respectively. Finally, the paper is concluded in Sect 9.

2 Review of work

A variety of approaches have been proposed in the literature for bug prediction such as software reliability growth models (Goel and Okumoto 1979; Huang et al. 1997; Kapur and Garg 1992; Kapur et al. 1999, 2008; Lyu 1996; Musa et al. 1987; Ohba 1984; Yamada et al. 1983), code metrics (lines of code) (Arisholm and Briand 2006; Gyimothty et al. 2005; Nagappan and Ball 2005a; Nagappan et al. 2006), process metrics (number of changes) (Hassan 2009; Nagappan and Ball 2005b) and previous defects (Hassan and Holt 2005; Kim et al. 2007; Ostrand et al. 2005). Reliability models have been developed to predict failure rates based on the expected operational usage profile of the system. Defect detection rate during testing process was used to predict defects with an objective to reduce the software defects. The defects are minimized, if we incorporate the changes early in design and testing phases (Fenton and Neil 1999). The number of changes and the age of a module provide a good prediction over other product measures (Graves et al. 2000). The chaos in software and code development process has been studied in the literature and concluded that a complex process of coding had a negative effect of producing a complex system (Hassan and Holt 2003a, b). Researchers proposed in the literature that a prior modification to a file is a good predictor to determine the fault potential in the software (Arisholm and Briand 2006; Graves et al. 2000; Khoshgoftaar et al. 1999; Leszak et al. 2002). There is no single set of metrics which can predict bugs of multiple projects (Nagappan et al. 2006). A combination of product and process measures to predict the defect density using a decision tree algorithm was developed and validated with Mozilla releases. It is also concluded that the process measures are performing well over the product measures for bug prediction (Knab et al. 2006). A comparative study has been conducted for bug prediction using change metrics and code metrics for different releases of Eclipse and found that change metric gives better performance over the code metric (Moser et al. 2008). Change metrics achieve better prediction performance in the case of effort aware prediction models which consider the effort required for testing and code reviews (Kamei et al. 2010). History complexity metric (HCM) as a process measure has been proposed in the literature for bug prediction and compared with other approaches namely prior faults, prior modifications and complexity metrics. It is concluded that the proposed complexity metric is a better predictor of faults as compared to product measures (Hassan 2009). A benchmark study was conducted on defect prediction approaches and provided an extensive study to compare these models and metrics using decay based models (D’Ambros et al. 2010, 2012). A mathematical model was proposed to study the diffusion of the complexity of code changes in software and predicted the potential complexity of code changes (Chaturvedi et al. 2012). Recently, an attempt has been made to propose generalized decay models to measure the historical complexity metric with an intuition that the effect of code change can be reduced over a period (Singh and Chaturvedi 2012). Further, the authors applied support vector machine technique for regression and predicted the next year expected bugs based on the current year complexity of code changes/entropy (Singh and Chaturvedi 2013). To the best of our knowledge, no effort has been made to develop the complexity of code changes based mathematical models for predicting potential bugs. This paper investigates the applicability of the complexity of code changes based bug prediction. A comparison with the existing reliability growth models has been also proposed.

3 Complexity of code changes

The information theory deals with assessing and defining the amount of information contained in a message is measured as the amount of uncertainty or entropy of the distribution. The entropy H n is defined by (Shannon 1948) as

$$H_{n} (P) = - \sum\limits_{k = 1}^{n} {(P_{k} } \log_{2} P_{k} )$$
(1)

where P k  ≥ 0 and \(\sum\limits_{k = 1}^{n} {P_{k}=1}\)

where n is the number of files and the value of k varies from 1 to n. For a distribution P, where all the files have the same probability of changes \((P_{k} = \, {1 \mathord{\left/ {\vphantom {1 n}} \right. \kern-0pt} n};\forall \;k = 1,2, \ldots ,n)\), we achieve maximum entropy. On the other hand for a distribution P where a file i has a probability of code changes i.e., \(P_{i} = \, 1\)and ∀k ≠ i, P k  = 0, we achieve a minimal entropy.

The entropy for this period can be calculated by substituting the values of these probabilities in Eq (1). From the definition, it is clear that the entropy will be maximum, if the changes are in every file. On the other hand, it will be minimum if the changes are occurring in a single file.

Code change process means to study the patterns of source code modifications. These modifications/changes are occurring due to bug repairs (BR), feature enhancement/modification (EM) and the addition of new features (NF). BR is the changes or modifications made in the code due to fixing of the bug. EM is the changes made due to some cosmetic changes like format, alignment, justification, comments etc. NF is the changes made in the code due to incorporation of new features or new components. These changes make the software code complex and may produce a system with new bugs. If the developer does not understand source code properly before performing changes/modifications in the source code, the release or new version of the software may get delayed. The frequent changes in the code may also negatively affect the software system in terms of its overall quality, reliability and sustainability.

The entropy based estimation may be helpful in studying the code change process. Code change quantifies the patterns of changes instead of simply measuring the number of changes to measure the effect of changes in code structure. This measurement is called the complexity of code changes. These changes can be studied at a source code level with granularity at package/file/class/method/variable at the code level. These changes are measured at a specific duration which can be anything ranging from hours to year/decades depending on the frequency of updates/changes in the code. The complexity of code changes is calculated based on the number of changes in a file for specific periods. The period can be taken as a day, week, month, year etc. based on the total duration of the project or the number of changes occurs.

Hassan (2009) proposed and used entropy to measure the complexity of code changes. The probability P k is defined as the ratio of the number of times kth file changed during a period and the total number of changes for all files in that period.

For example, suppose that there are 15 changes occurred in four files and three periods. These changes are shown in Fig. 1. For a first period T1, there are five changes occurred across all four files namely F1, F2, F3 and F4. There are eight changes in period T2 and two changes in period T3. All the four files are affected in period T2 and only two files are affected in period T3. The probability of these changes in files F1, F2, F3, and F4 during the period T1 will be 1/5 (=0.2), 1/5 (=0.2), 2/5 (=0.4) and 1/5 (=0.2) respectively. These probabilities have been shown in Fig. 1. By using these probabilities, we calculate the entropy and complexity of code changes for all the periods.

Fig. 1
figure 1

Number of changes in files with respect to a specific period of time

4 Bug prediction approaches and modelling

In this section, we firstly discussed the mathematical models used for measuring the reliability growth of software and after that on the line of these models, new models have been proposed to measure the potential complexity of code changes for bug prediction and to predict the bugs based on the complexity of code changes.

Notations

t :

Time

X :

Potential bugs

m(t):

Bugs occurred/removed up to time t

b :

Bug detection/removal rate

β :

Constant and >0

H 0 :

Potential complexity of code changes

H(t):

Complexity of code changes at time t

c :

Rate of complexity of code changes

χ :

Constant and >0

Z :

Potential bugs due to the complexity of code changes

m(H(t)):

Bugs occurred/diffused/removed up to H(t)

g :

Rate of bug occurrence/diffusion in the software due to the code changes

δ :

Constant and >0

k :

Constant used in power function

4.1 Time vs bugs (reliability growth models)

In this approach, the reliability growth models namely GO (Goel and Okumoto 1979), Yamada delayed S-shaped (Yamada et al. 1983), inflected S-shaped (Ohba 1984), Kapur-Garg Model (Kapur and Garg 1992) and the instruction execution dependent models (Singh et al. 2007) have been used to predict the potential bugs.

There are several existing well-known non-homogenous poisson process (NHPP) models with different mean value functions. The mean value function m(t) is described as the expected cumulative number of bugs in (0, t] time interval. The SRGM are based on the following assumptions.

It has been observed that the relationship between the time and the corresponding number of bugs detected/removed is either exponential or S-shaped. Let [N(t), t ≥ 0] denote a discrete counting process representing the cumulative number of failures experienced (fault removed) up to time t, i.e., N(t), is said to be a non-homogeneous poisson process (NHPP) with intensity function λ(t), if it satisfies the following conditions:

There are no failures experienced at a timet = 0, i.e., N(t = 0) = 0 with probability 1.

The process has independent increments, i.e., the number of failures experienced in time interval, i.e., N(t + Δt) − N(t), is independent of the history. Note this assumption implies the Markov property that N(t + Δt) of the process depends only on the present state N(t) and is independent of its past stateN(x), for x < t.

The probability that a failure will occur during (tt + Δt] is λ(tt + ot), i.e., Pr [N(t + Δt) − N(t) = 1] = λ(t) + ot). Note that the function ot) is defined as

$$\mathop {\lim }\limits_{\varDelta t \to 0} \frac{o(\varDelta t)}{\varDelta t} = 0$$

In practice, it implies that the second or higher order effects of Δt are negligible.

The probability that more than one failure will occur during (tt + Δt] is ot), i.e., Pr [N(t + Δt) − N(t) > 1] = ot).

The bug detection/removal process can be described as

$$\frac{{{\text{d}}m(t)}}{{{\text{d}}t}} \propto (X - m(t))\;\text{or} \;\frac{{{\text{d}}m(t)}}{{{\text{d}}t}} = b(t)\left( {X - m(t)} \right)$$
(2)

Here, b(t)is time dependent bug detection/removal rate.

Solving Eq (2) with initial condition, m(0) = 0 at a time t = 0, we get

$$m(t) = X\left( {1 - {\text{e}}^{{ - \int_{0}^{t} {b(t)\,{\text{d}}t} }} } \right)$$
(3)

By putting different value of b(t) in Eq (3) and solving with initial condition m(0) = 0, we get SRGM as follows in (Goel and Okumoto 1979; Ohba 1984; Yamada et al. 1983; Singh et al. 2007). These models for different values of b(t) have been shown in Table 1.

Table 1 Time vs bugs (existing software reliability growth models)

In the above approach, the models have been developed by taking the calendar time i.e., t. These models have not been linked to the source code changes. In the following sections, we have proposed and developed models based on the source code changes i.e., the complexity of code changes.

4.2 Time vs entropy (potential entropy based model)

Using this approach, we firstly predict the potential complexity of code changes/entropy and after that the potential complexity of code changes has been used to predict the potential bugs lying dormant in the software.

The function H(t) is called the mean value function and describes the expected complexity of code changes in (0, t] time interval as described below by different mathematical models. The diffusion of the complexity of code changes in software has been described by the following equation:

$$\frac{{{\text{d}}H(t)}}{{{\text{d}}t}} \propto (H_{0} - H(t))\;\text{or} \;\frac{{{\text{d}}H(t)}}{{{\text{d}}t}} = c(t)\left[ {H_{0} - H(t)} \right]$$
(4)

Here, c(t) is rate at which the code are changes i.e., the complexity of code changes.

Solving the Eq (4) with initial condition, H(0) = 0 at a time t = 0, we get

$$\Rightarrow H(t) = H_{0} \left( {1 - {\text{e}}^{{ - \int_{o}^{t} {c(t)dt} }} } \right)$$
(5)

We take different value of c(t) to derive different models which are shown in Table 2.

Table 2 Time vs entropy (proposed potential complexity of code changes based models)

We calculate the potential complexity of code changes diffused in the software using proposed models i.e., Model 7 to Model 12 which are shown in Table 2. After getting the potential complexity of code changes, we apply simple linear regression to find the potential bugs diffused in the software.

The simple linear regression (Weisberg 1980) has been fitted with the complexity of code changes as independent variable H(t) and bugs detected as dependent variable m(t) using an equation

$$m(t) = \beta_{0} + \beta_{1} *H(t)$$
(6)

where β 0 and β 1 are regression coefficients

After getting the regression coefficients β 0 and β 1, we obtain the potential bugs X for potential entropy H 0 in the software by putting the value of the potential complexity of code changes obtained from the above proposed models shown in Table 2.

4.3 Entropy vs bugs (complexity of code changes)

In this approach, we have developed the complexity of code changes based models for bug prediction.

The function m(H(t)) is the mean value function and describes the expected cumulative potential complexity of code changes in (0, H(t)) interval. The bug diffusion/detection process can defined as

$$\frac{{{{{\text{d}}m(t)} \mathord{\left/ {\vphantom {{{\text{d}}m(t)} {{\text{d}}t}}} \right. \kern-0pt} {{\text{d}}t}}}}{{{{{\text{d}}H(t)} \mathord{\left/ {\vphantom {{{\text{d}}H(t)} {{\text{d}}t}}} \right. \kern-0pt} {{\text{d}}t}}}} \propto \left[ {Z - m(H(t))} \right]\;\text{or} \;\frac{{{{{\text{d}}m(t)} \mathord{\left/ {\vphantom {{{\text{d}}m(t)} {{\text{d}}t}}} \right. \kern-0pt} {{\text{d}}t}}}}{{{{{\text{d}}H(t)} \mathord{\left/ {\vphantom {{{\text{d}}H(t)} {{\text{d}}t}}} \right. \kern-0pt} {{\text{d}}t}}}} = g(H(t))(Z - m(H(t)))$$
(7)

g(H(t)) is the rate of bug occurrence/diffusion in the software due to the complexity of code changes at any given time t.

With the initial conditions H(0) = 0 at the time \(t = 0 ,\) by solving the above equation, we get

$$m(t) = Z\left( {1 - e^{{ - \int_{0}^{t} {g(H(t))d(H(t))} }} } \right)$$
(8)

If we take different values of g(H(t)) in Eq (8), we get different proposed models based on these values as shown in Table 3.

Table 3 Entropy vs bugs (proposed complexity of code changes based models)

The existing and proposed models are shown in Tables 1, 2 and 3. The unknown parameters of the models have been estimated using Statistical Package for Social Sciences (SPSS) software.

5 Data collection and preprocessing

We have taken the complexity of code changes of “mozilla/layout/svg/”, “mozilla/layout/base/”, and “mozilla/layout/xul/” of the Mozilla project (The Mozilla project 2013) and the bugs which have been reported in BugZilla (The bugZilla project 2013) bug reporting and tracking system. The process of data collection, extraction and model building are shown in Fig. 2. These subsystems are selected as test cases and historical changes of the files of these subsystems are extracted from the concurrent versioning system (CVS) logs repository. After extracting data from the repository, monthly number of changes and the number of bugs occurred/fixed during that period have been recorded for all the files.

Fig. 2
figure 2

Process for building the entropy based prediction models

These bugs are arranged based on their first appearance for a fix and remaining duplicate entries of these bugs are discarded. In this study, we have considered the period as the calendar months of a year. There are 52 files in the subsystem “mozilla/layout/svg/”, 60 files in the subsystem “mozilla/layout/base/”, and 106 files in the subsystem “mozilla/layout/xul/”. The methodology for data collection and modelling are as follows:

Step 1:

Choose the project

Step 2:

Select the sub-systems

Step 3:

Browse the CVS logs for historical changes

a. Arrange the files as per their commit/transaction date of the changes

b. Count month wise changes with respect to various files in the chosen sub-system

c. Calculate entropy/complexity of code changes of the files

Step 4:

Browse the CVS logs for historical changes

a. Extract the bugs detected/removed during the period under consideration

b. Arrange the bugs as per their arrival or first appearance and remove other duplicate occurrences

c. Count month wise bugs of the chosen sub-system

Step 5:

Build the Prediction Models using different approaches

Step 6:

Determine the potential complexity of code changes and the potential number of bugs in the software

Step 7:

Apply the K–S test for statistical significance using data in step6

Step 8:

Calculate the entropy of the current period and predict the number of bugs for the coming year

These detected/removed bugs are extracted from the CVS log repository and confirms with the bug reporting system. A sample of CVS logs in a particular file of the specific subsystem is also shown in Fig. 3.

Fig. 3
figure 3

CVS log output of the subsystem of Mozilla (Source: http://bonsai.mozilla.org/cvslog.cgi?file=mozilla/layout/base/nsBidiPresUtils.h&rev=HEAD&mark=1.29)

In Fig. 3, date wise changes are recorded for a specific file nsBidiPresUtils.h with a specific bug in the log entry.

Prediction Error (Pred. Error):

It is the difference between observed value and predicted value. The lower value of Prediction error indicates less fitting error, thus indicates better goodness of fit

$${\text {Pr}}\, ed.Error = {(ObservedValue - {\text {Pr}}\, edictedValue)}$$
Bias:

It is the average of prediction error. Lower the value of bias, provides better goodness of fit

$$Bias = \frac{{\sum {(Observed - Predicted)} }}{n}$$
Variation:

The standard deviation of prediction error is known as variation. Lower value of variation indicates better goodness of fit

$$Variation = \sqrt {\frac{{\sum {(\Pr \;ed.Error - Bias)*(\Pr \;ed.Error - Bias)} }}{n - 1}}$$
Root mean squared prediction error (RMSPE):

It is a measure of closeness with which a model predicts the observation. Lower value of RMSPE provides better goodness of fit

$$RMSPE = \sqrt {(Bias*Bias + Variation*Variation)}$$
Coefficient of determination (R-square):

This coefficient is defined as the ratio of the sum of squares between corrected and residuals subtracted from 1. This measure is used to test the significant trend between predicted and observed values.

$$R^{2} = 1 - \frac{CorrectedSS}{{\text{Re} \;sidualSS}}$$

R2 measures the total variation about the mean for the fitted curve. It ranges in value from 0 to 1. Lower values indicate that the model does not fit the data well and the larger R2 indicates the better fit of the model and able to explain the variation in data

Kolmogorov–Smirnov (K–S) test:

The K–S test is a non-parametric test and it is free from the distribution. It does not consider any assumption about the data. This test is able to provide the distance between the observed and predicted curve with the confidence level in terms of probability. Lower the value of distance and higher the value of probability provides the significantly improved performance of the model

The trends of complexity of code changes/entropy and bugs for the considered subsystem under study are shown in Figs. 4, 5 and 6 for subsystems SVG, Base and XUL respectively. It is clear from these figures that the entropy and bugs fixed/occurred are highly correlated.

Fig. 4
figure 4

Entropy and bug occurrence trends of SVG dataset

Fig. 5
figure 5

Entropy and bug occurrence trends of base dataset

Fig. 6
figure 6

Entropy and bug occurrence trends of XUL dataset

6 Results and discussions

The data has been collected for three subsystems of layout components of Mozilla namely SVG, Base, and XUL. The entropy/complexity of code changes is calculated on a monthly basis with respect to changes in the files of these subsystems. The parameters of these models have been estimated by applying the non-linear regression in SPSS. Simple linear regression has been applied between the complexity of code changes and bugs to obtain the regression coefficients. These regression coefficients have been used to calculate the potential bugs based on the potential complexity of code changes estimated using entropy based models. The statistical performance and regression coefficients for the considered data sets are shown in Table 4. The values of R2 are found more than 98 % in all cases. The standard error of the estimates has been found to be 19.0108, 48.8429 and 20.6138 for SVG, Base and XUL respectively which shows the appropriateness of the model fitting.

Table 4 Statistical performance parameters of entropy vs bugs using simple linear regression

We have estimated the parameters of different proposed models with respect to all the cases i.e., namely software reliability growth models, the potential complexity of code changes based models and the complexity of code changes based models using the SPSS package.

The parameters of SRGM based models developed in literature i.e., from Model 1 to Model 6 are estimated using nonlinear regression and the results of this estimation is shown in Table 5. The value of R2 is more than 0.99 for all the models across all the datasets except models mentioned Model 1 and Model 2 for SVG dataset. The bug detection/removal rate i.e., b is varied from 0.001 to 0.095 for SVG, 0.001–0.013 for Base and 0.001–0.021 for XUL datasets. Model 1 does not give the accurate value of the parameters as this model is exponential in nature and does not fit on SVG dataset. Estimated potential bugs are shown in Table 5 in the third column as X. The potential bugs for the Model 1 is in the range of 195,400, 12,450 and 9,845 for datasets SVG, Base and XUL respectively. Model 1 is not fitted well as shown by the value of R2 for the SVG dataset. The value of b is also much less for this dataset i.e., 1.71E−05. If we further look into the table, we found that the values of X for the Model 2 are also very large i.e., 1,532 with the R2 0.845. In the rest of the other cases shown in Table 5, the value of R2 is more than 0.99 and the potential bugs are within the range.

Table 5 Parameter estimates for different models of time vs bugs (software reliability growth models)

The goodness of fit curve has been shown in Figs. 7 and 8 for the models which give the maximum fit on the basis of R2 for given datasets. The results are confirmed by drawing the goodness of the fit curve for the Model 3 and Model 6 in the Figs. 7 and 8 respectively. Model 3 fits well with the predicted bugs and the actual bugs lying/detected in the software. Model 6 shows the large variation for SVG and Base dataset in the later part of the software maintenance. Larger value of β shows the poor maintenance (bug removal is delayed) in SVG as compared to smaller value of β for XUL dataset. The value of k is found significant in all the models namely Model 4, Model 5 and Model 6 using power functions. For management decision we observed that the predicted potential bugs using Model 3, 4, 5 and 6 are closer to the detected bugs i.e., 488 for SVG data set. The difference in R2 value of these models is also non-significant. For Base data set, the Model 3 and Model 6 are estimated potential bugs closer to the detected bugs. These models are also showing the improved performance in terms of R2 which is significant. For XUL data set, Model 2, Model 3 and Model 5 shows the significantly closer potential bugs with the detected bugs in the component i.e., 922. But the R2 is improved only in case of Model 4 and Model 6 which have a significantly higher value of potential bugs.

Fig. 7
figure 7

Goodness of fit for model defined in Model 3

Fig. 8
figure 8

Goodness of fit for model defined in Model 6

The parameters of the potential complexity of code changes based models proposed in Model 7 to Model 12 has been estimated using nonlinear regression and the results are shown in Table 6. The value of R2 is more than 0.99 for all the models across all the datasets except model discussed in Model 8 and Model 9 for SVG dataset. For model mentioned in the Model 8 does not give the accurate value of the parameters as this model is exponential and does not fit on SVG dataset. This potential entropy or potential complexity of code changes will help in predicting the future potential bugs lying dormant in the software using simple linear regression. These regression parameters shown in Table 4 are used with the potential complexity of code changes to determine the potential bugs lying dormant in the software. These potential bugs are shown as the last column of the Table 6. The goodness of fit curve has been shown in Figs. 9 and 10 for the models which give the maximum fit on the basis of R2.

Table 6 Parameter estimates for different models for time vs entropy (potential complexity based models)
Fig. 9
figure 9

Goodness of fit for model defined in Model 9

Fig. 10
figure 10

Goodness of fit for model defined in Model 12

From the Table 6, it is also clear that the potential bugs with respect to Model 7 are found 246,933, 16,209 and 6,298 for SVG, Base and XUL respectively. The value of R2 for Model 7 and Model 8 are 0.784 and 0.971 for SVG dataset. The value of R2 is more than 0.99 in the rest of the other models and datasets. The rate of complexity of code changes is also in the range of 0.001–0.06 for all the proposed models based on the complexity of code changes. The potential bugs predicted using all these proposed models are at par with the actual bugs except Model 7. Model 8 which is overestimated and also showing the poor value of R2 i.e., 0.971 for SVG dataset.

For SVG data set, the value of potential bugs for Model 9, Model 10 and Model 12 give fair value against the bugs detected i.e., 488. This shows that the remaining bugs still need to be removed from the software. Since, Model 9 and Model 12 gives better R2, we will trust these models for determining remaining bugs yet to be removed from the software. For Base data set, the potential bugs are estimated at par i.e., 1,621 for the Model 8, Model 9, Model 11 and Model 12. By the comparison between these models based on value of R2, it is found that for Model 9 and Model 12 with 0.999 and 0.998 are slightly improved over other models. Similar trends have also been observed in the XUL data set with the potential bugs are estimated at 922 which is closer to the predicted potential bugs lying in the software. The Model 9 and Model 12 show the improved performance over other models. The goodness of fit curves is shown in Figs. 9 and 10 for Model 9 and Model 12 respectively. The curve with respect to SVG dataset shows the best fit using both these models.

The parameters related to complexity of code changes with respect to bugs detected/removal for Model 13 to Model 18 has been estimated using nonlinear regression and the result is shown in Table 7. The value of R2 is more than 0.98 for all the models and for all the datasets except Model 13 for SVG and Base dataset. Model 13 does not give the accurate value of the parameters as this model is exponential and does not fit on SVG and Base datasets. The values of potential bugs detected/removed due to diffusion of complexity of code changes are in the range of 5.88E−05 to 0.06. The value of potential bugs using the diffused complexity of code changes using Model 13 i.e., Zare 126,900, 13,770 and 5,771 for SVG, Base and XUL dataset respectively. The value of R2 is also very low for these datasets using Model 13 i.e., 0.974, 0.968, and 0.985 for SVG, Base and XUL dataset respectively. For SVG data set, the predicted potential bugs for Model 15 and 18 are closer to the detected bugs i.e., 488. The value of R2 as 0.999 also confirms the fitting of these models for this data set using these proposed models. For Base data set, the value of predicted potential bugs estimated using the model 16 is 1,800 which is closer to detect bugs. In this model, the R2 is least i.e., 0.889 for Model 16 as compared to other models. Model 15 and Model 18 predict the fair amount of potential bugs lying in the software components as the larger is the value of R2 for these models. This confirms with the bug arrival/detected patterns of the software components because of the sharp increase after 105 months. For XUL data set, the closer value with the predicted potential bugs is found for the model 14, 15, 17 and 18 but the R2 is poor as compared to Model 16. This is due to sudden increase in the arrival/detection of bugs in the software.

Table 7 Parameter estimates for different models of various entropy vs bugs (complexity of code changes)

The goodness of fit curve has been shown in Figs. 11 and 12 for the models which give the maximum fit on the basis of R2 for given datasets. The goodness of fit curve of the Model 18 for Base dataset is shown best fitting. In the rest of the cases, the performance diminishes.

Fig. 11
figure 11

Goodness of fit for model defined in Model 15

Fig. 12
figure 12

Goodness of fit for model defined in Model 18

To summarize the performance of the proposed models based on goodness of fit criteria i.e., R2, we have found that there are 41 cases out of 54 cases (6 models, 3 data sets and 3 approaches) which shows the value of R2 is more than 99 %. It is also observed that 14 of software reliability growth models based (time vs bugs) approach, 16 of potential complexity of code changes based (time vs entropy) approach and 11 cases of complexity of code changes based (entropy vs bugs) approach out of 18 cases for each category show the accuracy more than 99 %.

We further investigated the model which gives the best performance in three approaches on the basis of Bias, variation and RMSPE (root mean squared prediction error).

The statistical performances of the best models obtained in the above proposed models are shown in the Table 8. In this table, the value of R2 is more than 98 % which shows all models are applicable in determining the potential bugs in the software project based on complexity of code change metric. If we compare these models based on other performance measures namely, Bias, variation and RMSPE, it is found that the time vs entropy based models are having lower values of these performance measures. Further, we have confirmed it with the non-parametric Kolmogorov–Smirnov (K–S) test.

Table 8 Statistical performance parameters of best models of various datasets

Further, we have applied the Kolmogorov–Smirnov test (K–S test) to statistically validate the proposed models by considering the actual and predicted bugs/complexity of code changes of the system. Kolmogorov–Smirnov (K–S) statistic quantifies a distance between the observed value and the predicted value of the cumulative distribution function for the given distribution. The K–S test has been applied to the models selected based on the goodness of fit criteria. The results of the K–S test are shown in Table 9. We have applied K–S test on the models which give the best performance on the basis of R2, Bias, Variation, and RMSPE (the results are shown in Tables 4, 5, 6, 7, 8).

Table 9 Statistical performance using Kolmogorov–Smirnov (K–S) test of different models

From Table 9, we have selected Model 3 and Model 6 for time vs bugs, Model 9 and Model 12 for time vs entropy and Model 15 and Model 18 for entropy vs bugs approaches. The D-value shows the distance between the actual and predicted values. The high value of performance parameter of associated P value shows the applicability of that model. It is clear from the above table that for time vs bugs approach; Model 3 gives associated P value 0.7677 and 0.8992 for the Base and XUL data sets. In time vs entropy approach, Model 9 gives associated P value 0.9583 and 0.9733 for XUL and Base data sets. For this approach also, Model 12 gives the associated P value 0.9583 and 0.9733 for XUL and Base data sets. The associate P value in entropy vs bugs approach is in the range of 0.00–0.1070 which is not significant. If we compared across different types of cases for the SVG dataset, we found that the maximum associated P value of time vs entropy case is 0.2713. This value is non significant. This result in rejecting the Null Hypothesis i.e., the alternate hypothesis is accepted. This analytical results show that the time vs entropy based approaches show significantly improved performance over the other two approaches.

From above experiment and analysis, it is clear that all three approaches used for bug prediction are not performing equally on the basis of different comparison criteria and K–S test. It is also found that the second approach, i.e. the potential complexity of code changes/entropy based models (time vs entropy) performance is better than the other two approaches. Based on the results, we found that all three approaches have significant difference; hence we reject the null hypothesis.

7 Managerial applications

The potential complexity of code changes/entropy or potential bugs will help the project manager in calculating the remaining bugs that still exist in the software. It also helps in determining the number of changes required in the software to fix these bugs. Further the changes in the software are also being made due to more usage, more bug fixes, more feature enhancements and addition of new features. These changes will make the software source code complex. The complexity of the code changes will further lead to determine the potential complexity of code changes which will be helpful as a deciding factor for new version/release of the software. These changes further help in determining the software evolution.

This study is quite helpful for the manager who is managing the code. One can continuously draws trends of bug occurrence as well as trends of complexity of code changes. The increase in the complexity may also lead to require more resources as the number of files need to be changed and it is very difficult to remember all of these changes. The predicted bugs/complexity of code changes will be helpful in determining the code maturity. As we are aware that anyone can participate in the development process of open source projects, but it is the duty of the manager or consortia owner to judge the quality of code produced by a particular developer. For example, whenever any contributor of the code wants to fix the bugs, one has to see all changes previously made by all contributors who have contributed towards these changes in source codes. The resource allocation can be optimized by the manager to efficiently produce the quality product by allocating the prospect contributor who is producing less complex codes. This study will help in deciding the release schedule of the software.

8 Threats to validity

In our study, we have considered only those bugs which are affecting the changes in the subsystems under study. We have collected data from the open source repositories available online. A number of changes to the files are affecting the entropy or complexity of code changes. These changes might be lines modified, deleted or added but in our study, we have considered the changes made in the source code as reported in the CVS logs repositories instead of lines deleted/added/modified. The assumptions made during the models building process are violated in the real time scenario. The repair of bugs may introduce further new bugs.

9 Conclusions

Changes in files make the software complex and affect the quality of product but essentially requires for bug repairs, feature enhancement/modification for improvement and new feature introduction in the software. In this paper, we have discussed three approaches namely software reliability growth models, potential complexity of code changes based models and complexity of code changes based models to determine the potential bugs lying in the software. The experiment shows that all the three approaches are sufficient to predict the potential bugs lying dormant in the software. We observed that the potential complexity of code changes based approach is performing well on the basis of goodness of fit criteria. In our experiment, we found that 14, 16 and 11 cases out of total 18 cases each of different approaches respectively show the goodness of fit more than 99 %. We have validated the models using non-parametric K–S test to test the statistical significance of the models proposed in the paper. The K–S test is also confirming the results obtained in the goodness of fit criteria. The K–S test also shows that best cases are lying with time vs entropy or potential complexity of code changes based models with the associated P-Value more than 95 % for two data sets out of three data sets. This study can be further extended to capture all changes for the entire project which will be helpful in predicting the actual number of future bugs for the entire system.