1 Introduction

Software bug prediction is an active and continuously evolving research field. Bug Prediction is defined as the act of identifying files or software code modules that are most likely to contain bugs before formal testing, so that that testing time and resources, can be allocated optimally. Only the files that are more likely to contain bugs should be tested more thoroughly. Accurate and reliable bug prediction can help software industry, as companies seek out ways to deliver extremely high quality software systems at lower costs of software quality assurance activities such as testing (Catal and Banerjee 2010). Other benefits of bug prediction models are that they can be used to identify refactoring candidates (Catal and Banerjee 2010), architectural improvements (Catal and Banerjee 2010) and selection of best design approaches (Catal and Banerjee 2010). Bug prediction is also useful for software project managers as it helps in quantitative planning and steering of projects (Ekanayake et al. 2012).

Bug prediction literature contains many seminal studies where bug prediction models are based on static code metrics (Agarwal 2009; Basili 1996; Gyimothy 2005). Basili et al. (1996) concluded that a relationship exists between bugs and Chidamber and Kemerer metrics. Their conclusions were based on studying eight medium sized systems. Aggarwal et al. (2009) concluded that import coupling and size metrics were related to bugs. Their conclusions were based on studying 12 software systems. Gyimothy et al. (2005) also found object oriented metrics to be influential in bug prediction. Their results were based on study of open source web software Mozilla. Despite existence of many seminal studies on empirical validation of code metrics (Agarwal 2009; Basili 1996; Gyimothy 2005) in bug prediction, code metrics based bug prediction has faced criticism from some researchers (Fenton and Ohlsson 2000). Fenton et al. (2007) constructed a bug prediction model based on project and process metrics. Moser et al. (2008) also showed that process metrics outperform code metrics in bug prediction. Graves et al. (2000), Khoshgoftaar et al. (1999) and Nagappan and Ball (2007) have shown that number of previous modifications to a file are good predictors of bugs. Radjenovic et al. (2013) conducted a systematic literature review of bug prediction metrics and found that object-oriented metrics were twice more popular as compared to traditional source code metrics or process metrics. They reported that object-oriented and process metrics were more successful in finding faults compared to traditional size and complexity metrics. Recently, Hassan (2009) introduced the concept of complexity of code change by applying information theory principles. He conceptualized that code change process of a software system can be viewed as a system that emits data, where he defined data as the feature introducing changes (FIC) to source code files. This allows concepts from Shannon’s information entropy theory to be applied to quantify the complexity of code change. As per Hassan(2009), in a software system consisting of n source code files, if changes are monitored and it is found that the probability of modification of a single file(for example file1), is one and all other files is zero, then the complexity of code change or software entropy is minimum. On the other hand if the probability of modification of all files (file1….file n) is equal (= 1/n), then software entropy is maximum. Hassan (2009), further defined history complexity metrics (HCM) based on software entropy concepts and showed that bugs could be more accurately predicted using HCM as predictors. Hassan (2009) used statistical linear regression to build bug prediction models with HCM as predictors (Hassan 2009). Another conspicuous research direction in software bug prediction emphasizes that the selection of learning methods is very important to accurately predict software bugs (Menzies et al. 2007). Menzies et al. (2007) established a baseline experiment by utilizing rich developments in machine learning and data mining to demonstrate that the selection of a machine learning method greatly affect the accuracy of a defect prediction model. Later, Lessmann et al. (2008) extended Menzies et al.’s (2007) experiment and evaluated 22 machine learners on defects data sets of ten large scale NASA projects. However, the experiments by Menzies et al. (2007) and Lessmann et al. (2008) are based on static code metrics only. Very recently, Malhotra (2015) performed systematic literature review machine learning techniques for software fault/bug prediction and they concluded that the machine learning techniques had the ability to predict software bugs. They further conclude that more number of studies should be carried out in order to obtain well formed and generalizable results. Hassan (2009) did not evaluate any learning techniques in the context of entropy based bug prediction. Although a large number of machine learning techniques have been evaluated in static code metrics based bug prediction (Menzies et al. 2007; Lessmann et al. 2008; Malhotra 2014; Kaur and Kaur 2014), no comparative study is available in entropy based bug prediction. This motivates us to evaluate machine learning techniques in entropy based bug prediction. It is also imperative to note that the dependent or variable in static code metrics based bug prediction (Menzies et al. 2007; Lessmann et al. 2008; Malhotra 2014; Kaur and Kaur 2014) considered in most previous studies is binary. In this paper, the dependent or response variable is continuous. To the best of our knowledge there is only one study that considers the application of only a single machine learning technique, that is support vector regression in software entropy based bug prediction (Singh and Chaturvedi 2012). This motivates us to investigate various machine learning techniques in software entropy based bug prediction. Collection of data for empirical studies for bug prediction is another challenge. Therefore a tool for automatic data extraction is developed. Thus, the contribution of this paper is twofold:

  1. (1)

    A tool for automatic data collection and classification of software changes is developed.

  2. (2)

    Concept of complexity of code change as proposed by Hassan (2009) is used and performance of the following machine learning techniques for predicting bugs is compared:

    • Gene Expression Programming (GEP)

    • General Regression Neural Network (GRNN)

    • Locally Weighted Regression (LWR)

    • Support Vector Regression (SVR)

    • Least Median Square Regression (LMSR)

The performance of these machine learning techniques is compared for two subsystems of Mozilla and two subsystems of Apache Http Server. The results are analyzed to arrive at general conclusions regarding the applicability of the techniques.

The rest of this paper is organized as follows: Sect. 2 presents an overview of related work in bug prediction. Section 3 describes the concept of Entropy of changes. The data extraction algorithm and metrics calculation is explained in Sect. 4. Section 5 describes the regression techniques that have been compared. Results are analyzed in Sect. 6 while Sect. 7 discusses the threats to validity. The study is finally concluded in Sect. 8.

2 Related work

Many Bug Prediction approaches have been developed by distinguished researchers. Mende and Koschke (2009) verified that a trivial defect prediction model such as large files are more prone to bugs performs well when a classic evaluation metric is used, but fails badly when an effort-aware performance metric is used. D’Ambros et al. (2012) have compared the performance of various such techniques. They developed a benchmark for bug prediction that includes process metrics, system metrics and defect history of five open source software projects: Eclipse JDT Core, Eclipse PDE UI, Equinox framework, Mylyn and Apache Lucene. The approaches are evaluated using a binary classification scenario, a ranking-based evaluation and an effort-aware ranking-based evaluation. Also, two effort-aware models were also evaluated and compared with a classical prediction model. Khoshgoftaar et al. (1996) used the number of past modifications to the module to predict bug-prone entities in the software system. It was concluded by them that the number of modifications in the past reliably predict future bugs. Nagappan and Ball (2005) also conducted a study on the influence of code churn or the number of changes to the system on the defect density. Their study was validated for Windows Server 2003 and it was found that relative code churn predicted better than absolute churn. Zhou and Leung (2006) built bug prediction models that could classify bugs according to two levels of severity-high and low. Later, Singh et al. (2010) used machine learning techniques to classify bugs in three severity levels-low, medium and high.

A lot of work has been done to determine the best techniques for prediction. Khoshgoftaar et al. (1997) applied neural networks for bug prediction using procedural static code metrics as predictor variables. Thwin and Quah (2005) applied generalized regression neural networks and used object-oriented static code metrics as predictor variables to predict bugs. Kanmani et al. (2007) used object-oriented static code metrics as predictor variables and applied two different kinds of neural network techniques for bug prediction. Menzies et al. (2007) conducted a study that suggested that the category of static source code metrics employed is not as important as the learning algorithm that is used for prediction. They used the datasets from NASA Metrics Data Program (MDP) to conclude this. Their study compared the impact of using various categories of software metrics like Halstead metrics, Lines of code, McCabe complexity metrics with the impact of using various learning algorithms like J48, Naive Bayes and OneR. Tosun et al. (2011) performed a study on bug prediction on embedded software projects using classifier ensembles and found that 70 % defects could be detected. Their study is based on static code metrics. Rodrigues et al. (2013) utilized static code metrics datasets from PROMISE (Menzies et al. 2016) data repository and bug prediction datasets developed by D’Ambros et al. (2012) and suggested an evolutionary subgroup based descriptive approach for defect prediction rather than the precise classification techniques. Malhotra (2014) performed comparative analysis of statistical and machine learning techniques for bug prediction using static code metrics. They found that decision tree method was better than logistic regression and other machine learning techniques. Okutan and Yildiz (2014) applied Bayesian networks in code and process metrics bug prediction and found that there was positive correlation between number of developers and level of defects. Dejaeger et al. (2013) applied fifteen different Bayesian network classifiers in static code metrics based bug prediction and found that they had better comprehensibility than other machine learning techniques. We have presented a brief review of machine learning techniques in bug prediction. There are three noteworthy systematic literature reviews on bug prediction (Catal 2011; Radjenovic et al. 2013; Malhotra 2015). Two significant observations from these reviews are:

  • Most bug prediction studies use static code metrics as predictors or independent variables

  • The dependent or response variable is binary in most bug prediction studies.

There are only a few studies (Afzal and Torkar 2008) where dependent variable is continuous but they use only one technique that is genetic programming.

Hassan (2009) introduced a novel concept of bug prediction by quantifying the complexity of code changes and using them to develop bug prediction models. He applied Shannon’s information entropy principles to complexity of changes. He devised three code change models namely: Basic Code Change (BCC) Model, Extended Code Change (ECC) Model and File Code Change (FCC) Model. Entropy of changes is calculated using Shannon’s entropy (Hassan, 2009). Hassan (2009) proposed a new entropy-based complexity metric which is termed as history complexity metric (HCM) and used it as independent or predictor variable for prediction of bugs. He concluded that history complexity metrics (HCM) predicted bugs more accurately than the code churn metrics and prior faults. Hassan (2009) developed bug prediction models using Statistical Linear Regression (SLR) techniques but did not evaluate any machine learning techniques in the context of entropy based bug prediction. Singh and Chaturvedi (2012) also employed the same concept of history complexity metrics to arrive at the conclusion that Support Vector Regression performs better than Statistical Linear Regression (SLR). They have considered only one software system and one machine learning technique, but in our current study on software entropy based bug prediction, subsystems from two different software systems are considered and five machine learning techniques have been compared. We consider two subsystems of Mozilla and two subsystems of Apache Http Server and compare the performance of the five machine learning techniques for predicting bugs: Gene Expression Programming (GEP), General Regression Neural Network (GRNN) Locally Weighted Regression (LWR), Support Vector Regression (SVR), and Least Median Square Regression (LMSR).

3 Entropy of changes

Entropy of changes as proposed by Hassan (2009) is used to quantify the complexity of code changes. A software file is altered:

  • when a bug is to be removed,

  • when a new functionality is introduced,

  • when some comments or coding standards are changed.

The changes that take place when a new functionality is introduced are the most complex type of changes. The complexity of such changes is what is quantified in terms of entropy of changes. This entropy of changes is then used to derive History Complexity Metrics (HCM) for predicting bugs.

3.1 Measurement of entropy

Entropy of changes for a period in a system/subsystem is calculated by the Shannon’s Entropy formula specified in (1). The period is taken as 1 year for this study.

$$SE_{n} (P) = - \sum\limits_{k = 1}^{n} {(P_{k} \times \log_{2} P_{k} )}$$
(1)

where P k  ≥ 0 and ∑ n k=1 P k  = 1

P k is taken to be the probability of change for the kth file in the specified period i.e. the number of times kth file is modified divided by the total number of modifications. For example, let us assume as shown in Fig. 1, that there are 14 changes that occurred in four files and divided into three periods. For a first period, there are six changes that occurred across all four files. The probability of change occurrence for files F1, F2, F2, and F4 will be 2/6 (=0.33), 1/6 (=0.17), 1/6 (=0.17) and 2/6 (=0.33) respectively. These probabilities are also shown in Fig. 1. The value of Entropy for the first period is calculated as

$${-}\left( {0.33 \times \log _{2} 0.33 + 0.17 \times \log _{2} 0.17 + 0.17 \times \log _{2} 0.17 + 0.33 \times \log _{2} 0.33} \right) = 1.924819.$$
Fig. 1
figure 1

Probability of change for a file in a specified time period

The Entropy is normalized using (2), so that it the entropy of subsystems that contain different number of files or totally different software systems can be compared easily.

$${\text{SE}}(P) = \frac{1}{{{\text{Maximum}}\;{\text{Entropy}}}} \times {\text{SE}}_{n} (P) = - \frac{1}{{\log_{2} n}} \times \sum\limits_{k = 1}^{n} {(P_{k} \times \log_{2} P_{k} )}$$
(2)

such that, 0 ≤ SE ≤ 1 SE is the value of Normalized Entropy. For the example given in Fig. 1 the value of Normalized Entropy (SE) for the first period is calculated as 1.924819/log2 4 = 0.962409.

3.2 History Complexity Metric

Entropy of Changes is then used to compute the History Complexity Metric (HCM). It is a measure for the effect of complexity of changes that is assigned to each file in the software subsystem/system. But, first the History Complexity Period Factor HCPF i (j) for a file j during period i is calculated using (3).

$$HCPF_{i} (j) = \left\{ \begin{array}{ll} C_{ij} \times SE_{i} ,&\quad j \in F_{i} \hfill \\ 0,&\quad otherwise \hfill \end{array} \right\}$$
(3)

where SE i is the Entropy of changes for the system/subsystem over period i and C ij is the portion of SE i which is assigned to every file j that is modified in period SE i . The definition of variants of C ij is varied to arrive at the three variants of HCPF that are used in computing HCM. Figure 2 describes the three variants of HCM.

Fig. 2
figure 2

Variants of HCM

In the example given in Fig. 1 the HCPF calculated for file 1in the first period is different for the three variants of HCM. For HCM1 the HCPF calculated is 1 × 0.962409 = 0.962409, for HCM2 the HCPF is 0.33 × 0.962409 = 0.317595 whereas for HCM3 the HCPF is 1/4 × 0.962409 = 0.240602. The HCM for a file j for the evolution period set {f,…,g} is defined as

$$HCM_{{\{ f, \ldots ,g\} }} (j) = \sum\nolimits_{{i \in \{ f, \ldots ,g\} }} {HCPF_{i} (j)}$$
(4)

Similarly, HCM for a system/subsystem S for the evolution period set {f,…,g} is the sum of HCMs for all files in that subsystem as specified in (5).

$$HCM_{{\{ f, \ldots ,g\} }} (S) = \sum\nolimits_{j \in S} {HCM_{{\{ f, \ldots ,g\} }} (j)}$$
(5)

4 Empirical data

Data Extraction is the most crucial step in any empirical study. It is very important to correctly collect the data from software repositories for predicting bugs. The prediction is accurate only if the data collected is reliable and does not contain errors. The following subsections describe the software subsystems used for this study, the development of the tool used for data extraction and how the metrics are calculated.

4.1 Data sources

In this study, we extract bugs data for two subsystems each of Mozilla and Apache Http Server for evaluating the performance of machine learning techniques in software entropy based bug prediction. Mozilla is a popular web browser whereas Apache Http Server is a popular web server. The data for the subsystems of Mozilla is extracted from Mozilla-central which is a Mercurial repository of Mozilla project. On the other hand GitHub, a web-based hosting service for Git repositories is used to extract the data for Apache Http Server. After extracting the year-wise bugs and changes for each file in each subsystem, the entropy and History Complexity Metric (HCM) are calculated for a time period of each year using Eqs. (1)–(5). The number of changes and Normalized Entropy for each year for the four subsystems is given in Table 2.

Since manual data collection is a tedious process and prone to errors, we develop a tool for automatic data collection as explained in next Sect. 4.2

4.2 Automating data extraction

The first step in data extraction is to browse the revision history of each file and record the year-wise bugs and changes. A change record typically defines the reason for the change along with date of change, name of the committer, lines of code added/deleted/modified etc. These changes are categorized as follows:

  • Bug Repairing Changes (BRC): These are the changes that take place when a bug is to be corrected.

  • Functionality Introducing Changes (FIC): These changes take place whenever new functionality or a feature needs to be added in the software system.

  • General Changes (GC): These changes are maintenance related and do not add any feature or rectify a bug. This class of changes includes changes in comments, copyrights, formatting changes etc.

Feature introducing changes (FIC) are those that introduce new features or enhance existing features and are related to adaptive maintenance of a software system. FIC cause modifications to software code which may be highly scattered in large number of source code files and modules and cause code decay. This code decay is responsible for increase in overall complexity and thus FIC are used for calculating entropy of changes (Hassan 2009). General Changes (GC) are not related to introduction of new features (Hassan 2009). Some examples of general changes are like addition of authors in comments, update of copyright notice in comments or re-indentation of source code to make it more readable. Thus general changes are localized and do not lead to code decay. Bug Repairing changes (BRC) are logically related to the number of faults or bugs in a file or module of source code. Thus, BRCs are used for validation of results. This methodology is consistent with the methodology established by Hassan (2009).

Change classification is automated by developing a Java application named Change Classifier, which inputs the URL of the repository, the repository name and classifies change records listed on that page. The output generated is year-wise frequency of each type of change. Web scraping and regular expressions are used for the purpose of extracting the date and reason of change. A regular expression is a text string used for pattern matching. Two regular expressions are formulated, one for Mozilla-Central repository and the other for GitHub Repository. The main reason behind writing different regular expression for GitHub and Mozilla-Central is that the website of these repositories are structured and formatted differently hence requiring a different regular expression for pattern matching. The regular expression used for Mozilla-Central in the Change Classifier is:

  • expr = “<td class =\“age\”> .*? <i>([^ <]+) </i> .*? <td> (.*?) </strong > </td>”

The regular expression used for GitHub in the Change Classifier is:

  • expr = “<a href = .*?class =\“message\”.*?title = (.*?)> .*? <time datetime = [^ >]*> (.*?) </time>”

The content matching the highlighted expression is extracted. After extracting the date and reason of change, the changes are classified as specified in the algorithm given in Fig. 3. The output of Change Classifier for Mozilla-Central and GitHub is as shown in Figs. 4 and 5.

Fig. 3
figure 3

Algorithm for data extraction

Fig. 4
figure 4

Output of change classifier for Mozilla-Central

Fig. 5
figure 5

Output of change classifier for GitHub

4.3 Metrics calculation

After extracting the year-wise bugs and changes for each file in the subsystem, the entropy and History Complexity Metrics are calculated for each year using Eqs. (1)–(5). The number of changes and Normalized Entropy for each year for the four subsystems is given in Table 2.

4.4 Independent and dependent variables

In this study bug prediction models are built using data of History Complexity Metrics (HCM1, HCM2 and HCM3) as independent variables and number of bugs as dependent variable. The dependent variable is continuous in this study. The calculations of History Complexity Metrics (HCM1, HCM2 and HCM3) are explained in Sect. 3. The bug prediction models are built using five machine learning based regression techniques. The accuracy of five machine learning techniques is evaluated on the data sets of five software subsystems described in Table 1. For each of the five software subsystems mentioned in Table 1, bug prediction models are built using three HCM metrics as predictors

Table 1 Software subsystems for evaluation
Table 2 Normalized entropy and number of changes

5 Machine learning based regression techniques

Regression analysis is a statistical process used for estimating the relationships among variables. There are many different types of techniques that model and analyze variables to derive a relationship between a target variable that is dependent on one or more independent predictors. Performance of some of these regression techniques for predicting bugs using Entropy-based metrics have been compared in this study. The following subsections describe the regression techniques which are being compared in this study.

5.1 Gene expression programming (GEP)

Gene Expression Programming (GEP) (Ferreira 2001) is a procedure based on biological evolution. It creates a computer program for modeling some phenomenon. The different types of models that can be created by GEP include neural networks, decision trees and polynomial constructs. A simplified representation of GEP algorithm is given in Fig. 6.

Fig. 6
figure 6

Simplified algorithm of GEP

GEP is similar to Genetic Algorithms (GA) and Genetic Programming (GP) as it uses a population of individuals, computes their fitness to select them and introduces genetic variations by using genetic operators such as mutation, transposition, recombination etc. But the basic difference between the three is that:

  • in GA the individuals are linear strings having fixed length.

  • in GP the individuals are non-linear entities having different sizes and shapes.

  • in GEP the individuals are encoded using linear strings having fixed length which are later expressed using non-linear entities of different sizes and shapes.

DTREG (Predictive Modeling Software) tool is used to implement GEP. The type of GEP modeled by DTREG is Symbolic Regression. Symbolic Regression is a subset of non-parametric regression in which the form of the function to be fitted is not given in advance but the function is restricted to be mathematical or logical expressions. It is the goal of the procedure to find the function that best fits the data.

The goal of GEP for Symbolic Regression is to find the expression that performs well for all fitness cases within a certain minimum error of the correct target value. An evolutionary strategy is used for discovering a very good solution without halting the evolution process. So, the system finds the best possible solution within minimum error. The founder individuals are very unfit but their modified descendents are reshaped by selection and then the population adapts wonderfully by finding better solutions that ultimately approach a perfect solution. The fitness f i of a program i is calculated using (6) if absolute error is considered or by (7) if relative error is considered.

$$f_{i} = \mathop \sum \limits_{j = 1}^{{C_{t} }} \left( {M - \left| {C_{i,j} - T_{j} } \right|} \right)$$
(6)
$$f_{i} = \mathop \sum \limits_{j = 1}^{{C_{t} }} \left( {M - \left| {\frac{{C_{i,j} - T_{j} }}{{T_{j} }}.100} \right|} \right)$$
(7)

where M is the range of selection, C i,j is the value returned by individual i for fitness case j out of total C t fitness cases and T j is the target value for fitness case j.

5.2 General regression neural network (GRNN)

The general regression neural network (GRNN) (Specht 1991) is a memory-based network that provides estimates for continuous target variable. It is a single-pass learning algorithm having a highly parallel structure. The architecture of GRNN is as shown in Fig. 7.

Fig. 7
figure 7

GRNN architecture

GRNN consists of the following four layers:

  1. 1.

    Input layer The input layer consists of one neuron per predictor. The input neuron standardizes the values of input variables by subtracting the median and then dividing the result by the interquartile range. Then these values are fed into the neurons of the hidden layer.

  2. 2.

    Hidden layer There is one neuron for each case of training data set in the hidden layer. Each neuron stores the value of the predictors for the case along with the value of target variable. This hidden neuron when given a vector of input values from the input layer, calculates the Euclidean distance of the test case from the center point of the neuron and then applies the kernel function to compute its weightage. This value is passed into the summation layer.

  3. 3.

    Summation layer Summation layer consists of only two neurons, namely the denominator summation unit and the numerator summation unit. The denominator summation unit sums up the weightage value calculated by each of the hidden neuron. While the numerator summation unit sums up the product of weightage values and the actual value of the target variable for each of the hidden neuron.

  4. 4.

    Decision layer The decision layer calculates the predicted value of target variable by dividing the value calculated in the numerator summation unit by the value calculated in the denominator summation unit.

GRNN is implemented using DTREG (Predictive Modeling Software) tool. The tool provides a choice of two kernel functions: Gaussian and Reciprocal. The performance of GRNN for predicting bugs using Entropy of changes is recorded for both kernel functions.

5.3 Locally weighted regression (LWR)

Locally Weighted Learning (LWL) (Atkeson et al. 1997) employs a lazy learning method since the processing is deferred unless a query needs to be answered. This is a local method in the sense that it attempts to fit the training data only about the query point. The weights for the training instances are calculated using a distance function. The nearby points have greater weight. The weighting function is also called the Kernel function (K).

In general, there are two methods of weighting:

  • Weighting the error criterion In this method the error criterion is assigned weights. The aim is to minimize the error criterion given in (8).

    $$C\left( q \right) = \mathop \sum \limits_{i} (\hat{y}_{i} - y_{i} )^{2} K\left( {d\left( {x_{i} ,q} \right)} \right)$$
    (8)
  • Direct Data Weighting In this method the weights are directly assigned to the training data using the Kernel function as specified in (9).

    $$\hat{y}\left( q \right) = \frac{{\sum y_{i} K\left( {d\left( {x_{i} ,q} \right)} \right)}}{{\sum K(d\left( {x_{i} ,q} \right))}}$$
    (9)

where, x i is the ith input vector, y i is the ith training data and d(x i q) is the distance function.

The tool used to implement LWR is Weka (Witten et al. 2011). Weka provides six kernel functions for LWR namely, Linear, Epanechnikov, Tricube, Inverse-distance, Gaussian and Constant weighting. The performance of LWR for predicting bugs using Entropy of changes is analyzed for all six kernel functions.

5.4 Support vector regression (SVR)

The goal of SVR is to estimate the function f(x) specified in (10) for the training dataset {x i ,d i }where x i is the ith input vector and d i is the ith target value, such that it predicts the actual target value as closely as possible and is also as flat as possible in order to provide good generalization.

$$f\left( x \right) = w.\varphi \left( x \right) + b$$
(10)

where, b denotes the bias and w denotes the coefficient vector. Also, z = φ(x) specifies the feature space vector. All computations are done using the Kernel function defined in (11)

$$K\left( {x,\hat{x}} \right) = \varphi \left( x \right).\varphi \left( {\hat{x}} \right)$$
(11)

where · represents the dot product in feature space. The primal optimization problem of SVR given ε-intensive loss function is to minimize Eq. (12).

$$\frac{1}{2}w^{2} + C\sum \left( {\xi_{i} + \hat{\xi }_{i} } \right)$$
(12)

such that: \(d_{i} - w.z_{i} - b \le \varepsilon + \xi_{i}\), \(w.z_{i} + b - d_{i} \le \varepsilon + \hat{\xi }_{i}\) and \(\xi_{i} ,\hat{\xi }_{i} \ge 0\).

It is difficult to solve the primal optimization problem due to the fact z and w are infinite-dimensional. Hence, a finite-dimensional optimization known as the dual optimization problem is defined as in (13) using Lagrange multipliers (α i , .\(\hat{\alpha }_{i}\)).

$${\text{Maximize}}\;\mathop \sum \limits_{i} d_{i} \left( {\alpha_{i} - \hat{\alpha }_{i} } \right) - \varepsilon \mathop \sum \limits_{i} \alpha_{i} + \hat{\alpha }_{i} - \frac{1}{2}w\left( {\alpha ,\hat{\alpha }_{i} } \right)^{2}$$
(13)

where, \(w\left( {\alpha ,\hat{\alpha }_{i} } \right) = \mathop \sum \limits_{i} \left( {\alpha_{i} - \hat{\alpha }_{i} } \right)z_{i}\) such that: \(\mathop \sum \limits_{i} \begin{array}{*{20}c} {\left( {\alpha_{i} - \hat{\alpha }_{i} } \right) = 0} \\ \end{array}\) and α i , .\(\hat{\alpha }_{i} \in \left[ {0,C} \right]\) for each i. C denotes the coefficient of smoothness.

Weka (Witten et al. 2011) is used to implement SVR. It uses SMOReg to implement SVR, the explanation of which is given by Shevade et al. (2000). Weka provides four different kernels for numeric data in SVR: PolyKernel, NormalizedPolyKernel, Puk and RBFKernel and which have been used to perform the analysis.

5.5 Least median square regression (LMSR)

Least Median Square Regression (LMSR) generates functions from random samples of data. The final model is the least squared regression with the lowest median squared error. Consider the data generation process given by (14).

$$Y_{i} = \beta_{0} + \beta_{1} X_{i} + \varepsilon_{i}$$
(14)

where \(\upvarepsilon_{\text{i}}\) is independently and identically distributed. So when a realization of n observations is given in X, Y pairs called the sample, the aim is to estimate the parameters \(\upbeta_{0}\) and \(\upbeta_{1}\). This is done by fitting a line to the observations in the sample. The fitted line’s intercept is \(\upbeta_{0}\) and slope is \(\upbeta_{1}\). Least Square fits the line by finding the intercept and slope that minimizes the sum of squared residuals. This technique is also implemented using Weka (Witten et al. 2011) based on the algorithm given by Leroy and Rousseeu (1987).

6 Result analysis

This section presents the result of performance of the machine learning techniques described in the previous section in software entropy based bug prediction. The machine learning techniques are compared for each of the four selected subsystems and then general conclusions are derived.

Performance of machine learning techniques is compared using Mean Absolute Error (MAE) and Root Mean Square Error (RMSE). These measures are defined in Witten et al. (2011) as follows:

  • Mean Absolute Error (MAE):

    $$\frac{{|p_{1} - a_{1} | + | \cdots | + |p_{n} - a_{n} }}{n}$$
    (15)
  • Root Mean Square Error (RMSE):

    $$\sqrt {\frac{{\left( {p_{1} - a_{1} } \right)^{2} + \cdots + \left( {p_{n} - a_{n} } \right)^{2} }}{n}}$$
    (16)

where p n predicted number of bugs and a n actual number of bugs.

The results of the regression techniques for each subsystem are presented in Tables 3, 4, 5 and 6. The best results for each technique are highlighted.

Table 3 Results for mozilla/layout/generic
Table 4 Results for mozilla/layout/forms
Table 5 Results for apache/httpd/modules/ssl
Table 6 Results for apache/httpd/modules/mappers

Table 3 presents the results for the mozilla/layout/generic subsystem. The best cases are compared for each of the techniques. The best case MAE and RMSE values are plotted in Fig. 8. It is noticed that SVR gives least MAE (125.089) followed by LWR and then LMSR, GEP and GRNN respectively. Also based on RMSE the best results are obtained by LWR followed by SVR, GEP, LMSR and GRNN in that order.

Fig. 8
figure 8

Best case MAE and RMSE values for mozilla/layout/generic subsystem

The results for the mozilla/layout/forms subsystem are presented in Table 4. The best cases are compared for each of the techniques. It is noticed that LMSR gives least MAE (65.518) followed by GEP, SVR, GRNN and LWR respectively. The least RMSE is observed for LMSR (95.694) followed by SVR, GEP, GRNN and LWR respectively. These best case MAE and RMSE values are plotted in Fig. 9.

Fig. 9
figure 9

Best case MAE and RMSE values for mozilla/layout/forms subsystem

Table 5 lists out the results for the apache/httpd/modules/ssl subsystem. The best cases are compared for each of the techniques. It is observed that GEP gives least MAE (12.995) followed by LWR, SVR, GRNN and LMSR in that order. The least value of RMSE is obtained using GEP (19.039) followed by GRNN, LWR, SVR and LMSR in that order. These best case MAE and RMSE values are plotted in Fig. 10.

Fig. 10
figure 10

Best case MAE and RMSE values for apache/httpd/modules/ssl subsystem

Table 6 shows the results for the apache/httpd/modules/mappers subsystem. The best cases are compared for each of the techniques. The best case MAE and RMSE values are plotted in Fig. 11.The least values of both MAE and RMSE are obtained using GRNN (MAE = 10.343 and RMSE = 12.951) followed by SVR, LWR, GEP and LMSR in that order.

Fig. 11
figure 11

Best case MAE and RMSE values for apache/httpd/modules/mappers subsystem

Although it is difficult to conclude which machine learning based regression technique performs best it is noticed LMSR performs worst for both the Apache Http Server subsystems. Also the techniques that give adequate performance for all the subsystems are GEP and SVR. Thus, it is our suggestion to employ GEP and SVR for predicting bugs using Entropy of changes.

7 Threats to validity

Empirical studies in bug prediction are subject to factors that affect the correctness of results. These are called threats to validity. Broadly there are two kinds of threats- internal and external validity threats. Threats to internal validity occur if there is misinterpretation of true causes that affect the experimental results. External validity refers to ability to generalize results of a study. For this purpose, we include multiple data sets extracted from four open-source software subsystems and performing empirical analysis on them. This study does not take into consideration the bugs that are reported but not corrected while counting the year-wise number of bugs. But since there are hardly any bugs that are not fixed, the results of this study stand justified. Also the tool we developed and deployed, matches keyword substrings in the extracted reason of change from the repository, and hence it does not apply any human like reasoning for classification of the change. The tool uses a simple, but robust keyword matching algorithm to classify the changes with least possible chances of misclassification. Since the number of studies on applicability of machine learning in entropy based bug prediction is very few, it is suggested that more studies be carried out on various application domain, industrial settings and programming languages to obtain better generalization of results. Another threat to validity is that the four software subsystems considered in this study mozilla/layout/generic, mozilla/layout/forms, apache/httpd/modules/ssl, apache/httpd/modules/mappers cannot be considered representative of systems written in programming languages different from C and C++.

8 Conclusions and Future direction

Empirical software engineering is concerned with developing accurate models to support various phases of software development. Bug prediction models support the testing phase by helping to optimize software testing costs through early identification of modules that require more rigorous testing. It is encouraged to build replicable and refutable models in empirical software engineering (Menzies et al. 2016). The contribution of this study is two-fold for research community and industry practitioners. This study proposes and implements an algorithm that automates the data extraction process for conducting software entropy based bug prediction studies. The concept of software entropy is grounded in information theory principles which is both intuitive and has a strong mathematical foundation (Hassan 2009). It is evident from three recent systematic literature reviews (Catal 2011; Radjenovic et al. 2013; Malhotra 2015) that there are no benchmarking studies till date that evaluate the applicability of machine learning in software entropy based bug prediction. The second important contribution of this study is that compares machine learning based regression techniques for predicting bugs using entropy of changes. The study explains how entropy of changes is calculated for a software system/subsystem, how metrics are derived using it and finally compares results obtained by using the five regression techniques namely: Gene Expression Programming (GEP), General Regression Neural Network (GRNN), Locally Weighted Regression (LWR), Support Vector Regression (SVR) and Least Median Square Regression (LMSR). Even though a single best technique that performs better than all regression techniques for every case is not observed, nevertheless it is noticed that GEP and SVR give adequate results in all cases. Hence it is suggested that GEP and SVR should be employed for bug prediction using Entropy of changes. An important extension of our work will be to extract software entropy data for large scale software systems developed using other programming languages such as Java, python and other modern and upcoming programming languages such as Go, Rust and Ruby.