1 Introduction

Current software systems are growing rapidly in complexity and size, thus, ensuring their reliability and quality are paramount important, which depends on software faults [1]. Software fault prediction (SFP) actively helps in the detection of faults by highlighting potential faulty areas of code in the software system [2]. This identification of areas of code liable to more faults can help the testing team to allot software quality assurance resources optimally and efficiently [3, 4]. SFP modeling has been examined widely by several researchers due to its inherent advantages in optimizing testing resources utilization and improving the quality of software projects [5,6,7].

For the last two decades, various learning techniques have been used greatly for SFP [8,9,10,11,12]. Naïve Bayes, regression techniques, k-nearest neighbors, decision trees, multilayer perceptron, rule-based learners, etc. are the few of them. However, analysis of these algorithms showed that most of the algorithms achieved an average prediction accuracy of 80%-85% with a higher misclassification rate [4, 13, 14]. Moreover, the performance of algorithms has not been consistent across different fault datasets [15,16,17,18]. In the case of the software system, it is observed that most of the faults are concentrated in the small area of code. Therefore, the evaluation of a classification algorithm using accuracy measures will not provide an accurate depiction of the model performance [19, 20].

Earlier research in the SFP domain revealed that individual classification and learning techniques have reached the verge of their performance threshold point and the performance of these techniques may not be further improved without applying external corrections in the fault datasets or model building process [2, 21, 22]. Some researchers have tried to break this performance ceiling by adapting different performance-improving strategies such as enriching the information content of the training datasets [21], by customizing the prediction model to the specific local business goals [2], or by combining multiple sets of software metrics [23]. The results of these performance-improving strategies showed positive conclusions to break the performance bottleneck of SFP models. Presently, ensemble techniques based SFP models have gained popularity in the software engineering research community [24,25,26]. Many research evidence showed that ensemble techniques can help to overcome the performance bottleneck of classification algorithms and can serve as a tool to develop improved fault prediction models [23]. Few researchers have analyzed ensemble techniques such as bagging, boosting, voting, and stacking for SFP [26,27,28]. However, these studies were limited to some fault datasets and analyzed one or two ensemble techniques only. Further, many new as well as improved ensemble techniques have been reported by the researchers, but their evaluation for the SFP has not been performed yet. This motivated us to undertake a study of these ensemble techniques and to establish their usefulness for the SFP.

This paper performs an extensive experimental study of seven ensemble techniques including Dagging, Decorate, Grading, MultiBoostAB, RealAdaBoost, Rotation Forest, and Ensemble Selection for the SFP. To the best of our knowledge, most of the ensemble techniques used in this study have not been investigated thoroughly before for the SFP. For the ensemble techniques, three different classification algorithms, namely, naive Bayes, logistic regression, and J48 (decision tree) are chosen to serve as base learners. The experimental study is performed for twenty-eight public-domain software fault datasets available in the PROMISE data repository [29]. Precision, recall, AUC (area under the ROC curve), specificity, and G-means (G-mean 1 and G-mean 2) measures are considered to evaluate the performance of ensemble techniques. The relative significance difference in the performance of seven ensemble techniques is evaluated by using Friedman’ test and Wilcoxon signed-rank test. Additionally, a cost-benefit analysis is carried out to assess the cost-effectiveness of used ensemble techniques in terms of saving software testing cost and effort. Results and observations obtained from this empirical study can help practitioners in building effective SFP models.

1.1 Contributions

Since the last decade, various researchers have used different ensemble techniques for software fault prediction. However, recently many new as well as improved versions of existing ensemble techniques have been introduced in the machine learning domain, which are not explored for the SFP. This raises the need for a comprehensive evaluation of these techniques to benchmark their performance for the SFP. This could be very beneficial to the research community and the practitioners working in the SFP domain.

The contributions of the presented work are discussed as follow:

  1. 1.

    We provide a systematic literature review of the ensemble techniques used for the software fault prediction and reported the findings of the review.

  2. 2.

    We perform an extensive comparison of seven different ensemble techniques for the SFP, which to the best of our knowledge have not explored before.

  3. 3.

    We repeat experiments for the twenty-eight distinct fault datasets of different domains to establish the feasibility and usefulness of used ensemble techniques for the SFP.

  4. 4.

    Further, we perform a cost-benefit analysis of the used ensemble techniques to assess their economic viability for the SFP.

Following research questions have been framed to investigate in the presented experimental study:

  • RQ1: Which ensemble technique shows overall best performance for software fault prediction?

  • RQ2: Is there any statistically significant performance difference between the chosen ensemble techniques?

  • RQ3: How do base learners affect the performance of ensemble techniques?

  • RQ4: For a given software system, how economically effective ensemble techniques are for software fault prediction?

The structure of the paper is as follows. A discussion on earlier presented similar works is provided in Section 2. Section 3 provides a systematic review of the ensemble techniques based SFP. Section 4 includes the details of the software fault prediction process. Section 5 focuses on the overview of ensemble techniques used for SFP. Section 6 provides details of the empirical study including description of used software fault datasets, performance evaluation measures, experimental procedure, etc. Section 7 presents and discusses results of the study. The comparative study of used ensemble techniques is presented in Section 8. Section 9 listed various threats to the validity to the presented study followed by the conclusions and future works in the final section.

2 Related work

Many works are available in the literature, which used ensemble techniques/methods for SFP [23, 25, 30, 31]. Tosun et al. [32] built an ensemble based fault prediction model that combines the learning of three different classifiers, naive Bayes, neural network, and voting feature intervals. Authors compared the performance of the presented ensemble model with naive Bayes and found that the presented model has achieved a considerably improved performance. However, authors focused on only one ensemble model and performed experiments for a few NASA datasets. In a similar study, J. Zheng [33] presented and evaluated three cost-sensitive boosting algorithms for SFP. The author used one threshold-updating and two weight-adjusting based algorithms and performed the analysis for four NASA datasets. Results of the study showed that the algorithm based on threshold-updating with the boosted neural network performed the best among the other techniques considered in the study for SFP. Wang et al. [26] presented a study for software defect prediction using some classifier ensembles. Authors assessed the capabilities of seven ensemble techniques such as Bagging, Boosting, Random trees, Random forest, Random subspace, Stacking, and Voting and used naive Bayes as the base learner among the ensemble techniques. Authors performed a series of experiments for several NASA datasets and found that voting and random forest performed better compared to other methods. Overall, authors suggested that ensemble methods produced better performance than a single classifier. B. Twala [34] built an ensemble technique based fault prediction model using three distinct techniques for a large space software system. The author showed that decision tree and apriori techniques based ensemble techniques outperformed other used ensemble techniques and yielded a better accuracy.

Aljamaan et al. [35] performed an investigation of bagging and boosting ensemble techniques for software defect prediction and compared their performance with other commonly used fault prediction techniques. Results found that ensemble based prediction models produced better accuracy values in comparison to most of the used fault prediction techniques. Recently, Siers and Islam [36] presented two ensemble methods, namely, CSForest and CSVoting using cost-sensitive analysis for SFP. The examined ensemble methods initially created a set of decision trees and later combined these trees to minimize the classification cost. Authors showed that presented ensemble methods were able to achieve superior performance compared to other used six classification algorithms.

In the presented work, we performed an extensive analysis of seven ensemble techniques, Dagging, Decorate, Grading, MultiBoostAB, RealAdaBoost, Rotation Forest, and Ensemble Selection for SFP. To the best of our knowledge, most of these ensemble techniques have not been explored and experimented for SFP till now. Further, we use three different classification algorithms as base learners to analyze the impact of base learners on the performance of ensemble techniques. The study was performed for twenty-eight software fault datasets, and a total of 532 fault prediction models have been generated. We believe that the analysis of ensemble techniques presented in this paper will help the research community to build more effective fault prediction models using ensemble techniques.

3 Systematic review of ensemble techniques based software fault prediction

To identify the papers related to the ensemble techniques for the software fault prediction, we have searched in the Google Scholar, IEEE Explorer, ScienceDirect, and Scopus databases and extracted papers published between January 2010 and April 2020. We have selected this timeline for article search, because most the works using ensemble techniques for the software fault/defect prediction published in last decade only. The query string used for the database search is “(Software Fault OR Defect OR Bug Prediction) AND (Ensemble techniques OR Bagging OR Boosting OR Stacking)”. The initial query run resulted into a large number of articles. We have applied the inclusion and exclusion criteria to filter out the articles and to select only the relevant articles/papers [37].

Inclusion Criteria

  1. 1.

    Paper must be written in the English language.

  2. 2.

    Full content of the paper must be available online.

  3. 3.

    Paper must be published between January 2010 and April 2020.

  4. 4.

    The study reported in the paper used on the software project datasets not the simulated one.

  5. 5.

    The paper applied at least one ensemble technique for the software fault/defect prediction.

  6. 6.

    Paper must be reported new experiments only.

  7. 7.

    Paper must be reported results using standard performance measures with sufficient details.

Table 1 listed the studies related to the ensemble techniques based software fault /defect prediction (SFP). The use of ensemble techniques for the SFP has expedited since 2010 [38,39,40,41]. The review of the ensemble techniques showed that a large number of researchers have focused on the use of bagging, boosting, and stacking based ensemble techniques. Different studies have used different classifiers as the base learners to these ensemble techniques such as naïve Bayes, decision tree, multilayer perceptron, etc. Results of the analysis showed that these ensemble techniques produced higher or at least equal performance as compared to the base learners [23]. Some other researchers have explored different variations of the traditional ensemble techniques such as cost-sensitive neural network, cost-sensitive boosting, bagging with the oversampling, etc. Authors claimed that these variations of the ensemble techniques resulted in an improved performance as compared to the traditional ensemble techniques [65, 74]. A few researchers have used hybrid ensemble techniques such as ensemble techniques with the feature selection, ensemble techniques sampling, etc. These studies showed that the use of hybrid ensemble techniques could be useful in building accurate fault prediction models [55, 61]. However, over the last few years many new or improved ensemble techniques have been presented by the researchers. Although, a comprehensive evaluation of these newly available ensemble techniques is missing. Thus, in this work, we include the ensemble techniques, which are not explored before for the SFP.

Table 1 Analysis of ensemble techniques based software fault prediction literature

4 Software fault prediction process: An overview

In this section, we have discussed a generic process used for the prediction of software faults. There are many works reported in the literature presenting various approaches for the software fault prediction. The aim of this section is to discuss the commonly used steps for the software fault prediciton based on various available works [81, 84,85,86,87]. These steps are also useful in building the ensemble models under study for software fault prediction discussed in the upcoming sections.

The aim of software fault prediction (SFP) is to identify the software modules having a higher probability of being faulty. The SFP process is based on the use of some underlying characteristics such as source code metrics, change and revision history, structural properties, etc. of the software project. The SFP model uses such software project datasets augmented with corresponding fault information for a known project as a training dataset, and subsequently uses the trained SFP model to predict faults for unknown projects. The working assumption of the SFP process is that if a software project developed in an environment that led to faults, then any subsequent software modules developing in a similar environment with similar underlying characteristics will end to be faulty [81, 82]. Let us say that the software fault dataset is defined as D = {X, Y }, where X represents a set of software metrics (features or attributes or independent variables) and it is a matrix of N X M size. N is the number of rows (software modules) and M is the number of features. Y represents fault information (dependent variable) and it is a vector of N size. {xi, yi} is the ith observation in the dataset. The dependent variable is (DV) yi 𝜖 [1, 0], where “1” stands for a faulty software module, and “0” stands for the non-faulty software module. The prediction models are built on the dataset D and aim to classify the unseen software modules in faulty or non-fault labels, yielding classifier results yi = (xi). If we use a classification algorithm to build the SFP model, then it is often referred to as the classification model or binary classification model given its binary outcome.

Figure 1 depicts an overview of the software fault prediction process. The process shown in the figure and depicted as below is a generic process used for the prediction of software faults. The steps involve in the SFP model building and assessment are described as follows [83].

  1. 1.

    Extraction of fault information: Each software project has source code and bug repositories such as SVN or CVS. The extraction of fault information involves data retrieval from the bug repository and linking it to its source. Based on the log contents and status of the bug, it is decided whether a commit is a bugfix or not. All such reported bugs are collected from the bug repository and mapped to their corresponding source code modules.

  2. 2.

    Collecting software metrics (features or attributes) and creating fault dataset: This step collects software metrics information from the source code of the software project or from the log contents of the projects. First, it is decided that what type of properties of the given software are required. Further, based on that source code or log files are parsed and corresponding software metrics are collected. Last, extracted fault information and collected software metrics are combined together to create the fault dataset that is used to train the SFP model.

  3. 3.

    Building SFP models: Usually, some classification algorithms or regression techniques such as decision tree, support vector machine, naïve Bayes, or linear regression are used to build the SFP model using fault set. Subsequently, the trained SFP model is then used to predict the faults in the unseen software modules.

  4. 4.

    Evaluation: To assess the SFP model’s performance, generally a separate testing dataset is used besides the training dataset. This testing dataset is created by partitioning the fault dataset into training and testing parts. The fault-proneness of software modules in the testing dataset is predicted. Then, the performance of the model is evaluated, by comparing the predicted value of faults and the corresponding actual value of faults.

Fig. 1
figure 1

Software fault prediction process

A number of researchers have explored/presented different models for the software fault prediction. Most of these works focused on the binary class classification of faults (faulty or non-faulty) [84,85,86,87]. Some of the researchers have built prediction models for the number of faults in a software module prediction or the severity of the fault prediction [88,89,90]. The results of these studies showed that the average prediction accuracy of software fault prediction models was 80%-85% (approx.) with 30%-40% of the misclassification rate. Additionally, it has been found that no single learning technique (classifiers or regression techniques) always performed better than the other techniques across different software projects [91]. However, some learning techniques such as Naive Bayes, Logistic Regression, and random forest achieved better performance than techniques such as support vector machine (SVM) and multiplayer perceptron (MLP). Athough, in some cases SVM or MLP yielded better performance than other techniques [4]. A few researchers have performed a comparative analysis study or meta-analysis study of learning techniques for the software fault prediction [82, 92]. Recently, Li et al. [93] and Ning Li et al. [37] have reported benchmark studies for software fault prediction in the years 2019 and 2020, respectively. In 2019, Li et al. [93] reported an updated benchmark study, where authors evaluated various classifiers using new fault datasets and new evaluation metrics. The result analysis showed that techniques such as bagged MLP, ANN/MLP, decision tree, and random forest yielded better prediction performance as compared to the techniques such as CART, Logistic regression, SVM, Naïve Bayes, etc. Further authors stated that there is no single best classifier found for the SFP. Moreover, the authors suggested the use of simple classifiers over the complex ones for the SFP due to the problem of hyper-parameter tuning of the classifiers. In 2020, Ning et al. [37] reported a systematic review and meta-analysis of unsupervised learning techniques for software defect prediction. After, the thorough screening of the works published between 2000 and 2018, the authors included a total of 49 studies in their presented meta-analysis. The results of the meta-analysis showed that the performance of unsupervised learning techniques was comparable with supervised learning techniques for both within-project and cross-project prediction. Among the considered unsupervised learning techniques, Fuzzy CMeans (FCM) and Fuzzy SOMs (FSOMs) yielded the best performance. Further, the authors stated that factors such as dataset characteristics did not show any significant impact on the performance of unsupervised techniques.

5 Ensemble techniques for software fault prediction

Ensemble technique refers to the technique that generates several intermediate prediction models, which are integrated together to make an overall prediction [94]. The primary purpose of an ensemble technique is to overcome the performance ceiling problem of the single learning algorithm and to enhance the overall performance of prediction model. Several techniques are available in the literature to generate the intermediate prediction models for the ensemble techniques [95]. Ensemble techniques make an effective use of these intermediately generated prediction models to reduce the variance in the prediction performance without increasing any bias [96]. In this work, the SFP problem is defined as a classification task, where the aim is to categorize the given software modules into the faulty or non-faulty classes. The technique used for the prediction takes the form of a function f, which uses a vector of size n + 1 of n software metrics (A1,A2, ... , An) and one dependent variable (D, fault information) as input and outputs (Y) fault-proneness of the given software modules. Each vector of software metrics and dependent variable describes a software module i.e., a class in object-oriented software systems or a file in other software systems. The calibration of f is done on the training dataset (TR) having several such vectors or examples. The dependent variable is faulty and non-faulty information of a software module.

Figure 2 shows the working of ensemble techniques for the SFP. The process of building a prediction model using ensemble technique is two-folded: (1) generation of intermediate prediction models to be used for the ensemble (ensemble generation), and (2) integration of generated prediction models for the ensemble to obtain the final prediction (ensemble integration) [95]. Ensemble techniques utilize multiple models (known as “weak learners”) that are trained and combined to get improved results. The accurate working of ensemble techniques depends on the correctly combined weak learners. In ensemble theory, a weak learner is a model that does not perform so well alone either because it has a high bias or high variance. Ensemble techniques overcome this high bias-variance problem by combining several weak learners to reduce bias and variance of such weak learners. Most of the ensemble techniques rely on the use of a single base learning algorithm to generate multiple weak learners. However, each instance of the weak learner is trained differently. This setting is known as homogeneous ensemble techniques. However, some ensemble techniques use different learning algorithms to generate weak learners. It is known as heterogeneous ensemble techniques. The next step is the correct aggregation of weak learners. Different ensemble techniques combine weak learners differently. For example, in the bagging ensemble technique, weak learners are combined by using a deterministic averaging process. In boosting ensemble technique weak learners are generated adaptively and combined using a deterministic strategy. In the stacking ensemble technique, weak learners are combined using a meta-model that learns on the outputs of a weak learner and combined their outputs.

Fig. 2
figure 2

Working of ensemble techniques for the SFP

Despite the use of the type of ensemble techniques, every ensemble technique takes one or some learning algorithms as the input. Additionally, a training dataset is taken as input by ensemble techniques. Depend on the number of weak learners to be generated, the input training dataset is partition into several subsamples. One weak learner is trained on the one subsample of the training dataset. The output of this training phase is the several trained weak learners on the different subsamples. Next, based on the used combination strategy, weights of each weak learner are decided and their outputs are combined for the final prediction. There are several techniques proposed by researchers for the ensemble generation and ensemble integration [95]. In the presented work, we focus on the homogeneous ensemble generation techniques, where the same algorithm is used to generate intermediate prediction models. There are seven different homogeneous ensemble techniques used in the study and the description of these techniques is given as follows.

  1. 1.

    Dagging: In this ensemble technique, initially, several disjoint stratified subsets of the given original fault dataset is generated. Subsequently, the generated subsets are fed to the classification algorithm (base learner). The final prediction is made by using the majority voting scheme to combine the outcomes of the base learner for all the generated subsets [97]. It differs from the bagging in the sense that here disjoint subsets of given dataset are used to build the prediction models.

  2. 2.

    Decorate: This ensemble technique generates diverse intermediate prediction models by using specially constructed artificial training examples. It follows an iterative ensemble generation process. In each iteration, an intermediate prediction model is generated and added to the current ensemble. The base learner is trained in each iteration for the training dataset augmented with some artificially generated data points. The population of artificial training data points is drawn from the original data distribution and it is specified as a fraction of the training dataset size [98]. The class of these artificial data points is maximally different from the current ensemble’s predictions.

  3. 3.

    Grading: It is a meta-classification scheme, which uses graded predictions on the meta-level classes to make the final prediction [99]. For each base learner, a meta-classifier is learned whose task is to predict when the base learner will be incorrect. Graded prediction is a prediction that has been marked as correct or incorrect. The training dataset for meta-classifier is made up using the graded predictions of the corresponding base learners as new class labels for the original attributes. The final prediction is derived from the predictions of base learners that are predicted to be correct by the meta-classification schemes [100].

  4. 4.

    MultiBoostAB: This ensemble technique extends the working of AdaBoost ensemble technique. It combines the capabilities of AdaBoost with wagging techniques to reduce the prediction bias and variance in the final model [101]. The advantage of MultiBoost technique over AdaBoost is that in contrast to the AdaBoost, in this technique, intermediate models can learn in parallel, which speed up the training and model building process.

  5. 5.

    RealAdaBoost: RealAdaBoost is a modified version of AdaBoost ensemble technique that fits an additive logistic regression and produces a non-linear version of logistic regression [102]. It extends AdaBoost techniques and removes the need for a coefficient as the optimal coefficient is always 1. Additionally, it generates fewer trees than AdaBoost to reach the final prediction [103].

  6. 6.

    Rotation Forest: This ensemble technique makes use of a PCA (Principal Component Analysis) algorithm to choose features and instances of the training dataset when building decision trees [104]. First, the features of the training dataset are split into K non-overlapping subsets of equal size. Then, 25% of the training data examples are removed randomly by using a bootstrap method and PCA is used for the rest of 75% of data examples. These steps are repeated for each tree in rotation forest and the final prediction is based on the integrated outputs of each tree.

  7. 7.

    Ensemble Selection: Ensemble selection is a meta-classification ensemble technique. It uses a set of base learners to generate the final ensemble. Initially, technique starts with an empty ensemble. Iteratively, it adds a base learner to the ensemble library that maximizes the ensemble’s performance. This process is repeated for a fixed number of rounds and the final ensemble based prediction model is the nested set of base learners that maximizes the prediction performance [105].

6 Empirical study

6.1 Experimental datasets

In this work, fault datasets were gathered from the PROMISE data repository for building and evaluating prediction modelsFootnote 1 [29]. A total of twenty-eight benchmarked software fault datasets have been gathered from the mentioned repository. Considered fault datasets include data of several open-source software systems such as Apache Camel, Apache Xerces, Apache Xalan, PROP, etc. The details of considered datasets are given in Table 2. The used datasets (described in Table 2) are same as ones used in our one previous paper [16]. All the used fault datasets are having 300 or more software modules. We have drooped all the smaller size datasets below the given threshold limit of 300 modules. Each of the used dataset contained twenty-one object-oriented software metrics and number of faults found in each software module. Since, aim of the presented study is to classify software modules into faulty or non-faulty modules, therefore, we performed data transformation on these datasets and categorized given number of faults information into faulty and non-faulty classes. Software modules with one or more faults have been marked as faulty, other modules with zero faults have been marked as non-faulty. We apply the same data transformation scheme on all twenty-eight datasets. The considered dependent variable is faulty and non-faulty labels of the software modules.

Table 2 Details of considered software fault datasets [16]

6.2 Experimental procedure

Figure 3 depict the procedure used for the experimental study presented in the paper.

figure c
Fig. 3
figure 3

Overview of the experimental procedure

The experimental procedure mainly consists of three steps. In initial step, training and testing subsets are generated from the original fault dataset by splitting it into multiple partitions. A ten-folds cross-validation scheme is used to build prediction model and evaluate the performance of ensemble techniques. This scheme partitions the original fault dataset into ten disjoint folds. For each iteration, nine folds are served as training dataset used to train the ensemble techniques and remaining one is served as testing dataset used to evaluate the performance of ensemble techniques. This process is repeated for ten folds. The second step is the building of the ensemble techniques. The selected training dataset is used to build the prediction model. Three different classification algorithms are used as base learners to the ensemble techniques. Each time a different classification algorithm is fed to the ensemble technique. This process is repeated for all the base learners. The final step is the evaluation of built ensemble based fault prediction models for the testing dataset. Various performance measures are used to evaluate the performance of built models. Further, Friedman’s test and Wilcoxon signed rank test are used to evaluate the statistically significant performance difference among the chosen ensemble techniques. The experimental procedure is described as follows.

6.3 Base Learners

Three different classification algorithms namely, naive Bayes, logistic regression, and J48 (decision tree) have been used as base learners. Previous research showed that these algorithms produced better performance compared to other classification algorithms for the SFP [4]. For this reason, we have selected these algorithms as base learners to feed into ensemble techniques. A brief description of these algorithms is given as follow.

  1. 1.

    Naive Bayes (NB): Naive Bayes algorithm belongs to the Bayesian classifier family. Its working is based on the use of Bayes equation to categories the given testing module into one of the classes [106]. Initially, naive Bayes calculates the posterior probability of each class using the attribute values (software metrics) of the given module. Further, the testing module is classified with the label the same as the class label of the highest probability class. Parameter estimation process of naive Bayes classifier involves a simple estimation of the probability of attribute values within each class from the training modules. A comprehensive description of naive Bayes can be referred from [107].

  2. 2.

    Logistic Regression (LR): LR is a type of regression technique used when response variable is of categorical type. It calculates the probability of a binary response variable using one or more independent variables (software metrics) [108]. The simple logistic model only predicts the probabilities of outcomes in terms of input values. To use it as classifier, we need to select a cutoff value (threshold), which classifies values greater than cutoff into one class and values lower than cutoff into another class.The more details of logistic regression is given in [109].

  3. 3.

    J48 (decision tree): As the name implies, decision tree form a tree type of structure to make the decisions. Building the decision tree involves selection of tree nodes and splitting criteria along with the knowing when to stop [110]. Initially, it selects the most promising node as the root node of the tree and continues with the tree construction with intermediate promising nodes. Typically, information gain (Infogain) or Gain Ratio is used as the splitting criteria [111]. We used J48 algorithm is the present study, which is an implementation of decision tree in the Weka machine learning tool [112].

6.4 Implementation details

The implementation of all ensemble techniques has been performed using Weka machine learning tool [113]. The parameter values of different used ensemble techniques and base learners are given in Appendix. Each used ensemble technique receives the training dataset having software metrics and corresponding fault information as input. The training dataset is used to train the SFP model based on the internal working of the ensemble technique. After the training, a separate testing dataset is fed as input to the trained SFP model and a prediction is made for the software modules of the testing dataset. Each ensemble technique output the faulty or non-faulty labels of the given software modules. We have used seven different ensemble techniques and different classification algorithms for software fault prediction. So, a total of nineteen fault prediction models have created for a fault dataset. We replicated the experiments for twenty-eight fault datasets. Therefore, 532 total fault prediction models have been created.

6.5 Performance evaluation measures

Five different performance measures namely, precision, recall, AUC (area under ROC curve), specificity, and G-means have been used to evaluate the performance of all seven ensemble techniques [23, 114]. It was reported in previous studies that accuracy measure does not provide a complete evaluation of the model performance due to the imbalance in the fault datasets. For this reason, we have excluded it from the study. We have selected those measures, which can provide complete model evaluation despite the imbalance in the fault datasets [115]. An explanation of these performance measures is as followsFootnote 2.

  • (i) Precision: It is used to identify the portion of the correctly predicted faulty modules out of all modules predicted faulty. It is defined by Equation 1.

    $$ Precision = \frac{TP}{TP+FP} $$
    (1)
  • (ii) Recall: It is used to identify how many correct faulty modules are predicted. It is defined by Equation 2.

    $$ Recall = \frac{TP}{TP+FN} $$
    (2)
  • (iii) AUC: It stands for area under the receiver operating characteristic curve. It is a graphical plot that depicts the diagnostic capabilities of a prediction model under different threshold values. It plots the true positive rate in the y-axis and false positive rate in the x-axis. Area under the curve shows the probability that a classifier will classify a randomly chosen positive module higher than a randomly chosen negative module.

  • (iv) Specificity: It is used to identify the portion of negative modules that are actually predicted correctly by a model. Specificity therefore quantifies the avoiding false positive. It is defined by Equation 3.

    $$ Specificity = \frac{TN}{N} $$
    (3)

    A high value of specificity shows that the prediction model has a low false positive rate and thus helps in a significant reduction in the resource consumption to the false alarm cases. However, a low value of specificity signifies a higher false positive rate and thus a high consumption of resources on the false alarm cases.

  • (v) G-means: It stands for geometric means. Two measures, G-mean 1 and G-mean 2 are generally used together.

    G-mean 1 is calculated as the square root of the precision and recall. G-mean 2 is calculated as the square root of the product of recall and specificity. They are defined by (4) and (5), respectively.

    $$ G-mean 1 = \sqrt{Precision * Recall} $$
    (4)
    $$ G-mean 2 = \sqrt{Specificity * Recall} $$
    (5)
  • (vi) Statistical tests: We perform Friedman’s test and Wilcoxon signed rank sum test to identify the difference in performance of the used ensemble techniques [116]. Both the used tests are nonparametric in nature, so they do not make any assumptions related to the normality of the data points. In this test, significance level (α) is set to 0.05, which shows 95% probability of not accepting the null hypothesis when it is true. For these tests, the framed null hypothesis (H0) and alternative hypothesis (Ha) are as follow:

    H0: There is no significant performance difference among the used ensemble techniques at the given significance level.

    Ha: There is a significant performance difference among the used ensemble techniques at the given significance level.

  • (vii) Cost-benefit Analysis: The cost-benefit analysis of used ensemble techniques is performed to assess the cost-effectiveness of SFP models. Wagner initially proposed the concept of cost-benefit analysis in the context of SFP [117]. This analysis estimates the amount of testing efforts and cost that can be saved by using the results of SFP models along with the software testing process in the software development life cycle. The analysis model considers fault removal cost and the fault identification efficiency of different testing phases derived from the case studies of different software organization to estimate the fault removal cost of specific fault prediction model. Kumar et al. [118] explored the use of cost-benefit analysis in SFP. We have used that presented model in our work for cost effectiveness analysis of built SFP models. Certain assumptions have been made in designing the cost-benefit model, as specified below:

    1. (a)

      Each of the testing phase such as unit testing, integration testing, system testing has different fault removal cost.

    2. (b)

      None of the software testing phase is able to detect 100% of software faults.

    3. (c)

      Unit testing of all software modules is not practically feasible.

Equation (6) shows the estimated fault removal cost (Ecost) that can occur when results of fault prediction are used along with the software testing process. Equation (7) shows the minimum fault removal cost (Tcost) that can occur without the use of fault prediction results in the software testing process. Equation 8 shows the normalized fault removal cost and its interpretation.

$$ \begin{array}{@{}rcl@{}} Ecost&=& C_{ini}+C_{u}* (FP+TP)\\ &&+ \delta_{i}*C_{i}*(FN +(1- \delta_{u})* TP))\\ &&+ \delta_{s}* C_{s}* (1-\delta_{i})* (FN+(1-\delta_{u})*TP\\ &&+(1-\delta_{s})*C_{f}*((1-\delta_{u})* FN\\ &&+(1-\delta_{u})*TP)) \end{array} $$
(6)
$$ \begin{array}{@{}rcl@{}} Tcost &=& M_{p}*C_{u}*(TM)+\delta_{i}*C_{i}*(1-\delta_{u})*FM\\&&+\delta_{s}*C_{s}*(1- \delta_{i})* (1-\delta_{u})*FM\\ &&+(1-\delta_{s})*C_{f}*((1-\delta_{i})*(1-\delta_{u})*FM \end{array} $$
(7)
$$ Ncost = \frac{Ecost}{Tcost} \begin{cases} <1 & \quad \text{Fault prediction is useful}\\ =>1 & \quad \text{Unit testing is useful} \end{cases} $$
(8)

The meanings of used notations are same as described in one of the study by [118].

Where,

“Ecost::

Estimated fault removal cost of the software with the use of software fault prediction results

Tcost::

Total fault removal cost of the software without the use of software fault prediction results

Ncost::

Normalized fault removal cost of the software when software fault prediction is used

  • Cini: Initial setup cost for using software fault-prediction model (Cini = 0)

  • Cu: Normalized fault removal cost in unit testing

  • Cs: Normalized fault removal cost in system testing

  • Cf: Normalized fault removal cost in field testing

  • Ci: Normalized fault removal cost in integration testing

  • Mp: Percentage of modules unit tested

  • FP: Number of false positives

  • FN: Number of false negatives

  • TP: Number of true positives

  • TM: Total modules

  • FM: Total number of faulty modules

  • δu: Fault identification efficiency of unit testing

  • δs: Fault identification efficiency of system testing

  • δi: Fault identification efficiency of integration testing”

The fault identification efficiency of different testing phases is defined as staff hour per fault and is borrowed from the study performed by Jones [119]. We have considered median of the fault identification efficiency values maintained by Jones in our study. The used values are, δu = 0.25, δs= 0.5, and δi= 0.45. The normalized fault removal cost is defined as staff hour per fault and is borrowed from the Wagner’s work [117]. Again, we have considered median of these values. The used values are, Cf = 27, Cs= 6.2, Cu= 2.5, and Ci= 4.55. Mp shows the fraction of modules unit tested. Its value is taken from the study performed by [120] and is Mp= 0.5. A detailed description of used cost-benefit analysis model is given in [118].

7 Results and analysis

This section reports the results of used ensemble techniques for various performance measures. Further, an analysis of results is performed to draw observations about the ensemble techniques’ performance. The experimental procedure discussed in Section 6 has been used to build and evaluate prediction models. Later, this section discusses the results of the used statistical tests.

7.1 Results for precision, recall, AUC, specificity, and G-means

Tables 34567 show the summarized results of ensemble techniques for various used datasets. Each table depicts the results of one performance measure. The table contains min, max, and means values of each ensemble technique calculated from all datasets. We have reported the summarized results due to the space constraint. Following observations are drawn from tables.

  • With respect to the precision measure, Rotation Forest with J48 as base learner achieved highest max value and highest mean value. Whereas, MultiBoostAB with NB as base learner yielded the lowest min value.

  • With respect to the recall measure, again Rotation Forest with J48 as base learner achieved highest max value and highest mean value. Whereas, Rotation Forest with NB as base learner yielded the lowest min value.

  • With respect to the AUC measure, Decorate with NB as base learner produced highest max value and Dagging with NB as base learner produced highest mean value. RealAdaBoost with J48 as base learner produced the lowest min value.

  • With respect to the specificity measure, MultiBoostAB with J48 as base learner produced highest mean value and Rotation Forest with J48 and Dagging with J48 as base learner produced highest max value. Rotation Forest with NB produced the lowest min value.

  • With respect to the G-mean measures, Rotation Forest with J48 as base learner produced highest max value and highest mean value for G-mean 1, and Dagging with NB as base learner produced highest max value and Dagging with J48 as base learner produced highest mean value for G-mean 2. RealAdaBoost with NB as base learner produced the lowest min value for G-mean 1 and Dagging with LR produced lowest min value for G-mean 2.

  • Overall, it is found that Rotation Forest outperformed other used ensemble techniques and yielded better performance. In case of base learners, J48 achieved better performance among the used base learners.

  • From tables, it can be observed that for all the considered performance measures, used ensemble techniques produced mean values greater than 0.7, except for the grading ensemble technique in terms of AUC measure. The standard deviation values (std) of all ensemble techniques are below 0.10 for most of the cases for all performance measures, except for the specificity measure. For specificity measure, all ensemble techniques produced std values above 0.10 with the highest value of 0.223. This high variation in the model’s performance signifies a low true negative rate and thus it shows that prediction models missed some true negative cases and classified them as false positives. This will increase the consumption of software testing to test false positive cases. However, in their work, Bohem et al. [121] argued that the verification/testing efforts saved by a fault prediction model of correct identification of one fault are higher than the cost of misclassification of a hundred fault-free modules as fault-prone. Therefore, the high std values of specificity measure would result in the marginal increase in the testing cost but overall software testing cost would be saved.

Table 3 Summarized results of ensemble techniques for the used fault datasets with respect to precision measure
Table 4 Summarized results of ensemble techniques for the used fault datasets with respect to recall measure
Table 5 Summarized results of ensemble techniques for the used fault datasets with respect to AUC measure
Table 6 Summarized results of ensemble techniques for the used fault datasets with respect to specificity measure
Table 7 Summarized results of ensemble techniques for the used fault datasets with respect to G-mean 1 and G-mean 2 measures

Figure 4 shows box-plots for comparing the degree of dispersion, inter-quartile range, outliers and skewness in term of precision, recall, AUC, specificity, and G-means values for all ensemble techniques across all fault datasets. Each box-plot is corresponding to one ensemble technique and for one performance measure. The middle line in box-plots shows midpoints of the data (median values). Following observations have been obtained from the figure.

  • It is depicted in the figure that for the AUC measure, all ensemble techniques performed relatively poor as compared to other used performance measures.

  • Additionally, it is observed that the inter-quartile range (the difference between the first quartile and third quartile) for AUC measure is more than other used performance measures.

  • The box-plots of specificity measure are relatively wider than other box-plots and hence it shows the variation in the specificity values across dataset. The upper and lower whiskers of box-plots corresponding to specificity measure in the figure show that many values are deviated largely from the median value.

  • For other performance measures such as precision, recall, and G-means, it is observed that there are not many variations in the values. For these three performance measures, all the ensemble techniques achieved relatively better performance.

Fig. 4
figure 4

Boxplot diagrams showing the degree of dispersion, interquartile range, outliers and skewness for all the used performance measures

7.2 Results of statistical tests

Table 8 shows results of the Friedman’s tests of all used ensemble techniques for all five performance measures. It is observed from the table that a statistically significant difference in the performance of at least one pair of ensemble techniques has been found for all used performance evaluation measures. P-values are lower than the considered significant values (α= 0.05) for all cases. These results showed that different ensemble techniques performed differently for at least one pair of techniques for the given software fault datasets. Further, Wilcoxon signed rank sum test is performed to calculate the within pair difference among the used ensemble techniques.

Table 8 Results of statistical comparisons of Friedman’s tests among the used ensemble techniques for all five performance measures

Table 9 shows results of the Wilcoxon signed rank test for all five performance measures of all used ensemble techniques. Each sub-table is for one performance measure. Due to the space constraint, we used the abbreviated ID’s for technique names. The full name of each ID is provided in the table caption. A black filled circle shows the significant performance difference in the pair of ensemble techniques at α = 0.05 and thus rejecting the null hypothesis. A hollow circle shows no significant performance difference at α = 0.05 and thus accepting the null hypothesis. There are a total of 171 pair-wise comparisons of seven ensemble techniques is reported in Table 9 for each performance measure. The summarized results of Wilcoxon signed rank sum test are given below.

  • For the precision measure, a total of 106 pairs have shown a statistical significant difference in performance and other 65 pairs have not shown any statistical significant difference in performance.

  • In case of recall performance measure, a total of 138 pairs have shown a statistical significant difference in performance and other 33 pairs have not shown any statistical significant difference in performance.

  • For the AUC measure, a total of 133 pairs have shown a statistical significant difference in performance and other 38 pairs have not shown any statistical significant difference in performance.

  • In case of specificity measure, a total of 128 pairs have shown a statistical significant difference in performance and other 43 pairs have not shown any statistical significant difference in performance.

  • For the G-mean 1 measure, a total of 54 pairs have shown a statistical significant difference in performance and other 117 pairs have not shown any statistical significant difference in performance.

  • For the G-mean 2 measure, a total of 128 pairs have shown a statistical significant difference in performance and other 43 pairs have not shown any statistical significant difference in performance.

Table 9 Results of the statistical comparison of Wilcoxon signed rank test among the used ensemble techniques for all five performance measures. A filled circle shows the significance difference and a hollow circle shows no significance difference. (ID1: Dagging(NB), ID2: Dagging(LR), ID3: Dagging(J48), ID4: Decorate(NB), ID5: Decorate(LR), ID6: Decorate(J48), ID7: Grading(NB), ID8: Grading(LR), ID9: Grading(J48), ID10: MultiBoostAB(NB), ID11: MultiBoostAB(LR), ID12: Multi-BoostAB(J48), ID13: RealAdaBoost(NB), ID14: RealAdaBoost(LR), ID15: RealAdaBoost(J48), ID16: RotationForest(NB), ID17: RotationForest(LR), ID18: RotationForest(J48), ID19: Ensemble Selection)

These results showed that performance of ensemble techniques differs statistically significantly from one to other. Except the G-mean 1 performance measure, for all other used performance measures, cases where statistically significant performance difference have been found are more than the cases where no statistically significant performance different have been found.

7.3 Results of cost-benefit analysis

Table 10 shows normalized cost values (Ncost) of each ensemble techniques for all the used software fault datasets. For each dataset, Ncost value is reported in the table and values less than 1.0 show the cost-effectiveness of the ensemble techniques. It implies that if the results of SFP are used with software testing than overall testing cost and effort can be saved. On the other hand, values higher than 1.0 show that SFP is not helpful in saving testing cost and effort and it is suggested not to use SFP models in those cases. From the table, it can be seen that for datasets such as Lucene-2.4, Poi-2.5, Poi-3.0, Xalan-2.5, Xalan-2.6, Xalan-2.7, Xerces-1.3, and Xerces-1.4, Ncost values are higher than the threshold value (1.0) for all the used ensemble techniques. Therefore, as estimated from this study, it may not be beneficial to use software fault prediction based on used ensemble techniques along with the software testing for these fault datasets. For all other 20 datasets, Ncost values are lower than the threshold value and thus it is beneficial to use software fault prediction based on used ensemble techniques.

Table 10 Results of the cost-benefit analysis (Ncost) of ensemble techniques for all used software fault datasets [ID1: Dagging(NB), ID2: Dagging(LR), ID3: Dagging(J48), ID4: Decorate(NB), ID5: Decorate(LR), ID6: Decorate(J48), ID7: Grading(NB), ID8: Grading(LR), ID9: Grading(J48), ID10: MultiBoostAB(NB), ID11: MultiBoostAB(LR), ID12: Multi-BoostAB(J48), ID13: RealAdaBoost(NB), ID14: RealAdaBoost(LR), ID15: RealAdaBoost(J48), ID16: RotationForest(NB), ID17: RotationForest(LR), ID18: RotationForest(J48), ID19: Ensemble Selection]

7.4 Answer to the research questions

Based on the results reported in Tables 39, the answers of research questions are discussed as follow:

RQ1::

Which ensemble technique shows overall best performance for software fault prediction?

Results reported in Table 37 showed that for most of the cases Rotation Forest yielded better performance compared to other used ensemble techniques. MultiBoostAB, Decorate, and Dagging produced better performance in some cases. Other ensemble techniques performed relatively poor.

RQ2::

Is there any statistically significant performance difference between the chosen ensemble techniques?

Results of Friedman’s tests and Wilcoxon signed rank tests reported in Tables 8 and 9 showed that for majority of the cases pairs of ensemble techniques showed statistically significant performance difference. This pattern has been found for all the used performance measures except G-means measure.

RQ3::

How do base learners affect the performance of ensemble techniques?

The evidence obtained from the experimental results discussed in Section 7 showed that the performance of ensemble techniques varies with the use of the base learner. Overall, J48 as a base learner helped in achieving improved prediction performance. NB as a base learner generally resulted in the inferior performance of the ensemble techniques.

RQ4::

For a given software system, how economically effective ensemble techniques are for software fault prediction?

The evidence obtained from Table 10 shows that for twenty out of twenty-eight fault datasets, SFP models based on the used ensemble techniques helped in saving software testing cost and effort. For only eight fault datasets, used ensemble techniques have not been helped in saving software testing cost and effort. From the results, it can be recommended to use SFP models based on the used ensemble techniques to reduce the software testing cost.

In this paper, we have explored the use of seven different ensemble techniques for the software fault prediction. Three different classification algorithms have bene used as base learners in the used ensemble techniques. The observations drawn from the experimental results and main advantages of the presented work are summarized as follow.

  • The analysis of used ensemble techniques showed that no single ensemble technique always provides the best performance across all the fault datasets, and the use of a particular ensemble technique for SFP depends on the properties of the fault dataset in-hands.

  • However, among the used ensemble techniques, Rotation Forest yielded better prediction performance than others. J48 as a base learner outperformed other used base learners. Thus, from this study, it may be recommended to use Rotation Forest and J48 to build the SFP models for better prediction performance.

  • The cost-benefit analysis showed that the SFP models based on the ensemble techniques under consideration can help in reducing the software testing cost and can help in optimizing the testing resources.

8 Comparison analysis

There few efforts have been reported earlier regarding the evaluation of ensemble techniques based on fault prediction models. A comparison of reported study with these works on various attributes has tabulated in Table 11. A majority of previous works listed in Table 11 included the contextual information of the fault prediction model, model building information, used software fault datasets, and prediction modeling techniques with the experimental findings in their studies. It can also be observed from the table that generally available works focused on the use of a limited set of prediction modeling techniques in their studies. In comparison to this, the reported work examined seven ensemble techniques, which have not been explored earlier. Moreover, most of the earlier works used only a few datasets to perform the experiments. In the reported study, a total of 28 different fault datasets have been considered to generalize the findings of the work. Further, we have performed a cost-benefit analysis to assess the economic viability of the used ensemble techniques for the SFP, which has not been done in previous studies. When comparing the results of the presented empirical study with the results reported by [31] for SFP, it has been found that ensemble techniques used in this study performed better. The highest AUC value achieved is 0.986 by Decorate with NB as a base learner in comparison to 0.96 value achieved by stacking in the Wang et al.’s study. In the presented study, the mean AUC value is 0.781, and minimum AUC value is 0.532, which is comparable with the values reported by Wang et al.’s work.

Table 11 Summary of comparative analysis

9 Threats to the validity

The empirical analysis reported in this paper can be suffered by some threats to the validity, which are discussed as follow.

Construct Validity:

This validity threat concerns with the accuracy of the used software fault datasets. We gathered and used fault datasets reported in the PROMISE data repository, which is available in the public domain. The fault datasets in this repository are corresponding to various contributors. It is the primary repository used for building and evaluating software fault prediction model. This made us believe that fault datasets used in the study are accurate, consistent, and free from any inconsistency.

Internal Validity:

This validity threat concerns with the selection of base learners. The presented study included three different classification algorithms as base learners. The rationale behind the selection of these three algorithms is that previous research found that these algorithms performed better in comparison to other ones for software fault prediction. However, the selection of base learners is orthogonal to the intended contribution. We have used faulty or non-faulty information for a given software module as dependent variable due to the nature of the designed experimental study. Other dependent variables such as the quantity of faults in a software module, severity of a fault, etc. could also be used.

External Validity:

This validity threat is related to the used statistical tests. We have used Friedman’s test and Wilcoxon signed rank sum test to evaluate the performance difference of the considered ensemble techniques. These tests are the non-parametric tests, which do not specify any conditions about the drawn population sample of the data. Selection of these tests were made according to the data sample available in hand in the presented study. However, any other statistical test can be used based on the given data. We have used datasets corresponding to different software projects to generalize the conclusions drawn in the presented study.

10 Conclusions and future work

In this work, an extensive experimental analysis of ensemble techniques for the SFP has been carried out. The study presented an evaluation of seven ensemble techniques by using three different classification algorithms as base learners on twenty-eight software fault datasets. In total 532 prediction models have been built and evaluated for fault prediction in the presented study. Overall, we found that ensemble techniques are useful modeling techniques and thus can be considered to build effective fault prediction models. Out of the used ensemble techniques, rotation forest yielded better prediction performance compared to the other ensemble techniques and J48 as the base learner has worked most effectively. Further, we found that used ensemble techniques have shown a statistically significant performance difference for the used performance measures. The work reported in this paper can be helpful to the research community in the modeling of accurate fault prediction models by selecting appropriate ensemble technique.

In the future, we aim to develop a hybrid ensemble technique based fault prediction model based on the findings of the reported study. Additionally, future work includes the assessment of ensemble techniques for the fault datasets drawn from other software systems and having different software metrics to generalize the findings of the work.