1 Introduction

Software Defect Prediction (SDP) helps to improve the quality of software product allowing the assessment of the fault-proneness of modules and to forecast which part of software will be requiring more testing and quality assurance (QA) resources [16]. It reduces the testing cost and overall development cost. Machine learning (ML) techniques are finding wide applications in SDP [2, 4, 6, 7, 15, 17]. Solely machine learning methods bring sub-optimal results due to the class imbalance in the defect datasets. Class imbalance refers to the situation when one of the classes in the dataset outnumbers the rest of the classes. The class with higher number of instances is called majority class and the rest id called minority class. This imbalanced nature of defect data negatively impacts the accuracy of ML based SDP classifiers [3, 8, 13]. From the literature survey, it is seen that ensemble learning has better prediction power for software defect prediction using the historic data from the past projects with the condition of “class-imbalance” [24, 25, 28]. Galar et al. (2011) [5] and Rathore et al. (2017) [18] advocated that ensemble based classifiers have built-in capability to handle data-imbalance.

1.1 Motivation

Learning from imbalanced datasets is an open problem. All the proposed ensembles from literature are standard algorithms of random forest, bagging or boosting method. None of the technique is customized as per the application or depending upon the nature of the dataset. This work is contributing to improve the prediction power of classifiers by using a customized heterogenous stacked ensemble classification algorithm.

1.2 Contribution

This work contributes a customized stacked ensemble classifier for the task of SDP provided the data is suffering from class-imbalance. The proposed model is compared with state-of-art techniques to find the best classifier. The statistical evidence is presented to advocate that the proposed model is the best SDP classifier.

1.3 Organization

The paper is organized as follows. Section 2 covers the current state-of-the-art to handle class imbalance using ensembles along the review of the literature. The research methodology is explained in Sect. 3 along with the research questions and the experimental setup. In Sect. 4, the datasets and evaluation metrics used in experimental work are described. In Sect. 5, the experimental results are reported and analysed to answer the research questions. The conclusions are drawn in Sect. 6.

2 Related works

This section highlights the contribution made by various researchers in the field of SDP using machine learning (ML) algorithms deploying ensemble approach to class-imbalance problems in order to get accurate models for software defect prediction (SDP). Table 1 shows the current trends to tackle class imbalance issue in SDP. The table is headed with the year of publication of the referenced work, the technique used in the work, the dataset(s) and its feature space considered in the respective research along with the performance measurement criteria adopted. The last column added in the table is the observation drawn by the authors of this candidate work.

Table 1 State-of-the-art: SDP with ML and SDP with class imbalance

Corresponding to each study, we have made some observations which are added in Table 1 (as last column). After reviewing the literature in multiple dimensions, we identified that the existing studies results are sub-optimal. All ensembles are existing traditional ensembles.

Some other observations made from the literature review are—1) Majority of research in the field of SDP has been carried out by utilizing publicly available datasets namely NASA Metrics Data Program and PROMISE Data Repository which comprises almost 67% of total research work carried out in past three decades, (2) the most popular evaluation metrics among software practitioners for SDP evaluation are AUC, ROC and accuracy, (3) ANN, SVM are the two most popular classifiers for software defect prediction, (4) class imbalance majorly hinders the performance of classifiers. And (5) ensembles are robust enough and possess built-in capacity to deal with class imbalance of defect dataset.

In the next section of our paper, the research methodology which is adopted for this paper is explained and the research gaps are reported as well-formed research questions.

3 Research methodology

In this section, we report the methodology adopted to carry out the research work. First up, we formulate the research questions in an empirical way to steer the research work in a systematic way. Then, we describe the configuration of the proposed stacked ensemble and the working algorithm. The experimental set-up adopted for this work along with the parameter settings for the experimental model are also discussed in detail.

3.1 Research questions

To steer the research in a systematic way, we address the following research questions:

  • RQ1. Does the proposed heterogenous stacked ensemble empirically outperform the existing single classifiers?

    This RQ deals with comparison of proposed model with the traditional models in order to ensure that proposed model has the potential to predict the buggy modules effectively. For this purpose, five most popular classifiers from the literature are selected for comparative study. These five classifiers are artificial neural network, nearest neighbor, tree based classifier, Naïve Bayes and support vector machines. The reasons for selecting these classification algorithms are—(1) the popularity of these classifiers in SDP [20], (2) effective prediction power in SDP domain [23] and (3) these are base classifiers of our proposed ensemble. For comparative analysis in this specific aspect, the study Goyal and Bhatia (2020) [6] is selected.

  • RQ2. Does the proposed customized stacked ensemble empirically outperform the state-of-the-art ensemble based SDP classifiers?

    It is to investigate into the prediction power of proposed model in comparison to the state-of-the-art ensemble based SDP models. For the comparative analysis, homogenous ensemble and heterogeneous ensemble based classifiers are selected. The study Balogun et al. [1] is selected for comparison over homogenous ensemble based SDP classifiers and the study Khuat et al. [11] is selected for comparison with heterogenous ensemble based SDP classifiers.

  • RQ3. Are the answers to the above mentioned RQs statistically valid?

    This is the most crucial RQ as it confirms that the answers to above stated RQs are valid. Proper statistical tests are selected and conducted for the statistical evidence. The Friedman test has been found suitable and hence conducted to find the statistical proof for the study.

3.2 Proposed stacked ensemble classifier

We propose stacking based ensemble combining heterogenous base learning classifiers. We use five most popular SDP classifiers from the literature namely support vector machine (SVM), artificial neural networks (ANN), naïve bayes (NB), nearest neighbor and decision trees (DT) [4, 6, 7, 17] as base learners. Then, a neural network is selected as a meta-model for this work which takes the predictions made by base classifiers (called ‘Level-1’ data) as inputs and returns the final predicted outputs.

The choice of base-learners and meta-model is very trivial from the literature survey for the work contributed for the SDP domain during past three decades [23]. The selection of neural network as a meta-model is to non-linearly combine [9] the predictions from the base classifiers to bring the best combination of powers and produce the most accurate final predictions regarding whether the candidate module is ‘buggy’ or ‘clean’ out of this synergism of powers.

The proposed stacked ensemble based SDP classifier is modelled as in Fig. 1. The proposed model works on the algorithm which is stated as below-

Fig. 1
figure 1

Proposed stacked ensemble SDP model

3.3 Experimental set-up

In this paper, the MATLAB™ R2019a is used for carrying out the processing and computational tasks. It is installed on Windows™ 10 Pro, Intel® Core™ i5-8265U CPU with RAM storage of 8 GB. All of the rigorous sets of experiments including data pre-processing through fitting the classifiers and validation of classifiers are executed over the same hardware and software platform. The performance for the proposed classifier is measured over selected five datasets for every selected performance evaluation criterion which includes AUC, ROC, accuracy. The data is partitioned into training dataset and testing dataset using k-fold cross validation with k = 10. The training subset is used to train the stacked ensemble classifier and then it is tested for testing dataset. All the experiments are performed on the above experiment set-up and design following classifiers at two levels of model (shown in Table 2).

Table 2 Description of 5 base-learners and 1 meta-learner (level-wise)

The choice of base-learners and meta-model is very trivial from the literature survey for the work contributed for the SDP domain during past three decades [23]. At level-2 of stacking, neural network is utilized due to its robust capability of learning non-linear relationships among the inputs [9]. The selection of neural network as a meta-model is to non-linearly combine the predictions from the base classifiers to bring the best combination of powers and produce the most accurate final predictions regarding whether the candidate module is ‘buggy’ or ‘clean’ out of this synergism of powers. The parameter settings for the proposed model is given in Table 3.

Table 3 Parameter settings for base-learners and meta-learner

For comparative analysis, rigorous experiments are conducted following the same process including the parameter settings, tools and environment as deployed by the selected studies to ensure the fair comparison of performance [1, 6, 11]. All eight models (5 traditional ML SDP models + 3 Ensemble based classifiers) are synthesized, and experiments are repeated for all five datasets. Then, the performance is recorded over three selected evaluation criterion (ROC, AUC, Accuracy) and comparison is made statistically. The SDP models selected for comparative analysis are listed as in Table 4.

Table 4 Details of 8 SDP Classifiers Selected for Comparative Analysis

3.4 Mathematical background

Naïve Bayes Classifier makes classification utilizing the probability theory from the statistics. Bayes rule is applied to predict whether the module is buggy or not. It predicts that the test sample data-point belongs to that particular class which is having the highest posterior probability for that sample data-point. Suppose for defect prediction problem, vector x denotes the attribute set and y is a set with two elements {buggy, clean}; denotes the classes to which each data-point uniquely belongs. Naïve bayes classifier predicts that a specific module with attribute vector x belongs to ‘buggy’ class only if Eq. (1) satisfies. Otherwise, it predicts that the module belongs to ‘clean’ class.

$$P\left( {buggy{|}{\varvec{x}}} \right) \ge P(clean|{\varvec{x}})$$
(1)

In Eq. (1), \(P\left( {buggy{|}{\varvec{x}}} \right)\) denotes the posterior probability of class buggy, after having seen x and \(P\left( {clean{|}{\varvec{x}}} \right)\) denotes the posterior probability of class clean, after having seen x. Equation (1) shows that for two class classification problem, whichever class will be having highest posterior probability will be predicted by the classifier for given x. The posterior probability for any class can be computed using Bayes Rule as given in Eq. (2). Equation (2) can be rewritten as Eq. (3) for class buggy and as Eq. (4) for class clean.

$$Posterior = \frac{Prior \times Likelihood}{{Evidence}}$$
(2)
$${\text{P}}\left( {\text{buggy|x}} \right) = \frac{{{\text{p}}\left( {\text{x|buggy}} \right) \times {\text{P}}\left( {{\text{buggy}}} \right)}}{{{\text{p}}\left( {\text{x}} \right)}}$$
(3)

where \({\text{p}}\left( {x{\text{|buggy}}} \right)\) denotes the prior probability for x; the probability of seeing x as input when it is known that it belongs to buggy class; satisfying inequation (5) and Eq. (6).

$${\text{P}}\left( {\text{clean|x}} \right) = \frac{{{\text{p}}\left( {\text{x|clean}} \right) \times {\text{P}}\left( {{\text{clean}}} \right)}}{{{\text{p}}\left( {\text{x}} \right)}}$$
(4)

where \({\text{p}}\left( {x{\text{|clean}}} \right)\) denotes the prior probability for x; the probability of seeing x as input when it is known that it belongs to clean class; satisfying inequation (5) and Eq. (6).

$${\text{P}}\left( {{\text{buggy}}} \right) \ge 0,\;{\text{P}}\left( {{\text{clean}}} \right) \ge 0$$
(5)
$${\text{P}}\left( {{\text{buggy}}} \right) + {\text{P}}\left( {{\text{clean}}} \right) = 1$$
(6)

And, \({\text{p}}\left( {\text{x}} \right)\) denotes the Evidence which is the marginal probability that x is seen, regardless it belongs to buggy class or clean class. It can be computed as Eq. (7).

$${\text{p}}\left( {\text{x}} \right) = {\text{p}}\left( {\text{x|buggy}} \right) \times {\text{P}}\left( {{\text{buggy}}} \right) + {\text{ p}}\left( {\text{x|clean}} \right) \times {\text{P}}\left( {{\text{clean}}} \right)$$
(7)

Equation (2) which represents Bayes rule is the basis for Naïve Bayes classifier. By applying the values from Eq. (3), (4) and (5) into Eq. (1), the prediction for given data-point that whether it belongs to ‘buggy’ class or not; can be made.

K-Nearest Neighbors is another classification algorithm from statistics. It uses similarity between data-points to predict the class. In our experimental set-up, we utilize Euclidean distance which can be computed between any two data-points namely xi and xj as Eq. (8). Suppose for defect prediction problem, vector x denotes the attribute set and y is a set with two elements {buggy, clean}; denotes the classes to which each data-point uniquely belongs.

$$D\left( {x_{i} ,x_{j} } \right)\; = \;\sqrt {\mathop \sum \limits_{i = 1}^{k} (x_{ik} - x_{jk} )^{2} }$$
(8)

Assume buggy is denoted with ‘ + 1’ and clean with ‘-1’, hence y = {+ 1,-1}. For the instance, xq, K-NN will make classification using the Eq. (9) after computing the ‘k’ nearest neighbors of xq using Eq. (8). Suppose Nk denotes the set of ‘k’ neighbors of xq.

$${{\hat{y}}}_{{\text{q}}} \; = \;{\text{sign }}\left( {{}\mathop \sum \limits_{{{\text{x}}_{{\text{i}}} \in {\text{ N}}_{{\text{k}}} }} {\text{y}}_{{\text{i}}} } \right)$$
(9)

Decision Trees based classifiers are built using Classification and Regression Trees (CART) algorithm. Decision trees are hierarchical, non-parametric, supervised machine learning models. A tree is comprised of few internal nodes with decision functions and external leaves. In our experiments, we used the ‘entropy’ as a measure of impurity which in turn records the goodness of split. Let us compute entropy for a node in classification tree say node ‘a’; Na denotes the number of instances that reaches to node ‘a’; \(N_{a}^{buggy}\) and \(N_{a}^{clean}\) denotes the number of nodes in Na that belongs to class ‘buggy’ and class ‘clean’ respectively. Suppose an instance reaches node ‘a’ then its chances of being ‘buggy’ is given as Eq. (10). Similarly, its chances of being ‘clean’ is computed using Eq. (11). Entropy is computed as Eq. (12) for 2-class classification problem.

$$p_{a}^{buggy} = \frac{{N_{a}^{buggy} }}{{N_{a} }}$$
(10)
$$p_{a}^{clean} \; = \;\frac{{N_{a}^{clean} }}{{N_{a} }}$$
(11)
$${\text{Entropy }}\left( {\text{node\,'a'}} \right)\; = \; - \, \left( {\left( {p_{a}^{buggy} } \right){\text{log}}\left( {p_{a}^{buggy} } \right) + p_{a}^{clean} {\text{log}}\left( {p_{a}^{clean} } \right)} \right)$$
(12)

Artificial neural networks are implemented with standard feed-forward, error backpropagation algorithm. For n-feature input data X =  < x1,x2,…,xn > , there are n input neurons. For sigmoid activation function, the output \(\hat{y}_{i}\) for ith neuron is computed using Eq. (13). In this way, features are fed in forward direction from input layer to hidden layer, then from hidden to output layer. The computed output at output neuron is compared with the actual output and the error is computed as Eq. (14) as half of the sum of squares of difference between the actual output and predicted output and the error is back propagated to update weights as per Eq. (15) and learning takes place in this way to minimize the error.

$$\hat{y}_{i} = sig(\mathop \sum \limits_{i = 1}^{n} w_{i} x_{i} + w_{o} )$$
(13)

where \(w_{i}\) denotes weight for ith neuron and \(w_{o}\) denotes the bias;

$${\text{error }} = \frac{1}{2}\mathop \sum \limits_{m} \mathop \sum \limits_{i}^{n} (y_{i} - \hat{y}_{i} )^{2}$$
(14)

where m denotes number of output neuron.

Δw = η. error. input signal

$$\eta .\frac{1}{2}\mathop \sum \limits_{m} \mathop \sum \limits_{i}^{n} (y_{i} - \hat{y}_{i} )^{2} \cdot x_{i}$$
(15)

where η denotes learning rate.

Support vector machine works on Vapnik theory of maximum marginal methods. We used the RBF kernel setting for SVM. For ‘n’ instances denoted as < Xi,yi > , it finds the optimal separating hyperplane between two classes denoted {buggy as + 1,clean as -1} by finding w1 and w2 which satisfies Eq. (16).

$$y(w_{2} x + w_{1)} \ge {}_{ - }^{ + } 1$$
(16)

SVM solves the optimal hyperplane problem by Langrangian multipliers. First, new higher dimensional mapping is achieved with function ϕ as Eq. (17) shown.

$${\text{y }} = {\text{ w}}^{{\text{T}}} \phi \left( {\text{x}} \right) \, + {\text{ c}}$$
(17)

where w is weight vector and c is scalar.

The SVM has to be optimize Eq. (18)

$${\text{Minimize}}\;\frac{1}{2}{\text{w}}^{{\text{T}}} {\text{w + }}\rlap{--} \rho \frac{1}{2} error_{i}^{2}$$
(18)
$${\text{Subject to y }} = {\text{ w}}^{{\text{T}}} \phi \left( {\text{x}} \right) \, + {\text{ c }} + error$$

where ϼ denotes the cost function.

After solving this, the prediction made by SVM classifier can be given as Eq. (19) in terms of kernel.

$$\rlap{--} Y = \sum \left( {\alpha - \alpha^{T} } \right).K\left( {x_{centre} ,x} \right) + b$$
(19)

In Eq. (20) \(K\left( {x_{centre} ,x} \right)\) denotes kernel based on Radial basis function. In our experiments we have used RBF kernel for SVM where the centre and radius are defined by the user.

$$K\left( {x_{centre} ,x} \right) = e^{{ - \frac{{\left| {x_{centre} - x} \right|^{2} }}{{2. (radius)^{2} }}}}$$
(20)

4 Dataset and evaluation criteria used

In this section, the highlights on the dataset and metrics used for experimentation are brought. The performance evaluation metrics opted to measure the performance of proposed stacked ensemble based SDP model and for comparative analysis among the selected models are described.

4.1 Dataset and software metrics

The Dataset used for the experimental study is NASA defect dataset which are available publicly in PROMISE repository. The data metrics are collected from NASA projects. The experiment is designed using five datasets—CM1, KC1, KC2, PC1, and JM1. McCabe and Halstead features extractors are used to collect the data [19, 21]. Table 5 shows the used dataset name, total instances in the dataset, number of instances which are buggy and number of instances which are clean. The datasets are comprising of the most popular static code metrics. All five datasets possess 21 metrics and 1 response variable.

Table 5 Dataset description [2]

4.2 Performance evaluation criteria

The performance of proposed stacked ensemble is evaluated using the widely accepted evaluation metrices namely Confusion matrix, ROC, AUC, Accuracy and recall [2, 7, 10, 20, 26, 27]. These can be defined as-

  • Confusion matrix is in the form of a matrix whose individual cell contains necessary information for performance evaluation of the classifier.

    As shown in Fig. 2a., the class ‘buggy’ is considered as positive class and class ‘clean’ is considered as negative class. The term ‘True Positive’ refers to the ‘count of modules’ which are buggy in actual and classified as buggy by the classifier. The term ‘True Negative’ refers to the ‘count of modules’ which are clean in actual dataset and predicted as clean by the classifier. It leads to two other terms which are ‘False Positive’ and ‘False Negative’. The ‘False Positive’ refers to the ‘count of modules’ which belong to clean class in actual dataset and predicted as buggy by the classifier in consideration. The ‘False Negative’ means those modules which are buggy in actual dataset and predicted as clean by the classifier.

  • The sensitivity (true positive rate or TPR) and specificity (1- false positive rate or 1- FPR) are computed as Eq. (21) and (22). True positive rate, TPR, can be thought as hit rate, accounts for what proportion of buggy modules we correctly predict and false positive rate, FPR, refers to the proportion of clean modules we wrongly accept as buggy.

    $$sensitivity \left( {or recall} \right) = \frac{true positive}{{true positive + false negative}}$$
    (21)
    $$specificity = \frac{true negative}{{true negative + false positive}}$$
    (22)
  • Receiver Operating Characteristics (ROC) curve is plot of TPR (as y-axis) and FPR (as x-axis) (see Fig. 2b. and Fig. 2c). It is interpreted that closer the classifier gets to the upper left corner, better is its performance. To compare the performance of classifiers, the one above the other is considered better.

  • Area Under the ROC Curve (AUC) gives the averaged performance for the classifier over different situations. AUC = 1 is considered ideal.

  • Accuracy is computed as Eq. (23)

    $$Accuracy = \frac{true positive + true negative}{{true positive + false positive + true negative + false negative}}$$
    (23)
Fig. 2
figure 2

a Confusion matrix. b ROC. c Multiple ROCs

5 Result analysis and discussion

In this section, we report the results recorded in the experimental study. We also find the answers to the Research Questions (RQs) following an analytical approach. Let us discuss all three RQs one by one in upcoming sub-sections.

5.1 Finding the answer to RQ1-

RQ1. Does the proposed heterogenous stacked ensemble empirically outperform the existing single classifiers?

To answer RQ1, first up, we recorded the performance of all six classification algorithms (ANN, SVM, NB, KNN, Tree and stacked ensemble) which are selected in this study on all five datasets in terms of AUC and Accuracy reported in Tables 6 and 7 respectively. The ROC curves are also reported for comparison as in Fig. 3.

Table 6 Performance comparison of stacked ensemble classifier with base classifiers (in AUC)
Table 7 Performance comparison of stacked ensemble classifier with base classifiers (in ACCURACY)
Fig. 3
figure 3

ROC Curve for all six classifiers over five datasets

From the reported results in Tables 6 and 7, it can be seen that proposed model shows better performance than base classifiers (highest values are shown in bold faces). The drawn inferences are-

  1. i.

    It is better than ANN by 4%, SVM by 10%, NB by 11%, Tree by 17% and KNN by 20% in terms of AUC.

  2. ii.

    It is better than ANN by 2%, SVM by 3%, NB by 3%, Tree by 4% and KNN by 2% in terms of Accuracy.

  3. iii.

    The proposed model shows best ROC curve among all six classifiers.

Further, the results recorded for AUC measure and Accuracy measure for the candidate classifiers are plotted as box plots for better visualization and analysis (shown in Fig. 4a and Fig. 4b respectively). From the figures, we can easily analyse the classifiers in a comparative manner. It is noted that the technique having high value of median with fewer outliners performs better than other classification algorithms. It is evident from the figures and plots that the proposed model outperforms all 5 base classifiers in AUC, Accuracy and ROC metrics.

Fig. 4
figure 4

a AUC Box Plots for all six classifiers over five datasets. b Accuracy Box Plots Curve for all six classifiers over five datasets

The average performance of all 5 base classifiers and the proposed stacked ensemble based SDP classifier is plotted as Fig. 5. It can be inferred that Proposed Stacked Ensemble Model outperforms Base Classifiers on average by 12% in AUC and by 8% in Accuracy.

Fig. 5
figure 5

Proposed Stacked Ensemble Model outperforms Base Classifiers by 12% in AUC and by 8% in Accuracy

ANSWER to RQ1- From the results and analysis, YES! The proposed model outperforms the single base classifiers empirically.

5.2 Finding the answer to RQ2-

RQ2- Does the proposed customized stacked ensemble empirically outperform the state-of-the-art ensemble based SDP classifiers?

To answer this RQ, we further need to compare the performance of proposed stacked ensemble over the state-of-the-art of ensemble based SDP models. We selected two empirical studies with 3 different ensemble based SDP models for comparative analysis- 1) Balogun et al. (2020) [1] deployed standard ensemble techniques-Bagging and Boosting. 2) Khuat et al. (2021) [11] deployed heterogenous ensembles using 9 base classifiers. Tables 8 and 9 report the comparative analysis between the proposed model and the state-of-the-art models in terms of AUC and Accuracy respectively.

Table 8 Performance comparison of stacked ensemble classifier with state-of-the-art ensembles (in AUC)
Table 9 Performance comparison of stacked ensemble classifier with state-of-the-art ensembles (in ACCURACY)

It is clear form the results recorded in the Tables 8 and 9 that Stacked Ensemble performs better than the state-of-the-art ensemble classifiers (highest values are reflected with bold faces). Further, for comparison, the ROC curve is plotted for all 4 SDP models as shown in Fig. 6.

Fig. 6
figure 6

ROC Curve for all four classifiers over five datasets

The drawn inferences are-

  1. i.

    The proposed stacked ensemble based classifier outperforms the Bagging, Boosting and Heterogenous model by 6%, 5%, and 2% in terms of AUC.

  2. ii.

    The proposed stacked ensemble based classifier outperforms the Bagging, Boosting and Heterogenous model by 10%, 11%, and 5% in terms of Accuracy.

  3. iii.

    The best ROC is shown by proposed stacked model among all 4 classifiers.

Further, the results recorded for AUC measure and Accuracy measure for the candidate classifiers are plotted as box plots for better visualization and analysis (shown in Fig. 7a and b respectively). From the figures, we can easily analyse the classifiers in a comparative manner. It is noted that the technique having high value of median with fewer outliners performs better than other classification algorithms. It is evident from the figures and plots that the proposed model outperforms all state-of-the-art ensemble methods in AUC, Accuracy and ROC metrics.

Fig. 7
figure 7

a ROC Curve for all four classifiers over five datasets. b ROC Curve for all four classifiers over five datasets

The average performance of all 3 ensembles (from literature) and the proposed stacked ensemble based SDP classifier is plotted as Fig. 8. It can be inferred that Proposed Stacked Ensemble Model outperforms Base Classifiers on average by 4% in AUC and by 9% in Accuracy.

Fig. 8
figure 8

Proposed Stacked Ensemble Model outperforms Base Classifiers by 4% in AUC and by 9% in Accuracy

ANSWER to RQ2- From the results and analysis, YES! The proposed model outperforms the state-of-the-art ensemble classifiers empirically.

5.3 Finding the answer to RQ3

RQ3. Are the answers to the above mentioned RQs statistically valid?

From the above experimental results, analysis and inferences; proposed stacked Ensemble based classifier is the best SDP classifier among all 8 selected SDP models from the literature.

Before drawing any final conclusions, we statistically validated the inferences drawn in above two subsections and sought the statistical evidence to the answers reported for RQ1 and RQ2. The Friedman’s test is found suitable for non-parametric comparison among more than two samples [14, 23].

In respect to RQ1, we assume H0: “The performance reported by stacked ensemble and the performance reported by other 5 base classifiers are not different”. And the alternate hypothesis- H1: ‘The performance reported by stacked ensemble and the performance reported by other 5 base classifiers are different”.

We conducted the test with 95% of confidence. The results of the statistical tests are shown in Fig. 9. It can be seen clearly that the value of p-static is 0.0058 which is smaller than 0.05.

Fig. 9
figure 9

Friedman Test with p-static 0.0058

It means, the null hypothesis: H0 is to be rejected and alternate hypothesis:H1 is to be accepted. (It can be inferred that -Answer reported to RQ1 in Sect. 5.1- The proposed stacked ensemble outperforms the single base classifiers over all datasets- is statistically validated.

In respect to RQ2, we assume H0: “The performance reported by stacked ensemble and the performance reported by other 3 state-of-the-art ensemble classifiers are not different”. And the alternate hypothesis- H1: ‘The performance reported by stacked ensemble and the performance reported by other 3 state-of-the-art ensemble classifiers are different”.

We conducted the test with 95% of confidence. The results of the statistical tests are shown in Fig. 10. It can be seen clearly that the value of p-static is 0.0029 which is smaller than 0.05.

Fig. 10
figure 10

Friedman Test with p-static 0.0029

It means, the null hypothesis: H0 is to be rejected and alternate hypothesis:H1 is to be accepted. (It can be inferred that -Answer reported to RQ2 in Sect. 5.2- The proposed stacked ensemble outperforms state-of-the-art ensemble classifiers over all datasets- is statistically validated.

ANSWER to RQ3- From the results and analysis, YES! The answers to RQ1 and RQ2 are statistically valid.

6 Conclusion

In this paper, we proposed a novel heterogenous ensemble utilizing the stacking methodology and deployed it for effective software defect prediction (SDP). The proposed model is robust enough to handle the defect dataset having class imbalance issues. Software defect prediction plays an important role in targeting the testing efforts to the faulty modules and hence to save time and cost. From literature, it is evident that plethora of ML based SDP models have been contributed by the researchers. The imbalance nature of defect datasets has always been a hurdle in achieving a good classification accuracy. The class imbalance means the number of instances belonging to one class outnumbers the number of instances of other class. The outnumbering instances introduce biasing in the classification algorithms and hinder the performance. Due to this reason, the traditional ML based SDP models result sub-optimal results when trained with imbalanced datasets. We found the studies reported in the literature dealing with class imbalance using ensemble techniques as ensembles have built-in capacity to deal with the class imbalance nature of datasets. Still there is huge scope for the improvement in the accuracy of ensemble based defect predictors.

This work is dedicated to build an effective SDP classifier using heterogenous stacking ensemble method. It has been built upon the five best classifiers (reported from literature) as the base classifiers at level-1, then utilizing two level stacking and at level-2, ANN algorithm has been used to combine the outputs from heterogenous classifiers of level-1. The performance of proposed model is empirically evaluated over three most popular criteria for evaluation—AUC, ROC and accuracy. A statistical comparison is also made among the performances of the proposed stacked ensemble classifier with that of the 5 base classifiers and 3 state-of-the-art ensemble techniques. From the reported results, it can be inferred that the proposed model built up of two-level stacking of heterogeneous ensemble (with 5 base classifiers) at level-1 and ANN at level-2; is best with highest value for AUC measure (85.3 = %) with best ROC curve and highest value for accuracy measure (= 92.6%). In future, we propose to replicate the study other defect datasets extracting from live projects.