1 Introduction

Various researchers generally have well-defined research strategies that not only have detailed guidance but also possess simplified views from observers’ perspectives. For instance, the public can understand large-scale medical studies well enough so as to discuss the risks associated with an experimental treatment. However, this is not true for software engineering researchers as they do not have any well-understandable guidance [1]. Various researchers have made several attempts to formalize software engineering research, but it fails to paint a comprehensive picture [2]. As a result, rigorous research is taking place in this dimension to fill the void. Authors in this paper attempt to formalize the method of software fault prediction through a sequential ensemble model [3, 4].

The motive for selecting this topic for research is that the prime reason for software failures is the faults present in software modules affecting the software's reliability. It may lead to dissatisfaction among users, eventually leading to the downfall of the company. However, in the current scenario, when software demand is exponentially increasing in the industries, there is nearly a zero-tolerance for software faults. Additionally, the software has a considerable number of modules that further intricate the identification of fault-prone modules. This has further opened avenues for research in the field of software engineering, particularly in the field of software fault prediction (SFP) [5, 6].

SFP aims to thoroughly inspect the software's quality before its release by inspecting the fault vulnerability of the software modules [7]. Identification of fault vulnerable modules emphasizes a specific focus on these modules to efficiently manage the resources by reducing the number of faults post-implementation [8]. SFP focuses on accurately predicting the fault vulnerability of the software modules so as to maximize software availability. Moreover, it also helps to minimize the maintenance cost and thus achieves high-quality software products [9].

Machine Learning (ML) has demonstrated successful and widespread deployment for solving classification problems of SFP [10]. Here, classification problem refers to classifying Fault-Prone (FP) and Non-Fault-Prone (NFP) modules. Now, this classification problem in SFP has several associated challenges: class imbalance problems, irrelevant features, and noise [11]. Hence, a single ML technique fails to handle all these challenges and thus leads to performance degradation. Hence, it is widely accepted that ensemble modeling may overcome the limitations that remain unaddressed by individual ML classifier [6, 12]. This belief is strengthened by the proven competence of ensemble learning algorithms (ELA) in various research fields [10, 13]. In the literature, it is also claimed that ensemble classifiers overcome the limitations of individual classifiers. Moreover, no learning technique can handle SFP's significant challenges like imbalance problems, the presence of redundant features, and noise in the dataset [14, 15]. All these applications advocated that the application of ELA in SFP as it outperforms the individual classifier algorithm [6, 16].

The class imbalance problem occurs when there is an extreme imbalance between Fault-Prone (FP) and Non-Fault-Prone (NFP) modules. Hence, in this scenario, the dataset is highly skewed toward FP or NFP modules. Generally, FP modules are relatively small and rarely occur but have considerable significance. However, learners primarily focus on NFP modules while ignoring the FP modules. Here, data balancing is implemented to resolve the skewness in the dataset to improve the performance of ELA for SFP models. In the context of class imbalance, ML researchers have suggested two methods to handle this. As per a suggested method, a typical cost is to be assigned to training examples. The other method suggests resampling the original dataset by oversampling the minority class and under-sampling the majority class [14]. Hence, various studies suggest Synthetic Minority Oversampling Techniques (SMOTE) for balancing to enhance the performance of classification [17].

Also, it is evident that the performance of classification learner is affected by the quality of data used [13, 18]. Resultantly, corrupted data in real datasets may impede the decisions, and thus, the ensemble classifiers built from such data lack accuracy. A specifically designed ensemble-based framework may be helpful in this case, and hence, authors propose a framework that combines ELA with noise filtering, feature selection, and data balancing. Here, feature selection eliminates the redundant and less significant features to consider only principal features for training. Consideration of principal features aids in reducing the complexity of the algorithm, thus achieving speed and cost-effectiveness. Here, in order to eliminate the less useful features, it employs the Information Gain approach, which has been widely accepted in various studies related to SFP [19].

Hence, the authors in this paper aim to study the large scale experiment to understand the effect of ensemble modeling in SFP. The paper also presents a thorough comparison of individual techniques at each stage. Thereafter, a new sequential ensemble model is presented that aims to address the challenges of the individual model.

The related work by various researchers demonstrates that ELA achieves robustness. Summarizing this, when ELA is implemented on selected features of balanced data, it achieves remarkable performance improvement. Thus, it needs to examine the ensemble techniques to achieve robust performance. The objective of the study is to identify the FP modules efficiently.

The work is organized into various sections. The requirement for a formalized method for SFP is discussed in Sect. 1. Materials and methods are presented in Sect. 2. The proposed sequential ensemble model is elaborated in Sect. 3. The results obtained from the proposed model are discussed in Sect. 4. Finally, the manuscript is concluded in Sect. 5.

2 Materials and methods

The ensemble model contains a learning prototype where multiple models are trained on the same datasets, and the forecast of these models are combined to forecast the future values. The ensemble approach gains performance enhancement over individual models owing to reduced bias and variance. Ensembling involves training multiple (same or different) models individually, which further combines their results. These results are combined by feeding into a meta-model that uses individual models' values to predict the final value. The ensemble modeling may be broadly categorized into sequential model and parallel model, which are defined as follows:

2.1 Sequential

In this method, base learning models are dependent on each other. A classic example of a sequential model is AdaBoost.

2.2 Parallel

In this model, base learning models are independent of each other such as Random Forest.

This broad classification of the ensemble model is shown in Fig. 1a, b.

Fig. 1
figure 1

a Sequential Ensemble Method. b Parallel ensemble methods

The following section discusses the material and method used for this research work.

2.3 Data collection

In this manuscript, authors have selected 8 fault datasets for experimental analysis. These datasets have been taken from various open-source projects present in PROMISE and Eclipse bug data repository [20]. Here, PROMISE is an open-source repository that contains fault datasets for various open-source software. Eclipse repository contains fault information of the Eclipse project that contains thousands of files. This project is similar to an industrial system in terms of size and complexity. The dataset contains various metrics like structure and complexity at the file-level. The dataset contains "filename," "count of errors reported six months before release," "count of errors reported six months after the release," and "complexity metrics." Usage of standard datasets enables the reproduction of the standard experiments that aids in performing comparative performance analysis. Here, we have taken the datasets with more than 300 modules to check the proposed model's efficiency. The datasets are represented in Table 1.

Table 1 Dataset for the proposed model of SFP

3 Proposed sequential ensemble model

The proposed model is presented abstractly in Fig. 2. Here, the output of each model is given to Neural Network Autoregression (NNAR). NNAR model is run multiple times so as to achieve the best-fitting model with the least error [21, 22]. Thereafter, the fitted values from the NNAR model are fed to support vector regression (SVR) model.

Fig. 2
figure 2

Abstract view of proposed model

In the proposed model, authors suggest tuning the hyperparameters of SVR to obtain an accurate prediction. The model mainly considers three hyperparameters, viz. soft margin constant cost, the linearity degree of the hyperplane (\(\gamma \)), and finally, the error tolerance (\(\varepsilon \)). These parameters refer to the misclassification of the training data. Here, the model picks a small margin hyperplane for larger values to enhance the said model's performance in terms of classification. On the contrary, it picks a larger margin hyperplane for smaller values. The model derived from a line function is mathematically defined by the following equation:

$$m.a+n=0$$
(1)

here,

$$m={\sum }_{i=1}^{k}{\Psi }_{i}{b}_{i}{a}_{i}$$
(2)
$$n=\frac{1}{s}{\sum }_{i=1}^{s}\left({b}_{i}-m.a\right)$$
(3)

s indicates the count of support vectors.

In the model \(C\) can be mathematically described as:

$$C=\begin{array}{c}min\\ m,n,\xi \end{array}\frac{1}{2}{\left|\left|w\right|\right|}^{2}+C{\sum }_{i=1}^{p}{\xi }_{i}$$
(4)

where

$${b}_{i}\left(m.{a}_{i}+n\right)\ge 1-{\xi }_{i}$$
(5)
$${\xi }_{i}\ge 0$$
(6)
$$i=1..p$$
(7)

These parameters indicate the impact of a single training example considering minimal gamma (distant) and maximum gamma (near). Here, gamma can be understood as the inverse of the radius of influence, and thus, a large gamma indicates that the radius of the area of influence of support vectors includes only support-vector. In such a case, normalization fails to avoid overfitting. Conversely, smaller gamma indicates the overly contrived model that fails to retain the complexity of data. It can be mathematically expressed as follows:

$$R\left({a}_{i},{a}_{j}\right)=exp\left(-\gamma {\left|\left|{a}_{i}-{a}_{j}\right|\right|}^{2}\right)$$
(8)

An extensive detail of the proposed model is given in Fig. 3. Here, the historical data are segregated into training data and testing data. In the model, the actual number of faults is given to NNAR individually. This NNAR gives the fitted values which are input to the SVR model. Finally, the predicted values are obtained through the SVR model. In the end, the performance of the model is established in terms of several error metrics [23].

Fig. 3
figure 3

Detailed view of sequential ensemble model

3.1 Implementation steps

Various steps in the proposed ensemble model for SFP are depicted in Fig. 4.

  1. 1.

    As mentioned earlier, the proposed model considers 8 well-known fault datasets from PROMISE and Eclipse bug data repository [20].

  2. 2.

    Further, if the dataset consists of zero to indicate missing values, it is suggested to take z-score transformation as log transformation may again lead to undefined values. However, log transformation may be done for the rest of the datasets. The equations for log and z-score transformation are given in Eq. 9 and 10, respectively. These transformations basically help in minimizing the variances among the dataset.

    $${\widehat{a}}_{t}=log\left({a}_{t}\right)$$
    (9)
    $${\widehat{a}}_{t}=\frac{{a}_{t}-mean\left({a}_{t}\right)}{{SD}_{t}}$$
    (10)

    where

    $$mean\left({a}_{t}\right)=\frac{1}{p}{\sum }_{t=1}^{p}{a}_{t}$$
    (11)
    $${SD}_{t}=\frac{1}{p}{\sum }_{t=1}^{p}\sqrt{{a}_{t}-{\widehat{a}}_{t}}$$
    (12)
  3. 3.

    Thereafter, the transformed dataset is classified into training dataset and test dataset.

  4. 4.

    The training dataset part after the classification is given to the model for learning purposes. During learning, the model extracts useful patterns and information in the data. The feeding of actual data follows it into the NNAR model. The equation for the same is expressed as follows:

    $${\widehat{b}}_{t}=f\left({\widehat{a}}_{t}\right)+{\varepsilon }_{t}$$
    (13)

    Then, fitted values of NNAR model are given as input to the SVR model which can be mathematically expressed as follows:

    $$g\left({c}_{t}\right)=\left(w.{\Psi }_{t}\left({b}_{t}\right)\right)+C$$
    (14)

    Various parameters that have been used in above equations are described as follows in Table 2:

    The parameters are manually tuned in the model by fixing a parameter to its default value, and the other parameter is adjusted accordingly. During each adjustment, the accuracy of the model is checked regularly. Now, when some parameter achieves a required value, other parameters are adjusted accordingly while checking the model accuracy. Thus, the values of all parameters are manually adjusted. The NNAR model is executed multiple times with different seed value each time to make autoregression, and seasonal autoregression equal to zero that helps avoid missing values in the fitted result of NNAR. SVR model is also set to default and modified so as optimize its value. This process of fixing some parameter’s value and adjusting other parameters in order to find optimum value is called cross-validation.

  5. 5.

    The trained model's performance can be measured in terms of various error metrics for the test dataset. The accuracy of the prediction of the proposed model is also verified by various error metrics as described in the subsequent section.

Fig. 4
figure 4

Step-wise illustration of the proposed ensemble model

Table 2 Descriptions of used parameters

4 Results and discussions

In order to find the efficacy of the proposed approach, experiments are conducted with reference to the collected datasets and other ensemble techniques. The section is divided into two subsections: In the first subsection, an empirical evaluation of the proposed ensemble learning approach is done using two scenarios, viz. Intrarelease prediction and Intrarelease prediction. Secondly, a comparative analysis is performed for proposed technique with other ensemble techniques. The datasets are classified into two subsets as training dataset (80%) and testing datasets (20%). The obtained results are analyzed with respect to several performance metrics, viz. Average Absolute Error (AAE), Average Relative Error (ARE), and prediction (level 1).

Here, AAE is represented in Eq. (15) that represents the absolute difference between the predicted faults (\(PF\)) and the actual faults (\(AF\)) for \(n\) number of modules:

$$AAE=\frac{1}{n}\sum_{i=1}^{n}\left|{AF}_{i}-{PF}_{i}\right|$$
(15)

ARE is given in Eq. (16) which represents the proportion of absolute error with respect to the average fault.

$$ARE=\frac{1}{n}\sum_{i=1}^{n}\frac{\left|{AF}_{i}-{PF}_{i}\right|}{{AF}_{i}+1}$$
(16)

Further, another measure, i.e., prediction at level l, represents the percentage of faults predicted that lies within the range of 1% considering the actual faults. In order to consider the model as acceptable, the value of this metric must be kept as 1 less than or equal to 0.3 [24].

4.1 Empirical evaluation

In Intrarelease prediction, only a single version of the software is used to collect the datasets, Interrelease prediction where several versions of the same software are used to derive the datasets. The results obtained are illustrated below:

4.1.1 Intrarelease prediction

The results obtained are demonstrated in Table 3. The value of AAE measure remains between 0.13 and 1.78. CAMEL 1.6 and XALAN 2.6 demonstrate the highest values. Eclipse datasets have shown the lowest values. Most of the datasets demonstrate the values lesser than 0.50. However, the range of ARE values remains between 0.12 and 0.46, and most of the values were lying below 0.31 except XERES 1.4. The results regarding prediction (level 1) range between 45 and 90%. With the highest value for EMF 2.1. Xerces 1.4 and XALAN 2.6 show lower values contrasted with other datasets. The average values of the proposed ensemble approach metrics are 0.51, 0.21, and 72.14% for AAE, ARE, and pred(0.3) analysis, respectively. The visualization of the obtained results is depicted in Fig. 5.

Table 3 Error metrics of ensemble approach for Intrarelease prediction
Fig. 5
figure 5

Illustration of obtained results for intra release predictions

4.1.2 Interrelease prediction

In Interrelease prediction, the software's current version is used as testing datasets, whereas the previous versions of the software are used as a training set. The simulation results are provided in Table 4. The value of AAE metrics range from 0.2 to 2.0; ARE metrics range from 0.12 to 0.44. Eclipse datasets have shown promising results with the lowest AAE and ARE values; on the contrary, Xerces 1.4 dataset has the highest values. The average value of the AAE metric is 0.61, and all datasets have lesser than the average value except XERES 1.4. Correspondingly, the average value for the ARE metric is 0.27. Regarding the pred(0.3) measure, the values range between 35 and 85%, with the average being 66.56%. Figure 6 shows the visualization of the results.

Table 4 Error metrics of ensemble approach for Interrelease prediction
Fig. 6
figure 6

Illustration of obtained results for inter release predictions

4.2 Comparative analysis

The proposed ensemble approach's performance has been compared with the state of art ensemble techniques, viz., random forest, bagging, stacking, and XGBoost. The simulation results obtained along with the comparative analysis are demonstrated in Tables 5 and 6.

Table 5 Comparative analysis for Intrarelease prediction
Table 6 Comparative analysis for Interrelease prediction

The obtained results, as shown in Tables 5 and 6, witness that the ensemble modeling outperforms the individual model by a significant margin. Thus, the competence of ensemble modeling for SFP is established in addition to various other domains.

5 Conclusion

The field of software engineering has lacked well-defined strategies, unlike other related fields. However, during the past few decades, there has been a significant rise in software demand. This growth of the software industry also necessitates the presence of well-defined strategies. As a result, rigorous research is taking place in this direction. The authors in this manuscript aim to formalize the method of SFP through a sequential ensemble model. The proposed model is applied on the 8 datasets taken from well-known repositories. The proposed sequential ensemble model's performance is analyzed in terms of various error metrics, viz. average absolute error, average relative error, and prediction. Root mean squared error (RMSE), another error metric, is not employed for analyzing the performance as RMSE assigns larger weights to larger error, as it squares the errors before averaging out, and hence, it is more suitable for applications focusing on large errors. This means the RMSE should be more useful when large errors are particularly undesirable. The results obtained through the proposed model are encouraging and thus support the employment of ensemble modeling for SFP.