1 Introduction

Software enhancement effort estimation (also termed prediction) has recognized the growing importance by several software organizations since most software enhancement projects allocated lower cost compared to new development [1]. Software enhancement is considered a critical activity in the software development life cycle. It is defined as “changes made to an existing application where new functionality has been added, or existing functionality has been changed or deleted. This would include adding a module to an existing application, irrespective of whether any of the existing functionality is changed or deleted” [2]. Since changes are frequent throughout the Software Development Life Cycle (SDLC), software project planning should be reviewed frequently. And therefore, the software enhancement effort estimation should be accurate.

The benefits of using enhancement effort estimation models are numerous. For instance, estimation models can help in making decisions about when to restructure or re-engineer a software component to make it more maintainable, know better the underlying reasons about the difficulty of correcting specific kinds of errors [3]. In this area, Machine Learning (ML) techniques are widely used for achieving better estimation. ML techniques are the most suitable for dealing with modeling of high dimensional problems [4]. But there is a lack of consensus among researchers about the technique that can achieve better estimation [5]. Several techniques have been proposed for estimating software enhancement effort, including statistical regressions or machine learning models such as case-based reasoning, neural networks (NN), decision trees (DT), Bayesian networks, support vector machines (SVM), genetic algorithms, genetic programming, and association rules (ARU) [5].

In our previous work [6], we used separately four various ML techniques (M5P, GBRerg, LinearSVR, and RFR) for estimating software enhancement effort. The four selected ML techniques were trained and tested using industrial projects from the International Software Benchmarking Standards Group (ISBSG) Release 12 dataset [7]. The first phase focused on the selection of the optimal features set in the ISBSG dataset using the CFS algorithm, while the second phase focused on estimating the enhancement effort based on the optimal features set obtained from the first phase. The findings of our previous empirical study were as follows:

  • The correlation coefficients computed between enhancement functional size and enhancement effort have a value of 0.5 which indicates a good correlation. The enhancement functional size was therefore chosen as the primary independent variable.

  • The use of ML techniques without feature selection generated good accuracy. However, the use of ML techniques with the CFS algorithm gives better results.

  • The empirical results suggested that M5P is the most accurate model with small MAEs = 0.0612 and with quite good performance that can achieve 99%.

More recently, research publications investigated the use of ensemble learning for improving software effort estimation [8, 9]. Various ensemble methods are considered for estimating software effort such as [10]:

  • Bagging: The estimation is based on merging the same type of model.

  • Boosting: The estimation is based on the use of sequential method to reduce the bias.

  • Stacking: The estimation is done from multiple individuals models to build a novel model.

Based on the obtained results [6] from our previous work, we aim in this paper to build a stacking ensemble method to accurately predict the total enhancement effort for enhancement projects in person-hours. Our constructed Stacking ensemble method combines three different Machine Learning models (GBRegr, LinearSVR, and RFR). Estimation result using staking will be compared to those using a single algorithm (M5P). The M5P is recently used for software estimation [11,12,13,14]. M5P is a powerful implementation of Quinlan’s M5 algorithm for inducing both Model Trees and Regression Trees [15]. The main motivation for this research study arises from the fact that existing single techniques used for estimating software effort suffer from several limitations [16] while other innovative approaches such as the ensemble method are yet to be adopted in the industry for estimating software effort. This study investigates the use of CFS and stacking ensemble methods for improving enhancement effort estimation. First, the M5P, GBRegr, LinearSVR, and RFR are used separately. Second, the stacking ensemble method that combines GBRegr, LinearSVR, and RFR is used. And finally, comparisons of the experimental results are made. The hypothesizes investigated in this research are the following:

  • H1: The enhancement effort estimation accuracy with the stacking ensemble method is statistically better than that obtained with M5P when the functional change Size is used as the independent variable.

  • H2: The use of the CFS algorithm improves the accuracy of the selected ML methods.

The rest of the paper is organized as follows: in Sect. 2, we present the background and the related work. The detailed description of our research methodology consisting for achieving better software enhancement effort estimation is presented in Sect. 3. In Sect. 4, we intend to discuss the experimental results. Our evaluation is performed also throughout threats to validity presented in Sect. 5. Finally, we conclude the paper and we give directions for future works in Sect. 6.

2 Background and related work

2.1 Software enhancement effort estimation and machine learning

Enhancement is considered as a type of adaptive maintenance [2]. Regarding the use of ML for software maintenance effort estimation models, we identified 18 studies published between 1995 and 2020. The models in these 18 studies were statistical regressions [17,18,19,20,21], neural networks [22, 23], SVR [24], rule based [23, 24], Bayesian network [25], analogy [26], pattern recognition approach termed optimized set reduction [21], general regression [22], support linear regression models [22], support vector regression [24], and decision trees stochastic gradient boosting [27].

Results showed that there was not a statistically significant difference in the estimation accuracy among the proposed models. A major challenge for the research community is to develop a good theoretical understanding of maintenance and evolution which are scaled to industrial applications [28]. Several studies lack clarity on how the data were prepared and used, which makes it difficult to compare results among studies as well as replicate them [29]. More recently, the use of ensemble method combining more than one single ML technique has achieved attention in the software engineering research [30]. Hence, a systematic review conducted by Idri et al. [31] and Alsolai et al. [32] have confirmed that the ensemble methods outperformed their constituents (single models). The ensemble method has revealed promising capabilities in ameliorating the accuracy over single models [33]. It contributes to better accurate results even when compared to deep learning models [34]. Indeed, the more diverse the constituents are, the better the ensemble method outputs will be distributed around the desired output [35, 36]. Despite that, in the area of effort enhancement maintenance estimation, ensemble methods are not yet adopted. This is the first study to our knowledge that investigates the use of the ensemble method in software maintenance effort estimation. The main motivation behind using the ensemble method in this work is that it makes the model more reliable and robust due to the advantages of using more than one ML technique for estimation [37]. Hence, with the creation of an ensemble method if any of the used models perform poorly, the ensemble method can reduce the error using many models [38].

2.2 Stacking ensemble method

The stacking model is invented by Wolpert [39]. It is recently used for estimating software effort [30, 34]. The stacking model combines lower-level Machine Learning techniques for achieving more accurate estimation. The constructed linear estimation model consists of two learning levels [40]. The first learning level is called Level-0, where models are trained and tested in independent cross-validation examples from the original input data. Then, the output of Level-0 and the original input data is used as input for level-1, called generalized (i.e. the meta-model). The Level-1 is constructed using the original input data and the output of level-0 generalizers [40].

2.3 Feature selection methods

Feature selection methods or techniques can be classified into three categories: Filters, Wrappers, and Hybrid algorithms [41]. The Filter methods select the features based on the characteristics of the dataset without involving any learning technique. Afterward, this subset of features is presented as input to a classification/regression estimation algorithm [42]. The Wrapper methods select the feature subset based on the performance of given learning techniques according to a performance measure. And Embedded or Hybrid methods perform the selection step and model building simultaneously or combine filter and wrapper techniques. One of the measures used for feature selection is the dependency measure. Many dependency-based algorithms have been proposed. In this study, we will use correlation-based feature selection (CFS) since it can evaluate all the possible combinations [43]. It can also update the subset of the selected features during the evaluation process instead of the greedy forward selection and greedy backward elimination that do not update the subset of features during the evaluation process [44]. CFS employs correlation to evaluate a feature subset that derived from Pearson correlation coefficient [43]. This method is a multivariate Feature Filter. That means that it assesses different feature subsets and chooses the best one. CFS was proposed by Hall [43] to evaluate subsets of features according to the heuristic evaluation function. This study was based on the hypothesis “A good feature subset contains features highly correlated with the class, yet uncorrelated with each other” [43]. Due to the ability of the Feature selection algorithm to produce good subsets of features, its use with the ensemble method will be effective for improving ensemble methods accuracy [45]. This observation was also founded by Hosni et al. [44] in their empirical study, where the CFS ensemble generated better results than the RReliefF ensemble. The choice of feature selection methods differs among various application areas [46]. Table 1 presents the findings that used filter feature selection for software effort estimation.

Table 1 Literature review on CFS algorithm used for software effort estimation

There are relatively few studies that investigated the use of CFS algorithm in the area of software enhancement effort estimation for both individual and ensemble models. Nevertheless, a number of research studies confirmed the effectiveness of CFS algorithm and ML techniques for software effort estimation [41, 44, 44, 45] .

3 Research process

In this paper, we will extend our previous research methodology [6] by setting up two new models using the CFS algorithm to predict software enhancement maintenance effort. The first model is constructed using four selected regression ML techniques (M5P, LinearSVR, GBRegr, and RFR) separately. While, the second model combines three models (LinearSVR, GBRegr, and RFR) that will construct the stacking ensemble method. Finally, we make a comparison of the estimation accuracy of the two mentioned models. We aim to identify whether the use of the CFS algorithm with the stacking ensemble method improves the performance of the estimation model versus the use of the CFS algorithm with the M5P model.

Fig. 1
figure 1

Research method design

3.1 Data preprocessing

The dataset used for training and testing the estimation model is obtained from the ISBSG Release 12 [18]. The ISBSG dataset is widely used for software project estimation [47]. It includes new, enhanced, and re-develop software projects. It has been extensively reviewed for its applicability to building effort estimation models, including the effects of outliers and missing values [48]. The effort expended on the support activities is reported in person-hours. We selected the data regarding “enhancement” as the “development type” where the “count approach” was the COSMIC Functional size measurement method. In addition, we consider only data with soundness and a high level of integrity (i.e., records having “Data Quality Rating” of “A” or “B”). To exclude trivial projects, the following filters were applied:

  • Normalized work effort (full life cycle effort for project) equal to or greater than 80 person/hours.

  • Development types other than enhancement were excluded.

Table 2 lists the data fields, the corresponding values selected in this study, the discarded values, and the number of projects. After the preprocessing phase, we selected a total of 17 attributes.

Table 2 First selection of data concerning software enhancement projects from the ISBSG dataset

3.2 Constructing estimation models

This section presents a series of experiments to investigate the performance of estimation models with the use of the CFS algorithm. We have constructed two estimation models presented in Fig. 1. The first model constructs four ML techniques (M5P, LinearSVR, GBReg, and RFR) for estimating enhancement effort. The chosen models are trained and tested separately on the ISBSG dataset with relevant features using the CFS algorithm. The second model constructs a stacking ensemble method (that combines LinearSVR, GBRegr, and RFR). For this second model, the meta-model provided via the “final_estimator” argument (LinearSVR) is trained to combine the estimation of the chosen regression ML techniques provided via the “estimators” argument (GBReg, RFR). Each regression model is trained on the ISBSG dataset with relevant features filtered using the CFS algorithm allocated for training. Then the outputs of “estimators” are fed into the ”final_estimator”, which combines each regression estimator model with a weight and delivers the final estimation. For the first set of experiments, the classic approach is to do a simple 70%–30% split. We split data into training and validation/test set. The training set is used to train the model, and the validation/test set is used to validate it on data it has never seen before. The selected ML techniques are trained and tested for various sorts of experiments using features selected from the preprocessing phase. Thereafter, to carry out the experiments, different tools was used. Building the M5P model (tree-based model) has been carried out using Weka softwareFootnote 1. It is widely used for teaching, research, and industrial applications. It contains a plethora of built-in tools for standard machine learning tasks. For the feature selection methods, 10-fold cross validation and validation test estimation of GBRegr, SVR and RFR models were performed using the Google ColaboratoryFootnote 2 python programming. Google Colaboratory known as Google Colab is the current inventory tool [49]. It provides GPU for research to the people who do not have enough resources or cannot afford one. Table 3 lists the selected ML techniques with their corresponding predefined range of parameters values.

Table 3 Parameters values for grid search

3.3 Experiments results

This section evaluates the estimation performance of the two constructed models where two experiments are conducted. In each constructed model, we propose to use the CFS algorithm. That is after applying the CFS algorithm, we randomly split data with relevant features into two subsets: a training set and a test set. To evaluate the accuracy of the prediction models, we used a wide set of evaluation metrics [47, 48] such as root mean square error (RMSE) and mean absolute error (MAE). We also used the Standardized Accuracy measure (SA) based on MAE proposed by [50]. We also used the cross-validation method [51]. We partitioned the validation size with K = 10. It is well-known since the number of the selected ML model fitting to get the estimate now becomes independent of the size of the training sample [52].

3.3.1 Correlation-based feature selection (CFS) algorithm

Once the appropriate projects have been selected (i.e., projects with high quality of data), then we propose to use the CFS algorithm for selecting the features that are relevant for software enhancement effort estimation. The main challenge when using correlation-based Filters is related to the starting points for feature subsets generation [44]. To handle the missing values in a feature, CFS replaces the missing values by taking into account the average value for continuous features and the most common value for discrete features [44]. That is after applying the CFS algorithm, we determine which features globally and consistently appear in the optimal set of features. The filtering here is done by using correlation matrix and Pearson correlation [53].

Fig. 2
figure 2

Pearson correlation heat map

Pearson correlation Pearson’s correlation coefficient is a measure of the strength of the association between two variables [54]. In our research, we will plot the Pearson correlation heat map (see Fig. 2). After the preprocessing phase, we selected a total of 17 attributes where 16 are independent variables and one is the dependent variable (NormalizedWorkEffort). This correlation coefficient is a single number that indicates both the strength and direction of the linear relationship between two continuous variables. Values can range from \(-\,1\) to \(+\,1\) [54].

  • Strength: The greater the absolute value of the correlation coefficient, the stronger the relationship. When the value is in-between 0 and \(+\,1\)/\(-\,1\), there is a relationship, but the points do not all fall on a line.

  • Direction: The sign of the correlation coefficient represents the direction of the relationship. Positive coefficients indicate that when the value of one variable increases, the value of the other variable also tends to increase. Negative coefficients represent cases when the value of one variable increases, the value of the other variable tends to decrease.

Since correlation coefficients which magnitude are less than 0.3 have little if any (linear) correlation [54], only the features correlating larger than 0.4 (taking into account absolute value) are selected with the output variable. The use of the CFS algorithm selects 37.5% (6 out of 16) of features (see Table 4). Note that the CFS algorithm is used not only to select features but also to evaluate the impact of the enhancement size (i.e., functional size of the functional change) feature on the accuracy of the software enhancement effort estimation.

Table 4 Selected feature correlation

It has been observed that COSMIC sizing is an efficient method for measuring not only software size but also the functional size of the functional change that may occur during the Software Life Cycle [55]. Figure 2 shows that the correlation coefficients value between enhancement functional size and enhancement effort is equal to 0.5. This investigation indicates an acceptable correlation when compared with other features (such as CHANGEWorkEffort and UnrecordedWorkEffort). Change functional Size was therefore chosen as the primary independent variable.

M5P algorithm Performance Assessment versus GBRegr, SVR and RFR models Using the CFS algorithm with the selected regression ML techniques separately leads to an accurate enhancement effort estimation when the enhancement functional Size is used as the independent variable (see Table 5). Error metrics (such as MAEs and RMSEs) reveal quite values using M5P (MAE = 0.0612; RMSE = 0.2514). It is evident from the results that M5P method delivers the best performance as compared to other three ML techniques with SA stands at around 99% (see Fig. 3).

Table 5 Estimation analysis using MAE, RMSE and SA
Fig. 3
figure 3

ML techniques accuracy

3.3.2 Stacking ensemble method based on the use of correlation-based feature selection (CFS) algorithm

Regarding the above estimation results, our stacking ensemble method is based on the hypothesis that “When weak models are rightly aggregated, the strength of the union, therefore, leads to better performance and more accurate estimation of software enhancement effort”. (1) Selecting which models to be used as ”estimators” and model to be used as a meta-model and (2) making predictions by feeding estimators’ predictions into a meta-model.

Selecting estimators and meta-model The main parameters of the stacking ensemble regression model are defined in scikit-learnFootnote 3 as follows: StackingRegressor(estimators, final_estimator = None, *) explained in Table 6.

Table 6 Stacking ensemble regression model parameters’

Thus, we try to identify which technique from the three ML techniques can be used as “final_estimator” and which ones should be used as “estimators”. In this case, we selected the r2_score evaluation metricFootnote 4 to evaluate the overall performance of the selected prediction model to provide an adequate combination. Table 7 illustrates the r2_score results where the best possible score stands at 1.0. Figure 4 shows the ML “estimators” and the average of their predictions.

Table 7 Estimation analysis using R2 score
Fig. 4
figure 4

ML “estimators” and the average of their predictions

Constructing the estimating software enhancement effort Regarding Table 7, LineanrSVR is selected to be used as the final_estimator. Table 8 shows the stacking ensemble method parameter that defines the best combination.

Table 8 Parameters values for grid search

Using the CFS algorithm with the constructed stacking ensemble method leads to an accurate enhancement effort estimation when the enhancement functional Size is used as the independent variable (see Table 9). It is evident from the results that the stacking ensemble method delivers the best performance when compared with the other three ML techniques. The r2_score arises to 0.987 (see Figs. 3, 5).

Fig. 5
figure 5

Regressor estimation score

4 Discussion and comparison

When comparing the estimation accuracy of the models using the same ISBSG dataset, we can accept the two following hypotheses derived from the one formulated in Sect. 1.

  • H1: The enhancement effort estimation accuracy using the stacking ensemble method with an R2score of 0.98 is statistically better than that obtained using M5P the functional change Size is used as the independent variable.

  • H2: The use of the CFS algorithm improves the accuracy of the selected ML methods.

The main reason behind selecting the enhancement functional size as a primary independent variable in our study is that the software functional size is correlated to the software project effort. And that affects the sensitivity of the software project [56]. Our previous experiment study was conducted to evaluate the accuracy of four machine learning techniques (M5P, GBRerg, LinearSVR, and RFR) separately. The selected ML techniques are used to provide the effort estimation of a new enhancement when software is being developed. Among the selected ML algorithms, M5P is the most effective. This is supported by the results with a minimum MAE of 0.0612. The effectiveness of M5P can be seen from the results obtained when applying a simple method. It has small MAEs and RMSEs values. A good accuracy (SA) of 99% is obtained when using the 10-fold cross-validation.

To identify the effective determinants for enhancement effort estimation, the importance of each feature is computed using the CFS algorithm. Furthermore, the model using the CFS algorithm delivers superior performance when compared to the model that used all the selected features (17 features). Thus, using the M5P ML algorithm improves the accuracy of enhancement estimation.

Furthermore, to ensure the above results, we have investigated the idea of using a stacking ensemble method by combining the weak ML techniques (GBRerg, LinearSVR, and RFR). Experimental results are compared with the M5P algorithm (see Table 9). The effectiveness of the stacking ensemble method can be seen in the results (see Figs. 4 and 6). This is supported by the results with the minimum MAE of 0.0383, RMSE of 0.1973, and a good r2_score of 0.987 (Fig. 7).

Table 9 Estimation analysis using MAE, RMSE and r2_score
Fig. 6
figure 6

ML techniques performance assessment

Fig. 7
figure 7

ML techniques accuracy

5 Threads to validity

In this section, we discuss the threats to the validity of this research study according to the guidelines proposed by [57]. The validity of this research results is pertinent to internal validity, external validity, and construct validity.

5.1 Internal validity

Internal validity is related to (i) the size of the data set where the number of instances in the data set must be more significant, as well as (ii) the number and the nature of attributes used to estimate the software enhancement effort. To overcome this limitation (ii), we have used the CFS algorithm for selecting the attributes from one of the well-known historical software project datasets (the ISBSG dataset that contains many attributes). Since we restricted the study to numerical attributes only 17 features have been selected which constitutes 17% from all the attributes in the ISBSG dataset after the phase of prepossessing data. And, six features have been selected after using the CFS algorithm that constitutes 6% from all the attributes in the ISBSG dataset. This is why the findings of this work may differ from other studies that use other types of data.

5.2 External validity

External validity is related to the degree of the generalization of the results. The results of this study are based only on the use of the ISBSG R12 dataset. Conducting more experiments with other kinds of datasets that present quality characteristics are also required. There are two threats to the external validity of this study: (1) the first threat may come from the CFS algorithm. Although the experiments were performed using CFS, it is still compulsory to test other FS algorithms with different ML techniques. (2) The other threat may come from the selected dataset. We have used a single popular ISBSG dataset containing COSMIC Functional Points measures.

5.3 Threats of construct

Threats of the construct are related to (i) the degree of reliability of the features used to predict enhancement effort and (ii) the accuracy metric used for the analysis. In fact, (i) the estimation of enhancement effort in our study is provided based on the independent variable (i.e., the size of functional change). Even the results about the performance accuracy of the selected ML techniques provide a good accuracy equals to 99%, the correlation coefficients computed between enhancement functional size and enhancement effort is still a moderate value, this is due to the fact that enhancement functional size is identified at a high level of abstraction of the software life-cycle. Regarding the accuracy metric (ii), there has been some criticism of these metrics [47] such as ignoring the importance of the dataset quality. However, we adopted these four evaluation metrics (MAE, RMSE, R2 score, and SA) in our work.

6 Conclusion

The study was based on two main hypotheses (i) “A good feature subset is one that contains features highly correlated with the class, yet uncorrelated with each other” [43] and (ii) “When powerless (/weak) models are rightly combined we can obtain more accurate software enhancement effort estimation models”. The constructed models are tested using the ISBSG dataset of historical software projects that take into account the use of software functional size expressed in terms of COSMIC Function Point units (as an independent variable). The findings of the research questions were as follows:

  • The correlation coefficient computed between enhancement functional size and enhancement effort has a value of 0.5 which indicates a good correlation. The enhancement functional size was therefore chosen as the primary independent variable.

  • The ML techniques without feature selection generated good accuracy. However, ensemble learning techniques with the CFS algorithm give better results.

  • The experimental results suggested that:

    • M5P is more accurate with small MAEs = 0.0612 and with quite good performance of 99% compared to GBRerg, LinearSVR, and RFR.

    • The stacking ensemble method (combining GBRerg, LinearSVR, and RFR)is more accurate with small MAEs = 0.0383 and R2 score = 0.987 compared to M5P algorithm.

For future work, several extensions can be made. This work will be extended by exploring other ensemble methods for estimating software enhancement effort, to get more accuracy reaching 100% and other features selection methods including Backward Elimination, Forward Selection, Bidirectional Elimination, and RFE.