1 Introduction

Software effort estimation deals with estimating the software effort, which is essential to build a software project [1]. It considers estimates of schedules, a probable sum of cost and manpower essential in building a software project [2]. The software effort estimation approach will foretell the practical quantity of effort required to maintain and develop the software from undetermined, insufficient, noisy, and inconsistent data input [3]. The software effort estimation has drawn attention since the mid-1970s [4]. It was of utmost importance since inaccuracies in the predictions lead to complicated results such as overestimation and dissipation of resources, and underestimation may cause overabundance in their planned budgets [5]. With the boundless extension of software technology, the significance and bulkiness of the software are extended in size, collaterally its complexity also increased to predict the software project’s effort accurately.

To estimate the software effort, the techniques that comprehend are algorithmic models, expert judgment, estimation by analogy, and machine learning. The algorithmic models comprise COCOMO, function points, SLIM, and use case points. Expert judgment is a method that analyzes and utilizes the expert’s experiences in the estimation of the project. Estimation by analogy refers to that homogeneous historical projects which are compared to the current developmental models. Machine learning techniques are renowned for the last two decades in estimating software effort [6].

Moving further into the conceptualization of machine learning, the methods entail in machine learning builds regression models, which make use of prior projects and are subsequently employed to estimate the software project’s effort. As software projects convolutions are increasing, the statistical methods and traditional parametric models do not often seem to be coherent to represent the correlation between the project features and the software effort [7]. So with the immense progress of software effort estimation practice, there should be techniques that can compute the effort of abruptly changing software, which keeps on updating the programming tools and skills. In this scenario, machine learning rather than traditional methods are preferable because those have the potentiality to access the historical data and learn from it and can adapt to the wide variations that occur in a software project [8]. Hence, such techniques are efficient to deal with the complex data and possessing high accuracy.

Periodically, there would be difficulty in the machine learning techniques to specify which effort estimators accomplishes best. Because, while comparing estimators based on the modified conditions, any estimator may change its place in the ranking. However, if we merge the estimates of multiple estimators, then the resulting method executes better than any single estimator. Taking the above assertion into consideration, we uncover the ensemble learning in machine learning, which makes use of multiple learning algorithms to estimate or predict better performance [9]. So, integrating ensemble learning for the software effort estimation process can lead to better accuracy in estimating the effort by making use of many algorithms and predicting the good result of all.

The primary goal of this paper is to estimate the effort of software using ensemble learning technique, and various machine learning techniques. The highlights of the research work are referenced as follows:

  • Proposed gradient boosting regressor (GBR) for estimating the software effort of large-scale projects.

  • Other ensemble learning methods such as Ada-Boost regressor (ABR), and bagging regressor (BR) and machine learning models such as stochastic gradient descent (SGD), K-nearest neighbor (KNN), decision trees (DT) are used to compare the performance with the proposed method.

  • Simulation is carried out by using datasets such as COCOMO’81, and CHINA.

  • Considered four performance metrics to analyze the performance of each regressor.

The rest of the paper is structured as follows: Section 2 shows the literature review of techniques in estimating the software project’s effort. Section 3 describes the proposed model along with the framework. Section 4 presents the Experimental setup along with describing the datasets COCOMO’81, and CHINA, performance measures and the environmental setup. Section 5 shows the result analysis accompanying plots; Sect. 6 presents the conclusion of the work.

2 Literature review

Estimating software effort has always remained a challenging task for machine learning researchers. Singal et al. [10] applied differential evaluation (DE) algorithm on COCOMO II and COCOMO models to estimate the effort of a software. In this experiment, COCOMO81 and NASA’93 datasets were implemented and the evaluation metric used for comparison is MMRE. The effectiveness of the differential evolution algorithm was investigated in enhancing the parameter values for traditional algorithmic models such as COCOMO II and COCOMO. The cost driver values were upgraded by implementing the DE approach for both the models, which resulted in boosting the accuracy in software effort estimation.

The accurate effort estimation in agile software development projects can help in the planning of a sprint, which leads to optimal results. Malgonde and Chari [11] experimented with seven algorithmic approaches such as support vector machine, ridge regression, artificial neural network, K-nearest neighbor, decision tree, linear regression, and Bayesian networks to determine the method that gives better accuracy in estimating the software effort, and uniformly neither of them outperformed. So, they implemented an ensemble-based approach to predict the effort, the data on which they executed are from 24 software development projects. The considered performance metrics for evaluation are mean absolute error (MAE), mean balanced error (MBE), and root mean square error (RMSE). The proposed ensemble-based algorithm outperformed the other similar approaches in predicting the software effort.

Abdelali et al. [12] built a random forest (RF) model and experimentally optimized the performance by modifying the key parameters to estimate the accurate software effort. The datasets used are ISBSG, Tukutuku, and COCOMO. The evaluation was handled through the 30% hold-out validation method. To evaluate and determine the well-performed technique, three performance metrics such as median magnitude of relative error (MdMRE), mean magnitude of relative error (MMRE), and Pred (0.25) are used. The obtained RF model is collated with the classical regression tree and the results proved that the enhanced RF model performed well than the regression tree model.

Fuzzy models in predicting software effort are experimented by Nassif et al. [13]. Three fuzzy logic models were implemented and compared in estimating software effort and in aid of designing of these three fuzzy logic models namely Sugeno, Mamdani with constant output, and Sugeno with linear output, regression analysis were carried out. The dataset utilized for training and testing the models is ISBSG dataset and the performance metrics used for evaluation are MAE, MBRE, mean inverted balance relative error, standardized accuracy, and Scott–Knott. The comparative analysis among the fuzzy models designed in assistance with regression analysis showed that the Sugeno fuzzy model with linear output resulted better.

Pospieszny et al. [14] proposed implementations with the machine learning algorithms in predicting the effort of software projects effectively by averaging the ensemble of machine learning algorithms including neural networks, generalized linear models, support vector machines with cross-validation. The methods are tested with ISBSG dataset and verified with metrics such as MAE, MMRE, mean squared error (MSE), RMSE, MMER, mean balanced relative error (MBRE), and PRED in evaluating the proposed model. The efficacy of the model helped in predicting the software effort accurately within the time duration.

The other works of estimating software effort using different techniques by employing well-known datasets along with the real-time industry projects and the performance metrics evaluated to determine the best model that predicts the accurate software effort are shown in Table 1.

Table 1 Literature study in software effort estimation

3 Proposed methodology

The proposed method’s framework is the inclusion of various independent methods. Primarily, data collection is done as the initial step of our framework and the succeeding steps are data preprocessing, data cleaning, and data visualization. In our framework, we compared various machine learning techniques such as SGD, KNN, DT, BR, RFR, ABR, and GBR are considered for software effort estimation. The data are split into training and testing in the ratio of 80:20. Our proposed framework is represented in Fig. 1.

Fig. 1
figure 1

The framework of the proposed model

3.1 Gradient boosting

In ensemble learning, the sets of learning machines are trained and combined to execute a similar task, to enhance predictive performance. Ensemble learning techniques have drawn the attention of software effort estimation communities as they constantly contributing better performance over single learning techniques [15]. Gradient boosting is an ensemble learning algorithm of machine learning. The primary concept of boosting is to append the new base models constantly to the ensemble of learning machines. At every following iteration of the learning process, a new weak base-learning model is set and trained based upon the errors of the earlier iterations of the entire ensemble. Gradient boosting was developed by Friedman [16]. It solves the minimization problem by using a gradient descent method and develops a model for prediction in the configuration of an ensemble of learning machines that includes weak prediction models consistently with decision trees.

Gradient boosting regressor is the abstraction of gradient boosting and entails a weak learner, a loss function, and an additive model. Loss function to be optimized, and it may be differentiable depending upon the kind of problem being solved, which is a prerequisite of gradient boosting. Decision trees are utilized as the weak learners as they permit their outputs to be added, also allowing the next immediate model outputs to be appended and to rectify the residuals in the predictions. The greedy approach is used to build the trees that chooses the best split to minimize the loss. A gradient descent method is utilized to minimize the loss by appending the decision trees. After evaluating the loss, a new tree adjoins to employ the gradient descent procedure to minimize the loss and the output of the new tree is subsequently appended to the existing ensemble of trees to improve the accuracy of final output [17].

The procedure of the proposed Gradient boosting regressor is as follows:

Algorithm: Gradient boosting regressor model

Input: The input taken is \( \left\{ {\left( {x_{i} ,y_{i} } \right)} \right\}_{{i = 1}}^{n} \), and a differentiable loss function considered is \( G\left( {y_{i} ,F\left( x \right)} \right) \) where \( x_{i} \) represents input variables and \( y_{i} \) represents the corresponding output variable i.e., dependent variable and the differentiable loss function is \( G\left( {y_{i} ,F\left( x \right)} \right) \)

The loss function is defined as \( \frac{1}{2}\left( {{\text{Actual}} - {\text{Predicted}}} \right)^{2} \)

1.

A function \( F_{0} \left( x \right) \) is initialized with a value by minimizing the summation of the loss function \( G\left( {y_{i} ,\beta } \right) \) by considering the summation of the loss function and \( \begin{array}{*{20}c} {\text{argmin}} \\ \beta \\ \end{array} \) is used to minimize the summation of loss function consisting of \( y_{i} \) as observed values and \( \beta \) as predicted values. i.e., we are initializing the model with a constant value.

\( F_{0} \left( x \right) = \begin{array}{*{20}c} {argmin} \\ \beta \\ \end{array} \mathop \sum \limits_{i = 1}^{n} G\left( {y_{i} ,\beta } \right) \)

(1)

2.

Consider the decision trees i.e., weak learners in our case; starting from first tree (\( h \)) to last tree (\( H \)) for \( h = 1 \,{\text{to}}\, H\!: \)

(a)

Computation of residuals is carried out by derivative of loss function with respect to the derivative of predicted value; for every tree we are trying to build, and with every sample (\( i = 1,2,3, \ldots \)) given by

\(\begin{aligned} q_{ih} = - \left[ { \frac{{\partial G\left( {y_{i} ,F\left( {x_{i} } \right)} \right) }}{{\partial F\left( {x_{i} } \right)}}} \right]_{{F\left( x \right) = F_{h - 1} \left( x \right)}} \\ \,\,\,\,\,{\text{for}}\,\,\,i = 1, \ldots ,n\end{aligned} \)

(2)

The residuals are called as pseudo residuals, and the derivative \( \frac{{\partial G\left( {y_{i} ,F\left( {x_{i} } \right)} \right) }}{{\partial F\left( {x_{i} } \right)}} \) is called as Gradient.

(b)

After calculating the residuals, a regression tree to be fitted for pseudo-residuals \( q_{ih} \) values and keep count of the terminal regions i.e., leaves in the trees \( Q_{jh} \) for j = 1, …, jh where h is tree.

(c)

Compute the output values for each leaf in the tree by minimizing the summation of loss function by using \( \begin{array}{*{20}c} {\text{argmin}} \\ \beta \\ \end{array} \) for each sample in the particular leaves of the tree i.e., \( x_{i} \in Q_{ij} \). The output values are the average of residuals for each leaf and are stored in \( \beta_{jh} \)

For j = 1, …, jh compute

\( \beta_{jh} = \begin{array}{*{20}c} {\text{argmin}} \\ \beta \\ \end{array} \mathop \sum _ {xi \in Qij} G\left( {y_{i} ,F_{h - 1} \left( {x_{i} } \right) + \beta } \right) \)

(3)

(d)

Now, update the model Fh(x) by adding the previous output and summation of output residuals of the leaves of each tree to minimize the loss function. Here, \( \upsilon \) is defined as learning rate. By adding the learning rate, the accuracy of the model can be increased.

Update

\( F_{h} \left( x \right) = F_{h - 1} \left( x \right) + \upsilon \mathop \sum \limits_{j = 1}^{{j_{h} }} \beta_{j} I\left( {x \in Q_{jh} } \right) \)

(4)

3.

The loop will be continued for the no. of weak learners keep adding and the output will be the Hth tree which obtains the estimated level or highest accuracy Output FH(x).

Gradient boosting algorithm constructs trees according to the input values and calculates the residuals based on the observed values and improves the accuracy by considering the predicted outputs of the previous trees. In the above algorithm, input considered are independent variables and the dependent variable of all the samples and loss function is considered to calculate the residuals. In step 1, \( F_{0} \left( x \right) \) function is initialized with a value obtained by the summation of the loss function of all the samples. In step 2, the residuals are calculated by taking the derivative of the loss function to that of the predicted values for every tree and a regression tree is fitted for all the residuals. The output is calculated for each leaf in the tree by minimizing the summation of their residuals and that is the average of residuals in each leaf, and are stored. Now, the model is updated by adding the summation of the output to the previous output, and this iteration continues for \( h \) number of trees until we obtain the estimated accuracy as described in step 3.

The framework of the proposed model is presented in Fig. 1.

4 Experimental setup

COCOMO’81, and CHINA datasets were employed to evaluate and compare the software effort using various regression models. The COCOMO dataset has been widely used in research studies to estimate the effort of software by employing traditional algorithms and also with the machine learning algorithms [18,19,20,21]. China dataset has been used in various models to estimate accurate software effort; for instance, employing in designing a model with an algorithm [22] and in comparing approaches to improve the defect and effort estimation models [23] and so on. These two datasets are publicly available in the PROMISE repository [24, 25].

COCOMO’81, proposed by Barry Boehm [26], comprising 63 software projects. There exist entire 17 numeric attributes, among them 15 are effort multipliers divided into four clusters such as product, platform, personnel, and project. The effort multipliers under their respective cluster are shown in Table 2. The two attributes other than the 15 effort multipliers are lines of code and actual development effort. The effort is estimated in person-months. The development effort is the dependent variable whereas, effort multipliers and lines of code are the independent variables.

Table 2 Effort multipliers and their description of COCOMO’81 dataset

CHINA dataset is of Chinese software projects, and it consists of 499 projects including 19 attributes for each project and unit of measure is done through function points and effort is estimated in Person–Hours [27]. Features and their description are shown in Table 3 [28].

Table 3 Features and description of the CHINA dataset

4.1 Performance measures

Developing models is not sufficient and needs to be evaluated to verify the accuracy and to know how precise the models are. In this experiment, to evaluate and compare the performance of the various regression models, we adopted four performance measures such as mean absolute error (MAE), mean squared error (MSE), root mean squared error (RMSE), and coefficient of determination (R2). The above regression metrics are imported from sklearn package.

The MAE is the mean absolute error that defines the absolute error i.e., the average error of true values and predicted values of all the samples included. The lower the MAE value obtained, the better is the model. The mean_absolute_error function imported from the sklearn metrics package evaluates the mean absolute error, the error metric is relative to the expected value of absolute error loss. The computation of MAE [22] is shown in Eq. (5)

$$ {\text{MAE}} = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \left| {y_{i} - \hat{y}_{i} } \right| $$
(5)

The MSE is the mean squared error which is calculated by considering the averages of squares of the difference between the actual and predicted values of all samples. The lower the MSE value obtained, the better is the model. The mean_squared_error function imported from the sklearn metrics package evaluates mean square error, the error metric for the expected value of squared error. The computation of MSE [29] is shown in Eq. (6)

$$ {\text{MSE}} = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} (y_{i} - \hat{y}_{i} )^{2} $$
(6)

The RMSE is the root mean squared error, evaluated by calculating the square root of the MSE, and it can also be defined as the standard deviation of the residuals. Like MAE and MSE, the RMSE value should be lower for a model to perform better. The RMSE is computed by importing the function mean_squared_error from the sklearn metrics package and setting the squared parameter to false. RMSE is described in Eq. (7) [18].

$$ {\text{RMSE}} = \sqrt {\frac{1}{n}\mathop \sum \limits_{i = 1}^{n} (y_{i} - \hat{y}_{i} )^{2} } $$
(7)

In Eqs. (5)–(7), \( y_{i} \) is the actual values and \( \hat{y}_{i} \) is the predicted values of corresponding actual values for a total of \( n. \)

The R2 evaluation metric is the coefficient of determination. This evaluation metric gives a manifestation that how good a model fits a given dataset. It indicates how nearer the predicted values to the actual values. The R2 value lies between 0 and 1. If the obtained value is 1, it indicates the model fits exactly to the dataset provided and if it is negative then it defines that model doesn’t fit well to the dataset. Unlike, the MAE, MSE, and RMSE, the R2 value should be higher i.e., nearer to 1 for a model to perform better. The r2_score function imported from the sklearn metrics package computes the coefficient of determination. It represents the proportion of variance that has been described by the independent variables in the model. The equation describes the R2 is presented in Eq. (8) [29]

$$ R^{2} \left( {y,\hat{y}_{i} } \right) = 1 - \frac{{\mathop \sum \nolimits_{i = 1}^{n} (y_{i} - \hat{y}_{i} )^{2} }}{{\mathop \sum \nolimits_{i = 1}^{n} (y_{i} - \bar{y}_{i} )^{2} }} $$
(8)

\( y_{i} \) is the actual value and \( \hat{y}_{i} \) is predicted value of ith sample for a total of \( n \) samples and \( \bar{y}_{i} = \frac{1}{n}\sum\nolimits_{i = 1}^{n} {y_{i} } \).

4.2 Environmental setup

Experimentation is done on the system setup of Lenovo with windows 10 pro 64-bit operating system Intel i5 processor and 6 GB RAM. The designing of regression models and evaluation of statistics is done using python and various frameworks such as NumPy and pandas modules. Visualization is done by using matplotlib, and seaborn libraries. We used python’s package scikit-learn to build various machine learning and ensemble models. Experimentation is carried out by using COCOMO’81 and CHINA dataset. The COCOMO’81 and CHINA datasets comprise 63 and 499 projects, respectively. The datasets are split into 80% for training the model and 20% for testing the model. Various machine learning and ensemble learning algorithms such as stochastic gradient descent (SGD) regressor, K-nearest neighbors (KNN) regressor, decision tree (DT) regressor, bagging regressor (BR), random forest regressor (RFR), Ada-boost regressor (ABR), gradient boosting regressor (GBR) were used to estimate the effort of software. For each regression model in correspondence with the particular dataset, with specified parameters that give the best score are shown in Table 4.

Table 4 Parameter settings of all considered methods w.r.t. respective datasets

The evaluation metrics were computed and compared with the results obtained from different parametric settings, in the process of obtaining the highest accuracy for the regression models, and the proposed method. The evaluation of the gradient boosting model’s accuracy with the different parameter settings are shown in Table 5 with the COCOMO’81 and CHINA datasets.

Table 5 Parameter settings and performance metrics of gradient boosting regression model on COCOMO’81 and CHINA datasets

Table 5 shows the changes in evaluation metrics with different parameter settings applied to the gradient boosting regression model with the COCOMO’81 and CHINA dataset. The best score for the R2 metric obtained is 0.98, and the least error value for MAE, MSE, and RMSE metrics is obtained as 184.8, 52,314.5, and 228.7, respectively, when the parameter settings are set to n_estimators: 69, learning rate: 0.3, max_leaf_nodes: 6. The parameter “n_estimators” defines the number of trees to be added to the regression model and from the table, it is evident that the increasing number of “n_estimators” increases the accuracy. The lower values were set to the “learning rate” parameter to make the model more robust, “max_leaf_nodes” parameter defines the minimum no. of leaf nodes in the tree. From the table, we can state that the minimum number of leaf nodes is required to boost the accuracy of the model.

For the proposed model, the best score of R2 with 0.93, and the MAE, MSE, RMSE with 676.6, 3,252,196.6, and 1803.3, respectively, for CHINA dataset can be obtained with the parameter settings such as n_estimators: 96, subsample: 0.7, criterion: mae, ccp_alpha: 0.9. With the increase in “n_estimators,” the accuracy is increased as “no. of trees” are appending and learning from the previous errors which result in the best score, “subsample” parameter is the fraction of samples to be considered for each tree. For the value is nearly less than 1, the results signify a robust model. ‘criterion’ parameter defines the split’s quality, with this dataset, ‘mae-criteria’, results in good score than the other ‘criteria’. ‘ccp_alpha’ parameter is utilized to reduce the cost-complexity, ‘ccp_alpha’ is set low according to the cost-complexity of the subtree, ‘ccp_alpha’ slightly less than 1 gives the best result with the CHINA dataset.

5 Result analysis

In this section, the performance of various machine learning algorithms is analyzed and evaluated using MAE, MSE, RMSE, and R2 by employing COCOMO’81 and CHINA datasets. The statistical results calculated for each regression model concerning COCOMO’81 and CHINA datasets are tabulated in Tables 6 and 7 respectively.

Table 6 Performance metrics evaluated using COCOMO’81 dataset for various regression models
Table 7 Performance metrics evaluated using CHINA dataset for various regression models

Table 6 describes the performance metrics of our considered regression models in correspondence with the COCOMO’81 dataset. From the results obtained in Table 6, the order of mean absolute error of various regression models from least to highest error value is BR, GBR, ABR, KNN, RFR, DT, and SGD with values 153, 184.8, 229.4, 397.9, 402.6, 429, and 860.5, respectively. Similarly, the order of various regression models from the least to highest mean squared error values is GBR, ABR, BR, RFR, KNN, DT, and SGD with values 52,314.5, 69,936.2, 109,995.3, 293,674.5, 481,104.3, 639,741.4, and 1,498,019.6, respectively. For root mean squared error, the order of various regression models from least to highest error value is GBR, ABR, BR, RFR, KNN, DT, and SGD with values 228.7, 251.7, 331.6, 541.9, 693.6, 799.8, and 1223.9, respectively. Unlike MAE, MSE, RMSE which needs to possess low values to be the accurate model, for the R2 metric, the higher the value and closer to 1.0 is the most efficient model. From the Table 6, the order of different regression models that acquired the highest R2 values are GBR, ABR, BR, RFR, KNN, DT, and SGD with the scores 0.98, 0.97, 0.96, 0.89, 0.83, 0.77, and 0.47, respectively. So, it is evident from the results of Table 6 that the GBR attains optimal performance when compared with other regression models using the COCOMO’81 dataset.

Table 7 represents the performance metrics of regression models using the CHINA dataset. The order of the MAE value of the regression models from low to a high value is GBR, RFR, BR, DT, ABR, SGD, and KNN with values 676.6, 810.1, 903.9, 1268.7, 1483.8, 2103.5, and 2268.2, respectively. Similarly, for the MSE values, the order of regression models from low to high are represented as GBR, RFR, BR, DT, ABR, KNN, and SGD with values 3,252,196.6, 5,235,939.8, 5,786,313.0, 6,325,892.0, 7,431,929.3, 19,883,523.4, and 20,462,127.5, respectively. Correspondingly, the RMSE values of various regression models from low to high values are in the order of GBR, RFR, BR, DT, ABR, KNN, and SGD with the values 1803.3, 2282.2, 2405.4, 2515.1, 2726.1, 4459.0, and 4523.5, respectively. For R2 metrics, the regression models with the highest scores nearer to 1.0 are in the order of GBR, RFR, BR, DT, ABR, KNN, and SGD containing the values as 0.93, 0.89, 0.88, 0.87, 0.85, 0.61, and 0.60, respectively. So from the results, it is noticed that the gradient boosting regression model outperformed the other models with the CHINA dataset.

From Tables 6 and 7, the results obtained i.e., by comparing various regression models and analyzing the different evaluation metrics, we concluded that our proposed regression model, gradient boosting regressor performed well among the other models with both the datasets. The GBR outperformed the other models by obtaining a score of 0.98 with the COCOMO’81 dataset. While using the CHINA dataset, the GBR model obtained a higher score of 0.93 when compared with the other regression models.

Figures 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 and 15 describes the scatter plots for CHINA and COCOMO’81 datasets. Figures 2, 3, 4, 5, 6, 7 and 8 shows the scatter plots of regression models associated with the CHINA dataset and Figs. 9, 10, 11, 12, 13, 14, and 15 visualize the scatter plots of regression models with the COCOMO’81 dataset. Scatter plots are used to identify the relationship between the variables and are used to examine the patterns. In this case, plots of true and predicted values show a strong, linear, and positive relationship between the variables. The line passing through the origin in all the scatter plots gives a better explanation of how well the model fits. So, if the points are closer to the line and the outliers are lesser, then the model is the best model.

Fig. 2
figure 2

Actual versus predicted using SGD in China dataset

Fig. 3
figure 3

Actual versus predicted using KNN in China dataset

Fig. 4
figure 4

Actual versus predicted using DT in China dataset

Fig. 5
figure 5

Actual versus predicted using BR in China dataset

Fig. 6
figure 6

Actual versus predicted using RFR in China dataset

Fig. 7
figure 7

Actual versus predicted using ABR in China dataset

Fig. 8
figure 8

Actual versus predicted using GBR in China dataset

Fig. 9
figure 9

Actual versus predicted using SGD in COCOMO’81 dataset

Fig. 10
figure 10

Actual versus predicted using KNN in COCOMO’81 dataset

Fig. 11
figure 11

Actual versus predicted using DT in COCOMO’81 dataset

Fig. 12
figure 12

Actual versus predicted using BR in COCOMO’81 dataset

Fig. 13
figure 13

Actual versus predicted using RF in COCOMO’81 dataset

Fig. 14
figure 14

Actual versus predicted using ABR in COCOMO’81 dataset

Fig. 15
figure 15

Actual versus predicted using GBR in COCOMO’81 dataset

The proposed GBR model gives the best fitline with both the datasets when compared with other models. COCOMO’81 dataset possesses lesser projects compared to the CHINA dataset as we compared only 20% of data for testing. So, the data points plotted for the COCOCMO’81 dataset are fewer in number. Though they are lesser in number, data points are closely fitted to the regression line expressing the best fit.

For better visualization and understanding, we plotted boxplots for each evaluation metrics of various regression models such as MAE, MSE, RMSE, and R2 using both the datasets. Figures 16, 17, 18 and 19 compares the performance of evaluation metrics of distinct regression models with both the datasets. The boxplots of the evaluation metrics show how the errors and performance of regression models are distributed. The boxplots represent the median value, and the quartiles represent the lower and higher values. We have considered the evaluation metrics of both the datasets to have a detailed understanding of a better regression model.

Fig. 16
figure 16

MAE in COCOMO’81 and China dataset

Fig. 17
figure 17

MSE in COCOMO’81 and China dataset

Fig. 18
figure 18

RMSE in COCOMO’81 and China dataset

Fig. 19
figure 19

R2 in COCOMO’81 and China dataset

The boxplot of MAE in Fig. 16 shows that the box of our proposed model GBR is at the bottom, compared to the other regressor models which explain that the MAE values of the GBR in both the datasets are comparatively lesser than the other models and are identical in the visualization of GBR in Figs. 17 and 18 i.e., box plot of MSE and boxplot of RMSE are also similar as that of MAE boxplot showing that the MSE and RMSE values of GBR model are lesser than the other models. The boxplot of R2 shown in Fig. 19 provides an important insight into how better is our proposed model performing compared to other models. The score should be close to 1.0 for a better model in R2 evaluation metrics. Therefore, the values closer to 1.0 are considered as better models. So, unlike the other three evaluation metrics, the visualization is different for the R2 box plot.

Figures 20 and 21 describes the predicted effort of each regression model for the two datasets comparing with the actual effort. These plots show us how each regression model predicted effort is contrasted with the actual effort. We can get a better understanding of each regression model performance for the particular dataset. Figure 20 shows the line plot of regression models effort with the actual effort of the COCOMO’81 dataset, the line of GBR is competent with the actual effort line of the dataset i.e., the predicted effort of this proposed model is relatively closer to the actual effort. Figure 21 shows that the line plot of regression models effort regarding the actual effort of the CHINA dataset, the plot explains that the GBR’s effort line is adjacent to the actual effort line of the dataset. Hence, the GBR model’s effort is comparatively better than the other models.

Fig. 20
figure 20

Actual effort and predicted efforts of regression models in the case of COCOMO’81

Fig. 21
figure 21

Actual effort and predicted efforts of regression models in the case of china dataset

All in all, we can conclude that by considering different evaluation metrics, our proposed regression model i.e., gradient boosting regressor performed well compared to the other models and for better comprehension, we examined the visualizations of each regression models for each dataset and our proposed model showed a better fit, and we also considered the visualization of evaluation metrics of both the datasets, which showed a better performance of the proposed model. Finally, the gradient boosting regressor model performs well with the COCOMO’81 and CHINA datasets compared to the SGD, KNN, DT, BR, RFR, and ABR.

Table 8 shows previous works on software effort estimation using various techniques. We showed the results concerning the COCOMO and CHINA datasets. The results proved that when these models were compared with our proposed model, our proposed model outperformed the other previous models in estimating software effort.

Table 8 Evaluated values of previous paper models in estimating software effort

6 Conclusion

Software effort estimation is very significant and necessary for software projects. There are many machine learning models and traditional algorithms such as COCOMO, SLIM, functional points are used of software effort estimation. In this paper, a gradient boosting regressor method is proposed for effective software effort estimation. Also, we studied the problem of estimating the effort of software projects by adopting SGD, KNN, DT, BR, RFR, and ABR by employing two datasets COCOMO’81 and CHINA. We evaluated the performance of these models by using MAE, MSE, RMSE, and R2. For the model to be an accurate model, the MAE, MSE, RMSE should possess low and from the results, it is observed that the gradient boosting algorithm achieved low value compared to the rest of the models. Whereas, for the R2 metric, the model is considered as an efficient model if the value is higher and closer to 1.0. The proposed method obtained R2 values such as 0.98 and 0.93 with the COCOMO’81 and CHINA dataset, respectively. The study proved that the gradient boosting regressor performance is outstanding in terms of COCOMO’81 and CHINA datasets concerning all performance measures. In future work, some other ensemble learning models may be adopted to estimate the effort of software projects and special attention may be attained toward the large-sized projects.