1 Introduction

The process of estimating software development efforts prior to and/or during the development stage is critical to the success of a software project and to reducing risk. Software projects do not have the same structure and nature and so the estimation of the effort process may become a challenging task [1, 2]. In the literature, several machine learning (ML) methods have been proposed to enhance software development effort estimation (SEE) [3,4,5,6,7]. Furthermore, a wide variety of research articles have been published on the optimization of the parameters of the three COCOMO-based models by employing an artificial neural network (ANN) [8,9,10,11].

Commonly, ANNs have been employed to tackle software estimation problems due to their suitability for arbitrary accuracy and superior predictive ability [12, 13]. They have also been used in various real-world applications [10, 14,15,16], proving capable of tackling estimation issues very effectively because of their learning capacity. The ANN learning technique mainly involves the updating of the weights and biases of neurons as they modify the transmission of signals among interconnected neurons. Therefore, the learning technique defines how the weights and biases should be updated in order, for example, to optimize the loss function.

Several studies have employed various types of ANNs to address SEE problems. For instance, [17] compared four previous studies [18,19,20,21] which used different neural network architectures–multilayer perceptron (MLP), a general regression neural network (GRNN), a radial basis function neural network (RBF), and cascade correlation neural networks (CCNN)—to estimate software project efforts. In their study, the authors of [17] found that each network outperformed the previous one, confirming the ability of different types of networks to address the SEE problem but with varying capabilities. One of the most popular and efficient learning techniques employed to train an ANN is the backpropagation algorithm, which has proved able to address a variety of estimation and classification issues [22].

The backpropagation algorithm utilizes a local search-based technique called gradient descent to minimize the value of the error function in the weight space [23]. The fully connected neural network (FCNN) has many superior properties, including an excellent self-learning technique, robustness, self-adaptation, and generalization capacity [24]. In addition, a three-layer FCNN can tackle non-linear functions [25]. The FCNN can also be applied to a wide range of research fields, such as function approximation, image processing, and pattern recognition [26]. Nevertheless, the FCNN suffers from many drawbacks, such as slow convergence speed [27,28,29], becoming easily stuck in local optima [27, 28], low convergence accuracy [28], and high dependency on initial parameters [29]. To overcome these drawbacks, several metaheuristic algorithms have been proposed, most notably the genetic algorithm (GA) [30], particle swarm optimization (PSO) [31], the artificial bee colony (ABC) [32], the whale optimization algorithm [33], biogeography-based optimization (BBO) [34], the firefly optimization algorithm (FFA) [35], the bat algorithm (BA) [36], and the cuckoo search (CS) [37].

In this study, we propose a novel technique known as GWO-FC, in which the gray wolf optimizer (GWO) [38, 39] is integrated with the FCNN to optimize the FCNN parameters (i.e., weights and biases) so that they are more sensitive to tackling the SEE problem. It is essential to find a low-complexity and high-utility estimation method. In this regard, the GWO is fast, robust, and has simple features [40] which support dependability. The motivation for this work is the priority that must be given to managing the expenditure and effort incurred during the software project development cycle. The effort estimation process aims to provide an accurate estimation of the cost of software development, as well as assist in the efficient use and allocation of human and computational resources for development tasks.

To validate the SEE findings obtained by the GWO-FC approach, 12 dataset instances for the SEE were selected from different repositories, such as PROMISE and GitHub. First, the data preparation step for the selected datasets with parameter configurations was conducted. Then, the performance of the proposed GWO-FC was studied and analyzed in terms of convergence behavior. After this, a statistics-based evaluation was performed to compare the GWO-FC against traditional FCNN methods. Finally, the efficiency of the proposed GWO-FC was further validated by comparing its results with those of selected state-of-the-art methods. The analysis of the findings shows that the GWO-FC approach is a viable method for the SEE problem in the field of software engineering.

It should be noted that the gray wolf optimizer (GWO) has been used in previous studies to tackle the problem addressed in this study [37, 41]. In [41], the authors used the GWO algorithm to address the shortcomings of conventional software prediction methods, a result of imprecise model construction and erroneous outcomes. They combined three metaheuristic algorithms–the GWO, harmony search (HSA), and strawberry algorithms (SB)–to optimize the COCOMO effort estimation method and applied the developed model to a NASA dataset. Their study did not use ML methods with the GWO to address the prediction efforts problem, and instead they opted for outdated traditional methods. In [37], a combination of the GWO and SB algorithms was utilized to build a parametric model for the SEE problem; GWO was used to optimize the weights of a deep neural network, while SB was used to improve its learning rate. However, their model suffers from a high convergence time as well as a poor balance between exploitation and exploration.

In short, the ultimate goal of this study is to integrate the GWO into the FCNN to tackle the SEE problem. To achieve this, three novel contributions are made in the following order:

  • Introduction of the FCNN network to tackle the SEE problem;

  • Use of the GWO algorithm as a learning technique for FCNN network to identify best parameter values, consequently enhancing the estimation ability;

  • Formulation of the FCNN parameters as input solutions for the GWO;

  • Use of the GWO to find the optimal vector to use as the optimal parameter for the FCNN;

  • Evaluation of the proposed GWO-FCNN using several well-known benchmark datasets.

The remainder of this article is organized as follows: Sect. 2 provides the background to the study and an overview of the GWO and FCNN, and Sect. 2.3 presents the SEE problem. The proposed method is discussed in Sect. 3, while Sect. 4 presents the experimental results and performance evaluation. Sect. 6 concludes the article.

2 Background

In this section, the GWO is thoroughly discussed and the mathematical formulation of the FCNN presented.

2.1 Grey wolf optimizer

Mirjalili et al. [38] developed the GWO as a population-based metaheuristic algorithm inspired by the social leadership and hunting behavior of a pack of gray wolves. In the GWO, three dominant leaders, \(\alpha\), \(\beta\), and \(\delta\), can lead the remainder of the pack, called \(\omega\), to the candidate regions to discover the global solution. The hunting mechanism consists of three stages: encircling, hunting, and attacking the prey.

Encircling: As seen in Eqs. 1 and 2, it is possible to mimic how wolves might surround their prey:

$$\begin{aligned}{} & {} D = |C * X_p(t) - X(t)| \end{aligned}$$
(1)
$$\begin{aligned}{} & {} X(t+1) = X_p(t) - A * D \end{aligned}$$
(2)

where, the prey position is symbolized by \(X_p\), the position vector of a gray wolf is symbolized by X, the current iteration is symbolized

$$\begin{aligned}{} & {} A = 2 * A * r_1 - a(t) \end{aligned}$$
(3)
$$\begin{aligned}{} & {} C = 2 * r_2 \end{aligned}$$
(4)

where, \(r_1\) and \(r_2\) coefficients are vectors with random values ranging from 0 to 1. vector a items is linearly decreased in [2,0] through the iterations using Eq. 5:

$$\begin{aligned} a(t) = 2 - \frac{(2 * t)}{MaxIter} \end{aligned}$$
(5)

Hunting: In order to model the wolves’ hunting behavior mathematically, it is presumed that \(\alpha\), \(\beta\), and \(\delta\) have better knowledge of the prey’s location. Therefore, given the location of the three best solutions \(\alpha\), \(\beta\), and \(\delta\), the other wolves \(\omega\) are forced to follow. The description of the hunting behavior is represented by Eqs. 6, 7, and 8:

$$\begin{aligned} \begin{aligned} D_{\alpha } = |C_1 * X_{\alpha } - X(t)| \\ D_{\beta } = |C_2 * X_{\beta } - X(t)| \\ D_{\delta } = |C_3 * X_{\delta } - X(t)| \end{aligned} \end{aligned}$$
(6)

where the coefficients \(C_1\), \(C_2\), and \(C_3\) are computed by Eq. 4.

$$\begin{aligned} \begin{aligned} X_{i1}(t) = X_{\alpha }(t) - A_{i1} * D_{\alpha }(t) \\ X_{i2}(t) = X_{\beta }(t) - A_{i2} * D_{\beta }(t) \\ X_{i3}(t) = X_{\delta }(t) - A_{i3} * D_{\delta }(t) \end{aligned} \end{aligned}$$
(7)

where the coefficients \(X_{\alpha }\), \(X_{\beta }\), and \(X_{\delta }\) are the first three best solutions at iteration t, while A1, A2, and A3 are computed by Eq. 3, and \(D_{\alpha }\), \(D_{\beta }\), and \(D_{\delta }\) are computed by Eq. 6.

$$\begin{aligned} X(t + 1) = \frac{X_{i1}(t) + X_{i2}(t) + X_{i3}(t)}{3} \end{aligned}$$
(8)

Attacking: Hunting ends when the prey stops and the wolves attack. All of this can be simulated mathematically with the linear decrement in a value over the course of the iterations in order to control exploration and exploitation. Eq. 5 shows that the value of a is updated in each iteration across the range [2,0]. Emary et al. [42] recommend that 50% of the iterations are used for exploration and the remaining iterations for exploitation in a seamless transition. At these moments, wolves randomly move location to any other location in the range between the current one and that of the prey.

Figure 1 depicts a full flowchart of the GWO. Generally, within the search space, the method begins with an initial random formation of wolves. Next, the fitness of each solution (wolves’ positions) is then evaluated. The remaining steps are repeated until the halting requirement is met. The maximum number of iterations is defined as the halting requirement. In each iteration, the highest ranked solutions (i.e., wolves \(\alpha\), \(\beta\), and \(\delta\)) with the best finesses are considered. Subsequently, the location of each wolf is updated in the above stages (i.e., encircling, hunting, and attacking). Through repetition of the above three stages, the best prey position can be determined, which is \(\alpha\)’s position.

Fig. 1
figure 1

Flowchart of the GWO algorithm

2.2 Fully connected neural network

One of the most well-liked ANN models is the FCNN. It is widely used to tackle regression and classification problems [43, 44]. Essentially, the FCNN contains several layers in addition to processing elements called neurons. The layers are stacked parallel to each other, with neurons distributed over each layer. Also, neurons are fully connected to each other between the layers, as shown in Fig. 2. The input layer is the first layer, in which the network receives its input variables, while the output layer is the last. The layers between the input and output layers are known as hidden layers.

Fig. 2
figure 2

Basic structure of fully-connected neural network

Weights are associated with all neuronal connections, which determine the impact of the relevant inputs on neurons. In addition, there is an activation and aggregation function within each neuron to produce the output, with the activation function being unique within a single layer. The aggregation function is shown in Eq. 9, which computes the inputs’ weighted sum. The activation functions (Eq. 14) apply a threshold to the derived weighted sum to produce the neuron’s output.

$$\begin{aligned} net_j = \sum ^{n}_{i=1} w_{ij} * x_i + b_j \end{aligned}$$
(9)

where, the variable of the input i is denoted as \(x_i\), the value of \(j^{th}\) neuron bias is denoted as \(b_j\), the value of connection weight from the \(i^{th}\) input to the \(j^{th}\) neuron is denoted by \(w_{ij}\).

To calibrate the network connection weights, the FCNN training technique is employed, consisting of the forward propagation (FP) and backward propagation (BP) stages respectively. Data are fed to the input layer and then transferred to the output layer after being transferred to the hidden layer in the FP stage. During this pass, the aggregation and activation functions update the weights and biases of each neuron. The estimation error e is often measured in the output layer by computing the difference between the real and anticipated outputs. The obtained error is then back-propagated to the hidden layer in the BP stage to adjust the weights and biases according to the value of e so that the error is reduced. The weights \(\omega _{ij}\) and \(\omega _{jk}\) are updated as in Eqs.10 and 11, respectively. In addition, biases \(b_1\) and \(b_2\) are updated as in Eqs.12 and 13, respectively. These two processes are repeated iteratively till the e value approaches zero or a tolerable limit. Thus, the FCNN is trained to reduce the overall network error, which might be viewed as an issue of optimization [45]. The training technique is portrayed in Fig. 3.

$$\begin{aligned}{} & {} \widehat{\omega _{ij}} = \omega _{ij} + \eta H_i(1 - H_j) I_i \sum _{k=1}^{t} \omega _{jk}e_k \end{aligned}$$
(10)
$$\begin{aligned}{} & {} \widehat{\omega _{jk}} = \omega _{jk} + \eta H_j e_k \end{aligned}$$
(11)

where, adjusted weights obtained are donated by \(\widehat{\omega _{ij}}\) and \(\widehat{\omega _{jk}}\), original weights are donated by \(\omega _{ij}\) and \(\omega _{jk}\), learning rate is donated by \(\eta\).

$$\begin{aligned}{} & {} \widehat{b_j} = b_j + \eta H_j (1 - H_j) \sum _{k=1}^{t} \omega _{jk}e_k \end{aligned}$$
(12)
$$\begin{aligned}{} & {} \widehat{b_k} = b_k + e_k \end{aligned}$$
(13)

where, adjusted biases obtained are donated by \(\widehat{b_j}\) and \(\widehat{b_k}\), original biases are donated by \(b_j\) and \(b_k\), the learning rate is donated by \(\eta\).

Fig. 3
figure 3

Backpropagation training technique

2.3 The software development effort estimation issue

The estimation of software development effort can be defined as the process of estimating the practical amount of effort necessary to develop a software project from inconsistent, incomplete, noisy, and uncertain input data [46]. This is the crucial moment when the project manager needs to estimate in advance the substantial resources required for the development of a software project [47]. Suppose there is a reasonable estimation of the effort needed to build a software project. This will facilitate the process of allocating resources to project tasks as well as accurately estimating costs, reducing failures and delays in development, and smoothing the project schedule [10]. The possibility of stakeholders accepting or rejecting a software project is a substantial factor in estimating the effort involved in realizing a software project [4]. In general, underestimation and/or overestimation are the main issues encountered in the forecasting process, with underestimation leading to understaffing of the project, delivery delays, and inaccurate forecasting of budget expenditure. In contrast, overestimation will cause project overrun and loss of resources [48, 49].

Generally, money and person-hour criteria are used to measure effort, which means how many persons per hour spent is needed to develop the software. In [50], there is an explanation of the general factors that may lead to software failure, including the risk of mismanagement, unrealistic software project objectives, employment of unripe technology, inaccurate definition of system requirements, incompetence in managing the complexity of the project, incorrect project status reports, disagreements between stakeholders, labor market pressures and difficulties, and miscommunication between customers/users and software developers. Nevertheless, the success of the software depends mainly on the accuracy of estimating the efforts made to develop it [51]. Consequently, effort estimation must be optimized for a software project because correct estimation is desired by both developers and clients. Furthermore, estimating the effort required assists the developer in building and controlling a software project efficiently, as well as enabling the client to accomplish project contract completion dates, negotiations, and prototype release dates. Although there are many methods of estimating software development effort, it remains difficult for researchers to develop a reliable approach to estimating development efforts.

3 Proposed method

The proposed method involves combining the FCNN and the GWO, as shown in Fig. 4. The FCNN design consists of three layers: an input, output, and single hidden. A single hidden layer is sufficient to enhance the estimation accuracy of the problem addressed in this study [52, 53], and a three-layered FCNN has the ability to address any non-linear functions [54]. In the input layer, the number of neurons is based on the number of dataset features. In the hidden layer, the amount of neurons is established through the trial-and-error method [55]. The total estimated effort is determined by the output layer, which contains only one neuron.

Fig. 4
figure 4

Proposed GWO-FC method

According to Eq.14, the sigmoid function is unique to all neurons in the hidden layer, as well as an aggregation function. In general, most ANN types use the S-shaped sigmoid function because it is thought to be suitable for reducing the influence of overfitting and accelerating the training of the model [56,57,58,59]. Likewise, the output layer neurons have a linear activation function, as in Eq.15.

$$\begin{aligned}{} & {} f(x) = \frac{1}{(1+e^{-x})} \end{aligned}$$
(14)
$$\begin{aligned}{} & {} f(x) = x \end{aligned}$$
(15)

After the FCNN has finished its training process, the vector form of the adjusted parameters (weights and biases) is retrieved and delivered as shown in the following expression and Fig. 5:

$$\begin{aligned} (\textbf{w}, \textbf{b}) = (\textbf{w}_{1,1},\textbf{w}_{1,2},...,\textbf{w}_{n,n}, ~~~\textbf{b}_{1,1},\textbf{b}_{1,2},...,\textbf{b}_{n,n}) \end{aligned}$$

where. input nodes number referred as n, connection weight is denotes as \(w_{ij}\), bias in the hidden node is represented by \(b_j\).

Fig. 5
figure 5

Single vector (Solution) extracted from FCNN

This vector consists of two parts, as shown in Fig. 5. The first (blue side) is for the weights and the second (red side) for the biases. After the vector is optimized by the GWO-FC and fed back to the FCNN network, the vector is split into two parts, weights and biases. This returns them to the state/condition they were in before being merged into a single vector, for further use by FCNN.

The FCNN is trained as far as the population size of the GWO allows, which in this study is equal to 30. With every training process, one vector of FCNN-adjusted weights and biases is modeled, as mentioned earlier. Therefore, all vectors aggregate to form a population that is used as a GWO population. This population is directed to GWO for optimization. The population optimization process is implemented to find the optimal individual solution (vector), which reduces the overall FCNN estimation error, and this process can be considered an optimization problem [45].

The FCNN is a gradient-based search method and so, unfortunately, it has a scaling problem which can lead to a prompt decline in performance when handling high-dimensional issues [45, 60]. Gradient-based search techniques may also become stuck in local minima since FCNN parameters are multi-modal spaces with a range of local minima close to the global minimum [45]. Metaheuristic optimization algorithms can be used to address this problem because these are generally combined with an ANN to obtain high-precision results as well as to minimize network training time [61]. In the present study, the assembled metaheuristic algorithm is the beating heart of the proposed method of using the GWO to optimize the FCNN weights and biases. The main aim of using the GWO algorithm is to reduce the probability of the FCNN falling into local minima. This can be obtained by utilizing the joint process between GWO and FCNN, as the GWO promotes great exploration while the FCNN promotes exploitation, leading to a reasonable balance between them. A previous study has proven that population-based metaheuristic algorithms have powerful exploration capabilities [62].

Returning to the procedure of the proposed method, after the optimal vector has been obtained by the GWO, this vector is returned to the FCNN, which utilizes the vector in the validation process and calculates the accuracy. Mean square error (MSE) is employed in this article as the key fitness function by which to measure output error. The MSE measures the errors between the estimations and the true labels of each estimation process during the training and validation steps of the proposed method, as shown in Eq.16:

$$\begin{aligned} \text {MSE} = \frac{1}{m} \sum _{i=1}^{m} (A_i - D_i) \end{aligned}$$
(16)

where, training samples number is donated by m, actual output of the \(i^{th}\) instance is donated by \(A_i\), desired output is donated by (\(D_i\)).

The reason for using MSE as a primary loss function in this research is due to the tackling of a regression problem, and MSE is a standard measure of test error for this type of problem [63]. In addition, the FCNN model often employs the MSE as robust loss functions [64, 65]. Furthermore, as seen in Table 3, this study uses several loss functions since there is no single measurement method that is useful for all types of problems [57].

In short, the main contribution of this research is combining the GWO algorithm with FCNN to act as one unit (GWO-FC) to address the SEE problem. Therefore, the parameters (weights and biases) of the BP network are collected as a single vector and sent to the GWO algorithm as input solutions. Then, the GWO optimizes this solution, which is then used as optimal parameters for the BP network. Finally, the developed GWO-FC is evaluated using several global benchmark datasets.

4 Experimental results and performance evaluation

The performance evaluation of the proposed GWO-FC approach to the SEE issue is covered in this section. First, the evaluation criteria, datasets used, experimental design, and parameter composition are discussed. The outcomes of the proposed approach are then contrasted with those of the conventional FCNN on datasets for SEE. After this comparison, the results are presented of a statistical analysis utilizing the Wilcoxon-Mann-Whitney test to generate statistically significant findings. In addition, boxplot and convergence behavior analyses are provided, respectively. Finally, performance validation results are given by comparing the results of the proposed method with those of some state-of-the-art methods when applied to SEE problem datasets.

4.1 Evaluation criteria

The determination of evaluation criteria is critical to the success of a proposed approach. There are different evaluation criteria in the literature and this study sought to utilize the most reliable and common criteria. Heuristic algorithms are stochastic optimization methods which can, therefore, produce different outcomes [45]. Thus, an average of 30 separate runs for each dataset was evaluated to acquire all the experimental outcomes.

MSE served as the primary evaluation criterion in this study, as previously established. Additionally, other criteria were also used, namely: relative absolute error (RAE), mean absolute error (MAE), variance-accounted-for (VAF), Manhattan distance (MD), root mean square error (RMSE), root relative squared error (RRSE), median of magnitude relative error (MdMRE), correlation coefficient (\(R^2\)), euclidian distance (ED), standardized accuracy (SA), and effect size (\(\bigtriangleup\)) measures. In utilizing the proposed method, the main aim was to decrease the value of these criteria, except for VAF and \(R^2\), where the aim was to increase their value.

4.2 Datasets used

In this study, twelve different freely and publicly available benchmark datasets were utilized to estimate software development efforts research community. These benchmark datasets have been employed in related works in the literature and thus they are appropriate for use in evaluating the proposed method [3]. The literature states that using a relatively large number of software development effort estimation datasets is helpful for reaching a stable conclusion [66]. All the employed datasets were obtained from GitHub and PROMISE repositories and had several features ranging from 7–27, all of which were used in the experiments. The observations amount ranged between 15–499 and they also had different technological features.

Table 1 provides the following details for each dataset: the name of the dataset, number of features, search space dimensions, estimated time unit, and repository source. It is evident from the table that there is variation in the complexity and size of all datasets. In terms of both the number of features and instances, Albrecht, Kemerer, and Miyazaki are small datasets, whereas China, COCOMO, Maxwell, and NASA are medium/large. The Kemerer, Albrecht, and Kitchenham datasets have the fewest number of features, whereas the COCOMONASA-II and Maxwell have the most. While all other datasets are recorded in person-months, the Maxwell, China, and Desharnais are recorded in person-hours. In all datasets, the dependent variable is effort, expressed in person-months or person-hours.

It is critical to comprehend the scope of software development work to produce an accurate estimate and learn how to deliver estimates for the effort of software development. Estimation is crucial because it enables the developer to determine the expenses and time needed to finish the task. Once it is known how much a person-hour costs and the dependencies of all tasks have been analyzed, one can easily calculate the time it will take to complete the entire software project. A person-hour refers to that portion of work achieved by an average specialist in one hour of uninterrupted work. The term refers to two distinct concepts: Man (person) refers to the specialist performing the activity (e.g., analyst, developer, engineer, tester, etc.), and Hour refers to 60 minutes of uninterrupted work. Finally, what applies to the term person-hours also applies to person-months, except that the second term is calculated as the total hours worked in a month. The features of the datasets are as follows givn in Table 1.

Table 1 Description of the datasets
  • Size features: data on the scope of the project as measured by several metrics, e.g., function points (fp), lines of code (loc).

  • Environment features: company data, the development team of the project, number and experience of developers, and so on.

  • Development features: project technical details, such as the database type and development language employed.

  • Project-related features: regarding the project’s purpose, type, and requirements.

The IBM DP service corporation generated the Albrecht dataset, which includes 24 samples from industrial IT projects and eight attributes. It is described in terms of KSLOC and FPs, which are weighted sums of inputs, outputs, files, and inquiries for software projects. The China dataset contains information from projects developed by Chinese corporations and considers 16 features and 499 samples. Functional elements are used as independent variables to find how many FPs there are, such as inquiry, file, input, interface, and output.

For the COCOMO dataset created by NASA, this includes 17 features and 63 samples. Among the features are: loc (line of code), rely (reliability of the software), tool (use of software tools), data (size of the datasets), modp (modern programming practices), cplx (process complexity), lexp (language experience), time (cpu time constraint), sced (schedule constraint, stor (main memory constraint), vexp (virtual ma- chine experience), virt (volatility of the machine), pcap (capability of programmers), turn (turnaround time), aexp (application experience), and acap (capability of analysts).

The Kemerer dataset is small, with seven features and 15 samples. There are two category aspects to the independent features (language and hardware). Raw FPs are based on KSLOC, and adjusted function points. The two dependent variables are the project time and overall effort.

The Maxwell dataset contains information about 62 projects, including details of the industrial software initiatives programmed by one of Finland’s leading commercial banks. Among the significant independent features are T15 (staff team skill), FPs (SizeFP), T14 (staff tool skills), T01 (customer participation), T13 (staff application knowledge), T02 (development environment adequacy), T12 (staff analysis skills), T03 (staff availability), T11 (installation requirements), T04 (standards used), T10 (efficiency requirements), T05 (methods used), T09 (quality requirements), T06 (tools used), T08 (requirements volatility), and T07 (software logical complexity).

Miyazaki is a medium-sized dataset with 48 samples that include data on projects created by Fujitsu Large Systems Users Group software firms. It has eight independent features, with the dependent variable being the number of person-hours needed to finish the development process from system design through system testing. The number of various record formats, various report forms (form), and various input or output screens (scrn) are all significant dependent variables (file).

In the 1980s, the Desharnais dataset, which contains 81 software projects, was gathered from ten Canadian firms. The total effort was used as a dependent variable in this study, but not the loc. The categorical variables (i.e., language and year end) were also omitted from this study, and the following variables were employed: adjusted FPs, transactions (i.e., the total number of fundamental logical transactions in the system), teamexp (i.e., the team’s years of experience), entities (i.e., the number of entities in the system’s data model), and managerexp (i.e., the manager’s experience measured in years).

The datasets used include data from one or more software firms representing a wide range of application environments and project features. The datasets above have also been utilized in wide ranging practical research to evaluate effort estimation approaches in the literature [71, 72].

4.3 Experimental design

Before the proposed method was assessed, it was necessary to pre-process the data to enable optimal usage of it. By analyzing the variables within the datasets, we found that the characteristics exhibited several notable factors. For instance, large projects are less numerous than smaller ones, which influences the skewness of the data and, as the scale of the project increases, so does the diversity of the effort. There are also some very large data values, which mean that there are outliers. Finally, it appears that the correlations between size and effort varied for different software projects.

Based on the previous findings, it appears that there is a need to use transformation techniques for the data to guarantee that the developed model will traverse the raw data’s scale origin. This will take into account the relationships between size and effort, which may be linear, nonlinear, or both. Transforming the data values by applying the transformation technique will bring the data closer to a normal distribution, and also bring the values closer together by reducing larger values to smaller ones. On this basis, the datasets were converted into a new form where all the values were between 0 and 1, in a transformation technique called min-max normalization, as shown in Eq. 17:

$$\begin{aligned} Y_i = \frac{x_i - x_{min}}{x_{max} - x_{min}} \end{aligned}$$
(17)

where, normalized data is referred as \(Y_i\), data initial value is referred as \(x_i\), minimum and maximum data are referred as \(x_{max}\) and \(x_{min}\), respectively.

The second step in the data pre-processing was to divide it into training and testing sets. The training set accounted for 70% of the original dataset and the testing set 30%. The selection of the data rows for the training set was performed by random selection. The remaining rows were included in the testing set. Random selection was employed to prevent method overfitting and data selection bias. Therefore, for each run of the 30 runs of the proposed method, new training and testing sets of data were formulated randomly. The training phase was executed for all of the datasets used. Then, the testing phase was carried out. Finally, 30 separate runs of each experiment were conducted, so that all the evaluation criteria measurements could be calculated for each dataset by averaging the results obtained for the 30 runs as a final result.

4.4 Parameter configuration

The parameter configuration was the same for all experiments. Where the population size of the GWO was determined to be 30, the maximum iterations of GWO (L) was determined to be 300, and all experiments were performed 30 separate times. Extensive experiments were carried out to find the parameter values (i.e., maximum of iterations and population size), and the optimal values were chosen. The experiments were carried out using a personal computer with an Intel Core i5 processor, Windows 10 system, 8 GB of RAM, 2.0 GHz CPU, using MATLAB 2016a.

4.5 Evaluation comparison of GWO-FC against traditional FCNN

The first experiment was conducted to evaluate the performance of the proposed GWO-FC method. This involved comparing the results obtained by the GWO-FC with those produced by the traditional FCNN for SEE problems on test data. A comparison of the results obtained by the GWO-FC and conventional FCNN for each data set is shown in Tables 2 and 3 in terms of MSE and other measures (i.e., RAE, MAE, VAF (%), MD, RMSE, RRSE, MdMRE, \(R^2\), ED, standardized accuracy (SA) and effect size (\(\bigtriangleup\)) measures), respectively. Table 2 shows that the GWO-FC approach gave more accurate estimates than the classic FCNN in terms of MSE for all datasets. Table 3 shows that the GWO-FC achieved the lowest RRSE, ED, MdMRE, RAE, RMSE, MAE, MD, SA and effect size (\(\bigtriangleup\)) for most datasets. Moreover, the GWO-FC produced the largest VAF and \(R^2\) for all datasets.

The outcomes of the proposed technique thus show that the performance of the FCNN estimator and the quality of the findings were significantly improved by GWO optimization of the FCNN parameters. As previously mentioned, the FCNN tends to become trapped in local optima and has a sluggish convergence rate, and yet the proposed GWO-FC overcame these drawbacks. Providing more optimal weights and biases adds greater balance concerning exploitation and exploration in addition to exponentially accelerating the convergence of the FCNN estimation process. This provides a remarkable generalization estimation performance.

Table 2 Results obtained by GWO-FC and FCNN in terms of MSE
Table 3 Obtained results by GWO-FC and FCNN in terms of RAE, MAE, VAF, ED, MdMRE, RMSE, RRSE, \(R^2\), MD, SA, \(\bigtriangleup\)

A boxplot was created to show the distribution of the findings obtained using the proposed method. Figure 6 shows the MSE results produced by the GWO-FC method, demonstrating that the majority of the results of the examined datasets were more successful when using the proposed strategy. The effectiveness of the obtained findings may be related to the proposed method’s ability to identify the optimal weights of the FCNN, which is essential for resolving the early convergence defect as well as improving the convergence behavior of the FCNN.

Fig. 6
figure 6

Boxplots of GWO-FC

4.5.1 Wilcoxon Mann-Whitney statistical test analysis

The significance of the outcomes produced using the suggested strategy was validated with the Wilcoxon-Mann-Whitney statistical test. This test is utilized for the purpose of demonstrating a difference in the value of an ordinal, interval, or ratio variables between two sets. This non-parametric test is utilized for continuous, interval or ratio data. The technique for computing the test statistics is straightforward but too lengthy to explain here. For the details, the interested reader may wish to consult [73]. The calculation technique is as follows:

Take into account that (\(x_1,..., x_m\)) and (\(y_1,..., y_n\)) are different pairs of separate sets of random variables. Also, let each set’s arbitrary response be denoted by x and y. In addition, take into account that \(y \sim\) G and \(x \sim\) F, and let the data of observations be categorized. The function of Mann-Whitney is as follows:

$$\begin{aligned} \phi = h_{MW}(F,G) = Pr[x<y] + \frac{1}{2}Pr[x=y] \end{aligned}$$
(18)

The \(\widehat{G}\) and \(\widehat{F}\) experimental distributions can be used to estimate the \(\phi\) as shown below:

$$\begin{aligned} \widehat{\phi } = h_{MW}(\widehat{F},\widehat{G}) = \frac{1}{mn}(S_y - \frac{n(n+1)}{2}) \end{aligned}$$
(19)

The mid-ranks are calculated by ranking all N \(= m + n\) replies combined, breaking ties randomly, and average the tied values. Where \(S_y\) is regarded the total of the n mid-ranks from the next group. Using the \(\widehat{\phi }\), the Wilcoxon–Mann–Whitney Test is a permutation test.

Table 4 presents the outcomes of the Wilcoxon test according to the average of the best results obtained across 30 independent experiment runs. Regression and correlation values were used in this test to compare the two methods (GWO-FC and FCNN), as well as to assess how different the two estimation methods were from each other. The likelihood of the hypothesis’s random validity is represented by the \(\rho\)-value. Confirmation versus the null hypothesis and high statistical significance are shown by a low \(\rho\)-value, because the null hypothesis is true. Depending on the \(\rho\)-value, the null hypothesis is either rejected or accepted. If the \(\rho\)-value is less than or equal to the probability threshold (\(\alpha\)), which is the case, the null hypothesis is not supported.

Table 4 \(\rho\)-values of Wilcoxon Test for GWO-FC against FCNN

Since the experimental factors affected the experiment’s digital observations in this test, where \(\alpha = 0.05\), any probability value less than \(\alpha\) meant that less than 5% of the experiment results were due to chance and not to the experimental factors. As a result, there was a difference in the statistical significance between the two methods. From Table 4, it can be seen that the statistical indicators demonstrate that the results obtained by the GWO-FC were significantly different from those of the traditional FCNN in all datasets.

4.5.2 Convergence rate analysis

To evaluate the proposed method in depth, a convergence rate analysis was performed. In this work, the maximum number of iterations was 300. The values of the parameters (weights and biases) were changed at each iteration. The parameters had random values at the early iterations, but as the number of iterations increased, these values dropped and so the FCNN became stuck in the local optimum. Therefore, using the GWO helped the FCNN to escape from any local optima using the GWO’s ability regarding exploration and exploitation, which accelerated and enhanced the convergence. The parameters produced by the GWO give the method the ability to explore the search space around the related best solutions.

The convergence curves for the average values of the overall results for the GWO-FC versus the FCNN for each dataset are shown in Fig. 7. It can be seen from the curves that the integration of the GWO algorithm with the FCNN significantly enhanced the convergence rate, as well as the accuracy of FCNN estimation, which in turn improved the quality of the results.

Fig. 7
figure 7

Convergence curves of GWO-FC against FCNN

The findings of the experiments suggest that the usage of metaheuristic techniques in general, and the GWO in particular, may produce significant gains in the field of parameter optimization, since generating high quality results for ANNs is correlated to parameter optimization, which in turn enhances estimate effectiveness. Therefore, the use of metaheuristic methods to estimate optimal parameter values may enable ANNs to address the uncertainties in estimations for different benchmark datasets and provide more accurate results. In addition, the use of parameter optimization methods based on metaheuristic algorithms increases the chance of a neural network being able to enhance estimation accuracy and convergence speed, as well as reduce the possibility of falling into local optima.

4.5.3 Computational time

In order to compare the selected methods further, a computational time comparison between the traditional FCNN and the proposed GWO-FC in the simulation phase was performed using the test data. This comparison was made in Windows 10 64 bit, i5−10600 CPU@3.30 GHz, with 16 RAM. The results are presented in Table 5 for all datasets used by computing the mean time for 30 runs.

Table 5 Comparison of computational time

The results show that FCNN took less time compared to the GWO-FC method. The noticeable increase in the GWO-FC computation time is due to the fact that it consists of the FCNN training time plus the GWO optimization time. As a result, the GWO-FC method has several loops, as well as the process of passing data across the methods. Moreover, the calculated time that was considered here is for the total time of the FCNN training process and the GWO optimization process, where the bulk of the time is spent on the optimization process. Since the training and optimization processes are the same for all employed datasets, in addition to the fact that a small dataset (i.e., USP05) consumes more computational time than a large dataset (i.e., China), this is due to the complexity of the database itself. The same is true of the COCOMO81 and the COCOMONASA-II datasets, where COCOMONASA-II is comparatively larger than COCOMO81, but COCOMO81 takes more computational time than COCOMONASA-II.

4.6 Validation comparison of GWO-FC against state-of-the-art methods

To support the proposed methodology, a comparison with state-of-the-art approaches from the literature was conducted. The comparison was performed based on experiments employing analogous evaluation criteria and benchmark datasets as in the previously described experiment. The results of the comparison are presented in five tables: Table 6 presents MSE and MAE, Tables 7, 8 and 9 present MAE, and Table 10 presents SA and (\(\bigtriangleup\)).

Table 6 shows that the GWO-FC was superior to Salp Swarm Algorithm with Backpropoagation Neural Network (SSA-BPNN) method in most of the datasets except in Cosmic and COCOMONASA-II in terms of MSE, in Miyazaki 94 in terms of MAE.

Table 6 Results obtained by GWO-FC against state-of-the-art methods [57]

From Table 7, the proposed GWO-FC can be seen to be superior to other methods in most of the datasets except for China and Kitchenham, where SRF achieved the best results. The abbreviations used in Table 7 are as follow:

  • Decision Tree (DT),

  • Deep Net (DN),

  • Elastic Net Regression (EN),

  • Ensemble Technique: Bagging (BA),

  • Ensemble Technique: Boosting (BS),

  • Ensemble Technique: Weighted Averaging (WAVG),

  • LASSO Regression (LASSO),

  • Random Forest (RF),

  • Ridge Regression (Ridge),

  • Stacking Using RF (SRF)

Table 7 Results obtained by GWO-FC against state-of-the-art methods [74]

Table 8 shows that the GWO-FC approach outperformed the competition over all datasets. The abbreviations available in Table 8 are as follow:

  • Genetic algorithm - hybrid search-based algorithm (GA-HSBA)

  • Black hole optimization algorithm - hybrid search-based algorithm (BHO-HSBA)

  • Firefly algorithm (FFA) - hybrid search-based algorithm (FFA-HSBA)

Table 8 Results obtained by GWO-FC against state-of-the-art methods [30]

Table 9 shows that the GWO-FC approach outperformed the competition over all datasets. The abbreviations available in Table 9 are as follow:

  • Ant colony optimization (ACO)

  • Chaos optimization algorithm (COA)

  • Genetic algorithm (GA)

  • Partial swarm optimization (PSO)

  • Bat algorithm (BA)

Table 9 Results obtained by GWO-FC against state-of-the-art methods [37]

Table 10 shows that the GWO-FC approach outperformed the competition over all datasets. The abbreviations available in Table 10 are as follow:

  • Cluster-based fuzzy regression tree (CFRT)

  • Multi-layer perceptron (MLP)

  • K-nearest neighbor (KNN)

  • Classification and regression trees (CART)

  • Linear regression (LR)

Table 10 Results obtained by GWO-FC against state-of-the-art methods [75]

In conclusion, the comparative results generally demonstrate that the proposed GWO-FC method can outperform other methods because it is robust and can handle a variety of situations that are different in complexity and dimension.

4.7 Statistical test analysis

To investigate the significance and variations among the outcomes of a proposed method and competitive methods, statistical analysis is crucial. A theoretically and empirically based analysis of potential statistical tests was applied to this research problem to compare two or more predictors/classifiers across multiple datasets included non-parametric tests (Wilcoxon and Friedman tests), parametric tests (paired ANOVA test). The non-parametric test assumes no commensurability of the results (sign test). In the theoretical part, we specifically addressed how a typical ML dataset can deviate from the basic assumptions of the tests. We concluded that non-parametric tests should be favored over parametric ones based on the well-known statistical features of the tests and our understanding of the ML data [76]. In addition, for statistical comparisons of classifiers, we recommend a collection of straightforward non-parametric tests that are secure and reliable [76], such as the Friedman test with the appropriate post-hoc tests for comparison of more classifiers across different datasets and the Wilcoxon signed ranks test for comparison of two classifiers [76].

To determine if there were statistically significant variations between the GWO-FC’s accuracy and that of the cutting-edge methods listed in Table 7, Friedman and Holm/Hochberg statistical tests [77, 78] were used in this work. The efficiency and appropriateness of the proposed strategy were also confirmed using statistical test methods.

4.7.1 Friedman’s tests of GWO-FC and state-of-the-art methods in terms of MAE

A non-parametric statistical test called the Friedman’s test was created by the economist, Milton Friedman. When measuring an ordinal dependent variable, this test is used to determine whether there are any variations between the sets (treatments). When the identical parameter has been evaluated in multiple circumstances on the same participants, Friedman’s test is utilized for a one-way repetitive measurement analysis of variance by ranks. The following are the hypotheses for the comparison of recurrent assessments:

  • H0: throughout repeated measures, the distributions are the same (there is no substantial variation across the tested sets);

  • H1: throughout repeated measures, the distributions are different (there is a substantial variation across the tested sets).

In addition, this test looks at the values of rankings by column after ranking each row (or block) jointly. The test statistic that Friedman suggests [79] is as follows:

$$\begin{aligned} T= \frac{12}{mk(k+1)} \sum ^k_{j=1} R^2_j - 3m(k+1) \end{aligned}$$
(20)

where, sets number is donated by k, subjects number is donated by m, sum of the ranks for the \(j^{th}\) set is donated by \(R_j\).

Suppose significant differences are detected between groups (treatments) by Friedman’s test (i.e., the null hypothesis of Friedman’s test is rejected). In this case, Holm’s procedure (Wright 1992) will be performed as a post-hoc method for multiple comparison tests to determine which groups (treatments) differ from the others (unplanned comparisons). Holm’s procedure is one of the earliest usages of stepwise algorithms in simultaneous inference. This method is an improvement of the Bonferroni procedure [79] which applies a criterion’s unequal allocation to each hypothesis being tested. A step-down method of Holm’s tests the hypotheses in order of relevance in a sequential manner, and adjusts the crucial value in order to reject the null hypothesis. The more significant of the surviving hypotheses are successively taken into consideration during the approach. Holm’s test rejects the hypothesis linked to the most significant test statistic.

Friedman’s test was performed as a quantitative evaluation of the statistical difference between the GWO-FC method and the state-of-the-art methods in terms of MAE, using a significance level of \(\alpha\) = 0.05. By using the results from Table 7 and Friedman’s test, the GWO-FC was ranked against its rivals. Figure 8 provides an overview of how all of the competing approaches ranked.

Fig. 8
figure 8

Friedman’s Test average rankings for comparative methods results in Table 7

Figure 8 shows that the GWO-FC method ranked second, with an average of 2.86, preceded by the SRF algorithm (average ranking 1.71), and followed by, in descending order, BS (4.78), WAVG (5.57), LASSO (5.57), EN (6.00), Ridge (6.14), BA (6.36), RF (7.29), DT (8.71), and DN (11.00). Also, the \(\rho\)-value calculated using Friedman’s test for results listed in Table 7 was 1.1803E-5, which is below the threshold for significance (\(\alpha\) = 0.05). These results illustrate the extent to which there was a significant difference between the methods under evaluation.

4.7.2 Holm’s/Hochberg statistical test analysis in terms of MAE

A post-hoc statistical method for contrasting the control method (i.e., the best performing method) with other methods is the Holm’s/Hochberg test. This test was applied to the proposed and compared methods because there were significant statistical differences between the obtained results. In addition, the null hypothesis of equivalent accuracy was rejected by the Holm’s/Hochberg method, with the aim of confirming the existence of significant differences in the accuracy of the results produced by the competing methods. The Holm’s/Hochberg results are provided in Table 11. For all test instances, the confidence threshold level was 0.05.

Table 11 GWO-FC and other comparative methods Holm’s/Hochberg results in Table 7

Table 11 shows that the control method (SRF) was statistically better than the compared methods based on the results of the Holm’s/Hochberg test. Additionally, the difference between the performance of the SRF and the GWO-FC was slight. The SRF and GWO-FC \(\rho\)-values based on the Holm’s/Hochberg test were too close. Accordingly, the average estimation findings of the GWO-FC approach were statistically superior and significantly more compelling than those of the compared state-of-the-art approaches, except for the SRF method. Therefore, the GWO-FC method can be considered a viable alternative for estimating software effort and other engineering problems.

The superiority of the proposed method can be attributed to the synergy of estimation and optimization achieved by combining the GWO with an FCNN, enabling the GWO to explore the FCNN parameter space and discover the optimal subset of parameters values. This provides the best estimation performance and a reasonable balance between exploitation and exploration, as well as preventing the FCNN from falling into local optima.

4.8 Discussion

This research demonstrates that combining a metaheuristic optimizer with an ML estimator, in this case, the GWO and the FCNN network, can improve the accuracy of the estimation process and achieve high quality estimation performance and results. Using the GWO to optimize the FCNN network significantly helps in optimizing the solution in an iterative search process. As a result, the GWO-FC method helped identify the optimal set of parameters for estimation activities. The results of the experiments demonstrated the value of parameter optimization, as a pre-processing step in any estimation process. As a result, there is a strong likelihood that including non-optimal parameter values degrades estimation quality. This is supported by the evaluation comparison results in Sect. 4.5 and the verification comparison results in Sect. 4.6. Furthermore, all the results show that the proposed integration approach was successful in identifying the best parameter values for the estimation process.

Another notable finding from this study is that the proposed GWO-FC performed significantly better than the traditional FCNN because the GWO maintains a balance among exploitation and exploration. Essentially, the GWO has significant exploration ability in the GWO-FC approach due to the embedded adaptive parameters [38]. Simultaneously, the FCNN contributes to the enhancement of local search capability, which improves exploitation. These are the main explanations for the results shown in Tables 2 and 3, which show that the GWO-FC outperformed the conventional FCNN not only in terms of MSE but also in a variety of other performance metrics.

In summary, the empirical findings show that combining GWO and FCNN is beneficial. In the GWO-FC, the GWO receives multiple new solutions from the FCNN during the training step, which maximizes the exploration capability. The GWO then evaluates the most promising solutions and selects the best one for use in the FCNN testing step (the estimation process). Based on the experimental results, it can be inferred that the FCNN’s performance improved and became more stable, and that the optimal parameter values guarantee more accurate estimation. Finally, the proposed method can be said to be superior to comparable methods in the literature based on the statistical analyses.

In the future, researchers may wish to combine the GWO optimizer with another local-search algorithm, for instance, simulated annealing (SA), to improve search exploitation (local) capability, or to employ the GWO in other estimation/classification fields, such as medical diagnosis, intrusion detection, or image segmentation.

5 Threats to the study’s validity

In the following subsections, a number of threats to the study’s validity are covered:

5.1 Construct validity

If dependent and independent variables are not measured properly, a construct threat arises [80]. The datasets for the current research were taken from a trustworthy software engineering source and therefore this threat is not present here.

5.2 Internal validity

When a study’s usage of software metrics is loosely connected to the software efforts, threats to internal validity become possible. Internal validity threats are possible in this study as the programmer’s expertise capacity was not considered.

5.3 External validity

The study’s conclusions may be generalized since several project kinds were represented in the datasets and so there is very little threat to the study’s external validity.

6 Conclusion

Parameter tuning is a difficult optimization challenge for engineering problems involving estimation and classification because the aim of such tuning is to maximize and strengthen the performance of the estimator used. Previous studies have shown that metaheuristic techniques are appropriate for addressing this issue. Therefore, in this study, a promising metaheuristic algorithm, the gray wolf optimizer (GWO), was combined with the fully connected neural network (FCNN) method, named (GWO-FC), for software development effort estimation (SEE) problems. In GWO-FC, the GWO is utilized to optimize the FCNN parameters (weights and biases) to increase the accuracy of FCNN estimation by defining the most suitable parameter values to tackle the SEE problem. Hence, the GWO helps to increase exploration ability in the parameter search field as well as prevent the FCNN from falling into local optima.

The proposed GWO-FC method was evaluated against the traditional FCNN. In addition, the proposed method was validated against 24 state-of-the-art methods extracted from the literature. The research findings demonstrate that, for the majority of benchmark datasets and evaluation criteria, the GWO-FC substantially improved on the FCNN and most recent approaches. The results indicate that the traditional FCNN has limitations when trying to address the estimation problem. On the other hand, the GWO-FC has the potential to tackle the estimation task by increasing exploration in the search space. Thus, the results demonstrate that the GWO can be integrated with ML methods for the purpose of maximizing the accuracy of the estimation task.

Nevertheless, it should be stated that the proposed method still lacks the ability to compete with other modified metaheuristic algorithms in the literature. In addition, the proposed method suffers from a relatively high computational time compared to the traditional FCNN. Therefore, further studies may wish to consider developing new approaches to address the estimation problem. The authors intend to develop new and more efficient methods of tackling the SEE problem in the future. Also, a more efficient method for computational time can be developed by which the proposed method is improved.

Since the GWO algorithm’s limitations, in some circumstances, may prolong convergence time, plans have been made to develop novel techniques to improve the search behavior of the algorithm so that it maintains an appropriate balance between exploitation and exploration, while relatively increasing the exploration capacity. Metaheuristic techniques have been effectively used in several fields of study to enhance the training of ANNs but there are still few pertinent papers in this field. Future research should focus on investigating the use of metaheuristic algorithms in ANN architectures for SEE.