1 Introduction

The software typically developed has a “compilation phase” without which the software is not deemed to be complete. It may not be efficient, reliable, maintainable or scalable in absence of resources. In the juggernaut of IT revolutions in this world, Software Development Effort Estimation (SDEE) is a research area that is undergoing intense study as it assists in the production of an error-free software product which adheres to the requirements of the customers (Gupta et al. 2014). A software failure, in the present world of IT, is extremely perilous and can sometimes prove to be fatal as well. In the past, many software projects have failed such as Y2K (2000); Mariner Bugs Out (1962); Hartford Coliseum Collapse (1978); (http://www.devtopics.com/20-famous-software-disasters/). Some of the prominent causes discovered were inaccurate estimates of needed resources, badly defined system requirements, unmanaged risks, poor project management etc. Hence, it seems apparent to be more systematic and organised while producing software or on the whole to devote the accurate time, effort and cost required for producing software.

SDEE helps in realizing these goals of producing quintessential software. It involves estimating the effort (or manpower) required for producing software, prior to its development phase. Since effort is the main driver for producing software, its estimation will ultimately lead to the estimations of the cost, time and staffing levels required to complete the project. Hence, the accurate estimates of software development effort are imperative as they are responsible for the entire framework and success of the software development process.

SDEE is usually done when most of the attributes of the software details are not divulged. Therefore, it has always been a very challenging and daunting task. Since the 1980s, many SDEE models have been proposed in literature (Jørgensen and Sheppard 2007; Elish 2009; Benala et al. 2012). But no model is perfect or has proved to be 100% accurate.

Traditionally, expert judgment was used for SDEE. It involved consulting an expert, who had a lot of experience in developing software (Srivastava et al. 2020). This failed when no experts were available to consult. Moreover, it was completely intuitive. Later, algorithmic models were developed such as Constructive Cost Model (COCOMO) (Boehm 1984; Fenton and Pfleeger, 1997; Pressman 1997), Putnam Resource Allocation (Putnam 1978) etc. These belonged to linear-least-squares regression. The equations in these models were constructed after studying the datasets of previous software projects. Their input was usually software size and/or function points. But these models could not deal with non-linear relations across the characteristics of project and effort (Gray 1999).

Recently, various authors have been estimating software development effort using non-algorithmic models.

These non-algorithmic models do not rely upon one fixed formula, it allows experts intervention to adjust the model used in software cost prediction. They have the capability to work even on imprecise, noisy and vague data and still produce suitable results.

Non-algorithmic models work well in SDEE as at the early stages of software development, the attributes of the software are not known or are imprecise, vague and noisy. Two widely known techniques that can be classified as non-algorithmic techniques are machine learning techniques and metaheuristic techniques. The machine learning techniques that have been used by some authors in the domain of SDEE, include Artificial Neural Network (Kumar et al. 2008), Genetic Algorithms (Burgess and Lefley 2001), Multiple Additive Regression trees (Elish 2009), linear regression models (Jiang et al. 2007; Xia et al. 2008) etc. Many researchers also incline towards using metaheuristic techniques in various domains for solving complex problems. These can also be used in affiliation with the machine learning techniques for solving problems. Some of the metaheuristic techniques that have been used by some authors in the domain of SDEE include Firefly Algorithm (Kaushik et al. 2016), COA-Cuckoo optimization (Kaushik et al. 2017) and ALO-Ant Lion Optimization (Kaushik et al. 2020). The other domains of software engineering where these metaheuristic techniques are used are software reliability (Sharma et al. 2011; Sharma and Pant 2014) and Software Project Scheduling (SPS) Problem (Sharma 2016).

Deep learning is another domain that is gaining importance these days. It outperforms previously existing techniques in the multiple domains, such as image processing (Ciregan et al. 2012), optical character recognition (Breuel et al. 2013), text-to-speech synthesis (Fan et al. 2014) etc. The contribution of this research is to predict software effort using integration of DBN, which is a model of deep learning, and WOA, which is a metaheuristic technique. This paper attempts to evaluate the success of deep learning for SDEE. The paper further also attempts to judge the effectiveness of integrating metaheuristic technique, WOA with DBN as compared to integrating backpropagation with DBN. This research attempts to answer the following questions:

  1. 1.

    Can deep learning accurately predict the software development effort?

  2. 2.

    Can the fine tuning of DBN using Whale Optimization Algorithm (WOA) perform better than the fine tuning of the DBN using backpropagation?

The remainder of the paper is organised as follows: Sect. 2 summarises the related work; Sect. 3 discusses the overview of techniques employed in the paper; Sect. 4 describes the proposed methodology of SDEE; Sect. 5 describes the experimental evaluation and results; Sect. 6 discusses the statistical validation; Sect. 7 lists the limitations and future scope and Sect. 8 concludes the paper and answers the research questions.

2 Related research

Realizing the importance of Software Development Effort Estimation, researchers over the past few decades have developed a number of techniques, which have been discussed here.

Muzaffar and Ahmed (2010) presented a study which showed that accuracy of effort prediction of a fuzzy logic-based system (FLS) is largely dependent on the architecture of the system, the relative parameters, and the training algorithms. They concluded that the steepest descent algorithm is a better training algorithm than heuristic based algorithm on FLS based effort prediction.

Qin and Fang (2011) listed out the three types of software cost estimation methods- the top-down method, the bottom-up method and the analogy method. They discussed the COCOMO model, the COCOMO 2 model and also the inability of the COCOMO model to handle Commercially available Off-The-Shelf (COTS) product integration costs.

Wen et al. (2012) analysed 84 studies through a thorough literature review of empirical studies on Machine learning (ML) models published between 1991 and 2010. They discovered that eight machine learning techniques have been employed in SDEE models, and that the estimation accuracy of these techniques was not only close to acceptable estimation levels but better than non-machine learning models. They also concluded that different ML models were suitable to different estimation contexts, based on the strengths and weaknesses of the model.

Nassif et al. (2013) compared Multilayer Perceptron (MLP) and novel log linear regression model which was based on Use Case Point (UCP) model for SDEE. They developed a linear regression model for prediction of the values of the productivity factor in their novel regression model. They also developed a Multi-Layer Perceptron artificial neural network model which uses the size of the software and the productivity of the team which is represented by eight factors as inputs to give the software effort. They claimed that MLP model surpasses the regression model for small projects (effort < 3000 person-hours) and that the log-linear regression model gives improved results when larger projects (effort > 3000 person-hours) are considered.

Dave and Dutta (2014) provided a review covering various artificial neural network-based models for SDEE as proposed by various researchers. They reviewed twenty-one articles and emphasized on the abilities of neural network-based models in SDEE.

Rao and Kumar (2015) proposed a Generalized Regression Neural Network (GRNN) to incorporate enhanced software effort estimation for COCOMO dataset. They used the Mean Magnitude Relative Error (MMRE) and Median Magnitude Relative Error (MdMRE) as the evaluation criteria. They compared the proposed GRNN with various techniques such as M5, Linear regression and Radial Basis Function (RBF) kernel.

Panda et al. (2015) used the Story Point Approach (SPA) to enhance the accuracy of prediction of agile SDEE. They used different neural networks—GRNN, Probabilistic Neural Network (PNN), Group Method of Data Handling (GMDH), Polynomial Neural Network and Cascade-Correlation Neural Network. They concluded that cascade networks outscore other networks.

Kaushik et al. (2016) integrated firefly algorithm and artificial neural network (ANN) for accurate cost predictions. The novel method was compared with particle swarm optimization and it was concluded that the ANN model gives more accurate estimations than particle swarm optimization.

Rijwani and Jain (2016) used Multi Layered Feed Forward Neural Network for software effort estimation and experimentally evaluated their model on COCOMO dataset.

Miandoab and Gharehchopogh (2016) proposed a COA-Cuckoo optimization and K-Nearest Neighbours (KNN) algorithm for SDEE. They evaluated the proposed technique on eight evaluation criteria: Mean Magnitude of Error Relative to the estimate (MMER), Mean Magnitude of Relative Error (MMRE), Median Mean Magnitude of Relative Error (MDMRE), Root Mean Square Error (RMSE), Mean Absolute Percentage Error (MAPE), Mean Absolute Error (MAE), Prediction(N) (PRED (N)) and Mean Square Error (MSE), and six datasets: KEMERER, MAXWELL, MIYAZAKI 1, NASA93 60, NASA93 63 and NASA93 and concluded that the proposed technique outscore other techniques, such as COCOMO, KNN and Cuckoo optimization algorithm, in KEMERER, MIYAZAKI1, NASA93 60 and NASA93 datasets.

Kaushik et al. (2017) proposed a novel method CUCKOO-FIS, which integrates two techniques: Cuckoo optimizations, a meta-heuristic search algorithm and Fuzzy Inference System. The collaborated technique is used for effort estimation and is successfully evaluated on software effort estimation datasets.

Rajpurohit et al. (2017) provided a comprehensive list of metaheuristic algorithms developed so far which is very insightful for solving various engineering problems including software cost estimation. The work provided a base for the researchers to work with metaheuristic algorithms.

Sharma and Pant (2017) proposed an Intermediate Artificial Bee Colony Greedy (I-ABC Greedy) algorithm. This algorithm overcomes the limitations of ABC algorithm. The efficiency of the proposed technique was verified by applying it on three real world problems consisting of parameter estimation in software reliability growth models, software effort estimation and redundancy optimization in modular software system models.

Kaushik et al. (2020) used Deep Belief Network along with Ant Lion Optimization (DBN-ALO) for effort estimation in both agile and non-agile datasets. Their approach worked best for both agile and non-agile development approaches.

All the above techniques were unique in their own way and every year as the new techniques are coming there remains always a scope of improvement. So, this work is to explore that improvement by integrating Whale Optimization Algorithm (WOA) and Deep Belief Network (DBN) for Software Development Effort Estimation (SDEE).

3 Overview of techniques employed

3.1 Deep learning

Deep learning is a part of a family of machine learning methods based on learning data representations. It uses a given training set to extrapolate new features from a limited set of features. Hence, it owes an advantage over previous neural networks and other machine-learning algorithms. It has the ability to search for and discover other features that correlate to those that are already known. It may discover new ways of separating the noise from the signal to better hear the signal. Two models of deep learning that have been used in this research are described in Sects. 3.1.1 and 3.1.2.

3.1.1 Restricted Boltzmann machine (RBM)

RBMs (Yang and Papa 2016; Fischer and Igel 2012) are energy-based stochastic neural networks that have two layers—the visible layer (to represent observable data) and the hidden layer (to capture the dependencies observed between data). RBMs are connected in a complete bipartite graph manner as shown in Fig. 1, where there are m visible units and n hidden units. Each connection between a visible unit and a hidden unit in an RBM is associated with a weight. The matrix of all the weights of an RBM is represented by a weight matrix W.

Fig. 1
figure 1

A typical structure of RBM

The energy function of an RBM is given in Eq. (1)

$$E\left( {v,h} \right) = - \mathop \sum \limits_{i = 1}^{m} a_{i} v_{i} - \mathop \sum \limits_{j = 1}^{n} b_{j} h_{j} - \mathop \sum \limits_{i = 1}^{m} \mathop \sum \limits_{j = 1}^{n} v_{i} h_{j} w_{ij}$$
(1)

where a and b are the biases of visible and hidden layer respectively, vi is the value of observable data at ith visible unit, hj is the state of jth hidden unit and wij is the weight associated with the connection between ith visible unit and jth hidden unit.

In a DBN, each RBM is trained in an unsupervised manner, independently of other RBMs.

3.1.2 Deep belief network

A Deep belief network (DBN) (Yang and Papa 2016; Fischer and Igel 2012) is formed when more than one Restricted Boltzmann Machines (RBMs) are stacked upon each other. In a DBN, the hidden layer of the ith RBM acts as input layer to (i + 1)th RBM. Each RBM in a DBN is trained in an unsupervised manner independently of other RBMs. A final output layer is also added to the top of the DBN. The DBN can be fine-tuned with any algorithm (e.g. gradient descent algorithm, metaheuristic algorithm etc.) to minimize some error measure with the output obtained at the final layer. A typical structure of DBN is shown in Fig. 2. It represents L RBM layers, with Wi being the weight matrix of the ith RBM.

Fig. 2
figure 2

A typical structure of DBN

3.2 Whale optimization algorithm

Whale Optimization Algorithm (WOA) (Mirjalili and Lewis 2016) draws its inspiration from the hunting behaviour of humpback whales called the bubble-net feeding method. Generally, they hunt schools of krill and small fish close to the surface. They do so by creating distinctive bubbles along a circle or ‘9’-shaped path, which comprises upward-spirals' and ‘double- loops. The humpback whales encircle the prey with fins which are flashing, that keeps the prey contained and does not escape. The mathematical model of encircling prey, spiral bubble-net foraging manoeuvrer and prey search is described in the following section:

3.2.1 Encircling prey

Humpback whales encircle the prey and keep updating their positions towards the best search agent as the iterations increase i.e. from start to a maximum number of iterations. This behaviour is mathematically formulated as:

$$\vec{E} = \left| {\overrightarrow {D } \cdot \overrightarrow {{Y^{*} }} \left( t \right) - \vec{Y}\left( t \right)} \right|$$
(2)
$$\overrightarrow {Y } \left( {t + 1} \right) = \overrightarrow {{Y^{*} }} (t) - \overrightarrow {B } \cdot \vec{E}$$
(3)

where \(\vec{B}\) and \(\vec{D}\) are the coefficient vectors, t indicates the current iteration, Y * is the position vector of the best solution obtained so far, \(\vec{Y}\) is the position vector, | | is the absolute value and.is an element-by-element multiplication. The vectors \(\vec{B}\) and \(\vec{D}\) are calculated as follows:

$$\vec{B} = 2\overrightarrow {b } \cdot \overrightarrow {s } \cdot \overrightarrow {b }$$
(4)
$$\vec{D} = 2 \cdot \overrightarrow {s }$$
(5)

where \(\vec{b}\) is decreased linearly from 2 to 0 over the course of iterations (in both exploration and exploitation phases) and \(\vec{s}\) is a random vector in [0, 1].

3.2.2 Bubble-net attacking method

The following two approaches are developed to mathematically model the bubble-net behaviour of the humpback whales:

  1. 1.

    Shrinking encircling mechanism This behaviour is achieved by decreasing the value of b from 2 to 0 in Eq. (4) over the course of iterations. The new position of a search agent can be defined anywhere in between the original position of the agent and the position of the current best agent by setting random values for \(\vec{B}\) in [− 1, 1].

  2. 2.

    Spiral updating position The spiral equation between the position of the prey and the whale to imitate the helix-shaped movement of the humpback whales is as follows:

    $$\overrightarrow {Y } \left( {t + 1} \right) = \overrightarrow {E^{\prime}} \cdot e^{cl} \cdot cos\left( {2\pi l} \right) + \overrightarrow {{Y^{*} }} \left( t \right)$$
    (6)

The humpback whales move around the prey within a shrinking circle as well as a spiral-shaped path simultaneously. Subsequently, in order to perfect this behavior, probability of 50% is there to choose between either the shrinking encircling method or the spiral model, in order to update the position of whales. The mathematical model is described as follows:

$$\overrightarrow {Y } \left( {t + 1} \right) = \left\{ {\overrightarrow {{Y^{*} }} \left( t \right) - \overrightarrow {B } .\vec{E}} \right\}\;{\mathrm{if}}\;p \le 0.5$$
(7)
$$\overrightarrow {E^{\prime}} \cdot e^{cl} \cdot cos\left( {2\pi l} \right) + \overrightarrow {{Y^{*} }} \left( t \right)\;{\mathrm{if}}\;p \ge 0.5$$
(8)

where \(\overrightarrow {{E^{\prime}}} = \left| {\overrightarrow {{Y^{*} }} \left( t \right) - \vec{Y}\left( t \right)} \right|\) and is the distance of the ith whale to the prey (the best solution obtained so far), c is constant for defining the shape of the logarithmic spiral, l is a random number in [− 1, 1] and p represents a random number in [0, 1].

3.2.3 Search for prey

The variation of \(\vec{B}\) vector can be utilized to search for prey, i.e., exploration phase. The mathematical model for this phase is as follows:

$$\vec{E} = \left| {\vec{D}. \overrightarrow {{Y_{rand} }} - \vec{Y}} \right|$$
(9)
$$\overrightarrow {Y } \left( {t + 1} \right) = \overrightarrow {{Y_{rand} }} - \overrightarrow {B } \cdot \vec{E}$$
(10)

where \(\overrightarrow {{Y_{rand} }}\) is a random position vector (a random whale) chosen from the current population.

4 Proposed methodology

This research uses the Deep Belief Network to predict software development effort. The DBN is fine-tuned using Whale Optimization Algorithm. The block diagram for the proposed methodology is given in Fig. 3. The graphical Representation of the proposed DBN for SDEE is given in Fig. 4.

Fig. 3
figure 3

Block diagram representing the proposed methodology

Fig. 4
figure 4

Graphical Representation of the proposed DBN for SDEE

The following key steps were taken for implementing the proposed methodology for SDEE:

  1. a.

    Construction of RBM

  2. b.

    Training of RBM

  3. c.

    Construction of DBN

  4. d.

    Fine-tuning of DBN with WOA

These steps are explained below:

4.1 Construction of RBM

The Restricted Boltzmann Machine (RBM) is constructed by taking two layers—input layer (visible layer) and hidden layer. The inputs (i.e. effort multipliers) \(v_{i}\) from datasets are fed to the input layer. For example, from COCOMO81 dataset, 15 effort multipliers (EMs), whose values for project ID 1, are 0.88, 1.16, 0.7, 1, 1.06, 1.15, 1.07, 1.19, 1.13, 1.17, 1.1, 1, 1.24, 1.1 and 1.04 are fed to the input layer of RBM 1. Subsequently, the values of hidden units of hidden layer are calculated using Eq. (11)

$$h_{j} = \sigma \left( {\mathop \sum \limits_{i = 1}^{m} w_{ij} v_{i} + b_{j} } \right)$$
(11)

where m is the number of visible units, b is the bias of the hidden layer, vi is the value of observable data at ith visible unit, hj is the state of jth hidden unit, wij is the weight associated with the connection between ith visible unit and jth hidden unit and \(\sigma ( . )\) is the sigmoid activation function given in Eq. (12).

$$\sigma \left( x \right) = \frac{1}{{1 + e^{ - x} }}$$
(12)

Weights wij and bias b are initialized to random values.

4.2 Training of RBM

For training of RBM (Yang et al. 2016; Fischer et al. 2012), three parameters are updated, the weight matrix W (the set of weights between visible layer and hidden layer), the bias of visible layer a and the bias of hidden layer b in each iteration. In this research, the number of iterations for training of RBM is taken as 100. Firstly, probability of configuration (v, h) for all visible-hidden unit pair is computed using Eq. (13)

$$P\left( {v,h} \right) = \frac{{e^{{ - E\left( {v,h} \right)}} }}{{\mathop \sum \nolimits_{v,h} e^{{ - E\left( {v,h} \right)}} }}$$
(13)

where E (v, h) is computed using Eq. (14)

$$E\left( {v,h} \right) = - \mathop \sum \limits_{i = 1}^{m} a_{i} v_{i} - \mathop \sum \limits_{j = 1}^{n} b_{j} h_{j} - \mathop \sum \limits_{i = 1}^{m} \mathop \sum \limits_{j = 1}^{n} v_{i} h_{j} w_{ij}$$
(14)

here, a and b are the biases of visible layer and hidden layer respectively (initialized randomly), m is the number of visible units, n is the number of hidden units, vi is the value of observable data at ith visible unit, hj is state of jth hidden unit and wij is the weight associated with the connection between ith visible unit and jth hidden unit. Weights wij and biases a and b are initialized to random values. Then, given a visible unit (and its observable data), its probability is computed over all the hidden vectors as given in Eq. (15)

$$P\left( v \right) = \frac{{\mathop \sum \nolimits_{h} e^{{ - E\left( {v,h} \right)}} }}{{\mathop \sum \nolimits_{v,h} e^{{ - E\left( {v,h} \right)}} }}$$
(15)

The data-driven probability, \(E[hv]^{data}\) is directly used for updating the weight matrix W and biases a and b which is calculated using Eq. (16) as

$$E[hv]^{data} = P(h|v)v^{T}$$
(16)

where \(P(h|v)\) is the conditional probability for obtaining h given v and this is calculated using Eq. (17) as

$$P\left( v \right) = \frac{{P\left( {v,h} \right)}}{P\left( v \right)}$$
(17)

The reconstructed data-driven probabilities \([hv]^{model}\), is also needed in updating the weight matrix W and biases a and b. It is calculated by a method given by (Hinton and Geoffrey 2002) which is based on contrastive divergence. Firstly, the states of hidden units are computed from visible units (which are initialized with a training sample) using Eq. (18) and the states of visible units are reconstructed using Eq. (19).

$$P\left( {h_{j} = 1{|}v} \right) = \sigma \left( {\mathop \sum \limits_{i = 1}^{m} w_{ij} v_{t} + b_{j} } \right)$$
(18)
$$P\left( {h_{j} = 1{|}h} \right) = \sigma \left( {\mathop \sum \limits_{i = 1}^{m} w_{ij} h_{i} + b_{j} } \right)$$
(19)

where \(\sigma ( . )\) is the sigmoid activation function given in Eq. (12).

Formula for \(E[hv]^{model}\) is given in Eq. (20)

$$E[hv]^{model} = P(\tilde{h}|\tilde{v})\tilde{v}^{T}$$
(20)

Finally, Eq. (21), Eq. (22) and Eq. (23) are used to update the weight matrix W and biases a and b respectively.

$$W^{t + 1} = W^{t} + \eta \left( {E\left[ {hv]^{data} - E} \right[hv]^{model} } \right) = W^{t} + \eta \left( {P\left[ {h\left| {v\left] {v^{T} - P} \right[\tilde{h}} \right|\tilde{v}} \right]\tilde{v}^{T} } \right)$$
(21)
$$a^{t + 1} = a^{t} + \eta \left( {v - E\left[ v \right]^{model} } \right) = a^{t} + \eta \left( {v - \tilde{v}} \right)$$
(22)
$$b^{t + 1} = b^{t} + \eta \left( {E\left[ {h]^{data} - E} \right[h]^{model} } \right) = b^{t} + \eta \left( {P\left[ {h\left| {v\left] { - P} \right[\tilde{h}} \right|\tilde{v}} \right]} \right)$$
(23)

The training of RBM 1 is done for the number of iterations in an unsupervised manner. After stacking multiple RBMs on each other, a DBN is constructed. Its process of construction is given in subsequent section.

4.3 Construction of DBN

The DBN is constructed when multiple RBMs are stacked upon each other. In this research, we have stacked three RBMs on each other to construct the DBN. Each RBM is trained independently in an unsupervised manner for a specific number of epochs.

The values of the hidden units of RBM 1 after training are fed to the visible units of RBM 2. Subsequently, the values of the hidden units of RBM 2, after training, are fed to the visible units of RBM 3 (final RBM). Finally, after the training of RBM 3, an output layer is added to RBM 3 consisting of only one output unit as shown in Fig. 4. The values of the hidden units of the final hidden layer (RBM 3’s hidden layer) are multiplied with the weights between the final hidden layer and the output unit and summed up to give the calculated effort.

For example, after the m effort multipliers of the project ID 1 of COCOMO81 dataset is fed to the input layer of RBM 1, RBM 1 was trained using 100 epochs. The value of m is 15 in this case. The values of its hidden layer nodes turned out to be 0.842, 0.989, 0.823, 0.759, 0.751, 0.788, 0.969, 0.764, 0.855, 0.839, 0.805, 0.778, 0.789, 0.765, 0.732, 0.932, 0.864, 0.761, 0.760, 0.960, 0.805, 0.893, 0.958, 0.735, 0.933, 0.726, 0.890, 0.909, 0.848, and 0.940. After the training of RBM 1, RBM 2 was trained in a similar way. The values of its hidden layer nodes turned out to be 0.818, 0.812, 0.819, 0.799, 0.811, 0.777, 0.812, 0.813, 0.803, 0.804, 0.810, 0.808, 0.813, 0.819, 0.825, 0.796, 0.819, 0.802, 0.812, 0.822, 0.814, 0.785, 0.808, 0.820, 0.780, 0.807, 0.807, 0.808, 0.795, 0.802, 0.796, 0.800, 0.789, 0.809, 0.798, 0.802, 0.810, 0.803, 0.810, 0.802, 0.811, 0.810, 0.803, 0.797, 0.779, 0.804, 0.807, 0.817, 0.795, 0.805, 0.779, 0.824, 0.802, 0.815, 0.826, 0.822, 0.796, 0.805, 0.789 and 0.797. Finally, after the training of RBM 2, RBM 3 was trained in a similar way. The values of its hidden layer nodes turned out to be 0.902, 0.907, 0.888, 0.887, 0.908, 0.906, 0.903, 0.903, 0.902, 0.891, 0.884, 0.906, 0.902, 0.902, 0.899, 0.905, 0.898, 0.900, 0.893, 0.902, 0.888, 0.904, 0.887, 0.898, 0.903, 0.892, 0.908, 0.895, 0.912 and 0.899.

The DBN is then fine-tuned using the whale optimization algorithm explained in next Sect. 4d.

4.4 Fine tuning of DBN with WOA

For fine-tuning of DBN, the whale optimization algorithm is run and weights between the hidden layer of the final RBM and the single output unit are calculated from the parameters returned by the whale algorithm in the first for loop. The matrices used; weights’ calculations and dimensions of matrices used are given in Table 1. The matrices used in the algorithm are initialized to some random values. Table 2 provides 13 objective functions (F1–F13) (Mirjalili and Lewis 2016) which are used for finding the fitness values in WOA. The results are obtained from all the functions, F1–F13, for the proposed technique on all the four datasets but only the functions which provided the best results are shown in experimental evaluation under Sect. 5. Table 3 shows the values for the initialization of parameters used in the proposed approach.

Table 1 Particulars used in WOA for fine-tuning of the DBN
Table 2 Description of unimodal and multimodal benchmark function (Mirjalili and Lewis 2016)
Table 3 Initialization of parameters for implementing the proposed methodology

The hidden units’ values of RBM 3, obtained after training, are multiplied with their corresponding weight values obtained from WOA at the output unit and added as given in Eq. (24).

$$X = \left( {\mathop \sum \limits_{i = 1}^{n} w_{i} h_{i} } \right)$$
(24)

where X is the final sum obtained at the output layer of DBN, n is the number of hidden units in final RBM’s hidden layer, \(h_{i}\) is the value of these hidden units and \(w_{i}\) are the values of weights between ith hidden unit of the final RBM and the single output unit obtained from WOA. For example, in our case of COCOMO81 dataset for project ID 1, when considering the WOA algorithm, the value of \(X\) came out to be 19.521. The sum X obtained from Eq. (24) is multiplied with the size of the project taken from the dataset to obtain the final calculated Effort as given in Eq. (25).

$$Effort = X * size$$
(25)

In our case, the value of size is 113 and thus the value of calculated \(Effort\) comes out to be 2205.793 where the actual value of effort is 2040.The calculated Effort is compared with the actual Effort from the dataset using various evaluation criteria as discussed under Sect. 5. The process of fine-tuning is terminated, if the magnitude of relative error (MRE) is within the acceptable range (taken as 25%) (Di Martino et al. 2011; Foss et al. 2003) or if the maximum number of epochs is reached which is taken as 500. Else, the procedure is repeated.

The metaheuristic algorithm continues to explore and exploit the search space until the end criteria is not satisfied. The actual Effort from the dataset is just used to verify whether MRE is within the acceptable range or not. The error measure or the actual value of Effort is not used for updating any parameters. The proposed technique has also been compared with DBN-BP. The equation used for updating the weights between RBM 3 and final output layer in DBN-BP is given as Eq. (26)

$$w_{new} = w_{old} + \left( {\delta *alpha} \right)$$
(26)

where \(\delta\) is equal to MRE and alpha is the learning rate whose value lies in the range [0, 1].

4.5 Complexity

There are four steps in the algorithm. The construction of RBM includes creating input layer and hidden layer. This takes constant time i.e., O (1). Let's say, time complexity for the training of RBM is O(R). The construction of DBN involves training of stacked RBMs. In our proposed algorithm, 3 RBMs are trained, so time complexity is O(3R). In the last step, for each iteration during the fine tuning of DBN, optimized weights from WOA are received, so time complexity is O(N*W) where N is the maximum number of iterations and O(W) is the time complexity of WOA. Hence, total time complexity for DBN-WOA is O(1 + 3R + N*W) = O(R + N*W).

5 Experimental evaluation

Testing the proposed methodology for SDEE, is imperative after training the DBN, as it gives clear estimates of how well the technique can perform when it is used to predict effort with unseen projects (not seen during training period). There are various methodologies to test and validate the techniques. Some of them include—Holdout method, Leave-one-out cross validation, tenfold Cross Validation, threefold Cross Validation etc. In this research, the proposed technique is experimentally validated using threefold cross validation which has been widely used by various authors in SDEE to validate their proposed methodologies such as Burgess and Lefley (2001), Kumar et al. (2008). It divides the dataset into three parts in a certain ratio. Any of the two parts are taken for training and the left-out part is taken for testing. This is repeated three times so that all the parts get to be in a training set exactly twice and, in a testing set exactly once. In this research, we have divided the datasets into three equal parts. The advantage of threefold Cross Validation is that it does not matter how the dataset gets divided. Every data point is in a test set exactly once and is in a training set exactly twice.

The datasets used are COCOMO81, NASA93, MAXWELL and CHINA for experimentally and statistically evaluating the proposed techniques. These datasets have been obtained from PROMISE Software Engineering Repository (http://promise.site.uottawa.ca/SERepository/). These datasets have been used widely by various authors in the domain of SDEE (Benala et al. 2012; Elish 2009; Kaushik et al. 2016). Evaluation criteria play an important role in estimating and commenting upon the success of a technique. In this paper, we have used three evaluation criteria—Mean of Magnitude of Relative Error (MMRE), Prediction (l) (Pred (l)) and Median of MRE (MdMRE).

Mean of magnitude of relative error (MMRE) is calculated as the average of magnitude of relative errors (MREs) of all projects in the dataset. The formula for calculating MMRE is given in Eq. (27).

$${\mathrm{MMRE}} = \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} MRE_{i}$$
(27)

where N is number of projects in dataset.

Magnitude of relative error (MRE) finds the relative error between actual effort and estimated effort for each project in the dataset. Formula for calculating MRE is given in Eq. (28).

$${\mathrm{MRE}} = \frac{{\left| {{\mathrm{Actual}}\;\,{\mathrm{Effort}} - {\mathrm{Estimated}}\;\,{\mathrm{Effort}}} \right|}}{{{\mathrm{Actual}}\;\,{\mathrm{Effort}}}}$$
(28)

Pred (l) represents the percentage of MREs less than or equal to the value l among all projects. It is defined in Eq. (29).

$${\mathrm{Pred }}\left( l \right) = \frac{k}{n}$$
(29)

where n is the total number of observations and k is the number of observations whose MRE is less than or equal to l. MdMRE is the median of MREs of all projects of a dataset. MdMRE is less susceptible to extreme values. It behaves similar to MMRE but is more likely to select a true model if under-estimation is served (Kumar et al, 2008). The software development effort estimation models are said to be acceptably accurate if the value of MMRE and MdMRE is less than or equal to 0.25 and Pred (0.25) is greater than or equal to 0.75 (Di Martino et al. 2011; Foss et al. 2003).

The results of training of the proposed technique on all the four datasets are given in Table 4. The proposed DBN-WOA was evaluated with all the 13 evaluation functions and for all the three folds of the dataset. The objective function which gave the best result is tabulated for the respective datasets as Table 4. It is found that the objective function F11 is best for COCOMO81 and NASA93 datasets; the objective function F4 is best for MAXWELL dataset and the objective function F1 is best for CHINA dataset. The proposed technique has been tested using the best objective functions obtained from Table 4. After the optimum weights of DBN are successfully explored and searched using WOA in the training phase, the resultant DBN is subject to testing. The results from the testing phase on the four datasets are presented in Table 5. Since both the training and the testing errors are low, it can be inferred that the model is neither overfitting nor underfitting.

Table 4 Results of training of the proposed technique, DBN-WOA
Table 5 Results of testing of the proposed technique, DBN-WOA

The snapshot of the convergence graph for a particular stage of the proposed methodology where the calculated effort is trying to get closer to the actual effort for the project ID 32 of COCOMO81 dataset is given as Fig. 5. The convergence is helping the DBN get optimum weights.

Fig. 5
figure 5

Epochs vs Effort curve for DBN-WOA for COCOMO81 dataset for project ID 32

The proposed technique is compared against DBN fine- tuned with backpropagation and the results from its training and testing phases are presented in Tables 6 and 7 respectively. The results show that the integration of WOA with the DBN greatly enhances the efficiency for estimation of software development effort as compared to the integration of backpropagation with DBN. The rationale behind this is that searching for weights in a DBN can be modelled as an optimization problem where the goal is to achieve optimum weights. Metaheuristic techniques tend to surpass other optimization techniques as they tend to avoid local optima.

Table 6 Results of training of DBN-BP for the datasets used
Table 7 Results of testing of DBN-BP for the datasets used

Figure 6 shows the bar graph representation of DBN-WOA and DBN-BP for COCOMO81 training dataset.

Fig. 6
figure 6

Comparison of DBN-WOA and DBN-BP on COCOMO81 training dataset

The results of the proposed DBN-WOA technique is also compared with the work given by the authors (Rijwani and Jain 2016; Kaushik et al. 2020). Table 8 demonstrates the result on few of the COCOMO dataset as the same project Ids were used by the authors (Rijwani and Jain 2016).

Table 8 Effort Comparison on COCOMO dataset

Table 8 shows that the estimated effort using DBN-WOA is better for most of the projects.

The authors (Kaushik et al. 2020) used Ant Lion Optimization (ALO) and proposed DBN-ALO. The estimated effort given by them for COCOMO dataset with P.Id 1is 2378 and for P.Id 3 it is 230 whereas the estimated effort using DBN-WOA is 2205.793 for P.Id1 and it is 240.68 for P.Id 3. Thus, DBN-WOA is providing better results than the techniques used by the earlier researches.

6 Statistical validation

Since, experimental evaluation alone cannot be used to accurately determine the success of the proposed technique for SDEE, the analysis of the technique using statistical tests is imperative (Prasad Reddy et al. 2010; Kitchenham and Mendes 2009).

Statistical inferential tests are of three types- parametric, non-parametric and semi-parametric. The parametric tests assume that the data comes from a type of probability distribution and make deductions about the distribution parameters. Non-parametric tests, on the other hand, do not rely on data that belongs to any specific distribution, and includes techniques that assume that the model structure is not fixed. They are mainly used for ordinal data. The semi-parametric tests make use of the merits of both parametric as well non-parametric techniques in a sophisticated manner. Mittas et al. (2015) has given a new set of techniques called semi-parametric models (SPM) to handle software estimations. In software cost estimations, we assume that the project data does not follow any distribution. Hence, to validate the work statistically, non-parametric tests are used. Here, Friedman test is used which the non-parametric equivalent of the well is known analysis of variance. The Friedman test tests the change between several related samples. The null hypothesis of the Friedman test mentions that all techniques perform equivalent. This test is carried out using IBM SPSS (Statistical Package for the Social Sciences) tool. It provided the mean ranks which are depicted in Table 9. Using this test, it was confirmed that the objective function which was giving the best experimental results provided the same here also. So, for each of the datasets the MRE values calculated using different objective functions were used as an input to the tool. The objective functions which were giving the best values in experimental evaluations were giving best values here also. For example, the function F11 was giving best results for COCOMO 81 and NASA 93, here also this function has the lowest mean in both the datasets. The value is 5.10 for COCOMO81 and 4.06 for NASA93 dataset. Similarly, the function F1 was performing best for CHINA dataset and here also it has the minimum rank with the value 4.55 and for MAXWELL dataset, the function F4 has the lowest mean rank of 3.92. This is depicted in Table 9.

Table 9 Results of Friedman Test depicting mean ranks for the datasets

The objective functions F5 and F8 for COCOMO81 and NASA93 dataset; the objective functions F5, F7, F8, F11 for CHINA and the objective functions F2, F5, F8, F11 for MAXWELL were not used in statistical validation as they provided NAN (not a number) value in MATLAB during experimental validation.

7 Limitations and future scope

The limitations of the proposed technique can be summarized as follows.

  • The effort multipliers used in datasets employed may not be extremely accurate because developers might be optimistic in answering questions related to their capabilities. Therefore, it will inculcate errors in the calculated effort. Also, the effect of environmental factors cannot be determined with the available data.

  • In this paper, traditional datasets are used for estimating the performance of the proposed technique. Nowadays, software is being developed using agile techniques. The proposed technique should also be evaluated and validated using these latest software projects’ datasets.

  • Although initialization of the parameters in the employed technique is done through extensive study and evaluation, more optimum initializations of these parameters might be possible.

  • In this paper, k-fold cross validation is employed where the value of k is taken as 3, but it is very difficult to decide the appropriate value of k. Techniques employed can, therefore, be tested with more validation techniques such as tenfold cross validation, Leave-one out cross validation etc.

  • Though some objective functions gave remarkable results for SDEE, functions such as F5 and F8 failed to provide appropriate efforts for all the proposed techniques.

  • In this paper, 3 RBM’s have been stacked upon each other to form a DBN. But it is difficult to know the appropriate value of the number of RBMs to be stacked to form a DBN. Using a different number of stacked RBMs might give different results.

All the above limitations can be overcome by further work in this domain. This research attempts to evaluate the effectiveness of integrating deep learning and metaheuristics for SDEE. Traditional datasets are used for conducting the experimental evaluation. Nowadays, software is majorly developed using agile techniques. Hence, the research can also be verified using the agile datasets (Panda et al. 2015). Integration of metaheuristic techniques with other deep learning models such as Recurrent Neural Networks and Convolutional Neural Networks for SDEE can also be evaluated to confirm the effectiveness of deep learning in effort estimates. For more refined fine-tuning of the parameters in DBN, techniques proposed by Calvet et al. (2016) can be employed. Also, the technique proposed can be tested by using different validation techniques.

8 Conclusion

In this research, Deep Belief Network (DBN) is used to predict software development effort and its parameters are finetuned using a metaheuristic technique, Whale Optimization Algorithm (WOA). The technique DBN-WOA, is evaluated on four effort estimation datasets and is compared with another variant of DBN which uses Back Propagation algorithm i.e., DBN-BP. The experimental and statistical results show that the technique DBN-WOA is at par than DBN-BP. In the future, we will work on the limitations listed in Sect. 7, to improve the existing method.

8.1 Answers to research questions

  1. 1.

    Can deep learning accurately predict the software development effort?

    Yes, deep learning can predict the software development effort within 25% MRE as can be observed from Tables 4 and 5. But the results of the technique can vary depending upon the DBN architecture.

  2. 2.

    Can the fine tuning of DBN using Whale Optimization Algorithm (WOA) perform better than the fine tuning of the DBN using backpropagation?

Yes, the fine-tuning of DBN with metaheuristic technique, like WOA outscores the fine-tuning of DBN with backpropagation as can be inferred from the tabular results in Sect. 5. Here also, depending upon the initialization of parameters the performance of WOA may alter.