1 Introduction

Early estimation of software development effort is a critical management activity. Indeed, realistic estimates are crucial to the adequate allocation of resources and also affect the competitiveness of a software company (Mendes 2009). Several studies have addressed this problem (e.g., Briand et al. 1999; Briand et al. 2000; Briand and Wieczorek 2002; Costagliola et al. 2006; Maxwell et al. 1999; Mendes et al. 2003b; Shepperd et al. 1996; Shepperd and Schofield 1997), many of which focusing on the proposal and evaluation of techniques to construct predictive models able to estimate the effort of a new project exploiting information (actual effort and cost-drivers) related to past projects. In particular, recent studies (Corazza et al. 2009, 2010, 2011) have investigated the effectiveness of Support Vector Regression (SVR) for software effort estimation. SVR is a technique based on Support Vector Machines, a family of Machine Learning algorithms that have been successfully applied for addressing several predictive data modeling problems (Cristianini and Shawe-Taylor 2000; Smola and Schölkopf 2004). The studies reported in (Corazza et al. 2009, 2011) showed that SVR has potential use also for software development effort estimation; indeed it outperformed the most commonly adopted prediction techniques using the Tukutuku database (Mendes et al. 2005a, b), a cross-company dataset of Web projects widely adopted in Web effort estimation studies. It was argued that the main reason for that lies in the flexibility of the method. Indeed, SVR enables the use of kernels and parameter settings allowing the learning mechanism to better suit the characteristics of different chunks of data, which is a typical characteristic of cross-company datasets. However, the setting of parameters needs to be done carefully, since an inappropriate choice can lead to over- or under-fitting, heavily worsening the performance of the method (Chang and Lin 2001; Keerthi 2002). Nevertheless, there are no guidelines on how to best select these parameters (Scholkopf and Smola 2002; Vapnik and Chervonenkis 1964; Vapnik 1995) since the appropriate setting depends on the characteristics of the employed dataset. Moreover, an examination of all possible values for parameters is not computationally affordable, as the search space is too large, also due to the interaction among parameters, which cannot be separately optimized.

The issues abovementioned have motivated us to investigate the use of Tabu Search (TS) to automatically select SVR parameters (Corazza et al. 2010). TS is a meta-heuristic search technique used to address several optimization problems (Glover and Laguna 1997). The TS-based approach was first investigated in (Corazza et al. 2010) employing SVR in combination with different kernels and variables’ preprocessing strategies, using as dataset the Tukutuku database (Mendes et al. 2005a). In particular, we compared SVR configured with TS (SVR + TS) with other effort estimation techniques, namely Manual StepWise Regression (MSWR), Case-Based Reasoning (CBR), Bayesian Networks (Mendes 2008), and the Mean and Median effort of the training sets. SVR + TS gave us the best results ever achieved with the Tukutuku database. However, these results were based on two random splits of only one cross-company dataset and it is widely recognized that several empirical analysis are needed to generalize empirical findings. Thus, the aim of this paper is to further investigate the combination of TS and SVR using data from several single- and cross-company datasets. Let us recall that the former represents a dataset containing data on projects from a single software company while the latter includes project data gathered from several software companies. In our analysis, we employed 13 different datasets from the PROMISE repository and also other 8 datasets obtained by splitting the Tukutuku database according to the values of its four categorical variables (see Appendix A). The choice to use datasets from the PROMISE repository is motivated due to the following points:

  • Availability of datasets on industrial software projects, representing a diversity of application domains and projects’ characteristics. This is also in line with recommendation made by Kitchenham and Mendes (2009).

  • Availability of projects that are not Web-based, thus enabling the assessment of the effectiveness of the estimation technique employed herein when applied to different types of applications – Web, using the Tukutuku, and non-Web, using the PROMISE datasets. We would also like to point out that, in our view, Web and software development differ in a number of areas, such as: application characteristics, primary technologies used, approach to quality delivered, development process drivers, availability of the application, customers (stakeholders), update rate (maintenance cycles), people involved in development, architecture and network, disciplines involved, legal, social, and ethical issues, and information structuring and design. A detailed discussion on this issue is provided in (Mendes et al. 2005b).

  • Availability of single- and cross-company datasets, thus enabling the assessment of the estimation technique employed herein when applied to single- and cross-company datasets. We would also like to point out that the use of a cross-company dataset is particularly useful for companies that do not have their own data on past projects from which to obtain their estimates, or that have data on projects developed in different application domains and/or technologies. To date, several studies have investigated if estimates obtained using cross-company datasets can be as accurate as the ones obtained using single-company datasets (e.g., Briand et al. 1999; Jeffery et al. 2000; Kitchenham and Mendes 2004; Corazza et al. 2010; Lefley and Shepperd 2003; Maxwell et al. 1999; Mendes et al. 2008; Mendes and Kitchenham 2004; Wieczorek and Ruhe 2002) achieving different findings (see Kitchenham et al. 2007 for a systematic review).

In relation to the choice of SVR kernels and pre-processing strategies, we focused our analysis on the RBF kernel and a logarithmic transformation of the variables since they provided the best results in our previous study (Corazza et al. 2010).

In order to verify whether the proposed TS technique is able to make a suitable choice of SVR parameters we also compared the estimates obtained with SVR + TS with those obtained applying SVR using different strategies for parameters selection, namely:

  • random SVR configurations. This means that the same number of solutions investigated for SVR + TS was generated in a totally random fashion and the best one among them was selected according to the same criteria employed for SVR + TS. This is a natural benchmark when using meta-heuristic search techniques;

  • default parameters employed by the Weka tool (Hall et al. 2009);

  • the Grid-search algorithm provided by (Chang and Lin 2001).

In addition, we also assessed whether the estimates provided by the proposed approach were better than those obtained using the Mean and Median effort (popular and simple benchmarks for effort estimation techniques) and those achieved with MSWR and CBR. These techniques were chosen because they are the two techniques widely used in the literature and also in industry, and the mostly employed estimation techniques (Mair and Shepperd 2005).

Consequently, the research questions addressed in this paper are:

  1. RQ1.

    Is Tabu Search able to effectively set Support Vector Regression parameters?

  2. RQ2.

    Are the effort predictions obtained by using Support Vector Regression configured with Tabu Search significantly superior to the ones obtained by other techniques?

The remainder of the paper is organized as follows. Section 2 first reports on the main aspects of SVR and TS and then describes the proposed approach based on TS to set-up SVR parameters. Section 3 presents the design of our empirical study, i.e., the datasets, the null hypotheses, the validation method, and the evaluation criteria employed to assess the prediction accuracy. Results are presented in Section 4, followed by a discussion on the empirical study validity in Section 5. Related work is discussed in Section 6. Final remarks and some future work conclude the paper.

2 Using Support Vector Regression in Combination with Tabu Search for Effort Estimation

In this section, we describe Support Vector Regression, Tabu Search, and how we have combined them for effort estimation.

2.1 Support Vector Regression

Support Vector Regression is a regression technique based on Support Vector (SV) machines, a learning approach originally introduced for linear binary classification. Linear classifiers construct a hyperplane separating the training set points belonging to the two classes. In the SV machine classifier (Vapnik and Chervonenkis 1974; Vapnik 1995), the hyperplane maximizes the classification margin, that is the minimum distance of the hyperplane from the training points (Vapnik and Chervonenkis 1974), as shown in Fig. 1. Choosing such optimal hyperplane requires the solution of a quadratic optimization problem subject to linear constraints, corresponding to the fact that each point of the training set must be correctly labeled. The hyperplane resulting from this optimization only depends on a subset of the training points, called support vectors. As an example, in Fig. 1 the three points closest to the classification hyperplane are highlighted, as they represent the support vectors.

Fig. 1
figure 1

Hyperplane, margin and support vectors in linearly separable set

Thus, the system admits a solution only if there is a hyperplane separating the two classes in the training set (as in Fig. 1), i.e., when the training set is linearly separable. Nevertheless, this can be considered too restrictive to be of any practical interest. Thus, in 1995, Cortes and Vapnik (1995) defined a modified version of the approach, by introducing a parameter C to allow (but penalize) misclassifications in the training set, thus obtaining soft margin SVM’s. The choice of the best value for C is crucial to performance, as it decides the trade-off between classification errors in the training set and model complexity (Hofmann et al. 2008; Moser et al. 2007).

When the SV approach is applied to a regression problem, a function has to be derived, which minimizes the deviation between observed and predicted values. To solve this problem we apply an SV approach that, rather than minimizing a function of the errors on the training set, aims at minimizing a bound on a generalized error, which also takes into account a regularization term in addition to the training error. Thus, the goal is to find a linear function that obtains an error lower than a constant ε on the training data and that at the same time is as flat as possible. This formulation of the problem can be softened, as discussed above, by using parameter C, so that an error larger than the bound can be allowed on some of the points in the training set. Therefore, the parameter C determines the trade-off between the occurrences of errors larger than ε in the training set and the flatness of the function. On the other hand, ε controls the wideness of a tube such that points occurring inside are considered correct and only points outside the tube are evaluated as errors (see Fig. 2). The two parameters are therefore strictly correlated, even if their suitable values depend on the dataset (Cherkassky and Ma 2004), so no rule of thumb exists.

Fig. 2
figure 2

ε-tube in SVR

2.1.1 The Non-Linear Case and the Kernel Choice

The SV approaches described above are conceived for the linear case. Thus, they could be not suitable for development effort estimation where the dependent variable (i.e., effort) does not necessarily linearly depend on the independent variables (i.e., cost drivers). To deal with the nonlinear case we can map the input vectors into a feature space before the linear SV approach is applied.

Mathematically, such mapping requires the substitution of dot products between the input vector x and each support vector s, with a function describing their similarity in the feature space: such function k(x, s) with two variables (x and s) is called kernel function.

A wide variety of kernel functions has been proposed in the literature: a good overview can be found in (Hofmann et al. 2008). An important kernel family is given by Radial Basis Function (RBF) where the output value only depends on the distance of the two points in the input space. In particular, the most popular kernel belonging to the RBF family is the Gaussian one, which is defined as follows:

$$ k\left( {x,s} \right) = exp\left( { - \gamma {{{\left| {x{ } - { }s} \right|}}^2}} \right),with{ }\gamma > 0. $$
(1)

The Gaussian RBF kernel has been successfully applied in a variety of contexts, both alone (e.g., Moser et al. 2007; Shin and Goel 2000) and in combination with SV approaches (e.g., Corazza et al. 2009, 2011; Schölkopf et al. 1997). Furthermore, Gaussian RBF kernel is usually suggested as the first choice in many practical guides (e.g., Hsu et al. 2010) and is implemented in LibSVM, a popular library for SV approaches (Chang and Lin 2001). All the above considerations motivated our choice to use this kernel in the study reported in the present paper.

Using the Gaussian RBF kernel, a value for the kernel parameter γ needs to be selected in addition to the values for C and ε parameters. The main issue is how to set these parameters ensuring good generalization performance for a given dataset. In the following we report some existing approaches to address the problem and then describe our proposal.

2.1.2 SVR Parameter Setting

Many alternative strategies have been defined in the literature to select suitable values for SVR parameters. As pointed out in (Cherkassky and Ma 2004), many studies related to the use of SVR are based on the opinion of experts that select parameter values on the basis of their knowledge of both the approach and the application domain. Of course the reliance upon experts severely bounds the applicability of this approach. Another possibility is the use of heuristics based on noise characteristics (Kwok and Tsang 2003). However, in addition to some technical limitations of these approaches, they require either an expert with a deep understanding of the problem or a statistical model for the noise. Parameters choice based on more direct information, such as the range of output values, are prone to other problems, including outliers (Mattera and Haykin 1999).

In Grid-search approaches a certain number of parameter values are explored to identify the best option. Nevertheless, the points are chosen a-priori and do not depend on the specific case. For instance, the software library LibSVM provides a mechanism that explores a combination of 8 values for each of the parameters C, ε, and γ (in the ranges [1.0E-3, 32000], [1.0E-6, 1], and [1.0E-6, 8]) using a five-fold cross-validation on the training set (Chang and Lin 2001). Thus, a total of 512 fixed points are assessed and the one with the best cross-validation accuracy is returned. Even if Grid-searches are easy to apply, they have a main drawback: the search is performed always on the same (coarse grained) points, without taking into account the dataset to guide the search.

In (Corazza et al. 2009, 2011) the problem was addressed in the context of effort estimation, adopting an automatic approach to explore a large number of parameter values (employing various nested cycles with small incremental steps). For each run, depending on the kernel, the number of executions ranged from some dozens to more than 4000 executions. An inner leave-one-out cross validation was performed on the training set (each cycle of execution required a number of iterations corresponding to the cardinality of the training set) and for each iteration the goodness of the solution was evaluated using a combination of effort accuracy estimation measures Footnote 1. Thus, the setting providing the best estimation (according to the selected criterion) on the training set was chosen.

Although such optimization strategy included a quite large combination of parameter values, it proceeded by brute force, by predefined steps, and did not use any information related to the prior steps trying to improve the search. Moreover, it was computationally too expensive. Smarter optimization strategies, on the contrary, use all possible clues to focus the search in the most promising areas of parameter values for a given dataset. Among such strategies, in (Corazza et al. 2010) we proposed the use of the meta-heuristics Tabu Search to search for the best parameter settings. This approach is further investigated in this paper and will be described in the next section. One of the strengths of the Tabu Search strategy is that it uses information both in a positive way, to focus the search, and in a negative way, to avoid already explored areas and loops.

2.2 Tabu Search

Tabu Search (TS) is a meta-heuristic search algorithm that can be used for solving optimization problems. The method was proposed originally by Glover to overcome some limitations of Local Search (LS) heuristics (Glover and Laguna 1997). Indeed, while classical LS heuristics at each iteration constructs from a current solution i a next solution j and checks whether j is worse than i to determine if the search has to be stopped, a TS optimization step consists in creating from a current solution i a set of solutions N(i) (also called neighboring solutions) and selecting the best available one to continue the search. In particular, TS usually starts with a random solution and applies local transformations (i.e., moves) to the current solution i to create N(i). When no improving neighboring solution exists, TS allows for a climbing move, i.e., a temporary worsening move can be performed. The search terminates when a stopping condition is met (e.g., a maximum number of iteration is reached). To determine whether a solution is worse (or better) than another an objective function is employed. In order to prevent loops and to guide the search far from already visited portions of the search space, some moves can be classified as tabu which means that are forbidden. The tabu moves can be stored in a list, named Tabu List, of fixed or variable length following a short-term (i.e., moves leading to already visited solutions are stored) or a long-term memory strategy (i.e., moves that have been performed several times are stored). Since tabu moves sometimes may prohibit attractive solutions or may lead to an overall stagnation of the searching process (Glover and Laguna 1997), the so called aspiration criteria can be used to revoke the tabu status of a move. A common aspiration criterion allows for a tabu move if it results in a solution which has an objective value better than the current solution.

To summarize, TS starting from a random solution, at each iteration explores a search space consisting of a set of moves. Such moves are often local transformations of the current solution and depend on the problem to be solved. Among these moves, the one that provides the best objective value and is not tabu or matches an aspiration criterion is selected to continue the search.

Thus, to tailor the TS meta-heuristics to a given problem we have to perform the following choices:

  • define a representation of possible solutions and the way for generating the initial one;

  • define local transformations (i.e., moves) to be applied to the current solution for exploring the neighbor solutions;

  • choose a means to evaluate the neighborhood (i.e., an objective function), thus guiding the search in a suitable way;

  • define the Tabu list, the aspiration criteria, and the termination criteria.

In the next section we describe how we designed TS for setting SVR parameters, thus specifying the above choices.

2.3 Using Tabu Search to Configure SVR

Let us formulate our goal: starting from a dataset of past projects we have to identify a good solution S, represented by values for variables C, ε, and γ (see step 1 in Fig. 3), so that SVR configured with those parameter values can accurately predict the unknown effort for new incoming projects (see step 2 in Fig. 3). Thus, in this section we will detail step 1, whose process is illustrated in Fig. 4.

Fig. 3
figure 3

The two steps of applying SVR + TS: parameters identification and use

Fig. 4
figure 4

The proposed TS-based approach for SVR parameters selection

An initial solution is generated by randomly choosing the values for each variable in a defined range. In particular, since the values for C, ε, and γ can vary from zero to infinity, an upper bound has usually to be chosen. To this end, we employ the same ranges of the Grid-search algorithm (Hsu et al. 2010) for C, ε, and γ, respectively, and, as it is usual, we perform the search for parameter values in the logarithmic space of these ranges (Hsu et al. 2010; Keerthi and Lin 2003).

Starting from the random initial solution, at each iteration 25 moves are performed, each one according to the pseudocode provided in Fig. 5 and explained herein. A parameter to be changed is selected among C, ε, and γ (with equal probability). The current value of the chosen parameter in the 80% of the cases is incremented up to its 20% adding (or subtracting, with the same probability) a random value, while in the remaining 20% of the cases the new parameter value is chosen in a totally random fashion within the specified range.

Fig. 5
figure 5

The TS move

The rationale for the percentage of 80% is to investigate as much as possible an actual promising solution. Indeed, once a “better” region on the space has been identified, a finer search on that region is conducted performing small changes around a potentially interesting solution (Fig. 5 line 6). On the other hand, we defined also a mechanism to allow for a diversification in the search space (obtained using total random move) to escape from local optima (Fig. 5 line 8).

Once all moves are performed, a set of 25 new neighboring solutions is created and the neighboring solution with the best objective function value and which is not tabu or matches an aspiration criterion is selected as current best solution and then as starting point to explore a new neighborhood in the next iteration. It is worth noting that a move is marked as tabu if it leads to a solution whose parameter values are very similar (i.e., the difference between parameter values is less than 10%) to those of a solution stored in the Tabu List. In order to allow one to revoke tabu moves, we employ the most commonly used aspiration criterion, namely we permit a tabu move if it results in a solution with an objective function value better than the one of the best solution reached so far.

Moreover, if the current best solution’s objective value is better than the one achieved by the best solution found so far, the latter is replaced. Finally, to avoid retracing the moves previously used, the current solution is stored in the Tabu List. Note that since only a fairly limited quantity of information is usually recorded in the Tabu List (Glover and Laguna 1997), we decided to employ a short-term memory of fixed length with 7 elements.

The search is stopped after a fixed number of iterations is performed (i.e., 100).

It is worth noting that we adopted the same choices for number of moves, Tabu List size, and iterations employed in our previous study (Corazza et al. 2010). Those numbers were empirically determined as it is usual when no guidelines are available. In particular, they were chosen for the work presented herein because our previous research showed that increasing them did not allow us to improve the estimation accuracy while wasting computation time.

As for the objective function, a number of accuracy measures can be used to compare effort estimates, usually based on the residuals, i.e., the differences between predicted and actual efforts. Among them, two widely summary measures are the Mean Magnitude of Relative Error (MMRE) (Conte et al. 1986) and the Mean Magnitude of Relative Error relative to the Estimate (MEMRE) (Kitchenham et al. 2001). Let us recall that MMRE is the Mean of MRE and MEMRE is the Mean of EMRE, where:

$$ MRE = \frac{{\left| {e - \hat{e}} \right|}}{e} $$
(2)
$$ EMRE = \frac{{\left| {e - \hat{e}} \right|}}{{\hat{e}}} $$
(3)

where e represents actual effort and ê estimated effort. We can observe that EMRE has the same form of MRE, but the denominator is the estimate, giving thus a stronger penalty to under-estimates. In (Corazza et al. 2009, 2010, 2011) we employed as objective function, the mean of them:

$$ {\text{Objective Function}} = \left( {{\text{MMRE}} + {\text{MEMRE}}} \right){/2} $$
(4)

The rationale was that, since MRE is more sensitive to overestimates and EMRE to underestimates, an objective function minimizing them should find better solutions. Since the present paper provides a further assessment of the technique proposed in (Corazza et al. 2010), we exploited the same objective function.

It is worth noting that the solution we are proposing attempts to capture the necessary domain knowledge by using performance indicators as the objective function. On the other hand, it requires a meta-heuristics as robust as possible with respect to the target function characteristics, which are completely unexplored. We think that the TS strategy has these characteristics because of its capability to adapt to the input function both by concentrating search efforts on promising areas and keeping away from already visited regions by means of the Tabu List.

Finally, in order to cope with the non-deterministic nature of TS, we performed 10 executions of SVR + TS and, among the obtained configurations, we retained as final the one which provided objective value closest to the mean of the objective values obtained in the 10 executions.

3 Empirical Study Design

In this section, we present the design of the empirical study carried out to assess the effectiveness of the proposed approach. In particular, we present the employed datasets, the null hypotheses, the adopted validation method, and evaluation criteria. The results of the empirical analysis are discussed in Section 4.

3.1 Datasets

To carry out the empirical evaluation of the proposed technique we employed a total of 21 industry software project datasets selected both from the PROMISE repository (PROMISE 2011) and the Tukutuku database (Mendes et al. 2005a). PROMISE contains publicly available single and cross-company datasets, while the Tukutuku database contains data about Web projects (i.e., Web hypermedia systems and Web applications) developed in different companies and gathered by the Tukutuku project, which aimed to develop Web cost estimation models and to benchmark productivity across and within Web Companies.

Concerning the PROMISE repository, it is worth noting that we did not employ all the datasets that it contains, since we were interested only on the ones that can be employed for early effort estimation (i.e., datasets containing information that would be available at the early stages of a software development process), which is the managerial goal of our investigation. To this end, we avoided the use of datasets like NASA and COCOMO containing as size measures only features available once a project is completed, such as the Lines of Code (LOCs). Moreover, we pruned the remaining datasets from this kind of features, since their use could bias the results (Shepperd and Schofield 1997). As for the categorical variables contained in some datasets, we used them as done in (Kocaguneli et al. 2010; Shepperd and Schofield 1997) to obtain different more homogenous splits from the original datasets or we excluded them from our analysis in case splitting was not possible (e.g., the resulting sub datasets were too small). As an example, we used the categorical variable “Languages” in the Desharnais dataset to split the original data into three different datasets corresponding to Languages 1, 2, and 3, respectively. After applying the above criteria, 13 PROMISE datasets were kept for our empirical analysis, namely Albrecht, China, Desharnais1, Desharnais2, Desharnais3, Finnish, Kemerer, MaxwellA2, MaxwellA3, MaxwellS2, MaxwellT1, Miyazaki, and Telecom. We applied the same procedure on the Tukutuku database obtaining 8 splits since all the categorical variables (i.e., TypeProj, DocPro, ProImpr, and Metrics) were binary.

Table 1 summarizes the main characteristics of the considered datasets while further details together with the descriptive statistics of the involved features are provided in Appendix A. They represent an interesting sample of software projects, since they contain data about projects that are Web-based (i.e. the ones from Tukututku) and not Web-based (i.e., the ones from PROMISE) and include datasets that were collected from a single software company or several companies. Moreover, all the datasets contain data about industrial projects, representing a diversity of application domains and projects’ characteristics. In particular, they all differ in relation to:

Table 1 Summary of the employed datasets
  • geographical locations: software projects coming from Canada, China, Finland, Japan, New Zealand, Italy, United States, etc.;

  • number of involved companies;

  • observation number: from 10 to 499 observations;

  • number and type of features: from 1 to 27 features, including a variety of features describing the software and Web projects, such as number of entities in the data model, number of basic, logical transactions, number of developers involved in the project and their experience, number of Web page or image;

  • technical characteristics: software projects developed in different programming languages and for different application domains, ranging from telecommunications to commercial information systems.

Nevertheless, note that none of these datasets are random samples of software and Web projects. Therefore the information provided in Appendix A can be useful for readers to assess whether the results we gathered can scale up to their own contexts.

In order to avoid that large differences in the ranges of the features’ values can have the unwanted effect of giving greater importance to some characteristics than to others, a data preprocessing step should be applied when using SVR (Chang and Lin 2001; Smola and Schölkopf 2004). In our previous studies (Corazza et al. 2009, 2011), we experimented different preprocessing strategies, such as normalization and logarithmic. The latter is a typical approach in the field of effort estimation (Briand et al. 2000; Costagliola et al. 2006; Di Martino et al. 2007; Kitchenham and Mendes 2004), since it reduces ranges and at the same time it limits the linearity issue. It provided the best results in (Corazza et al. 2009, 2011), thus, we adopted it in (Corazza et al. 2010) and in the present paper. Moreover, we removed from the employed datasets the observations which have missing values (see Appendix A).

3.2 Null Hypotheses

To address the first research question (i.e., assessing the effectiveness of TS for configuring SVR) we first verified the benefits of using a search-based approach like TS to configure SVR against a simpler approach considering random configurations (SVRrand, in the following). In this case, to be fair the same number of solutions has to be generated and compared with those achieved with the meta-heuristic search approach. Thus, we randomly generated 25*100 SVR configurations ten times (within the same ranges defined for TS in Section 2.2) and the best one of these was selected based on the same criteria employed for SVR + TS but without guiding the search in any way. Moreover, we also considered the use of the default configuration (i.e., C = 1, ε = 0.001, γ = 0) provided by the Weka tool (Hall et al. 2009) (SVRweka in the following) and the Grid-search algorithm provided by LibSVM (Chang and Lin 2001) (SVRgrid in the following).

As a consequence, the following null hypotheses were formulated:

  1. Hn0:

    SVR + TS does not provide significant better estimates than SVRrand;

  2. Hn1:

    SVR + TS does not provide significant better estimates than SVRweka;

  3. Hn2:

    SVR + TS does not provide significant better estimates than SVRgrid;

which contrast with the following alternative hypotheses:

  1. Hn0:

    SVR + TS provides significant better estimates than SVRrand;

  2. Hn1:

    SVR + TS provides significant better estimates than SVRweka;

  3. Hn2:

    SVR + TS provides significant better estimates than SVRgrid.

With regard to the second research question, we assessed whether the estimates obtained with SVR + TS were better than those obtained using the Manual StepWise Regression (MSWR) (Kitchenham and Mendes 2004; Mendes and Kitchenham 2004) and the Case-Based Reasoning (CBR) (Shepperd and Kadoda 2001) that are two techniques widely used in the literature and also in industry (probably the most employed estimation methods).

MSWR is a statistical technique whereby a prediction model (Equation) is built and represents the relationship between independent (e.g., number of Web pages) and dependent variables (e.g., total Effort). This technique builds the model by adding, at each stage, the independent variable with the highest association to the dependent variable, taking into account all variables currently in the model. It aims to find the set of independent variables (predictors) that best explain the variation in the dependent variable (response).

Within the context of our investigation, the idea behind the use of CBR is to predict the effort of a new project by considering similar projects previously developed. In particular, the completed projects are characterized in terms of a set of p features (i.e., variables) and form the case base (Shepperd and Kadoda 2001). The new project is also characterized in terms of the same p features and it is referred as the target case. Then, the similarity between the target case and the other cases in the p-dimensional feature space is measured, and the most similar cases are used, possibly with adaptations, to obtain a prediction for the target case. In our empirical study we employed CBR in two ways:

  1. i)

    by considering only the independent variables that are statistically correlated to the dependent variable (CBRfss in the following), and

  2. ii)

    without applying any kind of selection of the variables (CBR in the following).

The key aspects of MSWR and CBR are detailed in Appendix B and C, respectively.

In addition, we also assessed whether the estimates obtained with SVR + TS were significantly better than those obtained using the mean of effort (MeanEffort in the following) and the median of effort (MedianEffort in the following). This was done because, as suggested by Mendes and Kitchenham in (2004), if an estimation technique does not outperform the results achieved by using MeanEffort and MedianEffort, it cannot be transferred to industry since there would be no value in dealing with complex computations of estimation methods to predict development effort compared to simply using as estimate the mean or the median effort of its own past projects.

Thus, we formulated the following null hypotheses:

  1. Hn3:

    SVR + TS does not provide significant better estimates than MSWR;

  2. Hn4:

    SVR + TS does not provide significant better estimates than CBRfss;

  3. Hn5:

    SVR + TS does not provide significant better estimates than CBR;

  4. Hn6:

    SVR + TS does not provide significant better estimates than MeanEffort;

  5. Hn7:

    SVR + TS does not provide significant better estimates than MedianEffort;

which contrast with the following alternative hypotheses:

  1. Ha3:

    SVR + TS provides significant better estimates than MSWR;

  2. Ha4:

    SVR + TS provides significantly better estimations than CBRfss;

  3. Ha5:

    SVR + TS provides significantly better estimations than CBR;

  4. Ha6:

    SVR + TS provides significantly better estimations than Mean Effort;

  5. Ha7:

    SVR + TS provides significantly better estimations than Median Effort.

3.3 Validation Method

To assess the effectiveness of the effort predictions obtained using the techniques employed herein we exploited a multiple-fold cross validation, partitioning each original dataset into training sets, for model building, and test sets, for model evaluation. This is done to avoid optimistic predictions (Briand and Wieczorek 2002). Indeed, cross validation is widely used in the literature to validate effort estimation models when dealing with medium/small datasets (e.g., Briand et al. 2000). When applying the multiple-fold cross validation, we decided to use the leave-one-out cross validation on the datasets that have less than 60 observations (i.e., Albrecht, Desharnais1, Desharnais2, Desharnais3, Finnish, Kemerer, Miyazaki, and Telecom). In those cases the original datasets of N observations were divided into N different subsets of training and validation sets, where each validation set had one project. On the other hand, we decided to partition the datasets having more than 60 observations (i.e., China and the 8 splits obtained from the Tukutuku database) into k = 10 randomly test sets, and then for each test set to consider the remaining observations as training set to build the estimation model. This choice was made trying to find a trade-off between computational costs and effectiveness of the validation. The 10 folds for the China datasets are given in Appendix E (Table 10).Footnote 2

3.4 Evaluation Criteria

Several accuracy measures have been proposed in the literature to assess and compare the estimates achieved with effort estimation methods (Conte et al. 1986; Kitchenham et al. 2001), e.g., Mean of MRE, Median of MRE; Mean of EMRE, Median of EMRE, and Pred(25) (i.e., Prediction at level 25%). Considering that all the above measures are based on the absolute residuals (i.e., the absolute values of differences between predicted and actual efforts) in our empirical analysis we decided to compare the employed estimation techniques in terms of the Median of Absolute Residuals (MdAR), which is a cumulative measure widely employed as the Mean of Absolute Residuals (MAR). We chose to employ MdAR since it is less sensitive to extreme values with respect to MAR (Mendes et al. 2003b). The use of a single summary measure was motivated by the aim to improve the readability of the discussion on the comparison of the analyzed effort estimation methods (that is not confused by the fact that some measures have to be minimized and other maximized). Moreover, to make the comparison more reliable we used, behind this summary measure, also a statistical test. Indeed, to verify if the differences observed using the above measure were legitimate or due to chance, we checked if the absolute residuals obtained with the application of the various estimation techniques come from the same population. If they do, it means that there are no significant differences between the data values being compared. We accomplished the statistical significance test using a nonparametric statistical significance test (Kitchenham et al. 2001), namely Wilcoxon Signed Rank test, with α = 0.05. We decided to use the Wilcoxon test since it is resilient to strong departures from the t-test assumptions (Conover 1998).

4 Results and Discussion

Table 2 reports the Median of the Absolute Residuals (MdAR) obtained with each technique for all the employed datasets. Let us recall that the results of TS + SVR reported herein were obtained applying on test set the final configuration provided by TS, namely the one having objective value closest to the mean of the objective values obtained in the 10 executions performed on training set. An assessment of the variation of the objective values can be found in Appendix D.

Table 2 Accuracy in terms of MdAR

Notice that for CBR we used 1, 2, and 3 analogies and due to space constraints, only the best results are reported herein. The number of analogies used to obtain each of these best results is specified in Table 2. The details about the application of MSWR and CBR are reported in Appendix B and C, respectively.

In order to provide better readability, all the best results (i.e., the minimum MdAR values) obtained for each dataset across the employed techniques are reported in bold (see Table 2).

Table 2 shows that SVR + TS provided the best MdAR values for all the datasets, except for NewProjects, where CBR provided a slightly better result.

To quantify how much SVR + TS provided better results than the other employed techniques, for each dataset we calculated the ratio BestSVR/SVR + TS (AvgSVR/SVR + TS, and WorstSVR/SVR + TS, respectively) between the best (the mean, and the worst, respectively) MdAR provided by the other SVR based approaches with the MdAR of SVR + TS. Similarly, we also provided the same ratios (named BestBench/SVR + TS, AvgBench/SVR + TS, and WorstBench/SVR + TS) with respect to the other estimation techniques used as benchmarks. These results are reported in Table 3, together with the median values of these ratios obtained on all the datasets.

Table 3 A comparison between SVR + TS and the other techniques

Thus, we can observe that with respect to the other SVR techniques:

  • the error (i.e., MdAR) made using the other SVR technique providing the best estimates is on median about one half (i.e., 1.48) the error made employing SVR + TS;

  • the mean of the errors made using the other SVR techniques is on median about twice (i.e., 1.75) the error made employing SVR + TS;

  • the error made using the other SVR technique providing the worst result is on median about twice (i.e., 2.06) the error made employing SVR + TS.

As for the comparison with the other estimation techniques used as benchmarks (i.e., MSWR, CBR, MeanEffort, and MedianEffort), the results in Table 3 suggest that:

  • the error made using the technique providing the best estimates is on median about twice (i.e., 1.65) the error made employing SVR + TS;

  • the mean of the errors made using the other techniques is on median about four (i.e., 3.99) times the error made employing SVR + TS;

  • the error made using the technique providing the worst result is on median about nine times (i.e., 8.93) the error made employing SVR + TS.

In order to verify whether the differences observed using MdAR values were legitimate or due to chance, we employed the Wilcoxon test (α = 0.05) to assess if the absolute residuals from all the techniques used came from the same population. The results are reported in Table 4 where “Yes” in a cell means that SVR + TS is significantly superior to the technique indicated on the column (i.e., it means that the absolute residuals achieved with SVR + TS are significantly less than the ones obtained with the technique indicated on the column).

Table 4 Comparison of the absolute residuals using Wilcoxon test (p-values are reported between brackets) for PROMISE and Tukutuku datasets

These results allowed us to state that the predictions obtained with SVR + TS were significantly superior than those obtained with SVRrand, SVRweka, SVRgrid, MSWR, CBR (with and without feature selection), MedianEffort, and MeanEffort for all PROMISE and Tukutuku datasets, except for a few cases (i.e., the China, EnhancementProjects, MetricNo, ProImprYes, and ProImprNo datasets with respect to SVRgrid, SVRweka, SVRgrid, CBR, and SVRweka approaches, respectively) where no significant difference was found.

According to these results we can reject all the null hypotheses presented in Section 4 (with a confidence of 95%), highlighting that SVR + TS provided significant better estimates than:

  • SVRrand for all the datasets;

  • SVRweka for 19 out of 21 datasets;

  • SVRgrid for 19 out of 21 datasets;

  • MSWR for all the datasets;

  • CBR for 20 out of 21 datasets;

  • CBRss for all the datasets;

  • Mean Effort for all the datasets;

  • Median Effort for all the datasets.

Thus, we conclude that we can positively answer our research questions, i.e., Tabu Search was able to effectively set Support Vector Regression parameters and the effort predictions obtained by using the combination of Tabu Search and Support Vector Regression were significantly superior to the ones obtained by other techniques.

Note that these results confirm and extend those previously obtained and detailed in (Corazza et al. 2010), thus supporting the usefulness of TS for configuring SVR. Indeed, TS has allowed us to improve the accuracy of the obtained estimates with respect to the use of random configurations, the use of a default configuration, and the use of the Grid-search algorithm for parameter selection provided by LibSVM. Moreover, we want to stress that the analysis showed that SVR outperformed the two techniques that are to date the most widely and successfully employed prediction techniques in Software Engineering (e.g., Briand et al. 2000; Briand and Wieczorek 2002; Costagliola et al. 2006; Kitchenham and Mendes 2004; Mendes et al. 2008; Mendes and Kitchenham 2004; Shepperd and Kadoda 2001), namely MSWR and CBR.

In addition, note that SVR + TS outperformed all the other techniques both for single- and cross- company datasets and for both Web-based and not Web-based applications datasets.

5 Validity Assessment

There are several factors that can bias the validity of empirical studies. Here we consider three types of validity threats: Construct validity, related to the agreement between a theoretical concept and a specific measuring device or procedure; Conclusion validity, related to the ability to draw statistically correct conclusions; External validity, related to the ability to generalise the achieved results. As highlighted by Kitchenham et al. (1995), to satisfy construct validity a study has “to establish correct operational measures for the concepts being studied”. Thus, the choice of the features and how to collect them represents the crucial aspects. We mitigated such a threat by evaluating the employed estimation methods on publicly available datasets from the PROMISE repository. These datasets have been previously used in many other empirical studies carried out to evaluate effort estimation methods (see PROMISE web site).

With respect to the Tukutuku datasets, the size measures and cost drivers used in the Tukutuku database, and therefore in our study, have been obtained from the results of a survey investigation (Mendes et al. 2003a), using data from 133 on-line Web forms aimed at giving quotes on Web development projects. In addition, these measures and cost drivers have also been confirmed by an established Web company and a second survey involving 33 Web companies in New Zealand. Consequently, it is our belief that the variables identified are measures that are meaningful to Web companies and are constructed from information their customers can provide at a very early stage in the project development. As for data quality, to identify effort guesstimates from more accurate effort data, companies were asked on how their effort data was collected (see Table 5). At least for 93.8% of Web projects in the Tukutuku database, effort values were based on more than just guesstimates.

Table 5 How effort data was collected

In relation to the conclusion validity we carefully applied the statistical tests, verifying all the required assumptions. Moreover, we used medium size datasets to mitigate the threats related to the number of observations composing the dataset.

As for the external validity, let us observe that both PROMISE and Tukutuku datasets comprise data on projects volunteered by individual companies, and therefore they do not represent random samples of projects from a defined population. This means that we cannot conclude that the results of this study promptly apply to other companies different from the ones that volunteered the data used here. However, we believe that companies that develop projects with similar characteristics to those included in the Tukutuku and PROMISE database may be able to apply our results to their software projects. However, the adoption of this technique by industry may require to build and calibrate the initial model, prior to its use for effort estimation. This also applies to most effort estimation techniques investigated to date in the literature, and some examples of how to bridge the gap between research and practice are given in (Mendes et al. 2009).

6 Related Work

Regarding the use of SVR for software effort estimation, Oliveira (2006) was the first to apply the technique in this domain, exploiting data on 18 applications from the well-known NASA software project dataset (Bailey and Basili 1981). The author tested the linear and the RBF kernels, trying for each of them three settings for the SVR’s parameters. The evaluation, conducted using a leave-one-out cross-validation, and expressed in terms of the indicators MMRE and Pred(25), highlighted that SVR significantly outperformed both Linear Regression and Radial Basis Function Networks (RBFNs). In a subsequent study, Braga et al. (2007) proposed a machine learning-based method able to provide an effort estimate and a corresponding confidence interval. To assess the defined method, they performed a case study using the Desharnais (Desharnais 1989) and NASA (Bailey and Basili 1981) datasets. The results of this empirical analysis showed that the proposed method was characterized by better performance with respect to the previous study. It is worth noting that we cannot perform a punctual comparison of our results with those presented in that work, since authors used a hold-out validation on the Desharnias dataset, obtained by randomly selecting 18 projects as training set, but did not report whose projects they exploited. As for NASA, as said in section 3.1 we excluded it from our analysis since it contains only LOC as size measure.

We also previously employed SVR (Corazza et al. 2009, 2011) and SVR + TS (Corazza et al. 2010), as detailed in Sections 1 and 2.

As for the use of meta-heuristics to explore the parameter setting with the aim to improve effort predictions, this is a quite new research. Some research has been conducted to employ Genetic Algorithms (GA) to improve the estimation performance of existing estimation techniques. To the best of our knowledge, the first attempt to combine evolutionary approaches with an existing effort estimation technique was made by Shukla (2000) applying GA to Neural Networks (NN) predictor (namely, neuro-genetic approach, GANN) to improve its estimation capability. Results were significantly better than other techniques, such as a modified version of the Regression Trees.

Li et al. (2009) proposed a combination of an evolutionary approach with CBR, aiming at exploiting GA to simultaneously optimize the selection of the feature weights and projects. The performed case study employed a hold-out validation on the Desharnais, Albrecht, and two artificial datasets. The results showed that the use of GA can provide significantly better estimations. Also in this case, we cannot compare our results with those presented in that paper since the datasets have been handled differently.

More recently, Chiu and Huang (2007) applied GA to many different analogy-based approaches using two datasets not included in the PROMISE repository. The results showed an improvement of 38% in terms of MMRE, when using GA to explore an adjustment function.

About Tabu Search, to the best of our knowledge, only two case studies were performed to assess its use for estimating software development effort. In particular, Ferrucci et al. applied TS on Desharnais (Ferrucci et al. 2009) and Tukutuku datasets (Ferrucci et al. 2010), obtaining interesting results, motivating further investigation on the use of search-based methods in this field.

7 Conclusions and Future Work

In this paper, we have assessed whether Support Vector Regression configured by using the proposed Tabu Search approach can be effective to estimate software development effort. We extended a previous empirical study (Corazza et al. 2010) where we applied SVR + TS to two splits randomly obtained from 195 applications of the Tukutuku database and applying a hold-out cross validation. The results obtained were promising and encouraged us to further verify the effectiveness of SVR + TS. In particular, in this paper we have presented the results achieved by applying SVR + TS to other 13 datasets obtained from the PROMISE repository and considering further 8 datasets obtained by splitting the Tukutuku database according to the values of 4 categorical variables included in it. Thus, a total of 21 datasets (both single- and cross- company datasets related to both Web-based and not Web-based applications) were employed to perform a 10-fold or a leave-one-out validation depending on the size of the datasets.

Regarding the choices of SVR kernels and pre-processing strategy, we have employed the RBF kernel and a logarithmic transformation of the variables since they provided the best results in (Corazza et al. 2010).

The results of the empirical analysis have confirmed and extended those reported in (Corazza et al. 2010), highlighting the goodness of TS for configuring SVR. Indeed, SVR + TS provided significant better estimates than SVR configured with simpler approaches, such as random configuration, default configuration provided by the Weka tool, and the Grid-search algorithm provided by LibSVM. Moreover, SVR + TS allowed us to obtain significantly better effort estimates than the ones obtained using MSWR and CBR, two techniques widely employed both in academic and industrial contexts.

Many studies have been reported in the literature that show the ability of SVR to construct accurate predictive models in different contexts (Cherkassky and Ma 2004). Nevertheless, those studies are usually based on the opinion of experts that select SVR parameter values on the basis of their knowledge of both the approach and the application domain (Cherkassky and Ma 2004). Of course the reliance upon experts severely bounds the practical applicability of this approach in the software industries. The approach investigated in the present paper does not only address the problem to find a suitable SVR setting for effort estimation but it also allows practitioners of software industries to effectively use it without requiring to be an expert in the field of those techniques. Indeed, although the models constructed using the datasets employed in the present paper cannot be immediately adopted in other software companies, thanks to the use of the proposed approach project managers can automatically build their own effort estimation models starting from their historical data.

These observations together with the results presented in this paper suggest SVR + TS among the techniques that are suitable for software development effort estimation in industrial world.

Several interesting investigations can be planned as future work. First of all, other objective functions could be exploited in the definition of TS and their influence on the final results could be analyzed. These functions could be based on other evaluation criteria (e.g., Pred(25)) used to compare effort estimation models or based on measures optimized by other estimation techniques (e.g., SSR optimized by MSWR). Other aspects of TS could also be investigated, such as the use of a heuristics to choose the initial solution and then compare the results with respect to the random initialization employed in the present paper.

Finally, the good results herein reported concerning the ability of TS to configure SVR encourage us to apply a similar approach to other estimation techniques, such as CBR (for example to select feature and/or other aspects, such as the number of analogies).