A study of the non-linear adjustment for analogy based software cost estimation

Li, Y. F.; Xie, M.; Goh, T. N.

doi:10.1007/s10664-008-9104-6

A study of the non-linear adjustment for analogy based software cost estimation

Published: 12 February 2009

Volume 14, pages 603–643, (2009)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

Empirical Software Engineering Aims and scope Submit manuscript

A study of the non-linear adjustment for analogy based software cost estimation

Download PDF

Y. F. Li¹,
M. Xie¹ &
T. N. Goh¹

830 Accesses
62 Citations
Explore all metrics

Abstract

Cost estimation is one of the most important but most difficult tasks in software project management. Many methods have been proposed for software cost estimation. Analogy Based Estimation (ABE), which is essentially a case-based reasoning (CBR) approach, is one popular technique. To improve the accuracy of ABE method, several studies have been focusing on the adjustments to the original solutions. However, most published adjustment mechanisms are based on linear forms and are restricted to numerical type of project features. On the other hand, software project datasets often exhibit non-normal characteristics with large proportions of categorical features. To explore the possibilities for a better adjustment mechanism, this paper proposes Artificial Neural Network (ANN) for Non-linear adjustment to ABE (NABE) with the learning ability to approximate complex relationships and incorporating the categorical features. The proposed NABE is validated on four real world datasets and compared against the linear adjusted ABEs, CART, ANN and SWR. Subsequently, eight artificial datasets are generated for a systematic investigation on the relationship between model accuracies and dataset properties. The comparisons and analysis show that non-linear adjustment could generally extend ABE’s flexibility on complex datasets with large number of categorical features and improve the accuracies of adjustment techniques.

Pareto efficient multi-objective optimization for local tuning of analogy-based estimation

Article 01 September 2015

Appropriate number of analogues in analogy based software effort estimation using quality datasets

Article 22 January 2023

Insightful analogy-based software development effort estimation through selective classification and localization

Article 05 December 2014

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Software cost estimation is a continuous activity which often starts at the first stage of the software life cycle and continues throughout the life time. Since software cost estimation affects most software development activities, it has become a critical practice in software project management. The importance of accurate cost estimation has led to extensive research efforts onto estimation methods in the past decades. These methods can be classified into three basic types (Angelis and Stamelos 2000): expert judgment (Jorgensen 2004, 2005, 2007), algorithmic estimation (Jun and Lee 2001; Heiat 2002; Pendharkar et al. 2005; Van Koten and Gray 2006), and analogy based estimation (Shepperd and Schofield 1997; Auer et al. 2006; Lee and Lee 2006; Chiu and Huang 2007; Li et al. 2007; Li and Ruhe 2008)

The analogy based estimation (ABE) was first proposed by Shepperd and Schofield (1997) as a valid alternative to expert judgment and algorithmic estimation. ABE is partially motivated by the obvious connections between project managers making estimation based on the memories of past similar projects and the formal use of analogies in Case Based Reasoning (CBR) (Kolodner 1993). The fundamental principle of ABE is simple: when provided a new project for estimation, the most similar historical projects (analogies) are retrieved, the solutions (cost values) of the retrieved projects are used to construct a ‘retrieved solution’ to the new project, with the expectation that the cost values of the retrieved projects will be similar to the real cost of the new project.

However, the adjustment on the retrieved solution is of necessity since it can capture the differences between the new project and the retrieved projects, and refine the retrieved solution into the target solution (Walkerden and Jeffery 1999). In the literature, many works (Walkerden and Jeffery 1999; Jorgensen et al. 2003; Chiu and Huang 2007; Li et al. 2007; Li and Ruhe 2008) have been focusing on the adjustments to the retrieved solution. However, most of these adjustment mechanisms are based on predetermined linear forms without learning ability to adapt to more complex situations such as non-normality. In addition, these adjustment techniques are limited to the numeric features despite that the categorical features also contain valuable information to improve the cost estimation accuracies (Angelis et al. 2000). In contrast, software project datasets often exhibit non-normal characteristics (Pickard et al. 2001) and contain large proportion of categorical features (Sentas and Angelis 2006; Liu and Mintram 2005).

To improve the existing adjustment mechanisms, we propose a more flexible non-linear adjustment with learning ability and including categorical features. The Non-linearity adjusted Analogy Based Estimation (NABE) is realized by adding a non-linear component (Artificial Neural Network) onto the retrieved solution of the ABE system. In this approach, the ordinary ABE procedure is first executed to produce an un-adjusted retrieval solution to the new project. Then, the differences between the new project’s features and its analogies’ features are treated as inputs to ANN model to generate the non-linear adjustment. Finally, the retrieved solution and the adjustment from ANN are summed up to form the final prediction.

The rest of this paper is organized as follows: Section 2 presents the related work on the adjustments of analogy based cost estimation, the detailed comparisons of the existing adjustment mechanisms and how they are related to the properties of the project datasets. Section 3 describes the details of non-linearity adjusted ABE system (NABE). Section 4 presents four real world data sets and the evaluation criteria for experiments. Section 5 provides an illustrative example of the application procedure of NABE. In Section 6, the NABE is tested on the real world datasets and is compared against the linear adjusted ABEs, ANN CART and SWR. In Section 7, eight artificial data sets are generated and a systematic analysis is conducted to explore how the model accuracies are related to dataset properties. In Section 8, the threats to validity are presented. The final section presents the conclusions and future works.

2 Related Work and Motivations

2.1 Related Work

Analogy based software cost estimation is essentially a case-based reasoning (CBR) approach (Shepperd and Schofield 1997). This approach identifies one or more historical projects that are similar to the present project and then derives the cost estimates from the similar projects. Generally, the ABE consists of four parts (Fig. 1): a case/project data base, a similarity function, a retrieved solution and the associated retrieval rules (Kolodner 1993).

Figure 1 shows that the retrieved solution function is a crucial component in ABE, since it obtains the adjustment and produces the final prediction. Retrieved solution has a general mathematical form shown the following formula:

$$\hat C_x = g\left( {C_1 ,C_2 , \ldots ,C_n } \right)$$

(1)

where Ĉ _x denotes the estimated cost for the new project x, C _i is the cost value of the ith closest analogy to project x, and n denotes the total number of closest analogies. The retrieved solution function (1) only includes the ‘cost’ values as its variables while other project features such as ‘lines of source code’ and ‘function points’ do not appear in this function. In the literature several retrieved solutions have been proposed, such as un-weighted mean (Shepperd and Schofield 1997; Jorgensen et al. 2003), weighted mean (Mendes et al. 2003), and median (Angelis and Stamelos 2000; Mendes et al. 2003). However, these solution functions can be rarely directly applied to predict Ĉ _x. Instead, they need to be adjusted in order to fit the situations of the new project (Walkerden and Jeffery 1999). Therefore the adjustment mechanisms should first identify the differences between the new project (features) and the retrieved projects (features) and then convert these differences into the amount of change in the cost value. In the literature, many adjustment techniques have been proposed:

Walkerden and Jeffery (1999) first proposed the linear size adjustment. This approach performs a linear extrapolation along a single metric—the number of function points (FP) which is supposed to be strongly correlated with the project cost.

$$\hat C_x = \frac{{FP_x }}{{FP_{CA} }}C_{CA} $$

(2)

where Ĉ _x denotes the estimated cost of a new project x, C _CA is the cost value of the closest analogy (CA) or the most nearest neighbor, FP _x is the number of function points of the new project x, and FP _CA is the number of function points of the closest project. However, this adjustment technique may not be applicable when the size measure is not function point or there are many size measures other than function points such as the size measures of website projects (Mendes et al. 2001). Thus, based on Walkerden and Jeffery’s work, Mendes and Mosley (2003) further extended the naive form into an arbitrary number of size related features with multiple closest analogies.

$$\hat C_x = \frac{1}{K}\sum\limits_{i = 1}^K {\frac{1}{Q}\left( {\sum\limits_{p = 1}^Q {\frac{{s_{qx} }}{{s_{qi} }}C_i } } \right)} $$

(3)

where C _i is the project cost of the ith closest analogy, K is the total number of retrieved closest analogies, s _px is the qth size related feature of the new project x, s _pi is the qth size related feature of the ith closest project, and Q is the total number of size related features. In a later study, Kirsopp et al. (2003) conducted an empirical investigation into the above two types of linear adjustments on ABE system.

Later, Jorgensen et al. (2003) proposed the ‘Regression Toward the Mean’ (RMT) adjustment mechanism:

$$\begin{array}{*{20}l} {\hat C_x = FP_{CA} \times \hat P_x } \hfill \\ {\hat P_x = P_{CA} + \left( {M - P_{CA} } \right) \times \left( {1 - r} \right)} \hfill \\ \end{array} $$

(4)

where $\hat P_x $ denotes the adjusted productivity (Productivity = Cost / Function Points) of the new project x, P _CA is the productivity of the closest analogies, M is the average productivity of the similar projects, and r is the historical correlation between the non-adjusted analogy based productivity and the actual productivity as a measure of the expected estimation accuracy. This method can also be regarded as an extension of Walkerden and Jeffery’s model, as it adjusts the ratio P _CA=C _CA/FP _CA in (2) by adding a component (M−P _CA) × (1−r) representing ‘regression toward the mean’.

At a later stage, heuristic method has been applied onto adjustment techniques. Chiu and Huang (2007) proposed Genetic Algorithm (GA) to optimize a linear adjustment model:

$$\hat C_x = C_{CA} + Adj$$

(5)

where C _CA denotes the cost value of the closest analogy, $Adj = \sum\limits_{i = 1}^m {\alpha _i \times \left( {s_{xi} - s_{CAi} } \right)} $ is the linear adjustment term, S _xi is the ith feature of the new project x, and S _CAi denotes the ith feature of the closest analogy. Genetic Algorithm is used to optimize the coefficients α _i in this equation.

More recently, the categorical features are included into the adjustment model. Li et al. (2007) and Li and Ruhe (2008) proposed AQUA and AQUA+ for cost estimation. In their works, the following similarity adjusted solution function is proposed:

$$\hat C_x = \sum\limits_{i = 1}^K {\left[ {\frac{{Sim\left( {x,i} \right)}}{{\sum\limits_{i = 1}^K {Sim\left( {x,i} \right)} }} \cdot C_i } \right]} $$

(6)

where C _i is the project cost of the ith closest analogies, Sim(x, i) is the similarity between project x and its ith analogy, and K is the total number of closest analogies. The similarity measure Sim(x, i) can deal with both numerical and categorical features. In AQUA, the similarity measure assigns equal weights to the project features to eliminate the impact of different features, while AQUA+ employs the rough set approach to weight each project feature.

2.2 Motivations

Subsequent to the short descriptions of the published adjustment techniques, we present the motivations of this study in this section. Table 1 characterizes each adjustment method from six aspects. The first column contains the source of the adjustment. The second column is the type of function (linear / non-linear) that the adjustment bases on. The third column describes the type of project feature used in the adjustment function. The fourth column indicates whether the categorical features are considered in the adjustment. The fifth column shows whether the adjustment function can learn from the training dataset to approximate a complex relationship. The last column presents the number of closest analogies (one / multiple) used in the adjustment function. The reasons for selecting these criteria are as follows. The function type reflects the basic structure of the adjustment model. The adjustment feature, categorical feature, and number of analogies together determine the inputs of adjustment model. The learning ability indicates whether the adjustment mechanism has the flexibility to adapt to complex relationships.

Table 1 Comparison of published adjustment mechanisms

Features	Inpcount	Outcount	Quecount	Filcount	Fp	SLOC	Effort
Project i	0.22	0.38	0.08	0.16	0.13	0.32	0.17

Features	Inpcount	Outcount	Quecount	Filcount	Fp	SLOC	Effort
Project j	0.23	0.43	0.27	0.19	0.15	0.36	0.12

Inputs	1	2	3	4	5	6
Residual	−0.01	−0.05	−0.19	−0.03	−0.02	−0.04

Features	Inpcount	Outcount	Quecount	Filcount	Fp	SLOC	Effort
Project x	0.36	0.18	0.20	0	0.06	0.23	?

A study of the non-linear adjustment for analogy based software cost estimation

Abstract

Similar content being viewed by others

Pareto efficient multi-objective optimization for local tuning of analogy-based estimation

Appropriate number of analogues in analogy based software effort estimation using quality datasets

Insightful analogy-based software development effort estimation through selective classification and localization

Explore related subjects

1 Introduction

2 Related Work and Motivations

2.1 Related Work

2.2 Motivations

3 Artificial Neural Networks for Non-Linear Adjustment

3.1 Non-linear Adjusted Analogy Based System

3.1.1 Stage I—Training

3.1.2 Stage II—Predicting

4 Evaluation Criteria and Data Sets

4.1 The Evaluation Criteria

4.2 Data Sets

5 Application Example of NABE

5.1 Stage I—Training

5.2 Stage II—Predicting

6 Experiments and Results

6.1 Experiments Design

6.1.1 Three-Fold Cross Validation

6.1.2 Experiments Procedures

6.1.3 Methods Specifications

6.2 The Results on Albrecht Dataset

6.3 The Results on Desharnais Dataset

6.4 The Results on Maxwell Dataset

6.5 The Results on ISBSG Dataset

7 Analysis on Dataset Characteristics

7.1 Artificial Datasets Generation

7.2 Comparisons on Modeling Accuracies

7.3 Analysis on ‘Size’

7.4 Analysis on ‘Proportion of Categorical Features’

7.5 Analysis on ‘Degree of Non-Normality’

7.6 Summary of Analysis

8 Threats to Validity

8.1 Internal Validity

8.2 External Validity

9 Conclusions and Future Works

Abbreviations

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation