1 Introduction

Planning, bidding, and budgeting of software development projects consider as an important factor the effort prediction required to complete the project; bad effort estimates may address to poor planning, low profitability, and consequently products with poor quality [23]. Software researchers have addressed the problems of effort estimation for software development projects since at least the 1960s, and the early models were those based upon statistical regression [24].

Software estimation has been identified as one of the three great challenges for half-century-old computer science [9], and several effort estimation techniques have been proposed and researched over the last years [7, 35]. Researchers aimed to (1) determine which technique has the greatest effort prediction accuracy or (2) propose new or combined techniques that could provide better estimates.

Software development estimation techniques can be classified into three general categories [35]:

  1. 1.

    Expert judgment: It remains the predominant methodology of choice [5]. The term expert estimation is not clearly defined and covers a wide range of estimation approaches; a common characteristic is, however, that intuitive processes constitute a determinant part of the estimation [21]. This technique implies a lack of analytical argumentation and by the frequent use of phrases such as “I think that …” and “I feel that …” [24], and it aims to derive estimates based on an experience of experts on similar projects. The means of deriving an estimate are not explicit and therefore not repeatable [35]. Psychological research on real-world quantitative expert estimation “has not culminated in any theory of estimation, not even in a coherent framework for thinking about the process” [24].

  2. 2.

    Algorithmic models: It has been very popular in literature [8, 35]. It attempts to represent the relationship between effort and one or more characteristics of a project; the main cost driver in a model is usually taken to be some notion of software size (e.g., the number of lines of source code). Its general form is a linear regression equation, as that used by Kok [29], or a group of non-linear regression equations as those used by Boehm in COCOMO 81 [2] and COCOMO II [4].

  3. 3.

    Machine learning: Machine learning techniques have been used as a complement or alternative to the previous two techniques in recent years [40, 46]. Fuzzy logic models are included in this category [32] as well as neural networks [40], genetic algorithms [47], genetic programming [10], regression trees [24], case-based reasoning [25], and associative memories [30].

A primary conclusion of a previous research is that no single technique is best for all situations and that a careful comparison of the results of several approaches is most likely to produce realistic estimates [3]. Based upon this conclusion, in this paper, are compared the accuracies of a multiple linear regression (MLR), a general regression neural network (GRNN), and a fuzzy logic model (FLM). This comparison is based upon the two following main stages when an estimation model is used [37]: (1) the model adequacy checking or model verification (estimation stage) must be determined, that is, whether the model is adequate to describe the observed (actual) data; if so, then (2) the estimation model is validated using new data (prediction stage).

The hypothesis to be investigated in this paper is the following:

Effort prediction accuracy of a general regression neural network is statistically equal or better than those obtained by a fuzzy logic model as well as by a multiple linear regression, when new and change code and reused code obtained from short-scale programs developed with personal practices are used as independent variables.

The foundation for predicting the effort should be based on the assumption of that unless software engineers have the capabilities provided by personal training, they cannot properly support their teams or consistently and reliably produce quality products; this assumption is from the Personal Software Process (PSP) whose practices and methods have been used by thousands of software engineers for delivering quality products on predictable schedule [43].

The capability maturity model (CMM) gives an available description of the goals, methods, and practices needed in software engineering industrial practice, while the PSP allows its instrumentation at a personal level (twelve of the eighteen key process areas of the CMM are at least partially considered in PSP) [17]. This paper is based upon some PSP practices and takes account of the guidelines suggested in [27].

In this study, two measures related to software size (lines of code) as well as to development time (effort) were gathered from 163 small programs developed by 53 programmers. From this set of programs, a MLR equation as well as a FLM was generated, and then, a GRNN was trained too; then, their adequacy was checked (verification). Afterward, these three models were validated with the effort estimation of 80 programs developed by other group integrated by thirty programmers. Verification data of this research were gathered through 2005, 2006, and 2007 years, whereas validation data were obtained through 2008 and first trimester of 2009 year.

1.1 Fuzzy logic

Newer computation techniques on cost estimation that are non-algorithmic appeared in the 1990s; fuzzy logic with its offerings of a powerful linguistic representation can represent imprecision in inputs and outputs, while providing a more expert knowledge-based approach to model building [1].

A fuzzy model is a modeling construct featuring two main properties [42]: (1) It operates at a level of linguistic terms (fuzzy sets) and (2) it represents and processes uncertainty.

Estimation techniques have an important limitation, which arises when software projects are described using categorical data (nominal or ordinal scale) such as small, medium, average, or high (linguistic terms or values). A more comprehensive approach to deal with linguistic values is using the fuzzy set theory [20, 34]. Specifically, fuzzy logic offers a particularly convenient way to generate a keen mapping between input and output spaces thanks to the natural expression of fuzzy rules [49].

In software development effort estimation, two considerations justify the decision of implementing a fuzzy model: first, it is impossible to develop a precise mathematical model of the domain [31]; second, metrics only produce estimations of the real complexity. Thus, according to the previous assertions, formulating a tiny set of natural rules describing underlying interactions between the software metrics and the effort estimation could effortlessly reveal their intrinsic and wider correlations. Disadvantages of a fuzzy model could be that (1) it requires a lot of data, (2) the estimators must be familiar with the historically developed programs, and (3) it is not useful for programs much larger or smaller than the historical data [17]. In this research, only the third one represents a disadvantage since the sample was conformed by 163 programs and those were well known to the estimator.

In this paper, data defuzzification is constructed based on a rule induction system replacing the crisp facts with fuzzy inputs, then an inference engine uses a base of rules to map inputs to a fuzzy output which either can be translated back to a crisp value or left as a fuzzy value [44].

1.2 General regression neural network

There are some advantages when using estimation by artificial neural networks [40] which are as follows: (1) it allows the learning from previous outcomes and (2) it can model a complex set of relationships between the dependent variable (effort) and the independent variables (i.e., new and changed, as well as reused lines of code); moreover, they have been considered as promising techniques to build predictive models, because they are capable of modeling non-linear relationships [26]; however, there are some shortcomings which include the following: (1) the ability of neural networks to solve problems of high complexity has been proven in classification and categorization areas, whereas in the cost estimation field, we deal with a generalization rather than a classification problem and (2) there is not guidelines for the construction of the neural network topologies (number of layers, number of neurons by layer, or initial weights).

This paper uses a general regression neural network (GRNN) whose principal advantages are (a) fast learning and (b) convergence to the optimal regression surface as the number of samples becomes very large. The GRNN has shown that even with sparse data in a multidimensional measurement space, the algorithm provides smooth transitions from one observed value to another [45].

Figure 1 shows the architecture of a GRNN [45]. The input units are merely distribution units that provide for all the scaled measurement variables X to all neurons on the second layer, the pattern units that are dedicated to one exemplar or one cluster center. When a new vector X is entered into the network, it is subtracted from the stored vector representing each cluster center. Either the squares or the absolute values of the differences are summed and fed into a non-linear activation function. The activation function normally used is the exponential. The pattern units’ output is passed onto the summation units. The summation units perform a dot product between a weight vector and a vector composed of the signals from the pattern units. The summation unit that generates an estimate of F(X)K sums the outputs of the pattern units weighted by the number of observations each cluster center represents. The summation unit that estimates Y′ F(X)K multiplies each value from a pattern unit by the sum of the samples Y j associated with cluster center X i. The output unit merely divides Y′ F(X)K by F(X)K to yield the desired estimate of Y. When estimation of a vector Y is desired, each component is estimated using one extra summation unit, which uses as its multipliers sums of samples of the component of Y associated with each cluster center Xi. Figure 1 depicts a feedforward network that can be used to estimate a vector Y from a measurement vector X.

Fig. 1
figure 1

General regression neural network diagram

1.3 Software measurement (independent variables)

In spite of the availability of a wide range of software product size measures, source lines of code (LOC) remains in favor of many models [33, 35]. In fact, since attributes must be relevant for the effort estimation, researches use to correlate lines of code to effort [19].

There are two measures of source code size: physical source lines and logical source statements. The count of physical lines gives the size in terms of the physical length of the code as it appears when printed [39].

In this study, the independent variables of the models are new and changed (N&C) as well as reused code, and all of them are considered as physical lines of code (LOC). N&C is composed of added and modified code. The added code is the LOC written during the current programming process, while the modified code is the LOC changed in the base program when modifying a previously developed program. The base program is the total LOC of the previous program, while the reused code is the LOC of previously developed programs that are used without any modification [17].

A coding standard should establish a consistent set of coding practices that is used as a criterion when judging the quality of the produced code [17]. Hence, it is necessary to always use the same coding and counting standards. The programs developed of this study followed these guidelines.

1.4 Accuracy criteria

It has been demonstrated that the magnitude of relative error, or MRE, (a common criterion for the evaluation of cost estimation models) does not identify the best prediction model [14]. In accordance with [14], the implications of this finding are that the results and conclusions on prediction models over the past 15–25 years are unreliable and may have misled the entire software engineering discipline; therefore, they strongly recommend not using MRE to evaluate and compare prediction models, but the magnitude of error relative to the estimate, or MER, that was proposed in [28].

The MER is defined as follows:

$$ {\text{MER}}_{i} = \frac{{\left| {{\text{Actual} \,\text{Effort}}_{i} - {\text{Estimated} \,\text{Effort}}_{i} } \right|}}{{{\text{Estimated} \,\text{Effort}}_{i} }} $$

The MER value is calculated for each observation i whose effort is estimated. The aggregation of MER over multiple observations (n) can be achieved through the mean (MMER) as follows:

$$ {\text{MMER}} = (1/n)\sum\limits_{i = 1}^{n} {{\text{MER}}_{i} } $$

Intuitively, MER seems preferable to MRE since MER measures the error relative to the estimate. Results of MMER in [14] had better results than MMRE; this fact is the reason for using MMER in this study.

The accuracy of an estimation technique is inversely proportional to the MMER. In several papers, a MMRE ≤ 0.25 has been considered as acceptable, however, who has proposed this value [11] neither present any reference to studies nor any argumentation providing evidence [22]. On the other hand, a reference for an acceptable value of MMER has not been found.

Another complementary criterion used in this study for evaluating the accuracy of the models is the mean square error (MSE) which is calculated for n-programs as follows:

$$ {\text{MSE}}_{i} = \frac{{\left( {{\text{Actual} \,\text{Effort}}_{i} - {\text{Estimated} \,\text{Effort}}_{i} } \right)^{2} }}{n} $$

1.5 Related work

Artificial neural networks and statistical regression have been investigated, and results have shown that the performance of both techniques indicates that they are competitive with models generated from data of large programs [12, 13, 15]. The feedforward multi-layer perceptron with back propagation learning algorithm is the most commonly used in the effort estimation field [40]. However, it has not found any paper for which a GRNN has been used for predicting the software development effort of short-scale programs. One paper found uses a GRNN but for predicting the number of defects in a class as well as for predicting the number of lines changed by class [36].

On the other hand, fuzzy logic has been used for predicting the development effort of large programs [1, 6, 16, 19, 39, 48], whereas about short-scale programs only one paper was found; however, it did not use new and changed (N&C) as well as reused code as independent variables, but only one of them (N&C) [32].

1.6 Verification (estimating) and validation (predicting) of models

There are two main stages for using an estimation model [37]:

  1. 1.

    Model adequacy checking (or model verification): The MMER is calculated for each model as well as an analysis of variance (ANOVA) to compare their MER.

  2. 2.

    Model validation: Once the adequacy of the models is checked, the effort of the new gathered dataset is predicted.

This research involved a sample integrated by 83 programmers divided into two groups: one for checking the adequacy of the models (53 developers with 163 programs) and the other (30 developers with 80 programs) for validating the models.

2 Experimental design

The experiment was done inside a controlled environment having the following characteristics:

  • Each programmer developed seven small programs. Seven was a number established because of the availability of developers (the course duration is a principal concern for industrial organizations [18]).

  • Ten sessions were carried out by programmer. The first session was spent for the introduction to PSP and for studying and making the coding and counting standards. From the second to the eighth training day session, one program was developed (one daily). Finally, the ninth and tenth days were assigned to make final reports.

  • In all programs, the following artefacts were used by all developers [17]: coding standard, counting standard (see Table 1), defect-type standard, project plan summary, time recording log, defect recording log, and process improvement proposal. A test report template was introduced in the second through the seventh programs in the testing phase. Code review checklist was introduced in the third–seventh programs, and design review checklist was used in the fourth–seventh programs. Thus, from fourth program on, all practices and logs planned for this study were used by all the developers. Hence, the first, second, and third programs were excluded from this study, otherwise the comparison of the development time results would have been unfair.

    Table 1 Counting standard
  • Developers had already received at least a course about the imperative programming language that they used though the assignments–programs.

  • Developers were constantly supervised and advising about the process.

  • Each developer selected his/her own imperative programming language (as C and Pascal) whose code standard had the following characteristics: each compiler directive, variable declaration, constant definition, delimiter, assign sentence, as well as flow control statement, was written in a line of code.

  • The code wrote in each program was designed by the developers to be reused in next programs.

  • All programs were developed based upon the following phases of the process: planning, algorithm design, coding, compiling, testing, and postmortem, and from the fourth program, two phases were added. The development time considered for this study included only the following phases: algorithm design, design review, coding, code review, compiling, and testing.

  • The kind of the developed programs had a similar complexity of those suggested in [17]. From a set of 18 small programs, a subset of seven was randomly assigned to each of the all programmers. A brief description of these 18 programs is the following:

  1. 1.

    Estimating the mean of a sample of n real numbers.

  2. 2.

    Estimating the standard deviation of a sample of n real numbers.

  3. 3.

    Matrix addition integrated by real numbers.

  4. 4.

    Summing the diagonal of a real numbers square matrix.

  5. 5.

    Translating from a quantity to letters.

  6. 6.

    Calculating the correlation (r) between two series of real numbers.

  7. 7.

    Computing the linear regression equation parameters a and b (y = a+bX).

  8. 8.

    Calculating z-values from a sample of real numbers.

  9. 9.

    Calculating the size of a sample.

  10. 10.

    Calculating the y-values from a sample of real numbers using the normal distribution equation.

  11. 11.

    Calculating the estimation standard error (from y = a+bX).

  12. 12.

    Calculating the coefficient of determination (r 2) from a linear regression equation.

  13. 13.

    Calculating both upper and lower limits from a sample of real numbers based upon its standard deviation and average.

  14. 14.

    Calculating the coefficient of variation from a distribution.

  15. 15.

    Estimating the values based upon statistical empirical rule.

  16. 16.

    Counting the physical lines of code of a program omitting comments and blank lines.

  17. 17.

    Both storing and searching records from a file.

  18. 18.

    Both deleting and modifying records from a file.

3 Conducting the experiment

The number of programs developed by the 53 programmers was 371. One hundred and fifty-nine of the 371 were excluded because they corresponded to the first, second, or third program. From those 212 programs, in 175 of them, lines of code were reused. In addition, 12 programs of those 175 were excluded because they presented errors in their time recording. Hence, 163 was the final number that represents the sample for verifying the three models. In Appendices 1 and 3, actual data and names by developer are depicted, respectively.

3.1 Regression model

The following multiple linear regression equation considering new and changed (N&C) as well as reused code was generated from Appendix 1:

$$ {\text{Effort}} = 44.7136 + \left( {1.08075 \times {\text{N\&C}}} \right)-\left( {0.1454 \times {\text{Reused}}} \right) $$

In accordance with [17], a r 2 ≥ 0.5 (coefficient of determination) is an acceptable value for predicting. In this case, the r 2 of this equation was 0.57. ANOVA for this equation (Table 2a) shows a statistically significant relationship between the variables at the 99% confidence level.

Table 2 Multiple linear regression analysis

However, an ANOVA was not sufficient. To determine whether the model could be simplified a parameters analysis of the multiple linear regression was done. Table 2b depicts the results for this analysis, and the highest p-value on the independent variables is 0.0239, belonging to reused code. Since this p-value is less than 0.05, reused code is statistically significant at the 95% confidence level. Consequently, the independent variable of reused code was not removed. Hence, this variable will have to be considered for its evaluation in the GRNN as well as in the fuzzy logic model.

3.2 Fuzzy rules

The term “fuzzy identification” usually refers to the techniques and algorithms for constructing fuzzy models from data. There are two main approaches for obtaining a fuzzy model from data [48]:

  1. 1.

    The expert knowledge is translated in a verbal form into a set of if–then rules. A certain model structure can be created, and parameters of this structure, such as membership functions and weights of rules, can be tuned using input and output data.

  2. 2.

    No prior knowledge about the system under study is initially used to formulate the rules, and a fuzzy model is constructed from data based on a certain algorithm. It is expected that extracted rules and membership functions can explain the system behavior. An expert can modify the rules or supply new ones based upon his or her own experience. The expert tuning is optional in this approach.

This paper is based upon the first approach. The fuzzy rules were formulated as follows:

  1. 1.

    Correlation analysis between N&C and effort. Figure 2 shows that the higher the value of N&C, the higher is the effort. Correlation value (r) is 0.74. From this correlation pattern, the following rules can be formulated.

    Fig. 2
    figure 2

    Effort-N&C scatter plot (r = 0.74)

    • Rule 1: If (New&Changed is Small), then Effort is Low

    • Rule 2: If (New&Changed is Big), then Effort is High

  2. 2.

    Correlation between reused code and effort from Appendix 1 having the table ordered by N&C. Data of N&C, reused code, and effort from Appendix 1 were put in ascending order (having N&C as first criterion for ordering it). Then, the data of this table were divided in two subsamples A and B. A incorporated by the N&C smaller programs, and B incorporated by the bigger ones. Then, reused code and effort from subsamples A and B (first 81 programs and last 82 programs, respectively) were correlated with effort. Figure 3a, b depicts these two scatter plots. Figure 3a shows that if reused code is small or big, the effort tends to be low in relation to programs of Fig. 3b, while in Fig. 3b it can be seen that if reused code is small or big, the effort tends to be higher in relation to programs of Fig. 3a.

    Fig. 3
    figure 3

    Effort-Reused code scatter plots

Hence, rules 1 and 2 can be completed as follows:

  • Rule 1: If (New&Changed is Small) and (Reused is Small or Reused is Big), then Effort is Low

  • Rule 2: If (New&Changed is Big) and (Reused is Small or Reused is Big), then Effort is High

Implementing a fuzzy system requires that the different categories of the different inputs be represented by fuzzy sets, which in turn is represented by membership functions (MF) [1]. The MF type considered in this experiment was triangular because it has demonstrated better accuracy than others like Gaussian and trapezoidal types [48].

A triangular MF is a three-point (parameters) function, defined by minimum (a), maximum (c), and modal (m) values, that is, MF(a, m, c) where a ≤ m≤ c. Their scalar parameters (a, m, c) are defined as follows:

MF(x) = 0

if x < a

MF(x) = 1

if x = m

MF(x) = 0

if x > c

The values of MF parameters by fuzzy model were then defined. From values close or equal to both minimum and maximum of both program code sizes and efforts, the parameters were iteratively adjusted until obtaining the best (smallest) MMER possible. Table 3 shows the final values of the model parameters.

Table 3 Parameters of the fuzzy model membership functions

4 Analysis

4.1 Model adequacy checking (model verification)

Multiple linear regression equation (depicted in Sect. 3.1), the fuzzy logic model (depicted in Sect. 3.2), and regression neural network (depicted in Sect. 1.2) were applied to original data set, and the MER by program as well as the MMER by technique (model) was then calculated (see Appendix 1). The MMER results by technique were the following: multiple linear regression = 0.27, general regression neural network = 0.24, and fuzzy logic model = 0.25.

The ANOVA for MER of the programs (Table 4) shows that there is not a statistically significant difference among the accuracy of prediction for the three techniques at the 95.0% confidence level. This result can graphically be interpreted based upon the means plot of Fig. 4b. In addition, in accordance with Shapiro–Wilks test, a normal probability plot of the residuals should be roughly linear as it is shown in Fig. 4a.

Table 4 MER ANOVA (verification of models)
Fig. 4
figure 4

Plots of MER ANOVA (verification stage)

In accordance with MSE criterion, ANOVA for models showed a p-value of 0.005, that is, there was a statistically significant difference among the MSE of models at the 95.0% confidence level. Means plot of Fig. 5 shows that the GRNN (having the lower MMER) has difference with FLM, whereas the GRNN does not have difference with MLR.

Fig. 5
figure 5

Plot of MSE ANOVA (verification stage)

4.2 Model validation

Other group integrated by thirty programmers developed 210 programs. Ninety of the 210 were not considered because they corresponded to the first, second, or third program. Eighty of those 120 programs correspond to the sample size for validating the models (once that only programs reusing code were selected, and once programs having errors in their time recording were excluded). In Appendix 2, actual data by developer are depicted.

The three models for estimating the effort were applied using the data depicted in Appendix 2, the MMER results by technique were the following: multiple linear regression = 0.29, general regression neural network = 0.31, and fuzzy logic model = 0.31. In accordance with the ANOVA for MMER models (Table 5), there is not a statistically significant difference among the accuracy of prediction for the three models at the 95.0% confidence level (Fig. 6b). Figure 6a shows that residuals related to normality data of this ANOVA are met (Shapiro–Wilks test).

Table 5 MER ANOVA (validation of models)
Fig. 6
figure 6

Plots of MER ANOVA (validation stage)

MSE ANOVA for models showed a p-value of 0.8627, that is, there was not a statistically significant difference among the MSE of FLN, GRNN, and MLR models at the 95.0% confidence level.

5 Discussion

This study is related to software engineering education and training (SEET). Discipline about the process for software organizations is one of the SEET goals. In accordance with [43], the performance of a development organization is determined by the performance of its engineering teams; and the performance of an engineering team is determined by the performance of the team members; and the performance of the engineers is, at least in part, determined by the practices these engineers follow in doing their work. This assertion is also supported by [17], in the sense of that unless software developers have the capabilities provided by personal training, they cannot properly support their teams or consistently and reliably produce quality products. One of those practices is the software development effort prediction that can be started at individual level developing components that could be integrated when large systems are built. Effort prediction is one of the three main practices used for training developers at personal level (the other two are related to software defects and software size [43]). Prediction techniques at personal level that have been applied are related to expert judgment, statistical regression, and fuzzy logic, and the discipline of using these techniques has even been scaled to developing large systems. Because no single technique is best for all situations and a careful comparison of the results of several approaches is most likely to produce realistic estimates [3], this paper applied an additional one: a GRNN. Neural networks have already been applied for predicting effort too [12, 13, 15, 40]; however, neither of them has been a GRNN. Results of this paper could encourage researchers to analyze the prediction accuracy when a GRNN is applied for predicting effort not only of developed software components at individual level, but of large systems built by teams of developers. In addition, models of this research used two measures of software size, whereas future research could involve the GRNN application involving more independent variables related to large systems (as those considered by statistical regression models [4]).

6 Conclusions and future research

Levels of software engineering education and training could be classified in the small as well as in the large, this paper focused its interests on individual training based on short-scale programs and using PSP whose practices and methods are used for delivering quality products on predictable schedules. This paper approached its focus on the following practice: estimation and prediction of the software development effort. An accuracy comparison among a multiple linear regression (MLR), a general regression neural network (GRNN), and a fuzzy logic model (FLM) was depicted. In this research, a total of 163 programs were developed by a group of 53 programmers. Using the data gathered (two types of lines of code as well as effort) from these programs, three models were applied for estimating the effort. Then, models were validated with data gathered of eighty programs developed by other group of thirty developers. All programs were developed with personal practices from PSP. In verification and validation stages, results from the two machine learning techniques were compared with those of a multiple linear regression. This comparison was based on MER as well as on MSE. In validation stage, the GRNN had a same statistically significant difference than FLM and MLR. These results allow accepting the following hypothesis: in linear case, neural network is statistically equal to those obtained by a fuzzy logic model as well as by multiple linear regression, which suggests that a general regression neural network could be used for predicting the development effort when new and change code and reused code obtained from short-scale programs developed with personal practices are used as independent variables.

Future research involves the predictive accuracy comparison applying a GRNN for data from large systems built by teams of developers. Moreover, using classifiers and associative memories based on datasets from individual as well as from team projects.