1 Introduction

Experts estimate that in 2015 over 70.000.000 cars will be sold (Zimmermann and Hauser 2004). At the same time, the European Commission requests a reduction of traffic deaths by 50%. The same experts estimate further that 70% of future innovation will be based on software. Therefore, with the increasing number of cars, software becomes more and more important. In recent years, 50% of all vehicle recalls were software related.

One major challenge in the development of large-scale software products for the automotive industry is to optimize quality assurance (QA) over product costs to develop high quality products, i.e. with a low number of defects, and still at low costs. Short development life cycles stand in contrast to long product life cycles, a fact which confirms the demand for failsafe innovative products.

An earlier Software Process Model (SPM) (Schulz et al. 2011) describes the influence of changes on a software product in terms of potential defect correction effort (DCE) remaining after the development. SPM defines a defect cost factor (DCF) that reflects error-proneness of a specific feature of the software product. SPM estimates the DCE based on various characteristics of software changes within a specific software release.

We found that defects originating from other development phases than the one they were detected in require different overall DCE than defects detected in the same development phases in which they were inserted. Therefore, to reduce the DCE of the software product, we assess every development phase along with its specific DCE characteristic and focus on the shift of defects through development phases—i.e. the defects that are detected and corrected in a different phase than they were introduced in.

The main goal of this paper is to develop a model that can predict the DCE at various development phases. We have built a completely new Defect Cost Flow Model (DCFM) that reflects the V-model of a software development lifecycle—a real engineering process for the development of embedded applications in the automotive industry. With the DCFM, it is possible to identify the process areas where optimization leads to higher quality and lower costs. The DCFM estimates the DCE based on key performance indicators (KPIs) representing product, process and project aspects. With the DCFM it is possible to assess the effort spent on defect correction in comparison to the effort spent on development throughout every phase of the development process. To ensure meaningful results at a sufficient level of details and the ability to calibrate the model with industrial data, the DCFM represents a single high level function of the software product.

Technically, the DCFM is a Bayesian Network (BN), which incorporates both process data as well as expert knowledge within a single model. We have selected Bayesian Network as a formal representation of DCFM for a variety of reasons. Most importantly, BNs may incorporate expert knowledge combined with empirical data encoded in cause-effort relationships. They also enable performing various types of analyses using rigorous probability calculus focused on decision support. Section 2.4 provides other motivations for using BNs.

We have discussed some brief details of the model and its validation in an earlier study (Schulz et al. 2010). The current paper extends the previous one by providing significantly more details on the rationale for the model, its structure, validation and practical considerations. Specifically, the primary contributions of this paper are:

  1. 1.

    A Defect Cost Flow Model—a Bayesian Network predicting the effort, especially for defect correction, in software projects at various stages of the development process.

  2. 2.

    Validation of DCFM—which not only confirms that the model can be successfully used in industrial decision support, but also discusses the mechanics encoded in the model, which may be used in a future work focusing on developing similar models.

  3. 3.

    Empirical data on the underlying software development process. Specifically, they involve the effectiveness of quality reviews, i.e. the effects of such reviews on the reduction of defect correction effort.

  4. 4.

    Discussion of the practical benefits of applying the model in industrial-scale projects.

  5. 5.

    General discussion of the background and motivations for this study – these can also prompt future research in the field.

  6. 6.

    Discussion on how the DCFM can be extended, calibrated and adjusted to specific industrial needs and environments.

The rest of this paper is organized as follows: in section 2 we provide the background and motivations for this study including an introduction to Bayesian Networks and the discussion of related work. Section 3 contains the research procedure we followed. Section 4 opens the most important parts of this paper by discussing the main result—the Defect Cost Flow Model. Section 5 discusses the results of DCFM validation. In Section 6 we demonstrate how this model may be enhanced and customized for specific industrial needs. Section 7 considers the limitations and risks to the validity of the study. Section 8 draws the conclusions and discusses future work.

2 Background and Motivation

2.1 Software Development Process in the Automotive Industry

The demand for software products in the automotive industry has been exponentially increasing over the last decades, making software engineering one of its critical success factors (Lee et al. 2007). The extensive use of electronic control units (ECUs) in vehicles today has already led to an average of 20 ECUs in smaller and even more than 70 ECUs in upper class cars (Zimmermann and Hauser 2004). Developing complex features with short time to market and minimal engineering costs at very high quality are the challenges to be met.

Furthermore, embedded software is subject to domain specific characteristics, especially:

  • Code efficiency due to limitations of processor and memory resources,

  • Code portability to support the product line approach for variant handling,

  • Code understandability and maintainability to enable multiple, distributed engineering teams,

  • High availability and reliability to support safety critical applications,

  • Real-time operation,

  • Network operation to support distributed systems.

Given these aspects, effort estimation including the effort needed for defect correction already plays a major role in the domain of automotive software engineering.

Standardized engineering processes are crucial to continuously improve and adapt to the market’s demands. The engineering process behind the model presented here is based on the V-Model (IABG 1992), an international standard for developing information technology systems. Inside the automotive industry, the V-Model has been enhanced for decades to reflect the needs for development of high quality failsafe products. It describes different development phases in a software release life cycle focusing on high-level system specification and testing, as well as module implementation and testing.

Figure 1 illustrates the V-Model.

Fig. 1
figure 1

Stages of software development process in V-Model

Car manufacturers, who are internal customers from the perspective of the development process that we are investigating, demand new functionalities in their cars. Adding functionality in form of new features or changing respectively removing existing functionality is handled in the requirements engineering (RE) phase. In the design (DE) phase, features are allocated to software components. The succeeding phase implementation (IM) is responsible for developing the code base. All engineering artifact requirements, design and source code are subject to phase specific QA measures. The V-Model’s right side is related to QA measures, mainly to integration and testing (I&T) based on test specifications from phases RE, DE and IM. Module tests are performed after implementation. These modules are integrated into software components whereas components are integrated into the overall software product in the so called integration phase. Next, the software is integrated into the ECU to perform high-level functional tests forming the base for reviews and tests carried out by system engineers. Finally, the product has to pass product acceptance tests by the customer (CU).

2.2 Flow of Defects

The Defect Flow Model (DFM) represents defects based on where they are created and where they are found. It is possible to assess every development phase separately with the help of the DFM to identify areas where QA measures are to be focused on. The DFM uses the number of defects as an indicator for the performance of a development phase.

Figure 2 illustrates a DFM based on a data set from Fraunhofer IESE (Klaes et al. 2007). It shows defects made and detected within the V-model phases RE, DE, IM and I&T. The data set has an overall of 345 defects distributed over all development phases. In phase RE there are 55 defects, 140 in phase DE, 125 defects in phase IM, and 25 defects in the final phase I&T. Furthermore, the DFM shows an overall of 330 detected defects again distributed over all development phases. Here, 40 defects were detected in phase RE, e.g. with the help of requirements reviews. In phase DE, 91 defects were detected, 98 in phase IM and an additional 68 defects as part of the phase I&T. The last phase CU indicates the residual 48 defects detected by the customer.

Fig. 2
figure 2

Flow of defects

There are several other views on the DFM, e.g. with information about origin of every residual defects in every phase. According to Stolz and Wagner (2004), the DFM has proven its capability to monitor and improve the quality processes in the domain of software development for automotive applications. Our DCFM is based on the DFM, a measurement system supporting the quantitative evaluation of QA measures in an engineering process (Stolz and Wagner 2004). The DFM is derived from the orthogonal defect classification concept for process measurements (Chillarege et al. 1992). Its main goal is to provide transparency on phase specific defect rates, i.e. number of defects per unit of software size.

2.3 Motivation for the Flow of Defect Correction Costs

Defect Correction is a stage where released features are corrected if they do not meet their requirements. Defect costs can be derived from the defect correction effort needed to complete a product. The amount of defect correction is often estimated based on a number of defects. However, the number of defects may only be an indicator for the actual defect correction effort. Since it is such simple measure, it is incapable of quantifying detailed amount of defect correction. For example, a simple defect might be fixed in hours whereas another defect might take days of analyzing and fixing. Furthermore, defects introduced in early development phases (RE or DE) but detected in later phases (e.g. I&T) require revisiting these earlier phases. The correction of such defects is much more costly than of defects introduced, found and corrected in a single phase.

The DCFM uses the potential defect correction effort as one indicator to describe a product’s defectiveness. A second indicator, the development effort, is needed as a reference value for the DCE to indicate its impact. This leads to the definition of the defect cost factor (DCF) representing DCFM’s second indicator for software defectiveness:

$$ DCF = \frac{{defect{ }correction{ }effort}}{{development{ }effort}} $$

Where:

  • Defect correction effort is the effort used to fix a defective implementation.

  • Development effort represents the effort needed for implementation.

This implies that with an increasing development effort, the potential effort needed for later fixing increases proportionally. We define a DCF for every product feature because features are different in many aspects, for example concerning their requirement complexity and volatility, resulting code complexity and finally their testability, e.g. the probability of creating a defect is lower for implementing parameter changes than for a complex state machine used to realize HMI logic. Considering effort, the feature “parameter dispatcher” might only have 10 h of DCE in relation to 1000 h of development. This represents a low DCF of 0.01. In contrast to that, the feature “HMI” might need 1500 h of DCE and 1000 h of development effort. Consequently, the DCF for the HMI feature would be higher at 1.5.

Summarizing, the DCF allows quantifying the defect correction effort for specific features and development phases. Thereby, it supports the goal to identify areas for intensifying QA measures focused on cost reduction. Furthermore, it enables to determine the return of an investment measured directly from its cost.

2.4 Motivation for Bayesian Networks

In the extensively studied area of software effort prediction, several approaches have been used. These include successful applications of techniques such as neural networks, rule-based models, decision trees, system dynamics or estimation by analogy (Jørgensen and Shepperd 2007; Mair et al. 2000; Zhang and Tsai 2003). Most of existing research and practical applications in this area is data-driven. It means that algorithms or models are typically highly dependent on the availability of empirical data in the required volume and granularity. This causes problems in selecting the most appropriate technique for a given analytical environment (i.e. a dataset). This can be partially solved by approaches, which dynamically assess the suitability of a particular technique (Song et al. 2011). While we agree that there is a great demand for such results, in this study we opt for a different point of view.

We have found that such machine learning techniques cannot be applied successfully in our context where only little empirical data is available. Although we have huge databases even with datasets carrying the same name, the meaning of the data changes over time. This is due to several activities, which can be summarized as a continuous improvement process. For example, several years ago, a process phase “module test” only covered hand written module test cases whereas currently, test frameworks are used to create automatic module tests. Therefore, not only the effort needed for testing changed but also did defect detection rates. In such cases, based on previous research (Fenton et al. 2004; Fenton et al. 2008a; Fenton et al. 2008b; Herbsleb. 2010), we believe that predictive models may be constructed if they are primarily based on expert knowledge and merged with some empirical data. After a comparative analysis of various modeling techniques, we have found that Bayesian Networks are a technique that is most relevant for our needs. First, they enable a model to incorporate both expert knowledge and empirical data encoded to model structure and parameter definition. Second, a model may contain causal relationships between variables to reflect background knowledge of the development process and the flow of defects. Third, BNs explicitly capture the uncertainty about variables though probabilistic variable definition. This imposes that predictions are in the form of more informative probability distributions than just point values. Fourth, the model can be calculated with missing data—variables with an observation entered are treated as predictors to calculate the output variables (i.e. without observations). Additionally, BNs offer both forward and backward inference. Further, BNs can be presented graphically to increase understandability and clearness.

2.5 Formal and Illustrative Introduction to Bayesian Networks

Formally, a Bayesian Network is a probabilistic model, which consists of:

  • A directed acyclic graph with nodes V = {v 1, …, v n} and directed links E,

  • A set of random variables X, represented by the nodes in graph,

  • A set of conditional probability distributions P, containing one distribution P(X v |X pa(v)), for each random variable X v  ∈ X,

where pa(v) denotes a set of parents for a node v. A Bayesian network encodes a joint probability distribution P(X) over a set of random variables X. The set of conditional probability distributions P specifies a multiplicative factorization of the joint probability distribution over X: \( P(X) = \prod\limits_{{v \in V}} {P({X_v}|{X_{{pa(v)}}})} \) (Jensen and Nielsen 2007; Kjærulff and Madsen 2008).

BN concepts (and the term itself) were introduced in the 1980’s in pioneering work by Pearl (1985, 1988). Since then, it has become a well established modeling technique, which has been applied in diverse fields as: medicine, biology, chemistry, physics, law, management, computer science and other.

Before we discuss in details the Defect Cost Flow model, which is the focal point of this paper, we will show an example of a very simple BN model to illustrate what BNs are and how they can be used. Figure 3 illustrates the structure of such a simple model, including Conditional Probability Tables (CPTs). The model consists of five variables, which, for simplicity, are all binary with states “low” and “high”. A root node, development effort, has its probability distributions unconditional on other nodes. It has been defined as P(“low”) = 0.4 and P(“high”) = 0.6. It may reflect the frequency of development effort in past projects (based on past data) or expert’s belief about the anticipated amount of development effort for the project investigated (expert knowledge). Note that this distribution, although it is unconditional on other nodes, is formally still conditional on available knowledge and data—for details see the discussion in (Radlinski 2010, Winkler 2003).

Fig. 3
figure 3

Structure of a simple Bayesian network

The distribution for the review effort says that it is proportional to the amount of development effort. The distribution for quality after development is more complex because this variable has two parents. It says that to achieve high quality after development both high amount of development and review effort are required, and that the review effort is more important. In this example, the testing effort is inversely proportional to the quality after development—based on the notion that there is no need to spend much time on testing good software. Finally, high quality after testing can be achieved with high quality after development and high testing effort.

Let us illustrate how such a model can be used. In the first scenario (baseline) we assume that the amount of development effort is high. The model propagates this observation to all nodes without entered observations and calculates posterior distributions for them (Fig. 4). For example the model predicts that P(quality after development = high) = 82% and that P(quality after testing = high) = 76.16%. Let us further assume that in addition to the high development effort we also allocate a high review effort (scenario: high review effort). In this case, the model predicts that software quality should increase and achieve the level “high” with probability of 90% (after development) and 79.2% (after testing). Finally, let us assume that in addition to previous observations we enter that testing effort is assumed to be high (scenario: high review and testing effort). Now, the model predicts further increase for expected high quality after testing. However, it also predicts, that the quality after development is likely to be lower. The explanation is related to the definition of distribution for testing effort. Normally, high testing effort is required for software with low quality after development. Because we entered an observation about high testing quality the model correctly believes that previously predicted high quality after development is actually lower.

Fig. 4
figure 4

Results for example analysis using a simple Bayesian network

This simple analysis illustrates the part of great analytical potential of Bayesian Networks. Further examples of similar analyses with models for software engineering are available in (Fenton et al. 2004; Fenton et al. 2008a, Fenton et al. 2008b; Radlinski 2008).

2.6 Tackling Complexity of Variable Definition in BNs

The simple BN model discussed earlier contains very few variables with only two states for each of them. As a result, CPTs are quite simple—they contain very few entries. As illustrated in Fig. 3, a CPT for a binary node with two parents that are also binary contains eight values (2 × 2 × 2). The complexity of CPTs increases exponentially with the number of parents and number of states. For example, a three-state node with three parents, where each parent also has three states, has a CPT with 27 entries (3 × 3 × 3). Manual elicitation of such CPT based on expert knowledge is difficult and prone to inconsistencies. With even more states and/or parents, manual elicitation of CPTs becomes extremely difficult. There is ongoing research to tackle the complexity of the variable/model definition process. Some of these studies are related to supporting the definition of CPTs only (Das 2004; Fenton et al. 2007a; Helsper et al. 2005; O’Hagan et al. 2006; Pfautz et al. 2007; Wiegmann 2005), while others are focused on the definition of both structure and CPTs (Kraaijeveld et al. 2005; Nadkarni and Shenoy 2004; Skaanning 2000).

Our DCFM model, discussed in the further sections, contains only few ranked nodes (with five states)—all other nodes are numeric, which need to be discretized into a set of mutually exclusive intervals. Defining high number of intervals ensures high precision in calculations but causes difficulty in CPT definition. To solve this problem we have followed an approach where a CPT for numeric nodes is defined as a mathematical expression that uses distributions of parent nodes as parameters.

We have also used the concept of “partitioned expressions” which allows defining different expressions for a single node of interest, depending on the state of the parent node. For example, let us consider two variables: amount of QA effort expressed on a 3-point ranked scale and defect detection rate expressed on a numeric scale. Defect detection rate may be defined as child of amount of QA effort using the following partitioned expressions:

  • For amount of QA effort = ‘low’ as Normal(μ = 0.3; σ2 = 0.1),

  • For amount of QA effort = ‘medium as Normal(μ = 0.6; σ2 = 0.05),

  • For amount of QA effort = ‘low’ as Normal(μ = 0.8; σ2 = 0.05),

We have used partitioned expressions to define some variables in DCFM. The use of expressions simplifies the definition of variables because fewer parameters must be supplied. But it does not solve the task of defining appropriate intervals for numeric nodes. To achieve the highest precision in calculations it is recommended that a variable should have more intervals defined around the values of high probability values and may have fewer intervals around the values of low probability. This is difficult to be achieved in advance during the process of model building because a model can be used in a variety of scenarios causing probability distributions to shift left or right. As a solution, we have applied a dynamic discretization algorithm (Neil et al. 2007, Neil et al. 2010). This algorithm takes the need to discretize numeric variable off the modeler—during model calculation, it automatically defines intervals that more precisely reflect distributions of numeric variables.

In this study, we have used the algorithms supporting the expressions and dynamic discretization implemented in the AgenaRisk tool (AgenaRisk 2009).

2.7 Related Work

We have justified the selection of BN as a modeling technique for this study, therefore we do not discuss the use of other techniques as they have been used for different environments. Extensive analysis of BN models developed for effort and quality prediction have been performed in earlier studies (Fenton et al. 2008a; Radlinski 2010a). Table 1 summarizes the most relevant of them.

Table 1 Summary of related BN models

Some of these BNs do not have a causal structure, some have been automatically generated from dataset without any expert input and some incorporate a different level of details—they have been developed for other analytical purposes based on different assumptions. Only a few of previous studies are strongly related with the current one.

Bibi and Stamelos (2004) proposed a model for development effort prediction in projects compatible with the Rational Unified Process (Kruchten 1998). In this model, the effort is estimated at various project stages and for different activities, and then aggregated. This concept seems to be clear and intuitive. This model can be adjusted to reflect the incremental or iterative lifecycle, where the number of iterations will depend on a particular project. However, the authors have not performed any validation and only published a basic topology of the model. Thus, it is difficult to analyze the details of their model.

Fenton et al. (2008b) investigated how the effectiveness of specification, coding and testing influences the defect potential, inserted and found, respectively. Ultimately, the model predicts the number of residual defects. However, it assumes that the effort required to fix a defect is constant—i.e. that it does not depend on the defect itself. Such an assumption may work only when proportions of different types of defects are constant among multiple projects and components.

Radliński et al. (2008) developed a dynamic BN (sequentially linked identical BNs), where each instance reflects a testing and defect correction iteration. The model aims to predict the number of defects remaining after each iteration and thus may be used to estimate the release time. This model is restricted for use in situations where no functionality is added to the code during the testing and correction phase. The predictions are based mainly on the amount of effort, process quality factors and amounts of defects found and fixed. This model has been validated using a synthetic dataset.

Schulz et al. (2011) proposed a Software Process Model with its main aim to predict defect correction effort. This model also is a dynamic BN, where each instance reflects a single development task related to implementing part of specific feature. The model predicts engineering effort, which is later used to predict defect correction effort. Predictions are mainly based on the type of task (i.e. task complexity) and process quality. Interestingly, the model does not contain any numerical variable reflecting the size of project, feature or task. Model usage has been illustrated and discussed in a set of hypothetical scenarios and validated for accuracy using past project data. High accuracy of predictions based on task complexity, expressed on a ranked scale, proves that accurate effort predictions do not always require a numerically expressed size of a project.

In our DCFM model, we have used our own and other authors’ experience from developing earlier BN models. In particular, we have put an emphasis on building an appropriate structure and variable definition that enables performing useful analyses and relatively easy calibration.

3 Research Procedure

The research procedure of this study is divided into four main stages, in which three types of experts took part, as illustrated in Fig. 5.

Fig. 5
figure 5

Steps of the research procedure

3.1 Problem Definition

The main application of the DCFM is the support of project managers and process owners in complex decisions related to DCE estimation. The DCFM is based on the guidelines for designing expert systems (Weiss 1984).

In every expert system, it is essential to properly define the problem and the KPIs related to it. The problem is derived from the initial goal of the model. We used the Goal Question Metric (GQM) approach (Basili and Rombach 1988) to systematically identify relevant KPIs of the model and their relation. According to the GQM, the problem definition includes its purpose, object of interest, issue and user’s point of view leading to the following definition:

For a specific engineering environment, the DCFM shall identify the ideal distribution of QA effort to minimize defect correction effort at a given time and costs from a project manager’s point of view.

Based on the problem definition we identified the KPIs and their relations, and used them in a causal structure of the model. Table 2, discussed later, summarizes the goal, questions and metrics.

Table 2 Definition of variables

3.2 Data Gathering and Analysis

Our data for model definition was mostly obtained from internal sources within the specific automotive engineering environment. We analyzed data from a completed project and extracted relevant metrics according to the data definition illustrated in Fig. 6. Our main data source was the change & defect management (CM) system holding information about the development of artifacts needed to realize a specific part of a feature. The CM system is a standardized database system used for the management of all product related changes and defects. It supports traceability both from requirements to source code as well as from source code to requirements. Furthermore, it is one major source of metrics to support the assessment and improvement of the development process.

Fig. 6
figure 6

Definition of project data

An entry in the CM has one of the following categories:

  • The Change entry describes the development of a newly added, modified or removed requirement. It holds information about the development effort needed for this change as well as the testing effort .

  • The Defect Analysis entry describes a defect in an existing feature. It stores the information about the analysis effort, where the defect has been detected (origin phase) and in what phase it has been corrected (correction phase).

  • The Defect Correction describes what changed to solve the defect. It stores information about the correction effort.

Every CM entry is related to a feature in the requirement management (RM) system. The RM system holds information about the complexity and volatility of a specific feature. The DCE and other model parameters are defined based on these data from CM and RM.Further internal sources are process documentation as well as expert knowledge. Data used for model scenarios has been made anonymous due to their confidentiality.

3.3 Model Creation and Enhancement

Model creation and simulation is a very challenging task because expertise is required in multiple fields. First, a deep understanding of the model’s domain has to be established, which is software engineering in the case of the DCFM. Second, the problem investigated has to be fully understood including its KPIs and how they are related to each other. For DCFM, this means being an expert in the specific automotive work environment who is able to map process and project data to the problem definition and its KPIs. Finally, statistical know-how is required, e.g. to understand the consequences of combining data from various sources and how it affects the success of simulation results. We needed several iterations to design and calibrate the DCFM to fulfill all requirements from the fields above.

Model structure and calibration is based on the automotive engineering process described before. Here, we collected historical and current data according to the project data definition. The DCFM was set up based on this data. Furthermore, domain experts calibrated those parts of the model where no project data was available. To evaluate the model performance, several scenarios were defined to reflect the different aspects of the problem definition. The final structure of the DCFM is based on an iterative refinement with the goal to optimize the performance for all scenarios. This stage also involved developing various specialized editions of the model, one of which, for the analysis of multiple releases of developed software, has been discussed later in this paper.

3.4 Model Validation

The final step creating DCFM has been model validation. Due to the novelty of our approach, it was not possible to validate all simulation scenarios. Instead, domain experts analyzed and evaluated the model according to the definition of the problem. This stage involved the following:

  1. 1.

    Validation of general model behavior—We entered the hypothetical data on development and review allocation in various project stages into the DCFM. We analyzed if the predictions were falling within the ranges expected for a specific environment. This became an introduction to the analysis of predictive accuracy, but the predictions were not compared with empirical data but judged by experts on being possible in reality or not (in the case of the latter the model was recalibrated).

  2. 2.

    Validation of practical usefulness—We used empirical data from past projects. We entered the data on an assumed development process into the DCFM. According to predictions from DCFM we changed the allocation of the review effort in an ongoing project. Finally, we analyzed the effects of the changed process.

  3. 3.

    Validation of detailed model behavior—As in (1) we entered hypothetical data to the model and analyzed the predictions in detail. The correctness of predictions were assessed by experts based on the quality of predicted probability distributions, calculated summary statistics, model behavior in standard and non-standard situations, and model behavior with low and high amount of observations entered.

  4. 4.

    Validation of model sensitivity—Again, using hypothetical data we analyzed model predictions in terms of how a change of one or more variables influences predicted variables, most notably overall effort. Experts assessed the obtained predictions in terms of compliance with the existing body of knowledge in the field and empirical data with high-granularity.

  5. 5.

    Evaluation of the possibility of adjusting and calibrating the core of DCFM—Initially we pre-calibrated the model to the particular industrial setting. Since the model is also useful in other environments, we analyzed what sorts of adjustments and calibrations could be made to the core part of DCFM.

3.5 Model Release

The final version of the BN model, including calibration and some predefined scenarios, is available for public use from the PROMISE online repository (Boetticher et al. 2007).

4 Results—Defect Cost Flow Model

4.1 Overview of DCFM

Based on the procedure model from the previous section, first there will be an introduction to the concept behind DCFM. It focuses on the problem definition of assessing every phase of a development process to identify the ideal amount of QA effort spent in every phase. Model data in form of the KPIs of the DCFM and their relations are described in form of a metric definition table where every metric is derived based on a set of questions describing the DCFM’s principal aim (see Table 2 for details).

The DCFM is built as a BN based on these parameters and their relation among each other. The model is focused on assessment of the overall amount of effort, development-and defect correction effort, as well as on QA effort for the features developed.

The schematic of DCFM reflecting the notion of defect flow is illustrated in Fig. 7. According to the V-Model process definition there are four development phases built into DCFM: RE, DE, IM and I&T. The final phase CU is calculated based on results from its previous phase I&T.

Fig. 7
figure 7

Schematic of DCFM

For every development phase, the DCE is determined separately based on process specific KPIs. The DCE after QA originating in RE flows to its succeeding phase after it is adjusted by its phase multiplier. This phase multiplier represents the increase of effort if defects are not corrected in the same phase in which they are injected. It varies from company to company and depends on the people involved over different phases. The more process activities involved, the higher the DCE for defects flowing from one phase to another. For example, the customer specification demands a signal light to be red. Instead, the internal specification describes it as green. If this mismatch of specification is not detected early, the green signal is passed into the design and implementation phase. Test cases are created to validate whether the signal is green. Finally, the customer detects the problem and reports a defect. All process steps from requirements engineering to testing the feature have to be repeated. The phase multiplier describes this repetition of phase activities.

4.2 Measurements and Metrics

The DCFM can be used to analyze how effort spent on QA activities in various development phases reduces the overall engineering effort. The identification of KPIs and their relations is the first step in creating such a model. According to our research method, the systematic approach to identify these KPIs is the prior definition of measurement categories. Every measurement category contains a set of specific metrics, actual measures to characterize a category. Measurement categories are defined with the help of questions you might ask to understand the statement of the problem. With the help of these questions it is possible to systematically identify relevant KPIs and characterize the goal of model.

Table 2 illustrates the set of questions enabling the identification of all major components of the final model. Each question is answered by metrics incorporated as variables in the DCFM. We have used the resulting variable definitions to build the model.

4.3 Detailed Model Structure and Initial Calibration

Figures 8, 9, 10 and 11 illustrate the details of DCFM structure by visualizing a cause and effect chain where nodes represent events and arcs their relation. Calibration nodes are used for setting up the model. We elicited prior distributions based on internal data and expert knowledge. We analyzed process specific activities as well as cost distribution tables for different development phases. Furthermore, the change management system provided valuable information on the flow of defects to elicit e.g. specific effort reduction rates.

Fig. 8
figure 8

Structure of phase RE

Fig. 9
figure 9

Structure of phase DE

Fig. 10
figure 10

Structure of phase IM

Fig. 11
figure 11

Structure of phase I&T

Fig. 12
figure 12

Raw predictions from DCFM for two example scenarios

Fig. 13
figure 13

Development of costs predicted in DCFM

Fig. 14
figure 14

RE scenario results

In every phase, there is a specific development effort and defect cost factor resulting in the potential DCE as an indicator for the error-proneness of a phase. The amount QA effort is determined in relation to the amount of development effort and the level of sufficiency of QA effort. Specifically, QA effort is defined as a percentage of development effort, where the sufficiency of QA effort defines this percentage value. These nodes take into account that for specific features, e.g. a simple parameter database, it might not be necessary to use the maximum QA effort for defect detection because their initial defectiveness is already very low. Furthermore, not every development phase introduces defects with the same rate. Especially later phases have lower DCFs than e.g. RE or DE. The effort reduction rate based on the sufficiency of QA effort determines the effectiveness of phase specific QA activities. In combination with the potential DCE, the DCE after QA can be determined expressing the undetected DCE after every phase.

The DCFM is calibrated using phase specific DCFs illustrated in Table 3. These values are taken based on the assumption to develop a complex feature, e.g. an HMI. Features with lower complexity have lower DCFs. In every development phase, both nodes development effort and defect cost factor are combined (multiplied) in a node defect correction effort. Nodes effort reduction rate and sufficiency of QA effort represent the effectiveness of all QA activities for specific development phases.

Table 3 DCFs per development phase

The node effort reduction rate enables to define defect detection rates for specific levels of QA activity between 0% and 100%. The DCFM uses ranked nodes for their representation with the possibility to define the most relevant values, for a realistic scenario from a project manager’s perspective.

Four different ranks are defined:

  • Low represents a worst case scenario.

  • Medium is used in an average scenario.

  • High represents ideal conditions for a scenario.

  • Very High is used for the best case scenario.

Table 4 illustrates the sufficiency of QA effort resulting on a ranked scale. It is represented as percentage of the development effort, e.g. medium sufficiency of QA effort represents 10% of development effort which is additionally spent on reviews.

Table 4 Sufficiency of QA effort

Node effort reduction rate and sufficiency of QA effort result in DCE after QA. This node defines the possible reduction of DCE for every phase. The corresponding values of these rates, depending on the sufficiency of QA effort and phases of defect origin and detection, are shown in Table 5.

Table 5 Effort reduction rates

In every development phase there is a specific detection rate depending on the sufficiency of QA effort, defect origin and defect detection phase. For example, the third column represents detection rates in phase DE for defects originating from phase RE. Node defect correction effort after QA is a subtraction node to calculate the difference between potential DCE and reduced DCE.

At this point the residing effort flows from phase RE to DE where it is adjusted by the phase multiplier to node defect correction effort (RE) in DE. It represents the increase of effort if defects are not corrected in the same phase in which they are injected, e.g. if the design process has to be followed twice due to a defective requirement, it first has to be corrected (this effort cannot be saved) and additionally, the design might have to be redone. The phase multiplier varies from company to company and depends on the development process. The more development phases involved, the higher the overall effort needed for fixing these defects. There are different phase multipliers for different development phases dependent on their phase activities. The DCFM uses the phase multipliers illustrated in Table 6. For a DCE flowing from RE to DE, a phase multiplier of 4 is taken, leading to an effort multiplication by 4, e.g. if a residing DCE of 64 h is left in phase RE, an effort of 256 h is needed if it is detected in phase DE. For DCE flowing from DE to IM it is 5 and from IM to the final phase I&T a factor of 4. This results in a worst case DCE multiplication of 80 for defects flowing through all development phases.

Table 6 Development phase multipliers

In phase DE, the nodes sufficiency of QA effort (RE) in DE and effort reduction rate (RE) in DE are different than in phase RE corresponding to the phase specifics, e.g. for defects created in RE or DE, the effort reduction rate in phase IM is lower than in RE or DE, because it is implemented according to requirements and design and it is not the task of a programmer to question the meaning of all his requirements nor the interface he uses.

This leads to different defect correction effort reduced (RE) in DE and defect correction effort after QA (RE) in DE in phase DE for defects originating in phase RE. The described pattern is repeated for every succeeding phase resulting in case of DCFM in a four phase model for phases RE, DE, IM and I&T.

4.4 Calculation of Aggregated Effort

The DCFM enables prediction of DCE in various stages of a development process, with explanation of the origin of the DCE (i.e. which development stage). To enable analysis at a higher level of detail, the model also contains a set of summary variables, which are defined deterministically as shown in Table 7. In the DFCM validation discussed in the next section, we used aggregated values from the bottom rows of this table.

Table 7 Definition of summary variables

5 DCFM Validation

We have validated the DCFM by performing various simulations. They involve the following basic steps: entering observations to the model, calculating the model using an exact junction tree algorithm, obtaining posterior probabilities for investigated variables, and an analysis of results (either the whole distributions or the summary statistics calculated based on those distributions). The main aims of particular simulations are the following:

  • Simulation 1 (Section 5.1)—analysis of aggregated predictions for a hypothetical project;

  • Simulation 2 (Section 5.2)—analysis of the detailed predictions by investigating the cost flow for a hypothetical project;

  • Simulation 3 (Section 5.3)—comparison of four scenarios with different levels of QA activities in particular phases;

  • Simulation 4 (Section 5.4)—application of model predictions in real industrial setting,

  • Simulation 5 (Section 5.5)—sensitivity analysis demonstrating the effects of various levels of QA activities.

During the development of earlier BN models (Fenton et al. 2008b, Radliński et al. 2007, Radliński et al. 2008, Schulz et al. 2011) we found that it is not trivial to ensure correct encoding of expert knowledge into the model and model consistency with the general body of knowledge in the software engineering area. Thus, all simulations are focused on validating such issues. Additionally, Simulation 4 validates predictive accuracy and usefulness in a real environment.

To keep this discussion concise, in some simulations we do not present posterior probability distributions but only summary statistics. The model performs all calculations using probability distributions rather than point values. Distributions for some variables are often left-side truncated at value ‘0’ to reflect that the effort cannot be negative. As a result, presented median values are not always equal to the results of mathematical operations performed on point values (Figs. 15, 16 and 17).

Fig. 15
figure 15

DE scenario results

Fig. 16
figure 16

IM scenario results

Fig. 17
figure 17

I&T scenario results

5.1 Example Predictions

In the first analysis, we discuss the results of predictions provided by DCFM for two scenarios. The aim of this analysis is to briefly illustrate the raw predictions, in form of probability distributions, provided by the model.

We compare the predictions for a hypothetical project, with the following settings:

  • Total Development Effort = 1000,

  • Development Effort RE = 200,

  • Development Effort DE = 200,

  • Development Effort IM = 200,

  • Development Effort I&T = 400.

In the first scenario we assume the review effort to be “high” in early stages (RE and DE), while being “low” in the later stages (IM and I&T). In the second scenario the assumption about the sufficiency of QA effort is opposite—“low” in early stages and “high” in later stages.

Figure 12 illustrates the predictions for four variables by visualizing probability density functions and calculated summary statistics. In this analysis the amounts of review effort have not been set numerically but in words (“low”, “high”). The exact total QA effort is not a numerical point value but a distribution because of the uncertainty related to the imprecision of verbal expression encoded in the model. In both scenarios the review effort has been set to “high” in two stages and “low” for the two remaining stages. However, due to unequal allocation of development effort among stages and the fact that numerical review effort is defined as a proportion of the development effort, the total QA effort is not equal for these two scenarios.

The total QA effort is higher for the second scenario (more effort spent on reviews). In spite of this, in all three other variables of interest, defect correction effort, residual defect correction effort, and overall effort, the predicted values are higher in the second scenario than in the first. Additionally, the predictions in the second scenario are more uncertain—they have a higher variance. We suspect that this is caused by the lower effectiveness of reviews in IM and I&T compared to RE and DE. The detailed discussion of review effectiveness in particular stages has been presented in the subsection on the sensitivity analysis.

Although the prediction results in form of probability distributions provide more information than just point values, in most of further analyses we only use the median of the distribution as the “predicted value”. This simplifies the visualization of predictions and comparison of scenarios. This approach has been followed in various previous studies (Fenton et al. 2008a; Fenton et al. 2008b; Radlinski 2008; Radliński 2010b; Schulz et al. 2011). An analysis of full distributions would be unnecessarily complex and long in this paper.

5.2 Development of Costs

The development of costs in DCFM is shown in Fig. 13, focusing on the flow of DCE over development phases. Defects are injected in their corresponding phase and detected in later phases. In DCFM, defect correction costs are represented by the DCE. The positive axis illustrates effort spent on defect correction, whereas the negative axis depicts the reduced DCE after QA measures.

Focusing on defects with their origin in phase RE, there are 440 h of potential DCE residing in the software product. With QA measures and a very high detection rate of 85%, the DCE could be reduced by 374 h leading to a total DCE of 66 h for phase RE. These 66 h stay undetected and shift from phase RE to DE.

In the DE phase they are adjusted by the phase multiplier for RE to DE, where DCFM is calibrated with a value of 4. Following the flow of DCE injected in phase RE, 264 h are detected by QA measures in phase DE and reduced by 224 to around 40 h. Further adjusted by the phase multiplier (5 from DE to IM), 198 h of DCE injected originally in phase RE flow from DE to IM. In phase IM it is much more difficult to detect defects from RE whereby only 43% of DCE are detected. Finally, the DCE shifts to phase I&T (adjusted by a phase multiplier of 4) where 455 h are reduced by 387 h to a final DCE of 68 h residing after the last phase and potentially detected by the customer.

Summing up for phase RE, there is an initial DCE of 440 versus an overall of 763 h resulting from undetected RE defects. In every development phase there is a correspondent DCE to it, either reduced by specific QA measures or flowing to further phases. The DCFM shows that even with very high defect detection rates every undetected defect flowing from one phase to another causes a multiple of correction effort compared to if QA measures had prevented the defect from being made or at least these measures had detected it in the same phase in which it was injected.

5.3 Scenario Results

Following the goal to identify the ideal distribution of QA effort over all development phases (RE, DE, IM and I&T) of the development process under discussion, four scenarios have been defined based on the assumption to develop a complex feature with an estimated development effort of 1000 h. These scenarios, defined in detail in Table 8, demonstrate the capabilities of the DCFM:

Table 8 Scenario definition
  • S1 is at a low sufficiency of QA effort. This is the worst case scenario considering the development of DCE over all development phases.

  • S2 uses a high sufficiency of QA effort typically used if you consider optimizing a single development phase only. The definition of an additional scenario to demonstrate medium (average) QA activities has been left out because it performs similar to the latter.

  • S3 has very high QA activities for RE and DE and a high amount for IM and I&T. It is expected to be too cost expensive if you consider every development phase only in its own context.

  • S4 uses a very high amount of QA activities on all development phases.

All scenarios share the same settings for the amount of development effort: 200 h in RE, 200 in DE, 200 in IM, and 400 in I&T. The simulation results are illustrated throughout the following figures in form of a comparison of scenarios for every development phase. All values represent the calculated median of the predicted probability distribution for DCE.

5.3.1 Requirements Engineering Phase

For phase RE, Fig. 14 illustrates a constant DCE of 200 h in all scenarios. In S1, the DCE could only be reduced by 23 h. S2 reduces the DCE by 150 h whereas S3 and S4 have the highest DCE reduction of 170 h due to very high QA activities.

The remaining DCE is shifted from phase RE to DE and adjusted by the phase specific multiplier.

5.3.2 Design Phase

The development of DCE is illustrated in Fig. 15. All of the scenarios have an additional DCE of 160 h caused by defects originating in phase DE. The shifted DCE from RE has increased to 598 h in S1. In S2, it is 199 h whereas for S3 and S4 it is 119 h of remaining DCE. With the corresponding QA effort, the DCE for defects, which originated in phase DE could be reduced by 16 h in S1, 120 h in S2 and 136 h in S3 and S4. DCFM assumes for defects originating in RE to have similar detection ratios as have DE defects. Thereby, the DCE reduction for RE defects in phase DE is 17 h in S1, 76 h in S2 and 52 h in S3 and S4.

5.3.3 Implementation Phase

Figure 16 illustrates predictions for the IM phase. Focusing only on defects with their origin in RE, the development of DCE has increased to 2529 h in S1, in S2 the DCE is still 601 h whereas S3 and S4 only have 332 h. The reduction of DCE for defects with origin in RE is shown in row “–RE”. DCFM assumes low detection rates for RE and DE defects to be found in IM. Thus, the DCE is reduced by 36 h in S1, 225 h in S2 and 146 h in S3 and S4.

5.3.4 Integration & Test Phase

Results for the final phase I&T are presented in Fig. 17. The DCE for RE defects is 8361 h in S1, 1462 h in S2 and 722 h in S3 and S4. DCFM assumes higher detection rates for RE defects in the final phase I&T. Thus, the reduction of DCE is 851 h 851 h in S1, 1115 h in S2, 375 h in S3 and 614 h in S4.

Figure 17 also illustrates the final development of DCE for S1 to S4. It shows that S1 still has 10 times more of DCE to be done than initially planned. This is when software projects run out of time or out of money and customers are disappointed. S2 has around 2000 h of DCE left whereas S3 and S4 are at around 1000 h, most of the correction effort needed on defects with origin in RE.

Table 9 summarizes the overall results. Initial development effort is 1000 h for every scenario. Effort spent on QA activities is 50 h in S1, 200 h in S2, 320 h S3 and 400 h in S4. The overall DCE as part of QA activities is lowest for S1, 1224 h. In S2 it is 2343 h and S3 only has 1328 h whereas S4 has 1608 h. The DCE residing in the product after the development is 9771 h in S1, 528 h in S2, 457 h in S3 and 177 h in S4. Finally, the overall simulation results expect S1 to need 12045 h to develop a product initially planned at 1000 h. S2 at 4071 h of overall effort still consumes 3 times the amount of effort than planned. S3 has an estimated effort of 3105 h close to S4 at 3185 h.

Table 9 Scenario overall results

Considering the overall amount of engineering effort, it is still cheaper to invest in QA activities close to a maximum level than in optimizing effort spent for QA. Especially QA activities in early phases pay off because longer process iterations involve more process activities and therefore more effort.

5.4 Practical Benefits

5.4.1 Motivation for Process Change

This section describes the benefits of using the DCFM in a real-life environment. Working with such a model can reveal improvement potential in various ways. We have calibrated the DCFM based on data and expert knowledge in the target environment. Expert knowledge was needed to describe process dependencies, e.g. the relevant development phases, as well as the corresponding metric databases. We have used historical project data to calibrate process dependencies among each other, e.g. specific DCFs for specific features lead to an environment specific DCE which was the base for optimization. After that, we have used the resulting model to simulate various hypothetical scenarios. For further analysis, we focused on scenario S2 and S3 introduced in the previous section. Scenario S2 represents the current level of QA in the real-life development environment. It involves using a high’ sufficiency of QA effort in all development phases. Such a scenario is typically used to optimize a single development phase only. Scenario S3 uses ‘very high’ QA activities in early phases RE and DE. Such QA effort typically is often too expensive considering the return of invest per development phase. However, considering the overall effort reduction, scenario S3 outperforms S2, i.e. for S3 the overall effort is lower than for S2 by about 24%. Such reduction is a result of the fact that the increase of the level for QA Activities reduces the required amount of DCE.

Encouraged by such prediction, we changed the real-life development process.. This change involved increasing the sufficiency of QA effort from ‘high’ to ‘very high’ in all the following phases: requirements engineering (RE), design (DE) and implementation (IM). In numbers, it means an increase for the QA effort from 20% (‘high’) to 40% (‘very high’) of the development effort, as presented earlier in Table 4.

5.4.2 Evaluation of Process Change

The evaluation of the process change lasted over a period of three months and involved one real-life project. The review activities have been intensified in phases RE, DE and IM. In a review performance analysis, based on statistical data, we determined the potential DCE per defect for a specific feature. It involved using the DCF in combination with the corresponding development effort to determine the potential DCE. For this evaluation, we assumed a DCF of 0.75 due to the high difficulty of the feature development for this specific project. The phase multiplier for defects shifting from one phase to another is 20 for undetected defects form phases RE and DE, and four for undetected defects with origin in phase IM.

The company spent 560 h on the core development and the base amount of reviews. The DCFM predicts the total DCE of 300 h. Software engineers performed detailed reviews that took additional 57 h. During these reviews, they found some number of defects (for confidentiality reasons we cannot provide exact numbers) and they immediately corrected them. Company’s software engineering experts carefully analyzed each of these defects and, based on their experience and historical data, they estimated that correcting them at the later phases (i.e. the DCE) would have taken 218 h. This value served as a base for analysis of both the model and the process performance.

Figure 18 illustrates the results of the evaluation analysis. If the company had not Błąd! Nie można odnaleźć źródła odwołania. Spent additional effort on reviews, the total effort would have been 860 h, i.e. 560 h on development plus 300 h on corrections. Since the company actually performed additional reviews, they saved 73% of DCE (i.e. 218/300). Because the effort on these reviews was already spent, thus the total effort was 699 h (i.e. 560 + 57 + 300–218).

Fig. 18
figure 18

Review evaluation results

The company saved about 19% (i.e. 1–699/860) of overall effort thanks to performing additional reviews. This result proves the justification for these additional reviews. This ratio of 19% is close to the predicted reduction of 24% what proves the reasonable accuracy of the DCFM.

These evaluation results are based on the DCE estimated by experts, which we treated as a real empirical value. This is a limitation of this part of validation because experts might have not correctly estimated this value. However, it is not possible to assess the accuracy of this assessment because the defects, for which this DCE was estimated, were actually detected and fixed during reviews and not passed to the next phases to see how much it would really take to correct them.

5.5 Sensitivity Analysis for Decision Support

We discuss the pairwise impact of various factors, mostly related to review effort in specific stages, on overall effort. The latest of these analyses covers the impact of various combinations of the review effort on the overall effort. These analyses show how the DCFM can be used for decision support—i.e. how specific states of variables of interest influence the overall effort to help in finding the optimum review effort in specific stages. There are various techniques aiming to quickly find the optimum solution in similar problems, most notably within the search-based software engineering area (Antoniol et al. 2011; Durillo et al. 2009; Finkelstein et al. 2009; Gueorguiev et al. 2009). We decided not to use them because they are outside the scope of this paper in which we want to show detailed model behavior in various situations, including those far from the optimal solution.

Table 10 summarizes the settings used in this sensitivity analyses. The marks “x” and “y” indicate the variables that have been used in graphs on particular axes (where applicable). In SA4, the marks “a” and “b” denote two different settings that have been used.

Table 10 Settings for the sensitivity analyses

5.5.1 SA1—Pair Wise Impact of the Review Effort RE vs the Review Effort DE

Here we examine the impact of various combinations of review effort RE and review effort DE on the overall effort.

Figures 19 and 20 illustrates the predictions. With very low review effort in RE and DE we should expect the need for high overall effort—with median of over 24 thousand hours for no early reviews at all. This value is thus 24 times higher than assumed total development effort (1000 h). This can be explained in the following—the lack of QA effort in early phases causes that fixing the defects in later stages and after release is significantly more difficult.

Fig. 19
figure 19

Impact of review effort RE and review effort DE

Fig. 20
figure 20

Impact of review effort DE and review effort IM

We can see that the impact of review effort RE on overall effort is higher than of review effort DE This is illustrated by a higher slope of the curve for varying values of review effort RE and constant review effort DE than for varying values of review effort DE and constant review effort DE. This can be explained by the notion that when finding and fixing the defects sooner, a lower total effort is required.

This analysis also reveals that increasing review effort is efficient only to some point. After that point reviews are not so efficient anymore . In this experiment, the optimum review effort in RE and DE is about 70 person-hours. With more reviews the total effort starts to slightly increase (although it is difficult to observe it on this graph due to the resolution for vertical axis that has to cover high values for low amounts of reviews). The reason for such behavior is related to the law of diminishing returns where increasing the review effort causes lower and lower reduction of correction effort. This occurs up to the point where the reduction of correction effort is lower than review effort and thus reviews are not effective anymore. The decreasing reduction of correction effort is caused by the fact that initially it is easy to find and fix some obvious defects. However, finding and fixing more hidden defects is difficult and takes more time. This relationship has been previously encoded in earlier BN model (Radliński et al. 2008). These results confirm that DCFM correctly incorporates known relationships between variables.

5.5.2 SA2—Pairwise Impact of the Review Effort DE vs the Review Effort IM

This experiment illustrates the impact of various combinations of the review effort DE and review effort IM (Fig. 21). The slope of the curve for varying values of review effort DE is higher than for varying values of review effort IM. This suggests the higher impact of the review effort DE than the review effort IM and indicates that it is more effective to perform reviews as early as possible.

Fig. 21
figure 21

Impact of review effort RE and review effort IM

Like in the SA1 experiment, the optimum review effort is also around 70 person-hours. With more reviews in DE and IM, the overall effort gradually starts to increase. These results correspond to similar analysis in SA1.

5.5.3 SA3—Pairwise Impact of the Review Effort RE vs the Review Effort IM

In this experiment we analyze the impact of the review effort RE compared to the review effort IM. Like in previous analyses, the predictions in Fig. 21 also confirm the appropriate encoding of the law of diminishing returns, where the increase of review effort in RE and IM initially reduces the overall effort, and after certain threshold of reviews, they are no longer effective and the overall effort starts to increase.

Figure 21 clearly shows that the review effort RE has a higher impact on the overall effort than the review effort IM. Again, these results confirm the higher effectiveness of early-stage reviews than reviews performed later in the software development lifecycle.

5.5.4 SA4—Pairwise Impact of the Review Effort IM vs the Review Effort I&T

Here we analyze the impact of various combinations of review effort IM and review effort I&T. Figure 22 illustrates the results for appropriate and inadequate early-stage reviews. These graphs have different shapes than in SA1–SA3. This is related with different impact of review effort I&T on overall effort than reviews in RE, DE, and IM.

Fig. 22
figure 22

Impact of review effort IM and review effort I&T with appropriate (left) and little (right) early-stage reviews (RE and DE)

The optimum review effort in IM is still around 70 person-hours. Surprisingly, it appears that the overall effort increases along with the increase of the review effort I&T. This happens when appropriate and only very little amount of effort has been spent on reviews in earlier stages (RE and DE), although at different rates. This happens for three reasons—much higher defect cost factor and lower defect detection rates set for I&T than in default settings and the lack of phase multiplier after the phase I&T. These settings cause that, according to the model, the final phase I&T is not as effective as the QA activities in previous phases.

5.5.5 SA5—Pairwise Impact of the Defect Cost Factor vs the Review Effort RE and DE

In this experiment we analyze the impact of early-stage reviews (the same in RE and in DE) for different values of defect cost factor (equal for different stages). Figure 23 illustrates the predicted overall effort. We can observe that the overall effort increases linearly with the increase of the defect cost factor. This shows that defect-proneness (an uncontrollable factor) of a feature linearly affects the overall effort.

Fig. 23
figure 23

Impact of the defect cost dactor vs review effort RE and review effort DE

The impact of the amount of early-stage reviews is two-fold. Initially, the increasing review effort up to the optimum point causes a reduction of the overall effort. Beyond that optimum point, the increase of these reviews causes the increase of overall effort. The optimum review effort in this experiment is still around 70 person-hours. The increasing value of the defect cost factor causes a higher reduction of correction effort with increasing amounts of reviews. These results also confirm that the DCFM properly reflects the law of diminishing returns discussed earlier and properly captures known data.

5.5.6 SA6—Pairwise Impact of the Development Effort RE vs the Review Effort RE

This experiment is focused on the analysis of the impact of the review effort RE for different values of the development effort RE. In previous analyses, we assumed a constant development effort in each phase. It was related to the fact that real projects in the analyzed company typically share the proportions of development effort in various stages. Here, we analyze what happens if we spend different amount of effort on development in RE and keep the baseline proportions of development effort in other stages.

The results of this analysis have been illustrated in Fig. 24. The model predicts that if we spend none or little effort of reviews than the overall effort increases together with the increase of development effort RE. This happens because the model believes that a higher development effort RE is required for larger software. A large software requires more reviews in RE to keep the quality of this software at an appropriate level. Since none or little reviews are performed in RE, significantly more correction effort will have to be spent later.

Fig. 24
figure 24

Impact of the development effort RE and the review effort RE

For low development effort RE the optimum review effort RE is about 40 person-hours. With the increase of the development effort RE, we can observe the increase of optimum review effort RE. For example for the development effort RE of 300 person-hours the optimum review effort RE is around 110 person hours.

These results confirm that DCFM appropriately encodes the relationships between these variables. Although, we only discussed the results of such an experiment for RE here, the relationships between relevant variables in other phases are defined in the same way. Thus, for other phases the model will behave accordingly but the overall effort will be less responsive to the changes in development and review effort.

5.5.7 SA7—Combined Impact of the Review Effort RE, DE, IM and I&T

In previous sensitivity analyses, we analyzed the impact of pairs of variables (or pairs of groups of variables) on the overall effort. In this last analysis, we examine the impact of review effort in all four phases at a time. We have generated scenarios with all possible combinations of review effort within a range of [30, 130] and the step of “20”, and performed predictions for all of these combinations.

Because there are four predictor variables and one dependent variable in this analysis, we cannot easily visualize the results for all combinations at the time. First, we analyzed the results using 3D graphs with some of the predictors set to fixed values. One of such graphs, for review effort I&T = 70 person-hours is illustrated in Fig. 25. Each surface illustrates predictions for specific value of review effort RE and varying values of review effort DE and review effort IM. With increasing values for review effort RE up to the value of “70”, the overall effort decreases. With higher value of review effort RE, the overall effort starts to increase, as the reviews in RE are no longer effective (this has not been shown in the graph because the increase of overall effort was not significant and would cause surfaces to be almost overlapping and thus the whole figure would be less clear).

Fig. 25
figure 25

Impact of the review effort DE and the review effort IM on different values of the review effort RE and fixed review effort I&T = 70

Second, we analyzed the predictions using box-plots generated for each predictor (Fig. 26). These illustrate the expected ranges of overall effort for different values of review effort. These ranges visually vary the most for different values of review effort RE and the least for review effort IM. Thus, they also confirm the highest effectiveness of reviews in RE and the lowest in IM.

Fig. 26
figure 26

Comparison of the impact of the review effort in various stages

Lastly, to numerically confirm the reviews in which phase are most effective, we calculated the Spearman rank correlation coefficients between each predictor and the overall effort. The values of these coefficients are the following: for RE = −0.62, for DE = −0.38, for IM = −0.17, and for I&T = 0.29 (all with p < 0.05). These results confirm that review effectiveness on the overall effort in different phases can be ranked as follows: RE, DE, I&T, IM.

5.6 Summary of the Validation Analyses

The performed analyses, aimed to validate the DCFM, have shown that the model properly encodes relationships between specific variables known from literature in this field. The model achieved high accuracy in the analysis of data from real industrial projects. Limitations and threats to the validity have been discussed later in Section 9.

6 Model Customization and Enhancements

We have developed this version of DCFM to meet the particular needs of a specific software development process and to prove the concept of our general approach in a real world environment. One of the main purposes of DCFM is to support experts to improve specific development processes. However, it is possible to tailor our approach to other environments. In any case, expert knowledge is needed for both model development as well as for the target process. This includes adjusting and calibrating the model. Furthermore, the assessment of scenario simulations and deriving measures to improve the target development process needs expert knowledge. In general, to develop, calibrate and assess the model, the following topics need to be taken into account:

6.1 Development Effort

The expert is expected to provide all data in relation to development effort. This effort should be assigned to the development of specific features. Furthermore, effort has to be recorded for specific development phases. The allocation of effort to specific phases is encoded in the model in fixed proportions. They can be adjusted and defined either deterministically (fixed fractions of total development effort) or stochastically (fractions with encoded uncertainty as variance).

6.2 Sufficiency of QA Effort (ranked) and Review Effort (numeric)

As explained earlier, it is expected that the effort spent on reviews in a specific phase is a fixed proportion of development effort in that phase. Like the development effort, this can be adjusted to other proportions depending on the specific process needs.

6.3 Defect Cost Factor, Effort Reduction Rate, and Phase Multiplier

Currently these parameters are defined, based on data from past projects, results published in literature and expert knowledge about the specific development environment. These parameters may be defined differently in other environments, for example if new data become available or the model is supposed to be used for project of different type, size, complexity, novelty, etc., or developed by a different team or according to a different process.

6.4 Adding New or Removing Existing Phases

The model is of modular structure, where the phases have very similar structures. Thus, adding a new phase requires copying an existing phase, rename variables, adjust the links for prior or subsequent phase sub-models and calibrate the parameters. Removing a phase from a model only requires deleting unnecessary variables and adjusting the links in prior or subsequent phase sub-models.

7 Limitations and Threats to Validity

The study presented in this paper has both research and industrial potential. However, we are aware of the limitations of this approach and threats to validity. The first issue is related to the industrial validation. In this study we have used real industrial empirical data from past projects. In a typical scientific experiment, the analysis of predictive accuracy involves usage of specific measures, for example based on relative or absolute error. To make such an analysis reliable, such measures should be calculated for several data points, i.e. projects. However, we did not have access to data of such a required volume and granularity. Thus, we have performed a basic analysis of accuracy. With more testing data it could be possible to reveal some predictive inaccuracies and calibrate the model accordingly. It is not possible to compare the predictions from DCFM with results from other models because, even with our extensive research of previous related work, we have not found another model that would be able to provide predictions using the DCFM input data. In fact, that was also one of the motivations to build a new model, rather than use an existing one.

Second, the DCFM presented in this paper reflects a fixed software development process, known as the V-model. In such a process, once one development stage is fully completed, the project moves to the next stage and so on. Although this development process attracted lots of interest over past decades, there are other approaches that became more and more popular, for example agile process. Currently the DCFM cannot reflect an agile process because of the different concepts of development processes, and specifically different points in time for testing and correction activities. However, thanks to the modular structure, DCFM can be adjusted to reflect a spiral development process—in a similar way as modeling multiple software releases discussed earlier. Additionally, not each iteration needs to cover all four development phases. For example, iteration 1 may involve only RE and DE, iteration 2—RE, DE and IM, while iteration 3—RE, DE, IM and I&T. Adjusting DCFM for other development lifecycles may be a direction for future research.

The lack of empirical data of required volume and granularity was one of the main motivations for choosing BNs as opposed to data-driven techniques. Most parts of DCFM have been built according to expert knowledge. However, calibration and tweaking the model requires the access to some empirical data. Thus, the DCFM may be used in other software companies but it is necessary to calibrate the model according to the particular environment, otherwise the model may not provide accurate predictions.

Further, the model works on proportions of development effort (multipliers, fractions etc.). It does not incorporate the scale effect, as to which the relationships between certain variables in small projects may be different than in large projects. For example, the meaning of specific states for sufficiency of QA effort (i.e. “very low”–“very high”) may indicate different proportions of development effort depending on project scale. To simplify the model and the processes of its creation, calibration and calculation, we did not implement this feature. In addition, our focus was more on a specific industrial setting, where the projects do not significantly vary in size. Still, reflecting a wider spectrum of project types and sizes in DCFM may be a topic for future research.

In the sensitivity analysis discussed earlier we have analyzed various scenarios—combinations of values for review effort at different stages. Each scenario requires calculating the whole model with specific input values. The calculation time for one scenario on a “standard” contemporary PC or laptop using a junction-tree algorithm implemented in the AgenaRisk BN tool (AgenaRisk 2009) varies from about 20 s to about 1 min, depending on which variables have the input values assigned to. Thus, an analysis of various combinations of just one pair of variables typically took over an hour. While researchers may afford such calculation times, industrial users (i.e. decision-makers, analysts) typically wish to have results within seconds or minutes at maximum. Currently there is no simple solution to this problem. On the one hand, it is possible to set the values for parameters in the calculation algorithm in such a way that it will speed up calculations. However, the time improvement is far from a fraction of 0.01 of the original time. What is most important, there is a trade-off between the calculation time and predictive accuracy—with shorter calculations, resulting distributions will have fewer intervals and thus will not accurately define output variables. Another way is to use an approximate calculation algorithm—again, lower precision but faster calculation time.

Finally, the major challenge of establishing a general model in a real-life software development environment is its calibration. The more parameters are used in a model, the more complex it is to calibrate. Detailed understanding of the model is essential. Modeling experts with good understanding of translating a given problem into a mathematical and graphical representation are of high value. It is also related to deep understanding of the model’s field of application. The development processes behind the models developed in this, are needed to build large scale embedded systems for the automotive industry. On a single product potentially more than 100 engineers are working simultaneously in a distributed environment. Such complex process structures evolve over time and adapt to specific boundary conditions. Thus, it is very challenging to model software defects as part of such a development process.

8 Conclusions and Future Work

The DCFM introduced in this paper, enables an in depth analysis of the impact of QA activities in various stages on defect correction and overall effort. Typical cost benefit optimization strategies tend to optimize effort locally, e.g. every development phase is optimized separately on its own. In contrast to this, we demonstrated that even cost intensive quality activities pay off when the overall DCE of specific features is considered.

The optimum amount of QA effort depends on the DCE injected, whereas the DCE itself is based on development effort and its corresponding DCF for the feature to be developed. Additionally, the people involved in the development phase also have a strong impact. The more people involved in the correction of an engineering artifact, the higher the overall DCE.

The overview on the influence of all KPIs of a software product is very complex. With DCFM project managers could not only monitor the current situation of a project but also estimate the project behavior under given circumstances, e.g. project rescheduling or process optimization. Furthermore, DCFM could support higher process maturity levels, e.g. the Capability Maturity Model Integrated (CMMI Product Team, 2010), a process improvement approach also used as reference for appraising a company’s engineering processes.

The next step towards a higher effort estimation performance and therefore better software projects in time, quality and costs is the establishment of our method as part of the continuous improvement process. One major part of this is the establishment of a long-term measurement framework. Based on this data, we could further enhance our models and thereby provide decision support for process optimization and project estimation.

In the future we also plan to overcome the limitations of the approach discussed in the earlier section, most notably with respect to the extended validity and easier calibration, enhancement and adjusting to specific development processes, and easier incorporation of expert-knowledge and new empirical data. In addition, the extension of this work may involve combining the proposed DCFM with various optimization techniques for detailed studies on optimum development process.