Keywords

1 Introduction

With the comprehensive application of computer software technology in various systems such as daily life and safety-critical applications, software quality is an essential guarantee for software survival, and software reliability is an important index to measure software quality [1,2,3]. In the process of software development, it is necessary to consider the time of software release, the number of failures after software release, and the severity of the failures. Therefore, as an important means of quantitative evaluation and prediction of software reliability, in recent years, many software reliability growth models (SRGMs) based on time-domain have been proposed and successfully applied to the development process of various types of security-critical software. Among all SRGMs, the non-homogeneous Poisson process (NHPP) class SRGM is recognized as the most effective and widely used model because of its excellent characteristics such as easy to understand and easy to use.

In 1979, Goel and Okumoto [4] first used NHPP to describe the SRGM, known as the G-O model. The model assumes that the failure detection rate function is constant. Many models are based on the G-O model to improve some assumptions and modify the G-O model to make the newly established model achieve a better fitting effect. Yamada [5] proposed the Delay S-Shaped model, which believed that the failure detection rate was a non-decreasing function changing with time. Meanwhile, Ohba [6, 7] considers that there is more than one fault in the software. Kapur et al. [8] introduced the concept of Fault Severity Factor (FSF). They proposed a SRGM with two types of fault. The first type is the model proposed by Goel and Okumoto. The second type introduced the logistic rate during the removal process. Later many materials indicate that there is more than one level of software failure in the software system. However, in the models that have been proposed, it is usually assumed that the faults detected are immediately eliminated to simplify the calculations. In other words, they assume that the troubleshooting process is perfect. This assumption ignores the fact that all detected faults cannot be eliminated due to resource constraints and the introduction of new faults in the troubleshooting process. Therefore, the models can not be well applied to a practical application environment. The fitting ability and prediction ability of the model needs to be improved.

In this paper, we show how to classify software failures into three categories: severe, general, and simple. According to the severity of the software fault, we proposed a new SRGM to quantify the software level. In the process of fault detection, for simple failure, in different time intervals, the probability of testers finding problems is only related to the number of residual failures. Therefore, unlike Kapur et al., we assume that the fault detection rate of this fault level is a decreasing function over time. In the other two more complex fault levels, we will consider the ability of testers to detect problems and the number of remaining issues. Similarly, during the troubleshooting process, we introduced the non-removable software failure rate into consideration of the possibility of imperfect debugging. We entered the failure lead-in rate parameter into account of the option of adding new errors. Based on the above analysis, we carried out experiments on two types of failure data and compared the experimental results with the existing SRGMs. From the results, we can see that the new model has a better fitting and prediction effect.

The rest of the paper is arranged as follows. In the second part, three models of different severity fault of software reliability are given. This section also describes the proposed new model. The third part presents the evaluation criteria of the model. Part four discusses estimating model parameters using the Least Square Method (LSE) [4, 25] and applying these models to two failure data sets. Finally, in the fifth part, the experimental results are compared with other classical SRGMs.

2 The Fault Levels Model

We improved the assumptions of the earliest NHPP class model and obtained the following assumptions [9,10,11].

  1. 1.

    The software test runs in the same way as the actual running profile.

  2. 2.

    The different types of software faults are mutually independent.

  3. 3.

    Assume that \( m(t) \) is the mean value function (MVF) of the expected number of problems detected in time \( \left( {0,t} \right) \). The cumulative errors to time t follow the Poisson process where the MVF is \( m(t) \). We can get the predicted function \( m(t) \) of the increasing error number is a bounded non-subtractive function that satisfies the requirement that \( m(0) = 0 \).

  4. 4.

    The expected number of errors at any time interval \( \left( {t,t + \Delta t} \right) \) is proportional to the number of errors remaining at time \( t \). The ratio is the failure detection rate \( b(t) \). The function of the failure detection rate over time for different problem levels is assumed as follows:

    1. 1)

      For Level I, in different time intervals, the probability of testers finding minor problems is only related to the number of residual failures, and the failure detection rate function is \( b_{1} (t) \).

    2. 2)

      For Level II, the probability of the tester finding non-minor problems is related not only to the number of residual failures, but also to the tester’s learning ability. The failure detection rate function is \( b_{2} (t) \).

    3. 3)

      For Level III, similarly, we can assume the fault detection rate function of Level III is the same as that of Level II, which is \( b_{3} (t) \).

  5. 5.

    We assume that the original fault content in the software is \( N \). \( N_{1} \), \( N_{2} \) and \( N_{3} \) represent the initial fault number of simple, general, and severe levels respectively. Eliminating errors isn’t all perfect for problems of different grades. Thus, we introduce the failure introduction rate \( a \).

    1. 1)

      For Level I and Level II, when the fault level is low, the developer does not introduce new errors in the debugging process, the failure introduction rate are \( a_{1} \) and \( a_{2} \), and \( a_{1} = a_{2} = 0 \).

    2. 2)

      For Level III, at this point, developers may introduce new problems when solving the problem, so assume that the failure introduction rate is \( a_{3} \).

  6. 6.

    In practice, due to limited test resources, the skill and experience of the tester, and different severity of fault, it is not possible to eliminate all detected faults in the test phase. Therefore, we introduce the non-removable failure rate \( c \).

    1. 1)

      For Level I and Level II, when the fault is relatively simple, we assume that the fault can be completely removed, so the non-removable failure rate is \( c_{1} = c_{2} = 0 \).

    2. 2)

      For Level III, there are some software glitches that the software development team can not eliminate, so we assume that the non-removable failure rate is \( c_{3} \).

Based on the above assumptions, we can construct the model as follows, where, the \( m(t) \) of Level I, II, III are respectively expressed as \( m_{1} (t) \), \( m_{2} (t) \), \( m_{3} (t) \).

2.1 Level I SGRM

According to the above assumptions, and from [12, 13], the model of the simple problem can be expressed as

$$ \frac{{\text{d} m(t)}}{{\text{d} t}} = b_{1} (t) \times [N - m_{1} (t)] $$
(1)

Since the failure detection rate of minor problems by testers is only related to the number of remaining faults, with the continuous correction of the failures, the number of residual failures in the software become less and less, and the probability of detection becomes lower and lower. Therefore, it is assumed that the function of the failure detection rate over time of the tester for minor problems satisfies the following equation.

$$ b_{1} (t) = \frac{{b_{1} }}{1 + t}(0 \le b_{1} \le 1) $$
(2)

In the formula, \( b_{1} \) denotes a fault detection rate of simple faults found by the tester at the initial time.

Substitute (2) into (1), and solving (1) under the condition \( m_{1} (0) = 0 \), we can get the MVF of Level I as follows

$$ m_{1} (t) = N[1 - (1 + t)^{{ - b_{1} }} ] $$
(3)

2.2 Level II SGRM

Similarly, according to the above assumptions, the model of the middle problem can be formulated as

$$ \frac{{\text{d} m(t)}}{{\text{d} t}} = b_{2} (t) \times [N - m_{2} (t)] $$
(4)

The logistic Testing-Effort Function (TEF) [14,15,16,17,18] can well describe extensive test work, we use the ratio of test coverage growth rate and uncovered code to express the failure detection rate of the Level II fault.

The logistic TEF formula for the period \( \left( {0,t} \right] \) is

$$ W(t) = \frac{{W_{\hbox{max} } }}{1 + A\exp ( - \alpha t)} $$
(5)

Where \( A \) is a constant, \( \alpha \) is the consumption rate of testing effort, and \( W_{\hbox{max} } \) is the total testing effort that can be consumed finally. The current TEF rate at test time \( t \) can be shown as

$$ W^{\prime}(t) = \frac{{W_{\hbox{max} } A\alpha \exp ( - \alpha t)}}{{[1 + A\exp ( - \alpha t)]^{2} }} $$
(6)

For simplicity of calculation, we assume \( W_{\hbox{max} } = 1 \). The function of the failure detection rate over time can be represented as

$$ b_{2} (t) = \frac{{W^{\prime}(t)}}{1 - W(t)} = \frac{\alpha }{1 + A\exp ( - \alpha t)} $$
(7)

With the tester’s continuous understanding of the software under test, the tester can write better test cases, and the failure detection rate will increase; at the same time, as the failure is constantly corrected, the remaining failure in the software is less and less, and the probability of detection is lower and lower. Therefore, the fault detection rate at this time is affected by these two aspects.

Substitute (7) into (4), and solving (4) under the boundary condition \( m_{2} (0) = 0 \), we can obtain the MVF of Level II as follow

$$ m_{2} (t) = \frac{N[\exp (\alpha t) - 1]}{A + \exp (\alpha t)} $$
(8)

2.3 Level III SGRM

According to the above assumptions, the model of the complex problem can be shown as

$$ \frac{{dm_{3} (t)}}{dt} = b_{3} (t) \times [N(1 + a_{3} t) - m_{3} (t)] - c_{3} m_{3} (t) $$
(9)

According to assumptions 2, the parameters of \( b_{2} (t) \) and \( b_{3} (t) \) should be different. Therefore, the function of the failure detection rate over time of Level III can be formulated as

$$ b_{3} (t) = \frac{{\alpha_{1} }}{{1 + A_{1} \exp ( - \alpha_{1} t)}} $$
(10)

Substitute (10) into (9), and solving (9) with the condition \( m_{3} (0) = 0 \), the MVF of Level III is

$$ \begin{aligned} m_{3} (t) = & \frac{{N\alpha_{1} }}{{[\exp (\alpha_{1} t) + A_{1} ](\alpha_{1} + c_{3} )^{2} \exp (c_{3} t)}} \times \\ & [(1 + a_{3} t)(\alpha_{1} + c_{3} )\exp (\alpha_{1} t + c_{3} t) - a_{3} \exp (\alpha_{1} t + c_{3} t) + a_{3} - \alpha_{1} - c_{3} ] \\ \end{aligned} $$
(11)

Therefore, we can assume different parameters in front of the formulas of varying Level and get the new SGRM as follow

$$ m(t) = \sum\limits_{i = 1}^{k = 3} {p_{i} m_{i} (t)} $$
(12)

We call it a fault levels model. In Eq. (12), \( p_{1} \), \( p_{2} \) and \( p_{3} \) need to satisfy \( \sum\limits_{i = 1}^{k = 3} {p_{i} } = 1 \). And the initial fault number satisfies \( p_{i} N = N_{i} (i \in 1,2,3) \).

3 Model Comparison Criteria

We analyze models based on the ability to fit software failures and the ability to predict future software behavior based on observed failure data sets. The four standards of model comparison are:

3.1 The Fitting Effect Criterion

To quantitatively compare the effects of model fitting data, we use the Sum of Squared Errors (SSE), the Mean Square of Fitting Errors (MSE), and the R-square (R) [19,20,21,22].

MSE.

The MSE formula is shown below

$$ MSE = \frac{{\sum\limits_{i = 1}^{k} {(m(t_{i} ) - m_{i} )^{2} } }}{k} $$
(13)

The smaller the value of MSE, the lower the fitting error, and the better the performance.

SSE.

The calculation formula of SSE is as follows

$$ SSE = \sum\limits_{i = 1}^{k} {(m(t_{i} ) - m_{i} )^{2} } $$
(14)

Similarly, the smaller the value of SSE is, the lower the fitting error is, that is, the better the performance is.

R.

The formula for R is

$$ R = 1 - {{\sum\limits_{i = 1}^{k} {(m(t_{i} ) - m_{i} )^{2} } } \mathord{\left/ {\vphantom {{\sum\limits_{i = 1}^{k} {(m(t_{i} ) - m_{i} )^{2} } } {\sum\limits_{i = 1}^{k} {(m_{i} - m_{ave} )^{2} } }}} \right. \kern-0pt} {\sum\limits_{i = 1}^{k} {(m_{i} - m_{ave} )^{2} } }} $$
(15)

Unlike the above, the closer the value of R is to one, the better the fitting effect will be.

3.2 The Predictive Goodness Criterion

The ability of a model to predict failure behavior based on the current number of failures is called predictive validity. Musa [8, 23, 24] proposed a method that could be used to calculate the Relative Error (RE) of the data set to represent the predictive validity.

$$ RE = \frac{{\hat{m}(t_{q} ) - q}}{q} $$
(16)

First, assuming that \( q \) faults are found at the end of the test time \( t_{q} \), we use the failure data before \( t_{e} (t_{e} \le t_{q} ) \) to predict the parameters of \( m(t) \). By substituting the values of these prediction parameters into MVF, we can obtain the number of failures \( m(t_{q} ) \) over time \( t_{q} \). The second step is to compare the predicted value with the actual amount \( q \). Third, repeat the process for different \( t_{e} \) values. The validity of the prediction can be verified by drawing the relative errors of different \( t_{q} \) values. The closer the number is to zero, the better the prediction. Where the positive error represents an overestimate; A negative error indicates underestimation.

4 Model Simulation and Result Analysis

In this section, the proposed model has been tested on two real data sets to evaluate its validity. At the same time, the model with better performance on each data set and the classical models are used as the comparison model. In this paper, we use PyCharm software as a simulation platform. The Least Square Estimation (LSE) method is used to estimate the model parameters [4, 25, 26], and the estimation results generated by LSE are unbiased.

4.1 Data Set I

The data is from Misra, which is the failure data of software developed in the contract between IBM’s Federal Systems Division and NASA’s Johnson [27, 28]. The software was tested for 38 weeks, during which 2456.4 computer hours were used, and 231 faults were removed. It can be seen that faults are classified when failure data is recorded. The proposed model has been compared with the model proposed by Kapur et al. [29, 30], who also used the data set for experiments.

Analysis of Fitting Results.

The Parameter Estimation result and the goodness of fit results for the proposed SRGM are given in Table 1. It is observed that the proposed model has the smallest value of SSE and MSE when compared with the SDE model. The two models have the same value of R. From the weight coefficient values of the proposed model, we found that Level I and Level II faults account for a significant proportion of the DS-I, while Level III accounts for a small percentage. The SDE model also reflects this phenomenon. The fitting results of the two models are close to the original data set, which further proves the validity of the model. Compared with the SDE model, the total number of faults fitted by the proposed model is closer to the total number of faults in the original data set. Figure 1 describes the comparison between the fitting value \( m(t_{i} ) \) of each failure data in the DS-I by the two models and the actual observed failure data \( m_{i} \). It can be seen from Fig. 1 that the fitting results of the two models basically coincide with the real data. Combined with Table 1 and Fig. 1, the proposed model performs better in DS-I fitting.

Table 1. Fitting parameters for DS-I.
Fig. 1.
figure 1

Goodness of fit curves for DS-I.

Analysis of Prediction Results.

We train with the failure data of the first 22 weeks, and compare the predicted value with the real cost to get the RE curves in Fig. 2. It can find that the RE value of the proposed model is the closest to zero as a whole when compared with the SDE model. It means that the proposed model predicts more accurately.

Fig. 2.
figure 2

RE curves on DS-I.

4.2 Data Set II

The Ohba data set is mentioned in a paper written by Ohba for a database software system that contains approximately 1317,000 lines of code [12, 31, 32]. The software was tested for 19 weeks, during which 47.65 computer hours were used, and 328 faults were removed. Different from the DS-I, DS-II does not classify the failures when recording the failure data. Therefore, we choose three models that are tested with DS-II to compare with the proposed and prove that the proposed model has an excellent performance in both classified and unclassified data sets.

Analysis of Fitting Results.

Table 2 lists the estimates of different model parameters on DS-II, including the G-O model, and the traditional Yamada Delayed S-Shaped model. We also give the values of SSE, MSE, and R in Table 2. It is observed that the proposed model has the smallest value of SSE and MSE, and the value of R for the proposed model is the closest to one when compared with other SRGMs.

Table 2. Fitting parameters for DS-II.

Different from the DS-I, we found that Level II and Level III faults account for a significant proportion of the DS-II, while Level I accounts for a small percentage. This is because different software development environments and application scenarios have different fault level distribution.

To clearly show the fitting effect diagram, we chose the two models with the best fitting effect for comparison. The two models with better fitting effect are the proposed model, and the Improved G-O model. Figure 3 describes the contrast between the appropriate value (\( m(t_{i} ) \)) of each failure data in the DS-II by the above two models and the actual observed failure data (\( m_{i} \)). On average, the proposed model performs better in data set fitting.

Fig. 3.
figure 3

Goodness of fit curves for DS-II.

Analysis of Prediction Results.

In DS-II, we train with the failure data of the first 12 weeks, and compare the predicted value with the real value to get the RE curves in Fig. 4. For the convenience of observation, the RE curves of two models with a better fitting effect on the DS-II are depicted in Fig. 4. It is worth noting the curve of the Improved G-O model deviates from zero by a large margin. The RE value of the proposed model is the closest to zero as a whole, and the speed of the curve approaching zero is the fastest after eighteen weeks, which indicates that the proposed model has excellent predicted results on the DS-II.

Fig. 4.
figure 4

RE curves on DS-II.

5 Conclusion

This paper provides a new SRGM based on three different types of fault severity. The model not only considers the existence of more than one type of software failure but also considers the possibility of imperfect debugging in the real world, introducing failure introduction rate and non-removable software failure rate. This makes the establishment of the model more in line with the actual situation and the calculation is simple, which is convenient for transplantation and application. The simulation results on two different types of data sets show that, compared with the previous methods, the fault severity classification can effectively improve the fitting effect and prediction effect of the traditional software reliability model, which plays an essential role in the theoretical research and engineering application of the software reliability model.