1 Introduction

Aiming at enhancing the dependability and reducing outages of software systems, the studies on software aging and rejuvenation have been extensively taken over the last three decades. The software aging refers to the phenomenon that software system performance degrades continuously with time and system failure occurs suddenly. Although it is known that the software aging is usually caused by some aging-related software bugs, such as the memory leak and the accumulation of round-off errors, remaining in the system, it is difficult to remove such bugs completely in the software development and testing phases, due to their low reproducibility. The software rejuvenation is a proactive software control technique to prevent the system performance degradation and other associated failures related to software aging by refreshing the system condition occasionally or periodically before the system failure, e.g., a hardware reboot. The execution of rejuvenation typically bring an overhead, but it may prevent more severe failures from occurring.

The concept of software aging and rejuvenation was first introduced by Huang et al. (1995). They presented a simple continuous-time Markov chain (CTMC) model with four states (normal, failure-probable, failure, and rejuvenation states) of a software system. Since then, many authors have proposed various software rejuvenation policies for counteracting the software aging. For example, Garg et al. (1995) brought an idea of periodic rejuvenation (i.e., deterministic rejuvenation trigger interval) into Huang et al.’s model. In their assumption, the degraded system is renewed and the system state becomes normal immediately after the repair operation. However, there is the fact that, after the completion of the repair operation, some cleanup and the resume of process execution at the checkpointed state may be necessary before restarting the system; thus, another periodic rejuvenation model based on Garg et al.’s model was proposed by Suzuki et al. (2002a, 2002b). As a result, Garg et al.’s model and Suzuki et al.’s model are regarded as two fundamental software rejuvenation models in software reliability engineering, since they have been widely studied and evaluated over years and can be extended to various complicated rejuvenation models.

To our knowledge, many researches focused on solving steady-state solutions and analyzing steady-state measures for software rejuvenation models such as the steady-state availability (Trivedi et al. 2000; Xie et al. 2005; Bao et al. 2005; Thein and Park 2009; Rezaei and Sharifi 2010; Dohi and Okamura 2016; Zheng et al. 2017). The steady-state measure, in general, is appropriate when the software systems have a very long lifetime. The steady-state availability is one of the typical steady-state measures, indicating the long-run probability when the system operation time tends to infinity. On the other hand, there is the case where one wishes to ensure the system availability for a specific time period. For example, it is important for a typical enterprise system with attendance management to be available during office-going hours, and it is not necessary to work for a whole day. In such a situation, transient measures are more appropriate than the steady-state measures. In fact, Hosford (1960) first proposed the concept of the interval availability and defined it as the fraction of time duration when a system is operational over a finite time horizon. Rubino and Sericola (1992) analyzed the interval availability distribution of repairable computer systems using Markov models.

This paper focuses on the interval reliability for software systems with periodic rejuvenation. The interval reliability is comprehensive and is known as one of the most generalized dependability measures, which involve both system reliability and system availability such as the steady-state availability and the pointwise availability. It assesses an ability of an operable software system over a specific time interval without failure and is helpful for the system design during a fixed mission time period. However, from the perspective of software aging and rejuvenation, there are a few works concerning the interval reliability, since the formulation of the interval reliability leads to the transient analysis of the software rejuvenation models. The transient analysis is carried out to obtain the state probabilities of the system at an arbitrary time, so that it is a relatively much hard work compared with the steady-state analysis. Consequently, to date, the transient analysis is still a challenging issue when one focuses on the short period for system operation. Several studies by Dohi et al. (2010, 2012, 2018) tried to solve the transient solutions by the usage of Laplace-Stieltjes transform (LST) and its numerical inversion. However, the main problem is that the numerical inversion of LST is not stable, in other words, it is not easy to control the round-off and truncation errors in the numerical inversion of LST.

In this paper, we evaluate the interval reliability for software rejuvenation based on stochastic models through the transient analysis with the phase expansion approach. Concretely, the interval reliability is formulated via the transient analysis of two software rejuvenation models introduced by Garg et al. (1995) and Suzuki et al. (2002a, 2002b). The transient analysis is easily conducted by using the phase expansion, which is a method to approximate an arbitrary general distribution with phase-type (PH) distribution (Okamura and Dohi 2016a). Since the PH distribution is defined by a CTMC with absorbing states, the approximate (phase-expanded) model reduces to the “equivalent” CTMC. It is well-known that the transient analysis of CTMC is computationally feasible via some numerical methods, e.g., uniformization. Based on the phase expansion, this paper also considers two special cases of interval reliability: the reliability function and the pointwise availability. The highlights of this paper are as follows:

  • Coverage of the analytical transient solutions for both two fundamental software rejuvenation models in software reliability engineering;

  • Formulation and evaluation of the transient behavior in terms of the interval reliability for two rejuvenation models using a phase expansion approach;

  • Comparison of the interval reliabilities between two rejuvenation models.

The rest of this paper is organized as follows. In Section 2, we give an overview of the previous works on the analysis of software systems with rejuvenation. In Section 3, we introduce two fundamental software rejuvenation models using Markov regenerative models and their behavioral analysis with LST. Section 4 presents the transient analysis of software rejuvenation models with phase expansion. In Section 5, we formulate the interval reliability of software rejuvenation models in the context of phase-expanded CTMC models. Section 6 is devoted to numerical experiments. Finally, this paper is concluded with some remarks in Section 7.

2 Related works

Huang et al. (1995) first introduced the concept of software aging and rejuvenation in 1995. In Huang et al. (1995), a simple stochastic model for analyzing software rejuvenation in a continuously running telecommunication application was presented. They also expressed the expected downtime and expected cost due to the overhead of rejuvenation, aiming to handle transient software failures. Afterwards, Garg et al. (1995) proposed a periodic time-based rejuvenation policy upon Huang et al.’s model (1995). To deal with the deterministic rejuvenation trigger interval, the dynamics of system behavior were characterized by a Markov regenerative stochastic Petri net (MRSPN) model. Both models in Huang et al. (1995) and Garg et al. (1995) were extensively analyzed by Dohi et al. (2000a, 2000b, 2001) and Suzuki et al. (2002a). They showed that Huang et al.’s model (1995) and Garg et al.’s model (1995) could be extended to the well-known non-Markovian processes, i.e., the semi-Markov process (SMP) and the Markov regenerative process (MRGP). On the other hand, several papers (Okamura et al. 2001; Vaidyanathan and Trivedi 2005; Bruneo et al. 2013) discussed the workload-based rejuvenation policies where the rejuvenation timing is controlled by the number of processed transactions, namely, the rejuvenation is triggered when the number of processed transactions exceeds a threshold.

Since the frequent software rejuvenation typically involves much overheads, there is a tradeoff on the system performance between the execution frequency of rejuvenation and its associated overhead. In general, low system performance results in low system dependability. For example, the system hanging due to software aging extremely decreases the system dependability such as the system availability. Thus, it is important to determine the optimal rejuvenation policy based on the system performance indices, as well as the system dependability measures. In many papers, the system dependability measures or performance indices are derived by the means of the steady-state analysis of the stochastic models. Thein and Park (2009), Rezaei and Sharifi (2010), and Machida et al. (2013) derived the steady-state availability of a virtualized system with rejuvenation. Trivedi et al. (2000) discussed availability models with rejuvenation to evaluate the effectiveness of proactive fault management in operational software systems and determined the optimal times to perform rejuvenation under different scenarios. Xie et al. (2005) and Bao et al. (2005) described SMP models for degraded software systems with rejuvenation and determined the optimal rejuvenation schedule that maximizes the system availability. Recently, Dohi and Okamura (2016) considered an operational software system with multi-stage degradation levels due to software aging and derived the optimal dynamic software rejuvenation policy that maximizes the steady-state system availability via the semi-Markov decision process. Zheng et al. (2017) comprehensively evaluated six software rejuvenation policies for transaction systems in terms of steady-state performance measures, i.e., loss probability of transactions and the mean response time, under the environment where the arrival stream of system follows a Markovian arrival process.

On the contrary, only a few works focused on the transient measures for the systems considering software aging and rejuvenation, because it is well-known that the transient analysis of the software rejuvenation models is more difficult than the steady-state analysis and requires the state probabilities at an arbitrary time. The computation of state probabilities is not so easy even if the number of states in the model is small. One solution of solving the transient case is to derive the LSTs of state probabilities and take place the numerical inversion of the LSTs by Durbin’s algorithm (Durbin 1974; Dubner and Abate 1968). For example, Dohi et al. (2010, 2012) formulated the pointwise availability and steady-state interval reliability of Huang et al.’s model (1995) and other variants by LSTs and their numerical inversion. Recently, Dohi et al. (2018) further examined numerically the transient behavior of the interval reliability of software rejuvenation models at an arbitrary operation time using the inversion of the LSTs. However, since Durbin’s algorithm is based on a Fourier series expansion and requires some judicious choices of the parameters in his algorithm for rapid convergence, the biggest disadvantage of Durbin’s method is the dependence of the discretization and truncation errors on the free parameters. In other words, by a suitable choice of the parameters, the discretization error becomes arbitrarily small, but the truncation error grows to infinity at the same time and vice versa. Therefore, it should be pointed out that, the numerical inversion of LST is not stable, namely, it is not easy to control the round-off and truncation errors in the numerical inversion of LST. Alternatively, Okamura and Dohi (2016a) developed the fast PH fitting algorithm from the probability density function (p.d.f.) of general distribution. This enables us to perform the transient analysis more accurately. In fact, Okamura et al. (2014) demonstrated the transient analysis for the rejuvenation model of a virtualized system with phase-expanded CTMC models. Also, in Okamura and Dohi (2016b), they discussed the transient analysis of a basic software rejuvenation model introduced by Garg et al. (1995) and formulated the interval reliability for software rejuvenation.

This paper extends and improves and the work of Okamura and Dohi (2016b) to cover both two fundamental software rejuvenation models in software reliability engineering. Apart from Garg et al.’s model, this paper also formulates and evaluates the transient behavior of the interval reliability of another fundamental software rejuvenation model introduced by Suzuki et al. (2002a, 2002b) using the phase expansion, and compare the reliability measures between two basic rejuvenation models.

3 Software rejuvenation models

3.1 Model description

3.1.1 Model-I

Consider the stochastic model with periodic software rejuvenation (Model-I, see Fig. 1) similar to Garg et al. (1995). We suppose three degradation levels: highly robust state, failure probable state, and degradation failure state. Note that the degradation failure is regarded as the system failure, e.g., a sudden crash of software application. All system states are defined as follows:

State 0::

highly robust state (normal operation state)

State 1::

failure probable state (degraded state)

State 2::

software rejuvenation state from failure probable state

State 3::

failure state

State 4::

software rejuvenation state from highly robust state.

Without loss of generality, the software system is supposed to start at time t = 0 at the highly robust state (State 0: normal operation state). From State 0, the system performance degrades with the operation time. Let Z0 be the random time to reach the failure probable state (State 1: degraded state) from State 0 with the cumulative distribution function (c.d.f.) F0(t) = P(Z0t). In the degraded state, the system may fail with a positive probability. Let X be the time to failure from State 1 having the c.d.f. Ff(t) = P(Xt). Whenever the system failure occurs, the system state enters State 3 where the system is forced to undergo the repair operation immediately. Let Y denote the time to complete the repair operation with the c.d.f. Fa(t) = P(Yt). After the completion of repair operation, the system state is assumed to be as good as new and returns back to State 0 from State 3.

Fig. 1
figure 1

State transition diagram (Model-I)

From the aspect of prevention of the system failure, the rejuvenation of system may be performed at a random time interval measured from the start of the software in State 0. Thus, we define Fr(t) as the c.d.f. of the time to execute the software rejuvenation and Fc(t) as the c.d.f. of the time to complete the software rejuvenation. After the completion of the software rejuvenation, the software system also becomes as good as new and goes back to State 0 from State 2 (4). Generally speaking, the age of software system is renewed when the state becomes State 0 again due to either the repair operation or the rejuvenation, and the time clock of rejuvenation trigger is also reset when the system enters the highly robust state. We call the above state transition process as an MRGP (Kulkarni 1995) since the time clock of rejuvenation trigger does not reset when the system degrades to the failure probable state.

In the MRGP of Model-I illustrated in Fig. 1, the regeneration points and non-regeneration point are distinguished by circles (0, 2, 3, 4) and square (1), respectively. Model-I shows a very basic stochastic model with periodic software rejuvenation, which is essentially the same as that in Garg et al. (1995). Garg et al. (1995) numerically presented this model using an MRSPN.

3.1.2 Model-II

Consider a different model (Model-II, see Fig. 2) from Model-I introduced in Suzuki et al. (2002a, 2002b). In Model-I, it is assumed that the system is renewed and the state moves to State 0 immediately after the completion of the repair operation. However, after the completion of the repair operation, some cleanup and the resume of process execution at the checkpointed state may be necessary before restarting the system; thus, in Model-II, an additional software rejuvenation after completing the repair operation is adopted to renew the system.

Fig. 2
figure 2

State transition diagram (Model-II)

Figure 2 shows the state transition diagram of Model-II. The only difference between Model-I and Model-II is that the system is rejuvenated just after a system failure recovery in Model-II.

3.2 Behavioral analysis

For both Model-I and Model-II, Dohi et al. (2010) presented the behavioral analysis in terms of LST. Let Kij(t) be the probability that the system is in State j by time t when the state of system begins with regenerative State i at t = 0. Also let qij(s) be the LST of Kij(t), i.e.,

$$ q_{ij}(s) ={\int}_{0}^{\infty}e^{-st}dK_{ij}(t). $$
(1)

Further, let \(K^{(k)}_{ij}(t)\) and \(q^{(k)}_{ij}(t)\) be the probability that the system enters State j by time t when starting from State i via a non-regeneration State k and its LST, respectively.

3.2.1 Model-I

For the Model-I, we obtain

$$ \begin{array}{@{}rcl@{}} q_{01}(s) &=& {\int}_{0}^{\infty} e^{-st} \overline{F}_{r}(t) dF_{0}(t), \end{array} $$
(2)
$$ \begin{array}{@{}rcl@{}} q_{02}^{(1)}(s) &=& {\int}_{0}^{\infty} e^{-st} (F_{0} * \overline{F}_{f})(t) dF_{r}(t), \end{array} $$
(3)
$$ \begin{array}{@{}rcl@{}} q_{03}^{(1)}(s) &=&{\int}_{0}^{\infty} e^{-st} \overline{F}_{r}(t) d (F_{0}*F_{f}) (t), \end{array} $$
(4)
$$ \begin{array}{@{}rcl@{}} q_{04}(s) &=&{\int}_{0}^{\infty} e^{-st} \overline{F}_{0}(t) dF_{r}(t), \end{array} $$
(5)
$$ \begin{array}{@{}rcl@{}} q_{20}(s) &=&{\int}_{0}^{\infty} e^{-st} dF_{c}(t), \end{array} $$
(6)
$$ \begin{array}{@{}rcl@{}} q_{30}(s) &=&{\int}_{0}^{\infty} e^{-st} dF_{a}(t), \end{array} $$
(7)
$$ \begin{array}{@{}rcl@{}} q_{40}(s) &=&{\int}_{0}^{\infty} e^{-st} dF_{c}(t), \end{array} $$
(8)

where \((F_{0}* \overline {F}_{f})(t)\) and (F0Ff)(t) are Stieltjes convolution functions which are given by

$$ \begin{array}{@{}rcl@{}} (F_{0} * \overline{F}_{f})(t) &=& {{\int}_{0}^{t}} \overline{F}_{f}(t-u) dF_{0}(u), \end{array} $$
(9)
$$ \begin{array}{@{}rcl@{}} (F_{0} * F_{f})(t) &=& {{\int}_{0}^{t}} F_{f}(t-u) dF_{0}(u). \end{array} $$
(10)

Defining H00(t) as the recurrence time distribution from State 0 to State 0, we then obtain its LST

$$ \begin{array}{@{}rcl@{}} h_{00}(s) &=& {\int}_{0}^{\infty} e^{-st} dH_{00}(t) \\ &=& q_{02}^{(1)}(s)q_{20}(s)+q_{03}^{(1)}(s)q_{30}(s)+q_{04}(s)q_{40}(s). \end{array} $$
(11)

The objective of our concern is to derive the transient probability of staying in State j at an arbitrary time, provided that the initial state at time t = 0 is State 0. Let P0j(t) denote the transient probability from State 0 to State j, then, its LST is given by \(p_{0j}(s)={\int }_{0}^{\infty } e^{-st} dP_{0j}(t)\). After some manipulations, we have

$$ \begin{array}{@{}rcl@{}} p_{00}(s)&=&\frac{1-q_{01}(s)-q_{04}(s)}{1-h_{00}(s)}, \end{array} $$
(12)
$$ \begin{array}{@{}rcl@{}} p_{01}(s)&=&\frac{q_{01}(s)-q_{02}^{(1)}(s)-q_{03}^{(1)}(s)}{1-h_{00}(s)}, \end{array} $$
(13)
$$ \begin{array}{@{}rcl@{}} p_{02}(s)&=&\frac{q_{02}^{(1)}(s)(1-q_{20}(s))}{1-h_{00}(s)}, \end{array} $$
(14)
$$ \begin{array}{@{}rcl@{}} p_{03}(s)&=&\frac{q_{03}^{(1)}(s)(1-q_{30}(s))}{1-h_{00}(s)}, \end{array} $$
(15)
$$ \begin{array}{@{}rcl@{}} p_{04}(s)&=&\frac{q_{04}(s)(1-q_{40}(s))}{1-h_{00}(s)}. \end{array} $$
(16)

3.2.2 Model-II

Similarly, for the Model-II, the LSTs of the probabilities that the system becomes State j by time t from State i are

$$ \begin{array}{@{}rcl@{}} q_{01}(s) &=& {\int}_{0}^{\infty} e^{-st} \overline{F}_{r}(t) dF_{0}(t), \end{array} $$
(17)
$$ \begin{array}{@{}rcl@{}} q_{02}^{(1)}(s) &=& {\int}_{0}^{\infty} e^{-st} (F_{0} * \overline{F}_{f})(t) dF_{r}(t), \end{array} $$
(18)
$$ \begin{array}{@{}rcl@{}} q_{03}^{(1)}(s) &=&{\int}_{0}^{\infty} e^{-st} \overline{F}_{r}(t) d (F_{0}*F_{f}) (t), \end{array} $$
(19)
$$ \begin{array}{@{}rcl@{}} q_{04}(s) &=&{\int}_{0}^{\infty} e^{-st} \overline{F}_{0}(t) dF_{r}(t), \end{array} $$
(20)
$$ \begin{array}{@{}rcl@{}} q_{20}(s) &=&{\int}_{0}^{\infty} e^{-st} dF_{c}(t),\end{array} $$
(21)
$$ \begin{array}{@{}rcl@{}} q_{32}(s) &=&{\int}_{0}^{\infty} e^{-st} dF_{a}(t), \end{array} $$
(22)
$$ \begin{array}{@{}rcl@{}} q_{40}(s) &=&{\int}_{0}^{\infty} e^{-st} dF_{c}(t). \end{array} $$
(23)

Then, we have the LST of the recurrence time distribution under Model-II

$$ \begin{array}{@{}rcl@{}} h_{00}(s) &=& {\int}_{0}^{\infty} e^{-st} dH_{00}(t) \\ &=& \left( q_{02}^{(1)}(s) + q_{03}^{(1)}(s)q_{32}(s)\right)q_{20}(s)+q_{04}(s)q_{40}(s), \end{array} $$
(24)

and the LSTs of the transient probabilities beginning from State 0

$$ \begin{array}{@{}rcl@{}} p_{00}(s)&=&\frac{1-q_{01}(s)-q_{04}(s)}{1-h_{00}(s)}, \end{array} $$
(25)
$$ \begin{array}{@{}rcl@{}} p_{01}(s)&=&\frac{q_{01}(s)-q_{02}^{(1)}(s)-q_{03}^{(1)}(s)}{1-h_{00}(s)}, \end{array} $$
(26)
$$ \begin{array}{@{}rcl@{}} p_{02}(s)&=&\frac{q_{02}^{(1)}(s)(1-q_{20}(s))}{1-h_{00}(s)}, \end{array} $$
(27)
$$ \begin{array}{@{}rcl@{}} p_{03}(s)&=&\frac{q_{03}^{(1)}(s)(1-q_{32}(s)q_{20}(s))}{1-h_{00}(s)}, \end{array} $$
(28)
$$ \begin{array}{@{}rcl@{}} p_{04}(s)&=&\frac{q_{04}(s)(1-q_{40}(s))}{1-h_{00}(s)}. \end{array} $$
(29)

Dohi et al. (2010, 2018) discussed the transient analysis based on the above LSTs by applying Durbin’s algorithm (Durbin 1974; Dubner and Abate 1968) to solve the inversion of LSTs. But unfortunately, there are several problems about LST and its inversion in practice (Okamura and Dohi 2016b). At first, the LST is not always expressed in a closed form using elementary functions. For instance, the LST of Weibull distribution includes the special function such as gamma function. Therefore, even if we know the LSTs of transient probabilities, it is practically impossible to solve the inversion of LSTs analytically. On the other hand, it is known that the numerical methods for the inversion of LST or equivalently the inversion of Laplace transform are not stable, i.e., they are sensitive to round-off errors. For example, Durbin’s algorithm needs some judicious choices of the parameters in the algorithm for rapid convergence, which may bring large truncation error easily.

4 Transient analysis with phase expansion

Here, we apply the phase expansion to seek the transient solutions of rejuvenation models. The phase expansion, also called PH approximation, is an approximation technique to replace general distributions (i.e., non-exponential distributions) by PH distributions (Okamura and Dohi 2016a). The PH distribution is defined by the probability distribution of the absorbing time in a CTMC with absorbing states. Since the PH distribution is dense for any probability distribution defined on positive domain (Asmussen and Koole 1993), the PH distribution can approximate any probability distribution with high precision. In this way, the original non-exponential model such as MRGP can be approximated by an phase-expanded CTMC. In other words, by solving the “equivalent” CTMC, we can obtain the solution of the original non-exponential model. There are, in general, several stable methods to compute the transient solution of phase-expanded CTMC such as uniformization, moment match, least square method of the difference of p.d.f.s, and maximum likelihood estimation, so we can apply them to solve the approximation pattern. Thus, the transient solutions of MRGPs of software rejuvenation can be obtained through solving an “equivalent” CTMC.

Consider a phase-expanded CTMC. Without loss of generality, the infinitesimal generator Q of the CTMC is assumed to be partitioned as follows:

(30)

where T is an infinitesimal generator of the underlying CTMC over transient states (non-absorbing states) and τ is a column vector whose elements represent the transition rates from transient states to the absorbing ones. In general, the p.d.f. and c.d.f. of PH distribution are represented by

$$ f_{PH}(t) = \boldsymbol \alpha \exp(\boldsymbol T t) \boldsymbol \tau, \quad F_{PH}(t) = 1 - \boldsymbol \alpha \exp(\boldsymbol T t) \boldsymbol 1, $$
(31)

where α is the probability (row) vector that determines the initial state of the underlying CTMC, and 1 is a column vector whose elements are 1. Note that ξ = −T1 from the property of CTMC. The transient states of the underlying CTMC are called phases. The accuracy of approximation depends on the number of phases.

For the stochastic models with periodic software rejuvenation (Model-I and Model-II) in Section 3, the probability distributions F0(t), Ff(t), Fa(t), Fr(t), and Fc(t) are approximated by the following PH distributions:

$$ \begin{array}{@{}rcl@{}} F_{0}(t) &\approx& 1 - \boldsymbol \alpha_{0} \exp(\boldsymbol T_{0} t) \boldsymbol 1_{0}, \end{array} $$
(32)
$$ \begin{array}{@{}rcl@{}} F_{f}(t) &\approx& 1 - \boldsymbol \alpha_{f} \exp(\boldsymbol T_{f} t) \boldsymbol 1_{f}, \end{array} $$
(33)
$$ \begin{array}{@{}rcl@{}} F_{a}(t) &\approx& 1 - \boldsymbol \alpha_{a} \exp(\boldsymbol T_{a} t) \boldsymbol 1_{a}, \end{array} $$
(34)
$$ \begin{array}{@{}rcl@{}} F_{r}(t) &\approx& 1 - \boldsymbol \alpha_{r} \exp(\boldsymbol T_{r} t) \boldsymbol 1_{r}, \end{array} $$
(35)
$$ \begin{array}{@{}rcl@{}} F_{c}(t) &\approx& 1 - \boldsymbol \alpha_{c} \exp(\boldsymbol T_{c} t) \boldsymbol 1_{c}, \end{array} $$
(36)

where 10, 1f, 1a, 1r, and 1c are the 1’s column vectors, whose sizes are determined by the numbers of phases of the respective PH distributions. Thus, we obtain the following approximate CTMCs for the rejuvenation models via using PH distributions.

4.1 Model-I

The Model-I is approximated by the CTMC with the following infinitesimal generator

$$ \boldsymbol Q = \left( \begin{array}{lllll} \boldsymbol T_{0} \oplus \boldsymbol T_{r} & (\boldsymbol \tau_{0} \boldsymbol \alpha_{f}) \otimes \boldsymbol I_{r,r} & \boldsymbol 1_{0} \otimes (\boldsymbol \tau_{r} \boldsymbol \alpha_{c}) \\ & \boldsymbol T_{f} \oplus \boldsymbol T_{r} & \boldsymbol 1_{f} \otimes (\boldsymbol \tau_{r} \boldsymbol \alpha_{c}) & (\boldsymbol \tau_{f} \boldsymbol \alpha_{a}) \otimes \boldsymbol 1_{r} \\ \boldsymbol \tau_{c} (\boldsymbol \alpha_{0} \otimes \boldsymbol \alpha_{r}) & & \boldsymbol T_{c} & \\ \boldsymbol \tau_{a} (\boldsymbol \alpha_{0} \otimes \boldsymbol \alpha_{r}) & & & \boldsymbol T_{a} \end{array}\right) $$
(37)

where

$$ \begin{array}{@{}rcl@{}} \boldsymbol \tau_{0} &=& -\boldsymbol T_{0} \boldsymbol 1_{0}, \quad \boldsymbol \tau_{f} = -\boldsymbol T_{f} \boldsymbol 1_{f}, \quad \boldsymbol \tau_{a} = -\boldsymbol T_{a} \boldsymbol 1_{a}, \end{array} $$
(38)
$$ \begin{array}{@{}rcl@{}} \boldsymbol \tau_{r} &=& -\boldsymbol T_{r} \boldsymbol 1_{r}, \quad \boldsymbol \tau_{c} = -\boldsymbol T_{c} \boldsymbol 1_{c}. \end{array} $$
(39)

Also, Ir,r is the identity matrix having the same dimension as Tr. Moreover, ⊗ and ⊕ are operators of Kronecker product and sum, respectively. Each block of Q corresponds to State 0, State 1, State 2, or State 3. Note that State 4 is the same as State 2, since they both refer to software rejuvenation state. When the system starts from State 0, the initial probability vector for the above CTMC is given by

$$ \boldsymbol \pi_{0} = \left( \begin{array}{llll} \boldsymbol \alpha_{0} \otimes \boldsymbol \alpha_{r} & \boldsymbol 0 & \boldsymbol 0 & \boldsymbol 0 \end{array}\right). $$
(40)

From the argument of CTMC, the transient probability vector of the approximate CTMC at time t is computed by

$$ \boldsymbol \pi(t) = \boldsymbol \pi_{0} \exp(\boldsymbol Q t) $$
(41)

and then, the transient probabilities are approximately obtained by

$$ \begin{array}{@{}rcl@{}} &&\left( \begin{array}{lllll} P_{00}(t) & P_{01}(t) & P_{02}(t) & P_{03}(t) \end{array}\right)\\ &&= \boldsymbol \pi(t) \left( \begin{array}{lllll} \boldsymbol 1_{0} \otimes \boldsymbol 1_{r} \\ & \boldsymbol 1_{f} \otimes \boldsymbol 1_{r} \\ & & \boldsymbol 1_{c} \\ & & & \boldsymbol 1_{a} \end{array}\right). \end{array} $$
(42)

4.2 Model-II

For Model-II, the approximate CTMC is given with the following infinitesimal generator

$$ \begin{array}{@{}rcl@{}} \boldsymbol Q = \left( \begin{array}{lllll} \boldsymbol T_{0} \oplus \boldsymbol T_{r} & (\boldsymbol \tau_{0} \boldsymbol \alpha_{f}) \otimes \boldsymbol I_{r,r} & \boldsymbol 1_{0} \otimes (\boldsymbol \tau_{r} \boldsymbol \alpha_{c}) \\ & \boldsymbol T_{f} \oplus \boldsymbol T_{r} & \boldsymbol 1_{f} \otimes (\boldsymbol \tau_{r} \boldsymbol \alpha_{c}) & (\boldsymbol \tau_{f} \boldsymbol \alpha_{a}) \otimes \boldsymbol 1_{r} \\ \boldsymbol \tau_{c} (\boldsymbol \alpha_{0} \otimes \boldsymbol \alpha_{r}) & & \boldsymbol T_{c} & \\ & & \boldsymbol \tau_{a} \boldsymbol \alpha_{c} & \boldsymbol T_{a} \end{array}\right). \end{array} $$
(43)

Similar to Model-I, the transient probabilities under Model-II can be obtained by using (40) through (42).

After obtaining the above approximate CTMCs for Model-I and Model-II, the most important problem is to determine the PH parameters so that the PH distribution can approximate the original distribution well. Okamura and Dohi (2016b) discussed two approaches for the estimation of PH parameters; moment-based (Osogami and Harchol-Balter 2006; Bobbio et al. 2005) and likelihood-based approaches (Okamura et al. 2011). In this paper, the PH parameters are estimated by the likelihood-based approach. Thus, the analytical transient solutions of the software rejuvenation models can be achieved through the analysis of the “equivalent” CTMCs.

5 Formulation of interval reliability

The interval reliability is, generally speaking, a generalized dependability measure derived from the transient analysis. More specifically, the interval reliability is formally defined as the probability that the system is operational throughout [t,t + x], where t and x are called operation time and planning time, respectively. The interval reliability is comprehensive and embraces two other significant dependability measures as special cases, that is, it can be reduced to the pointwise availability at operation time t as x → 0 and the (conditional) reliability at the planning time x given that the system has started in the working state at time t = 0. The interval reliability analysis was originally introduced by Barlow and Proschan (1965). In Dohi et al. (2018), a special case of the interval reliability (i.e., limiting interval reliability) for rejuvenation model was discussed.

Suppose that the software rejuvenation is not performed during the time interval [t,t + x], even if the rejuvenation time has already come. Letting M(t) be the expected number of visits to State 0 during the time interval (0,t], after the system state starts from State 0 at time t = 0, we have the LST of M(t)

$$ m(s) = \frac{h_{00}(s)}{1 + h_{00}(s)}. $$
(44)

The interval reliability R(t,x) is defined as the probability that at a specified operation time t, the software system is operating and will continue operating for an interval of [t,t + x] (Barlow and Proschan 1965). The interval reliability is simply called reliability at time x when the operation time t = 0, and further becomes pointwise availability at time t as x → 0. The interval reliability is given by

$$ R(x,t) = F_{UP}(t) + {{\int}_{0}^{t}} F_{UP}(t-u) dM(u), $$
(45)

where FUP(t) gives the probability that the system continues operating during the time interval (0,t + x] provided that the software rejuvenation is not performed before time t;

$$ F_{UP}(t)=(1 - (F_{0}*F_{f})(t+x)) (1 - F_{r}(t)). $$
(46)

In Dohi et al. (2018), the LST of (45) was derived. However, as mentioned before, the inversion of LST is not easy in practice.

Next we consider the interval reliability under the phase expansion. According to the definition of interval reliability, we define the following matrices:

$$ \begin{array}{@{}rcl@{}} & \boldsymbol Q_{UP} = \left( \begin{array}{lllll} \boldsymbol T_{0} & \boldsymbol \tau_{0} \boldsymbol \alpha_{f} \\ & \boldsymbol T_{f} \\ & & \boldsymbol O_{c,c} \\ & & & \boldsymbol O_{a,a} \end{array}\right), \end{array} $$
(47)

where O means a zero matrix. The above matrix is the infinitesimal generator without rejuvenation. Then, the interval reliability is written by

$$ R(x,t) = \boldsymbol \pi_{0} \exp(\boldsymbol Q t) \boldsymbol U \exp(\boldsymbol Q_{UP} x) \boldsymbol 1_{UP}, $$
(48)

where

$$ \begin{array}{@{}rcl@{}} \boldsymbol 1_{UP} &=& \left( \begin{array}{lllll} \boldsymbol 1_{0} \\ \boldsymbol 1_{f} \\ \boldsymbol 0_{c} \\ \boldsymbol 0_{a} \end{array}\right), \end{array} $$
(49)
$$ \begin{array}{@{}rcl@{}} \boldsymbol U &=& \left( \begin{array}{lllll} \boldsymbol I_{0,0} \otimes \boldsymbol 1_{r} \\ & \boldsymbol I_{f,f} \otimes \boldsymbol 1_{r} \\ & & \boldsymbol O_{c,c} \\ & & & \boldsymbol O_{a,a} \end{array}\right). \end{array} $$
(50)

As special cases, R(x,0) and R(0,t) are reduced to the reliability function and the pointwise availability, respectively, that is

$$ \begin{array}{@{}rcl@{}} R(x,0) &=& \boldsymbol \pi_{0} \boldsymbol U \exp(\boldsymbol Q_{UP} x) \boldsymbol 1_{UP}, \end{array} $$
(51)
$$ \begin{array}{@{}rcl@{}} R(0,t) &=& \boldsymbol \pi_{0} \exp(\boldsymbol Q t) \boldsymbol U \boldsymbol 1_{UP}. \end{array} $$
(52)

Since the approximate CTMC is ergodic, the limiting interval reliability \(R(x, \infty ) = \lim _{t \to \infty } R(x, t)\) is simply given by

$$ R(x, \infty) = \boldsymbol \pi_{s} \boldsymbol U \exp(\boldsymbol Q_{UP} x) \boldsymbol 1_{UP}, $$
(53)

where πs is the steady-state probability vector satisfying

$$ \boldsymbol \pi_{s} \boldsymbol Q = \boldsymbol 0, \quad \boldsymbol \pi_{s} \boldsymbol 1 = 1. $$
(54)

In this paper, we take account into the above three special cases of the interval reliability, covering different desired system dependability properties under specific operation conditions. For example, when one wishes to ensure the system dependability for a specific time period, the interval reliability and reliability function are good candidates to be considered. On the contrary, for systems with long operational lifetimes, the limiting interval reliability is more preferred.

6 Numerical illustration

This section consists of two illustrative experiments. One is conducted to examine the accuracy of the phase expansion from the viewpoints of distribution curves and the interval reliabilities, including the approximate pointwise availabilities, reliability functions, and limiting interval reliabilities, of Model-I. Another is to compare the interval reliabilities between two software rejuvenation models. All model parameters are set empirically in our current experiments, aiming mainly to examine the accuracy of the phase expansion and execute comparison between two rejuvenation models.

6.1 Accuracy of phase expansion

We present an example to analyze quantitatively the transient measures of the software rejuvenation model, Model-I, with the phase expansion. To examine the accuracy of phase expansion, we assume that the exact steady-state availability, reliability function, and limiting interval reliability function are known. In such a case, the probability distributions of state transitions in Fig. 1 are given in Table 1. From this table, we see that the distributions are assumed to be Weibull and lognormal distributions, that means, the transient analysis is difficult by using common methods, e.g., the LST and its numerical inversion approach. The columns of Mean and CV represent the mean time and the coefficient of variation, respectively.

Table 1 Probability distributions of state transitions

Each distribution is approximated by a PH distribution using the likelihood-based approach (Okamura and Dohi 2015; Okamura et al. 2011). We consider three cases on the numbers of phases; PH1 (1 phase), PH10 (10 phases), and PH100 (100 phases) to examine the accuracy of PH approximation. It should be noted that when the number of phases is 1, the corresponding PH distribution becomes the exponential distribution as a special case. Figure 3 shows the p.d.f.s of approximate PH distributions for the distributions F0(t), Ff(t), Fr(t), Fc(t), and Fa(t). In the figure, the original p.d.f.’s are drawn as dotted lines. From these figures, we can find that the PH distribution gets close to the original distribution when the number of phases increases. Also, in the case where CV is small, the large number of phases is needed to approximate the original distribution accurately. For example, the PH distributions with PH10 are accurate enough to approximate the distributions F0(t) and Ff(t) with CV= 0.5 (see (a) and (b) in Fig. 3), whereas the PH distributions with PH100 are needed for the distributions Fr(t) with CV= 0.1, Fc(t) and Fa(t) with CV= 0.2 (see (c), (d), and (e) in Fig. 3).

Fig. 3
figure 3

Approximate PH distributions. aF0(t). bFf(t). cFr(t). dFc(t). eFa(t)

We evaluate the approximate pointwise availabilities of Model-I. Figure 4 illustrates the approximate pointwise availabilities of Model-I with the phase expansion. The behavior of these availabilities depends on the number of phases. PH1 and PH10 are not enough to approximate accurately the behavior of pointwise availability. In particular, PH1 quickly converges to the steady state, compared with the others. Also, Table 2 shows the exact steady-state availability and those with PH1, PH10, and PH100 under Model-I. Dissimilar to the pointwise availability, the results of PH1 and PH10 get close to the exact steady-state availability, though they are not good to approximate the pointwise availability. Consequently, PH100 is the most accurate to capture the behavior of pointwise availability, as well as the steady-state availability. However, when focusing on only the steady state, PH10 is a more good choice due to lower computational cost, compared with PH100. Moreover, from Fig. 4, in the case of PH100, the pointwise availability reaches its minimum at t = 297.01 hours with R(0,297.01) = 0.99779. Also, oscillations can be observed in this figure. A possible explanation is that at the beginning, no rejuvenation actions are performed and the availability first decreases due to failures. A first rejuvenation action takes place at a time with mean 300 h, which brings an impact on pointwise availability. That is, the availability increases roughly from time 297.01 up to time 436.52 h. After a while, this first rejuvenation action has no more impact and the availability starts again to decrease. However, the time to the second rejuvenation action becomes shorter than for the first one, because the first one is not perfect, so that the availability does not decrease much than before. This leads to damped oscillations that can be seen in Fig. 4. Note that the pointwise availability eventually stabilizes and tends to 0.99944 as t, which is regarded as steady-state availability as shown in Table 2.

Fig. 4
figure 4

Approximate pointwise availabilities of Model-I

Table 2 Steady-state availabilities of Model-I with phase expansion

Next, we investigate the interval reliability. To compare with the exact value, we consider the reliability function R(x,0) and the limiting interval reliability R(x,). Figure 5a and b show the reliability function R(x,0) and the limiting interval reliability function R(x,) with respect to x of Model-I, respectively. Note that the exact results are drawn as dotted lines in both figures. From these figures, it is found that except for PH1, the PH approximation (i.e., PH10 or PH100) works well to approximate the reliabilities. In particular, PH10 becomes more accurate, compared with the result of pointwise availability. This is caused by the fact that F0(t) and Ff(t) are accurately approximated by PH10. The behavior of the reliability is less sensitive to the rejuvenation trigger time distribution Fr(t), because the rejuvenation is not performed in [t,t + x]. Hence, the approximation with PH10 was improved.

Fig. 5
figure 5

Approximate reliability functions of Model-I. a Approximate reliability functions. b Approximate limiting interval reliabilities

6.2 Comparison between Model-I and Model-II

We evaluate the interval reliabilities of a Web server under both two software rejuvenation models described in Section 3 by using the phase expansion and compare them between two models. Table 3 gives the probability distributions of state transitions. The number of phase is fixed as 10. The reasons are given as follows: (i) The PH distributions with PH10 are accurate enough to approximate the distributions F0(t) and Ff(t) with CV= 0.5 and (ii) although the PH distributions with PH10 are not accurate enough to approximate the distributions Fr(t), Fc(t), and Fa(t) with small CV, but the behavior of the reliability is less sensitive to the distributions Fr(t), Fc(t), and Fa(t), which are not performed in [t,t + x].

Table 3 Probability distributions of state transitions

We consider the approximate pointwise availabilities of Model-I and Model-II with PH10, which are illustrated in Fig. 6. From this figure, the availability of either Model-I or Model-II quickly converges to the steady state. The approximate pointwise availability of Model-II is slightly smaller than that of Model-I. This is because under Model-II, a software rejuvenation after the completion of the repair operation will be needed to renew the system, which causes much downtime, compared with Model-I. Also, Model-I has a larger steady-state availability, compared with Model-II (see Table 4).

Fig. 6
figure 6

Approximate pointwise availabilities of Model-I and Model-II (PH10)

Table 4 Steady-state availabilities of Model-I and Model-II

Next, we investigate the interval reliability of two software rejuvenation models. Figure 7 a and b present the approximate reliability functions and the approximate limiting interval reliabilities of Model-I and Model-II with PH10, respectively. It can be observed that all the reliability functions of Model-I and Model-II show the same results. This is due to the fact that when considering the reliability, we focus on only the transitions from State 0 to State 3 or from State 0 to State 2 that are the same in both models (see Figs. 1 and 2). In other words, the repair and rejuvenation operations cannot affect the system reliability significantly.

Fig. 7
figure 7

Approximate reliability functions of Model-I and Model-II (PH10). a Approximate reliability functions. b Approximate limiting interval reliabilities

7 Conclusion and future work

More and more systems focus on software reliability. Software reliability engineering is an important subset of the larger domain of software engineering. In software engineering, software rejuvenation is a proactive software control technique to prevent system performance degradation and other associated failures related to software aging due to aging-related bugs. In this paper, we have focused on analyzing the interval reliability for two fundamental software rejuvenation models in software reliability engineering. It is well-known that the transient analysis of stochastic model is relatively difficult compared with the steady-state analysis. Even if we know the LST representation, the inversion of LST is not efficient from the viewpoint of numerical analysis. Therefore, the key significance of this paper lies on three aspects:

  • This paper covers the analytical transient solutions for both two fundamental software rejuvenation models;

  • The interval reliability has been formulated through the transient analysis of two software rejuvenation models with phase expansion approximately;

  • A comprehensive comparison has been made between two software rejuvenation models based on three interval reliability measures.

In the numerical examples, we have derived the approximate PH distributions from the original distributions and discussed the accuracy of the PH approximation. As a result, the phase expansion with the moderate or large number of phases is effective to analyze the transient behavior of rejuvenation model. However, the PH approximation requires the large number of phases when CV of the original distribution is small. Also, according to the results from the accuracy analysis of phase expansion, we have investigated the dependability measures with the interval reliability of a web server under two different software rejuvenation models by using the phase expansion. The lessons learned from the experiments are twofold: 1) the approximate pointwise availability of the model with rejuvenation after the completion of repair operation seems to be slightly smaller than that without rejuvenation after the repair operation and 2) two rejuvenation models have the same reliability functions because the rejuvenation operation cannot affect the system reliability.

In the future, we will combine the discrete PH distribution with the present approach. The discrete PH distribution is effective to approximate the distribution with a small CV such as a constant distribution. Thus, we will try to make a new framework with both continuous and discrete PH distributions. Also, we will consider a practical method to determine the number of phases from the viewpoint of tolerance error.