Keywords

22.1 Introduction

This chapter discusses techniques that are found to be effective for reliability and availability assessment of real systems. Modern life heavily relies on man-made systems that are expected to be reliable. Many high-tech cybersystems are found wanting since their failures are not so uncommon. Such failures and consequent downtimes lead to economic losses, to a loss of reputation, and to even loss of lives. To ameliorate the situation, methods have been developed that reduce failure occurrences and resultant downtimes. In order to gauge the effectiveness of these improvement methods, scalable and high-fidelity techniques of reliability and availability assessment are needed.

This chapter discusses techniques that are found to be effective for reliability and availability assessment in practice. Assessment methods can be divided into measurement-driven (or data-driven) versus model-driven methods (Fig. 22.1). Data-driven methods are suitable for small subsystems, while model-driven methods are appropriate for large systems. Using model-driven methods, we can derive the dynamic behavior of a system consisting of many components using first principles (of probability theory) rather than from measurements.

Fig. 22.1
figure 1

Reliability/availability assessment methods

In practice, these two approaches are combined together so that subsystem or component behavior is derived using data-driven methods, while the system behavior is derived using model-driven methods.

This chapter focuses on model-driven methods. Models can be solved using discrete-event simulation or using analytic–numeric techniques. Some simple models can be solved analytically to yield a closed-form formula while a much larger set of models can be dealt with by a numerical solution of their underlying equations. The latter approach is known as analytic–numeric solution. Distinction between analytic–numeric solution versus discrete-event simulation-based solution ought to be noted. We believe that simulative solution and analytic–numeric solutions should be judiciously combined in order to solve complex system models. This chapter, however, is on analytic–numeric methods, providing an overview of a recently published book by the authors of this chapter [1].

Our approach to exposing the methods is example-based. Chosen examples are of real systems that we have ourselves analyzed for some companies. Overall modeling process is depicted in Fig. 22.2.

Fig. 22.2
figure 2

The overall modeling process

Non-state-space (or combinatorial) models can deal with large systems if based on the drastic assumption of statistical independence among components. State-space model types, specifically continuous-time Markov chains and Markov reward models, are commonly utilized for higher fidelity. Multi-level models that judiciously combine non-state-space and state-space methods will be seen to have the scalability and fidelity needed for capturing the dynamic behavior of real systems. Depending on the application, a model may be solved for its long-term (steady-state) behavior or its time-dependent (or transient) behavior. Solution types for such models are classified in Fig. 22.3 [1, 2]. Software packages that are used in solving the examples of this chapter are SHARPE [2, 3] and SPNP [4, 5].

Fig. 22.3
figure 3

Solution techniques

22.2 Non-state-Space Methods

Several traditional methods for the analysis of system reliability and availability can be classified under the umbrella of non-state-space (sometimes called combinatorial) methods:

  • Reliability Block Diagrams (RBD)

  • Network reliability or Reliability graphs (RelGraph)

  • Fault Trees.

The simplest paradigm for reliability/availability is the (series–parallel) reliability block diagram (RBD). These are commonly used in computer and communications industry and are easy to use and assuming statistical independence, simple algorithms are available to solve very large RBDs. Reliabilities (availabilities) multiply for blocks in series, while unreliabilities (unavailabilities) multiply for blocks in parallel. Efficient algorithms for k-out-of-n blocks are also available, both in the case of statistically identical blocks and for non-identical blocks [1].

Besides system reliability at time t, system mean time to failure, system availability (steady-state and instantaneous), and importance measures can also be computed so as to point out critical components (bottlenecks) [1].

High availability requirement in telecommunication systems is usually more stringent than most other sectors of industry. The carrier-grade platform from Sun Microsystems requires a “five nines and better” availability. From the availability point of view, the top-level architecture of a typical carrier-grade platform was modeled in [6] as a reliability block diagram consisting of series, parallel, and k-out-of-n subsystems, as shown in Fig. 22.4. The SCSI series block is further expanded as in the inset of Fig. 22.4.

Fig. 22.4
figure 4

High availability platform from sun microsystems

Series–parallel structure is often violated in practice. Non-series–parallel block diagrams are often cast as s-t connectedness problems, also known as network reliability problems or just relgraph in SHARPE. The price to be paid for this additional modeling power is the increased complexity of solution methods. Known solution methods are factoring (or conditioning), finding all minpaths followed by the use of one of many sum-of-disjoint-product (SDP) algorithms, the use of binary decision diagrams (BDD), or the use of Monte Carlo simulation. SDP- and BDD-based algorithms have been implemented in the SHARPE software package [2, 3]. Nevertheless, real systems pose a challenge to these algorithms. For instance, the reliability of the current return network subsystem of Boeing 787 was modeled as a relgraph shown in Fig. 22.5. However, the number of minpaths was estimated to be over 4.2 trillion.

Fig. 22.5
figure 5

Boeing relgraph example

To solve the model, for the purpose of FAA certification, a new bounding algorithm was developed, patented, and published [7]. Table 22.1 reports the results showing that the upper and lower bounds to the s-t reliability were close enough, with a very small number of minpaths and mincuts selected for the computation. The computation time was very short for this otherwise intractable problem. This new bounding algorithm is implemented in the SHARPE software package and continues to be used by Boeing (via their IRAP software package [8]) for the reliability assessment of current return network of all Boeing commercial airplanes.

Table 22.1 Unreliability upper/lower bounds

Table 22.2 shows a comparison of SDP and BDD methods for various benchmark networks of increasing complexity. The different BDD columns show the effect of node ordering on the computational time [9]. The used benchmark networks are shown in Fig. 22.6 and were inspired by the literature [10]. Note also that the bounding method is not utilized in the comparison table.

Table 22.2 Comparison of SDP and BDD with various orderings
Fig. 22.6
figure 6

Benchmark networks

In the aerospace, chemical, and nuclear industries, engineers use fault trees (FT) to capture the conditions under which system fails. These Boolean conditions are encoded into a tree with AND gates, OR gates, and k-out-of-n gates as internal nodes, while leaf nodes represent component failures and the top or root node indicates system failure.

Fault trees without repeated events are equivalent to series–parallel RBDs, while those with repeated events are more powerful [1, 2, 11]. Solution techniques for fault trees with repeated events are the same as those for the network reliability problem discussed in the previous paragraph [1]. Fault trees with several thousand components can be solved with relative ease.

Figure 22.7 shows an FT for a GE Equipment Ventilation System. Notice that leaves drawn as circles are basic events, while inverted triangles represent repeated events. Assuming that all the events have a failure probability equal to q = 0.001, the SHARPE input file and the SHARPE output file are shown in Fig. 22.8 on the left-hand and on the right-hand side, respectively. In this example, SHARPE is asked to compute the Top Event probability (QTE = 1.0945e−02) as well as the list of the mincuts. We could ask for importance measures as well as a closed-form expression of top event probability [1, 3]. By assigning failure rates for each event, we could ask for the time-dependent failure probability of the system. Many other possibilities for output measures exist.

Fig. 22.7
figure 7

Fault tree model equipment ventilation system

Fig. 22.8
figure 8

SHARPE input/output files for ventilation system

By assigning failure rates to components, system reliability at time t and the mean time to system failure can be computed. Time-to-failure distribution other than exponential (e.g., Weibull) can be used in such non-state-space models. Furthermore, by assigning failure rate and repair rate to each component, steady-state and instantaneous availability can be computed (assuming independence in repair besides failure independence).

FTs have been extended to non-coherent gates such as NOT gates, to multi-state components [12], phased-mission systems [13], and with dynamic gates [14]. SHARPE fault trees allow NOT gate, multi-state components, and phased-mission systems. Dynamic gates are not explicitly included in SHARPE but can easily be implemented since (static) fault trees, Markov chains, and their combination via hierarchical modeling are provided [1].

22.3 State-Space Methods

As stated in the last section, non-state-space models with thousands of components can be solved without generating their underlying state space by making the independence assumption. But in practice, dependencies do exist among components. We then need to resort to state-space models such as (homogeneous) continuous-time Markov chains (CTMC).

Markov models have been used to capture dynamic redundancy, imperfect coverage (e.g., failure to failover or failure to detect, etc.), escalated levels of recovery, concurrency, contention for resources, combined performance and reliability/availability, and survivability [1, 15]. Markov availability model will have no absorbing states (Fig. 22.9), while Markov reliability models will have one or more absorbing states (Fig. 22.11). Markov models can be solved for steady-state, transient, and cumulative transient behavior according to the following equations [1, 15]:

Steady-state

\(\user2{\pi Q} = 0\) with ∑ π = 1

Transient

\({\text{d}}{\varvec{\pi}}\left( t \right){\text{/d}}t = {\varvec{\pi}}\left( t \right)\user2{ Q}\;{\text{given}}\;{\varvec{\pi}}\left( 0 \right)\)

Cumulative transient

\({\text{d}}{\varvec{b}}\left( t \right){\text{/d}}t = {\varvec{b}}\left( t \right)\;{\varvec{Q}} + {\varvec{\pi}}\left( 0 \right)\)

Fig. 22.9
figure 9

CTMC availability model of Linux OS

In the above formulas, \({\varvec{Q}}\) is the infinitesimal generator matrix of the CTMC, \({\varvec{\pi}}\left( t \right)\user2{ }\) is the state probability vector at time t, \({\varvec{\pi}}\left( 0 \right)\) is the initial state probability vector, \({\varvec{\pi}} = \lim_{{{\varvec{t}} \to \infty }} {\varvec{\pi}}\left( t \right)\) is the steady-state probability vector, and \({\varvec{b}}\left( t \right) = \int_{0}^{t} {{\varvec{\pi}}\left( u \right) {\text{d}}u}\) is the vector of the expected state occupancy times in the interval from 0 to t. Derivatives of these measures with respect to the input parameters can also be computed numerically [1].

22.3.1 CTMC Availability Models

The system availability (or instantaneous, point, or transient availability) is defined as the probability that at time t the system is in an up state:

$$ A\left( t \right) = P \left\{ {{\text{system working at }}t} \right\} $$

Steady-state availability \(\left( {A_{{{\text{ss}}}} } \right)\) or just availability is the long-term probability that the system is up:

$$ A_{{{\text{ss}}}} = \mathop {\lim }\limits_{t \to \infty } A\left( t \right) = \frac{{{\text{MTTF}}}}{{{\text{MTTF}} + {\text{MTTR}}}} $$

where MTTF is the system mean time to failure and MTTR is the system mean time to recovery. When applied to a single component, the above equation holds without any distributional assumptions. For a complex system with redundancy, the equation holds if we use “equivalent” MTTF and “equivalent” MTTR [1].

The availability model of the Linux operating system used in IBM’s SIP implementation on WebSphere was presented in [16] and is shown in Fig. 22.9. From the up state, the model enters the down state DN with failure rate \(\lambda_{OS}\). After failure detection, with a mean time of \(1/\delta_{OS}\), the system enters the failure-detected state DT.

The OS is then rebooted with the mean time to reboot given by \(1{/}\beta_{OS}\). With probability \(b_{{{\text{OS}}}}\) the reboot is successful, and system returns to the UP state. However, with probability 1 − \(b_{{{\text{OS}}}} , \) the reboot is unsuccessful, and the system enters the DW state where a repairperson is summoned. The travel time of the repairperson is assumed to be exponentially distributed with rate \(\alpha_{SP}\). The system then moves to the state RP. The repair takes a mean time of \(1{/}\mu_{OS}\), and after its completion, the system returns to the UP state.

Solving the steady-state balance equations, a closed-form solution for the steady-state availability of the OS is easily obtained in this case due to the simplicity of the Markov chain.

$$ A_{{{\text{ss}}}} = \pi_{{{\text{UP}}}} = \frac{1}{{\lambda_{{{\text{OS}}}} }}{ }\left[ {\frac{1}{{\lambda_{{{\text{OS}}}} }} + \frac{1}{{\delta_{{{\text{OS}}}} }} + \frac{1}{{\beta_{{{\text{OS}}}} }} + \left( {1 - b_{OS} } \right)\left( {\frac{1}{{\alpha_{SP} }} + \frac{1}{{\mu_{OS} }}} \right)} \right]^{ - 1} $$

We can alternatively obtain a numerical solution of the underlying equations by using a software package such as SHARPE. Either graphical or textual input can be employed. The SHARPE textual input file modeling the CTMC of Fig. 22.9 is shown in Fig. 22.10. Noting that UP (labeled 1) is the only upstate, the steady-state availability is computed using the command expr prob (LinuxOS,1). With the assigned numerical values for parameters (see Fig. 22.10), the result is \(A_{{{\text{ss}}}} = 0.99963\).

Fig. 22.10
figure 10

SHARPE input file for the CTMC of Fig. 22.9

22.3.2 CTMC Reliability Models

While CTMC availability models have no absorbing states, CTMC reliability models have one or more absorbing states and the reliability at time t is defined as the probability that the system is continuously working during the interval (0 − t]. Further, since in a reliability model the system down state is an absorbing state, the MTTF can be calculated as the mean time to absorption in the corresponding CTMC model [1, 2, 15].

The reliability model extracted from the availability model of the Linux operating system used for IBM’s SIP application is shown in Fig. 22.11. The repair transition from state RP to state UP and the transition from state DW are removed, that is, the down state reached starting from the UP state is made an absorbing state. Note that states DN and DT are down states but the sojourns in these states are likely to be short enough to be considered as glitches that can be ignored while computing system reliability and MTTF.

Fig. 22.11
figure 11

CTMC reliability model of Linux OS

In this case, the model is simple enough so that a closed-form solution can be obtained by hand (or using Mathematica) by setting up and solving the underlying Kolmogorov differential equations. Alternatively, a numerical solution of the underlying equations can be obtained using SHARPE. The SHARPE textual input file for the reliability model of Fig. 22.11 is shown in Fig. 22.12. Note that in this case, since the CTMC is not irreducible, an initial probability vector must be specified.

Fig. 22.12
figure 12

SHARPE input file for CTMC of Fig. 22.11

The system reliability at time t is defined in this case as \(R\left( t \right) = \pi_{{{\text{UP}}}} \left( t \right) \) and, in the SHARPE input file of Fig. 22.11, is computed from t = 0 to t = 10,000 in steps of 2000. As noted earlier, the MTTF is defined as the mean time to absorption and is computed using the SHARPE command expr mean (LinuxOS). With the assigned numerical values, the result is MTTF = 40,012 h.

The CTMC of a reliability model can be, but need not be, acyclic, as in the case of Fig. 22.11. If there is no component level repair (recovery), then the CTMC will be acyclic but if there is component level repair (but no repair after system failure) then the CTMC will have cycles. However, the model will always have one or more absorbing states.

Reliability modeling techniques have wide applications in different technical fields and have been proposed to provide new frontiers in predicting healthcare outcomes. With the rise in quantifiable approaches to health care, lessons from reliability modeling may well provide new ways of improving patient healthcare. Describing the development of conditions leading to organ system failure provides motivation for quantifying disease progression. As an example, a simple model for progressive kidney disease leading to renal failure is reported in Fig. 22.13 [17] where five discrete conditions are enumerated in keeping with clinical classification of kidney function.

Fig. 22.13
figure 13

Markov model of renal disease

The parameter values, used in solving the model of Fig. 22.13, are reported in Table 22.3. These values are estimated for a 65-year-old Medicare patient and are based on the latest available statistics from the United States Renal Data System (USRDS) annual report [18].

Table 22.3 Parameter values for a 65-year-old medicare patient

The model of Fig. 22.13 is solved for the survival rate and expected cost incurred by a patient in a 1-year interval [17].

Efficient algorithms are available for solving Markov chains with several million states [19,20,21] both in the steady-state and in the transient regime. Furthermore, measures of interest such as reliability, availability, performability, survivability, etc. can be computed by means of reward rate assignments to the states of the CTMC [1, 15]. Derivatives (sensitivity functions) of the measures of interest with respect to input parameters can also be computed to help detect bottlenecks [22,23,24]. Nevertheless, the generation, storage, and solution of real-life-system Markov models still pose challenges. Higher level formalisms such as those based on stochastic Petri nets (SPNs) and their variants [4, 15, 25,26,27] have been used to automate the generation, storage, and the solution of large state-space Markov models [26]. Our own version of SPN is known as stochastic reward nets (SRN). SRNs extend SPN formalism in several useful ways besides allowing specification of reward rates at the net level. This enables more concise description of real-world problems and an easier way to get the output measures [4].

An example of the use of stochastic reward nets to model the availability of an Infrastructure-as-a-Service (IaaS) cloud is shown in Fig. 22.14 [28]. To reduce power usage costs, physical machines (PMs) are divided into three pools: Hot pool (high performance and high power usage), warm pool (medium performance and medium power usage), and cold pool (lowest performance and lowest power usage). PMs may fail and get repaired. A minimum number of operational hot PMs are required for the system to function but PMs in other pools may temporarily be assigned to the hot pool in order to maintain system operation. Upon repair, PMs migrate back to their original pool. Migration creates dependencies among the pools.

Fig. 22.14
figure 14

SRN availability model of IaaS cloud

A monolithic CTMC is too large to construct by hand. We use our high-level formalism of Stochastic Reward Net (SRN) [26]. An SRN model can be automatically converted into an underlying Markov (reward) model that is solved numerically for the measures of interest such as expected downtime, steady-state availability, reliability, power consumption, performability, and sensitivities of these measures.

In Fig. 22.14, place \(P_{{\text{h}}}\) initially contains \(n_{{\text{h}}}\) tokens (PMs of the hot pool), \(P_{{\text{w}}}\) contains \(n_{{\text{w}}}\) tokens (PMs of the warm pool), and \(P_{{\text{c}}}\) contains \(n_{{\text{c}}}\) tokens (PMs of the cold pool). Assuming the number of PMs in each pool is identical and equal to n, the number of states for the monolithic model of Fig. 22.14, is reported in the second column of Table 22.4. From this table, it is clear that this approach based merely on a higher formalism such as SRN, which we call largeness tolerance, soon reaches its limits as the time needed for the generation and storage of the state space becomes prohibitively large for real systems.

Table 22.4 Comparison of monolithic versus decomposed model

22.4 Hierarchy and Fixed-Point Iteration

In order to avoid large models as is the case in a monolithic Markov (or generally state space) model, we advocate the use of multi-level models in which the modeling power of state-space models and efficiency of non-state-space models are combined together (Fig. 22.15).

Fig. 22.15
figure 15

Analytic modeling taxonomy

Since a single monolithic model is never generated and stored in this approach, this is largeness avoidance in contrast with the use of largeness tolerance (recall stochastic Petri nets, SRNs, and related modeling paradigms) wherein the underlying large model is generated and stored. In multi-level modeling, each of the models is solved and results are conveyed to other relevant models to use as their input parameters. This transmission of results of one sub-model as input parameters to other sub-models is depicted as a graph that we have called an import graph [29].

Consider, for instance, the availability model of the SUN Microsystem whose top-level RBD availability model is shown in Fig. 22.4. Each block of the RBD of Fig. 22.4 is a complex subsystem that was modeled separately using the appropriate formalism in order to compute the steady-state availability of that subsystem. In the present case, the subsystems were modeled as Markov chains to cater to the dependencies within each subsystem.

The subsystem availability is then rolled up to the higher level RBD model to compute the system steady-state availability. The import graph for this system model is shown in Fig. 22.16. Specification, solution, and passing parameters for such multi-level models are facilitated by the SHARPE software package [2, 3]. The import graph in this case is acyclic. We can then carry out a topological sort of the graph resulting in a linear order specifying the order in which the sub-models are to be solved and the results rolled up in the hierarchy.

Fig. 22.16
figure 16

Import graph for high availability platform from Sun Microsystems [6]

As the next example, we return to the IaaS cloud availability model and improve its scalability. The monolithic SRN model of Fig. 22.14 is decomposed into three sub-models to describe separately the behavior of the three pools [28, 29] while taking into account their mutual dependencies by means of parameter passing. The three sub-models are shown in Fig. 22.17.

Fig. 22.17
figure 17

Decomposed SRN availability model of IaaS cloud

Its import graph is shown in Fig. 22.18, indicating the output measures and input parameters that are exchanged among sub-models to obtain the overall model solution. Import graphs such as the one shown in Fig. 22.18 are not acyclic, and hence the solution to the overall problem can be set up as a fixed-point problem. Such problems can be solved iteratively by successive substitution with some initial starting point. Many mathematical issues arise such as the existence of the fixed point, the uniqueness of the fixed-point, the rate of convergence, accuracy, and scalability. Except for the existence of the fixed point [30], all other issues are open for investigation. Nevertheless, the method has been successfully utilized on many real problems [1].

Fig. 22.18
figure 18

Import graph describing sub-model interactions

Table 22.4 shows the effect of the decomposition/fixed-point iteration method (which is also known as interacting sub-models method), comparing the number of states of the monolithic model (column 2) with the number of states of the interacting sub-model case (column 3).

Many more examples of this type of multi-level models can be found in the literature [1, 2, 16, 29,30,31,32,33,34,35]. A particular example is the implementation of the Session Initiation Protocol (SIP) by IBM on its WebSphere. A hierarchical availability model of that system is described in detail in [16].

22.5 Relaxing the Exponential Assumption

One standard complaint about the use of homogeneous continuous-time Markov chains is the ubiquitous assumption of all event times being exponentially distributed. There are several known paradigms that can remove this assumption: non-homogeneous Markov chain, semi-Markov and Markov regenerative process, and the use of phase-type expansions. All these techniques have been used, and many examples are illustrated in [1].

Nevertheless, there is additional complexity in using non-exponential techniques in practice, partly because the analytical–numeric solution is more complex but also because of additional information about the non-exponential distributions which is then needed and is often hard to come by.

A flowchart comparing the modeling power of the different state-space model types is shown in Fig. 22.19 [1], and in Fig. 22.20, we provide a classification of the modeling formalisms considered in [1].

Fig. 22.19
figure 19

Flow chart comparing the modeling power of the different state space model types [1]

Fig. 22.20
figure 20

Modeling formalisms

22.6 Conclusions

We have tried to provide an overview of known modeling techniques for the reliability and availability of complex systems. We believe that techniques and tools do exist to capture the behavior of current-day systems of moderate complexity. Nevertheless, higher and higher complexity is being designed into systems, and hence the techniques must continue to evolve. Together with the largeness problem, the need for higher fidelity will require increasing use of non-exponential distributions, the need to properly combine performance, power, and other measures of system effectiveness together with failure and recovery. Parameterization and validation of the models need to be further emphasized and aided. Tighter connection between data-driven and model-driven methods on the one hand, and combining simulative solution with analytic–numeric solution on the other hand, is desired. Validated models need to be maintained throughout the life of a system so that they can be used for tuning at operational time as well. Besides system-oriented measures such as reliability and availability, user-perceived measures need to be explored [34,35,36]. Uncertainty in model parameters, so-called epistemic uncertainty, as opposed to aleatory uncertainty already incorporated in the models discussed here, needs to be accounted for in a high-fidelity assessment of reliability and availability [37]. For further discussion on these topics, see [1].