1 Introduction

Reliability Analysis of Safety-Critical Systems. Reliability analysis is concerned with analysing system models to determine measures-of-interest such as the mean-time-to-failure (MTTF) and the system’s reliability, i.e., the probability that the system is continuously operational up to a given mission time? Model-based analysis such as the numerical evaluation of Markov chains suffer from the state-space explosion problem. A possible remedy is discrete-event simulation. Simulation is applicable to a large class of reliability models, e.g., it supports general failure and repair rates, and has a low memory footprint as only the current model state needs to be kept in memory. Simulation results though come with statistical bounds onlyFootnote 1 and excessively many simulation runs are needed for rare events, events that happen with very low probability. Failures in safety-critical systems such as autonomous cars, nuclear power plants, satellites, launchers, etc., are (supposed to be) rare events, and standards such as ISO 26262 require hard guarantees—safe lower- and upper-bounds.

Lazy Verification. The challenge is to come up with a reliability analysis technique that provides hard guarantees, can deal with rare events, and preferably provides results with numerical accuracy.

Fig. 1.
figure 1

Lazy verification for reliability

This paper proposes to use lazy verification for reliability analysis. The idea is conceptually simple: generate a partial state space of a reliability model description and carry out a fast analysis that takes a conservative and an optimistic perspective. By considering unexplored states as being always operational or, dually, already failed, the analysis yields a sound upper- and lower-bound (ub and lb) on the system’s reliability, see Fig. 1. Fast analysis is done using the state-of-the-art probabilistic model-checking techniques [2, 4, 25] for continuous-time Markov models. If the gap between the lower and upper-bound is below an a priori user-defined tolerance \(\varepsilon \), i.e., ub-lb\(\ \le \varepsilon \), the analysis halts: the system reliability is certainly between these bounds. In case the results are not tight enough, ub-lb\(\ >\varepsilon \), the partial state space is extended with some unexplored states. This iterative approach thus has a lazy character: only a state-space fragment required to obtain the system’s reliability (or measures such as MTTF) with a given accuracy is generated and explored.

Lazy Verification of BDMPs. Our lazy approach is applicable to a wide range of dynamic reliability models, in particular those containing state-dependent failure mechanisms such as temporal orderings, spare management, and failure dependencies. This includes, e.g., dynamic fault trees [17], state-event fault trees [24], dynamic reliability block diagrams [16], and Pandora temporal fault trees [31]. Our lazy verification approach is also applicable to the static (non-state-dependent) reliability models such as static fault trees and reliability block diagrams. However, our approach will not be competitive to the binary decision diagram-based analysis of these models. We present details of our lazy verification approach for Boolean logic-driven Markov processes (BDMPs), an expressive dynamic fault tree variant extensively used by engineers at Électricité de France (EDF) [9]. EDF is one of the world’s largest producers of electricity and is active in technologies such as nuclear power, hydropower, wind power, solar and geothermal energy. BDMPs contain VOTing gates (generalisation of AND and OR gates), priority AND-gates, basic events that model system components whose lifetime is exponentially distributed, instantaneous events, and two forms of triggers. While the general definition of BDMPs in [9] allows the use of arbitrary Markov processes for defining basic events, we restrict ourselves to the commonly used exponential distributions. The semantics of BDMPs has been translated into Markov automaton in [26], and generalised stochastic Petri nets in [27]. The underlying stochastic process of a BDMP is a continuous-time Markov chain (CTMC). Polynomial-time model-checking algorithms for computing lifetimes on CTMCs have been given [4] and are part of probabilistic model checkers such as Prism  [28] and Storm  [22]. The main questions for lazy BDMP verification are “which fragment of the state space needs to be explored?” and “how much to extend a partial state space in an iteration of Fig. 1?” The first question is answered by a probabilistic criterion, i.e., the states with the highest reachability probabilities are selected for exploration. Regarding the second question, all one-step-successors of the selected state(s) are explored and the exploration is stopped based on an exploration threshold, e.g., if all remaining states in iteration i have reachability probability \({<}{2^{-i}}\).

Experimental Evaluation. We implemented the BDMP lazy verification approach on top of the probabilistic model checker Storm  [22] whose performance is among the best as witnessed in recent tool competitions (see qcomp.org). Distinguishing features of Storm are its modular set-up enabling the rapid exchange of solvers, its facility to generate counterexamples, and its support for multiple modelling languages such as the reliability models dynamic fault trees and BDMPs. To validate our results, we compare to a free discrete-event simulation tool [7]. To indicate the “goodness” of the obtained bounds, we compare to initiator and all barriers (I&AB), a recent closed-form technique for obtaining pessimistic system reliability of BDMPs [11]Footnote 2. We distinguish between non-repairable and repairable reliability models, as some analysis techniques perform better for a particular class. The main findings of the experimental evaluation:

  • Exploring small state-space fragments mostly suffices, in particular for repairable models. This extends the findings of lazy verification of dynamic fault trees that do not include repairs, as in [29].

  • Reliability model sizes that could not be handled before come within range.

  • In contrast to I&AB, lazy verification is generally applicable (non-repairable and repairable models, arbitrary mission times, acyclic models, and models with loops), and provides sound lower- and upper-bounds within a given accuracy \(\varepsilon \).

Note that solving reliability models is much more difficult when the model includes repairs, for several reasons:

  • Merging failure states into a single state is not correct for repairs any more, because these states are usually associated with different repair rates.

  • The existence of repairs can create new states that cannot be reached from the initial state by a path containing only failures. This is exemplified by the case of the PAND-gate given in Sect. 2.

  • The failure and repair rates usually differ by several orders of magnitude creating stiffness problems for solvers based on matrix calculations.

2 Boolean Logic-Driven Markov Processes

This section briefly explains the main principles of BDMPs; for more details we refer to [9]. BDMPs extend static fault trees with triggers and associate a pair of CTMCs with leaves to model various failure modes. Triggers model activation mechanisms that are useful to model dynamic failure dependencies, e.g., failure on-demand, mutual exclusion, and causal failure dependencies. Semantically, BDMPs augment the failure predicate of static fault trees with activation and trimming predicates. While activation predicates govern the activation status (active or dormant) of BDMP elements, the trimming predicates curtail the BDMP’s state space, e.g., by inhibiting the failure of non-failed inputs of OR-gates once the gate fails. (If such input is shared with other parts of the BDMP, then it is not pruned.) Fig. 2 depicts the main BDMP elements.

Fig. 2.
figure 2

BDMP elements

Gates (the first row of Fig. 2). A VOT-gate has two parameters: the number N of inputs and the number \(1\le K\le N\) of inputs that need to fail for the VOT gate to fail. The gate is repaired once the number of input failures is below K. The AND- and OR-gates are special cases of VOT with \(K=N\) and \(K=1\), respectively. The priority-AND (PAND) gate fails once both its inputs fail in a left-to-right order. Simultaneous input failures of PAND do not lead to a gate failure. (Other fault tree variants use inclusive PAND [23].) Four repair strategies for PAND exist in BDMPs: (1) on the repair of the first input (repair-first), (2) on the repair of the second input (repair-last), (3) on the repair of both inputs (repair-all), or (4) on the repair of any input (repair-any).

Basic events (the second row of Fig. 2). The EXP-type basic events fail and are repaired following an exponential distribution. They can fail only in active mode; their repair is independent of their activation status. The STDBY-type basic events can switch between active and passive failure rates depending on their activation status. The INST-type basic events fail upon activation with probability \(\gamma \). The repair behavior of both STDBY and INST is the same as for EXP. The failure of the top event TOP represents the system’s failure.

Triggers and links (the third row of Fig. 2). The trigger TRIG activates its target when its source fails and if at least one parent of TRIG’s target is active. The inverted trigger InvTRIG achieves the inverse behavior, i.e., it activates the target node when at least one parent of the target node is active and the origin of InvTRIG is not failed. The BeforeLINK forces an order on simultaneously enabled INST leaves to reduce the number of combinations to be examined thanks to the trimming mechanism. The LogicLINK propagates the failure and activation status among BDMP elements.

Example 1

As a running example we consider a reconfiguration strategy adopted from [11]. The system, see Fig. 3 (left), has two power sources for the Main_line: (1) a Grid, and (2) a diesel generator Dies_gen. The red dotted line indicates the reconfiguration strategy. Initially, the Normal_line is active and powers the Main_line. On its failure, the load is switched to the Standby_line. A repair will switch back the load to the Normal_line. Reconfigurations are realised by circuit breakers; they can fail due to: (1) inadvertent failure during normal operation, (2) a refuse-to-open failure, or (3) a refuse-to-close failure. The latter two modes are failure on-demand as they happen while switching the load from the Normal_line to Standby_line and vice versa. Dies_gen has two failure modes: failure in-operation, and failure on-demand. Whereas, Grid can only fail in-operation.

A BDMP model is shown in Fig. 3 (right). The red (blue) arrow represents TRIG (InvTRIG). The inverted trigger models mutually exclusive failure modes of the circuit breaker CB1. That is, the inadvertent opening of CB1 (CB1_IO) preempts its refuse-to-open failure mode CB1_RO. In the BDMP, initially either Grid or CB1_IO can fail. Any of these failures causes a failure of Normal_line_fail, which in turn activates Standby_line_fail. This also activates its children. A failure of an INST leaf causes the events Standby_line_fail and the Main_line_fail to fail. After a successful reconfiguration, Dies_gen can fail. We point out that a failing sequence initiated by CB1_IO does not lead to CB1_RO being tested. For the sake of simplicity, the basic event CB2_IO is omitted from the model. Moreover, CB2_RO and CB1_RC are omitted due to their negligible failure probability.

Fig. 3.
figure 3

Running example of a system (left) and a BDMP modeling it (right) (Color figure online)

3 Lazy Probabilistic Verification

This section introduces the partial state-space exploration for continuous-time Markov chains (CTMCs), the underlying model of BDMPs. CTMCs are finite transition systems with a designated initial state, whose transitions are labelled with rates (positive reals) of exponential distributions. For state s with transition rates \(R(s,u) = \lambda \) and \(R(s,v) = \mu \), say, the probability to move from s to u is \(\nicefrac {\lambda }{\lambda {+}\mu }\), and the state residence time is random: the probability to stay for t time units in s equals \(1{-}e^{-(\lambda {+}\mu )\cdot t}\). To enable CTMC analysis by model checking [4], states are labelled with sets of atomic propositions.

3.1 State-Space Generation

The (compositional) semantics of BDMPs in terms of CTMCs is fully explained in [26, 27]. We present the general idea by an example.

Example 2 (CTMC for BDMP)

Consider again the BDMP in Fig. 3 (right). Figure 5 depicts a fragment of its corresponding CTMC. Initially, the BDMP elements Grid and CB1_IO can fail. The initial state thus has two outgoing transitions labelled with the failure rates of Grid and CB1_IO. The failure of Grid causes the testing of three independent Bernoulli trials through the trigger. Thus, the failure of Grid is succeeded by \(2^3=8\) probabilistic transitions. The failure of CB1_IO causes two independent Bernoulli trials, as testing of CB1_RO is inhibited by the inverted trigger. We combine these probabilistic transitions to the preceding exponential transitions. This gives rise to 12 outgoing transitions in the initial state, e.g., \(s_{1}\) represents the scenario where CB1_IO fails, but both instantaneous events CB1_RO and CB2_RC have not failed. The failure rate of CB1_IO is 0.0001 and the failure probabilities of the two instantaneous events are 0.0001. We have a transition rate of \(0.0001 \cdot (1{-}0.0001)^{2} = 9.998e^{-5}\). We indicate the failure of component CB1_IO by adding the corresponding label \(\{\textit{CB1}\_\textit{IO}\}\) to state \(s_1\). The other transition rates are obtained similarly. For state \(s_1\), there are two possibilities: either CB1_IO is repaired, or Dies_gen fails. The transitions have rates 0.1 and 0.0001, respectively.

Fig. 4.
figure 4

State space after the first iteration

On completing the state-space generation process, the resulting CTMC is amenable to standard CTMC analysis techniques such as model checking [3]. This is supported by state-of-the-art model checkers such as Storm  [22]. Time-bounded properties are important for reliability analysis. The unreliability is the probability of system failure within T time units. Expressed in a timed probabilistic temporal logic such as CSL [4]:

$$ \mathbb {P} \left( \Diamond ^{[0,\,T]} \,\,``failed\text {''} \right) , $$

where “failed” represents the failure of the top-level event in the BDMP. The unavailability is the probability that the system is failed at a given time point t:

$$ \mathbb {P} \left( \Diamond ^{[t,\,t]}\,\,``failed\text {''} \right) . $$

3.2 Lazy Verification

While efficient probabilistic model-checking algorithms exist [21], the state-space explosion remains a limiting factor. This issue can partially be mitigated by using bisimulation minimization [5], but this requires the entire state space to be generated first. We resort to partial state-space generation. This is supported by Storm to analyse non-repairable dynamic fault trees [29] and yields safe upper- and lower-bounds of unreliability (and other measures). In this work, we generalise this to repairable models described by BDMPs.

The idea is to perform iterative state-space exploration. Let us explain this using Fig. 4. In each iteration, we explore a certain number of unexplored states—prioritized according to some heuristic. In our example of Fig. 4, we start by first exploring the initial state \(s_0\) which yields successor states \(s_1, \dots , s_{12}\). State \(s_0\) is now explored (white), and all other states are discovered (gray). All discovered states are equipped with a self-loop in order to distinguish them from deadlock states. Deadlocks are attributed to modelling errors. The resulting partial state space is used to obtain safe lower and upper bounds. As both unreliability and unavailability use the label “failed” we pursue as follows. To obtain a lower bound lb, we mark all explored states that correspond to a system failure with the label “failed” and analyse the unreliability (or unavailability) of the resulting CTMC. For the upper bound ub, we additionally label all discovered states with label “failed” as well and analyse the resulting CTMC. For our example, we obtain the interval [0, 0.8646] for the unreliability. If \(ub{-}lb\) is less than a user-defined accuracy \(\epsilon \), then the procedure terminates. Otherwise, it continues with the next iteration and explores more states. In our example, we continue exploration as indicated in Fig. 5. In the second iteration, we explore states \(s_1\) and \(s_2\) and obtain three new states. Note that now also “repair” transitions returning to state \(s_0\) are inserted. Computing the refined bounds on the extended state space yields [0, 0.0044]. (The nominal unreliability of this example is 0.0024).

Fig. 5.
figure 5

State space after the second iteration

“Failed” Labels for Discovered States. Note that we could have already added “failed” labels to many discovered states of the first iteration as they already represent system-level failure, e.g., state \(s_{12}\). While this will help to significantly tighten the bounds for the unreliability computation, it can give non-monotonic behavior for the unavailability. A discovered state which is marked as failed can never be left due to the self-loop. However, further exploration might introduce repair transitions such that non-failed states can be reached now. In such scenarios, the lower bound for the unavailability would decrease between iterations, which could yield unsound bounds. We, thus, keep our bound differences strictly non-increasing by applying a more conservative approach for discovered states.

Exploration Heuristics. The order in which states are explored and the threshold when to stop exploration for an iteration are determined by exploration heuristics. We use two types of heuristics in Storm: depth-based and probability-based. The depth-based heuristic explores the state space up to a predefined depth, i.e., distance to the initial state. This method is beneficial if one wants to analyse a system for a certain number of consecutive failures. The probability-based heuristic orders the states by their probability to eventually reach the state from the initial state. That way, we give priority to states which are more likely to occur and disregard unlikely events.

In the second iteration of our example in Fig. 5, we explored states \(s_1\) and \(s_2\), because they have the highest incoming transition probabilities. We set the exploration threshold to \(10^{-5}\) for iteration 2. That means, the other states are not explored in this iteration, because their probabilities are below the threshold.

Proposition 1

The lazy verification technique provides sound lower and upper bounds, i.e., the exact unreliability (or unavailability) lies within [lbub].

This can be seen as follows. The approximation accounts for both extremes. Either all of the discovered states lead to the failure (upper bound) or none of the discovered states lead to a failure (lower bound). Moreover, the difference between the upper and the lower bounds is strictly non-increasing for increasing iterations. The algorithm terminates at the latest when the complete state space is explored. The lower and upper bound then equal the exact result (up to the machine precision) as no discovered states are present anymore.

4 The Initiator and All Barriers Method

Context. The I&AB method [10, 11] is an approximation-based analysis approach. It aims to compute a conservative approximation of the BDMP’s unreliability for a given mission time. It is included in the commercial RiskSpectrum PSA 1.4.0/RSAT 3.4.5 software tool to assess repairs in real-life nuclear probabilistic safety assessments (PSAs) (cf. footnote \({}^4\)). It is based on two assumptions:

  • all standby redundancies become active upon the failure of an initiator, and

  • the repair of initiator i inhibits the system failure (due to the sequence initiated by i).

As the I&AB method relies on the repair of initiators, it is applicable to repairable systems only. It requires the repair rates to be at least ten times higher than the failure rates. We describe the four steps of the I&AB method and demonstrate each step on the sample BDMP in Fig. 3 (right):

Step-1: Marking initiators and barriers. The I&AB method starts by partitioning the set of basic events into initiators and barriers. Initiators are the basic events that can take the system out of its perfect state (aka: initial state). Barriers are basic events that get activated upon the failure of an initiator. Once an initiator fails, all non-failed initiators are—as barriers—activated.

Example 3

As Grid and CB1_IO are the only basic events that can be active at the start, we mark them as initiators. All other basic events are barriers.

Step-2: Generating minimal contents of sequences. After declaring the initiators, the I&AB method computes the set of minimal contents of sequences (MCSS). This set contains sequences of failures—initiated by an initiator—that cause the top event to fail. Such sequence m is minimal in the sense that none of its proper prefixes causes a top event failure. To compute the MCSS, an SFT with NOT-gates is derived from the BDMP while respecting the triggers and precedence links. A PAND-gate is replaced by an AND-gate. The minimal cut setsFootnote 3 of the SFT computed using SCRAMFootnote 4 correspond to MCSS of the BDMP.

Fig. 6.
figure 6

SFT generated for the running example

Example 4

Applying the proprietary EDF-tool KB3Footnote 5 to our running example yields the SFT of Fig. 6. The black circles above basic events represent negated literals. The two top layers of the SFT are as in the BDMP. The top event AND Main_line_fail has two inputs: Normal_line_fail, and Standby_line_fail. Consider the boxed fragment of the SFT. Normal_line_fail occurs when either CB1_IO or Grid fails. Since they are initiators, their failures as initiators are mutually exclusive. This is captured by the negated literals, e.g., the left input of CB1_IO fails if CB1_IO_init fails and Grid_init does not. As CB1_IO is an initiator and its mode of failure switches from init to failF (failure in-function) when Grid_init fails, the right input of CB1_IO has Grid_init and CB1_IO_failF as inputs.

Now consider the right input of Main_line_fail, i.e., Standby_line_fail. The Standby_line_fail of Fig. 3 (right) has four components, and it is the target of a trigger. Correspondingly, four AND gates in the SFT capture the failure of each component. Dies_gen_RS, CB2_RC, and Dies_gen fail once the trigger source Normal_line_fail fails. The failure conditions of CB1_RO are more involved, as it is a target of an inverted trigger. The right input of CB1_RO depicts that CB1_RO_failI (failI is a failure on-demand) must fail for CB1_RO to fail. Whereas the left input of CB1_RO ensures that the trigger source, i.e., Normal_line_fail has failed and this failure is caused by Grid. The SFT’s minimal cut sets are:

{CB1_IO_init, Dies_gen_failF}, {CB1_IO_init, Dies_gen_RS_failI}, {CB1_IO_init,

CB2_RC_failI}, {Grid_init, CB1_RO_failI}, {Grid_init, CB2_RC_failI},

{Grid_init, Dies_gen_RS_failI}, and {Grid_init, Dies_gen_failF}.

Step-3: Computing minimal products. The MCSS are arranged as minimal products, partial minimal sequences associated to initiators. It amounts to stripping initiator ie from sequence \(m \in MCSS\) and designating \(m \setminus ie\) as minimal product associated to ie. The minimal products are arranged as dictionary structure. The minimal products for MCSS are:

CB1_IO_init: {{Dies_gen_failF}, {Dies_gen_RS_failI}, {CB2_RC_failI}} and

Grid_init: {{CB1_RO_failI}, {CB2_RC_failI}, {Dies_gen_RS_failI},

{Dies_gen_failF}}.

Step-4: Computing unreliability. Based on minimal products, a closed-form formula approximates the BDMP’s unreliability. To that end, the system failure distribution is assumed to be exponential, and the key is to approximate its failure rate \(\lambda _{eq}\). For mission time t, the unreliability is approximated by \(1{-}e^{-\lambda _{eq}\cdot t}\). The rate \(\lambda _{eq}\) is approximated as \(\sum _{ie \in \text{ init }}{\lambda _{ie} \cdot P_{ie}}\), where \(\text{ init }\) is the set of initiators and \(P_{ie}\) is the probability of system failure due to minimal products associated to \(ie \in \text{ init }\). An upper bound is obtained by \(P_{ie} \le \sum _{c=1}^{k}{\overline{R}_c{(\infty )}}\), where k is the number of minimal products with initiator ie and \(\overline{R}_c(\infty )\) is the steady-state unreliability of a hypothetical parallel system composed of components \(comp \in c\). Since BDMPs can have both exponential (failure in-function) and instantaneous (failure on-demand) failures, a minimal product c can be: (1) mixed type, i.e., both exponential and instantaneous type components, (2) exponential type, or (3) instantaneous type. Closed-forms for \(\overline{R}_c(\infty )\) for these types are given in [10]. As type (2) and (3) are special cases of type (1), we consider (1) in the following.

Using Murchland approximation [15] we have: \(P_{ie} \le \sum _{c=1}^{k}{E(N_c(\infty ))}\), where \(E(N_c(\infty ))\) is the expected number of system failures due to minimal product c within time interval \((0, \infty )\). Consider a minimal product c involving \(\ell \) instantaneous and m exponential components. Let \(\lambda _{c,i}\) denote the failure rate of the \(i^{th}\) exponential component, \(\gamma _{c,i}\) denote the failure probability of the \(i^{th}\) instantaneous component and \(\mu _{c,i}\) denote the repair rate of the \(i^{th}\) component. Using \(r_{c,i} = \lambda _{c,i}{+}\mu _{c,i}\) and \(\mu _c = \mu _{c,ie} + \sum _{j=1}^{\ell }{\mu _{c,j}}\), where \(\mu _{c,ie}\) is the repair rate of initiator ie, \(E(N_c(\infty ))\) equals:

$$\begin{aligned}\begin{gathered} \overbrace{\prod _{i=1}^{\ell }{\gamma _{c,i}}}^{(INST)}\sum _{i=1}^{m}{\lambda _{c,i}} \bigl ( \prod _{\begin{array}{c} j=1\\ j\ne i \end{array}}^{m}{\frac{\lambda _{c,j}}{r_{c,j}}\underbrace{\int _{0}^{\infty }{\overbrace{e^{-\sum _{k=1}^{\ell }{\mu _{c,k}}\cdot x}}^{INST}\sum _{\begin{array}{c} j=1\\ j\ne i \end{array}}^{m}{(1{-}e^{-r_{c,j} x}})dx}}_{expr_1}} - \prod _{j=1}^{m}{\frac{\lambda _{c,j}}{r_{c,j}}\underbrace{\int _{0}^{\infty }{\overbrace{e^{-\sum _{k=1}^{\ell }{\mu _{c,k}\cdot x}}}^{INST}\sum _{j=1}^{m}{(1{-}e^{-r_{c,j} x}})dx}}_{expr_2}} \bigr ) \end{gathered}\end{aligned}$$

Intuitively, \(expr_1\) represents the probability that all components of c except component i failed. The term \(expr_2\) represents the probability that all components of c failed. Their difference is the contribution of component i to c’s failure probability. The repair of instantaneous components also contributes to these probabilities (these parts are identified by overbraces). Solving \(expr_2\) yields:

$$\begin{aligned}\begin{gathered} \frac{1}{\mu _c}-\sum _{i=1}^{m}{\frac{1}{\mu _c {+} r_{c,i}}}+\sum _{i=1}^{m}{\sum _{j>i}^{m}{\frac{1}{\mu _c {+} r_{c,i} {+} r_{c,j}}}} - \sum _{i=1}^{m}{\sum _{j>i}^{m}{\sum _{k>j}^{m}{\frac{1}{\mu _c {+} r_{c,i} {+} r_{c,j} {+} r_{c,k}}}}} + \cdots +(-1)^m \frac{1}{\mu _c {+} \sum _{i=1}^{m}{r_{c,i}}} \end{gathered}\end{aligned}$$

The formula for \(expr_1\) is similar except that index i is ignored.

Example 5

In our running example, let the failure rate (probability) of all exponential (instantaneous) events be 0.0001, and the repair rate be 0.1. We have that \(c_1\) = {Grid_init : {CB2_RC_failI}} is instantaneous, and \(c_2\) = {CB1_IO_init: {Dies_gen_failF}} is exponential. For \(c_1\) and \(c_2\) we obtain:

where for \(\lambda _1 = \lambda _{Dies\_gen\_failF}\), \(\mu _1 = \mu _{Dies\_gen\_failF}\), \(\mu = \mu _{CB1\_IO}\), \(r_1 = \lambda _1{+}\mu _1\):

As there are two exponential and five instantaneous components, all of the same cardinality and equal reliability parameters:

$$\begin{aligned} \lambda _{eq} = 5\cdot (1\cdot 10^{-8}) + 2 \cdot (9.995\cdot 10^{-8}) = 2.499\cdot 10^{-7}. \end{aligned}$$

The system unreliability for mission time \(t=10,000\) thus is: \(1-e^{-t\cdot \lambda _{eq}} = 0.0025.\) (The nominal unreliability of this example is 0.0024, see Sect. 3.2.)

5 Case Studies

We first tested our implementation on 24 BDMP test cases available onlineFootnote 6. Though useful as a sanity check for our implementation, the test cases are too small to meet our objective: investigating scalable analysis on real-world case studies. There is only one larger test case in the literature: a power supply of a nuclear power plant [8], so other test cases were obtained by a semi-automatic translation from DFTs to BDMPs using the approach outlined in [6]. We selected DFTs from the online DFT repositoryFootnote 7, and added repairs. This yields:

Dual Processor Reactor Regulation System. This DFT models a power regulator in a nuclear reactor [18]. The model available on the benchmark website is non-repairable and we adopted the repair rates from the original paper [18].

Emergency Power Supply of Nuclear Power Plant. This BDMP is given by EDF as a challenge for BDMP analysis tools [8]. The repairable model captures most of the scenarios encountered while analysing complex dynamic systems.

Railway Crossing. This DFT models a railway level crossing [20]. We consider variant RC_5_5_sc in this paper. It consists of 5 barriers with independent motors, 5 groups of sensors, and the controller is modeled as a single basic event.

Vehicle Guidance Case Study. These DFTs stem from an industrial case study on safety concepts from the automotive domain [19]. We extended the original (non-repairable) DFTs by adding repair rates of 0.1 to all basic events. We consider all eight variants of this test case.

Railway Station. These DFTs model the influence of infrastructure failures in German railway station areas on the routability of trains [30]. We consider all six variants and made them repairable by adding repair rates of 0.1 to all basic events. Note that this model is not yet available on the benchmark website.

Sensor Filter. This DFT is automatically synthesized from an AADL description using the Compass tool-chain [13]. We use variant \(sf\_3\_2\) modelling a network consisting of three sensors and two filters.

6 Results and Discussion

We implemented the incremental verification (Sect. 3.2) in the probabilistic model checker Storm  [22], called Storm-approx. We also implemented the I&AB method (Sect. 4). We validated the outcomes of our open-source prototypical implementation with the commercial RiskSpectrum toolFootnote 8. While the results coincide for both implementations, our implementation is not competitive in terms of computation time and scalability. (We calibrated our implementation using [8] as the RiskSpectrum results were online for this test case [12].) We compare both approaches with the Monte Carlo simulator Yams  [7] and evaluate both repairable and non-repairable BDMPs. We use the trimming feature [9]—which reduces the state space—to be able to treat large BDMPs. All the results discussed in this Section are available onlineFootnote 9.

Fig. 7.
figure 7

Accuracy of results computed by Storm, Yams and I&AB

6.1 Accuracy

Non-repairable Models. As the I&AB method is not applicable to non-repairable systems, we only compare Yams and Storm-approx and give the results in Fig. 7a. Since the absolute deviations between both tools are negligible, we plot the % deviations relative to the respective mean value. Yams provides a mean value and a confidence interval \([\ell ,u]\). In Fig. 7a, the extremities of segments correspond to \(\frac{\ell -u}{2\cdot mean\_value}\cdot 100\) and \(\frac{u-\ell }{2\cdot mean\_value}\cdot 100\) Storm-approx gives an upper_bound and a lower_bound. We compute % error bounds by using \(\frac{upper\_bound-mean\_value}{mean\_value}\cdot 100\) and \(\frac{lower\_bound-mean\_value}{mean\_value}\cdot 100\) where mean_value is the middle of the obtained bounds. Since we used \(10^7\) simulations for each test case with YAMS, the width of the confidence interval depends on the probability to be estimated. Figure 7a indicates that the magnitude of the unreliability values of both tools coincides. However, the Storm bounds are significantly tighter than the simulation bounds of Yams with the notable exception of test case RS_A_alt. For this test case, Storm-approx took 18 h and explored 15.6 M states with 20.6 M transitions, cf. Table 1. This means that the number of states required to compute tighter bounds is too big to explore within the time-limit of 24 h.

Repairable Models. Figure 7b presents the (absolute) unreliability values obtained by Storm, Yams and I&AB. The results of Yams and Storm coincide. However, the results of the I&AB method are very pessimistic for six benchmarks. This is mostly due to the fact that I&AB does not work if there are looped interactions of structure functions through triggers, i.e., cyclic dependencies. Such loops inhibit the generation of SFTs for MCSS computation. One must manually break the loops by modifying the activation conditions of some instantaneous type basic events. The BDMPs of VG-variants contain such loops making them inappropriate for processing with I&AB. This yields trivial bounds.

Table 1. Computation time and state spaces statistics

6.2 Computation Time

Table 1 gives a detailed account of the model statistics and tool performances. For each test case, we list the number of BDMP elements. For both repairable and non-repairable variants, we give the explored state space sizes of Storm and the computation times required by the three tools. The error bound of Storm-approx is set to \(10^{-3}\) and the number of simulations for Yams is \(10^{7}\).

The state-space sizes might seem small at first glance, but they are the result of enabling trimming for the BDMP and partial exploration of these trimmed state spaces. The effect of trimming can be seen for the DPRPS test case, where the state space without trimming is several orders of magnitudes larger.

The computation times of Storm and I&AB are within a few minutes for most of the repairable test cases whereas Yams requires significantly more computation time. Note that some of these times could be considerably reduced by adapting the number of simulations to the probability to be calculated. The performance of Storm and I&AB is comparable for the test cases where I&AB returns accurate results. For RS_M_alt, I&AB is 5 times faster than Storm, whereas Storm is faster on RS_A_std. Note that the timings for Storm are not directly correlated to the state space sizes, compare, e.g., NPPS and RC_5_5_sc. As failure and repair rates typically differ by at least one order of magnitude, computing unreliability becomes expensive due to the stiffness of the CTMC.

The I&AB method requires more than one hour for RS_A_alt and RS_M_alt. For both BDMPs, a high number of cut sets needs to be computed. We address this issue using SCRAM by limiting the cardinality of a cut set. But by doing so we introduce an error that cannot be mastered. The implementation of I&AB in RiskSpectrum would be much faster and yet more precise, as it is able to eliminate cut sets with probabilities lower than a given threshold during their generation.

Fig. 8.
figure 8

Percentage of explored state space and computation time

6.3 Gained Insights on Lazy Verification

We provide insights into the lazy verification technique using the sensor filter sf_3_2 test case. The observed trends for this benchmark are representative for the other benchmarks as well. The exhaustive non-trimmed state-space of the non-repairable (repairable) version consists of 5.19 M (5.19 M) states and 14 M (66 M) transitions and was verified in 10 m (56 m). The (approximate) trimmed state-space for non-repairable (repairable) case consists of the same number of states and 14 M (50 M) transitions and verified in 18 m (50 m).

State Space Coverage Versus Number of Iterations. The first trend we studied is the percentage of the total state space explored in each iteration and the total computation time up to this iteration. The trends for both repairable and non-repairable models are shown in Fig. 8a. Both graphs are plotted for an accuracy of \(10^{-3}\). Interestingly, the repairable model converges faster as compared to the non-repairable version and explores fewer states. One reason is the fact that while exploring the state space, we enter the region which does not lead to system failure but both our lazy verification and BDMP trimming are agnostic of this. Such region can exist due to, e.g., fail-safe behavior of a PAND-gate.

State Space Coverage Versus Accuracy. The next trend we studied is the dependency of the state space coverage on the accuracy. Figure 8b shows that a larger percentage of the state-space is explored for increasing accuracy. After the threshold of \(10^{-3}\), the time for approximating the non-repairable model exceeds the verification time of the exact approach, i.e., 10 m. However, for the repairable case, the computation time is less than that required to fully explore the state space. This implies that our approximation technique, combined with the trimming feature, is more efficient for analyzing repairable systems.

Convergence of Approximate Results. Figure 9 plots the % relative errors of upper and lower bound to the exact value. We see that the bounds converge faster for increasing number of iterations and the error is negligible after 20 iterations.

Fig. 9.
figure 9

Iteration versus bound convergence

7 Conclusion

We presented an iterative verification procedure for BDMPs based on partial state-space exploration. Our evaluation shows that this approach allows scalable analysis of BDMPs while providing sound upper and lower bounds of the exact result. Our technique is applicable to other dynamic reliability models too. Lazy verification is a promising approach and we plan to further improve it by developing better exploration heuristics, e.g., using learning techniques [1, 14].