Keywords

1 Introduction

Reliability is the desired ability for mission and safety-critical systems, which generally speaking characterizes the ability of the system to continue to operate the intended functions correctly even in the presence of faults [2, 18]. Among all the possible approaches to increase the system reliability, a common and prominent practice in reliability engineering is to replicate the components carrying out critical functions and encapsulate them in architectural redundancy patterns so that the single point of failure can be avoided and when faults occur in a limited number of critical components, they can be identified and excluded upon reconfiguration without compromising the overall functionality of the system. The most well-known redundancy architectural patterns are static Triple Modular Redundancy (TMR) and dynamic Cold Standby, which have been widely used in the practical development of mission and safety-critical systems [23].

Despite the practical needs of designing and analyzing safety-critical systems, the reliability analysis of systems built with redundancy architectures, in general, is a difficult task due to the lack of specific techniques addressing both modeling and automated analysis. Previous works mostly use fault tree analysis (FTA) [16] and rely on a substantial amount of labor effort for the reliability verification and analysis. In [15] the authors analyze the reliability of cascaded TMR with “paper-and-pencil” techniques. However, such approaches cannot be generalized to cover a wide range of architectural patterns. Some other works [9, 10] propose to use SMT techniques to perform automated analysis of redundancy architectures and it improves the overall scalability concerning the monolithic case, obtaining speed-ups of various orders of magnitude. However, they do not consider the system’s dynamic behavior or integrate it with the process of model-based system design.

In order to automate the reliability evaluation of different redundancy architectures, we follow the methodology of model-based design and safety analysis [21], which has been promoted as an increasingly prominent approach for the development of safety-critical systems. In model-based design, various development activities such as simulation, verification, testing, and code generation are based on a unified model, describing system behavior and architectures. In the model-based safety analysis, the model of the system behavior is further extended and augmented by taking into account the faulty behavior of software and hardware components. We then can analyze the reliability properties based on the extended system in the presence of faults. The main advantage of this methodology is that the system and safety engineers work off a common, unambiguous model of the system leading to tighter integration between the systems and safety engineering processes. The common model ensures that reliability analysis results are relevant and up-to-date as the system architecture evolves and allows reliability assessment early in the system design process. Additionally, it supports the exploration of different architectures and design choices by automatically determining which choices will increase reliability. Ideally, computational tools such as model checkers can automate many reliability analysis activities, leading to more accurate and complete reliability analyses while reducing manual effort.

In this work, we leverage the SBIP framework. SBIP [22] is a stochastic extension of BIP (Behavior-Interaction-Priority) [3, 4, 7, 17] with an emphasis on formal modeling and statistical analysis of safety-critical systems exhibiting stochastic behaviors. BIP is a component-based system design framework advocating the rigorous design methodology for complex hardware/software mixed system design [3, 24]. The concept of rigorous system design can be understood as a formal, accountable and coherent process for deriving correct-by-construct system implementations from high-level specifications. The essential safety properties of the design are guaranteed at the earliest possible design phase by applying algorithmic verification to the system model. Then the system implementation is automatically generated by a sequence of property preserving model transformations, progressively refining the model with details specific to the target platforms. The BIP framework provides a well-defined modeling language and an associated toolbox to realize the rigorous system design flow. The modeling language allows the construction of composite components from atomic components through the layered application of interactions and priorities. The BIP toolbox supports both verifications of high-level system designs [5] and automatic model transformation and code generation of low-level implementations from high-level system designs. In practice, BIP has been actively used in several applications [1, 19].

To this end, the contributions of this work can be summarized as follows.

  1. (1)

    We propose a novel approach for the modeling and quantitative reliability analysis of redundancy architectures based on the SBIP framework. Our approach supports modeling the nominal system behavior and the system faults in a unified formal model, which can be further integrated into the rigorous component-based system design paradigm.

  2. (2)

    We propose two categories of metrics to evaluate the reliability and related properties of redundancy architectures, in terms of whether the system is able to operate correctly or provide reduced functionalities in the presence of faults. All the properties are specified as formulas in bounded LTL and automatically analyzed using the statistical model checker for BIP [22].

  3. (3)

    We further take a computation unit as the running example and apply the proposed approach to the analysis of two widely used redundancy architectures, i.e., the static triple duplication redundancy and the dynamic cold standby redundancy. The experimental results show that our approach can accurately model various redundancy architectures and provide a comprehensive analysis of reliability and related properties in an automated manner.

The remainder of this paper is organized as follows. Section 2 describes the related works. In Sect. 3, we carry out a study with the computation unit and build a formal model of CU including both nominal and faulty behavior within the SBIP framework. Then we develop the formal model of TMR and Cold Standby architectures and analyze their reliability using SBIP. Finally, Sect. 4 concludes the paper and points out our future work.

2 Related Work

COMPASS (Correctness, Modeling, and Performance of Aerospace Systems) [8] is an international research effort aiming to ensure system-level correctness, safety, dependability, and performability of onboard computer-based aerospace systems. COMPASS uses a System-Level Integrated Modeling (SLIM) language for modeling and specifying hardware/software systems, and the COMPASS toolset supports for timed failure propagation graphs, non-deterministic models, the newly developed statistical model checking and requirement formalization approaches.

The Altarica language [6] can also formally specify the behavior of systems when faults occur. An Altarica model can be assessed be employing complementary tools such as fault tree generator and model-checker. FSAP/NuSMV-SA [11, 12] is a toolset for safety and reliability analysis. FSAP/NuSMV-SA supports failure mode definition and model extension through automatic failure injection.

The work in [14] proposes a model-based approach using dynamic fault trees for the safety analysis of vehicle guidance systems. Its flexibility could support new partitions and architectural changes being accommodated automatically. The fault tree analysis in [9, 10] aims at evaluating characteristics of redundancy architectures without considering the behaviors of components. The work in [20] proposes an intelligent fault diagnosis method based on an improved domain adaptation method. In [25], a module based on redundancy is designed within the formalism of timed automata and formally analyzed using the UPPAAL model checker. The work in [13] presents an approach based BIP framework for the rigorous design of FDIR components. It leverages the statistical model checking to check the requirement satisfaction and the code generation feature of the BIP compiler. Our work differs from the previous work in that we mainly focus on the reliability analysis of redundancy architectures. The work in [15] presents an ad-hoc algorithm that can analyze the reliability of computational chains based on Triple Modular Redundancy with one voter. Differently, we provide ways to make quantitative analyses reliability of redundancy architectures in full automation.

3 Formal Modeling and Reliability Analysis in SBIP

This section introduces a simple computation unit (CU) that will be used as a running example to illustrate the concepts and our approach throughout. Firstly, we build a formal model of CU including both nominal and faulty behavior within the SBIP framework. Then we build the formal model of TMR and Cold Standby architectures.

3.1 A Computation Unit Example

The computation unit is a common module in many embedded systems, for instance, the detection system in the space field. Its function is to receive the instructions of the upper main control system, then obtain the data from the lower sensor, process the data, and return the processing results to the main control system.

We consider an abstract model of CU, which receives data from some components as its input and computes an output to the other components. Let \(\mathtt {cu\_input}\) be the input and \(\mathtt {cu\_output}\) be the output of the CU. To simplify the model, we assume that the value of \(\mathtt {cu\_output}\) ranges from \(-1\) to 1, where 1 indicates the correct computation, \(-1\) indicates the incorrect computation and 0 is the cleared output. Initially, the value of \(\mathtt {cu\_output}\) is 1. An impulse generator is used to first issue an edge impulse xms every \(\mathtt {T}\) time unit to force the CU to read its input, compute and place the result in \(\mathtt {cu\_output}\), and then clear the output. Figure 1 displays a system with one fault-free CU, one Impulse generator and their connector modeled in SBIP. In the Impulse generator model, the Impulse generator automaton models the impulse generator periodically produces edge impulses. A clock x is declared to record the time between sending of two edge impulses. Starting from the initial state \(\mathtt {S0}\), it generates a signal through port xms in each T time unit that synchronizes with the port xms of the CU automaton, then generates a signal through port clear that synchronizes with the port clear of the CU automaton. In the CU model, a variable \(\mathtt {cu\_output}\) is declared to represent the computing result of CU. Starting from the initial state \(\mathtt {Good}\), it receives a signal through port xms that computes an output and receives a signal through port clear that clears an output.

Fig. 1.
figure 1

Fault-free CU model

We consider that following three kinds of faults that may occur in a CU.

  • \({FAULT_0}\): CU enters a deadlock state.

  • \({FAULT_1}\): CU enters an error state in which it computes incorrect results.

  • \({FAULT_2}\): CU enters a livelock state and executes only internal actions without any outputs.

The fault-affected CU model is shown in Fig. 2. Initially, CU works well and stays in location Good. The self-loop in fault-free location Good models the scenario when a synchronization impulse xms occurs, and the CU outputs a correct result. From location Good, the CU automaton selects which fault may occur in a stochastic manner. We use exponential distribution to approximate the failure occurrence model for the faults mentioned above in this work. In SBIP, we define a stochastic port with a gamma density function, i.e., its scheduling time is sampled with respect to a gamma function with alpha 1 and beta 100 to indicate the probability of the occurring fault. If \({FAULT_0}\) occurs, it moves to the error location Error0 and the automaton deadlocks. Since the xms is defined as a broadcast channel, the Impulse automaton can still execute it when the CU stays in location Error0. The Impulse automaton can still execute the xms. The location Error1 is reached from location Good when \(FAULT_1\) occurs. The model outputs incorrect data when the signal xms is issued in this location. Similarly, if \(FAULT_2\) occurs, the CU goes to location Error2, in which it fails to output data when the signal xms is issued. We also add fault transitions from location Error1 to Error0 and Error2, because a \(FAULT_1\) may be followed by one of the other faults.

An auxiliary variable \(\mathtt {cu\_fault}\) indicates whether the CU is faulty. It is initially 0, representing that the CU is in location Good. When a fault occurs in the CU, the value of \(\mathtt {cu\_fault}\) becomes 1.

Fig. 2.
figure 2

A SBIP model of fault-effected CU

3.2 Formal Model of Triple Modular Redundancy Architecture

The \(\mathbf {TMR}\) model consists of three replicated CUs, one Impulse generator and one Voter (see Fig. 3). Since the Impulse generator and Voter are not redundant in this architecture, it is prone to single-point failures. However, its complexity is usually significantly lower than that of the three modules, and this safety gap is tolerated. So we assume that no faults occur in the Impulse generator or the Voter.

In the Impulse generator model (see Fig. 4), the automaton generates a synchronization signal through port xms to trigger the three CUs to process their inputs and start the computation. Then it generates a signal through port x5ms to trigger the Voter to check on the outputs of CUs after \({CU\_PERIOD}\) time units. Finally, a clear is sent after \({VOTER\_PERIOD}\) time units to trigger the three CUs to clear their outputs.

In \(\mathbf {TMR}\), the CU model (see Fig. 5) is similar to fault-effected CU, but it has two additional ports. The port fault of \(\mathtt {CUi}\) synchronizes with the port faulti of the Voter automaton, and the port output of \(\mathtt {CUi}\) synchronizes with the port outputi of the Voter automaton. In \(\mathbf {TMR}\), we define the system as in a degraded state when one CU is faulty and the other two CUs are fault-free.

Fig. 3.
figure 3

Triple Modular Redundancy model

Fig. 4.
figure 4

A SBIP model of Impulse generator

Fig. 5.
figure 5

A SBIP model of TMR’s CU

In this model, a Voter is used to detect the faults (see Fig. 6). When it receives the x5ms signal, it calls two functions: \(\mathtt {fault\_check}\)() and \(\mathtt {vote}\)(). Function \(\mathtt {fault\_check}\)() evaluates whether any CU is faulty. It gets the value of variable \(cu\_fault\) of \(\mathtt {CUi}\) through port faulti. If it is 1, then Voter treats \(\mathtt {CUi}\) as faulty. Otherwise, it is fault-free. Function \(\mathtt {vote}\)() computes the voting algorithm. It compares results from the three CUs and outputs the majority value to the buffer \(\mathtt {voter\_output}\). If the results of the three CUs are different, it computes the value of \(\mathtt {CU0}\).

3.3 Formal Model of Dynamic Cold Standby Redundancy Architecture

We build the cold standby redundancy architecture model (see Fig. 7), which contains one Impulse generator, one Primary CU, one Backup CU, and one Switch. The Impulse generator and Primary CU are similar to that of the \(\mathbf {TMR}\) model, but all ports of Primary CU are connected to the port primary of Switch.

The Switch automaton is shown in Fig. 9. The Switch initiates in the state \(\mathtt {S0}\) and sets variable \(\mathtt {is\_fault}\) to 0, which means the Primary CU is fault-free. Moreover, the Switch only interacts with the Primary CU in state \(\mathtt {S0}\). The port isswitch connects with the port fault of the Primary CU. The port isswitch is active and the state \(\mathtt {S0}\) transfers to state \(\mathtt {S1}\) when the condition \(\mathtt {is\_faulty == 1}\) holds. Then the Switch interacts with the Backup CU in the state \(\mathtt {S1}\).

The Backup CU automaton is shown in Fig. 8. The initial state is in \(\mathtt {Start}\), and it transfers to state \(\mathtt {Good}\) when the port backup is active. Moreover, the function of Backup CU is similar to Primary CU after it locates the state \(\mathtt {Good}\). It may still locate in state \(\mathtt {Good}\) in the next state, or transfer to state \(\mathtt {Errori}\). All ports of the Backup CU except port backup are connected to the port backup of Switch.

3.4 Formal Reliability Analysis Using SBIP

As stated above, reliability refers to the capability of the system to operate correctly in the presence of faults. The system is in a fault-free state if the system safely operates, and the system is in a degraded state if the system still safely operates but provides reduced functionality. Furthermore, the system is in a failure state if the function of the system, parts or components are lost or abnormal, resulting in its inability to perform the expected function. We propose two categories of metrics for formal reliability evaluation of the system in terms of whether the system can operate correctly or provide reduced functionalities in the presence of faults.

We use the following four measures to analyze the reliability of the system. \(\mathtt {Reliability}\) means the probability of the system being in a fault-free state. \(\mathtt {Failures\ In\ Time\ (FIT)}\) means the average probability of the system being in a fault-free state during the considered operational lifetime. \(\mathtt {Degraded\ Availability\ (DA)}\) means the probability of the system being in a degraded state. Moreover, \(\mathtt {Full\ Function\ Availability\ (FFA)}\) means the probability of the system being in a fault-free and degraded state.

\(\mathtt {Reliability}\) is obtained by first computing the time-bounded probability to reach a state where the system has failed and then complementing this value. To obtain the average failure-probability, the complement of the reliability is scaled with the lifetime (to determine the failures in time, \(\mathtt {FIT}\)). \(\mathtt {DA}\) is obtained by computing the time-bounded reachability probability for reaching a degraded state. \(\mathtt {FFA}\) is obtained by computing the complement of the time-bounded reachability probability for reaching a failed or degraded state.

Fig. 6.
figure 6

A SBIP model of TMR’s Voter

Fig. 7.
figure 7

Cold Standby model

Fig. 8.
figure 8

A SBIP model of Backup CU

Fig. 9.
figure 9

A SBIP model of Switch

The formal definition of the measures is given in Table 1. The detailed BLTL specifications of the measures estimated on the Triple Modular Redundancy model and dynamic redundancy model are followed.

In order to simplify writing, we assume that the \(cf_i\) represents that the i-th CU is faulty, i.e., variable \(\mathtt {cu\_fault}\) of the i-th CU equals to 1, and \(\overline{cf_i}\) represents the i-th CU is fault-free, i.e., variable \(\mathtt {cu\_fault}\) of the i-th CU equals to 0. The \(\phi _T\) represents the detailed BLTL specification of the Triple Modular Redundancy model, and the \(\phi _D\) represents the detailed BLTL specification of the dynamic redundancy model.

  • For \(\mathtt {Reliability}\):

    $$\begin{aligned} \begin{aligned} \phi _T(t) = 1-{\textbf {F}}[0,t]((cf_0\wedge cf_1\wedge cf_2)\vee (\overline{cf_0}\wedge cf_1\wedge cf_2) \\ \vee (cf_0\wedge \overline{cf_1}\wedge cf_2)\vee (cf_0\wedge cf_1\wedge \overline{cf_2})) \end{aligned} \end{aligned}$$
    (1)
    $$\begin{aligned} \begin{aligned} \phi _D(t) = 1-{\textbf {F}}[0,t](cf_0 \wedge cf_1) \end{aligned} \end{aligned}$$
    (2)
  • For \(\mathtt {FIT}\):

    $$\begin{aligned} \begin{aligned} \phi _T(t) = \frac{1}{lifetime}(1-{\textbf {F}}[0,lifetime]((cf_0\wedge cf_1\wedge cf_2)\\ \vee (\overline{cf_0}\wedge cf_1\wedge cf_2) \vee (cf_0\wedge \overline{cf_1}\wedge cf_2)\vee (cf_0\wedge cf_1\wedge \overline{cf_2}))) \end{aligned} \end{aligned}$$
    (3)
    $$\begin{aligned} \begin{aligned} \phi _D(t) = ) = \frac{1}{lifetime}(1-{\textbf {F}}[0,lifetime](cf_0 \wedge cf_1)) \end{aligned} \end{aligned}$$
    (4)
  • For \(\mathtt {DA}\):

    $$\begin{aligned} \begin{aligned} \phi _T(t) = {\textbf {F}}[0,t]((\overline{cf_0}\wedge \overline{cf_1}\wedge cf_2)\vee (\overline{cf_0}\wedge cf_1\wedge \overline{cf_2})\\ \vee (cf_0\wedge \overline{cf_1}\wedge \overline{cf_2})) \end{aligned} \end{aligned}$$
    (5)
    $$\begin{aligned} \begin{aligned} \phi _D(t) = {\textbf {F}}[0,t](cf_0 \wedge \overline{cf_1}) \end{aligned} \end{aligned}$$
    (6)
  • For \(\mathtt {FFA}\):

    $$\begin{aligned} \begin{aligned} \phi _T(t) = 1-{\textbf {F}}[0,t]((cf_0\wedge cf_1\wedge cf_2)\vee (\overline{cf_0}\wedge cf_1\wedge cf_2) \\ \vee (cf_0\wedge \overline{cf_1}\wedge cf_2) \vee (cf_0\wedge cf_1\wedge \overline{cf_2}) \vee (\overline{cf_0}\wedge \overline{cf_1}\wedge cf_2)\\ \vee (\overline{cf_0}\wedge cf_1\wedge \overline{cf_2}) \vee (cf_0\wedge \overline{cf_1}\wedge \overline{cf_2})) \end{aligned} \end{aligned}$$
    (7)
    $$\begin{aligned} \begin{aligned} \phi _D(t) = 1-{\textbf {F}}[0,t]((cf_0 \wedge cf_1)\vee (cf_0 \wedge \overline{cf_1})) \end{aligned} \end{aligned}$$
    (8)
Table 1. Definition of measures
Fig. 10.
figure 10

Results of experimental evaluations using statistical model checking

In the experimental evaluations, probability estimation algorithm is utilized. It enables to compute the probability p for S to satisfy \(\phi \). Give a precision \(\delta \), this algorithm computes a value for \(p'\) such that \(\left| p'-p\right| \le \delta \) with confidence \(1-\alpha \). We use probability estimation with (\(\alpha \) = 0.2, \(\delta \) = 0.2) for all the analyses and rely on the parametric exploration to analyze specifications \(\phi _T\)(t) and \(\phi _D\)(t).

Figure 10(a) shows a comparison of the reliability of a single CU, \(\mathbf {TMR}\), and Cold Standby model. The x-axis represents the time steps and the y-axis represents the probability of property. We observe that \(\mathbf {TMR}\) is slightly more reliable than single CU and Cold Standby is more reliable than single CU and \(\mathbf {TMR}\). Because three CUs in \(\mathbf {TMR}\) are executed concurrently and the backup CU is activated after the primary CU is faulty, the reliability of Cold Standby is higher at the beginning of a period. The result of Fig. 10(b) shows that the FIT between \(\mathbf {TMR}\) and single CU are nearly the same, and the FIT of Cold Standby is greater than the others. Figure 10(c) indicates that the differences between DA of \(\mathbf {TMR}\) and Cold Standby are marginal. Figure 10(d) shows that the FFA of Cold Standby is Slightly higher than \(\mathbf {TMR}\) at the beginning of a period, and then the FFA of Cold Standby is similar to \(\mathbf {TMR}\).

4 Conclusions and Future Work

We propose a novel approach to automated reliability analysis of redundancy architectures based on the BIP framework and illustrate it on Triple Modular Redundancy and Cold Standby. The reliability analysis uses the statistical model checking tool SBIP. We propose a relatively complete set of metrics for reliability, including system reliability metrics and degradation reliability metrics, and use it to evaluate the reliability of redundancy architectures. We apply our approach to a computation unit example and experimental results show that our approach can automatically model various redundancy architectures and perform a comprehensive analysis of reliability in practical redundancy systems. Moreover, our models are scalable and our approach is open to additional - models that can be easily extended to further types of fault rates, behavior types, etc. In the future, we will integrate qualitative analysis of reliability into our approach such that qualitative and quantitative analysis can be integrated into the rigorous component-based system design paradigm. Moreover, we can broaden the evaluation architecture and guide the system design according to the evaluation results.