Keywords

1 Introduction

SCSS are pervasive in the medical field and it has to be designed with maximum care. The dependability requirement is an important criterion in these systems. To reduce the probability of losses, appropriate failure analysis practices have to be used to validate the safety of the critical system [1].

To achieve error free scenario of SCSS is very difficult, although the system has been well tested, used, and documented. If one part of a system fails, this can affect other parts and in worst case results in partial or even total system failure. To avoid such incidents, research on failure analysis is of high importance. Failure analysis is the proven approach for mitigating the system hazards and failure modes and consequently determines which of those are influenced by or affected by software or lack of software [1]. A failure of a safety critical system can be defined as “the non performance or incapability of the system or a component of the system to perform or meet the expectation for a specific time under stated environmental conditions.”

The error propagation probability is a condition that once an error occurs in a system module, it might propagate to other modules and thereby cascades the error to the system output [2]. The error propagation analysis is a vital activity for the efficient and robust designing of safety critical software system.

Error propagation between software modules is a quantitative factor that reflects on the reliability of a safety critical software product. In general, the SCSS is considered between two major states, perfect functioning and failure state. Here we are considering several intermittent states between the two major states for the failure analysis. Hence these systems can be termed as Multistate Systems (MS) in our research. The reliability of a MS can be defined as a measure of the capability of the system to execute required performance level [3].

The presence of an error [4] in a software module might trigger an error in other modules of the system that are interconnected. Identifying error propagation in a software system is an important task during the development activity. Propagation analysis may be used to identify the critical modules in a system, and to determine how other modules are affected in the presence of errors. This concept will aid in system testing and debugging through generating required test cases that will stimulate fault activation in the identified critical modules and facilitate error detection [5].

The errors under consideration might be due to faulty design, which could result in errors and data errors due to wrong data, late data, or early data. The impact of error propagation across modules can be assessed by analyzing the error propagation process and arrive at a general expression to estimate the performance distribution of each module using computational intelligence because of its complexity and randomness [6]. As per IEC 61508, it is necessary to see that the design and performance of critical systems is safety enough to meet tolerable risk targets, taking into account of all failure sources including systematic hardware and software faults and random faults [7].

The reliability and performance of a multistate safety critical system can be computed by using Universal Generating Function (UGF) technique [8]. The UGF technique is based on probability theory to assess and express models through polynomial functions. The UGF technique applied for failure analysis in safety critical systems in this paper is adapted by following the procedure given by Levitin et al. [8, 9].

Hence the error propagation analysis provides the base for the reliability evaluation, since the occurrence of error propagation across the modules has a significant effect on the system behavior during critical states.

The paper is structured as follows: Sect. 2 describes background and Sect. 3 describes the proposed approach through a framework. The analysis of error propagation and failure of a SCSS is depicted in Sect. 4. Conclusion and looking beyond the area of this research are discussed in Sect. 5.

2 Background

According to Avizienis et al. [4], a failure is an event that occurs when the delivered service no longer complies with the expected service of the system. An error is an incorrect internal state that is liable to the occurrence of a failure or another error. However all errors may not reach the system’s output to cause a failure. A fault is active when it results in an error otherwise it is said to be inactive. Nevertheless, not all faults lead to an error, and not all errors lead to a failure.

2.1 Error Propagation

Error propagation (EP) can be defined as a condition where a failure of a component may potentially impact other system components or the environment through interactions, such as communication of bad data, no data, and early/late data [10]. Such failures represent hazards that if not handled properly can result in potential damage. The interaction topology and hierarchical structure of the system architecture model provide the information on how the erroneous behavior of each system component interacts through error propagation.

Morozov et al. [5] have used probabilistic error propagation techniques for diagnosing the system. Henceforth it aids in tracing back the path of error propagation path to the error-origin. Moreover this diagnosis helps in error localization procedure, testing, and debugging [5].

The influence of error propagation in the overall system reliability has been demonstrated in [11]. With the help of UML artifacts, the system architectural information is used to find the probability of error propagation across system components [11]. Since they have used UML artifacts, their model can be used to predict reliability in the early phases of system development.

Hiller et al. [12] have initiated a new concept called “Error Permeability” through software modules , as well as a set of associated measures namely error exposure, relative permeability, and non-weighted relative permeability. They found these measures are helpful in assessing the vulnerable modules, which are exposed to error propagation.

A bottom-up approach is considered to estimate the reliability for component-based system in [2]. Foremost, the reliability of system component was assessed. Based on the architectural information, the system reliability was estimated taking into the account of error propagation probability. The system analysis was carried out through the failure model by considering only data errors across components. Authors in [2] have concluded that error propagation is a significant characteristic of each system component and defined as the probability that a component propagates the erroneous inputs to the generated output. Their approach can be used in the early prediction of system reliability.

An error happens when there is an activation of a fault [4]. An error occurs in a module when there is a fault in the module and henceforth it cannot directly cause an error in other modules . Relatedly an error in a module can lead to its failure only within that module. The reason for the module error is either due to the activation of fault in the same module or due to deviated input service from other modules. A module failure is defined as the deviation of the performance from its accepted output behavior. If the failed module is the output interface module of the system then its failure is considered as a system failure [13]. System failures are defined based on its boundary. A system failure is said to occur when error propagates outside the system. Figure 1 depicts intermodular error propagation . Module X influences module Y.

Fig. 1
figure 1

Inter-modular error propagation

In safety critical system, certain factors are considered crucial which signifies the safety of a system and such critical attributes should be consistently monitored throughout the lifecycle of the system [14]. This work focuses on analyzing error propagation in safety critical software systems. In this approach, we use a methodology called universal generating function (UGF) to quantify the performance distribution of a multistate safety critical software system [3] and subsequently introduce a new metric called Safety Metric (SMEP).

2.2 MSS and Universal Generating Function

The UGF technique also called as u function is a mathematical tool introduced by Ushakov [15] and Levitin [8] expanded and proved that UGF is an effective technique for assessing the performance of real-world systems, in specific Multistate Systems. In general all traditional reliability models perceived system as binary state systems, states being a perfect functionality and a complete failure. In reality, each system has different performance levels and various failure modes affecting the system performance [3]. Such systems are termed as Multistate Systems (MS).

Let us assume a MS composed of n modules . In order to assess the reliability of a MS, it is necessary to analyze the characteristic of each module present in the system. A system module ‘m’ can have different performance rates and represented by a finite set q m, such that \(q_{m} = \left\{ {q_{m1} ,q_{m2} , \ldots . q_{mi \ldots .} q_{{mk_{m} }} } \right\}\) [16], where q mi is the performance rate of module m in the ith state and \(q_{i} = \left\{ {1,2, \ldots .k_{m} } \right\}\). The performance rate \(Q_{m} \left( t \right)\) of module ‘m’, at time \(t \ge 0\) is a random variable that takes its value from \(q_{m} :Q_{m} \left( t \right) \in q_{m}\).

Let the ordered set \(p_{m} = \left\{ {p_{m1} ,p_{m2} , \ldots .p_{mi} , \ldots p_{{mj_{m} }} } \right\}\) associate the probability of each state with performance rate of the system module m, where \(p_{mi} = Pr\left\{ {Q_{m} = q_{mi} } \right\}\).

The mapping q mi   p mi is called the probability mass function (pmf) [17].

The random performance [18] of each module m defined as polynomials can be termed as module’s UGF (u m (z))

$$u_{m} \left( z \right) = \mathop \sum \limits_{i = 0}^{k} P_{mi} z^{{q_{mi} }} ,\,where\,m = 1,2 \ldots n.$$
(1)

Similarly the performance rates of all ‘m’ system modules have to be determined. At each instant t ≥ 0, all the system modules have their performance rates corresponding to their states. The UGF for the MS denoted as “(U S (Z))” can be arrived, by determining the modules interconnection through system architecture. The random performance of the system as a whole at an instant t ≥ 0 is dependent on the performance state of its modules. The UGF technique specifies an algebraic procedure to calculate the performance distribution of the entire MS, denoted as U S (z),

$$U_{s} \left( z \right) = f\left\{ {u_{m1} \left( z \right),u_{m2} \left( z \right), \ldots ,u_{mn} \left( z \right) } \right\},$$
(2)
$$U_{s} \left( z \right) = \nabla_{\phi} \left\{ {u_{m1} \left( z \right),u_{m2} \left( z \right), \ldots ,u_{mn} \left( z \right) } \right\}$$
(3)

where ∇ is the composition operator and ø is the system structure function. In order to assess the performance distribution of the complete system with the arbitrary structure function ø, a composition operator ∇ is used across individual u function of m system modules [17].

U S (z) is a U function representation of performance distribution of the whole MS software system. The composition operator ∇ determines the U function of the whole system by exercising numerical operations on the individual u functions of the system modules. The structure function ø(·) in composition operator ∇ expresses the complete performance rate of the system consisting of different modules in terms of individual performance rates of modules. The structure function ø(·) depends upon the system architecture and nature of interaction among system modules .

Reliability is nothing but continuity of expected service [4] and it is well known that, it can be quantitatively measured as failures over time. The UGF technique can be used for estimating software reliability of the system as a whole consisting of n modules. Each of the modules performs a sub-function and the combined execution of all modules performs a major function [17].

An assumption while using the UGF technique is that the system modules are mutually independent of their performance .

3 Proposed Approach

Error Propagation (EP) is defined as the condition where an error (or failure) propagates across system modules [19]. Our approach focuses on quantifying the propagation of error between modules in safety critical software system.

The analysis proposed in this research contains four different stages and explained through a framework. The framework, as shown in Fig. 2, is based on bottom-up approach in assessing the performance distribution of SCSS To start with, we have arrived at the performance distribution of each system module (PDMOD) using U function.

Fig. 2
figure 2

Error propagation and failure analysis

The probability of error propagation in a module (PD MOD  + SM EP ) is quantified in the second step. As third step, the performance distribution of subsystems (PD SS  + SM EP ) is arrived through composition operator. As the final step the failure prediction is achieved through recursive operations for quantifying the error propagation throughout the system (PD SYS  + SM EP ).

During software development, this framework would be helpful to demonstrate the probability of error propagation to identify the error prone areas.

4 Error Propagation and Failure Analysis

The error propagation and failure analysis model is a conceptual framework for analyzing the occurrence of error propagation in SCSS. The system considered is broken down into subsystem, and each subsystem in turn is subdivided into modules called elements.

A module is an atomic structure, which performs definite function(s) of a complex system.

4.1 Performance Distribution of System Module

The performance rate of a module can be measured in terms of levels of failure.

Let us assume that the performance rate of a module m with 0% failure is qm1, 10% failure is qm2, 30% failure is qm3, 50% failure is qm4, and 100% failure is qm5.

The state of each module m can be represented by a discrete random variable Qm that takes value from the set,

$${\text{Q}}_{\text{m}} = \left\{ {{\text{q}}_{\text{m1}} ,{\text{q}}_{\text{m2}} ,{\text{q}}_{\text{m3}} ,{\text{q}}_{\text{m4}} ,{\text{q}}_{\text{m5}} } \right\}$$

The random performance of a module varies from perfect functioning state to complete failure state.

The probabilities associated with different states (performance rates) of a module m at time t can be represented by the set, \(P_{m} = \left\{ {p_{m1} ,\,p_{m2} ,\,p_{m3} ,\,p_{m4} ,\,p_{m5} } \right\},\) where \(P_{mh} = Pr\left\{ {Q_{m} = q_{mh} } \right\}.\)

The module’s states is the composition of the group of mutually exclusive events,

$$\mathop \sum \limits_{h = 1}^{5} P_{mh} = 1$$
(5)

The performance distribution of a module m (pmf of discrete random variable G) can be defined as

$${\text{u}}_{\text{m}} \left( {\text{z}} \right) = \mathop \sum \limits_{h = 1}^{5} P_{mh} z^{{q_{mh} }}$$
(6)

The performance distribution of any pair of system modules l and m, connected in series or parallel [18], can be determined by,

$$u_{l} \left( z \right)\nabla u_{m} \left( z \right) = \mathop \sum \limits_{h = 1}^{5} P_{lh} z^{{q_{ih} }} \nabla \mathop \sum \limits_{h = 1}^{5} P_{mh} z^{{q_{mh} }}$$
(7)

The composition operator ∇ determines the u function for two modules based on whether they are connected in parallel or series using the structure function ø. The equation arrived in Eq. (7) quantifies the performance distribution of combination of modules. Levitin et al. in [9] have demonstrated the determination of performance distribution when the modules are connected in series or parallel. A failure in a module may potentially impact other modules through error propagation.

4.2 Formulation of Safety Metric SMEP

The probability of occurrence of EP in a module can be defined by introducing a new state in that module [9]. Assuming that the state 0 of each module corresponds to the EP originated from this module [8]. The Eq. (6) can be rewritten as,

$$u_{m} \left( z \right)_{ep} = P_{m0} z^{{q_{m0} }} + \mathop \sum \limits_{h = 1}^{5} P_{mh } z^{{q_{mh} }}$$
(8)
$$u_{m } \left( z \right)_{ep} = P_{m0} z^{{q_{m0} }} + {{u_{m } \left( z \right)}}$$
(9)

where p m0 is the probability state for error propagation and q m0 is the performance of the module at state 0.

\(u_{m } \left( z \right)\) represents all states except the state of error propagation  as represented in Eq. (6).

The performance distribution of a module m at state 0 is the state of error propagation which is given by \(P_{m0} z^{{q_{m0} }}\) and is termed as safety metric SMEP. This metric depends upon the probability of the module performance with respect to EP, whether a failed module can cause EP or not and whether a module can be infuenced by EP or not. The safety metric SMEP of each module will carry a weightage based on the probability of propagating error. If the module does not propagate any error, the corresponding state probability should be equated to zero [9].

$$p_{m0} = 0$$
(10)

By substituting Eq. (10) in Eq. (8), the SMEP. is quantified as zero. Therefore Eq. (8) becomes,

$$u_{m} \left( z \right)_{ep} = \mathop \sum \limits_{h = 1}^{5} p_{mh} z^{{q_{mh} }}$$
(11)

The module that does not have error propagation property or state is given by \(u_{m } \left( z \right)_{ep} = u_{m } \left( z \right)\), the u function in Eq. (11) is reduced to Eq. (6).

If a failed module causes error propagation, then the performance of the module in that state of error propagation is

$$q_{m0} = \alpha$$
(12)

where the value of α can be of any random performance qm1 or qm2 or qm3 or qm4 or qm5. When any operational module m that will not fail due to error propagation can be represented by conditional pmf [9],

$${{u_{m } \left( z \right)}}_{ep} = \mathop \sum \limits_{h = 1}^{5} \frac{{p_{mh} }}{{1 - p_{m0} }}\,z^{{q_{mh} }}$$
(13)

Because the module can be in any one of the five states as defined in Eq. (6). The safety metric SMEP depends on the performance of each module in the multistate system. This metric helps to estimate the probability of EP to hazardous modules of the system and identify modules that should be protected with fault detection and fault recovery mechanisms.

Based on the above considerations, the conditional u functions of each system module have to be estimated. Depending upon the subsystem architecture, the u function of each subsystem can be quantified by applying the composition operator ∇ø. Then the recursive approach is used to obtain the entire u function of safety critical software system, which will be elaborated in the subsequent work.

5 Conclusions

This approach proposes a new framework to analyze the failure of multistate safety critical software with respect to error propagation and arrive at a new metric called safety metric SMEP. This proposed new metric will be the key finding for the failure analysis of real-time safety critical system. This metric has the application in finding the failure probability of each module, the migration of error propagation from modular level to subsystem and then to system level and the process of identifying the most critical module in the whole safety critical software system and the impact of error propagation in the performance of SCSS. Our future work will continue by applying the safety metric SMEP in relevant real-time SCSS for its failure analysis.