Keywords

1 Introduction

Current developments for ensuring safe and competitive operation of industrial plants, such as nuclear facilities, in most countries, is largely based upon deterministic criteria using multiple layers of Defense-in-depth (DiD). Design basis accidents (DBAs) are then defined and safety systems incorporated into the design to respond to these accidents. In general, risk methods are not explicitly considered in the regulatory process although the selection of DBAs and their inclusion on Safety Analysis Reports implicitly include consideration of their risk potential (IAEA 2009).

As result of the nuclear accidents at Three Mile Island, Chernobyl, and Fukushima, many countries have incorporated additional steps to the licensing processes of Nuclear Power Plants (NPPs) in order to control accident risks. Lessons learned included recommendations to improve plant systems, resources, and operator training to effective responses to severe accidents (IAEA 2012). Probabilistic Risk Assessment (PRA) is used in the nuclear industry in the United States and in many other countries for analyzing accidents beyond-design-basis, such as Fukushima event. Sometimes named Probabilistic Safety Analysis—PSA, this approach is useful for improving safety, reliability and availability in design and operating of NPPs (NAS & USNRC 2014).

Although risk assessment is an integral part of evaluating NPP safety, the main strategy for designing and regulating such facilities remains in DiD philosophy. This involves the use of multiple redundant systems for preventing and mitigating components and human failures. In addition, Human Reliability Analysis (HRA) is typically performed as part of these PRAs (or PSAs) to quantify the likelihood of omission and commission errors, as well as fail in recovery actions.

The United States is an example of country where many application of PRA to regulatory issues have been carried out. Both the U.S. Nuclear Regulatory Commission (USNRC) and the regulated industry have made significant advances in the development and application of risk-based technology (USNRC 2011). Overall, there is clear evidence in all countries that PRA methods have become an important part of the safety, reliability, and risk management processes in support to regulation. These questions are normally treated individually and without considering systematically human factors that have significant impact on operational effectiveness and risk assessment and management (Cox and Tait 1998).

On the other hand, the use of common tools in the analysis of each one of these subjects, as Fault Tree Analysis (FTA), Reliability Block Diagram (RBD), and Event Tree Analysis (ETA), is a clear indication that an integrated evaluation is feasible (USNRC 2001). This integrated approach is also particularly important when implementing Quality, Safety, Health, and Environment Integrated Management Systems following ISO 9001, BS 8800, OHSAS 18001, and ISO 14001 standards. Such systems cannot assure legal compliance, but if they are effective, they can help the organizations to know better their compliance status, so that preventive and corrective actions can be efficiently implemented (Vasconcelos et al. 2009).

This chapter proposes an integration of safety, reliability, risk management and human factors issues based on the application of systems theory. Section 2 presents main terminology and concepts related to safety assessment, risk management, reliability engineering, human factors and ergonomics. Section 3 presents an overview of the integrated framework based on systems theory. Section 4 describes briefly the common tools used in the integrated analysis, as Fault Tree Analysis (FTA), Reliability Block Diagram (RBD), Event Tree Analysis (ETA), and Technique for Human Error Rate Prediction (THERP), including mathematical and statistical basis. Section 5 presents a simple representative example to illustrate the benefits of integrated engineering approach to safety, reliability, risk management and human factors for a generic Loss of Coolant Accident (LOCA) in a Nuclear Power Plant. Finally, the conclusions about the integrated framework and summary about application example are presented in Sect. 6.

2 Terminology and Concepts

In the scope of this chapter, there are many concepts and terminology adopted within an integrated engineering approach to safety, reliability, risk management and human factors.

2.1 Safety Assessment

ALARP

“As Low as Reasonably Practicable” is a principle usually applied to risks in some areas as radiation protection and chemical accident prevention, preparedness and response that fall below a defined level of “intolerable” risk. This principle recognizes that not all risk can be eliminated; there will be always a residual risk of an accident since it may not be practicable to take further actions to reduce the risk or to identify the potential accidents (HSE 2017). The associated term used in Nuclear Regulatory Commission (NRC) standards is ALARA (As Low As Reasonably Achievable). ALARA means making every reasonable effort to maintain exposure to ionizing radiation as far below the dose limits as practical, consistent with the purpose for which the licensed activity is undertaken, taking into account the state of technology, economic factors, and public interest (USNRC 2017).

Safety Assessment

Safety can be seen as a practical certainty that adverse effects will not result from exposure to an agent under defined circumstances (Christensen et al. 2003). Safety assessment is therefore a systematic process that is carried out throughout the design process (and throughout the lifetime of the facility or the activity) to ensure that all the relevant safety requirements are met by the proposed (or actual) design. Safety assessment includes the formal safety analysis, i.e., it includes the evaluation of the potential hazards associated with the operation of a facility or the conduct of an activity (IAEA 2016a, b).

Defence-in-depth (DID)

It is an established safety philosophy, in which multiple lines of defence and safety margins are applied to the design, operation, and regulation of plants to assure that public health and safety are adequately protected. NRC statement for DID is a safety philosophy that employs successive compensatory measures to prevent accidents or lessen the effects of damage if a malfunction or accident occurs. This philosophy ensures that the public is adequately protected, and that emergency plans surrounding a nuclear facility are well conceived and will work. Moreover, the safety philosophy ensures that safety will not be wholly dependent on any single element of the design, construction, maintenance, or operation of a nuclear facility (ANS 2016).

Design Basis Accidents (DBA)

Design-basis accidents are postulated accidents that are used to set design criteria and limits for the design and sizing of safety-related systems and components. When developing a nuclear power plant, DBAs are selected to ensure that plant can withstand and recover from these accidents (USNRC 2013).

Deterministic Safety Analysis

It is the engineering analysis of a plant response using validated models, calculations and data that predict transient response of the plant to an event sequence typically uses conservative estimates, safety margins and DBAs, and it is based on expert judgement and knowledge of the phenomena being modelled (ANS 2016).

Probabilistic Safety Assessment (PSA), also referred to as Probabilistic Risk Analysis (PRA)

PSA or PRA is a qualitative and quantitative assessment of the risk associated with plant operation and maintenance that is measured in terms of frequency of occurrence of risk metrics, such as core damage or a radioactive material release and its effects on the health of the public, in the case of NPP (ANS 2016).

2.2 Risk Management

Risk

There are many different definitions of risk. In the scope of this chapter, risk is a comprehensive set of event sequences, a quantitative assessment of the event sequence frequencies and their consequences, and an evaluation of the uncertainties in the assessments (Christensen et al. 2003; WHO 2004; ANS 2016). Mathematically this can be expressed as a product of frequency of occurrence and severity, as shown in Eq. 1 (USNRC 1975).

$$ Risk\left[\frac{consequence}{time}\right]\kern0.5em =\kern0.5em frequency\left[\frac{event}{time}\right]\kern0.75em \times \kern0.5em severity\left[\frac{consequence}{event}\right] $$
(1)

Hazard

It is an event or a natural phenomenon that poses some risk to a facility. Internal hazards include events such as equipment failures, human failures, and flooding and fires internal to the plant. External hazards include events such as flooding and fires external to the plant, tornadoes, earthquakes, and aircraft crashes (Lees 2012).

Hazard Analysis

It is the determination of material, system, process, and plant characteristics that can produce undesirable consequences, followed by assessment of hazardous situations associated with a process or activity. Qualitative techniques are normally used to pinpoint weaknesses in design or operation of the facility that could lead to hazardous material releases. The hazard analysis examines the complete spectrum of potential events that could expose members of the public, facility workers, and the environment to hazardous materials (Lees 2012).

Risk Assessment

Refers to technical estimation of nature and magnitude of a risk. It involves basically answers to three questions: What can go wrong? How frequently does it happen? What are the consequences? Figure 1 illustrates the risk assessment process. Risk assessment is a process for measuring, qualitatively and quantitatively, the risks a particular agent represents for a specific system or facility (Stamatelatos 2002).

Fig. 1
figure 1

Illustration of a risk assessment process (adapted from Stamatelatos 2002)

Risk Management

It is a systematic application of management policies, procedures and practices of establishing the context, identifying, analyzing, planning, implementing, controlling, communicating and documenting risks in a way that will enable organizations minimizing loss and maximizing opportunity in a cost-effective way (Stamatelatos 2002; IAEA 2001). A risk management process is illustrated in Fig. 2.

Fig. 2
figure 2

Illustration of risk management process

Risk management activities encompass the following steps:

  • Identify. States the risk in terms of conditions and consequences; capture the context of risk; e.g., what, when, where, how, and why.

  • Analyze. Evaluates probability and severity, prioritizes and classifies groups with similar or related risks.

  • Plan. Identifies techniques or strategies to manage the risk, including actions to mitigate, transfer or retain risks.

  • Implement. Carries out the chosen techniques or strategies.

  • Control. Analyzes results, decides how to proceed (re-plans, closes the risk, invokes contingency plans, continues tracking, etc.) and executes control decisions, providing feedback so that risk analysis is always updated.

  • Communicate and document. Essential risk status is to be documented and communicated on a regular basis to the entire team.

2.3 Reliability Engineering

System is a collection of interrelated parts (components) that work together by way of some driving process. In this context, reliability is defined as the probability that an engineering system will perform its intended function satisfactorily for its intended life under specified environmental and operating conditions. Reliability is basically a design parameter and must be incorporated into the system at the design stage. Then, it is an inherent characteristic of the system, just as is its capacity or performance. To analyze and measure the reliability characteristics of a system, there must be a mathematical and a logical model of the system that shows the functional relationships among all the components, the subsystems, and the overall system. The reliability of a system is a function of the reliabilities of its components. A system reliability model consists of some combination of a reliability data through use of techniques like block diagrams or fault trees. A definition of all equipment failure and repair distributions and a statement of spare and repair strategies are necessary (IAEA 2016a).

Since component failure characteristics can be described by distributions, the system reliability is actually time-dependent (ReliaSoft 2015). Assuming an exponential life distribution, the reliability of the component i as function of time, t, R i (t), is:

$$ {R}_i(t)={e}^{-{\lambda}_it}, $$
(2)

where λ i is the failure rate of component i.

Mean life (or Mean Time to Failure, MTTF) can be obtained by integrating system reliability function from zero to infinity:

$$ MTTF=\underset{0}{\overset{\infty }{\int }}{R}_i(t) dt=\underset{0}{\overset{\infty }{\int }}{e}^{-{\lambda}_it} dt=\frac{1}{\lambda_i}. $$
(3)

As reliability of a system is the probability that a system will operate successfully by a given time, in dealing with repairable systems, these definitions need to be adapted to deal with the case of the renewal of systems/components. Repairable systems receive maintenance actions that restore system components when they fail. These actions change the overall makeup of the system.

Maintainability, M i (t), is defined as the probability of performing a successful repair action within a given time, t. In other words, maintainability measures ease and speed a system can be restored to its operational status after a failure occurs. In maintainability, the random variable is time-to-repair, in the same way, as time-to-failure is the random variable in reliability (Mobley et al. 2008). As an example, consider the maintainability equation for a system in which repair times are distributed exponentially. Its maintainability is given by:

$$ {M}_i(t)=1-{e}^{-{\mu}_it}, $$
(4)

where μ i is repair rate.

Mean Time to Repair, MTTR, can be obtained by integrating maintainability function from zero to infinity:

$$ MTTR=\underset{0}{\overset{\infty }{\int }}{M}_i(t) dt=\underset{0}{\overset{\infty }{\int }}{e}^{-{\mu}_it} dt=\frac{1}{\mu_i}. $$
(5)

If one considers both reliability (probability an item will not fail) and maintainability (probability an item is successfully restored after failure), then an additional metric is needed for probability a component/system is operational at a given time, (i.e., has not failed or it has been restored after failure). This metric is availability. Availability is then a performance criterion for repairable systems that accounts for both reliability and maintainability properties of a component or system. Availability, A(t), is defined as probability a system is operating properly when it is requested for use. In other words, availability is the probability a system will not fail or undergoing a repair action when it needs to be used. In case of a single component, i, A i (t) is given by:

$$ {A}_i(t)=\frac{System\ up\ time}{System\ up\ time+ System downtime}=\frac{MTTF}{MTTF+ MTTR}=\frac{\mu_i}{\lambda_i+{\mu}_i}. $$
(6)

2.4 Human Factors and Ergonomics

Human Factors

It is a discipline concerned with the development and application of human system interface technology to systems analysis design and evaluation. This technology includes human machine, human task, human environment, and organization machine interfaces. Efforts of human factors engineering are directed to improving operability, maintainability, usability, comfort, safety and health characteristics of systems in order to improve human and system effectiveness and to reduce the potential of injury and error (Stanton et al. 2005).

Ergonomics

It is a term often used interchangeably with human factors that commonly refers to designing work environments for maximizing safety and efficiency. Ergonomics nowadays has great importance because companies have learned that designing a safe work environment can also result in greater efficiency and productivity. Today, around the world, there are many laws requiring safe work environment. Design of workplace results in a great impact on both safety and efficiency. The easier is to do a job, the more likely is to gain productivity due to greater efficiency. Analogously, the safer is to do it; also, the more likely it is to see gains in productivity due to reduced time off for injury. Ergonomics can address both these issues concurrently by maximizing workspace, equipment and activities needed to do a job (Stanton et al. 2005).

Human Reliability Assessment (HRA)

HRA is a method that involves systematic prediction of potential human errors when interacting with a system. Once such errors are identified, this method tries to eliminate or reduce their occurrence, in order to maximize safety and performance of a system or facility. Results of HRA can be entered into risk management actions to reduce risk to ALARP, both by system re-design and implementation of controls and mitigations (USNRC 2005).

HRA, in general, encompasses the identification of error types, likelihood of error occurrence, opportunities to recover from errors and consequence of errors. This method should analyze current design and recommend how to mitigate errors identified. Many reliability and risk analysis tools as FTA and ETA can help HEP steps. There are also many HRA specific techniques like THERP (Technique for Human Error Rate Prediction), SHERPA (Systematic Human Error Reduction and Prediction Approach), HEART (Human Error Assessment and Reduction Technique), CREAM (Cognitive Reliability and Error Analysis Method) and ATHEANA (A Technique for Human Event Analysis) (Calixto 2013). THERP will be briefly discussed in Sect. 4.4.

3 Integrated Framework for Assessing Safety, Reliability, Risk and Human Factors

Management systems in complex facilities like Nuclear Power Plants encompass several areas such as Quality (ISO 9001 standards), Environment (ISO 14001 standards), and Safety, Health and Risk Assessment (BS 8800 and OHSAS 18001 standards). Such management systems are often treated as independent functions within organizations. However, corresponding elements between these management systems are compatible and it is feasible integrating them. An integrated management and a systemic approach, i.e. an approach relating to the system as a whole in which the interactions among technical, human and organizational factors are fully considered, are essential to the specification and application of adequate safety measures and the fostering of a safety culture (IAEA 2016b).

3.1 Systems Theory

To understand complex systems, scientists usually try to envisage phenomena of nature and processes as simplified versions of reality known as a system. As defined, system can be envisaged as a collection of interrelated parts that work together by way of some driving process. They can be visualized as component blocks that have connections between them. Systems can be modeled using tools like block diagrams, facilitating evaluations of safety and reliability, for instance (ReliaSoft 2015).

Most systems share the same common characteristics. These common characteristics include the following (Cox and Tait 1998):

  • Systems have a structure defined by its parts and processes.

  • Systems are generalizations of reality.

  • Systems tend to function in a same way. This involves inputs and outputs of material (energy or matter) that is then processed, causing it to change in some way.

  • Various parts of a system have functional as well as structural relationships between each other.

  • The fact of having functional relationships between parts suggests flow and transfer of some type of energy or matter.

  • Systems often exchange energy or matter beyond their defined boundary with outside environment, and other systems, through various input and output processes.

  • Functional relationships can only occur because of the presence of a driving force.

  • The parts that make up a system show some degree of integration; in other words, the parts work well together.

Within the boundary of a system, three kinds of properties can be found:

  • Elements—kinds of parts (things or substances) that make up a system. These parts may be hardware, software, raw materials, and persons, for instance.

  • Attributes—characteristics of elements that may be perceived and measured. Examples: production, reliability, safety, and availability.

  • Relationships—associations that occur between elements and attributes. These associations are based on cause and effect. In an organizational system, for example, there is a close relationship between human factors and production, safety and availability.

The state of a system is defined by the value of its properties (elements, attributes, and/or relationships).

3.2 Overview of Human Factors Integration

Figure 3 can be used to support the definition the objective of integrated analysis. It shows an overview of possibilities of integration of human factors (ergonomics), life-cycle step of the project (design, implantation, operation or decommissioning), target (quality, occupational health and safety, or environmental management), and focus of analysis (safety, reliability, or risk).

Fig. 3
figure 3

Overview of framework for human factors integration (Vasconcelos et al. 2009)

3.3 Integrated Framework

Figure 4 shows the steps for the proposed methodology considering safety, reliability, risk management and human factors integrations.

Fig. 4
figure 4

Proposed methodology for safety, reliability, risk management and human factors integration (Vasconcelos et al. 2009)

Identification of a system to be analyzed is carried out with the aid of systems theory. Figure 5 illustrates a systematic model of an organization adapted to an industrial facility (Cox and Tait, 1998). The first box represents inputs into the systems and includes physical, human and financial resources, as well as service and knowledge. The transformation process integrates plant (hardware), human resources (liveware) and policies, procedures, rules, and processes (software). The box at right represents outputs and depending on targets of analysis, elements of quality, occupational health and safety, or environmental management can be selected.

Fig. 5
figure 5

Systemic model of an organization (adapted from Cox and Tait 1998)

The overview of framework for human factors integration, illustrated in Fig. 3 through the intersection of human factors (ergonomics) arrow with characteristics in focus (safety, reliability, or risk) or their intersections is best illustrated in Fig. 6. This figure illustrates some identified pertinent safety and reliability items, as well as common pertinent human factors issues. By this way, systems to be analyzed are identified systematically under all focus combination, within the life-cycle step and the required target (EUROCONTROL 2004). At each selected focus, applicable principles and criteria are chosen (examples in Table 1). Human factors to be considered in analysis are grouped in six areas in order to warrant that all issues will be considered and can be adequately prioritized. Six human factors areas and some example issues within each one are shown in Table 2. The integrated analysis can be carried out using common tools referred in Sect. 4 of this chapter.

Fig. 6
figure 6

Examples of pertinent items and Human Factors within an integrated safety and reliability focus (adapted from EUROCONTROL 2004)

Table 1 Examples of design and analysis principles and criteria applied to safety, reliability, risk and human factors (adapted from Vasconcelos et al. 2009)
Table 2 Six human factors areas and examples of human factors issues (adapted from EUROCONTROL 2004)

4 Applied Models and Methods

Event Tree Analysis (ETA), Fault Tree Analysis (FTA), Reliability Block Diagrams (RBD), and Technique for Human Error Rate Prediction (THERP) are examples of common safety, reliability, and risk evaluation tools that can support the team in proposed integrated framework for analyzing process systems and identifying potential accidents.

4.1 Event Tree Analysis (ETA)

Modeling of accident scenarios within a risk assessment process proceeds with inductive logic and probabilistic tools called Event Trees (ETs). An event tree starts with the initiating event and progresses through the scenario, a series of successes or failures of intermediate events (Defence-in-depth levels), until an end-state is reached. (Stamatelatos 2002). Figure 7 illustrates an event tree for a generic initiating event and two levels of Defence-in-depth. Considering λie as the frequency of occurrence of an initiating event, and p1 and p2, as the probabilities of failure of Defence-in-depth levels 1 and 2, respectively, the frequency of occurrence, F, for four possible accident scenarios (no-consequence, and accident scenarios 1, 2 and 3) can be calculated as shown in Fig. 7. Notice that these estimates are only valid if the events involved in each sequence are independent.

Fig. 7
figure 7

Event Tree for a generic initiating event and two levels of Defence-in-depth

4.2 Fault Tree Analysis (FTA)

Fault Tree Analysis (FTA) is an analytical technique, whereby an undesired state of a system is specified, usually a state that is critical from a safety or reliability standpoint. The system is then analyzed in the context of its environment and operation, to find all realistic ways in which the undesired event (called top event) can occur. Fault tree itself is a graphic model of various parallel and sequential combinations of faults that will result in the occurrence of the top event. Faults can be events that are associated with component hardware failures, human errors, software errors, or any other pertinent events, which can lead to the top event. A fault tree thus depicts the logical interrelationships of basic events that lead to the top event of the fault tree (Stamatelatos 2002).

Both qualitative and quantitative evaluations can be performed with the help of fault tree technique. Fault tree itself is a qualitative assessment of events and shows relationships that lead to the top event. In constructing fault tree, significant insights and understanding are gained concerning causes of the top event. Additional evaluations serve to further refine of the information that fault tree provides.

Qualitative evaluations basically transform a fault tree into logically equivalent forms that provide more focused information. The principal qualitative results obtained are the Minimal Cut Sets (MCSs) of the top event. A cut set is a combination of basic events that can cause the top event. An MCS is the smallest combination of basic events that result in the top event. MCSs relate the top event directly to the basic event causes. A set of MCSs for the top event constitutes all ways that basic events can cause the top event. Because the excessively large number of possible combinations of basic events in MCS, computer programs are necessary to identify MCS. For instance, in a system with 100 basic events there are 100 possible cut sets of one basic event, 161,700 cut sets with two basic events, 3,921,225 cut sets with three basic events, and so on. It is virtually impossible a manual review of these possible combinations and check if they are MCSs. Specialized computer programs are then necessary in order to obtain MCS for more complex fault trees (ReliaSoft 2015).

Quantitative evaluations of a fault tree consist of determining the top event probabilities and relative importance of basic events. Uncertainties in any quantified result can also be determined. Fault trees are quantified, typically, by calculating the probability of each MCS and by summing these probabilities, if the events in MCS are independent. Different types of probabilities can be calculated for different applications. In addition to a constant probability value that is typically calculated, time-related probabilities can be calculated providing the probability distribution of the time of first occurrence of the top event. Occurrence rates and availabilities of top events can also be calculated. These characteristics are particularly applicable if the top event is a system failure. Two examples of fault trees representing series and parallel systems respectively are shown in Fig. 8.

Fig. 8
figure 8

Fault trees representing series and parallel systems

Top event probability is calculated from a fault tree using the probabilities that are input for the basic events. Depending on the specific top event definition, the top event probability can be the probability of the top event occurring during a mission time or in a given period of time, i.e., the probability that the top event exists at a given point in time. In some cases, the top event probability can be also the frequency of the top event occurring or the expected number of occurrences of the top event in some time interval. This only occurs if the inputs are basic event frequencies or expected numbers of occurrences.

Using the set theory concepts (Stamatelatos 2002) the probability equations of the two fault trees in Fig. 8 can be expressed as:

$$ \mathrm{P}\left(\mathrm{A}\ \mathrm{or}\ \mathrm{B}\right)=\mathrm{P}\left(\mathrm{A}\cup \mathrm{B}\right)=\mathrm{P}\left(\mathrm{A}\right)+\mathrm{P}\left(\mathrm{B}\right)-\mathrm{P}\left(\mathrm{A}\cap \mathrm{B}\right), $$
(7)
$$ \mathrm{P}\left(\mathrm{A}\ \mathrm{and}\ \mathrm{B}\right)=\mathrm{P}\left(\mathrm{A}\cap \mathrm{B}\right)=\mathrm{P}\left(\mathrm{A}|\mathrm{B}\right)\ \mathrm{P}\left(\mathrm{B}\right)=\mathrm{P}\left(\mathrm{B}|\mathrm{A}\right)\ \mathrm{P}\left(\mathrm{A}\right), $$
(8)

where P(A) and P(B) are the independent probabilities of basic events, and P(A|B) and P(B|A) are the conditional probabilities. If events A and B are independents, Eqs. 7 and 8 become:

$$ \mathrm{P}\left(\mathrm{A}\ \mathrm{or}\ \mathrm{B}\right)=\mathrm{P}\left(\mathrm{A}\cup \mathrm{B}\right)=\mathrm{P}\left(\mathrm{A}\right)+\mathrm{P}\left(\mathrm{B}\right)-\mathrm{P}\left(\mathrm{A}\right)\ \mathrm{P}\left(\mathrm{B}\right), $$
(9)
$$ \mathrm{P}\left(\mathrm{A}\ \mathrm{and}\ \mathrm{B}\right)=\mathrm{P}\left(\mathrm{A}\cap \mathrm{B}\right)=\mathrm{P}\left(\mathrm{A}\right)\ \mathrm{P}\left(\mathrm{B}\right). $$
(10)

4.3 Reliability Block Diagrams (RBD)

An overall system reliability prediction can be made by looking at the reliabilities of the components that make up the whole system or product. A Reliability Block Diagram (RBD) is a graphical representation of the components of a system and how they are related or connected (ReliaSoft 2015). RBDs for series and parallel systems are shown in Fig. 9.

Fig. 9
figure 9

Reliability block diagrams representing series and parallel systems

In a series configuration, failure of any component results in failure of the entire system. In most cases, when considering complete systems at their basic subsystem level, it is found that these are arranged reliability-wise in a series configuration. A failure of any of these subsystems will cause a system failure. In other words, all of components in a series system must succeed for the system to succeed.

The reliability of a series system, R s , is the probability that all n components in the system succeed. Therefore, the reliability of the system is then given by:

$$ {R}_s=P\left({X}_1\cap {X}_2\cap \dots \cap {X}_n\right)=P\left({X}_1\left)P\left({X}_2|{X}_1\right)P\right({X}_3|{X}_1{X}_2\Big)\dots P\Big({X}_n|{X}_1{X}_2\dots {X}_{n-1}\right) $$
(11)

where X i is the event of component i being operational,

P(X i ) is probability that component i is operational, and

P(X i |X1X2 X3 ... Xi−1) is conditional probability.

In the case where failure of a component affects failure rates of other components, the conditional probabilities in equation above must then be considered.

However, in the case of independent components, the equation above becomes:

$$ {R}_s=P\left({X}_1\Big)P\left({X}_2\right)P\left({X}_3\right)\dots P\Big({X}_n\right)={R}_1{R}_2{R}_3\dots {R}_n, $$
(12)

where R i is the reliability of component i.

In a parallel configuration, at least one of the components must succeed for the system to succeed. For this reason, components in parallel are also referred to as redundant components. Redundancy is a very important method of improving system design and reliability.

Probability of failure, or unreliability, Q p , for a system with n parallel components is the probability that all components in the system fail. Therefore, the unreliability of a parallel system is then given by:

$$ {Q}_p=P\left({x}_1\cap {x}_2\cap \dots \cap {x}_n\right)=P\left({x}_1\left)P\left({x}_2|{x}_1\right)P\right({x}_3|{x}_1{x}_2\Big)\dots P\Big({x}_n|{x}_1{x}_2\dots {x}_{n-1}\right), $$
(13)

where x i is the event of failure of component i,

P(x i ) is the failure probability of component i, and

P(x i | x 1 x 2 … x i − 1) is conditional probability.

In the case where the failure of a component affects failure rates of other components, the conditional probabilities in equation above must be then considered.

However, in the case of independent components, equation above becomes:

$$ {Q}_p=P\left({x}_1\Big)P\left({x}_2\right)P\left({x}_3\right)\dots P\Big({x}_n\right)={Q}_1{Q}_2{Q}_3\dots {Q}_n. $$
(14)

So, the reliability of a parallel system, R p , is then given by:

$$ {R}_p=1-{Q}_p=1-\left(1-{R}_1\right)\left(1-{R}_2\right)\left(1-{R}_3\right)\dots \left(1-{R}_n\right). $$
(15)

4.4 Technique for Human Error Rate Prediction (THERP)

Technique for Human Error Rate Prediction (THERP) is the most structured, detailed, and widely used Human Reliability Analysis (HRA) method in PRAs for NPPs. Swain and Guttmann (1983) define THERP as a method to predict Human Error Probabilities (HEP) and to evaluate the degradation of a man-machine system. This degradation can be caused by human errors alone or in connection with equipment malfunctions, operational procedures and practices, or other system and human characteristics that influence system behavior.

THERP analysis encompasses the following steps (Calixto 2013):

  • Understanding the problem to be assessed to see if THERP is the best tool for finding the answer.

  • Understanding of human error context and how human tasks influence activity or system being assessed.

  • Listing and analyzing the related human tasks.

  • Estimating error probabilities for each task using database, expert opinion or literature data.

  • Estimating the final HEP for the whole activity using a THERP tree event.

  • Proposing recommendations to reduce HEP.

  • Estimating the effects of recommendations on HEP after they are implemented.

THERP depends heavily on a detailed and properly performed task analysis. Upon completion of the task analysis, Human Interaction (HI) is logically represented by an HRA event tree, which is used to combine HEPs associated with various HI tasks/subtasks, including cognitive response and action response. Figure 10 shows an example of an HRA event tree for an HI with two tasks A and B (Swain and Guttmann 1983).

Fig. 10
figure 10

HRA event tree example for series or parallel system (adapted from Swain and Guttmann 1983)

As can be seen in Fig. 10, the probability of success P(S) or failure P(F) of a task is the sum of the probabilities for respective sequences. So, for the series system:

$$ \mathrm{P}\left(\mathrm{S}\right)=\mathrm{a}\left(\mathrm{b}|\mathrm{a}\right), $$
(16)
$$ \mathrm{P}\left(\mathrm{F}\right)=1-\mathrm{a}\left(\mathrm{b}|\mathrm{a}\right)=\mathrm{a}\left(\mathrm{B}|\mathrm{a}\right)+A\left(\mathrm{b}|\mathrm{A}\right)+\mathrm{A}\left(\mathrm{B}|\mathrm{A}\right). $$
(17)

For the parallel system:

$$ P\left(\mathrm{S}\right)=1-\mathrm{A}\left(\mathrm{B}|\mathrm{A}\right)=\mathrm{a}\left(\mathrm{b}|\mathrm{a}\right)+\mathrm{a}\left(\mathrm{B}|\mathrm{a}\right)+\mathrm{A}\left(\mathrm{b}|\mathrm{A}\right), $$
(18)
$$ P\left(\mathrm{F}\right)=\mathrm{A}\left(\mathrm{B}|\mathrm{A}\right) $$
(19)

Many dependency models were developed to account for potential dependencies among multiple tasks or human interactions (Zhou et al. 2017; Su et al. 2015). In the model proposed by Swain and Guttmann (1983), the dependency level between two HI/tasks is broken into five levels, as shown in Table 3: Zero Dependence (ZD), Low Dependence (LD), Moderate Dependence (MD), High Dependence (HD), and Complete Dependence (CD). In Table 3 HEP n is the HEP for Task n given Zero Dependence to Task n−1.

Table 3 Dependence model for Human Error Probability System (Swain and Guttmann 1983)

Many authors consider the assessment of dependence level in THERP highly subjective and dependent of a considerable amount of expert judgment. The criticisms also include the absence of specific guidance that makes the use of THERP dependence method difficult and the results may lack traceability and repeatability (Su et al. 2015). Despite such methodology does not consider human performance-shaping factors that cause human error, which is a characteristic of first generation of HRA methodologies, the longevity of THERP is a testament of its significance. THERP started the field of HRA, and newer methods can be seen as extensions of this pioneering work (Boring 2012).

5 Application Example

In this section, a simple representative example is presented in order to illustrate the benefits of integrated engineering approach to safety, reliability, risk management and human factors for a generic Loss of Coolant Accident (LOCA) in a Nuclear Power Plant (NPP).

5.1 Objective of Analysis

The objective of analysis is to improve Non-destructive Inspection (NDI) process of pipe segments of a core cooling system of a NPP, reducing LOCA probability, increasing system reliability and managing risks through acting on human factors issues. The life-cycle focus of analysis is the operation phase of the NPP.

5.2 System Identification

Figure 11 shows a simplified block diagram of a generic core cooling system (primary system) of a NPP and pertinent safety and reliability items that act as DID levels in case of a LOCA.

Fig. 11
figure 11

Simplified block diagram of a primary circuit of a NPP

In this example, LOCA consequences are prevented or mitigated through actuating of safety systems, flaw detection, leak detection, or maintenance and repair. Piping flaws can be identified using NDI techniques, like ultrasonic inspection. Maintenance and repair actions can prevent leak and avoid accidents. If the NDI method fails, a leak will occur and could be detected by leak detection systems. The leak detection system can have the functions of initiating safety, maintenance and repair actions, mitigating the consequences.

5.3 Pertinent Safety, Reliability and Risk Items

The identified pertinent safety items are piping segments, safety instrumentation, flaw detection systems, leak monitoring systems, and engineering safety features (safety systems). Reliability items include piping segments, control instrumentation and engineering safety features. Common pertinent safety and reliability items include piping segments, safety instrumentation, and engineering safety features. Risk management of a generic LOCA in the primary circuit is the pertinent risk item.

5.4 Pertinent Human Factors Issues

Table 4 shows examples of human factors issues related to a generic LOCA analysis, taking into account the six human factors areas defined in Table 2.

Table 4 Examples of human factors issues related to generic LOCA

5.5 Pertinent Principles and Criteria Issues

According the pertinent principles listed in Table 1, the following issues were identified to the selected application example.

The pertinent safety principle and criteria of the operation phase of NPP is Defence-in-depth (DID) against LOCA. DID level 1 includes the use of operational experience, planning of safety improvements, maintenance and training. DID level 2 includes ISI and leakage detection. Automatic and manual actuation of safety systems are part of DID level 3. The pertinent reliability principle and criteria are maintainability of piping segments and redundancy of leakage detection and safety systems. LOCA is the Design Basis Accident (DBA) criteria considered in risk management and some human factors should be considered in order to reduce risks. Workspace accessibility, usability of ISI, leakage, maintenance and repair systems, as well as cognitive ergonomics features of related operating plans and procedures are some pertinent criteria for human factors that can be cited.

5.6 Adequate Assessment Tools

In order to analyze safety, reliability, risk management and human factors for a generic LOCA in a NPP, using the proposed integrated engineering approach, a set of tools should be selected.

Sequences of plant end-states after an initiating event involving flaw occurrence in a pipe segment of primary circuit and the actuation of DID levels can be analyzed using Event Tree Analysis (ETA) technique.

Occurrence of piping rupture (that can cause LOCA or core damage) can be analyzed qualitative or quantitatively using Fault Tree Analysis (FTA) technique, which can involve hardware failures and human errors.

Evaluation of human errors occurring through the completion of selected complex tasks as NDI and typical action sequence for inspection can be carried out using THERP. Human performance issues can then be analyzed and improvements of NDI process of pipe segments of NPP can be suggested. In this example, only a qualitative use of these tools is done.

5.7 Integrated Assessment

Considering as pertinent safety items the pipe segments, NDI, leakage detection and safety systems, an event tree considering as initiating event “Flaw occurrence in a pipe segment” is shown in Fig. 12. This example is based in a previous work of Holmberg and Nirmark (2008) related to risk-informed assessment of Defence-in-depth of a generic LOCA.

Fig. 12
figure 12

Example of a simple event tree for LOCA (adapted from Holmberg and Nirmark 2008)

The following event sequences were considered. The occurrence of the initiating event can be avoided by DID level 1, as use of operational experience, planning of safety improvements, maintenance and training. If the initiating event occurs, the flaw can be identified by In-service-Inspection (ISI), using NDI methods (DID level 2). If the NDI method fails, a leak will occur. This leak, assumed to be a small LOCA, can propagate to a large LOCA if the leakage detection system fails. Both the small and large LOCA can lead to Core Damage, if safety systems fail. The leak detection system and the safety systems are DID level 3 methods in this example.

A fault tree was constructed in order to evaluate qualitatively the likelihood of occurrence of piping rupture taking into account the reliability of ISI and leak detection systems. Considering “Piping failure” as top event and ranking piping states into four types, “successive state”, “detectable flaw state”, “detectable leakage state” and “failure state”, a fault tree model can be constructed as shown in Fig. 13. The descriptions of the primary events of the fault tree are listed in Table 5. The parameters expressing primary events rates in fault tree depends on both historical generic component data and plant specific data. Among the necessary data for estimating primary event parameters, can be highlighted: effectiveness rate to inspect flaw, piping flaw probability, piping rupture probability, effectiveness rate of leakage detection, and leakage occurrence rate. A qualitative evaluation of such fault tree can be performed through identification of Minimal Cut Sets (MCS), i.e., the minimal combination of events that can cause the top event occurrence.

Fig. 13
figure 13

Fault tree model for piping failure (adapted from Vasconcelos et al. 2016)

Table 5 Description of the primary events of fault tree of Fig. 13

In order to estimate the effectiveness rate to inspect flaw, a THERP event tree evaluating the likelihood of human errors occurring throughout the completion the task of piping inspection is constructed and shown in Fig. 14.

Fig. 14
figure 14

THERP for evaluating the probability of human errors occurring throughout the completion the task of piping inspection

The tasks considered in this THERP are: define inspection strategy, select inspection technique, prepare equipment and procedures, acquire data, analyze data, record data, and report inspections (Parris 1988). The Human Error Probability (HEP) of NDI task depends on HEP for each action of the sequence, and they are described as follows.

Define Inspection Strategy

To be effective, an inspection must be based on existing information about location, geometric profile, frequency of inspection, history, risks, etc., in order to define inspection strategy. Critical human function must be performed more automatically and remotely, reducing radiation exposure and improving results of inspections.

Select Inspection Technique

The selection of most effective inspection technique for flaw detection involves considerations of geometry and materials properties, and detailed procedures to be carefully followed.

Prepare Equipment and Procedures

The preparing involves calibration, equipment set and tests, establishing team coordination, and following written and trained procedures.

Acquire Data

Acquiring data needs explicitly written and trained procedures, i.e., specific steps must be prescribed and followed invariably. Sometimes this is not possible due to task complexity and the number of variables and conditions that must be addressed in ISI.

Analyze Data

Interpreting flaw data and discriminating them from another signal depends on many equipment sets, inspector skill and training, and accurate procedures.

Record Data and Report Inspections

In manual data recording and inspection reporting, data such as, relevant parameters and defect indications and locations, are collected and analyzed at the same time, increasing error possibilities. Automatic data recording and analyses do not need proceed simultaneous with data collection, reducing HEP.

5.8 Improvements on Safety, Reliability and Risk Management

A quantitative assessment of safety, reliability and risk including human factors for complex tasks as NDI in this application example is not easy to do, because it depends on specific data and HEP for each step of THERP, which are usually unavailable. However, the qualitative integrated assessment illustrated in this application example can be helpful for understanding the human error context and identify many improvements that can be made in human factors issues and, accordingly, in safety, reliability and risk management.

Among the improvements of generic NDI process of pipe segments of a core cooling system of a Nuclear Power Plant (NPP) can be highlighted:

  • Definition of an optimal strategy of inspection. There are different possible inspection strategies involving locations, techniques, frequencies, etc. A Risk- Based Inspection approach, for instance, prioritizing locations and higher risks systems should be considered (Soares et al. 2015).

  • Development of guidelines for operator-control interface design. Use of human-factors principles and criteria in design of new inspection systems. This guidelines can be used to design more effective systems, and reduce the time and expense required for inspection tasks (Stanton et al. 2005).

  • Reduction of complexity of manual NDI. Manual NDI are typically too complex to produce reliable results, because many variable must be addressed in order to preparate and conduct inspections. In NPP, the task is usually performed in radioactive areas, with time pressure and protective clothes that difficult the tasks. Manual NDI should only be peformed where accessibility limitations preclude automatic ones (Parris 1988).

  • Application of human factors principles and criteria to the preparation of written instructions. NDI procedures usually offer many opportunities for human performance errors. Many inspections are in general, similar, but different in significant details. The principal means of countering error potentials is to provide understandable, action-oriented instructions combined with labels on controls and indicators, for instance, taking into account ergonomic principles as usability and accessibility. As an example, instructions that emphasizes graphics and decision tables rather than narrative presentation of information are less error-prone (Stanton et al. 2005).

  • Collection and analysis of NDI performance data. Many studies have shown that inspection accuracies are typically lower than expected. It is necessary to know what might be done to redesign the tasks or instrumentation to yield better results. Collected performance data should be interpreted and transformed into specific recommendations to task, instrumentaion, and training improvements (Parris 1988).

  • Development of method for feedback information of effectiveness of NDI. Task performance tends to deteriorate if feedback is lacking or not adequate. Complete, accurate, and timely information on task performance is one of the best ways to improve and sustain human performance of complex task as NDI and to better the risk management (IAEA 2001).

6 Conclusions

This chapter proposes and applies a systematic methodology for integrated analysis of safety, reliability, risk and human factors. Interactions among technical, human and organizational factors can be fully considered by using systems theory.

The systematic approach directs the analysis, starting from the selection of applicable life-cycle step and the required target (quality, occupational health and safety, or environmental management). The analyses of the attributes in focus (safety, reliability, or risk) or their intersections are carried out through the integration of human factors that are selected, prioritized and analyzed, considering applicable principles and criteria, and using common applicable safety, reliability, and risk tools. Merging of these various assessment and management systems could reduce duplication of efforts and costs, and increase the effectiveness of management systems, among others.

Main terminology and concepts related to safety assessment, risk management, reliability engineering, human factors and ergonomics were presented. Concepts of systems theory, supporting the integrated framework for assessing safety, reliability, risk and human factors, were also introduced. Mathematical and statistical basis for assessment of reliability, unreliability, maintainability and availability were described. The systematization of the application of the methodology was driven by the use of figures and tables, helping the definition of objectives of analysis, detailing their steps, as well as defining the pertinent items, principles and criteria applied to safety, reliability and human factors.

Common tools used in integrated analysis, as Fault Tree Analysis (FTA), Reliability Block Diagram (RBD), Event Tree Analysis (ETA), and Technique for Human Error Rate Prediction (THERP), including mathematical and statistical basis, were briefly described. So, an event tree for a generic initiating event and two levels of Defence-in-depth was presented, showing the frequency of occurrence estimation for possible accident scenarios, as function of frequency of occurrence of initiating event and probabilities of failure of Defence-in-depth levels. Concepts of Fault Trees and Reliability Block Diagrams were presented, including theoretical basis for qualitatively and quantitatively assessment of likelihood of failures and reliability for series and parallel systems. THERP was also presented through a Human Reliability Analysis event tree for series and parallel systems, illustrating how to estimate the probability of successful and unsuccessful performance of tasks.

Finally, a simple representative example was presented, in order to illustrate the benefits of the integrated engineering approach to safety, reliability, risk management and human factors for a generic LOCA in a Nuclear Power Plant (NPP). The qualitative assessment demonstrated the benefits of using the proposed integrated approach. The application example illustrated an integrated assessment of safety, reliability and risks, including human factors for a complex task of Non-destructive Inspection (NDI) of piping segments of primary circuit of a NPP. A quantitative assessment of complex tasks as NDI involved in the application example is difficult to do, because it depends on specific data and human error probabilities for each step of developed THERP, which are usually unavailable. However, this qualitative integrated assessment was helpful for understanding the human error context and identify many improvements that can be made in human factors issues and, consequently, in safety, reliability and risk management. Some generic improvements for NDI process of piping segments of primary circuit were then presented for the purpose of reducing LOCA probabilities.