Keywords

1 Defining Resilience

Resilience has emerged in the last decade as a concept for better understanding the performance of infrastructures, especially their behavior during and after the occurrence of disturbances, e.g. natural hazards or technical failures. Recently, resilience has grown as a proactive approach to enhance the ability of infrastructures to prevent damage before disturbance events, mitigate losses during the events and improve the recovery capability after the events, beyond the concept of pure prevention and hardening (Woods 2015).

The concept of resilience is still evolving and has been developing in various fields (Hosseini et al. 2016). The first definition described resilience as “a measure of the persistence of systems and of their ability to absorb change and disturbance and still maintain the same relationships between populations or state variables” (Holling 1973). Several domain-specific resilience definitions have been proposed (Ouyang et al. 2012) (Adger 2000) (Pant et al. 2014) (Francis and Bekera 2014). Further developments of this concept should include endogenous and exogenous events and recovery efforts. To include these factors, resilience is broadly defined as “the ability of a system to resist the effects of disruptive forces and to reduce performance deviations” (Nan et al. 2016). Recently, the AR6A a resilience framework has been proposed based on eight generic system functions, i.e. attentiveness, robustness, resistance, re-stabilization, rebuilding, reconfiguration, remembering, and adaptiveness (Heinimann 2016).

Assessing and engineering systems resilience is emerging as a fundamental concern in risk research (Woods and Hollnagel 2006) (Haimes 2009) (McCarthy et al. 2007) (McDaniels et al. 2008) (Panteli and Mancarella 2015). Resilience adds a dynamical and proactive perspective into risk governance by focusing (i) on the evolution of system performance during undesired system conditions, and (ii) on surprises (“known unknowns” or “unknown unknowns”), i.e. disruptive events and operating regimes which were not considered likely design conditions. Resilience encompasses the concept of vulnerability (Johansson and Hassel 2010) (Kröger and Zio 2011) as a strategy to strengthen the system response and foster graceful degradation against a wide spectrum of known and unknown hazards. Moreover, it expands vulnerability in the direction of system reaction/adaptation and capability of recovering an adequate level of performance following the performance transient.

2 Need for Resilience in Critical Interdependent Infrastructures

Resilience calls for developing a strategy rather than performing an assessment. If on the one hand it is important to quantify and measure resilience in the context of risk management, it is even more important that the quantification effort enables the engineering of resilience into critical infrastructures (Guikema et al. 2015). Especially for emerging, not-well-understood hazards and “surprises” (Paté-Cornell 2012), resilience integrates very smoothly into risk management, and expediently focuses the perspective on the ex-ante system design process. Following this perspective, risk thinking becomes increasingly embedded into the system design process.

The application of resilience-building strategies look particularly promising for critical interdependent infrastructures, also called systems-of-systems, because of its dynamical perspective in which the system responds to the shock event, adapting and self-healing, and eventually recovers to a suitable level of performance. Such perspective well suits the characteristics of these complex systems, i.e. (i) the coexistence of multiple time scales, from infrastructure evolution to real-time contingencies; (ii) multiple levels of interdependencies and lack of fixed boundaries, i.e. they are made of multiple layers (management, information & control, energy, physical infrastructure); (iii) broad spectrum of hazards and threats; (iv) different types of physical flows, i.e. mass, information, power, vehicles; (v) presence of organizational and human factors, which play a major role in severe accidents, highlighting the importance of assessing the performance of the social system together with the technical systems.

As a key system of interdependent infrastructures, the energy infrastructure is well suited to resilience engineering. In the context of security of supply and security of the operations, resilience encompasses the concept of flexibility in energy systems. Flexibility providers, i.e. hydro and gas-fired plants, cross-border exchanges, storage technologies, demand management, decentralized generation, ensure enough coping capacity, redundancy and diversity during supply shortages, uncertain fluctuating operating conditions and unforeseen contingencies (Roege et al. 2014) (Skea et al. 2011).

3 Quantifying Resilience

Resilience is defined and measured based on system performance. The selection of the appropriate MOP depends on the specific service provided by the system under analysis.

The resilience definition can be further interpreted as the ability of the system to withstand a change or a disruptive event by reducing the initial negative impacts (absorptive capability), by adapting itself to them (adaptive capability) and by recovering from them (restorative capability). Enhancing any of these features will enhance system resilience. It is important to understand and quantify these capabilities that contribute to the characterization of system resilience (Fiksel 2003). Absorptive capability refers to an endogenous ability of the system to reduce the negative impacts caused by disruptive events and minimize consequences. In order to quantify this capability, robustness can be used, which is defined as the strength of the system to resist disruption. This capability can be enhanced by improving system redundancy, which provides an alternative way for the system to operate. Adaptive capability refers to an endogenous ability of the system to adapt to disruptive events through self-organization in order to minimize consequences. Emergency systems can be used to enhance adaptive capability. Restorative capability refers to an ability of the system to be repaired. The effects of adaptive and restorative capacities overlap and therefore, their combined effects on the system performance are quantified by rapidity and performance loss.

Resilience can be quantified though computational experiments in which disruptions are triggered, the system performance is analyzed (Fig. 6.1), and integrated resilience metrics are computed (Nan and Sansavini 2017). By repeating this process, different system design solutions can be ranked with respect to resilience. By the same token, resilience against various disruptions can be assessed, and resilience-improving strategies compared. The selection of the appropriate MOP depends on the specific service provided by the infrastructure under analysis. For generality, it is assumed that the value of MOP is normalized between 0 and 1 where 0 is total loss of operation and 1 is the target MOP value in the steady phase. As illustrated in Fig. 6.1, the first phase is the original steady phase (t < t d ), in which the system performance assumes its target value. The second phase is the disruptive phase (t d  ≤ t < t r ), in which the system performance starts dropping until reaching the lowest level at time t r . During this phase, the system absorptive capability can be assessed by identifying appropriate measures. Robustness (or Resistance) (R) is a measure to assess this capability, which quantifies the minimum MOP value between t d and t ns :

$$ R= \min \left\{ MOP\kern0.5em (t)\right\}\kern0.5em \left(for\kern0.5em {t}_d\le t\le {t}_{ns}\right) $$
(6.1)

where t d represents the time when the system is in disruptive phase and t ns represents the time when the system reaches the new steady phase. This measure is able to identify the maximum impact of disruptive events; however, it is not sufficient to reflect the ability of the system to absorb the impact. Two additional complementary measures are further developed: Rapidity (RAPI DP ) and Performance Loss (PL DP ) during disruptive phase. The measure Rapidity can be approximated by the average slope of the MOP function.

$$ RAP{I}_{DP}=\frac{MOP\left({t}_d\right)- MOP\left({t}_r\right)}{t_r-{t}_d} $$
(6.2)
Fig. 6.1
figure 1

The “resilience curve”, i.e. the performance transient after disturbance, and its phases

To improve the accuracy of the estimation of RAPI, ramp detection is applied to quantify the average slope (Ferreira et al. 2013). According to (Kamath 2010) and (Zheng and Kusiak 2009), a ramp is assumed to occur if the difference between the measured value at the initial and final points of a time interval Δt is greater than a predefined ramping threshold value:

$$ \frac{MOP\left(t+\varDelta t\right)- MOP(t)}{\varDelta t}>\varDelta {X}_{ramp} $$
(6.3)

where ∆X ramp represents the predefined ramping threshold value. The system rapidity can then be calculated as the average of slope of each ramp:

$$ \mathrm{RAPI}=\frac{\left|{{\displaystyle \sum}}_{i=1}^K\frac{MOP\left({t}_i\right)- MOP\left({t}_i-\varDelta t\right)}{\varDelta t}\right|}{K} $$
(6.4)

where K represents the number of detected ramps and MOP(t i ) represents the MOP value at the i-th detected ramp. Compared to (2), this method better captures the speed of change in the system performance during disruption and recovery phases. According to this approach, the rapidity during disruptive phase can be calculated as:

$$ {\mathrm{RAPI}}_{DP}=\frac{\left|{{\displaystyle \sum}}_{i=1}^{K_{DP}}\frac{MOP\left({t}_i\right)- MOP\left({t}_i-\varDelta t\right)}{\varDelta t}\right|}{K_{Dp\ }}\kern0.5em \left(for\kern0.5em {t}_d\le\ {\mathrm{t}}_{\mathrm{i}}<{t}_r\right) $$
(6.5)

where KDP represents number of detected ramps during the disruptive phase.

The performance loss in the disruptive phase (PL DP ), using the system illustrated in Fig. 6.1 as an example, can be quantified as the area of the region bounded by the MOP curve before and after occurrence of the negative effects caused by the disruptive events, i.e. between t d and t r which is referred to as the system impact area:

$$ P{L}_{DP}=\underset{t_d}{\overset{t_r}{{\displaystyle \int }}}\left( MOP\left({t}_o\right)- MOP(t)\right)dt $$
(6.6)

Where t o represents the time when the system is in original steady phase. A new measure, i.e. the time averaged performance loss (TAPL), is introduced. Compared to PL, TAPL considers the time of appearance of negative effects due to disruptive events up to full system recovery and provides a time-independent indication of both adaptive and restorative capabilities as responses to the disruptive events. TAPL DP in the disruptive phase (t d  ≤ t < t r ) can be calculated as:

$$ {\mathrm{TAPL}}_{DP}=\frac{{{\displaystyle \int}}_{t_d}^{t_r}\left( MOP\left({\mathrm{t}}_{\mathrm{o}}\right)- MOP(t)\right)dt}{t_r-{t}_d} $$
(6.7)

The third phase is the recovery phase (t r  ≤ t < t ns ), in which the system performance increases until the new steady level. During this phase, the system adaptive and restorative capability can be assessed by developing appropriate measures: rapidity (RAPI RP ), performance loss (PL RP ) and time average performance loss (TAPL RP ).

$$ {\mathrm{RAPI}}_{RP}=\frac{\left|{{\displaystyle \sum}}_{i=1}^{K_{RP}}\frac{MOP\left({t}_i\right)- MOP\left(t-\varDelta t\right)}{\varDelta t}\right|}{K_{Rp\ }}\kern0.5em \left(for\kern0.5em {t}_r\le\ {\mathrm{t}}_{\mathrm{i}}<{t}_{ns}\right) $$
(6.8)

where K RP represents the number of detected ramps in recovery phase.

$$ P{L}_{RP}=\underset{t_r}{\overset{t_{ns}}{{\displaystyle \int }}}\left( MOP\left({t}_0\right)- MOP(t)\right)dt $$
(6.9)
$$ {\mathrm{TAPL}}_{RP}=\frac{{{\displaystyle \int}}_{t_r}^{t_{ns}}\left( MOP\left({t}_0\right)- MOP(t)\right)dt}{t_{ns}-{t}_r} $$
(6.10)

The fourth phase is the new steady state (t ≥ t ns ), in which system performance reaches and maintains a new steady level. As seen in Fig. 6.1, the newly attained steady level may equal the previous steady level or reach a lower level. It should be noted that the new steady state may even be at a higher level than the original one. In order to take this situation into consideration, a simple quantitative measure Recovery Ability (RA) is developed:

$$ RA=\left|\frac{MOP\left({t}_{ns}\right)- MOP\left({t}_r\right)}{MOP\left({t}_o\right)- MOP\left({t}_r\right)}\right| $$
(6.11)

Different system phases and related system capabilities are summarized in Table 6.1.

Table 6.1 Summary of different resilience phases

3.1 The Integrated Resilience Metric

Although the measurements introduced and discussed in Sect. 6.6.3 are useful in assessing the system behavior during and after disruptive events, an integrated metric with the ability of combining these capabilities is needed in order to assess system resilience with an overall perspective and to allow comparisons among different systems and system configurations. The basic idea of incorporating various resilience capacities into one metric has been proposed by Francis and Bekera to develop resilience factor (Francis and Bekera 2014). The idea is also supported by (McDaniels et al. 2008). Therefore, the resilience metric (GR) is proposed, which integrates the previous measures:

$$ \begin{array}{l}GR=f\left(R, RAP{I}_{DP}, RAP{I}_{RP}, TAPL,RA\right)\hfill \\ {}=\kern0.5em R\times \left(\frac{RAP{I}_{RP}}{RAP{I}_{DP}}\right)\times {(TAPL)}^{-1}\times RA\hfill \end{array} $$
(6.12)

where TAPL DP and TAPL RP have been combined into one TAPL measure (\( \frac{{{\displaystyle \int}}_{{\mathrm{t}}_{\mathrm{d}}}^{{\mathrm{t}}_{ns}}\left[ MOP\left({\mathrm{t}}_0\right)- MOP\left(\mathrm{t}\right)\right]dt}{{\mathrm{t}}_{ns}-{\mathrm{t}}_{\mathrm{d}}} \)) in order to incorporate effects of total performance loss during disruptive and recovery phases.

The functional form of the proposed resilience metric assumes that robustness R, recovery speed RAPI RP and recovery ability RA have a positive effect on resilience, and, conversely, performance loss TAPL and loss speed RAPI DP have a negative effect. To compile the integrated metric (12), no weighting factor is assigned to the measures so that no bias is introduced, i.e. they contribute equally to resilience. GR is consistent with the definition proposed in Sect. 6.6.1:

  1. 1.

    If the system is more capable of resisting a disruptive event or force (large R, small RAPI DP ), the system is more resilient (large GR).

  2. 2.

    If the system is more capable of reducing the magnitude and duration of deviation of its performance level between original state and new steady state (small TAPL, large RAPI RP ), the system is more resilient (large GR).

  3. 3.

    Additionally GR also incorporate the possibility of improvement of the system performance after the occurrence of the disruptive event. If the new performance level is larger than the original (large RA), the system is more resilient (large GR).

GR is a non-negative metric and its value equals zero in the following relevant cases:

  1. 1.

    System performance level drops to zero after the disturbance (R = 0).

  2. 2.

    After the disturbance, system performance immediately drops to its lowest level (RAPI DP  → ∞, i.e. no absorptive capability).

  3. 3.

    System performance never increases past the lower level, R, which is the new steady phase (RAPI DP  = 0, i.e. no adaptive and restorative capability).

GR is dimensionless and is most useful in a comparative manner, i.e. to compare the resilience of various systems to the same disruptive event, or to compare resilience of same system under different disruptive events. This approach of measuring system resilience is neither model nor domain specific. For instance, historical data can also be used for the resilience analysis. It only requires the time series that represents system output during whole time period. In this respect, the selection of the MOP is very important.

During the last decade, researchers have proposed different methods for quantifying resilience. In 2003, the first conceptual framework was proposed to measure the seismic resilience of a community (Bruneau et al. 2003), by introducing the concept of Resilience Loss, later also referred to as “resilience triangle”.

In recent years, the importance of improving the resilience of interdependent critical infrastructures has been recognized, and research works have developed. Historically, knowledge-based approaches have been applied to improve the understanding of infrastructures resilience (McDaniels et al. 2008). Lately, model-based approaches have been developed to overcome the limitations of data-driven approaches, such as System Dynamics (Bueno 2012), Complex Network Theory (Gao et al. 2016), and hybrid approaches (Nan et al. 2016).

Approaches to quantify system resilience should be able to

  • capture the complex behavior of interdependent infrastructures

  • cover all phases of the transient performance following the disruption, and to include all resilience capabilities

  • clarify the overlap with other concepts such as robustness, vulnerability and fragility.

Resilience quantification of interdependent infrastructures is still at an early stage. Currently, a comprehensive method aiming at improving our understanding of the system resilience and at analyzing the resilience by performing in-depth experiments is still missing.

4 Building Resilience in Critical Infrastructures

In the context of critical infrastructures, resilience can be developed by focusing on the different phases of the transient performance following a disturbance (also called resilience curve), and devising strategies and improvements which strengthen the system response.

Focusing mainly on the technical aspects, these strategies can be summarized as:

4.1 Planning Ahead During the Design Phase

Robust or stochastic optimization against uncertain future scenarios, i.e. attacks or uncertain future demand in the energy infrastructure, can be used in the system planning or expansion process; uncertain scenarios provide the basis to design resilient systems.

In (Fang and Sansavini 2017), the combination of capacity expansion and switch installation in electric systems that ensures optimum performance under nominal operations and attacks is studied. The planner-attacker-defender model is adopted to develop decisions that minimize investment and operating costs, and functionality loss after attacks. As such, the model bridges long-term system planning for transmission expansion and short-term switching operations in reaction to attacks. The mixed-integer optimization is solved by decomposition via two-layer cutting plane algorithm. Numerical results shows that small investments in transmission line switching enhance resilience by responding to disruptions via system reconfiguration (Fig. 6.2). Sensitivity analyses show that transmission planning under the assumption of small-scale attacks provides the most robust strategy, i.e. the minimum-regret planning, if many constraints and limited investment budget affect the planning. On the other hand, the assumption of large-scale attacks provides the most robust strategy if the planning process involves large flexibility and budget.

Fig. 6.2
figure 2

Integrated planning of system expansion and recovery devices against uncertain attack scenarios (Fang and Sansavini 2017)

4.2 Self-Healing, Adaptation and Control

Graceful degradation: the system cannot be designed with respect to every uncertain scenario, therefore a resilient design should consider how to prevent the disturbance from spreading across the whole system, creating systemic contagion and system-wide collapse. In this respect, cascading failures analysis (Li and Sansavini 2016), and engineering network systems to be robust against outbreak of outages and propagations of cascading failures across their elements are key strategies. Control engineering can provide strategies to create robust feedback loops capable of enabling infrastructures to absorb shocks and avoid instabilities. Designing structures and topologies which prevent failure propagation, and devising flexible topologies by switching elements which allow graceful degradation of system performances after disruptions are also valuable resilience-enhancing techniques (Fig. 6.3).

Fig. 6.3
figure 3

The heat map of the cumulative economic losses at each canton of Switzerland due to propagation of cascading failures in the electric power system (Li et al. 2015)

4.3 Recovering Quickly from the Minimum Performance Level

Robust or stochastic optimization of the recovery and restoration process in the face of uncertainties in the repair process or in the disruption scenarios.

System restoration and its contribution to the resilience of infrastructure networks following disruptions have attracted attention in recent years. Optimization approaches usually guide the identification and scheduling of restoration strategies for rapid system functionality reestablishment under limited resources. Most of the related studies rely on deterministic assumptions such as complete information of resource usage and deterministic duration of the repair tasks. However, restoration activities are subject to considerable uncertainty stemming from subjective expert judgment and imprecise forecasts that may render the scheduling solution obtained by a deterministic approach suboptimal or even infeasible under some uncertainty realizations. Restoration planning and scheduling under uncertainty can be investigated within a credibility-based fuzzy mixed integer programming (PMIP) approach, in which the imprecise parameters are modelled by fuzzy numbers (Fang and Sansavini 2016). To solve the proposed fuzzy optimization problem, an interactive fuzzy solution technique is utilized which provides the decision maker (DM) the flexibility to consider two significant factors when making decision: the degree of achievement of his/her aspiration level and the risk of violation of the constraints. A computational experiment involving the Swiss high voltage electric power transmission network demonstrates the significance and applicability of the developed approach for DM to determine efficient restoration actions aimed to enhancing system resilience. Generally, the system restoration curves, i.e. the system performance levels evolving over time, show that decreasing the degree of feasibility of the constraints results a faster system restoration (Fig. 6.4).

Fig. 6.4
figure 4

System restoration curves for five different feasibility levels of the solution vector, i.e. the set of decision variables concerning the restoration process

4.4 Effective System Restoration

Through the combination of restoration strategies, e.g. repairing the failed elements and building new elements, the infrastructure can achieve a higher performance with respect to the pre-disruption conditions, and display the anti-fragility property (Taleb 2012; Aven 2015).

A system is anti-fragile if its performance improves as the result of exposure to stressors, shocks or disruptions. This behavior is typical of complex systems and it is not usually exhibited by engineered technical systems. In fact, technical systems can display anti-fragility when new investments are allocated, e.g. after disasters. In post-disaster restoration planning of infrastructure networks, the possibility of combining the construction of new components and the repair of failed ones can lead to anti-fragile behavior. The strategic goal is to determine the optimal target system structure so that the performance of the target system is maximized under the constraints of investment cost and network connectivity. The problem can be formulated as a mixed-integer binary linear programming (MILP). The preliminary results (Fig. 6.5) show that the restored network can achieve an improved functionality as compared to the original network if new components are constructed and some failed components are not repaired, even when the former is much more expensive than the latter. Therefore, different investment allocations schemes define whether an infrastructure network is fragile or anti-fragile. In particular, the tested infrastructure exhibits anti-fragile behavior even for restoration investments that amount at 62% the cost of complete repair. Furthermore, antifragility provides an opportunity for the system to meet future service demand increase, and a perspective under which disruptions can be seen as chances for system performance improvements.

Fig. 6.5
figure 5

Optimum restoration by repairing and building anew

4.5 Exploiting Interdependencies Among Infrastructures

Interdependencies and couplings in systems operations can foster the propagations of failure across coupled system; on the other hands, interdependencies might also provide additional flexibility in disrupted conditions and additional resources that can facilitate achieving stable conditions of the coupled system.

Cyber interdependencies are pervasive in critical infrastructures (CIs) and particularly in electric power networks, which are dependent on information and communications technology (ICT), e.g., supervisory control and data acquisition (SCADA) systems, to transmit measurements signals to control centers and to dispatch control signals to actuators. The requirements towards ICT to transmit these signals with tolerable communication delays for timely balancing of power demand and supply have increased due to changes in the operating conditions of electric power networks. On the one hand, its operating conditions are pushed closer to its stability limits due to amplified loading conditions. On the other hand, the increasing share of distributed inverter-connected renewable energy, e.g., wind and PV, on the distribution level has led to a decrease in the inertia and an increase in the volatility in the power grid further reducing its stability margins. Under these conditions, severe consequences, e.g. system-wide blackouts, can be caused by disturbances in the electric grid. In the face of these challenges, ICT is expected to turn the current electric grid into a “smart grid” in order to assure reliable, efficient and secure operations of the electric grid. An application that benefits from the ICT in power systems is grid splitting, also referred to as controlled islanding, which relies on real-time system-wide measurements to enable the detection and recovery from failures in real time, i.e., by applying system topology changes. Grid splitting is a special protection scheme that separates a power system into synchronized islands in a controlled manner in response to an impending instability, i.e., generator rotation desynchronization triggered by a component fault. By appropriately disconnecting transmission lines, severe consequences, e.g., system-wide blackouts, are mitigated through the formation of stable islands. The successful application of grid splitting depends on the communication infrastructure to collect system-wide synchronized measurements and to relay the command to open line switches. Grid splitting may be ineffective if communication is degraded and its outcome may also depend on the system loading conditions. The effects of degraded communication and load variability on grid splitting are investigated in (Tian and Sansavini 2016). To this aim, a communication delay model is coupled with a transient electrical model and applied to the IEEE 39-Bus and the IEEE 118-Bus Test System. Case studies show that the loss of generator synchronism following a fault is mitigated by timely splitting the network into islands. On the other hand, the results show that communication delays and increased network flows can degrade the performance of grid splitting. The developed framework enables the identification of the requirements of the dedicated communication infrastructure for a successful grid-splitting procedure.