1 Introduction

Cascading effects constitute a major issue in Critical Infrastructures (CI), for they are often unforeseen and have severe consequences [1]. With the launch of the ‘Smart Grid Luxembourg’, a project which started in 2012 and aims at deploying a country-wide smart grid infrastructure, the Grand-Duchy of Luxembourg is facing new risks in the energy distribution domain.

Although risk assessments can help to get a good overview of the threats faced by a CI operator, and comprehensive frameworks [2, 3] exist for taking appropriate decisions, these methodologies deal with risk scenarios one-by-one and do not analyse the interactions between them. Indeed, in a risk analysis, threats may share common causes or consequences, which will be accounted for multiple times if each scenario is considered separately – see Fig. 1. Several authors have handled this issue; Aubigny et al. [4] provide a risk ontology for highly interdependent infrastructures and use Quality of Service (QoS) as weighting instrument. Foglietta et al. [5] present the EU-funded project CockpitCI that identifies consequences of cyber-threats in real-time. Suh et al. [6] deliberate various factors that influence the relative importance of risk scenarios. Tong et al. [7] classify assets into three layers, viz. business, information and system. On the lowest layer, risk is computed traditionally as \(\text {risk} = \text {impact}\times \text {likelihood}\). Dependencies appear in the model as weighted impact added to the risk of dependent higher-level assets. Breier et al. [8] adopt a hierarchical approach and model dependencies as added-risk for parent assets; cyclically dependent components are not supported. Stergiopoulos et al. [9] use dependency graphs to describe cascading effects, but the risk of individual nodes is not expressed with respect to the graph structure.

Fig. 1.
figure 1

Illustration of several causal effect chains that share common consequences. In the example of a smart grid operator, the availability of the electrical grid is of first priority, and the latter is endangered by various accidental, intentional and environmental causes. However, if each of the 6 depicted risk scenarios is analysed separately, the risk assessment would account 6 times for the grid instability, although only 3 causes (human error, command injection and fire) would be ultimately responsible for it.

In addition, such a risk analysis is only valid for the time it was created. However, whenever a security control is put in place or the network topology changes, the situation may change enormously, especially if the infrastructure is characterised by many interdependencies. Smart grids are especially affected by this issue, as smart meters are constantly re-organised with new living spaces being built. Since critical infrastructures should know the risk they are currently facing at all time, the risk analysis should be dynamically generated, taking into account all changes to the system. Several authors [10,11,12] propose algorithms for computing the likelihoods of interdependent risk scenarios, and thus, adopt a similar approach as in this paper, but all of them assume a fixed system and do not discuss the time needed to build the model, which is exponentially large in general and thus unsuitable for real-time updates.

The objective of this work is to provide a systematic approach for identifying the most critical parts in an cyber-physical system with the help of a risk analysis. Indeed, instead of relying on human intuition to properly cover all interdependencies in the risk assessment, the proposed model automatically deduces the risk scenarios from the dependencies provided as input by the risk assessor.

This way of proceeding has several advantages: for one, it accounts for each risk scenario to an appropriate extent (proportionally to its likelihood of occurring). Second, it eases updating or reusing the risk analysis generated by the proposed model, since all aspects are explicitly included in the model. Especially in the context of huge systems, this saves a lot of time. This work also paves the way for real-time risk monitoring (see Sect. 6) and deep analysis, such as determining the most critical sequence of cascading effects.

Section 2 defines the notions used throughout this paper, while Sect. 3 presents the risk model itself. Section 4 describes how to efficiently encode threats shared by multiple assets. The concepts developed in the paper are applied to the Luxembourgish smart grid in Sect. 5 and conclusions are drawn in Sect. 6.

2 Terminology

The (quantitative) risk modelling approach adopted by this paper relies on notions originating from probability theory. A risk event is a stochastic event which may occur with a certain probability. Two properties are associated with it:

  • The likelihood (or expected frequency) of an event is defined to be the number of times it is estimated to happen per time unit. This notion is analogous to the probability of a probabilistic event, but different on principle: whereas traditional probabilities are bounded by ‘absolute certainty (100%)’, the likelihood (frequency) can be arbitrarily large.

  • The impact describes the consequences of a risk event that are estimated to arise whenever the latter occurs. For simplicity, this paper focuses on financial damage (in ), but the latter can be substituted by any other impact measure instead (such as reputation, number of affected people ...), as long as this is done consistently throughout the whole risk assessment.

Risk is then defined to be the expected impact with respect to the likelihood. Note the analogy to the ‘expected value’ in probability theory:

In a risk analysis composed of many scenarios, the total risk is the sum of all partial risks (engendered by each event):

$$\begin{aligned} \text {risk} = \sum _{i: \text {event}} \text {risk}_i = \sum _{i: \text {event}} \text {likelihood}_i \times \text {impact}_i. \end{aligned}$$

The collection of events in a risk analysis can be embedded into a directed graph \(G=(V,E)\) as follows. Whenever two events \(\alpha \) and \(\beta \) bear a causal relation, in the sense that \(\alpha \) is likely to cause \(\beta \) to happen, an edge is drawn from vertex \(\alpha \) to vertex \(\beta \), which is written as “\(\alpha \rightarrow \beta \)”. An example of a causal graph is depicted in Fig. 1. When acyclic, such graphs are called Bayesian networks [13], but the ‘cycle-free’ assumption is not necessary in the context of this paper.

In order to encode the extent to which two events are dependent, let \(p:E\mapsto [0,1]\) be the map which associates to each edge \((\alpha \rightarrow \beta )\) the probability that \(\alpha \) causes \(\beta \) directly. Note that this is not the same as the probability that \(\beta \) occurs given \(\alpha \) occurs; indeed, suppose the graph consists of a single chain \(\alpha \xrightarrow {0.1}\gamma \xrightarrow {0.5}\beta \), then \(\Pr [\beta |\alpha ]=0.05\), but \(p(\alpha \rightarrow \beta )=0\) since \(\alpha \) cannot cause \(\beta \) directly.

The graph \(G=(V,E,p)\) consisting of the vertex set V of events, the (directed) edge set E of causal relations and the probability map p, is called the dependency graph associated with the risk analysis. Vertices without parents are called root causes and are supposed to occur independently of one another. Denote the set of root causes by \(V_R\subset V\). Those will be the events that ultimately trigger any risk scenario encoded in the graph.

3 Risk Assessments Using the Dependency-Aware Root Cause (DARC) Model

The model described in Sect. 2 is called Dependency-Aware Root Cause (Darc) Model and was introduced by Muller et al. in [14]. It is designed for conducting a risk analysis in complex systems featuring many interdependencies (from a physical, logical or management point of view), such as cyber-physical systems.

Whereas scenarios in traditional risk analyses usually cover direct consequences and all cascading effects, events in the Darc model are supposed to be as specific as possible. In fact, in order to eliminate any redundancy or repetition in a risk analysis, all involved intermediate events should be made explicit (by adding them as vertices to the dependency graph), so that the precise nature of the dependencies can be properly encoded (using directed edges); see Fig. 2 for an example.

Fig. 2.
figure 2

Example of two security events (‘full control over a Programmable Logic Controller (PLC)’ and ‘leak of sensitive data’) that share common causes. Edge labels represent the probability map \(p:E\rightarrow [0,1]\).

The motivation behind the dependency graph is to identify all possible root causes that can bootstrap a chain of effects in a risk scenario. Note that in order to yield sensible results, the model requires the set of causes to be exhaustive for each node, but determining all possible (significant) causes can be hard and time-consuming. A separate node accounts for any uncertainty resulting from unknown, rare or minor causes; notice the node ‘other/unknown’ in Fig. 2. Since its likelihood is usually unknown, an estimated upper bound shall be used instead (which should be lower than the likelihood of all other parent nodes).

Once a dependency graph has been established, the risk assessor estimates the probability p of each edge and likelihoods \(\mathcal {L}\) (expected frequencies) of the root events only, since the remaining ones can be deduced from the dependency graph and its probability map p. Indeed, say \(\mathcal {L}(\cdot )\) denotes the likelihoodFootnote 1 of an event and \(V_R\subset V\) represents the set of (independent) root causes, then

Define \(\mathcal {P}(r,\alpha ):=\Pr [r\;\text {eventually causes}\;\alpha ]\) for root causes r and any event \(\alpha \). Efficient randomized algorithms exist [14] which can compute these values \(\mathcal {P}\). Now, if \(\mathcal {I}(\cdot )\) denotes the (estimated) impact of an event, then the global risk is

$$\begin{aligned} \text {risk} = \sum _{\alpha \in V} \mathcal {I}(\alpha ) \cdot \mathcal {L}(\alpha ) = \sum _{\alpha \in V} \mathcal {I}(\alpha ) \cdot \sum _{r\in V_R} \mathcal {L}(r) \cdot \mathcal {P}(r, \alpha ), \end{aligned}$$
(1)

where \(V_R\subset V\) denotes the set of root causes. The benefit of this reformulation is that it is no longer necessary to know the likelihood of all events, but only the one of root causes. When the involved maps are interpreted as vectors and matrices, the previous line is equivalent to

$$\begin{aligned} \text {risk} = \underbrace{\mathcal {I}^\top }_{1\times V \; \text {matrix}} \cdot \underbrace{\mathcal {P}^\top }_{V \times V_R \; \text {matrix}} \cdot \underbrace{\mathcal {L}|_{V_R}}_{V_R \times 1 \; \text {matrix}}, \end{aligned}$$
(2)

where \(\mathcal {L}|_{V_R}\) denotes the restriction of \(\mathcal {L}\) to \(V_R\). Note that \(\mathcal {P}\) is entirely deduced from the dependency graph, whereas \(\mathcal {I}\) and \(\mathcal {L}\) have to be estimated by a risk assessor.

Writing Eq. (2) in matrix form permits to embed it into a spreadsheet file where it can be further edited, formatted or analysed (e.g. using diagrams).

4 Risk Taxonomy for Critical Infrastructures

A major drawback of dependency-aware risk analyses is the increase in size of the risk description, resulting from the additional values to be estimated. However, many assets share the same threats, which gives rise to information redundancy in the graph. To avoid this, and thus to reduce the size of the model, the risk taxonomy presented in this section aims at providing a framework where threats can be defined generically for a set of assets. It extends the Darc model introduced in [14] and briefly presented in Sect. 2.

Fig. 3.
figure 3

Class diagram representing the taxonomy of assets involved in a risk analysis, grouped by layer. Read ‘\(A\xrightarrow {\;\xi \;}B\)’ as ‘B is a \(\xi \) of A’.

In fact, most (yet not all) security events are related to a (physical or digital) asset. For this reason, it makes sense to group assets facing similar threats, so as to express the dependencies between asset classes as causal relations between threats acting upon them. For instance, the natural dependence of sensitive data on its database is contained in the statement that threats to any software also put data managed by it at risk. At the same time, one should be able to specify risks for a specific asset, but not for the whole class. For instance, when (specific) login credentials are stolen, a (specific) application can be accessed, whereas this is not the case when other kinds of information are stolen.

The following asset classes have been identified; see Fig. 3 for an overview.

  • A Network is a closed environment of interconnected devices. Communication with the outside is possible, but subject to rules (often imposed by a firewall). Compromising a network amounts to compromising the flows provided by devices inside that network.

  • A Device is any physical hardware. Devices are subject to mechanical damage and physical access.

  • An Application is the functional counterpart of a device; it covers programs, operating systems and firmwares. Unlike a device, an application is threatened by software vulnerabilities and remote attacks. The idea is to separate the soft- and hardware layer, since both are exposed to different risks.

  • A Flow transports data from one application to another over a network. Flows can be manipulated/blocked and tapped.

  • Information comprises all kind of data – including contents of a database, keys, passwords and certificates. Information can be destroyed, be tampered with or leak.

  • A Service consists in a goal that one or more applications are pursuing. They represent the purpose of having a particular application running in the system, and are often linked to business processes. Unavailability constitutes the main threat that services are exposed to. For instance, a firewall ultimately prevents network intrusions; if the associated service fails, this may have an impact on the integrity of all flows within that network.

The idea of classifying assets is not new; Tong et al. [7] have suggested systematising assets on different ‘levels’ (system, informational, business), but their motivation is primarily to do a risk assessment on each layer and interconnect the results, whereas the Darc model goes one step further by linking risk scenarios directly to asset classes. In the critical infrastructure context, Aubigny et al. [4] adopt a similar approach for interdependent infrastructures (using layers ‘implementation’, ‘service’ and ‘composite’). The IRRIIS Information Model [15] separates the topological, structural and functional behaviour of dependencies and analyses their interactions across these three layers. In contrast to all of these works, this paper uses asset classes only to define similar threats for similar assets, but considers each asset individually when computing the risk.

4.1 Dependency Definition Language

The encoding of the causal relations is achieved using a simple mark-up language that is based on the GraphVizFootnote 2 dot syntax. They are expressed as

figure a

which reads as

If \(\textit{node}_{A}\) occurs, it causes \(\textit{node}_{B}\) to occur with probability x;

where \(\textit{node}_{A}\) and \(\textit{node}_{B}\) are the IDs of the respective nodes in the dependency graph, and \(x\in [0,1]\). The syntax is extended in such a way that the node IDs can contain placeholders which match whole asset classes. Placeholders are enclosed with angular brackets \(\mathtt{<}\) and \(\mathtt{>}\) and are of the following form.

figure b

where

  • assetclass is one of net, dev, app, flow, inf, svc;

  • selector is one of hosted-devc, run-appl, provided-flow, required-flow, sent-info, held-info, required-info, required-svc, provided-srvc, depending on the context, meant to navigate through the class model depicted in Fig. 3;

  • filter is a keyword restricting the choice of selected nodes (e.g. #fw for only selecting firewall devices, #key for only selecting keys).

For instance, the following line encodes the fact that if an attacker has full control over some (any) application, there is a 10% chance that he gets access to any associated secret keys.

figure c

4.2 Generating the Dependency Graph

In order to be able to deduce the final dependency graph from the definitions, an inventory describing the assets themselves needs to be created. Note that organisations usually have such an inventory, especially if they have a security management systems. The inventory itself can be encoded as a directed graph in the dot syntax as well.

figure d

Node IDs should be prefixed by the asset class (e.g. info: for information assets) so that the class can be inferred from the ID. Edge labels express the kind of relation which the second (right) node maintains to the first (left); in the example above, customer data (info:custdat) is information held (held-info) by the database (appl:db).

Once the dependencies have been defined and the inventory has been established, both inputs can be programmatically combined to yield the dependency graph – in the use-case described below, this is achieved using a Python script.

5 The ‘Smart Grid Luxembourg’ Use-Case

The ‘Smart Grid Luxembourg’ (SGL) project aims at innovating the electricity and gas transmission networks by deploying a nation-wide ‘smart’ grid infrastructure. By law, starting from July 2016, every new electricity or gas meter deployed in Luxembourg will be a smart meter. At the end of the project, by 2020, the system will count 380.000 smart meters. In the following, the concepts presented in the previous sections are applied to SGL.

5.1 Compiling a Dependency-Aware Inventory

In a first phase, the inventory of all relevant assets has been compiled, covering hard- and software, physical wiring, network flows, database tables and their contents, certificates, other kinds of information and the services provided by the various applications. Figures 45 and 6 provide anonymised variants of the complete, confidential graphs.

The Luxembourgish smart grid manages its own Public Key Infrastructure (PKI), so as to guarantee complete independence of any external providers. The certificates in a PKI bear a natural dependency hierarchy with them, in the sense that compromising any certificate authority (CA) allows one to reproduce any dependent certificates and thus, ultimately, to undermine an encrypted communication channel.

Fig. 4.
figure 4

Anonymised network diagram of the central system architecture showing devices and their affinity to the respective networks. DSO = Distribution System Operator; DMZ = DeMilitarised Zone; field devices include data concentrators and meters.

Fig. 5.
figure 5

Anonymised hierarchy of the PKI. CA = Certificate Authority.

Fig. 6.
figure 6

Excerpt of a dependency graph containing placeholders. Note the singleton node ‘DDoS’ which is not associated to any particular asset.

5.2 Threat Model

The second phase consisted in identifying all possible threats faced by the system and encoding them properly in a dependency graph (with placeholders).

The elaborated threat model is based, on the one hand, on other research work by Grochocki et al. [16] and ENISA [17], who determine the threats faced by a smart grid infrastructure. On the other hand, the Smart Grid Luxembourg specific dependencies could be extracted from former risk analyses and from documentation material that was kindly provided by Luxmetering.

It turns out that a large portion of the threats can be expressed as a tuple consisting of an asset (class) and an endangered security property (such as confidentiality, integrity or availability). For instance, the generic risk scenarios faced by applications comprehend malfunctioning, unauthorized access (e.g. by faking login credentials), lose of control (e.g. due to code injection) and denial of service. The risk scenarios that are not directly associated to an asset (such as distributed denial-of-service or fire), are added as singleton nodes to the graph.

5.3 Generation of the Dependency Graph

Once the threat model was set up, the final dependency graph could be programmatically derived from the inventory. For this purpose, a small script written in Python reads in the dependency definitions and applies them to the inventory (by replacing all placeholders by the respective IDs of the assets in the inventory).

The algorithm developed in [14] then allows one to identify all root causes, that is to say, to find those events which are ultimately responsible for all risk scenarios in the threat model. Moreover, it determines the probabilities that each of these root causes eventually leads to each of the other events by cascading effect – thus computing the probability matrix \(\mathcal {P}:V_R\times V\rightarrow [0,1]\) introduced in Sect. 3.

5.4 Results

The inventory consists of 12 different devices (each with multiplicity), 9 networks, 37 applications, 43 flows, 14 data sets, 26 certificates, 9 sets of credentials and 18 services. The generic dependency graph encoding the threat model (with placeholders) has 53 nodes and 104 edges. The time needed for conducting the risk analysis for Luxmetering is composed as follows:

Gather asset inventory from documentation material and past (static) risk analyses

18 h (2 md)

Define (generic) dependency graph

30 h (4 md)

Estimate \(\mathcal {P}\) and \(\mathcal {L}\)

9 h (1 md)

Fine-tuning of the model

7 h (1 md)

Total

64 h (8 md)

Since the generic dependency graph contains (almost) no SGL-specific information, it can be easily recycled for other, similar use-cases.

Computing the full final dependency graph (consisting of 502 nodes and 1516 edges) took 3.79 s on a 2.0 GHz dual-core processor (which includes the parsing time of the inventory files). The probability matrix \(\mathcal {P}\) has been computed using a C# implementation of the algorithm proposed in [14]; it is composed of 502 rows (as many as nodes) and 25 columns (root causes), comprising thus a total of 12550 probability values. Its computation took 39.14 s.

The following 25 root causes were read off the model:

  • phishing, social engineering,

  • bad input validation, XSS, CSRF, broken authentication, buffer overflow,

  • DDoS, jamming, smart meter intrusion, physical access to facility,

  • data center incidents (fire ...), device construction faults, mechanical attrition

  • and 11 SGL-specific attacks.

The most critical risks read off from the risk matrix were the following (most critical first):

figure e

In total, a yearly risk of an order of magnitude of 300 k was estimated.

6 Conclusion and Future Work

The Darc model allows one to perform a risk analysis that accounts for the interdependencies in complex infrastructures such as cyber-physical systems.

This paper extends the Darc risk model developed in [14] by a risk taxonomy and describes an automated approach to generate a dependency graph from an asset inventory. That way, any changes in the infrastructure can be automatically replicated to the risk analysis, which renders it dynamic.

This is a first step towards dynamic risk management. Future work is devoted to the question how other sources (apart from the inventory) of real-time information can be used to automatically update the parameters of the risk analysis. Practical examples include intrusion detection systems, firewalls or patch management tools, which allow to infer the likelihoods of certain root causes from historical data (such as log files) collected by these sources.

In this spirit, such real-time information can also be used to directly compute the risk currently faced by an organisation. In a dashboard-like interface, technical alerts can be translated to the equivalent losses that an intrusion or fault can cause, expressing technical issues in a language understood by decision-makers. Such an interface is planned to be implemented in Trick Service Footnote 3, an existing web application developed by itrust consulting designed to conduct risk assessments according to ISO 27005.