1 Introduction

The North American electrical power infrastructure is in need of modernization, due to global warming concerns, volatility in energy supplies and prices, increasing worldwide energy demand, and aging power generation and transmission equipment. This has led to the vision of a smart electrical power grid utilizing the latest information and communications technologies, enabling real-time load and control capabilities from the point of generation to the end-customer. According to the US Department of Energy (DoE) [1], the so-called smart grid electricity delivery network will have several key defining functions, namely: (1) enabling active participation by consumers to adjust consumption based on price and overall demand, (2) improving the utilization and efficiency of the grid by better matching generation with demand, (3) integrating renewable (e.g., solar, hydro, wind, etc.) and distributed power generation sources, (4) providing energy storage options to support renewable power generation, and (5) improving power quality and reliability, and enhancing resiliency to attack, natural disaster and system disturbances.

The first two functions correspond to demand response with real-time adjustment of demand and generation. The third and fourth functions can be thought of as improving the environmental sustainability of electricity generation thus greening the grid. The fifth function is providing wide-area situational awareness, seeking to better manage the grid within a wider geographic scope in order to improve operational efficiency and to perform quick fault diagnosis and isolation. There are significant technical hurdles to overcome to realize the full vision of the smart grid, one of which is the design and development of the cyber infrastructures to support the integration of distributed generation, demand/response and new control functions. Hence, a fundamental challenge in implementing the smart grid is the development of a secure resilient cost effective communications network infrastructure that meets the quality of service and reliability requirements [2]. In this paper, we focus on the reliability requirements posed by the smart grid and their impact on communication network design. We adopt a network service view by considering the communications network’s all-terminal reliability/availability, rather than the reliability of individual components.

The remainder of this paper is structured as follows. The standards, reliability requirements, and methods to calculate reliability are first overviewed in Sect. 2. We next compute the reliability of a potential wide area network and attached substations in the California Power Grid in Sect. 3.2 and outline that its reliability falls short of the DoE recommendations. To overcome this issue, we discuss augmentation of the substation systems and then propose new simple, but yet efficient, optimization methods using minimum cut-sets to augment the initial wide area network design in Sect. 4. Section 5 concludes the paper.

2 Communications Network Reliability

Reliability is characterized as the ability to execute a defined function under specified conditions for a known period of time [3]. Here we use reliability and availability interchangeably. Electric power reliability is quantified in terms of availability of power to the customer. Public utility commissions require power companies to report outages and use metrics such as SAIDI, SAIFI and CAIDI to evaluate the power delivery reliability [4]. The system average interruption duration index (SAIDI) is the total duration of power outage to an average customer over a fixed time period (e.g., monthly, yearly). The system average interruption frequency index (SAIFI) is the average number of times a system customer experiences an outage during the time period of interest (e.g., monthly, yearly). The customer average interruption duration index (CAIDI) is the average time to restore service for customers that suffer an outage. In contrast, communication network reliability is usually discussed in terms of 2-terminal, k-terminal, or all-terminal reliability [3]. The first is the most basic case, where a sender s, and a receiver t, are able to communicate with each other with a certain guarantee. The k-terminal case is when a set of k nodes in the network can communicate. Finally, all-terminal refers to the case when all nodes are able to communicate with all other nodes.

2.1 Standards and Reliability Requirements

The US Department of Energy (DoE), has developed a set of recommended communications QoS and availability requirements to enable the functional objectives for the smart grid [5]. The DoE availability targets are summarized in Table 1. These requirements can be mapped on to services across several communications networks in the NIST Smart Grid Information Networks architecture [6] or the IEEE 2030 [7] reference architecture. Here we focus on the most stringent availability target of supporting wide area situational awareness. This requires high levels of availability from the transmission substations LANs and the wide area network interconnecting system operators and the substations.

Table 1 Smart grid communications availability requirements

The DoE availability requirements, however, do not differentiate between 2-terminal, k-terminal, or all-terminal reliability. In this paper, we assume the most stringent case, namely that of all-terminal reliability. We justify the decision based on the fact that the power grid is a societal dependent critical infrastructure. Thus, when analyzing the requirements for a vital cyber-physical system like the Smart Grid, it is an accepted practice to consider only the worst operational conditions. The resulting system is over-engineered to meet the worst case conditions, thus normally operating with substantial safety margins. Here, we concentrate on determining the all-terminal availability of communication networks deployed to support wide area situational awareness.

2.2 All Terminal Reliability

Consider an undirected network, represented by an undirected graph \(G = (N,E)\), where \(N=\{v_1,v_2, \dots , v_n \}\) is the set of nodes and \(E=\{e_1,e_2, \dots , e_l\}\) is the set of edges, where n is the number of nodes and l the number of edges. Each edge is an unordered pair of different elements belonging to N. We assume nodes do not fail and that edges fail independently representing a random failure scenario. An edge e has a probability \(p_e\) of being operational and \(q_e = 1 - p_e\) probability of being in the failed state. The all-terminal reliability/availability R(G) is the probability the graph G is connected. Calculating all-terminal reliability is known to be an NP-hard problem [8]. Classical approaches for calculating all-terminal reliability focus either on minimum path-set (minpath) or minimum cut-set (mincut) within a given network. A path-set (pathset) is a set of operative edges which ensure that all end-nodes in the network are connected, and a minpath is a pathset that does not contain any other pathset. Hence, a minpath is a spanning tree. Let the minpaths be denoted \(P_i\), \(i=1,2, \dots , r\) where r is the number of spanning trees. Then R(G) is given by

$$\begin{aligned} R(G) = \Pr (P'_1 \cup P'_2 \cup \cdots \cup P'_r) . \end{aligned}$$
(1)

where \(P'_i\) is the event that all edges in \(P_i\) are simultaneously operational (\(i=1,2, \dots , r\)). In [9], an algorithm for generating all spanning trees, with time complexity \(O(N + E + NT)\), where N, E, and NT represent the number of nodes, edges, and spanning trees, respectively, in undirected networks, was proposed. Although this is a very efficient algorithm, the number of spanning trees NT grows quickly as a function of N and E.

The other basic approach to computing all-terminal reliability is based on a cut-set which is a set of of links such that if all of them fail simultaneously, then the system fails. In the case of all-terminal reliability, this means that at least one vertex will become disconnected. A minimum cut-set (mincut) is a cutset that does not contain any other cutset. The mincuts are designated by \(C_j\), \(j=1,2, \dots , u\), where u is the number of mincuts. The all-terminal reliability can be determined from the mincuts by

$$\begin{aligned} R(G) = 1 - \Pr (C'_1 \cup C'_2 \cup \cdots \cup C'_u) . \end{aligned}$$
(2)

where \(C'_j\) is the event that all edges in \(C_j\) are simultaneously failed (\(j=1,2, \dots , u\)). Although a mincut can be found in polynomial time, generating all mincuts is NP-hard [8]. Hence, determining the exact all-terminal reliability is not possible for large networks and several bounds and approximations have been developed (e.g., Bonferroni Bounds [10] and Esary-Proschan bounds [11]). Recently in [12], we proposed a novel algorithm for computing all-terminal reliability bounds using ordered subsets of minimum cut-sets and minimum path-sets. The bounds were shown to be computationally feasible for large networks and reasonably accurate. Here, we utilize the bounds [12] to evaluate the all-terminal reliability of large networks and use the exact calculation by serial and parallel reduction for small networks.

Network edges, due to the probability of cable cuts, are much more subject to failures than nodes. Hence node availability has, in general, very low impact in the all-terminal availability of a network. However, if the availability of the nodes were to be considered in the model, the probability of the event of all nodes in the network being operational, given by the product of the availability of every network node (assuming nodes fail independently) would have to be calculated. The all-terminal availability, considering nodes may fail, would then be the result of the multiplication of this value by R(G).

3 Smart Grid WANs

Smart Grid wide area situational awareness functionality requires very high levels of communication network availability. Wide area situational awareness is implemented by the installation of GPS-synched Phasor Measurement Units (PMU) at transmission substations along with advanced distributed control algorithms to closely monitor and adaptively stabilize the power grid. IEEE has developed standards for synchrophasors measurements and communications [13]. Also, the North American Synchro-Phasor Initiative (NASPI) [14] has developed guidelines for interoperation of the network of synchrophasors and sharing information among regional power grid control centers (i.e., balancing authorities, local power companies, independent system operators). Current NASPI plans project having PMUs installed at all 200KV and above substations by 2019 with expansion to lower voltage substations afterwards. Figure 1 shows the proposed logical view of the communication network proposed by NASPI. The NASPI system consists of several basic components: PMUs at substations, Phasor Data Concentrators (PDC) which aggregate and archive measurements from several PMUs and run application software that allows operators to query various PMUs and analyze the measurements; Phasor Gateways and the Data Bus. The latter will help in the sharing of the phasor measurement data by different utilities and independent system operators (ISOs) and would typically include firewalls for security. Note that communications will range from unicast to multicast to broadcast modes. Implementation of a NASPI like network will require high availability of the transmission level substation LANs upon which the PMUs are placed as well as the interoperator WAN connecting the various utilities. Here we look at these component networks in turn.

Fig. 1
figure 1

NASPI logical architecture

3.1 All-Terminal Reliability of Transmission Substations

Transmission power substations are in charge of stepping down high voltage electric power and splitting feeder lines into multiple lower voltage output distribution lines. Substations are constructed in a variety of sizes and configurations depending on the several factors such as voltage level, number of feeders and distribution lines. Typically, the substation layout is divided into multiple bays that contain power and communication equipment. Figure 2 shows a small 220KV single feeder substation, which consists of five bays: three line bays, one bay housing the transformer, and one bus bay [15]. The high voltage feeder line is connected to the transformer and consequently to the bus. The transformer steps down voltage from the incoming feed to the voltage level needed by the output lines. The bus, usually a very low impedance piece of metal, connects the output of the transformer to the lower voltage lines. Furthermore, two circuit breakers protect the transformer. Positioned on either side, the breakers monitor impedance and resistance, and trip in an emergency situation to prevent any damage to the transformer. Additionally, there are two more breakers in the bays housing the output voltage lines and protecting the substation from any downstream surges.

Fig. 2
figure 2

Small 220/132kV substation layout with a single bus (T1-1 type)

The major challenge faced by the substation automation system (SAS) is to provide interoperability between protection, control, and monitoring services. In order to address this issue, the International Electrotechnical Commission (IEC) standards organization developed the IEC 61850 standard [15]. This widely accepted standard provides the procedures and formats for communication data exchange, data format definition, and XML-based configurations. The main goal of the standard is to remove any ambiguity regarding the functionality of intelligent electric devices (IEDs) such as voltage regulators, protective relays and recloser controllers. The standard defines the allowed power and network equipment, functionalities, inputs/outputs, and all interfaces to the communications network which is typically an industrial Ethernet network.

The SAS monitors in real-time the status of each circuit breaker via digitization of the current and voltage metrics, and distributes those measurements using a substation communication network. Power engineers commonly refer to the communications network within a substation as “the bus”, not to be confused with the power distribution bus. In general, the IEDs in a given substation include four types of equipment. First, the protection IEDs (P IEDs) are in charge of tripping when they measure a high level of impedance or resistance on the power line. The control IEDs (C IEDs) monitor the status of the protection IEDs and issue additional trip commands if a surge could cause cascading failures. Third, the merging units (MUs) facilitate multicast communications within a given bus. The nodes generating sampled measurements send real-time data to the MUs, which in turn multiplex the measurements to the subscribed control IEDs. In the reverse direction, the control equipments send commands to the network MUs, which in turn push the data to the destinations. Finally, due to the stringent real-time delivery requirements specified in IEC 61850, (e.g., 3 ms for sampled values and control commands), time synchronization (TS) equipment is needed for the devices on the bus network. Further, the Ethernet switches provide layer-2 addressing for the local bus. The standard contains two main real-time data protocols: (1) SMV which is used to publish sample values and (2) GOOSE which carries command signals. Due to the real-time delivery requirements, both protocols reside directly on the top of the MAC layer. The addition of PMUs to substations as led to the IEC 61850-90-5 extension to support communication among PMUs and between PMUs and PDCs over the SAS network using UDP/IP and DSCP (Differentiated Services Code Point).

Several standard topologies have been defined for IEC 61850 based substation networks, namely: cascade, ring, star, and redundant ring. In [16], Kanabar et al. provide an analysis of IEC 61850 availability for the various network topology options. However, the focus was on inter-bay two terminal reliability. Specificially, they study a protection IED from a given bay attempting to send a message to a given control IED located in a different bay. Given the importance of PMU data for grid control we adopt the redundant ring architecture substation communications architecture presented in Fig. 3. On the left side of the figure we have the actual communications architecture. Each bay in the substation has a redundant set of Ethernet switches, a redundant set of protection IEDs, a control IED, a merging unit, and a TS unit. Additionally, in the transformer and bus bays there is an extra set of merging units with attached TS units for redundancy. On the substation level, there are four redundant switches that connect two rings to the bay-level switches.

Fig. 3
figure 3

Redundant ring network for IEC 61850-based substation

In the redundant ring network architecture the availability of the links is very high and are ignored in estimating the availability. The architecture can be reduced using series, parallel and pendant reduction [3] to simplify the network topology used to calculate the availability. On the right side of Fig. 3, we present the reduced communications network. Table 2 presents the availability statistics for each type of IED. The redundant switch and protection IED configuration in each bay are combined in parallel fashion with availability \(1-(1-0.999981735)^2\) and \(1-(1-0.999993912)^2\) respectively. The substation level switches consists of two parallel pairs of two parallel switches which can be combined resulting in availability \(1-(1-0.999981735)^{4}\). In the transformer and bus bays there are redundant merging units with attached TS units with combined parallel availability of 0.9999999994. Using the reduced topology of Fig. 3, we calculate the all-terminal reliability \(R_{sub}\) of the typical substation using the reliability block diagram reduction technique, resulting in \(R_{sub} = 0.9998962477\). Note the substation availability is well below the desired level of five to six \(9'\)s.

Table 2 Availability of substation equipment [16]

3.2 Availability of Intersystem WAN

The reliability of a fiber based wide area network serving as the intersystem network in a region will be function of several factors, such as, the geographic size of the network, the amount of redundancy, the protection mechanisms employed and the equipment utilized. We note that power companies in the USA are typically vertically integrated companies and assume that power companies would prefer to not lease network links from a third party, but instead create their own communications network following the power lines as they have right-of-way facilitating cable placement.

Here as a representative case study we consider a intersystem wide area communications network serving the California Power Grid. The California Power Grid is served by 75 utilities and supplies power to more than 30 million customers through more than 32,000 miles of transmission lines. The power grid network layout was obtained from California Energy Commission [17] maps. The maps are distributed as PDF files, which contain text entries with the location and name of power substations. We created a PDF text parser and obtained the (xy) coordinates of all substations. Further, for each map there is a scale from which the relationship between pixels and miles can be deducted. As such, we were able to obtain relative positions of all California substations. For a number of them, we verified that the distances calculated were accurate by using Google maps. In total, there are 3329 substations indicated on the maps. The minimum distance between two substations is 1.2 miles, maximum is 1074 miles, and average is 310 miles. All the substations can be categorized by ownership as belonging to: Imperial Irrigation District, Los Angeles Department of Water & Power, PacifiCorp, Pacific Gas & Electric, Southern California Edison, San Diego Gas & Electric, Sacramento Municipal Utility District, West Area Power Administration, and Other. The transmission lines fall into five categories 33-92KV, 110-161KV, 220-287KV, 345-500KV, and 500KV-DC.

Using the power grid map data, we constructed a WAN backbone connecting all transmission level substations to implement the logical topology of Fig. 1. Following NASPI recommendations the substation PMUs are grouped into clusters based on geographic proximity, and all sub-stations in a cluster have a point-to-point connection to the cluster PDC. In Fig. 4, the fifty four named nodes represent the PDCs. The PDCs form the backbone network nodes which are connected into a WAN following the power grid topology as shown in Fig. 4. Given the target data rates for PMUs (1.5 Mbps) and NASPI QoS requirements, the capacity and performance should not be an issue with current WDM technology.

In order to estimate the availability of the resulting WAN we assume all PDC nodes have perfect availability and the availability of each optical fiber link in the network as a function of distance d in miles is given by the formula \(A = 0.99987^{d/250}\) from [18]. Based on the communication links availabilities, we used the mincut algorithm in [12] to calculated the upper bound on the all-terminal reliability of the example California power network to be 0.99935954 . The reliability of this network falls short of the DoE requirements of five to six 9’s. Note that a simple algorithm of following the power grid was used to construct the backbone network assuming no protection mechanisms, a mesh network with protection may yield a higher availability but at a larger cost.

Fig. 4
figure 4

Sample communications network for the California power grid

3.3 Effects of the Substation on Communications Reliability

For completeness, we extend the analysis to consider the effects of the substation network reliability on the overall availability of the system. In reality, many of the substations in the California Power grid will have larger substation networks than the single feeder 220 kV system analyzed earlier, however we assume that the reliability of all substation networks can be approximated by our earlier analysis. This will result in an availability that is an upper bound on the real system since the larger the substation the more extensive the network resulting in a lower overall reliability. Here we consider the effects of the fifty four substations hosting PDCs on the overall availability. Table 3 presents a summary of the all-terminal reliability results. Observe that considering the substation availability decreases the overall availability (the links and substations case). Furthermore, the WAN network is the weakest system having the lowest availability.

Table 3 Summary of all-terminal reliability results

3.4 Relationship with Power Reliability Metrics

One can illustrate the need for higher levels of reliability by relating the availability metrics above to standard electric power metrics such as SAIDI. A precise analysis would require a detailed model of both the power grid and communication network, such as a co-simulation [2]. Here we conduct a simple average value type of analysis to illustrate the tradeoffs between network availability and power outages. Specifically we consider the substation networks and make the assumptions that all substations in the network are identical and network downtime results in a power grid outage (this is a worst case assumption that is not always the case). Let \(p_s\) denote the probability a substation network is functioning in the all terminal reliability sense (i.e., \(p_s = 0.99989624\) for the system of Fig. 3). Let X denote the event of a substation network failure, assuming the networks fail independently, then \(P\{X = k\}\) follows a Binomial distribution with \(P\{X = k\} = {N \atopwithdelims ()k} (1-p_s)^k (p_s)^{N-k}\) where N is the number of substations. This results in a mean number of substation failures of \(N (1-p_s)\). Let Cust denote the total number of customer served by the power grid, ACS represents the average customers per substation and \(MMTR_s\) the average repair time for a substation network (in minutes). Then the SAIDI can be determined by \(SAIDI = [(N(1-p_s) MTTR_s ACS]/Cust\). Table 4 shows the SAIDI value in minutes versus the substation availability for the California ISO assuming a \(MTTR_s\) of ten hours. One can see that high levels of availability (i.e., 0.99999) eleminates the substation network as a significant source of SAIDI, which has traditionally been the case in the current power grid.

Table 4 SAIDI versus substation network availability

4 Improving Communication Reliability

In the previous section, we calculated the communications network reliability of the California Power Grid and showed that the network falls short of the DoE requirement of 0.99999 or better. In order to meet the DoE objective both the substation networks and the WAN network need higher levels of availability and we discuss each in turn below.

4.1 Substation Network

In order to improve the substation communication availability, we note that the network is not the issue, instead the end device IEDs are what lower the availability. We investigated several options for increasing the availability and report two simple cases here depending on the level of reliability needed.

  1. 1.

    Adding one additional TS and MU component to the Line 1, 2 and 3 Bays, results in increasing the substation availability up to 0.999969216.

  2. 2.

    If in addition to the previous modification we assume that redundant control IEDs (C IEDs) are deployed in the Bays. The C IED performs real time control of the power equipment and in older systems redundancy was often achieved by deploying an odd number of C IEDs with majority voting logic to determine the actual control actions. However, newer digital systems adopt a simpiler backup/standby controller approach using only a single backup C IED. Here we note 6 nines can be achieved if duplicate C IEDs are put in every Bay (i.e., Line 1, 2, 3, Transformer, and Bus), resulting in availability of 0.99999974000.

4.2 Intersystem WAN

In order to meet the DoE availability requirement in the WAN, new links must be added in a cost effective fashion. There are two basic approaches to improving the nework reliability: (1) augmenting the original network with parallel redundant links and (2) adopting a mesh network design. In both cases, we seek to minimize the network cost (e.g., in terms of link distance) given an all-terminal reliability constraint. Recall that the all-terminal reliability calculation is an NP-hard problem, network reliability optimization is even more challenging due to the growth of the search space. Several previous works looked at this specific problem [1922]. In [19], the authors applied a genetic search heuristic to optimize network reliability. While it does not guarantee optimality, the approach was shown to be flexible and effective. In [21], a neural approach called OPTI-net was proposed based on artificial neural networks. The approach was found to provide the optimal solution for most of the tested topologies of limited size. Furthermore, as opposed to related techniques, OPTI-net requires just a few seconds of calculation even with large topologies. However, the approach needs training/tuning of neural network parameters and the optimality of result for large networks was not evaluated. In [23], the authors present a Integer Linear Programming (ILP) model to minimize the total number of fibers or fiber length in a multi-fiber WDM network scenario to meet traffic demands. However, meeting an availability requirement is not considered.

In the following, we propose new efficient mechanisms using mincuts to optimize a communications network to meet a reliability constraint. We first consider the case of adding links in parallel to existing links, then we consider the case of the mesh network design with the possiblity of addiing cross links to the original network design. We note that adding cross links to make the topology a mesh may be too expensive due to the need to procure right-of-way.

4.3 Heuristic Integer Linear Programming (H-ILP)

Consider the problem of given a topology, finding the optimal set of redundant links to add in order to meet a reliability constraint of \(\mathfrak {R}\). We first generate the mincuts, noted \(\mathcal {C}\), whereby each mincut \(\psi \in \mathcal {C}\) consists of a set of links, \(l \in \psi\). Given the link reliability \(r_l\) and number of redundant links \(x_l\) for a given link l, we formulate the reliability as follows:

$$\begin{aligned} R(\mathcal {C}, x) = \prod \limits _{\psi \in \mathcal {C}} \left( 1 - \prod \limits _{l \in \psi } (1 - r_l)^{x_l+1}\right) . \end{aligned}$$
(3)

Note, that \(\mathfrak {R}\) corresponds to the Esary-Proschan all-terminal reliability lower bound. Hence, the exact all-terminal reliability will be larger than the calculated R(C,x). The objective function is defined as follows:

$$\begin{aligned} \min \left( \sum _{l \in \mathcal {L}} x_l\right) , \end{aligned}$$
(4)

where \(\mathcal {L}\) represents the set of links. The nonlinear constraints are given as follows for each mincut \(\psi\):

$$\begin{aligned} \prod \limits _{l \in \psi } (1 - r_l)^{x_l} = b_\psi , \end{aligned}$$
(5)

where \(b_\psi\) represents the probability that the mincut \(\psi\) fails and disconnects the network. We then add the constraint to meet the reliability requirement \(\mathfrak {R}\):

$$\begin{aligned} \prod \limits _{\psi \in \mathcal {C}} (1 - b_\psi ) \ge \mathfrak {R}. \end{aligned}$$
(6)

The problem we face with this system is that Eq. (5) is nonlinear and it would not converge for the communication network supporting the California Power Grid topology presented in the previous section. Therefore, we change constraint Eq. (5) to linear by exploiting the properties of logarithms as follows:

$$\begin{aligned} \sum \limits _{l \in \psi } \log (1 - r_l) \cdot x_l = \log (b_\psi ). \end{aligned}$$
(7)

However, we now face another problem with the overall constraint (Eq. (6)), since the right hand side of Eq. (7) (\(\log (b_\psi )\)) is mapped to a log scale and thus the probability \(b_\psi\) is not accessible to calculate the overall constraint of Eq. (6). To overcome this problem, we propose a heuristic by approximating \(b_\psi\) with a global b such that:

$$\begin{aligned} (1 - b)^{|\mathcal {C}|} \ge \mathfrak {R}. \end{aligned}$$
(8)

Using this global b, we replace Eq. (7) with:

$$\begin{aligned} \sum \limits _{l \in \psi } \log (1 - r_l) \cdot x_l = \log (b). \end{aligned}$$
(9)

The resulting ILP defined by (4), (7) and (9) does not necessarily find the optimal solution, since the mincut probability b is assumed the same for all mincuts. Hence, we develop a greedy heuristic algorithm in order to compare with in terms of computation complexity and cost.

4.4 Greedy Iterative: Max Availability

We propose a greedy-based iterative algorithm which principle consists of successively adding redundant links between node pairs maximizing the increase in availability. Using the mincuts and reliability formulation (Eq. (3)), we define a maximimize availability greedy heuristic in Algorithm 1.

figure a

The number of redundant links, noted \(x_l\), is first set to 0 for each original link l. The algorithm keeps adding redundant links as long as the reliability of the network is below the reliability requirement \(\mathfrak {R}\). In order to find the best redundant link to add, all links \(l \in \mathcal {L}\) are tested by calculating the gain (\(d* = r_2 - r_1\)) in terms of availability of adding this given link. The link \(l*\) maximizing the availability gain (\(d*\)) is selected and a redundant link is added to \(l*\), with \(x_{l*} \leftarrow x_{l*} + 1\).

4.5 Greedy Iterative: Min Cost

In the previous greedy iterative algorithm (Max Availability), extra backup fibers are added to improve the overall network availability assuming a single type of fiber, whereby reliability depends mainly on the fiber length. However, the availability of a given fiber can vary depending on its fiber technology, as investigated in [24]. Here we assume the availability of different fiber technologies as in Table I and Eq. (1) in [24] for 1 km. For a given fiber type \(t \in \mathcal {F}_t\), we have the following availability:

$$\begin{aligned} A_t = \frac{MTBF_t - MTTR_t}{MTBF_t}, \end{aligned}$$
(10)

where \(MTBF_t\) and \(MTTR_t\) correspond to the mean time between failure and mean time to repair, respectively. To extend for any number of kilometers d, availability is given by \(A_t^{d}\). As in [24, Eq. (3)], we define cost for a given fiber type t as follows:

$$\begin{aligned} y_t = MTBF_t^\alpha + \beta , \end{aligned}$$
(11)

where \(\alpha\) corresponds to a scaling factor reflecting the growth of the cost for increasing MTBF and \(\beta\) is the starting cost. Therefore, the overall cost of the network is given by:

$$\begin{aligned} Cost = \sum _{\forall l} \sum _{t \in \mathcal {F}_t} x_{l, t} \cdot y_t, \end{aligned}$$
(12)

where \(x_{l, t}\) corresponds to the number of links installed of type t at link l. We propose a greedy algorithm (Min Cost, Algorithm 2) aiming at minimizing the network cost while still providing the required availability target.

figure b

The first step of Algorithm 2 consists of initializing the links using an initial fiber type dt. Then, as long as the network availability is lower than the requirement, backup links are added using the technologies minimizing cost.

4.6 Reliable Mesh Design (RMD)

In the previous approach, we added new parallel backup links on existing interconnected node pairs only. Here we adopt a greenfield design where nodes can be connected without the restriction of following the power lines. Specifically we propose to construct a reliable mesh network, given the existing node positions, to meet the reliability requirement. The algorithm consists of two main steps, that is, first design a minimal cost spanning tree and then secondly add links to improve the availability. Note, that one might improve the network availability by selecting any node pair to connect, which could result in new network links as well as parallel backup links.

4.6.1 Minimal Low-Cost Spanning Tree

As a first step, a low-cost spanning tree is constructed using the well known Prim’s minimum spanning tree algorithm.

4.6.2 Reliability Mesh Improvement

Starting with the spanning tree topology determined in the first step, new edges are added maximizing reliability, as described in Algorithm 3. Note, that a maximum possible link distance threshold \(\mathcal {L}_d\) is defined to limit the search space for possible new links. Then every potential new link whose distance is below the threshold is evaluated in terms of improving the availability.

figure c

4.7 Numerical Results

H-ILP and Max Availability comparison: We first compare the H-ILP and Greedy Iterative (Max Availability) approaches to improving the reliability for the sample communications network of the California Power Grid. Table 5 shows a comparison of both approaches in terms of availability after network augmentation with parallel links, cost (in terms of number of links), and computation time. Note that we could have different cost metrics, such as link distance, installation, etc. Since, the link distance influences the availability in this scenario (i.e., \(A = 0.99987^{d/250}\) [18]), we use the number of redundant links added as the cost.

Table 5 Results for the California power grid

One can observe that both approaches give similar results in terms of availability and meet the requirement of 0.99999. However, the greedy iterative algorithm overall outperforms H-ILP since the resulting cost in terms of number of links is significantly lower. Note, that the solution found by both approaches is more complex than just the longest links having a redundant link placed in parallel, as the solution contains some short links which appear in many mincuts. Observe that, both algorithms required a reasonable computation time, but the H-ILP was found to require a significantly lower amount of computing resources. In order to compute both algorithms we used a Mac Pro 6-Core Intel Xeon with 32 GB RAM. Despite giving a larger cost, H-ILP could still potentially be useful for large networks to quickly provide an approximate solution.

Cost optimization: We next compare the Max Availability and Min Cost algorithms with fiber type “fiber buried opti” from [24] as the initial baseline link technology. Note that the original greedy Max Availability algorithm was adapted to include multiple fiber types. In order to compare both approaches, we define the cost reduction metric:

$$\begin{aligned} Cost\ reduction = \frac{Cost_{ma} - Cost_{mc}}{Cost_{ma}} \cdot 100, \end{aligned}$$
(13)

where \(Cost_{ma}\) and \(Cost_{mc}\) correspond to the cost (Eq. 12) with Max Availability and Min Cost algorithms, respectively.

Fig. 5
figure 5

Cost reduction of min cost compared to max availability greedy algorithm

Figure 5 depicts the cost savings in using Min Cost compared to the Max Availability algorithm. A cost reduction on the order of 17–20.5 % was observed, depending on the cost scaling factor \(\alpha\) (values of \(\alpha\) taken from [24]) with \(\beta = 6\).

Table 6 Performance of greedy algorithms with multiple fiber types

Furthermore, we compared both algorithms in terms of availability and computation duration in Table 6. The Max Availability greedy algorithm requires a significant amount of computing compared to Min Cost. Note that the Max Availability approach selected few but very costly links (in terms of distance and technology), as opposed to the Min Cost approach which selected many cheap and short fiber links (e.g., aerial fiber). Also, the availability of Min Cost is closer to the minimum requirement, which is correlated to the cost reduction depicted in Fig. 5.

Reliable Mesh Design (RMD) Results In the previous design approachs, we were looking at improving the network by adding backup edges on an existing network constrained to follow the power grid. We next compare the RMD of Algorithm 3 and Max Availability, where RMD maximizes reliability by adopting a more greenfield approach first determing a minimum spanning tree and then modifing the topology into a mesh structure. In contrast the Max Availability design adds backup edges on existing connected node pairs only.

Figure 6 depicts the comparison of both approaches in terms of the availability versus the total number of network links. It is worth noting in the Max Availability case, the network edges are based on the power transmission lines, where in RMD we started from a minimal spanning tree (Refer to Sect. 4.6.1) which is not restricted to following the power line topology. Thus one can see in Fig. 6 that the initial number of links and availability of the two designs are different. In the RMD approach, fewer additional links are needed to achieve the desired availability. Thus, in this specific scenario, adopting the mesh structure leads is significantly few links in order to improve the availability of the network.

Providing a direct cost comparison is difficult since the RMD approach will involve right of way costs and geographic topology effects (e.g., mountains), whereas the Max Availability approach follows existing power lines. However one can compare the two designs in terms of total length of cables required as shown in Fig. 7. Observe that the RMD design starts with a shorter total distance and adds longer cables to increase the availability in comparison to the Max Availability method. However, even though long edges are added in the RMD method, the overall distance of the network using RMD is about 20 % less compared to the Max Availability.

Fig. 6
figure 6

Reliability mesh design compared to the Max Availability approach

Fig. 7
figure 7

Cost comparison of the RMD approach compared to the Max Availability approach

5 Conclusions

In this paper, we discussed the communications reliability requirements for smart power grids at the transmission level with a focus on all-terminal reliability. We studied in detail the reliability of a typical transmission level substation and the intersystem wide area network to implement wide area situational awareness in the California Power Grid. We discussed the need for increasing the availability of the substation networks and how this could be achieved. To improve the wide area network reliability, we investigated different novel optimization mechanisms. We first proposed an heuristic integer programming model (H-ILP) allowing to efficiently approximate the cost to meet a certain reliability threshold. Further, as the H-ILP model does not provide the optimal solution, we proposed a greedy iterative algorithm which outperforms the H-ILP approach in terms of cost but performs worse interms of computation time. Next, we developed a greedy heuristic that considers the tradeoff between cost and link reliability for various fiber types and seeks to minimize the cost. The minimum cost heuristic was shown to out perform the greed max availability heuristic in terms of link cost and computation time. As a basis for comparison we also formulated a mesh network design based on a two step approach consisting of an initial minimum spanning tree which is then augmented with a greedy heuristic. The mesh based design was shown to require fewer links and shorter total link distance. However, how the distance translates into monetary cost will depend on several factors.