1 Introduction

The ever-increasing number of users and applications perpetually consuming and producing data across the Internet of today has produced a great demand which must be met. Extreme-scale science applications continue to grow in importance, and with those applications comes a need for a network that can support the high bit rate necessary for inter-laboratory cooperation and data storage at scale. A prime example can be found in the experiments performed using the Large Hadron Collider (LHC) run by the European Organization for Nuclear Research (CERN), which are expected to generate data ranging from petabytes to exabytes in scale throughout the project lifecycle [1]. Transmitting the resulting measurements and calculations both to storage facilities for data replication and to other laboratories for data verification in a timely fashion requires a medium with the ability to support a tremendous bit rate. Optical networks are such a medium.

In all-optical networks, data are transmitted in the optical domain on fibers which connect optical switches, and the fibers themselves are typically divided into several logical wavelength channels through the process of wavelength-division multiplexing (WDM). Optical Cross-Connects (OXCs) support all-optical WDM by demultiplexing optical signals and multiplexing them onto the correct fibers. A logical connection utilizing WDM between two endpoints is called a lightpath, made up of a combination of the physical links between the two nodes and the particular wavelength that carries that connection’s traffic on each link. Wavelengths on the same link do not interfere with each other, so more than one lightpath can overlap and share a physical link. This flexibility is mitigated by the fact that without the presence of wavelength converters in the network, a lightpath must use the same wavelength along its entire physical path. This restriction is known as the wavelength-continuity constraint.

While simple point-to-point connections can be easily supported through this process, modern applications may demand the ability to transmit to and from multiple points. Example use cases include real-time streaming or distributed storage and retrieval. These multiple points can be referred to as a generic set of resources, which could be a group of experiment laboratories, data centers in a Content Distribution Network, or user machines receiving a live stream of a presentation. Multicast, a widely used point-to-multipoint paradigm between a single source node s and an entire set of resource nodes D, is possible at the optical layer through the use of multicast-capable OXCs (MC-OXCs). MC-OXCs are equipped with power splitters, which enable an incoming optical signal to be replicated/split into some number of outgoing signals [2]. The source and the resources of a request are collectively referred to as members, and the logical end-to-end connections between them can be described as a conglomerate of lightpaths known as a light-tree. Multicast light-trees can also be supported without splitting technology through the use of a logical overlay, in which an optical signal carrying traffic is dropped to the electrical layer at particular nodes and converted back to optical to forward the traffic to one or more other nodes [3]. The establishment of an optical light-tree can be reduced to the Steiner minimal tree (SMT) problem, which is NP-complete [4].

Fig. 1
figure 1

Two solutions for a multicast request from source node 1 to resource nodes 3, 4, and 5. Should resource 3 fail, source 1 will become disconnected from resources 4 and 5 in solution (a), while solution (b) will remain viable should any one resource fail. a Unprotected multicast. b Protected multicast

Regardless of how efficient multicasting could be, multicast light-trees face the same vulnerability as simple point-to-point connections: The failure of a single physical-layer component, such as a fiber link or a switching node, could disconnect an entire session and render the established tree inoperable. An example is shown in Fig. 1a, where the removal of node 3 renders the provisioned connection futile. Survivability is the capability of a network to continue operation even in the presence of accidents, attacks, or equipment failures, which can all have detrimental effects on the integrity and performance of a network. Should a link carrying traffic be cut, any data propagating across the link at the time, which could number upward of 100 Gb depending on the medium, will be lost. Should a network node fail, not only can it no longer be used for forwarding traffic, but the data held in buffers at the time, and any data currently on links incident to the node, will be lost as well.

Network survivability strategies tend to focus on dealing with link failure, which can be fairly common due to human error, such as accidental fiber cuts during construction projects [5, 6]. Node failures typically require an extraordinary event to fail, such as a natural disaster [7] or a directed attack with an electromagnetic pulse [8]. The arrival of Hurricane Sandy in New York and New Jersey in October 2012 resulted in the failure of three hundred Verizon facilities along the eastern seaboard [9]. Other unavoidable natural disasters, such as the catastrophic destruction brought about by the 2011 T\(\overline{o}\)hoku earthquake and tsunami in Japan [10], can occur, often without warning. Researchers have proposed methods for designing networks to survive disasters, such as the authors of [11], who propose optimal disaster-aware methods for selecting locations for underwater fiber-optic cables. Survivability as a discipline has been well studied [12] and modeled [13], with strategies generally broken into two categories. Restoration involves computing and setting up an alternate path should a network failure occur, at the cost of greater downtime, while protection requires provisioning paths in advance and quickly switching traffic should something fail, with the cost of having to set aside resources which other active demands will be unable to use. Recent research in restoration includes work on improving efficiency through approaches that take advantage of the flexibility inherent in elastic optical networks [14], where the optical spectrum can be divided into slots that can be combined to meet demands. Protection can be further divided into two classes: dedicated, in which each demand has its own distinct backup path to switch to in the event of a failure, or shared, in which costs can be reduced by sharing a backup path between multiple demands, with only one demand able to switch should a failure occur.

Specialized methods for multicast survivability exist for all aforementioned subdivisions of the field. Preventing a disconnection in the event of a single link failure has been studied extensively [15, 16]. The recent work proposed in [16] provides a tree-segment-based protection solution, lowering the blocking for dynamic traffic, where demands are satisfied as they arrive in the system and compete for resources, with a marginal increase in cost over earlier backup light-tree solutions [15]. An overlay dual-homing approach is used in [17] to protect client nodes against the failure of any one optical link or access network link, while others add additional paths onto a multicast tree to make established demands survivable against the failure of intermediate nodes between the source and the resources [18]. The problem of protecting multicast sessions across multi-domain optical networks is tackled in [19], where the authors propose two cost-efficient heuristics for building a survivable inter-domain multicast tree.

Other methods include the use of predesigned cycles (p-Cycles), which have been proposed by multiple authors as a cost-efficient method for protecting against a wide variety of failures in optical networks [20]. Multicast survivability problems have been solved through overlapping p-Cycles to protect against intermediate node failure [21], or constructing trees for recovering from both node and link failure [22]. The authors of [23] take a different approach, focusing instead on a subset of node failure and proposing a solution for resource node failure protection in the context of virtual networks, provisioning an additional physical backup node for each virtual resource. Work in a similar vein is done in [24], where the authors provide a mixed ILP and heuristic algorithms.

With the currently explored approaches in mind, we aim to examine the static problem of protecting multicast requests in WDM networks against the failure of a single node out of the group of multicast resource nodes through a logical overlay. We have previously found solutions to this problem through both heuristics [25] and ILP [26] in the context of traditional optical WDM networks and have found that traditional SMT solutions for multicast often do not survive the removal of just one resource node from a tree. The removal of a branching, or Steiner, point can disconnect a branch from a multicast tree, rendering any resources on that branch unreachable through the established lightpaths. Resource nodes can, depending on the topology, often be used as Steiner points. Given the important role resource nodes serve in multicast communications, the loss of a resource node can result in both the disconnection of entire light-tree branches and the loss of any critical data at that node. This can make resource nodes a tempting target for anyone looking to cause maximum harm to an established multicast connection.

In order to mitigate this harm, we present protection methods that enable a connection between a source node and the resource nodes to remain intact should any one resource node fail. These strategies are not guaranteed to protect against the failure of any intermediate node, which can be costly [18], but rather act as a less-expensive compromise that will protect against targeted removal of these high-priority nodes. Aiming to protect only against resource failure allows algorithms to be designed with a focus on a clearly defined set of nodes, rather than attempting to tackle the problem of dealing with any node or Steiner point failing, which might be determined dynamically depending on how the light-tree is constructed. In this paper, we present an extended and improved ILP formulation for solving the static version of the resource-failure protection problem and compare the performance to heuristics.

A major motivation behind our approach is informed by the use case of large-scale science facilities, such as the previously mentioned LHC. These facilities may stream tremendous quantities of data at one time to multiple remote sites and cannot afford to lose both the data at a failed node and data en route to the other sites at the time of failure. Node in this case would refer to both a remote site and the closest major switch responsible for forwarding data to not only that site, but likely other repositories or laboratories as well. The demands in a scientific network can be large in terms of size and may be long-lasting, so the static approach of determining how efficiently a group of demands can be provisioned is appropriate. Each request receives dedicated backup paths, to ensure that every request is left with an option to switch to should a failure occur, which is a possibility when shared backup paths are utilized. We assume that the resource nodes are geographically distributed in such a way that a natural disaster, which would disable a swath of networking equipment simultaneously, would not affect multiple resources at a time. Finally, our protected solutions are constructed through the use of a logical overlay, as multicast-capable switches may often not be available.

The paper is structured as follows. The formal problem definition is given in Sect. 2, and the ILP solution follows in Sect. 3. Section 4 describes our proposed resource-failure protection heuristics, the performances of which will be quantitatively compared with the ILP in Sect. 5. Section 6 concludes the paper.

2 Problem definition

We are given the following inputs to the problem.

  • A topology \(G=(V,E)\), where V is a set of network nodes, and E is a set of unweighted, directed edges. The wavelength-continuity constraint is enforced.

  • A set of wavelengths W, where |W| is the number of wavelength channels supported by each fiber.

  • A static set of immediate reservation multicast connection requests R, with \(r \in R\). Each request \(r = (s_r, D_r)\), where source \(s_r \in V\) and the set of resources \(D_r \subseteq V - {s_r}\), must be established while protecting it against the failure of any single node in \(D_r\). Immediate reservation requests must be provisioned at the time of arrival, and the entire set arrives at once. The bandwidth granularity of each request is assumed to be equivalent to the capacity of a single wavelength, and no grooming is performed.

Defining the problem formally, the goal is to establish a protected solution \(G'\) for each multicast request r on topology G such that the removal of any one resource does not disconnect the remaining multicast members of that request, and the total number of wavelengths required to satisfy all requests is minimized. The protection requirement can be formulated as a bound of \(|M|\ge 2\) for any minimal vertex cut M, where \(M \subseteq D_r\) for solution \(G'\). A cut is a set of vertices from V, such that their removal causes the remaining graph G to become disconnected. Cuts may be of various sizes, but the minimal cut is that which contains the smallest set of nodes from V. Such a protected solution \(G'\) can be described as biconnected, meaning that it takes the removal of two elements (in this case, only resource vertices) to disconnect the solution. An example multicast solution is shown in Fig. 1a which, while using a minimum number of links, is unprotected should resource 2 be removed. A protected solution for the same multicast request is shown in Fig. 1b.

3 Integer linear programming solution

The Multicast Destination Failure Protection ILP formulated below is based in part on the Drop At Any Node (DAAN) multicast overlay ILP, presented in [27]. The DAAN approach to establishing multicast circuits in optical networks efficiently establishes a logical overlay over the underlying physical network. In these solutions, we provision requests by creating a set of lightpath routes in the overlay layer from the source node of a request to each resource member. Each lightpath route can terminate, or “drop,” at any node to the electronic layer and can then return to the optical layer to forward the traffic toward another node. In this manner, a light-tree can be constructed without splitting hardware, at some cost to efficiency if a purely optical-level solution were possible. We build on this formulation to create resource-failure survivable overlays, providing protection for multicast sessions in any biconnected network. Our survivable solution combines multiple lightpaths to form a primary end-to-end “connection” from the source for each resource. Individual lightpaths can be shared between different connections. If there are any intermediate resource nodes present in a resource’s primary connection, we provide a backup connection which does not share any intermediate resource nodes with the primary connection.

3.1 Minimum wavelengths required ILP formulation (ILP-MinWR)

3.1.1 Given

V :

is the set of nodes in the network.

\(A_{ij}\) :

is 1 if a physical link exists between \(i,j \in {{V}}\).

R :

is the set of multicast requests, which are numbered 1 through R. For a given multicast request r, we denote the source node of the request as \(s_r\) and the set of resource member nodes is represented as \(D_r\). The set of non-resource members is denoted as \(X_r = {V} - D_r \cup {s_r}\).

W :

is the set of wavelengths available on each link.

H :

is the indexing set for variable \(P^{r,d,h}_{u,v}\), where \( H = \{1,2\}\). This is used to indicate whether either one or two end-to-end connections are required between the source and a particular resource to provide resource-failure protection, with \(h = 1\) indicating the primary path, and \(h = 2\) the secondary.

Z :

is a very large number, used as an upper bound for inequalities.

3.1.2 Variables

The ILP will solve for the following variables:

\(L^{r,w}_{u,v}\) :

is a binary variable, with a value of 1 if a lightpath is established for request r from node u to node v on wavelength w. It is 0 otherwise.

\(F^{r,w}_{u,v,i,j}\) :

is a binary variable, with a value of 1 if there is a flow on the physical link from node i to node j on wavelength w, for a lightpath from node u to node v, for request r. It is 0 otherwise.

\(C^r_w\) :

is a binary variable, with a value of 1 if wavelength w is used to service multicast request r. It is 0 otherwise.

\(P^{r,d,h}_{u,v}\) :

is binary, equal to 1 if there is an end-to-end connection (i.e., a series of lightpaths) from the source node \(s_r\) to resource \(d \in D_r\) for request r, using lightpath (uv) as a virtual link. These connections are indexed by \(h \in H\). The value is 0 otherwise.

\(LP^{r,h}_{u,v}\) :

is binary, equal to 1 if the lightpath (uv) is a virtual link in a connection P from the source node \(s_r\) to any resource node. The value is 0 otherwise. A lightpath can act as a virtual link for several end-to-end connections between the source \(s_r\) and the resource nodes in R.

\(I^{r}_{n,u,v}\) :

is binary, equal to 1 if node \(n \in V\) is present in lightpath (uv). The value is 0 otherwise.

\(G^{r,d,h}_{n,u,v}\) :

is binary, equal to 1 if node \(n \in V\) is present in lightpath (uv), where (uv) is a virtual link in request r’s connection h to resource d. The value is 0 otherwise.

\(N^{r,d,h}\) :

is an integer counter variable, equal to the number of resource nodes present in end-to-end connection P from \(s_r\) to d.

\(B^{r,d}\) :

is binary, equal to 1 if any connection \(P^{r,d,h}_{u,v}\) contains at least one intermediate resource node \(\in D_r\), indicating that the connection from \(s_r\) to resource node d would become disconnected should another resource node fail. This variable determines whether more than one connection P is required to provide single resource node failure protection for node d. The value is 0 otherwise.

MaxIndex :

is an integer variable, representing the largest wavelength index used on any link network-wide. Minimizing this value is the objective.

3.1.3 Constraints

Objective function:

                              minimize: MaxIndex

Subject to:

$$\begin{aligned}&MaxIndex \ge C^r_w \times w; \qquad \forall \;r \in {R},\ \;w \in {W}. \end{aligned}$$
(1)
$$\begin{aligned}&\quad \sum \limits _{w}^{{W}} C^r_w \ge 1; \qquad \forall \;r \in {R}. \end{aligned}$$
(2)
$$\begin{aligned}&\quad L^{r,w}_{u,v} \;\le \; C^r_w; \quad \forall \;r \in {R},\ \;w \in {W},\quad u,v\in {V}. \end{aligned}$$
(3)

A lower bound for the maximum wavelength index used is provided in Constraint (1). Constraint (2) ensures that at least one wavelength is used to satisfy each request and Constraint (3) that the set of established lightpath routes are bound by the number of wavelengths used.

$$\begin{aligned}&\sum \limits _{r}^{{R}} \sum \limits _{u}^{{V}} \sum \limits _{v}^{{V}} F^{r,w}_{u,v,i,j} \le 1; \ \ \forall \;i,j \in {V}, \qquad \;w \in {W}. \end{aligned}$$
(4)
$$\begin{aligned}&F^{r,w}_{u,v,i,j} \le A_{ij} \times L^{r,w}_{u,v} \ \ \forall \;r \in {R}, \ \;u,v,i,j \in {V}, \nonumber \\&\quad u\ne v, i \ne j,{\quad } w \in {W}. \end{aligned}$$
(5)
$$\begin{aligned}&\sum \limits _{i}^{{V}} F^{r,w}_{u,v,i,j} - \sum \limits _{k}^{{V}} F^{r,w}_{u,v,j,k} = \left\{ \begin{array}{ll} 0 &{}\quad \text{ if } j\, \ne \, u,v \\ L^{r,w}_{u,v} &{}\quad \text{ if } j = v \\ -L^{r,w}_{u,v} &{}\quad \text{ if } j = u \\ \end{array} \right. \nonumber \\&\quad \forall \;u,v,j \in {V},\quad w \in {W}, \quad r \in {R}. \end{aligned}$$
(6)

Constraints (4) through (6) are the physical-layer constraints. (4) prevents any wavelength being used by more than one request on any particular link, while Constraint (5) allows lightpaths to be established only between nodes connected by a physical link in the topology. Constraint (6) is a flow conservation constraint, requiring the in-flow to equal the out-flow of any bypass, or non-endpoint, node. The lightpath sources or resources have either negative or positive flow, respectively.

$$\begin{aligned}&\sum \limits _{u}^{V} \sum \limits _{w}^{{W}} L^{r,w}_{u,v} \ge 1; \ \ \ \forall \;r \in R, \ v \in D_r. \end{aligned}$$
(7)
$$\begin{aligned}&\quad \sum \limits _{v}^{V}\sum \limits _{w}^{{W}} L^{r,w}_{s_r,v} \ge 1; \ \ \ \forall \;r \in R. \end{aligned}$$
(8)
$$\begin{aligned}&\quad \sum \limits _{u}^{V}\sum \limits _{w}^{{W}} L^{r,w}_{u,s_r} = 0; \quad \forall \;r \in R. \end{aligned}$$
(9)
$$\begin{aligned}&\quad \sum \limits _{v}^{V} \sum \limits _{w}^{{W}} L^{r,w}_{u,v} - Z \times \sum \limits _{v}^{V} \sum \limits _{w}^{{W}} L^{r,w}_{v,u}\le 0; \quad \forall \;r \in R,\ u \in V, \nonumber \\&{\quad } u \ne s_{r}. \end{aligned}$$
(10)
$$\begin{aligned}&\quad \sum \limits _{u}^{V} \sum \limits _{w}^{{W}} L^{r,w}_{u,v} - \sum \limits _{u}^{V} L^{r,w}_{v,u} \le 0; \quad \forall \;r \in R,\ v \in X_r. \end{aligned}$$
(11)

Lightpath establishment is covered through constraints (7) through (11). At least one lightpath must terminate at each resource node so the data can be received (7) and at least one lightpath must originate from the source node to carry the data (8). No lightpaths need to terminate at the source node (9), but lightpaths are allowed to terminate at any other node. Lightpaths can only originate at a non-source node if there is at least one terminating lightpath at the node (10). This is accomplished through summing up the number of terminating lightpaths at a node and subtracting the product of the number of lightpaths originating at the node and a large number. This ensures that the constraint can hold when there are a greater number of lightpaths originating from the node than terminating at it. There must be at least one lightpath originating from a non-resource node if a lightpath terminates there, so the data can be forwarded to resources (11).

$$\begin{aligned}&I^{r,n}_{u,v} \times Z \ge \sum \limits _{w}^{{W}} \sum \limits _{i}^{V} F^{r,w}_{u,v,i,n} + \sum \limits _{w}^{{W}} \sum \limits _{k}^{V} F^{r,w}_{u,v,n,k}; \nonumber \\&\quad \forall \;r \in R, \ u, v, n \in V. \end{aligned}$$
(12)
$$\begin{aligned}&\quad I^{r,n}_{u,v} \le \sum \limits _{w}^{{W}} \sum \limits _{i}^{V} F^{r,w}_{u,v,i,n} + \sum \limits _{w}^{{W}} \sum \limits _{k}^{V} F^{r,w}_{u,v,n,k}; \nonumber \\&\quad \forall \;r \in R, \ u, v, n \in V. \end{aligned}$$
(13)

A critical component of protecting multicast requests against the failure of a resource node is determining which nodes are physically present within a lightpath. The binary variable \(I^{r,n}_{u,v}\) is set to 1 if there is at least one flow into or out of node n, indicating that a lightpath (uv) either originates, terminates, or passes through n (12) and (13).

$$\begin{aligned}&\sum \limits _{h}^{H} \sum \limits _{u}^{V \setminus \{s_r\}} P^{r,d,h}_{u,s_r} = 0; \ \ \forall r \in R, \ d \in D_r. \end{aligned}$$
(14)
$$\begin{aligned}&\sum \limits _{h}^{H} \sum \limits _{v}^{V \setminus \{d\}} P^{r,d,h}_{d,v} = 0; \ \ \forall r \in R, \ d \in D_r. \end{aligned}$$
(15)
$$\begin{aligned}&\sum \limits _{v}^{V \setminus \{s_r\}} P^{r,d,1}_{s_r,v} = 1; \ \ \forall r \in R, \ d \in D_r. \end{aligned}$$
(16)
$$\begin{aligned}&\sum \limits _{v}^{V \setminus \{s_r\}} P^{r,d,2}_{s_r,v} = B^{r,d}; \ \ \forall r \in R, \ d \in D_r. \end{aligned}$$
(17)
$$\begin{aligned}&\sum \limits _{u}^{V \setminus \{d\}} P^{r,d,1}_{u,d} = 1; \ \ \forall r \in R, \ d \in D_r. \end{aligned}$$
(18)
$$\begin{aligned}&\sum \limits _{u}^{V \setminus \{d\}} P^{r,d,2}_{u,d} = B^{r,d}; \ \ \forall r \in R, \ d \in D_r. \end{aligned}$$
(19)
$$\begin{aligned}&\sum \limits _{u}^{V \setminus \{v\}} P^{r,d,h}_{u,v} = \sum \limits _{a}^{V \setminus \{v\}} P^{r,d,h}_{v,a}; \ \ \forall r \in R, \ v \in V \setminus \{s_r, d\}, \nonumber \\&d \in D_r, \ h \in H. \end{aligned}$$
(20)

\(P^{r,d,h}_{u,v}\) is used to keep track of which (u,v) lightpaths are used for either the primary (h = 1) or backup (h = 2) connections from \(s_r\) to each \(d \in {D_r}\). A connection from \(s_r\) to a d does not need a lightpath terminating at \(s_r\) (14). Constraint (15) similarly prevents lightpaths originating at node d from being used in connection from \(s_r\) to d. One lightpath originating from the source must be a part of the primary end-to-end connection to each d (16), and if the binary variable \(B^{r,d}\) is equal to 1, there must also be a lightpath originating at the source for the backup connection as well (17). A similar set of constraints (18) and (19) is established for lightpaths terminating at resource nodes. Finally, the number of lightpaths in a connection from \(s_r\) to d terminating at a node v must equal the number of lightpaths originating at v, enforcing the continuity of connection traffic (20).

$$\begin{aligned}&B^{r,d} \times Z \ge N^{r,d,1} - 2; \ \ \forall r \in R, \ \ d \in D_r. \end{aligned}$$
(21)
$$\begin{aligned}&\quad B^{r,d} \le N^{r,d,1} - 2; \ \ \forall r \in R, \ \ d \in D_r. \end{aligned}$$
(22)

\(B^{r,d}\) is a binary variable for determining when a backup connection is necessary for a particular resource d. Constraints (21) and (22), when combined, force B to equal 1 when there are at least two resource nodes in the primary connection, indicating that there is at least one intermediate resource node. Otherwise, \(B = 0\).

$$\begin{aligned}&LP^{r,h}_{u,v} * \left| {D_r}\right| \ge \sum \limits _{d}^{D_r} P^{r,d,h}_{uv}; \ \ \forall \ r \in R, \ \ u,v \in V, \nonumber \\&\quad h \in H. \end{aligned}$$
(23)
$$\begin{aligned}&LP^{r,h}_{u,v} \le \sum \limits _{d}^{D_r} P^{r,d,h}_{u,v}; \ \ \forall \ r \in R, \ \ u,v \in V, \ \ h \in H. \end{aligned}$$
(24)
$$\begin{aligned}&LP^{r,h}_{u,v} \le \sum \limits _{w}^{{W}} L^{r,w}_{u,v}; \ \ \forall \ r \in R, \ \ u,v \in V, \ \ h \in H. \end{aligned}$$
(25)
$$\begin{aligned}&\sum \limits _{h}^{H} LP^{r,h}_{u,v} \ge \sum \limits _{w}^{{W}} L^{r,w}_{u,v}; \ \ \forall \ r \in R, \ \ u,v \in V. \end{aligned}$$
(26)

Each lightpath is a component of a connection, so the variable \(LP^{r,h}_{u,v}\) is necessary for indicating when a lightpath is used in a connection. It is important to note that a lightpath can be used for multiple connections in a request simultaneously. LP is equal to 1 when it is both used in at least one connection (23) and (24), and the lightpath L is established for the solution (25) and (26).

$$\begin{aligned}&G^{r,d,h}_{n,u,v} \ge P^{r,d,h}_{u,v}+ I^{r,n}_{u,v} -1; \ \ \forall \ r \in R, \nonumber \\&n,u,v \in V, \ d \in D_r, \ \ h \in H. \end{aligned}$$
(27)
$$\begin{aligned}&G^{r,d,h}_{n,u,v} \le P^{r,d,h}_{u,v}; \ \ \forall \ r \in R, \ \ n,u,v \in V, \ d \in D_r, \nonumber \\&h \in H. \end{aligned}$$
(28)
$$\begin{aligned}&G^{r,d,h}_{n,u,v} \le I^{r,n}_{u,v}; \ \ \forall \ r \in R, \nonumber \\&n,u,v \in V, \ d \in D_r, \ \ h \in H. \end{aligned}$$
(29)
$$\begin{aligned}&N^{r,d,h} = \sum \limits _{n}^{M_r} \sum \limits _{u}^{V} \sum \limits _{v}^{V} G^{r,d,h}_{n,u,v}; \ \ \forall \ r \in R, \ d \in D_r, \nonumber \\&h \in H. \end{aligned}$$
(30)
$$\begin{aligned}&\sum \limits _{h}^{H} \sum \limits _{u}^{V} \sum \limits _{v}^{V} G^{r,d,h}_{n,u,v} \le 1; \ \ \forall \ r \in R, \ n \in D_r \setminus \{d\}. \end{aligned}$$
(31)

Variable I keeps track of when a node n is in a lightpath (u,v), and variable P indicates when a lightpath (u,v) is used in a connection, so indicator variable G can be used to show when node n is an intermediate node in an end-to-end connection. The value of G is determined through the equivalent of the logical operation \(I \wedge P\) in Constraints (27), (28), and (29). Variable \(N^{r,d,h}\) is then used to store the number of resource nodes in an \(s_r\) to d connection by summing up the value of \(G^{r,d,h}_{n,u,v}\) across each lightpath and node. The N variable is used for determining when a backup connection is necessary in constraints (21) and (22). Variable G is finally then essential for determining whether a backup connection is survivable; if a resource node \(d' \ne d\) is in both connections to d, then removing it will cause d to become disconnected from the source. This is prevented through constraint (31), which restricts every other resource to appearing at most once across every connection to resource d.

3.2 Minimum wavelength-links ILP formulation (ILP-MinWL)

While the presented ILP does minimize the number of wavelengths required on any one link in the network, it does not necessarily minimize alternative costs. Such a cost could be minimizing the number of wavelengths used across the entire network (i.e., the number of wavelengths used on each link, summed over each link in the network), or the wavelength-links. We present an alternative formulation for the objective of minimizing this cost, while satisfying a static set of multicast requests with logical overlays protected against single resource failure.

3.2.1 Variables

The ILP utilizes Eq. (2)—through (31), and requires an additional variable, replacing maxIndex:

WL :

is an integer variable, representing the number of wavelength-links network-wide. Minimizing this value is the objective.

3.2.2 Constraints

In addition, an alternative objective function is required, along with a new constraint to replace Constraint (1).

Objective function:

                              minimize: WL

Subject to:

$$\begin{aligned} WL = \sum \limits _{r}^{{R}} \sum \limits _{w}^{{W}} \sum \limits _{u}^{{V}} \sum \limits _{v}^{{V}} \sum \limits _{i}^{{V}} \sum \limits _{j}^{{V}} F^{r,w}_{u,v,i,j}. \end{aligned}$$
(32)

Constraint (32) sets the value of WL equal to the number of flows across all links, summing up the value of \(F^{r,w}_{u,v,i,j}\) across all requests, lightpaths, and links. As \(F^{r,w}_{u,v,i,j}\) is a binary variable, with a value of 1 only when wavelength w is used on link (ij) for some lightpath (uv) for a request r, performing this summation is sufficient for determining the number of wavelength-links.

4 Heuristics

While the ILP provides an optimal solution in terms of minimizing required wavelengths, the run-time growth is exponential, making the ILP infeasible to run on large or well-connected topologies. We briefly describe two heuristics proposed in [25], which provide resource-failure protection approaches for each multicast request with a more reasonable time complexity.

4.1 Steiner minimal tree with failure-avoidance backup (Steiner-FAB)

A traditional SMT provides a minimal solution, in terms of hops or links used, for a multicast request, and while NP-Complete, it can be approximated in \(\Theta (|{V}|^3)\) time [28]. While a SMT efficiently connects a source node to all of its resources, the solution found is in no way guaranteed to survive the failure of a resource node. Depending on the paths chosen to connect the member nodes, it is possible that a resource node is chosen as a branching point, disconnecting any nodes along one of the branches should it fail. Due to the minimal nature of a SMT for satisfying multicast requests, the approach is a logical starting point for developing a resource-failure survivable multicast heuristic. Steiner minimal tree with failure-avoidance backup (Steiner-FAB) builds upon a request’s primary SMT, with the addition of backup paths to every vulnerable resource that could become disconnected due to another resource’s failure, and is described by Algorithm 1. The approximation algorithm in [28] is used to build the original SMT in polynomial time.

The algorithm first constructs an empty set of Vulnerable nodes and builds a SMT for the given request r and topology G (Lines 1–2). With the SMT established, Steiner-FAB evaluates the route between the source and each resource (Line 3) \(d_i\) in the tree, finding \(path_i\) to resource \(d_i\) through the route function, a depth-first-search (Line 4). Then, the length of \(path_i\) is stored in the variable len (Line 5). The empty set \(V_{d_i}\) is instantiated, along with a variable \(d_{nearest}\) for storing the nearest resource to resource \(d_i\) (Lines 6–7). Then, each resource \(d_j\) other than \(d_i\) is checked (Line 8), and if \(path_i\) contains \(d_j\) (Line 9), the distance between \(d_i\) and \(d_j\) (Line 10) is stored in set \(V_{d_i}\) with \(d_j\) as tuple \((d_j, len_j)\) (Line 11). With this loop through all other resources \(d_j\), all resource nodes along the path to \(d_i\) are found. Then, as long as at least one intermediate resource node was found (Line 12), the node closest to \(d_i\) is found using the stored distance (Line 13), and a tuple \((d_i, d_{nearest}, len)\) is stored in the set of Vulnerable nodes.

With that process repeated for each resource \(d_i\), the procedure for protecting each vulnerable node is repeated until there are no vulnerable resources remaining (Line 15). Then, using the len value stored in each tuple in Vulnerable, the mostVulnerable resource node tuple is determined (Line 16). Then, a subgraph \(G'\) is constructed, using all nodes in V except for \(d_{nearest}\) stored in the mostVulnerable tuple (Line 17), and the ShortestPath from s to \(d_i\) is found and added to the original SMT \(t_r\) (Line 18). The mostVulnerable tuple is removed from the Vulnerable set (Line 19), and then for each tuple v remaining in the Vulnerable set (Line 20), find all paths \(Paths_i\) within the tree \(t_r\) to the resource node \(d_i\) (Line 21). If there is more than one path to node \(d_i\) (Line 22), then for each \(path_p\) in that set of paths (Line 23), if the path does not contain the previously identified dangerous node \(d_{nearest}\), then the resource \(d_i\) is considered safe and the associated tuple v is removed from the Vulnerable set (Lines 24–27). Finally, now that the Vulnerable set is empty, all resource nodes have been protected through the addition of backup paths, and the survivable modified tree \(t_r\) is returned (Line 28). An illustrative example of how Steiner-FAB would establish such a protected solution on a SMT with one vulnerable member is shown in Fig. 2. The run-time complexity of Algorithm 1 is bounded by the time taken to build the initial SMT, \(O(|V|^3)\), the time taken to identify vulnerable nodes \(O(|V|^3)\), and the time taken to route the set of O(|V|) backup paths O(|V|log|V|). Therefore, the total worst-case run-time complexity is \(O(|V|^3 + |V|^3 + |V|(|V|log|V|)\), or simply \(O(|V|^3)\).

Fig. 2
figure 2

Steiner-FAB algorithm first constructs a SMT (solid red) from source S to resources R1, R2, and R3. The algorithm then checks each resource node for vulnerability and identifies the single vulnerable multicast resource, R3. The backup lightpath (dashed gold) directly from S to node R3 is established to avoid the potential failure node R2 that, when removed, would disconnect S from R3. If either R1 or R2 fails, R3 will remain connected to source S (Color figure online)

figure c

4.2 Critical resource biconnective survivability (CRB)

While Steiner-FAB does not add significant complexity to the SMT approach while providing the desired protection, it is possible to establish a more efficient solution through constructing a series of paths that prevents disconnection in the event of a member failure, without requiring additional backup paths after construction. Our Critical Resource Biconnective Survivability (CRB) solution, proposed in [25], aims to establish a subgraph \(G'\) from physical topology G for each request r in such a way that there is an alternate path to every \(d \in {D_r}\), the set of resources, from \(s_r\), the source, should it be impossible to establish a direct path from \(s_r\) to d without traversing another member of \(D_r\). With these alternate paths, should any one resource fail, there will always be a path available to every remaining d from \(s_r\). It is important to note that there is no guarantee that a protected solution \(G'\) can be found if the underlying physical topology G is itself not biconnected.

figure d

The CRB algorithm begins by creating a set of ShortestPaths (Line 1). Then, for each possible member node pair, a subgraph (Line 3) is found that does not contain the set of \(NonPairMembers = D_r \cup \{s_r\} \setminus \{m_i, m_j\}\) (Line 2). With this subgraph, the ShortestPath between the node pair \((m_i, m_j)\) can be found, if one exists, providing the shortest path between those two nodes that does not contain any other member node (Line 4). With this set of ShortestPaths, a new logical topology H is constructed, where the new set of nodes \(V'\) is made up of all multicast members for the request r, and there is a link \((i, j) \in L\) between each member that has a corresponding shortest path SP in ShortestPaths (Line 5). A weight is assigned to each link in L equal to the length of the corresponding shortest path in ShortestPaths (Lines 6–7). Following that, each logical edge adjacent to source \(s_r\) is added to the logical solution set, LogicalSolutionEdges (Lines 8–10).

Adding those logical edges ensures that there is an uninterrupted physical path between source \(s_r\) and those logically adjacent resource nodes, but there may be resources that cannot be reached without traversing another resource node. Those nodes are considered Vulnerable (Line 11), and the resource nodes which could cause a disconnection will be put into the set Failure (Line 12). For every resource node d (Line 13), if there is no logical edge directly connecting the source to that resource node (Line 14), then that resource node is marked as Vulnerable (Line 15), a logical path is found to that node d through depth-first search (Line 16), and each resource node in that path is added to the set of Failure nodes (Line 17–18). A BiconnectedSolution is then found through running the Minimum-Cost 2-Vertex Connected (MC2VC) approximation algorithm on a subgraph where the nodes are \(\{s_r\} \cup Vulnerable \cup Failure\) (Line 19). This gives you a minimum-cost biconnected subgraph, constructed so that it can survive the removal of any one node. The edges in that BiconnectedSolution are added to the set of LogicalSolutionEdges if they are not already included in the set (Lines 20–22). A subgraph of logical graph H is then constructed, consisting only of member nodes \(V'\) and the LogicalSolutionEdges (Line 23). The corresponding PhysicalEdges and PhysicalNodes (Lines 24–25) are then found through mapping the LogicalSolutionEdges back to the physical topology (Lines 26–28). The physical subgraph \(G' = (PhysicalNodes, PhysicalEdges)\) is then returned as a solution protected against the failure of any multicast resource node.

Fig. 3
figure 3

Critical resource biconnective (CRB) survivability. a Physical topology G. b Logical topology H. c Pruned logical topology H’. d Pruned physical topology G’

An example conversion from physical topology G, to logical H, is shown in Fig. 3a, b. H is pruned in Fig. 3c and then mapped back to the physical topology as an established circuit in Fig. 3d. The time complexity of CRB has a lower bound of the optimized \(O(|V|^3)\) complexity of the minimum-cost 2-vertex connectivity problem. Including the O(|E|) complexity from each of the conversions from G to H and from \(H'\) to \(G'\), and the \(O(|V|^2)\) time to identify vulnerable and potential failure nodes, the total complexity of this approach is \(O(|V|^3 + |V|^2 + |E|)\), which can be reduced to \(O(|V|^3 + |E|)\) [25].

5 Results and analysis

In this section, we quantitatively examine and compare the performance of the two presented resource-failure protection multicast ILPs (henceforth referred to as ILP-MinWR, for minimizing wavelengths required on any link network-wide, and ILP-MinWL, for minimizing the number of wavelength-links in the network) and both of the heuristics. In addition, the SMT approximation presented in [28] is considered alongside the heuristics and ILP. Even though the SMT solves only the multicast problem, not the survivable version, it is useful to compare this minimal multicast solution to the survivable solutions to give an approximate lower bound for wavelength consumption and other metrics.

Fig. 4
figure 4

Fourteen-node NSFNet topology. The topology has 21 edges, an average path length of 2.14 hops, a maximum nodal degree of 4, a minimum nodal degree of 2, an average nodal degree of 3, and a path diameter of 3 hops

Fig. 5
figure 5

Twenty-five-node Manhattan topology. The topology has 40 edges, an average path length of 3.33 hops, a maximum nodal degree of 4, a minimum nodal degree of 2, an average nodal degree of 3.2, and a path diameter of 8 hops

The heuristic simulations, implemented in Python, and the ILPs, which were implemented with AMPL and solved using the Gurobi version 5.6.3 optimization software package, are run on both the 14-node National Science Foundation (NSFNet) topology shown in Fig. 4 and a symmetrical 25-node Manhattan topology depicted in Fig. 5. It is assumed that for each link (ij) in the topology, there is a fiber available in both directions, each possessing its own set of available wavelengths. When comparing the ILP solutions and the heuristics, we generated 30 request sets, using 30 different seed values, all comprised of 5 requests, each with randomly selected source and two resource nodes following a uniform random distribution. The same was done to generate sets which required three, four, and five uniformly randomly selected resources. The ILPs and heuristics were evaluated only for sets of 5 requests, due to the high complexity of integer linear programming, with the values presented here averaged across the 30 seeds. The CRB, Steiner-FAB, and SMT algorithms were run on the same topology for the same sets of requests, the results of which are shown in Table  1 for the 14-node NSFNet and Table  2 for the 25-node Manhattan network. We additionally compared the performance between just the heuristics, with a much higher request set size of 1000 requests and with resource requirements of up to seven resources, on the same topologies. The results for NSFNet are shown in Table  3 and in Table  4 for the larger Manhattan network.

Table 1 Comparison between the ILP and the CRB, Steiner-FAB, and SMT heuristics on the NSFNet 14-node topology
Table 2 Comparison between the ILP and the CRB, Steiner-FAB, and SMT heuristics on the Manhattan 25-node topology
Table 3 Comparison between the CRB, Steiner-FAB, and SMT algorithms on the NSFNet 14-node topology
Table 4 Comparison between the CRB, Steiner-FAB, and SMT algorithms on the Manhattan 25-node topology
Table 5 General ranking in descending order of each approach for each metric considered

Eight metrics are compared: (1) the minimum number of wavelengths required on each link to provision all requests in a set; (2) the number of wavelength-links (the summation across all links, of the number of wavelengths provisioned per link) required network-wide to provision all requests in a set; (3) the average number of links utilized (had at least one wavelength allocated) per request; (4) the average diameter, or the minimum number of hops to reach the furthest resource from the source, per request; (5) the average path length difference between the source and the closest/furthest resource per request, also known as jitter, which can be important when considering delay for data arrival between all resource nodes; (6) the average number of resource nodes that must be removed from a provisioned request to disconnect the source from the remaining resource(s). If this average value is lower than 2, that indicates that some requests provisioned with that method can be disconnected with only 1 resource removed; (7) the number of requests in a set that are protected against single resource failure; (8) the running time, in seconds, for completing an entire request set given a certain algorithm.

The general ranking of the approaches is presented in Table  5. The ranking may change based upon the topology or the number of requests, with the SMT and Steiner-FAB solutions often swapping position, but the presented ordering generally holds. ILP-MinWR and ILP-MinWL score the best among the survivable approaches when it comes to the wavelengths required to provision all requests, and the number of wavelength-links consumed to provision a request set, respectively. The wavelengths required, wavelength-links, average links utilized, diameter, and jitter all increase as a greater number of resources must be reached per request. Each request must, on average, be provisioned a larger proportion of the network as the resource set size increases, so all approaches appropriately perform more poorly. Among the survivable heuristics, CRB outperforms Steiner-FAB in terms of wavelength-links and the average number of links utilized per request, while Steiner-FAB performs better when considering the number of hops between the source and the resources. CRB is less costly in terms of the number of wavelengths/links consumed, but the average time to transfer data to the furthest resources from the source will likely be lower with Steiner-FAB. CRB, as it utilizes the Minimum-Cost-K-Vertex-Connected-Subgraph algorithm as a component, can survive a greater number of resource node failures on average than any other approach, as CRB paths are often established to ensure that resources can connect not only to the source, but to each other. All survivable approaches (ILP-MinWR, ILP-MinWL, CRB, and Steiner-FAB) provide only protected solutions, while SMT is in no way guaranteed to provide survivable trees. SMT, given its minimal nature, solves the multicast problem in the shortest time, by far, and is followed by Steiner-FAB and CRB, which both have additional requirements beyond connecting the source to its resources. The ILP-MinWL and ILP-MinWR approaches both require significantly more time regardless of the topology compared to the heuristics, even when the heuristics have to handle a greatly increased number of requests in Tables 3 and 4.

Digging into the differences between NSFNet and the larger Manhattan network, the number of wavelengths required per link network-wide is slightly increased, but the number of wavelength-links consumed per request set is, in the worst-case with ILP-MinWR, almost doubled. The average number of links utilized per request experiences a slightly smaller growth, while the average request diameter and jitter scale with the increased average path length in the Manhattan network compared to the NSFNet topology. The number of resource failures required to disconnect an established session does experience a slight increase as the network grows larger, which can be tied to the greater path length in the network. Resource nodes, which are chosen uniformly, may end up more “spread out” in a larger network, increasing the number of failures required to disconnect a solution completely. For the SMT approach, which is the only non-survivable algorithm, the average number of requests in a request set which are protected against the failure of any single resource node increases slightly as the size of the network increases. The running times for ILP-MinWR, by far the worst due to the difficulty of minimizing the maximum number of wavelengths consumed on any link in the network, and ILP-MinWL, both increase as the topology size increases, and almost converge as the number of resources in a set increases. SMT, meanwhile, always completes in the shortest amount of time, taking slightly longer in a larger topology. Steiner-FAB follows a similar pattern of growth, but CRB scales at a much higher rate in terms of running time compared to other heuristics. This is related to the greater computational complexity in comparison with SMT and Steiner-FAB, and CRB appropriately greatly increases in running time as a larger number of resource nodes are required per request.

While SMT consumes fewer wavelength-links on average, it is important to keep in mind that the SMT solutions are often not protected against the failure of a single resource node. The ILPs and heuristics, on the other hand, always require at least 2 resources to fail before the established request is considered disconnected. In addition, Steiner-FAB, which builds backup paths alongside a SMT to provide survivability, provisions requests in such a way as to provide the average lowest diameter and jitter among the examined approaches. This is a by-product of the backup paths: resources further from the source in a SMT are more likely to have another resource present as an intermediate node in the SMT, so they often require a backup path directly from the source. The SMT, while reducing the number of hops to connect the source to all resources with one tree, does not necessarily use the shortest path from the source to any particular resource. The additional backup paths, while not necessarily minimal, can be shorter than the established path in the SMT to the furthest resources. In several ways, the heuristics appear to be satisfactory substitutes for the ILP, as they outperform the ILP in several metrics, and can run much more efficiently at scale.

Overall, it can be seen that CRB allows wavelengths to be used more efficiently than Steiner-FAB for the same level of protection, but the greater end-to-end delay could be a detrimental factor for time-sensitive multicast scenarios. SMT outperforms both survivable heuristics on nearly all fronts, being beaten only by Steiner-FAB in terms of diameter and jitter, as previously mentioned, and by both CRB and Steiner-FAB in terms of the number of resource failures the provisioned requests can withstand, on average. The relationship between SMT and the two survivable heuristics perfectly demonstrates the trade-off a user can expect when survivability is a requirement: the survivable methods are likely to be far more inefficient in terms of cost. This trade-off between survivability and cost must be weighed when deciding which methods to use for provisioning network requests. When comparing survivable heuristics, CRB tends to be more cost-effective in terms of wavelengths, but is more costly in terms of delay and running time. If a large number of requests, or a greater number of resources, must be protected, Steiner-FAB may be chosen over CRB if time is a concern. On the other hand, if protection against multiple resource node failure is a priority, CRB, on average, is more resilient against multiple failures than every other approach.

6 Conclusion

The point-to-multipoint nature of the multicast communication paradigm plays an important role in supporting a wide variety of networking applications, including cloud-based services, streaming media, and distributed storage or retrieval. The high-bandwidth available in optical WDM networks makes them an excellent candidate for supporting the paradigm. However, the most efficient solution for provisioning multicast requests may create points of vulnerability that can lead to loss of data or service. We have proposed two optimal solutions through ILP to solve this issue in the static case of provisioning an entire set of multicast requests in networks that do not have optical-level multicast splitters available, and we compared their performance to two survivable heuristics, finding that they solve the same problem with a slightly higher wavelength consumption, but in much faster time. The built-in redundancy provided by these solutions is guaranteed to protect any single request against the failure of one of its resources in well-connected networks. When a demand requires guaranteed survivability, due to either its importance, its size, or both, these methods can secure network transmissions against a potentially devastating type of failure, should the trade-off in terms of cost be acceptable.

Future areas of work include further simulation and evaluation of performance for these approaches on larger topologies and for more sizable request sets. It is possible that the relationship between the heuristics in terms of resource consumption could vary based on topology, and the optimal solution may outperform the other approaches to an even greater degree, although the running time is likely to increase significantly on more complex topologies. Additional objectives, such as minimizing the diameter of solutions, can also be considered and formulated. How the proposed methods perform in other types of networks, such as the increasingly researched Elastic Optical Networks, where the spectrum allocated per request can be flexibly tailored to meet a demand, will be examined. The approaches presented in this paper did not consider blocking, so future work can include updated versions of the heuristics which are able to prioritize blocking reduction. Going beyond just the static problem of provisioning a known set of requests, the dynamic problem can be considered, where requests are satisfied as they become known, and failure events and recovery times can be simulated following a probabilistic model.