1 Introduction

The introduction of new applications has been pushing the boundaries of the throughput and latency that current networks can provide. It is expected that by 2022, 82% of all Internet traffic will be video, and more than half of these videos will be watched in Ultra High Definition (UHD) [5], greatly increasing the network traffic. Besides, latency-critical applications, such as gaming, are expected to grow at 55% per year [5], requiring application servers to be ever-closer to users. Other applications such as those supporting Machine-to-Machine (M2M) communication and Internet of Things (IoT) may require a mix of latency-critical or Ultra Low Latency Reliable Communication (ULLRC) [30] and/or high bit rate. These new applications are sensitive to network performance degradation, such as an increase in latency and/or a decrease in throughput. For instance, while traditional services (e.g. e-mail and Web pages) can remain functional under degradation of several tens or hundreds of milliseconds in latency, new applications (e.g. self-driving cars, interactive video streaming) cannot afford a latency degradation in the order of few milliseconds.

The 5th generation of networks (5G) is expected to provide specialized slices with appropriate QoS to support the execution of these bit-rate-hungry latency-constrained applications [27]. Driven by the proliferating data-center-based communication paradigms, network traffic profiles have been changing from unicast to anycast or even multicast, and metrics tailored for unicast traffic may not be able to effectively measure the ability of a network to support the execution of these new traffic patterns [5, 22]. Moreover, optical networks, which is the standard choice technology for high-capacity networks, have been adopting elastic bandwidth allocation technologies, i.e. Elastic Optical Networks (EONs), which adapt the modulation format to the particularities of the channel (e.g. path length and physical-layer impairments), enabling shorter paths to adopt more efficient modulation formats [3]. However, small variations in the length of the path may disrupt the communication, forcing the use of a less efficient modulation format [2]. Finally, some applications may be provisioned with a very tight latency budget, being very sensitive to any increase in path length and consequent propagation latency increase.

Many studies in the literature have analyzed the reliability of the network infrastructure by using structural measures. These measures are related to the network topology, usually relying on graph theory, such as Average Two-Terminal Reliability (ATTR) and the size of the giant component [7, 20, 26, 28]. However, metrics based solely on the topology only assess whether nodes can communicate or not. By focusing only on the network topology structure, these measures may not fully capture the disruptions caused by disasters to the applications running over the network, i.e. QoS. Moreover, even if the application remains working after a disaster, the user experience may be affected, and these aspects are also not captured by the structural measures.

This chapter presents an overview of the functional metrics used to evaluate the network QoS and QoE. A case study introduces an approach that assesses the impact of a disaster on application functionality and on perceived user experience. The assessment demonstrates that new applications (such as video in the specific case study) can be severely degraded by disasters even when their impact on the network structure is minor.

2 Metrics for Functional Evaluations

In computer networks, functional metrics are used to measure the performance of the network infrastructure perceived by its users (e.g. overlay networks, network applications, and users). For the functional metrics, it is not enough that nodes are connected, but the connections should correctly support the functioning of the application. The functional metrics can be classified into objective and subjective. In this section, we introduce the classification of functional metrics.

2.1 Objective Metrics

Objective functional metrics measure the network performance in objective terms, which can be usually obtained by networking devices and/or specific monitoring equipment. These metrics are closely related to QoS metrics and are usually collected at the transport layer (Layer 4) of the network model. Since they are easy to collect, objective functional metrics are often used to determine terms of Service Level Agreements (SLAs).

2.1.1 Packet-Level Metrics

Several applications may suffer degradation when the transmitted packets undergo degradations. These degradations may be caused by a disaster that disrupts part of the network infrastructure. For instance, applications such as Voice over IP (VoIP) using User Datagram Protocol (UDP) may lose part of the audio if packets are lost. Other applications may experience severe degradation if the latency increases beyond certain limits. The main objective metrics at packet level are described as follows.

  • Transmission time over a link [16] represents the amount of time necessary to push a message (all bits of the message) into a link (channel).

  • Propagation time over a link [16] is a metric that represents the amount of time a message must travel through the link. We assume that the message will be transmitted through all links of the path and reach its destination; otherwise, it will be retransmitted.

  • Latency/delay [4] of a message represents the sum of transmission and propagation times over all links from the sender to the receiver. We assume that the message will be transmitted through all links of the path and reach its destination; otherwise, it will be retransmitted. We have to note that the delay can be considered in two different manners: at network layer and application layer (especially in overlay networks) from the sender to the receiver.

  • Jitter [9] is the variation in network delay caused by factors such as fluctuations in queuing and scheduling delays at intermediate network elements. In general, jitter is the variation of the one-way delay for two consecutive packets. In static audio/video applications (e.g. listening to music or watching to video), de-jitter buffers are used to remove the delay variation caused by the network. However, many applications, such as video-conferencing and gaming, require very low latency and cannot use buffers to alleviate the effects of jitter in the application.

  • Packet loss ratio [18] is a metric that represents the ratio of packets not correctly delivered at their destination over the total number of packets sent. Packets may not reach the receiver either due to errors in transmission or network congestion.

  • Retransmission rate [16] is a metric that represents the percentage of retransmitted over the total transmitted packets from the sender to the receiver.

2.1.2 Network-Based Metrics

During network operation, and especially in disaster scenarios, network-based metrics are essential to understand the status of the network, i.e. the status of links and nodes, and how they determine the capabilities of the network as a whole. These network-based metrics might be used, for instance, to define which strategy should be used for the routing of new services, or a whether reconfiguration of services is necessary to prevent bottlenecks [15, 29, 35]. In the following, the main objective metrics at network level are described.

  • Link utilization [4] is a metric that shows the used percentage of the total link capacity.

  • Link load in overlay networks [34] is a metric that represents the number of overlay links that pass through a physical link. The link with larger load is more important than the link with the lower load generating a smaller impact.

  • Node load in overlay networks [34] represents the number of overlay links that pass through a physical node. A node with a larger load is more important than a node with a lower load as its failure generates a smaller impact.

  • Congestion [18] is a metric that represents a ratio of the average sum of the best (lowest) Round Trip Time (RTT) of all active hops between the sender and the receiver and the best (lowest) RTT (\(\textit{RTT}_i\)) on the receiver (the last hop) \(\textit{RTT}_{\textit{last}}\):

    $$\begin{aligned} F_{\textit{Congestion}} = \frac{\sum _{i=1}^{i=n}{} \textit{RTT}_i}{n_r \cdot \textit{RTT}_{\textit{last}}} \end{aligned}$$
    (2.1)

    While it is expected that congestion on continuous, stable, and ascending forwarding paths falls within the open interval (0, 1), the total range of the congestion is within the open interval \((0,\infty )\). In case the receiver is one hop away from the destination (neighbors), the value for the congestion will be 1. In case there is congestion over the path from the sender to the receiver, the \(F_{\textit{Congestion}}\) value will be greater than 1.

  • Path stability (PST) [18] shows how stable a forwarding path is over time. It is defined as the product of the two previous metrics—packet loss and congestion. Values within the open interval of (0, 1) should lead to the optimal non-congested paths. However, in general, the expected values are \(\textit{PST} \in (0,\infty )\). Still, this metric can be used to evaluate the best path among those in the range of the open interval (0, 1), as other paths (i.e. paths with \(F_{\textit{Congestion}} > 1\)) are congested.

  • Path symmetry (PSY) [18] is a measure that quantifies how symmetrical the paths between the sender and receiver are, i.e. it expresses the balance between the end-to-end latency and hop count on the forwarding and the reverse path. This metric is important in distributed systems as it is unlikely (or unwanted) that the path from the sender to the receiver differs from the reverse path [14]. Therefore, the path symmetry \(\textit{PSY}\) considers both hop count and the RTT paths in both directions:

    $$\begin{aligned} \textit{PSY} = \frac{n}{n'} \cdot \frac{\textit{RTT}_{\textit{last}}}{\textit{RTT}_{\textit{last}}'} \end{aligned}$$
    (2.2)

    where \(n'\) and \(\textit{RTT}_{\textit{last}}'\) denote the total hop count and the lowest RTT of the reverse path. Based on (2.2), one can determine whether a path is a perfectly symmetric (\(\textit{PSY} = 1\)), whether the reverse path is relatively inflated (\(\textit{PSY} < 1\)), or maybe the forwarding path is relatively inflated (\(\textit{PSY} > 1\)). Namely, the path symmetry gives the balance between the end-to-end latency and hop count for forwarding versus reverse paths.

  • Link stress [4] often counts a packet crosses over the same link, which is greater than or equal to 1 in overlay networks. For example, as shown in Fig. 2.1, a message from Node N1 to Node N5 needs to cross link \(<\texttt {R1}, \texttt {R2}>\) twice.

  • Stretch or relative delay penalty [4, 17] measures the delay overhead in overlay compared to the underlying network. It is the ratio of the delays between two nodes in the overlay network and the same two nodes in the underlying network. Using the example shown in Fig. 2.1, messages from N2 \(\rightarrow \) N3 have a total cost of 730 (going through \(\texttt {N2} \rightarrow \texttt {R2} \rightarrow \texttt {R1} \rightarrow \texttt {R5} \rightarrow \texttt {N5} \rightarrow \texttt {R5} \rightarrow \texttt {R3} \rightarrow \texttt {R4} \rightarrow \texttt {N4} \rightarrow \texttt {R4} \rightarrow \texttt {R3} \rightarrow \texttt {N3}\)). On the other hand, the underlying network will generate a total cost of 470 (path \(\texttt {N2} \rightarrow \texttt {R2} \rightarrow \texttt {R4} \rightarrow \texttt {R3} \rightarrow \texttt {N3}\)).

  • Average content accessibility ACA [22] quantifies the ability of a network to deliver anycast traffic when its infrastructure is subject to a disaster scenario. This ability is largely influenced by the placement of the replicas of the service provided by the network. Different from the ATTR, which counts the node-pairs that are able to communicate in an unicast fashion, Average Content Accessibility (ACA) can be considered a functional metric because it is tailored to assess the communication in anycast traffic, such as Content Distribution/Delivery Networks (CDNs). Content is considered accessible to a requesting (source) node if the requesting node can reach any of the nodes hosting a replica of the requested content. Considering that nodes hosting replicas can also originate requests for the replicas, the ACA value lies in the interval (r/|V|, 1). The ACA can be computed in three different cases: best-, worst- and real-case scenarios. These different cases illustrate how the placement decision influences the ability of the network to serve its users.

  • Average content accessibility in the best-case scenario ACA-BCS [22] is a theoretical metric that calculates the upper bound on the ACA value for a given network topology, link cut scenario, and a number of content replicas. The ACA reaches its highest value when content is spread across the largest connected components, such that each one of the largest components hosts at least one replica.

  • Average content accessibility in the worst-case scenario ACA-WCS [22] is a theoretical metric that calculates the lower bound on the ACA value for a given network topology and a number of content replicas. The lowest value of ACA occurs in two situations, depending on the relation between the number of replicas and the number and size of connected components. The exact fit occurs when the number of replicas is equal to the number of nodes in a subset of one or more connected components. Placing all replicas in those components leaves the content inaccessible to nodes in all other components. If the replicas do not fit exactly to any subset of connected components, Average Content Accessibility in the Worst Case Scenario (ACA-WCS) is calculated by searching for the best fit of replicas to the smallest connected components.

  • Average content accessibility in a real-case scenario ACA-RCS [22] is a metric that calculates the ACA for a given network topology and placement of the replicas. Different from the Average Content Accessibility in the Best Case Scenario (ACA-BCS) and ACA-WCS, which only consider the number of replicas, the Average Content Accessibility in a Real Case Scenario (ACA-RCS) uses the replica placement information to compute the value of the ACA for that particular placement. The ACA-RCS can be used, for instance, to evaluate how different replica placement strategies may perform with respect to the content accessibility when the network is under a disaster.

  • Mean content accessibility (\(\mu \)-ACA) [23] extends the ACA and evaluates the robustness of anycast traffic of a topology over a range of disasters. It is particularly useful when evaluating the effect of several disaster scenarios over a network topology and server placement.

Fig. 2.1
figure 1

Overlay network example. The values on links denote the delay incurred

2.2 Subjective Metrics

While the objective metrics capture how the network performance reflects on the application performance, they are incapable of capturing how eventual performance degradations impact the user experience. Different from objective metrics, subjective functional metrics should represent the quality that the user perceives while using an application that runs over the network infrastructure. Since these metrics aim at measuring the user experience, it becomes more difficult to obtain such metrics by automated or analytical tools. In recent years, there has been increasing interest to develop tools for mapping objective metrics to subjective ones without human intervention. Many of these proposals use, e.g. machine learning to learn to map the objective to the subjective metrics based on the previously evaluated scenarios [19, 37].

These metrics also depend on the specific application. Several evaluation methods are available in the literature, and the most studied applications are audio and video. However, with the emergence of new applications, it becomes necessary to revisit the audio/video-based metrics and develop new metrics to evaluate better these new experiences provided by new applications. The best-known metrics to objectively measure the subjective functional metrics include:

  • Mean opinion score (MOS) [11] estimates the perceived quality considering the human visual system and the video re-buffering frequency. It can also be used to obtain scores from subjective evaluation of users.

  • Peak signal-to-noise ratio (PSNR) [10] measures the Mean Squared Score (MSE) between the original and the received image averaged over all the frames of a video. The PSNR can also be expressed in logarithmic scale, where the resulting value represents the maximum possible power of a signal and the noise that affects the fidelity of its representation.

  • Structural similarity (SSIM) [36] predicts the perceived quality of images and videos. It improves upon the MOS and PSNR measures, which have shown to provide an inconsistent assessment of the video quality with the human eye [33].

Subjective metrics, such as the ones presented in this chapter, have been used as a basis to develop several algorithms with the goal to deliver better QoE to users. For instance, heterogeneous networks may collaborate to deliver a better experience [12]. Another example is the coordination of vehicles to improve QoE in vehicular networks [13]. A final example is considering the energy consumption of devices to decide on the bit-rate adaptation [1].

Several aspects affect the quality of a video perceived by the user. For instance, video resolution, screen resolution, and refresh rate play a fundamental role in the perceived video quality. Another important aspect is the codec used to encode the video. For instance, H.264 and H.265 are currently considered state-of-the-art options. Depending on the codec used, several options need to be selected depending on the type of video, e.g. animation, high movement [6, 33].

Companies providing video streaming services need to make a careful selection of the codec options. For instance, Netflix selects the codec options specifically for each title (e.g. movie or series).Footnote 1 These options can lead, if overestimated, to an unnecessary increase in the bit rate of the videos. If underestimated, these options may cause artefacts in the image, resulting in low QoE even when having high resolution [6, 33].

In this scenario, video providers need to decide, based on the screen resolution and available throughput, which video quality should be provided to the user such that its QoE is maximized. Therefore, there is a need to move from fixed-quality video streaming, which does not adapt to the user network conditions, to more dynamic video-quality adjustments. In this context, the MPEG-Dynamic Adaptive Streaming over HTTP (MPEG-DASH) standard facilitates the interoperability between video providers and devices. The MPEG-DASH allows video quality to dynamically adapt to the changing network conditions on the user end [25, 31]. This means that by using this new standard, video providers can better exploit the capabilities of user devices and ensure to provide the best QoE possible, given the current user network conditions while using the right amount of network resources. However, in the context of disasters, it should be expected that network resources would be scarce, and video quality is expected to drop significantly. In the following section, a case study showing the effects of disasters on network performance is presented.

3 Case Study

In this section, we analyze how link removals (caused by natural disasters or malicious attacks) can degrade or disrupt the correct functioning of real-world applications running over a real-world network topology. First, we describe the setup used to obtain the statistics presented in this chapter, which considers a realistic optical network infrastructure and population-based traffic distribution. Then, we present the numerical results of the effects of link removals to the functioning of video streaming applications. Finally, we discuss the remaining challenges in the assessment of the effects of disasters to the degradation or disruption of applications running over the damaged network.

3.1 Evaluation Setup

To evaluate the effects of link removals to real-world applications, we use the Germany50 network topology [24] with 50 nodes and 88 links, depicted in Fig. 2.2. Each node in the topology was associated with a German city and its population, as shown in Table 2.1. The link lengths were computed using the Euclidean distance between the adjacent nodes and considering the curvature of the Earth surface.Footnote 2

Fig. 2.2
figure 2

Network topology (Germany50) used for the study carried in this chapter [24]. Cities and population associated with each node are described in Table 2.1

A multi-Data Centre (DC) scenario is considered, where users connect to the closest among the several DCs deployed in the network. The anycast traffic assumption is suitable for many of the current applications [5], such as video streaming and gaming, and allows to increase the robustness of the applications while reducing latency between user and application server. In Table 2.1, superscripts besides cities indicate the nodes that are selected by the optimal DC placement model (we refer to [21] for more details) when \(\beta \) DCs are placed in the network (\(\beta = 4, 5, 6\)).

Table 2.1 List of nodes and their associated population [24]

For the physical connectivity, we assume an EON deployed over the network topology with one fiber pair per link and adopting a grid of 320 frequency slots of 12.5 GHz bandwidth. For each scenario, 100 simulations were carried out where the available spectrum is allocated to unitary demands respecting the frequency continuity constraint. Unitary demands (i.e. demands comprising one frequency slot) are generated. The sources of demands are randomly selected using a probability distribution that reflects the population associated with each node. The Routing, Modulation and Spectrum Assignment (RMSA) is performed selecting the shortest path to the closest DC as the route, and the first-fit approach is used to select the frequency slot to be used by the demand (we refer to [3, 8, 32] for more details regarding RMSA).

Table 2.2 shows the modulation formats considered for the dynamic modulation format assignment used in this chapter, their respective spectral efficiency, and their maximum distance reach. Demands are generated until network resource exhaustion occurs, i.e. until no additional demands can be accommodated. On average, 170,000 demands are assigned at each simulation. For the sake of simplicity, no guard-bands were considered in the results of this section. The speed of the light in the fiber is assumed to be \(2\times 10^8\) m/s.

Table 2.2 Spectral efficiency and maximum reach of modulation formats [8, 32]
Fig. 2.3
figure 3

Cumulative Distribution Function (CDF) of the ratio of nodes for different numbers of DCs (\(\beta \)): a path length; b propagation latency; c spectral efficiency and d bit rate per user

Figure 2.3 shows the CDF of different performance indicators obtained for the Germany50 network topology and the considered scenarios with 4, 5 and 6 DCs. Figure 2.3a shows that the maximum path length from the node to the closest DC is shortened by around 30% when increasing from 4 to 5 DCs, i.e. from more than 500 km to around 350 km. The shortening of paths reflects directly on the decrease in the propagation latency, as shown in Fig. 2.3b. For instance, while \(\beta =4\) has some nodes with up to 2.5 ms latency, having an extra DC (\(\beta =5\)) reduces the maximum propagation latency to less than 2 ms. Moreover, having shorter paths also allows for more efficient modulation formats to be used, as shown in Fig. 2.3c. With 4 DCs (\(\beta =4\)), 10% more nodes use modulation formats with spectral efficiency equal to 3 and 4 b/Hz/s when compared to \(\beta =5\). This lower spectral efficiency translates into lower bit rates being available to the users. Figure 2.3d shows that the increase from 4 to 5 DCs increases the maximum bit rate from less than 70 Mbps to more than 80 Mbps per user.

To mimic the effects of a disaster on the network topology, we adopted a strategy that progressively removes a number of links (\(\rho \)) from the network topology and evaluates the impact of the removal on the performance indicators. These link removals, depending on the order of link selection, can model natural disasters (disrupting links within a geographical area) or malicious attacks (disrupting the most critical links) [28]. For the sake of simplicity, the order in which links are removed was selected using the link betweenness centrality computed considering the anycast traffic scenario, i.e. based on the number of shortest paths from a node to its closest DC that traverse each link. In the beginning, the network topology with all 88 links was evaluated (i.e. \(\rho =0\)). Then, links were progressively removed following the order obtained by the betweenness centrality metric. The bit- rate simulations were performed for the network considering the remaining links.

Fig. 2.4
figure 4

Percentage of the nodes using a given modulation format over the number of removed links (\(\rho \))

Figure 2.4 shows the impact of the link removals on the selection of the modulation format per node. For the network topology in normal working conditions (\(\rho =0\)), 32QAM, 64QAM, and 16QAM are the dominantly used modulation formats, selected for 50%, 48% and 2% of the nodes, respectively. As links get removed, paths from source nodes to their closest DCs increase in length, affecting the eligible modulation formats. For instance, 16QAM accounts for only 2% of the nodes for \(\rho =0\) but is used by up to 42% of the nodes when 24 links are removed. Less efficient modulation formats (i.e. 8QAM and QPSK), which are not used in the network at its initial state, are also necessary when experiencing large disruptions. It is expected that the use of less efficient modulation formats degrades the bit rate available for the users, potentially disrupting the functioning of bit-rate-hungry applications. These potential degradations are investigated in the next section.

3.2 Use Case: Adaptive Video Streaming

This section presents the results obtained by considering the use of the network described in the previous sections to support adaptive video streaming services. We assume that streaming services use the MPEG-DASH standard to detect and adjust the video quality according to the network conditions [25, 31]. The total bit rate assigned to each network node is evenly divided among the population associated with the node. Table 2.3 summarizes the considered video resolution options and their respective required bit rates. The video quality associated with each node is selected as the highest resolution possible for the bit rate available per user.

Table 2.3 List of video resolutions and their associated bit rates [6]
Fig. 2.5
figure 5

Percentage of the population accessing videos with a given quality over the number of links removed (\(\rho \))

Figure 2.5 shows the percentage of the population able to receive a given video quality when up to 40 links are removed. For the network in normal operating conditions (i.e. \(\rho =0\)), 4k, 8k, and 1080p are the dominant video qualities accounting for 88% of the users in total, i.e. 45%, 28%, and 14% of the users, respectively. As links get disrupted, the traffic must traverse longer paths to reach the DC, and the resources are shared among more users, leading to a lower bit rate available per user.

When 9 links are removed (i.e. \(\rho =9\)), users with higher video quality (i.e. 8k, 4k, and 1080p) drop from 88% (in normal operating conditions) to 61%, with users accessing 4k dropping by nearly 15%. Moreover, the bit rate available for 0.12% of the users is already insufficient to provide even the lowest video quality, meaning that they have no access to video at all. Interestingly, at this stage, the network topology remains connected, i.e. all node-pairs are still connected. This means that structural metrics, such as ATTR, still evaluate the network condition with their highest score, demonstrating that they are not sufficient to assess the effects of disasters on the applications such as video streaming.

With 14 (i.e. 15%) links removed, the video service is no longer available to 3% of the users. At this condition, nearly 34% of the users have the video quality degraded to 240p or 720p, compared to only 10% of the users for \(\rho =0\). Moreover, the network is partitioned, and all-node-pairs connectivity drops to 92% (\(\textit{ATTR}=0.92\)). These values also show that even though only 92% of all node-pairs can be connected, 97% of the users can still connect to one of the DCs, demonstrating the increased reliability properties of anycast traffic.

4 Conclusions

This chapter reviewed functional metrics for evaluating the impact of disasters on applications and users. Objective and subjective functional metrics were discussed. For the objective metrics, packet-level and network-level metrics were summarized and discussed as the means to evaluate the network status. The subjective metrics were also summarized and discussed, bringing attention to the need for evaluating the network status not only through its infrastructure metrics but also by the quality perceived by the user.

Moreover, this chapter investigated the impact of disasters disrupting network links on the functional aspects of applications that rely on the network infrastructure to deliver services to users. The effects of these link disruptions on QoS metrics were illustrated assuming an EON deployed over a 50-node national topology. A use case for adaptive video streaming was investigated, and numerical results indicate that a severe degradation of the QoE is imposed when links are disrupted in the network.

The assessment methodology adopted in this chapter assumed several state-of-the-art technologies such as EON and adaptive video streaming using MPEG-DASH. However, there are several aspects that disasters can cause and that were not investigated in this chapter. One of these is, for example, the impact that the latency degradation shown in this chapter may have on applications. Moreover, new video codecs and adaptive video streaming techniques may present different behavior upon the degradation of network connectivity. Finally, other optical network technologies and/or traffic conditions would also need to be investigated, given that optical networks are expected to support several applications.