1 Introduction

Video streaming is nowadays one of the hungriest bandwidth consuming applications. Internet video was 40 % of consumer Internet traffic on 2011, and it is expected to reach 62 % by the end of 2015 [5]. Trying to minimize this expense, encoders from MPEG family [18] have been widely adopted for the codification of audio and video. Given the spread of MPEG-2 and MPEG-4/H.264 throughout the industry and reaching millions of user devices, they will most likely remain so in the foreseeable future.

In MPEG standard, the coding process takes advantage of temporal similarities between frames in order to produce smaller compressed frames. The decoding process for most video frames requires previously decoded ones. This hierarchical structure of MPEG encoding implies a possible error propagation through its frames, and therefore it adds an extra handicap to the transport of MPEG video flows over lossy networks [19]. Small packet loss rates may cause high frame error rates, degrading the video quality perceived by the user.

A video provider needs to quantify the video quality problems, at best before the user perceives them. The concept of Quality of Experience (QoE) emerged from this need. ITU-T defines QoE [13] as “The overall acceptability of an application or service, as perceived subjectively by the end-user”. The user does not perceive directly the proportion of lost packets, but the proportion of frames that could be or could not be displayed which depends on lost frames and their relationship to the other frames. The proportion of lost packets is a Quality of Service (QoS) parameter about the transport network, but according to the definition given above, it cannot be considered a QoE parameter. The proportion of frames that could be decoded, and therefore displayed, is named the Decodable Frame Rate (Q) [29] and can be considered a QoE parameter.

User perception is not only affected by the number of non-displayed frames, but also by the grouping in time of these non-displayed frames. This grouping is reflected in the length of the video playback interruptions or cuts.

The main contribution of this paper is to consider the Decodable Frame Rate and the video playback interruptions or cut lengths important parameters to quantify the video quality, as the user perceives both. This paper presents an analytical model to compute the Decodable Frame Rate and the video playback interruption lengths for network scenarios where frame losses can be considered independent. The analytical equations are validated by simulation and used, under certain restrictions, on packet switching networks. These QoE parameters show that different videos can have similar Decodable Frame Rate or cut lengths, but both parameters can only be similar for different videos if the videos transport characteristics are similar.

The rest of the paper is organized as follows. Section 2 presents the related work done on this topic. Sections 3 presents the analytical study of the QoE parameters. Section 4 validates the analytical model for a packet switching network. Section 5 presents results that can be extracted from the model and the validation. Finally, section 6 concludes this paper.

2 Related work

A common strategy for quality evaluation of IP packet transport is the use of QoS measures like packet loss rate or packet jitter. Its analogues for MPEG video transport are QoS measures like packet/frame loss ratio or packet/frame jitter [26]. These measures are not directly related to user perception, therefore they cannot be considered a valid QoE metric.

QoE metrics can be classified based on the availability of the original video signal [7]. In full reference (FR) metrics the whole original video signal is available. In reduced reference (RR) metrics only partial information from the original video is available. In no reference (NR) metrics the original video is not available at all. The signal as received by the user is assumed as available for the three types of metrics.

The most used QoE metrics are FR metrics like the Peak Signal-to-Noise Ratio (PSNR) [9] and the Perceptual Evaluation of Video Quality (PEVQ) [14]. These metrics can only be carried out by offline computation, or by online computation in highly controlled environments where the original video and the video at user side are available together at some point. In a real video distribution scenario, online QoE computation is preferred, in order to quickly react to quality problems. However, it is possible only if the original video is available at user side, thus making QoE assessment unnecessary. Therefore, RR or NR metrics are more useful.

RR metrics require parameter extraction from the original video and the received one. They measure changes in these parameters, which must be transported by the network and collected at a single site in order to obtain the measured QoE. In practical situations these parameters are hard to obtain. For example, the Hybrid Image Quality Metric (HIQM) [15] combines five structural parameters that are computed for each frame, both in the original and the received video. There is therefore a clear overhead not only on computation but also on transportation of these meta-data.

The computation of NR metrics does not use the original video or any parameter extracted from it. They can be computed at user side based only on the received video. This paper focuses on these metrics as they are the easiest to implement and use for traffic engineering in a real video distribution network.

The Decodable Frame Rate is the proportion of the video frames that the user will see completely correct, so it is directly related to user perception and it only needs the video signal as received by the user. However, the user perceives non-decoded frames differently if they are temporally contiguous or they are separated, i.e. the user perceives the length of the video cuts too.

The Decodable Frame Rate was introduced in [29]. It was assumed that frame losses were mutually independent and the analytical model for the Decodable Frame Rate depends on the probability of a frame being lost. In [29] this probability is obtained for the case of a packet switching network.

Some papers have used this analytical model for transmission on wireless networks [16], on IPTV networks [1] or in the evaluation of the SCTP protocol [3]. All of them present the Decodable Frame Rate as a valid QoE metric. In fact, the authors from [3, 4] obtain for the same scenarios and parameters the PSNR and the Decodable Frame Rate of some videos. They compare the results and they conclude that the Decodable Frame Rate can reflect the behaviour of the PSNR, and therefore the MOS video quality can be estimated with reasonable accuracy from the Decodable Frame Rate.

However, none of them [1, 3, 4, 16, 29] includes the distribution of the non-decoded frames, which gives more data than the Decodable Frame Rate for the statistical evaluation of video quality. The Decodable Frame Rate can be derived from the distribution of the non-decoded frames, but this distribution cannot be derived from the Decodable Frame Rate.

The distribution of the non-decoded frames is reflected in the video playback interruptions or cuts length. In this paper, the video playback interruption lengths are analytically derived for network scenarios where frame losses can be considered independent.

Both the number of non-decoded frames and the video playback interruption lengths must be taken into account to evaluate video quality. The user perception is affected by the number of consecutive non-displayed frames, as experimental measurements have shown in [22, 23]. For example, video playback interruptions of 200 ms are certainly visible to the user and even a 80 ms long cut may be visible.

In [22, 23], the authors present a mathematical expression to obtain the MOS video quality from the video playback interruption length. They only consider the case of one cut, i.e., the mathematical expression calculates the degradation in quality when the video has a single cut. In [20, 21], a NR metric is proposed to obtain the MOS video quality when the video has multiple playback interruptions. The mathematical expression from [22, 23] is extended to take into account the video playback interruption lengths distribution. The authors obtained experimentally the video playback interruption lengths distribution, thus the metric can only be used after the reception of the video and it can not be used to make predictions about the video quality at the user side. The analytical model of video playback interruption lengths that we propose in this paper can be used to simplify the metric and to make predictions about the video quality at the user side based on current network conditions.

3 Analytical model

Three types of video frames are defined in the MPEG standards [8]: intra-coded frames (I-frames), inter-coded or predicted frames (P-frames) and bidirectional coded frames (B-frames).

I-frames can be decoded on their own. P- and B-frames hold only the changes in the image from the reference frames, and thus improve video compression rates. P-frames have only one reference frame, the previous I- or P-frame. In MPEG-2 Part 2 (H.262) [10] and MPEG-4 Part 2 [11], B-frames have two reference frames. These frames are the previous I- or P-frame and the following one of either type. In MPEG-4 Part 10 (MPEG-4 AVC or H.264) [12], B-frames can have up to 16 reference frames, located before or after the B-frame. They can be either I- or P-frames, and even B-frames can be reference frames for other B-frames. In this paper, only videos with “classic” B-frames will be studied, i.e. videos with B-frames with two reference frames of I- or P-frame type.

In MPEG-2 Part 2 and MPEG-4 Part 2, the allowed prediction level is the same for the whole frame. In H.264, the prediction type granularity is reduced to a level lower than the frame, called the slice. A frame can contain multiple slices that are encoded separately from any other slice of the frame and even they can be of different types of prediction level (I-slice, P-slice, B-slice). Therefore, instead of frame losses, we could talk about slice losses.

We have done a brief survey on the presence of slices in videos from present networks [6]. The results show that most of the video sources analyzed use one slice per frame, i.e., the use of more than one slice per frame is infrequent. In the rare event of more than one slice per frame, all the slices are from the same type. Therefore, we have concluded that a model that ignores the presence of slices and simplifies the model of losses is still useful.

The hierarchical structure of MPEG encoding implies a possible error propagation through its frames. Frames that arrive at the destination could be useless if the other frames that they depend on have been dropped by the network. The loss of a frame by the network, or part of it, is named a direct loss and it implies that the frame is non-decodable. Indirect loss of a frame happens when a frame is considered non-decodable because some frame it depends on is non-decodable. These are common assumptions taken in many other papers [3, 16, 29].

I-, P- and B-frames are grouped into Groups of Pictures (GoP). A GoP is a sequence of frames beginning with an I-frame up to the frame before the next I-frame. The GoP structure is the pattern of I-, P- and B-frames used inside every GoP. A regular GoP structure is usually described by the pattern (N, M) where N is the I-frame to I-frame distance, and M is the I-to-P frame distance (see Table 1 for the notation used in the paper). For example, the GoP structure could be (12,3) or IBBPBBPBBPBB.

Table 1 Notation

N {I,P,B} is the number of frames from each type of frames in a single GoP. For any regular GoP, N I  = 1 and N = N I  + N P  + N B , but N P and N B depend on the GoP structure (N,M). In an open GoP the last B-frames depend on the I-frame from the next GoP, like for example in (12, 3) or IBBPBBPBBPBB. In this case N is a multiple of M and the number of P-frames can be computed as N/M − 1. In a closed GoP there is no dependence with frames out of the GoP and it ends with a P-frame, like for example in (10,3) or IBBPBBPBBP. In this case N − 1 is a multiple of M and the number of P-frames can be computed as (N − 1)/M.

For simplicity, we have defined \(N_P=\lfloor (N-1)/M \rfloor\) for any type of GoPs. As N − 1 is a multiple of M for a closed GoP, then \(N_P=\lfloor (N-1)/M \rfloor=(N-1)/M\) for a closed GoP. As N is a multiple of M for an open GoP, then \(\lfloor (N-1)/M \rfloor\) is the same as \(\lfloor (N-M)/M \rfloor\) and therefore \(N_P=\lfloor (N-1)/M \rfloor=\lfloor (N-M)/M \rfloor=\lfloor N/M-1 \rfloor=N/M-1\) for an open GoP. Summarizing, the number of frames from each type (N {I,P,B}) can be obtained for any type of GoP by (1).

$$ \begin{array}{rll} \label{eq:N_IPB} N_I &= & 1 \\ N_P &= & \left\lfloor \frac{N-1}{M} \right\rfloor \\ N_B& = & N-1-N_P = N-1-\left\lfloor \frac{N-1}{M} \right\rfloor \end{array} $$
(1)

In an open GoP N B  = (N P  + 1)*(M − 1) and in a closed one N B  = N P *(M − 1). We can define the control variable z (2) that nullifies terms in the analytical model that only affect open GoPs. Therefore, when the GoP is an open one z = 1 and when the GoP is a closed one z = 0. This value can be obtained directly from the GoP structure (2).

$$ \label{eq:z} z = \frac{N_B}{M-1}-N_P = \frac{N-1-\left\lfloor \frac{N-1}{M} \right\rfloor}{M-1}-\left\lfloor \frac{N-1}{M} \right\rfloor $$
(2)

A GoP structure like IBBPBBPBBPBB or IBBBPBBBP is shown in presentation order, i.e. the order in which the frames will be shown to the user. However, as B-frames require in order to be processed the previous I- or P-frame and the following of either type, the coding/decoding order will be different. The coding/decoding order will be IbbPBBPBBPBBiBB for a (12, 3) GoP structure, where the frames in lower-case correspond to frames from the previous or the next GoP, showing that it is an open GoP.

The transmission order corresponds usually to the coding/decoding order, but the user perceives the cuts as viewed in presentation order. The measurement of cut durations has to take into account this change of frame order.

Parameters presented in this paper like the Decodable Frame Rate and the video playback interruptions or cuts, will be measured in presentation order.

3.1 Decodable frame rate Q

The analytical model (3) for the Decodable Frame Rate presented in [29] is valid for open GoPs.

$$ \begin{array}{rll} \label{eq:QGoPZiviani} Q & =& \frac{ (1-P_I) + (1-P_I) \sum\limits_{i=1}^{N_P} (1-P_P)^i }{ N } \\ &&+ \frac{ (M-1)(1-P_I)(1-P_B) \left[ \sum\limits_{i=1}^{N_P} (1-P_P)^i + (1-P_I)(1-P_P)^{N_P} \right] }{ N } \end{array} $$
(3)

P I is the probability of losing an I-frame, P P is the probability of losing a P-frame and P B is the probability of losing a B-frame. We use P τ as the probability of losing in the network a frame of type τ ∈ {I, P, B}. It is assumed that frame losses are mutually independent. The last term of (3), \((1-P_I)(1-P_P)^{N_P}\), corresponds to the last B-frames that depend on the I-frame from the next GoP. A closed GoP does not have these last B-frames, so the analytical model should reflect this difference. Equation (4) presents the Decodable Frame Rate expression, valid for any type of GoP, where the difference between open and closed GoPs is reflected using variable z (z = 1 for open GoPs and z = 0 for closed GoPs).

$$ \begin{array}{rll} \label{eq:QGoP} Q & =& \frac{ (1-P_I) + (1-P_I) \sum\limits_{i=1}^{N_P} (1-P_P)^i }{ N } + \\ &&+ \frac{ (M-1)(1-P_I)(1-P_B) \left[ \sum\limits_{i=1}^{N_P} (1-P_P)^i + z * (1-P_I)(1-P_P)^{N_P} \right] }{ N } \end{array} $$
(4)

3.2 Video playback interruptions or cuts

Video playback interruption lengths or cut lengths can be measured as the number of consecutive non-decoded frames. It is a non-negative integer number c that is related to the cut time duration through the inter-frame time T if . The possible cut lengths depend on the GoP structure of the video and they can be obtained by (5) (see Appendix for more details).

$$ \begin{array}{rll} \label{eq:c} c & = & [ 1 \ldots M-1 ,\: j * N + i * M + z * (M-1) ,\: (j+1) * N + z * (M-1) ] \\ i& = &1 \ldots N_P \\ j& = &0 \ldots N_G-1 \end{array} $$
(5)

Where N G  = F/N is the total number of GoPs in the video and F is the number of frames.

N cut[c] (6) is the number of cuts of c frames length when one or more losses happen. It depends on the GoP structure of the video and on the frame loss probabilities, P {I, P, B} (see Appendix for more details).

$$ \label{eq:Ncut} N_{\rm cut}[c] = \begin{cases} N_G * \delta * P_B^c * (1\!-\!P_I) \: \sum\limits_{m=1}^{N_P} (1\!-\!P_P)^m + \\ + z * N_G * \delta * P_B^c * (1\!-\!P_I)^2 \: (1-P_P)^{N_P} & {\kern-8pt} \text{ for $c=1 \ldots M-1$ }\\ \\ N_G * P_I^j * P_P * (1\!-\!P_I)^2 \: (1\!-\!P_P)^{N_P-i} & {\kern-8pt} \text{ for $c{\kern-.5pt} ={\kern-.5pt} {\kern-2pt} j * {\kern-1pt} N {\kern-1pt} +{\kern-1pt} i * M {\kern-1pt} +{\kern-1pt} z * {\kern-1pt} ({\kern-1pt} M{\kern-1.5pt} -{\kern-1.5pt} 1{\kern-1pt} )$ }\\ \\ N_G * P_I^{(j+1)} * (1\!-\!P_I)^2 \: (1\!-\!P_P)^{N_P} & {\kern-8pt} \text{ for $c{\kern-.5pt} ={\kern-.5pt} (j{\kern-1pt} +{\kern-1pt} 1) * N{\kern-1pt} +{\kern-1pt} z * ({\kern-.5pt} M{\kern-1.5pt} -{\kern-1.5pt} 1{\kern-.5pt} )$ } \\ \\ 0 & {\kern-8pt} \text{ otherwise } \end{cases} $$
(6)

Where δ is defined in (7) (see Appendix for more details).

$$ \begin{array}{rll} \label{eq:deltaysigma} \delta &= & \sum\limits_{r=1}^{M-c} (1-P_B)^{\sigma} \\ \sigma &= & \begin{cases} 0 & \text{ if $r=1$ and $r+c=M$ } \\ 1 & \text{ if $(r=1$ and $r+c<M)$ or $(r>1$ and $r+c=M)$ } \\ 2 & \text{ if $r>1$ and $r+c<M$ } \\ \end{cases} \end{array} $$
(7)

The total number of cuts T cut can be computed as:

$$ \label{eq:Tcut} T_{\rm cut} = \sum\limits_{c=1}^{F} N_{\rm cut}[c] $$
(8)

The proportion of cuts of c frames length (P cut[c]) can be computed dividing the number of cuts of c frames length (N cut[c]) by the total number of cuts (T cut). The cut length Probability Mass Function P cut is the set of all possible values of P cut[c].

$$ \label{eq:Pcut} P_{\rm cut}[c] = \frac{ N_{\rm cut}[c] }{ T_{\rm cut} } = \frac{ N_{\rm cut}[c] }{ \sum\limits_{i=1}^{F} N_{\rm cut}[i] } $$
(9)

The average cut length L cut can be computed from the cut length Probability Mass Function:

$$ \label{eq:CutLength} L_{\rm cut} = \sum\limits_{c=1}^{F} c * P_{\rm cut}[c] = \frac{ \sum\limits_{c=1}^{F} c * N_{\rm cut}[c] }{ \sum\limits_{c=1}^{F} N_{\rm cut}[c] } = \frac{ \sum\limits_{c=1}^{F} c * N_{\rm cut}[c] }{ T_{\rm cut} } $$
(10)

The Decodable Frame Rate can be derived from the cut length Probability Mass Function:

$$ \label{eq:Q} T_{\rm cut} = \frac{ F * (1-Q) }{ L_{\rm cut} } \Rightarrow Q = 1 - \frac{L_{\rm cut} * T_{\rm cut}}{F} = 1 - \frac{1}{F} \sum\limits_{c=1}^{F} c * N_{\rm cut}[c] $$
(11)

4 Analytical model validation on packet switching networks

In Section 3, we have presented the analytical model for computing the cut length Probability Mass Function P cut, the Average Cut Length L cut and the Decodable Frame Rate Q for network scenarios where frame losses can be considered independent. In this section, we present simulation results that validate the applicability of the analytical model. Statistical results are obtained at 95 % confidence level, but most confidence intervals are too small to be noticed in the figures.

The video source takes a video trace containing the size and timestamps for each frame. The video source generates UDP packets using the frame’s size from the video traces. The number of packets per frame depends on the frame length and the selected packet length. Therefore, all packets from the video have the same length, except for the last one from each frame. A summary of the different video traces from [25, 27] used as video flows is presented in Table 2. Video traces with same and different bit-rate, GoP and relative frame sizes have been used. For example, two versions of Tokyo video have been used. Both have the same GoP, but different bit-rate and therefore, as stated on [24], different relative frame sizes.

Table 2 Summary of video traces used in the simulations

The analytical formulation requires the frame loss probabilities as an input parameter. If the packet loss ratio can be modelled as an independent rate p, then the probability P τ of losing a frame of type τ ∈ {I, P, B} can be approximated using (12) based on the packet loss ratio p and the average number of packets per frame d τ .

$$ \label{eq:Ptau} P_\tau = \sum\limits_{k=1}^{d_\tau} \binom{d_\tau}{k} p^k * (1-p)^{d_\tau-k} = 1 - (1-p)^{d_\tau} $$
(12)

First, we assume an environment where the packet loss ratio that the video flow experiences is an independent rate p. This environment is modelled by a black box network scenario with an i.i.d. packet loss ratio (Subsection 4.1). Afterwards, a more realistic environment will be studied (Subsection 4.2), where the packet loss ratio is the result of output port contention on routers. The analytical model requires that frame losses are independent. In a more realistic environment, when the link is congested, buffer is close to full occupancy and bursty arrivals can result in bursty losses. Therefore, packet losses can present correlation and it can not be asserted that the analytical model is valid. However, as the results will show, the model stays accurate for low to medium link utilizations and only for highly congested links it deviates from the simulation results. For each of the environments, a specific simulator was developed using OMNeT+ + [28].

4.1 Model validation in a network scenario with i.i.d. packet losses

Figure 1 shows the first four terms of the cut length Probability Mass Function (P cut) versus the independent packet loss ratio p for the different video traces. The analytical results match quite well with the simulations. LOTRIII and Matrix traces have the same GoP structure, G12B2, obtaining a very similar P cut[c]. However, this is not always the case. TokyoQP4, TokyoQP1 and StarWarsIV have the same GoP structure, G16B7, but different P cut[c]. The cut length Probability Mass Function does not only depend on the GoP structure, but it also depends on the relation between the average number of packets per each type of frame. Table 2 shows that TokyoQP4, TokyoQP1 and StarWarsIV have the same GoP structure, but they differ greatly in the relation between the average number of packets. However, LOTRIII and Matrix have the same GoP structure and a similar relation between the average number of packets.

Fig. 1
figure 1

Cut length probability mass function for different video traces in a network with independent losses

As the packet loss ratio p decreases, the cut length Probability Mass Function tends to stabilize. This happens because the probabilities of cut lengths that involve the loss of multiple frames tends to zero. These negligible cut length probabilities depend on the GoP structure, e.g. for G12B2 they will be c ∈ {2, 8} and for G16B7 they will be c ∈ {2, 3, 4} (see Fig. 1). So, as the packet loss ratio p decreases, the analytical formulation can be simplified assuming that cut lengths coming from the loss of multiple frames can not happen (see Section 5).

Figure 2a shows the simulation results for Average Cut Length (L cut) versus the independent packet loss ratio p. The analytical results match quite well for all video traces. As the packet loss ratio p grows, more frames are lost and it is more probable that these losses interact on greater cut lengths. As LOTRIII and Matrix have a similar P cut the Average Cut Length tends to the same value when p decreases, something that does not happen with TokyoQP4, TokyoQP1 and StarWarsIV.

Fig. 2
figure 2

Average cut length and decodable frame rate for different video traces in a network with independent losses

Figure 2b shows the simulation results for Decodable Frame Rate (Q) versus the independent packet loss ratio. Again, the analytical results match quite well with the simulations. As the packet loss ratio grows, more frames are lost and a lower Q is obtained.

As the bit-rate grows, the number of packets per frame grows, and therefore the frame loss probabilities increase too. So, the same video (e.g. TokyoQP4) with a higher bit-rate (TokyoQP1) will suffer more losses and will have a worse Decodable Frame Rate, as seen in Fig. 2b. However, the Average Cut Length can be smaller, as seen in Fig. 2a when p < 0.01. Basically, TokyoQP1 suffers more cuts than TokyoQP4 in the same scenario, but these cuts are shorter. This can be good, for example for video recovery techniques such as interpolations, that work well with little losses. This possible disparity between the Decodable Frame Rate and the Average Cut Length reflects the importance of considering not only the Decodable Frame Rate for a QoE metric, but to complement it with the distribution of the cuts as proposed in this paper.

So far, the results have been validated for a scenario with an independent packet loss ratio, where the frame loss probabilities depend only on the frame sizes. On the following, we simulate a network scenario where packet losses can present correlation and hence frame losses too. We check the validity of the formulation when independence of frame loss probabilities is not assured.

4.2 Model validation in a network scenario with real traffic

The network scenario, Fig. 3, is similar to the scenario used in other papers [29]. Each switching node is modelled as a router with a finite queue on the output port. There are three background traffic flows in the network. Each background traffic flow only competes with the video traffic for resources (bandwidth and queue space) in one router, i.e., FlowA goes from LAN A1 to LAN A2 through RouterA, FlowB goes from LAN B1 to LAN B2 through RouterB, and FlowC goes from LAN C1 to LAN C2 through RouterC. Each background traffic flow is generated from different Ethernet packet traces from the WIDE project’s MAWI Working Group Traffic Archive [2, 17]. The 2010/04/13 set from the Day in the Life of the Internet project is used. Chronologically consecutive Ethernet traces are concatenated to obtain at least the duration of the video flow and then packet’s inter-arrival time and size are extracted. Each background traffic source uses these inter-arrival time and size to generate the corresponding background traffic. The Ethernet traces have an average rate of 200 Mbps, while the simulation links have 2 Gbps bandwidth. Larger background traffic rates are created multiplexing several Ethernet traces before the packet’s inter-arrival time and size extraction process. As more background traffic is added on each hop, the packet loss ratio grows.

Fig. 3
figure 3

Network scenario with background traffic

Figure 4 shows the first four terms of the cut length Probability Mass Function (P cut) versus the experimental packet loss ratio for only Matrix and TokyoQP1. The analytical results match quite well with the simulation ones for both video traces. Again, as the packet loss ratio decreases, the cut length Probability Mass Function tends to stabilize and the analytical formulation can be simplified assuming that cut lengths that imply the loss of multiple frames will not happen (see Section 5).

Fig. 4
figure 4

Cut length Probability Mass Function for different video traces in the network scenario with background traffic

Figure 5 shows the simulation results of Average Cut Length (L cut) and Decodable Frame Rate (Q) versus the experimental packet loss ratio. Again, the analytical results match quite well for both video traces for small packet loss ratios. For high packet loss ratios (p ∼ 0.04) the analytical results differ from simulations ones. This happens for high background traffic rates, where the router’s queue is saturated most of the time and the losses are bursty. The analytical model needs independent packets losses, so for high background traffic rates the model is not valid. Production networks usually have small packet loss ratios, so the analytical model could be used in realistic packet switching scenarios and the frame loss probabilities can be obtained from the packet loss ratio and the average number of packets per frame.

Fig. 5
figure 5

Average Cut Length and Decodable Frame Rate for different video traces in the network scenario with background traffic

For moderate or low packet loss ratios (p ≤ 0.01), the model accuracy is good. The analytical model for the Average Cut Length has an error below 0.5 frames or 20 ms (40 ms inter-frame time) for p = 0.01. The authors in [22, 23] show that the MOS does not have significant changes, even in the most sensitive range, for variations in cut lengths below the inter-frame time (Figure 7 in [23]). The analytical model for the Decodable Frame Rate has an error below 3 % for p = 0.01. The authors in [3, 4] relate this rate to the PSNR. They show that the PSNR value greatly depends on the content under study, but for the same content the PSNR does not have significant changes for variations in Decodable Frame Rate below that range.

5 Model evaluation results

The analytical model for the Decodable Frame Rate shows that as the bit-rate grows, the Decodable Frame Rate decreases. This happens because the number of packets per frame grows, and therefore the frame loss probabilities for a packet switching network, P τ , increase too. It can be checked on Figs. 2b and 5b with the extreme cases of TokyoQP1 and TokyoQP4. TokyoQP1 has a bit-rate six times greater than TokyoQP4, and therefore the difference on Decodable Frame Rate is noticeable.

However, the analytical model for video cuts, P cut[c] , has not a so clear behaviour with the bit-rate. The Average Cut Length can be smaller for the higher bit-rate case, as seen in Fig. 5a when p < 0.01. TokyoQP1 has more cuts than TokyoQP4, but these cuts are shorter. Video recovery techniques such as interpolations work well in scenarios with few losses.

This disparity between the Decodable Frame Rate and the Average Cut Length reflects the importance of considering both parameters together, not only the Decodable Frame Rate. TokyoQP1 always will have a worse Decodable Frame Rate than TokyoQP4, but for small p this can be compensated by the improvement on the Average Cut Length. In other cases this compensation can be happen for any p.

In the previous section it was shown that the analytical formulation could be simplified when the packet loss ratio p is low. If p is small, it can be assumed that the probability of packet losses in adjacent frames tends to zero, i.e., only cuts resulting one frame losses will be considered. This leads to:

  • Only cuts of length c = 1, i * M + z * (M − 1), N + z * (M − 1) can be possible.

  • All terms not related to the lost frame can be removed from the analytical formulation, i.e., terms of type 1 − P {I, P, B} can be removed.

Taking into account these assumptions, a simplified versions of the number of cuts can be computed:

  • Number of cuts of c frames length

    $$ \label{eq:Ncut_Approx} N_{\rm cut}[c] \approx \begin{cases} N_G * P_B * N_B & \text{ for $c=1$ }\\ N_G * P_P & \text{ for $c=i * M + z * (M-1)$ }\\ N_G * P_I & \text{ for $c=N + z * (M-1)$ } \\ 0 & \text{ otherwise } \end{cases} $$
    (13)
  • Total number of cuts

    $$ \label{eq:NcutApprox} T_{\rm cut} \approx N_G * P_B * N_B + N_G * P_P * N_P + N_G * P_I = N_G \sum\limits_{\tau \in \{I, P, B\}}^{} P_{\tau} * N_{\tau} $$
    (14)
  • Cut length Probability Mass Function

    $$ \label{eq:Pcut_Approx} P_{\rm cut}[c] \approx \begin{cases} \frac{ P_B * N_B }{ \sum\limits_{\tau \in \{I, P, B\}}^{} P_{\tau} * N_{\tau} } & \text{ for $c=1$ }\\ \frac{ P_P }{ \sum\limits_{\tau \in \{I, P, B\}}^{} P_{\tau} * N_{\tau} } & \text{ for $c=i * M + z * (M-1)$ }\\ \frac{ P_I }{ \sum\limits_{\tau \in \{I, P, B\}}^{} P_{\tau} * N_{\tau} } & \text{ for $c=N + z * (M-1)$ } \\ 0 & \text{ otherwise } \end{cases} $$
    (15)
  • Average Cut Length

    $$ \begin{array}{rll} \label{eq:CutLength_Approx} L_{\rm cut} & \approx &\frac{ P_B * N_B + \sum\limits_{i=1}^{N_P} \left[ i * M + z * (M-1) \right] P_P + \left[N+z * (M-1)\right] P_I}{ P_B * N_B + \sum\limits_{i=1}^{N_P} P_P + P_I } = \\ & =& \frac{ P_B * N_B + \left[ M \frac{ N_P * (N_P+1)}{ 2 } + z * (M-1) N_P \right] P_P + \left[N + z * (M-1)\right] P_I }{ P_B * N_B + P_P * N_P + P_I } \end{array} $$
    (16)

It can be seen in (15) that the probability of a c frames length cut, P cut[c], is proportional to the loss probability of the frame that generates that cut length and the number of these frames inside a GoP. So, the more B-frames a GoP contains, the more cuts will be of only one frame length and the Average Cut Length will be smaller. The loss of a B-frame has not a great impact in the average cut length (16) compared to the loss of a P- or I-frame. P-frame losses get amplified their effect on the average cut length by a factor of \(M \frac{ (N_P+1) }{ 2 } + z(M-1)\); I-frames by a factor of N + z(M − 1) and B-frames by a factor of 1. Although the loss of a frame is an sporadic incident when p is small, it is very important the type of lost frame. Therefore, any attempt to improve the transmission of a video must be based on reducing the loss of I- and/or P-frames.

Taken into account the same assumptions, also the Decodable Frame Rate can be simplified. The expression proposed on [29] tends to 1 (17), but the expression derived from the cut length Probability Mass Function still depends on the GoP structure (and the frame losses) (18).

$$ \begin{array}{rll} \label{eq:QGoP_Aprrox} Q &=& \frac{ (1-P_I) + (1-P_I) \sum\limits_{i=1}^{N_P} (1-P_P)^i }{ N } \\ && + \frac{ (M-1)(1-P_I)(1-P_B) \left[ \sum\limits_{i=1}^{N_P} (1-P_P)^i + z * (1-P_I)(1-P_P)^{N_P} \right] }{ N }\\ & \approx& \frac{ 1 + \sum\limits_{i=1}^{N_P} 1 + (M-1) \left[ \sum\limits_{i=1}^{N_P} 1 + z \right] }{ N } = \frac{ 1 + N_P + (M-1)(N_P+z) }{ N } = 1 \end{array} $$
(17)
$$ \begin{array}{rll} \label{eq:Q_Approx} Q &=& 1 - \frac{L_{\rm cut} * T_{\rm cut}}{F} \approx 1 - \frac{ L_{\rm cut} }{N} \sum\limits_{\tau \in \{I, P, B\}}^{} P_{\tau} * N_{\tau} \\ & =& 1 - \frac{ P_B * N_B + \left[ M \frac{ N_P * (N_P+1)}{ 2 } + z * (M-1) N_P \right] P_P + \left[N + z * (M-1)\right] P_I }{ N } \end{array} $$
(18)

This is a clear advantage of the presented formulation for computing the Decodable Frame Rate as it provides a better approximation for low loss scenarios. It also shows the importance of considering the Decodable Frame Rate together with the video cut lengths.

6 Conclusions

This paper has presented two objective video quality evaluation parameters for a network where losses of video frames can be considered independent: the Decodable Frame Rate and the video cut lengths. The Decodable Frame Rate has been used on previous works, but user perception is not only affected by the number of non-decoded frames, but also by the video playback interruptions caused by the grouping of these non-decoded frames. Therefore, these two parameters have to be considered important parameters to quantify the video quality. The analytical formulation for them has been presented and the importance of considering together the two parameters has been reflected.

The analytical model has shown that as the bit-rate grows, the Decodable Frame Rate decreases. However, the Average Cut Length can be smaller for the higher bit-rate case, because it can have more cuts, but they can be shorter. This reinforces the importance of considering the two parameters together.

The simplified analytical model shows that, as expected, the loss of a B-frame has not a great impact in the average cut length comparing to the loss of a P- or I-frame. Although the loss of a frame is an sporadic incident when the packet loss ratio is small, the simplified analytical model shows that the type of lost frame is very important. Therefore, any attempt at improving the transmission of a video should be directed at minimizing the Average Cut Length and/or at maximizing the Decodable Frame Rate reducing the number of frame losses. Based on the analytical results on frame losses, depending on the GoP structure the best strategy for this improvement will be the reduction of I-, P- and/or B-frames losses.