1 Introduction

The growth of the Internet network has used more and more resources for performance analysis. This work compares the performance of the network from an experimental viewpoint. The one of the objectives of this work is to analyses the various impacts of transmitted video, such as packet loss, jitter, and packets reordering. The results should show if it is better (for video quality) to get packets in a wrong order or completely lost. Comparison includes the static and dynamic video sequences, different quality and finally the two currently most used streaming video codecs. In the work, we deal with objective methods for evaluation of the quality of video sequences. Subjective assessment methods are very demanding regarding the time and people, and comparison of each other is not a simple process.

The objective methods are commonly preferred and offer the results that can be used for immediate comparison and verification. Each used objective method has a different procedures and different metric evaluation system. The obtained results were used for an implementation of prediction model, able to compute the final quality of video service according to the network behavior. This model can by useful for internet service providers during the process of network architecture designing.

2 State of the art

The recently growing interest in real-time service (such as audio and video) transfer through packet networks based on IP protocol has led to analyses of these services and their behavior in such networks becoming more intensive. Logically, the greatest emphasis is being put on the transfer of voice since this service is the most sensitive to the overall network status. Packet networks based on IP protocol had not been designed to transfer delay-sensitive traffic and without any supplementary mechanisms securing the quality of service, such a transfer was not capable of providing a high-quality interactive communication similar to standard PSTN (Public Switched Telephone Network). Lack of synchronization in comparison to TDM (Time-division Multiplexing) presented in PSTN brings concerns about variable conditions on the network that cause packet loss and fluctuation in delays (jitter). Jitter causes excess packet loss on receiving buffers depending on the buffer size and delay variance [1]. These two mentioned factors (packet loss and overall delay) have significant influence on the final quality of service. The research is focused on developing new codecs and techniques that are able to eliminate the packet loss factor on voice clarity (e.g. Packet Loss Concealment in G.711), and voice flow prioritization in the network that reduces transmission time [2, 3].

But, on the other hand, video has become the major part of all data traffic sent via IP networks. In general, a video service is one-way service (except e.g. video calls) so network delay is not such an important factor as in voice service. Dominant network factors that influence the final video quality are especially packet loss, delay variation and the capacity of the transmission links [4]. Analysis of video quality concentrates on the resistance of video codecs to packet loss in the network, which causes artefacts in the video [5, 6]. The new video codecs like VP9 (Google) and H.265 (The Moving Picture Experts Group-MPEG and ITU-T) that were released in December 2012, respectively April 2013 started the process of comparison and evaluation with their predecessors. The papers brought many performed tests, which showed the suitability of codecs using. The newest video codecs offer higher efficiency compression or in other words - better quality with the same video bitrate used [7]. However, new codecs need more computing performance, and this is the main reason why they are used for a very high resolution such as Full HD (\(1920 \times 10,280\)) and 4K (\(4096\times 2160\)) [8]. Streaming video as a part of triple play services will still be mainly ensured by older video codecs like MPEG-2 and MPEG-4 (H.264).

Nevertheless, a few factors are still lacking, such as complex view of video parameters on the final video quality. In our previous works [2], we focused on the quality of the triple play services prediction model implementation, where one part was dedicated to the quality of the video service.

The main motivation behind this work is to extend the mentioned computational model and bring a comprehensive view of all video parameters like codec type, character and resolution, and their influence on negative network factors resistance. In addition to packet loss, we focused on another network disruption phenomenon called delay variation (also known as jitter). This phenomenon is very often overlooked due to de-jitter buffer implementation on the receiving side, but for better process of the network situation modeling and prediction, it is good to know how it influences the final video quality.

3 Methodology

3.1 Video processing

The volume of digital video data is usually described in the terminology of bandwidth or transfer rate. Bandwidth of a classical digital video transmission without compression is up to hundreds of Mbps. The amount of data of the picture signal is higher with an increase in resolution. The volume of data is a major problem in the transmission, processing, storage and display of video information. Digital video compared to static images is very sensitive to memory needed saving [9].

Standard television broadcasting has a frame rate of at least 25 fps [10] (images per second). It is sufficient for the delay in perception of the human eye. Every second of the movie at a resolution 1080p (Full HD) of uncompressed video can take up to tens of megabytes of memory. Video typically contains a large amount of redundant data. Those can be removed using the appropriate compression algorithms [9].

MPEG-2 is one of the most used compression standards. It was approved in 1994. MPEG-2 is built on the MPEG-1, and its video coding scheme is a refinement of MPEG-1 standard. The advantage of the MPEG-2 standard is that it is suitable for coding both progressive and interlaced video. Much functionality such as scalability has been introduced. MPEG-2 also defines the Profiles and Levels. The Profile describes a degree of functionality whereas the Level represents resolution and bitrates. But not all Levels are supported at all Profiles. The most important application of the MPEG-2 standard is digital television broadcasting (DVB-T, DVB-S, DVB-C) but it also specifies the format of movies and other programs that are distributed on DVDs and similar disks [5, 10].

The latest and today most used compression standard designed for a wide range of applications, ranging from mobile video to HDTV, is MPEG Part 10, also called

MPEG-4 H.264/AVC. Some of the feature enhancements in MPEG-4 H.264/AVC standard over the earlier codecs are e.g. redundant pictures, multiple reference frames, arithmetic variable-length coding or motion compensation blocks being of variable sizes. MPEG-4 H.264/AVC defines the Profiles and Levels, too, but its organization is much simpler than in MPEG-4 Part 2. There are only three Profiles currently defined (Baseline, Main, Extended) [5, 10, 11].

3.1.1 Group of pictures (GOP)

Very important factor that also influences the video quality is the frame type. There are three defined types of frames: I, P, B.

I (intra) frames are coded without reference to other frames (without any motion-compensated prediction), in a very similar manner to JPEG (Joint Photographic Experts Group—lossy compression for digital image), which means that they contain all the information necessary for their reconstruction by the decoder. For this reason, they are the essential entry point for access to a video sequence. An I frame is used as a reference for further predicted frames (P and B). The compression rate of I frames is relatively low.

P (predicted) frames are inter-coded using motion compensated prediction from a reference frame (the P frame or I frame preceding the current P frame). Hence, a P frame is predicted using forward prediction, and a P frame may itself be used as a reference for further predicted frames (P and B frames). The compression rate of P frames is significantly higher than of I frames.

B frames are inter-coded using motion-compensated prediction from two reference frames, the P and/or I frames before and after the current B frame. Two motion vectors are generated for each macroblock in a B frame—one pointing to a matching area in the previous reference picture (a forward vector) and one pointing to a matching area in the future reference picture (a backward vector). A motion-compensated prediction macroblock can be formed in three ways—forward prediction using forward vector, backwards prediction using backward vector or bidirectional prediction (where the prediction reference is formed by averaging forward and backward prediction references). Typically, an encoder chooses the prediction mode (forward, backward or bidirectional) that gives the lowest energy into the difference macroblock. B frames offer the highest compression rate.

All these different frame types (I, P, B) are then grouped together to a sequence (specific repeating order)—called the Group of Pictures (GOP). A GOP must always start with an I frame and can contain only I or a combination of I and P or I, P, B frames. The use and also a number of B or P frames within a GOP can be increased or decreased depending on image content, compression rate or application that the compressed video is intended for. Various GOP lengths and combinations of P and B frames can be encoded, but mostly a typical GOP pattern is used—IBBPBBPBBPBBI—where each letter represents viewing order and type of the frame. [9, 10, 12].

3.1.2 Video transmission

To transfer video files, whether in real or system time (Video on Demand), fundamentally unreliable protocols are used. The principle consists of sending and receiving data without feedback between the sender and the recipient. Factors affecting the video transmission are [2, 9]:

  • Latency—this is the time that elapses between sending a message from the source and adoption of the destination node.

  • Packet order—variability in the packet delivery time to the destination node causes incorrect order.

  • Packet loss—occurs when one or more packets of data travelling across a computer network fail to reach their destination. It is most often expressed as a percentage.

  • Bandwidth—this expresses the capacity of the transmission channel.

  • Delay—This is caused by overcrowding the packet queue on the outgoing interface.

Network factors like Latency and Delay are necessary for real-time services such as voice over IP (VoIP), where the communication is two-way at the same time. Video broadcasting is typically only one-way service (direction from content provider to end user), so that is the reason why we have focused mainly on packet loss and delay variation as the primary network impairments on final service quality.

3.1.3 Methods for evaluating the quality

In this work, we used two the objective methods—VQM and SSIM. Objective evaluation metric involves the use of the metric’s computational methods, which form a “score” of the quality of the investigated video. These methods measure the physical characteristics of the video signal, such as the amplitude, timing and signal-to-noise ratio.

The SSIM (Structural Similarity Index) index is a method for measuring the similarity between two images. The SSIM index and VQM are full reference metrics; in other words, the measuring of image quality based on an initial uncompressed or distortion-free image as reference. SSIM is designed to improve the traditional methods like peak signal-to-noise ratio (PSNR) and mean squared error (MSE), which have proven to be inconsistent with the human eye perception. SSIM considers image degradation as perceived change in structural information. Structural information is the idea that the pixels have strong inter-dependencies especially when they are spatially close. These dependencies carry valuable information about the structure of the objects in the visual scene. This index includes three components - the similarity of the intensity, the corresponding contrast and the corresponding structure. The combination of these three factors forms one value as shown in Fig. 1.

Fig. 1
figure 1

The block diagram of SSIM index metric [13]

This method differs by evaluating structural distortion and not error rate. The main reason for this difference is characteristic of the human visual system. Since the SSIM method achieves a good correlation to the subjective impression, rating is defined in the interval [0–1], where 0 represents the worst value and 1 the best one (identity) [2]. Final SSIM value is a combination of three parameters, with original signal \(x\) and encoded signal \(y\) being defined as follows [2, 10, 13]:

$$\begin{aligned} \hbox {SSIM}\left( {\hbox {x,y}} \right) =\left[ {\hbox {l}\left( {\hbox {x,y}} \right) } \right] ^{{\upalpha }}\left[ {\hbox {c}\left( {\hbox {x,y}} \right) } \right] ^{{\upbeta }}\left[ {\hbox {s}\left( {\hbox {x,y}} \right) } \right] ^{{\upgamma }} \end{aligned}$$
(1)
  • Element \(\hbox {l}\left( {\hbox {x,y}} \right) \) compares the brightness of the signal

  • Element \(\hbox {c}\left( {\hbox {x,y}} \right) \) compares the contrast of the signal

  • Element \(\hbox {s}\left( {\hbox {x,y}} \right) \) measures the structure of correlation

  • \(\upalpha >0,\, \upbeta >0,\,\upgamma >0\) measures the weight of individual elements

The VQM (Video Quality Measure) metric computes the visibility of artefacts expressed in the DCT domain (Discrete Cosine Transform—express a finite sequence of data points in terms of a sum of cosine functions oscillating at different frequencies). Figure 2 shows the block diagram of this metric, which can be divided into nine steps.

Fig. 2
figure 2

The block diagram of VQM metric [12]

The input of the metric is a pair of color image sequences—the reference one and the test one. Both sequences are cropped then transformed to blocked DCT and afterwards converted to units of local contrast. In the next step, the input sequences are subjected to temporal filtering, which implements the temporal part of the contrast sensitivity function. The DCT coefficients expressed in a local contrast form, are then converted to just-noticeable-differences (JNDs) by dividing according to their respective spatial thresholds. This implements the spatial part of the contrast sensitivity function. In the next step, after the conversion to JNDs, the two sequences are subtracted to produce a difference sequence. In the following step, the contrast masking operation to the difference sequence is performed. Finally, the masked differences are weighted and pooled over all dimensions to yield summary measures of visual error [12]. The output value of the VQM metric indicates the amount of distortion of the sequence—for no impairment the value is equal to zero and for rising level of impairment the output value rises [12].

3.2 Video quality evaluation

The aim of the measurement was to simulate the effect of packet loss and jitter for the video formats MPEG-2 and MPEG-4, to determine the impact on the resulting image using objective methods for measuring the quality of the video and comparing the results. We made measurements for one static and two dynamic videos of 25 s. All movies were measured in a resolution of \(720 \times 576\) (PAL), \(1280 \times 720\) (HD) and \(1920 \times 1080\) (Full HD). Static video was represented by TV news (slow motion), the first dynamic video by a NASA space shuttle launch and the third video with the highest bitrate (70 Mbps in full HD) was an open source animation movie for testing purposes called Big Buck Bunny.

The whole process of measuring is shown in the Figs. 3 and 4. To evaluate the quality we used the methods SSIM and VQM. SSIM correlates better with the perception of the human eye [10, 11]. We evaluated these methods using MSU VQM Tools. As a first step, we created a stream in the VLC Player. As for the video content, streaming process RTP/UDP/IP method with MPEG2 (TS) and H.264 (MP4) was used (Table 1).

Table 1 Parameters for measurements
Fig. 3
figure 3

Measurement procedure

Fig. 4
figure 4

Comparing tested sequence with original video

We captured the broadcast stream on the local computer interface using VLC Player. This video was saved and tagged as the original video. Our testing scenarios reflect the situations that can happen in the real network. Especially mobile networks, capable of using IP architecture like UMTS (The Universal Mobile Telecommunications System—3rd generation mobile cellular system) and LTE (Long-Term Evolution—4th generation), reach high values of packet loss and delay variation [14]. For the purpose of settings in our scenarios, we used Linux tool called Netem. Netem provides Network Emulation functionality for testing protocols by emulating the properties of wide area networks. The current version emulates variable delay, loss, duplication and reordering [15].

We set the packet loss to 1 % on the interface using Netem and then repeated the whole measurement. Then we repeated this step for packet loss in increments of 1, 2,..., 10 %.

For the purpose of delay variation (Jitter) simulation, we set that 25 % of packets are delayed (results of our previous research [2] showed that approximately 25 % of all traffic had different one-way delay). We repeated the measurements for 10, 20, 30, 50, 75 and 100 ms delay variation. By streaming videos, we set the value of the de-jitter buffer to 0 in VLC so that the indicated delays were processed without any impact of buffering.

3.3 Netem settings

#tcqdisc add dev eth0 root netem loss 1 % // add packet loss on interface

#tcqdisc change dev eth0 root netem loss 2 % // change packet loss on local interface

It causes that 2 % (i.e., 2 out of 100) packets are randomly dropped. Videos for measurement were in these formats, so we did not set any additional transcoding by creating or capturing a stream.

#tcqdisc add dev eth0 root netem delay 10ms reorder 75 % 50 % // setting packet delay on local interface

In this example, 75 % of the packets (with a correlation of 50 %) are sent immediately, the others are delayed by 10ms. In our case, correlation 50 % means that the delayed part of all data traffic is oscillated around a value of 25 %. This setting simulates the network performance behavior more accurately.

#tcqdisc change dev eth0 root netem delay 20ms reorder 75 % 50 % // change packet delay on local interface

3.4 Evaluation of the results

We compared the original sample and the tested sample in tool MSU VQMT, which included a damage caused by our settings. For correct opening of video formats, it was necessary to use AviSynth plugin that is able to open many video formats. The using of this script is simple, an example of the application is listed below:

\(DirectShowSource(``c:\backslash folder\backslash myclip.mpg``)\)

This one line is written in a text editor and saved as “name.avs” is sufficient for the correct function in MSU VQMT. The DirectShow is a multimedia playback system from Microsoft. It can read most of the formats that Media Player can play, including MPEG, MP3, and some QuickTime files, as well as AVI files. MSU VQMT can works with AviSynth script that is responsible for playing the video; then MSU starts to compare every single frame with its reference. MSU VQMT offers many others objective video metrics that we have mentioned, but VQM and SSIM are the most commonly used and fully accepted by scientific community.

The tool exports results into the CSV format where we can find the value for every compared frame altogether with the total average value for the whole video.

4 Results

The results of the measurements verified our prediction that not only the type of video codec has a degradation impact on video quality. On the other hand, video resolution has not been proven as a significant parameter of video robustness.

All measured results led to the creation of video quality prediction model. Our goal was to find regressive equations with the highest R-squared factor. R-squared is a statistical measure of how close the data are to the fitted regression line. Regression calculates an equation that minimizes the distance between the fitted line and all of the data points. These regression equations have been calculated by statistical program Statgraphics, and as it can be seen, they have exponential character. After that, the proposal model was tested with values obtained by network simulation. Predicted values showed good correlation with measured values. The level of statistical deviation was approximately 5 % and it correspondents with computed R-squared factor.

The video compression type appeared to be the most important factor. Video codec H.264 (MPEG-4 Part 10) is more prone to packet loss rate in the network infrastructure than the older MPEG-2. According to the results of the static video measurements shown in Table 2, there is no big difference between the resolutions used. We detected a slight decrease in higher resolution. During the static scene, where changes were very slow, mainly the P and B frames contained approximately the same information regardless of the resolution used [12].

Table 2 Static video measurements results

Table 3 describes results of the first tested dynamic video sequence. This sequence achieved worse results than the static video. Again, the differences between the used resolutions were small. All the GOP frames contain more information, so packet loss influences more the picture reconstruction. That is the reason why dynamic video is generally more sensitive to data loss [9, 12].

Table 3 VQM and SSIM results for first tested dynamic video

The third video has a dynamic character too, but very high bitrate when compared to the previous two videos. Obtained values from this dynamic video are listed in Table 4.

Table 4 VQM and SSIM results for the second tested dynamic video

For a better illustration, Figs. 5 and 6 demonstrate the video quality results for full HD resolution.

Fig. 5
figure 5

SSIM results for MPEG-2

Fig. 6
figure 6

SSIM results for MPEG-4(H.264)

High bitrate means a lot of information contained in I, B and P frames and its loss causes significant degradation of video quality. The conclusion of these results refers to the argument of the importance of codec type, its bitrate and character. Only better resolution does not bring significant packet loss resistance.

This paper follows on from our previous work [2] and extends the video prediction model used there. All these mentioned result were processed into the following regressive equations. As for the equations described here, X represents the packet loss in % (range 0.5–10), and all coefficients are shown in Tables 5, 6 and 7.

  • Slow-motion video. MPEG-2

    $$\begin{aligned} \hbox {SSIM}= & {} {\upalpha }( {\hbox {a}+\hbox {b}*( {\hbox {X}^{2}} )} )+{\upbeta }( {\hbox {a}+\hbox {b}*\sqrt{\hbox {X}}} )\nonumber \\&+{\upgamma }( {\hbox {a}+\hbox {b}*( {\hbox {X}^{2}} )} ) \end{aligned}$$
    (2)

    MPEG-4(H.264)

    $$\begin{aligned} \hbox {SSIM}= & {} {\upalpha }\left( {\sqrt{\hbox {a}+\frac{\hbox {b}}{\hbox {X}}}} \right) \Big )+{\upbeta }\left( {\frac{1}{\hbox {a}+\hbox {b}*\hbox {X}}} \right) \nonumber \\&+{\upgamma }\left( {\hbox {exp}\left( {\hbox {a}+\hbox {b}*\hbox {X}} \right) } \right) \end{aligned}$$
    (3)

    All the necessary coefficients are presented in Table 5. Because measurements were performed on two dynamic videos, the following regressive equations represent the prediction model for both of them.

  • Dynamic video with ordinary and high bitrate. MPEG-2

    $$\begin{aligned} \hbox {SSIM}={\upalpha }\left( {\frac{1}{\hbox {a}+\hbox {b}*\hbox {X}}} \right) +{\upbeta }(\hbox {a}+\hbox {b}*\ln \left( \hbox {X} \right) ) \end{aligned}$$
    (4)

    MPEG-4(H.264)

    $$\begin{aligned} \hbox {SSIM}={\upalpha }(\hbox {a}+\hbox {b}*\ln \left( \hbox {X} \right) )+{\upbeta }\left( {\frac{1}{\hbox {a}+\hbox {b}*\hbox {X}}} \right) . \end{aligned}$$
    (5)

    The Tables 6 and 7 contain the coefficients for these two equations. All regressive models described here have gained an R-square factor (\(R^{2}\).) higher than 92 %, which represents a high level of veracity.

The second group of measurements led to an analysis of the degradation effect of delay variation—Jitter. The results of the performed tests showed in the Figs. 8 and 9, uncover a critical boundary of 20 ms. Above this value, a significant reduction of final video quality is observed. This boundary indicates, that if a network provider is able to ensure the Jitter not exceeding the value of 20 ms, the using of de-jitter buffer on receive side would not be necessary. The value of 20 ms points to the typical interval of RTP transmitted audio/video datagrams. Due to the process of decompressing and video stream processing which requires a particular time, both codecs are tolerant of a small delay variation.

Table 5 Coefficients for static video
Table 6 Coefficients for MPEG-2 dynamic videos
Table 7 Coefficients for MEPG-4(H.264) dynamic videos

As can be seen in Fig. 7, packet loss causes artefacts in the image. On the other hand, delay variation influences the overall image sharpness. The video, affected by delay variation is more blurred in comparison to the video with some packets missing.

Fig. 7
figure 7

Comparison of packet loss and Jitter effect on video quality

Fig. 8
figure 8

Results of delay variation measurements—HD resolution

Fig. 9
figure 9

Results of delay variation measurements—full HD resolution

The behavior of dynamic videos is approximately on the same level, with bigger differences between MPEG-2 and MPEG-4 if the static video is used. Typically in the real operation, de-jitter buffer is used for elimination of this phenomenon. Nevertheless, our experiments showed the significant degradation impact on video quality in case of a particular delay variation.

5 Conclusion

The article brings detailed view of the video streaming performance over an IP-based network. The measured results showed the relation between the video codec type and bitrate to the final video quality. These results have helped us to create and extend our previous mathematical models of video streaming behavior. The second part of the measurements was dedicated to another adverse network impact on video quality called Jitter. The results proved the importance of De-jitter buffer implementation not only for voice services but also for video streaming services.

The extended prediction model of video quality described in the paper is very easy to be implemented in any programing language. Then, this model will become a very simple and fast tool for prediction of video behavior in the network. Constant network monitoring, along with network performance intervening as needed, seems to be a method for securing at least minimal QoS level in the packet network. The application aims to be a helpful tool for designing the network infrastructure with regard to achieving at least minimal QoS level.

Our future works will focus on two directions: Firstly, the new generation of video codecs such as h.265 and VP9. The used tools do not support these new codecs yet. Therefore, we will include them into our model immediately after their release as soon as possible. The second part will be focused on analysis of the impact of encryption algorithms on QoS. Security is a highly discussed topic nowadays, and protocols such as IPsec, VPN/SSL and SRTP are becoming more and more frequently used for securing of the content of voice or video. Therefore, computational mathematical models should handle this new situation.