1 Introduction

With the development of video applications, video has become more and more important in daily life. The demand for high quality video is still increasing, which means more capacity and bandwidth are needed. Due to the unstable bandwidth and complex transmission environment, packet loss often occurs in streaming video, resulting in the degradation of perceived video quality. According to the characteristics of codec system, a frame subject to packet loss may cause some impairments in the successive frames. Even in the same frame, packet loss appearing in different regions (e.g., the regions with intense or slow movement), causes the video quality degradation at different degrees [4, 33]. Therefore, how to accurately measure the influence of packet loss faces challenges.

Many research works have been devoted to measure the impact of packet loss on video quality [16]. The most reliable way is to collect the judgements from many viewers, since humans are the final receivers of videos, such as the works [11, 17]. At the same time such subjective methods consume large time and human resource, which makes objective quality evaluations popular. Considering the diversity of lost packets, numerous studies have been explored quality change of video under some specific conditions, which mainly refer to different packet loss rates [6], different distributions of lost packets [5, 25], packet loss with different frame types (e.g., I, P, and B frame) [12, 27], packets lost in different Group of Pictures (GOP) patterns [31] and videos with different resolutions [3]. These works usually analyze the relationship between packet loss and traditional objective metrics, e.g., peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM). It is noteworthy that traditional objective metrics are simple and easy to calculate, usually need the complete original video. Unfortunately, the original video is not always available. Especially when packet loss occurs, much of the information in the video is lost. There is the need for the development of no-reference (NR) methods, which estimate the influence with limited available input information.

During the past years, several NR methods have been proposed. In general, it is consisted of two main steps: analyzing the characteristics of corrupted video stream and predicting video quality. The prediction models are usually established by formulation and learning-based methods. The former incorporates the analyzed characteristics into explicit formulations, such as works [7, 21, 29, 30] and the latter uses machine learning techniques for deriving the quality from a set of features. Since the selected characteristics determine the accuracy of the prediction model, how to choose effective characteristics is a key point for NR methods. There are several features extracted from bit stream level [8, 28, 34, 35], such as the bit rate, the frame type, packet length, and the packet loss rate. These bit-stream based features can be easily obtained after a decoding of the packet and the video frame header. But sometimes using these parameters cannot exactly capture the impact of packet loss on a specific video, since for different video contents, the same parameters may have different impacts on video quality [1]. In that case, pixel level features are analyzed by utilizing the decoded video information. Because of the multidimensional nature of video, pixel-based techniques usually take into consideration both spatial and temporal features. For example, 3D shearlet transform and 3D discrete cosine transform is used in [19, 20], respectively. And the statistical analysis of transform coefficients can characterize the spatiotemporal statistics of videos in different views. Spatial features come from the statistical analysis of spatial activity and edge discontinuity indices, and temporal features are derived from the motion vector (MV) information [18]. A quality model based on time-domain characteristics was proposed to statistically analyze the residual diagram of adjacent frames of video to predict video quality [22]. Literature [2] extracted features from spatial domain and frequency domain: Canny operator was used to extract edge information in spatial domain; In the frequency domain, the Discrete Cosine Transform (DCT) was applied to the video frames to obtain the frequency domain features.

Up to now, most of the existing NR methods focus on the impact of packet loss on the whole frames or the video sequence, and rarely study the impact on the local regions in one frame. Unfortunately, not all packet loss artifacts are the same, that is, the sensitivities of video to packet loss artifacts vary widely. Valenzise et al. [32] shows that after using error concealment (EC), some packet loss artifacts can be significantly alleviated, whereas the other may still be visible. More or less effective performance of EC depends on several factors, e.g., motion complexity and local texturing of the lost region. In addition, because of the temporal-prediction characteristic of codec system, errors in one frame may be spread to the following frames, which is called error propagation. Thus, packets lost in different regions with different coding patterns may cause various results. So, exploring the impact of packet loss on local regions can be of great importance. Korhonen [17] focuses on the visibility of packet loss artifacts appearing in spatially and temporally limited regions of a video sequence. The corrupted macroblocks (MBs) are combined into error clusters by using a methodology. A subjective test is then implemented to obtain the visibility of error clusters. This work narrows the region into partial spatiotemporal space, but the results significantly depend on the division of error clusters. A recursive distortion model [7] is proposed by analyzing the propagating behavior of transmission errors due to packet loss. Although the model has a good accuracy at the MB level, it entails recursive operation for every pixel in a MB, which increases the computational complexity.

In this paper, an error sensitivity model is proposed to measure the video quality affected by various packet losses. Different from the traditional methods that evaluate the quality on basis of the whole frames or the video sequence, the proposed model focuses on measuring the impact of a single lost packet on the local region. Once a block is lost, its internal information is also completely lost, which means the impact of the packet loss on the region is unknown. We firstly describe the error sensitivity index, based on the number of error pixels that still exist after using EC algorithm. Then, inspired by the spatiotemporal relativity of video sequences, we extract available features from the correctly received blocks in spatial and temporal domain. Finally, machine learning technology is applied to learn a mapping from feature space to error sensitivity of videos. Our model can give some guidance on feature selection in related fields, and at the same time, it can provide directions for error concealment algorithm improvement. Moreover, most of the existing methods are based on whole video frames, which cannot provide theoretical support for local region improvement, and our approach complements this nicely. Due to the complex factors that will affect the video quality (compression, blur, packet loss, etc.), unless otherwise noted, the error mentioned in this paper refers to packet loss. In H.265/HEVC, numerous largest coding units (LCUs) can be packetized into one packet for transmission. For simplifying the problem, we assume that each LCU is considered as a separate packet, and the packet loss in this paper refers to loss in the form of blocks. The major contributions of this paper are summarized as follows.

  1. 1)

    Considering the specific situation of video packet loss, we extract a collection of calculated features from the spatial and temporal domain. Most of these features are simple but closely quality related.

  2. 2)

    The proposed error sensitivity model pay attention to the levels of severity of damage in the local regions of the video, and can accurately predict the sensitivity of videos to different packet loss cases. The proposed model is appropriate to applications such as video processing and transmission, alleviating the impact of packet loss on video quality.

The rest of this paper is organized as follows. The theory of error sensitivity is introduced in Section 2. In Section 3, we describe details of the extraction process of the spatial and temporal features. The error sensitivity prediction process from the extracted features is also presented in this section. Experiment results are shown in Section 4. Finally, conclusions are drawn in Section 5.

2 The theory of error sensitivity

As we all know, the quality impact of packet loss on local regions can be considered as a combination of quality degradation directly due to packet loss and that induced by error propagation. When a packet loss is detected, the decoder usually uses some EC algorithms to mitigate the degradation. However, there exists local differences in video sequences, leading to different performances of the EC algorithms. It is difficult to accurately measure the effect of packet loss on local region quality. Not all the lost information can be reconstructed intact, and errors in some areas are still very obvious, affecting the video viewing quality. Due to the diversification of the content characteristics of the video area (such as diversification of motion or diversity of texture), packet loss at different positions within the same video frame has a great difference in the impact on video quality [15]. What is the impact of different packet loss on video quality is an important issue to be solved urgently in the research field of video compression and processing. In this paper, error sensitivity, reflecting the sensitivity of the damaged region to errors, is used to evaluate the impact of packet loss. The regions with high sensitivity are more susceptible to errors, and the distortion in these regions usually remains quite noticeable even after simple EC operations. Thus, the error sensitivity can be obtained by counting the number of error pixels that still remain after concealing [10]. Since packet loss is loss in the form of blocks, the error sensitivity of a corrupted block yB can be described as:

$$ {y}_B=n/{N}_B $$
(1)

where n is the number of error pixels within the block, and NB is the total number of pixels in the block. The more error pixels, the higher error sensitivity of the block.

Figure 1 illustrates the highly sensitive regions of the Traffic and Cactus sequences. Figure 1(a) and (b) are the original frames of Traffic and Cactus respectively. After subject to random packet loss at the rate of 20%, the damaged frames are shown in Fig. 1(c) and (d), where the black blocks refer to the lost regions. It is assumed that when the value of error sensitivity is higher than 10%, the damaged region is considered to be a highly sensitive region. In Fig. 1(e) and (f), the blocks left in lost regions are highly sensitive regions, and other white regions are low sensitive regions. It is worth noting that the regions containing objects or their boundaries are usually more sensitive to errors, whereas background or the regions where video content is consistent are not.

Fig. 1
figure 1

The highly sensitive regions of the Traffic and Cactus sequences. (a) the 12th frame of Traffic; (b) the 11th frame of Cactus; (c), (d) the random packet loss results of (a), (b) with packet loss ratio: 20%; (e), (f) the highly sensitive regions in the lost regions of (c), (d)

3 The proposed error sensitivity model

Since the original undistorted videos are not always accessible in many practical applications, it is not straightforward to obtain the sensitivities of regions to errors. Our proposed error sensitivity model predicts the sensitivity based on the quality-related features. The flowchart is given in Fig. 2. Numerous lost blocks in video sequences constitute the sample set. For every lost block, the related spatial and temporal features are extracted. After concealing the lost block in training set, error sensitivity of the block is calculated, as described in Section 2. Then, the features and error sensitivity are used to train a regression module. Lastly, the trained model is used to map the features in testing dataset to error sensitivities.

Fig. 2
figure 2

The flowchart of error sensitivity model

3.1 Selection of spatial features

When a block is lost, it means that all the information about the block is also lost. Fortunately, natural videos possess substantial spatiotemporal regularities, in the sense that video frames at different times and spatial positions are highly correlated. According to this property, we extract the features from adjacent blocks in spatiotemporal domain. Selection of temporal features will be detailed in the next subsection. In this section, some spatial features are mainly discussed.

3.1.1 Correctness of adjacent blocks in spatial domain

As a source of information for feature extraction, the correctness of adjacent blocks matters a lot. The number and locations of adjacent blocks may affect the video quality in different extent. Therefore, the eight surrounding blocks of current lost block, whose size is the same as that of current lost block, are considered. The relation of lost block and its adjacent blocks is demonstrated in Fig. 3, where B0 is the lost block and Bk (k = 1–8) is the surrounding block of B0 in spatial domain. It is assumed that the loss probability of each block is independent, and we come up with (2) to judge the blocks’ correctness. Since there are eight adjacent blocks, each lost block corresponds to an eight-dimensional feature to characterize the correctness of adjacent blocks in spatial domain.

$$ {C}_k=\left\{\begin{array}{c}1,\kern1.25em {B}_k\ is\ correct\\ {}0,\kern1.50em {B}_k\ is\ lost\kern1.5em \end{array}\kern1.25em \left(k=1-8\right)\right. $$
(2)
Fig. 3
figure 3

The relation of lost block and its adjacent blocks

3.1.2 Textural features

Texture is one of the important characteristics for picture, describing the surface properties of the object in the region. When the region with detailed texture is corrupted, the performance of a normal EC method may not be satisfactory, since the surrounding texture information used is not reliable. On the other hand, for the region with simple texture, it is much easier to reconstruct region with good quality by using surrounding information. Hence textural features directly affect the sensitivity of regions to errors. In this work, textural features are calculated with the help of gray-level co-occurrence matrix (GLCM). GLCM is the representation of statistical joint probability of two pixels (I, j) held at distance d in direction θ, which reflects the characteristics of texture [9]. Various GLCMs can be calculated for different distances and directions, in order to reduce computational complexity, for each adjacent block received correctly, we only compute the GLCM matrices with a pixel distance of 1 in the directions 0° and 90°. These two matrices are then averaged and normalized to mitigate the effect of the direction on the results, expressed as Pd(i, j). Considering the correlation of textural descriptors derived from GLCM, we focus on one of the descriptors, namely entropy, which measures the regularity versus disorder of pixel values in the block. The calculating formula of entropy Hk as follows:

$$ {H}_k=-\sum \limits_{i=0}^{L-1}\sum \limits_{j=0}^{L-1}{P}_d\left(i,j\right)\mathit{\log}{P}_d\left(i,j\right) $$
(3)

where L is the gray level, and in this work L is set to 128. When all elements in co-occurrence matrix are equal or have the maximum randomness, the entropy becomes larger, which means the block has higher texture complexity. After computing the entropy features of all the available surrounding blocks, the texture complexity of the lost block Ec is defined as:

$$ {E}_c=\left\{\begin{array}{c}\frac{\sum_{k=1}^8{H}_k{C}_k}{\sum_{k=1}^8{C}_k},\kern1.25em if\ {\sum}_{k=1}^8{C}_k\ne 0\ \\ {}0,\kern3.5em otherwise\end{array}\right. $$
(4)

In particular, when all the adjacent blocks are lost, the texture complexity of the lost block is fixed to 0.

Ec reflects the average level of the texture complexity of the regions around the lost block. However, when the texture complexity of the surrounding regions varies widely, using the mean values alone cannot reflect the textural features effectively. Therefore, we also analyze the differences in the texture of available surrounding blocks, called texture consistency:

$$ {E_c}^{\prime }=\left\{\begin{array}{c} std\left({H}_k,k=1-8\ and\ {C}_k\ne 0\right), if{\sum}_{k=1}^8{C}_k\ne 0\ \\ {}0,\kern12.25em otherwise\end{array}\right. $$
(5)

where std(∙) indicates the standard deviation operation.

3.1.3 Spatial activity

Spatial activity indicates the amount of detail of a video in spatial domain. It is derived from the gradient, describing the structural information of the video. As described in Fig. 1, regions with rich structural information, like edge regions, are more likely to be highly sensitive regions. Taking spatial activity into account can enhance the prediction accuracy of our model. In this paper, we use the definition of spatial activity [14] and modify it slightly. After Sobel operator, the spatial motion of the video frame calculates its standard deviation for all pixel values of the filtered image:

$$ \mathrm{SA}={std}_{M,N}\left( Sobel(F)\right) $$
(6)

where F is the video frame, and Sobel(∙) means Sobel filter operation on F. stdM, N(∙) means standard deviation for all pixel values of a 5 × 5 size image, respectively.

After converting the video frame to grayscale image, the spatial activity of the available adjacent blocks SAk can be calculated as:

$$ {SA}_k=\frac{std_{H,W}\left({Sobel}_x\left({B}_k\right)\right)+{std}_{H,W}\left({Sobel}_y\left({B}_k\right)\right)}{2} $$
(7)

where H is the height of the block, and W is the width of the block. Sobelx(Bk) and Sobely(Bk) means applying the Sobel operator to Bk with the horizontal and vertical mask, respectively. Considering the computational complexity, 3 × 3 masks of Sobel operator are adopted. After obtaining the spatial activity values of all the available adjacent blocks, the spatial activity of the lost block ESA can be calculated by:

$$ {E}_{SA}=\left\{\begin{array}{c}\frac{\sum_{k=1}^8{SA}_k{C}_k}{\sum_{k=1}^8{C}_k}, if{\sum}_{k=1}^8{C}_k\ne 0\ \\ {}0,\kern3em otherwise\end{array}\right. $$
(8)

3.2 Selection of temporal features

It is not enough to only consider spatial features in our model. The degradation of video quality is not only because of impairments caused by packet loss, but also owing to error propagation along the direction of inter prediction. Packet loss artifacts may spread from frame to frame. Since temporal information plays an important role in error sensitivity estimation, in this paper some temporal features are analyzed.

3.2.1 Correctness of the corresponding block in temporal domain

According to the characteristics of inter prediction, if the reference blocks in previous frames are damaged, the probability of errors occurring in the current block increases, which means current block has higher sensitivity. Analyzing correctness of reference blocks is necessary. To simplify the problem, suppose the current frame only refers to its previous frame. The corresponding block in previous frame, which is in the same position as the lost block in the current frame, is taken into consideration, as shown in Fig. 4. Correctness of the corresponding block B9 can be obtained by (2).

Fig. 4
figure 4

The relation of lost block and its corresponding block in temporal domain

3.2.2 Motion features

Motion features reflect the motion activity of video content. Generally, temporal ECs work well if the video content is of low motion activity, but often cause noticeable errors in regions with high motion activity. In order to further explore the relationship between the motion characteristics of the region and the impact of packet loss in the region on the video, Fig. 5 shows the scatter plots of motion vector (MV) amplitude and error sensitivity of image blocks in BQMall sequence and BasketballPass sequence. It can be seen from the scatter plots of the two sequences that as the MV amplitude increases, the error sensitivity corresponding to the image block also shows an upward trend, that is, the two have a positive correlation. Therefore, it is necessary to analyze the motion characteristics of the video.

Fig. 5
figure 5

Scatter plot of MV amplitude and error sensitivity of image blocks

To capture the motion features of the lost block, we make full use of the spatio-temporal correlation of the video, and analyze the motion characteristics of related regions in spatial and temporal domain. For a lost block, the motion information of 4 surrounding blocks in spatial domain (B2, B4, B5, and B7 in Fig. 3.) and the corresponding block in temporal domain (B9 in Fig. 4.) is considered. Firstly, the motion vectors of the related blocks Vk can be calculated as:

$$ {V}_k=\left\{\begin{array}{c}\sqrt{{\left|{MV}_x(k)\right|}^2+{\left|{MV}_y(k)\right|}^2},{C}_k=1\\ {}0,\kern11em {C}_k=0\end{array}\right.\ \left(k=2,4,5,7,9\right) $$
(9)

where MVx(k) and MVy(k) are the x-axis and y-axis components of the average motion vector of all pixels in the kth block, respectively. Then, the motion intensity of the lost block EV can be estimated by those motion vectors:

$$ {E}_V=\left\{\begin{array}{c}\frac{\sum {V}_k{C}_k}{\sum {C}_k},\kern0.5em if\sum {C}_k\ne 0\ \\ {}0,\kern2em otherwise\end{array}\right.\left(k=2,4,5,7,9\right) $$
(10)

In addition, by analyzing the motion differences of the related regions, we can obtain the motion consistency of the lost block:

$$ {E_V}^{\prime }=\left\{\begin{array}{c} std\left({V}_k,k=2,4,5,7,9\ and\ {C}_k\ne 0\right), if\sum {C}_k\ne 0\ \\ {}0,\kern14em otherwise\end{array}\right. $$
(11)

Finally, the motion features of the lost block are characterized by the motion intensity and the motion consistency.

3.2.3 Temporal randomness

Temporal randomness measures the temporal regularity of video content. Packet loss occurring in regions with regular and unregular motion can result in different visual impairments. According to [13], we can obtain temporal randomness by using previous frames to predict the current frame, which can be written as:

$$ r=\mid f\left({t}_2+1\right)-C\hat{A}{X}_{t_1}^{t_2}\mid $$
(12)

where f(t2 + 1) represents the current frame. C, \( \hat{A} \), and \( {X}_{t_1}^{t_2} \) are the related parameters of \( {F}_{t_1}^{t_2} \) (a sequence from the t1th frame to the t2th frame), carrying the information of previous frames. They come from the theory that there are some connections between frames, and the video signal can be modeled as a dynamic system:

$$ {F}_{t_1}^{t_2}=C{X}_{t_1}^{t_2}+{W}_{t_1}^{t_2} $$
(13)
$$ {X}_{t_1}^{t_2}=A{X}_{t_1-1}^{t_2-1}+{V}_{t_1}^{t_2} $$
(14)

where \( {X}_{t_1}^{t_2} \) and \( {X}_{t_1-1}^{t_2-1} \) are the state sequence of \( {F}_{t_1}^{t_2} \) and \( {F}_{t_1-1}^{t_2-1} \), respectively. A is the state transition matrix reflecting the regularity of motion, and C is a metric to encode the regularity of spatial information. \( {W}_{t_1}^{t_2} \)and \( {V}_{t_1}^{t_2} \) are the noise that cannot be represented by C and A, respectively. By conducting the singular value decomposition on \( {F}_{t_1}^{t_2} \), we can obtain the parameter C and \( {X}_{t_1}^{t_2} \) in (13). Since A reflects the motion information and can be used to predict next frames, the optimal A is expected to represent information as much as possible, which can be calculated as:

$$ \hat{A}={X}_{t_1+1}^{t_2}{\left({X}_{t_1}^{t_2-1}\right)}^{-1} $$
(15)

where \( {\left({X}_{t_1}^{t_2-1}\right)}^{-1} \) is the pseudo inverse of \( {X}_{t_1}^{t_2-1} \). Once the related parameters of \( {F}_{t_1}^{t_2} \) are obtained, we can calculate r by using (12). It is not hard to see that r reflects the unregular information that cannot be predicted from previous frames, which indicates temporal randomness.

In this paper, the temporal randomness is obtained by using the previous frame to predict the current frame. If the motion structures between adjacent frames are similar to each other, the temporal randomness should be small. To further visualize temporal randomness, by following [13], we transform the values of temporal randomness between 0 and 255, generating the temporal randomness map. The brighter the point in the map, the stronger the temporal randomness. Figure 6 shows the temporal randomness map for the Traffic sequence. Figure 6(a) and (b) are two consecutive frames in the sequence, and Fig. 6(c) shows the corresponding temporal randomness map. As seen in Fig. 6, the motion in background is regular, and its temporal randomness is rather small. But for the cars, the motion is unpredictable, corresponding to large temporal randomness.

Fig. 6
figure 6

The temporal randomness map for the Traffic sequence. (a)-(b) Consecutive frames in Traffic; (c) The corresponding temporal randomness map

Since the temporal randomness describes the regularity of video content between frames, the information from the temporal domain is as important as the information from the spatial domain when estimating the temporal randomness of the lost block. In fact, the temporal randomness represents the intensity of changes corresponding to the local region, and temporal randomness is large when the movement is intense, otherwise, it is small. Therefore, the temporal randomness can well reflect the error sensitivity, which is also the reason for using this feature in this paper. For every lost block, we analyze the temporal randomness of its eight surrounding blocks and the corresponding block in the previous frame. After calculating the temporal randomness of available related blocks, we get the sum of temporal randomness of each block rk, and then define the average value as the temporal randomness of the lost block Er. The detail process is expressed as follows.

$$ {r}_k={\sum}_{i=1}^H{\sum}_{j=1}^W\left|r\left(i,j\right)\right| $$
(16)
$$ {E}_r=\left\{\begin{array}{c}\frac{\sum_{k=1}^9{r}_k}{\sum_{k=1}^9{C}_k},\kern1.25em if\ {\sum}_{k=1}^9{C}_k\ne 0\ \\ {}0,\kern3.5em otherwise\end{array}\right. $$
(17)

where r(i, j) is the temporal randomness of the pixel at position (i, j).

3.3 Module regression

As the original video signals are not always available, we cannot directly calculate error sensitivity according to (1), which means how much it is sensitive to errors is unknown. Fortunately, machine learning methods have been widely used to derive the index from numerous features. They usually divide the sample dataset into a training dataset and a testing dataset, as depicted in Fig. 2. For the training dataset, the above extracted features and their corresponding ground truth are used to train the model. So far, a total of 15 features are extracted for each lost block, including 11 spatial features and 4 temporal features, which are Ck (k = 1 − 9), Ec, Ec,ESA, EVEV, and Er, respectively. We listed these features in Table 1. To get the ground truth of error sensitivity of the lost block in training dataset, we assume that the lost block is concealed by the simplest temporal EC method, where the lost block is directly replaced with the corresponding block in previous frame. Error sensitivity of the region is then obtained by using (1). In this paper, SVR is adopted to learn the relationship between the features and error sensitivity index. Specially, the LibSVM package is utilized to implement the SVR with the Radial Basis Function (RBF) kernel. The task of SVR is to train a regression model such as (18) so that f(x) and y are as close as possible.

$$ f\left(\boldsymbol{x}\right)={\boldsymbol{w}}^T\boldsymbol{x}+b $$
(18)

where wTand b are the parameters of the model. Here x is the feature set of the missing block, f(x) is the predicted error sensitivity.

Table 1 Features and feature description

For the testing dataset, the extracted features are fed into the trained SVR model, and its corresponding error sensitivity is then predicted.

4 4Experiment results

In this section, some experiments have been carried out to evaluate the performance of the proposed error sensitivity model. The experiment settings are introduced firstly, then the details of experimental results are reported.

4.1 Experiment settings

In our work, all the experiments are conducted in HM-16.9 and MATLAB R2016a. We use six video sequences (Traffic, Cactus, Kimono, BQMall, BasketballPass and FourPeople) to evaluate our model. These sequences have different spatial and temporal complexity. Detailed information of these sequences is summarized in Table 2. All the video sequences are firstly compressed using the H.265/HEVC encoding standard (HM-16.9), encoded in IPPP… sequence format with a fixed QP (32). 20% packet loss rate is considered to study the effect at worst channel conditions. The packet loss is simulated block wise, and each loss block size is 64 × 64 pixels.

Table 2 Information of video sequences

For each database, 80% of samples are randomly selected as training set and the remaining 20% samples are used as test set. To avoid any performance bias, we repeat the training and test cycle using 10 different random splits, and the mean values are reported as the final performance score.

There are three criteria employed in this study to quantitatively measure the performance of the model: Pearson’s linear correlation coefficient (PLCC), Spearman’s rank-order correlation coefficient (SROCC), and root mean-squared error (RMSE). PLCC and RMSE are utilized for measuring prediction accuracy, whereas SROCC is used for measuring prediction monotonicity. Higher PLCC and SROCC values, and RMSE value closer to zero, indicate good correlation with the ground truth.

4.2 Performance comparison

In order to investigate the effectiveness of the proposed model, we conduct some experiments to compare the proposed model (donated as Proposed) with existing, state-of-art NR quality assessment models. These models are DIIVINE [23], NOREQI [24], VBLIINDS [26] and NR_VQA [18]. Among them, DIIVINE and NOREQI are good at evaluating the frame level quality, for convenience, called as type A models in the following. They use spatial or frequency features and do not consider any temporal features. In order to verify the effectiveness of the spatio-temporal feature selection in this paper, we also select two general video quality assessment model (VBLIINDS and NR_VQA), both of which analyze many characteristics of videos in spatial and temporal domain. These two models and the proposed model are collectively referred to as type B models. For the fair comparison, we extract the features from these models at block level, then train the SVR model with these features to predict error sensitivity. Moreover, the SVR parameters have been all optimized to achieve their best performance.

Table 3 lists the performances of all the methods on the six video sequences, where the best performance is highlighted in bold. From Table 3, the PLCCs and SROCCs of the proposed model are larger than other models in all the six video sequences, and the RMSEs are smaller than others, which means the proposed model can predict error sensitivity with higher accuracy. It is worth noting that, in most cases, the PLCCs and SROCCs of type B models considering video characteristics in spatio-temporal domain are larger than those of type A models only considering spatial or frequency features, and the RMSEs of the type A models are smaller than those of the type B models, which highlights the importance of the temporal information. Especially for the FourPeople, which has simple content and low motion activity, type B models is significantly superior to type A models, increasing the PLCCs and SROCCs at least by 0.06. This is due to the fact that these type B models extract temporal features more accurately when the motion of the video is slow. However, for the sequences with intensive motion, such as Kimono and BasketballPass, it is more difficult to predict error sensitivity for all the models. The PLCCs and SROCCs are all below 0.7, and the RMSEs are larger than 0.094. To visualize the statistical significance of the comparison, we take the Traffic as an example and show the box plots of PLCC, SROCC, and RMSE distributions of different models over 10 trails in Fig. 7(a), 6(b), and 6(c), respectively. It is clear that the proposed model works well among all the NR models under consideration.

Table 3 Performances of the proposed model and the other four models on the six video sequences
Fig. 7
figure 7

Box plot of PLCC, SROCC, and RMSE distributions of the models over 10 trails on the Traffic. (a) Box plot of PLCC distribution; (b) Box plot of SROCC distribution; (c) Box plot of RMSE distribution

As we know, generalization capability is a significant problem for all learning-based methods. To evaluate the generalization capability of our model, we implement cross-dataset experiments, where models are trained and tested on different datasets. Six video sequences are divided into two parts. Considering the balance of sample size, three video sequences (Traffic, BasketballPass and FourPeople) are used for training, and the remaining sequences (Cactus, Kimono and BQMall) for testing. The results are exhibited in Table 4. It can be observed that, compared with the results in Table 3, the cross-dataset experiments have reduced the prediction accuracy of both the proposed model and the other four models. Even so, our model maintains the stable performance across most sequences, showing better robustness. In addition, compared with the performances of type B models, the performances of type A models are comparatively low. Because the motions of different sequences vary widely, using spatial or frequency features alone cannot predict the error sensitivity well, which indicates the validity of the combination of spatial and temporal features.

Table 4 Results of cross-dataset experiments

4.3 Contributions of features

In the paper, 15 features are extracted to train the error sensitivity model. To further investigate their individual or combined contributions to the performance of the model, the following test is conduct. In our experiment, 15 features are classified into five categories: correctness of the related blocks (both in spatial and temporal domain), textural features, spatial activity, motion features, and temporal randomness, donated by I, II, III, IV, and V, respectively. The performances of different combinations of feature types on the BasketballPass is shown in Fig. 8. First, we test on each feature type in isolation and rank them in terms of PLCC (SROCC or RMSE can also be used): feature V (0.5393) > feature IV (0.5168) > feature I (0.3908) > feature II (0.3555) > feature III (0.2028). It can be observed from the rank that feature V is the most contributing feature. However, the prediction accuracy of the model using only one feature type is still unsatisfactory, reflected by the PLCCs below 0.6. Second, to find the most effective combination of feature types, we fix the best feature V and add each of the other four feature types individually. The new rankings are obtained: feature I + V (0.6414) > feature II + V (0.5762) > feature IV + V (0.5554) > feature III + V (0.5387). It is evident that the performance is much better if two feature types are used. Then, we fix the optimal feature I + V and add each of the other three feature types, and so on. Finally, the contributions of several combinations of feature types are evaluated.

Fig. 8
figure 8

Performances of different combinations of feature types on the BasketballPass

As demonstrated in Fig. 8, all the five feature types play important roles in the performance of the error sensitivity model. With the addition of feature types, the values of PLCC or SROCC are on the rise, whereas the RMSE values reveal a trend of gradual decrease. When all the five feature types are utilized together, it achieves the best performance, indicating the feasibility of our model.

5 Conclusions

In this paper, a novel error sensitivity model, aiming to explore the impact of different packet losses on local regions, is presented. To solve the problem of missing information when packet loss appears, the available information from adjacent regions is further studied. Spatial and temporal features, which relate to error sensitivity, are considered comprehensively. We detail the features extracted, and then map it into error sensitivities using the SVR. The results of experiments conducted show that our model provides high accuracy for prediction of error sensitivity and is robust to different datasets as compared to state-of-the-art NR quality assessment models. Moreover, we demonstrate the effectiveness of feature selection of our model. More importantly, our error sensitivity model can give some guidance and improvement direction to the error concealment algorithm. Because improving the algorithm for high sensitivity region can greatly improve the error concealment effect.

However, the prediction accuracy for videos with high motion activity is still not encouraging. In the future, we will focus on the temporal characteristics and investigate other features to improve the prediction accuracy.