1 Introduction

In video compression technology, the transmission and storage of video signals have become more and more popular. For the network environment, various channel/network errors result in damage or loss of compressed video information during transmission or storage. H.264/AVC is the most popular compression coding standard [16], outperforming the previous standards. It has removed redundancy in the spatial and temporal domain of a video, thus the bitstream is very sensitive to transmission error during the network. Unfortunately, packet errors are unavoidable, which will cause serious display quality degradation.

The loss of transmitted packets influences the quality of the received video. For combatting the transmission error, on one hand, error resilience adds redundant information at the encoder with the penalty of decreasing the compression efficiency. On the other hand, error concealment conceals the errors in the post-processing stage after receiving the information at the decoder side without modifying source and channel coding schemes. Since the real time transmission of video stream aims to reduce the delay, it is not possible to retransmit all erroneous or lost packets. Therefore, there is a need for post-processing methods that try to reduce the visual artifacts caused by bitstream error. The simplest error concealment method used in H.264/AVC is the frame copy method. In this method, the lost frame is estimated directly by copying the co-located pixels from the previous frame; this method would result in the frame discontinuity, especially for the high motion sequences. An alternative method called motion copy is also adopted in H.264/AVC, providing another option for error concealment. The concept of the motion copy method is that the motion vectors of the previous frame are used for concealing the current lost frame.

The error concealment algorithms in the literatures can be categorized into three sets: spatial approaches, temporal approaches, and hybrid approaches. Spatial approaches restore the corrupted macroblock by the information of surrounding macroblocks or pixels that are correctly received and decoded. A spatial error concealment method proposed by Wang et al. minimized the first-order derivative-based smoothness measurement [15], and Zhu et al. enhanced it by using second-order derivative that could suppress the induced blurring artifact [22]. In [12], an algorithm based on selective directional interpolation was proposed that could efficiently recover the smooth and edge areas.

Temporal approaches exploit temporal correlation between successive received frames to reconstruct the corrupted blocks. The most important part in temporal approaches is finding the appropriate blocks in previous frames that can be used as information to restore the corrupted blocks. The appropriate blocks are found by estimating lost motion vectors (MV). Much research has been proposed to recover the MVs of the corrupted block. The well-known boundary matching algorithm proposed in [9] used the information between the internal and external boundary of reconstructed blocks to select the MV that minimized the total variation. There are also more sophisticated algorithms [13,18] to obtain better replacements for the corrupted blocks.

Hybrid approaches combine the spatial and temporal approaches. A priority-driven region matching algorithm was proposed to exploit the spatial and temporal information in [4]. Specifically, the work in [19] proposed an auto-regression model that considers spatial and temporal information to recover the lost pixels. The work in [21] first performs a temporal method, called motion vector extrapolation, from the motion vectors in the previous frame to estimate the lost MVs. Then, based on the spatially neighboring available motion vectors around the lost MB, the estimated MVs are improved. The work of Yan et al. [17] considers whole frame loss, and the error concealment algorithm uses the motion extrapolation technique and received depth map to recover the lost frame from the previous frame. A recent work in [10] proposes a dynamic programming method to estimate the missing motion vectors in consecutive blocks, and the block merging procedure is performed for similar motion vectors. A loss frame error concealment method is proposed in [20] for wireless multimedia sensor networks using the inter-view information. In this paper, we propose a novel way to recover the lost pixel by using the theory of sparse optimization based on [19] and considering the sparse characteristics of nature images. Spatial and temporal sparse optimization problems are formulated for the spatial and temporal method. After the problems are solved, the solutions are combined to recover the lost pixels.

The differences between the proposed work and important related works

References [12,15,16,22] are spatial approaches for error concealment, and references [9,13,18] are temporal approaches for error concealment, whereas the proposed method is a spatial-temporal approach for the problem. For other recent works, the work in [4] must first pre-process the pixels to find the edges in an image for recovery, whereas the proposed method can directly find lost motion vectors from pixels and the relations among them. The work in [17] performs the error concealment algorithm for color frame based on the depth frame; however, the proposed work does not deal with the transmission of depth frames. Furthermore, the work in [17] solves the problem in the case of whole frame losses, but the proposed work solves the problem for the packet losses in the checkerboard patterns. The algorithms in [10] discuss the lost pattern for consecutive lost blocks, whereas the proposed work focuses on the packet loss in the checkerboard pattern. The work in [20] concentrates on the loss recovery for the multi-view videos, while the proposed work is designed for single-view videos.

The proposed work is most similar with [19] since they both formulate the spatial and temporal relations among neighboring pixels and solve the problem based on matrix computation. Also, in [21], missing motion vectors are estimated using the established mathematical formulations among neighboring motion vectors. These two papers are similar to the proposed paper in that their approaches all depend on the developed mathematical and matrix formulations. Therefore, in this paper, we compare the proposed work against the state-of-the-art works in [19] and [21], due to the similarities among the works, to show the significance of the proposed method.

This paper is organized as follows: Section 2 introduces the theory of sparse optimization. Section 3 discusses the spatial and temporal formulation between pixels in the current and reference frames. The spatial and temporal sparse optimization problems are constructed and solved. Section 4 demonstrates the experimental results compared to two state-of-the-art methods objectively and subjectively. Section 5 concludes the paper.

2 Sparse optimization

The signal processing field proposed an exciting theory, known as “sparse optimization” [13,6]. In this new method, the sampling frequency can be greatly reduced and keep the quality of the reconstructed signal. The original signal must be consistent with the sparsity, which means the signal has a more concise expression with the appropriate basis. Considering a 1-D vector f ∈ ℝn and its orthogonal basis Ψ = [ψ 1, ψ 2, …, ψ n ], such as the basis of DCT, it can be expressed as:

$$ f(t)={\displaystyle \sum_{i=1}^n}{z}_i{\psi}_i(t) $$
(1)

where z i  = f(t), ψ i (t). This relationship translated into a matrix expression is shown as such:

$$ f=\Psi z $$
(2)

where f represents a n × 1 vector, Ψ = [ψ 1, ψ 2, …, ψ n ] represents a n × 1 matrix, and z represents a n × 1 vector. We define the sparsity of signal as: when a signal is combined by orthogonal basis Ψ, and when most of the values in z are zero and only have \( \mathcal{K} \) non-zero values, we consider f to be \( \mathcal{K} \) -sparse. This means that, by restoring signal f , which has n data, we only need sampling \( \mathcal{K}- point \) \( \left(\mathcal{K}<n\right) \) in the basis domain, and f could be reconstructed completely. However, we will not sample in the basis domain but in the signal domain. For example, an n-point signal is transmitted, and only m points are received. In this case, when restoring the signal, matrix theory states that there should be infinite solutions when m < n. But what amazes us about sparse optimization is that we could reconstruct the signal perfectly and uniquely. When considering Eq (2) and assuming f is \( \mathcal{K} \) -sparse, if we want to solve the unique z, it can be obtained through the following optimization problem:

$$ z= \arg\ \underset{\tilde{z}\in {R}^n}{ \min }||\tilde{z}{||}_0\ s.t.\ \Psi \tilde{z}=f $$
(3)

where \( {\tilde{z}}_0 \) is 0 -norm representing the number of non-zero values in \( \tilde{z} \). Therefore, we can find \( \tilde{z} \), which has the least non-zero values and also conforms to the condition of \( \Psi \tilde{z}=f \). However, it is very complicated to solve the 0 -norm problem. Fortunately, [3] and [5] proposed an approximated solution to 0 -norm:

$$ z= \arg\ \underset{\tilde{z}\in {R}^n}{ \min }||\tilde{z}{||}_1\ s.t.\ \Psi \tilde{z}=f $$
(4)

Works [1] and [5] prove that the problem has high probability to restore the \( \mathcal{K} \) -sparse signal with \( \mathrm{m}\ge \mathrm{c}\mathcal{K}\ \log\ \left(\mathrm{n}/\mathcal{K}\right) \) data, where c is a constant. Therefore, the sparser the signal, the easier it can be restored perfectly by the sparse optimization technique. In [14] and [7], the works discuss more methods of sparse optimization to restore the signal and how to restore the information when there is noise present (or when it is near-sparse). The 1 -norm optimization problem in Eq (4) can be efficiently solved by the primal-dual interior point method in [8].

3 The Spatial and Temporal Sparse Optimization (STSO)

In this section, we collect data from the available pixels surrounding the lost MB as well as pixels in the reference frame, as done similarly in [10]. Pixels in two frames away are also used for the proposed method. The mathematical formulations of these two mechanisms are formulated, and the predicting coefficients are solved by sparse optimization in section 2. We then combine these two mechanisms to form an overall algorithm. We denote this proposed method as STSO (Spatial and Temporal Sparse Optimization).

3.1 Spatial Formulation of Sparse Optimization

For the spatial mechanism, the idea can be explained with Fig. 1. The gray area is the lost block. The initial motion vector is derived by averaging the motion vectors in the neighboring available MVs around the lost MB. The initial motion vector is used by pixel A in the neighboring blocks to point back to reference pixel B in the reference frame. Pixel B is used as a center to form a 3 × 3 subblock in the reference frame, as shown in Fig. 1. These 3 × 3 = 9 pixels are used to predict pixel A by predicting coefficient α:

Fig. 1
figure 1

Nine pixels in the reference frame to predict one available neighboring pixel in the current frame using initial estimated motion vectors

$$ {x}_t\left(i,j\right)={\displaystyle \sum_{k=-1}^1}{\displaystyle \sum_{l=-1}^1}\alpha \left(k,l\right){x}_{t-1}\left(i+ dy+k,j+dx+l\right) $$
(5)

where x t − 1 is the reference frame pixel, x t is the current frame pixel, and α(k, l) is the weight on the reference pixel x t − 1(i + dy + k, j + dx + l) to the current pixel x t (i, j). The (dx, dy) is the initial motion vector.

This procedure is repeated for all the pixels in the spatial neighbors in the lost block, and their corresponding 9 reference pixels can be located and collected. With the collected data and Eq (5), we write a formulation in a matrix form for the optimization problem:

$$ {\boldsymbol{x}}_t={\mathbf{X}}_{t-1}{\boldsymbol{\upalpha}}^T $$
(6)

The x t is a vector containing the neighboring available pixels of the lost MB, and the X t is the matrix containing corresponding reference pixels according to Eq (5). α T is a 9 × 1 vector of predicting coefficients that we aim to determine. Figure 2 is a visual demonstration of the spatial pixel relation in Eq (6). Nine bases (256 × 1) are multiplied with corresponding 9 × 1 coefficients α and summed to form the target vector.

Fig. 2
figure 2

The pixel illustration of the values in nine columns (each of size 256 × 1) in X t − 1 as nine bases

The sparse characteristics of nature images are discussed in [11]. We use this property to recover the lost pixels in nature images in the error concealment application. We solve for optimal α T by the sparse optimization technique in section 2: we aim to find a coefficient vector α SO with minimal 1-norm, such that the relation between neighboring available pixels and the reference pixels in Eq (6) holds:

$$ \frac{\kern0.5em {\boldsymbol{\upalpha}}_{\mathrm{SO}}= \arg\ \underset{\boldsymbol{\upalpha} \in {R}^n}{ \min }{\boldsymbol{\upalpha}}_1}{\mathrm{s}.\mathrm{t}.\ {\boldsymbol{x}}_t={\mathbf{X}}_{t-1}{\boldsymbol{\upalpha}}^{\boldsymbol{T}}} $$
(7)

This optimization problem is a perfect fit for the sparse optimization technique introduced in section 2 and can therefore be solved efficiently by the primal-dual interior point method in [8]. After α SO is obtained, the pixels in the lost MB can be found by:

$$ {\hat{\boldsymbol{x}}}_{t,\ \mathrm{spatial}}^{\boldsymbol{lostMB}}={\mathbf{X}}_{\boldsymbol{t}-1}^{\boldsymbol{lostMB}}{\boldsymbol{\upalpha}}_{\mathrm{SO}}^T $$
(8)

where \( {\hat{\boldsymbol{x}}}_{t,\ \mathrm{spatial}}^{\boldsymbol{lostMB}} \) is a (16 × 16) × 1 vector containing the pixels in the lost MB and recovered by the spatial method. X lostMB t − 1 is a matrix containing the corresponding pixels in the reference frame according to Eq (5) and the initial motion vector (dx, dy).

3.2 Temporal Formulation of Sparse Optimization

In this subsection, the temporal relations between pixels in a different frame are considered, as done similarly in [19]. Specifically, the pixels in one frame before the current frame and two frames before the current frame are considered, as shown in Fig. 3. The initial motion vector (dx, dy) is used to point from the current pixel to pixel C one frame before the current frame. And the available motion vectors in the previous frame are used to point back to pixel D two frames before the current frame. Centered at D, we collect the surrounding 3 × 3 = 9 pixels to predict pixel C. After considering all pixels in the reference frame, the matrix formulation can be written as:

Fig. 3
figure 3

Temporal relations of pixels

$$ {\boldsymbol{x}}_{t-1}={\mathbf{X}}_{t-2}{\boldsymbol{\beta}}^T $$
(9)

The x t − 1 is a vector containing the available pixels one frame away, and the X t − 2 is the matrix containing corresponding reference pixels two frames away. β T is a predicting coefficient vector describing the temporal relationship. Figure 4 visually demonstrates the temporal formulation in Eq (9), where nine bases with their corresponding temporal predicting vector β are used to form the target vector x t − 1. Again, based on the sparse characteristics of nature images [11], we aim to find a temporal coefficient vector β SO with minimal 1-norm, such that the relation among temporally neighboring available pixels in Eq (9) holds:

Fig. 4
figure 4

The pixel illustration of the values in nine columns (each of size 256 × 1) in X t − 2 as nine bases

$$ \begin{array}{c}\hfill \kern0.5em {\boldsymbol{\beta}}_{\mathrm{SO}}= \arg\ \underset{\boldsymbol{\beta} \in {R}^n}{ \min }{\boldsymbol{\beta}}_1\hfill \\ {}\hfill \mathrm{s}.\mathrm{t}.\ {\boldsymbol{x}}_{t-1}={\mathbf{X}}_{t-2}{\boldsymbol{\beta}}^T\hfill \end{array} $$
(10)

The problem is identical to the one in section 2 and can be solved efficiently by [8]. After β SO is found, the pixels in the lost MB can be found by:

$$ {\widehat{x}}_{t,\ \mathrm{temporal}}^{\boldsymbol{lostMB}}={\mathbf{X}}_{\boldsymbol{t}-1}^{\boldsymbol{lostMB}}{\boldsymbol{\beta}}_{\mathrm{SO}}^T $$
(11)

where \( {\hat{\boldsymbol{x}}}_{t,\ \mathrm{temporal}}^{\boldsymbol{lostMB}} \) is a vector containing the pixels in the lost MB and recovered by the temporal method. X lostMB t − 1 is the same as in Eq (6).

3.3 Combined algorithm STSO

After spatial and temporal sparse optimization are done, we combine the results to obtain the final estimated lost pixels \( {\hat{\boldsymbol{x}}}_t^{\boldsymbol{lostMB}}: \)

$$ {\hat{\boldsymbol{x}}}_t^{\boldsymbol{lostMB}}=\tau \times {\hat{\boldsymbol{x}}}_{t,\ \mathrm{spatial}}^{\boldsymbol{lostMB}}+\left(1-\tau \right)\times {\hat{\boldsymbol{x}}}_{t,\ \mathrm{temporal}}^{\boldsymbol{lostMB}} $$
(12)

The balance τ between the temporal solution and spatial solution is the same as in [19]. The whole STSO algorithm is shown in Fig. 5.

Fig. 5
figure 5

Proposed STSO method for error concealment

4 Experimentresult

In this section, we discuss the experimental results that compare our method against existing ones. The video encoder is H.264 (JM 18.0). We use IPPPP (I means Intra frame, and P means Predictive frame) as an encoding GOP (Group Of Pictures) structure with a size of 15 frames. We use dispersed mode in FMO (Flexible Macroblock Ordering) so that the packets are encoded/lost in a checkerboard manner. We use various QPs (quantization parameters) from 16 to 38. The videos crew, container, md, carphone, mobile, soccer, stefan, football, and foreman are considered. The video resolution is 176 × 140. We compare our proposed STSO method against the state-of-the-art spatial-temporal hybrid Zhang’s method in [19] and Zhou’s method in [21].

The PSNR (Peak Signal-to-Noise Ratio) is used as a metric to compare the performance. The PSNR is computed between the uncompressed video and the concealed video. Table 1 shows the PSNR comparison among the proposed STSO method, Zhang’s method [19], and Zhou’s method [21] for different video sequences and different QPs. Due to page limits, we choose representative cases for discussion. First, videos of slow motion (crew, container, md, carphone, and mobile) are discussed. And the videos of higher motion (soccer, stefan, football, and foreman) are discussed later.

Table 1 PSNR (dB) performance comparisons for different QPs

For the slower videos, for crew video, it can be seen that the STSO performs better than Zhang by 0.15 dB, 0.10 dB, and 0.07 dB with different QPs. STSO works much better than Zhou by 0.47 dB, 0.43 dB, and 0.39 dB. These show outstanding performances of the STSO over the existing works. And the winning margin against Zhou is larger since Zhou does not utilize the information from the past two frames, as done in Zhang and STSO, so Zhou’s performances are usually lower. For container, the trends remain the same; the STSO is better than Zhang by 0.1 dB, 0.05 dB, and 0.04 dB, and the winning margin is even larger against Zhou by 0.27 dB, 0.19 dB, and 0.16 dB in QP = 20, 22, and 24, respectively. Similarly, in md, the performance gains of STSO are 0.18 dB, 0.05 dB, and 0.07 dB over Zhang, and they are 0.62 dB, 0.44 dB, and 0.39 dB over Zhou (larger gains as discussed) for different QP settings. For the carphone sequences, the improvements of 0.19 dB, 0.17 dB, and 0.15 dB are made over Zhang, and larger improvements of 1.12 dB, 0.98 dB, and 0.85 dB are made over Zhou. When comparing in mobile, the winning margin of 0.10 dB, 0.08 dB, and 0.17 dB are achieved against Zhang, and larger margins of 0.81 dB, 0.76 dB, and 0.63 dB are made against Zhou. Therefore, we can first conclude that the STSO is better than Zhang and Zhou for the videos with slower motion.

For the videos of higher motion, for soccer, STSO is better than Zhang by 0.12 dB, 0.08 dB, and 0.07 dB, and STSO is further better than Zhou by 0.32 dB, 0.17 dB, and 0.09 dB in QP settings being 24, 26, and 32, respectively. In stefan, similar situation occurs; by STSO, the improvement of 0.15 dB, 0.09 dB, and 0.07 dB are made over Zhang, and that of 0.38 dB, 0.29 dB, and 0.17 dB are achieved against Zhou in different QPs; again, Zhou has the lowest performances as discussed. In football, the performances of STSO are better than Zhang by 0.12 dB, 0.07 dB, and 0.02 dB, and they are also better than Zhou by an even larger margin 0.52 dB, 0.45 dB, and 0.41 dB for QP = 32, 34, and 38. Finally, in foreman, for QP = 30, 34, and 38, the winning margins of STSO over Zhang are 0.12 dB, 0.10 dB, and 0.04 dB, and they are 0.50 dB, 0.37 dB, and 0.31 dB over Zhou. From these discussions, it can be summarized that the STSO works better than the existing methods of Zhang and Zhou, even for fast videos.

To conclude from the data analyses and comparisons in Table 1, it can be understood that the proposed STSO performs better than the state-of-the-art Zhang and Zhou methods for various QP settings and different videos with different textures and motions. Furthermore, as discussed, the winning margin of STSO over Zhang is usually larger than Zhou because the method in Zhou only uses information from the previous frame and does not use information from the previous two frames, which is done by STSO and Zhang. Therefore, the performances of STSO and Zhang are better. Finally, for the largest gain among all the comparisons, STSO improves over Zhang’s method by up to 0.19 dB and over Zhou’s method by up to 1.12 dB, both for carphone sequence at QP = 20. Note that both compared methods are state-of-the-art methods, which means that they have the most outstanding performance in the current technology; therefore, it is very difficult to outperform them by a very large margin.

In addition to the objective improvement in PSNR, our algorithm mostly performs better in subjective experience as well. Figure 6 to Fig. 14 show the different concealed frames using three different concealment methods. In Fig. 6, compared to Zhou’s method, we can observe that the proposed method reduces the discontinuities in the concealed edges, and compared to Zhang’s method, the proposed algorithm performs better in the detail area; for example, the lines on the clothes. In Fig. 7, since the video is simpler and slower, the performance differences are not obvious. But as the bird flies fast over the frame, the proposed method produces less shadow objects. In Fig. 8, the results of different methods are similar. But for the detail continuity and integrity of the fingers, the proposed method performs better. In Fig. 9, the proposed method has better facial detail and smoother line connection in the background. In Fig. 10, for the faster part in the ball area, we have a more complete result. In Fig. 11, where the video is of higher motion, the proposed method conserves more integrity in the hand and leg area. Figure 12 is also a fast video. The proposed method produces better image integrity of the player and provides more complete English lines in the background. In Fig. 13, in this fast video, we provide a better image integrity in the fast moving player. In Fig. 14, we perform better in the edge areas, especially on the edge of the hat, edge of the face, as well as the detail of the face.

Fig. 6
figure 6

Error concealment performance comparisons for crew, 193th frame in QP = 20

Fig. 7
figure 7

Error concealment performance comparisons for contrainer, 252th frame in QP = 20

Fig. 8
figure 8

Error concealment performance comparisons for md, 112th frame in QP = 22

Fig. 9
figure 9

Error concealment performance comparisons for carphone, 191th frame in QP = 24

Fig. 10
figure 10

Error concealment performance comparisons for mobile, 201th frame in QP = 24

Fig. 11
figure 11

Error concealment performance comparisons for soccer, 96th frame in QP = 26

Fig. 12
figure 12

Error concealment performance comparisons for stefan, 236th frame in QP = 28

Fig. 13
figure 13

Error concealment performance comparisons for football, 124th frame in QP = 32

Fig. 14
figure 14

Error concealment performance comparisons for foreman, 52th frame in QP = 34

The improved results of the proposed method can be due to the fact that the sparse optimization can always find the solution with a minimal number of important and representative predictors. Therefore, by restricting the solutions to a limited number of important predictors and suppressing the effects of other unimportant predictors, our proposed error concealment algorithm can perform better in objective and subjective measurements.

Complexity analysis

We discuss the complexities of the three discussed error concealment algorithms. For Zhou’s method [21], it is less complicated since it only needs the generated extrapolated motion vector and the surrounding available motion vectors to perform the motion vector estimation for vertical and horizontal directions. The processing time with Zhou’s method for each lost MB is 0.0064 s. For Zhang’s method [19] and the proposed STSO method, the complexities are comparable since the data collection process and pixel arrangement are similar; for a lost MB, an initial motion vector is first generated by surrounding available motion vectors. With this vector, matrices in Eq (7) are formed, and αs are solved by Zhang and STSO in different ways. Similarly, with the same initial motion vector and the motion vector in the previous frame, the matrices in Eq (10) are formed, and’s are solved by Zhang and STSO in different ways. The usages of α and β are combined in Eq (12). The main computational complexities in Zhang and STSO are the optimization processes to find α and β; thus, the computational time is higher than Zhou’s method; for a lost MB, Zhang’s method requires 0.038 s and STSO requires 0.043 s. Especially for the STSO, the sparse optimization technique is used; therefore, the processing time is slightly higher with the advantages of higher recovered image quality.

5 Conclusion

In this paper, we use the facts that the nature images has sparse representation. We proposed a spatial-temporal hybrid sparse optimization error concealment STSO method. Our algorithm limits the predictors to numerically useful ones so the performance of the error concealment scheme will not be distorted by other less-useful predictors. As shown in the experimental results, the proposed algorithm improves subjectively and objectively from the two state-of-the-art spatial-temporal hybrid algorithms by up to 0.19 dB and 1.12 dB. These are meaningful results considering these two state-of-the-art algorithms are ones of the most outstanding methods in the existing literatures.