Keywords

1 Introduction

Real-time video traffic over the Internet has experienced a tremendous growth with the rise of video conference applications such as Zoom and Skype. With the outbreak of the COVID-19 pandemic, classes, meetings, and conferences have gone virtual. The demand for real-time video streaming is unprecedented.

Different from stored video streaming (e.g., Youtube and Netflix) where up to 10 s of delay is tolerable, real-time video streaming requires a much smaller delay (e.g., \(< 400\) ms [27]) to satisfy the interactive experience, while consuming large network bandwidth. User Datagram Protocol (UDP), rather than Transmission Control Protocol (TCP) is widely employed as the transport-layer protocol to avoid retransmission delay. There are two obstacles to realizing real-time video delivery with high quality of experience (QoE). First, UDP works for the best-effort packet transmission and does not guarantee error-free packet delivery. Packet loss causes damaged frames in the video application and brings negative visual impacts. Second, due to the complexity of the Internet, the end-to-end throughput may fluctuate. Unlike stored video playback, where a buffer can be used to smooth the video playout at a cost of additional delay, real-time video has to be down-sampled to adapt to the network dynamics, as delaying packet transmission is not allowed.

Fig. 1.
figure 1

Output quality vs. inference delay.

Fig. 2.
figure 2

Demonstration of the selection of supporting frames, where I and P denotes I-frame and P-frame respectively. The loss region bounded by solid lines are caused by packet loss directly, while the loss region bounded by dashed lines are caused by loss propagation.

Recent advances in deep neural networks (DNNs) have strong potential to address the aforementioned issues. Video inpainting networks, e.g. [12], can achieve video loss recovery by reconstructing a missing region in a sequence of frames. Video super-resolution networks, e.g. [11], use the low-resolution input frames to reconstruct high-resolution output frames. These DNNs can be combined to enhance the received real-time video. However, it is not very practical to directly use them at the receiver side of real-time video applications. There are two challenges. (1) Existing DNNs [2, 8, 12, 16, 29] are not designed for real-time video streaming; they introduce seconds of inference delay (see Fig. 1), which is intolerable by real-time video streaming. (2) Existing DNNs require both the preceding frames and the succeeding frames for reconstruction. To meet the stringent delay requirement of real-time video streaming, we cannot use a future (succeeding) frame as this will introduce an additional delay (to wait for the arrival of the future frame). In fact, we can only utilize the current frame and its preceding frames to enhance real-time video frames.

In this paper, we design a new network, namely 3RE-Net, to effectively realize super-resolution and loss recovery for real-time video. 3RE-Net takes the partially missed (damaged) current frame and the preceding frames as input, all in low-resolution. It recovers the missed region and performs super-resolve to output a high-resolution and complete frame with a small delay.

To address the aforementioned Challenge (1), 3RE-Net re-uses the similarities in video inpainting and video super-resolution approaches, to reduce duplicate processes and to improve effectiveness. In particular, the extracted optical flows and features of the frames can be regarded as the semantics of the video content, and the extractors can be tuned towards a generic direction to facilitate both loss recovery and super-resolution. By sharing the extractors, we not only reduce redundant computation, but also achieve even better performance.

To address the aforementioned Challenge (2), 3RE-Net takes only the partially missed current frame and the preceding frames as input. We capture multiple optical flows (motions) using the reformative motion extraction DNN. By cross-checking the optical flows, we are able to achieve comparable performance with the DNNs which utilize both the preceding frames and the succeeding frames for reconstruction.

We have conducted comprehensive experiments to evaluate the performance of 3RE-Net. We compare 3RE-Net with state-of-the-art benchmark schemes. In terms of video quality improvement (both the correctness of the reconstructed region, and the accuracy of the super-resolved details), 3RE-Net substantially outperforms other benchmark schemes. In terms of quality-delay trade-off, 3RE-Net substantially improves the video quality at a cost of a small amount of delay, which is more advantageous compared with all other benchmarks. (Other benchmarks either cannot realize as much performance gain as 3RE-Net, or cause too much delay, intolerable by real-time video applications.) A rough comparison is shown in Fig. 1. Detailed comparison will be shown later in the experiments.

2 Related Work

Video Enhancement Using DNNs. There are three types of video enhancement DNNs related to our work. Video super-resolution (VSR) [2, 17] refers to the task of restoring high-resolution frames from multiple low-resolution observations of the same scene. Different from video super-resolution, we do not use succeeding frames as input to avoid additional delay, and simultaneously recover the loss. Video inpainting [12, 29] fills the missing regions of a given video sequence with contents that are both spatially and temporally coherent. Similar to inpainting, one of our objectives is to recover the missing regions. However, we do not use succeeding frames as input and the inference is lightweight without incurring a long delay. Video interpolation [6, 8] increases the temporal resolution of a video by synthesizing non-existent frames between two original frames. Their optical flow extractors have fewer parameters and cause less delay among the three types of DNNs.

Fig. 3.
figure 3

DNN model architecture.

Video Delivery and Packet Loss. UDP is widely deployed to carry real-time [25] and interactive [13] video applications [22] (e.g., WebRTC, Zoom, and Teamviewer), to guarantee conversational behaviors. UDP does not recover packet loss. The packet loss [19] usually brings a missing block after decoding [5]. Frames in a video are put together to form a Group of Pictures (GOP) [18], which has an intra-coded frame (I-frame) and predicted frames (P-frames), but no bi-directional predicted frames (B-frames) are used in real-time video as they cause delay [28]. A lost frame in a GOP affects subsequent P-frames, and recovery may require earlier error-free frames instead of succeeding ones. These issues are expected to be addressed by our solution.

3 Model Design

Let \(O \in \mathbb {R}^{H \times W \times C}\) be the damaged (partial missing) video frame in low-resolution (LR), and \(\overline{O} \in \mathbb {R}^{s H \times s W \times C}\) be the corresponding original (ground-truth) video frame in high-resolution (HR), where H, W, and C denote the height, weight, and number of channels of the frame respectively, and s is the upscaling factor. Please note that \(C = 3\) is a constant, and s is an integer where \(s > 1\). We utilize the N complete frames in LR prior to frame O to aid the reconstruction, these frames are known as the supporting frames, where \(N \in \left\{ 2, 3, 4 \right\} \). (The larger number of N brings higher accuracy, but is more time-consuming. This trade-off is discussed in the ablation studies of the supplementary.) The proposed network takes the reference frame O and N supporting frames \(\left\{ I_{1}, ... , I_{N}\right\} \) in LR as input, and outputs the recovered and super-resolved frame \(\hat{O} \in \mathbb {R}^{s H \times s W \times C}\) in HR:

$$\begin{aligned} \hat{O} = f_{DNN} (\left\{ I_{1}, ... , I_{N}\right\} , O), \end{aligned}$$
(1)

with the objective to minimize the difference between the ground-truth frame \(\overline{O}\) and the reconstructed frame \(\hat{O}\).

Figure 3 is an overview of the proposed network architecture. First, we extract two groups of motions using the motion extraction network (Step A). We also extract the corresponding feature map of each input frame (Step B). Then, we warp the extracted features towards the extracted motions (Step C), and obtain a set of candidate frames along with their feature maps. After that, we synthesize the candidate frames (Step D), and align the synthesized frame with spatio-temporal neighboring frames (Step E). Finally, we apply deep detail refinement to obtain the recovered and super-resolved frame (Step F). Note that the motions extracted in Step A and the features extracted in Step B benefit both loss recovery and super-resolution processes. By extracting them only once, we alleviate the redundant computation and reduce delay.

Please also note that the supporting frames \(\left\{ I_{1}, ... , I_{N}\right\} \) may not be consecutive frames due to the existence of loss propagation caused by GOPs. Let \(t_o\) denote the timestamp of frame O, and \(\left\{ t_{1}, ... , t_{N}\right\} \) be the timestamps of frames \(\left\{ I_{1}, ... , I_{N}\right\} \) respectively. Figure 2 shows an example, where the immediately preceding frame of the reference frame O also contains loss. In this case, we iterate backwards through the preceding frames until \(N = 4\) error-free frames \(\left\{ I_{1}, I_2, I_3, I_4\right\} \) are found. In addition, we do not use a recovered frame to reconstruct O. The reason is twofold: (1) This avoids error propagation among the sequence of frames. (2) This avoids the processing delay to wait for a previous frame to be reconstructed. We assume that the network loss is not significant and only a small percentage of frames need to recover [4].

Motion Extraction and Prediction (Step A). We establish a motion extraction DNN (PWC-Net [23]). Given two input frames, frame x and frame y, the DNN will output the estimated motions from frame y to frame x, with respect to all pixels in frame x. The output depends on the order of input.

By feeding different pairs of inputs into the DNN, we can extract two groups of motions using direct and indirect approaches to better estimate true motions, which is detailed as follows.

First Group of Motions. For each supporting frame \(I_{n}\), where \(n \in \left\{ 1, \cdots , N \right\} \), we feed the reference frame O and the supporting frame \(I_{n}\) into the motion extraction network. The motion extraction network outputs the extracted motion \(V_{n \rightarrow o}\). The same is done for each supporting frame, resulting in a group of N motions \(\left\{ V_{1 \rightarrow o}, ... , V_{N \rightarrow o} \right\} \). We denote the first group of motions as \(\mathbb {V}\), where \(\mathbb {V} = \left\{ V_{1 \rightarrow o}, ... , V_{N \rightarrow o} \right\} \) and \(|\mathbb {V}| = N\).

Second Group of Motions. We estimate the second group of motions using the supporting frames only. As for each pair of frames in the supporting frames, we estimate a set of 4 motions. Let frame \(I_m\) and \(I_n\) be a pair of two non-repetitive supporting frames, where \(m, n \in \left\{ 1, \cdots , N \right\} \), and \(m \ne n\). We obtain a set of 4 estimated motions using this pair of frames.

Given frames \(I_{n}\) and \(I_{m}\) as input, the motion extraction network outputs the estimated motion \(V_{m \rightarrow n}\) for all pixels in \(I_{n}\). By reversing the order of input, we obtain another motion \(V_{n \rightarrow m}\) for all pixels in \(I_{m}\). Given timestamps \(t_m\), \(t_n\), and \(t_o\), we expand the motion vector \(V_{m \rightarrow n}\) and obtain the estimated motion vector, denoted by \(U_{m \rightarrow o \mid m \rightarrow n}\) and \(U_{n \rightarrow o \mid m \rightarrow n}\). Formally, we have

$$\begin{aligned} U_{m \rightarrow o \mid m \rightarrow n}= \frac{t_o - t_{m}}{t_{n} - t_{m}} \times V_{m \rightarrow n},\end{aligned}$$
(2)
$$\begin{aligned} U_{n \rightarrow o \mid m \rightarrow n}= \frac{t_o - t_{n}}{t_{n} - t_{m}} \times V_{m \rightarrow n},\nonumber \\ \forall m, n \in \left\{ 1, \cdots , N \right\} , m \ne n. \end{aligned}$$
(3)

Similarly, as for the reversed motion \(V_{n \rightarrow m}\), we have

$$\begin{aligned} U_{m \rightarrow o \mid n \rightarrow m}= -\frac{t_o - t_{m}}{t_{n} - t_{m}} \times V_{n \rightarrow m},\end{aligned}$$
(4)
$$\begin{aligned} U_{n \rightarrow o \mid n \rightarrow m}= -\frac{t_o - t_{n}}{t_{n} - t_{m}} \times V_{n \rightarrow m},\nonumber \\ \forall m, n \in \left\{ 1, \cdots , N \right\} , m \ne n. \end{aligned}$$
(5)

We apply the same procedure to each pair of frames in the supporting frames. We denote the second group of motions as \(\mathbb {U}\). As a result, we obtain \(|\mathbb {U}| = {N \atopwithdelims ()2} \times 4\) motions, where

$$\begin{aligned} \mathbb {U} = \left\{ U | U_{m \rightarrow o \mid m \rightarrow n}, U_{n \rightarrow o \mid m \rightarrow n}, U_{m \rightarrow o \mid n \rightarrow m}, U_{n \rightarrow o \mid n \rightarrow m} \right\} , \nonumber \\ \forall m, n \in \left\{ 1, \cdots , N \right\} , m \ne n. \end{aligned}$$
(6)

On the one hand, the first group of motions are extracted using the reference frame O and one of the supporting frames. The reference frame contains loss. Therefore, the first group of motions does not include the motions of the pixels in the missing region. Despite the missing region, all other pixels in the reference frame are accurate. On the other hand, the second group of motions are estimated using the complete supporting frames. The motions missing in the first group of motions are likely to be in the second group of motions. However, the motions in the second group are less accurate since the estimation does not utilize any information in the reference frame itself. The two groups of motions complement each other. Through this way, we are able to obtain more accurate motions for the subsequent steps.

Refined Feature Extraction (Step B). Given frame \(I_n\) as input, the feature extraction module extracts the feature map \(F_n\) of the input frame \(I_n\). The feature extraction module uses the first convolutional layer of ResNet [7]. Since ResNet is designed for image classification tasks, and the rest of the layers narrow down the parameters and lose valuable contextual information. The module takes a 3-channel frame as input, and outputs a 64-channel feature map, which contains rich and generic contextual information to serve as a backbone for the alignment and up-sampling modules.

For each supporting frame \(I_{n}\), where \(n \in \left\{ 1, \cdots , N \right\} \), the corresponding feature map \(F_{n}\) is obtained using this module. As a result, we obtain feature maps \(\left\{ F_{1}, ... , F_{N} \right\} \). Similarly, we apply feature extraction to the input reference frame O and obtain feature map E. The contextual information in the extracted feature map works alongside the extracted motions, to perform loss recovery and super-resolution. Let Conv denote the feature extraction module, we have

$$\begin{aligned} E = \texttt {Conv} (O), F_n = \texttt {Conv} (I_n), \forall n \in \left\{ 1, \cdots , N \right\} . \end{aligned}$$
(7)

Warping Module (Step C). This module warps frames and feature maps based on the given motion vector. We adopt and modify the warping layer in the spatial transformer networks [9], which applies a spatial transformation to a pair of frames and feature maps.

As for each supporting frame \(I_{n}\), where \(n \in \left\{ 1, \cdots , N \right\} \), we warp the frame itself and its corresponding feature map \(F_n\) towards motion \(V_{n \rightarrow o} \in \mathbb {V}\) (the first group of motions derived). We obtain a warped reference frame \(W'_n\) along with its feature map \(F'_n\). We denote the aggregated pair of warped frame and its feature map as \(P_n = \left[ W'_n, F'_n \right] \), and let \(\mathbb {P}\) denote the set of all \(P_{n}\).

(8)

As for each pair of frames \(I_m\) and \(I_n\) in the supporting frames, where \(m, n \in \left\{ 1, \cdots , N \right\} \), and \(m \ne n\), we warp the supporting frame and its feature map towards the motions in \(\mathbb {U}\) (the second group of motions derived) as stated above.

(9)
(10)
(11)
(12)

Let \(\mathbb {Q}\) denote the set of all \(Q_m, \tilde{Q}_m, Q_n, \tilde{Q}_n\).

$$\begin{aligned} \mathbb {Q} = \left\{ Q | Q_m, \tilde{Q}_m, Q_n, \tilde{Q}_n \right\} , \forall m, n \in \left\{ 1, \cdots , N \right\} , m \ne n. \end{aligned}$$
(13)

Since \(|\mathbb {P}| = |\mathbb {V}| = N\) and \(|\mathbb {Q}| = |\mathbb {U}| = {N \atopwithdelims ()2}\times 4\), we obtain \(|\mathbb {P}| + |\mathbb {Q}| = N + {N \atopwithdelims ()2} \times 4\) pairs of warped frames and feature maps. These frames, along with their corresponding feature maps, then become candidates for the synthesis module.

Motion Synthesis (Step D). We do not use a simple average on the candidate warped frames, as it leads to blurry results caused by disputed estimation of fast-moving and occluded pixels. To address this issue, we feed the missing and unreliable pixels into a DNN (dynamic filter network [10]), which takes spatio-temporal factors into consideration to soundly resolve the disputes.

We first adopt the filter generation network to cross check the \(|\mathbb {P}| + |\mathbb {Q}|\) candidate warped frames on the image level. As for each candidate warped frame, the network generates a set of \(5\times 5\times \left[ N + {N \atopwithdelims ()2} \times 4\right] \) parameters for the blending filter denoted as X, where \(5\times 5\) is the size of the blending filter. We then apply convolution using the blending filter on each warped candidate frame, and the results of each candidate frame are aggregated to generate the synthesized frame \(O'\). We extract the feature map \(E'\) of frame \(O'\) using the aforementioned refined feature extractor (Step B).

Feature-Level Alignment (Step E). Spatial and temporal alignment are crucial for video enhancement. Due to the varying motions of cameras or objects, the synthesized reference frame and the supporting frames are not aligned. Misaligned frames can hinder aggregation and reduce performance, especially with a larger number of supporting frames N. Alignment modules align frames and features to enable subsequent aggregation. Traditional image-level alignment methods result in artifacts when combined with the deep detail refinement module (Step F), as they suffer from artifacts around image structures being propagated into final reconstructed frames. To avoid this, we resort to deformable convolution [3], which provides additional offsets to allow the convolutional network to obtain information from beyond its regular local neighborhood [24, 26], and redevelop the feature-level deformable alignment module in our design.

The module inputs the synthesized frame \(E'\), aligns the supporting feature maps \(F_1, \cdots , F_N\) and E with \(E'\), and outputs the aligned feature map \(\bar{E}\).

Note that the synthesis module handles the disputed estimation of the occluded or missing pixels at the same timestamp \(t_o\). It performs spatial aggregation on the image level. The alignment module, on the other hand, performs feature-level adjustments to align the temporal neighboring frames (at timestamps \(\left\{ t_{1}, ... , t_{N}\right\} \)) with the reference frame. It effectively handles complex motions and large parallax problems.

Deep Detail Refinement (Step F). Since we have the synthesized reference frame \(O'\) along with its aligned feature map \(\bar{E}\), we aggregate the reference frame and the supporting frames cross the time-space (since each of the supporting frames may contain different detail), and then up-sample to obtain the output in HR. We directly concatenate the \(N + 1\) frames and feed them into a convolutional layer to output the fused feature map. Then, we feed the fused feature map and the aligned feature map derived in Step E into a nonlinear mapping module, which utilizes the fused features to predict deep features. After extracting deep features, we utilize an up-sampling module composed of an up-scaling layer [14] to increase the resolution of the feature map, with a sub-pixel convolution [21]. The final output frame in HR is obtained by a convolutional layer from the zoomed feature map.

4 Experiments

Datasets and Metrics. We adopt Vimeo90k [30] as the training and testing dataset. It is a public dataset consists of \(448 \times 256\) video for video enhancement tasks, including super-resolution. We further use Vid4 [15] as another testing set, since its results are more visually comparable. Vid4 consists of multiple scenes with various motions and occlusions in 720p. We apply \(4 \times \) down-sampling using bicubic degradation. We stream the down-sampled H.264 encoded video over the Internet using UDP-based RTSP protocol [20] via FFmpeg. We employ Peak Signal to Noise Ratio (PSNR), measured in decibels (dB) where higher values indicate superior quality, and the Structural Similarity Index (SSIM), which ranges from 0 to 1 with higher values denoting greater similarity with the ground-truth, as quantitative evaluation metrics for comparing the quality of reconstructions.

Table 1. Average PSNR (in dB) and SSIM along with the delay (in ms) under all schemes. B1–B12 denote the benchmark schemes.
Fig. 4.
figure 4

Quality vs. delay under different schemes.

Fig. 5.
figure 5

Qualitative comparison of the reconstructed loss region. Two red boxes are zoomed in. The second row shows one of the reconstructed regions. The third row shows the border between the damaged and undamaged parts. (Color figure online)

Benchmarks. We conduct comprehensive experiments by comparing our 3RE-Net with the following benchmarks. For the loss recovery part, we consider two video interpolation networks (DAIN [2] and BMBC [17]) and two video inpainting networks (DFC-Net [29] and VI-Net [12]). For the super-resolution part, we consider two video super-resolution networks (RRN [8] and RBPN [6]). Then, we run interpolation/inpainting and super-resolution in tandem to compare with ours. Since the above schemes are not designed for real-time loss recovery, we modify them slightly for a fair comparison. The modified schemes take as input two previous (supporting) frames and the current damaged (reference) frame, identical to 3RE-Net. The timestamps in the optical flow manipulation are changed to ensure that the benchmarks synthesize the damaged frame as desired. Please note that we can swap the order of interpolation and super-resolution and treat them as different benchmarks. For example, DAIN + RRN means we run DAIN first and then RRN; while RRN + DAIN means we run RRN first and then DAIN. Inpainting (DFC-Net and VI-Net) can only be run after super-resolution, because these networks contain large convolution kernels and cannot handle input frames in low resolution. Note that for video prediction, its performance is worse than interpolation and thus it is not necessary to compare with benchmarks utilizing prediction. In sum, we consider 12 benchmarks (labeled by B1–B12) in Table 1.

Results. The quantitative and qualitative results are summarized in Table 1 and Fig. 5, respectively. As for the visualized result, we only show our results and top 5 benchmarks due to the space limit. The quality and delay comparison using Vid4 is illustrated in Fig. 4. Note that the delay is averaged after 1,000 rounds of experiments. In this subsection, we choose \(N = 2\) (number of supporting frames) by default. We will further discuss the choice of different N in the ablation studies of the supplementary material.

3RE-Net substantially outperforms all benchmark schemes in terms of PSNR and SSIM, and achieves 21.26 dB in PSNR (resp. 0.6435 in SSIM) for Vid4 dataset, and 31.21 dB in PSNR (resp. 0.9064 in SSIM) for Vimeo90k. This is because 3RE-Net effectively exploits the motions and features in the supporting frames. With packet loss, the damaged part of the frame is still observable in the preceding (supporting) frames. 3RE-Net aggregates the information and reconstructs the damaged part in a spatio-temporal consistent manner. The border between damaged and undamaged parts are smooth and barely observable (see the last row in Fig. 5).

Other benchmark schemes either yield poor quality (B1 and B5 introduce merely acceptable \(\sim 400\) ms delay, but low PSNR (18 dB)), or cause too much delay (\(>1000\) ms delay for the rest of the benchmarks). 3RE-Net causes the least delay (187 ms) and yields satisfactory results. In our joint design, the extracted motions and features are used by multiple modules (including motion synthesis, feature-level alignment, and deep detail refinement), which significantly reduces redundancy.

We observe a trade-off when we swap the order of interpolation and super-resolution. Performing interpolation before super-resolution (B1, B3, B5, and B7) yields weaker quality but less delay, compared with the opposite order (B2, B4, B6, and B8). This is because the load of the motion and feature extractors is smaller on low-resolution inputs, while the results are less accurate due to the limited resolution of inputs. 3RE-Net can provide high quality without causing large delays. All modules have access to accurate motions and features.

In terms of quality-delay trade-off, 3RE-Net substantially improves the video quality at a cost of a small amount of delay. It is more advantageous compared with all other benchmarks, as they either cannot realize as much performance gain as ours, or cause too much delay, intolerable by real-time video applications.

Other Studies. Results with confidence intervals, ablation study, inference delay breakdown, and more visualized examples are shown in the supplementary materials [1].

5 Conclusion

In this work, we propose the 3RE-Net (Joint Loss-REcovery and Super-REsolution Neural Network for REal-time Video). Our pipeline utilizes the reference (damaged) frame and preceding frames. It exploits the motions and features of the input frames, and propagates them through warping and synthesis modules. The output frame is then reconstructed using deep features in high resolution and loss free, with small inference delay. We run benchmark interpolation/inpainting and super-resolution schemes, and results show that 3RE-Net outperforms all other benchmarks with substantial improvements in both quality and delay.