MVFI-Net: Motion-Aware Video Frame Interpolation Network

Lin, Xuhu; Zhao, Lili; Liu, Xi; Chen, Jianwen

doi:10.1007/978-3-031-26313-2_21

Xuhu Lin¹²,
Lili Zhao^12,13,
Xi Liu¹² &
…
Jianwen Chen¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13843))

Included in the following conference series:

Asian Conference on Computer Vision

813 Accesses
1 Citations

Abstract

Video frame interpolation (VFI) is to synthesize the intermediate frame given successive frames. Most existing learning-based VFI methods generate each target pixel by using the warping operation with either one predicted kernel or flow, or both. However, their performances are often degraded due to the issues on the limited direction and scope of the reference regions, especially encountering complex motions. In this paper, we propose a novel motion-aware VFI network (MVFI-Net) to address these issues. One of the key novelties of our method lies in the newly developed warping operation, i.e., motion-aware convolution (MAC). By predicting multiple extensible temporal motion vectors (MVs) and filter kernels for each target pixel, the direction and scope could be enlarged simultaneously. Besides, we first attempt to incorporate the pyramid structure into the kernel-based VFI, which can decompose large motions into smaller scales to improve the prediction efficiency. The quantitative and qualitative experimental results have demonstrated the proposed method delivers the state-of-the-art performance on the diverse benchmarks with various resolutions. Our codes are available at https://github.com/MediaLabVFI/MVFI-Net.

Access provided by Autonomous University of Puebla. Download conference paper PDF

A Multi-frame Video Interpolation Neural Network for Large Motion

A Flexible Recurrent Residual Pyramid Network for Video Frame Interpolation

Real-Time Intermediate Flow Estimation for Video Frame Interpolation

Keywords

1 Introduction

Video frame interpolation (VFI) has been greatly demanded in many applications such as frame-rate up conversion [4, 5], slow-motion generation [19, 35, 39] and video compression [10, 16], etc. In recent years, deep learning shows its strong capacity in a series of low-level tasks, some learning-based VFI methods have been subsequently developed [2, 7, 21, 29, 31, 33, 34, 37, 40, 41], which can be viewed as a two-stage process: 1) motion estimation (ME), i.e., for the output pixels, finding their reference regions in reference frames; 2) motion compensation (MC), i.e., synthesizing the output pixels based on ME (Fig. 1).

Among the existing methods, one popular strategy is to conduct the optical flow estimation as ME (e.g., [2, 29, 33, 34, 41]). However, the optical flow estimator will impose much computational complexity into the VFI algorithms, and the quality of the interpolated frame largely depends on the quality of flows [46]. Besides, the flow-based methods more consider the most related pixel in reference frame, while ignoring the influence of surrounding pixels, which could result in the limited direction issues.

Instead of using optical flows, some kernel-based VFI methods are developed (e.g., [7, 21, 31, 37, 40]), which can be viewed as a convolution process. SepConv [31] estimated a separable kernel for each location, and then convolved these kernels with the corresponding reference regions to predict the output pixels. However, it can not deal with the motions beyond the kernel size since its scope is fixed and limited, as demonstrated in Fig. 2(a). In this case, SDC-Net [37] proposed a spatially-displaced convolution which predicted one flow and one kernel for each location, as shown in Fig. 2(b). Then, the scope could be enlarged based on flow information, but the direction is still limited. Recently, DSepConv [7] utilized a separably deformable convolution to expand the direction of the reference regions [seeing Fig. 2(c)]. The work [21] provided the version without weight sharing as an independent warping operation namely AdaCoF. Nevertheless, the issue of limited scope is exposed again, especially for videos with a large number of complex motions.

Based on the above-mentioned discussions, there are two issues existing in VFI: 1) the direction and scope of reference regions are limited and hard to be loosed simultaneously; 2) kernel-based methods are non-robust for complex motions. To tackle the aforementioned issues, we propose a novel VFI method namely motion-aware VFI network (MVFI-Net).

For issue 1), we develop a novel warping technique called motion-aware convolution (MAC), which is one of the key novelties. In specific, multiple temporally extensible motion vectors (MVs) and corresponding spatially-varying kernels are predicted for each target pixel. Then the target pixel is calculated by convolving kernels with selected regions. The rationale behind this design is that during the motion estimation via the network, it may be possible to search MVs in a very small range and later synthesize the current pixel only based on the adjacent pixels. Obviously, the performance would be largely degraded while dealing with complex motion. In this case, we propose a motion-aware extension mechanism to adaptively extend the temporal MVs for a wider search range, and improve the efficiency of VFI. As illustrated in Fig. 2(d), compared to the previous works, MAC can explore more directions and larger scopes, which has the potential to overcome both limitations.

For issue 2), it has been known that many optical flow-based approaches have used the feature pyramid network (FPN) as the feature extractor to get multi-scale feature maps, which can be warped by internally scaling the flow in different sizes. This operation can decompose large motions into a smaller scale, which facilitates more accurate motion estimation. Thus, it is highly desirable to bring FPN into kernel-based VFI methods. However, for warping multi-scale feature maps, kernels with various sizes are required. This means that large memory is demanded, which is impractical. To solve this problem, we propose a two-stage warping strategy to warp reference frames and multi-scale features, while exploiting a frame synthesis network to mix these information for generating high-quality frame. Experimental results show that our design can improve the robustness and performance of VFI with complex motion.

Our contributions can be summarized as follows: 1) A novel warping technique MAC is proposed to simultaneously alleviate the issues of limited direction and scope of reference regions; 2) We propose a two-stage warping strategy to firstly integrate the pyramid structure into the kernel-based VFI method; 3) The proposed method delivers the state-of-the-art results on several benchmarks with various resolutions.

2 Related Work

In this section, we briefly review the works about VFI. The VFI methods can be mainly divided into three categories: 1) the phase-based methods (e.g., [26, 27]), 2) the optical flow-based methods (e.g., [2, 3, 9, 22, 23, 28, 29, 33, 34, 41]), and 3) the kernel-based methods (e.g., [7, 8, 21, 31, 32, 40]).

Phase-Based VFI. It is commonly-known that images can be transformed to the frequency field by Fourier Transform, and this operation has been used into VFI by [26, 27]. Specifically, they utilized a Phase-Net to conduct the motion estimation by the linear combinations of wavelets, which achieved competitive inference time. However, it is difficult for this category of methods to process the large motion on the high frequency components, usually imposing artifacts and pixel disappearance.

Optical Flow-Based VFI. In recent years, the deep learning-based optical flow estimation [13, 17, 36, 44, 45] has delivered impressive quality on motion estimation. Therefore, the optical flow estimation has been used in many subsequently developed VFI approaches (e.g., [2, 3, 15, 19, 29, 33, 34]). For example, in [29], the interpolated frame $I_t$ is synthesized by forward warping the input consecutive frames ($I_0$ and $I_1$) with their features, under the guidance of the estimated optical flows $t \cdot F_{0 \rightarrow 1}$ and $(1-t) \cdot F_{1 \rightarrow 0}$ via softmax splatting and a frame synthesis network. Instead of using the forward-warping method, another trend in this research have exploited the backward-warping strategy (e.g., [2, 3, 19, 33]). It is well-known that for these backward warping-based methods, the flows from the target frame to the bi-directional reference frames, denoted as $F_{t \rightarrow 0}$ and $F_{t \rightarrow 1}$, are unavailable. To address this issue, the algorithms presented in [3, 19] approximated $F_{t \rightarrow 0}$ and $F_{t \rightarrow 1}$ by linearly combining bi-directional optical flows based on the target time step t. However, such motion estimation is under a basic assumption that the motion is linear and symmetric. Therefore, it can not model complex motions in real-world videos, and any initial errors imposed by flow estimation would be inevitably propagated to the subsequent processing procedures. More recently, the work proposed in [15] directly predicted the intermediate flow via a teacher-student net architecture without any prior bi-directional flow estimation. For the asymmetric motions in real-world videos, a bi-directional correlation volume algorithm [34] is proposed to modify intermediate flows, respectively, which achieves the state-of-the-art performance.

Kernel-Based VFI. Consider that incredibly extra cost would be introduced for predicting prior optical flow, some works (e.g., [7, 8, 21, 30,31,32]) attempt to directly synthesize the pixel by spatially-varying convolution. For example, the work in [30] synthesized a target pixel by predicting kernels with the size of $41\times 41$ for two reference frames, and then convolved them with the reference pixels. However, a huge amount of memory is required to store the kernels with such large size. In this case, [31, 32] decomposed the convolutional kernels into two one-dimensional vectors, and then used outer product to obtain the final kernel. These two methods save much memory but can not handle videos with large motions beyond the limited kernel size. To address this issue, an added kernel is estimated for each flow to collect local appearance information [37]. Moreover, inspired by deformable convolution, [8, 40] predicted weight-sharing offsets for each element of kernels, and thus irregular motions could be described. In [21], the degree of freedom of square kernel is further improved, while it is still limited by reference scopes. Obviously, compared to optical flow-based VFI approaches, the existing kernel-based methods have drawbacks in motion estimation due to lack of prior motion guidance. Moreover, it is hard to introduce the pyramid structure into kernel-based VFI methods. This is because unlike the optical flow, if kernels are down-sampled, their weights would change completely, and there is no one-to-one correspondence between pixels. Fortunately, our work can fill up this gap.

3 Method

In this section, we first present the problem statement, and then we give an overview of our MVFI-Net. Later, more details of each module will be provided, respectively.

3.1 Problem Statement

VFI is to synthesize the temporally-consistent middle frame $I_{1}$ between two consecutive frames $I_{0}$ and $I_{2}$. An essential step is to find a transformation function $\mathcal {T}{\left( \cdot \right) }$ to warp reference frames based on the motion estimation (ME) results $\{\theta _0,\theta _2\}$. Therefore, the procedure of VFI can be formulated as

$$\begin{aligned} I_1 = \mathcal {T}_{\theta _0}{\left( I_{0}\right) }+ \mathcal {T}_{\theta _2}{\left( I_{2}\right) }. \end{aligned}$$

(1)

However, it is commonly-known that some undesirable artifacts could be yielded during the aforementioned linear combination, when the target pixel is only visible in one of reference frames. This phenomenon is called as the occlusion issue. In this case, we define a soft mask [8, 21] ${M}\in {[0,1]^{{H}\times {W}}}$ to tackle this problem, where [H, W] is the target frame size. Then, Eq. (1) can be modified as

$$\begin{aligned} I_1 ={M}\odot \mathcal {T}_{\theta _0}{\left( I_{0}\right) }+ {(J-M)}\odot \mathcal {T}_{\theta _2}{\left( I_{2}\right) }, \end{aligned}$$

(2)

where $\odot $ is element-wise multiplication, and ${J}\in {R^{H\times W}}$ is a matrix where each element is equal to one.

3.2 Overall Architecture

The pipeline of the proposed MVFI-Net is shown in Fig. 3, which is mainly composed of four modules: a motion estimation network MENet ($\mathcal {U}$), a novel warping technique motion-aware convolution MAC ($\mathcal {M}$), a context-pyramid feature extractor ($\mathcal {C}$) and a frame synthesis network ($\mathcal {G}$). In specific, $\mathcal {U}$ first takes two reference frames $I_0$ and $I_2$ as inputs to predict temporal MVs $\{F_{10},F_{12}\}$, kernel weights $\{K_{10}, K_{12}\}$ and the aforementioned mask M. Concurrently, the weight-sharing $\mathcal {C}$ extracts multi-scale feature maps of $\{I_0, I_2\}$, i.e., $\{c_0^0, c_0^1, c_0^2\}$ for $I_0$ and $\{c_2^0, c_2^1, c_2^2\}$ for $I_2$. Then, the proposed two-stage warping strategy is used to align these feature maps and reference frames with time step 1. Particularly, $\mathcal {M}$ is adopted to warp $\{I_0, I_2\}$ and feature maps at the first pyramid layer $\{c_0^0, c_2^0\}$. Next, instead of predicting multi-scale kernels, the flow information $\{f_{10}, f_{12}\}$, called as sumflow, is calculated by weighted summation of $\{F_{10}, F_{12}\}$ along the channel axis, respectively. Later, lower-resolution feature maps $\{c_0^1, c_2^1, c_0^2, c_2^2\}$ are backward warped to the middle ones $\{c_{01}^1, c_{21}^1, c_{01}^2, c_{21}^2\}$ with the guidance of $\{f_{10}, f_{12}\}$ which are internally scaled to the corresponding size. Finally, the interpolated frame $I_1$ will be synthesized by $\mathcal {G}$. More details will be analyzed next.

3.3 Motion Estimation Network

As illustrated in Fig. 4, MENet ($\mathcal {U}$) is designed based on the U-Net [38] architecture, followed by nine parallel sub-nets, which are used to predict five elements: the bi-directional temporal MVs (${F}_{10}=[{u}_{10},{v}_{10}]$, ${F}_{12}=[{u}_{12},{v}_{12}]$), filter kernels (${K}_{10}=[{k^u_{10}},{k^v_{10}}]$, ${K}_{12}=[{k^u_{12}},{k^v_{12}}]$) and the soft mask M, where u and v represent the horizontal and vertical direction, respectively. Each prediction process $\mathcal {X}$ can be formulated by

$$\begin{aligned} \mathcal {X} = \mathcal {U}(Cat[I_0,I_2]), \end{aligned}$$

(3)

where $Cat[\cdot ]$ is the concatenation operation along the channel axis.

3.4 Motion-Aware Convolution

Let $\bar{\bar{I}}$ be the target frame and I be the reference frame. Most kernel-based VFI methods (e.g., [31, 32]) assume that $\mathcal {T}{\left( \cdot \right) }$ is conducted by using the spatially-varying convolution, which is described as

$$\begin{aligned} \bar{\bar{I}}(x,y)=\sum {K(x,y)\odot P_I(x,y)}, \end{aligned}$$

(4)

where $P_I(x,y)$ is the patch centered at (x, y) in I, and the sum ($\sum $) represents the summation of all elements in Hadamard product.

Besides, K(x, y) is a 2D filter kernel obtained by the outer product of two 1D vectors, which is computed as

$$\begin{aligned} K(x,y)=k^v(k^u)^T \end{aligned}$$

(5)

However, for these methods, the direction and scope of reference regions are limited, and thus they fail to handle large motion beyond the kernel size. In this case, we attempt to predict multiple extensible temporal MVs for each location, and the corresponding Eq. (4) can be modified as

$$\begin{aligned} \bar{\bar{I}}(x,y)=\sum _{l=0}^{L-1}\sum K^l(x,y) \odot P_I(x+u^l(x,y)+d_{u}(l),y+v^l(x,y)+d_{v}(l)), \end{aligned}$$

(6)

where $u^{l}(x,y)$ and $v^{l}(x,y)$ are horizontal and vertical components of the $l^{th}$ temporal MV, respectively, and L denotes the amount of temporal MVs. $d_{u}(l)$ and $d_{v}(l)$ are offset biases, which are adaptively calculated by our proposed motion-aware extension mechanism (MAEM) [see Fig. 5(b) top]. $d_{u}(l)$ can be formulated as

$$\begin{aligned} d_{u}(l)= l \cdot sign(u^{l}(x,y)),\Vert u^{l}(x,y)\Vert \ge \gamma , \end{aligned}$$

(7)

where $sign(\cdot )$ is the signal function, while $\gamma $ is the preset threshold and we empirically set $\gamma =1$. Note that $d_v(l)$ can be calculated in the same way. The motivation behind it is that we find the traditional dilation mechanism [see Fig. 5(a) top] fails to handle irregular motions due to its fixed dilation coefficient. To address this issue, we propose motion-aware extension mechanism (MAEM), which is able to adaptively extend temporal MV by modifying its start position on the basis of initial prediction. As shown in Fig. 5, our method can capture more accurate MVs and deliver the higher quality frame.

3.5 Multi-scale Feature Aggregation

Context-Pyramid Feature Extractor. Consider that the features of frames in VFI tasks are different from those in classification tasks. Motivated by [29], we optimize a feature extractor ($\mathcal {C}$) from scratch, rather than using the existing pre-trained model (e.g., VGG [42] and Resnet [14]). As depicted in Fig. 6, the features can be obtained by

$$\begin{aligned} c_t^s = \mathcal {C}(I_t)^{s}, \end{aligned}$$

(8)

where s is the scale of pyramid, and t is the time step of reference frames.

As discussed previously, extremely large memory is required while predicting different kernels (e.g., $ H\times W, \frac{H}{2} \times \frac{W}{2}, \frac{H}{4} \times \frac{W}{4}$) for each scale feature map, which is impractical for applications. To address this issue, we only apply MAC ($\mathcal {M}$) for $c_0^0$ and $c_2^0$ which have the same resolutions as the target frame. Then, the warped feature map of the first layer is obtained by

$$\begin{aligned} c_1^0 = M \odot \mathcal {M}_{F_{10};K_{10}}(c_0^0)+(J-M) \odot \mathcal {M}_{F_{12};K_{12}}(c_2^0). \end{aligned}$$

(9)

For $c_t^1$ and $c_t^2$, we first calculate the sumflow ($f_{10}$ and $f_{12}$) through the predicted temporal MVs as

$$\begin{aligned} \begin{aligned} f_{10}&= Cat[\sum _l^{L-1}{u_{10}^l(x,y)+d_{u_{10}}(l)},\sum _l^{L-1}{v_{10}^l(x,y)+d_{v_{10}}(l)}].\\ \end{aligned} \end{aligned}$$

(10)

$f_{12}$ can be computed by the same way.

Then the backward warping operation [18] is used for the temporal alignment, which is described as

$$\begin{aligned} \begin{aligned} c_{01}^s&=backwarp((f_{10})^{\downarrow s},c_{0}^{s}),\\ c_{21}^s&=backwarp((f_{12})^{\downarrow s},c_{2}^{s}), \end{aligned} \end{aligned}$$

(11)

where $\downarrow s$ denotes that sumflow has been downsampled to the same size with $s^{th}$ feature map.

Frame Synthesis Network. The modified version of Grid-Net [28] is used as our frame synthesis network. As depicted in Fig. 7, the inputs of the network are warped frame $I_1$ and feature maps $\{c_1^0, c_{01}^1, c_{21}^1, c_{01}^2, c_{21}^2\}$, and the output is the final interpolated frame.

4 Experiments

4.1 Datasets and Implementation Details

We select four benchmarks with various video resolutions for comparison, which are Vimeo90K [47] ($448 \times 256$), UCF101 [24, 43] ($256 \times 256$), Middlebury [1] ($640 \times 480$) and SNU-FILM [11] ($1280 \times 720$). We next describe our training details for MVFI-Net.

Loss Functions. To evaluate the difference between the interpolated frame $I_{1}$ and its ground truth ${I_{gt}}$, we combine Charbonnier penalty function [6] with the gradient loss [25], which can facilitate generating a sharper frame.

$$\begin{aligned} \mathcal {L}_{d} = \lambda _{1}\mathcal {L}_{char} + \lambda _{2}\mathcal {L}_{gdl}, \end{aligned}$$

(12)

where we empirically set $\lambda _{1}=1$, $\lambda _{2}=1$.

Training Strategy. We train the MVFI-Net for 150 epochs on the Vimeo90K training triplets, and use AdaMax [20] optimization with $\beta _{1} = 0.9$, $\beta _{2} =0.999$, where the initial learning rate is set as 0.001. Note that the learning rate will be decreased by a factor of 0.5 when the validation loss does not decrease for five epochs.

Table 1. Quantitative comparisons on three benchmarks. We also calculate the inference time and MACs on Middlebury [1] ‘Urban’ set. All methods are tested on one NVIDIA 2080Ti GPU. For a fair comparison, each method is only trained on Vimeo90K triplets [47] by taking only two frames as reference.

Full size table

4.2 Comparison with the State-of-the-Arts

To prove the effectiveness of our proposed algorithm, we compare our method with other competitive works, including SepConv [31], DSepConv [7], DAIN [2], CAIN [11], AdaCoF [21], BMBC [33], SepConv++ [32], CDFI [12], EDSC [8], X-VFI [41] and GDConvNet [40]. For evaluation metrics, we measure the performance of VFI methods in terms of Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity (SSIM). Additionally, we also calculate the widely-used Interpolation Error (IE) on Middlebury-Other set for evaluation. Note that the code of all compared methods are publicly available.

Table 2. Quantitative results of the current competitive kernel-based methods on four settings of SNU-FILM [11].

Full size table

Quantitative Results. We provide two versions of MVFI-Net for comparison, which have different model sizes. Formally, MVFI-Net$_S$ represents five temporal MVs and corresponding $11\times 11$ kernels are predicted, while MVFI-Net$_L$ possesses eleven temporal MVs and the same kernel size. As shown in Table 1, our method is far superior to others on diverse benchmarks in terms of PSNR and SSIM. Compared to the state-of-the-art kernel-based method SepConv++ [32], our lightweight model MVFI-Net$_S$ improves the PSNR by 0.88 dB on Vimeo90K with 1.5$\times $ faster inference speed. Although the fastest VFI approach AdaCoF [21] is 2.3$\times $ faster than us, MVFI-Net$_S$ gives 1.36 dB improvement on Vimeo90K. Note that the gains are distinct on the other two benchmarks while the MACs are similar. Compared to optical flow-based approaches, we comprehensively provide improvements either objective qualities or inference speed. This result supports that kernel-based approaches could yield impressive results without prior flow estimation.

To further prove that MVFI-Net is more robust for complex motions, we compare it with current competitive kernel-based algorithms on SNU-FILM [11] which is divided into four settings according to motions. From Table 2, it can be obviously found that our proposed MVFI-Net$_L$ delivers better performance on all settings. Generally, it is difficult to retain the structure and shape of objects in the interpolated frame when large motion exists, which imposes artifacts and blurriness with lower SSIM. Nevertheless, MVFI-Net$_L$ partly fixes this defect and supply an impressive result.

Qualitative Results. To verify the subjective quality, we also visually compare MVFI-Net with high-performance VFI approaches. As illustrated in Fig. 8, it can be seen that there are no overshoot artifacts in the frames generated by our method, while others fail. For example, in the third row and fourth row, our method clearly keep the texture details of the airplane head and the wing of the bird, while other methods impose serious artifacts and blur background.

4.3 Ablation Study

In this section, we first conduct ablation studies to demonstrate the effect of our proposed motion-aware extension mechanism (MAEM) and the improvements of introducing the pyramid structure. Then, we design a series of experiments to explore the effectiveness of different amounts of temporal MVs and different kernel sizes. Finally, we attempt to transfer our method to AdaCoF [21], in which multiple flows are predicted for each target pixel, to analyze whether the performance can be improved by our algorithm.

Table 3. Ablation studies of the motion-aware extension mechanism and pyramid architecture.

Full size table

MAEM and Pyramid Structure. For a fair comparison, each group is retrained under the same condition without any prior information. We first remove the pyramid structure, and then the final interpolated frame is synthesized by Eq. (2) where $\mathcal {T}{\left( \cdot \right) }$ is MAC. Next, we continue to remove MAEM, temporal MVs are no longer extended according to the initial prediction. From Table 3, it can be observed that our proposed MAEM, which is non-parametric and non-extra inference time cost, provides stable improvements on each benchmark in terms of two evaluation metrics. Besides, it can be demonstrated that the pyramid structure is vital for VFI, since large motions could be decomposed into the smaller scale that can be easier predicted and captured by the network. It can be seen that there is nearly 1 dB gain on Vimeo90K and Middlebury, while supplying higher structure similarity. We also illustrate a visual result for intuitive comparison in Fig. 9.

Table 4. Ablation studies of the amounts of temporal motion vectors.

Full size table

Amount of Temporal MVs. As discussed above, more reference pixels are required for complex motions. Therefore, the amount of predicted temporal MVs would directly influence the performance. To verify it, we first set a fixed kernel size $N=11$, and let the number of temporal MVs be $L\in \{1,5,11\}$. As displayed in Table 4, increasing temporal MVs helps the network explore more related regions, and thus the middle frame with a higher quality is synthesized. It should be noted that for Middlebury [1], the model with five temporal MVs is a little better than that with eleven temporal MVs. This can be explained by the fact that the videos in Middlebury usually have small motion. This means that the motion difference between consecutive frames is relatively small. Therefore, redundant temporal MVs may incur dispensable appearance information, leading to undesirable artifacts.

Kernel Size. It has been demonstrated that the quality of the interpolated frame is closely relevant to the size of the adaptive kernel [31]. To explore the effectiveness of the kernel size, we train several models by using the kernels with different sizes. Similar to the above experiments, we fix the number of temporal MVs $L=5$ and modify the kernel size $N \in \{1,5,11\}$. Note that $N=11$ means the $11\times 11$ kernel is predicted for each target pixel. From Table 5, it can be observed that a larger kernel can facilitate generating a better interpolated result. However, when the kernel becomes larger (i.e., $N=11$), there is no significant improvements. It is because our proposed MAEM has facilitated capturing motions accurately, while larger kernels have little effect and may impose repetitive local information.

Table 5. Ablation studies of kernel size

Full size table

Table 6. Ablation study on transferring our method to AdaCoF [21]

Full size table

Transferability. In AdaCoF [21], multiple flows are predicted for each target pixel, which is similar to our algorithm. Moreover, they introduce a fixed dilation coefficient to expand the initial point of flow for a wider searching range. As demonstrated above, our MAEM could facilitate more accurate motion estimation and the pyramid structure is vital for VFI. Therefore, we attempt to transfer MAEM and the two-stage warping strategy into AdaCoF to analyze their effectiveness. Table 6 illustrates that our proposed method significantly provides stable gains for AdaCoF on the basis of each benchmark.

5 Conclusion

In this paper, we propose a novel VFI network namely MVFI-Net. There are two novelties: (1) we design an efficient warping technique MAC, where multiple extensible temporal MVs and corresponding filter kernels are predicted for each target pixel, which enlarges the direction and scope of reference region simultaneously; (2) we firstly integrate the pyramid structure into the kernel-based VFI approach, which can decompose complex motions to a smaller scale, improving the efficiency of searching for temporal MVs. Extensive simulations conducted on various datasets have demonstrated that our proposed MVFI-Net is able to consistently deliver the state-of-the-art results in terms of the objective quality and human perception.

References

Baker, S., Scharstein, D., Lewis, J., Roth, S., Black, M.J., Szeliski, R.: A database and evaluation methodology for optical flow. Proc. Int. J. Comput. Vis. 1–8 (2007)
Google Scholar
Bao, W., Lai, W.S., Ma, C., Zhang, X., Gao, Z., Yang, M.H.: Depth-aware video frame interpolation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3698–3707 (2019)
Google Scholar
Bao, W., Lai, W.S., Zhang, X., Gao, Z., Yang, M.H.: Memc-net: motion estimation and motion compensation driven neural network for video interpolation and enhancement. IEEE Trans. Pattern Anal. Mach. Intell. 43, 933–948 (2019)
Article Google Scholar
Bao, W., Zhang, X., Chen, L., Ding, L., Gao, Z.: High-order model and dynamic filtering for frame rate up-conversion. IEEE Trans. Image Process. 27, 3813–3826 (2018)
Article MathSciNet MATH Google Scholar
Castagno, R., Haavisto, P., Ramponi, G.: A method for motion adaptive frame rate up-conversion. IEEE Trans. Circuits Syst. Video Technol. 6, 436–446 (1996)
Article Google Scholar
Charbonnier, P., Blanc-Feraud, L., Aubert, G., Barlaud, M.: Two deterministic half-quadratic regularization algorithms for computed imaging. In: Proceedings of 1st International Conference on Image Processing, pp. 168–172 (1994)
Google Scholar
Cheng, X., Chen, Z.: Video frame interpolation via deformable separable convolution. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 10607–10614 (2020)
Google Scholar
Cheng, X., Chen, Z.: Multiple video frame interpolation via enhanced deformable separable convolution. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7029–7045 (2021)
Google Scholar
Chi, Z., Mohammadi Nasiri, R., Liu, Z., Lu, J., Tang, J., Plataniotis, K.N.: All at once: temporally adaptive multi-frame interpolation with advanced motion modeling. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12372, pp. 107–123. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58583-9_7
Chapter Google Scholar
Choi, H., Bajić, I.V.: Deep frame prediction for video coding. IEEE Trans. Circuits Syst. Video Technol. 30, 1843–1855 (2020)
Google Scholar
Choi, M., Kim, H., Han, B., Xu, N., Lee, K.M.: Channel attention is all you need for video frame interpolation. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 10663–10671 (2020)
Google Scholar
Ding, T., Liang, L., Zhu, Z., Zharkov, I.: CDFI: compression-driven network design for frame interpolation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7997–8007 (2021)
Google Scholar
Dosovitskiy, A., et al.: FlowNet: learning optical flow with convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2758–2766 (2015)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Huang, Z., Zhang, T., Heng, W., Shi, B., Zhou, S.: RIFE: real-time intermediate flow estimation for video frame interpolation. arXiv preprint arXiv:2011.06294 (2020)
Huo, S., Liu, D., Li, B., Ma, S., Wu, F., Gao, W.: Deep network-based frame extrapolation with reference frame alignment. IEEE Trans. Circuits Syst. Video Technol. 31, 1178–1192 (2021)
Article Google Scholar
Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: FlowNet 2.0: evolution of optical flow estimation with deep networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2462–2470 (2017)
Google Scholar
Jaderberg, M., Simonyan, K., Zisserman, A.: Spatial transformer networks. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 2017–2025 (2015)
Google Scholar
Jiang, H., Sun, D., Jampani, V., Yang, M.H., Learned-Miller, E., Kautz, J.: Super slomo: high quality estimation of multiple intermediate frames for video interpolation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3813–3826 (2018)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Lee, H., Kim, T., Chung, T.Y., Pak, D., Ban, Y., Lee, S.: AdaCof: adaptive collaboration of flows for video frame interpolation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5315–5324 (2020)
Google Scholar
Li, H., Yuan, Y., Wang, Q.: Video frame interpolation via residue refinement. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2613–2617 (2020)
Google Scholar
Liu, Y.L., Liao, Y.T., Lin, Y.Y., Chuang, Y.Y.: Deep video frame interpolation using cyclic frame generation. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 8794–8802 (2019)
Google Scholar
Liu, Z., Yeh, R.A., Tang, X., Liu, Y., Agarwala, A.: Video frame synthesis using deep voxel flow. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4473–4481 (2017)
Google Scholar
Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440, pp. 1–14 (2016)
Meyer, S., Djelouah, A., McWilliams, B., Sorkine-Hornung, A., Gross, M., Schroers, C.: Phasenet for video frame interpolation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 498–507 (2018)
Google Scholar
Meyer, S., Wang, O., Zimmer, H., Grosse, M., Sorkine-Hornung, A.: Phase-based frame interpolation for video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1410–1418 (2015)
Google Scholar
Niklaus, S., Liu, F.: Context-aware synthesis for video frame interpolation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1701–1710 (2018)
Google Scholar
Niklaus, S., Liu, F.: Softmax splatting for video frame interpolation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5436–5445 (2020)
Google Scholar
Niklaus, S., Mai, L., Liu, F.: Video frame interpolation via adaptive convolution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 670–679 (2017)
Google Scholar
Niklaus, S., Mai, L., Liu, F.: Video frame interpolation via adaptive separable convolution. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 261–270 (2017)
Google Scholar
Niklaus, S., Mai, L., Wang, O.: Revisiting adaptive convolutions for video frame interpolation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1098–1108 (2021)
Google Scholar
Park, J., Ko, K., Lee, C., Kim, C.-S.: BMBC: bilateral motion estimation with bilateral cost volume for video interpolation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 109–125. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_7
Chapter Google Scholar
Park, J., Lee, C., Kim, C.S.: Asymmetric bilateral motion estimation for video frame interpolation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14519–14528 (2021)
Google Scholar
Peleg, T., Szekely, P., Sabo, D., Sendik, O.: IM-Net for high resolution video frame interpolation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2393–2402 (2019)
Google Scholar
Ranjan, A., Black, M.J.: Optical flow estimation using a spatial pyramid network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4161–4170 (2017)
Google Scholar
Reda, F.A., Liu, G., Shih, K.J., Kirby, R., Barker, J., Tarjan, D., Tao, A., Catanzaro, B.: SDC-Net: video prediction using spatially-displaced convolution. In: Proceedings of the European Conference on Computer Vision, pp. 718–733 (2018)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Shen, W., Bao, W., Zhai, G., Chen, L., Min, X., Gao, Z.: Blurry video frame interpolation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5113–5122 (2020)
Google Scholar
Shi, Z., Liu, X., Shi, K., Dai, L., Chen, J.: Video frame interpolation via generalized deformable convolution. IEEE Trans. Multimedia 20, 426–436 (2022)
Article Google Scholar
Sim, H., Oh, J., Kim, M.: XVFI: extreme video frame interpolation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14469–14478 (2021)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
Sun, D., Yang, X., Liu, M.Y., Kautz, J.: PWC-NET: CNNs for optical flow using pyramid, warping, and cost volume. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8934–8943 (2018)
Google Scholar
Teed, Z., Deng, J.: RAFT: recurrent all-pairs field transforms for optical flow. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 402–419. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_24
Chapter Google Scholar
Wu, Z., Zhang, K., Xuan, H., Yang, J., Yan, Y.: DAPC-Net: deformable alignment and pyramid context completion networks for video inpainting. IEEE Signal Process. Lett. 28, 1145–1149 (2021)
Article Google Scholar
Xue, T., Chen, B., Wu, J., Wei, D., Freeman, W.T.: Video enhancement with task-oriented flow. Proc. Int. J. Comput. Vis. 1106–1128 (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Electronic Science and Technology of China, Chengdu, 611731, China
Xuhu Lin, Lili Zhao, Xi Liu & Jianwen Chen
China Mobile Research Institute, Beijing, 100032, China
Lili Zhao

Authors

Xuhu Lin
View author publications
You can also search for this author in PubMed Google Scholar
Lili Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Xi Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jianwen Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jianwen Chen .

Editor information

Editors and Affiliations

University of Wollongong, Wollongong, NSW, Australia
Lei Wang
University of Bonn, Bonn, Germany
Juergen Gall
University of Adelaide, Adelaide, SA, Australia
Tat-Jun Chin
National Institute of Informatics, Tokyo, Japan
Imari Sato
Johns Hopkins University, Baltimore, MD, USA
Rama Chellappa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lin, X., Zhao, L., Liu, X., Chen, J. (2023). MVFI-Net: Motion-Aware Video Frame Interpolation Network. In: Wang, L., Gall, J., Chin, TJ., Sato, I., Chellappa, R. (eds) Computer Vision – ACCV 2022. ACCV 2022. Lecture Notes in Computer Science, vol 13843. Springer, Cham. https://doi.org/10.1007/978-3-031-26313-2_21

Download citation

DOI: https://doi.org/10.1007/978-3-031-26313-2_21
Published: 02 March 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26312-5
Online ISBN: 978-3-031-26313-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

MVFI-Net: Motion-Aware Video Frame Interpolation Network

Abstract

Similar content being viewed by others

A Multi-frame Video Interpolation Neural Network for Large Motion

A Flexible Recurrent Residual Pyramid Network for Video Frame Interpolation

Real-Time Intermediate Flow Estimation for Video Frame Interpolation

Keywords

1 Introduction

2 Related Work