Keywords

1 Introduction

Video frame interpolation (VFI) has been greatly demanded in many applications such as frame-rate up conversion [4, 5], slow-motion generation [19, 35, 39] and video compression [10, 16], etc. In recent years, deep learning shows its strong capacity in a series of low-level tasks, some learning-based VFI methods have been subsequently developed [2, 7, 21, 29, 31, 33, 34, 37, 40, 41], which can be viewed as a two-stage process: 1) motion estimation (ME), i.e., for the output pixels, finding their reference regions in reference frames; 2) motion compensation (MC), i.e., synthesizing the output pixels based on ME (Fig. 1).

Fig. 1.
figure 1

Results of frame interpolation on UHD videos with large motion. The interpolated frame by using our proposed MVFI-Net is sharper with more texture details than those generated by the state-of-the-art kernel-based methods, e.g., SepConv++ [32] and GDConvNet [40].

Fig. 2.
figure 2

Comparisons between the mainstream methods and our work. (a) The kernel-based method (e.g., SepConv [31]); (b) The method based on the kernel and flow (e.g., SDC-Net [37]); (c) DSepConv [7]; (d) The proposed MAC.

Among the existing methods, one popular strategy is to conduct the optical flow estimation as ME (e.g., [2, 29, 33, 34, 41]). However, the optical flow estimator will impose much computational complexity into the VFI algorithms, and the quality of the interpolated frame largely depends on the quality of flows [46]. Besides, the flow-based methods more consider the most related pixel in reference frame, while ignoring the influence of surrounding pixels, which could result in the limited direction issues.

Instead of using optical flows, some kernel-based VFI methods are developed (e.g., [7, 21, 31, 37, 40]), which can be viewed as a convolution process. SepConv [31] estimated a separable kernel for each location, and then convolved these kernels with the corresponding reference regions to predict the output pixels. However, it can not deal with the motions beyond the kernel size since its scope is fixed and limited, as demonstrated in Fig. 2(a). In this case, SDC-Net [37] proposed a spatially-displaced convolution which predicted one flow and one kernel for each location, as shown in Fig. 2(b). Then, the scope could be enlarged based on flow information, but the direction is still limited. Recently, DSepConv [7] utilized a separably deformable convolution to expand the direction of the reference regions [seeing Fig. 2(c)]. The work [21] provided the version without weight sharing as an independent warping operation namely AdaCoF. Nevertheless, the issue of limited scope is exposed again, especially for videos with a large number of complex motions.

Based on the above-mentioned discussions, there are two issues existing in VFI: 1) the direction and scope of reference regions are limited and hard to be loosed simultaneously; 2) kernel-based methods are non-robust for complex motions. To tackle the aforementioned issues, we propose a novel VFI method namely motion-aware VFI network (MVFI-Net).

For issue 1), we develop a novel warping technique called motion-aware convolution (MAC), which is one of the key novelties. In specific, multiple temporally extensible motion vectors (MVs) and corresponding spatially-varying kernels are predicted for each target pixel. Then the target pixel is calculated by convolving kernels with selected regions. The rationale behind this design is that during the motion estimation via the network, it may be possible to search MVs in a very small range and later synthesize the current pixel only based on the adjacent pixels. Obviously, the performance would be largely degraded while dealing with complex motion. In this case, we propose a motion-aware extension mechanism to adaptively extend the temporal MVs for a wider search range, and improve the efficiency of VFI. As illustrated in Fig. 2(d), compared to the previous works, MAC can explore more directions and larger scopes, which has the potential to overcome both limitations.

For issue 2), it has been known that many optical flow-based approaches have used the feature pyramid network (FPN) as the feature extractor to get multi-scale feature maps, which can be warped by internally scaling the flow in different sizes. This operation can decompose large motions into a smaller scale, which facilitates more accurate motion estimation. Thus, it is highly desirable to bring FPN into kernel-based VFI methods. However, for warping multi-scale feature maps, kernels with various sizes are required. This means that large memory is demanded, which is impractical. To solve this problem, we propose a two-stage warping strategy to warp reference frames and multi-scale features, while exploiting a frame synthesis network to mix these information for generating high-quality frame. Experimental results show that our design can improve the robustness and performance of VFI with complex motion.

Our contributions can be summarized as follows: 1) A novel warping technique MAC is proposed to simultaneously alleviate the issues of limited direction and scope of reference regions; 2) We propose a two-stage warping strategy to firstly integrate the pyramid structure into the kernel-based VFI method; 3) The proposed method delivers the state-of-the-art results on several benchmarks with various resolutions.

2 Related Work

In this section, we briefly review the works about VFI. The VFI methods can be mainly divided into three categories: 1) the phase-based methods (e.g., [26, 27]), 2) the optical flow-based methods (e.g., [2, 3, 9, 22, 23, 28, 29, 33, 34, 41]), and 3) the kernel-based methods (e.g., [7, 8, 21, 31, 32, 40]).

Phase-Based VFI. It is commonly-known that images can be transformed to the frequency field by Fourier Transform, and this operation has been used into VFI by [26, 27]. Specifically, they utilized a Phase-Net to conduct the motion estimation by the linear combinations of wavelets, which achieved competitive inference time. However, it is difficult for this category of methods to process the large motion on the high frequency components, usually imposing artifacts and pixel disappearance.

Optical Flow-Based VFI. In recent years, the deep learning-based optical flow estimation [13, 17, 36, 44, 45] has delivered impressive quality on motion estimation. Therefore, the optical flow estimation has been used in many subsequently developed VFI approaches (e.g., [2, 3, 15, 19, 29, 33, 34]). For example, in [29], the interpolated frame \(I_t\) is synthesized by forward warping the input consecutive frames (\(I_0\) and \(I_1\)) with their features, under the guidance of the estimated optical flows \(t \cdot F_{0 \rightarrow 1}\) and \((1-t) \cdot F_{1 \rightarrow 0}\) via softmax splatting and a frame synthesis network. Instead of using the forward-warping method, another trend in this research have exploited the backward-warping strategy (e.g., [2, 3, 19, 33]). It is well-known that for these backward warping-based methods, the flows from the target frame to the bi-directional reference frames, denoted as \(F_{t \rightarrow 0}\) and \(F_{t \rightarrow 1}\), are unavailable. To address this issue, the algorithms presented in [3, 19] approximated \(F_{t \rightarrow 0}\) and \(F_{t \rightarrow 1}\) by linearly combining bi-directional optical flows based on the target time step t. However, such motion estimation is under a basic assumption that the motion is linear and symmetric. Therefore, it can not model complex motions in real-world videos, and any initial errors imposed by flow estimation would be inevitably propagated to the subsequent processing procedures. More recently, the work proposed in [15] directly predicted the intermediate flow via a teacher-student net architecture without any prior bi-directional flow estimation. For the asymmetric motions in real-world videos, a bi-directional correlation volume algorithm [34] is proposed to modify intermediate flows, respectively, which achieves the state-of-the-art performance.

Kernel-Based VFI. Consider that incredibly extra cost would be introduced for predicting prior optical flow, some works (e.g., [7, 8, 21, 30,31,32]) attempt to directly synthesize the pixel by spatially-varying convolution. For example, the work in [30] synthesized a target pixel by predicting kernels with the size of \(41\times 41\) for two reference frames, and then convolved them with the reference pixels. However, a huge amount of memory is required to store the kernels with such large size. In this case, [31, 32] decomposed the convolutional kernels into two one-dimensional vectors, and then used outer product to obtain the final kernel. These two methods save much memory but can not handle videos with large motions beyond the limited kernel size. To address this issue, an added kernel is estimated for each flow to collect local appearance information [37]. Moreover, inspired by deformable convolution, [8, 40] predicted weight-sharing offsets for each element of kernels, and thus irregular motions could be described. In [21], the degree of freedom of square kernel is further improved, while it is still limited by reference scopes. Obviously, compared to optical flow-based VFI approaches, the existing kernel-based methods have drawbacks in motion estimation due to lack of prior motion guidance. Moreover, it is hard to introduce the pyramid structure into kernel-based VFI methods. This is because unlike the optical flow, if kernels are down-sampled, their weights would change completely, and there is no one-to-one correspondence between pixels. Fortunately, our work can fill up this gap.

3 Method

In this section, we first present the problem statement, and then we give an overview of our MVFI-Net. Later, more details of each module will be provided, respectively.

3.1 Problem Statement

VFI is to synthesize the temporally-consistent middle frame \(I_{1}\) between two consecutive frames \(I_{0}\) and \(I_{2}\). An essential step is to find a transformation function \(\mathcal {T}{\left( \cdot \right) }\) to warp reference frames based on the motion estimation (ME) results \(\{\theta _0,\theta _2\}\). Therefore, the procedure of VFI can be formulated as

$$\begin{aligned} I_1 = \mathcal {T}_{\theta _0}{\left( I_{0}\right) }+ \mathcal {T}_{\theta _2}{\left( I_{2}\right) }. \end{aligned}$$
(1)

However, it is commonly-known that some undesirable artifacts could be yielded during the aforementioned linear combination, when the target pixel is only visible in one of reference frames. This phenomenon is called as the occlusion issue. In this case, we define a soft mask [8, 21] \({M}\in {[0,1]^{{H}\times {W}}}\) to tackle this problem, where [HW] is the target frame size. Then, Eq. (1) can be modified as

$$\begin{aligned} I_1 ={M}\odot \mathcal {T}_{\theta _0}{\left( I_{0}\right) }+ {(J-M)}\odot \mathcal {T}_{\theta _2}{\left( I_{2}\right) }, \end{aligned}$$
(2)

where \(\odot \) is element-wise multiplication, and \({J}\in {R^{H\times W}}\) is a matrix where each element is equal to one.

Fig. 3.
figure 3

The overview of our proposed MVFI-Net. The network takes two consecutive frames \(I_0\) and \(I_2\) as inputs, and finally generates the middle frame \(I_1\). Note that green dotted line box represents the parameters of this part will be updated by gradient descent methods, while the gray one means non-parametric warping operation. (Color figure online)

3.2 Overall Architecture

The pipeline of the proposed MVFI-Net is shown in Fig. 3, which is mainly composed of four modules: a motion estimation network MENet (\(\mathcal {U}\)), a novel warping technique motion-aware convolution MAC (\(\mathcal {M}\)), a context-pyramid feature extractor (\(\mathcal {C}\)) and a frame synthesis network (\(\mathcal {G}\)). In specific, \(\mathcal {U}\) first takes two reference frames \(I_0\) and \(I_2\) as inputs to predict temporal MVs \(\{F_{10},F_{12}\}\), kernel weights \(\{K_{10}, K_{12}\}\) and the aforementioned mask M. Concurrently, the weight-sharing \(\mathcal {C}\) extracts multi-scale feature maps of \(\{I_0, I_2\}\), i.e., \(\{c_0^0, c_0^1, c_0^2\}\) for \(I_0\) and \(\{c_2^0, c_2^1, c_2^2\}\) for \(I_2\). Then, the proposed two-stage warping strategy is used to align these feature maps and reference frames with time step 1. Particularly, \(\mathcal {M}\) is adopted to warp \(\{I_0, I_2\}\) and feature maps at the first pyramid layer \(\{c_0^0, c_2^0\}\). Next, instead of predicting multi-scale kernels, the flow information \(\{f_{10}, f_{12}\}\), called as sumflow, is calculated by weighted summation of \(\{F_{10}, F_{12}\}\) along the channel axis, respectively. Later, lower-resolution feature maps \(\{c_0^1, c_2^1, c_0^2, c_2^2\}\) are backward warped to the middle ones \(\{c_{01}^1, c_{21}^1, c_{01}^2, c_{21}^2\}\) with the guidance of \(\{f_{10}, f_{12}\}\) which are internally scaled to the corresponding size. Finally, the interpolated frame \(I_1\) will be synthesized by \(\mathcal {G}\). More details will be analyzed next.

Fig. 4.
figure 4

Our designed MENet. Notably, each sub-net (gray box) consists of three \(3\times 3\) convolution layers and one bilinear up-sampling layer. Note that the weights across sub-nets are not sharing. Therefore, temporal MVs, filter kernels and the soft mask are predicted independently.

3.3 Motion Estimation Network

As illustrated in Fig. 4, MENet (\(\mathcal {U}\)) is designed based on the U-Net [38] architecture, followed by nine parallel sub-nets, which are used to predict five elements: the bi-directional temporal MVs (\({F}_{10}=[{u}_{10},{v}_{10}]\), \({F}_{12}=[{u}_{12},{v}_{12}]\)), filter kernels (\({K}_{10}=[{k^u_{10}},{k^v_{10}}]\), \({K}_{12}=[{k^u_{12}},{k^v_{12}}]\)) and the soft mask M, where u and v represent the horizontal and vertical direction, respectively. Each prediction process \(\mathcal {X}\) can be formulated by

$$\begin{aligned} \mathcal {X} = \mathcal {U}(Cat[I_0,I_2]), \end{aligned}$$
(3)

where \(Cat[\cdot ]\) is the concatenation operation along the channel axis.

3.4 Motion-Aware Convolution

Let \(\bar{\bar{I}}\) be the target frame and I be the reference frame. Most kernel-based VFI methods (e.g., [31, 32]) assume that \(\mathcal {T}{\left( \cdot \right) }\) is conducted by using the spatially-varying convolution, which is described as

$$\begin{aligned} \bar{\bar{I}}(x,y)=\sum {K(x,y)\odot P_I(x,y)}, \end{aligned}$$
(4)

where \(P_I(x,y)\) is the patch centered at (xy) in I, and the sum (\(\sum \)) represents the summation of all elements in Hadamard product.

Besides, K(xy) is a 2D filter kernel obtained by the outer product of two 1D vectors, which is computed as

$$\begin{aligned} K(x,y)=k^v(k^u)^T \end{aligned}$$
(5)
Fig. 5.
figure 5

Visualization of the effect of using the traditional dilation mechanism (TDM) and our MAEM on the search range. The first row displays the toy examples and qualitative results given by two methods, and the second row depicts the co-location patches of two reference frames, where the pink cross denotes the co-location position of the target pixel. The third row describes the endpoint of each temporal MV from start point. (a) The result by using TDM. (b) The result by replacing TDM with our MAEM. It can be seen that our MAEM can extend the search range effectively. (Color figure online)

However, for these methods, the direction and scope of reference regions are limited, and thus they fail to handle large motion beyond the kernel size. In this case, we attempt to predict multiple extensible temporal MVs for each location, and the corresponding Eq. (4) can be modified as

$$\begin{aligned} \bar{\bar{I}}(x,y)=\sum _{l=0}^{L-1}\sum K^l(x,y) \odot P_I(x+u^l(x,y)+d_{u}(l),y+v^l(x,y)+d_{v}(l)), \end{aligned}$$
(6)

where \(u^{l}(x,y)\) and \(v^{l}(x,y)\) are horizontal and vertical components of the \(l^{th}\) temporal MV, respectively, and L denotes the amount of temporal MVs. \(d_{u}(l)\) and \(d_{v}(l)\) are offset biases, which are adaptively calculated by our proposed motion-aware extension mechanism (MAEM) [see Fig. 5(b) top]. \(d_{u}(l)\) can be formulated as

$$\begin{aligned} d_{u}(l)= l \cdot sign(u^{l}(x,y)),\Vert u^{l}(x,y)\Vert \ge \gamma , \end{aligned}$$
(7)

where \(sign(\cdot )\) is the signal function, while \(\gamma \) is the preset threshold and we empirically set \(\gamma =1\). Note that \(d_v(l)\) can be calculated in the same way. The motivation behind it is that we find the traditional dilation mechanism [see Fig. 5(a) top] fails to handle irregular motions due to its fixed dilation coefficient. To address this issue, we propose motion-aware extension mechanism (MAEM), which is able to adaptively extend temporal MV by modifying its start position on the basis of initial prediction. As shown in Fig. 5, our method can capture more accurate MVs and deliver the higher quality frame.

Fig. 6.
figure 6

The architecture of the weight-sharing context feature extractor.

3.5 Multi-scale Feature Aggregation

Context-Pyramid Feature Extractor. Consider that the features of frames in VFI tasks are different from those in classification tasks. Motivated by [29], we optimize a feature extractor (\(\mathcal {C}\)) from scratch, rather than using the existing pre-trained model (e.g., VGG [42] and Resnet [14]). As depicted in Fig. 6, the features can be obtained by

$$\begin{aligned} c_t^s = \mathcal {C}(I_t)^{s}, \end{aligned}$$
(8)

where s is the scale of pyramid, and t is the time step of reference frames.

As discussed previously, extremely large memory is required while predicting different kernels (e.g., \( H\times W, \frac{H}{2} \times \frac{W}{2}, \frac{H}{4} \times \frac{W}{4}\)) for each scale feature map, which is impractical for applications. To address this issue, we only apply MAC (\(\mathcal {M}\)) for \(c_0^0\) and \(c_2^0\) which have the same resolutions as the target frame. Then, the warped feature map of the first layer is obtained by

$$\begin{aligned} c_1^0 = M \odot \mathcal {M}_{F_{10};K_{10}}(c_0^0)+(J-M) \odot \mathcal {M}_{F_{12};K_{12}}(c_2^0). \end{aligned}$$
(9)

For \(c_t^1\) and \(c_t^2\), we first calculate the sumflow (\(f_{10}\) and \(f_{12}\)) through the predicted temporal MVs as

$$\begin{aligned} \begin{aligned} f_{10}&= Cat[\sum _l^{L-1}{u_{10}^l(x,y)+d_{u_{10}}(l)},\sum _l^{L-1}{v_{10}^l(x,y)+d_{v_{10}}(l)}].\\ \end{aligned} \end{aligned}$$
(10)

\(f_{12}\) can be computed by the same way.

Then the backward warping operation [18] is used for the temporal alignment, which is described as

$$\begin{aligned} \begin{aligned} c_{01}^s&=backwarp((f_{10})^{\downarrow s},c_{0}^{s}),\\ c_{21}^s&=backwarp((f_{12})^{\downarrow s},c_{2}^{s}), \end{aligned} \end{aligned}$$
(11)

where \(\downarrow s\) denotes that sumflow has been downsampled to the same size with \(s^{th}\) feature map.

Fig. 7.
figure 7

The structure of the frame synthesis network.

Frame Synthesis Network. The modified version of Grid-Net [28] is used as our frame synthesis network. As depicted in Fig. 7, the inputs of the network are warped frame \(I_1\) and feature maps \(\{c_1^0, c_{01}^1, c_{21}^1, c_{01}^2, c_{21}^2\}\), and the output is the final interpolated frame.

4 Experiments

4.1 Datasets and Implementation Details

We select four benchmarks with various video resolutions for comparison, which are Vimeo90K [47] (\(448 \times 256\)), UCF101 [24, 43] (\(256 \times 256\)), Middlebury [1] (\(640 \times 480\)) and SNU-FILM [11] (\(1280 \times 720\)). We next describe our training details for MVFI-Net.

Loss Functions. To evaluate the difference between the interpolated frame \(I_{1}\) and its ground truth \({I_{gt}}\), we combine Charbonnier penalty function [6] with the gradient loss [25], which can facilitate generating a sharper frame.

$$\begin{aligned} \mathcal {L}_{d} = \lambda _{1}\mathcal {L}_{char} + \lambda _{2}\mathcal {L}_{gdl}, \end{aligned}$$
(12)

where we empirically set \(\lambda _{1}=1\), \(\lambda _{2}=1\).

Training Strategy. We train the MVFI-Net for 150 epochs on the Vimeo90K training triplets, and use AdaMax [20] optimization with \(\beta _{1} = 0.9\), \(\beta _{2} =0.999\), where the initial learning rate is set as 0.001. Note that the learning rate will be decreased by a factor of 0.5 when the validation loss does not decrease for five epochs.

Table 1. Quantitative comparisons on three benchmarks. We also calculate the inference time and MACs on Middlebury [1] ‘Urban’ set. All methods are tested on one NVIDIA 2080Ti GPU. For a fair comparison, each method is only trained on Vimeo90K triplets [47] by taking only two frames as reference.

4.2 Comparison with the State-of-the-Arts

To prove the effectiveness of our proposed algorithm, we compare our method with other competitive works, including SepConv [31], DSepConv [7], DAIN [2], CAIN [11], AdaCoF [21], BMBC [33], SepConv++ [32], CDFI [12], EDSC [8], X-VFI [41] and GDConvNet [40]. For evaluation metrics, we measure the performance of VFI methods in terms of Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity (SSIM). Additionally, we also calculate the widely-used Interpolation Error (IE) on Middlebury-Other set for evaluation. Note that the code of all compared methods are publicly available.

Table 2. Quantitative results of the current competitive kernel-based methods on four settings of SNU-FILM [11].

Quantitative Results. We provide two versions of MVFI-Net for comparison, which have different model sizes. Formally, MVFI-Net\(_S\) represents five temporal MVs and corresponding \(11\times 11\) kernels are predicted, while MVFI-Net\(_L\) possesses eleven temporal MVs and the same kernel size. As shown in Table 1, our method is far superior to others on diverse benchmarks in terms of PSNR and SSIM. Compared to the state-of-the-art kernel-based method SepConv++ [32], our lightweight model MVFI-Net\(_S\) improves the PSNR by 0.88 dB on Vimeo90K with 1.5\(\times \) faster inference speed. Although the fastest VFI approach AdaCoF [21] is 2.3\(\times \) faster than us, MVFI-Net\(_S\) gives 1.36 dB improvement on Vimeo90K. Note that the gains are distinct on the other two benchmarks while the MACs are similar. Compared to optical flow-based approaches, we comprehensively provide improvements either objective qualities or inference speed. This result supports that kernel-based approaches could yield impressive results without prior flow estimation.

To further prove that MVFI-Net is more robust for complex motions, we compare it with current competitive kernel-based algorithms on SNU-FILM [11] which is divided into four settings according to motions. From Table 2, it can be obviously found that our proposed MVFI-Net\(_L\) delivers better performance on all settings. Generally, it is difficult to retain the structure and shape of objects in the interpolated frame when large motion exists, which imposes artifacts and blurriness with lower SSIM. Nevertheless, MVFI-Net\(_L\) partly fixes this defect and supply an impressive result.

Fig. 8.
figure 8

Qualitative comparisons with DAIN [2], SepConv++ [32], XVFI [41] and GDConvNet [40] on the test set of Vimeo90K [47] and SNU-FILM [11] extreme setting, including the large motion and the occlusion issue.

Qualitative Results. To verify the subjective quality, we also visually compare MVFI-Net with high-performance VFI approaches. As illustrated in Fig. 8, it can be seen that there are no overshoot artifacts in the frames generated by our method, while others fail. For example, in the third row and fourth row, our method clearly keep the texture details of the airplane head and the wing of the bird, while other methods impose serious artifacts and blur background.

4.3 Ablation Study

In this section, we first conduct ablation studies to demonstrate the effect of our proposed motion-aware extension mechanism (MAEM) and the improvements of introducing the pyramid structure. Then, we design a series of experiments to explore the effectiveness of different amounts of temporal MVs and different kernel sizes. Finally, we attempt to transfer our method to AdaCoF [21], in which multiple flows are predicted for each target pixel, to analyze whether the performance can be improved by our algorithm.

Table 3. Ablation studies of the motion-aware extension mechanism and pyramid architecture.

MAEM and Pyramid Structure. For a fair comparison, each group is retrained under the same condition without any prior information. We first remove the pyramid structure, and then the final interpolated frame is synthesized by Eq. (2) where \(\mathcal {T}{\left( \cdot \right) }\) is MAC. Next, we continue to remove MAEM, temporal MVs are no longer extended according to the initial prediction. From Table 3, it can be observed that our proposed MAEM, which is non-parametric and non-extra inference time cost, provides stable improvements on each benchmark in terms of two evaluation metrics. Besides, it can be demonstrated that the pyramid structure is vital for VFI, since large motions could be decomposed into the smaller scale that can be easier predicted and captured by the network. It can be seen that there is nearly 1 dB gain on Vimeo90K and Middlebury, while supplying higher structure similarity. We also illustrate a visual result for intuitive comparison in Fig. 9.

Fig. 9.
figure 9

Qualitative comparisons with Baseline, Baseline+MAC and Baseline+MAC+FPN. It can be seen that the frame quality is gradually enhanced by our design.

Table 4. Ablation studies of the amounts of temporal motion vectors.

Amount of Temporal MVs. As discussed above, more reference pixels are required for complex motions. Therefore, the amount of predicted temporal MVs would directly influence the performance. To verify it, we first set a fixed kernel size \(N=11\), and let the number of temporal MVs be \(L\in \{1,5,11\}\). As displayed in Table 4, increasing temporal MVs helps the network explore more related regions, and thus the middle frame with a higher quality is synthesized. It should be noted that for Middlebury [1], the model with five temporal MVs is a little better than that with eleven temporal MVs. This can be explained by the fact that the videos in Middlebury usually have small motion. This means that the motion difference between consecutive frames is relatively small. Therefore, redundant temporal MVs may incur dispensable appearance information, leading to undesirable artifacts.

Kernel Size. It has been demonstrated that the quality of the interpolated frame is closely relevant to the size of the adaptive kernel [31]. To explore the effectiveness of the kernel size, we train several models by using the kernels with different sizes. Similar to the above experiments, we fix the number of temporal MVs \(L=5\) and modify the kernel size \(N \in \{1,5,11\}\). Note that \(N=11\) means the \(11\times 11\) kernel is predicted for each target pixel. From Table 5, it can be observed that a larger kernel can facilitate generating a better interpolated result. However, when the kernel becomes larger (i.e., \(N=11\)), there is no significant improvements. It is because our proposed MAEM has facilitated capturing motions accurately, while larger kernels have little effect and may impose repetitive local information.

Table 5. Ablation studies of kernel size
Table 6. Ablation study on transferring our method to AdaCoF [21]

Transferability. In AdaCoF [21], multiple flows are predicted for each target pixel, which is similar to our algorithm. Moreover, they introduce a fixed dilation coefficient to expand the initial point of flow for a wider searching range. As demonstrated above, our MAEM could facilitate more accurate motion estimation and the pyramid structure is vital for VFI. Therefore, we attempt to transfer MAEM and the two-stage warping strategy into AdaCoF to analyze their effectiveness. Table 6 illustrates that our proposed method significantly provides stable gains for AdaCoF on the basis of each benchmark.

5 Conclusion

In this paper, we propose a novel VFI network namely MVFI-Net. There are two novelties: (1) we design an efficient warping technique MAC, where multiple extensible temporal MVs and corresponding filter kernels are predicted for each target pixel, which enlarges the direction and scope of reference region simultaneously; (2) we firstly integrate the pyramid structure into the kernel-based VFI approach, which can decompose complex motions to a smaller scale, improving the efficiency of searching for temporal MVs. Extensive simulations conducted on various datasets have demonstrated that our proposed MVFI-Net is able to consistently deliver the state-of-the-art results in terms of the objective quality and human perception.