Error Compensation Framework for Flow-Guided Video Inpainting

Kang, Jaeyeon; Oh, Seoung Wug; Kim, Seon Joo

doi:10.1007/978-3-031-19784-0_22

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13675))

Included in the following conference series:

European Conference on Computer Vision

2390 Accesses
14 Citations

Abstract

The key to video inpainting is to use correlation information from as many reference frames as possible. Existing flow-based propagation methods split the video synthesis process into multiple steps: flow completion $\rightarrow {}$ pixel propagation $\rightarrow {}$ synthesis. However, there is a significant drawback that the errors in each step continue to accumulate and amplify in the next step. To this end, we propose an Error Compensation Framework for Flow-guided Video Inpainting (ECFVI), which takes advantage of the flow-based method and offsets its weaknesses. We address the weakness with the newly designed flow completion module and the error compensation network that exploits the error guidance map. Our approach greatly improves the temporal consistency and the visual quality of the completed videos. Experimental results show the superior performance of our proposed method with the speed up of $\times {6}$, compared to the state-of-the-art methods. In addition, we present a new benchmark dataset for evaluation by supplementing the weaknesses of existing test datasets.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Local and nonlocal flow-guided video inpainting

Article 21 June 2023

Flow-Guided Transformer for Video Inpainting

Flow and Color Inpainting for Video Completion

Keywords

1 Introduction

Video inpainting is the task of filling missing regions in frames with realistic content. Its applications cover object removal, watermark/logo/subtitle removal, and even corrupted video restoration. It is challenging as the hole regions should be filled with synthesized contents that are unnoticeable with the surrounding background and temporally consistent.

Existing video inpainting methods usually extract useful information from reference frames and merge them to fill the target frame. The key is to use as many frames as possible since most correlated pixels are somewhere in other frames. The common pipeline for video inpainting can be categorized into two groups: 1) direct synthesis, 2) flow-based propagation.

In general, direct synthesis methods adopt convolution-based [3, 4, 7, 23] and attention-based [13, 22] networks. They usually take corrupted frames and corresponding pixel-wise masks as input and directly output completed frames. However, due to the computational complexity, these methods have relatively small temporal windows resulting in only a few reference frames to be used (up to 10 frames). Serious temporal inconsistencies can occur if some key frames that contain unique pixel values and textures needed to fill the hole are not selected. Figure 1(a) illustrates failure cases of [13, 22] when an occluded object is not properly included in the small reference frame set.

Flow-based methods [6, 20] split the synthesis process as into three steps: flow completion, pixel propagation, and synthesis. During the flow completion, the corrupted optical flow maps are processed by inpainting the holes within the flow maps. In the pixel propagation step, pixel values from reference frames are propagated to the holes with the guidance of the inpainted flows. Finally, in the synthesis step, only the remaining holes are synthesized using an image inpainting method. Flow-based approaches show better long-term temporal consistency compared to direct synthesis methods as they explicitly propagate pixel values from other frames using flows.

Nevertheless, there are some drawbacks in flow-based methods. (1) Flow completion. In flow inpainting, it is crucial to infer spatio-temporal relationships from other corrupted flows. However, even adjacent flows can have different values and directions due to inaccurate flow estimation or non-linear motion between frames, resulting in temporally inconsistent flows. Also, without taking the underlying video contents, errors from the flow estimator like From corrupted in Fig. 2 cannot be handled. (2) Pixel propagation. Simply copy-and-pasting propagated pixels causes pixel misalignment and brightness inconsistency. As the pixel values/textures can be conveyed from distant frames, there are often perspective mismatch such as scale and shape. In addition, the brightness inconsistency may arise due to the changes in the lighting condition or camera exposure in the given video sequence. Models will output visual artifacts without the proper compensation for these errors. While existing methods attempt to address these issues with checking flow consistency [6, 20] and poisson blending [6], visual artifacts still remain as shown in Fig. 1(b).

To this end, we propose a simple yet effective Error Completion Framework for Flow-guided Video Inpainting (ECFVI) that addresses both drawbacks. We follow the three steps in flow-based methods (i.e., flow completion, pixel propagation, synthesis) but significantly improve the first two steps by correcting errors, preventing the error accumulation. For the flow completion, we make the flow inpainting be aware of RGB values. Hallucinating missing flow values based on the underlying video contents can infer temporally coherent flows. Specifically, we first roughly complete RGB pixels on holes, considering the local temporal relationship. Then, the locally coherent RGB inpainting results guide the flow inpainting. Note that the local RGB inpainting here is only used for guiding the flow, and the actual inpainting results are obtained in later steps. For the pixel propagation step, we introduce additional compensation network to detect and correct the misalignment and brightness inconsistency errors from the pixel propagation. The network utilizes the error map outside the hole (called error guidance map) to predict errors inside the hole. Such carefully designed components can prevent errors from accumulating at the following stages, which significantly improves the visual quality and temporal consistency of completed videos.

For quantitative evaluation, many video inpainting methods construct their own datasets by randomly masking regions on a video. They usually set the unmasked video as the ground-truth and evaluate their models. However, when the randomly generated mask covers an entire object like in Fig. 6(a), a model will output object-removed results which are different from the unmasked video. Evaluation values from this case are not proportional to human perception. Therefore, we propose a new benchmark dataset by using an object segmentation algorithm [5] so that the mask only partially covers the object. The results on our dataset are identical to the unmasked video, which can provide comparative analysis into video inpainting methods.

In summary, the contributions of the paper are as follows:

We propose a simple yet effective way to compensate for the limitations of the flow-based video inpainting. Specifically, we improve the flow completion with RGB-awareness and propose to compensate errors in pixel propagation.
Our method outperforms previous state-of-the-art video inpainting methods in terms of PSNR/SSIM and visual quality. Compared with the state-of-the-art in flow-based method FGVC [6], we produce better results while reducing the computation time ($\times $6 faster).
We provide a new benchmark dataset for quantitative evaluation of video inpainting, which can be very useful for evaluating future work on this topic by providing a common ground.

2 Related Work

2.1 Direct Synthesis Methods

From the success of deep learning, many deep-based video inpainting methods have emerged. [3, 10, 17] proposed to use 3D encoder-decoder networks for enhancing the efficiency and the temporal consistency. [13, 22] exploited attention-based methods, which use transformer modules for matching similar patches on inter-intra frames. Zou et al. [23] used the optical flow to progressively merge target frames with reference frames to enrich feature representations on temporal relationships. Due to the memory constraints, above methods can only refer to a few frames. On the other hand, Oh et al. [14] exploited a non-local pixel matching to have a global temporal window by using the memory network. Although they refer to all frames, corruption of information occurs because the frames are continuously referenced in an implicit way. While Lee et al. [12] designed a network to take information from long-distance frames with global affine matrices, it cannot handle non-rigid complex motions.

2.2 Flow-Based Methods

Huang et al. [8] adopted spatial patch matching and flow-based method but suffered from mismatching issues caused by corrupted flows. Instead, [6, 19] first inpainted the corrupted flows and propagated pixels along their flow trajectories. Despite the success, they still have some limitations to produce better results: First, to estimate the corrupted flows, they entered original frames to the flow estimator and removed the values with masks. This procedure works well in object removal scenarios where the original frames exist. The problem is when the original frames are not available as in video restoration scenarios. If the flow estimator takes the corrupted frames as input, the masks can interfere with estimating the motion of objects in the video (Fig. 2). The errors in corrupted flows will affect the flow completion stage, resulting in incorrect flow values. Second, it is difficult to use spatio-temporal information between the corrupted flows for flow completion as mentioned in previous section.

Wrong estimation from the flow completion would lead to the pixel misalignment issue. Both methods checked the flow consistency and only propagated trustworthy pixels to deal with the issue, but the consistency is not well preserved as the completed flow itself is incorrect. Our framework focuses on offsetting these limitations of flow-based methods.

3 Method

Let $X=[x_1, x_2, ..., x_T]$ be a set of corrupted video frames of length T. $M=[m_1, m_2, ..., m_T]$ denotes the corresponding frame-wise masks, where the spatial resolution is the same as X. For each mask, value 0 indicates valid regions, and 1 indicates hole (missing) regions. Video inpainting aims to predict the original or object removed video $\hat{Y}=[\hat{y}_1, \hat{y}_2, ..., \hat{y}_T]$ given X and M as inputs. We call the frame to be inpainted as the target frame and other frames as reference frames.

3.1 Overview

We illustrate our video inpainting framework in Fig. 3. The overall procedure can be summarized as follows. (1) Flow completion with RGB guidance: Given the corrupted frames, we first generate locally coherent frames. Then, we estimate complete flows between adjacent frames using our flow estimator (Sect. 3.2). Before the propagation, we estimate bi-directional completed flows between all frames. (2) Pixel Propagation with Error Compensation: With the guidance of completed flows, we propagate valid pixels from reference frames to target holes. In this process, we prevent the errors from propagating to the following stages using our error compensation network (Sect.3.3). In the order close to the target frame, we iterate propagation and compensation procedures until the hole regions are entirely removed or all frames are referenced. (3) Synthesis: If there are remaining holes to fill, we synthesize using an existing video inpainting method. This case usually occurs due to occlusion, where some pixels cannot be found in any other frames. For the synthesis, we used FuseFormer [13] with the weights from their website.

3.2 Flow Completion with RGB Guidance

Previous methods [6, 20] first estimate corrupted flows with the flow estimator [9, 16] and complete flow values with their methods. However, without considering the original video contents, there is no ability to handle the errors from the corrupted flows. Therefore, we design our flow completion module to be aware of RGB values and directly output completed flows from the flow estimator.

A simple approach would be to have the flow estimator take the corrupted frames and complete the flow values. However, since the flow estimator [9, 16] iteratively takes two frames to estimate entire flows, completed flows are created without considering spatio-temporal information of other flows. Also, it is more challenging to run flow estimation and completion simultaneously. Therefore, we roughly complete RGB pixels through local neighboring frames (larger than two frames) and pass the results to our flow estimator. This can further exploit temporal information for estimating completed flows. Here, our flow estimator takes locally coherent frames $\bar{x}$ from a local temporal network LTN. Note that the intermediate frames $\bar{x}$ are only used for guiding the flow completion.

We design the local temporal network LTN to consist of an encoder, multiple spatio-temporal transformer layers [13, 22] and a decoder. We first extract features of each input from the encoder and exploit transformer layers to merge the information from the reference in the deep encoding space. The decoder takes the output of the transformer layers to reconstruct locally coherent frames $\bar{x}$ as follows:

$$\begin{aligned} \begin{aligned} \bar{x}_{i} = LTN(x_{i-N:i+N},m_{i-N:i+N}), \end{aligned} \end{aligned}$$

(1)

where N is the temporal radius and $N=5$ is used in our experiment. Then, we estimate a completed flow between adjacent frames as follows:

$$\begin{aligned} \begin{aligned} \tilde{f}_{t\rightarrow {t+1}} = F(\bar{x}_{t}, \bar{x}_{t+1}, m_{t}, m_{t+1}), \end{aligned} \end{aligned}$$

(2)

where F is our flow estimator. Backward flow $\tilde{f}_{t\rightarrow {t-1}}$ can be estimated in the same manner. For classifying the regions where the original and coarsely completed content exist, we use masks m as additional input. The flow estimator is initialized with pretrained weights from RAFT [16] except for the first layer to take the masks as additional input. We jointly train our local temporal network and flow estimator as follows:

$$\begin{aligned} \begin{aligned} \mathcal {L} = \mathcal {L}_{rec} + \lambda _{flow}\mathcal {L}_{flow}, \end{aligned} \end{aligned}$$

(3)

where $\mathcal {L}_{rec}$ is the reconstruction L1 loss defined as $||\bar{x}-{y}||_{1}$. $\lambda _{flow}$ is a coefficient for the loss terms and we set it to 2 in, our experiment. For the flow loss $\mathcal {L}_{flow}$, inspired by [15], we employ hard example mining mechanism to weigh more on the difficult areas as follows:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{flow} = || {h}_{t} \odot {} (\tilde{f}_{t \rightarrow {t+1}} - f_{t \rightarrow {t+1}})||_{1}, \end{aligned} \end{aligned}$$

(4)

where $\odot {}$ is element-wise multiplication and ground-truth $f_{t \rightarrow {t+1}}$ is calculated from original frames Y. We set the hard mining weight ${h}_{t}$ with $ (1 +|\bar{x}_{t} - y_{t}|)^2$. It encourages the model to implicitly deal with errors from the local temporal network. More details of our network are shown in the supplementary material.

This design choice has advantages in two aspects. First, it alleviates the problems of corrupted flows mentioned in Sect.2.2. Second, with the strong initialization of pretrained weights and no need to inpaint flow values from scratch, it can be trained much easier. Although there still could be errors from the two networks, we can further compensate it in the next stage.

3.3 Pixel Propagation with Error Compensation

The next step, pixel propagation with error compensation, is illustrated in Fig. 4. With the guidance of the completed flows, we iterate propagation and compensation steps so that the errors do not accumulate or amplify in the next iteration.

Propagation. Let $x_i$ and $x_j$ be the target and the reference frame respectively. With the completed flow $\tilde{f}_{i\rightarrow {j}}$, only the valid pixels in reference frame $x_j$ are forwarded to target holes by using backward warping function w.

$$\begin{aligned} \begin{aligned} {m}^{p}_{i} = m_i \odot {}(1-w(m_j,\tilde{f}_{i\rightarrow {j}})),&\\ \tilde{x}_i = x_i + {m}^{p}_{i}\odot {}w(x_j, \tilde{f}_{i\rightarrow {j}}), \end{aligned} \end{aligned}$$

(5)

where $\tilde{x}_i$ is the filled target frame. To find the regions where the matched and valid regions are in the reference frame, we warp the reference mask $m_j$ and get valid regions $1-w(m_j,\tilde{f}_{i\rightarrow {j}})$ in the warped reference frame. $m^p_i$ denotes a propagation mask where the propagated pixels exist, shown by the green region in Fig. 4. This mask is used at the compensation stage to know where to compensate for errors. After the propagation, there may be a remaining mask $m^{r}_{i}$, where $m^{r}_{i} = m_i - {m}^{p}_{i}$. The remaining mask is further used for the next propagation or input for the synthesis network.

Compensation. As shown in Fig. 5, we observe that directly propagating pixels from the reference frame may cause misalignment or brightness inconsistency issues. To remedy these issues, one solution is taking filled frame $\tilde{x}_i$ and corresponding mask $m^p_i$ as input and processing them through a GAN framework. However, it is difficult to train such a network because it cannot easily detect what types of problem occurred.

We approach this problem by introducing the error guidance map $e_i$ as input to our network. To know what is wrong with the propagation, we need valid (ground-truth) values. Since only regions outside the holes in the given corrupted frame have valid values, we propagate more pixels by dilating the original hole regions like in Fig. 4. Then, we calculate the errors on enlarged parts, which is the error guidance map $e_i$. This process assumes that the propagated pixels on the enlarged and the propagated regions have similar error tendencies. We set the dilation factor as 17 pixels in our experiment.

We set dilated mask as $m^d_i$ and overfilled frame as $\tilde{x}^d_i$. The error guidance map $e_i$ and the corresponding error regions (mask) $m^e_i$ are computed as follows:

$$\begin{aligned} \begin{gathered} m^e_i = (m^d_i -m_i) \odot {} (1- w(m_j,\tilde{f}_{i\rightarrow {j}})), \\ e_i = m^e_i\odot {}(\tilde{x}^d_i - \tilde{x}_i). \end{gathered} \end{aligned}$$

(6)

In Fig. 5, this error guidance map intuitively shows the problems of pixel propagation. If it has no issue, it will be black (zero). On the other hand, if it has a pixel misalignment issue, it will result in the edges being visible in the error map. This additional information helps our network be more aware of the problems from the propagation stage.

We design our error compensation network with a similar structure as our local temporal network LTN, but with different input and output settings:

$$\begin{aligned} \begin{aligned} \tilde{e}_i = ECN(\tilde{x}^d_{i-N:i+N}, {e}_{i-N:i+N}, m^{[e,p,r]}_{i-N:i+N}), \end{aligned} \end{aligned}$$

(7)

where ECN is our error compensation network and $m^{[e,p,r]}$ denotes error, propagation, remaining mask respectively. For simplicity, we will omit the index from now on. $\tilde{x}^d, {e}, m^e$ are concatenated in the channel dimension and forwarded to our encoder, and each of the two masks $m^p, m^r$ are used at the transformer layers. Note that the overfilled frames $\tilde{x}^d$ are used instead of filled frames $\tilde{x}$ as input. It makes our network to learn the relationship between overfilled frame $\tilde{x}^d$ and the error guidance maps e on the error regions $m^e$. Since the propagated regions have some structure information, based on the relationship, our network can estimate error values on the regions $m^p$ with accuracy. A final compensated frame is $\tilde{y} = \tilde{x}^d + \tilde{e}$. The results and corresponding errors are shown in Fig. 5.

The loss to train our error compensation network consists of reconstruction loss $\mathcal {L}_{rec}$ and adversarial loss $\mathcal {L}_{adv}$:

$$\begin{aligned} \begin{aligned} \mathcal {L} = \mathcal {L}_{rec} + \lambda _{adv}\mathcal {L}_{adv}, \end{aligned} \end{aligned}$$

(8)

where $\lambda _{adv}$ is a coefficient for the loss terms and we set 0.01 in our experiment. $\mathcal {L}_{rec}$ is a L1 loss between the ground-truth frame ${y}_i$ and compensated output $\tilde{y}_i$. For $\mathcal {L}_{adv}$, we use the model from Temporal PatchGAN (T-PatchGAN) [3]. Further details on our network are included in the supplementary material.

4 Experiments

4.1 Training Details

Our whole networks are trained on Youtube-VOS [19] dataset. It consists of 4,453 videos, which are split into 3,471/474/508 for training, validation and testing respectively. We randomly select $256\times {}256$ frames and free-form masks from [22] as input. To train our compensation network, we freeze the weights of flow completion. And for handling the brightness inconsistency issue, we further modify the brightness, hue and saturation of reference frames. Adam optimizer [11] is used where the learning rate starts with 1e−4 and is divided by 2 every 40,000 iterations. For each module, the total iteration is 120,000 with mini-batch size 4, and it takes about 2 d using two NVIDIA RTX 2080 TI GPUs.

4.2 Youtube-VI Dataset

Most previous works created their own random masks and measured the performance on Youtube-VOS [19] and DAVIS [2] datasets. However, when a randomly generated mask coincidentally covers an entire object, a model will create the object-removed result like in Fig. 6(a). The result is reasonable, but it is far from the original frame. Only the case in Fig. 6(b) should be used for quantitative evaluation. Also, if there is no motion in the mask and its vicinity, it is not suitable for video inpainting, which cannot trace any pixels from other frames.

TSAM [23] publicly provides frame-wise masks for the FVI [21] dataset. It consists of 100 scenes, each with 15 frames. For each scene, three types of masks are provided, some of which are shown in Fig. 7. While the masks cover diverse scenarios, 15 frames in a video are still limited compared to real-world videos. Also, the problems mentioned above on the mask still exist.

To this end, we design the new Youtube Video Inpainting (Youtube-VI) dataset, which will be publicly available. 100 scenes are selected in order of name from the beginning of Youtube-VOS dataset, and each scene has 50 frames. We generated two types of masks, which are shown in Fig. 7 and described as follows:

(1) Moving mask. We follow the same rules for generating free-form mask in STTN [22]. But we use a video segmentation network STCN [5] so that an object is not fully covered. With the help of STCN, we randomly generate one moving mask that partially overlaps with objects. Then we modify the mask again if the scene and mask do not look visually proper.

(2) Stationary mask. This case is usually used to cover the logo/subtitle removal scenarios. We find that using free-form masks is unsuitable since there may be no motion near the mask. Instead, we use 5$\,\times \,$4 small square masks like FGVC [6].

4.3 Comparisons

We quantitatively compare with others on published FVI [23] and our new dataset for video restoration scenarios. Like [13, 22, 23], we use DAVIS [2] dataset for qualitative comparison in object removal scenarios. DAVIS dataset consists of 150 videos in total, in which 90 videos are annotated with the pixel-wise object masks. In addition, DAVIS-shadow dataset [8] where the mask covers both an object and its shadows are used for more visual comparison.

DFC-Net [20] and FGVC [6] only take original frames as input to the flow estimator from their published code. It is acceptable in object removal scenarios, but it is not fair to evaluate quantitative performance where the original frame is equal to ground-truth. Therefore, we also experiment with corrupted frames as input, and denote them as DFC-Net* and FGVC* respectively.

Table 1. Quantitative evaluation on FVI and Youtube-VI datasets. Best values are shown in bold and second best values are underlined. For PSNR and SSIM, the higher is better. Missing entries indicate the method impossible to run at those settings.

Full size table

Quantitative Results. We evaluate video inpainting quality on PSNR and SSIM. In Table 1, we achieve the best performance on PSNR and SSIM except for the curve mask on FVI dataset. It is worth noting that the curve mask is a particular case in real-world scenarios. As we assume our error guidance map for common scenarios, the error values in dilated regions can be overlapped, which worsens our results in such thin and overlapped masks. On the other hand, on the moving and stationary mask, our framework clearly outperforms by a large margin compared to DFC-Net [20] and FGVC [6], where the flow-based propagation is used. Such results show that our overall framework plays critical role in addressing the limitations of existing methods. More comparisons on other metrics, Video-based Fréchet Inception Distance (VFID) [18] and optical flow based warping error (EWarp) [18], are shown in the supplementary material.

We measure the execution time on 864$\,\times \,$480 resolution for inpainting 50 frames to compare the efficiency. Since we design our entire components with deep frameworks, ours is $\times {6}$ faster compared to FGVC [6] where the optimization-based methods are used.

Qualitative Results. In Fig. 8, we show our model’s qualitative results compared with other methods, including STTN [22], FGVC [6] and FuseFormer [13]. To visualize the temporal consistency, we show the temporal profile [1] of the resulting videos below the completed frames. Sharp and smooth edges in the temporal profile indicate that the video has much less flickering. As can be seen, our method shows more visually pleasing results compared to others.

For a more comprehensive comparison, we conducted a user study to subjectively compare our method against others on the DAVIS dataset through Amazon Mechanical Turk. The experiments were performed on 30 videos that excludes easy (Tennis, Flamingo, etc.) and difficult (India, Drone, etc.) cases where every methods failed. The videos like Fig. 8 were shown to 20 participants. We unlabeled and shuffled the order in all videos, and asked to rank the results from 1 to 4 (1 is the highest preference). As shown in Fig. 9, the results of our model showed higher preference over others, supporting that our approach can produce more visually pleasing outputs.

Table 2. Ablation studies on Youtube-VI dataset. For Flow-EPE, the lower is better. FGVC/* means FGVC and FGVC* respectively.

Full size table

4.4 Ablation Study

Effectiveness of Flow Completion. In Table 2(a), we show the flow end-point-error (EPE) metric to compare our flow completion with others. We set the estimated flows from RAFT [16] as pseudo ground-truth flows. In the table, FGVC/* means FGVC and FGVC* respectively. In DFC-Net [20] and FGVC [6], the performance is significantly reduced when the corrupted frames are used as input to the flow estimator. It verifies that the errors in corrupted flows prevent estimating correct flows. In the same setting where the corrupted frames are used, ours achieves the best performance.

In addition, we conduct a series of ablation studies to analyze the effectiveness of our flow completion module. First, we sequentially run pretrained FuseFormer [13] followed by original RAFT [16] (FF + RAFT). To demonstrate the effectiveness of local temporal network LTN, we show the results only using the flow estimator (w/o LTN). The flow estimator is trained with the same setting as ours. In addition, we train our flow estimator with equal weights in Eq. () instead of h (w/o Hard). Such results verify that our jointly trained flow completion module can deal with the errors from the local temporal network LTN and produce more accurate completed flows.

Effectiveness of Compensation. In Table 2(b), we show the performance of the error compensation network by replacing our flow completion with FGVC’s method (FGVC* + Comp). Although the flow completion from FGVC* is worse than ours, it outperforms the full procedure of FGVC*. We also show the pseudo ground-truth flow from RAFT [16] (GT flow + Comp). These two experiments demonstrate our error compensation network is robust to errors from the flow completion. Also, we test our model without compensation stage (w/o Comp) and without error guidance map (w/o Guide). In w/o Guide experiment, we use naive GAN where filled frames with propagation masks and remaining masks are only used as input. The results show severe performance drops compared to our method. These two experiments demonstrate that our compensation network using the error guidance map is necessary to handle the errors from the previous stages.

5 Conclusion

In this paper, we have proposed a simple yet effective video inpainting framework, which takes advantage of the flow-based methods while compensating for its shortcomings. We can propagate valid pixels with our flow completion and error compensation network, showing high-quality completed videos. Especially with the help of the error guidance map, we can prevent the errors from accumulating and amplifying at the following stages. We show that our method achieves state-of-the-art performance and demonstrate the benefits of our framework through extensive experiments. We also provide a new benchmark dataset, which provides comparative analysis into video inpainting methods.

References

Caballero, J., et al.: Real-time video super-resolution with spatio-temporal networks and motion compensation. In: CVPR, pp. 4778–4787 (2017)
Google Scholar
Caelles, S., et al.: The 2018 davis challenge on video object segmentation. arXiv preprint arXiv:1803.00557 (2018)
Chang, Y.L., Liu, Z.Y., Lee, K.Y., Hsu, W.: Free-form video inpainting with 3d gated convolution and temporal patchgan. In: ICCV, pp. 9066–9075 (2019)
Google Scholar
Chang, Y.L., Liu, Z.Y., Lee, K.Y., Hsu, W.: Learnable gated temporal shift module for deep video inpainting. In: BMVC (2019)
Google Scholar
Cheng, H.K., Tai, Y.W., Tang, C.K.: Rethinking space-time networks with improved memory coverage for efficient video object segmentation. In: NeurIPS (2021)
Google Scholar
Gao, C., Saraf, A., Huang, J.-B., Kopf, J.: Flow-edge guided video completion. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 713–729. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_42
Chapter Google Scholar
Hu, Y.-T., Wang, H., Ballas, N., Grauman, K., Schwing, A.G.: Proposal-based video completion. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12372, pp. 38–54. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58583-9_3
Chapter Google Scholar
Huang, J.B., Kang, S.B., Ahuja, N., Kopf, J.: Temporally coherent completion of dynamic video. ACM Trans. Graph. (TOG) 35(6), 1–11 (2016)
Google Scholar
Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: evolution of optical flow estimation with deep networks. In: CVPR, pp. 2462–2470 (2017)
Google Scholar
Kim, D., Woo, S., Lee, J.Y., Kweon, I.S.: Deep video inpainting. In: CVPR, pp. 5792–5801 (2019)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Lee, S., Oh, S.W., Won, D., Kim, S.J.: Copy-and-paste networks for deep video inpainting. In: ICCV, pp. 4413–4421 (2019)
Google Scholar
Liu, R., et al.: Fuseformer: fusing fine-grained information in transformers for video inpainting. In: ICCV, pp. 14040–14049 (2021)
Google Scholar
Oh, S.W., Lee, S., Lee, J.Y., Kim, S.J.: Onion-peel networks for deep video completion. In: ICCV, pp. 4403–4412 (2019)
Google Scholar
Shrivastava, A., Gupta, A., Girshick, R.: Training region-based object detectors with online hard example mining. In: CVPR, pp. 761–769 (2016)
Google Scholar
Teed, Z., Deng, J.: RAFT: recurrent all-pairs field transforms for optical flow. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 402–419. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_24
Chapter Google Scholar
Wang, C., Huang, H., Han, X., Wang, J.: Video inpainting by jointly learning temporal structure and spatial details. In: AAAI, vol. 33, pp. 5232–5239 (2019)
Google Scholar
Wang, T.C., et al.: Video-to-video synthesis. arXiv preprint arXiv:1808.06601 (2018)
Xu, N., et al.: Youtube-vos: sequence-to-sequence video object segmentation. In: ECCV, pp. 585–601 (2018)
Google Scholar
Xu, R., Li, X., Zhou, B., Loy, C.C.: Deep flow-guided video inpainting. In: CVPR, pp. 3723–3732 (2019)
Google Scholar
Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.S.: Free-form image inpainting with gated convolution. In: ICCV, pp. 4471–4480 (2019)
Google Scholar
Zeng, Y., Fu, J., Chao, H.: Learning joint spatial-temporal transformations for video inpainting. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12361, pp. 528–543. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58517-4_31
Chapter Google Scholar
Zou, X., Yang, L., Liu, D., Lee, Y.J.: Progressive temporal feature alignment network for video inpainting. In: CVPR, pp. 16448–16457 (2021)
Google Scholar

Download references

Acknowledgement

This work was supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2014-3-00123, Development of High Performance Visual BigData Discovery Platform for LargeScale Realtime Data Analysis), and No. 2020-0-01361, Artificial Intelligence Graduate School Program (Yonsei University).

Author information

Authors and Affiliations

Yonsei University, Seoul, South Korea
Jaeyeon Kang & Seon Joo Kim
Adobe, California, USA
Seoung Wug Oh

Authors

Jaeyeon Kang
View author publications
You can also search for this author in PubMed Google Scholar
Seoung Wug Oh
View author publications
You can also search for this author in PubMed Google Scholar
Seon Joo Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Seon Joo Kim .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1952 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kang, J., Oh, S.W., Kim, S.J. (2022). Error Compensation Framework for Flow-Guided Video Inpainting. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13675. Springer, Cham. https://doi.org/10.1007/978-3-031-19784-0_22

Download citation

DOI: https://doi.org/10.1007/978-3-031-19784-0_22
Published: 31 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19783-3
Online ISBN: 978-3-031-19784-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics