1 Introduction

Optical flow provides the clues of motion between subsequent frames, which can be utilized for other computer vision tasks such as object tracking, action recognition, 3D reconstruction, and video enhancement, etc.  Recently, deep neural networks have shown great progress in optical flow estimation [12, 15, 29,30,31]. The progress has been made primarily in a supervised learning manner requiring a large amount of labeled data. Despite the effectiveness of the learning-based approaches, obtaining labeled real-world data is prohibitively expensive at a large scale. Therefore, synthetic computer graphics data [1, 3, 6, 23] are typically leveraged.

A common belief of using synthetic data is that the data rendered by graphics engines limit generalization to real scenes due to synthetic-to-real domain gaps in quality. Those gaps involve real-world effects such as noise, 3D motion, non-rigidity, motion blur, occlusions, large displacements, and texture diversity. Thus, synthetic datasets [1, 3, 6, 23] for optical flow have been developed by considering these effects to some extent, i.e. , mimicking the real-world effects.

In this paradigm, we throw a question, “Which factor of the synthetic dataset is essential for the generalization ability to the real domain?” In this work, we found that the required characteristics for an optical flow dataset are simple; achieving only a certain level of realism is enough for training highly generalizable and accurate optical flow models. We empirically observe that a simple 2D motion-based dataset as training data often shows favorable performance for ordinary purposes or much higher than the former synthetic datasets [1, 22], which are rendered by complex 3D object or motion with rich textures. Furthermore, we found that using occlusion masks to give the network incomplete information is effective for a powerful initial state of curriculum learning.

Fig. 1
figure 1

The prior arts of synthetic data and our proposed dataset. Sampled frames and its corresponding flow maps are visualized. While being diverse in motion, a,b include many thin object parts and unrealistically simple reflectance. c includes semantically coherent flow map but the diversity of the motion is limited by a global camera motion. Our method, in contrast, includes both controllable and diverse motion characteristics with semantically coherent object shapes and rich texture

We design easily controllable synthetic dataset generation recipes using a cut-and-paste method with segmented 2D object textures. As shown in Fig. 1, our generated data appears to be far from the real-world one, but training on those shows promising results both on generalization and fine-tuning regimes, outperforming the networks trained on the competing datasets. We also utilize occlusion masks to stop gradients on occluded regions, and the RAFT network initially trained with occlusion masks outperforms the original RAFT on the two most challenging online benchmarks, MPI Sintel [3] and KITTI 2015 [24]. Our key contributions are summarized as follows: (1) We present simple synthetic data generation recipes with compositions of simple elementary operations and show comparable performance against competing methods, (2) we propose a novel method of utilizing occlusion masks in a supervised method and show that suppressing gradients on occluded regions in a supervised optical flow serves as a powerful initial state in the curriculum learning protocol, and (3) we systematically analyze our dataset and the effects according to different factors of motion type, motion distribution, data size, texture diversity, and occlusion masks.

2 Related work

We briefly review our target task, i.e. , optical flow estimation, and the training datasets that have been used for training learning-based optical flow estimation methods.

Optical Flow Fundamentally, optical flow estimation for each pixel is an ill-posed problem. Traditional approaches [2, 11, 25, 33] attempted to deal with imposing smoothness priors to regularize the ill-condition in an optimization framework. According to the advance of deep learning, the ill-posedness has been tackled by learning, yielding superior performance. Starting with the success of FlowNet [6, 15], Recent optical flow estimation methods have been developed by coarse-to-fine approaches [10, 29, 32] or iterative refinement approaches [13, 31]. However, these approaches strongly rely on training datasets, where real supervised data of optical flow is extremely difficult to obtain [22].

Fig. 2
figure 2

Schematic overview of our data generation pipeline and occlusion mask estimation. a Given a background image and foreground objects, we sample affine flow coefficients and generate a consecutive frame. These coefficients can be used to extract exact ground-truth optical flow map. b We describe the process of estimating the occlusion mask (\(\textrm{M}_{r,i}\)) for the first layer (\(i=0\)), which is the background. This process is recursively conducted in ascending order until the end of the layers

Datasets The supervised learning-based methods for optical flow estimation requires exact and pixel-accurate ground truth. While obtaining true real motion is extremely difficult without the support of additional information, several real-world optical flow datasets [9, 16, 20, 24] have been proposed. However, these datasets are relatively small scale and biased to limited scenarios; thus, those are not sufficient for training a deep model but more suitable for benchmark test sets.

To address persistent data scarcity, studies for generating large-scale synthetic datasets have been attempted. Dosovitskiy et al.   [6] propose a synthetic dataset of moving 3D chairs superimposed on the images from Flickr. Similarly, Mayer et al.   [23] present datasets where not only chairs but various objects are scattered in the background. Aleotti et al.   [1] leverage an off-the-shelf monocular depth network to synthesize a novel view from a single image and compute an accurate flow map.

Mayer et al.   [22] present critical factors of the synthetic dataset, i.e. , the object shape, motion types and distributions, textures, real-world effects, data augmentation, and learning schedules. Prior work [5, 28] generate a learning-based synthetic dataset for training accurate optical flow networks, but it is still challenging to distinguish the key factors for synthetic data intuitionally. We build upon the observations of Mayer et al.   [22] and design easily controllable synthetic dataset generation recipes and identify additional key factors such as balanced motion distribution, amount of data, texture combination, and learning schedules with occlusion masks.

3 Data generation pipeline

In this section, we present a simple method to generate an effective optical flow dataset. Unlike the prior arts using 3D motions and objects with computer graphics, our generation scheme remains simple by using 2D image segment datasets and 2D affine motion group. The proposed simple dataset enables analyzing the effect of each factor of the synthetic dataset.

Overall Pipeline The overall data generation pipeline is illustrated in Fig. 2. As shown, we use a simple cut-and-paste method where foreground objects are pasted on an arbitrary background image. Inspired by Oh et al.[26], the segmented foreground objects and random background images are obtained from two independent datasets to encourage combinatorial diversity while avoiding texture overlaps. In this work, we use PASCAL VOC [7] and MS COCO [21] as suggested by Oh et al. [26]. The foreground objects are first superimposed randomly, and its consecutive frame is composed of randomly moving both the foreground objects and the background image by simple affine motions. This allows us to express diverse motions, easily control the motion distribution, and compute occlusion masks.

Background Processing We first sample an image from an image dataset for background and resize them to \(712\times 584\). We regard this frame as the target frame (Frame B in Fig. 2). Then, we generate a flow map using random affine coefficients, including translation, rotation, and scaling (zooming), and inverse-warp the target frame to obtain the reference frame (Frame A in Fig. 2). We sample the translation coefficient of background from the range \([-20, 20]\) pixels for each direction, and with a 30% chance, the translation coefficient is reset to zero. The rotation and scale coefficients are sampled from \([-\frac{\pi }{100}\),\( \frac{\pi }{100}]\) and [0.85, 1.15], respectively. From the sampled affine matrix, we obtain a ground-truth flow map by subtracting the coordinates of two background image pairs as \(\textbf{f} = \textbf{A}\textbf{x}-\textbf{x}\), where \(\textbf{f}\) denotes each flow vector of a pixel at the reference frame, \(\textbf{A}\) the affine transform, and \(\textbf{x}\) a homogeneous coordinate [xy, 1] of each pixel on the reference frame. We sample 7,849 background images from MS COCO [21].

Foreground Processing For synthesizing foreground objects’ motion, we use segmented objects from a semantic image segmentation dataset. For the target frame, we first sample the number of foreground objects to be composited in \(\{7,8,\cdots ,14,15\}\). Then, we randomly place these objects on the target one and apply inverse-warping to obtain the warped objects on the reference frame using optical flow maps obtained from random affine transformations. The sampling ranges of rotation and scale coefficients are the same as those of the background case. The distribution of the translation coefficient is designed to follow the exponential distribution as \(\tfrac{1}{Z}\exp (-f/T)\), where the temperature T is empirically set to 20, and Z the normalization term. The distribution is inspired by natural statistics of optical flow [27], where the statistics of motions tend to follow Laplacian distribution. We limit the distribution range [0, 150] by resampling if the magnitude is over 150 pixels. The translation direction of foregrounds is sampled at uniformly random. We use 2913 images from PASCAL VOC [7], and from the set, we extract 5543 preprocessed segments as foreground objects.

Composition We sequentially paste foregrounds on the background to generate a single pair of consecutive frames. The flow maps of each foreground are pasted only when the alpha channel value is at least the threshold c. Following the implementation details of [28], we set c to 0.4 and empirically found the performance is not sensitive to the setting of the threshold.

After composition, we conduct the center crop to the composited images to obtain outputs of size \(512\times 384\)  which is the same as FlyingChairs [6]. Our data generation speed is faster than AutoFlow [28], which generates a learning-based dataset for given target data, and ours about 500 times faster than dCOCO [1] as shown in Table 1. Our fast data generation speed is beneficial for analyzing the required characteristics to train accurate optical flow networks.

Occlusion Mask Similar to the prior arts [3, 9, 23, 24], our data generation method exports occlusion masks as well. Predicting motions of regions being occluded is an intractable problem and requires uncertain forecasting, which can act as detrimental outliers during training. Thus, prior arts [14, 17] estimate occlusion masks as well to encourage reliable optical flow estimation. Unlike prior arts, we utilize occlusion masks in a supervised method by suppressing gradients on occluded regions in a supervised optical flow. The gradient suppression with occlusion masks serves as a powerful initial state in the curriculum learning protocol, which will be discussed in the experimental section. To obtain occlusion masks, given the alpha maps of each layer including foregrounds (\(i\ge 1\)) and background (\(i=0\)) in order, we binarize the alpha map by thresholding with 0.4, denoting \(\mathrm {\alpha }_{\{r, t\},i}\) for the i-th object layer in the reference and target frames, respectively. The non-visible regions \(\textrm{V}_{\{r, t\},i}\) of the i-th layer in each frame are computed by \(\textrm{V}_{\{r, t\},i} = \alpha _{\{r, t\},i}{\cap }(\cup _{k=i+1}^{L}{\alpha _{\{r, t\},k}})\). Using the i-th layer flow map \(\textbf{f}_{i}\), we inverse-warp the \(\textrm{V}_{t,i}\) to the reference frame as \(\textrm{V}_{t\rightarrow r,i} = \textbf{f}_{i}\circ \textrm{V}_{t,i}\) and binarize it by 0.4 again, where \(\circ \) denotes the warping operation. Then, because the occluded regions are only visible in the reference frame, we can find such an occlusion mask of each layer by \(\textrm{M}_{r,i} = \textrm{max}(\textrm{V}_{t\rightarrow r,i} - \textrm{V}_{r,i}, 0)\). The compromised occlusion mask \(\textrm{M}_{r}\) is obtained by \(\textrm{M}_{r} = \cup _{i=0}^{L} \textrm{M}_{r,i}\).

4 Experiments

In this section, we compare the performance of respective optical flow networks by training on our datasets with/without the occlusion mask and competing datasets. Utilizing the simple data generation recipe, we also analyze the effects of characteristics in optical flow datasets.

Table 1 Data generation speed

Optical Flow Network We use RAFT [31] as a reference model to evaluate the benefits of our synthetic dataset in generalization and fine-tuning setups. RAFT is a representative supervised model that is widely used to estimate the effectiveness of optical flow datasets [1, 28]. We follow the same hyper-parameters suggested by the implementation of [31], and the experiment setup by Aleotti et al. [1] that shows one-/multi-stage training results. For our synthetic datasets, in the initial training stage, we train RAFT for 100k iterations with the batch sizeFootnote 1 of 10, image crops of size \(496\times 368\), the learning rate \(4\times 10^{-4}\), and the weight decay of \(1\times 10^{-4}\).

For multi-stage training with FlyingThings3D [23], from the RAFT networks pre-trained on our datasets, we further train with the frames_cleanpass split of FlyingThings3D that includes 40k consecutive frame pairs. We train the model for 100k iterations with a batch size of 6, image crops of size \(720\times 400\), the learning rate of \(1.25\times 10^{-4}\), and the weight decay of \(1\times 10^{-4}\). These hyper-parameters are the same with the Things training stage reported in [31].

Competing Datasets for Training We choose FlyingChairs (Ch) [6] and dCOCO [1] as the competing datasets, and leverage the RAFT networks pre-trained on each dataset provided by the authors and dCOCO. For multi-stage training models, from the networks pre-trained on ours, we further train with FlyingThings3D (Th) [23] in sequence to compare with the RAFT model trained with FlyingChairs followed by FlyingThings3D (Ch\(\rightarrow \)Th).

Table 2 Comparison with other datasets.

Test Datasets We evaluate on Sintel [3] and KITTI 2015 [24]. These datasets contain crucial real-world effects, such as occlusions, illumination changes, motion blur, and camera noise, making them challenging and widely used standard benchmarks for evaluating optical flow models. We report the performance of the model trained with the base datasets without fine-tuning on Sintel or KITTI, called generalization and that of the model fine-tuned on the training set of Sintel or KITTI, called fine-tuning.

Evaluation Following the convention, we report the average End-Point Error (EPE) and the errors that exceed 3 pixels and 5% of its true value (Fl). We further evaluate the percentage of pixels with an absolute error smaller or equal to 1 (\(\le \)1). The bold will be used to highlight the best one among the methods.

4.1 Comparison with other synthetic datasets

We compare the generalization and fine-tuning performance of the networks trained on our dataset and other competing datasets [1, 6, 23]. For fair comparisons, we train the network on our dataset (denoted as Ours) with 20k image pairs that include translation, rotation, and zooming. We also evaluate our dataset with occlusion masks \(\langle \)O\(\rangle \) (denoted as Ours+O).

Generalization The left part of Table 2 summarizes the generalization test. Among the models trained on a single dataset, our datasets (C, D) show the best performance on Sintel. However, dCOCO (B) shows better performance on KITTIs. We further evaluate the performance on two other benchmarks as shown in Table 3, and observe that dCOCO achieves better performance on Virtual KITTI [8], which is a synthetic dataset. On the other hand, ours achieves more accurate optical flow estimation in a real dataset, i.e. , HD1K [20]. From these results, we assume that dCOCO, which uses depth-aware data generation approach with real images, is effective in autonomous driving scenarios and the similar motion distribution and texture between the synthetic and target dataset are key factors of generalization. We also pre-train the network on 2D motion datasets, such as FlyingChairs [6] and our datasets, and sequentially train on FlyingThings3D [23]. Compared to (E) which uses FlyingChairs at the initial stage, (F, G) show better generalization performance in the KITTIs and Sintel Clean pass. These show that the choice of the initial training stage significantly affects the final performance.

Table 3 Generalization results on other benchmarks
Table 4 Test results on Sintel and KITTI 2015
Table 5 Generalization results on other backbone networks
Fig. 3
figure 3

Generalization results and histograms of datasets depending on foreground translation distribution. From left to right, A uniform, B Gaussian, and C exponential distribution. A is sampled from a uniform distribution of the interval [0,150]. B is the suggested distribution by FlyingChairs [6] given as \(\textrm{max}(\textrm{min}(\textrm{sign}(\gamma )\cdot |\gamma |^3,150),-150)\), where \(\gamma \sim \mathcal {N}(0,\,2.3^2)\). C is the proposed distribution that follows natural statistics [4]. Note that we sample foreground translation magnitude from the three distributions while the background distribution is fixed

Fine-tuning We fine-tune the networks of the left part of Table 2 on Sintel or KITTIs, and the results are reported in the right part of the table. Overall, our datasets show favorable performance. Compared to (E) first pre-trained on FlyingChairs, (F, G) show better performance. (G) especially achieves the lowest Fl and noticeable performance improvement in KITTI 2015. These results suggest that utilizing occlusion masks as a gradient suppression tool is effective in fine-tuning real-world datasets, i.e. , KITTI 2012 and KITTI 2015. We observe a consistent tendency with the online benchmark results as follows.

Online Benchmarks We follow the training procedure described in RAFT [31] to fine-tune the model pre-trained by our dataset and test on the public benchmarks of Sintel and KITTI 2015. As summarized in Table 4, using our dataset for the initial curriculum outperforms the original RAFT on both public benchmarks. On the KITTI 2015 test set, the network pre-trained on our synthetic dataset with occlusion masks shows better performance compared to RAFT. In the Sintel test dataset, we observe that the performance improvement in Sintel Clean and Final passes with our dataset. With and without the warm-start initialization, the network trained with our training schedule also achieves better results in both passes. From these results, we assume that learning the simplest characteristics for estimating optical flow at the initial learning schedule without occlusion estimation helps the network perform better.

Other Backbone Networks To evaluate the effectiveness of our dataset other than RAFT, we selected two more optical flow models: FlowNet [6] and PWC-Net [29]. We use the re-implementation of FlowNet Footnote 2 and PWC-Net .Footnote 3 Table 5 shows the result of each network trained on our dataset outperforming the one trained on FlyingChairs [6]. We also contain the previous experiment with RAFT in (C) as a reference. These results prove that the simple properties of our dataset are effective for not only the RAFT [31], but also general optical flow networks.

Table 6 Impact of motion complexity and occlusion masks

4.2 Ablation study

By virtue of the fast generation speed from the simple recipes and the controllability of our dataset, we can conduct a series of ablation studies to determine the critical factors of our dataset which affect the network performance the most.

Foreground Translation Distributions   We evaluate the effect of the translational motion distribution of foregrounds with 20K image pairs. We use three different distributions to sample magnitudes of translation. Figure 3 shows the histograms of each dataset distribution and summarizes the generalization results achieved by the RAFT network. (A) is uniform distribution and (B) is Gaussian distribution suggested by FlowNet [6]. (C) is the proposed distribution that follows natural statistics [4].

As shown in the histograms, peaks are near zero (in a factor of \(10^9\)) due to the background translation. Thus, we focus on the tails of the distributions, which typically occur by foregrounds. (A) includes excessively large motions, which are unrealistic in real-world scenarios and eventually degrade the performances. Comparing with (B), (C) outperforms on overall metrics of benchmarks. The main difference between these two is the density of the focused region in the histogram, where (C) decays faster than (B). From this, we observe that slight differences in tails of translation distributions affect the performance of the model significantly; thus, we take special care of a balanced motion distribution design. We choose (C) as the distribution of translation for the following experiments.

Motion Complexity We assess the effect of each motion type in training. We start by evaluating the dataset having each of translation \(\langle \)T\(\rangle \), rotation \(\langle \)R\(\rangle \), and zooming \(\langle \)Z\(\rangle \), respectively. Then, we sequentially apply rotation \(\langle \)R\(\rangle \) and zooming \(\langle \)Z\(\rangle \) to the dataset with the translation \(\langle \)T\(\rangle \) only. As shown in Table 6, the network trained on translation motion (A) demonstrates comparable performance to a network trained on FlyingChairs (G). In contrast, with only rotation (B), the generalization performance significantly drops in both benchmarks. When applying zooming alone (C), the performance is sub-par in Sintel, but in KITTI, it exhibits a favorable EPE and surpasses the performance of (A) in the F1 score. This result is likely attributed to KITTI’s characteristics, which predominantly feature driving scenes with frequent forward-backward ego motions, which can be mimicked by zooming. We also constitute the dataset (D) by adding rotation transformation to (A) and measure the performance of the trained network. Compared to (A), the network trained on (D) achieves an improved EPE score on Sintel because the cinematic scene of Sintel frequently has rotation motion. In KITTI, the model trained on (D) achieves lower EPE scores. This implies that adding rotation might confuse the network on the test dataset that contains few rotation motions, i.e. , the driving scenes of KITTI. Interestingly, both (A) and (D) show comparable performance to the network trained on FlyingChairs (G), which contains three motion types, T+R+Z. We believe that different translation distributions and abundant textures lead to these results. Finally, by adding rotation and zooming (E), the generalization performance outperforms (A\(\sim \)D), and (G) in all cases. We observe that zooming mimics the backward and forward object or ego motions, which frequently happens in both benchmarks. In summary, translation motion is the most fundamental factor influencing the generalization ability of both benchmarks, while the effects of rotation and zooming vary depending on the characteristics of the test dataset. Although the effects of each type of motion may differ, the combination of translation, rotation, and zooming demonstrates the highest generalization performance on both benchmarks.

Table 7 Impact of abundant texture

The networks trained on our datasets have not seen any 3D motion during training; thus, we can further fine-tune on another dataset, including 3D motions in practice. To figure out the ability of our datasets as pre-training datasets, we further fine-tune the aforementioned networks to the benchmarks, KITTI 2015 or Sintel. We follow the same fine-tuning protocol suggested by Aleotti et al. [1] on the KITTI datasets. The same fine-tuning protocol is applied to the Sintel dataset as well, with the initial 80% of the data used for fine-tuning and the remaining portion used for validation. The fine-tuning results in the right part of Table 6 show a consistent tendency with the above generalization study. While the improvement is marginal due to the high accuracy regime, in the KITTI datasets, the best performance is achieved when the pre-trained network has been exposed to diverse types of motion (E), i.e. , translation, rotation, and zooming. In Sintel, we observe that the best performance is nearly achieved when exposed to both translation and rotation in the pre-training stage (D). These results suggest that the required motion characteristics of the pre-trained dataset vary depending on the test dataset.

Effects of Occlusion Mask   The prior works [18, 19, 34] show the effectiveness of occlusion masks \(\langle \)O\(\rangle \). Unlike these prior arts, we propose an intuitive and effective method utilizing the easily obtainable occlusion masks by suppressing the gradients at the regions to be occluded in a supervised manner. In the left part of Table 6, generalization results with occlusion mask (F) show comparable EPE to (E) on the benchmarks but lower Fl on the KITTI datasets. To further evaluate, we fine-tune the network (F) from the left part of Table 6 on the benchmarks and show its results in the right part of the table. The results also show lower Fl on the KITTI dataset. Besides, (F) outperforms (E) on both metrics in fine-tuning on KITTI 2015, which contains the most complicated real-world scenes. In Sintel, applying occlusion masks to pre-training shows a marginal difference in performance compared to the results observed in KITTI 2015. We hypothesize the reason for the marginal performance gap in Sintel as the occlusion patterns between our and Sintel datasets are similar, which are both synthetic datasets. Also, those are easier than the occlusion pattern of KITTI 2015. Thus, the EPE errors in Sintel are much lower. On the other hand, occlusion patterns between our synthetic data and KITTI 2015 data have a clear discrepancy. Thus, learning occlusion during pre-training can produce bias toward our synthetic data set, which has different characteristics with KITTI 2015. This may hint that applying the occlusion mask in pre-training allows to focus on learning the correspondences in the early stage and the specific occlusion patterns of benchmark datasets later in fine-tuning. This phenomenon can be regarded as curriculum learning, where learning more concepts gradually to complex one helps the network perform better. Applying the occlusion mask is an intuitive method for curriculum learning, and we demonstrate the effectiveness of occlusion masks in improving the final performance, particularly when the occlusion pattern varies between the pre-training and fine-tuning datasets.

Abundant Textures We analyze the effect of the abundant textures of foregrounds in training. Considering that the average number of foregrounds in the FlyingChairs [6] is 5, we compared the case when the number of foregrounds is 4 and 8. We also apply a Gaussian filter whose kernel size is 5 to the foregrounds for simulating the lack of high-frequency textures of chairs used in FlyingChairs. Table 7 shows that more foregrounds with high-frequency textures lead to overall improvement. These results hint that abundant textures are another important factor in generating synthetic data.

5 Discussion and limitation

We propose an easily controllable synthetic dataset recipe by cut-and-paste, which enables conducting comprehensive studies. Through the experiments, we reveal the simple yet crucial factors for generating synthetic datasets and learning curriculums. We introduce a supervised occlusion mask method, which stops the gradient at the regions to be occluded. Combining these findings, we observe that the networks trained on our datasets achieve favorable generalization performance, and our datasets with occlusion masks serve as a powerful initial curriculum, which achieves superior performance in fine-tuning and online benchmarks.

Limitation In this work, using the proposed controllable synthetic dataset, we analyzed the effect of key factors in the optical flow training dataset, such as the balanced motion distribution, amount of data, texture combination, and learning schedules with occlusion masks. Although we examined the impact of these fundamental and simple factors through extensive analysis, the impact of various real-world effects, such as motion blur and fog, has not been addressed in this paper, which would also have an impact. Those real-world effects require certain levels of physics simulation or introduce notable complexity in data generation, of which the direction is not aligned with the direction of this work, i.e.  simplicity. Nonetheless, motion estimation in extreme cases, including weather artifacts or degraded photos, is indeed an important problem and the next challenge. It is an interesting research question, which factors are important to deal with such artifacts, and the complicated and realistic simulation is necessary, which we leave for future research.