Meta-Sim2: Unsupervised Learning of Scene Structure for Synthetic Data Generation

Devaranjan, Jeevan; Kar, Amlan; Fidler, Sanja

doi:10.1007/978-3-030-58520-4_42

Jeevan Devaranjan^12,14,
Amlan Kar^12,13,15 &
Sanja Fidler^12,13,15

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12362))

Included in the following conference series:

European Conference on Computer Vision

3673 Accesses
39 Citations

Abstract

Procedural models are being widely used to synthesize scenes for graphics, gaming, and to create (labeled) synthetic datasets for ML. In order to produce realistic and diverse scenes, a number of parameters governing the procedural models have to be carefully tuned by experts. These parameters control both the structure of scenes being generated (e.g. how many cars in the scene), as well as parameters which place objects in valid configurations. Meta-Sim aimed at automatically tuning parameters given a target collection of real images in an unsupervised way. In Meta-Sim2, we aim to learn the scene structure in addition to parameters, which is a challenging problem due to its discrete nature. Meta-Sim2 proceeds by learning to sequentially sample rule expansions from a given probabilistic scene grammar. Due to the discrete nature of the problem, we use Reinforcement Learning to train our model, and design a feature space divergence between our synthesized and target images that is key to successful training. Experiments on a real driving dataset show that, without any supervision, we can successfully learn to generate data that captures discrete structural statistics of objects, such as their frequency, in real images. We also show that this leads to downstream improvement in the performance of an object detector trained on our generated dataset as opposed to other baseline simulation methods. Project page: https://nv-tlabs.github.io/meta-sim-structure/.

J. Devaranjan and A. Kar—Contributed equally, work done during JD’s internship at NVIDIA.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Full-Glow: Fully Conditional Glow for More Realistic Image Generation

Vehicle Image Generation Going Well with the Surroundings

Configurable 3D Scene Synthesis and 2D Image Rendering with Per-pixel Ground Truth Using Stochastic Grammars

Article 30 June 2018

1 Introduction

Synthetic datasets are creating an appealing opportunity for training machine learning models e.g. for perception and planning in driving [18, 53, 55], indoor scene perception [46, 57], and robotic control [61]. Via graphics engines, synthetic datasets come with perfect ground-truth for tasks in which labels are expensive or even impossible to obtain, such as segmentation, depth or material information. Adding a new type of label to synthetic datasets is as simple as calling a renderer, rather than embarking on a time consuming annotation endeavor that requires new tooling and hiring, training and overseeing annotators.

Creating synthetic datasets comes with its own hurdles. While content, such as 3D CAD models that make up a scene are available on online asset stores, artists write complex procedural models that synthesize scenes by placing these assets in realistic layouts. This often requires browsing through massive amounts of real imagery to carefully tune a procedural model – a time consuming task. For scenarios such as street scenes, creating synthetic scenes relevant for one city may require tuning a procedural model made for another city from scratch. In this paper, we propose an automatic method to carry out this task (Fig. 1).

Recently, Meta-Sim [30] proposed to optimize scene parameters in a synthetically generated scene by exploiting the visual similarity of (rendered) generated synthetic data with real data. They represent scene structure and parameters in a scene graph, and generate data by sampling a random scene structure (and parameters) from a given probabilistic grammar of scenes, and then modifying the scene parameters using a learnt model. Since they only learn scene parameters, a sim-to-real gap in the scene structure remains. For example, one would likely find more cars, people and buildings in Manhattan over a quaint village in Italy. Other work on generative models of structural data such as graphs and grammar strings [12, 17, 37, 42] require large amounts of ground truth data for training to generate realistic samples. However, scene structures are extremely cumbersome to annotate and thus not available in most real datasets.

In this paper, we propose a procedural generative model of synthetic scenes that is learned unsupervised from real imagery. We generate scene graphs object-by-object by learning to sample rule expansions from a given probabilistic scene grammar and generate scene parameters using [30]. Learning without supervision here is a challenging problem due to the discrete nature of the scene structures we aim to generate and the presence of a non-differentiable renderer in the generative process. To this end, we propose a feature space divergence to compare (rendered) generated scenes with real scenes, which can be computed per scene and is key to allowing credit assignment for training with reinforcement learning.

We evaluate our method on two synthetic datasets and a real driving dataset and find that our approach significantly reduces the distribution gap between scene structures in our generated and target data, improving over human priors on scene structure by learning to closely align with target structure distributions. On the real driving dataset, starting from minimal human priors, we show that we can almost exactly recover the structural distribution in the real target scenes (measured using GT annotations available for cars) – an exciting result given that the model is trained without any labels. We show that an object detector trained on our generated data outperforms those trained on data generated with human priors or by [30], and show improvements in distribution similarity measures of our generated rendered images with real data.

2 Related Work

2.1 Synthetic Content Creation

Synthetic content creation has been receiving significant interest as a promising alternative to dataset collection and annotation. Various works have proposed generating synthetic data for tasks such as perception and planning in driving [2, 14, 18, 48, 53, 55, 68], indoor scene perception [4, 24, 46, 57, 59, 70, 75], game playing [6, 28], robotic control [6, 56, 61, 63], optical flow estimation [7, 58], home robotics [20, 34, 49] amongst many others, utilizing procedural modeling, existing simulators or human generated scenarios.

Learnt Scene Generation brings a data-driven nature to scene generation. [64, 74] propose learning hierarchical spatial priors between furniture, that is integrated into a hand-crafted cost used to generate optimized indoor scene layouts. [50] similarly learn to synthesize indoor scenes using a probabilistic scene grammar and human-centric learning by leveraging affordances. [64] learn to generate intermediate object relationship graphs and instantiate scenes conditioned on them. [76] use a scene graph representation and learn adding objects into existing scenes. [40, 54, 65] propose methods for learning deep priors from data for indoor scene synthesis. [16] introduce a generative model that sequentially adds objects into scenes, while [29] propose a generative model for object layouts in 2D given a label set. [60] generate batches of data using a neural network that is used to train a task model, and learn by differentiating through the learning process of the task model. [30] propose learning to generate scenes by modifying the parameters of objects in scenes that are sampled from a probabilistic scene grammar. We argue that this ignores learning structural aspects of the scene, which we focus on in our work. Similar to [16, 30], and contrary to other works, we learn this in an unsupervised manner i.e. given only target images as input.

Learning with Simulators: Methods in Approximate Bayesian Inference have looked into inferring the parameters of a simulator that generate a particular data point [35, 45]. [11] provide a great overview of advances in simulator based inference. Instead of running inference per scene [35, 69], we aim to generate new data that resembles a target distribution. [44] learn to optimize non-differentiable simulators using a variational upper bound of a GAN-like objective. [8] learn to optimize simulator parameters for robotic control tasks by directly comparing trajectories between a real and a simulated robot. [19, 47] train an agent to paint images using brush strokes in an adversarial setting with Reinforcement Learning. We learn to generate discrete scene structures constrained to a grammar, while optimizing a distribution matching objective (with Reinforcement Learning) instead of training adversarially. Compared to [47], we generate large and complex scenes, as opposed to images of single objects or faces.

2.2 Graph Generation

Generative models of graphs and trees [3, 10, 17, 42, 43, 73] generally produce graphs with richer structure with more flexibility over grammar based models, but often fail to produce syntactically correct graphs for cases with a defined syntax such as programs and scene graphs. Grammar based methods have been used for a variety of tasks such as program translation [9], conditional program generation [71, 72], grammar induction [32] and generative modelling on structures with syntax [12, 37], such as molecules. These methods, however, assume access to ground-truth graph structures for learning. We take inspiration from these methods, but show how to learn our model unsupervised i.e. without any ground truth scene graph annotations.

3 Methodology

We aim to learn a generative model of synthetic scenes. In particular, given a dataset of real imagery $X_R$, the problem is to create synthetic data $D(\theta ) = (X(\theta ), Y(\theta ))$ of images $X(\theta )$ and labels $Y(\theta )$ that is representative of $X_R$, where $\theta $ represents the parameters of the generative model. We exploit advances in graphics engines and rendering, by stipulating that the synthetic data D is the output of creating an abstract scene representation and rendering it with a graphics engine. Rendering ensures that low level pixel information in $X(\theta )$ (and its corresponding annotation $Y(\theta )$) does not need to be modelled, which has been the focus of recent research in generative modeling of images [31, 51]. Ensuring the semantic validity of sampled scenes requires imposing constraints on their structure. Scene grammars use a set of rules to greatly reduce the space of scenes that can be sampled, making learning a more structured and tractable problem. For example, it could explicitly enforce that a car can only be on a road which then need not be implicitly learned, thus leading us to use probabilistic scene grammars. Meta-Sim [30] sampled scene graph structures (see Fig. 2) from a prior imposed on a Probabilistic Context-Free Grammar (PCFG), which we call the structure prior. They sampled parameters for every node in the scene graph from a parameter prior and learned to predict new parameters for each node, keeping the structure intact. Their generated scenes therefore come from a structure prior (which is context-free) and the learnt parameter distribution, resulting in an untackled sim-to-real gap in the scene structures. In our work, we aim to alleviate this by learning a context-dependent structure distribution unsupervised of synthetic scenes from images.

We utilize scene graphs as our abstract scene representation, that are rendered into a corresponding image with labels (Sect. 3.1). Our generative model sequentially samples expansion rules from a given probabilistic scene grammar (Sect. 3.2) to generate a scene graph which is rendered. We train the model unsupervised and with reinforcement learning, using a feature-matching based distribution divergence specifically designed to be amenable to our setting (Sect. 3.3).

3.1 Representing Synthetic Scenes

In Computer Graphics and Vision, Scene Graphs are commonly used to describe scenes in a concise hierarchical manner, where each node describes an object in the scene along with its parameters such as the 3D asset, pose etc. Parent-child relationships define the child’s parameters relative to its parent, enabling easy scene editing and manipulation. Additionally, camera, lighting, weather etc. are also encoded into the scene graph. Generating corresponding pixels and annotations amounts to placing objects into the scene in a graphics engine and rendering with the defined parameters (see Fig. 2).

Notation: A context-free grammar G is defined as a list of symbols (terminal and non-terminal) and expansion rules. Non-terminal symbols have at least one expansion rule into a new set of symbols. Sampling from a grammar involves expanding a start symbol till only non-terminal symbols remain. We denote the total number of expansion rules in a grammar G as K. We define scene grammars and represent strings sampled from the grammar as scene graphs following [30, 48] (see Fig. 3). For each scene graph, a structure T is sampled from the grammar G followed by sampling corresponding parameters $\alpha $ for every node in the graph.

3.2 Generative Model

We take inspiration from previous work on learning generative models of graphs that are constrained by a grammar [37] for our architecture. Specifically, we map a latent vector z to unnormalized probabilities over all possible grammar rules in an autoregressive manner, using a recurrent neural network till a maximum of $T_{max}$ steps. Deviating from [37], we sample one rule $r_t$ at every time step and use it to predict the logits for the next rule $f_{t+1}$. This allows our model to capture context-dependent relationships easily, as opposed to the context-free nature of scene graphs in [30]. Given a list of at most $T_{max}$ sampled rules, the corresponding scene graph is generated by treating each rule expansion as a node expansion in the graph (see Fig. 3).

Sampling Correct Rules: To ensure the validity of sampled rules in each time step t, we follow [37] and maintain a last-in-first-out (LIFO) stack of unexpanded non-terminal nodes. Nodes are popped from the stack, expanded according to the sampled rule-expansion, and the resulting new non-terminal nodes are pushed to the stack. When a non-terminal is popped, we create a mask $m_t$ of size K which is 1 for valid rules from that non-terminal and 0 otherwise. Given the logits for the next expansion $f_t$, the probability of a rule $r_{t,k}$ is represented as,

$$\begin{aligned} p(r_t = k | f_t) = \frac{m_{t, k} exp(f_{t,k})}{\sum _{j=1}^{K} m_{t,j} exp(f_{t,j})} \end{aligned}$$

Sampling from this masked multinomial distribution ensures that only valid rules are sampled as $r_t$. Given the logits and sampled rules, $(f_t, r_t) \forall t \in {1 \ldots T_{max}}$, the probability of the corresponding scene structure T given z is simply,

$$\begin{aligned} q_\theta (T|z) = \sum _{t=1}^{T_{max}} p(r_t|f_t) \end{aligned}$$

Putting it together, images are generated by sampling a scene structure $T \sim q_\theta (\cdot | z)$ from the model, followed by sampling parameters for every node in the scene $\alpha \sim q(\cdot | T)$ and rendering an image $v' = R(T, \alpha ) \sim q_I$. For some $v' \sim q_I$, with parameters $\alpha $ and structure T, we assume^{Footnote 1},

$$\begin{aligned} q_I(v'|z) = q(\alpha | T)q_\theta (T|z) \end{aligned}$$

3.3 Training

Training such a generative model is commonly done using variational inference [33, 52] or by optimizing a measure of distribution similarity [22, 30, 39, 41]. Variational Inference allows using reconstruction based objectives by introducing an approximate learnt posterior. Our attempts at using variational inference to train this model failed due to the complexity coming from discrete sampling and having a renderer in the generative process. Moreover, the recognition network here would amount to doing inverse graphics – an extremely challenging problem [36] in itself. Hence, we optimize a measure of distribution similarity of the generated and target data. We do not explore using a trained critic due to the clear visual discrepancy between rendered and real images that a critic can exploit. Moreover, adversarial training is known to be notoriously difficult for discrete data. We note that recent work [19, 47] has succeeded in adversarial training of a generative model of discrete brush strokes with reinforcement learning (RL), by carefully limiting the critic’s capacity. We similarly employ RL to train our discrete generative model of scene graphs. While two sample tests, such as MMD [23] have been used in previous work to estimate and minimize the distance between two empirical distributions [15, 30, 39, 41], training with MMD and RL resulted in credit-assignment issues as it is a single score for the similarity of two full sets(batches) of data. Instead, our metric can be computed for every sample, which greatly helps training as shown empirically in Sect. 4.

Distribution Matching: We train the generative model to match the distribution of features of the real data in the latent space of some feature extractor $\varphi $. We define the real feature distribution $p_f$ s.t $F \sim p_f \iff F = \varphi (v)$ for some $v \sim p_I$. Similarly we define the generated feature distribution $q_f$ s.t $F \sim q_f \iff F = \varphi (v)$ for some $v \sim q_I$. We accomplish distribution matching by approximately computing $p_f, q_f$ from samples and minimizing the KL divergence from $p_f$ to $q_f$. Our training objective is

$$\begin{aligned} \min _\theta&\quad KL(q_f || p_f)\\ \min _\theta&\quad \mathbb {E}_{F \sim q_f}[\log q_f(F) - \log p_f(F)] \end{aligned}$$

Using the feature distribution definition above, we have the equivalent objective

$$\begin{aligned} \min _\theta \mathbb {E}_{v \sim q_I} [\log q_f(\varphi (v)) - \log p_f(\varphi (v))] \end{aligned}$$

(1)

The true underlying feature distributions $q_f$ and $p_f$ are usually intractable to compute. We use approximations $\tilde{q}_f(F)$ and $\tilde{p}_f(F)$, computed using kernel density estimation (KDE). Let $V = \{v_1, \ldots , v_l\}$ and $B = \{v'_1, \ldots , v'_m\}$ be a batch of real and generated images. KDE with B, V to estimate $q_f, p_f$ yield

$$\begin{aligned} \tilde{q}_f(F) = \frac{1}{m} \sum _{j = 1}^m K_{H}(F - \varphi (v_j'))\\ \tilde{p}_f(F) = \frac{1}{l} \sum _{j = 1}^l K_{H}(F - \varphi (v_j)) \end{aligned}$$

where $K_H$ is the standard multivariate normal kernel with bandwidth matrix H. We use $H = dI$ where d is the dimensionality of the feature space.

Our generative model involves making a discrete (non-differentiable) choice at each step, leading us to optimize our objective using reinforcement learning techniques^{Footnote 2}. Specifically, using the REINFORCE [67] score function estimator along with a moving average baseline, we approximate the gradients of Eq. 1 as

$$\begin{aligned} \nabla _\theta \mathcal {L} \approx \frac{1}{M} \sum _{j = 1}^m (\log \tilde{q}_f(\varphi (v_j')) - \log \tilde{p}_f(\varphi (v_j'))) \nabla _\theta \log q_I (v_j') \end{aligned}$$

(2)

where M is the batch size, $\tilde{q}_f(F)$ and $\tilde{p}_f(F)$ are density estimates defined above.

Notice that the gradient above requires computing the marginal probability $q_I(v')$ of a generated image $v'$, instead of the conditional $q_I(v'|z)$. Computing the marginal probability of a generated image requires an intractable marginalization over the latent variable z. To circumvent this, we use a fixed finite number of latent vectors from a set Z sampled uniformly, enabling easy marginalization. This translates to,

$$\begin{aligned} q_\theta (T)&= \frac{1}{|Z|} \sum _{z \in Z} q_\theta (T|z)\\ q_I(v')&= q(\alpha | T)q_\theta (T) \end{aligned}$$

We find that this still has enough modeling capacity, since there are only finitely many scene graphs of a maximum length $T_{max}$ that can be sampled from the grammar. Empirically, we find using one latent vector to be enough in our experiments. Essentially, stochasticity in the rule sampling makes up for lost stochasticity in the latent space.

Pretraining is an essential step for our method. In every experiment, we define a simple handcrafted prior on scene structure. For example, a simple prior could be to put one car on one road in a driving scene. We pre-train the model by sampling strings (scene graphs) from the grammar prior, and training the model to maximize the log-likelihood of these scene graphs. We provide specific details about the priors used in Sect. 4.

Feature Extraction for distribution matching is a crucial step since the features need to capture structural scene information such as the number of objects and their contextual spatial relationships for effective training. We describe the feature extractor used and its training for each experiment in Sect. 4.

Ensuring Termination: During training, sampling can result in incomplete strings generated with at most $T_\text {max}$ steps. Thus, we repeatedly sample a scene graph T until its length is at most $T_\text {max}$. To ensure that we do not require too many attempts, we record the rejection rate $r_\text {reject}(F)$ of a sampled feature F as the average failed sampling attempts when sampling the single scene graph used to generate F. We set a threshold $\epsilon $ on $r_\text {reject}(F)$ (representing the maximum allowable rejections) and weight $\lambda $ and add it to our original loss as,

$$\begin{aligned} \mathcal {L}' = \mathbb {E}_{F \sim q_F}[\log q_f(F) - \log p_f(F) + \lambda \mathbf {1}_{(\epsilon ,\infty )}(r_\text {reject}(F))] \end{aligned}$$

We found that $\lambda = 10^{-2}$ and $\epsilon = 1$ worked well for all of our experiments.

4 Experiments

We show two controlled experiments, on the MNIST dataset [38] (Sect. 4.1) and on synthetic aerial imagery [30] (Sect. 4.2), where we showcase the ability of our model to learn synthetic structure distributions unsupervised. Finally, we show an experiment on generating 3D driving scenes (Sect. 4.3), mimicking structure distributions on the KITTI [21] driving dataset and showing the performance of an object detector trained on our generated data. The renderers used in each experiment are adapted from [30]. For each experiment, we first discuss the corresponding scene grammar. Then, we discuss the feature extractor and its training. Finally, we describe the structure prior used to pre-train the model, the target data, and show results on learning to mimic structures in the target data without any access to ground-truth structures. Additionally, we show comparisons with learning with MMD [23] (Sect. 4.1) and show how our model can learn to generate context-dependent scene graphs from the grammar (Sect. 4.2).

4.1 Multi MNIST

We first evaluate our approach on a toy example of learning to generate scenes with multiple digits. The grammar defining the scene structure is:

$$\begin{aligned} \text {Scene} \rightarrow bg \ \text {Digits}, \quad \text {Digits} \rightarrow \text {Digit} \ \text {Digits} \ | \ \epsilon , \quad \text {Digit} \rightarrow 0 \ | \ 1 \ | \ 2 \ | \ \cdots \ | \ 9 \end{aligned}$$

Sampled digits are placed onto a black canvas of size $256\,\times \,256$.

Feature Extraction Network: We train a network to determine the binary presence of a digit class in the scene. We use a Resnet [26] made up of three residual blocks each containing two $3\,\times \,3$ convolutional layers to produce an image embedding and three fully connected layers from the image embedding to make the prediction. We use the Resnet embeddings as our image features. We train the network on synthetic data generated by our simple prior for both structure and continuous parameters. Training is done with a simple binary cross-entropy criterion for each class. The exact prior and target data used is explained below.

Prior and Target Data: We sample the number of digits in the scene $n_d$ uniformly from 0 to 10, and sample $n_d$ digits uniformly to place on the scene. The digits are placed (parameters) uniformly on the canvas. The target data has digits upright in a straight line in the middle of the canvas. Figure 4 shows example prior samples, and target data. We show we can learn scene structures with a gap remaining in the parameters by using the parameter prior during training.

We attempt learning a random distribution of number of digits with random classes in the scene. Figure 6 shows the prior, target and learnt distribution of the number of digits and their class distribution. We see that our model can faithfully approximate the target, even while learning it unsupervised. We also train with MMD [23], computed using two batches of real and generated images and used as the reward for every generated scene. Figure 7 shows that using MMD results in the model learning a smoothed approximation of the target distribution, which comes from the lack of credit assignment in the score, that we get with our objective.

4.2 Aerial 2D

Next, we evaluate our approach on a harder synthetic scenario of aerial views of driving scenes. The grammar and the corresponding rendered scenes offer additional complexity to test the model. The grammar here is as follows:

$$\begin{aligned}&\text {Scene} \rightarrow \text {Roads}, \quad&\text {Roads} \rightarrow \text {Road} \ \text {Roads} \ | \ \epsilon \\&\text {Road} \rightarrow \text {Cars}, \quad&\text {Cars} \rightarrow car \ \text {Cars} \ | \ \epsilon \end{aligned}$$

Feature Extraction Network: We use the same Resnet [26] architecture from the MNIST experiment with the FC layers outputting the number of cars, roads, houses and trees in the scene as 1-hot labels. We train by minimizing the cross entropies these labels, trained on samples generated from the prior.

Prior: We sample the number of roads $n_r \in [0, 4]$ uniformly. On each road, we sample $c \in [0, 8]$ cars uniformly. Roads are placed sequentially by sampling a random distance d and placing the road d pixels in front of the previous one. Cars are placed on the road with uniform random position and rotation (Fig. 5).

Learning Context-Dependent Relationships: For the target dataset, we sample the number of roads $n_r \in [0, 4]$ with probabilities (0.05, 0.15, 0.4, 0.4). On the first road we sample $n_1 \sim \text {Poisson}(9)$ cars and $n_i \sim \text {Poisson}(3)$ cars for each of the remaining roads. All cars are placed well spaced on their respective road. Unlike the Multi-MNIST experiment, these structures cannot be modelled by a Probabilistic-CFG, and thus by [30, 37]. We see that our model can learn this context-dependent distribution faithfully as well in Fig. 8.

4.3 3D Driving Scenes

We experiment on the KITTI [21] dataset, which was captured with a camera mounted on top of a car driving around the city of Karlsruhe, Germany. The dataset contains a wide variety of road scenes, ranging from urban traffic scenarios to highways and more rural neighborhoods. We utilize the same grammar and renderer used for road scenes in [30]. Our model, although trained unsupervised, can learn to get closer to the underlying structure distribution, improve measures of image generation, and the performance of a downstream task model (Fig. 9).

Prior and Training: Following SDR [48], we define three different priors to capture three different modes in the KITTI dataset. They are the ‘Rural’, ‘Suburban’ and ‘Urban’ scenarios, as defined in [48]. We train three different versions of our model, one for each of the structural priors, and sample from each of them uniformly. We use the scene parameter prior and learnt scene parameter model from [30] to produce parameters for our generated scene structures to get the final scene graphs, which are rendered and used for our distribution matching.

Feature Extraction Network: We use the pool-3 layer of an Inception-V3 network, pre-trained on the ImageNet [13] dataset as our feature extractor. Interestingly, we found this to work as well as using features from Mask-RCNN [25] trained on driving scenes.

Distribution Similarity Metrics: In generative modeling of images, the Frechet Inception Distance [27], and the Kernel Inception Distance [5] have been used to measure progress. We report FID and KID in Table 1 and 2 between our generated synthetic dataset and the KITTI-train set. We do so by generating 10K synthetic samples and using the full KITTI-train set, computed using the pool-3 features of an Inception-v3 network. Figure 10 (left) shows the distribution of the number of cars generated by the prior, learnt model and in the KITTI dataset (since we have GT for cars). We do not have ground truth for which KITTI scenes could be classified into rural/suburban/urban, so we compare against the global distribution of the whole dataset. We notice that the model bridges the gap between this particular distribution well after training.

Task Performance: We report average precision for detection at 0.5 IoU i.e. AP@0.5 (following [30]) of an object detector trained to convergence on our synthetic data and tested on the KITTI validation set. We use the detection head from Mask-RCNN [25] with a Resnet-50-FPN backbone initialized with pre-trained ImageNet weights as our object detector. The task network in each result row of Table 1 is finetuned from the snapshot of the previous row. [30] show results with adding Image-to-Image translation to the generated images to reduce the appearance gap and results with training on a small amount of real data. We omit those experiments here and refer the reader to their paper for a sketch of expected results in these settings. Training this model directly on the full KITTI training set obtains AP@0.5 of $81.52 (\text {easy})$, $83.58 (\text {medium})$ and $84.48 (\text {hard})$, denoting a large sim-to-real performance gap left to bridge.

Using a Simple Prior: The priors on the structure in the previous experiments were taken from [48]. These priors already involved some human intervention, which we aim to minimize. Therefore, we repeat the experiments above with a very simple and quick to create prior on the scene structure, where a few instances of each kind of object (car, house etc.) is placed in the scene (see Fig. 11 (Left)). [30] requires a decently crafted structure prior to train the parameter network. Thus, we use the prior parameters while training our structure generator in this experiment (showing the robustness of training with randomized prior parameters), and learn the parameter network later (Table 2). Figure 10 (right) shows that the method learned the distribution of the number of cars well (unsupervised), even when initialized from a bad prior. Notice that the FID/KID of the learnt model from the simple prior in Table 2 is comparable to that trained from a tuned prior in Table 1, which we believe is an exciting result.

Table 1. AP@0.5 on KITTI-val and distribution similarity metrics between generated synthetic data and KITTI-train. Learnt parameters are used from [30]. *Results from [30] are our reproduced numbers, and we show learning the structure additionally helps close the distribution gap and improves downstream task performance.

Full size table

Table 2. Repeat of experiments in Table 1 with a *simple prior on the scene structure. Parameters are learnt using [30]. We observe a significant boost in both task performance and distribution similarity metrics, by learning the structure and parameters.

Full size table

Discussion: We noticed that our method worked better when initialized with more spread out priors than more localized priors (Table 1 and 2, Fig. 10) We hypothesize this is due to the distribution matching metric we use being the the reverse-KL divergence between the generated and real data (feature) distributions, which is mode-seeking instead of being mode-covering. Therefore, an initialization with a narrow distribution around one of the modes has low incentive to move away from it, hampering learning. Even then, we see a significant improvement in performance when starting from a peaky prior as shown in Table 2. We also note the importance of pre-training the task network. Rows in Table 1 and Table 2 were finetuned from the checkpoint of the previous row. The first row (Prob. Grammar) is a form of Domain Randomization [48, 62], which has been shown to be crucial for sim-to-real adaptation. Our method, in essence, reduces the randomization in the generated scenes (by learning to generate scenes similar to the target data), and we observe that progressively training the task network with our (more specialized) generated data improves its performance. [1, 66] show the opposite behavior, where increasing randomization (or environment difficulty) through task training results in improved performance. A detailed analysis of this phenomenon is beyond the current scope and is left for future work.

5 Conclusion

We introduced an approach to unsupervised learning of a generative model of synthetic scene structures by optimizing for visual similarity to the real data. Inferring scene structures is known to be notoriously hard even when annotations are provided. Our method is able to perform the generative side of it without any ground truth information. Experiments on two toy and one real dataset showcase the ability of our model to learn a plausible posterior over scene structures, significantly improving over manually designed priors. Our current method needs to optimize for both the scene structure and parameters of a synthetic scene generator in order to produce good results. This process has many moving parts and is generally cumbersome to make work in a new application scenario. Doing so, such as learning the grammar itself, requires further investigation, and opens an exciting direction for future work.

Notes

1.
This equality does not hold in general for rendering, but it worked well in practice.
2.
We did not explore sampling from a continuous relaxation of the discrete variable here.

References

Akkaya, I., et al.: Solving Rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113 (2019)
Alhaija, H.A., Mustikovela, S.K., Mescheder, L., Geiger, A., Rother, C.: Augmented reality meets computer vision: efficient data generation for urban driving scenes. Int. J. Comput. Vis. 126(9), 961–972 (2018)
Article Google Scholar
Alvarez-Melis, D., Jaakkola, T.S.: Tree-structured decoding with doubly-recurrent neural networks (2016)
Google Scholar
Armeni, I., et al.: 3D scene graph: a structure for unified semantics, 3D space, and camera. In: Proceedings of the IEEE International Conference on Computer Vision (2019)
Google Scholar
Bińkowski, M., Sutherland, D.J., Arbel, M., Gretton, A.: Demystifying MMD GANs. In: ICLR (2018)
Google Scholar
Brockman, G., et al.: OpenAI Gym. arXiv arXiv:1606.01540 (2016)
Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie for optical flow evaluation. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7577, pp. 611–625. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33783-3_44
Chapter Google Scholar
Chebotar, Y., et al.: Closing the sim-to-real loop: Adapting simulation randomization with real world experience. arXiv preprint arXiv:1810.05687 (2018)
Chen, X., Liu, C., Song, D.: Tree-to-tree neural networks for program translation. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 31, pp. 2547–2557. Curran Associates, Inc. (2018). http://papers.nips.cc/paper/7521-tree-to-tree-neural-networks-for-program-translation.pdf
Chu, H., et al.: Neural turtle graphics for modeling city road layouts. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4522–4530 (2019)
Google Scholar
Cranmer, K., Brehmer, J., Louppe, G.: The frontier of simulation-based inference. arXiv preprint arXiv:1911.01429 (2019)
Dai, H., Tian, Y., Dai, B., Skiena, S., Song, L.: Syntax-directed variational autoencoder for structured data. arXiv preprint arXiv:1802.08786 (2018)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Google Scholar
Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: CARLA: an open urban driving simulator. In: CORL, pp. 1–16 (2017)
Google Scholar
Dziugaite, G.K., Roy, D.M., Ghahramani, Z.: Training generative neural networks via maximum mean discrepancy optimization. In: UAI (2015)
Google Scholar
Eslami, S.A., Heess, N., Weber, T., Tassa, Y., Szepesvari, D., Hinton, G.E., et al.: Attend, infer, repeat: fast scene understanding with generative models. In: Advances in Neural Information Processing Systems, pp. 3225–3233 (2016)
Google Scholar
Fan, S., Huang, B.: Labeled graph generative adversarial networks. CoRR abs/1906.03220 (2019). http://arxiv.org/abs/1906.03220
Gaidon, A., Wang, Q., Cabon, Y., Vig, E.: Virtual worlds as proxy for multi-object tracking analysis. In: CVPR (2016)
Google Scholar
Ganin, Y., Kulkarni, T., Babuschkin, I., Eslami, S., Vinyals, O.: Synthesizing programs for images using reinforced adversarial learning. arXiv preprint arXiv:1804.01118 (2018)
Gao, X., Gong, R., Shu, T., Xie, X., Wang, S., Zhu, S.: VRKitchen: an interactive 3D virtual environment for task-oriented learning. arXiv arXiv:1903.05757 (2019)
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: CVPR (2012)
Google Scholar
Goodfellow, I., et al.: Generative adversarial nets. In: NIPS (2014)
Google Scholar
Gretton, A., Borgwardt, K.M., Rasch, M.J., Schölkopf, B., Smola, A.: A kernel two-sample test. JMLR 13, 723–773 (2012)
MathSciNet MATH Google Scholar
Handa, A., Patraucean, V., Badrinarayanan, V., Stent, S., Cipolla, R.: Understanding real world indoor scenes with synthetic data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4077–4085 (2016)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR abs/1512.03385 (2015). http://arxiv.org/abs/1512.03385
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: Advances in Neural Information Processing Systems, pp. 6626–6637 (2017)
Google Scholar
Juliani, A., et al.: Unity: A general platform for intelligent agents. arXiv preprint arXiv:1809.02627 (2018)
Jyothi, A.A., Durand, T., He, J., Sigal, L., Mori, G.: LayoutVAE: stochastic scene layout generation from a label set. In: The IEEE International Conference on Computer Vision (ICCV) (October 2019)
Google Scholar
Kar, A., et al.: Meta-Sim: learning to generate synthetic datasets. In: ICCV (2019)
Google Scholar
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948 (2018)
Kim, Y., Dyer, C., Rush, A.M.: Compound probabilistic context-free grammars for grammar induction. CoRR abs/1906.10225 (2019). http://arxiv.org/abs/1906.10225
Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114 (2013)
Kolve, E., Mottaghi, R., Gordon, D., Zhu, Y., Gupta, A., Farhadi, A.: AI2-THOR: An interactive 3D environment for visual AI. arXiv:1712.05474 (2017)
Kulkarni, T.D., Kohli, P., Tenenbaum, J.B., Mansinghka, V.: Picture: a probabilistic programming language for scene perception. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4390–4399 (2015)
Google Scholar
Kulkarni, T.D., Whitney, W.F., Kohli, P., Tenenbaum, J.: Deep convolutional inverse graphics network. In: NIPS, pp. 2539–2547 (2015)
Google Scholar
Kusner, M.J., Paige, B., Hernández-Lobato, J.M.: Grammar variational autoencoder. In: Proceedings of the 34th International Conference on Machine Learning, ICML 2017, vol. 70, pp. 1945–1954. JMLR.org (2017). http://dl.acm.org/citation.cfm?id=3305381.3305582
LeCun, Y.: The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/
Li, C.L., Chang, W.C., Cheng, Y., Yang, Y., Póczos, B.: MMD GAN: towards deeper understanding of moment matching network. In: NIPS (2017)
Google Scholar
Li, M., et al.: Grains: generative recursive autoencoders for indoor scenes. ACM Trans. Graph. (TOG) 38(2), 12 (2019)
Article Google Scholar
Li, Y., Swersky, K., Zemel, R.: Generative moment matching networks. In: ICML (2015)
Google Scholar
Li, Y., Vinyals, O., Dyer, C., Pascanu, R., Battaglia, P.: Learning deep generative models of graphs. arXiv preprint arXiv:1803.03324 (2018)
Liao, R., et al.: Efficient graph generation with graph recurrent attention networks. arXiv preprint arXiv:1910.00760 (2019)
Louppe, G., Cranmer, K.: Adversarial variational optimization of non-differentiable simulators. arXiv preprint arXiv:1707.07113 (2017)
Mansinghka, V.K., Kulkarni, T.D., Perov, Y.N., Tenenbaum, J.: Approximate Bayesian image interpretation using generative probabilistic graphics programs. In: Advances in Neural Information Processing Systems, pp. 1520–1528 (2013)
Google Scholar
McCormac, J., Handa, A., Leutenegger, S., Davison, A.J.: SceneNet RGB-D: 5M photorealistic images of synthetic indoor trajectories with ground truth. arXiv preprint arXiv:1612.05079 (2016)
Mellor, J.F.J., et al.: Unsupervised doodling and painting with improved spiral (2019)
Google Scholar
Prakash, A., et al.: Structured domain randomization: Bridging the reality gap by context-aware synthetic data. arXiv:1810.10093 (2018)
Puig, X., et al.: VirtualHome: simulating household activities via programs. In: CVPR (2018)
Google Scholar
Qi, S., Zhu, Y., Huang, S., Jiang, C., Zhu, S.C.: Human-centric indoor scene synthesis using stochastic grammar. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5899–5908 (2018)
Google Scholar
Razavi, A., van den Oord, A., Vinyals, O.: Generating diverse high-fidelity images with VQ-VAE-2. arXiv preprint arXiv:1906.00446 (2019)
Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082 (2014)
Richter, S.R., Vineet, V., Roth, S., Koltun, V.: Playing for data: ground truth from computer games. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 102–118. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_7
Chapter Google Scholar
Ritchie, D., Wang, K., Lin, Y.A.: Fast and flexible indoor scene synthesis via deep convolutional generative models. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)
Google Scholar
Ros, G., Sellart, L., Materzynska, J., Vazquez, D., Lopez, A.: The SYNTHIA dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In: CVPR (2016)
Google Scholar
Sadeghi, F., Levine, S.: CAD2RL: Real single-image flight without a single real image. arXiv preprint arXiv:1611.04201 (2016)
Savva, M., et al.: Habitat: A platform for embodied AI research. arXiv preprint arXiv:1904.01201 (2019)
Shugrina, M., et al.: Creative flow+ dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5384–5393 (2019)
Google Scholar
Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., Funkhouser, T.: Semantic scene completion from a single depth image. In: Proceedings of 30th IEEE Conference on Computer Vision and Pattern Recognition (2017)
Google Scholar
Such, F.P., Rawal, A., Lehman, J., Stanley, K.O., Clune, J.: Generative teaching networks: Accelerating neural architecture search by learning to generate synthetic training data. arXiv preprint arXiv:1912.07768 (2019)
Tassa, Y., et al.: DeepMind control suite. Technical report, DeepMind (January 2018). https://arxiv.org/abs/1801.00690
Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., Abbeel, P.: Domain randomization for transferring deep neural networks from simulation to the real world. In: IROS (2017)
Google Scholar
Todorov, E., Erez, T., Tassa, Y.: MuJoCo: a physics engine for model-based control. In: International Conference on Intelligent Robots and Systems (2012)
Google Scholar
Wang, K., Lin, Y.A., Weissmann, B., Savva, M., Chang, A.X., Ritchie, D.: PlanIT: planning and instantiating indoor scenes with relation graph and spatial prior networks. ACM Trans. Graph. (TOG) 38(4), 132 (2019)
Google Scholar
Wang, K., Savva, M., Chang, A.X., Ritchie, D.: Deep convolutional priors for indoor scene synthesis. ACM Trans. Graph. (TOG) 37(4), 70 (2018)
Google Scholar
Wang, R., Lehman, J., Clune, J., Stanley, K.O.: Poet: open-ended coevolution of environments and their optimized solutions. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 142–151 (2019)
Google Scholar
Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8, 229–256 (1992). https://doi.org/10.1007/BF00992696
Article MATH Google Scholar
Wrenninge, M., Unger, J.: SynScapes: A photorealistic synthetic dataset for street scene parsing. arXiv:1810.08705 (2018)
Wu, J., Tenenbaum, J.B., Kohli, P.: Neural scene de-rendering. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Wu, Y., Wu, Y., Gkioxari, G., Tiani, Y.: Building generalizable agents with a realistic and rich 3D environment. arXiv arXiv:1801.02209 (2018)
Yin, P., Neubig, G.: A syntactic neural model for general-purpose code generation. CoRR abs/1704.01696 (2017). http://arxiv.org/abs/1704.01696
Yin, P., Zhou, C., He, J., Neubig, G.: StructVAE: Tree-structured latent variable models for semi-supervised semantic parsing. CoRR abs/1806.07832 (2018). http://arxiv.org/abs/1806.07832
You, J., Ying, R., Ren, X., Hamilton, W., Leskovec, J.: GraphRNN: generating realistic graphs with deep auto-regressive models. In: International Conference on Machine Learning, pp. 5694–5703 (2018)
Google Scholar
Yu, L.F., Yeung, S.K., Tang, C.K., Terzopoulos, D., Chan, T.F., Osher, S.: Make it home: automatic optimization of furniture arrangement. ACM Trans. Graph. 30(4), 86 (2011)
Article Google Scholar
Zhang, Y., et al.: Physically-based rendering for indoor scene understanding using convolutional neural networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)
Google Scholar
Zhou, Y., While, Z., Kalogerakis, E.: SceneGraphNet: neural message passing for 3D indoor scene augmentation. In: The IEEE International Conference on Computer Vision (ICCV) (October 2019)
Google Scholar

Download references

Author information

Authors and Affiliations

NVIDIA, Waterloo, Canada
Jeevan Devaranjan, Amlan Kar & Sanja Fidler
University of Toronto, Toronto, Canada
Amlan Kar & Sanja Fidler
University of Waterloo, Waterloo, Canada
Jeevan Devaranjan
Vector Institute, Toronto, Canada
Amlan Kar & Sanja Fidler

Authors

Jeevan Devaranjan
View author publications
You can also search for this author in PubMed Google Scholar
Amlan Kar
View author publications
You can also search for this author in PubMed Google Scholar
Sanja Fidler
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Amlan Kar .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 24767 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Devaranjan, J., Kar, A., Fidler, S. (2020). Meta-Sim2: Unsupervised Learning of Scene Structure for Synthetic Data Generation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12362. Springer, Cham. https://doi.org/10.1007/978-3-030-58520-4_42

Download citation

DOI: https://doi.org/10.1007/978-3-030-58520-4_42
Published: 19 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58519-8
Online ISBN: 978-3-030-58520-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us