1 Introduction

Synthetic datasets are creating an appealing opportunity for training machine learning models e.g. for perception and planning in driving  [18, 53, 55], indoor scene perception  [46, 57], and robotic control  [61]. Via graphics engines, synthetic datasets come with perfect ground-truth for tasks in which labels are expensive or even impossible to obtain, such as segmentation, depth or material information. Adding a new type of label to synthetic datasets is as simple as calling a renderer, rather than embarking on a time consuming annotation endeavor that requires new tooling and hiring, training and overseeing annotators.

Creating synthetic datasets comes with its own hurdles. While content, such as 3D CAD models that make up a scene are available on online asset stores, artists write complex procedural models that synthesize scenes by placing these assets in realistic layouts. This often requires browsing through massive amounts of real imagery to carefully tune a procedural model – a time consuming task. For scenarios such as street scenes, creating synthetic scenes relevant for one city may require tuning a procedural model made for another city from scratch. In this paper, we propose an automatic method to carry out this task (Fig. 1).

Fig. 1.
figure 1

We present a method that learns to generate synthetic scenes from real imagery in an unsupervised fashion. It does so by learning a generative model of scene structure, samples from which (with additional scene parameters) can be rendered to create synthetic images and labels.

Recently, Meta-Sim  [30] proposed to optimize scene parameters in a synthetically generated scene by exploiting the visual similarity of (rendered) generated synthetic data with real data. They represent scene structure and parameters in a scene graph, and generate data by sampling a random scene structure (and parameters) from a given probabilistic grammar of scenes, and then modifying the scene parameters using a learnt model. Since they only learn scene parameters, a sim-to-real gap in the scene structure remains. For example, one would likely find more cars, people and buildings in Manhattan over a quaint village in Italy. Other work on generative models of structural data such as graphs and grammar strings  [12, 17, 37, 42] require large amounts of ground truth data for training to generate realistic samples. However, scene structures are extremely cumbersome to annotate and thus not available in most real datasets.

In this paper, we propose a procedural generative model of synthetic scenes that is learned unsupervised from real imagery. We generate scene graphs object-by-object by learning to sample rule expansions from a given probabilistic scene grammar and generate scene parameters using  [30]. Learning without supervision here is a challenging problem due to the discrete nature of the scene structures we aim to generate and the presence of a non-differentiable renderer in the generative process. To this end, we propose a feature space divergence to compare (rendered) generated scenes with real scenes, which can be computed per scene and is key to allowing credit assignment for training with reinforcement learning.

We evaluate our method on two synthetic datasets and a real driving dataset and find that our approach significantly reduces the distribution gap between scene structures in our generated and target data, improving over human priors on scene structure by learning to closely align with target structure distributions. On the real driving dataset, starting from minimal human priors, we show that we can almost exactly recover the structural distribution in the real target scenes (measured using GT annotations available for cars) – an exciting result given that the model is trained without any labels. We show that an object detector trained on our generated data outperforms those trained on data generated with human priors or by   [30], and show improvements in distribution similarity measures of our generated rendered images with real data.

2 Related Work

2.1 Synthetic Content Creation

Synthetic content creation has been receiving significant interest as a promising alternative to dataset collection and annotation. Various works have proposed generating synthetic data for tasks such as perception and planning in driving  [2, 14, 18, 48, 53, 55, 68], indoor scene perception  [4, 24, 46, 57, 59, 70, 75], game playing  [6, 28], robotic control  [6, 56, 61, 63], optical flow estimation  [7, 58], home robotics  [20, 34, 49] amongst many others, utilizing procedural modeling, existing simulators or human generated scenarios.

Learnt Scene Generation brings a data-driven nature to scene generation.   [64, 74] propose learning hierarchical spatial priors between furniture, that is integrated into a hand-crafted cost used to generate optimized indoor scene layouts.  [50] similarly learn to synthesize indoor scenes using a probabilistic scene grammar and human-centric learning by leveraging affordances.  [64] learn to generate intermediate object relationship graphs and instantiate scenes conditioned on them.  [76] use a scene graph representation and learn adding objects into existing scenes.  [40, 54, 65] propose methods for learning deep priors from data for indoor scene synthesis.  [16] introduce a generative model that sequentially adds objects into scenes, while   [29] propose a generative model for object layouts in 2D given a label set.  [60] generate batches of data using a neural network that is used to train a task model, and learn by differentiating through the learning process of the task model.  [30] propose learning to generate scenes by modifying the parameters of objects in scenes that are sampled from a probabilistic scene grammar. We argue that this ignores learning structural aspects of the scene, which we focus on in our work. Similar to  [16, 30], and contrary to other works, we learn this in an unsupervised manner i.e. given only target images as input.

Fig. 2.
figure 2

Example scene graph (structure and parameters) and depiction of its rendering

Learning with Simulators: Methods in Approximate Bayesian Inference have looked into inferring the parameters of a simulator that generate a particular data point  [35, 45].  [11] provide a great overview of advances in simulator based inference. Instead of running inference per scene  [35, 69], we aim to generate new data that resembles a target distribution.  [44] learn to optimize non-differentiable simulators using a variational upper bound of a GAN-like objective.  [8] learn to optimize simulator parameters for robotic control tasks by directly comparing trajectories between a real and a simulated robot.  [19, 47] train an agent to paint images using brush strokes in an adversarial setting with Reinforcement Learning. We learn to generate discrete scene structures constrained to a grammar, while optimizing a distribution matching objective (with Reinforcement Learning) instead of training adversarially. Compared to  [47], we generate large and complex scenes, as opposed to images of single objects or faces.

2.2 Graph Generation

Generative models of graphs and trees   [3, 10, 17, 42, 43, 73] generally produce graphs with richer structure with more flexibility over grammar based models, but often fail to produce syntactically correct graphs for cases with a defined syntax such as programs and scene graphs. Grammar based methods have been used for a variety of tasks such as program translation  [9], conditional program generation  [71, 72], grammar induction  [32] and generative modelling on structures with syntax  [12, 37], such as molecules. These methods, however, assume access to ground-truth graph structures for learning. We take inspiration from these methods, but show how to learn our model unsupervised i.e. without any ground truth scene graph annotations.

3 Methodology

We aim to learn a generative model of synthetic scenes. In particular, given a dataset of real imagery \(X_R\), the problem is to create synthetic data \(D(\theta ) = (X(\theta ), Y(\theta ))\) of images \(X(\theta )\) and labels \(Y(\theta )\) that is representative of \(X_R\), where \(\theta \) represents the parameters of the generative model. We exploit advances in graphics engines and rendering, by stipulating that the synthetic data D is the output of creating an abstract scene representation and rendering it with a graphics engine. Rendering ensures that low level pixel information in \(X(\theta )\) (and its corresponding annotation \(Y(\theta )\)) does not need to be modelled, which has been the focus of recent research in generative modeling of images  [31, 51]. Ensuring the semantic validity of sampled scenes requires imposing constraints on their structure. Scene grammars use a set of rules to greatly reduce the space of scenes that can be sampled, making learning a more structured and tractable problem. For example, it could explicitly enforce that a car can only be on a road which then need not be implicitly learned, thus leading us to use probabilistic scene grammars. Meta-Sim  [30] sampled scene graph structures (see Fig. 2) from a prior imposed on a Probabilistic Context-Free Grammar (PCFG), which we call the structure prior. They sampled parameters for every node in the scene graph from a parameter prior and learned to predict new parameters for each node, keeping the structure intact. Their generated scenes therefore come from a structure prior (which is context-free) and the learnt parameter distribution, resulting in an untackled sim-to-real gap in the scene structures. In our work, we aim to alleviate this by learning a context-dependent structure distribution unsupervised of synthetic scenes from images.

We utilize scene graphs as our abstract scene representation, that are rendered into a corresponding image with labels (Sect. 3.1). Our generative model sequentially samples expansion rules from a given probabilistic scene grammar (Sect. 3.2) to generate a scene graph which is rendered. We train the model unsupervised and with reinforcement learning, using a feature-matching based distribution divergence specifically designed to be amenable to our setting (Sect. 3.3).

3.1 Representing Synthetic Scenes

In Computer Graphics and Vision, Scene Graphs are commonly used to describe scenes in a concise hierarchical manner, where each node describes an object in the scene along with its parameters such as the 3D asset, pose etc. Parent-child relationships define the child’s parameters relative to its parent, enabling easy scene editing and manipulation. Additionally, camera, lighting, weather etc. are also encoded into the scene graph. Generating corresponding pixels and annotations amounts to placing objects into the scene in a graphics engine and rendering with the defined parameters (see Fig. 2).

Fig. 3.
figure 3

Representation of our generative process for a scene graph. The logits and mask are of shape \(T_{max} \times K\). Green represents a higher value and red is lower. At every time step, we autoregressively sample a rule and predict the logits for the next rule conditioned on the sample (capturing context dependencies). The figure on the right shows how sampled rules from the grammar are converted into a graph structure (only objects that are renderable are kept from the full grammar string). Parameters for every node can be sampled from a prior or optionally learnt with the method of  [30]. A generated scene graph can be rendered as shown in Fig. 2. (Color figure online)

Notation: A context-free grammar G is defined as a list of symbols (terminal and non-terminal) and expansion rules. Non-terminal symbols have at least one expansion rule into a new set of symbols. Sampling from a grammar involves expanding a start symbol till only non-terminal symbols remain. We denote the total number of expansion rules in a grammar G as K. We define scene grammars and represent strings sampled from the grammar as scene graphs following  [30, 48] (see Fig. 3). For each scene graph, a structure T is sampled from the grammar G followed by sampling corresponding parameters \(\alpha \) for every node in the graph.

3.2 Generative Model

We take inspiration from previous work on learning generative models of graphs that are constrained by a grammar  [37] for our architecture. Specifically, we map a latent vector z to unnormalized probabilities over all possible grammar rules in an autoregressive manner, using a recurrent neural network till a maximum of \(T_{max}\) steps. Deviating from  [37], we sample one rule \(r_t\) at every time step and use it to predict the logits for the next rule \(f_{t+1}\). This allows our model to capture context-dependent relationships easily, as opposed to the context-free nature of scene graphs in  [30]. Given a list of at most \(T_{max}\) sampled rules, the corresponding scene graph is generated by treating each rule expansion as a node expansion in the graph (see Fig. 3).

Sampling Correct Rules: To ensure the validity of sampled rules in each time step t, we follow  [37] and maintain a last-in-first-out (LIFO) stack of unexpanded non-terminal nodes. Nodes are popped from the stack, expanded according to the sampled rule-expansion, and the resulting new non-terminal nodes are pushed to the stack. When a non-terminal is popped, we create a mask \(m_t\) of size K which is 1 for valid rules from that non-terminal and 0 otherwise. Given the logits for the next expansion \(f_t\), the probability of a rule \(r_{t,k}\) is represented as,

$$\begin{aligned} p(r_t = k | f_t) = \frac{m_{t, k} exp(f_{t,k})}{\sum _{j=1}^{K} m_{t,j} exp(f_{t,j})} \end{aligned}$$

Sampling from this masked multinomial distribution ensures that only valid rules are sampled as \(r_t\). Given the logits and sampled rules, \((f_t, r_t) \forall t \in {1 \ldots T_{max}}\), the probability of the corresponding scene structure T given z is simply,

$$\begin{aligned} q_\theta (T|z) = \sum _{t=1}^{T_{max}} p(r_t|f_t) \end{aligned}$$

Putting it together, images are generated by sampling a scene structure \(T \sim q_\theta (\cdot | z)\) from the model, followed by sampling parameters for every node in the scene \(\alpha \sim q(\cdot | T)\) and rendering an image \(v' = R(T, \alpha ) \sim q_I\). For some \(v' \sim q_I\), with parameters \(\alpha \) and structure T, we assumeFootnote 1,

$$\begin{aligned} q_I(v'|z) = q(\alpha | T)q_\theta (T|z) \end{aligned}$$

3.3 Training

Training such a generative model is commonly done using variational inference  [33, 52] or by optimizing a measure of distribution similarity  [22, 30, 39, 41]. Variational Inference allows using reconstruction based objectives by introducing an approximate learnt posterior. Our attempts at using variational inference to train this model failed due to the complexity coming from discrete sampling and having a renderer in the generative process. Moreover, the recognition network here would amount to doing inverse graphics – an extremely challenging problem  [36] in itself. Hence, we optimize a measure of distribution similarity of the generated and target data. We do not explore using a trained critic due to the clear visual discrepancy between rendered and real images that a critic can exploit. Moreover, adversarial training is known to be notoriously difficult for discrete data. We note that recent work  [19, 47] has succeeded in adversarial training of a generative model of discrete brush strokes with reinforcement learning (RL), by carefully limiting the critic’s capacity. We similarly employ RL to train our discrete generative model of scene graphs. While two sample tests, such as MMD  [23] have been used in previous work to estimate and minimize the distance between two empirical distributions  [15, 30, 39, 41], training with MMD and RL resulted in credit-assignment issues as it is a single score for the similarity of two full sets(batches) of data. Instead, our metric can be computed for every sample, which greatly helps training as shown empirically in Sect. 4.

Distribution Matching: We train the generative model to match the distribution of features of the real data in the latent space of some feature extractor \(\varphi \). We define the real feature distribution \(p_f\) s.t \(F \sim p_f \iff F = \varphi (v)\) for some \(v \sim p_I\). Similarly we define the generated feature distribution \(q_f\) s.t \(F \sim q_f \iff F = \varphi (v)\) for some \(v \sim q_I\). We accomplish distribution matching by approximately computing \(p_f, q_f\) from samples and minimizing the KL divergence from \(p_f\) to \(q_f\). Our training objective is

$$\begin{aligned} \min _\theta&\quad KL(q_f || p_f)\\ \min _\theta&\quad \mathbb {E}_{F \sim q_f}[\log q_f(F) - \log p_f(F)] \end{aligned}$$

Using the feature distribution definition above, we have the equivalent objective

$$\begin{aligned} \min _\theta \mathbb {E}_{v \sim q_I} [\log q_f(\varphi (v)) - \log p_f(\varphi (v))] \end{aligned}$$
(1)

The true underlying feature distributions \(q_f\) and \(p_f\) are usually intractable to compute. We use approximations \(\tilde{q}_f(F)\) and \(\tilde{p}_f(F)\), computed using kernel density estimation (KDE). Let \(V = \{v_1, \ldots , v_l\}\) and \(B = \{v'_1, \ldots , v'_m\}\) be a batch of real and generated images. KDE with BV to estimate \(q_f, p_f\) yield

$$\begin{aligned} \tilde{q}_f(F) = \frac{1}{m} \sum _{j = 1}^m K_{H}(F - \varphi (v_j'))\\ \tilde{p}_f(F) = \frac{1}{l} \sum _{j = 1}^l K_{H}(F - \varphi (v_j)) \end{aligned}$$

where \(K_H\) is the standard multivariate normal kernel with bandwidth matrix H. We use \(H = dI\) where d is the dimensionality of the feature space.

Our generative model involves making a discrete (non-differentiable) choice at each step, leading us to optimize our objective using reinforcement learning techniquesFootnote 2. Specifically, using the REINFORCE  [67] score function estimator along with a moving average baseline, we approximate the gradients of Eq. 1 as

$$\begin{aligned} \nabla _\theta \mathcal {L} \approx \frac{1}{M} \sum _{j = 1}^m (\log \tilde{q}_f(\varphi (v_j')) - \log \tilde{p}_f(\varphi (v_j'))) \nabla _\theta \log q_I (v_j') \end{aligned}$$
(2)

where M is the batch size, \(\tilde{q}_f(F)\) and \(\tilde{p}_f(F)\) are density estimates defined above.

Notice that the gradient above requires computing the marginal probability \(q_I(v')\) of a generated image \(v'\), instead of the conditional \(q_I(v'|z)\). Computing the marginal probability of a generated image requires an intractable marginalization over the latent variable z. To circumvent this, we use a fixed finite number of latent vectors from a set Z sampled uniformly, enabling easy marginalization. This translates to,

$$\begin{aligned} q_\theta (T)&= \frac{1}{|Z|} \sum _{z \in Z} q_\theta (T|z)\\ q_I(v')&= q(\alpha | T)q_\theta (T) \end{aligned}$$

We find that this still has enough modeling capacity, since there are only finitely many scene graphs of a maximum length \(T_{max}\) that can be sampled from the grammar. Empirically, we find using one latent vector to be enough in our experiments. Essentially, stochasticity in the rule sampling makes up for lost stochasticity in the latent space.

Pretraining is an essential step for our method. In every experiment, we define a simple handcrafted prior on scene structure. For example, a simple prior could be to put one car on one road in a driving scene. We pre-train the model by sampling strings (scene graphs) from the grammar prior, and training the model to maximize the log-likelihood of these scene graphs. We provide specific details about the priors used in Sect. 4.

Feature Extraction for distribution matching is a crucial step since the features need to capture structural scene information such as the number of objects and their contextual spatial relationships for effective training. We describe the feature extractor used and its training for each experiment in Sect. 4.

Ensuring Termination: During training, sampling can result in incomplete strings generated with at most \(T_\text {max}\) steps. Thus, we repeatedly sample a scene graph T until its length is at most \(T_\text {max}\). To ensure that we do not require too many attempts, we record the rejection rate \(r_\text {reject}(F)\) of a sampled feature F as the average failed sampling attempts when sampling the single scene graph used to generate F. We set a threshold \(\epsilon \) on \(r_\text {reject}(F)\) (representing the maximum allowable rejections) and weight \(\lambda \) and add it to our original loss as,

$$\begin{aligned} \mathcal {L}' = \mathbb {E}_{F \sim q_F}[\log q_f(F) - \log p_f(F) + \lambda \mathbf {1}_{(\epsilon ,\infty )}(r_\text {reject}(F))] \end{aligned}$$

We found that \(\lambda = 10^{-2}\) and \(\epsilon = 1\) worked well for all of our experiments.

4 Experiments

We show two controlled experiments, on the MNIST dataset  [38] (Sect. 4.1) and on synthetic aerial imagery  [30] (Sect. 4.2), where we showcase the ability of our model to learn synthetic structure distributions unsupervised. Finally, we show an experiment on generating 3D driving scenes (Sect. 4.3), mimicking structure distributions on the KITTI  [21] driving dataset and showing the performance of an object detector trained on our generated data. The renderers used in each experiment are adapted from  [30]. For each experiment, we first discuss the corresponding scene grammar. Then, we discuss the feature extractor and its training. Finally, we describe the structure prior used to pre-train the model, the target data, and show results on learning to mimic structures in the target data without any access to ground-truth structures. Additionally, we show comparisons with learning with MMD  [23] (Sect. 4.1) and show how our model can learn to generate context-dependent scene graphs from the grammar (Sect. 4.2).

4.1 Multi MNIST

We first evaluate our approach on a toy example of learning to generate scenes with multiple digits. The grammar defining the scene structure is:

$$\begin{aligned} \text {Scene} \rightarrow bg \ \text {Digits}, \quad \text {Digits} \rightarrow \text {Digit} \ \text {Digits} \ | \ \epsilon , \quad \text {Digit} \rightarrow 0 \ | \ 1 \ | \ 2 \ | \ \cdots \ | \ 9 \end{aligned}$$

Sampled digits are placed onto a black canvas of size \(256\,\times \,256\).

Fig. 4.
figure 4

Prior (Left) and Validation (Right) example for MultiMNIST experiments

Fig. 5.
figure 5

Prior (Left) and Validation (Right) example for Aerial 2D experiments

Fig. 6.
figure 6

Distributions of classes and number of digits, in the prior, learned and target scene structures

Fig. 7.
figure 7

Distributions of classes and number of digits, comparing learning with MMD, ours and the target

Feature Extraction Network: We train a network to determine the binary presence of a digit class in the scene. We use a Resnet  [26] made up of three residual blocks each containing two \(3\,\times \,3\) convolutional layers to produce an image embedding and three fully connected layers from the image embedding to make the prediction. We use the Resnet embeddings as our image features. We train the network on synthetic data generated by our simple prior for both structure and continuous parameters. Training is done with a simple binary cross-entropy criterion for each class. The exact prior and target data used is explained below.

Prior and Target Data: We sample the number of digits in the scene \(n_d\) uniformly from 0 to 10, and sample \(n_d\) digits uniformly to place on the scene. The digits are placed (parameters) uniformly on the canvas. The target data has digits upright in a straight line in the middle of the canvas. Figure 4 shows example prior samples, and target data. We show we can learn scene structures with a gap remaining in the parameters by using the parameter prior during training.

We attempt learning a random distribution of number of digits with random classes in the scene. Figure 6 shows the prior, target and learnt distribution of the number of digits and their class distribution. We see that our model can faithfully approximate the target, even while learning it unsupervised. We also train with MMD  [23], computed using two batches of real and generated images and used as the reward for every generated scene. Figure 7 shows that using MMD results in the model learning a smoothed approximation of the target distribution, which comes from the lack of credit assignment in the score, that we get with our objective.

4.2 Aerial 2D

Next, we evaluate our approach on a harder synthetic scenario of aerial views of driving scenes. The grammar and the corresponding rendered scenes offer additional complexity to test the model. The grammar here is as follows:

$$\begin{aligned}&\text {Scene} \rightarrow \text {Roads}, \quad&\text {Roads} \rightarrow \text {Road} \ \text {Roads} \ | \ \epsilon \\&\text {Road} \rightarrow \text {Cars}, \quad&\text {Cars} \rightarrow car \ \text {Cars} \ | \ \epsilon \end{aligned}$$

Feature Extraction Network: We use the same Resnet  [26] architecture from the MNIST experiment with the FC layers outputting the number of cars, roads, houses and trees in the scene as 1-hot labels. We train by minimizing the cross entropies these labels, trained on samples generated from the prior.

Fig. 8.
figure 8

#cars distribution learned in the Aerial 2D experiment. We can learn context dependent relationships, placing different number of cars on different roads

Prior: We sample the number of roads \(n_r \in [0, 4]\) uniformly. On each road, we sample \(c \in [0, 8]\) cars uniformly. Roads are placed sequentially by sampling a random distance d and placing the road d pixels in front of the previous one. Cars are placed on the road with uniform random position and rotation (Fig. 5).

Learning Context-Dependent Relationships: For the target dataset, we sample the number of roads \(n_r \in [0, 4]\) with probabilities (0.05, 0.15, 0.4, 0.4). On the first road we sample \(n_1 \sim \text {Poisson}(9)\) cars and \(n_i \sim \text {Poisson}(3)\) cars for each of the remaining roads. All cars are placed well spaced on their respective road. Unlike the Multi-MNIST experiment, these structures cannot be modelled by a Probabilistic-CFG, and thus by   [30, 37]. We see that our model can learn this context-dependent distribution faithfully as well in Fig. 8.

4.3 3D Driving Scenes

We experiment on the KITTI   [21] dataset, which was captured with a camera mounted on top of a car driving around the city of Karlsruhe, Germany. The dataset contains a wide variety of road scenes, ranging from urban traffic scenarios to highways and more rural neighborhoods. We utilize the same grammar and renderer used for road scenes in  [30]. Our model, although trained unsupervised, can learn to get closer to the underlying structure distribution, improve measures of image generation, and the performance of a downstream task model (Fig. 9).

Fig. 9.
figure 9

Generated images (good prior expt.). (Left) Using both the structure and parameter prior, (Middle) Using our learnt structure and parameters from  [30], (Right) Real KITTI samples. Our model (middle), although unsupervised, adds diverse scene elements like vegetation, pedestrians, signs etc. to better resemble the real dataset.

Fig. 10.
figure 10

#cars/scene learned from a simple prior (left) and good prior (right) on KITTI

Prior and Training: Following SDR  [48], we define three different priors to capture three different modes in the KITTI dataset. They are the ‘Rural’, ‘Suburban’ and ‘Urban’ scenarios, as defined in  [48]. We train three different versions of our model, one for each of the structural priors, and sample from each of them uniformly. We use the scene parameter prior and learnt scene parameter model from  [30] to produce parameters for our generated scene structures to get the final scene graphs, which are rendered and used for our distribution matching.

Feature Extraction Network: We use the pool-3 layer of an Inception-V3 network, pre-trained on the ImageNet  [13] dataset as our feature extractor. Interestingly, we found this to work as well as using features from Mask-RCNN  [25] trained on driving scenes.

Distribution Similarity Metrics: In generative modeling of images, the Frechet Inception Distance  [27], and the Kernel Inception Distance  [5] have been used to measure progress. We report FID and KID in Table 1 and 2 between our generated synthetic dataset and the KITTI-train set. We do so by generating 10K synthetic samples and using the full KITTI-train set, computed using the pool-3 features of an Inception-v3 network. Figure 10 (left) shows the distribution of the number of cars generated by the prior, learnt model and in the KITTI dataset (since we have GT for cars). We do not have ground truth for which KITTI scenes could be classified into rural/suburban/urban, so we compare against the global distribution of the whole dataset. We notice that the model bridges the gap between this particular distribution well after training.

Fig. 11.
figure 11

Generated images (simple prior expt.). (Left) Using both the structure and parameter prior, (Middle) Using our learnt structure and parameters from  [30], (Right) Real samples from KITTI. Our model, although trained unsupervised, learns to add an appropriate frequency and diversity of scene elements to resemble the real data, even when trained from a very weak prior.

Task Performance: We report average precision for detection at 0.5 IoU i.e. AP@0.5 (following  [30]) of an object detector trained to convergence on our synthetic data and tested on the KITTI validation set. We use the detection head from Mask-RCNN [25] with a Resnet-50-FPN backbone initialized with pre-trained ImageNet weights as our object detector. The task network in each result row of Table 1 is finetuned from the snapshot of the previous row.  [30] show results with adding Image-to-Image translation to the generated images to reduce the appearance gap and results with training on a small amount of real data. We omit those experiments here and refer the reader to their paper for a sketch of expected results in these settings. Training this model directly on the full KITTI training set obtains AP@0.5 of \(81.52 (\text {easy})\), \(83.58 (\text {medium})\) and \(84.48 (\text {hard})\), denoting a large sim-to-real performance gap left to bridge.

Using a Simple Prior: The priors on the structure in the previous experiments were taken from  [48]. These priors already involved some human intervention, which we aim to minimize. Therefore, we repeat the experiments above with a very simple and quick to create prior on the scene structure, where a few instances of each kind of object (car, house etc.) is placed in the scene (see Fig. 11 (Left)).  [30] requires a decently crafted structure prior to train the parameter network. Thus, we use the prior parameters while training our structure generator in this experiment (showing the robustness of training with randomized prior parameters), and learn the parameter network later (Table 2). Figure 10 (right) shows that the method learned the distribution of the number of cars well (unsupervised), even when initialized from a bad prior. Notice that the FID/KID of the learnt model from the simple prior in Table 2 is comparable to that trained from a tuned prior in Table 1, which we believe is an exciting result.

Table 1. AP@0.5 on KITTI-val and distribution similarity metrics between generated synthetic data and KITTI-train. Learnt parameters are used from  [30]. *Results from   [30] are our reproduced numbers, and we show learning the structure additionally helps close the distribution gap and improves downstream task performance.
Table 2. Repeat of experiments in Table 1 with a *simple prior on the scene structure. Parameters are learnt using  [30]. We observe a significant boost in both task performance and distribution similarity metrics, by learning the structure and parameters.

Discussion: We noticed that our method worked better when initialized with more spread out priors than more localized priors (Table 1 and 2, Fig. 10) We hypothesize this is due to the distribution matching metric we use being the the reverse-KL divergence between the generated and real data (feature) distributions, which is mode-seeking instead of being mode-covering. Therefore, an initialization with a narrow distribution around one of the modes has low incentive to move away from it, hampering learning. Even then, we see a significant improvement in performance when starting from a peaky prior as shown in Table 2. We also note the importance of pre-training the task network. Rows in Table 1 and Table 2 were finetuned from the checkpoint of the previous row. The first row (Prob. Grammar) is a form of Domain Randomization  [48, 62], which has been shown to be crucial for sim-to-real adaptation. Our method, in essence, reduces the randomization in the generated scenes (by learning to generate scenes similar to the target data), and we observe that progressively training the task network with our (more specialized) generated data improves its performance.  [1, 66] show the opposite behavior, where increasing randomization (or environment difficulty) through task training results in improved performance. A detailed analysis of this phenomenon is beyond the current scope and is left for future work.

5 Conclusion

We introduced an approach to unsupervised learning of a generative model of synthetic scene structures by optimizing for visual similarity to the real data. Inferring scene structures is known to be notoriously hard even when annotations are provided. Our method is able to perform the generative side of it without any ground truth information. Experiments on two toy and one real dataset showcase the ability of our model to learn a plausible posterior over scene structures, significantly improving over manually designed priors. Our current method needs to optimize for both the scene structure and parameters of a synthetic scene generator in order to produce good results. This process has many moving parts and is generally cumbersome to make work in a new application scenario. Doing so, such as learning the grammar itself, requires further investigation, and opens an exciting direction for future work.