Keywords

1 Introduction

Autonomous mobile agents, such as driverless cars, will be a cornerstone of the smart society of the future. Currently available datasets of labeled street scene images, such as Cityscapes [6], are an important step in this direction, and could, e.g., be used for training models for semantic image segmentation. However, collecting such data poses challenges including privacy intrusions, the need for accurate crowd-sourced labels, and the requirement to cover a huge state-space of different situations and environments. Another approach – especially useful to gather data representing dangerous situations such as collisions with pedestrians – is to generate training images with known ground-truth labeling using game engines or other virtual worlds, but this approach requires object and state-space variability to be manually engineered into the system.

A viable alternative to both these approaches is to augment existing datasets with synthetically-generated novel datapoints, produced by generative image models trained on the existing data. This builds on recent applications of generative models for a variety of tasks such as image style transfer [15] and modality transfer in medical imaging [36].

Among currently-available deep generative approaches, GANs [10] are probably the most widely used in image generation, owing to their achievements in synthesizing realistic high-resolution output with novel and rich detail [3, 16]. Auto-regressive architectures [26, 27] are usually computationally demanding (not parallelizable) and not feasible for generating higher-resolution images. Image samples generated by early variants of VAEs [20, 33] tended to suffer from blurriness [44], although the realism of VAE output has improved in recent years [32, 40].

This article considers normalizing flows [7, 8], a different model class of growing interest. With recent improvements such as Glow [19], flows can generate images with a quality that approaches that produced by GANs. Flows have also achieved competitive results in other tasks such as audio and video generation [17, 21, 30]. Flow-based models exhibit several benefits compared to GANs: 1) stable, monotonic training, 2) learning an explicit representation useful for down-stream tasks such as style transfer, 3) efficient synthesis, and 4) exact likelihood evaluation that could be used for density estimation.

In this paper, we propose a new, fully conditional Glow-based architecture called Full-Glow for generating plausible street scene images conditioned on the structure of the image content (i.e., the segmentation mask). We show that, by using this model, we are able to synthesize moderately high-resolution images that are consistent with the given structure but differ substantially from the existing ground-truth images. A quantitative comparison against previously proposed Glow-based models [23, 36] and the popular GAN-based conditional image-generation model pix2pix [15], finds that our improved conditioning allows us to synthesize images that achieve better semantic classification scores under a pre-trained semantic classifier. We also provide visual comparisons of samples generated by different models.

The remainder of this article is laid out as follows: Sect. 2 presents prior work in street-scene generation and image-to-image translation, while Sect. 3 provides technical background on normalizing flows. Our proposed fully-conditional architecture is then introduced in Sect. 4 and validated experimentally in Sect. 5.

2 Related Work

Synthetic Data Generation. Street-scene image datasets such as Cityscapes [6], CamVid [4], and the KITTI dataset [9] are useful for training vision systems for street-scene understanding. However, collecting and labeling such data is costly, resource demanding, and associated with privacy issues. An effective alternative that allows for ground-truth labels and scene layout control is synthetic data generation using game engines [34, 35, 39]. Despite these advantages, images generated by game engines tend to differ significantly from real-world images and may not always act as a replacement for real data. Moreover, game engines generally only synthesize objects from pre-generated assets or recipes, meaning that variation has to be hand-engineered in. It is therefore difficult and costly to obtain diverse data in this manner. Data generated from approaches such as ours address these shortcomings while maintaining the benefits of ground-truth labeling and scene layout control.

Image-to-Image Translation. In order to generate images for data-augmentation of supervised learning tasks, it is necessary to condition the image generation on an input, such that the ground-truth labeling of the generated image is known. For street-scene understanding, this conditioning takes the form of per-pixel class labels (a segmentation mask), meaning that the augmentation task can be formulated as an image-to-image translation problem. GANs [10] have been employed for both paired and unpaired image-to-image translation problems [15, 45]. While GANs can generate convincing-looking images, they are known to suffer from mode collapse and low output diversity [11]. Consequentially, their value in augmenting dataset diversity may be limited.

Likelihood-based models, on the other hand, explicitly aim to learn the probability distribution of the data. These models generally prefer sample diversity, sometimes at the expense of sample quality [8], which has been linked to the mass-covering property of the likelihood objective [25, 37]. Like for GANs [3], perceived image quality can often be improved by reducing the entropy of the distribution at synthesis time, relative to the distribution learned during training, cf. [19, 40]. Flow-based models are a particular class of likelihood-based model that have gained recent attention after an architecture called Glow [19] demonstrated impressive performance in unconditional image generation. Previous works have applied flow-based models for image colorization [1, 2], image segmentation [23], modality transfer in medical imaging [36], and generating point clouds and image-to-image translation [31].

So far, Glow-based models proposed for image-to-image translation [23, 31, 36] have only considered low-resolution tasks. Although the results are promising, they do not assess the full capacity of Glow for generating realistic image detail, for example in street scenes. High-resolution street-scene synthesis has been performed by the GAN-based model pix2pixHD [41] on a GPU with very high memory capacity (24 GB). In the present work, we synthesize moderately high resolution street scene images using a GPU with lower memory capacity (11–12 GB). We extend previous works on Glow-based models by introducing a fully conditional architecture, and also by modeling high-resolution street-scene images, which is a more challenging task than the low-resolution output considered in prior work.

3 Flow-Based Generative Models

Normalizing flows [28] are a class of probabilistic generative models, able to represent complex probability densities in a manner that allows both easy sampling and efficient training based on explicit likelihood maximization. The key idea is to use a sequence of invertible and differentiable functions/transformations which (nonlinearly) transform a random variable \(\mathbf {z}\) with a simple density function to another random variable \(\mathbf {x}\) with a more complex density function (and vice versa, thanks to invertibility):

(1)

Each component transformation \(\mathbf {f}_i\) is called a flow step. The distribution of \(\mathbf {z}\) (termed the latent, source, or base distribution) is assumed to have a simple parametric form, such as an isotropic unit Gaussian. Similar to in GANs, the generative process can be formulated as:

$$\begin{aligned} \mathbf {z}\sim & {} \, p_{\mathrm {z}} (\mathbf {z}), \end{aligned}$$
(2)
$$\begin{aligned} \mathbf {x}= & {} \, \mathbf {g}_{\boldsymbol{\theta }}(\mathbf {z}) = \mathbf {f}^{-1}_{\boldsymbol{\theta }}(\mathbf {z}) \end{aligned}$$
(3)

where \(\mathbf {z}\) is sampled from the base distribution and \(\mathbf {g}_{\boldsymbol{\theta }}\) represents the cumulative effect of the parametric invertible transformations in Eq. (1). The log-density function of \(\mathbf {x}\) under this transformation can be written as:

$$\begin{aligned} \mathrm {log} \, p_{\mathrm {x}} (\mathbf {x}) = \mathrm {log} \, p_{\mathrm {z}} (\mathbf {z}) + \sum _{i=1}^{K} \mathrm {log} \left| \mathrm {det}\frac{\mathrm {d}\mathbf {h}_i}{\mathrm {d}\mathbf {h}_{i-1}}\right| \end{aligned}$$
(4)

using the change-of-variables theorem, where \(\mathbf {h}_0 \triangleq \mathbf {x}\) and \(\mathbf {h}_K \triangleq \mathbf {z}\). Equation (4) can be used to compute the exact dataset log-likelihood (not possible in GANs) and is the sole objective function for training flow-based models.

The central design challenge of normalizing flows is to create expressive invertible transformations (typically parameterized by deep neural networks) where the so-called Jacobian log-determinant in Eq. (4) remains computationally feasible to evaluate. Often, this is achieved by designing transformations whose Jacobian matrix is triangular, making the determinant trivial to compute. An important example is NICE [7]. NICE introduced the coupling layer, which is a particular kind of flow nonlinearity that uses a neural network to invertibly transform half of the elements in \(\mathbf {h}_k\) with respect to the other half. RealNVP [8] improved on this architecture using more general invertible transformations in the coupling layer and by imposing a hierarchical structure where the flow is partitioned into blocks that operate at different resolutions. This hierarchy allows using smaller \(\mathbf {z}\)-vectors at the initial, smaller resolutions, speeding up computation, and has lately been used by other prominent image-generation systems [32, 40]. Glow [19] added actnorm as a replacement for batchnorm [14] and introduced invertible \(1\times 1\) convolutions to more efficiently mix variables in between the couplings.

A number of Glow-based architectures have been proposed for conditional image generation. In these models, the goal is to learn a distribution over the target image \(\mathbf {x}_{b}\) conditioned on the source image \(\mathbf {x}_{a}\). C-Glow [23] is based on the standard Glow architecture from [19], but makes all sub-steps inside the Glow conditional on the raw conditioning image \(\mathbf {x}_{a}\). The Dual-Glow [36] architecture instead builds a generative model of both source and target image together. It consists of two Glows where the base variables \(\mathbf {z}_a\) of the source-image Glow determine the Gaussian distribution of the corresponding base variables \(\mathbf {z}_b\) of the target-image Glow through a neural network. Because of the hierarchical structure of Glow, several different conditioning networks are used, one for each block of flow steps. C-Flow [31] described a similar structure of side-by-side Glows, but kept the Gaussian base distributions in the two flows independent. Instead, they used the latent variables \(\mathbf {h}_{a,i}\) at every flow step i of the target-domain Glow to condition the transformation in the coupling layer at the corresponding level in the source-domain Glow. Compared to the raw image-data conditioning in C-Glow, Dual-Glow and C-Flow simplify the conditional mapping task at the different levels since the source and target information sit at comparable levels of abstraction.

4 Fully Conditional Glow for Scene Generation

This section introduces our new, fully conditional Glow architecture for image-to-image translation, which combines key innovations from all three previous architectures, C-Glow, Dual-Glow, and C-Flow: Like Dual-Glow and C-Flow (but unlike C-Glow) we use two parallel stacks of Glow, so that we can leverage conditioning information at the relevant level of hierarchy and not be restricted to always use the raw source image as input. In contrast to Dual-Glow and C-Flow (but reminiscent of C-Glow), we introduce conditioning networks that allow all operations in the target-domain Glow conditional on the source-domain information. The resulting architecture is illustrated in Fig. 1. Because of its fully conditional nature, we dub this architecture Full-Glow.

Fig. 1.
figure 1

The proposed architecture, where all substeps have been made conditional by inserting conditioning networks. \(\mathbf {x}_{a}\) and \(\mathbf {x}_{b}\) are paired images in the source and target domains, respectively.

In our proposed architecture, not only is the coupling layer conditioned on the output of the corresponding operation in the source Glow, but the actnorm and the \(1 \times 1\) convolutions in the target Glow are also connected to the source Glow. In particular, the parameters of these two operations in each target-side step are generated by conditioning networks CN built from convolutional layers followed by fully connected layers. These networks also enable us to exploit other side information for conditioning, for instance by concatenating the side information with the other input features of each conditioning network. We experimentally show that making the model fully conditional indeed allows for learning a better conditional distribution (measured with lower conditional bits-per-dimension) and more semantically meaningful images (measured using a pre-trained semantic classifier).

We will now describe the architecture of the fully-conditional target-domain Glow in more detail. We describe the computational statistical inference (\(\mathbf {f}_{\boldsymbol{\theta }}\)), but every transformation is invertible for synthesis (\(\mathbf {g}_{\boldsymbol{\theta }}\)) given the conditioning image \(\mathbf {x}_{a}\).

Conditional Actnorm. The shift \(\mathbf {t}\) and scale \(\mathbf {s}\) parameters of the conditional actnorm are computed as follows:

$$\begin{aligned} \mathbf {s}, \mathbf {t} = \texttt {CN}\left( \mathbf {x}_{\mathrm {act}}^{\mathrm {source}}\right) \end{aligned}$$
(5)

where \(\mathbf {x}_{\mathrm {act}}^{\mathrm {source}}\) is the output of the corresponding actnorm in the source Glow. For initializing the actnorm conditioning network (CN), we set all parameters of the network except those of the output layer to small, random values. Similar to [19, 42], the weights of the output layer are initialized to 0 and the biases are initialized such that the output activations of the target side after applying actnorm have mean 0 and std of 1 per channel for the first batch of data, similar to the scheme used for initialising actnorm in regular Glow.

Conditional \(1 \times 1\) Convolution. Like Glow, we represent the convolution kernel W using an LU decomposition for easy log-determinant computation, but we have conditioning networks generate the \(\mathbf {L}\), \(\mathbf {U}\) matrices and the \(\mathbf {s}\) vector:

$$\begin{aligned} \mathbf {L}, \mathbf {U}, \mathbf {s}= \texttt {CN}\left( \mathbf {x}_{\mathrm {\mathbf {W}}}^{\mathrm {source}}\right) \end{aligned}$$
(6)

where \(\mathbf {x}_{\mathrm {\mathbf {W}}}^{\mathrm {source}}\) is the output of the corresponding \(1\times 1\) convolution in the source Glow, \(\mathbf {L}\) is a lower triangular matrix with ones on the diagonal, \(\mathbf {U}\) is an upper triangular matrix with zeros on the diagonal, and \(\mathbf {s}\) is a vector. Initialization again follows [19]: we first sample a random rotation matrix \(\mathbf {W}_0\) per layer, which we factorise using the LU decomposition as \(\mathbf {W}_0=\mathbf {P}\mathbf {L}_0\left( \mathbf {U}_0 + \mathrm {diag}(\mathbf {s}_0) \right) \). The conditioning network is then set up similar as for the actnorm, with weights and biases set so that its outputs on the first batch are constant and equivalent to the sampled rotation matrix \(\mathbf {W}_0\). The permutation matrix \(\mathbf {P}\) remains fixed throughout the optimization.

Conditional Coupling Layer. The conditional coupling layer resembles that of C-Flow (except that we use the whole source coupling output rather than half of it), where the network in the coupling layer takes input from both source and target sides:

$$\begin{aligned} \mathbf {x}_{1}^{\mathrm {target}}, \mathbf {x}_{2}^{\mathrm {target}}= & {} \, \texttt {split}(\mathbf {x}^{\mathrm {target}}) \end{aligned}$$
(7)
$$\begin{aligned} \mathbf {o_1}, \mathbf {o_2}= & {} \, \texttt {CN}\left( \mathbf {x}_{2}^{\mathrm {target}}, \mathbf {x}^{\mathrm {source}} \right) \end{aligned}$$
(8)
$$\begin{aligned} {\mathbf {s}}= & {} \, {\mathrm {Sigmoid}(\mathbf {o_1 + 2})} \end{aligned}$$
(9)
$$\begin{aligned} {\mathbf {t}}= & {} \, {\mathbf {o_2}} \end{aligned}$$
(10)

where the split operation splits the input tensor along the channel dimension, \(\mathbf {x}^{\mathrm {source}}\) is the output of the corresponding coupling layer in the source Glow, and \(\mathbf {x}^{\mathrm {target}}\) is the output of the preceding \(1 \times 1\) convolution in the target Glow. \(\mathbf {s}\) and \(\mathbf {t}\) are the affine coupling parameters. The conditioning network inputs are concatenated channel-wise.

The objective function for the model has the same form as that of Dual-Glow:

$$\begin{aligned} \frac{1}{N} \left[ -\sum _{{n}=1}^{N} \lambda \log p_{\boldsymbol{\theta }}\left( \mathbf {x}_{a}^{({n})}\right) -\sum _{{n}=1}^{N} \log p_{\boldsymbol{\phi }}\left( \mathbf {x}_{b}^{({n})} \mid \mathbf {x}_{a}^{({n})}\right) \right] \end{aligned}$$
(11)

where \(\boldsymbol{\theta }\) are the parameters of the source Glow and \(\boldsymbol{\phi }\) are the parameters of the target Glow. We note that there is one model (and term) for unconditional image generation in the source domain, coupled with a second model (and term) for conditional image generation in the target domain. With the tuning parameter \(\lambda \) set to unity, Eq. 11 is the joint likelihood of the source-target image pair \((\mathbf {x}_{a}, \mathbf {x}_{b})\), and puts equal emphasis on learning to generate (and to normalize/analyze) both images. In the limit \(\lambda \rightarrow \infty \), we will learn an unconditional model of source images only. Using a \(\lambda \) below 1, however, helps the optimization process instead put more importance on the conditional distribution, which is our main priority in image-to-image translation. This “exchange rate” between bits of information in different domains is reminiscent of the tuning parameter in the information bottleneck principle [38].

5 Experiments

This section reports our findings from applying the proposed model to the Cityscapes dataset from [6]. Each data instance is a photo of a street scene that has has been segmented into objects of 30 different classes, such as road, sky, buildings, cars, and pedestrians. 5000 of these images come with fine per-pixel class annotations of the image, a so called segmentation mask. We used the data splits provided by the dataset (2975 training and 500 validation images), and trained a number of different models to generate street-scene images conditioned on their segmentation masks.

A common way to evaluate the quality of images generated based on the Cityscapes dataset is to apply well-known pre-trained classifiers such as FCN [22] and (here) PSPNet [43] to synthesized images (as done by [15, 41]). The idea is that if a synthesized image is of high quality, a classifier trained on real data should be able to successfully classify different objects in the synthetic image, and thus produce an estimated segmentation mask that closely agrees with the ground-truth segmentation mask. For likelihood-based models we also consider the conditional bits per dimension (BPD), \(-\mathrm {log_2} \ p(\mathbf {x}_{b} | \mathbf {x}_{a})\), as a measure of how well the conditional distribution learned by the model matches the real conditional distribution, when tested on held-out examples.

Implementation Details. Our main experiments were performed on images from the Cityscapes data down-sampled to \(256 \times 256\) pixels (higher than C-Flow [31] that uses 64\(\times \)64 resolution). The Full-Glow model was implemented in PyTorch [29] and trained using the Adam optimizer [18] with a learning rate of \(10^{-4}\) and a batch size of 1. The conditioning networks (CN) for the actnorm and \(1 \times 1\) convolution in our model consisted of three convolutional layers followed by four fully connected layers. The CN for the coupling layer had two convolutional layers. Network weights were initialized as described in Sect. 4. We used \(\lambda = 10^{-4}\) in the objective function Eq. 11. Training was consistently stable and monotonic; see the loss curve in the supplement. Our implementation could be found at: https://github.com/MoeinSorkhei/glow2.

Table 1. Comparison of different models on the Cityscapes dataset for label \(\rightarrow \) photo image synthesis.

5.1 Quantitative Comparison with Other Models

We compare the performance of our model against C-Glow [23] and Dual-Glow [36] (two previously proposed Glow-based models) and pix2pix [15] a widely used GAN-based model for image-to-image translation.

Since C-Glow was proposed to deal with low-resolution images, the authors exploited deep conditioning network in their model. We could not use equally deep conditioning networks in this task because the images we would like to generate are of higher resolution (\(256 \times 256\)). To enable valid comparisons, we trained two versions of the their model. In the first version, we allowed the conditioning networks to be deeper while the Glow itself is shallower (3 Blocks each with 8 Flows). In the second version, the Glow model is deeper (4 Blocks and each with 16 Flows) but the conditioning networks are shallower. More details about the models and their hyper-parameters can be found in the supplementary material. Note that the Glow models in C-Glow version 2, Dual-Glow, and our model are all equally deep (4 Blocks and each having 16 Flows). All models, including Full-Glow, were trained for \({\sim }45\) epochs using the same training procedure described earlier.Footnote 1

We sampled from each trained model 3 times on the validation set, evaluated the synthesized images using PSPNet [43], and calculated the mean and standard deviation of the performance (denoted by ±). The metrics used for evaluation are mean pixel accuracy, mean class accuracy, and mean intersection over union (IoU), as formulated in [22]. The mean pixel accuracy essentially computes mean accuracy over all the pixels of an image (which could easily be dominated by the sky, trees, and large objects that are mostly classified correctly.) Mean class accuracy, however, calculates the accuracy over the pixels of each class, and then takes average over different classes (where all classes are treated equally). Finally, mean class IoU calculates for each class the intersection over union for the objects of that class segmented in the synthesized image compared against the objects in the ground-truth segmentation. Optimally, this number should be 1, signifying complete overlap between segmented and ground-truth objects.

Quantitative results of applying each model to the Cityscapes dataset in the label \(\rightarrow \) photo direction could be seen in Table 1. The results show that street scene images generated by Full-Glow are of higher quality from the viewpoint of semantic segmentation. The noticeable difference in classification performance confirms that the objects in the images generated by our model are more easily distinguishable by the off-the-shelf semantic classifier. We attribute this to the fact that making the model fully conditional enables the target Glow to exploit the information available in the source image and to synthesize an image that follows the structure most.

Fig. 2.
figure 2

Visual samples from different models. Samples from likelihood-based models are taken with temperature 0.7. Please zoom in to see more details.

5.2 Visual Comparison with Other Models

It is interesting to see how samples generated by different models are different visually. Figure 2 illustrates samples from different models given the same condition. An immediate observation is that C-Glow v.1 [23] (which has deeper conditioning networks but shallower Glow) is essentially unable to generate any meaningful image. Dual-Glow [36], however, is able to generate plausible images. Samples generated by pix2pix [15] exhibit vibrant colors (especially for the buildings) but the important objects (such as cars) that constitute the general structure of the image are sometimes distorted. We believe this is the reason behind scoring low with the semantic classifier. Respecting the structure seems to be more important than having vibrant colors in order to get higher classification accuracy. Different samples taken from our model show the benefit of flow-based models in synthesizing different images every time we sample. Most of the difference can be seen in the colors of the objects such as cars.

Generally, the samples generated by likelihood-based models appear dimmed to some extent. This is in contrast with GAN-based samples, which often have realistic colors. This is probably related to the fundamental difference in the optimization process of the two categories. GAN-based models tend to collapse to regions of data where only plausible samples could have come from, and they might not have support over other data regions [11] – also seen as lack of diversity in their samples. In contrast, likelihood-based models try to learn a distribution that has support over wider data regions while maximizing the probability of the available datapoints. The latter approach seems to result in generating samples that are diverse but have somewhat muted colors (especially with lower temperatures).

Table 2. Effect of temperature T evaluated using a pre-trained PSPNet [43]. Each column lists the mean over repeated image samples.

5.3 Effect of Temperature

As noted above, likelihood-based models such as Glow [19] generally tend to overestimate the variability of the data distribution [25, 37], hence occasionally generating implausible output samples. A common way to circumvent this issue is to reduce the diversity of the output at generation time. For flows, this can be done by reducing the standard deviation of the base distribution by a factor T (known as the temperature). While \(T=1\) corresponds to sampling from the estimated maximum-likelihood distribution, reducing T generally results in the output distribution becoming concentrated on a core region of especially-probable output samples. Similar ideas are widely used not only in flow-based models (cf. [19]) but also in other generative models such as GANs, VAEs, and Transformer-based language models [3, 5, 13, 40].

We investigated the effect of temperature by evaluating the performance of the model on samples generated at different temperatures (instead of \(T=1\) as in previous experiments). We sampled on the validation set 3 times with the trained model and evaluated using the PSPNet semantic classifier as before. The results are reported in Table 2 and suggest that the optimal temperature is around 0.8 for this task. That setting strikes a compromise where colors are vibrant while object structure is well-maintained, enabling the classifier to well understand the objects in the synthesized image. Also note the small standard deviation at lower temperatures, which agrees with our expectation that inter-sample variability would be small at low temperature. Example images generated at different temperatures are provided in the supplementary material.

Fig. 3.
figure 3

Examples of applying a desired content to a desired structure. Please zoom in to see more details.

5.4 Content Transfer

Style transfer in general is an interesting application for synthesizing new images, which incorporates transferring the style of an image into a new structure (condition). In this experiment, we try to transfer the content of a real photo to a new structure (segmentation). Previous work [1, 31] has performed similar experiments with flow-based models but either on a different dataset or in very low resolution (64\(\times \)64). We demonstrate how the learned representation enables us to synthesize an image given a desired content and a desired structure in relatively high resolution while maintaining the details of the content.

Suppose \(\mathbf {x}_{b}^{1}\) is the image with the desired content (with \(\mathbf {x}_{a}^{1}\) being its segmentation) and \(\mathbf {x}_{a}^{2}\) is the new segmentation to which we are interested to transfer the content. We can take the following steps (\(\mathbf {g}_{\boldsymbol{\theta }}(.)\) and \(\mathbf {f}_{\phi }(.)\) are the forward functions in the source and target Glows respectively):

  1. 1.

    Extract representation of the desired content: \(\mathbf {z}_{\mathrm {b}}^{1}=\mathbf {f}_{\phi }\left( \mathbf {x}_{\mathrm {b}}^{1} \, | \, \mathbf {g}_{\boldsymbol{\theta }}\left( \mathbf {x}_{\mathrm {a}}^{1}\right) \right) \).

  2. 2.

    Apply the content to the new segmentations: \(\mathbf {x}_{\mathrm {b}}^{\mathrm {new}}=\mathbf {f}_{\phi }^{-1} \left( \mathbf {z}_{\mathrm {b}}^{1} \, | \, \mathbf {g}_{\boldsymbol{\theta }}\left( \mathbf {x}_{\mathrm {a}}^{2}\right) \right) \).

Figure 3 shows examples of transferring the content of an image to another segmentation. We can see that the model is able to successfully apply the content of large objects such as buildings, trees, cars to the desired image structure. So often a given content and a given structure may not agree with each other. For instance, when there are cars in the content which are missing in the segmentation or vice versa. This kind of mismatch is quite common for content transfer on Cityscapes images as these images have a lot of objects placed in different positions. In such a case, the model tries to respect the structure as much as possible while filling it with the given content. The results show that content transfer is so useful in data augmentation since, given the desired content, the model can fill the structure with coherent information which makes the output image much realistic. This technique could practically be applied to any content and any structure (provided that they do not mismatch completely), hence enabling one to synthesize many more images.

Fig. 4.
figure 4

Higher resolution samples of our model taken with temperature 0.9. Left: conditioning, middle-right: samples 1, 2, 3. Please zoom in to see more details.

Fig. 5.
figure 5

Higher resolution samples of our model taken with temperature 0.9. Left: conditioning, middle-right: samples 1, 2, 3. Please zoom in to see more details.

5.5 Higher-Resolution Samples

In order to see how expressive the model is at even higher resolutions, we trained the proposed model on 512\(\times \)1024 images. Example output images from the trained model are provided in Figs. 4 and 5. It is known that higher temperatures show more diversity but the structures become somewhat distorted. We chose to sample with temperature 0.9 for these higher-resolution images. The diversity between multiple samples is especially obvious looking at the cars. Additional high-resolution samples are available in the supplementary material.

6 Conclusions

In this paper, we proposed a fully conditional Glow-based architecture for more realistic conditional street-scene image generation. We quantitatively compared our method against previous work and observed that our improved conditioning allows for generating images that are more interpretable by the semantic classifier. We also used the architecture to synthesize higher-resolution images in order to better show diversity of samples and the capabilities of Glow at higher resolutions. In addition, we demonstrated how new meaningful images could be synthesized based on a desired content and a desired structure, which is a compelling option for high-quality data augmentation.