1 Introduction

Image generation of complex realistic scenes with multiple objects and desired layouts is one of the core frontiers for computer vision. Existence of such algorithms would not only inform our designs for inference mechanisms, needed for visual understanding, but also provide practical application benefits in terms of automatic image generation for artists and users. In fact, such algorithms, if successful, may replace visual search and retrieval engines in their entirety. Why search the web for an image, if you can create one to user’s specification?

Fig. 1
figure 1

Image generation from layout. Given the coarse layout (bounding boxes + object categories), the proposed Layout2Im model samples the appearance of each object from a normal distribution, and transforms these inputs into a real image by a serial of components. Please refer to Sect. 3 for a detailed explanation

For these reasons, image generation algorithms have been a major focus of recent research. Of specific relevance are approaches for text-to-image (Hong et al. 2018; Karacan et al. 2016; Mansimov et al. 2015; Reed et al. 2017; Tan et al. 2018; Zhang et al. 2017) generation. By allowing users to describe visual concepts in natural language, text-to-image generation provides natural and flexible interface for conditioned image generation. However, existing text-to-image approaches exhibit two drawbacks: (i) most approaches can only generate plausible results on simple datasets such as cats (Zhang et al. 2008), birds (Welinder et al. 2010) or flowers (Nilsback and Zisserman 2008). Generating complex, real-world images such as those in COCO-Stuff (Caesar et al. 2016) and Visual Genome (Krishna et al. 2017) datasets remains a challenge; (ii) the ambiguity of textual description makes it more difficult to constrain complex generation process, e.g. , locations and sizes of different objects are usually not given in the description.

Scene graphs are powerful structured representations that encode objects, their attributes and relationships. In Johnson et al. (2018) an approach for generating complex images with many objects and relationships is proposed by conditioning the generation on scene graphs. It addresses some of the aforementioned challenges. However, scene graphs are difficult to construct for a layman user and lack specification of core spatial properties, e.g. , object size/position.

To overcome these limitations, we propose to generate complicated real-world images from layouts, as illustrated in Fig. 1. By simply specifying the coarse layout (bounding boxes + categories) of the expected image, our proposed model can generate an image which contains the desired objects in the correct locations. It is much more controllable and flexible to generate an image from layout than textual description.

With the new task comes new challenges. First, image generation from layout is a difficult one-to-many problem. Many images could be consistent with a specified layout; same layout may be realized by different appearance of objects, or even their interactions (e.g. , a person next to the frisbee may be throwing it or be a bystander, see Fig. 1). Second, the information conveyed by a bounding box and corresponding label is very limited. The actual appearance of the object displayed in an image is not only determined by its category and location, but also its interactions and consistency with other objects. Moreover, spatially close objects may have overlapping bounding boxes. This leads to additional challenges of “separating” which object should contribute to individual pixels. A good generative model should take all these factors and challenges into account implicitly or explicitly.

We address these challenges using a novel variational inference approach. The representation of each object in the image is explicitly disentangled into a specified/certain part (category) and an unspecified/uncertain part (appearance). The category is encoded using a word embedding and the appearance is distilled into a low-dimensional vector sampled from a normal distribution. Based on this representation and specification of object bounding box, we construct a feature map for each object. These feature maps are then composed using convolutional LSTM into a hidden feature map for the entire image, which subsequently is decoded into an output image. This set of modelling choices makes it easy to generate different and diverse images by sampling the appearance of individual objects, and/or adding, moving or deleting objects from the layout. Our proposed model is end-to-end learned using a loss that consists of a number of objectives. Specifically, a pair of discriminators are designed to discriminate the overall generated image and the generated objects within their specified bounding boxes, as real or fake. In addition, object discriminator is also trained to classify the categories of generated objects.

Contributions Our contributions are three-fold:

  • We propose a novel approach for generating images from coarse layout (bounding boxes + object categories). This provides a flexible control mechanism for image generation;

  • By disentangling the representation of objects into a category and (sampled) appearance, our model is capable of generating a diverse set of consistent images from the same layout;

  • We propose an object-wise attention mechanism, which enables the network to model shape of different objects in an explicit manner and therefore enhances the overall visual quality.

  • We show qualitative and quantitative results on COCO-Stuff (Caesar et al. 2016) and Visual Genome (Krishna et al. 2017) datasets, demonstrating our model’s ability to generate complex images with respect to object categories and their layout (without access to segmentation masks (Hong et al. 2018; Johnson et al. 2018)). We also perform comprehensive ablations to validate each component in our approach.

Summary of Changes This work is an extension of our CVPR paper Image Generation from Layout. The main changes of this extended version are:

  • Explicitly define the loss functions;

  • Extend Object Feature Map Composition Module with Object-Wise Attention.

  • Add one new section in the related work for semantic image generation.

  • Add three new baselines for comparison: BicycleGAN, pix2pixHD and GauGAN (SPADE), and report their performance in inception score, FID, object classification accuracy and diversity score; add qualitative results on COCO and VG datasets.

  • Add four ablation experiments: (a) remove the path of generating I’ from normal distribution; (b) remove both latent code reconstruction loss and KL loss; (c) replace convolution LSTM with sum; (d) remove both latent code reconstruction loss, KL loss, and replace convolution LSTM with sum.

  • Report FID scores for the ablation study in Table 6.

The source code of this work is available at https://github.com/zhaobozb/layout2im.

2 Related Work

2.1 Conditional Image Generation

Conditional image generation approaches generate images conditioned on additional input information, including entire source image (Isola et al. 2017; Liu et al. 2017; Pathak et al. 2016; Yang et al. 2017; Zhao et al. 2018; Zhu et al. 2017a, b), sketches (Isola et al. 2017; Sangkloy et al. 2017; Wang et al. 2018; Xian et al. 2018; Zhu et al. 2017b), scene graphs (Johnson et al. 2018), dialogues (Kim et al. 2017; Sharma et al. 2018) and text descriptions (Mansimov et al. 2015; Reed et al. 2017; Tan et al. 2018; Zhang et al. 2017). Variational Autoencoders (VAEs) (Kingma and Welling 2014; Mansimov et al. 2015; Sohn et al. 2015), autoregressive models (Oord et al. 2016; van den Oord et al. 2016) and GANs (Isola et al. 2017; Mirza and Osindero 2014; Wang et al. 2018; Zhu et al. 2017a) are powerful tools for conditional image generation and have shown promising results. However, many previous generative models (Isola et al. 2017; Pathak et al. 2016; Sangkloy et al. 2017; Xian et al. 2018; Yang et al. 2017; Zhu et al. 2017a) tend to largely ignore the random noise vector when conditioning on the same relevant context, making the generated images very similar to each other. By enforcing the bijection mapping between the latent and target space, BicycleGAN (Zhu et al. 2017b) pursues the diversity of generated images from the same input. Inspired by this idea, in our paper, we also explicitly regress the latent codes which are used to generate the different objects.

2.2 Image Generation from Layout

Existing models usually use key points of the object to generate specific type of image, i.e. human body (Ma et al. 2018b) or birds (Reed et al. 2016). However these models cannot be easily generalized to different objects in general image generation. The use of layout in image generation is a relative novel task. It is usually served as an intermediate representation between other input sources [e.g. , text (Hong et al. 2018) or scene graphs (Johnson et al. 2018)] and the output images, or as a complementary feature for image generation based on context [e.g. , text (Karacan et al. 2016; Reed et al. 2016; Tan et al. 2018), shape and lighting (Dosovitskiy et al. 2015)]. In Hong et al. (2018) and Johnson et al. (2018), instead of learning a direct mapping from textual description/scene graph to an image, the generation process is decomposed into multiple individual steps. They first construct a semantic layout (bounding boxes + object shapes) from the input, and then convert it to an image using an image generator. Both of them can generate an image from a coarse layout together with textual description/scene graph. However, Hong et al. (2018) requires detailed object instance segmentation masks to train its object shape generator. Getting such segmentation masks for large scale datasets is both time-consuming and and labor-intensive. Different from Hong et al. (2018) and Johnson et al. (2018), we use the coarse layout without instance segmentation mask as a fundamental input modality for diverse image generation.

2.3 Semantic Image Generation

The task of image generation from semantic label maps has received lots of attention recently. Pix2pix (Isola et al. 2017) used an encoder-decoder generator with patch discriminator to translate input semantic label maps into images. Pix2pixHD (Wang et al. 2017) proposed a coarse-to-fine generator and multi-scale discriminators to generate high resolution images. GauGAN (Park et al. 2019) used input semantic label maps to generate normalization parameters for different layers. This track of work requires that accurate semantic label maps are provided as input where shape and location of each object is encoded inside the input. Different from these works, our method only requires coarse layout as input.

2.4 Disentangled Representations

Many papers (Chen et al. 2016; Cheung et al. 2015; Denton and Birodkar 2017; Lai et al. 2017; Lee et al. 2018; Ma et al. 2018a; Mathieu et al. 2016; Murez et al. 2018) have tried to learn disentangled representations as part of image generation. Disentangled representations model different factors of data variations, such as class-related and class-independent parts (Cheung et al. 2015; Lai et al. 2017; Lee et al. 2018; Mathieu et al. 2016; Murez et al. 2018). By manipulating the disentangled representations, images with different appearances can be generated easily. In Ma et al. (2018a), three factors (foreground, background and pose) are disentangled explicitly when generating person image. InfoGAN (Chen et al. 2016), DrNet (Denton and Birodkar 2017) and DRIT (Lee et al. 2018) learn the disentangled representations in an unsupervised manner, either by maximizing the mutual information (Chen et al. 2016) or adversarial losses (Denton and Birodkar 2017; Lee et al. 2018). In our work, we explicitly separate the representation of each object into a category-related and an appearance-related parts, and only the bounding boxes and category labels are used during both training and testing.

3 Image Generation from Layout

The overall training pipeline of the proposed approach is illustrated in Fig. 2. Given a ground-truth image \({\mathbf {I}}\) and its corresponding layout \({\mathbf {L}}\), where \({\mathbf {L}}_i = (x_i, y_i, h_i, w_i)\) containing the top-left coordinate, height and width of the bounding box, our model first samples two latent codes \({\mathbf {z}}_{ri}\) and \({\mathbf {z}}_{si}\) for each object instance \({\mathbf {O}}_i\). The \({\mathbf {z}}_{ri}\) is sampled from the posterior \(Q({\mathbf {z}}_{r}|{\mathbf {O}}_i)\) conditioned on object \({\mathbf {O}}_i\) cropped from the input image according to \({\mathbf {L}}_i\). The \({\mathbf {z}}_{si}\) is sampled from a normal prior distribution \({\mathcal {N}}({\mathbf {z}}_{s})\). Each object \({\mathbf {O}}_i\) also has a word embedding \({\mathbf {w}}_i\), which is an embedding of its category label \(y_i\). Based on the latent codes \({\mathbf {z}}_i \in \{ {\mathbf {z}}_{ri}, {\mathbf {z}}_{si} \}\), word embedding \({\mathbf {w}}_i\), and layout \({\mathbf {L}}_i\), multiple object feature maps \({\mathbf {F}}_i\) are constructed, and then fed into the object encoder and the objects fuser sequentially, generating a fused hidden feature map \({\mathbf {H}}\) containing information from all specified objects. Finally, an image decoder D is used to reconstruct, \(\hat{{\mathbf {I}}} = D({\mathbf {H}})\), the input ground-truth image \({\mathbf {I}}\) and generate a new image \({\mathbf {I}}'\), simultaneously; the former comes from \({\mathbf {z}}_{r} = \{ {\mathbf {z}}_{ri} \}\) and the latter from \({\mathbf {z}}_{s} = \{ {\mathbf {z}}_{si} \}\). Notably, both resulting images match the training image input layout. To make the mapping between the generated object \({\mathbf {O}}'_i\) and the sampled latent code \({\mathbf {z}}_{si}\) consistent, we make the object estimator regress the sampled latent codes \({\mathbf {z}}_{si}\) based on the generated object \({\mathbf {O}}'_i\) in \({\mathbf {I}}'\) at locations \({\mathbf {L}}_i\). To train the model adversarially, we also introduce a pair of discriminators, \(D_\mathrm {img}\) and \(D_\mathrm {obj}\), to classify the results at image and object level as being real or fake.

Fig. 2
figure 2

Overview of our Layout2Im network for generating images from layout during training. The inputs to the model are the ground truth image with its layout. The objects are first cropped from the input image according to their bounding boxes, and then processed with the object estimator to predict a latent code for each object. After that, multiple object feature maps are prepared by the object composer based on the latent codes and layout, and processed with the object encoder, objects fuser and image decoder to reconstruct the input image. Additional set of latent codes are also sampled from a normal distribution to generate a new image. Finally, objects in generated images are used to regress the sampled latent codes. The model is trained adversarially against a pair of discriminators and a number of objectives. For clarity, we omit \(\mathrm {D}_{obj}\) for the objects cropped from \(\hat{\mathrm {I}}\)

Once the model is trained, it can generate a new image from a layout by sampling object latent codes from the normal prior distribution \({\mathcal {N}}({\mathbf {z}}_{s})\) as illustrated in Fig. 1.

3.1 Object Latent Code Estimation

Object latent code posterior distributions are first estimated from the ground-truth image, and used to sample object latent code \({\mathbf {z}}_{ri} \sim {\mathcal {N}}(\mu ({\mathbf {O}}_i), \sigma ({\mathbf {O}}_i))\). These object latent codes model the ambiguity in object appearance in the ground-truth image, and play important roles in reconstructing the input image later.

Fig. 3
figure 3

Object latent code estimation. Given the input image and its layout, the objects are first cropped and resized from the input image. Then the object estimator predicts a distribution for each object from the object crops, and multiple latent codes are sampled from the estimated distribution

Figure 3 illustrates the object latent code estimation process. First, each object \({\mathbf {O}}_i\) is cropped, from the input image \({\mathbf {I}}\) according to its bounding box \({\mathbf {L}}_i\), and then resized to fit the input dimensionality of object estimator using bilinear interpolation. The resized object crops are fed into an object estimator which consists of several convolutional layers and two fully-connected layers. The object estimator predicts the mean and variance of the posterior distribution for each input object \({\mathbf {O}}_i\). Finally, the predicted mean and variance are used to sample a latent code \({\mathbf {z}}_{ri}\) for the input object \({\mathbf {O}}_i\). We sample latent code for every object in the input image.

3.2 Object Feature Map Composition

Given the object latent code \({\mathbf {z}}_i \in {\mathbb {R}}^{m}\) sampled from either posterior or the prior ( \({\mathbf {z}}_i \in \{ {\mathbf {z}}_{ri}, {\mathbf {z}}_{si} \}\)), object category label \(y_i\) and corresponding bounding box information \({\mathbf {L}}_i\), the object composer module constructs a feature map \({\mathbf {F}}_i\) for each object \({\mathbf {O}}_i\). Each feature map \({\mathbf {F}}_i\) contains a region corresponding to \({\mathbf {L}}_i\) filled with the disentangled representation of that object, consisting of object identity and appearance.

Fig. 4
figure 4

Object feature map composition. The object category is first encoded by a word embedding. Then the object feature map is simply composed by filling the region within the object bounding box with the concatenation of category embedding and latent code. The rest of the feature map are all zeros. Symbol \(\bigoplus \) stands for the vector concatenation, and \(\bigotimes \) means replicating object representation within a bounding box

Figure 4 illustrates this module. The object category label \(y_i\) is first transformed to a corresponding word vector embedding \({\mathbf {w}}_i \in {\mathbb {R}}^{n}\), and then concatenated with the object latent vector \({\mathbf {z}}_i\). This results in the representation of the object which has two parts: object embedding and object latent code. Intuitively, the object embedding encodes the identity of the object, while the latent code encodes the appearance of a specific instance of that object. Jointly these two components encode sufficient information to reconstruct a specific instance of the object in an image. The object feature map \({\mathbf {F}}_i\) is composed by simply filling the region within its bounding box with this object representation \(({\mathbf {w}}_i, {\mathbf {z}}_i) \in {\mathbb {R}}^{m+n}\). For each tuple \(<y_i, {\mathbf {z}}_i, {\mathbf {L}}_i>\) encoding object label, latent code and bounding box, we compose an object feature map \({\mathbf {F}}_i\). These object feature maps are downsampled by an object encoder network which contains several convolutional layers. Then an object fuser module is used to fuse all the downsampled object feature maps, generating a hidden feature map \({\mathbf {H}}\).

Fig. 5
figure 5

Object Feature Map Composition with Object-Wise Attention(OWA). Instead of using the object bounding box to construct the object feature map as shown in Fig. 4, here we also predict an attention map for each object from the object embedding. By interpolating the attention map fitting into the bounding box, it provides finer guidance to generate target objects

3.3 Object Feature Map Composition with Object-Wise Attention (OWA)

Our original method produces object feature map \({\mathbf {F}}_i\) by simply filling the region within its bounding boxes with object representations. This method has its limitations in that the shape of different classes of objects are all provided as rectangles and the network needs to figure out the shape implicitly with following fusion layers and decoders. To alleviate this problem, we proposed Object-Wise Attention(OWA). After we get the word embedding \({\mathbf {w}}_i\) representing different classes of objects, an attention decoder M is used to generate corresponding object wise attention mask \(a_i=M({\mathbf {w}}_i)\) for different objects. Then we fill out each feature map \({\mathbf {F}}_i'\) with its object representations \(({\mathbf {w}}_i,{\mathbf {z}}_i)\) using the corresponding attention mask \(a_i\). Thus the shape of objects is modeled explicitly so that the network does not need to figure out the shape of different objects during decoding process. It can focus more on textures and coherence between objects to enhance visual quality. This is validated in following experiments (Fig. 5).

3.4 Object Feature Maps Fusion

Since the result image will be decoded from it, a good hidden feature map \({\mathbf {H}}\) is crucial to generating a realistic image. The properties of a good hidden feature map can be summarized as follows: (i) it should encode all object instances in the desired locations; (ii) it should coordinate object representations based on other objects in the image; (iii) it should be able to fill the unspecified regions, e.g. , background, by implicitly reasoning about plausibility of the scene with respect to the specified objects.

Fig. 6
figure 6

Object Feature Maps Fusion. Three layers CLSTM is used to encodes all object feature maps together

To satisfy these requirements, we choose a multi-layer convolutional Long-Short-Term Memory (cLSTM) network (Shi et al. 2015) to fuse the downsampled object feature maps \({\mathbf {F}}\) as illustrated in Fig. 6. Different from the traditional LSTM (Hochreiter and Schmidhuber 1997), the hidden states and cell states in cLSTM are both feature maps rather than vectors. The computation of different gates are also done by convolutional layers. Therefore, cLSTM can better preserve the spatial information compared with the traditional vector-based LSTM. The cLSTM acts like an encoder to integrate object feature maps one-by-one, and the last output of the cLSTM is used as the fused hidden layout \({\mathbf {H}}\), which incorporates the location and category information of all objects.

3.5 Image Decoder

Given the fused image hidden feature map \({\mathbf {H}}\), image decoder is tasked with generating a result image. As shown in Fig. 2, there are two paths (blue and red) in the networks. They differ in latent code estimation. The blue path reconstructs the input image using the object latent codes \({\mathbf {z}}_r\) sampled from the posteriors \(Q({\mathbf {z}}_r | {\mathbf {O}})\) that are conditioned on the objects \({\mathbf {O}}\) in the input image \({\mathbf {I}}\), while in the red one, the latent codes \({\mathbf {z}}_s\) are directly sampled from prior distributions \({\mathcal {N}}({\mathbf {z}}_s)\). As a result, two images are generated, i.e. , \(\hat{{\mathbf {I}}}\) and \({\mathbf {I}}'\), through the red and blue paths, respectively. Although they may differ in appearance, both of them share the same layout.

3.6 Object Latent Code Regression

To explicitly encourage the consistent connection between the latent codes and outputs, our model also tries to recover the random sampled latent codes from the objects generated along the red path. One can think of this as an inference network for the latent codes. This helps prevent a many-to-one mapping from the latent code to the output during training, and as a result, produces more diverse results.

To achieve this, we use the same input object bounding boxes \({\mathbf {L}}\) to crop the objects \({\mathbf {O}}'\) in the generated image \({\mathbf {I}}'\). The resized \({\mathbf {O}}'\) are then sent to an object latent code estimator (which shares weights with the one used in image reconstruction path), getting the estimated mean and variance vectors for the generated objects. We directly use the computed mean vectors, as the regressed latent codes \({\mathbf {z}}'_s\), and compare them with the sampled ones \({\mathbf {z}}_s\), for all objects.

3.7 Image and Object Discriminators

To make the generated images realistic, and the objects recognizable, we adopt a pair of discriminators \(D_{\mathrm {img}}\) and \(D_{\mathrm {obj}}\). The discriminator is trained to classify an input image or object as real or fake. Meanwhile, the generator networks are trained to fool the discriminator.

The image discriminator \(D_{\mathrm {img}}\) is applied to input images \({\mathbf {I}}\), reconstructed images \(\hat{{\mathbf {I}}}\) and sampled images \({\mathbf {I}}'\), classifying them as real or fake. The object discriminator \(D_{\mathrm {obj}}\) is designed to assess the quality and category of the real objects \({\mathbf {O}}\), reconstructed objects \(\hat{{\mathbf {O}}}\) and sampled objects \({\mathbf {O}}'\) at the same time. In addition, since \(\hat{{\mathbf {O}}}\) and \({\mathbf {O}}'\) are cropped from the reconstructed/sampled images according to the input bounding boxes \({\mathbf {L}}\), \(D_{obj}\) also encourages the generated objects to appear in their desired locations.

3.8 Loss Function

We end-to-end train the generator network and two discriminator networks in an adversarial manner. The generator network, with all described components, is trained to minimize the weighted sum of six losses:

KL Loss computes the KL-Divergence between the distribution \(Q({\mathbf {z}}_r|{\mathbf {O}})\) and the normal distribution \({\mathcal {N}}({\mathbf {z}}_r)\), which is defined as:

$$\begin{aligned} {\mathcal {L}}_{\mathrm {KL}} = \sum _{i=1}^{o}{\mathbb {E}}[{\mathcal {D}}_{\mathrm {KL}}(Q({\mathbf {z}}_{ri}|{\mathbf {O}}_i)||{\mathcal {N}}({\mathbf {z}}_r))], \end{aligned}$$
(1)

where o is the number of objects in the image/layout, \({\mathbf {O}}\) are the cropped objects, and \({\mathbf {z}}\) are the appearance codes of objects.

Image Reconstruction Loss penalizes the difference between ground-truth image \({\mathbf {I}}\) and reconstructed image \(\hat{{\mathbf {I}}}\). In our paper, \({\mathcal {L}}_1\) distance is chosen as shown in Eq. (2).

$$\begin{aligned} {\mathcal {L}}_1^{\mathrm {img}} = ||{\mathbf {I}}-\hat{{\mathbf {I}}}||_1. \end{aligned}$$
(2)

Object Latent Code Reconstruction Loss encourages the connection between specific appearance and the latent code to be invertible, which is defined as the \({\mathcal {L}}_1\) distance between the randomly sampled \({\mathbf {z}}_{s} \sim N({\mathbf {z}}_s)\) and the re-estimated \({\mathbf {z}}'_s\) from the generated objects \({\mathbf {O}}'\):

$$\begin{aligned} {\mathcal {L}}_1^{\mathrm {latent}} = \sum _{i=1}^{o}||{\mathbf {z}}_{si}-\mathbf {z'}_{si}||_1. \end{aligned}$$
(3)

Image Adversarial Loss encourages the model to generate realistic images, which is defined as:

$$\begin{aligned}&{\mathcal {L}}_{\mathrm {GAN}}^{\mathrm {img}} = \underset{{\mathbf {I}}\sim p_\mathrm {real}}{{\mathbb {E}}}\log D({\mathbf {I}})+ 0.5 \underset{\hat{{\mathbf {I}}}\sim {\hat{p}}_\mathrm {fake}}{{\mathbb {E}}}\log (1-D(\hat{{\mathbf {I}}}))\nonumber \\&\qquad \qquad + 0.5 \underset{\mathbf {I'}\sim p'_\mathrm {fake}}{{\mathbb {E}}}\log (1-D(\mathbf {I'})), \end{aligned}$$
(4)

where \({\mathbf {I}}\) is the ground truth image, \(\hat{{\mathbf {I}}}\) is the reconstructed image and \(\mathbf {I'}\) is the sampled image.

Object Adversarial Loss encourages the model to generate realistic objects within an image, which is defined as:

$$\begin{aligned}&{\mathcal {L}}_{\mathrm {GAN}}^{\mathrm {obj}} = \underset{{\mathbf {O}}\sim p_\mathrm {real}}{{\mathbb {E}}}\log D({\mathbf {O}})\nonumber \\&\qquad \qquad + 0.5 \underset{\hat{{\mathbf {O}}}\sim {\hat{p}}_\mathrm {fake}}{{\mathbb {E}}}\log (1{-}D(\hat{{\mathbf {O}}}))\nonumber \\&\qquad \qquad + 0.5\underset{{\mathbf {O}}'\sim p'_\mathrm {fake}}{{\mathbb {E}}}\log (1-D({\mathbf {O}}')), \end{aligned}$$
(5)

where \({\mathbf {O}}\) are the objects cropped from the ground truth image \({\mathbf {I}}\), \(\hat{{\mathbf {O}}}\) and \({\mathbf {O}}'\) are objects cropped from the reconstructed image \(\hat{{\mathbf {I}}}\) and sampled image \({\mathbf {I}}'\), respectively.

Auxiliary Classification Loss is adopted to classify the generated objects \(\hat{{\mathbf {O}}}\) and \({\mathbf {O}}'\). It encourages them to be recognizable as their corresponding categories. The auxiliary classification loss is defined as:

$$\begin{aligned} {\mathcal {L}}_\mathrm {AC}^{\mathrm {obj}} = {\mathbb {E}}_{\hat{{\mathbf {O}}},c}[-\log D_\mathrm {obj}(c|\hat{{\mathbf {O}}})]+ {\mathbb {E}}_{{\mathbf {O}}',c}[-\log D_\mathrm {obj}(c|{\mathbf {O}}')], \end{aligned}$$
(6)

where c is the object class label.

Therefore, the final loss function of our model is defined as:

$$\begin{aligned} {\mathcal {L}}&= \lambda _1 {\mathcal {L}}_{\mathrm {KL}} + \lambda _2 {\mathcal {L}}_{1}^{\text {img}} + \lambda _3{\mathcal {L}}_{1}^{\mathrm {latent}} + \lambda _4 {\mathcal {L}}_{\mathrm {adv}}^\mathrm {img} \nonumber \\&\quad + \lambda _5 {\mathcal {L}}_{\mathrm {adv}}^\mathrm {obj} + \lambda _6 {\mathcal {L}}_{\mathrm {AC}}^{\mathrm {obj}}, \end{aligned}$$
(7)

where \(\lambda _1 \sim \lambda _6\) are the parameters balancing different losses.

Table 1 Statistics of COCO-Stuff and Visual Genome dataset

3.9 Implementation Details

We use SN-GAN (Miyato et al. 2018) for stable training. Batch normalization (Ioffe and Szegedy 2015) and ReLU are used in the object encoder, image decoder, and only ReLU is used in the discriminators (no batch normalization). Conditional batch normalization (de Vries et al. 2017) is used in the object estimator to better normalize the object feature map according to its category. After object fuser, we use six residual blocks (He et al. 2016) to further refine the hidden image feature maps. We set both m and n to 64. The image and crop size are set to 64 \(\times \) 64 and 32 \(\times \) 32, respectively. The \(\lambda _1 \sim \lambda _6\) are set to 0.01, 1, 10, 1, 1 and 1 respectively.

We train all models using Adam (Kingma and Ba 2014) with learning rate of 0.0001 and batch size of 8 for 300,000 iterations; training takes about 3 days on a single Titan Xp GPU. Full details about our architecture can be found in “Appendix”, and code will be made publicly available.

4 Experiments

Extensive experiments are conducted to evaluate the proposed Layout2Im network. We first compare our proposed method and the proposed OWA extension with previous state-of-the-art models for scene image synthesis, and show its superiority in aspects of realism, recognition and diversity. Finally, the contributions of each loss for training our model are studied through ablation.

4.1 Datasets

The same as previous scene image generation method (Johnson et al. 2018), we evaluate our proposed model on the COCO-Stuff (Caesar et al. 2016) and Visual Genome (Krishna et al. 2017) datasets. We preprocess and split the two datasets the same as that in Johnson et al. (2018). Table 1 lists the datasets statistics. Each image in these datasets has multiple bounding boxes annotations with labels for the objects.

4.2 Baselines

We compare our approach with several state-of-the-art methods.

pix2pix (Isola et al. 2017) translates images between two domains. In this paper, we define the input domain as feature maps constructed from layout \({\mathbf {L}}\), and set the real images as the output domain. We construct the input feature map with the size of \(C\times H\times W\) for each layout \({\mathbf {L}}\), where C is the number of object categories, \(H\times W\) is the image size. A bounding box \({\mathbf {O}}_i\) with label \(y_i\) will set the corresponding region within c-th channel (the channel for category \(y_i\)) of the feature map to 1 and others are all 0. The pix2pix model is learned to translate the generated feature maps to real images.

BicycleGAN (Zhu et al. 2017b) models a distribution of possible outputs in a conditional generative modeling setting when a single input may correspond to multiple possible outputs. The ambiguity of the mapping is represented as a low-dimensional latent vector, which will be combined with the input, and be translated into the output. It explicitly encourages the connection between output and the latent code to be invertible, contributing to generating diversified translated images. In our paper, we construct the input the same as pix2pix, and generate real images.

Fig. 7
figure 7

Examples of 64 \(\times \) 64 generated images from complex layouts on COCO-Stuff by our proposed method and baselines. For each example, we show the input layout, images generated by pix2pix, BicycleGAN, sg2im, pix2pixHD, GauGAN, results of our method, results of our method with object wise attention added and the ground truth image in the datasets. Please zoom in to see the category of each object. Better view in color

Fig. 8
figure 8

Examples of 64 \(\times \) 64 generated images from complex layouts on Visual Genome by our proposed method and baselines. For each example, we show the input layout, images generated by pix2pix, BicycleGAN, sg2im, pix2pixHD, GauGAN, results of our method, results of our method with object wise attention added, the ground truth image in the datasets. Please zoom in to see the category of each object. Better view in color

Fig. 9
figure 9

More Qualitative Examples on COCO Dataset. Please zoom in to see the categories. Best viewed in color

Fig. 10
figure 10

More Qualitative Example Results on Visual Genome Dataset. Please zoom in to see the categories. Best viewed in color

sg2im (Johnson et al. 2018) is originally trained to generate images from scene graphs. However, it can also generate images from layout, simply replacing the predicted layout with ground truth layout. We list the Inception Score of sg2im using ground truth layouts as reported in their paper, and generate the results for other comparisons using their released model trained with ground truth layout. In other words, the input and training data for our and sg2im models is identical.

pix2pixHD (Wang et al. 2017) produces realistic images from given semantic label maps. It uses multi-scale patch wise discriminator and multi-scale generator to generate high resolution images. To manipulate object with different input style vectors, they use a encoder-decoder to generate latent vectors at each spatial location and perform instance-wise average pooling for each instance to get the style vector. Then it can be used to control style of different objects, such as colors. As only coarse layouts are provided, we use the same setting as pix2pix where the input is semantic map translated from layouts.

GauGAN (Park et al. 2019) is proposed to synthesize high-resolution images with realistic details. They propose that the original batch normalization will wash away semantic information during each layer, so spatial adaptive normalization is proposed to alleviate this problem. Different from previous conditional normalization layers, their proposed normalization layer applies a spatially varying affine transformation, making it suitable for image synthesis from spatially-varying semantic mask. We use the same setting as pix2pix and input is converted from layouts.

4.3 Evaluation Metrics

Plausible images generated from layout should meet three requirements: be realistic, recognizable and diverse. Therefore we choose four different metrics, Inception Score (IS) (Salimans et al. 2016), Fréchet Inception Distance (FID) (Heusel et al. 2017), Object Classification Accuracy (Accu.) and Diversity Score (DS) (Zhang et al. 2018).

Inception Score (Salimans et al. 2016) is adopted to measure the quality, as well as diversity, of generated images. In our paper, we use the pre-trained VGG-net (Simonyan and Zisserman 2014) as the base model to compute the inception scores for our model and the baselines.

Fréchet Inception Distance (Heusel et al. 2017) uses 2nd order information of the final layer of the inception model, and calculates the similarity of generated images to real ones. Fréchet Inception Distance is more robust to noise than Inception Score.

Classification Accuracy measures the ability to generate recognizable objects, which is an important criteria for our task. We first train a ResNet-101 model (He et al. 2016) to classify objects. This is done using the real objects cropped and resized from ground truth images in the training set of

each dataset. We then compute and report the object classification accuracy for objects in the generated images.

Diversity Score computes the perceptual similarity between two images in deep feature space. Different from the inception score which reflects the diversity across the entire generated images, diversity score measures the difference of a pair of images generated from the same input. We use the LPIPS metric (Zhang et al. 2018) for diversity score, and use AlexNet (Krizhevsky et al. 2012) for feature extraction as suggested in the paper.

4.4 Qualitative Results

Figures 7 and 8 show generated images using our method, as well as baselines on COCO and Visual Genome datasets respectively. From these examples it is clear that our method can generate complex images with multiple objects, and even multiple instances of the same object type. For example, Fig. 7(a) shows two boats, (c) shows two cows, (e) and Fig. 8(r) contain two people. More qualitative results of our model of COCO and Visual Genome dataset are demonstrated in Figs. 9 and 10, respectively. These examples also show that our method generates images which respect the location constraints of the input bounding boxes, and the generated objects in the image are also recognizable and consistent with their input labels.

As we can see in Figs. 7 and 8, pix2pix fails to generate meaningful images, due to the extreme difficulty of directly mapping layout to a real image without detailed instance segmentation. Pix2PixHD and GauGAN provide more meaningful outputs but the shape of some object does not look realistic due to the lack of correct segmentation. For example in Fig. 7(g) the giraffe looks like a rectangle. The results generated by BicycleGAN and sg2im are also not as good as ours. For example, in Fig. 7(g), (i), the generated giraffe and zebra are difficult to recognize, and (l) contains lots of artifacts, making result look unrealistic. We also provide the result of OWA extension. The overall quality is better than our original method and the shape of each object is more reasonable. As is shown in Fig. 7(f), the shape of airplane looks more reasonable thanks to the attention representation, which captures the shape of each object better.

In Fig. 11 we demonstrate our model’s ability to generate complex images by starting with simple layout and progressively adding new bounding box or moving existing bounding box, e.g. , (g) and (k), to build/manipulate a complex image. From these examples we can see that new objects are drawn in the images at the desired locations, and existing objects are kept consistent as new content is added.

Figure 12 shows the diverse results generated from the same layouts. Given that the same layout may have many different possible real image realizations, the ability to sample diverse images is a key advantage of our model.

4.5 Quantitative Results

Table 2 Performance on COCO and VG in Inception Score (IS)
Table 3 Performance on COCO and VG in Fréchet Inception Distance (FID)
Table 4 Performance on COCO and VG in Object Classification Accuracy
Table 5 Performance on COCO and VG in Diversity Score

Table 2, 3, 4 and 5 summarize comparison results of the inception score, Fréchet Inception Distance (FID), object classification accuracy and diversity score of baseline models and our model. We also report the inception score and object classification accuracy on real images.

Our proposed method significantly outperforms baselines in all the evaluation metrics except Diversity Score. In terms of Inception Score and Fréchet Inception Distance, our method outperforms the existing approaches with a substantial margin, presumably because our method generates more recognizable objects as proved by object classification accuracy. By adding the object-wise attention machnism, we can see a boost of Inception score on COCO dataset, and comparable performance on VG dataset. The FID scores, as well as the object classification accuracy are increased on both datasets with OWA. Please note that the object accuracy on real images is not the upper bound of object classification accuracy, since the object cannot be classified correctly in a real image does not necessarily mean it is also difficult to distinguish in a generated image. Pix2pix are deterministic so the diversity score is zero. We perform multimodal sampling for pix2pixHD. Though the diversity is high as shown in Table 5, the image quality still low with different style codes because during training time the network cannot learn meaningful style code due to inaccurate correspondence between bounding boxes and ground truth image. We can sample diverse images from GauGAN by using VAE to provide global style information, but we cannot control style for each object with this global style code. Since BicycleGAN explicitly pursues the diversity of results, it can generate more diverse results, and has the highest diversity score. However, the generated objects of BicycleGAN are hard to recognize, as shown in the Figs. 7 and 8, which leads to poor performance on the rest evaluation metrics. By adding global noise to scene layout, sg2im can generate images with limited diversity. The diversity performance shows that our method can generate diverse results from the same layout. A very notable improvement is on COCO, where we achieve diversity score of 0.15 as compared to 0.02 for sg2im.

Fig. 11
figure 11

Example of generated images by adding or moving bounding boxes based on previous layout. Three groups of images, (a)–(c), (d)–(g) and (h)–(k), are shown. In (g) and (k), original bounding boxes are drawn in dash. Please zoom in to see the category of each object

Fig. 12
figure 12

Examples of diverse images generated from same layouts. For each layout, we sample 3 images. The generated images have different appearances, but sharing the same layout. Please zoom in to see the category of each object

4.6 Ablation Study

We demonstrate the necessity of all components of our model by comparing the inception score, object classification accuracy, and diversity score of several ablated versions of our model trained on Visual Genome dataset:

  • w/o \({\mathcal {L}}_1^\mathrm {img}\) reconstructs ground truth images without pixel regression;

  • w/o \({\mathcal {L}}_1^\mathrm {latent}\) does not regress the latent codes which are used to generated objects in the result images;

  • w/o \({\mathcal {L}}_\mathrm {AC}^{\mathrm {obj}}\) does not classify the category of objects;

  • w/o \({\mathcal {L}}_\mathrm {adv}^\mathrm {img}\) removes the object adversarial loss when training the model;

  • w/o \({\mathcal {L}}_\mathrm {adv}^\mathrm {obj}\) removes the image adversarial loss when training the model.

  • w/o \(\mathrm {I}'\) removes the path which generates \(\mathrm {I}'\) from prior distribution.

  • w/o \({\mathcal {L}}_1^\mathrm {latent}\) and \({\mathcal {L}}_\mathrm {KL}\) removes both the KL loss and latent code regression loss.

  • w/o cLSTM replaces the cLSTM in the object fusion module with simple object feature maps summation.

  • w/o \({\mathcal {L}}_1^\mathrm {latent}\), \({\mathcal {L}}_\mathrm {KL}\) and cLSTM removes both the KL loss and latent code regression loss. It also replaces the cLSTM in the object fusion module with simple object feature maps summation.

Table 6 Ablation study of our model on Visual Genome dataset by removing different objectives

As shown in Table 6, removing any loss term will decrease the overall performance. Specifically, The model trained without \({\mathcal {L}}_1^\text {img}\) or \({\mathcal {L}}_1^\text {latent}\) generates less realistic images, which decreases the inception score. The object classification accuracy is still high because of the object classification loss. Without the constraint on reconstructed images or latent codes, the models get lower inception scores, but similar diversity scores. Removing the object classification loss degrade the inception score and object classification accuracy significantly, since the model cannot generate recognizable objects. Not surprisingly, this freedom results in higher diversity score. It is expected to see that removing the adversarial loss on image or object will decrease the inception score substantially. However, the object classification accuracy increases further comparing to the full model. We believe that without the realism requirement of image or object, the object classification loss could be tampered with adversarial attack. Without explicitly sampling images from prior distribution, the model is only learned from real data. A random latent code can not yield realistic images at test time since the KL loss may not be well optimized during training. Without both KL divergence loss and latent code regression, the model is not enforced to encode the appearance of different objects into prior distribution, and cannot generate diversified object appearance from prior distribution. It results to a drop on both inception score and FID. The performance drops in all four metrics when replacing the cLSTM in the Object Feature Maps Fusion module with simple feature maps summation, especially the object classification accuracy. Since object bonding boxes usually have overlaps, adding different object vectors together in the overlap region may confuse the image decoder to generate recognizable objects. We observe similar performance decrease in object classification accuracy when we remove both KL divergence loss and latent code regression, and replace cLSTM with summation. Trained with all the losses, our full model achieves a good balance across all four metrics.

5 Conclusion

In this paper we have introduced an end-to-end method for generating diverse images from layout (bounding boxes + categories). Our method can generate reasonable images which look realistic and contain recognizable objects at the desired locations. We also showed that we can control the image generation process by adding/moving objects in the layout easily. To further improve the overall visual quality, we proposed an extension of our method called object wise attention, which helps the network model shape of different classes. Qualitative and quantitative results on COCO-Stuff (Caesar et al. 2016) and Visual Genome (Krishna et al. 2017) datasets demonstrated our model’s ability to generate realistic complex images. Generating high resolution images from layouts will be our future work. Moreover, making the image generation process more controllable, such as specifying the fine-grained attributes of instances, would be an interesting future direction.