1 Introduction

Generating images from text description has been an active research topic in computer vision. Allowing users to describe visual concepts in natural language provides a natural and flexible interface for conditioning image generation. Also, the task describes the model’s understanding of the visual concepts by synthesizing images matching the text description. Recently, approaches based on conditional Generative Adversarial Network (GAN) have shown promising results on text-to-image synthesis task [6, 23, 25, 28, 38,39,40]. Conditioning both the generator and discriminator on text allows these approaches to generate realistic images that are both diverse and relevant to the input text. Based on the conditional GAN framework, recent approaches further improve the prediction quality by generating high-resolution images [38,39,40] or improving the text conditioning [4, 7, 28].

Fig. 5.1.
figure 1

Overall framework of the proposed algorithm. Given a text description, our algorithm sequentially constructs a semantic structure of a scene and generates an image conditioned on the inferred layout and text. Best viewed in color.

However, the success of existing approaches has been limited to simple datasets, such as birds [37] and flowers [19], while the generation of complicated real-world images, such as MS-COCO [15] remains an open challenge. As illustrated in Fig. 5.1, generating images from a general sentence, “people riding on elephants that are walking through a river”, requires multiple reasonings on various visual concepts, such as object category (people and elephants), spatial configurations of objects (riding), scene context (walking through a river), etc., which is more complex than generating a single, large object as in simpler datasets [19, 37]. Existing approaches have not been successful in generating reasonable images for such complex text description due to the complexity of learning a direct text-to-pixel mapping from general images. In addition, deep generative models often lack the mechanisms required to introspect its generation process, which limits the interpretability of the model since the learned mapping of the model is not explicitly interpretable only from the output images, especially when the generation quality is limited (Fig. 5.1).

Instead of learning a direct mapping from text to image, we propose an alternative approach that constructs a semantic layout as an intermediate representation between text and image. The semantic layout defines the structure of the scene based on object instances and provides fine-grained information of the scene, such as the number of objects, object category, location, size, shape, etc. (Fig. 5.1). Introducing a mechanism that explicitly aligns the semantic structure of an image to text allows the proposed method to generate complicated images of complex text descriptions. In addition, conditioning the image generation on a semantic structure allows our model to generate semantically more meaningful images that are recognizable and interpretable.

Our model for hierarchical text-to-image synthesis consists of two parts: the layout generator that constructs a semantic label map from a text description, and the image generator that converts the estimated layout to an image with using the text. Since learning a direct mapping from text to fine-grained semantic layout is challenging, we further decompose the task into two manageable subtasks: estimating the bounding box layout of an image using the box generator, and then refining the shape of each object inside the box using the shape generator. The generated layout is then used to guide the image generator for pixel-level synthesis. The box generator, shape generator, and image generator are implemented by independent neural networks and trained in parallel with corresponding supervisions.

Generating a semantic layout improves the quality of text-to-image synthesis and provides several potential benefits. First, the predicted semantic layout provides an interpretable representation of intermediate outputs in text-to-image mapping, which helps humans to understand the underlying mechanism of the model converting texts to images. Second, it offers an interactive interface for controlling the image generation process; users can modify the semantic layout to generate a desired image by removing/adding objects, changing size and location of objects, etc.. Third, the semantic layout provides instance-wise annotations on generated images, which can be directly exploited for automated scene parsing and object retrieval. The contributions of this paper are as follows:

  • We propose a novel approach for synthesizing images from complicated text descriptions. Our model explicitly constructs a semantic layout from the text description and guides image generation using the inferred semantic layout.

  • Conditioning image generation on explicit layout prediction allows our method to generate images that are semantically meaningful and well-aligned with input descriptions.

  • We conduct extensive quantitative and qualitative evaluations on the challenging MS-COCO dataset and demonstrate substantial improvement on generation quality over existing works.

The remainder of the paper is organized as follows: we briefly review related work in Sect. 5.2 and provide an overview of the proposed approach in Sect. 5.3. Our model for layout and image generation is introduced in Sect. 5.4 and 5.5, respectively. We discuss the experimental results on the MS-COCO dataset in Sect. 5.6.

2 Related Work

Generating images from text descriptions has recently drawn a lot of attention from the research community. Various approaches have been proposed to formulate the task as a conditional image generation problem based on Variational Auto-Encoders (VAE) [16], auto-regressive models [24], optimization techniques [18], etc. Recently, approaches based on conditional GANs [8] have shown promising results in text-to-image synthesis [4, 6, 7, 23, 25, 28, 38,39,40]. Reed et al. [23] proposed the first approach that formulates the task in the conditional GAN framework, which conditions both the generator and the discriminator on text embedding. They demonstrated the successful generation results on 64 \(\times \) 64 images. Built upon this framework, recent approaches have improved the generation quality by synthesizing the higher-resolution images [39, 40] or improving conditioning on text information [4, 7, 28, 38]. To increase the resolution of output images, Zhang et al. [39] proposed a two-stage GAN that first generates low-resolution images from text and increases its resolution by another network conditioned on the low-resolution image. Zhang et al. [40] extended the approach by using an end-to-end trainable network that regularizes the generator outputs in multiple resolutions with a set of discriminators. On the other hand, to improve the conditioning on the text, Hao et al. [7] proposed to augment the text annotations by synthesizing captions using a pre-trained caption generator, and Dash et al. [4] proposed to augment the discriminator to predict additional class labels. To reduce the ambiguities in text, Sharma et al. [28] conditioned image generation on a dialogue instead of single caption. Xu et al. [38] employed a multi-stage generation method with an attention mechanism to condition the image generation on each word in a caption one at a time. Although these approaches have demonstrated impressive generation results on datasets of specific categories (e.g., birds [37] and flowers [19]), the perceptual quality of generation tends to substantially degrade on datasets with complicated images (e.g., MS-COCO [15]). We investigate a method to improve text-to-image synthesis on general images by conditioning generation on the inferred semantic layout.

The problem of generating images from pixel-wise semantic labels has been explored recently [3, 11, 14, 24]. In these approaches, the task of image generation is formulated as translating semantic labels to pixels. Isola et al. [11] proposed a pixel-to-pixel translation network that converts dense pixel-wise labels to an image using conditional GAN. Chen et al. [3] proposed a cascaded refinement network that generates a high-resolution output from dense semantic labels using a cascade of upsampling layers conditioned on the layout. Karacan et al. [14] employed both dense layouts and attribute vectors for image generation using conditional GAN. Reed et al. [24] utilized sparse label maps defined on a few foreground objects, similar to our method. Unlike previous approaches that require ground-truth layouts for generation, our method infers the semantic layout, and thus is more applicable to various generation tasks. Note that our main contribution is complementary to these approaches and we can integrate existing segmentation-to-pixel generation methods to generate an image conditioned on a layout.

The idea of inferring scene structure for image generation has been explored by recent works in several domains. For example, Wang et al. [36] proposed to infer a surface normal map as an intermediate structure to generate indoor scene images, and Villegas et al. [33] predicted human joints for future frame prediction. Reed et al. [25] predicted local key-points of bird or human for text-to-image synthesis. Contrary to the previous approaches that predict specific types of structures for image generation, our proposed method aims to predict semantic label maps, which is a general representation of natural images. Concurrent to this work, Johnson et al. [13] developed a hierarchical text-to-image synthesis method based on a scene graph. In this approach, they first extract a scene-graph from an input text and construct an image in a coarse-to-fine manner with incorporating the relationship between objects encoded in the graph. Contrary to this work, we aim to infer a semantic layout directly from the text without information on scene structure, where the information required to organize and associate objects is implicitly learned from text.

Fig. 5.2.
figure 2

Overall pipeline of the proposed algorithm. Given a text embedding, our algorithm first generates a coarse layout of the image by placing a set of object bounding boxes using the box generator (Sect. 5.4.1), and further refines the object shape inside each box using the shape generator (Sect. 5.4.2). Combining outputs from the box and the shape generator creates a semantic label map defining the semantic structure of the scene. Conditioned on the inferred semantic layout and the text, a pixel-wise image is finally generated by the image generator (Sect. 5.5).

3 Overview

The overall pipeline of the proposed framework is illustrated in Fig. 5.2. Given a text description, our model progressively constructs a scene by refining the semantic structure of an image using the following sequence of generators:

  • The box generator takes a text embedding \(\mathbf {s}\) as input and generates a coarse layout by composing object instances in an image. The output of the box generator is a set of bounding boxes \(B_{1:T}=\{B_1,...,B_T\}\), where each bounding box \(B_t\) defines the location, size, and category label of the t-th object (Sect. 5.4.1).

  • The shape generator takes a set of bounding boxes generated from the box generator and predicts the shapes of the objects inside the boxes. The output of the shape generator is a set of binary masks \(M_{1:T}=\{M_1,...,M_T\}\), where each mask \(M_t\) defines the foreground shape of the t-th object (Sect. 5.4.2).

  • The image generator takes the semantic label map \(\mathbf M\) obtained by aggregating instance-wise masks and the text embedding as inputs and generates an image by translating a semantic layout to pixels matching the text description (Sect. 5.5).

By conditioning the image generation process on the semantic layouts that are explicitly inferred, our method is able to generate images that preserve detailed object shapes and therefore semantic contents are more recognizable. In our experiments, we show that the images generated by our method are semantically more meaningful and well-aligned with the input text, compared to images generated by previous approaches [23, 39] (Sect. 5.6).

4 Inferring Semantic Layout from Text

4.1 Bounding Box Generation

Given an input text embedding \(\mathbf {s}\), we first generate a coarse layout of the image in the form of object bounding boxes. We associate each bounding box \(B_t\) with a class label to define which object class to place and where, which plays a critical role in determining the global layout of the scene. Specifically, we denote the labeled bounding box of the t-th object as \(B_t=(\mathbf {b}_t, {\varvec{l}}_t)\), where \(\mathbf {b}_t=[b_{t,x}, b_{t,y}, b_{t,w}, b_{t,h}] \in \mathbb {R}^4\) represents the location and size of the bounding box, and \({\varvec{l}}_t \in \{0, 1\}^{L+1}\) is a one-hot class label over L categories. We reserve the \((L+1)\)-th class as a special indicator for the end-of-sequence.

The box generator \(G_\text {box}\) defines a stochastic mapping from the input text \(\mathbf {s}\) to a set of T object bounding boxes \(B_{1:T} = \{B_1,...,B_T\}\):

$$\begin{aligned} \widehat{B}_{1:T} \sim G_\text {box}(\mathbf {s}). \end{aligned}$$
(5.1)

Model. We employ an auto-regressive decoder for the box generator by decomposing the conditional joint bounding box probability as \(p(B_{1:T} \mid \mathbf {s}) = \prod _{t=1}^{T} p(B_t \mid B_{1:t-1}, \mathbf {s})\), where the conditionals are approximated by LSTM [10]. In the generative process, we first sample a class label \({\varvec{l}}_t\) for the t-th object and then generate the box coordinates \(\mathbf {b}_t\) conditioned on \({\varvec{l}}_t\), i.e., \(p(B_t | \cdot ) = p(\mathbf {b}_t, {\varvec{l}}_t | \cdot ) = p({\varvec{l}}_t | \cdot ) \, p(\mathbf {b}_t | {\varvec{l}}_t, \cdot )\). The two conditionals are modeled by a Gaussian Mixture Model (GMM) and a categorical distribution [9], respectively:

$$\begin{aligned} p({\varvec{l}}_t \mid B_{1:t-1}, \mathbf {s})&= \mathrm {Softmax}(\mathbf e_t), \end{aligned}$$
(5.2)
$$\begin{aligned} p(\mathbf {b}_t \mid {\varvec{l}}_t,B_{1:t-1}, \mathbf {s})&= \sum _{k=1}^K \pi _{t,k} \, \mathcal {N}\left( \mathbf {b}_t ; \varvec{\mu }_{t,k}, \varvec{\varSigma }_{t,k} \right) , \end{aligned}$$
(5.3)

where K is the number of mixture components. The softmax logit \(\mathbf e_t\) in Eq. (5.2) and the parameters for the Gaussian mixtures \(\pi _{t,k} \in \mathbb R, \varvec{\mu }_{t,k} \in \mathbb R^4\) and \(\varvec{\varSigma }_{t, k} \in \mathbb R^{4 \times 4}\) in Eq. (5.3) are computed by the outputs from each LSTM step.

Training. We train the box generator by minimizing the negative log-likelihood of ground-truth bounding boxes:

$$\begin{aligned} \mathcal {L}_{\text {box}} = -\lambda _l \,\frac{1}{T} \sum _{t=1}^{T} {\varvec{l}}^*_{t} \log p({\varvec{l}}_{t}) -\lambda _b \,\frac{1}{T} \sum _{t=1}^{T} \log p(\mathbf {b}^*_{t}), \end{aligned}$$
(5.4)

where T is the number of objects in an image, and \(\lambda _l, \lambda _b\) are balancing hyper-parameters. \(\mathbf {b}^*_{t}\) and \({\varvec{l}}^*_{t}\) are ground-truth bounding box coordinates and a label of the t-th object, respectively, which are ordered based on their bounding box locations from left to right. Note that we drop the conditioning in Eq. (5.4) for notational brevity. The hyper-parameters are set to \(\lambda _l=4, \lambda _b=1\), and \(K=20\) in our experiments.

At test time, we generate bounding boxes via ancestral sampling of box coordinates and class labels using Eqs. (5.2) and (5.3), respectively. We terminate the sampling when the sampled class label corresponds to the termination indicator \((L+1)\), thus the number of objects are determined adaptively based on the text.

4.2 Shape Generation

Given a set of bounding boxes obtained by the box generator, the shape generator predicts a more detailed image structure in the form of object masks. Specifically, for each object bounding box \(B_t\) obtained by Eq. (5.1), we generate a binary mask \(M_t\in \mathbb {R}^{H\times W}\) that defines the shape of the object inside the box. To this end, we first convert the discrete bounding box outputs \(\{B_t\}\) to a binary tensor \(\mathbf B_t \in \{0,1\}^{H\times W \times L}\), whose element is 1 if and only if it is contained in the corresponding class-labeled box. Using the notation \(M_{1:T}=\{M_1, ..., M_T\}\), we define the shape generator \(G_\text {mask}\) as

$$\begin{aligned} \widehat{M}_{1:T} = G_\text {mask}(\mathbf B_{1:T}, \mathbf z_{1:T}), \end{aligned}$$
(5.5)

where \(\mathbf z_t \sim \mathcal N(0, I)\) is a random noise vector.

Generating an accurate object shape requires two criteria: (i) First, each instance-wise mask \(M_t\) should match the location and class information of \(\mathbf B_t\) and be recognizable as an individual instance (instance-wise constraints). (ii) Second, each object shape must be aligned with its surrounding context (global constraints). To satisfy both criteria, we design the shape generator as a recurrent neural network, which is trained with two conditional adversarial losses, as described below.

Model. We build the shape generator \(G_\text {mask}\) using a convolutional recurrent neural network [29], as illustrated in Fig. 5.2. At each step t, the model takes \(\mathbf B_t\) through the encoder CNN and encodes the information of all object instances using bi-directional convolutional LSTM (Bi-convLSTM). In addition to the convLSTM output at the t-th step, we add noise \(\mathbf z_t\) by spatial tiling and concatenation and generate a mask \(M_t\) by forwarding it through a decoder CNN.

Fig. 5.3.
figure 3

Architecture of the image generator. Conditioned on the text description and the semantic layout generated by the layout generator, it generates an image that matches both inputs.

Training. Training the shape generator is based on the GAN framework [8], in which the generator and the discriminator are alternately trained. To enforce both the global and the instance-wise constraints discussed earlier, we employ two conditional adversarial losses [17] with the instance -wise discriminator \(D_\text {inst}\) and the global discriminator \(D_\text {global}\).

First, we encourage each object mask to be compatible with class and location information encoded by the object bounding box. We train an instance-wise discriminator \(D_\text {inst}\) by optimizing the following instance-wise adversarial loss:

$$\begin{aligned} \mathcal L_{\text {inst}}^{(t)}&= \mathbb E_{ (\mathbf {B}_t,M_t) } \Big [ \log D_\text {inst}\big ( \mathbf {B}_t, M_t \big ) \Big ] \\&~~~ + \mathbb E_{\mathbf {B}_t, \mathbf {z}_t } \Big [ \log \Big ( 1 - D_\text {inst}\big (\mathbf {B}_t, G_\text {mask}^{(t)}(\mathbf {B}_{1:T}, \mathbf {z}_{1:T}) \big ) \Big ) \Big ] , \nonumber \end{aligned}$$
(5.6)

where \(G_\text {mask}^{(t)}(\mathbf {B}_{1:T}, \mathbf {z}_{1:T})\) indicates the t-th output from mask generator. The instance-wise loss is applied for each of T instance-wise masks and aggregated over all instances as \(\mathcal {L}_\text {inst}= (1/T) \sum _t \mathcal {L}_\text {inst}^{(t)}\).

On the other hand, the global loss encourages all the instance-wise masks to form a globally coherent context. To consider the relation between different objects, we aggregate them into a global maskFootnote 1 \( G_\text {global}(\mathbf B_{1:T}, \mathbf z_{1:T}) = \textstyle \sum _t G_\text {mask}^{(t)}(\mathbf {B}_{1:t}, \mathbf z_{1:t}), \) and compute a global adversarial loss analogous to Eq. (5.6) as

$$\begin{aligned}&\mathcal L_\text {global}= \mathbb E_{ (\mathbf {B}_{1:T},M_{1:T}) } \Big [ \log D_\text {global}\big ( \mathbf {B}_\text {global}, M_\text {global}\big ) \Big ] \\&~~ + \mathbb E_{\mathbf {B}_{1:T}, \mathbf {z}_{1:T} } \Big [ \log \Big ( 1 - D_\text {global}\big ( \mathbf {B}_\text {global}, G_\text {global}(\mathbf {B}_{1:T}, \mathbf {z}_{1:T}) \big ) \Big ) \Big ] , \nonumber \end{aligned}$$
(5.7)

where \(M_\text {global}\in \mathbb {R}^{H\times W}\) is an aggregated mask obtained by taking element-wise addition over \(M_{1:T}\). \(\mathbf B_{\text {global}}\in \mathbb {R}^{H\times W\times L}\) is an aggregated bounding box tensor obtained by taking element-wise maximum over \(\mathbf B_{1:T}\).

Finally, we impose a reconstruction loss \(\mathcal L_\text {rec}\) that encourages the predicted instance masks to be similar to the ground-truths. We implement this idea using perceptual loss [2, 3, 12, 35], which measures the distance of real and fake images in the feature space of a pre-trained CNN by

$$\begin{aligned} \mathcal {L}_\text {rec} = \sum _l \big ||\varPhi _l (G_\text {global}) - \varPhi _l(M_\text {global}) \big ||, \end{aligned}$$
(5.8)

where \(\varPhi _l\) is the feature extracted from the l-th layer of a CNN. We use the VGG-19 network [30] pre-trained on ImageNet [5] in our experiments. Since our input to the pre-trained network is a binary mask, we replicate masks to channel dimension and use the converted mask to compute Eq. (5.8). We found that using the perceptual loss improves the stability of GAN training and the quality of object shapes, as discussed in [2, 3, 35].

Combining Eqs. (5.6), (5.7) and (5.8) allows the overall training objective for the shape generator to become:

$$\begin{aligned}&\mathcal {L}_{\text {shape}} = \lambda _i \mathcal {L}_{\text {inst}} + \lambda _g \mathcal {L}_{\text {global}} + \lambda _r \mathcal {L}_{\text {rec}}, \end{aligned}$$
(5.9)

where \(\lambda _i, \lambda _g\), and \(\lambda _r\) are hyper-parameters that balance different losses, which are set to 1, 1, and 10 in the experiment, respectively.

5 Synthesizing Images from Text and Layout

The outputs from the layout generator define the location, size, shape, and class information of objects, which provides semantic structure of a scene relevant to the text. Given the semantic structure and text, the objective of the image generator is to generate an image that conforms to both conditions. To this end, we first aggregate binary object masks \(M_{1:T}\) to a semantic label map \(\mathbf M \in \{0, 1\}^{H \times W \times L}\), such that \(\mathbf M_{ijk} = 1\) if and only if there exists an object of class k whose mask \(M_t\) covers the pixel (ij). Then, given the semantic layout \(\mathbf M\) and the text \(\mathbf {s}\), the image generator is defined by:

$$\begin{aligned} \widehat{X} = G_\text {img}(\mathbf M, \mathbf {s}, \mathbf {z}), \end{aligned}$$
(5.10)

where \(\mathbf z\sim \mathcal {N}(0,I)\) is a random noise. We describe the network architecture and training procedures of the image generator below.

Model. Figure 5.3 illustrates the overall architecture of the image generator. Our generator network is based on a convolutional encoder-decoder network [11] with several modifications. It first encodes the semantic layout \(\mathbf M\) through several down-sampling layers to construct a layout feature \(\mathbf A\in \mathbb {R}^{h \times w\times d}\). We consider that the layout feature encodes various context information of the input layout along the channel dimension. We use the layout feature to adaptively select context relevant to the text. Specifically, we compute a d-dimensional vector from the text embedding and spatially replicate it to construct \(\mathbf S\in \mathbb {R}^{h \times w\times d}\). Then we apply gating on the layout feature by \(\mathbf A^g = \mathbf A \odot \sigma (\mathbf S)\), where \(\sigma \) is the sigmoid nonlinearity and \(\odot \) denotes element-wise multiplication. To further encode text information on the background, we compute another text embedding with separate fully-connected layers and spatially replicate it to size \(h\times w\). The gated layout feature \(\mathbf A^g\), the text embedding, and the noises are then combined by concatenation along channel dimension and subsequently fed into several residual blocks and the decoder to be mapped to an image. We employ a cascaded network [3] for the decoder, which takes the semantic layout \(\mathbf M\) as an additional input to every upsampling layer. We found that the cascaded network enhances conditioning on a layout structure and produces a more accurate object boundary.

Table 5.1. Quantitative evaluation results. Two evaluation metrics based on caption generation and the Inception score are presented. The second and third columns indicate types of bounding box or mask layouts used in image generation, where “GT” indicates ground-truth and “Pred.” indicates the predicted layouts by our model. The last row presents the caption generation performance on real images, which corresponds to the upper-bound of the caption generation metric. Higher values are more accurate in all columns.

For the discriminator network \(D_{\text {img}}\), we first concatenate the generated image X and the semantic layout \(\mathbf M\). It is fed through a series of down-sampling blocks, resulting in a feature map of size \(h' \times w'\). We concatenate it with a spatially tiled text embedding, from which we compute a decision score of the discriminator.

Training. Conditioned on both the semantic layout \(\mathbf M\) and the text embedding \(\mathbf {s}\) extracted by [22], the image generator \(G_\text {img}\) is jointly trained with the discriminator \(D_{\text {img}}\). We define the objective function by \( \mathcal {L}_{\text {img}} = \lambda _a \mathcal {L}_{\text {adv}} + \lambda _r \mathcal {L}_{\text {rec}}, \) where

$$\begin{aligned} \mathcal L_{\text {adv}}&= \mathbb E_{ (\mathbf M, \mathbf {s}, X) } \Big [ \log D_{\text {img}} \big ( \mathbf M, \mathbf {s}, X\big ) \Big ]\nonumber \\&~~~ + \mathbb E_{(\mathbf M, \mathbf {s}), \mathbf {z} } \Big [ \log \Big ( 1 - D_{\text {img}} \big (\mathbf M, \mathbf {s}, G_\text {img}(\mathbf M, \mathbf {s}, \mathbf {z}) \big ) \Big ) \Big ] , \end{aligned}$$
(5.11)
$$\begin{aligned} \mathcal {L}_{\text {rec}}&= \sum _l ||\varPhi _l (G_\text {img}(\mathbf M, \mathbf {s}, \mathbf {z})) - \varPhi _l(X) ||, \end{aligned}$$
(5.12)

where \(X\) is a ground-truth image associated with semantic layout \(\mathbf M\). As in the mask generator, we apply the same perceptual loss \(\mathcal L_\text {rec}\), which is found to be effective. We set the hyper-parameters \(\lambda _a=1\), \(\lambda _r=10\) in our experiment.

Table 5.2. Human evaluation results.

6 Experiments

6.1 Experimental Setup

Dataset. We use the MS-COCO dataset [15] to evaluate our model. It contains 164,000 training images over 80 semantic classes, where each image is associated with instance-wise annotations (i.e., object bounding boxes and segmentation masks) and five text descriptions. The dataset has complex scenes with many objects in a diverse context, which makes generation challenging. We use the official train and validation splits from MS-COCO 2014 for training and evaluating our model, respectively.

Fig. 5.4.
figure 4

Qualitative examples of generated images conditioned on text descriptions on the MS-COCO validation set with using our method and the baselines, method (StackGAN [39] and Reed et al. [23]). The input text and ground-truth image are shown in the first row. For each method, we provide a reconstructed caption conditioned on the generated image.

Evaluation Metrics. We evaluate text-conditional image generation performance using various metrics: Inception score, caption generation, and human evaluation.

Inception Score—We compute the Inception score [26] by applying a pre-trained classifier on synthesized images and investigating statistics of their score distributions. It measures the recognizability and the diversity of generated images and is correlated with human perceptions on visual quality [20]. We use the Inception-v3 [31] network pre-trained on ImageNet [5] for evaluation, and measure the score for all validation images.

Caption Generation—In addition to the Inception score, assessing the performance of text-conditional image generation necessitates measuring the relevance of the generated images to the input texts. To this end, we generate sentences from the synthesized image and measure the similarity between the input text and the predicted sentence. The underlying intuition is that if the generated image is relevant to input text and its contents are recognizable, one should be able to determine the original text from the synthesized image. We employ an image caption generator [34] trained on MS-COCO to generate sentences, where one sentence is generated per image by greedy decoding. We report three standard language similarity metrics: BLEU [21], METEOR [1], and CIDEr [32].

Human Evaluation—Evaluation based on caption generation is beneficial for large-scale evaluation, but may introduce unintended bias by the caption generator. To verify the effectiveness of caption-based evaluation, we conduct human evaluation using Amazon Mechanical Turk. For each text randomly selected from the MS-COCO validation set, we presented five images generated by various methods and asked users to rank the methods based on the relevance of generated images to text. We collected results for 1,000 sentences, each of which is annotated by five users. We report results based on the ratio of each method ranked as the most accurate one-to-one comparison between ours and the baselines.

6.2 Quantitative Analysis

We compare our method with two state-of-the-art approaches [23, 39] based on conditional GANs. Tables 5.1 and 5.2 summarize the quantitative evaluation results.

Comparisons to Other Methods. We first present systemic evaluation results based on Inception scores and caption generation performance. The results are summarized in Table 5.1. The proposed method substantially outperforms existing approaches based on both evaluation metrics. In terms of Inception score, our method outperforms the existing approaches with a substantial margin, presumably because our method generates more recognizable objects. Caption generation performance shows that captions generated from our synthesized images are more strongly correlated with the input text than the baselines. This shows that images generated by our method are more accurately aligned with descriptions and the semantic contents are more recognizable.

Table 5.2 summarizes comparison results based on human evaluation. When users are asked to rank images based on their relevance to input text, they choose images generated by our method as the most accurate in approximately \(60\%\) of all presented sentences, which is substantially higher than the baselines (about \(20\%\)). This is consistent with the caption generation results in Table 5.1, in which our method substantially outperforms the baselines, while their performances are comparable.

Fig. 5.5.
figure 5

Image generation results of our method. Each column corresponds to generation results conditioned on (a) predicted box and mask layouts, (b) ground-truth box and predicted mask layouts, and (c) ground-truth box and mask layouts. Classes are color-coded for illustration purpose. Best viewed in color.

Figure 5.4 illustrates qualitative comparisons. Due to adversarial training, images generated by the other methods, especially StackGAN [39], tend to be clear and exhibit high-frequency details. However, it is difficult to recognize contents from the images, since they often fail to predict important semantic structures of objects and scenes. As a result, the reconstructed captions from the generated images are usually not relevant to the input text. Compared to them, our method generates much more recognizable and semantically meaningful images by conditioning the generation with the inferred semantic layout, and is able to reconstruct descriptions that are more aligned with the input sentences.

Ablative Analysis. To understand the quality and the impact of the predicted semantic layout, we conduct an ablation study by gradually replacing the bounding box and mask layouts predicted by the layout generator with the ground-truths. Table 5.1 summarizes quantitative evaluation results. As it shows, replacing the predicted layouts with the ground-truths results in gradual performance improvements, which shows prediction errors in both the bounding box and mask layouts.

Fig. 5.6.
figure 6

Multiple samples generated from a text description.

6.3 Qualitative Analysis

Figure 5.5 shows qualitative results of our method. For each text, we present the generated images alongside the predicted semantic layouts. As in the previous section, we also present our results conditioned on ground-truth layouts. As it shows, our method generates reasonable semantic layouts and images matching the input text. It generates bounding boxes corresponding to fine-grained scene structures implied in texts (i.e. object categories and the number of objects) and object masks capturing class-specific visual attributes as well as the relation to other objects. Given the inferred layouts, our image generator produces correct object appearances and backgrounds compatible with the text. Replacing the predicted layouts with ground-truths makes the generated images have a similar context to the original images.

Diversity of Samples. To assess the diversity in the generation, we sample multiple images while fixing the input text. Figure 5.6 illustrates the example images generated by our method, which generates diverse semantic structures given the same text description, while preserving semantic details, such as the number of objects and object categories.

Fig. 5.7.
figure 7

Generation results by manipulating captions. The manipulated portions of texts are highlighted in bold characters, where the types of manipulation is indicated by different colors. : scene context, : spatial location, : the number of objects, ahd : object category.

Text-Conditional Generation. To see how our model incorporates text description in the generation process, we generate images while modifying portions of the descriptions. Figure 5.7 illustrates the example results. When we change the context of descriptions, such as object class, number of objects, spatial composition of objects, and background patterns, our method correctly adapts the semantic structure and images based on the modified part of the text.

Controllable Image Generation. We demonstrate controllable image generation by modifying the bounding box layout. Figure 5.8 illustrates the example results. Our method updates object shapes and context based on the modified semantic layout (e.g. adding new objects and changing spatial configuration of objects) and generates accurate images.

Fig. 5.8.
figure 8

Examples of controllable image generation.

7 Conclusion

We proposed an approach for text-to-image synthesis which explicitly infers and exploits a semantic layout as an intermediate representation from text to image. Our model hierarchically constructs a semantic layout in a coarse-to-fine manner by a series of generators. By conditioning image generation on explicit layout prediction, our method generates complicated images that preserve semantic details and are highly relevant to the text description. We also showed that the predicted layout can be used to interpret and control the generation process of the model.

Despite these advantages, training the proposed method requires heavier annotations than the previous methods, such as instance-level bounding box and mask annotations, which makes our method less scalable to large-scale generation problems. To resolve this limitation, we believe that there are several interesting future research directions to explore further. One option is an unsupervised discovery of intermediate semantic structures through end-to-end training. With properly designed regularization on the intermediate output structures, we believe that we can inject useful inductive bias that guides the model to discover meaningful structures from data. Another interesting direction is semi-supervised learning of the model using a large set of partially annotated data. For instance, we can exploit a small number of fully annotated images and a large number of partially annotated images (e.g. images with only text descriptions), which allows our model to exploit large-scale datasets, such as the Google Conceptual Caption dataset [27].