Keywords

1 Introduction

Conditional generative adversarial network (cGAN)  [34] has recently made substantial progress in realistic image synthesis. In cGAN, a generator \(\hat{x}={{\,\mathrm{G}\,}}(c,z)\) aims to output a realistic image \(\hat{x}\) with a constraint implicitly encoded by c. Conversely, a discriminator \({{\,\mathrm{D}\,}}(x,c)\) learns such a constraint from ground-truth pairs \(\langle x,c\rangle \) by predicting if \(\langle \hat{x},c\rangle \) is real or generated.

The current cGAN models  [20, 36, 43] for semantic image synthesis aim to solve the structural consistency constraint where the output image \(\hat{x}={{\,\mathrm{G}\,}}(c)\) is required to be aligned to a semantic label map c. However, for such a model, the style of \(\hat{x}\) is inherently determined by the model and thus cannot be controlled by the user. To provide desired controllability over the generated styles, previous studies  [27, 41] impose additional constraints and allow more inputs to the generator: \(\hat{x}_{2 \rightarrow 1} ={{\,\mathrm{G}\,}}(c_1, x_2, z)\), where \(x_2\) is an exemplar image that guides the style of \(c_1\). However, previous studies are designed on specified datasets such as face  [30, 37], dancing  [41] or street view  [46], where the exemplar images and semantic label map usually contain similar semantics and spatial structures.

Fig. 1.
figure 1

Our task aims to synthesize style-consistent images from a semantic label map (row 1, column 1) and a structurally and semantically different exemplar image (row 1, columns 2–7). In spite of the differences between the two scenes, our model can synthesize high-quality images of consistent styles with the reference images.

Different from the previous studies, we address a more challenging example-guided synthesis task that transfers styles across semantically very different scenes. As shown in Fig. 1, given a semantic label map \(c_1\) (column 1) and an arbitrary scene image \(x_2\) (row 1, column 2–7), the task aims to generate a new scene image \(\hat{x}_{2 \rightarrow 1}\) (row 2) that matches the semantic structure of \(c_1\) and the scene style of \(x_2\). The challenge is that scene images have complex semantic structures as well as diversified scene styles, and more importantly, the inputs \(c_1\) and \(x_2\) can be structurally unaligned and semantically different. Therefore, a mechanism is required to better match the structures and semantics for coherent synthesis.

In this paper, we propose a novel Masked Spatial-Channel Attention (MSCA) module (Sect. 3.2) to propagate features across unstructured scenes. Our module is inspired by a recent work  [7] for attention-based object recognition, but instead, we propose a new cross-attention mechanism to model the semantic correspondence for image synthesis. Moreover, our method is based on the novel design of spatial-channel decoupling that allows efficient computation. To facilitate example-guided synthesis, we further improve the module by including: i) feature masking for semantic outlier filtering, ii) multi-scaling for global-local feature processing, and iii) resolution extending for image synthesis. As a result, our module provides both clear physical meaning and interpretability for the example-guided synthesis task.

We formulate the proposed approach under a unified synthesis network for joint feature extraction, alignment and image synthesis. We achieve this by applying MSCA modules to the extracted features for multi-scale feature domain alignment. Next, we apply a recent feature normalization technique, SPADE  [36] on the aligned features to allow spatially-controllable synthesis. To facilitate the learning of this network, we propose a novel self-supervision task. As opposed to  [41], our scheme requires only semantically parsed images for training and does not rely on video data. We show that a model trained with this approach generalizes across different scene semantics (See Fig. 1).

Our main contributions include the following:

  • A novel masked spatial-channel attention (MSCA) module to propagate features between two semantically different scenes.

  • A unified example-guided synthesis network for joint feature extraction, alignment and image synthesis.

  • A novel self-supervision scheme that only requires semantically annotated images for training but not at the testing (image synthesis) stage.

  • Significant improvements over the existing methods on the COCO-stuff  [3] dataset, as well as interpretability and easy extensions to other content manipulation tasks.

2 Related Work

Generative Adversarial Networks. Recent years have witnessed the progress of generative adversarial networks (GANs)  [11] for image synthesis. A GAN model consists of a generator and a discriminator where the generator serves to produce realistic images that cannot be distinguished from the real ones by the discriminator. Recent techniques for realistic image synthesis include modified losses  [1, 33, 38], model regularization  [35], self-attention  [2, 48], feature normalization  [24] and progressive synthesis  [23].

Image-to-Image Translation (I2I). I2I translation aims to translate images from a source domain to a target domain. The initial work of Isola et al.  [20] proposes a conditional GAN framework to learn I2I translation with paired images. Wang et al.  [43] improve the conditional GAN for high-resolution synthesis and content manipulation. To enable I2I translation without using paired data, a few works  [4, 18, 25, 29, 50] apply the cycle consistency constraint in training. Recent works on photo-realistic image synthesis take semantic label maps as inputs for image synthesis. Specifically, Wang et al.  [43] extend the conditional GAN for high-resolution synthesis, Chen et al.  [6] propose a cascade refinement pipeline. More recently, Park et al.  [36] propose spatial-adaptive normalization for realistic image generation.

Example-Guided Style Transfer and Synthesis. Example guided style transfer  [8, 13] aims to transfer the style of an example image to a target image. Recent works  [4, 10, 12, 16, 17, 22, 26, 31, 45] utilize deep neural network features to model and transfer styles. Several frameworks  [18, 32] perform style transfer via image domain style and content disentanglement. In addition, domain adaptation  [4] applies a cycle consistency loss to perform cross-domain style transformation.

More recently, example-guided synthesis  [27, 41] is proposed to transfer the style of an example image to a target condition, e.g.. a semantic label map. Specifically, Lin et al.  [27] apply dual learning to disentangle the style for guided synthesis, Wang et al.  [41] extract style-consistent data pairs from videos for model training. In addition, Park et al.  [36] adopt an I2I network under the auto-encoding framework for example-guided image synthesis. Different from  [27, 36, 41], we address style alignment issue between arbitrary scenes for region and semantic aware style integration. Furthermore, our self-supervised learning scheme does not require video data and is a generalize and more challenging auto-encoding task.

Correspondence Matching for Synthesis. Finding correspondence is critical for many synthesis tasks. For instance, Siarohin et al.  [39] apply the affine transformation on reference person images to improve pose-guided person image synthesis, Wang et al.  [42] use optical flow to align frames for coherent video synthesis. However, the affine transformation and optical flow cannot adequately model the correspondences between two structurally very different scenes.

Efficient Attention Modeling. The self-attention  [44, 48] can capture general pair-wise correspondences. However, it is computationally intensive at high-resolution. To enable fast attention computation, GCNL  [47] and CCCA  [19] respectively apply Taylor series expansion and criss-cross attention to approximate self-attention. Alternatively, \(A^2\)-Nets  [7] factorize self-attention to solve video classification tasks. Inspired by [7], we propose an attention-based module named MSCA. It is worth noting MSCA is based on cross-attention and feature masking for modeling image correspondence.

3 Method

The proposed approach aims to generate scene images that align with given semantic maps. Differ from conventional semantic image synthesis methods  [20, 36, 43], our model takes an exemplary scene as an extra input to provide more controllability over the generated scene image. Unlike existing example-based approaches  [27, 41], our model addresses a more challenging case where the exemplary inputs are structurally and semantically unaligned with the given semantic map.

Our method takes a semantic label map \(c_1\), a reference image \(x_2\) and its corresponding parsed semantic label map \(\widetilde{c}_2\) as inputs and synthesizes an image which matches the style of \(x_2\) and structure of \(x_1\) using a generator \({{\,\mathrm{G}\,}}\), . As shown in Fig. 2 left, the generator \({{\,\mathrm{G}\,}}\) consists of three parts, namely i) feature extraction ii) feature alignment and iii) image synthesis. In Sect. 3.1, we describe the first part that extracts features from inputs of both scenes. In Sect. 3.2, we propose a masked spatial-channel attention (MSCA) module to distill features and discover relations between two arbitrarily structured scenes. Unlike the affine-transformation  [21] and flow-base warping  [42], MSCA provides better interpretability to the scene alignment task. In Sect. 3.3, we introduce how to use the aligned features for image synthesis. Finally, in Sect. 3.4, we propose a self-supervised scheme to facilitate learning.

Fig. 2.
figure 2

Left: our generator consists of three steps, namely feature extraction, feature alignment, and image synthesis. We describe each step in its corresponding section, respectively. Right: The MSCA module for feature alignment (at scale i). Our module takes image feature map \(F^{(i)}_{x,2}\) and segmentation feature map \(F^{(i)}_{c,1}\), \(F^{(i)}_{c,2}\) as inputs to output a new image feature map \(F^{(i)}_{x,1}\) that is aligned to condition \(c_1\).

3.1 Feature Extraction

Taking an image \(x_2\) and label maps \(c_1,\widetilde{c}_2\) as inputs, the feature extraction module extracts multi-scale feature maps for each input. Specifically, the feature map \(F^{(i)}_{x,2}\) of image \(x_2\) at scale i is computed by:

$$\begin{aligned} \begin{aligned} F^{(i)}_{x,2} = W^{(i)}_x *F_{\text {vgg}}^{(i)}(x_2), \quad \text {for } i\in \{0, \dots , L\}, \end{aligned} \end{aligned}$$
(1)

where \(*\) denotes the convolution operation, \(F_{\text {vgg}}^{(i)}\) denotes the feature map extracted by VGG-19  [40] at scale i, and \(W^{(i)}_x\) denotes a \(1\times 1\) convolutional kernel for feature compression. L is the number scales and we set \(L=4\) in this paper.

For label map \(c_1\), its feature \(F^{(i)}_{c,1}\) is computed by:

$$\begin{aligned} F^{(i)}_{c,1}= {\left\{ \begin{array}{ll} {{\,\mathrm{LReLU}\,}}(W^{(i)}_{c} *c^{(i)}_1) &{} \text {for i=L}, \\ {{\,\mathrm{LReLU}\,}}(W^{(i)}_{c} *[\Uparrow (F^{(i+1)}_{c,1}), c_1^{(i)}]) &{}\text {otherwise}, \end{array}\right. } \end{aligned}$$
(2)

where \(\Uparrow (\cdot )\) denotes \(\times 2\) bilinear interpolation, \(c^{(i)}_1\) denotes the resized label map, \(W^{(i)}_{c}\) denotes a \(1\times 1\) convolutional kernel for feature extraction, and operation \([\cdot , \cdot ]\) denotes channel-wise concatenation. Note that as scale i decreases from L down to 0, the feature resolutions in Eq. 2 are progressively increased to match a finer label maps \(c^{(i)}_1\).

Similarly, applying Eq. 2 with the same weights to label map \(\widetilde{c}_2\), we can extract its features \(F^{(i)}_{c,2}\):

$$\begin{aligned} F^{(i)}_{c,2}= {\left\{ \begin{array}{ll} {{\,\mathrm{LReLU}\,}}(W^{(i)}_{c} *c^{(i)}_2) &{} \text {for i=L} \\ {{\,\mathrm{LReLU}\,}}(W^{(i)}_{c} *[\Uparrow (F^{(i+1)}_{c,2}), \widetilde{c}_2^{(i)}]) &{}\text {otherwise} \end{array}\right. }. \end{aligned}$$
(3)

3.2 Masked Spatial-Channel Attention Module

As shown in Fig. 2 right, taking the image features \(F^{(i)}_{x,2}\) and the label map features \(F^{(i)}_{c,1}\), \(F^{(i)}_{c,2}\) as inputsFootnote 1, the MSCA module generates a new image feature map \(F^{(i)}_{x,1}\) that has the content of \(F^{(i)}_{x,2}\) but is aligned with \(F^{(i)}_{c,1}\). We elaborate the detailed procedures as follows:

Spatial Attention. Given feature maps \(F^{(i)}_{x,2}, F^{(i)}_{c,2}\) of the exemplar scene, the module first computes a spatial attention tensor \(\alpha ^{(i)}\in {[0,1]}^{K\cdot H\cdot W}\):

$$\begin{aligned} \begin{aligned} \alpha ^{(i)} = {{\,\mathrm{softmax}\,}}_{2,3}(\phi ^{(i)} *[F^{(i)}_{x,2}, F^{(i)}_{c,2}]), \end{aligned} \end{aligned}$$
(4)

with \(\phi ^{(i)}\in \mathbb {R}^{(N+M_2) \cdot K}\) denoting a \(1\times 1\) convolutional filter and \({{\,\mathrm{softmax}\,}}_{2,3}\) denoting a 2D softmax function on spatial dimensions \(\{2,3\}\). The output tensor contains K attention maps of resolution \(H\times W\), which serve to attend K different spatial regions on image feature \(F^{(i)}_{x,2}\).

Spatial Aggregation. Then, the module aggregates K feature vectors from \(F^{(i)}_{x,2}\) using the K spatial attention maps of \(\alpha ^{(i)}\) from Eq. 4. Specifically, a matrix dot product is performed:

$$\begin{aligned} \begin{aligned} \textit{\textbf{V}}^{(i)}&= \textit{\textbf{F}}^{(i)}_{x,2} (\varvec{\alpha }^{(i)})^\intercal , \end{aligned} \end{aligned}$$
(5)

with \(\varvec{\alpha }^{(i)}\in [0,1]^{K\cdot HW}\) and \(\textit{\textbf{F}}^{(i)}_{x,2}\in \mathbb {R}^{N\cdot HW}\) denoting the reshaped versions of \(\alpha ^{(i)}\) and \(F^{(i)}_{x,2}\), respectively. The output \(\textit{\textbf{V}}^{(i)} \in \mathbb {R}^{N \cdot K} \) stores feature vectors spatially aggregated from the K independent regions of \(F^{(i)}_{x,2}\).

Feature Masking. The exemplar scene \(x_2\) may contain irrelevant semantics to the label map \(c_1\), and conversely, \(c_1\) may contain semantics that are unrelated to \(x_2\). To address this issue, we apply feature masking on the output of Eq. 5 by multiplying \(\textit{\textbf{V}}^{(i)}\) with a length-K gating vector at each row:

$$\begin{aligned} \begin{aligned} \widetilde{\textit{\textbf{V}}}^{(i)}&= (\textit{\textbf{V}}^{(i)})^T \circ {{\,\mathrm{mlp}\,}}([{{\,\mathrm{gap}\,}}(F^{(i)}_{c,1}),{{\,\mathrm{gap}\,}}(F^{(i)}_{c,2})]), \end{aligned} \end{aligned}$$
(6)

where \({{\,\mathrm{mlp}\,}}(\cdot )\) denotes a 2-layer MLP followed by a sigmoid function, \({{\,\mathrm{gap}\,}}\) denotes a global average pooling layer, \(\circ \) denotes broadcast element-wise multiplication, and \(\widetilde{\textit{\textbf{V}}}^{(i)}\) denotes the masked features. The design of feature masking in Eq. 6 resembles to Squeeze-and-Excitation  [15]. Using the integration of global information from label maps \(c_1\) and \(\widetilde{c}_2\), features are filtered.

Channel Attention. Given feature \(F^{(i)}_{c,1}\) of label map \(c_1\), a channel attention tensor \(\beta ^{(i)}\in {[0,1]}^{K\cdot H\cdot W}\) is generated as follows:

$$\begin{aligned} \begin{aligned} \beta ^{(i)} = {{\,\mathrm{softmax}\,}}_{1}(\psi ^{(i)} *F^{(i)}_{c,1}), \end{aligned} \end{aligned}$$
(7)

with \(\psi ^{(i)}\in \mathbb {R}^{M_1\cdot K}\) denoting a \(1\times 1\) convolutional filter and \({{\,\mathrm{softmax}\,}}_{1}\) denoting a softmax function on channel dimension. The output \(\beta ^{(i)}\) serves to dynamically reuse features from \(\widetilde{\textit{\textbf{V}}}^{(i)}\).

Fig. 3.
figure 3

Our self-supervision scheme performs cross-reconstruction at the patch scale (top row) and self-reconstruction at the global scale (bottom row). The solid, dashed and dotted bounding boxes respective represent images, semantic label maps, and synthesized outputs. Boxes with the same color are cropped from the same position.

Channel Aggregation. With channel attention \(\beta ^{(i)}\) computed in Eq. 7, feature vectors at HW spatial locations are aggregated again from \( \widetilde{\textit{\textbf{V}}}^{(i)}\) via matrix dot product:

$$\begin{aligned} \begin{aligned} \textit{\textbf{F}}^{(i)}_{x,1}&= \widetilde{\textit{\textbf{V}}}^{(i)} (\varvec{\beta }^{(i)})^\intercal , \end{aligned} \end{aligned}$$
(8)

where \(\varvec{\beta }^{(i)}\in \mathbb {R}^{K\cdot HW}\) denotes the reshaped version of \(\beta ^{(i)}\). The output \(\textit{\textbf{F}}^{(i)}_{x,1} \in \mathbb {R} ^{N \cdot HW}\) represents the aggregated features at HW locations. The output feature map \(F^{(i)}_{x,1}\) is generated by reshaping \(\textit{\textbf{F}}^{(i)}_{x,1}\) to size \( N \times H \times W\).

Remarks. Spatial attention (Eq. 4) and aggregation (Eq. 5) attend to K independent regions from feature \(F^{(i)}_{x,2}\), then store the K features into \(\textit{\textbf{V}}^{(i)}\). After feature masking, given a new label map \(c_1\), channel attention (Eq. 4) and aggregation (Eq. 8) combine \(\widetilde{\textit{\textbf{V}}}^{(i)}\) at each location to compute an output feature map. As a result, each output location finds its correspondent regional features or ignored via feature masking. In this way, the feature of example scene is aligned. Note that when \(K=1\) and \(\alpha ^{(i)}\) is constant, the above operations is essentially a global average pooling. We show in the experiment that \(K=8\) is sufficient to dynamically capture visually significant scene regions for alignment.

Multi-scaling. Both global color tone and local appearances are informative for the style-constraint synthesis. Therefore, we apply MSCA modules at all scales \(i\in \{0,\dots ,L\}\) to generate global and local features \(F^{(i)}_{x,1}\).

Fig. 4.
figure 4

from left to right: the inputs for example-guided synthesis, i.e. target label maps, exemplar label parsing from Deeplab-v2  [5], and exemplar images. from left to right: visual comparisons with cI2I  [27], EGSC-IT  [32], SPADE_VAE  [36], four ablation models, and our full model. from left to right: the retrieved ground-truth before and after color correction  [45]. Our full model generates the most style-consistent results with the exemplar images. (Color figure online)

3.3 Image Synthesis

The extracted features \(F^{(i)}_{c,1}\) in Sect. 3.1 capture the semantic structure of \(c_1\), whereas the aligned features \(F^{(i)}_{x,1}\) in Sect. 3.2 capture the appearance style of the example scene. In this section, we leverage \(F^{(i)}_{c,1}\) and \(F^{(i)}_{x,1}\) as control signals to generate output images with desired structures and styles.

Specifically, we adopt a recent synthesis model, SPADE  [36], and feed the concatenation of \(F^{(i)}_{x,1}\) and \(F^{(i)}_{c,1}\) to the spatially-adaptive denormalization layer of SPADE at each scale. By taking the style and structure signal as inputs, spatially-controllable image synthesis is achieved. We refer readers to appendix for more network details of the synthesis module.

3.4 Self-supervised Training

Training an example-guided synthesis model that can transfer styles across semantically different scenes is challenging. First, style-consistent scene images are hard to acquire. A previous work  [41] generates style-consistent pairs from videos. However, collecting scene videos can be more labor intensive. Second, even with ground truth style-consistent pairs, the trained model is not guaranteed to generalize to a new arbitrary scene.

We propose a novel self-supervised scheme to enable style-transfer between two structurally and semantically different scenes. Our solution is motivated by the fact that the style of a scene image is stationary, meaning that patches cropped from the same scene share largely the same style. Moreover, non-overlapping patches from the same scene may contain new structures and semantic labeling, which is essential for the learned model to generalize better.

We first design a cross-reconstruction task at the patch scale: given patches \(x_p\) and \(x_q\) cropped from the same scene image x, the generator is asked to reconstruct \(x_p\) using \(x_q\). Formally,

$$\begin{aligned} \begin{aligned} \hat{x}_{p}&={{\,\mathrm{G}\,}}(c_p,x_q,\widetilde{c}_q). \end{aligned} \end{aligned}$$
(9)

Note that \(c_p\) and \(\widetilde{c}_q\) contain different semantic labeling. Therefore, the generator are required to infer the correlation between different semantic labeling for coherent style transfer. An illustrative example is shown in Fig. 3. More details on patch sampling is included in the appendix.

The cross-reconstruction task is designed at the patch scale and may not generalize well to the global scale. In fact, the generator trained with the patch-level task alone tends to generate repetitive local textures (in Sect. 4). Therefore, we further design a self-reconstruction task at the global scale, which reconstructs an global image x from itself:

$$\begin{aligned} \begin{aligned} \hat{x}&={{\,\mathrm{G}\,}}(c,x,\widetilde{c}). \end{aligned} \end{aligned}$$
(10)

Our training objective for generator G and discriminator D is formulated as:

$$\begin{aligned} \begin{aligned} \mathcal {L}(G, D) =&\log D(x_p,c_p,x_q,\widetilde{c}_q) + \log (1-D(\hat{x}_{p},c_p,x_q,\widetilde{c}_q)) + \mathcal {L}_{spade}(\hat{x}_{p}, x_p)\\&+\lambda \{\log D(x,c,x,\widetilde{c}) + \log (1-D(\hat{x},c,x,\widetilde{c})) + \mathcal {L}_{spade}(\hat{x}, x)\} \end{aligned} \end{aligned}$$
(11)

where \(\mathcal {L}_{spade}\) refers to the VGG and GAN feature matching losses defined in  [36] and \(\lambda \) is a parameter that controls the importance of the two self-supervised tasks. We set \(\lambda =1\) in our experiments. Our full objective for self-supervised training is:

$$\begin{aligned} \begin{aligned} G^*= \arg \min _{G} \max _{D} {\mathcal {L}(G, D)} \end{aligned} \end{aligned}$$
(12)

4 Experiments

Dataset. Our model is trained on the COCO-stuff dataset  [3]. It contains densely annotated images captured from various scenes. We remove images with indoor scenes and large objects from the dataset, resulting in 34, 698/499 scene images for training/testing, respectively. The COCO-stuff dataset does not provide ground-truth for the example-guided synthesis task, i.e. two images with the exact same styles. To qualitatively evaluate the performances, we designed three tasks where the ground-truth image can be obtained: i) duplicating task self-reconstructs an image using itself as exemplar and its semantic label map as layout condition, ii) mirroring task reconstructs an image using its mirrored version as exemplar and the semantic label map as layout condition, iii) retrieving task: requires a model to reconstruct an ground-truth (GT) image using its semantic label map and a retrieved image from a image pool. To retrieve an image that best match GT in styles, we first select 20 candidate images from the image pool that has the greatest label histogram intersections with the GT image. Afterwards, the best-matched image is select out of candidates using SIFT Flow  [28]. Finally, since the color of GT is different from the retrieved image, we apply color correction  [45] on GT to eliminate color discrepancy. Examples of GT before and after color correction are shown in the blue box in Fig. 4.

Table 1. Quantitative comparisons of different methods and ablation models in terms of PSNR, LPIPS  [49], Fréchet Inception Distance (FID)  [14] and style loss (\(\mathcal {L}_{style}\))  [9]. Higher scores are better for metrics with uparrow (\(\uparrow \)), and vice versa.
Fig. 5.
figure 5

With a slight modification and no further training, our model can perform style interpolation between exemplar inputs. Note how our model interpolates styles for new semantics, e.g. “river” in row 3.

Implementation Details. We use a COCO-stuff pretrained Deeplab-v2  [5] model to generate semantic label maps from exemplar images. During training, we resize images to \(512 \times 512\) then crop two non-overlapping patches of size \(256 \times 256\) to facilitate patch-based cross-reconstruction. After 20 epochs, we increase the patch size to \(384 \times 384\) for cross-patch reconstruction in order to improve generalization to global scenes. Details of the patch sampling procedure are provided in the appendix.

For the MSCA modules from scale 0 to 4, the number of attention maps K are respectively set to 8, 16, 16, 16, 16. The learning rate is set to 0.0002 for the generator and the discriminator. The weights of generator are updated every 5 iterations. Our synthesis model and all comparative models based on SPADE backbone are trained for 40 epochs to generate the results in the experiments.

Before training, we pretrain the spatial-channel attention with a lightweight feature decoder to improve training efficiency. Specifically, at each scale, the concatenation of \(F^{(i)}_{x,1}\) and \(F^{(i)}_{c,1}\) in Sect. 3.3 at each scale is fed into a \(1\times 1\) convolutional layer to reconstruct the ground-truth VGG feature at the corresponding scale. More details of the pretraining procedure is provided in the appendix.

Comparative Methods. We compare our approach with an example-guided synthesis approach: variational autoencoding SPADE (SPADE_VAE)  [36] which is based on a self-reconstruction loss for training. We also trained cI2I  [27], EGSC-IT  [32] and SCGAN  [41] on COCO-stuff dataset. cI2I and EGSC-IT are originally designed for exemplar-guided image-to-image translation. As a result, we observed that cI2I and EGSC-IT have difficulty generating images from one-hot encoded semantic label maps. However, these models can synthesize reasonable images from color-encoded semantic label maps. Finally, we note that SCGAN is not directly applicable to COCO-stuff dataset, as its positive pairs are sampled from video data. We attempted to modify SCGAN such that its positive pairs can be generated from our self-supervision task. However, we could not achieve reasonable image outputs. We speculate that the negative sampling and semantic consistency loss of SCGAN is not optimal for COCO-stuff dataset, as COCO-stuff dataset contains much larger variations for negative pairs. Finally, four ablation models are evaluated (see Ablation Study).

Fig. 6.
figure 6

Left: inputs and outputs of our model. Right: the \(K=8\) learned spatial and channel attention that attends and transfer feature between individual exemplar and target regions. By examining the semantics label maps, we observe the following transformation patterns: , for the 1st sample, and , , for the 2nd sample.

Quantitative Evaluation. For quantitative evaluation, we apply PSNR as the low-level metric. Furthermore, perceptual-level metrics including Perceptual Image Patch Similarity Distance (LPIPS)  [49], Fréchet Inception Distance (FID)  [14] and style loss (\(\mathcal {L}_{style}\)) of  [9] are evaluated on different methods. The linearly calibrated VGG model is used to compute LPIPS distance.

Among the four competitive methods (cI2I, EGSC-IT, SPADE_VAE and ours full) in Table 1, our method clearly outperforms the remaining methods both in low-level and perceptual-level measurements, suggesting that our model can better preserve color and texture appearances. Also, we observe that without further modification, the off-the-shelf example-guided image translation approaches cannot perform well on image synthesis tasks (cI2I, EGSC-IT). It suggests that example-guided image-synthesis task can be more challenging. Finally, a simple synthesis model (ours GAP) outperforms SPADE_VAE, suggesting that the self-supervised task in Sect. 3.4 is beneficial to the exampled-guided synthesis task (see Ablation Study for more details).

Qualitative Evaluation. Figure 4 qualitatively compares our approach against the remaining approach on four scenes. We observe that our full model generates more style-consistent results with the exemplar images. In comparisons, SPADE_VAE tends to generate results with low color contrast, as it lacks the mechanism and supervision to perform region-aware style transformation. In addition, the existing example-guided image-to-image approaches (cI2I, EGSC-IT) cannot generalize well to the image synthesis tasks.

Ablation Study. To evaluate the effectiveness of our design, we separately train four variants of our model: i) our GAP that replaces the MSCA module with global average pooling, ii) ours MSCA w/o att that keeps MSCA modules but replaces spatial and channel attention with one-hot label maps from source and target domains, respectively. In such a way, alignment is performed only for regions with the same semantic labeling, iii) ours MSCA w/o fm that keeps MSCA modules but removes the feature masking procedures, and iv) ours MSCA w/o global that is trained without using global-level self-reconstruction (Eq. 10) or increased patchsize.

In Table 1, our full model clearly achieves the best qualitative results. In Fig. 4, ours GAP tends to produce images with deviated colors since it averages the style features from all exemplar regions. In contrast, our model dynamically transfers appearance for individual regions. We observe that ours w/o att is less stable in training and cannot generate plausible results. We suspect that the label-level alignment generates more misaligned and noisier feature maps, thus hurting training. ours MSCA w/o fm tends to generate inconsistent colors for new semantic labels, for instance, the “hill” and “sky” regions in rows 1 and 2 of Fig. 4. In contrast, our model can eliminate the undesired influence of exemplar inputs on new semantic labels. ours MSCA w/o global performs reasonably well but it tends to generate repetitive local textures, while the self-reconstruction scheme helps our model generalize better at the global scale.

User Study. We conduct a user study to qualitatively evaluate our method. Specifically, we retrieve an exemplar image for each testing label map, and ask 20 subjects to choose the most style-consistent results generated by our method and two competitive baselines (SPADE_VAE and ours GAP). To generate samples for the user study, we first rank the label histogram intersections with each target scene for all images in the image pool, and use the top 20 percentile images as exemplarsFootnote 2. The subjects are given unlimited time to make their selections. For each subject, we randomly generate 100 questions from the dataset. Table 2 shows the evaluation results. First, all subjects strongly favor our results. Second, ours GAP is favored more than twice over SPADE_VAE  [36], further suggesting that the proposed self-supervision scheme is effective since ours GAP is also trained with self-supervision.

Table 2. User preference study. The numbers indicate the percentage of user who favors the result generated by different methods. Two com

Effect of Attention. To understand the effect of spatial-channel attention, we visualize the learned spatial and channel attention in Fig. 6. We observe that: a) spatial attention can attend to multiple regions of the reference image. For each reference region, channel attention finds the corresponding target region. b) spatial-channel attention can detect and utilize the similarities of semantic labels to facilitate style features transfer. In the first sample of Fig. 6, attention in channels 1, 4 respectively perform transformations: , . In the second sample, attention in channels 1, 2, 7 respectively perform transformations: , and . We provide more analysis on the effect of attention in the appendix.

Fig. 7.
figure 7

With a slight modification and no further training, our model can perform spatial style interpolation. In this figure, we demonstrate a horizontal gradient style change on the output image. Please refer to Interpolation, Sect. 4 for more details.

Fig. 8.
figure 8

Given an exemplar patch at the center and the global semantic label map, our trained model can perform example-guided scene image extrapolation, i.e. generating style-consistent beyond-the-border images with semantic maps guidance.

Interpolation. We can easily control the synthesized styles in the test stage by manipulating spatial and channel attentions. First, by manipulating the spatial attention of two exemplar inputs, our trained model can perform style interpolation between the two exemplar. The results are shown in Fig. 5. Next, by manipulating the channel attention, our trained model can perform spatial style interpolation. Figure 7 shows our model can interpolate between two images and generate horizontally gradient style changes. More details are included in the appendix.

Extrapolation. Given a scene patch at the center our model can achieve scene extrapolation, i.e. generating beyond-the-border image content according to the semantic map guidance. A \(512\times 512\) extrapolated image is generated by weighted combining synthesized \(256\times 256\) patches at 4 corners and 10 other random locations. As shown in Fig. 8, our model generates visually plausible extrapolated images, showing the promise of our proposed framework for guided scene panorama generation.

Fig. 9.
figure 9

Style-structure swapping on 6 arbitrary scenes at resolution \(256\times 256\). Our model can generalize across recognizably different scenes of different semantics. Note that the images along the diagonal (red boxes) are self-reconstruction (Color figure online)

Style Swapping. Figure 9 shows reference-guided style swapping between six arbitrary scenes. Our model can generalize across recognizably different scenes semantics and appearances, including snow, mountain, seashore, grassland, dessert, artistic effect, and synthesize image with reasonable and consistent styles. More results and comparisons to other approaches are included in the appendix.

5 Conclusion

We propose to address a challenging example-guided image synthesis task between semantically very different scenes. To propagate information between two structurally unaligned and semantically different scenes, we propose an MSCA module that leverages decoupled cross-attention for adaptive correspondence modeling. With MSCA, we propose a unified model for joint global-local alignment and image synthesis. We further propose a patch-based self-supervision scheme that enables training. Experiments on the COCO-stuff dataset show significant improvements over the existing methods. Furthermore, our approach provides interpretability and can be extended to other content manipulation tasks.