Unsupervised Sketch to Photo Synthesis

Liu, Runtao; Yu, Qian; Yu, Stella X.

doi:10.1007/978-3-030-58580-8_3

Runtao Liu¹²,
Qian Yu^12,13 &
Stella X. Yu¹²

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12348))

Included in the following conference series:

European Conference on Computer Vision

4724 Accesses
32 Citations

Abstract

Humans can envision a realistic photo given a free-hand sketch that is not only spatially imprecise and geometrically distorted but also without colors and visual details. We study unsupervised sketch to photo synthesis for the first time, learning from unpaired sketch and photo data where the target photo for a sketch is unknown during training. Existing works only deal with either style difference or spatial deformation alone, synthesizing photos from edge-aligned line drawings or transforming shapes within the same modality, e.g., color images.

Our insight is to decompose the unsupervised sketch to photo synthesis task into two stages of translation: First shape translation from sketches to grayscale photos and then content enrichment from grayscale to color photos. We also incorporate a self-supervised denoising objective and an attention module to handle abstraction and style variations that are specific to sketches. Our synthesis is sketch-faithful and photo-realistic, enabling sketch-based image retrieval and automatic sketch generation that captures human visual perception beyond the edge map of a photo.

R. Liu and Q. Yu—equal contribution. http://sketch.icsi.berkeley.edu.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Unsupervised Scene Sketch to Photo Synthesis

Sketch-to-Art: Synthesizing Stylized Art Images from Sketches

CoGS: Controllable Generation and Search from Sketch and Style

1 Introduction

Sketches, i.e., rapidly executed freehand drawings, make an intuitive and powerful visual expression (Fig. 1). There is much research on sketch recognition [7, 35], sketch parsing [26, 27], and sketch-based image or video retrieval [21, 28, 36]. We study how to imagine a realistic photo given a sketch that is spatially imprecise and missing colorful details, by learning from unpaired sketches and photos.

Sketch to photo synthesis is challenging for three reasons.

1) Sketches of objects often do not match their shapes in photos, since sketches commonly drawn by amateurs have large spatial and geometrical distortion. Translating a sketch to a photo thus requires shape rectification. However, it is not trivial to rectify shape distortion in a sketch, as line strokes are only suggestive of the actual shapes and locations, and the extent of shape fidelity varies widely between individuals. In Fig. 1, the three sketches for the same shoe are very different both overall proportions and local stroke styles.

2) Sketches are color-less and lacking details. Drawn in black strokes on white paper, sketches outline mostly object boundaries and characteristic interior markings. To synthesize a photo, shading and colorful textures must be filled in properly. However, it is not trivial to fill in details either. Since a sketch could depict multiple photos, any synthesizer must have the capability to produce not only realistic but also diverse photos for a single sketch.

3) Sketches may not have corresponding photos. Free-hand sketches can be created from observation, memory, or pure imagination; they are not so widely available as photos, and those with corresponding photos are even rarer. A few sketch datasets exist in computer vision. TU-Berlin [6] and QuickDraw [11] contain sketches only, with 20,000 and 50 million instances over 250 and 345 categories respectively. Contour Drawing [19] and Scenesketchy [39] have sketch-photo image pairs at the scene level; their sketches are either contour tracings or cartoon-style line drawings, neither representative of real-world free-hand sketches. Sketchy [28] has only 500 sketches paired with 100 photos in each of 125 categories. ShoeV2 and ChairV2 [36] contain 6,648/2,000 and 1,297/400 sketches/photos in a single semantic category of shoes and chairs respectively. To enable data-driven learning of sketch to photo synthesis, we must handle limited sketch data and unpaired sketches and photos.

Existing works focus on either shape or color translation alone (Fig. 2). 1) Most image synthesis that deals with shape transfiguration tends to stay in the same visual domain, e.g. changing the picture of a dog to that of a cat [15, 22], where visual details are comparable in the color image. 2) Sketches are a special case of line drawings, and the most studied case of line drawings in computer vision is the edge map extracted automatically from a photo. Such an edge map based drawing to photo synthesis task does not have the spatial deformation problem between sketches and photos, and realistic photos can be synthesized with [16, 31] or without [38] paired training data between drawings and photos. We will show that existing methods fail in sketch to photo synthesis when both shape and color translations are needed simultaneously.

We consider learning sketch to photo synthesis from sketches and photos of the same object category such as shoes. There is no pairing information between individual sketches and photos; these two sets can be independently collected.

Our insight for unsupervised sketch to photo synthesis is to decompose the task into two separate translations (Fig. 2). Our two-stage model performs first shape translation in grayscale and then content fill-in in color. Stage 1) Shape translation learns to synthesize a grayscale photo given a sketch, from unpaired sketch set and photo set. Geometrical distortions are eliminated at this step. To handle abstraction and drawing style variations, we apply a self-supervised learning objective to noise sketch compositions, and also introduce an attention module for the model to ignore distractions. Stage 2) Content enrichment learns to fill the grayscale with details, including colors, shading, and textures, given an optional reference image. It is designed to work with or without reference images. This capability is enabled by a mixed training strategy. Our model can thus produce diverse outputs on demand.

Our model links sketches to photos and can be used directly in sketch-based photo retrieval. Another exciting corollary result from our model is that we can also synthesize a sketch given a photo, even from unseen semantic categories. Strokes in a sketch capture information beyond edge maps defined primarily on intensity contrast and object exterior boundaries. Automatic photo to sketch generation could lead to more advanced computer vision capabilities and serve as a powerful human-user interaction device.

Our work makes the following contributions. 1) We propose the first two-stage unsupervised model that can generate diverse, sketch-faithful, and photo-realistic images from a single free-hand sketch. 2) We introduce a self-supervised learning objective and an attention module to handle abstraction and style variations in sketches. 3) Our work not only enables sketch-based image retrieval but also delivers an automatic sketcher that captures human visual perception beyond the edge map of a photo. See http://sketch.icsi.berkeley.edu.

2 Related Works

Sketch-Based Image Synthesis. While much progress has been made on sketch recognition [6, 35, 37] and sketch-based image retrieval [9, 13, 20, 21, 28, 36], sketch-based image synthesis remains under-explored.

Prior to deep learning (DL), Sketch2Photo [4] and PhotoSketcher [8] compose a new photo from photos retrieved for a sketch. Sketch2Photo [4] first retrieves photos based on the class label, then uses the given sketch to filter them and compose a target photo. PhotoSketcher [8] has a similar pipeline but retrieves photos based on a rather restrictive sketch and hand-crafted features.

The first DL-based free-hand sketch-to-photo synthesis is SketchyGAN [5], which trains an encoder-decoder model conditioned on the class label for sketch and photo pairs. Contextual GAN [23] treats sketch to photo synthesis as an image completion problem, using the sketch as a weak contextual constraint. Interactive Sketch [10] focuses on multi-class photo synthesis based on incomplete edges or sketches. All of these works rely on paired sketch and photo data and do not address the shape deformation problem.

Sketches are often used in photo editing [1, 25, 34], e.g., line strokes are drawn on a photo to change the shape of a roof. Unlike our sketch to photo synthesis, these works mainly address a constrained image inpainting problem.

Synthesis from the opposite direction, photo to sketch, has also been studied [19, 29]: The former proposes a hybrid model to synthesize a sketch stroke by stroke given a photo, whereas the latter aims to generate boundary-like drawings that capture the outline of the visual scene. Both models require paired data for training. While photo to sketch is not our focus, our model trained only on shoes can generate realistic sketches from photos in other semantic categories.

Generative Adversarial Networks (GAN). GAN has a generator (G) and a discriminator (D): G tries to fake instances that fool D and D tries to detect fakes from reals. GAN is widely used for realistic image generation [17, 24] and translation across image domains [15, 16].

Pix2Pix [16] is a conditional GAN that maps source images to target images; it requires paired (source,target) data during training. CycleGAN [38] uses a pair of GANs to map an image from the source domain to the target domain and then back to the source domain. Imposing a consistency loss over such a cycle of mappings, it allows both models to be trained together on unpaired source and target images in two different domains. UNIT [22] and MUNIT [15] are variations of CycleGAN, both achieving impressive performance.

None of these methods work well when the source and target images are spatially poorly aligned (Fig. 1) and across different appearance domains.

3 Unsupervised Two-Stage Sketch-to-Photo Synthesis

In our unsupervised learning setting, we are given two sets of data in the same semantic category such as shoes, and no instance pairing is known or available. Formally, all we have are n sketches $\{S_1,\ldots , S_n \}$ and m color photos $\{I_1,\ldots , I_m \}$ along with their grayscale versions $\{G_1,\ldots , G_m \}$.

Compared to photos, sketches are spatially imprecise and colorless. To synthesize a photo from a sketch, we deal with these two aspects at separate stages: We first translate a sketch into a grayscale photo and then translate the grayscale into a color photo filled with missing details on texture and shading (Fig. 3).

3.1 Shape Translation: Sketch $S\rightarrow $ Grayscale G

Overview. We first learn to translate sketch S into grayscale photo G. The goal is to rectify shape deformation in sketches. We consider unpaired sketch and photo images, not only because paired data are scarce and hard to collect, but also because heavy reliance on paired data could restrict the model from recognizing the inherent misalignment between sketches and photos.

A pair of mappings, $T: S\xrightarrow {}G$ and $T': G\xrightarrow {}S$, each implemented with an encoder-decoder architecture, are learned with cycle-consistency objectives: $S\approx T'(T(S))$ and $G\approx T(T'(G))$. Similar to [38], we train two domain discriminators $D_{G}$ and $D_{S}$: $D_{G}$ tries to tease apart G and T(S), while $D_{S}$ teases apart S and $T'(G)$ (Fig. 3). The predicted grayscale T(S) goes to content enrichment next.

The input sketch may exhibit various levels of abstraction and different drawing styles. In particular, sketches containing dense strokes or noisy details (Fig. 3) cannot be handled well by a basic CycleGAN model.

To deal with these variations, we introduce two strategies for the model to extract style-invariant information only: 1) We compose additional noise sketches to enrich the dataset and introduce a self-supervised objective; 2) We introduce an attention module to help detect distracting regions.

Noise Sketch Composition. In a rapidly drawn sketch, strokes could be deliberately complex, or simply careless and distractive (Fig. 3). We augment limited sketch data with more noise. Let $S^{\text {noise}}=\varphi (S)$, where $\varphi (.)$ represents composition. We detect dense strokes and construct a pool of noise masks. We randomly sample from these masks and artificially generate complex sketches by inserting these dense stroke patterns into original sketches. We generate distractive sketches by adding a random patch from a different sketch on an existing sketch. The noise strokes and random patches are used to simulate irrelevant details in a sketch. We compose such noise sketches on the fly and feed them into the network with a fixed occurrence ratio.

Self-supervised Objective. We introduce a self-supervised objective to work with the synthesized noise sketches. For a composed noise sketch, the reconstruction goal of our model is to reproduce the original clean sketch:

$$\begin{aligned} L_{ss}(T, T^\prime )= \left\| S-T^\prime \left( T(S^{\text {noise}})\right) \right\| _{1} \end{aligned}$$

(1)

This objective is different from the cycle-consistency loss used on untouched original sketches. It makes the model ignore irrelevant strokes and put more efforts on style-invariant strokes in the sketch.

Ignore Distractions with Active Attention. To identify distracting strokes, we also introduce an attention module. Since most areas of a sketch are blank, the activation of dense stroke regions is stronger than others. We can thus locate distracting areas and suppress the activation there accordingly. That is, the attention module generates an attention map A to be used for re-weighting the feature representation of sketch S (Eq. 2):

$$\begin{aligned} f_{\text {final}}(S)=(1-A)\odot f(S) \end{aligned}$$

(2)

where f(.) refers to the feature map and $\odot $ denotes element-wise multiplication. Our attention is used for area suppression instead of the usual area highlight.

Our total objective for training a shape translation model is:

$$\begin{aligned} \min _{T, T^\prime } \max _{D_{G}, D_{S}}&\lambda _{1} (L_{adv}(T,D_{G};S,G) + L_{adv}(T^\prime ,D_{S};G,S)) \\ +&\lambda _{2} L_{cycle}(T,T^\prime ;S,G)+\lambda _{3} L_{identity}(T,T^\prime ;S,G)+ L_{ss}(T,T^\prime ;S^{noise}). \end{aligned}$$

We follow [38] to add an identity loss $L_{identity}$, which slightly improves the performance. See the details of each loss in the Supplementary.

3.2 Content Enrichment: Grayscale $G\rightarrow $ Color I

Now that we have a predicted grayscale photo G, we learn a mapping C that turns it into color photo I. The goal at this stage is to enrich the generated grayscale photo G with missing appearance details.

Since a color-less sketch could have many colorful realizations, many fill-in’s are possible. We thus model the task as a style transfer task and use an optional reference color image to guide the selection of a particular style.

We implement C as an encoder (E) and decoder (D) network (Fig. 3). Given a grayscale photo G as the input, the model outputs a color photo I. The input G and the grayscale of the output I, specifically the L-channel in CIE Lab color space of the output should be the same. Therefore we use a self-supervised intensity loss (Eq. 3) to train the model:

$$\begin{aligned} L_{it}(C) = \left\| G-\text {grayscale}\left( C\left( G \right) \right) \right\| _{1} \end{aligned}$$

(3)

We train discriminator $D_{I}$ to ensure that I is also as photo-realistic as ${I_1,\ldots , I_m}$.

To achieve the output diversity, we introduce a conditional module that takes an optional reference image for guidance. We follow AdaIN [14] to inject style information by adjusting the feature map statistics. Specifically, the encoder E takes the input grayscale image G and generates a feature map $\mathbf{x} =E(G)$, then the mean and variance of x are adjusted by the reference’s feature map $\mathbf{x} ^{ref}=E(R)$. The new feature map is $\mathbf{x} ^{new}=AdaIN(\mathbf{x} ,\mathbf{x} ^{ref})$ (Eq. 4), which is subsequently sent to the decoder D for rendering the final output image I:

$$\begin{aligned} AdaIN(\mathbf {x},\mathbf {x}^{\text {ref}})&= \sigma (\mathbf {x}^{\text {ref}})(\frac{\mathbf {x}-\mu (\mathbf {x})}{\sigma (\mathbf {x})})+\mu (\mathbf {x}^{\text {ref}}) \end{aligned}$$

(4)

Our model can work with or without reference images, in a single network, enabled by a mixed training strategy. When there is no reference image, only intensity loss and adversarial loss are used while $\sigma (\mathbf{x} ^{ref})$ and $\mu (\mathbf{x} ^{ref})$ are set to 1 and 0 respectively; otherwise, a content loss and style loss are computed additionally. The content loss (Eq. 5) is used to guarantee that the input and output images are consistent perceptually, whereas the style loss (Eq. 6) is to ensure the style of the output is aligned with that of the reference image.

$$\begin{aligned} L_{cont}(C;G,R)&= \left\| E(D(t))-t\right\| _{1} \end{aligned}$$

(5)

$$\begin{aligned} L_{style}(C;G,R)&\!=\!\! \sum _{i=1}^{K}\left\| \mu \left( \phi _{i}(D(t))\right) \!-\!\mu \left( \phi _{i}(R)\right) \right\| _{2} \!+\!\! \sum _{i=1}^{K}\left\| \sigma \left( \phi _{i}(D(t))\right) \!-\!\sigma \left( \phi _{i}(R)\right) \right\| _{2} \end{aligned}$$

(6)

$$\begin{aligned} \text {where }\quad t&= AdaIN(E(G),E(R)) \end{aligned}$$

(7)

$\phi _{i}(.)$ denotes a layer of a pre-trained VGG-19 model. In our implementation, we use $relu1\_{1}$, $relu2\_{1}$, $relu3\_{1}$, $relu4\_{1}$ layers with equal weights to compute the style loss. Equation 8 shows the total loss for training the content enrichment model. Network architectures and further details are provided in the Supplementary.

$$\begin{aligned} \min _{C} \max _{D_{I}} \lambda _{4} L_{adv}(C,D_{I};G,I)+ \lambda _{5} L_{it}(C) + \lambda _{6} L_{style}(C;G,R)+\lambda _{7} L_{cont}(C;G,R) \end{aligned}$$

(8)

4 Experiments and Applications

4.1 Experimental Setup and Evaluation Metrics

Datasets. We train our model on two single-category sketch datasets, ShoeV2 and ChairV2 [36], with 6,648/2,000 and 1,297/400 sketches/photos respectively. Each photo has at least 3 corresponding sketches drawn by different individuals. Note that we do not use pairing information at training. Compared to QuickDraw [11], Sketchy [28], and TU-Berlin [6], sketches in ShoeV2/ChairV2 have more fine-grained details. They demand like-kind details in synthesized photos and are thus more challenging as a testbed for sketch to photo synthesis.

Baselines for Image Translation. 1) Pix2Pix [16] is our supervised learning baseline which requires paired training data. 2) CycleGAN [38] is an unsupervised bidirectional image translation model. It is the first to apply cycle-consistency with GANs and allows unpaired training data. 3) MUNIT [15] is also an unsupervised model that could generate multiple outputs given an input. It assumes that the representation of an image can be decomposed into a content code and a style code. 4) UGATIT [18] is an attention-based image translation model, with the attention to help the model focus on the domain-discriminative regions and thereby improve the synthesis quality.

Training Details. We train our shape translation network for 500 (400) epochs on shoes (chairs), and train our content enrichment network for 200 epochs. The initial learning rate is 0.0002, and the input image size is $128\times 128$. We use Adam optimizer with batch size 1. Following the practice by CycleGAN, we train the first 100 epochs at the same learning rate and then linearly decrease the rate to zero until the maximum epoch. We randomly compose complex and distractive sketches with the possibility of 0.2 and 0.3 respectively. The random patch size is $50\times 50$. When training the content enrichment network, we feed reference images into the network with possibility 0.2.

Evaluation Metrics. 1) Fréchet Inception Distance (FID). It evaluates image quality and diversity according to the distance between synthesized and real samples according to the statistics of activations in layer pool3 of a pre-trained Inception-v3. A lower FID value indicates higher fidelity. 2) User study (Quality). It evaluates subjective impressions in terms of similarity and realism. As in [30], we ask the subject to compare two generated photos and select the one better fitting their imagination for a given sketch. We sample 50 pairs for each comparison (more details in Supplementary). 3) Learned perceptual image patch similarity (LPIPS). It measures the distance between two images. As in [15, 38], we use it to evaluate the diversity of synthesized photos.

4.2 Sketch-Based Photo Synthesis Results

Table 1 shows that: 1) Our model outperforms all the baselines in terms of FID and user studies. Note that all the baselines adopt one-stage architectures. 2) All the models perform poorly on ChairV2, probably due to more shape variations but far fewer training data for chairs than for shoes (1:5). 3) Ours outperforms MUNIT by a large margin, indicating that our task-level decomposition strategy, i.e., two-stage architecture, is more effective than feature-level decomposition for this task. 4) UGATIT ranks the second on each dataset. It is also an attention-based model, showing the effectiveness of attention in image translation tasks.

Comparisons in Fig. 4 and Varieties in Fig. 5 (Left). Our results are more realistic and faithful to the input sketch (e.g., buckle and logo); our synthesis with different reference images produces varieties.

Table 1. Benchmarks on ShoeV2/ChairV2. ‘$*$’ indicates paired data for training.

Full size table

Robustness and Sensitivity in Fig. 5 (Middle & Right). We test our ShoeV2 model under two settings: 1) sketches corresponding to the same photo, 2) sketches at different completion stages. Given sketches of similar shoes drawn by different users, our model can capture their commonality as well as subtle distinctions and translate them into photos. Our model also works for sketches at different completion stages (obtained by removing strokes according to their orderings), synthesizing realistic closely-looking shoes for partial sketches.

Generalization Across Domains in Fig. 6 (Left). When sketches are randomly sampled from different datasets such as TU-Berlin [6] and Sketchy [28], which have greater shape deformation than ShoeV2, our model trained on ShoeV2 can still produce good results (see more examples in the Supplementary).

Table 2. Comparison of different architecture designs.

Full size table

Sketches from Novel Categories in Fig. 6 (Right). While we focus on a single category training, we nonetheless feed our model sketches from other categories. When the model is trained on shoes, the shape translation network has learned to synthesize a grayscale shoe photo based on a shoe sketch. For a non-shoe sketch, our model translates it into a shoe-like photo. Some fine details in the sketch become a common component of a shoe. For example, a car becomes a trainer while the front window becomes part of a shoelace. The superimposition of the input sketch and the re-shoe-synthesized sketch reveals which lines are chosen by our model and how it modifies the lines for re-synthesis.

4.3 Ablation Study

Two-Stage Architecture. Two-stage architecture is the key to the success of our model. This strategy can be easily adapted by other models such as cycleGAN. Table 2 compares the performance of the original cycleGAN and its two-stage version (i.e., cycleGAN is used only for shape translation while the content enrichment network is the same as ours). The two-stage version outperforms the original cycleGAN by 27.55 (on ShoeV2) and 68.33 (on ChairV2), indicating the significant benefits brought by this architectural design.

Edge Map vs. Grayscale as the Intermediate Goal. We choose grayscale as our intermediate goal of translation. As shown in Fig. 1, edge maps could be an alternative since it does not have shape deformation either. We can first translate sketch to an edge map, and then fill the edge map with colorful details.

Table 2 and Fig. 7 show that using the edge map is worse than using the grayscale. Our explanations are: 1) Grayscale images contain more visual details thus can provide more learning signals for training shape translation network; 2) Content enrichment is easier for grayscale as they are closer to color photos than edge maps. The grayscale is also easier to obtain in practice.

Deal with Abstraction and Style Variations. We have discussed the problem encountered during shape translation in Sect. 3.1, and further introduced 1) a self-supervised objective along with noise sketch composition strategies and 2) an attention module to handle the problem. Table 3 compares FID achieved at the first stage by different variants. Our full model can tackle the problem better than the vanilla model, and each component contributes to the improved performance. Figure 7 shows two examples and compares the results of UGATIT.

Paired vs. Unpaired Training. We train a Pix2Pix model for shape translation to see if paired information helps. As shown in Table 3 (Pix2Pix) and Fig. 8, It turns out the performance of Pix2Pix is much worse than ours (FID: 75.84 vs. 46.46 on ShoeV2 and 164.01 vs. 90.87 on ChairV2). It is most likely caused by the shape misalignment between sketches and grayscale images.

Table 3. Contribution of each proposed component. The FID scores are obtained based on the results of shape translation stage.

Full size table

Table 4. Exclude the effect of paired data. Although the paired information is not used during training, they indeed exist in ShoeV2. We compose a new dataset where pairing does not exist to train the model again. Results obtained on the same test set.

Full size table

Exclude the Effect of Paired Information. Although pairing information is not used during training, they do exist in ShoeV2. To eliminate any potential pairing facilitation, we train another model on a composed dataset, created by merging all the sketches of ShoeV2 and 9,995 photos of UT Zappos50K [33]. These photos are collected from a different source than ShoeV2. We train this model in the same setting. In Table 4, we can see this model achieves similar performance with the one trained on ShoeV2, indicating the effectiveness of our approach for learning the task from entirely unpaired data.

4.4 Photo-to-Sketch Synthesis Results

Synthesize a Sketch Given a Photo. As the shape translation network is bidirectional (i.e., T and $T^\prime $), our model can also translate a photo into a sketch. This task is not trivial, as users can easily detect a fake sketch based on its stroke continuity and consistency. Figure 9 (Top) shows that our generated sketches mimic manual line-drawings and emphasize contours that are perceptually significant.

Sketch-Like Edge Extraction. Sketch-to-photo and photo-to-sketch synthesis are opposite processes. We suspect that our model can create sketches from photos in broader categories as it may require less class priors.

We test our shoe model directly on photos in ShapeNet [3]. Figure 9 (Bottom) lists our results along with those from HED [32] and Canny edge detector [2]. We also compare with Photo-Sketching [19], a method specifically designed for generating boundary-like drawing from photos. 1) Unlike HED and Canny producing an edge map faithful to the photo, ours presents a hand-drawn style. 2) Our model can dub as an edge$+$ extractor on unseen classes. This is an exciting corollary product: A promising automatic sketch generator that captures human visual perception beyond the edge map of a photo (more results in Supp.).

4.5 Application: Unsupervised Sketch-Based Image Retrieval

Sketch-based image retrieval is an important application of sketch. One of its main challenges is the large domain gap. Existing methods either map sketches and photos into a common space or use edge maps as the intermediate representation. However, our model enables direct mapping between these two domains.

We thus conduct experiments in two possible mapping directions: 1) Translate gallery photos to sketches, and then find the nearest sketches to the query sketch (Fig. 10 (Left)); 2) Translate a sketch to a photo and then find its nearest neighbors in the photo gallery (Fig. 10 (Right)). Two ResNet18 [12] models, one is pretrained on the ImageNet while the other is on the TU-Berlin dataset, are used as feature extractors for photos and sketches respectively (see Supplementary for further details). Figure 10 shows our retrieval results. Even without any supervision, the results are already acceptable. In the second experiment, we achieve an accuracy of 37.2% (65.2%) at top5 (top20) respectively. These results are higher than the results from sketch to edge map, which are 34.5% (57.7%).

Summary. We propose the first unsupervised two-stage sketch-to-photo synthesis model that can produce photos of high fidelity, realism, and diversity. It enables sketch-based image retrieval and automatic sketch generation that captures human visual perception beyond the edge map of a photo.

References

Bau, D., et al.: Semantic photo manipulation with a generative image prior. ACM Trans. Graph. (TOG) 38(4), 59 (2019)
Article Google Scholar
Canny, J.: A computational approach to edge detection. TPAMI 6, 679–698 (1986)
Article Google Scholar
Chang, A.X., et al.: ShapeNet: An information-rich 3D model repository. arXiv preprint arXiv:1512.03012 (2015)
Chen, T., Cheng, M.M., Tan, P., Shamir, A., Hu, S.M.: Sketch2Photo: internet image montage. ACM Trans. Graph.(TOG) 28, 124:1–124:10 (2009)
Google Scholar
Chen, W., Hays, J.: SketchyGAN: towards diverse and realistic sketch to image synthesis. In: CVPR (2018)
Google Scholar
Eitz, M., Hays, J., Alexa, M.: How do humans sketch objects? ACM Trans. Graph. (TOG) 31, 44:1–44:10 (2012)
Google Scholar
Eitz, M., Hildebrand, K., Boubekeur, T., Alexa, M.: An evaluation of descriptors for large-scale image retrieval from sketched feature lines. Comput. Graph. 34(5), 482–498 (2010)
Article Google Scholar
Eitz, M., Richter, R., Hildebrand, K., Boubekeur, T., Alexa, M.: Photosketcher: interactive sketch-based image synthesis. IEEE Comput. Graph. Appl. 31, 56–66 (2011)
Article Google Scholar
Eitz, M., Hildebrand, K., Boubekeur, T., Alexa, M.: Sketch-based image retrieval: benchmark and bag-of-features descriptors. TVCG 17(11), 1624–1636 (2011)
Google Scholar
Ghosh, A., et al.: Interactive sketch & fill: multiclass sketch-to-image translation. In: CVPR (2019)
Google Scholar
Ha, D., Eck, D.: A neural representation of sketch drawings. arXiv preprint arXiv:1704.03477 (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Hu, R., Barnard, M., Collomosse, J.: Gradient field descriptor for sketch based retrieval and localization. In: ICIP (2010)
Google Scholar
Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1501–1510 (2017)
Google Scholar
Huang, X., Liu, M.Y., Belongie, S., Kautz, J.: Multimodal unsupervised image-to-image translation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01219-9_11
Chapter Google Scholar
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1125–1134 (2017)
Google Scholar
Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of GANs for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196 (2017)
Kim, J., Kim, M., Kang, H., Lee, K.: U-GAT-IT: unsupervised generative attentional networks with adaptive layer-instance normalization for image-to-image translation. CoRR abs/1907.10830 (2019)
Google Scholar
Li, M., Lin, Z., Mech, R., Yumer, E., Ramanan, D.: Photo-sketching: inferring contour drawings from images. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV) (2019)
Google Scholar
Li, Y., Hospedales, T., Song, Y.Z., Gong, S.: Fine-grained sketch-based image retrieval by matching deformable part models. In: BMVC (2014)
Google Scholar
Liu, L., Shen, F., Shen, Y., Liu, X., Shao, L.: Deep sketch hashing: Fast free-hand sketch-based image retrieval. arXiv preprint arXiv:1703.05605 (2017)
Liu, M.Y., Breuel, T., Kautz, J.: Unsupervised image-to-image translation networks. In: Advances in Neural Information Processing Systems, pp. 700–708 (2017)
Google Scholar
Lu, Y., Wu, S., Tai, Y.-W., Tang, C.-K.: Image generation from sketch constraint using contextual GAN. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11220, pp. 213–228. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01270-0_13
Chapter Google Scholar
Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)
Portenier, T., Hu, Q., Szabo, A., Bigdeli, S.A., Favaro, P., Zwicker, M.: FaceShop: deep sketch-based face image editing. ACM Trans. Graph. (TOG) 37(4), 99 (2018)
Article Google Scholar
Qi, Y., Guo, J., Li, Y., Zhang, H., Xiang, T., Song, Y.: Sketching by perceptual grouping. In: ICIP, pp. 270–274 (2013)
Google Scholar
Qi, Y., et al.: Making better use of edges via perceptual grouping. In: CVPR (2015)
Google Scholar
Sangkloy, P., Burnell, N., Ham, C., Hays, J.: The sketchy database: learning to retrieve badly drawn bunnies. In: SIGGRAPH (2016)
Google Scholar
Song, J., Pang, K., Song, Y.Z., Xiang, T., Hospedales, T.M.: Learning to sketch with shortcut cycle consistency. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 801–810 (2018)
Google Scholar
Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional GANs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8798–8807 (2018)
Google Scholar
Xian, W., et al.: TextureGAN: controlling deep image synthesis with texture patches. In: CVPR (2018)
Google Scholar
Xie, S., Tu, Z.: Holistically-nested edge detection. In: ICCV (2015)
Google Scholar
Yu, A., Grauman, K.: Fine-grained visual comparisons with local learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 192–199 (2014)
Google Scholar
Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.S.: Free-form image inpainting with gated convolution. arXiv preprint arXiv:1806.03589 (2018)
Yu, Q., Yang, Y., Song, Y., Xiang, T., Hospedales, T.: Sketch-a-net that beats humans. In: BMVC (2015)
Google Scholar
Yu, Q., Liu, F., Song, Y.Z., Xiang, T., Hospedales, T.M., Loy, C.C.: Sketch me that shoe. In: CVPR (2016)
Google Scholar
Yu, Q., Yang, Y., Liu, F., Song, Y.Z., Xiang, T., Hospedales, T.M.: Sketch-a-net: a deep neural network that beats humans. JICV 122(3), 411–425 (2017)
MathSciNet Google Scholar
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2223–2232 (2017)
Google Scholar
Zou, C., et al.: SketchyScene: richly-annotated scene sketches. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 438–454. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_26
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

UC Berkeley/ICSI, Berkeley, USA
Runtao Liu, Qian Yu & Stella X. Yu
Beihang University, Xueyuan Rd. No. 37, Haidian District, Beijing, China
Qian Yu

Authors

Runtao Liu
View author publications
You can also search for this author in PubMed Google Scholar
Qian Yu
View author publications
You can also search for this author in PubMed Google Scholar
Stella X. Yu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 4105 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, R., Yu, Q., Yu, S.X. (2020). Unsupervised Sketch to Photo Synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12348. Springer, Cham. https://doi.org/10.1007/978-3-030-58580-8_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-58580-8_3
Published: 03 December 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58579-2
Online ISBN: 978-3-030-58580-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Unsupervised Sketch to Photo Synthesis

Abstract

Similar content being viewed by others

Unsupervised Scene Sketch to Photo Synthesis

Sketch-to-Art: Synthesizing Stylized Art Images from Sketches

CoGS: Controllable Generation and Search from Sketch and Style

1 Introduction

2 Related Works

3 Unsupervised Two-Stage Sketch-to-Photo Synthesis

3.1 Shape Translation: Sketch \(S\rightarrow \) Grayscale G

3.2 Content Enrichment: Grayscale \(G\rightarrow \) Color I

4 Experiments and Applications