1 Introduction

Sketches, i.e., rapidly executed freehand drawings, make an intuitive and powerful visual expression (Fig. 1). There is much research on sketch recognition [7, 35], sketch parsing [26, 27], and sketch-based image or video retrieval [21, 28, 36]. We study how to imagine a realistic photo given a sketch that is spatially imprecise and missing colorful details, by learning from unpaired sketches and photos.

Fig. 1.
figure 1

Comparisons of image types and challenges of sketch to photo synthesis. Left: A single object shape could have multiple distinctive colorings yet a common or similar grayscale. Edges extracted by Canny and HED detectors lose colorful details but align well with boundaries in the color photo, whereas sketches are more abstract lines drawn with deformations and style variations. Row 2 shows their lines overlaid on the grayscale photo. Right: Human vision can imagine a realistic photo given a free-hand sketch. Our goal is to equip computer vision with the same imagination capability.

Sketch to photo synthesis is challenging for three reasons.

1) Sketches of objects often do not match their shapes in photos, since sketches commonly drawn by amateurs have large spatial and geometrical distortion. Translating a sketch to a photo thus requires shape rectification. However, it is not trivial to rectify shape distortion in a sketch, as line strokes are only suggestive of the actual shapes and locations, and the extent of shape fidelity varies widely between individuals. In Fig. 1, the three sketches for the same shoe are very different both overall proportions and local stroke styles.

2) Sketches are color-less and lacking details. Drawn in black strokes on white paper, sketches outline mostly object boundaries and characteristic interior markings. To synthesize a photo, shading and colorful textures must be filled in properly. However, it is not trivial to fill in details either. Since a sketch could depict multiple photos, any synthesizer must have the capability to produce not only realistic but also diverse photos for a single sketch.

3) Sketches may not have corresponding photos. Free-hand sketches can be created from observation, memory, or pure imagination; they are not so widely available as photos, and those with corresponding photos are even rarer. A few sketch datasets exist in computer vision. TU-Berlin [6] and QuickDraw [11] contain sketches only, with 20,000 and 50 million instances over 250 and 345 categories respectively. Contour Drawing [19] and Scenesketchy [39] have sketch-photo image pairs at the scene level; their sketches are either contour tracings or cartoon-style line drawings, neither representative of real-world free-hand sketches. Sketchy [28] has only 500 sketches paired with 100 photos in each of 125 categories. ShoeV2 and ChairV2 [36] contain 6,648/2,000 and 1,297/400 sketches/photos in a single semantic category of shoes and chairs respectively. To enable data-driven learning of sketch to photo synthesis, we must handle limited sketch data and unpaired sketches and photos.

Fig. 2.
figure 2

Comparison of sketch to photo synthesis settings and results. Left: Three training scenarios on whether line drawings and photos are provided as paired training instances and whether line drawings are spatially aligned with the photos. Edges extracted from photos are aligned, whereas sketches are not. The bottom panel compares synthesis results from representative approaches in each setting, indicated by the same line/bracket color. Ours are superior to unsupervised edge map to photo methods (cycleGAN [38], MUINT [15], UGATIT [18]) and even supervised methods (Pix2Pix [16]) trained on paired data. Right: Our unsupervised sketch-to-photo synthesis model has two separate stages handling spatial deformation and color enrichment respectively: Shape translation learns to synthesize a grayscale photo given a sketch, from unpaired sketch set and photo set, whereas color enrichment learns to fill the grayscale with colorful details given an optional reference photo.

Existing works focus on either shape or color translation alone (Fig. 2). 1) Most image synthesis that deals with shape transfiguration tends to stay in the same visual domain, e.g. changing the picture of a dog to that of a cat [15, 22], where visual details are comparable in the color image. 2) Sketches are a special case of line drawings, and the most studied case of line drawings in computer vision is the edge map extracted automatically from a photo. Such an edge map based drawing to photo synthesis task does not have the spatial deformation problem between sketches and photos, and realistic photos can be synthesized with [16, 31] or without [38] paired training data between drawings and photos. We will show that existing methods fail in sketch to photo synthesis when both shape and color translations are needed simultaneously.

We consider learning sketch to photo synthesis from sketches and photos of the same object category such as shoes. There is no pairing information between individual sketches and photos; these two sets can be independently collected.

Our insight for unsupervised sketch to photo synthesis is to decompose the task into two separate translations (Fig. 2). Our two-stage model performs first shape translation in grayscale and then content fill-in in color. Stage 1) Shape translation learns to synthesize a grayscale photo given a sketch, from unpaired sketch set and photo set. Geometrical distortions are eliminated at this step. To handle abstraction and drawing style variations, we apply a self-supervised learning objective to noise sketch compositions, and also introduce an attention module for the model to ignore distractions. Stage 2) Content enrichment learns to fill the grayscale with details, including colors, shading, and textures, given an optional reference image. It is designed to work with or without reference images. This capability is enabled by a mixed training strategy. Our model can thus produce diverse outputs on demand.

Our model links sketches to photos and can be used directly in sketch-based photo retrieval. Another exciting corollary result from our model is that we can also synthesize a sketch given a photo, even from unseen semantic categories. Strokes in a sketch capture information beyond edge maps defined primarily on intensity contrast and object exterior boundaries. Automatic photo to sketch generation could lead to more advanced computer vision capabilities and serve as a powerful human-user interaction device.

Our work makes the following contributions. 1) We propose the first two-stage unsupervised model that can generate diverse, sketch-faithful, and photo-realistic images from a single free-hand sketch. 2) We introduce a self-supervised learning objective and an attention module to handle abstraction and style variations in sketches. 3) Our work not only enables sketch-based image retrieval but also delivers an automatic sketcher that captures human visual perception beyond the edge map of a photo. See http://sketch.icsi.berkeley.edu.

2 Related Works

Sketch-Based Image Synthesis. While much progress has been made on sketch recognition [6, 35, 37] and sketch-based image retrieval [9, 13, 20, 21, 28, 36], sketch-based image synthesis remains under-explored.

Prior to deep learning (DL), Sketch2Photo [4] and PhotoSketcher [8] compose a new photo from photos retrieved for a sketch. Sketch2Photo [4] first retrieves photos based on the class label, then uses the given sketch to filter them and compose a target photo. PhotoSketcher [8] has a similar pipeline but retrieves photos based on a rather restrictive sketch and hand-crafted features.

The first DL-based free-hand sketch-to-photo synthesis is SketchyGAN [5], which trains an encoder-decoder model conditioned on the class label for sketch and photo pairs. Contextual GAN [23] treats sketch to photo synthesis as an image completion problem, using the sketch as a weak contextual constraint. Interactive Sketch [10] focuses on multi-class photo synthesis based on incomplete edges or sketches. All of these works rely on paired sketch and photo data and do not address the shape deformation problem.

Sketches are often used in photo editing [1, 25, 34], e.g., line strokes are drawn on a photo to change the shape of a roof. Unlike our sketch to photo synthesis, these works mainly address a constrained image inpainting problem.

Synthesis from the opposite direction, photo to sketch, has also been studied [19, 29]: The former proposes a hybrid model to synthesize a sketch stroke by stroke given a photo, whereas the latter aims to generate boundary-like drawings that capture the outline of the visual scene. Both models require paired data for training. While photo to sketch is not our focus, our model trained only on shoes can generate realistic sketches from photos in other semantic categories.

Generative Adversarial Networks (GAN). GAN has a generator (G) and a discriminator (D): G tries to fake instances that fool D and D tries to detect fakes from reals. GAN is widely used for realistic image generation [17, 24] and translation across image domains [15, 16].

Pix2Pix [16] is a conditional GAN that maps source images to target images; it requires paired (source,target) data during training. CycleGAN [38] uses a pair of GANs to map an image from the source domain to the target domain and then back to the source domain. Imposing a consistency loss over such a cycle of mappings, it allows both models to be trained together on unpaired source and target images in two different domains. UNIT [22] and MUNIT [15] are variations of CycleGAN, both achieving impressive performance.

None of these methods work well when the source and target images are spatially poorly aligned (Fig. 1) and across different appearance domains.

Fig. 3.
figure 3

Our two-stage model architecture (top) and three major technical components (bottom) that tackle abstract and style-varying strokes: noise sketch composition for training data augmentation, a self-supervised de-noising objective, and an attention module to suppress distracting dense strokes.

3 Unsupervised Two-Stage Sketch-to-Photo Synthesis

In our unsupervised learning setting, we are given two sets of data in the same semantic category such as shoes, and no instance pairing is known or available. Formally, all we have are n sketches \(\{S_1,\ldots , S_n \}\) and m color photos \(\{I_1,\ldots , I_m \}\) along with their grayscale versions \(\{G_1,\ldots , G_m \}\).

Compared to photos, sketches are spatially imprecise and colorless. To synthesize a photo from a sketch, we deal with these two aspects at separate stages: We first translate a sketch into a grayscale photo and then translate the grayscale into a color photo filled with missing details on texture and shading (Fig. 3).

3.1 Shape Translation: Sketch \(S\rightarrow \) Grayscale G

Overview. We first learn to translate sketch S into grayscale photo G. The goal is to rectify shape deformation in sketches. We consider unpaired sketch and photo images, not only because paired data are scarce and hard to collect, but also because heavy reliance on paired data could restrict the model from recognizing the inherent misalignment between sketches and photos.

A pair of mappings, \(T: S\xrightarrow {}G\) and \(T': G\xrightarrow {}S\), each implemented with an encoder-decoder architecture, are learned with cycle-consistency objectives: \(S\approx T'(T(S))\) and \(G\approx T(T'(G))\). Similar to [38], we train two domain discriminators \(D_{G}\) and \(D_{S}\): \(D_{G}\) tries to tease apart G and T(S), while \(D_{S}\) teases apart S and \(T'(G)\) (Fig. 3). The predicted grayscale T(S) goes to content enrichment next.

The input sketch may exhibit various levels of abstraction and different drawing styles. In particular, sketches containing dense strokes or noisy details (Fig. 3) cannot be handled well by a basic CycleGAN model.

To deal with these variations, we introduce two strategies for the model to extract style-invariant information only: 1) We compose additional noise sketches to enrich the dataset and introduce a self-supervised objective; 2) We introduce an attention module to help detect distracting regions.

Noise Sketch Composition. In a rapidly drawn sketch, strokes could be deliberately complex, or simply careless and distractive (Fig. 3). We augment limited sketch data with more noise. Let \(S^{\text {noise}}=\varphi (S)\), where \(\varphi (.)\) represents composition. We detect dense strokes and construct a pool of noise masks. We randomly sample from these masks and artificially generate complex sketches by inserting these dense stroke patterns into original sketches. We generate distractive sketches by adding a random patch from a different sketch on an existing sketch. The noise strokes and random patches are used to simulate irrelevant details in a sketch. We compose such noise sketches on the fly and feed them into the network with a fixed occurrence ratio.

Self-supervised Objective. We introduce a self-supervised objective to work with the synthesized noise sketches. For a composed noise sketch, the reconstruction goal of our model is to reproduce the original clean sketch:

$$\begin{aligned} L_{ss}(T, T^\prime )= \left\| S-T^\prime \left( T(S^{\text {noise}})\right) \right\| _{1} \end{aligned}$$
(1)

This objective is different from the cycle-consistency loss used on untouched original sketches. It makes the model ignore irrelevant strokes and put more efforts on style-invariant strokes in the sketch.

Ignore Distractions with Active Attention. To identify distracting strokes, we also introduce an attention module. Since most areas of a sketch are blank, the activation of dense stroke regions is stronger than others. We can thus locate distracting areas and suppress the activation there accordingly. That is, the attention module generates an attention map A to be used for re-weighting the feature representation of sketch S (Eq. 2):

$$\begin{aligned} f_{\text {final}}(S)=(1-A)\odot f(S) \end{aligned}$$
(2)

where f(.) refers to the feature map and \(\odot \) denotes element-wise multiplication. Our attention is used for area suppression instead of the usual area highlight.

Our total objective for training a shape translation model is:

$$\begin{aligned} \min _{T, T^\prime } \max _{D_{G}, D_{S}}&\lambda _{1} (L_{adv}(T,D_{G};S,G) + L_{adv}(T^\prime ,D_{S};G,S)) \\ +&\lambda _{2} L_{cycle}(T,T^\prime ;S,G)+\lambda _{3} L_{identity}(T,T^\prime ;S,G)+ L_{ss}(T,T^\prime ;S^{noise}). \end{aligned}$$

We follow [38] to add an identity loss \(L_{identity}\), which slightly improves the performance. See the details of each loss in the Supplementary.

3.2 Content Enrichment: Grayscale \(G\rightarrow \) Color I

Now that we have a predicted grayscale photo G, we learn a mapping C that turns it into color photo I. The goal at this stage is to enrich the generated grayscale photo G with missing appearance details.

Since a color-less sketch could have many colorful realizations, many fill-in’s are possible. We thus model the task as a style transfer task and use an optional reference color image to guide the selection of a particular style.

We implement C as an encoder (E) and decoder (D) network (Fig. 3). Given a grayscale photo G as the input, the model outputs a color photo I. The input G and the grayscale of the output I, specifically the L-channel in CIE Lab color space of the output should be the same. Therefore we use a self-supervised intensity loss (Eq. 3) to train the model:

$$\begin{aligned} L_{it}(C) = \left\| G-\text {grayscale}\left( C\left( G \right) \right) \right\| _{1} \end{aligned}$$
(3)

We train discriminator \(D_{I}\) to ensure that I is also as photo-realistic as \({I_1,\ldots , I_m}\).

To achieve the output diversity, we introduce a conditional module that takes an optional reference image for guidance. We follow AdaIN [14] to inject style information by adjusting the feature map statistics. Specifically, the encoder E takes the input grayscale image G and generates a feature map \(\mathbf{x} =E(G)\), then the mean and variance of x are adjusted by the reference’s feature map \(\mathbf{x} ^{ref}=E(R)\). The new feature map is \(\mathbf{x} ^{new}=AdaIN(\mathbf{x} ,\mathbf{x} ^{ref})\) (Eq. 4), which is subsequently sent to the decoder D for rendering the final output image I:

$$\begin{aligned} AdaIN(\mathbf {x},\mathbf {x}^{\text {ref}})&= \sigma (\mathbf {x}^{\text {ref}})(\frac{\mathbf {x}-\mu (\mathbf {x})}{\sigma (\mathbf {x})})+\mu (\mathbf {x}^{\text {ref}}) \end{aligned}$$
(4)

Our model can work with or without reference images, in a single network, enabled by a mixed training strategy. When there is no reference image, only intensity loss and adversarial loss are used while \(\sigma (\mathbf{x} ^{ref})\) and \(\mu (\mathbf{x} ^{ref})\) are set to 1 and 0 respectively; otherwise, a content loss and style loss are computed additionally. The content loss (Eq. 5) is used to guarantee that the input and output images are consistent perceptually, whereas the style loss (Eq. 6) is to ensure the style of the output is aligned with that of the reference image.

$$\begin{aligned} L_{cont}(C;G,R)&= \left\| E(D(t))-t\right\| _{1} \end{aligned}$$
(5)
$$\begin{aligned} L_{style}(C;G,R)&\!=\!\! \sum _{i=1}^{K}\left\| \mu \left( \phi _{i}(D(t))\right) \!-\!\mu \left( \phi _{i}(R)\right) \right\| _{2} \!+\!\! \sum _{i=1}^{K}\left\| \sigma \left( \phi _{i}(D(t))\right) \!-\!\sigma \left( \phi _{i}(R)\right) \right\| _{2} \end{aligned}$$
(6)
$$\begin{aligned} \text {where }\quad t&= AdaIN(E(G),E(R)) \end{aligned}$$
(7)

\(\phi _{i}(.)\) denotes a layer of a pre-trained VGG-19 model. In our implementation, we use \(relu1\_{1}\), \(relu2\_{1}\), \(relu3\_{1}\), \(relu4\_{1}\) layers with equal weights to compute the style loss. Equation 8 shows the total loss for training the content enrichment model. Network architectures and further details are provided in the Supplementary.

$$\begin{aligned} \min _{C} \max _{D_{I}} \lambda _{4} L_{adv}(C,D_{I};G,I)+ \lambda _{5} L_{it}(C) + \lambda _{6} L_{style}(C;G,R)+\lambda _{7} L_{cont}(C;G,R) \end{aligned}$$
(8)
Fig. 4.
figure 4

Our model can produce high-fidelity and diverse photos from a sketch. Top: Result comparisons. Most baselines cannot handle this task well. While UGATIT can generate realistic photos, our results are more faithful to the input sketch, e.g., the three chair examples. Bottom: Results without (Column 2) or with (Column 3) the reference image. Our single content enrichment model can work under both settings, with or without a reference photo (shown in the top right corner).

4 Experiments and Applications

4.1 Experimental Setup and Evaluation Metrics

Datasets. We train our model on two single-category sketch datasets, ShoeV2 and ChairV2 [36], with 6,648/2,000 and 1,297/400 sketches/photos respectively. Each photo has at least 3 corresponding sketches drawn by different individuals. Note that we do not use pairing information at training. Compared to QuickDraw [11], Sketchy [28], and TU-Berlin [6], sketches in ShoeV2/ChairV2 have more fine-grained details. They demand like-kind details in synthesized photos and are thus more challenging as a testbed for sketch to photo synthesis.

Baselines for Image Translation. 1) Pix2Pix [16] is our supervised learning baseline which requires paired training data. 2) CycleGAN [38] is an unsupervised bidirectional image translation model. It is the first to apply cycle-consistency with GANs and allows unpaired training data. 3) MUNIT [15] is also an unsupervised model that could generate multiple outputs given an input. It assumes that the representation of an image can be decomposed into a content code and a style code. 4) UGATIT [18] is an attention-based image translation model, with the attention to help the model focus on the domain-discriminative regions and thereby improve the synthesis quality.

Training Details. We train our shape translation network for 500 (400) epochs on shoes (chairs), and train our content enrichment network for 200 epochs. The initial learning rate is 0.0002, and the input image size is \(128\times 128\). We use Adam optimizer with batch size 1. Following the practice by CycleGAN, we train the first 100 epochs at the same learning rate and then linearly decrease the rate to zero until the maximum epoch. We randomly compose complex and distractive sketches with the possibility of 0.2 and 0.3 respectively. The random patch size is \(50\times 50\). When training the content enrichment network, we feed reference images into the network with possibility 0.2.

Evaluation Metrics. 1) Fréchet Inception Distance (FID). It evaluates image quality and diversity according to the distance between synthesized and real samples according to the statistics of activations in layer pool3 of a pre-trained Inception-v3. A lower FID value indicates higher fidelity. 2) User study (Quality). It evaluates subjective impressions in terms of similarity and realism. As in [30], we ask the subject to compare two generated photos and select the one better fitting their imagination for a given sketch. We sample 50 pairs for each comparison (more details in Supplementary). 3) Learned perceptual image patch similarity (LPIPS). It measures the distance between two images. As in [15, 38], we use it to evaluate the diversity of synthesized photos.

4.2 Sketch-Based Photo Synthesis Results

Table 1 shows that: 1) Our model outperforms all the baselines in terms of FID and user studies. Note that all the baselines adopt one-stage architectures. 2) All the models perform poorly on ChairV2, probably due to more shape variations but far fewer training data for chairs than for shoes (1:5). 3) Ours outperforms MUNIT by a large margin, indicating that our task-level decomposition strategy, i.e., two-stage architecture, is more effective than feature-level decomposition for this task. 4) UGATIT ranks the second on each dataset. It is also an attention-based model, showing the effectiveness of attention in image translation tasks.

Comparisons in Fig. 4 and Varieties in Fig. 5 (Left). Our results are more realistic and faithful to the input sketch (e.g., buckle and logo); our synthesis with different reference images produces varieties.

Table 1. Benchmarks on ShoeV2/ChairV2. ‘\(*\)’ indicates paired data for training.
Fig. 5.
figure 5

Left: With different references, our model can produce diverse outputs. Middle: Given sketches of similar shoes drawn by different users, our model can capture their commonality as well as subtle distinctions. Each row shows input sketch, synthesized grayscale image, synthesized RGB photo. Right: Our model even works for sketches at different completion stages, delivering realistic closely looking shoes. (Color figure online)

Fig. 6.
figure 6

Left: Generalization across domains. Column 1 are sketches from two unseen datasets, Sketchy and TU-Berlin. Columns 2–4 are results from our model trained on ShoeV2. Right: Our shoe model can be used as a shoe detector and generator. It can generate a shoe photo based on a non-shoe sketch. It can further turn the non-shoe sketch into a more shoe-like sketch. (a) Input sketch; (b) synthesized grayscale photo; (c) re-synthesized sketch; (d) Green (a) overlaid over gray (c). (Color figure online)

Robustness and Sensitivity in Fig. 5 (Middle & Right). We test our ShoeV2 model under two settings: 1) sketches corresponding to the same photo, 2) sketches at different completion stages. Given sketches of similar shoes drawn by different users, our model can capture their commonality as well as subtle distinctions and translate them into photos. Our model also works for sketches at different completion stages (obtained by removing strokes according to their orderings), synthesizing realistic closely-looking shoes for partial sketches.

Generalization Across Domains in Fig. 6 (Left). When sketches are randomly sampled from different datasets such as TU-Berlin [6] and Sketchy [28], which have greater shape deformation than ShoeV2, our model trained on ShoeV2 can still produce good results (see more examples in the Supplementary).

Table 2. Comparison of different architecture designs.

Sketches from Novel Categories in Fig. 6 (Right). While we focus on a single category training, we nonetheless feed our model sketches from other categories. When the model is trained on shoes, the shape translation network has learned to synthesize a grayscale shoe photo based on a shoe sketch. For a non-shoe sketch, our model translates it into a shoe-like photo. Some fine details in the sketch become a common component of a shoe. For example, a car becomes a trainer while the front window becomes part of a shoelace. The superimposition of the input sketch and the re-shoe-synthesized sketch reveals which lines are chosen by our model and how it modifies the lines for re-synthesis.

4.3 Ablation Study

Two-Stage Architecture. Two-stage architecture is the key to the success of our model. This strategy can be easily adapted by other models such as cycleGAN. Table 2 compares the performance of the original cycleGAN and its two-stage version (i.e., cycleGAN is used only for shape translation while the content enrichment network is the same as ours). The two-stage version outperforms the original cycleGAN by 27.55 (on ShoeV2) and 68.33 (on ChairV2), indicating the significant benefits brought by this architectural design.

Fig. 7.
figure 7

Left: Synthesized results when the edge map is used as the intermediate goal instead of the grayscale photo. (a) Input sketch; (b) Synthesized edge map, (c) Synthesized RGB photo using the edge map; (d) Synthesized RGB photo using grayscale (Ours). Right: Our model can successfully deal with noise sketches, which are not well handled by another attention-based model, UGATIT. For an input sketch (a), our model produce an attention mask (b); (c) and (d) are grayscale images produced by vanilla and our model. (e) and (f) compare ours with the result of UGATIT. (Color figure online)

Fig. 8.
figure 8

Comparisons of paired and unpaired training for shape translation. There are four examples. For each example, the 1st one is the input sketch, the 2nd and the 3rd are grayscale images synthesized by Pix2Pix and our model respectively. Note that for each example, although the input sketches are different visually, Pix2Pix produces a similar-looking grayscale image. Our results are more faithful to the sketch.

Edge Map vs. Grayscale as the Intermediate Goal. We choose grayscale as our intermediate goal of translation. As shown in Fig. 1, edge maps could be an alternative since it does not have shape deformation either. We can first translate sketch to an edge map, and then fill the edge map with colorful details.

Table 2 and Fig. 7 show that using the edge map is worse than using the grayscale. Our explanations are: 1) Grayscale images contain more visual details thus can provide more learning signals for training shape translation network; 2) Content enrichment is easier for grayscale as they are closer to color photos than edge maps. The grayscale is also easier to obtain in practice.

Deal with Abstraction and Style Variations. We have discussed the problem encountered during shape translation in Sect. 3.1, and further introduced 1) a self-supervised objective along with noise sketch composition strategies and 2) an attention module to handle the problem. Table 3 compares FID achieved at the first stage by different variants. Our full model can tackle the problem better than the vanilla model, and each component contributes to the improved performance. Figure 7 shows two examples and compares the results of UGATIT.

Paired vs. Unpaired Training. We train a Pix2Pix model for shape translation to see if paired information helps. As shown in Table 3 (Pix2Pix) and Fig. 8, It turns out the performance of Pix2Pix is much worse than ours (FID: 75.84 vs. 46.46 on ShoeV2 and 164.01 vs. 90.87 on ChairV2). It is most likely caused by the shape misalignment between sketches and grayscale images.

Table 3. Contribution of each proposed component. The FID scores are obtained based on the results of shape translation stage.
Table 4. Exclude the effect of paired data. Although the paired information is not used during training, they indeed exist in ShoeV2. We compose a new dataset where pairing does not exist to train the model again. Results obtained on the same test set.

Exclude the Effect of Paired Information. Although pairing information is not used during training, they do exist in ShoeV2. To eliminate any potential pairing facilitation, we train another model on a composed dataset, created by merging all the sketches of ShoeV2 and 9,995 photos of UT Zappos50K [33]. These photos are collected from a different source than ShoeV2. We train this model in the same setting. In Table 4, we can see this model achieves similar performance with the one trained on ShoeV2, indicating the effectiveness of our approach for learning the task from entirely unpaired data.

Fig. 9.
figure 9

Our results on photo-based sketch synthesis. Top: each sketch-photo pair: left: input photo, right: synthesized sketch. Results obtained on ShoeV2 and ChairV2. Bottom: Results obtained on ShapeNet [3]. The column 1 is the input photo, Column 2–5 are lines generated by Canny, HED, Photo-Sketching [19] (Contour for short), and our model. Our model can generate line strokes with a hand-drawn effect, while HED and Canny detectors produce edge maps faithful to the original photos. Ours emphasize perceptually significant contours, not intensity-contrast significant as in edge maps.

4.4 Photo-to-Sketch Synthesis Results

Synthesize a Sketch Given a Photo. As the shape translation network is bidirectional (i.e., T and \(T^\prime \)), our model can also translate a photo into a sketch. This task is not trivial, as users can easily detect a fake sketch based on its stroke continuity and consistency. Figure 9 (Top) shows that our generated sketches mimic manual line-drawings and emphasize contours that are perceptually significant.

Sketch-Like Edge Extraction. Sketch-to-photo and photo-to-sketch synthesis are opposite processes. We suspect that our model can create sketches from photos in broader categories as it may require less class priors.

We test our shoe model directly on photos in ShapeNet [3]. Figure 9 (Bottom) lists our results along with those from HED [32] and Canny edge detector [2]. We also compare with Photo-Sketching [19], a method specifically designed for generating boundary-like drawing from photos. 1) Unlike HED and Canny producing an edge map faithful to the photo, ours presents a hand-drawn style. 2) Our model can dub as an edge\(+\) extractor on unseen classes. This is an exciting corollary product: A promising automatic sketch generator that captures human visual perception beyond the edge map of a photo (more results in Supp.).

Fig. 10.
figure 10

Sample retrieval results. Our synthesis model can map photo to sketch domain and vice versa. Cross-domain retrieval task can thus be converted to intra-domain retrieval. Left: All candidate photos are mapped to sketches, thus both query and candidates are in the sketch domain. Right: The query sketch is translated to a photo, so the matching is in the photo domain. Top right shows the original photo or sketch.

4.5 Application: Unsupervised Sketch-Based Image Retrieval

Sketch-based image retrieval is an important application of sketch. One of its main challenges is the large domain gap. Existing methods either map sketches and photos into a common space or use edge maps as the intermediate representation. However, our model enables direct mapping between these two domains.

We thus conduct experiments in two possible mapping directions: 1) Translate gallery photos to sketches, and then find the nearest sketches to the query sketch (Fig. 10 (Left)); 2) Translate a sketch to a photo and then find its nearest neighbors in the photo gallery (Fig. 10 (Right)). Two ResNet18 [12] models, one is pretrained on the ImageNet while the other is on the TU-Berlin dataset, are used as feature extractors for photos and sketches respectively (see Supplementary for further details). Figure 10 shows our retrieval results. Even without any supervision, the results are already acceptable. In the second experiment, we achieve an accuracy of 37.2% (65.2%) at top5 (top20) respectively. These results are higher than the results from sketch to edge map, which are 34.5% (57.7%).

Summary. We propose the first unsupervised two-stage sketch-to-photo synthesis model that can produce photos of high fidelity, realism, and diversity. It enables sketch-based image retrieval and automatic sketch generation that captures human visual perception beyond the edge map of a photo.