Keywords

1 Introduction

Zero-shot sketch-based image retrieval (ZS-SBIR) is from the Sketch-based image retrieval (SBIR) [8, 12], that the sketch is used as a query to retrieval relevant images from the gallery. In SBIR, the training and testing data are from the same categories, and the biggest challenge in addressing SBIR is the modal gap between sketch and image. SBIR with zero-shot setting, called as ZS-SBIR, is proposed due to the data scarcity problem of human sketches and human-annotated samples [7, 23]. The zero-shot setting means that the training categories called as seen categories and the testing categories called as unseen categories are disjoint, and this setting brings up a new issue of semantic inconsistency between seen categories and unseen categories.

In recent years, in order to address the issue of semantic inconsistency, most of the research on ZS-SBIR has focused on incorporating external semantic information to facilitate the transfer of knowledge from seen categories to unseen categories or to alleviate the modal gap between different modalities. At the very beginning, a series of research [4, 7, 15] aims to bridge the seen and unseen classes through semantic embeddings. In general, these semantic embeddings are obtained through annotations. But these annotations require additional human labor costs to obtain. Then, some methods [11, 16, 17] based on pretrained convolutional neural networks (CNN) or pretrained vision transformer (ViT), have utilized knowledge distillation to preserve the powerful representation capability for transferring knowledge from seen categories to unseen categories. Although these methods using knowledge distillation have achieved great success, they still need external semantic information, such as class labels. However, these methods can also result in poorly discriminative features for retrieval due to knowledge distillation. To solve the problem brought by knowledge distillation, they have utilized class labels to make the features more discriminative. Recently, a method called ZSE [9] has been proposed. Compared to previous methods, it has a different perspective on addressing ZS-SBIR: the visual correspondence between sketch and image. It achieves superior performance using a transformer-based model, without relying on any external semantic information such as texts or class labels. However, it designs a kernel-based relation network to learn the relationships between sketches and images at every pair and ignores the relationship at higher levels, such as class-level and modality-level. And it also fails to preserve the generalization capability brought by the ViT used as the backbone to build the model.

Motivated by the above observations, we propose a method that shares the same idea as ZSE to address ZS-SBIR without external semantic information by learning the visual correspondence between sketch and image. Specifically, as shown in Fig. 1, we design a dual-pathway transformer-based structure corresponding to sketches and images, respectively. This structure takes data from the two modalities as input in order to establish local correspondences between them. In addition, a triplet loss is used for preliminary alignment between retrieval tokens from sketch and image. In order to maintain the relationship at various levels, such as instance-level, class-level and modality-level, a distribution alignment loss is employed to prevent the alignment of data only at the instance-level. Besides, a teacher ViT is employed to perform knowledge distillation on the self-attention modules using the instance-level kd loss to maintain the generalization capability brought by the ViT for constructing the self-attention modules. Extensive experiments on three benchmark datasets of ZS-SBIR verify the superiority of our method.

We summarize our contributions in this paper as follows:

  1. 1.

    We propose a novel method based on a transformer-based dual-pathway structure to learn the visual correspondence between sketch and image to address ZS-SBIR. The method achieves superior performance in ZS-SBIR without any external semantic information which requires extra human labor cost.

  2. 2.

    We propose a distribution alignment loss, which aligns the data from sketch and image in a global view and maintains the relationships between sketch and image at various levels, such as instance-level, class-level and modality-level.

  3. 3.

    We introduce knowledge distillation using the instance-level kd loss, which preserve the generalization capability brought by the ViT used as the backbone to build the model.

2 Related Work

2.1 Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR)

ZS-SBIR is a challenging task that must simultaneously address the inherent modal gap and the semantic inconsistency. Pioneering researches [4, 7, 15] in ZS-SBIR are inspired by the knowledge transfer mechanism in zero-shot learning, has used the semantic embeddings obtained from the labeled category-level texts extracted from the text-based model, to support knowledge transfer from the source domain (seen categories) to the target domain (unseen categories).

Liu et al. [11] used the CNN-based model pretrained in ImageNet as the backbone to map the data from different modalities to the semantic space of the pretrained model, and knowledge distillation was introduced for the first time to preserve the rich semantic information brought by the pretrained model to transfer knowledge. Tian et al. [16] also used knowledge distillation and selected DINO [1] as the backbone, which has a strong ability to detect global structural information. In addition, they proposed the hypersphere learning framework to align the data from different modalities.

Recently, Lin et al. [9] proposed ZSE that differed from previous work in that it thought a cross-modal matching problem such as ZS-SBIR as the comparisons of groups of key local patches, which had the advantage of not requiring external semantic knowledge and achieved superior performances in ZS-SBIR. In this paper, we adopt the same idea of ZSE to address ZS-SBIR by learning a local visual match between sketch and photo, but we perform the match from a global perspective and use knowledge distillation to prevent catastrophic forgetting.

2.2 Vision Transformer(ViT)

Transformer [19] was originally proposed for machine translation and has achieved tremendous success in many fields of artificial intelligence, such as natural language processing and computer vision. ViT [6] comes from the idea of applying Transformer structures to computer vision, and is a transformer-based image classification model with excellent representational capabilities and strong transferability that can be applied to other vision tasks such as object detection and semantic segmentation.

Subsequently, several researches [2, 5] have found that reasoning based on cross attention was shown effective on image classification, few-shot learning and sketch segment matching, they have tried to learn visual correspondence at different levels between different images by cross-attention mechanism. In this paper, we use the ViT with powerful representation ability to build self-attention module and use it to produce visual tokens that correspond to the most informative local regions, and a cross-attention module followed by self-attention module to learn a local visual match between sketch and photo.

Fig. 1.
figure 1

The overview of our proposed method. A dual-pathway structure is designed to learn visual correspondences between sketches and images. Triplet Loss is applied to retrieval tokens for preliminary alignment. Distribution Alignment Loss eliminates the domain gap between sketch and image by aligning their different latent feature distributions. Instance-level KD Loss is used for the Pseudo ImageNet Label produced by the Self-Attention Module or Teacher ViT, to preserve generalization.

3 Methodology

In this section, following the overall scheme of our proposed framework which is illustrated in Fig. 1, we explain the main modules and learning objectives of our method in detail.

3.1 Cross-Domain ViT

To establish patch-to-patch correspondences between sketches and images, as shown in Fig. 1, we design a dual-pathway structure corresponding to sketches and images, respectively. Specifically, sketches and images are first used to obtain tokens have rich visual information independently within the modality through the corresponding self-attention module, and then interact with tokens from other modality through the corresponding cross-attention module to establish local correspondences between tokens from the two modalities. Furthermore, triplet loss is applied to our proposed method for preliminary alignment of different modalities tokens from the dual-pathway structure.

Self-attention Module. We select ViT to build the self-attention module. ViT is composed of L layers of multi-head self-attention (MSA) and Feed-Forward Network (FFN) blocks. The inputs of ViT are first resized to a fixed resolution. Subsequently, each input is divided into a sequence of patches of fixed resolution. A learnable class token cls for image recognition is added. We replaced the cls with a learnable retrieval token \(x_{ret}\) to obtain the global features of an image or sketch for retrieval. All tokens including \(x_{ret}\) are \(d-\)dimensional token and they are \(X_{0}=(x_{ret},x_{1},\dots ,x_{n})\), \(x_{i}\) is the visual token and n is the number of visual tokens. For the \(i-\)th layer of MSA and FFN, the MSA module has \((W_{q},W_{k},W_{v})\), which project the same token into Queries, Keys and Values, and the whole process of MSA can be formulated as follows:

$$\begin{aligned} Q=X_{i-1}W_{q},\ K=X_{i-1}W_{k},\ V=X_{i-1}W_{v} \end{aligned}$$
(1)
$$\begin{aligned} SA(X_{i-1})=\psi (\frac{QK^{T}}{\sqrt{d}})V \end{aligned}$$
(2)

where \(X_{i-1}\) is the output of the last layer or \(X_{0}\) and \(\psi \) is softmax operation, the feed-forward process can be formulated as follows:

$$\begin{aligned} X_{i}=MSA(LN(X_{i-1}))+X_{i-1} \end{aligned}$$
(3)
$$\begin{aligned} X_{i}=FFN(LN(X_{i}))+X_{i} \end{aligned}$$
(4)

where LN is layer normalization.

Cross-Attention Module. We use the method proposed by ZSE [9] to build cross-attention module. Each cross-attention module takes visual tokens and \(x_{ret}\) from all modalities as inputs to build the pairwise connections between tokens from the two modalities. In detail, the sketch Query \(Q_{s}\) and the image Query \(Q_{p}\) are swapped, and the cross-modal attention of the sketch can be formulated as:

$$\begin{aligned} CA(X_{s})=\psi (\frac{Q_{p}K_{s}^{T}}{\sqrt{d}})V_{s} \end{aligned}$$
(5)

Similarly, the cross-modal attention of the image can be formulated as:

$$\begin{aligned} CA(X_{p})=\psi (\frac{Q_{s}K_{p}^{T}}{\sqrt{d}})V_{p} \end{aligned}$$
(6)

Triplet Loss. To align retrieval tokens from different modalities, i.e., sketch and image, we use triplet loss to make the sketch retrieval token \(x_{ret}^{s+}\) close to the image retrieval token \(x_{ret}^{p+}\) that has the same class, and away from the image retrieval token \(x^{p-}_{ret}\) that has different class, and the triplet loss for a batch of N can be formulated as:

$$\begin{aligned} \mathcal L_{tri} =\frac{1}{N}\sum _{i=1}^{N}max(\parallel x_{ret}^{s+} - x_{ret}^{p+} \parallel - \parallel x_{ret}^{s+} - x_{ret}^{p-} \parallel + m, 0) \end{aligned}$$
(7)

where m is the margin.

3.2 Gaussian Distribution Based Domain Alignment

Wang et al. [20] propose that different domains datasets with different latent feature distributions can be aligned under the guidance of the Gaussian prior. This alignment can build a common feature space for the datasets from different domains and the common space has the discriminative features to achieve excellent performance in eliminating the domain gap between different domains. So, we adopt a Gaussian prior to guide the alignment between the sketch and image in order to alleviate the domain gap between them, specifically we follow the method proposed by Wu et al. [22] and utilize the Kullback-Leibler divergence to align \(x^{s}_{ret}\) of sketches and \(x^{p}_{ret}\) of images with a common Gaussian distribution. With our proposed dual-pathway structure, a batch of retrieval tokens \(X_{ret} = \{ x_{ret}^{i} \in R^{d} \}^{N}\) representing the global visual information of sketches or images can be generated from a batch of training samples. We sample a batch of random features \(F = \{ f_{i} \in R^{d} \}^{N}\) from Gaussian distribution \(\mathcal N(0,1)\) simultaneously and the distribution alignment loss \(\mathcal L_{da}\) can be formulated as:

$$\begin{aligned} \mathcal L_{da} = KL(F \parallel X_{ret}) \end{aligned}$$
(8)

where KL is the Kullback-Leibler divergence. By applying \(\mathcal L_{da}\) to both \(x^{s}_{ret}\) and \(x^{p}_{ret}\), the distributions of the two modalities can be aligned under the guidance of Gaussian distribution and the domain gap between sketch and image is indirectly mitigated.

3.3 Instance-Level Knowledge Distillation

Since ViT is pretrained on large-scale image dataset, e.g., ImageNet, it has a powerful discrimination capability to provide probability vectors containing fine-grained semantic information for the input images. However, when ViT is used as the backbone to build the self-attention module and finetuned in a much smaller ZS-SBIR dataset, its rich knowledge originally learned from ImageNet is eliminated. Inspired by the method proposed by Tian et al. [17], we introduce instance-level knowledge distillation to our method. More specifically, as the Fig. 1 is illustrated, we use the teacher ViT and self-attention module to produce probability vectors, which are originally used to predict the categories from ImageNet and we dub them as pseudo ImageNet label. Then, we let self-attention module to mimic teacher ViT’s response by aligning the pseudo ImageNet label. However, this alignment operation is only applied to images because of the domain gap between the sketches and images from ImageNet.

Given an image \(p_{i}\), it is fed into the teacher and the self-attention module to obtain pseudo ImageNet label \(e^{t}_{i}\) and \(e^{s}_{i}\), and the instance-level knowledge distillation loss \(\mathcal L_{ikd}\) can be formulated as:

$$\begin{aligned} \mathcal L_{ikd} = KL(e^{t}_{i} \parallel e^{s}_{i}) \end{aligned}$$
(9)

3.4 Overall Objective

Finally, the full objective function of our method can be formulated as:

$$\begin{aligned} \mathcal L = \lambda _{1}\cdot \mathcal L_{tri} + \lambda _{2}\cdot \mathcal L_{da} + \lambda _{3}\cdot \mathcal L_{ikd} \end{aligned}$$
(10)

where \(\lambda _{1}\), \(\lambda _{2}\) and \(\lambda _{3}\) are weight factors to balance the contributions of \(\mathcal L_{tri}\), \(\mathcal L_{da}\) and \(\mathcal L_{ikd}\), respectively.

4 Experiments

4.1 Datasets and Setup

Datasets. We evaluate our method on three benchmark datasets for ZS-SBIR, i.e., Sketchy [14], TU-Berlin [8] and QuickDraw [4]. Sketchy has 12,500 natural images and 75,471 sketches in 125 categories. Liu et al. [10] extended it by adding another 60,502 natural images to alleviate the data imbalance between two modalities. There are two kinds of seen and unseen class divisions for Sketchy, we refer to them as Sketchy and Sketchy-NO. The former one [10] randomly selects 25 classes as unseen classes, and the latter one [23] selects 21 classes which do not overlap with the classes in ImageNet. TU-Berlin has 13,419 natural images and 20,000 sketches in 250 categories. Zhang et al. [24] extended it by adding another 204,489 natural images. One seen and unseen class division [15] is widely used for TU-Berlin and it selects 30 categories as unseen classes. QuickDraw has 330,000 sketches and 204,000 images in 110 categories. QuickDraw has a seen and unseen class division [4] and this division also selects 30 classes that do not overlap with the classes in ImageNet.

Implementation Details. We use PyTorch as an implementation framework to implement our method with a Geforce RTX2080ti GPU. The ViT pretrained on ImageNet-1K is used to build the self-attention module, which consists of 12 layers of MSA and FFN blocks and a fully-connected layer to produced 1000 dimensional pseudo ImageNet labels. The cross-attention module only contains one layer. The dimension of retrieval tokens and visual tokens is 768. The input size of the sketch or image is \(224 \times 224\). AdamW is used as the optimizer and the learning rate is \(10^{-5}\). The batch size is set as 64 with 2 gradient accumulation steps. To obtain \((x_{ret}^{s+}, x_{ret}^{p+}, x_{ret}^{p-})\) for the triplet loss, each batch consists of 32 sketches from the same category and 32 images from two categories, and half of the images in the batch have the same category as the sketches. Epoch is set to 30 for training the model. \(\lambda _{1}\), \(\lambda _{2}\) and \(\lambda _{3}\) are set to 2.0, 0.1 and 1.0 in all experiments. In the test phase, we use the retrieval tokens for retrieval.

Evaluation Protocol. Following the previous works [11] in ZS-SBIR, we utilize precision (Prec@k) and mean average precision (mAP@k) as the evaluation protocols for fair comparisons. In all experiments, these evaluation protocols are computed using cosine similarity as the distance metric (Table 1).

Table 1. Comparison of our method and compared approaches on Sketchy, Sketchy-NO, TU-Berlin and QuickDraw. “–”means that the results are not reported in the original papers. The best and second-best results are marked in bold and underlined, respectively.

4.2 Comparison with the State-of-the-Arts

Comparison Methods. We compared our method with some baselines, including ZSIH [15], SEM-PCYC [7], DOODLE [4], SAKE [11], PDFD [3], DSN [21], RPKD [17], SBTKNet [18], Sketch3T [13], TVT [16], and ZSE [9]. There are two retrieval approaches in which ZSE is used and we compare our method with both approaches for a fair comparation with ZSE. One is using the matching scores of the sketch and image output from the relation network, referred to as ZSE-RN. The other is using retrieval tokens for retrieval, referred to as ZSE-Ret. It is worth noting that all methods, except ours and ZSE, utilize external semantic information, such as text or class labels.

Table 2. Ablation results (mAP@all) for each loss on Sketchy and TU-Berlin datasets. “\(\checkmark \)” means that the loss term is used, while “\(\times \)” does not.

Overall Results. We evaluate our method on Sketchy, Sketchy-NO, TU-Berlin and QuickDraw, and compare the experimental results with other baselines in the table. Compared to the state-of-the-art ZSE, we surpass it in most of the results, which highlights the superiority of our efficient Gaussian distribution based domain alignment and knowledge distillation for preserving knowledge.

When compared to other methods that use external semantic information, our method significantly outperforms them on Sketchy and TU-Berlin. We also have the best mAP@all result on QuickDraw. The results on Sketchy-NO are slightly worse than some of them, because the unseen categories of Sketchy-NO do not overlap with the classes in ImageNet and it is tough to improve the results without any external semantic information. All of this shows the effectiveness of our method since we do not utilize any external sematic information.

Fig. 2.
figure 2

Retrieval examples of ZS-SBIR results on unseen data of TU-Berlin.

4.3 Further Analysis

Ablation Study. We evaluate the effect of each loss term our method used by ablating one of them in the training phase. The experimental results are shown in Table 2, where “w/o” means the ablating behavior. From the comparison of each variant and our full model, we can draw the following conclusion: 1) \(\mathcal L_{tri}\) is the most substantial one among the three losses, which directly aligns the retrieval tokens used for retrieval, since the variant without \(\mathcal L_{tri}\) drops significantly and preforms worse than other variants. 2) The performance of the variant without \(\mathcal L_{da}\) indicates that adopting Gaussian prior to guide the alignment between sketch and image can alleviate the domain gap between them. 3) The variant without \(\mathcal L_{ikd}\) performs slightly worse than the full model. This observation shows that \(\mathcal L_{ikd}\) can make the backbone retain the extensive knowledge learned from the large-scale ImageNet.

Retrieval Results. As shown in Fig. 2, we visualize the top 10 retrieved results of sketches queries, where correct and incorrect candidates are marked with checkmarks and crosses, respectively. The majority of the top 10 images retrieved using our approach resemble the query sketches in terms of the overall object pose and shape characteristics, even the incorrect retrieved results have a similar shape to the queries. This observation proves the validity of our method.

Fig. 3.
figure 3

The t-SNE visualization for seen and unseen data of Sketchy. Colored circles are used to represent images, while upper triangles are used to represent sketches.

Visualization of Embeddings. As shown in Fig. 3, we evaluate the effect of our method in semantic alignments across modalities by utilizing the t-SNE tool. We conduct this visualization on both seen and unseen data from Sketchy. From Fig. 3, we can find that our method successfully clusters seen data from different modalities. To some extent, the unseen data can also cluster together by our method. Furthermore, most of the categories are properly separated regardless of modalities. This observation is a good indication of the effectiveness of our method in aligning different modal data and demonstrates a strong generalization ability to align data from unseen classes.

5 Conclusion

This paper tackled ZS-SBIR by focusing on learning a local visual correspondence between sketch and photo. We proposed a transformer-based dual-pathway model to learn the local visual correspondence between sketch and photo. In this way, the semantic inconsistency is minimized. In order to eliminate the modal gap, triplet loss and distribution alignment loss were introduced to align the data from different modalities. Furthermore, knowledge distillation was introduced to maintain the generalization capability. We conducted extensive experiments on three benchmark datasets, and our method achieves competitive results without external semantic information compared to the baselines. In the future, we will focus on addressing ZS-SBIR by exploring a more effective solution to learn the visual correspondence between sketch and photo.