1 Introduction

The last few years have witnessed progress on image retrieval: successful models can be trained, provided that a set of labeled images from the domain of interest (not necessary from the same categories) is available for training, as in the common deep metric learning scenario. Those models are as powerful as they are specialized: it has been shown, and we confirm in our experiments, that one model carefully tailored for one domain (e.g. bird species) tend to perform poorly to a neighboring yet different domain (e.g. dog breeds).

Here, we argue that a practical visual search system should be able to solve multiple retrieval tasks simultaneously, without needing to explicitly specialize for each task. Consider for example a visual search system specialized to fauna and flora. In such system, the image database covers a broad range of fine-grained domains, e.g. from searching among different insect species to different kinds of mushrooms. For the system to also handle coral species, it should be as simple as providing a set of unlabeled coral images.

In parallel, the field has worked towards pretraining large and generic models for visual representations that can be used, often as a black box, to extract features for new tasks. Among those, models trained in a self-supervised way have shown to be versatile to various target tasks, including image retrieval [9, 21].

In this work, we assume access to such a large pretrained model that already provides good zero-shot performance. We also assume access to an unlabeled set of images possibly from multiple tasks. We propose to adapt the initial model so it performs even better on multiple image retrieval tasks simultaneously, i.e. when this same adapted model is used to extract features for all tasks.

This raises two questions. First, how should we perform adaptation? Fine-tuning is prohibitively costly especially for large pretrained models, and does not always transfer well. As an alternative to fine-tuning, and inspired by an early work on multi-task training [46] and a recent trend in natural language processing [27, 41], we propose to use adaptor layers. Adaptors are embedded in between architecture blocks, and are the only weights learned, the ones from the original pretrained model remaining fixed. Our experiments show that this composite architecture allows for a versatile adaptation of a strong initial model by adjusting only a small percentage of the model parameters.

Second, how should we reconcile various retrieval tasks in a single model? A retrieval task focuses on a given set of visual concepts, often associated to a particular granularity. Yet, unlike in classification for which the granularity is known beforehand, the granularity of a retrieval task is context dependent, and depends on the gallery of images where visual search is performed. We therefore propose learning different sets of adaptors, each set tailored to one specific granularity. As we assume that training images are unlabeled, not even to indicate the retrieval task they correspond to, we propose to automatically define levels of granularity by partitioning the training set into more and more clusters. As a result, each partition corresponds to a different set of pseudo-labels. We then independently train one set of adaptors for each pseudo-granularity.

Next, we need to reconcile these different sets of adaptors into a single multi-purpose retrieval model. One option is to combine them with a naive fusion mechanism. The resulting model improves results on all retrieval tasks, showing the clear benefit of a multi-granularity understanding of the data. Another option is to go one step further and to achieve adaptor fusion via attention propagation. In this case, we require consistency between the adaptor attention of nearest neighbors in the feature space. We observe this fusion mechanism further improves the model.

To summarize, our contribution is threefold. First, we palliate the absence of image and task labels by creating sets of pseudo-labels, with the goal of approximating any possible granularities in a given set of retrieval tasks. Second, we propose a way to extend transformer-based architectures with adaptors, and a training framework that tailors individual sets of adaptors to different pseudo-granularities. Third, we propose a number of ways for fusing the adapter features, e.g. via augmentation invariance or via propagating attention from neighbors in the image features space. We validate our approach on a collection of datasets for deep metric learning and we show that improves over the successful DINO pretrained model, a model known to already obtain strong zero-shot performance on all these retrieval tasks (see Fig. 1).

Fig. 1.
figure 1

  is an unsupervised method that trains a single model with higher zero-shot performance (measured with \(\mathcal {R}\)-Precision or \(\mathcal {R}\)P) than the pretrained DINO [9] model, over several retrieval tasks.

2 Related Work

The task we tackle in this paper strongly relates to deep metric learning. It requires specific architectural changes of neural networks to extend them with adaptors. Note that our task can be seen as solving a zero-shot problem, i.e. it requires no labeled data from the downstream datasets and learns a single model for all tasks, something fairly uncommon in transfer learning.

Deep Metric Learning (DML). DML aims to learn a metric between data points that reflects the semantic similarity between them. It plays an important role in a wide range of tasks such as image clustering [7, 26], unsupervised learning [8, 10, 24], and visual search [6, 19, 47]. Recent DML approaches typically learn visual similarity using either a pair-based loss [11, 22, 30, 40, 53] which considers pair-wise similarities, a proxy-based loss [14, 23, 31, 57, 58], which considers the similarity between samples and class representative proxies, or a contextual classification loss [5, 16, 50, 65]. In most cases, DML approaches finetune an ImageNet pretrained model for each target retrieval task, and each of those finetuned models fall short when applied to other retrieval tasks. We aim at a more versatile visual search system that handles multiple retrieval tasks with a single model.

Neural Architectures with Adaptation Layers. Adaptation layers (or adaptors) have emerged [27, 41, 46, 59] as a way to avoid common problems rising in sequential finetuning or multi-task learning when trying to finetune large pretrained models to solve multiple tasks, namely the issues of catastrophic forgetting [36] and task imbalance. Rebuffi et al.. [46] were the first to introduce adaptors to visual recognition tasks, adapting a convolutional model to many classification tasks. Adaptors have also been used with transformer architectures for natural language processing [27]; bottleneck layers are added to all the blocks of a pretrained model and finetuned, keeping the underlying model fixed.

Recently, Pfeiffer et al.. [41] introduced a way to share knowledge between adaptors using an adaptor fusion layer within a two-stage learning framework: adaptors are trained independently in the first stage, they are kept fixed while only the fusion layer is trained in the second stage. All the methods mentioned above still result in models that specialize to a single task; e.g. [41] learns a separate fusion layer per downstream task, whereas we would like to learn a single model for all tasks.

Zero-Shot Problems. The field has recently taken an interest in pretraining large models, sometimes called zero-shot models, using large quantities of data. Those have been shown to be versatile and applicable to many target tasks. Among them, self-supervised models [8,9,10, 24, 64] are trained using self-defined pseudo-labels as supervision and typically millions of images (e.g. from ImageNet [13]). Recent works [20, 54] exploit even larger yet uncurated, sets of unlabeled images to enhance the quality of the learned representations. Others [28, 45, 66] have leveraged multiple modalities, e.g. training visual representations so they are similar to the textual representations of their associated text. Those self-supervised or multimodal methods offer excellent initialization to be finetuned for a wide range of downstream tasks. Sometimes they are used in a zero-shot setting: a single model is used as a feature extractor, typically to solve multiple tasks. This is the regime we study here, but we further assume that a small amount of unlabeled data from the downstream tasks exists.

Relation to Other Transfer Tasks. The idea of transferring a model trained for a given task to a related one has become central to computer vision [49, 51], and appears in many research fields such as task transfer [63], domain adaptation [12] or self-supervised learning [9, 18, 39]. Yet, in all those, the initial model is only a starting point and it is typically not only extended, but also retrained for each task of interest, leading to a multitude of specialized models. In our work, we need a single model to perform well across retrieval tasks. In that regard, this work is closer to zero-shot transfer of the large pretrained models discussed above. Also related are Mixtures of Experts (MoE) [44, 48, 52, 62], an ensembling technique that decomposes a predictive problem into subtasks, training one expert for each. Although MoE architectures may look similar to ours at first glance, they typically rely on gating and pooling mechanisms that learn to predict, in a supervised way, which experts to trust, and how to combine them. Similar to typical transfer approaches, they build one specialized model for each target task. Here, we focus on a purely unsupervised task: no labels are provided to indicate image semantic content nor the retrieval task images belong to.

3 A Granularity-Aware Multi-purpose Retrieval Model

In this section we present , a method for adapting a pretrained model to multiple retrieval tasks simultaneously, in an unsupervised way. We first formalize our task, i.e. visual search over several retrieval tasks using a single model (Sect. 3.1). We then present an overview of the approach (Sect. 3.2). Next, we detail each step, i.e. building multiple granularities (Sect. 3.3), learning adaptors using granularity-aware pseudo-labels (Sect. 3.4), and learning to fuse them by propagating adaptor attention across feature space neighbors (Sect. 3.5).

3.1 Background

Our task of interest, visual search on multiple retrieval tasks, can be seen as a variant of the standard deep metric learning (DML) task. The most common protocol in DML is to a) split the classesFootnote 1 into disjoint train and test sets of labels; b) learn a separate model for each retrieval task on the corresponding train split; c) perform retrieval on all images of the (unseen) test split of classes.

Our setting has several key differences. First, we solve multiple retrieval tasks simultaneously. This means that we do not learn one model for each but a single model that will be used for all tasks. Second, we only assume access to a set of unlabeled images from each retrieval task, and do not have access to labeled training sets, unlike standard DML methods. Even more challenging, unlabeled training images are provided jointly without knowing which target retrieval task they correspond to nor the total number of retrieval tasks.

More formally, let \(\mathcal {T}\) be the set of m retrieval tasks that we want to simultaneously tackle. Each task \(\mathcal {T}^t\) is associated with a training and a test set. At train time, we are provided with a fused training set \(\mathcal {D}\) composed of the union of all training sets of the m datasets in \(\mathcal {T}\). As mentioned earlier, images are not associated to any class or task label.

With so many unknowns on the target retrieval tasks, an obvious choice is to start with a large pretrained model. Self-[9, 10, 24] or weakly-[28, 45] supervised learning have been shown to lead to strong models that generalize well and exhibit high zero-shot transfer performance. We assume that we are given such a model. Here, we base our work on the recently proposed Visual Transformer (ViT) [15], a popular, efficient, and highly performing architecture, pretrained in a self-supervised way with DINO [9].

We set our pretrained model \(\mathcal {M}\) to be a ViT with L transformer layers and an input patch size \(P \times P\) pixels. Input image \(\textbf{x}\in \mathbb {R}^{H\times W \times C}\) is reshaped into a sequence of T flattened 2D patches where \(T=HW/P^2\). The transformer uses constant latent vector size D through all of its layers, so flattened patches are first mapped to D dimensions with a trainable linear projection and concatenated in \(\textbf{h}^0\), together with a prepended learnable [class] token and added position embeddings. The transformer encoder [55] consists of alternating blocks of multi-headed self-attention (MSA) and MLP (which contain two layers with a GELU non-linearity). LayerNorm (LN) is applied before every block, and residual connections after every block. Formally, each layer of \(\mathcal {M}\) (shown with a gray background on Fig. 3, left) is given by:

$$\begin{aligned} \textbf{h}^l = \text {MLP}(\text {LN}(\tilde{\textbf{h}}^l)) + \tilde{\textbf{h}}^l, \quad \tilde{\textbf{h}}^l = \text {MSA}(\text {LN}(\textbf{h}^{l-1})) + \textbf{h}^{l-1}, \end{aligned}$$
(1)

for \(l = \{1\ldots L\}\). The image representation \(\textbf{z}\) is the output of the [class] token after the last layer \(\textbf{h}^L\), i.e. \(\textbf{z}= \text {LN}(\textbf{h}^{L})[\)class]. We refer the reader to [15] for more details about the ViT architecture.

Fig. 2.
figure 2

Training of the proposed . Left: granularities correspond to pseudo-labels \(\mathcal {P}_i\) obtained by multiple clusterings of the feature space (Step 1). Right: we learn the granularity-aware adaptors (Step 2, in green), and then learn how to fuse them (Step 3, in blue). For this example, N=3. (Color figure online)

3.2 Method Overview

Our method builds on the VIT [15] model \(\mathcal {M}\) pretrained with DINO [9], that we treat as an architectural backbone. We extend and train it in an unsupervised way using \(\mathcal {D}\). The training process consists of three steps (summarized in Fig. 2):

  • Step 1: Learning pseudo-labels. We build multiple sets of pseudo-labels. Each set partitions the feature space using clustering and corresponds to a different pseudo-granularity. This process is illustrated in Fig. 2a.

  • Step 2: Learning adaptors for each pseudo-label set independently. We learn a set of adaptors specific to each pseudo-granularity using a classification loss. This process is depicted by the green arrows in Fig. 2b.

  • Step 3: Learning to fuse adaptors. We learn a set of fusion layers to merge the outputs of multiple adaptors using a transformation invariance or an attention propagation loss, i.e. neighboring images should have similar attentions over adaptors. This process is depicted by the blue arrows in Fig. 2b.

These three stages lead to a single model we denote as \(\mathcal {M}^*\), that unifies the multiple granularities, and consists of: the pretrained model \(\mathcal {M}\) used as a frozen backbone (its parameters are kept fixed during the entirety of the process), embedded adaptors \(\mathcal {A}_i\), and fusion layers \(\mathcal {F}\). This single model is used as a unique feature extractor for all retrieval tasks considered in our benchmark. We denote our method , that stands for learning anularity-aware daptors by Attention ro gation. The following subsections detail the learning stages.

3.3 Step 1: Learning Pseudo-labels

We would like to build multiple sets of pseudo-labels such that they partition the feature space at different ‘granularities’. We can approximate this partitioning by estimating multiple sets of clusters while varying the number of centers.

In practice, we extract features using the pretrained model \(\mathcal {M}\). Let \(\textbf{z}= f(\textbf{x}; \mathcal {M})\) be the feature of an image \(\textbf{x}\in \mathcal {D}\). Let the set of all features for training set \(\mathcal {D}\) be \(\mathcal {Z}= \{f(\textbf{x}; \mathcal {M}), \forall \textbf{x}\in \mathcal {D}\}\). To get multiple sets of pseudo-labels, we cluster the full set of features \(\mathcal {Z}\) into sets of centroids \(\mathcal {C}_i, i = 1..N\), of respectively \(k_i\) clusters, where \(k_i\) gets monotonically larger as i approaches N. This produces N sets of pseudo-labels \(\mathcal {P}_1, \ldots , \mathcal {P}_N\). For each pseudo-label set \(\mathcal {P}_i\), an image \(\textbf{x}\in \mathcal {D}\), is associated to a pseudo-label given by \(\mathcal {P}_i(\textbf{x}) = \arg \min _{\textbf{c}\in \mathcal {C}_i} ||\textbf{z}- \textbf{c}||\), for \(\textbf{z}= f(\textbf{x}; \mathcal {M})\). We rely on the vanilla k-means clustering algorithm [34] with k-means++ [1] initialization, a common choice for the size of our benchmark. For even larger datasets, more scalable variants could be used, like hierarchical [42], approximate [2], or quantized [3] k-means.Note that other works have used k-means to define pseudolabels [7, 60]. Yet, our work is the first to learn multiple sets and subsequently use all of them.

3.4 Step 2: Learning Adaptors for Each Pseudo-label Set

Given the N sets of pseudo-labels computed in the previous step, we now would like to learn adaptors tailored to each pseudo-label set, i.e. to each pseudo-granularity. We use the pretrained model \(\mathcal {M}\) as a backbone and extend it by embedding an adaptor at every layer. We then learn the adaptor parameters while keeping the backbone ones frozen. We learn a set of L adaptors for each pseudo-granularity in an independent way.

Fig. 3.
figure 3

Architecture of the \(l^\text {th}\) layer of the model, for N=3 adaptors.

Adaptor Architecture. Recent works in natural language processing [27, 43, 59] have embedded adaptor layers in transformer architectures. We follow a similar design and embed L adaptors, one at the end of each transformer layer of \(\mathcal {M}\).

Formally, we learn a separate set of adaptors \(\mathcal {A}_i\) for each pseudo-label set \(\mathcal {P}_i\), \(i = \{1,\ldots ,N\}\). Each set \(\mathcal {A}_i\) consists of L adaptors, denoted as \(\mathcal {A}_i^1,\ldots , \mathcal {A}_i^L\). These adaptors are bottleneck layers with an intermediate dimensionality of \(D^\prime \) (where \(D^\prime < D\)), a GELU layer [25] in between, and a residual connection at the end. Since we are modifying the architecture of \(\mathcal {M}\) by interleaving it with blocks, we need to revisit notations. The output of layer l in \(\mathcal {M}\) (after the basic ViT block) is now defined as \(\bar{\textbf{h}}^l = \text {MLP}(\text {LN}(\tilde{\textbf{h}}^l)) + \tilde{\textbf{h}}^l\). The output of the new layer l (the original VIT block combined with an adaptor) can still be denoted as \(\textbf{h}^l\). Details of the overall architecture are shown in Fig. 3.

Learning the Adaptors. Given a set of pseudo-labels \(\mathcal {P}_i\), we can learn the parameters of the set of adaptors \(\mathcal {A}_i\) via a supervised cross entropy loss. Specifically, we use the norm-softmax loss [33, 57, 58, 65] that, for image \(\textbf{x}\), is given by:

$$\begin{aligned} \mathcal {L}_{cls}(\textbf{x}; y) = \log \frac{\exp (\gamma \cos \theta _y)}{\sum _{y^\prime = 1}^{k_i} \exp (\gamma \cos \theta _{y^\prime }) }, \end{aligned}$$
(2)

where \(\gamma \) is a scale factor, \(\cos \theta _y\) is the cosine similarity to the classifier of class y, and the loss is guided by the pseudo-labels, i.e. \(y = \mathcal {P}_i(\textbf{x})\). After learning the parameters for each set \(\mathcal {P}_i\), we keep the adaptors and discard the classifiers.

3.5 Step 3: Learning to Fuse Adaptors

The process described in Sect. 3.4 leads to N separate sets of adaptors, each tailored to a different pseudo-granularity. The next step is to unify all adaptor sets into a single architecture. To that end, we append (i.e. stack) the N adaptors for each layer in parallel, as shown in Fig. 3. We then concatenate adaptor outputs in a tensor \(\textbf{U}^l \in \mathbb {R}^{N\times T \times D}\) for each layer \(l = \{1,\ldots ,L\}\), where each row corresponds to the output of one adaptor for this layer. Here, another residual connection is added, giving the model the opportunity to bypass the adapter if needed. Tensor \(\textbf{U}^l\) is therefore given by \(\textbf{U}^l = \{ \mathcal {A}_i^l(\bar{\textbf{h}}^l) + MLP(LN(\tilde{\textbf{h}}^l)), i=1..N\}\) and is then fed, together with \(\bar{\textbf{h}}^l\), to a fusion layer, as detailed below.

First Option: Fusion by Average Pooling. A straightforward way of fusing the outputs of the N adaptors is to treat them as equally important and average them. The fusion layer therefore is simply an average pooling layer that takes tensor \(\textbf{U}^l \in \mathbb {R}^{N\times T \times D}\) as input and computes the mean over its first dimension. We refer to this simpler version of our approach as -avg.

Second Option: Learning to Fuse. Treating all adaptors as equally important for any input image goes against our intuition that different retrieval tasks are more related to certain granularities, and hence more suited for the corresponding adaptors. We therefore design a fusion layer with trainable parameters, that can learn to weigh the different adaptor outputs. We use a simple dot-product self-attention architecture over the sequence of N adaptor outputs. Yet, we make two crucial modifications to the vanilla query-key-value self-attention: a) To learn an image-level attention, we average over the T spatial tokens; b) Given that we want to fuse the adaptors but do not want to alter the adaptor representations, we omit the linear projection of the value branch, and only learn projections for the query and key branches that affect the re-weighting of adaptor features.

Specifically, the fusion layer learns an attention vector of size N over the adaptors, given inputs \(\bar{\textbf{h}}^l\) and \(\textbf{U}^l\), by \(\mathcal {F}^l(\bar{\textbf{h}}^l, \textbf{U}^l) = \alpha ^l(\bar{\textbf{h}}^l, \textbf{U}^l) \textbf{U}^l\), where vector \(\alpha ^l(\bar{\textbf{h}}^l, \textbf{U}^l) \in \mathbb {R}^N\) is given by:

$$\begin{aligned} \alpha ^l(\bar{\textbf{h}}^l, \textbf{U}^l) = \text {softmax}\left( \frac{\left( \textbf{Q}\sum _T\bar{\textbf{h}}^l\right) \left( \textbf{K}\sum _T\textbf{U}^l\right) ^{T}}{\sqrt{D}}\right) \end{aligned}$$
(3)

where \(l = \{1,\ldots ,L\}\), \(\textbf{Q}\) and \(\textbf{K}\) are linear projections of size \(D \times D\). A final residual connection is added after the fusion layer. The architecture details of a complete layer are shown in Fig. 3. The latter comprises the ViT block, the adaptors, and the fusion layer, all appended in a residual fashion.

Given pretrained model \(\mathcal {M}\) and multiple sets of adaptors, one way to build a single model is to select one set of adaptors per image. This amounts to guessing which pseudo-granularity best fits each image. We argue that, in a generic visual search system, “picking a granularity” for a query image depends less on the image content than on the retrieval task, i.e. the gallery used at test time. Given a dog image query, for example, the only way to know if we are looking for any dog image or only images of the same dog breed, is by looking at the local structure of the gallery around that image. Both scenarios might favor different representations; our system reconciles them by learning a combination of adaptors. Obviously, we do not have access to the test images during training. Yet, we assume access to unlabeled set of images \(\mathcal {D}\), representative of the target retrieval tasks, or at least of their granularity. Again, these images are provided without task labels, we do not know which retrieval task they correspond to.

Without any other supervisory signal, we argue that the local neighborhood in the feature space of the training set \(\mathcal {D}\) can be used to approximate the “granularity” of a query. In other words, we assume that visually similar images from \(\mathcal {D}\) should yield similar attention vectors over the sets of adaptors. We therefore propose to learn to fuse adaptors using a loss on neighboring image pairs in the feature space. In this step, the backbone model \(\mathcal {M}\) and the adaptors remain frozen. The fusion layer only learns \(\textbf{K}\) and \(\textbf{Q}\), two linear projections that are multiplied to give the attention vectors \(\alpha ^l\), for each ViT encoder l. This means that any loss applied to this fusion step only re-weights adaptor features. We denote the final model, composed of the backbone with all embedded adaptors and their fusion, as \(\mathcal {M}^*\), and the corresponding feature extractor as \(f^*(\textbf{x}, \mathcal {M}^*)\).

Attention Propagation Loss. As mentioned, we propose to train the fusion layer leveraging the assumption that neighboring image pairs in the feature space should use similar attentions over adaptors. Let \(\mathcal {N}_k(\textbf{x};\mathcal {D})\) denote the k nearest neighbors of \(\textbf{x}\) from dataset \(\mathcal {D}\). We define neighbors \((\textbf{x}_i, \textbf{x}_j)\) as a pair of inputs such that \(\textbf{x}_j \in \mathcal {N}_k(\textbf{x}_i;\mathcal {D})\). Although neighbors could be built using the pretrained model \(\textbf{z}= f(\textbf{x}, \mathcal {M})\) (static k-NN), the representations \(\tilde{\textbf{z}} = f^*(\textbf{x},\mathcal {M}^*)\) from the learned model \(\mathcal {M}^*\) provide a better estimation. This requires to periodically update neighbors during training (in practice we do it at every epoch). Given a pair of neighboring features, we bring their adaptor attentions close to each other and strive for attention consistency.

Attention consistency is enforced using the pairwise Barlow Twins loss [64]. Given a batch of image pairs, the loss is defined over the output representations \(\tilde{\textbf{z}_i} = f^*(\textbf{x}_i; \mathcal {M}^*), \tilde{\textbf{z}_j} = f^*(\textbf{x}_j; \mathcal {M}^*)\) of our model, computed over the \(D\times D\) cross-correlation matrix C and averaged over the batch, i.e.:

$$\begin{aligned} \mathcal {L}_{BT} =\sum _n (1 - C^{nn})^2 + \beta \sum _n \sum _{m \ne n} (C^{nm})^2, C_{nm} = \frac{\sum _b g(\hat{\textbf{z}_i})^{b,n} g(\hat{\textbf{z}_j})^{b,m}}{\sqrt{\sum _b( g(\hat{\textbf{z}_i})^{b,n})^2} \sqrt{\sum _b(g(\hat{\textbf{z}_j})^{b,m})^2}}, \end{aligned}$$
(4)

where b iterates over pairs in the batch, n and m iterate over feature dimensions, \(\beta \) is a hyperparameter and \(g(\cdot )\) is a MLP projector appended to the model and discarded after training. We refer the reader to [64] for more details.

Originally, i.e. in [64], this loss was defined over two transformed versions of the same image (\(\textbf{x}_i = t(\textbf{x}), \textbf{x}_j = t(\textbf{x})\)). When image pairs are created using image transformations, Eq. (4) defines a transformation consistency (TC) loss or \(\mathcal {L}_{TC}\). This is a variant that we consider in our benchmark, referred to as . However, we are interested in applying this loss on neighboring pairs in the feature space, \((\textbf{x}_i, \textbf{x}_j)\) such as \(\textbf{x}_j \in \mathcal {N}_k(\textbf{x}_i;\mathcal {D})\), and using it for attention propagation. In this case, we depart from [64] and follow the recent TLDR method [29] which uses the Barlow Twins loss over neighbor pairs for learning a feature encoder for dimensionality reduction. Similarly, we use the Barlow Twins loss on image pairs defined using the k-NN graph. We denote the loss in Eq. (4) as an attention consistency (AC) loss, or \(\mathcal {L}_{AC}\), and refer to this variant as .

4 Experiments

In this section we validate the proposed on several retrieval tasks. These tasks are collected in a new benchmark that we introduce, called MRT. It unifies 6 fine-grained classification datasets under a retrieval setting. We show statistics and present the evaluation protocol of this benchmark in Sect. 4.1, we present the methods we compare in Sect. 4.2 and we report all results in Sect. 4.3.

Implementation Details. We use ViT-Small [15] as a backbone architecture, with a patch size of 16 pixels and DINO [9] pre-trained weights. We generate N=8 sets of pseudo-labels on the training set of MRT, respectively composed of 256, 1024, 4096, 8,192, 16,384, 32,768, 65,536, and 131,072 clusters. We learn a set of adaptors for each pseudo-label set, using the norm-softmax loss from Eq. (2) and an Adam optimizer with a learning rate and weight decay of 0.001. We train the fusion layer over these adaptors using the Barlow Twins loss [64] and the LARS optimizer [61]. We used the same hyper-parameters for the scaling and \(\beta \) as suggested in [64], and a learning rate and weight decay of 0.5 and 0.001.

Evaluation Metrics. Recent works [17, 37] in DML have questioned standard evaluation metrics (i.e. Recall@1) and argue they are not fair. Thus, we report the R-Precision (\(\mathcal {R}\)P) and MAP@R metrics recently introduced in [37].

4.1 Multiple Retrieval Tasks (MRT) Benchmark

Data. The Multiple Retrieval Tasks (MRT) benchmark combines the 6 following fine-grained datasets under a retrieval setting: Aircraft  [35], Cars  [32], CUB  [56], Flowers  [38], Food-101  [4], and Stanford online products (Products) [40]. We follow standard practice in the DML community and, for each dataset, assign the first half of the classes (ordered alphabetically) for training and the second half for testing. We then combine images from all the training splits into a single training set \(\mathcal {D}\) of 133,339 images and discard their class and dataset labels. This is the training set we use to learn the pseudo-labels, as well as the adaptor and the fusion parameters. We show statistics for all datasets in Table 1.

Table 1. Statistics of the Multiple Retrieval Tasks (MRT) benchmark. It is composed of 6 datasets. Classes in train and test are disjoint. We provide the number of classes as a reference, but labels are never used during training.

Evaluation Protocol. Models are trained on the combined training set \(\mathcal {D}\) without task nor class labels. Evaluation is performed on the test split of each task independently, following a leave-one-out protocol: each image is used as a query once to rank all the other images in the test set. For evaluation, relevance is defined according to class labels and we report mean average precision (MAP@R or mAP) and \(\mathcal {R}\)-Precision (\(\mathcal {R}\)P) over all queries.

4.2 Compared Methods

Baselines. First and foremost, we compare with the DINO pretrained visual transformer of [9], a self-supervised model trained on ImageNet1K. It obtains impressive zero-shot performance on retrieval and constitutes a very strong baseline.Footnote 2 We denote this baseline as DINO or simply \(\mathcal {M}\) in Table 2 and Fig. 4.

Table 2. Results on the Multiple Retrieval Task (MRT) benchmark. We report MAP@R (mAP) and \(\mathcal {R}\)P on the six datasets of MRT, obtained by a single model from those listed in Sect. 4.2, all unsupervised. The oracle (in gray) is not comparable as it selects the set of adaptors that performs best for each task.

The architecture adds an extra set of parameters in the form of adaptors and fusion layers. To verify that improvements do not simply come from these extra parameters, we report results with a second baseline that has the same number of parameters as the models, but uses no pseudo-granularity-based adaptors nor the proposed attention consistency loss. Instead of following Step 2, we randomly initialize adaptors and finetune them when learning the fusion. For the latter, we use the Barlow Twins loss from Eq.(4), with transformation consistency, and train it on the training set of MRT, similar to . We denote this baseline: \(\mathcal {M}^*\) (random).

Proposed. We report results for models with adaptor fusion described in Sect. 3 together with results for individual adaptors. More precisely, we build N pseudo-label sets on the training set of MRT and train N sets of adaptors on these pseudo-labels independently. We report their individual performance as \(\mathcal {M}\) + . As mentioned above, we use DINO as a backbone and keep it frozen.

Then, using MRT ’s training set again, we combine these N adaptors into a single model and train the fusion layer using i) a Barlow Twins loss on the final representation when creating pairs from two augmented views, reported as , and ii) our proposed Attention Consistency framework which relies on the local neighborhood, reported as . We also report results for the case where the fusion is an average pooling layer, denoted as .

An Adaptor Selector Oracle. What if we could choose the best performing pseudo-granularity for each retrieval task? Obviously, this requires access to the test set labels. This also results in a different representation per task, which departs from the universal representation we seek to learn. For these reasons, we only consider this variant as an oracle, and its results should not be compared with others. We still provide it as a reference, showing how much could be achieved if we set the attention as a one-hot vector that only enables the best possible pseudo-granularity adaptor. We denote the oracle as \(\mathcal {O}\).

4.3 Results

We present our results in Table 2 and Fig. 4. Again, note that, unlike the common DML experimental setting, we use a unique model for all retrieval datasets and no class nor task labels during training. We make the following observations.

Fig. 4.
figure 4

Results per dataset in MRT. All 6 datasets use the same model.

Baselines. We confirm the initial observation [9] that, for common DML datasets, DINO is a very strong baseline. It achieves good performance even on the more challenging metrics \(\mathcal {R}\)P and MAP@R. Also, it turned out to be very challenging to improve over DINO by keeping its backbone frozen and embedding extra modules. We did our best to learn the additional modules from scratch, but were unsuccessful. Rows 2–3 of Table 2 report the best results after hyper-parameter tuning for a single set (no fusion) and for 8 sets of adaptors with fusion. Embedding randomly initialized modules typically deteriorates the performance. This makes separately trained pseudo-granularity adaptors all the more important.

Using a Single Set of Adaptors. In rows 4–12 of Table 2 we report results when only using adaptors from a single pseudo-granularity; these results are also visualized as lavender-colored points in Fig. 4. We observe that, for each dataset, there exists at least one pseudo-label set that improves over DINO, and that the best one (reported as Oracle \(\mathcal {O}\)) is different for different retrieval tasks. The oracle results use separate models, and selecting the best one for each task requires access to labels. Considering each set of adaptors as a separate model, some improve over DINO on several retrieval tasks, showing that even individual pseudo-granularities-specific sets of adaptors can be useful. Yet, as we will see next, higher gains can be achieved by fusing multiple adaptors into a single model.

Fusion and Attention Consistency. The bottom part of Table 2 shows the performance of the final model \(\mathcal {M}^*\) that combines adaptors from all eight pseudo-granularities ( \(\ldots \) ). We see that using the proposed attention consistency (AC) loss over neighborhood pairs results in an even stronger model that improves over DINO and over the variant that uses pairs from different augmentations of the same image, in all datasets. In fact, we see that in most cases (i.e. apart from Products), the proposed AC loss is able to give results on-par with the oracle \(\mathcal {O}\) that would select the right pseudo-granularity for each dataset. Moreover, for Food-101 and Cars, outperforms the best adaptor for the dataset, showing that this oracle is not necessarily an upper bound: combining pseudo-granularities is beneficial. It is also worth noting that the simpler and parameter-free fusion of the model is still improving over DINO in all cases, while it is also the best performing variant on the instance-level Products dataset.

Qualitative Results. Figure 5 presents qualitative results for two queries from the MRT benchmark. We present two cases where achieves significantly higher recall than DINO, i.e. cases where the local feature space is adapted in a way that images from the correct class are closer together.

Failure Cases. As we see from Table 2 and Fig. 4, our fusion mechanism is not able to learn that is the best pseudo-granularity for Products. We attribute this to the different topology of that dataset and the fact that hyperparameters were chosen to optimize performance for the union of the six benchmarks.

Fig. 5.
figure 5

Qualitative results. Queries (first column) from Flowers and Food-101 and their top 5 retrieved results (columns 2-6) by DINO [9] and .

5 Conclusions

We present , an unsupervised approach for adapting a large pretrained backbone to simultaneously tackle multiple retrieval tasks, given only an unlabeled set of training images associated to these retrieval tasks. We show that one can adapt a large pretrained visual transformer using a set of pseudo-granularity adaptors and simple fusion layers. Our models bring consistent gains over the strong DINO [9] baseline on all six retrieval tasks we adapt to. We envision this work as a first step towards models that dynamically adapt.