1 Introduction

Current machine learning algorithms require enormous amounts of training data to learn new tasks. This is an issue for many practical problems across domains such as biology and medicine where labeled data is hard to come by. In contrast, we humans can quickly learn new concepts from limited training data by relying on our past “visual experience”. Recent work attempts to emulate this by training a feature representation to classify a training dataset of “base” classes with the hope that the resulting representation generalizes not just to unseen examples of the same classes but also to novel classes, which may have very few training examples (called few-shot learning). However, training for base class classification can force the network to only encode features that are useful for distinguishing between base classes. In the process, it might discard semantic information that is irrelevant for base classes but critical for novel classes. This might be especially true when the base dataset is small or when the class distinctions are challenging.

One way to recover this useful semantic information is to leverage representation learning techniques that do not use class labels, namely, unsupervised or self-supervised learning. The key idea is to learn about statistical regularities within images, such as the spatial relationship between patches, or its orientation, that might be a cue to semantics. Despite recent advances, these techniques have only been applied to a few domains (e.g., entry-level classes on internet imagery), and under the assumption that large amounts of unlabeled images are available. Their applicability to the general few-shot scenario is unclear. In particular, can these techniques prevent overfitting to base classes and improve performance on novel classes in the few-shot setting? If so, does the benefit generalize across domains and to more challenging tasks? Moreover, can self-supervision boost performance in domains where even unlabeled images are hard to get?

Fig. 1.
figure 1

Combining supervised and self-supervised losses for few-shot learning. Self-supervised tasks such as jigsaw puzzle or rotation prediction act as a data-dependent regularizer for the shared feature backbone. Our work investigates how the performance on the target task domain (\(\mathcal{D}_s\)) is impacted by the choice of the domain used for self-supervision (\(\mathcal{D}_{ss}\)).

This paper seeks to answer these questions. We show that with no additional training data, adding a self-supervised task as an auxiliary task (Fig. 1) improves the performance of existing few-shot techniques on benchmarks across a multitude of domains (Fig. 2), in agreement with conclusions from similar recent work [18]. Intriguingly, we find that the benefits of self-supervision increase with the difficulty of the task, for example when training from a smaller base dataset, or with degraded inputs such as low resolution or greyscale images (Fig. 3).

One might surmise that as with traditional SSL, additional unlabeled images might improve performance further. But what unlabeled images should we use for novel problem domains where unlabeled data is not freely available? To answer this, we conduct a series of experiments with additional unlabeled data from different domains. We find that adding more unlabeled images improves performance only when the images used for self-supervision are within the same domain as the base classes (Fig. 4a); otherwise, they can even negatively impact the performance of the few-shot learner (Fig. 4b). Based on this analysis, we present a simple approach that uses a domain classifier to pick similar-domain unlabeled images for self-supervision from a large and generic pool of images (Fig. 5). The resulting method improves over the performance of a model trained with self-supervised learning from images within the dataset (Fig. 6). Taken together, this results in a powerful, general, and practical approach for improving few-shot learning on small datasets in novel domains. Finally, these benefits are also observed in standard classification tasks (Appendix A.3).

2 Related Work

Few-Shot Learning. Few-shot learning aims to learn representations that generalize well to the novel classes where only a few images are available. To this end, several meta-learning approaches have been proposed that evaluate representations by sampling many few-shot tasks within the domain of a base dataset. These include optimization-based meta-learners, such as model-agnostic meta-learner (MAML) [16], gradient unrolling [49], closed-form solvers [4], and convex learners [35]. The second class of methods rely on distance-based classifiers such as matching networks [61] and prototypical networks (ProtoNet) [55]. Another class of methods [19, 47, 48] model the mapping between training data and classifier weights using a feed-forward network.

While the literature is rapidly growing, a recent study by Chen et al.  [10] has shown that the differences between meta-learners are diminished when deeper networks are used. They develop a strong baseline for few-shot learning and show that the performance of ProtoNet [55] matches or surpasses several recently proposed meta-learners. We build our experiments on top of this work and show that auxiliary self-supervised tasks provide additional benefits across a large array of few-shot benchmarks and across meta-learners.

Self-supervised Learning. Human labels are expensive to collect and hard to scale up. To this end, there has been increasing research interest to investigate learning representations from unlabeled data. In particular, the image itself already contains structural information that can be utilized. One class of methods remove part of the visual data and task the network with predicting what has been removed from the rest in a discriminative manner [34, 46, 59, 68, 69]. Another line of works treat each image (and augmentations of itself) as one class and use contrastive learning as self-supervision [3, 5, 9, 15, 22, 24, 25, 38, 43, 65]. Other self-supervised tasks include predicting rotation [20], relative patch location [13], clusters [7, 8], and number of objects [42], etc.

On top of those SSL tasks, combining different tasks can be beneficial [14], in this work we also see its benefit. Asano et al.  [2] showed that the representations can be learned with only one image and extreme augmentations. We also investigate SSL on a low-data regime, but use SSL as a regularizer instead of a pre-training task. Goyal et al.  [21] and Kolesnikov et al.  [31] compared various SSL tasks at scale and concluded that solving jigsaw puzzles and predicting image rotations are among the most effective, motivating the choice of tasks in our experiments. Note that these two works did not include a comparison with contrastive learning approaches.

In addition to pre-training models, SSL can also be used to improve other tasks. For example, Zhai et al.  [67] showed that self-supervision can be used to improve recognition in a semi-supervised setting and presented results on a partially labeled version of the ImageNet dataset. Carlucci et al.  [6] used self-supervision to improve domain generalization. In this work we use SSL to improve few-shot learning where the goal is to generalize to novel classes.

However, the focus of most prior works on SSL is to supplant traditional supervised representation learning with unsupervised learning on large unlabeled datasets for downstream tasks. Crucially in almost all prior works, self-supervised representations consistently lag behind fully-supervised ones trained on the same dataset with the same architecture [21, 31]. In contrast, our work focuses on an important counterexample: self-supervision can in fact augment standard supervised training for few-shot transfer learning in the low training data regime without relying on any external dataset.

The most related work is that of Gidaris et al.  [18] who also use self-supervision to improve few-shot learning. Although the initial results are similar (Table 3), we further show these benefits on several datasets with harder recognition problems (fine-grained classification) and with deeper models (Sect. 4.1). Moreover, we present a novel analysis of the impact of the domain of unlabeled data (Sect. 4.2). Finally, we propose a new and simple approach to automatically select similar-domain unlabeled data for self-supervision (Sect. 4.3).

Multi-task Learning. Our work is related to multi-task learning, a class of techniques that train on multiple task objectives together to improve each one. Previous works in the computer vision literature have shown moderate benefits by combining tasks such as edge, normal, and saliency estimation for images, or part segmentation and detection for humans [30, 37, 51]. However, there is significant evidence that training on multiple tasks together often hurts performance on individual tasks [30, 37]. Only certain task combinations appear to be mutually beneficial, and sometimes specific architectures are needed. Our key contribution here is showing that self-supervised tasks and few-shot learning are indeed mutually beneficial in this sense.

Domain Selection. On supervised learning, Cui et al.  [12] used Earth Mover’s distance to measure the domain similarity and select the source domain for pre-training. Ngiam et al.  [39] found more data for pre-training does not always help and proposed to use importance weights to select pre-training data. Task2vec [1] generates a task embedding given a probe network and the target dataset. Such embeddings can help select a better pre-training model from a pool of experts which yields better performance after fine-tuning. Unlike these, we do not assume that the source domain is labeled and rely on self-supervised learning. On self-supervised learning, Goyal et al.  [21] used two pre-training and target datasets to show the importance of source domain on large-scale self-supervised learning. Unlike this work, we investigate the performance on few-shot learning across a number of domains, as well as investigate methods for domain selection. A concurrent work [62] also investigates the effect of domain shifts on SSL.

3 Method

We adopt the commonly used setup for few-shot learning where one is provided with labeled training data for a set of base classes \(\mathcal {D}_b\) and a much smaller training set (typically 1–5 examples per class) for novel classes \(\mathcal {D}_n\). The goal of the few-shot learner is to learn representations on the base classes that lead to good generalization on novel classes. Although in theory the base classes are assumed to have a large number of labeled examples, in practice this number can be quite small for novel or fine-grained domains, e.g. less than 5000 images for the birds dataset [63], making it challenging to learn a generalizable representation.

Our framework, as seen in Fig. 1, combines meta-learning approaches for few-shot learning with self-supervised learning. Denote a labeled training dataset \(\mathcal{D}_s\) as \(\{(x_i, y_i)\}_{i=1}^n\) consisting of pairs of images \(x_i \in \mathcal{X}\) and labels \(y_i \in \mathcal{Y}\). A feed-forward convolutional network f(x) maps the input to an embedding space which is then mapped to the label space using a classifier g. The overall mapping from the input to the label can be written as \(g \circ f (x): \mathcal{X}\rightarrow \mathcal{Y}\). Learning consists of estimating functions f and g that minimize an empirical loss \(\ell \) over the training data along with suitable regularization \(\mathcal{R}\) over the functions f and g. This can be written as:

$$ \mathcal{L}_s := \sum _{(x_i,y_i) \in \mathcal{D}_s} \ell \big (g\circ f (x_i), y_i\big ) + \mathcal{R}(f, g). $$

A commonly used loss is the cross-entropy loss and a regularizer is the \(\ell _2\) norm of the parameters of the functions. In a transfer learning setting g is discarded and relearned on training data for novel classes.

We also consider self-supervised losses \(\mathcal{L}_{ss}\) based on labeled data \(x \rightarrow (\hat{x},\hat{y}\)) that can be derived automatically without any human labeling. Figure 1 shows two examples: the jigsaw task rearranges the input image and uses the index of the permutation as the target label, while the rotation task uses the angle of the rotated image as the target label. A separate function h is used to predict these labels from the shared feature backbone f with a self-supervised loss:

$$ \mathcal{L}_{ss} := \sum _{x_i \in \mathcal{D}_{ss}} \ell \big (h\circ f (\hat{x}_i), \hat{y}_i\big ). $$

Our final loss combines the two: \(\mathcal{L} := \mathcal{L}_s + \mathcal{L}_{ss}\) and thus the self-supervised losses act as a data-dependent regularizer for representation learning. The details of these losses are described in the next sections.

Note that the domain of images used for supervised \(\mathcal{D}_s\) and self-supervised \(\mathcal{D}_{ss}\) losses need not to be identical. In particular, we would like to use larger sets of images for self-supervised learning from related domains. The key questions we ask are: (1) How effective is SSL when \(\mathcal{D}_s = \mathcal{D}_{ss}\) especially when we have a small sample of \(D_s\)? (2) How do the domain shifts between \(\mathcal{D}_s\) and \(\mathcal{D}_{ss}\) affect generalization performance? and (3) How to select images from a large, generic pool to construct an effective \(\mathcal{D}_{ss}\) given a target domain \(\mathcal{D}_s\)?

3.1 Supervised Losses (\(\mathcal{L}_s\))

Most of our results are presented using a meta-learner based on prototypical networks [55] that perform episodic training and testing over sampled datasets in stages called meta-training and meta-testing. During meta-training, we randomly sample N classes from the base set \(\mathcal {D}_b\), then we select a support set \(\mathcal {S}_b\) with K images per class and another query set \(\mathcal {Q}_b\) with M images per class. We call this an N-way K-shot classification task. The embeddings are trained to predict the labels of the query set \(\mathcal {Q}_b\) conditioned on the support set \(\mathcal {S}_b\) using a nearest mean (prototype) classifier. The objective is to minimize the prediction loss on the query set. Once training is complete, given the novel dataset \(\mathcal {D}_n\), class prototypes are recomputed for classification and query examples are classified based on the distances to the class prototypes.

Prototypical networks are related to distance-based learners such as matching networks [61] or metric-learning based on label similarity [29]. We also present few-shot classification results using a gradient-based meta-learner called MAML [16], and one trained with a standard cross-entropy loss on all the base classes. We also present standard classification results where the test set contains images from the same base categories in Appendix A.3.

3.2 Self-supervised Losses (\(\mathcal{L}_{ss}\))

We consider two losses motivated by a recent large-scale comparison of the effectiveness of self-supervised learning tasks [21] described below:

  • Jigsaw puzzle task loss. Here the input image x is tiled into 3 \(\times \) 3 regions and permuted randomly to obtain an input \(\hat{x}\). The target label \(\hat{y}\) is the index of the permutation. The index (one of 9!) is reduced to one of 35 following the procedure outlined in [41], which grouped the possible permutations based on the hamming distance to control the difficulty of the task.

  • Rotation task loss. We follow the method of [20] where the input image x is rotated by an angle \(\theta \in \{0^\circ ,90^\circ ,180^\circ ,270^\circ \}\) to obtain \(\hat{x}\) and the target label \(\hat{y}\) is the index of the angle.

In both cases we use the cross-entropy loss between the target and prediction.

3.3 Stochastic Sampling and Training

When the images used for SSL and meta-learning are identical, i.e., \(\mathcal{D}_s\)  = \(\mathcal{D}_{ss}\), the same batch of images are used for computing both losses \(\mathcal{L}_s\) and \(\mathcal{L}_{ss}\). For experiments investigating the effect of domain shifts described in Sect. 4.2 and 4.3, where SSL and meta-learner are trained on different domains, i.e. \(\mathcal{D}_s\) \(\ne \) \(\mathcal{D}_{ss}\), a separate batch of size of 64 is used for computing \(\mathcal{L}_{ss}\). After the two forward passes, one for the supervised task and one for the self-supervised task, the two losses are combined and gradient updates are performed. While other techniques exist [11, 26, 54], simply averaging the two losses performed well.

Table 1. Example images and dataset statistics. For few-shot learning experiments the classes are split into base, val, and novel set. Image representations learned on base set are evaluated on the novel set while val set is used for cross-validation. These datasets vary in the number of classes but are orders of magnitude smaller than ImageNet dataset.

4 Experiments

We first describe the datasets and experimental details. In Sect. 4.1, we present the results of using SSL to improve few-shot learning on various datasets. In Sect. 4.2, we show the effect of domain shift between labeled and unlabeled data for SSL. Last, we propose a way to select images from a pool for SSL to further improve the performance of few-shot learning in Sect. 4.3.

Datasets and Benchmarks. We experiment with datasets across diverse domains: Caltech-UCSD birds [63], Stanford cars [32], FGVC aircrafts [36], Stanford dogs [27], and Oxford flowers [40]. Each dataset contains between 100 and 200 classes with a few thousands of images. We also experiment with the widely-used mini-ImageNet [61] and tiered-ImageNet [50] benchmarks for few-shot learning. In mini-ImageNet, each class has 600 images, wherein tiered-ImageNet each class has 732 to 1300 images.

We split classes within a dataset into three disjoint sets: base, val, and novel. For each class, all the images in the dataset are used in the corresponding set. A model is trained on the base set of categories, validated on the val set, and tested on the novel set of categories given a few examples per class. For birds, we use the same split as [10], where {base, val, novel} sets have {100, 50, 50} classes respectively. The same ratio is used for the other four fine-grained datasets. We follow the original splits for mini-ImageNet and tiered-ImageNet. The statistics of various datasets used in our experiments are shown in Table 1. Notably, fine-grained datasets are significantly smaller.

We also present results on a setting where the base set is “degraded” either by (1) reducing the resolution, (2) removing color, or (3) reducing the number of training examples. This allows us to study the effectiveness of SSL on even smaller datasets and as a function of the difficulty of the task.

Meta-Learners and Feature Backbone. We follow the best practices and use the codebase for few-shot learning described in [10]. In particular, we use ProtoNet [55] with a ResNet-18 [23] network as the feature backbone. Their experiments found this to be the best performing. We also present experiments with other meta-learners such as MAML [16] and softmax classifiers in Sect. 4.1.

Learning and Optimization. We use 5-way (classes) and 5-shot (examples per-class) with 16 query images for training. For experiments using 20% of labeled data, we use 5 query images for training since the minimum number of images per class is 10. The models are trained with ADAM [28] with a learning rate of 0.001 for 60,000 episodes. We report the mean accuracy and 95% confidence interval over 600 test experiments. In each test episode, N classes are selected from the novel set, and for each class 5 support images and 16 query images are selected. We report results for \(N=\{5,20\}\) classes.

Image Sampling and Data Augmentation. Data augmentation has a significant impact on few-shot learning performance. We follow the data augmentation procedure outlined in [10] which resulted in a strong baseline performance. For label and rotation predictions, images are first resized to 224 pixels for the shorter edge while maintaining the aspect ratio, from which a central crop of 224 \(\times \) 224 is obtained. For jigsaw puzzles, we first randomly crop 255 \(\times \) 255 region from the original image with random scaling between [0.5, 1.0], then split into 3 \(\times \) 3 regions, from which a random crop of size 64 \(\times \) 64 is picked. While it might appear that with self-supervision the model effectively sees more images, SSL provides consistent improvements even after extensive data augmentation including cropping, flipping, and color jittering. More experimental details are in Appendix A.5.

Other Experimental Results. In Appendix A.3, we show the benefits of using self-supervision for standard classification tasks when training the model from scratch. We further visualize these models in Appendix A.4 to show that models trained with self-supervision tend to avoid accidental correlation of background features to class labels.

Fig. 2.
figure 2

Benefits of SSL for few-shot learning tasks. We show the accuracy of the ProtoNet baseline of using different SSL tasks. The jigsaw task results in an improvement of the 5-way 5-shot classification accuracy across datasets. Combining SSL tasks can be beneficial for some datasets. Here SSL was performed on images within the base classes only. See Appendix A.1 for a tabular version and results for 20-way 5-shot classification.

4.1 Results on Few-Shot Learning

Self-supervised Learning Improves Few-Shot Learning. Figure 2 shows the accuracies of various models on few-shot learning benchmarks. Our ProtoNet baseline matches the results of the mini-ImageNet and birds datasets presented in [10] (in their Table A5). Our results show that jigsaw puzzle task improves the ProtoNet baseline on all seven datasets. Specifically, it reduces the relative error rate by 4.0%, 8.7%, 19.7%, 8.4%, 4.7%, 15.9%, and 27.8% on mini-ImageNet, tiered-ImageNet, birds, cars, aircrafts, dogs, and flowers datasets respectively. Predicting rotations also improves the ProtoNet baseline on most of the datasets, except for aircrafts and flowers. We speculate this is because most flower images are symmetrical, and airplanes are usually horizontal, making the rotation task too hard or too trivial respectively to benefit the main task. In addition, combining these two SSL tasks can be beneficial sometimes. A tabular version and the results of 20-way classification are included in Appendix A.1.

Fig. 3.
figure 3

Benefits of SSL for harder few-shot learning tasks. We show the accuracy of using the jigsaw puzzle task over ProtoNet baseline on harder versions of the datasets. We see that SSL is effective even on smaller datasets and the relative benefits are higher.

Table 2. Performance on few-shot learning using different meta-learners. Using jigsaw puzzle loss improves different meta-learners on most of the datasets. ProtoNet with jigsaw loss performs the best on all five datasets.

Gains Are Larger for Harder Tasks. Figure 3 shows the performance on the degraded version of the same datasets (first five groups). For cars and aircrafts we use low-resolution images where the images are down-sampled by a factor of four and up-sampled back to 224 \(\times \) 224 with bilinear interpolation. For natural categories we discard color. Low-resolution images are considerably harder to classify for man-made categories while color information is most useful for natural categories [56]. On birds and dogs datasets, the improvements using self-supervision (3.2% and 2.9% on 5-way 5-shot) are higher compared to color images (2.5% and 2.7%), similarly on the cars and aircrafts datasets with low-resolution images (2.2% and 2.1% vs. 0.7% and 0.4%). We also conduct an experiment where only 20% of the images in the base categories are used for both SSL and meta-learning (last five groups in Fig. 3). This results in a much smaller training set than standard few-shot benchmarks: 20% of the birds dataset amounts to only roughly 3% of the popular mini-ImageNet dataset. We find larger benefits from SSL in this setting. For example, the gain from the jigsaw puzzle loss for 5-way 5-shot car classification increases from 0.7% (original dataset) to 7.0% (20% training data).

Improvements Generalize to Other Meta-Learners. We combine SSL with other meta-learners and find the combination to be effective. In particular, we use MAML [16] and a standard feature extractor trained with cross-entropy loss (softmax) as in [10]. Table 2 compares meta-learners based on a ResNet-18 network trained with and without jigsaw puzzle loss. We observe that the average 5-way 5-shot accuracies across five fine-grained datasets for softmax, MAML, and ProtoNet improve from 85.5%, 82.6%, and 88.5% to 86.6%, 83.8%, and 90.4% respectively when combined with the jigsaw puzzle task. Self-supervision improves performance across different meta-learners and different datasets; however, ProtoNet trained with self-supervision is the best model across all datasets.

Table 3. Comparison with prior works on mini-ImageNet. 5-shot 5-way classification accuracies on 600 test episodes are reported. The implementation details including image size, backbone model, and training are different in each paper. \(^*\)validation classes are used for training. \(^\dagger \)dropblock [17], label smoothing, and weight decay are used.

Self-supervision Alone Is Not Enough. SSL alone significantly lags behind supervised learning in our experiments. For example, a ResNet-18 trained with SSL alone achieve 32.9% (w/ jigsaw) and 33.7% (w/ rotation) 5-way 5-shot accuracy averaged across five fine-grained datasets. While this is better than a random initialization (29.5%), it is dramatically worse than one trained with a simple cross-entropy loss (85.5%) on the labels (details in Table 4 in Appendix A.1). Surprisingly, we also found that initialization with SSL followed by meta-learning did not yield improvements over meta-learning starting from random initialization, supporting the view that SSL acts as a feature regularizer.

Few-Shot Learning as an Evaluation for Self-supervised Tasks. The few-shot classification task provides a way of evaluating the effectiveness of self-supervised tasks. For example, on 5-way 5-shot aircrafts classification, training with only jigsaw and rotation task gives 38.8% and 29.5% respectively, suggesting that rotation is not an effective self-supervised task for airplanes. We speculate that it might be because the task is too easy as airplanes are usually horizontal.

Comparison with Prior Works. Our results also echo those of [18] who find that the rotation task improves on mini- and tiered-ImageNet. In addition we show the improvement still holds when using deeper networks, higher resolution images, and in fine-grained domains. We provide a comparison with other few-shot learning methods in Table 3.

4.2 Analyzing the Effect of Domain Shift for Self-supervision

Scaling SSL to massive unlabeled datasets that are readily available for some domains is a promising avenue for improvement. However, do more unlabeled data always help for a task in hand? This question hasn’t been sufficiently addressed in the literature as most prior works study the effectiveness of SSL on a curated set of images, such as ImageNet, and their transferability to a handful of tasks. We conduct a series of experiments to characterize the effect of size and distribution \(\mathcal{D}_{ss}\) of images used for SSL in the context of few-shot learning on domain \(\mathcal{D}_s\).

Fig. 4.
figure 4

Effect of size and domain of SSL on 5-way 5-shot classification accuracy. (a) More unlabeled data from the same domain for SSL improves the performance of the meta-learner. (b) Replacing a fraction (x-axis) of the images with those from other domains makes SSL less effective.

First, we investigate if SSL on unlabeled data from the same domain improves the meta-learner. We use 20% of the images in the base categories for meta-learning identical to the setting in Fig. 3. The labels of the remaining 80% data are withheld and only the images are used for SSL. We systematically vary the number of images used by SSL from 20% to 100%. The results are presented in Fig. 4a. The accuracy improves with the size of the unlabeled set with diminishing returns. Note that 0% corresponds to no SSL and 20% corresponds to using only the labeled images for SSL (\(\mathcal{D}_{s} = \mathcal{D}_{ss}\)).

Figure 4b shows an experiment where a fraction of the unlabeled images are replaced with images from other four datasets. For example, 20% along the x-axis for birds indicate that 20% of the images in the base set are replaced by images drawn uniformly at random from other datasets. Since the numbers of images used for SSL is identical, the x-axis from left to right represents increasing amounts of domain shifts between \(\mathcal{D}_s\) and \(\mathcal{D}_{ss}\). We observe that the effectiveness of SSL decreases as the fraction of out-of-domain images increases. Importantly, training with SSL on the available 20% within domain images (shown as crosses) is often (on 3 out of 5 datasets) better than increasing the set of images by five times to include out of domain images.

Fig. 5.
figure 5

Overview of domain selection for self-supervision. Top: We first train a domain classifier using \(\mathcal{D}_s\) and (a subset of) \(\mathcal{D}_p\), then select images using the predictions from the domain classifier for self-supervision. Bottom: Selected images of each dataset using importance weights.

Fig. 6.
figure 6

Effectiveness of selected images for SSL. With random selection, the extra unlabeled data often hurts the performance, while those sampled using the importance weights improve performance on all five datasets. A tabular version is shown in Appendix A.2.

4.3 Selecting Images for Self-supervision

Based on the above analysis we propose a simple method to select images for SSL from a large, generic pool of unlabeled images in a dataset dependent manner. We use a “domain weighted” model to select the top images based on a domain classifier, in our case a binary logistic regression model trained with images from the source domain \(\mathcal{D}_s\) as the positive class and images from the pool \(\mathcal{D}_{p}\) as the negative class based on ResNet-101 image features. The top images are selected according to the ratio \(p(x \in \mathcal{D}_s)/p(x \in \mathcal{D}_{p})\). Note that these importance weights account for the domain shift. Figure 5 shows an overview of the selection process.

We evaluate this approach using a pool of images \(\mathcal{D}_{p}\) consisting of (1) the training images of the “bounding box” subset of Open Images V5 [33] which has 1,743,042 images from 600 classes, and (2) iNaturalist 2018 dataset [60] which has 461,939 images from 8162 species. For each dataset, we use 20% of the labeled images as \(\mathcal{D}_s\). The rest 80% of the data are only used as the “oracle” where the unlabeled data are drawn from the exact same distribution as \(\mathcal{D}_s\). We show some of the selected images for self-supervision \(\mathcal{D}_{ss}\) in Fig. 5.

Figure 6 shows the results of ProtoNet trained on 20% labeled examples with jigsaw puzzle as self-supervision. To have a fair comparison, for methods of selecting images from the pool, we select the same number (80% of the original labeled dataset size) of images as \(\mathcal{D}_{ss}\). We report the mean accuracy of five runs. “SSL with 20% dataset” denotes a baseline of only using \(\mathcal{D}_s\) for self-supervision (\(\mathcal{D}_{s}=\mathcal{D}_{ss}\)), which is our reference “lower bound”. SSL pool “(random)” and “(weight)” denote two approaches of selecting images for self-supervision. The former selects images uniformly at random, which is detrimental for cars, dogs, and flowers. The pool selected according to the importance weights provides significant improvements over “no SSL”, “SSL with 20% dataset”, and “random selection” baselines on all five datasets. The oracle is trained with the remaining 80% of the original dataset as \(\mathcal{D}_{ss}\), which is a reference “upper bound”.

5 Conclusion

Self-supervision improves the performance on few-shot learning tasks across a range of different domains. Surprisingly, we found that self-supervision is more beneficial for more challenging problems, especially when the number of images used for self-supervision is small, orders of magnitude smaller than previously reported results. This has a practical benefit that the images within small datasets can be used for self-supervision without relying on a large-scale external dataset. We have also shown that additional unlabeled images can improve performance only if they are from the same or similar domains. Finally, for domains where unlabeled data is limited, we present a novel, simple approach to automatically identify such similar-domain images from a larger pool.

Future work could investigate if using other self-supervised tasks can also improve few-shot learning, in particular constrastive learning approaches [3, 22, 24, 38, 58]. Future work could also investigate how and when self-supervision improves generalization across self-supervised and supervised tasks empirically [1, 66].