Keywords

1 Introduction

Fig. 1.
figure 1

The Broader Study of Cross-Domain Few-Shot Learning (BSCD-FSL) benchmark. ImageNet is used for source training, and domains of varying dissimilarity from natural images are used for target evaluation. Similarity is measured by 3 orthogonal criteria: 1) existence of perspective distortion, 2) the semantic content, and 3) color depth. No data is provided for meta-learning, and target classes are disjoint from the source classes.

Training deep neural networks for visual recognition typically requires a large amount of labelled examples [28]. The generalization ability of deep neural networks relies heavily on the size and variations of the dataset used for training. However, collecting sufficient amounts of data for certain classes may be impossible in practice: for example, in dermatology, there are a multitude of instances of rare diseases, or diseases that become rare for particular types of skin [1, 25, 48]. Or in other domains such as satellite imagery, there are instances of rare categories such as airplane wreckage. Although individually each situation may not carry heavy cost, as a group across many such conditions and modalities, correct identification is critically important, and remains a significant challenge where access to expertise may be impeded.

Although humans generalize to recognize new categories from few examples in certain circumstances, such as when categories exhibit predictable variations across examples and have reasonable contrast from background [31, 32], even humans have trouble recognizing new categories that vary too greatly between examples or differ from prior experience, such as for diagnosis in dermatology, radiology, or other fields [48]. Because there are many applications where learning must work from few examples, and both machines and humans have difficulty learning in these circumstances, finding new methods to tackle the problem remains a challenging but desirable goal.

The problem of learning how to categorize classes with very few training examples has been the topic of the “few-shot learning” field, and has been the subject of a large body of recent work [5, 13, 34, 43, 53, 55, 60]. Few-shot learning is typically composed of the following two stages: meta-learning and meta-testing. In the meta-learning stage, there exists an abundance of base category classes on which a system can be trained to learn well under conditions of few-examples within that particular domain. In the meta-testing stage, a set of novel classes consisting of very few examples per class is used to adapt and evaluate the trained model. However, recent work [5] points out that meta-learning based few-shot learning algorithms underperform compared to traditional pre-training and fine-tuning when there exists a large shift between base and novel class domains. This is a major issue that occurs commonly in practice: by the nature of the problem, collecting data from the same domain for many few-shot classification tasks is difficult. This scenario is referred to as cross-domain few-shot learning, to distinguish it from the conventional few-shot learning setting. Although benchmarks for conventional few-shot learning are well established, the cross-domain few-shot learning evaluation benchmarks are still in early stages. All established works in this space have built cross-domain evaluation benchmarks that are limited to natural images [5, 56, 58]. Under these circumstances, useful knowledge may still be effectively transferring across different domains of natural images, implying that methods designed in this setting may not continue to perform well when applied to domains of other types of images, such as industrial natural images, satellite images, or medical images. Currently, no works study this scenario.

To fill this gap, we propose the Broader Study of Cross-Domain Few-Shot Learning (BSCD-FSL) benchmark (Fig. 1), which covers a spectrum of image types with varying levels of similarity to natural images. Similarity is defined by 3 orthogonal criteria: 1) whether images contain perspective distortion, 2) the semantic content of images, and 3) color depth. The datasets include agriculture images (natural images, but specific to agriculture industry), satellite (loses perspective distortion), dermatology (loses perspective distortion, and contains different semantic content), and radiological images (different according to all 3 criteria). The performance of existing state-of-art meta-learning methods, transfer learning methods, and methods tailored for cross-domain few-shot learning is then rigorously tested on the proposed benchmark.

In summary, the contributions of this paper are itemized as follows:

  • We establish a new Broader Study of Cross-Domain Few-Shot Learning (BSCD-FSL) benchmark, consisting of images from a diversity of image types with varying dissimilarity to natural images, according to 1) perspective distortion, 2) the semantic content, and 3) color depth.

  • Under these conditions, we extensively evaluate the performance of current meta-learning methods, including methods specifically tailored for cross-domain few-shot learning, as well as variants of fine-tuning.

  • The results demonstrate that state-of-art meta-learning methods are outperformed by older meta-learning approaches, and all meta-learning methods underperform in relation to simple fine-tuning by 12.8% average accuracy. In some cases, meta-learning underperforms even networks with random weights.

  • Results also show that accuracy gains for cross-domain few-shot learning methods are lost in this new challenging benchmark.

  • Finally, we find that accuracy of all methods correlate with the proposed measure of data similarity to natural images, verifying the diversity of the problem representation, and the value of the benchmark towards future research.

We believe this work will help the community understand what methods are most effective in practice, and help drive further advances that can more quickly yield benefit for real-world applications.

2 Related Work

Few-Shot Learning. Few-shot learning [31, 32, 60] is an increasingly important topic in machine learning. Many few-shot methods have been proposed, including meta-learning, generative and augmentation approaches, semi-supervised methods, and transfer learning.

Meta-learning methods aim to learn models that can be quickly adapted using a few examples [13, 33, 53, 55, 60]. MatchingNet [60] learns an embedding that can map an unlabelled example to its label using a small number of labelled examples, while MAML [13] aims at learning good initialization parameters that can be quickly adapted to a new task. In ProtoNet [53], the goal is to learn a metric space in which classification can be conducted by calculating distances to prototype representations of each class. RelationNet [55] targets learning a deep distance metric to compare a small number of images. More recently, MetaOpt [33] learns feature embeddings that can generalize well under a linear classification rule for novel categories.

The generative and augmentation based family of approaches learn to generate more samples from few examples available for training in a given few-shot learning task. These methods include applying augmentation strategies learned from data [36], synthesizing new data from few examples using a generative model, or using external data for obtaining additional examples that facilitate learning on a given few shot task. In [19, 52] the intra-class relations between pairs of instances of reference categories are modeled in feature space, and then this information is transferred to the novel category instances to generate additional examples in that same feature space. In [63], a generator sub-net is added to a classifier network and is trained to synthesize new examples on the fly in order to improve the classifier performance when being fine-tuned on a novel (few-shot) task. In [44], a few-shot class density estimation is performed with an auto-regressive model, combined with an attention mechanism, where examples are synthesized by a sequential process. In [6, 51, 67] label and attribute semantics are used as additional information for training an example synthesis network.

In some situations there exists additional unlabeled data accompanying the few-shot task. In the semi-supervised few-shot learning [2, 35, 37, 45, 49] the unlabeled data comes in addition to the support set and is assumed to have a similar distribution to the target classes (although some unrelated samples noise is also allowed). In LST [35], self-labeling and soft attention are used on the unlabeled samples intermittently with fine-tuning on the labeled and self-labeled data. Similarly to LST, [45] updates the class prototypes using k-means like iterations initialized from the PN prototypes. In [2], unlabeled examples are used through soft-label propagation. In [15, 24, 37], graph neural networks are used for sharing information between labeled and unlabeled examples in semi-supervised [15, 37] and transductive [24] FSL setting. Notably, in [37] a Graph Construction network is used to predict the task specific graph for propagating labels between samples of semi-supervised FSL task.

Transfer learning [42] is based on the idea of reusing features learned from the base classes for the novel classes, and is conducted mainly by fine-tuning, which adjusts a pre-trained model from a source task to a target task. Yosinski et al. [66] conducted extensive experiments to investigate the transfer utility of pre-trained deep neural networks. In [27], the authors investigated whether higher performing ImageNet models transfer better to new tasks. Ge et al. [16] proposed a selective joint fine-tuning method for improving the performance of models with a limited amount training data. In [18], the authors proposed an adaptive fine-tuning scheme to decide which layers of the pre-trained network should be fine-tuned. Finally, in [10], the authors found that simple transductive fine-tuning beats all prior state-of-art meta-learning approaches.

Common to all few-shot learning methods is the assumption that base classes and novel classes are from the same domain. The current benchmarks for evaluation are miniImageNet [60], CUB [61], Omniglot [31], CIFAR-FS [3] and tieredImageNet [46]. In [56], the authors proposed Meta-Dataset, which is a newer benchmark for training and evaluating few-shot learning algorithms that includes a greater diversity of image content. Although this benchmark is more broad than prior works, the included datasets are still limited to natural images, and both the base classes and novel classes are from the same domain. Recently, [47] proposes a successful meta-learning approach based on conditional neural process on the MetaDataset benchmark.

Cross-Domain Few-Shot Learning. In cross-domain few-shot learning, base and novel classes are both drawn from different domains, and the class label sets are disjoint. Recent works on cross-domain few-shot learning include analysis of existing meta-learning approaches in the cross-domain setting [5], specialized methods using feature-wise transform to encourage learning representations with improved ability to generalize [58], and works studying cross-domain few-shot learning constrained to the setting of images of items in museum galleries [26]. Common to all these prior works is that they limit the cross-domain setting to the realm of natural images, which still retain a high degree of visual similarity, and do not capture the broader spectrum of image types encountered in practice, such as industrial, aerial, and medical images, where cross-domain few-shot learning techniques are in high demand.

3 Proposed Benchmark

In this section, we introduce the Broader Study of Cross-Domain Few-Shot Learning (BSCD-FSL) benchmark, which includes data from CropDiseases [40], EuroSAT [21], ISIC2018 [8, 57], and ChestX [62] datasets. These datasets cover plant disease images, satellite images, dermoscopic images of skin lesions, and X-ray images, respectively. The selected datasets reflect well-curated real-world use cases for few-shot learning. In addition, collecting enough examples from above domains is often difficult, expensive, or in some cases not possible. Image similarity to natural images is measured by 3 orthogonal criteria: 1) existence of perspective distortion, 2) the semantic data content, and 3) color depth. According to this criteria, the datasets demonstrate the following spectrum of image types: 1) CropDiseases images are natural images, but are very specialized (similar to existing cross-domain few-shot setting, but specific to agriculture industry), 2) EuroSAT images are less similar as they have lost perspective distortion, but are still color images of natural scenes, 3) ISIC2018 images are even less similar as they have lost perspective distortion and no longer represent natural scenes, and 4) ChestX images are the most dissimilar as they have lost perspective distortion, do not represent natural scenes, and have lost 2 color channels. Example images from ImageNet and the proposed benchmark datasets are shown in Fig. 1.

Having a few-shot learning model trained on a source domain such as ImageNet [9] that can generalize to domains such as these, is highly desirable, as it enables effective learning for rare categories in new types of images, which has previously not been studied in detail.

4 Cross-Domain Few-Shot Learning Formulation

The cross domain few-shot learning problem can be formalized as follows. We define a domain as a joint distribution P over input space \(\mathcal {X}\) and label space \(\mathcal {Y}\). The marginal distribution of \(\mathcal {X}\) is denoted as \(P_\mathcal {X}\). We use the pair (xy) to denote a sample x and the corresponding label y from the joint distribution P. For a model \(f_\theta \): \(\mathcal {X}\) \(\rightarrow \) \(\mathcal {Y}\) with parameter \(\theta \) and a loss function \(\ell \), the expected error is defined as,

$$\begin{aligned} \epsilon (f_\theta ) = E_{(x, y) \sim P} [\ell (f_\theta (x), y)] \end{aligned}$$
(1)

In cross-domain few-shot learning, we have a source domain \((\mathcal {X}_s, \mathcal {Y}_s)\) and a target domain \((\mathcal {X}_t, \mathcal {Y}_t)\) with joint distribution \(P_s\) and \(P_t\) respectively, \(P_{\mathcal {X}_s} \ne P_{\mathcal {X}_t}\), and \(\mathcal {Y}_s\) is disjoint from \(\mathcal {Y}_t\). The base classes data are sampled from the source domain and the novel classes data are sampled from the target domain. During the training or meta-training stage, the model \(f_\theta \) is trained (or meta-trained) on the base classes data. During testing (or meta-testing) stage, the model is presented with a support set \(S = \{x_i, y_i\}_{i=1}^{K \times N}\) consisting of N examples from K novel classes. This configuration is referred to as “K-way N-shot” few-shot learning, as the support set has K novel classes and each novel class has N training examples. After the model is adapted to the support set, a query set from novel classes is used to evaluate the model performance.

5 Evaluated Methods for Cross-Domain Few-Shot Learning

In this section, we describe the few-shot learning algorithms that will be evaluated on our proposed benchmark.

5.1 Meta-learning Based Methods

Single Domain Methods. Meta-learning [13, 43], or learning to learn, aims at learning task-agnostic knowledge in order to efficiently learn on new tasks. Each task \(\mathcal {T}_i\) is assumed to be drawn from a fixed distribution, \(\mathcal {T}_i \sim P(\mathcal {T})\). Specially, in few-shot learning, each task \(\mathcal {T}_i\) is a small dataset . \(P_s(\mathcal {T})\) and \(P_t(\mathcal {T})\) are used to denote the task distribution of the source (base) classes data and target (novel) classes data respectively. During the meta-training stage, the model is trained on T tasks \(\{\mathcal {T}_i\}_{i=1}^T\) which are sampled independently from \(P_s(\mathcal {T})\). During the meta-testing stage, the model is expected to be quickly adapted to a new task \({T}_j \sim P_t(\mathcal {T})\).

Meta-learning methods differ in their way of learning the parameter of the initial model \(f_\theta \) on the base classes data. In MatchingNet [60], the goal is to learn a model \(f_\theta \) that can map an unlabelled example \(\hat{x}\) to its label \(\hat{y}\) using a small labelled set as \(\hat{y} = \sum _{j=1}^{K \times N}a_\theta (\hat{x}, x_j)y_j\), where \(a_\theta \) is an attention kernel which leverages \(f_\theta \) to compute the distance between the unlabelled example \(\hat{x}\) and the labelled example \(x_j\), and \(y_j\) is the one-hot representation of the label. In contrast, MAML [13] aims at learning an initial parameter \(\theta \) that can be quickly adapted to a new task. This is achieved by updating the model parameter via a two-stage optimization process. ProtoNet [53] represents each class k with the mean vector of embedded support examples as \(c_k = \frac{1}{N} \sum _{j=1}^{N}f_\theta (x_j)\). Classification is then conducted by calculating distance of the example to the prototype representations of each class. In RelationNet [55] the metric of the nearest neighbor classifier is meta-learned using a Siamese Networks trained for optimal comparison between query and support samples. More recently, MetaOpt [33] employs convex base learners and aims at learning feature embeddings that generalize well under a linear classification rule for novel categories. All the existing meta-learning methods implicitly assume that \(P_s(\mathcal {T})\) = \(P_t(\mathcal {T})\) so the task-agnostic knowledge learned in the meta-training stage can be leveraged for fast learning on novel classes. However, in cross-domain few-shot learning \(P_s(\mathcal {T})\) \(\ne \) \(P_t(\mathcal {T})\) which poses severe challenges for current meta-learning methods.

Cross-Domain Methods. Only few methods specifically tailored to learning in the condition of cross-domain few-shot learning have been previously explored, including feature-wise transform (FWT) [58], and Adversarial Domain Adaptation with Reinforced Sample (ADA-RSS) Selection [11]. Since the problem setting of ADA-RSS requires the existence of unlabelled data in the target domain, we study FWT alone.

FWT is a model agnostic approach that adds a feature-wise transform layer to pre-trained models to learn scale and shift parameters from a collection of several dataset domains, or use parameters empirically determined from a single dataset domain. Both approaches have been previously found to improve performance. Since our benchmark is focused on ImageNet as the single source domain, we focus on the single data domain approach. The method is studied in combination with all meta-learning algorithms described in the prior section.

5.2 Transfer Learning Based Methods

An alternative way to tackle the problem of few-shot learning is based on transfer learning, where an initial model \(f_\theta \) is trained on the base classes data in a standard supervised learning way and reused on the novel classes. There are several options to realize the idea of transfer learning for few-shot learning:

Single Model Methods. In this paper, we extensively evaluate the following commonly variants of single model fine-tuning:

  • Fixed feature extractor (Fixed): simply leverage the pre-trained model as a fixed feature extractor.

  • Fine-tuning all layers (Ft All): adjusts all the pre-trained parameters on the new task with standard supervised learning.

  • Fine-tuning last-k (Ft Last-k): only the last k layers of the pre-trained model are optimized for the new task. In the paper, we consider Fine-tuning last-1, Fine-tuning last-2, Fine-tuning last-3.

  • Transductive fine-tuning (Transductive Ft): in transductive fine-tuning, the statistics of the query images are used via batch normalization [10, 41].

In addition, we compare these single model transfer learning techniques against a baseline of an embedding formed by a randomly initialized network (termed Random) to contrast against a fixed feature vector that has no pre-training. All the variants of single model fine-tuning are based on linear classifier but differ in their approach to fine-tune the single model feature extractor.

Another line of work for few-shot learning uses a broader variety of classifiers for transfer learning. For example, recent works show that mean-centroid classifier and cosine-similarity based classifier are more effective than linear classifier for few-shot learning [5, 39]. Therefore we study these two variations as well.

Mean-Centroid Classifier. The mean-centroid classifier is inspired from ProtoNet [53]. Given the pre-trained model \(f_{\theta }\) and a support set \(S = \{x_i, y_i\}_{i=1}^{K \times N}\), where K is the number of novel classes and N is the number of images per class. The class prototypes are computed in the same way as in ProtoNet. Then the likelihood of an unlabelled example \(\hat{x}\) belongs to class k is computed as,

$$\begin{aligned} p(y=k | \hat{x}) = \frac{\text {exp}(-d(f_{\theta }, c_k))}{\sum _{l=1}^K \text {exp}(-d(f_{\theta }, c_l))} \end{aligned}$$
(2)

where d() is a distance function. In the experiments, we use negative cosine similarity. Different from ProtoNet, \(f_\theta \) is pretrained on the base classes data in a standard supervised learning way.

Cosine-Similarity Based Classifier. In cosine-similarity based classifier, instead of directly computing the class prototypes using the pre-trained model, each class k is represented as a d-dimension weight vector \(\mathbf {w}_k\) which is initialized randomly. For each unlabeled example \(\hat{x}_i\), the cosine similarity to each weight vector is computed as \(c_{i,k} = \frac{f_{\theta }(\hat{x}_i)^T \mathbf {w}_k}{\Vert f_{\theta }(\hat{x}_i) \Vert \Vert \mathbf {w}_k \Vert }\). The predictive probability of the example \(\hat{x}_i\) belongs to class k is computed by normalizing the cosine similarity with a softmax function. Intuitively, the weight vector \(\mathbf {w}_k\) can be thought as the prototype of class k.

Transfer from Multiple Pre-trained Models. In this section, we describe a straightforward method that utilizes multiple models pre-trained on source domains of natural images similar to ImageNet. Note that all domains are still disjoint from the target datasets for the cross-domain few-shot learning setting. The purpose is to measure how much performance may improve by utilizing an ensemble of models trained from data that is different from the target domain. The described method requires no change to how models are trained and is an off-the-shelf solution to leverage existing pre-trained models for cross-domain few-shot learning, without requiring access to the source datasets.

Assume we have a library of C pre-trained models \(\{M_c\}_{c=1}^{C}\) which are trained on various datasets in a standard way. We denote the layers of all pre-trained models as a set F. Given a support set \(S = \{x_i, y_i\}_{i=1}^{K \times N}\) where \((x_i, y_i) \sim P_t\), our goal is to find a subset I of the layers to generate a feature vector for each example in order to achieve the lowest test error. Mathematically,

$$\begin{aligned} \mathop {\mathrm {arg}\,\mathrm {min}}\limits _{I \subseteq F} \,_{(x, y) \sim \ P_t} \ell ( f_s( T (\{l(x): l \in I \}), y) \end{aligned}$$
(3)

where \(\ell \) is a loss function, T() is a function which concatenates a set of feature vectors, l is one particular layer in the set I, and \(f_s\) is a linear classifier. Practically, for feature vectors l coming from inner layers which are three-dimensional, we convert them to one-dimensional vectors by using Global Average Pooling. Since Eq. 3 is intractable generally, we instead adopt a two-stage greedy selection method, called Incremental Multi-model Selection, to iteratively find the best subset of layers for a given support S.

In the first stage, for each pre-trained model, we a train linear classifier on the feature vector generated by each layer individually and select the corresponding layer which achieves the lowest average error using five-fold cross-validation on the support set S. Essentially, the goal of the first stage is to find the most effective layer of each pre-trained model given the task in order to reduce the search space and mitigate risk of overfitting. For convenience, we denote the layers selected in the first selection stage as set \(I_1\). In the second stage, we greedily add the layers in \(I_1\) into the set I following a similar cross-validation procedure. First, we add the layer in \(I_1\) into I which achieves the lowest cross-validation error. Then we iterate over \(I_1\), and add each remaining layer into I if the cross-validation error is reduced when the new layer is added. Finally, we concatenate the feature vector generated by each layer in set I and train the final linear classifier. Please see Algorithm 1 in Appendix for further details.

6 Evaluation Setup

For meta-learning methods, we meta-train all meta-learning methods on the base classes of miniImageNet [60] and meta-test the trained models on each dataset of the proposed benchmark. For transfer learning methods, we train the pre-trained model on base classes of miniImageNet. For transferring from multiple pre-trained models, we use a maximum of five pre-trained models, trained on miniImagenet, CIFAR100 [29], DTD [7], CUB [64], Caltech256 [17], respectively. On all experiments we consider 5-way 5-shot, 5-way 20-shot, 5-way 50-shot. For all cases, the test (query) set has 15 images per class. All experiments are performed with ResNet-10 [20] for fair comparison. For each evaluation, we use the same 600 randomly sampled few-shot episodes (for consistency), and report the average accuracy and \(95\%\) confidence interval.

Table 1. The results of meta-learning methods on the proposed benchmark.

During the training (meta-training) stage, models used for transfer learning and meta-learning models are both trained for 400 epochs with Adam optimizer. The learning rate is set to 0.001. During testing (meta-testing), both transfer learning methods and those meta-learning methods that require adaptation on the support set of the test episodes (MAML, RelationNet, etc.) use SGD with momentum. The learning rate is 0.01 and the momentum rate is 0.9. All variants of fine-tuning methods are trained for 100 epochs. For feature-wise transformation [58], we adopt the recommended hyperparameters in the original paper for meta-training from one source domain. In the training or meta-training stage, we apply standard data augmentation including random crop, random flip, and color jitter.

Table 2. The results of different variants of single model fine-tuning on the proposed benchmark.

In the cross-domain few-shot learning setting, since the source domain and target domain are drastically different, it may not be appropriate to use the source domain data for hyperparameter tuning or validation. Therefore, we leave the question of how to determine the best hyperparameters in the cross-domain few-shot learning as future work. One simple strategy is to use the test set or validation set of the source domain data for hyperparameter tuning. More sophisticated methods may use datasets that are similar to the target domain data.

7 Experimental Results

7.1 Meta-learning Based Results

Table 1 show the results on the proposed benchmark of meta-learning, for each dataset, method, and shot level in the benchmark. Across all datasets and shot levels, the average accuracies (and 95% confidence internals) are 50.21% (0.70) for MatchingNet, 46.55% (0.58) for MatchingNet+FWT, 38.75% (0.41) for MAML, 59.78% (0.70) for ProtoNet, 56.72% (0.55) for ProtoNet+FWT, 54.48% (0.71) for RelationNet, 52.6% (0.56) for RelationNet+FWT, and 57.35% (0.68) for MetaOpt. The performance of MAML was impacted by its inability to scale to larger shot levels due to memory overflow. Methods paired with Feature-Wise Transform are marked with “+FWT”.

What is immediately apparent from Table 1, is that the prior state-of-art MetaOptNet is no longer state-of-art, as it is outperformed by ProtoNet. In addition, methods designed specifically for cross-domain few-shot learning lead to consistent performance degradation in this new challenging benchmark. Finally, performance in general strongly positively correlates to the dataset’s similarity to ImageNet, confirming that the benchmark’s intentional design allows us to investigate few-shot learning in a spectrum of cross-domain difficulties.

Table 3. The results of varying the classifier for fine-tuning on the proposed benchmark.

7.2 Transfer Learning Based Results

Single Model Results. Table 2 show the results on the proposed benchmark of various single model transfer learning methods. Across all datasets and shot levels, the average accuracies (and 95% confidence internals) are 53.99% (1.38) for random embedding, 64.24 (0.59) for fixed feature embedding, 67.23% (0.46) for fine-tuning all layers, 67.41% (0.49) for fine-tuning the last 1 layer, 67.26% (0.53) for fine-tuning the last 2 layers, 67.17% (0.58) for fine-tuning the last 3 layers, and 68.14% (0.56) for transductive fine-tuning. From these results, several observations can be made. The first observation is that, although meta-learning methods have been previously shown to achieve higher performance than transfer learning in the standard few-shot learning setting [5, 60], in the cross-domain few-shot learning setting this situation is reversed: meta-learning methods significantly underperform simple fine-tuning methods. In fact, MatchingNet performs worse than a randomly generated fixed embedding. A possible explanation is that meta-learning methods are fitting the task distribution on the base class data, improving performance in that circumstance, but hindering ability to generalize to another task distribution. The second observation is that, by leveraging the statistics of the test data, transductive fine-tuning continues to achieve higher results than the standard fine-tuning and meta-learning, as previously reported [10]. While transductive fine-tuning, however, assumes that all the queries are available as unlabeled data. The third observation is that the accuracy of most methods on the benchmark continues to be dependent on how similar the dataset is to ImageNet: CropDiseases commands the highest performance on average, while EuroSAT follows in 2\(^{nd}\) place, ISIC in 3\(^{rd}\), and ChestX in 4\(^{th}\). This further supports the motivation behind benchmark design in targeting applications with increasing visual domain dissimilarity to natural images.

Table 3 shows results from varying the classifier. While mean-centriod classifier and cosine-similarity classifier are shown to be more efficient than simple linear classifier in the conventional few-shot learning setting, our results show that mean-centroid and cosine-similarity classifier only have a marginal advantage on ChestX and EuroSAT over linear classifier in the 5-shot case (Table 3). As the shot increases, linear classifier begins to dominate mean-centroid and cosine-similarity classifier. One plausible reason is that since both mean-centroid and cosine-similarity classifier conduct classification based on unimodal class prototypes, when the number of examples increases, unimodal distribution becomes less suitable, and multi-modal distribution is required.

Transfer from Multiple Pre-trained Models. The results of the described Incremental Muiti-model Selection are shown in Table 4. IMS-f fine-tunes each pre-trained model before applying the model selection. We include a baseline called all embeddings which concatenates the feature vectors generated by all the layers from the fine-tuned models. Across all datasets and shot levels, the average accuracies (and 95% confidence internals) are 68.22% (0.45) for all embeddings, and 68.69% (0.44) for IMS-f. The results show that IMS-f generally improves upon all embeddings which indicates the importance of selecting relevant pre-trained models to the target dataset. Model complexity also tends to decrease by over 20% compared to all embeddings on average. We can also observe that it is beneficial to use multiple pre-trained models than using just one model, even though these models are trained from data in different domains and different image types. Compared with standard finetuning with a linear classifier, the average improvement of IMS-f across all the shots on ChestX is 0.20%, on ISIC is 0.69%, on EuroSAT is 3.52% and on CropDiseases is 1.27%.

In further analysis, we study the effect of the number of pre-trained models for the studied multi-model selection method. We consider libraries consisting of two, three, four, and all five pre-trained models. The pre-trained models are added into the library in the order of ImageNet, CIFAR100, DTD, CUB, Caltech256. For each dataset, the experiment is conducted on 5-way 50-shot with 600 episodes. The results are shown in Table 5. As more pre-trained models are added into the library, we can observe that the test accuracy on ChestX and ISIC gradually improves which can be attributed to the diverse features provided by different pre-trained models. However, on EuroSAT and CropDiseases, only a marginal improvement can be observed. One possible reason is that the features from ImageNet already captures the characteristics of the datasets and more pre-trained models does not provide additional information.

7.3 Benchmark Summary

Figure 2 summarizes the comparison across algorithms, according to the average accuracy across all datasets and shot levels in the benchmark. The degradation in performance suffered by meta-learning approaches is significant. In some cases, a network with random weights outperforms meta-learning approaches. FWT methods, which yielded no performance improvements, are omitted for brevity. MAML, which failed to operate on the entire benchmark, is also omitted.

Table 4. The results of using all embeddings, and the Incremental Multi-model Selection (IMS-f) based on fine-tuned pre-trained models on the proposed benchmark.
Table 5. Number of models’ effect on test accuracy.
Fig. 2.
figure 2

Comparisons of methods across the entire benchmark.

8 Conclusion

In this paper, we formally introduce the Broader Study of Cross-Domain Few-Shot Learning (BSCD-FSL) benchmark, which covers several target domains with varying similarity to natural images. We extensively analyze and evaluate existing meta-learning methods, including approaches specifically designed for cross-domain few-shot learning, and variants of transfer learning. The results show that, surprisingly, state-of-art meta-learning approaches are outperformed by earlier approaches, and recent methods for cross-domain few-shot learning actually degrade performance. In addition, all meta-learning methods significantly underperform in comparison to fine-tuning methods. In fact, some meta-learning approaches are outperformed by networks with random weights. In addition, accuracy of all methods correlate with proposed measure of data similarity to natural images, verifying the diversity of the proposed benchmark in terms of its problem representation, and its value towards guiding future research. In conclusion, we believe this work will help the community understand what methods are most effective in practice, and help drive further advances that can more quickly yield benefit for real-world applications.