Feature extractor stacking for cross-domain few-shot learning

Wang, Hongyu; Frank, Eibe; Pfahringer, Bernhard; Mayo, Michael; Holmes, Geoffrey

doi:10.1007/s10994-023-06483-x

Feature extractor stacking for cross-domain few-shot learning

Published: 30 November 2023

Volume 113, pages 121–158, (2024)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Machine Learning Aims and scope Submit manuscript

Feature extractor stacking for cross-domain few-shot learning

Download PDF

Hongyu Wang ORCID: orcid.org/0000-0002-2898-0771¹,
Eibe Frank¹,
Bernhard Pfahringer¹,
Michael Mayo¹ &
…
Geoffrey Holmes¹

444 Accesses
1 Altmetric
Explore all metrics

Abstract

Cross-domain few-shot learning (CDFSL) addresses learning problems where knowledge needs to be transferred from one or more source domains into an instance-scarce target domain with an explicitly different distribution. Recently published CDFSL methods generally construct a universal model that combines knowledge of multiple source domains into one feature extractor. This enables efficient inference but necessitates re-computation of the extractor whenever a new source domain is added. Some of these methods are also incompatible with heterogeneous source domain extractor architectures. We propose feature extractor stacking (FES), a new CDFSL method for combining information from a collection of extractors, that can utilise heterogeneous pretrained extractors out of the box and does not maintain a universal model that needs to be re-computed when its extractor collection is updated. We present the basic FES algorithm, which is inspired by the classic stacked generalisation approach, and also introduce two variants: convolutional FES (ConFES) and regularised FES (ReFES). Given a target-domain task, these algorithms fine-tune each extractor independently, use cross-validation to extract training data for stacked generalisation from the support set, and learn a simple linear stacking classifier from this data. We evaluate our FES methods on the well-known Meta-Dataset benchmark, targeting image classification with convolutional neural networks, and show that they can achieve state-of-the-art performance.

Dual selective knowledge transfer for few-shot classification

Article Open access 18 September 2023

Global and local representation collaborative learning for few-shot learning

Article 21 December 2022

A Survey on Cross-Domain Few-Shot Image Classification

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Cross-domain few-shot learning (CDFSL) addresses the problem that deep learning methods, such as convolutional neural networks (CNN) for image classification, generally require a large amount of labelled training data to achieve high predictive accuracy when trained from scratch. CDFSL algorithms are designed for scenarios where only a few labelled training instances are available in the form of a so-called “support set”. The aim is to nevertheless achieve high accuracy when predicting labels for instances of the target domain that have never been seen before, i.e., the so-called “query set”. This can generally only be achieved by applying transfer learning: taking knowledge gleaned from one or several source domains with large-scale training data and using this knowledge to inform learning in a few-shot target domain.

In CDFSL, the source domain(s) and the target domain are assumed to have potentially very distinct properties. This cross-domain setting is arguably more realistic than the “in-domain” scenario, used in some few-shot learning literature (Vinyals et al., 2016), where the source and the target domains comprise mutually exclusive sets of classes obtained from the same dataset.^{Footnote 1} It also yields harder learning problems due to greater domain shift.

In a CDFSL setting with multiple source domains, an algorithm needs to both select relevant source domains and effectively transfer their knowledge into a target domain using a few-shot support set. Performance is measured by “meta-testing”—transferring model(s) using target domain support sets and evaluating their predictive accuracy on corresponding query sets. Recent work considering performance in image classification, which is the setting we also focus on in this paper, shows that single-domain learning (SDL) and vanilla multi-domain learning (MDL), which applies one feature extractor and multiple classification heads, fail to achieve competitive performance compared to methods specifically designed for CDFSL (Triantafillou et al., 2020; Li et al., 2021).

A majority of recently published CDFSL methods involve building a universal model from a collection of extractors, with each extractor pretrained in a distinct source domain. This comprises the so-called “meta-training” phase, which is performed before meta-testing begins. The universal-model paradigm is generally efficient when performing meta-testing because a single universal feature extractor is used and fine-tuned on the support set, usually in conjunction with a simple robust classifier that turns extracted feature vectors into predictions. However, training the universal model is computationally expensive, and some methods constrain all extractors to the same architecture as the intended universal model (Triantafillou et al., 2021), rendering them inapplicable to heterogeneous extractor collections that are likely to occur in real-world practice. Moreover, they may require adjustment based on pre-existing domain knowledge to function well. For example, given a source domain/extractor collection for image classification consisting of ImageNet (Deng et al., 2009; Russakovsky et al., 2015), along with other, less comprehensive source domains, authors often assign greater importance to the ImageNet extractor during training (Triantafillou et al., 2021; Li et al., 2021). This achieves good performance on benchmarks, which normally include target domains such as CIFAR-10 that are quite similar to ImageNet in nature, but may not be as useful in real-world applications involving less similar data. Lastly, the process of deriving a universal model is non-incremental, which means it needs to be re-run whenever an extractor is updated or added, and normally requires access to the entire meta-training dataset (Triantafillou et al., 2021; Li et al., 2021).

As an alternative approach that avoids these shortcomings, we propose a novel “lazy” CDFSL method, termed feature extractor stacking (FES), that fine-tunes each extractor independently and trains a classifier using a form of stacked generalisation (Wolpert, 1992) during meta-testing. The “meta-training” phase in FES consists solely of training individual feature extractors, one for each source domain, using standard single-domain supervised learning. In practical applications, it may be possible to skip meta-training entirely if a set of suitable feature extractors has been obtained from other sources. FES is fully compatible with heterogeneous extractor collections, imposing no constraints on their architecture or fine-tuning configuration. It assumes equal importance of the extractors a priori, determining their task-specific relevance based purely on the support set, and does not require derivation of a universal model.

Along with the basic FES algorithm, which applies a simple linear stacking classifier and is described in Sect. 3.1, we present two variants: convolutional FES (ConFES) in Sect. 3.3 and regularised FES (ReFES) in Sect. 3.4. ConFES replaces the flat global kernel of FES with a hierarchy of depthwise convolutional kernels, reducing the number of parameters in the stacking classifier. ReFES applies fused lasso regularisation (Tibshirani et al., 2005) to the stacking classifier of FES to reduce the weights of irrelevant snapshots and induce smooth weight transition between adjacent snapshots.

We evaluate FES and its variants on the Meta-Dataset benchmark (Triantafillou et al., 2020), which contains eight source domains and five target domains, and include five additional target domains: CropDisease, EuroSAT, ISIC, ChestX, and Food101 (Guo et al., 2020; Bossard et al., 2014). We show that FES outperforms three recent universal-model methods: URL (Li et al., 2021), FLUTE (Triantafillou et al., 2021), and a URL extractor with TSA fine-tuning (Li et al., 2022), and advances the state of the art on Meta-Dataset. We also discuss practical advantages of FES in real-world scenarios, as FES can work with heterogeneous extractors out of the box and does not need to train a universal model.

2 Related work

Our empirical comparison of CDFSL methods is based on the Meta-Dataset framework, so we review this benchmark first before discussing methods that we compare to our approach. We also briefly review other noteworthy methods in the literature.

2.1 The Meta-dataset benchmark

The Meta-Dataset (Triantafillou et al., 2020) benchmark has multiple configurations; we describe the CDFSL configuration that we use—most recent publications in the field use this configuration as well. It contains eight source domains: ILSVRC-2012 (ImageNet), Omniglot, Aircraft, CUB-200-2011 (Birds), Describable Textures, Quick Draw, Fungi, and VGG Flower. Recent work utilising Meta-Dataset (Requeima et al., 2019) has extended its original set of two target domains, Traffic Signs and MSCOCO, by adding three more: MNIST, CIFAR10, and CIFAR100. For an even more comprehensive evaluation, we add four target domains from the CDFSL benchmark in Guo et al. (2020)—CropDisease, EuroSAT, ISIC, and ChestX—but additionally also employ Food101 (Bossard et al., 2014). Only the 250 sanitised test images in each Food101 class are used in our experiments.

The Meta-Dataset framework splits each source domain into three partitions: training, validation, and test. The partitions are mutually exclusive in terms of their classes, with the training partition containing approximately 70% of source domain classes and the validation and test partitions containing approximately 15% each. The training and validation partitions are made available to the CDFSL method for “meta-training”, where the training partition is generally used to train extractors and the validation split to aid hyperparameter tuning. The test partition is reserved for evaluating the CDFSL method by sampling few-shot episodes (i.e., meta-testing): the term “episode” refers to the process of sampling a support set and a query set, training a classifier on the support set, and evaluating it on the query set.

In contrast, the entire target domain data can be used for sampling episodes to evaluate few-shot learning in these domains. It is important to note that, by definition, only tasks sampled from target domains truly measure CDFSL performance. Using terminology that is common in this context, good performance in these domains indicates “strong generalisation”; good performance on tasks sampled from source domain test partitions indicates “weak generalisation”.

The most commonly used method to evaluate CDFSL algorithms on Meta-Dataset is to generate 600 any-way any-shot episodes from each dataset, and measure each algorithm’s mean classification accuracy on these 600 episodes, as well as the 95% confidence interval. Any-way any-shot sampling means the number of classes for each episode and the number of support instances per class are arbitrary, leading to imbalanced support sets more representative of real-world scenarios than fixed-way fixed-shot episodes. The query set is balanced in the Meta-Dataset setting. We adhere to this evaluation method in our experiments.

2.2 Methods included in the experimental comparison

Two recently published CDFSL methods that advanced the state-of-the-art on Meta-Dataset are Few-shot Learning with a Universal TEmplate (FLUTE) (Triantafillou et al., 2021) and Universal Representation Learning (URL) (Li et al., 2021). Even more recently, based on a URL universal model, a fine-tuning method using Task-Specific Adaptors (TSA) (Li et al., 2022) improved accuracy on some target domains even further. We compare our new FES approach to these methods in our experiments.

2.2.1 Few-shot learning with a universal template

Based on the FiLM approach (Perez et al., 2018), FLUTE trains a universal model in the source domains, employing the ResNet18 architecture (He et al., 2016) widely used in CDFSL, but maintaining a separate set of batch normalisation (Ioffe & Szegedy, 2015) parameters for each domain. The ResNet “template” contains one set of convolutional weights shared across all source domains, and only the batch normalisation parameters are specific to each source domain. FLUTE jointly trains the template in all source domains. At each training iteration, a random source domain is selected—with ImageNet having a 50% probability of being selected and the other seven source domains evenly sharing the other 50% probability—and a batch of input data is sampled from the selected source domain. In forward propagation, the input batch flows through the shared convolutional layers and the selected domain’s set of batch normalisation layers, and loss is computed by applying a cosine classifier (Chen et al., 2019, 2021). A nuance of FLUTE training is that backpropagation is performed using a “meta-batch” of eight individual batches: the intention is to stabilise training by aggregating loss values across multiple domains. Hyperparameter tuning is performed using episodes sampled from source domain validation partitions.

When the template is trained, snapshots are frequently saved. The final template is chosen as the snapshot that performs best on the source domains’ validation partitions. To establish performance, few-shot episodes are sampled from these partitions. For each episode, feature vectors are extracted using the shared convolutional layers and the domain’s set of batch normalisation layers. Accuracy is computed using a nearest-centroid classifier (Mensink et al., 2013; Snell et al., 2017).

One more component of FLUTE, produced in a separate meta-training phase, is a blender network, which is a dataset classifier based on a permutation-invariant set encoder (Zaheer et al., 2017) followed by a linear layer. Given a batch of instances, the blender predicts, as a probability distribution, the source domain from which the batch is sampled. It is trained on batches sampled from the source domains’ training partitions, and the final blender model is chosen using batches from the validation partitions.

Given a few-shot episode at meta-test time, the blender uses the support set to produce a probability distribution. These probabilities in turn are used to form a linear combination of the source-domain-specific batch normalisation weights. Along with the shared convolutional weights from the template, this forms the initial set of parameters for the ResNet18 feature extractor, which is applied in conjunction with a nearest-centroid classifier. The model’s batch normalisation parameters are then fine-tuned on the support set while its convolutional weights remain fixed.

2.2.2 Universal representation learning

The URL algorithm also generates a universal model. It first pretrains domain-specific ResNet18 extractors independently. Then, a separate ResNet18 feature extractor is trained to form a universal model by distillation. This model is trained to match each extractor’s output feature vectors and logits using instances sampled from the extractor’s corresponding domain. To this end, the universal model contains pairs of auxiliary domain-specific components that each comprise 1) a projection layer that transforms the universal extractor’s feature vectors to match those of each domain-specific extractor, and 2) a classifier layer trained to match the logits produced by each extractor.

In the experiments by Li et al. (2021), ImageNet is made more prominent in distillation: ImageNet instances make up 50% of each mini-batch and the other seven source domains evenly make up the rest. Snapshots of the universal feature extractor are saved at predefined intervals during knowledge distillation. Episodes sampled from source domain validation partitions are used to select the best snapshot as a form of early stopping.

After meta-training, the auxiliary components of the universal model are discarded, leaving only the feature extractor. During meta-testing, this extractor is frozen, and a projection layer is initialised with an identity weight matrix and trained using the support set. The projected feature vectors are used to build a nearest-centroid classifier. Cosine similarity values between a feature vector to be classified and the centroids are used as logits. Fine-tuning minimises cross-entropy loss on the support set. Note that during fine-tuning, as the projection layer is optimised, projected support feature vectors change, and their centroids change as well. The fine-tuning effect can be interpreted as forming better clusters with projected support feature vectors.

2.2.3 Task-specific adaptors

TSA (Li et al., 2022) is a fine-tuning method suitable for CDFSL. Given a pretrained extractor, trainable task-specific adaptors are attached to it, and the support set is used to optimise the adaptors with the extractor’s original weights frozen. Like URL, TSA also attaches a trainable linear projection layer and a robust classifier to the end of the feature extractor during fine-tuning, but it adds further adaptor components. Among multiple configurations examined, the most effective approach for few-shot image classification found by Li et al. (2022) is to attach channel projection matrices as residual connections to a model’s convolutional layers. Li et al. (2022) used TSA in conjunction with a URL-distilled universal extractor, but TSA can be applied to other CNN architectures as well.

2.3 Other work on CDFSL

We review additional noteworthy CDFSL methods here. These methods precede FLUTE, URL, and TSA chronologically and achieve lower accuracy than results presented by Triantafillou et al. (2021) and Li et al. (2021, 2022). Hence, in the experiments presented in this paper, we only compare to FLUTE, URL, and a URL extractor with TSA fine-tuning.

2.3.1 Selecting relevant features from a Universal representation

SUR (Dvornik et al., 2020) is a CDFSL method that utilises independently pretrained feature extractors directly for meta-testing. Each extractor is used to extract a set of feature vectors from the support set, with a trainable weight assigned to it. Feature vectors are multiplied by their respective weights and concatenated to provide input to a nearest-centroid classifier. The weights are trained by optimising loss of the classifier on the support set. SUR is similar to URL in the meta-testing phase, as both make predictions with a nearest-centroid classifier and optimise parameters on the support set; the primary difference is that URL maintains a universal model while SUR uses the original extractors directly.

2.3.2 Universal representation transformer

URT (Liu et al., 2021) also assigns a weight to each source domain extractor during meta-testing. However, it utilises a weight assignment model learned using meta-training instead of direct optimisation on the support set to obtain the weights. To this end, URT trains an attention mechanism (Vaswani et al., 2017) that learns to assign appropriate weights to source domain feature extractors given a few-shot episode. The weight assignment model is trained and has its hyperparameters selected using episodes sampled from the source domains’ training and validation partitions.

2.3.3 Conditional neural adaptive processes

The CNAPs method, as proposed in Requeima et al. (2019), uses an extractor pretrained in a large source domain, e.g., ImageNet (Deng et al., 2009; Russakovsky et al., 2015), and meta-trains adaptation networks, using episodes sampled from the source domains, to produce task-specific FiLM (Perez et al., 2018) transformations and a linear classifier for each few-shot episode.

A variant, Simple CNAPs (Bateni et al., 2020), was later proposed utilising a non-parametric Mahalanobis distance (Galeano et al., 2015) measure in place of the classifier adaptation network of CNAPs, reducing the parameter count and improving CDFSL performance. A transductive version of Simple CNAPs was subsequently also proposed (Bateni et al., 2022), making use of clustering of query instances in feature space to achieve better performance than Simple CNAPs, assuming that the query set is available as a batch instead of a sequential stream of incoming instances. As most other CDFSL methods do not rely on such an assumption, they cannot be compared to transductive CNAPs on an even footing.

2.3.4 Multi-mode modulator

Tri-M (Liu et al., 2021), akin to CNAPs, uses an extractor pretrained in a large-scale source domain, and meta-trains a modulation network using source domain episodes to generate appropriate FiLM transformations for each few-shot episode. Tri-M maintains two sets of transformations—a domain-specific one and a domain-cooperative one—and its resulting FiLM transformation is a combination of the two. Tri-M determines a source domain for its domain-specific transformation in a way similar to how FLUTE (Triantafillou et al., 2021) utilises its blender network and uses an attention mechanism (Vaswani et al., 2017) to compose its domain-cooperative transformation from relevant source domains.

3 Cross-domain few-shot learning using stacking

Considering the CDFSL methods discussed in the previous section, the SUR method stands out because its meta-training process is straightforward: all it involves is pretraining individual source domain feature extractors. Once these have been obtained, SUR performs “lazy” learning in the sense that significant work is only performed once the support set for a few-shot episode becomes available. This makes it very flexible because new extractors can be added at any time. However, SUR does not yield state-of-the-art performance. The new methods presented in this paper are inspired by SUR and the old and established method of applying stacked generalisation to learning a classifier that combines predictions of multiple base classifiers. Henceforth, we will refer to this classifier as the “stacking classifier”. There are four primary differences between SUR and our stacking-based methods: 1) the source domain extractors are fined-tuned on the support set to extract more information from this data by attaching appropriate classifier layers to them, 2) two-fold cross-validation is used to generate training data for the stacking classifier to tackle overfitting, 3) the feature vectors of this training data consist of logits obtained from classifier layers attached to the extractors, and 4) multiple snapshots of each extractor are stored during fine-tuning and used to obtain sets of logits, adding further richness to the data available for training the stacking classifier.

In the following, we first explain the basic method of feature extractor stacking (FES) in detail and prove convexity of its optimisation, before describing two variants: convolutional FES (ConFES) and regularised FES (ReFES).

3.1 Feature extractor stacking

Given pretrained feature extractors, FES has three key components: fine-tuning extractors to obtain snapshots, two-fold cross-validation to produce training data for the stacking classifier, and training of the stacking classifier. Figure 1 depicts the FES framework.

3.1.1 Fine-tuning the extractors

We use $f_{\Phi _1}, f_{\Phi _2},..., f_{\Phi _{K}}$ (or just $\Phi _1, \Phi _2,..., \Phi _{K}$ for brevity) to denote the collection of pretrained feature extractors, where $\Phi$ represents the corresponding extractor’s parameters and K is the number of source domains. The support set of a few-shot episode is denoted S and the query set Q. S contains N instances belonging to C classes. We fine-tune each extractor independently on S. As $f_{\Phi }$ is a feature extractor, a classifier g with parameters $\Theta _1$ is attached to $f_{\Phi }$ to produce logits. Auxiliary components with parameters $\Theta _2$ may also be introduced to the model to aid fine-tuning, such as with TSA (Li et al., 2022). The resulting model is defined as $h_{\Psi } = g_{\Theta _1} \circ f_{(\Phi , \Theta _2)}$, where we use $\Psi$ to denote the combination of all parameters. It is possible for $\Theta _2$ to be $\varnothing$, as auxiliary fine-tuning components are optional. J snapshots are saved sequentially at different fine-tuning iterations of $h_{\Psi }$. Each snapshot contains parameters $\Psi _k^j[S]$, where $k \in [1,K]$ and $j \in [1,J]$, with S denoting the fine-tuning set used.

3.1.2 Cross-validation to obtain training data for stacked generalisation

In stacked generalisation (Wolpert, 1992), cross-validation is employed to obtain training data for the stacking classifier to combat overfitting, and it is applied in FES as well. More specifically, we apply stratified two-fold cross-validation to the support set S, producing two splits $S_1$ and $S_2$, which will take turns serving as the training split $S^{train}$ and the test split $S^{test}$. It is possible to employ more folds in FES, but using additional folds did not yield performance gains in our experiments.

Training on one of the training splits amounts to fine-tuning a network $h_{\Psi }$ on this data. In principle, this could be done for a fixed number of iterations, and once complete, logits on the corresponding test split could be obtained as training data for the stacking classifier. However, this naive approach may not work well because it is not known how many iterations should be performed for fine-tuning to maximise accuracy of the full learning system. The approach we propose and evaluate in this paper is instead based on the idea that we can take multiple snapshots of the models during fine-tuning and use all the snapshots’ logits on the test folds for training the stacking classifier. In other words, the learning algorithm for the stacking classifier will be responsible for deciding which extractor snapshots are the most useful ones for making accurate predictions on the test folds.

More specifically, given a pair $(S^{train}, S^{test})$ and an extractor $h_{\Psi }$, we fine-tune $h_{\Psi }$ on $S^{train}$ with the same configuration used to obtain $h_{\Psi ^j[S]}$, e.g., optimiser, learning rate, etc., and save snapshots $h_{\Psi ^j[S^{train}]}$ at the same iterations as $h_{\Psi ^j[S]}$. Logits $L^j[S^{test}]$ are extracted from $S^{test}$ with each $h_{\Psi ^j[S^{train}]}$, i.e., $L^j[S^{test}] = h_{\Psi ^j[S^{train}]}(S^{test})$. Using this approach, the two splits $S_1$ and $S_2$ can be used to alternately fine-tune extractors and produce logits $L^j[S_1]$ and $L^j[S_2]$, which are combined into $L^j[CV]$, i.e., logits for every support set instance extracted using cross-validation. Considering the logits from all K extractors jointly, $L_K^J[CV]$ is a tensor of shape $N \times K \times J \times C$, i.e., N support instances converted into logits for C classes extracted by $K \times J$ snapshot models, ready to serve as training data for the stacking classifier.

3.1.3 Stacking classifier training

The FES stacking classifier is a weight matrix W of shape $K \times J$, with $W_k^j$ representing $\Psi _k^j$’s weight. Given an instance l of shape $K \times J \times C$, the stacking classifier’s output logits $l^W$ are obtained using a simple weighted average:

$$\begin{aligned} l^W[c] = \sum _{k = 1}^K\sum _{j = 1}^JW_k^j \cdot l_k^j[c], \end{aligned}$$

(1)

where c is one of the C classes. We compute the cross-entropy loss using the N support set logits $L^W$ output by the stacking classifier and the one-hot-encoded labels Y, i.e., $-\sum \limits _{n = 1}^NY_n \log (\text {softmax}(L_n^W))$, which we minimise by training W. For interpretability, we constrain all values in W to be non-negative by clipping negative weights with ReLU. The FES stacking classifier is shown in Fig. 2.

After training, W is used with Eq. 1 to compute meta logits for the query set Q using the logits $L_K^J[Q]$ computed by the saved snapshots $\Psi _K^J[S]$. Then, a softmax function is used to obtain class probability estimates.

3.2 Proof of convexity

Given a stacking instance l consisting of base logits obtained from the extractor snapshots, which the stacking classifier transforms into meta-level logits $l^W$, and the label $c_y$, the negative log-likelihood loss $\ell$ associated with the stacking classifier’s parameters W is

$$\begin{aligned} \ell (W) = \log \left(\sum _{i = 1}^Ce^{l^W[c_i]}\right) - l^W[c_y]. \end{aligned}$$

(2)

To prove that optimising FES is a convex problem, we show that for any two values of W, named A and B, a linear combination of the loss on A and the loss on B is never smaller than the loss obtained for the corresponding linear combination of the parameter values A and B, i.e.,

$$\begin{aligned} \ell (\lambda A + (1 - \lambda ) B) \le \lambda \ell (A) + (1 - \lambda )\ell (B), \lambda \in [0, 1]. \end{aligned}$$

(3)

Applying Eq. 2 to Eq. 3, we get

$$\begin{aligned}&\log ( \left( {\sum\limits_{{i = 1}}^{C} {e^{{l^{{(\lambda A + (1 - \lambda )B)}} [c_{i} ]}} } } \right) - l^{(\lambda A + (1 - \lambda ) B)} \left[ {c_{y} } \right] \le \\&\lambda \left(\log \left(\sum _{i = 1}^Ce^{l^A[c_i]}\right) - l^A[c_y]\right) + (1 - \lambda )\left(\log \left(\sum _{i = 1}^Ce^{l^B[c_i]}\right) - l^B[c_y]\right), \end{aligned}$$

which can be simplified into

$$\begin{aligned} \log \left(\sum _{i = 1}^Ce^{l^{(\lambda A + (1 - \lambda ) B)}[c_i]}\right) \le \lambda \log \left(\sum _{i = 1}^Ce^{l^A[c_i]}\right) + (1 - \lambda )\log \left(\sum _{i = 1}^Ce^{l^B[c_i]}\right), \end{aligned}$$

(4)

because using Eq. 1, we have

$$\begin{aligned}&l^{(\lambda A + (1 - \lambda ) B)}[c_y] \\ =&\sum _{k = 1}^K\sum _{j = 1}^J(\lambda A_k^j + (1 - \lambda ) B_k^j) \cdot l_k^j[c_y] \\ =&\sum _{k = 1}^K\sum _{j = 1}^J\lambda A_k^j \cdot l_k^j[c_y] + \sum _{k = 1}^K\sum _{j = 1}^J(1 - \lambda ) B_k^j \cdot l_k^j[c_y] \\ =&\lambda \sum _{k = 1}^K\sum _{j = 1}^JA_k^j \cdot l_k^j[c_y] + (1 - \lambda )\sum _{k = 1}^K\sum _{j = 1}^JB_k^j \cdot l_k^j[c_y] \\ =&\lambda l^A[c_y] + (1 - \lambda ) l^B[c_y]. \end{aligned}$$

Similarly, Eq. 4 can be transformed using Eq. 1 into

$$\begin{aligned} \log \left(\sum _{i = 1}^Ce^{\lambda l^A[c_i] + (1 - \lambda ) l^B[c_i]}\right) \le \lambda \log \left(\sum _{i = 1}^Ce^{l^A[c_i]}\right) + (1 - \lambda )\log \left(\sum _{i = 1}^Ce^{l^B[c_i]}\right). \end{aligned}$$

(5)

It is known that the LogSumExp function $LSE(x) = \log (\sum \limits _{i = 1}^ne^{x_i})$ is convex. Therefore, we have

$$\begin{aligned} \forall n \in \mathbb {Z}^+, \alpha , \beta \in \mathbb {R}^n: LSE(\lambda \alpha + (1 - \lambda ) \beta ) \le \lambda LSE(\alpha ) + (1 - \lambda ) LSE(\beta ). \end{aligned}$$

(6)

Hence, Eq. 5 is true because we can make the following assignments:

$$\begin{aligned} n&= C, \\ \alpha _i&= l^A[c_i],\\ \beta _i&= l^B[c_i]. \end{aligned}$$

This completes the proof of Eq. 3, and thus optimising FES on a single instance l is a convex problem. As the sum of convex functions is a convex function, optimising FES on a full batch L is also a convex problem. Therefore, FES is a convex optimisation problem.

3.3 Convolutional feature extractor stacking

The basic FES approach does not exploit the temporal relation between logits obtained from adjacent snapshots produced during fine-tuning. Convolutional FES (ConFES) replaces the global kernel of FES with a kernel hierarchy, as shown in Fig. 3, to treat the collection of logits as a time series. The hierarchy comprises one or more lower-level one-dimensional depthwise convolutional kernels and a top-level global kernel. The depthwise kernels condense the logit output sequence from each extractor’s snapshots into a 1D feature map, while keeping the extractors separate, and the global kernel summarises the feature maps produced by the lower-level kernels.

ConFES is motivated by the assumption that when each extractor is fine-tuned on the support set, it undergoes gradual changes between iterations, and the logits output by sequentially saved snapshots can be considered a time series. Therefore, 1D convolutions can be used to discern informative patterns in the time series data and compute feature maps, which are smaller in size than the raw logit time series, and therefore require fewer parameters in the global kernel than standard FES.

Given K extractors and J snapshots for each extractor, FES requires $K \times J$ parameters. Assuming a two-level ConFES hierarchy, with a base-level convolutional kernel of size $J_b$ and stride T, the feature map for each extractor will be of length $J_m = \frac{J - J_b}{T} + 1$, leading to a global kernel size of $K \times (\frac{J - J_b}{T} + 1)$. Including the $K \times J_b$ parameters in the convolutional kernel, ConFES contains $K \times (\frac{J - J_b}{T} + 1 + J_b)$ parameters. In practice, it can generally be assumed that $J \gg 1$: a two-level ConFES architecture should be configured so that $J \gg J_b \ge T \gg 1$ in order to cover all snapshots with significantly fewer parameters than FES.

ConFES utilises the sequential relation of each extractor’s snapshots through its lower-level 1D depthwise convolutional layers and exhibits substantially fewer parameters than FES, making it less prone to overfitting. Note that Fig. 3 is simplified for demonstration purposes and does not reflect well that ConFES maintains fewer parameters; for a practical example of ConFES kernels, please refer to Fig. 12.

3.4 Regularised feature extractor stacking

To combat overfitting, an alternative to reducing the number of parameters is to perform regularisation. Regularised FES (ReFES) introduces fused lasso regularisation (Tibshirani et al., 2005) to the stacking classifier used in FES, as shown in Fig. 4. Non-zero weights are penalised with a strength of $\lambda _1$, and each feature-extractor-wise weight sequence is smoothed with a strength of $\lambda _2$. The loss is a combination of cross-entropy loss and depthwise fused lasso loss, as formulated in Eq. 7, given K extractors, J snapshots per extractor, and a 2D global kernel W of shape $K \times J$.

$$\begin{aligned} \ell = \ell _{\text {cross-entropy}} + \lambda _1 \sum _{k = 1}^K \sum _{j = 1}^J \Vert W_k^j\Vert + \lambda _2 \sum _{k = 1}^K \sum _{j = 1}^{J - 1} \Vert W_k^j - W_k^{j + 1}\Vert . \end{aligned}$$

(7)

In addition to encouraging sparse weights like standard lasso, fused lasso also encourages smaller differences between adjacent weights (Tibshirani et al., 2005). Each extractor’s snapshots are ordered by their fine-tuning iterations, and adjacent snapshots are likely to be similar. By applying fused lasso regularisation, differences between adjacent weights are penalised, and weight sequences are smoothed.

The stratified two-fold splits $S_1$ and $S_2$ can be used to select appropriate $\lambda _1$ and $\lambda _2$ values for a few-shot episode. In the spirit of grid search with cross-validation, a ReFES stacking classifier is trained on the logits of one split, e.g., $L_K^J[S_1]$, and tested on the logits of the other split, e.g., $L_K^J[S_2]$. Different values for $\lambda _1$ and $\lambda _2$ can be explored and the best configuration selected based on the combined accuracy on the two folds. This configuration is then used to train a newly initialised ReFES stacking classifier on the full set of cross-validation logits $L_K^J[CV]$, and this stacking classifier is used to label the query set instances Q based on their logits $L_K^J[Q]$.

3.5 Handling single-instance classes

Meta-Dataset’s sampling scheme (Triantafillou et al., 2020) sometimes produces support sets containing single-instance classes. During cross-validation, single-instance classes need to be removed: if a class’ only instance is in the test split $S^{test}$, then the training split $S^{train}$ will have no instance of that class. FES and its variants can train their stacking classifiers on a subset of the support classes $C_{sub} \le C$, because their kernels only encode the weights of the snapshots, and are inherently independent of the number of classes C. In Figs. 2, 3, and 4, C can simply be replaced by $C_{sub}$ during training.

Given a strict one-shot problem, where all classes exhibit exactly one instance, FES cross-validation is infeasible, as all classes need to be removed during cross-validation, leading to $L_K^J[CV] = \varnothing$. Therefore, support logits obtained from ordinary fine-tuning need to be used in place of cross-validation logits, i.e., $L_K^J[S]$ is used to train the FES classifier W instead of using $L_K^J[CV]$.

4 Experimental setup

To evaluate FES and its variants on the Meta-Dataset benchmark described in Sect. 2.1, we use an extractor collection containing eight extractors, each independently pretrained on a Meta-Dataset source domain. In our primary set of experiments, all extractors are ResNet18 models (He et al., 2016) and identical to the source domain extractors used in the publication introducing URL (Li et al., 2021). Note that the extractors are trained on the training split of the source domain data only. The source domain validation split is used to select a trained checkpoint.

FES is compatible with any fine-tuning algorithm that is applicable to the individual extractors. In our experiments, we save a snapshot of each extractor before fine-tuning and save a snapshot after each iteration. We evaluate FES with three fine-tuning methods used by state-of-the-art CDFSL methods in the literature:

TSA (Li et al., 2022)—matrix residual adaptors attached to convolutional layers, and a fully-connected layer to project feature vectors.
URL (Li et al., 2021)—only a fully-connected layer to project feature vectors.
FLUTE (Triantafillou et al., 2021)—scaling and shift factors of batch normalisation layers.

When performing each fine-tuning method for FES, we use the hyperparameters as stated in the source publications, including optimiser type, learning rate, number of iterations, etc., and we compare FES to each source method. The URL (Li et al., 2021) and TSA (Li et al., 2022) papers fine-tune their feature extractors for 40 iterations, leading to 41 FES snapshots per extractor. The FLUTE (Triantafillou et al., 2021) paper fine-tunes its feature extractor for six iterations, leading to seven FES snapshots per extractor.

We adhere to the TSA, URL, and FLUTE papers when replicating and evaluating their methods as benchmarks. Pretrained universal extractors are obtained from the official repositories, and hyperparameter settings are consistent with the papers’ specifications. Note that both the URL and TSA papers used the same URL-distilled universal extractor, and their difference is in fine-tuning, i.e., only fine-tuning a feature projection (URL) or additionally fine-tuning convolutional channel projections (TSA).

We use an LBFGS optimiser to train the stacking classifier, applying its default hyperparameters in the PyTorch library (Paszke et al., 2019), except that we utilise its line search function. A ridge regularisation of strength $1\textrm{e}^{-2}$ is applied to FES and ConFES to make the LBFGS optimiser more numerically stable. Adjusting the regularisation strength up or down by an order of magnitude does not substantially affect classification accuracy.

Table 1 Meta-Dataset episode statistics

Feature extractor stacking for cross-domain few-shot learning

Abstract

Similar content being viewed by others

Dual selective knowledge transfer for few-shot classification

Global and local representation collaborative learning for few-shot learning

A Survey on Cross-Domain Few-Shot Image Classification

Explore related subjects

1 Introduction

2 Related work

2.1 The Meta-dataset benchmark

2.2 Methods included in the experimental comparison

2.2.1 Few-shot learning with a universal template

2.2.2 Universal representation learning

2.2.3 Task-specific adaptors

2.3 Other work on CDFSL

2.3.1 Selecting relevant features from a Universal representation

2.3.2 Universal representation transformer

2.3.3 Conditional neural adaptive processes

2.3.4 Multi-mode modulator

3 Cross-domain few-shot learning using stacking

3.1 Feature extractor stacking

3.1.1 Fine-tuning the extractors

3.1.2 Cross-validation to obtain training data for stacked generalisation

3.1.3 Stacking classifier training

3.2 Proof of convexity

3.3 Convolutional feature extractor stacking

3.4 Regularised feature extractor stacking

3.5 Handling single-instance classes

4 Experimental setup

5 Results

5.1 Meta-dataset results

5.2 Weight visualisation

5.3 Snapshot omission

6 Ablation study

7 Heterogeneous extractors

8 Limitations and discussion

9 Future work

10 Conclusion

Availability of data and materials

Code availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix A: Additional heatmaps

Appendix A: Additional heatmaps

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation