1 Introduction

Visual recognition for objects in the “long tail” has been an important challenge to address (Wang et al. 2017; Liu et al. 2019; Kang et al. 2020; Zhou et al. 2020). We often have a very limited amount of data on those objects as they are infrequently observed and/or visual exemplars of them are hard to collect. As such, state-of-the-art methods (e.g., deep learning) can not be directly applied due to their notorious demand of a large number of annotated data (Krizhevsky et al. 2017; Simonyan and Zisserman 2014; He et al. 2016).

Fig. 1
figure 1

A conceptual diagram comparing the Few-Shot Learning (FSL) and the Generalized Few-Shot Learning (GFSL).GFSL requires to extract inductive bias from seen categories to facilitate efficiently learning on few-shot unseen tail categories, while maintaining discernability on head classes

Few-shot learning (FSL) (Vinyals et al. 2016; Larochelle 2018) is mindful of the limited data per tail concept (i.e., shots), which attempts to address this challenging problem by distinguishing between the data-rich head categories as seen classes and data-scarce tail categories as unseen classes. While it is difficult to build classifiers with data from unseen classes, FSL mimics the test scenarios by sampling few-shot tasks from seen class data, and extracts inductive biases for effective classifiers acquisition on unseen ones. Instance embedding (Vinyals et al. 2016; Snell et al. 2017; Rusu et al. 2019; Ye et al. 2020), model initialization (Finn et al. 2017; Nichol et al. 2018; Antoniou et al. 2019), image generator (Wang et al. 2018), and optimization flow (Ravi and Larochelle 2017; Lee et al. 2019) act as popular meta-knowledge and usually incorporates with FSL.

This type of learning makes the classifier from few-shot learning for unseen classes difficult to be combined directly with the classifier from many-shot learning for seen classes, however, the demand to recognize all object categories simultaneously in object recognition is essential as well.

In this paper, we study the problem of Generalized Few-Shot Learning (GFSL), which focuses on the joint classification of both data-rich and data-poor categories. Figure 1 illustrates the high-level idea of the GFSL, contrasting the standard FSL. In particular, our goal is for the model trained on the seen categories to be capable of incorporating the limited unseen class examples, and make predictions for test data in both the head and tail of the entire distribution of categories.

One naive GFSL solution is to train a single classifier over the imbalanced long-tail distribution (Hariharan and Girshick 2017; Wang et al. 2017; Liu et al. 2019; Zhou et al. 2020), and re-balance it (Cui et al. 2019; Cao et al. 2019; Kang et al. 2020). One main advantage of such a joint learning objective over all classes is that it characterizes both seen and unseen classes simultaneously. In other words, training of one part (e.g., head) naturally takes the other part (e.g., tail) into consideration, and promotes the knowledge transfer between classes. However, such a transductive learning paradigm requires collecting the limited tail data in advance, which is violated in many real-world tasks. In contrast to it, our learning setup requires an inductive modeling of the tail, which is therefore more challenging as we assume no knowledge about the unseen tail categories is available during the model learning phase.

There are two main challenges in the inductive GFSL problem, including how to construct the many-shot and few-shot classifiers in the GFSL scenario and how to calibrate their predictions.

First, the head and tail classifiers for a GFSL model should encode different properties of all classes towards high discerning ability, and the classifiers for the many-shot part should be adapted based on the tail concepts accordingly. For example, if the unseen classes come from different domains, the same single seen classifier is difficult to handle their diverse properties and should not be left alone in this dynamic process. Furthermore, as observed in the generalized zero-shot learning scenario (Chao et al. 2016), a classifier performs over-confident with its familiar concepts and fear to make predictions for those unseen ones, which leads to a confidence gap when predicting seen and unseen classes. The calibration issue appears in the generalized few-shot learning as well, i.e., seen and unseen classifiers have different confidence ranges. We empirically find that directly optimizing two objectives together could not resolve the problem completely.

To this end, we propose ClAssifier SynThesis LEarning (Castle), where the few-shot classifiers are synthesized based on a neural dictionary with common characteristics across classes. Such synthesized few-shot classifiers are then used together with the many-shot classifiers, and learned end-to-end. To this purpose, we create a learning scenario by sampling a set of data instances from seen categories and pretend that they come from unseen categories, and apply the synthesized classifiers (based on the above instances) as if they are many-shot classifiers to optimize multi-class classification together with the remaining many-shot seen classifiers. In other words, we construct few-shot classifiers to not only perform well on the few-shot classes but also to be competitive when used in conjunction with many-shot classifiers of populated classes. We argue that such highly contrastive learning can benefit the few-shot classification in two aspects: (1) it provides high discernibility for its synthesized classifiers. (2) it makes the synthesized classifier automatically calibrated with the many-shot classifiers.

Taking steps further, we then propose the Adaptive ClAssifier SynThesis LEarning (a Castle), with additional flexibility to adapt the many-shot classifiers based on few-shot training examples. As a result, it allows backward knowledge transfer (Lopez-Paz and Ranzato 2017)—new knowledge learned from novel few-shot training examples can benefit the existing many-shot classifiers. In a Castle, the neural dictionary is the concatenation of the shared and the task-specific neural bases, whose elements summarize the generality of all visual classes and the specialty of current few-shot categories. This improved neural dictionary facilitates the adaptation of the many-shot classifiers conditioned on the limited tail training examples. The adapted many-shot classifiers in a Castle are then used together with the (jointly) synthesized few-shot classifiers for GFSL classification.

We first verify the effectiveness of the synthesized GFSL classifiers over multi-domain GFSL tasks, where the unseen classes would come from diverse domains. a Castle can best handle such task heterogeneity due to its ability to adapt the head classifiers. Next, we empirically validate our approach on two standard benchmark datasets—MiniImageNet (Vinyals et al. 2016) and TieredImageNet (Ren et al. 2018). The proposed approach retains competitive tail concept recognition performances while outperforming existing approaches on generalized few-shot learning with criteria from different aspects. By carefully selecting a prediction bias from the validation set, those miscalibrated FSL approaches or other baselines perform well in the GFSL scenario. The implicit confidence calibration in Castle and a Castle works as well as or even better than the post-calibration techniques. We note that Castle and a Castle are applicable for standard few-shot learning, which stays competitive with and sometimes even outperforms state-of-the-art methods when evaluated on two popular FSL benchmarks.

Our contributions are summarized as follows:

  • We propose a framework that synthesizes few-shot classifiers for GFSL with a shared neural dictionary, as well as its adaptive variant that modifies seen many-shot classifiers to allow the backward knowledge transfer.

  • We extend an existing GFSL learning framework into an end-to-end counterpart that learns and contrasts the few-shot and the many-shot classifiers simultaneously, which is observed beneficial to the confidence calibration of these two types of classifiers.

  • We empirically demonstrate that a Castle is effective in backward transferring knowledge when learning novel classes under the setting of multi-domain GFSL. Meanwhile, we perform a comprehensive evaluation of both existing and our approaches with criteria from various perspectives on multiple GFSL benchmarks.

In the rest sections of this paper, we first describe the problem formulation of GFSL in Sect. 2, and then introduce our Castle/a Castle approach in Sect. 3. We conduct thorough experiments (see Sect. 4 for the setups) to verify the the proposed Castle and a Castle across multiple benchmarks. We first conduct a pivot study on multi-domain GFSL benchmarks (Sect. 5) to study the backward transfer capability of different methods. Then we evaluate both a Castle and Castle on popular GFSL (Sect. 6.3), and FSL benchmarks (Sect. 6.4). Eventually, we review existing related works in Sect. 7 and discuss the connections to our work.

2 Problem Description

We define a K-shot N-way classification task to be one with N classes to make prediction and K training examples per class for learning. The training set (i.e., the support set) is represented as \({\mathcal {D}}_{\mathbf {train}} = \{(\mathbf{x}_{i}, \mathbf{y}_{i})\}_{i=1}^{NK}\), where \(\mathbf{x}_{i}\in {\mathbb {R}}^{D}\) is an instance and \(\mathbf{y}_{i}\in \{0,1\}^N\) (i.e., one-hot vector) is its label. Similarly, the test set (a.k.a. the query set) is \({\mathcal {D}}_{\mathbf {test}}\), which contains i.i.d. samples from the same distribution as \({\mathcal {D}}_{\mathbf {train}}\).

2.1 Meta-Learning for Few-Shot Learning (FSL)

In many-shot learning, where K is large, a classification model \(f:{\mathbb {R}}^D\rightarrow \{0,1\}^N\) is learned by optimizing over the instances from the head classes:

$$\begin{aligned} {\mathbb {E}}_{(\mathbf{x}_i,\mathbf{y}_i)\in {\mathcal {D}}_{\mathbf {train}}} \ell (f(\mathbf{x}_i), \mathbf{y}_i) \end{aligned}$$

Here f is often instantiated as an embedding function \(\phi (\cdot ):{\mathbb {R}}^D\rightarrow {\mathbb {R}}^{d}\) and a linear classifier \(\varvec{\Theta }\in {\mathbb {R}}^{d\times N}\): \(f(\mathbf{x}_i) = \phi (\mathbf{x}_i)^\top \varvec{\Theta }\). We do not consider the bias term in the linear classifier in the following discussions, and the weight vector of the class n is denoted as \(\varvec{\Theta }_n\). The loss function \(\ell (\cdot , \cdot )\) measures the discrepancy between the prediction and the true label.

On the other hand, Few-shot learning (FSL) faces the challenge in transferring knowledge across learning visual concepts from head to the tail. It assumes two non-overlapping sets of seen (\({\mathcal {S}}\)) and unseen (\({\mathcal {U}}\)) classes. During training, it has access to all seen classes for learning an inductive bias, which is then transferred to learn a good classifier on \({\mathcal {U}}\) rapidly with a small K.

In summary, we aim to minimize the following expected error in FSL:

$$\begin{aligned} \mathbb {E}_{\mathcal {D}^{\mathcal {U}}_{\mathbf {train}}} \mathbb {E}_{(\mathbf{x}_j,\mathbf{y}_j)\in \mathcal {D}^{{\mathcal {U}}}_{\mathbf {test}}} \Big [ \ell \left( f\left( \mathbf{x}_j; \mathcal {D}^{\;{\mathcal {U}}}_{\mathbf {train}}\right) , \mathbf{y}_j\right) \Big ] \end{aligned}$$
(1)

Given any unseen few-shot training set \({\mathcal {D}}^{\mathcal {U}}_\mathbf{train }\), the function f in Eq. 1 maps \({\mathcal {D}}^{\mathcal {U}}_\mathbf{train }\) to the classifiers of unseen classes, which achieves low error of classifying instances in \(\mathcal {D}^{\mathcal {U}}_\mathbf{test }\) via the inference \(f\left( \mathbf{x}_j; {\mathcal {D}}^{\mathcal {U}}_\mathbf{train }\right) \). Here instances in \({\mathcal {D}}^{\mathcal {U}}_\mathbf{test }\) are sampled from the same set of classes as \({\mathcal {D}}^{\mathcal {U}}_\mathbf{train }\).

Since we do not have access to the unseen classes during the model training, meta-learning becomes an effective framework for FSL (Vinyals et al. 2016; Finn et al. 2017; Snell et al. 2017) in the recent years. In particular, a K-shot N-way task \({\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {train}}\) sampled from \({\mathcal {S}}\) is constructed by randomly choosing N categories from \({\mathcal {S}}\) and K examples in each of them.Footnote 1 The main idea of meta-learning is to mimic the future few-shot learning scenario by optimizing a shared f across K-shot N-way sampled tasks drawn from the seen class sets \({\mathcal {S}}\).

$$\begin{aligned} \min _f {\mathbb {E}}_{({\mathcal {D}}^{\mathcal {S}}_\mathbf{train }, {\mathcal {D}}^{\mathcal {S}}_\mathbf{test })\sim {\mathcal {S}}}\; {\mathbb {E}}_{(\mathbf{x}_j,\mathbf{y}_j)\in {\mathcal {D}}^{\mathcal {S}}_\mathbf{test }} \Big [ \ell \left( f\left( \mathbf{x}_j; {\mathcal {D}}^{\mathcal {S}}_\mathbf{train }\right) , \mathbf{y}_j\right) \Big ]\; \end{aligned}$$
(2)

Equation 2 approximates the Eq. 1 with the seen class data, and the meta-model f is applied to different few-shot tasks constructed by the data of seen classes. Following this split use of \({\mathcal {S}}\), tasks and classes related to \({\mathcal {S}}\) are denoted as “meta-training”, and called “meta-val/test” when they are related to \({\mathcal {U}}\). Similar to Eq. 1, a corresponding test set \({\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {test}}\) is sampled from the N classes in \({\mathcal {S}}\) to evaluate the resulting few-shot classifier \(f\left( \cdot \;; {\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {train}}\right) \). Therefore, we expect the learned classifier “generalizes” well on the training few-shot tasks sampled from seen classes, to “generalize” well on few-shot tasks drawn from unseen class set \({\mathcal {U}}\). Once we learned f, for a few-shot task \(\mathcal {D}^{\mathcal {U}}_{\mathbf {train}}\) with unseen classes \(\mathcal {U}\), we can get its classifier \(f\left( \cdot \;; \mathcal {D}^{\mathcal {U}}_{\mathbf {train}}\right) \) as Eq. 2.

Specifically, one popular form of the meta-knowledge to transfer between seen and unseen classes is the instance embedding, i.e., \(f=\phi \), which transforms input examples into a latent space with d dimensions (Vinyals et al. 2016; Snell et al. 2017). \(\phi \) is learned to pull similar objects close while pushing dissimilar ones far away (Koch et al. 2015). For a test instance \(\mathbf{x}_j\), the embedding function \(\phi \) makes a prediction based on a soft nearest neighbor classifier:

$$\begin{aligned} {\hat{y}}_j&= f\left( \mathbf{x}_j; {\mathcal {D}}_{\mathbf {train}}\right) \\&= \underset{(\mathbf{x}_i, \mathbf{y}_i)\in {\mathcal {D}}_{\mathbf {train}}}{\sum } \mathbf {sim}\left( \phi (\mathbf{x}_j), \phi (\mathbf{x}_i)\right) \cdot \mathbf{y}_i \end{aligned}$$

\(\mathbf {sim}(\phi (\mathbf{x}_j), \phi (\mathbf{x}_i))\) measures the similarity between the test instance \(\phi (\mathbf{x}_j)\) and each training instance \(\phi (\mathbf{x}_i)\). When there is more than one instance per class, i.e., \(K>1\), instances in the same class can be averaged to assist make a final decision (Snell et al. 2017). By learning a good \(\phi \), important visual features for few-shot classification are distilled, which helps the few-shot tasks with classes from the unseen classes.

2.2 Meta-learning for Generalized Few-Shot Learning (GFSL)

Different from FSL which neglects classification of the \(\mathcal {S}\) classes, Generalized Few-Shot Learning (GFSL) aims at building a model that simultaneously predicts over \(\mathcal {S} \; \cup \; \mathcal {U}\) categories. As a result, such a model needs to deal with many-shot classification from \(|\mathcal {S}|\) seen classes along side with learning \(|\mathcal {U}|\) emerging unseen classes. Footnote 2 In inductive GFSL, the model only has access to the head part \(\mathcal {S}\) and is required to extract knowledge which facilitates building a joint classifier over seen and unseen categories once with limited tail examples.

In GFSL, we require the function f to map from a few-shot training set \(\mathcal {D}^{\;\mathcal {U}}_{\mathbf {train}}\) to a classifier classifying both seen and unseen classes, which means a GFSL classifier f should have a low expected error as what follows:

$$\begin{aligned} \mathbb {E}_{\mathcal {D}^{\mathcal {U}}_{\mathbf {train}}} \mathbb {E}_{(\mathbf{x}_j,\mathbf{y}_j)\in \mathcal {D}^{\mathcal {S}\cup {\;\mathcal {U}}}_{\mathbf {test}}} \Big [ \ell \left( f\left( \mathbf{x}_j; \mathcal {D}^{\;\mathcal {U}}_{\mathbf {train}}, {\varvec{\Theta }}_{\mathcal {S}}\right) , \mathbf{y}_j\right) \Big ] \; \end{aligned}$$
(3)

Different from Eq. 1, in the GFSL setting, the meta-model f generates classifier \(f\left( \cdot ; \mathcal {D}^{\;\mathcal {U}}_{\mathbf {train}}, {\varvec{\Theta }}_{\mathcal {S}}\right) \) through taking both the unseen class few-shot training set \(\mathcal {D}^{\;\mathcal {U}}_{\mathbf {train}}\) and a class descriptors set \({\varvec{\Theta }}_{\;\mathcal {S}}\) summarizing the information of the seen classes as input. Besides, such classifier is able to tell instances from the joint set of \({\;\mathcal {S}}\cup {\;\mathcal {U}}\).

Similarly, we simulate many GFSL tasks from the seen classes. At each time, we split the seen classes into a tail split with classes \({\mathcal {C}}\), and treat remaining \(|{\mathcal {S}}| - |{\mathcal {C}}|\) classes as the head split. Eq. 3 is transformed into:

$$\begin{aligned} \min _{f}\; \sum _{{\mathcal {C}} \subset {\mathcal {S}}} \; \;\sum _{{(\mathbf{x}_j, \mathbf{y}_j)\sim \mathcal {S}}} \ell \left( f\left( \mathbf{x}_j; \mathcal {D}^{\;{\mathcal {C}}}_{\mathbf {train}}, {\mathbf {\Theta }}_{{\mathcal {S}}-{\mathcal {C}}}\right) , \mathbf{y}_j\right) \; \end{aligned}$$
(4)

In particular, the function f outputs a \({\mathcal {S}}\)-way classifier with two steps: (1) For the tail split \({\mathcal {C}}\), it follows what f does in Section 2 and generates the classifiers of \({\mathcal {C}}\) using their few-shot training examples \({\mathcal {D}}^{\;{\mathcal {C}}}_{\mathbf {train}}\). (2) For the head split \({\mathcal {S}} - {\mathcal {C}}\), this function directly make use of the many-shot classifiers of the \({\mathcal {S}}-{\mathcal {C}}\) classes to generate the classifiers (instead of asking for training examples of head split).

3 Method

There are two key components in Castle and a Castle. First, it presents an effective learning algorithm that learns many-shot classifiers and few-shot classifiers simultaneously, in an end-to-end manner. Second, it contains a classifier composition model, which synthesizes classifiers for the tail classes using the few-shot training data, via querying a learnable neural dictionary.

Fig. 2
figure 2

Illustration of adaptive GFSL learning process of Castle and a Castle. Different from the stationary learning process (l.h.s.) of Castle, a Castle (r.h.s.) synthesizes the GFSL classifiers for seen and unseen classes in an adaptive manner—the many-shot classifiers of head classes are also conditioned on the training instances from the tail classes

In Sect. 3.1, we utilize the objective in Eq. 3 that directly contrasts many-shot classifiers with few-shot classifiers, via constructing classification tasks over \(\;{{\mathcal {U}}} \; \cup \; {{\mathcal {S}}}\) categories. By reusing the parameters of the many-shot classifier, the learned model calibrates the prediction ranges over head and tail classes naturally. It enforces the few-shot classifiers to explicitly compete against the many-shot classifiers in the model learning, which leads to more discriminative few-shot classifiers in the GFSL setting. In Sect. 3.2, we introduce the classifier composition model uses a few-shot training data to query the neural bases, and then assemble the target “synthesized classifiers”. Castle sets a shared neural bases across tasks, which keeps stationary many-shot classifiers all the time; while with both shared and specific components in the neural dictionary, the seen class classifiers will be adapted based on its relationship with unseen class instances in a Castle.

3.1 Unified Learning of Few-Shot and Many-Shot Classifiers

In addition to transferring knowledge from seen to unseen classes as in FSL, in generalized few-shot learning, the few-shot classifiers is required to do well when used in conjunction with many-shot classifiers. Suppose we have sampled a K-shot N-way few-shot learning task \({{\mathcal {D}}}^{\;{{\mathcal {U}}}}_{\mathbf {train}}\), which contains \(|{{\mathcal {U}}}|\) visual unseen categories, a GFSL classifier f should have a low expected error as in Eq. 3

The set of “class descriptors” \({{\varvec{\Theta }}}\) of a classifier is a set of vectors summarizes the characteristic of its target classes, e.g., some preserved instances from those classes. For the seen classes \({{\mathcal {S}}}\), we set the descriptors as the union of the weight vectors in the many-shot classifiers \(\varvec{\Theta }_{{{\mathcal {S}}}} = \{\varvec{\Theta }_s\}_{s\in {{\mathcal {S}}}}\) (i.e., the liner classifier over the embedding function \(\phi (\cdot )\)). For each task, the classifier f predicts a test instance in \({{\mathcal {D}}}^{\;{{\mathcal {S}}}\cup {{\mathcal {U}}}}_{\mathbf {test}}\) towards both tail classes \({{\mathcal {U}}}\) and head classes \({{\mathcal {S}}}\). In other words, based on \({{\mathcal {D}}}^{\;{{\mathcal {U}}}}_{\mathbf {train}}\) and the class descriptors set of the many-shot classifier \(\varvec{\Theta }_{{{\mathcal {S}}}}\), a randomly sampled instance in \({{\mathcal {S}}}\cup {{\mathcal {U}}}\) should be effectively predicted. In summary, a GFSL classifier generalizes its joint prediction ability to \({{\mathcal {S}}}\cup {{\mathcal {U}}}\) given \({{\mathcal {D}}}^{\;{{\mathcal {U}}}}_{\mathbf {train}}\) and \(\varvec{\Theta }_{{{\mathcal {S}}}}\) during inference.

3.1.1 Neural Dictionary for Classifier Synthesis

We use neural dictionary to implement the joint prediction \(f\left( \mathbf{x}_j; {\mathcal {D}}^{\;{{\mathcal {U}}}}_{\mathbf {train}}, \varvec{\Theta }_{{{\mathcal {S}}}}\right) \) in Eq. 3. A neural dictionary is a module with a set of neural bases \({\mathcal {B}}\), which represents its input as a weighted combination of those bases based on their similarities. To classify an instance during inference, the neural dictionary takes partial or both of the limited tail instances \({\mathcal {D}}^{\;{{\mathcal {U}}}}_{\mathbf {train}}\) and the context of the seen classifiers descriptors set \(\varvec{\Theta }_{{{\mathcal {S}}}}\) into account, and synthesizes the classifier for the corresponding classes with \({\mathcal {B}}\). The details of the neural dictionary will be described in the next subsection.

3.1.2 Unified Learning Objective

a Castle and its variants learn a generalizable GFSL classifier via training on the seen class set \({{\mathcal {S}}}\). We sample a “fake” K-shot N-way few-shot task from \({{\mathcal {S}}}\), which contains categories \({{\mathcal {C}}}\). Given the “fake” few-shot task, we treat the remaining \({{\mathcal {S}}}-{{\mathcal {C}}}\) classes as the “fake” head classes, whose corresponding many-shot classifier descriptors set is \({\varvec{\Theta }}_{{{\mathcal {S}}} - {{\mathcal {C}}}}\). Then the GFSL model needs to build a classifier targets any instance in \({{\mathcal {C}}} \cup ({{\mathcal {S}}} - {{\mathcal {C}}})\). As mentioned before, we synthesize both the few-shot classifiers for \({{\mathcal {C}}}\) by \(\mathbf {W}_{{\mathcal {C}}} = \{\;\mathbf{w}_c \mid c \in {{\mathcal {C}}}\;\}\) and the many-shot classifier \({\hat{\varvec{\Theta }}}_{{{\mathcal {S}}} - {{\mathcal {C}}}} = \{\;{\hat{\varvec{\Theta }}}_c \mid c \in {{\mathcal {S}}} - {{\mathcal {C}}}\;\}\) with a neural dictionary, so that the composition of one classifier will consider the context of others.

Both the synthesized many-shot classifier (from the “fake” many-shot classes \({{\mathcal {S}}} - {{\mathcal {C}}}\)) \({\hat{\varvec{\Theta }}}_{{{\mathcal {S}}} - {{\mathcal {C}}}}\) and few-shot classifier (from the “fake” few-shot classes \({{\mathcal {C}}}\)) \(\mathbf {W}_{{\mathcal {C}}}\) are combined together to form a joint classifier \({\hat{\mathbf {W}}} = \mathbf {W}_{{\mathcal {C}}} \cup {\hat{\varvec{\Theta }}}_{{{\mathcal {S}}} - {{\mathcal {C}}}}\), over all classes in \({{\mathcal {S}}}\).

Finally, we optimize the learning objective as follows:

(5)

In addition to the learnable neural bases \({\mathcal {B}}\), \(\mathrm{U}\) and \(\mathrm{V}\) are two projections in the neural bases to facilitate the synthesis of the classifier, and there is no bias term in our implementation. Despite that the few-shot classifiers \(\mathbf {W}_{{\mathcal {C}}}\) are synthesized using with K training instances, they are optimized to perform well on all the instances from \({{\mathcal {C}}}\) and moreover, to perform well against all the instances from other seen categories. The many-shot classifiers \(\varvec{\Theta }_{{{\mathcal {S}}}}\) are not stationary, which are adapted based on the context of the current few-shot instances (the adaptive GFSL classifier notion is illustrated in Fig. 2). Note that \(\mathbf {W}_{{\mathcal {C}}}\) and \({\hat{\varvec{\Theta }}}_{{{\mathcal {S}}} - {{\mathcal {C}}}}\) are synthesized based on the neural dictionary, which serves as the bridge to connect the “fake” few-shot class set \({{\mathcal {C}}}\) and the “fake” many-shot class set \(({{\mathcal {S}}}-{{\mathcal {C}}})\).

After minimizing the accumulated loss in Eq. 5 over multiple GFSL tasks, the learned model extends its discerning ability to unseen classes so that has low error in Eq. 3. During inference, a Castle synthesizes the classifiers for unseen classes based on the neural dictionary with their few-shot training examples, and makes a joint prediction over \({{\mathcal {S}}}\cup {{\mathcal {U}}}\) with the help of the adapted many-shot classifier \({\hat{\varvec{\Theta }}}_{{\mathcal {S}}}\).

3.1.3 Reuse Many-Shot Classifiers

We optimize Eq. 5 by using the many-shot classifier over \({{\mathcal {S}}}\) to initialize the embedding \(\phi \). In detail, a \(|{{\mathcal {S}}}|\)-way many-shot classifier is trained over all seen classes with the cross-entropy loss, whose backbone is used to initialize the embedding \(\phi \) in the GFSL classifier. We empirically observed that such initialization is essential for the prediction calibration between seen and unseen classes, more details could be found in “Appendix 1” and “Appendix 4”.

3.1.4 Multi-classifier Learning

A natural way to minimize Eq. 5 implements a stochastic gradient descent step in each mini-batch by sampling one GFSL task, which contains a K-shot N-way training set together with a set of test instances \((\mathbf{x}_j, \mathbf{y}_j)\) from \({{\mathcal {S}}}\). It is clear that increasing the number of GFSL tasks per gradient step can improve the optimization stability. Therefore, we propose an efficient implementation that utilizes a large number of GFSL tasks to compute gradients. Specifically, we sample two sets of instances from all seen classes, i.e., \({\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {train}}\) and \({\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {test}}\). Then we construct a large number of joint classifiers \(\{ \hat{{\mathbf {W}}}^z = {\mathbf {W}}^z_{{{\mathcal {C}}}} \cup {\hat{\varvec{\Theta }}}^z_{{{\mathcal {S}}} - {{\mathcal {C}}}} \mid z = 1,\ldots ,Z \}\) with different sets of \({{\mathcal {C}}}\), which is then applied to compute the averaged loss over z using Eq. 5. Note that there is only one single forward step to get the embeddings of the involved instances, and we mimic multiple GFSL tasks through different random partitions of the “fake” few-shot and “fake” many-shot classes. In the scope of this paper, a Castle variants always use multi-classifier learning unless it is explicitly mentioned. With this, we observed a significant speed-up in terms of convergence (see “Appendix 10” for the ablation study).

3.2 Classifier Composition with a Neural Dictionary

Neural dictionary is an essential module for classifier composition in a Castle variants. We describe the composition of the neural dictionary and the way to synthesize tail classifiers first, followed by the adaption of the head classifiers. The neural dictionary formalizes both head and tail classifiers with common bases, which benefits the relationship transition between classes. Furthermore, the neural dictionary encodes the shared primitives for composing classifiers, which serves as a kind of meta-knowledge to be transferred across both the seen and the unseen classes.

Fig. 3
figure 3

Illustration of Adaptive ClAssifier SynThesize LEarning (a Castle). A neural dictionary contains two types of neural bases—the shared component and the task-specific component. During the inference, both the prototype of the tail classes and the descriptors of the seen classifier are input into the neural dictionary, and synthesize the joint classifier over both seen and unseen categories

Similar to Vaswani et al. (2017), we define a neural dictionary as pairs of learnable “key” and “value” embeddings, where each “key” and “value” is associated with a set of neural bases, which are designed to encode shared primitives for composing the classifier of \({\mathcal {S}} \cup {\mathcal {U}}\). Formally, the neural bases contain two sets of elements:

$$\begin{aligned} {\mathcal {B}} = {\mathcal {B}}_{\mathbf {share}} \bigcup {\mathcal {B}}_{\mathbf {specific}} \end{aligned}$$

\({\mathcal {B}}_{\mathbf {share}}\) contains a set of \(|{\mathcal {B}}_{\mathbf {share}}|\) learnable bases \({\mathcal {B}}_{\mathbf {share}} = \{\mathbf{b}_1, \mathbf{b}_2, \ldots , \mathbf{b}_{|{\mathcal {B}}_{\mathbf {share}}|}\}\), and \(\mathbf{b}_k\in {\mathcal {B}}_{\mathbf {share}}\in {\mathbb {R}}^d\). This part in the neural dictionary is shared when synthesizing classifiers for different kinds of tasks. \({\mathcal {B}}_{\mathbf {specific}}\) characterizes the local information of the input to the neural dictionary, i.e., the training set \({{\mathcal {D}}}_{\mathbf {train}}\) of the current few-shot task with tail classes and the descriptors set of the many-shot classifier.

The key and value for the neural dictionary are generated based on two linear projections \({\mathrm {U}}\in {\mathbb {R}}^{d\times d}\) and \(\mathrm {V}\in {\mathbb {R}}^{d\times d}\) of elements in the bases \({\mathcal {B}}\). For instance, \(\mathrm {U}\mathbf{b}_k\) and \(\mathrm {V}\mathbf{b}_k\) represent the generated key and value embeddings. For a query to the neural dictionary, it first computes the similarity (a.k.a. the attention) with all keys (\(\mathrm {U}\mathbf{b}_k\)), and the corresponding output of the query is the attention-weighted combination of all the elements in the value set (\(\mathrm {V}\mathbf{b}_k\)).

In a “fake” K-shot N-way few-shot task from \({{\mathcal {S}}}\), there are \({{\mathcal {C}}}\) categories. Denote \({\mathbb {I}}\left[ \;\mathbf{y}_i=c\;\right] \) as an indicator that selects instances in the class c. To synthesize classifier for a class c, we first compute the class signature as the embedding prototype, defined as the average embedding of all K shots of instances (in a K-shot N-way task):Footnote 3

$$\begin{aligned} \mathbf{p}_c = \frac{1}{K} \sum _{(\mathbf{x}_i,\mathbf{y}_i)\in {\mathcal {D}}_{\mathbf {train}}} \phi \left( \mathbf{x}_i\right) \cdot {\mathbb {I}}\left[ \;\mathbf{y}_i=c\;\right] \end{aligned}$$
(6)

The specific component in the neural dictionary bases \({{\mathcal {B}}}\) is the concatenation of the prototype of few-shot instances \(\{\mathbf{p}_c\}\) and the linear classifier descriptors set \(\varvec{\Theta }_{{{\mathcal {S}}}-{{\mathcal {C}}}}\) over the embedding \(\phi \), i.e.,

$$\begin{aligned} {\mathcal {B}}_{\mathbf {specific}} = \{\mathbf{p}_c \mid c \in {{\mathcal {C}}}\} \cup \varvec{\Theta }_{{{\mathcal {S}}}-{{\mathcal {C}}}} \end{aligned}$$
(7)

We then compute the attention coefficients \(\alpha _c\) for assembling the classifier of class c, via measuring the compatibility score between the class signature and the key embeddings of the neural dictionary,

$$\begin{aligned} \alpha ^k_c \propto \exp \left( \mathbf{p}_c^\top \mathrm {U}\mathbf{b}_k \right) , \text {where } k = 1, \ldots , |{\mathcal {B}}| \end{aligned}$$

The coefficient \(\alpha ^k_c\) is then normalized with the sum of compatibility scores over all \(|{\mathcal {B}}|\) bases, which then is used to convexly combine the value embeddings and synthesize the classifier,

$$\begin{aligned} \mathbf{w}_c = \mathbf{p}_c + \sum _{k=1}^{|{\mathcal {B}}|} \alpha ^{k}_{c} \cdot \mathrm {V}\mathbf{b}_k \end{aligned}$$
(8)

We formulate the classifier composition as a summation of the initial prototype embedding \(\mathbf{p}_c\) and the residual component \(\sum _{k=1}^{|{\mathcal {B}}|} \alpha ^{k}_{c} \cdot \mathrm {V}\mathbf{b}_k\). Such a composed classifier is then \(\ell _2\)-normalized and used for (generalized) few-shot classification. Such normalization also fixes the scale differences in the concatenation of the prototype and the descriptors set in the specific neural bases in Eq. 7. The same classifier synthesis process could be applied to the elements in the seen class descriptors set \(\varvec{\Theta }_{{{\mathcal {S}}}-{{\mathcal {C}}}}\), where a head classifier first computes its similarity with the shared neural bases and the tail prototypes, then adapts the classifier to \({\hat{\varvec{\Theta }}}_{{{\mathcal {S}}}-{{\mathcal {C}}}}\) with Eq. 8. Therefore, the seen classifier is also synthesized conditioned on the context of the unseen instances, which promotes the backward knowledge transfer from unseen classes to the seen ones.

Since both the embedding “key” and classifier “value” are generated based on the same set of neural bases, it encodes a compact set of latent features for a wide range of classes in \({\mathcal {B}}_{\mathbf {share}}\) while leaving the task-specific characteristic in \({\mathcal {B}}_{\mathbf {specific}}\). We hope the learned neural bases contain a rich set of classifier primitives to be transferred to novel compositions of emerging visual categories. Figure 3 demonstrates the classifier synthesize process with the neural dictionary.

We denote the degenerated version with only the shared neural bases \({\mathcal {B}} = {\mathcal {B}}_{\mathbf {share}}\) as Castle, which makes a joint prediction with the stationary many-shot classifier \(\varvec{\Theta }_{{{\mathcal {S}}}}\) and the synthesized few-shot classifier.

Remark 1

Changpinyo et al. (2016, 2020) take advantage of the dictionary to synthesize the classifier for all classes in zero-shot learning. Gidaris and Komodakis (2018) implement a GFSL model with two stages. After pre-training a many-shot classifier, it freezes the embedding and composes the tail classifier by convex combinations of the transforms of the head classifier. Different from the previous approach constructing a dictionary based on a pre-fixed feature embedding, we use a learned embedding function together with the neural dictionary, leading to an end-to-end GFSL framework. Furthermore, different from Gidaris and Komodakis (2018) keeping the head classifier stationary, we adapt them conditioned on the tail classes, which could handle the diversity between class domains (as illustrated in Fig. 2). Comprehensive experiments to verify the effectiveness of such an adaptive GFSL classifier could be found in Sects. 5 and 6.3.

Remark 2

The attention mechanism to synthesize the classifier is similar to Vaswani et al. (2017), which is also verified to be effective for adapting embeddings for few-shot learning (Ye et al. 2020). Different from Vaswani et al. (2017), both the specific and shared weights are included in the “key” and “value” part of the neural dictionary. No additional normalization strategies (e.g., layer normalization (Ba et al. 2016) and temperature scaling (Guo et al. 2017)) are used in our module.

4 Experimental Setups

This section details the experimental setups, including the general data splits strategy, the pre-training technique, the specifications of the feature backbone, and the evaluation metrics for GFSL.

Fig. 4
figure 4

The split of data in the generalized few-shot classification scenario. In addition to the standard dataset like MiniImagetnet (blue part), we collect non-overlapping augmented head class instances from the corresponding categories in the ImageNet (red part), to measure the classification ability on the seen classes. Then in the generalized few-shot classification task, few-shot instances are sampled from each of the unseen classes, while the model should have the ability to predict instances from both the head and tail classes (Color figure online)

4.1 Data Splits

We visualize the general data split strategy in Fig. 4. There are two parts of the dataset for standard meta-learning tasks. The meta-training set for model learning (corresponds to the seen classes), and the meta-val/test part for model evaluation (corresponds to the unseen classes). To evaluate a GFSL model, we’d like to augment the meta-training set with new instances, so that the classification performance on seen classes could be measured. During the inference phase, a few-shot training set from unseen classes are provided with the model, and the model should make a joint prediction over instances from both the head and tail classes. We will describe the detailed splits for particular datasets in later sections.

4.2 Pre-training Strategy

Before the meta-training stage, we try to find a good initialization for the embedding \(\phi \), and then we reuse such a many-shot classifier as well as the embedding to facilitate the training of a GFSL model. More details of the pre-training stage could be found in “Appendix 1”. In later sections, we will verify this pre-training strategy does not influence the few-shot classification performance a lot, but it is essential to make the GFSL classifier well-calibrated.

4.3 Feature Network Specification

Following the setting of most recent methods (Qiao et al. 2018; Rusu et al. 2019; Ye et al. 2020), we use ResNet variants (He et al. 2016; Bertinetto et al. 2019) to implement the embedding backbone \(\phi \).Footnote 4 Details of the architecture and the optimization strategy are in “Appendix 2”.

4.4 Evaluation Measures

We take advantage of the auxiliary meta-training set from the benchmark datasets during GFSL evaluations, and an illustration of the dataset construction can be found in Fig. 4. The notation \(X\rightarrow Y\) with \(X,Y\in \{{{\mathcal {S}}}, {{\mathcal {U}}}, {{\mathcal {S}}}\cup {{\mathcal {U}}}\}\) means computing prediction results with instances from X to labels of Y. For example, \({{\mathcal {S}}}\rightarrow {{\mathcal {S}}}\cup {{\mathcal {U}}}\) means we first filter instances come from the seen class set (\(\mathbf{x}\in {{\mathcal {S}}}\)), and predict them into the joint label space (\(\mathbf{y}\in {{\mathcal {S}}}\cup {{\mathcal {U}}}\)). For a GFSL model, we consider its performance with different measurements.

4.4.1 Few-Shot Accuracy

Following the standard protocol (Vinyals et al. 2016; Finn et al. 2017; Snell et al. 2017; Ye et al. 2020), we sample 10,000 K-shot N-way tasks from \({{\mathcal {U}}}\) during inference. In detail, we first sample N classes from \({{\mathcal {U}}}\), and then sample \(K+15\) instances for each class. The first NK labeled instances (K instances from each of the N classes) are used to build the few-shot classifier, and the remaining 15N (15 instances from each of the N classes) are used to evaluate the quality of such few-shot classifier. During our test, we consider \(K=1\) and \(K=5\) as in the literature, and change N ranges from \(\{5,10,15,\ldots ,|{{\mathcal {U}}}|\}\) as a more robust measure. It is noteworthy that in this test stage, all the instances come from \({{\mathcal {U}}}\) and are predicted to classes in \({{\mathcal {U}}}\) (\({{\mathcal {U}}} \rightarrow {{\mathcal {U}}}\)).

4.4.2 Generalized Few-Shot Accuracy

Different from many-shot and few-shot evaluations, the generalized few-shot learning takes the joint instance and label spaces into consideration. In other words, the instances come from \({{\mathcal {S}}}\cup {{\mathcal {U}}}\) and their predicted labels also in \({{\mathcal {S}}}\cup {{\mathcal {U}}}\) (\({{\mathcal {S}}}\cup {{\mathcal {U}}} \rightarrow {{\mathcal {S}}}\cup {{\mathcal {U}}}\)). This is obviously more difficult than the many-shot (\({{\mathcal {S}}} \rightarrow {{\mathcal {S}}}\)) and few-shot (\({{\mathcal {U}}} \rightarrow {{\mathcal {U}}}\)) tasks. During the test, with a bit abuse of notations, we sample K-shot \({\mathcal {S}}+N\)-way tasks from \({{\mathcal {S}}} \cup {{\mathcal {U}}}\). Concretely, we first sample a K-shot N-way task from \({{\mathcal {U}}}\), with NK training and 15N test instances, respectively. Then, we randomly sample 15N instances from \({{\mathcal {S}}}\). Thus in a GFSL evaluation task, there are NK labeled instances from \({{\mathcal {U}}}\), and 30N test instances from \({{\mathcal {S}}} \cup {{\mathcal {U}}}\). We compute the accuracy of \({{\mathcal {S}}}\cup {{\mathcal {U}}}\) as the final measure. We abbreviate this criterion as “Mean Acc.” or “Acc.” in later sections.

Fig. 5
figure 5

An illustration of the harmonic mean based criterion for GFSL evaluation. \({{\mathcal {S}}}\) and \({{\mathcal {U}}}\) denotes the seen and unseen instances (\(\mathbf{x}\)) and labels (\(\mathbf{y}\)) respectively. \({{\mathcal {S}}}\cup {{\mathcal {U}}}\) is the joint set of \({{\mathcal {S}}}\) and \({{\mathcal {U}}}\). The notation \(X\rightarrow Y, X,Y\in \{{{\mathcal {S}}}, {{\mathcal {U}}}, {{\mathcal {S}}}\cup {{\mathcal {U}}}\}\) means computing prediction results with instances from X to labels of Y. By computing a performance measure (like accuracy) on the joint label space prediction of seen and unseen instances separately, a harmonic mean is computed to obtain the final measure

4.4.3 Generalized Few-Shot \(\Delta \)-Value

Since the problem becomes difficult when the predicted label space expands from \({{\mathcal {S}}} \rightarrow {{\mathcal {S}}}\) to \({{\mathcal {S}}} \rightarrow {{\mathcal {S}}}\cup {{\mathcal {U}}}\) (and also \({{\mathcal {U}}} \rightarrow {{\mathcal {U}}}\) to \({{\mathcal {U}}} \rightarrow {{\mathcal {S}}}\cup {{\mathcal {U}}}\)), the accuracy of a model will have a drop. To measure how the classification ability of a GFSL model changes when working in a GFSL scenario, Ren et al. (2019) propose the \(\Delta \)-Value to measure the average accuracy drop. In detail, for each sampled GFSL task, we first compute its many-shot accuracy (\({{\mathcal {S}}} \rightarrow {{\mathcal {S}}}\)) and few-shot accuracy (\({{\mathcal {U}}} \rightarrow {{\mathcal {U}}}\)). Then we calculate the corresponding accuracy of seen and unseen instances in the joint label space, i.e., \({{\mathcal {S}}} \rightarrow {{\mathcal {S}}}\cup {{\mathcal {U}}}\) and \({{\mathcal {U}}} \rightarrow {{\mathcal {S}}}\cup {{\mathcal {U}}}\). The \(\Delta \)-Value is the average decrease of accuracy in these two cases. We abbreviate this criterion as “\(\Delta \)-value” in later sections.

4.4.4 Generalized Few-Shot Harmonic Mean

Directly computing the accuracy still gets biased towards the populated classes, so we also consider the harmonic mean as a more balanced measure (Xian et al. 2017). By computing performance measurement such as top-1 accuracy for \({{\mathcal {S}}} \rightarrow {{\mathcal {S}}}\cup {{\mathcal {U}}}\) and \({{\mathcal {U}}} \rightarrow {{\mathcal {S}}}\cup {{\mathcal {U}}}\), the harmonic mean is used to average the performance in these two cases as the final measure. In other words, denote the accuracy for \({{\mathcal {S}}} \rightarrow {{\mathcal {S}}}\cup {{\mathcal {U}}}\) and \({{\mathcal {U}}} \rightarrow {{\mathcal {S}}}\cup {{\mathcal {U}}}\) as Acc\(_{{\mathcal {S}}}\) and Acc\(_{{\mathcal {U}}}\), respectively, the value \(\frac{2\mathrm{Acc}_{{\mathcal {S}}}\mathrm{Acc}_{{\mathcal {U}}}}{\mathrm{Acc}_{{\mathcal {S}}} + \mathrm{Acc}_{{\mathcal {U}}}}\) is used as a final measure. An illustration is in Fig. 5. We abbreviate this criterion as “HM” or “HM Acc.” in later sections.

4.4.5 Generalized Few-Shot AUSUC

Chao et al. (2016) propose a calibration-agnostic criterion for generalized zero-shot learning. To avoid evaluating a model influenced by a calibration factor between seen and unseen classes, they propose to determine the range of the calibration factor for all instances at first, and then plot the seenunseen accuracy curve based on different configurations of the calibration values. Finally, the area under the seenunseen curve is used as a more robust criterion. We follow Chao et al. (2016) to compute the AUSUC value for sampled GFSL tasks. We abbreviate this criterion as “AUSUC” in later sections.

5 Pivot Study on Multi-domain GFSL

We first present a pivot study to demonstrate the effectiveness of a Castle, which leverages adaptive classifiers synthesized for both seen and unseen classes. To achieve this, we investigate two multi-domain datasets—“Heterogeneous” and “Office-Home” with more challenging settings, where a GFSL model is required to transfer knowledge in backward direction (adapt seen classifiers based on unseen ones) to obtain superior joint classification performances over heterogeneous domains.

Fig. 6
figure 6

An illustration of the Heterogeneous and Office-Home dataset. Both datasets contain multiple domains. In the Heterogeneous dataset, each class belongs to only one domain, while in Office-Home, a class has images from all three domains

5.1 Dataset

We construct a Heterogeneous dataset based on 5 fine-grained classification datasets, namely AirCraft (Maji et al. 2013), Car-196 (Krause et al. 2013), Caltech-UCSD Birds (CUB) 200-2011 (Wah et al. 2011), Stanford Dog (Khosla et al. 2011), and Indoor Scenes (Quattoni and Torralba 2009). Since these datasets have apparent heterogeneous semantics, we treat images from different datasets as different domains. 20 classes with 50 images in each of them are randomly sampled from each of the 5 datasets to construct the meta-training set. The same sampling strategy is also used to sample classes for model validation (meta-val) and evaluation (meta-test) sets. Therefore, there are 100 classes for meta-training/val/test sets, which contains 20 classes from each fine-grained dataset. To evaluate the performance of a GFSL model, we augment the meta-training set by sampling another 15 images from the corresponding classes for each of the seen classes.

We also investigate the Office-Home (Venkateswara et al. 2017) dataset, which originates from a domain adaptation task. There are 65 classes and 4 domains of images per class. Considering the scarce number of images in one particular domain, we select three of the four domains, “Clipart”, “Product”, and “Real World” to construct our dataset. The number of instances in a class per domain is not equal. We randomly sample 25 classes (with all selected domains) for meta-training, 15 classes for meta-validation, and the remaining 25 classes are used for meta-test. Similarly, we hold out 10 images per domain for each seen class to evaluate the generalized classification ability of a GFSL model.

Table 1 Generalized 1-shot classification performance (mean accuracy and harmonic mean accuracy) on (a) the Heterogeneous dataset with 100 Head and 5 Tail categories and (b) the Office-Home dataset with 25 Head and 5 Tail categories

Note that in addition to the class label, images in these two datasets are also equipped with at least one domain label. In particular, classes in Heterogeneous dataset belong to a single domain corresponding to “aircraft”, “bird”, “car”, “dog”, or “indoor scene”, while the classes in Office-Home possess images from all 3 domains, namely “Clipart”, “Product” and “Real World”. An illustration of the sampled images (of different domains) from these two datasets is shown in Fig. 6.

The key difference to standard GFSL (cf. Sect. 6.3) is that here the seen categories are collected from multiple (heterogeneous) visual domains and used for training the inductive GFSL model. During the evaluation, the few-shot training instances of tail classes only come from one single domain. With this key difference, we note that the unseen few-shot classes are close to a certain sub-domain of seen classes and relatively far away from the others. Therefore, a model capable of adapting its seen classifiers can take the advantages and adapt itself to the domain of the unseen classes.

5.2 Baselines and Comparison Methods

Besides Castle and a Castle, we consider two other baseline models. The first one optimizes the Eq. 5 directly but without the neural dictionary, which relies on both the (fixed) linear classifier \(\varvec{\Theta }_{{{\mathcal {S}}}}\) and the few-shot prototypes to make a GFSL prediction (we denote it as “Castle\(^-\)”); the second one is DFSL (Gidaris and Komodakis 2018), which requires a two-stage training of the GFSL model. It trains a many-shot classifier with cosine similarity in the first stage. Then it freezes the backbone model as feature extractor and optimizes a similar form of Eq. 5 via composing new few-shot classifiers as the convex combination of those many-shot classifiers. It can be viewed as a degenerated neural dictionary, where DFSL sets a size-\(|{{\mathcal {S}}}|\) “shared” bases \({\mathcal {B}}_{\mathbf {share}}\) as the many-shot classifier \(\varvec{\Theta }_{{{\mathcal {S}}}}\). We observe that DFSL is unstable to perform end-to-end learning. It is potentially because the few-shot classifier composition uses many-shot classifiers as bases, but those bases are optimized to both be good bases and good classifiers, which can likely to be conflicting to some degree. It is also worth noting that all the baselines except a Castle only modify the few-shot classifiers, and it is impossible for them to perform backward knowledge transfer.

5.3 GFSL over Heterogeneous Dataset

The Heterogeneous dataset has 100 seen classes in the meta-training set, 20 per domain. We consider the case where during the inference, all of the tail classes come from one particular domain. For example, the tail classes are different kinds of birds, and we need to do a joint classification over all seen classes from the heterogeneous domains and the newly coming tail classes with limited instances. To mimic the inference scenario, we sample “fake” few-shot tasks with classes from one of the five domains randomly and contrasting the discerning ability from the sampled classes w.r.t. the remaining seen classes as in Eq. 5.

Note that we train DFSL strictly follows the strategy in Gidaris and Komodakis (2018), and train other GFSL models with a pre-trained embedding and the multi-classifier techniques to improve the training efficiency. Following Xian et al. (2017), Schönfeld et al. (2019) and Gidaris and Komodakis (2018), we compute the 1-Shot 5-Way GFSL classification mean accuracy and harmonic mean accuracy over 10,000 sampled tasks, whose results are recorded in Table 1a. \({{\mathcal {S}}} \rightarrow {{\mathcal {S}}}\cup {{\mathcal {U}}}\) and \({{\mathcal {U}}} \rightarrow {{\mathcal {S}}}\cup {{\mathcal {U}}}\) denote the average accuracy for the joint prediction of seen and unseen instances respectively.

From the results in Table 1a, DFSL could not work well due to its fixed embedding and restricted bases. Castle\(^-\) is able to balance the training accuracy of both seen and unseen classes benefited from the pre-train strategy and the unified learning objective, which achieves the highest joint classification performance over unseen classes. The discriminative ability is further improved with the help of the neural dictionary. Castle performs better than its degenerated version, which verifies the effectiveness of the learned neural bases. The neural dictionary encodes the common characteristic among all classes for the GFSL classification, so that Castle gets better mean accuracy and harmonic mean accuracy than Castle\(^-\). Since a Castle is able to adapt both many-shot and few-shot classifiers conditioned on the context of the tail instances, it obtains the best GFSL performance in this case. It is notable that a Castle gets much higher joint classification accuracy for seen classes than other methods, which validates its ability to adapt the many-shot classifier over the seen classes based on the context of tail classes.

5.4 GFSL over Office-Home Dataset

We also investigate the similar multi-domain GFSL classification task over the Office-Home dataset. However, in this case, a single class could belong to all three domains. We consider the scenario to classify classes in a single domain and the domain of the classes should be inferred from the limited tail instances. In other words, we train a GFSL model over 25 classes, and each class has 3 sets of instances corresponding to the three domains. In meta-training, a 25-way seen class classifier is constructed. During the inference, the model is provided by another 5-way 1-shot set of unseen class instances from one single domain. The model is required to output a joint classifier for test instances from the whole 30 classes whose domains are the same as the one in the unseen class set.

Towards such a multi-domain GFSL task, we train a GFSL model by keeping the instances in both the few-shot fake tail task and corresponding test set from the same domain. We use the same set of comparison methods and evaluation protocols with the previous subsection. The mean accuracy, harmonic mean accuracy, and the specific accuracy for seen and unseen classes are shown in Table 1b.

Due to the ambiguity of domains for each class, the GFSL classification over Office-Home gives rise to a more difficult problem, while the results in Table 1b reveal a similar trend with those in Table 1a. Since for Office-Home a single GFSL model needs to make the joint prediction over classes from multiple domains conditioned on different configurations of the tail few-shot tasks, the stationary seen class classifiers are not suitable for the classification over different domains. In this case, a Castle still achieves the best performance over different GFSL criteria, and gets larger superiority margins with the comparison methods.

6 Experiments on GFSL

In this section, we design experiments on benchmark datasets to validate the effectiveness of the Castle and a Castle in GFSL (cf. Sect. 6.3). After a comprehensive comparison with competitive methods using various protocols, we analyze different aspects of GFSL approaches, and we observe the post calibration makes the FSL methods strong GFSL baselines. We verify that Castle/a Castle learn a better calibration between seen and unseen classifiers, and the neural dictionary makes Castle/a Castle persist its high discerning ability with incremental tail few-shot instances. Finally, we show that Castle/a Castle also benefit standard FSL performances (cf. Sect. 6.4).

6.1 Datasets

Two benchmark datasets are used in our experiments. The MiniImageNet dataset (Vinyals et al. 2016) is a subset of the ILSVRC-12 dataset (Russakovsky et al. 2015). There are totally 100 classes and 600 examples in each class. For evaluation, we follow the split of Ravi and Larochelle (2017) and use 64 of 100 classes for meta-training, 16 for validation, and 20 for meta-test (model evaluation). In other words, a model is trained on few-shot tasks sampled from the 64 seen classes set during meta-training, and the best model is selected based on the few-shot classification performance over the 16 class set. The final model is evaluated based on few-shot tasks sampled from the 20 unseen classes.

The TieredImageNet (Ren et al. 2018) is a more complicated version compared with the MiniImageNet. It contains 34 super-categories in total, with 20 for meta-training, 6 for validation (meta-val), and 8 for model testing (meta-test). Each of the super-category has 10 to 30 classes. In detail, there are 351, 97, and 160 classes for meta-training, meta-validation, and meta-test, respectively. The divergence of the super-concept leads to a more difficult few-shot classification problem.

Since both datasets are constructed by images from ILSVRC-12, we augment the meta-training set of each dataset by sampling non-overlapping images from the corresponding classes in ILSVRC-12. The auxiliary meta-train set is used to measure the generalized few-shot learning classification performance on the seen class set. For example, for each of the 64 seen classes in the MiniImageNet, we collect 200 more non-overlapping images per class from ILSVRC-12 as the test set for many-shot classification. The illustration of the dataset split is shown in Fig. 4.

6.2 Baselines and Prior Methods

We explore several (strong) choices in deriving classifiers for the seen and unseen classes, including Multiclass Classifier (MC) + kNN, which contains a \(|{{\mathcal {S}}}|\)-way classifier trained on the seen classes in a supervised learning manner as standard many-shot classification, and its embedding with the nearest neighbor classifier is used for GFSL inference; ProtoNet + ProtoNet, where the embeddings trained by Prototypical Network (Snell et al. 2017) is used, and 100 training instances are sampled from each seen category to act as the seen class prototypes; MC + ProtoNet, where we combine the learning objective of the previous two baselines to jointly learn the MC classifier and feature embedding. Details of the methods are in “Appendix 3”.

Besides, we also compare our approach with the L2ML (Wang et al. 2017), Dynamic Few-Shot Learning without forgetting (DFSL) (Gidaris and Komodakis 2018), and the newly proposed Incremental few-shot learning (IFSL) (Ren et al. 2019). For Castle, we use the many-shot classifiers \(\{\varvec{\Theta }_{{\mathcal {S}}}\}\), cf. Sect. 3.1) for the seen classes and the synthesized classifiers for the unseen classes to classify an instance into all classes, and then select the prediction with the highest confidence score. For a Castle, we adapt the head classifiers to \(\{{\hat{\varvec{\Theta }}}_{{\mathcal {S}}}\}\) with the help of the tail classes.

Table 2 Generalized Few-shot classification performance (mean accuracy, \(\Delta \)-value, and harmonic mean accuracy) on MiniImageNet when there are 64 Head and 5 Tail categories

6.3 Main Results

We first evaluate all GFSL methods on MiniImageNet with the criteria in Gidaris and Komodakis (2018) and Ren et al. (2019), the mean accuracy over all classes (the higher the better) and the \(\Delta \)-value (the lower the better). An effective GFSL approach not only makes prediction well on the joint label space (with high accuracy) but also keeps its classification ability when changing from many-shot/few-shot to the generalized few-shot case (low \(\Delta \)-value).

The main results are shown in Table 2. We found that a Castle outperforms all the existing methods as well as our proposed baseline systems in terms of the mean accuracy. Meanwhile, when looked at the \(\Delta \)-value, and Castle variants are the least affected between predicting for \(\textsc {seen}\)/\(\textsc {usseen}\) classes separately and predicting over all classes jointly.

However, we find that either mean accuracy or \(\Delta \)-value is not informative enough to tell about a GFSL algorithm’s performances. For example, a baseline system, i.e., ProtoNet + ProtoNet performs better than IFSL in terms of 5-shot mean accuracy but not \(\Delta \)-value. This is consistent with the observation in Ren et al. (2019) that the \(\Delta \)-value should be considered together with the mean accuracy. In this case, how shall we rank these two systems? To answer this question, we propose to use another evaluation measure, the harmonic mean of the mean accuracy for each seen and unseen category (Xian et al. 2017; Schönfeld et al. 2019), when they are classified jointly.

6.3.1 Harmonic Mean Accuracy Measures GFSL Performance Better

Since the number of seen and unseen classes are most likely to be not equal, e.g., 64 versus 5 in our cases, directly computing the mean accuracy over all classes is almost always biased. For example, a many-shot classifier that only classifies samples into seen classes can receive a good performance than one that recognizes both seen and unseen. Therefore, we argue that harmonic mean over the mean accuracy can better assess a classifier’s performance, as now the performances are negatively affected when a classifier ignores classes (e.g., MC classifier get \(0\%\) harmonic mean). Specifically, we compute the top-1 accuracy for instances from seen and unseen classes, and take their harmonic mean as the performance measure. The results are included in the right side of the Table 2.

We find the harmonic mean accuracy takes a holistic consideration of the “absolute” joint classification performance and the “relative” performance drop when classifying towards the joint set. For example, the many-shot baseline MC+kNN with good mean accuracy and high \(\Delta \)-value has extremely low performance as it tends to ignore unseen categories. Meanwhile, Castle and a Castle remain the best when ranked by the harmonic mean accuracy against others.

Table 3 Generalized Few-shot classification accuracies on MiniImageNet with 64 head categories and 20 tail categories
Table 4 Generalized Few-shot classification accuracy on TieredImageNet with 351 head categories and 160 tail categories

6.3.2 Evaluate GFSL Beyond 5 unseen Categories

Besides using harmonic mean accuracy, we argue that another important aspect in evaluating GFSL is to go beyond the 5 sampled unseen categories, as it is never the case in real-world. On the contrary, we care most about the GFSL with a large number of unseen classes, which also measure the ability of the model to extrapolate the number of novel classes in the unseen class few-shot task. To this end, we consider an extreme case—evaluating GFSL with all available seen and unseen categories over both MiniImageNet and TieredImageNet, and report their results in Tables 3 and 4.

Together with the harmonic mean accuracy of all categories, we also report the tail classification performance, which is a more challenging few-shot classification task (the standard FSL results could be found in Sect. 6.4). In addition, the joint classification accuracy for seen classes instances (\({{\mathcal {S}}}\rightarrow {{\mathcal {S}}}\cup {{\mathcal {U}}}\)) and unseen classes instances (\({{\mathcal {U}}}\rightarrow {{\mathcal {S}}}\cup {{\mathcal {U}}}\)) are also listed.

The methods without a clear consideration of head-tail trade-off (e.g., ProtoNet+ProtoNet) fails to make a joint prediction over both seen and unseen classes. We observe that Castle and a Castle outperform all approaches in the unseen and more importantly, the all categories section, across two datasets.

6.3.3 Confidence Calibration Matters in GFSL

In generalized zero-shot learning, Chao et al. (2016) have identified a significant prediction bias between classification confidence of seen and unseen classifiers. We find a similar phenomena in GFSL. For instance, the few-shot learning ProtoNet + ProtoNet baseline becomes too confident to predict on seen categories than unseen categories (The scale of confidence is on average 2.1 times higher). To address this issue, we compute a calibration factor based on the meta-validation set of unseen categories, such that the prediction logits are calibrated by subtracting this factor out from the confidence of seen categories’ predictions. With 5 unseen classes from MiniImageNet, the GFSL results of all comparison methods before and after calibration is shown in Fig. 7. We observe a consistent and obvious improvements over the harmonic mean accuracy for all methods. For example, although the FSL approach ProtoNet neglects the classification performance over seen categories outside the sampled task during meta-learning, it gets even better harmonic mean accuracy compared with the GFSL method DFSL (62.70% vs. 62.38%) with such post-calibration, which becomes a very strong GFSL baseline. Note that Castle and a Castle are the least affected with the selected calibration factor. This suggests that Castle variants, learned with the unified GFSL objective, have well-calibrated classification confidence and does not require additional data and extra learning phase to search this calibration factor.

Fig. 7
figure 7

Calibration’s effect to the 1-shot harmonic mean accuracy on MiniImageNet. Baseline models improve a lot with the help of the calibration factor

Fig. 8
figure 8

The 1-shot AUSUC performance with two configurations of unseen classes on MiniImageNet. The larger the area under the curve, the better the GFSL ability.

Moreover, we use area under seenunseen curve (AUSUC) as a measure of different GFSL algorithms (Chao et al. 2016). Here, AUSUC is a performance measure that takes the effects of the calibration factor out. To do so, we enumerate through a large range of calibration factors and subtract it from the confidence score of seen classifiers. Through this process, the joint prediction performances over seen and unseen categories, denoted as \({{\mathcal {S}}}\rightarrow {{\mathcal {S}}}\cup {{\mathcal {U}}}\) and \({{\mathcal {U}}}\rightarrow {{\mathcal {S}}}\cup {{\mathcal {U}}}\), shall vary as the calibration factor changes. For instance, when the calibration factor is infinitely large, we measure a classifier that only predicts unseen categories. We denote this as the seenunseen curve. The 1-shot GFSL results with 5 unseen classes from MiniImageNet is shown in Fig. 8. As a result, we observe that a Castle and Castle archive the largest area under the curve, which indicates that Castle variants are in general a better algorithm over others among different calibration factors.

6.3.4 Robust Evaluation of GFSL

Other than the harmonic mean accuracy of all seen and unseen categories shown in Tables 3 and 4, we study the dynamic of how harmonic mean accuracy changes with an incremental number of unseen tail concepts. In other words, we show the GFSL performances w.r.t. different numbers of tail concepts. We use this as a robust evaluation of each system’s GFSL capability. In addition to the test instances from the head 64 classes in MiniImageNet, 5 to 20 novel classes are included to compose the generalized few-shot tasks. Concretely, only one instance per novel class is used to construct the tail classifier, combined with which the model is asked to do a joint classification of both seen and unseen classes. Figure 9 records the change of generalized few-shot learning performance (harmonic mean) when more unseen classes emerge. We omit the results of MC+kNN and MC+ProtoNet since they bias towards seen classes and get nearly zero harmonic mean accuracy in all cases. We observe that a Castle consistently outperforms all baseline approaches in each evaluation setup, with a clear margin. We also compute the harmonic mean after selecting the best calibration factor from the meta-val set (cf. Fig. 10). It is obvious that almost all baseline models achieve improvements and the phenomenon is consistent with Fig. 7. The GFSL results of a Castle and Castle are almost not influenced after using the post-calibration technique. a Castle still persists its superiority in this case.

Fig. 9
figure 9

Results of 1-shot GFSL harmonic mean accuracy with incremental number of unseen classes on MiniImageNet. Note MC+kNN and MC+ProtoNet bias towards seen classes and get nearly zero harmonic mean accuracy

Fig. 10
figure 10

Post-calibrated results of 1-shot GFSL harmonic mean accuracy with incremental number of unseen classes on MiniImageNet. All methods select the their best calibration factors from the meta-val data split

Table 5 Few-shot classification accuracy on MiniImageNet with different types of backbones
Table 6 Few-shot classification accuracy on TieredImageNet with different types of backbones

6.4 Standard Few-Shot Learning

Finally, we also evaluate our proposed approaches’ performance on two standard few-shot learning benchmarks, i.e., MiniImageNet and TieredImageNet dataset. In other words, we evaluate the classification performance of few-shot unseen class instances with our GFSL objective. We compare our approaches with the state-of-the-art methods in both 1-shot 5-way and 5-shot 5-way scenarios. We cite the results of the comparison methods from their published papers and remark the backbones used to train the FSL model by different methods. The mean accuracy and 95% confidence interval are shown in the Table 5 and Table 6.

It is notable that some comparison methods such as CTM Li et al. (2019) are evaluated over only 600 unseen class FSL tasks, while we test both Castle and a Castle over 10,000 tasks, leading to more stable results. Castle and a Castle achieve almost the best 1-shot and 5-shot classification results on both datasets. The results support our hypothesis that jointly learning with many-shot classification forces few-shot classifiers to be discriminative.

7 Related Work and Discussion

Building a high-quality visual system usually requires to have a large scale of annotated training set with many shots per category. Many large-scale datasets such as ImageNet have an ample number of instances for popular classes (Russakovsky et al. 2015; Krizhevsky et al. 2017). However, the data-scarce tail of the category distribution matters. For example, a visual search engine needs to deal with the rare object of interests (e.g., endangered species) or newly defined items (e.g., new smartphone models), which only possesses a few data instances. Directly training a system over all classes is prone to over-fit and can be biased towards the data-rich categories (Cui et al. 2019; Cao et al. 2019; Kang et al. 2020; Ye et al. 2020; Zhou et al. 2020).

Zero-shot learning (ZSL) (Lampert et al. 2014; Akata et al. 2013; Xian et al. 2017; Changpinyo et al. 2020) is a popular idea for addressing learning without labeled data. By aligning the visual and semantic definitions of objects, ZSL transfers the relationship between images and attributes learned from seen classes to unseen ones, so as to recognize a novel instance with only its category-wise attributes (Changpinyo et al. 2016, 2017). Generalized ZSL (Chao et al. 2016; Schönfeld et al. 2019) extends this by calibrating a prediction bias to jointly predict between seen and unseen classes. ZSL is limited to recognizing objects with well-defined semantic descriptions, which assumes that the visual appearance of novel categories is harder to obtain than knowledge about their attributes, whereas in the real-world we often get the appearance of objects before learning about their characteristics.

Few-shot learning (FSL) proposes a more realistic setup, where we have access to a very limited number (instead of zero) of visual exemplars from the tail classes (Li et al. 2006; Vinyals et al. 2016). FSL meta-learns an inductive bias from the seen classes, such that it transfers to the learning process of unseen classes with few training data during the model deployment. For example, one line of works uses meta-learned discriminative feature embeddings (Snell et al. 2017; Oreshkin et al. 2018; Rusu et al. 2019; Vuorio et al. 2019; Lee et al. 2019; Ye et al. 2020) together with the non-parametric nearest neighbor classifiers, to recognize novel classes given a few exemplars. Another line of works chooses to learn the common optimization strategy (Ravi and Larochelle 2017; Bertinetto et al. 2019) across few-shot tasks, e.g., the model initialization to a pre-specified model configuration could be adapted rapidly using fixed steps of gradient descents over the few-shot training data from unseen categories (Finn et al. 2017; Li et al. 2017; Nichol et al. 2018; Lee et al. 2018; Antoniou et al. 2019). FSL has achieved promising results in various domains such as visual recognition (Triantafillou et al. 2017; Lifchitz et al. 2019; Das and Lee 2020), domain adaptation (Dong and Xing 2018; Kang et al. 2018), neural machine translation (Gu et al. 2018), data compression (Wang et al. 2018), and density estimation (Reed et al. 2018). Empirical studies of FSL could be found in (Chen et al. 2019; Triantafillou et al. 2020).

FSL emphasizes on building models of the unseen classes, while the simultaneous recognition of the many-shot head categories in real-world use cases is also important. Low-shot learning has been studied in this manner (Hariharan and Girshick 2017; Wang et al. 2018; Gao et al. 2018; Ye et al. 2020; Liu et al. 2019). The main aim is to recognize the entire set of concepts in a transductive learning framework—during the training of the target model, it has access to both the (many-shot) seen and (few-shot) unseen categories. The key difference with our Generalized Few-Shot Learning (GFSL) is that we assume no access to unseen classes in the model learning phase, which requires the model to inductively transfer knowledge from seen classes to unseen ones during the model evaluation phase.

Some of the previous GFSL approaches (Hariharan and Girshick 2017; Wang et al. 2018; Gao et al. 2018) apply the exemplar-based classification paradigms on both seen and unseen categories to resolve the transductive learning problem, which requires recomputing the centroids for seen categories after model updates. Others (Wang et al. 2017; Schönfeld et al. 2019; Liu et al. 2019) usually ignore the explicit relationship between seen and unseen categories, and learn separate classifiers. Ren et al. (2019) and Gidaris and Komodakis (2018) propose to solve inductive GFSL via either composing unseen with seen classifiers or meta-leaning with recurrent back-propagation procedure. Gidaris and Komodakis (2018) is the most related work to Castle and a Castle, which composes the tail classifiers by a convex combination of the many-shot classifiers. Castle is different from Gidaris and Komodakis (2018) as it presents an end-to-end learnable framework with improved training techniques, as well as it employs a shared neural dictionary to compose few-shot classifiers. Moreover, a Castle further relates the knowledge for both seen and unseen classes by constructing a neural dictionary with both shared (yet task-agnostic) and task-specific basis, which allows backward knowledge transfer to benefit seen classifiers with new knowledge of unseen classes. As we have demonstrated in Sect. 5, a Castle significantly improves seen classifiers when learning of unseen visual categories over heterogeneous visual domains.

8 Conclusion

A Generalized Few-Shot Learning (GFSL) model takes both the discriminative ability of many-shot and few-shot classifiers into account. In this paper, we propose the ClAssifier SynThesis LEarning (Castle) and its adaptive variant (a Castle) to solve the challenging inductive modeling of unseen tail categories in conjunction with seen head ones. Our approach takes advantage of the neural dictionary to learn bases for composing many-shot and few-shot classifiers via a unified learning objective, which transfers the knowledge from seen to unseen classifiers better. Our experiments highlight a Castle especially fits the GFSL scenario with tasks from multiple domains. Both Castle and a Castle not only outperform existing methods in terms of various GFSL criteria but also improve the classifier’s discernibility over standard FSL. Future directions include improving the architecture of neural dictionary and designing better fine-tuning strategies for GFSL.