Keywords

1 Introduction

Zero-Shot Learning (ZSL) is the task of recognizing images from unseen categories with the model trained only on seen classes. Nowadays, ZSL relies on semantic information to classify images of unseen categories and can be formulated as a visual semantic understanding problem. In other words, given candidate text descriptions of a class that has not been seen during training, the goal is to identify images of that unseen class and distinguish them from seen ones and other unseen classes based on their text descriptions.

In general, current datasets contain two commonly used semantic information including attribute descriptions (e.g., AWA2 [35], SUN [22], and CUB [34]), and more challenging unstructured text descriptions (e.g., CUB-wiki[5], NAB-wiki [6]). However, these datasets are all small or medium-size with up to a few hundred classes, leaving a significant gap to study generalization at a realistic scale. In this paper, we focus on large-scale zero-shot image classification. More specifically, we explore the learning limits of a model trained from 1K seen classes and transfer it to recognize more than 10 million images from 21K unseen candidate categories from ImageNet-21K [4], which is the largest available image classification dataset to the best of our knowledge.

Fig. 1.
figure 1

Intuitive illustration of our proposed HGR-Net. Suppose the ground truth is Hunting Dog, then we can find the real-label path Root \(\rightarrow \) Animal \(\rightarrow \) Domestic Animal \(\rightarrow \) Dog \(\rightarrow \) Hunting Dog. Our goal is to efficiently leverage semantic hierarchical information to help better understand the visual-language pairs.

A few works of literature explored zero-shot image classification on ImageNet-21K. However, the performance has plateaued to a few percent Hit@1 performances on ImageNet-21K zero-shot classification benchmark ( [7, 13, 20, 32]). We believe the key challenge is distinguishing among 21K highly fine-grained classes. These methods represents class information by GloVe [23] or Skip-Gram [17] to align the vision-language relationships. However, these lower-dimensional features from GloVe or Skip-Gram are not representative enough to distinguish among 21K classes, especially since they may collapse for fine-grained classes. Besides, most existing works train a held-out classifier to categorize images of unseen classes with different initialization schemes. One used strategy is to initialize the classifier weights with semantic attributes [12, 27, 36, 38], while another is to conduct fully-supervised training with generated unseen images. However, the trained MLP-like classifier is not representative enough to capture fine-grained differences to classify the image into a class with high confidence.

To resolve the challenge of large-scale zero-shot image classification, we proposed a novel Hierarchical Graph knowledge Representation network (denoted as HGR-Net). We explore the conceptual knowledge among classes to prompt the distinguishability. In Fig. 1, we state the intuition of our proposed method. Suppose the annotated image class label is Hunting Dog. The most straightforward way is to extract the semantic feature and train the classifier with cross-entropy. However, our experiments find that better leveraging hierarchical conceptual knowledge is important to learn discriminative text representation. We know the label as Hunting Dog, but all the labels from the root can also be regarded as the real label. We incorporate conceptual semantic knowledge to enhance the network representation.

Moreover, inspired by the recent success of pre-trained models from large vision-language pairs such as CLIP [24] and ALIGN [9], we adopt a dynamic confidence-based classification scheme, which means we multiply a particular image feature with candidate text features and then select the most confident one as the predicted label. Unlike traditional softmax-based classifier, this setting is dynamic, and no need to train a particular classifier for each task. Besides, the confidence-based scheme can help truly evaluate the vision-language relationship understanding ability. For better semantic representation, we adopt Transformer [29] as the feature extractor, and follow-up experiments show Transformer-based text encoder can significantly boost the classification performance.

Contributions. We consider the most challenging large-scale zero-shot image classification task on ImageNet-21K and proposed a novel hierarchical graph representation network, HGR-Net, to model the visual-semantic relationship between seen and unseen classes. Incorporated with a confidence-based learning scheme and a Transformers to represent class semantic information, we show that HGR-Net achieved new state-of-the-art performance with significantly better results than baselines. We also conducted few-shot evaluations of HGR, and we found our method can learn very efficiently by accessing only one example per class. We also conducted extensive experiments on the variants of ImageNet-21K, and the results demonstrate the effectiveness of our HGR-Net. To better align with our problem, we also proposed novel matrices to reflect the conceptual learning ability of different models.

2 Related Work

2.1 Zero-/Few-Shot Learning

Zero-Shot Learning (ZSL) is recognizing images of unseen categories. Our work is more related to semantic-based methods, which learn an alignment between different modalities (i.e., visual and semantic modalities) to facilitate classification [12, 27, 36, 38]. CNZSL [27] proposed to map attributes into the visual space by normalization over classes. In contrast to [27], we map both the semantic text and the images into a common space and calculate the confidence. Experimental studies are conducted to show that mapping to a common space achieves higher accuracy. We also explore the Few-Shot Learning (FSL) task, which focuses on classification with only accessing a few testing examples during training [28, 37, 39]. Unlike [33] which defines the FSL task as extracting few training data from all classes, we took all images from seen classes and selected only a few samples from unseen classes during training. Our main goal here is to analyze how the performance differs from zero to one-shot.

2.2 Large-Scale Graphical Zero-Shot Learning

Graphical Neural Networks [11] are widely applied to formulate zero-shot learning, where each class is associated with a graph node, and a graph edge represents each inter-class relationship. For example, [32] trains a GNN based on the WordNet knowledge to generate classifiers for unseen classes. Similarly, [10] uses fewer convolutional layers but one additional dense connection layer to propagate features towards distant nodes for the same graph. More recently, [19] adopts a transformer graph convolutional network (TrGCN) for generating class representations. [31] leverages additional neighbor information in the graph with a contrastive objective. Unlike these methods, our method utilizes fruitful information of a hierarchical structure based on class confidence and thus grasps hierarchical relationships among classes to distinguish hard negatives. Besides, some works exploit graphical knowledge without explicitly training a GNN. For example, [15] employs semantic vectors of the class names using multidimensional scaling (MDS) [3] on the WordNet to learn a joint visual-semantic embedding for classification; [12] learns similarity between the image representation and the class representations in the hyperbolic space.

2.3 Visual Representation Learning from Semantic Supervision

Visual representation learning is a challenging task and has been widely studied with supervised or self-supervised methods. Considering semantic supervision from large-scale unlabeled data, learning visual representation from text representation [24] is a promising research topic with the benefit of large-scale visual and linguistic pairs collected from the Internet. These methods train a separate encoder for each modality (i.e., visual and language), allowing for extended to unseen classes for zero-shot learning. Upon these methods, [2] improves the data efficiency during training, [9] enables learning from larger-scale noisy image-text pairs, [40] optimizes the language prompts for better classifier generation. Our work adopts the pre-trained encoders of [24] but tackles the problem of large-scale zero-shot classification from a candidate set of 22K classes instead of at most 1K as in [24].

3 Method

3.1 Problem Definition

Zero-Shot Learning. Let \(\mathcal {C}\) denote the set of all classes. \(\mathcal {C}_{s}\) and \(\mathcal {C}_{u}\) to be the unseen and seen classes, respectively, where \(\mathcal {C}_{s} \cap \mathcal {C}_{u} =\emptyset \), and \(\mathcal {C}=\mathcal {C}_{s} \cup \mathcal {C}_{u}\). For each class \(c_i\in \mathcal {C}\), a d-dimensional semantic representation vector \(t(c_i)\in \mathbb {R}^{d}\) is provided. We denote the training set \(\mathcal {D}_{tr}=\{(\textbf{x}_i, c_i, t(c_i))\}_{i=1}^{N}\), where \(\textbf{x}_i\) is the i-th training image. In ZSL setting, given testing images \(\textbf{x}_{te}\), we aim at learning a mapping function \(\textbf{x}_{te}\rightarrow \mathcal {C}_u\). In a more challenging setting, dubbed as generalized ZSL, we not only aim at classifying images from unseen categories but also seen categories, where we learn \(\textbf{x}_{te}\rightarrow \mathcal {C}_u \cup \mathcal {C}_s\) covering the entire prediction space.

Fig. 2.
figure 2

HGR-Net: Suppose the annotated single label is D and we can find the tracked label path \(\texttt {R}\cdots \rightarrow \texttt {A}\rightarrow \texttt {B}\rightarrow \texttt {D}\) from the semantic graph extended from WordNet. We first set D as the positive anchor and contrast with negatives which are sampled siblings of its ancestors (i.e., \(\{\texttt {E}, \texttt {C}, \texttt {G}\}\)) layer by layer. Then we iterate to set the positive anchor to be controlled depth as \(\texttt {B}, \texttt {A}\), which has layer-by-layer negatives \(\{\texttt {C}, \texttt {G}\}\) and \(\texttt {G}\), respectively. Finally, we use a memory-efficient adaptive re-weighting strategy to fuse knowledge from different conceptual level.

Semantic Hierarchical Structure. We assume access to a semantic Directed Acyclic Graph (DAG), \(\mathcal {G}=(\mathcal {V}, \mathcal {E})\), where \(\mathcal {V} = \mathcal {C} \cup \left\{ R\right\} \) and \(\mathcal {E} \subseteq \left\{ (x, y) \mid (x, y) \in \mathcal {C}^{2}\right. \), \(\left. x \ne y\right\} \). Here the two-tuple (xy) represents an parenting relationship between x and y, which means y is a more abstract concept than x. Here we manually add a root node R with a in-degree of 0 into \(\mathcal {G}\). For simplicity, given any node \(c_i\), we denote the ordered set of all its ancestors obtained by shortest path search from R to \(c_i\) as \(\mathcal {A}^{c_i} = \left\{ a(c_i)_{j} \right\} _{j=1}^{N^a_i} \subseteq \mathcal {C}\). Similarly, we denote the set of all siblings of \(c_i\) as \(\mathcal {S}^{c_i} = \left\{ s({c_i})_{j} \right\} _{j=1}^{N^s_i} \subseteq \mathcal {C}\). Finally, \(d(c_i) \triangleq |\mathcal {A}^{c_i} |\) is defined as depth of node \(c_i\).

3.2 HGR-Net: Large-Scale ZSL with Hierarchical Graph Representation Learning

We mainly focus on zero-shot learning on the variants of ImageNet-21K, the current largest image classification dataset to our knowledge. Previous strategies [7, 13, 20, 32] adopt a N-way classification as the training task on all the N seen classes. However, we argue that this is problematic, especially in using a Transformer as the text encoder to obtain class semantic representations. First, when conducting N-way classification, all the classes except the single ground-truth are regarded as negative ones. Even though this helps build a well-performing classifier in a fully-supervised learning scenario, we argue that this is harmful to knowledge transfer from seen to unseen classes in ZSL. Second, a batch of N samples is fed to the text Transformer [29] to obtain their corresponding text representations and to compute the class logits afterward. This strategy can be acceptable for datasets with a small number of classes. However, when the number of classes scales to tens of thousands, as in our case, it becomes formidable to implement the operations mentioned above. Therefore, we propose a memory-efficient hierarchical contrastive objective to learn transferable and discriminative representations for ZSL.

Intuitively, as illustrated in Fig. 2, suppose we have an image sample with annotated ground-truth label D according to ImageNet-21K. Then, we could find a shortest path \(\texttt {R} -\cdots \rightarrow \texttt {A}\rightarrow \texttt {B}\rightarrow \texttt {D}\) to be the tracked true-label path \(\mathcal {T}_\texttt {E}\). With our definition of the hierarchical structure, the true labels for the image sample are defined by all the nodes along this path through different levels of conceptions, from abstract to concrete in our case. Therefore, to better leverage this hierarchical semantic knowledge, we propose a hierarchical contrastive loss that conducts two levels of pre-defined degrees of bottom-up contrasting.

Specifically, for node D with a depth of \(d(\texttt {D})\). In the outer-level loop, we iterate ground-truth labels of different levels, along the ancestor path \(\mathcal {A}^{\texttt {D}}\), we traverse from itself bottom-up \(\texttt {D} -\cdots \rightarrow \texttt {B}\rightarrow \texttt {A}\) until reaching one of its ancestors of a depth of \(K d(\texttt {D})\), where K is the outer ratio. In the inner-level loop, fixing the ground-truth label, we conduct InfoNCE loss [21] layer by layer in a similar bottom-up strategy with an inner ratio M (e.g., when fixing current ground truth node as B in Fig. 2, for inner loop we consider \(\left\langle \texttt {B}, \texttt {C}\right\rangle , \left\langle \texttt {B}, \texttt {G}\right\rangle \)). We provide more details in Algorthim 1.

Formally, given an image of \(\boldsymbol{x}\) from class \(c_i\), we define its loss as:

$$\begin{aligned} \mathcal {L_{{\text {cont}}}} = \sum _{j=k_s}^{k_e}g(j, \mathcal {L}_j),\quad \mathcal {L}_{j} = \frac{1}{m_e-m_s+1}\sum _{l=m_s}^{m_e}\mathcal {L}_{j, l}, \end{aligned}$$
(1)

where \(g(\cdot , \cdot )\) is an adaptive attention layer to dynamically re-weight the importance of labels given different levels j, \(j \in [k_s, k_e]\) and \(l \in [m_s, m_e]\) are the outer-level and inner-level loop respectively. \(k_s, k_e\) represents the start layer and the end layer for outer loop while \(m_s, m_e\) are the start layer and the end layer for the inner loop.

$$\begin{aligned} \mathcal {L}_{j, l}=-\log \frac{\textrm{pos}^j}{\textrm{pos}^j + \textrm{neg}^{j, l}}, \end{aligned}$$
(2)

where

$$\begin{aligned} \begin{aligned} \mathrm {pos^j}&= \exp \left( {\text {sim}}\left( T(\boldsymbol{c}_{j}^{+}), V(\boldsymbol{x})\right) / \tau \right) \end{aligned} \end{aligned}$$
(3)
$$\begin{aligned} \begin{aligned} \qquad \textrm{neg}^{j, l}&= \sum _{q=1}^{n_l} \exp \left( \textrm{sim}\left( T(\boldsymbol{c}^{-}_{j, l, q}), V(\boldsymbol{x})\right) / \tau \right) \end{aligned} \end{aligned}$$
(4)

where, \(\textrm{sim}(\cdot )\) is the measure of similarity, \(\tau \) is the temperature value. \(V(\cdot )\) and \(T(\cdot )\) are the visual and image encoders, \(\boldsymbol{c}_j^{+} = a(c_i)_{j}\) is the selected positive label on the tracked lable path at layer l. \(\boldsymbol{c}_{j, l, q}^{-}\) is the q-th sibling of the j-th ground-truth at level l; see Agorithm 1.

figure a

4 Experiments

4.1 Datasets and the Hierarchical Structure

ImageNet [4] is a widely used large-scale benchmark for ZSL organized according to the WordNet hierarchy [18], which can lead our model to learn the hierarchical relationship among classes. However, the original hierarchical structure is not a DAG (Directed Acyclic Graph), thus not suitable when implementing our method. Therefore, to make all of the classes fit into an appropriate location in the hierarchical DAG, we reconstruct the hierarchical structure by removing some classes from the original dataset, which contains seen classes from the ImageNet-1K and unseen classes from the ImageNet-21K (winter-2021 release), resulting a modified dataset ImageNet-21K-D (D for Directed Acyclic Graph).

It is worth noticing that although there are 12 layers in the reconstructed hierarchical tree in total, most nodes reside between \(2^{{\text {nd}}}\) and \(6^{{\text {th}}}\) layers. Our class-wise dataset split is based on GBU [35], which provides new dataset splits for ImageNet-21K with 1K seen classes for training and the remaining 20, 841 classes as test split. Moreover, GBU [35] splits the unseen classes into three different levels, including “2-hop", “3-hops" and “All" based on WordNet hierarchy [18]. More specifically, the “2-hops" unseen concepts are within 2-hops from the known concepts. After the modification above, the training is then conducted on the processed ImageNet-1K with seen 983 classes, while 17,295 unseen classes from the processed ImageNet-21K are for ZSL testing, and 1533 and 6898 classes for “2-hops" and “3-hops" respectively. Please note that there is no overlap between the seen and unseen classes. The remaining 983 seen classes make our training setting more difficult because our model gets exposed to fewer images than the original 1k seen classes. Please refer to the supplementary materials for more detailed descriptions of the dataset split and reconstruction procedure.

4.2 Implementation Details

We use a modified ResNet-50 [8] from [24] as the image encoder, which replaces the global average pooling layer with an attention mechanism, to obtain visual representation with feature dimensions of 1024. Text descriptions are encoded into tokens and bracketed with start tokens and end tokens based on byte pair encoding (BPE) [26] with the max length of 77. For text embedding, we use CLIP [24] Transformer to extract semantic vectors with the same dimensions as feature representation. We obtain the logits with L2-normalized image and text features and calculate InfoNCE loss [21] layer by layer with an adaptive re-weighting strategy. More specifically, a learnable parameter with a size equivalent to the depth of the hierarchical tree is used to adjust the weights adaptively.

Training Details. We use the AdamW optimizer [14] applied to all weights except the adaptive attention layer with a learning rate 3e-7. We use the SGD optimizer for the adaptive layer with a learning rate of 1e-4. When computing the matmul product of visual and text features, a learnable temperature parameter \(\tau \) is initialized as 0.07 from [30] to scale the logits and clips gradient norm of the parameters to prevent training instability. To accelerate training and avoid additional memory, mixed-precision [16] is used, and the weights of the model are only transformed into float32 for optimization. Our proposed HGR model is implemented in PyTorch, and training and testing are conducted on a Tesla V100 GPU with a batch size of 256 and 512, respectively.

4.3 Large-Scale ZSL Performance

Comparison Approaches. We compare with the following approaches:

DeViSE [7] linearly maps visual information to the semantic word-embedding space. The transformation is learned using a hinge ranking loss.

HZSL [12] learns similarity between the image representation and the class representations in the hyperbolic space.

SGCN [10] uses an asymmetrical normalized graph Laplacian to learn the class representations.

DGP [10] separates adjacency matrix into ancestors and descendants and propagates knowledge in two phases with one additional dense connection layer based on the same graph as in GCNZ [32].

CNZSL [27] utilizes a simple but effective class normalization strategy to preserve variance during a forward pass.

FREE [1] incorporates semantic-visual mapping into a unified generative model to address cross-dataset bias.

Evaluation Protocols. We use the typical Top@K criterion, but we also introduce additional metrics. Since it could be more desirable to have a relatively general but correct prediction rather than a more specific but wrong prediction, the following three metrics evaluate a given model’s ability to learn the hierarchical relationship between the ground truth and its general classes.

  • Top-Overlap Ratio (TOR). In this metric, we take a further step to also cover all the ancestor nodes of the ground truth class. More concretely, for an image \(x_j\) from class \(c_i\) of depth \(q_{c_i}\), TOR is defined as:

    $$\begin{aligned} TOR(x_j) = \frac{|p_{x_j} \cap \{A_{c_i},c_i\} |}{ q_{c_i} } \end{aligned}$$
    (5)

    where \(c_i\) is the corresponding class to image \(x_j\). \(A_{c_i}\) is the union of all the ancestors of class \(c_i\) and \(p_{x_j}\) is the predicted class of \(x_j\). In other words, this metric consider the predicted class correct if it is an ancestor of the ground truth.

  • Point-Overlap Ratio (POR). In this setting, we let the model predict labels layer by layer. POR is defined as:

    $$\begin{aligned} POR(x_j) = \frac{|P_{x_j} \cap P_{c_i} |}{ q_{c_i} }, \end{aligned}$$
    (6)

    where \(P_{c_i} = \{c_{i_1},c_{i_2},c_{i_3},\cdots ,c_{i_{q_{c_i}-1}},c_{i}\}\) is the union of classes from the root to the ground truth through all the ancestors, and \(P_{x_j}\) is the union of classes predicted by our model layer by layer. \(q_{c_i}\) is count of all the ancestors including the ground truth label, which is tantamount to the depth of node \(c_i\). The intersection calculates the overlap between correct and predicted points for image \(x_j\).

Results Analysis. Table 1 demonstrates the performance of different models on ImageNet-21K ZSL setting on Top@K and above-mentioned three hierarchical evaluation. Our proposed model outperforms SoTA methods in all metrics, including hierarchical measures, proving the ability to learn the hierarchical relationship between the ground truth and its ancestor classes. We also attach the performance on 2-hops and 3-hops in the supplementary.

4.4 Ablation Studies

Different Attributes. Conventional attribute-based ZSL methods use GloVe [23] or Skip-Gram [17] as text models, while CLIP [24] utilizes prompts (i.e., text description) template: “a photo of a [CLASS]", and take advantage of Transformer to extract text feature. Blindly adding Transformer to some attribute-based methods like HZSL [12] which utilizes unique techniques to improve their performance in the attribute setting result in unreliable results. Therefore, we conducted experiments comparing three selected methods with different attributes. The result in Table 2 shows that methods based on text embedding extracted by CLIP transformer outperform traditional attribute-based ones since the low dimension representations (500-D) from w2v [17] is not discriminative enough to distinguish unseen classes, while higher dimension (1024-D) text representations significantly boost classification performance. Our HGR-Net gained significant improvement by utilizing Transformer compared to the low dimension representation from w2v [17].

Table 1. Top@k accuracy, Top-Overlap Ratio (TOR), and Point-Overlap Ratio (POR) for different models on the ImageNet-21K-D only testing on unseen classes. Tr means text encoder is CLIP Transformer.
Table 2. Different attributes. DGP(w/o) means without separating adjacency matrix into ancestors and descendants, CN and INIT in CNZSL means class normalization and proper initialization respectively.

Different Outer and Inner Ratio. Fig. 3 demonstrate the Top1, Top-Overlap Ratio (TOR) and Point-Overlap Ratio (POR) metrics of different K and M, where K and M \(\in [0, 1]\). K and M are outer and inner ratio that determine how many samples is considered in the inner and outer loop respectively as earlier illustrated.

We explore different K and M in this setting and observe how performance differs under three evaluations. Please note that when K or M is 0.0, it means only the current node is involved in a loop. As K increases, the model is prone to obtain higher performance on hierarchical evaluation. An intuitive explanation is that more conceptual knowledge about ancestor nodes facilitates hierarchical learning relationships among classes.

Fig. 3.
figure 3

Different outer ratio (K) and inner ratio (M)

Different Negative Sampling Strategies. We explore various sampling strategies for choosing negative samples and observe how they differ in performance. Random randomly samples classes from all the classes. TopM samples neighbour nodes from \((q_{c_i}-M)\) to \(q_{c_i}\) layers, where \(q_{c_i}\) is the depth of inner anchor \(c_i\), and we set M as 1. Similarity calculates the similarity of text features and chooses the top similar samples with the positive sample as hard negatives. Sibling samples sibling nodes of the target class. Table 3 indicates that TopM outperforms other sampling strategies. Therefore, we adopt the TopM sampling strategy in the subsequent ablation studies.

Table 3. Analysis of sampling strategies

Different Weighting Strategies. Orthogonal to negative sampling methods, we explore in this ablation the influence of different weighting strategies across the levels of the semantic hierarchy. The depth of the nodes in the hierarchical structure is not well-balanced, and the layers are not accessible for all objects. Therefore, it is necessary to focus on the importance of different layers. In this case, we experimented with 6 different weighting strategies and observed how they differ in multiple evaluations. As Table 4 shows, Increasing gives more weights to deeper layers in a linear way and \(\uparrow \) non-linear is exponentially increasing weights to deeper layers. To balance the Top@K and hierarchical evaluations, the adaptive weighting method is proposed to obtain a comprehensive result. More specifically, Adaptive uses a learnable parameter with a size equivalent to the depth of the hierarchical tree to adjust the weights adaptively. We attached the exact formulation of different weighting strategies in the supplementary.

Table 4. Analysis of the weighting strategies when re-weighting in both inner and outer loop with K=0.25 and M=0.5.

Experiment on ImageNet-21K-P [25]. ImageNet-21K-P [25] is a pre-processed dataset from ImageNet21K by removing infrequent classes, reducing the number of total numbers by half but only removing only 13% of the original images, which contains 12,358,688 images from 11,221 classes. We select the intersection of this dataset with our modified ImageNet21K dataset to ensure DAG structure consistency. The spit details (class and sample wise) are demonstrated in the supplementary.

We show experimental results on ImageNet-21K-P comparing our method to different SoTA variants. Our model performs better in this smaller dataset compared to the original larger one in Table 1 and outstrips all the previous ZSL methods. We presented important results in Table 5 and we attached more results in the supplementary.

Table 5. Result of ImageNet21K-P [25]. DGP(w/o) [10] means without separating adjacency matrix into ancestors and descendants, CN and INIT in CNZSL [27] means class normalization and proper initialization respectively, and Tr is Transformer of CLIP for short.
Fig. 4.
figure 4

Zero-shot retrieved images. The first column represents unseen class names and corresponding confidence, the middle shows correct retrieval, and the last demonstrates incorrect images and their true labels.

4.5 Qualitative Results

Figure 4 shows several retrieved images by implementing our model in the ZSL setting on ImageNet-21K-D. The task is to retrieve images from an unseen class with its semantic representation. Each row demonstrates three correct retrieved images and one incorrect image with its true label. Although our algorithm retrieves images from the wrong class, they are still visually similar to ground truth. For instance, the true label hurling and the wrong class American football belong to sports games, and images from both contain several athletes wearing helmets against a grass background. We also show some prediction examples in Fig. 5 to present Point-Overlap results.

4.6 Low-shot Classification on Large-Scale Dataset

Apart from zero-shot experiments being our primary goal in this paper, we also explore the effectiveness of our method in the low-shot setting compared to several baselines. Unlike pure few-shot learning, our support set comprises two parts. To be consistent with ZSL experiments, all the training samples of 983 seen classes are for low-shot training. For the 17, 295 unseen classes used in the ZSL setting, k-shots (1,2,3,5,10) images are randomly sampled for training in the low-shot setting, and the remaining images are used for testing. The main goal of this experiment is to show how much models could improve from zero to one shot and whether our proposed hierarchical-based method could generalize well in the low-shot scenario. Figure 6 illustrated the few-shots results comparing our model to various SoTA methods. Although our approach gains trivial Top@k improvements from 1 to 10 shots, the jump from 0 to 1 shot is two times that from 1 to 10, proving that our model is an efficient learner.

Fig. 5.
figure 5

Predicted examples to show Point-Overlap. First row of each image is correct points from root to the ground truth and the second row show predicted points. The hit points are highlighted in bold.

Fig. 6.
figure 6

Few shots comparison. DGP(w/o) [10] means without separating adjacency matrix into ancestors and descendants, CN and INIT in CNZSL [27] means class normalization and proper initialization respectively, and Tr is Transformer of CLIP [24] for short.

5 Conclusions

This paper focuses on scaling-up visual recognition of unseen classes to tens of thousands of categories. We proposed a novel hierarchical graphic knowledge representation framework for confidence-based classification and demonstrated significantly better performance than baselines over Image-Net-21K-D and Image-Net-21K-P benchmarks, achieving new SOTA. We hope our work help ease future research of zero-shot learning and pave a steady way to understand large-scale visual-language relationships with limited data.