1 Introduction

Relying on massive labeled training datasets, significant progress has been made for image recognition in the past years (Simonyan and Zisserman 2014; Szegedy et al. 2015; He et al. 2016). However, it is unrealistic to label all the object classes, thus making these supervised learning methods struggle to recognize objects which are unseen during training. By contrast, Zero-Shot Learning (ZSL) (Norouzi et al. 2014; Zhang and Saligrama 2016b; Zhu et al. 2019; Long et al. 2017) only requires labeled images of seen classes (source domain), and are capable of recognizing images of unseen classes (target domain). The seen and unseen domains often share a common semantic space, which defines how unseen classes are semantically related to seen classes. The most popular semantic space is based on semantic attributes, where each seen or unseen class is represented by an attribute vector. Besides the semantic space, images of the source and target domains are also related and represented in a visual feature space.

To associate the semantic space and the visual space, existing methods often rely on the source domain data to learn a compatible projection function to map one space to the other, or two compatible projection functions to map both spaces into one common embedding space. During test time, to recognize an image in the target domain, semantic vectors of all unseen classes and the visual feature of this image would be projected into the embedding space using the learned function, then nearest neighbor (NN) search will be performed to find the best match class. However, due to the existence of the distribution difference between the source and target domains, especially for in-the-wild data, the learned projection function often suffers from the well-known domain shift problem.

To compensate for this domain gap, transductive zero-shot learning (Fu et al. 2015) assumes that the semantic information (e.g.attributes) of unseen classes and visual features of all test images are known in advance. Different ways like domain adaption (Kodirov et al. 2015) and label propagation (Zhu and Ghahramani 2002) are well investigated to better leverage this extra information. Recently, Zhang and Saligrama (2016b) find that visual features of unseen target instances can be separated into different clusters even though their labels are unknown as shown in Fig. 1. By incorporating this prior as a regularization term, a better label assignment matrix can be solved with a non-convex optimization procedure. However, their method still has three main limitations: 1) This visual structure prior is not used to learn a better projection, which directly limits the upper bound of the final performance. 2) They model the ZSL problem as a less-scalable batch mode, which requires re-optimization when adding new test data. 3) Like most previous transductive ZSL methods, they have not considered a realistic scenario where many unrelated images may exist in the test dataset and make the above prior invalid.

Considering the first problem, we model the above visual structure prior as a new constraint to learn a better projection function rather than use the pre-defined one. In this paper, we adopt the visual space as the embedding space and project the semantic space into it. To learn the projection function, we not only use the projection constraint of the source domain data as Zhang et al. (2017) but also impose the aforementioned visual structure constraint of the target domain data. Specifically, during training, we first project all the unseen semantic classes into the visual space, then consider three different strategies (“Chamfer-distance based”, “Bipartite matching based” and “Wasserstein-distance based”) to align the projected unseen semantic centers and the visual centers. However, due to the lack of test instance labels in the ZSL setting, we approximate these real visual centers with some unsupervised clustering algorithms (e.g.K-Means). Need to note that in our method, we directly apply the learned projection function to the online-mode testing, which is more friendly to real applications when compared to the batch mode in Zhang and Saligrama (2016b).

The third problem is common for data in the wild, where many unrelated images, which belong to neither seen nor unseen classes, often exist in the target domain. Thus using current unsupervised clustering algorithms directly on the whole test dataset will generate invalid visual centers, and misguide the learning of the projection functions. To overcome this problem, we further propose a new training strategies that first filters out the highly unrelated images and then uses the remaining ones to impose the proposed visual constraint. The filter is based on the distance from each image to its closest visual center. Considering the initial projection function is often deviated more than the later refined one, we further propose another progressive training strategy to gradually refine the center and increase the threshold in a class-specific manner. To the best of our knowledge, we are the first to consider this realistic transductive ZSL configuration with unrelated test images.

We demonstrate the effectiveness of the proposed visual structure constraint on many different widely-used datasets. Extensive experiments on both small datasets like AwA2 and large-scale dataset like Imagenet all show that the proposed visual structure constraint can consistently bring substantial performance gain and achieve state-of-the-art results.

To summarize, our contributions are three-fold as below:

  • We have proposed three different types of visual structure constraint for the projection learning of transductive ZSL to alleviate its domain shift problem.

  • We introduce a realistic transductive ZSL configuration where many unrelated images exist in the test dataset and propose two new training strategies to make our method work for it.

  • Experiments demonstrate that the proposed visual structure constraint can bring substantial performance gain consistently and achieve state-of-the-art results.

This paper is extended from our preliminary conference version (Wan et al. 2019) in four ways. First, we propose a new progressive training strategy for the realistic setting of data in the wild. Second, we carry out extensive experiments on the large-scale dataset like ImageNet and push current state-of-the-art performance by a large margin. This further demonstrates the scalability and effectiveness of our method. Third, we extend the wasserstein-based visual structure constraint to its instance-based counterpart and show its robustness even in some extremely challenging cases. Fourth, we present deeper and more detailed analysis and discussion of the proposed method.

2 Related Work

Zero-shot Learning and Semantic Spaces. Though deep supervised learning has gained enormous success for the image recognition task (Simonyan and Zisserman 2014; Szegedy et al. 2015; He et al. 2016), it relies on large-scale human annotations and cannot generalize to new unseen classes. Zero-shot learning bridges the gap between training seen classes and testing unseen classes via different kinds of semantic spaces. Among them, the most popular and effective one is the attribute-based semantic space (Farhadi et al. 2009; Akata et al. 2013; Lampert et al. 2009). The attributes are often designed by experts to represent the class-specific properties, which are demonstrated to be reliable and effective. To incorporate more attributes for fine-grained recognition tasks, the text description-based semantic space is proposed in Reed et al. (2016), Zhang et al. (2017), Elhoseiny et al. (2013), Zhu et al. (2018), which provides a more natural language interface. Compared to these two labor-intensive types of methods, word vector-based methods (Frome et al. 2013; Miller 1995; Norouzi et al. 2014; Wang et al. 2018) can learn the semantic space from large text corpus automatically and save much human labor. However, they often suffer from visual-semantic discrepancy problem and achieve inferior performance.

Embedding Spaces. To relate the visual features of test images and semantic attributes of unseen classes, three different embedding spaces are often used by existing zero-shot learning methods: the original semantic space, the original visual space, and the newly-learned common intermediate embedding space. Specifically, they often learn a projection function from the visual space to the semantic space (Lampert et al. 2014; Reed et al. 2016; Frome et al. 2013; Annadani and Biswas 2018) or from the semantic space to the visual space (Kodirov et al. 2015; Shigeto et al. 2015; Zhang et al. 2017) in the first two cases, or learn two projection functions from semantic and visual space to the common embedding space (Changpinyo et al. 2016; Lu 2016; Zhang and Saligrama 2016a) respectively, which can be modeled as a regression or ranking problem solved by conventional methods or deep neural networks. In our method, we also use the visual space as the embedding space, because it is demonstrated helpful in alleviating the hubness problem (Radovanović et al. 2010) in Zhang et al. (2017). More importantly, our structure constraint is based on the separability of visual features of unseen classes.

GAN-based Zero-shot Learning. With the recent progress of generative adversarial networks(GAN) (Goodfellow et al. 2014), a series of GAN-based zero-learning methods (Xian et al. 2018; Felix et al. 2018; Xian et al. 2019; Huang et al. 2019; Li et al. 2019) have been proposed to solve the domain shift problem. For example, f-CLSWGAN (Xian et al. 2018) directly employs the conditional Wasserstein GAN (Arjovsky et al. 2017) to generate the sample features of unseen classes based on the class attributes, which could be utilized for consequent supervised learning. To ensure the synthetic visual features can recover their semantic features back, Felix et al. (2018) further add a multi-modal cycle-consistent constraint into GAN-based ZSL. In LisGAN (Li et al. 2019), a novel sample strategy is proposed to improve the ZSL performance. To achieve any-shot learning, Xian et al. (2019) design a f-VAEGAN-D2 framework which combines both GAN and VAE (Kingma and Welling 2013).

Although the above-mentioned GAN-based ZSL methods can achieve promising results in both conventional ZSL and generalized ZSL, they have some drawbacks compared to projection-based methods. Firstly, like other generative models, the training instability also exists in GAN-based ZSL. Besides, enough samples are needed to narrow down the gap between the generated distributions and the real feature distributions. When the training samples for each seen class are not enough, the generator usually could not capture the underlying distribution well. In contrast, our method could alleviate this problem well since the training only relies on the visual center. Last but not least, GAN-based ZSL could not be directly applied under the proposed realistic transductive setting because of the overlap between noise distribution and real distribution.

Domain Shift and Transductive Zero-shot Learning. Since the target unseen classes are disjointed to source seen classes, their underlying data distribution might be very different. In such cases, if we only learn the projection functions based on the source domain data, the learned projection functions will be biased and incur a serious performance gap when directly applied to the target domain. This problem is first witnessed by Fu et al. (2015) and alleviated by a new transductive zero-shot learning setting. Under this setting, the unseen target domain data is leveraged in the learning stage to improve the generalization ability and different strategies have been proposed. For example, a multi-view semantic space alignment process is used to correlate different semantic views and the low-level feature view by projecting them onto a common latent space learned using multi-view canonical correlation analysis in Fu et al. (2015). And unsupervised domain adaption is utilized in Kodirov et al. (2015) by proposing a novel regularized sparse coding framework. Our method also belongs to transductive approaches, and the proposed visual structure constraint is inspired by Zhang and Saligrama (2016b), but we have addressed their aforementioned drawbacks and improved the performance significantly. To the best of our knowledge, we are the first that utilizes the structure of visual space to constrain the projection function learning in transductive ZSL.

3 Method

Problem Definition. In ZSL setting, we have \(N_{s}\) source labeled samples \({\mathcal {D}}_{s} \equiv \{(x^{s}_i, y^{s}_i)\}^{N_{s}}_{i=1}\), where \(x^s_i\) is an image and \(y^{s}_i \in {\mathcal {Y}}_{s}=\{1,\ldots ,S \} \) is the corresponding label within total S source classes. We are also given \(N_{u}\) unlabeled target samples \({\mathcal {D}}_{u} \equiv \{x^{u}_i\}^{N_{u}}_{i=1}\) that are from target classes \({\mathcal {Y}}_{u}=\{S+1,\ldots ,S+U \}\). According to the definition of ZSL, there is no overlap between source seen classes \({\mathcal {Y}}_{s}\) and target unseen classes \({\mathcal {Y}}_{u}\), i.e.\({\mathcal {Y}}_{s} \cap {\mathcal {Y}}_{u}=\emptyset \). But they are associated in a common semantic space, which is the knowledge bridge between the source and target domain. As explained before, we adopt semantic attribute space here, where each class \(z \in {\mathcal {Y}}_{s} \cup {\mathcal {Y}}_{u}\) is represented with a pre-defined auxiliary attribute vector \(a_{z} \in {\mathcal {A}}\). The goal of ZSL is to predict the label \(y^u_i \in {\mathcal {Y}}_{u}\) given \(x^u_i\) with no labeled training data.

Besides the semantic representations, images of the source and target domains are also represented with their corresponding features in a common visual space. To relate these two spaces, projection functions are often learned to project these two spaces into a common embedding space. Following Zhang et al. (2017), we directly use the visual space as the embedding space, in which case only one projection function is needed. The key problem then becomes how to learn a better and generalized projection function.

Fig. 1
figure 1

Visualization of CNN feature distribution of 10 target unseen classes on AwA2 dataset using t-SNE, which can be clearly clustered into several real centers (stars). Squares (VCL) are synthetic centers projected by the projection function learned only from source domain data. By incorporating our visual structure constraint, our method (BMVSc) can help to learn better projection function and the generated synthetic semantic centers would be much closer to the real visual centers

Motivation. Our method is inspired by Zhang and Saligrama (2016b), whose idea is shown in Fig. 1: thanks to the powerful discriminativity of pre-trained CNN, the visual features of test images can be separated into different clusters. We denote the centers of these clusters as real centers. We believe that if we have a perfect projection function to project the semantic attributes to the visual space, the projected points (called synthetic centers) should align with real centers. However, due to the domain shift problem, the projection function learned on the source domain is not perfect so that the synthetic centers (i.e.VCL centers in Fig. 1) will deviate from real centers, and then NN search among these deviated centers to assign labels will cause inferior ZSL performance. Based on the above analysis, besides source domain data, we attempt to take advantage of the existing discriminative structure of target unseen class clusters during the learning of the projection function, i.e.the learned projection function should also align the synthetic centers with the real ones in the target domain.

3.1 Visual Center Learning (VCL)

In this section, we first introduce a baseline method which learns the projection function f only with source domain data. Specifically, a CNN feature extractor \(\phi (\cdot )\) is used to convert each image x into a d-dimensional feature vector \(\phi (x)\in {\mathcal {R}}^{d\times 1}\). According to the above analysis, each class i of source domain should have a real visual center \(c^s_i\), which is defined as the mean of all feature vectors in the corresponding class. For the projection function f, a two-layer embedding network is utilized to transfer source semantic attribute \(a^s_i\) to generate corresponding synthetic center \(c^{syn,s}_i\):

$$\begin{aligned} c^{syn,s}_i= f(a^s_i) = \sigma _2(w_2^T\sigma _1 (w_1^Ta^s_i)) \end{aligned}$$
(1)

where \(\sigma _1 (\cdot )\) and \(\sigma _2 (\cdot )\) denote non-linear operation (Leaky ReLU with negative slope of 0.2 by default). \(w_1\) and \(w_2\) are the weights of two fully connected layers to be learned.

Since the correspondence relationship is given in the source domain, we directly adopt the simple mean square loss to minimize the distance between synthetic centers \(c^{syn}\) and real centers c in the visual feature space:

$$\begin{aligned} {\mathcal {L}}_{MSE}= \frac{1}{S}\sum _{i=1}^{S}\left\| c_i^{syn,s} -c_{i}^s \right\| ^{2}_{2} +\lambda \Psi (w_1,w_2) \end{aligned}$$
(2)

where \(\Psi (\cdot )\) is the L2-norm parameter regularizer decreasing the model complexity, we empirically set \(\lambda =0.0005\). Need to note that different from Zhang et al. (2017) which trains with a large number of individual instances of each class i, we choose to utilize a single cluster center \(c^s_i\) to represent each object class, and train the model with just several center points. It is based on the observation that the instances of the same category could form compact clusters, and will make our method much more computationally efficient.

When performing ZSL prediction, we first project the semantic attributes of each unseen class i to its corresponding synthetic visual center \(c_i^{syn,u}\) using the learned embedding network as in Equation (1). Then for a test image \(x^u_k\), its classification result \(i^*\) can be achieved by selecting the nearest synthetic center in the visual space. Formally,

$$\begin{aligned} i^*=\underset{i}{\textit{argmin }}\left\| \phi (x^u_k) - c_i^{syn,u}\right\| _{2} \end{aligned}$$
(3)

3.2 Chamfer-Distance-based Visual Structure Constraint(CDVSc)

As discussed earlier, the domain shift problem will cause the target synthetic centers \(c^{syn,u}\) deviated from the real centers \(c^u\), thus yields poor performance. Intuitively, if we also require the projected synthetic centers to align with the real ones by using the target domain dataset during the learning process, a better projection function can be learned. However, due to the lack of the label information of the target domain, it is impossible to directly get real centers of unseen classes. Considering the fact that the visual features of unseen classes can be separated into different clusters, we try to utilize some unsupervised clustering algorithms (K-means by default) to get approximated real centers. To valid it, we plot the K-means centers in Fig. 1, which are very close to the real ones.

After obtaining the cluster centers, aligning the structure of cluster centers to that of synthetic centers can be formulated as reducing the distance between the two unordered high-dimensional point sets. Inspired by the work in 3D point clouds (Fan et al. 2017), a symmetric Chamfer-distance constraint is proposed to solve the structure matching problem:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{CD}&= \sum _{x\in C^{syn,u}}min_{y\in C^{clu,u}}\left\| x -y \right\| ^{2}_{2} \\&\quad +\sum _{y\in C^{clu,u} }min_{x\in C^{syn,u}}\left\| x -y \right\| ^{2}_{2} \end{aligned} \end{aligned}$$
(4)

where \( C^{clu,u}\) indicates the cluster centers of unseen classes obtained by K-means algorithm. \(C^{syn,u}\) represents the synthetic target centers obtained with the learned projection. Combining the above constraint, the final loss function to train the embedding network is defined as:

$$\begin{aligned} {\mathcal {L}}_{CDVSc}= {\mathcal {L}}_{MSE}+\beta \times {\mathcal {L}}_{CD} \end{aligned}$$
(5)
Fig. 2
figure 2

Illustration of possible many-to-one matching problem in Chamfer-distance based visual structure constraint, which can be avoided in the bipartite matching based visual structure constraint

3.3 Bipartite-Matching-based Visual Structure Constraint(BMVSc)

CDVSc helps to preserve the structure similarity of two sets, but sometimes many-to-one matching may happen with the Chamfer-distance constraint as shown in Fig. 2. This conflicts with the important prior in ZSL that the obtained matching relation between synthetic and real centers should conform to the strict one-to-one principle. When undesirable many-to-one matching arises, the synthetic centers will be pulled to incorrect real centers and result in inferior performance. To address this issue, we change CDVSc to bipartite matching based visual structure constraint (BMVSc), which aims to find a global minimum distance between the two sets meanwhile to satisfy the strict one-to-one matching principle.

We first consider a graph \(G=(V,E)\) with two partitions A and B, where A is the set of all synthetic centers C synu and B contains all cluster centers of target classes. Let \(dis _ { i j } \in D\) denotes the Euclidean distance between \(i \in A\) and \(j \in B\), element \(x_{ij}\) of the assignment matrix X defines the matching relationship between i and j. To find a one-to-one minimum matching between real and synthetic centers, we could formulate it as a min-weight perfect matching problem, and optimize the problem as follows:

$$\begin{aligned} \begin{aligned}&{\mathcal {L}}_{BM} = \underset{X}{min} \sum _ { i , j } dis _ { i j } x _ { i j }, \\&s.t. \quad \sum _ { j } x _ { i j } = 1, \sum _ { i } x _ { i j } = 1, x _ { i j } \in \{ 0,1 \} \end{aligned} \end{aligned}$$
(6)

In this formulation, the assignment matrix X strictly conforms to the one-to-one principle. To solve this linear programming problem, we employ Kuhn-Munkres algorithm whose time complexity is \(O(V^2E)\). Like CDVSc, we also combine the MSE loss and this bipartite matching loss.

$$\begin{aligned} {\mathcal {L}}_{BMVSc}= {\mathcal {L}}_{MSE}+\beta \times {\mathcal {L}}_{BM} \end{aligned}$$
(7)

3.4 Wasserstein-Distance-based Visual Structure Constraint(WDVSc)

Ideally, if the synthetic and real centers are compact and accurate, the above bipartite matching based distance can achieve a global optimal matching. However, this assumption is not always valid, especially for the approximated cluster centers of target classes, because these centers may contain noises and are not accurate enough. Therefore, instead of using a hard-value (0 or 1) assignment matrix X, a soft-value X whose values represent the joint probability distribution between these two point sets is further considered by using the Wasserstein distance. In the optimal transport theory, Wasserstein distance is demonstrated as a good metric for measuring the distance between two discrete distributions, whose goal is to find the optimal “coupling matrix” X that achieves the minimum matching distance. Its objective formulation is the same as Equation (6), but X represents the soft joint probability values rather than \(\{0,1\}\). In this paper, in order to make this optimization problem convex and solve it more efficiently, we adopt the entropy-regularized optimal transport problem by using the Sinkhorn iterations (Cuturi 2013).

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{WD}&= \underset{X}{min}\sum _ { i , j } dis _ { i j } x _ { i j } - \epsilon H(X) \\ H(X)&= - \sum _{i j} x_{ij}log{x_{ij}} \end{aligned} \end{aligned}$$
(8)

where H(X) is the entropy of matrix, \(\epsilon \) is the regularization coefficient to encourage smoother assignment matrix X. The solution X can be written in form \(X = diag\{u\}K diag\{v\}\) (\(diag\{v\}\) returns a square diagonal matrix with vector v as the main diagonal), and the iterations alternate between updating u and v is:

$$\begin{aligned} u^{(k+1)} = \frac{a}{Kv^{(k+1)}}, \qquad \qquad v^{(k+1)} = \frac{b}{K^Tu^{(k+1)}} \end{aligned}$$
(9)

Here, K is a kernel matrix calculated with D. Since these iterations are solving a regularized version of the original problem, the corresponding Wasserstein distance that results is sometimes called the Sinkhorn distance. Combining this constraint, the final loss function is:

$$\begin{aligned} {\mathcal {L}}_{WDVSc} = {\mathcal {L}}_{MSE} + \beta \times {\mathcal {L}}_{WD} \end{aligned}$$
(10)

3.5 A Realistic Setting with Unrelated Test Data in the Wild

Existing transductive ZSL methods always assume that all the images in the test dataset belong to target unseen classes we have already defined. However, in the wild, many unrelated images which do not belong to any defined class may exist. If we directly perform clustering on all these unfiltered images, the approximated real centers will deviate far from the real centers of unseen classes and make the proposed visual structure constraint invalid. This problem also exists in Zhang and Saligrama (2016b). To solve this realistic challenging setting, we propose two new training strategies for our method. In both of these two strategies, we assume the domain shift problem is not that severe, so the initial unseen synthetic centers learned by VCL are roughly correct. This assumption makes sense in that the source and target classes are related in a common attribute space.

Specifically, in the first strategy, we use the baseline VCL to get the initial unseen synthetic centers and then directly filter out unrelated images with one specific feature distance threshold before conducting CDVSc, BMVSc or WDVSc. The details of this training strategy are shown in Algorithm 1. Since the visual structure constraint is only leveraged in one step, this strategy is called the “one-step” training strategy. Since we do not have the ground truth labels of target domain and many noisy samples exist, we use the farthest sample feature distance of the source domain to set the feature distance threshold \(\lambda _{dist}\).

By contrast, in the second strategy, rather than using only one specific feature distance threshold, we do it in a progressive way which aims to get more accurate synthetic centers and use larger feature distance threshold in multiple alternative steps. At each step t, we will not only use the latest projection function \(f_{t-1}\) to get more accurate synthetic visual centers but also use looser feature distance threshold to include more target domain data. The following experiments will demonstrate that this progressive training strategy is often more robust and can generate better results.

figure a
figure b

4 Experiments

Implementation Details. We adopt the pretrained ResNet-101 to extract visual features unless specified. All images are resized to \(224 \times 224\) without any data augmentation, and the dimension of extracted features is 2048. The hidden unit numbers of the two FC layers in the embedding network are both 2048. Both visual features and semantic attributes are L2-normalized. Using Adam optimizer, our method is trained for 5000 epochs with a fixed learning rate of 0.0001. The weight \(\beta \) in CDVSc and BMVSc is cross-validated in \([10^{-4},10^{-3}]\) and \([10^{-5},10^{-4}]\) respectively, while WDVSc directly sets \(\beta =0.001\) because of its very stable performance.

Datasets. Extensive experiments are conducted on three widely-used ZSL benchmark datasets, i.e., Animals with Attributes 2 (AwA2) (Xian et al. 2018), Caltech-UCSD Birds 200-2011 (CUB) (Wah et al. 2011) and Scene UNderstanding (SUN) (Patterson et al. 2014). The statistics of these datasets are briefly introduced as below:

  • Animals with Attributes2 (AwA2) (Xian et al. 2018) contains 37,322 images from 50 animals categories, where 40 of 50 classes are used for training and the rest 10 are used for testing. We adopt the class-level continuous 85-dim attributes as the semantic representations. For fair comparison with previous methods, we also report the results on AwA1 Lampert et al. (2014) which is an old version of animal datasets of ZSL.

  • Caltech-UCSD Birds 200-2011 (CUB) (Wah et al. 2011) is a fine-grained bird dataset with 200 species of birds and 11,788 images. Each image is associated with a 312-dim continuous attribute vector. Following Xian et al. (2018), we use the class-level attribute vector and the 150/50 split.

  • SUN-Attribute (SUN) (Patterson et al. 2014) includes 14,340 images coming from 717 scenes. Each sample is paired with a binary 102-dim semantic vector. We compute class-level continuous attributes as our semantic representations by averaging the image-level attributes for each class. 707/10 (SUN10) and 645/72 (SUN72) splits are adopted in our experiments.

Data Splits. (1) Standard Splits (SS): The standard seen/unseen class split is first proposed in Lampert et al. (2009) and then widely used in most ZSL works. (2) Proposed Splits (PS): Many recent ZSL methods extract visual features using ImageNet 1K classes pretrained CNN models, and the unseen classes in SS may overlap with these 1K classes, which actually violates the zero-shot setting that the test classes should be unseen during ZSL training. Based on this consideration, Xian et al. (2018) introduces the Proposed Splits (PS), in which the overlapped ImageNet classes are removed from the test set of unseen classes. In this paper, we report the results on both the standard splits and the proposed splits for fair comparisons.

Evaluation Metrics. For fair comparison and completeness, we consider two different ZSL settings: 1) Conventional ZSL, which assumes all the test instances only belong to target unseen classes. 2) Generalized ZSL, where test instances are from both seen and unseen classes, which is a more realistic setting in real applications. For the former setting, we compute the multi-way classification accuracy (MCA) as in previous works.

$$\begin{aligned} a c c _ { {\mathcal {Y}} } = \frac{ 1 }{ \Vert {\mathcal {Y}} \Vert } \sum _ { i = 1 } ^ { \Vert {\mathcal {Y}} \Vert } \frac{ \# \text{ correct } \text{ predictions } \text{ in } \mathrm { i } }{ \# \text { samples in } \mathrm { i } } \end{aligned}$$
(11)

For the latter one, we define three metrics. 1) \(acc_{{\mathcal {Y}}_s}\) – the accuracy of classifying the data samples from the seen classes to all the classes (both seen and unseen); 2) \(acc_{{\mathcal {Y}}_u}\) – the accuracy of classifying the data samples from the unseen classes to all the classes; 3) H – the harmonic mean of \(acc_{{\mathcal {Y}}_s}\) and \(acc_{{\mathcal {Y}}_u}\):

$$\begin{aligned} H=\frac{2*acc_{{\mathcal {Y}}_u}*acc_{{\mathcal {Y}}_s}}{acc_{{\mathcal {Y}}_u}+acc_{{\mathcal {Y}}_s}} \end{aligned}$$
(12)

4.1 Qualitative Results

Fig. 3
figure 3

Qualitative results of BMVSc on 6 categories of AwA2 and CUB datasets. We list the top-6 images classified to each category. The misclassified images are marked with red bounding boxes and the right name of category is below the corresponding image

In Fig. 3, we have shown some qualitative results of the proposed BMVSc on the AwA2 and CUB datasets. Although the test images of each class have an overall different appearance, the projection function learned by our method can still capture important discriminative semantic information from their visual characteristics, which corresponds to their semantic attributes. For example, the predicted sheep images in AwA2 all share furry, bulbous and hooves attributes. However, we could also observe some misclassified images such as the walrus in row 6 of AwA2. After careful analysis, we find two possible reasons: 1) The discriminative ability of the pretrained CNN is not enough to separate the visual appearances between highly similar categories. In fact, the visual appearance of seal and walrus are so close that even people could not distinguish them by rule and line without expert knowledge. This problem can be solved only by more powerful visual features. 2) Some attribute annotations are not accurate enough. For example, the seal category possesses spots of semantic descriptions, but walrus does not, but both these two categories own this attribute in the semantic annotation. Such incorrect supervision information will mislead the learning of the projection function.

Table 1 Quantitative comparisons of MCA (%) under standard splits (SS) in conventional ZSL setting. I: Inductive, T: Transductive, O: Our method, Bold: Best, Italic: Second best, V: VGG, R: ResNet, G: GoogLeNet. It can be seen that the proposed three types of visual structure constraints can bring substantial performance gain and outperform existing state-of-the-art methods

4.2 Conventional ZSL Results

To show the effectiveness of the proposed visual structure constraint, we first compare our method with existing state-of-the-art methods in the conventional setting. Table 1 is the comparison results under standard splits (SS), where we also re-implement our method using 4096-dimensional VGG features to guarantee fairness. Obviously, with the three different types of visual structure constraint, our method can obtain substantial performance gains consistently on all the datasets and outperforms previous state-of-the-art methods. The only exception is that VZSL (Wang et al. 2018) is slightly better than our method on the AwA1 dataset when using VGG features.

Specially, comparing with SP-ZSR (Zhang and Saligrama 2016b) which shares the similar spirit with our method, we could find that their performance sometimes is even worse than inductive methods such as SynC (Changpinyo et al. 2016), SCoRe (Morgado and Vasconcelos 2017) or VCL. The possible underlying reason is that, when utilizing the structure information only in test time, the final performance gain highly depends on the quality of the projection function. When the projection function is not good enough, the initial synthetic centers will deviate far from the real centers and result in bad matching results with unsupervised cluster centers, thus causing even worse results. By contrast, in our method, this visual structure constraint is incorporated into the learning of projection function in the training stage, which can help to learn a better projection function and bring performance gain consistently. Another bonus is that, during runtime, we can directly do recognition in real-time online-mode rather than the batch-mode optimization in SP-ZSR (Zhang and Saligrama 2016b), which is more friendly in real applications.

The results on proposed splits of AwA2, CUB and SUN72 are reported in Table 2 with ResNet-101 features. It can be seen that almost all methods suffer from performance degradation under this setting. However, our proposed method could still maintain the highest accuracy. Specifically, the improvements obtained by our method range from 0.8% to 25.8%, which indicate that visual structure constraint is effective in solving the domain shift problem.

Table 2 Quantitative comparisons under the proposed splits (PS) in conventional ZSL setting. It shows that, though most methods suffer from performance degradation under this setting, our method can still achieve the best performance and beat other methods by a large margin
Table 3 Quantitative results on the large-scale ImageNet dataset in conventional ZSL setting. Even under this very challenging setting, our method still beats the baseline method VCL and previous state-of-the-art methods by more than 5%
Fig. 4
figure 4

Some classification results on the large-scale imagenet dataset. The leftmost column of the images are the input test images, while the second to six columns of images are representative sample images of top-5 classified categories. It shows that though the top-1 classification results of some samples are not correct, they are all predicted into visually similar categories

4.3 Large-Scale Conventional ZSL Results

To further evaluate the effectiveness of the real large-scale scenarios, we follow the procedure of Frome et al. (2013) and evaluate our method on the widely used subset of ImageNet. It corresponds to 1,549 unseen classes that are within two hops of the 1,000 seen classes according to the existing hierarchical structure of ImageNet. There exists certain classes without specific semantic representations. Instead of dropping them directly like Changpinyo et al. (2018), we average each word embedding to construct corresponding semantic attributes. For the evaluation metric, we employ Hit@K as Frome et al. (2013) which is defined as the percentage of test images whose true labels belong to top K predictions of models. As shown in Table 3, our WDVSc outperforms baseline VCL and previous state-of-the-art methods by more than 5%. Note that the baseline method DIPL (Zhao et al. 2018) only reports its results on the test data ILSVRC 2010, which is a subset of the default 2-hop test data, so it is relatively easier.

Table 4 Quantitative comparisons under the generalized ZSL setting. Obviously, most methods including our baseline VCL cannot maintain the same level accuracy for both seen and unseen classes. The much better results on the source seen classes should be because of the serious domain shift problem, which can be significantly alleviated by the proposed visual structure constraints

In Fig. 4, we further provide four classification results. Though the number of unseen classes is very large in this dataset, our method can still work quite well and correctly categorize the input test images into corresponding categories. Moreover, for some samples whose top-1 classification results are not correct, they are all categorized into visually similar categories. It is consistent with the increased performance number with a larger K. In fact, it is even difficult for human to tell the difference among these visually similar categories.

4.4 Generalized ZSL Results

In Table 4, we compare our method with eight different generalized ZSL methods. It can be seen that, although almost all the methods including our baseline VCL cannot maintain the same level accuracy for both seen (\(acc_{{\mathcal {Y}}_s}\)) and unseen classes (\(acc_{{\mathcal {Y}}_u}\)), our method with visual structure constraints significantly outperforms other methods by a large margin on these datasets. More specifically, take CONSE (Norouzi et al. 2014) as an example, due to the domain shift problem, it can achieve the best results on the source seen classes but totally fails on the target unseen classes. By contrast, since the proposed two structure constraints can help to align the structure of synthetic centers to that of real unseen centers, our method can achieve acceptable ZSL performance on target unseen classes. This generalized ZSL setting demonstrates that the proposed visual structure constraints can help alleviate the aforementioned domain shift problem again.

Fig. 5
figure 5

The comparison results between the one-step and progressive training strategy in the realistic setting for standard split (SS, top) and proposed split (PS, top) respectively. The x-axis threshold denotes the feature distance threshold and the max feature distance threshold in these two settings respectively. It shows that the progressive training strategy is overall more robust and better than the one-step one for different thresholds

Table 5 Results (%) in the more realistic setting. With the new proposed one-stage (\(S+*\)) and progressive (\(P+*\)) training strategy, the proposed method can still work well and bring performance gain. In particular, we find WDVSc is extremely robust in this challenging setting thanks to its inherent soft matching scheme and the progressive training strategy is better than the one-stage one as well. The experiments are conducted on both coarse-grained dataset AwA and fine-grained dataset CUB

4.5 Results of the Realistic Setting in the Wild

To imitate the setting in the wild where many unrelated images may exist in the test dataset, we prepare two types of datasets, coarse-grained and fine-grained respectively. Specifically, for the coarse-grained dataset, we mix the AwA2 test dataset with extra 8K unrelated images from the aPY dataset. These unrelated images do not belong to the classes of either AwA2 or ImageNet-1K. For the fine-grained dataset, we hold out 10 unseen classes of CUB dataset as noise samples to confuse the target domain. From Table 5, we have the following analyses. 1)It could be seen that without filtering out the unrelated images, the performance of our method with CDVSc, BMVSc, and WDVSc all degrades when compared to the baseline VCL on the coarse-grained dataset, which means that the alignment of wrong visual structures is counterproductive to the learning of projection function. 2) On the fine-grained dataset, directly constraining the noisy target domain with the visual center does not hurt that much. The underlying reason may be that the visual feature of the fine-grained image sometimes is indistinguishable as shown in Fig. 9. 3) With the new training strategies (\(S+*\), \(P+*\)), the proposed visual structure constraints can work very well and bring substantial performance gain consistently on both coarse-grained and fine-grained datasets. 4) The progressive training strategy \(P+*\) is better than the one-step training strategy \(S+*\), which demonstrates the benefits from gradually improving the projection function with more domain target samples and looser distance constraints (Table 6).

Besides the final performance, we further analyze the influence of distance threshold on these two strategies for the standard split (top) and the proposed split (bottom) in Fig. 5 on AwA2 dataset. Though they both have different MCA results for different threshold values, we find the progressive strategy is also overall more robust and better than the one-step training strategy. Need to note that the threshold value here means the feature distance threshold \(\lambda _{dist}\) in Algorithm 1 and the maximum distance threshold \(\lambda _{dist}^{max}\) in Algorithm 2 respectively.

4.6 More Analysis

Possible many-to-one problem in CDVSc. To verify that there may exist many-to-one matching problem during the training of CDVSc, we randomly select the output of embedding networks of one epoch and visualize the matching results on the AwA2 dataset in Fig. 6. It can be seen that one projected semantic center can be matched by multiple visual cluster centers, and vice versa. By contrast, BMVSc can guarantee strict one-one matching, which may be the reason of better results of this dataset.

Fig. 6
figure 6

Matching matrixs between the projected semantic centers and visual cluster centers of CDVSc (left) and BMVSc (right) on the AwA2 dataset. BMVSc can guarantee strict one-one matching while CDVSc may have many-to-one matching

Fig. 7
figure 7

The right matching number (“epoch-match”) and distance (“epoch-dis”) between the projected semantic centers and real visual centers during the training of BMVSc on the SUN dataset, which demonstrate the BMVSc can improve the matching relation by only using the approximated K-means centers

Progressive improvement of center matching in BMVSc. The final ZSL performance depends on the alignment of the projected semantic centers and real visual centers. In our method we use K-means cluster centers to approximate the real centers and minimize their matching distance. So one natural question would be ”whether we can achieve this final objective by training with cluster centers from K-means?”. To answer this question, we calculate the the number of right matching point and distances between the projected semantic centers and real visual centers respectively during the training of BMVSc . We plot these two metrics of the SUN dataset in Fig. 7. Obviously, BMVSc can definitely improve the matching of the projected semantic centers and real visual centers by only using the cluster centers from K-means during the whole training processing.

Center-based objective vs Instance-based objective. Compared to previous instance-based optimization objective, our center-based optimization objective is much more computationally efficient. To verify this point, we re-implement the work DEM (Zhang et al. 2017) and adopt the same network structure, parameter setting and optimization algorithm with our VCL method on the AwA2 dataset. Then we plot the change of loss and accuracy with epoch increasing in Fig. 8 respectively. It shows that our center-based optimization objective converges faster than previous instance-based optimization objective and can even achieve slightly better final results.

Fig. 8
figure 8

The comparison of convergence curve between instance-based method (DEM) and our center-based method (VCL). It shows that center-based objective converges faster than instance-based objective

Why are slightly worse results obtained by BMVSc than CDVSc on the CUB dataset? In our paper, three different types of visual structure constraint are proposed to alleviate the domain shift problem in ZSL. BMVSc can solve the possible many-to-one matching problem in CDVSc and satisfy the strict one-to-one principle, which potentially helps to achieve better results, such as the gain can be observed on the AwA2 and SUN datasets. However, on the CUB dataset, the performance of BMVSc is slightly worse than CDVSc. Although this difference is quite subtle when it is compared to the absolute gain coming from the visual structure constraint, we still want to find the possible underlying reason.

Fig. 9
figure 9

Feature distribution of the CUB dataset, which shows the feature distribution is not that separable for all categories

To answer this question, we first plot the feature distribution of all the categories of the CUB dataset in Fig. 9 with TSNE. We could find that the feature distribution of some categories is too close to be distinguished because the feature pretrained on ImageNet is not representative enough for this CUB dataset. This somehow violates our assumption that the separated clusters of unseen classes obtained from pre-trained CNN models are already discriminative, and thus lead to this degradation phenomenon. To verify it, we check the matching matrix obtained by our methods and find that there indeed exists wrong matches due to very closed real centers. Specifically, consider synthetic center X of yellow billed cuckoo, and two similar real centers Y and Z of mangrove cuckoo and yellow billed cuckoo which are shown in Fig. 10. \(X-Z\) is the right matching, and \(X-Y\) is the wrong matching. In BMVSc, if the wrong matching happens, X will be pulled closer to inaccurate center Y (loss term: \(\left\| X-Y \right\| _2^2\)). By contrast, the contribution of CDVSc to the final loss is \(\frac{\left\| X-Y \right\| _2^2+\left\| X-Z \right\| _2^2}{2}\), which will also force X to approach Z and alleviate the wrong matching problem to some certain degree.

Fig. 10
figure 10

Matching relations between synthetic center and two similar real centers. Red line denotes BMVSc matching, and green line and red line denote CDVSc matching

Table 6 Generality to the word vector based semantic space on the AwA1 dataset. Though the word vector contains less effective information than semantic attributes, our visual structure constraint can work very well and bring significant performance gain

Importance of unsupervised cluster centers and semantic attributes. In our method, to recognize the target domain images, two different types of knowledge are leveraged: unsupervised cluster centers of target domain and semantic attributes. To study the importance of these two components, we design a simple voting algorithm to calculate the upper bound of unsupervised clustering algorithms for ZSL recognition. Specifically, we assume the ground truth label for each unseen instance is accessible. Then for each cluster center obtained by K-means, we predict its category through a voting process, i.e.its category is the one which most images in this cluster belong to. Finally, the classification results for test instances are directly set to the label of the corresponding cluster. In this way, because we have already used the ground truth information, it can be viewed as the upper bound of K-means clustering algorithm. As shown in Table 7, its performance is even better than our baseline VCL. which demonstrates that the information of unsupervised clustering is very useful. By combining the semantic attributes and this unsupervised cluster information during the learning process, our method CDVSc, BMVSc and WDVSc are all better than the upper bound of K-Means and VCL.

Table 7 Analysis to demonstrate the importance of unsupervised cluster centers and semantic attributes. By combining these two types of information during training, our CDVSc, BMVSc and WDVSc achieve better results than the upper bound of K-Means and VCL
Table 8 We run the whole pipeline multiple times and record the MCA (%) performance on AwA2 and CUB dataset. It demonstrates the stability of used cluster methods. Meanwhile, the averaged results are slightly higher than the previous report results
Table 9 Comparison results (%) among the baseline VCL, the default center based WDVSc (“WDVSc”) and instance based WDVSc (“WDVSc-instance”) on different datasets and conventional settings. It shows that the instance based WDVSc is worse than the default center based WDVSc but still much better than the VCL baseline, which demonstrates the strong generalization ability of the proposed visual structure constraint

Stability of unsupervised cluster centers. Since the proposed the visual structure constraint depends on the unsupervised clustering which has a certain degree of randomness, one may ask “if the final ZSL performance is stable enough”? To analyze this point in detail, we run the whole pipeline for 5 times on both AwA2 and CUB dataset using WDVSc. To eliminate the influence of other factors, we initialize the projection function f with the same parameters, only keeping the randomness of the clustering methods. Based on the experimental results of Table 8, we could find that the performance variance of our method on each dataset is very minor. By using more advanced clustering methods like Chang et al. (2017), we believe the stability and performance could be further improved.

Instance based WDVSc. Though our method uses the default center-based objective function, we find the proposed wassertein-distance-based visual structure constraint can also support instance-based objective function well. This is because the wassertein distance can be used to measure the distance between two discrete distributions with unequal sample numbers, in which case the “coupling matrix” X is not a square matrix anymore. Specifically, we do not use unsupervised clustering algorithms to generate approximated real centers B in advance but instead directly use the visual feature of each instance to find their individual optimal matching. To verify its effectiveness, we conduct the controlled experiment on three different datasets and evaluation settings (standard split “SS” and proposed split “PS”). As shown in Table 9, without unsupervised clustering, the instance-based WDVSc is a little worse than the default center-based WDVSc but still significantly better than the baseline VCL. It further demonstrates the strong generalization ability of the proposed visual structure constraint.

Fig. 11
figure 11

The comparison results of MCA (%) on the AwA2 and CUB dataset by using instance based WDVSc with different sample number of each target class. It shows that instance based WDVSc can work quite well for this extremely challenging case. And better results can be achieved with more samples

Besides dropping the cluster procedure, instance-based WDVSc also has other advantages compared with center-based WDVSc. Sometimes there may exist some extreme cases where the number of samples in some classes is very small and thus the unsupervised clustering results will be insensible and inaccurate to some extent. In these cases, it is more suitable to use the instance-based WDVSc through directly measuring the distance of two discrete feature distributions with unequal quantity. To simulate this setting, we test both AwA2 and CUB dataset and only keep 1 to 5 images for each class respectively. It can be seen from Fig. 11 that our method can achieve substantial improvements over the VCL baseline even in this extremely challenging case by using the instance based WDVSc. And with more target images for each class, the final performance increases consistently.

Generality to word vectors based semantic space. Compared to some previous methods which are only applicable to one specific semantic space, we further demonstrate that the proposed visual structure constraint can also be applied to word vector-based semantic space. Specifically, to obtain the word representations for the embedding networks inputs, we use the GloVe text model (Pennington et al. 2014) trained on the Wikipedia dataset leading to 300-d vectors. For the classes containing multiple words, we match all the words in the trained model and average their word embeddings as the corresponding category embedding. As shown in Table 6, though the contained effective information of word vectors is less than that of semantic attributes, the proposed visual structure constraint can still bring substantial performance gain and outperform previous methods. Note that DEM (Zhang et al. 2017) utilized 1000-d word vectors provided by Fu et al. (2014), Fu et al. (2015) to represent a category.

Robustness for imperfect separability of visual features for unseen classes. Though our motivation is inspired by the great separable ability of visual features for unseen classes on the AwA2 dataset, we find the proposed visual structure constraint is very robust and does not rely on it seriously. For example, on the CUB dataset, the feature distribution in Fig. 9 is not totally separable, but the proposed visual structure constraint still brings significant performance gain as shown in the above Tables. Because even though there are some incorrect clusters, as long as most of them are correct clusters, the proposed visual structure constraint will be beneficial.

Fig. 12
figure 12

MCA results (%) of different cluster number K (super-class) on coarse-grained dataset (AwA2) and fine-grained dataset (SUN). Note that the total unseen class number of AwA2 and SUN are 10 and 72 respectively. coarse-grained all and fine-grained all denote that directly use the maximum cluster number

Fig. 13
figure 13

ImageNet results with different unknown K. On the one hand, it demonstrates the effectiveness of our method even without knowing K. On the other hand, the proposed visual structure constraint could capture more fine-grained structure information with the increase of K and achieves better results. Note that the unseen class number of this setting is 1549

On the other hand, though the unseen class number K is often pre-defined, we find the proposed visual constraint can improve the performance even when it is unknown. In Fig. 12, we report the performance for different \(K (K\leqslant \text {unseen class number})\) on coarse-grained AwA2 dataset and fine-grained SUN dataset. Specifically, we first perform K-means both in the semantic space and visual space simultaneously, then use WDVSc to align these two synthetic sets. Obviously, the proposed visual structure constraint can bring performance gain consistently. In other words, as long as the visual features can form some superclasses (not fine-level, which is satisfied by most datasets), the proposed visual structure constraint is always effective. We further conduct similar experiments by using WDVSc on the large-scale ImageNet dataset. As shown in Fig. 13, the recognition accuracy by using the visual structure constraint also improves the baseline VCL consistently while increasing the number of visual centers. We believe the proposed visual structure constraint could capture a more fine-grained structure of visual space with an increase of K, thus achieving better results.

5 Conclusion

Domain shift is a key problem for ZSL in the wild. To alleviate it, three different types of visual structure constraint are proposed for transductive ZSL in this paper. Moreover, we introduce a new transductive ZSL configuration for real scenarios and design two new training strategies to make our method work for it. Experiments demonstrate that our method brings substantial performance gain consistently on different benchmark datasets, outperforms previous state-of-the-art methods by a large margin and generalizes well for data in the wild, including large-scale data and some extreme cases.