1 Introduction

Image recognition trained on a large number of labeled instances can get good results at present, but it takes a lot of manpower and resources to collect these labeled images. Especially, it requires experts to give identification for fine-grained classification. How to complete image recognition with only a few labeled instances or even some categories without labels has become a very challenging and realistic task.

Zero-shot learning (ZSL) [22, 33, 41] is an effective method to solve the above problem. Zero-shot learning is a special unsupervised domain adaptation method. Its purpose is to learn a model based on a set of labeled source data, and then transfer the learned knowledge to the target domain to identify another set of unlabeled data. In zero-shot learning setting, the data categories in these two domains are assumed completely non-overlapping.

Because the source data during training are labeled, we usually call the classes in source domain as seen classes, and the classes in target domain as unseen classes. Zero-shot learning can be divided into traditional ZSL and generalized zero-shot learning (GZSL), which are called ZSL and GZSL, respectively. The difference is that, in the test, ZSL only classifies instances from target domain without labeled visual samples, while GZSL classifies all instances from both source and target domains. Zero-shot learning can also be divided into two categories as inductive ZSL and transductive ZSL. For inductive ZSL, we can only use the labeled data from source domain for training; while for transductive ZSL, we can use not only the labeled data from source domain but also the unlabeled data from target domain at the time of training. For inductive ZSL, the predictions of instances from target domain depend entirely on the knowledge learned from source domain. But for transductive ZSL, the unlabeled data from target domain can be used to adjust the trained model iteratively.

Since the unseen classes do not appear at all during the training, we need some auxiliary information, that is, semantic description. These auxiliary information can be semantic attributes vectors [1, 8], word2vec [29] and human gaze [18], etc. For example, semantic attributes vectors define some common characteristics between the seen and unseen classes. If both the seen and unseen classes are animals, the semantic attributes vectors will be fur, color, and stripes, etc. By semantic attributes vectors, the visual features from the seen or unseen classes can be bridged. Thus, only auxiliary information are needed, which greatly reduces the collection difficulty of labeled data.

In the semantic embedding research direction, some of existing zero-shot methods map the visual features to the semantic space [4, 9, 21, 33]. But in this way, they reduce the expression ability of visual information. Some methods map the semantic features to the visual space [20, 35, 44]. However, the expression ability of semantic attributes vectors is reduced and the noise will be introduced that are not visual descriptions at all [6, 7, 32]. The remaining methods project the visual and semantic features into a common space [5, 26, 45] and align them. However, some simple and rough alignments, such as the shortest Euclidean distance between them, are usually adopted. We call such alignment hard alignment. The visual and semantic feature distributions by such hard alignments are not well aligned at the overall level. Meanwhile, there exist an obvious bias problem when incorrectly bridging visual and semantic information as shown in Fig. 1. When classifying instances from target domain, they are always predicted to be some seen classes in source domain, which is a serious issue that exists in many zero-shot learning methods.

Fig. 1
figure 1

Visualization of bias problem. The labeled instances are from seen classes at the stage of training, so the unseen instance will often be predicted as a similar seen class. As shown in the figure, the bobcat from unseen classes has a high probability of being predicted as tiger which is from seen classes

In order to solve the above mentioned problems, we propose a bidirectional mapping method. With the bidirectional projections, we can make full use of the information from two domains without introducing too much noise. Motivated by the idea of cycleGAN [47], a couple of GANs [13] are used to solve the problem of hard alignment. Two generators realize the bidirectional mappings between the visual features and semantic features. At the same time, we remap the information that has been mapped to another domain back to the domain it belongs to, and compare it with the information before the mapping. The error between them is called cycle loss. Cycle loss and classification task loss further guarantee that the important information is kept and the alignment is correct. In order to solve the bias problem better, a transductive method is proposed to use pseudo-labels for model correction iteratively. At the test phase, we do not give a classification result based on the features in only one space. The features in both the visual and semantic feature spaces are combined to give a decision. In summarization, this paper has the following contributions:

  1. 1.

    A transductive method of bidirectional projections is proposed. The method makes the visual features more consistent with the corresponding semantic features and greatly weaken the bias problem. Extensive experimental results show that our model achieves the state-of-the-art performance at both ZSL and GZSL settings.

  2. 2.

    We propose a new zero-shot classifier based on the bidirectional projection method. The classifier combines both visual and semantic features to give the final prediction, which makes full use of visual and semantic information to reduce discriminant bias.

The reminder of the paper is arranged as follows. Section 2 introduces the related works of transductive zero-shot learning and zero-shot learning based on GANs. We detail our BGT model in Sect. 3. The experimental results are shown in Sect. 4 and conclusion is given in Sect. 5.

2 Related work

Our approach is a transductive method based on GANs, so we will firstly introduces some common transductive methods and then some GAN-based methods. At the same time, the similarities and differences between these methods and ours will be introduced.

2.1 Transductive zero-shot learning

Unlike the standard ZSL, transductive ZSL uses target domain data during training phase to reduce the ubiquitous domain offset problem. It does not violate the “zero-shot” setting because the data from target domain are unlabeled.

The transductive methods use unseen instances in multiple ways. Some methods first train a model with source data, then use the trained model to get pseudo-labels of the instances from target domain. On this basis, they use the obtained pseudo-labels to further adjust the model [3, 15, 34, 46]. Our method follows this research line. It will also use pseudo-labels and unseen instances to further train our model after getting the trained model to make the model more suitable for unseen instances. However, the performance of this approach largely depends on the predictive ability of the trained model on unseen instances. So the unseen instances are also used in the phase of training in our method. We project the visual features of the seen instances to the semantic domain and then project them back to the visual domain. And it is required that the distance between the projected visual feature and the original visual feature should not be too large. In this way, the model also learns how to map the features of the unseen instance without losing information, thus, the model could predict more accurate pseudo-labels on unseen instances. Thereby the overall effect of the model is improved.

Other methods are devoted to making the model more adaptable to the target domain through special training or prediction methods. In [10], classifications of the instances from unseen classes are implemented in two steps. First, canonical correlation analysis (CCA) is used to project visual features and semantic features into a multi-view embedding space, and then unseen unlabeled instances are used to construct a hypergraph to achieve transfer from the seen classes to the unseen classes. Kodirov et al. [20] proposes to use a space shared by the seen and unseen classes to improve the performance of the model on the unseen class. Recently, Verma and Rai [39] proposes to learn the data distribution from the attributes of the seen and unseen classes, and then use unseen instances to adjust the parameters of the distribution.

2.2 Zero-shot learning based on GANs

In recent years, a lot of research related GANs has appeared [14, 27, 28], and GANs has performed well in many scenarios [11, 12, 25, 43]. At the same time, some GAN-based zero-shot learning methods are proposed. In [30, 42, 48], semantic attributes vectors and random noise are used directly to generate unseen instances. But simply using noise and semantic attributes vectors to generate unseen instances has great uncertainty because of GANs property. Tong et al. [38] uses GANs to generate samples with specified semantic attributes vectors to mitigate the bias problem. In order to solve the problem of generative diversity and reliability, LisGAN [24] introduces soul samples and make all generated unseen instances similar to them.

Different from these methods, our method does not use GANs for feature generation, but uses GANs for feature alignment. That is, a bidirectional generation motivated by cycleGAN [47] is used in our method to project visual and semantic information to each other’s domain. Then, through adversarial learning, our model makes the projected features and the original features follow a similar distribution and align them.

Table 1 Notation used in our approach

3 The proposed approach

3.1 Problem definition

Suppose that we have a set of \(N_{\rm s}\) labeled images \(D_{\rm s}=\{(x_i,y_i,z_i)\}_{i=1}^{N_{\rm s}}\) from \(C_{\rm s}\) seen classes \(Y_{\rm s}=\{1,2,\ldots ,C_{\rm s}\}\), where \(x_i\in X_{\rm s} \subset \mathbb {R}^{m \times N_{\rm s}}\) is the visual feature of the ith instance in \(D_{\rm s}\) and \(m\) is the dimensionality of visual feature space; \(y_i \in Y_{\rm s}\) is the corresponding label and \(z_i \in Z_{\rm s} \subset \mathbb {R}^{n\times C_{\rm s}}\) is the corresponding attributes vector where \(n\) is its dimension. There is a corresponding relationship between \(Y_{\rm s}\) and \(Z_{\rm s}\); each column of \(Z_{\rm s}\) represents a semantic attributes vector corresponds to a class in \(Y_{\rm s}\). We also have a set \(D_{\rm u}=\{(x_j,y_j,z_j)\}_{j=1}^{N_{\rm u}}\) from \(C_{\rm u}\) unseen classes \(Y_{\rm u}=\{C_{\rm s}+1,C_{\rm s}+2,\ldots ,C_{\rm s}+C_{\rm u}\}\), where \(x_j \in X_{\rm u} \subset \mathbb {R}^{m\times N_{\rm u}}\) is the visual feature of the jth instance in \(D_{\rm u}\); \(y_j \in Y_{\rm u}\) and \(z_j \in Z_{\rm u} \subset \mathbb {R}^{n\times C_{\rm u}}\) is the corresponding label and semantic attributes vector. While \(y_j \in Y_{\rm u}\) and \(z_j \in Z_{\rm u} \subset \mathbb {R}^{n\times C_{\rm u}}\) are unavailable during training. Similarly, there is a corresponding relationship between \(Y_{\rm u}\) and \(Z_{\rm u}\). The goal of ZSL problem is to learn a function \(f:X_{\rm u}{\rightarrow }Y_{\rm u}\). For the GZSL problem, the goal is to learn a function \(f:\{X_{\rm s},X_{\rm u}\}{\rightarrow }Y_{\rm s}{\cup }Y_{\rm u}\). It is worth noting that \(Y_{\rm s}{\cap }Y_{\rm u}={\emptyset }\). Table 1 shows the main notations used here in after.

3.2 Bidirectional generative transductive (BGT) model

3.2.1 Overall idea

The overall framework is shown in Fig. 2. First, the visual features of both the source and target data are extracted by a convolution neural networks as \(x^{\rm s}\) and \(x^{\rm u}\), respectively. Then semantic attributes vectors are projected into a semantic space by a function \(\Phi \) approximated by a neural networks. There are two generators \(G_{\rm va}\) and \(G_{\rm av}\) which map from the visual feature space to the semantic feature space and vice versa, respectively. The fake semantic and visual features \(a_{\rm fake}^{\rm s}\) and \(x_{\rm fake}^{\rm s}\) are generated from the source visual and semantic features \(x^{\rm s}\) and \(a^{\rm s}\), respectively. Then we judge whether these are fake by the semantic and visual feature domain classifiers \(D_{\rm a}\) and \(D_{\rm v}\), respectively. By this bidirectional projections, the source visual features are aligned with the semantic features in both the visual and semantic spaces. For the visual feature of the target data, we do similar operations which make the visual features of target data consistent with the semantic features. The implementation process at this stage is summarized in Sect. 3.2.6.

In the test phase, we need to combine the divisions in both the visual and semantic spaces to give final predictions instead of giving judgments only in one space as before. Next we will describe each part of our model in details. In Sect. 3.3 we will show how to make further adjustments to the model using pseudo-attributes vectors of target samples.

Fig. 2
figure 2

The overall architecture. The symbol − in the circle is the calculation in Euclidean distance. We train a couple of generative networks to bidirectionally generate visual features from semantic features and projected semantic features from visual features, respectively. In the test phase, by combining the information from the visual and semantic spaces, the category of a target sample is predicted

3.2.2 Generator loss

Since in the source dataset the visual features have the corresponding semantic features, two generators are designed to realize the bidirectional projections which align visual and semantic features in two spaces. The generator losses are defined as follows:

$$\begin{aligned} L_{{\rm G}_{\rm va}}= & {} \frac{1}{N_{\rm s}}\sum _{i=1}^{N_{\rm s}} -\log (D_{\rm a}(G_{\rm va}(x_i^{\rm s}))), \end{aligned}$$
(1)
$$\begin{aligned} L_{{\rm G}_{\rm av}}= & {} \frac{1}{N_{\rm s}}\sum _{i=1}^{N_{\rm s}} -\log (D_{{\rm v}}(G_{\rm av}(\Phi (z_i^{\rm s})))) \end{aligned}$$
(2)

where the function \(\Phi \) is a mapping from the semantic attributes vector \(z_i^{\rm s}\) to the semantic space and \(x_i^{\rm s}\) the visual feature of the ith instance in the source dataset. By defining loss in this way, we can make the generated feature as similar as the original feature in the source domain. In the end, the total loss is

$$\begin{aligned} L_{\rm G}=L_{{\rm G}_{\rm va}}+L_{{\rm G}_{\rm av}}. \end{aligned}$$
(3)

3.2.3 Discriminator loss

The discriminators \(D_{\rm a}\) and \(D_{{\rm v}}\) are used to determine whether the generated feature is real. The losses of discriminators are defined as follows:

$$\begin{aligned} L_{D_{\rm a}}= & {} \frac{1}{N_{\rm s}}\sum _{i=1}^{N_{\rm s}} (-\log (1- D_{\rm a}(G_{\rm va}(x_i^{\rm s})))-\log (D_{\rm a}(\Phi (z_i^{\rm s})))), \end{aligned}$$
(4)
$$\begin{aligned} L_{D_{{\rm v}}}= & {} \frac{1}{N_{\rm s}}\sum _{i=1}^{N_{\rm s}} (-\log (1-D_{{\rm v}}(G_{\rm av}(\Phi (z_i^{\rm s}))))-\log (D_{{\rm v}}(x_i^{\rm s}))). \end{aligned}$$
(5)

Through these losses, the discriminator can bee learned to identify that the generated feature is fake, and the original is true. In the end, the total loss is

$$\begin{aligned} L_{{\rm D}}=L_{D_{\rm a}}+L_{D_{{\rm v}}}. \end{aligned}$$
(6)

3.2.4 Cycle loss

In order to ensure that the generator does not lose important information, cycle loss for the source data is defined as follows:

$$\begin{aligned} L_{{\rm C}_{\rm a}}= & {} \frac{1}{N_{\rm s}}\sum _{i=1}^{N_{\rm s}} \left\| \Phi (z_i^{\rm s})-G_{\rm va}(G_{\rm av}(\Phi (z_i^{\rm s})))\right\| _2, \end{aligned}$$
(7)
$$\begin{aligned} L_{{\rm C}_{{\rm v}}}= & {} \frac{1}{N_{\rm s}}\sum _{i=1}^{N_{\rm s}} \left\| x_i^{\rm s}-G_{\rm av}(G_{\rm va}(x_i^{\rm s}))\right\| _2. \end{aligned}$$
(8)

At the same time, in order to ensure that these generators have a good performance on the target dataset, we also introduce the cycle loss to the target dataset during the training process, which is defined as follows,

$$\begin{aligned} L_{{\rm C}_{{\rm v}}^{\rm u}}=\frac{1}{N_{\rm u}}\sum _{j=1}^{N_{\rm u}} \left\| x_j^{\rm u}-G_{\rm av}(G_{\rm va}(x_j^{\rm u}))\right\| _2, \end{aligned}$$
(9)

where \(x_j^{\rm u}\) the visual feature of the jth instance in the target dataset. Finally the final cycle loss is the following:

$$\begin{aligned} L_{\rm C}=L_{{\rm C}_{\rm a}}+L_{{\rm C}_{{\rm v}}}. \end{aligned}$$
(10)

We participate in training with \(L_{\rm G}\) and \(L_{\rm C}\) as a whole, thus the combination loss is:

$$\begin{aligned} L_{{\rm GC}}=L_{\rm G}+L_{\rm C}. \end{aligned}$$
(11)

3.2.5 Task loss

After ensuring that our generators can do a good job of mapping between the two spaces, we need to further match the visual and semantic features. To achieve accurate classification, we define the task losses in both the visual and semantic spaces, respectively, as follows:

$$\begin{aligned} L_{T_{\rm a}}= & {} \frac{1}{N_{\rm s}}\sum _{i=1}^{N_{\rm s}} \left\| \Phi (z_i^{\rm s})-G_{\rm va}(x_i^{\rm s})\right\| _2+\lambda \left\| W_{\rm va}\right\| , \end{aligned}$$
(12)
$$\begin{aligned} L_{T_{{\rm v}}}= & {} \frac{1}{N_{\rm s}}\sum _{i=1}^{N_{\rm s}} \left\| x_i^{\rm s}-G_{\rm av}(\Phi (z_i^{\rm s}))\right\| _2+\lambda \left\| W_{\rm av}\right\| \end{aligned}$$
(13)

where \(W_{\rm av}\), \(W_{\rm va}\) are the learning parameters in \(G_{\rm av}\) and \(G_{\rm va}\), and \(\lambda \) is a regularization parameter which is a constant. Through the regularization we can effectively reduce the bias problem and improve the effect. In the end, the total loss is defined as follows,

$$\begin{aligned} L_{{\rm T}}=L_{T_{\rm a}}+L_{T_{{\rm v}}}. \end{aligned}$$
(14)

3.2.6 Training process

The training process is shown in Algorithm 1. The training steps for each epoch are: First we train the two generators and \(\Phi \) according to the \(L_{{\rm T}}\). Then the two generators, discriminator and \(\phi \) are trained according to \(L_{{\rm GC}}\) and \(L_{{\rm D}}\). Finally we train the two generators and \(\Phi \) according to the \(L_{{\rm C}_{{\rm v}}^{\rm u}}\). We repeat the above steps until the model converges.

figure a
Fig. 3
figure 3

Visualization of the classifier. Given an instance x, we take a semantic attributes vector \(z\) and map \(z\) to the semantic space through \(\Phi \) to get the semantic feature \(\Phi (z)\). Then \(\phi (z)\) is mapped to the visual domain through the generator \(G_{\rm av}\) to get the generated visual feature, then we calculate the distance between it and \(x\). At the same time, we map \(x\) to the semantic domain through the generator \(G_{\rm va}\) to obtain the generated semantic feature and calculate the distance between it and \(\Phi (z)\). Finally, two distances are added to get the score of \(z\) for the instance x. We predict the label of instance x as the corresponding class of semantic attributes vector with the lowest score

3.2.7 Classification

Finally, we combine the information in both the visual and semantic spaces to give a prediction, as shown in Fig. 3. For an instance \(x\), its predicted semantic attributes vector at the ZSL setting is

$$\begin{aligned} \mathop {\arg \min }_{z \in Z_{\rm u}} \ \Vert x-G_{\rm av}(\Phi (z))\Vert _2+\Vert \Phi (z)-G_{\rm va}(x)\Vert _2. \end{aligned}$$
(15)

For GZSL, it is

$$\begin{aligned} \mathop {\arg \min }_{z \in Z_{\rm s}{\cup }Z_{\rm u}} \ \ \Vert x-G_{\rm av}(\Phi (z))\Vert _2+\Vert \Phi (z)-G_{\rm va}(x)\Vert _2. \end{aligned}$$
(16)

Since the semantic attributes vector has a clear correspondence with the class, we can get its corresponding label through the semantic attributes vector. So the above formulas can be used for classification. Whether using this classifier is more effective than traditional single-domain classifiers will be discussed further in Sect. 4.4.2.

3.3 Self-labeled strategy

The process of self-labeled strategy is summarized in Algorithm 2. At the first stage, the model is trained using the datasets \(D_{\rm s}\) and \(X_{\rm u}\). When the model converges, we get

$$\begin{aligned} \{G_{\rm av}^*,G_{\rm va}^*,D_{{\rm v}}^*,D_{\rm a}^*, \phi ^*\}=\mathop {\arg \min } \{L_{{\rm GC}},L_{{\rm D}},L_{{\rm C}_{{\rm v}}^{\rm u}},L_{{\rm T}}\}, \end{aligned}$$
(17)

where \(\{G_{\rm av}^*,G_{\rm va}^*,D_{{\rm v}}^*,D_{\rm a}^*,\phi ^*\}\) represents the optimal generators, discriminators, and semantic mapping learned in the training phase. At the second stage, the different strategies for ZSL and GZSL are used. In the following two subsections we will give a detailed introduction.

3.3.1 ZSL

We use \(\{G_{\rm av}^*,G_{\rm va}^*,D_{{\rm v}}^*,D_{\rm a}^*,\phi ^*\}\) as initial parameters. Then the prediction is performed for the instances in \(X_{\rm u}\) according to Eq. (14), and the predicted semantic attributes vector is referred to as pseudo-semantic attributes vector. On this basis we use Eq. (14) to compute the task losses of all instances in target domain, and sort all instances and their pseudo-semantic attributes vectors according to their task losses. After setting a self-marking ratio \(r\), then \(r\times N_{\rm u}\) instances with the smallest task loss are selected and the same number of samples in \(D_{\rm s}\) are replaced. The parameters \(\{G_{\rm av}^*,G_{\rm va}^*,D_{{\rm v}}^*,D_{\rm a}^*,\phi ^*\}\) are updated according to Eqs. (11) and (14). The above steps are repeated until the training converges.

3.3.2 GZSL

As mentioned earlier in this paper, the instances in the unseen classes are always classified to some seen categories. So we use pseudo-semantic attributes vectors to make adjustment for our model. Since in the adjustment process, the performance of the model will gradually be biased towards the unseen classes. So we further use \(D_{\rm s}\) to update the parameters of \(\{G_{\rm av}^*,G_{\rm va}^*,D_{{\rm v}}^*,D_{\rm a}^*,\phi ^*\}\) by the losses (11) and (14) after doing the same operations as that in ZSL.

figure b
Table 2 The details of the datasets we used

4 Experiments

4.1 Datasets and setting

4.1.1 Datasets

AWA2 (Animal With Attribute 2) includes 30,475 instances from 50 classes, 40 of which are used as seen classes and 10 classes as unseen classes, and their semantics are described as 85-dimensional attributes.

aPY (aPascal-aYahoo) includes 15339 instances from 32 classes. We use 20 classes of data as seen classes, and the remaining 12 classes as the unseen classes. Its semantics are described as 64-dimensional attributes.

SUN (SUN Attribute) includes 14,340 instances from 717 categories. Among them, 645 classes are used as the seen classes and 72 classes are used as the unseen classes. Its semantics are described as 102-dimensional attributes.

In this paper, the original image is not used as the training data, but the 2048-dimensional feature extracted by resnet101 [17] pre-trained on ImageNet is used as the visual feature. More details are shown in Table 2. And these datasets have two splits as SS and PS which are same with the previous work [41].

4.1.2 Methods for comparisons

Our method is based on GANs, so we chose some methods that are also based on GANs for comparison. They are: generative adversarial approach for zero-shot learning (GAZSL) [48], Wasserstein GAN with a Classification Loss(f-CLSWGAN) [42], Leveraging invariant side GAN(LisGAN) [24]. At the same time, our method is also a transductive method, so we also selected some transductive methods for comparison. They are: transductive multi-view zero-shot learning (TMV) [10], shared model space (SMS) [16], quasi-fully supervised learning (QFSL) [37]. Some other methods are also selected for comparison which do not have many similarities with ours. Because they have greatly promoted the development of ZSL research and they are often regarded as baselines by other researchers. They are: direct attribute prediction (DAP) [23], deep visual semantic embedding (DEVISE) [9], cross modal transfer (CMT) [36], convex combination of semantic embeddings (CONSE) [31], semantic similarity embedding (SSE) [45], structured joint embedding (SJE) [2], embarrassingly simple approach to zero-shot learning (ESZSL) [33], latent embeddings (LATEM) [40], attribute label embedding (ALE) [2], synthesized classifiers (SYNC) [5], semantic autoencoder (SAE) [21], generative framework for zero-shot learning (GFZSL) [39], deep embedding model (DEM) [44].

These methods use a variety of strategies to accomplish ZSL and GZSL tasks. GAZSL [48] uses Wikipedia to generate features of unseen classes and use these generated features for training. The f-CLSWGAN [42] generates features of unseen classes for training and optimizes the wasserstein distance. LisGAN [24] introduces soul samples to ensure GANs’s generation diversity and generation reliability, thereby improves the performance of model. TMV [10] proposes a transductive multiview embedding space to solve the problem of mapping offset and uses the multi-view information of visual features in this space. SMS [16] realizes knowledge transfer by learning model sharing space of multiple models. QFSL [37] uses labeled data to train the relationship between visual information and semantic information, and uses unseen data to reduce bias. DAP [23] learns an attribute probability classifier, and then uses this classifier for classification. DEVISE [9] uses pairwise ranking objective method to make predictions. For the first time, CMT [36] projects the image into semantic space and align it with the class name. CONSE [31] maps the image to semantic space through a convex combination of the label embedding vectors and then aligns them. SSE [45] compares the similarity between visual information and semantic information in visual space and semantic space at the same time. SJE [2] optimizes a structural SVM loss to learn a bilinear compatibility. ESZSL [33] learns a bilinear compatibility and explicitly regularizes the objective frobenius norm using square loss. LATEM [40] extends the SJE [2] to be a piecewise linear mappings. ALE [2] uses a ranking loss to learn a bilinear compatibility function between the visual space and the attributes space. SYNC [5] uses a linear combination of multiple classifiers learned by seen classes to construct an unseen classifier. SAE [21] uses a semantic auto-encoder to reconstruct the image features. GFZSL [39] models each class as a gaussian model, and then learns a regression function to project them into a common space. DEM [44] projects visual information into visual space, and then uses a multi-modality fusion method to combine more semantic information.

4.1.3 Evaluation metrics

We use the similar accuracy evaluation metrics as [41]. For ZSL, the top-1 accuracy average of per-class is computed in the following way,

$$\begin{aligned} \hbox {acc}_Y=\frac{1}{|Y|}\sum _{c=1}^{|Y|}\frac{\#\hbox {correct}\ \hbox {predictions}\ \hbox {in}\ c}{\#\hbox {samples}\ \hbox {in}\ c} \end{aligned}$$
(18)

where \(|Y|\) is the total number of categories at the time of testing. For GZSL, we need to consider the performance on both the seen and unseen classes. The harmonic mean of accuracies respect to seen and unseen classes is calculated as follows,

$$\begin{aligned} H_{\hbox {acc}}=\frac{2\times \hbox {acc}_{Y_{\rm s}}\times \hbox {acc}_{Y_{\rm u}}}{\hbox {acc}_{Y_{\rm s}} +\hbox {acc}_{Y_{\rm u}}} \end{aligned}$$
(19)

where \(\hbox {acc}_{Y_{\rm s}}\) is the accuracy on the seen data and \(\hbox {acc}_{Y_{\rm u}}\) is the accuracy on the unseen data.

4.2 Implementation

The generators are all composed of three fully connected layers. Each fully connected layer is followed by ReLU layer. All generators from the visual space to the semantic space share weights, and all generators from the semantic space to the visual space share weights. Discriminators also use a three fully connected layers. The first two layers are activated by ReLU function, and the last layer is activated by Sigmod function.

This paper uses the semantic attributes vectors provided by the dataset as auxiliary information. Then, we use the mapping \(\Phi \) which is composed by a fully connected layer activated by ReLU to map the original attributes vector into the semantic space mentioned before. Then in the training process, training will stop once convergence is achieved, because excessive training will aggravate the bias problem. The regularization coefficient \(\lambda \) in task losses is set to 1e−4 when training with AWA or aPY and is set to 0 when training with SUN. The learning rate is set to 1e−5, and we use Adam [19] for training.

In the training phase, the generators and discriminators in our GANs and the \(\Phi \) are trained synchronously. In the phase of training, at each epoch we first use the source dataset for training, and then use the target dataset for training. For the source data for each batch, we first minimize \(L_{{\rm T}}\) and then minimize \(L_{{\rm GC}}\) and then minimize \(L_{{\rm D}}\). After an epoch, we train the model by minimizing \(L_{{\rm C}_{\rm v}^{\rm u}}\). In the self-labeled phase, we still train the generator and discriminator and \(\Phi \) at the same time. For ZSL seting, we first replace the corresponding part of \(D_{\rm s}\) with the selected pseudo-labeled instances. Then the losses \(L_{{\rm T}}\) and \(L_{{\rm GC}}\) are minimized successively. For GZSL after we do the same operation as ZSL does, and the losses \(L_{{\rm T}}\) and \(L_{{\rm GC}}\) on \(D_{\rm s}\) are minimized successively.

4.3 Comparison results

4.3.1 Comparisons at ZSL setting

Table 3 shows that our method has good performance compared with existing methods known to us. When we use SS split for SUN, we achieve similar results to the current best methods. For other situations our method improved by 1.6–14.9% over the best method.

Table 3 Top-1 accuracies of different methods on three datasets with two splits

We found that GAN-based ZSL methods such as GFZSL, f-CLSWGAN, LisGAN and our method tend to have better results than traditional embedding methods. This shows that GANs could be a powerful tool for ZSL research. At the same time we can use GANs in many ways in ZSL research. GFZSL, f-CLSWGAN, LisGAN all use GANs to generate visual features, our method uses the characteristics of GANs to achieve flexible alignment between visual and semantic information. These two approaches do not conflict, so combining these two kind of methods may achieve good results.

4.3.2 Comparisons at GZSL setting

Table 4 shows the comparison results. On the dataset AWA and aPY, the \(H_{\hbox {acc}}\) of our method is 4.4% and 5.3% higher than the current best method. Our method also achieves good results on the SUN. The main reason for this result is that our model performs well when predicting unseen instances. This shows that generalization ability of our model is better. And we found that many previous methods such as DAP, ESZSL and SAE have a good performance in the traditional ZSL problem but their performance drops sharply in the GZSL problem. Therefore, these models will be greatly limited in practical applications, and our models do not have to worry about this.

Table 4 Comparisons at GZSL setting

We use \(H_{\hbox {acc}}\) as the main evaluation index of GZSL, which is a more objective method. It is affected by the prediction accuracy of both the seen and unseen instances. Although some methods such as SAE, GFZSL, DEM, GAZSL guarantee a high prediction accuracy of seen instances, the prediction accuracy of unseen instances is very low. While f-CLSWGAN and LisGAN have achieved high \(H_{\hbox {acc}}\), but in order to obtain higher prediction accuracy on unseen instance, the prediction accuracy on the seen instances is sacrificed. So to sum up, our method can achieve better performance because we consider both two aspects. First of all, we have the structure of cycleGAN to ensure better binding of visual information and semantic information, and introduce the reconstruction process of unseen visual information in the training phase. These strategies ensure that the knowledge learned from the seen domain could be smoothly transferred to the unseen domain, thereby ensuring that a high prediction accuracy on unseen instances could be obtained. On the other hand, in order not to sacrifice too much prediction accuracy on the seen instances, we adopt a strategy different from the strategy of ZSL setting in the self-labeled stage. While using the unseen instances to adjust the model, the seen instances are also used to adjust the model.

Our model has achieved good performance in the SUN dataset but does not have the same advantages in the AWA and aPY datasets. This is because the number of categories in the SUN dataset is much larger than the other two datasets. So the model has to face more diverse data, and it is more difficult to generate a similar distribution respect to the visual or semantic features.

4.4 Model analysis

4.4.1 Parameter sensitivity

Our model has an important parameter, which is the ratio of the unlabeled instances in the target dataset we used in the self-labeled phase. Figure 4 shows our experimental analysis about this parameter.

At the ZSL setting, the accuracy on AWA is gradually stabilized with the increase of the ratio. For the aPY and SUN datasets, the classification accuracy first increases with the increase of ratio, but after reaching a peak, the classification accuracy decreases with the increase of ratio. This is because the model does not have a particularly good classification ability for the unseen instances in the datasets aPY and SUN. When the ratio is increased to a certain extent, too many false predictions are introduced which may spoil the learned mode. At the GZSL setting, the harmonic mean accuracies have similar phenomenons.

Fig. 4
figure 4

Parameter sensitivity. The horizontal axis represents the proportion of unlabeled instances we used in the self-labeled phase for all unseen instances

The results show that the optimal ratios are 0.85, 1, 0.8, 0.2, 0.9 and 0.8 for AWA(ss), AWA(PS), aPY(ss), aPY(PS), SUN(SS), SUN(PS) at the ZSL setting, respectively. The optimal ratios are 0.8, 0.8 and 0.9 for AWA, aPY and SUN at the GZSL setting, respectively. However, as shown in Fig. 4, these settings are not absolute, and good results can also be achieved by floating around the optimal ratios.

4.4.2 Significance test

In general, we can not know the generalization accuracy of the model, and we can only approximate the generalization accuracy by the mean value of multiple experimental results. As shown in Tables 3 and 4, the results are the mean values given after many experiments. We assume that there is more than 95% confidence that there is no significant difference between the results given in this paper and the generalization accuracy. In order to verify the hypothesis we put forward, the “Student’s t test” is used to verify our hypothesis. Specifically, we conducted 10 repeated experiments and obtained 10 sets of Top-1 accuracy and \(H_{\hbox {acc}}\) under ZSL and GZSL for different datasets, respectively. Then we use these data and the results in Tables 3 and 4 to perform a significant test through “Student’s t test” with the statistical significance level \(\alpha =0.05\). The results are shown in Table 5.

From Table 5, we can see that the p value of the results on each dataset under ZSL or GZSL is greater than 0.05, which proves that our hypothesis is accurate. That is, the results given in Tables 3 and 4 and the model’s generalization accuracy are not significantly different.

Table 5 The result of Student’s t test

4.4.3 Ablation

We adopt a Bidirectional Generative method, and put forward the classifier shown in Sect. 3.2.6 which is called VSC. In order to better verify the effectiveness of our classifier, two kinds of settings are designed to complete the ablation experiment.

  1. 1.

    The classifier only depends on the distance in the visual domain. As shown in Eq. (20), the classifier in this setting is called VC,

    $$\begin{aligned} \left\{ \begin{array}{ll} \mathop {\arg \min }\nolimits_{z \in Z_{\rm u}} \ \Vert x^t-G_{\rm av}(\Phi (z))\Vert _2, & {\rm for}\quad {\rm ZSL},\\ \mathop {\arg \min }\nolimits_{z \in Z_{\rm s}{\cup }Z_{\rm u}} \ \ \Vert x^t-G_{\rm av}(\Phi (z))\Vert _2, & {\rm for}\quad {\rm GZSL}.\\ \end{array} \right. \end{aligned}$$
    (20)
  2. 2.

    The classifier only depends on the distance in the semantic domain. As shown in Eq. (21), the classifier in this setting is called SC,

    $$\begin{aligned} \left\{ \begin{array}{ll} \mathop {\arg \min }\nolimits_{z \in Z_{\rm u}} \ \Vert \Phi (z)-G_{\rm va}(x^t)\Vert _2, &{\rm for}\quad {\rm ZSL},\\ \mathop {\arg \min }\nolimits_{z \in Z_{\rm s}{\cup }Z_{\rm u}} \ \Vert \Phi (z)-G_{\rm va}(x^t)\Vert _2, &{\rm for}\quad {\rm GZSL}.\\ \end{array} \right. \end{aligned}$$
    (21)
Table 6 Comparison results of different classifiers

The experimental results are shown in Table 6. According to the experimental results, it is found that our classifier is significantly better than those which only use the distance in a single domain as the classification basis, which also proves the opinion mentioned before in this paper, i.e., the bidirectional mapping will retain more useful information and better classification results can be obtained by adopting the classifier proposed in this paper.

4.4.4 Class-wise accuracy

To analyze the sensitivity of our model to different categories of images, we analyze classification results of our model for different categories. In order to obtain a more objective evaluation, we choose a same GAN-based model f-CLSWGAN for comparison, which offers confusion matrix in their paper. Figure 5 is the confusion matrix on the aPY dataset. From Fig. 5, we can observe that our method have better performance. Especially for some categories, f-CLSWGAN can not give reasonable judgments, but our method can classify them well. Especially, for potted plant, sheep, statue, the classification accuracies of our method are 45%, 49% and 67% higher than f-CLSWGAN.

Fig. 5
figure 5

The confusion matrices on the aPY dataset. The subfigure (a) and subfigure (b) are the confusion matrices of f-CLSWGAN and our method, respectively

There also exists an interesting phenomenon in the experimental results. In terms of ZSL, researchers usually think that misclassification is because two things are visually similar, such as goat and donkey. However, we find that whether using our model or f-CLSWGAN, when classifying tvmonitor and pottedplant, there is always a high probability of misclassifying them as each other. While it is obvious that these two things are not visually similar. The reason for this misclassification is that they usually appear in similar environments, that is, their visual features contain the similar background information. This extra unnecessary background information influences the judgment of our model. So how to eliminate the influence of background information on our model may become one of our future research directions.

5 Conclusions

This paper proposes a zero-shot learning method based on bidirectional projections, which are used to map visual features and semantic features to each other and align their distributions. And it also ensures that no effective information is lost in the mapping process. At the same time, we introduce the cycle loss of unseen unlabeled data in the training process and the predicted pseudo-labels of these samples to correct the model, which greatly alleviates the bias problem in zero-shot learning. Experimental results on three popular datasets show that our method is superior to most of the existing state-of-the-art methods.