Keywords

1 Introduction

Generalized Zero-shot Learning (GZSL) separates the classes of interest into a sub-set of seen classes and another sub-set of unseen classes. The training process uses the semantic features of both sub-sets and the visual representations of only the seen classes; while the testing process aims to classify the visual representations of both sub-sets [2, 3]. The semantic features available for both the training and testing classes are typically acquired from other domains, such as visual features [4], text [3, 5, 6], or learned classifiers [7]. The traditional approach to address this challenge [2] involves the learning of a transformation from the visual to the semantic space of the seen classes. Testing is then performed by transforming the visual representation of the seen and unseen classes into this semantic space, where classification is typically achieved with a nearest neighbor classifier that selects the closest class in the semantic space. In contrast to Zero-shot Learning (ZSL), which uses only the unseen domain for testing, GZSL approaches tend to be biased towards the seen classes, producing poor classification results, particularly for the unseen testing classes [1].

Fig. 1.
figure 1

Overview of the proposed multi-modal cycle-consistent GZSL approach. Our approach extends the idea of synthesizing visual representations of seen and unseen classes in order to train a classifier for the GZSL problem [1]. The main contribution of the paper is the use of a new multi-modal cycle consistency loss in the training of the visual feature generator that minimizes the reconstruction error between the semantic feature \(\mathbf {a}\), which was used to synthesize the visual feature \(\widetilde{\mathbf {x}}\), and the reconstructed semantic feature \(\widetilde{\mathbf {a}}\) mapped from \(\widetilde{\mathbf {x}}\). This loss is shown to constrain the optimization problem more effectively in order to produce useful synthesized visual features for training the GZSL classifier.

These traditional approaches rely on the assumption that the distributions observed in the semantic and visual spaces are relatively similar. Recently, this assumption has been relaxed to allow the semantic space to be optimized together with the transformation from the visual to the semantic space [8] - this alleviates the classification bias mentioned above to a certain degree. More recent approaches consist of building a generative adversarial network (GAN) that synthesizes visual representations of the seen and unseen classes directly from their semantic representation [8, 9]. These synthesized features are then used to train a multi-class classifier of seen and unseen classes. This approach has been shown to improve the GZSL classification accuracy, but an obvious weakness is that the unconstrained nature of the generation process may let the approach generate unrepresentative synthetic visual representations, particularly of the unseen classes (i.e., representations that are far from possible visual representations of the test classes).

The main contribution of this paper is a new regularization of the generation of synthetic visual representations in the training of GAN-based methods that address the GZSL classification problem. This regularization is based on a multi-modal cycle consistency loss term that enforces good reconstruction from the synthetic visual representations back to their original semantic features (see Fig. 1). This regularization is motivated by the cycle consistency loss applied in training GANs [10] that forces the generative training approach to produce more constrained visual representations. We argue that this constraint preserves the semantic compatibility between visual features and semantic features. Once our model is trained with this multi-modal cycle consistency loss term, we can then synthesize visual representations for unseen classes in order to train a GZSL classifier [1, 11].

Using the experimental setup described by Xian et al. [1], we show that our proposed regularization provides significant improvements not only in terms of GZSL classification accuracy, but also ZSL on the following datasets: Caltech-UCSD-Birds 200-2011 (CUB) [2, 12], Oxford-Flowers (FLO) [13], Scene Categorization Benchmark (SUN) [2, 14], Animals with features (AWA) [2, 4], and ImageNet [15]. In fact, the experiments show that our proposed approach holds the current best ZSL and GZSL classification results in the field for these datasets.

2 Literature Review

The starting point for our literature review is the work by Xian et al. [1, 2], who proposed new benchmarks using commonly accepted evaluation protocols on publicly available datasets. These benchmarks allow a fair comparison among recently proposed ZSL and GZSL approaches, and for this reason we explore those benchmarks to compare our results with the ones obtained from the current state of the art in the field. We provide a general summary of the methods presented in [2], and encourage the reader to study that paper in order to obtain more details on previous works. The majority of the ZSL and GZSL methods tend to compensate the lack of visual representation of the unseen classes with the learning of a mapping between visual and semantic spaces [16, 17]. For instance, a fairly successful approach is based on a bi-linear compatibility function that associates visual representation and semantic features. Examples of such approaches are ALE [18], DEVISE [19], SJE [20], ESZSL [21], and SAE [22]. Despite their simplicity, these methods tend to produce the current state-of-the-art results on benchmark datasets [2]. A straightforward extension of the methods above is the exploration of a non-linear compatibility function between visual and semantic spaces. These approaches, exemplified by LATEM [23] and CMT [6], tend not to be as competitive as their bi-linear counterpart, probably because the more complex models need larger training sets to generalize more effectively. Seminal ZSL and GZSL methods were based on models relying on learning intermediate feature classifiers, which are combined to predict image classes (e.g., DAP and IAP) [4] – these models tend to present relatively poor classification results. Finally, hybrid models, such as SSE [3], CONSE [24], SYNC [25], rely on a mixture model of seen classes to represent images and semantic embeddings. These methods tend to be competitive for classifying the seen classes, but not for the unseen classes.

The main disadvantage of the methods above is that the lack of visual training data for the unseen classes biases the mapping between visual and semantic spaces towards the semantic features of seen classes, particularly for unseen test images. This is an issue for GZSL because it has a negative effect in the classification accuracy of the unseen classes. Recent research address this issue using GAN models that are trained to synthesize visual representations for the seen and unseen classes, which can then be used to train a classifier for both the seen and unseen classes [8, 9]. However, the unconstrained generation of synthetic visual representations for the unseen classes allows the production of synthetic samples that may be too far from the actual distribution of visual representations, particularly for the unseen classes. In GAN literature, this problem is known as unpaired training [10], where not all source samples (e.g., semantic features) have corresponding target samples (e.g., visual features) for training. This creates a highly unconstrained optimization problem that has been solved by Zhu et al. [10] with a cycle consistency loss to push the representation from the target domain back to the source domain, which helped constraining the optimization problem. In this paper, we explore this idea for GZSL, which is a novelty compared to previous GAN-based methods proposed in GZSL and ZSL.

3 Multi-modal Cycle-Consistent Generalized Zero Shot Learning

In GZSL and ZSL [2], the dataset is denoted by \(\mathcal {D} = \{(\mathbf x ,\mathbf {a},y)_i\}_{i=1}^{|\mathcal {D}|}\) with \(\mathbf x \in \mathcal {X} \subseteq \mathbb {R}^K\) representing visual representation (e.g., image features from deep residual nets [26]), \(\mathbf {a} \in \mathcal {A} \subseteq \mathbb R^L\) denoting L-dimensional semantic feature (e.g., set of binary attributes [4] or a dense word2vec representation [27]), \(y \in \mathcal {Y} = \{ 1,..., C \}\) denoting the image class, and |.| representing set cardinality. The set \(\mathcal {Y}\) is split into seen and unseen subsets, where the seen subset is denoted by \(\mathcal {Y}_S\) and the unseen subset by \(\mathcal {Y}_U\), with \(\mathcal {Y} = \mathcal {Y}_S \cup \mathcal {Y}_U\) and \(\mathcal {Y}_S \cap \mathcal {Y}_U = \emptyset \). The dataset \(\mathcal {D}\) is also divided into mutually exclusive training and testing subsets: \(\mathcal {D}^{Tr}\) and \(\mathcal {D}^{Te}\), respectively. Furthermore, the training and testing sets can also be divided in terms of the seen and unseen classes, so this means that \(\mathcal {D}^{Tr}_S\) denotes the training samples of the seen classes, while \(\mathcal {D}^{Tr}_U\) represents the training samples of the unseen classes (similarly for \(\mathcal {D}^{Te}_S\) and \(\mathcal {D}^{Te}_U\) for the testing set). During training, samples in \(\mathcal {D}_S^{Tr}\) contain the visual representation \(\mathbf {x}_i\), semantic feature \(\mathbf {a}_i\) and class label \(y_i\); while the samples in \(\mathcal {D}_U^{Tr}\) comprise only the semantic feature and class label. During ZSL testing, only the samples from \(\mathcal {D}_U^{Te}\) are used; while in GZSL testing, all samples from \(\mathcal {D}^{Te}\) are used. Note that for ZSL and GZSL problems, only the visual representation of the testing samples is used to predict the class label.

Below, we first explain the f-CLSWGAN model [1], which is the baseline for the implementation of the main contribution of this paper: the multi-modal cycle consistency loss used in the training for the feature generator in GZSL models based on GANs. The loss, feature generator, learning and testing procedures are explained subsequently.

Fig. 2.
figure 2

Overview of the multi-modal cycle-consistent GZSL model. The visual features, represented by \(\mathbf {x}\), are extracted from a state-of-art CNN model, and the semantic features, represented by \(\mathbf {a}\), are available from the training set. The generator G(.) synthesizes new visual features \(\widetilde{\mathbf {x}}\) using the semantic feature and a randomly sampled noise vector \(\mathbf {z} \sim \mathcal {N}(\mathbf {0},\mathbf {I})\), and the discriminator D(.) tries to distinguish between real and synthesized visual features. Our main contribution is focused on the integration of a multi-modal cycle consistency loss (at the bottom) that minimizes the error between the original semantic feature \(\mathbf {a}\) and its reconstruction \(\widetilde{\mathbf {a}}\), produced by the regressor R(.).

3.1 f-CLSWGAN

Our approach is an extension of the feature generation method proposed by Xian et al. [1], which consists of a classification regularized generative adversarial network (f-CLSWGAN). This network is composed of a generative model \(G:\mathcal {A} \times \mathcal {Z} \rightarrow \mathcal {X}\) (parameterized by \(\theta _G\)) that produces a visual representation \(\widetilde{\mathbf {x}}\) given its semantic feature \(\mathbf {a}\) and a noise vector \(\mathbf {z} \sim \mathcal {N}(\mathbf {0},\mathbf {I})\) sampled from a multi-dimensional centered Gaussian, and a discriminative model \(D:\mathcal {X} \times \mathcal {A} \rightarrow [0,1]\) (parameterized by \(\theta _D\)) that tries to distinguish whether the input \(\mathbf {x}\) and its semantic representation \(\mathbf {a}\) represent a true or generated visual representation and respective semantic feature. Note that while the method developed by Yan et al. [28] concerns the generation of realistic images, our proposed approach, similarly to [1, 8, 9], aims to generate visual representations, such as the features from a deep residual network [26] - the strategy based on visual representation has shown to produce more accurate GZSL classification results compared to the use of realistic images. The training algorithm for estimating \(\theta _G\) and \(\theta _D\) follows a minimax game, where G(.) generates synthetic visual representations that are supposed to fool the discriminator, which in turn tries to distinguish the real from the synthetic visual representations. We rely on one of the most stable training methods for GANs, called Wasserstein GAN, which uses the following loss function [29]:

$$\begin{aligned} \theta _G^*,\theta _D^*=\arg \min _{\theta _G} \max _{\theta _D} \ell _{WGAN}(\theta _G,\theta _D), \end{aligned}$$
(1)

with

$$\begin{aligned} \begin{aligned} \ell _{WGAN}(\theta _G,\theta _D)&= \mathbb E_{(\mathbf {x},\mathbf {a}) \sim \mathbb P^{x,a}}[D(\mathbf {x},\mathbf {a};\theta _D)] - \mathbb E_{(\widetilde{\mathbf {x}},\mathbf {a}) \sim \mathbb P^{x,a}_G}[D(\widetilde{\mathbf {x}},\mathbf {a};\theta _D)] \\&-\,\lambda \mathbb E_{(\hat{\mathbf {x}},\mathbf {a}) \sim \mathbb P^{x,a}_{\alpha }}[\left( ||\nabla _{\hat{\mathbf {x}}}D(\hat{\mathbf {x}},\mathbf {a}; \theta _D)||_2 - 1\right) ^2], \end{aligned} \end{aligned}$$
(2)

where \(\mathbb E[.]\) represents the expected value operator, \(\mathbb P_S^{x,a}\) is the joint distribution of visual and semantic features from the seen classes (in practice, samples from that distribution are the ones in \(\mathcal {D}_S^{Tr}\)), \(\mathbb P^{x,a}_G\) represents the joint distribution of semantic features and the visual features produced by the generative model G(.), \(\lambda \) denotes the penalty coefficient, and \(\mathbb P^{x,a}_{\alpha }\) is the joint distribution of the semantic features and the visual features produced by \(\hat{\mathbf {x}} \sim \alpha \mathbf {x} + (1-\alpha )\widetilde{\mathbf {x}}\) with \(\alpha \sim \mathcal {U}(0,1)\) (i.e., uniform distribution).

Finally, the f-CLSWGAN is trained with the following objective function:

$$\begin{aligned} \theta _G^*,\theta _C^*,\theta _D^*=\arg \min _{\theta _G,\theta _C} \max _{\theta _D} \ell _{WGAN}(\theta _G,\theta _D) + \beta \ell _{CLS}(\theta _C,\theta _G), \end{aligned}$$
(3)

where \(\ell _{CLS}(\theta _C,\theta _G) = -\mathbb E_{(\widetilde{\mathbf {x}},y) \sim \mathbb P^{x,y}_G}[\log P(y | \widetilde{\mathbf {x}}, \theta _C)]\), with

$$\begin{aligned} P(y | \widetilde{\mathbf {x}}, \theta _C) = \frac{\exp ( (\theta _C(y))^T\widetilde{\mathbf {x}})}{\sum _{c\in \mathcal {Y}}\exp ((\theta _C(c))^T\widetilde{\mathbf {x}})} \end{aligned}$$
(4)

representing the probability that the sample \(\widetilde{\mathbf {x}}\) has been predicted with its true label y, and \(\beta \) is a hyper-parameter that weights the contribution of the loss function. This regularization with the classification loss was found by Xian et al. [1] to enforce G(.) to generate discriminative visual representations. The model obtained from the optimization in (3) is referred to as baseline in the experiments.

3.2 Multi-modal Cycle Consistency Loss

The main issue present in previously proposed GZSL approaches based on generative models [1, 8, 9] is that the unconstrained nature of the generation process (from semantic to visual features) may produce image representations that are too far from the real distribution present in the training set, resulting in an ineffective multi-class classifier training, particularly for the unseen classes. The approach we propose to alleviate this problem consists of constraining the synthetic visual representations to generate back their original semantic features - this regularization has been inspired by the cycle consistency loss [10]. Figure 2 shows an overview of our proposal. This approach, representing the main contribution of this paper, is represented by the following loss:

$$\begin{aligned} \begin{aligned} \ell _{CYC}(\theta _R,\theta _G)&= \mathbb E_{\mathbf {a} \sim \mathbb P_S^a,\mathbf {z} \sim \mathcal {N}(\mathbf {0},\mathbf {I})} \left[ \Vert {\mathbf {a} - R(G(\mathbf {a},\mathbf {z};\theta _G);\theta _R)}\Vert _2^2 \right] \\&+ \mathbb E_{\mathbf {a} \sim \mathbb P_U^a,\mathbf {z} \sim \mathcal {N}(\mathbf {0},\mathbf {I})} \left[ \Vert {\mathbf {a} - R(G(\mathbf {a},\mathbf {z};\theta _G);\theta _R)}\Vert _2^2 \right] , \end{aligned} \end{aligned}$$
(5)

where \(\mathbb P_S^a\) and \(\mathbb P_U^a\) denote the distributions of semantic features of the seen and unseen classes, respectively, and \(R:\mathcal {X} \rightarrow \mathcal {A}\) represents a regressor that estimates the original semantic features from the visual representation generated by G(.).

3.3 Feature Generation

Using the losses proposed in Sects. 3.1 and 3.2, we can propose several feature generators. First, we pre-train the regressor R(.) defined below in (6), by minimizing a loss function computed only from the seen classes, as follows:

$$\begin{aligned} \ell _{REG}(\theta _R) = \mathbb E_{(\mathbf {a},\mathbf {x}) \sim \mathbb P_S^{a,x}} \left[ \Vert {\mathbf {a} - R(\mathbf {x};\theta _R)}\Vert _2^2 \right] , \end{aligned}$$
(6)

where \(\mathbb P_S^{a,x}\) represents the real joint distribution of image and semantic features present in the seen classes. In practice, this regressor is defined by a multi-layer perceptron, whose output activation function depends on the format of the semantic vector.

Our first strategy to build a feature generator consists of pre-training a regressor (using samples from seen classes) optimized by minimizing \(\ell _{REG}\) in (6), which produces \(\theta _R^*\) and training the generator and discriminator of the WGAN using the following optimization function:

$$\begin{aligned} \theta _G^*,\theta _D^* = \arg \min _{\theta _G} \max _{\theta _D} \ell _{WGAN}(\theta _G,\theta _D) + \lambda _1 \ell _{CYC}(\theta _R^*,\theta _G), \end{aligned}$$
(7)

where \(\ell _{WGAN}\) is defined in (2), \(\ell _{CYC}\) is defined in (5), and \(\lambda _1\) weights the importance of the second optimization term. The optimization in (7) can use both the seen and unseen classes, or it can rely only the seen classes, in which case the loss \(\ell _{CYC}\) in (5) has to be modified so that its second term (that depends on unseen classes) is left out of the optimization. The feature generator model in (7) trained with seen and unseen classes is referred to as cycle-(U)WGAN, while the feature generator trained with only seen classes is labeled cycle-WGAN.

The second strategy explored in this paper to build a feature generator involves pre-training the regressor in (6) using samples from seen classes to produce \(\theta _R^*\), and pre-training a softmax classifier for the seen classes using \(\ell _{CLS}\), defined in (3), which results in \(\theta _C^*\). Then we train the combined loss function:

$$\begin{aligned} \theta _G^*,\theta _D^* = \arg \min _{\theta _G} \max _{\theta _D} \ell _{WGAN}(\theta _G,\theta _D) + \lambda _1 \ell _{CYC}(\theta _R^*,\theta _G) + \lambda _2\ell _{CLS}(\theta _C^*,\theta _G). \end{aligned}$$
(8)

The feature generator model in (8) trained with seen classes is referred to as cycle-CLSWGAN.

3.4 Learning and Testing

As shown in [1] the training of a classifier using a potentially unlimited number of samples from the seen and unseen classes generated with \(\mathbf {x} \sim G(\mathbf {a},\mathbf {z};\theta _G^*)\) produces more accurate classification results compared with multi-modal embedding models [18,19,20,21]. Therefore, we train a final softmax classifier \(P(y|\mathbf {x},\theta _C)\), defined in (4), using the generated visual features by minimizing the negative log likelihood loss \(\ell _{CLS}(\theta _C,\theta ^*_G)\), as defined in (3), where \(\theta _G^*\) has been learned from one of the feature learning strategies discussed in Sect. 3.3 - the training of the classifier produces \(\theta ^*_C\). The samples used for training the classifier are generated based on the task to be solved. For instance, for ZSL, we only use generated visual representations from the set of unseen classes; while for GZSL, we use the generated samples from seen and unseen classes.

Finally, the testing is based on the prediction of a class for an input test visual representation \(\mathbf {x}\), as follows:

$$\begin{aligned} y^* = \arg \max _{y \in \widetilde{\mathcal {Y}}} P(y|\mathbf {x},\theta ^*_C), \end{aligned}$$
(9)

where \(\widetilde{\mathcal {Y}} = \mathcal {Y}\) for GZSL or \(\widetilde{\mathcal {Y}} = \mathcal {Y}_U\) for ZSL.

4 Experiments

In this section, we first introduce the datasets and evaluation criteria used in the experiments, then we discuss the experimental set-up and finally show the results of our approach, comparing with the state-of-the-art results.

4.1 Datasets

We evaluate the proposed method on the following ZSL/GZSL benchmark datasets, using the experimental setup of [2], namely: CUB-200-2011 [1, 12], FLO [13], SUN [2], and AWA [2, 30] – where CUB, FLO and SUN are fine-grained datasets, and AWA coarse. Table 1 shows some basic information about these datasets in terms of number of seen and unseen classes and number of training and testing images. For CUB-200-2011 [1, 12] and Oxford-Flowers [13], the semantic feature has 1024 dimensions produced by the character-based CNN-RNN [31] that encodes the textual description of an image containing fine-grained visual descriptions (10 sentences per image). The sentences from the unseen classes are not used for training the CNN-RNN and the per-class sentence is obtained by averaging the CNN-RNN semantic features that belong to the same class. For the FLO dataset [13], we used the same type of semantic feature with 1024 dimensions [31] as was used for CUB (please see description above). For the SUN dataset [2], the semantic features have 102 dimensions. Following the protocol from Xian et al. [2], visual features are represented by the activations of the 2048-dim top-layer pooling units of ResNet-101 [26], obtained from the entire image. For AWA [2, 30], we use a semantic feature containing 85 dimensions denoting per-class attributes. In addition, we also test our approach on ImageNet [15], for a split containing 100 classes for testing [32].

The input images do not suffer any pre-processing (cropping, background subtraction, etc.) and we do not use any type of data augmentation. This ResNet-101 is pre-trained on ImageNet with 1K classes [15] and is not fine tuned. For the synthetic visual representations, we generate 2048-dim CNN features using one of the feature generation models, presented in Sect. 3.3.

For CUB, FLO, SUN, and AWA we use the zero-shot splits proposed by Xian et al. [2], making sure that none of the training classes are present on ImageNet [15]. Differently from these datasets (i.e., CUB, FLO, SUN, AWA), we observed that there is a lack of standardized experimental setup for GZSL on Imagenet. Recently, papers have used ImageNet for GZSL using several splits (e.g., 2-hop, 3-hop), but we noticed that some of the supposedly unseen classes can actually be seen during training (e.g., in split 2-hop, we note that the class American mink is assumed to be unseen, while class Mink is seen, but these two classes are arguably the same). Nevertheless, in order to demonstrate the competitiveness of our proposed cycle-WGAN, we compare it to the baseline using carefully selected 100 unseen classes [32] (i.e., no overlap with 1k training seen classes) from ImageNet.

Table 1. Information about the datasets CUB [12], FLO [13], SUN [33], AWA [2], and ImageNet [15]. Column (1) shows the number of seen classes, denoted by \(|\mathcal {Y}_S|\), split into the number of training and validation classes (train + val), (2) presents the number of unseen classes \(| \mathcal {Y}_U |\), (3) displays the number of samples available for training \(|\mathcal {D}^{Tr}|\) and (4) shows number of testing samples that belong to the unseen classes \(|\mathcal {D}_U^{Te}|\) and number of testing samples that belong to the seen classes \(|\mathcal {D}_S^{Te}|\).

4.2 Evaluation Protocol

We follow the evaluation protocol proposed by Xian et al. [2], where results are based on average per-class top-1 accuracy. For the ZSL evaluation, top-1 accuracy results are computed with respect to the set of unseen classes \(\mathcal {Y}_U\), where the average accuracy is independently computed for each class, which is then averaged over all unseen classes. For the GZSL evaluation, we compute the average per-class top-1 accuracy on seen classes \(\mathcal {Y}_S\), denoted by s, the average per-class top-1 accuracy on unseen classes \(\mathcal {Y}_U\), denoted by u, and their harmonic mean, i.e. \(H = 2 \times (s \times u)/(s + u)\).

4.3 Implementation Details

In this section, we explain the implementation details of the generator G(.), the discriminator D(.), the regressor R(.), and the weights used for the hyper-parameters in the loss functions in (2), (3), (7) and (8) - all these terms have been formally defined in Sect. 3 and depicted in Fig. 2. The generator consists of a multi-layer perceptron (MLP) with a single hidden layer containing 4096 nodes, where this hidden layer is activated by LeakyReLU [34], and the output layer, with 2048 nodes, has a ReLU activation [35]. The weights of G(.) are initialized with a truncated normal initialization with mean 0 and standard deviation 0.01 and the biases are initialized with 0. The discriminator D(.) is also an MLP consisting of a single hidden layer with 4096 nodes, which is activated by LeakyReLU, and the output layer has no activation. The initialization of D(.) is the same as for G(.). The regressor R(.) is a linear transform from the visual space \(\mathcal {X}\) to the semantic space \(\mathcal {A}\). Following [1], we set \(\lambda =10\) in (2), \(\beta = 0.01\) in (3) and \(\lambda _1 = \lambda _2 = 0.01\) in (7) and (8). We ran an empirical evaluation with the training set and noticed that when \(\lambda _1\) and \(\lambda _2\) share the same value, the training becomes stable, but a more systematic evaluation to assess the relative importance of these two hyper-parameters is still needed. Table 2 shows the learning rates for each model (denoted by \(lr_{\{ R(.), G(.), D(.) \}}\)), batch sizes (batch) and number of epochs (#ep) used for each dataset and model – the values for G(.) and D(.) have been estimated to reproduce the published results of our implementation of f-CLSWGAN (explained below), and the values for R(.) have been estimated by cross validation using the training and validation sets.

Regarding the number of visual representations generated to train the classifier, we performed a few experiments and reached similar conclusions, compared to [1]. For all experiments in the paper, we generated 300 visual representations per class [1]. We reached this number after a study that shows that for a small number of representations (below 100), the classification results were not competitive; for values superior to 200 or more, results became competitive, but unstable; and above 300, results were competitive and stable.

Table 2. Summary of cross-validated hyper-parameters in our experiments.
Table 3. Comparison between the reported results of f-CLSWGAN [1] and our implementation of it, labeled baseline, where we show the top-1 accuracy on the unseen test \(\mathcal {Y}_U\) (GZSL), the top-1 accuracy for seen test \(\mathcal {Y}_S\) (GZSL), the harmonic mean H (GZSL), and the top-1 accuracy for ZSL (\(T1_Z\)).

Since our approach is based on the f-CLSWGAN [1], we re-implemented this methodology. In the experiments, the results from our implementation of f-CLSWGAN using a softmax classifier is labeled as baseline. The results that we obtained from our baseline are very similar to the reported results in [1], as shown in Table 3. For ImageNet, note that we use a split [32] that is different from previous ones used in the literature, as explained above in Sect. 4.1, so it is not possible to have a direct comparison between f-CLSWGAN [1] and our baseline. Nevertheless, we show in Table 6 that the results we obtain for the split [32] are in fact similar to the reported results for f-CLSWGAN [1] for similar ImageNet splits. We developed our codeFootnote 1 and perform all experiments using Tensorflow [36].

5 Results

In this section we show the GZSL and ZSL results using our proposed models cycle-WGAN, cycle-(U)WGAN and cycle-CLSWGAN, the baseline model f-CLSWGAN, denoted by baseline, and several other baseline methods previously used in the field for benchmarking [2]. Table 4 shows the GZSL results and Table 5 shows the ZSL results obtained from our proposed methods, and several baseline approaches on CUB, FLO, SUN and AWA datasets. The results in Table 6 shows that the top-1 accuracy on ImageNet for cycle-WGAN and baseline [1].

Table 4. GZSL results using per-class average top-1 accuracy on the test sets of unseen classes \(\mathcal {Y}_U\), seen classes \(\mathcal {Y}_S\), and the harmonic mean result H – all results shown in percentage. Results from previously proposed methods in the field extracted from [2].
Table 5. ZSL results using per-class average top-1 accuracy on the test set of unseen classes \(\mathcal {Y}_U\) – all results shown in percentage. Results from previously proposed methods in the field extracted from [2].
Table 6. ZSL and GZSL ImageNet results using per-class average top-1 accuracy on the test sets of unseen classes \(\mathcal {Y}_U\) – all results shown in percentage.

6 Discussion

Regarding the GZSL results in Table 4, we notice that there is a clear trend of all of our proposed feature generation methods (cycle-WGAN, cycle-(U)WGAN), and cycle-CLSWGAN) to perform better than baseline on the unseen test set. In particular, it seems advantageous to use the synthetic samples from unseen classes to train the cycle-(U)WGAN model since it achieves the best top-1 accuracy results in 3 out of the 4 datasets, with improvements from 0.7% to more than 4%. In general, the top-1 accuracy improvement achieved by our approaches in the seen test set is less remarkable, which is expected given that we prioritize to improve the results for the unseen classes. Nevertheless, our approaches achieved improvements from 0.4% to more than 2.5% for the seen classes. Finally, the harmonic mean results also show that our approaches improve over the baseline in a range of between 1% and 2.2%. Notice that this results are remarkable considering the outstanding improvements achieved by f-CLSWGAN [1], represented here by baseline. In fact, our proposed methods produce the current state of the art GZSL results for these four datasets.

Analyzing the ZSL results in Table 5, we again notice that, similarly to the GZSL case, there is a clear advantage in using the synthetic samples from unseen classes to train the cycle-(U)WGAN model. For instance, top-1 accuracy results show that we can improve over the baseline from 0.9% to 3.5%. The results in this table show that our proposed approaches currently hold the best ZSL results for these datasets.

It is interesting to see that, compared to GZSL, the ZSL results from previous method in the literature are far more competitive, achieving results that are relatively close to ours and the baseline. This performance gap between ZSL and GZSL, shown by previous methods, enforces the argument in favor of using generative models to synthesize images from seen and unseen classes to train GZSL models [1, 8, 9]. As argued throughout this paper, the performance produced by generative models can be improved further with methods that help the training of GANs, such as the cycle consistency loss [10].

In fact, the experiments clearly demonstrate the advantage of using our proposed multi-modal cycle consistency loss in training GANs for GZSL and ZSL. In particular, it is interesting to see that the use of synthetic examples of unseen classes generated by cycle-(U)WGAN to train the GZSL classifier provides remarkable improvements over the baseline, represented by f-CLSWGAN [1]. The only exception is with the SUN dataset, where the best result is achieved by cycle-CLSWGAN. We believe that cycle-(U)WGAN is not the top performer on SUN due to the number of classes and the proportion of seen/unseen classes in this dataset. For CUB, FLO and AWA we notice that there is roughly a \((80\%,20\%)\) ratio between seen and unseen classes. In contrast, SUN has a \((91\%,9\%)\) ratio between seen and unseen classes. We also notice a sharp increase in the number of classes from 50 to 817 – GAN models tend not to work well with such a large number of classes. Given the wide variety of GZSL datasets available in the field, with different number of classes and seen/unseen proportions, we believe that there is still lots of room for improvement for GZSL models.

Regarding the large-scale study on ImageNet, the results in Table 6 show that the top-1 accuracy classification results for Baseline and cycle-WGAN are quite low (similarly to the results observed in [1] for several ImageNet splits), but our proposed approach still shows more accurate ZSL and GZSL classification.

An important question about out approach is whether the regularisation succeeds in mapping the generated visual representations back to the semantic space. In order to answer this question, we show in Fig. 3 the evolution of the reconstruction loss \(\ell _{REG}\) in (6) as a function of the number of epochs. In general, the reconstruction loss decreases steadily over training, showing that our model succeeds at such mapping. Another relevant question is if our proposed methods take more or less epochs to converge, compared to the Baseline – Fig. 4 shows the classification accuracy of the generated training samples from the seen classes for the proposed models cycle-WGAN and cycle-CLSWGAN, and also for the baseline (note that cycle-(U)WGAN is a fine-tuned model from the cycle-WGAN, so their loss functions are in fact identical for the seen classes shown in the graph). For three out of four datasets, our proposed cycle-WGAN converges faster. However, when the \(\ell _{CLS}\) in included in (7) to form the loss in (8) (transforming cycle-WGAN into cycle-CLSWGAN), then the convergence of cycle-CLSWGAN is comparable to that of the baseline. Hence, cycle-WGAN tends to converge faster than the baseline and cycle-CLSWGAN.

Fig. 3.
figure 3

Evolution of \(\ell _{REG}\) in terms of the number of epochs for CUB, FLO, SUN and AWA.

Fig. 4.
figure 4

Convergence of the top-1 accuracy in terms of the number of epochs for the generated training samples from the seen classes for CUB, FLO, SUN and AWA.

7 Conclusions and Future Work

In this paper, we propose a new method to regularize the training of GANs in GZSL models. The main argument explored in the paper is that the use of GANs to generate seen and unseen synthetic examples for training GZSL models has shown clear advantages over previous approaches. However, the unconstrained nature of the generation of samples from unseen classes can produce models that may not work robustly for some unseen classes. Therefore, by constraining the generation of samples from unseen classes, we target to improve the GZSL classification accuracy. Our proposed constraint is motivated by the cycle consistency loss [10], where we enforce that the generated visual representations maps back to their original semantic feature – this represents the multi-modal cycle consistency loss. Experiments show that the use of such loss is clearly advantageous, providing improvements over the current state of the art f-CLSWGAN [1] both in terms of GZSL and ZSL.

As noticed in Sect. 6, GAN-based GZSL approaches offer indisputable advantage over previously proposed methods. However, the reliance on GANs to generate samples from unseen classes is challenging because GANs are notoriously difficult to train, particularly in unconstrained and large scale problems. Therefore, future work in this field should be focused on targeting these problems. In this paper, we provide a solution that addresses the unconstrained problem, but it is clear that other regularization approaches could also be used. In addition, the use of GANs in large scale problems (regarding the number of classes) should also be more intensively studied, particularly when dealing with real-life datasets and scenarios. Therefore, we will focus our future research activities in solving these two issues in GZSL.