Abstract
In generalized zero shot learning (GZSL), the set of classes are split into seen and unseen classes, where training relies on the semantic features of the seen and unseen classes and the visual representations of only the seen classes, while testing uses the visual representations of the seen and unseen classes. Current methods address GZSL by learning a transformation from the visual to the semantic space, exploring the assumption that the distribution of classes in the semantic and visual spaces is relatively similar. Such methods tend to transform unseen testing visual representations into one of the seen classes’ semantic features instead of the semantic features of the correct unseen class, resulting in low accuracy GZSL classification. Recently, generative adversarial networks (GAN) have been explored to synthesize visual representations of the unseen classes from their semantic features - the synthesized representations of the seen and unseen classes are then used to train the GZSL classifier. This approach has been shown to boost GZSL classification accuracy, but there is one important missing constraint: there is no guarantee that synthetic visual representations can generate back their semantic feature in a multi-modal cycle-consistent manner. This missing constraint can result in synthetic visual representations that do not represent well their semantic features, which means that the use of this constraint can improve GAN-based approaches. In this paper, we propose the use of such constraint based on a new regularization for the GAN training that forces the generated visual features to reconstruct their original semantic features. Once our model is trained with this multi-modal cycle-consistent semantic compatibility, we can then synthesize more representative visual representations for the seen and, more importantly, for the unseen classes. Our proposed approach shows the best GZSL classification results in the field in several publicly available datasets.
All authors gratefully acknowledge the support of the Australian Research Council through the Centre of Excellence for Robotic Vision (project number CE140100016), Laureate Fellowship FL130100102 to IR and Discover Project DP180103232 to GC.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Generalized Zero-shot Learning (GZSL) separates the classes of interest into a sub-set of seen classes and another sub-set of unseen classes. The training process uses the semantic features of both sub-sets and the visual representations of only the seen classes; while the testing process aims to classify the visual representations of both sub-sets [2, 3]. The semantic features available for both the training and testing classes are typically acquired from other domains, such as visual features [4], text [3, 5, 6], or learned classifiers [7]. The traditional approach to address this challenge [2] involves the learning of a transformation from the visual to the semantic space of the seen classes. Testing is then performed by transforming the visual representation of the seen and unseen classes into this semantic space, where classification is typically achieved with a nearest neighbor classifier that selects the closest class in the semantic space. In contrast to Zero-shot Learning (ZSL), which uses only the unseen domain for testing, GZSL approaches tend to be biased towards the seen classes, producing poor classification results, particularly for the unseen testing classes [1].
These traditional approaches rely on the assumption that the distributions observed in the semantic and visual spaces are relatively similar. Recently, this assumption has been relaxed to allow the semantic space to be optimized together with the transformation from the visual to the semantic space [8] - this alleviates the classification bias mentioned above to a certain degree. More recent approaches consist of building a generative adversarial network (GAN) that synthesizes visual representations of the seen and unseen classes directly from their semantic representation [8, 9]. These synthesized features are then used to train a multi-class classifier of seen and unseen classes. This approach has been shown to improve the GZSL classification accuracy, but an obvious weakness is that the unconstrained nature of the generation process may let the approach generate unrepresentative synthetic visual representations, particularly of the unseen classes (i.e., representations that are far from possible visual representations of the test classes).
The main contribution of this paper is a new regularization of the generation of synthetic visual representations in the training of GAN-based methods that address the GZSL classification problem. This regularization is based on a multi-modal cycle consistency loss term that enforces good reconstruction from the synthetic visual representations back to their original semantic features (see Fig. 1). This regularization is motivated by the cycle consistency loss applied in training GANs [10] that forces the generative training approach to produce more constrained visual representations. We argue that this constraint preserves the semantic compatibility between visual features and semantic features. Once our model is trained with this multi-modal cycle consistency loss term, we can then synthesize visual representations for unseen classes in order to train a GZSL classifier [1, 11].
Using the experimental setup described by Xian et al. [1], we show that our proposed regularization provides significant improvements not only in terms of GZSL classification accuracy, but also ZSL on the following datasets: Caltech-UCSD-Birds 200-2011 (CUB) [2, 12], Oxford-Flowers (FLO) [13], Scene Categorization Benchmark (SUN) [2, 14], Animals with features (AWA) [2, 4], and ImageNet [15]. In fact, the experiments show that our proposed approach holds the current best ZSL and GZSL classification results in the field for these datasets.
2 Literature Review
The starting point for our literature review is the work by Xian et al. [1, 2], who proposed new benchmarks using commonly accepted evaluation protocols on publicly available datasets. These benchmarks allow a fair comparison among recently proposed ZSL and GZSL approaches, and for this reason we explore those benchmarks to compare our results with the ones obtained from the current state of the art in the field. We provide a general summary of the methods presented in [2], and encourage the reader to study that paper in order to obtain more details on previous works. The majority of the ZSL and GZSL methods tend to compensate the lack of visual representation of the unseen classes with the learning of a mapping between visual and semantic spaces [16, 17]. For instance, a fairly successful approach is based on a bi-linear compatibility function that associates visual representation and semantic features. Examples of such approaches are ALE [18], DEVISE [19], SJE [20], ESZSL [21], and SAE [22]. Despite their simplicity, these methods tend to produce the current state-of-the-art results on benchmark datasets [2]. A straightforward extension of the methods above is the exploration of a non-linear compatibility function between visual and semantic spaces. These approaches, exemplified by LATEM [23] and CMT [6], tend not to be as competitive as their bi-linear counterpart, probably because the more complex models need larger training sets to generalize more effectively. Seminal ZSL and GZSL methods were based on models relying on learning intermediate feature classifiers, which are combined to predict image classes (e.g., DAP and IAP) [4] – these models tend to present relatively poor classification results. Finally, hybrid models, such as SSE [3], CONSE [24], SYNC [25], rely on a mixture model of seen classes to represent images and semantic embeddings. These methods tend to be competitive for classifying the seen classes, but not for the unseen classes.
The main disadvantage of the methods above is that the lack of visual training data for the unseen classes biases the mapping between visual and semantic spaces towards the semantic features of seen classes, particularly for unseen test images. This is an issue for GZSL because it has a negative effect in the classification accuracy of the unseen classes. Recent research address this issue using GAN models that are trained to synthesize visual representations for the seen and unseen classes, which can then be used to train a classifier for both the seen and unseen classes [8, 9]. However, the unconstrained generation of synthetic visual representations for the unseen classes allows the production of synthetic samples that may be too far from the actual distribution of visual representations, particularly for the unseen classes. In GAN literature, this problem is known as unpaired training [10], where not all source samples (e.g., semantic features) have corresponding target samples (e.g., visual features) for training. This creates a highly unconstrained optimization problem that has been solved by Zhu et al. [10] with a cycle consistency loss to push the representation from the target domain back to the source domain, which helped constraining the optimization problem. In this paper, we explore this idea for GZSL, which is a novelty compared to previous GAN-based methods proposed in GZSL and ZSL.
3 Multi-modal Cycle-Consistent Generalized Zero Shot Learning
In GZSL and ZSL [2], the dataset is denoted by \(\mathcal {D} = \{(\mathbf x ,\mathbf {a},y)_i\}_{i=1}^{|\mathcal {D}|}\) with \(\mathbf x \in \mathcal {X} \subseteq \mathbb {R}^K\) representing visual representation (e.g., image features from deep residual nets [26]), \(\mathbf {a} \in \mathcal {A} \subseteq \mathbb R^L\) denoting L-dimensional semantic feature (e.g., set of binary attributes [4] or a dense word2vec representation [27]), \(y \in \mathcal {Y} = \{ 1,..., C \}\) denoting the image class, and |.| representing set cardinality. The set \(\mathcal {Y}\) is split into seen and unseen subsets, where the seen subset is denoted by \(\mathcal {Y}_S\) and the unseen subset by \(\mathcal {Y}_U\), with \(\mathcal {Y} = \mathcal {Y}_S \cup \mathcal {Y}_U\) and \(\mathcal {Y}_S \cap \mathcal {Y}_U = \emptyset \). The dataset \(\mathcal {D}\) is also divided into mutually exclusive training and testing subsets: \(\mathcal {D}^{Tr}\) and \(\mathcal {D}^{Te}\), respectively. Furthermore, the training and testing sets can also be divided in terms of the seen and unseen classes, so this means that \(\mathcal {D}^{Tr}_S\) denotes the training samples of the seen classes, while \(\mathcal {D}^{Tr}_U\) represents the training samples of the unseen classes (similarly for \(\mathcal {D}^{Te}_S\) and \(\mathcal {D}^{Te}_U\) for the testing set). During training, samples in \(\mathcal {D}_S^{Tr}\) contain the visual representation \(\mathbf {x}_i\), semantic feature \(\mathbf {a}_i\) and class label \(y_i\); while the samples in \(\mathcal {D}_U^{Tr}\) comprise only the semantic feature and class label. During ZSL testing, only the samples from \(\mathcal {D}_U^{Te}\) are used; while in GZSL testing, all samples from \(\mathcal {D}^{Te}\) are used. Note that for ZSL and GZSL problems, only the visual representation of the testing samples is used to predict the class label.
Below, we first explain the f-CLSWGAN model [1], which is the baseline for the implementation of the main contribution of this paper: the multi-modal cycle consistency loss used in the training for the feature generator in GZSL models based on GANs. The loss, feature generator, learning and testing procedures are explained subsequently.
3.1 f-CLSWGAN
Our approach is an extension of the feature generation method proposed by Xian et al. [1], which consists of a classification regularized generative adversarial network (f-CLSWGAN). This network is composed of a generative model \(G:\mathcal {A} \times \mathcal {Z} \rightarrow \mathcal {X}\) (parameterized by \(\theta _G\)) that produces a visual representation \(\widetilde{\mathbf {x}}\) given its semantic feature \(\mathbf {a}\) and a noise vector \(\mathbf {z} \sim \mathcal {N}(\mathbf {0},\mathbf {I})\) sampled from a multi-dimensional centered Gaussian, and a discriminative model \(D:\mathcal {X} \times \mathcal {A} \rightarrow [0,1]\) (parameterized by \(\theta _D\)) that tries to distinguish whether the input \(\mathbf {x}\) and its semantic representation \(\mathbf {a}\) represent a true or generated visual representation and respective semantic feature. Note that while the method developed by Yan et al. [28] concerns the generation of realistic images, our proposed approach, similarly to [1, 8, 9], aims to generate visual representations, such as the features from a deep residual network [26] - the strategy based on visual representation has shown to produce more accurate GZSL classification results compared to the use of realistic images. The training algorithm for estimating \(\theta _G\) and \(\theta _D\) follows a minimax game, where G(.) generates synthetic visual representations that are supposed to fool the discriminator, which in turn tries to distinguish the real from the synthetic visual representations. We rely on one of the most stable training methods for GANs, called Wasserstein GAN, which uses the following loss function [29]:
with
where \(\mathbb E[.]\) represents the expected value operator, \(\mathbb P_S^{x,a}\) is the joint distribution of visual and semantic features from the seen classes (in practice, samples from that distribution are the ones in \(\mathcal {D}_S^{Tr}\)), \(\mathbb P^{x,a}_G\) represents the joint distribution of semantic features and the visual features produced by the generative model G(.), \(\lambda \) denotes the penalty coefficient, and \(\mathbb P^{x,a}_{\alpha }\) is the joint distribution of the semantic features and the visual features produced by \(\hat{\mathbf {x}} \sim \alpha \mathbf {x} + (1-\alpha )\widetilde{\mathbf {x}}\) with \(\alpha \sim \mathcal {U}(0,1)\) (i.e., uniform distribution).
Finally, the f-CLSWGAN is trained with the following objective function:
where \(\ell _{CLS}(\theta _C,\theta _G) = -\mathbb E_{(\widetilde{\mathbf {x}},y) \sim \mathbb P^{x,y}_G}[\log P(y | \widetilde{\mathbf {x}}, \theta _C)]\), with
representing the probability that the sample \(\widetilde{\mathbf {x}}\) has been predicted with its true label y, and \(\beta \) is a hyper-parameter that weights the contribution of the loss function. This regularization with the classification loss was found by Xian et al. [1] to enforce G(.) to generate discriminative visual representations. The model obtained from the optimization in (3) is referred to as baseline in the experiments.
3.2 Multi-modal Cycle Consistency Loss
The main issue present in previously proposed GZSL approaches based on generative models [1, 8, 9] is that the unconstrained nature of the generation process (from semantic to visual features) may produce image representations that are too far from the real distribution present in the training set, resulting in an ineffective multi-class classifier training, particularly for the unseen classes. The approach we propose to alleviate this problem consists of constraining the synthetic visual representations to generate back their original semantic features - this regularization has been inspired by the cycle consistency loss [10]. Figure 2 shows an overview of our proposal. This approach, representing the main contribution of this paper, is represented by the following loss:
where \(\mathbb P_S^a\) and \(\mathbb P_U^a\) denote the distributions of semantic features of the seen and unseen classes, respectively, and \(R:\mathcal {X} \rightarrow \mathcal {A}\) represents a regressor that estimates the original semantic features from the visual representation generated by G(.).
3.3 Feature Generation
Using the losses proposed in Sects. 3.1 and 3.2, we can propose several feature generators. First, we pre-train the regressor R(.) defined below in (6), by minimizing a loss function computed only from the seen classes, as follows:
where \(\mathbb P_S^{a,x}\) represents the real joint distribution of image and semantic features present in the seen classes. In practice, this regressor is defined by a multi-layer perceptron, whose output activation function depends on the format of the semantic vector.
Our first strategy to build a feature generator consists of pre-training a regressor (using samples from seen classes) optimized by minimizing \(\ell _{REG}\) in (6), which produces \(\theta _R^*\) and training the generator and discriminator of the WGAN using the following optimization function:
where \(\ell _{WGAN}\) is defined in (2), \(\ell _{CYC}\) is defined in (5), and \(\lambda _1\) weights the importance of the second optimization term. The optimization in (7) can use both the seen and unseen classes, or it can rely only the seen classes, in which case the loss \(\ell _{CYC}\) in (5) has to be modified so that its second term (that depends on unseen classes) is left out of the optimization. The feature generator model in (7) trained with seen and unseen classes is referred to as cycle-(U)WGAN, while the feature generator trained with only seen classes is labeled cycle-WGAN.
The second strategy explored in this paper to build a feature generator involves pre-training the regressor in (6) using samples from seen classes to produce \(\theta _R^*\), and pre-training a softmax classifier for the seen classes using \(\ell _{CLS}\), defined in (3), which results in \(\theta _C^*\). Then we train the combined loss function:
The feature generator model in (8) trained with seen classes is referred to as cycle-CLSWGAN.
3.4 Learning and Testing
As shown in [1] the training of a classifier using a potentially unlimited number of samples from the seen and unseen classes generated with \(\mathbf {x} \sim G(\mathbf {a},\mathbf {z};\theta _G^*)\) produces more accurate classification results compared with multi-modal embedding models [18,19,20,21]. Therefore, we train a final softmax classifier \(P(y|\mathbf {x},\theta _C)\), defined in (4), using the generated visual features by minimizing the negative log likelihood loss \(\ell _{CLS}(\theta _C,\theta ^*_G)\), as defined in (3), where \(\theta _G^*\) has been learned from one of the feature learning strategies discussed in Sect. 3.3 - the training of the classifier produces \(\theta ^*_C\). The samples used for training the classifier are generated based on the task to be solved. For instance, for ZSL, we only use generated visual representations from the set of unseen classes; while for GZSL, we use the generated samples from seen and unseen classes.
Finally, the testing is based on the prediction of a class for an input test visual representation \(\mathbf {x}\), as follows:
where \(\widetilde{\mathcal {Y}} = \mathcal {Y}\) for GZSL or \(\widetilde{\mathcal {Y}} = \mathcal {Y}_U\) for ZSL.
4 Experiments
In this section, we first introduce the datasets and evaluation criteria used in the experiments, then we discuss the experimental set-up and finally show the results of our approach, comparing with the state-of-the-art results.
4.1 Datasets
We evaluate the proposed method on the following ZSL/GZSL benchmark datasets, using the experimental setup of [2], namely: CUB-200-2011 [1, 12], FLO [13], SUN [2], and AWA [2, 30] – where CUB, FLO and SUN are fine-grained datasets, and AWA coarse. Table 1 shows some basic information about these datasets in terms of number of seen and unseen classes and number of training and testing images. For CUB-200-2011 [1, 12] and Oxford-Flowers [13], the semantic feature has 1024 dimensions produced by the character-based CNN-RNN [31] that encodes the textual description of an image containing fine-grained visual descriptions (10 sentences per image). The sentences from the unseen classes are not used for training the CNN-RNN and the per-class sentence is obtained by averaging the CNN-RNN semantic features that belong to the same class. For the FLO dataset [13], we used the same type of semantic feature with 1024 dimensions [31] as was used for CUB (please see description above). For the SUN dataset [2], the semantic features have 102 dimensions. Following the protocol from Xian et al. [2], visual features are represented by the activations of the 2048-dim top-layer pooling units of ResNet-101 [26], obtained from the entire image. For AWA [2, 30], we use a semantic feature containing 85 dimensions denoting per-class attributes. In addition, we also test our approach on ImageNet [15], for a split containing 100 classes for testing [32].
The input images do not suffer any pre-processing (cropping, background subtraction, etc.) and we do not use any type of data augmentation. This ResNet-101 is pre-trained on ImageNet with 1K classes [15] and is not fine tuned. For the synthetic visual representations, we generate 2048-dim CNN features using one of the feature generation models, presented in Sect. 3.3.
For CUB, FLO, SUN, and AWA we use the zero-shot splits proposed by Xian et al. [2], making sure that none of the training classes are present on ImageNet [15]. Differently from these datasets (i.e., CUB, FLO, SUN, AWA), we observed that there is a lack of standardized experimental setup for GZSL on Imagenet. Recently, papers have used ImageNet for GZSL using several splits (e.g., 2-hop, 3-hop), but we noticed that some of the supposedly unseen classes can actually be seen during training (e.g., in split 2-hop, we note that the class American mink is assumed to be unseen, while class Mink is seen, but these two classes are arguably the same). Nevertheless, in order to demonstrate the competitiveness of our proposed cycle-WGAN, we compare it to the baseline using carefully selected 100 unseen classes [32] (i.e., no overlap with 1k training seen classes) from ImageNet.
4.2 Evaluation Protocol
We follow the evaluation protocol proposed by Xian et al. [2], where results are based on average per-class top-1 accuracy. For the ZSL evaluation, top-1 accuracy results are computed with respect to the set of unseen classes \(\mathcal {Y}_U\), where the average accuracy is independently computed for each class, which is then averaged over all unseen classes. For the GZSL evaluation, we compute the average per-class top-1 accuracy on seen classes \(\mathcal {Y}_S\), denoted by s, the average per-class top-1 accuracy on unseen classes \(\mathcal {Y}_U\), denoted by u, and their harmonic mean, i.e. \(H = 2 \times (s \times u)/(s + u)\).
4.3 Implementation Details
In this section, we explain the implementation details of the generator G(.), the discriminator D(.), the regressor R(.), and the weights used for the hyper-parameters in the loss functions in (2), (3), (7) and (8) - all these terms have been formally defined in Sect. 3 and depicted in Fig. 2. The generator consists of a multi-layer perceptron (MLP) with a single hidden layer containing 4096 nodes, where this hidden layer is activated by LeakyReLU [34], and the output layer, with 2048 nodes, has a ReLU activation [35]. The weights of G(.) are initialized with a truncated normal initialization with mean 0 and standard deviation 0.01 and the biases are initialized with 0. The discriminator D(.) is also an MLP consisting of a single hidden layer with 4096 nodes, which is activated by LeakyReLU, and the output layer has no activation. The initialization of D(.) is the same as for G(.). The regressor R(.) is a linear transform from the visual space \(\mathcal {X}\) to the semantic space \(\mathcal {A}\). Following [1], we set \(\lambda =10\) in (2), \(\beta = 0.01\) in (3) and \(\lambda _1 = \lambda _2 = 0.01\) in (7) and (8). We ran an empirical evaluation with the training set and noticed that when \(\lambda _1\) and \(\lambda _2\) share the same value, the training becomes stable, but a more systematic evaluation to assess the relative importance of these two hyper-parameters is still needed. Table 2 shows the learning rates for each model (denoted by \(lr_{\{ R(.), G(.), D(.) \}}\)), batch sizes (batch) and number of epochs (#ep) used for each dataset and model – the values for G(.) and D(.) have been estimated to reproduce the published results of our implementation of f-CLSWGAN (explained below), and the values for R(.) have been estimated by cross validation using the training and validation sets.
Regarding the number of visual representations generated to train the classifier, we performed a few experiments and reached similar conclusions, compared to [1]. For all experiments in the paper, we generated 300 visual representations per class [1]. We reached this number after a study that shows that for a small number of representations (below 100), the classification results were not competitive; for values superior to 200 or more, results became competitive, but unstable; and above 300, results were competitive and stable.
Since our approach is based on the f-CLSWGAN [1], we re-implemented this methodology. In the experiments, the results from our implementation of f-CLSWGAN using a softmax classifier is labeled as baseline. The results that we obtained from our baseline are very similar to the reported results in [1], as shown in Table 3. For ImageNet, note that we use a split [32] that is different from previous ones used in the literature, as explained above in Sect. 4.1, so it is not possible to have a direct comparison between f-CLSWGAN [1] and our baseline. Nevertheless, we show in Table 6 that the results we obtain for the split [32] are in fact similar to the reported results for f-CLSWGAN [1] for similar ImageNet splits. We developed our codeFootnote 1 and perform all experiments using Tensorflow [36].
5 Results
In this section we show the GZSL and ZSL results using our proposed models cycle-WGAN, cycle-(U)WGAN and cycle-CLSWGAN, the baseline model f-CLSWGAN, denoted by baseline, and several other baseline methods previously used in the field for benchmarking [2]. Table 4 shows the GZSL results and Table 5 shows the ZSL results obtained from our proposed methods, and several baseline approaches on CUB, FLO, SUN and AWA datasets. The results in Table 6 shows that the top-1 accuracy on ImageNet for cycle-WGAN and baseline [1].
6 Discussion
Regarding the GZSL results in Table 4, we notice that there is a clear trend of all of our proposed feature generation methods (cycle-WGAN, cycle-(U)WGAN), and cycle-CLSWGAN) to perform better than baseline on the unseen test set. In particular, it seems advantageous to use the synthetic samples from unseen classes to train the cycle-(U)WGAN model since it achieves the best top-1 accuracy results in 3 out of the 4 datasets, with improvements from 0.7% to more than 4%. In general, the top-1 accuracy improvement achieved by our approaches in the seen test set is less remarkable, which is expected given that we prioritize to improve the results for the unseen classes. Nevertheless, our approaches achieved improvements from 0.4% to more than 2.5% for the seen classes. Finally, the harmonic mean results also show that our approaches improve over the baseline in a range of between 1% and 2.2%. Notice that this results are remarkable considering the outstanding improvements achieved by f-CLSWGAN [1], represented here by baseline. In fact, our proposed methods produce the current state of the art GZSL results for these four datasets.
Analyzing the ZSL results in Table 5, we again notice that, similarly to the GZSL case, there is a clear advantage in using the synthetic samples from unseen classes to train the cycle-(U)WGAN model. For instance, top-1 accuracy results show that we can improve over the baseline from 0.9% to 3.5%. The results in this table show that our proposed approaches currently hold the best ZSL results for these datasets.
It is interesting to see that, compared to GZSL, the ZSL results from previous method in the literature are far more competitive, achieving results that are relatively close to ours and the baseline. This performance gap between ZSL and GZSL, shown by previous methods, enforces the argument in favor of using generative models to synthesize images from seen and unseen classes to train GZSL models [1, 8, 9]. As argued throughout this paper, the performance produced by generative models can be improved further with methods that help the training of GANs, such as the cycle consistency loss [10].
In fact, the experiments clearly demonstrate the advantage of using our proposed multi-modal cycle consistency loss in training GANs for GZSL and ZSL. In particular, it is interesting to see that the use of synthetic examples of unseen classes generated by cycle-(U)WGAN to train the GZSL classifier provides remarkable improvements over the baseline, represented by f-CLSWGAN [1]. The only exception is with the SUN dataset, where the best result is achieved by cycle-CLSWGAN. We believe that cycle-(U)WGAN is not the top performer on SUN due to the number of classes and the proportion of seen/unseen classes in this dataset. For CUB, FLO and AWA we notice that there is roughly a \((80\%,20\%)\) ratio between seen and unseen classes. In contrast, SUN has a \((91\%,9\%)\) ratio between seen and unseen classes. We also notice a sharp increase in the number of classes from 50 to 817 – GAN models tend not to work well with such a large number of classes. Given the wide variety of GZSL datasets available in the field, with different number of classes and seen/unseen proportions, we believe that there is still lots of room for improvement for GZSL models.
Regarding the large-scale study on ImageNet, the results in Table 6 show that the top-1 accuracy classification results for Baseline and cycle-WGAN are quite low (similarly to the results observed in [1] for several ImageNet splits), but our proposed approach still shows more accurate ZSL and GZSL classification.
An important question about out approach is whether the regularisation succeeds in mapping the generated visual representations back to the semantic space. In order to answer this question, we show in Fig. 3 the evolution of the reconstruction loss \(\ell _{REG}\) in (6) as a function of the number of epochs. In general, the reconstruction loss decreases steadily over training, showing that our model succeeds at such mapping. Another relevant question is if our proposed methods take more or less epochs to converge, compared to the Baseline – Fig. 4 shows the classification accuracy of the generated training samples from the seen classes for the proposed models cycle-WGAN and cycle-CLSWGAN, and also for the baseline (note that cycle-(U)WGAN is a fine-tuned model from the cycle-WGAN, so their loss functions are in fact identical for the seen classes shown in the graph). For three out of four datasets, our proposed cycle-WGAN converges faster. However, when the \(\ell _{CLS}\) in included in (7) to form the loss in (8) (transforming cycle-WGAN into cycle-CLSWGAN), then the convergence of cycle-CLSWGAN is comparable to that of the baseline. Hence, cycle-WGAN tends to converge faster than the baseline and cycle-CLSWGAN.
7 Conclusions and Future Work
In this paper, we propose a new method to regularize the training of GANs in GZSL models. The main argument explored in the paper is that the use of GANs to generate seen and unseen synthetic examples for training GZSL models has shown clear advantages over previous approaches. However, the unconstrained nature of the generation of samples from unseen classes can produce models that may not work robustly for some unseen classes. Therefore, by constraining the generation of samples from unseen classes, we target to improve the GZSL classification accuracy. Our proposed constraint is motivated by the cycle consistency loss [10], where we enforce that the generated visual representations maps back to their original semantic feature – this represents the multi-modal cycle consistency loss. Experiments show that the use of such loss is clearly advantageous, providing improvements over the current state of the art f-CLSWGAN [1] both in terms of GZSL and ZSL.
As noticed in Sect. 6, GAN-based GZSL approaches offer indisputable advantage over previously proposed methods. However, the reliance on GANs to generate samples from unseen classes is challenging because GANs are notoriously difficult to train, particularly in unconstrained and large scale problems. Therefore, future work in this field should be focused on targeting these problems. In this paper, we provide a solution that addresses the unconstrained problem, but it is clear that other regularization approaches could also be used. In addition, the use of GANs in large scale problems (regarding the number of classes) should also be more intensively studied, particularly when dealing with real-life datasets and scenarios. Therefore, we will focus our future research activities in solving these two issues in GZSL.
Notes
- 1.
Code is available at: https://github.com/rfelixmg/frwgan-eccv18.
References
Xian, Y., Lorenz, T., Schiele, B., Akata, Z.: Feature generating networks for zero-shot learning. In: 31st IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA (2018)
Xian, Y., Schiele, B., Akata, Z.: Zero-shot learning - the Good, the Bad and the Ugly. In: 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, pp. 3077–3086. IEEE Computer Society (2017)
Zhang, Z., Saligrama, V.: Zero-shot learning via semantic similarity embedding. In: Proceedings of the IEEE International Conference on Computer Vision,pp. 4166–4174 (2015)
Lampert, C.H., Nickisch, H., Harmeling, S.: Attribute-based classification for zero-shot visual object categorization. IEEE Trans. Pattern Anal. Mach. Intell. 36(3), 453–465 (2014)
Qiao, R., Liu, L., Shen, C., van den Hengel, A.: Less is more: zero-shot learning from online textual documents with noise suppression. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2249–2257 (2016)
Socher, R., Ganjoo, M., Manning, C.D., Ng, A.: Zero-shot learning through cross-modal transfer. In: Advances in Neural Information Processing Systems, pp. 935–943 (2013)
Yu, F.X., Cao, L., Feris, R.S., Smith, J.R., Chang, S.F.: Designing category-level attributes for discriminative visual recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 771–778 (2013)
Long, Y., Liu, L., Shen, F., Shao, L., Li, X.: Zero-shot learning using synthesised unseen visual data with diffusion regularisation. IEEE Trans. Pattern Anal. Mach. Intell. (2017)
Bucher, M., Herbin, S., Jurie, F.: Generating visual representations for zero-shot classification. In: International Conference on Computer Vision (ICCV) Workshops: TASK-CV: Transferring and Adapting Source Knowledge in Computer Vision (2017)
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: 2017 IEEE International Conference on Computer Vision (ICCV) (2017)
Tran, T., Pham, T., Carneiro, G., Palmer, L., Reid, I.: A Bayesian data augmentation approach for learning deep models. In: Advances in Neural Information Processing Systems, pp. 2794–2803
Welinder, P., et al.: Caltech-UCSD birds 200 (2010)
Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: Sixth Indian Conference on Computer Vision, Graphics & Image Processing, ICVGIP 2008, pp. 722–729. IEEE (2008)
Farhadi, A., Endres, I., Hoiem, D., Forsyth, D.: Describing objects by their attributes. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 1778–1785. IEEE (2009)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR09 (2009)
Chen, L., Zhang, H., Xiao, J., Liu, W., Chang, S.F.: Zero-shot visual recognition using semantics-preserving adversarial embedding networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018
Annadani, Y., Biswas, S.: Preserving semantic relations for zero-shot learning. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018
Akata, Z., Perronnin, F., Harchaoui, Z., Schmid, C.: Label-embedding for image classification. IEEE Trans. Pattern Anal. Mach. Intell. 38(7), 1425–1438 (2016)
Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Mikolov, T., et al.: DeVISE: a deep visual-semantic embedding model. In: Advances in Neural Information Processing Systems, pp. 2121–2129 (2013)
Akata, Z., Reed, S., Walter, D., Lee, H., Schiele, B.: Evaluation of output embeddings for fine-grained image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2927–2936 (2015)
Romera-Paredes, B., Torr, P.: An embarrassingly simple approach to zero-shot learning. In: International Conference on Machine Learning, pp. 2152–2161 (2015)
Elyor Kodirov, T.X., Gong, S.: Semantic autoencoder for zero-shot learning. In: IEEE CVPR 2017 (2017)
Xian, Y., Akata, Z., Sharma, G., Nguyen, Q., Hein, M., Schiele, B.: Latent embeddings for zero-shot classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 69–77 (2016)
Norouzi, M., et al.: Zero-shot learning by convex combination of semantic embeddings (2014)
Changpinyo, S., Chao, W.L., Gong, B., Sha, F.: Synthesized classifiers for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5327–5336 (2016)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Yan, X., Yang, J., Sohn, K., Lee, H.: Attribute2Image: conditional image generation from visual attributes. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016 Part IV. LNCS, vol. 9908, pp. 776–791. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_47
Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein GAN. arXiv preprint arXiv:1701.07875 (2017)
Lampert, C.H., Nickisch, H., Harmeling, S.: Learning to detect unseen object classes by between-class attribute transfer. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 951–958, June 2009
Reed, S., Akata, Z., Lee, H., Schiele, B.: Learning deep representations of fine-grained visual descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 49–58 (2016)
Wang, P., Liu, L., Shen, C., Huang, Z., van den Hengel, A., Shen, H.T.: Multi-attention network for one shot learning. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 22–25 (2017)
Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: SUN database: large-scale scene recognition from abbey to zoo. In: 2010 IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 3485–3492, IEEE (2010)
Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. Proc. ICML. 30, 3 (2013)
Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML 2010), pp. 807–814 (2010)
Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. OSDI 16, 265–283 (2016)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Felix, R., Vijay Kumar, B.G., Reid, I., Carneiro, G. (2018). Multi-modal Cycle-Consistent Generalized Zero-Shot Learning. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science(), vol 11210. Springer, Cham. https://doi.org/10.1007/978-3-030-01231-1_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-01231-1_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01230-4
Online ISBN: 978-3-030-01231-1
eBook Packages: Computer ScienceComputer Science (R0)