Keywords

1 Introduction

Generative Adversarial Networks (GANs) can generate samples from complex image distributions [13]. They consist of two networks: a discriminator which aims to separate real images from fake (or generated) images, and a generator which is simultaneously optimized to generate images which are classified as real by the discriminator. The theory was later extended to the case of conditional GANs where the generative process is constrained using a conditioning prior [29] which is provided as an additional input. GANs have further been widely applied in applications, including super-resolution [26], 3D object generation and reconstruction [40], human pose estimation [28], and age estimation [47].

Deep neural networks have obtained excellent results for discriminative classification problems for which large datasets exist; for example on the ImageNet dataset which consists of over 1M images [25]. However, for many problems the amount of labeled data is not sufficient to train the millions of parameters typically present in these networks. Fortunately, it was found that the knowledge contained in a network trained on a large dataset (such as ImageNet) can easily be transferred to other computer vision tasks. Either by using these networks as off-the-shelf feature extractors [3], or by adapting them to a new domain by a process called fine tuning [33]. In the latter case, the pre-trained network is used to initialize the weights for a new task (effectively transferring the knowledge learned from the source domain), which are then fine tuned with the training images from the new domain. It has been shown that much fewer images were required to train networks which were initialized with a pre-trained network.

GANs are in general trained from scratch. The procedure of using a pre-trained network for initialization – which is very popular for discriminative networks – is to the best of our knowledge not used for GANs. However, like in the case of discriminative networks, the number of parameters in a GAN is vast; for example the popular DC-GAN architecture [36] requires 36M parameters to generate an image of 64 \(\times \) 64. Especially in the case of domains which lack many training images, the usage of pre-trained GANs could significantly improve the quality of the generated images.

Therefore, in this paper, we set out to evaluate the usage of pre-trained networks for GANs. The paper has the following contributions:

  1. 1.

    We evaluate several transfer configurations, and show that pre-trained networks can effectively accelerate the learning process and provide useful prior knowledge when data is limited.

  2. 2.

    We study how the relation between source and target domains impacts the results, and discuss the problem of choosing a suitable pre-trained model, which seems more difficult than in the case of discriminative tasks.

  3. 3.

    We evaluate the transfer from unconditional GANs to conditional GANs for two commonly used methods to condition GANs.

2 Related Work

Transfer Learning/Domain Transfer: Learning how to transfer knowledge from a source domain to target domain is a well studied problem in computer vision [34]. In the deep learning era, complex knowledge is extracted during the training stage on large datasets [38, 48]. Domain adaptation by means of fine tuning a pre-trained network has become the default approach for many applications with limited training data or slow convergence [9, 33].

Several works have investigated transferring knowledge to unsupervised or sparsely labeled domains. Tzeng et al. [43] optimized for domain invariance, while transferring task information that is present in the correlation between the classes of the source domain. Ganin et al. [12] proposed to learn domain invariant features by means of a gradient reversal layer. A network simultaneously trained on these invariant features can be transfered to the target domain. Finally, domain transfer has also been studied for networks that learn metrics [18]. In contrast to these methods, we do not focus on transferring discriminative features, but transferring knowledge for image generation.

GAN: Goodfellow et al. [13] introduced the first GAN model for image generation. Their architecture uses a series of fully connected layers and thus is limited to simple datasets. When approaching the generation of real images of higher complexity, convolutional architectures have shown to be a more suitable option. Shortly afterwards, Deep Convolutional GANs (DC-GAN) quickly became the standard GAN architecture for image generation problems [36]. In DC-GAN, the generator sequentially up-samples the input features by using fractionally-strided convolutions, whereas the discriminator uses normal convolutions to classify the input images. Recent multi-scale architectures [8, 20, 22] can effectively generate high resolution images. It was also found that ensembles can be used to improve the quality of the generated distribution [44].

Independently of the type of architecture used, GANs present multiple challenges regarding their training, such as convergence properties, stability issues, or mode collapse. Arjovksy et al. [1] showed that the original GAN loss [13] are unable to properly deal with ill-suited distributions such as those with disjoint supports, often found during GAN training. Addressing these limitations the Wassertein GAN [2] uses the Wasserstein distance as a robust loss, yet requiring the generator to be 1-Lipschitz. This constrain is originally enforced by clipping the weights. Alternatively, an even more stable solution is adding a gradient penalty term to the loss (known as WGAN-GP) [15].

cGAN: Conditional GANs (cGANs) [29] are a class of GANs that use a particular attribute as a prior to build conditional generative models. Examples of conditions are class labels [14, 32, 35], text [37, 46], another image (image translation [23, 50] and style transfer [11]).

Most cGAN models [10, 29, 41, 46] apply their condition in both generator and discriminator by concatenating it to the input of the layers, i.e. the noise vector for the first layer or the learned features for the internal layers. Instead, in [11], they include the conditioning in the batch normalization layer. The AC-GAN framework [32] extends the discriminator with an auxiliary decoder to reconstruct class-conditional information. Similarly, InfoGAN [5] reconstructs a subset of the latent variables from which the samples were generated. Miyato et al. [30] propose another modification of the discriminator based on a projection layer that uses the inner product between the conditional information and the intermediate output to compute its loss.

3 Generative Adversarial Networks

3.1 Loss Functions

A GAN consists of a generator G and a discriminator D [13]. The aim is to train a generator G which generates samples that are indistinguishable from the real data distribution. The discriminator is optimized to distinguish samples from the real data distribution \(p_{data}\) from those of the fake (generated) data distribution \(p_g\). The generator takes noise \(z \sim p_z\) as input, and generates samples \(G\left( z\right) \) with a distribution \(p_g\). The networks are trained with an adversarial objective. The generator is optimized to generate samples which would be classified by the discriminator as belonging to the real data distribution. The objective is:

(1)
$$\begin{aligned} \mathcal {L}_{GAN}\left( G,D\right) = \mathbb {E}_{x\sim p_{data}} [\log D(x)] + \mathbb {E}_{z\sim p_z}[\log (1-D(G(z)))] \end{aligned}$$
(2)

In the case of WGAN-GP [15] the two loss functions are:

$$\begin{aligned} \begin{aligned}&\mathcal {L}_{WGAN-GP}\left( D\right) = -\mathbb {E}_{x\sim p_{data}} [D(x)] + \mathbb {E}_{z\sim p_z}[D(G(z))]\\ {}&+ \lambda \mathbb {E}_{x\sim p_{data},{z\sim p_z},\alpha \sim \left( 0,1\right) }\left[ \left( \Vert \nabla D\left( \alpha x+ \left( 1-\alpha \right) G(z) \right) \Vert _2 -1 \right) ^2 \right] \end{aligned} \end{aligned}$$
(3)
$$\begin{aligned} \mathcal {L}_{WGAN-GP}\left( G\right) = -\mathbb {E}_{z\sim p_z}[D(G(z))] \end{aligned}$$
(4)

3.2 Evaluation Metrics

Evaluating GANs is notoriously difficult [42] and there is no clear agreed reference metric yet. In general, a good metric should measure the quality and the diversity in the generated data. Likelihood has been shown to not correlate well with these requirements [42]. Better correlation with human perception has been found in the widely used Inception Score [39], but recent works have also shown its limitations [49]. In our experiments we use two recent metrics that show better correlation in recent studies [4, 21]. While not perfect, we believe they are satisfactory enough to help us to compare the models in our experiments.

Fréchet Inception Distance [17]. The similarity between two sets is measured as their Fréchet distance (also known as Wasserstein-2 distance) in an embedded space. The embedding is computed using a fixed convolutional network (an Inception model) up to a specific layer. The embedded data is assumed to follow a multivariate normal distribution, which is estimated by computing their mean and covariance. In particular, the FID is computed as

$$\begin{aligned} FID \left( \mathcal {X}_{1},\mathcal {X}_{2} \right) =\left\| \mu _{1}-\mu _2 \right\| ^2_2 + {\text {Tr}} \left( \varSigma _{1} + \varSigma _2 -2\left( \varSigma _{1}\varSigma _2 \right) ^\frac{1}{2}\right) \end{aligned}$$
(5)

Typically, \(\mathcal {X}_1\) is the full dataset with real images, while \(\mathcal {X}_2\) is a set of generated samples. We use FID as our primary metric, since it is efficient to compute and correlates well with human perception [17].

Independent Wasserstein (IW) Critic [7]. This metric uses an independent critic \(\hat{D}\) only for evaluation. This independent critic will approximate the Wasserstein distance [1] between two datasets \(\mathcal {X}_{1}\) and \(\mathcal {X}_{2}\) as

$$\begin{aligned} IW \left( \mathcal {X}_{1},\mathcal {X}_{2} \right) =\mathbb {E}_{x\sim \mathcal {X}_{1}}\left( \hat{D}\left( x \right) \right) - \mathbb {E}_{x\sim \mathcal {X}_{2}}\left( \hat{D}\left( x \right) \right) \end{aligned}$$
(6)

In this case, \(\mathcal {X}_1\) is typically a validation set, used to train the independent critic. We report IW only in some experiments, due to the larger computational cost that requires training a network for each measurement.

Table 1. FID/IW (the lower the better/the higher the better) for different transfer configurations. ImageNet was used as source dataset and LSUN Bedrooms as target (100K images).

4 Transferring GAN Representations

4.1 GAN Adaptation

To study the effect of domain transfer for GANs we will use the WGAN-GP [15] architecture which uses ResNet in both generator and discriminator. This architecture has been experimentally demonstrated to be stable and robust against mode collapse [15]. The generator consists of one fully connected layer, four Residual Blocks and one convolution layer, and the Discriminator has same setting. The same architecture is used for conditional GAN.

Implementation Details. We generate images of 64 \(\times \) 64 pixels, using standard values for hyperparameters. The source modelsFootnote 1 are trained with a batch of 128 images during 50K iterations (except 10K iterations for CelebA) using Adam [24] and a learning rate of 1e−4. For fine tuning we use a batch size of 64 and a learning rate of 1e−4 (except 1e−5 for 1K target samples). Batch normalization and layer normalization are used in the generator and discriminator respectively.

4.2 Generator/Discriminator Transfer Configuration

The two networks of the GAN (generator and discriminator) can be initialized with either random or pre-trained weights (from the source networks). In a first experiment we consider the four possible combinations using a GAN pre-trained with ImageNet and 100K samples of LSUN bedrooms as target dataset. The source GAN was trained for 50K iterations. The target GAN was trained for (additional) 40K iterations.

Table 1 shows the results. Interestingly, we found that transferring the discriminator is more critical than transferring the generator. The former helps to improve the results in both FID and IW metrics, while the latter only helps if the discriminator was already transferred, otherwise harming the performance. Transferring both obtains the best result. We also found that training is more stable in this setting. Therefore, in the rest of the experiments we evaluated either training both networks from scratch or pre-training both (henceforth simply referred to as pre-trained).

Fig. 1.
figure 1

Evolution of evaluation metrics when trained from scratch or using a pre-trained model for unconditional GAN measured with (a) FID and (b) IW (source: ImageNet, target: LSUN Bedrooms, metrics: FID and IW). The curves are smoothed for easier visualization by averaging in a window of a few iterations.

Table 2. FID/IW for different sizes of the target set (LSUN Bedrooms) using ImageNet as source dataset.

Figure 1 shows the evolution of FID and IW during the training process with and without transfer. Networks adapted from a pre-trained model can generate images of given scores in significantly fewer iterations. Training from scratch for a long time manages to reduce this gap significantly, but pre-trained GANs can generate images with good quality already with much fewer iterations. Figures 2 and 4 show specific examples illustrating visually these conclusions.

Fig. 2.
figure 2

Images generated at different iterations (from 0 and 10000, step 2000) for LSUN bedrooms training from scratch and from a pre-trained network. Better viewed in electronic version.

4.3 Size of the Target Dataset

The number of training images is critical to obtain realistic images, in particular as the resolution increases. Our experimental settings involve generating images of 64 \(\times \) 64 pixels, where GANs typically require hundreds of thousands of training images to obtain convincing results. We evaluate our approach in a challenging setting where we use as few as 1000 images from the LSUN Bedrooms dataset, and using ImageNet as source dataset. Note that, in general, GANs evaluated on LSUN Bedrooms use the full set of 3M million images.

Table 2 shows FID and IW measured for different amounts of training samples of the target domain. As the training data becomes scarce, the training set implicitly becomes less representative of the full dataset (i.e. less diverse). In this experiment, a GAN adapted from the pre-trained model requires roughly between two and five times fewer images to obtain a similar score than a GAN trained from scratch. FID and IW are sensitive to this factor, so in order to have a lower bound we also measured the FID between the specific subset used as training data and the full dataset. With 1K images this value is even higher than the value for generated samples after training with 100K and 1M images.

Intializing with the pre-trained GAN helps to improve the results in all cases, being more significant as the target data is more limited. The difference with the lower bound is still large, which suggests that there is still field for improvement in settings with limited data.

Figure 2 shows images generated at different iterations. As in the previous case, pre-trained networks can generate high quality images already in earlier iterations, in particular with sharper and more defined shapes and more realistic fine details. Visually, the difference is also more evident with limited data, where learning to generate fine details is difficult, so adapting pre-trained networks can transfer relevant prior information.

4.4 Source and Target Domains

The domain of the source model and its relation with the target domain are also a critical factor. We evaluate different combinations of source domains and target domains (see Table 3 for details). As source datasets we used ImageNet, Places, LSUN Bedrooms and CelebA. Note that both ImageNet and Places cover wide domains, with great diversity in objects and scenes, respectively, while LSUN Bedrooms and CelebA cover more densely a narrow domain. As target we used smaller datasets, including Oxford Flowers, LSUN Kitchens (a subset of 50K out of 2M images), Label Faces in the Wild (LFW) and CityScapes.

Table 3. Datasets used in the experiments.
Table 4. Distance between target real data and target generated data \( FID/IW \left( \mathcal {X}^{tgt}_{data},\mathcal {X}^{tgt}_{gen}\right) \).

We pre-trained GANs for the four source datasets and then trained five GANs for each of the four target datasets (from scratch and initialized with each of the source GANs). The FID and IW after fine tuning are shown in Table 4. Pre-trained GANs achieve significantly better results. Both metrics generally agree but there are some interesting exceptions. The best source model for Flowers as target is ImageNet, which is not surprising since it contains also flowers, plants and objects in general. It is more surprising that Bedrooms is also competitive according to FID (but not so much according to IW). The most interesting case is perhaps Kitchens, since Places has several thousands of kitchens in the dataset, yet also many more classes that are less related. In contrast, bedrooms and kitchens are not the same class yet still very related visually and structurally, so the much larger set of related images in Bedrooms may be a better choice. Here FID and IW do not agree, with FID clearly favoring Bedrooms, and even the less related ImageNet, over Places, while IW preferring Places by a small margin. As expected, CelebA is the best source for LFW, since both contain faces (with different scales though), but Bedroom is surprisingly very close to the performance in both metrics. For Cityscapes all methods have similar results (within a similar range), with both high FID and IW, perhaps due to the large distance to all source domains (Fig. 3).

Fig. 3.
figure 3

Transferring GANs: training source GANs, estimation of the most suitable pre-trained model and adaptation to the target domain.

Table 5. Distance between source generated data \(\mathcal {X}^{src}_{gen}\) and target real data \(\mathcal {X}^{tgt}_{data}\), and distance between source real \(\mathcal {X}^{src}_{data}\) and generated data \(\mathcal {X}^{src}_{gen}\).

4.5 Selecting the Pre-trained Model

Selecting a pre-trained model for a discriminative task (e.g. classification) is reduced to simply selecting either ImageNet, for object-centric domains, or Places, for scene-centric ones. The target classifier or fine tuning will simply learn to ignore non-related features and filters of the source network.

However, this simple rule of thumb does not seem to apply so clearly in our GAN transfer setting due to generation being a much more complex task than discrimination. Results in Table 4 show that sometimes unrelated datasets may perform better than other apparently more related. The large number of unrelated classes may be an important factor, since narrow yet dense domains also seem to perform better even when they are not so related (e.g. Bedrooms). There are also non-trivial biases in the datasets that may explain this behavior. Therefore, a way to estimate the most suitable model for a given target dataset is desirable, given a collection of pre-trained GANs.

Perhaps the most simple way is to measure the distance between the source and target domains. We evaluated the FID between the (real) images in the target and the source datasets (results included in the supplementary material). While showing some correlation with the FID of the target generated data, it has the limitation of not considering whether the actual pre-trained model is able or not to accurately sample from the real distribution. A more helpful metric is the distance between the target data and the generated samples by the pre-trained model. In this way, the quality of the model is taken into account. We estimate this distance also using FID. In general, there seem to roughly correlate with the final FID results with target generated data (compare Tables 4 and 5). Nevertheless, it is surprising that Places is estimated as a good source dataset but does not live up to the expectation. The opposite occurs for Bedrooms, which seems to deliver better results than expected. This may suggest that density is more important than diversity for a good transferable model, even for apparently unrelated target domains.

In our opinion, the FID between source generated and target real data is a rough indicator of suitability rather than accurate metric. It should taken into account jointly with others factors (e.g. quality of the source model) to decide which model is best for a given target dataset.

4.6 Visualizing the Adaptation Process

One advantage of the image generation setting is that the process of shifting from the source domain towards the target domain can be visualized by sampling images at different iterations, in particular during the initial ones. Figure 4 shows some examples of the target domain Kitchens and different source domains (iterations are sampled in a logarithmic scale).

Trained from scratch, the generated images simply start with noisy patterns that evolve slowly, and after 4000 iterations the model manages to reproduce the global layout and color, but still fails to generate convincing details. Both the GANs pre-trained with Places and ImageNet fail to generate realistic enough source images and often sample from unrelated source classes (see iteration 0). During the initial adaptation steps, the GAN tries to generate kitchen-like patterns by matching and slightly modifying the source pattern, therefore preserving global features such as colors and global layout, at least during a significant number of iterations, then slowly changing them to more realistic ones. Nevertheless, the textures and edges are sharper and more realistic than from scratch. The GAN pre-trained with Bedrooms can already generate very convincing bedrooms, which share a lot of features with kitchens. The larger number of training images in Bedrooms helps to learn transferable fine grained details that other datasets cannot. The adaptation mostly preserves the layout, colors and perspective of the source generated bedroom, and slowly transforms it into kitchens by changing fine grained details, resulting in more convincing images than with the other source datasets. Despite being a completely unrelated domain, CelebA also manages to help in speeding up the learning process by providing useful priors. Different parts such as face, hair and eyes are transformed into different parts of the kitchen. Rather than the face itself, the most predominant feature remaining from the source generated image is the background color and shape, that influences in the layout and colors that the generated kitchens will have.

5 Transferring to Conditional GANs

Here we study the transferring the representation learned by a pre-trained unconditional GAN to a cGAN [29]. cGANs allow us to condition the generative model on particular information such as classes, attributes, or even other images. Let y be a conditioning variable. The discriminator D(xy) aims to distinguish pairs of real data x and y sampled from the joint distribution \(p_{data}\left( x,y\right) \) from pairs of generated outputs \(G(z,y')\) conditioned on samples \(y'\) from y’s marginal \(p_{data}(y)\).

5.1 Conditional GAN Adaptation

For the current study, we adopt the Auxiliary Classifier GAN (AC-GAN) framework of [32]. In this formulation, the discriminator has an ‘auxiliary classifier’ that outputs a probability distribution over classes \(P(C=y|x)\) conditioned on the input x. The objective function is then composed of the conditional version of the GAN loss \(\mathcal {L}_{GAN}\) (Eq. (1)) and the log-likelihood of the correct class. The final loss functions for generator and discriminator are:

$$\begin{aligned} \mathcal {L}_{AC-GAN}\left( G\right) =\mathcal {L}_{GAN}\left( G\right) -\alpha _{G}\mathbb {E}\left[ \log \left( P\left( C=y'|G(z,y')\right) \right) \right] , \end{aligned}$$
(7)
$$\begin{aligned} \mathcal {L}_{AC-GAN}\left( D\right) =\mathcal {L}_{GAN}\left( D\right) -\alpha _{D}\mathbb {E}\left[ \log \left( P\left( C=y|x\right) \right) \right] , \end{aligned}$$
(8)

respectively. The parameters \(\alpha _{G}\) and \(\alpha _{D}\) weight the contribution of the auxiliary classifier loss with respect to the GAN loss for the generator and discriminator. In our implementation, we use Resnet-18 [16] for both G and D, and the WGAN-GP loss from the Eqs. (3) and (4) as the GAN loss. Overall, the implementation details (batch size, learning rate) are the same as introduced in Sect. 4.1.

In AC-GAN, the conditioning is performed only on the generator by appending the class label to the input noise vector. We call this variant ‘Cond Concat’. We randomly initialize the weights which are connected to the conditioning prior. We also used another variant following [11], in which the conditioning prior is embedded in the batch normalization layers of the generator (referred to as ‘Cond BNorm’). In this case, there are different batch normalization parameters for each class. We initialize these parameters by copying the values from the unconditional GAN to all classes.

5.2 Results

We use Places [48] as the source domain and consider all the ten classes of the LSUN dataset [45] as target domain. We train the AC-GAN with 10K images per class for 25K iterations. The weights of the conditional GAN can be transferred from the pre-trained unconditional GAN (see Sect. 3.1) or initialized at random. The performance is assessed in terms of the FID score between target domain and generated images. The FID is computed class-wise, averaging over all classes and also considering the dataset as a whole (class-agnostic case). The classes in the target domain have been generated uniformly. The results are presented in Table 6, where we show the performance of the AC-GAN whose weights have been transferred from pre-trained network vs. an AC-GAN initialized randomly. We computed the FID for 250, 2500 and 25000 iterations. At the beginning of the learning process, there is a significant difference between the two cases. The gap is reduced towards the end of the learning process but a significant performance gain still remains for pre-trained networks. We also consider the case with fewer images per class. The results after 25000 iterations for 100 and 1K images per class are provided in the last column of Table 7. We can observe how the difference between networks trained from scratch or from pre-trained weights is more significant for smaller sample sizes. This confirms the trend observed in Sect. 4.3: transferring the pre-trained weights is especially advantageous when only limited data is available.

The same behavior can be observed in Fig. 5 (left) where we compare the performance of the AC-GAN with two unconditional GANs, one pre-trained on the source domain and one trained from scratch, as in Sect. 4.2. The curves correspond to the class-agnostic case (column ‘All’ in the Table 6). From this plot, we can observe three aspects: (i) the two variants of AC-GAN perform similarly (for this reason, for the remaining of the experiments we consider only ‘Cond BNorm’); (ii) the network initialized with pre-trained weights converges faster than the network trained from scratch, and the overall performance is better; and (iii) AC-GAN performs slightly better than the unconditional GAN.

Table 6. Per-class and overall FID for AC-GAN. Source: Places, target: LSUN
Table 7. Accuracy of AC-GAN for the classification task and overall FID for different sizes of the target set (LSUN).
Fig. 4.
figure 4

Evolution of generated images (in logarithmic scale) for LSUN kitchens with different source datasets (from top to bottom: from scratch, ImageNet, Places, LSUN bedrooms, CelebA). Better viewed in electronic version.

Next, we evaluate the AC-GAN performance on a classification experiment. We train a reference classifier on the 10 classes of LSUN (10K real images per class). Then, we evaluate the quality of each model trained for 25K iterations by generating 10K images per class and measuring the accuracy of the reference classifier for 100, 1K and 10K images per class. The results show an improvement when using pre-trained models, with higher accuracy and lower FID in all settings, suggesting that it captures better the real data distribution of the dataset compared to training from scratch.

Finally, we perform a psychophysical experiment with generated images by AC-GAN with LSUN as target. Human subjects are presented with two images: pre-trained vs. from scratch (generated from the same condition <class>), and asked ‘Which of these two images of <class> is more realistic?’ Subjects were also given the option to skip a particular pair should they find very hard to decide for one of them. We require each subject to provide 100 valid assessments. We use 10 human subjects which evaluate image pairs for different settings (100, 1K, 10K images per class). The results (Fig. 5 right) clearly show that the images based on pre-trained GANs are considered to be more realistic in the case of 100 and 1K images per class (e.g. pre-trained is preferred in 67% of cases with 1K images). As expected the difference is smaller for the 10K case.

Fig. 5.
figure 5

(Left) FID score for Conditional and Unconditional GAN (source: Places, target: LSUN 10 classes). (Right) Human evaluation of image quality.

6 Conclusions

We show how the principles of transfer learning can be applied to generative features for image generation with GANs. GANs, and conditional GANs, benefit from transferring pre-trained models, resulting in lower FID scores and more recognizable images with less training data. Somewhat contrary to intuition, our experiments show that transferring the discriminator is much more critical than the generator (yet transferring both networks is best). However, there are also other important differences with the discriminative scenario. Notably, it seems that a much higher density (images per class) is required to learn good transferable features for image generation, than for image discrimination (where diversity seems more critical). As a consequence, ImageNet and Places, while producing excellent transferable features for discrimination, seem not dense enough for generation, and LSUN data seems to be a better choice despite its limited diversity. Nevertheless, poor transferability may be also related to the limitations of current GAN techniques, and better ones could also lead to better transferability.

Our experiments evaluate GANs in settings rarely explored in previous works and show that there are many open problems. These settings include GANs and evaluation metrics in the very limited data regime, better mechanisms to estimate the most suitable pre-trained model for a given target dataset, and the design of better pre-trained GAN models.