1 Introduction

The ability to generate photo-realistic images of objects such as human faces or fully clothed bodies has wide applications in computer graphics and computer vision. Traditional computer graphics, based on physical simulation, often fails to produce photo-realistic images of objects with complicated geometry and material properties. In contrast, modern data-driven methods, such as deep learning-based generative models, show great promise for realistic image synthesis [23, 24]. Among the four major categories of generative models –generative adversarial networks (GANs), variational auto-encoders (VAEs), normalizing flows and autoregressive models– GANs deliver images with the best visual quality. Although recent efforts in VAEs [13, 34] have tremendously improved their generation quality, they still use larger latent space dimensions and deliver lower quality images. Autoregressive models are very slow to sample from and do not provide a latent representation for the trained data. Flow-based methods do not perform dimensionality reduction and hence produce large models and latent representations. On the other hand, GANs offer great generation quality, but do not provide a mechanism to embed real images into the latent space. This limits them as a tool for image editing and manipulation. Specifically, while several methods exist [2, 4, 6, 46], there is no method that trains the generative and the inference model togetherFootnote 1. To that end, we propose InvGAN, an invertible GAN in which the discriminator acts as an inference module. InvGAN enables a wide range of applications, as described in the following paragraphs.

GANs learn a latent representation of the training data. This representation has been shown to be well-structured [10, 23, 24], allowing GANs to be employed for a variety of downstream tasks (e.g. classification, regression and other supervised tasks) [28, 33]. We extend the GAN framework to include an inference model that embeds real images into the latent space. InvGAN can be used to support representation learning [11, 27], data augmentation [10, 37] and algorithmic fairness [7, 38, 39]. Previous methods of inversion rely on computationally expensive optimization of inversion processes [3, 4], limiting their scope to offline applications, e.g. data augmentation has to happen before training starts. Efficient, photo-realistic, semantically consistent, and model-based inversion is the key to online and adaptive use-cases.

Recent work shows that even unsupervised GAN training isolates several desirable generative characteristics [29, 43]. Prominent examples are correspondences between latent space directions and e.g. hairstyle, skin tone and other visual characteristics. Recent works provide empirical evidence suggesting that one can find paths in the latent space (albeit non-linear) that allow for editing individual semantic aspects. GANs therefore have the potential to become a high-quality graphics editing tool [18, 41]. However, without a reliable mechanism for projecting real images into the latent space of the generative model, editing of real data is impossible. InvGAN take a step towards addressing this problem.

2 Related Work

The task, GAN inversion, refers to the task of (approximately) inverting the generator network. It has been addressed in two primary ways (1) using an inversion model (often a deep neural network) (2) using an iterative optimization-based method, typically initialized with (1). Although, invertibility of generative models span beyond specific data domains (images, speech, language etc..), we study InvGAN applied to image data only. Its applications of generation of sound, language etc.. is left as future work.

Optimization Based: iGAN [53] optimizes for a latent code while minimizing the distance between a generated image and a source image. To ensure uniqueness of the preimage of a GAN-generated data point, Zachary et al. [26] employ stochastic clipping. As the complexity of the GAN generators increases, an inversion process based on gradient descent and pixel space \(\text {MSE}\) is insufficient. Addressing this, Rameen et al. specifically target StyleGAN generators and optimize for perceptual loss [3, 4]. However, they invert into the \(W+\) space, the so-called extended W space of StyleGAN. This results in high dimensional latent codes and consequently prolongs inversion time. This can also produce out-of-distribution latent representations, which makes them unsuitable for downstream tasks. Contrary to these, InvGAN offers fast inference embedding in the non-extended latent space.

Model Based: BiGAN [14] and ALI [16] invert the generator of a GAN during the training process by learning the joint distribution of the latent vector and the data in a completely adversarial setting. However, the quality is limited, partially because of the choice of DCGAN [32] and partially because of the significant dimensionality and distribution diversity between the latent variable and the data domain [15]. More recent models target the StyleGAN architecture [35, 44, 52] and achieve impressive results. Most leverage StyleGAN peculiarities, i.e., they invert in the \(W+\) space, so adaptation to other GAN backbones is non-trivial. Adversarial latent auto-encoders [31] are closest to our current work. Our model and adversarial autoencoders can be made equivalent with a few alterations to the architecture and to the optimization objective. We discuss this more in detail in Sect. 3.2. Our method, on the other hand, neither uses any data set specific loss nor does it depend upon any specific network architecture.

Hybrid Optimization and Regression Based: Guan et al. [20] train a regressor that is used to initialize an optimization-based method to refine the regressor’s guess. However, is specific to human face datasets. Zhu et al. [51] modify the general hybrid approach with an additional criterion that encourages the recovered latent must belong to the semantically meaningful region learned by the generator by assuming that the real image can be reconstructed more faithfully in the immediate neighbourhood of an initial guess given by a model-based inversion mechanism. Yuval et al. [5] replace gradient-based optimization with an iterative encoder that encodes the current generation and target image to the estimated latent code difference. They empirically show that this iterative process converges and that the recovered image improves over iterations. However, this method requires multiple forward passes in order to achieve a suitable latent code. In contrast to the work above, the inference module obtained by our method infers the latent code in one shot. Hence, it is much faster and does not run the risk of finding a non-meaningful latent code.

The inversion mechanisms presented so far do not directly influence the generative process. In most of the cases, they are conducted on a pre-trained frozen generator. Although in the case of ALI [16] and BiGAN [14] the inference model loosely interacts with the generative model at training time, the interaction is only indirect; i.e. through the discriminator. In our work, we tightly couple the inference module with the generative module.

Joint Training of Generator and Inference Model: We postulate that jointly training an inference module will help regularize GAN generators towards invertibility. This is inspired by the difficulty of inverting a pre-trained high-performance GAN. For instance, Bau et al. [8] invert PGAN [22], but for best results a two-stage mechanism is needed. Similarly, Image2StyleGAN [2] projects real images into the extended \(W^+\) space of StyleGAN, whereas, arguably, all the generated images can be generated from the more compact z or w space. This is further evident from Wulff et al. [45] who find an intermediate latent space in StyleGAN that is more Gaussian than the assumed prior. However, they too use an optimization-based method and, hence, it is computationally expensive and at the same time specific to both the StyleGAN backend and the specific data set. Finally, we refer the readers to ‘GAN Inversion: A Survey’ [46] for a comprehensive review of relate work.

3 Method

Goal: Our goal is to learn an inversion module alongside the generator during GAN training. An inversion module is a mechanism that returns a latent embedding of a given data point. Specifically, we find a generator \(G: \mathbb {W}\rightarrow \mathbb {X}\) and an inference model \(D: \mathbb {X}\rightarrow \mathbb {W}\) such that \(x \approx G(D(x \sim \mathbb {X}))\), where \(\mathbb {X}\) denotes the data domain and \(\mathbb {W}\) denotes the latent space. We reuse the GAN discriminator to play the role of this inference model D in practice.

3.1 Architecture

We demonstrate InvGAN using DC-GAN, BigGAN and StyleGAN as the underlying architectures. Figure 1 represents the schematic of our model. We follow the traditional alternate generator-discriminator training mechanism. The generative part consists of three steps 1. sampling latent code: \(z \sim \mathcal {N}(0, I)\), 2. mapping the latent code to w space: \(w = M(z)\), 3. using mapped code to generate fake data: \(x = G(w)\), where M is a mapping network, G is the generator, D is the discriminator, and \(\mathcal {N}(0, I)\) is the standard normal distribution. In practice, the discriminator, besides outputting real/fake score, also outputs inferred w parameter, which was found to work better empirically over designs with two different networks for discrimination and inference. From here on, we use \(\tilde{w}, c = D(x)\) to denote the inferred latent code (\(\tilde{w}\)) using the discriminator D and c to denote the real-fake classification decision for the sample \(x\in \mathbb {X}\). Wherever obvious, we simply use D(x) to refer to c, the discrimination decision only.

3.2 Objective

GAN Objective: The min-max game between the discriminator network and the generator network of vanilla GAN training is described as

$$\begin{aligned} \min _{G, M}\max _{D}\mathcal {L}_\text {GAN} = \min _{G, M}\max _{D}\left[ \mathbb {E}_{x \in \mathbb {X}}[\log D(x)] + \mathbb {E}_{z\in \mathbb {Z}}[\log (1-D(G(M(z))))]\right] . \end{aligned}$$
(1)

A naive attempt at an approximately invertible GAN would perform \(\min _{G}\max _{D}L_\text {GAN} + \min _{G, D}\left\Vert w - \tilde{w}\right\Vert _p\), where \(\left\Vert \bullet \right\Vert _p\) denotes an \(L_p\) norm. This loss function can be interpreted as optimal transport cost. We discuss this in more detail at the end of this section. However, this arrangement, coined the “naive model”, does not yield satisfactory results, cf. Sect. 4.4. This can be attributed to three factors: (1) w corresponding to real images are never seen by the generator; (2) no training signal is provided to the discriminator for inferring the latent code corresponding to real images (\(w_R\)); (3) the distribution of \(w_R\) might differ from prior distribution of w. We address each of these concerns with a specific loss term designed to address the said issues. Our naive model corresponds to the adversarial autoencoders [31] if the real-fake decision is derived from a common latent representation. However, this forces the encoding of real and generated images to be linearly separable and contributes to degraded inference performance.

Minimizing Latent Space Extrapolation: Since, in the naive version, neither the generator nor the discriminator gets trained with \(w_R\), it relies completely upon its extrapolation characteristics. In order to reduce the distribution mismatch for the generator, we draw half the mini batch of latent codes from the prior and the other half consists of \(w_R\); i.e., \(w_\text {total} = w \mathbin {++}w_R,\ w \sim P(W)\) where \(\mathbin {++}\) denotes a batch concatenation operation. By \(w \sim P(W)\), we denote the two stage process given by the following \(w = M(z\sim P(Z))\). Together with the naive loss this forms the first three terms of our full objective function given in Eq. 3

Pixel Space Reconstruction Loss: Since latent codes for real images are not given, the discriminator cannot be trained directly. However, we recover a self-supervised training signal by allowing the gradients from the generator to flow into the discriminator. Intuitively, the discriminator tries to infer latent codes from real images that help the generator reproduce that image. As shown in Sect. 4.4, this improves real image inversion tremendously. We enforce further consistency by imposing an image domain reconstruction loss between input and reconstructed real images. However, designing a meaningful distance function for images is a non-trivial task. Ideally, we would like a feature extractor function f that extracts low- and high-level features from the image such that two images can be compared meaningfully. Given such a function, a reconstruction loss can be constructed as

$$\begin{aligned} \mathcal {L}_\text {fm} = \left\Vert \mathbb {E}_{x \in \mathbb {X}}(f(x) - f(G(w \sim P(W|x)))\right\Vert _p \end{aligned}$$
(2)

A common practice in the literature is to use a pre-trained VGG [21, 50] network as a feature extractor f. However, it is well known that deep neural networks are susceptible to adversarial perturbations [47]. Given this weakness, optimizing for perceptual loss is error-prone. Hence, a combination of a pixel-domain \(L_2\) and feature-space loss is typically used, but this often results in degraded quality. Consequently, we take the discriminator itself as the feature extractor function f. Due to the min-max setting of GAN training, we are guaranteed to avoid the perils of adversarial and fooling samples, if we use the discriminator features, instead of VGG features. The feature loss is shown in the second half of Fig. 1. Although this resembles the feature matching described by Salimans et al. [36], it has a crucial difference. As seen in Eq. 2 the latent code fed into the generator is drawn from the conditional distribution \(P(W|x):=\delta _{D(x)}(w)\) rather than the prior P(W), where \(\delta (x)\) represents the Dirac delta function located at x. This forces the distribution of the features to match more precisely as compared to the simple first-moment matching proposed by Salimans et al. in [36].

Addressing Mismatch Between Prior and Posterior: Finally, we address the possibility of mismatch between inferred and prior latent distributions (point (3) described above) by imposing a maximum mean discrepancy (MMD) loss between the sets of samples of the said two distributions. We use an RBF kernel to compute this. The loss improves the random sampling quality by providing a direct learning signal to the mapping network. This forms the last term of our objective function as shown in Eq. 3.

Fig. 1.
figure 1

We train InvGAN following a regular GAN. We use a second output head in the discriminator besides the real fake decision head, to infer the latent-code z of a given image. Here denotes no gradient propagation during back propagation step. It also denotes ‘no training’ when it is placed on a model. We use red color to show data flow corresponding to real images.

Putting everything together gives the objective of our complete model. It is as shown in Eq. 3. Note that here the expectation operator \(\mathbb {E}_{w \mathbin {++}w_R}\) acts on several loss terms that are independent of \(w_R\) or w. In such cases keeping in mind the identity \(c = \mathbb {E}[c]\), where c is a constant, can add clarity. Furthermore, here and in the rest of the paper we use a plus operator \(+\), between two optimization process, to indicate that both of them are performed simultaneously.

$$\begin{aligned} \begin{aligned} \min _{G, M}\biggl [\max _{D}\mathcal {L}_\text {GAN} + \min _{D}\Big [&\mathbb {E}_{w \mathbin {++}w_R}\big [\left\Vert M(z) - \tilde{w}\right\Vert _2^2 + \left\Vert (\tilde{w}\mathbin {++}w_R) - \tilde{\tilde{w}}\right\Vert _2^2 + \mathcal {L}_\textrm{fm} \\ {}&+ \textrm{MMD}\{w, w_R\}\big ]\Big ]\biggr ] \end{aligned} \end{aligned}$$
(3)

An Optimal Transport Based Interpretation: Neglecting the last three terms described in Eq. 3, our method can be interpreted as a Wasserstein autoencoder (the GAN version) (WAE-GAN) [42]. Considering a WAE with its data domain set to our latent space and its latent space assigned to our image domain, if the encoder and the discriminator share weights the analogy is complete. Our model can, hence, be thought of as learning the latent variable model P(W) by randomly sampling a data point \(x\sim \mathbb {X}\) from the training set and mapping it to a latent code w via a deterministic transformation. In terms of density, it can be written as in Eq. 4.

$$\begin{aligned} P(W) := \int _{x\in \mathbb {X}} P(w|x)P(x)\textrm{d}x. \end{aligned}$$
(4)

As proven by Olivier et al. [9], under this model the optimal transport problem \(W_c(P(W), P_D(W)) := \inf _{\Gamma \in P(w_1\sim P(W), w_2\sim P_D(W))}\left[ \mathbb {E}_{w_1, w_2\sim \Gamma } \left[ c(w_1, w_2)\right] \right] \) can be solved by finding a generative model G(X|W) such that its X marginal, \(P_G(X) = \mathbb {E}_{w\sim P(W)}G(X|w)\) matches the image distribution P(X). We ensure this by considering the Jensen-Shannon divergence \(D_\text {JS}(P_G(X), P(X))\) using a GAN framework. This leads to the cost function given in Eq. 5, when we choose the ground cost function \(c(w_1, w_2)\) to be squared \(L_2\) norm.

$$\begin{aligned} \min _{G, M}\max _{D}\mathcal {L}_\textrm{GAN} + \min _{G, M}\min _{D}\left\Vert w - \tilde{w}\right\Vert _2^2 \end{aligned}$$
(5)

Finally, we find that by running the encoding/decoding cycle one more time, we can impose several constraints that improve the quality of the encoder and the decoder network in practice. This leads to our full optimization criterion, as described in Eq. 3. Note that our method because of this extra cycle is less efficient computationally as compared to vanilla VAEs or WAEs, but by incurring this computational penalty we successfully avoid having to define a loss function in the image domain. This results in sharper image generation.

3.3 Dealing with Resolutions Higher Than the Training Resolution

Although StyleGAN [24] and BigGAN [10] have shown that it is possible to generate relatively high-resolution images, in the range of \(1024 \times 1024\) and \(512 \times 512\), their training is resource intense and the models are difficult to tune for new data sets. Equipped with invertibility, we explore a tiling strategy to improve the output resolution. First, we train an invertible GAN at a lower resolution (\(m \times m\)) and simply tile them \(n \times n\) times with \(n^2\) latent codes to obtain a higher resolution (\(mn \times mn\)) final output image. The new latent space containing \(n^2\) latent codes obtained using the inference mechanism of the invertible GAN can now be used for various purposes, as described in Sect. 4.3 and reconstructions are visualized in Fig. 3. This process correlates in spirit somewhat to COCO-GAN [25]. The main difference, however, is that our model at no point learns to assemble neighbouring patches. Indeed, the seams are visible if one squints at the generated images, e.g., in Fig. 3. However, a detailed study of tiling for generation of higher resolution images than the input domain is beyond the scope of our paper. We simply explore some naive settings and their applications in Sect. 4.3.

4 Experiments

We test InvGAN on several diverse datasets (MNIST, ImageNet, CelebA, FFHQ) and multiple backbone architectures (DC-GAN, BigGAN, StyleGAN). For the mapping network in the generator, we use the standard 8-layer mapping network with StyleGAN and add a 2-layer mapping network to BigGAN and DC-GAN. Our method is evaluated both qualitatively (via style mixing, image inpainting etc.) and quantitatively (via the FID score and the suitability for data augmentation for discriminative tasks such as classification). We found that relative weights of different terms in our objective function do not impact the model’s performance significantly. Therefore, keeping simplicity in mind, we avoid tuning and simply set them to be one.

Table 1 shows random sample FIDs, middle point linear interpolation FIDs and test set reconstruction mean absolute errors (MAEs) of our generative model. We note here that interpolation FID and random generation FID are comparable to non-inverting GANs. This leads us to conclude that the inversion mechanism does not adversarially impact the generative properties. We provide a definition, baseline and understanding of inversion of a high-quality generator for uniform comparison of future works on GAN inversion. We highlight model-based inversion, joint training of generative and inference model and its usability in downstream tasks. We demonstrate that InvGAN generalizes across architectures, datasets, and types of downstream task.

Table 1. Here we report random sample FID (RandFID), FID of reconstructed random samples (RandRecFID), FID of reconstructed test set samples (TsRecFID), FID of the linear middle interpolation of test set images (IntTsFID) and reconstruction per pixel per color channel mean absolute error when images are normalized between \(\pm 1\), also from test set. All FID scores are here evaluated against train set using 500 and 50000 samples. They are separated by ‘/’. For the traditional MSE optimization based and In-Domain GAN inversion, the MSE errors are converted to MAE by taking square root and averaging over the color channels and accounting for the re-normalization of pixel values between \(\pm 1\) (MAE\(\pm 1\)). Runtime is given in seconds per image. We ran them on a V100 32GB GPU and measured wall clock time.

We start with a StyleGAN-based architecture on FFHQ and CelebA for image editing. Then we train a BigGAN-based architecture on ImageNet, and show super resolution and video key-framing by tiling in the latent domain to work with images and videos that have higher resolution than training data. We also show ablation studies with a DC-GAN-based architecture on MNIST. In the following sections we evaluate qualitatively by visualizing semantic editing of real images and quantitatively on various downstream tasks including classification fairness, image super resolution, image mixing, etc.

4.1 Semantically Consistent Inversion Using InvGAN

GANs can be used to augment training data and substantially improve learning of downstream tasks, such as improving fairness of classifiers of human-facial attributes [7, 33, 38, 39]. There is an important shortcoming in using existing GAN approaches for such tasks: the labeling of augmented data relies on methods that are trained independently on the original data set, using human annotators or compute-expensive optimization-based inversion. A typical example is data-set debasing by Ramaswamy et al. [33]. For each training image, an altered example that differs in some attribute (e.g. age, hair color, etc..) has to be generated. This can be done in one of two ways, 1. by finding the latent representation of the ground truth image via optimization and 2. by labeling random samples using pre-trained classifiers on the biased data set. Optimization-based methods are slow and not a viable option for on-demand/adaptive data augmentation. Methods using pre-trained classifiers inherit their flaws, e.g. spurious correlation induced dependencies. However, having access to a high-quality inversion mechanism help us overcome such problems [54].

To verify that InvGAN is indeed suitable for such tasks, we train ResNet50 attribute classifiers on the CelebA dataset. We validate that the encoding and decoding of InvGAN results in a semantically consistent reconstruction by training the classifier only on reconstructions of the full training set. As a baseline, we use the same classifier trained on the original CelebA. We produce two reconstructed training sets by using the tiling-based inversion (trained on ImageNet) and by training InvGAN on CelebA (without tiling). For each attribute, a separate classifier has been trained for 20 epochs. The resulting mean average precisions are reported in Table 2. We see that training on the reconstructions allows for very good domain transfer to real images, indicating that the reconstruction process maintained the semantics of the images.

Table 2. Mean average precision for a ResNet50 attribute classifier on CelebA, averaged over 20 attributes. We report the performance for training on the original dataset, the reconstructed dataset using the tiling-based method pre-trained on ImageNet and the reconstruction on InvGAN trained on the CelebA training set directly.

4.2 Suitability for Image Editing

GAN inversion methods have been proposed for machine supported photo editing tasks [12, 30, 51]. Although there is hardly any quantitative evaluation for the suitability of a specific inversion algorithm or model, a variety of representative operations have been reported [3, 4, 51]. Among those are in-painting cut out regions of an image, image-merging and improving on amateur manual photo editing. Figures 2 and 8 in the appendix visualize those operations performed on FFHQ and CelebA images, respectively. We demonstrate in-painting by zeroing out a randomly positioned square patch and then simply reconstructing the image. Image-merging is performed by reconstructing an image which is composed out of two images by simply placing them together. By reconstructing an image that has undergone manual photo editing, higher degrees of photo-realism are achieved. Quantitative metric for such tasks are hard to define and hence is scarcely found in prior art, since they depend upon visual quality of the results. We report reconstruction and interpolation FIDs in Table 1, in an effort to establish a baseline for future research. However, we do acknowledge that a boost in pixel fidelity in our reconstruction will greatly boost the performance of InvGAN on photo editing tasks. The experiments clearly show the general suitability of the learned representations to project out of distribution images to the learned posterior manifold via reconstruction.

Fig. 2.
figure 2

Benchmark image editing tasks on FFHQ (\(128\,\textrm{px}\)). Style mixing: we transfer the first \(0, 1, 2, \dots , 11\) style vectors from one image to another. For the other image editing tasks, pairs of images are input image (left) and reconstruction (right).

4.3 Tiling to Boost Resolution

Limitations in video RAM and instability of high resolution GANs are prominent obstacles in generative model training. One way to bypass such difficulties is to generate the image in parts. Here we train our invertible generative model, a BigGAN architecture, on \(32\times 32\) random patches from ImageNet. Once the inversion mechanism and the generator are trained to satisfactory quality, we reconstruct both FFHQ and ImageNet images. We use \(256 \times 256\) resolution and tile 32 patches in an \(8\times 8\) grid for FFHQ images, and \(128\times 128\) resolution and tiling 32 patches in a \(4\times 4\) grid for ImageNet images. The reconstruction results are shown in Fig. 3. Given the successful reconstruction process, we explore the tiled latent space for tasks such as image deblurring and time interpolation of video frames.

Fig. 3.
figure 3

Tiled reconstruction of random (a) FFHQ images and (b) ImageNet images. The left column shows the real images, the second shows the patch by patch reconstructions, and the third shows the absolute pixel-wise differences. Note that interestingly though the patches are reconstructed independent of each other, the errors lie mostly on the edges of the objects in the images, arguably the most information dense region of the images.

Image De-blurring: Here we take a low-resolution image, scale it to the intended resolution using bicubic interpolation, invert it patch by patch, Gaussian blur it, invert it again and linearly extrapolate it in the deblurring direction. The deblurring direction is simply obtained by subtracting the latent code of the given low resolution but bicubic up sampled image from the latent code of the blurred version of it at the same resolution. The exact amount of extrapolation desired is left up to the user. In Fig. 4 we show the effect of three different levels of extrapolation. Although our method is not trained for the task of super resolution, by virtue of a meaningful latent space we can enhance image quality.

Fig. 4.
figure 4

Super resolution using extrapolation in the tiled latent space. From left, we visualize the original image, the low-resolution version of it, the reconstruction of the low-resolution version, and progressive extrapolation to achieve deblurring.

Temporal Interpolation of Video Frames: Here we boost the frame rate of a video post capture. We infer the tiled latent space of consecutive frames in a video, and linearly interpolate each tile to generate one or more intermediate frames. Results are shown in Figure 6 in the appendix and in the accompanying videos in supplementary material. We find the latent code of each frame in a video sequence, and then derive intermediate latent codes by weighted averaging neighboring latent codes using a Gaussian window. We use the UCF101 data set [40] for this task. Note however that since there is no temporal constraint and each patch is independently interpolated, one can notice flicker effect. Further-studies into this matter are left to future work.

As can also be seen, this process produces discontinuities at the boundaries of the patches. This is because neighboring patches are modeled independently of one another in this work. This can be dealt with in a variety of ways, namely, by carefully choosing overlapping patch patterns and explicitly choosing a predetermined or learning-based stitching mechanism, seam detection and correction [1, 49]. However, a thorough study in this direction is considered out of scope for the current work.

4.4 Ablation Studies

Recall that the naive model defined in Sect. 3.2 uses the optimization \(\min _{G,M}\big [\max _D L_\text {GAN} + \min _D\mathbb {E}_{z \sim P(Z)}\left\Vert M(z) - \tilde{w}\right\Vert _p\big ]\) to train (also given in Equation 5), In Fig. 5a we show how our three main components progressively improve the naive model. As is apparent from the method section, the first major improvement comes from exposing the generator to the latent code inferred from real images. This is primarily due to the difference in the prior and the induced posterior distribution. This is especially true during early training, which imparts a lasting impact. The corresponding optimization is \(\min _{G,M}\big [\max _D L_\text {GAN} + \)\(\min _D\mathbb {E}_{w=M(z \sim P(Z)) \mathbin {++}w_R}\left\Vert w - \tilde{w}\right\Vert _p\big ]\). This simply reduces the distribution mismatch between prior and posterior by injecting inferred latent codes, improves inversion quality. This is visualized in Fig. 5b. We call this model the augmented naive model. However, this modification unlocks the possibility to enforce back propagation of generator loss gradients to the discriminator and real-image, generated-image pairing, as detailed in Sect. 3.2. This leads to our model and the results are visualized in Fig. 5c.

Fig. 5.
figure 5

Inversion of held out test samples. Columns are in groups of three: the first column holds real images, second their reconstruction and third the absolute pixel-wise difference. (a) Inversion using naive model, i.e. only z reconstruction loss is used (b) inversion using model that uses latent codes from real samples, i.e., the augmented naive model. (c) our full model. Notice how the imperfections in the reconstructions highlighted with red boxes gradually vanishes as the model improves. (Color figure online)

5 Discussion and Future Work

While InvGAN can reliably invert the generator of a GAN, it still can benefit from an improved reconstruction fidelity for tasks such as image compression, image segmentation, etc. We observe that the reconstruction of rare features, such as microphones, hats or background, tend to have lower quality, as seen in the appendix in Figure 7 bottom row 3rd and 4th columns. This combined with the fact that the reconstruction loss during training tends to saturate even when the weights are sufficiently high indicates that even well-engineered architectures such as StyleGAN and BigGAN lack representative power to provide sufficient data coverage.

Strong inductive biases in the generative model have the potential to improve the quality of the inference module. For instance, GIF [18] and hologan [29] among others introduce strong inductive bias from the underlying 3D geometry and lighting properties of a 2D image. Hence, an inverse module of these generative mechanism has the potential to outperform their counterparts, which are trained fully supervised on the labelled training data alone at estimating 3D face parameters from 2D images.

As was shown by the success of RAEs [17], there is often a mismatch between the induced posterior and the prior of generative models, which can be removed by an ex-post density estimator. InvGAN is also aminable to ex-post density estimation. When applied to the tiled latent codes, it estimates a joint density of the tiles for unseen data. This would recover a generative model without going through the unstable GAN training.

We have shown that our method scales to large datasets such as ImageNet, CelebA, and FFHQ. A future work that is able to improve upon reconstruction fidelity, would be able to explore adversarial robustness by extending [19] to larger datasets.

6 Conclusion

We presented InvGAN, an inference framework for the latent generative parameters used by a GAN generator. InvGAN enjoys several advantages compared to state-of-the-art inversion mechanisms. Since InvGAN is a model-based approach, it enjoys computational efficiency. This enables our mechanism to reconstruct images that are larger than the training images by tiling, with no additional merging step. Furthermore, the inversion mechanism is integrated into the training phase of the generator, this encourages the generator to cover all modes. We further demonstrated that the inferred latent code for a given image is semantically meaningful, i.e., it falls inside the structured part of the latent space learned by the generator.