VecGAN: Image-to-Image Translation with Interpretable Latent Directions

Dalva, Yusuf; Altındiş, Said Fahri; Dundar, Aysegul

doi:10.1007/978-3-031-19787-1_9

Yusuf Dalva¹²,
Said Fahri Altındiş¹² &
Aysegul Dundar¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13676))

Included in the following conference series:

European Conference on Computer Vision

2611 Accesses
13 Citations

Abstract

We propose VecGAN, an image-to-image translation framework for facial attribute editing with interpretable latent directions. Facial attribute editing task faces the challenges of precise attribute editing with controllable strength and preservation of the other attributes of an image. For this goal, we design the attribute editing by latent space factorization and for each attribute, we learn a linear direction that is orthogonal to the others. The other component is the controllable strength of the change, a scalar value. In our framework, this scalar can be either sampled or encoded from a reference image by projection. Our work is inspired by the latent space factorization works of fixed pretrained GANs. However, while those models cannot be trained end-to-end and struggle to edit encoded images precisely, VecGAN is end-to-end trained for image translation task and successful at editing an attribute while preserving the others. Our extensive experiments show that VecGAN achieves significant improvements over state-of-the-arts for both local and global edits.

Access provided by Autonomous University of Puebla. Download conference paper PDF

HyperplaneGAN: a unified consistent translation framework for facial attribute editing

Article 22 August 2024

FAE-GAN: facial attribute editing with multi-scale attention normalization

Article 21 June 2021

StyleDisentangle: Disentangled Image Editing Based on StyleGAN2

Keywords

1 Introduction

There has been a significant progress in image-to-image translation methods [6, 13, 20, 21, 23, 24, 35, 37] especially for facial attribute editing [5, 19, 25, 33, 38] powered with generative adversarial networks (GANs). A main challenge of facial attribute editing methods is to be able to change only one attribute of an image without affecting others such as global lighting parameters of the images, identity of the persons, background, or their other attributes. The other challenge is the interpretability of the style codes so that one can control the attribute intensity of the edit, e.g. increase the intensity of smile or aging.

To achieve the targeted attribute editing while preserving the others, many works set a separate style encoder and an image editing network where modified styles are injected into it [5, 19]. During image-to-image translation, a style encoded from another image or a newly sampled style latent code can be used to output diverse images. To disentangle attributes, works focus on style encoding and progress from a shared style code, SDIT [30], to mixed style codes, StarGANv2 [5], to hierarchical disentangled styles, HiSD [19]. Among these works, HiSD independently learn styles of each attribute, bangs, hair color, glasses and introduces a local translator which uses attention masks to avoid global manipulations. HiSD showcases successes on those three local attribute editing task and is not tested for global attribute editing, e.g. age, smile. Furthermore, one limitation of these works is the uninterpretablity of style codes as one cannot control the intensity of attribute (e.g. blondness) in a straight-forward manner.

To overcome the challenges of facial attribute editing task, we propose a novel framework, VecGAN, and image-to-image translation framework with interpretable latent directions. Our framework does not require a separate style encoder as in the previous works since we achieve the translation in the encoded latent space directly. The attribute editing directions are learned in the latent space and regularized to be orthogonal to each other for style disentanglement. The other component of our framework is the controllable strength of the change, a scalar value. This scalar can be either sampled from a distribution or encoded from a reference image by projection in the latent space. Our framework not only achieves significant improvements over state-of-the-arts for both local and global edits but also provides a knob to control the editing attribute intensity via its design.

VecGAN is encouraged by the findings that well-trained generative models organize their latent space as disentangled representations with meaningful directions in a completely unsupervised way. Exploring these interpretable directions in latent codes has emerged as an important research endeavor on the fixed pretrained GANs [9, 26,27,28, 32]. These works show that images can be mapped to the GANs latent space and edits can be achieved by manipulations in the latent space. However, since these models are not trained end-to-end, the results are sub-optimal as will also be shown in our experiments.

To enable VecGAN, different than previous works of image-to-image translation networks, we use a deeper neural network architecture. Image-to-image translation methods, such as state-of-the-art HiSD [19] uses a network with small receptive fields that decreases the image resolution only by four times in the encoder. However, we want an organization in a latent space such that we can take meaningful linear directions. Therefore, images should be encoded to a spatially smaller feature space and a network should have a full understanding of an image. For that reason, we set a deep encoder and decoder network architecture but then this network faces the challenges of reconstructing all the details from the input image. To solve this problem, we use a skip connection between the encoder and decoder but only at lower resolution to find the optimal equilibrium of the information flow between with and without dimensionality reduction bottleneck. In summary, our main contributions are:

We propose VecGAN, a novel image-to-image translation network that is trained end to end with interpretable latent directions. Our framework does not employ a separate style network as in the previous works and translations are achieved with a single deep encoder-decoder architecture.
VecGAN enables both reference attribute copy and attribute strength manipulation. Reference style encoding is designed in a novel way by using the same encoder from the translation pipeline. First, encoder is used to obtain latent codes of a reference image and it is followed by the projection of the codes into learned latent directions for different attributes.
We conduct extensive experiments to show the effectiveness of our framework and achieve significant improvements over state-of-the-art for both local and global edits. Qualitative results of our framework can be seen in Fig. 1.

2 Related Works

Image to Image Translation. Image-to-image translation algorithms aim at preserving a given content while changing targeted attributes. Examples range from translating semantic maps into RGB images [29], to translating summer images into winter images [13], to portrait drawing [36] and very popularly to editing faces [2, 5, 7, 11, 19, 25, 31, 33, 38]. These algorithms powered with GAN loss [8] set an encoder-decoder architecture. In models that learn a deterministic mapping from one domain to the other, images are processed with encoder and decoder to output translated images [24, 29]. In multi-modal image-to-image translation methods, style is encoded separately from an another image or sampled from a distribution [5, 12]. In the generator, style and content are either combined with concatenation [41], or combined with a mask [19] or fed separately through instance normalization blocks [12, 42]. The generator also uses an encoder-decoder architecture [19, 34] that is separate than the style encoder. In our work, we are interested in designing the attribute as a learnable linear direction in the latent space and we do not employ a separate style encoder which results in a more intuitive framework.

Learning Interpretable Latent Directions. In another line of research, it is shown that GANs that are trained to synthesize faces can also be used for face attribute manipulations [3, 15, 16]. Initially, these networks are not designed or trained to translate images but rather to synthesize high fidelity images. However, it is shown that one can embed existing images into the GAN’s embedding space [1] and further one can find latent directions to edit those images [9, 26,27,28, 32]. These directions are explored in supervised [26] and unsupervised ways [9, 27, 28, 32]. It is quite remarkable when the generative network is only taught to synthesize realistic images, it organizes the use of latent space such that linear shifts on them change a specific attribute. Inspired by these findings, we design our image to image translation such that a linear shift in the encoded features is expected to change a single attribute of an image. Different than previous works, our framework is trained end-to-end for translation task and allows for reference guided attribute manipulation via projection.

3 Method

We follow the hierarchical labels defined by [19]. For a single image, its attribute for tag $i \in \{1,2,...,N\}$ can be defined as $j \in \{1,2,...,M_i\}$, where N is the number of tags and $M_i$ is the number of attributes for tag i. For example i can be tag of hair color, and attribute j can take the value of black, brown, or blond.

Our framework has two main objectives. As the main task, we aim to be able to perform the image-to-image translation task in a feature (tag) specific manner. While performing this translation, as the second objective, we also want to obtain an interpretable feature space which allows us to perform tag-specific feature interpolation.

3.1 Generator Architecture

For image to image translation task, we set an encoder-decoder based architecture and latent space translation in the middle as given in Fig. 2. We perform the translation in the encoded latent space, e, which is obtained by $e = E(x)$ where E refers to the encoder. The encoded features go through a transformation T which is discussed in the next section. The transformed features are then decoded by G to reconstruct the translated images. The image generation pipeline following feature encoding is described in Eq. 1.

$$\begin{aligned} e' = T(e, \alpha , i) \nonumber \\ x' = G(e') \end{aligned}$$

(1)

Previous image-to-image translation networks [5, 19, 34] set a shallow encoder decoder architecture to translate an image and a separate deep network for style encoding. In most cases, the style encoder includes separate branches for each tag. The shallow architecture that is used to translate images prevents the model from making drastic changes in the images and this helps preserving the identity of the persons. Our framework is different as we do not employ a separate style encoder and instead have a deep encoder-decoder architecture for translation. That is because to be able to organize the latent space in an interpretable way, our framework requires a full understanding of the image and therefore a larger receptive field; deeper network architecture. A deep architecture with decreasing size of feature size, on the other hand, faces the challenges of reconstructing all the fine details from the input image.

With the motivation of helping the network to preserve tag independent features such as the fine details from background, we use skip connections between our encoder and decoder. However, we observe that the flow of information should be limited to force the encoder-decoder architecture learn facial attributes and well-organized latent representations. Because of that reason, we only allow skip connection at low resolution. This design is extensively justified in our Ablation Studies.

3.2 Translation Module

To achieve a style transformation, we perform the tag-based feature manipulation in a linear fashion in the latent space. First, we set a feature direction matrix A which contains learnable feature directions for each tag. In our formulation $A_i$ denotes the learned feature direction for tag i. Direction matrix A is randomly initialized and learned during the training process.

Our translation module is formulated in Eq. 2, which adds the desired shift on top of the encoded features e similar to [28].

$$\begin{aligned} T(e, \alpha , i) = e + \alpha \times A_i \end{aligned}$$

(2)

We compute the shift by subtracting target style from the source style as given in Eq. 3.

$$\begin{aligned} \alpha = \alpha _t - \alpha _s \end{aligned}$$

(3)

Since the attributes are designed as linear steps in the learnable directions, we find the style shift by subtracting the target attribute scale from source attribute scale. This way the same target attribute $\alpha _t$ can have the same impact on the translated images no matter what the attributes were of the original images. For example, if our target scale corresponds to brown hair, the source scale can be coming from an image with blonde or back hair but since we take a step for difference of the scales, they can be both translated to an image with the same shade of brown hair.

To extract the target shifting scale for feature (tag) i, $\alpha _t$, there are two alternative pathways. The first pathway, named as latent-guided path, samples a $z \in \mathcal {U}[0,1)$ and applies a linear transformation $\alpha _t = w_{i,j} \cdot z + b_{i,j}$, where $\alpha _t$ denotes sampled shifting scale for tag i and attribute j. Here tag i can be hair color and attribute j can be blonde, brown, or back hair. For each attribute we learn a different transformation module which is denoted as $M_{i,j}(z)$. Since we learn a single direction for every tag for example for hair color, this transformation module can put the initially sampled z’s into correct scale in the linear line based on the target hair color attribute. As the other alternative pathway, we encode the scalar value $\alpha _t$ in a reference-guided manner. We extract $\alpha _t$ for tag i from a provided reference image by first encoding it into the latent space, $e_r$, and projecting $e_r$ via by $A_i$ as given in Eq. 4.

$$\begin{aligned} \alpha _t = P(e_r, A_i) = \dfrac{e_r \cdot A_i}{||A_i||} \end{aligned}$$

(4)

In the reference guidance set-up, we do not use the information of attribute j, since it is encoded by the tag i features of the image.

The source scale, $\alpha _s$, is obtained by the same way we obtain $\alpha _t$ from reference image. We perform the projection for the corresponding tag we want to manipulate, i, by $P(e, A_i)$. We formulate our framework with the intuition that the scale controls the amount of feature to be added. Therefore, especially when the attribute is copied over from a reference image, the amount of features that will be added will be different based on the source image. It is for this reason, we find the amount of shift by subtraction as given in Eq. 3. Our framework is intuitive and relies on a single encoder-decoder architecture. Figure 2 shows the overall pipeline.

3.3 Training Pathways

Modifying the translation paths defined by [19], we train our network using two different paths. For each iteration to optimize our model, we sample a tag i for shift direction, a source attribute j as the current attribute and a target attribute $\hat{j}$.

Non-translation Path. To ensure that the encoder-decoder structure preserves details of the images, we perform a reconstruction of the input image without applying any style shifts. The resulting image is denoted as $x_n$ as given in Eq. 5.

$$\begin{aligned} x_n = G(E(x)) \end{aligned}$$

(5)

Cycle-Translation Path. We apply a cyclic translation to ensure that we get a reversible translation from a latent guided scale. In this path, as shown in Fig. 3, we first apply a style shift by sampling $z \in \mathcal {U}[0,1)$ and obtaining target $\alpha _t$ with $M_{i,\hat{j}}(z)$ for target attribute $\hat{j}$. The translation uses $\alpha $ that is obtained by subtracting $\alpha _t$ from the source style. Decoder generates an image, $x_t$ as given in Eq. 6 where e is encoded features from input image x, $e=E(x)$. $x_t$ refers to the image without glasses in Fig. 3.

$$\begin{aligned} x_t = G(T(e, M_{i,j}(z) - P(e, i), i)) \end{aligned}$$

(6)

Then by using the original image, x, as a reference image, we aim to reconstruct the original image by translating $x_t$. Overall, this path attempts to reverse a latent-guided style shift with a reference-guided shift. The second translation is given in Eq. 7 where $e_t=E(x_t)$.

$$\begin{aligned} x_c = G(T(e_t, P(e, i) - P(e_t, i), i)) \end{aligned}$$

(7)

In our learning objectives, we use $x_n$ and $x_c$ for reconstruction and $x_t$ and $x_c$ for adversarial losses, and $M_{i,j}(z)$ for the shift reconstruction loss. Details about the learning objectives are given in the next section.

3.4 Learning Objectives

Given an input image $x_{i,j} \in \mathcal {X}_{i,j}$, where i is the tag to manipulate and j is the current attribute of the image, we optimize our model with the following objectives. In our equations, $x_{i,j}$ is shown as x.

Adversarial Objective. During training, our generator performs a style-shift either in a latent-guided way or a reference-guided way, which results in a translated image. In our adversarial loss, we receive feedback from the two steps of cycle-translation path. As the first component of the adversarial loss, we feed a real image x with tag i and attribute j to the discriminator as the real example. To give adversarial feedback to latent-guided path, we use the intermediate image generated in cycle-translation path, $x_t$. Finally, to provide adversarial feedback to reference-guided path, we use the final outcome of the cycle-translation path $x_c$. Only x acts as real image, both $x_t$ and $x_c$ are translated images, and they are treated as fake images with different attributes. The discriminator aims at classifying whether an image, given its tag and attribute, is real or not. The objective is given in Eq. 8.

$$\begin{aligned} \begin{aligned} \mathcal {L}_{adv} = 2log(D_{i,j}(x)) + log(1 - D_{i, \hat{j}} (x_t)) + log(1 - D_{i, j} (x_c)) \end{aligned} \end{aligned}$$

(8)

Shift Reconstruction Objective. As the cycle-consistency loss performs reference-guided generation followed by latent-guided generation, we utilize a loss function to make these two methods consistent with each other [12, 17,18,19]. Specifically, we would like to obtain the same target scale, $\alpha _t$, both from the mapping and from the encoded reference image generated by the mapped $\alpha _t$. The loss function is given in Eq. 9.

$$\begin{aligned} \mathcal {L}_{shift} = ||M_{i,j}(z) - P(e_t, i)||_1 \end{aligned}$$

(9)

Those parameters, $M_{i,j}(z)$ and $P(e_t, i)$, are calculated for the cycle-translation path as given in Eq. 6 and 7.

Image Reconstruction Objective. In all of our training paths, the purpose it to be able to re-generate the original image again. To supervise this desired behavior, we use $L_1$ loss for reconstruction loss. In our formulation $x_n$ and $x_c$ are outputs of non-translation path and cycle-translation path, respectively. Formulation of this objective is provided in Eq. 10.

$$\begin{aligned} \begin{aligned} \mathcal {L}_{rec} = ||x_n - x||_1 + ||x_c - x||_1 \end{aligned} \end{aligned}$$

(10)

Orthogonality Objective. To encourage the orthogonality between directions, we use soft orthogonality regularization based on Frobenius norm, which is given in Eq. 11. This orthogonality further encourages a disentanglement in the learned style directions.

$$\begin{aligned} \mathcal {L}_{ortho} = {\Vert A^{T}A - I\Vert _F} \end{aligned}$$

(11)

Full Objective. Combining all of the loss components described, we reach to the overall objective for optimization as given in Eq. 12. We additionally add L1 loss on the matrix A parameters to encourage its sparsity.

$$\begin{aligned} \begin{aligned} \underset{E,G,M,A}{\min }\ \underset{D}{\max }\ \lambda _{a}\mathcal {L}_{adv} + \lambda _{s} \mathcal {L}_{shift} + \lambda _{r} \mathcal {L}_{rec}+ \lambda _{o} \mathcal {L}_{ortho}+ \lambda _{sp} \mathcal {L}_{sparse} \end{aligned} \end{aligned}$$

(12)

To control the dominance of each loss component, we use $\lambda _{a}, \lambda _{s}$, $\lambda _{r}$, $\lambda _{o}$, and $\lambda _{sp}$ hyperparameters. These hyperparameter values and training details are given in Supplementary.

4 Experiments

4.1 Dataset and Settings

We train our model on CelebA-HQ dataset [22] which contains 30,000 face images. To extensively compare with state-of-the-arts, we follow two training-evaluation protocols as follows:

Setting A. In our first setting, we follow the set-up from HiSD [19]. Following HiSD, we use the first 3000 images of CelebA-HQ dataset as the test set and 27000 as the training set. These images include annotations for different attributes from which we use hair color, presence of glass, and bangs attributes for translation task in this setting. Hair color attribute includes 3 tags, black, brown, and blonde whereas the other attributes are binary. The images are resized to $128\times 128$. Following the evaluation protocol proposed by HiSD [19], we compute FID scores on bangs addition task. For each test image without bangs, we translate them to images with bangs with latent and reference guidance. In latent guidance, 5 images are generated for each test image by randomly sampling scale from a uniform distribution. Then this generated set of images are compared with images that have attribute bangs in terms of their FIDs. FIDs are calculated for these 5 sets and averaged. For reference guidance, we randomly pick 5 references images to extract the style scale. FIDs are calculated for these 5 sets separately and averaged.

Setting B. In this setting, we follow the set-up from L2M-GAN [34]. The training/test split is obtained by re-indexing each image in CelebA-HQ back to the original CelebA and following the standard split of CelebA. This results in 27,176 training and 2,824 test images. Models are trained for hair color, presence of glasses, bangs, age, smiling, and gender attributes. Images are resized to $256\times 256$ resolution. For evaluation, smiling attribute is used following L2M-GAN [34]. It is noted that smiling is one of the most challenging among the CelebA facial attributes because adding/removing a smile requires high-level understanding of the input face image for modifying multiple facial components simultaneously. FIDs are calculated for adding and removing the smile attribute.

4.2 Results

We extensively compare our results with other competing methods in Table 1. In Setting A, as given in Table 1a, we compare with SDIT [30], StarGANv2 [5], Elegant [33], and HiSD [19] models. Among these methods, HiSD learns a hierarchical style disentanglement whereas StarGANv2 learns a mixed style code. Therefore, StarGANv2 when translating images also does other unnecessary manipulations and does not strictly preserve the identity. Our work is most similar to HiSD as we also learn disentangled style directions. However, HiSD learns feature based local translators which is an approach known to be successful on local edits, e.g. bangs. Ours results show that VecGAN achieves significantly better quantitative results than HiSD both in latent guided and reference guided evaluations even though they are compared on a local edit task.

Table 1. Comparisons with state-of-the-art competing methods. Please refer to Sect. 4 for details on training and evaluation protocol of Setting A and B.

Full size table

Table 2. User study results conducted with smiling attribute. Smiling (+) denotes the results of adding a smile, Smiling (−) refers to the results of removing a smile, and Smiling (avg) denotes the average of Smiling (+) and Smiling (−). Percentages show the preference rates of our method versus the other competing method.

Full size table

Figure 4 shows reference guided results of our model versus HiSD. We compare with HiSD since it provides with the best results after ours. As can be seen from Fig. 4, both methods achieve attribute disentanglement, they do not change any other attribute of the image than the bangs tag. However, HiSD outputs artifacts especially for the reference image from the last column. On the other hand, VecGAN outputs higher quality results. As the second example, we pick a very challenging example to compare these methods. Even though, our results can be further improved to look more realistic, it achieves significantly better outputs with no artifacts compared to HiSD.

In our second set-up of evaluation, we compare our method with many state-of-the-art methods as given in Table 1b. We compare with StarGAN [4], CycleGAN [40], Elegant [33], PA-GAN [10], InterFaceGAN [26], and L2M-GAN [34]. For InterFaceGAN, we use the GAN Inversion [39] as the encoder and pretrained StyleGAN [15] as the generator backbone. As can be seen from Table 1b, we achieve significantly better scores on both settings and in average.

In our visual comparisons, we mainly focus on L2M-GAN and InterFaceGAN since L2M-GAN is the second best model after ours and InterFaceGAN shares the same intuition with our model and performs edits by latent code manipulation. The results are shown in Fig. 5 where the first four examples show smile addition and the other four examples show smile removal manipulations. The most prominent limitation of L2M-GAN and InterFaceGAN is that they do not preserve the other attributes of images, especially on the background whereas VecGAN does a very good job at that. Smile attribute addition and removal of L2M-GAN is better than InterFaceGAN, however, worse than ours. VecGAN is the only method among them that can produce manipulated images with high fidelity to the originals with only targeted attribute manipulated in a natural and realistic way.

We also conduct a user study on the first 64 images of validation set among 10 users. We set an A/B test and provide users with input images and translations obtained by VecGAN and other competing methods. The left-right order is randomized to ensure fair comparisons. We perform two separate tests. 1) Quality: We ask users to select the best result according to i) whether the smile attribute is correctly added, ii) whether irrelevant facial attributes preserved, iii) and overall whether the output image looks realistic and high quality. 2) Fidelity: We ask users to pay attention if details from the input image is preserved in addition to the quality. When only asked for quality, users pay attention to facial attributes and do not pay much attention to the background, ornament, details of hair of the image, and so on. In this test, we remind the users to pay attention to those as well. Table 2 shows the results of the user study. Users preferred our method as opposed to L2M-GAN $59.45\%$ of the time ($50\%$ is tie), and as opposed to InterFaceGAN $82.82\%$ of the time for the quality measure in average of smile addition and removal results. When users asked to pay attention to non-facial attributes as well, they preferred our method as opposed to L2M-GAN $74.22\%$ of the time, and as opposed to InterFaceGAN $91.09\%$ of the time in average.

4.3 Ablation Study

We conduct ablation studies for network architecture and loss objectives as given in Table 3. We first experiment with a shallower architecture where encoder decreases the input dimension of $128\times 128$ to a spatial dimension of $8\times 8$. This version gives reasonable scores, however, we are interested in a better latent space organization. For that, we use a deeper encoder-decoder architecture where encoded latent space goes as low as $1\times 1$ which we refer as deep architecture. Deep architecture without skip connections is not able to minimize the reconstruction objective and results in a high FID. On the other hand, deep architecture with a skip connection at each resolution from encoder to decoder can minimize the reconstruction loss however the latent space is not well organized since the model tends to pass all the information from the encoder which instabilizes the training. Our architecture with single skip layer at resolution $32\times 32$ provides a good balance between the information flow from encoder-decoder and the latent space bottleneck.

Table 3. FID results of ablation study with Setting A. Lat: Latent guided, Ref: Reference guided.

Full size table

Next, we experiment the effect of loss functions. First, we remove the orthogonality loss of A directions. This results in worse FID scores but more importantly we observe that the styles are not disentangled, e.g. changing bangs attribute changes the gender as can be seen in Fig. 6. Even without this loss function, we observe that during training the orthogonality loss of A decreases but to a higher value than when this loss is added to the final objective. That is because the framework and other loss objectives also encourage the disentanglement of attribute manipulations and it shows in the orthogonality of direction vectors. This also shows the importance of orthogonality in style disentanglement and this targeted loss helps improve that significantly. We also observe that sparsity loss applied on the directional vectors stabilizes the training and without that FIDs are much higher.

4.4 Other Capabilities of VecGAN

Gradually Increased Scale. We translate images with gradually increased attribute strength as shown in Fig. 7. We plot the manipulation results on six different attributes. These results show that attributes that are designed as linear transformations are disentangled, and changing one attribute does not affect the other components. In these results, as scales are gradually increased, the strength of the tag smoothly increases with the identity of the person preserved.

Multi-tag Edits. We additionally experiment with multi-tag manipulation. To change two attributes, instead of encoding and decoding the image twice with a translation in between each time, we perform two translation operations in the latent code simultaneously. That is we apply Eq. 2 twice for two different i. Figure 8a shows results of the multi-tag edits. In the first row, we consider gender and smile tags, and first edit those attributes individually. In the last coloumn, we edit the image with these two tags simultaneously. The second row shows a similar experiment with smile and age tags. We observe that VecGAN provides with disentangled tag control and can successfully edit tags independently.

Generalization to Other Domains. We apply VecGAN model to MetFace dataset [14] without any retraining. The results are provided in Fig. 8b. The first row shows source images, and the second row shows outputs of our model. In the first two examples, we increase the smile attribute, and in the other two, we decrease it. The results show that VecGAN has a good generalization ability and works reasonably well across datasets.

5 Conclusion

This paper introduces VecGAN, an image-to-image translation framework with interpretable latent directions. This framework includes a deep encoder and decoder architecture with latent space manipulation in between. Latent space manipulation is designed as vector arithmetic where for each attribute, a linear direction is learned. This design is encouraged by the finding that well-trained generative models organize their latent space as disentangled representations with meaningful directions in a completely unsupervised way. Each change in the architecture and loss functions is extensively studied and compared with state-of-the-arts. Experiments show the effectiveness of our framework.

References

Abdal, R., Qin, Y., Wonka, P.: Image2StyleGAN: how to embed images into the StyleGAN latent space? In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4432–4441 (2019)
Google Scholar
Abdal, R., Zhu, P., Mitra, N.J., Wonka, P.: StyleFlow: attribute-conditioned exploration of StyleGAN-generated images using conditional continuous normalizing flows. ACM Trans. Graph. (ToG) 40(3), 1–21 (2021)
Article Google Scholar
Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096 (2018)
Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: StarGAN: unified generative adversarial networks for multi-domain image-to-image translation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8789–8797 (2018)
Google Scholar
Choi, Y., Uh, Y., Yoo, J., Ha, J.W.: StarGAN v2: diverse image synthesis for multiple domains. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2020)
Google Scholar
Dundar, A., Sapra, K., Liu, G., Tao, A., Catanzaro, B.: Panoptic-based image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8070–8079 (2020)
Google Scholar
Gao, Y., et al.: High-fidelity and arbitrary face editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16115–16124 (2021)
Google Scholar
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27 (2014)
Google Scholar
Härkönen, E., Hertzmann, A., Lehtinen, J., Paris, S.: GANSpace: discovering interpretable GAN controls. In: Advances in Neural Information Processing Systems, vol. 33 (2020)
Google Scholar
He, Z., Kan, M., Zhang, J., Shan, S.: Pa-GAN: progressive attention generative adversarial network for facial attribute editing. arXiv preprint arXiv:2007.05892 (2020)
Hou, X., Zhang, X., Liang, H., Shen, L., Lai, Z., Wan, J.: GuidedStyle: attribute knowledge guided style manipulation for semantic face editing. Neural Netw. 145, 209–220 (2022)
Article Google Scholar
Huang, X., Liu, M.Y., Belongie, S., Kautz, J.: Multimodal unsupervised image-to-image translation. In: European Conference on Computer Vision (2018)
Google Scholar
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: IEEE Conference on Computer Vision and Pattern Recognition (2017)
Google Scholar
Karras, T., Aittala, M., Hellsten, J., Laine, S., Lehtinen, J., Aila, T.: Training generative adversarial networks with limited data. Adv. Neural. Inf. Process. Syst. 33, 12104–12114 (2020)
Google Scholar
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401–4410 (2019)
Google Scholar
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of StyleGAN. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8110–8119 (2020)
Google Scholar
Lee, H.Y., Tseng, H.Y., Huang, J.B., Singh, M., Yang, M.H.: Diverse image-to-image translation via disentangled representations. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 35–51 (2018)
Google Scholar
Li, X., et al.: Attribute guided unpaired image-to-image translation with semi-supervised learning. arXiv preprint arXiv:1904.12428 (2019)
Li, X., et al.: Image-to-image translation via hierarchical style disentanglement. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8639–8648 (2021)
Google Scholar
Liu, M.Y., Breuel, T., Kautz, J.: Unsupervised image-to-image translation networks. In: Advances in Neural Information Processing Systems (2017)
Google Scholar
Liu, M.Y., Tuzel, O.: Coupled generative adversarial networks. Adv. Neural. Inf. Process. Syst. 29, 469–477 (2016)
Google Scholar
Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: Proceedings of International Conference on Computer Vision (ICCV), December 2015
Google Scholar
Mardani, M., Liu, G., Dundar, A., Liu, S., Tao, A., Catanzaro, B.: Neural FFTS for universal texture image synthesis. Adv. Neural. Inf. Process. Syst. 33, 14081–14092 (2020)
Google Scholar
Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: Semantic image synthesis with spatially-adaptive normalization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2337–2346 (2019)
Google Scholar
Shen, W., Liu, R.: Learning residual images for face attribute manipulation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4030–4038 (2017)
Google Scholar
Shen, Y., Gu, J., Tang, X., Zhou, B.: Interpreting the latent space of GANs for semantic face editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9243–9252 (2020)
Google Scholar
Shen, Y., Zhou, B.: Closed-form factorization of latent semantics in GANs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1532–1540 (2021)
Google Scholar
Voynov, A., Babenko, A.: Unsupervised discovery of interpretable directions in the GAN latent space. In: International Conference on Machine Learning, pp. 9786–9796. PMLR (2020)
Google Scholar
Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional GANs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8798–8807 (2018)
Google Scholar
Wang, Y., Gonzalez-Garcia, A., van de Weijer, J., Herranz, L.: SDIT: scalable and diverse cross-domain image translation. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 1267–1276 (2019)
Google Scholar
Wu, P.W., Lin, Y.J., Chang, C.H., Chang, E.Y., Liao, S.W.: RelGAN: multi-domain image-to-image translation via relative attributes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5914–5922 (2019)
Google Scholar
Wu, Z., Lischinski, D., Shechtman, E.: StyleSpace analysis: disentangled controls for StyleGAN image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12863–12872 (2021)
Google Scholar
Xiao, T., Hong, J., Ma, J.: ELEGANT: exchanging latent encodings with GAN for transferring multiple face attributes. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 168–184 (2018)
Google Scholar
Yang, G., Fei, N., Ding, M., Liu, G., Lu, Z., Xiang, T.: L2M-GAN: learning to manipulate latent space semantics for facial attribute editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2951–2960 (2021)
Google Scholar
Yi, R., Liu, Y.J., Lai, Y.K., Rosin, P.L.: APDrawingGAN: generating artistic portrait drawings from face photos with hierarchical GANs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10743–10752 (2019)
Google Scholar
Yi, R., Liu, Y.J., Lai, Y.K., Rosin, P.L.: Unpaired portrait drawing generation via asymmetric cycle mapping. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8217–8225 (2020)
Google Scholar
Yi, Z., Zhang, H., Tan, P., Gong, M.: DualGAN: unsupervised dual learning for image-to-image translation. In: International Conference on Computer Vision (2017)
Google Scholar
Zhang, G., Kan, M., Shan, S., Chen, X.: Generative Adversarial Network with Spatial Attention for Face Attribute Editing. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 417–432 (2018)
Google Scholar
Zhu, J., Shen, Y., Zhao, D., Zhou, B.: In-domain GAN inversion for real image editing. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12362, pp. 592–608. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58520-4_35
Chapter Google Scholar
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: International Conference on Computer Vision (2017)
Google Scholar
Zhu, J.Y., et al.: Multimodal image-to-image translation by enforcing bi-cycle consistency. In: Advances in Neural Information Processing Systems, pp. 465–476 (2017)
Google Scholar
Zhu, P., Abdal, R., Qin, Y., Wonka, P.: SEAN: image synthesis with semantic region-adaptive normalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5104–5113 (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

Bilkent University, Ankara, Turkey
Yusuf Dalva, Said Fahri Altındiş & Aysegul Dundar

Authors

Yusuf Dalva
View author publications
You can also search for this author in PubMed Google Scholar
Said Fahri Altındiş
View author publications
You can also search for this author in PubMed Google Scholar
Aysegul Dundar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yusuf Dalva .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3413 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dalva, Y., Altındiş, S.F., Dundar, A. (2022). VecGAN: Image-to-Image Translation with Interpretable Latent Directions. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13676. Springer, Cham. https://doi.org/10.1007/978-3-031-19787-1_9

Download citation

DOI: https://doi.org/10.1007/978-3-031-19787-1_9
Published: 21 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19786-4
Online ISBN: 978-3-031-19787-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

VecGAN: Image-to-Image Translation with Interpretable Latent Directions

Abstract

Similar content being viewed by others

HyperplaneGAN: a unified consistent translation framework for facial attribute editing

FAE-GAN: facial attribute editing with multi-scale attention normalization

StyleDisentangle: Disentangled Image Editing Based on StyleGAN2

Keywords

1 Introduction

2 Related Works