1 Introduction

Facial image analysis, such as facial occlusions localization [52], 3D facial image analysis [37], forensic person identification [7] is an important topic in computer vision and pattern recognition. Recently, another facial analysis topic, image-to-image translation, has been developing very fast due to the great success of deep neural networks. The goal of image-to-image translation is to translate an image from one domain to another domain while maintaining some invariant or consistency property which is corresponding to specific tasks [20, 32, 46, 53, 60]. Facial image editing is one kind of image-to-image translation problems, in which we can manipulate the attributes of face images, i.e, with or without some kinds of attributes. Here, the term attribute represents some high-level feature of a facial image, e.g. expression, hair color, age and so on. We further denote the attribute value as a specific value of an attribute, e.g. neutral/smiling for expression or black/blond/brown for hair color or young/old for age, and domain as the images having the same attribute value. The key challenge of facial image editing is that the transformation is ill-posed and the training set is unpaired, that is, it is practically infeasible to collect images with arbitrarily specified attributes for each person. The problem has aroused a lot of interest [9, 17, 29, 39, 56, 57]. In particular, researchers tried to use deep generative models such as Bayesian inference [4, 14, 27, 41, 47], adversarial training [2, 13], variational autoencoders(VAEs) [27, 41] to solve this problem and make significant progress. Among those algorithms, variational autoencoders(VAEs) [27] and generative adversarial networks(GANs) [13] are the cornerstones.

To manipulate facial attributes given a facial image, many researches have been undertaken [1, 48]. Liu et al. [32] and Lu et al. [35] learn pair-wise generators and discriminators for every pair of image domains. Although these models handle well on the translation between two different domains, they are inefficient or ineffective in multiple-attribute editing. Some works learn to disentangle attribute representation in the latent space and use one block or one dimension of the latent vector to represent one attribute. Hence, feeding to the generator by swapping the corresponding components of two images from different domains is expected to generate image with swapped attributes [25, 55, 56, 59]. However, they are complained about the complex training pipelines and model structures. Larsen et al. [30], Radford et al. [40] and Upchurch et al. [51] compute the average direction vector of one attribute in the latent space over a pair of image sets with contrary attributes, which points from the average latent variable of the set with (without) that attribute to the average latent variable of the other set without (with) that attribute, then input image can be added or removed the specified attribute by adding its latent variable with the direction vector or subtracting the direction vector from its latent variable. Those models can add or remove attributes easily, but it is reported that the average direction vectors are not orthogonal and often contain highly correlated attributes which make the generation tend to transfer images with unwanted attributes. Bao et al. [3], Choi et al. [9], He et al. [17], Lample et al. [29], Perarnau et al. [39] and Yan et al. [57] extend CVAE [45] or CGAN [38] and inject different attribute labels to conduct the image generation.

Although these models success in generating new images, they still have one or several of the following limitations:

  • generated images are in low resolution or with lots of artifacts [39, 57];

  • label-paired images in different domains are needed to train the model [20, 53];

  • multiple attribute editing is infeasible [24, 32, 60];

  • the models combining VAEs and GANs, have more than three deep mappings, which increases the training complexity [9, 17];

  • some models cannot generate facial images randomly with specified attributes [9, 17, 29, 39, 57].

Due to its nice manifold representation and stable training mechanism, VAE is theoretically well-founded and more stable. However, VAE trends to generate blurry images due to the limited representation ability of the inference distribution, the inherent over-regularization induced by the Kullback-Leibler divergence term and imperfect reconstruction error [10, 49]. GAN [13] introduces a generator and a discriminator for adversarial learning. When the network reaches the equilibrium, the fake (generated) images have the same distribution as the real images, which makes the generated images look more realistic. The emergence of GAN has attracted so many concerns that it makes other generative models obsolete and becomes one of the dominant approaches for generating images with surprising complexity and realism. One of the drawback of GAN is instability in training optimization and easily leads to the model-collapse problem [42].

The motivation of our model is quite straightforward, that is, we want to propose a novel model that inherits the advantages of VAE and GAN while abandons the disadvantages. In addition, we want the novel model can do multiple attribute facial image editing rather than we can only manipulate one attribute at a time. In order to achieve these goals, we develop a novel framework that incorporates the ideas of VAE and GAN. We first divide the latent space learned by the encoder into two independent subspaces, the attribute-irrelevant subspace and the attribute-relevant subspace. The attribute-irrelevant subspace is used to represent factors such as the identity, pose, illumination etc., while the attribute-relevant subspace represents the attributes such as the hair color, the hair style, gender, age and so forth that we can edit. Thus, each facial image can be represented in the latent space by an identification vector together with an attribute vector where each component corresponds to one attribute. During the editing, we want the identification vector keep unchanged, and only need to manipulate the attribute vector to generate the specified attribute images. We further show that without increasing the complexity of model or introducing additional deep mappings, the adversarial training can be potentially implemented via the encoder and the decoder. By viewing the KL divergence as a special form of reression and introducing classification loss on the attribute-relevant variables, the encoder can be treated not only as a discriminator for real and generated samples but also as a classifier for attributes. Subsequently, adversarial training is introduced into the latent space to align the generated data distribution to the real data distribution while the attributes of the generated images can be classified correctly. We have compared the proposed model with state-of-the-art algorithms for single-attribute and multiple-attribute facial editing. The quantitative and quantitative results show that proposed model can produce impressive and high-quality images. To summarize, the contributions of this paper include:

  1. 1.

    Based on the encoder-decoder architecture, the latent space of the proposed network is split into two independent subspaces, the attribute-irrelevant subspace and the attribute-relevant subspace;

  2. 2.

    Based on the combination of VAE and GAN, an attribute-disentangled generative model is proposed, which involves only two deep mappings: the encoder and the decoder;

  3. 3.

    Extensive experiments, including single-attribute and multiple-attribute facial image editing, were designed to evaluate the performance of the proposed model.

The remaining sections of the paper are organized as follows. Section 2 reviews the VAE and GAN as well as other related works. The proposed model is introduced in Section 3. Experimental results including quantitative and qualitative evaluation are presented in Section 4, followed by the conclusion in Section 5.

2 Related works

2.1 Variational autoencoders

Variational autoencoders(VAEs) [27, 41] were proposed to estimate flexible deep generative models by variational inference methods. A standard VAE consists of an encoder Enc and a decoder Dec. The encoder (also regarded as recognition model) maps an input sample x to a distribution over latent variable \(z\sim Enc(x) = q_{\phi }(z|x)\). The decoder (also regarded as a generative model) maps from this latent space to a distribution over images \( \tilde {x}\sim Dec(z)=p_{\theta }(x|z) \). VAE regularizes the encoder by imposing a prior over the latent distribution pθ(z), which is typically chosen as the standard Gaussian distribution N(0,I). The objective function of VAE is to maximize the evidence lower bound(ELBO) of log-likelihood logpθ(x):

$$ \begin{array}{ll} \mathcal{L}_{VAE} & = \mathbb{E}_{z\sim q_{\phi} (z|x)}[log p_{\theta} (x|z)] - D_{KL}(q_{\phi} (z|x) \parallel p_{\theta}(z)) \\& \leqslant log p_{\theta}(x) \end{array} $$
(1)

where DKL is the Kullback-Leibler divergence. The first term in Eq. 1 is a reconstruction error. If we assume that the decoder predicts a Gaussian distribution at each pixel, then it reduces to squared Euclidean error in the image space. The second term drives the recognition distribution towards the prior distribution which acts as a regularizer. Both qϕ(z|x) and pθ(z) are commonly assumed to be Gaussian, in which case the KL divergence can be computed analytically. Assume that mean μ(x) and covariance σ(x) of qϕ(z|x) are outputs of Enc(x) for a given input x, then the KL divergence can be derived as follows [3]:

$$ \begin{array}{ll} \mathcal{L}_{KL} = \frac{1}{2}(\mu (x)^{T}\mu (x)+ sum(exp(\sigma (x)) - \sigma (x) -1 )) \end{array} $$
(2)

In addition, a reparameterization of the recognition distribution in terms of auxiliary variables with fixed distributions is used so that the samples from the recognition model are a deterministic function of the inputs and auxiliary variables. That is, a latent sample z is drawn from qϕ(z|x) as follows: a random noise vector (also an auxiliary variable) \(\epsilon \sim N(0,I)\), for example, then z = gϕ(𝜖,x) = μ(x) + σ(x) ⊙ 𝜖, where μ(x) and σ(x) are outputs of Enc(x) for a given input x and ⊙ signifies an element-wise product.

One of the major disadvantages of VAE is that, because of the injected noise and imperfect element-wise measures such as the squared error, the generated samples are often blurry.

2.2 Generative adversarial networks

Generative Adversarial Networks (GANs) [13] consist of a generator G and a discriminator D that compete in a two-player minimax game. The goal of GANs is to let the G learn a distribution pg(x) that matches the real data distribution pdata(x) via an adversarial process. D tries to distinguish a real image x from a synthetic one G(z), where z is a input noise variable sampled from a prior distribution pz(z), and G tries to synthesize realistic-looking images that can fool D. Concretely, D and G play the game with a value function V (D,G):

$$ \begin{array}{ll} \underset{G}{min} \underset{D}{max} V(D,G) &= \mathbb{E}_{x\sim p_{data}(x)}[log D(x)] \\ & + \mathbb{E}_{z\sim p_{z}(z)}[log(1-D(G(z)))]. \end{array} $$
(3)

It is proved that this minimax game has a global optimum when the distribution pg of the synthetic samples and the distribution pdata of the training samples are the same. Under mild conditions (e.g., G and D have enough capacity), pg converges to pdata.

When trained on image dataset, GANs can produce visually sharp and compelling sample images. However, it is also complained about instabilities in optimization that leads to the problem of mode-collapse [42], which means that samples generated from GANs don’t reflect the diversity of the underlying data distribution.

Many works have tried to improve the stability of training and the quality of generated images from different perspectives. DCGAN [40] which adopts deconvolutional and convolutional neural networks to implement G and D, respectively, is the first GAN model to learn to generate high resolution images in a single shot. Many GANs are at least loosely based on the DCGAN architecture. WGAN [2] and WGAN-GP [15] which use the Wasserstein distance instead of the Jensen-Shannon distance to form a new objective for training GANs, have provided a powerful theoretical proof and illustrated that they can make the GAN training process more stable.

2.3 Variants of combination of VAEs and GANs

Due to both VAEs and GANs having their own advantages and disadvantages, several recent works have looked for hybrid approaches to enable both sampling and inference like VAEs or autoencoders(AEs), while producing samples of quality comparable to GANs. Typically this is achieved by training an autoencoder(AE) jointly with one or more adversarial discriminators whose purpose is to improve the alignment of distributions in the latent space(AAE [36], IAN [6], AGE [50]), the data space(MRGAN [8], VAE/GAN [30]) or in the joint (product) latent-data space(BiGAN [11], ALI [12]). These algorithms have been demonstrated their power in generating excellent quantitative and visual results. However, while compounding autoencoding and adversarial training do improve VAEs and GANs, it is at the cost of adding complexity. In particular, these systems usually involve at least three [11, 12, 29, 30, 36] or four [17] deep mappings: an encoder for encoding representation, a decoder/generator for generating samples, a discriminator for discriminating real or generated samples and a classifier for classifying the attributes of samples.

By introducing additional conditionality, VAEs and GANs can also be trained to conduct conditional generation, e.g., CVAE [45] and CGAN [38]. CGAN [38] modified GAN from unsupervised learning into semi-supervised learning by feeding the conditional variable (e.g., the class label) into the data. CVAE-GAN [3] combines CVAE and CGAN for fine-grained category image generation.

AGE [50], which directly sets up an adversarial training between the encoder and the decoder of an AE and constrains the real and the generated data distribution to be the prior distribution in the latent space. IntroVAE [19] uses VAE to replace AE in AGE, and preserves the advantages of VAEs, such as stable training and nice latent manifold. Our work is partially inspired by AGE and IntroVAE, but the difference is that our model extends the latent structure of VAE and jointly trains with CGAN combining perceptual loss, which makes our model can not only achieve better reconstruction effect but can also synthesize photo-realistic images with specified attributes.

2.4 Image to image translation

The model pix2pix [20] was trained by combining cGAN with a L1 loss in a supervised manner, which means requiring paired training data. However, obtaining such data is usually expensive. DTN [46] presents a baseline formulation for unsupervised cross-domain image translation and trains an image-conditional generator including a pre-trained function as an encoder and enforces the translated image is close to the original image in the latent space. CoGAN [33] learns a joint distribution of images in two different domains by enforcing a weight-sharing constraint to the layers of a pair of GANs, each of which is responsible for synthesizing images in one domain. UNIT [32] extends CoGAN framework with VAEs and assumes that two different domains can be mapped to a shared-latent space. CycleGAN [60] and DiscoGAN [24] train two mapping functions that are inverse to each other between two image domains by employing the cycle consistency loss and two domain-specific discriminators to distinguish between the domains by employing the adversarial loss.

2.5 Facial attribute editing

The work disCVAE [57] learns the disentangled latent variable which is split into a foreground part and a background part by training with CVAE [45] to improve the generation quality and diversity. AD-VAE [16] trains VAEs by splitting the latent variables into different groups to learn a representation disentangled model. IcGAN [39] separately trains a cGAN [38] and an encoder which is the inverse of the mapping of the cGAN. DIAT [31] is presented as a deep identity-aware attribute transfer model to modify an attribute of a face image via adversarial learning. Shen and Liu [43] adopt the dual residual learning strategy to simultaneously train two generators for respectively adding and removing a specific attribute. To tackle the task of attribute transfer from an exemplar image with targeted attribute, Kim et al. [25], GeneGAN [59], DNA-GAN [55] and ELEGANT [56] encode a source image and an exemplar image to their respective latent variables and swap attribute-relevant latent code as representations of the “crossbreed” (residual) images to achieve (multiple) attribute transfer.

Recently, several works have been proposed for multiple facial attribute editing simultaneously with capability of high-quality image generation using one model by only training one time with images from different domains. Fader Networks [29] employs the adversarial learning on the latent representation of an autoencoder to learn attribute invariant representation. StarGAN [9] performs image-to-image translations for multiple domains using one single GAN model with a cycle consistency loss. AttGAN [17] uses an encoder-decoder architecture together with an attribute classifier and a discriminator and applies an attribute classification constraint to guarantee the generated images can be correctly changed with desired attributes. All these three models can handle multiple face attributes transfer and generate sharp images, but they are not able to generate facial images given by randomly specified attributes.

3 Latent space adversarial variational autoencoder

We denote the training dataset \(\mathcal {D} = \left \{(x^{i},y^{i}) \right \}\), which consists of m pairs (image, attribute), where xi is the i-th image and \(y^{i} = \left \{0,1 \right \}^{n} \) is the corresponding attribute vector of xi with n dimensions. Each component \({y_{k}^{i}}\) in yi(we use the subscript k to refer to the k-th attribute) represents the k-th attribute value, which indicates whether xi has certain attribute or not.

Our model is based on the encoder-decoder architecture. Concretely, the latent space mapped by the encoder is split into the attribute-irrelevant subspace \(\mathcal {Z}\) (assume the prior distribution on this space is \(p(z) = \mathcal {N}(0,I)\)) and the attribute-relevant subspace \(\mathcal {A}\) (denote the prior distribution on this space is p(a)). The former represents attribute-irrelevant factors, such as identity, position and background, etc. The latter represents attributes, such as hair color, gender, with or without glasses, etc. The decoder maps these subspaces together back to the data space. Denote qϕ(z|x) and qϕ(a|x) as the approximation posterior distributions and assume that p(z) and p(a) are independent, then the ELBO can be rewritten as:

$$ \begin{array}{ll} log p_{\theta}(x) & \geqslant E_{q_{\phi}(a|x)q_{\phi}(z|x)}\left [log \frac{p_{\theta}(x,z,a)}{q_{\phi}(z|x)q_{\phi}(a|x)} \right ] \\ &= E_{q_{\phi}(a|x)q_{\phi}(z|x)}\left [log \frac{p_{\theta}(x|a,z)p(a)p(z)}{q_{\phi}(z|x)q_{\phi}(a|x)} \right ] \\ &= - D_{KL}(q_{\phi}(a|x)\parallel p(a) ) - D_{KL}(q_{\phi}(z| x) \parallel p(z)) \\ &+ E_{q_{\phi}(a|x),q_{\phi}(z| x)} log p_{\theta}(x| a,z) \\ &= \mathcal{L}_{ELBO}, \end{array} $$
(4)

where qϕ(z|x), qϕ(a|x) and pθ(x|a,z) are assumed to be multivariate Gaussian distributions. Further, we choose a conditional distribution p(a|y) as the prior of a instead, and \(p(a_{i}|y_{i})\sim N(y_{i},\sigma )\), where yi resfers to i-th binary attribute and σ is the standard deviation of p(a|y). Thus the training objective can be rewritten as

$$ \begin{array}{ll} \underset{\phi,\theta}{min} -\mathcal{L}_{ELBO} &= {\sum\limits_{i}^{n}} D_{KL}(q_{\phi}(a_{i}|x)\parallel p(a_{i}|y_{i})) \\ &+ \alpha D_{KL}(q_{\phi}(z| x) \parallel p(z)) \\ &- \beta E_{q_{\phi}(a|x),q_{\phi}(z| x)} log p_{\theta}(x| a,z), \end{array} $$
(5)

where α,β are hyper-parameters to control the relative importance of different terms. The training pipeline of our model consists of two training phases: reconstruction training phase and adversarial training phase.

In reconstruction training phase, since we specify the mean of prior p(a|y) to be the binary attribute label y of the input image, the first term in Eq. 5 is discriminative. Hence qϕ(a|x) also can serve as a classifier for facial attributes, here we adopt binary cross entropy loss:

$$ \begin{array}{ll} \mathcal{L}_{attr\_real} = & - \sum\limits_{i=1}^{n} y_{i} log a_{i} + (1-y_{i}) log (1- a_{i}), \end{array} $$
(6)

where ai indicates the predicted value for the i-th attribute. The third term in Eq. 5 is the reconstruction term. In order to overcome shortcomings of pixel-wise 2 loss, we use perceptual loss to measure the similarity between the input image and its reconstruction. Perceptual loss is widely used to measure the content difference between different images.

Denote Φ(x)l is the lth hidden layer with Cl channels and size of Wl × Hl when x is fed to the VGG-19 [44] pre-trained model Φ. Then, the feature perceptual loss for this layer between the input image x and its reconstruction image \(\bar {x}\) is defined as

$$ \mathcal{L}_{rec}^{\Phi,l} = \frac{1}{2C^{l}W^{l}H^{l}}\left \| {\Phi}(x^{i})^{l} - {\Phi}(\bar{x}^{i})^{l}\right \|_{2}^{2}. $$
(7)

Then the total feature perceptual loss is defined as a weighted feature perceptual loss based on some layers of Φ

$$ \mathcal{L}_{rec} = \sum\limits_{l} \omega_{l} \mathcal{L}_{rec}^{\Phi,l}, $$
(8)

where ωl is the weight for the lth hidden layer. We can rewrite loss function of the reconstruction training phase as

$$ \mathcal{L}_{ir} = \mathcal{L}_{attr\_real}+ \alpha \mathcal{L}_{kl\_real} + \beta \mathcal{L}_{rec}, $$
(9)

where \( {\mathscr{L}}_{kl\_real}\) is the second term in Eq. 5 and α,β are hyper-parameters for balancing the losses.

In Adversarial training phase, as KL divergence statistics DKL(qϕ(zx) ∥ p(z)) which computes a single number can serve as a special form of regression, the encoder Enc can be regarded as a discriminator and a classifier.In addition, as a generator, the decoder Dec can generate two types of different fake images:

(1):

fake image \(\tilde {x}\), which is generated by feeding Dec with random variables \(\tilde {z}\) and \(\tilde {y}\) drawn from p(z) and p(y),

(2):

fake image \(\hat {x}\), which is generated by feeding Dec with the attribute-irrelevant latent variable \(\hat {z}\) of the input image x and another attribute latent variables \(\hat {y}\) drawn from p(y), that is different from the real attribute.

Hence, we end up with the following two objectives for adversarial training alternately:

  • For training the encoder which is also a discriminator:

    $$ \begin{array}{ll} \underset{\phi}{min} \mathcal {L}_{enc} = & \gamma_{1} \mathcal{L}_{kl\_real} + \gamma_{2} \mathcal{L}_{attr\_real} \\ &+ max(0,m- \gamma_{3} \mathcal{L}_{kl\_fake}), \end{array} $$
    (10)

    where m is a positive margin and

    $$ \mathcal{L}_{kl\_fake}= \mathcal{L}_{kl\_\tilde{x}} + \mathcal{L}_{kl\_\hat{x}}, \\ $$
    (11)
    $$ \mathcal{L}_{kl\_\tilde{x}} = D_{KL}(q_{\phi}(z|\tilde{x})|| p(z)),\\ $$
    (12)
    $$ \mathcal{L}_{kl\_\hat{x}} = D_{KL}(q_{\phi}(z|\hat{x})|| p(z)), $$
    (13)

    where \(\tilde {x}\) is the random generated sample from \(p_{\theta }(x|\tilde {y},\tilde {z})\), \(\tilde {y}\) and \(\tilde {z}\) are samples from p(y) and p(z), respectively, and \(\hat {x}\) is the generated sample from \(p_{\theta }(x|\hat {y},\hat {z})\), \(\hat {y}\) is randomly sampled from p(y), \(\hat {z}\) is the attribute-irrelevant latent variable of the input image.

  • For training the decoder which is also a generator:

    $$ \underset{\theta}{min} \mathcal {L}_{dec} = \gamma_{4} \mathcal{L}_{kl\_fake} + \gamma_{5} \mathcal{L}_{attr\_fake}, $$
    (14)

    where \({\mathscr{L}}_{kl\_fake}\) is the same as in Eq. 11 and

    $$ \begin{array}{ll} \mathcal{L}_{attr\_fake} = \mathcal{L}_{attr\_\tilde{x}} + \mathcal{L}_{attr\_\hat{x}}, \end{array} $$
    (15)
    $$ \mathcal{L}_{attr\_\tilde{x}} = - {\sum}_{i=1}^{L} \tilde{y}_{i} log \tilde{a_{i}} + (1-\tilde{y}_{i}) log (1-\tilde{a_{i}}), $$
    (16)
    $$ \mathcal{L}_{attr\_\hat{x}} = - {\sum}_{i=1}^{L} \hat{y}_{i} log \hat{a_{i}} + (1-\hat{y}_{i}) log (1-\hat{a_{i}}), $$
    (17)

    where \(\tilde {a}\) and \(\hat {a}\) are the latent attribute variables of \(\tilde {x}\) and \(\hat {x}\), respectively.

The overall training loss function can be summarized as

$$ \mathcal{L}_{total} = \mathcal{L}_{ir} + \mathcal {L}_{enc} + \mathcal {L}_{dec}, $$
(18)

where the exact forms of each term are presented in Eqs. 910 and 14, respectively. Figure 1 summarizes the reconstruction training and the adversarial training phases of our model. The training procedure is presented in Algorithm 1. We name the proposed model Latent Space Adversarial Variational Autoencoder (LSA-VAE).

figure a
Fig. 1
figure 1

Overview of our model. The meaning of each notation refers to the corresponding text part

4 Experiments

4.1 Dataset

We evaluated the proposed model on the CelebA dataset [34], which contains 202599 celebrity images and each of them is annotated with or without 40 binary attributes. We select 13 attributes, including “Bald”, “Bangs”, “Black Hair”, “Blond Hair”, “Brown Hair”, “Bushy Eyebrows”, “Eyeglasses”, “Male”, “Mouth Slightly Open”, “Mustache”, “No Beard”, “Pale Skin” and “Young”, due to that they are more distinctive in appearance. Officially, CelebA is separated into training set, validation set and testing set. We used the training set and validation set together to train our model while using the testing set for evaluation. In our experiment, all the images were cropped in the central 170 × 170 region and scaled down to 128 × 128 pixels and normalized to [− 1,1].

4.2 Network architecture

The details of the network architectures of LSA-VAE are shown in Tables 12 and 3. Both the encoder and the decoder networks are mainly based on residual blocks whose configuration is shown in Table 3. Like the other VAEs, mean μ(x) and covariance σ(x) in Eq. 2 are output by the encoder of our model to compute the KL divergence loss and used for compute the attribute-irrelevant latent variable z. In addition, attribute-relevant latent variable a is output by another branch of the encoder. For the decoder, instead of standard zero-padding, we used replication padding, i.e., feature map of an input was padded with the replication of the input boundary. We also used the nearest neighbor method by a scale of 2 to replace with fractional-strode convolutions for upsampling. The meanings of the notations in Tables 12 and 3 are as follows: Conv(d,k,s,p) denotes the convolutional layer with d as the dimension, k as the kernel size, s as the stride and p as the padding, BN denotes the batch normalization, FC denotes a fully-connected layer, ResBlock(I,O) denotes a residual block with I and O as the numbers of input feature maps and output feature maps respectively, and AvgPool(w) denotes the average pooling with w as size of the window.

Table 1 The architecture of the encoder in LSA-VAE
Table 2 The architecture of the decoder in LSA-VAE
Table 3 The structure of the ReBlock in LSA-VAE

4.3 Training details

As illustrated in Algorithm 1, the reconstruction training phase and the adversarial training phase were trained iteratively using the ADAM optimizer [26] (β1 = 0.5, β2 = 0.999) with a batch size of 32 and a fixed learning rate of 0.0002. We set the dimensions of the attribute-irrelevant subspace and the attribute-relevant subspace to 512 and 13, respectively. The marginal m in Eq. 10 was set to 1500. We used a combination of relu1_1, relu2_1 and relu3_1 layer of VGG-19 pre-trained model to compute feature perceptual loss. Each ωl was set to 0.5 in Eq. 8. α and β in Eq. 9 were set to 1 and 0.5, respectively. γ1, γ2, γ3, γ4 and γ5 were set to 0.5,0.5,1,1,100, respectively. It is suggested to train the model with 10 epochs in the reconstruction training phase as initial weights before performing the adversarial training phase iteratively.

Our experiments were conducted on a computer with a GTX TitanX GPU of 12GB memory.

4.4 Baseline models

As our baseline models, we compared LSA-VAE with state-of-the-art algorithms including IcGAN [39], StarGAN [9] and AttGAN [17], which were reported that achieved the best performance for facial editing and were capable to manipulate images conditioned on multiple attributes with a single generator. The results were evaluated on quantitative comparison and qualitative comparison on single-attribute editing and multiple-attribute editing. For fair comparison, all the baselines were retrained on CelebA dataset by the authors’ released codes using thirteen attributes mentioned above. We briefly introduce these models in following:

IcGAN combines an encoder with a cGAN model, where cGAN learns the mapping \(G:\left \{ z,c \right \} \rightarrow x \) that generates an image x conditioned on both the random noise z and the conditional representation c. Training on the random samples of z and c and their corresponding synthesized image x generated by the cGAN, an encoder learns the inverse mappings \(E_{z}:x\rightarrow z\) and \(E_{c}:x\rightarrow c\). This allows to synthesize images conditioned on arbitrary conditional representation.

StarGAN trains a single generator G that learns mappings among multiple domains and introduces an auxiliary classifier that allows a single discriminator to control multiple domains. That is, G translates an input image x into an output image y conditioned on the target domain label c, \(G:\left \{ x,c \right \} \rightarrow y \). The discriminator produces probability distributions over both sources and domain labels, \(D : x \rightarrow \left \{ D_{src}(x), D_{cls}(x) \right \} \), which means D can not only discriminate real or fake images, but also can classify domain labels. In addition, in order to preserve the content of its input images while changing only the domain-related part of the inputs, StarGAN applies a cycle consistency loss to the generator,\(|| x-G(G(x,c),c{}^{\prime }) ||_{1}\), where c and \(c{}^{\prime }\) are target label and original label, respectively.

AttGAN employs an encoder-decoder architecture and models the relation between the latent representation and the attributes, which are different from StarGAN. On the one hand, the encoder encodes input xa into the latent representation z and the decoder receives z and original attribute a as input to be trained to reconstruct xa. On the other hand, the decoder receives z and target attribute b as inputs to be trained to generate target image \( \hat {x}_{b} \), which is expected to be with target attributes while identity-preserving. AttGAN also applies the attribute-classification constraint on the generated image to guarantee the correct change of the attributes.

4.5 Qualitative analysis

4.5.1 Single facial attribute editing

In this section, we compared the proposed model with the baselines in terms of single facial attribute editing. We presented qualitative results in Figs. 2 and 3, where eight attributes were chosen to show the ability of these models on the task of editing single attribute. As shown in the figures, IcGAN produced blurry images, and the reconstruction ability of IcGAN is very limited. StarGAN could generate sharper images than IcGAN, but the results of StarGAN contained some artifacts. In general, StarGAN, AttGAN and our model could edit attributes correctly. Both AttGAN and our model could reconstruct the original image much better than other algorithms and generate more realistic results. For “Pale Skin” attribute, StarGAN tended to generate less natural skin color. For “Bangs” attribute, the results of AttGAN seems to be less distinct. For global attribute like “Gender”, both StarGAN and AttGAN tended to change male to female by putting makeup and wearing lipstick, while our model attempted to change eye shapes and facial lines.

Fig. 2
figure 2

Results of single facial attributes editing. The first and second columns from the left are ground truth and its reconstruction images, respectively. For each ground truth image, every row demonstrates results of single facial attributes editing by different models. The rows from top to bottom are generated by IcGAN, StarGAN, AttGAN and our model, respectively

Fig. 3
figure 3

Results of single facial attributes editing. The first and second columns from the left are ground truth and its reconstruction images, respectively. For each ground truth image, every row demonstrate results of single facial attributes editing by different models. The rows from top to bottom were generated by IcGAN, StarGAN, AttGAN and our model, respectively

In summary, the proposed model could manipulate the attribute correctly and generate images with high visual quality, and performed well on all the attribute testings, which is mainly due to the superior design of the architecture of the proposed network.

4.5.2 Multiple facial attribute editing

For more comprehensive comparison, we evaluated these four models in term of multiple-attribute facial editing. The comparison results are shown in Fig. 4. Similar to single-attribute editing, images generated by IcGAN were in distortion of facial details and seemed more blurry. For StarGAN, it can edit attributes accurately. However, some of its results look unnatural and it didn’t perform as well as AttGAN and our model in terms of identity preserving. As for AttGAN, when we tried to manipulate the gender and the hair color simultaneously (see row 5, column 6 and row 6, column 5 in Fig. 4), it also tended to change the hair style such as making it short hair when changing female to male with mustache, or making it long hair when changing male to female with blond hair. By contrast, our model can generate more natural and realistic images and handle the task of multiple-attribute facial editing much better.

Fig. 4
figure 4

Comparison results of IcGAN, StarGAN, AttGAN and LSA-VAE, with multiple-attribute facial editing. The first row from top is ground truth. Zoom in for better resolution

4.5.3 Attribute-conditioned image progression

Although our model was trained with discrete binary attribute values (0 or 1), we found that it is compatible with continuous attribute value in the testing phase and can generate a progression process of attribute intensity. In order to demonstrate this attributed-condition progression, we manipulated the value of one dimension in the attribute variable by modifying it from 0 to 1 smoothly and keeping all other latent variables fixed. As we can see in Fig. 5, samples generated by progression are visually consistent with attribute description. By changing value of attribute like “Mouth Open” or “Eyeglasses” and “Age”, respectively, attribute intensity was strengthened or weakened smoothly, and other visual appearances irrelevant to the attribute of interest remained unchanged. In particular, the identity-related visual appearance was well preserved.

Fig. 5
figure 5

Examples of attribute-conditioned image progression on adding or removing eyeglasses, opening or closing mouth and changing age, respectively. Zoom in for better resolution

4.6 Random image generation

Comparing with those baselines, our model is also capable of synthesizing diverse and realistic facial images given specific attributes. With this unique property, our model can not only generate realistic facial images from random noise as recent works [5, 21,22,23], but it can also control the attributes that we want the images to possess.

To examine this important property of our model, we evaluated it on the task of attribute-conditional random image generation. To synthesize facial images with specific attributes, we generated these samples in Fig. 6 through the following process: firstly, noise variables were randomly sampled from unit isotropic Gaussian distribution; secondly, the noise variables and specific attribute variables with values 1 or 0 were fed into the generator of LSA-VAE. As shown in Fig. 6, the four groups of facial images, synthesized with four groups of attributes which are ‘Black Hair/ Male/ Young’, ‘Blond Hair/ Female/ Young’, ‘Brown Hair/ Female/ Mouth Open/ Young’ and ‘Bale/ Male/ Old’, look photo-realistic with vivid texture. In each group, the faces were different with several view points. Those results imply that the proposed model is capable of generating facial images with randomly specified attributes without retraining the model using label-wise samples.

Fig. 6
figure 6

Examples of attribute-conditional random image generation with our model. Images in each row were generated by feeding into the decoder with variables of noise and fixed attributes listed in the left

More results of random image generation are shown in Fig. 7 and all of them were synthesized by random attributes sampled from the thirteen attributes. For instance, the image in second row and fifth column was generated by feeding attributes of “Black Hair”, “Bushy Eyebrows”, “Mouth Open”, “Pale Skin”, “Female” and “Young”.

Fig. 7
figure 7

More examples of attribute-conditional random image generation. All of them were synthesized by random attributes. Zoom in for better resolution

4.7 Quantitative analysis

In this section, we performed two kinds of quantitative analysis on similarity between source images and their reconstructed images and quality of generated images. For fair comparison, 10,000 images from testing set were randomly selected as the source image set, and their reconstructed images and transformed images with 13 attributes which were generated by each model were formed as the reconstruction set and the transformation set. We evaluated the abilities of the models using four metrics.

Identity preserving is an important factor in facial image editing. We evaluated the similarity between source images and the reconstructed images for each model. Three metrics were adopted, which are Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM) [54] and Learned Perceptual Image Patch Similarity (LPIPS) [58]. For PSNR and SSIM, the higher value means more similar between source image and its reconstructed image. For LPIPS, which evaluates similarity on deep features of images by feeding them to the pre-trained network, such as VGG [44] or AlexNet [28], the lower value means the more similar. To evaluate the quality of generated images, we used Fréchet Inception distance (FID) [18]. FID calculates the Fréchet distance also known as Wasserstein-2 distance between the source images and the generated images in the feature space of Inception Net. FID has been shown to be consistent with human judgement and robust to noise. Lower FID value means the distribution of the generated images is of closer distance to the distribution of the source images. In addition, the number of parameters of each model is also a key point needed to be compared with. A model with more parameters means it needs more time and memory to train the model. Thus, it is useful to compare the number of parameters of each model (in unit of a million).

All these comparison results are shown in Table 4. From which it can be seen that AttGAN achieved the best scores in terms of PSNR, SSIM, LPIPS and FID, and our model took the second place. IcGAN produced the poorest evaluation results, which further confirms the conclusion observed from Figs. 23 and 4. Note that although evaluation vales of our model were a little weaker than those of AttGAN, the number of parameters of the proposed model is much fewer than the number of AttGAN, which implies that our model is more efficient and less complex.

Table 4 Quantitative comparison results of the baseline models and the proposed model, evaluated by PSNR, SSIM, LPIPS and FID. The number of the parameters of each models is also listed in the last column

4.8 Ablation study

To study the function of each part in the proposed model, we did the ablation study in this section. We established two different variants of our model: one based on convolutional network and one without performing the adversarial training phase, which were named LSA-VAE-CNN and LSA-VAE-part, respectively.

The network architectures of LSA-VAE-CNN are shown in Tables 5 and 6. Both encoder and decoder network were based on deep convolutional neural network. Since the only difference between LSA-VAE and LSA-VAE-CNN is the network architecture, we can analyze the effect of different network architectures on the quality of generated images. In addition, due to the different network structure to that of LSA-VAE, different parameter settings were adopted for training LSA-VAE-CNN. Specifically, LSA-VAE-CNN was trained by the ADAM optimizer [26] (β1 = 0.5, β2 = 0.999) with the learning rate of 0.0005. The marginal m in Eq. 10 was set to 10. α and β in Eq. 9 were set to 0.5 and 100, respectively. γ1, γ2, γ3, γ4 and γ5 were set to 0.01,100, 1, 100, 10, respectively. We performed comparison experiments between LSA-VAE-CNN and LSA-VAE on the task of single-attribute editing and multiple-attribute editing. The results are shown in Figs. 8 and 9. As we can see, LSA-VAE-CNN could change image attribute accurately, but generated more blurry images than LSA-VAE, and the images generated by LSA-VAE-CNN lack details of skin and hair texture. Moreover, LSA-VAE-CNN could not preserve the facial identity like LSA-VAE. We also performed random images generation with LSA-VAE-CNN to look into whether it can synthesize realistic images given specific attributes. As shown in Fig. 10, LSA-VAE-CNN can synthesize acceptable facial images, but the oversmooth face and checker texture in hair area make them not look like real faces. In a word, LSA-VAE can generate much sharper and realistic images than LSA-VAE-CNN. We suggest the reason is that, the skip connection in residual block is helpful to enhance image quality of editing result.

Fig. 8
figure 8

Comparison between LSA-VAE-CNN and LSA-VAE on the task of single-attribute facial editing

Fig. 9
figure 9

Comparison between LSA-VAE-CNN and LSA-VAE on the task of multiple-attribute facial editing

Fig. 10
figure 10

Examples of attribute-conditional random image generation based on LSA-VAE-CNN

Table 5 Network architecture of the encoder in LSA-VAE-CNN, which is an invariant of LSA-VAE based on convolutional networks
Table 6 Network architecture of the decoder in LSA-VAE-CNN, which is an invariant of LSA-VAE based on convolutional networks

LSA-VAE-part employs the same network architecture as mentioned in Tables 1 and 2, but only performs the reconstruction training phase without the adversarial training. In this case, we investigated the role of adversarial training in random image generation. As shown in Fig. 11, the images generated by LSA-VAE-part with specific attributes are blurry and distorted seriously, comparing with results in Figs. 6 and 7, which demonstrates that the adversarial training plays an important role in the success of image synthesis.

Fig. 11
figure 11

Examples of attribute-conditional random image generation with LSA-VAE without performing the adversarial training phase

5 Conclusion

This paper proposed a novel attribute-disentangled generative model for facial image editing conditioned on arbitrarily specified attributes by combining the advantages of variational autoencoders and generative adversarial networks. In the proposed model, the latent space is split into two independent subspaces. By introducing the adversarial training strategy on the latent space, the generated data distribution is trained to approach the real data distribution in the latent space and meanwhile the generated images are used to trained the encoder like a discriminator. We evaluated our model by attribute manipulation and random images generation experiments. The experimental results demonstrated our proposed model could learn attribute-disentangled representations of facial images and generate face images with rich details and high visual quality.