Keywords

1 Introduction

Facial editing has made remarkable progress with the development of deep neural networks [18, 19]. More and more methods use the GANs to edit faces and generate images that utilize the image-to-image translation [21, 27] or embed into the GAN’s latent space [22, 29, 36, 37]. Recent studies have shown that the StyleGAN2 contains rich semantic information in latent space and certified realistic editing of high-quality images.

Most GANs based on image editing methods fall into a few categories. Some works rely on image-to-image translation [15], which use the encoder-decoder architecture and take the source image and target image attribute vector as input [5, 7, 12], e.g. StarGAN [6], AttGAN [13], STGAN [24] and RelGAN [34]. AttGAN first adopts the attribute classification constraints, reconstruction learning, and adversarial learning. STGAN used the difference attribute vector as input and present the selective transfer units with the encoder-decoder. Yue eal proposed HifaFace [9] which is a novel wavelet-based face editing. They observed the generator learns to apply a tricky method to satisfy the constraint of cycle consistency by hiding signals in the output images. And other works use the latent space of the GAN’s [38], e.g. Image2StyleGAN [1], Image2StyleGAN++ [2], InvertGAN [42], InterFaceGAN [28] and StyleFlow [3]. Those methods find the disentangled latent variables suitable for image editing. But these methods can’t obtain the independent editing attributes and generated images are not consistent with the row image.

Fig. 1.
figure 1

The results of adding eyeglasses with different face editing methods.

As shown in Fig. 1, we feed an input face image x into AttGAN and STGAN [24] and expect it to add eyeglasses on the face. We can find that although the output of AttGAN does wear eyeglasses, the irrelated region has changed, e.g., the face color changed from yellow to white. And the output result of STGAN with the same result as the input image but the eyeglasses can’t see. We can find the edited facial images are inconsistent with the raw image that the non-editing attributes/areas changed.

To achieve consistent face editing, we propose a simple yet effective face editing method called SemanticGAN. We main solve this problem from two aspects. Firstly, we edit the image directly, and we only consider whether the editing attributes are successful, regardless of whether the attribute independent regions are consistent. Then, we optimize the attributes vector ensure the attribute independent regions are consistent.

Specifically, our proposed method builds on a recently proposed StyleGAN that face images and their semantic segmentation can be generated simultaneously, and a small number of labeled images are required to train it. We embed images into the latent space of GAN and edit the face image by adding attribute vectors. We use the generated segmentation to optimize the attributes vector to be consistent with the input image. Otherwise, the attributes vector can directly apply to the real images, without any optimization steps.

Our unique contribution that advances the field of face image editing manipulation include:

  1. 1)

    We propose a novel face editing method, named SemanticGAN, for consistency and arbitrary face editing.

  2. 2)

    We design a few-shot semantic segmentation model, which requires only a few annotated data for training and can well obtain the identity with the semantic knowledge of the generator.

  3. 3)

    Both qualitative and quantitative results demonstrate the effectiveness of the proposed framework for improving the consistency of edited face images.

2 Related Works

2.1 Generative Adversarial Networks

Generative Adversarial Networks(GANs) [10, 11, 26, 41] have been widely used for image generation [32, 33]. A classic GAN is composed of two parts: a generator and a discriminator. The generator is to synthesize noise to resemble a real image while the discriminator is to determine the authenticity between the real and generated images. Recently GANs have been developed to synthesize faces and generate diverse faces from random noise, e.g., PGGAN [16], BigGAN [4], StyleGAN [18], StyleGAN2 [19] and StyleGAN3 [17] which encode critical information in the intermediate features and latent space for high-quality face image generation. Furthermore, GANs are also widely used in computer vision such as image conversion, image synthesis, super-resolution, image restoration, and style transfer.

2.2 Facial Image Editing

Facial image editing is a rapidly growing field in face image [25, 30, 39, 42]. Thanks to the recent development of GANs, Generally speaking, these methods can be divided into two categories. The first category of methods utilizes image-to-image translation for face editing. StarGAN and AttGAN used target attribute vector as input to the transform model and bring in an attribute classification constraint. STGAN enhances the editing performance of AttGAN by using a different attribute vector as input to generate high-quality face attributes editing. RelGAN proposed a relative-attribute-based method for multi-domain image-to-image translation. ELEGANT [35] proposed to convert the same type of property from one image to another by swapping some parts of the encoding. HiSD [21]realize image-to-image translation by hierarchical style disentanglement for facial image editing. The other category of methods uses the pre-trained GANs, e.g., StyleGAN, StyleGAN2. To achieve facial image editing by changing the latent codes.Yujun Shen et al. proposed semantic face editing by interpreting the latent semantics learned by GANs. StyleFlow proposed conditional exploration of latent spaces of unconditional GANs using conditional normalization flows based on semantic attributes. EditGAN [23] offered a very high-precision editing. However, those methods are unable to give any satisfactory editing results. The first method’s synthesis image quality is low, and other methods can’t get perfect latent codes. In this paper, we utilize the latent codes of StyleGAN2 to realize facial image editing, and proposed an effective framework to solve inconsistent editing.

3 Proposed Method

In this section, we introduce our proposed editing method named SemanticGAN. Figure 2 gives an overview of our method which mainly contains the following two parts: 1) attribute-related fine editing and 2) attribute-independent optimization.

3.1 Preliminary

Our model is based on StyleGAN2 which extracts the latent codes \(z \in Z\) from multivariate normal distributions and maps them to real images. A latent code z is first mapped into a latent code \(w \in W\) with a fully connected layer network. And then extended into a \(W^{+}\) space that controls the generation of images with different styles. The \(W^{+}\) space is the concatenation of k different w spaces. \(W^{+}=w^{0} \times w^{1} \times \ldots \times w^{k}\). The space can better decouple the model attributes by learning a multi-layer perceptron before the generator. So we embed images into the GAN’s latent code \(W^{+}\) space using an encoder. Thus we can define the encoder and generator \(E: x \rightarrow W^{+}\) and \(G_{x}: W^{+} \rightarrow x^{\prime }\). We followed the previous encoding works and trained an encoder that embeds images into \(W^{+}\) space. We follow the encoding in style to train the encoder.

$$\begin{aligned} \mathcal {L}_{\text{ per }}(x)=\mathcal {L}_{LPIPS}\left( x,G_{x}(E(x))\right) +\lambda _{1}\left\| x-G_{x}(E(x))\right\| _{2} \end{aligned}$$
(1)

where \(\mathcal {L}_{\text{ LPIPS } }\) loss is the Learned Perceptual Image Patch Similarity(LPIPS) distance.

$$\begin{aligned} \mathcal {L}_{I D}(x)=1-\left\langle R(x), R\left( G_{x}(E(x))\right) \right\rangle \end{aligned}$$
(2)
$$\begin{aligned} \mathcal {L}_{r e.g.}(x)=\Vert E(x)-\bar{W}\Vert _{2} \end{aligned}$$
(3)

where the E is the latent encoder, R denotes the pre-trained ArcFace [8] feature extraction network, \(\left\langle .,. \right\rangle \) is cosine-similarity, and \(\bar{W}\) is the average of the generator latent.

$$\begin{aligned} \mathcal {L}(x)=\lambda _{2}\mathcal {L}_{\text{ per } }(x)+\lambda _{3} \mathcal {L}_{I D}(x)+\lambda _{4} \mathcal {L}_{r e.g.}(x) \end{aligned}$$
(4)

where \(\lambda _{1}\), \(\lambda _{2}\), \(\lambda _{3}\), and \(\lambda _{4}\) are constants defining the loss weight.

3.2 Attribute-Related Fine Editing

We firstly put input image x into the well-trained encoder E that can embed x into \(W^{+}\) latent space. Then we adopt the vector of editing attributes \(\delta W^{+}\) and we have \(W_{e d i t}=W^{+}+\delta W^{+}\) that is put into the generator G. We can get a facial image \(x^{\prime }\) which has the editing attribute. And \(x^{\prime } = G\left( W^{+}+\delta W^{+}\right) \). Notice that we mainly consider the accuracy of edited images possesses the target attributes instead of whether the editing of irrelevant regions is inconsistent. So we design an attribute classifier that is composed of convolutional neural networks to detect whether the synthesized facial image contains the corresponding attribute. We train the attribute classifier on the labeled datasets. We apply the well-trained classifier to ensure that the synthesized image \(x^{\prime }\) possesses the target attributes. We apply the classifier loss:

$$\begin{aligned} \begin{aligned}&\mathcal {L}_{a c}=-\left[ \mathbb {I}_{\left\{ \left| \varDelta \right| =1\right\} }\left( a_{y} \log p_{y}+\left( 1-a_{y}\right) \log \left( 1-p_{y}\right) \right) \right] \end{aligned} \end{aligned}$$
(5)

where \(\mathbb {I}\) denotes the indicator function, which is equal to 1 where the condition is satisfied and \(\varDelta = \textrm{H}\left( \textrm{C}\left( \textrm{x}^{\prime }\right) ,\, a_{y}\right) \) which H denotes Hamming distance and C is the well trained classifier. We use \(\varDelta \) to determine whether the attribute has been changed (i.e., \(\left| \varDelta \right| =1\)), and \(\textrm{p}_{\textrm{y}}\) is the probability value of the attributes estimated by the classifier C. Ensure the generator G to synthesize facial image \(\textrm{x}^{\prime }\) with the relate attributes.

Fig. 2.
figure 2

An overview of our proposed SemanticGAN method. SemanticGAN contains attribute-related fin editing and attribute-independent optimization, where the attribute-related fine editing only focused on the accuracy of the attributes, rather than the none edited areas is changed. The optimization is used the segmentation to select the edited regions and let the non edited areas consistent with the raw segmentation.

$$\begin{aligned} \begin{aligned}&\mathcal {L}_{\textrm{adv}}=\mathbb {E}[\log (\textrm{D}(\textrm{x}))]+\mathbb {E}\left[ \log \left( 1-\textrm{D}\left( \textrm{x}^{\prime }\right) \right) \right] \end{aligned} \end{aligned}$$
(6)

where x is the input image, \(\textrm{x}^{\prime }\) is the synthesized facial image. \(\mathcal {L}_{\textrm{adv}}\) encourages the generator G to synthesize a high-quality facial image.

$$\begin{aligned} \begin{aligned}&\mathcal {L}=\lambda _{5} \mathcal {L}_{\textrm{ac}}+\lambda _{6} \mathcal {L}_{\textrm{adv}} \end{aligned} \end{aligned}$$
(7)

3.3 Attribute-Independent Optimization

In Fig. 2(A) we can find the edited images although possess the target attributes. But the editing irrelevant regions have changed, e.s. when editing the smile attribute the hair color from black change to brown. That means the attributes vector can edit the facial image, but also change the other attributes. To solve this problem, we use semantics to optimize the result. Specifically, we use the well-trained segment model to generate the edited image’s semantic label. As mentioned above when we input an image into the encoder, we get a \(\textrm{W}^{+}\) latent code. Then we adopt the attribute vector, which is put into the generator together. \(\left( \textrm{x}^{\prime }, \textrm{y}^{\prime }\right) =\textrm{G}^{\prime }\left( \textrm{W}^{+}+\delta \textrm{W}^{+}\right) \). So we can have the edited image’s label \(\textrm{y}^{\prime }\). To optimize the \(\delta \textrm{W}^{+}\), we select the edited regions using the EditGAN [23]. We define the edited regions r are the region of the edited attributes and the relevant regions. \(\textrm{r}=\{p:p^y \in P_{edit}\} \bigcup \{p: p^{y_{edit}}\in P_{edit}\}\), which means that r is defined by all pixels p which consistent with the edited and relevant for the edit. In the training process, we scale the region r out by 5 pixels that can offset the error due to the inaccurate segmentation. In practice, r acts as a binary pixel-wise mask.

To optimize the attribute independent region and find the appropriate attributes vector \(\delta \textrm{W}^{+}\). We use the following loss:

$$\begin{aligned} \begin{aligned}&\mathcal {L}_{\textrm{Seg}}\left( \delta W^{+}\right) = L_{L P I P S}\left( x^{\prime } \odot (1-r), x \odot (1-r)\right) \\&+L_{L 2}\left( x^{\prime } \odot (1-r), x \odot (1-r)\right) \end{aligned} \end{aligned}$$
(8)

where \(\textrm{L}_{\text{ LPIPS }}\) is based on the Learned Perceptual Image Patch Similarity(LPIPS) distance, \(x^{\prime }\) is the generated face image and \(\textrm{L}_{\text{ L2 }}\) is a regular pixel-wise L2 loss. \(\mathcal {L}_{{\text {Seg}}}\left( \delta W^{+}\right) \) ensures that the synthesized facial image dose not change the unedited region.

$$\begin{aligned} \mathcal {L}_{I D}\left( \delta W^{+}\right) =\left\langle R\left( x^{\prime }\right) , R(x)\right\rangle \end{aligned}$$
(9)

with R denoting the pre-trained ArcFace [8] feature extraction network and \(\left\langle .,. \right\rangle \) cosine-similarity.

$$\begin{aligned} \mathcal {L}\left( \delta W^{+}\right) =\lambda _{7} \mathcal {L}_{S e.g.}+\lambda _{8} \mathcal {L}_{I D} \end{aligned}$$
(10)

with the hyperparameters \(\lambda _{7}\), \(\lambda _{8}\).

We define the \(D_{x, y}\) is the annotated datasets where x is the real image, and y is the label. Similar to DatasetGAN [40] and Repurpose-GAN [31], to generate segmentation \(y^{\prime }\) alongside images x we train a segmentation branch S which is a sample multi-layer convolutional neural networks. Figure 3 shows the segmentation frame. We input the image into the optimized encoder network which can get a latent code z and then z is fed to the generator. We extract feature maps \(f_{i}\) which dimension is \(\left( h_{i}, \textrm{w}_{i}, \textrm{c}_{i}\right) \) for \(i = 1, 2,3,\ldots , K\). Each feature map is upsampled to a same dimension, \(\hat{f}=U_{k}\left( f_{k}\right) \) for \(k \in 0, \ldots , K\) and \(\textrm{U}_{\textrm{k}}\) is the upsampling functions. Then all the feature maps are concatenated along the channel dimensions. The segment operates on the feature map and predicts the segmentation label for each pixel. It is trained with the cross-entropy loss function.

Fig. 3.
figure 3

The framework for few-shot semantic segmentation.

4 Experiments

In this section, we 1) describe the experimental implementation details; 2) show the attributes editing results; 3) show the editing results with SemanticGAN; 4) provide the results of ablation studies.

4.1 Implementation Details

Datasets: We evaluate our model on the CelebA-HQ datasets. The segmentation model and encoder model trained on CelebA-HQ mask datasets [20]. The image resolution is chosen as 1024 \(\times \) 1024 in our experiments.

Fig. 4.
figure 4

Comparisons with AttGAN, STGAN, InvertGAN, DNI, and SemanticGAN on real image editing.

Implementation: We train our segmentation branch using 15 image-mask pairs as labeled training data for a face. The initial learning rate is 0.02 and decreased by half for every 200 epochs. The model is trained by an Adam optimizer with a momentum equal to 0.9. Our experimental environment is based on Lenovo Intelligent Computing Orchestration (LiCO), a software solution that simplifies the use of clustered computing resources for artificial intelligence (AI) model development and training. We implement our method using PyTorch 1.17 library, CUDA 11.0. The models are trained on 32GB Tesla V100 GPUs, respectively.

4.2 Attribute Face Editing

We compare our method with some recent works: AttGAN, STGAN, InvertGAN, and DNI. Figure 4 shows the results of attribute editing. AttGAN and STGAN used the encoder-decoder architecture, and the image becomes blurred after attribute editing. We can find the reconstruction of AttGAN the skin color and hair color are changed compared to the input image. AttGAN and STGAN can not have the smile attribute and the eyeglasses are not obviously. When editing the smile attribute, the hair is changed. We can find the InvertGAN and DNI successfully edit the eyeglasses and smile attribute and have high image quality compared to AttGAN and STGAN. But InvertGAN changed the age when editing the glasses attribute, and the eyes also changed at the same that DNI edits the glasses. SemanticGAN can edit with the original image details unchanged and we use the generated semantic to optimize the edited image. Table 1 compares the quantitative results of different methods, from which we can see that, considering the values of FID [14], attribute Acc, and ID score, our method outperforms other methods and generates face images that are consistent with row images. Furthermore, we have higher acc value bias and our method has a better ability to edit attributes and ID_score accuracy. The LPIPS of AttGAN is higher than ours but the accuracy of the attributes is only 50.3%.

Table 1. The quantitative results of different methods. \(\uparrow \) and \(\downarrow \) denote the higher and the lower the better.
Fig. 5.
figure 5

The results of SemanticGAN. Left to right for each line denotes: source image, attribute fine editing, the editing segmentation, attribute independent optimization, and final segmentation.

4.3 Editing with SemanticGAN

As shown in Fig. 5, we apply our method to the other images downloaded from the Internet. The process of our model for image editing. Figure 5 (a) is the input image and (b) and (c) are the attribute-related fine editing results and the segmentation. Then we use the segmentation to select the edited region. We optimize the attribute-independent regions. We can obtain the Fig. 5 (d) and (e). We can find see the segmentation changed after the optimization. The final results are consistent with the raw image.

Table 2. Quantitative results for ablation studies.

4.4 Ablation Studies

In this section, we conduct experiments to validate the effectiveness of each component of SemanticGAN. 1) We optimize the latent codes without identity loss. 2) We don’t optimize the latent codes. 3) Our full model. Qualitative and quantitative results of these methods are shown in Table 2. We can find after attributes editing, the accuracy of attributes becomes higher and the ID_Score becomes smaller than the without identity loss. It means the identity loss focus on the face identity while neglecting other significant information. We use the segmentation to optimize the attribute latent can contribute results consistent with the raw image.

5 Conclusion

In this work, we propose a novel method named SemanticGAN for facial image editing. SemanticGAN can generate images and their pixel-wise semantic segmentation and the semantic segmentation model requires only a few annotated data for training. We firstly embed the attributes vectors into the latent spaces and focus on the attribute-related fine editing. Then we optimize the editing results so that we can achieve consistent facial image editing results. Extensive qualitative and quantitative experimental results demonstrate the effectiveness of the proposed method.