1 Introduction

Face editing task aims to manipulate a facial attribute toward the desired status, such as adding age or adding smiling (see Fig. 1). Real facial semantic editing is needed in extensive applications, but there are still challenges [1]. Extensive works have been proposed: such as earlier algorithms [2, 3] based on three-dimensional morphable face models [4] (3DMMs) and methods [5, 6] based on improved conditional generative adversarial networks (CGANs).

Fig. 1
figure 1

Different editing examples (in resolution 1024 \(\times \) 1024) from our method. The background and identity are kept well after disentangled editing. More edited examples are public on our project page at https://github.com/lcd21/PFSF

Recently, generative adversarial networks (GANs) [7,8,9,10] have made impressive strides in generating realistic high-resolution face images. Walking in the latent space of a pretrained facial GAN in appropriate directions can result in facial attribute variation. This phenomenon implies that there is abundant facial semantic information determined by the latent space together with GAN models, which is termed as facial semantic field (FSF) in this paper. Thus, several recent studies [1, 11,12,13,14,15,16,17] propose to edit face based on the priors from pretrained famous GANs. These works show effectiveness in face editing via finding a semantic direction and then moving toward it, but three challenges remain: identity loss, background alteration and entangled manipulation. For example, adding eyeglasses via InterFaceGAN [11] may remove the beard and background by mistake (see the first row of Fig. 2). In addition, adding facial age via InterFaceGAN [11] may also put on eyeglasses (see the second row of Fig. 2). The possible reason is that prior works utilize fixed pretrained GANs to edit all faces, ignoring individuals’ differences. For each facial attribute, the universal walking direction is insufficient for disentangled editing for different individuals.

Fig. 2
figure 2

Visual comparison on different editing results. Notice the background and other attributes

To address the above limitations, we propose to perform disentangled face editing by learning an individual walk in a personalized facial semantic field instead of a universal facial semantic field for all faces as peers [11, 18, 19]. The motivation is that there are personalized differences during facial semantic changing. For example, old people may smile in a quite different manner in contrast to children. Thus, personalized walk can leverage more individual characteristics, stimulating more identity preservation and semantic disentanglement. Our personalized walk framework consists of three steps: (1) the generator of StyleGAN and the inversion model are retrained, producing a personalized facial semantic field. The PFSF is built via portrait-masked constraints with more personalized facial details for each instance. Sampling from the personalized facial semantic field, the generator can synthesize similar faces to the original ones compared to the universal facial semantic field from the fixed GANs model used by peers [11, 18, 19]. In other words, with the help of the learned personalized facial semantic field, more identity information is preserved. (2) For individual semantic manipulation with disentanglement, we learn to walk in the personalized facial semantic field to manipulate the objective attribute but preserve the other attributes. This disentangled semantic walk is supervised via an attributes predictor. (3) The edited portrait is integrated into the background of the given image with the constraints of a portrait mask, which can preserve the background.

Broad experiments are performed on the public datasets FFHQ [9] and CelebaHQ [7] to assess the proposed method. Results demonstrate that our method can preserve the background well and surpass recent state-of-the-art works on identity preservation and disentangled editing.

The major contributions of this work are listed as follows.

  1. 1.

    A personalized facial semantic field is built for individuals retraining the GAN model with the retention of identity and perception as optimizing constraints. This help to preserve more facial identity during inversion.

  2. 2.

    The portrait mask is introduced as a constraint to both the PFSF building and the edited image fusion. This can maintain the background and preserve facial details.

  3. 3.

    An individual walk in the PFSF is proposed to perform disentangled semantic manipulation.

  4. 4.

    A framework consisting of the above components is constructed for real face editing, surpassing existing excellent methods in both quantitative and qualitative evaluation.

2 Related work

GAN inversion searches for a latent vector in the latent space of a pretrained GAN, from which a new image similar to the given image can be generated. Current works for GAN inversion are in three types. The first applies optimizing strategy [20,21,22,23,24,25] to find the right latent code by optimizing an initial latent code directly via irritation. Image2StyleGAN [22] is a typical optimization-based work, optimizing on the latent vector via gradient descent. Optimization-based methods can achieve decent inversions, but they are time-consuming. The second is based on learning strategy [26,27,28,29,30]. They train a network to encode the given image into the latent vector by minimizing the difference between the inverted image and the original one. pSp [31] trains an network to embed vectors of different styles in W+ space of StyleGAN. Inversion works based on learning are fast in the inference stage, but they lose identity for wild real faces. The last applies hybrid techniques [32,33,34,35] to combine the above two strategies. IDInv [32] trains a domain-guided network to encode given images into the latent vectors, which are used as the initialization for the following optimization. These existing inversion works only perform on a fixed universal semantic field, but we consider the personalized differences, and the proposed method performs the inversion and retraining of the GANs simultaneously, producing a personalized semantic field for precise attribute editing.

Recent works [1, 11,12,13,14,15,16,17, 36,37,38] use the priors in pretrained GANs for face editing. InterfaceGAN [11] searches for a hyperplane to divide a specific attribute from one status to another (e.g., from male to female) and moves to the orientation perpendicular to the hyperplane. SGF [14] and HijackGAN [13]design surrogate networks to learn the semantic gradient in the latent space of pretrained generators, showing disentangled editing to some extent, but suffer the loss of identity and background. GANSpace [12] is a typical unsupervised method, searching for the attribute direction via Principal Component Analysis (PCA). IALS [18] learns attribute-level direction for face editing and proposes a disentanglement transformation to achieve disentanglement in pairs, but entanglements still exist among multiple attributes. Trans4edit [19] trains a transformer to map semantic variations into the latent vector variations. FacialVideoEditing [36] conducts face editing in videos at high resolution via combining GAN inversion and attributes manipulation. Multi-view-face [38] can generate multi-view faces via unpaired images, avoiding large-scale data collection and annotation.

3 Proposed method

Our facial editing framework for real faces is composed of building a personalized facial semantic field (PFSF), learning individual walks for editing, and edited face fusion. An overview of our framework is shown in Fig. 3. First, a personalized facial semantic field for each real face is built by optimizing the inversion and retraining the GAN model (e.g., StyleGAN) together, preserving more individual information. Then, an individual walk in the above personalized semantic field is conducted by searching the target semantic direction step by step, guided by the pretrained facial attribute predictor. Finally, the edited face is fused into the original image via a portrait mask.

Fig. 3
figure 3

Framework of our method. For a given face image, the personalized facial semantic field (PFSF) is built by inverting it into the latent space of GAN and retraining GAN together, which is colored blue. After the PFSF is built, the generator of GAN is fixed, and a pretrained facial attribute classifier is used as supervision for searching disentangled semantic direction in the PFSF. This individual walk aims to walk in the right semantic direction, making the given face image edited according to the semantic target (e.g., smiling). The editing process is colored in orange. Finally, the edited portrait is fused with the background of the original image

Current facial editing methods usually fail to real face images not generated by GANs. They are based on fixed inversion and fixed GAN, resulting in a universal facial semantic field. This implies that it is a challenge to embed real face images into a universal semantic field, which motivates us to conduct GAN inversion and GAN retraining simultaneously, building a personalized facial semantic field for the given image. Methods based on the UFSF ignore the personalized difference between individuals and walk in a fixed universal semantic field along a single linear path for all faces. By contrast, our method learns a personalized facial semantic field for each face independently and then walks in it.

Facial attributes \(A = \left\{ a_{1}, a_{2}, \cdots , a_{N}\right\} \) can be quantified as semantic scores \(S = \left\{ s_{1}, s_{2}, \cdots , s_{N}\right\} \), where \(s_{i} \in R\), N denotes the considered attributes number. For each attribute, different scores mean different semantic intensities. Taking attribute aging for an example, the higher attribute score means the older face. The gradient of the semantic scores \(\nabla S\) is related to the inversion module I and GAN model G, then the personalized facial semantic field can be defined as

$$\begin{aligned} F(I, G)=\nabla S. \end{aligned}$$
(1)

For a given face x, its personalized facial semantic field F is built by learning personalized inversion module I and GAN model G. Then, the latent vector of x in the F is \(z=I(x)\). We need to find a semantic direction dz for z to walk from \(z_m\) to \(z_n\) step by step in the learned personalized facial semantic field F, driving the corresponding attribute score altered from the original attribute score \(s_m\) to the target one \(s_n\) gradually. This semantic walk is expressed as

$$\begin{aligned} s_{n}-s_{m} \sim \sum _{i=m}^{n} \left( z_{i}+d z_{i}\right) . \end{aligned}$$
(2)

3.1 Building personalized facial semantic field

GAN inversion I aims to find a latent vector \(z=I(x)\), where the corresponding generator G can reconstruct a new image \(\hat{x}=G(I(x))\), whose difference from the original image x is as little as possible. Since the background of the face image needs not to be changed during editing, it is not appropriate to treat the portrait and the background equally. So we propose to focus on the portrait, neglecting the background. We introduce the portrait matting method M [39] to extract the portrait mask m of the original image x: \(m=M(x)\). Then, the portrait in the original image x can be represented as \(x_{p}=x \cdot m\), and the portrait in the reconstructed image \(\hat{x}\) can be expressed as \(\hat{x}_{p}=\hat{x} \cdot m\). The pixel reconstruction loss and perceptual loss are used to minimize the differences between the inverted image \(\hat{x}\) and the original image x. The objective is expressed as

$$\begin{aligned} L_{\textrm{inv}}&=L_{\textrm{pixl}}+\lambda _{1} L_{\textrm{pert}}\nonumber \\&=L_{2}(x, \hat{x})+\lambda _{1} L_{pert}(x, \hat{x})\nonumber \\&=L_{2}(M(x)\cdot {x}, M(x)\cdot {G(I(x))})\nonumber \\&\quad +\lambda _{1} L_{pert}(M(x)\cdot {x}, M(x)\cdot {G(I(x))}) \end{aligned}$$
(3)

where the perceptual loss \(L_{pert}\) is from work [40], and the reconstruction L2 denotes the mean square error (MSE). Building PFSF is marked as light blue at the top of Fig. 3.

The inversion module I and generator G are parameterized by \(\theta _{I}\) and \(\theta _{G}\), respectively, then the personalized facial semantic field can be learned by the following objective function.

$$\begin{aligned} \theta _{I}^{*}, \theta _{G}^{*}=\underset{\theta _{I}, \theta _{G}}{\arg \min } \left( L_{\textrm{inv}}\right) \end{aligned}$$
(4)

3.2 Individual semantic walk for facial editing

After the personalized facial semantic field is built for the given image x, the personalized generator G and corresponding latent vector \(z=I(x)\) are learned. The generated image G(z) shows little difference to the original face x. Then, the synthetic image G(z) can be manipulated into the semantic target (e.g., smiling or aging) via z’s walk in the personalized facial semantic field in the proper semantic direction. For disentangled and precise facial editing, we propose to walk individually, instead of the universal and linear manner used by current methods.

Our strategy for disentangled semantic direction consists of two steps: first, preliminary semantic directions for extensive instances sampled from the personalized facial semantic field are obtained by the gradient of attributes classifiers. Then, the average of the above preliminary semantic directions is seen as the final direction. The disentangled semantic direction dz in the personalized facial semantic field should guarantee that while a latent vector z walks along it, the generated image \(G(z+dz)\) should change only one attribute as desired and preserve the rest. N binary facial attributes (\(a\sim \left\{ 0,1 \right\} \)) are listed as \(A = \left\{ a_{1}, a_{2}, \cdots , a_{N}\right\} \). For example, \(a_39=1\) means young and \(a_39=0\) means old. The attributes A of a face image x is recognized via the facial analysis model P [41]. This step is expressed as \(A=P(x)\).

The cross-entropy loss \(\mathcal {L}_{\textrm{tar}}\) is used to drive the selected attribute from the original status to the objective status.

$$\begin{aligned} \mathcal {L}_{\textrm{tar}}=-y_{t} \log \left( {a}_{t}\right) -\left( 1-y_{t}\right) \log \left( 1-{a}_{t}\right) , \end{aligned}$$
(5)

where \(a_{t}=P(G(z+dz))[t]\) is the attribute probability predicted by P, and \(y_t\) means the objective label of the target attribute.

The MSE loss is adopted to guarantee other attributes unchanged, which is critical for disentangled editing. Let \(A_{other} = \left\{ a_{1}, a_{2}, \cdots , a_{i}\right\} _{i\ne t}\) be other attributes except target attribute \(a_t\) and \(Y_{other} = \left\{ y_{1}, y_{2}, \cdots , y_{i}\right\} _{i\ne t}\) be the corresponding label values. Then, the loss to preserve other attributes can be represented as

$$\begin{aligned} L_{\textrm{other}}=\frac{1}{N-1}\left\| A_{\textrm{other}}-Y_{\textrm{other}}\right\| _{2}. \end{aligned}$$
(6)

The individual semantic direction can be searched via the following objective.

$$\begin{aligned} dz_x= \underset{dz}{\arg \min }\left( \lambda _{\textrm{tar}} L_{\textrm{tar}}+\lambda _{\textrm{other}} L_{\textrm{other}}\right) , \end{aligned}$$
(7)

where \(\lambda _{\textrm{tar}}\) and \(\lambda _{\textrm{other}}\) denote the weights to control the two losses’ contributions.

Then, the instance semantic direction dz can be searched by minimizing the loss \(L_{\textrm{nav}}\). And the final semantic direction can be obtained via averaging those instance semantic directions from multiple samples \(z_s\) as

$$\begin{aligned} dz = \frac{1}{S} \sum _{s=1}^S dz_s , \end{aligned}$$
(8)

where S means the number of images sampled from the SFPF.

3.3 Portrait fusion

After the individual semantic walk, the given face has been edited, but its background has also been altered. We extract the background of the original image x and the portrait of the edited image \(x^{\prime }\) as \(x \cdot (1-m)\) and \(x^{\prime } \cdot (m)\), respectively. Then, the finally edited image with the background maintained can be illustrated as

$$\begin{aligned} x_{p}^{\prime }=x \cdot (1-m)+x^{\prime } \cdot m, \end{aligned}$$
(9)

where m is the portrait mask of the original image. The portrait fusion is marked as light gray at the bottom of Fig. 3.

The overall pipeline of our method is summarized in Algorithm 1.

figure d

4 Experiments

4.1 Experimental settings

Our experiments are performed on the subsets from the face datasets FFHQ [9] and dataset CelebaHQ [7]. The dataset FFHQ consists of 70,000 real faces in 1024 \(\times \) 1024 resolution. The dataset CelebaHQ is made up of 30,000 real faces in 1024 \(\times \) 1024 resolution. The facial attribute classifier is obtained from the pretrained ResNet50 [42] for facial attributes recognition by the work [41]. While building the PFSF, the inversion is conducted via optimization as Image2Style [20]. The learning rate and iteration times for fine-tuning the generator are valued as 0.1 and 100, respectively. The total number of the attributes is 40 as the dataset CelebA [43]. The hyperparameters \(\lambda _{tar}\), and \(\lambda _{pre}\) in Eq. (7) are assigned 1 and 2. Our experiment is on a single NVIDIA 1080Ti GPU.

Benchmarks The proposed method is compared to the recent famous works, including InterfaceGAN [11], GANSpace [12], IALS [18] and Trans4edit [19]. The pretrained generator of StyleGAN [9] is adopted as the backbone.

Metrics Our method is evaluated in terms of identity preservation and disentanglement. Identity preservation is assessed in the same way as InterfaceGAN [11], adopting cosine similarity between the identity features of the original and the edited image. The identity features are obtained from the face recognition framework [44]. A higher identity preservation score means that more identity information is preserved. The disentanglement is evaluated by the cross-entropy of the edited attribute scores and the target attribute scores. A lower disentanglement score means that attributes are less entangled.

Fig. 4
figure 4

Visual comparison. Each column means edited image from the following method, respectively: (2) GANSpace [12], (3) InterfaceGAN [11], (4) IALS [18], (5) Trans4edit [19], (6) Our method

Fig. 5
figure 5

Reconstructed examples of inversions from the universal facial semantic field (FSF) and personalized facial semantic fields (PFSF) before the background fusion. It is obvious that reconstructions from PFSF (Ours) preserve more detailed characteristics and show more similarity to the original examples

Fig. 6
figure 6

Different edited examples from our method. Notice that the other attributes and the background are preserved well

4.2 Qualitative evaluation

The visual comparisons among edited images from the proposed approach and the decent benchmarks are demonstrated in Fig. 4. We can see our method achieves much better performance in terms of identity preservation. After editing, the personalized facial details are preserved well by our method but are easy to be lost by compared works (e.g., see the mouth, makeup, headdress and painting from the first row to the fourth row, respectively). This benefits from editing on the personalized facial semantic field built for each face instead of editing on a UFSF for all faces as peers [11, 18, 19]. Fine-tuning the generator can enhance its ability to reconstruct personalized features that even do not appear in the original training dataset (see Fig. 5). Furthermore, our method shows better disentanglement. For instance, aged results in the fifth row and the sixth row show that GANSpace [12] and InterfaceGAN [11] both add age accompanied with adding eyeglasses; by contrast, our method can make faces aging without eyeglasses added. When manipulating gender, our method maintains eyeglasses and smiling (the last two rows) unchanged, however, the compared works fail to keep them. Results from GANSpace [12] seem not as competitive as other works, because GANSpace searches for semantic direction in an unsupervised manner while the others in a supervised manner. In addition, from the first two rows and the last two rows, we can see that the background of edited faces from our method is the same as the original images; however, edited results from compared works fail to reconstruct the corresponding background well. More edited examples are listed in Fig. 6. All these qualitative findings are consistent to the quantitative evaluation in the next section.

4.3 Quantitative evaluation

To perform a fair quantitative comparison on the subset of datasets FFHQ and CelebaHQ, the same four attributes as the benchmarks are listed. The symbol – denotes that the method is often unable to edit the objective attribute as desired. Tables 1 and 2 demonstrate that identity preservation scores of edited images from our method are much higher than those compared methods. This denotes that more identity features are kept by the proposed method. Among different attributes editing results from each method, it is obvious that adding smiling obtains the best identity preservation score, but manipulating gender gets the lowest one. This is because changing the gender needs to change the whole face, however, adding smiling just needs small variations on the mouth and eyes.

Table 1 Quantitative evaluation of identity preservation on data from FFHQ [9]. A–D in the first row mean adding eyeglasses, adding smiling, aging and adding male, respectively
Table 2 Quantitative evaluation of identity preservation on data from CelebaHQ [7]. A–D in the first row mean adding eyeglasses, adding smiling, aging and adding male, respectively
Table 3 Quantitative comparison on semantic disentanglement on dataset FFHQ [9]. A–D in the first row mean adding eyeglasses, adding smiling, aging and adding male, respectively
Table 4 Quantitative comparison on semantic disentanglement on dataset CelebaHQ [7]. A–D in the first row mean adding eyeglasses, adding smiling, aging and adding male, respectively

From Tables 3 and 4, our method demonstrates higher disentanglement scores than benchmarks. This denotes that our method can preserve the other attributes well, while the target attribute is edited as desired. Except for our approach, Trans4edit [19] also obtains competitive performance in disentangled editing, which benefits from the disentanglement loss designed to train the latent transformer. Among different edited attributes, disentanglement scores are the worst while adding age. Since editing gender needs massive changes of facial appearances, inevitably altering some other attributes. On the other hand, adding eyeglasses only needs to change the local region around the eyes and does not affect other regions, so editing this attribute wins the best disentanglement.

4.4 Ablation analysis

We conduct the ablation experiment by performing three versions of our method to test the effectiveness of different components. To test the portrait mask’s contribution to identity preservation, we set the portrait mask as a unit matrix to remove its effect. This version of our method is marked as PFSF-wo-mask. In addition, we set the \(\lambda _{1}\) as 0 to remove the loss \(L_{pert}\) in Eq. (3) to test the perceptual loss’s contribution to identity preservation. This version is labeled as PFSF-wo-pert. The corresponding experimental results in terms of identity preservation are listed in Table 5. It is obvious from the first two rows of Table 5. that our method with PFSF can improve the identity preservation score by a wide margin in contrast to the version with UFSF. The last three rows in Table 5. demonstrate that the introduced portrait mask, and the perceptual loss can improve identity preservation, respectively. It is accordant with the visual comparison in Fig. 5. The third and the fourth column in Fig. 5. show that our method with the perceptual loss \(L_{pert}\) preserves more personalized features (e.g., the colored hair and the scar).

Furthermore, we set \(\lambda _{\textrm{other}}\) in Eq. (7) as 0 to test the contribution of the loss \(L_{other}\) to disentanglement. Our method without the loss \(L_{other}\) is marked as PFSF-wo-Lother, and the experimental results are listed in Table 6. Table 6. shows that disentanglement scores are improved significantly with constrain \(L_{other}\) added. This means the constraint \(L_{other}\) contributes to keeping other attributes during editing the target attributes, and it is valid for disentangled editing.

Table 5 Identity preservation comparison on different versions of our method. A–D in the first row mean adding eyeglasses, adding smiling, aging and adding male, respectively
Table 6 Facial attribute disentanglement comparison on different versions of our method. A–D in the first row mean adding eyeglasses, adding smiling, aging and adding male, respectively

4.5 Limitations

Several limitations exist in this work. First, limited by the 40 binary annotations of facial attributes in the face dataset CelebA [43], our method cannot edit other semantic attributes (e.g., pose or illumination) beyond the above annotations. In addition, our method needs to retrain the generator for each face to build the personalized facial semantic fields. It takes 60 s to build a personalized facial semantic field and 30 min to search the semantic direction (when the sampling number \(S=5,000\) in Eq. 8). This is time-consuming, and it is not appropriate for real-time application. Since the experiment is conducted on a single NVIDIA GTX 1080Ti GPU, this time-consuming limitation would be alleviated on GPUs with more memory or with higher performance.

5 Conclusion

This paper presented a disentangled face editing method via walking in the personalized facial semantic field. We build personalized facial semantic fields for individuals via retraining the GAN model with the retention of identity and perception as optimizing constraints. Then, individual walk in the personalized facial semantic field is conducted to perform disentangled semantic manipulation, with the objective attribute manipulated but the others preserved. Experiments validate that the proposed method can surpass existing works in terms of identity and background preservation and disentangled editing. One future work is to edit other attributes beyond the annotations in dataset CelebA, such as editing facial poses via extending our idea to 3D engineering fields inspired by the related works [45,46,47]. Another potential future work is extending this method to the task of multimodal-driven face editing (such as voice-driven face editing [48, 49])