1 Introduction

The demand of face editing is booming in the era of selfies. Both the research community, e.g., [4, 6, 9, 15,16,17, 20, 24, 28, 31, 35, 36, 39, 43], and the industry, e.g., Adobe and Meitu, have extensively explored to improve the automation of face editing by leveraging user’s specification of various facial attributes, e.g., hair color and eye size, as the conditional input. Generative Adversarial Networks (GANs) [7] have made tremendous progress for this task. Prominent examples in this direction include AttGAN [9], StarGAN [6], and STGAN [24], all of which use an encoder-decoder architecture, and take both source image and target attributes (or, attributes to be changed) as input to generate a new image with the characteristic of target attributes.

Fig. 1.
figure 1

Visual results of MagGAN on resolution \(1024\times 1024\). The specific sub-regions are cropped for better visualization

Although promising results have been achieved, state-of-the-art methods still suffer from inaccurately localized editing, where regions irrelevant to the desired attribute change are often edited. For instance, STGAN [24] can make undesired editing by painting the scarf to white for “Pale Skin” (left) and the hat to golden for “Blond Hair” (right) (see Fig. 2). Solution to this problem requires notions of relevant regions that are editable w.r.t. the facial attribute edit types, while keeping the non-editable regions intact. To illustrate this concept of region-localized attribute editing, we refer to the facial regions that are editable when a specific attribute changes as attribute-relevant regions (such as the hair region for “To Blonde”). Regions that should not be edited (such as the hat and other non-hair regions for attribute “To Bald”) are referred to as attribute-irrelevant. Ideal attribute editing generator will only edit attribute-relevant regions while keeping attribute-irrelevant regions intact, to minimize artifacts. The second issue of most existing methods is that they only work with images of low resolutions (\(128 \times 128\)). How to edit facial attributes of high-resolution (\(1024 \times 1024\)) images is less explored.

In order to address these challenges, we present the Mask-guided Generative Adversarial Network (MagGAN) for high-resolution face attribute editing. The proposed approach is built upon STGAN [24], which uses a difference attribute vector as conditional input, and a selective transfer unit for attribute editing. Based on this, a soft segmentation mask of common face parts from a pre-trained face parser is used to achieve fine-grained face editing. On one hand, the facial mask provides useful geometric constraints, which helps generate realistic face images. On the other hand, the mask also identifies each facial component (e.g., eyes, mouth, and hair), which is necessary for accurately localized editing. With the introduction of a mask-guided reconstruction loss, MagGAN can effectively focus on regions that are most related to the edited attributes, and keep the attribute-irrelevant regions intact, thus generating photo-realistic outputs.

Another reason why existing methods cannot preserve the regions that should not be edited is about how the attribute change information is injected into the generator. Although most attribute changes lead to localized editing, the attribute change condition itself does not explicitly contain any spatial information. In order to better learn the alignment between attribute change and regions to edit, MagGAN further uses a novel mask-guided conditioning strategy that can adaptively learn where to edit.

To further scale our model for high-resolution (\(1024\times 1024\)) face editing (see Fig. 1 for visual results), we propose to use a series of multi-level patch-wise discriminators. The coarsest-level discriminator sees the full downsampled image, and is responsible for judging the global consistency of generated images, while a finer-level discriminator only sees patches of the generated high-resolution image, and tries to classify whether these patches are real or not. Empirically, this leads to more stable model training for high-resolution face editing.

Fig. 2.
figure 2

MagGAN (1st row) can effectively apply accurate attribute editing while keeping attribute-irrelevant regions (e.g., hat, scarf) intact. In comparison, the state-of-the-art STGAN [24] (2nd row) produces undesired modifications on these regions, e.g., whitening the scarf while manipulating “Pale Skin”

The main contributions of this paper are summarized as follows. (i) We propose MagGAN that can effectively leverage semantic facial mask information for fine-grained face attribute editing, via the introduction of a mask-guided reconstruction loss. (ii) A novel mask-guided conditioning strategy is further introduced to encourage the influenced region of each target attribute to be localized into the generator. (iii) A multi-level patch-wise discriminator structure scales up our model to deal with high-resolution face editing. (iv) State-of-the-art results are achieved on the CelebA benchmark, outperforming previous methods in terms of both visual quality and editing performance.

2 Related Work

The development of face editing techniques evolves along the automation of editing tools. In the early stage, researchers focused on developing attribute-dedicated methods for face editing [3, 21, 25, 32, 33, 42], i.e., each model is dedicated to modifying a single attribute. However, such dedicated methods suffer from low automation level, i.e., not being able to manipulate multiple attributes in one step. To this end, many works [6, 9, 15,16,17, 20, 24, 28, 31, 35, 36, 39, 43] started using attribute specifications, i.e., semantically meaningful attribute vectors, as conditional input. Multiple attributes can be manipulated via changing the input attribute specifications. This work belongs to this category. Another line of works [5, 29, 34, 37, 45] improve the automation level of the face editing model by providing an exemplar image as the conditional input. Below, we briefly review recent attribute-specification based methods, and refer the readers to [44] for more details of methods that are not reviewed herein.

Fig. 3.
figure 3

Model architecture for the proposed Mask-guided GAN (MagGAN)

Many facial attributes are local properties (such as hair color, baldness, etc.), and facial attribute editing should only change relevant regions and preserve regions not to be edited. StarGAN [6] and CycleGAN [28] introduced the cycle-consistency loss to conditional GAN so as to preserve attribute-irrelevant details and to stabilize training. AttGAN [9] and STGAN [24] found that the reconstruction loss of images not to be edited is at least as good as the cycle-consistency loss for preserving attribute-irrelevant regions. STGAN [24] proposed the selective transfer units to adaptively select and modify encoder features for enhanced attribute editing, achieving state-of-the-art performance on editing success rate. However, in this paper, we show that neither the cycle-consistency loss nor the reconstruction loss is sufficient to well preserve regions not to be edited (see Fig. 2), and propose to utilize masks to solve this problem.

Semantic mask/segmentation provides geometry parsing information for image generation, see, e.g., [12, 22, 30]. Semantic mask datasets and models are available for domains with important real applications, such as face editing [18, 19] and fashion [23]. Recently, both [8] and [19] utilize mask information for facial image manipulation, where a target/manipulated mask is required in the manipulation process. In this paper, we focus on the setting of editing with attribute specifications, without requiring a target/manipulated mask. We only make use of a pre-trained face parser, instead of requiring users to provide the mask manually.

3 MagGAN

As illustrated in Fig. 3, face editing is performed in MagGAN via an encoder-decoder architecture [6, 9]. The design of Selective Transfer Units (STUs) in STGAN [24] is adopted to selectively transform encoder features according to the desired attribute change. Inspired by StyleGAN [14, 30], the adaptive layer normalization [2, 11] is used to inject conditions through the de-normalization process, instead of directly concatenating the conditions with the feature map. Our full encoder-decoder generator is denoted as:

$$\begin{aligned} \widehat{\mathbf {x}} = G(\mathbf {x}, \mathbf {att}_{\text {diff}}), \quad \mathbf {att}_{\text {diff}} = \mathbf {att}_{t} - \mathbf {att}_{s}, \end{aligned}$$
(1)

where \(\mathbf {x} ( \text {or }\widehat{\mathbf {x}}) \in \mathbb {R}^{3\times H \times W}\) denote the input (or edited) image; \(\mathbf {att}_{s} (\text {or }\mathbf {att}_{t}) \in \mathbb {R}^C\) are the source (or target) attributes. The generator takes the attribute difference \(\mathbf {att}_{\text {diff}} \in \mathbb {R}^C\) as input, following [24].

3.1 Avoid Editing Attribute-Irrelevant Regions

Although notable results have been achieved, existing work still suffers from inaccurately localized editing, where irrelevant regions unrelated to the desired attribute change are often made. For example, in Fig. 2, STGAN [24] changes the scarf to white for “Pale Skin” (left), and changes the hat to golden for “Blond Hair” (right).

We leverage facial regions for effective facial attribute editing and modeling as a solution. We utilize a pre-trained face parser to provide soft facial region masks. Specifically, a modified BiseNet [38] trained on the CelebAMask-HQ dataset [19]Footnote 1 is used to generates 19-class region masks, including various facial components and accessories. For each attribute \(a_i\), we define its influence regions represented by two probability masks \(M_i^{+}, M_i^{-} \in [0,1]^{H\times W}\). If attribute \(a_i\) is strengthened during editing, the region characterized by \(M_i^{+}\) is likely to be changed; if \(a_i\) is weakened, the region characterized by \(M_i^{-}\) is likely to be changed. For example, for “Pale Skin”, both \(M_i^{+}\) and \(M_i^{-}\) characterize the “skin” region; for “Bald”, \(M_i^{+}\) characterizes the “hair” region while \(M_i^{-}\) characterizes the region consisting of “background, skin, ears” and “ear rings”. In this setup, we propose the following Mask-aware Reconstruction Error (MRE) to measure the preserving quality of the editing process (in preserving irrelevant regions that shall not be edited):

$$\begin{aligned} \text {MRE} = \frac{1}{H W C} \sum _{i=1}^C \big \Vert (1-M_i^{\text {sgn}(\mathbf {att}_{\text {diff},i})}) (G(\mathbf {x}, \mathbf {att}_{\text {diff},i} \mathbf {e}_i) - \mathbf {x} ) \big \Vert _1, \end{aligned}$$
(2)

where \(\mathbf {att}_{\text {diff},i}\) is the i’th entry of \(\mathbf {att}_{\text {diff}}\), and \(\mathbf {e}_i\) is the vector with i’th entry 1 and all others 0, \(M_i^{\text {sgn}(\mathbf {att}_{\text {diff},i})} \in \{ M_i^{+}\), \(M_i^{-} \}\). In the face editing experiments, since all attributes are binary and \(\mathbf {att}_{s} \in \{0,1\}^C\), we take the attribute change vector \(\mathbf {att}_{\text {diff}} := 1 - 2 \mathbf {att}_{s}\). In this case, the image preservation error is computed when only one attribute is flipped each time, and MRE is the total error.

In Sect. 4, we will report MRE for various previous methods and our models in Table 3. Existing approaches of both the cycle-consistency loss used in StarGAN [6] and the reconstruction loss in [9, 24] are insufficient to preserve the regions that shall not be edited.

Fig. 4.
figure 4

MagGAN loss function design (Sect. 3.2). For better illustration, the preserving region is denoted by the non-grey region of human face

3.2 Loss Functions for Model Training

We aim to optimize MagGAN regarding the following four aspects: (i) preservation accuracy for regions that should be preserved; (ii) reconstruction error of the original image; (iii) attribute editing success; and (iv) synthesized image quality. Therefore, we design four respective types of loss functions for MagGAN training, as illustrated in Fig. 4.

Mask-Guided Reconstruction Loss. Continue from the design of MRE (2), we propose the following mask-guided reconstruction loss:

$$\begin{aligned} L_{G}^{\text {mre}} = \left\| M(\mathbf {att}_{\text {diff}}, \mathbf {x}) \cdot (\mathbf {x} - G(\mathbf {x}, \mathbf {att}_{\text {diff}})) \right\| _1, \end{aligned}$$
(3)

where \(M(\mathbf {att}_{\text {diff}}, \mathbf {x}) \in [0,1]^{H\times W}\) is a probability mask of the regions to be preserved.

The preserved mask \(M(\mathbf {att}_{\text {diff}}, \mathbf {x})\) is computed from both the attribute difference \(\mathbf {att}_{\text {diff}}\) and the probability facial mask \(\mathbf {M}\) of image \(\mathbf {x}\). We first feed image \(\mathbf {x}\) into a face parser, and obtain a probability map \(\mathbf {M} \in [0,1]^{19 \times H \times W}\) of the 19 facial parts, where \(\sum ^{19}_{i=1} \mathbf {M}_{i,h,w} = \mathbf {1}_{h,w}\). Since the semantic relationship between facial attributes and facial parts can be reasonably assumed to be constant, we explicitly define two binary relation matrices \(\mathbf {AR}^{+}\) and \(\mathbf {AR}^{-}\), the attribute-part matrices with dimension \(C \times 19\), to characterize the relation between them. The i-th row of matrix \(\mathbf {AR}^{+}\) or \(\mathbf {AR}^{-}\) indicates which facial parts should be modified when the i-th attribute is strengthened, i.e., \(\mathbf {att}_{\text {diff},i}>0\), or weakened, i.e., \(\mathbf {att}_{\text {diff},i}<0\). Note that, if facial part has no explicit relationship with one attribute, the corresponding matrix entry of \(\mathbf {AR}^{+}\),\(\mathbf {AR}^{-}\) could be set to 0.

To obtain M, we first gather all parts \(\mathbf {AR}^* \in [0,1]^{19}\) that are possibly influenced by attribute change \(\mathbf {att}_{\text {diff}}\), as,

$$\begin{aligned} \mathbf {AR}^* = \min \left\{ 1, \big (\mathbf {att}_{\text {diff}}^{(+)} \big )^T \mathbf {AR}^{+} + \big (\mathbf {att}_{\text {diff}}^{(-)} \big )^T \mathbf {AR}^{-} \right\} , \end{aligned}$$
(4)

where \(\mathbf {att}_{\text {diff}}^{(+)} = (\mathbf {att}_{\text {diff}}>0)\) and \(\mathbf {att}_{\text {diff}}^{(-)} = (\mathbf {att}_{\text {diff}}<0)\). Finally,

$$\begin{aligned} M_{h,w}(\mathbf {att}_{\text {diff}}, \mathbf {x}) = \mathbf {1} - \sum _{i=1}^C \mathbf {M}_{i,h,w} * \mathbf {AR}^*_{i}. \end{aligned}$$
(5)

The influence regions \(M_i^{+}\) and \(M_i^{-}\) in (2) can also be computed this way, with \(\mathbf {att}_{\text {diff}} = \mathbf {e}_i\) and \(\mathbf {att}_{\text {diff}} = -\mathbf {e}_i\).

Reconstruction Loss. Image reconstruction can be considered as a sub-task of image editing, because the generator should reconstruct the image when no edit is applied, \(\mathbf {att}_{\text {diff}} = \mathbf {0}\). Therefore, the reconstruction loss is defined as

$$\begin{aligned} \mathcal {L}_{G}^{\text {rec}} = \Vert G(\mathbf {x}, \mathbf {0}) - x\Vert _1, \end{aligned}$$
(6)

where the \(\ell _1\) norm is adopted to preserve the sharpness of the reconstructed image.

GAN Loss for Enhancing Image Quality. The synthesized image quality is enhanced by the generative adversarial networks, where we use an unconditional image discriminator \(D_{\text {adv}}\) to differentiate real images from edited images. In particular, a Wasserstein GAN (WGAN) [1] is utilized:

(7)

where \(\widehat{\mathbf {x}}\) is the generated image and \(\mathbf {x}_{\text {int}}\) is sampled along lines between the latent space of pairs of real and generated image.

The generator G, instead, tries to fool the discriminator by synthesizing more realistic images:

(8)

Attribute Classification Loss. To ensure that the edited image indeed has the target attribute \(\mathbf {att}_t\), an attribute classifier \(D_{\text {att}}\) is trained on the ground-truth image attribute pairs \((\mathbf {x}, \mathbf {att}_s)\) with the standard cross-entropy loss:

(9)

The generator is trying to generate images that maximize its probability to be classified with the target attribute \(\mathbf {att}_t\):

(10)

In summary, the loss to train the MagGAN generator G is

$$\begin{aligned} \mathcal {L}_{G} = {L}_{G}^{\text {gan}} + \lambda _1 \mathcal {L}_{G}^{\text {rec}} + \lambda _2 \mathcal {L}_{G}^{\text {cls}} + \lambda _3 L_{G}^{\text {mre}}. \end{aligned}$$
(11)

In experiments, we always take \(\lambda _1=100\) and \(\lambda _2 = 10\). We vary \(\lambda _3\) to examine the effect of our proposed mask-guided reconstruction loss.

3.3 Mask-Guided Conditioning in the Generator

Another reason why the previous methods cannot preserve the regions that shall not be edited is about how the attribute change information is injected into the generator. Although most attribute changes should lead to localized editing, the attribute change condition \(\mathbf {att}_{\text {diff}} \in \mathbb {R}^{C}\) does not explicitly contain any spatial information. In STGAN [24] (and other previous works for face attribute editing), this condition is replicated to have the same spatial size of some hidden feature tensor, and then concatenated to it in the generator. For example, in the SPADE block in Fig. 3 (Right), \(\mathbf {att}_{\text {diff}}\) is replicated spatially to be \(\mathbf {Att}_{\text {diff}} \in \mathbb {R}^{C\times H \times W}\) (the purple block)Footnote 2, and then concatenated to the decoder feature (the green block). It is hoped that the generator will learn by itself the localized property of attribute editing from this concatenated tensor. However, in practice, this is insufficient, even with the mask-guided reconstruction loss (3).

We propose to inject this inductive bias that the influence region of each attribute change is localized into the generator directly, by making use of masks. We view the i-th channel of \(\mathbf {Att}_{\text {diff}}\), denoted as \(\mathbf {Att}_{\text {diff}}^{(i)} \in \mathbb {R}^{H\times W}\), as the condition to edit attribute \(a_i\). In previous work, \(\mathbf {Att}_{\text {diff}}^{(i)} = \mathbf {att}_{\text {diff},i} \mathbf {1}\) that is uniform across the spatial dimension. Specifically, we propose:

$$\begin{aligned} \mathbf {Att}_{\text {diff}}^{(i)} = \mathbf {att}_{\text {diff},i} M_i^{\text {sgn}(\mathbf {att}_{\text {diff},i})}, \end{aligned}$$
(12)

where \(M_i^{+}\) and \(M_i^{-}\) are the influence regions of attribute \(a_i\) defined in (2). We illustrate this mask-guided conditioning process in Fig. 3 (bottom-left). Finally, we simply replace the original replicated tensor with the mask-guided attribute condition tensor, and obtain a generator with mask-guided conditioning. Note that this mask-guided conditioning technique is generally applicable to both generators with and without SPADE.

The blending trick is another simple approach to preserve the attribute-irrelevant regions. More specifically, with the probability mask of attribute-irrelevant regions \(M(\mathbf {att}_{\text {diff}}, \mathbf {x})\) defined in (3), we simply add a linear layer at the end of the generator:

$$\begin{aligned} \widehat{x} = M(\mathbf {att}_{\text {diff}}, \mathbf {x})*x + \left( 1 - M(\mathbf {att}_{\text {diff}}, \mathbf {x}) \right) *G(\mathbf {x}, \mathbf {att}_{\text {diff}}). \end{aligned}$$
(13)

This blending trick improves our MagGAN performance in terms of MRE, but visually it introduces sharp transitions at the boundary of regions to be preserved. Therefore, we do not include this trick in our final MagGAN. More discussions are in Supplementary.

3.4 Multi-level Patch-Wise Discriminators for High-Resolution Face Editing

Fig. 5.
figure 5

Illustration of multi-level patch-wise discriminators

We describe our approach to scale up image editing in high resolutions. First of all, we empirically found that a single “shallow” discriminator cannot learn some global concepts, such as Male/Female, leading to low editing success. On the other hand, a single “deep” discriminator makes the adversarial training very unstable, leading to low image quality.

Inspired by PatchGAN [12] and several multi-level generation works [13, 40, 41], we propose to use a series of multi-level patch-wise “shallow” discriminators, as illustrated in Fig. 5, for high-resolution face editing. The architecture of the discriminators are exactly the same without sharing weights. The coarsest-level discriminator (\(D_1\)) see the full downsampled image, and is responsible for global consistency in the image generation. The attribute classifier \(C_1\) associated with it is effective in attribute classification, as in the low-resolution image editing case. The finer-level discriminators (\(D_2\), etc.) see patches of the generated high-resolution image instead of the full one, and determine whether these patches are real or not. To maintain an unified architecture for discriminators across different levels, we still associate the finer-level discriminator with a classifier (\(C_2\)), which takes the average pooled feature as input for classification. The total loss for all PatchGAN discriminators are defined as:

$$\begin{aligned} \mathcal {L}_{D} = \frac{1}{P}\sum ^{P}_{i=1} \big ( \mathcal {L}_{D_{\text {att}}^i} + \mathcal {L}_{D_{\text {adv}}^i} \big ), \end{aligned}$$
(14)

where \(D_{\text {att}}^i\), \(D_{\text {adv}}^i\) denote the attribute classifier and image discriminator of the ith PatchGAN discriminator, P is the number of total discriminators. In practice, we found these finer-level discriminators improve the editing performance.

Note that our generator only generates high-resolution images, which can be directly downsampled to lower resolutions and fed to coarse-level discriminators. On the contrary, generators in previous works [13, 40, 41] generate a high-resolution image in a multi-stage manner for the sake of training stability. They generate low-resolution images as intermediate outputs, which are fed to coarse-level discriminators. Our approach is simple in comparison, and we did not observe any training stability issue.

4 Experiments

Dataset and pre-processing. We use CelebA dataset [26] for evaluation. CelebA contains over 200K facial images with 40 binary attribute labels for each image. To apply CelebA to high-resolution face editing, we process the original web images by cropping, aligning and resizing into \(1024\times 1024\). When loading images for editing, they are re-scaled to match the target resolution. The images are divided into the training set, validation set and test set. Following the repository of STGANFootnote 3, we take 637 images from the validation set to assess the training process. We use the rest of the validation set and the training set to train our model. The test set (nearly 20K) is used for evaluation. We consider 13 distinctive attributes including: Bald, Bangs, Black Hair, Blond Hair, Brown Hair, Bushy Eyebrows, Eyeglasses, Male, Mouth Slightly Open, Mustache, No Beard, Pale Skin and Young. Since most images in CelebA have lower resolution than \(1024\times 1024\), our “high-resolution” MagGAN models are not exactly trained with true high-resolution images. However, our results show the ability of MagGAN scale up to \(1024\times 1024\) resolution.

MagGAN exploits the information of facial masks, which are obtained using a pre-trained face parser with 19 classes (as mentioned in Sect. 3.1). Instead of taking a multi-label hard mask, we take the probability of each class as soft masks with smooth boundaries, which leads to improved generation quality. All the facial masks are stored in resolution \(256\times 256\). The two attribute-part relation matrices \(\text {AR}^+, \text {AR}^- \in [0,1]^{13\times 19}\) described in Sect. 3.2 characterize the relation between each edit attribute and corresponding facial component changes. Detailed definitions are in Supplementary.

Quantitative Evaluation. The performance of attribute editing are measured in three aspects, i.e., (i) mask-aware reconstruction error (MRE), (ii) attribute editing accuracy and (iii) image quality.

Table 1 shows that MagGAN decreases the MRE significantly, indicating better preserving of regions that should be intact. This improvement is also obvious in the editing results in Fig. 8. Table 1 also reports the PSNR/SSIM score of the reconstructed image by keeping target attribute vector the same as the source one. MagGAN also improves PSNR/SSIM significantly.

Fig. 6.
figure 6

Facial attribute editing accuracy of IcGAN [31], FaderNet [16], StarGAN [6], AttGAN [9], STGAN [24], STGAN(256) and our model MagGAN(256) (from left to right in rainbow colors in order). The last two models naming with “(256)” are the ones with image resolution 256 that are resized into 128 for evaluation

Fig. 7.
figure 7

Facial attribute editing accuracy of STGAN and MagGAN on hat samples and non hat samples of resolution \(256\times 256\)

Table 1. Comparison of quantitative results with SOTA

We also report the attribute editing accuracy by employing the pre-trained attribute classification model from [24]. We follow the evaluation protocol used in [9, 24]. For each test image, reverse one of its 13 attributes at a time (\(1 \rightarrow 0\) or \(0 \rightarrow 1\)), and generate an image after each reversion; so there are 13 edited images for each input image. The widely used evaluation metric is attribute editing accuracy, which measures the successful manipulation rate for the reversed attribute each time, but ignores the attribute preservation error. Fig. 6 reports the facial attribute manipulation accuracy of previous works IcGAN [31], FaderNet [16], AttGAN [9], StarGAN [6], STGAN [24] and our proposed MagGAN. To build the strongest baseline, we also train our own STGAN model at resolution \(256\times 256\), optimizing all possible parameters; see details of the hyperparameter tuning in Supplementary.

High Editing Accuracy v.s. Attribute-irrelevant Region Preserving. As shown in Table 1, MagGAN at resolution 256 outperforms all the previous reported numbers except STGAN(256) on average accuracy. In Fig. 6, compared with STGAN(256), MagGAN(256) is better in “Mustache”, “No beard”, “Gender”, “Age” and worse in “Bald”, “Bangs”, “Black Hair”, “Blonde Hair”, “Brown Hair”. We conjecture that STGAN(256) achieves this high accuracy by editing hat or scarf when they appear in the image; like coloring the hat to golden to get an editing success of “Blonde Hair”. To verify this assumption, we separate the testing set into two groups – samples with hat, samples without hat by measuring the area ratio of hat in the face masks (we select threshold 0.1 to decide if the sample contains a hat). The attribute editing accuracy is evaluated on the two subsets respectively. Results in Fig. 7 show that the editing accuracy of MagGAN decreases a lot on hat subset on several hat-related attributes, e.g., “Bald”, “Black Hair”, but on par with STGAN on non hat subsets. In this sense, MagGAN editing success is even higher than our strongest baseline STGAN(256) since it can preserve the attribute irrelevant regions, making editing more real.

To measure the image quality, we report FID (Fréchet Inception Distance) score [10]. The FID score measures the distance between the Inception-v3Footnote 4 activation distributions of original images and the edited images. Table 1 shows that the FID score improves significantly from resolution 128 to 256, but then get stalled and insensitive to image quality for 256 and higher resolutions. This is because the input size for Inception-v3 model is 299, and thus resolution increase from 128 to 256 is significant. However, all high resolution generations are first downsampled to evaluate the FID score. After all, MagGAN at all resolutions achieves the comparable result with the best FID score. Finally, due to smaller batches in training for high resolutions, FID scores of MagGAN(512) and MagGAN(1024) are slightly lower than those of MagGAN(256).

Qualitative Evaluation. Apart from the quantitative evaluation, we visualize some facial attribute editing results at resolution \(256\times 256\) in Fig. 8, and compare our proposed model with the state-of-the-art method, i.e., STGAN [24] (as it is the strongest baseline) and other variations.

Table 2. Results of user study for ranking methods on two subsets considering hat wearing

User Study. We conduct user study on Amazon Turk to compare the generation quality of STGAN and MagGAN. To verify that MagGAN performs better on editing attribute relevant regions, we randomly choose 100 input samples from test set, 50 samples with hat or scarf and 50 samples without (since STGAN usually fails on person wearing hat). For each sample, 5 attribute editing tasks are performed by STGAN and MagGAN (500 comparison pairs in total). All 5 tasks are randomly chosen from 13 attributes, for subjects with hat, we increase the chance to select hair related attributes. The users are instructed to choose the best result which changes the attribute more successfully considering image quality and identity preservation. To avoid human bias, each sample pair is evaluated by 3 volunteers. The results are shown in Table 2, MagGAN outperforms STGAN on both hat samples and without hat samples.

Fig. 8.
figure 8

Visual results of MagGAN variants on resolution \(256\times 256\). Each column represents edited images through one attribute reversing editing

5 Ablation Study

We conduct three groups of ablation comparisons in image resolution of \(256\times 256\), to verify the effectiveness of the proposed modules individually: (i) mask guided reconstruction loss, (ii) spatially modified attribute feature, and (iii) usage of SPADE normalization.

We consider seven variants, i.e., (i) STGAN: STGAN at resolution \(256\times 256\), (ii) STGAN+cycle: STGAN with cycle-consistency loss instead of its original reconstruction loss, (iii) STGAN w/ \(L^{\text {mre}}\): STGAN plus mask guided reconstruction loss, (iv) MagGAN w/o \(L^{\text {mre}}\): MagGAN trained without mask guided reconstruction loss, (v) MagGAN w/o SP: MagGAN without using SPADE, (vi) MagGAN: our proposed model with the usage of mask-guided reconstruction loss and make-guided attribute conditioning. (vii) MagGAN+Seg: Instead of using a pre-trained face parser, build a face segmentation branch (adopting FCN[27] architecture) into generator as sub-task, making the whole model fully trainable.

Mask-guided Reconstruction Loss. We compare three reconstruction loss: (i) STGAN with only the reconstruction loss computed by reconstructed images, (ii) cycle-consistency loss which is applied in StarGAN [6], (iii) two parts of reconstruction loss (computed on reconstructed images and synthesized images respectively) proposed in Sect. 3.2. Row 1–3 of Table 3 report the quantitative results of STGAN applying each type of reconstruction loss respectively. We observe that adding mask guided reconstruction loss to generator training can effectively reduce Mask-aware Reconstruction Error (MRE). In Fig. 8, the synthesized image of STGAN w/ \(L^{\text {mre}}\) on attribute “Bald” and “Blonde Hair” also proves this assumption. But since the spatial information of mask is not directly injected into generator, STGAN w/ \(L^{\text {mre}}\) still cannot preserve the attribute-irrelevant regions well.

Table 3. Comparison of variants of MagGAN on \(256\times 256\)

Mask-guided Attribute Conditioning. Utilizing mask-guided attribute conditioning instead of the spatially uniformed attribute conditioning provides generator with more spatial information of the interest regions. From Table 3, (i) v.s. (iv), (iii) v.s. (vi) illustrate that the MRE score decreases obviously when mask-guided attribute conditioning is applied in generator. It implies that generator effectively takes the regions of interest and edits on these local regions. Taking advantage of both mask-guided reconstruction loss and attribute conditioning strategy, MagGAN achieves the best MRE and FID. And the visual results in Fig. 8 also show that MagGAN makes accurate editing on hair related attributes (’Bold’, ’Blonde Hair’, etc.), by preserving the region of hat while only remove or paint the hair. MagGAN w/o SP and MagGAN perform nearly the same as (v) v.s. (vi), which demonstrates the denormalization method does not affect much on performance. Finally, the quantitative results and visual results of (vii) MagGAN+Seg are bad, which indicates the incorporating mask segmentation branch as part of the generator is not a good choice. Since the mask-guided reconstruction loss and attribute conditioning requires accurate masks, training segmentation branch with generator from scratch makes the model hard to train and undermines the editing accuracy.

Fig. 9.
figure 9

Comparison of training with vanilla single discriminator and multi-level PatchGAN discriminators on resolution \(1024\times 1024\): (a) attribute editing accuracy and (b) visual results

Multi-level PatchGAN Discriminator for High Resolution Editing. We apply PatchGAN discriminator to supervise training of high resolution image generation. We are able to scale the generated image resolution up to \(1024\times 1024\). In Fig. 9, we compare the \(1024\) version of training with a single discriminator and with our proposed multi-level PatchGAN discriminators. Under this setting, PatchGAN has 3 discriminators working on resolution \(256\times 256\), \(512\times 512\) and \(1024\), respectively. In Fig. 9 (a), when applying single vanilla discriminator, the generator converges slower than using PatchGAN discriminator and early stops at low editing accuracy. In Fig. 9 (b), editing effects on “Eyeglasses”, “Gender” from PatchGAN are more obvious than original discriminator. We assume PatchGAN discriminators provide more supervise signal on global and local regions, thus helping generator learns more discriminative features for each attribute. See more visual results in supplementary.

6 Conclusion

In this paper, we propose MagGAN for high-resolution face image editing. The key novelty of our work lies in the use of facial masks for achieving more accurate local editing. Specifically, the mask information is used to construct a mask-guided reconstruction loss and mask-guided conditioning in the generator. MagGAN is further scaled up for high-resolution face editing with the help of PatchGAN discriminators. To our knowledge, it is the first time face attribute editing is able to be applied on resolution \(1024\times 1024\).