Keywords

1 Introduction

Face frontalization aims to synthesize the frontal view face from a given profile. Frontalized faces can be directly used for general face recognition methods without elaborating additional complex modules. Apart from face recognition, generating photo-realistic frontal face is beneficial for a series of face-related tasks, including face reconstruction, face attribute analysis, facial animation, etc.

Traditional methods address this problem through 2D/3D local texture warping [6, 36] or statistical modeling [22]. Recently, GAN-based methods have been proposed to recover a frontal face in a data-driven manner [1, 8, 10, 28, 31, 32, 34, 35]. For instance, Yin et al. [32] propose DA-GAN to capture the long-displacement contextual information from illumination discrepancy images under large poses. However, it recovers inconsistent illumination on the synthesized image. Flow-based method [33] predicts a dense pixel correspondence between the profile and frontal image and uses it to deform the profile face to the frontal view. However, deforming the profile face in the image space directly leads to obvious artifacts and missing pixels should be addressed under large poses.

Fig. 1.
figure 1

\( \pm 45^\circ \), \( \pm 60^\circ \), \( \pm 75^\circ \) and \( \pm 90^\circ \) images of the two persons in the Multi-PIE. Each row images have the same flash in the recording environment.

The existing methods do not consider the illumination inconsistency between the profile and ground-truth frontal image. Taking the widely used benchmark Multi-PIE [4] as an example, the visual illumination conditions on several poses are significantly different from the ground-truth frontal images as shown in Fig. 1. Except \( \pm 90^\circ \), the other face images are produced by the same camera type. The variation in camera types causes obvious illumination inconsistency between the \( \pm 90^\circ \) images and the ground-truth frontal image. Although efforts have been made to manually color-balance those same type cameras, the illumination of resulting images within \( \pm 75^\circ \) (except \( 0^\circ \)) still look visually distinguishable with the ground-truth frontal image. Since the existing methods minimize pixel-wise loss between the synthesized image and the illumination inconsistent ground-truth, they tend to change both the pose and the illumination of the profile face image, while the latter actually is not acceptable in face editing and synthesis.

To address the above issue, this paper proposes a novel Flow-based Feature Warping Model (FFWM) which can synthesize photo-realistic and illumination preserving frontal image from illumination inconsistent image pairs. In particular, FFWM incorporates the flow estimation with two modules: Illumination Preserving Module (IPM) and Warp Attention Module (WAM). Specifically, we estimate the optical flow fields from the given profile: the reverse and forward flow fields are predicted to warp the front face to the profile view and vice versa, respectively. The estimated flow fields are fed to IPM and WAM to conduct face frontalization.

The IPM is proposed to synthesize illumination preserving images with fine facial details from illumination inconsistent image pairs. Specifically, IPM contains two pathways: (1) Illumination Preserving Pathway and (2) Illumination Adaption Pathway. For (1), an illumination preserving loss equipped with the reverse flow field is introduced to constrain the illumination consistency between synthesized images and the profiles. For (2), guided filter [7] is introduced to further eliminate the illumination discrepancy and learns frontal view facial details from illumination inconsistent ground-truth image. The WAM is introduced to reduce the pose discrepancy in the feature level. It uses the forward flow field to align the profile features to the frontal view. This flow provides an explicit and accurate supervision signal to guide the frontalization. The attention mechanism in WAM helps to reduce the artifacts caused by the displacements between the profile and frontal images.

Quantitative and qualitative experimental results demonstrate the effectiveness of our FFWM on synthesizing photo-realistic and illumination preserving faces with large poses and the superiority over the state-of-the-art results on the testing benchmarks. Our contributions can be summarized as:

  • A Flow-based Feature Warping Model (FFWM) is proposed to address the challenging problem in face frontalization, i.e. photo-realistic and illumination preserving image synthesis.

  • Illumination Preserving Module (IPM) equipped with guided filter and flow field is proposed to achieve illumination preserving image synthesis. Warp Attention Module (WAM) uses the attention mechanism to effectively reduce the pose discrepancy in the feature level under the explicit and effective guidance from flow estimation.

  • Quantitative and qualitative results demonstrate that the proposed FFWM outperforms the state-of-the-art methods.

2 Related Work

2.1 Face Frontalization

Face frontalization aims to synthesize the frontal face from a given profile. Traditional methods address this problem through 2D/3D local texture warping [6, 36] or statistical modeling [22]. Hassner et al. [6] employ a mean 3D model for face normalization. A statistical model [22] is used for frontalization and landmark detection by solving a constrained low-rank minimization problem.

Benefiting from deep learning, many GAN-based methods [8, 10, 27, 28, 32, 33] are proposed for face frontalization. Huang et al. [10] use a two-pathway GAN architecture for perceiving global structures and local details simultaneously. Domain knowledge such as symmetry and identity information of face is used to make the synthesized faces photo-realistic. Zhao et al. [34] propose PIM with introducing a domain adaptation strategy for pose invariant face recognition. 3D-based methods [1, 2, 31, 35] attempt to combine prior knowledge of 3D face with face frontalization. Yin et al. [31] incorporate 3D face model into GAN to solve the problem of large pose face frontalization in the wild. HF-PIM [1] combines the advantages of 3D and GAN-based methods and frontalizes profile images via a novel texture warping procedure. In addition to supervised learning, Qian et al. [17] propose a novel Face Normalization Model (FNM) for unsupervised face generation with unpaired face images in the wild. Note that FNM focuses on face normalization, without considering preserving illumination.

Instead of learning function to represent the frontalization procedure, our method gets frontal warped feature by flow field and reconstructs illumination preserving and identity preserving frontal view face.

2.2 Optical Flow

Optical flow estimation has many applications, e.g., action recognition, autonomous driving and video editing. With the progress in deep learning, FlowNet [3], FlowNet2 [13] and others achieve good results by end-to-end supervised learning. While SpyNet [18], PWC-Net [26] and LiteFlowNet [11] also use coarse-to-fine strategery to refine the initial flow. It is worth mentioning that PWC-Net and LiteFlowNet have smaller size and are easier to train. Based on weight sharing and residual subnetworks, Hur and Roth [12] learn bi-directional optical flow and occlusion estimation jointly. Bilateral refinement of flow and occlusion address blurry estimation, particularly near motion boundaries. By the global and local correlation layers, GLU-Net [29] can resolve the challenges of large displacements, pixel-accuracy, and appearance changes.

In this work, we estimate bi-directional flow fields to represent dense pixel correspondence between the profile and frontal faces, which are then exploited to obtain frontal view features and preserve illumination condition, respectively.

Fig. 2.
figure 2

The architecture of our FFWM. Illumination Preserve Module is incorporated to facilitate synthesized frontal image \(\hat{I}\) to be illumination preserving and facial details preserving in two independent pathways. Based on the skip connection, the Warp Attention Module helps synthesize frontal image effectively. Losses are shown in red color, which \( \hat{I}^w \) is the synthesized image \( \hat{I} \) warped by \( \varPhi ' \) and the \( \hat{I}^G \) is the guided filter output. (Color figure online)

3 Proposed Method

Let {\(I, I^{gt}\)} be a pair of profile and frontal face image of the same person. Given a profile image I, our goal is to train a model \(\mathcal {R}\) to synthesize the corresponding frontal face image \( \hat{I} = \mathcal {R}(I) \), which is expected to be photo-realistic and illumination preserving. To achieve this, we propose the Flow-based Feature Warping Model (FFWM). As shown in Fig. 2, FFWM takes U-net [20] as the backbone and incorporates with the Illumination Preserving Module (IPM) and the Warp Attention Module (WAM) to synthesize \(\hat{I}\). In addition, FFWM uses optical flow fields which are fed to IPM and WAM to conduct frontalization. Specifically, we compute the forward and reverse flow fields to warp the profile to the frontal view and vice versa, respectively.

In this section, we first introduce the bi-directional flow fields estimation in Sect. 3.1. IPM and WAM are introduced in Sect. 3.2 and Sect. 3.3. Finally, the loss functions are detailed in Sect. 3.4.

3.1 Bi-directional Flow Fields Estimation

Face frontalization can be viewed as the face rotation transformation, and the flow field can model this rotation by establishing the pixel-level correspondence between the profile and frontal faces. Traditional optical flow methods [3, 13] take two frames as the input. However, we only use one profile image as the input. In this work, we adopt the FlowNetSD in FlowNet2 [13] as our flow estimation network, and change the input channel from 6 (two frames) to 3 (one image). For preserving illumination and frontalization, we estimate the reverse flow field \( \varPhi ' \) and the forward flow field \( \varPhi \) from the profile image, respectively.

Reverse Flow Field. Given the profile image I, reverse flow estimation network \( \mathcal {F'} \) predicts the reverse flow field \( \varPhi ' \) which can warp the ground-truth frontal image \( I^{gt} \) to the profile view as I.

$$\begin{aligned} \varPhi ' =\mathcal {F'}(I;\varTheta _\mathcal {F'}), \end{aligned}$$
(1)
$$\begin{aligned} {I^w}' = \mathcal {W}(I^{gt},\varPhi '), \end{aligned}$$
(2)

Where \(\varTheta _\mathcal {F'}\) denotes the parameters of \( \mathcal {F'} \), and \(\mathcal {W}(\cdot )\) [14] is the bilinear sampling operation. To learn an accurate reverse flow field, \( \mathcal {F}' \) is pretrained with the landmark loss [15], sampling correctness loss [19] and the regularization term [19].

Forward Flow Field. Given the profile image I, forward flow estimation network \( \mathcal {F} \) predicts the forward flow field \( \varPhi \) which can warp I to the frontal view.

$$\begin{aligned} \varPhi =\mathcal {F}(I;\varTheta _\mathcal {F}), \end{aligned}$$
(3)
$$\begin{aligned} I^w = \mathcal {W}(I,\varPhi ), \end{aligned}$$
(4)

Where \(\varTheta _\mathcal {F}\) denotes the parameters of \( \mathcal {F} \). To learn an accurate forward flow field, \( \mathcal {F} \) is pretrained with the same losses as \( \mathcal {F}' \).

Then two flow fields \( \varPhi ' \) and \( \varPhi \) are used for the IPM and WAM to generate illumination preserving and photo-realistic frontal images.

3.2 Illumination Preserving Module

Without considering inconsistent illumination in the face datasets, the existing frontalization methods potentially overfit to the wrong illumination. To effectively decouple the illumination and the facial details, hence to synthesize illumination preserving faces with fine details, we propose the Illumination Preserving Module (IPM). As shown in Fig. 2, IPM consists of two pathways. Illumination preserving pathway ensures that the illumination condition of the synthesized image \(\hat{I}\) is consistent with the profile I. Illumination adaption pathway ensures that the facial details of the synthesized image \(\hat{I}\) are consistent with the ground-truth \(I^{gt}\).

Illumination Preserving Pathway. Because the illumination condition is diverse and cannot be quantified as a label, it is hard to learn reliable and independent illumination representation from face images. Instead of constraining the illumination consistency between the profile and the synthesized image in the feature space, we directly constrain it in the image space. As shown in Fig. 2, in the illumination preserving pathway, we firstly use the reverse flow field \({\varPhi }'\) to warp the synthesized image \( \hat{I} \) to the profile view,

$$\begin{aligned} \hat{I}^w = \mathcal {W}(\hat{I}, \varPhi '). \end{aligned}$$
(5)

Then an illumination preserving loss is defined on the warped synthesized image \( \hat{I}^w \) to constrain the illumination consistency between the synthesized image \( \hat{I} \) and the profile I. By minimizing it, FFWM can synthesize illumination preserving frontal images.

Illumination Adaption Pathway. Illumination preserving pathway cannot ensure the consistency of facial details between the synthesized image \(\hat{I}\) and the ground-truth \( I^{gt} \), so we constrain it in the illumination adaption pathway. Since the illumination of profile I is inconsistent with the ground-truth \( I^{gt} \) under large poses, adding constraints directly between \(\hat{I}\) and \( I^{gt} \) eliminates the illumination consistency between \( \hat{I}\) and I. So a guided filter layer [7] is firstly used to transfer the illumination of images. Specifically, the guided filter takes \(I^{gt}\) as the input image and \(\hat{I}\) as the guidance image,

$$\begin{aligned} \hat{I}^G = \mathcal {G}(\hat{I},I^{gt}), \end{aligned}$$
(6)

where \( \mathcal {G}(\cdot ) \) denotes the guided filter, and we set the radius of filter as the quarter of the image resolution. After filtering, the guided filter result \( \hat{I}^G \) has the same illumination with \( I^{gt} \) while keeping the same facial details with \( \hat{I} \). Then the illumination-related losses (e.g., pixel-wise loss, perceptual loss) are defined on \( \hat{I}^G \) to facilitate our model synthesize \( \hat{I} \) with much finer details. By this means, \(\hat{I}\) can become much more similar to \(I^{gt}\) in facial details without changing the illumination consistency between \( \hat{I} \) and I.

Note that the guided filter has no trainable parameters and potentially cause our model trap into local minima during training. So we apply the guided filter after several iterations, providing stable and robust initialization to our model.

3.3 Warp Attention Module

The large pose discrepancy makes it difficult to synthesize correct facial details in the synthesized images. To reduce the pose discrepancy between the profile and frontal face, Warp Attention Module (WAM) is proposed to align the profile face to the frontal one in the feature space. We achieve this by warping the profile features guided by the forward flow field \( \varPhi \). The architecture of our WAM is illustrated in Fig. 3. It contains two steps: flow-guided feature warping and feature attention.

Fig. 3.
figure 3

The architecture of Warp Attention Module. Considering the symmetry prior of human face, WAM also contains flipped warped feature.

Flow-Guided Feature Warping. Because the profile and frontal face have different visible areas, the forward flow field \( \varPhi \) cannot establish a complete pixel-level correspondence between them. Hence, warping profile face directly leads to artifacts. Here we incorporate \( \varPhi \) with bilinear sampling operation \(\mathcal {W}(\cdot )\) to warp the profile face to the frontal one in the feature space. Additionally, we use the symmetry prior of human face, and take both warped features and its horizontal flip to guide the frontal image synthesis.

$$\begin{aligned} f_w = \mathcal {W}(f, \varPhi ) , \end{aligned}$$
(7)

Where f denotes the encoder feature of the profile. Let \( {f_w}' \) denotes the horizontal flip of \( f_w \), and \( (f_w \oplus {f_w}') \) denotes the concatenation of \( f_w \) and \( {f_w}' \).

Feature Attention. After warping, the warped feature encodes the backgrounds and self-occlusion artifacts, which leads to degraded frontalization performance. To eliminate above issue and extract reliable frontal feature, an attention mechanism is then used to adaptively focus on the critical parts of \( (f_w \oplus {f_w}') \). The warped feature \( (f_w \oplus {f_w}') \) is firstly fed into a Conv-BN-ReLU-ResidualBlock Layer to generate an attention map A, which has the same height, width and channel size with \( (f_w \oplus {f_w}') \). Then the reliable frontalized feature \(\hat{f}\) is obtained by,

$$\begin{aligned} \hat{f} = A \otimes (f_w \oplus {f_w}') , \end{aligned}$$
(8)

where \(\otimes \) denotes element-wise multiplication. \(\hat{f}\) is then skip connected to the decoder to help generate photo-realistic frontal face image \( \hat{I} \).

3.4 Loss Functions

In this section, we formulate the loss functions used in our work. The background of images is masked to make the loss functions focus on the facial area.

Pixel-Wise Loss. Following [8, 10], we employ a multi-scale pixel-wise loss on the guided filter result \( \hat{I}^G \) to constrain the content consistency,

$$\begin{aligned} \mathcal {L}_{pixel}= \sum _{s=1}^{S} \left\| \hat{I}^G_s - I^{gt}_s \right\| _1 , \end{aligned}$$
(9)

Where S denotes the number of scales. In our experiments, we set S = 3, and the scales are 32 \(\times \) 32, 64 \(\times \) 64 and 128 \(\times \) 128.

Perceptual Loss. Pixel-wise loss tends to generate over-smoothing results. To alleviate this, we introduce the perceptual loss defined on the VGG-19 network [25] pre-trained on ImageNet [21],

$$\begin{aligned} \mathcal {L}_{p}= \sum _{i}w_i \left\| \phi _i(\hat{I}^G) - \phi _i(I^{gt}) \right\| _1 , \end{aligned}$$
(10)

where \(\phi _i(\cdot )\) denotes the output of the i-th VGG-19 layer. In our implementation, we use Conv1-1, Conv2-1, Conv3-1, Conv4-1 and Conv5-1 layer, and set \( w = \{1, 1/2, 1/4, 1/4, 1/8\} \). To improve synthesized imagery in the particular facial regions, we also use the perceptual loss on the facial regions like eyes, nose and mouth.

Adversarial Loss. Following [24], we adpot a multi-scale discriminator and adversarial learning to help synthesize photo-realistic images.

$$\begin{aligned} \mathcal {L}_{adv} = \min _{R}\max _{D} \mathbb {E}_{I^{gt}}[ \log D(I^{gt} ) ] - \mathbb {E}_{\hat{I}^G}[\log (1-D(\hat{I}^G))]. \end{aligned}$$
(11)

Illumination Preserving Loss. To preserve the illumination of profile I on synthesized image \( \hat{I} \), we define the illumination preserving loss on the warped synthesized image \( \hat{I}^w \) at different scales,

$$\begin{aligned} \mathcal {L}_{ip}= \sum _{s=1}^{S} \left\| \hat{I}^w_s - I_s \right\| _1 , \end{aligned}$$
(12)

Where S denotes the number of scales, and the scale setting is same as Eq. (9).

Identity Preserving Loss. Following [8, 10], we present an identity preserving loss to preserve the identity information of the synthesized image \( \hat{I} \),

$$\begin{aligned} \mathcal {L}_{id}=\left\| \psi _{fc2} (\hat{I})-\psi _{fc2} (I^{gt} ) \right\| _1 + \left\| \psi _{pool } (\hat{I})-\psi _{pool } (I^{gt}) \right\| _1, \end{aligned}$$
(13)

Where \( \psi (\cdot ) \) denotes the pretrained LightCNN-29 [30]. \( \psi _{fc2}(\cdot ) \) and \( \psi _{pool}(\cdot ) \) denote the outputs of the last pooling layer and the fully connected layer respectively. To preserve the identity information, we add the identity loss on both \( \hat{I} \) and \( \hat{I}^G \).

Overall Losses. Finally, we combine all the above losses to give the overall model objective,

$$\begin{aligned} \mathcal {L}= \lambda _0 \mathcal {L}_{pixel} + \lambda _1 \mathcal {L}_{p} + \lambda _2 \mathcal {L}_{adv} + \lambda _3 \mathcal {L}_{ip} + \lambda _4 \mathcal {L}_{id}, \end{aligned}$$
(14)

Where \( \lambda _{*} \) denotes the different losses tradeoff parameters.

4 Experiments

To illustrate our model can synthesize photo-realistic and illumination preserving images while preserving identity, we evaluate our model qualitatively and quantitatively under both controlled and in the wild settings. In the following subsections, we begin with an introduction of datasets and implementation details. Then we demonstrate the merits of our model on qualitative synthesis results and quantitative recognition results over the state-of-the-art methods. Lastly, we conduct an ablation study to demonstrate the benefits from each part of our model.

4.1 Experimental Settings

Fig. 4.
figure 4

Synthesis results of the Multi-PIE dataset by our model under large poses and illumination inconsistent conditions. Each pair presents profile (left), synthesized frontal face (middle) and ground-truth frontal face (right).

Datasets. We adopt the Multi-PIE dataset [4] as our training and testing set. Multi-PIE is widely used for evaluating face synthesis and recognition in the controlled setting. It contains 754,204 images of 337 identities from 15 poses and 20 illumination conditions. In this paper, the face images with neutral expression under 20 illuminations and 13 poses within \( \pm 90^\circ \) are used. For a fair comparison, we follow the test protocols in [8] and utilize two settings to evaluate our model. The first setting (Setting 1) only contains images from Session 1. The training set is composed of all the first 150 identities images. For testing, one gallery image with frontal view and normal illumination is used for the remaining 99 identities. For the second setting (Setting 2), we use neutral expression images from all four sessions. The first 200 identities and the remaining 137 identities are used for training and testing, respectively. Each testing identity has one gallery image with frontal view and normal illumination from the first appearance.

LFW [9] contains 13,233 face images collected in unconstrained environment. It will be used to evaluate the frontalization performance in uncontrolled settings.

Implementation Details. All images in our experiments are cropped and resized to 128 \( \times \) 128 according to facial landmarks, and image intensities are linearly scaled to the range of [0, 1]. The LightCNN-29 [30] is pretrained on MS-Celeb-1M [5] and fine-tuned on the training set of Multi-PIE.

In all our experiments, we empirically set \( \lambda _0 = 5, \lambda _1 = 1, \lambda _2 = 0.1, \lambda _3 = 15, \lambda _4 = 1 \). The learning rate is initialized by 0.0004 and the batch size is 8. The flow estimation networks \( \mathcal {F} \) and \( \mathcal {F'} \) are pre-trained and then all networks are end-to-end trained by minimizing the objective \( \mathcal {L} \) with setting lr = 0.00005 for \( \mathcal {F} \) and \( \mathcal {F'} \).

4.2 Qualitative Evaluation

In this subsection, we qualitatively compare the synthesized results of our model against state-of-the-art face frontalization methods. We train our model on the training set of the Multi-PIE Setting 2, and evaluate it on the testing set of the Multi-PIE Setting 2 and the LFW [9].

Figure 4 shows the face synthesized results under large poses, and it is obvious that our model can synthesize photo-realistic images. To demonstrate the illumination preserving strength of our model, we choose the profiles with obvious inconsistent illumination. As shown in Fig. 4, the illumination of profile faces can be well preserved in the synthesized images. More synthesized results are provided in the supplementary material.

Fig. 5.
figure 5

Face frontalization comparison on the Multi-PIE dataset under the pose of \( 90^\circ \) (first two rows) and \( 75^\circ \) (last two rows).

Figure 5 illustrates the comparison with the state-of-the-art face frontalization methods [8, 10, 17, 31] on the Multi-PIE dataset. In the large pose cases, existing methods are disable to preserve the illumination of profiles on the synthesized results. Face shape and other face components (e.g., eyebrows, mustache and nose) also occur deformation. The reason is those methods are less able to preserve reliable details from the profiles. Compared with the existing methods, our method produces more identity preserving results while keeping the facial details of the profiles as much as possible. In particular, under large poses, our model can recover photo-realistic illumination conditions of the profiles, which is important when frontalized images are used for some other face-related tasks, such as face editing, face pose transfer and face-to-face synthesis.

Fig. 6.
figure 6

Face frontalization comparison on the LFW dataset. Our method is trained on Mulit-PIE and tested on LFW.

Table 1. Rank-1 recognition rates (%) across poses under Setting 2 of the Multi-PIE. The best two results are highlighted by bold and underline respectively.

We further qualitatively compare face frontalization results of our model on the LFW dataset with [6, 10, 17, 28, 34]. As shown in Fig. 6, the existing methods fail to recover clear global structures and fine facial details. Also they cannot preserve the illumination of the profiles. Though FNM [17] generates high qualitative images, it is still disable to preserve identity. It is worth noting that our method produces more photo-realistic faces with identity and illumination well-preserved, which also demonstrates the generalizability of our model in the uncontrolled environment. More results under large poses are provided in the supplementary material.

4.3 Quantitative Evaluation

In this subsection, we quantitatively compare the proposed method with other methods in terms of recognition accuracy on Multi-PIE and LFW. The recognition accuracy is calculated by firstly extracting deep features with LightCNN-29 [30] and then measuring similarity between features with a cosine-distance metric.

Table 1 shows the Rank-1 recognition rates of different methods under Setting 2 of Multi-PIE. Our method has advantages over competitors, especially at large poses (e.g., \( 75^\circ \), \( 90^\circ \)), which demonstrates that our model can synthesize frontal images while preserving the identity information. The recognition rates under Setting 1 is provided in the supplementary material.

Table 2. Face verification accuracy (ACC) and area-under-curve (AUC) results on LFW.

Table 2 compares the face verification performance (ACC and AUC) of our method with other state-of-the-arts [8, 16, 23, 31, 32] on the LFW. Our method achieves 99.65 on accuracy and 99.92 on AUC, which is also comparable with other state-of-the-art methods. The above quantitative results prove that our method is able to preserve the identity information effectively.

4.4 Ablation Study

Fig. 7.
figure 7

Model comparsion: synthesis results of our model and its variants on Multi-PIE

Table 3. Incomplete variants analysis: Rank-1 recognition rates (%) across poses under Setting 2 of the Multi-PIE dataset. IAP and IPP denote the illumination adaption pathway and illumination preserving pathway in the Illumination Preserving Module (IPM). Warp, flip and att denote the three variants in Warp Attention Module (WAM).

In this subsection, we analyze the respective roles of the different modules and loss functions in frontal view synthesis. Both qualitative perceptual performance (Fig. 7) and face recognition rates (Table 3) are reported for comprehensive comparison under the Multi-PIE Setting 2. We can see that our FFWM exceeds all its variants in both quantitative and qualitative evaluations.

Effects of the Illumination Preserving Module (IPM). Although without IPM the recognition rates drop slightly (as shown in Table 3), the synthesized results cannot preserve illumination and are approximate to the inconsistent ground-truth illumination (as shown in Fig. 7). We also explore the contributions of illumination adaption pathway (IAP) and illumination preserving pathway (IPP) in the IPM. As shown in Fig. 7, without IPP, the illumination of synthesized images tend to be inconsistent with the profiles and ground-truth images. And without IAP, the illumination of synthesized images tends to be a tradeoff between the profiles and the illumination inconsistent ground-truth images. Only integrating IPP and IAP together, our model can achieve illumination preserving image synthesis. Furthermore, our model archives a lower recognition rate when removing the IPP, which demonstrates that the IPP prompts the synthesized results to keep reliable information of the profiles.

Effects of the Warp Attention Module (WAM). We can see that without WAM, the synthesized results tend to be smooth and distorted in the self-occlusion parts (as shown in Fig. 7). As shown in Table 3, without WAM, the recognition rates drop significantly, which proves that WAM dominates in preserving identity information. Moreover, we explore the contributions of three components in the WAM, including taking flipped warped feature as additional input (w/o flip), feature warping (w/o warp) and feature attention (w/o att). As shown in Fig. 7, taking flip feature as additional input has benefits on recovering the self-occlusion parts on the synthesized images. Without the feature attention mechanism, there are artifacts on the synthesized images. Without feature warping, the synthesized results get worse visual performance. These results above suggest that each component in WAM is essential for synthesizing identity preserving and photo-realistic frontal images.

Effects of the Losses. As shown in Table 3, the recognition rates decrease if one loss function is removed. Particularly, the rates drop significantly for all poses if the \( \mathcal {L}_{id} \) loss is not adapted. We also report the qualitative visualization results in Fig. 7. Without \( \mathcal {L}_{adv} \) loss, the synthesized images tend to be blurry, suggesting the usage of adversarial learning. Without \( \mathcal {L}_{id} \) and \( \mathcal {L}_{pixel} \), our model cannot promise the visual performance on the local textures (e.g., eyes). Without \(\mathcal {L}_p\), the synthesized faces present artifacts at the edges (e.g., face and hair).

5 Conclusion

In this paper, we propose a novel Flow-based Feature Warping Model (FFWM) to effectively address the challenging problem in face frontalization, photo-realistic and illumination preserving image synthesis with illumination inconsistent supervision. Specifically, an Illumination Preserve Module is proposed to address the illumination inconsistent issue. It helps FFWM to synthesize photo-realistic frontal images while preserving the illumination of profile images. Furthermore, the proposed Warp Attention Module reduces the pose discrepancy in the feature space and helps to synthesize frontal images effectively. Experimental results demonstrate that our method not only synthesizes photo-realistic and illumination preserving results but also outperforms state-of-the-art methods on face recognition across large poses.