Generative Face Parsing Map Guided 3D Face Reconstruction Under Occluded Scenes

Zhao, Dapeng; Qi, Yue

doi:10.1007/978-3-030-89029-2_20

Dapeng Zhao¹⁵ &
Yue Qi^15,16,17

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 13002))

Included in the following conference series:

Computer Graphics International Conference

2093 Accesses
7 Citations

Abstract

Over the past few years, single-view 3D face reconstruction methods can produce beautiful 3D models. Nevertheless, the input of these works is unobstructed faces. We describe a system designed to reconstruct convincing face texture in the case of occlusion. Motivated by parsing facial features, we propose a complete face parsing map generation method guided by landmarks. We estimate the 2D face structure of the reasonable position of the occlusion area, which is used for the construction of 3D texture. An excellent anti-occlusion face reconstruction method should ensure the authenticity of the output, including the topological structure between the eyes, nose, and mouth. We extensively tested our method and its components, qualitatively demonstrating the rationality of our estimated facial structure. We conduct extensive experiments on general 3D face reconstruction tasks as concrete examples to demonstrate the method’s superior regulation ability over existing methods often break down. We further provide numerous quantitative examples showing that our method advances both the quality and the robustness of 3D face reconstruction under occlusion scenes.

Access provided by Autonomous University of Puebla. Download conference paper PDF

3D Face Reconstruction with Geometry Details from a Single Color Image Under Occluded Scenes

Generative Landmarks Guided Eyeglasses Removal 3D Face Reconstruction

FRNet: Improving Face De-occlusion via Feature Reconstruction

Keywords

1 Introduction

3D face reconstruction refers to synthesizing a 3D face model given one input face photo. It has a wide range of applications, such as face recognition and digital entertainment [25]. Existing methods mainly concentrate on unobstructed faces, thus limiting the scenarios of their actual applications. Reconstructing a 3D face model from a single photo is a classical and fundamental problem in computer vision. The reconstruction task is challenging as human face structure partial invisibility when considering occluded scenes. Over the past five years, the related problem of face inpainting in images has gradually developed to the rationality of face photo generation in the most extreme scenes [15].

We cannot use artificial intelligence to robustly predict the 3D texture of the occluded area of the face. On the other hand, when faces are partially occluded, existing methods often indiscriminately reconstruct the occluded area. With the assistance of face parsing map, we find a way to identify the occluded area and reconstruct the input image to a reasonable 3D face model. The main contributions are summarized as follows:

We propose a novel algorithm that combines feature points and face parsing map to generate face with complete facial features.
To address the problem of invisible face area under occluded scenes, we propose synthesizing input face photo based on Generative Adversarial Network rather than reconstructing 3D face directly.
We have improved the loss function of our 3D reconstruction framework for occluded scenes. Our method obtains state-of-the-art qualitative performance in real-world images.

2 Related Works

2.1 Generic Face Reconstruction

The classic methods use reference 3D face models to fit the input face photo. Some recent techniques use Convolution Neural Networks (CNNs) to regress landmark locations with the raw face image. Some recent techniques firstly used CNNs to predict the 3DMM parameters with input face image.

2.2 Face Image Synthesis

Deep pixel-level face generating has been studied for a few years. Many methods achieve remarkable results. EdgeConnect [12] shows impressive proceeds which disentangling generation into two stages: edge generator and image completion network. Contextual Attention [22] takes a similar two-step approach. First, it produces a base estimate of the invisible region. Next, the refinement block sharpens the photo by background patch sets. The typical limitations of current face image generate schemes are the necessity of manipulation, the complexity of fundamental architectures, the degradation in accuracy, and the inability of restricting modification to local region.

3 Our Approach

3.1 Landmark Prediction Task

Figure 1 shows the entire process of our work.In the landmark prediction task, we found that generating accurate 68 feature points ${\mathbf {{Z}}_{\mathbf {lmk}}}\in {{\mathbb {R}}^{2\times 68}}$ was a crucial part under occlusion scenes. The architecture ${{\mathcal {N}}_\text {lmk}}$ aims to generate landmarks from a corrupted face photo ${\text {I}_\text {cor}}:{\mathbf {Z}_\mathbf {lmk}}{=}{{\mathcal {N}}_{\text {lmk}}}\left( {{\text {I}}_{\text {cor}}};{{\theta }_{lmk}} \right) $, where ${{\theta }_{lmk}}$ denotes the trainable parameters. Since we want to focus more on efficiency and follow face parsing map generation task, we built a sufficiently effective ${{\mathcal {N}}_\text {lmk}}$ upon the MobileNet-V3 [6]. ${{\mathcal {N}}_\text {lmk}}$ is focused on feature extraction, unlike traditional landmark detectors. The final module is realized by fully connecting the fused feature maps. We set the loss function ${{\mathcal {L}}_{lmk}}$ as follows:

$$\begin{aligned} {{\mathcal {L}}_{lmk}}\text {=}\left\| \mathbf {{Z}_{_{lmk}}^{(i)}}-\mathbf {{\hat{Z}}_{_{gt}}^{(i)}} \right\| _{2}^{2} \end{aligned}$$

(1)

where $\mathbf {{\hat{Z}}_{_{gt}}^{(i)}}$ denotes the ith ground truth face landmarks.

3.2 Face Parsing Map Generation

Pixel-level recognition of occlusion and face skin areas is a prerequisite for our framework to ensure accuracy. To benefit from the annotated face dataset CelebAMask-HQ [10], we used an encoder-decoder architecture ${{\mathcal {N}}_{\alpha }}$ based on U-Net [17] to estimate pixel-level label classes. Given a squarely resized face image ${{\mathbf {I}}_\mathbf {fac}}\in {{\mathbb {R}}^{H\times W\times 3}}$ , we applied the trained face parsing model ${{\mathcal {N}}_{\alpha }}$ to obtain the parsing map ${\mathbf {M}_{\boldsymbol{\alpha }}}\in {{\mathbb {R}}^{H\times W\times 1}}$ . On the other hand, given the landmarks ${\mathbf {Z}_\mathbf {lmk}}\in {{\mathbb {R}}^{2\times 68}}$, we connected the feature points to form a region. Then these regions can form a parsing map ${\mathbf {M}_{\boldsymbol{\beta }} } \in {\mathbb {R}^{H \times W \times 1}}$ including facial features. Please notice that, in our work, we assumed that facial features only include only five parts, including facial skin, eyebrows, eyes, nose and lips. The final map ${\mathbf {M}_{\boldsymbol{\gamma }} } \in {\mathbb {R}^{H \times W \times 1}}$ (see Fig. 2) without occluded objects needs ${\mathbf {M}_{\boldsymbol{\alpha }} }$ plus ${\mathbf {M}_{\boldsymbol{\beta }} }$ . In order to generate ${\mathbf {M}_{\boldsymbol{\gamma }} }$ including the complete facial features, we designed Algorithm 1.

3.3 Face Image Synthesis with GAN

Face Image Synthesis Network. To benefit from the Pix2Pix architecture, we proposed a Face Image Synthesis Network (FISN) ${\mathcal{N}_{\mathrm{{et}}}}$, which was based on Pix2PixHD [26] as a backbone. FISN receives ${{\mathbf {I}}_{\mathbf {fac}}}\in {{\mathbb {R}}^{H\times W\times 3}}$ and ${\mathbf {M}_{\boldsymbol{\alpha }}}$ as inputs. The detailed architecture is shown in Fig. 1. To fuse ${{\mathbf {I}}_{\mathbf {fac}}}$ and ${\mathbf {M}_{\boldsymbol{\alpha } }}$ , we used Spatial Feature Transform (SFT) layer [14] learned a mapping function $\mathcal{M}$ that outputs a parameter pair $\left( {\gamma ,\beta } \right) $ based on the prior condition ${\varPsi }$ from the features ${\mathbf {M}_{\boldsymbol{\alpha }} }$. A pair of affine transformation parameters $\left( {\gamma ,\beta } \right) $ model the prior ${\varPsi }$ . Here, the mapping equation can be expressed as $(\gamma ,\beta ){{ }} = M\left( \varPsi \right) .$ After obtaining $\left( {\gamma ,\beta } \right) $ , the transformation is carried out by the SFT layer:

$$\begin{aligned} SFT({\mathbf {F}_{{\mathbf {map}}}}|\gamma ,\beta ){{ }} = \gamma \odot F + \beta \end{aligned}$$

(2)

where ${{\mathbf {F}}_{\mathbf {map}}}$ denotes the feature maps from $\mathbf {{{I}}_{fac}}$,$\odot $ denotes Hadamard product.

Therefore, we conditioned spatial information ${\mathbf {M}_{\boldsymbol{\alpha }} }$ on style data ${{\mathbf {I}}_{\mathbf {fac}}}$ and generated affine parameters $({{{x}}_{{i}}}{{,}}{{{y}}_i})$ followed $({{{x}}_{{i}}}{{,}}{{{y}}_i}){{ = }}{\mathcal{N}_{{{et}}}}\left( {{{{I}}_{{{fac}}}},{M_\alpha }} \right) $. Related research [14] showed that ordinary normalization layers would “wash away” semantic information. To transfer $({{{x}}_{{i}}}{{,}}{{{y}}_i})$ to new mask input ${\mathbf {M}_{\boldsymbol{\gamma }}}$, we utilized semantic region-adaptive normalization (SEAN) [29] on residual blocks ${{{z}}_{{i}}}$ in the FISN. Let H, W and C be the height, width and the number of channels in the activation map of the deep convolutional network for a batch of N samples. The modulated activation value at the site was defined as:

$$\begin{aligned} SEAN\left( {{z_i},{x_i},{y_i}} \right) = {x_{{i}}}\frac{{{z_i} - \mu \left( {{z_i}} \right) }}{{\sigma \left( {{z_i}} \right) }} + {y_i} \end{aligned}$$

(3)

where $\mu \left( {{z_i}} \right) $ and $\sigma \left( {{z_i}} \right) $ are the mean and standard deviation of the activation $\left( {{{n}} \in N,c \in C,y \in H,x \in W} \right) $ in channel c :

$$\begin{aligned} \mu \left( {{z_i}} \right) \mathrm{{ = }}\frac{1}{{NHW}}\sum \limits _{n,y,x} {{h_{n,c,y,x}}} \end{aligned}$$

(4)

$$\begin{aligned} \sigma \left( {{z}_{i}} \right) {=}\sqrt{\frac{1}{NHW}\sum \limits _{n,y,x}{\left( {{(h_{n,c,y,x}^{{}})}^{2}}-\mu {{\left( {{z}_{i}} \right) }^{2}} \right) }} \end{aligned}$$

(5)

FISN is a generator that learns the style mapping between ${{\mathbf {I}}_{\mathbf {fac}}}$ and ${\mathbf {M}_{\boldsymbol{\gamma }}}$ according to the spatial information provided by ${\mathbf {M}_{\boldsymbol{\alpha }} }$. Therefore, face features (e.g. eyes style) in ${{\mathbf {I}}_{\mathbf {fac}}}$ are shifted to the corresponding position on ${\mathbf {M}_{\boldsymbol{\gamma }} }$ so that FISN can synthesis image ${{\mathbf {I}}_{{\mathbf {out}}}}$ which removed occlusion.

Loss Function. The design of our loss function for FISN is inspired by Pix2PixHD [26], MaskGAN [10] and SEAN [29], which contains three components:

$(1)\ Adversarial\ loss$. Let ${{\text {D}}_{\text {1}}}$ and ${{\text {D}}_{\text {2}}}$ be two discriminators at different scales, ${\mathcal{L}_{GAN}}$ is the conditional adversarial loss defined by

$$\begin{aligned} {{\mathcal {L}}_{GAN}}{=}\mathbb {E}\left[ \log \left( {\text {D}_{1,2}}\left( {\mathbf {I}_{\mathbf {fac}}},{\mathbf {M}_{{\boldsymbol{\alpha }} }} \right) \right) \right] +\mathbb {E}\left[ 1-\log \left( {\text {D}_{1,2}}\left( {\mathbf {I}_{\mathbf {out}}},{{\mathbf {M}}_{\boldsymbol{\alpha } }} \right) \right) \right] \end{aligned}$$

(6)

$(2)\ Feature\ matching\ loss$ [26]. Let T be the total number of layers in discriminator $\text {D}$ .${{\mathcal {L}}_{\text {f}ea}}$ is the feature matching loss which computed the $\text {L}_{1}$ distance between the real and generated face image defined by

$$\begin{aligned} {{\mathcal {L}}_{\text {fe}a}}{=}\mathbb {E}\sum \limits _{i=1}^{T}{{{\left\| D_{1,2}^{(i)}\left( {\mathbf {I}_{\mathbf {fac}}},{{\mathbf {M}}_{\boldsymbol{\alpha } }} \right) -D_{1,2}^{(i)}\left( {\mathbf {I}_{\mathbf {out}}},{\mathbf {M}_{\mathbf {\alpha } }} \right) \right\| }_{1}}} \end{aligned}$$

(7)

$(3)\ Perceptual\ loss$ [8]. Let $\text {N}$ be the total number of layers used to calculate the perceptual loss, ${{F}^{(i)}}$ be the output feature maps of the ith layer of the VGG network [21].${\mathcal{L}_{{{per}}}}$ is the perceptual loss which computes the ${{L}_{1}}$ distance between the real and generated face image defined by

$$\begin{aligned} {{\mathcal {L}}_{\text {per}}}{=}\mathbb {E}\sum \limits _{i=1}^{N}{\frac{1}{{{M}_{i}}}[{{\left\| {{F}^{\left( i \right) }}\left( {\mathbf {I}_{\mathbf {fac}}} \right) -{{F}^{\left( i \right) }}\left( {\mathbf {I}_{\mathbf {out}}} \right) \right\| }_{1}}]} \end{aligned}$$

(8)

The final loss function of FISN used in our experiment is made up of the above-mentioned three loss terms as:

$$\begin{aligned} {{\mathcal {L}}_{FISN}}{=}{{\mathcal {L}}_{GAN}}+{{\lambda }_{1}}{{\mathcal {L}}_{\text {fea}}}+{{\lambda }_{2}}{{\mathcal {L}}_{\text {per}}} \end{aligned}$$

(9)

where we set ${{\lambda }_{1}}{=}{{\lambda }_{2}}{=10}$ respectively in our experiments.

3.4 Camera and Illumination Model

Given an face image, we adopt the Basel Face Model (BFM) [16]. After the 3D face is reconstructed, it can be projected onto the image plane with the perspective projection:

$$\begin{aligned} {{V}_{2d}}\left( \mathbf {P} \right) =f*{\mathbf {P}_{\mathbf {r}}}*{\mathbf {R}}*{\mathbf {S}_{\mathbf {mod} }}+{\mathbf {t}_{\mathbf {2d}}} \end{aligned}$$

(10)

where ${{V}_{2d}}\left( \mathbf {P} \right) $ denotes the projection function that turned the 3D model into 2D face positions, f denotes the scale factor, ${\mathbf {P}_{\mathbf {r}}}$ denotes the projection matrix, $\mathbf {R}\in SO(3)$ denotes the rotation matrix, ${{\mathbf {S}}_{\mathbf {mod}}}$ denotes the shape of the face and ${\mathbf {t}_{\mathbf {2d}}}\in {{\mathbb {R}}^{3}}$ denotes the translation vector.

We approximated the scene illumination with Spherical Harmonics (SH) [3] for face. Thus, we can compute the face as Lambertian surface and skin texture follows:

$$\begin{aligned} \mathbf {C}\left( {\mathbf {r}_{\mathbf {i}}},{\mathbf {n}_{\mathbf {i}}},\boldsymbol{\gamma } \right) ={\mathbf {r}_{\mathbf {i}}}\odot \sum \limits _{b=1}^{{{B}^{2}}}{{\boldsymbol{\gamma }_{\mathbf {b}}}{{\varPhi }_{{b}}}\left( {\mathbf {n}_{\mathbf {i}}} \right) } \end{aligned}$$

(11)

where ${\mathbf {r}_{\mathbf {i}}}$ denotes skin reflectance, ${\mathbf {n}_{\mathbf {i}}}$ denotes surface normal, $\odot $ denotes the Hadamard product, $\boldsymbol{\gamma } \in {{\mathbb {R}}^{9}}$ under monochromatic lights condition, ${{\varPhi }_{b}}:{{\mathbb {R}}^{3}}\rightarrow \mathbb {R}$ denotes SH basis function, B denotes the number of spherical harmonics bands and ${\boldsymbol{\gamma }_{\mathbf {b}}}\in {{\mathbb {R}}^{3}}$ (here we set $B=3$ ) denotes the corresponding SH coefficients.

Therefore, parameters to be learned can be denoted by a vector $\boldsymbol{y}=(\boldsymbol{\widetilde{{{\alpha }_{i}}},\widetilde{{{\beta }_{i}}},\gamma ,p})\in {{\mathbb {R}}^{175}}$, where $\mathbf {p}\in {{\mathbb {R}}^{6}}=\{\boldsymbol{pitch,yaw,roll,f,{{t}_{2D}}\}}$ denotes face poses. In this work, we used a fixed ResNet-50 [5] network to regress these coefficients. The loss function of ResNet-50 follows Eq. 16. We then got the fundamental shape ${\mathbf {S}_{\mathbf {base}}}$ (coordinate,e.g. x, y, z) and the coarse texture ${\mathbf {T}_{\mathbf {coa}}}$ (albedo,e.g. r, g, b). We used a coarse-to-fine network based on graph convolutional networks of Lin et al. [11] for producing the fine texture ${\mathbf {T}_{\mathbf {fin}}}$.

3.5 Loss Function of 3D Reconstruction

Given a generated image ${{\mathbf {I}}_{\mathbf {out}}}$ ,we used the ResNet to regress the corresponding coefficient y. The design of loss function for ResNet contained four components:

(1)
Landmark Loss. As facial landmarks convey the structural information of the human face, we used landmark loss to measure how close projected shape landmark vertices to the corresponding landmarks in the image ${{\mathbf {I}}_{\mathbf {out}}}$. We ran the landmark prediction module ${{\mathcal {N}}_{lmk}}$ to detect 68 landmarks $\left\{ {z}_{lmk}^{\left( n \right) } \right\} $ from the training images. We obtained landmarks $\left\{ l_{y}^{\left( n \right) } \right\} $ from rendering facial images. Then, we computed the loss as:
$$\begin{aligned} {{\mathcal {L}}_{{lmk}}}\left( y \right) \text {=}\frac{1}{N}\sum \limits _{\text {n}=1}^{N}{\left\| \text {z}_{\text {lmk}}^{\left( n \right) }-l_{y}^{(n)} \right\| _{2}^{2}} \end{aligned}$$
(12)
where ${{\left\| \cdot \right\| }_{2}}$ denotes the ${{L}_{2}}$ norm.
(2)
Accurate Pixel-wise Loss. The rendering layer renders back an image $\mathbf {I}_{\mathbf {y}}^{^{(i)}}$ to compare with the image $\mathbf {I}_{\mathbf {out}}^{(i)}$. The pixel-wise loss is formulated as:
$$\begin{aligned} {{\mathcal {L}}_{\text {pix}}}\left( y \right) \text {=}\frac{\sum \nolimits _{i\in \mathcal {M}}{{{P}_{i}}\cdot }{{\left\| \mathbf {I}_{\mathbf {out}}^{(i)}-{\mathbf {I}}_{\mathbf {y}}^{(i)} \right\| }_{2}}}{\sum \nolimits _{i\in \mathcal {M}}{{{P}_{i}}}} \end{aligned}$$
(13)
where i denotes pixel index, $\mathcal {M}$ is the reprojected face region which obtained with landmarks [13], ${{\left\| \cdot \right\| }_{2}}$ denotes the ${{L}_{{2}}}$ norm and ${{P}_{i}}$ is occlusion attention coefficient which is described as follows. To gain robustness to accurate texture, we set $P_{i}={\left\{ \begin{array}{ll}1 \;\;\; \text {if} \;\; i\in \text {facial features of}\;{{M}_{\alpha }}\\ 0.1 \;\; \text {otherwise}\end{array}\right. }$ for each pixel i.
(3)
Regularization Loss. To prevent shape deformation and texture degeneration, we introduce the prior distribution to the parameters of the face model. We add the regularization loss as:
$$\begin{aligned} {{\mathcal {L}}_{\text {reg}}}\text {=}{{\omega }_{\alpha }}{{\left\| \widetilde{{{\boldsymbol{\alpha }}_{\mathbf {i}}}} \right\| }^{2}}+{{\omega }_{\beta }}{{\left\| \widetilde{{{\boldsymbol{\beta }}_{\mathbf {i}}}} \right\| }^{2}} \end{aligned}$$
(14)
here, we set ${{\omega }_{\alpha }}\text {=}1.0$, ${{\omega }_{\beta }}=1.\text {75e-3}$ respectively.
(4)
Face Features Level Loss. To reduce the difference between 3D face with 2D image, we define the loss at face recognition level. The loss computes the feature difference between the input image ${{\mathbf {I}}_{\mathbf {out}}}$ and rendered image ${{\mathbf {I}}_{\mathbf {y}}}$. We define the loss as a cosine distance:
$$\begin{aligned} {{\mathcal {L}}_{ff}}\text {=1-}\frac{<G({{\mathbf {I}}_{\mathbf {out}}}),G({{\mathbf {I}}_{\mathbf {y}}})>}{\left\| G({{\mathbf {I}}_{\mathbf {out}}}) \right\| \cdot \left\| G({{\mathbf {I}}_{\mathbf {y}}}) \right\| } \end{aligned}$$
(15)
where $G(\cdot )$ denotes the feature extraction function by FaceNet [19], $<\cdot ,\cdot>$ denotes the inner product.

In summary, the final loss function of 3D face reconstruction used in our experiment is made up of the above-mentioned four loss terms as:

$$\begin{aligned} {{\mathcal {L}}_{3D}}{=}{{\lambda }_{3}}{{\mathcal {L}}_{lmk}}+{{\lambda }_{4}}{{\mathcal {L}}_{{pix}}}+{{\lambda }_{5}}{{\mathcal {L}}_{reg}}+{{\lambda }_{6}}{{\mathcal {L}}_{{ff}}} \end{aligned}$$

(16)

where we set ${{\lambda }_{3}}{=1}{.6e-3}, {{\lambda }_{4}}{=1}{.4}, {{\lambda }_{5}}{=3}\text {.7e-4}, {{\lambda }_{6}}{=0}{.2}$ respectively in all our experiments.

4 Implementation Details

Considering the question of landmark predictor, the 300-W dataset [18] has labeled ground truth landmarks, while the CelebA-HQ dataset [9] does not. We generated the ground truth of CelebA-HQ by the Faceboxes predictor [28] as the reference. In experiments shown in this work, we use the $256\times 256$ images for training the landmark predictor ${{\mathcal {N}}_{lmk}}$ and the batch size $=16$ . The learning rate of ${{\mathcal {N}}_{lmk}}$ is $10e-4$. We use the trained face parsing model ${{\mathcal {N}}_{\alpha }}$ [10] to generate ${{\mathbf {M}}_{\boldsymbol{\alpha }}}$. We obtain ${{\mathbf {M}}_{\boldsymbol{\gamma }}}$ according to Algorithm 1. FISN follows the design of Pix2PixHD [26] with four residual blocks. To train the FISN, we used the CelebAMask-HQ dataset which has 30000 semantic labels with a size of $512\times 512$. Each label clearly marked the facial features of the face.

FISN does not use any ordinary normalization layers (e.g. Instance Normalization) which will wash away style information. Before training the ResNet, we take the weights from pre-trained of R-Net [3] as initialization. We set the input image size to $224\times 224$ and the number of vertices to 35709. We design our texture refinement network based on the Graph Convolutional Network method of Lin et al. [11]. We do not adopt any fully-connected layers or convolutional layers in the refinement network refer to related research [11]. This will reduce the performance of the module.

5 Experimental Results

5.1 Qualitative Comparisons with Recent Works

Figure 3 shows our results compared with the other work. The last two columns show our results.The remaining columns demonstrate the results of 3DDFA [4], $\text {D}{{\text {F}}^{\text {2}}}\text {Net}$ [27] and Chen et al. [2]. Qualitative results show that our method surpasses other methods. Figure 3 shows that our method can reconstruct a complete face model under occlusion scenes such as glasses, jewelry, palms, and hair. Other methods focused on generating high-resolution face textures. These frameworks cannot effectively deal with occluded scenes.

5.2 Quantitative Comparison

Comparison Result on the MICC Florence Datasets. MICC Florence dataset [1] is a 3D face dataset that contains 53 faces with their ground truth models. We artificially added some occluders as input. We calculated the average $90\%$ largest error between the generative model and the ground truth model. Figure 4 shows that our method can effectively handle occlusion.

Occlusion Invariance of the Foundation Shape. Our choice of using the ResNet-50 to regress the shape coefficients is motivated by the unique robustness to extreme viewing conditions in the paper of Deng et al. [3]. To fully support the application of our method to occluded face images, we test our system on the Labeled Faces in the Wild datasets (LFW) [7]. We used the same face test system from Anh et al. [24], and we refer to that paper for more details.

Figure 5 (left) shows the sensitivity of the method of Sela et al. [20]. Their result clearly shows the outline of a finger. Their failure may be due to more focus on local details, which weakly regularizes the global shape. However, our method recognizes and regenerates the occluded area. Our method much robust provides a natural face shape under common occlusion scenes. Though 3DMM also limits the details of shape, we use it only as a foundation and add refined texture separately.

Table 1. Quantitative evaluations on LFW.

Full size table

We further quantitatively verify the robustness of our method to occlusions. Table 1 (top) reports verification results on the LFW benchmark with and without occlusions (see also ROC in Fig. 5 (right)). Though occlusions clearly impact recognition, this drop of the curve is limited, demonstrating the robustness of our method.

6 Conclusions

In this work, we present a novel single-image 3D face reconstruction method under occluded scenes with high fidelity textures. Comprehensive experiments have shown that our method outperforms previous methods by a large margin in terms of both accuracy and robustness. Future work includes combining our method with Transformer architecture to further improve accuracy.

References

Bagdanov, A.D., Del Bimbo, A., Masi, I.: The florence 2d/3d hybrid face dataset. In: Proceedings of the 2011 Joint ACM Workshop on Human Gesture and Behavior Understanding, pp. 79–80 (2011)
Google Scholar
Chen, A., Chen, Z., Zhang, G., Mitchell, K., Yu, J.: Photo-realistic facial details synthesis from single image. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9429–9439 (2019)
Google Scholar
Deng, Y., Yang, J., Xu, S., Chen, D., Jia, Y., Tong, X.: Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (2019)
Google Scholar
Guo, J., Zhu, X., Yang, Y., Yang, F., Lei, Z., Li, S.Z.: Towards fast, accurate and stable 3d dense face alignment. arXiv preprint arXiv:2009.09960 (2020)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Howard, A., et al.: Searching for mobilenetv3. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1314–1324 (2019)
Google Scholar
Huang, G.B., Mattar, M., Berg, T., Learned-Miller, E.: Labeled faces in the wild: A database forstudying face recognition in unconstrained environments. In: Workshop on Faces in ’Real-Life’ Images: Detection, Alignment, and Recognition (2008)
Google Scholar
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_43
Chapter Google Scholar
Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196 (2017)
Lee, C.H., Liu, Z., Wu, L., Luo, P.: Maskgan: towards diverse and interactive facial image manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5549–5558 (2020)
Google Scholar
Lin, J., Yuan, Y., Shao, T., Zhou, K.: Towards high-fidelity 3d face reconstruction from in-the-wild images using graph convolutional networks. arXiv preprint arXiv:2003.05653 (2020)
Nazeri, K., Ng, E., Joseph, T., Qureshi, F.Z., Ebrahimi, M.: Edgeconnect: Generative image inpainting with adversarial edge learning. arXiv preprint arXiv:1901.00212 (2019)
Nirkin, Y., Masi, I., Tuan, A.T., Hassner, T., Medioni, G.: On face segmentation, face swapping, and face perception. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp. 98–105. IEEE (2018)
Google Scholar
Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: Semantic image synthesis with spatially-adaptive normalization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2337–2346 (2019)
Google Scholar
Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition (2015)
Google Scholar
Paysan, P., Knothe, R., Amberg, B., Romdhani, S., Vetter, T.: A 3d face model for pose and illumination invariant face recognition. In: 2009 Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance, pp. 296–301. IEEE (2009)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., Pantic, M.: 300 faces in-the-wild challenge: the first facial landmark localization challenge. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 397–403 (2013)
Google Scholar
Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: a unified embedding for face recognition and clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823 (2015)
Google Scholar
Sela, M., Richardson, E., Kimmel, R.: Unrestricted facial geometry reconstruction using image-to-image translation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1576–1585 (2017)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Song, Y., et al.: Contextual-based image inpainting: Infer, match, and translate. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)
Google Scholar
Tuan Tran, A., Hassner, T., Masi, I., Medioni, G.: Regressing robust and discriminative 3d morphable models with a very deep neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5163–5172 (2017)
Google Scholar
Tun Trn, A., Hassner, T., Masi, I., Paz, E., Nirkin, Y., Medioni, G.: Extreme 3d face reconstruction: Seeing through occlusions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3935–3944 (2018)
Google Scholar
Wang, S., Cheng, Z., Deng, X., Chang, L., Duan, F., Lu, K.: Leveraging 3d blendshape for facial expression recognition using CNN. Sci. China Inf. Sci 63(120114), 1–120114 (2020)
Google Scholar
Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional gans. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8798–8807 (2018)
Google Scholar
Zeng, X., Peng, X., Qiao, Y.: Df2net: a dense-fine-finer network for detailed 3d face reconstruction. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2315–2324 (2019)
Google Scholar
Zhang, S., Zhu, X., Lei, Z., Shi, H., Wang, X., Li, S.Z.: Faceboxes: a cpu real-time face detector with high accuracy. In: 2017 IEEE International Joint Conference on Biometrics (IJCB), pp. 1–9. IEEE (2017)
Google Scholar
Zhu, P., Abdal, R., Qin, Y., Wonka, P.: Sean: image synthesis with semantic region-adaptive normalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5104–5113 (2020)
Google Scholar

Download references

Acknowledgment

This paper is supported by National Natural Science Foundation of China (No. 62072020), National Key Research and Development Program of China (No. 2017YFB1002602), Key-Area Research and Development Program of Guangdong Province (No. 2019B010150001) and the Leading Talents in Innovation and Entrepreneurship of Qingdao (19-3-2-21-zhc).

Author information

Authors and Affiliations

State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering at Beihang University, Beijing, China
Dapeng Zhao & Yue Qi
Peng Cheng Laboratory, Shenzhen, China
Yue Qi
Qingdao Research Institute of Beihang University, Qingdao, China
Yue Qi

Authors

Dapeng Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Yue Qi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yue Qi .

Editor information

Editors and Affiliations

University of Geneva, Carouge, Switzerland
Nadia Magnenat-Thalmann
University of Minnesota, Minneapolis, MN, USA
Victoria Interrante
EPFL, Lausanne, Switzerland
Daniel Thalmann
University of Crete, Heraklion, Crete, Greece
George Papagiannakis
Shanghai Jiao Tong University, Shanghai, China
Bin Sheng
University of Sydney, Sydney, NSW, Australia
Jinman Kim
University of Calgary, Calgary, AB, Canada
Marina Gavrilova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhao, D., Qi, Y. (2021). Generative Face Parsing Map Guided 3D Face Reconstruction Under Occluded Scenes. In: Magnenat-Thalmann, N., et al. Advances in Computer Graphics. CGI 2021. Lecture Notes in Computer Science(), vol 13002. Springer, Cham. https://doi.org/10.1007/978-3-030-89029-2_20

Download citation

DOI: https://doi.org/10.1007/978-3-030-89029-2_20
Published: 11 October 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-89028-5
Online ISBN: 978-3-030-89029-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics