Keywords

1 Introduction

3D face reconstruction refers to synthesizing a 3D face model given one input face photo. It has a wide range of applications, such as face recognition and digital entertainment [25]. Existing methods mainly concentrate on unobstructed faces, thus limiting the scenarios of their actual applications. Reconstructing a 3D face model from a single photo is a classical and fundamental problem in computer vision. The reconstruction task is challenging as human face structure partial invisibility when considering occluded scenes. Over the past five years, the related problem of face inpainting in images has gradually developed to the rationality of face photo generation in the most extreme scenes [15].

We cannot use artificial intelligence to robustly predict the 3D texture of the occluded area of the face. On the other hand, when faces are partially occluded, existing methods often indiscriminately reconstruct the occluded area. With the assistance of face parsing map, we find a way to identify the occluded area and reconstruct the input image to a reasonable 3D face model. The main contributions are summarized as follows:

  • We propose a novel algorithm that combines feature points and face parsing map to generate face with complete facial features.

  • To address the problem of invisible face area under occluded scenes, we propose synthesizing input face photo based on Generative Adversarial Network rather than reconstructing 3D face directly.

  • We have improved the loss function of our 3D reconstruction framework for occluded scenes. Our method obtains state-of-the-art qualitative performance in real-world images.

2 Related Works

2.1 Generic Face Reconstruction

The classic methods use reference 3D face models to fit the input face photo. Some recent techniques use Convolution Neural Networks (CNNs) to regress landmark locations with the raw face image. Some recent techniques firstly used CNNs to predict the 3DMM parameters with input face image.

2.2 Face Image Synthesis

Deep pixel-level face generating has been studied for a few years. Many methods achieve remarkable results. EdgeConnect [12] shows impressive proceeds which disentangling generation into two stages: edge generator and image completion network. Contextual Attention [22] takes a similar two-step approach. First, it produces a base estimate of the invisible region. Next, the refinement block sharpens the photo by background patch sets. The typical limitations of current face image generate schemes are the necessity of manipulation, the complexity of fundamental architectures, the degradation in accuracy, and the inability of restricting modification to local region.

3 Our Approach

3.1 Landmark Prediction Task

Figure 1 shows the entire process of our work.In the landmark prediction task, we found that generating accurate 68 feature points \({\mathbf {{Z}}_{\mathbf {lmk}}}\in {{\mathbb {R}}^{2\times 68}}\) was a crucial part under occlusion scenes. The architecture \({{\mathcal {N}}_\text {lmk}}\) aims to generate landmarks from a corrupted face photo \({\text {I}_\text {cor}}:{\mathbf {Z}_\mathbf {lmk}}{=}{{\mathcal {N}}_{\text {lmk}}}\left( {{\text {I}}_{\text {cor}}};{{\theta }_{lmk}} \right) \), where \({{\theta }_{lmk}}\) denotes the trainable parameters. Since we want to focus more on efficiency and follow face parsing map generation task, we built a sufficiently effective \({{\mathcal {N}}_\text {lmk}}\) upon the MobileNet-V3 [6]. \({{\mathcal {N}}_\text {lmk}}\) is focused on feature extraction, unlike traditional landmark detectors. The final module is realized by fully connecting the fused feature maps. We set the loss function \({{\mathcal {L}}_{lmk}}\) as follows:

$$\begin{aligned} {{\mathcal {L}}_{lmk}}\text {=}\left\| \mathbf {{Z}_{_{lmk}}^{(i)}}-\mathbf {{\hat{Z}}_{_{gt}}^{(i)}} \right\| _{2}^{2} \end{aligned}$$
(1)

where \(\mathbf {{\hat{Z}}_{_{gt}}^{(i)}}\) denotes the ith ground truth face landmarks.

Fig. 1.
figure 1

Overall our pipeline. We first remove the occluded area and reconstruct the face with complete facial features. Then we utilize ResNet-50 and texture refinement network to reconstruct the final 3D model.

3.2 Face Parsing Map Generation

Pixel-level recognition of occlusion and face skin areas is a prerequisite for our framework to ensure accuracy. To benefit from the annotated face dataset CelebAMask-HQ [10], we used an encoder-decoder architecture \({{\mathcal {N}}_{\alpha }}\) based on U-Net [17] to estimate pixel-level label classes. Given a squarely resized face image \({{\mathbf {I}}_\mathbf {fac}}\in {{\mathbb {R}}^{H\times W\times 3}}\) , we applied the trained face parsing model \({{\mathcal {N}}_{\alpha }}\) to obtain the parsing map \({\mathbf {M}_{\boldsymbol{\alpha }}}\in {{\mathbb {R}}^{H\times W\times 1}}\) . On the other hand, given the landmarks \({\mathbf {Z}_\mathbf {lmk}}\in {{\mathbb {R}}^{2\times 68}}\), we connected the feature points to form a region. Then these regions can form a parsing map \({\mathbf {M}_{\boldsymbol{\beta }} } \in {\mathbb {R}^{H \times W \times 1}}\) including facial features. Please notice that, in our work, we assumed that facial features only include only five parts, including facial skin, eyebrows, eyes, nose and lips. The final map \({\mathbf {M}_{\boldsymbol{\gamma }} } \in {\mathbb {R}^{H \times W \times 1}}\) (see Fig. 2) without occluded objects needs \({\mathbf {M}_{\boldsymbol{\alpha }} }\) plus \({\mathbf {M}_{\boldsymbol{\beta }} }\) . In order to generate \({\mathbf {M}_{\boldsymbol{\gamma }} }\) including the complete facial features, we designed Algorithm 1.

Fig. 2.
figure 2

Our face parsing map generation module, which follows Algorithm 1. The results shown in the figure show that our method finally successfully removed the occlusion of fingers and hair

figure a

3.3 Face Image Synthesis with GAN

Face Image Synthesis Network. To benefit from the Pix2Pix architecture, we proposed a Face Image Synthesis Network (FISN) \({\mathcal{N}_{\mathrm{{et}}}}\), which was based on Pix2PixHD [26] as a backbone. FISN receives \({{\mathbf {I}}_{\mathbf {fac}}}\in {{\mathbb {R}}^{H\times W\times 3}}\) and \({\mathbf {M}_{\boldsymbol{\alpha }}}\) as inputs. The detailed architecture is shown in Fig. 1. To fuse \({{\mathbf {I}}_{\mathbf {fac}}}\) and \({\mathbf {M}_{\boldsymbol{\alpha } }}\) , we used Spatial Feature Transform (SFT) layer [14] learned a mapping function \(\mathcal{M}\) that outputs a parameter pair \(\left( {\gamma ,\beta } \right) \) based on the prior condition \({\varPsi }\) from the features \({\mathbf {M}_{\boldsymbol{\alpha }} }\). A pair of affine transformation parameters \(\left( {\gamma ,\beta } \right) \) model the prior \({\varPsi }\) . Here, the mapping equation can be expressed as \((\gamma ,\beta ){{ }} = M\left( \varPsi \right) .\) After obtaining \(\left( {\gamma ,\beta } \right) \) , the transformation is carried out by the SFT layer:

$$\begin{aligned} SFT({\mathbf {F}_{{\mathbf {map}}}}|\gamma ,\beta ){{ }} = \gamma \odot F + \beta \end{aligned}$$
(2)

where \({{\mathbf {F}}_{\mathbf {map}}}\) denotes the feature maps from \(\mathbf {{{I}}_{fac}}\),\(\odot \) denotes Hadamard product.

Therefore, we conditioned spatial information \({\mathbf {M}_{\boldsymbol{\alpha }} }\) on style data \({{\mathbf {I}}_{\mathbf {fac}}}\) and generated affine parameters \(({{{x}}_{{i}}}{{,}}{{{y}}_i})\) followed \(({{{x}}_{{i}}}{{,}}{{{y}}_i}){{ = }}{\mathcal{N}_{{{et}}}}\left( {{{{I}}_{{{fac}}}},{M_\alpha }} \right) \). Related research [14] showed that ordinary normalization layers would “wash away” semantic information. To transfer \(({{{x}}_{{i}}}{{,}}{{{y}}_i})\) to new mask input \({\mathbf {M}_{\boldsymbol{\gamma }}}\), we utilized semantic region-adaptive normalization (SEAN) [29] on residual blocks \({{{z}}_{{i}}}\) in the FISN. Let H, W and C be the height, width and the number of channels in the activation map of the deep convolutional network for a batch of N samples. The modulated activation value at the site was defined as:

$$\begin{aligned} SEAN\left( {{z_i},{x_i},{y_i}} \right) = {x_{{i}}}\frac{{{z_i} - \mu \left( {{z_i}} \right) }}{{\sigma \left( {{z_i}} \right) }} + {y_i} \end{aligned}$$
(3)

where \(\mu \left( {{z_i}} \right) \) and \(\sigma \left( {{z_i}} \right) \) are the mean and standard deviation of the activation \(\left( {{{n}} \in N,c \in C,y \in H,x \in W} \right) \) in channel c :

$$\begin{aligned} \mu \left( {{z_i}} \right) \mathrm{{ = }}\frac{1}{{NHW}}\sum \limits _{n,y,x} {{h_{n,c,y,x}}} \end{aligned}$$
(4)
$$\begin{aligned} \sigma \left( {{z}_{i}} \right) {=}\sqrt{\frac{1}{NHW}\sum \limits _{n,y,x}{\left( {{(h_{n,c,y,x}^{{}})}^{2}}-\mu {{\left( {{z}_{i}} \right) }^{2}} \right) }} \end{aligned}$$
(5)

FISN is a generator that learns the style mapping between \({{\mathbf {I}}_{\mathbf {fac}}}\) and \({\mathbf {M}_{\boldsymbol{\gamma }}}\) according to the spatial information provided by \({\mathbf {M}_{\boldsymbol{\alpha }} }\). Therefore, face features (e.g. eyes style) in \({{\mathbf {I}}_{\mathbf {fac}}}\) are shifted to the corresponding position on \({\mathbf {M}_{\boldsymbol{\gamma }} }\) so that FISN can synthesis image \({{\mathbf {I}}_{{\mathbf {out}}}}\) which removed occlusion.

Loss Function. The design of our loss function for FISN is inspired by Pix2PixHD [26], MaskGAN [10] and SEAN [29], which contains three components:

\((1)\ Adversarial\ loss\). Let \({{\text {D}}_{\text {1}}}\) and \({{\text {D}}_{\text {2}}}\) be two discriminators at different scales, \({\mathcal{L}_{GAN}}\) is the conditional adversarial loss defined by

$$\begin{aligned} {{\mathcal {L}}_{GAN}}{=}\mathbb {E}\left[ \log \left( {\text {D}_{1,2}}\left( {\mathbf {I}_{\mathbf {fac}}},{\mathbf {M}_{{\boldsymbol{\alpha }} }} \right) \right) \right] +\mathbb {E}\left[ 1-\log \left( {\text {D}_{1,2}}\left( {\mathbf {I}_{\mathbf {out}}},{{\mathbf {M}}_{\boldsymbol{\alpha } }} \right) \right) \right] \end{aligned}$$
(6)

\((2)\ Feature\ matching\ loss\) [26]. Let T be the total number of layers in discriminator \(\text {D}\) .\({{\mathcal {L}}_{\text {f}ea}}\) is the feature matching loss which computed the \(\text {L}_{1}\) distance between the real and generated face image defined by

$$\begin{aligned} {{\mathcal {L}}_{\text {fe}a}}{=}\mathbb {E}\sum \limits _{i=1}^{T}{{{\left\| D_{1,2}^{(i)}\left( {\mathbf {I}_{\mathbf {fac}}},{{\mathbf {M}}_{\boldsymbol{\alpha } }} \right) -D_{1,2}^{(i)}\left( {\mathbf {I}_{\mathbf {out}}},{\mathbf {M}_{\mathbf {\alpha } }} \right) \right\| }_{1}}} \end{aligned}$$
(7)

\((3)\ Perceptual\ loss\) [8]. Let \(\text {N}\) be the total number of layers used to calculate the perceptual loss, \({{F}^{(i)}}\) be the output feature maps of the ith layer of the VGG network [21].\({\mathcal{L}_{{{per}}}}\) is the perceptual loss which computes the \({{L}_{1}}\) distance between the real and generated face image defined by

$$\begin{aligned} {{\mathcal {L}}_{\text {per}}}{=}\mathbb {E}\sum \limits _{i=1}^{N}{\frac{1}{{{M}_{i}}}[{{\left\| {{F}^{\left( i \right) }}\left( {\mathbf {I}_{\mathbf {fac}}} \right) -{{F}^{\left( i \right) }}\left( {\mathbf {I}_{\mathbf {out}}} \right) \right\| }_{1}}]} \end{aligned}$$
(8)

The final loss function of FISN used in our experiment is made up of the above-mentioned three loss terms as:

$$\begin{aligned} {{\mathcal {L}}_{FISN}}{=}{{\mathcal {L}}_{GAN}}+{{\lambda }_{1}}{{\mathcal {L}}_{\text {fea}}}+{{\lambda }_{2}}{{\mathcal {L}}_{\text {per}}} \end{aligned}$$
(9)

where we set \({{\lambda }_{1}}{=}{{\lambda }_{2}}{=10}\) respectively in our experiments.

3.4 Camera and Illumination Model

Given an face image, we adopt the Basel Face Model (BFM) [16]. After the 3D face is reconstructed, it can be projected onto the image plane with the perspective projection:

$$\begin{aligned} {{V}_{2d}}\left( \mathbf {P} \right) =f*{\mathbf {P}_{\mathbf {r}}}*{\mathbf {R}}*{\mathbf {S}_{\mathbf {mod} }}+{\mathbf {t}_{\mathbf {2d}}} \end{aligned}$$
(10)

where \({{V}_{2d}}\left( \mathbf {P} \right) \) denotes the projection function that turned the 3D model into 2D face positions, f denotes the scale factor, \({\mathbf {P}_{\mathbf {r}}}\) denotes the projection matrix, \(\mathbf {R}\in SO(3)\) denotes the rotation matrix, \({{\mathbf {S}}_{\mathbf {mod}}}\) denotes the shape of the face and \({\mathbf {t}_{\mathbf {2d}}}\in {{\mathbb {R}}^{3}}\) denotes the translation vector.

We approximated the scene illumination with Spherical Harmonics (SH) [3] for face. Thus, we can compute the face as Lambertian surface and skin texture follows:

$$\begin{aligned} \mathbf {C}\left( {\mathbf {r}_{\mathbf {i}}},{\mathbf {n}_{\mathbf {i}}},\boldsymbol{\gamma } \right) ={\mathbf {r}_{\mathbf {i}}}\odot \sum \limits _{b=1}^{{{B}^{2}}}{{\boldsymbol{\gamma }_{\mathbf {b}}}{{\varPhi }_{{b}}}\left( {\mathbf {n}_{\mathbf {i}}} \right) } \end{aligned}$$
(11)

where \({\mathbf {r}_{\mathbf {i}}}\) denotes skin reflectance, \({\mathbf {n}_{\mathbf {i}}}\) denotes surface normal, \(\odot \) denotes the Hadamard product, \(\boldsymbol{\gamma } \in {{\mathbb {R}}^{9}}\) under monochromatic lights condition, \({{\varPhi }_{b}}:{{\mathbb {R}}^{3}}\rightarrow \mathbb {R}\) denotes SH basis function, B denotes the number of spherical harmonics bands and \({\boldsymbol{\gamma }_{\mathbf {b}}}\in {{\mathbb {R}}^{3}}\) (here we set \(B=3\) ) denotes the corresponding SH coefficients.

Therefore, parameters to be learned can be denoted by a vector \(\boldsymbol{y}=(\boldsymbol{\widetilde{{{\alpha }_{i}}},\widetilde{{{\beta }_{i}}},\gamma ,p})\in {{\mathbb {R}}^{175}}\), where \(\mathbf {p}\in {{\mathbb {R}}^{6}}=\{\boldsymbol{pitch,yaw,roll,f,{{t}_{2D}}\}}\) denotes face poses. In this work, we used a fixed ResNet-50 [5] network to regress these coefficients. The loss function of ResNet-50 follows Eq. 16. We then got the fundamental shape \({\mathbf {S}_{\mathbf {base}}}\) (coordinate,e.g. xyz) and the coarse texture \({\mathbf {T}_{\mathbf {coa}}}\) (albedo,e.g. rgb). We used a coarse-to-fine network based on graph convolutional networks of Lin et al. [11] for producing the fine texture \({\mathbf {T}_{\mathbf {fin}}}\).

3.5 Loss Function of 3D Reconstruction

Given a generated image \({{\mathbf {I}}_{\mathbf {out}}}\) ,we used the ResNet to regress the corresponding coefficient y. The design of loss function for ResNet contained four components:

  1. (1)

    Landmark Loss. As facial landmarks convey the structural information of the human face, we used landmark loss to measure how close projected shape landmark vertices to the corresponding landmarks in the image \({{\mathbf {I}}_{\mathbf {out}}}\). We ran the landmark prediction module \({{\mathcal {N}}_{lmk}}\) to detect 68 landmarks \(\left\{ {z}_{lmk}^{\left( n \right) } \right\} \) from the training images. We obtained landmarks \(\left\{ l_{y}^{\left( n \right) } \right\} \) from rendering facial images. Then, we computed the loss as:

    $$\begin{aligned} {{\mathcal {L}}_{{lmk}}}\left( y \right) \text {=}\frac{1}{N}\sum \limits _{\text {n}=1}^{N}{\left\| \text {z}_{\text {lmk}}^{\left( n \right) }-l_{y}^{(n)} \right\| _{2}^{2}} \end{aligned}$$
    (12)

    where \({{\left\| \cdot \right\| }_{2}}\) denotes the \({{L}_{2}}\) norm.

  2. (2)

    Accurate Pixel-wise Loss. The rendering layer renders back an image \(\mathbf {I}_{\mathbf {y}}^{^{(i)}}\) to compare with the image \(\mathbf {I}_{\mathbf {out}}^{(i)}\). The pixel-wise loss is formulated as:

    $$\begin{aligned} {{\mathcal {L}}_{\text {pix}}}\left( y \right) \text {=}\frac{\sum \nolimits _{i\in \mathcal {M}}{{{P}_{i}}\cdot }{{\left\| \mathbf {I}_{\mathbf {out}}^{(i)}-{\mathbf {I}}_{\mathbf {y}}^{(i)} \right\| }_{2}}}{\sum \nolimits _{i\in \mathcal {M}}{{{P}_{i}}}} \end{aligned}$$
    (13)

    where i denotes pixel index, \(\mathcal {M}\) is the reprojected face region which obtained with landmarks [13], \({{\left\| \cdot \right\| }_{2}}\) denotes the \({{L}_{{2}}}\) norm and \({{P}_{i}}\) is occlusion attention coefficient which is described as follows. To gain robustness to accurate texture, we set \(P_{i}={\left\{ \begin{array}{ll}1 \;\;\; \text {if} \;\; i\in \text {facial features of}\;{{M}_{\alpha }}\\ 0.1 \;\; \text {otherwise}\end{array}\right. }\) for each pixel i.

  3. (3)

    Regularization Loss. To prevent shape deformation and texture degeneration, we introduce the prior distribution to the parameters of the face model. We add the regularization loss as:

    $$\begin{aligned} {{\mathcal {L}}_{\text {reg}}}\text {=}{{\omega }_{\alpha }}{{\left\| \widetilde{{{\boldsymbol{\alpha }}_{\mathbf {i}}}} \right\| }^{2}}+{{\omega }_{\beta }}{{\left\| \widetilde{{{\boldsymbol{\beta }}_{\mathbf {i}}}} \right\| }^{2}} \end{aligned}$$
    (14)

    here, we set \({{\omega }_{\alpha }}\text {=}1.0\), \({{\omega }_{\beta }}=1.\text {75e-3}\) respectively.

  4. (4)

    Face Features Level Loss. To reduce the difference between 3D face with 2D image, we define the loss at face recognition level. The loss computes the feature difference between the input image \({{\mathbf {I}}_{\mathbf {out}}}\) and rendered image \({{\mathbf {I}}_{\mathbf {y}}}\). We define the loss as a cosine distance:

    $$\begin{aligned} {{\mathcal {L}}_{ff}}\text {=1-}\frac{<G({{\mathbf {I}}_{\mathbf {out}}}),G({{\mathbf {I}}_{\mathbf {y}}})>}{\left\| G({{\mathbf {I}}_{\mathbf {out}}}) \right\| \cdot \left\| G({{\mathbf {I}}_{\mathbf {y}}}) \right\| } \end{aligned}$$
    (15)

    where \(G(\cdot )\) denotes the feature extraction function by FaceNet [19], \(<\cdot ,\cdot>\) denotes the inner product.

In summary, the final loss function of 3D face reconstruction used in our experiment is made up of the above-mentioned four loss terms as:

$$\begin{aligned} {{\mathcal {L}}_{3D}}{=}{{\lambda }_{3}}{{\mathcal {L}}_{lmk}}+{{\lambda }_{4}}{{\mathcal {L}}_{{pix}}}+{{\lambda }_{5}}{{\mathcal {L}}_{reg}}+{{\lambda }_{6}}{{\mathcal {L}}_{{ff}}} \end{aligned}$$
(16)

where we set \({{\lambda }_{3}}{=1}{.6e-3}, {{\lambda }_{4}}{=1}{.4}, {{\lambda }_{5}}{=3}\text {.7e-4}, {{\lambda }_{6}}{=0}{.2}\) respectively in all our experiments.

4 Implementation Details

Considering the question of landmark predictor, the 300-W dataset [18] has labeled ground truth landmarks, while the CelebA-HQ dataset [9] does not. We generated the ground truth of CelebA-HQ by the Faceboxes predictor [28] as the reference. In experiments shown in this work, we use the \(256\times 256\) images for training the landmark predictor \({{\mathcal {N}}_{lmk}}\) and the batch size \(=16\) . The learning rate of \({{\mathcal {N}}_{lmk}}\) is \(10e-4\). We use the trained face parsing model \({{\mathcal {N}}_{\alpha }}\) [10] to generate \({{\mathbf {M}}_{\boldsymbol{\alpha }}}\). We obtain \({{\mathbf {M}}_{\boldsymbol{\gamma }}}\) according to Algorithm 1. FISN follows the design of Pix2PixHD [26] with four residual blocks. To train the FISN, we used the CelebAMask-HQ dataset which has 30000 semantic labels with a size of \(512\times 512\). Each label clearly marked the facial features of the face.

FISN does not use any ordinary normalization layers (e.g. Instance Normalization) which will wash away style information. Before training the ResNet, we take the weights from pre-trained of R-Net [3] as initialization. We set the input image size to \(224\times 224\) and the number of vertices to 35709. We design our texture refinement network based on the Graph Convolutional Network method of Lin et al. [11]. We do not adopt any fully-connected layers or convolutional layers in the refinement network refer to related research [11]. This will reduce the performance of the module.

5 Experimental Results

5.1 Qualitative Comparisons with Recent Works

Figure 3 shows our results compared with the other work. The last two columns show our results.The remaining columns demonstrate the results of 3DDFA [4], \(\text {D}{{\text {F}}^{\text {2}}}\text {Net}\)  [27] and Chen et al. [2]. Qualitative results show that our method surpasses other methods. Figure 3 shows that our method can reconstruct a complete face model under occlusion scenes such as glasses, jewelry, palms, and hair. Other methods focused on generating high-resolution face textures. These frameworks cannot effectively deal with occluded scenes.

Fig. 3.
figure 3

Comparison of qualitative results. Baseline methods from left to right: 3DDFA, \(\text {D}{{\text {F}}^{\text {2}}}\text {Net}\), Chen et al. and our method.

5.2 Quantitative Comparison

Fig. 4.
figure 4

Comparison of error heat maps on the MICC Florence datasets. Digits denote \(90\%\) error (mm).

Comparison Result on the MICC Florence Datasets. MICC Florence dataset [1] is a 3D face dataset that contains 53 faces with their ground truth models. We artificially added some occluders as input. We calculated the average \(90\%\) largest error between the generative model and the ground truth model. Figure 4 shows that our method can effectively handle occlusion.

Occlusion Invariance of the Foundation Shape. Our choice of using the ResNet-50 to regress the shape coefficients is motivated by the unique robustness to extreme viewing conditions in the paper of Deng et al. [3]. To fully support the application of our method to occluded face images, we test our system on the Labeled Faces in the Wild datasets (LFW) [7]. We used the same face test system from Anh et al. [24], and we refer to that paper for more details.

Fig. 5.
figure 5

Reconstructions with occlusions. Left: Qualitative results of Sela et al. [20] and our shape. Right: LFW verification ROC for the shapes, with and without occlusions.

Figure 5 (left) shows the sensitivity of the method of Sela et al. [20]. Their result clearly shows the outline of a finger. Their failure may be due to more focus on local details, which weakly regularizes the global shape. However, our method recognizes and regenerates the occluded area. Our method much robust provides a natural face shape under common occlusion scenes. Though 3DMM also limits the details of shape, we use it only as a foundation and add refined texture separately.

Table 1. Quantitative evaluations on LFW.

We further quantitatively verify the robustness of our method to occlusions. Table 1 (top) reports verification results on the LFW benchmark with and without occlusions (see also ROC in Fig. 5 (right)). Though occlusions clearly impact recognition, this drop of the curve is limited, demonstrating the robustness of our method.

6 Conclusions

In this work, we present a novel single-image 3D face reconstruction method under occluded scenes with high fidelity textures. Comprehensive experiments have shown that our method outperforms previous methods by a large margin in terms of both accuracy and robustness. Future work includes combining our method with Transformer architecture to further improve accuracy.