Keywords

1 Introduction

High-quality 3D face reconstruction is a fundamental problem in computer graphics [31] that is related to various applications such as digital animation [32], video editing [32] and face recognition [40, 41]. Since Vetoer’s first 3D face [35], 3D reconstruction methods have rapidly advanced enabling applications. However, these methods all perform poorly in terms of face geometry details. To make the problem tractable, most proposed methods introduce existing statistical models or prior knowledge. These models are unable to reconstruct expression-dependent wrinkles, which are essential for analyzing human expression.

Several methods recover detailed facial geometry that lacks robustness to occlusions [1, 9]. We introduce a novel face geometry detail generation method, which learns bump maps (simulate geometry changes) from in-the-wild face images with occlusion. In contrast to prior work (estimating mid-level features often breaks down), our method generates bump maps from a low-dimensional representation containing subject-specific detail parameters and expression parameters. Our detailed model builds upon this separation design. This design is fundamental, as it allows estimating a robust global shape, even under occluded scenes.

The main contributions are summarized as follows:

  • We propose a novel Face Image Synthesis Network, a simple yet effective diversity promoting face image regeneration approach. The regenerated eyeglasses removal face without glasses will guide the generation of a 3D model.

  • We have improved the loss function of our 3D reconstruction system for occluded scenes with eyeglasses. Our results are more accurate than other approaches. As a result of our method, we are able to obtain state-of-the-art qualitative performance in real-world images.

2 Related Work

2.1 Single Image 3D Face Reconstruction

Since the first 3DMM model was proposed by Blanz and Vetter [2], single image based 3D face reconstruction has become a hot research topic and considerable progress have been made in the field. Richardson et al. [25] presented a method based on CNN that can reconstruct 3D face based on synthetic data. As training deep neural networks usually demand a large amount of data to get acceptable results, Deng et al. [5] proposed an approach that can achieve accurate 3D face reconstruction with weakly supervised learning based on less training data. Kemelmacher-Shlizerman and Basri [11] recovered 3D faces by exploiting the similarity of faces based on a single 3D reference model of a different person. Liu et al. [19] built a 3D face model that can exploit both faces with fully labeled 3D landmarks and unlimited unlabeled in-the-wild face images. Lee et al. [16] employed an uncertainty-aware encoder and a fully nonlinear decoder model for realistic 3D face reconstruction. Cheng et al. [3] solved the 3D face reconstruction problem based on graph convolutional networks obtaining good results without scarifying speed. Shang et al. [30] proposed a self-supervised training architecture that is accurate and robust, even under large variations of expressions, poses, and illumination conditions. Li et al. [17] publicized an end-to-end framework and designed an efficient network model that can apparently increase the accuracy of face alignment and 3D face reconstruction. Li et al. [18] presented a multi-attribute regression reconstruction network that can work well in complex cases when provided with 2D images including severe poses, extreme expressions, and partial occlusions.

2.2 Generative Adversarial Networks

Generative adversarial networks (GANs) was first proposed by Goodfellow et al. to study the generative model. Classical GANs consist of a generator and a discriminator. The aim of the generator is to generate data samples that can confuse the discriminator. The generator and the discriminator must improve themselves to win the ‘game’ until a Nash equilibrium is achieved; then generator successfully learns the distribution of the real dataset. GANs have been applied in many fields, including face image synthesis. Zhan et al. [39] proposed Spatial Fusion GAN (SF-GAN), which can obtain better results in both geometry and appearance spaces utilizing a geometry synthesizer and an appearance synthesizer. A triple-translation GAN (TTGAN) is proposed for face image synthesis by Ye et al. [38]. TTGAN adopts a triple translation consistency loss to translate from a rendered original input image to the desired output image. Sangloy et al. [28] proposed an adversarial image synthesis architecture that can extract information from sketched boundaries and parse color strokes and output realistic face images.

2.3 Face Image Synthesis

Deep pixel-level face generating has been studied for several years. Many methods achieve remarkable results. Context encoder [23] is the first deep learning network designed for image inpainting with encoder-decoder architecture. Nevertheless, the networks do a poor job in dealing with human faces. Following this work, Yang et al. used a modified VGG network to improve the result of the context-encoder, by minimizing the feature difference of photo background. Dolhansky et al. demonstrated the significance of exemplar data for inpainting. However, this method only focuses on filling in missing eye regions of the frontal face, so it does not generalize well. EdgeConnect [21] shows impressive proceeds which disentangling generation into two stages: edge generator and image completion network. Contextual Attention takes a similar two-step approach. First, it produces a base estimate of the invisible region. Next, the refinement block sharpens the photo by background patch sets. The typical limitations of current face image generate schemes are the necessity of manipulation, the complexity of fundamental architectures, the degradation in accuracy, and the inability of restricting modification to local region.

3 Proposed Approach

We propose a detailed 3D face reconstruction method (as shown in Fig. 1) based on a single photo that consists of two steps:

  • in response to the occlusion area, synthesizing the 2D face with complete facial features.

  • detailed 3D shape reconstruction module based on unobstructed frontal images.

Fig. 1.
figure 1

Method overview. At first, as input for our face image synthesis network, we need the target image \({{\textbf{I}}_{\textbf{in}}}\) and map \({{\textbf{M}}_{\textbf{fin}}}\). We utilize the face parsing map generation module and edge lines map generation module to obtain the map \({{\textbf{M}}_{\textbf{fa}}}\) and \({{\textbf{M}}_{\textbf{edge}}}\). Then we obtain the final face parsing map \({{\textbf{M}}_{\textbf{fin}}}\) following Zhao et al.’s Algorithm [42]. After obtaining the face image \({{\textbf{I}}_{\textbf{out}}}\) with eyeglasses removed, in step two, we leverage ResNet-50 and texture refinement network to reconstruct the final 3D model.

Fig. 2.
figure 2

The overview of the proposed face parsing network.

3.1 Face Parsing Map Generation

Our goal is to realize detailed 3D face shape reconstruction under occluded scenes using our method. Pixel-level recognition of eyeglasses areas serves as a key step for our framework to ensure accuracy. Face parsing is a fundamental facial analysis task. Recently, methods based on Fully Convolutional Networks have achieved remarkable results on this task [8, 20, 36]. As shown in Fig. 2, given a squarely resized face image \({{\textbf{I}}_{\textbf{in}}}\in {{\mathbb {R}}^{H\times W\times 3}}\), we aim to apply a modified encoder-decoder network \({{\mathcal {N}}_{fa}}\) as the backbone frame for face parsing. We take \({{\mathcal {N}}_{fa}}\) to extract features at different levels for multi-scale illustration. In the structure of \({{\mathcal {N}}_{fa}}\), high-level features contain semantic information while low-level features show local details, both of which are essential for face parsing. We feed the feature map with multi-scale information into the Edge Aware Graph Reasoning module, targeting to learn fundamental graph illustration for the characterization of the relations between vertices. The reasoning module consists of three components: graph projection operation, graph reasoning operation and graph reprojection operation. Let us make it clear. The graph projection operation projects the initial information onto vertices. The graph reasoning operation reasons the relational expression between regions over the graph and projects the acquired graph interpretation back to previous pixel grids. The graph reprojection operation leads to an optimized feature map with the same dimension and size. We implemented the reasoning module following the method of Gusi et al. [31]. Let us explain the last step of the network. We transmit the optimized features into a decoder to estimate the final pixel labels. In our network, two different level feature maps are concatenated into the decoder. The two feature maps are concatenated by the \(1\times 1\) convolution layer. The specific fusion method is through upsampling. That is, the high-level feature map is upsampled to the same dimension as the low-level feature map. Finally, we obtain the face parsing map \({{\textbf{M}}_{\textbf{fa}}}\in {{\mathbb {R}}^{H\times W\times 1}}\).

3.2 Face Edge Lines Map Generation

Fig. 3.
figure 3

The overview of the proposed face edge lines map generation approach.

In order to generate an accurate face parsing map, our method uses face edge lines to guide the reconstruction of the face parsing map. Face edge lines is closely related to the facial landmark. The reason why we choose face edge lines instead of landmarks is that landmarks have difficulties in presenting the accurate facial features structure [37]. In this section, we describe the proposed face edge lines map generation framework in detail. As shown in Fig. 3 (a) and (b), the proposed framework consists of two parts: (a) face edge lines generation module; (b) face edge lines effectiveness discriminator.

As shown in Fig. 3 (a), stacked U-Nets is the core part of the face edge lines generation module. More than piecemeal landmarks, face edge lines can well describe the geometry structure of a face. Most of the previous convolutional networks only use the convolutional features of the last layer. Image information at other scales will be lost. Unlike the previous network, the main contribution of the stacked U-Nets unit [22, 26] is to use multi-scale features to represent image information. We Leverage the mean squared error (MSE) between the estimated Face edge lines map and the ground-truth map. The presence of obstructions (this paper focuses on eyeglasses) will significantly affect the accuracy of edge lines generation. In order to relieve the loss of image information due to eyeglasses, we introduce message passing layers to pass information between face edge lines. It is proposed in this implementation that the feature map at the end of each stack should be divided into M (the number) areas. We implemented the message passing approach following the method of Chu et al. [4]. This process is visualized in Fig. 3.

Intra-level Message Passing Layer. Among the steps involved in dealing with the problem of occlusion of eyeglasses, the intra-level message passing plays a crucial role. A layer such as this one is used at the end of each U-Nets stack in order to transmit information between visible edge lines and eyeglasses areas. Consequently, in the process of designing eyeglasses, the prediction of the eyeglasses areas can be improved through the visible edge lines data.

Inter-level Message Passing Layer. It is true that there are various U-Nets stacks that focus on different dimensions of facial information, but in the case of multiple stacks, the facial information is transferred in the different stacks by performing communication between the former stacks and the latter stacks. When stacking more hourglass subnets, inter-level message passing is adopted to ensure that the face edge lines map maintains the quality when messages are passed from the lower stacks to the higher stacks.

Adversarial Learning for Edge Lines Effectiveness. Poor face edge lines map will adversely affect the accuracy of the 3D face model. When training, we use adversarial learning between the estimated edge lines map and the ground-truth map in order to guarantee the effectiveness of the edge lines map obtained in the generation stage. Using the Face edge lines map generator, the edge lines map \({{\textbf{M}}_{\textbf{edge}}}\in {{\mathbb {R}}^{H\times W\times 1}}\) is generated with the coordinate set \({{S}_{coor}}\); the mapping between the generated coordinate set and the ground-truth distance matrix \(\textbf{M}{{\textbf{A}}_{\textbf{gt}}}\). In order to determine whether a generated edge line map is fake or not, the ground truth \({{d}_{gt}}\) can be calculated as:

$$\begin{aligned} {{d}_{gt}}({{\textbf{M}}_{\textbf{edge}}},{{S}_{coor}})=\left\{ \begin{aligned}&0,Es{{t}_{s\in {{S}_{coor}}}}({{d}_{gt}}<\theta )<\delta \\&1,\text {other cases} \\ \end{aligned} \right. \end{aligned}$$
(1)

where Est denotes the probability value calculation function, \(\theta \) denotes the distance threshold to ground truth edge lines, \(\delta \) denotes the probability threshold.

In order to combine the edge lines effectiveness discriminator D and the face edge lines map estimator G, we apply the concept of adversarial learning. The loss function of the discriminator D can be calculated as:

$$\begin{aligned} {{\mathcal {L}}_{D}}=\mathbb {E}[\log (1-\left| D(G({{\textbf{I}}_{\textbf{in}}}))-{{d}_{gt}} \right| )]-\mathbb {E}[\log D({{\textbf{M}}_{\textbf{gt}}})] \end{aligned}$$
(2)

where \({{\textbf{M}}_{\textbf{gt}}}\) denotes the ground truth face edge lines map. A discriminator is trained to predict an edge lines map on the ground truth as well as predict the generated edge lines map according to \({{d}_{gt}}\) . With effectiveness discriminator, the adversarial loss can be calculated as:

$$\begin{aligned} {{\mathcal {L}}_{adv-loss}}=\mathbb {E}\left[ \log (1-D(G({{\textbf{I}}_{\textbf{in}}})) \right] \end{aligned}$$
(3)

3.3 Recovering 3D Face Geometric Details

We obtain the final face parsing map \({{\textbf{M}}_{\textbf{fin}}}\) following Zhao et al.’s Algorithm [42]. We synthesize the face photo \({{\textbf{I}}_{\textbf{out}}}\) by existing methods [15]. Given \({{\textbf{I}}_{\textbf{out}}}\), we used the ResNet to regress the corresponding coefficient y. Due to the collection of large scale high-resolution 3D texture datasets is still very costly and scarce, the ResNet was trained under weakly supervised. The corresponding loss function consists of four parts [2, 5]:

$$\begin{aligned} {{\mathcal {L}}_{shape}}={{\lambda }_{feat}}{{\mathcal {L}}_{feat}}+{{\lambda }_{regu}}{{\mathcal {L}}_{regu}}+{{\lambda }_{phot}}{{\mathcal {L}}_{phot}}+{{\lambda }_{land}}{{\mathcal {L}}_{land}} \end{aligned}$$
(4)

Here we set \({{\lambda }_{{fea}t}}{=0}{.2}\),\({{\lambda }_{regu}}=3.6e-4\),\({{\lambda }_{phot}}=1.4\), \({{\lambda }_{land}}=1.6e-3\) respectively in all our experiments.

The addition of human face geometric details is the core of our method. We choose to add a bump map on the base shape \({{\textbf{S}}_{\textbf{basi}}}\). Inspired by the method of image-to-image translation method, we define the displacements of the depth map as the distances through the pixels of \({{\textbf{I}}_{\textbf{out}}}\) to the 3D face surface. Generally, we define the bump map \(\varPhi (\textbf{b})\) as:

$$\begin{aligned} \varPhi (\textbf{b}){=}\left\{ \begin{aligned}&\phi (0)\ \ \ \text {other}\text {cases} \\&\phi ({d}'(\textbf{b})-d(\textbf{b}))\ \text {face}\ \text {projects}\ \text {to}\ \textbf{b} \\ \end{aligned} \right. \end{aligned}$$
(5)

where \(\phi (\cdot )\) denotes an encoding function that converts the depth value to the linear range \([0,\ldots ,255]\), \(\textbf{b}\) denotes the pixel coordinate \(\left[ x,y \right] \) in \({{\textbf{I}}_{\textbf{out}}}\), \({d}'(\textbf{b})\) denotes the depth, which is the distance from the surface of the detailed face shape to \(\textbf{b}\) along the line of sight, \(d(\textbf{b})\) denotes the depth of the basic shape.

Thus, Given a bump map \(\varPhi \) and the depth of the basic shape,we can compute the detailed depth follows \({d}'(\textbf{b}){=}d(\textbf{b})+{{\phi }^{-1}}(\varPhi (\textbf{b}))\). In order to increase geometric details and to suppress noise, we define the loss function as follows:

$$\begin{aligned} {{\mathcal {L}}_{geo}}=\left\| \tilde{\varPhi }-\varPhi \right\| +\left\| \frac{\partial \tilde{\varPhi }}{\partial x}-\frac{\partial \varPhi }{\partial x} \right\| +\left\| \frac{\partial \tilde{\varPhi }}{\partial y}-\frac{\partial \varPhi }{\partial y} \right\| \end{aligned}$$
(6)

where \(\left\| \cdot \right\| \) denotes the \({{L}_{1}}\) norm, \(\tilde{\varPhi }\) denotes the ground truth and \(\frac{\partial \tilde{\varPhi }}{\partial x}\) , \(\frac{\partial \tilde{\varPhi }}{\partial {y}}\) denotes the 2D gradient of the bump map. After the 3D face is reconstructed, it can be projected onto the image plane with the perspective projection:

$$\begin{aligned} {{V}_{2d}}\left( \textbf{P} \right) =f*{\textbf{P}_{\textbf{r}}}*{\textbf{R}}*{\textbf{S}_{\textbf{mod} }}+{\textbf{t}_{\textbf{2d}}} \end{aligned}$$
(7)

where \({{V}_{2d}}\left( \textbf{P} \right) \) denotes the projection function that turned the 3D model into 2D face positions, f denotes the scale factor, \({\textbf{P}_{\textbf{r}}}\) denotes the projection matrix, \(\textbf{R}\in SO(3)\) denotes the rotation matrix and \({\textbf{t}_{\textbf{2d}}}\in {{\mathbb {R}}^{3}}\) denotes the translation vector.

Therefore, we approximated the scene illumination with Spherical Harmonics (SH) [24] parameterized by coefficient vector \(\gamma \in {{\mathbb {R}}^{9}}\). In summary, the unknown parameters to be learned can be denoted by a vector \(y=({{\boldsymbol{\alpha }}_{\textbf{id}}},{{\boldsymbol{\beta }}_{\textbf{exp}}},{{\boldsymbol{\beta }}_{\textbf{t}}},\boldsymbol{\gamma },\textbf{p})\in {{\mathbb {R}}^{239}}\), where \(\textbf{p}\in {{\mathbb {R}}^{6}}=\{\textbf{pitch},\textbf{yaw},\textbf{roll},f,{{\textbf{t}}_{\textbf{2D}}}\}\) denotes face poses. In this work, we used a fixed ResNet-50 network to regress these coefficients.

We found that by adding these last two terms of loss function and we reduce bump map noise by favoring smoother surfaces. At the same time, the final effect shows that high-frequency details are preserved.

4 Implementation Details

All the networks were trained using the Adam solver [12]. To train our face parsing map generation network, we collected two sources dataset: Helen dataset [14] and CelebAMask-HQ dataset [15]. The Helen dataset contains 2330 images with 11 categories: background, skin, paired lips, paired eyes, paired brows, paired mouth and hair. The CelebAMask-HQ dataset is a large-scale face parsing datasets which includes 30000 high-resolution portrait images. The dataset contains 19 categories. In addition to the facial unit, the components such as eyeglass, earring, necklace, neck, and cloth are also annotated.

In the face parsing map generation stage, our backbone is a modified version of the trained parsing model [31]. We made the parsing model exclude the average pooling layer. For the pyramid pooling module, we follow the implementation of the method of Te et al. [31] with exploiting global contextual information. We leveraged the fixed parsing model to generate \({{\textbf{M}}_{\textbf{fa}}}\). In the face edge lines map generation stage, all training images are cropped and resized to \(512\times 512\). We obtained \({{\textbf{M}}_{\textbf{edge}}}\) according the lines map generation network. We implemented message passing module following naturally obtains face features in different sizes. In the above two stages, we train our network on four datasets including 300W (3148 sample images) [27] and AFLW (24386 sample images) [13].

5 Experimental Results

In this work, we aim to generate a wide range of diverse and yet realistic 3D detailed reconstructions from occluded face images. Our approach should be characterized by the following three qualities: 1) the reconstructed geometry should fit as convincingly as possible to the visible regions, 2) the reconstructed model texture should not include eyeglasses, which is the essential requirement for the accuracy of the reconstruction.

5.1 Qualitative Comparisons with Recent Art

Figure 4 shows our results compared with the other arts. The last columns show our results. The remaining columns demonstrate the results of Sela et al. [29], PRNet [6] and 3DDFA [7]. Our results show that our results have better handled the occlusion area than other methods. Figure 4 shows that our method can reconstruct a complete face shape with geometry details under occlusion scenes such as glasses, food and fingers. The approach of 3DDFA was aimed at extremely large poses. Therefore, it cannot reconstruct a detailed face model under occluded scenes. Its shape lacks details. Other methods focused on generating high-resolution face textures instead of geometry details. At the same time, it must also be pointed out, the other methods cannot effectively deal with occluded scenes.

Fig. 4.
figure 4

Comparison of qualitative results. Baseline methods from left to right: Sela et al., PRNet, 3DDFA and our method.

Fig. 5.
figure 5

Reconstructions with eyeglasses. Left: qualitative results of Sela et al. [29] and our shape. Right: LFW verification ROC for the shapes, with and without eyeglasses.

5.2 Quantitative Comparison with Recent Art

Our choice of using the ResNet-50 to regress the shape coefficients is motivated by the unique robustness to extreme viewing conditions in the paper of Deng et al. [5]. To fully support the application of our method to occluded face images, we test our system on the Labeled Faces in the Wild datasets (LFW) [10]. We used the same face test system from Anh et al. [34], and we refer to that paper for more details.

Figure 5 (left) shows the sensitivity of the method of Sela et al. [29]. Their result clearly shows the outline of the eyeglasses. Their failure may be due to more focus on local details, which weakly regularizes the global shape. However, our method recognizes and regenerates the occluded area. Our method much robust provides a natural face shape under eyeglasses scenes. Though 3DMM also limits the details of shape, we use it only as a foundation and add refined texture separately.

Table 1. Quantitative evaluations on LFW.

We further quantitatively verify the robustness of our method to eyeglasses. Table 1 (top) reports verification results on the LFW benchmark with and without eyeglasses (see also ROC in Fig. 5-right). Though eyeglasses clearly impact recognition, this drop of the curve is limited, demonstrating the robustness of our method.

6 Conclusions

In this work, we describe a 3D face detailed reconstruction framework that can run efficiently under occluded scenes. Our method enables unobstructed face image synthesis by concatenating the original face parsing map with the face edge lines map which both are extracted from the input face image in the encoder-decoder network. The experiments on 3D face reconstruction using various datasets have shown that our method can effectively remove eyeglasses with equivalent quality and better accuracy control than the existing methods.