Generative Landmarks Guided Eyeglasses Removal 3D Face Reconstruction

Zhao, Dapeng; Qi, Yue

doi:10.1007/978-3-030-98355-0_10

Dapeng Zhao¹⁵ &
Yue Qi^15,16,17

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13142))

Included in the following conference series:

International Conference on Multimedia Modeling

2180 Accesses
4 Citations

Abstract

Single-view 3D face reconstruction is a fundamental Computer Vision problem of extraordinary difficulty. Current systems often assume the input is unobstructed faces which makes their method not suitable for in-the-wild conditions. We present a method for performing a 3D face that removes eyeglasses from a single image. Existing facial reconstruction methods fail to remove eyeglasses automatically for generating a photo-realistic 3D face “in-the-wild”. The innovation of our method lies in a process for identifying the eyeglasses area robustly and remove it intelligently. In this work, we estimate the 2D face structure of the reasonable position of the eyeglasses area, which is used for the construction of 3D texture. An excellent anti-eyeglasses face reconstruction method should ensure the authenticity of the output, including the topological structure between the eyes, nose, and mouth. We achieve this via a deep learning architecture that performs direct regression of a 3DMM representation of the 3D facial geometry from a single 2D image. We also demonstrate how the related face parsing task can be incorporated into the proposed framework and help improve reconstruction quality. We conduct extensive experiments on existing 3D face reconstruction tasks as concrete examples to demonstrate the method’s superior regulation ability over existing methods often break down.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Eyeglasses 3D Shape Reconstruction from a Single Face Image

Generative Face Parsing Map Guided 3D Face Reconstruction Under Occluded Scenes

Enhancing eyeglasses removal in facial images: a novel approach using translation models for eyeglasses mask completion

Article 11 September 2024

Keywords

1 Introduction

3D face reconstruction is an important and popular research field of computer vision [4, 12, 33]. It is widely used in face recognition, video editing, film avatars and so on. Face occlusions (such as eyeglasses, respirators, eyebrow pendants and so on.) can degrade the performance of face recognition and face animation evidently. We cannot use artificial intelligence to robustly predict the 3D texture of the occluded area of the face. How to remove occlusions on face image robustly and automatically becomes one crucial problem in 3D face reconstruction processing. As the human face is one kind of particular image (the face area is not large, but there are many features, and humans are very familiar and sensitive to it), common image inpainting techniques cannot be used to remove face occlusions. The traditional image inpainting methods reconstruct the damaged image region by its same surrounding pixels, which does not consider the structure of the face. For example, if an eye of a human is occluded, the conventional inpainted face cannot reconstruct the eye image, and the output 2D face will have only one eye [36]. However, things have changed in recent years. Due to the rapid development of deep learning and face parsing methods, face inpainting approaches have developed rapidly. Some common extreme scenarios (i.e., with eyeglasses) become easy to handle.

3D morphable models 3DMM was proposed in 1999, which was a widely influential template reconstruction method [3, 4, 7, 25, 38, 40]. Since the facial features are distributed very regularly, the application of the template method has continued until now. On the other hand, due to the limitation of the template’s space, the expressiveness of the model is very lacking, especially the geometric details.

In this paper, we proposed a robust and fast face eyeglasses removal reconstruction algorithm based on face parsing and the deep learning method. The main contributions are summarized as follows:

We propose a novel algorithm that combines feature points and face parsing map to generate face which removes eyeglasses.
In order to solve the problem of the invisible face area under eyeglasses occluded scenes, we propose synthesizing input face image based on Generative Adversarial Network rather than reconstructing 3D face directly.
We have improved the loss function of our 3D reconstruction framework for eyeglasses occluded scenes. Our method obtains state-of-the-art qualitative performance in real-world images.

2 Related Work

2.1 Generic Face Reconstruction

Blanz et al. [3, 35] proposed the 3D Morphable Model (3DMM) for modeling the 3D face from a single face photo. 3DMM is a statistical model which transforms the shape and texture into a vector space representation. Though a relatively robust face model result can be achieved, the expressive power of the 3D model is limited. In addition, this method suffers from high computational costs. Rara et al. [28] proposed a regression model between the 2D face landmarks and the corresponding 3DMM coefficient. They employed principal component regression for face model coefficient prediction. Since large facial pose changes may reduce the performance of 2D facial landmark detection, Dou et al. [9] proposed a dictionary-based representation of 3D face shape; They then adopted sparse coding to predict model coefficients. The related comparative experiment shows that their method achieved better robustness to the previous facial landmark detection method. Following this work, Zhou et al. [41] also utilize a dictionary-based model; they introduced a convex formulation to estimate model parameters.

With the development of deep learning, 3D face reconstruction has witnessed remarkable progress in both quality and efficiency by Convolution Neural Network (CNN). In 2017, Anh et al. [33] utilized ResNet to estimate the 3D Morphable Model parameters. However, the performance of the methods is restricted due to the limitation of the 3D space defined by the face model basis or the 3DMM templates.

2.2 Face Parsing

The unique structure pattern of human face contains rich semantic representation, such as eyes, mouth, nose and so on. The low and intermediate visual features of the known region are not enough to infer the missing effective semantic features, so it is impossible to model the face geometry [2, 16]. Generate Face Completion [21] introduces face parsing to form regular semantic constraints. As shown in Fig. 1, the adoption of face parsing map can assist the face inpainting task.

2.3 Deep Face Synthesis Methods

In the existing depth learning based face inpainting methods, due to the adoption of standard convolution layer, the synthetic pixels of the area to be inpainting comes from two parts: the valid value of the unobstructed area and the substitute value of the occluded area. This approach usually leads to color artifacts and visual blur. Deep learning has been widely used in face synthesis tasks. Li et al. [21] introduced the face parsing map into the face synthesis task in order to guide GAN to generate a reasonable more brilliant face structure.

3 Our Method

3.1 Landmark Estimation Network

In the landmark estimation task, we built the sufficiently effective landmark estimation network (Fig. 2) based on the MobileNet-V3 model [15]. In our method, accurate facial landmark ${{\mathbf {L}}_{\mathbf {face}}}\in {{\mathbb {R}}^{2\times 68}}$ generation is a crucial part. The network ${{\mathcal {N}}_{L}}$ aims to generate ${{\mathbf {L}}_{\mathbf {face}}}$ from a face image ${{\mathbf {I}}_{\mathbf {in}}}:{{\mathbf {L}}_{\mathbf {face}}}={{\mathcal {N}}_{L}}({{\mathbf {I}}_{\mathbf {in}}};{{\theta }_{lmk}})$, where ${{\theta }_{lmk}}$ denotes the model parameters. ${{\mathcal {N}}_{L}}$ is designed to extract facial features instead of face recognition, which is different from traditional detectors [19, 37]. We set the loss function as follows:

$$\begin{aligned} {{\mathcal {L}}_{lmk}}=\left\| {{\mathbf {L}}_{\mathbf {face}}}-{{\mathbf {L}}_{\mathbf {gt}}} \right\| _{2}^{2} \end{aligned}$$

(1)

where ${{\mathbf {L}}_{\mathbf {gt}}}$ denotes the ground truth face landmarks and ${{\left\| \cdot \right\| }_{2}}$ denotes the ${{L}_{2}}$ norm.

3.2 Face Synthesis Module

Overall, We design the synthesis module ${{\mathcal {N}}_{s}}$ to synthesize a 2D image of a human face without eyeglasses. The module ${{\mathcal {N}}_{s}}$ consists of three parts: deleter, generator and discriminator.

Deleter. Normally, the task of the deleter is to delete the occluded eyeglasses areas ${{\mathbf {I}}_{\mathbf {m}}}$ of the facial features in the input image ${{\mathbf {I}}_{\mathbf {in}}}$ (Fig. 3). Overall, the deleter ${{\mathcal {N}}_{de}}$ is based on the U-Net structure [30]. Inspired by the annotated face dataset CelebAMask-HQ [20], we used the encoder-decoder architecture ${{\mathcal {N}}_{de}}$ to estimate pixel-level label classes. Given the input face image ${{\mathbf {I}}_{\mathbf {in}}}\in {{\mathbb {R}}^{\text {H}\times \text {W}\times \text {3}}}$, we applied the trained model ${{\mathcal {N}}_{de}}$ to obtain the face parsing map $\mathbf {M}\in {{\mathbb {R}}^{\text {H}\times \text {W}\times 1}}$. According to the map $\mathbf {M}$, we identify and delete the eyeglasses area ${{\mathbf {I}}_{\mathbf {m}}}$ to obtain the corrupted image ${{\mathbf {I}}_{\mathbf {co}}}$.

Generator. The generator ${{\mathcal {N}}_{ge}}$ is also based on the U-Net structure, which desires to synthesize the full face by taking corrupted images ${{\mathbf {I}}_{\mathbf {co}}}$ and landmarks $\mathbf {L}$ (${{\mathbf {L}}_{\mathbf {face}}}$ or ${{\mathbf {L}}_{\mathbf {gt}}}$). The generator can be formulated as ${{\text {I}}_{\mathbf {out}}}={{\mathcal {N}}_{ge}}({{\mathbf {I}}_{\mathbf {co}}},L;{{\theta }_{ge}})$, with ${{\theta }_{ge}}$ the trainable parameters.

Discriminator. The purpose of the discriminator is to judge whether the data distribution meets our requirements. The ambition of face synthesis is achieved when the generated results are not distinguishable from the real ones.

Loss of Discriminator. We use a combination of an adversarial loss, a per-pixel loss, a perceptual loss, a style loss, a total variation loss and an adversarial loss, for training the face synthesis network.

The per-pixel loss is formulated as follows:

$$\begin{aligned} {{\mathcal {L}}_{pixe}}=\frac{1}{S}\left\| {{\mathbf {I}}_{\mathbf {out}}}-{{\mathbf {I}}_{\mathbf {in}}} \right\| \end{aligned}$$

(2)

where S denotes the mask size and $\left\| \cdot \right\| $ denotes the ${{L}_{1}}$ norm. Here, S is the denominator and its role is to adjust the penalty. A straightforward objective of per-pixel loss is to minimize the differences between the input face images and the synthetic images. It should be pointed out that our input image will not contain occlusion, so we don’t need to consider this.

The style loss computes the style distance between two images as follows:

$$\begin{aligned} {{\mathcal {L}}_{style}}=\sum \limits _{\text {n}}{\frac{1}{{{O}_{n}}\times {{O}_{n}}}\left\| \frac{{{G}_{n}}({{\mathbf {I}}_{\mathbf {out}}}\odot {{\mathbf {I}}_{\mathbf {m}}})-{{G}_{n}}({{\mathbf {I}}_{\mathbf {in}}}\odot {{\mathbf {I}}_{\mathbf {m}}})}{{{O}_{n}}\times {{H}_{n}}\times {{W}_{n}}} \right\| } \end{aligned}$$

(3)

where ${{G}_{\text {n}}}\text {(x)}={{\varphi }_{n}}{{(x)}^{T}}{{\varphi }_{n}}(x)$ denotes the Gram Matrix corresponding to ${{\varphi }_{n}}(x)$, ${{\varphi }_{n}}(\cdot )$ denotes the ${{O}_{n}}$ feature maps with the size ${{H}_{n}}\times {{W}_{n}}$ of the n-th layer.

Due to the use of the normalization tool, the synthesized face may have artifacts, checkerboards, or water droplets. We define the total variation loss as:

$$\begin{aligned} {{\mathcal {L}}_{{\text {var}}}}=\frac{1}{{{\text {P}}_{{{\mathbf {I}}_{\mathbf {in}}}}}}\left\| \nabla {{\mathbf {I}}_{\mathbf {out}}} \right\| \end{aligned}$$

(4)

where ${{\text {P}}_{{{\mathbf {I}}_{\mathbf {in}}}}}$ is the pixel number of ${{\mathbf {I}}_{\mathbf {in}}}$ and $\nabla $ is the first order derivative, containing ${{\nabla }_{h}}$ (horizontal) and ${{\nabla }_{v}}$ (vertical).

The total loss with respect to the face synthesis module:

$$\begin{aligned} {{\mathcal {L}}_{fsm}}={{\lambda }_{pixe}}{{\mathcal {L}}_{pixe}}+{{\lambda }_{style}}{{\mathcal {L}}_{style}}+{{\lambda }_{{\text {var}}}}{{\mathcal {L}}_{{\text {var}}}} \end{aligned}$$

(5)

Here, we use ${{\lambda }_{pixe}}=1$, ${{\lambda }_{style}}=250$ and ${{\lambda }_{{\text {var}}}}=0.1$ in our experiments.

3.3 3D Face Reconstruction

The classic single-view 3D face reconstruction methods utilize a 3D template model (e.g., 3DMM) ) to fit the input face image [3, 25]. This type of method usually consists of two steps: face alignment and regressing the 3DMM coefficients. The seminal work [4, 8, 11] describe the 3D face space with PCA:

$$\begin{aligned} \mathbf {S}=\overline{\mathbf {S}}+{{\mathbf {A}}_{\mathbf {id}}}{{\boldsymbol{\alpha }}_{\mathbf {id}}}+{{\mathbf {B}}_{\mathbf {exp}}}{{\boldsymbol{\beta }}_{\mathbf {exp}}},\mathbf {T}=\overline{\mathbf {T}}+{{\mathbf {B}}_{\mathbf {t}}}{{\boldsymbol{\beta }}_{\mathbf {t}}} \end{aligned}$$

(6)

where $\overline{\mathbf {S}}$ and $\overline{\mathbf {T}}$ denote the mean shape and texture, ${{\mathbf {A}}_{\mathbf {id}}}$, ${{\mathbf {B}}_{\mathbf {exp}}}$ and ${{\mathbf {B}}_{\mathbf {t}}}$ denote the PCA bases of identity, expression and texture. ${{\mathbf {\alpha }}_{\mathbf {id}}}\in {{\mathbb {R}}^{80}}$ and ${{\mathbf {\beta }}_{\mathbf {exp}}}\in {{\mathbb {R}}^{64}}$, and ${{\mathbf {\beta }}_{\mathbf {t}}}\in {{\mathbb {R}}^{80}}$ are the corresponding 3DMM coefficient vectors.

After the 3D face is reconstructed, it can be projected onto the image plane with the perspective projection:

$$\begin{aligned} {{\mathbf {V}}_{\mathbf {2d}}}\left( \mathbf {P} \right) =f*{{\mathbf {P}}_{\mathbf {r}}}*\mathbf {R}*{{\mathbf {S}}_{\mathbf {mod}}}+{{\mathbf {t}}_{\mathbf {2d}}} \end{aligned}$$

(7)

where ${{V}_{2d}}\left( \mathbf {P} \right) $ denotes the projection function that turned the 3D model into 2D face positions, f denotes the scale factor, ${{\mathbf {P}}_{\mathbf {r}}}$ denotes the projection matrix, $\mathbf {R}\in SO(3)$ denotes the rotation matrix and ${{\mathbf {t}}_{\mathbf {2d}}}\in {{\mathbb {R}}^{3}}$ denotes the translation vector.

Therefore, we approximated the scene illumination with Spherical Harmonics (SH) [6, 23, 26, 27] parameterized by coefficient vector $\gamma \in {{\mathbb {R}}^{9}}$. In summary, the unknown parameters to be learned can be denoted by a vector $y=({{\boldsymbol{\alpha }}_{\mathbf {id}}},{{\boldsymbol{\beta }}_{\mathbf {exp}}},{{\boldsymbol{\beta }}_{\mathbf {t}}},\boldsymbol{\gamma },\mathbf {p})\in {{\mathbb {R}}^{239}}$, where $\mathbf {p}\in {{\mathbb {R}}^{6}}=\{\mathbf {pitch},\mathbf {yaw},\mathbf {roll},f,{{\mathbf {t}}_{\mathbf {2D}}}\}$ denotes face poses. In this work, we used a fixed ResNet-50 [14] network to regress these coefficients.

The corresponding loss function consists of two parts: pixel-wise loss and face feature loss.

Pixel-Wise Loss. The purpose of this loss function is very simple, which is to minimize the difference between the input image $\mathbf {I}_{\mathbf {out}}^{(i)}$ and the rendered image $\mathbf {I}_{\mathbf {y}}^{(i)}$. The rendering layer renders back an image $\mathbf {I}_{\mathbf {y}}^{(i)}$ to compare with the image $\mathbf {I}_{\mathbf {out}}^{(i)}$. The pixel-wise loss is formulated as:

$$\begin{aligned} {{\mathcal {L}}_{1}}={{\left\| \mathbf {I}_{\mathbf {out}}^{(i)}-\mathbf {I}_{\mathbf {y}}^{(i)} \right\| }_{2}} \end{aligned}$$

(8)

where i denotes pixel index and ${{\left\| \cdot \right\| }_{2}}$ denotes the ${{L}_{2}}$ norm.

Face Features Loss. We introduce a loss function at the face recognition level to reduce the difference between the 3D model of the face and the 2D image. The loss function computes the feature difference between the input image $\mathbf {I}_{\mathbf {out}}^{{}}$ and rendered image $\mathbf {I}_{\mathbf {y}}^{{}}$. We define the loss as a cosine distance:

$$\begin{aligned} {{\mathcal {L}}_{2}}=1-\frac{{<}G(\mathbf {I}_{\mathbf {out}}^{{}}),G(\mathbf {I}_{\mathbf {y}}^{{}}){>}}{\left\| G(\mathbf {I}_{\mathbf {out}}^{{}}) \right\| \cdot \left\| G(\mathbf {I}_{\mathbf {y}}^{{}} \right\| } \end{aligned}$$

(9)

where ${G}(\cdot )$ denotes the feature extraction function by FaceNet [31], ${<}\cdot ,\cdot {>}$ denotes the inner product.

In summary, we used the loss function ${{\mathcal {L}}_{3D}}$ to reconstruct the basic shape of the face. We set ${{\mathcal {L}}_{3D}}={{\lambda }_{1}}{{\mathcal {L}}_{1}}+{{\lambda }_{2}}{{\mathcal {L}}_{2}}$, where ${{\lambda }_{1}}=1.4$ and ${{\lambda }_{2}}{=0}{.25}$ respectively in all our experiments. We then used a coarse-to-fine graph convolutional network based on the frameworks of Lin et al. [22] for producing the fine texture ${{T}_{fina}}$.

4 Experimental Details and Results

4.1 Implementation Details

In consideration of the module of ${{\mathcal {N}}_{cont}}$, we used the ground truth of CelebA-HQ datasets [18] as the reference. Considering the generator ${{\mathcal {N}}_{ge}}$, it consists of three gradually down-sampled encoding blocks, followed by seven residual blocks with dilated convolutions and a long-short term attention block. Then, the decoder processes the feature maps gradually up-sampled to the same size as input.

4.2 Qualitative Comparisons with Recent Arts

Figure 4 shows our experimental results compared with the others [5, 10, 13, 39]. The result shows that our method is far superior to other frameworks. Our 3D reconstruction method can handle eyeglasses occluded scenes, such as transparent glasses and sunglasses. Other frameworks can not handle eyeglasses well; they are more focused on the generation of high-definition textures.

4.3 Quantitative Comparison

Comparison Result on the MICC Florence Datasets. MICC Florence dataset [1] is a 3D face dataset that contains 53 faces with their ground truth models. We artificially added eyeglasses as input. We calculated the average 90% largest error between the generative model and the ground truth model. Figure 5 shows that our method can effectively handle eyeglasses.

Eyeglasses Invariance of the Foundation Shape. Our choice of using the ResNet-50 to regress the shape coefficients is motivated by the unique robustness to extreme viewing conditions in the paper of Deng et al. [29]. To fully support the application of our method to occluded face images, we test our system on the Labeled Faces in the Wild datasets (LFW) [24]. We used the same face test system from Anh et al. [34], and we refer to that paper for more details.

Figure 6 (left) shows the sensitivity of the method of Sela et al. [32]. Their result clearly shows the outline of a finger. Their failure may be due to more focus on local details, which weakly regularizes the global shape. However, our method recognizes and regenerates the occluded area. Our method much robust provides a natural face shape under eyeglasses scenes. Though 3DMM also limits the details of shape, we use it only as a foundation and add refined texture separately.

Table 1. Quantitative evaluations on LFW.

Full size table

We further quantitatively verify the robustness of our method to eyeglasses. Table 1 (top) reports verification results on the LFW benchmark with and without eyeglasses (see also ROC in Fig. 6-right). Though eyeglasses clearly impact recognition, this drop of the curve is limited, demonstrating the robustness of our method.

5 Conclusions

We propose a novel method to reconstruct a 3D face model from an eyeglass occluded RGB face photo. Given the input image and a pre-trained ResNet, we fit the face model to a template model (3DMM). In order to robustly reconstruct RGB face without glasses, we design a deep learning network, which remakes reasonable texture intelligently. Comprehensive experiments have shown that our method outperforms previous arts by a large margin in terms of both accuracy and robustness.

References

Bagdanov, A.D., Del Bimbo, A., Masi, I.: The florence 2D/3D hybrid face dataset. In: Proceedings of the 2011 Joint ACM workshop on Human Gesture and Behavior Understanding, pp. 79–80 (2011)
Google Scholar
Bertalmio, M., Sapiro, G., Caselles, V., Ballester, C.: Image inpainting. In: Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, pp. 417–424 (2000)
Google Scholar
Blanz, V., Vetter, T.: A morphable model for the synthesis of 3D faces. In: Siggraph, vol. 99, pp. 187–194 (1999)
Google Scholar
Blanz, V., Vetter, T.: Face recognition based on fitting a 3D morphable model. IEEE Trans. Pattern Anal. Mach. Intell. 25(9), 1063–1074 (2003)
Article Google Scholar
Chen, A., Chen, Z., Zhang, G., Mitchell, K., Yu, J.: Photo-realistic facial details synthesis from single image. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9429–9439 (2019)
Google Scholar
Dapeng, Z., Yue, Q.: Generative landmarks guided eyeglasses removal 3D face reconstruction. In: Pór Jónsson, B., et al. (eds.) MMM 2022. LNCS, vol. 13142. pp. 109–120. Springer, Heidelberg (2022)
Google Scholar
Dapeng, Z., Yue, Q.: Generative contour guided occlusions removal 3D face reconstruction. In: 2021 International Conference on Virtual Reality and Visualization (ICVRV), pp. 74–79. IEEE (2021)
Google Scholar
Dapeng, Z., Yue, Q.: Learning detailed face reconstruction under occluded scenes. In: 2021 International Conference on Virtual Reality and Visualization (ICVRV), pp. 80–84. IEEE (2021)
Google Scholar
Dou, P., Wu, Y., Shah, S.K., Kakadiaris, I.A.: Robust 3D face shape reconstruction from single images via two-fold coupled structure learning. In: Proceedings of the British Machine Vision Conference. pp. 1–13 (2014)
Google Scholar
Feng, Y., Wu, F., Shao, X., Wang, Y., Zhou, X.: Joint 3D face reconstruction and dense alignment with position map regression network. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 557–574. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_33
Chapter Google Scholar
Gerig, T., et al.: Morphable face models-an open framework. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp. 75–82. IEEE (2018)
Google Scholar
Gilani, S.Z., Mian, A.: Learning from millions of 3D scans for large-scale 3D face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1896–1905 (2018)
Google Scholar
Guo, J., Zhu, X., Yang, Y., Yang, F., Lei, Z., Li, S.Z.: Towards fast, accurate and stable 3D dense face alignment. arXiv preprint arXiv:2009.09960 (2020)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Howard, A., et al.: Searching for MobileNetV3. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1314–1324 (2019)
Google Scholar
Huang, J.B., Kang, S.B., Ahuja, N., Kopf, J.: Image completion using planar structure guidance. ACM Trans. Graph. (TOG) 33(4), 1–10 (2014)
Google Scholar
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1125–1134 (2017)
Google Scholar
Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of GANs for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196 (2017)
Kumar, A., Chellappa, R.: Disentangling 3D pose in a dendritic CNN for unconstrained 2D face alignment. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 430–439 (2018)
Google Scholar
Lee, C.H., Liu, Z., Wu, L., Luo, P.: MaskGAN: towards diverse and interactive facial image manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5549–5558 (2020)
Google Scholar
Li, Y., Liu, S., Yang, J., Yang, M.H.: Generative face completion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3911–3919 (2017)
Google Scholar
Lin, J., Yuan, Y., Shao, T., Zhou, K.: Towards high-fidelity 3D face reconstruction from in-the-wild images using graph convolutional networks. arXiv preprint arXiv:2003.05653 (2020)
Müller, C.: Spherical Harmonics, vol. 17. Springer, Heidelberg (2006)
Google Scholar
Pan, J., et al.: Video generation from single semantic label map. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3733–3742 (2019)
Google Scholar
Paysan, P., Knothe, R., Amberg, B., Romdhani, S., Vetter, T.: A 3D face model for pose and illumination invariant face recognition. In: 2009 Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance, pp. 296–301. IEEE (2009)
Google Scholar
Ramamoorthi, R., Hanrahan, P.: An efficient representation for irradiance environment maps. In: Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, pp. 497–500 (2001)
Google Scholar
Ramamoorthi, R., Hanrahan, P.: A signal-processing framework for inverse rendering. In: Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, pp. 117–128 (2001)
Google Scholar
Rara, H.M., Farag, A.A., Davis, T.: Model-based 3D shape recovery from single images of unknown pose and illumination using a small number of feature points. In: 2011 International Joint Conference on Biometrics (IJCB), pp. 1–7. IEEE (2011)
Google Scholar
Richardson, E., Sela, M., Kimmel, R.: 3D face reconstruction by learning from synthetic data. In: 2016 Fourth International Conference on 3D Vision (3DV), pp. 460–469. IEEE (2016)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: a unified embedding for face recognition and clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823 (2015)
Google Scholar
Sela, M., Richardson, E., Kimmel, R.: Unrestricted facial geometry reconstruction using image-to-image translation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1576–1585 (2017)
Google Scholar
Tuan Tran, A., Hassner, T., Masi, I., Medioni, G.: Regressing robust and discriminative 3D morphable models with a very deep neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5163–5172 (2017)
Google Scholar
Tun Trn, A., Hassner, T., Masi, I., Paz, E., Nirkin, Y., Medioni, G.: Extreme 3D face reconstruction: seeing through occlusions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3935–3944 (2018)
Google Scholar
Wang, S., Cheng, Z., Deng, X., Chang, L., Duan, F., Lu, K.: Leveraging 3D blendshape for facial expression recognition using CNN. Sci. China Inf. Sci 63(120114), 1–120114 (2020)
Google Scholar
Wang, Z.M., Tao, J.H.: Reconstruction of partially occluded face by fast recursive PCA. In: 2007 International Conference on Computational Intelligence and Security Workshops (CISW 2007), pp. 304–307. IEEE (2007)
Google Scholar
Wu, W., Qian, C., Yang, S., Wang, Q., Cai, Y., Zhou, Q.: Look at boundary: a boundary-aware face alignment algorithm. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2129–2138 (2018)
Google Scholar
Yongkang, Z., Jun, L., Zhiping, S., Na, j., Zhilei, L.: Tssn: temporal self-attention and self-supervision network for efficient action recognition. In: 2021 International Conference on Virtual Reality and Visualization (ICVRV), pp. 87–92. IEEE (2021)
Google Scholar
Zeng, X., Peng, X., Qiao, Y.: DF2Net: a dense-fine-finer network for detailed 3D face reconstruction. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2315–2324 (2019)
Google Scholar
Zhao, D., Qi, Y.: Generative face parsing map guided 3D face reconstruction under occluded scenes. In: Magnenat-Thalmann, N., et al. (eds.) CGI 2021. LNCS, vol. 13002, pp. 252–263. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-89029-2_20
Chapter Google Scholar
Zhou, X., Leonardos, S., Hu, X., Daniilidis, K.: 3D shape reconstruction from 2D landmarks: a convex formulation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4447–4455. Citeseer (2015)
Google Scholar

Download references

Acknowledgment

This paper is supported by National Natural Science Foundation of China (No. 62072020), National Key Research and Development Program of China (No. 2017YFB1002602), Key-Area Research and Development Program of Guangdong Province (No. 2019B010150001) and the Leading Talents in Innovation and Entrepreneurship of Qingdao (19-3-2-21-zhc).

Author information

Authors and Affiliations

State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang University, Beijing, China
Dapeng Zhao & Yue Qi
Peng Cheng Laboratory, Shenzhen, China
Yue Qi
Qingdao Research Institute of Beihang University, Qingdao, China
Yue Qi

Authors

Dapeng Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Yue Qi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yue Qi .

Editor information

Editors and Affiliations

IT University of Copenhagen, Copenhagen, Denmark
Björn Þór Jónsson
Dublin City University, Dublin, Ireland
Cathal Gurrin
University of Science, VNU-HCM, Ho Chi Minh City, Vietnam
Minh-Triet Tran
University of Bergen, Bergen, Norway
Duc-Tien Dang-Nguyen
National Tsing Hua University, Hsinchu, Taiwan
Anita Min-Chun Hu
Hanoi University of Science and Technology, Hanoi, Vietnam
Binh Huynh Thi Thanh
Median Technologies, Valbonne, France
Benoit Huet

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhao, D., Qi, Y. (2022). Generative Landmarks Guided Eyeglasses Removal 3D Face Reconstruction. In: Þór Jónsson, B., et al. MultiMedia Modeling. MMM 2022. Lecture Notes in Computer Science, vol 13142. Springer, Cham. https://doi.org/10.1007/978-3-030-98355-0_10

Download citation

DOI: https://doi.org/10.1007/978-3-030-98355-0_10
Published: 15 March 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-98354-3
Online ISBN: 978-3-030-98355-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics