1 Introduction

3D face reconstruction from a monocular face image is a longstanding problem in the field of computer graphics and computer vision, which has numerous applications, such as face recognition [1], interaction in augmented/virtual environments [2], media manipulation, and animation [3]. For recovering 3D face shape and texture from monocular images, a statistical 3D Morphable Model (3DMM) [4] is most popularly utilized, built from hundreds of 3D face scans. 3DMM facilitates a search space spanning the range of 3D human faces. The points from the 3DMM search space contain information on 3D face geometry and texture. Along with these points, face illumination and pose coefficients are required to generate desired 3D faces. The reconstructed 3D faces imitate the face shape and color of the corresponding face images; thus the processing, i.e., cropping and alignment, of input face images is needed. Facial image processing poses dependencies such as pre-trained landmark detectors. Moreover, processing requires significant time, which is a major issue, particularly during testing.

Numerous deep learning-based monocular 3D face reconstruction methods [5,6,7] have been proposed, but dependency reduction and test speed improvement are beyond the scope of these approaches, which are crucial for real-time application. Tewari et al. [5, 8] produce 3D faces in consistency with the processed inputs. Deng et al. [6] exploit dlib [9] and (or) MTCNN [10] to process the input face images for reconstructing accurate 3D faces. Tiwari et al. [7, 11] require processed face images at the input for generating occlusion robust 3D faces. Feng et al. [12] reconstruct detailed 3D faces from monocular processed face images. Although these methods improve the 3D face reconstruction accuracy, they require processed input at the test time, leading to the need for facial landmark information. Moreover, processing reduces the testing speed, thus increasing the time required for testing these methods. Therefore, a novel training pipeline is required to overcome these issues and obtain accurate 3D faces from unprocessed (uncropped and unaligned) monocular images.

In this work, our aim is to estimate the 3D faces from unprocessed monocular face images to reduce the test time dependencies and improve the testing speed. Furthermore, an unsupervised training scheme is needed to overcome the requirement of difficult to procure ground truth 3D face scans. To achieve these objectives, we propose a REduced Dependency Fast UnsuperviSEd 3D Face Reconstruction (RED-FUSE) framework, which estimates statistical 3D face coefficients for unprocessed face images in an unsupervised manner, as in Fig. 1. More specifically, RED-FUSE exploits a (1) Multi-pipeline architecture to ensure a reliable reconstruction of 3D faces from challenging unprocessed inputs and (2) Pose transfer module that facilitates the elimination of landmark requirements for training the network on various variants of unprocessed inputs. Due to the inclusion of challenging image variants (affine transformed) as the inputs to the training pipeline and landmark free network’s training for unprocessed variants, RED-FUSE produces accurate 3D face from real-world in-the-wild face images.Footnote 1 The proposed RED-FUSE is qualitatively and quantitatively evaluated on the numerous open source unprocessed images, CelebA-test dataset [13], LFW-test set [14], and NoW selfie-based validation dataset [15]. Our method demonstrates superior performance over several methods. For example, we obtain an improvement of \(\mathbf {46.2\%}\), \(\mathbf {15.1\%}\), \(\mathbf {29.6}\%\) and \(\mathbf {27.4}\%\) for 3D shape-based error, color-based error, NoW selfie challenge, and visual similarity (perceptual) error, respectively, compared to a recent approach. Moreover, our test time improves from \(\mathbf {7.30}\) m.sec. to \(\mathbf {1.85}\) m.sec. per face compared to various 3D face reconstruction methods.

Fig. 1
figure 1

An overview of our REduced Dependency Fast UnsuperviSEd 3D Face Reconstruction (RED-FUSE) framework. The proposed method addresses the problem of unprocessed monocular 3D face reconstruction in an unsupervised manner by exploiting novel pose transferring module and speeds up the testing process, without posing the requirement of 3D ground-truth face scans

A summary of our multi-fold contributionsFootnote 2 is as follows.

  1. 1.

    We propose REduced Dependency Fast UnsuperviSEd 3D Face Reconstruction (RED-FUSE) to perform 3D face reconstruction from unprocessed face images without posing additional requirements and dependencies.

  2. 2.

    We propose a pose transfer module, which integrates with our training framework to facilitate landmark-free training of unprocessed variants, thus aiding in eliminating the landmark requirement at the test time.

  3. 3.

    We leverage a multi-pipeline training scheme to learn the statistical representation of 3D faces for unprocessed variants of face images in an unsupervised manner, overcoming the need for difficult-to-procure ground-truth 3D faces.

  4. 4.

    Our method demonstrates improvements for several 2D and 3D evaluation metrics. For example, the proposed approach improves 3D shape accuracy by \(\mathbf {46\%}+\) and 2D visual error by \(\mathbf {27\%}+\), demonstrating the effectiveness of the proposed approach.

  5. 5.

    Our method does not require input processing during testing, thus eliminating the test time landmark dependency and producing reliable 3D faces.

  6. 6.

    The proposed approach provides \(\mathbf {75\%}\) faster inference than recent state-of-the-art monocular 3D face reconstruction methods and shows real-time performance.

2 Related work

The literature behind 3D face reconstruction method [17,18,19,20,21] is vast. Therefore, we focus on morphable model-based [4, 22,23,24] monocular 3D face reconstruction approaches and unsupervised training strategies.

3D Face Reconstruction Methods: 3D face shape retrieval from an unconstrained monocular face image is a mathematically ill-posed problem. A geometric prior is required to address the issue. 3DMM [4] has gained immense popularity in recent years, which serves as a strong prior for reconstructing 3D faces. Tewari et al. [5, 25] exploit 3DMM to reconstruct 3D faces from face images by exploiting a cycle-consistent approach. Sela et al. [26] provide high-quality reconstructions by utilizing depth images and facial correspondence maps. Feng et al. [22] disentangle shape features such that the tasks of reconstructing 3D face shapes and learning discriminative shape features for face recognition are accomplished simultaneously. Tran et al. [27] produce accurate 3D faces from non-frontal, obstructed face images. Genova et al. [28] use synthetic images with corresponding ground-truth data, where label-free instances of real images are exploited to reconstruct 3D faces. Deng et al. [6] attain deep-feature consistency to improve the reconstructed 3D face shape accuracy. Gecer et al. [29] produce high-fidelity 3D face texture and shape by estimating facial texture in UV space. Tu et al. [30] use sparse 2D facial landmark heatmaps to produce high-quality 3D faces. Feng et al. [12] generate a UV displacement map containing person-specific details to reconstruct detailed 3D faces from monocular images. Zeng et al. [31] integrate a fitting-based approach with the shape-from-shading method [32] to reconstruct detailed 3D face geometry. Tiwari et al. [11] distill knowledge for tackling occlusions to reconstruct accurate 3D faces. Tiwari et al. [7] deploy a self-supervision strategy to generate occlusion robust 3D faces. However, these approaches require the processing of face images, which poses a dependency on prior landmark information and degrades the testing speed of the model. Besides, our method reconstructs reliable 3D faces from unprocessed face data without posing additional dependencies, thus demonstrating reduced dependency and faster testing speed.

Unsupervised Learning: Recently, there has been a surge of interest in the unsupervised training scheme for monocular 3D face reconstruction using processed inputs, as it can learn statistical 3D face coefficients without human intervention. The key is to design a 3D face reconstruction task that relates the projected 3D faces with the corresponding processed face images such that the 3D face coefficients can be self-annotated. Most recent developments for 3D face reconstruction tasks [8, 28, 29, 33] utilize the unsupervised approach mentioned above. Tewari et al. [8] establish consistency between processed input and the rendered face to overcome the requirement of external supervision. Genova et al. [28] exploit labeled synthetic data, whereas label-free instances of processed real inputs are used to perform unsupervised 3D face learning. Gecer et al. [29] estimate the relationship between the facial identity features and the parameters of a 3DMM for shape and texture for processed data in an unsupervised manner. Besides, our proposed task exploits unprocessed images as input to learn the accurate 3D face representation without external supervision.

3 Technical details 

In this section, we present the preliminaries of 3D face reconstruction (Sect. 3.1). Moreover, we provide the details of proposed  REduced Dependency Fast UnsuperviSEd 3D Face Reconstruction (RED-FUSE), which reconstructs 3D faces from unprocessed face images without requiring external supervision (Sect. 3.2).

3.1 Preliminaries

We present the preliminaries for reconstructing 3D faces from monocular face images. More specifically, we provide the details on the 3D Morphable Model (Sect. 3.1.1), which serves as a prior for facilitating fitting-based monocular 3D face reconstruction. Moreover, we present the illumination assumption (Sect. 3.1.2), and face projection (Sect. 3.1.3).

3.1.1 3D Morphable Model (3DMM)

A 3DMM reconstructs the desired 3D face by exploiting the linear combination of shape (\(\varvec{\alpha }\in {\mathbb {R}}^{80}\)), expression (\(\varvec{\beta }\in {\mathbb {R}}^{64}\)), and texture (\(\varvec{\gamma }\in {\mathbb {R}}^{80}\)) coefficients with their respective basis vectors \(\varvec{{\mathcal {B}}_{\alpha }}\in {\mathbb {R}}^{80\times 3N}\), \(\varvec{{\mathcal {B}}_{\beta }}\in {\mathbb {R}}^{64\times 3N}\), and \(\varvec{{\mathcal {B}}_{\gamma }}\in {\mathbb {R}}^{80\times 3N}\), respectively, as follows.

$$\begin{aligned} \varvec{{\mathcal {S}}} = \overline{\varvec{{\mathcal {S}}}} + \varvec{\alpha }\varvec{{\mathcal {B}}_{\alpha }}+ \varvec{\beta }\varvec{{\mathcal {B}}_{\beta }},\quad\varvec{{\mathcal {T}}} = \overline{\varvec{{\mathcal {T}}}} + \varvec{\gamma }\varvec{{\mathcal {B}}_{\gamma }}, \end{aligned}$$
(1)

where, \(\overline{\varvec{{\mathcal {S}}}}\in {\mathbb {R}}^{3N}\) and \(\overline{\varvec{{\mathcal {T}}}}\in {\mathbb {R}}^{3N}\) are the mean 3D face shape and texture, respectively. It is worth noting that \(\overline{\varvec{{\mathcal {S}}}}\), \(\overline{\varvec{{\mathcal {T}}}}\), \(\varvec{{\mathcal {B}}_{\alpha }}\), and \(\varvec{{\mathcal {B}}_{\gamma }}\) are obtained from the Basel Face Model (BFM) [34]. BFM produces 3D faces with neutral expressions; thus, the expression basis \(\varvec{{\mathcal {B}}_{\beta }}\) is extracted from the Facewarehouse model [23]. Besides, our network estimates the face coefficients \(\varvec{\alpha }\), \(\varvec{\beta }\), and \(\varvec{\gamma }\). Moreover, we exclude the reconstruction of ear and neck regions of 3D faces following [6], leading to \(N=36\)K face vertices.

3.1.2 Illumination assumption

We illuminate the reconstructed 3D faces (from Eq. (1)) using Spherical Harmonics (SH) under the assumption of a Lambertian surface reflectance, following [6]. In particular, we exploit SH basis vector \(\varvec{\phi }: {\mathbb {R}}^3\rightarrow {\mathbb {R}}\), i-th vertex normal \({\varvec{n}}_i\in {\mathbb {R}}^3\), illumination coefficient \(\varvec{\gamma }_x\in {\mathbb {R}}^3\), and texture \(\varvec{{\mathcal {T}}}_i\in {\mathbb {R}}^3\) corresponding to i-th vertex \({\varvec{v}}_i\in {\mathbb {R}}^3\) to illuminate 3D faces, as follows.

$$\begin{aligned} \varvec{\Gamma }({\varvec{v}}_i,{\varvec{n}}_i\mid \varvec{\gamma }) = \varvec{{\mathcal {T}}}_i\cdot \sum _{x=1}^{9} \varvec{\gamma }_x\varvec{\phi }({\varvec{n}}_i). \end{aligned}$$
(2)

In Eq. (2), \(\Gamma\) represents the illumination function for reconstructed 3D faces.

3.1.3 3D face projection

To project the 3D faces onto the screen space, we map each 3D face vertex (containing shape \(\varvec{{\mathcal {S}}}_i\), texture \(\varvec{{\mathcal {T}}}_i\), illumination \(\varvec{\Gamma }\) and pose \({\varvec{p}}\) information such that \(i\in \{1,2,\dots ,3N\}\)) to the image plane by assuming a pinhole camera under full perspective projection, as follows.

$$\begin{aligned} \varvec{I'} = \Upsilon (\varvec{{\mathcal {S}}}_i, \varvec{{\mathcal {T}}}_i,\varvec{\Gamma },{\varvec{p}}), \end{aligned}$$
(3)

where \({\varvec{p}}\) contains \({\varvec{R}}\in SO({3})\) and \({\varvec{t}}\in {\mathbb {R}}^{3}\). It is worth noting that \(\varvec{{\mathcal {S}}}=[\varvec{{\mathcal {S}}}_1, \varvec{{\mathcal {S}}}_2,\dots , \varvec{{\mathcal {S}}}_{3N}]\) and \(\varvec{{\mathcal {T}}}=[\varvec{{\mathcal {T}}}_1, \varvec{{\mathcal {T}}}_2,\dots , \varvec{{\mathcal {T}}}_{3N}]\), where \(N=36\)K. Moreover, \(\Upsilon\) is projection function, whereas \(\varvec{I'}\) denotes the projected 3D face.

Fig. 2
figure 2

An overview of our REduced Dependency Fast UnsuperviSEd 3D Face Reconstruction (RED-FUSE) framework. The proposed method addresses the problem of unprocessed monocular 3D face reconstruction by exploiting novel pose transferring module in an unsupervised manner and speeds up the testing process, without the requirement of 3D ground-truth face scans

3.2 Reduced dependency fast unsupervised 3D face reconstruction

Despite the recent advancements in monocular 3D face reconstruction methods, there is still a large scope for improvement concerning the test time dependencies. Moreover, the issue of test speed improvement is still under-addressed, which is crucial for real-time applications. One possible way to address the problem is to reconstruct 3D faces from unprocessed data, which eliminates the facial landmark requirement at the test time and improves the estimation speed; thus, we aim to reconstruct accurate 3D faces from unprocessed single-view face images without posing additional dependencies. To achieve our objective, we propose REduced Dependency Fast UnsuperviSEd 3D Face Reconstruction (RED-FUSE) framework, which exploits unprocessed face images and their variants to estimate the corresponding 3D face coefficients. More specifically, the proposed network exploits unprocessed (Original) image \(\varvec{I_O}\) and it’s three variants i.e., Rotated \(\varvec{I_R}\), Skewed \(\varvec{I_S}\), and Translated \(\varvec{I_T}\) as the inputs to multi-pipeline framework, estimates corresponding 3D face coefficients \(\varvec{{C}_{R}}\), \(\varvec{{C}_{S}}\), \(\varvec{{C}_{T}}\), \(\varvec{{C}_{O}}\), generates corresponding 3D face meshes \(\varvec{M_{R}}\), \(\varvec{M_{S}}\), \(\varvec{M_{T}}\), \(\varvec{M_{O}}\), transfers the pose of \(\varvec{M_{O}}\) to the remaining 3D face meshes, and projects them on the processed face image \(\varvec{I_P}\) (obtained after processing \(\varvec{I_O}\)) to get the 2D images \(\varvec{I_{R'}}, \varvec{I_{S'}}, \varvec{I_{T'}}, \varvec{I_{O'}}\), all similar to processed image \(\varvec{I_P}\) as shown in Fig. 2. Furthermore, \(\varvec{{C}_{R}}\)\(\varvec{{C}_{S}}\)\(\varvec{{C}_{T}}\)\(\varvec{{C}_{O}}\) are learned by ensuring the consistency between processed image \(\varvec{I_P}\) and projected 3D faces \(\varvec{I_{R'}}\), \(\varvec{I_{S'}}\), \(\varvec{I_{T'}}\), \(\varvec{I_{O'}}\). It should be noted that 3DMM coefficient \({\varvec{C}}_i\) contains shape \(\varvec{\alpha }_i\in {\mathbb {R}}^{80}\), expressions \(\varvec{\beta }_i\in {\mathbb {R}}^{64}\), texture \(\varvec{\gamma }_i\in {\mathbb {R}}^{80}\), illumination \(\varvec{\delta }_i\in {\mathbb {R}}^{27}\), rotation and translation vectors (together known as pose coefficients) \({\varvec{R}}_i\in {\mathbb {R}}^{3}\) and \({\varvec{t}}_i\in {\mathbb {R}}^{3}\) such that \(i\in \{{\varvec{R}},{\varvec{S}},{\varvec{T}},{\varvec{O}}\}\), for generating 3D faces. All the components required to train the proposed framework are given below.

Fig. 3
figure 3

A demonstration of the proposed pose transfer module. It is worth noting that apart from rotation and translation coefficients, we do not transfer other 3D face coefficients

Pose Transfer Module: The conventional approaches ensure cycle consistency of the estimated 3D faces with their processed counterparts. The processing of face images requires facial landmark information. However, deriving facial landmarks becomes tedious and infeasible for tough unprocessed variants. Therefore, these methods fail to solve the problem of unprocessed monocular 3D face reconstruction. To overcome these issues, we exploit a novel pose transfer scheme. For this purpose, we transfer the pose coefficients of the 3D face (\(\varvec{M_{O}}\)) obtained from the unprocessed image to the 3D faces (\(\varvec{M_{R}}\), \(\varvec{M_{S}}\), \(\varvec{M_{T}}\)) generated from the variants of unprocessed input (as shown in Fig. 3), thus assisting the RED-FUSE to attain consistency of all projected 3D faces (\(\varvec{I_{R'}}\), \(\varvec{I_{S'}}\), \(\varvec{I_{T'}}\), \(\varvec{I_{O'}}\)) with a single processed facial image (\(\varvec{I_P}\)) (using Eq. (3)), without posing requirement of landmark information for unprocessed variants, as follows.

$$\begin{aligned} \varvec{I_{j'}} = \Upsilon (\varvec{{\mathcal {S}}}_{i^j},\varvec{{\mathcal {T}}}_{i^j},\varvec{\Gamma }_{j}, {\varvec{p}}_{j=O}), \end{aligned}$$
(4)

where \(\varvec{{\mathcal {S}}}_{i^j}\) and \(\varvec{{\mathcal {T}}}_{i^j}\) are the i-th element of shape vector \(\varvec{{\mathcal {S}}}\) and texture vector \(\varvec{{\mathcal {T}}}\) (from Eq. (1)), respectively, such that \({j}\in \{{R}, S, {T},{O}\}\). \(\varvec{\Gamma }_j\) (from Eq. (2)) is the illumination vector, whereas \({\varvec{p}}_{j=O}\) represents the pose coefficients of estimated 3D face \(\varvec{M_O}\). Also, \(\Upsilon\) is the 3D face projection function, which aids in producing projected 3D face \(\varvec{I_{j'}}\) on to the processed image. It is worth noting that the module facilitates the network to waive-off facial landmark coordinate requirements during testing, thus reducing test time dependencies and improving estimation speed.

Obtaining 3D Face Alignment: To obtain the accurate pose of estimated 3D faces, we align the projected 3D faces \(\varvec{I_{R'}}\), \(\varvec{I_{S'}}\), \(\varvec{I_{T'}}\), \(\varvec{I_{O'}}\) with the corresponding processed face image \(\varvec{I_P}\). Therefore, as follows, we obtain the consistency between 68 facial landmark coordinates using Landmark Loss \({\mathcal {L}}_N\).

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_N&= \mid \mid \varvec{L_P} - {\varvec{L_{R'}}} \mid \mid +\mid \mid \varvec{L_P} - {\varvec{{L}_{S'}}} \mid \mid \\&\quad +\mid \mid \varvec{{L}_P} - {\varvec{{L}_{T'}}} \mid \mid +\mid \mid \varvec{{L}_P} - {\varvec{{L}_{O'}}} \mid \mid . \end{aligned} \end{aligned}$$
(5)

In Eq. (5), \(\varvec{{L}_P}\) is a set of landmark coordinates obtained for \(\varvec{I_P}\), whereas \({\varvec{{L}_{R'}}}\), \({\varvec{{L}_{S'}}}\), \({\varvec{{L}_{T'}}}\), \({\varvec{{L}_{O'}}}\) are the facial landmark coordinates of \(\varvec{I_{R'}}\), \(\varvec{I_{S'}}\), \(\varvec{I_{T'}}\), \(\varvec{I_{O'}}\), respectively. Also, \(\mid \mid \cdot \mid \mid\) is the L2 loss.

Obtaining Photometric Consistency: To learn the 3D face color, we regress the pixels of projected 3D faces \(\varvec{I_{R'}}\), \(\varvec{I_{S'}}\), \(\varvec{I_{T'}}\), \(\varvec{I_{O'}}\) on to the corresponding processed face image \(\varvec{I_P}\), thus attaining the pixel-consistency using Photometric Loss \({\mathcal {L}}_P\), as follows.

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{P} =&\frac{\varvec{{\mathcal {A}}}\cdot \mid \mid \varvec{I_{P}} - \varvec{I_{R'}}\mid \mid }{\mid \mid \varvec{{\mathcal {A}}}\mid \mid }+\frac{\varvec{{\mathcal {A}}}\cdot \mid \mid \varvec{I_{P}} - \varvec{I_{S'}}\mid \mid }{\mid \mid \varvec{{\mathcal {A}}}\mid \mid }\\ {}&+\frac{\varvec{{\mathcal {A}}}\cdot \mid \mid \varvec{I_{P}} - \varvec{I_{T'}}\mid \mid }{\mid \mid \varvec{{\mathcal {A}}}\mid \mid }+\frac{\varvec{{\mathcal {A}}}\cdot \mid \mid \varvec{ I_{P}} - \varvec{I_{O'}}\mid \mid }{\mid \mid \varvec{{\mathcal {A}}}\mid \mid }, \end{aligned} \end{aligned}$$
(6)

where \(\varvec{{\mathcal {A}}}\) represents the skin attention mask [6] obtained for \(\varvec{I_P}\). \(\varvec{\cdot }\) denotes the element-wise multiplication.

Obtaining Deep Feature Similarity: To ensure the visual similarity between the processed image \(\varvec{I_P}\) and the projected 3D faces \(\varvec{I_{R'}}\), \(\varvec{I_{S'}}\), \(\varvec{I_{T'}}\), \(\varvec{I_{O'}}\), we use Deep Feature Loss \({\mathcal {L}}_D\), as follows.

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_D&=4-\Bigg (\frac{<\varvec{\zeta _P}, {\varvec{\zeta _{R'}}>}}{\mid \mid \varvec{\zeta _P}\mid \mid \mid \mid {\varvec{\zeta _{R'}}}\mid \mid }+\frac{<\varvec{\zeta _P}, {\varvec{\zeta _{S'}}>}}{\mid \mid \varvec{\zeta _P}\mid \mid \mid \mid {\varvec{\zeta _{S'}}}\mid \mid }\\&\quad +\frac{<\varvec{\zeta _P}, {\varvec{\zeta _{T'}}>}}{\mid \mid \varvec{\zeta _P}\mid \mid \mid \mid {\varvec{\varvec{\zeta _{T'}}}}\mid \mid }+\frac{<\varvec{\zeta _P}, {\varvec{\zeta _{O'}}>}}{\mid \mid \varvec{\zeta _P}\mid \mid \mid \mid {\varvec{\zeta _{O'}}}\mid \mid }\Bigg ), \end{aligned} \end{aligned}$$
(7)

where \(\varvec{\zeta _P}\) is the deep feature for \(\varvec{I_P}\), whereas \({\varvec{\zeta _{R'}}}\), \({\varvec{\zeta _{S'}}}\), \({\varvec{\zeta _{T'}}}\), \({\varvec{\zeta _{O'}}}\) represent the deep-feature vectors of \(\varvec{I_{R'}}\), \(\varvec{I_{S'}}\), \(\varvec{I_{T'}}\), \(\varvec{I_{O'}}\), respectively. It should be noted that the deep features are obtained using pre-trained face recognition model FaceNet [35].

Regularization: For ensuring the plausibility of reconstructed 3D face shape, expressions and texture, we enforce the estimated shape (\(\varvec{\alpha }_R\)\(\varvec{\alpha }_S\), \(\varvec{\alpha }_T\), \(\varvec{\alpha }_O\)), expression (\(\varvec{\beta }_R\), \(\varvec{\beta }_S\), \(\varvec{\beta }_T\), \(\varvec{\beta }_O\)) and texture (\(\varvec{\gamma }_R, \varvec{\gamma }_S, \varvec{\gamma }_T, \varvec{\gamma }_O\)) coefficients to follow the BFM distribution (normal), using Regularization term, as follows.

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_R&= w_{\alpha _R}\mid \mid \varvec{\alpha }_R\mid \mid +w_{\beta _R}\mid \mid \varvec{\beta }_R\mid \mid +w_{\gamma _R} \mid \mid \varvec{\gamma }_R\mid \mid \\&\quad +w_{\alpha _S}\mid \mid \varvec{\alpha }_S\mid \mid +w_{\beta _S}\mid \mid \varvec{\beta }_S\mid \mid +w_{\gamma _S}\mid \mid \varvec{\gamma }_S\mid \mid \\&\quad +w_{\alpha _T}\mid \mid \varvec{\alpha }_T\mid \mid +w_{\beta _T}\mid \mid \varvec{\beta }_T\mid \mid +w_{\gamma _T}\mid \mid \varvec{\gamma }_T\mid \mid \\&\quad +w_{\alpha _O}\mid \mid \varvec{\alpha }_O\mid \mid +w_{\beta _O}\mid \mid \varvec{\beta }_O\mid \mid +w_{\gamma _O}\mid \mid \varvec{\gamma }_O\mid \mid , \end{aligned} \end{aligned}$$
(8)

where \(w_{\alpha _i}\), \(w_{\beta _i}\) and \(w_{\gamma _i}\) are the weights associated with \({\varvec{\alpha }_i}\), \({\varvec{\beta }_i}\) and \({\varvec{\gamma }_i}\), respectively such that \(i\in \{{{R}},{{S}},{{T}},{{O}}\}\).

Obtaining Overall Supervision: The overall supervisory signal for training the Reduced dependency fast unsupervised 3D face reconstruction framework is obtained using the pose transferring module (Eq. (4)), landmark loss \({\mathcal {L}}_N\) (Eq. (5)), photometric loss \({\mathcal {L}}_{P}\) (Eq. (6)), deep feature loss \({\mathcal {L}}_D\) (Eq. (7)), and regularization \({\mathcal {L}}_R\) (Eq. (8)). The mathematical formulation of overall loss function is as follows.

$$\begin{aligned} {\mathcal {L}} = w_N{\mathcal {L}}_N+w_P{\mathcal {L}}_P+w_{D}{\mathcal {L}}_{D}+w_R{\mathcal {L}}_R, \end{aligned}$$
(9)

where \(w_N\), \(w_{P}\), \(w_{D}\), and \(w_R\) are the weights associated with \({\mathcal {L}}_N\), \({\mathcal {L}}_P\), \({\mathcal {L}}_D\), and \({\mathcal {L}}_R\), respectively.

4 Experiments

In this section, we present the details of the training and testing datasets (Sect. 4.1). Also, the evaluation metrics and algorithms are detailed to evaluate the performance of the proposed method (Sect. 4.2). Moreover, we provide the implementation details of our approach (Sect. 4.3).

4.1 Datasets

We gathered various standard face datasets, such as 300W-LP [36], LFW [14], etc., to form a training dataset. To validate the reconstruction accuracy, we use the test dataset of CelebA [37], NoW selfie-based validation dataset [15] and LFW-test set [14].

4.2 Evaluation metrics

For evaluating the reconstruction accuracy of proposed RED-FUSE, we exploit various 3D and 2D evaluation metrics. Furthermore, we demonstrate the test speed improvement of our using the time analysis. The details of the metrics are as follow.

figure a

3D Shape and Color-based Error: The 3D shape and color-based error metrics evaluate the spatial and color differences between the estimated 3D faces and the corresponding ground truth. Specifically, each 3D face contains \(N=36\)K vertices; each vertex has an associated spatial location (xyz) and color values (r, g, b). The estimated vertex locations and texture values are compared with the ground-truth data using root mean square and standard deviation error metric. The mathematical formulation of the 3D shape-based error (\(M_{3DS}\pm S_{3DS}\)) is given below.

$$\begin{aligned} \begin{aligned} M_{3DS}&= \frac{1}{3N}\sum _{i}{\textbf{E}}_{3DS_i},\quad \\ S_{3DS}&= \frac{1}{3N}\sum _{i}({\textbf{E}}_{3DS_i}-M_{3DS})^2\quad \text {where,}\quad \\ {\textbf{E}}_{3DS}&= \sqrt{(x_{i^G}-x_{i^P})^2+(y_{i^G}-y_{i^P})^2+(z_{i^G}-z_{i^P})^2}, \end{aligned} \end{aligned}$$
(10)

where \(M_{3DS}\) and \(S_{3DS}\) are the mean and standard deviation of shape error, respectively. Moreover, \(k_{i^G}\) and \(k_{i^P}\) are the ground-truth and predicted spatial locations of i-th vertex such that \(k\in \{x, y, z\}\). Also, the mathematical formulation of the 3D color-based error (\(M_{3DC}\pm S_{3DC}\)) is given below.

$$\begin{aligned} \begin{aligned} M_{3DC}&= \frac{1}{3N}\sum _{i}{\textbf{E}}_{3DC_i},\quad \\ S_{3DC}&= \frac{1}{3N}\sum _{i}({\textbf{E}}_{3DC_i}-M_{3DC})^2\quad \text {where,}\quad \\ {\textbf{E}}_{3DC}&= \sqrt{(r_{i^G}-r_{i^P})^2+(g_{i^G}-g_{i^P})^2+(b_{i^G}-b_{i^P})^2}. \end{aligned} \end{aligned}$$
(11)
figure b

\(M_{3DC}\) and \(S_{3DC}\) are the mean and standard deviation of the color error, respectively. Furthermore, \(k_{i^G}\) and \(k_{i^P}\) are the ground-truth and predicted color values associated with i-th face vertex such that \(k\in \{r,g,b\}\) where r denotes red, g represents green and b is blue color values. We exploit a total of 80 subjects for the comparison. An algorithm for 3D color and shape-based error evaluation is given in Algo. 1.

NoW Challenge: NoW selfie challenge [15] computes the scan-to-mesh distance between the ground truth scan and the estimated 3D faces on the selfie images. Our method produces 3D faces from unprocessed images such as selfies and near-face pictures; thus, the evaluation is crucial for demonstrating the 3D shape accuracy of the proposed approach.

Perceptual Error: In addition to 3D evaluation, we also evaluate the performance of our model on the 2D perceptual metric using 3K, and 1.5K images of the CelebA-test, and LFW-test datasets, respectively. The metric emphasizes the visual similarity between the 2D face image and the rendered counterpart. Therefore, the metric is crucial for evaluating the visual consistency between the input data and the estimated faces. To perform the evaluation, we leverage, seven high performing face recognition models VGG-Face [38], FaceNet [35], FaceNet-512 [35], OpenFace [39], DeepFace [40], ArcFace [41] and SFace [42], as follows.

$$\begin{aligned} M_{2DP}&=\sum _i \mid \mid (\mathbf {v}_{i^G} - \mathbf {v}_{i^P})\mid \mid , \quad \nonumber \\ S_{2DP}&=\frac{1}{M} \sum _i(\mid \mid (\mathbf {v}_{i^G} - \mathbf {v}_{i^P})\mid \mid -M_{2DP})^2, \end{aligned}$$
(12)

where \(\mathbf {v}_{i^G}\in {\mathbb {R}}^M\) and \(\mathbf {v}_{i^P}\in {\mathbb {R}}^M\) are the ground truth and predicted vectors for i-th face image, respectively. \(\mid \mid \cdot \mid \mid\) denotes L2 norm. Moreover, \(M_{2DP}\) and \(S_{2DP}\) are the mean and standard deviation of perceptual error vectors, respectively. Please refer to Algo. 2 for details.

Test-time Analysis: Finally, we evaluate the improvement in the testing time by deriving its average percentage decrease compared to SOTA methods. For the comparison, we tested the models on 3K images from the CelebA-test dataset and derived the average time taken by each network.

Note that the training and testing datasets are distinct, and the testing data is not accessible during training.

4.3 Implementation details

Our REduced Dependency Fast UnsuperviSEd 3D Face Reconstruction (RED-FUSE) framework contains 3D face prediction networks, which estimate 3D face vector \(\varvec{C_i}\in {\mathbb {R}}^{257}\), containing shape \(\varvec{\alpha }_i\in {\mathbb {R}}^{80}\), expression \(\varvec{\beta }_i\in {\mathbb {R}}^{64}\), texture \(\varvec{\gamma }_i\in {\mathbb {R}}^{80}\), illumination \(\varvec{\delta }_i\in {\mathbb {R}}^{27}\), rotation \({\varvec{R}}_i\in {\mathbb {R}}^{3}\), and translation \({\varvec{t}}_i\in {\mathbb {R}}^{3}\) coefficients such that \(i\in \{R,S,T,O\}\). Therefore, the last fully-connected (FC) layer of our backbone architecture contains 257 nodes. Following [6], we exploit ResNet-50 as our backbone architecture such that the classification layer is replaced by 257 nodal FC layer. Moreover, the in-the-wild (unprocessed) face images and their variants (rotated, skewed, and translated) of size \(224\times 224\) serve as the inputs to our framework. Furthermore, the unprocessed face images (not their variants) are cropped, aligned (using the method in [43]), and reshaped to size \(224\times 224\), which facilitate the unsupervised training. Besides, we opt for a batch of 5 for each case: rotated, skewed, and translated original unprocessed face images. Thus, the proposed framework is trained with a net batch size of 20. Our framework is initialized with ImageNet weights [44]. In addition, an Adam optimizer [45] is utilized with an initial learning rate of \(10^{-4}\), and 500K training iterations. The proposed framework contains the weights associated with the losses as \(w_N = 1.6\times 10^{-3}\), \(w_{P}=1.92, w_D=0.2\), \(w_R=3\times 10^{-4}\), following R-Net [6].

4.4 Results

Fig. 4
figure 4

A comparison of qualitative performance of the proposed RED-FUSE model with R-Net and MoFA methods on open source images. Results show the superior 3D face reconstruction using the proposed approach. MOPI distills the knowledge from R-Net for occlusion robustness, resulting in the same performance

In this section, we compare the qualitative (Sect. 4.4.1) and quantitative results (Sect. 4.4.2) of our method with various methods, MoFA [5, 8], R-Net [6], and MOPI [11] on the several open source images, test dataset of CelebA [37], LFW-test set [14] and NoW selfie dataset [15]. MoFA is a preliminary CNN-based 3D face reconstruction method, whereas R-Net and MOPI generate the accurate 3D face from single-view face images using the CNN framework, and thus we choose these methods for the comparisons.

4.4.1 Qualitative results

With a single monocular unprocessed face image, RED-FUSE reconstructs 3D face shape and texture without posing additional dependencies. The second rows of Figs. 4 and 5 show that the proposed approach attains high visual similarity between 3D faces and the corresponding unprocessed face images.

Figure 4 qualitatively compares RED-FUSE results with the recent methods, namely MoFA [5], R-Net [6] and MOPI [11] on open source unprocessed images (such as YouTube, Google, etc.). Compared to these methods, RED-FUSE reconstructs superior overall 3D face shape (row 2, 3, 4 and 5) and estimates reliable 3D face pose (column 3). In addition, RED-FUSE predicts better 3D face expressions than all the other approaches. More specifically, MoFA either drags the search outside the 3DMM space (column 3 and 6, row 5) or maps to a coordinate distant from the true coordinate in the search space, resulting in unreliable reconstruction results (column 1, 2, 4 and 5, row 5). Moreover, R-Net fails to capture accurate expressions from unprocessed face images, resulting in poor 3D face shape accuracy (column 1 and 3, row 4). Similar to R-Net, MOPI produces inaccurate face shapes and poses from unprocessed inputs (row 3). It is worth noting that all these methods are producing 3D faces with \(N=36\)K face vertices facilitating a fair comparison.

Fig. 5
figure 5

A comparison of qualitative performance of the proposed RED-FUSE model with R-Net and MoFA methods on LFW datasets. MOPI distills the knowledge from R-Net for occlusion robustness, resulting in the same performance

Figure 5 demonstrates the performance of our method on LFW [14] unprocessed images. The second row shows variations in the expressions and poses of 3D faces emphasizing the ability of RED-FUSE to re-produce difficult-to-produce facial expressions on 3D faces (column 3 and 5, row 2). Also, our model holds the ability to capture a range of accurate 3D face shapes from unprocessed images. It is worth noting that RED-FUSE reliably predicts eyebrow patterns, gaze details, etc., resulting in the high perceptual similarity between the unprocessed input and the resultant 3D face. Finally, our approach effectively tackles minor occlusions such as caps and spectacles (column 1 and 3, row 2). MoFA aims to attain cycle consistency with the processed input images (row 5), thus resulting in poor visual appearance and sometimes may lead to not-a-human looking face (column 1, row 5). R-Net exploits deep-feature loss to improve the accuracy of 3D faces using cropped and aligned face images in the input, thus producing better results than MoFA but still estimates unreliable 3D face shape and expression for unprocessed face images (row 4). Furthermore, MOPI distills the knowledge from R-Net, showing similar performance as R-Net (row 3). Besides, RED-FUSE exploits the unprocessed images and their variants to estimate 3D faces using a novel pose transfer module and regress the projection of predicted 3D faces over the corresponding processed variant of unprocessed face images for obtaining accurate 3D faces. Therefore, our approach shows significant improvement in performance as compared to other recent methods.

In summary, RED-FUSE generates better reconstruction results, outperforming recent 3D face reconstruction approaches in terms of shape robustness, while producing reliable 3D face expression and pose. Moreover, the proposed method effectively tackles minor occlusions and generates occlusion robust 3D faces.

4.4.2 Quantitative results

We compare the quantitative performance of our RED-FUSE framework with methods MoFA [8], R-Net [6], and MOPI [11] on four criteria: (1) Perceptual Error, (2) NoW Selfie Challenge, (3) 3D Shape-based and Color-based Errors, and (4) Required Testing Time and Dependencies, as follows.

Table 1 A quantitative comparison of the perceptual error with other approaches on CelebA-test dataset, where the error numbers are the lower the better
Table 2 A quantitative comparison of the perceptual error with other approaches on LFW-test set, where the error numbers are the lower the better

(1) Perceptual Error: To emphasize the visual effectiveness of the results obtained using our method over other recent approaches, we compare the perceptual error between rendered 3D faces and color 2D face images with MoFA, R-Net, and MOPI. Our results in Tables 1 and 2 demonstrate superior performance compared to recent approaches. A significant improvement of \(\mathbf {27.4}\%\) (1.007 \(\rightarrow 0.731)\), \(\mathbf {38.2}\%\) (1.296 \(\rightarrow\) 0.801), \(\mathbf {40.6}\% (1.329\rightarrow 0.789)\), \(\mathbf {30.9}\% (0.953\rightarrow 0.659)\), \(\mathbf {17.7}\%\) \((0.785\rightarrow 0.646)\), \(\mathbf {24.9}\%\) (1.315 \(\rightarrow 0.987)\), and \(\mathbf {17.8}\%\) (1.260 \(\rightarrow 1.036)\) in the perceptual error for VGG-Face, FaceNet, FaceNet-512, OpenFace, DeepFace, ArcFace, and SFace, respectively, is achieved compared to MoFA on the CelebA-test dataset. Similarly, our approach obtains superior performance on the LFW-test set (Table 2) for various methods.

All these results demonstrate that the outputs of the proposed approach are visually more similar to the faces in the unprocessed images, thus establishing the effectiveness of the proposed method.

Table 3 A quantitative evaluation on the NoW validation selfie dataset. Our results show superior performance compared to recent methods
Fig. 6
figure 6

A cumulative error plot obtained for the NoW validation selfie dataset. In the plot, the x-axis shows the scan-to-mesh distance error (in mm), whereas the y-axis displays the cumulative percentage such that the higher the curve, the better the shape-based accuracy. It is worth noting that the error curves for R-Net (orange) and MOPI (green) are overlapping

2) NoW Selfie Challenge: We evaluate our dataset on the standard NoW validation selfie challenge [15]. Our results in Table 3 show that the proposed method outperforms recent methods by a large margin. For example, improvement of \(\mathbf {29.6}\%\) (1.99 \(\rightarrow\) 1.40) and \(\mathbf {20.5}\%\) (2.54 \(\rightarrow\) 2.02) are obtained in the median and mean, respectively, compared to a monocular 3D face reconstruction method. Moreover, we show the improvement through a cumulative error plot in Fig. 6. In the plot, the curve corresponding to the proposed RED-FUSE is higher than the curves of other approaches, thus validating our method’s supremacy. It is worth noting that the evaluation is performed on unprocessed images, i.e., no landmark information is exploited to estimate the meshes.

Table 4 A comparison of our method with methods MoFA, R-Net and MOPI, the principle of the lesser the better principle

3) 3D Shape and Color-based Error: Table 4 shows the 3D shape and color-based error comparison of RED-FUSE with regards to MoFA and R-Net. We infer that RED-FUSE improves the shape and color-based RMSE errors by a large margin of \(\mathbf {64.2\%}\) (\(8.78 \rightarrow 3.14\)) and \(\mathbf {29.8\%}\) (\(4.23 \rightarrow 2.97\)), respectively, compared to MoFA. Also, our method shows a significant improvement of \(\mathbf {46.2\%}\) (\(5.84 \rightarrow 3.14\)) and \(\mathbf {15.1\%}\) (\(3.50 \rightarrow 2.97\)) for shape and color-based errors, respectively, with respect to R-Net. Furthermore, the improvement of \(\mathbf {46.0\%}\) (\(5.82 \rightarrow 3.14\)) and \(\mathbf {15.1\%}\) (\(3.50 \rightarrow 2.97\)) is obtained for shape and color-based errors, respectively, compared to MOPI (Table 5).

Table 5 A comparison of our method with recent methods MoFA, R-Net and MOPI. It is worth noting that the proposed method poses fewer dependencies and significantly reduces testing time. Moreover, we re-trained MoFA with the same backbone architecture (as ours) to facilitate a fair comparison. Furthermore, FC stands for the last fully-connected layer

4) Improved Inference Time: To emphasize the efficacy of the proposed method for real-time applications, we compare our test time with the above-mentioned recent methods MoFA R-Net, and MOPI. These methods require the same test time due to the requirement of processing the raw data during testing. The proposed approach takes \(\mathbf {1.85}\) msec to generate a 3D face, whereas the above mentioned methods require \(\mathbf {7.30}\) msec per face, on average, when tested on a Linux platform (Ubuntu 16.04.7) with NVIDIA GK110GL GPU 3D controller card. Therefore, our method improves test time by a large margin of \(\mathbf {74.6\%}\) (nearly 4 times faster) compared to the recent approaches. Moreover, unlike various methods, RED-FUSE doesn’t require 5 facial landmarks coordinates during the testing, thus eliminating all the test time dependencies.

Fig. 7
figure 7

An analysis of the impact of various losses on the training. Our results show that the model drifts the search outside 3DMM space when trained without landmark loss \(\varvec{{\mathcal {L}}_N}\). Besides, the network trained without photometric loss \(\varvec{{\mathcal {L}}_P}\) or deep-feature loss \(\varvec{{\mathcal {L}}_D}\) demonstrates poor visual similarity with the input image

4.5 Ablation analysis

We present a study on the impact of various losses exploited for the training (Sect. 4.5.1). Moreover, we provide an analysis (qualitative and quantitative) to validate the efficacy of the proposed pose-transferring module for training our model (Sect. 4.5.2).

4.5.1 Impact of losses

We exploit a combination of losses for learning the 3D face representation from unprocessed monocular images in an unsupervised manner. Therefore, we qualitatively demonstrate the effectiveness of each loss in the proposed framework (Fig. 7). In Fig. 7a, the model trained without photometric loss \({\mathcal {L}}_{P}\) produces unreliable 3D face texture, i.e., the estimated skin color of the rendered face is inconsistent with the input image. Moreover, the network trained without landmark loss \({\mathcal {L}}_{N}\) (Fig. 7b) drags the search outside the 3DMM, thus resulting in a not-a-human-looking face. Furthermore, the model trained without deep-feature loss \({\mathcal {L}}_{D}\) produces visually less effective 3D faces (Fig. 7c). However, a network trained with all the losses (\({\mathcal {L}}_{P}\), \({\mathcal {L}}_{N}\), and \({\mathcal {L}}_{D}\)) demonstrates the best performance, establishing the efficacy of the proposed framework.

4.5.2 Impact of pose transfer module

A critical question arises: What is the impact of the pose transfer module on the training? To answer this, we train the model without exploiting the proposed scheme and regress the projection of estimated 3D faces (obtained from unprocessed image and its variants) over the corresponding aligned and cropped face image. Figure 8 shows that the performance of our model degrades when trained without the proposed module, particularly in terms of 3D face shape and expressions. We conjecture that the model trained without our scheme is penalized for estimating poses consistent with the unprocessed image variants, impacting the 3D face shapes and expressions. Therefore, during testing, the model fails to capture accurate 3D face shapes and expressions from unprocessed face images. Besides, the model trained with the pose transfer scheme transfers the estimated 3D face pose of the original unprocessed image to the 3D faces of corresponding variants before penalizing pixel discrepancies. Therefore, the model trained with the pose transfer module learns the correct 3D face shape and expression information from unprocessed face images.

Fig. 8
figure 8

A qualitative demonstration of the effectiveness of pose transfer module for training the proposed framework

Moreover, we demonstrate the impact of each component of the pose transfer module. Our results in Table 6 show that the model trained without rotation and translation transferring performs poorly on 3D -based errors. However, the accuracy improves by transferring the rotation coefficients (\(\varvec{R_O}\)) of the 3D mesh (\(\varvec{M_O}\)) obtained from the unprocessed image to its variants.

Table 6 A study on the impact of pose transfer module in training the proposed RED-FUSE framework. It is worth noting that the best performance is obtained by exploiting all the components (rotation and translation transfer) of the proposed module

Further improvement is observed in transferring the translation coefficient (\(\varvec{t_O}\)) of \(\varvec{M_O}\) obtained from the unprocessed image to its variants. This emphasizes that the translation coefficient is crucial in improving the accuracy of 3D faces. Finally, the model trained with both coefficient transfer, translation, and rotation, demonstrates the best performance, thus, validating the effectiveness of the proposed pose transfer module.

Fig. 9
figure 9

An analysis of the limitations of the proposed model. (Left) The proposed model does not reconstruct the faces in red rings as the network has an upper limit of processing a single face per image. (Right) The faces in the blue ring are not reliably reconstructed, emphasizing the constraint on the area occupied by a face in the captured image

5 Limitations and future work

While RED-FUSE achieves SOTA results for the reconstructed 3D faces (obtained from unprocessed monocular images) and the testing speed, numerous challenges exist. First, RED-FUSE reconstructs only one 3D face irrespective of the number of persons in the image (Fig. 9a). This leads to the need for a more robust network, which divides the images into patches and reconstructs 3D faces based on the face information obtained from each patch. More specifically, the patch size should be small enough to contain a single face only. However, such a network increases computational complexity and poses several dependencies during training, such as prior knowledge of the number of faces in the image. Moreover, RED-FUSE poses challenges in estimating 3D faces from images containing far-away faces (Fig. 9b). This suggests the need for more diverse unprocessed training data. Therefore, a face dataset is required for training, consisting of far-away faces, such that the corresponding processed 2D face images do not lose facial information. Finally, details such as makeup, mustaches, etc. (Fig. 9a, row 1) are not reproduced as we exploit BFM, leading to a visual discrepancy between the input image and the corresponding 3D face. BFM spans the range of human facial appearance, thus posing a challenge in reproducing facial accessories such as makeup. Also, BFM contains Principal Component Analysis (PCA) basis vectors (obtained by projecting 3D facial data from high-dimensional space to low-dimensional space) for shape and texture reconstruction, resulting in the loss of fine facial details. A different approach is needed to estimate 3D faces beyond 3DMM.

In future works, we aim to empower our model to tackle the above mentioned issues, including patch-wise 3D face reconstruction, training on expanded face dataset, and reconstruction surpassing the constraints posed by 3DMM.

6 Conclusion

In this work, we proposed a novel REduced Dependency Fast UnsuperviSEd 3D Face Reconstruction (RED-FUSE) framework to reconstruct 3D faces from unprocessed face images in an unsupervised manner without posing additional dependencies. In particular, RED-FUSE is trained on various 2D face datasets using a multi-pipeline training architecture. A novel pose transfer scheme is exploited to learn the accurate 3D face representation without affecting shape and texture accuracy. This enables lesser dependencies and improved estimation during inference. Our experiments indicate that the proposed model improves the perceptual error, NoW selfie challenge, 3D shape, and color-based error by a large margin of \(\mathbf {27.4}\%\), \(\mathbf {29.6}\%\), \(\mathbf {46.2\%}\), and \(\mathbf {15.1\%}\), respectively, outperforming the current method. Moreover, our approach significantly reduces testing time, i.e., \(\mathbf {74.6}\%\); thus, RED-FUSE not only reduces test time dependencies and improves estimation speed, but also produces reliable 3D faces. Due to the reconstruction accuracy, lower dependencies, and speed, the proposed model is beneficial for real-time applications.