1 Introduction

Three-dimensional knowledge about a human face is very much beneficial for computer vision and computer animation fields. 3D Face Reconstruction can be defined as promoting a 2D face image into a 3D geometry. Since a 3D face model provides a mathematical representation of the face surface in three dimension, which is consistent despite change in the pose, illumination and expression, 3D face reconstruction is therefore essential for real-time applications in face recognition, plastic surgery, facial animation [10, 29], entertainment [9] and computer games [38, 39], 3D rendering of comic cartoon characters [26, 27] etc. In the media and entertainment industry(M&E), it is contemporary to use 3D rendering of human faces in various segments such as cinema, advertising and gaming. In the medical field, 3D printing exhibits abundant applications such as 3D modeling of human organs from computed tomography(CT) data [43]. 3D reconstruction is essential to generate 3D-printed model of the anatomical structure for pre-surgical preparation and also to carry out cosmetic surgery. In criminal investigation [14, 32, 41], facial recognition is of predominant importance and 3D technology is used for precise recognition.

A 3D face model can be represented by polygonal mesh or B-spline. The ambiguity feature of the human face limits the accuracy of 3D face modeling. 3D point cloud obtained by the 3D scanners could be the best input for generating 3D face model. But it is very expensive. So, researchers started focusing on 3D face reconstruction from 2D images captured by the camera. Also, acquiring multiple views of a face is not feasible in all the real time applications. So, using a single face image as input for 3D modeling gains more importance. Irrespective of the methods used to capture the input, the result will be greatly affected by the person’s pose and illumination. Several 3D face reconstruction methods have been presented in the state-of-the art. However, dealing with pose and illumination variation is still a significant problem. This affects the semantic details of the resultant 3D model.

Many datasets include images and 3D shapes annotated with the primary (eyes,mouth,nose) and secondary(facial outline,chin,hair) face features, which is not sufficient for detailed 3D face geometry reconstruction. Therefore, to handle cheek and contour information, the intermediate serving keypoints(ISK) are introduced in this work. And, in some datasets, though such uncalibrated images and models are there, the keypoints are not available, which is crucial for learning based 3D modeling. Therefore, in order to prepare the publicly available 3D face databases without required keypoints, to be useful for research purpose, a Landmark-mapping strategy is proposed in this work, to localize the landmarks on the faces in the database, adopting a model template consisting of required keypoints. It needs no manual intervention. The proposed landmark-mapping strategy can map a set of landmarks from one template model to any arbitrary model.

Most of commercial systems have restrictions in modeling 3D shape such as the requirement of frontal and profile views, manual localization of feature points. In the proposed work, the problem of 3D face modeling from a frontal face image with automated feature localization using cluster analysis [16, 40, 42] is particularly focused. Instead of generating 3D face using the generic model, the SL2E-AFRE follows predicted landmark based reference model deformation approach resulting in an unique 3D shape model divergent from the generic model. The literature which has given high accurate 3D models have used the deep learning for both 2D as well as 3D subspace learning purpose. The proposed approach is confined to learn the 2D subspace only using autoencoder, which is basically a dimensionality reduction technique. Since the regression based landmark estimation [44] is sensitive to the face appearance inside the bounding box [45], clustering based landmark estimation is proposed in this work.

1.1 Contributions

As key contribution, this work presents a new energy function to be optimized within the autoencoder architectures. The energy function simultaneously learns the subspace and landmarks positions. The overall contribution of this paper are as follows: In this work, a three-step 3D face reconstruction approach is presented which includes adapting the database to the proposed 3D reconstruction framework, landmarks estimation and shape deformation.

  1. 1.

    In the first step, a set of Intermediate Serving Keypoints(ISK) is computed on the model template and an enhanced keypoint set is formed using ISK along with the base keypoints to derive more accurate facial shape. Then, the proposed landmark mapping strategy is applied to adapt the 3D face database to the proposed 3D face reconstruction approach by mapping the landmarks from the model template to the database, which does not require any manual intervention.

  2. 2.

    In the second step, a novel architecture based on auto encoders for 3D face landmarks prediction is presented, where learning the patch subspace and Landmark positions are carried out simultaneously with the help of the proposed energy function. Only 2D images are learned with the deep network, which can dramatically reduce computation.

  3. 3.

    In the third step, a deformation method based on predicted landmarks is proposed to obtain a personalized unique 3D face shape, where instead of using a generic face, a reference model to be deformed is selected from the training examples based on the predicted landmarks. A model is selected by performing interpolation followed by finding distance between predicted landmarks and landmarks of training examples.

  4. 4.

    The proposed system is tested with two databases, namely The Bosphorus 3D Face Database [31] and the USF Human-ID 3D Face database [6].

The performance of proposed approach is comparable to other state-of-the-art works. However, the proposed approach is far faster than the existing methods as it does not employ complex architectures.

This paper is formulated as follows. Section 2 describes related works. A new methodology for 3D face reconstruction is presented in Section 3. Section 4 analyses the experimental results obtained. Section 5 concludes the paper.

2 Related work

Existing 3D face reconstruction approaches can be classified into three categories: 3D Morphable Model(3DMM), Shape-from-Shading(SFS), Learning-based Methods. Shape from Shading methods determine the shape using shading information(brightness variation from one pixel to another) inherited from the image [18]. The main drawback of the SFS method is that its presumption of lambertian surface and a single point light source at infinity which led to unrealistic 3D face reconstruction [8]. In [23], Kemelmacher et al. exploit the similarity of faces and combines the input image’s shading information with that of the generic model. To overcome the problem arose by the assumptions, the researchers have proposed various approaches [35]. However, SFS does not provide an unique solution owing to the complex albedo variations in the face. Since SFS captures the fine-scale facial geometric details, it is being used as the tune-up phase in some approaches [21].

In 3DMM approach, a deformable model is generated based on a linear combination of example 3D models [6]. Then the parameters are estimated by optimizing the cost function, which is the difference between the 2D projection of the model and the input image. The 3D model can be generated using the estimated parameters. Because of using the intensity information while fitting the model, the process becomes time-consuming and most probably trapped into local minima, which results in unrealistic 3D geometry. To overcome the local minima problem, researchers started using the facial landmarks in the deformable model to fit those in the input 2D image [20]. These approaches require human intervention to locate the landmarks and the estimated parameters are less accurate. Zhou et al. [49] introduces shape-space model, where each basis shape can be rotated and so it is robust to arbitrary initialization. Baumberger et al. [4] used silhouette information to deform the generic model. Dou et al. [12] proposed a deep neural network based End-to-End 3D face reconstruction and generate 3D model from unconditional inputs. Richardson et al. [30] generates a coarse face model using CNN, employs another network to refine the coarse model and uses a rendering layer to connect the networks. Tran et al. [36] employs bump mapping method, in which global shape is computed before local features estimation. Ding et al. [11] presented a local linear fitting (LLF) based 3D face reconstruction with sparse key points. Luo et al. [25] used ICF (Iterative Closest Point) algorithm, where each vertex is provided with an optimal weight for aligning the 3DMM to the given depth image.

Learning based methods used to learn the 2D images subspace and the corresponding 3D models subspace. Then the mapping can be computed between these two subspaces to generate the 3D face model for the input face image [47]. Han et al. [15] proposed an approach based on cascaded regression, where a shape incremental feature was used, which exploits the information from 2D face and the current estimated model at that stage. It could not handle pose variations. In [33] and [24] coupled radial basis function is used to get the intermediate face. The model is optimized through landmarks using a coupled dictionary, which relates 3D face model and 3D landmarks, and another coupled dictionary, which relates 2D and 3D landmarks(to obtain z-coordinates). Sun et al. [34] proposed a coupled statistical model which incorporates both face image and depth map of the face. In this approach, a new database is generated by illuminating the training dataset with an illumination parameter estimated from the average face model and the input image. This method is robust to different light conditions. Zhang et al. [48] used the Stacked Contractive Autoencoder (SCAE), which learns nonlinear image subspace and corresponding 3D model subspace. Then a one-layer neural network is used to compute the mapping between the subspaces. The drawback of this approach is its computation complexity. Similarly, Arslan and Seke [2] used conditional generative adversarial networks(CGAN) for computing depth. Jackson et al. [19] and Feng et al. [13] learn the mapping from 2D to 3D coordinates using CNN(Convolutional Neural network). Jackson et al. [19] develops the volumetric representation, which however does not consider the semantic significance of the points but the method proposed by Feng et al. [13] does. Tran et al. [37] used deep neural networks (DNN), where in-the-wild images are used for training.

3 Methodology

The proposed approach is succinctly illustrated in Fig. 1. It includes three components: 1) landmark mapping across 3D databases; 2) patch subspace learning and facial landmark estimation with autoencoder; 3) predicted landmark based deformation.

Fig. 1
figure 1

The proposed SL2E-AFRE framework for 3D face reconstruction

Initially, we need to find N 3D facial landmarks L = {l1, l2, ... , lN}, in the normalized 3D mean frontal face geometry. In this work, to localize the landmarks in the 3D face model, a landmark mapping approach across database is presented. Then, the reconstruction problem is solved by simultaneous clustering and relevant keypoints estimation procedures by incorporating the proposed energy function within the autoencoder network. The Energy function to be minimized to find the landmarks is as follows:

$$ E = AE_{loss}+\alpha.landmark_{loss}-\beta.cluster_{loss} $$
(1)

where AEloss ensures the consistent result though the code layer data of autoencoder is used. landmarkloss is the error in the landmarks predicted. Since the code layer is the base for clustering, which is the base for landmark prediction and the input is not represented precisely in the code layer, clusterloss takes a part in this energy function.

Autoencoder network is used with the intention of compressing the dimension, since it can learn the non-linear mapping in an unsupervised way effectively. Autoencoder maps the high dimensional data to low dimensional space. Existing deep learning based approaches [12, 19] obtain non-linear subspaces of both 2D and 3D samples, entailing computationally intensive methods. In this proposed approach, only 2D subspace of facial patches is learned with autoencoder network. Then 3D geometry of different patches of a face is generated, each from different individuals, by performing cluster analysis and 3D landmark estimation simultaneously. Finally, a reference model chosen from the training dataset based on the prediction of landmarks, is deformed using the estimated keypoints by the laplacian deformation technique. In this paper, the set of 3D face landmarks is represented as L = {l1, l2, ... , lN}, and each 2D face image is divided into three partitions P = {Pe, Pn, Pm}, where \(l_{i}\in \mathbb {R}^{3n}\); Pe, Pn and Pm are the sets encompassing eye patches, nose patches and mouth patches respectively. The edge map of the face image Pf is also used to estimate keypoints.

3.1 Enhanced keypoint set

An Enhanced keypoint set of 342 keypoints (Fig. 2) is built with intermediate serving keypoints for smooth deformation. It includes 68 base keypoints, 113 facial contour vertices, 100 cheek vertices, 40 eye and eyebrow vertices, 20 nosebase vertices and 1 nosetip. In the z-axis, the largest valued vertex is taken as the nosetip vertex.

Fig. 2
figure 2

a 68 base keypoints, b Enhanced keypoint set

Since the target and reference shapes are aligned to each other, by exploiting the groundtruth landmark points from the reference shape, target shape’s landmarks get inherited. The quantity of landmarks is expanded with the following landmark function:

$$ \{\nu^{\prime}\}\quad =\quad \{\nu\}\pm\varDelta $$
(2)

where {ν} is a set of preset landmark vertices; Δ = {δ1, δ2, ... , δn}. A small value δi is added and subtracted from {ν} to obtain a new set of landmarks \(\{\nu ^{\prime }\}\). The set of supplementary eye landmarks is defined by,

$$ \begin{array}{@{}rcl@{}} \mathbb{L}_{e}\quad&=& \quad\{(\nu_{e},\nu^{\prime}_{e})\in \mathbb{V} \mid \nu^{\prime}_{e} \text{ can be obtained from }\\ &&\qquad\qquad\qquad \qquad \nu_{e} \text{ by applying (2)}\} \end{array} $$
(3)

the set of supplementary nose landmarks is defined by,

$$ \begin{array}{@{}rcl@{}} \mathbb{L}_{n}\quad= \quad\{(\nu_{n},\nu^{\prime}_{n})\in \mathbb{V} \mid \nu^{\prime}_{n} \text{ can be obtained from }\\ \nu_{n} \text{ by applying (2)}\} \end{array} $$
(4)

and the set of supplementary contour landmarks is defined by,

$$ \begin{array}{@{}rcl@{}} \mathbb{L}_{c}\quad= \quad\{(\nu_{c},\nu^{\prime}_{c})\in \mathbb{V} \mid \nu^{\prime}_{c} \text{ can be obtained from }\\ \nu_{c} \text{ by applying (2)}\} \end{array} $$
(5)

where \(\mathbb {V}\) is the vertices in the 3D shape model.

3.2 Landmark mapping across 3D databases

The proposed landmark mapping method is depicted in Fig. 3. The prerequisite to achieve 3D point cloud correspondence is that the shape of all the subjects(the topology of landmarks irrespective of the size of the subject) should be similar, so that all the subjects have equal number of vertices. In this paper, 3D facial landmarks are recognized with the use of 3dMDLab benchmark [7] as the reference. In some previous methods [28], locations of li is achieved with multivariate Gaussian distribution, i.e.,

$$ p(y|x,\Uptheta) = N(1,\sigma^{2}) $$
(6)

p(mi, li|Θ) is defined as,

$$ \begin{array}{@{}rcl@{}} p(m_{i},l_{i}|\Uptheta)\quad=\quad p(m_{i}|l_{i},\Uptheta)p(l_{i}|\Uptheta) \\ \quad \quad \propto exp\left( -\frac{\vert \vert m_{i}-sURl_{i}-t\vert \vert^{2}}{2{\sigma_{i}^{2}}}-\frac{\vert \vert l_{i}-\mu_{i}\vert \vert^{2}}{2{\rho_{i}^{2}}}\right) \end{array} $$
(7)

where mi represents the 2D frontal face feature points and Θ represents pose parameters Θ = {s, R, t}, because a 2D face image can be formed by projecting the 3D shape geometry with Enhanced pose parameters such as rotation ‘R’, scaling ‘s’, translation ‘t’. The 2D projection is carried out with U2×3 = [1 0 0 ; 0 1 0] and the depth has not been included. Standard deviation σ and ρ are computed from the training dataset. And, p(li|mi,Θ) is defined as,

$$ p(l_{i}|m_{i},\Uptheta)\quad=\quad\frac{p(m_{i}|l_{i},\Uptheta)p(l_{i}|\Uptheta)}{p(m_{i}|\Uptheta)} $$
(8)

p(mi|Θ) is represented by,

$$ p(m_{i}|\Uptheta)\quad=\quad\int p(m_{i},l_{i}|\Uptheta) dl_{i} $$
(9)
Fig. 3
figure 3

Landmark mapping across 3D databases

In this work, the landmarks are detected using a template model with landmark annotations. First, the mean shape of the USF database is rigidly aligned with the reference shape using the most widely used Iterative Closest Point(ICP) algorithm [5] for 3D shape registration. ICP aligns two moderately overlying meshes. This registration method establishes correspondence between the mean-USF and the reference triangulated mesh [1]. It first selects sample points from the target shape employing random sampling method, finds the closest points in the source mesh iteratively for 10 iterations employing knn search.

Then weights are provided to each correspondence based on (10).

$$ W(m,b_{i})\quad=\quad\frac{exp(-D(m,b_{i}))}{{\sum}_{i=1}^{N}exp(-D(m,b_{i}))} $$
(10)

where m is a vertex in the reference mesh, bi is the ith closest vertex in the source mesh and D(m, bi) is the euclidean distance between m and bi.

It uses sum of squared distance as the error metric to be minimized to derive final transformation.

$$ vert\_aligned_{targ}=knn\_search(v_{targ},v_{ref}) $$
(11)

where vert_alignedtarg is the aligned vertices of the target shape, which is the mean shape of USF database, vtarg is the vertices from the target shape, vref is the vertices from the reference shape. Then, the landmark points are recognized with the use of the reference 3D shape geometry landmarks.

3.3 Patch subspace learning & facial landmarks estimation with autoencoder

The autoencoder network for 3D landmarks estimation is shown in Fig. 7, in which, each layer is a fully connected layer. It comprises stacked dense layers. To assist the 3D face shape reconstruction, the input 2D face image is partitioned into three patches and the edge map of the face region is extracted.

3.3.1 Face partitioning

The proposed approach starts with partitioning the input faces into patches: eye, nose, mouth. Figure 4 shows how the patches are extracted and utilized to derive the landmarks for generating the final 3D geometry. Each facial patch with the predefined dimension is given as one dimensional input to the autoencoder network.

Fig. 4
figure 4

Patches extraction and 3D face geometry generation

3.3.2 Edgemap for chin-cheek contour keypoints prediction

We can extract the chin-cheek region keypoints with the help of Sobel operator by detecting horizontal edges using the threshold value as 0.01. Edge detection operation depends on the quality of image. Therefore, preprocessing of image and post processing of edgemap should be performed accordingly. Here, the input image is converted into gray image followed by image dilation, which makes the edges sharper. Then, edges are detected by applying sobel edge operator. Edge detection is followed by morphological close operation and area opening operation, which removes all the unwanted small edges consisting of less than a specific number of pixels. Unwanted eye and mouth region are removed through hole filling and deleting connected components with significant number of pixels. Finally, contour is extracted through morphological operations, namely dilation, extraction of largest blob and erosion. Only the lower half of the resultant image is analysed for predicting the required chin-cheek contour keypoints. Though we obtain similar result when other edge operators such as prewitt and canny are used, noise suppression characteristics of sobel is better than prewitt and sobel is computationally less expensive compared to canny [22]. The output after applying sobel method is shown in Fig. 5. The proposed method benefits from this edgemap by obtaining more accurate chin-cheek contour keypoints, resulting in an accurate facial shape.

Fig. 5
figure 5

Contour Extraction with Sobel

3.3.3 Subspace learning and landmark estimation

After acquiring sufficient landmarks on the database faces using the proposed landmark mapping strategy and dividing each face into patches Pe, Pn, Pm and Pf, each patch set is given as input for the concerned autoencoders, since each category of patches is handled by different autoencoder networks separately. A simple autoencoder is shown in Fig. 6. An autoencoder [3, 17] has two partitions: an encoder and a decoder. The patches are encoded using the 11-layered autoencoder depicted in Fig. 7. Here, the input is the pixel values of the pixels in each patch image. A patch pi is encoded using the scaled exponential linear unit activation function. The decoder reconstruct the input pi from the encoded input. Generally, an autoencoder network is trained, such that the mean squared error between actual input(xi) and the obtained output\((x_{i}^{\prime })\) gets minimized, since the expected output is nothing but the input itself.

$$ MSE = \frac{1}{N}\sum\limits_{i=1}^{N} \vert \vert x_{i}-x_{i}^{\prime} \vert \vert {^{2}} $$
(12)

where N indicates the total number of objects. The autoencoder network parameters are obtained by minimizing the mean squared error.

Fig. 6
figure 6

Layers in a simple Autoencoder

Fig. 7
figure 7

The autoencoder network for 3D landmarks estimation

In the proposed approach, the objective is minimizing image reconstruction error and landmark error simultaneously. The proposed energy function as specifed in (1) is,

$$ \begin{array}{@{}rcl@{}} \min_{W,b} \quad \frac{1}{N}\sum\limits_{i=1}^{N} \vert \vert p_{i}-p_{i}^{\prime} \vert \vert {^{2}}+\alpha \sqrt{ \frac{1}{N}\sum\limits_{i=1}^{N} (y_{p_{i}}-\hat{y}_{p_{i}})^{2}}\\-\beta \vert \vert enc(p_{i})-c_{i} \vert \vert_{2} \end{array} $$
(13)

where N is the size of the dataset, \( y_{P_{i}} \) and \(\hat {y}_{P_{i}}\) are actual and obtained keypoints for the patch pi; enc(pi) and \( p_{i}^{\prime }\) are encoded and decoded version of input face patch, ci is the centroid to which pi belongs. Since the landmark prediction relies on the clustering performance, which depends on the code layer output and the autoencoder does not provide the same encoded representation for the identical input, clusterloss is subtracted from the computed error. clusterloss is obtained by taking L2 distance between the encoded input and the centroid it belongs. α and β are meta-parameters used to help estimating the model parameters and are set by trial and error.

To compute the loss, in every iteration, clusters have been computed based on the 2D subspace dimension at the code layer. During the first iteration centroids are selected randomly. Each facial patch is assigned to a patch cluster centroid in every iteration.

$$ c_{i} = \min_{c_{k}} \quad \vert \vert enc(p_{i})-c_{k} \vert \vert_{2} $$
(14)

Thereby, the centroids get updated for the next iteration.

$$ {c_{k}}^{t+1}= \frac{{\sum}_{\mathbf{p_{i}}\in {C_{k}}^{t}}{enc(p_{i})}^{t}}{|{{C_{k}^{t}}}|} $$
(15)

where ck is kth centroid and Ck is the set of patches belonging to ck. The algorithm for subspace learning and landmark estimation is shown in Algorithm 1.

figure c

To find the relevant samples among the cluster of samples, SSIM(Structural Similarity Index Measure) is computed between the input and each of the cluster samples. SSIM can be defined as,

$$ SSIM (p,q)\quad = \quad [l(p,q)^{\alpha} . c(p,q)^{\beta} . s(p,q)^{\gamma}] $$
(16)

where, α, β, γ are weights given to each of the comparative measures luminance(l), contrast(c), structure(s).

$$ l(p,q) = \frac{2{\mu_{p}} {\mu_{q}} + {C_{1}}}{{\mu_{p}^{2}} + {\mu_{q}^{2}} + {C_{1}}} $$
(17)
$$ c(p,q) = \frac{2{\sigma_{p}} {\sigma_{q}} + {C_{2}}}{{\sigma_{p}^{2}} + {\sigma_{q}^{2}} + {C_{2}}} $$
(18)
$$ s(p,q) = \frac{{\sigma_{pq}} + {C_{3}}}{{\sigma_{p}} {\sigma_{q}} + {C_{3}}} $$
(19)

With, C3 = C2/2 and setting α, β, γ to 1, the equation can be reduced to,

$$ SSIM (p,q) = \frac{(2{\mu_{p}} {\mu_{q}}+ {C_{1}})({2\sigma_{pq}} + {C_{2}})}{({\mu_{p}^{2}} + {\mu_{q}^{2}} + {C_{1}})({\sigma_{p}^{2}} + {\sigma_{q}^{2}} + {C_{2}})} $$
(20)

Then, based on a threshold value 𝜖, a subset is selected from the cluster, named as enhanced cluster Mc. For a patch pi, Mc is computed as follows:

$$ M_{c} = \{s=[s_{1},...s_{m},...s_{n}];s \subset C_{k},s_{m}\in C_{k} \wedge SSIM(p_{i},s_{m})<\epsilon\} $$
(21)

where Ck is the cluster comprising the samples belonging to the cluster centroid ck. The cluster subset is used to estimate the facial patch landmarks.

$$ \hat{y}_{p_{i}}=\frac{1}{|{M_{c}}|}{\sum}_{\mathbf{s_{m}}\in M_{c}} y[s_{m}] $$
(22)

where |Mc| is the number of samples in Mc and y[sm] is the keypoints of sm. It is explained in the Algorithm 2.

figure d

In each iteration, the mapping function and the landmark loss function need to be optimized (13) via back-propagation.

Upon completion of the learning process, the code layer dimension(5th layer output) is utilized as feature vector for the new test input face patch. Based on its closest centroid and (21), landmarks are predicted. The patch landmarks are combined together to obtain the complete face landmarks.

After the facial landmarks estimation, laplacian coordinate based surface deformation is carried out by deforming a reference model, which is structurally similar to the predicted keypoints.

3.4 Predicted landmark based deformation

Since the resultant 3D shape achieved through deforming the mean shape lags in the fine details, predicted landmark based deformation is proposed here, in which the depth of non-landmark vertices are assigned to the most similar 3D face according to the predicted keypoints. The reference 3D face geometry is chosen in accordance with the minimum gap between the predicted 3D landmark vertices and the reference shapes’ 3D landmark vertices. Since the 2D coordinates are different, first the reference landmark points get interpolated at the predicted 2D coordinates to reproduce the depth value. Thereafter, the distance is measured with (23), using the difference between the interpolated depth value and the predicted depth value.

$$ distance = \sqrt{\|depth_{inter}-depth_{predict}\|} $$
(23)

where depthinter is the reproduced depth value after interpolation and depthpredict is the predicted landmarks’ depth coordinate values. Then, the selected reference 3D model is deformed with the predicted landmarks using laplacian deformation method with Iterative Closest Point(ICP) algorithm.

4 Experimental results

This section presents the information about the databases used, the results and the comparative analysis which shows the effectiveness of the system that has been proposed.

4.1 Datasets

To evaluate the proposed methodology, experiments are implemented with two datasets. The Bosphorus 3D Face Database [31] includes 105 subjects with different expressions. In this database, facial information are captured by structured-light based 3D digitizer and the resolution of the 2D pictures are high (1600×1200 pixels). The resolution in x, y and z coordinates are 0.3mm, 0.3mm and 0.4mm respectively and this database incorporates six universal expressions (Happy, Sadness, Surprise, Disgust, Angry, Fear). In this experiment, neutral expression has been considered. The USF Human-ID 3D Face database [6] consists of 100 laser scanned 3D faces, scanned under controlled viewing conditions and each subject in the database has 75972 vertices. The 2D face images were obtained by taking snapshots of the 3D face geometry. 3dMDLab benchmark [24] used in landmark mapping purpose provides around 68 keypoints locations. In addition, intermediate serving keypoints are found by adding and subtracting small numbers of consecutive vertices as in (2), from important features among the 68 landmarks. The proposed autoencoder based 3D face reconstruction method is implemented using python on a 64-bit Windows operating system over Intel(R) Core(TM) i5-8300H processor @ 2.30GHz with NVIDIA Geforce GTX 1060 GPU and 8.0 GB RAM.

4.2 Results

4.2.1 Landmark detection

Manual landmark localization is most probably error prone and tedious task. So, the 3D landmarks are found with the use of the mean shape of some publicly available faces from 3dMDLab benchmark [7] and the annotated landmarks. Initially, 68 primary and secondary keypoints are found with the proposed landmark mapping method, and then, in addition, 274 intermediate serving keypoints are generated from the obtained keypoints by adding and subtracting small numbers of consecutive vertices. Figure 8 shows some examples of the 3D shapes obtained by deforming the mean USF using the landmarks predicted with the proposed landmark mapping strategy.

Fig. 8
figure 8

Examples of the 3D shapes obtained by deforming the mean USF using the landmarks predicted with the proposed landmark mapping strategy. The first column is the face images, the second column shows the 3D shapes from the database, the third column shows the deformed 3D shapes using the landmarks predicted, the fourth column shows the error map(Blue-Green-Yellow-Red)

This 3D face reconstruction work represents a valuable alternative to the state-of-the-art conventional [23, 44, 46] and deep neural network learning based [2, 21, 24, 33] methods. In the former, shading information and sophisticated algorithms are used and in the later, 2D face subspace as well as the corresponding 3D shape geometry are learned with computationally expensive networks. However, the proposed work learns only 2D face subspace without such non-trivial networks. In addition, intermediate serving keypoints carry contour and cheek shape details.

4.2.2 3D face reconstruction comparison with and without ISK

The performance of the ISK on the proposed SL2E-AFRE method is evaluated. First, 3D reconstruction is carried out with the 68 base keypoints. With them, the facial shape could not be recovered more accurately. For that, a new enhanced keypoint set including base as well as intermediate serving keypoints is introduced and used for the proposed approach. The Table 1 summarizes overall performance of the introduced ISK. As the number of important keypoints increases, the system generates more accurate 3D geometry.

Table 1 Performance evaluation of ISK

4.2.3 Optimal choice of number of clusters

The optimal choice of the number of clusters to be used influences the performance to some extent. Experiments are carried out to determine the optimal number of clusters that both the datasets impart better performance. Figure 9 showed the number of clusters versus RMSE value. The proposed algorithm works efficiently at the value of 18 and so the number of clusters for the experiment is set to 18. Compared to the method without ISK, the method with ISK performed better on two datasets.

Fig. 9
figure 9

Impact of choosing the optimal number of clusters

4.2.4 Deformation on generic face vs landmark based reference face

As shown in Fig. 10, when the generic face is used as the reference for deformation, it’s semblance will be there in all the reconstructed models. Also, unique appearance details will be missed out. Hence, the landmark based reference model, which is structurally similar to the test subject is used for the deformation leading to a personalized unique 3D model. In the figure, the respective reconstruction errors are overlaid.

Fig. 10
figure 10

a Input image, b Reconstruction with generic face, c Reconstruction with landmark based reference face, d Ground truth.

4.3 Comparison

4.3.1 Experiments on USF human-ID 3D face database

First, the proposed method is evaluated on USF Human-ID 3D Face database. Since, the less availability of large 3D face database is a big constraint in 3D face reconstruction task, training and testing is carried out in 90:10 ratio. The training set is further partitioned into train and validation sets. To deal with the over fitting problem, the samples for validation are randomly selected at each epoch. The meta-parameters has been set manually in (13) as α = 3.2 × 10− 3 and β = 5 × 10− 3. The evaluation is carried out with the neutral expression as the baseline. The resolution of the input patches are given in the Table 2.

Table 2 Face patches and their dimensions

The number of clusters is set to 18 for both the dataset. To derive the enhanced cluster set for each patch, the threshold value 𝜖 = 0.63 is used, as detailed in Algorithm 1. To evaluate the estimated landmarks, the reconstruction error [23] is employed to be the performance metric and is computed by,

$$ \text {reconstruction error}\quad=\quad \frac{100.|(z_{rec}-z_{truth})|}{z_{truth}} $$
(24)

where zrec is the z-coordinate value of the reconstructed shape keypoints and ztruth is the z-coordinate value of the ground truth keypoints.

The observable outcome of the proposed framework on some human faces from the USF Human-ID 3D Face database can be seen in Fig. 11. From the reconstruction error overlaid on the obtained 3D shapes, it is understood that the results are significant.

Fig. 11
figure 11

Sample Results. a The 2D image, b Groundtruth profile view, c Reconstructed profile view, d Groundtruth front view, e Reconstructed front view. The reconstruction error has been overlaid on each reconstructed face

The proposed method is compared with [21, 23, 44, 46] and [2].The method proposed by Kemelmacher et al. [23], in which one reference model is used for generating 3D models of all the test subjects resulting in reference model resemblance in the final model. The compared methods have been put forward with computationally very expensive deep networks [30], but the proposed approach is able to provide plausible result without such complex architectures.

We adopt the mean and standard deviation of the reconstruction error to compare the performance, which is detailed in Table 3, in which the best reconstruction is displayed in boldface. From the table, it is inferred that, SL2E-AFRE gives superior performance in obtaining the chin keypoints, which leads to better facial shape. There are still some failure cases due to the dataset constraint, since the proposed method relies fully on the keypoints.

Table 3 Comparison of the mean and standard deviation of the reconstruction errors on the USF database

4.3.2 Experiments on Bosphorus database

When the proposed approach is evaluated on the Bosphorus database, 90 out of 105 subjects are selected randomly and employed for training and validation purpose. And, the subjects for the validation set are selected randomly in each epoch.

The predicted keypoints are employed to deform the selected 3D model from the training set of subjects chosen based on the landmark prediction. The proposed method is compared with [21, 23, 44, 46] and [2]. Kemelmacher et al. [23] used a single generic model for generating all 3D face models. However, the proposed approach uses different reference models for different facial patches, ensuring unique reconstruction and the RMSE for Kemelmacher et al. [23] is larger than that for the proposed approach. Zeng et al. [46] used both front and profile views of a face image, but the proposed method utilizes only frontal face. The methods proposed in [21] and [2] are deep neural network based methods. Arslan et al. [2] uses Generative Adversarial Networks(GAN), which incorporates both generator and discriminator network. However the result of the proposed approach is comparable to those computationally expensive methods. Figure 12 presents a comparison on a sample input. From the Fig. 13, it can be seen that, the proposed method gives significant result on most of the subjects.

Fig. 12
figure 12

Sample result of various state-of-the-art methods

Fig. 13
figure 13

Graphical representation of RMSE values computed on 15 subjects from Bosphorus database

5 Conclusion and futurework

In this paper, a method for 3D face reconstruction with a single frontal face image is presented. Initially the keypoint localization in the databases is carried out using the landmark mapping approach. The proposed 3D reconstruction method employs different reference models for modeling different facial patches. It utilizes an autoencoder network for learning the patch subspace and selecting the reference models for landmark estimation. Finally, a reference shape is chosen from the database, which is considered as structurally similar to the test subject according to the predicted landmarks. Then, laplace deformation is carried out on the selected model with the predicted keypoints.

Unlike existing approaches, the proposed method adopts a single face image as input and does not rely on a single reference model. The evaluation result has demonstrated that obtaining plausible 3D face geometry is possible without computationally expensive approaches. However the findings might not be generalized to view-invariant face images. Since the proposed method is kind of example based method, the performance depends on the faces in the training dataset.

Further studies, which take uncalibrated input images into account will need to be undertaken in future. Also, it is believed that this framework can be extended for generating 3D comic characters.