Keywords

1 Introduction

3D dense face alignment is essential for many face related tasks, e.g., recognition  [6, 12, 23, 25, 41], animation  [9], avatar retargeting  [8], tracking  [46], attribute classification  [3, 20, 21], image restoration  [10, 11, 47], anti-spoofing  [24, 37, 45, 49, 50]. Recent studies are mainly divided into two categories: 3D Morphable Model (3DMM) parameters regression  [22, 31, 33, 44, 54, 55, 57] and dense vertices regression  [17, 26]. Dense vertices regression methods directly regress the coordinates of all the 3D points (usually more than 20,000) through a fully convolutional network  [17, 26], achieving the state-of-the-art performance. However, the resolution of reconstructed faces relies on the size of the feature map and these methods rely on heavy networks like hourglass  [35] or its variants, which are slow and memory-consuming in inference. The natural way of speeding it up is to prune channels. We try to prune 77.5% channels on the state-of-the-art PRNet  [17] to achieve real-time speed on CPU, but find the error greatly increases 44.8% (3.62% vs. 5.24%). Besides, an obvious disadvantage is the presence of checkerboard artifacts due to the deconvolution operators, which is present in the supplementary material. Another strategy is to regress a small set of 3DMM parameters (usually less than 200). Compared with dense vertices, 3DMM parameters have low dimensionality and low redundancy, which are appropriate to regress by a lightweight network. However, different 3DMM parameters influence the reconstructed 3D face  [54] differently, making the regression challenging since we have to dynamically re-weight each parameter according to their importance during training. Cascaded structures  [33, 54, 55] are always adopted to progressively update the parameters but the computation cost is increased linearly with the number of cascaded stages.

Fig. 1.
figure 1

A few results from our MobileNet (M+R+S) model, which runs at 50 fps on a single CPU core or 130 fps on multiple CPU cores.

In this paper, we aim to accelerate the speed to CPU real time and achieve the state-of-the-art performance simultaneously (Fig. 1). To this end, we choose to regress 3DMM parameters with a fast backbone, e.g. MobileNet. To handle the optimization problem of the parameters regression framework, we exploit two different loss terms WPDC and VDC  [54] (see Sect. 2.2) and propose our meta-joint optimization to combine the advantages of them. The meta-joint optimization looks ahead by k-steps with WPDC and VDC on the meta-train batches, then dynamically selects the better one according to the error on the meta-test batch. By doing so, the whole optimization converges faster and achieves better performance than the vanilla-joint optimization. Besides, a landmark-regression regularization is introduced to further alleviate the optimization problem to achieve higher accuracy. In addition to single image, 3D face applications on videos are becoming more and more popular  [8, 9, 27, 28], where reconstructing stable results across consecutive frames is important, but it is often ignored by recent methods  [17, 26, 54, 55]. Video-based training  [16, 32, 36, 40] is always adopted to improve the stability in 2D face alignment. However, no video databases are publicly available for 3D dense face alignment. To address it, we propose a 3D aided short-video-synthesis method, which simulates both in-plane and out-of-plane face moving to transform one still image to a short video, so that our network can adjust results of consecutive frames. Experiments show our short-video-synthesis method significantly improves the stability on videos.

In general, our proposed methods are (i) fast: It takes about 7.2ms with an single image as input (almost 24\(\times \) faster than PRNet) and runs at over 50fps (19.2ms) on a single CPU core or over 130fps (7.2ms) on multiple CPU cores (i5-8259U processor), (ii) accurate: By dynamically optimizing 3DMM parameters through a novel meta-optimization strategy combining the fast WPDC and VDC, we surpass the state-of-the-art results  [17, 26, 54, 55] under a strict computation burden in inference, and (iii) stable: In a mini-batch, one still image is transformed slightly and smoothly into a short synthetic video, involving both in-plane and out-of-plane rotations, which provides temporal information of adjacent frames for training. Extensive experimental results on four datasets show that the overall performance of our method is the best.

2 Methodology

Fig. 2.
figure 2

Overview of our method. Our architecture consists of four parts: the lightweight backbone like MobileNet for predicting 3DMM parameters, the meta-joint optimization of fWPDC and VDC, the landmark-regression regularization and the short-video-synthesis for training. The landmark-regression branch is discarded in inference, thus not increasing any computation burden.

This section details our proposed approach. We first discuss 3D Morphable Model (3DMM)  [5]. Then, we introduce the proposed methods of the meta-joint optimization, landmark-regression regularization and 3D aided short-video-synthesis. The overall pipeline is illustrated in Fig. 2 and the algorithm is described in Algorithm 1.

figure a

2.1 Preliminary of 3DMM

The original 3DMM can be described as:

$$\begin{aligned} \mathbf {S} = \overline{\mathbf {S}} + \mathbf {A}_{id} \varvec{\alpha }_{id} + \mathbf {A}_{exp} \varvec{\alpha }_{exp}, \end{aligned}$$
(1)

where \(\mathbf {S}\) is the 3D face mesh, \(\overline{\mathbf {S}}\) is the mean 3D shape, \(\varvec{\alpha }_{id}\) is the shape parameter corresponding to the 3D shape base \(\mathbf {A}_{id}\), \(\mathbf {A}_{exp}\) is the expression base and \(\varvec{\alpha }_{exp}\) is the expression parameter. After the 3D face is reconstructed, it can be projected onto the image plane with the scale orthographic projection:

$$\begin{aligned} V_{2d}{(\mathbf {p})}=f * \mathbf {Pr} * \mathbf {R} *\left( \overline{\mathbf {S}}+\mathbf {A}_{id} \varvec{\alpha }_{id}+\mathbf {A}_{exp} \varvec{\alpha }_{exp} \right) + \mathbf {t}_{2d}, \end{aligned}$$
(2)

where \(V_{2d}{(\mathbf {p})}\) is the projection function generating the 2D positions of model vertices, f is the scale factor, \(\mathbf {Pr}\) is the orthographic projection matrix, \(\mathbf {R}\) is the rotation matrix constructed by Euler angles including pitch, yaw, roll and \(\mathbf {t}_{2d}\) is the translation vector. The complete parameters of 3DMM are \(\mathbf {p} = [f, \mathrm {pitch}, \mathrm {yaw}, \mathrm {roll}, \mathbf {t}_{2d}, \varvec{\alpha }_{id}, \varvec{\alpha }_{exp}]\).

However, the three Euler angles will cause the gimbal lock  [30] when faces are close to the profile view. This ambiguity will confuse the regressor to degrade the performance, so we choose to regress the similarity transformation matrix instead of \([f, \mathrm {pitch}, \mathrm {yaw}, \mathrm {roll}, \mathbf {t}_{2d}]\) to reduce the regression difficulty: \(\mathbf {T} = f \left[ \mathbf {R}; \mathbf {t}_{3d} \right] \), where \(\mathbf {T} \in \mathbb {R}^{3 \times 4}\) is constructed by a scale factor f, a rotation matrix \(\mathbf {R}\) and a translation vector \(\mathbf {t}_{3d} = \begin{bmatrix} \mathbf {t}_{2d} \\ 0 \end{bmatrix}\). Therefore, the scale orthographic projection in Eq. 2 can be simplified as:

$$\begin{aligned} V_{2d}({\mathbf {p}}) = \mathbf {Pr} * \mathbf {T} * \begin{bmatrix} \overline{\mathbf {S}} + \mathbf {A} \varvec{\alpha } \\ \mathbf {1} \end{bmatrix}, \end{aligned}$$
(3)

where \(\mathbf {A} = [\mathbf {A}_{id}, \mathbf {A}_{exp}]\) and \(\varvec{\alpha } = [\varvec{\alpha }_{id}, \varvec{\alpha }_{exp}]\). Our regression objective is described as \(\mathbf {p} = [\mathbf {T}, \varvec{\alpha }]\).

The high-dimensional parameters \(\varvec{\alpha }_{shp} \in \mathbb {R}^{199}\), \(\varvec{\alpha }_{exp} \in \mathbb {R}^ {29}\) are redundant, since 3DMM models the 3D face shape with PCA and the last parts of parameters have little effect on the face shape. We choose only the first 40 dimensions of \(\varvec{\alpha }_{shp}\) and the first 10 dimensions of \(\varvec{\alpha }_{exp}\) as our regression target, since the NME increase is acceptable and the reconstruction can be greatly accelerated. The NME error heatmap caused by different size of shape and expression dimensions is present in the supplementary material. Therefore, our complete regression target is simplified as \(\mathbf {p} = [\mathbf {T}^{3 \times 4}, \varvec{\alpha }^{50}]\), with 62 dimensions in total, where \(\varvec{\alpha } = [\varvec{\alpha }_{shp} ^{40}, \varvec{\alpha }_{exp}^{10}]\). To eliminate the negative impact of magnitude differences between \(\mathbf {T}\) and \(\varvec{\alpha }\), Z-score normalizing is adopted: \(\mathbf {p} = (\mathbf {p} - \varvec{\mu }_{p}) / \varvec{\sigma }_{p}\), where \(\varvec{\mu }_{p} \in \mathbb {R} ^ {62}\) is the mean of parameters and \(\varvec{\sigma }_{p} \in \mathbb {R} ^ {62}\) indicates the standard deviation of parameters.

2.2 Meta-Joint Optimization

We first review the Vertex Distance Cost (VDC) and Weighted Parameter Distance Cost (WPDC) in  [54], then derivate the meta-joint optimization to facilitate the parameters regression.

The VDC term \(\mathcal {L}_{vdc}\) directly optimizes \(\mathbf {p}\) by minimizing the vertex distances between the fitted 3D face and the ground truth:

$$\begin{aligned} \mathcal {L}_{vdc} = \left\| V_{3d}\left( \mathbf {p} \right) - V_{3d}\left( \mathbf {p}^{g}\right) \right\| ^{2}, \end{aligned}$$
(4)

where \(\mathbf {p}^g\) is the ground truth parameter, \(\mathbf {p}\) is the predicted parameter and \(V_{3d} (\cdot )\) is the 3D face reconstruction formulated as:

$$\begin{aligned} V_{3d}({\mathbf {p}}) = \mathbf {T} * \begin{bmatrix} \overline{\mathbf {S}} + \mathbf {A} \varvec{\alpha } \\ \mathbf {1} \end{bmatrix}. \end{aligned}$$
(5)

Different from VDC, the WPDC term  [54] \(\mathcal {L}_{wpdc}\) assigns different weights to each parameter:

$$\begin{aligned} \mathcal {L}_{wpdc} = \left\| \mathbf {w} \cdot (\mathbf {p} - \mathbf {p}^{g}) \right\| ^2, \end{aligned}$$
(6)

where \(\mathbf {w}\) indicates the importance weight as follows:

$$\begin{aligned} \begin{aligned} \mathbf {w}&= \left( w_1, w_2, \dots , w_i, \dots , w_n \right) , \\ w_i&= \left\| V_{3d}(\mathbf {p}^{de,i}) - V_{3d}(\mathbf {p}^g)\right\| / Z, \\ \mathbf {p}^{de,i}&= \left( \mathbf {p}_1^g, \mathbf {p}_2^g, \dots , \mathbf {p}_i, \dots , \mathbf {p}_n^g \right) , \\ \end{aligned} \end{aligned}$$
(7)

where n is the number of parameters (\(n = \) 62 in our regression framework), \(\mathbf {p}^{de, i}\) is the i-degraded parameter whose i-th element is from the predicted \(\mathbf {p}\), Z is the maximum of \(\mathbf {w}\) for regularization. The term \(\left\| V_{3d}(\mathbf {p}^{de,i}) - V_{3d}(\mathbf {p}^g)\right\| \) models the importance of i-th parameter.

figure b

fWPDC. The original calculation of \(\mathbf {w}\) in WPDC is rather slow as the calculation of each \(w_i\) needs to reconstruct all the vertices once, which is a bottleneck for fast training. We find that the vertices can be only reconstructed once by decomposing the weight calculation into two parts: the similarity transformation matrix \(\mathbf {T}\), and the combination of shape and expression parameters \(\varvec{\alpha }\). Therefore, we design a fast implementation of WPDC named fWPDC: (i) reconstructing the vertices without projection \(\mathbf {S} = \overline{\mathbf {S}} + \mathbf {A} \varvec{\alpha }\) and calculating \(\mathbf {w}_T\) using the norm of row vectors; (ii) calculating \(\mathbf {w}_{\alpha }\) using the norm of column vectors of \(\mathbf {A}\) and the input scale f: \(\mathbf {w}_\alpha (i) = f \cdot \big ( \varvec{\alpha }(i) - \varvec{\alpha }^g(i) \big ) \cdot \left\| \mathbf {A}\left( :,i \right) \right\| \); (iii) Combining them to calculate the final cost. The detailed algorithm of fWPDC is described in Algorithm 2. fWPDC only reconstructs dense vertices once, not 62 times as WPDC, thus greatly reducing the computation cost. With 128 samples as a batch input, the original WPDC takes 41.7 ms while fWPDC only takes 3.6 ms. fWPDC is over 10\(\times \) faster than the original WPDC while preserving the same outputs.

Fig. 3.
figure 3

The vertex error in training on 300W-LP supervised by different loss terms. VDC from scratch has the highest error, fWPDC is lower than VDC, and VDC from fWPDC is better than both. When combining VDC and fWPDC, the proposed meta-joint optimization converges faster and reaches lower error than vanilla-joint, and achieves even better convergence when incorporating the landmark-regression regularization.

Exploitation of VDC and fWPDC. Through Eq. 4 and Eq. 6, we find: WPDC/fWPDC is suitable for parameters regression since each parameter is appropriately weighted, while VDC can directly reflect the goodness of the 3D face reconstructed from parameters. In Fig. 3, we investigate how these two losses converge as the training progresses. It is shown that the optimization is difficult for VDC since the vertex error is still over 15 when training converges. The work in  [55] also demonstrates that optimizing VDC with gradient descent converges very slowly due to the “zig-zagging” problem. In contrast, the convergence of fWPDC is much faster than VDC and the error is about 7 when training converges. Surprisingly, if the fWPDC-trained model is fine-tuned by VDC, we can get a much lower error than fWPDC. Based on the above observation, we conclude that: training from scratch with VDC is hard to converge and the network is not fully trained by fWPDC in the late stage.

Meta-Joint Optimization. Based on above discussions, it is natural to weight two terms to perform a vanilla-joint optimization: \(\mathcal {L}_{vanilla\textit{-}joint} = \beta \mathcal {L}_{fwpdc} + (1 - \beta ) \frac{|l_{fwpdc}|}{|l_{vdc}|} \cdot \mathcal {L}_{vdc}\), where \(\beta \in [0, 1]\) controls the importance between fWPDC and VDC. However, the vanilla-joint optimization relies on the manually set hyper-parameter \(\beta \) and does not achieves satisfactory results in Fig. 3. Inspired by Lookahead  [52] and MAML  [18], we propose a meta-joint optimization strategy to dynamically combine fWPDC and VDC. The overview of the meta-joint optimization is shown in Fig. 4. In the training process, the model looks ahead by k-steps with the cost fWPDC or VDC on k meta-train batches \(\mathcal {X}_{mtr}\), then selects the better one between fWPDC and VDC according to the vertex error on the meta-test batch. Specifically, the whole meta-joint optimization consists of four steps: (i) sampling k batches of training samples \(\mathcal {X}_{mtr}\) for meta-train and one batch \(\mathcal {X}_{mte}\) for meta-test; (ii) meta-train: updating the current model parameters \(\theta _i\) with fWPDC and VDC on \(\mathcal {X}_{mtr}\) by k-steps, respectively, getting two parameter states \(\theta _{i+k}^f\) and \(\theta _{i+k}^v\); (iii) meta-test: evaluating the vertex error \(\theta _{i+k}^f\) and \(\theta _{i+k}^v\) on \(\mathcal {X}_{mte}\); (iv) selecting the parameters which have the lower error to update \(\theta _i\). The proposed meta-joint optimization can be directly embedded into the standard training regime. From Fig. 3, we can observe that the meta-joint optimization converges faster than vanilla-joint and has the lower error.

2.3 Landmark-Regression Regularization

In 3D face reconstruction  [14, 15, 19, 42, 43], the 2D sparse landmarks after projecting are usually used as an extra regularization to facilitate the parameters regression. In our regression framework, we find that treating 2D sparse landmarks as a auxiliary regression task benefits more.

Fig. 4.
figure 4

Overview of the meta-joint optimization.

As shown in Fig. 2, we add an additional landmark-regression task on the global pooling layer, trained by L2 loss. The difference between the former landmark-regularization and the latter landmark-regression regularization is that the latter introduces extra parameters to regress the landmarks. In other words, the landmark-regression regularization is a task-level regularization. From the tomato curve in Fig. 3, we get a lower error by incorporating the landmark-regression regularization. The comparative results in Table 3 show our proposed landmark-regression regularization is better than landmark-regularization (3.59% vs. 3.71% on AFLW2000-3D). The landmark-regression regularization is formulated as: \(\mathcal {L}_{lrr} = \frac{1}{N} \sum _{i=1}^{N} \left\| l_i - l_i^g \right\| _2 ^2\), where N is 136 here as we utilize 68 2D landmarks and flatten them into a 136-d vector.

2.4 3D Aided Short-video-synthesis

Video based 3D face applications have become more and more popular  [8, 9, 27, 28] recently. In these applications, 3D dense face alignment methods are required to run on videos and provide stable reconstruction results across adjacent frames. The stability means that the changing of the reconstructed 3D faces across adjacent frames should be consistent with the true face moving in a fine-grained level. However, most of existing methods  [17, 26, 54, 55] omit this requirenment and the predictions suffer from random jittering. In 2D face alignment, post-processing like temporal filtering is a common strategy to reduce the jittering, but it degrades the precision and causes the frame delay. Besides, since no public video databases for 3D dense face alignment are available, the video training strategies  [16, 32, 36, 40] cannot work here. A challenge arises: can we improve the stability on videos with only still images available when training?

To address this challenge, we propose a batch-level 3D aided short-video-synthesis strategy, which expands one still image to several adjacent frames, forming a short synthetic video in a mini-batch. The common patterns in a video can be modelled as: (i) Noise. We model noise as \(P(x) = x + \mathcal {N}(0, \varSigma )\), where \(\varSigma = \sigma ^2 I\). (ii) Motion Blur. Motion blur can be formulated as \(M(x) = K *x\), where K is the convolution kernel (the operator \(*\) denotes a convolution). (iii) In-plane rotation. Given two adjacent frames \(x_{t}\) and \(x_{t+1}\), the in-plane temporal change from \(x_t\) to \(x_{t+1}\) can be described as a similarity transform \(T ( \cdot )\):

$$\begin{aligned} T(\cdot ) = \varDelta s \begin{bmatrix} \cos (\varDelta \theta ) &{} -\sin (\varDelta \theta ) &{} \varDelta t_1 \\ \sin (\varDelta \theta ) &{} \cos (\varDelta \theta ) &{} \varDelta t_2 \end{bmatrix}, \end{aligned}$$
(8)

where \(\varDelta s\) is the scale perturbation, \(\varDelta \theta \) is the rotation perturbation, \(\varDelta t_1\) and \(\varDelta t_2\) are translation perturbations. (iv) Since human faces share similar 3D structure, we are also able to synthesize the out-of-plane face moving. Face profiling  [54] \(F (\cdot )\), which is originally proposed to solving large-pose face alignment, is utilized to progressively increase the yaw angle \(\varDelta \phi \) and pitch angle \(\varDelta \gamma \) of the face. Specifically, we sample several still images in a mini-batch and for each still image \(x_0\), we transform it slightly and smoothly to generate a synthetic video with n adjacent frames: \(\{ x'_j | x'_{j} = (M \circ P) (x_j), x_{j} = (T \circ F) (x_{j-1}), 1 \le j \le n-1 \} \cup \{x_0\}\). In Fig. 5, we give an illustration of how these transformations are applied on an image to generate several adjacent frames.

Fig. 5.
figure 5

An illustration of how two adjacent frames are synthesized in our 3D aided short-video-synthesis.

3 Experiments

In this section, we first introduce the datasets and protocols; then, we give comparison experiments on the accuracy and stability; thirdly, the complexity and running speed are evaluated; extensive discussions are finally made. The implementation details, generalization and scaling-up ability of our proposed method are in the supplementary material.

3.1 Datasets and Evaluation Protocols

Five datasets are used in our experiments: 300W-LP  [54] (300W Across Large Poses) is composed of the synthesized large-pose face images from 300W  [38], including AFW  [56], LFPW  [2], HELEN  [53], IBUG  [38], and XM2VTS  [34]. Specifically, the face profiling method  [54] is adopted to generate 122,450 samples across large poses. AFLW  [29] consists of 21,080 in-the-wild faces (following   [22, 55]) with large poses (yaw from -90\(^\circ \) to 90\(^\circ \)). Each image is annotated up to 21 visible landmarks. AFLW2000-3D  [54] is constructed by  [54] for evaluating 3D face alignment performance, which contains the ground truth 3D faces and the corresponding 68 landmarks of the first 2,000 AFLW samples. Florence  [1] is a 3D face dataset containing 53 subjects with its ground truth 3D mesh acquired from a structured-light scanning system. For evaluation, we generate renderings with different poses for each subject following VRN  [26] and PRNet  [17]. Menpo-3D  [51] provides a benchmark for evaluating 3D facial landmark localization algorithms in the wild in arbitrary poses. Specifically, Menpo-3D provides 3D facial landmarks for 55 videos from 300-VW  [39] competition.

Fig. 6.
figure 6

Ablative results of the vanilla-joint optimization with different \(\beta \) and meta-joint optimization with different k. Lower NME (%) is better.

Table 1. The NME (%) of different methods on AFLW2000-3D and AFLW. The first and the second best results are highlighted. M, R, S denote the meta-joint optimization, landmark-regression regularization and short-video-synthesis, respectively.
Table 2. The NME (%) on Florence, AFLW2000-3D (Dense), NME (%) / Stability (%) on Menpo-3D, running complexity and time with different methods. Our method outputs 3D dense vertices with only 2.1 ms (2 ms for parameters prediction and 0.1 ms for vertices reconstruction) in GPU or 7.2 ms in CPU (6.2 ms for parameters prediction and 1 ms for vertices reconstruction). The first and second best results are highlighted.

Protocols. The protocol on AFLW follows  [54] and Normalized Mean Error (NME) by bounding box size is reported. Two protocols on AFLW2000-3D are applied: the first one follows AFLW, and the other one follows  [17] to evaluate the NME of 3D face reconstruction normalized by the bounding box size. For Florence, we follow  [17, 26] to evaluate the NME of 3D face reconstruction normalized by outer interocular distance. As for Menpo-3D, we evaluate the NME on still frames and the stability across adjacent frames. We calculate the stability following  [40] by measuring the NME between the predicted offsets and the ground-truth offsets of adjacent frames. Specifically, at frame \(t-1\) and t, the ground-truth landmark offset is \(\varDelta p = p_t - p_{t-1}\), the prediction offset is \(\varDelta q = q_t - q_{t-1}\), the error \(\varDelta p - \varDelta q\) normalized by the bounding box size represents the stability. Since 300W-LP only has the indices of 68 landmarks, we use 68 landmarks of Menpo-3D for consistency.

Table 3. The comparative and ablative results on AFLW2000-3D and AFLW. The mean NMEs (%) across small, medium and large poses on AFLW2000-3D and AFLW are reported. lmk. indicates landmark constraint on the parameter regression like  [43] and lrr. is the proposed landmark-regression regularization.
Table 4. Comparisons of NME (%)/Stability (%) on Menpo-3D. svs. indicates short-video-synthesis, rnd. indicates applying in-plane and out-of-plane rotations randomly in one mini-batch.

3.2 Ablation Study

To evaluate the effectiveness of the meta-joint optimization and the landmark-regression regularization, we carry out comparative experiments including our two baselines: VDC and fWPDC, three joint options: (i) VDC from fWPDC: fine-tune the model with VDC loss from the pre-trained model by fWPDC; (ii) Vanilla-joint: weight VDC and fWPDC by the best scalar \(\beta =0.5\); (iii) Meta-joint: the proposed meta-joint optimization with best \(k=100\) and four options of how the 2D landmarks are utilized. From Table 3, Table 4, Fig. 3 and Fig. 6, we can draw the following conclusions:

Meta-Joint Optimization Performs Better. Comparing with two baselines VDC and fWPDC, all three joint optimization methods perform better. Among three joint optimization methods, the proposed meta-joint performs better than VDC from fWPDC and vanilla-joint: the mean NME drops from \(4.04\%\) to \(3.73\%\) on AFLW2000-3D and \(5.10\%\) to \(4.64\%\) on AFLW when compared with the baseline fWPDC. Furthermore, we conduct ablative experiments with different \(\beta \) for vanilla-joint and different look-ahead step k in Fig. 6. We can observe that \(\beta =0.5\) is the best setting for vanilla-joint, but meta-joint still outperforms it and \(k=100\) performs best on both AFLW2000-3D and AFLW. Overall, the proposed meta-joint optimization is effective in alleviate the training and promoting the performance.

Landmark-Regression Regularization Benefits. Another contribution is the landmark-regression regularization, which can also be regarded as an auxiliary task to parameters regression. From Table 3, the improvements from fWPDC to fWPDC w/ lrr. on AFLW2000-3D and AFLW are \(0.15\%\) and \(0.26\%\), and the improvements from Meta-joint to Meta-joint w/ lrr. on AFLW2000-3D and AFLW are \(0.14\%\) and \(0.14\%\). We also compare the proposed landmark-regression regularization with prior methods  [42, 43] which directly impose landmark constraint on the parameter regression, the results show ours is significantly better: 3.59% vs. 3.71% on AFLW2000-3D. We further evaluate the performance of the landmark-regression branch on AFLW2000-3D and AFLW. The performances are 3.58% and 4.52% respectively, which are close to the parameter branch. It indicates that these two tasks are highly related. Overall, the landmark-regression regularization benefits the training and promotes the performance.

Short-Video-Synthesis Improves Stability. The last contribution is 3D aided short-video-synthesis, which is designed to enhance stability on videos by augmenting one still image to a short video in a mini-batch. The results in Table 4 indicate that short-video-synthesis works for both the fWPDC and meta-joint optimization. With short-video-synthesis and landmark-regression regularization, the performance on still frames improves from \(1.86\%\) to \(1.71\%\) and the stability improves from \(0.52\%\) to \(0.48\%\). We also evaluate the performance by randomly applying in-plane and out-of-plane rotations in each mini-batch and find it is worse than short-video-synthesis: 1.76%/0.50% v.s. 1.71%/0.48%. These results validate the effectiveness of the 3D aided short-video-synthesis.

3.3 Evaluations of Accuracy and Stability

Sparse Face Alignment. We use AFLW2000-3D and AFLW to evaluate sparse face alignment performance with small, medium and large yaw angles. The results in Table 1 indicate that our method performs better than PRNet (3.51% vs. 3.62%) in AFLW2000-3D and better than 3DDFA-TPAMI  [55] in AFLW (4.43% vs. 4.55%). Note that these results are achieved with only 3.27M parameters (24% of PRNet) and it takes 6.2 ms (3.5% of PRNet) in CPU. The sampling of 68/21 landmarks from 3DMM is extremely fast, only 0.01 ms (CPU), which can be ignored.

Dense Face Alignment. Dense face alignment is evaluated on Florence and AFLW2000-3D. Our evaluation settings follow  [17] to keep consistency. The results in Table 2 show that the proposed method significantly outperforms others. As for 3D dense vertices reconstruction, 45K vertices only takes 1 ms in CPU (0.1 ms in GPU) with our regression framework.

Video-based 3D Face Alignment. We use Menpo-3D to evaluate both the accuracy and stability. Table 4 has already shown the superiority of short-video-synthesis. We choose to compare our method with recent PRNet  [17] in Table 2. The results indicate that our method significantly surpasses PRNet in both the accuracy and stability on videos of Menpo-3D with a much lower computation cost.

3.4 Evaluations of Speed

We compare parameter numbers, MACs (Multiply-Accumulates) measuring the number of fused Multiplication and Addition operations, and the running time of our method with others in Table 2. As for the running speed, 3DDFA  [54] takes 23.2 ms (GPU) for predicting parameters and 52.5 ms (CPU) for PNCC construction, DeFA  [17] needs 11.8 ms (GPU) to predict 3DMM parameters and 23.6 ms (CPU) for post-processing, VRN  [26] detects 68 2D landmarks with 28.4 ms (GPU) and regresses the 3D dense vertices with 40.6 ms (GPU), PRNet  [17] predicts the 3D dense vertices with 9.8 ms (GPU) or 175 ms (CPU). Compared with them, our method takes only 2 ms (GPU) or 6.2 ms (CPU) to predict 3DMM parameters and 0.1 ms (GPU) or 1 ms (CPU) to reconstruct 3D dense vertices.

Specifically, compared with the recent PRNet  [17], the parameters of our model (3.27M) are less than one-quarter of PRNet (13.4M), and the MACs are less than 1/30 (183.5M vs. 6190M). We measure the overall running time on GeForce GTX 1080 GPU and i5-8259U CPU with 4 cores. Note that our method takes only 7.2 ms, which is almost 24\(\times \) faster than PRNet (175 ms). Besides, we benchmark our method on a single CPU core (using only one thread) and the running speed of our method is about 19.2 ms (over 50 fps), including the reconstruction time. The specific CPU configuration is i5-8259U CPU @ 2.30 GHz on a 13-inch MacBook Pro.

3.5 Analysis of Meta-Joint Optimization

We visualize the auto-selection result of fWPDC and VDC in the meta-joint optimization, as shown in Fig. 7. We can observe that both \(k=100\) and \(k=200\) show the same trend: fPWDC dominates in the early stage and VDC guides in the late stage. This trend is consistent with the previous observations and gives a clear description of why our proposed meta-joint optimization works.

Fig. 7.
figure 7

Auto-selection result of the selector in the meta-joint optimization.

4 Conclusion

In this paper, we have successfully pursued the fast, accurate and stable 3D dense face alignment simultaneously. Towards this target, we make three main efforts: (i) proposing a fast WPDC named fWPDC and the meta-joint optimization to combine fWPDC and VDC to alleviate the problem of optimization; (ii) imposing an extra landmark-regression regularization to promote the performance to state-of-the-art; (iii) proposing the 3D aided short-video-synthesis method to improve the stability on videos. The experimental results demonstrate the effectiveness and efficiency of our proposed methods. Our promising results pave the way for real-time 3D dense face alignment in practical use and the proposed methods may improve the environment by reducing the amount of carbon dioxide released by the huge amounts of energy consumed by GPUs.