Towards Fast, Accurate and Stable 3D Dense Face Alignment

Guo, Jianzhu; Zhu, Xiangyu; Yang, Yang; Yang, Fan; Lei, Zhen; Li, Stan Z.

doi:10.1007/978-3-030-58529-7_10

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12364))

Included in the following conference series:

European Conference on Computer Vision

4456 Accesses
231 Citations

Abstract

Existing methods of 3D dthus limiting the scope of their practical applications. In this paper, we propose a novel regression framework which makes a balance among speed, accuracy and stability. Firstly, on the basis of a lightweight backbone, we propose a meta-joint optimization strategy to dynamically regress a small set of 3DMM parameters, which greatly enhances speed and accuracy simultaneously. To further improve the stability on videos, we present a virtual synthesis method to transform one still image to a short-video which incorporates in-plane and out-of-plane face moving. On the premise of high accuracy and stability, our model runs at 50 fps on a single CPU core and outperforms other state-of-the-art heavy models simultaneously. Experiments on several challenging datasets validate the efficiency of our method. The code and models will be available at https://github.com/cleardusk/3DDFA_V2.

J. Guo and X. Zhu—Equal contribution.

Access provided by Autonomous University of Puebla. Download conference paper PDF

$$\text {Face2Face}^\rho $$ : Real-Time High-Resolution One-Shot Face Reenactment

Robust Multi-view Face Alignment Based on Cascaded 2D/3D Face Shape Regression

Pose-Invariant Face Alignment via CNN-Based Dense 3D Model Fitting

Article 19 April 2017

Keywords

1 Introduction

3D dense face alignment is essential for many face related tasks, e.g., recognition [6, 12, 23, 25, 41], animation [9], avatar retargeting [8], tracking [46], attribute classification [3, 20, 21], image restoration [10, 11, 47], anti-spoofing [24, 37, 45, 49, 50]. Recent studies are mainly divided into two categories: 3D Morphable Model (3DMM) parameters regression [22, 31, 33, 44, 54, 55, 57] and dense vertices regression [17, 26]. Dense vertices regression methods directly regress the coordinates of all the 3D points (usually more than 20,000) through a fully convolutional network [17, 26], achieving the state-of-the-art performance. However, the resolution of reconstructed faces relies on the size of the feature map and these methods rely on heavy networks like hourglass [35] or its variants, which are slow and memory-consuming in inference. The natural way of speeding it up is to prune channels. We try to prune 77.5% channels on the state-of-the-art PRNet [17] to achieve real-time speed on CPU, but find the error greatly increases 44.8% (3.62% vs. 5.24%). Besides, an obvious disadvantage is the presence of checkerboard artifacts due to the deconvolution operators, which is present in the supplementary material. Another strategy is to regress a small set of 3DMM parameters (usually less than 200). Compared with dense vertices, 3DMM parameters have low dimensionality and low redundancy, which are appropriate to regress by a lightweight network. However, different 3DMM parameters influence the reconstructed 3D face [54] differently, making the regression challenging since we have to dynamically re-weight each parameter according to their importance during training. Cascaded structures [33, 54, 55] are always adopted to progressively update the parameters but the computation cost is increased linearly with the number of cascaded stages.

In this paper, we aim to accelerate the speed to CPU real time and achieve the state-of-the-art performance simultaneously (Fig. 1). To this end, we choose to regress 3DMM parameters with a fast backbone, e.g. MobileNet. To handle the optimization problem of the parameters regression framework, we exploit two different loss terms WPDC and VDC [54] (see Sect. 2.2) and propose our meta-joint optimization to combine the advantages of them. The meta-joint optimization looks ahead by k-steps with WPDC and VDC on the meta-train batches, then dynamically selects the better one according to the error on the meta-test batch. By doing so, the whole optimization converges faster and achieves better performance than the vanilla-joint optimization. Besides, a landmark-regression regularization is introduced to further alleviate the optimization problem to achieve higher accuracy. In addition to single image, 3D face applications on videos are becoming more and more popular [8, 9, 27, 28], where reconstructing stable results across consecutive frames is important, but it is often ignored by recent methods [17, 26, 54, 55]. Video-based training [16, 32, 36, 40] is always adopted to improve the stability in 2D face alignment. However, no video databases are publicly available for 3D dense face alignment. To address it, we propose a 3D aided short-video-synthesis method, which simulates both in-plane and out-of-plane face moving to transform one still image to a short video, so that our network can adjust results of consecutive frames. Experiments show our short-video-synthesis method significantly improves the stability on videos.

In general, our proposed methods are (i) fast: It takes about 7.2ms with an single image as input (almost 24$\times $ faster than PRNet) and runs at over 50fps (19.2ms) on a single CPU core or over 130fps (7.2ms) on multiple CPU cores (i5-8259U processor), (ii) accurate: By dynamically optimizing 3DMM parameters through a novel meta-optimization strategy combining the fast WPDC and VDC, we surpass the state-of-the-art results [17, 26, 54, 55] under a strict computation burden in inference, and (iii) stable: In a mini-batch, one still image is transformed slightly and smoothly into a short synthetic video, involving both in-plane and out-of-plane rotations, which provides temporal information of adjacent frames for training. Extensive experimental results on four datasets show that the overall performance of our method is the best.

2 Methodology

This section details our proposed approach. We first discuss 3D Morphable Model (3DMM) [5]. Then, we introduce the proposed methods of the meta-joint optimization, landmark-regression regularization and 3D aided short-video-synthesis. The overall pipeline is illustrated in Fig. 2 and the algorithm is described in Algorithm 1.

2.1 Preliminary of 3DMM

The original 3DMM can be described as:

$$\begin{aligned} \mathbf {S} = \overline{\mathbf {S}} + \mathbf {A}_{id} \varvec{\alpha }_{id} + \mathbf {A}_{exp} \varvec{\alpha }_{exp}, \end{aligned}$$

(1)

where $\mathbf {S}$ is the 3D face mesh, $\overline{\mathbf {S}}$ is the mean 3D shape, $\varvec{\alpha }_{id}$ is the shape parameter corresponding to the 3D shape base $\mathbf {A}_{id}$, $\mathbf {A}_{exp}$ is the expression base and $\varvec{\alpha }_{exp}$ is the expression parameter. After the 3D face is reconstructed, it can be projected onto the image plane with the scale orthographic projection:

$$\begin{aligned} V_{2d}{(\mathbf {p})}=f * \mathbf {Pr} * \mathbf {R} *\left( \overline{\mathbf {S}}+\mathbf {A}_{id} \varvec{\alpha }_{id}+\mathbf {A}_{exp} \varvec{\alpha }_{exp} \right) + \mathbf {t}_{2d}, \end{aligned}$$

(2)

where $V_{2d}{(\mathbf {p})}$ is the projection function generating the 2D positions of model vertices, f is the scale factor, $\mathbf {Pr}$ is the orthographic projection matrix, $\mathbf {R}$ is the rotation matrix constructed by Euler angles including pitch, yaw, roll and $\mathbf {t}_{2d}$ is the translation vector. The complete parameters of 3DMM are $\mathbf {p} = [f, \mathrm {pitch}, \mathrm {yaw}, \mathrm {roll}, \mathbf {t}_{2d}, \varvec{\alpha }_{id}, \varvec{\alpha }_{exp}]$.

However, the three Euler angles will cause the gimbal lock [30] when faces are close to the profile view. This ambiguity will confuse the regressor to degrade the performance, so we choose to regress the similarity transformation matrix instead of $[f, \mathrm {pitch}, \mathrm {yaw}, \mathrm {roll}, \mathbf {t}_{2d}]$ to reduce the regression difficulty: $\mathbf {T} = f \left[ \mathbf {R}; \mathbf {t}_{3d} \right] $, where $\mathbf {T} \in \mathbb {R}^{3 \times 4}$ is constructed by a scale factor f, a rotation matrix $\mathbf {R}$ and a translation vector $\mathbf {t}_{3d} = \begin{bmatrix} \mathbf {t}_{2d} \\ 0 \end{bmatrix}$. Therefore, the scale orthographic projection in Eq. 2 can be simplified as:

$$\begin{aligned} V_{2d}({\mathbf {p}}) = \mathbf {Pr} * \mathbf {T} * \begin{bmatrix} \overline{\mathbf {S}} + \mathbf {A} \varvec{\alpha } \\ \mathbf {1} \end{bmatrix}, \end{aligned}$$

(3)

where $\mathbf {A} = [\mathbf {A}_{id}, \mathbf {A}_{exp}]$ and $\varvec{\alpha } = [\varvec{\alpha }_{id}, \varvec{\alpha }_{exp}]$. Our regression objective is described as $\mathbf {p} = [\mathbf {T}, \varvec{\alpha }]$.

The high-dimensional parameters $\varvec{\alpha }_{shp} \in \mathbb {R}^{199}$, $\varvec{\alpha }_{exp} \in \mathbb {R}^ {29}$ are redundant, since 3DMM models the 3D face shape with PCA and the last parts of parameters have little effect on the face shape. We choose only the first 40 dimensions of $\varvec{\alpha }_{shp}$ and the first 10 dimensions of $\varvec{\alpha }_{exp}$ as our regression target, since the NME increase is acceptable and the reconstruction can be greatly accelerated. The NME error heatmap caused by different size of shape and expression dimensions is present in the supplementary material. Therefore, our complete regression target is simplified as $\mathbf {p} = [\mathbf {T}^{3 \times 4}, \varvec{\alpha }^{50}]$, with 62 dimensions in total, where $\varvec{\alpha } = [\varvec{\alpha }_{shp} ^{40}, \varvec{\alpha }_{exp}^{10}]$. To eliminate the negative impact of magnitude differences between $\mathbf {T}$ and $\varvec{\alpha }$, Z-score normalizing is adopted: $\mathbf {p} = (\mathbf {p} - \varvec{\mu }_{p}) / \varvec{\sigma }_{p}$, where $\varvec{\mu }_{p} \in \mathbb {R} ^ {62}$ is the mean of parameters and $\varvec{\sigma }_{p} \in \mathbb {R} ^ {62}$ indicates the standard deviation of parameters.

2.2 Meta-Joint Optimization

We first review the Vertex Distance Cost (VDC) and Weighted Parameter Distance Cost (WPDC) in [54], then derivate the meta-joint optimization to facilitate the parameters regression.

The VDC term $\mathcal {L}_{vdc}$ directly optimizes $\mathbf {p}$ by minimizing the vertex distances between the fitted 3D face and the ground truth:

$$\begin{aligned} \mathcal {L}_{vdc} = \left\| V_{3d}\left( \mathbf {p} \right) - V_{3d}\left( \mathbf {p}^{g}\right) \right\| ^{2}, \end{aligned}$$

(4)

where $\mathbf {p}^g$ is the ground truth parameter, $\mathbf {p}$ is the predicted parameter and $V_{3d} (\cdot )$ is the 3D face reconstruction formulated as:

$$\begin{aligned} V_{3d}({\mathbf {p}}) = \mathbf {T} * \begin{bmatrix} \overline{\mathbf {S}} + \mathbf {A} \varvec{\alpha } \\ \mathbf {1} \end{bmatrix}. \end{aligned}$$

(5)

Different from VDC, the WPDC term [54] $\mathcal {L}_{wpdc}$ assigns different weights to each parameter:

$$\begin{aligned} \mathcal {L}_{wpdc} = \left\| \mathbf {w} \cdot (\mathbf {p} - \mathbf {p}^{g}) \right\| ^2, \end{aligned}$$

(6)

where $\mathbf {w}$ indicates the importance weight as follows:

$$\begin{aligned} \begin{aligned} \mathbf {w}&= \left( w_1, w_2, \dots , w_i, \dots , w_n \right) , \\ w_i&= \left\| V_{3d}(\mathbf {p}^{de,i}) - V_{3d}(\mathbf {p}^g)\right\| / Z, \\ \mathbf {p}^{de,i}&= \left( \mathbf {p}_1^g, \mathbf {p}_2^g, \dots , \mathbf {p}_i, \dots , \mathbf {p}_n^g \right) , \\ \end{aligned} \end{aligned}$$

(7)

where n is the number of parameters ($n = $ 62 in our regression framework), $\mathbf {p}^{de, i}$ is the i-degraded parameter whose i-th element is from the predicted $\mathbf {p}$, Z is the maximum of $\mathbf {w}$ for regularization. The term $\left\| V_{3d}(\mathbf {p}^{de,i}) - V_{3d}(\mathbf {p}^g)\right\| $ models the importance of i-th parameter.

fWPDC. The original calculation of $\mathbf {w}$ in WPDC is rather slow as the calculation of each $w_i$ needs to reconstruct all the vertices once, which is a bottleneck for fast training. We find that the vertices can be only reconstructed once by decomposing the weight calculation into two parts: the similarity transformation matrix $\mathbf {T}$, and the combination of shape and expression parameters $\varvec{\alpha }$. Therefore, we design a fast implementation of WPDC named fWPDC: (i) reconstructing the vertices without projection $\mathbf {S} = \overline{\mathbf {S}} + \mathbf {A} \varvec{\alpha }$ and calculating $\mathbf {w}_T$ using the norm of row vectors; (ii) calculating $\mathbf {w}_{\alpha }$ using the norm of column vectors of $\mathbf {A}$ and the input scale f: $\mathbf {w}_\alpha (i) = f \cdot \big ( \varvec{\alpha }(i) - \varvec{\alpha }^g(i) \big ) \cdot \left\| \mathbf {A}\left( :,i \right) \right\| $; (iii) Combining them to calculate the final cost. The detailed algorithm of fWPDC is described in Algorithm 2. fWPDC only reconstructs dense vertices once, not 62 times as WPDC, thus greatly reducing the computation cost. With 128 samples as a batch input, the original WPDC takes 41.7 ms while fWPDC only takes 3.6 ms. fWPDC is over 10$\times $ faster than the original WPDC while preserving the same outputs.

Exploitation of VDC and fWPDC. Through Eq. 4 and Eq. 6, we find: WPDC/fWPDC is suitable for parameters regression since each parameter is appropriately weighted, while VDC can directly reflect the goodness of the 3D face reconstructed from parameters. In Fig. 3, we investigate how these two losses converge as the training progresses. It is shown that the optimization is difficult for VDC since the vertex error is still over 15 when training converges. The work in [55] also demonstrates that optimizing VDC with gradient descent converges very slowly due to the “zig-zagging” problem. In contrast, the convergence of fWPDC is much faster than VDC and the error is about 7 when training converges. Surprisingly, if the fWPDC-trained model is fine-tuned by VDC, we can get a much lower error than fWPDC. Based on the above observation, we conclude that: training from scratch with VDC is hard to converge and the network is not fully trained by fWPDC in the late stage.

Meta-Joint Optimization. Based on above discussions, it is natural to weight two terms to perform a vanilla-joint optimization: $\mathcal {L}_{vanilla\textit{-}joint} = \beta \mathcal {L}_{fwpdc} + (1 - \beta ) \frac{|l_{fwpdc}|}{|l_{vdc}|} \cdot \mathcal {L}_{vdc}$, where $\beta \in [0, 1]$ controls the importance between fWPDC and VDC. However, the vanilla-joint optimization relies on the manually set hyper-parameter $\beta $ and does not achieves satisfactory results in Fig. 3. Inspired by Lookahead [52] and MAML [18], we propose a meta-joint optimization strategy to dynamically combine fWPDC and VDC. The overview of the meta-joint optimization is shown in Fig. 4. In the training process, the model looks ahead by k-steps with the cost fWPDC or VDC on k meta-train batches $\mathcal {X}_{mtr}$, then selects the better one between fWPDC and VDC according to the vertex error on the meta-test batch. Specifically, the whole meta-joint optimization consists of four steps: (i) sampling k batches of training samples $\mathcal {X}_{mtr}$ for meta-train and one batch $\mathcal {X}_{mte}$ for meta-test; (ii) meta-train: updating the current model parameters $\theta _i$ with fWPDC and VDC on $\mathcal {X}_{mtr}$ by k-steps, respectively, getting two parameter states $\theta _{i+k}^f$ and $\theta _{i+k}^v$; (iii) meta-test: evaluating the vertex error $\theta _{i+k}^f$ and $\theta _{i+k}^v$ on $\mathcal {X}_{mte}$; (iv) selecting the parameters which have the lower error to update $\theta _i$. The proposed meta-joint optimization can be directly embedded into the standard training regime. From Fig. 3, we can observe that the meta-joint optimization converges faster than vanilla-joint and has the lower error.

2.3 Landmark-Regression Regularization

In 3D face reconstruction [14, 15, 19, 42, 43], the 2D sparse landmarks after projecting are usually used as an extra regularization to facilitate the parameters regression. In our regression framework, we find that treating 2D sparse landmarks as a auxiliary regression task benefits more.

As shown in Fig. 2, we add an additional landmark-regression task on the global pooling layer, trained by L2 loss. The difference between the former landmark-regularization and the latter landmark-regression regularization is that the latter introduces extra parameters to regress the landmarks. In other words, the landmark-regression regularization is a task-level regularization. From the tomato curve in Fig. 3, we get a lower error by incorporating the landmark-regression regularization. The comparative results in Table 3 show our proposed landmark-regression regularization is better than landmark-regularization (3.59% vs. 3.71% on AFLW2000-3D). The landmark-regression regularization is formulated as: $\mathcal {L}_{lrr} = \frac{1}{N} \sum _{i=1}^{N} \left\| l_i - l_i^g \right\| _2 ^2$, where N is 136 here as we utilize 68 2D landmarks and flatten them into a 136-d vector.

2.4 3D Aided Short-video-synthesis

Video based 3D face applications have become more and more popular [8, 9, 27, 28] recently. In these applications, 3D dense face alignment methods are required to run on videos and provide stable reconstruction results across adjacent frames. The stability means that the changing of the reconstructed 3D faces across adjacent frames should be consistent with the true face moving in a fine-grained level. However, most of existing methods [17, 26, 54, 55] omit this requirenment and the predictions suffer from random jittering. In 2D face alignment, post-processing like temporal filtering is a common strategy to reduce the jittering, but it degrades the precision and causes the frame delay. Besides, since no public video databases for 3D dense face alignment are available, the video training strategies [16, 32, 36, 40] cannot work here. A challenge arises: can we improve the stability on videos with only still images available when training?

To address this challenge, we propose a batch-level 3D aided short-video-synthesis strategy, which expands one still image to several adjacent frames, forming a short synthetic video in a mini-batch. The common patterns in a video can be modelled as: (i) Noise. We model noise as $P(x) = x + \mathcal {N}(0, \varSigma )$, where $\varSigma = \sigma ^2 I$. (ii) Motion Blur. Motion blur can be formulated as $M(x) = K *x$, where K is the convolution kernel (the operator $*$ denotes a convolution). (iii) In-plane rotation. Given two adjacent frames $x_{t}$ and $x_{t+1}$, the in-plane temporal change from $x_t$ to $x_{t+1}$ can be described as a similarity transform $T ( \cdot )$:

$$\begin{aligned} T(\cdot ) = \varDelta s \begin{bmatrix} \cos (\varDelta \theta ) &{} -\sin (\varDelta \theta ) &{} \varDelta t_1 \\ \sin (\varDelta \theta ) &{} \cos (\varDelta \theta ) &{} \varDelta t_2 \end{bmatrix}, \end{aligned}$$

(8)

where $\varDelta s$ is the scale perturbation, $\varDelta \theta $ is the rotation perturbation, $\varDelta t_1$ and $\varDelta t_2$ are translation perturbations. (iv) Since human faces share similar 3D structure, we are also able to synthesize the out-of-plane face moving. Face profiling [54] $F (\cdot )$, which is originally proposed to solving large-pose face alignment, is utilized to progressively increase the yaw angle $\varDelta \phi $ and pitch angle $\varDelta \gamma $ of the face. Specifically, we sample several still images in a mini-batch and for each still image $x_0$, we transform it slightly and smoothly to generate a synthetic video with n adjacent frames: $\{ x'_j | x'_{j} = (M \circ P) (x_j), x_{j} = (T \circ F) (x_{j-1}), 1 \le j \le n-1 \} \cup \{x_0\}$. In Fig. 5, we give an illustration of how these transformations are applied on an image to generate several adjacent frames.

3 Experiments

In this section, we first introduce the datasets and protocols; then, we give comparison experiments on the accuracy and stability; thirdly, the complexity and running speed are evaluated; extensive discussions are finally made. The implementation details, generalization and scaling-up ability of our proposed method are in the supplementary material.

3.1 Datasets and Evaluation Protocols

Five datasets are used in our experiments: 300W-LP [54] (300W Across Large Poses) is composed of the synthesized large-pose face images from 300W [38], including AFW [56], LFPW [2], HELEN [53], IBUG [38], and XM2VTS [34]. Specifically, the face profiling method [54] is adopted to generate 122,450 samples across large poses. AFLW [29] consists of 21,080 in-the-wild faces (following [22, 55]) with large poses (yaw from -90$^\circ $ to 90$^\circ $). Each image is annotated up to 21 visible landmarks. AFLW2000-3D [54] is constructed by [54] for evaluating 3D face alignment performance, which contains the ground truth 3D faces and the corresponding 68 landmarks of the first 2,000 AFLW samples. Florence [1] is a 3D face dataset containing 53 subjects with its ground truth 3D mesh acquired from a structured-light scanning system. For evaluation, we generate renderings with different poses for each subject following VRN [26] and PRNet [17]. Menpo-3D [51] provides a benchmark for evaluating 3D facial landmark localization algorithms in the wild in arbitrary poses. Specifically, Menpo-3D provides 3D facial landmarks for 55 videos from 300-VW [39] competition.

Table 1. The NME (%) of different methods on AFLW2000-3D and AFLW. The first and the second best results are highlighted. M, R, S denote the meta-joint optimization, landmark-regression regularization and short-video-synthesis, respectively.

Full size table

Table 2. The NME (%) on Florence, AFLW2000-3D (Dense), NME (%) / Stability (%) on Menpo-3D, running complexity and time with different methods. Our method outputs 3D dense vertices with only 2.1 ms (2 ms for parameters prediction and 0.1 ms for vertices reconstruction) in GPU or 7.2 ms in CPU (6.2 ms for parameters prediction and 1 ms for vertices reconstruction). The first and second best results are highlighted.

Full size table

Protocols. The protocol on AFLW follows [54] and Normalized Mean Error (NME) by bounding box size is reported. Two protocols on AFLW2000-3D are applied: the first one follows AFLW, and the other one follows [17] to evaluate the NME of 3D face reconstruction normalized by the bounding box size. For Florence, we follow [17, 26] to evaluate the NME of 3D face reconstruction normalized by outer interocular distance. As for Menpo-3D, we evaluate the NME on still frames and the stability across adjacent frames. We calculate the stability following [40] by measuring the NME between the predicted offsets and the ground-truth offsets of adjacent frames. Specifically, at frame $t-1$ and t, the ground-truth landmark offset is $\varDelta p = p_t - p_{t-1}$, the prediction offset is $\varDelta q = q_t - q_{t-1}$, the error $\varDelta p - \varDelta q$ normalized by the bounding box size represents the stability. Since 300W-LP only has the indices of 68 landmarks, we use 68 landmarks of Menpo-3D for consistency.

Table 3. The comparative and ablative results on AFLW2000-3D and AFLW. The mean NMEs (%) across small, medium and large poses on AFLW2000-3D and AFLW are reported. lmk. indicates landmark constraint on the parameter regression like [43] and lrr. is the proposed landmark-regression regularization.

Full size table

Table 4. Comparisons of NME (%)/Stability (%) on Menpo-3D. svs. indicates short-video-synthesis, rnd. indicates applying in-plane and out-of-plane rotations randomly in one mini-batch.

Full size table

3.2 Ablation Study

To evaluate the effectiveness of the meta-joint optimization and the landmark-regression regularization, we carry out comparative experiments including our two baselines: VDC and fWPDC, three joint options: (i) VDC from fWPDC: fine-tune the model with VDC loss from the pre-trained model by fWPDC; (ii) Vanilla-joint: weight VDC and fWPDC by the best scalar $\beta =0.5$; (iii) Meta-joint: the proposed meta-joint optimization with best $k=100$ and four options of how the 2D landmarks are utilized. From Table 3, Table 4, Fig. 3 and Fig. 6, we can draw the following conclusions:

Meta-Joint Optimization Performs Better. Comparing with two baselines VDC and fWPDC, all three joint optimization methods perform better. Among three joint optimization methods, the proposed meta-joint performs better than VDC from fWPDC and vanilla-joint: the mean NME drops from $4.04\%$ to $3.73\%$ on AFLW2000-3D and $5.10\%$ to $4.64\%$ on AFLW when compared with the baseline fWPDC. Furthermore, we conduct ablative experiments with different $\beta $ for vanilla-joint and different look-ahead step k in Fig. 6. We can observe that $\beta =0.5$ is the best setting for vanilla-joint, but meta-joint still outperforms it and $k=100$ performs best on both AFLW2000-3D and AFLW. Overall, the proposed meta-joint optimization is effective in alleviate the training and promoting the performance.

Landmark-Regression Regularization Benefits. Another contribution is the landmark-regression regularization, which can also be regarded as an auxiliary task to parameters regression. From Table 3, the improvements from fWPDC to fWPDC w/ lrr. on AFLW2000-3D and AFLW are $0.15\%$ and $0.26\%$, and the improvements from Meta-joint to Meta-joint w/ lrr. on AFLW2000-3D and AFLW are $0.14\%$ and $0.14\%$. We also compare the proposed landmark-regression regularization with prior methods [42, 43] which directly impose landmark constraint on the parameter regression, the results show ours is significantly better: 3.59% vs. 3.71% on AFLW2000-3D. We further evaluate the performance of the landmark-regression branch on AFLW2000-3D and AFLW. The performances are 3.58% and 4.52% respectively, which are close to the parameter branch. It indicates that these two tasks are highly related. Overall, the landmark-regression regularization benefits the training and promotes the performance.

Short-Video-Synthesis Improves Stability. The last contribution is 3D aided short-video-synthesis, which is designed to enhance stability on videos by augmenting one still image to a short video in a mini-batch. The results in Table 4 indicate that short-video-synthesis works for both the fWPDC and meta-joint optimization. With short-video-synthesis and landmark-regression regularization, the performance on still frames improves from $1.86\%$ to $1.71\%$ and the stability improves from $0.52\%$ to $0.48\%$. We also evaluate the performance by randomly applying in-plane and out-of-plane rotations in each mini-batch and find it is worse than short-video-synthesis: 1.76%/0.50% v.s. 1.71%/0.48%. These results validate the effectiveness of the 3D aided short-video-synthesis.

3.3 Evaluations of Accuracy and Stability

Sparse Face Alignment. We use AFLW2000-3D and AFLW to evaluate sparse face alignment performance with small, medium and large yaw angles. The results in Table 1 indicate that our method performs better than PRNet (3.51% vs. 3.62%) in AFLW2000-3D and better than 3DDFA-TPAMI [55] in AFLW (4.43% vs. 4.55%). Note that these results are achieved with only 3.27M parameters (24% of PRNet) and it takes 6.2 ms (3.5% of PRNet) in CPU. The sampling of 68/21 landmarks from 3DMM is extremely fast, only 0.01 ms (CPU), which can be ignored.

Dense Face Alignment. Dense face alignment is evaluated on Florence and AFLW2000-3D. Our evaluation settings follow [17] to keep consistency. The results in Table 2 show that the proposed method significantly outperforms others. As for 3D dense vertices reconstruction, 45K vertices only takes 1 ms in CPU (0.1 ms in GPU) with our regression framework.

Video-based 3D Face Alignment. We use Menpo-3D to evaluate both the accuracy and stability. Table 4 has already shown the superiority of short-video-synthesis. We choose to compare our method with recent PRNet [17] in Table 2. The results indicate that our method significantly surpasses PRNet in both the accuracy and stability on videos of Menpo-3D with a much lower computation cost.

3.4 Evaluations of Speed

We compare parameter numbers, MACs (Multiply-Accumulates) measuring the number of fused Multiplication and Addition operations, and the running time of our method with others in Table 2. As for the running speed, 3DDFA [54] takes 23.2 ms (GPU) for predicting parameters and 52.5 ms (CPU) for PNCC construction, DeFA [17] needs 11.8 ms (GPU) to predict 3DMM parameters and 23.6 ms (CPU) for post-processing, VRN [26] detects 68 2D landmarks with 28.4 ms (GPU) and regresses the 3D dense vertices with 40.6 ms (GPU), PRNet [17] predicts the 3D dense vertices with 9.8 ms (GPU) or 175 ms (CPU). Compared with them, our method takes only 2 ms (GPU) or 6.2 ms (CPU) to predict 3DMM parameters and 0.1 ms (GPU) or 1 ms (CPU) to reconstruct 3D dense vertices.

Specifically, compared with the recent PRNet [17], the parameters of our model (3.27M) are less than one-quarter of PRNet (13.4M), and the MACs are less than 1/30 (183.5M vs. 6190M). We measure the overall running time on GeForce GTX 1080 GPU and i5-8259U CPU with 4 cores. Note that our method takes only 7.2 ms, which is almost 24$\times $ faster than PRNet (175 ms). Besides, we benchmark our method on a single CPU core (using only one thread) and the running speed of our method is about 19.2 ms (over 50 fps), including the reconstruction time. The specific CPU configuration is i5-8259U CPU @ 2.30 GHz on a 13-inch MacBook Pro.

3.5 Analysis of Meta-Joint Optimization

We visualize the auto-selection result of fWPDC and VDC in the meta-joint optimization, as shown in Fig. 7. We can observe that both $k=100$ and $k=200$ show the same trend: fPWDC dominates in the early stage and VDC guides in the late stage. This trend is consistent with the previous observations and gives a clear description of why our proposed meta-joint optimization works.

4 Conclusion

In this paper, we have successfully pursued the fast, accurate and stable 3D dense face alignment simultaneously. Towards this target, we make three main efforts: (i) proposing a fast WPDC named fWPDC and the meta-joint optimization to combine fWPDC and VDC to alleviate the problem of optimization; (ii) imposing an extra landmark-regression regularization to promote the performance to state-of-the-art; (iii) proposing the 3D aided short-video-synthesis method to improve the stability on videos. The experimental results demonstrate the effectiveness and efficiency of our proposed methods. Our promising results pave the way for real-time 3D dense face alignment in practical use and the proposed methods may improve the environment by reducing the amount of carbon dioxide released by the huge amounts of energy consumed by GPUs.

References

Bagdanov, A.D., Bimbo, A.D., Masi, I.: The florence 2D/3D hybrid face dataset. In: ACM workshop on Human gesture and behavior understanding (2011)
Google Scholar
Belhumeur, P.N., Jacobs, D.W., Kriegman, D.J., Kumar, N.: Localizing parts of faces using a consensus of exemplars. TPAMI 35, 2930–2940 (2013)
Article Google Scholar
Bettadapura, V.: Face expression recognition and analysis: the state of the art. arXiv:1203.6722 (2012)
Bhagavatula, C., Zhu, C., Luu, K., Savvides, M.: Faster than real-time facial alignment: a 3D spatial transformer network approach in unconstrained poses. In: ICCV (2017)
Google Scholar
Blanz, V., Vetter, T., et al.: A morphable model for the synthesis of 3D faces. In: SIGGRAPH (1999)
Google Scholar
Booth, J., Roussos, A., Zafeiriou, S., Ponniah, A., Dunaway, D.: A 3D morphable model learnt from 10,000 faces. In: CVPR (2016)
Google Scholar
Bulat, A., Tzimiropoulos, G.: How far are we from solving the 2D & 3D face alignment problem? (and a dataset of 230,000 3D facial landmarks). In: ICCV (2017)
Google Scholar
Cao, C., Chai, M., Woodford, O., Luo, L.: Stabilized real-time face tracking via a learned dynamic rigidity prior. In: SIGGRAPH Asia 2018 Technical Papers. ACM (2018)
Google Scholar
Cao, C., Weng, Y., Lin, S., Zhou, K.: 3D shape regression for real-time facial animation. ACM Trans. Graph. (TOG) 32, 1–10 (2013)
Article Google Scholar
Cao, J., Hu, Y., Zhang, H., He, R., Sun, Z.: Learning a high fidelity pose invariant model for high-resolution face frontalization. In: Advances in Neural Information Processing Systems, pp. 2867–2877 (2018)
Google Scholar
Cao, J., Hu, Y., Zhang, H., He, R., Sun, Z.: Towards high fidelity face frontalization in the wild. Int. J. Comput. Vision 128, 1–20 (2019)
Google Scholar
Cao, J., Huang, H., Li, Y., He, R., Sun, Z.: Informative sample mining network for multi-domain image-to-image translation. In: Proceedings of the European Conference on Computer Vision (ECCV) (2020)
Google Scholar
Cao, X., Wei, Y., Wen, F., Sun, J.: Face alignment by explicit shape regression. IJCV 107, 177–190 (2014)
Article MathSciNet Google Scholar
Chinaev, N., Chigorin, A., Laptev, I.: Mobileface: 3D face reconstruction with efficient cnn regression. In: ECCV (2018)
Google Scholar
Deng, Y., Yang, J., Xu, S., Chen, D., Jia, Y., Tong, X.: Accurate 3D face reconstruction with weakly-supervised learning: from single image to image set. In: CVPR Workshop (2019)
Google Scholar
Dong, X., Yu, S.I., Weng, X., Wei, S.E., Yang, Y., Sheikh, Y.: Supervision-by-registration: an unsupervised approach to improve the precision of facial landmark detectors. In: CVPR (2018)
Google Scholar
Feng, Y., Wu, F., Shao, X., Wang, Y., Zhou, X.: Joint 3D face reconstruction and dense alignment with position map regression network. In: ECCV (2018)
Google Scholar
Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML (2017)
Google Scholar
Gecer, B., Ploumpis, S., Kotsia, I., Zafeiriou, S.: Ganfit: generative adversarial network fitting for high fidelity 3D face reconstruction. In: CVPR (2019)
Google Scholar
Guo, J., et al.: Dominant and complementary emotion recognition from still images of faces. IEEE Access 6, 26391–26403 (2018)
Article Google Scholar
Guo, J., et al.: Multi-modality network with visual and geometrical information for micro emotion recognition. In: 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), pp. 814–819. IEEE (2017)
Google Scholar
Guo, J., Zhu, X., Lei, Z.: 3ddfa (2018). https://github.com/cleardusk/3DDFA
Guo, J., Zhu, X., Lei, Z., Li, S.Z.: Face synthesis for eyeglass-robust face recognition. In: Zhou, J., et al. (eds.) CCBR 2018. LNCS, vol. 10996, pp. 275–284. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-97909-0_30
Chapter Google Scholar
Guo, J., Zhu, X., Xiao, J., Lei, Z., Wan, G., Li, S.Z.: Improving face anti-spoofing by 3D virtual synthesis. In: 2019 International Conference on Biometrics (ICB), pp. 1–8. IEEE (2019)
Google Scholar
Guo, J., Zhu, X., Zhao, C., Cao, D., Lei, Z., Li, S.Z.: Learning meta face recognition in unseen domains. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6163–6172 (2020)
Google Scholar
Jackson, A., Bulat, A., Argyriou, V., Tzimiropoulos, G.: Large pose 3D face reconstruction from a single image via direct volumetric cnn regression. In: ICCV (2017)
Google Scholar
Kim, H., et al.: Neural style-preserving visual dubbing. ACM Trans. Graph. (TOG) 38, 1–13 (2019)
Google Scholar
Kim, H., et al.: Deep video portraits. ACM Trans. Graph. (TOG) 37, 1–14 (2018)
Google Scholar
Koestinger, M., Wohlhart, P., Roth, P.M., Bischof, H.: Annotated facial landmarks in the wild: a large-scale, real-world database for facial landmark localization. In: ICCV Workshop (2011)
Google Scholar
Lepetit, V., Fua, P., et al.: Monocular model-based 3D tracking of rigid objects: A survey. Found. Trends® Comput. Graph. Vision (2005)
Google Scholar
Liu, F., Zeng, D., Zhao, Q., Liu, X.: Joint face alignment and 3D face reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 545–560. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_33
Chapter Google Scholar
Liu, H., Lu, J., Feng, J., Zhou, J.: Two-stream transformer networks for video-based face alignment. TPAMI 40, 2546–2554 (2018)
Article Google Scholar
Liu, Y., Jourabloo, A., Ren, W., Liu, X.: Dense face alignment. In: ICCV (2017)
Google Scholar
Messer, K., Matas, J., Kittler, J., Luettin, J., Maitre, G.: Xm2vtsdb: the extended m2vts database. In: Second International Conference on Audio and Video-Based Biometric Person Authentication (1999)
Google Scholar
Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_29
Chapter Google Scholar
Peng, X., Feris, R.S., Wang, X., Metaxas, D.N.: A recurrent encoder-decoder network for sequential face alignment. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 38–56. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_3
Chapter Google Scholar
Qin, Y., et al.: Learning meta model for zero-and few-shot face anti-spoofing. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI) (2020)
Google Scholar
Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., Pantic, M.: 300 faces in-the-wild challenge: the first facial landmark localization challenge. In: CVPRW (2013)
Google Scholar
Shen, J., Zafeiriou, S., Chrysos, G.G., Kossaifi, J., Tzimiropoulos, G., Pantic, M.: The first facial landmark tracking in-the-wild challenge: Benchmark and results. In: ICCV Workshops (2015)
Google Scholar
Tai, Y., et al.: Towards highly accurate and stable face alignment for high-resolution videos. arXiv:1811.00342 (2018)
Taigman, Y., Yang., M., Ranzato, M., Wolf, L.: Deepface: closing the gap to human-level performance in face verification. In: CVPR (2014)
Google Scholar
Tewari, A., et al.: Fml: face model learning from videos. In: CVPR (2019)
Google Scholar
Tewari, A., et al.: Mofa: model-based deep convolutional face autoencoder for unsupervised monocular reconstruction. In: ICCV (2017)
Google Scholar
Tuan, T., Hassner, T., Masi, I., Medioni, G.: Regressing robust and discriminative 3d morphable models with a very deep neural network. In: CVPR (2017)
Google Scholar
Wang, Z., et al.: Deep spatial gradient and temporal depth learning for face anti-spoofing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5042–5051 (2020)
Google Scholar
Xiong, X., De, T.F.: Global supervised descent method. In: CVPR (2015)
Google Scholar
Yang, C.Y., Liu, S., Yang, M.H.: Structured face hallucination. In: CVPR (2013)
Google Scholar
Yu, R., Saito, S., Li, H., Ceylan, D., Li, H.: Learning dense facial correspondences in unconstrained images. In: ICCV (2017)
Google Scholar
Yu, Z., Li, X., Niu, X., Shi, J., Zhao, G.: Face anti-spoofing with human material perception. In: Proceedings of the European Conference on Computer Vision (ECCV) (2020)
Google Scholar
Yu, Z., et al.: Searching central difference convolutional networks for face anti-spoofing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5295–5305 (2020)
Google Scholar
Zafeiriou, S., Chrysos, G.G., Roussos, A., Ververas, E., Deng, J., Trigeorgis, G.: The 3D menpo facial landmark tracking challenge. In: ICCV (2017)
Google Scholar
Zhang, M., Lucas, J., Ba, J., Hinton, G.E.: Lookahead optimizer: k steps forward, 1 step back. In: NeurIPS (2019)
Google Scholar
Zhou, E., Fan, H., Cao, Z., Jiang, Y., Yin, Q.: Extensive facial landmark localization with coarse-to-fine convolutional network cascade. In: CVPRW (2013)
Google Scholar
Zhu, X., Lei, Z., Liu, X., Shi, H., Li, S.Z.: Face alignment across large poses: a 3D solution. In: CVPR (2016)
Google Scholar
Zhu, X., Liu, X., Lei, Z., Li, S.Z.: Face alignment in full pose range: a 3D total solution. TPAMI 41, 78–92 (2019)
Article Google Scholar
Zhu, X., Ramanan, D.: Face detection, pose estimation, and landmark localization in the wild. In: CVPR (2012)
Google Scholar
Zhu, X., et al.: Beyond 3D mm space: towards fine-grained 3D face reconstruction. In: Proceedings of the European Conference on Computer Vision (ECCV) (2020)
Google Scholar

Download references

Acknowledgement

This work was supported in part by the National Key Research & Development Program (No. 2020YFC2003901), Chinese National Natural Science Foundation Projects #61872367, #61876178, #61806196, #61976229.

Author information

Authors and Affiliations

CBSR&NLPR, Institute of Automation, Chinese Academy of Sciences, Beijing, China
Jianzhu Guo, Xiangyu Zhu, Yang Yang & Zhen Lei
School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
Jianzhu Guo, Xiangyu Zhu, Yang Yang & Zhen Lei
College of Software, Beihang University, Beijing, China
Fan Yang
School of Engineering, Westlake University, Hangzhou, China
Stan Z. Li

Authors

Jianzhu Guo
View author publications
You can also search for this author in PubMed Google Scholar
Xiangyu Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Yang Yang
View author publications
You can also search for this author in PubMed Google Scholar
Fan Yang
View author publications
You can also search for this author in PubMed Google Scholar
Zhen Lei
View author publications
You can also search for this author in PubMed Google Scholar
Stan Z. Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhen Lei .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 5120 KB)

Supplementary material 2 (mp4 7832 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Guo, J., Zhu, X., Yang, Y., Yang, F., Lei, Z., Li, S.Z. (2020). Towards Fast, Accurate and Stable 3D Dense Face Alignment. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12364. Springer, Cham. https://doi.org/10.1007/978-3-030-58529-7_10

Download citation

DOI: https://doi.org/10.1007/978-3-030-58529-7_10
Published: 13 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58528-0
Online ISBN: 978-3-030-58529-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Towards Fast, Accurate and Stable 3D Dense Face Alignment

Abstract

Similar content being viewed by others

$$\text {Face2Face}^\rho $$ : Real-Time High-Resolution One-Shot Face Reenactment

Robust Multi-view Face Alignment Based on Cascaded 2D/3D Face Shape Regression

Pose-Invariant Face Alignment via CNN-Based Dense 3D Model Fitting

Keywords

1 Introduction