1 Introduction

Facial landmark localization aims to detect a sparse set of facial fiducial points on a human face, some of which include “eye corner”, “nose tip”, and “chin center”. In the pipeline of face analysis, landmark detectors take the input of a face image and the bounding box provided by face detection, and output a set of coordinates of the predefined landmarks, which is illustrated in Fig. 5.1. It provides a fine-grained description of the face topology, such as facial features locations and face region contours, which is essential for many face analysis tasks, e.g., recognition [32], animation [33], attributes classification [34], and face editing [35]. These applications usually run on lightweight devices in uncontrolled environments, requiring landmark detectors to be accurate, robust, and computationally efficient, all at the same time.

Over the past few decades, there have been significant developments in facial landmark detection. The early works consider landmark localization as the process of moving and deforming a face model to an image, and they construct a statistical facial model to model the shape and albedo variations of human faces. The most prominent algorithms include Active Shape Model (ASM) [42], Active Appearance Model (AAM) [43], and Constrained Local Model (CLM) [44], by which the faces in controlled environments (normal lighting and frontal poses) can be well handled. However, these methods deteriorate greatly when facing enormous challenges in the wild, such as large poses, extreme illuminations, low resolution, and partial occlusions. The next wave of methods is based on cascaded regression [45, 88, 89], which cascades a list of weak regressors to reduce the alignment error progressively. For example, the Supervised Descent Method (SDM) [88] updates the landmark locations by several iterations of regressions. In each iteration, a regressor takes the input of the appearance features (e.g., SIFT) around landmarks, and estimates a landmark update to approach the ground-truth locations. The Ensemble of Regression Trees (ERT) [45] learns an ensemble of regression trees to regress the landmarks from a sparse subset of intensity values, so as to handle partial or uncertain labels. One of the most popular landmark detectors Dlib [46] implements ERT as its landmark detector due to its high speed of 1 millisecond per face.

Fig. 5.1
figure 1

Facial landmark localization

Following the great success of deep learning in computer vision [47], researchers started to predict facial landmarks by deep convolutional neural networks. In general, deep learning-based landmark detectors can be divided into coordinate-based and heatmap-based, illustrated in Fig. 5.2, depending on the detection head of network architecture. Coordinate-based methods output a vector consisting of 2D coordinates of landmarks. On the contrary, heatmap-based methods output one heatmap for each landmark, where the intensity value of the heatmap indicates the probability that this landmark locates in this position. It is commonly agreed [38, 39] that heatmap-based methods detect more accurate landmarks, but are computationally expensive and sensitive to outliers. In contrast, coordinate-based methods are fast and robust, but have sub-optimal accuracy.

Fig. 5.2
figure 2

Coordinate-based methods and heatmap-based methods

In recent years, 3D landmark localization has attracted increasing attention due to its additional geometry information and superiority in handling large poses [40]. However, localizing 3D landmarks is more challenging than 2D landmarks because recovering depth from a monocular image is an ill-posed problem. This requires the model to build a strong 3D face prior from large-scale 3D data in order to accurately detect and locate the facial landmarks in 3D space. Unfortunately, acquiring 3D faces is expensive, and labeling 3D landmarks is also tedious. A feasible solution is to fit a 3D Morphable Model (3DMM) [41] by a neural network [40] and sample the 3D landmarks from the fitted 3D model. Another one is utilizing a fully convolutional network to regress the 3D heatmaps, on which the coordinates of the largest probabilities are sampled as 3D landmarks [51, 52].

2 Coordinate Regression

As deep learning has become the mainstream method for facial landmark localization, this section focuses on recent advances in deep learning-based coordinate regression approaches. Given an input face image, coordinate regression-based methods predict the 2D coordinates of a set of predefined facial landmarks directly from the deep features extracted by a backbone network, as shown in Fig. 5.3.

Fig. 5.3
figure 3

Coordinate regression-based facial landmark localization. The input is an RGB face image, and the output is a vector consisting of the 2D coordinates of all the facial landmarks

2.1 Coordinate Regression Framework

The task of coordinate regression-based facial landmark localization is to find a nonlinear mapping function (usually a deep CNN model):

$$\begin{aligned} \Phi : \mathcal {I} \rightarrow \textbf{s}, \end{aligned}$$
(5.1)

that outputs the 2D coordinates vector \(\textbf{s} \in \mathbb {R}^{2L}\) of L landmarks for a given facial image \(\mathcal {I} \in \mathbb {R}^{H \times W \times 3}\). In general, the input image is cropped by using a bounding box obtained by a face detector in a full-stack facial image/video analysis pipeline. The 2D coordinate vector \(\textbf{s} = [x_1, ..., x_L, y_1, ..., y_L]^T\) consists of the coordinates of L predefined landmarks, where \((x_l, y_l)\) are the X- and Y-coordinates of the lth landmark.

To obtain the above mapping function, a deep neural network can be used, which is formulated as a compositional function:

$$\begin{aligned} \Phi = (\phi _1 \circ ... \circ \phi _M)(\mathcal {I}), \end{aligned}$$
(5.2)

with M sub-functions, and each sub-function (\(\phi \)) represents a specific network layer, e.g., convolutional layer and nonlinear activation layer. Most existing deep learning-based facial landmark localization approaches use CNN as the backbone with a regression output layer [24,25,26].

Given a set of labeled training samples \(\Omega = \{\mathcal {I}_i, \textbf{s}_i\}_{i=1}^{N}\), the network training aims to find the best set of the parameters \(\Phi \) so that to minimize:

$$\begin{aligned} \sum _{i=1}^{N} loss(\Phi (\mathcal {I}_i), \textbf{s}_i), \end{aligned}$$
(5.3)

where loss() is a predefined loss function that measures the difference between the predicted and ground-truth coordinates over all the training samples. To optimize the above objective function, a variety of optimization methods, such as Stochastic Gradient Descent (SGD) and AdamW, can be used for network training.

2.2 Network Architectures

As shown in Fig. 5.3, the input for a coordinate regression-based facial landmark localization model is usually an image enclosing the whole face region. Then a backbone CNN network can be used for feature extraction and fully connected layers are used for regressing the landmark coordinates. With the development of deep learning, different backbone networks have been explored and evaluated for accurate and robust landmark localization. For example, Feng et al. [38] evaluated different backbone networks, including VGG, ResNet, MobileNet, etc., for efficient and high-performance facial landmark localization. As face landmarking is a key element in a full-stack facial image/video analysis system, the design of a lightweight network is crucial for real-time applications. For instance, Guo et al. [18] developed a light framework that is only 2.1 MB and runs at 140 fps on a mobile device. Gao et al. [19] proposed EfficientFAN that applies deep knowledge transfer via a teacher-student network for efficient and effective network training. Feng et al. [38] compared different designs of network architectures and evaluated their inference speed on different devices, including GPU, CPU, and portable devices.

Instead of the whole face image, shape- or landmark-related local patches have also been widely used as the input of neural networks [24, 83]. To use local patches, one can apply CNN-based feature extraction to the local patches centered at each landmark and for fine-grained landmark prediction or update [83]. The advantage of using the whole face region, in which the only input of the network is a cropped face image, is that it does not require the initialization of facial landmarks. In contrast, to use local patches, a system usually requires initial estimates of facial landmarks for a given image. This can be achieved by either using the mean-shape landmarks [83] or the output of another network that predicts coarse landmarks [24, 27, 61].

Fig. 5.4
figure 4

A two-stage coarse-to-fine facial landmark localization framework

The accuracy of landmark localization can be degraded by in-plane face rotations and inaccurate bounding boxes output by a face detector. To address these issues, a widely used strategy is to cascade multiple networks to form a coarse-to-fine structure. For example, Huang et al. [28] proposed to use a global network to obtain coarse facial landmarks for transforming a face to the canonical view and then applied multiple networks trained on different facial parts for landmark refinement. Similarly, both Yang et al. [29] and Deng et al. [30] proposed to train a network that predicts a small number of facial landmarks (5 or 19) to transform the face to a canonical view. It should be noted that the first network can be trained on a large-scale dataset so it performs well for unconstrained faces with in-plane head rotation, scale, and translation. With the first stage, the subsequent networks that predict all the landmarks can be trained with the input of normalized faces.

Feng et al. [38] also proposed a two-stage network for facial landmark localization, as shown in Fig. 5.4. The coarse network is trained on a dataset with very heavy data augmentation by randomly rotating an original training image between \([-180^{\circ },180^{\circ }]\) and perturbing the bounding box with \(20\%\) of the original bounding box size. Such a trained network is able to perform well for faces with large in-plane head rotations and low-quality bounding boxes. For training the second network, each training sample is fed to the first network to obtain its coarse facial landmarks for geometric normalization. To be specific, two anchor points (blue points in Fig. 5.4) are computed to perform the rigid transformation, where one anchor is the mean of the four inner eye and eyebrow corners and the other one is the chin landmark. Afterward, the normalized training data is lightly augmented by randomly rotating the image between \([-10^{\circ }, 10^{\circ }]\) and perturbing the bounding box with \(10\%\) of the bounding box size. The aim is to address the issues caused by inaccurate landmark localization of the first network. Finally, a second network is trained on the normalized-and-lightly-augmented dataset for further performance boosting in localization accuracy. The joint use of these two networks in a coarse-to-fine fashion is instrumental in enhancing the generalization capacity and accuracy.

2.3 Loss Functions

Another important element for high-performance coordinates regression is the design of a proper loss function. Most existing regression-based facial landmark localization approaches with deep neural networks are based on the L2 loss function. Given a training image \(\mathcal {I}\) and a network \(\Phi \), we can predict the facial landmarks as a vector \(\textbf{s}' = \Phi (\mathcal {I})\). The loss function is defined as:

$$\begin{aligned} loss(\textbf{s}, \textbf{s}') = \frac{1}{2L}\sum _{i=1}^{2L} f(s_i - s'_i), \end{aligned}$$
(5.4)

where \(\textbf{s}\) is the ground-truth facial landmark coordinates and \(s_i\) is its ith element. For f(x) in the above equation, the L2 loss is defined as:

$$\begin{aligned} f_{L2}(x) = \frac{1}{2}x^2. \end{aligned}$$
(5.5)

However, it is well known that the L2 loss function is sensitive to outliers, which has been noted in connection with many existing studies, such as the bounding box regression problem in face detection [31]. To address this issue, L1 and smooth L1 loss functions are widely used for robust regression. The L1 loss is defined as:

$$\begin{aligned} f_{L1}(x) = |x|. \end{aligned}$$
(5.6)

The smooth L1 loss is defined piecewise as:

$$\begin{aligned} f_{smL1}(x) = \left\{ \begin{array}{ll} \frac{1}{2}x^2 &{} \text {if } |x| < 1 \\ |x| - \frac{1}{2} &{} \text {otherwise} \end{array} \right. , \end{aligned}$$
(5.7)

which is quadratic for small values and linear for large values [31]. More specifically, smooth L1 uses \(f_{L2}(x)\) for \(x\in (-1,1)\) and shifts to \(f_{L1}(x)\) elsewhere. Figure 5.5 depicts the plots of these three loss functions.

Fig. 5.5
figure 5

Plots of the L2, L1 and smooth L1 loss functions

However, outliers are not the only subset of points which deserve special consideration. Feng et al. [38] argued that the behavior of the loss function at points exhibiting small-medium errors is just as crucial to finding a good solution to the landmark localization task. Based on a more detailed analysis, they proposed a new loss function, namely Rectified Wing (RWing) loss, for coordinate regression-based landmark localization. Similar to the original Wing loss function, RWing is also defined piecewise:

$$\begin{aligned} RWing(x) = \left\{ \begin{array}{ll} 0 &{} \text {if } |x| < r \\ w \ln (1 + (|x|-r)/\epsilon ) &{} \text {if } r \le |x| < w \\ |x| - C &{} \text {otherwise} \end{array} \right. , \end{aligned}$$
(5.8)

where the non-negative parameter r sets the range of rectified region to \((-r, r)\) for very small values. The aim is to remove the impact of noise labels on network convergence. For a training sample with small-medium range errors in [rw), RWing uses a modified logarithm function, where \(\epsilon \) limits the curvature of the nonlinear region and \( C = w - w\ln ({1 + (w -r)/\epsilon })\) is a constant that smoothly links the linear and nonlinear parts. Note that one should not set \(\epsilon \) to a very small value because this would make the training of a network very unstable and cause the exploding gradient problem for small errors. In fact, the nonlinear part of the RWing loss function just simply takes a part of the curve of \(\ln (x)\) and scales it along both the X-axis and Y-axis. Also, RWing applies translation along the Y-axis to allow \(RWing(\pm r) = 0\) and to impose continuity on the loss function at \(\pm w\). In Fig. 5.6, some examples of the RWing loss with different hyper parameters are demonstrated.

Fig. 5.6
figure 6

The Rectified Wing loss function plotted with different hyper parameters, where r and w limit the range of the nonlinear part and \(\epsilon \) controls the curvature. By design, the impact of the samples with small- and medium-range errors is amplified, and the impact of the samples with very small errors is ignored

3 Heatmap Regression

Another main category of the state-of-the-art facial landmark localization methods is heatmap regression. Different from coordinate regression, heatmap regression outputs a heatmap for each facial landmark. In the heatmap, the intensity value of a pixel in a heatmap indicates the probability that its location is the predicted position of the corresponding landmark. The task of heatmap regression-based facial landmark localization is to find a nonlinear mapping function:

$$\begin{aligned} \Phi : \mathcal {I} \rightarrow \mathcal {H}, \end{aligned}$$
(5.9)

that outputs L 2D heatmaps \(\mathcal {H} \in \mathbb {R}^{H \times W \times L}\) for a given image \(\mathcal {I} \in \mathbb {R}^{H \times W \times 3}\). As shown in Fig. 5.7, heatmap regression usually uses an encoder-decoder architecture for heatmap generation. For network training, typical loss functions used for heatmap generation include MSE and L1.

Fig. 5.7
figure 7

Heatmap regression-based facial landmark localization. The input is a face image and the output are L 2D heatmaps, each for one predefined facial landmark. The backbone network usually has an encoder-decoder architecture

Fig. 5.8
figure 8

A typical architecture of a stacked hourglass network

3.1 Network Architectures

As aforementioned, heatmap regression usually applies an encoder-decoder architecture for high-performance facial landmark localization. The most popular backbone network used for heatmap regression might be the stacked hourglass network [29, 30, 55, 68]. The key to the success of a stacked hourglass network is the use of multiple hourglass networks with residual connections, as shown in Fig. 5.8. On the one hand, the use of residual connections in each hourglass network maintains multi-scale facial features for fine-grained heatmap generation. On the other hand, stacking multiple hourglass networks improves the overall network capacity, so as to improve the quality of a generated heatmap. Besides the stacked hourglass network, another two popular network architectures used for heatmap regression are HRNet [75] and U-Net [77]. Similar to hourglass, both HRNet and U-Net try to find an effective way of using multi-scale features rather than the single use of a deep high-level semantic feature map for heatmap generation.

To reduce false alarms of a generated 2D heatmap, Wu et al. [22] proposed a distance-aware softmax function that facilitates the training of a dual-path network. Lan et al. [79] further investigated the issue of quantization error in heatmap regression, and proposed a heatmap-in-heatmap method for improving the prediction accuracy of facial landmarks. Instead of using a Gaussian map for each facial landmark, Wu et al. [68] proposed to create a boundary heatmap mask for feature map fusion and demonstrated its merits in robust facial landmark localization.

3.2 Loss Function

Similar to coordinate regression, the design of a proper loss function is crucial for heatmap regression-based facial landmark localization. Most of the existing heatmap regression methods use MSE or L1 loss for heatmap generation via an encoder-decoder network. However, a model trained with MSE or L1 loss tends to predict blurry and dilated heatmaps with low intensity on foreground pixels compared to the ground-truth ones. To address this issue, Wang et al. [76] proposed an adaptive Wing loss function for heatmap regression. In contrast to the original Wing loss [20], the adaptive Wing loss is a tailored version for heatmap generation. The adaptive Wing loss is able to adapt its shape to different types of ground-truth heatmap pixels. This adaptability penalizes loss more on foreground pixels while less on background pixels, hence improving the quality of a generated heatmap and the performance of the final landmark localization task in terms of accuracy.

To be specific, the adaptive Wing loss function is defined as:

$$\begin{aligned} AWing(y, \hat{y}) = \left\{ \begin{array}{ll} w \ln (1 + |\frac{y-\hat{y}}{\epsilon }|^{\alpha -y}) &{} \text { if } |y-\hat{y}| < \theta \\ A|y - \hat{y}| - C &{} \text { otherwise} \end{array} \right. , \end{aligned}$$
(5.10)

where y and \(\hat{y}\) are the intensities of the pixels on the ground truth and predicted heatmaps, respectively. w, \(\theta \), \(\epsilon \) and \(\alpha \) are positive values, \(A = w(1/(1 + (\theta /\epsilon )^{(\alpha -y)} ))(\alpha - y)\) \(((\theta /\epsilon )^{(\alpha - y - 1)})(1/\epsilon )\) and \(C = (\theta A - w \ln (1+(\theta / \epsilon )^{\alpha - y}))\) are designed to link different parts of the loss function continuously and smoothly at \(|y - \hat{y}| = \theta \). Unlike the Wing loss, which uses w as the threshold, the adaptive Wing loss introduces a new variable \(\theta \) as the threshold to switch between linear and nonlinear parts. For heatmap regression, a deep network usually regresses a value between 0 and 1, so the adaptive Wing loss sets the threshold in this range. When \(|y - \hat{y}| < \theta \), adaptive Wing considers the error to be small and thus needs stronger influence. More importantly, this new loss function adopts an exponential term \(\alpha - y\), which is used to adapt the shape of the loss function to y and makes the loss function smooth at the origin.

It should be noted that adaptive Wing loss is able to adapt its curvature to the ground-truth pixel values. This adaptive property reduces small errors on foreground pixels for accurate landmark localization, while tolerating small errors on background pixels for better convergence of a network.

4 Training Strategies

4.1 Data Augmentation

For a deep learning-based facial landmark localization method, a key to the success of network training is big labeled training data. However, it is a difficult and tedious task to manually label a large-scale dataset with facial landmarks. To mitigate this issue, effective data augmentation has become an essential alternative. Existing data augmentation approaches in facial landmark localization usually inject geometric and textural variations into training images. These augmentation approaches are efficient to implement and thus can be easily performed online for network training.

To investigate the impact of these data augmentation methods on the performance of a facial landmark localization model, Feng et al. [26] introduced different data augmentation approaches and performed a systematic analysis of their effectiveness in the context of deep-learning-based facial landmark localization. Feng et al. divided the existing data augmentation techniques into two categories: textural and geometric augmentation, as shown in Fig. 5.9. Textural data augmentation approaches include Gaussian blur, salt and pepper noise, color jetting, and random occlusion. Geometric data augmentation consists of horizontal image flip, bounding box perturbation, rotation and shear transformation. According to the experimental results, all data augmentation approaches improve the accuracy of the baseline model. However, the key finding is that the geometric data augmentation methods are more effective than the textural data augmentation methods for performance boosting. Furthermore, the joint use of all data augmentation approaches performs better than only using a single augmentation method.

Fig. 5.9
figure 9

Different geometric and textural data augmentation approaches for facial landmark localization. “bbox” refers to “bounding box”

In addition, Feng et al. [26] argued that, by applying random textural and geometric variations to the original labeled training images, some augmented samples may be harder and more effective for deep network training. However, some augmented samples are less effective. To select the most effective augmented training samples, they proposed a Hard Augmented Example Mining (HAEM) method for effective sample mining. In essence, HAEM selects N hard samples from each mini-batch those which exhibit the largest losses but excludes the one of dominant loss. The main reason for this conservative method is that some of the samples generated by a random data augmentation method might be too difficult to train networks. Such samples become “outliers” that could disturb the convergence of the network training. Thus in each mini-batch, HAEM identifies \(N+1\) hardest samples and discards the hardest one to define the hard sample set.

4.2 Pose-Based Data Balancing

Existing facial landmark localization methods have achieved good performance for faces in the wild. However, extreme pose variations are still very challenging. To mitigate this problem, Feng et al. [20] proposed a simple but very effective Pose-based Data Balancing (PDB) strategy. PDB argues that the difficulty for accurately localizing faces with large poses is mainly due to data imbalance. This is a well-known problem in many computer vision applications [21].

To perform pose-based data balancing, PDB applies Principal Component Analysis (PCA) to the aligned shapes and projects them to a one dimensional space defined by the shape eigenvector (pose space) controlling pose variations. To be more specific, for a training dataset \(\{\textbf{s}_i\}_{i=1}^N\) with N samples, where \(\textbf{s}_i \in \mathbb {R}^{2L}\) is the ith training shape vector consisting of the 2D coordinates of all the L landmarks, the use of Procrustes Analysis aligns all the training shapes to a reference shape, i.e. the mean shape, using rigid transformations. Then PDB approximates any training shape or a new shape, \(\textbf{s}\), using a statistical linear shape model:

$$\begin{aligned} \textbf{s} \approx \bar{\textbf{s}} + \sum _{j=1}^{N_s} p_j \textbf{s}_j^*, \end{aligned}$$
(5.11)

where \(\bar{\textbf{s}} = \frac{1}{N}\sum _{i=1}^{N} \textbf{s}_i\) is the mean shape over all the training samples, \(\textbf{s}_j^*\) is the jth eigenvector obtained by applying PCA to all the aligned training shapes and \(p_j\) is the coefficient of the jth shape eigenvector. Among those shape eigenvectors, we can find an eigenvector, usually the first one, that controls the yaw rotation of a face. We denote this eigenvector as \(\hat{\textbf{s}}\). Then we can obtain the pose coefficient of each training sample \(\textbf{s}_i\) as:

$$\begin{aligned} \hat{p}_i = \hat{\textbf{s}}^T(\textbf{s}_i - \bar{\textbf{s}}). \end{aligned}$$
(5.12)

The distribution of the pose coefficients of all the AFLW training samples is shown in Fig. 5.10. According to the Fig. 5.10, it can be seen that the AFLW dataset is not well-balanced in terms of pose variation.

Fig. 5.10
figure 10

Distribution of the head poses of the AFLW training set

With the pose coefficients of all the training samples, PDB first categorizes the training dataset into K subsets. Then it balances the training data by duplicating the samples falling into the subsets of lower cardinality. To be more specific, the number of training samples in the kth subset is denoted as \(B_k\), and the maximum size of the K subsets is denoted as \(B^*\). To balance the whole training dataset in terms of pose variation, PDB adds more training samples to the kth subset by randomly sampling \(B^*-B_k\) samples from the original kth subset. Then all the subsets have the size of \(B^*\) and the total number of training samples is increased from \(\sum _{k=1}^{K}B_k\) to \(KB^*\). It should be noted that pose-based data balancing is performed before network training by randomly duplicating some training samples of each subset of lower occupancy. After pose-based data balancing, the training samples of each mini-batch are randomly sampled from the balanced training dataset for network training. As the samples with different poses have the same probability to be sampled for a mini-batch, the network training is pose-balanced.

5 Landmark Localization in Specific Scenarios

5.1 3D Landmark Localization

3D landmark localization aims to locate the 3D coordinates, including 2D positions and depth, of landmarks. The 2D landmark setting assumes that each landmark can be detected by its visual patterns. However, when faces deviate from the frontal view, the contour landmarks become invisible due to self-occlusion. In medium poses, this problem can be addressed by changing the semantic positions of contour landmarks to the silhouette, which is termed landmark marching [62]. However, in large poses where half of the face is occluded, some landmarks are inevitably invisible. In this case, the 3D landmark setting is employed to make the semantic meanings of landmarks consistent, and the face shape can be robustly recovered. As shown in Fig. 5.11, 3D landmarks are always located in their semantic positions, and they should be detected even if they are self-occluded.

Fig. 5.11
figure 11

Examples of 3D landmark localization. The blue/red ones indicate visible/invisible landmarks

Fig. 5.12
figure 12

The overview of 3DDFA. At kth iteration, 3DDFA takes the images and the projected normalized coordinate code (PNCC) generated by \(\textbf{p}^{k}\) as inputs and uses a convolutional neural network to predict the parameter update \(\Delta \textbf{p}^{k}\)

In recent years, 3D face alignment has achieved satisfying performance. The methods can be divided into two categories: model-based methods and non-model-based methods. The former performs the 3D face alignment by fitting a 3D Morphable Model (3DMM), which provides a strong prior of face topology. The latter extracts features from the image and directly regresses that to the 3D landmarks by deep neural networks.

5.1.1 3D Dense Face Alignment (3DDFA)

Estimating depth information from a monocular image is an ill-posed problem, and a feasible solution to realize 3D face alignment is introducing a strong 3D face prior. The 3D Dense Face Alignment (3DDFA) is a typical model-based method, which fits a 3DMM by a cascaded convolutional neural network to recover the 3D dense shape. Since the 3DMM is topology-unified, the 3D landmarks can be easily indexed after 3D shape recovery. An overview of 3DDFA is shown in Fig. 5.12. Specifically, the 3D face shape is described as:

$$\begin{aligned} \textbf{S}=\mathbf {\overline{S}} + \textbf{A}_{id}\alpha _{id} + \textbf{A}_{exp}\alpha _{exp}, \end{aligned}$$
(5.13)

where \(\textbf{S}\) is the 3D face shape, \(\mathbf {\overline{S}}\) is the mean shape, \(\textbf{A}_{id}\) is the principle axes for identity, and \(\textbf{A}_{exp}\) is the principle axes for expression, \(\alpha _{id}\) and \(\alpha _{exp}\) are the identity and expression parameters that need to be estimated. To obtain the 2D positions of the 3D vertices, the 3D face is projected to the image plane by the weak perspective projection:

$$\begin{aligned} V(\textbf{p}) = f * \textbf{Pr}*\textbf{R}*(\mathbf {\overline{S}} + \textbf{A}_{id}\alpha _{id} + \textbf{A}_{exp}\alpha _{exp}) +\textbf{t}_{2d}, \end{aligned}$$
(5.14)

where f is the scalar parameter, \(\textbf{Pr}\) is the orthographic projection matrix \(\left( \begin{array}{ccc} 1 &{} 0 &{} 0 \\ 0 &{} 1 &{} 0 \\ \end{array} \right) \), \(\textbf{R}\) is the rotation matrix derived from the rotation angles pitch, yaw, roll, and \(\textbf{t}_{2d}\) is the translation vector. Parameters for shape recovery are collected as \(\textbf{p}=[f,pitch,yaw,roll,\textbf{t}_{2d},\alpha _{id},\) \(\alpha _{exp}]^{T}\), and the purpose of 3DDFA is to estimate \(\textbf{p}\) from the input image.

3DDFA is a cascaded-regression-based method that employs several networks to update the parameters step by step. A specially designed feature Projected Normalized Coordinate Code (PNCC) is proposed to reflect the fitting accuracy, which is formulated as:

$$\begin{aligned} \text {NCC}_{d}&=\frac{\mathbf {\overline{S}}_{d} - \min (\overline{\textbf{S}}_{d})}{\max (\overline{\textbf{S}}_{d}) - \min (\overline{\textbf{S}}_{d})}~~~ (d = x,y,z), \nonumber \\ & \text {PNCC} = Z-Buffer (V(\textbf{p}), \text {NCC}), \end{aligned}$$
(5.15)

where \(\mathbf {\overline{S}}\) is the mean shape of 3DMM, \(Z-Buffer (\nu ,\tau )\) is the render operation that renders 3D mesh \(\nu \) colored by \(\tau \) to an image. PNCC represents the 2D locations of the visible 3D vertices on the image plane. Note that both \(\text {NCC}\) and \(\text {PNCC}\) have three channels for xyz, which is similar to RGB, and they can be shown in color as in Fig. 5.13.

Fig. 5.13
figure 13

The illustration of the Normalized Coordinate Code (NCC) and the Projected Normalized Coordinate Code (PNCC). NCC denotes the position as its texture (\(\text {NCC}_x=R, \text {NCC}_y=G, \text {NCC}_z=B\)) and \(\text {PNCC}\) is generated by rendering the 3D face with \(\text {NCC}\) as its colormap

At the kth iteration, 3DDFA constructs PNCC by the current parameter \(\textbf{p}^{k}\) and concatenates it with the image as input. Then, a neural network is adopted to predict the parameter update \(\Delta \textbf{p}^{k}\):

$$\begin{aligned} \begin{aligned} \Delta \textbf{p}^{k} = Net ^{k} (\textbf{I}, \text {PNCC}(\textbf{p}^{k})). \end{aligned} \end{aligned}$$
(5.16)

Afterward, the parameter for the \(k+1\) iteration is updated: \(\textbf{p}^{k+1}=\textbf{p}^{k}+\Delta \textbf{p}^{k}\), and another network is adopted to further update the parameters until convergence. By incorporating 3D prior, 3DDFA localizes the invisible landmarks in large poses, achieving the-state-of-the-art performance. However, it is limited by the computation cost since it cascades several networks to progressively update the fitting result. To deploy 3DDFA on lightweight devices, 3DDFAv2 [63] employs a mobilenet [64] to directly regress the target parameters and also achieves satisfactory performance.

5.1.2 Face Alignment Network (FAN)

Face Alignment Network (FAN) [52] is a non-model-based method for 3D face alignment, which trains a neural network to regress the landmark heatmaps. FAN constructs a strong backbone to localize 3D landmarks, shown in Fig. 5.14a. Specifically, FAN consists of four stacked hourglass networks [55], and the bottleneck blocks in each hourglass are replaced with the hierarchical, parallel, and multi-scale residual block [56] to further improve the performance. Given an input image, FAN utilizes the network to regress the landmark heatmaps, where each channel of the heatmap is a 2D Gaussian centered at the corresponding landmark’s location with a standard deviation of one pixel.

Fig. 5.14
figure 14

a The backbone of the Face Alignment Network (FAN). It consists of stacked Hourglass networks [55] in which the bottleneck blocks are replaced with the residual block of [56]. b The illustration of FAN for 3D face alignment. The network takes the images and their corresponding 2D landmark heatmaps as input to regress the heatmaps of the projected 3D landmarks, which are then concatenated with the image to regress the depth values of landmarks

To realize the regression of 3D positions, FAN designs a guided-by-2D-landmarks network to convert 2D landmarks to 3D landmarks, which bridges the performance gap between the saturating 2D landmark localization and the challenging 3D landmark localization. The overview of FAN for 3D landmark localization is shown in Fig. 5.14b. Specifically, given an RGB image and their corresponding 2D landmark heatmaps as input, FAN first regresses the heatmaps of the projected 3D landmarks, obtaining the xy of 3D landmarks. Then, the projected 3D landmark heatmaps are combined with the input image and sent to a followed network to regress the depth value of each landmark, obtaining the full xyz coordinates of 3D landmarks.

5.1.3 MediaPipe

MediaPipe [60] is a widely used pipeline for 2D and 3D landmark localization. It is proposed to meet the real-time application requirements for face localization such as AR make-up, eye tracking, AR puppeteering, etc. Different from the cascaded framework, MediaPipe uses a single model to achieve comparable performance. The pipeline of MediaPipe is shown in Fig. 5.15. The network first extracts the global feature map from the cropped images, and then the network is split into several sub-networks. One sub-network predicts the 3D face mesh, including 3D landmarks, and outputs the regions of interest (eyes and lips). The remaining two sub-networks are employed to estimate the local landmarks of eyes and lips, respectively. The output of MediaPipe is a sparse mesh composed of 468 points. Through the lightweight architecture [61] and the region-specific heads for meaningful regions, MediaPipe has good efficiency and achieves comparable performance compared with the cascaded methods, realizing the real-time on-device inference.

Fig. 5.15
figure 15

The pipeline of the MediaPipe. Given an input image, the face region is first cropped by the face detector and then sent to the feature extractor. After that, the model is split into several sub-models to predict the global landmarks and important local landmarks including eyes and lips

5.1.4 3D Landmark Data

One of the main challenges of 3D landmark localization is the lack of data. Acquiring high-precision 3D face models requires expensive devices and a fully controlled environment, making large-scale data collection infeasible. To overcome this bottleneck, current methods usually label 2D projections of 3D landmarks as an alternative solution. However, it is still laborious since the self-occluded parts have to be guessed by intuition. In recent years, 300W-LP [40, 85], AFLW2000-3D [40, 85], and Menpo-3D [84] have been popular data sets for building 3D landmark localization systems. In addition to hand annotation, training data can be generated by virtual synthesis. Face Profiling [40, 85] proposes to recover a textured 3D mesh from a 2D face image and rotate the 3D mesh to given rotation angles, which can be rendered to generate virtual data, shown in Fig. 5.16. Through face profiling, not only the face samples in large poses (yaw angle up to \(90^{\circ }\)) can be obtained, but also the dataset can be augmented to any desired scale.

Fig. 5.16
figure 16

The face profiling process

5.2 Landmark Localization on Masked Face

Since the outbreak of the worldwide pandemic COVID-19, facial landmark localization has encountered the great challenge of mask occlusion. First, the collection of masked face data is costly and difficult, especially during the spread of COVID-19. Second, the masked facial image suffers from severe occlusion, making the landmarks more difficult to detect. Taking the 106-point landmark setting as an example, there are around 27 nose and mouth points occluded by the facial mask (Fig. 5.18), which brings not only additional difficulty to landmark detection, but also adverse uncertainty to the ground-truth labeling. These issues cause serious harm to the deep-learning-based landmark localization that relies on labeled data.

It can be perceived that most of the issues lie in the masked face data. Therefore, a feasible and straightforward solution is synthesizing photo-realistic masked face images from mask-free ones, so as to overcome the problems of data collection and labeling. One popular approach [14] , as shown in Fig. 5.17, is composed of three steps, i.e., 3D reconstruction, mask segmentation, and re-rendering of the blended result. Given the source masked face and the target mask-free face, their 3D shapes are first recovered by a 3D face reconstruction method (such as PRNet [53]) to warp the image pixels to the UV space to generate the UV texture. Second, the mask area in the source image is detected by a facial segmentation method [90], which is also warped to the UV space to get a UV mask. Finally, the target UV texture is covered by the UV mask, and the synthesized target texture is re-rendered to the original 2D plane.

Fig. 5.17
figure 17

Adding virtual mask to face images by 3D reconstruction and face segmentation

Fig. 5.18
figure 18

Examples of synthesized and real masked face images [1]

There are two benefits of this practice. First, a large number of masked face images can be efficiently produced with geometrically-reasonable and photo-realistic masks, and the mask styles are fully controlled. Second, once the target image has annotated landmarks, the synthesized one does not have to be labeled again. It can directly inherit the ground-truth landmarks for training and testing (Fig. 5.18a). With the synthesized masked face images, the mask-robust landmark detection model can be built in the similar manner as in the mask-free condition.

5.3 Joint Face Detection and Landmark Localization

The joint detection of face boxes and landmarks has been studied since the early ages when deep learning begins to thrive in biometrics. The initial motivation of joint detection is to boost face detection itself by incorporating landmarks to handle certain hard cases, e.g., large pose, severe occlusion, and heavy cosmetics [5, 6]. Afterward, the community pays increasing attention to merging the two tasks as one. The advantages are three-fold: First, the two highly correlated tasks benefit each other when the detector is trained by the annotations from both sides. Second, the unified style brings better efficiency to the whole pipeline of face-related applications, as the two detection tasks can be accomplished by a single lightweight model. Finally, the joint model can be conveniently applied in many tasks, including face recognition, simplifying the implementation in practice. Despite the obvious advantages of the multi-task framework, building such a system requires more expensive training data with labels of multiple face attributes, improving the cost of data annotations.

Fig. 5.19
figure 19

The typical framework of joint detection of face and landmark

Networks. The typical framework of joint face and landmark detection is shown in Fig. 5.19. The input image contains human faces that occur with arbitrary pose, occlusion, illumination, cosmetics, resolution, etc. The backbone extracts an effective feature from the input image and feeds it into the multi-task head. The multi-task head outputs the joint detection results, including at least three items, i.e., face classification, face bounding box coordinates, and landmark coordinates. Beyond typical tasks, some methods also predict the head pose, gender [8], and 3D reconstruction [11] simultaneously. The major backbones include FPN [10], Cascaded-CNN [7], multi-scale fusion within rapidly digested CNN [9], YOLO-vX style [3], etc. The former two make full use of hierarchical features and predict fine results, and the latter two have excellent efficiency for CPU-real-time applications.

Learning objectives. The framework should be trained with multiple objectives to perform joint predictions. Equation (5.17) is the typical loss formulation for multiple objective training. \(\mathcal {L}_{face-cls}\) is the cross-entropy loss for face classification, which predicts the confidence of whether the candidate is a human face. \(\mathcal {L}_{bbox-reg}\) is defined as the L2 or smooth L1 distance between the coordinates of the predicted bounding box and the ground truth, supervising the model to learn the bounding box locations. Similarly, \(\mathcal {L}_{lm-reg}\) supervises the model to predict the landmark coordinates in the same way.

$$\begin{aligned} \mathcal {L} = \alpha _1 \beta _1 \mathcal {L}_{face-cls} + \alpha _2 \beta _2 \lambda \mathcal {L}_{bbox-reg} + \alpha _3 \beta _3 \lambda \mathcal {L}_{lm-reg}, \end{aligned}$$
(5.17)

where \(\{\alpha _1, \alpha _2, \alpha _3\}\in \mathbb {R}\) are the weights for balancing the training toward three objectives, \(\{\beta _1,\beta _2,\beta _3\}\in \{0, 1\}\) are binary indicators that activate the supervision if the corresponding annotation presents in the training sample, and \(\lambda \in \{0, 1\}\) is applied to activate the supervision of bounding box and landmark if the candidate’s ground truth is human face [9]. It is worth noting that the incorporation of \(\beta \) enables the training on partially annotated datasets.

Datasets. The dataset most commonly used for joint detection is the WIDER FACE [13] dataset with the supplementary annotations [11]. The initial purpose of WIDER FACE is to train and evaluate face detection models. The supplementary annotation provides five-point landmarks on each face, enabling the usage for the joint detection task. Owing to the wide utilization of this dataset, most joint detection models predict five-point landmarks, which are sufficient for face alignment in most cases. Besides, some models [8, 30] trained by the 300W [57] dataset predict 68 landmarks for joint detection.

6 Evaluations of the State of the Arts

In this section, we introduce how to evaluate the performance of a landmark localization method, including various datasets and evaluation metrics. The evaluation results of representative methods on different datasets are also collected and demonstrated.

Table 5.1 An overview of 2D facial landmark datasets. “Train” and “Test” are the number of samples in the training set and the test set, respectively. “Landmark Num.” represents the number of annotated landmarks

6.1 Datasets

In recent years, many datasets have been collected for training and testing of 2D facial landmark localization, including COFW [67], COFW-68 [72], 300W [65], 300W-LP [85], WFLW [68], Menpo-2D [83], AFLW [66], AFLW-19 [86], AFLW-68 [87], MERL-RAV [77] and WFLW-68 [39], which are listed in Table 5.1. We introduce some representative datasets as follows:

\({\textbf {300W}}\) contains 3, 837 images, some images may have more than one face. Each face is annotated with 68 facial landmarks. The 3, 148 training images are from the full set of AFW [69] (337 images), the training part of LFPW [70] (811 images), and HELEN [71] (2, 000 images). The test set is divided into a common test set and a challenging set. The common set with 554 images comes from the testing part of LFPW (224 images) and HELEN (330 images). The challenging set with 135 images is from the full set of IBUG [65]. 300W-LP [85] augments the pose variations of 300W by the face profiling technique and generates a large data set with 61, 225 samples, much of which are in profile.

\({\textbf {COFW}}\) contains 1, 007 images with 29 annotated landmarks. The training set with 1, 345 samples is the combination of 845 LFPW samples and 500 COFW samples. The test set with 507 samples has two cases. They are annotated with 29 landmarks (the same as the training set) or 68 landmarks, and the latter is called COFW-68 [72]. Most faces in COFW have large variations in occlusion.

\({\textbf {AFLW}}\) contains 25, 993 faces with at most 21 visible facial landmarks annotated, but excludes the annotations of invisible landmarks. A protocol [86] is built on the original AFLW and divides the dataset into 20, 000 training samples and 4, 386 test samples. The dataset has large pose variations, especially has thousands of faces in profile. AFLW-19 [86] builds a 19-landmark annotation by removing the 2 ear landmarks. AFLW-68 [87] follows the configuration in 300 W and re-annotates the images with 68 facial landmarks.

\({\textbf {Menpo-2D}}\) has a training set with 7, 564 images, including 5, 658 front faces and 1, 906 profile faces, and a test set with 7, 281 images, including 5, 335 front faces and 1, 946 profile faces. There are two settings for different poses. The front faces are annotated by 68 landmarks, and the profile faces are annotated by 39 landmarks.

\({\textbf {WFLW}}\) contains 7, 500 images for training and 2, 500 images for testing. Each face in WFLW is annotated with 98 landmarks and some attributes such as occlusion, make-up, expression and blur. WFLW-68 [39] converts the original 98 landmarks to 68 landmarks for convenient evaluation.

6.2 Evaluation Metric

There are three commonly utilized metrics to evaluate the precision of landmark localization, including Normalized Mean Error (NME), Failure Rate (FR) and Cumulative Error Distribution (CED).

Normalized Mean Error (NME) is one of the most widely used metrics in face alignment, which is defined as:

$$\begin{aligned} \textrm{NME} = \frac{1}{M}\sum \limits _{i = 1}^M {\frac{{||{\textbf{P}_i} - {\textbf{P}_i}^{*}|{|_2}}}{d}}, \end{aligned}$$
(5.18)

where \(\{ {\textbf{P}_i}\}\) is the predicted landmark coordinates, \(\{ {\textbf{P}_i}^{*}\}\) is the ground-truth coordinates, M is the total number of landmarks, and d is the distance between outer eye corners (inter-ocular) [39, 68, 75, 79, 82]) or pupil centers (inter-pupils [76, 80]). It can be seen that the error is normalized by d to reduce the deviation caused by face scale and image size. In some cases, the image size [39] or face box size [77] is also used as the normalization factor d. A smaller NME indicates better performance.

\({\textbf {Failure Rate (FR)}}\) is the percentage of samples whose NMEs are higher than a certain threshold f, denoted as \(\textrm{FR}_{f}\) (f is usually set to 0.1) [57, 68, 92]. A smaller FR means better performance.

\({\textbf {Cumulative Error Distribution (CED)}}\) is defined as a curve (xy), where x indicates NME and y is the proportion of samples in the test set whose NMEs are less than x. Figure 5.20 shows an example of CED curve, which provides a more detailed summary of landmark localization performance. Based on CED, the \({\textbf {Area Under the Curve (AUC)}}\) can be obtained by the area enclosed between the CED curve and the x-axis, whose integral interval is \(x=0\) to a threshold \(x=f\), denoted as \(\textrm{AUC}_{f}\). A larger AUC means better performance.

Fig. 5.20
figure 20

An example of CED curve from [40]. In the curve, x is NME and y is the proportion of samples in the test set whose NMEs are less than x

Table 5.2 Performance comparison on 300 W, “Common”, “Challenge”, and “Full” represent common set, challenging set, and full set of 300W, respectively. “Backbone” represents the model architecture used by each method
Table 5.3 Performance comparison on COFW and COFW-68. The threshold of Failure Rate (FR) and Area Under the Curve (AUC) are set to 0.1
Table 5.4 Performance comparison on WFLW. All, Pose, Expr., Illu., M.u., Occ. and Blur represent full set, pose set, expression set, illumination set, make-up set, occlusion set, and blur set of WFLW, respectively. All results used inter-ocular distance for normalization. The threshold of Failure Rate (FR) and Area Under the Curve (AUC) are set to 0.1

6.3 Comparison of the State of the Arts

We demonstrate the performance of some state-of-the-art methods from 2018 to 2022 on commonly used datasets, including LAB [68], SAN [73], HG-HSLE [74], AWing [76], DeCaFA [78], RWing [38], HRNet [75], LUVLi [77], SDL [81], PIPNet [39], HIH [79], ADNet [80], and SLPT [82]. It is worth noting that the reported results should not be compared directly because the model sizes and training data are different.

300W: Table 5.2 summarizes the results on the most commonly used dataset 300W, with three test subsets of “common”, “challenging”, and “full”. The NME of the 68 facial landmarks is calculated to measure the performance. All the results are collected from the corresponding papers.

COFW: Table 5.3 summarizes the results on the COFW and COFW-68, which mainly measure the robustness to occlusion. There are two protocols, the within-dataset protocol (COFW) and cross-dataset protocol (COFW-68). For the within-dataset protocol, the model is trained with 1, 345 images and validated with 507 images on COFW. The NME and FR\(_{0.1}\) of the 29 landmarks are utilized for comparison. For the cross-dataset protocol, the training set includes the complete 300W dataset (3, 837 images), and the test set is COFW-68 (507 images). The NME and FR\(_{0.1}\) of the 68 landmarks are reported. All the results are collected from the corresponding papers.

WFLW: Table 5.4 summarizes the results on WFLW. The test set is divided into six subsets to evaluate the models in various specific scenarios, which are pose (326 images), expression (314 images), illumination (698 images), make-up (206 images), occlusion (736 images), and blur (773 images). The three metrics of NME, FR\(_{0.1}\) and AUC\(_{0.1}\) of the 98 landmarks are employed to demonstrate the stability of landmark localization. The results of SAN are from the supplemental material of [82]. The results of LUVLi are from the supplemental materials of [77]. The results of SLPT are from the supplemental materials of [82]. For HRNet, the NME is from [75], and the FR\(_{0.1}\) and AUC\(_{0.1}\) are from [81]. The other results are from the corresponding papers.

7 Conclusion

Landmark localization has been the cornerstone of many widely used applications. For example, face recognition utilizes landmarks to align faces, face AR applications use landmarks to enclose eyes and lips, and face animation fits 3D face models by landmarks. In this chapter, we have discussed typical methods of landmark localization, including coordinate regression and heatmap regression, and some special landmark localization scenarios. Although these strategies have made great progress and enabled robust localization in most cases, there are still many challenging problems remaining to be addressed in advanced applications, including faces in profile, large-region occlusion, temporal consistency, and pixel-level accuracy. With the development of face applications, the benchmark of landmarks on accuracy, robustness, and computation cost becomes higher and higher and more sophisticated landmark localization strategies are needed.