Keywords

1 Introduction

Facial landmark detection, also known as face alignment [29, 43], is a fundamental task in various facial image and video analysis applications [31, 32, 41, 44,45,46]. During the past decades, the facial landmark detection area has made significant progress. Nevertheless, many existing approaches have difficulties in dealing with in-the-wild faces with extreme appearance variations in pose, expression, illumination, blur and occlusion.

Existing facial landmark detection algorithms can be roughly divided into three categories: global appearance based approaches, constrained local models and regression-based methods. Global appearance based methods detect the key points using the whole facial textural information and global shape information [3,4,5, 13, 25, 30]. Constrained local model [17] is based on global face shape and independent local textural information around each key point that captures more robust information for illumination and occlusion variations. Regression-based methods can be divided into direct regression, cascade regression and regression with deep neural networks. At present, the most widely used and the most accurate methods are all based on deep Convolutional Neural Networks (CNNs) [12, 16]. In this paper, the proposed facial landmark detection method is based on CNNs as well.

The key innovations of the proposed method include:

  • For data augmentation, we adopt the online Pose-based Data Balancing (PDB) [8] method that balances the original training dataset. To be more specific, we copy the samples of low proportion defined by PDB and randomly modify the samples (flip, rotate, blur, etc.) including changing the copied samples with different styles since the intrinsic variance of image styles can also affects the performance of a trained network [6].

  • The baseline of this paper is CPM [34] that generates a heatmap image as the final output of a network. In order to apply Wing Loss that is specially designed for coordinates regression models in this work, we introduce the soft-argmax function [21]. The function converts heatmaps to coordinates thus the network is differentiable.

  • The original Wing Loss function [8] focuses on small and medium errors, but pays less attention to the samples with large errors. To address this issue, we design a new loss function, namely mixed loss, that considers the samples with errors at various magnitudes.

2 Related Work

2.1 Pose Variation

The aim of data augmentations is to reduce the bias in network training due to the imbalance of a training dataset. STN [22] applies spatial transformer network to learn transformation parameters thus to automatically initialise a training dataset. SAN [6] translates each image to four different styles by a generative adversarial module. Both of them try to inject diversity to a training dataset and balance the training samples.

2.2 Regression Model

The regression methods used from facial landmark detection can be divided into two categories: coordinate regression and heatmap regression. A coordinate regression network performs well on a dataset with sparse landmarks, but not as well as heatmap regression on dense landmarks. However, heatmap regression has been proved that the prediction can be worsen despite MSE improving during the regression of heatmap matching [26]. Luvizon et al. [21] propose the soft-argmax function to convert heatmaps to coordinates to make the network differentiable. Nibali et al. [26] use a new regularisation strategy to improve the prediction accuracy of a network.

2.3 Loss Function

For a CNN-based facial landmark detector, a loss function has to be defined to supervise the network training process. Most existing facial landmark detection approaches are based on the L2 loss, which is sensitive to outliers. Feng et al. [8] propose a new loss function, i.e. wing loss, to balance the sensitivity of small errors and big errors for the training of a deep CNN model. Guo et al. [11] introduce a loss that can adjust weights for different samples during the training process according to the tag that describes the pose of each sample. Merget et al. [23] proposes a loss function that judges whether each landmark is labelled and within the image boundary at the beginning, which gives each landmark a specific weight according to the judgement.

3 Methodology

3.1 Data Augmentation

Data imbalance is a common issue in deep learning, which limits the accuracy and robustness of a trained network [11]. From Table 1 and Fig. 1, we can see that most datasets contain a large number of frontal faces, but lack of samples with large poses, expressions, illuminations and occlusions [42]. The imbalance of a dataset in gesture is very significant. If we train a network using an imbalanced dataset, the network may not able to generalise well to practical applications. Besides, the distribution variations among training and test sets can influence the performance of a trained network significantly.

Table 1. Distribution of the 300-W dataset in gesture [29].
Fig. 1.
figure 1

Distribution of the ICME2019 GC facial landmark datasets in gesture [19]. The X-axis stands for pitch angle in the left figure and yaw angle in the right figure. The Y-axis denotes the number of samples of the training set.

To address the data imbalance problem, various algorithms have been proposed, including both geometric and textural transformations [9]. The main methods used for geometric transformation are flipping, scaling, translation and rotation. For textural transformation, Gaussian noise and brightness transformation are widely used. Nevertheless, if we randomly apply the above methods to the training samples of a dataset, we don’t know how many times a training sample should be augmented/copied.

To improve the balance of a dataset, we introduce the Pose-based Data Balancing (PDB) [8] strategy (Algorithm 1) in our work. PDB is a statistical method that aims to analyse the distribution of a face dataset in shape and posture. To adapt PDB to our network, first, we take Procrustes Analysis [10] to align all the faces in a training dataset to the mean face. Procrustes Analysis learns a affine transformation from a shape to another shape with minimum mean square error. By applying PCA to the training set and analysing the distribution of the principle component, we can balance the training set by copying each sample for a fixed number of times which is set to balance the distribution.

figure a

In order to minimise the impact of dataset imbalance on facial landmark detection accuracy, the PDB process is applied in each epoch at the beginning. Since the modification of each sample is random, the online PDB process can substantially enhance the variety of samples in different attributes. In each epoch, the data is copied the same number of times, but in different epochs, the data is randomly transformed independently. In this case, dataset is invariably expanded by a period of multiple times and each epoch can be regarded as sampling in a large dataset. According to our experiments, the offline data augmentation has a very good improvement on the performance of a detector. When we convert the offline data augmentation to online data augmentation, the performance of a trained facial landmark detector can be further improved. However, it is worth noting that offline data augmentation does not require many CPU resources. If one does not perform multi-thread data augmentation, each online PDB training process needs to multiply the original running time by several times.

Fig. 2.
figure 2

Backbone network of the proposed facial landmark detector. All the inputs must be resized to 256 \(\times \) 256. Concat means splicing the feature maps by channels and changing channels by 1 \(\times \) 1 convolution, the channels 69 in output means 68 landmarks and 1 mask denoting the visibility.

3.2 Network and Mixed Loss

The backbone network of our facial landmark detector is shown in Figs. 2 and 3. The network is based on VGG16 [33\(+\) CPM [7, 34], which uses first four convolutions of VGG16 to extract coarse feature maps, followed by three stages of CPM structure. The detailed architecture of conv in Fig. 2 is shown in Fig. 3. We use the convolutional pose machines (CPM) as the main architecture, CPMs combine and concatenate outputs of each stage in the network, in order to hold the geometric constraint and semantic information in feature maps. The ground truth is transformed into heatmap style via taking Gaussian blur on the landmark point. After down-sampling the image with landmark points to the same size and channels with the output of CPM, the error between the predicted and ground truth values is back propagated in each stage of CPM since each stage is intermediately supervised by the L1/L2 loss function.

Fig. 3.
figure 3

Detailed kernel size and channels of convolution layers in Conv in Fig. 2, the first row is Conv(stage1), and the second row is the Conv(stage>1). The L in last layers denote the number off facial landmarks and mask.

Models based on heatmap regression have higher accuracy. Additionally, all heatmap regression methods are supervised by the L2 loss, which makes it difficult to improve the form of loss function. However, it is practical to optimise the loss function in coordinates regression, because errors between points are direct. So we try to refine a detector, intending to help the model in learning better parameters by combining coordinate and heatmap regression.

In order to get refined landmark detection in a cascaded model, the multi-stage CPM network is learned in a L2 heatmap regression style. To calculate the loss of the whole network, the multi-stage L2 heatmap loss function and the improved Wing loss function are combined.

$$\begin{aligned} {l_{mix}} = {\alpha _1}{l_{po{\mathop {\mathrm {int}}} }} + {\alpha _2}{l_{stage}}, \end{aligned}$$
(1)
$$\begin{aligned} {l_{stage}} = \sum \limits _{i = 1}^3 {{\beta _i}l_{stage}^{(i)}}. \end{aligned}$$
(2)

The form of mixed loss function is shown in (1) and (2), we can see that the network will be updated by point information and heatmap information. The ratio between these two losses are controlled by \(\alpha _1\) and \(\alpha _2\). In heatmap loss, the output of each stage can also contributes to total loss, \(\beta _1\), \(\beta _2\) and \(\beta _3\) are hyper-parameters.

As aforementioned, it is well-known that the proportion of difficult samples is relatively small in a training data set, causing data imbalance issue. Additionally, the simple samples usually dominate the network training. In this case, the widely used L2 loss is not necessarily the best loss function. The L2 loss function amplifies the effects of samples with large errors and neglects small errors. In contrast, the Wing loss function focuses on small and medium errors, but pay less attention to the samples with large errors. In order to design a new loss function that considers the samples with various errors, we formulate the function in Eq. (3) [8]:

$$\begin{aligned} {l_{po{\mathop {\mathrm {int}}} }} = wing\left\{ x \right\} = \left\{ {\begin{array}{*{20}{c}} {w\ln \left( {1 + \frac{x}{\varepsilon }} \right) \,\,\,\,\,\,\,\,if\,\left| x \right| < w}\\ {\left| x \right| - C\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,otherwise} \end{array}} \right. . \end{aligned}$$
(3)

3.3 Heatmap to Point Regression

It is easy to convert a heatmap to key point coordinates, by just finding the peak locations in the heatmap throughout the argmax function. However, the process is not trivial because the gradients cannot be back-propagated through argmax. To address this issue, this paper adopts soft-argmax, which can guarantee the differentiation in the training process while searching for the maximum value. We represent the argmax function as a parsed form to explain the expectation of the idea. The expectation on the idea of representing the argmax function as a parsed form.

Assuming that one channel of heatmap can be represented as I(xy), which has the size of \(W \times H \times C\), where W and H are the width and height of heatmap, and C denotes channels. The maximum point can be calculated by [26]:

$$\begin{aligned} softargmax(I) = \left( {\sum \limits _{i,j} {{W_x}\left( {i,j} \right) I\left( {i,j} \right) ,\sum \limits _{i,j} {{W_y}\left( {i,j} \right) I\left( {i,j} \right) } } } \right) , \end{aligned}$$
(4)
$$\begin{aligned} {W_x}\left( {i,j} \right) = \frac{i}{W}, \end{aligned}$$
(5)
$$\begin{aligned} {W_y}\left( {i,j} \right) = \frac{j}{H}. \end{aligned}$$
(6)

In fact, considering that in our model, each heatmap has the order of \({10^{ - 5}}\), to avoid truncation errors and insufficient precision, we use the adapted Algorithm 2.

figure b

4 Experimental Results

4.1 Datasets

In this paper, we conduct experiments on two datasets: the 300-W [29] and AFLW facial landmark datasets [15].

300-W is an open facial landmark dataset, which is composed by LFPW [1], AFW [40], HELEN [18], XM2VTS [24] and IBUG [20] datasets. The whole 300-W dataset contains 3148 training images and 687 test images. Each image in 300-W is labelled with 68 facial landmark (Fig. 4).

AFLW is another classic datasets in face alignment. AFLW consists of more than 25000 images with 21 landmarks. In our experiments, we follow AFLW-Full protocol [15], which contains 24386 images in total, 20000 images for training and others for testing. The images are annotated with 19 landmarks since the landmarks of two ears are ignored in this protocol.

Fig. 4.
figure 4

Partial visualization of the results of our model on 300-W.

Table 2. Results on 300-W and AFLW datasets. For 300-W, we use inter-pupil distance to compute NME. For AFLW, we use the face size for NME.

4.2 Experimental Settings

We conduct all the experiments on an Intel E5-2650 v4 CPU with two Tesla v100 GPUs. The proposed method was implemented with Pytorch 1.1 [27, 28] and Python 3.7. All the input images are resized to \(256\times 256\times 3\) and the output is N*2 landmark coordinates. The type of heatmap is Gaussian. Our models is updated by Stochastic Gradient Descent (SGD), with the momentum of 0.9 and weight decay of 0.0005. For the 300-W dataset, the learning rate is 0.00005, while for ALFW we set the learning rate to 0.00001. From epoch 30 to 40, the learning rate will decay by a factor of 0.2. After 40 epochs, the learning rate will decay by a factor of 0.1. We train the model for more than 60 epochs. The batch size is set to 64. In the mixed loss function, we try to combine different coefficients with grid search. We get the best result when setting \(\alpha _1 = 0.7\) and \(\alpha _1 = 0.3\). Meanwhile \(\beta _i\) are set as \( \left\{ {0.5,0.5,1} \right\} \). In the training step, it cost above half a day on 300-W without PDB while about one day with offline PDB. When applying online PDB, it costs 8 days on the same CPU and GPU. For the AFLW dataset we don’t do online PDB due to the time limitation.

4.3 Results

We use the backbone network with L2 point regression as the baseline method. Then we try to observe the effects of different methods on 300-W, we use NME as the evaluation metric, which is defined as:

$$\begin{aligned} NME = \frac{1}{N}\sum \limits _{k = 1}^N {\frac{{{{\left\| {{x_k} - {y_k}} \right\| }_2}}}{d}}. \end{aligned}$$
(7)

where x denotes the ground truth landmarks for a given face, y denotes the corresponding prediction and d can be computed as the face size, using the inter-ocular distance or the pupil distance.

4.3.1 Results on 300-W

We apply different innovations to our experiments on 300-W. The performance of different state-of-the-art methods as well as the proposed method in terms of NME are reported in Table 2. We can see that, in spite of the accuracy loss of point regression our method achieves competitive result. The test batch size is 16 and the proposed method achieves 85 FPS on a Tesla v100 GPU.

4.4 Results on AFLW

As shown in Table 2, we conduct similar experiments on the AFLW dataset. The speed of the proposed method can also achieve more than 80 FPS under the same environment. We also summarise the important parameters, e.g. model size, FLOP and so on. The model size is similar to the model used for 300-W except for the last output layer. The number of parameters is 15.94 M, model size is 127 MB, and the GFLOPs is 2.57 billions.

5 Conclusion

In this paper, we presented a robust facial landmark detector that combines coordinate and heatmap information, thus improving the performance of a trained CNN network in terms of accuracy. Besides, we used the soft-argmax instead of argmax as well as online PDB for training data augmentation. The main purpose of the proposed method is to mitigate the dataset imbalance problem. In addition, we designed a mixed loss function consisting of more information for network training. The experiments obtained on 300-W and AFLW demonstrate the effectiveness of the proposed method compared with the state-of-the-art approaches.