Keywords

1 Overview

Facial landmark localization, also known as facial landmark detection, is to locate a set of predefined facial fiducial points on facial images. It has been successfully applied to many face related applications. For example, facial landmarks are usually employed to implement face alignment for the recognition tasks including face identity recognition, facial expression recognition, facial attribute recognition, etc. The landmark-based alignment is crucial for the high recognition performance. Besides, facial landmarks are also taken as features for face manipulation such as face aging, face swapping, face cartoonization and face attribute editing. Furthermore, the methods of facial landmark localization can also be applied on other fields such as pose estimation [10, 23]. Recent years, with the popularity of the internet and smart phones, more and more face-related applications are implemented on the mobile device, thus the lightweight models are required to enable efficient system deployment. However, the prior competitions of facial landmark localization (i.e., 300-W [1, 24], Menpo [2, 34], 300-VW [3, 26] and JD-landmark [4, 20]) focus only on accuracy, without consideration on the efficiency. To push the frontier of the lightweight facial landmark localization algorithm, we host the 2nd 106-point lightweight Facial Landmark Localization Challenge (FLLCFootnote 1) in conjunction with the \(25^\mathrm{th}\) International Conference on Pattern Recognition (ICPR2020). We extend the JD-landmark dataset in the \(1^\mathrm{st}\) challenge with thousands of in-the-wild facial images. The upgraded JD-landmark-v2 dataset contains more than 24,000 images. Figure 2 shows some examples in this dataset. The challenge has attracted much attention from both academia and industrial practitioners. We will introduce the details of the challenge along with the methods of the winner teams in this paper (Fig. 1).

Fig. 1.
figure 1

Example images of the grand challenge dataset.

2 Related Work

In order to provide a fair comparison between the different methods of automatic facial landmark localization, the Intelligent Behaviour Understanding Group (I\(\cdot \)BUGFootnote 2) from Imperial College London held a series of competitions, including 2D/3D facial landmark localization in static imagery and 2D/3D facial landmark tracking in videos. The annotated data has been used by the academia and industrial community for training and testing facial landmark localization models. Before presenting FLLC, we outline the previous competitions along with the related datasets.

2.1 Competitions

300-W Challenge. The first Automatic Facial Landmark Detection in-the-Wild Challenge (300-W Challenge [1, 24]) is held in conjunction with ICCV 2013 in Sydney, Australia. It was the first event to benchmark the efforts in the facial landmark localization field. The competition provides 4,350 “in-the-wid” images with around 5,000 faces. All the faces are annotated using a 68-landmark frontal face mark-up scheme as Multi-PIE [5].

300-VW Challenge. In conjunction with ICCV 2015, Zafeiriou et al. held the 300 Videos in the Wild (300-VW [3, 26]) challenge. The purpose is to develop a comprehensive benchmark for evaluating in-the-wild facial landmark tracking algorithms. The competition collects a large number of long face videos recorded in the wild. Each video has a duration of about 1 min. (at 25–30 FPS). In total, the 300-VW benchmark consists of 114 videos and 218,595 frames. All frames have been annotated with regards to the same 68 points mark-up used in the 300-W competition.

Menpo Challenge. The 300-W and 300-VW challenges have two limitations: 1) lack of faces in extreme poses; 2) limited test images (around 600). To address these issues, the Menpo [2, 34] challenge is held in conjunction with CVPR 2017. It consists of 5,658 semi-frontal and 1,906 profile facial images in the training set, and 5,335 frontal and 1,946 profile facial images in the test set. Besides, the 68-point mark-up scheme is used for frontal faces while a 39 points mark-up scheme is adopted for profiles.

3D Menpo Challenge. The I\(\cdot \)BUG held the 3D Menpo Challenge [6, 33] in conjunction with ICCV 2017 to develop a comprehensive benchmark for evaluating 3D facial landmark localization algorithms in the wild in arbitrary poses. They fitted all the 2D faces provided by the 300-W and Menpo challenges with the state-of-the-art 3D facial morphable models. They also provided 3D facial landmarks for all the videos of 300-VW competition.

106-Point Facial Landmark Localization Challenge. As mentioned above, many efforts have been made for the 68-point facial landmark localization. However, the 68-point landmarks are incompetent to depict the detailed structure of facial components. For example, the lower boundary of eyebrows and the wing of nose are out of the definition in 68-point landmarks, while they are important in some cases such as face parsing [21]. To overcome this problem, a challenging dataset (named as JD-landmark) is constructed and employed for the competition [4, 20] of 106-point facial landmark localization in conjunction with ICME 2019.

2.2 Datasets

Large amount of annotated data are important for training the high performance landmark localization model, especially for the deep learning based methods. We summary the commonly used 2D facial landmark datasets in static images as follows.

LFPW. The Labeled Face Parts in the Wild (LFPW [7, 11]) dataset consists of 1,432 face images downloaded from the internet using simple text queries on sites such as google.com, flickr.com, and yahoo.com. Each image was labeled by three MTurk workers with 29 fiducial points.

HELEN. The HELEN [8, 19] dataset collected images from the Flickr. It contains 2,330 images of high resolution. Each image is annotated with 194 points. It is also extended to a face parsing benchmark [27].

AFW. The Annotated Face in-the-Wild (AFW [35]) dataset is also built using Flickr images. It includes 205 images with 473 labeled faces. For each image, six landmarks along with the pose angles and a rectangular bounding box are provided.

AFLW. The Annotated Facial Landmarks in the Wild (AFLW [9]) dataset provides a large-scale collection of images gathered from Flickr. It consists of 25,993 faces in 21,997 real-world images, each of them is annotated with up to 21 landmarks.

300-W. The training images of the 300-W dataset [1] consists of the LPFW, AFW, Helen and XM2VTS datasets. Each image is re-annotated using the 68-point markup as the landmark configuration of MultiPIE. Besides, a new dataset (IBUG), which includes 135 images with large variations in expression, illumination conditions and pose, are released as part of 300W dataset. The test set consists of 300 images captured indoors and 300 images captured outdoors.

Menpo. The training set of Menpo dataset [2] consists of 5,658 semi-frontal and 1,906 profile facial images. The test set contains 5,335 frontal and 1,946 profile facial images. The frontal/semi-frontal images employ the same landmark configuration of 300W with 68 points, while the profile facial images are annotated with a 39 profile landmark scheme. All the images are taken from LFW and FDDB datasets.

WFLW. The Wider Facial Landmarks in-the-wild (WFLW [32]) contains 10,000 faces (7,500 for training and 2,500 for testing) with 98 fully manual annotated landmarks. Apart from the landmark annotation, this dataset provides several attribute annotations, i.e., occlusion, pose, make-up, illumination, blur and expression for comprehensive analysis of existing algorithms.

JD-landmark. The JD-landmark dataset [4] is an incremental dataset based on 300W, composed of LFPW, AFW, Helen and IBUG, and re-annotated with the 106-point mark-up. The dataset contains 11,393 face images for training. Besides, 2,000/2,000 facial images are collected from the open-source face dataset Megaface as validation/test set. The JD-landmark covers a large variation of pose, illumination and expression.

Fig. 2.
figure 2

Example images of the 2D facial landmark localization datasets. (a) LFPW; (b) HELEN; (c) AFW; (d) AFLW; (e) 300-W; (f) Menpo; (g) WFLW; (h) JD-landmark.

3 Introduction of Competition

3.1 Datasets

Compared with the first version of JD-landmark dataset, we expand it by about 9,000 in-the-wild facial images, which are collected from the Megaface dataset [15]. Each sample is annotated with 106-point landmarks. Expect for the facial images in extreme poses and expressions, many low quality (low resolution) images are added to increase the difficulty of the competition. In total, the second version, i.e. JD-landmark-v2 dataset consists of 20,386 images for training, 2,000 images for validation and 2,000 images for testing. Each image is provided with the 106-point landmarks along with the referenced bounding box.

3.2 Evaluation Criteria

The submissions are ranked according to the Area-Under-the-Curve (AUC) from the Cumulative Errors Distribution (CED) curves. Furthermore, the statistics from the CED curves such as the failure rate and average Normalized Mean Error (NME) are also taken into account. The CED curve reflects the proportion of the test images with regard to the NME less than a threshold \(\alpha \). The AUC is the area under the CED curve calculated up to the threshold, then divided by the threshold \(\alpha \). In this competition, we set the value of \(\alpha \) to 0.08. Similarly, we regard each image with a NME larger than \(\alpha \) as a failure case. NME is computed as:

$$\begin{aligned} NME = \frac{1}{N} \sum _{k=1}^N \frac{\Vert y_k-\hat{y}_k\Vert _2}{d} \end{aligned}$$
(1)

where k refers to the index of landmarks. y and \(\hat{y}\) denotes the ground truth and the prediction of landmarks for a given facial image, respectively. In order to alleviate the bias in profile faces caused by the small interocular distance, we employ the square-root of the ground truth bounding box as the normalization factor d, computed as \(d = \sqrt{w_{bbox} \times h_{bbox}}\). Here \(w_{bbox}\) and \(h_{bbox}\) are the width and height of the enclosing rectangle of the ground truth landmarks, respectively. If no face is detected, the NME will be set to infinite.

3.3 Detailed Requirements

The upper bound of computational complexity is set to 1G Flops, and the upper bound of model size is set to 20 MB. For the training/validation/testing images, we provide the bounding boxes obtained by an off-the-shelf face detector. Nevertheless, the participants are allowed to employ their own face detector. Except for the face detectors, any external datasets and models are not allowed. Any test augmentation or multi-model ensemble strategy is not allowed, either.

The \(2^\mathrm{nd}\) 106-point Lightweight Facial Landmark Localization (FLLC) grand challenge began by July 13, 2020. During the validation phase (from July 27 to October 08), the participants were allowed to evaluate their models on the validation set, and the leaderboard on the validation set was updated every day with respect to the submissions. The test images were released on October 09. To prevent cheating on the test set, each team was given an 24-h window to submit their predicted test results (Fig. 3).

Fig. 3.
figure 3

The CED curve of the top three teams.

Table 1. Leaderboard of the \(2^\mathrm{nd}\) 106-point Lightweight Facial Landmark Localization Competition. The top three teams are ranked according to the AUC of the CED curve.

4 Summary of Participants

The competition has attracted much attention from both academia and industry. There are more than 70 teams registered in the competition. 15 teams participate in the validation phase, and 9 of them enter in the final test phase.

As shown in Table 1, the champion of the competition is Xu et al. from SogouAI. Huang et al. from OPPO Research Institute achieve the second place. The third place goes to Tong et al. from the Vision Intelligence Center of Meituan. Next, we will give a brief algorithm introduction of the top three winners.

Xu et al. employ the advanced HRNet [14] for facial landmark prediction, which is able to maintain high-resolution representations through the whole process and connect the multi-resolution sub-networks in parallel. In order to reduce computational complexity, the inverted residuals [25] are adopted. The expansion ratio is set to 1 in all inverted residuals. To achieve good results, they increase the number of channels in the branch of HRNet blocks, and utilize group convolution [16] in the last few layers. The settings of network structure is given in Table 2. During the training phase, they apply some forms of data augmentation, including randomly rotating and randomly cropping. Specially, they employ the PDB strategy [12] against pose variations which duplicates large samples many times. Finally, they won the first place with the AUC of 80.52%, NME of 1.58% and Failure rate of 0.05%. The model size is about 12M and the FLOPS is 887.073M.

Table 2. The network structure settings of Xu et al.

Huang et al. propose a multi-level supervision strategy to train the facial landmark localization models. They take ResNet-18 [13] as the backbone and reduce the channel size of the last two residual blocks from 256/512 to 192/256 due to the limits of computational complexity. Instance Normalization [31] is adopted instead of Batch Normalization, which further improves the details of individual differences without increasing computational overhead. As shown in Fig. 4, apart from the main branch, an additional branch from feature map of the \(3^\mathrm{rd}\) blocks is introduced for the contour landmarks prediction. Finally, mean aggregation is used for the final output. The AUC, NME and Failure rate are 79.92%, 1.63% and 0.05%, respectively. The model size is 17M while FLOPS is 995.922M. Huang et al. gain the second place in the competition.

Tong et al. take the improved HRNet [28] structure as the backbone, in which the bottleneck block [13] and group convolution [29] are used to replace the standard residual block in the original HRNet. In order to prevent the accuracy loss by the coordinates quantization, they use a mapping function named Dual Soft Argmax (DSA [18]) to map the heatmap response to final coordinates, which overcomes the problem of weight imbalance problem of Soft Argmax (SA [22]). The Normalized Mean Error (NME) loss [17] is taken as the training loss. Besides, inspired by [30], they propose a Similarity-FeatureMap knowledge distillation model. As Figure 5 shows, it guides the training of a student network by keeping the feature maps’ similarity of input pairs according to the teacher network. Specifically, similarity matrices are derived from the feature maps and a distillation loss is computed on the matrices produced by the student network and the teacher network. Finally, the submitted model achieves 78.88%, 1.71%, 0.05% of the AUC, NME, and Failure rate, respectively. The model size is about 16M and the FLOPS is 996.190M. Tong et al. won the third place.

Fig. 4.
figure 4

The network structure of Huang et al.

Fig. 5.
figure 5

Similarity-FeatureMap knowledge distillation guides the training of a student network such that input pairs that produce similar (dissimilar) feature maps in the pre-trained teacher network produce similar (dissimilar) feature maps in the student network. Given an input mini-batch of b images, we derive similarity matrices from the feature maps, and compute a distillation loss on the matrices produced by the student and the teacher.

5 Conclusion

In this paper, we first summarize the prior facial landmark localization challenges and the commonly used 2D facial landmark datasets in recent years. Then we introduce the detailed information of the \(2^\mathrm{nd}\) 106-point lightweight facial landmark localization grand challenge. We construct and release a new facial landmark dataset, named JD-landmark-v2. Compared with the previous challenges, our work pays attention on the lightweight facial landmark localization model, which is important for the efficient system deployment. Finally, there are more than 70 teams participate in the competition and 9 teams involve in the final evaluation. We introduce the methods together with the performance of top three teams in this paper. We hope this work could push the frontier of the lightweight facial landmark localization algorithm.