Introduction

In recent yearsFootnote 1, intelligent surveillance systems have been widely studied [1, 2]. The combination of robotics and artificial intelligence arose outstanding developments in the fields of cognitive robotics and human-robot interaction [3]. Nowadays, several academic and industrial research groups are engaged in the design of intelligent robots able to act autonomously using deep learning-based algorithms [4, 5] for the analysis of data acquired from heterogeneous sensors, such as camera, 3D camera, stereo camera, microphone, and LIDAR [6], for ambient understanding (scenes, objects, people) and for dynamically adapting the interaction with humans and environment [7, 8]. Object detection or recognition is one of the most fundamental and challenging problem in computer vision [9, 10]. As a longstanding, challenging problem in object detection, facial landmarks detection (FLD) has been an active area of research for several decades [11]

Facial landmarks detection, also known as face alignment, is the process of locating a specified unique key-point such as the eyes corner, mouth, brows, and tip of the nose [12]. As it is used as a prerequisite for other computer vision applications, detection of these facial points must be robust and reliable. For example, the facial landmarks localization are required for many applications like head pose [13], face recognition [14,15,16], face emotion recognition [17, 18], gender recognition [19, 20], facial beautification [21], as well as facial expression recognition [22]. To ensure the success of these applications, extremely accurate and exceptional detection accuracy is a must. Due to the practical relevance of FLD, the efforts of both industry and academics have been attracted, which in recent years led to significant development. Although the findings have been accomplished, the exact location of facial points in uncontrolled settings remains an exceedingly difficult issue [23, 24]. Besides, a large number of the existing methods are designed based on capturing the local spatial relationship among sets of facial points ignoring that these spatial relationships are high order and global [25].

Cascaded regression is regarded as one of the potential state-of-the-art approaches for refining the prediction of the related predecessor, but the loss of information during the cascading stages makes it fall in complicated cases in the real world [26]. The cascaded deep convolutional neural nets are able to learn a large number of essential filters and combine them in a hierarchical manner to describe latent concepts for features discrimination efficiently, they can withstand high deformations in a human face and extreme pose changes. Considering these capabilities of the cascaded deep convolutional neural nets, they can successfully detect facial landmarks. On the other hand, the loss of spatial information due to resolution, as well as the difficulty of imposing a proper facial form on the collection of estimated landmarks, reduces its accuracy [27]. To solve this issue, we propose using heatmap coupling to prevent the loss of crucial feature information related to the input and transmit this feature to the cascaded layers, where it can be used as variable initialization for the cascaded CNN regressors.

Fig. 1
figure 1

The first column shows the face images from the datasets with different landmarks annotation. The second column is the output of the heatmap conversion module. The final estimate of the landmarks is shown in the third column

In this regard, the shape S can be progressively refined through estimating the incremental in the shape \(\Delta S\), which is needed to be learned within the stage-by-stage methodology [28]. By providing the facial image I and the initial face shape \(S^0\) or even the previous face shape \(S^{t-1}\), the regressor \(R^t\) can compute \(\Delta S^t\) using the image features at each t stage. The main aim of the cascaded regression is to produce the sequence of updates (\(\Delta S^0\), ..., \(\Delta S^{t-1}\)) starting from the initial shape \(S^0\) and converges to \(S^*\) (i.e., \(S^0\)+ \(\sum _{t=0}^{T-1} \Delta S^t \approx S^*\)). Based on that the new face shape \(S^t\) is updated in a cascade way using

$$\begin{aligned} S^t=S^{t-1}+R^t\left( I,S^{t-1}\right) \end{aligned}$$
(1)

where \(t=1,\dots ,T\) and \(R^t\) is a linear regressor that can be formulated by

$$\begin{aligned} R^t=\text{arg}\quad \text{min}_{R^t} \sum _{i=1}^{N} \Vert (S_i^*-S_i^{t-1})-R^t(\phi (I_{i},S_{i}^{t-1}))\Vert \end{aligned}$$
(2)

where t refers to the current iteration and \(R^t\) is employed to map the feature of the shape indexed \(\phi (I_i,S_{i}^{t-1})\) to the shape residual (\(S_i^*-S_i^{t-1}\)) and M is the samples number of the training images.

To overcome the drawbacks and limitations of the existing techniques, in this work, we present an accurate and efficient FLD detection method based on a two-stage coordinate regression that is coupled with a heatmap module. The proposed method is called coordinate regression with heatmap coupling (CR-HC). The regression model attempts to extract the shape of the facial landmark as a coarse-to-fine coordinating vector. The input to the first stage is regressed using simple CNN and generates a number of N landmarks. The generated landmarks are transformed using the heatmap module to a Gaussian heatmap with the same dimension as the input image. The second stage is employed to refine the first estimation, which regresses the combination of the input and heatmap images. Figure 1 shows the face images, outputs of the heatmap modules, and the final landmarks estimation with different annotation schemes.

In brief, the main contributions of the work can be summarized as follows.

  1. 1.

    Design a robust deep convolutional neural network model from scratch for facial landmarks detection.

  2. 2.

    Unlike conventional methods based on cascade coordinate regression, we propose a new stage coupling scheme based on a heatmap module to benefit from the input feature for the next stage, which reduces the network complexity.

  3. 3.

    The proposed network is adaptable to be applied on different resolutions images and can achieve comparative results with \(128\times 128\) resolution despite that it is hard to discriminate the key facial points in case of low-resolution images.

  4. 4.

    Experiments on three challenging benchmark datasets are conducted to evaluate the proposed method, which achieves top performance results in the three datasets compared to state-of-the-art methods.

  5. 5.

    The proposed method’s execution time is better for reliable applications than other FLD methods with low execution time but high normalized mean error.

  6. 6.

    Finally, we provide an open repository of the source code to the community for further research activities.

The rest of the paper is organized as follows. The “Related Works’’ section introduces a brief discussion about the FLD methods in the literature. The “The Proposed Method’’ section describes the proposed FLD method. The evaluated datasets and experiments are presented in the “Experiments and Results’’ section. The ablation study was conducted to evaluate the effectiveness of the proposed two-stage CR-HC in the “Ablation Study’’ section. Finally, the conclusions and future works are given in the “Conclusion’’ section.

Related Works

Facial landmarks detection has made large strides in the last two decades, thanks to technological advancements. It is important that the FLD be more resistant to the static and non-static face deformations caused by occlusion, facial expression, and head motions, notwithstanding the positive results gained [29]. The FLD is nevertheless affected by these conditions, making it unreliable in real-world settings. Generally, the traditional landmark detection methods can be broken down into template-based approaches and regression-based methods [30]. In recent years, deep learning models such as convolution neural networks achieve an enhancement in facial landmarks detection [31], and they can be categorized into coordinate regression and heatmap regression models. The following part presents a brief review of the state-of-the-art methods in the field of facial landmarks detection.

Conventional FLD Approaches

Template fitting models depend on generating a parametric shape from the training dataset and fitting the testing image to this shape during the testing phase. The most popular template-based method is the active shape model (ASM) [32], in which the face shape is represented by a linear combination of fundamental shapes that are learned so that it can use the principal component analysis (PCA). The output shape of the linear model of the shape description S of an object can be formulated as follows

$$\begin{aligned} {S = \bar{S}+\sum _{i=1}^{n}w_i\tilde{S}_i,} \end{aligned}$$
(3)

where \(\bar{S}\) is an average example of the object described, \(w_i\) is the weighting factor of the model, and \(\tilde{S}\) is the i-th object mode. The model is based on the pre-aligned point cloud \(\tilde{S}_{1...m}\) from the training set, each sample m from the training set represents a point cloud that describes the shape of an object. The average variable \(\bar{S}\) is the average point cloud in \(\tilde{S}_{1...m}\), and the model \(\tilde{S}_{1...n}\) is the result of the PCA. In addition, the PCA can describe the variation in the appearance of the face shape. The appearance of ASM is modeled by a variety of pre-trained template models, which are the active appearance model (AAM) [33] and PCA models. The appearance in a regular coordinate system eliminates shape alterations and the shape representation is identical to that of ASM and AAM. In [34], a matching approach was proposed for generating a collection of area template detectors using a combined shape and texture appearance model. Despite that these traditional approaches give good results in constrained condition, they are failed on the wild condition as these methods are sensitive to large head pose and occlusion problems. Moreover, both AAM and ASM can not handle the nonlinearity in faces with large head poses as these methods are considered linear in nature; in addition, the irregularity of face shape can lead to self occlusion.

Approaches based on regression immediately learn the mapping from the image to landmarks. It can be a direct regression that can predict the location of the landmark directly without any initialization or cascade regression, which locates the landmarks in a cascade manner depending on the initial shape estimation. The structure information and shape constraints can be learned during the prediction process. The loss function \(L_2\) is usually adopted to calculate the difference between the predicted (\(S_k\)) and ground truth (\(S_k^*\)) landmarks in a point-wise way as

$$\begin{aligned} L_{2} = \dfrac{1}{K}\sum _{k=1}^{K}\Vert S_{k}-S_{k}^{*}\Vert _{2} \end{aligned}$$
(4)

where K is the number of landmarks set.

For sequential faces, a discriminative response map fitting (DRMF) has been proposed using discriminative regression to estimate model parameters depending on the part-based model in [35]. In [36], the regression forest has been used to estimate the face shape depending on helper facial characteristics such as head pose, gender, etc. [37] proposed an ensemble of regression trees, in which a gradient boosting algorithm is employed to learn each regressor, and it is added to the trees in a cascade manner. Authors of [38] proposed a cascade regression method that utilizes the \(L_{2,1}\) normalization factor instead of the least-squares regressor, and multi-initialization is required to increase the regressor robustness for the poor initialization case. In the supervised descent method (SDM), a SIFT of features extracted around the present landmarks is employed to solve a sequence of linear least-squares problems iteratively [39]. A local binary feature is also used to learn a set of local binary features for a cascade regression, as local binary extraction and regressing features are relatively inexpensive computationally [40]. In fact, cascade regression can improve the final facial landmark locations, but it depends on the accuracy of the initial estimation [30]. However, these traditional approaches depend on the handcrafted feature extraction so that some important information in the image is lost and in turn leads to low efficiency in the detection.

Deep Learning-Based Approaches

This type of facial landmarks detection directly maps the face image into the landmark coordinates using deep learning models [41]. In the early work [42], a cascaded CNN is proposed in which the face image is divided into different parts and each part is processed individually using separate deep CNNs. Then, the outputs from each CNN are combined and entered into the final deep CNN to generate final facial coordinates. A Task-constrained deep convolutional network (TCDCN) is proposed for simultaneously optimizing facial landmarks detection with correlated auxiliary tasks such as head pose, gender, and expression [43]. Inspired by knowledge distillation, [25] suggested a loss function for training a lightweight model consisting of two networks, which are the backbone network to regress the coordinates of the facial landmarks and an auxiliary network to estimate the Euler angles of roll, pitch, and yaw. It is worth mentioning that the latter network is used only during the training phase to make the model more practical from the point of model size and processing time. One of the drawbacks of previous methods is that they need special annotated dataset with landmarks and other task annotation to train the model, which is not supported in most of FLD datasets. A recurrent neural network and deep neural network are utilized to estimate the facial coordinates in [44]. This model consists of two networks, a global network with long short-term memory to estimate the initial shape, while the other network utilizes a component-based search method to generate the final shape. In [45], a two-stage branched convolutional neural network (BCNN-JDR) combined with Jacobian deep regression was proposed. The initialization consists of a branched CNN to estimate the face parts individually and the refinement stage to refine the result in a cascade manner. The work of [46] pays more attention to the loss function that is used to train the facial landmark detection model by designing a new loss function named rectified wing loss (Rwing). The developed loss function can handle small-medium error in a good manner compared to the conventional loss function. Although the coordinate regression is simple and fast, but it is not accurate and needs to be handled in a cascade manner to give high accuracy and this sometimes leads to the loss of information during the cascading.

Heatmap regression is the process of finding the likelihood of specific key points residing in the ground truth heatmaps. This type of method usually uses the fully convolutional framework so that it can regress multiple heatmaps keeping the same size as the input image. To address the facial landmarks detection problem, [47] presented a multi-order multi-constraint deep network (MMDN) based on the consolidation of an implicit multi-order correlated geometry aware model and the explicit probability-based boundary-adaptive regression (EPBR) method. Moreover, authors in [48] proposed a style aggregated network (SAN) by generating a new styles training dataset with the help of generative adversarial module and then using the generated data with the original to train a heatmap regression network. In [49], a heatmap regression network is proposed based on the strong stacked hourglass network by stacking four of them and improving the stacked hourglass network with hierarchical, parallel, and multiscale residual blocks. Yin et al. [50] try to solve the problem of \({\mathbf{2D}}\) heatmap regression complexity by designing an attentive \({\mathbf{1D}}\) heatmap regression model through generating two groups of \({\mathbf{1D}}\) heatmaps to represent the marginal distributions of x and y coordinates. The real and fake localization are discriminated by using the geometric priors on the face landmarks based on the conditional generative adversarial network (CGAN). The CNN-based face localization is introduced using a coarse and robust heatmap estimation followed by a subsequent regression-based refinement [51]. In such method, there are two sub-networks, the first one tries to estimate the heatmap-based encodings of the location of the facial landmarks. The second sub-network receives the outputs of the heatmap estimation unit as inputs and refines them by applying the regression. Despite that the heatmap regression provides a good accuracy, it suffers from the complexity, high execution time, and sensitivity to outliers.

Fig. 2
figure 2

Structure of the proposed method based on two-stage of coordinate regression neural networks with a heatmap coupling module

The Proposed Method

In this work, we introduce a new facial landmark detection method called CR-HC based on a two-stage coordinate regression model with a heatmap coupling. The proposed method aims to predict N points represented by a shape vector S, where

$$\begin{aligned} S=[x_0,y_0,x_1,y_1,\dots ,x_n,y_n]=[P_0,P_1,\dots ,P_N] \end{aligned}$$
(5)

where \(P_n=(x_n,y_n)\) represents the \(n_{th}\) landmark in the face images \(I\in\) \({\mathbb R}\) \(^{h\times { w}\times { c}}\), where h and w are the height and width of the face image, respectively, while c denotes the color channels (e.g., for RGB image, \(c=3\)).

The CR-HC method consists of a base regression model and a heatmap coupling module. The regression model aims to extract the shape of the facial landmark as a coordinating vector in a coarse-to-fine manner by stacking the base model to refine the output results with the strong use of the heatmap coupling module. The overall architecture of the proposed model is shown in Fig. 2. A detailed description for each part of the proposed method is discussed in the following subsections.

The Based Model Structure

The backbone network in the proposed model is a custom-built convolution neural network. The design of CNN is intended to be simple and effective to provide flexibility when layered in a multi-stage architecture. It is better to mention here that the number of layers in the proposed two-stage CR-HC is determined by trying a lot of layer configuration and hyperparameters values and choosing those that have the best results. It is made up of stacked convolution blocks, each of them is built with a \(3\times 3\) convolution layer, followed by batch normalization, and activated with the Relu function. Each stage has seven convolutional blocks, a \(2\times 2\) pooling layer, and two fully connected (FC) layers.

For a 2D image, the convolution operation can be expressed as in (6) in which k(xy) is the function of each kernel.

$$\begin{aligned} (I\times k)(x,y)=\sum _{u,v} I(x,y)\times k(x-u,y-v) \end{aligned}$$
(6)

In the fully connected layer, the input and output images have the same size to reduce the matrix-vector multiplication, while the pooling layer is employed to acquire the invariance against image deformation. It divides the input image into \(b\times b\) blocks and chooses the maximum value of each block such that

$$\begin{aligned} \text {pool}_b(I_{h\times w\times c})=\max _{0\le x<b,0\le y<b} I_{(h\times b+x)\times (w\times b+y)\times c}. \end{aligned}$$
(7)

The size of the output feature map is defined according to the number of stride s and padding p of each layer as

$$\begin{aligned} {\begin{matrix} h_{l+1}=\frac{h_l-h_l\acute{}+p}{s}+1 \\ w_{l+1}=\frac{w_l-w_l\acute{}+p}{s}+1 \\ c_{l+1}=m_l \end{matrix}} \end{aligned}$$
(8)

where l is the number of layer, m denotes the number of kernel unit in a layer l, \(\acute{h}\) and \(\acute{w}\) are the height and width of the layer’s kernel, respectively.

As the proposed method uses the deep convolution model, training such a model can be difficult because they are sensitive to the initial random weights and learning algorithm configuration. This issue is solved by using the batch normalization, which standardizes the inputs to a layer for each mini-batch and reduces generalization error. Table 1 describes the CNN layers in detail.

Let I1 \(\in\) \({\mathbb R}^{\text {h}\times \text {w}\times \text {c}}\) be the input face to the first stage with 3 color channels (e.g., \(c=3\)), where \(h \times w\) equals 128\(\times\)128. The number of channels c in the first convolution block is 64, and the number of channels is doubled in each convolution block, but it is halved in the last two blocks in each stage, as shown in Table 1. The output shape vector of the first stage is S \(\in\) \({\mathbb R}^{2\times N}\), where N is the number of detected landmarks. The output S coordinates vector is converted into a \(\mathbf {3}D\) heatmap \(\mathbf {H}\) \(\in\) \({\mathbb R}^{\text {h}\times \text {w}}\) by using the heatmap coupling module. Then, the generated heatmap from the first stage is concatenated with the input face image to be the new features map I2 \(\in\) \({\mathbb R}^{\text {h}\times \text {w}\times 4}\) that will enter to the second stage. It is noteworthy that the two stages are identical in their structure, but they have different input and output characteristics. Moreover, the coupling point is not the last layer of each stage, but it is approximately located at half of each stage.

Table 1 Structure of each stage in the proposed method

The Heatmap Coupling Module

When cascading more levels and making the model deeper, the cascading deep convolution network has lately demonstrated remarkable results in FLD tasks. On the other hand, it suffers from several issues, such as When the processed images are obtained under unconstrained circumstances. Two variables reduce the accuracy of the cascaded model, first, the loss of spatial information reduces the resolution of feature maps in the concatenation of multiple convolutions and pooling layers. In addition, there is an initialization problem, in which the refining process depends on the starting face shape. By providing information to the cascaded stage, the heatmap coupling module is able to resolve the first issue and serve as an initialization layer for the second stage as well. The heatmap conversion module converts the initial detected \({\mathbf{1D}}\) vector shape to \({\mathbf{2D}}\) heatmap by applying a Gaussian kernel as,

$$\begin{aligned} H = \exp ^{{\Bigg (-\dfrac{(X-x_{p})^2+(Y-y_{p})^2}{(2\times \sigma ^2)}\Bigg )}} \end{aligned}$$
(9)

where \(x_p\) and \(y_p\) are the coordinates predicted landmark and represent the center of the blob, and \(\sigma\) is the spread of the blob.

The concatenation of the face image and the generated \(\mathbf{2D}\) heatmap from the first stage is used as the input to the next stage as in (10). These concatenated feature patches encode sufficient information about the local appearance around the current \(\mathbf{2D}\) landmarks and allow the second stage to fine-tune the detected landmarks. The conversion details are illustrated in Algorithm 1.

figure a
$$\begin{aligned} I_{s2} = I \oplus H. \end{aligned}$$
(10)

CR-HC Loss Function

To train the CR-HC model, we used the mean absolute error (MAE) loss function \(L_{MAE}\), which represents the sum of \(L_1\) loss functions between the predicted landmarks and the ground truth landmarks of the model stage. \(L_{MAE}\) can be defined as

$$\begin{aligned} L_{MAE} = \sum _{1}^{s}\dfrac{1}{K}\sum _{i=1}^{K}\sum _{j=1}^{N} \Vert P_{i,j}-G_{i,j}\Vert \end{aligned}$$
(11)

where s represents the stage number, K is the number of inputs, N is the number of landmarks, \(P_{i,j}\) and \(G_{i,j}\) are the detected and ground truth landmarks. Steps of the training process of the CR-HC model are provided in Algorithm 2.

figure b

Experiments and Results

To assess the proposed method, several experiments are carried out on a variety of hard benchmarks with varying annotation schema including the Annotated Facial Landmarks in the Wild (AFLW) dataset [52], the 300 Faces in the Wild (300W) dataset [53], and the Wider Facial Landmarks in the Wild (WFLW) dataset [54]. All experiments are implemented using the Keras library on two NVIDIA Tesla K80 GPUs. Also, the training images are cropped and resized to \(128\times 128\) according to the provided bounding boxes and represented using RGB values. All the training dataset images are normalized by subtracting the mean image from the training set and dividing by its standard deviation. For 300W dataset image rotation, flipping and pixel shifting is applied. For AFLW and WFLW, we have used the provided training images without any data augmentation. The CR-HC model is trained from scratch using Adaptive Moment Estimation (Adam) optimization algorithm with a fixed learning rate of 0.0001 and a batch size of 32 with \(L_1\) loss function. The number of epochs is 100, 150, and 120 for the dataset of 300W, WFLW, and AFLW, respectively. It is clear that the number of epochs is different as the challenge in each dataset is different.

Datasets

AFLW

It has a large collection of images gathered from flicker, where it contains 21,997 in wild images with 25,993 faces in total. The collected images have a wide range of variety in facial appearances like pose, expression, occlusion, illumination as well as general imaging and environmental conditions. The dataset is annotated with 21 landmark coordinates. We follow the same setting used in [55] by dropping the landmarks of the ears and using only 19 landmarks. The dataset is divided into two subsets: AFLW-Full with 20,000 faces for the training phase and AFLW-Frontal with 4386 for the testing phase using the same training samples, but using only 1165 frontal faces for testing.

300W

It is the most popular facial landmarks dataset, it contains five different datasets with 68 points annotation schema as LFPW, XM2VTS, AFW, IBUG, and HELEN. The same setting of [48] is applied in the current study, which is based on 3148 training images from LFPW, AFLW, and HELEN. The testing set contains all IBUg images and the test subset of HELEN and LFPW. The 135 images from IBUG are considered as the challenging test subset and 554 images from the HELEN and LFPW as the common test subset. The combination of challenge and common subsets is used as the full test set.

WFLW

It is a very challenging facial landmark dataset that is introduced by [54]. It has 10,000 faces in total, 7500 for training, and 2500 for testing annotated with 98 facial points. The testing set is divided into six subsets such as occlusion, illumination, make-up, pose, expression, and blur.

Evaluation Metrics

To evaluate the proposed method and conduct a fair comparison with the state-of-the-art methods, a standard normalized mean error (NME) is considered as an evaluation metric, where

$$\begin{aligned} NME = \dfrac{1}{M}\sum _{i=1}^{M}\dfrac{\dfrac{1}{N}\sum _{j=1}^{N}(P_{i,j}-G_{i,j})}{d_{i}} \end{aligned}$$
(12)

where M is the number of all tested images, and \(d_{i}\) is the normalization distance for 300W and WFLW. We have used an inter-ocular distance as the normalization factor, and the face size is used as the normalization factor for AFLW dataset. In addition, we used another evaluation metrics based on the failure rate at 0.1 threshold value and the area under the curve (AUC) as

$$\begin{aligned} { AUC = \int _{0}^{th} f(e)de} \end{aligned}$$
(13)

where e is the normalized error, f(e) denotes the cumulative error distribution function, and th denotes the upper limit of the integration for calculating the AUC.

Table 2 Normalized mean error (%) on the AFLW dataset for 19 facial landmarks

Results

To prove the robustness of the proposed method, we conduct experiments on the three datasets using different annotated schema. Each dataset has a different number of annotated landmarks, as 19 points for AFLW, 64 points for 300W, and 98 points for WFLW. We compared the proposed method on each dataset with SDM [39], CFSS [57], ERT [37], Wing [67], LAB [54], SAN [48], TCDCN [43], 3FabRec [65], ODN [68], RCN [70], RDR [71], RCN+ [72], SHN-GCN [62], HB+SRT [60], DCNN [73], and more. The proposed CR-HC model achieves competitive results compared to these methods on the three datasets as reported in the next sections.

Table 3 Performance of the proposed method compared to other methods on the 300W test subsets for 68 facial landmarks

Performance on the AFLW Dataset

Table 2 summarizes the normalized mean error compared to the state-of-the-art methods. It is clear that the proposed method achieves a NME of 1.56\(\%\) in the frontal subset, which represents about 3.70\(\%\) improvement from the best previous method in [69]. The cumulative error curve (CED) is drawn in Fig. 3 for the proposed method and other methods. The proposed method achieved the highest CED curve, which differs significantly from the previous methods. The experimental results on the AFLW datasets prove that the proposed method outperforms the state-of-the-art methods by a large margin.

Fig. 3
figure 3

Performance comparison of the cumulative error distribution curves on the AFLW dataset

Performance on the 300W Dataset

To thoroughly assess the robustness of the proposed method, we conducted other experiments on the 300W three subsets (Full, Common, and Challenge). The results reported in Table 3 describe the NME of the proposed method compared to the state of the arts on the three categories. The cumulative error curve is shown in Fig. 4. It is clear that the CR-CH method achieves competitive results on the three 300W categories.

Fig. 4
figure 4

Performance comparison of the cumulative error distribution curves on the 300W dataset

Fig. 5
figure 5

The validation loss with and without the coupling module versus the epoch number for the dataset of: a AFLW, b 300W, and c WFLW

Performance on the WFLW Dataset

The performance of the proposed method is also evaluated on the WFLW datasets of 98-point annotation schema. Normalized mean error, AUC at 0.1, and failure rate on the test set and six subsets are summarized in Table 4. Our approach achieves the best NME values in the test set and all subsets except the pose subset.

Table 4 Evaluation of the proposed method on the WFLW dataset compared to literature work
Fig. 6
figure 6

Sample results of the proposed (CR-HC) method for AFLW (19 points) dataset

Fig. 7
figure 7

Sample results of the proposed (CR-HC) model for 300W (68 points) dataset

Fig. 8
figure 8

Sample results of the proposed (CR-HC) method for WFLW (98 points) dataset

Fig. 9
figure 9

Failure sample results of the proposed (CR-HC) method

Ablation Study

The proposed method consists of two main parts, the backbone convolutional neural network and the heatmap coupling module. It does not follow the same strategy of the conventional cascade coordinate regression methods. In this section, we investigate the effectiveness of the heatmap coupling module by evaluating the dataset with and without the coupling module. Figure 5 shows that the validation loss for the three datasets is decreased in the case of using the coupling module. The AFLW validation loss is decreased by 9.10% due to the use of the coupling module as shown in Fig. 5a. In the same way, the validation loss decreased for the 300W and WFLW datasets by 13.20% and 9.50% as shown in Fig. 5b, c, respectively.

The evaluated datasets have faces in uncontrolled conditions and challenge images. Figures 6, 7, and 8 show the detection results of the proposed model on the dataset of AFLW, 300W, and WFLW, respectively. The displayed images have a wide range of factors influencing the efficiency of landmarks detection, such as occlusion, head pose, illumination, and expression. The results prove the success of the proposed model to detect facial landmarks in difficult cases.

For further illustration, Fig. 9 presents a landmarks detection in the three used datasets, where row1, row2, and row3 represent the detection result in AFLW, 300W, and WFLW datasets, respectively. The results illustrate why the proposed approach might lead to inaccurate estimates in some situations. Referring to the images with indices ranging from 1 to 21, beginning in the top-left corner and going line wise, it is noticeable that when there is more than one factor affecting the distortion in the image, such as occlusion and head position as in images 1, 8, 15, and 16. The detection efficiency is affected when the color is absent from the image, leading to the overlapping of facial details, as shown in images 11, 9, and 19. Furthermore, the results are significantly impacted because only eyes are visible in images 2 and 18.

To measure the feasibility and usability of the proposed CR-HC method, the execution time is calculated and compared to other FLD methods. The execution time is calculated by computing the average execution time of 1000 images. In addition, all the compared methods are available online data source. Table 5 shows that the execution time of the proposed method is better for reliable applications compared to other FLD methods that have low execution time but have high normalized mean error on the other side.

Table 5 Execution time of the proposed CR-HC method compared to some other FLD methods

Conclusion

In this paper, we have presented a deep learning-based method using cascaded regression for coarse-to-fine detection of facial landmarks. The method is composed of two-stage cascaded CNNs that are coupled with a heatmap module. The first stage regresses the coordinates of landmarks of an input face image, and then it is transferred to the heatmap coupling module to convert the estimated shape to a Gaussian heatmap. The second stage is used to refine the output by regressing the concatenation of face images and a heatmap of the estimated shape vector. The obtained results revealed that the proposed method achieved approximately 1.57% NME on the AFLW dataset, 4.30% on the 300W dataset, and 5.53% on the WFLW dataset. Thus, using the coupling heatmap module improves the detection performance distinctly. In future studies, it is possible to suggest two paths, which can increase the accuracy of FLD. First, a combination of coordinate regression as the first stage of the CR-HC model and heatmap regression network as the second stage can be done. Secondly, other large datasets can be used to train the model specifically on the WFLW dataset, which has a wide range of styling.