Keywords

1 Introduction

Face alignment means that the key feature points (such as the eyes, the tip of the nose, the corners of the mouth, the eyebrows, and so on) of the face are automatically located according to the input face image, as shown in the Fig. 1. It is widely used in many facial analysis tasks, such as face recognition [1], expression analysis [2], face tracking, face animation [3] and facial attributes analysis [4]. Face alignment still faces many challenges due to many factors, like different poses, expressions, lighting, and occlusion.

Fig. 1.
figure 1

The task of face alignment.

SDM (Supervised Descent Method) [5] is a very representative method based on cascaded regression model. It evolved from Newton Descent Method with high efficiency and precision. The SDM starts with the initial shape (the mean shape of the training samples) and iterates through a series of regressors to bring the initial shape closer to the true shape. One of the drawbacks of the cascaded regression model is that its final result is highly dependent on its initial shape. When in simple conditions, SDM can often achieve better results due to its superior performance, but when under extreme conditions, such as a large head pose, or a facial expression is too exaggerated, the initial is too different from the true shape, at this time, it is very difficult to make the initial shape approximate to the true shape by several iterations.

After 2013, the methods based on deep learning began to be widely used in face alignment and achieved remarkable results. These methods generally do not rely on initialization. But under the premise of achieving high precision, they often show a complex structure and a time-consuming process.

To solve the above problems, we propose a coarse-to-fine SDM (CFSDM) method. We first use a simple CNN (Convolutional Neural Network) network (we added the channel-wise attention mechanism to the CNN network, so that the position of the landmarks will be predicted more accurately with a relatively simple structure) to predict the approximate location of the facial landmarks, and the resulting coordinates of the landmarks are assigned to SDM as its average face coordinate (i.e., the initial shape at the time of testing), thus the initialization of SDM will be optimized. This method provides a good initialization for SDM and is not easy to fall into local optimum, thus improves the results.

2 Related Work

2.1 Face Alignment

Traditional Methods

Traditional face alignment methods include ASM [6], AAM [7], CLM [8], etc. ASM is an active shape model, it model the facial landmarks labeled in the training set, and then search for the best matching points on the test set and locate the facial landmarks. AAM is an active appearance model. AAM is based on ASM, it further statistical model the texture, and merge the two statistical models of shape and texture into appearance models. CLM inherits the advantages of ASM and AAM respectively, it gets a balance between the efficiency of ASM and the accuracy of AAM, and models the patch of the local texture around the facial landmarks on the basis of ASM rather than the global texture method of AAM.

Recently, the cascade shape regression model has made a major breakthrough in face alignment task. These methods use a regression model to learn the mapping function directly from the appearance of the face to the shape of the face (or the parameters of the face shape model), then establish a correspondence from the appearance to the shape. Such methods do not require complex face shape and appearance, and achieve good results in controllable scenes (human faces collected under laboratory conditions) and non-controllable scenes (network face images, etc.). In 2010, Dollar proposed CPR (Cascaded Pose Regression) [9], CPR gradually refines a specified initial prediction value through a series of regressors. ESR [10] adopted boosted regression and two-level cascade regression. CFSS [11] began with a coarse search over a shape space that contains diverse shapes, and employs the coarse solution to constrain subsequent finer search of shapes. LBF (Regressing Local-Binary Features) [12] learned sparse binary features in local regions based on random forest regression model. SDM [5] used supervised gradient descent method to solve non-Linear least squares problem, and learns a series of regressors for locating facial landmarks. In general, the cascade shape regression methods are very sensitive to the starting point of the regression process.

CNN Based Methods

Deep learning has been widely used in the field of computer vision [24, 25], face alignment methods based on deep learning have also achieved remarkable results. Sun et al. [13] used CNN for face alignment for the first time, they proposed a three-layer network structure, and each layer contains multiple independent CNN models, which are responsible for predicting some or all key points. Wu et al. [14] found that the cnn network is hierarchical, and the deeper network extracted features can reflect the position of the facial landmarks more accurately, so they proposed TCNN (Twaned Convolutional Neural Networks). MTCNN [15] used three CNN cascades to simultaneously perform face detection and face alignment. SAN [16] predicted facial landmarks by simultaneously inputting the original face image and the style aggregate image, and solves the problem caused by the change of the image’s style. LAB [17] used a network to extract boundary of the face and fuses the information of boundary into face alignment. TCDCN [18] used multi-task learning to optimize facial landmark locating with a set of related tasks. In summary, in order to achieve accuracy, the deep learning methods generally use a cascaded depth model to gradually improve the estimation, which leads to more complicated calculations.

We propose a coarse-to-fine SDM (SDM) method, which takes advantage of the fact that the deep learning methods do not depend on initialization, reduces the complexity of the cnn structure, optimizes the initialization of SDM, and improves the final result.

2.2 Attention Mechanism

In recent years, the study of deep learning has become more and more extensive, and many breakthroughs have been made in various fields. Neural networks based on the attention mechanism have become a hot topic in recent neural network research. Mnih et al. [19] used the attention mechanism on the RNN model for image classification. Bahdanau et al. [20] first proposed the application of the attention mechanism to the NLP field. In order to make the i-th channel only related to the position of the i-th landmark, We use a channel-wise attention module which proposed in [21] in our method.

3 Method

During the test phase, SDM starts with the initial shape (the average shape of the training samples) and iterates with a series of regressors to gradually approximate the initial shape to the true shape. It is easy to find that its final result is very dependent on the initial shape. When the facial expression or the head pose changes too much, since the variation between the initial shape and the real shape is too large, generally SDM cannot obtain good performance. So we propose a coarse-to-fine SDM (CFSDM) method. We utilize a simple CNN network (the channel wise attention mechanism is introduced here) to predict the approximate position of the facial landmarks in advance, and then the obtained coordinates are given to SDM as its initial shape coordinates, which optimize the initialization of SDM, and then the initial shape will be closer to the real shape through the regressors that be learned. The overall process of the method is shown in Fig. 2.

Fig. 2.
figure 2

The overall process of our CFSDM. The coarse shape is estimated by CNN network, and it is the initial shape of SDM during test phase.

3.1 Coarse Localization Based on CNN

Architecture of CNN Network

Figure 3 shows the detailed structure of CNN network. We take the first 13 layers of VGG-16 [22] as our backbone. VGG is simple and practical, and it performs very well in both image classification and object detection. The size of convolutional filters’ receptive field is \( 3 \times 3 \), which is the smallest size to capture the left/right, up/down and center. The stride of the convolution is fixed to 1 pixel. The padding of the convolution layer is used to maintain the spatial resolution of the image after convolution. The second, the fourth, the seventh and the tenth convolutional layer are followed with a max-pooling layer respectively. Max-pooling is performed over a \( 2 \times 2 \) pixel window, with stride 2.

Fig. 3.
figure 3

The structure of CNN network.

It is well known that the bottleneck of VGG-16 is that there are too many parameters in the fully connected layer, and the speed is very slow, so we give up the fully connected layer of VGG-16. We adopt a structure with fewer parameters and higher efficiency, that is, a deconvolution layer. We revert the obtained CNN feature maps to the input size through the deconvolution layer. Each channel of the final output is equivalent to a probability map that predicts where a landmark is most likely to exist. For example, to predict the location of 68 landmarks for a face image in Helen dataset, the final output of our network contains also 68 channels. In order to make the i-th channel only related to the position of the i-th landmark, we have added the channel-wise attention module. We will introduce the specific structure of channel-wise attention module in Sect. 3.1.

Our CNN network is trained with an end-to-end manner. We take the mean square error as the loss function,

$$ {\mathbf{L}}_{{{\mathbf{MSE}}}} = \frac{{\sum\nolimits_{i = 1}^{n} {\left( {\left( {x_{i} - x_{i}^{{\prime }} } \right)^{2} + \left( {y_{i} - y_{i}^{{\prime }} } \right)^{2} } \right)} }}{n} $$
(1)

where \( x_{i} \) and \( y_{i} \) are the ground truth coordinate of the i-th landmark, \( x_{i}^{{\prime }} \) and \( y_{i}^{{\prime }} \) are the predicted coordinate of the i-th landmark.

Channel-Wise Attention

The different channels of the feature maps are essentially features extracted by different filters, and the different filters may extract different emphases. It is possible that one filter extracts more features of the tip of the nose, and another filter extracts more features of the eye. So it is required a weight, that is, channel-wise weight. To a certain extent, channel-wise attention can be seen as semantic attention which focuses on different objects in the image. Figure 4 shows the structure of channel-wise attention module.

Fig. 4.
figure 4

The structure of channel-wise attention module.

At first, the feature maps \( {\mathbf{V}} \) which extracted from our convolutional filters are reshaped to get the channel-wise vectors,

$$ {\mathbf{U}} = \left[ {{\mathbf{u}}_{1} ,{\mathbf{u}}_{2} , \cdots ,{\mathbf{u}}_{c} } \right] $$
(2)

where \( {\mathbf{u}}_{i} \in {\mathbb{R}}^{{{\text{W}} \times {\text{H}}}} \) is the i-th channel of feature maps and \( c \) is the number of the CNN feature channels. Then the average pooling is performed on each channel to obtain the vector

$$ {\mathbf{V}} = \left[ {{\mathbf{v}}_{1} ,{\mathbf{v}}_{2} , \cdots ,{\mathbf{v}}_{C} } \right] $$
(3)

which characterizes the information of different channels. And the final weight calculation is expressed as:

$$ {\text{b}} = \tanh \left( {{\mathbf{W}}_{c} \, \otimes \,{\mathbf{v}} + b_{c} } \right) $$
(4)
$$ \upbeta = {\text{softmax}}\left( {{\mathbf{W}}_{i} {\text{b}} + {\text{b}}_{i} } \right) $$
(5)
$$ {\mathbf{V}}^{{\prime }} = {\mathbf{V}}\, \odot \,\upbeta $$
(6)

where \( {\mathbf{W}}_{c} \), \( {\mathbf{W}}_{i} \) are transformation matrices we prepared to learn, and \( \otimes \) represents the outer product of vectors, \( b_{c} \), \( {\text{b}}_{i} \) are bias terms and \( \odot \) donate that each channel of \( {\mathbf{V}} \) is multiplied by \( \upbeta \).

3.2 Fine Localization Based on SDM

SDM is a commonly used method to solve complex Non-linear Least Squares (NLS) problems. SDM gets a initial shape after training, and it uses the initial shape as the staring point of the test, see Fig. 5. The initial shape of SDM is obtained by averaging the coordinates of all the training samples. But in our CFSDM, the initial shape is estimated by the CNN network. Such initialization is more beneficial to the results.

Fig. 5.
figure 5

The initial shape of SDM (the above is the training phase, and the following is the test phase).

Let us review the derivation of SDM. Given an image \( {\mathbf{d}} \in {\mathbb{R}}^{m \times 1} \) of m pixels, \( {\mathbf{d}}\left( {\mathbf{x}} \right) \in {\mathbb{R}}^{p \times 1} \) represents p landmarks in the image. \( {\mathbf{h}} \) is a nonlinear feature extraction function, such as \( {\mathbf{h}}\left( {{\mathbf{d}}\left( {\mathbf{x}} \right)} \right) \in {\mathbb{R}}^{128p \times 1} \) can denote the SIFT features extracted from p landmarks. Given the initial shape \( x_{0} \), the goal is to make \( x_{0} \) more and more close to the correct shape \( x_{*} \) of the face by regression. In order to achieve this goal, it is required to find the \( \Delta x \) that minimizes the following function

$$ f\left( {x_{0} + \Delta x} \right) = \left\| {{\mathbf{h}}\left( {{\mathbf{d}}\left( {x_{0} + \Delta x} \right)} \right) - \varPhi_{*} } \right\|_{2}^{2} $$
(7)

where \( \varPhi_{*} = {\mathbf{h}}\left( {{\mathbf{d}}\left( {x_{\varvec{*}} } \right)} \right) \) represents the SIFT values in the manually labeled landmarks of the face. Of course, the above is the target of testing. We only have the initial \( x_{0} \) in the prediction, and we do not know \( \Delta x \) and \( \varPhi_{*} \).

In training phase, the \( \Delta x \) and \( \varPhi_{*} \) are known, and the good regressors need to be trained from the training samples, so that it can return the initial \( x_{0} \) step by step to the correct unknown shape. In general, the initial \( x_{0} \) is the mean shape of the true shape of all known samples.

For regressing the same initial shape of each image to the true shape of there face, we extracted different SIFT features from different images. Although the initial shape is same, the SIFT features extracted from different images are completely different, that is \( \varPhi_{0} \) is different. This allows the initial shape to be regressed to the true shape through regressors. This can also be seen in Eq. 7.

The next thing to do is to get a series of regressors that can be used to regress an initial shape to a real shape, that is to learn the correct regressors to get the best \( \Delta x \). Of course, if you want to regress the initial shape to the real shape step by step, it is generally impossible to do it by only one \( \Delta x \), because it is very difficult to achieve the final goal in just one step. So we have to learn to get a series of different regressors, they will regress in turn, and a series of \( \Delta x \) will be learned, so \( x_{0} \) will converge to \( x_{*} \) in the training data by \( x_{k + 1} = x_{k} + \Delta x_{k} \) (k is the number of regressors or iterations).

The Taylor expansion of the objective function Eq. 7 is described as:

$$ f\left( {x_{0} + \Delta x} \right) \approx f\left( {x_{0} } \right) + {\mathbf{J}}_{f} \left( {x_{0} } \right)^{\text{T}} \Delta x + \frac{1}{2}\Delta x^{\text{T}} {\mathbf{H}}\left( {x_{0} } \right)\Delta x $$
(8)

where \( {\mathbf{J}}_{f} \left( {x_{0} } \right) \) and \( {\mathbf{H}}\left( {x_{0} } \right) \) are the Jacobian and Hessian matrices of f evaluated at \( x_{0} \). Differentiating Eq. 8 with respect to \( \Delta x \) and setting it to 0 gives us the first update for x,

$$ \Delta x_{1} = - {\mathbf{H}}^{ - 1} {\mathbf{J}}_{f} = - 2{\mathbf{H}}^{ - 1} {\mathbf{J}}_{h}^{\text{T}} \left( {\varPhi_{0} - \varPhi_{*} } \right) $$
(9)

where \( \varPhi_{0} = {\mathbf{h}}\left( {{\mathbf{d}}\left( {x_{0} } \right)} \right) \) denotes the SIFT features extracted from the initial shape \( x_{0} \). Let \( {\text{R}}_{0} = - 2{\mathbf{H}}^{ - 1} {\mathbf{J}}_{h}^{\text{T}} \), and \( \varPhi_{*} \) is unknown but fixed during the test stage, so Eq. 9 can be rewritten as follows:

$$ \begin{aligned} \Delta x_{1} & = - 2{\mathbf{H}}^{ - 1} {\mathbf{J}}_{h}^{\text{T}} \left( {\varPhi_{0} - \varPhi_{*} } \right) \\ & = - 2{\mathbf{H}}^{ - 1} {\mathbf{J}}_{h}^{\text{T}} \varPhi_{0} + \left( { - 2{\mathbf{H}}^{ - 1} {\mathbf{J}}_{h}^{\text{T}} } \right)\left( { - \varPhi_{*} } \right) \\ & = {\mathbf{R}}_{0} \varPhi_{0} + b_{0} \\ \end{aligned} $$
(10)

where \( {\mathbf{R}}_{0} \) is a descent direction, \( b_{0} \) is a bias term, and they can be learned from the training samples.

Usually the task cannot be completed in just one step and requires multiple steps, so SDM will learn a series of descent directions \( \left\{ {{\mathbf{R}}_{k} } \right\} \) and bias terms \( \left\{ {{\mathbf{b}}_{k} } \right\} \),

$$ x_{k} = x_{k - 1} + {\mathbf{R}}_{k - 1} \varPhi_{k - 1} + {\mathbf{b}}_{k - 1} $$
(11)

where \( x_{k} \) will converge to \( x_{*} \) step by the Eq. 11.

During training phase, \( \left\{ {{\mathbf{R}}_{k} } \right\} \) and \( \left\{ {{\mathbf{b}}_{k} } \right\} \) can be learned by minimizing this function:

$$ \mathop {\text{min}}\nolimits_{{{\mathbf{R}}_{k} ,{\mathbf{b}}_{k} }} \sum\nolimits_{{d^{i} }} {\sum\nolimits_{{x_{k}^{i} }} {\left\| {\Delta x_{*}^{ki} - {\mathbf{R}}_{k} \varPhi_{k}^{i} - b_{k} } \right\|^{2} } } $$
(12)

where \( \Delta x_{*}^{ki} = x_{*}^{i} - x_{k}^{i} \), and i indicates the number of landmarks.

During test stage, the initial shape \( x_{0} \) is determined by CNN network, and it will be more and more close to the true shape \( x_{k} \) through the learned \( \left\{ {{\mathbf{R}}_{k} } \right\} \) and \( \left\{ {{\mathbf{b}}_{k} } \right\} \).

4 Experiment

In this section, we evaluate the performance of the proposed method and compare it with several existing state-of-the-art methods on four datasets: LFPW, Helen, IBUG, and 300W.

  • LFPW (68 landmarks) contains 1432 face images, in which 1132 are for training and 300 are for testing. Since only image URLs are available and some links disappeared as time passed, we used a reduced version of 1035 images (each annotated with 68 landmarks), 811 of which are for training and the rest 224 for testing. The challenges are Large variations in illuminations, expressions, poses and occlusion.

  • Helen (68 landmarks) contains 2330 high resolution and accurately labeled face images, 2000 of which are for training and 330 are for testing. The challenges are large variations in expressions, poses and occlusion.

  • IBUG (68 landmarks) contains 135 accurately labeled face images. The challenges are extremely large variations in illuminations, expressions, poses and occlusion.

  • Multiple (68 landmarks) contains 150 people, and 10 facial expressions for each person. Each image is annotated with 68 landmarks. We select 225 images as training set, and 100 images as test set.

We use the Normalized Mean Error (NME) as a metric to measure the shape estimation error

$$ NME = \frac{100}{N}\sum\nolimits_{i = 1}^{N} {\left( {\frac{1}{{\left\| {w_{i}^{g} } \right\|_{1} }}\sum\nolimits_{l = 1}^{L} {\left( {\frac{{w_{i}^{g} \left( l \right) \cdot \left\| {x_{i} \left( l \right) - x_{i}^{g} \left( l \right)} \right\|}}{{d_{i} }}} \right)} } \right)} $$
(13)

It computes the euclidean distance between the ground-truth and estimated landmark positions normalized by \( d_{i} \). Table 1 shows the face alignment performance of our CFSDM and those of several excellent methods: LBF [12], CFAN [23], SDM [5] as well. From Table 1, it can be seen that our CFSDM outperforms the other methods, and it has a great improvement compared with SDM, especially in challenging datasets like IBUG and Multiple. Figure 6 plots the curves of percentage of images versus normalized error, it shows that our CFSDM performs the best on most of the test images. Some example results of SDM and our CFSDM are displayed in Fig. 7.

Table 1. Error of face alignment methods on four datasets.
Fig. 6.
figure 6

The curves of percentage of test images vs normalized error of various approaches on four datasets: LFPW, Helen, IBUG, and 300W.

Fig. 7.
figure 7

Some example results of SDM and our CFSDM. The first and third row of images are our experimental results, and the second and fourth row of images are experimental results of SDM.

In order to prove that the CNN network and the channel-wise attention module are both beneficial for the improvement of the final result, we performed an ablation experiment, see Table 2.

Table 2. Error of SDM, SDM+CNN (without the channel-wise attention module) and our CFSDM on four datasets.

5 Conclusion

In this paper, we propose a coarse-to-fine SDM (CFSDM) method. We utilize a CNN network with a channel-wise attention module to optimize the initialization of SDM, which reduces the distance between the initial shape and the real shape in the test phase of SDM. It solves the problem that SDM can’t achieve good results when the facial expression or direction changes greatly. The evaluation on four datasets shows that our CFSDM improves the accuracy of traditional SDM method and outperforms some other excellent methods.