Keywords

1 Introduction

Face alignment aims to locate facial landmarks, such as mouth corners, nose tip, pupil and chin. It is a fundamental component in many applications (e.g., facial attribute inference [15], face recognition [13], face verification [17] and facial animation [18]). With the rapid growth in image data nowadays, a highly efficient and accurate face alignment method is in great demand. Though great success has been achieved in this field, accurate and robust alignment of facial landmarks is still a formidable challenge due to partial occlusion and large variations of head pose.

Active appearance model (AAM) [7] solves the task of face landmarks detection by reconstructing entire face using an appearance model and minimizing model parameter errors in training phase. The model of AAM may not work well on unseen faces as the limited expressive power to model texture space. It is also well known that AAM can not handle large variations of expression, illumination and initialization. To solve this problem, local feature based methods such as active shape model (ASM) [8] and constrained local model (CLM) [9] have been proposed, which only model the local appearance around the landmarks instead of the entire face. The results show better generalization ability and stability performance. However, the local features sampled from the current facial landmarks are still not robust enough to adapt to large deformation, pose variation and occlusions.

The vast majority of face alignment approaches proposed in recent years are on the basis of shape regression [6, 10, 11, 14, 21]. The advantages of these methods are reflected in the ability of adaptively enforcing shape constrains and the capability of effectively leveraging large bodies of training data. The shape regression algorithm is frequently used in a cascaded manner. Cascaded shape regression (CSR) is first put forward in [6]. Without using a fixed parametric shape model, the inherent shape constraint in [6] is encoded into a cascaded regression framework and implemented from coarse to fine during the test phase. Beginning with an initial shape calculated from the average facial landmarks of the training datasets, the face shape is optimized stage-by-stage by adding a shape increment. In each stage, features are extracted from the images and then used in a regression method to calculate the current location of the facial landmarks.

The selection of features is crucial to the results of regression, so a series of algorithms on it is gradually put forward. The efficiency can be obviously improved by employing the shape-indexed feature [6]. In [26], SIFT (scale-invariant feature transform) feature is used to achieve a robust representation against illumination. Sun et al. [24] takes the advantage of the deep structures of convolutional networks to learn the features. In [21], local binary features (LBF) are presented for extremely accurate and fast face alignment. The obtained LBF is incorporated into CSR framework to learn a linear regression. Due to the simplicity of the pixel based feature, LBF is an exceedingly efficient tool for facial landmarks location. Nevertheless, it is more sensitive to noise compared with other conventional methods, such as HOG (histogram of oriented gradient) and SIFT.

In this paper, we propose a method using local probabilistic features (LPF), which is an optimization of LBF. The proposed LPF has the ability of modeling the probability of a test sample belonging to each leaf node. In the process of tree node split, we employ the average pixel difference value of three pairs of pixels, which can not only guarantee the accuracy of the algorithm, but also improve the speed of the algorithm. In order to obtain the optimized output results, various convergent models are combined to contribute their respective advantages.

The main contributions of our method are:

  1. 1.

    As an extension of local binary features [21], we focus on the important role of the probability for improving the learning effectiveness and efficiency in random forest at the first time. The method synthesizes not only the efficient performance of LBF, but also the probability of a sample reaching each leaf node. Qualitative and quantitative results show the superiority of our algorithm by blending them together.

  2. 2.

    Traditionally, the results of facial landmarks detection are determined by a single regression model [6, 10, 14, 21]. We overcome this limitation by combining various convergent models, since each model has its unique advantages. By integrating them together, we can overcome the instability of a single one.

2 Related Work

2.1 The Cascade Shape Regressors

In recent years, the concept of cascade shape regression gradually shows its superior quality in the research field of face alignment. All these methods take face alignment as a regression problem. Cascade shape regressors generally employ N regressors in series form. The vector S consists of the x, y-coordinates of L facial landmarks. Beginning with an image and a raw initial face shape \(S^0\), S is optimized by a shape increment \(\delta S^n\), which is calculated by the regressor \(R^n\), n = 1, 2, ..., N, stage-by-stage:

$$\begin{aligned} S^n =S^{n-1} +\delta S^n \end{aligned}$$
(1)

where \(S^n\) is the current shape estimation, \(S^{n-1}\) is the shape estimated by the previous stage, \(\delta S^n\) is calculated as follow:

$$\begin{aligned} \delta S^n =W^n {} \varPhi ^n (I ,S^{n-1} ) \end{aligned}$$
(2)

where \(W^n\) is a matrix for global linear projection, \(\varPhi ^n\) is a feature mapping function.

2.2 Random Forest

In recent years, random forests [4] play a great role in many classic pattern recognition problems, such as image classification [3], data clustering [23] and shape regression [21]. This approach has many advantages: (a) efficiency in both training and prediction, (b) the ability to handle a large number of input variables, (c) the ability to detect the interaction between features, and (d) suitable for multi-classification problem.

3 Method

In the training phase, we first augment our training data to meet the diversity of different situations. Then the local probabilistic features are generated from the random forest. After that we learn a global linear projection \(W^n\) by dual coordinate descent method [12]. The above procedure is repeated N times step-by-step in a cascaded form. The overview of our training algorithm is shown in Table 1.

The proposed local probabilistic features are extracted from conventional random forest. Following the framework proposed by [20], one facial landmark corresponds to one random forest, which is composed of 10 trees. For each leaf node, we calculate the local probabilistic features as follow:

$$\begin{aligned} p'(i)=\dfrac{num(i)}{\sum _{i=1}^I num(i)} \end{aligned}$$
(3)

where \(p'(i)\) is the initial probability value of the ith leaf node of a tree, num(i) is the number of training samples falling into the leaf node.

$$\begin{aligned} p(i) = {\left\{ \begin{array}{ll} lowTh, &{} \text{ if } p'(i)<\text{ lowTh } \\ p'(i), &{} \text{ if } \text{ lowTh }\le p'(i) <\text {highTh}\\ highTh, &{} \text{ if } p'(i) \ge \text {highTh}\\ \end{array}\right. } \end{aligned}$$
(4)

where p(i) is the final probability value of the ith leaf node of a tree, lowTh and highTh are the lower threshold and the upper threshold, respectively.

Table 1. Training of cascade face alignment with local probabilistic features
Fig. 1.
figure 1

The process of producing local probabilistic features from random forest. For the parameter \(p^{i}(j)\), i denotes the ith tree in its random forest and j denotes the jth leaf node in its tree.

Fig. 2.
figure 2

The high-dimensional probabilistic features are formed by concatenating all local probabilistic features.

For each facial landmark, we train a random forest through the method proposed by [4]. Figure 1 roughly illustrates the process of producing local probabilistic features from random forest. Each leaf node contains a pixel difference feature f [6], a local probabilistic feature p(i), and a threshold.

When all of the local feature mapping functions \(\phi {^n_i}\), i = 1, ..., L, have been established, high-dimensional probabilistic features are formed by concatenating all local probabilistic features. Details are shown in Fig. 2.

The process of testing is same as that for training, but we add an optimization strategy by combining various models to overcome the instability of a single one.

$$\begin{aligned} S=\dfrac{1}{M}\sum _{i=1}^M S_{i}=\dfrac{1}{M}\sum _{i=1}^M \sum _{j=1}^L s_{i}(x_j,y_j) \end{aligned}$$
(5)

where M is the number of models, \(S_{i}\) is the shape calculated by the ith model, \(s_{i}(x_j,y_j)\) denotes the location of jth landmark.

4 Experiments of Alignment

In this section, we first confirm the selection of some parameters, which are critical to our experimental performance. Then, we show the effectiveness of our method on two datasets, 300 W and Helen.

4.1 Datasets

Helen [16] consists of 2000 training and 330 test web images. The high resolution images are useful for accurate location. In order to achieve rapid results, we employ 68 landmarks instead of 194 landmarks to show the performance of our method. 300 W [22] is short for 300 faces in the wild. It is created from classical datasets, including LFPW [2], Helen [16], AFW [20], XM2VTS [19] and IBUG. Our training images are composed of the training sets of Helen and LFPW, and the whole AFW dataset. Our testing images are composed of the test sets of LFPW and Helen, which are also called the common test set, and the whole IBUG dataset, which is also called the challenging test set.

4.2 Selection of Parameters

In our experiments, a multi-pose V.J. detector is used for detecting face rectangles, and the mean shape of the training data is chosen as the initial shape of test image.

For the proposed method, the number of cascade stages N and the number of trees T in every random forest are crucial parameters. Figure 3 shows the mean errors as a function of the number of cascade stage. We can see that when N increases to 7, the performance of the algorithm achieve an ideal state. Figure 4 shows the mean errors as a function of the number of trees. With the growth of this parameter, the performance is fluctuating and the test time is increasing.

Fig. 3.
figure 3

The mean error at each stage of the cascade is plotted. Using many stages of regressors is fairly useful, regardless of the number of dot group, G, in each forest.

Fig. 4.
figure 4

The mean error as a function of the number of trees T in each random forest (line chart). The test time as a function of T (bar chart).

Table 2. Runtime (in FPS) on 300-W. Our parameter T is set to 2, and the results of other approaches are quoted from the original theses.
Table 3. Mean errors (percent) on Helen dataset (68 landmarks)
Table 4. Mean errors (percent) on 300-W dataset (68 landmarks)

4.3 Results and Discussion

Our method is implemented in C++ and tested on a i7-6700 CPU. The results of other compared approaches are from the reports in original papers. When the parameter T is set to 2, we can see that the alignment rate of our method is better than others from Table 2.

To evaluate the effectiveness of the local probabilistic features, we take our experiments on two frequently-used datasets, Helen and 300 W, and compare our results with other excellent methods. The parameters are set as follows: N = 7, D = 5 and M = 6.

On the Helen dataset, to keep consistent with other methods, we use 68 landmarks to evaluate our algorithm. From Table 3, we can find that our algorithm achieves a satisfactory result. We believe this all due to the effectiveness of the LPF and the high performance of federated models.

On the 300-W dataset, Table 4 shows the experimental results of our and the competitors’ methods. We can see the superiority of our method in the challenging subset as well as the common subset and fullset. As a result of the mechanism of combining various convergent models to predict the final shapes, the proposed method is imperfect relative to other methods in term of speed. Exemplary alignment results of our method are depicted in Fig. 5.

We provide a demo of our method at https://pan.baidu.com/s/1hsqjZqW.

Fig. 5.
figure 5

Example alignment results on 300-W. The first row shows the challenging sets, the second row shows the Helen dataset of common sets, and the last row shows the LFPW dataset of common sets.

5 Conclusion

In this paper, we propose a novel local probabilistic features based method for face alignment, under cascaded framework. The local probabilistic features contribute to modelling the probability of a test sample reaching each leaf node. This makes the prediction more realistic and theoretical. The cascade framework also obviously enhances the performance of our experiments. By combining various convergent models to overcome the instability of a single one, the regression accuracy of our method is superior to state-of-the-art methods. We demonstrate the efficiency and accuracy of the proposed method on two classical face alignment datasets. Furthermore, it is worth applying the local probabilistic features to many regression problems.