Keywords

1 Introduction

Facial motion and expression enable users to communicate with computers using natural skills. Constructing robust systems for facial motion tracking and expression recognition is an active research topic.

Generally, 2D approaches [1, 2] or 3D approaches [3, 4] can be conducted for this task. Compared with 2D methods, 3D methods are more qualified for the view-independent and illumination insensitive tracking and recognition situations [5]. For 3D methods, a 3D facial mesh model or a depth camera is often used [4,5,6,7,8]. Because the high cost-effectiveness of single video cameras, they are used as inputs here by a 3D facial mesh model, which is served as the priori knowledge and constraints.

For facial motion tracking [9], appearance-based techniques [12, 13] are more robust compared with feature-based ones [10, 11], and often implemented statistically to increase the robustness. Offline statistical appearance-based models [3, 14], such as 3D shape regression [15, 16], use a face image dataset taken under different conditions to learn the parameters of appearance model, while online statistical appearance models (OSAM) [17,18,19] are more flexible and efficient than the offline ones by updating the learned dataset progressively. In addition, an adequate motion filtering strategy should be adopted to obtain the true value. Particle filtering [19] has been widely used for the global optimization ability by using the Monte Carlo technique.

For facial expression recognition [20,21,22], static techniques use spatial ones or spatio-temporal features related to a single frame [23] to classify expressions by several statistical analysis tools [24,25,26,27,28,29,30], such as neural network, while dynamic techniques use the temporal variations of facial deformation to classify expressions by several statistical analysis tools [31,32,33], such as dynamic Bayesian networks.

In this paper, a framework (Fig. 1) is proposed for pose robust and illumination insensitive facial motion tracking and expression recognition on each video frame base on the work in [34].

Fig. 1.
figure 1

Framework.

Firstly, facial animation and facial expression are obtained sequentially in particle filtering. To alleviate the illumination variation difficulty during facial motion tracking, OSAM is improved to illumination weight adaptive online statistical appearance model (IWA-OSAM), in which 13 basis point light positions are constructed to model the lighting condition of each video frame. Then facial expressions are recognized by the static facial expression knowledge learned from the anatomical definitions in [35].

Secondly, because both the temporal dynamics and static information are important for recognizing expressions [36], they are combined and fused here by tracking facial motion and recognizing facial expression simultaneously in particle filtering. Compared with the sequential approach discussed above, particles are not only generated by the resampling, but also predicted by the dynamic knowledge; thus resulting into a more accurate recognition result.

2 Facial Motion Tracking

2.1 OSM-Based Facial Motion Tracking

The model (Fig. 2(a)): CANDIDE3 [37] is served as the priori knowledge and constraints for facial motion tracking, and defines the facial motion parameters as:

$$ \varvec{b} = [\theta_{\text{x}} ,\theta_{y} ,\theta_{z} ,t_{x} ,t_{y} ,t_{z} ,\varvec{\beta}^{T} ,\varvec{\alpha}^{T} ]^{T} { = [}\varvec{h}^{T} ,\varvec{\beta}^{T} ,\varvec{\alpha}^{T} ]^{T} $$
(1)

where \( \varvec{h}{ = [}\theta_{\text{x}} ,\theta_{y} ,\theta_{z} ,t_{x} ,t_{y} ,t_{z} ]^{T} \) is global head motion parameters. \( \varvec{\beta ,\alpha } \) are shape and animation parameter. 10 shape parameter and 7 animation parameter are used.

Fig. 2.
figure 2

(a) CANDIDE3 model. (b) A frame of input video. (c) The projection of CANDIDE3 under \( \varvec{b} \). (d) The GNFI. (e) The improved GNFI. (f) Selected facial areas.

A face texture is represented as a geometrically normalized facial image (GNFI) [37]. Figure 2(b)–(d) illustrate the process of obtaining a GNFI with an input image. Different facial areas may have different levels of influence on the tracking performance. Because the part above the eyebrows hardly take effect on the facial motion, and is often contaminated by the hair, it is removed from the GNFI. In addition, we found that the top part of the nose and the temples seldom undergo local motions. However, the appearance of these two facial areas is often influenced by head pose change and illumination variation. Therefore, the image regions corresponding to these two facial areas are removed from GNFI. The resulting image, called improved GNFI (Fig. 2(e)), is then used for measurements extraction.

Because the pixel color values are easily influenced by environment, and thus not robust for tracking, a more robust measurement is extracted from the improved GNFI as follows according to the discussion in [34]: for the improved GNFIs of the first and current frames, we obtain the illumination ratio images, and compute Gabor wavelet coefficients on the selected facial areas (Fig. 2(f)) where high frequency appearance changes more likely.

Moreover, illumination variation is one of the most important factors which reduce significantly the performance of face recognition system. It has been proved that the variations between images of the same face due to illumination are almost always larger than image variations due to change in face identity. So eliminating the effects due to illumination variations relates directly to the performance and practicality of face recognition. To alleviate this problem, a low-dimensional illumination space representation (LDISR) of human faces for arbitrary lighting conditions [38] is proposed for recognition. The key idea underlying the representation is that any lighting condition can be represented by 9 basis point light sources. The lighting subspace is constructed not using the eigenvectors from the training images with various lighting conditions directly but the light sources corresponding to the eigenvectors. The 9 basis light positions are shown in Table 1.

Table 1. Positions of the 9 basis light sources.

However, it is only used for the situation in which the face has only the 2D in-plane rotation, while the face with out-of-plane rotation is a common situation for the face in the real scene. Therefore, we extend the LDISR from 2D to 3D, in which the in-plane rotation and out-of-plane rotation are both considered. The training process is similar to that in the method discussed in [38], and the obtained 13 basis light positions are shown in Table 2.

Table 2. Positions of the 13 basis light sources.

Because different human faces have similar 3D shapes, the LDISR of different faces is also similar. In addition, by using the normalization with GNFI (Fig. 2), it can be assumed that different persons have the same LDISR.

Suppose the 13 basis images obtained under 13 basis lights are \( L_{i} ,i = 1, \cdots ,13 \), the LDISR of human face can be denoted as \( \varvec{A} = \left[ {L_{1} ,L_{2} , \cdots ,L_{13} } \right] \). Given an image of human face \( \varvec{I}_{x} \) under an arbitrary lighting condition, it can be expressed as:

$$ \varvec{I}_{x} = \varvec{A} \cdot\varvec{\lambda} $$
(2)

where \( \varvec{\lambda}= \left[ {\lambda_{1} ,\lambda_{2} , \cdots ,\lambda_{13} } \right]^{T} ,1 \le \lambda_{i} \le 1 \) is the lighting parameters of image \( \varvec{I}_{x} \), and can be calculated by minimizing the energy function \( E\left( \lambda \right) \) as:

$$ E\left( \lambda \right) = \left\| {\varvec{A} \cdot\varvec{\lambda}- \varvec{I}_{x} } \right\|^{2} $$
(3)

Then the lighting parameters can be obtained as:

$$ \lambda = \left( {\varvec{A}^{T} \varvec{A}} \right)^{ - 1} \varvec{A}^{T} \cdot \varvec{I}_{x} $$
(4)

In practice, the image pixel will be less influenced by lighting if the light positions are distributed more evenly. In this case, the values of \( \lambda_{1} ,\lambda_{2} , \cdots ,\lambda_{13} \) should be close for the 13 basis light positions discussed above. With this fact, an index can be defined to evaluate the lighting influence on the pixel values of each triangular patch of the improved GNFI. To achieve this goal, these triangular patches are first split to 13 areas corresponding to 13 basis point light positions, and then the lighting influence weight of the \( kth,k = 1, \cdots ,13 \) area is given for the \( tth \) video frame as follows:

$$ w_{t}^{k} \left( j \right) = {{abs\left( {\lambda_{k} - \frac{{\lambda_{1} + \cdots + \lambda_{13} }}{13}} \right)} \mathord{\left/ {\vphantom {{abs\left( {\lambda_{k} - \frac{{\lambda_{1} + \cdots + \lambda_{13} }}{13}} \right)} {\sum\nolimits_{k = 1}^{13} {abs\left( {\lambda_{k} - \frac{{\lambda_{1} + \cdots + \lambda_{13} }}{13}} \right)} }}} \right. \kern-0pt} {\sum\nolimits_{k = 1}^{13} {abs\left( {\lambda_{k} - \frac{{\lambda_{1} + \cdots + \lambda_{13} }}{13}} \right)} }} $$
(5)

where \( j \) is the index of the pixel in the \( kth \) area of the triangular patches in Fig. 1(e).

Based on the lighting influence weight discussed above, OSAM is extended to illumination weight adaptive online statistical appearance model (IWA-OSAM). The details are as follows.

\( \varvec{m}\left( {\varvec{b}_{t} } \right) \) with size d, abbreviated as \( \varvec{m}_{t} \), is the concatenation of pixel color value at time t, and it is modeled as a Gaussian Mixture stochastic variable with 3 components, \( s,w,l \), as Jepson et al. [17] does. \( \left\{ {\varvec{\mu}_{i,t} ;i = s,w,l} \right\} \) is the mean vector. \( \left\{ {\varvec{\sigma}_{i,t} ;i = s,w,l} \right\} \) is the vector composed of the square roots of the diagonal elements of the covariance matrix. \( \left\{ {\varvec{k}_{i,t} ;i = s,w,l} \right\} \) is the mixed probability vector. The observation likelihood is \( p\left( {{{\varvec{m}_{t} } \mathord{\left/ {\vphantom {{\varvec{m}_{t} } {\varvec{b}_{t} }}} \right. \kern-0pt} {\varvec{b}_{t} }}} \right) \), which is represented by the sum of the Gaussian distributions of 3 components, \( s,w,l \), weighted by \( \left\{ {\varvec{k}_{i,t} ;i = s,w,l} \right\} \).

The IWA-OSAM represents the stochastic process of all observations until time t-1: \( \varvec{m}_{1:t - 1} \). In order to enable IWA-OSAM to track target, \( \left\{ {\varvec{k}_{i,t} ;i = s,w,l} \right\} \) and \( \varvec{\mu}_{s,t} \), \( \varvec{\sigma}_{s,t} \) are updated when \( \varvec{b}_{t} \), \( \varvec{m}_{t} \) are got [18]. The following equations are valid for \( j = 1,2, \cdots ,d \). \( c = 0.2 \) is forgetting factor.

$$ \begin{aligned} k_{i,t} \left( j \right) & = \left( {\left( {1 - c} \right) + cN\left( {w_{t - 1}^{k} \left( j \right)m_{t - 1} \left( j \right);\mu_{i,t - 1} \left( j \right),\sigma_{i,t - 1}^{2} \left( j \right)} \right)} \right)k_{i,t - 1} \left( j \right) \\ \mu_{s,t} \left( j \right) & = \left( {1 - c} \right){{\mu_{s,t - 1} \left( j \right)} \mathord{\left/ {\vphantom {{\mu_{s,t - 1} \left( j \right)} {k_{s,t} \left( j \right)}}} \right. \kern-0pt} {k_{s,t} \left( j \right)}} + cw_{t - 1}^{k} \left( j \right)m_{t - 1} \left( j \right){{k_{s,t - 1} \left( j \right)} \mathord{\left/ {\vphantom {{k_{s,t - 1} \left( j \right)} {k_{s,t} \left( j \right)}}} \right. \kern-0pt} {k_{s,t} \left( j \right)}} \\ \sigma_{s,t}^{2} \left( j \right) & = \left( {1 - c} \right){{\sigma_{s,t - 1}^{2} \left( j \right)} \mathord{\left/ {\vphantom {{\sigma_{s,t - 1}^{2} \left( j \right)} {k_{s,t} \left( j \right)}}} \right. \kern-0pt} {k_{s,t} \left( j \right)}} + c\left( {w_{t - 1}^{k} \left( j \right)m_{t - 1} \left( j \right)} \right){{k_{s,t - 1} \left( j \right)} \mathord{\left/ {\vphantom {{k_{s,t - 1} \left( j \right)} {k_{s,t} \left( j \right)}}} \right. \kern-0pt} {k_{s,t} \left( j \right)}} - \mu_{s,t - 1}^{2} \left( j \right) \\ \end{aligned} $$
(6)

Moreover, the methods discussed in [19] are used to reduce the influences of occlusion and outlier here.

Once the solution \( \varvec{b}_{t} \) is solved, the corresponding pixels in the resulting synthesis texture will be used to update IWA-OSAM. While IWA-OSAM are not updated for outlier or occlusion pixels, thus the outlier and occlusion cannot deteriorate IWA-OSAM.

3 Facial Expression Recognition

Static knowledge and dynamic knowledge are extracted to cope with the complex variability of facial expression.

3.1 Static Facial Expression Knowledge

The retrieved \( \varvec{\alpha} \) of one frame from the input video can be seen as a description of facial muscles activations of the person in that frame according to the definitions of action units in [36]. Therefore, the relationship between \( \varvec{\alpha} \) and facial expression modes is established, namely 7 typical vectors \( \left\{ {\varvec{\alpha}_{su} ,\varvec{\alpha}_{di} ,\varvec{\alpha}_{fe} ,\varvec{\alpha}_{ha} ,\varvec{\alpha}_{sa} ,\varvec{\alpha}_{an} ,\varvec{\alpha}_{ne} } \right\} \) are chosen as the representatives of 7 universal facial expressions: surprise, disgust, fear, happy, sad, angry and neutral. They are set as the static knowledge for facial expression recognition.

When \( \varvec{\alpha} \) of one frame of input video is retrieved, the Euclidian distances between it and each of \( \left\{ {\varvec{\alpha}_{su} ,\varvec{\alpha}_{di} ,\varvec{\alpha}_{fe} ,\varvec{\alpha}_{ha} ,\varvec{\alpha}_{sa} ,\varvec{\alpha}_{an} ,\varvec{\alpha}_{ne} } \right\} \) are computed, and the facial expression corresponding to the minimum distance is set as the recognition result.

3.2 Dynamic Facial Expression Knowledge

For each expression \( \gamma \), a three layer Radial Basis Function (RBF) network is trained for describing the temporal evolution of facial animations \( \varvec{\alpha}_{t} \) as:

$$ \varvec{\alpha}_{t} = \varvec{W} \cdot\varvec{\varPhi}(\varvec{\alpha}_{t - 1} ) + \varvec{B} $$
(7)

The middle layer contains 400 nodes. \( \varvec{W}(7 \times 400) \), \( \varvec{B}(7 \times 1) \) are the weight matrix and bias vector of the output layer. The \( ith \) node of middle layer is given by RBF:

$$ \varvec{\varPhi}_{i} (A) = exp( - (\left\| {\varvec{\alpha}- \varvec{IW}_{i} } \right\| \times B_{mi} )^{2} ) $$
(8)

where \( \varvec{IW}_{i} \) is the \( ith \) row component of the weight matrix of the middle layer \( \varvec{IW}(400 \times 7) \), and represents the mean value of the \( ith \) RBF. \( \left\| {\varvec{\alpha}- IW_{i} } \right\| \) represents the distance between \( \varvec{\alpha} \) and \( IW_{i} \). \( B_{mi} \) is the \( ith \) component of bias vector of the middle layer \( \varvec{B}_{m} (400 \times 1) \), and its reciprocal represents the variance of the \( ith \) RBF. The dynamic of facial animations associated with the neutral expression is simplified as \( \varvec{\alpha}_{t} =\varvec{\alpha}_{t - 1} \).

We define a transition matrix \( \varvec{T} \) whose entries \( T_{{\gamma^{\prime},\gamma }} \) describe the probability of transition between two expressions \( \gamma^{\prime} \) and \( \gamma \). The transition probabilities are learned from a database [39]. Then the RBF network is trained on the 60% of the Extended Cohn-Kanade (CK+) database 25 and the database [39]. The corresponding facial animations \( \varvec{\alpha}_{t} \) are tracked by IWA-OSAM. Therefore, the RBF network is set as the dynamic knowledge for facial expression recognition.

3.3 Framework

Given a facial video, we would like to estimate \( \varvec{b}_{t} \) and the facial expression for each frame at time \( t \), by particle filtering, given all the observations up to time \( t \). Therefore, we create a mixed state \( \left( {\varvec{b}_{t}^{T} ,\gamma_{t} } \right)^{T} \), where \( \gamma_{t} \in \left\{ {1, \cdots ,7} \right\} \) is a discrete state, representing one of 7 universal expressions. For the estimation of \( \left( {\varvec{b}_{t}^{T} ,\gamma_{t} } \right)^{T} \), two schemes are proposed. The first scheme (Fig. 3) is to infer \( \left( {\varvec{b}_{t}^{T} ,\gamma_{t} } \right)^{T} \) sequentially, where facial expression is recognized by static facial expression knowledge. The second scheme (Fig. 4) is to infer \( \left( {\varvec{b}_{t}^{T} ,\gamma_{t} } \right)^{T} \) simultaneously, where facial expression is recognized by fusing the static and dynamic facial expression knowledge.

Fig. 3.
figure 3

Inferring the facial motion and facial expression sequentially.

Fig. 4.
figure 4

Inferring the facial motion and facial expression simultaneously.

4 Evaluation

A workstation with Intel i7-6700K 4.0G, 8G memory and NVIDIA GTX960 is used.

4.1 Testing Dataset and Evaluation Methods for Facial Motion Tracking

A facial image sequence [5] and the IMM face database [40] with the ground truth landmarks available are used. They support point based comparison for errors, and a texture based test could also be performed on it. Besides, as the pose coverage and illumination condition variations of above databases are not large enough, the 13 videos, including Carphone and Forman image sequences, in the MPEG-4 testing database and 78 captured videos from 48 subjects with the resolution 352 × 288 are also used. The ground truth landmarks of them are obtained by manual adjustment.

Root Mean Square (RMS) landmark error measures the Root Mean Square Error (RMSE) between the ground truth landmark points and the fitted shape points after tracking, and is defined as:

$$ \sum\nolimits_{i = 1}^{N} {{{sqrt\left( {C_{fit}^{i} - C_{grd}^{i} } \right)} \mathord{\left/ {\vphantom {{sqrt\left( {C_{fit}^{i} - C_{grd}^{i} } \right)} N}} \right. \kern-0pt} N}} $$
(9)

where \( C_{est}^{i} \), \( C_{grd}^{i} \) are the \( x \) or \( y \) coordinate of the \( ith \) fitted shape point and the \( ith \) ground truth landmark point.

4.2 Testing Dataset and Evaluation Methods for Facial Expression Recognition

The Extended Cohn-Kanade (CK+) database [25] and the database [39], not including the training part, are used. The database also presents the baseline results using AAM and a linear support vector machine classifier.

For evaluation, the facial expression recognition score is used, and a confusion matrix between different facial expression is used.

4.3 Facial Motion Tracking for Monocular Videos

Figure 5 shows the facial motion tracking results of several publically and captured videos. By computing the evaluation criteria, we can say that accurate tracking is obtained even in the presence of perturbing factors including significant head pose and illumination as well as facial expression variations.

Fig. 5.
figure 5

Facial motion tracking results.

Based on the testing dataset, the comparison between different tracking algorithm is conducted. It should be stated that the work in [3] an offline method, and its performance is highly dependent on the training data. However, Active Shape Model (ASM)/Active Appearance Model (AAM) based trackers, such as that in [3], are the mainstream, and after learning the model in a dataset, the tracker can track any face without further training. Therefore, we compare this work with that in [3]. The images of the training dataset, which approximately correspond to every 4th image in the sequences of the testing dataset, are used to construct the AAM in [3], and the number of bases in the constructed AAM is chosen so as to keep 95% of the variations. The performances is evaluated on the testing data except for the training images.

By computing the evaluation criteria, Table 3 shows the superiority of our algorithm. This is because the IWA-OSAM in our proposed approach can learn the variation of facial motion effectively, and our proposed approach has the ability to alleviate lighting influence. Moreover, this is also because the improved GNFI is less influenced by the head pose change and illumination variation.

Table 3. Performance evaluation of different facial motion tracking algorithms.

4.4 Facial Expression Recognition

Figure 6 shows the facial expression recognition results on the testing database. As can be seen from it, facial expressions can be recognized effectively by our proposed algorithm in the presence of perturbing factors including significant head pose and illumination.

Fig. 6.
figure 6

Facial expression recognition results.

Based on the CK+ database, we compare our proposed algorithm to the methods in [20,21,22, 31, 34, 37] (Table 4). The recognition score of our facial expression recognition algorithm is higher than those of other algorithms, and also higher than those of our facial expression recognition algorithm not using illumination modeling.

Table 4. The accuracy comparison between several algorithms.

According to the benchmarking protocol in the CK+ database, the leave-one-subject-out cross-validation configuration is used, and a confusion matrix is used to document the results. Table 5 shows the high recognition scores by our proposed algorithm. This is because that both static and dynamic knowledge are used, illumination is modeled and removed from improved GNFI to increase the accuracy and robustness of facial motion tracking.

Table 5. The confusion matrix of facial expression recognition by our algorithm.

5 Conclusion

We propose a unified facial motion tracking and expression recognition framework for monocular video. For retrieving facial motion, an online weight adaptive statistical appearance method is embedded into the particle filtering strategy by using a deformable facial mesh model served as an intermediate to bring input images into correspondence by means of registration and deformation. For recognizing facial expression, facial animation and facial expression are estimated sequentially for fast and efficient applications, in which facial expression is recognized by static anatomical facial expression knowledge. In addition, facial animation and facial expression are simultaneously estimated for robust and precise applications, in which facial expression is recognized by fusing static and dynamic facial expression knowledge. Experiments demonstrate the high tracking robustness and accuracy as well as the high facial expression recognition score of the proposed framework.

In future, the recursive neural network will be used to learn the dynamic expression knowledge.