Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

An automatic analysis of the facial expressions of people are highly important for automatic understanding of humans, their actions and their behavior in general. Facial expression has been a focus of research in human behavior for over a hundred years [30]. It is central to several leading theories of emotion [38, 116] and has been the focus of, at times, heated debate about issues in emotion science. Facial expression figures prominently in research on almost every aspect of emotion, including psychophysiology [66], neural correlates [39], development [84], perception [2], addiction [47], social processes [52], depression [27] and other emotion disorders [118]. Facial expression communicates physical pain [100], alertness, personality and interpersonal relations [46]. Applications of facial expression analysis include marketing [107], perceptual user interfaces, human–robot interaction [98, 126, 145], drowsy driver detection [128], telenursing [29], pain assessment [79], analyzing mother–infant interaction [45], autism [83], social robotics [6, 18], facial animation [72, 110] and expression mapping for video gaming [54] among others. A large number of examples are also provided in particular in Chaps. 22, 26 and 23.

In part because of its importance and potential uses as well as its inherent challenges, automated facial expression recognition has been of keen interest in computer vision and machine learning. Beginning with a seminal meeting sponsored by the US National Science Foundation [41], research on this topic has become increasingly broad, systematic, and productive. IEEE-sponsorship of international conferences (http://www.fg2011.org/), workshops, and a new journal in affective computing, among other outlets (e.g., IEEE journal System, Man, and Cybernetics and special issues of journals such as Image, Vision, and Computing Journal) speak to the vitality of research in this area. Automated facial expression analysis is critical as well to the emerging fields of Computational Behavior Science and Social Signal Processing.

Automated facial image analysis confronts a series of challenges. The face and facial features must be detected in video; shape or appearance information must be extracted and then normalized for variation in pose, illumination and individual differences; the resulting normalized features are used to segment and classify facial actions. Partial occlusion is a frequent challenge that may be intermittent or continuous (e.g., bringing an object in front of the face, self-occlusion from head turns, eyeglasses or facial jewelry). While human observers easily accommodate for changes in pose, scale, illumination, occlusion, and individual differences, these and other sources of variation represent considerable challenges for computer vision. Then there is the machine-learning challenge of automatically detecting actions that require significant training and expertise even for human coders. There is much good research to do.

We begin with a description of approaches to annotation and then review publicly available databases. Research in automated facial expression analysis depends on access to large, well-annotated, video data. We then review approaches to feature detection, representation, and registration, and both supervised and unsupervised learning of facial expression. We close with implications for future research in this area.

2 Annotation of Facial Expression

Two broad approaches to annotating facial expression are message–judgment and sign-based [25]. In the former, observers make inferences about the meaning of facial actions and assign corresponding labels. The most widely used approach of this sort makes inferences about felt emotion. Inspired by cross-cultural studies by Ekman [38] and related work by Izard [55], a number of expressions of what are referred to as basic emotions have been described. These include joy, surprise, anger, fear, disgust, sadness, embarrassment, and contempt. Examples of the first six are shown in Fig. 19.1. Message–judgment approaches tend to be holistic; that is, they typically combine information from multiple regions of the face, implicitly acknowledge that the same emotion or cognitive state may be expressed in various ways, and they utilize the perceptual wisdom of human observers, which may include taking account of context. A limitation is that many of these emotions may occur infrequently in daily life and much human experience involves blends of two or more emotions. While a small set of specific expressions that vary in multiple regions of the face may be advantageous for training and testing, their generalizability to new image sources and applications is limited. Moreover, the use of emotion labels implies that posers are experiencing the actual emotion. This inference often is unwarranted, as when facial expression is posed or faked, and the same expression may map to different felt emotions. Smiles, for instance, occur in both joy and embarrassment [1].

Fig. 19.1
figure 1

Basic facial expression phenotypes. 1, disgust; 2, fear; 3, joy; 4, surprise; 5, sadness; 6, anger. Figure reproduced with permission from [105]. © 2010 IEEE

In a sign-based approach, physical changes in face shape or texture are the descriptors. The most widely used approach is that of Ekman and colleagues. Their Facial Action Coding System (FACS) [40] segments the visible effects of facial muscle activation into “action units”. Each action unit is related to one or more facial muscles. The Facial Action Coding System (FACS) is a comprehensive, anatomically based system for measuring nearly all visually discernible facial movement. FACS describes facial activity on the basis of 44 unique action units (AUs), as well as several categories of head and eye positions and movements. Facial movement is thus described in terms of constituent components, or AUs. Any facial event (for example, an emotion expression or paralinguistic signal) may be decomposed into one or more AUs. For example, what has been described as the felt or Duchenne smile typically includes movement of the zygomatic major (AU12) and orbicularis oculi, pars lateralis (AU6).

The FACS taxonomy was defined by manually observing graylevel variation between expressions in images and to a lesser extent by recording the electrical activity of underlying facial muscles [24]. Depending on which edition of FACS is used, there are 30 to 44 AUs and additional “action descriptors.” Action descriptors are movements for which the anatomical basis is not established. More than 7000 AU combinations have been observed [104]. Figures 19.2 and 19.3 illustrate AUS from the upper and lower portions of the face, respectively. Figure 19.4 provides an example in which FACS action units have been used to label a prototypic expression of pain. Because of its descriptive power, FACS has become the standard for facial measurement in behavioral research and has supplanted use of message–judgment approaches in automated facial image analysis. As well, FACS has become influential in the related area of computer facial animation. The MPEG-4 facial animation parameters [92] are derived from FACS.

Fig. 19.2
figure 2

FACS action units (AU) for the upper face. Figure reproduced with permission from [24]

Fig. 19.3
figure 3

Action units of the lower face. Figure reproduced with permission from [24]

Fig. 19.4
figure 4

An example of facial action units associated with a prototypic expression of pain. Figure reproduced with permission from [75]. © 2011 IEEE

Facial actions can vary in intensity, which FACS represents at an ordinal level of measurement. The original (1978) version of FACS included criteria for measuring intensity at three levels (X, Y, and Z). The more recent 2002 edition provides criteria for measuring intensity at five levels, ranging from A to E. FACS scoring produces a list of AU-based descriptions of each facial event in a video record. Figure 19.5 shows an example for FACS coding AU12 (Smile), where the onset, peak and offset are labeled.

Fig. 19.5
figure 5

FACS coding typically involves frame-by-frame inspection of the video, paying close attention to subtle cues such as wrinkles, bulges, and furrows to determine which facial action units have occurred and their intensity. Full labeling requires marking onset, peak and offset of the action unit and all changes in intensity. Full coding generally is too costly. Left to right, evolution of an AU 12 (involved in smiling), from onset, peak, to offset

For both message–judgment and sign-based approaches, the reliability of human coding has been a neglected topic in the automated facial expression recognition literature. With some exceptions, publicly available databases (Table 19.1) and research reports fail to provide information about inter-observer reliability or agreement. This is an important lack, in that inter-system agreement between manual and automated coding is inherently limited by intra-system agreement. If manual coding disagrees about the ground truth used to train classifiers, it is unlikely that classifiers will surpass them. Inter-system reliability can be considered in numerous ways [26]. These range from the precision of measurement of onsets, peaks, offsets, and changes in action unit intensity, to whether or not observers agree on action unit occurrence within some number of frames. More attention to reliability of coding would be useful in evaluating training data and test results. Sayette and Cohn [103] found inter-observer agreement varied among AU. Agreement for AU 7 (lower lid tightener) was relatively low, possibly due to confusion with AU 6 (cheek raiser). Some AU may occur too infrequently to measure reliably (e.g., AU 11). Investigators may want to consider pooling some AU to achieve more reliable units.

Table 19.1 Publicly available facial expression databases

Agreement between human coders is better when temporal precision is relaxed. In behavioral research, it is common to expect coders to agree only within a \(\frac{1}{2}\) second window. In automated facial image analysis, investigators typically assume exact agreement between classifiers and ground truth, which is a level of temporal precision beyond what may be feasible for many AU [24].

3 Databases

The development of robust facial recognition algorithms requires well labeled databases of sufficient size that include carefully controlled variations of pose, illumination and resolution. Publicity available databases are necessary to comparatively evaluate algorithms. Collecting a high quality database is a resource-intensive task. The availability of public facial expression databases is important for the advancement of the field. Table 19.1 illustrates the characteristics of publicly available databases.

Most face expression databases have been collected by asking subjects to perform a series of expressions. These directed facial action tasks may differ in appearance and timing from spontaneously occurring behavior. Deliberate and spontaneous facial behavior are mediated by separate motor pathways, the pyramidal and extrapyramidal motor tracks, respectively. As a consequence, fine-motor control of deliberate facial actions is often inferior and less symmetrical than what occurs spontaneously. Many people, for instance, are able to raise their outer Brows spontaneously while leaving their inner brows at rest; few can perform this action voluntarily. Spontaneous depression of the lip corners (AU 15) and raising and narrowing the inner corners of the brow (AU 1+4) are common signs of sadness. Without training, few people can perform these actions deliberately, which incidentally is an aid to lie detection [36]. Differences in the temporal organization of spontaneous and deliberate facial actions are particularly important in that many pattern recognition approaches, such as HMMs, are highly dependent on the timing of the appearance change. Unless a database includes both deliberate and spontaneous facial actions, it will likely prove inadequate for developing face expression methods that are robust to these differences.

4 Facial Feature Tracking, Registration and Feature Extraction

Prototypical expression and AU detection from video are challenging computer vision and pattern recognition problems. Some of the most important challenges are: (1) non-frontal pose and moderate to large head motion make facial image registration difficult, (2) classifiers can suffer from over-fitting when trained with relatively few examples for each AU; (3) many facial actions are inherently subtle making them difficult to be model; (4) individual differences among faces in shape and appearance make the classification task difficult to generalize across subjects; (5) temporal dynamics of AUs are highly variable. These differences can signal different communicative intentions [62], levels of distress [9], and presents a challenge for detection and classification; (6) and the number of possible combinations of 40+ individual action units numbers in the thousands (more than 7000 action unit combinations have been observed [42]). To address these issues over the last 20 years, a large number of facial expression and AU recognition/detection systems have been proposed. Some of the leading efforts include those at: Carnegie Mellon University [81, 108, 112, 142], University of California, San Diego [7, 68], University of Illinois at Urbana-Champaign [23, 129], Rensselaer Polytechnic Institute [117], Massachusetts Institute of Technology [43], University of Maryland [13, 131], Imperial College [59, 95, 123], IDIAP Dalle Molle Institute for Perceptual Artificial Intelligence [44], and others [82, 138].

Most facial expression recognition systems are composed of three main modules: (1) face detection, facial feature tracking and registration, (2) feature extraction and (3) supervised or unsupervised learning. Figure 19.6 illustrates an example of these three modules. In the following sections we will discuss each of these modules in more detail with emphasis in the current CMU system. For other systems see [44, 93, 113].

Fig. 19.6
figure 6

Block diagram of our the CMU system. The face is tracked using an AAM; shape and appearance features are extracted, normalized, and output to a linear SVM for action unit or expression detection. Figure reproduced with permission from [78]. © 2010 IEEE

4.1 Facial Feature Detection and Tracking

Face detection is an initial step in most automatic facial expression recognition systems (see Chap. 5). For real-time, frontal face detection, the Viola and Jones [127] face detector is arguably the most commonly employed algorithm. See [137] for a survey of recent advances in face detection. Once the face is detected two approaches to registration are common. One performs coarse registration by detecting a sparse set of facial features (e.g., eyes) in each frame. The other detects detailed features (i.e. dense points around the eyes and other facial landmarks) in the video sequence. In this section we will describe a unified framework for the latter, which we refer to as Parameterized Appearance Models (PAMs). PAMs include the Lucas–Kanade method [74], Eigentracking [12], Active Appearance Models [28, 33, 34, 87] , and Morphable Models [14, 57], which have been popular approaches for facial feature detection, tracking and modeling faces in general.

PAMs are among the most popular methods for facial feature detection and face alignment in general. PAMs for faces build an appearance and/or shape representation from the principal components of labeled training data. Let d i ∈ℜm×1 (see an explanation of the notationFootnote 1) be the ith sample of a training set D∈ℜm×n of n samples, where each vector d i is a vectorized image of m pixels. In a training set, each face image is previously manually labeled with p landmarks. A 2p-dimensional shape vector is constructed by stacking all (x,y) positions of the landmarks as s=[x 1;y 1;x 2;y 2;…;x p ;y p ]. Figure 19.9a shows an example of several face images that have been labeled with 66 landmarks. Given the labeled training samples, Procrustes analysis [28] is applied to the shape vectors to remove two-dimensional rigid transformations. After removing rigid transformation with Procrustes, principal component analysis (PCA) is applied to the shape vectors to build a linear shape model. The shape model can reconstruct any shape on the training shape as the mean (s 0) and linear combination of a shape basis (U s) (eigenvectors of the shape covariance matrix), that is, ss 0+U s c s, where c s are the shape coefficients. U s spans the shape space that accounts for identity, expression and pose variation in the training set. Figure 19.7a shows the shape mean and PCA basis. Similarly, after backwarping the texture to a canonical configuration, the appearance (normalized graylevel) is vectorized into an m-dimensional vector and stacked into the n columns of D∈ℜm×n. The appearance model, U∈ℜm×k is computed by calculating the first k principal components [56] of D. Figure 19.7b shows the mean appearance and the PCA basis. Figure 19.7c contains face images generated using the Active Appearance Model (AAM) by setting appropriate parameters of shape and texture.

Fig. 19.7
figure 7

The figure shows the mean and first two modes of variation of 2D AAM shape a and appearance b variation and the mean and first two modes of 3D AAM shape. c Reconstructed face. Reproduced with permission from [88]

Once the appearance and shape model have been learned from training samples (i.e., U, U s is known), alignment is achieved by finding the motion parameter p that best aligns the image w.r.t. the subspace U by minimizing:

$$\min_{\mathbf{c},\mathbf{p}} \ \big\|\mathbf{d}\bigl(\mathbf{f}(\mathbf{x},\mathbf{p})\bigr) -\mathbf{U}\mathbf{c}\big\|_2^2, $$
(19.1)

where c is the vector for the appearance coefficients. x=[x 1,y 1,…,x l ,y l ]T is the coordinate vector with the pixels to track. f(x,p) is the function for geometric transformation; the value of f(x,p) is a vector denoted by [u 1,v 1,…,u l ,v l ]T. d is the image frame in consideration, and d(f(x,p)) is the appearance vector of which the ith entry is the intensity of image d at pixel (u i , v i ). For affine and non-rigid transformations, (u i , v i ) relates to (x i , y i ) by:

$$\left[\begin{array}{c} u_i \\ v_i\end{array}\right] =\left[\begin{array}{c@{\quad }c}a_1 & a_2\\a_4 & a_5\end{array}\right]\left[\begin{array}{c}x_i^s \\y_i^s\end{array}\right] +\left[\begin{array}{cc}a_3 \\a_6\end{array}\right]. $$
(19.2)

Here \([x_{1}^{s}, y_{1}^{s}, \ldots, x_{l}^{s}, y_{l}^{s}]^{T} = \mathbf{x} + \mathbf {U}^{s}\mathbf{c}^{s}\). The affine and non-rigid motion parameters are a,c s, respectively, and p=[a;c s] a combination of both affine and non-rigid motion parameters. In the case of the Lukas–Kanade tracker [74], c is fixed to be one and U is the subspace that contains a single vector, the reference template which is the appearance of the tracked object in the initial/previous frame.

Given an unseen facial image d, facial feature detection or tracking with PAM alignment algorithms optimize (19.1). Due to the high dimensionality of the motion space, a standard approach to efficiently search over the parameter space is to use gradient-based methods [5, 10, 12, 28, 31, 87]. To compute the gradient of the cost function given in (19.1), it is common to use Taylor series expansion to approximate:

$$\mathbf{d}\bigl(\mathbf{f}(\mathbf{x}, \mathbf{p} + \delta\mathbf{p})\bigr)\approx \mathbf{d}\bigl(\mathbf{f}(\mathbf{x},\mathbf{p})\bigr) + \mathbf{J}_{\mathbf{p}}\mathbf {d}(\mathbf{p})\delta\mathbf{p},$$
(19.3)

where \(\mathbf{J}_{\mathbf{p}}\mathbf{d}(\mathbf{p}) = \frac{\partial\mathbf {d}(\mathbf{f}(\mathbf{x},\mathbf{p}))}{\partial\mathbf{p}}\) is the Jacobian of the image d w.r.t. to the motion parameter p [74]. Once linearized, a standard approach is to use the Gauss–Newton method for optimization [10, 12]. Other approaches learn an approximation of the Jacobian matrix with linear [28] or non-linear [71, 102] regression. Figure 19.9a shows an example of tracking 66 facial features with an AAM in the RU-FACS database [7].

4.2 Registration and Feature Extraction

After the face has been detected and the facial feature points have been tracked, the next two steps registration and feature extraction follow.

Registration:

The main goal of registration is to normalize the image to remove 3D rigid head motion, so features can be geometrically normalized. 3D transformations could be estimated from monocular (up to a scale factor) or multiple cameras using structure from motion algorithms [51, 130]. However, if there is not much out of plane rotation (i.e. less than about 15 to 20 degrees) and the face is relatively far away from the camera (assume orthographic projection), the 2D projected motion field of a 3D planar surface can be recovered with an affine model of six parameters. In this situation, simpler algorithms may be used to register the image to extract normalized facial features.

Following [108, 142] a similarity transform registers facial features with respect to an average face (see middle column in Fig. 19.8). To extract appearance representations in areas that have not been explicitly tracked (e.g. nasolabial furrow), we use a backward piece-wise affine warp with Delaunay triangulation. Fig. 19.8 shows the two step process for registering the face to a canonical pose for facial expression recognition. Purple squares represent tracked points and blue dots represent non-tracked meaningful points. The dashed blue line shows the mapping between the point in the mean shape and the corresponding points on the original image. Using an affine transformation plus backwarping, we can preserve the shape variation in appearance better than by geometric normalization alone. This two-step registration proves particularly important to detect low intensity AUs.

Fig. 19.8
figure 8

Registration with two-step alignment. Figure reproduced with permission from [142]. © 2009 IEEE

Geometric features:

After the registration step, the shape and appearance features can be extracted from the normalized image. Geometric features contain information about shape and the locations of permanent facial features (e.g., eyes, brows, nose). Approaches that use only geometric features (or their derivatives) mostly rely on detecting sets of fiducial facial points [94, 96, 123], a connected face mesh or active shape model [20, 22, 61], or face component shape parametrization [113]. Some prototypical features include [108]: \(\mathbf{x}^{U}_{1}\) the distance between inner brow and eye, \(\mathbf{x}^{U}_{2}\) the distance between outer brow and eye, \(\mathbf{x}^{U}_{3}\) the height of eye, \(\mathbf{x}^{L}_{1}\) the height of lip, \(\mathbf{x}^{L}_{2}\) the height of teeth, and \(\mathbf{x}^{L}_{3}\) the angle of mouth corners, see Fig. 19.9b. However, shape features alone are unlikely to capture differences between subtle facial expressions or ones that are closely related. Many action units that are easily confusable by shape (e.g., AU 6 and AU 7 in FACS) can be discriminated by differences in appearance (e.g., furrows lateral to the eyes and cheek raising in AU 6 but not AU 7). Other AUs such as AU 11 (nasolabial furrow deepener), 14 (mouth corner dimpler), and 28 (inward sucking of the lips) cannot be detected from the movement of a sparse set of points alone but may be detected from changes in skin texture.

Fig. 19.9
figure 9

a AAM fitting across different subjects. b Eight different features extracted from distance between tracked points, height of facial parts, angles for mouth corners, and appearance patches. Figure reproduced with permission from [144]. © 2010 IEEE

Appearance features:

Represent the appearance (skin texture) changes and texture of the face, such as wrinkles and furrows. Appearance features for AU detection [3, 7, 50, 68, 111] outperformed shape only features for some action units, especially when registration is noisy see Lucey et al. [4, 77, 81] for a comparison.

Several approaches to appearance have been explored. Gabor wavelet coefficients are a popular approach. In several studies, Gabor wavelet coefficients outperformed optical flow, shape features, and Independent Component Analysis representations [3]. Tian [111, 113], however, reported that the combination of shape and appearance achieved better results than either shape or appearance alone. Recently, Zhu et al. [142] have explored the use of SIFT [73] and DAISY [114] descriptors as appearance features. Given feature points tracked with AAMs, SIFT descriptors are first computed around the points of interest. SIFT descriptors are computed from the gradient vector for each pixel in the neighborhood to build a normalized histogram of gradient directions. For each pixel within a subregion, SIFT descriptors add the pixel’s gradient vector to a histogram of gradient directions by quantizing each orientation to one of 8 directions and weighting the contribution of each vector by its magnitude. Similar in spirit to SIFT descriptors, DAISY descriptors are an efficient feature descriptor based on histograms. They are often used to match stereo images [114]. DAISY descriptors use circular grids instead of SIFT descriptors’ regular grids; the former have been found to have better localization properties [89] and to outperform many state-of-the-art feature descriptors for sparse point matching [115]. At each pixel, DAISY builds a vector made of values from the convolved orientation maps located on concentric circles centered on the location. The amount of Gaussian smoothing is proportional to the radius of the circles. Donato [37] combined Gabor wavelet decomposition and independent component analysis. These representations use graylevel texture filters that share properties of spatial locality, independence, and have relationships to the response properties of visual cortical neurons. Zheng [138] investigated the use of two types of features extracted from face images for recognizing facial expressions. The first type is the geometric positions of a set of fiducial points on a face. The second type is a set of multi-scale and multi-orientation Gabor wavelet coefficients extracted from the face image at the fiducial points.

Other features:

Other popular technique for feature extraction include more dynamic features such as optical flow [3], dynamic textures [21] and Motion History Images (MHIs) [16]. In an early exploration of facial expression recognition, Mase [86] used optical flow to estimate the activity in a subset of the facial muscles. Essa [43] extended this approach by using optic flow to estimate activity in a detailed anatomical and physical model of the face. Motion estimates from optic flow were refined by the physical model in a recursive estimation and control framework. The estimated forces were used to classify facial expressions. Yacoob and Davis [131] bypassed the physical model and constructed a mid-level representation of facial motion, such as a right mouth corner raise, directly from the optical flow. Ira et al. [22] implicitly recovered motion representations by building features such that each feature motion corresponded to a simple deformation of the face. MHIs were First proposed by Davis and Bobick [16]. MHIs compress the motion over a number of frames into a single image. This is done by layering the thresholded differences between consecutive frames one over the other. Valstar et al. [121] encoded face motion into Motion History Images. Zhao et al. [139] use volume Local Binary Patterns (LBPs), a temporal extension of local binary patterns often used in 2D texture analysis. The face is divided into overlapping blocks and the extracted LBP features in each block are concatenated into a single feature vector.

5 Supervised Learning

Supervised and more recently unsupervised approaches to action unit and expression detection have been pursued. In supervised learning event categories are defined in advance in labeled training data. In unsupervised learning no labeled training data are available and event categories must be discovered. In this section we discuss the supervised approach.

Early work in supervised learning sought to detect the six universal expressions of joy, surprise, anger, fear, disgust, and sadness; see Fig. 19.1. More recent work has attempted to detect expressions of pain [4, 69, 79], drowsiness, adult attachment [135], and indices of psychiatric disorder [27, 60]. Action unit detection remains a compelling challenge especially in unposed facial behavior. An open question is whether emotion and similar judgment-based categories are best detected by first detecting AU or by direct detection in which an AU detection step is bypassed. Work on this topic is just beginning [70, 79] and the question remains open.

Whether the focus is expression or AU, two main approaches have been pursued for supervised learning. These are (1) static modeling—typically posed as a discriminative classification problem in which each video frame is evaluated independently; and (2) temporal modeling—in which frames are segmented into sequences and typically modeled with a variant of DBNs (e.g. Hidden Markov Models, Conditional Random Fields).

5.1 Classifiers

In the case of static models, different feature representations and classifiers for frame-by-frame facial expression detection have been extensively studied. The pioneering work of Black and Yacoob [13] recognized facial expressions by fitting local parametric motion models to regions of the face and then feeding the resulting parameters to a nearest neighbor classifier for expression recognition. Tian et al. [111] made use of ANN classifiers for facial expression recognition. Barlett et al. [7, 8, 68] used Gabor filters in conjunction with AdaBoost feature selection followed by a Support Vector Machine (SVM) classifier. Lee and Elgammal [65] used multi-linear models to construct a non-linear manifold that factorizes identity from expression. Lucey et al. [76, 81] evaluated different shape and appearance representations derived from an AAM facial feature tracker, and an SVM for classification. Similarly, [139] made use of SVM.

More recent work has focused on incorporating the dynamics of facial expressions to improve performance (i.e. temporal modeling). De la Torre et al. [35] used condensation and appearance models to simultaneously track and recognize facial expression. Chang et al. [20] used a low-dimensional Lipschitz embedding to build a manifold of shape variation across several people and then used I-condensation to simultaneously track and recognize expressions. A popular strategy is to use HMMs to temporally segment expressions by establishing a correspondence between the action’s onset, peak, and offset and an underlying latent state. Valstar and Pantic [123] used a combination of SVM and HMM to temporally segment and recognize AUs. Valstar and Pantic [94, 122, 125] proposed a system that enables fully automated robust facial expression recognition and temporal segmentation of onset, peak and offset from video of mostly frontal faces. Koelstra and Pantic [59] used GentleBoost classifiers on motion from a non-rigid registration combined with an HMM. Similar approaches include a nonparametric discriminant HMM from Shang and Chan [106], and partially observed Hidden Conditional Random Fields by Chang et al. [19]. For other comprehensive surveys see [44, 95, 113, 136]. Tong et al. [117] used DBNs with appearance features to detect facial action units in posed facial behavior. The correlation among action units served as priors in action unit detection. Ira et al. [22] used a BN classifiers for classifying the six universal expressions from video. In particular they used a Naive-Bayes classifiers and change the distribution from Gaussian to Cauchy, and use Gaussian Tree-Augmented Naive Bayes (TAN) classifiers to learn the dependencies among different facial motion features.

5.2 Selection of Positive and Negative Samples During Training

Previous research in expression and AU detection has emphasized types of registration methods, features and classifiers (e.g., [67, 97, 113, 134, 140]). Little attention has been paid to make efficiently use of the training data for assignment of video frames to positive and negative classes. Typically, assignment has been done in one of two ways. One is to assign to the positive class those frames that occur at the peak of each AU or proximal to it. Peaks refer to the maximum intensity of an action unit between the frame at which begins (“onset”) and ends (“offset”). Negative class then is chosen by randomly sampling other AUs, including AU 0 or neutral. This approach suffers at least three drawbacks: (1) the number of training examples will often be small, which results in a large imbalance between positive and negative frames; and (2) peak frames may provide too little variability to achieve good generalization. These problems may be circumvented by following an alternative approach; that is to include all frames from onset to offset in the positive class. This approach improves the ratio of positive to negative frames and increases representativeness of positive examples. The downside is confusability of positive and negative classes. Onset and offset frames and many of those proximal or even further from them may be indistinguishable from the negative class. As a consequence, the number of false positives may dramatically increase. Moreover, how to make use of all negative samples in an efficient manner? Is there a better approach to selecting positive and negative training samples?

In this section, we consider two approaches that have shown promise; one static and one dynamic. We illustrate the methods with particular classifiers and features, but the methods are not specific to the specific features or classifiers. As before, we distinguish between static and dynamic approaches. In the former, video frames are assumed to be independent. In the latter, first-order dependencies are assumed.

5.2.1 Static Approach

Recently, Zhu et al. [142] proposed an extension of cascade AdaBoost called Dynamic Cascade Bidirectional Bootstrapping (DCBB) to iteratively select positive samples and improve AU detection performance. In the first iteration, DCBB selected only the peaks and the two neighboring frames as positive frames, and randomly sample other AUs and non-AUs as negative samples. As in standard AdaBoost [127], DCBB defines the false positive target ratio, the maximum acceptable false positive ratio per cascade stage, and the minimum acceptable true positive ratio per each of the cascades. DCBB uses Classification and Regression Tree (CART) [17] as weak classifier. Once a cascade of peak frame detectors is learned in the first iteration, DCBB enlarges the positive set to increase the discriminative performance of the whole classifier. The new positive samples are selected after running the current classifier (learned in the previous iteration) in the original training data and selecting for the new positive training set the frames that were classified as positive. Recall that we have only trained with the peak frames in the first iteration. For more details see [142].

Figure 19.10 shows the improvement in the Receiver–Operator Characteristic (ROC) curve for testing data (subjects not in the training) using DCBB for three AUs (AU12, AU14, AU17). The ROC is obtained by plotting true positives ratios against false positives ratios for different decision threshold values of the classifier. In each subfigure there are five or six ROCs corresponding to alternative selection strategies: using only peak in the first step (same as standard Cascade AdaBoost), running three or four iterations in DCBB (spread x), and using all the frames between onset and offset (All+Boost). That is, there are three results shown using different positive training samples: 1) peak frames (first step); 2) all frames between onset and offset (All+Boost); and 3) iterations of DCBB (spread x). The first number between lines | denotes the area under the ROC, the second number is the size of positive samples in the testing dataset and separated by / is the size of negative samples in the testing dataset. The third number denotes the size of positive samples in training working sets and separated by / the total frames of target AU in training datasets. We can observe that the area under the ROC for frame-by-frame detection is improved gradually during each learning stage and the performance improves faster for some AU rather than others. Improvement rate appears to be influenced by the base rate of the AU. For AU14 and AU17, fewer potential training samples are available than for AU12.

Fig. 19.10
figure 10

ROCs for AU detection using DBCC: See text for the explanation of Init+Boost, spread x and All+Boost. Figure reproduced with permission from [142]. © 2009 IEEE

Top of Fig. 19.11 shows the manual labeling for AU12 of the subject S015. We can see eight instances of AU12 with varying intensities ranging from A (weak) to E (strong). The strong AUs are represented by rectangles of height 4 and the weak ones with height 1. The remaining eight figures illustrates the sample selection process for each of the instances of the AU12. In the top right of each subfigure there is the corresponding AU instance number. The black curve in the bottom of the subfigures represents the similarity between the peak and the neighboring frames. The peak is the maximum of the curve. The positive samples in the first step are represented by green asterisks, in the second iteration by red crosses, in the third iteration by blue crosses, and in the final iteration by black circles. Observe that in the case of high peak intensity, subfigures 3 and 8 (top right number in the similarity plots), the final selected positive samples contain areas with low similarity values. However, when AU intensity is low, subfigure 7, the positive samples are only selected if they have a high similarity with the peak because otherwise we would select samples that will lead to many false positives. The ellipses and rectangles in the figures contain frames that are selected as positive samples, and correspond to strong and subtle AUs. The triangles correspond to frames between the onset and offset that are not selected as positive samples, and represent the ambiguous AUs.

Fig. 19.11
figure 11

The spreading of positive samples during each dynamic training step for AU12. See text for the explanation of the graphics. Figure reproduced with permission from [142]. © 2009 IEEE

Table 19.2 shows the area under the ROC for 14 AUs using DCBB and different set of features. The appearance features are based on SIFT descriptors. For all AUs the SIFT descriptor is built using a square of 48×48 pixels for twenty feature points for the lower face AUs or sixteen feature points for upper face. The shape features are the landmarks of the AAM. For more details see [142]. It is important to notice that the results illustrated in this section are obtained using a particular set of features and classifiers, but the strategy of positive sample selection in principle can be used with any combination of classifiers and features.

Table 19.2 Area under the ROC for six different appearance and sampling strategies. AU peak frames with shape features and SVM (Peak+Shp+SVM), All frames between onset and offset with shape features and SVM (All+Shp+SVM), AU peak frames with appearance features and SVM (Peak+App+SVM), Sampling 1 frame in every 4 frames between onset and offset with PCA reduced appearance features and SVM (All+PCA+App+SVM), AU peak frames with appearance features and Cascade AdaBoost (Peak+App+Cascade Boost), DCBB with appearance features (DCBB)

5.2.2 Dynamic Approach

Extensions of DBNs have been a popular approach for expression analysis [22, 106, 117, 123]. A major challenge for DBNs based on generative models such as HMMs is how to effectively model the null class (none of the labeled classes) and how to train effectively on all possible segments of the video (rather than independent features). In this section, we review recent work on a temporal extensions of a bag-of-words (BoW) model called kSeg-SVM [108] that overcomes these drawbacks. kSeg-SVM is inspired by the success of the spatial BoW sliding-window model [15] that has been used in difficult object detection problems. We pose the AU detection problem as one of detecting temporal events (segments) in time series of visual features. Events correspond to AUs, including all frames between onset and offset (see Fig. 19.12). kSeg-SVM represents each segment as a BoW; however, the standard histogram of entries is augmented with a soft-clustering assignment of words to account for smoothly varying signals. Given several videos with AU labeled events, kSeg-SVM learns the SVM parameters that maximize the response on positive segments (AU to be detected) and minimize the response in the rest of the segments (all other positions and lengths). Figure 19.12 illustrates the main idea of kSeg-SVM.

Fig. 19.12
figure 12

During testing, the AU events are found by efficiently searching over the segments (position and length) that maximize the SVM score. During training, the algorithm searches over all possible negative segments to identify those hardest to classify, which improves classification of subtle AUs. Figure reproduced with permission from [108]. © 2009 IEEE

kSeg-SVM can be efficiently trained on all available video using the Structure Output SVM (SO-SVM framework) [119]. Recent research [90] in the related area of sequence-labeling has shown that SO-SVMs can out-perform other algorithms including HMM, CRF[63] and Max-Margin Markov Networks [109]. SO-SVMs have several benefits in the context of AU detection: (1) they model the dependencies between visual features and the duration of AUs; (2) they can be trained effectively on all possible segments of the video (rather than on independent sequences); (3) they explicitly select negative examples that are most similar to the AU to be detected; and (4) they make no assumptions about the underlying structure of the AU events (e.g., i.i.d.). Finally, a novel parameterization of the output space is proposed to handle multiple AU event occurrences such that occur in long time series and search simultaneously for the k-or-fewer best matching segments in the time-series.

Given frame-level features, we will denote each processed video sequence i as \(\mathbf{x}_{i} \in\mathbb{R}^{d\times m_{i}}\), where d is the number of features and m i is the number of frames in the sequence. To simplify, we will assume that each sequence contains at most one occurrence of the AU event to be detected. For extensions to k-or-fewer occurrences see [108]. The AU event will be described by its corresponding onset to offset frame range and will be denoted by y i ∈ℤ2. Let the full training set of video sequences be , and their associated ground truth annotations for the occurrence of AUs . We wish to learn a mapping for automatically detecting the AU events in unseen signals. This complex output space contains all contiguous time intervals; each label y i consists of two numbers indicating the onset and the offset of an AU:

(19.4)

The empty label y=∅ indicates no occurrence of the AU. We will learn the mapping f as in the structured learning framework [15, 120] as

(19.5)

where g(x,y) assigns a score to any particular labeling y; the higher this value is, the closer y is to the ground truth annotation. For structured output learning, the choice of g(x,y) is often taken to be a weighted sum of features in the feature space:

(19.6)

where ϕ(x,y) is a joint feature mapping for temporal signal x and candidate label y, and w is the weight vector. Learning f can therefore be posed as an optimization problem:

(19.7)

Here, Δ(y i ,y) is a loss function that decreases as a label y approaches the ground truth label y i . Intuitively, the constraints in (19.7) force the score of g(x,y) to be higher for the ground truth label y i than for any other value of y, and moreover, to exceed this value by a margin equal to the loss associated with labeling y.

Table 19.3 shows the experimental results on the RU-FACS-1 dataset. As can be seen, kSeg-SVM consistently outperforms frame-based classification. It has the highest ROC area for seven out of 10 AUs. Using the ROC metric, kSeg-SVM appears comparable to standard SVM. kSeg-SVM achieves highest F1 score on nine out of 10 test cases. As shown in Table 19.3, BoW-kSeg performs poorly. There are two possible reasons for this. First, clustering is done with K-means, an unsupervised, non-discriminative method that is not informed by the ground truth labels. Second, due to the hard dictionary assignment, each frame is forced to commit to a single cluster. While hard-clustering shows good performance in the task of object-detection, our time-series vary smoothly, resulting in large groups of consecutive frames being assigned to the same cluster.

Table 19.3 Performance on the RU-FACS-1 dataset, ROC metric and F1 metric. Higher numbers indicate better performance, and best results are printed in bold

At this point, it is worth pointing out that until now, a common measure of classifier performance for AU detection has been area under the curve (i.e. ROC). In object detection, the common measure represents the relation between recall and precision. The two approaches give very different views of classifier performance. This difference is not unanticipated in the object detection literature, but little attention has been paid to this issue in facial expression literature. In pattern recognition and machine learning, a common evaluation strategy is to consider correct classification rate (classification accuracy) or its complement error rate. However, this assumes that the natural distribution (prior probabilities) of each class are known and balanced. In an imbalanced setting, where the prior probability of the positive class is significantly less than the negative class (the ratio of these being defined as the skew), accuracy is inadequate as a performance measure since it becomes biased toward the majority class. That is, as the skew increases, accuracy tends toward majority class performance, effectively ignoring the recognition capability with respect to the minority class. This is a very common (if not the default) situation in facial expression recognition setting, where the prior probability of each target class (a certain facial expression) is significantly less than the negative class (all other facial expressions). Thus, when evaluating performance of automatic facial expression recognizer, other performance measures such as precision (this indicates the probability of correctly detecting a positive test sample and it is independent of class priors), recall (this indicates the fraction of the positives detected that are actually correct and, as it combines results from both positive and negative samples, it is class prior dependent), F1-measure (this is calculated as 2*recall*precision/(recall + precision)), and ROC (this is calculated as P(x|positive)/P(x|negative), where P(x|C) denotes the conditional probability that a data entry has the class label C, and where a ROC curve plots the classification results from the most positive to the most negative classification) are more appropriate.

6 Unsupervised Learning

With few exceptions, previous work on facial expression or action unit recognition has been supervised in nature. Little attention has been paid to the problem of unsupervised temporal segmentation or clustering facial events prior to recognition. Essa and Pentland [43] proposed an unsupervised probabilistic flow-based method to describe facial expressions. Hoey [53] presented a multilevel BN to learn in a weakly supervised manner the dynamics of facial expression. Bettinger et al. [11] used AAMs to learn the dynamics of person-specific facial expression models. Zelnik-Manor and Irani [133] proposed a modification of structure-from-motion factorization to temporally segment rigid and non-rigid facial motion. De la Torre et al. [32] proposed a geometric-invariant clustering algorithm to decompose a stream of one person’s facial behavior into facial gestures. Their approach suggested that unusual facial expressions might be detected through temporal outlier patterns. In recent work, Zhou et al. [143] proposed Aligned Cluster Analysis (ACA), an extension of spectral clustering for time series clustering and embedding. ACA was applied to discover in unsupervised manner facial actions across individuals that achieves moderate agreement with FACS. In this section, we briefly illustrate the applications of ACA for facial expression analysis, and refer the reader to [141, 143] for further details.

6.1 Facial Event Discovery for One Subject

Figure 19.13 shows the results of running unsupervised ACA on a video sequence of 1000 frames to summarize the facial expression of an infant into 10 temporal clusters. Appearance and shape features in the eyes and mouth, as described in Sect. 19.4.2, are used for temporal clustering. These 10 clusters provide a summarization of the infant’s facial events. This visual summarization can be useful to automatically count the amount of time that the baby spends doing a particular facial expression (i.e. temporal cluster), such as crying, smiling or sleeping.

Fig. 19.13
figure 13

Temporal clustering of infant facial behavior. Each color denotes a temporal unique cluster. Each facial gesture is coded with a different color. Observe how the frames of the same cluster correspond to similar facial expressions. Figure reproduced with permission from [108]. © 2010 IEEE

Extensions of ACA [143] can be used for facial expression indexing, given a sequence labeled by a user. Figure 19.14a on the left shows a frame of a sequence labeled by the user, and to the right there are six frames corresponding to six sequences returned by Supervised ACA (SACA). Next to the frames one can observe the matching score, which become higher the closer the retrieved sequence is to the user-specified sequence of facial expression.

Fig. 19.14
figure 14

a Facial expression indexing. The user specifies a query sequence and Supervised ACA returns six sequences with similar facial behavioral content as the video sequence selected by the user. b Three-dimensional embedding of 30 subjects with different facial expressions from the Cohn–Kanade database

ACA inherits the benefits of spectral clustering algorithms in that it provides a mechanism for finding a semantic low-dimensional embedding. In an evaluation, we tested the ability of unsupervised ACA to temporally cluster images and provide a visualization tool of several emotion-labeled sequences. Figure 19.14b shows the ACA embedding of 112 sequences from 30 randomly selected subjects from the Cohn–Kanade database [58]. The frames are labeled with five emotion labels: surprise, sadness, fear, joy and anger. The number of facial expressions varies across subjects. It is important to notice that, unlike traditional dimensionality reduction methods, each three-dimensional point in the embedding represents a video segment (of possibly different length) containing different facial expression. The ACA’s embedding provides a natural mechanism for visualizing facial events and detecting outliers.

6.2 Facial Event Discovery for Sets of Subjects

This section illustrates the ability of ACA to discover dynamic facial events in the more challenging database RU-FACS [7] that contains naturally occurring facial behavior of multiple people. For this database the labels are AUs. We randomly selected 10 sets of 5 people and reported the mean clustering results and variance. The clustering accuracy is measured as the overlap between the temporal segmentation provided by ACA and the manually labeled FACS. ACA achieved an average accuracy of 52.2% in clustering the lower face and 68.8% in upper face using AUs labels. Figure 19.15a shows the results for temporal segmentation achieved by ACA on subjects S012, S028 and S049. Each color denotes a temporal cluster discovered by ACA. Figure 19.15 shows some of the dynamic vocabularies for facial expression analysis discovered by ACA. The algorithm correctly discovered smiling, with and without speech as different facial events. Visual inspection of all subjects’ data suggests that the vocabulary of facial events is moderately consistent with human evaluation. More details are given in [143].

Fig. 19.15
figure 15

a Results obtained by ACA for subjects S012, S028 and S049. b Corresponding video frames. Figure reproduced with permission from [108]. © 2010 IEEE

7 Conclusion and Future Challenges

Although many recent advances and successes in automatic facial expression analysis have been achieved, as described in the previous sections, many questions remain open, for which answers must be found. Few challenges remain such as (1) how to detect subtle AUs: more robust 3D models that effectively decouple rigid and non-rigid motion and better models that normalize for subject variability are needed to be researched. (2) More robust real-time systems for face acquisition, facial data extraction and representation, and facial expression recognition to handle head motion (both in-plane and out-of-plane), occlusion, lighting change, and low intensity expressions, all of which are common in spontaneous facial behavior in naturalistic environments; new 3D sensors such as structure light cameras or time-of-flight cameras can are a promising direction for real-time segmentation (3) most work on facial expression analysis has been done in the area of recognition (temporal segmentation is provided), and more specialized machine learning algorithms are needed for the problem of detection in naturally occurring behavior.

Because most investigators have used relatively limited datasets (with typically unknown reliability), the generalizability of different approaches to facial expression analysis remains unknown. With few exceptions, investigators have failed to report inter-observer reliability and the validity of the facial expressions they have analyzed. Approaches to facial expression analysis that have been developed in this way may transfer poorly to applications in which expressions, subjects, contexts, or image properties are more variable. In the absence of comparative tests on common data, the relative strengths and weaknesses of different approaches are difficult to determine. In particular, there is need for fully FACS coded databases with natural occurring behavior. Because intensity and duration measurements are critical, it is important to include descriptive data on these features as well.

Facial expression is one of several modes of nonverbal communication. The message value of various modes may differ depending on context and may be congruent or discrepant with each other. An interesting research topic is the integration of facial expression analysis with gesture, prosody, and speech. Combining facial features with acoustic features would help to separate the effects of facial actions due to facial expression and those due to speech-related movements.

At present, taxonomies of facial expression are based on FACS or other observer-based schemes. Consequently, approaches to automatic facial expression recognition are dependent on access to corpuses of FACS or similarly labeled video. This is a significant concern, in that recent work suggests that extremely large corpuses of labeled data may be needed to train robust classifiers. An open question in facial analysis is of whether facial actions can be learned directly from video in an unsupervised manner. That is, can the taxonomy be learned directly from video? And unlike FACS and similar systems that were initially developed to label static expressions, can we learn dynamic trajectories of facial actions? In our preliminary findings [143] on unsupervised learning using the using the RU-FACS database, agreement between facial actions identified by unsupervised analysis of face dynamics and FACS approached the level of agreement that has been found between independent FACS coders. These findings suggest that unsupervised learning of facial expression is a promising alternative to supervised learning of FACS-based actions. At least three benefits follow. One is the prospect that automatic facial expression analysis may be freed from its dependence on observer-based labeling. Second, because the current approach is fully empirical, it potentially can identify regularities in video that have not been anticipated by the top–down approaches such as FACS. New discoveries become possible. Three, similar benefits may accrue in other areas of image understanding of human behavior. Recent efforts by Guerra-Filho and Aloimonos [49] to develop vocabularies and grammars of human actions depend on advances in unsupervised learning. However, more robust and efficient algorithms that can learn from large databases are needed, as well as algorithms that can cluster more subtle facial behavior.

While research challenges in automated facial image and analysis remain, the time is near to apply these emerging tools to real-world problems in clinical science and practice, marketing, surveillance and human computer interaction.