Keywords

1 Introduction

Manifold learning methods have been widely applied to human emotion recognition, based on the fact that variations of expression can be represented as low dimensional manifold embedded in high dimensional data space. The original LPP [1], operated in an unsupervised manner, fails to embed the facial set in low dimensional space in which different expression classes are well clustered. Hence, supervised methods based on LPP are proposed for human emotion recognition [2]. Besides, Ptucha et al. [3] investigated the performance of combining automatic AAM landmark placement and LPP method for human emotion recognition and demonstrated the effectiveness on expression classification accuracy.

Note that the aforementioned methods assume that only one common manifold is developed from training set. However, it is difficult to determine how one manifold could well represent the structure of high dimensional data. To address this problem, Xiao et al. [4] proposed a human emotion recognition method by utilizing multiple manifolds. They claimed that different expressions may reside on different manifolds, and obtained the promising recognition performance. Lu et al. [5] presented a discriminative multimanifold analysis method to solve the single sample per person problem in face recognition, by splitting each face image into several local patches to form a training set, and sequentially learning discriminative information from each subject.

It is known that, under uncontrolled conditions, a number of specific facial areas play a more important role than the others in the formation of facial expressions and would be more robust to the variation of environmental lighting conditions. In light of the development, several methods are put forward to represent the local features. Chang et al. [7] constructed a training set of manifold from each local patch, and performed expression analysis based on local discriminant embedding method. Kotsia et al. [8] argued that local patches of facial images provide more discriminant information for recognizing emotional states.

Inspired by the aforementioned works, we propose a novel framework for feature extraction and classification of human emotion recognition from local patches set, namely multiple manifolds discriminant analysis (MMDA). MMDA first models face and obtain the landmark points of interest consisting of points from facial images based on ASM [9], and then focus on five local patches, including the regions of left and right eyes, mouth and left and right cheeks, to form a sample set for each expression. MMDA learns projection matrix of each expression so that maximizing the manifold margins among different expressions and minimizing the manifold distances of the same expression. As in [4, 5], a reconstruction error criterion is employed for computing the distance of manifold-to-manifold.

2 The Proposed Method

Assume that a dataset given in \(R^m\) contains \(n\) samples from \(c\) classes \(x_i^k\), \(k=1,2,\cdots ,c\), \(i=1,2,\cdots ,n_k\), where \(n_k\) denotes the sample size of the k-th class, \(\sum _{k=1}^c{n_k}=n\) and \(x_i^k\) is the i-th sample in the k-th class. We extract five local patches from each facial image \(x_i^k\) such as the regions of two eyes, mouth and right and left cheeks, with the size of each salient patch being \(a \times b\).

2.1 Problem Formation

To visually study the five local patches, we randomly pick seven facial samples with seven expressions: ‘Anger’ (AN), ‘Sadness’ (SA), ‘Fear’ (FE), ‘Surprise’ (SU), ‘Disgust’ (DI), ‘Happiness’ (HA) and ‘Neutral’ (NE) from Cohn-Kanade database [10]. At an intuitive level, different local patches are far apart, e.g., eyes versus cheeks of anger, while the same local patches are very close, e.g., eyes versus eyes. Hence, it is difficult to ensure one common manifold can model the high dimensional data well and guarantee the best performance of classification. Furthermore, it is more likely that these patches of the same expressions reside on the same manifold. In this case, we can model local patches of the same expression as one manifold so that local patches with the same manifold become closer and these patches with different manifolds are far apart.

2.2 Model Formation

Let \(\mathbf M =[M_1,\cdots ,M_c]\in {\mathfrak {R}^{d\times {\hbar }}}\) be a set of local patches and \(M_k=[P_1^k,P_2^k,\cdots , P_{n_k}^k]\in {\mathfrak {R}^{d\times {l_k}}}\) is the manifold of the k-th expression, where \(P_i^k=[x_{i1}^k,x_{i2}^k,\cdots ,x_{it}^k]\) be the patch set of the i-th facial sample in the k-th class, \(t\) is the number of local patches of each facial sample, \(l_k=t\cdot {n_k}\) and \(\hbar =\sum _{k=1}^c{l_k}\).The generic problem of feature extraction for MMDA is to seek \(c\) projection matrices \(W_1,W_2,\cdots ,W_c\) that maps manifold of each expression to low dimensional feature space. i.e., \(Y_k=W_k^TM_k\), so that \(Y_k\) represents \(M_k\) well in terms of certain optimal criterion, where \(W_k\in {\mathfrak {R}^{d\times {d_k}}}\), with \(d\) and \(d_k\) respectively denoting the dimensions of original local patch and feature space.

According to the study of Sect. 2.1, MMDA aims at maximizing the ratio of the trace of inter-manifold scatter matrix to the trace of intra-manifold scatter matrix. To achieve this goal, we formulate the proposed MMDA as the following optimization problem: (1).

$$\begin{aligned} J_1(W_1,\cdots ,W_c)= \frac{\sum _{k,i,j}\sum _{\hat{x}_{ijr}^k\in {N_b(x_{ij}^k)}}||W_k^Tx_{ij}^k-W_k^T\hat{x}_{ijr}^k||A_{ijr}^k}{\sum _{k,i,j}\sum _{\tilde{x}_{ijr}^k\in {N_w(x_{ij}^k)}}||W_k^Tx_{ij}^k-W_k^T\tilde{x}_{ijr}^k||B_{ijr}^k} \end{aligned}$$
(1)

where \(N_{b}(x_{ij}^{k})\) and \(N_{w}(x_{ij}^{k})\) denote the \(k_{b}\)-intermanifold neighbors and \(k_{w}\)-intra manifold neighbors of \(x_{ij}^{k}\) as well as \(\tilde{x}_{ijr}^k\) denotes the \(r\)th \(k_{b}\)-nearest intermanifold neighbors and \(\hat{x}_{ijr}^k\) represents the \(r\)th \(k_{w}\)-nearest intermanifold neighbors. the \(A_{ijr}^k\), \(B_{ijr}^k\) are the weight imposed on the edge that connects \(x_{ij}^k\) with \(\hat{x}_{ijr}^k\in {N_b(x_{ij}^b)}\) as well as that \(x_{ij}^k\) with \(\tilde{x}_{ijr}^k\in {N_w(x_{ij}^b)}\), respectively. Just defined as in the LPP [1].

For convenience, (1) can be written in a more compact form

$$\begin{aligned} J_2(W_1,\cdots ,W_c)=\frac{\sum _{k=1}^c{trace(W_k^T\tilde{S}_{b}^kW_k)}}{\sum _{k=1}^c{trace(W_k^T\tilde{S}_{w}^kW_k)}} \end{aligned}$$
(2)

where \(\tilde{S}_{b}^k=\sum _{i=1}^{n_k}\sum _{j=1}^t\sum _{\hat{x}_{ijr}^k\in {N_b(x_{ij}^k)}}(x_{ij}^k-\hat{x}_{ijr}^k)(x_{ij}^k-\hat{x}_{ijr}^k)^TA_{ijr}^k\),

\(\tilde{S}_{w}^k=\sum _{i=1}^{n_k}\sum _{j=1}^t\sum _{\tilde{x}_{ijr}^k\in {N_w(x_{ij}^k)}}(x_{ij}^k-\tilde{x}_{ijr}^k)(x_{ij}^k-\tilde{x}_{ijr}^k)^TB_{ijr}^k\) are respectively inter-manifold and intra-manifold scatter matrices of the k-th expression.

Since \((w_v^k)^Tw_\varepsilon ^k=\delta _{v\varepsilon }\), \(\tilde{S}_{b}^k\) and \(\tilde{S}_{w}^k\) are positive semi-definite matrices, it holds that \(trace(W_k^T\tilde{S}_{b}^kW_k)\ge {0}\) and \(trace(W_k^T\tilde{S}_{w}^kW_k)>{0}\), we and end up with a new optimization function from (2)

$$\begin{aligned} J_3(W_1,\cdots ,W_c)=\sum _{k=1}^c{\frac{trace(W_k^T\tilde{S}_{b}^kW_k)}{trace(W_k^T\tilde{S}_{w}^kW_k)}} \end{aligned}$$
(3)

without losing generality, we can easily know that \(J_3(W_1,\cdots ,W_c)\ge J_2(W_1, \cdots ,W_c)\).

Which means that (3) can obtain more discriminating features from training set than (2). However, there is no close-form solution for simultaneously obtaining \(c\) projection matrices from (2). To address the problem, we sequentially solve each projection matrix inspired by Fisher linear discriminant criterion [11]

$$\begin{aligned} J(W_k)=\frac{trace(W_k^T\tilde{S}_{b}^kW_k)}{trace(W_k^T\tilde{S}_{w}^kW_k)} \end{aligned}$$
(4)

\(\tilde{S}_{b}^k\) can be explicitly written as shown in Eq. (5).

$$\begin{aligned} \tilde{S}_{b}^k&=\sum _{i=1}^{n_k}\sum _{j=1}^t\sum _{r=1}^{k_b}(x_{ij}^k-\hat{x}_{ijr}^k)(x_{ij}^k-\hat{x}_{ijr}^k)^TA_{ijr}^k \nonumber \\&= M_kD_k^cM_k^T-(L_b^-+{L_b^-}^T)+\bar{M}_kD_k^l\bar{M}_k^T \end{aligned}$$
(5)

where \(L_b^-=M_k\Sigma _k\bar{M}_k^T\), \(\Sigma _k\) is a \(l_k\times {(k_b*l_k)}\) matrix with entries \(A_{ijr}^k\), \(\bar{M}_k=\{\hat{x}_{ijr}^k\in {N_b{(x_{ij}^k)}}\}\) , \(D_k^c\) and \(D_k^l\) are diagonal matrices with entries being the column and row sums of \(A_{ijr}^k\), i.e., \(D_k^c\leftarrow \sum _r{A_{ijr}^k}\) and \(D_k^l\leftarrow \sum _{ij}{A_{ijr}^k}\).

Similarly, \(\tilde{S}_{w}^k\) can also be reformed as shown in Eq. (6).

$$\begin{aligned} \tilde{S}_{w}^k=\sum _{i=1}^{n_k}\sum _{j=1}^t\sum _{r=1}^{k_w}(x_{ij}^k-\tilde{x}_{ijr}^k)(x_{ij}^k-\tilde{x}_{ijr}^k)^TB_{ijr}^k = 2M_k(D_k-A_k^w)M_k^T \end{aligned}$$
(6)

where \(D_k\) is the diagonal matrix whose entries on the diagonal are the column sum of \(A_k^w\) and \(A_k^w\) is the matrix which is combined with entries of \(B_{ijr}^k\).

In general, we can solve the following eigenvalue equation by Fisher discriminant criterion

$$\begin{aligned} \tilde{S}_{b}^kw_v^k&= \lambda _v^k\tilde{S}_{w}^k \end{aligned}$$
(7)

where \(w_1^k,w_2^k,\cdots ,w_{d_k}^k\) denote the eigenvectors corresponding to the \(d_k\) largest eigenvalues and \(v=1,2,\cdots ,d_k\).

Note that, for a task with high dimensional data such as facial images, (7) may encounter several difficulties. One of them is that we have to confront the issue of how to determine the feature dimension \(d_k\) for each projection matrix \(W_k\). For this sake, we utilize a feature dimension determination method by trace ratio. In particular, because \(\tilde{S}_{b}^k\) and \(\tilde{S}_{w}^k\) are non-negative semi-definite matrices, we can screen out the eigenvectors corresponding to eigenvalues so that they meet the following condition

$$\begin{aligned} J_2(w_v^k)&= \frac{(w_v^k)^T\tilde{S}_{b}^kw_v^k}{(w_v^k)^T\tilde{S}_{w}^kw_v^k}\ge {1} \end{aligned}$$
(8)

If \(J_2(w_v^k)\ge {1}\), local patches reside on the same manifold (intra-manifold) are close and those patches reside on different manifolds (inter-manifold) are far apart. According to this criterion, we can automatically determine the feature dimension \(d_k\) for the k-th projection matrix \(W_k\).

In conclusion, we summarize the steps to complete MMDA in Algorithm 1.

figure a

3 Experiments

We perform experiments on two public databases: Cohn-Kanade human emotion database [10] and Jaffe database [13], which are the most commonly used databases in the current human emotion research community.

3.1 Human Emotion Database

Cohn-Kanade database is acquired from 97 people aged from 18 to 30 years old with six prototype emotions (Anger, Disgust, Fear, Happiness, Sadness, and Surprise). In our study, 300 sequences which are selected. The selection criterion is that a sequence can be labeled as one of the six basic emotions and three peak frames of each sequence are used for processing. At last, 684 images are selected, including 19 subjects, 36 images of each subject and 6 images of each expression from each subject. Each normalized image is scaled down to the size of \(128\times {128}\). Some example images in this database are depicted in Fig. 1.

Fig. 1.
figure 1

Six samples from Cohn-Kanade database.

JAFFE human emotion database consists of 213 images of Japanese female facial expressions. Ten subjects posed three or four examples for each of the six basic expressions. Additionally, a simple preprocessing step is applied to Jaffe database before performing training and testing. Each normalized image is scaled down to the size of \(80 \times 80\). Some of the cropped face images in the Jaffe database with different human emotion are shown in Fig. 2.

Fig. 2.
figure 2

Six samples from Jaffe database.

3.2 Experimental Results and Analysis

In this paper, we compare the performance of MMDA with existing feature extraction and classification methods, including PCA+LDA [14], modular PCA [15], GMMSD [16], LPP [1], DLPP [17], MFA [18], Xiao’s [4]. For fair comparison, we explore the performance on all possible feature dimension in the discriminant step and report the best results. The experimental results are listed in Table 1. From these results, we make several observations:

Table 1. Recognition rates of comparative methods on Cohn-Kanade and Jaffe databases

(1) MMDA and Xiao’s consistently outperform other methods, further indicating that modeling each expression as one manifold is better because the geometric structure of expression-specific can be discovered and not influenced by that of subject-specific.

(2) Comparing the performance between MMDA and Xaio’s, the second best method in the comparison, reveals that MMDA encodes more discriminating information in the low-dimensional manifold subspace by preserving the local structure which is more important than the global structure for classification.

(3) It is observed that recognition performance on JAFFE database is much poorer than that on Cohn-Kanade database, likely due to the fact that there are fewer samples or subjects in the database resulting in a poor sampling of the underlying discriminant space.

In order to provide a more detailed observations, we show the corresponding mean confusion matrixes which analyze the confusion between the emotions when applying MMDA to human emotion recognition on Cohn-Kanade and Jaffe (See Tables 2 and 3). In Table 2, we can draw the following conclusions: ‘Anger’, ‘Happiness’, ‘Surprise’ and ‘Sadness’ are better distinguished by MMDA. However, ‘Disgust’ obtains the worst performance in the confusion matrix. To sum up, we know that MMDA well learns expression-specific of local patches belong to ‘Anger’, ‘Happiness’, ‘Surprise’ and ‘Sadness’. In Table 3, we see that it is very difficult to find the expression of ‘Fear’ accurately, which consistent with the result reported in [13].

Table 2. The confusion matrix by applying MMDA for facial expression recognition on Cohn-Kanade database
Table 3. The confusion matrix by applying MMDA for facial expression recognition on Jaffe database

4 Conclusions

We in this paper propose a novel model for human emotion recognition, which learns discriminative information based on the principle of multiple manifolds discriminant analysis (MMDA). Considering that local appearances can effectively reflect the structure of facial space on one manifold and provide more important discriminative information, we focus on five local patches including the regions of left and right eyes, mouth and left and right cheeks from each facial image to learn multiple manifolds features. Hence, the semantic similarity of expression from different subjects is well kept on each manifold. Extensive experiments on Cohn-Kanade and JAFFE databases are performed. Compared with several other human emotion recognition methods, MMDA demonstrates superior performance.