Keywords

1 Introduction

Facial expressions, poses, and gestures are the non-verbal cues, which are an essential part of our daily communication where the face is the most prominent. Facial expression through expressing feelings common means of human communication, it indicates their agreement or disagreement, express intentions, interacting with others as well as environment [1]. Thus, many research is being done in the field of automated expression recognition. Numerous approaches for facial expression recognition have been suggested in the literature, which can be broadly classified into feature-based techniques and appearance-based techniques [2]. Local image descriptors like SIFT(Scale Invariant Feature Transform) [3] has been extensively examined and has assumed a predominant job in FER applications. SIFT has achieved the best overall performance as local image descriptors in the literature. It improves effectiveness while diminishing the memory prerequisites. Impressed by the SIFT [4] presented Speed Up Robust Features (SURF). Instead of using the gradient information like SIFT descriptor does, SURF computes responses of Haar Wavelet and exploits integral images to save computational cost. As a result, it runs faster than SIFT. Dalal and Triggs [5] introduced the Histogram of Oriented Gradient (HOG) descriptor. For a long time, Gabor wavelet [6] have been comprehensively utilized in face feature extrication. It is capable of extracting the multi-direction and multi-scale statistics from the given face image. It is able to extract the multi-orientation and multi-scale information from the given face image. Gabor features has the huge feature dimensions, so it is not useful for the real time applications. [7] proposed Local Binary Patterns (LBP) for texture classification, and such a descriptor encodes texture information by comparing the central pixel and its neighborhood. It was successfully applied in the task of face recognition [8] and object categorization [9]. LBP is sensitive to local illumination variations. It does not correctly identify the texture of facial muscles and other types of local deformations, thus not very robust for FER applications. Another local feature based method based on Histogram of Second order Gradient (HSOG) [10].

Above mentioned work basically works based on local texture information, the local texture of face image is extracted with various image processing procedures, followed by histogram pooling by various block sizes. These approaches usually results in the feature vector length of order \(10^5\) making classification step computationally complex. To overcome problem of high dimensionality, in recent times many Dimensionality Reduction (DR) techniques are proposed, these techniques are proved to be very efficient in image recognition tasks. In this article, we are using one state of the art DR method - ONPP for facial expression recognition. The paper mainly proposes a variant of ONPP suitable for facial expression recognition. The main advantage achieved in the DR based method is computational complexity reduction in the classification. Using ONPP, the image is represented using feature vector 100s of dimensions instead of the feature vector having \(10^5\) dimensions. The performance of proposed method is tested on three bench-mark databases: CK+, Jaffe and Video database. Proposed method achieves similar recognition accuracy with comparatively very low dimensions than that of local feature based methods.

Remaining part of the study is held as follows. Section 2 explains dimensionality reduction methods employed successfully in image recognition tasks, followed by detailed discussion on Orthogonal Neighborhood Preserving Projection in Sect. 3. Section 4 documents expression recognition experiments performed on well-known facial expression databases, followed by the conclusion in Sect. 5.

2 Dimensionality Reduction for Image Recognition

Natural images are considered as very high-dimensional data, but as proved in [11], they usually reside on relatively low dimensional linear or non-linear manifold. The fundamental idea of dimensionality reduction is to seek this linear/non-linear manifold where data representation is compact and more meaningful, in the sense that the learned low dimensional representation reveals the variability of the data more clearly. During last two decades many dimensionality reduction methods are employed in image recognition.

Among these methods appearance based DR methods consider the image of size \(m\times n\) as a mn-dimensional data point in \(\mathcal {R}^{mn}\) and learn the low dimensional image manifold \(\mathcal {M}\in \mathcal {R}^d\) where \(d<<mn\). DR methods are broadly classified in two categories: Non-linear DR methods and Linear DR methods. Linear Embedding (LE) [11], Laplacian Embeddings (LE) are non-linear DR methods. The mapping of data points achieved by these methods are non-explicit thus making recognition of an out-of-sample image data impossible. The solutions to overcome non-explicit mapping are proposed in Locality Preserving Projections (LPP) [12], Neighborhood Preserving Projections (NPP) where training samples are used to learn a projection matrix which can be used to project out-of-sample test data on the learned projection space. These linear methods have performed very well in image recognition tasks, but in recent times, orthogonal variants of these DR methods are proposed in OLPP [13] and ONPP [14] have been proved superior to their non-orthogonal counterparts and considered state-of-the-art in appearance based image recognition. Over the time, various discriminating and/or 2-dimensional variants of OLPP and ONPP are proposed and used in the image recognition tasks, too.

In this article, ONPP technique with class similarity based neighborhood is used to perform expression recognition and the performance is compared with conventional ONPP approach [14] and Modified ONPP approach [15]. The article also compares the performance of DR based expression recognition approach with local feature descriptor based state-of-the-art approaches.

3 Expression Recognition Using ONPP and Class Similarity Based Neighborhood

3.1 Data Preparation for Modular Expression Recognition

Appearance based DR methods consider an \(m\times n\) image as a data point in mn-dimensional space where each image is vectorized either by columns or by rows. Let \(\mathbf {x_{1},x_{2},....,x_{N}}\) be the N data points of given training images, so the data matrix \(\mathbf {X}\) can be defined as \(\mathbf {X=[x_{1},x_{2},....,x_{N}]}\in {\mathbf {R}^{mn\times N}}\). As discussed in Sect. 2 it is expected that the similar images will be projected in the same neighborhood in the projected space. It is proven in that area between eye-brows, eyes, area containing nose and lips plays major role in expressing emotions while the rest of the facial area does not provide any significant information for expression recognition. The modular approach considers only these areas while recognizing expressions, thus the each data vector \(\mathbf {x_i}\) corresponding to image i is prepared considering above mentioned four areas from the image. Each significant area is cropped from the whole image and vectorized, these four vectors of facial region are then concatenated to make l-dimensional vector \(\mathbf {x_i}\) as shown in Fig. 1. Note that the size of these regions across all face images should be same so that the resulting vector representation be a point in \(\mathcal {R}^l\).

Fig. 1.
figure 1

Conversion of Facial expression image data into modular vectorized data: each l-dimensional vector \(\mathbf {{x_i}\in X}\) represents face image \(\mathbf {I_1}\)

3.2 ONPP with Class Similarity Based Neighborhood

Let X be the data matrix of training images such that \(\mathbf {X=[x_{1},x_{2},....,x_{N}]}\in {\mathbf {R}^{l\times N}}\) for N modular vector representing face images. ONPP is a three step procedure where in the first step defines the neighborhood of each data point \(\mathbf {x_i}\), the next step expresses each data point as a linear combination of its neighbors \(\mathbf {x_j}s\). The last step finds projection bases \(\mathbf {V\in \mathcal {R}^{l\times d}}\) of an ONPP space using training data \(\mathbf {X}\) that can be used to embed test image \(\mathbf {x_t}\) in the ONPP space for recognition step.

Step 1: Finding Nearest Neighbors: For every data point \(\mathbf {x_i}\), neighborhood \(\mathcal {N}_{x_i}\) is defined. The choice of neighborhood can be defined in several ways, the simple choice can be made using \(k \) neighbors are chosen by Nearest Neighbor (NN) method where \(k \) is a reasonably picked parameter. In another way, neighbors could be chosen which are inside \(\varepsilon \) radius from the data point. Let \(\mathcal {N}_{x_{i}}\) be the set of \(k \) neighbors of \(\mathbf {x_i}\). Both these approaches do not consider the available knowledge of class labels, thus known as unsupervised approach. In an alternative approach - supervised approach - data points belonging to same class are considered neighbors of each other.

Both the methods have their own shortcomings. Unsupervised methods consider only euclidean distance which essentially describes similarity between data points regardless of discriminating information based on class labels. Supervised methods decide the neighborhood neglecting the similarity between data points those are not in the same class. Both of these approaches are suitable for face recognition application where the difference between two images stems only from occlusions, illumination, expression and pose variations. On the other hand, in expression recognition task, images of an expression coming from two person show large variation which cannot be represented by only euclidean distance or only label knowledge.

Instead of claiming a data point \(\mathbf {x_i}\) belonging to an unique class, it makes much more sense that a data point belongs to a class \(c_i\) with a certain probability \(p_{c_i}\). If the data set has C classes, then for each data point we can build a C-dimensional probability vector \(\mathbf {p(x_i)}=[p_1(\mathbf {x_i}),p_2(\mathbf {x_i}),... p_c(\mathbf {x_i})]^T\). he probability vector is computed based on class similarity between data points which combines the knowledge of class label as well as euclidean distance.

Let \(\varDelta (i,j)\) be the euclidean distance between data points \(\mathbf {x_i}\) and \(\mathbf {x_j}\). Class similarity based distance denoted by \(\varDelta '(i,j)\) can be defined by

$$\begin{aligned} \varDelta '(\mathbf {x_i},\mathbf {x_j})&= \varDelta (i,j) + \alpha \max (\varDelta ) (1-\mathcal {S}(i,j)) \end{aligned}$$
(1)

where, \(\alpha \in [0,1]\) is a tuning parameter. \(max(\varDelta )\) indicates maximum pair-wise distance or data diameter. \(\mathcal {S}(i,j)\) is class similarity between \(\mathbf {x_i}\) and \(\mathbf {x_j}\), which is defined as,

$$\begin{aligned} \mathcal {S}(i,j)= {\left\{ \begin{array}{ll} 1; &{} \mathbf {x_i}=\mathbf {x_j} \\ \mathbf {p(x_i)}^T\mathbf {p(x_j)}; &{} \mathbf {x_i}\ne \mathbf {x_j} \end{array}\right. } \end{aligned}$$
(2)

We used Logistic Discrimination (LD) to find probability of each data point \(\mathbf {x_i}\) belonging to class \(c_i\). Performing LD on high dimensional data causes huge computational burden, thus lower dimensional representation is sought using PCA. Let \(\mathbf {z_i}\) be a lower dimensional representation of \(\mathbf {x_i}\), to find probability vector \(\mathbf {p(x_i)}\). The \(c^{th}\) element of \(\mathbf {p(x_i)}\) corresponding to class c can be computed by

$$\begin{aligned} p_c(\mathbf {x_i})= \frac{\pi (\mathbf {z_i}; \alpha _c , \beta _c)}{\sum _{c=1}^C\pi (\mathbf {z_i}; \alpha _c , \beta _c)} \end{aligned}$$
(3)
$$\begin{aligned} \nonumber \text {where the function, } \pi (\mathbf {z_i}; \alpha _c , \beta _c)= & {} \frac{\exp (\alpha _c+{\beta _c}^T\mathbf {z_i})}{1+\exp (\alpha _c+{\beta _c}^T\mathbf {z_i})} \end{aligned}$$

The neighbors for each data point \(\mathbf {x_i}\) will be chosen based on class similarity based distance given in Eq. (1).

Step 2: Calculating Reconstruction Weight: In this step, the neighborhood \(\mathcal {N}_{x_{i}}\) is expressed as a linear combination of neighbors with reconstruction weight \(w_{ij}s\) as \(\sum _{j=1}^{k} w_{ij}\mathbf {x_{j}}\). The weight \(w_{ij}\) are computed by minimizing the reconstruction error i.e. error between \(\mathbf {x_{i}}\) and linear combination of \(\mathbf {x_{j}}\in \mathcal {N}_{x_{i}}\).

$$\begin{aligned} \arg \min \mathcal {E}(W)=\underset{\mathbf {W}}{\arg \min }{\displaystyle {\displaystyle \sum _{i=1}^{N}}\parallel \mathbf {x_{i}}-\sum _{j=1}^{k}}w_{ij}\mathbf {x_{j}}\parallel ^{2} \end{aligned}$$
(4)

subject to \(\sum _{j=1}^{k} w_{ij}=1\).

For each data point \(\mathbf {x_i}\), optimization problem given in (4) can be modeled as a least square problem \((\mathbf {X_{N_{i}}-x_{i}e^{T})w_{i}}=\mathbf {0}\) with a constraint \(\mathbf {e^{T}w_{i}}=\mathbf {1}\). Here, \(\mathbf {X_{N_{i}}}\) is a matrix having \(\mathbf {x_{j}}\) as its columns, where \(\mathbf {x_{j}}\in {{N}_{x_{i}}}\). Note that \(\mathbf {X_{N_{i}}}\) includes \(\mathbf {x_{i}}\) as its own neighbor making it a matrix of dimension \(l \times k+1\). Solving the least square problem results in a closed form solution for \(\mathbf {w_{i}}\) given by Eq. (5).

$$\begin{aligned} \mathbf {w_{i}}=\frac{\mathbf {G}^{-1}\mathbf {e}}{{\mathbf {e}^{T}\mathbf {G}^{-1}\mathbf {e}}} \end{aligned}$$
(5)

Here, \(\mathbf {e}\) is a vector of ones having dimension \(k\times 1\) same as \(\mathbf {w_i}\). \(\mathbf {G}\in \mathcal {R}^{k\times k}\) is a Gramiam matrix, each entry of G is given by \(\mathbf {g_{pl}}=(\mathbf {x_{i}}-\mathbf {x_{p}})^{T}(\mathbf {x_{i}}-\mathbf {x_{l}}), \;for\,\forall \, \mathbf {x_{p}},\mathbf {x_{l}}\,\in \mathcal {N}_{x_{i}}\).

Step 3: Finding Projection Matrix: Last step is dimensionality reduction or finding the projection matrix V that explicitly maps l-dimensional data point \(\mathbf {x_i}\) to d-dimensional representation \(\mathbf {y_i}\) assuming that the neighborhood relationship among \(\mathcal {N}_\mathbf {x_i}\) with corresponding weights \(w_{ij}\) will be preserved in lower dimensional space, too.

The optimization problem to achieve such mapping can be formed as minimization of the sum of squares of reconstruction errors in lower dimensional space. The cost function is given by

$$\begin{aligned} \arg \min \mathcal {F}(\mathbf {Y})=\underset{\mathbf {Y}}{\arg \min }\sum _{i=1}^{N}\parallel \mathbf {y_{i}}-\sum _{j=1}^{k}w_{ij}\mathbf {y_{j}}\parallel ^{2} \end{aligned}$$
(6)

subject to orthogonality constraint, \(\mathbf {V}^{T}\mathbf {V}=\mathbf {I}\).

Solving the optimization problem results in eigenvalue problem \(\mathbf {XMX^TV}=\lambda \mathbf {V}\). Here, columns of \(\mathbf {V}\) are eigen-vectors that corresponding to the smallest d eigen-values. The matrix \(\mathbf {M}=(\mathbf {I}-\mathbf {W})(\mathbf {I}-\mathbf {W^T})\). Note that \(\mathbf {XMX}^T\) is symmetric and positive semi-definite. ONPP explicitly maps \(\mathbf {X}\) to \(\mathbf {Y}\), which is of the form \(\mathbf {Y}=\mathbf {V^TX}\), i.e. each test sample \(\mathbf {x_t}\) can now be projected to lower dimension by just a matrix-vector product \(\mathbf {y_{t}}=\mathbf {V}^{T}\mathbf {x_{t}}\).

Considering the under-sampled size issue where the number of samples N is less than dimension l. In such situation, the matrix \(\mathbf {XMX^T} \in \mathbf {R}^{l\times l}\) will have maximum rank \(N - c\), where c is number of classes. To ensure that the resulting matrix \( \mathbf {M}\) be non-singular, one may utilize an initial PCA projection that reduces the dimensionality of the data vectors to \(N - c\). If \(\mathbf {V_{PCA}}\) is the projection matrix of PCA, then on performing the ONPP the resulting dimensionality reduction matrix is given by \(\mathbf {V} = \mathbf {V_{PCA}V_{ONPP}}\). Note that the PCA projection is most common pre-processing applied in many dimensionality reduction methods and in this article we are using it in our advantage to define new distance measure for local neighborhood.

Following Table 1 gives procedure to find Class-similarity based ONPP subspace and mechanism to recognize expression of the test image.

Table 1. Procedure for expression recognition using Class-Similarity based ONPP

4 Experiments

To approve the hypothetical conclusion of the proposed framework, experiments were performed on the four facial datasets. 1) JAFFE database [16] having 213 facial images of 10 Japanese female models of 7 facial expressions (6 basic facial expressions + 1 neutral). Out of 213 images, random 140 images were chosen for the training and the remaining 73 were used for testing. 2) The Video database [17] has videos of 11 persons. The single video contains four different facial expressions: Smiling, Angry, Open mouth and Normal. Out of 6668 images, randomly 70% images were chosen for training and remaining 30% images used as testing 3) In CK+ [18] there are 593 sequences across 123 persons giving 8 facial expressions. Table 2 reports average and best recognition result of 20 such iterations.

Comparison of performance of ONPP [14] and CS-ONPP on facial expression databases in the light of recognition score (in \(\%\)) with corresponding subspace dimensions are reported in Table 3. Where as Table 4 reported the comparison between the MONPP [15] and CS-ONPP. CS-ONPP are reported along with tuning parameter Alpha and PCA dimension (d\(\_\)pca).

Table 2. Best recognition Accuracy (%) achieved with proposed method of three benchmark databases along with related parameters: PCA subspace dimension (dpca), Number of Nearest Neighbors (k), tuning parameter \(\alpha \) and ONPP subspace dimensions
Table 3. Comparison of performance of ONPP and CS-ONPP on facial expression databases in the light of recognition score (in \(\%\)) with corresponding subspace dimensions. CS-ONPP are reported along with tuning parameter Alpha and PCA dimension (d\(\_\)pca).
Table 4. Comparison of performance of MONPP and CS-ONPP on facial expression databases in the light of recognition score (in \(\%\)) with corresponding subspace dimensions. CS-ONPP are reported along with tuning parameter Alpha and PCA dimension (d\(\_\)pca).
Table 5. Comparison with Recognition Accuracy reported in some State-of-the-Art facial expression methods
Table 6. Comparison of proposed method with Local feature based methods in the light of Feature vector length for JAFFE dataset for \(256\times 256\)

Table 5 compares recognition results of the proposed technique with that of few State-of-the-art Local feature based facial expression recognition techniques. As can be seen from the comparison that for JAFFE database , similar performance is achieved with proposed method but with feature vector of only 110 dimensions. For CK+ data, best recognition is achieved at 510 dimensions, which is not superior to other methods, but provides an advantage in terms of reduced feature vectors. Table 6 repeats the comparisons given in [23] for Local features based Facial Expression Recognition methods along with the dimensions of feature vectors for the given method.

5 Conclusion

Though, Local Features based methods have been successfully applied to Facial Expression Recognition problems, the resulting feature vector lengths usually are of order \(10^5\) which slow down classification process. The article proposes a Dimensionality Reduction based method which can be employed in FER. Basically, state-of-the-art DR method ONPP is used and a novel approach of neighborhood selection based on class similarity is proposed to suit FER application. Proposed method is tested on three benchmark databases and proved to be gaining huge margin in terms of feature vector length while maintaining same recognition accuracy.