Keywords

1 Introduction

Facial expression recognition is a challenging technique in numerous researches, and impacts important applications like medical, lie detection, cognitive activity, robotics interaction, forensic section, automated training systems, security, intellectual state identification, music for mood, operator fatigue detection, etc. So much progress has been made, all the methods that have been used for facial expression recognition have similar steps. After face detection and segmentation, the next step is feature extraction and feature selection. The final step is feature classification. The differences between methods are based on the variety of features selected, the feature extraction method and the feature classification method.

In recent years, facial expression classification approaches mainly include two categories: (a) geometric feature-based and (b) texture feature-based. The geometric feature focus on where action units (AU) were detected tracking changes in permanent and transient facial features through accurate geometric modeling. The texture feature-based methods involve the use of local (face parts) or global (entire facial image) descriptors which intend at describing facial appearance.

Deepthi, Archana et al. [1] implemented FER using a 2D-DCT for feature extraction and a neural network is used as a classifier by using the JAFFE database. Punitha, Geetha et al. [2] extracted the mouth intensity code value (MICV) difference between the first and the greatest facial expression intensity frame, a Hidden Markov Model (HMM) is used as a classifier, with the own created dataset and achieved 94% accuracy. Zhang, Liu et al. [3] used the Gabor LBP feature and Gabor LPQ features as an input of Multiclass SVM classifier in JAFFE database. The accuracy obtained is 98%. Owusu and Zhan [4] fed selected Gabor features into a support vector machine (SVM) classifier and obtain an average recognition rate of 97.57%. Shah, Khanna et al. [5] implemented FER for Color Images using Gabor, Log Gabor Filters and PCA, and the Euclidian distance is used to classify the reduced features. The self-database is used for testing with an accuracy of 86.7%. Lajevardi, Husain et al. [6] presented the FER system based on hybrid face regions (HFR). Using Log Gabor filter features are extracted based on the whole face image and face region Naïve Bayes is used as a classifier. JAFFE database and the Cohn-Kanade database are used for testing with an accuracy of 97% and 91% respectively.

ELLaban, Ewees, Elsaeed et al. [7] tracked the Facial Expression Recognition using Support Vector Machines and k-Nearest Neighbor Classifier. In their work, Gabor, PCA are used for feature extraction. SVM and KNN classifiers are used for classification of the features extracted. Accuracy they achieved by testing the self-database using SVM is 90% and SVM outperformed than KNN. Lee, Uddin and Kim et al. [8] presented FER Using Fisher Independent Component Analysis and Hidden Markov Model from the Cohn-Kanade database. The FICA Fisher Linear Discriminant (FLD) is used for feature extraction based on a class-specific learning algorithm. Sumathi, Santhanam and Mahadevi et al. [9] investigated using Facial Action Coding System (FACS) action units and the methods which recognize the action units parameter using facial expression data that are extracted, various kinds of human facial expressions are recognized based on their geometric facial appearance, and hybrid features. In [10], a pose-invariant spatial-temporal textural descriptor is proposed, which is used to achieve 94.48% average accuracy using SVMs on the CK+ database. Some other researches in classifying six or seven expressions based on CK+ database can be found in [11,12,13].

There are numerous methodologies that have been proposed, they give us an overall idea about how effective the geometric and texture-based features are, but recognizing facial expression with a high accuracy remains a difficult problem due to the complexity and variety of facial expressions.

In this work, we propose an automated approach for recognizing facial expressions with a combination of geometric and texture-based features. The task of an automated FER method can be split into a sequence of processing stages such as preprocessing, feature extraction, and classification. Initially, the preprocessing is done before extracting features, to make texture invariant to translation, rotation, and scaling. Then, we combined geometric features and texture features for feature extraction. Geometric features are described by facial feature point displacements, slope and angle difference between normalized neutral and peak expression images, and texture features are represented by gradient-level normalized cross correlation and Gabor wavelet. Finally, the feature vector comprises of features, which are fed to SVM classifier for recognizing facial expressions from CK+ database.

The remaining section of the paper is organized as follows: Sect. 2 describes the preprocessing procedure. Feature extraction is introduced in Sect. 3. We evaluate the performance of the proposed method in Sect. 4 and finally, the paper concludes in Sect. 5.

2 Preprocessing

Preprocessing is a vital step in facial expression recognition, including face detection, face alignment, illumination processing, etc. For a given image, we first localize the centers of eyes with Adaboost learning algorithm [14] for reference points, followed by an image rotation to live up to eye coordinates (see Fig. 1).

Fig. 1.
figure 1

Normalized face acquisition.

Where Le is the center position of the left eye and Re is the center position of the right eye, \( \theta \) is the angle between the direction of the two eyes and the horizontal direction.

3 Feature Extraction

3.1 Geometric Feature Representation

In this work, we use the difference-ASM (DASM) features describe the facial shape difference between the neutral and peak expressions. The coordinates of salient points can be described as the change of facial landmark positions. On the other hand, the difference in the values in the corresponding facial angles and slopes between the neutral and peak expressions serves as the 24 geometric features for the face which help capture the coarse (large) distortions in the facial geometry across expression classes.

First of all, we use Adaboost algorithm and the extended Active Shape Model (ASM) [15, 16] to detect face and locate facial feature point respectively, the Stasm implementation [17] of the ASM module which is a C++ library built to run on top of OpenCV has been used as part of this work. Figure 2 shows the facial feature points and the detail face region obtained from the ASM algorithms. Then the displacements between x and y coordinates of 21 feature points on the neutral and expressive faces are calculated as a 21 × 2 dimensional feature vector.

Fig. 2.
figure 2

ASM-based face region acquisition.

Furthermore, geometric features such as slope and angle of the eye or mouth from the ASM landmark information are extracted. All the facial landmarks should be normalized to the same scale due to the different coordinates and the different scales. The normalization is described in the following 4 steps.

The 21 × 2 feature vertices can be described as follows:

$$ \left. {\alpha_{i} = \left( {x,y} \right.} \right) \,\,\, i = 1,2, \cdots 21. $$
(1)

The bottom of the nose is assigned as the 16th feature point shown in Fig. 2, and it has been defined as follows:

$$ bas = \alpha_{16} = \left( {x_{16,} y_{16} } \right) $$
(2)

To transform into a common coordinate system, the coordinate values of each feature point subtract the coordinate values of bas as follows:

$$ \beta_{i} = \alpha_{i} - bas\,\,{\text{i}} = 1,2, \cdots 21. $$
(3)

Meanwhile, each should also be normalized to make all the facial models with the same scale:

$$ N_{i} = \beta_{i} /\left( {\beta_{11} - \beta_{9} } \right) \,\,{\text{i}} = 1,2, \cdots 21. $$
(4)

Where β11 and β9 denote two inner eyes vertices. Based on the normalized feature points slope features are obtained as follows:

Eye slope features:

$$ \begin{array}{*{20}c} {S_{1} = \left( {N_{9} } \right. - \left. {N_{8} } \right);\,S_{2} = \left( {N_{9} } \right. - \left. {N_{10} } \right);\,S_{3} = \left( {N_{9} } \right. - \left. {N_{7} } \right);} \\ {S_{4} = \left( {N_{8} } \right. - \left. {N_{7} } \right);\,S_{5} = \left( {N_{10} } \right. - \left. {N_{7} } \right)} \\ \end{array} $$
(5)

Mouth slope features:

$$ \begin{array}{*{20}c} {S_{6} = \left( {N_{20} } \right. - \left. {N_{19} } \right);\,S_{7} = \left( {N_{20} } \right. - \left. {N_{21} } \right);\,S_{8} = \left( {N_{20} } \right. - \left. {N_{18} } \right);} \\ {S_{9} = \left( {N_{19} } \right. - \left. {N_{18} } \right);S_{10} = \left( {N_{21} } \right. - \left. {N_{18} } \right)} \\ \end{array} $$
(6)

Also, angle features can be generated as follows:

Eye angle features:

$$ \begin{array}{*{20}c} {a_{1} {\,=\, }\cos^{ - 1} \left( {\left( {s_{1} \times s_{3} } \right)/\left( {\left\| {s_{1} } \right\| \times \left\| {s_{3} } \right\|} \right)} \right);} \\ {a_{2} {\,=\, }cos^{ - 1} \left( {\left( {s_{2} \times s_{3} } \right)/\left( {\left\| {s_{2} } \right\| \times \left\| {s_{3} } \right\|} \right)} \right);} \\ {a_{3} {\,=\,}cos^{ - 1} \left( {\left( {s_{4} \times s_{3} } \right)/\left( {\left\| {s_{4} } \right\| \times \left\| {s_{3} } \right\|} \right)} \right);} \\ {a_{4} {\,=\,}cos^{ - 1} \left( {\left( {s_{3} \times s_{5} } \right)/\left( {\left\| {s_{3} } \right\| \times \left\| {s_{5} } \right\|} \right)} \right);} \\ {a_{5} {\,=\,}cos^{ - 1} \left( {\left( {s_{1} \times s_{4} } \right)/\left( {\left\| {s_{1} } \right\| \times \left\| {s_{4} } \right\|} \right)} \right);} \\ {a_{6} {\,=\,}cos^{ - 1} \left( {\left( {s_{2} \times s_{5} } \right)/\left( {\left\| {s_{2} } \right\| \times \left\| {s_{5} } \right\|} \right)} \right);} \\ \end{array} $$
(7)

Mouth angle features, where ||*|| is the norm operator:

$$ \begin{array}{*{20}c} {a_{7} {\,=\,}\cos^{ - 1} \left( {\left( {{\text{s}}_{6} \times {\text{s}}_{8} } \right)/\left( {\left\| {{\text{s}}_{6} } \right\| \times \left\| {{\text{s}}_{8} } \right\|} \right)} \right);} \\ {a_{8} {\,=\,}\cos^{ - 1} \left( {\left( {s_{7} \times s_{8} } \right)/\left( {\left\| {s_{7} } \right\| \times \left\| {s_{8} } \right\|} \right)} \right);} \\ {a_{9} {\,=\,}\cos^{ - 1} \left( {\left( {s_{9} \times s_{8} } \right)/\left( {\left\| {s_{9} } \right\| \times \left\| {s_{8} } \right\|} \right)} \right);} \\ {a_{10} {\,=\,}\cos^{ - 1} \left( {\left( {s_{10} \times s_{8} } \right)/\left( {\left\| {s_{10} } \right\| \times \left\| {s_{8} } \right\|} \right)} \right);} \\ {a_{11} {\,=\,}\cos^{ - 1} \left( {\left( {s_{6} \times s_{9} } \right)/\left( {\left\| {s_{6} } \right\| \times \left\| {s_{9} } \right\|} \right)} \right);} \\ {a_{12} {\,=\,}\cos^{ - 1} \left( {\left( {s_{7} \times s_{10} } \right)/\left( {\left\| {s_{7} } \right\| \times \left\| {s_{10} } \right\|} \right)} \right);} \\ \end{array} $$
(8)

Here, including a 21 × 2 dimensional displacement feature vector, a 12-dimensional slope feature vector and a 12-dimensional angle feature vector, the difference-ASM features are generated for classification.

3.2 Texture Feature Representation

In this paper, gradient-level normalized cross correlation (GNCC) and Gabor wavelet are utilized to extract texture features of facial faces, with the gradient-level matrix in the spatial space that has a better processing outcome on the local region of images and Gabor operator have been selected due to their simplicity, intuitiveness and computational efficiency.

The texture difference measure based on the relations between gradient vectors is robust with respect to noise and illumination changes [18], Ahmed and Hossain [19] investigated using gradient-based ternary texture for facial expression recognition operate faster than using gray level. Thus, the gradient value is a good measure to describe how the gray level changes within a neighborhood and it can be used to derive the local texture difference measure. Let p = {x,y}, gn(p), ge(p) and g′n(p), ge′ (p) be a feature point, the neutral and peak expression frames and the neutral frame gradient vector (g xn (p), g yn (p)) with \( {\text{g}}_{\text{n}}^{\text{x}} \left( {\text{p}} \right) = \varDelta x{\text{g}}_{\text{n}} \left( {\text{p}} \right) \) and \( {\text{g}}_{\text{n}}^{\text{y}} \left( {\text{p}} \right) = \varDelta y{\text{g}}_{\text{n}} \left( {\text{p}} \right) \), the peak expression frames gradient vector (g xe (p), g ye (p)) with \( {\text{g}}_{\text{e}}^{\text{x}} \left( {\text{p}} \right) = \varDelta x{\text{g}}_{\text{e}} \left( {\text{p}} \right) \) and \( {\text{g}}_{\text{e}}^{\text{y}} \left( {\text{p}} \right) = \varDelta y{\text{g}}_{\text{e}} \left( {\text{p}} \right) \), respectively. After that, the normalized cross correlation is calculated to evaluate the gradient difference in a feature point’s neighborhood between the normalized neutral and peak expression frames, defined as follows:

$$ NC = \frac{{\mathop \sum \nolimits_{P \in M} \left( {g_{n}^{'} \left( p \right) - \overline{{g_{n}^{'} \left( P \right)}} } \right)\left( {g_{e}^{'} \left( p \right) - \overline{{g_{e}^{'} \left( P \right)}} } \right)}}{{\sqrt {\mathop \sum \nolimits_{P \in M} \left( {g_{n}^{'} \left( p \right) - \overline{{g_{n}^{'} \left( P \right)}} } \right)^{2} \mathop \sum \nolimits_{P \in M} \left( {g_{e}^{'} \left( p \right) - \overline{{g_{e}^{'} \left( P \right)}} } \right)^{2} } }} $$
(9)

Where \( \overline{{g_{n}^{'} \left( P \right)}} \) and \( \overline{{g_{e}^{'} \left( P \right)}} \) are the averages of \( g_{n}^{'} \left( p \right) \) and \( g_{e}^{'} \left( p \right) \), M is the 10 × 10 neighborhood centered at point p. Texture features are calculated in the neighborhoods centered at 21 feature points (see Fig. 3).

Fig. 3.
figure 3

Examples of the local texture neighborhood.

In the frequency domain, Gabor wavelet [20, 21] has outstanding performance in obtaining multi-scale, multi-direction local information. Initially, the Gabor function is estimated for the image and it is computed as follows:

$$ {\upvarphi }_{{{\text{u}},{\text{v}}}} \left( {\text{z}} \right) = \frac{{\left\| {k_{u,v} } \right\|}}{{\sigma^{2} }}{\text{e}}^{{\left( { - \left\| {ku,v} \right\|^{2} \left\| {\text{z}} \right\|^{2} \text{ / }\left\| {2\upsigma} \right\|^{2} } \right)}} \left[ {{\text{e}}^{{\dot{i}\varvec{k}_{{\varvec{u},\varvec{v}}} z}} - {\text{e}}^{{ -\upsigma\frac{2}{2}}} } \right] $$
(10)

z = (x, y) gives the pixel position in the spatial domain, and frequency vector ku, v is defined as follows:

$$ k_{v} = \frac{{k_{max} }}{{f_{v} }}{\text{e}}^{{i\upphi_{u} }} \quad \,\,v = 0,\,1,\,\cdot \cdot\,\, ,\,4;u = 0,\,1,\, \ldots ,\,7 $$
(11)

Where \( \upphi_{u} = {\text{u}}_{\uppi} /{\text{u}}_{ \hbox{max} } \), \( \upphi_{u} \in \left( {0,\,\uppi} \right) \), and u and v denote the orientation and scale factors of Gabor filters respectively. In our system, we adopt the Gabor filters of five scales and eight orientations, with \( \sigma = \, 2\uppi \), \( k_{max} = \,\uppi/2 \). For each pixel of the peak expression frame, totally 40 Gabor features are obtained when 40 Gabor filters are used.

In practice, inhomogeneous sampling is applied to extract distinctive expression information mainly located at eyes, mouth, and nose. Considering the high correlation of adjacent pixels, and that the Gabor filter is not sensitive to the position of the gray value, we extract the Gabor features of some fixed geometric positions of the facial landmarks [22] as shown in Fig. 4, the Gabor features of each sampling point form feature vector with dimension 8 × 8 × 40 = 2560. Before fusing the Gabor and other features, we reduce their dimensionality to remove some of the redundant information with PCA [23, 24]. PCA is a useful technique used to reduce the dimensionality of a feature vector and has been successfully used in face analysis.

Fig. 4.
figure 4

The distribution of the sampling points on the face.

In the classification phase, we attempt to combine the advantages of geometric features and texture features to attain better FER performance. As mentioned in this section, in a facial expression image sequence, we work with the neutral and peak expression frames for DASM and GNCC features, with only the peak expression frames for Gabor features. We consider DASM as f1, GNCC as f2, Gabor as f3, in order to remove the unfavorable effect resulting from unequal dimensions, f1, f2, and f3, are all be normalized as \( V1,V2, V3 \) given by:

$$ V1 = \left[ {\frac{f1}{{\left| {\left| {f1} \right|} \right|}}} \right] ,\,V2 = \left[ {\frac{f2}{{\left| {\left| {f2} \right|} \right|}}} \right] ,\,V3 = \left[ {\frac{f3}{{\left| {\left| {f3} \right|} \right|}}} \right] $$
(12)

The fusion feature F can be defined as:

$$ F = \frac{{\left| {V1 V2 V3} \right|}}{{\left| {\left| {\left[ { V1 V2 V3} \right] } \right|} \right|}} $$
(13)

Where ||*|| is the norm operator. Here, we consider the simple concatenation of geometric and textural features is generated for classification.

4 Experiment

4.1 Dataset

The extended Cohn-Kanade (CK+) database [25] has been used in all our experiments. The database consists of image sequences across 123 subjects displaying 7 expressions: happy, sad, surprise, fear, anger, disgust, and contempt. The image sequences vary in duration from neutral faces to apexes, the peak information of the prototypical facial expressions. Since our work focuses only on the 6 basic expressions: happy, sad, surprise, fear, anger and disgust, the image sequences corresponding to the ‘contempt’ expression class have been left out from consideration. The total number of labeled image sequences therefore considered for our work turned out to be 309. Table 1 presents the detailed statistics for the portion of the dataset that is used.

Table 1. Overview of the dataset.

4.2 Experiment Results

Support Vector Machines (SVMs) [26] are primarily binary classifiers that find the best separating hyperplane by maximizing the margins from the support vectors. Support vectors are defined as the data points that lie closest to the decision boundary. LIBSVM [27] has been used as part of this work. To recognize the facial expressions, a one-versus all multi-class SVM is used, and all the results presented in this section have been 5-fold cross validated. The dataset is randomly divided into 5 groups of roughly equal numbers of subjects and each prototypical expression.

In order to demonstrate the efficiency of the concatenated feature, the findings on the individual geometry and texture-based features and on the concatenated feature are established through classification experiments. The experiment results are shown in Table 2 and Fig. 5. The proposed method achieves an average recognition accuracy of 95.3%. From Table 2, one can see that the results of every class of facial expression, among which, the recognition rate of happy and surprise are highest. It is an intuitive result as these expressions cause many more facial movements mainly located around the mouth and thus are relatively easy to recognize. But fear and sadness are similar sometimes and attain lower classification accuracies due to the lack of facial deformation and training samples, so they are not easily identified. The confusion matrix for 6-class classification in Fig. 5 shows that the concatenated feature performs better than the individual features.

Table 2. Recognition rate(%) Recognition rates of different features.
Fig. 5.
figure 5

Comparison of recognition rate of different features.

To further evaluate our proposed method, we consider the statistical comparison between performances of the proposed approach and the other state of the art algorithms considered in Sect. 1, employing the CK+ database. The experiment results are shown in Table 3. From Table 3, the comparison between our method and previously published approaches shows that our method outperforms other methods.

Table 3. A comparison of the proposed method with some examples from the related ones.

5 Conclusion

In this paper, we proposed a framework of fusing geometric and appearance features of the difference between the neutral and peak expressive facial expression images to recognize facial expressions. The difference tends to emphasize the facial parts that are changed from the neutral to expressive face and eliminate in that way the identity of the facial image. The feature fusion method fully utilizes the local geometric feature and texture information to extract expression features. Based on the combination of DASM, GNCC and Gabor wavelet extraction method, a SVM classification method is used to recognize six facial expressions from CK+ database, namely Happiness, Sadness, Anger, Fear, Disgust and Surprise, plus the Contempt. Thus, we have obtained a suitable classification system to work with the six basic emotions. Extensive experiments show that the proposed method achieves more reliable results comparing to DASM, GNCC and Gabor descriptor alone and outperforms several other methods on the CK+ database. It indicates that the proposed method has a strong potential as an alternative method for building a facial expression recognition system.

In the future, this work will be extended in three aspects. Firstly, more feature descriptors will be used to give more comprehensive facial representations. Secondly, the fusion strategy will be improved to increase the recognition rate. Finally, we expect to extend this framework to analyze people’s emotional state.