Human facial expression recognition using curvelet feature extraction and normalized mutual information feature selection

Siddiqi, Muhammad Hameed; Ali, Rahman; Idris, Muhammad; Khan, Adil Mehmood; Kim, Eun Soo; Whang, Min Cheol; Lee, Sungyoung

doi:10.1007/s11042-014-2333-3

Human facial expression recognition using curvelet feature extraction and normalized mutual information feature selection

Published: 05 November 2014

Volume 75, pages 935–959, (2016)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Multimedia Tools and Applications Aims and scope Submit manuscript

Human facial expression recognition using curvelet feature extraction and normalized mutual information feature selection

Download PDF

Muhammad Hameed Siddiqi¹,
Rahman Ali¹,
Muhammad Idris¹,
Adil Mehmood Khan²,
Eun Soo Kim³,
Min Cheol Whang⁴ &
…
Sungyoung Lee¹

800 Accesses
20 Citations
Explore all metrics

Abstract

To recognize expressions accurately, facial expression systems require robust feature extraction and feature selection methods. In this paper, a normalized mutual information based feature selection technique is proposed for FER systems. The technique is derived from an existing method, that is, the max-relevance and min-redundancy (mRMR) method. We, however, propose to normalize the mutual information used in this method so that the domination of the relevance or of the redundancy can be eliminated. For feature extraction, curvelet transform is used. After the feature extraction and selection the feature space is reduced by employing linear discriminant analysis (LDA). Finally, hidden Markov model (HMM) is used to recognize the expressions. The proposed FER system (CNF-FER) is validated using four publicly available standard datasets. For each dataset, 10-fold cross validation scheme is utilized. CNF-FER outperformed the existing well-known statistical and state-of-the-art methods by achieving a weighted average recognition rate of 99 % across all the datasets.

Analyze Facial Expression Recognition Based on Curvelet Transform via Extreme Learning Machine

Facial expression recognition with dynamic cascaded classifier

Article 19 March 2019

Facial Expression Recognition Adopting Combined Geometric and Texture-Based Features

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Communication through facial expressions plays a significant role in social interactions. Over the past two decades, human facial expression recognition (FER) emerged as an important research area. FER systems are very important to applications such as human emotion analysis [11], psychology and cognitive sciences [35], and access control and surveillance [6]. In most of these applications, FER systems should be adequately intelligent such that they can easily understand and analyze the goals and behaviors of humans.

There are two categories of FER systems: posed FER systems [39–41], and spontaneous FER systems [1, 17, 31]. Posed FER systems are used for recognizing artificial expressions, that is, the expressions produced by people when they are asked to do so [6]. On the other hand, spontaneous expressions are those that are produced by people spontaneously; these are observed on daily basis, such as during conversation or while watching movies [6]. This work falls under the category of posed FER systems.

Usually, an FER system consists of four basic modules: preprocessing, feature extraction, feature selection, and recognition. The preprocessing module is used to improve the image quality by diminishing the illumination noise and by eliminating the unnecessary details from the background. Feature extraction module deals with getting the distinguishable features for each expression and quantizing them as discrete symbols. Feature selection module is used for selecting a subset of relevant features from a large number of features extracted from the input data. Finally, in the recognition module, a classifier is first trained using the training data and then used to generate labels for the expressions in the incoming video data.

Many well-known methods, such as Histogram Equalization (HE) [33], Weighted Vector Directional Filters (WVDF) [18] or Wiener filter [21] have been employed for preprocessing. Similarly, for feature extraction, a number of techniques have been developed. Among these techniques, curvelet transform is the most robust and accurate technique. The accuracy and robustness of the curvelet transform has been proven by [42]. Its efficiency in coding the image edge information is high [8].Therefore, curvelet transform [42] was chosen for feature extraction in this work. For further study, please refer to [42].

Likewise, for the recognition module, many well-known classifiers have been studied for accurate expressions classification. For instance, artificial neural networks (ANNs) were employed by [12], support vector machines SVMs by [22, 38], Gaussian mixture models GMMs by [37] and hidden Markov models HMMs by [29]. Among these classifiers, HMM is the most frequently employed and commonly tested technique for sequential data [43].

In pattern recognition, identification of the most discriminative features is an important step [9], since it is common to have a large number of features, including relevant as well as irrelevant features, at the beginning of the pattern recognition process [13, 16]. Feeding a large set of features into a recognition model not only increases the computation burden but also causes the problem commonly known as the curse of dimensionality. Therefore, selecting only the relevant features helps in speeding up the learning process and alleviates the affect of the curse of dimensionality. Furthermore, feature selection facilitates in data visualization and understanding [19]. In regard to feature selection for FER, a number of techniques have been investigated. Among these, the most commonly used method is the mutual information based feature selection. However, there are still some limitations in this method [2, 10, 23, 30]. For instance, given a dataset with N features X ₁, X ₂, ... , X _N, and a set of i − 1 selected index (S _{i − 1} = {s ₁, s ₂, ... , s _{i − 1}}), the next feature ${X_{{s_{i}}}}$ is selected so that the redundancy $\left ({RC\left ({{X_{{s_{i}}}}} \right ) = \sum \limits _{s \in {s_{i - 1}}} {I\left ({{X_{s}};{X_{{s_{i}}}}} \right )} } \right )$ is minimized and the relevance $\left ({RL\left ({{X_{{s_{i}}}}} \right ) = I\left ({C;{X_{{s_{i}}}}} \right )} \right )$ is maximized. However, because the two problems may not have a common solution; therefore, we would like to find a scalar factor (denoted by β) so that a feature ${X_{{s_{i}}}}$ maximizing $RL\left ({{X_{{s_{i}}}}} \right ) - \beta \times RD\left ({{X_{{s_{i}}}}} \right )$ will be a possible solution for the minimization as well as the maximization. The existing solutions are summarized as given below:

MIFS and and MIFS-U: β is manually selected by the experiments,
mRMR: $\beta = \frac {1}{{{S_{i - 1}}}}$.
NMIFS: $\beta \left ({{X_{s}};{X_{{s_{i}}}}} \right ) = \frac {1}{{{S_{i - 1}}}} \times \frac {1}{{\min \left ({H\left ({{X_{s}}} \right ),H\left ({{X_{{s_{i}}}}} \right )} \right )}}$.

where MIFS stands for mutual information feature selection, MIFS-U stands for mutual information feature selection-unsupervised, mRMR stands for max-relevance and min-redundancy, and NMIFS stands for normalized mutual information feature selection.

Therefore, the objective of this paper is to propose a robust feature selection method in which we utilize the information measurement in order to estimate the potential of the features. On the topic of searching algorithms, since an exhaustive search over a large feature space is impractical, greedy forward selection and backward elimination are often used [2, 23, 30]. Here, we exploit greedy forward selection, wherein each feature is appended to the feature set based on its quality. Moreover, in this research, a detailed study on curvelet transform in combination with the proposed feature selection method is performed in order to extract and select the most prominent features. The dimension of the feature space is further reduced by employing a well-known statistical method called linear discriminant analysis LDA, and finally, HMM is used to label the expressions. System vlidation is perfmorned on four publicly available standard datasets of facial expressions, i.e., Japanese Female Facial Expressions (JAFFE) dataset [26], Yale B face dataset [14], Cohen-Kanade dataset [20], and Natural Visible and Infrared Facial Expression (USTC-NVIE) dataset [44].

We already discussed some related work about this field. The rest of the paper is organized as follows. Section 2 provides an overview of the proposed FER system (CNF-FER) with the integration of LDA and HMM. The experimental setup for the CNF-FER system is described in Section 3. Section 4 presents the experimental results along with discussion on each experiment, comparing our results with other well-known and state-of-the-art methods. Finally, Section 5 provides the conclusion of the paper with some future directions.

2 Methodology

2.1 Feature extraction

As mentioned above, curvelet transform [42] is used for feature extraction. This method has the capability to extract the most prominent features by keeping the line, curve, and edge information from each expression frame. It is a very light method for scenarios where objects edge information is illustrated. It can also be used for image reconstruction in severely ill-posed problems.

Curvelet transform can be implemented in two ways: firstly, using Unequally Spaced Fast Fourier Transform (USFFT); and secondly, through wrapping. We utilized curvelets through wrapping because it is faster than the USFFT method. While reconstructing the edge details in an image, this method is capable of employing a small number of coefficients, comparatively. Coefficient matrices of angle and scale are given as:

$$ C\left( {j,l,k} \right) = \left\langle {f,{\varphi_{j,l,k}}} \right\rangle $$

(1)

where j represents the scale, l indicates the angle, k shows the parameter position, and the inner products project f onto the φ _{j, l, k}, which represents the basic function of the curvelet transform. The angle and scale are indicated in Fig. 1.

As described before, we employed curvelet via wrapping; therefore, the following steps are used for it [7].

In first step, fast Fourier transform is employed to get the Fourier samples. $\hat f\left [ {{n_{1}},{n_{2}}} \right ], - \frac {n}{2} \le {n_{1}},{n_{2}} \le \frac {n}{2}$ (see Fig. 1).
In second step, the interpolation (re-sample) $\hat f\left [ {{n_{1}},{n_{2}}} \right ]$ takes place for each pair of scale j and angle l in order to attain $\hat f\left [ {{n_{1}},{n_{2}} - {n_{1}}\tan {\theta _{l}}} \right ]$.
In the third step, multiplication of interpolation function $\hat f$ with a discrete localizing window function is performed ${{\tilde U}_{j}}\left [ {{n_{1}},{n_{2}}} \right ]$. ${{\hat f}_{j,1}}\left [ {{n_{1}},{n_{2}}} \right ] = \hat f\left [ {{n_{1}},{n_{2}} - {n_{1}}\tan {\theta _{l}}} \right ]{{\tilde U}_{j}}\left [ {{n_{1}},{n_{2}}} \right ]$
In last step, on each ${{\hat f}_{j,1}}$, the inverse fast Fourier transform is performed in order to attain the associated curvelets C(j, l, k).

where the range of n ₁ and n ₂ is between 0 ≤ n ₁ ≤ L _1,j and 0 ≤ n ₂ ≤ L _2,j, while the range of θ is between $-\frac {\pi }{4}$ and $\frac {\pi }{4}$. For more detail on curvelet transform, please refer to [7, 42].

2.2 Normalized mutual information-based feature selection (NMIFS)

As mentioned before, for feature selection module, a robust normalized mutual information feature selection technique is used. This method is derived from the max-relevance and min-redundancy (mRMR) approach. We, however, propose to normalize the mutual information used in the method so that the domination of the relevance or of the redundancy can be eliminated. In our method only the upper bound of the mutual information of random variables is considered. Since, any continuous variable can be quantized into a discrete form, it is assumed that two discrete random variables X and Y are given along with their marginal and joint distributions. Hence, the joint mutual information of X and Y is computed as:

$$ I\left( {X;Y} \right) \le \min \left( {H\left( X \right),H\left( Y \right)} \right) $$

(2)

where I is the joint mutual information of two random variables X and Y, while, H is the entropy function, which can be defined by employing Jensen’s inequality that is given as

$$ H\left( X \right) \le {\log_{2}}\left( {\sum\limits_{x \in {{\Omega}_{X}}} {p\left( x \right)\frac{1}{{p\left( x \right)}}} } \right) $$

(3)

$$ H\left( X \right) \le {\log_{2}}\left( {\left| {{{\Omega}_{X}}} \right|} \right) $$

(4)

It is clear from (2) and (4) that

$$ I\left( {X;Y} \right) \le \min \left( {{{\log }_{2}}\left( {\left| {{{\Omega}_{X}}} \right|} \right),{{\log }_{2}}\left( {\left| {{{\Omega}_{Y}}} \right|} \right)} \right) $$

(5)

In this work, every feature is quantized by employing the same number of levels (N) that has been decided in order to achieve the expected quantization error. Algorithm 1 illustrates the quantization algorithm.

It is clear that the number of quantization levels progressively increases until the quantization error becomes smaller than a predefined small constant (ξ) that is the expected quantization error. We used ξ = 0.05 for our experiments because smaller values than this did not improve accuracy but increased the computational cost. It is clear from the algorithm that |Ω_X| = N for every feature X, so,

$$ I\left( {X;Y} \right) \le {\log_{2}}\left( N \right) $$

(6)

log2(N) is the upper bound of the mutual information I(X;Y) and does not depend either on X or Y; therefore, we call log2(N) as a feature-independent upper bound.

To eliminate the problem of unequal normalizing weights, we propose to use the feature-independent upper bound in (6) to normalize the mutual information instead of employing (2) as in [10]. Therefore, our normalized feature-feature mutual information is calculated by

$$ NI\left( {X;Y} \right) = \frac{{I\left( {X;Y} \right)}}{{{{\log }_{2}}\left( N \right)}} $$

(7)

It is clear that the range for the normalized feature-to-feature mutual information is always within [0,1]. Therefore, the class-feature mutual information is divided by log₂|Ω_C| in order to achieve a balance between the relevance and the redundancy, which is now defined as

$$ NI\left( {C;X} \right) = \frac{{I\left( {C;X} \right)}}{{{{\log }_{2}}\left| {{{\Omega}_{C}}} \right|}} $$

(8)

Combining (6) and (7), the potential of a feature is measured as

$$ {f^{1}}\left( {{X_{i}}} \right) = NI\left( {C;{X_{i}}} \right) - \frac{1}{{\left| {{S_{i - 1}}} \right|}}\sum\limits_{{X_{s}} \in {S_{i - 1}}} {NI\left( {{X_{s}};{X_{i}}} \right)} $$

(9)

where, S is a feature set, i.e., S = X ₁, X ₂, ... , X _i. In addition, to validate the effect of the imbalance between the relevance and the redundancy that we pointed out above, the normalized class-feature mutual information is combined with the same feature-feature mutual information as in [10]. In this way, the goodness of the feature is measured by

$$ {f^{1}}\left( {{X_{i}}} \right) = NI\left( {C;{X_{i}}} \right) - \frac{1}{{\left| {{S_{i - 1}}} \right|}}\sum\limits_{{X_{s}} \in {S_{i - 1}}} {\frac{{I\left( {{X_{s}};{X_{i}}} \right)}}{{\min \left( {H\left( {{X_{s}}} \right),H\left( {{X_{i}}} \right)} \right)}}} $$

(10)

The following pseudo code in Algorithm 2 represents the selection process using greedy forward searching strategy.

2.3 Dimension reduction

Some well-known methods used for dimension reduction of a feature space include kernel discriminant analysis (KDA) [28], generalized discriminant analysis (GDA) [3], and LDA [27]. Among these, LDA has been widely used in FER domain.

2.3.1 Linear discriminant analysis

Linear discriminant analysis maximizes the ratio of between-class variance to within-class variance in any particular data set, thereby guaranteeing maximal separability. LDA produces an optimal linear discriminant function that maps the input into the classification space on which the class identification of the samples is decided. LDA easily handles the case in which the within-class frequencies are unequal. The within S _W and between S _B class comparison is done by using the following equations.

$$ {S_{B}} = \sum \limits_{i = 1}^{c} {V_{i}}\left( {{\overline {m}_{i}} - \,\overline{\overline m} } \right){\left( {{\overline {m}_{i}} - \,\overline{\overline m} } \right)^{T}} $$

(11)

$$ {S_{W}} = \sum \limits_{i = 1}^{c} \sum \limits_{{m_{k}} \in {C_{i}}} \left( {{m_{k}} - {\overline {m}_{i}}} \right){\left( {{m_{k}} - {\overline {m}_{i}}} \right)^{T}} $$

(12)

where V _i is the number of vectors in the i _{t
h} class C _i, and c is the number of classes, and in our case, c represents the number of facial expressions. Also, $\overline {\overline m}$ represents the mean of all the vectors, $\overline m$ is the mean of the class C _i, and m _k is the vector of a specific class. The optimal discrimination projection matrix D _{o
p
t} is chosen from the maximization of the ratio of determinant of the between and within-class scatter matrices as

$$ {D_{opt}} = \arg\max\limits_{D} \frac{{\left| {{D^{T}}{S_{B}}D} \right|}}{{\left| {{D^{T}}{S_{W}}D} \right|}} = {\left[ {{d_{1}},{d_{2}}, {\ldots} ,{d_{t}}} \right]^{T}} $$

(13)

where D _{o
p
t} is the set of discriminate vectors of S _W and S _W corresponding to the c − 1 largest generalized eigenvalues λ. The size of D _{o
p
t} is t × r, where t ≤ r, and r is the number of elements in a vector. Then,

$$ {S_{B}}{d_{i}} = {\lambda_{i}}{S_{W}}{d_{i}}, i = 1,2, ... , c - 1 $$

(14)

where the rank of S _B is c − 1 or less, and hence, the upper bound value of t is c − 1. Thus, LDA maximizes the total scattering of the data while minimizing the within scattering of the classes. For more details on LDA, please refer to [5].

2.4 Expressions labeling using hidden Markov model

As described before, HMM is the most commonly used method for sequential data (facial expressions) classification, which provides a statistical model λ for a set of observation sequences. These observations are called frames in FER domain. A typical HMM has a sequence of observations of length T (i.e., T = O ₁, Q ₂, ... , O _T), a sequence of states S (i.e., S = S ₁, S ₂, ... , S _N, where N is the number of states in the model), and the time t for each state is denoted by Q (such that Q = q ₁, q ₂, ... , q _N). Each time, when a state j is entered, an observation is generated according to the multivariate Gaussian distribution b _j(O _t) with the mean value μ _j and covariance matrix V _j correlated with that state. There is also transition probabilities correlated with them such that the probability a _{i
j} is the resultant transition probability from state i to state j. The initial model probability for the state j is π_j. An HMM can be defined by this set of parameters, such as λ = A, B, π, where A indicates the probability of the state transition (such that A = a _{i
j}, a _{i
j} = P r o b(q _{t + 1} = S _j|q _t = S _i), 1 ≤ i, j ≤ N), where B represents the probability of observations (such that B = b _j(O _t), b _j = P r o b(O _t|q _t = S _j)1 ≤ j ≤ N), and the initial state probability is indicated by π (such that π = π_j, π_j = P r o b(q ₁ = S ₁)). All the equations are based on the work by [32] and make use of the initial state probability distribution.

In the training step, for a given model λ, the multiplication of each transition probability by each output probability at each step t provides the joint likelihood of a state sequence Q and the corresponding observation O. This likelihood P(O|λ) can be evaluated by summing over all possible state sequences:

$$ P\left( {O{\mathrm{|}}\lambda } \right) = \sum \limits_{Q} P(O,Q|\lambda ) $$

(15)

A simple procedure for finding the parameters λ that maximize the above equation in HMM, introduced in [4], depends on forward and backward algorithms α _t(j) = P(O ₁...O _t, q _t = j|λ) and β _t(j) = P(O ₍ t + 1) ... O _T|q _t = j, λ), respectively, such that these variables can be initiated inductively by the following processes:

$$ {\alpha_{1}}\left( j \right) = {\pi_{j}}{b_{j}}\left( {{O_{1}}} \right),1 \le j \le N $$

(16)

$$ {\beta_{T}}\left( j \right) = 1,1 \le j \le N $$

(17)

During testing, the appropriate HMM can then be determined by mean of likelihood estimation for the sequence observations O calculated based on the trained λ as

$$ P\left( {O{\mathrm{|}}\lambda } \right) = \sum \limits_{i = 1}^{N} {\alpha_{T}}\left( i \right) $$

(18)

The maximum likelihood for the observations provided by the trained HMM indicates the recognized label. The following formula has been utilized to model HMM (λ).

$$ \lambda = \left( O,\,Q,\,\pi \right) $$

(19)

where O is the sequence of observations (i.e., O ₁, Q ₂, ... , O _T) and each state is denoted by Q (such as Q = q ₁, q ₂, ... , q _N), where N is the number of states in the model, and π is the initial state probabilities. The parameters that are used to model HMM (λ) for all experiments were 64, 4, and 4 respectively. These values have been selected by performing multiple experiments. For more details on HMM, please refer to [36].

3 Experimental setup

The CNF-FER is tested and validated on four publicly available standard datasets, namely JAFFE, Yale B, Cohn-Kanade, and USTC-NVIE datasets. Six basic universal expressions, that is, happiness, anger, sadness, disgust, surprise, and fear are used from these datasets. From each dataset, we have selected only those expression frames which display the frontal view of the face, and each expression is composed of several sequences of expression frames. In this work, 10-fold cross-validation scheme was applied, i.e., out of 10 subjects, data from a single subject was reserved as the validation data for testing the CNF-FER, whereas the data for the remaining 9 subjects were used as the training data. This process was repeated 10 times. There were some expressions in the datasets that have different lighting conditions; therefore, histogram equalization was used in order to diminish the lighting effects. The detail on each dataset is as follows:

JAFFE Dataset: The expressions in this dataset were posed by 10 different subjects (Japanese female). Each image has been rated on 6 expression adjectives by 60 Japanese subjects. Most of the expression frames were taken from the frontal view of the camera with tied hair in order to expose all the sensitive regions of the face. In the whole dataset, there were total 213 facial frames, which consists of seven expressions including neutral. Therefore, we selected only 195 expression frames for six facial expressions performed by ten different Japanese female as subjects. The original size of each facial frame was 256 × 256 pixel.
Yale B Face Dataset: There were total 5760 facial frames taken in single light source performed by 10 distinct subjects, each seen under 576 viewing conditions (9 poses × 64 illumination conditions). For every subject, while performing a particular pose, the ambient illumination was also captured.
Cohn-Kanade Dataset: In this facial expressions dataset, 100 subjects (university students) performed basic six expressions. The age range of the subjects were from 18 to 30 years and most of them were female. We employed those expression frames for which the camera was fixed in front of the subjects. In order to utilize the six expressions from this dataset, we employed a total 450 image sequences from 100 subjects, and each of them was considered as one of the six expressions. The original size of each facial frame was 640 × 480 or 640 × 490 pixel with 8-bit precision for grayscale values. For recognition purpose, twelve expression frames were taken from each expression sequence, which results in total of 5400 expression images.
USTC-NVIE Dataset: In this dataset, an infrared thermal and a visible camera was used in order to collect both spontaneous and posed expressions, but we utilized only posed-based expressions. There were 108 subjects (university students), and their age range was from 17 to 31 years.Some of them worn glasses, whereas others were free of glasses. They were asked to perform a series of expressions with illumination from three different directions. The size of each facial frame was 640 × 480 or 704 × 490 pixels. In total, 1027 expression frames were utilized from this dataset.

For a thorough validation, we performed four different sets of experiments in this study:

In the first experiment, we analyzed the performance of the previous different statistical approaches with different combinations, using all datasets. Similarly, the performance of wavelet transform was also analysed in this experiments on all datasets.
In the second experiment, performance of the CNF-FER was analyzed.
While, in the third experiment, effectiveness of the proposed feature selection method was analyzed.
Finally, In the last experiment, the weighted average recognition rate and time complexity of the CNF-FER were compared with some state-of-the-art methods.

4 Results and discussion

4.1 First experiment

In this experiment, four sub-experiments were performed based on the combination of different well-known techniques using all the datasets. Each experiment along with its results are described below:

Recognition Rates of PCA and LDA with HMM: In this experiment, PCA was used along with LDA and HMM. PCA is an unsupervised technique, i.e., it does not require any prior information about the classes. For any defined level of compression, PCA is an optimal dimension-reduction scheme that minimizes the mean squared error between the original images and their reconstructions. Analysis showed that, when the number of PC features was increased, the recognition rate rose to a certain value and then remained saturated as shown in Fig. 2. Therefore, for optimal results, the first 50 eigenvectors with their corresponding eigenvalues were employed for each dataset.

Once the features were extracted, LDA was applied and the LDA features were then fed to an HMM for recognition. The idea was to maximize the total scattering of the data while minimizing the variance within classes before recognition. The recognition rates for this experiment on the four datasets are shown in Fig. 3.

It is clear from Fig. 3 that PCA and LDA with HMM did not achieve high recognition rate. The reason for this could be that PCA only focuses on global features and does not capture the local features.
Recognition Rates of ICA and LDA with HMM: In this experiment, we used ICA for feature extraction (to capture the local features) with LDA in order to examine any improvement in the feature space. The results for this experiment are summarized in Fig. 4.

It can be seen from Fig. 4 that using the global or local features separately with LDA does not guarantee a better recognition rate. Because ICA is slow to train when the dimension of the data is bulky. Moreover, ICA is very weak in managing the inputs, e.g., if a hug amount of expressions frames are exploited as input, ICA does not have the capability to recognize it, due to which some time ICA cannot retrieve the desire features. Therefore, both, PCA and ICA with LDA can get most informative features.
Recognition Rates of PCA+ICA+LDA and HMM: In this experiment, we utilized both PCA and ICA to extract the global and local features with LDA and HMM. The results for this experiment are summarized in Fig. 5.

It can be seen from Fig. 5 that all the existing statistical methods with different combinations did not achieve better recognition rate due to their own limitations.
Recognition Rates of wavelet transform with HMM: Finally, we analyzed the accuracy of the wavelet transform as a feature extraction technique with LDA and HMM for expression recognition. The corresponding experimental results are indicated in Fig. 6.

It is clear from Fig. 6 that the wavelet transform does not achieve high recognition rate. Therefore, we proposed the CNF-FER in order to attain a better performance against the existing methods on all datasets.

4.2 Second experiment

In this experiment, two sub-experiments were performed to validate the CNF-FER using the four datasets. The same validation scheme was applied as mentioned in Section 3, and the results are described below.

Recognition Rates of CNF-FER: The CNF-FER was evaluated for each dataset separately under the exact settings as mentioned in Section 3. The 3D feature plots of the CNF-FER, for the six expressions, after applying the LDA on four datasets, are shown in Figs. 7, 8, 9, and 10, and the detailed results are provided in Table 1.

It is clear from Table 1 that the CNF-FER consistently achieved a high recognition rate when applied on these datasets separately, i.e., 99.00 % on JAFFE, 99.17 % on Yale B face dataset, 99.17 % on Cohn-Kanade, and 99.33 % on USTC-NVIE dataset. This means that, unlike existing statistical methods, such as PCA, ICA, LDA, and wavelet transform, the CNF-FER is more robust, i.e., it provided high recognition rate not just for one but all four datasets. This is due to the proposed feature selection method that utilizes the information measurement in order to estimate the potential of the features. Furthermore, greedy forward selection was used, wherein each feature is appended to the feature set based on its quality, which confirms that the proposed feature selection method is more robust than others with respect to classification accuracy.
Recognition Rates of CNF-FER-based on Datasets: A set of experiments was performed in order to show the performance of CNF-FER system based on dataset. For these experiments, n-fold cross validation based on dataset was employed (in our case n = 4), that means that out of four datasets, one dataset was utilized as training data whereas the remaining three datasets were used as testing data, and this process was repeated four times, in which each data is used once for training and testing respectively. The weighted average recognition results of the CNF-FER systems on four datasets are shown in Table 2.

It can be seen from Table 2 that the CNF-FER does not achieve high recognition rate only on individual datasets, but also shows better performance when the system was trained on one dataset and tested on other datasets.

Table 1 Confusion matrix of CF-FER using: (A) JAFFE dataset, (B) Yale B dataset, (C) Cohn-Kanade dataset, and (D) USTC-NVIE dataset of facial expressions (Unit: %)

Full size table

Table 2 Confusion matrix of CNF-FER system that is: (A) trained on JAFFE dataset and tested on Yale B, Cohn-Kanade, and USTC-NVIE datasets, (B) trained on Yale B face dataset and tested on JAFFE, Cohn-Kanade, and USTC-NVIE datasets, (C) trained on Cohn-Kanade dataset and tested on JAFFE, Yale B, and USTC-NVIE datasets, and (D) trained on USTC-NVIE dataset and tested on JAFFE, Yale B, and Cohn-Kanade datasets (Unit: %)

Full size table

4.3 Third experiment

In order to assess the effectiveness of the proposed feature selection method, a series of sub-experiments were performed in this experiment. These experiments were performed using all of the four datasets and results are presented in Table 3.

Table 3 Confusion matrix of the CNF-FER using: (A) JAFFE dataset, (B) Yale B dataset, (C) Cohn-Kanade dataset, and (D) USTC-NVIE dataset, while removing the proposed feature selection method (Unit: %)

Full size table

It can be noted from Table 3 that the proposed feature selection method played a major role in the high recognition of CNF-FER. When we removed the proposed feature selection method, the recognition rate decreased signi?cantly. These results validate the problem of high similarity among the features of different expressions. The experimental results confirmed our analysis and provided clear evidence, allowing us to conclude that our proposed feature selection method selects a better feature set in terms of classification accuracy.

4.4 Fourth experiment

In the fourth experiment, the CNF-FER was compared against the following state-of-the-art FER methods: [15, 24, 25, 34, 45, 46]. JAFFE, Yale B, Cohn-Kanade, and USTC-NVIE datasets of facial expressions were used in this experiment. For a fair comparison, we borrowed the implementations of some of these methods, whereas for some methods, their published results are reported. The comparison has been performed under the same guidelines which were provided in their respective manuscripts. A 10-fold cross-validation scheme was utilized for each dataset (as described in Section 3). For the four datasets, the weighted average recognition rates of the existing methods and that of CNF-FER are presented in Table 4.

Table 4 Comparison results of the proposed approaches with recent feature extraction methods (Unit: %)

Full size table

It is clear from Table 4 that the CNF-FER outperformed the existing state-of-the-art methods.

Moreover, in order to analyze the computational cost of the CNF-FER, we selected the most efficient method (that is, [34] from the above experiments of Table 4). The FER system of [34] took 1533 ms, 1498 ms, 2292 ms, and 1701 ms to recognize an expression frame from JAFFE, Yale B, Cohn-Kanade, and USTC-NVIE datasets of facial expressions, respectively. On the other hand, CNF-FER took 1290 ms, 1034 ms, 1908 ms, and 1865 ms to recognize an expression frame from the same datasets. Thus, the CNF-FER not only achieved high recognition rate, but it is also less expensive in terms of computational cost.

Moreover, the FER system of [34] has a complexity of O(T Q ² M), where T is the length of an input sequence (i.e., expression frames), Q is the number of states, and M is the number of mixtures. On the other hand, the CNF-FER needs a maximum of O(T M) to compute gradients. These experiments were performed in Matlab using an Intel ®; Pentium ®; Dual-Core ^TM (2.5 GHz) with a RAM capacity of 3 GB.

5 Conclusion

A typical FER system consists of four modules: preprocessing, feature extraction, feature selection, and recognition. A great deal of research has been done for preprocessing, feature extraction, and recognition modules; however, feature selection is still an active research area.

In this paper, we have reviewed some recently developed algorithms for mutual information-based feature selection. We discussed the limitations of each method, and based on our observations, we proposed our own method derived from the NMIFS with two improvements: the normalization of the mutual information and the feature-independent normalizing weights. For feature extraction, we have utilized the existing curvelet transform that has the capability to extract prominent features by keeping the line, curve, and edge information from each expression frame. The dimensions of the feature space were reduced by employing LDA. Finally, HMM was used as the recognizer.

The CNF-FER was tested and validated using four publicly available standard datasets. For each dataset, 10-fold cross-validation scheme was employed. The CNF-FER achieved a weighted average recognition accuracy of 99 %, which is a significant improvement over the recognition rates of existing FER systems. Moreover, from computational perspective, the CNF-FER is less expensive than existing methods.

The performance of CNF-FER is yet to be investigated in real-time, because there exist several factors in real-time environment that might decrease the performance of CNF-FER, such as background clutter, image rotation and blur, and varying face angles. Therefore, further study is needed to tackle these issues and maintain the same high recognition rate in real environment.

References

Abd El Meguid M, Levine M (2014) Fully automated recognition of spontaneous facial expressions in videos using random forest classifiers. IEEE Trans Affect Comput 5(2):141–154
Article Google Scholar
Battiti R (1994) Using mutual information for selecting features in supervised neural net learning. IEEE Trans Neural Netw 5(4):537–550
Article Google Scholar
Baudat G, Anouar F (2000) Generalized discriminant analysis using a kernel approach. Neural Comput 12(10):2385–2404
Article Google Scholar
Baum LE (1972) An equality and associated maximization technique in statistical estimation for probabilistic functions of markov processes. Inequalities 3:1–8
Google Scholar
Belhumeur PN, Hespanha JP, Kriegman D (1997) Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Trans Pattern Anal Mach Intell 19(7):711–720
Article Google Scholar
Bettadapura V (2012) Face expression recognition and analysis: the state of the art. arXiv preprint. arXiv:12036722
Candes E, Demanet L, Donoho D, Ying L (2006) Fast discrete curvelet transforms. Multiscale Modeling & Simulation 5(3):861–899
Article MathSciNet MATH Google Scholar
Candes EJ, Donoho DL (2000) Curvelets, multiresolution representation, and scaling laws. In Proc. SPIE, vol 4119, pp 1–12
Dash M, Liu H (1997) Feature selection for classification. Intell Data Anal 1(3):131–156
Article Google Scholar
Estevez PA, Tesmer M, Perez CA, Zurada JM (2009) Normalized mutual information feature selection. IEEE Trans Neural Netw 20(2):189–201
Article Google Scholar
Fasel B, Luettin J (2003) Automatic facial expression analysis: a survey. Pattern Recogn 36(1):259–275
Article MATH Google Scholar
Filko D, Martinović G (2013) Emotion recognition system by a neural network based facial expression analysis. Automatika: Journal Control Meas Electron Comput Commun 54(2):263– 272
Google Scholar
Fodor IK (2002) A survey of dimension reduction techniques
Georghiades AS, Belhumeur PN, Kriegman D (2001) From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE Trans Pattern Anal Mach Intell 23(6):643–660
Article Google Scholar
Ghimire D, Lee J (2013) Geometric feature-based facial expression recognition in image sequences using multi-class adaboost and support vector machines. Sensors 13(6):7714–7734
Article Google Scholar
Hall MA (1999) Correlation-based feature selection for machine learning. PhD thesis. The University of Waikato
He S, Wang S, Lv Y (2011) Spontaneous facial expression recognition based on feature point tracking. In 2011 Sixth International Conference on Image and Graphics (ICIG). IEEE, pp 760–765
Hossain MA, Sanyal G (2012) A new improved tactic to extract facial expressions based on genetic algorithm and wvdf. Int J Adv Inf Technol 2(5):37–44
Article Google Scholar
Kamimura R (2011) Structural enhanced information and its application to improved visualization of self-organizing maps. Appl Intell 34(1):102–115
Article MathSciNet Google Scholar
Kanade T, Cohn JF, Tian Y (2000) Comprehensive database for facial expression analysis. In Proceedings. Fourth IEEE International Conference onAutomatic Face and Gesture Recognition, 2000. IEEE, pp 46–53
Kashyap KL, Shantaiya S (2012) Study and analysis of statistical features of face expression in noisy environment. Int J Image Process Vis Sci 1(2):29–34
Google Scholar
Kotsia I, Pitas I (2007) Facial expression recognition in image sequences using geometric deformation features and support vector machines. IEEE Trans Image Process 16(1):172–187
Article MathSciNet Google Scholar
Kwak N, Choi CH (2002) Input feature selection for classification problems. IEEE Trans Neural Netw 13(1):143–159
Article Google Scholar
Liu M, Shan S, Wang R, Chen X (2014a) Learning expressionlets on spatio-temporal manifold for dynamic facial expression recognition. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2014. CVPR 2014
Liu P, Han S, Meng Z, Tong Y (2014b) Facial expression recognition via a boosted deep belief network. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2014, CVPR 2014
Lyons M, Akamatsu S, Kamachi M, Gyoba J (1998) Coding facial expressions with gabor wavelets. In Proceedings. Third IEEE International Conference on Automatic Face and Gesture Recognition, 1998. IEEE, pp 200–205
Mika S (2002) Kernel fisher discriminants. PhD thesis, Universitätsbibliothek
Mika S, Ratsch G, Weston J, Scholkopf B, Mullers K (1999) Fisher discriminant analysis with kernels. In Proceedings of the 1999 IEEE Signal Processing Society Workshop. Neural Networks for Signal Processing IX, 1999. IEEE, pp 41–48
Pagariya RR, Bartere MM (2013) Facial emotion recognition in videos using hmm 3(4):111–118
Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Article Google Scholar
Piatkowska E, Martyna J (147) Spontaneous facial expression recognition: automatic aggression detection. In Hybrid Artificial Intelligent Systems. Springer
Rabiner L (1989) A tutorial on hidden markov models and selected applications in speech recognition. Proc IEEE 77(2):257–286
Article Google Scholar
Ramirez-Gutierrez K, Sanchez-Perez D, Perez-Meana H, Fujita H, Sasaki J (2010) Face recognition and verification using histogram equalization Selected Topics in Applied Computer Science. WSEAS, pp 85–89
Ramirez Rivera A, Castillo R, Chae O (2013) Local directional number pattern for face analysis: Face and expression recognition. IEEE Trans Image Process 22(5):1740–1752
Article MathSciNet Google Scholar
Russell JA (2003) Core affect and the psychological construction of emotion. Psychol Rev 110(1):145
Article Google Scholar
Samaria FS (1994) Face recognition using hidden markov models. PhD thesis. University of Cambridge
Schels M , Schwenker F (2010) A multiple classifier system approach for facial expressions in image sequences utilizing gmm supervectors. In 2010 20th International Conference on Pattern Recognition (ICPR). IEEE, pp 4251–4254
Shan C, Gong S, McOwan PW (2009) Facial expression recognition based on local binary patterns: A comprehensive study. Image Vis Comput 27(6):803–816
Article Google Scholar
Siddiqi MH, Lee S, Lee YK, Khan AM, Truc PTH (2013) Hierarchical recognition scheme for human facial expression recognition systems. Sensors 13(12):16682–16713
Article Google Scholar
Siddiqi MH, Ali R, Khan AM, Kim ES, Kim GJ, Lee S (2014) Facial expression recognition using active contour-based face detection, facial movement-based feature extraction, and non-linear feature selection. Multimedia Systems
Smith BM, Brandt J, Lin Z, Zhang L (2014) Nonparametric context modeling of local appearance for pose-and expression-robust facial landmark localization
Tang M, Chen F (2013) Facial expression recognition and its application based on curvelet transform and pso-svm. Optik-Int J Light Electron Optics 124(22):5401–5406
Article MathSciNet Google Scholar
Uddin MZ, Lee J, Kim TS (2009) An enhanced independent component-based human facial expression recognition from video. IEEE Trans Consum Electron 55(4):2216–2224
Article Google Scholar
Wang S, Liu Z, Lv S, Lv Y, Wu G, Peng P, Chen F, Wang X (2010) A natural visible and infrared facial expression database for expression recognition and emotion inference. IEEE Trans Multimedia 12(7):682–691
Article Google Scholar
Zhang S, Zhao X, Lei B (2012) Robust facial expression recognition via compressive sensing. Sensors 12(3):3747–3761
Article Google Scholar
Zhao X, Zhang S (2011) Facial expression recognition based on local binary patterns and kernel discriminant isomap. Sensors 11(10):9573–9588
Article Google Scholar

Download references

Acknowledgments

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. 2013-067321)).

This research was also supported by the MSIP (Ministry of Science, ICT & Future Planning), Korea, under the ITRC (Information Technology Research Center) support program supervised by the NIPA (National IT Industry Promotion Agency) (NIPA-2014-(H0301-14-1003).

Author information

Authors and Affiliations

Department of Computer Engineering, Kyung Hee University, Suwon, 446–701, Republic of Korea
Muhammad Hameed Siddiqi, Rahman Ali, Muhammad Idris & Sungyoung Lee
Division of Information and Computer Engineering, Ajou University, Suwon, 443–749, Republic of Korea
Adil Mehmood Khan
Department of Electronic Engineering, Kwangwoon University, Seoul, 139–701, Republic of Korea
Eun Soo Kim
Division of Digital Media Engineering, Sang-Myung University, Suwon, 110–809, Republic of Korea
Min Cheol Whang

Authors

Muhammad Hameed Siddiqi
View author publications
You can also search for this author in PubMed Google Scholar
Rahman Ali
View author publications
You can also search for this author in PubMed Google Scholar
Muhammad Idris
View author publications
You can also search for this author in PubMed Google Scholar
Adil Mehmood Khan
View author publications
You can also search for this author in PubMed Google Scholar
Eun Soo Kim
View author publications
You can also search for this author in PubMed Google Scholar
Min Cheol Whang
View author publications
You can also search for this author in PubMed Google Scholar
Sungyoung Lee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sungyoung Lee.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Siddiqi, M.H., Ali, R., Idris, M. et al. Human facial expression recognition using curvelet feature extraction and normalized mutual information feature selection. Multimed Tools Appl 75, 935–959 (2016). https://doi.org/10.1007/s11042-014-2333-3

Download citation

Received: 02 May 2014
Revised: 26 August 2014
Accepted: 20 October 2014
Published: 05 November 2014
Issue Date: January 2016
DOI: https://doi.org/10.1007/s11042-014-2333-3

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Human facial expression recognition using curvelet feature extraction and normalized mutual information feature selection

Abstract

Similar content being viewed by others

Analyze Facial Expression Recognition Based on Curvelet Transform via Extreme Learning Machine

Facial expression recognition with dynamic cascaded classifier

Facial Expression Recognition Adopting Combined Geometric and Texture-Based Features

1 Introduction