1 Introduction

Communication through facial expressions plays a significant role in social interactions. Over the past two decades, human facial expression recognition (FER) emerged as an important research area. FER systems are very important to applications such as human emotion analysis [11], psychology and cognitive sciences [35], and access control and surveillance [6]. In most of these applications, FER systems should be adequately intelligent such that they can easily understand and analyze the goals and behaviors of humans.

There are two categories of FER systems: posed FER systems [3941], and spontaneous FER systems [1, 17, 31]. Posed FER systems are used for recognizing artificial expressions, that is, the expressions produced by people when they are asked to do so [6]. On the other hand, spontaneous expressions are those that are produced by people spontaneously; these are observed on daily basis, such as during conversation or while watching movies [6]. This work falls under the category of posed FER systems.

Usually, an FER system consists of four basic modules: preprocessing, feature extraction, feature selection, and recognition. The preprocessing module is used to improve the image quality by diminishing the illumination noise and by eliminating the unnecessary details from the background. Feature extraction module deals with getting the distinguishable features for each expression and quantizing them as discrete symbols. Feature selection module is used for selecting a subset of relevant features from a large number of features extracted from the input data. Finally, in the recognition module, a classifier is first trained using the training data and then used to generate labels for the expressions in the incoming video data.

Many well-known methods, such as Histogram Equalization (HE) [33], Weighted Vector Directional Filters (WVDF) [18] or Wiener filter [21] have been employed for preprocessing. Similarly, for feature extraction, a number of techniques have been developed. Among these techniques, curvelet transform is the most robust and accurate technique. The accuracy and robustness of the curvelet transform has been proven by [42]. Its efficiency in coding the image edge information is high [8].Therefore, curvelet transform [42] was chosen for feature extraction in this work. For further study, please refer to [42].

Likewise, for the recognition module, many well-known classifiers have been studied for accurate expressions classification. For instance, artificial neural networks (ANNs) were employed by [12], support vector machines SVMs by [22, 38], Gaussian mixture models GMMs by [37] and hidden Markov models HMMs by [29]. Among these classifiers, HMM is the most frequently employed and commonly tested technique for sequential data [43].

In pattern recognition, identification of the most discriminative features is an important step [9], since it is common to have a large number of features, including relevant as well as irrelevant features, at the beginning of the pattern recognition process [13, 16]. Feeding a large set of features into a recognition model not only increases the computation burden but also causes the problem commonly known as the curse of dimensionality. Therefore, selecting only the relevant features helps in speeding up the learning process and alleviates the affect of the curse of dimensionality. Furthermore, feature selection facilitates in data visualization and understanding [19]. In regard to feature selection for FER, a number of techniques have been investigated. Among these, the most commonly used method is the mutual information based feature selection. However, there are still some limitations in this method [2, 10, 23, 30]. For instance, given a dataset with N features X 1, X 2, ... , X N , and a set of i − 1 selected index (S i − 1 = {s 1, s 2, ... , s i − 1}), the next feature \({X_{{s_{i}}}}\) is selected so that the redundancy \(\left ({RC\left ({{X_{{s_{i}}}}} \right ) = \sum \limits _{s \in {s_{i - 1}}} {I\left ({{X_{s}};{X_{{s_{i}}}}} \right )} } \right )\) is minimized and the relevance \(\left ({RL\left ({{X_{{s_{i}}}}} \right ) = I\left ({C;{X_{{s_{i}}}}} \right )} \right )\) is maximized. However, because the two problems may not have a common solution; therefore, we would like to find a scalar factor (denoted by β) so that a feature \({X_{{s_{i}}}}\) maximizing \(RL\left ({{X_{{s_{i}}}}} \right ) - \beta \times RD\left ({{X_{{s_{i}}}}} \right )\) will be a possible solution for the minimization as well as the maximization. The existing solutions are summarized as given below:

  • MIFS and and MIFS-U: β is manually selected by the experiments,

  • mRMR: \(\beta = \frac {1}{{{S_{i - 1}}}}\).

  • NMIFS: \(\beta \left ({{X_{s}};{X_{{s_{i}}}}} \right ) = \frac {1}{{{S_{i - 1}}}} \times \frac {1}{{\min \left ({H\left ({{X_{s}}} \right ),H\left ({{X_{{s_{i}}}}} \right )} \right )}}\).

where MIFS stands for mutual information feature selection, MIFS-U stands for mutual information feature selection-unsupervised, mRMR stands for max-relevance and min-redundancy, and NMIFS stands for normalized mutual information feature selection.

Therefore, the objective of this paper is to propose a robust feature selection method in which we utilize the information measurement in order to estimate the potential of the features. On the topic of searching algorithms, since an exhaustive search over a large feature space is impractical, greedy forward selection and backward elimination are often used [2, 23, 30]. Here, we exploit greedy forward selection, wherein each feature is appended to the feature set based on its quality. Moreover, in this research, a detailed study on curvelet transform in combination with the proposed feature selection method is performed in order to extract and select the most prominent features. The dimension of the feature space is further reduced by employing a well-known statistical method called linear discriminant analysis LDA, and finally, HMM is used to label the expressions. System vlidation is perfmorned on four publicly available standard datasets of facial expressions, i.e., Japanese Female Facial Expressions (JAFFE) dataset [26], Yale B face dataset [14], Cohen-Kanade dataset [20], and Natural Visible and Infrared Facial Expression (USTC-NVIE) dataset [44].

We already discussed some related work about this field. The rest of the paper is organized as follows. Section 2 provides an overview of the proposed FER system (CNF-FER) with the integration of LDA and HMM. The experimental setup for the CNF-FER system is described in Section 3. Section 4 presents the experimental results along with discussion on each experiment, comparing our results with other well-known and state-of-the-art methods. Finally, Section 5 provides the conclusion of the paper with some future directions.

2 Methodology

2.1 Feature extraction

As mentioned above, curvelet transform [42] is used for feature extraction. This method has the capability to extract the most prominent features by keeping the line, curve, and edge information from each expression frame. It is a very light method for scenarios where objects edge information is illustrated. It can also be used for image reconstruction in severely ill-posed problems.

Curvelet transform can be implemented in two ways: firstly, using Unequally Spaced Fast Fourier Transform (USFFT); and secondly, through wrapping. We utilized curvelets through wrapping because it is faster than the USFFT method. While reconstructing the edge details in an image, this method is capable of employing a small number of coefficients, comparatively. Coefficient matrices of angle and scale are given as:

$$ C\left( {j,l,k} \right) = \left\langle {f,{\varphi_{j,l,k}}} \right\rangle $$
(1)

where j represents the scale, l indicates the angle, k shows the parameter position, and the inner products project f onto the φ j, l, k , which represents the basic function of the curvelet transform. The angle and scale are indicated in Fig. 1.

Fig. 1
figure 1

The illustration of curvelet transform in frequency domain (left) and in spatial domain (right) [7]

As described before, we employed curvelet via wrapping; therefore, the following steps are used for it [7].

  • In first step, fast Fourier transform is employed to get the Fourier samples. \(\hat f\left [ {{n_{1}},{n_{2}}} \right ], - \frac {n}{2} \le {n_{1}},{n_{2}} \le \frac {n}{2}\) (see Fig. 1).

  • In second step, the interpolation (re-sample) \(\hat f\left [ {{n_{1}},{n_{2}}} \right ]\) takes place for each pair of scale j and angle l in order to attain \(\hat f\left [ {{n_{1}},{n_{2}} - {n_{1}}\tan {\theta _{l}}} \right ]\).

  • In the third step, multiplication of interpolation function \(\hat f\) with a discrete localizing window function is performed \({{\tilde U}_{j}}\left [ {{n_{1}},{n_{2}}} \right ]\). \({{\hat f}_{j,1}}\left [ {{n_{1}},{n_{2}}} \right ] = \hat f\left [ {{n_{1}},{n_{2}} - {n_{1}}\tan {\theta _{l}}} \right ]{{\tilde U}_{j}}\left [ {{n_{1}},{n_{2}}} \right ]\)

  • In last step, on each \({{\hat f}_{j,1}}\), the inverse fast Fourier transform is performed in order to attain the associated curvelets C(j, l, k).

where the range of n 1 and n 2 is between 0 ≤ n 1L 1,j and 0 ≤ n 2L 2,j , while the range of θ is between \(-\frac {\pi }{4}\) and \(\frac {\pi }{4}\). For more detail on curvelet transform, please refer to [7, 42].

2.2 Normalized mutual information-based feature selection (NMIFS)

As mentioned before, for feature selection module, a robust normalized mutual information feature selection technique is used. This method is derived from the max-relevance and min-redundancy (mRMR) approach. We, however, propose to normalize the mutual information used in the method so that the domination of the relevance or of the redundancy can be eliminated. In our method only the upper bound of the mutual information of random variables is considered. Since, any continuous variable can be quantized into a discrete form, it is assumed that two discrete random variables X and Y are given along with their marginal and joint distributions. Hence, the joint mutual information of X and Y is computed as:

$$ I\left( {X;Y} \right) \le \min \left( {H\left( X \right),H\left( Y \right)} \right) $$
(2)

where I is the joint mutual information of two random variables X and Y, while, H is the entropy function, which can be defined by employing Jensen’s inequality that is given as

$$ H\left( X \right) \le {\log_{2}}\left( {\sum\limits_{x \in {{\Omega}_{X}}} {p\left( x \right)\frac{1}{{p\left( x \right)}}} } \right) $$
(3)
$$ H\left( X \right) \le {\log_{2}}\left( {\left| {{{\Omega}_{X}}} \right|} \right) $$
(4)

It is clear from (2) and (4) that

$$ I\left( {X;Y} \right) \le \min \left( {{{\log }_{2}}\left( {\left| {{{\Omega}_{X}}} \right|} \right),{{\log }_{2}}\left( {\left| {{{\Omega}_{Y}}} \right|} \right)} \right) $$
(5)

In this work, every feature is quantized by employing the same number of levels (N) that has been decided in order to achieve the expected quantization error. Algorithm 1 illustrates the quantization algorithm.

figure i

It is clear that the number of quantization levels progressively increases until the quantization error becomes smaller than a predefined small constant (ξ) that is the expected quantization error. We used ξ = 0.05 for our experiments because smaller values than this did not improve accuracy but increased the computational cost. It is clear from the algorithm that |Ω X | = N for every feature X, so,

$$ I\left( {X;Y} \right) \le {\log_{2}}\left( N \right) $$
(6)

log2(N) is the upper bound of the mutual information I(X;Y) and does not depend either on X or Y; therefore, we call log2(N) as a feature-independent upper bound.

To eliminate the problem of unequal normalizing weights, we propose to use the feature-independent upper bound in (6) to normalize the mutual information instead of employing (2) as in [10]. Therefore, our normalized feature-feature mutual information is calculated by

$$ NI\left( {X;Y} \right) = \frac{{I\left( {X;Y} \right)}}{{{{\log }_{2}}\left( N \right)}} $$
(7)

It is clear that the range for the normalized feature-to-feature mutual information is always within [0,1]. Therefore, the class-feature mutual information is divided by log2 C | in order to achieve a balance between the relevance and the redundancy, which is now defined as

$$ NI\left( {C;X} \right) = \frac{{I\left( {C;X} \right)}}{{{{\log }_{2}}\left| {{{\Omega}_{C}}} \right|}} $$
(8)

Combining (6) and (7), the potential of a feature is measured as

$$ {f^{1}}\left( {{X_{i}}} \right) = NI\left( {C;{X_{i}}} \right) - \frac{1}{{\left| {{S_{i - 1}}} \right|}}\sum\limits_{{X_{s}} \in {S_{i - 1}}} {NI\left( {{X_{s}};{X_{i}}} \right)} $$
(9)

where, S is a feature set, i.e., S = X 1, X 2, ... , X i . In addition, to validate the effect of the imbalance between the relevance and the redundancy that we pointed out above, the normalized class-feature mutual information is combined with the same feature-feature mutual information as in [10]. In this way, the goodness of the feature is measured by

$$ {f^{1}}\left( {{X_{i}}} \right) = NI\left( {C;{X_{i}}} \right) - \frac{1}{{\left| {{S_{i - 1}}} \right|}}\sum\limits_{{X_{s}} \in {S_{i - 1}}} {\frac{{I\left( {{X_{s}};{X_{i}}} \right)}}{{\min \left( {H\left( {{X_{s}}} \right),H\left( {{X_{i}}} \right)} \right)}}} $$
(10)

The following pseudo code in Algorithm 2 represents the selection process using greedy forward searching strategy.

figure j

2.3 Dimension reduction

Some well-known methods used for dimension reduction of a feature space include kernel discriminant analysis (KDA) [28], generalized discriminant analysis (GDA) [3], and LDA [27]. Among these, LDA has been widely used in FER domain.

2.3.1 Linear discriminant analysis

Linear discriminant analysis maximizes the ratio of between-class variance to within-class variance in any particular data set, thereby guaranteeing maximal separability. LDA produces an optimal linear discriminant function that maps the input into the classification space on which the class identification of the samples is decided. LDA easily handles the case in which the within-class frequencies are unequal. The within S W and between S B class comparison is done by using the following equations.

$$ {S_{B}} = \sum \limits_{i = 1}^{c} {V_{i}}\left( {{\overline {m}_{i}} - \,\overline{\overline m} } \right){\left( {{\overline {m}_{i}} - \,\overline{\overline m} } \right)^{T}} $$
(11)
$$ {S_{W}} = \sum \limits_{i = 1}^{c} \sum \limits_{{m_{k}} \in {C_{i}}} \left( {{m_{k}} - {\overline {m}_{i}}} \right){\left( {{m_{k}} - {\overline {m}_{i}}} \right)^{T}} $$
(12)

where V i is the number of vectors in the i t h class C i , and c is the number of classes, and in our case, c represents the number of facial expressions. Also, \(\overline {\overline m}\) represents the mean of all the vectors, \(\overline m\) is the mean of the class C i , and m k is the vector of a specific class. The optimal discrimination projection matrix D o p t is chosen from the maximization of the ratio of determinant of the between and within-class scatter matrices as

$$ {D_{opt}} = \arg\max\limits_{D} \frac{{\left| {{D^{T}}{S_{B}}D} \right|}}{{\left| {{D^{T}}{S_{W}}D} \right|}} = {\left[ {{d_{1}},{d_{2}}, {\ldots} ,{d_{t}}} \right]^{T}} $$
(13)

where D o p t is the set of discriminate vectors of S W and S W corresponding to the c − 1 largest generalized eigenvalues λ. The size of D o p t is t × r, where tr, and r is the number of elements in a vector. Then,

$$ {S_{B}}{d_{i}} = {\lambda_{i}}{S_{W}}{d_{i}}, i = 1,2, ... , c - 1 $$
(14)

where the rank of S B is c − 1 or less, and hence, the upper bound value of t is c − 1. Thus, LDA maximizes the total scattering of the data while minimizing the within scattering of the classes. For more details on LDA, please refer to [5].

2.4 Expressions labeling using hidden Markov model

As described before, HMM is the most commonly used method for sequential data (facial expressions) classification, which provides a statistical model λ for a set of observation sequences. These observations are called frames in FER domain. A typical HMM has a sequence of observations of length T (i.e., T = O 1, Q 2, ... , O T ), a sequence of states S (i.e., S = S 1, S 2, ... , S N , where N is the number of states in the model), and the time t for each state is denoted by Q (such that Q = q 1, q 2, ... , q N ). Each time, when a state j is entered, an observation is generated according to the multivariate Gaussian distribution b j (O t ) with the mean value μ j and covariance matrix V j correlated with that state. There is also transition probabilities correlated with them such that the probability a i j is the resultant transition probability from state i to state j. The initial model probability for the state j is π j . An HMM can be defined by this set of parameters, such as λ = A, B, π, where A indicates the probability of the state transition (such that A = a i j , a i j = P r o b(q t + 1 = S j |q t = S i ), 1 ≤ i, jN), where B represents the probability of observations (such that B = b j (O t ), b j = P r o b(O t |q t = S j )1 ≤ jN), and the initial state probability is indicated by π (such that π = π j , π j = P r o b(q 1 = S 1)). All the equations are based on the work by [32] and make use of the initial state probability distribution.

In the training step, for a given model λ, the multiplication of each transition probability by each output probability at each step t provides the joint likelihood of a state sequence Q and the corresponding observation O. This likelihood P(O|λ) can be evaluated by summing over all possible state sequences:

$$ P\left( {O{\mathrm{|}}\lambda } \right) = \sum \limits_{Q} P(O,Q|\lambda ) $$
(15)

A simple procedure for finding the parameters λ that maximize the above equation in HMM, introduced in [4], depends on forward and backward algorithms α t (j) = P(O 1...O t , q t = j|λ) and β t (j) = P(O ( t + 1) ... O T |q t = j, λ), respectively, such that these variables can be initiated inductively by the following processes:

$$ {\alpha_{1}}\left( j \right) = {\pi_{j}}{b_{j}}\left( {{O_{1}}} \right),1 \le j \le N $$
(16)
$$ {\beta_{T}}\left( j \right) = 1,1 \le j \le N $$
(17)

During testing, the appropriate HMM can then be determined by mean of likelihood estimation for the sequence observations O calculated based on the trained λ as

$$ P\left( {O{\mathrm{|}}\lambda } \right) = \sum \limits_{i = 1}^{N} {\alpha_{T}}\left( i \right) $$
(18)

The maximum likelihood for the observations provided by the trained HMM indicates the recognized label. The following formula has been utilized to model HMM (λ).

$$ \lambda = \left( O,\,Q,\,\pi \right) $$
(19)

where O is the sequence of observations (i.e., O 1, Q 2, ... , O T ) and each state is denoted by Q (such as Q = q 1, q 2, ... , q N ), where N is the number of states in the model, and π is the initial state probabilities. The parameters that are used to model HMM (λ) for all experiments were 64, 4, and 4 respectively. These values have been selected by performing multiple experiments. For more details on HMM, please refer to [36].

3 Experimental setup

The CNF-FER is tested and validated on four publicly available standard datasets, namely JAFFE, Yale B, Cohn-Kanade, and USTC-NVIE datasets. Six basic universal expressions, that is, happiness, anger, sadness, disgust, surprise, and fear are used from these datasets. From each dataset, we have selected only those expression frames which display the frontal view of the face, and each expression is composed of several sequences of expression frames. In this work, 10-fold cross-validation scheme was applied, i.e., out of 10 subjects, data from a single subject was reserved as the validation data for testing the CNF-FER, whereas the data for the remaining 9 subjects were used as the training data. This process was repeated 10 times. There were some expressions in the datasets that have different lighting conditions; therefore, histogram equalization was used in order to diminish the lighting effects. The detail on each dataset is as follows:

  • JAFFE Dataset: The expressions in this dataset were posed by 10 different subjects (Japanese female). Each image has been rated on 6 expression adjectives by 60 Japanese subjects. Most of the expression frames were taken from the frontal view of the camera with tied hair in order to expose all the sensitive regions of the face. In the whole dataset, there were total 213 facial frames, which consists of seven expressions including neutral. Therefore, we selected only 195 expression frames for six facial expressions performed by ten different Japanese female as subjects. The original size of each facial frame was 256 × 256 pixel.

  • Yale B Face Dataset: There were total 5760 facial frames taken in single light source performed by 10 distinct subjects, each seen under 576 viewing conditions (9 poses × 64 illumination conditions). For every subject, while performing a particular pose, the ambient illumination was also captured.

  • Cohn-Kanade Dataset: In this facial expressions dataset, 100 subjects (university students) performed basic six expressions. The age range of the subjects were from 18 to 30 years and most of them were female. We employed those expression frames for which the camera was fixed in front of the subjects. In order to utilize the six expressions from this dataset, we employed a total 450 image sequences from 100 subjects, and each of them was considered as one of the six expressions. The original size of each facial frame was 640 × 480 or 640 × 490 pixel with 8-bit precision for grayscale values. For recognition purpose, twelve expression frames were taken from each expression sequence, which results in total of 5400 expression images.

  • USTC-NVIE Dataset: In this dataset, an infrared thermal and a visible camera was used in order to collect both spontaneous and posed expressions, but we utilized only posed-based expressions. There were 108 subjects (university students), and their age range was from 17 to 31 years.Some of them worn glasses, whereas others were free of glasses. They were asked to perform a series of expressions with illumination from three different directions. The size of each facial frame was 640 × 480 or 704 × 490 pixels. In total, 1027 expression frames were utilized from this dataset.

For a thorough validation, we performed four different sets of experiments in this study:

  • In the first experiment, we analyzed the performance of the previous different statistical approaches with different combinations, using all datasets. Similarly, the performance of wavelet transform was also analysed in this experiments on all datasets.

  • In the second experiment, performance of the CNF-FER was analyzed.

  • While, in the third experiment, effectiveness of the proposed feature selection method was analyzed.

  • Finally, In the last experiment, the weighted average recognition rate and time complexity of the CNF-FER were compared with some state-of-the-art methods.

4 Results and discussion

4.1 First experiment

In this experiment, four sub-experiments were performed based on the combination of different well-known techniques using all the datasets. Each experiment along with its results are described below:

  • Recognition Rates of PCA and LDA with HMM: In this experiment, PCA was used along with LDA and HMM. PCA is an unsupervised technique, i.e., it does not require any prior information about the classes. For any defined level of compression, PCA is an optimal dimension-reduction scheme that minimizes the mean squared error between the original images and their reconstructions. Analysis showed that, when the number of PC features was increased, the recognition rate rose to a certain value and then remained saturated as shown in Fig. 2. Therefore, for optimal results, the first 50 eigenvectors with their corresponding eigenvalues were employed for each dataset.

    Once the features were extracted, LDA was applied and the LDA features were then fed to an HMM for recognition. The idea was to maximize the total scattering of the data while minimizing the variance within classes before recognition. The recognition rates for this experiment on the four datasets are shown in Fig. 3.

    It is clear from Fig. 3 that PCA and LDA with HMM did not achieve high recognition rate. The reason for this could be that PCA only focuses on global features and does not capture the local features.

  • Recognition Rates of ICA and LDA with HMM: In this experiment, we used ICA for feature extraction (to capture the local features) with LDA in order to examine any improvement in the feature space. The results for this experiment are summarized in Fig. 4.

    It can be seen from Fig. 4 that using the global or local features separately with LDA does not guarantee a better recognition rate. Because ICA is slow to train when the dimension of the data is bulky. Moreover, ICA is very weak in managing the inputs, e.g., if a hug amount of expressions frames are exploited as input, ICA does not have the capability to recognize it, due to which some time ICA cannot retrieve the desire features. Therefore, both, PCA and ICA with LDA can get most informative features.

  • Recognition Rates of PCA+ICA+LDA and HMM: In this experiment, we utilized both PCA and ICA to extract the global and local features with LDA and HMM. The results for this experiment are summarized in Fig. 5.

    It can be seen from Fig. 5 that all the existing statistical methods with different combinations did not achieve better recognition rate due to their own limitations.

  • Recognition Rates of wavelet transform with HMM: Finally, we analyzed the accuracy of the wavelet transform as a feature extraction technique with LDA and HMM for expression recognition. The corresponding experimental results are indicated in Fig. 6.

    It is clear from Fig. 6 that the wavelet transform does not achieve high recognition rate. Therefore, we proposed the CNF-FER in order to attain a better performance against the existing methods on all datasets.

Fig. 2
figure 2

Top 100 eigen values along with their eigenvectors using Cohn-Kanade dataset of facial expressions

Fig. 3
figure 3

Recognition rate of PCA and LDA with HMM using four datasets of facial expressions

Fig. 4
figure 4

Recognition rate of ICA and LDA with HMM using four datasets of facial expressions

Fig. 5
figure 5

Recognition rate of PCA+ICA and LDA with HMM using four datasets of facial expressions

Fig. 6
figure 6

Recognition rate of wavelet transform with HMM using four datasets of facial expressions

4.2 Second experiment

In this experiment, two sub-experiments were performed to validate the CNF-FER using the four datasets. The same validation scheme was applied as mentioned in Section 3, and the results are described below.

  • Recognition Rates of CNF-FER: The CNF-FER was evaluated for each dataset separately under the exact settings as mentioned in Section 3. The 3D feature plots of the CNF-FER, for the six expressions, after applying the LDA on four datasets, are shown in Figs. 7, 8, 9, and 10, and the detailed results are provided in Table 1.

    It is clear from Table 1 that the CNF-FER consistently achieved a high recognition rate when applied on these datasets separately, i.e., 99.00 % on JAFFE, 99.17 % on Yale B face dataset, 99.17 % on Cohn-Kanade, and 99.33 % on USTC-NVIE dataset. This means that, unlike existing statistical methods, such as PCA, ICA, LDA, and wavelet transform, the CNF-FER is more robust, i.e., it provided high recognition rate not just for one but all four datasets. This is due to the proposed feature selection method that utilizes the information measurement in order to estimate the potential of the features. Furthermore, greedy forward selection was used, wherein each feature is appended to the feature set based on its quality, which confirms that the proposed feature selection method is more robust than others with respect to classification accuracy.

  • Recognition Rates of CNF-FER-based on Datasets: A set of experiments was performed in order to show the performance of CNF-FER system based on dataset. For these experiments, n-fold cross validation based on dataset was employed (in our case n = 4), that means that out of four datasets, one dataset was utilized as training data whereas the remaining three datasets were used as testing data, and this process was repeated four times, in which each data is used once for training and testing respectively. The weighted average recognition results of the CNF-FER systems on four datasets are shown in Table 2.

    It can be seen from Table 2 that the CNF-FER does not achieve high recognition rate only on individual datasets, but also shows better performance when the system was trained on one dataset and tested on other datasets.

Fig. 7
figure 7

3D feature plots of the CNF-FER for recognizing the expressions (on JAFFE dataset). It can be seen that the CNF-FER clearly classified the expressions classes

Fig. 8
figure 8

3D feature plots of the CNF-FER for recognizing the expressions (on Yale B face dataset). It can be seen that the CNF-FER clearly classified the expressions classes

Fig. 9
figure 9

3D feature plots of the CNF-FER for recognizing the expressions (on Cohn-Kanade dataset). It can be seen that the CNF-FER clearly classified the expressions classes

Fig. 10
figure 10

3D feature plots of the CNF-FER for recognizing the expressions (on USTC-NVIE dataset). It can be seen that the CNF-FER clearly classified the expressions classes

Table 1 Confusion matrix of CF-FER using: (A) JAFFE dataset, (B) Yale B dataset, (C) Cohn-Kanade dataset, and (D) USTC-NVIE dataset of facial expressions (Unit: %)
Table 2 Confusion matrix of CNF-FER system that is: (A) trained on JAFFE dataset and tested on Yale B, Cohn-Kanade, and USTC-NVIE datasets, (B) trained on Yale B face dataset and tested on JAFFE, Cohn-Kanade, and USTC-NVIE datasets, (C) trained on Cohn-Kanade dataset and tested on JAFFE, Yale B, and USTC-NVIE datasets, and (D) trained on USTC-NVIE dataset and tested on JAFFE, Yale B, and Cohn-Kanade datasets (Unit: %)

4.3 Third experiment

In order to assess the effectiveness of the proposed feature selection method, a series of sub-experiments were performed in this experiment. These experiments were performed using all of the four datasets and results are presented in Table 3.

Table 3 Confusion matrix of the CNF-FER using: (A) JAFFE dataset, (B) Yale B dataset, (C) Cohn-Kanade dataset, and (D) USTC-NVIE dataset, while removing the proposed feature selection method (Unit: %)

It can be noted from Table 3 that the proposed feature selection method played a major role in the high recognition of CNF-FER. When we removed the proposed feature selection method, the recognition rate decreased signi?cantly. These results validate the problem of high similarity among the features of different expressions. The experimental results confirmed our analysis and provided clear evidence, allowing us to conclude that our proposed feature selection method selects a better feature set in terms of classification accuracy.

4.4 Fourth experiment

In the fourth experiment, the CNF-FER was compared against the following state-of-the-art FER methods: [15, 24, 25, 34, 45, 46]. JAFFE, Yale B, Cohn-Kanade, and USTC-NVIE datasets of facial expressions were used in this experiment. For a fair comparison, we borrowed the implementations of some of these methods, whereas for some methods, their published results are reported. The comparison has been performed under the same guidelines which were provided in their respective manuscripts. A 10-fold cross-validation scheme was utilized for each dataset (as described in Section 3). For the four datasets, the weighted average recognition rates of the existing methods and that of CNF-FER are presented in Table 4.

Table 4 Comparison results of the proposed approaches with recent feature extraction methods (Unit: %)

It is clear from Table 4 that the CNF-FER outperformed the existing state-of-the-art methods.

Moreover, in order to analyze the computational cost of the CNF-FER, we selected the most efficient method (that is, [34] from the above experiments of Table 4). The FER system of [34] took 1533 ms, 1498 ms, 2292 ms, and 1701 ms to recognize an expression frame from JAFFE, Yale B, Cohn-Kanade, and USTC-NVIE datasets of facial expressions, respectively. On the other hand, CNF-FER took 1290 ms, 1034 ms, 1908 ms, and 1865 ms to recognize an expression frame from the same datasets. Thus, the CNF-FER not only achieved high recognition rate, but it is also less expensive in terms of computational cost.

Moreover, the FER system of [34] has a complexity of O(T Q 2 M), where T is the length of an input sequence (i.e., expression frames), Q is the number of states, and M is the number of mixtures. On the other hand, the CNF-FER needs a maximum of O(T M) to compute gradients. These experiments were performed in Matlab using an Intel ®; Pentium ®; Dual-Core TM (2.5 GHz) with a RAM capacity of 3 GB.

5 Conclusion

A typical FER system consists of four modules: preprocessing, feature extraction, feature selection, and recognition. A great deal of research has been done for preprocessing, feature extraction, and recognition modules; however, feature selection is still an active research area.

In this paper, we have reviewed some recently developed algorithms for mutual information-based feature selection. We discussed the limitations of each method, and based on our observations, we proposed our own method derived from the NMIFS with two improvements: the normalization of the mutual information and the feature-independent normalizing weights. For feature extraction, we have utilized the existing curvelet transform that has the capability to extract prominent features by keeping the line, curve, and edge information from each expression frame. The dimensions of the feature space were reduced by employing LDA. Finally, HMM was used as the recognizer.

The CNF-FER was tested and validated using four publicly available standard datasets. For each dataset, 10-fold cross-validation scheme was employed. The CNF-FER achieved a weighted average recognition accuracy of 99 %, which is a significant improvement over the recognition rates of existing FER systems. Moreover, from computational perspective, the CNF-FER is less expensive than existing methods.

The performance of CNF-FER is yet to be investigated in real-time, because there exist several factors in real-time environment that might decrease the performance of CNF-FER, such as background clutter, image rotation and blur, and varying face angles. Therefore, further study is needed to tackle these issues and maintain the same high recognition rate in real environment.