1 Introduction

Facial expression recognition (FER) has a significant contribution in our daily life communication, which has achieved much attention in many real applications, especially in human computer interaction (HCI) [53, 99], robot control, and driver state surveillance [97, 102], clinical decision-making processes [100], security access control [89], neuroscience, psychology, and cognitive science [12]. These systems are used to improve the communications between the user and computer according to the needs of the user [30, 53]. However, it is still a challenging task for such systems to achieve a robust recognition rate of facial expressions due to the difficulty in precisely extracting the suitable emotional features from the expression images [102]. Most of these features are represented either in static, or dynamic, or point-based geometric, or region-based appearance [102].

The are two types of expression recognition systems: posed-based expression recognition systems [98] and spontaneous expression recognition systems [106]. In the first system, the expressions have been produced artificially, which means that the subjects are forced to performed the expressions [11]. While, in the second system, the expressions were performed spontaneously that are observed on a day-to-day basis, such as during conservations or while watching movies [11]. The focus of this study is posed and spontaneous FER in naturalistic environments which is one of the limitations of the existing works.

There are two types of expression classification. The first one is frame-based classification, in which, only the current frame is used with or without a reference image (neutral face image) to recognize the expressions. On the other hand, in sequence-based classification, the temporal information of the sequences are employed to recognize the expressions of one or more frames [90]. In sequence-based methods, the geometrical displacement of facial feature points between the current frame and the initial frame are calculated [76]; while, frame-based methods do not have this property. The temporal information of expression in sequences of frames is important for facial expression analysis [75]; therefore, we employed sequence-based classification that is also one of the limitations of the existing works.

Typical FER system consists of four basic modules: face detection, feature extraction, feature selection, and recognition. Basically, the face contains most of the expression-related information; therefore, in face detection module, first the face in a given image is detected. The feature extraction module deals with getting the distinguishable features from each facial expression shape and quantizing it as a discrete symbol [82, 84]. The feature selection module is used for selecting a subset of relevant features from a large number of features extracted from the input data. In recognition module, a classifier is first trained with training data and then used to generate the labels of the facial expressions contained in the incoming video data [86].

Though, there lots of works have been done to achieve high recognition rates in controlled environments; however, their accuracies consistently degrade in naturalistic environments [63]. Moreover, less amount of works can be found that automatically detect the faces, and extract and selects the features from the facial muscles. So, it is still a challenging task for the existing works to extract the features in a robust way [102] in naturalistic environments. Therefore, the objective of this paper is to propose an unsupervised automatic face detection and extraction model based on active contour (AC) model. The proposed AC model is based on level set that is the combination of two energy functions like Chan-Vese [19] energy and Bhattacharyya distance [45] functions. The proposed AC model is most robust to noise and illuminations, which not only minimize the dissimilarities within the object (face) but also maximize the distance between the two regions such as face and the background.

Once the face has been detected, then for the feature extraction, we proposed a new method based on wavelet transform (especially symlet wavelet). To obtain the feature vectors, symlet wavelet family was tested for which the image was decomposed up to 4 levels. The proposed feature extraction method extracts the most prominent features; however, there might be some redundancy in the features. Therefore, we proposed the usage of a non-linear feature selection method called stepwise linear discriminant analysis (SWLDA) that has been employed to the selected feature space. SWLDA can select the most informative features taking the advantage of the forward selection model and can remove the irrelevant feature by taking the advantage of the backward regression model.

The rest of the paper is organized as follows. Section 2 provides related works and their limitations. Section 3 presents an overview of the proposed approaches. The experimental setup and results with some discussion are presented in Sections 4 and 5 respectively. Finally, the paper has been concluded with some future directions in Section 6.

2 Literature review

2.1 Existing face detection methods

Generally, before recognizing the expressions, the face must be located, detected, and extracted in expression frames. Because it is the face that contains most of the expression-related information. There lots of works have been done for face detection and extraction; however, each of them has its own limitation. Recent works [4, 14, 25, 28] were proposed for the purpose of face detection that utilized artificial neural networks (ANNs). However, the common limitation of ANNs is that the neuron model used in neural networks ignores most of the characteristics of its counterpart [93]. Also computational wise, ANNs are very much expensive and difficult to implement. Similarly, skin-tone based methods [7, 32, 38, 62, 94] were utilized by for face detection and extraction in FER systems. However, skin-tone based methods are very sensitive to illumination like under varying lighting conditions [61], and skin color varies from person to person by employing different cameras at static lighting conditions [71]. Moreover, the presence of the skin color is completely reliant on the brightness and the color temperature of the light source [87], which may cause misclassification.

Some appearance-based methods [24, 34, 40, 65] also proposed for face detection. These methods showed better performance in control environment; however, their performance degrade with wide variation of head pose and illumination in real world environments [23]. Moreover, holistic approaches [77, 78, 105] were employed for face detection which utilized global information rather than local. However, holistic approaches consider the entire face and the detection of facial features are difficult in holistic approaches when there is wide range of rotation, scale, head pose, and illumination variation [13]. The authors of [17, 26, 27, 67] employed invariant feature-based methods for face detection on spontaneous-based datasets and achieved better performance. However, the performance of invariant feature-based methods degrade with the environmental change. Moreover, for these methods, the accurate normalization of the face is required against pose, illumination, scale, and occlusion [33]. Also, computational wise, these methods are much more expensive [22].

2.2 Existing feature extraction methods

Feature extraction is a process that deals with getting the distinguishable features from each facial expression shape and quantizing it as a discrete symbol [81]. A well-defined feature extraction algorithm is that which improves the recognition rate more efficiently and effectively [95]. There are some parts of the face such as eyes, mouth, nose and forehead from where we can extract the most prominent and informative features. According to the face descriptors, there are two types of features, global features and local features. In global features, the features are extracted from the whole face while in local features, the features are extracted from some parts of the face.

The methods that used for global feature extraction named holistic methods include nearest features line-based subspace analysis [64], Eigenfaces and Eigenvector [2, 50] and [46], Fisherfaces [1], global features [60], neural network [1, 74], independent component analysis (ICA) [48, 54], and principal component analysis (PCA) [31, 35, 92]. Moreover, some frequency-based methods and Gabor wavelet were utilized by [57] and [52] respectively. However, all these holistic methods do not know what exact facial features are the most important for FER systems. Moreover, these methods ignore higher order correlation value and might not work if the data sources are dependent [20]. Furthermore, these techniques do not have the capability to handle that data in which the classes are far away from Gaussian and also if there is a small sample size data then these methods have the problem to process it [20]. Most of these methods work well in a control environment for face detection and recognition. However, their performance degrades in FER with the variation of illumination, pose, facial expression, occlusion and aging [18]. Furthermore, complexity-wise, most of these techniques are much more expensive because of considering the whole face and lots of memory are required [21]. These methods preferred only for face recognition because it preserves the interrelations between facial parts information of the face that are very important for FER systems [101].

On the other hand, local feature-based methods have been proposed to compute the local descriptors from some parts of the face and then integrate these information into one descriptor. These methods include local feature analysis (LFA) [42], Gabor features [88], non-negative matrix factorization (NMF) [5, 56], local non-negative matrix factorization (LNMF) [16], and local binary pattern (LBP) [43]. Among these methods, LBP achieved better performance. However, LBP does not provide the directional information of the facial frame [85]. Recent works have been proposed in order to solve the limitations of LBA. These methods include local transitional pattern (LTP) [3], local directional pattern (LDP) [39], and local directional pattern variance (LDPv) [44]. However, most of these methods exploited other information instead of employing intensity to overcome the problems due to noise and illumination change [70]. Moreover, the performance of these methods still degrade in non-monotonic illumination change, noise variation, change in pose, and expression conditions [70]. Similarly, the authors of [69] exploited local fisher discriminant analysis (LFDA) in their systems for FER. However, LFDA might fail to determine the essential assorted structure when the face image space is highly nonlinear [96]. Furthermore, the authors of [72] employed pixel and color segmentation for feature extraction to detect the facial expressions. However, the performance of this approach degrades with the variation of illumination.

2.3 Existing feature selection and dimension reduction methods

Dimension reduction or feature selection is the essential part before the classification. The dimensions of a feature space can be reduced by extracting discriminating features that are reliant on exploiting the total distribution of the data, while lessening the differences within the classes. that the feature values for the six classes are highly merged, which can result in a high misclassification rate. The use of inappropriate coefficients results in high within-class differences and low between-class differences. Therefore, a method is required to address the aforementioned problem and to reduce the dimension space and to increase the class separability. This idea is employed by various feature selection methods of machine learning, mainly Principal Component Analysis (PCA) [15], Linear Discriminant Analysis (LDA) [58], kernel discriminant analysis (KDA) [59], and Generalized Discriminant Analysis (GDA) [8].

2.3.1 Principal component analysis (PCA)

PCA is one of the most well-known methods for feature selection and dimension reduction. PCA is a second-order approach that offers an easy way of reducing a complex set of data by assigning it onto a space with a small dimension, while protecting as much of the unpredictability as possible. PCA fabricates the best linear least-squares decomposition of a training set. This method has the benefit of being linear and makes no hypothesis concerning the data distribution. The role of PCA is to estimate the original data with lower dimensional features, which represents the data economically. It also focuses on the global features of the gray-scale faces. In this case, there is a strong correlation among observed variables. So, for this work, the main purpose for using PCA was to express the large 1D vector of pixels constructed from the 2D image into the compact principal components of the feature space. This is called eigenspace projection. The primary job of PCA is to compute the eigenvectors of the covariance data matrix M and then, by the combination of a few top eigenvectors, the approximation is done. It is the most common feature extraction technique that has been widely employed in FER systems. However, PCA has poor discriminating power within the class and computational wise it is much expensive method. Therefore, Linear Discriminant Analysis (LDA) has been exploited to resolve the shortcomings of PCA.

2.3.2 Linear discriminant analysis (LDA)

LDA maximizes the ratio of between-class variance to within-class variance in any particular data set, thereby guaranteeing maximal separability. The use of LDA for data classification is applied to classification problems in speech recognition [29]. LDA produces an optimal linear discriminant function that maps the input into the classification space on which the class identification of the samples is decided. LDA easily handles the case in which the within-class frequencies are unequal and their performances have been examined on randomly generated test data. Thus, LDA maximizes the total scatter of the data while minimizing the within scatter of the classes. The use of LDA, however, failed in resolving the overlap or low between-class variance among the facial expressions. LDA is a linear technique, which limits its flexibility when applied to complex datasets. Moreover, the assumption made by LDA that all classes share the same within-class covariance matrix is not a valid one. In addition, large amounts of data are necessary to generate robust transforms for LDA, and there may be insufficient data to robustly estimate transforms to separate the classes. Moreover, linear discriminant analysis (LDA)-based methods suffer from the limitations that their optimality criteria are not directly associated to the classification capability of the achieved feature representation [55]. Moreover, LDA is much sensitive than PCA on partial occlusion. For more details on LDA, please refer to [10]. Thus, we believe that the use of LDA will not essentially yield an improvement in the performance of the FER systems. Moreover, LDA cannot provide better classification rate due to the aforementioned limitations. Therefore, Kernel Discriminant Analysis (KDA) has proposed to solve the limitations of LDA.

2.3.3 Kernel discriminant analysis (KDA)

KDA is a non-linear discriminating approach, which seeks non-linear discriminating features using kernel techniques. However, KDA does not have the capability to provide better performance in the case if the face images belong to the same subjects are scattered rather than dispersed as clusters [66]. Moreover, during the model evaluation, KDA needs the entire data set for training that is inappropriate for FER systems. A recent non-linear feature selection method such as Generalized Discriminant Analysis (GDA) has been proposed in order to solve the shortcomings of KDA.

2.3.4 Generalized discriminant analysis (GDA)

GDA plots the input data (training data) into a classification (high dimensional feature) space by generating an optimum discriminant function. GDA builds a nonlinear feature space for discrimination that is not separable by linear methods. The discriminant eigenvector has been calculated by GDA in the feature space in an efficient way. However, some eigenvectors may be degenerated by GDA in small sample size data case [104]. Furthermore, if there are slight changes in the training data then the solution of GDA might not be stable and perhaps are not optimal in terms of the discriminant ability [104]. Moreover, GDA takes lots of time during the data training and testing [6].

Therefore, to come up with the limitations of PCA, LDA, KDA, and GDA, we proposed the use of a robust technique called Stepwise Linear Discriminant Analysis (SWLDA) for the FER systems with the aim of extracting localized features from the some parts of the face that previous feature selection and dimension reduction techniques were limited in analyzing. The proposed technique is based on two processes: forward and backward regression processes. In the forward process, the most correlated features from the response are selected from the regression model, and the selected features are based on partial F-test values from the feature space. In the backward process, the least significant values are removed from the regression model (i.e., the lowest F-test values). In both processes, F-test values are calculated on the basis of defined class labels.

3 Proposed facial expression recognition (FER) system

3.1 Proposed face detection and extraction algorithm

Mostly, the performance of the FER systems reliant on accurate face detection. In the field of image segmentation, since it was first introduced by [49], active contour (AC) model has attracted much attention.

An active contour model is a deformable spline influenced by constraint and image forces that pull it toward object contours. It tries to move into a position where its energy is minimized. Active contour tries to improve by imposing desirable properties such as continuity and smoothness to the contour of the object, which means that the active contour approach adds a certain degree of prior knowledge for dealing with problem of finding the object contour.

Recently, Chan-Vese (CV) proposed in [19] a novel form of active contour for object segmentation based on level set framework. Its energy function is defined by.

$$ \mathrm{F} (\mathrm{C}) = \underset{\text{inside}(\mathrm{C})}{\int} |\mathrm{I} (\mathrm{x}) - \mathrm{c}_{\text{in}}|^{2} \text{dx} + \underset{\text{outside}(\mathrm{C})}{\int} |\mathrm{I}(\mathrm{x}) - \mathrm{c}_{\text{out}} |^{2} \text{dx} $$
(1)

where cin and cout are respectively the average intensities inside the variable curve C. Compared to the other AC contour models, the CV active model can detect the faces more exactly sice it does not need to smooth the initial facial image (via the edge function g|∇Iσ|2, even if it is very noisy. In other words, this model is more robust to noise. However, the convergence of CV active contour model depends on the homogeneity of the segmented faces. When the inhomogeneity becomes large, the CV active model provides unsatisfactory results. Moreover, unlike other active contour models which rely much on the gradient of the image as the stopping term and thus have unsatisfactory performance in noisy images. The reason CV active contour model fails is that, a segment is represented by only its mean value, which is not sufficient for a highly inhomogeneous object. Moreover, the CV active contour model does not use the edge information but utilizes the difference between the regions inside and outside of the curve, making itself one of the most robust and thus widely used techniques for image segmentation, especially, in the area of face detection. Moreover, the global minimum of the above energy functional does not always guarantee the desirable results. The unsatisfactory result of the CV AC in this case is due to the fact that it is trying to minimize the dissimilarity within each segment but does not take into account the distance between different segments.

The proposed methodology is to incorporate an evolving term based on the Bhattacharyya distance to the CV energy functional that minimizes the dissimilarities within the object and maximizes the distance between the two regions. The proposed energy function is:

$$ E_{0} (C) = \beta F (C) + (1 - \beta) B (C) $$
(2)

where β ∈ [0,1]. Note that to be comparable to the F(C) term, in practice, B(C) is multiplied by the area of the facial image (frame) because its value is always within the interval [0,1] whereas F(C) is calculated based on the integral over the facial image plane. As usual [19], one regularizes the solution by constraining the length of the curve and the area of the region inside ir, yielding the total energy functional as

$$\begin{array}{@{}rcl@{}} E (C) &=& \gamma \mathit{Length} (C) + \eta \mathit{Area} (\mathit{inside} (C))\\ && + \beta F (C) + (1 - \beta) B (C) \end{array} $$
(3)

where γ and η are non-negative constants.

The intuition behind the proposed model, in E(C), is that we seek for a curve which is regular (the first two terms) and partitions the facial image into regions such that the differences within each region are minimized (the F(C) term) like reducing environmental effects and the distance between the two regions (like human face and background) is maximized (the term B(C) term).

For the level-set formulation, let us define ϕ as the level-set function, I : Ω →Z ⊂ Rn as a certain image feature such as intensity, color, texture, or a combination thereof, and H(∙) and δ0(∙) as the Heaviside and the Dirac function respectively.

$$ H(u) = \left\{ \begin{array}{ll} 1,\,if\,u \ge 0 \\ 0,\,if\,u < 0\,\, \end{array} \right. \qquad\quad \delta_{0} (u) = \frac{d}{{du}}H(u) $$
(4)

The energy function can then be rewritten as

$$\begin{array}{@{}rcl@{}} E(\phi) &=& \gamma \underset{\Omega}{\int} | \nabla H (\phi (x)) |dx + \eta \underset{\Omega}{\int} H(-\phi (x))\\ && + \beta \left[ \begin{array}{l} \underset{\Omega}{\int} | I(x) - c_{in}|^{2} H (- \phi (x))\\ + \underset{\Omega}{\int} | I (x) - c_{out} |^{2} H (\phi (x)) \end{array} \right]\\ && + (1 - \beta )\underset{Z}{\int} \sqrt{p_{in} (z) p_{out} (z) dz} \end{array} $$
(5)

where

$$\begin{array}{@{}rcl@{}} p_{in} (z) &=& \frac{\underset{\Omega}{\int} \delta_{0} (z - I(x)) H (-\phi(x))dx}{\underset{\Omega}{\int} H (- \phi (x)) dx}\\ p_{out} (z) &=& \frac{\underset{\Omega}{\int} \delta_{0} (z - I(x)) H (\phi (x)) dx}{\underset{\Omega}{\int} H(\phi (x))dx} \end{array} $$
(6)

In general form, it reads

$$ E(\phi) = \underset{\Omega}{\int} {\underbrace{f\left( {\phi ,\phi_{x_{1} } ,\phi_{x_{2} } ,\ldots,\phi_{x_{n}}} \right)dx}_{\overline{F} (\phi)} + (1-\beta)B(\phi)} $$
(7)

where X = [x1, x2,…,xn] ∈ Rn, \(\phi _{x_{i}} = \frac {\partial \phi }{\partial _{x_{i}}}\), \(i = \overline {1..n}\), \(B(\phi ) = {\int }_{z} {\sqrt {p_{in} (z)p_{out} (z)}} dz\). The first variation (w.r.t ϕ(x)) is given by

$$ \frac{\delta E}{\delta \phi} = \frac{\delta \bar{F}}{\delta \phi} + (1 - \beta) \frac{\delta B}{\delta \phi} $$
(8)

Using Euler-Lagrange equation, one has

$$\begin{array}{@{}rcl@{}} \frac{\delta \bar{F}}{\delta \phi} &=& \frac{\partial f}{\partial \phi} - \sum\limits_{i = 1}^{n} \frac{\partial}{\partial x_{i}} \frac{\partial f}{\partial \phi x_{i}}\\ &=& \delta_{0} (\phi) [- \eta - \beta (I - c_{in} )^{2} + \beta (I - c_{out} )^{2} - \gamma k ] \end{array} $$
(9)

On the other hand,

$$ \frac{\delta B}{\delta \phi} = \frac{1}{2} {\int}_{z} \left( \begin{array}{l} \frac{\partial p_{in} (z)}{\partial \phi} \sqrt{\frac{p_{out} (z)}{p_{in} (z)}}\\ + \frac{\partial p_{out} (z)}{\partial \phi} \sqrt{\frac{p_{in} (z)}{p_{out} (z)}} \end{array} \right) dz $$
(10)

where pin(z) and pout(z) are given in (6). Differentiating them w.r.t ϕ(x), one obtains

$$\begin{array}{@{}rcl@{}} \frac{\partial p_{in} (z)}{\partial \phi} &=& \frac{\partial_{0} (\phi)}{A_{in} }\left[ p_{in} (z) - \delta_{0} ({z - I} ) \right]\\ \frac{\partial p_{out} (z)}{\partial \phi} &=& \frac{\partial_{0} (\phi)}{A_{out}}\left[ \delta_{0} ({z - I} ) - p_{out} (z)\right] \end{array} $$
(11)

where Ain and Aout are respectively the areas inside and outside the contour and are given by

$$ A_{in} = {\int}_{\Omega} H(- \phi (x))dx \qquad\quad A_{out} = {\int}_{\Omega} H(\phi(x))dx $$
(12)

Substituting (11) into (10) and taking some simple modification, one obtains

$$ \frac{\delta B}{\delta \phi} = \delta_{0} (\phi)V(x) $$
(13)

where

$$\begin{array}{@{}rcl@{}} V(x) &=& \frac{B}{2}\left( {\frac{1}{A_{in}} - \frac{1}{A_{out}}} \right)\\ && + \frac{1}{2} {\int}_{z} \begin{array}{ll} \delta_{0} (z-I(x))\\ \left( {\frac{1}{A_{out}} \sqrt{\frac{p_{in} (z)}{p_{out} (z)}} - \frac{1}{A_{in}} \sqrt{\frac{p_{out} (z)}{p_{in} (z)}}} \right)dz \end{array} \end{array} $$
(14)

Combining (8), (9), and (13), one can derive the first variation of E(ϕ) as

$$\begin{array}{@{}rcl@{}} \frac{\partial E}{\partial \phi} = \delta_{0} (\phi) \left[\begin{array}{l} - \gamma k - \eta - \beta \left( {I - c_{in} } \right)^{2}\\ + \beta \left( {I - c_{out} } \right)^{2} + \left( {1 - \beta } \right)V \end{array} \right] \end{array} $$
(15)

Hence, the evaluation flow associated with minimizing the energy functional in (5) is given as

$$\begin{array}{@{}rcl@{}} \frac{\partial \phi }{\partial t} &\,=\,& - \frac{\partial E}{\partial \phi}\\ &\,=\,& \delta_{0} (\phi) \left\{\begin{array}{l} \gamma k + \eta + \beta [(I - c_{in})^{2} + (I - c_{out})^{2}]\\ - (1 - \beta) \left[\begin{array}{l} \frac{B}{2}\left( \frac{1}{A_{in}} - \frac{1}{A_{out}}\right) \\ + \frac{1}{2}{\int}_{z} \begin{array}{l} \delta_{0} (z - 1)\\ \left( \begin{array}{l} \frac{1}{A_{out}}\sqrt{\frac{p_{in}}{p_{out}}}\\ - \frac{1}{A_{in}}\sqrt{\frac{p_{out}}{p_{in}}} \end{array} \right)dz \end{array} \end{array} \right] \end{array} \right\}\\ \end{array} $$
(16)

where Ain and Aout are respectively the areas inside and outside the curve C. Thus, the proposed AC model overcame the limitation of conventional CV AC model in the area of face detection.

3.2 Proposed feature extraction technique

Once the faces have been detected and extracted then the wavelet transform is applied for feature extraction. In wavelet transform, we used the decomposition process for which the video frames were in gray scale. The reason for converting from RGB to gray scale was to improve the efficiency of the proposed algorithm. The wavelet decomposition could be interpreted as signal decomposition in a set of independent feature vectors. Each vector consists of sub-vectors like

$$ {V}_{0}^{2D} = {V}_{0}^{2D - 1}, {V}_{0}^{2D - 2},....,V_{0}^{2D - n} $$
(17)

where V represents the 2D feature vector. If we have 2D expression frame X, and it is decomposed into orthogonal sub images corresponding to different visualization. The following equation shows one level of decomposition.

$$ X = {A_{1}} + {D_{1}} $$
(18)

where X indicates the decomposed image and A1 and D1 show approximation and detailed coefficient vectors respectively. If the expression frame is decomposed up to multilevel, then, the (18) can then be written as

$$ X = A_{j} + [D_{j} + D_{j - 1} + D_{j - 2} + .... + D_{2} + D_{1}] $$
(19)

where j represents the level of decomposition. Mostly, the detail coefficients consist of noise; therefore, only the approximation were utilized for feature extraction. During the decomposition process, each frame is decomposed up to four levels of decomposition, i.e., j = 4, because by exceeding the value of j = 4 the image loses lots of information due to which the informative coefficients cannot be detected properly and might cause misclassification. The detail coefficients further consist of three sub-coefficients. So the (19) can be written as

$$\begin{array}{@{}rcl@{}} X &=& A_{4} + [D_{4} + D_{3} + D_{2} + D_{1}]\\ &=& A_{4} + [(D_{h})_{4} + (D_{v})_{4} + (D_{d})_{4}]\\ && + [(D_{h})_{3} + (D_{v})_{3} + (D_{d})_{3}]\\ && + [(D_{h})_{2} + (D_{v})_{2} + (D_{d})_{2}]\\ && + [(D_{h})_{1} + (D_{v})_{1} + (D_{d})_{1}] \end{array} $$
(20)

Or simply, the (20) can be written as

$$ X = A_{4} + \sum\limits_{j = 4}^{1} [(D_{h})_{j} + (D_{v})_{j} + (D_{d})_{j}] $$
(21)

where Dh, Dv, and Dd indicate horizontal, vertical and diagonal coefficients respectively. We can observe from (20) or (21), that all the coefficients are connected with each other like a chain, through which we can easily extract the prominent features. These coefficients graphically is represented by Fig. 1. In each decomposition step, the approximation and detail coefficient vectors are obtained by passing the signal through the low-pass and high-pass filters.

Fig. 1
figure 1

All the coefficients are connected with one after another like performing head to tail rule in vector addition that produces one dimensional matrix, due to which the coefficients are extracted easily

After the decomposition process, the feature vector is created by taking the average of all the frequencies of the expression frames. In a specified time window the frequency of each expression frame has been estimated by analyzing the corresponding frame by utilizing the wavelet transform [91].

$$ C(a_{i}, b_{j}) = \frac{1}{\sqrt{a_{i}}} \int\limits_{- \infty}^{\infty} y(t) {\psi}_{f.e}^{*} \left( \frac{t -b_{j}}{a_{i}} \right)dt $$
(22)

where ai is the scale of the wavelet between the lower and upper frequency bounds to get higher decision for the frequency estimation, bj is the position of the wavelet from the start to end of the time window with the spacing of signal sampling period, t is the time, ψf.e is the wavelet function used for frequency estimation, and C(ai, bj) are the wavelet coefficients with the specified scale and position parameters, which is converted to mode frequency as given below.

$$ f_{1} = \frac{f_{a} \left( \psi_{f.e}\right)}{a_{m}\left( \psi_{f.e}\right).{\Delta}} $$
(23)

where \(f_{a} \left (\psi _{f.e}\right )\) is the average frequency of the wavelet function, and Δ is the signal sampling period. So the feature vector is obtained by taking the average of the whole frame frequencies for each expression that is given as

$$ f_{avg} = \frac{\left( f_{1} + f_{2} + f_{3} + .... + f_{K} \right)}{N} $$
(24)

where favg indicates the average frequency of each expression, which is a feature vector for that expression, K is the last frame of the current expression, and N represents the whole number of the frames in each expression.

3.3 Proposed feature selection model

In the this step, the most informative features are selected by using SWLDA, which maximizes the ratio of between-class variance to within-class variance in any particular data set, thereby guaranteeing maximal separability. Its forward and backward selection techniques enable SWLDA to effectively reduce the dimensions of the feature space.

In the forward step, the most correlated features are selected based on partial F- test values from the feature space. On the other hand, in the backward step, the least significant values are removed from the regression model i.e. lower F- test values. In both processes, the F- test values are calculated on the basis of defined class labels. The advantage of this method is that it is very efficient in seeking the localized features.

Procedure

In the beginning, there is no predictor in the model. Based on the significance test, i.e., partial F-test (the t-test), predictor is either entered or removed from the model in each iteration. Two predictors Alpha-to enter and Alpha-to remove are defined for significance level test. Alpha-to enter ae = 0.10 and Alpha-to-remove aγ = 0.20 are set as threshold parameters. These values are chosen based on various experiments. These threshold parameters show the significance level of the predictors which are entered or removed from the model, respectively. The algorithm stops when there are no more predictors to enter or remove from the stepwise model.

We present an example in which we have three independent predictors: x1, x2, and x3, and an output (response) y. Each predictor fits into the model using a regression; that is, we regress y on x1, x2, …, and xp− 1, where p is the total number of predictors (p = 3 in this case). The first predictor to enter into the stepwise model is the predictor that has the smallest t-test p-value (i.e., below ae). This will continue until the stopping criterion is met (i.e., if there is no predictor with a p-value less than ae). Now suppose x1 is the best predictor. Then fit each of the two predictor models that includes x1 in the model, i.e., the model regresses y on (x1, x2), regress y on (x1, x3)...y on (x1, xp− 1). The second predictor to enter into the stepwise model is the predictor that has the smallest p-value. If again there is no p-value less than ae, the iteration stops.

Suppose this time x2 is the best second predictor in the model. The analysis procedure then steps back and checks the p-value for β1 = 0 (i.e., criterion for the removal of the predictor from the model). In this case, if the p-value is above aγ for β1 = 0, then the predictor is not significant compared to the new entry, and x1 is removed from the stepwise model.

In contrast, suppose both x1 and x2 have made it into the two-predictor stepwise model. The analysis procedure then fits each of the three-predictor models with x1 and x2 in the model, i.e., it regresses y on (x1, x2, x3), regresses y on (x1, x2, x4),…, and regresses y on (x1, x2, xp− 1). The third predictor that enters the stepwise model is the predictor that has the smallest p-value less than ae. The stopping criterion is met when there is no p-value less than ae. In this case, the analysis checks the p-values β1 = 0. If either p-value has not become significant (i.e., above aγ), the predictor is removed from the stepwise model. This procedure will stop when adding an additional predictor does not yield a p-value below ae. For more details on SWLDA, please refer to a previous study [51].

3.4 Hidden Markove model (HMM) based classification

Commonly, the function of HMMs is to provide a statistical model λ for a set of observation sequences. Sometimes, the observations are called “frames” in facial expression recognition applications. Suppose there are sequences of observations of length T that are denoted by O1, O2,…,OT. An HMM also consists of particular sequences of states, S, whose lengths range from 1 to N (S = S1, S2,…,SN), where N is the number of states in the model, and the time t for each state is denoted Q = q1, q2,…,qN. The states are connected by arcs, and each time that a state j is entered, an observation is generated according to the multivariate Gaussian distribution bj(Ot) with the mean value μj and covariance matrix Vj correlated with that state. The arcs also have transition probabilities correlated with them such that probability aij is the resultant transition probability from state i to state j. The initial model probability for the state j is πj. An HMM can be defined by this set of parameters, such as λ = A,B,π, where A indicates the probability of the state transition such that A = aij, aij = Prob(qt+ 1 = Sj|qt = Si), 1 ≤ i, jN, where B represents the probability of observations such that B = bj(Ot), bj = Prob(Ot|qt = Sj)1 ≤ jN, and the initial state probability is indicated by π such that π = πj, πj = Prob(q1 = S1). All the equations are based on the work by [68] and make use of the initial state probability distribution.

In the training step, for a given model λ, the multiplication of each transition probability by each output probability at each step t provides the joint likelihood of a state sequence Q and the corresponding observation O that is calculated as:

$$ P(O, Q | \lambda) = {\pi_{q1}} {b_{q1}} (O_{1}) \left[\prod\limits_{t = 2}^{T} {a_{qt - 1, q_{t}}} {b_{qt}} ({O_{t}})\right] $$
(25)

Basically, the above equation cannot be evaluated, because in practice, the state sequence is hidden. Therefore, the likelihood \(P\left ({O | \lambda } \right )\) can be evaluated by summing over all possible state sequences:

$$ P (O | \lambda) = \underset{Q}{\sum} P(O,Q|\lambda ) $$
(26)

A simple procedure for finding the parameters λ that maximize the above equation in HMMs, introduced in [9], depends on the forward and backward algorithms αt(j) = P(O1Ot, qt = j|λ) and βt(j) = P(O(t + 1)…OT|qt = j,λ), respectively, such that these variables can be initiated inductively by the following three processes:

$$ {\alpha_{1}}(j) = {\pi_{j}}{b_{j}}(O_{1}),1 \le j \le N $$
(27)
$$ {\beta_{T}}(j) = 1,1 \le j \le N $$
(28)

The first process defined in (27) and (28) is known as initialization, and the second is known as the induction process and can be written as:

$$\begin{array}{@{}rcl@{}} {\alpha_{1 + 1}}(j) \,=\, \left[\sum\limits_{i = 1}^{N} \alpha_{t} (j) \alpha_{ij}\right]\! {b_{j}} (O_{t + 1}), 1 \!\le\! t \!\le\! T \,-\, 1\,\,\,and\\ \,\,1 \le j \le N \end{array} $$
(29)
$$\begin{array}{@{}rcl@{}} {\beta_{t}} (i) \,=\, \sum\limits_{j = 1}^{N} {\alpha_{ij}}{b_{j}}(O_{t + 1}) {\beta_{t + 1}} (j), t \,=\, T \,-\, 1, T \,-\, 2, {\ldots} 1\,\,\,\\ and \,\,\,1 \!\le\! i \!\le\! N \end{array} $$
(30)

The last process, known as the termination process, can be written as

$$ P (O|\lambda) = \sum\limits_{i = 1}^{N} {\alpha_{T}} (i) = \sum\limits_{i = 1}^{N} {\beta_{1}}(i) $$
(31)

Due to the multiplication of the forward and backward probabilities a new set of HMM parameters γt(j) for each state j can be found by computing weighted averages, and it can be simply calculated as

$$ {\gamma_{t}}(j) = \frac{\alpha_{t} (j) \beta_{t} (j)}{{\sum}_{i = 1}^{N} {\alpha_{t}}(j){\beta_{t}}(j)} $$
(32)

The model illustrated by the new set of parameters is described as \(\overline {\lambda } = \overline {\pi }. \overline {A}. \overline {B}\). A related quantity ξt(i,j) is used to estimate the transition parameters and is identified as the probability of state i in time t and in state j in t + 1. The observation sequence and the model are given as

$$ {\xi_{t}} ({i,j}) = P (q_{t} = i,{q_{t + 1}} = j | O, \lambda) $$
(33)

According to the forward and backward probabilities, (33) can be written as

$$ {\xi_{t}} ({i,j}) = \frac{\alpha_{t} (i) \alpha_{ij} {b_{j}} (O_{t + 1}) \beta_{t + j} (j)}{{\sum}_{i = 1}^{N} {\sum}_{j = 1}^{N} {\alpha_{t}}(i){\alpha_{ij}}{b_{j}}({O_{t + 1}}){\beta_{t + j}}(j)} $$
(34)

By using the above concept, the new parameter \(\overline {\lambda }\) can be re-estimated as

$$ \overline{\pi} = {\gamma_{1}}(i) $$
(35)
$$ {\overline{\alpha}_{ij}} = \frac{{\sum}_{t = 1}^{T - 1} {\xi_{t}} ({i,j})}{{\sum}_{t = 1}^{T} {\gamma_{t}}(i)} $$
(36)
$$ \overline{\mu}_{i} = \frac{{\sum}_{t = 1}^{T - 1} {\gamma_{t}}(i).{O_{t}}}{{\sum}_{t = 1}^{T} {\gamma_{t}}(i)} $$
(37)
$$ \overline{V}_{i} = \frac{{\sum}_{t-1}^{T = 1} \gamma_{t}(i).(O_{t}-\overline{\mu_{i}}) (O_{t}-\overline{\mu_{i}})^{\prime}}{{\sum}_{t = 1}^{T} \gamma_{t} (i)} $$
(38)

where prime denotes vector transpose, \(\overline {\alpha _{ij}}\) is the estimated transition probability from the state i to state j, and \(\overline {\mu _{i}}\) and \(\overline {V_{i}}\) are the estimates of the mean and the covariance matrix of the Gaussian output probability function for state i. The \(\overline {\lambda }\) is calculated iteratively in place of λ based on the above estimations, and this process is replicated until the parameter approximately meets at a critical point that is a local maximum of P(O|λ).

During testing, the appropriate HMMs can then be determined by mean of likelihood estimation for the sequence observations O calculated based on the trained λ as

$$ P (O|\lambda) = \sum\limits_{i = 1}^{N} {\alpha_{T}}(i) $$
(39)

The maximum likelihood for the observations provided by the trained HMMs indicates the recognized label. For more details on HMM, please refer to [73].

4 Experimental environment

We assessed the performance of the proposed FER system in real world environment by utilizing three real time YouTube-based datasets [79, 80] such as emulated, semi-naturalistic, and naturalistic datasets. Each dataset is described as below.

4.1 Emulated dataset

In this dataset, expressions from different subjects belonging to different colors, age, and ethnicity are collected. The dataset includes six basic expressions such as happy, sad, angry, normal, disgust, and fear. The subjects age ranges from childish like 4 years to eldest subjects such as 60 years. In some of the cases, the images in some expressions are rotated using the camera for better accuracy of the system. The subjects include both male and females. Each expression has at least 165 images. The images used in the dataset are of size 240 × 320 and 320 × 240 pixels with facial frame.

4.2 Semi-naturalistic dataset

In this dataset, the expressions have been collected from the actors and actresses of the Hollywood, Bollywood, and Lollywood while performing the in their respective movies. From all the subjects, six expressions including angry, normal, disgust, happy, fear, and sad are collected based on basic two variant captured and analyzed. Different views from different angles with glasses, hair open, wearing hat, and other obvious actions are collected in this dataset. In the whole dataset, each expression consists of at least 165 images. The dataset has the images of size 240 × 320 and 320 × 240 pixels with facial frame.

4.3 Naturalistic dataset

In this dataset of facial expression, variety of subjects from various parts of the world, races, and ethnicities have been selected. In this dataset of facial expressions, six basic universal expressions including normal, happy, sad, angry, fear, and disgust have been captured from real world scenarios that include mainly from real world talk-shows, interviews, and YouTube natural videos such as news and real world incidents. The total of 165 images have been considered for each expression. The age range of the subjects are from 18 to 50 years. Images used in the dataset are of size 240 × 320 and 320 × 240 pixels with facial frame.

4.4 Setup

For a thorough validation, the following series of experiments were performed in Matlab using Intel®; Pentium®; CoreTM i7-6700 (3.4 GHz) with a RAM capacity of 16 GB.

  • In the first experiment, the performance of the proposed face detection and extraction method was analyzed on each dataset.

  • In the second experiment, the proposed FER system was tested and validated on all the datasets using 10−fold cross-validation scheme. Which means that out of 10 subjects data from a single subject was used as the testing data, whereas data for the remaining 9 subjects were used as the training data. This process was repeated 10 times with data from each subject used exactly once as the testing data.

  • In the third experiment, the performance of the proposed FER system was validated under the absence of the proposed methods. For each modules (for feature extraction and selection), we used existing well-known statistical methods such as ICA and LDA for feature extraction and selection instead of using the proposed methods such as wavelet transform and SWLDA.

  • In the fourth experiment, the performance was analyzed across the datasets in order to show the robustness of the proposed system. In other words, from the three datasets, two datasets were used as testing datasets, whereas one dataset was used as the training data. This process was repeated three times, with data from each dataset used exactly once as the training data.

  • Finally, in the fifth experiment, the performance of proposed FER system was compared against the previous state-of-the-art works.

5 Experimental results and discussion

5.1 First experiment

The proposed face detection and extraction method is validated using all the the three datasets. In certain frames, the proposed AC model is performed independently; which means that the face detection and extraction is performed frame-by-frame. In this model, first, an ellipse with x-axis of length 15 and y-axis of length 15 is selected as the initial contour. In this experiment, the initial shape was the same for all frames, and only the center location varied. We manually segmented the first frame by placing the initial contour which must be closer to the face. Then from the second frame, the position of the initial contour’s center in the current frame is the mean value of the points along the final contour in the previous frame. By this way, the only information utilized from the previous frame is the final contour obtained in the previous frame. This information is used to determine the initial position of the active contour in the current frame.

5.2 Second experiment

This experiments show the performance of the proposed FER system in naturalistic environments. Therefore, the system was tested and validated on the three datasets such as emulated, semi-naturalistic, and naturalistic datasets separately. The overall results on the three datasets are shown in Tables 12, and 3, and in Figs. 23, and 4 respectively. It is clear from Tables 12, and 3 that the proposed system constantly better performance and achieved high recognition rates when applied on all the datasets separately. This means that, unlike existing methods, the proposed FER system is more robust, i.e., it provided high recognition rates not only on one dataset but using all the three datasets. The reason is that the proposed feature extraction and feature selection methods are more robust to the real life scenarios.

Table 1 Classification results of the proposed FER system on emulated dataset of facial expressions (Unit: %)
Table 2 Classification results of the proposed FER system on semi-naturalistic dataset of facial expressions (Unit: %)
Table 3 Classification results of the proposed FER system on naturalistic dataset of facial expressions (Unit: %)
Fig. 2
figure 2

3D feature plots for the six expressions after applying the proposed FER system on emulated dataset

Fig. 3
figure 3

3D feature plots for the six expressions after applying the proposed FER system on semi-naturalistic dataset

Fig. 4
figure 4

3D feature plots for the six expressions after applying the proposed FER system on naturalistic dataset

5.3 Third experiment

In this experiment, a set of sub-experiments were performed in order to show the importance of sub-components in the proposed FER system, i.e., wavelet transform with optical flow, SWLDA, and HCRF. For this purpose, nine sub-experiments were performed on the spontaneous dataset using the 10-fold validation rule.

5.3.1 Results while removing the proposed feature extraction technique

In the first three sub-experiments, ICA (a well-known local feature extraction technique) was utilized with SWLDA and HMM instead of the proposed feature extraction method (i.e., wavelet transform). The overall results for the these sub-experiments on emulated, semi-naturalistic, and naturalistic datasets are shown in Tables 45, and 6, respectively.

Table 4 Classification rates of ICA + SWLDA with HMM on emulated dataset of facial expressions, while removing the proposed feature extraction (wavelet transform) method (Unit: %)
Table 5 Classification rates of ICA + SWLDA with HMM on semi-naturalistic dataset of facial expressions, while removing the proposed feature extraction (wavelet transform) method (Unit: %)
Table 6 Classification rates of ICA + SWLDA with HMM on naturalistic dataset of facial expressions, while removing the proposed feature extraction (wavelet transform) method (Unit: %)

For the feature extraction, we utilized symlet wavelet transform. So, it can be seen from Tables 45, and 6 that without the proposed feature extraction method, the system is unable to show better performance. It is because symlet wavelet can extract the most prominent information in the form of frequency from expression frames, and also it is a compactly supported wavelet on frames with the least asymmetry and highest number of vanishing moments for a given support width. The symlet wavelet has the capability to support the characteristics of orthogonal, biorthogonal, and reverse biorthogonal of gray scale images. That’s why it provides better classification results. The frequency-based assumption is supported in our experiments and we measure the statistic dependency of wavelet coefficients for all expression frames. Joint probability of a frame is computed by collecting geometrically aligned frames of the expression for each wavelet coefficient. Mutual information for the wavelet coefficients computed using these distributions is used to estimate the strength of statistical dependency between the two frames. Moreover, symlet wavelet transform is capable to extract prominent features from expression frames with the aid of locality in frequency, orientation and in space as well. Since wavelet is a multi-resolution that helps us to efficiently find the images in coarse-to-find way.

5.3.2 Results while removing the proposed feature selection technique

In the next three sub-experiments, wavelet transform was coupled with LDA (a well-known discriminant analysis approach) before feeding the features to the proposed HMM. The results for the these sub-experiments on emulated, seminaturalistic, and naturalistic datasets are presented in Tables 78, and 9, respectively.

Table 7 Classification rates of the wavelet transform + LDA with HMM using emulated dataset of facial expressions, while removing the proposed feature selection (SWLDA) method (Unit: %)
Table 8 Classification rates of the wavelet transform + LDA with HMM using semi-naturalistic dataset of facial expressions, while removing the proposed feature selection (SWLDA) method (Unit: %)
Table 9 Classification rates of the wavelet transform + LDA with HMM using naturalistic dataset of facial expressions, while removing the proposed feature selection (SWLDA) method (Unit: %)

Similarly, it is also apparent from Tables 78, and 9 that without the proposed feature selection method (SWLDA), the system was also unable to achieve high classification

rate. This is because SWLDA not only provides dimension reduction, it also increases the low between-class variance to increase the class separation before the features are fed to the classifier. The low within class variance and high between class variance are achieved because of the forward and backward regression models in the SWLDA.

5.4 Fourth experiment

For this experiment, n −fold cross-validation rule based on dataset was performed (in our case n = 3). The overall results for this experiment are presented in Tables 1011, and 12, respectively.

Table 10 Classification rates of the proposed FER system training on emulated dataset and testing on semi-naturalistic and naturalistic datasets (Unit: %)
Table 11 Classification rates of the proposed FER system training on semi-naturalistic and testing on emulated, naturalistic datasets (Unit: %)
Table 12 Classification rates of the proposed FER system training on naturalistic dataset and testing on emulated and semi-naturalistic datasets (Unit: %)

It is clear from Table 10 that the proposed FER system achieved a high recognition rate when it was trained using the emulated dataset and tested on semi-naturalistic and naturalistic datasets. Similarly, it is also apparent from Table 11 that the system achieved slightly better performance when it was trained using the semi-naturalistic dataset and trained on emulated and naturalistic datasets (as shown in Table 11). However, the system achieved low accuracy when it was trained on the naturalistic dataset and tested on emulated and semi-naturalistic datasets (shown in Table 12). This might be because the datasets have different facial features and different environment. For instance, the subjects of emulated dataset performed the expressions in a posed manner, that is, each subject tried to copy or mimic the instructor, so there were little variations from subject-to-subject and in timings. However, the variation in capturing of expression from various angles (placing camera at variant angles) gave us the ability to test the proposed algorithm on the maximum possible alterations/variations in the images. Moreover, emulated dataset images are mostly front-faced, right-sided, and left-sided with up and down orientations. Likewise, the expressions in semi-naturalistic dataset were collected from the movie/drama scenes of professional actors and actresses, where we had no control on expression timings, camera, lighting and background settings. Hence, these expressions are semi-naturalistic expressions collected under dynamic settings. The performance of the system degrades when trained on naturalistic dataset. This is because the expressions in naturalistic dataset were recorded from real world talk shows, news, and interviews. Hence, these expressions are spontaneous expressions collected in natural and dynamic settings. The dataset includes both indoor and outdoor subjects with varying and dynamic backgrounds. In this dataset, different views from different angles with glasses, hair open, wearing hat, and other complex scenarios with obvious actions and things were included. Moreover, the images in this dataset were collected in real life setting such as a variety of backgrounds, unintentional expressions of the subjects, some variant/orientation angles of the face of the subjects, and lighting variations. These were some factors which may cause misclassification. Nevertheless, the results are very encouraging and this suggests that the proposed FER system is robust, i.e., the system not only achieved a high recognition rate on one dataset, but also provided good recognition rates when used across multiple datasets.

5.5 Fifth experiment

In this experiment, the recognition rates for the proposed spontaneous FER system was compared against some of the existing FER systems. The overall results of these systems along with the proposed spontaneous FER system are summarized in Table 13.

Table 13 The weighted average classification results of the proposed FER system with some existing state-of-the-art systems (Unit: %)

It can be seen from Table 13 that the proposed spontaneous FER system outperformed the existing methods. Thus, the proposed system shows significant potential in its ability to accurately and robustly recognize human facial expressions in naturalistic scenarios.

Furthermore, the proposed FER system has been compared with one of the recent FER systems like [37]. In the proposed system, we utilized the same dataset of [37] for fair comparison. The proposed FER system achieved 99% while, the accuracy of [37] is 96%. As can be seen that the proposed FER system showed significant performance than of the existing state-of-the-art works.

6 Conclusion and future direction

In naturalistic environments, facial expression recognition (FER) has received lots of attention. So, for this purpose, several FER systems have been proposed; however, recognizing the expressions accurately in naturalistic environment is still a major concern for most of these systems. Therefore, in this study, we proposed an accurate and robust facial expression recognition system that is capable of exhibiting high recognition rate in naturalistic scenarios. In this system, an unsupervised face detection and extraction model is proposed. In this model, two energy functions such as Chan-Vese (CV) energy and Bhattacharyya distance functions were exploited that not only minimize the dissimilarities within the object (face) but also maximize the distance between the object (face) and background. Furthermore, in this system, we also proposed a new feature extraction method based on symlet wavelet transform, which has the capability to extract the most prominent information in the form of frequency from the expression frames. Though, the proposed feature extraction technique extracts the most informative features; however, there might be some redundancy in the features. Therefore, in this work, we also proposed the usage of a robust non-linear feature selection called stepwise linear discriminant analysis (SWLDA). This method selects the most informative features taking the advantage of the forward selection model and can remove the irrelevant feature by taking advantage of the backward regression model. The proposed system has been tested and validated using three different YouTube-based datasets. These datasets have been collected from YouTube, real world talk shows and interviews, and daily conversations. Each dataset is consisted of six facial expressions like happy, sad, anger, disgust, fear, and normal. For the proposed system, we utilized 10–fold cross validation scheme for each datasets. The system achieved weighted average recognition rate (95%) against the existing FER systems. That is a significant contribution in accuracy in naturalistic environment.

All these experiments were performed in laboratory. In near future, we will employ the proposed FER in smartphone, due to which normal users can check their mental states during their daily routine.