Keywords

1 Introduction

Facial expressions are crucial for social communication. Verbal and nonverbal communication are standard. Facial expressions communicate non-verbally. Mehrabian [40] revealed that \(55\%\) of information passes between people through facial expressions, \(38\%\) via voice, and \(7\%\)via language [66]. Facial expression recognition has evolved into an outstanding and demanding field of computer vision. Disgust, anger, happiness, fear, surprise, and sadness are fundamental emotions [13]. Humans are highly skilled at identifying a person’s emotional state; a computer would have difficulty doing so. It is caused by a variation in occlusion, head postures, changes in lighting, and computing complexity. FER applications include operator tiredness detection, [77], automobile, healthcare, automated tutoring systems [67], mental state recognition [39], security [6], music for mood [12], and rating products or services in banks, malls, and showrooms. With the help of a FER, users can also study how well students interact in a classroom or talk with teachers. [56]. FER inbuilt mobile applications can help visually impaired persons (VIPs) to communicate daily. FER systems can detect the driver’s fatigue state and stress level to make better decisions about driving safely. Facial image acquisition, pre-processing, feature engineering, training, and classification are typical FER stages. The Fig. 1 depicts face expression recognition steps. Pre-processing is used to remove noise. Feature engineering extracts distinct visual characteristics. The popular feature engineering techniques are, Histogram of Gradient(HOG) [10], Local Directional Pattern (LDP) [23], Gabor filters [61], Local Binary Patterns (LBP) [52], Principal Component Analysis (PCA) [2], Independent Component Analysis (ICA), and Linear Discriminant Analysis(LDA) [5]. Extracted features are utilized for training a classifier using expression class labels. FER approaches are deep learning and conventional learning based on feature engineering. In deep learning huge number of examples and images are used to learn and tune feature extraction parameters, while conventional learning uses algorithms to extract hand-crafted features. Deep learning classifiers contain a sigmoid or softmax layer on the classification stage with Fully connected layers. K-nearest neighbor (KNN)and support vector machine(SVM)are well-known classifiers in conventional learning. The FER system’s accuracy depends on captured data variability, feature extraction, classification, and fine-tuning. Model inference time depends on camera resolution, feature engineering, classifier, and hardware computation capabilities.

Fig. 1.
figure 1

Different steps of facial expression recognition system

This work primarily concerns various FER approaches, with three primary processes: pre-processing, feature engineering, and classification. This paper also demonstrates the benefits of different FER methods and a performance analysis of various FER methods. Only image-based FER approaches are used in this work for the literature review; video-based FER techniques are not used. FER systems often deal with illumination fluctuations, skin tone variations, lighting variations, occlusion, and position variations. This work also provides a vital research suggestion for future FER research. The remaining research paper is organized into five 6 sections, including an introduction. Section 2 represents the related research work, including state-of-the-art for FER. Section 3 lists the most often used benchmarking datasets for FER. Section 4 provides an overview of FER feature engineering. Section 5 compares the performance of different FER systems. Finally, Sect. 6 offers a conclusion.

2 Related Work

FER has a wide range of applications in computer vision. Because of differences in position, illumination, scale, and orientation, recognizing facial expressions can be difficult. The primary goal of feature engineering is to find robust features that can improve the robustness of expression recognition. The feature extraction and classification stages are critical in FER. There are two kinds of feature extraction: geometric and appearance-based. Geometrically-based feature extraction includes the eye, mouth, nose, brow, ear, and other facial components, whereas appearance-based feature extraction includes the exact region of the face [66].

Abdullah et al. [1] reduced the face picture into a small feature set called eigenface and utilized PCA to extract facial features like eigenfaces into a class of finite feature descriptions. Yadav et al. [70] extracted facial features using Gabor filters and two-dimensional PCA. ICA is used to identify characteristics from statistically independent local faces [59]. Lee et al. [30] used ICA to extract statistically autonomous features from local face parts in various facial expressions. Mehta and Jadhav [41] classified human emotions using the Gabor filter. Islam et al. [22] used HOG and LBP to extract local characteristics. LBP features are easy to compute. ICA is less tolerant of illumination fluctuations than LBP. Edge pixels are needed to extract face features from an image. Local Directional Pattern (LDP) shows visual gradients. In FER, LDP represents gradient-based properties of the local face in the pixel’s eight prime directions [23].

In classic LDP features, the highest edge strengths determine binary values, which vary by experiment. LDP ignores a pixel’s direction strength sign, differentiating edge pixels with comparable strengths but opposite signs. Uddin et al. [60] overcame this LDP problem by grouping pixels’ major edge strengths in decreasing order and using their signs to build stable features. Many recent attempts have been made to recognize facial expressions from videos or images using deep learning. To learn appearance features from video frames and geometric features from raw face landmarks, Jung et al. [24] merged two deep learning-based models. Then, a joint learning method was used to connect the two models’ outputs. Zeng et al. [75] improved the performance by incorporating hand-picked features into the deep network training. Recently several deep learning methods have been developed for FER and applied in real-time images. Wang et al. [63] introduced Region Attention Network (RAN) for pose variant and occluded face FER. In this paper, region-biased loss and region attention mechanisms are employed to capture the importance of pose variant and occluded facial images. Wang et al. [62] proposed a ResNet-18 CNN model in which uncertainties caused by low-quality images are suppressed by CNN architecture Self-Cure Network (SCN). Li et al. [32] proposed a model that includes an attention mechanism in CNN to recognize expression from a partially occluded face.

3 Review Analysis of Facial Expression Dataset

This section describes FER benchmark datasets. The summary of these datasets, i.e., collection condition, environment challenges, expression distribution, and the number of images and subjects, is shown in Table 1. In the CK+ dataset, training, testing, and validation sets are not specified. Due to non-uniform expressive representation, MMI contains substantial interpersonal discrepancy. The JAFFE dataset has fewer samples per subject expression. AFEW is a multi-model, temporal dataset containing environmental conditions. CMU Multi-PIE and BU-3DFE examine multi-view face expressions.

Table 1. Benchmarking datasets for facial expression recognition

4 Review of Feature Engineering Technique

FER accuracy depends on feature engineering. Feature engineering can be hand-picked or deep-learned. Single-task learning (STL) includes hand-picked features, whereas deep learning techniques are iterative. FER’s traditional feature engineering methodologies are as follows:

4.1 Gaussian Mixture Model

Gaussian Mixture Model groups data into a cluster that is distinct from each other. A distribution models data points within a cluster. A weighted total of Gaussian functions can approximate many probability distributions. A Gaussian mixture model is the sum of k component Gaussian densities for vector x, as shown in Eq. 1.

$$\begin{aligned} p(x)=\sum _{j=1}^{k} w_j p(x|j) \end{aligned}$$
(1)

where x is a data vector of D dimension; \(w_j\) = 1, 2 ...k, are the weights of the mixture; p(x|j) = Gaussian Density Model for \(j^{th}\) Component,

Gaussian one-dimensional probability density function is represented in Eq. 2.

$$\begin{aligned} G(X|\mu ,\sigma ) = \frac{1}{{\sigma \sqrt{2\pi }}} {e^{{{-\left( {x-\mu }\right) ^2}/{2\sigma ^2}}}} \end{aligned}$$
(2)

Here \(\mu \) represents the mean, and \(\sigma ^2\) represents the distribution variance.

Multivariate Gaussian distribution probability density function is given by Eq. 3 [19].

$$\begin{aligned} G(X|\mu ,\varSigma ) = {\frac{1}{{\sqrt{{2\pi }^d|\varSigma |}}}} {exp(-\frac{1}{2}(x-\mu )^{T} {\varSigma }^{-1}(X-\mu ))} \end{aligned}$$
(3)

where \(\mu \) is a d dimensional vector denoting the mean of the distribution and \(\varSigma \) is the \(d\times d\) covariance matrix. The Expectation-Maximization (EM) method estimates model parameters.

4.2 Local Binary Pattern (LBP) Based Features

LBP captures local spatial patterns and the contrast in the facial image. LBP labels image pixels by thresholding the nearby pixel and gives a binary number [47]. LBP is computed in four steps as follows:

  • For each pixel (x, y) in an image I, P neighboring pixels are chosen at a radius R.

  • Intensity difference of the P adjacent pixels is determined.

  • Positive intensity differences are assigned one (1) and negative intensity differences are assigned zero (0).

  • Convert the P-bit vector to decimal. LBP descriptor is shown in Eq. 4.

LBP operator \(LBP_{P, R}\), here subscript represents the operator used in (PR) neighborhood.

$$\begin{aligned} LBP(P,R)=\sum _{p=0}^{p-1} f(i_p-i_c) 2^p \end{aligned}$$
(4)

where P denotes the number of neighboring pixels chosen at a radius R. \(i_c\) and \(i_p\) represent the intensity of the center and neighboring pixel, respectively. Thresholding function f is as follows:

$$\begin{aligned} f(x)=\left\{ \begin{aligned} 0{} & {} x < 0 \\ 1{} & {} x\ge 0 \end{aligned} \right. \end{aligned}$$
(5)

The LBP histogram is defined as:

$$\begin{aligned} H_j= \sum _{x,y} I\{f_I(x, y\} = j, \quad \ \quad j=0,\dots ,n-1 \end{aligned}$$
(6)

where n is the number of labels created by the LBP operator.

$$\begin{aligned} I(M)=\left\{ \begin{aligned} 1,{} & {} \text {if M is true}\\ 0,{} & {} \text {if M is false } \end{aligned} \right. \end{aligned}$$
(7)

Different-sized image patches are normalized using Eq. 8.

$$\begin{aligned} N_j = \frac{H_j}{\sum _{k=0}^{n-1} H_k} \end{aligned}$$
(8)

4.3 Gabor Filter Feature Extraction Technique

Edges and texture are essential features in the face image. The convolution of the face image and Gabor filter kernel creates these features. Gabor is an illumination-invariant Gaussian sinusoidal. Gabor filter kernel [65] is defined in Eq. 11. The Gabor filter components are the following: \(\phi \) (Phase), \(\lambda \) (Wavelength), \(\theta \) (Orientation) specify the number of cycles, angle of the normal to the sinusoidal plane, and offset of a sinusoidal. The frequency bandwidth of Gabor is:

$$\begin{aligned} b = \log _2 \frac{(\sigma /\lambda )\pi + \sqrt{\log 2/2}}{(\sigma /\lambda )\pi - \sqrt{\log 2/2}} \end{aligned}$$
(9)
$$\begin{aligned} \frac{\sigma }{\lambda } = (1/\pi ) \sqrt{\log 2/2} \frac{2^b+1}{2^b-1} \end{aligned}$$
(10)

The bandwidth b affects \(\sigma \) value. Convolution of Face image I(xy) with Gabor kernel \(\varPsi (\theta ,\lambda ,\gamma ,\phi )\) produces Gabor texture-edge features, shown in Eq. 13 [18]. Gabor kernel \(\varPsi (\theta ,\lambda ,\gamma ,\phi )\) is composite number as shown in Eq. 14. Gabor real (\(GI_{R}\)) and imaginary (\(GI_{Im}\)) components are created by convolution between Gabor kernel \(\varPsi \) and image I(xy) for real R(\(\varPsi \)) and imaginary Im(\(\varPsi \)) as shown in Eq. 15 and 16. Equation 17 shows amplitude features G(xy) of the Gabor kernel. Gabor filter has the problem of redundant features and high dimensions; PCA and ICA can fix this issue.

$$\begin{aligned} \varPsi _{\theta ,\lambda ,\gamma ,\phi }(x,y)=\exp \bigg (-\frac{a'^2+\gamma ^2 b'^2}{2\sigma ^2}\bigg ) e^{j\frac{2\pi {a'}}{\lambda }} \end{aligned}$$
(11)

Here, \(a'\), \(b'\) are direction coefficients and \(\theta \) represents projection angle.

$$\begin{aligned} {a'} = a\cos \theta + b\sin \theta \quad \textrm{and}\quad {b'} = a\cos \theta + b\sin \theta \end{aligned}$$
(12)
$$\begin{aligned} GI= I(x,y) * \varPsi (\theta ,\lambda ,\gamma ,\phi ) \end{aligned}$$
(13)
$$\begin{aligned} \varPsi (\theta ,\lambda ,\gamma ,\phi ) = R(\varPsi (\theta ,\lambda ,\gamma ,\phi ) + Im(\varPsi (\theta ,\lambda ,\gamma ,\phi ) \end{aligned}$$
(14)
$$\begin{aligned} GI_{R}(\theta ,\lambda ,\gamma ,\phi )=I(x,y))*R(\varPsi (\theta ,\lambda ,\gamma ,\phi ) \end{aligned}$$
(15)
$$\begin{aligned} GI_{Im}(\theta ,\lambda ,\gamma ,\phi )=I(x,y)*Im(\varPsi (\theta ,\lambda ,\gamma ,\phi )) \end{aligned}$$
(16)
$$\begin{aligned} GF(\theta ,\lambda )=(GI_{R}(\theta ,\lambda ,\gamma ,\phi )^2+(GI_{Im}(\theta ,\lambda ,\gamma ,\phi )^2)^{1/2} \end{aligned}$$
(17)

4.4 SIFT-Scale Invariant Feature Transform

SIFT Features are invariant to the scale of the image. The steps for calculating SIFT features are following.

  1. 1.

    Scale-space extrema detection: Gaussian difference finds scale- and rotation-invariant nearest points. Scale and image location are computed.

  2. 2.

    Key point localization: Only solid and fascinating points are selected based on intensity.

  3. 3.

    Orientation assignment: Key points are formed based on the gradient’s direction.

  4. 4.

    Key point descriptor: SIFT descriptions around important points are used to describe the local appearance of key points.

  5. 5.

    Keypoint matching: Two images’ nearest neighbors are matched.

4.5 Histogram of Oriented Gradient (HOG) Feature Extraction

Facial characteristics vary. A woman’s face is rounder than a man’s, which helps distinguish gender. HOG extracts picture curvature direction. Edge directions define the shape and local appearance [10]. The image is divided into blocks, and HOG features are computed for each block. All HOG features are integrated into one vector. HOG computation process: Calculate image gradient. For a face image F,

$$\begin{aligned} {F_x = F(r, c+1)-F(r, c-1)}, \quad \ \quad {F_y = F(r-1, c)-F(r+1, c)} \end{aligned}$$
(18)

here r and c represent rows and columns, respectively.

The magnitude (G) and orientation (\(\theta \)) of the gradient is computed by

$$\begin{aligned} \mid G \mid \, = \sqrt{{F_x}^2 + {F_y}^2} \quad \textrm{and}\quad \theta ={\tan ^{-1}}\frac{F_y}{F_x} \end{aligned}$$
(19)

Orientation range (0–360\(^{\circ })\) for signed gradients and (0–180\(^{\circ })\) for unsigned gradients. After determining the size and orientation of each cell’s pixel, the histogram is normalized using a block pattern. Combining HOG features creates a feature vector.

4.6 Discrete Wavelet Transform (DWT)

DWT can be computed by first evaluating 1D-DWT on rows of the 2D image matrix, then on the column of evaluated 1D-DWT. LL(low frequency), LH, HL, and HH (high frequency) represent approximation, horizontal, vertical, and diagonal frequency blocks, respectively, in DWT. LL block approximates low-resolution images by deleting extraneous details. Low-frequency band (LL) smooths input image while high frequency creates edge patterns [54]. Iteratively utilizing 2D-DWT on the LL band helps to reduce feature size.

4.7 Principle Component Analysis (PCA)

PCA finds correlations across attributes and uses a strong variance pattern to reduce data dimensions. In PCA, the given image is subtracted from the mean; the covariance matrix is calculated using \(FM^T\), then eigenvalues and eigenvectors are calculated. Eigenvectors that match up with certain high-magnitude eigenvalues at a certain significant level are essential information about the image’s variance. Equation 20 is used to figure out the PCA significance level.

$$\begin{aligned} \varepsilon =\frac{\sum _{i=1}^{m}\lambda _{i}}{\sum _{i=1}^{n}\lambda _{i}} \quad \quad m \le n \quad and \quad 0 \le \varepsilon \ge 1 \end{aligned}$$
(20)

here \(\lambda _i\) depicts the eigen value of \(i^{th}\) order in order of amplitude and \(m \le n\).

4.8 Deep-Learning Feature Engineering

Recent research has emphasized deep learning. Deep learning features are extracted using a convolution neural network (CNN). A DNN was proposed to retrieve patterns from high-dimensional data [27]. DNNs train slowly and overfit. Deep Belief Network [35]is used to tackle DNN challenges, with Restricted Boltzmann Machine (RBM) for training features [42]. A joint learning algorithm is used to combine geometry and appearance features [24].

5 Performance Analysis of Different FER Systems

The performance analysis of this review is based on the pre-processing, recognition accuracy on various datasets, feature extraction methods, contribution, and advantages of different FER techniques. Table 2 shows a comparative analysis of facial expression recognition techniques to better understand how complicated and accurate the method is.

Table 2. Performance comparison based on different hand-picked feature engineering (conventional learning) and deep learning approaches for facial expression recognition.

5.1 Conventional Learning-Based FER Analysis

LBP feature extraction and paired classification outperformed JAFFE with 99.05 accuracy [9]. Pairwise classifiers select features by class pair. Feature extraction is more dependable because it doesn’t rely on manually or automatically assigned fiducial points. Islam et al. [22] used HOG, LBP features, artificial neural network(ANN) classifier to get \(99.67\%\) accuracy on CK+ dataset. Feature fusion gives promising results, and ANN employs the limited memory (L-BFGS) technique for weight optimization but the dimension increases. Principle Component Analysis(PCA) is used to reduce dimension. Ryu et al. [50] extracted features via Local Directional Ternary Pattern(LDTP), used classifier as Support vector machine (SVM) and got accuracy \(99.8\%\) on MMI dataset.

5.2 Deep Learning-Based FER Analysis

Deeper Cascaded Peak-piloted Network (DCPN) [74] is achieved best accuracy \(99.6\%\) on the CK+ dataset, which is greater than other approaches in Table 2. Mahmoudi et al. [38] developed a CNN-based bilinear model and outperformed on FER-2013 dataset (unconstrained dataset) with \(77.81\%\) accuracy.

6 Conclusions

This paper present review of different features engineering techniques and provides detailed analysis including pros and cons of each techniques and comparative study of benchmarking dtaset. The techniques are categorized into conventional learning and deep-learning. Conventional learning including LBP, PCA, Gabor filter, HOG, DCT, DWT etc. feature extraction techniques while deep learning includes convolution neural networks and its variant for facial expression recognition. FER systems based on conventional learning and deep learning are presented with the help of bench-marking datasets based accuracy. Hybrid features provide a better recognition rate as compared to single features. This paper analyzed the different FER techniques according to pre-processing, feature engineering, classification, recognition accuracy, and critical contributions. The success of the FER approach depends on pre-processing of the facial images due to illumination and prominent feature engineering. Deep learning model performance is significantly better than conventional learning for real-time datasets but needs a huge amount of datasets and variability of images. The performance of these algorithms improves with the amount of the dataset. JAFFE and CK+ datasets are most frequently used in FER systems, but they do not contain all variability of real-time images. Although much research has been done on FER, Identifying facial expressions in real life is difficult due to frequent movements of the head and subtle facial deformations, and other real-time variability that motivate researchers to provide their efforts to find a possible solution.