1 Introduction

Emotion recognition is a way to determine or verify an individual's emotion using his/her face, body language, and speech. Due to the wide range of faces in various applications, the face is used in the emotion recognition system more than other biometrics. In the image processing field, the processing process includes three phases: preprocessing, feature extraction, and classification. Preprocessing is the preliminary processing of raw data to prepare for the emotion recognition system. After preprocessing phase, extracting distinctive features is a vital step in the emotion recognition system. Face feature extraction is divided into appearance and geometric methods [1, 19, 26, 31, 37]. In the geometric feature extraction approach, the position and shape of facial components, such as eyes, mouth, nose, and eyebrows, are used to identify distinctive features. While in the appearance feature extraction approaches, the histogram of oriented gradients (HOGs) [14] discretes wavelet transform (DWT) [32], local binary pattern (LBP) [55], local directional pattern (LDP) [27], and local ternary pattern (LTP) [61]; Gabor [22] are employed. Furthermore, convolutional neural network (CNN) [3, 5, 13, 15, 21, 25, 33] is used as a type of deep learning algorithm to extract features from images and learn to recognize patterns.

To design a high-performance model, tuning hyperparameters can be effective. Both automatic and manual methods can be used to set hyperparameters in architecture. In manual approaches, adjusting the hyperparameters requires a lot of resources and time. At the same time, meta-heuristic algorithms can tune hyperparameters automatically. Meta-heuristic algorithms inspired by natural phenomena are used to search/optimize problems. There are many meta-heuristic algorithms for adjusting hyperparameters. The whale optimization algorithm (WOA) [41] can be used as one of the swarm intelligence algorithms to achieve optimal solutions.

In the following, several related works in the field of facial emotion recognition are expressed. Boughida et al. [11] to detect facial emotions have used from three steps: First, the face is detected. Then, it uses a Gabor filter to extract features from regions of interest. Finally, the genetic algorithm for feature selection and support vector machine (SVM) hyperparameters tuning is presented. In [26], a descriptor based on the gradient of the neighbors of the target pixel for pattern generation called neighborhood-aware edge direction pattern (NEDP) is proposed by Iqbal et al. Farkhod and Chae [19] introduced the local prominent directional pattern (LPDP) approach for feature extraction based on local edge-based descriptors for recognizing emotions from faces. Jeen Retna Kumar et al. [28] proposed a system for facial emotion recognition, which includes: face recognition, feature extraction by the new subband selective multilevel stationary biorthogonal wavelet transform (SM-SBWT) method, and SVM for classification. Kola and Samayamantula [35] use four neighbors and diagonal neighbors in LBP for feature extraction. In addition, adaptive windows and averaging in radial directions are introduced to complete the proposed method. Bendjillali et al. [8] described a method to recognize facial expressions. The steps of this work include the Viola–Jones face detection algorithm, the contrast limited adaptive histogram equalization (CLAHE) algorithm for improving facial image, the discrete wavelet transform (DWT) for feature extraction, and deep CNN for classification. Nigam et al. [44] proposed a structure to recognize facial emotions. The first is face detection by the Viola–Jones method. Next, the discrete wavelet transformation transforms the spatial domain into the frequency domain. Then, the features are extracted by the histogram of oriented gradients. Finally, SVM is used for classification. Boughanem et al. [10] suggested a multi-channel CNN based on three models, VGG19, GoogleNet, and ResNet101, to form a more robust feature vector. SVM classifier is used for emotion recognition. Mukhopadhyay et al. [43] presented a new facial expression recognition method by exploiting texture image features such as LBP, LTP, and completed local binary pattern (CLBP). The CNN model is used to recognize facial expressions. Using a histogram of oriented gradient (HOG) for feature extraction and graph signal processing (GSP) method to reduce the feature vector is a system suggested by Meena et al. [40]. Alphonse and Dharma [2] reported maximum response-based directional texture pattern (MRDTP) and maximum response-based directional number pattern (MRDNP) approaches for feature extraction. Besides, an effective generalized supervised dimension reduction system (GSDRS) has been introduced. It used an extreme learning machine with radial basis function (ELM-RBF) for classification. Barra et al. [6] proposed an approach for facial emotion recognition based on geometry, a technique that analyzes salient points through a virtual spider web on the face. Emotion is classified using the K-nearest neighbor. Kola and Samayamantula [36] elaborated an approach for feature extraction for facial emotion recognition using a combination of the local gradient coding based on horizontal and diagonal (LGC‐HD) with wavelet transform and singular value decomposition to calculate the facial image’s singular values. Arora et al. [4] proposed a method for facial emotion recognition by gradient filter and PCA for feature extraction. Also, a random forest algorithm is used for classification. A novel graph-based texture transform for feature extraction to automatic facial expression detection is presented by Tuncer et al. [60]. Linear discriminant analysis (LDA) and SVM classifier are used for classification. Kar et al. [3] were used ripplet transform type II (ripplet-II) for the feature extraction. The principal component analysis (PCA) and linear discriminant analysis (LDA) approaches are expressed for the compress and discrimination features. Besides, the least squares variant of the support vector machine (LS-SVM) is considered for the classification. Baygin et al. [7] suggested an approach recognition of emotional expressions about individuals’ social behavior. This model has four prime phases: facial areas are segmented, features are extracted with AlexNet and MobileNetV2, the most valuable 1000 features are selected by neighborhood component analysis (NCA), and these 1000 features are selected on an SVM. Bentoumi et al. [9] proposed method is based on CNN model (VGG16, ResNet50) for feature extraction with a multilayer perceptron (MLP) classifier.

The drawbacks of the mentioned methods above compared to the proposed method are described below. In [2, 31, 40, 60], the computational complexity is more than the proposed model. The results show a high accuracy in recognizing facial expressions, despite the similar dataset, for the proposed system compared to the proposed methods in [11, 19, 26, 60]. Also, the method of this study has been able to extract distinct features by combining the well-known feature extraction methods of HOG [14], DWT [32], LBP [55], LDP [27], and Gabor [22] with the automatic feature extraction method of CNN, so that using the advantages of both methods, it has a better performance in recognizing facial emotions for real samples. In addition, compared to the reviewed methods, the proposed work has used data augmentation techniques in order to extract more features and increase the generalization ability of the model and high accuracy in emotion recognition due to the variety of data. Besides, a meta-heuristic algorithm has been employed to fine- tune CNN hyperparameters to improve network efficiency. Therefore, this paper introduced a new technique called local sorting binary pattern (LSBP). Our model uses a combination of LSBP to consider spatial features and CNN optimized with WOA algorithm to consider features of hyperspectral image data for facial emotion recognition. In addition, CNN is used to classify facial expressions (happiness, sadness, anger, contempt/natural, disgust, fear, and surprise). Experiment of the proposed method is done with CK+ [38], JAFFE [39], and MMI [47] datasets. The contributions of our work are as follows:

  • Preparing data and data augmentation in the preprocessing step for emotion recognition.

  • Employing WOA to tune hyperparameters of CNN architecture.

  • Obtaining high accuracy in emotion recognition by using a combination of LSBP and CNN for feature extraction.

  • CNN classifier is employed in the learning and testing phases.

  • Using different facial image datasets to evaluate the proposed model compared to related facial emotion recognition models.

The rest of the paper is organized as follows: Work-related concepts are explained in the Sect. 2. In the Sect. 3, the proposed method is presented. Experimental results are presented in the Sect. 4. Finally, Sect. 5 concludes the paper.

2 Background

This section discusses the CNN algorithm, Whale optimization algorithm, and LBP approach as three basic concepts for the proposed method.

2.1 Convolutional neural network

Artificial intelligence is simulated human intelligence. Machine learning is a subset of artificial intelligence. Machine learning allows computers to learn. Learning is applied with the help of algorithms based on the features in the collected data.

Deep learning (DL) is a machine learning method. As a subset of machine learning, neural networks are the basis of deep learning algorithms. The neural network consists of input, hidden, and output layers that are connected similarly to neurons in the human brain. DL emphasizes representational learning [16]. Representation learning uses raw data to extract features automatically. As a representation learning technique, DL transforms the representation to more abstract levels by hierarchically using nonlinear modules in each layer [42]. Machine learning methods can be divided into two categories: supervised and unsupervised learning. In supervised learning, labeled datasets are used in algorithms to predict data classes. CNN is a well-known supervised DL algorithm for image data processing and classification. As an artificial neural network with multiple hidden layers, CNN can be considered for feature extraction. There are different architectures for CNN, such as LeNet, AlexNet, VGG, GoogleNet, ResNet, DenseNet, and SENet [31, 48]. The CNN network consists of the following components, as shown in Fig. 1:

Fig. 1
figure 1

CNN structure

  • Convolutional layer: a layer containing one or more filters. The size of the filters is smaller than the size of the input image. Convolve a filter with an image produces a feature map. The outputs of the convolution process are passed through a nonlinear activation function. Selecting the appropriate activation function is effective in the performance of the network. Some activation functions include swish, ReLU [62], tanh, softmax, and sigmoid [3, 45, 51, 58]. Some activation functions such as ReLU are more popular for this layer [Eq. (1)].

$$ {\text{ReLU}}\left( x \right) = \max \left( {0,x} \right) $$
(1)

Where x is the input value.

  • Pooling layer: The pooling layer follows the convolutional layer. It is used to reduce the dimensionality of the feature map with keeping the vital data and the computational cost of the network. The size of the feature map from the previous layer is reduced according to the determined pooling size and stride value. The most commonly used down-sampling algorithms include min-pooling, max-pooling, average-pooling, global-pooling, and global average-pooling. A sample of this process is shown in Fig. 2.

Fig. 2
figure 2

Min pooling algorithm with stride 2

  • Fully connected layer: It is mostly used at the end of the network. The input of this layer is the output of the last convolution or pooling layer. The obtained features are flattened (Converting 2D feature maps to 1D feature maps) with the complete connection. In other words, the fully connected (FC) layer is suitable for making decisions about data classification. Also, in the last layer of the CNN model, it is used an activation function such as softmax or sigmoid which outputs a probability distribution of multi-class classification.

2.2 Whale optimization algorithm

WOA [41] is a nature-inspired meta-heuristic optimization algorithm that mimics the social behavior of humpback whales. Humpback whales prefer to prey on groups of krill or small fish near the surface of the water by creating distinctive bubbles along a circle or ‘9’-shaped. This interesting behavior for hunting is called the bubble-net feeding method. The WOA algorithm consists of three steps: encircling prey, spiral bubble-net feeding maneuver (exploitation phase), and search for prey (exploration phase), which is explained as follows:

  • Encircling prey: Whales can detect and encircle the prey’s location. The WOA algorithm assumes that the current best candidate solution is the target prey or is close to the optimal state. After defining the best search agent, other search agents update their position relative to the best agent. Its mathematical model is as follows:

$$\overrightarrow{D}=\left|\overrightarrow{C}\cdot \overrightarrow{{X}^{*}}\left(t\right)-\overrightarrow{X}\left(t\right)\right|$$
(2)
$$\overrightarrow{X}\left(t+1\right)=\overrightarrow{{X}^{*}}\left(t\right)-\overrightarrow{A}\cdot \overrightarrow{D}$$
(3)
$$\overrightarrow{A}=2\cdot \overrightarrow{a}\cdot {\text{rnd}}_{1}- \overrightarrow{a}$$
(4)
$$\overrightarrow{C}=2\cdot {\text{rnd}}_{2}$$
(5)

where \(t\) is the current iteration, \(\overrightarrow{{X}^{*}}\) is the location vector of the best solution obtained, and \(\overrightarrow{X}\) is the location vector, \({rnd}_{1}\) and \({rnd}_{2}\) are random vectors in \(\left[\text{0,1}\right]\), \(\overrightarrow{a}\) decreases linearly from 2 to 0 during iterations, and \(\overrightarrow{A}\) and \(\overrightarrow{C}\) are the coefficient vectors.

  • Exploitation phase: For mathematically model of the bubble-net behavior of whales, two techniques are expressed:

  • Shrinking encircling approach: This behavior is achieved by decreasing the value of \(a\) in Eq. (4).

  • Adjusting random values for \(A\) in \(\left[-\text{1,1}\right]\), the new position of \(a\) search agent can be defined anywhere in between the original position of the agent and the position of the current best agent.

  • Spiral update position: This approach first calculates the distance between the whale located at \(\left(X,Y\right)\) and the prey located at \(\left({X}^{*},{Y}^{*}\right)\). The simulation of the spiral movement of whales according to the spiral equation between the position of the whale and the prey is as follows:

$$\overrightarrow{X}\left(t+1\right)={D}{\prime}\cdot {e}^{bl}\cdot \text{cos}\left(2\pi l\right)+ \overrightarrow{{X}^{*}}\left(t\right)$$
(6)

Where \({D}{\prime}\) represents the distance of the \(i\) th whale to the prey (the best solution obtained so far), \(b\) is a constant to determine the shape of the logarithmic spiral, and \(l\) is a random number in \(\left[-\text{1,1}\right]\).

The whales swim simultaneously in a small circle along a spiral path around the prey. Therefore, assuming a probability of \(\frac{1}{2}\) for choosing each method, the mathematical model is as follows:

$$ \vec{X}\left( {t + 1} \right) = \left\{ {\begin{array}{*{20}l} {\overrightarrow {{X^{*} }} \left( t \right) - \vec{A} \cdot \vec{D}} \hfill & {P < 0.5} \hfill \\ {D^{\prime } \cdot e^{bl} \cdot \cos \left( {2\pi l} \right) + \overrightarrow {{X^{*} }} \left( t \right)} \hfill & {P \ge 0.5} \hfill \\ \end{array} } \right. $$
(7)

where \(p\) is a random number in \(\left[\text{0,1}\right]\).

Exploration phase: whales search randomly according to each other's position. Therefore, random values of \(\overrightarrow{A}\) greater than 1 or less than − 1 are used to force the search agent to move away from the reference whale. The random selection of the agent is used to update the position of a search agent. This mechanism and \(|\overrightarrow{A} | > 1\) emphasize exploration and allow the WOA algorithm to perform a global search. The equations are as follows:

$$\overrightarrow{D}=\left|\overrightarrow{C}\cdot \overrightarrow{{X}_{\text{rand}}}-\overrightarrow{X}\right|$$
(8)
$$\overrightarrow{X}\left(t+1\right)= \overrightarrow{{X}_{\text{rand}}}-\overrightarrow{A}\cdot \overrightarrow{D}$$
(9)

In this equation, \(\overrightarrow{{X}_{\text{rand}}}\) is a randomly selected position vector (random whale) from the current population. A random search agent is selected in the case \(|\overrightarrow{A} | > 1\), while the best solution is chosen when the search agents position update is \(|\overrightarrow{A} |< 1\).

The algorithm of the WOA is shown in following:

Algorithm 1
figure a

WOA

2.3 Local binary pattern

Local binary pattern (LBP) [17, 35, 37, 54, 56] is a local texture pattern presented by Ojala et al. [46]. Researchers have studied LBP in various areas of image processing, including face recognition and facial emotion recognition. LBP is used to extract features of grayscale images with robustness against monotonic illumination changes. This technique has efficiency and low complexity in calculations. In the LBP, a block with a specific size, such as 3 \(\times \) 3, is used to scroll the input image (Fig. 3a). In each step, the difference between the value of the neighboring pixels and the central pixel is calculated. Values greater than zero are encoded with 1, and other values are encoded with 0 for adjacent pixels (Fig. 3b). By moving counterclockwise, binary values are written from left to right. The decimal value is calculated. Formally, the LBP technique can be represented according to Eq. (10):

$$\text{LBP}\left({g}_{i},{g}_{c}\right)={\sum\limits}_{i=0}^{p-1}{2}^{i}f\left({g}_{i},{g}_{c}\right), i=\text{0,1},\cdots ,p$$
(10)
$$f\left(x\right)=\left\{\begin{array}{c}1\; x\ge 0\\ 0\; x<0\end{array}\right.$$
(11)

where \({g}_{i}\) is the pixel value of the \(i\)th neighbor of the central pixel. \({g}_{c}\) is the central pixel value. \(p\) is the total number of neighbors. The resulting value is placed in the center of the block, which is shown in Fig. 3c. This process continues until a 3 \(\times \) 3 block scans the image.

Fig. 3
figure 3

A sample of LBP descriptor. a 3 \(\times \)3 block, b binary coding resulting from the difference between the neighboring pixels of the central pixel, and c converting the binary value to decimal and storing it in the central pixel

Recently, there have been numerous proposed modified methods based on local binary patterns for extracting distinctive features used for classifying input data in various fields, including texture and facial recognition. In [24] introduced a method for texture-based image classification invaginating to changes in scale and rotation. Meanwhile, Ryu et al. [58] proposed a new approach called sorted consecutive local binary pattern (scLBP) for texture classification. The scLBP method can encode patterns with varying spatial transitions while still being rotation-invariant, achieved by sorting consecutive patterns. Additionally, the researchers utilized dictionary learning based on kd-tree to partition the data in space. Song et al. [57] introduced a histogram sorting method to preserve the distribution information of LBP codes and their complements, called first- and second-order sorted LBP (SLBP) which are robust to inverse grayscale changes and image rotation. Kalyoncu [29] employed Sorted Uniform LBP (SULBP), a rotation-invariant LBP variant for identifying leaf images.

The proposed method is similar to the proposed methods in terms of the input data format, which is an image. However, the context of the proposed method is the classification of emotions from faces. The proposed filter uses different steps and parameters than existing methods. In Sect. 3, the proposed method is described in detail.

3 Proposed methods

The proposed approach for classifying facial emotions is discussed in this section (Fig. 4). The combination of the new proposed LSBP and the CNN network is considered to extract appropriate features to recognize emotions with high accuracy. In the preprocessing step, several operations are employed to the input data. LSBP is a hand-crafted approach for feature extraction from input images. The proposed pattern applies changes to the image pixels to extract the feature of the input image. Then, the output image is augmented (to create variety and increase the number of input data). The obtained data are normalized and resized. In the next step, the achieved data are considered as the input of the CNN network whose hyperparameters are optimized with the WOA algorithm to extract higher-level features and classification in order to recognize emotions from facial image data with high accuracy. Hence, data preprocessing is presented in Sect.  3.1. Section 3.2 introduces the new LSBP descriptor. The WOA algorithm for optimizing CNN hyperparameters and CNN structure is discussed in Sect.  3.3.

Fig. 4
figure 4

Diagram of the proposed method

3.1 Preprocessing

Preprocessing is the process of preparing raw data for a deep learning model. The input images in the proposed scheme are considered in grayscale. Data augmentation is done by zooming, rotating, and flipping. Also, normalization and resizing are applied to the data.

3.2 The suggested local sorting binary pattern

In this paper, a simple method with low computational complexity and high emotion recognition accuracy is described. The LSBP technique, based on the LBP method, is proposed to extract the local features of input grayscale images, as shown in Fig. 5. In this pattern, image data pixels are divided into 3 \(\times \) 3 blocks [Eq. (12)]:

$$S=\{{g}_{0},{g}_{1},{g}_{2},{g}_{3},{g}_{4},{g}_{5},{g}_{6},{g}_{7},{g}_{8}\}$$
(12)

where \(S\) is the set of 3 \(\times \) 3 pixels selected. Within this set, the pixel values of a 3 \(\times \) 3 block are sorted in ascending order. Indexes of sorted values are saved. The central element \({c=g}_{4}\) is used as the threshold value for the eight neighbors. Index values greater than the threshold value are replaced with one and otherwise with 0, which is shown in Eq. (13):

Fig. 5
figure 5

\(3\times 3\) segmentation of input grayscale image

$$f\left(\text{Index\; value}\right)=\left\{\begin{array}{c}1\; Index\; value\ge c\\ 0\; Index\; value<c\end{array}\right.$$
(13)

The values 0 and 1 are written from left to right clockwise, starting from the pixel \({g}_{0}\). The binary value is converted to decimal according to Eq. (14):

$$\text{result}={\sum\limits}_{i=0}^{8}{2}^{i}f\left({g}_{i}\right), i=\text{0,1},\cdots ,8$$
(14)

The resulting value is stored in the center pixel. This process is applied until the scan the whole image. The steps of the LSBP descriptor are described in Algorithm 2.

Algorithm 2
figure b

LSBP

3.3 The proposed CNN architecture and tuning the hyperparameters using WOA

In this paper, WOA is used to optimize CNN. The time complexity of GA, PSO, and WOA optimization algorithms is presented in Table 1. According to the displayed results, WOA in the proposed network can achieve high accuracy faster than GA and PSO. Therefore, the hyperparameters of CNN, including the activation function, optimizer function, learning rate, number of epochs, and batch size, are adjusted by the WOA algorithm. Every hyperparameter vector (\(\overrightarrow{\text{HP}}\)) is known as an agent in WOA, where each parameter (\(p\)) has a predetermined range/set (\(i\)). The initial population (\({S}_{n}\)) of the WOA is a random set of \(n\) agents:

Table 1 Time complexity of optimization algorithms
$$\overrightarrow{{\text{HP}}_{i}}=\{{P}_{i0},{P}_{i1},{P}_{i2},{P}_{i3},{P}_{i4},{P}_{i5}\}$$
(15)
$${S}_{n}=\{{\overrightarrow{HP}}_{1},{\overrightarrow{HP}}_{2},{\overrightarrow{HP}}_{3},\dots ,{\overrightarrow{HP}}_{n}\}$$
(16)

In this proposed method, WOA parameters are unchanged. And, the fitness function is the accuracy value obtained by each hyperparameter vector. Eventually, the best agent with the highest accuracy is considered the optimal solution to set the hyperparameters of CNN.

After tuning the hyperparameters, the optimized CNN network is considered for feature extraction from the obtained data using the LSBP method and classification of facial emotions. Therefore, the recognition of facial emotions by CNN is described in three parts: designing CNN architecture, training, and evaluation of results:

  • Model design (CNN architecture): The architecture of CNN used is a combination of convolution and pooling layers based on inception modules [59] and residual blocks [23]. The inception module provides parallelization of several convolution and pooling operators. A residual block is a stack of network layers. Moreover, ResNet can merge the connection between the output value of a block and its input value. The layering of the neural network is shown in Fig. 6.

Fig. 6
figure 6

The proposed CNN architecture

  • Training model: The image data are passed to the designed model. First, the loss function is considered to calculate the error. There are different loss functions, such as mean square error (MSE), mean absolute error (MAE), and cross-entropy [49]. In this work, cross-entropy has been used. It is one of the most famous methods for classification problems. The value of the cross-entropy (CE) function increases according to the probability difference between the predicted value and the actual value, which is expressed in Eq. (17):

$$\text{CE}= -\left({y}_{i}\text{log}\widehat{{y}_{i}}+\left(1-\widehat{{y}_{i}}\right)\text{log}\left(1-\widehat{{y}_{i}}\right)\right)$$
(17)

where \({y}_{i}\) expected output value and \(\widehat{{y}_{i}}\) expected output value.

Then, the optimizer function is used to modify the weights and parameters of the neural network. The selection of the optimizer function has an influential role in the network results. Different functions, such as Adamax, SGD, Nadam, and Adagrad [52], are used. Finally, the weights and parameters of the neural network are updated, and the data are repassed to the CNN model. This cycle continues until the minimum expected error is reached.

  • Result analysis: To evaluate the trained model, the specificity, sensitivity, F1-score, and accuracy of data classification are checked. The k-fold cross-validation method [34] and data split approach can express the accuracy of emotion recognition. The parameter in the k-fold cross-validation method is the number of data separation groups to evaluate. This method is suitable for preventing overflow. The data split validation technique describes the data partitioning into two or more parts. For example, the data are divided into training and testing parts in two-part splitting. Also, the confusion matrix is used to display the results. The confusion matrix is a two-dimensional table that expresses the performance of the proposed method for classification prediction.

4 Experimental results

This section, the proposed system's performance, is compared to that of other state-of-the-art methods. In evaluating facial emotion recognition, three popular datasets CK+, JAFFE, and MMI are used. The data augmentation techniques including rotation 5°, zoom \(\left[\text{1,1.5}\right]\), and horizontal flip are applied to improve the classification. Therefore, the number of data in each dataset has been tripled. Besides, input images are grayscale, normalized, and resized. The values employed to adjust the network hyperparameters with the WOA are given in Table 2. In addition, Table 3 shows the best value of hyperparameters in 100 iterations of the WOA algorithm.

Table 2 Values of hyperparameters
Table 3 Best values of hyperparameters

The fivefold validation method and data split approach (85–15%) have been used to test the proposed method. For example, in the data split approach (85–15%), 85% of the data in experiments is set for the training phase, and 15% is employed for the testing phase. Furthermore, standard metrics of specificity, sensitivity, F1-score, and accuracy have been stated to evaluate facial emotion recognition. These metrics can be calculated from the confusion matrix.

The confusion matrix includes the labels of actual and predicted classes, which are employed to analyze the performance of the proposed algorithm. Equations (1821) are used to measure the efficiency of the proposed model where TP, FP, TN, and FN are true positives, false positives, true negatives, and false negatives and refer to the results that correctly predict the positive class, incorrectly predicts the positive class, correctly predicts the negative class, and incorrectly predicts the negative class, respectively:

  • Specificity (true negative rate) is the proportion of correctly classified positive cases out of all cases classified in a particular class, and the specificity formula is as follows:

$$\text{Specificity}=\frac{TP}{TP+FP}$$
(18)
  • Sensitivity (true positive rate) is the proportion of actual positive cases which are correctly classified. The sensitivity formula is:

$$\text{Sensitivity}=\frac{TP}{TP+FN}$$
(19)
  • F1-score is the harmonic mean of Specificity and Sensitivity values. The goal of this metric is when we have data with unbalanced distribution. The formula for F1-score is given by:

$$F1-\text{score}= 2\times \frac{\text{Specificity}\times \text{Sensitivity}}{\text{Specificity}+\text{Sensitivity}}$$
(20)
  • Accuracy is an important evaluation metric. It is the proportion of true results among the total number of items checked for a particular class. The formula for accuracy is as follows:

$$\text{Accuracy} =\frac{TP+TN}{TP+TN+FP+FN}$$
(21)

The accuracy and loss of the proposed method are tracked to assess the classification network’s performance. For this purpose, first, the datasets are introduced. Then, the results of our experiments are discussed. All experiments are conducted in the Google colaboratory environment with Python language in the Tensorflow framework.

4.1 Datasets

The extended Cohn-Kanade (CK+) dataset contains 593 video sequences from 123 objects. CK+ includes seven facial emotions (happiness, sadness, anger, contempt, disgust, fear, and surprise) and a resolution of either 640 \(\times \) 490 or 640 \(\times \) 480 pixels, which we employed 327 image sequences labeled with seven basic facial emotions in this study.

The images of the Japanese Female Facial Expression (JAFFE) dataset contain 213 images of 10 Japanese female models demonstrating seven facial emotions (happiness, sadness, fear, anger, neutral, disgust, and surprise) and size of 256 \(\times \) 256 pixels.

MMI facial expression dataset (MMI) includes 2900 samples of static and sequence images of faces in frontal and profile views. The high-resolution images consist of 75 subjects, in which 235 videos have emotional labels. In this work, we used 238 sequence images of faces in frontal (sessions 1767-2004). This dataset displays six basic emotions (happiness, sadness, anger, disgust, fear, and surprise). The sample of images from CK+, JAFFE, and MMI datasets is depicted in Fig. 7. In the proposed system, input data with size of 48 \(\times \) 48 pixels are considered.

Fig. 7
figure 7

The sample of images from the datasets (CK+, JAFFE, and MMI)

4.2 Discussion

This sub-section describes the experiments conducted with the proposed facial emotion recognition system on datasets. Table 4 shows the fivefold and split (85–15%) validation techniques on three popular datasets (CK+, JAFFE, and MMI) to evaluate the accuracy of the proposed model. The results demonstrate that our method is effective in recognizing emotions. Tables 5, 6, and 7 show the confusion matrix of the CK+, JAFFE, and MMI datasets, respectively.

Table 4 Results of the proposed model on standard datasets (CK+, JAFFE, and MMI) using validation schemes [fivefold and split (85–15%)]
Table 5 Confusion matrix for CK+ dataset
Table 6 Confusion matrix of the proposed method for JAFFE dataset
Table 7 Confusion matrix of the proposed method for MMI facial expression dataset

In Table 8, the performance of the proposed method is tabulated based on the metrics of specificity, sensitivity, and F1-score on the datasets of CK+, JAFFE, and MMI to recognize basic facial emotions. Besides, Fig. 8 shows the diagram related to the accuracy of the datasets with the fivefold cross-validation method.

Table 8 Statistical performance of the proposed model for facial emotion recognition on CK+, JAFFE, and MMI datasets
Fig. 8
figure 8

Accuracy of CK+, JAFFE, and MMI datasets related to fivefold method

Friedman non-parametric hypothetical test [50] is used to illustrate the effectiveness of the proposed method on JAFFE dataset. This test consists of two hypotheses: \({H}_{0}\) (null hypothesis) samples uniform distribution between groups and \({H}_{1}\) (substituent hypothesis) represents the effect of the method on samples in groups. The Friedman test statistic consists of four components. First, the sample size (N) represents the total number of observations in each group. Second, the Chi-square distribution, similar to variance over the mean ranks, approximates the test statistic distribution. It is used to examine whether two categorical groups influence the test statistic independently. Third, the degrees of freedom (df) equal the number of groups in your data minus one. Fourth, the p-value/significance level (Asymp. Sig.) is the asymptotic probability and the first type’s error probability (\(\alpha =0.05\)) that is employed to recognize two hypotheses in the proposed method. The Friedman test between the proposed method and each of the models [28] and [60] (\(\text{df}=1\)) is performed based on the accuracy of the fivefold approach (\(N=5\)). According to Tables 9 and 10, the results show that the proposed method has a significant effect (rejection of the null hypothesis) on the accuracy of facial emotion recognition with the JAFFE dataset compared to the models of [28] and [60].

Table 9 Friedman test on the JAFFE dataset between proposed method with model [28]
Table 10 Friedman test on the JAFFE dataset between proposed method with model [60]

High prediction accuracy is not only sufficient to improve user confidence in the proposed CNN model and ensure their deployment for real-world applications. Therefore, the model’s explainability and robustness features are examined to evaluate its performance.

Explainability [12, 20, 30] refers to techniques used to make the output produced by intelligent systems understandable to humans. Two methods for global-level and local-level explanations are proposed. Global-level explanations are focused on the model behavior in general.

They can be explored in detail using partial dependence plots (PDPs). PDPs describe how the response of the model changes with a shift of a single feature’s value. Local-level explanations are expressed as model behavior around a single model prediction. Many methods have been developed to determine the importance of features at the level of a single prediction. The Shapley additive explanations (SHAP) use Shapley values from cooperative game theory to attribute the effects of individual additive features to individual model predictions. Figure 9 shows the appropriate explainability of the proposed CNN network in the classification of four images on the JAFFE dataset with the shap technique.

Fig. 9
figure 9

Correct emotions recognition of the proposed method using the shap method on the JAFFE dataset

Robustness [18, 20] is to evaluate model performance on manipulated/modified input data and generalizability. In the proposed model, the preprocessing phase and especially data augmentation are employed. Data augmentation is an effective technique for improving the robustness of machine learning models by diversifying the input data. In addition, there are other methods to evaluate data robustness, such as applying noise to the input data [18]. Figure 10 illustrates the good performance of a test sample on the JAFFE dataset to recognize emotions despite the presence of noise (with Gaussian and Blur methods) as a type of data corruption.

Fig. 10
figure 10

The correct prediction of the emotion of happiness on one sample of JAFFE: a the original image, b the image filtered with LSBP, c the image with blur, and d the image with Gaussian noise

In the following, the experimental results of our method and well-known approaches (LBP, HOG, LDP, Gabor filters, and DWT) are shown in Table 11. Also, the proposed model is compared with state-of-art methods in Table 12. The evaluations confirm that the proposed system has improved the accuracy of the method compared to previous approaches, as depicted in Fig. 11.

Table 11 Comparison of proposed system with existing approaches
Table 12 Comparison of the proposed model with the other facial expression recognition models
Fig. 11
figure 11

Comparison between proposed method and the state-of-the-art model

5 Conclusion

The robust feature extraction method plays a vital role in emotion recognition. This paper proposes a new face feature extraction method for a facial emotion recognition system. Combining the novel LSBP descriptor and CNN is proposed to extract the effective features. LSBP, based on LBP, is defined to scroll image datasets on the sort value of pixels to extract features. Then, CNN is used for feature extraction from data obtained by the LSBP technique. In the proposed method, CNN hyperparameters are optimized using the WOA technique. In evaluation, we used the three well-known face datasets, namely CK+, JAFFE, and MMI. The results show that the proposed method guarantees high performance of emotion recognition. In the future, the scope of the proposed model can be expanded by providing new and combined methods of feature extraction, optimization of more hyperparameters with meta-heuristic algorithms, and different biometrics so that applications with an efficient emotional recognition system can be used in commercial, entertainment, and industrially fields.