1 Introduction

Facial expression conceded to be the best way to interact or communicate one’s emotions and feelings. Charles Darwin [8] significant contribution was a consideration of discrete emotions; the second contribution was an emphasis on the face, as facial expression contains the most valuable source of information; and the third contribution was revealing facial expressions of emotions as universal. The fourth observation was that emotions are not unique to humans but can be seen in other species. And his fifth contribution clarified why some movements correspond to a particular emotion. Thus, this began the evolution of the theory of Facial Expression Recognition (FER). Non-verbal components like facial expressions reveal 55% of the person’s intention, verbal components, and vocal segments convey 7% and 38% of the communicated message respectively [11, 19]. Thus, this motivates the researchers to explore the area of FER efficiently. Every region of the face conveys some important affective information. Sometimes, it is challenging to separate the same subject’s facial features in two different expressions, as they may share the same feature space [25]. There are issues with selecting appropriate features to distinguish individuals’ emotions from various categories of emotions [59]. Expressions keep varying within the same culture [5, 35], and patterns may depend on environment settings, mood, and situations, making it difficult for machines to recognize them efficiently [25]. Variations in the face, facial occlusions, head poses, and illumination also degrades the overall system’s performance. A generalized approach is needed that could overcome all these variations and help in building an efficient, robust system for recognition of expressions [54].

The composition of both feature extraction and classification techniques are essential for FER. The key challenges in efficiently recognizing facial expressions are a selection of efficient feature extraction and classification techniques. If features are sparse, the best classifier’s performance would also gradually decrease [49]. The handcrafted features like Scale-Invariant Feature Transform (SIFT), Gabor, Local Binary Patterns (LBP), Histograms of Oriented Gradients (HOG) has achieved a breakthrough in various fields. These handcrafted low-level features work well on a small amount of training data and inadequate for extracting discriminative information. It is arduous to fine-tune these low-level features according to input data. These disadvantages of low-level features made it inefficient in recognizing facial expressions accurately in real-world applications. Thus, the deep learning models overcame these challenges and automatically learn from raw data, represent the data on multiple levels, and contain more abstract information. The rapid development in the deep learning field has impacted various areas, including FER, and has shown promising results in identifying expressions from facial images. Despite the success, computational cost remains high, imposing difficulty in availability and accessibility.

This work explores the EfficientNet [43] model, the high-quality model from the CNN group of models, to efficiently recognize facial expression from static images. It is efficient in terms of a lesser parameter (4M parameters) and achieves better accuracy than previous CNN models. The EfficientNet B0 baseline architecture is used as a feature extractor to improve the FER system’s recognition rate and accuracy. Further, the features extracted from the EfficientNets intermediate layer are fed to machine learning classifiers for classification. A combination of deep learning models and machine learning classifiers effectively improves the ability of classification algorithms. Two ensemble models, EfficientNet B0, features fed to stacking classifier (SC-EffNet), and EfficientNet B0 features provided to machine learning classifiers based on the frequency of votes (FV-EffNet), are proposed to classify facial expressions into respective expression classes. Thus, the combination of multiple classifiers induces higher-level classifiers and tries to learn all possible patterns from the base classifiers [37], which further enhances the overall performance of FER.

The proposed work uses the EfficientNet model, which is computationally and memory-efficient compared to previous CNN models. The intermediate features of the models fed to machine learning classifiers using a frequency-based approach improved the system’s accuracy even further. As a result, we chose the top five best weights and their corresponding intermediate features, which we fed into machine learning classifiers to test the system’s performance. Additionally, to improve the model’s performance, a stacking classifier (an ensemble approach) was used. The meta classifier analyzed the pattern of base classifiers and learned from their errors before making the final prediction. By integrating the outputs of base classifiers, the ensemble model makes accurate predictions, reduces over-fitting, reduces the risk of selecting a single classifier, and achieves good results. There is no work related to FER using best model weights, the extraction of EfficientNet features, and the stacking classifier for classification to the best of our knowledge.

In affective computing, emotion recognition from video data is the current issue [34]. Even though the amount of information obtained from the video signals is comparatively more. The quickness and variability in dynamics (rapid changes in the intensity of expressions from onset to peak and to offset state) of video sequences pose additional challenges, making it challenging to recognize the expressions in correlated frame sequences compared to static image analysis. Many recent works, including [34], have attested that FER on static images is still an active research area. This work focuses on FER based on static images rather than video sequences. The main research contributions of this study are summarized as follows:

  1. 1.

    Different from previous approaches, the top five best weights and their respective intermediate features are fed to the proposed ensemble models. Thus, the frequency-based and stacking classifier approach showed enhanced performance than other existing machine and deep learning techniques.

  2. 2.

    Individual machine learning classifiers are assessed using different parameters. A fusion of identical and a diverse set of machine learning classifiers with a frequency-based approach and stacking classifier maintains good efficiency, thus achieving state-of-the-art on posed datasets.

  3. 3.

    The proposed model is evaluated on both single and multi-pose datasets with fine-tuned parameters, making the model achieve better performance against pose variations.

  4. 4.

    The proposed model tries to reduce the errors by analyzing the pattern in the base classifier before making the final predictions.

The remainder of this paper includes six sections. Section 2 presents preliminaries about the methods used for the ensemble model. FER related works are outlined in Section 3. The proposed methodology is discussed in Section 4. Experiment results and analysis are summarized in Section 5. Finally, the concluding remarks, along with future work, are pointed out in Section 6.

2 Preliminaries

2.1 EfficientNet

Earlier deep learning models have reached a hardware memory limit issue; hence, an efficient model is required to improve the accuracy. Furthermore, the CNN’s are computationally expensive compared to machine learning models, as Neural Networks heavily depend on the data, the problem considered, and the complex network required to solve it. But, the computational difficulty of these networks was solved using Graphical Processing Unit (GPU). Finally, the Google Research Brain team’s latest model, the EfficientNet [43] (a variant of the CNN), achieved state-of-the-art accuracy, faster computation power, compactness, and overcame all previous deep learning models.

The ConvNets are scaled up to obtain better accuracy and efficiency. Hence, it is scaled up by depth, width, and resolution. Single dimension scaling models tend to achieve higher accuracy with larger depth, width, and resolution, but it has a limitation, the accuracy gain drops and saturates after reaching 80%. The EfficientNet model overcomes the drawback by compound scaling [43], i.e., by scaling three dimensions like width, depth, and resolution with a fixed ratio. This model starts from high quality and with a compact baseline model and scales up each of its dimensions uniformly with a fixed set of escalade coefficients. If the image’s resolution is bigger in the compound scaling method, the network needs a more receptive field and more channels to capture other fine-grained patterns. In the proposed work, EfficientNet B0, a baseline model is utilized, and architecture details are given in Table 1.

Table 1 Architecture of EfficientNet B0 [43]

Mobile Inverted Bottleneck Conv (MBConv) [14, 15, 26, 38], an inverted bottleneck Conv, is the main building block or main component of EfficientNet. It is also an inverted residual structure with an injection of Squeeze and Excite (SE) block, which has skip connections between thin bottleneck layers. The inverted residual blocks are efficient compared to classical residual networks, as propagating the gradient across multiplier layers is improved.

2.2 Stacking classifier

Stacking is a process of constructing classifier ensembles [1]. It is an ensemble learning technique that combines multiple classification models (machine learning classifiers) via meta classifier [27, 37]. It is an approach where several individual classifiers’ outputs (decisions) are combined to classify new instances. The stacking process combines multiple classifiers [22, 29] to create high-level classifiers and produce improved performance. In the first level, the features are fed into the various base classifiers which, outputs a new decision. Later in a second level, a meta classifier decides the final prediction by considering the base classifiers’ opinions and their prediction (output pattern) value [2]. Suppose if the base classifiers make some classification errors. In that case, the meta classifier can successfully learn the pattern and decide which prediction value to be considered for the final prediction. By doing so, the overall performance of the recognition system is improved. The bias and variance can be reduced using the stacking approach [1], as the combination of different ensemble components tries to learn from its errors. The stacking approach is flexible and powerful as compared to that of other ensemble methods.

3 Related works

Ensemble classifier [27], combines the decisions from the multiple classifiers instead of relying on a single classifier decision. Thus, it helps in improving the overall accuracy of the system through enhanced decision-making. The ensemble approach proved efficient in various studies [2, 32, 37] and gave the best accurate prediction. [49] proposed an ensemble CNN architecture and fused the probabilities of these CNN architectures using the probability-based fusion method. [53] utilized three state-of-the-art face detector modules, ensemble all three face detectors to improve the face detection, and followed the ensemble of various Deep Convolutional Neural Network (DCNN) with randomized initialization for classification of FER. An ensemble classifier based on the Dynamic Weight Majority Voting (DWMV) mechanism with an incremental learning property is proposed by [60] to learn various incoming expression patterns from images belonging to new expression classes. The combination of SURF with DWMV showed superior performance for FER.

According to [30], approaches like data augmentation and ensemble voting improve generalization performance. Hence, they proposed an ensemble of CNN architectures (VGG, Inception, ResNet) without utilizing additional training data or facial registration. This approach became a state-of-art method compared to previous CNN-based FER architectures. In contrast, an ensemble model is proposed by [23] with three distinct structured CNN subnets trained separately. The combination of all three ensemble subnets provided better performance results on FER 2013 dataset and obtained 5th rank in the competitions. Also, [9] proposed a Multi-Region Ensemble CNN (MRE-CNN) framework to detect the contribution of three different sub-regions of the human face. This framework rendered a remarkable performance by assigning the weights to these three networks and combining their final predictions. Finally, an ensemble of Deep CNN’s with four different ensemble strategies like a seed, preprocessing, pretraining, and bagging is proposed by [34] to recognize facial expressions efficiently. The authors have also performed an extensive investigation on various aspects of ensemble generation and focused on the factors which influence classification accuracy. In contrast to the previous approach, this work focuses on saving the best model weights and extracting their intermediate features, feed them to the base classifiers of the ensemble model to recognize the facial expressions and improve the results efficiently.

4 Proposed methodology

The proposed work used for the recognition of facial expressions is briefly discussed in this section. The model consists of three stages: Pre-processing, Feature Extraction, and Classification. Finally, the facial images are mapped into respective expression classes using an ensemble approach as shown in Figs. 1 and 2. The proposed model’s performance is evaluated on Oulu-CASIA and RaFD (multi-pose and only frontal pose images) datasets.

  1. 1.

    Pre-processing and Data Augmentation The static images are used to carry out the experiments. Every image present in the dataset is pre-processed. For the face detection, Paul Viola and Michael Jones [47, 48] Adaboost learning algorithm is used. This technique uses Haar-Like features and AdaBoost to train cascaded classifiers and detect the faces with a frontal view in lesser time [60]. Facial image contains a lot of unnecessary background information which is not useful for the classification of the expressions [25], hence this irrelevant information is cropped, and expression specific information is retained. This approach was successful on Oulu-CASIA and RaFD datasets with only a frontal pose. The Viola-Jones algorithm failed to detect the faces with pose variations [55]. Hence, a Multi-task Cascaded Convolutional Networks (MTCNN) is used to detect faces with multi-pose variations. MTCNN [55] is a deep cascaded architecture which exploits the innate correlation between the detection and alignment. This framework consists of three-stage multitask deep convolutional networks like P-Net, R-Net, and O-Net, designed to predict facial and landmark location in a coarse-to-fine manner. The MTCNN face detection technique was successful in detecting faces with a multi-pose variation on the RaFD dataset. All the images present in the dataset were resized into 224x224 pixel resolution and fed to the network for recognition.

    The offline data augmentation is performed to improve the training samples. Data augmentation is a technique to virtually create extra training data by applying transformations to the input image, given the training data. In this work, horizontal and vertical flipping is applied on Oulu-CASIA and RaFD (frontal data) datasets. Data Augmentation has proven to be efficient in improving the generalization ability of deep learning models in various applications like image classification, speech recognition, and other areas. The huge complex designed network tend to over-fit on training data. Hence, to avoid this, it requires to feed a massive amount of data [23].

  2. 2.

    Feature Extraction using deep learning Model Feature extraction is a vital step in FER, and it is an aid to derive effective facial representations from the original facial image. There are two ways of extracting facial features: handcrafted features and the other using CNN architectures to extract auto-extracted features [45]. The extracted features play an essential role in minimizing the distance of intra-class variations and maximizing the distance between inter-class. The best classifier will also fail to achieve good performance if the extracted features are inadequate. The handcrafted feature extraction techniques like Local Binary Patterns (LBPs), Local Gabor Binary Patterns (LGBPs), Histograms of Oriented Gradients (HOG), and Scale Invariant Feature Transforms (SIFTs) have achieved great success in various fields with a small amount of training data. These low-level features are difficult to extract; tuning the features according to incoming face images and gathering discriminative information from these data is also tricky. These disadvantages present significant challenges in accurately recognizing expressions in real-life applications, as these data impose large inter-personal differences in appearance and capturing conditions. Hence, deep learning approaches cope up with these challenges and automatically discover multiple data representations and extract abstract concepts from a higher representation level. Thus, this was a reason for the breakthrough in recognition tasks.

    The EfficientNet model has achieved a state-of-the-art in image classification [43] and achieved high performance, and low computational cost [26]. In the proposed work, the EfficientNet B0 architecture has been used in the basic feature extraction process. Initially, the EfficientNet model is executed up to certain epochs, and all the weights are saved into a folder. Then, the model’s five best weights are chosen based on the performance of the validation accuracy. Later, the best weights are loaded, and their respective intermediate features are extracted and fed into an ensemble of machine learning classifiers to improve the FER’s efficiency.

  3. 3.

    Classification Combining the EfficientNet model’s features and various machine learning classifiers proved to be advantageous [27]. This work presented two novel ensemble models, an EfficientNet model with machine learning classifiers using a frequency-based voting strategy (FV-EffNet) and an EfficientNet model with the stacking classifier (SC-EffNet) for classification. Various machine learning classifiers are evaluated empirically. The classifiers that generated the best results were chosen for further evaluation. The proposed approaches takes machine learning classifiers like Extremely Randomized Trees Classifier (Extra Trees classifier) [51], Random Forest (RF) [1], Decision Trees (DT) [36], K-Nearest Neighbors (KNN) [7], Multilayer Perceptron (MLP) [33], and Support Vector Machine (SVM) [28] as base classifiers.

    Every model’s intermediate features are loaded individually and fed into each base classifier separately and evaluated to check with which intermediate features and base classifier (varying their parameters) the efficient result is obtained. This step is necessary as the features fed to the base classifier play a vital role in recognizing the input pattern and predicting the outcomes.

    1. (a)

      Classification using a frequency-based voting approach: In FV-EffNet for each image, the predictions from five separate base classifiers are analyzed row by row (predicting the class to which each image belongs). The frequency (vote) is calculated using all five predictions for each image. Finally, the maximum vote from all five predictions is used to generate a key. The final key value is the final prediction (emotion class to which each image belongs) obtained from the combination of base classifiers, and it is stored in the array. The strategy is depicted in the Fig. 1.

      • Base classifiers: The EfficientNet B0 intermediate features are fed to various machine learning classifiers. The base classifiers like KNN, MLP, RF, Extra Trees, SVM on Oulu-CASIA, and Extra Trees, RF, DT, MLP, KNN on RaFD datasets are chosen to predict various expression classes efficiently.

      • Ensemble classifier: The predictions from the aggregation of base classifiers would outperform compared to predictions from a single model [27]. Hence, this is a reason behind the choice of an ensemble model to predict expression class. A frequency-based voting strategy combines the predictions from various machine learning base classifiers and makes a final decision. Hence, the ensemble model’s output using a frequency-based algorithm would be a final class label predicted by most classifiers [32].

    2. (b)

      Classification using stacking classifier: It is an ensemble learning technique that combines multiple classification models through meta-classifier [27]. Instead of bagging and boosting approach, stacking tries to learn how to combine the base classifiers (level 0 classifiers) rather than taking votes. A novel approach, where a combination of deep learning model and stacking classifier (SC-EffNet) is proposed and depicted in Fig. 2. The best features from the EfficientNet B0 model are fed to a stacking classifier [1, 44] for classification.

      The intermediate features are first fed to base classifiers (level 0 classifiers), containing diverse machine learning classifiers. Each base classifier is trained using training data. The predictions (output) from each base classifier are appended and stacked as a vector. These predictions are considered as a new dataset fed as input to the meta classifier (level 1 classifier) [1]. Later, the meta classifier is trained with this new dataset, and evaluation is done by performing cross-validation on test data. The meta classifier helps analyze the data pattern in a better way and helps to get accurate predictions. Finally, this classifier outputs the final prediction. One of the advantages of using a stacking classifier is that it decreases the risk of getting varied outputs from different machine learning classifiers. It clubs the results of all individual machine learning classifiers, analyzes the pattern, performs accurate predictions, and achieves good performance.

      • Base Classifiers: The machine learning classifiers like KNN, MLP, RF, Extra Trees, and SVM are used as a base classifier on the Oulu-CASIA dataset and Extra Trees, RF, DT, MLP, KNN on the RaFD dataset.

      • Meta Classifier: The Extra Tree classifier outperformed other machine learning classifiers and was chosen as a meta classifier for evaluation on the Oulu-CASIA dataset. During the RaFD dataset evaluation, the DT classifier proved to be efficient compared to other machine learning classifiers for the final prediction.

Fig. 1
figure 1

Ensemble model architecture based on the frequency of votes (FV-EffNet)

Fig. 2
figure 2

Stacking classifier architecture (SC-EffNet)

5 Experiment results and analysis

5.1 Dataset description

This section presents different datasets used for the evaluation of the proposed technique.

  1. 1.

    Oulu-CASIA: This facial expression database is a posed dataset [57]. It includes 480 image sequences elicited from 80 subjects in six different background settings using three illumination conditions: normal, weak, and dark. The cameras like Near-Infrared (NIR) and Visual (VIS) were used to capture the same facial expressions elicited by subjects. Each image sequence in the database begins with a neutral expression and ends with peak emotion labels. Each image’s pixel resolution is 320*240. The images captured from visual cameras are used for the evaluation of the proposed methodology. In this experiment, peak expressions from the last three frames are chosen as training, testing, and validation data. A total of 236 images have been chosen for our experiments from 240 image segments after applying the viola jones face detection algorithm from all expression classes. Images are resized into 224*224 pixel resolution. Later, horizontal and vertical scaling augmentations are applied to increase the number of images. This dataset includes basic emotions like anger, disgust, fear, happiness, sadness, and surprise.

  2. 2.

    Radboud Faces Database (RaFD): The RaFD database [21, 39] contains images from 67 subjects collected using five camera angles and has five pose degrees 0, 45, 90, 135, 180 with three gaze directions (frontal, left and right views). The dataset includes expressions like anger, disgust, fear, happiness, sadness, surprise, contempt, and neutral. A total of 8040 images are present in this database, and each image pixel resolution is 681*1024. Two types of experimentation are carried out in this work using the RaFD database. One experiment is carried out with images with only a frontal pose, where the Viola-Jones algorithm is applied for face detection. The augmentation like horizontal and vertical scaling is used to increase the images with a frontal pose. The other experiment is carried out with the entire RaFD dataset, which includes all five pose angles. A total of 7974 facial images are detected using the MTCNN face detection approach. Data augmentation is not applied to this data with multi-pose angle, and all images are resized into 224*224 pixel resolution.

5.2 Implementation details

For training the network, this work has used a pre-trained network with pre-trained weights instead of training from scratch, and this is known as transfer learning. Transfer learning has proved to be effective in various computer vision applications [26]. This approach is applied to EfficientNet B0 and pre-trained on the ImageNet dataset, much broader than the facial images presented to the proposed models. The network weights are fine-tuned by the optimizer in the new training phase, allowing the model to adapt to our problem. The imported models have a lot of knowledge about the objects.

In the training phase, Adam optimizer is used to update the weights and reduce the learning rate by a factor of 10 in the event of stagnation (‘patience= 7’). The learning rate started at 1e− 4, and the batch size is 10, and the number of epochs is fixed at 50. During training, an early stopping callback is used to control the overlearning of EfficientNet architecture. The model weights are saved using the model checkpoint. Rectified Linear Unit (ReLU) [46] activation function is used which transforms the linear input into non-linear data. ReLU is computed using formula f(φ) = max(0,φ). With ReLU, the network becomes more efficient due to its sparse feature representation; it also helps in faster training, reduces computational complexity, and overcomes vanishing gradient problem. The softmax layer is used in the output layer of the EfficientNet model. It is used in multi-class classification problems [34] to estimate the testing sample’s probabilities belonging to each class.

5.3 Experiment 1: Evaluation of efficientNet B0 model

In this experiment, the EfficientNet B0 model is used as a classifier for evaluating posed datasets, and the results are presented in Table 2. The model showed an accuracy of 97.28% and 98.53% with augmented data on Oulu-CASIA and RaFD (only 90\(\deg \) frontal pose) datasets. Without data augmentation, the EfficientNet B0 model as a classifier achieved a performance of 93.72%, 95.10%, and 97.06% on Oulu-CASIA, RaFD datasets with frontal pose and Multi-pose variations, respectively.

Table 2 Experimental result obtained with efficientnet b0 architecture as a classifier

5.4 Experiment 2: Proposed methodology

The entire dataset is split into training, testing, and validation set. Every epoch’s weights are saved while monitoring the parameter val_acc using the model checkpoint. An early stopping callback is used to stop the EfficientNet model’s training if there is no increase in the value of val_acc until patience= 10 (10 iterations). Among all the saved weights, ‘n’ best weights are loaded where n = 1 to 5. Their respective intermediate features are fed to the combination of machine learning base classifiers for classification as shown in Figs. 1 and 2.

The detailed experiment procedure outlining the entire flow of the proposed methodology is depicted in Fig. 3. First, according to the process, the model weights that achieved the best results are saved, and their respective intermediate features are loaded. Next, every machine learning classifier is adapted, and these classifiers’ performance is verified on the saved ‘n’ models intermediate features. Finally, the classifier which gave the best results is chosen as a particular base classifier for that specific model. Similarly, the same procedure is followed for the rest of the ‘n-1’ models. The parameters of the machine learning classifiers are fine-tuned by varying the number of estimators, maximum depth of the tree, minimum samples of a leaf node, minimum sample split, maximum iterations of RF, DT, Extra Trees, SVM, MLP classifiers, and the nearest neighbor value of the KNN classifier. The parameters that showed the best performance results were eventually considered for individual classifiers. The Figs. 4 and 5 presents results obtained from the base classifiers when evaluated using identical set of machine learning classifiers.

Fig. 3
figure 3

Detailed procedure outlining the flow of proposed methodology

Fig. 4
figure 4

Results obtained from base classifiers (level 0) on Oulu-CASIA dataset

Fig. 5
figure 5

Results obtained from base classifiers (level 0) on RaFD dataset (Multi-Pose)

5.4.1 Frequency-based voting strategy

Five best epochs intermediate features are fed to ‘m’ machine learning base classifiers where m = 1 to 5. The predictions from five individual machine learning classifiers (base classifiers) are appended and considered for further evaluation. The final class label is predicted by taking votes from most classifiers in the ensemble model. The experiment results obtained from the proposed approach are presented in Tables 3 and 4. The deep learning model (EfficientNet B0), combined with distinct classifiers using a frequency-based voting strategy, gave the best results of 98.71% and 98.56% on Oulu-CASIA and RaFD multi-pose datasets, respectively and also depicted in Figs. 7 and 8.

5.4.2 Stacking classifier

Initially, five best epochs intermediate features are fed to ‘m’ machine learning classifiers (base classifiers) where m = 1 to 5. After empirically testing each machine learning classifier, the classifiers that performed best are chosen as the base classifier. All the base classifiers are trained using training data, and their predictions are horizontally stacked and converted into vectors and fed to meta classifier to get the final prediction. Due to a lack of test data, the test set is subjected to cross-validation (cv= 5). As a result, it establishes the robustness of the stacking strategy and the model’s generalizability. Each machine learning classifier (Extra Trees (ET), KNN, RF, DT, MLP, and SVM) is individually chosen for evaluation as a meta classifier. Further, as shown in Fig. 6, based on various meta classifiers’ performances, Extra Trees (ET) and DT classifiers proved to be efficient on Oulu-CASIA and RaFD datasets.

Fig. 6
figure 6

Selection of meta classifier (level 1) for evaluation of distinct set of base classifiers in stacking classifier approach

The results obtained when evaluating the identical base classifier and the same machine learning classifier chosen as meta-classifiers are depicted in Figs. 7 and 8 and also in Table 3. Similarly, the EfficientNet model with a distinct combination of machine learning classifiers and stacking classifiers are presented in Figs. 7 and 8 and Table 4. The accuracy of 98.35% and 98.06% is obtained with a stacking classifier approach on Oulu-CASIA and RaFD (multi-pose) datasets, respectively.

Fig. 7
figure 7

Results obtained when evaluating the frequency-based and stacking classifier approaches on Oulu-CASIA dataset

Fig. 8
figure 8

Results obtained when evaluating the frequency-based and stacking classifier approaches on RaFD dataset (Multi-Pose)

Table 3 The output of machine learning algorithms after fine-tuning individual classifier on Oulu-CASIA and RaFD datasets
Table 4 The output of distinct machine learning algorithms after fine-tuning on Oulu-CASIA and RaFD datasets

5.5 Observations

The confusion matrices obtained when evaluating a combination of individual machine learning classifiers are presented in Figs. 9 and 10. Also, the Figs. 11 and 12 depicts the confusion matrices obtained when evaluating a different combination of machine learning classifiers on Oulu-CASIA and RaFD (multi-pose) datasets, respectively. Thus, the observation is that every individual classifier contributes equivalently to classify the expressions into respective classes. For example, while observing the confusion matrix given in Fig. 9, the Extra Tree, RF, and DT classifiers predict 108 images correctly into anger expression classes and does one misclassification into a sadness expression class. Similarly, the KNN classifier predicts 107 images correctly into anger expression class and does two misclassifications. Whereas SVM and MLP classifier precisely classifies all 109 images into proper expression classes. Thus, the conceptual lesson to be learned from this proposed work is that every individual classifier is responsible for categorizing expressions into an appropriate emotion class.

Fig. 9
figure 9

Confusion Matrices obtained from machine learning classifiers on Oulu-CASIA dataset ((a) Extra Tree Classifier (b) Random Forest (c) Decision Trees (d) k-Nearest Neighbors (e) Support Vector Machine (f) Multi-Layer Perceptron)

Fig. 10
figure 10

Confusion Matrices obtained from machine learning classifiers on RaFD (Multi-Pose) dataset ((a) Extra Tree Classifier (b) Random Forest (c) Decision Trees (d) k-Nearest Neighbors (e) Support Vector Machine (f) Multi-Layer Perceptron)

Fig. 11
figure 11

Confusion Matrix obtained from the combination of various machine learning classifiers on Oulu-CASIA dataset ((a) Majority Voting (b) Stacking Classifier)

Fig. 12
figure 12

Confusion Matrix obtained from the combination of various machine learning classifiers on RaFD (Multi-Pose) dataset ((a) Majority Voting (b) Stacking Classifier)

The best features fed into an ensemble of machine learning classifiers showed enhanced performance on the frontal pose and multi-pose datasets. By fusing the outputs of base classifiers and providing them to the higher-level classifier, we try to reduce the errors by analyzing the pattern before making the final predictions. Thus, it suggests that a combination of classifiers using the stacking approach is better than selecting the best single classifier for classification. It will help improve the system’s efficiency and overcome the mistakes made in the previous classification level. The ensemble of deep learning and machine learning techniques performs better than the earlier methods, thus showing the state-of-the-art on Oulu-CASIA and RaFD datasets. The proposed approach is robust against pose variations and involves multiple processing stages. However, majority voting predominately aids in enhancing the effectiveness of the system.

5.6 Experiment analysis and comparisons

The experiment results obtained with previous FER studies are presented in Table 5 and compared with the proposed approach. With the proposed methodology, the accuracies of 98.71% and 98.56% using frequency-based strategy and 98.35% and 98.06% using a stacking classifier approach are obtained on Oulu-CASIA and RaFD multi-pose datasets, respectively. The performance of the proposed approach is compared with other machine learning methods, CNN-based methods, and state-of-the-art results on the Oulu-CASIA and RaFD datasets. As observed in Table 5, the proposed model achieves better results than other methods on these benchmark facial expression databases.

Table 5 Comparison with previous approaches on Oulu-CASIA and RaFD datasets

Using RaFD datasets with a frontal pose, the proposed model outperforms [40] and [13] by 0.83% and 1.42%, respectively. The authors used a pyramid histogram of an orientated gradient for feature extraction with KNN classification in [39] and attained 100% accuracy. With an ensemble model, the proposed method likewise obtained 100% accuracy. Additionally, eight expression classes were considered in the proposed work, which improved the accuracy rate to 2.29%, outperforming [50]. The authors in [50] avoided the contemptuous class and employed seven expression classes. On the Oulu-CASIA dataset, the proposed technique had an enhanced accuracy of 10.71% compared to [52], which extracted the expressive component through a deexpression mechanism.

Table 6 presents the experiment results evaluated using various other performance metrics. When using the RaFD dataset with a frontal pose, the proposed model performs better in terms of precision, recall, and F1-score than [3]. Before making the final predictions, the proposed model examines the pattern in the base classifier to minimize the errors. Hence, the results showed better performance compared to previous studies making the proposed system robust against pose variations.

Table 6 Experiment results using other performance metrics

6 Conclusion

An ensemble model with a frequency-based voting approach (FV-EffNet) and a stacking classifier approach (SC-EffNet) is adopted to classify the expressions into respective classes. Combining multiple base classifiers induces the higher level classifiers to learn the pattern and thus help the ensemble model make accurate predictions rather than selecting a single classifier for classification. In the proposed methodology, even though both the ensemble models gave the best results, majority voting predominantly helped improve the system’s performance.

The following conclusions are drawn from the experimental results: (1) The selection of best model weights and features extracted from EfficientNet showed better performance of 98.71% and 98.35% on Oulu-CASIA, and 98.56% and 98.06% on RaFD (multi-pose) datasets using frequency-based and stacking classifier approach compared to a baseline model. (3) An ensemble model with a frequency-based approach showed an improvement of 10.71% on the Oulu-CASIA dataset compared to [6] and 2.29% on the RaFD multi-pose dataset achieving the best performance than [50]. (4) The stacking classifier approach showed an improved efficiency by 10.35% and 1.79% on Oulu-CASIA and RaFD datasets, respectively. This method decreases the risk of getting varying results from various machine learning classifiers and reduces bias and variance. The cross-validation performed on the test set proved the model’s robustness and generalizability. (5) The proposed method with multi-stage processing showed better results with pose variations. The future work would be to evaluate the proposed approach on the spontaneous and in-the-wild databases and build a fully automated system that could be feasible for deploying real-world applications.