1 Introduction

The ability to express emotions through facial expressions is crucial for humans to communicate and connect. As technology advances and people increasingly rely on computers for various activities like online learning and shopping, interacting with virtual systems has become an essential aspect of daily life. Understanding and responding to human emotions and mental states is crucial in order to interact with computers similarly to how we interact with people. It is increasingly important to have a natural interaction with technology.

In recent years, there has been a growing interest in emotion recognition (ER), which involves the accurate and automated classification of emotions in images or video sequences [1,2,3], and [4]. This research topic has gained attention in the fields of psychology, computer vision, and artificial emotional intelligence (AEI).

It assists in recognizing not just the emotional states of humans but also enables the imitation of different emotions in human-machine interactions, which has significant practical uses in real-life situations. Examples of such applications include driver safety monitoring to determine if the driver is distracted or if the driver is paying attention and to predict his actions based on his confusion or frustration levels.

Other applications are human-robot emotional interaction, medical domains to detect signs of depression or pain, or the identification of children with learning or cognitive disabilities involving assessing their level of involvement and linking it to the likelihood of having autism or attention deficit hyperactivity disorder (ADHD). Additionally, it involves recognizing specific elements or situations that capture or irritate the child’s attention, as it may be an indication that a child suffers from a certain disease [5, 6], and [7]. In e-learning, facial expressions are utilized to determine which sections of a lecture are perplexing for the majority of students and to gauge the level of engagement among students while watching a video. Finally, in Talentino, to understand the candidates’ engagement during their interviews and analyze their behaviors.

Although considerable progress has been made toward enhancing emotion recognition, there are still various challenges in exploring the dynamic emotion variations, and obtaining precise emotional analysis remains challenging in the present time.

Many systems use facial expressions and features to identify human emotions [8, 9], and [10]. Various steps are involved in systems designed to detect human emotions, such as retrieving images, preprocessing them, segmenting the images, extracting features, classifying facial expressions, and conducting training [11].

The unregulated environment presents various difficulties for practical implementation. However, there are more and more social networks and applications being used as data sources. Deep learning networks have also improved the processes of analysis and recognition.

The majority of current efforts [3] and [4] concentrate on using convolutional neural networks (CNN) to extract the feature representation of each frame, but they do not take into account the correlation between the frames in a video sequence. These approaches seek to identify the most significant expression features in each frame and address the problem as an image-based task.

Thus, they are relying mainly on the spatial features in the images. Other recent works [12, 13] have also considered temporal features to enhance recognition accuracy.

There are two main types of techniques used for facial expression recognition (FER): static image-based approaches and dynamic sequence-based approaches. The majority of static frame-based techniques choose peak (apex) frames from films. Then perform facial emotion detection on these frames using local binary patterns [14], Gabor wavelets [15], and neural features. For instance, Zhao et al. [16] suggest using a sample with peak expression to guide a deep network that is peak-piloted, as it gains knowledge by studying a set of expressions that are not at their highest level. Meng et al. [17] suggest using the attention mechanism to combine multiple distinct frames into a unified representation for a video. These techniques are effective at choosing peak frames, but they do not take into account the changes over time and the relationship between consecutive facial frames.

In contrast to static frame-based techniques, dynamic sequence-based techniques learn spatiotemporal relations using 3D convolutional neural networks (3DCNN) [12] and long-short-term memory (LSTM) [18]. This could mimic long-term dependencies and improve FER performance.

To understand the time-related characteristics of spatial data and increase recognition accuracy, Kim et al. [19] suggest utilizing an LSTM network. Chen et al. [20] suggest a 3D-Inception-ResNet that enhances the representations of learned features. It calculates attention maps based on spatial-temporal and channel-wise factors. Li et al. [13] recently developed clip-aware dynamic facial expression by extracting clip-level features from each clip-based representation and re-weighting them.

The methods’ performances are still quite distant from being ideal due to occlusions, different head positions, bad lighting, and other unanticipated challenges in real-world scenarios, even though various ways [12, 18] have been developed for in-the-wild FER. The issue of capturing spatial and temporal discriminative information for in-the-wild FER is difficult. Our grasp of discriminative feature representation and contextual information modeling has significantly deepened as transformer-based approaches to computer vision challenges have recently become increasingly popular [21, 22].

We can summarize our main contributions as follows:

  • We proposed a state-of-the-art hybrid architecture, ViTCN, combining the powerful Vision Transformer with a Temporal Convolution Network.

  • We processed FER as a sequence of frames in order to recognize the deep expression in the actual context.

  • We performed our experiments on numerous standard data sets for comparison, which demonstrated that the proposed architecture outperformed the state-of-the-art methods. It showed that it has obtained the highest results in DFEW[23], AFFWild2[24], MMI[25], and DAiSEE[26] and achieved comparable results on other data sets such as CK+ [27].

  • We conducted an ablation study to confirm the effectiveness of every element in the suggested model. All these experiments are done on one single GPU.Footnote 1

Fig. 1
figure 1

State-Of-Art ViT with TCN Architecture

2 Related Work

The temporal correlations of succeeding frames in a sequence can be useful for face expression identification, even though the majority of the earlier models concentrated on static images. In this section, we present advanced deep networks for FER that take into account the spatial and temporal motion patterns in video frames, as well as the learned features obtained from the temporal structure. The spatio-temporal FER network utilizes both textural and time-related data to capture and represent more nuanced expressions. It uses a set of frames from a temporal window as a single input without knowing the intensity of the expression beforehand.

To identify emotions in a series of video frames, existing approaches frequently use recurrent neural network (RNN) models and their modifications. In several real-world applications, hybrid connections using CNN models have displayed outstanding performance. Deep RNNs, in particular LSTMs, have demonstrated impressive results in capturing the temporal relationships of sequential data.

RNNs are neural networks that have loops. This ability allows them to effectively understand and learn the time-based patterns in sequential data. In order to forecast the current outcome, RNNs can link historical data to the job at hand. The vanishing or exploding gradient issue makes training RNNs difficult. LSTM networks, a type of RNN that has the ability to acquire knowledge of long-term relationships, offer a solution to this issue.

CNNs are a different type of neural network that performs convolutional operations. Instead of the normal linear multiplication operation [28], convolution is another matrix multiplication procedure that generally depends on the weighted sum of neighboring input pixels multiplied with some special kernels. CNN is a convolutional layer that convolves the input into a feature map. In human eyes, convolution is the modeling of the response of a neuron in the visual cortex to a specific stimulus. Each convolutional neuron processes data only for its receptive field.

Memory cells with four neurons each make up the chain-like structure of LSTMs, which were created to interact in extremely unique ways. LSTM models with gated recurrent units (GRU) are one variant of the LSTM architecture. GRU models use less memory since they employ fewer training parameters. LSTM models are more accurate for bigger data sets, but GRU does calculations more quickly. LSTM or GRU networks have been used to attain the most advanced results to date. The performance of these networks is further enhanced through FER training. An LSTM [29] or GRU [30] network is given a sequence of frames to understand variations in facial expressions and to identify an individual’s emotional or mental condition.

For FER, a number of pre-trained models utilizing CNN architectures and similar modifications have been proposed. Self-encoder, CNN, and confidence networks are a few of these networks. Nevertheless, they lack the capacity to record contextual temporal information. Instead, they often show significant potential for automated feature learning. To achieve this, a variety of RNN model variations, such as CNN-LSTM [31, 32] and CNN-GRU [33], have been integrated with CNNs. They enhance their effectiveness in dealing with facial emotion recognition tasks. By reducing the impact of differences and the surrounding environment, these networks may more accurately identify facial expressions by obtaining more detailed information and distinguishing expression information from sequences of facial expressions. The LSTM component of these networks is responsible for learning and recognizing the changing patterns over time, while CNN extracts deep visual information. These networks highlight the significance of reading micro-expressions.

Extensions of the conventional LSTM and GRU architectures, Bidirectional LSTM (BDLSTM) and Bidirectional GRU (BIGRU), respectively, enhance the effectiveness of learning models to optimize the efficiency of FER. The BDLSTM processes the sequence in both directions, utilizing two LSTMs for training. As a result, the network receives more context, which speeds up the learning of an expression’s sequence. Consequently, for FER, a CNN is included as a hybrid link at the very end to aid the model in thoroughly processing the variations in facial expressions. The CNN-BDLSTM [34, 35], and CNN-BIGRU [36] hybrid connection models are two examples.

The development of transformer networks in natural language processing (NLP) has attracted the curiosity of many in the field of computer vision. The transformer in NLP was created to represent lengthy sequence inputs. When compared to CNN networks, ViT has produced impressive results by pre-training on large data sets such as ImageNet-1k and then fine-tuning the target data set. ViT divided the picture into patches that allowed it to adapt to the computer vision challenge. The self-attention technique used in the transformer has the ability to collect long-range dependencies between these patches. Chaudhari et al. [21] applied ViT on FER, and Aouayeb et al. [22] applied the ViT structure as well on FER by injecting a Squeeze and Excitation (SE) block prior to the Multiple Layer Perceptron (MLP) heads for the FER job.

Xue and colleagues [37] introduced TransFER, a transformer-based method. Local CNN blocks were made to find various local patches after extracting feature maps using a backbone CNN. Next, by using a multi-head self-attention-dropping part, a transformer encoder examined the general correlation among these local patches.

A two-stream pyramid cross-fusion transformer network was proposed [38]. In order to address scale sensitivity, intra-class discrepancy, and inter-class similarity issues in FER, it looks into the relationship between landmark features and picture attributes.

The primary difficulties encountered while utilizing CNN for FER are those related to computational complexity, picture quality, lighting fluctuations, high intra-class variations, and strong inter-class similarities brought on by changes in facial appearance. Several studies have considered creating a hybrid system by fusing deep learning approaches to address these problems.

Fig. 2
figure 2

Expressions classes in the CK+ data set

3 Hybrid Model Architecture

Our proposed hybrid model, ViTCN, is composed of two parts: a ViT and a TCN. The ViT is used to extract the important spatial features from the images, while the TCN is used to encode spatiotemporal information extracted from the different video frames, combine the correlated features extracted for each frame, analyze the relationship between them, and classify the accurate expression. Both models are explained in the following subsections. Figure 1 shows our proposed architecture.

Fig. 3
figure 3

Expressions classes in the MMI data set

Fig. 4
figure 4

Samples of CK+ data set

Fig. 5
figure 5

Samples of MMI data set

3.1 Vision Transformer

ViT [39] architecture is inspired by the basic transformer architecture first used in NLP problems. Its architecture is most similar to the encoder part of the transformers, where the image is split into a set of image patches, which are called visual tokens. These tokens are then embedded into a set of encoded features of a specified dimension. Moreover, the order of the patches is changed according to the positional encoding part of the ViT. This architecture was selected as it is able to identify and capture significant characteristics from the image in a relatively small vector of features that will be fed to the TCN model.

In our proposed architecture, we used the pre-trained ViT model from the PyTorch library. It is trained on the ImageNet-1k data set; however, we replaced the final fully connected convolution layer with one that has the same input/output dimension of 32, which is fed to the TCN model afterward.

3.2 Temporal Convolution Network

For the purpose of action recognition in videos, Lea et al. first proposed TCNs in 2016 [40]. The privilege of TCN architecture comes from its ability to encode spatiotemporal information coming from different frames along a video, which is then passed to a classifier label to classify these features into the corresponding classes. These features could be utilized to detect actions, emotions, or whatever significant information we need to detect. Moreover, TCN can process any sequence length, which enables it to consider more in-depth features.

To identify the emotional expression in each frame of the proposed model, we concentrated on the face’s changing characteristics. This information is then sent to a fully connected convolution layer, or classifier, which categorizes the input video based on the emotion, whether it exists or not (1 or 0).

3.3 Training Configuration

We have conducted many experiments to select the most optimal hyper-parameters during the proposed architecture training, which will be discussed in the ablation study section. Our model, ViTCN, is composed of two parts: a ViT and a TCN. We used the pre-trained ViT by replacing the final fully connected layer with a new layer with the same dimension to feed the TCN module. Our TCN module consists of 8 TCN layers with 3 kernel sizes. After many trials, we have chosen the following values as the default: We have used the ADAM [41] optimizer with the default learning rate set to 0.001. The dropout is set to be 10%. We split the data sets into a training set, a validation set, and a testing set with (60% - 20% - 20%), respectively. Our batch size consists of 4 samples, as we have a limitation on our training machine as our single GPU has only 8 GB of RAM. We also train all experiments with 50 epochs; however, we are using early stopping for most of our experiments.

3.4 Loss Calculation

Due to the imbalance of most of the available data sets and benchmarks, as discussed in [42], we needed a modified loss function for better back-propagation and learning of the model. As discussed in the training configuration subsection, we used a modified version of the binary cross entropy (BCE) for calculating the loss. We utilized the weighted loss for each label (0, 1), as explained in equation (1):

$$\begin{aligned} \mathcal {L}oss= & {} \left( 1 - \frac{ \mathcal {N} }{B} \right) \times \mathcal {BCE}\left( real_0, predicted_0\right) + \frac{ \mathcal {N} }{B} \nonumber \\{} & {} \times \mathcal {BCE}(real_1, predicted_1) \end{aligned}$$
(1)

where N represents the count of ’0’ labels within batch B, real\(_{0}\) is the list of ’0’ labels from batch, and predicted\(_{0}\) is their equivalent output from the model. Same with subscript 1 for label ’1’.

However, if N equals ’0’, we used equation (2):

$$\begin{aligned} \mathcal {L}oss = 0.35 \times \mathcal {BCE}(real_1, predicted_1) \end{aligned}$$
(2)

The weighted loss function improved the results by adjusting the calculated loss based on the label; this gives more attention to learning the unbalanced data by decreasing the loss value for the dominant label, which reduces the overfitting of that label.

Table 1 DAiSEE Labels File sample
Fig. 6
figure 6

Original DAiSEE statistics before augmentation on all splits

Fig. 7
figure 7

DAiSEE data set statistics as a binary classification problem

Fig. 8
figure 8

Final statistics of the data after splitting to two-second clips and augmentation

4 Experiments

In this part, we outlined the summary of the experiments conducted and discussed the obtained outcomes. Firstly, we started by discussing the data sets utilized to train and evaluate the suggested architecture. Then, we addressed the evaluation metrics utilized in evaluating the experimental outcomes and compared them with the latest advancements in the field. In conclusion, we provided an ablation study for different contributions with other details on the proposed solution and an analysis of additional visualization for an in-depth understanding of the application of ViT to the FER task.

4.1 Data sets

One of the challenges in ER is finding a suitable data set that suits our application requirements, as most of the available data sets have some challenges, such as non-frontal interview-based views, single images without any sequences, or small data sets with low data distribution. Hence, benchmarking our architecture is limited to a small number of applicable data sets. Frontal view data sets are chosen, with a sequence of images or videos. In the following subsections, we presented each data set and the class distribution, showing some statistics and insights about these data sets.

Fig. 9
figure 9

Samples of the original DAiSEE data set

Fig. 10
figure 10

Expressions classes in the AFFWild2 data set

Fig. 11
figure 11

Expressions classes in the DFEW data set

Fig. 12
figure 12

Samples of the original AFFWild2 data set

Fig. 13
figure 13

Samples of the original DFEW data set

Table 2 Data sets summary

4.1.1 CK+

The CK+ dataset (also known as Cohn Kanade) is an expanded version of the CK data set [27] and [43]. It consists of 593 sequence videos obtained from 123 individuals; however, the labeled videos are only 327. The length of image sequences can differ, ranging from 10 frames to 60 frames with frontal views and 30-degree views. The videos were captured at a rate of 30 frames per second (FPS) and had a resolution of either 640x490 or 640x480 pixels. The videos had either an 8-bit grayscale or a 24-bit color value.

There are seven categories for facial expressions, which include anger, disgust, contempt, fear, happiness, surprise, and sadness. The age of the participants in CK+ data collection ranged from 18 to 50 years, with 69% being female and 31% being male. They are from different countries: 81% Euro-American, 13% Afro-American, and 6% other groups. The unequal distribution of expressions in CK+ is evident. Most facial expression classification (FEC) algorithms use the CK+ database, which is widely recognized as the most commonly utilized laboratory-controlled FEC database. Figure 2 shows the total number of videos per emotion. It demonstrates that CK+ offers a diverse range of expressions and that there is no dominant class. Figure 4 shows samples from the CK+ data set.

4.1.2 MMI

The MMI [25] (also known as Maja Pantic, Michel Valstar, and Ioannis Patras) dataset contains videos of the full temporal pattern of facial expressions, where the videos start from natural facial expression, then the peak of emotion, and finally back to the natural facial expression. Prototypical expressions and expressions with a single Facial Action Coding System (FACS) action unit are contained. MMI is composed of about 2900 videos from 75 different subjects. Videos are classified into seven facial expressions: surprise, anger, disgust, fear, sadness, and happy. The distribution of the seven emotions is shown in Figure 3. Figure 5 shows samples from the MMI data set.

4.1.3 DAiSEE

The DAiSEE dataset [26] includes 9068 video clip shots from 112 subjects. Each clip has a duration of 10 seconds and is recorded at a rate of 30 frames per second. It has four labels for each clip, with four degrees for each. These labels are “frustration”, “engagement”, “boredom” and “confusion”. And for each of these emotions, the intensity could range from 0 to 3 (namely, very low, low, high, and very high), representing the intensity level of the user’s emotion. A sample is shown in Table 1.

The DAiSEE data set has odd statistics concerning biasing and labeling, as shown in Fig. 6, which makes the data utilization for efficient training more complicated. Figure 6 shows a huge imbalance regarding the labels or the intensity of emotion within each label. Taking engagement emotion, for instance, it is shown that there are 4071 instances labeled “3”, 4477 instances labeled “2”, 459 instances labeled “1” and only 61 instances labeled “0”. It proved to be a large bias towards levels 2 and 3 compared to levels 0 and 1. We handled the problem as a binary classification problem (level “1” means that emotion exists and “0” means the absence of that emotion). We merged old labels 0 and 1 to new label 0, and labels 2, and 3 to new label 1.

At this point, we would have 8548 clips labeled 1 and only 520 labeled 0. This kind of imbalance makes the deep learning model unable to be trained correctly, and many of the evaluation metrics become unreliable. As a result, it was decided to use intensive data augmentation techniques to balance the data. Figure 7 shows the binary classification statistics. The data still had a significant bias that needed to be fixed. Instead of focusing on balancing the levels of all emotions at once, we were more concerned with balancing the levels of each emotion because the suggested technique involved distinct models for each emotion.

To prepare the data for the training phase, we first sampled the videos into frames with a sample rate of 5, converting each 10-second video into 250 frames. Then we divided clip frames into two seconds each instead of 10 seconds, which theoretically enlarged the data set size to five times its original size. Each folder has 10 consecutive frames, each representing 2 seconds of the video. Secondly, for each two-second clip, which had a label in the labeling file, augmentation techniques were applied to make the data balanced. For each emotion, we considered the less-presented label from 0 to 1 as the one to be augmented with different methods with random hyper-parameters. The applied augmentation techniques are adding noise, interpolation, random horizontal flip, random rotation within a small range, random resized cropping, and changing sharpness, saturation, or blurring. After the augmentation step is done, a good, reliable, balanced data set is produced with the statistics shown in Fig. 8. Figure 9 shows samples from the DAiSEE data set.

4.1.4 AFFWild2

AFFWild2 [24] data set consists of 546 videos from 554 subjects, of which 326 are male and 228 are female. These videos have a big variety in terms of nationalities and ages within different environments. The total frames are around 2.8 million, which are classified in various ways. Around 546 videos (2.6 million frames) are labeled according to basic expressions (happiness, surprise, anger, disgust, fear, sadness, and the neutral state), and 541 videos (2.6 million frames) have been categorized based on the FACS. The labeled videos by basic expressions are utilized in this research. The distribution of the seven emotions is shown in Figure 10. Figure 12 shows samples from the AFFWild2 data set.

4.1.5 DFEW

Dynamic Facial Expression in the Wild(DFEW) [23] contains 16372 videos from 1500 movies where every video has various challenging interferences. Examples of challenges are different illumination, occlusions, and crowding. Twelve professional annotators have labeled these videos. Videos are classified into seven labels to describe facial expressions: surprised, natural, happy, angry, disgust, fear, and sad. The distribution of the emotions is shown in Figure 11. Figure 13 shows samples from the DFEW data set.

Table 2 provides a summary of the data sets utilized in training and testing.

Table 3 Comparison of Expression Classification results on CK+ data set image-based

4.2 Frames Transform

Transformations are applied to images in deep learning using frameworks such as PyTorch and are typically known as resizing, normalizing, and converting images to tensors. However, considering a sequence of frames from a video for a model is more complicated than that. We built our class for transforming a sequence of frames, applying the same set of transformations to each image with the equivalent sample rate that was defined, and finally collecting and reshaping them to give one tensor as input to the ViT model.

Table 4 Comparison of Expression classification on CK+ data set sequence-based

4.3 Evaluation Criteria

The proposed model is assessed based on two primary factors, namely the F1-score, and accuracy, which are used as the main criteria for evaluation. However, in some experiments, the weighted average recall (WAR) and the unweighted average recall (UAR) are utilized to compare previous work.

The F1-score is the harmonic mean value of the recall (the ability of the classifier to find all the positive samples) and precision (the ability of the classifier not to label as positive a sample that is negative). The F1-score reaches its best value at 1 and its worst score at 0. The F1-score equation (3) is defined as:

$$\begin{aligned} \mathcal {F}1 = \frac{2 \times \mathcal {P} \times \mathcal {R} }{\mathcal {P} + \mathcal {R}} \end{aligned}$$
(3)

The F1-score for emotions is calculated by considering the prediction made for each frame, where an emotion category is identified in every frame.

Accuracy (abbreviated as Acc.) is a measure of how well the test samples are predicted, expressed as the proportion of correctly predicted samples. The highest possible accuracy score is 1, indicating perfect predictions, while the lowest score is 0, indicating no correct predictions. The formula for calculating accuracy is as follows:

$$\begin{aligned} \mathcal {A}cc = \frac{\textit{Number of Correctly Predicted Samples}}{\textit{Total Number of samples}} \end{aligned}$$
(4)

UAR is defined as the mean accuracy of each class divided by the total number of classes, regardless of samples per class distribution, which makes it a better metric to optimize when the sample-class ratio is imbalanced.

WAR also called overall accuracy, is the ratio of accurately classified samples to the total number of samples, which is related to the number of samples in each category.

4.4 Implementation details:Training and Testing Settings

Our model is trained using the PyTorch platform, utilizing a single NVIDIA-GTX 1080Ti 8GB GPU card. By default, we train the model with a batch size of 4; however, in DFEW and CK+, we use 2 as our batch size to fit in our small GPU memory. We utilize the ADAM optimizer to optimize our proposed model, starting with a learning rate of 0.001.

Our backbone is the pre-trained ViT vit_base_patch16_224. While training, we split each video clip into 10 frames; however, some data sets have fewer frames per clip discussed in the experiment.

4.5 Comparison with state-of-the-art

In the following subsections, the outcomes of the experiments that were conducted to evaluate the proposed architecture are shown and compared to the previous work.

4.5.1 CK+ experiments

CK+ experiments are usually conducted as image-based experiments; hence, we have compared the obtained results in two different modes: the image-based mode and the sequence-based mode. First, we benchmark over the image-based mode. Table 3 shows a comparison of the achieved results on single image processing. In order to run a single image on the TCN network, the network was fed with two vector representations: the first is the feature vector from ViT, and the second is the original image.

The proposed model has achieved acceptable results compared to others, and ViT+SE [22] has obtained almost 99.8% accuracy.Although [22] are using ViT as their backbone, they still have the best result up to date because of the SE block they have combined with the transformer, where two fully connected layers are used with a single operation of pointwise multiplication. We think that integrating this mechanism with TCN might show better results across all benchmarks, as the SE block can optimize the TCN architecture with a channel-wise attention module; however, we consider this point to be future work.

On the other side, to have a fair comparison, the main proposed contribution is tested in sequence-based mode as well. Training a sequence of frames in CK+ was hard due to the absence of the peak expressions in the first few frames, hence, it is chosen to train over the peak frames starting from the 7\(^{\hbox {th}}\) frame.

A comparison of the achieved results on the sequence of images is shown in table 4. The proposed architecture operates better than IT-RBM [48] and STM-Explet [49] by 7.94% and 0.91% respectively. DCPN [53] has obtained the highest performance due to its high training capabilities. First, its architecture consists of three cascaded inception deep neural networks, the first network is pre-trained over the ImageNet data set with augmented images from the CK+ data set, and subsequently, it is fine-tuned on the CK+ data set. This gives the network prior information about the data set, and it becomes familiar with the data during the pre-training step. The first network makes predictions about the emotion and transfers this information to the second network. The second network chooses two frames from the sequence; the highest prediction score is considered a peak frame, and the lowest prediction score is considered a weak frame. This assisted them in achieving a significant contrast between the highest (peak) and lowest (non-peak) frames, and then these two selected frames were used in the third network to define the emotion for the whole sequence. Due to computation limits, we could not fine-tune ImageNet. Although DCPN [53] achieved the highest accuracy using a complex architecture, the proposed architecture accomplished comparable results using a simple architecture.

Fig. 14
figure 14

The confusion Matrix for the CK+ data set

Table 5 Comparison of Expression classification on MMI data set
Fig. 15
figure 15

The confusion Matrix for the MMI data set

Figure 14 shows the confusion matrix for the CK+ data set. It is shown that most of the classes are relatively easy to distinguish except for contempt vs. sadness and fear vs. contempt.

4.5.2 MMI experiments

Table 5 reports the comparison of the suggested model with other advanced video-based methods on the MMI data set. The proposed model demonstrates superior performance by achieving an accuracy of 99.2% surpassing the accuracy of the previous top-performing models MDSTFN [51] by 7.74% and STM-Explet [49], IDFERM [54], IT-RBM [48], GCN [55] by 24.08%, 18.07%, 16.99%, and 13.31% respectively.

The confusion matrix is shown in Fig. 15, and it is shown that the proposed model perfectly distinguishes all classes.

4.5.3 AFFWild2 experiments

We report results by accuracy and F1-score in Table 6. The accuracy comparison reveals that the proposed model outperforms both the baseline [56] and the most sophisticated approaches, like TSAV [57] and NeteaseFuxi [58], by 34.5%, 21.6%, and 14.41% respectively. It outperforms them by 0.7, 0.452, and 0.087, according to the F1-score. The outcomes demonstrate the efficiency of the suggested model in classifying facial expressions.

Table 6 Comparison of Seven Basic Expression classification on AFFWild2 test set

Figure16 shows the confusion matrix of the AffWild2 data sets. It is shown that “Neutral”, “Anger”, “Happiness”, and “Sadness” are relatively easy to distinguish. However, “Disgust”, “Surprise” and “Fear” are mostly confused with other emotions, as these classes have few examples.

Fig. 16
figure 16

The confusion Matrix for the AFFWild2 data set

Fig. 17
figure 17

The confusion Matrix for the DEFW data set

Table 7 Comparison of Expression classification on DFEW data set
Table 8 Comparison of Expression classification on DAiSEE data set multi-classes, FS refers to Full Screen
Table 9 Engagement class classification

4.5.4 DFEW experiments

Experiments on the in-the-wild DFEW data set demonstrate that the suggested model offers a successful approach to identifying and understanding changing facial expressions.

We evaluate our suggested model by comparing it with current achieved results on DFEW with respect to accuracy, the UAR, and the WAR. The comparison outcomes are presented in Table 7. The proposed model obtains the best results utilizing the three reported metrics, and we have outperformed all results up to date.

The previous leading method, known as STT [62], achieved a UAR of 54.85% and a WAR of 66.65%. However, the proposed model surpasses STT [62] by 1.42% in UAR and 4.35% in WAR. Furthermore, the proposed model outperforms 3D Resnet18 [23] by 11.27% and 16.02% in UAR and WAR. It is also shown that the proposed model achieves superior accuracy outcomes in comparison to alternative approaches.

Figure 17 shows the confusion matrix of the DEFW data set; it shows that our architecture can distinguish the “Anger”, “Happy”, and “Disgust” classes. However, the “fear”, and “sad” classes are a bit confused with the “happy” class. Nonetheless, the “neutral”, and “surprise” classes are still not up to par (as they have a limited number of examples) and could be improved in future research.

However, the “fear” and “sad” categories are somewhat unclear when compared to the “happy” category. Nonetheless, the “neutral” and “surprise” categories are still not up to par, as they have a limited number of examples and could be improved in future research.

4.5.5 DAiSEE experiments

The DAiSEE data set is a bit challenging; as we explained earlier in Table 1, each video could contain more than one label. First, we trained our network on a multi-classification problem over the whole data set; Table 8 shows the results we have obtained with multi-class classification. We have tried different approaches instead of ViTCN, inspired by [65], such as combining ResNet-Backbone with TCN layers and ResNet-Backbone with LSTM layers.

Table 10 Confusion class classification
Table 11 Frustration class classification
Table 12 Boredom class classification

Many research papers, such as [65, 67,68,69,70,71], and others, have investigated working with the engagement class. This encourages us to fine-tune our architecture to do a binary classification for each class. Hence, using binary classification for each class separately leads us to have four different models. In this section, we explain the experiments for each class. Tables 9, 10, 11 and 12 investigate the obtained results over the engagement, confusion, frustration, and boredom expression classes, respectively.

Table 9 shows the obtained results compared with the reported results in [65, 67, 68], and [71]. It was noticed that the proposed architecture, ViTCN, has higher results. However, we have noticed that our architecture has overfitted the dominant class. The overfitting problem was discussed in the ablation study.

There are no previously reported results for the other three classes in the literature as a binary classification problem; hence, we show only our obtained results for those classes.

Table 10 shows the classification of the confusion emotion. It is shown that the proposed architecture has achieved promising results utilizing the default configuration of cropping the faces (CF) in each video sequence. The obtained results outperformed other experiments we have performed over the full screen (FS) of the video sequence.

Table 11 shows the classification of the frustration emotion. It is shown that the proposed architecture has achieved promising results utilizing the default configuration with CF and using an augmentation ratio of 35% in each video sequence. The obtained results outperformed other experiments we have performed over FS of the video sequence.

Despite the imbalance in the boredom class, the classification presented in Table 12 has yielded satisfactory results when employing the default configuration by CF in every video sequence, along with an augmentation ratio of 35% to maintain data set balance.

4.5.6 Discussion

By successfully integrating ViT with TCN in our ViTCN architecture, we unlock substantial performance gains for ViT in FER tasks. This hybrid approach outperforms existing advanced models while being trained under restrictive conditions (single GPU), demonstrating its potential for real-world applications. The proposed architecture achieved accuracy improvements exceeding 8% in MMI, 14% in AFFWild2, and 4% in DFEW. We delve deeper into the influence of training-phase choices like ViT freezing, input type (full frame vs. cropped faces), and loss function (normal vs. weighted) on the model’s effectiveness in the following section.

4.6 Ablation Study

In this section, we perform ablation experiments to assess the influence of each element of our model, specifically the ViT architecture, TCN block, and data augmentation techniques.

First, the CK+, MMI, and DFEW data sets are utilized to conduct the experiments. Second, we assess our hyper-parameter tuning on the DAiSEE data set.

4.6.1 ViT Study

First, we assess the performance of the ViT architecture, the added TCN block, and the use of Hugging Face (ViT_base Patch 16) as a pre-training model trained on ImageNet-21k. Tables 13 and 14 show the accuracy and the F1-score, respectively. It was noticed that adding the TCN block outperforms the ViT basic architecture on MMI and DFEW data sets. In the CK+ data set, although the accuracy is slightly decreased with only 0.1%, the F1-score has seen a 0.85% improvement.

In the MMI data set, the accuracy has seen a 0.9% improvement, and the F1-score has seen a 1.4% improvement as well. In the DFEW data set, adding the TCN block has enhanced the accuracy by 20.4%, and the F1-score by 13.9%.

Table 13 Ablation study in terms of accuracy with and without TCN block
Table 14 Ablation study in terms of F1-Score with and without TCN block
Table 15 Engagement class on full-screen images across different augmentation (Aug.) ratios
Table 16 Engagement class on cropped faces across different augmentation (Aug.) ratios
Table 17 Studying the effect of freezing and non-freezing ViT with Engagement class across different proposed models, Aug. stands for the augmentation ratio applied

4.6.2 Data Processing Study on DAiSEE data set

In Tables 15 and 16, different experiments are conducted, and we first discuss the augmentation ratio over the data set. We generate more frames from the original DAiSEE frames by increasing the number of frames using the augmentation techniques discussed earlier. We have augmented the data set to rebalance the data by increasing the weak class. 100% augmentation indicates that our augmented data set is balanced. We have noticed that rotating the images usually leads to overfitting across any rotated image, as most of the rotated images are generated with our data augmentation technique; hence, we have eliminated the rotation procedure from the augmentation techniques that were utilized in all the conducted experiments.

Increasing the augmentation level on full-screen images (Table 15) decreases the accuracy and F1-score. It is noticed that using augmentation with a 35% balancing ratio is sufficient, as 15% is a bit low, and more than 50% leads to overfitting problems. Also, the model became activated with augmentation. Furthermore, we observed that training over the FS of the video sequence decreases the accuracy, and the model misclassifies more data and becomes more activated with outliers and noises. On the other side, using CF led the model to neglect unrelated noises and focus only on facial expressions. Video frames were cropped using multiple techniques, such as Multi-task Cascaded Convolutional Neural Networks (MTCNN) [72] and dLib [73].

In Table 17, we have investigated freezing the ViT during fine-tuning our ViTCN, and we noticed that freezing the parameters has reduced the obtained results.

Table 18 Studying the effect of using different augmentation ratios (Aug.) with engagement class

Hence, our best setup is using the CF without freezing the ViT parameters, using a sufficient augmentation ratio of 35%.

5 Conclusion and Future Work

In this work, we introduced the ViTCN, a hybrid architecture that combines the learned spatial features of the ViT with the temporal features of the TCN extracted from the different video frames and correlates the features extracted for each frame. It shows remarkable success in enhancing the performance of ViT on FER tasks. The performance of the proposed hybrid architecture was evaluated on controlled data sets like CK+ and MMI, as well as on wild data sets like DFEW and AFFWild2. It was shown that the suggested architecture outperforms other sophisticated solutions when utilizing a single model trained on a single GPU, notably on the MMI, DFEW, and AFFWild2 data sets. Our architecture has outperformed other sophisticated solutions with an accuracy of more than 8% in MMI, 14% in AFF-Wild2, and 4% in DFEW. Also, it produces competitive results on the CK+ data set. Nevertheless, due to the imbalanced classes in the DAiSEE data set, we examined the effects of augmentation methods and ratios. We discussed the implications of both freezing and non-freezing ViT during the training phase.

We aim to develop a facial expression detection and recognition system that utilizes a single GPU to minimize computational power while maintaining accuracy. Our goal is to achieve superior performance on a majority of benchmark data sets or match the performance of existing methods while minimizing computational resources across a series of frames.

In our future research, we plan to expand the ViTCN architecture to tackle a more challenging task, such as identifying micro-expressions. Additionally, we intend to improve the ViTCN architecture by incorporating an attention mechanism called SE block [22] into the TCN architecture. Further augmenting our computational resources will help us employ 16 TCN layers with larger kernel sizes, which will boost the module’s capacity and potentially yield superior performance on the CK+ data set.