ViTCN: Hybrid Vision Transformer with Temporal Convolution for Multi-Emotion Recognition

Zakieldin, Kamal; Khattab, Radwa; Ibrahim, Ehab; Arafat, Esraa; Ahmed, Nehal; Hemayed, Elsayed

doi:10.1007/s44196-024-00436-5

ViTCN: Hybrid Vision Transformer with Temporal Convolution for Multi-Emotion Recognition

Research Article
Open access
Published: 27 March 2024

Volume 17, article number 64, (2024)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Computational Intelligence Systems Aims and scope Submit manuscript

ViTCN: Hybrid Vision Transformer with Temporal Convolution for Multi-Emotion Recognition

Download PDF

Kamal Zakieldin¹^na1,
Radwa Khattab¹,
Ehab Ibrahim¹^na1,
Esraa Arafat¹,
Nehal Ahmed ORCID: orcid.org/0000-0002-3855-7707²^na1 &
…
Elsayed Hemayed³^na1

902 Accesses
2 Citations
Explore all metrics

Abstract

In Talentino, HR-Solution analyzes candidates’ profiles and conducts interviews. Artificial intelligence is used to analyze the video interviews and recognize the candidate’s expressions during the interview. This paper introduces ViTCN, a combination of Vision Transformer (ViT) and Temporal Convolution Network (TCN), as a novel architecture for detecting and interpreting human emotions and expressions. Human expression recognition contributes widely to the development of human-computer interaction. The machine’s understanding of human emotions in the real world will considerably contribute to life in the future. Emotion recognition was identifying the emotions as a single frame (image-based) without considering the sequence of frames. The proposed architecture utilized a series of frames to accurately identify the true emotional expression within a combined sequence of frames over time. The study demonstrates the potential of this method as a viable option for identifying facial expressions during interviews, which could inform hiring decisions. For situations with limited computational resources, the proposed architecture offers a powerful solution for interpreting human facial expressions with a single model and a single GPU.The proposed architecture was validated on the widely used controlled data sets CK+, MMI, and the challenging DAiSEE data set, as well as on the challenging wild data sets DFEW and AFFWild2. The experimental results demonstrated that the proposed method has superior performance to existing methods on DFEW, AFFWild2, MMI, and DAiSEE. It outperformed other sophisticated top-performing solutions with an accuracy of 4.29% in DFEW, 14.41% in AFFWild2, and 7.74% in MMI. It also achieved comparable results on the CK+ data set.

Context-Aware Facial Expression Recognition Using Deep Convolutional Neural Network Architecture

Emotion Recognition of People Based on Facial Expressions in Real-Time Event

Facial emotion recognition using convolutional neural networks (FERC)

Article 18 February 2020

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The ability to express emotions through facial expressions is crucial for humans to communicate and connect. As technology advances and people increasingly rely on computers for various activities like online learning and shopping, interacting with virtual systems has become an essential aspect of daily life. Understanding and responding to human emotions and mental states is crucial in order to interact with computers similarly to how we interact with people. It is increasingly important to have a natural interaction with technology.

In recent years, there has been a growing interest in emotion recognition (ER), which involves the accurate and automated classification of emotions in images or video sequences [1,2,3], and [4]. This research topic has gained attention in the fields of psychology, computer vision, and artificial emotional intelligence (AEI).

It assists in recognizing not just the emotional states of humans but also enables the imitation of different emotions in human-machine interactions, which has significant practical uses in real-life situations. Examples of such applications include driver safety monitoring to determine if the driver is distracted or if the driver is paying attention and to predict his actions based on his confusion or frustration levels.

Other applications are human-robot emotional interaction, medical domains to detect signs of depression or pain, or the identification of children with learning or cognitive disabilities involving assessing their level of involvement and linking it to the likelihood of having autism or attention deficit hyperactivity disorder (ADHD). Additionally, it involves recognizing specific elements or situations that capture or irritate the child’s attention, as it may be an indication that a child suffers from a certain disease [5, 6], and [7]. In e-learning, facial expressions are utilized to determine which sections of a lecture are perplexing for the majority of students and to gauge the level of engagement among students while watching a video. Finally, in Talentino, to understand the candidates’ engagement during their interviews and analyze their behaviors.

Although considerable progress has been made toward enhancing emotion recognition, there are still various challenges in exploring the dynamic emotion variations, and obtaining precise emotional analysis remains challenging in the present time.

Many systems use facial expressions and features to identify human emotions [8, 9], and [10]. Various steps are involved in systems designed to detect human emotions, such as retrieving images, preprocessing them, segmenting the images, extracting features, classifying facial expressions, and conducting training [11].

The unregulated environment presents various difficulties for practical implementation. However, there are more and more social networks and applications being used as data sources. Deep learning networks have also improved the processes of analysis and recognition.

The majority of current efforts [3] and [4] concentrate on using convolutional neural networks (CNN) to extract the feature representation of each frame, but they do not take into account the correlation between the frames in a video sequence. These approaches seek to identify the most significant expression features in each frame and address the problem as an image-based task.

Thus, they are relying mainly on the spatial features in the images. Other recent works [12, 13] have also considered temporal features to enhance recognition accuracy.

There are two main types of techniques used for facial expression recognition (FER): static image-based approaches and dynamic sequence-based approaches. The majority of static frame-based techniques choose peak (apex) frames from films. Then perform facial emotion detection on these frames using local binary patterns [14], Gabor wavelets [15], and neural features. For instance, Zhao et al. [16] suggest using a sample with peak expression to guide a deep network that is peak-piloted, as it gains knowledge by studying a set of expressions that are not at their highest level. Meng et al. [17] suggest using the attention mechanism to combine multiple distinct frames into a unified representation for a video. These techniques are effective at choosing peak frames, but they do not take into account the changes over time and the relationship between consecutive facial frames.

In contrast to static frame-based techniques, dynamic sequence-based techniques learn spatiotemporal relations using 3D convolutional neural networks (3DCNN) [12] and long-short-term memory (LSTM) [18]. This could mimic long-term dependencies and improve FER performance.

To understand the time-related characteristics of spatial data and increase recognition accuracy, Kim et al. [19] suggest utilizing an LSTM network. Chen et al. [20] suggest a 3D-Inception-ResNet that enhances the representations of learned features. It calculates attention maps based on spatial-temporal and channel-wise factors. Li et al. [13] recently developed clip-aware dynamic facial expression by extracting clip-level features from each clip-based representation and re-weighting them.

The methods’ performances are still quite distant from being ideal due to occlusions, different head positions, bad lighting, and other unanticipated challenges in real-world scenarios, even though various ways [12, 18] have been developed for in-the-wild FER. The issue of capturing spatial and temporal discriminative information for in-the-wild FER is difficult. Our grasp of discriminative feature representation and contextual information modeling has significantly deepened as transformer-based approaches to computer vision challenges have recently become increasingly popular [21, 22].

We can summarize our main contributions as follows:

We proposed a state-of-the-art hybrid architecture, ViTCN, combining the powerful Vision Transformer with a Temporal Convolution Network.
We processed FER as a sequence of frames in order to recognize the deep expression in the actual context.
We performed our experiments on numerous standard data sets for comparison, which demonstrated that the proposed architecture outperformed the state-of-the-art methods. It showed that it has obtained the highest results in DFEW[23], AFFWild2[24], MMI[25], and DAiSEE[26] and achieved comparable results on other data sets such as CK+ [27].
We conducted an ablation study to confirm the effectiveness of every element in the suggested model. All these experiments are done on one single GPU.^{Footnote 1}

2 Related Work

The temporal correlations of succeeding frames in a sequence can be useful for face expression identification, even though the majority of the earlier models concentrated on static images. In this section, we present advanced deep networks for FER that take into account the spatial and temporal motion patterns in video frames, as well as the learned features obtained from the temporal structure. The spatio-temporal FER network utilizes both textural and time-related data to capture and represent more nuanced expressions. It uses a set of frames from a temporal window as a single input without knowing the intensity of the expression beforehand.

To identify emotions in a series of video frames, existing approaches frequently use recurrent neural network (RNN) models and their modifications. In several real-world applications, hybrid connections using CNN models have displayed outstanding performance. Deep RNNs, in particular LSTMs, have demonstrated impressive results in capturing the temporal relationships of sequential data.

RNNs are neural networks that have loops. This ability allows them to effectively understand and learn the time-based patterns in sequential data. In order to forecast the current outcome, RNNs can link historical data to the job at hand. The vanishing or exploding gradient issue makes training RNNs difficult. LSTM networks, a type of RNN that has the ability to acquire knowledge of long-term relationships, offer a solution to this issue.

CNNs are a different type of neural network that performs convolutional operations. Instead of the normal linear multiplication operation [28], convolution is another matrix multiplication procedure that generally depends on the weighted sum of neighboring input pixels multiplied with some special kernels. CNN is a convolutional layer that convolves the input into a feature map. In human eyes, convolution is the modeling of the response of a neuron in the visual cortex to a specific stimulus. Each convolutional neuron processes data only for its receptive field.

Memory cells with four neurons each make up the chain-like structure of LSTMs, which were created to interact in extremely unique ways. LSTM models with gated recurrent units (GRU) are one variant of the LSTM architecture. GRU models use less memory since they employ fewer training parameters. LSTM models are more accurate for bigger data sets, but GRU does calculations more quickly. LSTM or GRU networks have been used to attain the most advanced results to date. The performance of these networks is further enhanced through FER training. An LSTM [29] or GRU [30] network is given a sequence of frames to understand variations in facial expressions and to identify an individual’s emotional or mental condition.

For FER, a number of pre-trained models utilizing CNN architectures and similar modifications have been proposed. Self-encoder, CNN, and confidence networks are a few of these networks. Nevertheless, they lack the capacity to record contextual temporal information. Instead, they often show significant potential for automated feature learning. To achieve this, a variety of RNN model variations, such as CNN-LSTM [31, 32] and CNN-GRU [33], have been integrated with CNNs. They enhance their effectiveness in dealing with facial emotion recognition tasks. By reducing the impact of differences and the surrounding environment, these networks may more accurately identify facial expressions by obtaining more detailed information and distinguishing expression information from sequences of facial expressions. The LSTM component of these networks is responsible for learning and recognizing the changing patterns over time, while CNN extracts deep visual information. These networks highlight the significance of reading micro-expressions.

Extensions of the conventional LSTM and GRU architectures, Bidirectional LSTM (BDLSTM) and Bidirectional GRU (BIGRU), respectively, enhance the effectiveness of learning models to optimize the efficiency of FER. The BDLSTM processes the sequence in both directions, utilizing two LSTMs for training. As a result, the network receives more context, which speeds up the learning of an expression’s sequence. Consequently, for FER, a CNN is included as a hybrid link at the very end to aid the model in thoroughly processing the variations in facial expressions. The CNN-BDLSTM [34, 35], and CNN-BIGRU [36] hybrid connection models are two examples.

The development of transformer networks in natural language processing (NLP) has attracted the curiosity of many in the field of computer vision. The transformer in NLP was created to represent lengthy sequence inputs. When compared to CNN networks, ViT has produced impressive results by pre-training on large data sets such as ImageNet-1k and then fine-tuning the target data set. ViT divided the picture into patches that allowed it to adapt to the computer vision challenge. The self-attention technique used in the transformer has the ability to collect long-range dependencies between these patches. Chaudhari et al. [21] applied ViT on FER, and Aouayeb et al. [22] applied the ViT structure as well on FER by injecting a Squeeze and Excitation (SE) block prior to the Multiple Layer Perceptron (MLP) heads for the FER job.

Xue and colleagues [37] introduced TransFER, a transformer-based method. Local CNN blocks were made to find various local patches after extracting feature maps using a backbone CNN. Next, by using a multi-head self-attention-dropping part, a transformer encoder examined the general correlation among these local patches.

A two-stream pyramid cross-fusion transformer network was proposed [38]. In order to address scale sensitivity, intra-class discrepancy, and inter-class similarity issues in FER, it looks into the relationship between landmark features and picture attributes.

The primary difficulties encountered while utilizing CNN for FER are those related to computational complexity, picture quality, lighting fluctuations, high intra-class variations, and strong inter-class similarities brought on by changes in facial appearance. Several studies have considered creating a hybrid system by fusing deep learning approaches to address these problems.

3 Hybrid Model Architecture

Our proposed hybrid model, ViTCN, is composed of two parts: a ViT and a TCN. The ViT is used to extract the important spatial features from the images, while the TCN is used to encode spatiotemporal information extracted from the different video frames, combine the correlated features extracted for each frame, analyze the relationship between them, and classify the accurate expression. Both models are explained in the following subsections. Figure 1 shows our proposed architecture.

3.1 Vision Transformer

ViT [39] architecture is inspired by the basic transformer architecture first used in NLP problems. Its architecture is most similar to the encoder part of the transformers, where the image is split into a set of image patches, which are called visual tokens. These tokens are then embedded into a set of encoded features of a specified dimension. Moreover, the order of the patches is changed according to the positional encoding part of the ViT. This architecture was selected as it is able to identify and capture significant characteristics from the image in a relatively small vector of features that will be fed to the TCN model.

In our proposed architecture, we used the pre-trained ViT model from the PyTorch library. It is trained on the ImageNet-1k data set; however, we replaced the final fully connected convolution layer with one that has the same input/output dimension of 32, which is fed to the TCN model afterward.

3.2 Temporal Convolution Network

For the purpose of action recognition in videos, Lea et al. first proposed TCNs in 2016 [40]. The privilege of TCN architecture comes from its ability to encode spatiotemporal information coming from different frames along a video, which is then passed to a classifier label to classify these features into the corresponding classes. These features could be utilized to detect actions, emotions, or whatever significant information we need to detect. Moreover, TCN can process any sequence length, which enables it to consider more in-depth features.

To identify the emotional expression in each frame of the proposed model, we concentrated on the face’s changing characteristics. This information is then sent to a fully connected convolution layer, or classifier, which categorizes the input video based on the emotion, whether it exists or not (1 or 0).

3.3 Training Configuration

We have conducted many experiments to select the most optimal hyper-parameters during the proposed architecture training, which will be discussed in the ablation study section. Our model, ViTCN, is composed of two parts: a ViT and a TCN. We used the pre-trained ViT by replacing the final fully connected layer with a new layer with the same dimension to feed the TCN module. Our TCN module consists of 8 TCN layers with 3 kernel sizes. After many trials, we have chosen the following values as the default: We have used the ADAM [41] optimizer with the default learning rate set to 0.001. The dropout is set to be 10%. We split the data sets into a training set, a validation set, and a testing set with (60% - 20% - 20%), respectively. Our batch size consists of 4 samples, as we have a limitation on our training machine as our single GPU has only 8 GB of RAM. We also train all experiments with 50 epochs; however, we are using early stopping for most of our experiments.

3.4 Loss Calculation

Due to the imbalance of most of the available data sets and benchmarks, as discussed in [42], we needed a modified loss function for better back-propagation and learning of the model. As discussed in the training configuration subsection, we used a modified version of the binary cross entropy (BCE) for calculating the loss. We utilized the weighted loss for each label (0, 1), as explained in equation (1):

$$\begin{aligned} \mathcal {L}oss= & {} \left( 1 - \frac{ \mathcal {N} }{B} \right) \times \mathcal {BCE}\left( real_0, predicted_0\right) + \frac{ \mathcal {N} }{B} \nonumber \\{} & {} \times \mathcal {BCE}(real_1, predicted_1) \end{aligned}$$

(1)

where N represents the count of ’0’ labels within batch B, real$_{0}$ is the list of ’0’ labels from batch, and predicted$_{0}$ is their equivalent output from the model. Same with subscript 1 for label ’1’.

However, if N equals ’0’, we used equation (2):

$$\begin{aligned} \mathcal {L}oss = 0.35 \times \mathcal {BCE}(real_1, predicted_1) \end{aligned}$$

(2)

The weighted loss function improved the results by adjusting the calculated loss based on the label; this gives more attention to learning the unbalanced data by decreasing the loss value for the dominant label, which reduces the overfitting of that label.

Table 1 DAiSEE Labels File sample

ViTCN: Hybrid Vision Transformer with Temporal Convolution for Multi-Emotion Recognition

Abstract

Similar content being viewed by others

Context-Aware Facial Expression Recognition Using Deep Convolutional Neural Network Architecture

Emotion Recognition of People Based on Facial Expressions in Real-Time Event

Facial emotion recognition using convolutional neural networks (FERC)

Explore related subjects

1 Introduction

2 Related Work

3 Hybrid Model Architecture

3.1 Vision Transformer

3.2 Temporal Convolution Network

3.3 Training Configuration

3.4 Loss Calculation

4 Experiments

4.1 Data sets

4.1.1 CK+

4.1.2 MMI

4.1.3 DAiSEE

4.1.4 AFFWild2

4.1.5 DFEW

4.2 Frames Transform

4.3 Evaluation Criteria

4.4 Implementation details:Training and Testing Settings

4.5 Comparison with state-of-the-art

4.5.1 CK+ experiments

4.5.2 MMI experiments

4.5.3 AFFWild2 experiments

4.5.4 DFEW experiments

4.5.5 DAiSEE experiments

4.5.6 Discussion

4.6 Ablation Study

4.6.1 ViT Study

4.6.2 Data Processing Study on DAiSEE data set

5 Conclusion and Future Work

Availability of data and material

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation