Keywords

1 Introduction

Recognizing emotions is a fundamental aspect of human communication, and the ability to accurately detect emotional states has significant impacts on a range of applications, from healthcare to human-computer interaction. Emotions are often reflected in physiological signals [26], facial [8], and speech [24]. Recently, the use of physiological signals for affective computing has gained considerable attention due to its potential to provide objective measures of emotional states in real time [21].

Recently, there has been a growing interest in developing machine learning algorithms for affective computing using physiological signals [2, 4, 5, 22, 28]. These algorithms can be used to classify emotional states, predict changes in emotional states over time, or identify the specific features of physiological signals that are most informative for detecting emotional states. There has also been interested in developing wearable sensors that can capture physiological signals in real-world settings, such as in the workplace or in social situations [20].

The use of end-to-end deep learning architectures for physiological signals has the potential to simplify the development and deployment of an emotion recognition system [21]. By eliminating the need for preprocessing steps, these architectures can reduce the complexity and time required for system development, as well as improve the scalability and accuracy of the system. In addition, end-to-end architectures can enable the development of systems that can process multiple physiological signals simultaneously, such as heart rate, respiration, and electrodermal activity, providing more comprehensive and accurate measures of emotional states.

Despite the potential benefits of end-to-end deep learning architectures for affective computing, there are still challenges that need to be addressed. One challenge is to develop architectures that can handle noisy and non-stationary physiological signals, which can be affected by movement artifacts, signal drift, and other sources of noise. Another challenge is to ensure that the learned features are interpretable and meaningful, which can help improve the transparency and explainability of the system.

In this paper, we propose an end-to-end multi-scale architecture for continuous emotion regression with physiological signals. We evaluate the performance of the proposed architecture using CASE dataset [25], which contains data collected from experiments carried out in a laboratory setting.

2 Related Works

2.1 Continuous Emotion Recognition from Multimodal Physiological Signal

The utilization of physiological signals has been widely acknowledged as one of the most reliable data forms for affective science and affective computing. Although individuals are capable of manipulating their physical signals such as facial expressions or speech, consciously controlling their internal state is a daunting task. Therefore, analysis of signals from the human body represents a reliable and robust approach to fully recognizing and comprehending an individual’s emotional state [1, 26]. This reliability factor is especially crucial in medical applications, such as mental health treatment or mental illness diagnosis.

Recognizing affect from physiological data remains a significant challenge, not only during the data acquisition process but also in terms of emotion assessment. Laboratory-based research dominates the field of affective science due to the control it affords over experimental variables. Researchers can carefully select and prepare emotional stimuli, and employ various sensor devices to trace and record a subject’s emotional state with minimal unexpected event, interference [21]. However, most of these studies rely on discrete indirect methods, such as quizzes, surveys, or discrete emotion categories for emotion assessment, which overlook the time-varying nature of human emotional experience. Sharma et al. [25] introduced Joystick-based emotion reporting interface (JERI) to overcome a limitation in emotion assessment. JERI enables the simultaneous annotation of valence and arousal, allowing for moment-to-moment emotion assessment. The Continuously Annotated Signals of Emotion (CASE) dataset, acquired using JERI, provides additional information for researchers to identify the timing of emotional triggers.

In addition, it is claimed that a single physiological signal is relatively difficult to precisely reflect human emotional changes. Therefore, recently, there has been much research focusing on detecting human emotion through multimodal physiological signals. There are many types of physiological signal used in these studies. While some studies record heart-related signals such as electrocardiographic (ECG) [7, 17, 18], blood volume pulse (BVP) [15, 33], others use electrical activity of the brain (Electroencephalogram/EEG) [13, 14] or muscle electrical reaction (Electromyogram/EMG) [19, 23]. Furthermore, some even employ skin temperature (SKT) [19], skin sweat glands (EDA) [13, 23], the depth and rate of breathing (respiratory/RSP) [23].

2.2 Transformer-Based Method for in Multimodal Emotion Recognition from Physiological Signal

Similar to other emotion recognition problems that involve physical signals, affective computing in physiological data has witnessed extensive adoption of machine learning techniques, particularly deep learning methodologies. Dominguez et al. [4] employed various conventional machine learning techniques, including Gaussian naive Bayes, k-Nearest Neighbors, and support vector machines, for valence-arousal estimation. However, these approaches are heavily dependent on the quality of handcrafted feature selection and feature extraction processes. To overcome this challenge, other studies [5, 22, 28] proposed the use of Deep Learning techniques for an end-to-end approach, where the model learns to extract features automatically without the need for pre-designed feature descriptors.

With the advancement of deep learning, various state-of-the-art techniques have been used to analyze physiological signals. Santamaria et al. [22] used convolutional neural networks (CNN) with 1D convolution layers for emotion detection, while Harper et al. [5] combined CNNs with frequently used recurring neural networks (RNN) for emotion recognition from ECG signals. Since their introduction in 2016, Transformers [27] have emerged as preferred models in the field of deep learning. Their robust performance in natural language processing, a type of data that shares some characteristics similar to time-series data, has demonstrated the potential of Transformers when applied to time-series signals. As a result, recent research in the time series domain has utilized Transformers as the core module in their model architecture [9, 10, 30]. For physiological signals, some studies have proposed using Transformers and their variants to detect emotions [28, 29, 31, 32]. In the works of Vazquez et al. [28, 29], they focused on applying pre-trained Transformers for multimodal signal processing. However, this is still a very basic application of Transformer modules. Wu et al. [31] and Yang et al. [32] proposed using more advanced techniques of Transformer-based models, which are self-supervised and Convolution-augmented transformers for single- and multimodal signal processing. Although these studies have demonstrated the effectiveness of transformers for physiological signals, they often feed the model with fixed original size signals, which may lead to the loss of global feature information. To address this issue, we propose a new multi-scale transformer-based architecture for multimodal emotional recognition.

3 Proposed Approach

Fig. 1.
figure 1

An overview of our proposed architecture.

3.1 Problem Definition

The emotion recognition in multimodal physiological signal problem takes as input 8 physiological signals, namely ECG, BVP, EMG_CORU, EMG_TRAP, EMG_ZYGO, GRS, RSP and SKT, extracted from human subjects during emotion-inducing stimuli. This is denoted as the 8 sequence with L length. In the affective computing field, the objective of the emotion recognition problem varies depending on the indicated emotional models. In the scope of this study, following the use of the SAM (Self-Assessment Manikin) [3] model of the CASE dataset, the problem objective is the estimated Valence-Arousal (V-A) value. The V-A score consists of two continuous floating-point numbers ranging from 0.5 to 9.5. A value of 0.5 denotes the most negative valence or the lowest arousal, 5 indicates the neutral valence or arousal, and 9.5 indicates the most positive valence or the highest arousal.

3.2 Methodology

We constructed a new multiscale architecture for the estimation of valence arousal from 8 physiological signals. Our architecture consists of two core modules: Feature encoding module and multiscale fusion module. The process involves feeding raw physiological data into a feature encoding module, designed to extract vital information across varying global and local scales. Subsequently, the multi-scale features are fused and utilized for the estimation of Valence-Arousal scores. The overall architecture is shown in Fig. 1.

Feature Encoding. To enable the feature encoding module to extract global features for the estimator and eliminate noise and interference information from the input, we employ 1-Dimensional average pooling to scale the 8 input signals into three different lengths: L, L/2, and L/4. This process helps to improve the model’s ability to extract useful information and eliminate unwanted noise and interference.

Then, we simultaneously apply two types of feature encoders, which are the Gaussian transform [16] and the transformer encoder [27]. The transformer encoder block is used as multi-headed self-attention as its core mechanism. Given an input sequential signal \(S \in R^{L \times C}\), where L represents the length of the signal sequence and \(C=8\) is the number of channels (signal modalities), we apply a positional encoding and embedding layer to convert the raw input into a sequence of tokens. Subsequently, the tokens are fed into transformer layers consisting of multi-headed self-attention (MSA) [27], layer normalization (LN), and multilayer perceptron (MLP) blocks. Each element is formalized in the following equations:

$$\begin{aligned} y^i = MSA(LN(x^i)) + x^i \end{aligned}$$
(1)
$$\begin{aligned} x^{l+i} = MLP(LN(y^i)) + y^i \end{aligned}$$
(2)

Here, i represents the index of the token, and \(x^i\) denotes the generated feature’s token. It is worth noting that since the multi-headed self-attention mechanism allows multiple sequences to be processed in parallel, all 8 signal channels are fed into the Transformer Encoder at once.

The Gaussian transform [16] is traditionally employed to kernelize linear models by nonlinearly transforming input features via a single layer and subsequently training a linear model on top of the transformed features. However, in the context of deep learning architectures, random features can also be leveraged, given their ability to perform dimensionality reduction or approximate certain functions via random projections. As a non-parametric technique, this transformation maps input data to a more compressed representation that excludes noise information while still enabling computationally efficient processing. Such a technique may serve as a valuable supplement to Transformer Encoder architectures, compensating for any missing information.

Multi-scale Fusion. From the features extracted from the feature encoder module on different scales, we fuse them using the concatenation operation. The concatenated features are then fed through a series of fully connected layers (FCN) for the estimation of the 2 valence and arousal scores. The Rectified Linear Unit (ReLU) activation function is chosen for its ability to introduce non-linearity into the model, thus contributing to the accuracy of the score estimation. The effectiveness of this approach lies in its ability to efficiently estimate the desired scores while maintaining a simple and straightforward architecture.

4 Experimental and Results

4.1 Dataset

CASE dataset [25] contains data from several physiological sensors and continuous annotations of emotion. These data were acquired from 30 subjects while they watched several video-stimuli and simultaneously reported their emotional experience using JERI. The devices used include sensors for electrocardiography (ECG), blood volume pulse (BVP), galvanic skin response (GSR), respiration (RSP), skin temperature (SKT) and electromyography (EMG). These sensors return 8 types of physiological signals: ECG, BVP, EMG_CORU, EMG_TRAP, EMG_ZYGO, GRS, RSP and SKT. Emotional stimuli consisted of 11 videos, ranging in duration from 120 to 197 s. The annotation and physiological data were collected at a sampling rate of 20 Hz and 1000 Hz, respectively. The initial range of valence arousal scores was established at [−26225, 26225].

We evaluate our approach with four different scenarios:

  • Across-time scenario: Each sample represents a single person watching a single video, and the training and test sets are divided based on time. Specifically, the earlier parts of the video are used for training, while the later parts are reserved for testing.

  • Across-subject scenario: Participants are randomly assigned to groups, and all samples from a given group belong to either the train or test set depending on the fold.

  • Across-elicitor scenario: Each subject has two samples (videos) per quadrant in the arousal-valence space. For each fold, both samples related to a given quadrant are excluded, resulting in four folds, with one quadrant excluded in each fold.

  • Across-version scenario: Each subject has two samples per quadrant in the arousal-valence space. In this scenario, one sample is used to train the model, and the other sample is used for testing, resulting in two folds.

4.2 Experiments Setup

Our networks were implemented using the TensorFlow framework. We trained our models using the AdamW optimizer [12] with a learning rate of 0.001 and the Cosine annealing warm restarts scheduler [11] over 10 epochs. The MSE loss function was used to optimize the network, and the evaluation stage is done with RMSE. The sequence length was set to 2048. We utilized 4 transformer layers for the transformer encoder, with each Attention module containing 4 heads. The hidden dimension of the transformer was set to 1024. All training and testing processes were conducted on a GTX 3090 GPU.

Table 1. RMSE on the test data with different scenarios.

4.3 Results

Table 1 presents the results of our model in the test set in terms of the evaluation at different scenarios. Overall, the final RMSE score for the valence and arousal estimation task that we gain is 1.447. Our model showcases promising performance in comparison to the approach presented by Hinduja et al. [6]. It achieves a slightly lower score of 0.077 in Arousal and 0.153 in Valence score.

In detail, our model achieved the best performance in the across-subject scenario, with an arousal score of 1.336 and a valence score of 1.345. These results suggest that our model can effectively generalize to new subjects and accurately capture the emotion change after fully viewing the entire video-viewing process. Meanwhile, the relatively low performance in the across-elicitor scenario, with scores of 1.509 and 1.514 in arousal and valence, respectively, suggests that our model did not perform well in inferring emotional states that were not seen during training, given the specific emotional states learned previously. In the context of the across-time scenario, our results demonstrate a significantly improved performance compared to that of Hinduja et al. [6]. Specifically, our model achieves noteworthy enhancements in both Arousal and Valence scores, with a margin of 0.317 in Arousal and 0.121 in Valence. This substantial improvement opens up promising avenues for our future research.

5 Conclusion

This paper proposes a new multiscale architecture for multimodal emotional recognition from physiological signals. Our approach involves encoding the signal with the transformer encoder at multiple scales to capture both global and local features and obtain more informative representations. Our method achieved decent results on the CASE dataset.