INTRODUCTION

Emotions affect the psychological status of any person and play an important role in human life and work. They usually appear spontaneously, which makes recognizing them accurately and on time a challenging problem. The change of a person’s internal affective state or intention is reflected in many human physical signals [3], among which the most useful for practical applications are facial expressions and voice. Automation of facial expression recognition (FER) and speech emotion recognition (SER) methods is one of the crucial points of pattern recognition having decisive importance for increasing the efficiency of emotion analysis. Availability of both audio and visual modalities in a video makes it possible to develop the audio-visual emotion recognition techniques. They can be applied in human computer interfaces, affective computing, lie detection, intelligent environments [3], assessment of several neuropsychiatric disorders [20], etc.

Unfortunately, constructing image models and representations allowable by efficient emotion recognition algorithms is very difficult because the datasets available for FER and SER are small and dirty. In fact, the labeling of an emotional video may be very difficult as perception of emotions varies from person to person, so many labels are ambiguous [12]. Moreover, the labeling of the beginning and end positions of each emotion at frame level [15] is required to track the changes in the emotional state [11]. As a result, the accuracy of even the state-of-the-art models trained on such datasets is still limited to 50–70% if the subjects from the training and testing sets are disjoint. For example, the single ResNet model with multilevel attention mechanism and self-training on unlabeled body language dataset with iterative training [8] reached validation accuracy 55.2% for the AFEW (Acted Facial Expressions in the Wild) database [3]. Representations of faces based on carefully pre-trained EfficientNet-B0 is the best-known single model for AFEW with accuracy greater than 59% [13]. The factorized bilinear pooling in the attention cross-modal feature fusion mechanisms [22] lead to the greatest validation accuracy (65.5%) on the same dataset. The highest accuracy on the testing set (62.78%) is obtained by the bi-modality fusion [9] of audio and video features extracted by four different CNNs. The multimodal dynamic fusion network [6] reached an accuracy 68.2% on IEMOCAP (Interactive Emotional Dyadic Motion Capture) database for the emotion recognition in conversations problem. The pre-trained deep convolutional neural network (CNN) with a correlation-based feature selection [6] in the speaker-independent mode achieved an SER accuracy of 56.5 and 63% for IEMOCAP and RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song), respectively.

It is important to emphasize that many recent papers report much better performance for the speaker-dependent mode, in which the training and testing sets contain data from the same subjects [12, 17]. For instance, the model from the above-mentioned paper [6] trained in this mode achieved accuracy of 83.8% for IEMOCAP and 81.3% for RAVDESS. The multiview facial expression light-weight network [7] had the FER accuracy 90–95% for several datasets with random train-test split. The subject-dependent challenge of the FER task is accomplished in [19] with a novel face recognition-based attention framework. The greatest unweighted average recall (UAR) for the RAMAS (Russian Acted Multimodal Affective Set) [11] has been established by fusing audio and video classifiers [12] with the random train-test split of the sequences of frames with the same emotion.

Thus, in this paper, we propose to develop personalized short-term FER [14] and SER [17] representations that have been adapted to each user of a multi-user system. The audio and video classifiers are fused in a novel technology for automatic audio-visual tracking of changes in the psycho-emotional state of the subject. The concrete audio and video models are chosen using preliminary video-based face recognition. The remaining part of this paper discusses the details of the proposed approach and its experimental study for the RAMAS dataset [11]. The results of the research and the conclusions can be useful for many engaged in the field of pattern recognition and image mining.

TASK FORMULATION

The task of continuous recognition of emotional state is formulated as follows. Let a set of K users (speakers) be available. Given an input video with the face and voice of one of these users, it is required to assign one of C > 1 emotional classes for every moment in time. In this paper, the typical assumption is made about the smoothness of psycho-emotional state. Hence, it is possible to split the whole signal into partially overlapped video \({{X}_{{v}}}\) and audio Xa fragments of short duration (0.5–5 s), for which emotion is considered to be constant. Thus, the task is to predict the class label c of emotions represented by audio signal Xa = {xa(t)}, t = 1, 2, …, Ta and a sequence of \({{T}_{{v}}}\) > 1 video frames (facial images) \({{X}_{{v}}}\) = {\({{X}_{{v}}}\)(t)}, t = 1, 2, …, \({{T}_{{v}}}\), where the number of samples Ta in speech signal and number of video frames \({{T}_{{v}}}\) are relatively small. For simplicity, we assume that only one facial image has been preliminary extracted from each frame by using appropriate face detection technique [21]. In order to solve this task, the training set of N > 1 pairs of facial video and audio signals {(\({{X}_{{{v};n}}}\), Xa;n)}, n = 1, 2, …, N of other persons with known emotional category cn should be available. Here each video signal is represented by a sequence of facial frames \({{X}_{{{v};n}}}\) = {\({{X}_{{{v};n}}}\)(t)}.

At first, it is necessary to extract visual and acoustic emotional features. In this paper, we use the MobileNet [2] and EfficientNet [13, 15] models pre-trained on the AffectNet dataset of facial photos [10]. The facial images \({{X}_{{v}}}\)(t) and \({{X}_{{{v};n}}}\)(t) are fed into a CNN, and the D-dimensional feature vectors (embeddings) \({{{\mathbf{x}}}_{{v}}}\)(t) and \({{{\mathbf{x}}}_{{{v};n}}}\)(t) are extracted at the output of the penultimate layer. There are several techniques to compute descriptor of the whole video \({{X}_{{v}}}\), such as attention mechanism [14, 19, 22], but we will use the simple component-wise averaging of feature vectors \({{{\mathbf{x}}}_{{v}}}\)(t) and \({{{\mathbf{x}}}_{{{v};n}}}\)(t) to obtain D-dimensional video descriptors \({{{\mathbf{x}}}_{{v}}}\) and \({{{\mathbf{x}}}_{{{v};n}}}\), respectively.

The audio feature vectors xa and xa;n are extracted from the speech signal Xa and Xa;n using libraries OpenSmile [4] and OpenL3 [1]. The former uses Emobase configuration of traditional acoustic features, such as pitch frequency, Mel-frequency cepstral coefficients, etc. The latter extracts deep audio embeddings based on the L3-Net (Look, Listen, and Learn) CNN trained through self-supervised learning of audio-visual correspondence in videos as opposed to other embeddings requiring labeled data.

Finally, arbitrary audio and video classifiers are trained on the sets {(\({{{\mathbf{x}}}_{{{v};n}}}\), cn)} and {(xa;n, cn)}. In this paper, we will use RF (Random Forest) and SVM (Support Vector Machine) from scikit-learn and a feed-forward neural network, such as multiclass logistic regression or MLP (multilayer perceptron) from the TensorFlow 2 framework.

PROPOSED APPROACH

The specificity and complexity of audio-visual emotion recognition problems stem from necessity to achieve some balance between such highly contradictory factors as variability of emotional features for different persons, ambiguous labeling of existing emotional datasets and requirement for near-real-time processing for continuous tracking of emotional state. Hence, the typical approach from the previous section based on lightweight visual and acoustic representations is not very accurate if the training and testing set do not contain data of the same subjects [14]. In this section, we describe the possibility to develop personalized representations with an assumption that a small set of Nk > C utterances and facial videos is available for every kth user (k = 1, 2, …, K). The proposed technology for continuous recognition of emotional state in a multi-user system is shown in Fig. 1.

Fig. 1.
figure 1

Proposed technology for audiovisual tracking of user’s emotional state in a multi-user system.

The top part of this figure contains the training of user-independent audio and video MLP-based classifiers from the previous section. Next, the personalized acoustic and video models are obtained for every mth user by fine-tuning these MLPs given only the data from this user. The MLP is initialized by the weights of the speaker-independent model, and the training process is repeated over 50 epochs using SGD (stochastic gradient descent) optimizer with learning rate 0.001.

The audio-visual emotion recognition is implemented as follows. At first, facial regions are detected in each video frame using MTCNN (multi-task cascaded CNN) [21]. The face is recognized by the nearest neighbor classifier of average facial embeddings extracted by our lightweight MobileNet or EfficientNet from each video frame [13]. As a result, the user identifier k is obtained. Next, short fragments of the input video and audio signals are stored in Xv and Xa, and their visual and acoustic representations are estimated as described in the previous section. The input features \({{{\mathbf{x}}}_{{v}}}\) and xa are fed into the kth video and speech models to obtain the C-dimensional scores (estimates of posterior probabilities) [\({{p}_{{{v};1}}}\), …, \({{p}_{{{v};C}}}\)] and [\({{p}_{{a;1}}}\), …, pa;C], respectively. The simple blending rule [16] is used for fusion of audio and video modalities to compute the final vector of scores pc = \(w{{p}_{{{v};c}}}\) + (1 – w)pa;c, c = 1, 2, …, C, where the weight w is estimated using cross-validation. The emotional class with the greatest score pc is returned as a final solution for the current moment in time. The dynamics of predicted emotional states can be further processed in various practical applications. For example, the standard deviation of emotions computed for all time moments [2] can be useful for stress-level analysis or lie detection. Let us experimentally prove the claim that the proposed personalized models are much more accurate when compared to the speaker-independent classifiers.

EXPERIMENTAL RESULTS

In this section, the RAMAS dataset [11] was used because it is the only one publicly available multi-modal emotional dataset with frame-level annotations and known subjects. It contains 564 audio and facial videos from 10 actors. The beginning and the end of each of C = 6 emotions (anger, sadness, disgust, happiness, fear, or surprise) and neutral class are labeled by at least 5 annotators for each video. In this paper, we borrowed the testing protocol originally introduced in [12]. The neutral emotion was dropped, and a threshold (level of confidence) na was set for a number of agreed annotators to obtain emotional intervals for each threshold. As a result, we obtain different sets of video and audio fragments with corresponding class labels that were chosen by at least na annotators.

In the first experiment, the speaker-dependent mode with the random train-test split from the paper [12] was used. As a result, the training/testing sets contain 2277/380, 1539/265, 1425/244, 1468/294 and 1124/234 samples for na = 1, 2, …, 5, respectively. The UAR and accuracy of video and audio emotion recognition are shown in Tables 1 and 2, respectively. These results demonstrate that our visual models are much better (up to 10%) than VGGFace and fine-tuned EfficientNet-B3 from [12] for video data from at least na = 2 agreed annotators. We used the same OpenSmile library as the authors of [12], so the UAR for OpenSmile features are more or less equal. However, embeddings of the L3-Net are classified slightly more accurately (Table 2).

Table 1. Classification results of speaker-dependent video-based FER
Table 2. Classification results of speaker-dependent audio-based SER

In the next experiment, the study of the proposed approach (Fig. 1) was carried out. The following implementation of 10-fold cross-validation was used. The audio and video of 9 actors were chosen to train the speaker-independent models. The videos of the remaining actor were randomly split into two equal parts. One of them was used to fine-tune the MLPs trained on the data of other actors. The accuracy and UAR of such a personalized model were estimated on the remaining half of audio-visual data. An experiment with selection of a testing actor was repeated 10 times to verify that all actors were involved in the testing process, and the average metrics were computed. In addition, we estimated the average accuracy of the speaker-independent classifiers trained on the data of 9 actors and tested on the same half of data of the remaining subject. The results of this experiment are presented in Table 3.

Table 3. Classification results of personalized emotion recognition

The proposed approach has approximately the same performance as the speaker-dependent recognition (Tables 1, 2), but is more flexible, because models for new users can be added in any moment without affecting the results for previous users. Our method (Fig.1) is much more accurate when compared to conventional speaker-independent mode in all cases except audio modality and only one agreed annotator. For example, the best visual representations (EfficientNet-B0) [13] with a personalized classifier increase the accuracy by 15% if na > 1. The quality of SER is typically much lower than FER, so that their fusion leads to only 1–2% higher accuracy and UAR when compared to processing of visual modality only. Conventional OpenSmile features [4] are significantly worse than OpenL3 deep audio embeddings [1]. Their fusion with visual representations let us achieve the greatest accuracy (up to 84.8%), and the difference between MobileNet and EfficientNet in this ensemble is not significant. Because the MobileNet is faster and has lower size, it is more preferable in real-time practical applications.

CONCLUSIONS

In this paper, the novel technology was proposed for continuous emotion recognition with personalized audio and visual neural models. It brings a real opportunity to efficiently solve various practical problems via extracting from videos the information necessary for analysis of dynamics of psycho-emotional state. It was experimentally shown that our adaptation of the user-independent classifiers significantly increased the recognition accuracy when compared to universal speaker-independent models (Table 3). We demonstrated that the lightweight MobileNet-based visual representations [2] are more suitable for practical applications due to their high accuracy and known excellent running time and model size. It was also shown that emotional speech is better represented by the L3-Net from OpenL3 [1] rather than with the traditional OpenSmile features [4].

The main direction for future research is a development of more sophisticated fusion algorithms for acoustic and facial representations instead of our simple blending of audio and video predictions. For example, it is possible to detect moments with pronunciation of vowels and aggregate the features of video frames in these moments only [18]. In addition, it is important to study the semi-supervised methods for development of personalized models without need for labeled audio-visual data for a concrete user.