Audio-Visual Continuous Recognition of Emotional State in a Multi-User System Based on Personalized Representation of Facial Expressions and Voice

Savchenko, A. V.; Savchenko, L. V.

doi:10.1134/S1054661822030397

Audio-Visual Continuous Recognition of Emotional State in a Multi-User System Based on Personalized Representation of Facial Expressions and Voice

SELECTED PAPERS OF THE 8th INTERNATIONAL WORKSHOP “IMAGE MINING. THEORY AND APPLICATIONS”
Published: 19 October 2022

Volume 32, pages 665–671, (2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Pattern Recognition and Image Analysis Aims and scope Submit manuscript

Audio-Visual Continuous Recognition of Emotional State in a Multi-User System Based on Personalized Representation of Facial Expressions and Voice

Download PDF

A. V. Savchenko¹ &
L. V. Savchenko¹

171 Accesses
3 Citations
Explore all metrics

Abstract

This paper is devoted to tracking dynamics of psycho-emotional state based on analysis of the user’s facial video and voice. We propose a novel technology with personalized acoustic and visual lightweight neural network models that can be launched in real-time on any laptop or even mobile device. At first, two separate user-independent classifiers (feed-forward neural networks) are trained for speech emotion and facial expression recognition in video. The former extracts acoustic features with OpenL3 or OpenSmile frameworks. The latter is based on preliminary extraction of emotional features from each frame with a pre-trained convolutional neural network. Next, both classifiers are fine-tuned using a small number of short emotional videos that should be available for each user. The face of a user is identified during the real-time tracking of emotional state to choose the concrete neural networks. The final decision about current emotion in a short time frame is predicted by blending the outputs of personalized audio and video classifiers. It is experimentally demonstrated for the Russian Acted Multimodal Affective Set that the proposed approach makes it possible to increase the emotion recognition accuracy by 2–15%.

Multi-modal emotion prediction system using convergence media and active contents

Article 26 July 2021

Multimodal Emotion Recognition Using Deep Networks

Personalized Frame-Level Facial Expression Recognition in Video

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

INTRODUCTION

Emotions affect the psychological status of any person and play an important role in human life and work. They usually appear spontaneously, which makes recognizing them accurately and on time a challenging problem. The change of a person’s internal affective state or intention is reflected in many human physical signals [3], among which the most useful for practical applications are facial expressions and voice. Automation of facial expression recognition (FER) and speech emotion recognition (SER) methods is one of the crucial points of pattern recognition having decisive importance for increasing the efficiency of emotion analysis. Availability of both audio and visual modalities in a video makes it possible to develop the audio-visual emotion recognition techniques. They can be applied in human computer interfaces, affective computing, lie detection, intelligent environments [3], assessment of several neuropsychiatric disorders [20], etc.

Unfortunately, constructing image models and representations allowable by efficient emotion recognition algorithms is very difficult because the datasets available for FER and SER are small and dirty. In fact, the labeling of an emotional video may be very difficult as perception of emotions varies from person to person, so many labels are ambiguous [12]. Moreover, the labeling of the beginning and end positions of each emotion at frame level [15] is required to track the changes in the emotional state [11]. As a result, the accuracy of even the state-of-the-art models trained on such datasets is still limited to 50–70% if the subjects from the training and testing sets are disjoint. For example, the single ResNet model with multilevel attention mechanism and self-training on unlabeled body language dataset with iterative training [8] reached validation accuracy 55.2% for the AFEW (Acted Facial Expressions in the Wild) database [3]. Representations of faces based on carefully pre-trained EfficientNet-B0 is the best-known single model for AFEW with accuracy greater than 59% [13]. The factorized bilinear pooling in the attention cross-modal feature fusion mechanisms [22] lead to the greatest validation accuracy (65.5%) on the same dataset. The highest accuracy on the testing set (62.78%) is obtained by the bi-modality fusion [9] of audio and video features extracted by four different CNNs. The multimodal dynamic fusion network [6] reached an accuracy 68.2% on IEMOCAP (Interactive Emotional Dyadic Motion Capture) database for the emotion recognition in conversations problem. The pre-trained deep convolutional neural network (CNN) with a correlation-based feature selection [6] in the speaker-independent mode achieved an SER accuracy of 56.5 and 63% for IEMOCAP and RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song), respectively.

It is important to emphasize that many recent papers report much better performance for the speaker-dependent mode, in which the training and testing sets contain data from the same subjects [12, 17]. For instance, the model from the above-mentioned paper [6] trained in this mode achieved accuracy of 83.8% for IEMOCAP and 81.3% for RAVDESS. The multiview facial expression light-weight network [7] had the FER accuracy 90–95% for several datasets with random train-test split. The subject-dependent challenge of the FER task is accomplished in [19] with a novel face recognition-based attention framework. The greatest unweighted average recall (UAR) for the RAMAS (Russian Acted Multimodal Affective Set) [11] has been established by fusing audio and video classifiers [12] with the random train-test split of the sequences of frames with the same emotion.

Thus, in this paper, we propose to develop personalized short-term FER [14] and SER [17] representations that have been adapted to each user of a multi-user system. The audio and video classifiers are fused in a novel technology for automatic audio-visual tracking of changes in the psycho-emotional state of the subject. The concrete audio and video models are chosen using preliminary video-based face recognition. The remaining part of this paper discusses the details of the proposed approach and its experimental study for the RAMAS dataset [11]. The results of the research and the conclusions can be useful for many engaged in the field of pattern recognition and image mining.

TASK FORMULATION

The task of continuous recognition of emotional state is formulated as follows. Let a set of K users (speakers) be available. Given an input video with the face and voice of one of these users, it is required to assign one of C > 1 emotional classes for every moment in time. In this paper, the typical assumption is made about the smoothness of psycho-emotional state. Hence, it is possible to split the whole signal into partially overlapped video \({{X}_{{v}}}\) and audio X_a fragments of short duration (0.5–5 s), for which emotion is considered to be constant. Thus, the task is to predict the class label c of emotions represented by audio signal X_a = {x_a(t)}, t = 1, 2, …, T_a and a sequence of \({{T}_{{v}}}\) > 1 video frames (facial images) \({{X}_{{v}}}\) = {\({{X}_{{v}}}\)(t)}, t = 1, 2, …, \({{T}_{{v}}}\), where the number of samples T_a in speech signal and number of video frames \({{T}_{{v}}}\) are relatively small. For simplicity, we assume that only one facial image has been preliminary extracted from each frame by using appropriate face detection technique [21]. In order to solve this task, the training set of N > 1 pairs of facial video and audio signals {(\({{X}_{{{v};n}}}\), X_a;n)}, n = 1, 2, …, N of other persons with known emotional category c_n should be available. Here each video signal is represented by a sequence of facial frames \({{X}_{{{v};n}}}\) = {\({{X}_{{{v};n}}}\)(t)}.

At first, it is necessary to extract visual and acoustic emotional features. In this paper, we use the MobileNet [2] and EfficientNet [13, 15] models pre-trained on the AffectNet dataset of facial photos [10]. The facial images \({{X}_{{v}}}\)(t) and \({{X}_{{{v};n}}}\)(t) are fed into a CNN, and the D-dimensional feature vectors (embeddings) \({{{\mathbf{x}}}_{{v}}}\)(t) and \({{{\mathbf{x}}}_{{{v};n}}}\)(t) are extracted at the output of the penultimate layer. There are several techniques to compute descriptor of the whole video \({{X}_{{v}}}\), such as attention mechanism [14, 19, 22], but we will use the simple component-wise averaging of feature vectors \({{{\mathbf{x}}}_{{v}}}\)(t) and \({{{\mathbf{x}}}_{{{v};n}}}\)(t) to obtain D-dimensional video descriptors \({{{\mathbf{x}}}_{{v}}}\) and \({{{\mathbf{x}}}_{{{v};n}}}\), respectively.

The audio feature vectors x_a and x_a;n are extracted from the speech signal X_a and X_a;n using libraries OpenSmile [4] and OpenL3 [1]. The former uses Emobase configuration of traditional acoustic features, such as pitch frequency, Mel-frequency cepstral coefficients, etc. The latter extracts deep audio embeddings based on the L³-Net (Look, Listen, and Learn) CNN trained through self-supervised learning of audio-visual correspondence in videos as opposed to other embeddings requiring labeled data.

Finally, arbitrary audio and video classifiers are trained on the sets {(\({{{\mathbf{x}}}_{{{v};n}}}\), c_n)} and {(x_a;n, c_n)}. In this paper, we will use RF (Random Forest) and SVM (Support Vector Machine) from scikit-learn and a feed-forward neural network, such as multiclass logistic regression or MLP (multilayer perceptron) from the TensorFlow 2 framework.

PROPOSED APPROACH

The specificity and complexity of audio-visual emotion recognition problems stem from necessity to achieve some balance between such highly contradictory factors as variability of emotional features for different persons, ambiguous labeling of existing emotional datasets and requirement for near-real-time processing for continuous tracking of emotional state. Hence, the typical approach from the previous section based on lightweight visual and acoustic representations is not very accurate if the training and testing set do not contain data of the same subjects [14]. In this section, we describe the possibility to develop personalized representations with an assumption that a small set of N_k > C utterances and facial videos is available for every kth user (k = 1, 2, …, K). The proposed technology for continuous recognition of emotional state in a multi-user system is shown in Fig. 1.

The top part of this figure contains the training of user-independent audio and video MLP-based classifiers from the previous section. Next, the personalized acoustic and video models are obtained for every mth user by fine-tuning these MLPs given only the data from this user. The MLP is initialized by the weights of the speaker-independent model, and the training process is repeated over 50 epochs using SGD (stochastic gradient descent) optimizer with learning rate 0.001.

The audio-visual emotion recognition is implemented as follows. At first, facial regions are detected in each video frame using MTCNN (multi-task cascaded CNN) [21]. The face is recognized by the nearest neighbor classifier of average facial embeddings extracted by our lightweight MobileNet or EfficientNet from each video frame [13]. As a result, the user identifier k is obtained. Next, short fragments of the input video and audio signals are stored in X_v and X_a, and their visual and acoustic representations are estimated as described in the previous section. The input features \({{{\mathbf{x}}}_{{v}}}\) and x_a are fed into the kth video and speech models to obtain the C-dimensional scores (estimates of posterior probabilities) [\({{p}_{{{v};1}}}\), …, \({{p}_{{{v};C}}}\)] and [\({{p}_{{a;1}}}\), …, p_a;C], respectively. The simple blending rule [16] is used for fusion of audio and video modalities to compute the final vector of scores p_c = \(w{{p}_{{{v};c}}}\) + (1 – w)p_a;c, c = 1, 2, …, C, where the weight w is estimated using cross-validation. The emotional class with the greatest score p_c is returned as a final solution for the current moment in time. The dynamics of predicted emotional states can be further processed in various practical applications. For example, the standard deviation of emotions computed for all time moments [2] can be useful for stress-level analysis or lie detection. Let us experimentally prove the claim that the proposed personalized models are much more accurate when compared to the speaker-independent classifiers.

EXPERIMENTAL RESULTS

In this section, the RAMAS dataset [11] was used because it is the only one publicly available multi-modal emotional dataset with frame-level annotations and known subjects. It contains 564 audio and facial videos from 10 actors. The beginning and the end of each of C = 6 emotions (anger, sadness, disgust, happiness, fear, or surprise) and neutral class are labeled by at least 5 annotators for each video. In this paper, we borrowed the testing protocol originally introduced in [12]. The neutral emotion was dropped, and a threshold (level of confidence) n_a was set for a number of agreed annotators to obtain emotional intervals for each threshold. As a result, we obtain different sets of video and audio fragments with corresponding class labels that were chosen by at least n_a annotators.

In the first experiment, the speaker-dependent mode with the random train-test split from the paper [12] was used. As a result, the training/testing sets contain 2277/380, 1539/265, 1425/244, 1468/294 and 1124/234 samples for n_a = 1, 2, …, 5, respectively. The UAR and accuracy of video and audio emotion recognition are shown in Tables 1 and 2, respectively. These results demonstrate that our visual models are much better (up to 10%) than VGGFace and fine-tuned EfficientNet-B3 from [12] for video data from at least n_a = 2 agreed annotators. We used the same OpenSmile library as the authors of [12], so the UAR for OpenSmile features are more or less equal. However, embeddings of the L³-Net are classified slightly more accurately (Table 2).

Table 1. Classification results of speaker-dependent video-based FER

Full size table

Table 2. Classification results of speaker-dependent audio-based SER

Full size table

In the next experiment, the study of the proposed approach (Fig. 1) was carried out. The following implementation of 10-fold cross-validation was used. The audio and video of 9 actors were chosen to train the speaker-independent models. The videos of the remaining actor were randomly split into two equal parts. One of them was used to fine-tune the MLPs trained on the data of other actors. The accuracy and UAR of such a personalized model were estimated on the remaining half of audio-visual data. An experiment with selection of a testing actor was repeated 10 times to verify that all actors were involved in the testing process, and the average metrics were computed. In addition, we estimated the average accuracy of the speaker-independent classifiers trained on the data of 9 actors and tested on the same half of data of the remaining subject. The results of this experiment are presented in Table 3.

Table 3. Classification results of personalized emotion recognition

Full size table

The proposed approach has approximately the same performance as the speaker-dependent recognition (Tables 1, 2), but is more flexible, because models for new users can be added in any moment without affecting the results for previous users. Our method (Fig.1) is much more accurate when compared to conventional speaker-independent mode in all cases except audio modality and only one agreed annotator. For example, the best visual representations (EfficientNet-B0) [13] with a personalized classifier increase the accuracy by 15% if n_a > 1. The quality of SER is typically much lower than FER, so that their fusion leads to only 1–2% higher accuracy and UAR when compared to processing of visual modality only. Conventional OpenSmile features [4] are significantly worse than OpenL3 deep audio embeddings [1]. Their fusion with visual representations let us achieve the greatest accuracy (up to 84.8%), and the difference between MobileNet and EfficientNet in this ensemble is not significant. Because the MobileNet is faster and has lower size, it is more preferable in real-time practical applications.

CONCLUSIONS

In this paper, the novel technology was proposed for continuous emotion recognition with personalized audio and visual neural models. It brings a real opportunity to efficiently solve various practical problems via extracting from videos the information necessary for analysis of dynamics of psycho-emotional state. It was experimentally shown that our adaptation of the user-independent classifiers significantly increased the recognition accuracy when compared to universal speaker-independent models (Table 3). We demonstrated that the lightweight MobileNet-based visual representations [2] are more suitable for practical applications due to their high accuracy and known excellent running time and model size. It was also shown that emotional speech is better represented by the L³-Net from OpenL3 [1] rather than with the traditional OpenSmile features [4].

The main direction for future research is a development of more sophisticated fusion algorithms for acoustic and facial representations instead of our simple blending of audio and video predictions. For example, it is possible to detect moments with pronunciation of vowels and aggregate the features of video frames in these moments only [18]. In addition, it is important to study the semi-supervised methods for development of personalized models without need for labeled audio-visual data for a concrete user.

REFERENCES

J. Cramer, H. H. Wu, J. Salamon, and J. P. Bello, “Look, listen, and learn more: Design choices for deep audio embeddings,” in ICASSP 2019–2019 IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Brighton, UK, 2019 (IEEE, 2019), pp. 3852–3856. https://doi.org/10.1109/ICASSP.2019.8682475
P. Demochkina and A. V. Savchenko, “MobileEmotiFace: Efficient facial image representations in video-based emotion recognition on mobile devices,” in Pattern Recognition. ICPR International Workshops and Challenges. ICPR 2021, Ed. A. Del Bimbo, Lecture Notes in Computer Science, Vol. 12665 (Springer, Cham, 2021), pp. 266–274. https://doi.org/10.1007/978-3-030-68821-9_25
Book Google Scholar
A. Dhall, R. Goecke, S. Lucey and T. Gedeon, “Collecting large, richly annotated facial-expression databases from movies”, IEEE Multimedia 19, 34–41 (2012). https://doi.org/10.1109/MMUL.2012.26
Article Google Scholar
F. Eyben, M. Wöllmer, and B. Schuller, “OpenSmile: the Munich versatile and fast open-source audio feature extractor,” in Proc. 18th ACM Int. Conf. on Multimedia, Firenze, 2010 (Association for Computing Machinery, New York, 2010), pp. 1459–1462. https://doi.org/10.1145/1873951.1874246
M. Farooq, F. Hussain, N. K. Baloch, F. R. Raja, H. Yu, and Y. Bin Zikria, “Impact of feature selection algorithm on speech emotion recognition using deep convolutional neural network,” Sensors 20, 6008 (2020). https://doi.org/10.3390/s20216008
Article Google Scholar
D. Hu, X. Hou, L. Wei, L. Jiang, and Y. Mo, “MM-DFN: Multimodal dynamic fusion network for emotion recognition in conversations,” in ICASSP 2022–2022 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 2022 (IEEE, 2022), pp. 7037–7041. https://doi.org/10.1109/ICASSP43922.2022.9747397
S. Jie, and Q. Yongsheng, “Multi-view facial expression recognition with multi-view facial expression light weight network,” Pattern Recognit. Image Anal. 30, 805–814 (2020). https://doi.org/10.1134/S1054661820040197
Article Google Scholar
V. Kumar, S. Rao, and L. Yu, “Noisy student training using body language dataset improves facial expression recognition,” in Computer Vision–ECCV 2020 Workshops, Ed. by A. Bartoli, Lecture Notes in Computer Science, Vol. 12535 (Springer, Cham, 2020), pp. 756–773. https://doi.org/10.1007/978-3-030-66415-2_53
Book Google Scholar
S. Li, W. Zheng, Y. Zong, C. Lu, C. Tang, X. Jiang, J. Liu, and W. Xia, “Bi-modality fusion for emotion recognition in the wild,” in ICMI’19: Int. Conf. on Multimodal Interaction, Suzhou, China, 2019 (Association for Computing Machinery, New York, 2019), pp. 589–594. https://doi.org/10.1145/3340555.3355719
A. Mollahosseini, B. Hasani, and M. H. Mahoor, “AffectNet: A database for facial expression, valence, and arousal computing in the wild,” IEEE Trans. Affective Comput. 10, 18–31 (2017). https://doi.org/10.1109/TAFFC.2017.2740923
Article Google Scholar
O. Perepelkina, E. Kazimirova, and M. Konstantinova, “RAMAS: Russian multimodal corpus of dyadic interaction for affective computing,” in Speech and Computer. SPECOM 2018, Ed. by A. Karpov, O. Jokisch, and R. Potapova, Lecture Notes in Computer Science, Vol. 11096 (Springer, Cham, 2018), pp. 501–510. https://doi.org/10.1007/978-3-319-99579-3_52
Book Google Scholar
E. Ryumina, O. Verkholyak, and A. Karpov, “Annotation confidence vs. training sample size: trade-off solution for partially-continuous categorical emotion recognition”, in Interspeech 2021 (IEEE, 2021), pp. 3690–3694. https://doi.org/10.21437/Interspeech.2021-1636
Google Scholar
A. V. Savchenko, “Facial expression and attributes recognition based on multi-task learning of lightweight neural networks,” in IEEE 19th Int. Symp. Intelligent Systems and Informatics (SISY), Subotica, Serbia, 2021, Ed. by L. Kovács (IEEE, 2021), pp. 119–124. https://doi.org/10.1109/SISY52375.2021.9582508
A. V. Savchenko, “Personalized frame-level facial expression recognition in video,” in Pattern Recognition and Artificial Intelligence. ICPRAI 2022, Ed. by M. El Yacoubi, E. Granger, P. C. Yuen, U. Pal, and N. Vincent, Lecture Notes in Computer Science, Vol. 13363 (Springer, Cham, 2022), pp 447–458. https://doi.org/10.1007/978-3-031-09037-0_37
Book Google Scholar
A. V. Savchenko, “Video-based frame-level facial analysis of affective behavior on mobile devices using EfficientNets,” in IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2022, Ed. by D. Kollias (IEEE, 2022), pp. 2359–2366.
A. Savchenko, A. Alekseev, S. Kwon, E. Tutubalina, E. Myasnikov, and S. Nikolenko. “Ad lingua: Text classification improves symbolism prediction in image advertisements,” in Proc. 28th Int. Conf. on Computational Linguistics, Barcelona, 2020, Ed. by D. Scott, N. Bel, and Ch. Zong (Association for Computational Linguistics, 2020), pp. 1886–1892. https://doi.org/10.18653/v1/2020.coling-main.171
A. V. Savchenko and L. Savchenko, “Speaker-aware training of speech emotion classifier with speaker recognition,” in Speech and Computer. SPECOM 2021, Ed. by A. Karpov and R. Potapova, Lecture Notes in Computer Science, Vol. 12997 (Springer, Cham, 2021), pp. 614–625. https://doi.org/10.1007/978-3-030-87802-3_55
Book Google Scholar
L. V. Savchenko and A. V. Savchenko, “A method of real-time dynamic measurement of a speaker’s emotional state from a speech waveform,” Meas. Tech. 64, 319–327 (2021). https://doi.org/10.1007/s11018-021-01935-z
Article Google Scholar
M. Shahabinejad, Y. Wang, Y. Yu, J. Tang, and J. Li, “Toward personalized emotion recognition: A face recognition based attention method for facial emotion recognition,” in 16th IEEE Int. Conf. on Automatic Face and Gesture Recognition (FG 2021), Jodhpur, India, 2021 (IEEE, 2021), pp. 1–5. https://doi.org/10.1109/FG52635.2021.9666982
B. Sonawane, and P. Sharma, “Deep learning based approach of emotion detection and grading system,” Pattern Recognit. Image Anal. 30, 726–740 (2020). https://doi.org/10.1134/S1054661820040239
Article Google Scholar
K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,” IEEE Signal Process. Lett. 23, 1499–1503 (2016). https://doi.org/10.1109/LSP.2016.2603342
Article Google Scholar
H. Zhou, D. Meng, Yu. Zhang, X. Peng, J. Du, K. Wang, and Yu Qiao, “Exploring emotion features and fusion strategies for audio-video emotion recognition,” in Int. Conf. on Multimodal Interaction, Suzhou, China, 2019, Ed. by W. Gao, H. M. Ling Meng, M. Turk, S. R. Fussell, B. Schuller, Ya. Song, and K. Yu (Association for Computing Machinery, New York, 2019), pp. 562–566. https://doi.org/10.1145/3340555.3355713

Download references

Funding

The work is supported by Russian Science Foundation, grant no. 20-71-10010.

Author information

Authors and Affiliations

HSE University, Laboratory of Algorithms and Technologies for Network Analysis, 603155, Nizhny Novgorod, Russia
A. V. Savchenko & L. V. Savchenko

Authors

A. V. Savchenko
View author publications
You can also search for this author in PubMed Google Scholar
L. V. Savchenko
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to A. V. Savchenko or L. V. Savchenko.

Ethics declarations

COMPLIANCE WITH ETHICAL STANDARDS

This article is a completely original work of its authors; it has not been published before and will not be sent to other publications until the PRIA Editorial Board decides not to accept it for publication.

Conflict of Interest

The authors declare that they have no conflicts of interest.

Additional information

Andrey V. Savchenko received the BSc degree in applied mathematics and informatics from Nizhny Novgorod State Technical University, Nizhny Novgorod, Russia, in 2006, the Cand. Sci. degree in mathematical modeling and computer science from the State University Higher School of Economics, Moscow, Russia, in 2010, and the Dr. Sci. degree in system analysis and information processing from Nizhny Novgorod State Technical University in 2016. Since 2008, he has been with the HSE University, Nizhny Novgorod, where he is currently a Full Professor with the Department of Information Systems and Technologies. He is also a Leading Research Fellow with the Laboratory of Algorithms and Technologies for Network Analysis and academic supervisor of the Master of Computer Vision programme at HSE University. He has authored or co-authored one monograph and more than 50 articles. His current research interests include statistical pattern recognition, image classification, and biometrics.

Lyudmila V. Savchenko received the Specialist degree in applied mathematics and informatics from Nizhny Novgorod State Technical University, Nizhny Novgorod, Russia, in 2008, the Cand. Sci. degree in system analysis and information processing from Voronezh State Technical University in 2017. Since 2018, she has been with the HSE University, Nizhny Novgorod, where she is currently an Associate Professor with the Department of Information Systems and Technologies. Her current research interests include speech processing and e-learning systems.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Savchenko, A.V., Savchenko, L.V. Audio-Visual Continuous Recognition of Emotional State in a Multi-User System Based on Personalized Representation of Facial Expressions and Voice. Pattern Recognit. Image Anal. 32, 665–671 (2022). https://doi.org/10.1134/S1054661822030397

Download citation

Received: 01 June 2022
Revised: 01 June 2022
Accepted: 01 June 2022
Published: 19 October 2022
Issue Date: September 2022
DOI: https://doi.org/10.1134/S1054661822030397

Keywords:

Use our pre-submission checklist

Avoid common mistakes on your manuscript.