Video multimodal emotion recognition based on Bi-GRU and attention fusion

Huan, Ruo-Hong; Shu, Jia; Bao, Sheng-Lin; Liang, Rong-Hua; Chen, Peng; Chi, Kai-Kai

doi:10.1007/s11042-020-10030-4

Video multimodal emotion recognition based on Bi-GRU and attention fusion

Published: 31 October 2020

Volume 80, pages 8213–8240, (2021)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Multimedia Tools and Applications Aims and scope Submit manuscript

Video multimodal emotion recognition based on Bi-GRU and attention fusion

Download PDF

Ruo-Hong Huan ORCID: orcid.org/0000-0003-2555-343X¹,
Jia Shu¹,
Sheng-Lin Bao¹,
Rong-Hua Liang¹,
Peng Chen¹ &
…
Kai-Kai Chi¹

2165 Accesses
31 Citations
3 Altmetric
Explore all metrics

Abstract

A video multimodal emotion recognition method based on Bi-GRU and attention fusion is proposed in this paper. Bidirectional gated recurrent unit (Bi-GRU) is applied to improve the accuracy of emotion recognition in time contexts. A new network initialization method is proposed and applied to the network model, which can further improve the video emotion recognition accuracy of the time-contextual learning. To overcome the weight consistency of each modality in multimodal fusion, a video multimodal emotion recognition method based on attention fusion network is proposed. The attention fusion network can calculate the attention distribution of each modality at each moment in real-time so that the network model can learn multimodal contextual information in real-time. The experimental results show that the proposed method can improve the accuracy of emotion recognition in three single modalities of textual, visual, and audio, meanwhile improve the accuracy of video multimodal emotion recognition. The proposed method outperforms the existing state-of-the-art methods for multimodal emotion recognition in sentiment classification and sentiment regression.

Multimodal emotion recognition from facial expression and speech based on feature fusion

Article 11 November 2022

A multimodal fusion-based deep learning framework combined with local-global contextual TCNs for continuous emotion recognition from videos

Article 21 February 2024

Attention-based multimodal contextual fusion for sentiment and emotion classification using bidirectional LSTM

Article 11 January 2021

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Usually, the ways humans naturally communicating and expressing emotions are multimodal [23]. That means we can express emotions either verbally or visually. When more emotions are expressed with tones, the audio data may contain major cues for emotion recognition; and when more facial expressions are used to express emotions, it can be considered that most of the clues needed for mining emotions exist in facial expressions. Identifying human emotions using multimodal information such as human facial expressions, phonetic intonation, and linguistic content is an interesting and challenging issue.

Videos provide multimodal data in both acoustic and visual modalities. Facial expressions, vocal tones and text data in the video data can provide important information to recognize the true emotion state of a person better. Therefore, analyzing videos can create better models for emotion recognition and sentiment analysis. Textual, visual, and audio are often regarded as the main multimodal information in the research of multimodal emotion recognition about videos. The three modalities of textual, visual, and audio are simultaneously recognized and utilized, which can effectively extract the semantic and emotional information conveyed during the communication process.

It is necessary to simultaneously establish the emotion recognition models for the textual, visual and audio three modalities to utilize the three-modality data simultaneously. In the single-modality emotion recognition of textual [12, 13, 32,33,34], visual [1, 2, 6, 16, 42] and audio [17, 19, 25, 38, 40], some researches have achieved good recognition performance using deep learning. The recognition and utilization of the textual, visual and audio three-modality information requires the seamless integration of the three-modality information. The purpose of multimodal fusion is to combine information of multiple modalities, utilize the complementarity of heterogeneous data, provide more robust predictions, and improve the accuracy and reliability of recognition. Multimodal fusion is usually performed at the feature layer. Multiple high-dimensional features are computed into a fused feature, which is then input into a model for training. Morency et al. [23] first proposed a joint model of three modalities of textual, visual and audio for multimodal sentiment analysis, and conducted verification experiments. Poria S et al. [30], and Zhao J et al. [46] implemented fusion by concatenating the feature vectors of all three modalities to form a single long feature vector. The shortcoming of the above methods for extracting fused feature vectors is the weight consistency of each modality in the multimodal fusion. That is to say, the fact that the importance of each modality is not equal has not been taken into account. To overcome the weight consistency, Poria et al. [29] used a convolutional neural network (CNN) for multimodal sentiment analysis and proposed the convolutional MKL-based (C-MKL) model. Wang et al. [39] proposed the selective-additive learning-CNN (SAL-CNN) for multimodal sentiment analysis. Zadeh et al. [42] proposed a new model of tensor fusion network (TFN). Because of the introduction of the tensor representation, the costs of calculation and memory increase exponentially, which severely limits the application of the model, especially when there are more than three modalities in the dataset.

To make better use of textual, visual and audio three-modality data for video emotion recognition, a video multimodal emotion recognition method based on bidirectional gated recurrent unit (Bi-GRU) and attention fusion is proposed in this paper. The contributions of our work include: (1) A time-contextual learning method based on Bi-GRU is proposed. The Bi-GRU can improve the accuracy of video emotion recognition in the time-contextual learning. (2) A new network initialization method is proposed and applied to the network model. This initialization method can optimize the initialization parameters of the network model, improve the robustness of the Bi-GRU in the training and then improve the accuracy of emotion recognition. (3) A video multimodal emotion recognition method based on the attention fusion network is proposed. The attention mechanism is used to deal with the variation of the contextual state at each moment of multiple modalities. The distribution of attention at each moment of multiple modalities is calculated in real-time, so that the network model can learn the multimodal contextual information in real-time, thereby improving the accuracy of video emotion recognition under the multimodal fusion.

This paper is organized as follows: Related work is presented in Section 2. The video multimodal emotion recognition method based on Bi-GRU and attention fusion is described in Section 3. Section 4 presents experimental results and analysis, and Section 5 presents conclusions and discusses future research directions.

2 Related work

2.1 Single-modality emotion recognition

2.1.1 Textual modality

Research on textual emotion recognition has always been an active and extremely successful field. Notable works include the automatic recognition of opinionated words and their emotion polarity [11, 36], methods using n-grams and more complex language models [37, 41], and methods using polarity transfer rules or detailed feature engineering to solve the problem of emotion composition [22, 28]. Li et al. [18] proposed a hybrid approach to recognize word emotion in the dimension of eight emotion categories with corresponding intensities based on the Chinese emotion corpus. They explored approaches to identify word emotion from the aspect of general emotion attribute for a word. Experimental results showed that the integration of morpheme characteristics and semantic relations can improve the classification accuracy efficiently. These methods have been applied in many different areas, including mining opinions in Twitter and other online forums, analyzing political debates, answering questions, summarizing dialogues, and detecting citation emotion.

Research on textual emotion recognition based on deep learning has also been successful. Socher et al. [34] introduced recursive neural tensor networks and the Stanford sentiment treebank. The combination of a new model and data results for single sentence sentiment detection pushed positive/negative sentence classification and fine-grained sentiment prediction. Their research showed that the sentiment analysis for texts is far from solved. Iyyer et al. [12] introduced a deep averaging network (DAN) for textual emotion recognition. This was a simple and effective sentiment analysis model that used only the distribution of words to represent information, rather than the combined information of sentences, thus reducing the computational complexity. The model performed better than syntactic models on datasets with high syntactic variance. Kalchbrenner et al. [13] described a convolutional architecture called the dynamic convolutional neural network (DCNN) that was adopted for the semantic modelling of sentences. The network handled input sentences of varying length and induced a feature graph over the sentence that was capable of explicitly capturing short and long-range relations. The network did not rely on a parse tree and was easily applicable to any language. Seyeditabari et al. [32] formulated emotion recognition in text as a binary classification problem and presented a new network based on a Bi-GRU model to capture more meaningful information from text. They reported the results for two word embedding models which had the best performance. Shrivastava et al. [33] proposed a sequence-based CNN with word embedding to detect the emotions. An attention mechanism was applied in the proposed model which allowed CNN to focus on the words that had more effect on the classification or the part of the features that should be attended more.

2.1.2 Visual modality

Emotion recognition based on visual information is a research focus in the field of emotion computing and computer vision. Human facial expression is one of the most powerful means for humans to exchange emotions and intentions. Face analysis and video analysis methods based on deep learning have recently shown good performance on various key tasks such as face recognition, emotion recognition and activity recognition. In the previous work, the CNN mainly relies on time averaging and pooling to handle time-series sequences in video emotion recognition. The recurrent neural network (RNN) shows more advanced performance in time-series sequence analysis tasks, which has attracted great interest in recent years.

Byeon et al. [2] used 3D convolutional neural networks (3D-CNN) to extract facial features from speakers, and reduced dimensionality of the extracted features to simultaneously recognize continuous frames of facial expression images obtained by camera. It used local receptive fields and spatial down-sampling to achieve a certain degree of displacement and deformation invariance. Ebrahimi Kahou et al. [6] used both CNN and long short-term memory (LSTM) to propose a CNN-LSTM recurrent model. The face area of a speaker was convoluted into the LSTM at each timestamp. The face expression processing of speakers was similar to 3D-CNN. This architecture was superior to the CNN method that used time-average aggregation. Zadeh et al. [42] extracted facial expression features of speakers through FACET facial expression analysis framework, and proposed a RNN model based on FACET. It used FACET features every 6 frames as input information to the RNN with a memory dimension of 100 neurons, which was used as a baseline model for their follow-up experiments. Kumawat S et al. [16] proposed a novel 3D convolutional layer that called local binary volume (LBV) layer. LBV layer reduced the number of trainable parameters by a significant amount when compared to a conventional 3D convolutional layer. The LBVCNN network achieved comparable results compared to the state-of-the-art (SOTA) landmark-based or without landmark-based models on image sequences from CK+, Oulu-CASIA, and UNBC McMaster shoulder pain datasets. Bairaju et al. [1] utilized combination of CNNs and auto encoders to extract features for facial emotion detection and got considerable classification accuracy.

2.1.3 Audio modality

Automatically identifying spontaneous emotions from speech is a challenging task. On the one hand, acoustic features need to be powerful enough to capture the emotional content of various speaker styles. On the other hand, machine learning algorithms need to be insensitive to outliers while be able to model contexts. In recent years, research on audio emotion recognition based on deep learning has made great progress.

Lee et al. [17] presented a speech emotion recognition system using an RNN model trained by an efficient learning algorithm. The proposed system took into account the long-range contextual effect and the uncertainty of emotion label expressions. To extract high-level representation of emotion states with regard to its temporal dynamics, a powerful learning method with a bidirectional long short-term memory (BLSTM) architecture was adopted. Trigeorgis et al. [38] proposed a solution to the problem of ‘context-aware’ emotional relevant feature extraction, by combining CNNs with LSTM networks to automatically learn the best representation of the speech signal directly from the raw time representation. Lim et al. [19] proposed the speech emotion recognition (SER) method based on CNNs and RNNs. By applying the proposed methods to an emotional speech database, classification result was verified to have better accuracy than that achieved using conventional classification methods. Orjesek et al. [25] stacked convolution layer with Bi-GRU and had shown exceptional performance using only raw audio signals without any need for pre-processing. Wu et al. [40] presented a novel architecture based on the capsule networks (CapsNets) for SER. The proposed system took into account the spatial relationship of speech features in spectrograms, and provided an effective pooling method for obtaining utterance global features. The paper demonstrated the effectiveness of the CapsNets for SER.

The use of the LSTM networks solves the problem of speech context modelling, but how to capture the emotional features of the speech still needs to be actively studied, although more than a decade of research provides a large number of acoustic feature descriptions.

2.2 Video multimodal emotion recognition

Multimodal research has shown great progress in a variety of tasks as an emerging research field of artificial intelligence. It is an interesting and challenging problem to identify human emotions using human facial expressions, phonetic intonation and body gestures. Many people only studied emotional content in the language, or just used images to identify human facial expressions. Therefore, there were relatively few studies that combined multiple modalities to recognize human emotions. Textual, visual, and audio are often regarded as the main multimodal information in the research of multimodal emotion recognition about videos. The purpose of fusion is to improve the accuracy and reliability of recognition. The main advantage of analyzing emotions by analyzing videos rather than just texts is the abundance of behavioral cues. Text analysis requires the use of words, phrases, and dependencies between them, but it is known that only that information is not sufficient to extract relevant emotional content. Videos provide multimodal data in both acoustic and visual modalities. Facial expressions, vocal tones and text data in video data can provide important information to recognize the true emotion state of a person better. Therefore, analyzing videos can create better models for emotion recognition and sentiment analysis.

An important challenge for multimodal fusion is how to extend the fusion to multiple modalities while maintaining reasonable model complexity. Morency et al. [23] demonstrated a joint model that integrated visual, audio, and textual features can be effectively used to identify sentiment in Web videos. They used the joint model for sentiment analysis of product and movie reviews. They also identified a subset of audio-visual features relevant to sentiment analysis and presented guidelines on how to integrate these features. But their method was to directly connect the modal information in the early fusion representation and did not study the relationship between different modalities. Their experiments were conducted in a speaker-dependent manner, without analyzing the intensity of emotions. Park et al. [26] studied the persuasiveness of communication in social activities. They demonstrated that computational descriptors derived from verbal and nonverbal behavior can be predictive of persuasiveness. At the same time, they further proved that combining descriptors from multiple communication modalities (audio, text and visual) improved the prediction performance compared to using those from single modality alone. The C-MKL model proposed by Poria et al. [29] was a multimodal emotion classification model. They used the combined feature vectors of textual, visual, and audio modalities to train a classifier based on multiple kernel learning. However, their experiments focused on discourse rather than commentary, and their research methods depended on the emotion polarity rather than emotion intensity. Nojavanasghari et al. [24] also studied persuasion. They used a deep multimodal fusion architecture which was able to leverage complementary information from individual modalities for predicting persuasiveness. They trained single neural networks for each view’s input and combined the views with a joint neural network. This baseline is the SOTA in the POM dataset. Wang et al. [39] used a select-additive learning (SAL) procedure that improved the generalizability of trained neural networks for multimodal sentiment analysis. In their experiments, they showed that their SAL approach improved the prediction accuracy significantly in all three modalities, as well as in their fusion. Zadeh et al. [44] presented a novel neural architecture for understanding human communication called the multi-attention recurrent network (MARN) for sentiment analysis. The main strength of this model came from discovering interactions between modalities through time using a neural component called the multi-attention block (MAB) and storing them in the hybrid memory of a recurrent component called the long-short term hybrid memory (LSTHM). Zadeh et al. [43] introduced a novel approach for multi-view sequential learning called memory fusion network (MFN) for multi-view sequential learning, which accounted for both view-specific and cross-view interactions. It continuously modeled them through time with a special attention mechanism and summarized through time with a multi-view gated memory. Liu Z et al. [20] proposed the low-rank multimodal fusion method, which performed multimodal fusion using low-rank tensors to improve efficiency. They performed experiments on other methods [24, 26, 31] under the CMU-MOSI dataset and POM dataset to prove the effectiveness of the proposed method. Ma L et al. [21] proposed an emotion computing algorithm based on cross-modal fusion and edge network data incentive. Deep cross-modal fusion can capture the semantic deviation between multiple modalities and design fusion methods through non-linear cross-layer mapping. The results of simulation experiments and theoretical analysis showed that the proposed algorithm was superior to the edge network data incentive algorithm and the cross-modal data fusion algorithm in recognition accuracy, complex emotion recognition efficiency, computation efficiency and delay.

The summary of related work in emotion recognition using deep learning is shown in Table 1.

Table 1 Summary of related work in emotion recognition using deep learning

Video multimodal emotion recognition based on Bi-GRU and attention fusion

Abstract

Similar content being viewed by others

Multimodal emotion recognition from facial expression and speech based on feature fusion

A multimodal fusion-based deep learning framework combined with local-global contextual TCNs for continuous emotion recognition from videos

Attention-based multimodal contextual fusion for sentiment and emotion classification using bidirectional LSTM

Explore related subjects

1 Introduction

2 Related work

2.1 Single-modality emotion recognition

2.1.1 Textual modality

2.1.2 Visual modality

2.1.3 Audio modality

2.2 Video multimodal emotion recognition

3 The proposed method

3.1 Video feature extraction for three modalities

3.1.1 Textual features

3.1.2 Visual features

3.1.3 Audio features

3.1.4 Alignment and normalization

3.2 Bi-GRU with new network initialization

3.2.1 Bi-GRU

3.2.2 Network initialization

3.3 Video multimodal emotion recognition based on attention fusion

3.3.1 Correlation calculation of the state information between multiple modalities

3.3.2 Generation of the fused feature vectors

4 Experimental results and analysis

4.1 Datasets

4.2 Experimental setup

4.3 Results and analysis

4.3.1 Experimental results for single modality

4.3.2 Experimental results for multimodal emotion recognition

5 Conclusions

Availability of data and material

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Code availability

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation