1 Introduction

Usually, the ways humans naturally communicating and expressing emotions are multimodal [23]. That means we can express emotions either verbally or visually. When more emotions are expressed with tones, the audio data may contain major cues for emotion recognition; and when more facial expressions are used to express emotions, it can be considered that most of the clues needed for mining emotions exist in facial expressions. Identifying human emotions using multimodal information such as human facial expressions, phonetic intonation, and linguistic content is an interesting and challenging issue.

Videos provide multimodal data in both acoustic and visual modalities. Facial expressions, vocal tones and text data in the video data can provide important information to recognize the true emotion state of a person better. Therefore, analyzing videos can create better models for emotion recognition and sentiment analysis. Textual, visual, and audio are often regarded as the main multimodal information in the research of multimodal emotion recognition about videos. The three modalities of textual, visual, and audio are simultaneously recognized and utilized, which can effectively extract the semantic and emotional information conveyed during the communication process.

It is necessary to simultaneously establish the emotion recognition models for the textual, visual and audio three modalities to utilize the three-modality data simultaneously. In the single-modality emotion recognition of textual [12, 13, 32,33,34], visual [1, 2, 6, 16, 42] and audio [17, 19, 25, 38, 40], some researches have achieved good recognition performance using deep learning. The recognition and utilization of the textual, visual and audio three-modality information requires the seamless integration of the three-modality information. The purpose of multimodal fusion is to combine information of multiple modalities, utilize the complementarity of heterogeneous data, provide more robust predictions, and improve the accuracy and reliability of recognition. Multimodal fusion is usually performed at the feature layer. Multiple high-dimensional features are computed into a fused feature, which is then input into a model for training. Morency et al. [23] first proposed a joint model of three modalities of textual, visual and audio for multimodal sentiment analysis, and conducted verification experiments. Poria S et al. [30], and Zhao J et al. [46] implemented fusion by concatenating the feature vectors of all three modalities to form a single long feature vector. The shortcoming of the above methods for extracting fused feature vectors is the weight consistency of each modality in the multimodal fusion. That is to say, the fact that the importance of each modality is not equal has not been taken into account. To overcome the weight consistency, Poria et al. [29] used a convolutional neural network (CNN) for multimodal sentiment analysis and proposed the convolutional MKL-based (C-MKL) model. Wang et al. [39] proposed the selective-additive learning-CNN (SAL-CNN) for multimodal sentiment analysis. Zadeh et al. [42] proposed a new model of tensor fusion network (TFN). Because of the introduction of the tensor representation, the costs of calculation and memory increase exponentially, which severely limits the application of the model, especially when there are more than three modalities in the dataset.

To make better use of textual, visual and audio three-modality data for video emotion recognition, a video multimodal emotion recognition method based on bidirectional gated recurrent unit (Bi-GRU) and attention fusion is proposed in this paper. The contributions of our work include: (1) A time-contextual learning method based on Bi-GRU is proposed. The Bi-GRU can improve the accuracy of video emotion recognition in the time-contextual learning. (2) A new network initialization method is proposed and applied to the network model. This initialization method can optimize the initialization parameters of the network model, improve the robustness of the Bi-GRU in the training and then improve the accuracy of emotion recognition. (3) A video multimodal emotion recognition method based on the attention fusion network is proposed. The attention mechanism is used to deal with the variation of the contextual state at each moment of multiple modalities. The distribution of attention at each moment of multiple modalities is calculated in real-time, so that the network model can learn the multimodal contextual information in real-time, thereby improving the accuracy of video emotion recognition under the multimodal fusion.

This paper is organized as follows: Related work is presented in Section 2. The video multimodal emotion recognition method based on Bi-GRU and attention fusion is described in Section 3. Section 4 presents experimental results and analysis, and Section 5 presents conclusions and discusses future research directions.

2 Related work

2.1 Single-modality emotion recognition

2.1.1 Textual modality

Research on textual emotion recognition has always been an active and extremely successful field. Notable works include the automatic recognition of opinionated words and their emotion polarity [11, 36], methods using n-grams and more complex language models [37, 41], and methods using polarity transfer rules or detailed feature engineering to solve the problem of emotion composition [22, 28]. Li et al. [18] proposed a hybrid approach to recognize word emotion in the dimension of eight emotion categories with corresponding intensities based on the Chinese emotion corpus. They explored approaches to identify word emotion from the aspect of general emotion attribute for a word. Experimental results showed that the integration of morpheme characteristics and semantic relations can improve the classification accuracy efficiently. These methods have been applied in many different areas, including mining opinions in Twitter and other online forums, analyzing political debates, answering questions, summarizing dialogues, and detecting citation emotion.

Research on textual emotion recognition based on deep learning has also been successful. Socher et al. [34] introduced recursive neural tensor networks and the Stanford sentiment treebank. The combination of a new model and data results for single sentence sentiment detection pushed positive/negative sentence classification and fine-grained sentiment prediction. Their research showed that the sentiment analysis for texts is far from solved. Iyyer et al. [12] introduced a deep averaging network (DAN) for textual emotion recognition. This was a simple and effective sentiment analysis model that used only the distribution of words to represent information, rather than the combined information of sentences, thus reducing the computational complexity. The model performed better than syntactic models on datasets with high syntactic variance. Kalchbrenner et al. [13] described a convolutional architecture called the dynamic convolutional neural network (DCNN) that was adopted for the semantic modelling of sentences. The network handled input sentences of varying length and induced a feature graph over the sentence that was capable of explicitly capturing short and long-range relations. The network did not rely on a parse tree and was easily applicable to any language. Seyeditabari et al. [32] formulated emotion recognition in text as a binary classification problem and presented a new network based on a Bi-GRU model to capture more meaningful information from text. They reported the results for two word embedding models which had the best performance. Shrivastava et al. [33] proposed a sequence-based CNN with word embedding to detect the emotions. An attention mechanism was applied in the proposed model which allowed CNN to focus on the words that had more effect on the classification or the part of the features that should be attended more.

2.1.2 Visual modality

Emotion recognition based on visual information is a research focus in the field of emotion computing and computer vision. Human facial expression is one of the most powerful means for humans to exchange emotions and intentions. Face analysis and video analysis methods based on deep learning have recently shown good performance on various key tasks such as face recognition, emotion recognition and activity recognition. In the previous work, the CNN mainly relies on time averaging and pooling to handle time-series sequences in video emotion recognition. The recurrent neural network (RNN) shows more advanced performance in time-series sequence analysis tasks, which has attracted great interest in recent years.

Byeon et al. [2] used 3D convolutional neural networks (3D-CNN) to extract facial features from speakers, and reduced dimensionality of the extracted features to simultaneously recognize continuous frames of facial expression images obtained by camera. It used local receptive fields and spatial down-sampling to achieve a certain degree of displacement and deformation invariance. Ebrahimi Kahou et al. [6] used both CNN and long short-term memory (LSTM) to propose a CNN-LSTM recurrent model. The face area of a speaker was convoluted into the LSTM at each timestamp. The face expression processing of speakers was similar to 3D-CNN. This architecture was superior to the CNN method that used time-average aggregation. Zadeh et al. [42] extracted facial expression features of speakers through FACET facial expression analysis framework, and proposed a RNN model based on FACET. It used FACET features every 6 frames as input information to the RNN with a memory dimension of 100 neurons, which was used as a baseline model for their follow-up experiments. Kumawat S et al. [16] proposed a novel 3D convolutional layer that called local binary volume (LBV) layer. LBV layer reduced the number of trainable parameters by a significant amount when compared to a conventional 3D convolutional layer. The LBVCNN network achieved comparable results compared to the state-of-the-art (SOTA) landmark-based or without landmark-based models on image sequences from CK+, Oulu-CASIA, and UNBC McMaster shoulder pain datasets. Bairaju et al. [1] utilized combination of CNNs and auto encoders to extract features for facial emotion detection and got considerable classification accuracy.

2.1.3 Audio modality

Automatically identifying spontaneous emotions from speech is a challenging task. On the one hand, acoustic features need to be powerful enough to capture the emotional content of various speaker styles. On the other hand, machine learning algorithms need to be insensitive to outliers while be able to model contexts. In recent years, research on audio emotion recognition based on deep learning has made great progress.

Lee et al. [17] presented a speech emotion recognition system using an RNN model trained by an efficient learning algorithm. The proposed system took into account the long-range contextual effect and the uncertainty of emotion label expressions. To extract high-level representation of emotion states with regard to its temporal dynamics, a powerful learning method with a bidirectional long short-term memory (BLSTM) architecture was adopted. Trigeorgis et al. [38] proposed a solution to the problem of ‘context-aware’ emotional relevant feature extraction, by combining CNNs with LSTM networks to automatically learn the best representation of the speech signal directly from the raw time representation. Lim et al. [19] proposed the speech emotion recognition (SER) method based on CNNs and RNNs. By applying the proposed methods to an emotional speech database, classification result was verified to have better accuracy than that achieved using conventional classification methods. Orjesek et al. [25] stacked convolution layer with Bi-GRU and had shown exceptional performance using only raw audio signals without any need for pre-processing. Wu et al. [40] presented a novel architecture based on the capsule networks (CapsNets) for SER. The proposed system took into account the spatial relationship of speech features in spectrograms, and provided an effective pooling method for obtaining utterance global features. The paper demonstrated the effectiveness of the CapsNets for SER.

The use of the LSTM networks solves the problem of speech context modelling, but how to capture the emotional features of the speech still needs to be actively studied, although more than a decade of research provides a large number of acoustic feature descriptions.

2.2 Video multimodal emotion recognition

Multimodal research has shown great progress in a variety of tasks as an emerging research field of artificial intelligence. It is an interesting and challenging problem to identify human emotions using human facial expressions, phonetic intonation and body gestures. Many people only studied emotional content in the language, or just used images to identify human facial expressions. Therefore, there were relatively few studies that combined multiple modalities to recognize human emotions. Textual, visual, and audio are often regarded as the main multimodal information in the research of multimodal emotion recognition about videos. The purpose of fusion is to improve the accuracy and reliability of recognition. The main advantage of analyzing emotions by analyzing videos rather than just texts is the abundance of behavioral cues. Text analysis requires the use of words, phrases, and dependencies between them, but it is known that only that information is not sufficient to extract relevant emotional content. Videos provide multimodal data in both acoustic and visual modalities. Facial expressions, vocal tones and text data in video data can provide important information to recognize the true emotion state of a person better. Therefore, analyzing videos can create better models for emotion recognition and sentiment analysis.

An important challenge for multimodal fusion is how to extend the fusion to multiple modalities while maintaining reasonable model complexity. Morency et al. [23] demonstrated a joint model that integrated visual, audio, and textual features can be effectively used to identify sentiment in Web videos. They used the joint model for sentiment analysis of product and movie reviews. They also identified a subset of audio-visual features relevant to sentiment analysis and presented guidelines on how to integrate these features. But their method was to directly connect the modal information in the early fusion representation and did not study the relationship between different modalities. Their experiments were conducted in a speaker-dependent manner, without analyzing the intensity of emotions. Park et al. [26] studied the persuasiveness of communication in social activities. They demonstrated that computational descriptors derived from verbal and nonverbal behavior can be predictive of persuasiveness. At the same time, they further proved that combining descriptors from multiple communication modalities (audio, text and visual) improved the prediction performance compared to using those from single modality alone. The C-MKL model proposed by Poria et al. [29] was a multimodal emotion classification model. They used the combined feature vectors of textual, visual, and audio modalities to train a classifier based on multiple kernel learning. However, their experiments focused on discourse rather than commentary, and their research methods depended on the emotion polarity rather than emotion intensity. Nojavanasghari et al. [24] also studied persuasion. They used a deep multimodal fusion architecture which was able to leverage complementary information from individual modalities for predicting persuasiveness. They trained single neural networks for each view’s input and combined the views with a joint neural network. This baseline is the SOTA in the POM dataset. Wang et al. [39] used a select-additive learning (SAL) procedure that improved the generalizability of trained neural networks for multimodal sentiment analysis. In their experiments, they showed that their SAL approach improved the prediction accuracy significantly in all three modalities, as well as in their fusion. Zadeh et al. [44] presented a novel neural architecture for understanding human communication called the multi-attention recurrent network (MARN) for sentiment analysis. The main strength of this model came from discovering interactions between modalities through time using a neural component called the multi-attention block (MAB) and storing them in the hybrid memory of a recurrent component called the long-short term hybrid memory (LSTHM). Zadeh et al. [43] introduced a novel approach for multi-view sequential learning called memory fusion network (MFN) for multi-view sequential learning, which accounted for both view-specific and cross-view interactions. It continuously modeled them through time with a special attention mechanism and summarized through time with a multi-view gated memory. Liu Z et al. [20] proposed the low-rank multimodal fusion method, which performed multimodal fusion using low-rank tensors to improve efficiency. They performed experiments on other methods [24, 26, 31] under the CMU-MOSI dataset and POM dataset to prove the effectiveness of the proposed method. Ma L et al. [21] proposed an emotion computing algorithm based on cross-modal fusion and edge network data incentive. Deep cross-modal fusion can capture the semantic deviation between multiple modalities and design fusion methods through non-linear cross-layer mapping. The results of simulation experiments and theoretical analysis showed that the proposed algorithm was superior to the edge network data incentive algorithm and the cross-modal data fusion algorithm in recognition accuracy, complex emotion recognition efficiency, computation efficiency and delay.

The summary of related work in emotion recognition using deep learning is shown in Table 1.

Table 1 Summary of related work in emotion recognition using deep learning

3 The proposed method

The video multimodal emotion recognition method based on the Bi-GRU and attention fusion is shown in Fig. 1. The main steps of the method include: extract the high-dimensional features of the three modalities of textual, visual and audio from the inputting videos and then align and normalize the feature vectors according to the word level. Input them into the Bi-GRU network for training, using a new network initialization method to initialize the weights of the Bi-GRU network and the fully connected network in the initial training of each single-modality subnetwork. The state information output of the Bi-GRU network is processed by the maximum pooling layer and the average pooling layer. Splice two pooled feature vectors as input features. The input features in the three single-modality subnetworks are then used to calculate the correlation between the multimodal state information. Then the attention distribution of each modality at each moment is calculated, that is, the weight of the state information at each moment is calculated. The input features of the three single-modality subnetworks are weighted averaged with the corresponding weights to obtain the fused feature vector as the input of the fully connected network. Input the video to be recognized into the network after training to obtain the final emotion intensity output of the video.

Fig. 1
figure 1

Video multimodal emotion recognition based on Bi-GRU and attention fusion

3.1 Video feature extraction for three modalities

3.1.1 Textual features

Global Vectors (GloVe), a word representation tool based on global word frequency statistics, can express a word as a vector of real numbers that captures some semantic features between words, such as similarity and analogy. The textual features of the video are defined as \( l=\left\{{l}_1,{l}_2,{l}_3,\dots, {l}_{T_l};{l}_t\in {\mathrm{\mathbb{R}}}^{300}\right\} \), where Tl is the number of words in the video, lt represents a sequence of 300-dimensional GloVe word-vector feature [27].

3.1.2 Visual features

We use the FACET facial expression analysis framework to detect the face of the speaker in each frame, and extract seven basic emotions (anger, contempt, disgust, fear, joy, sadness, and surprise) and two advanced emotions (frustration and confusion) [7] from the speaker. Using the FACET can also extract a set of 20 facial action units [8] to indicate detailed muscle movements on the face.

We define the visual features as \( v=\left\{{v}_1,{v}_2,{v}_3,\dots, {v}_{T_v}\right\} \). The visual feature of the jth frame is \( {v}_j=\left[{v}_j^1,{v}_j^2,{v}_j^3,\dots, {v}_j^p\right] \), which contains a set of p visual features, where Tv is the total number of frames in the video. We use v as the input of the visual subnetwork. Since the information extracted by the FACET from videos is very rich, inputting them into the Bi-GRU can produce meaningful time-contextual high-dimensional features in the visual modality.

3.1.3 Audio features

For the audio portion of each video, using the COVAREP acoustic analysis framework to extract a set of acoustic features, including 12 mel-frequency cepstral coefficients (MFCCs), pitch tracking and voiced/unvoiced segmentation features, glottal source parameters, peak slope parameters [3], maxima dispersion quotients (MDQ) [14], and Liljencrants-Fant (LF) estimation of the parameters of the glottal model [9]. The voiced/unvoiced segmentation feature is a summation of residual harmonics (SRH) with robust additive noise [4], and the glottal source parameter is estimated by glottal back-filtering based on GCI synchronous IAIF [5]. These extracted features capture different features of the human voice and have been proven to be related to emotions [10].

Each segment is sampled at 100 Hz with Ta audio frames. We extract the set of q acoustic features \( {a}_j=\left[{a}_j^1,{a}_j^2,{a}_j^3,\dots, {a}_j^q\right] \) from the jth frame. The audio features of each segment are \( a=\left\{{a}_1,{a}_2,{a}_3,\dots, {a}_{T_a}\right\} \). Here we take a as the input of the audio subnetwork. Since the COVAREP extracts rich features from audio, using the Bi-GRU can extract continuous time-contextual high-dimensional features better in the audio modality.

3.1.4 Alignment and normalization

The dimension of the GloVe features extracted by the textual modality subnetwork of each segment is (Tl, 300), the dimension of the FACET features extracted by the visual modality subnetwork is (Tv, p), and the dimension of the COVAREP features extracted by the audio modality subnetwork is (Ta, q). The alignment of multimodal high-dimensional features is required [42], which is usually done in the word level. In this paper, the high-dimensional features of the visual and audio modalities are respectively aligned with the GloVe features of the textual modality according to Tl words in each segment. Specifically, record the start time and the end time of the ith word of the speech, and take the high-dimensional features of all frames in this period from the visual and audio modalities respectively. It is necessary to obtain the average features of each modality as the high-dimensional features of the corresponding modality according to the total sample number of each modality in this period. At this time, the high-dimensional features of the three modalities of textual, visual and audio are aligned in each segment. Define the number of high-dimensional features for three modalities to the number of high-dimensional features of the pre-aligned textual modality subnetwork, which is Tl.

Since the high-dimensional features extracted differ in the amplitudes, normalization is required. The normalization is to find the maximum values of the high-dimensional features of the three modalities, and the values of all the high-dimensional features are respectively divided by the maximum values in the corresponding modality. Normalization can map data to numbers in the range from 0 to 1. In the training process of neural networks, normalization can speed up network training and improve the convergence speed of the network.

3.2 Bi-GRU with new network initialization

3.2.1 Bi-GRU

The Bi-GRU network combines the model architectures of both GRU and BRNN networks. Replacing the network nodes in the RNN with the network nodes in the GRU makes it easier for the network to learn the time-contextual information. It can overcome the problem that the RNN cannot handle the long-term dependency well and causes gradient vanishing or gradient exploding in the back propagation. The BRNN network architecture can simultaneously access the information of the past time and the future time. Replacing the GRU network nodes with the network nodes in the BRNN, the new network architecture can fully learn and utilize the contextual information of the past and future moments.

The high-dimensional features of the three modalities after the word-level alignment and normalization are respectively used as the inputs of the Bi-GRU network. Take textual modality subnetwork as an example. Textual features \( l=\left\{{l}_1,{l}_2,{l}_3,\dots, {l}_{T_l};{l}_t\in {\mathrm{\mathbb{R}}}^{300}\right\} \) are input into the Bi-GRU network, where lt represents a 300-dimensional GloVe word-vector feature. We define \( \overrightarrow{G}\left(\cdot \right) \) as the forward calculation formula of the Bi-GRU network and \( \overleftarrow{G}\left(\cdot \right) \) as the backward calculation formula, which are as follows:

$$ {\displaystyle \begin{array}{l}{\overrightarrow{h}}_t=\overrightarrow{G}\left({l}_t,{\overrightarrow{h}}_{\left(t-1\right)}\right)\\ {}{\overleftarrow{h}}_t=\overleftarrow{G}\left({l}_t,{\overleftarrow{h}}_{\left(t+1\right)}\right)\end{array}} $$
(1)

where \( {\overrightarrow{h}}_t \) and \( {\overleftarrow{h}}_t \) are the forward state output and the backward state output respectively at the moment t of the Bi-GRU network, \( {\overrightarrow{h}}_{\left(t\hbox{-} 1\right)} \) is the forward state output of the moment t ‐ 1, and \( {\overleftarrow{h}}_{\left(t+1\right)} \) is the backward state output of the moment t + 1. The Bi-GRU network model architecture is shown in Fig. 2.

Fig. 2
figure 2

Architecture of the Bi-GRU network model inputting with textual high-dimensional features

After the contextual information of the high-dimensional features is fully learned by the Bi-GRU network, the state information output of the network \( H=\left[\left[{\overleftarrow{h}}_1,{\overrightarrow{h}}_1\right],\left[{\overleftarrow{h}}_2,{\overrightarrow{h}}_2\right],\dots, \left[{\overleftarrow{h}}_{T_l},{\overrightarrow{h}}_{T_l}\right]\right] \) is obtained. The maximum pooling layer and the average pooling layer are used to extract features from the state information output of the Bi-GRU network. The pooling layers use overlapping aggregation technology. Pooling can reduce the feature vector dimension of the Bi-GRU network output. We extract high-dimensional representation vectors max(H) and avg(H) respectively, as follows:

$$ {\displaystyle \begin{array}{l}\mathit{\max}(H)=\left[\underset{1\le i\le {T}_l}{\mathit{\max}}\left({\overrightarrow{h}}_i\right),\underset{1\le i\le {T}_l}{\mathit{\max}}\left({\overleftarrow{h}}_i\right)\right]\\ {} avg(H)=\left[\underset{1\le i\le {T}_l}{avg}\left({\overrightarrow{h}}_i\right),\underset{1\le i\le {T}_l}{avg}\left({\overleftarrow{h}}_i\right)\right]\end{array}} $$
(2)

The feature vector h+ can be obtained by splicing the two pooled feature vectors, which is shown in the following formula:

$$ {h}^{+}=\left[\mathit{\max}(H), avg(H)\right] $$
(3)

h+ is considered as an input feature of the fully connected layer in the single-modality subnetwork. The fully connected layer maps the learned high-dimensional features to the sample label space as follows:

$$ y={W}_y{h}^{+}+{b}_y $$
(4)

where Wy is the weight associated with h+, by is the bias associated with h+, and y is the emotion intensity output of a single-modality subnetwork.

The loss function of the training network is L1 Loss. L1 Loss can be used to create a standard that measures mean absolute error between each element in the input X and the target Y. The formula for calculating L1 Loss is as follows:

$$ L\left(X,Y\right)=\left\{{l}_1,\dots, {l}_N\right\},{l}_N=\mid {x}_n-{y}_n\mid $$
(5)

where N is the number of elements in the input X, xn is the nth element of the input X, yn is the nth element of the target Y, and ∣a − b∣ refers to the absolute value of the difference between a and b.

3.2.2 Network initialization

The Bi-GRU network is used in our core network layers and the ReLU activation function is used in the fully connected layer of our network model. Orthogonal initialization is more suitable for the Bi-GRU network, while Kaiming parameter initialization is more suitable for initializing the neuron parameters of the network for the ReLU activation function. We adjust the parameter initialization methods for the Bi-GRU network model and the fully connected layer simultaneously. Kaiming parameter initialization is used with the weights conforming to the normal distribution, and orthogonal initialization is also applied for a part of weights of the Bi-GRU to keep the eigenvalue of the orthogonal matrix to an absolute value of 1.

In our network model, the neuron parameters of the fully connected layer include weight W and bias b. Their default initialization methods are the same, which are shown as follows:

$$ {\displaystyle \begin{array}{l}W\sim U\left(-\sqrt{k},\sqrt{k}\right)\\ {}b\sim U\left(-\sqrt{k},\sqrt{k}\right)\end{array}} $$
(6)

where U(−a, a) is the uniform distribution in the interval over (−a, a), \( k=\frac{1}{n_{in}} \), and nin is the number of the input neurons.

We initialize the weight W according to the Kaiming initialization method and make it conform to the normal distribution, and set the bias b to a constant 0, which are shown as follows:

$$ {\displaystyle \begin{array}{l}W\sim N\left(0,\sqrt{\frac{2}{n_{in}}}\right)\\ {}b=0\end{array}} $$
(7)

where N(μ, σ2) means the standard normal distribution with expectation μ and standard deviation σ.

There are four kinds of neuron parameters in the Bi-GRU network, which are the weight of the input layer to the hidden layer Wih, the bias of the input layer to the hidden layer bih, the weight of the hidden layer to the hidden layer Whh, the bias of the hidden layer to the hidden layer bhh. By default, the initialization methods for the four different neuron parameters are the same, which are shown as follows:

$$ {\displaystyle \begin{array}{l}{W}_{ih}\sim U\left(-\sqrt{k},\sqrt{k}\right)\\ {}{W}_{hh}\sim U\left(-\sqrt{k},\sqrt{k}\right)\\ {}{b}_{ih}\sim U\left(-\sqrt{k},\sqrt{k}\right)\\ {}{b}_{hh}\sim U\left(-\sqrt{k},\sqrt{k}\right)\end{array}} $$
(8)

where \( k=\frac{1}{hiddensize} \), hiddensize is the number of features of the hidden state of the Bi-GRU network.

We initialize the weight Wih in the Bi-GRU network according to the Kaiming initialization method and make it conform to the normal distribution. We use orthogonal initialization to initialize the weight Whh, and set the bias bih and bhh to a constant 0. The initialization methods are as follows:

$$ {\displaystyle \begin{array}{l}{W}_{ih}\sim N\left(0,\sqrt{\frac{2}{hiddensize}}\right)\\ {}{W}_{hh}\sim Q\\ {}{b}_{ih}=0\\ {}{b}_{hh}=0\end{array}} $$
(9)

where Q is an orthogonal matrix unit, whose absolute value of the eigenvalue is 1. The new network initialization method is compared with the default as shown in Fig. 3.

Fig. 3
figure 3

The new network initialization method compared with the default

3.3 Video multimodal emotion recognition based on attention fusion

In the video multimodal emotion recognition method based on attention fusion, the attention distribution of three modalities needs to be calculated at each moment, and the attention distribution is used as the weight of the state information output of the Bi-GRU network in the corresponding modality subnetwork. The state information output of the Bi-GRU network is weighted averaged with the corresponding weight to obtain the fused feature vector. The fused feature vector is used as the input feature for the next fully connected layer, and the multimodal emotion intensity is finally obtained.

3.3.1 Correlation calculation of the state information between multiple modalities

We define that the state information output by the Bi-GRU network of the high-dimensional features as \( H=\left[\left[{\overleftarrow{h}}_1,{\overrightarrow{h}}_1\right],\left[{\overleftarrow{h}}_2,{\overrightarrow{h}}_2\right],\dots, \left[{\overleftarrow{h}}_{T_l},{\overrightarrow{h}}_{T_l}\right]\right] \) after learning the contextual information sufficiently, where \( {\overrightarrow{h}}_t \) and \( {\overleftarrow{h}}_t \) are the forward state output and the backward state output of the Bi-GRU network at the moment t. Thus, we define the state information of the textual modality subnetwork as \( {H}_t=\left[\left[{\overleftarrow{h}}_{t_1},{\overrightarrow{h}}_{t_1}\right],\left[{\overleftarrow{h}}_{t_2},{\overrightarrow{h}}_{t_2}\right],\dots, \left[{\overleftarrow{h}}_{t_{T_l}},{\overrightarrow{h}}_{t_{T_l}}\right]\right] \). The state information of the visual modality subnetwork is \( {H}_v=\left[\left[{\overleftarrow{h}}_{v_1},{\overrightarrow{h}}_{v_1}\right],\left[{\overleftarrow{h}}_{v_2},{\overrightarrow{h}}_{v_2}\right],\dots, \left[{\overleftarrow{h}}_{v_{T_l}},{\overrightarrow{h}}_{v_{T_l}}\right]\right] \). And the state information of the audio modality subnetwork is \( {H}_a=\left[\left[{\overleftarrow{h}}_{a_1},{\overrightarrow{h}}_{a_1}\right],\left[{\overleftarrow{h}}_{a_2},{\overrightarrow{h}}_{a_2}\right],\dots, \left[{\overleftarrow{h}}_{a_{T_l}},{\overrightarrow{h}}_{a_{T_l}}\right]\right] \). \( {\overrightarrow{h}}_{t_{T_l}} \) and \( {\overleftarrow{h}}_{t_{T_l}} \) respectively are the forward state output and the backward state output of the Bi-GRU network in the textual modality subnetwork at the moment t. \( {\overrightarrow{h}}_{v_{T_l}} \) and \( {\overleftarrow{h}}_{v_{T_l}} \) respectively are the forward state output and the backward state output of the Bi-GRU network in the visual modality subnetwork at the moment t. \( {\overrightarrow{h}}_{a_{T_l}} \)and \( {\overleftarrow{h}}_{a_{T_l}} \) respectively are the forward state output and the backward state output of the Bi-GRU network in the audio modality subnetwork at the moment t. In the previous section, we performed the word-level alignment on the high-dimensional features of the visual and audio modalities with the high-dimensional features of the textual modality. Therefore, the time step of the state information of three single-modality subnetworks is all Tl.

The state neurons of the Bi-GRU network are formed by a part of forward calculation and a part of backward calculation at the hidden layer respectively. The current time step of the state information of each single-modality subnetwork is Tl, that is, the state neurons go through Tl time steps for forward calculation and Tl time steps for backward calculation. The essence of the attention fusion network is to extract the useful fused feature vector H from the state information of the Bi-GRU network output Ht, Hv and Ha in the three single-modality subnetworks.

We use the attention mechanism to consider the importance of each state information and calculate the attention distribution of each state information as the weight αi of the corresponding state information. Since the state information between multiple modalities is taken into consideration, the weight αi will simultaneously pay attention to the state information of the three modalities, that is, the correlation of the state information between multiple modalities si is related to the state information of each moment of the three single-modality subnetworks. The correlation si is calculated as follows:

$$ {s}_i=V\tanh \left({W}_t\cdotp {h}_{t_i}+{W}_v\cdotp {h}_{v_i}+{W}_a\cdotp {h}_{a_i}+{b}_1\right)+{b}_2 $$
(10)

where \( {h}_{t_i}=\left[{\overrightarrow{h}}_{t_i},{\overleftarrow{h}}_{t_i}\right] \) is the state information output by the Bi-GRU network in the textual modality subnetwork at the moment i, including the forward state output \( {\overrightarrow{h}}_{t_i} \) and the backward state output \( {\overleftarrow{h}}_{t_i} \). Wt is the weight associated with \( {h}_{t_i} \). \( {h}_{v_i}=\left[{\overrightarrow{h}}_{v_i},{\overleftarrow{h}}_{v_i}\right] \) is the state information output by the Bi-GRU network in the visual modality subnetwork at the moment i, including the forward state output \( {\overrightarrow{h}}_{v_i} \) and the backward state output \( {\overleftarrow{h}}_{v_i} \). Wv is the weight associated with \( {\overrightarrow{h}}_{v_i} \). \( {h}_{a_i}=\left[{\overrightarrow{h}}_{a_i},{\overleftarrow{h}}_{a_i}\right] \) is the state information output by the Bi-GRU network in the audio modality subnetwork at the moment i, including the forward state output \( {\overrightarrow{h}}_{a_i} \) and the backward state output \( {\overleftarrow{h}}_{a_i} \). Wa is the weight associated with \( {h}_{a_i} \). b1 is the bias associated with \( {h}_{t_i} \), \( {h}_{v_i} \) and \( {h}_{a_i} \). tanh is the activation function. V is the weight of multimodal fusion. b2 is the bias of multimodal fusion.

3.3.2 Generation of the fused feature vectors

According to the current correlation of multimodal state information si, we can calculate the attention distribution at each moment in multiple modalities, that is, the weight αi corresponding to the state information. The calculation of weight αi is as follows:

$$ {\alpha}_i=\mathrm{softmax}\left({s}_i\right)=\frac{\mathit{\exp}\left({s}_i\right)}{\sum_{j=1}^{T_l}\mathit{\exp}\left({s}_j\right)} $$
(11)

where softmax is a normalized exponential function.

The state information output by the Bi-GRU network are weighted averaged with the corresponding weight αi to obtain the fused feature vector H as the input feature of the fully connected layer. The calculation of the fused feature vectors H is as follows:

$$ {H}^{\ast }=\left[\sum \limits_{i=1}^{T_l}{\alpha}_i{h}_{t_i};\sum \limits_{i=1}^{T_l}{\alpha}_i{h}_{v_i};\sum \limits_{i=1}^{T_l}{\alpha}_i{h}_{a_i}\right] $$
(12)

The architecture of the attention fusion network is shown in Fig. 4.

Fig. 4
figure 4

The attention fusion network

4 Experimental results and analysis

4.1 Datasets

To verify the validity of the proposed method, we use the Carnegie Mellon University multimodal opinion sentiment intensity (CMU-MOSI) dataset [45] and persuasion opinion multimodal (POM) dataset [26] for video multimodal emotion recognition experiments.

CMU-MOSI is an annotated dataset of video comments from YouTube providing three modality data of textual, visual and audio. The annotation of sentiment of CMU-MOSI closely follows the annotation scheme of the Stanford sentiment treebank [35], where sentiment is annotated on a seven-step Likert scale from highly negative to highly positive. The emotion intensity annotation is done by online staff on Amazon Mechanical Turk website. Emotion intensity ranges from −3 to +3. There are 93 different speakers in the CMU-MOSI dataset, and 2199 opinion speech videos. There are 26,295 words in the commentary video. There is an average of 23.2 opinion segments per video, and the average length of each video is 4.2 s. The example snapshots of videos from CMU-MOSI dataset are shown in Fig. 5.

Fig. 5
figure 5

Example snapshots of videos from CMU-MOSI dataset. (a) Highly negative, (b) Negative, (c) Neutral, (d) Positive, (e) Highly positive

POM is a dataset for analysis of persuasion on online social media. It has annotations for personality and sentiment as well, which makes it very compelling for large numbers of tasks. Each video is annotated on a seven-step Likert scale with 1 being the least descriptive of the trait and 7 being the most descriptive. The speaker traits are listed as follows: confident (con), passionate (pas), voice pleasant (voi), dominant (dom), credible (cre), vivid (viv), expertise (exp), entertaining (ent), reserved (res), trusting (tru), relaxed (rel), outgoing (out), thorough (tho), nervous (ner), persuasive (per) and humorous (hum). The short forms of these speaker traits are indicated inside the parentheses and used for the rest of this paper. The example snapshots of videos from POM dataset are shown in Fig. 6.

Fig. 6
figure 6

Example snapshots of videos from POM dataset. (a) Confidence, (b) Passionate, (c) Voice pleasant, (d) Dominant, (e) Credible, (f) Vivid, (g) Expertise, (h) Entertaining, (i) Reserved, (j) Trusting, (k) Relaxed, (l) Outgoing, (m) Thorough, (n) Nervous, (o) Persuasive, (p) Humorous

In the experiments, we implemented binary sentiment classification, 5-class sentiment classification and sentiment regression on the CMU-MOSI dataset. The regression range is [−3, 3]. We also implemented multi-label classification of different speaker traits and speaker traits regression on the POM dataset. For classification, we use precision (PR), false positive rate (FPR), recall (RE), accuracy (Acc) and F1 score for evaluation; and for regression, mean absolute error (MAE) and Pearson product-moment correlation coefficients (Corr) between model predictions and real values are used for evaluation. Higher values denote better performance for all metrics except for FPR and MAE.

4.2 Experimental setup

In single-modality experiments, different network models use the same hyper-parameter settings for the convenience of comparison. The models are trained using the Adam optimizer [15] with the epoch size 50. The early stopping training method is used to monitor the loss value of the validation set. That is, when the loss value of the validation set is not reduced for 10 consecutive times, the training process will stop. While compared with the SOTA network models, the best hyper-parameters are chosen using grid search based on model performance on a validation set. The training, testing and validation folds are exactly the same for all network models. The sample numbers of the datasets are shown in Table 2.

Table 2 The sample numbers of the datasets

4.3 Results and analysis

4.3.1 Experimental results for single modality

The emotion recognition results of various network models in the textual, visual, and audio modalities under the CMU-MOSI dataset are compared in Tables 3, 4 and 5 respectively. It can be seen from Tables 3 and 4 that in the binary sentiment classification, 5-class sentiment classification and sentiment regression, the emotion recognition results based on the Bi-GRU with new network initialization in the textual modality and visual modality are superior to the results based on the methods of LSTM, GRU, BLSTM, Bi-GRU, and BLSTM with new network initialization. However, it can be seen from Table 5 that in the binary sentiment classification, 5-class sentiment classification and sentiment regression, the emotion recognition results based on the Bi-GRU in the audio modality (except for FPR in 5-class classification) are superior to the results based on the methods of LSTM, GRU, BLSTM, BLSTM with new network initialization, and Bi-GRU with new network initialization.

Table 3 Comparison of emotion recognition results of various network models in the textual modality under the CMU-MOSI dataset
Table 4 Comparison of emotion recognition results of various network models in the visual modality under the CMU-MOSI dataset
Table 5 Comparison of emotion recognition results of various network models in the audio modality under the CMU-MOSI dataset

The emotion recognition results of various network models in the textual, visual, and audio modalities under the POM dataset are compared in Tables 6, 7 and 8 respectively. For classification results, we use the average of the results of all labels. It can be seen in multi-label classification of speaker traits, the results based on the Bi-GRU with new network initialization in three single modalities are superior to the results based on the methods of LSTM, GRU, BLSTM, Bi-GRU and BLSTM with new network initialization. In regression of speaker traits, the results based on the Bi-GRU with new network initialization in the audio modality are superior to the results based on the methods of LSTM, GRU, BLSTM, Bi-GRU and BLSTM with new network initialization. But in the textual and visual modalities, the results based on the Bi-GRU with new network initialization are slightly worse than the results based on the BLSTM with new network initialization in a certain metric.

Table 6 Comparison of emotion recognition results of various network models in the textual modality under the POM dataset
Table 7 Comparison of emotion recognition results of various network models in the visual modality under the POM dataset
Table 8 Comparison of emotion recognition results of various network models in the audio modality under the POM dataset

The experimental results obtained under the CMU-MOSI dataset are compared with the results of the SOTA network models in the textual modality, visual modality and audio modality, as shown in Tables 9, 10 and 11. It can be known from Table 9 that in the textual modality, the proposed method (Bi-GRUinit textual) is superior to the SOTA network models in the binary sentiment classification, 5-class sentiment classification and sentiment regression. It can be seen from Tables 10 and 11 that in the visual and audio modalities, the proposed methods (Bi-GRUinit visual and Bi-GRUinit audio) are superior to the SOTA network models in the 5-class sentiment classification and sentiment regression. The F1 scores of TFN [42] method in the binary sentiment classification are better than the proposed methods, while the proposed methods are the best in the accuracy of the binary sentiment classification among all the SOTA network models.

Table 9 Comparison with the SOTA network models in the textual modality under the CMU-MOSI dataset
Table 10 Comparison with the SOTA network models in the visual modality under the CMU-MOSI dataset
Table 11 Comparison with the SOTA network models in the audio modality under the CMU-MOSI dataset

The experimental results obtained under the POM dataset are compared with the results of MFN [43] method in the textual modality, visual modality and audio modality, as shown in Tables 12, 13 and 14. It can be known that in the three single modalities, for most different speaker traits, the proposed methods (Bi-GRUinit textual, Bi-GRUinit visual and Bi-GRUinit audio) are superior to the MFN method in multi-label classification and regression, except for Thorough (Tho) regression (MAE) in the textual modality, Humorous (Hum) classification (Acc) in the visual modality, and Trusting (Tru) regression (Corr) in the visual modality.

Table 12 Comparison with the SOTA network models in the textual modality under the POM dataset
Table 13 Comparison with the SOTA network models in the visual modality under the POM dataset
Table 14 Comparison with the SOTA network models in the audio modality under the POM dataset

4.3.2 Experimental results for multimodal emotion recognition

The comparison of the emotion recognition results based on the methods for three single modalities (Bi-GRUinit textual, Bi-GRUinit visual and Bi-GRUinit audio) and for multimodal fusion (Bi-GRUinit multimodal) under the CMU-MOSI dataset and the POM dataset are shown in Tables 15 and 16 respectively. It can be seen from Table 15 that the sentiment classification and sentiment regression results of the multimodal fusion are significantly better than those of three single modalities under the CMU-MOSI dataset. It can be seen from Table 16 that in multi-label classification and regression, for most speaker traits, the results of the multimodal fusion (including PR, FPR, RE, Acc, F1, MAE, and Corr) are better than those of three single modalities under the POM dataset. Thus, the average (Avg) of all metrics of the multimodal fusion are all better than those of the three single modalities.

Table 15 Comparison of the emotion recognition results for three single modalities and multimodal fusion under the CMU-MOSI dataset
Table 16 Comparison of the emotion recognition results for three single modalities and multimodal fusion under the POM dataset

The emotion recognition results of the proposed multimodal fusion (Bi-GRUinit multimodal) and the SOTA multimodal network models under the CMU-MOSI dataset and the POM dataset are compared in Tables 17 and 18 respectively. It can be seen from Tables 17 and 18 that the emotion recognition performance of the proposed video multimodal emotion recognition based on the Bi-GRU with new network initialization and the attention fusion network is superior to that of the listing multimodal emotion recognition methods in sentiment classification and sentiment regression.

Table 17 Comparison of multimodal emotion recognition methods under the CMU-MOSI dataset
Table 18 Comparison of multimodal emotion recognition methods under the POM dataset

5 Conclusions

A time-contextual learning method based on the Bi-GRU network is proposed in this paper. In the process of video emotion recognition, the output of the current moment is not only related to the previous state, but also related to the state after it. The Bi-GRU can improve the accuracy of emotion recognition in the time-contextual learning. In order to further improve the accuracy of video emotion recognition, a new network initialization method is proposed and applied to the network model. This initialization method can optimize the initialization parameters of the ReLU network model, improve the robustness in the training of the Bi-GRU network and improve the accuracy of emotion recognition. A video multimodal emotion recognition based on the attention fusion network is proposed to overcome the weight consistency of each modality in multimodal fusion. The attention mechanism is used to process the variation of multimodal context state at each moment, and the attention distribution at each moment in multiple modalities is calculated in real-time. So that the network model can learn multimodal contextual information in real-time, thereby improving the accuracy of video emotion recognition under the multimodal fusion.

The main work that can be further carried out is summarized in the following four aspects:

  1. (1)

    Increase the high-dimensional features in the audio modality subnetwork. The audio modality subnetwork mentioned in this paper contains COVAREP acoustic features. Some other effective acoustic features, such as the acoustic features extracted by open speech and music interpretation by large-space extraction (OpenSMILE) may be taken into consideration. OpenSMILE combines functions such as music information retrieval and voice processing to automatically analyze audio signals in real-time, and automatically extract emotional features from speech and music signals. Adding other effective acoustic features can further improve the accuracy of emotion recognition in the audio modality subnetwork and then the fused multimodal network.

  2. (2)

    Research on contextual learning emotion recognition method based on the stacked Bi-GRUs. The time-contextual learning method based on the Bi-GRU used in this paper overcomes the problem that the BRNN cannot deal with long-term dependency well and causes gradient vanishing or gradient exploding in the back propagation. Next, we can consider stacking Bi-GRUs and apply them to video emotion recognition. The stacked Bi-GRUs can be defined as a model consisting of multiple Bi-GRU layers, which makes the network model deeper. Thus, we will extract features directly from the network without any manual work. It can make better use of the input data and more complex and comprehensive features can be learned to further improve the accuracy of video emotion recognition in the single-modality subnetwork and then the fused multimodal network.

  3. (3)

    Research on video multimodal emotion recognition based on hierarchical attention network. We will consider applying the idea of hierarchical attention networks to video multimodal emotion recognition. Intra-modality attention network can extract important information in the single modality. Inter-modality attention network can capture significant information globally. Thus, the accuracy of video multimodal emotion recognition can be further improved.

  4. (4)

    Research on video multimodal emotion recognition based on other fusion methods. The video multimodal emotion recognition in this paper is based on the attention fusion network, which can calculate the attention distribution of three single-modality subnetworks and calculate the attention distribution of each moment in multiple modalities in real-time. Next, we will consider other fusion methods or architectures to further improve the accuracy of video multimodal emotion recognition.