Emotion Recognition Using Multimodal Deep Learning

Liu, Wei; Zheng, Wei-Long; Lu, Bao-Liang

doi:10.1007/978-3-319-46672-9_58

Wei Liu¹⁹,
Wei-Long Zheng¹⁹ &
Bao-Liang Lu^19,20,21

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9948))

Included in the following conference series:

International Conference on Neural Information Processing

6754 Accesses
128 Citations

Abstract

To enhance the performance of affective models and reduce the cost of acquiring physiological signals for real-world applications, we adopt multimodal deep learning approach to construct affective models with SEED and DEAP datasets to recognize different kinds of emotions. We demonstrate that high level representation features extracted by the Bimodal Deep AutoEncoder (BDAE) are effective for emotion recognition. With the BDAE network, we achieve mean accuracies of 91.01 % and 83.25 % on SEED and DEAP datasets, respectively, which are much superior to those of the state-of-the-art approaches. By analysing the confusing matrices, we found that EEG and eye features contain complementary information and the BDAE network could fully take advantage of this complement property to enhance emotion recognition.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Multimodal Emotion Recognition Using Deep Neural Networks

Multimodal Emotion Recognition System Using Machine Learning and Psychological Signals: A Review

PSPN: Pseudo-Siamese Pyramid Network for multimodal emotion analysis

Article 28 May 2024

Keywords

1 Introduction

Nowadays, many human machine interface (HMI) products are used by ordinary people and more HMI equipments will be needed in the future. Since emotional functions of HMI products play an important role in our daily life, it is necessary for HMI equipments to be able to recognize humans emotions automatically.

Many researchers studied emotion recognition from EEG. Liu et al. used fractal dimension based algorithm to recognize and visualize emotions in real time [1]. Li and Lu used EEG signals of gamma band to classify two kinds of emotions, and their results showed that gamma band was suitable for emotion recognition [2].

Duan et al. found that differential entropy features are more suited for emotion recognition tasks [3]. Wang et al. compared three different kinds of EEG features and proposed a simple approach to track the trajectory of emotion changes with time [4]. Zheng and Lu employed deep neural network to classify EEG signals and examined critical bands and channels of EEG for emotion recognition [5].

To fully use information from different modalities, Yang et al. proposed an auxiliary information regularized machine, which treats different modalities with different strategies [6].

In [7], the authors built a single modal deep autoencoder and a bimodal deep autoencoder to generate shared representations of images and audios. Srivastava and Salakhutdinov extended the methods developed in [7] to bimodal deep Boltzmann machines to handle multimodal deep learning problems [8].

As for multimodal emotion recognition, Verma and Tiwary carried out emotion classification experiments with EEG singals and peripheral physiological signals [9]. Lu et al. used two different fusion strategies for combining EEG and eye movement data: feature level fusion and decision level fusion [10]. Liu et al. employed bimodal deep autoencoders to fuse different modalities and the authors tested the framework on multimodal facilitation, unimodal enhancement, and crossmodal learning tasks [11].

To our best knowledge, there is no research work reported in the literature to deal with emotion recognition from multiple physiological signals using multimodal deep learning algorithms. In this paper, we propose a novel multimodal emotion recognition method using multimodal deep learning techniques. In Sect. 2, we will introduce the bimodal deep autoencoder. Section 3 presents data pre-proessing, feature extraction and experiment settings. The experiment results are described in Sect. 4. Following discusses in Sect. 5, conclusions and future work are in Sect. 6.

2 Multimodal Deep Learning

2.1 Restricted Boltzmann Machine

A restricted Boltzmann machine (RBM) is an undirected graph model, which has a visible layer and a hidden layer. Connections exist only between visible layer and hidden layer and there is no connection either in visible layer or in hidden layer. Assuming visible variables ${\mathbf {v}}\in \{0,1\}^M$ and hidden variables ${\mathbf {h}}\in \{0,1\}^N$, we have the following energy function E:

$$\begin{aligned} E({\mathbf {v}},{\mathbf {h}};\theta ) = -\sum _{i=1}^M\sum _{j=1}^NW_{ij}v_ih_j-\sum _{i=1}^M b_iv_i - \sum _{j=1}^Na_jh_j \end{aligned}$$

(1)

where $\theta = \{\mathbf {a,b,W}\}$ are parameters, $W_{ij}$ is the symmetric weight between visible unit i and hidden unit j, and $b_i$ and $a_j$ are bias terms of visible unit and hidden unit, respectively. With energy function, we can get the joint distribution over the visible and hidden units:

$$\begin{aligned} p({\mathbf {v}},{\mathbf {h}};\theta )&= \frac{1}{{\mathcal {Z}}(\theta )}\exp (E({\mathbf {v}},{\mathbf {h}};\theta )), \text {and}\nonumber \\ {\mathcal {Z}}(\theta )&= \sum _{\mathbf {v}}\sum _{\mathbf {h}}\exp (E({\mathbf {v}},{\mathbf {h}};\theta )) \end{aligned}$$

(2)

where ${\mathcal {Z}}(\theta )$ is the normalization constant.

Given a set of visible variables $\{{\mathbf {v}}_n\}_{n=1}^N$, the derivative of log-likelihood with respect to weight ${\mathbf {W}}$ can be calculated from Eq. (2):

$$\begin{aligned} \frac{1}{N}\sum _{i=1}^N\frac{\partial \log p({\mathbf {v}}_n;\theta )}{\partial W_{ij}} = {\mathbb {E}}_{P_{data}}[v_ih_j]-{\mathbb {E}}_{P_{model}}[v_ih_j] \end{aligned}$$

Various algorithms can be used to train a RBM, such as Contrastive Divergence (CD) algorithm [12]. In this paper, Bernoulli RBM is used. We treat the visual layer as the probabilities and we use CD algorithm to train RBMs.

2.2 Model Construction

The proposed multimodal emotion recognition framework using deep learning is depicted in Fig. 1. There are three steps in total. The first step is to train the BDAE network.

We call this step feature selection. The second step is supervised training, and we use the extracted high level features to train a linear SVM classifier. And the last step is a testing process, from which the recognition results are produced.

The BDAE training procedures, including encoding part and decoding part, are shown in Fig. 2. In encoding part, we first train two RBMs for EEG features and eye movement features, respectively. As shown in Fig. 2(a), EEG RBM is for EEG features and eye RBM is for eye movement features. Hidden layers are indicated by $h_{EEG}$ and $h_{Eye}$, and $W_1,W_2$ are the corresponding weight matrices. After training these two RBMs, hidden layers, $h_{EEG}$ and $h_{Eye}$, are concatenated together. The concatenated layer is used as the visual layer of an upper RBM, as depicted in Fig. 2(b). Figure 2(c) shows the decoding part. When unfolding the stacked RBMs to reconstruct input features, we keep the weight matrices tied, and $W_1,W_2$, and $W_3$ and $W_1^T,W_2^T$, and $W_3^T$ in Fig. 2(c) are tied weights. At last, we used unsupervised back-propagation algorithm to fine-tune the weights and bias.

3 Experiment Settings

3.1 The Datasets

The SEED dataset^{Footnote 1}, which was first introduced in [5], contains EEG signals and eye movement signals of three different emotions (positive, negative, and neutral). These signals are collected from 15 subjects during watching emotional movie clips. There are 15 movie clips and each clip lasts about 4 min long. The EEG signals, recorded with ESI NeuroScan System, are of 62 channels at a sampling rate of 1000 Hz and the eye movement signals, collected with SMI ETG eye tracking glasses, contain information about blink, saccade fixation, and so on. In order to compare our proposed method with the existing approach [10], we use the same data as in [10], that is, 27 data files from 9 subjects. For every data file, the data from the subjects watching the first 9 movie clips are used as training samples and the rest ones are used as test samples.

The DEAP dataset was first introduced in [13]. The EEG signals and peripheral physiological signals of 32 participants were recorded when they were watching music videos. The dataset contains 32 channel EEG signals and 8 peripheral physiological signals. The emotional music videos include 40 one-minute long small clips and subjects were asked to do self-assessment by assigning values from 1 to 9 to five different status, namely, valence, arousal, dominance, liking, and familiarity. In order to compare the performance of our proposed method with previous results in [14, 15], we did not take familiarity into consideration. We divided the trials into two different classes according to the assigned values. The threshold we chose is 5, and the tasks can be treated as four binary classification problems, namely, high or low valence, arousal, dominance and liking. Among all of the data, 90 % samples were used as training data and the rest 10 % samples were used as test data.

3.2 Feature Extraction

For SEED dataset, both Power Spectral Density (PSD) and Differential Entropy (DE) features were extracted from EEG data. These two kinds of features contain five frequency bands: $\delta $ (1–4 Hz), $\theta $ (4–8 Hz), $\alpha $ (8–14 Hz), $\beta $ (14–31 Hz), and $\gamma $ (31–50 Hz). For every frequency band, the extracted features are of 62 dimensions and there are 310 dimensions for EEG features in total. As for eye movement data, we used the same features as in [10], and there are 41 dimensions in total including both PSD and DE features. The extracted EEG features and eye movement features were then rescaled to [0,1] and the rescaled features were used as the inputs of BDAE network.

For DEAP dataset, we used the downloaded preprocessed data directly as the inputs of BDAE network to generate shared representations of EEG signals and peripheral physiological signals. First, the EEG signals and peripheral physiological signals were separated and then the signals were segmented into 63 s. After segmentation, different channel data of the same time period (one second) are combined to form the input signals of BDAE network. And then, shared representation features were generated by the BDAE network.

3.3 Classification

The shared representation features generated by BDAE network are used to train a linear SVM classifier. Because of the variance between EEG signals collected from different people at different time, the BDAE model is data-specified, which means that we will build a BDAE model for each data and there are 27 BDAE models built for SEED dataset and 32 BDAE models for DEAP dataset. Network parameters, including hidden neuron numbers, epoch numbers, and learning rate, are chosen with grid searching.

4 Results

We compare our model with two other experimental settings. When only single modality is available, we classify different emotions with PSD and DE features by SVM. When multimodal information is available, features of different modalities are linked directly and different emotions are recognized with the concatenated features by SVM.

SEED Results. Figure 3 shows the summary of multimodal facilitation experiment results. As can be seen from Fig. 3, the BDAE model has the best accuracy (91.01 %) and the smallest standard deviation (8.91 %).

Table 1. Accuracies of BDAE model on SEED dataset (%).

Full size table

Table 1 is the detailed experimental results of the BDAE model. The last column means that we linked all five frequency bands of EEG features and eye movement features directly. We examined the BDAE model three times and the recognition accuracies shown in Table 1 were averaged.

DEAP Results. In the literature, Rozgic et al. treated the EEG signals as a sequence of overlapping segments and a novel non-parametric nearest neighbor model was employed to extract response-level feature from these segments [14]. Li et al. used Deep Belief Network (DBN) to automatically extract high-level features from raw EEG signals [15].

The experimental results on the DEAP dataset are shown in Table 2. Besides baselines mentioned above, we also compared the BDAE results with results in [15] and [14]. As can be seen from Table 2, the BDAE model improved recognition accuracies in all classification tasks.

Table 2. Comparison of six different approaches on DEAP dataset (Accuracy, %).

Full size table

Table 3. Confusing matrices of single modality and different modality merging methods

Full size table

5 Discussion

From the experimental results, we have demonstrated that the BDAE network can be used to extract shared representations from different modalities and the extracted features have better performance than other features.

From Table 3(a), we can see that EEG features are good for positive emotion recognition but are not good for negative emotions. As a complement, eye features have advantage in negative emotion recognition which can be seen from Table 3(b). When linking EEG and eye features directly, positive emotion accuracy is improved compare with situation where only eye features exist and negative emotion accuracy is also enhanced compared with when only EEG features are used. The BDAE framework achieves an even better result. The BDAE model has the highest accuracies in all three kinds of emotions, indicating that the BDAE model can fully use both EEG features and eye features.

6 Conclusions and Future Work

This paper has shown that the shared representations extracted from the BDAE model are good features to discriminate different emotions. Compared with other existing feature extraction strategies, the BDAE model is the best with accuracy of 91.01 % on SEED dataset. For DEAP dataset, the BDAE network largely improves recognition accuracies on all four binary classification tasks. We analysed the confusing matrices of different methods and found that EEG features and eye features contain complementary information. The BDAE framework could fully take advantage of the complementary property between EEG and eye features to improve emotion recognition accuracies.

Our future work will focus on invesgating the complementarity between EEG features and eye movement features and explaining the mechanism of multimodal deep learning for emotion recognition from EEG and other physiological signals.

Notes

1.
http://bcmi.sjtu.edu.cn/~seed/index.html.

References

Liu, Y., Sourina, O., Nguyen, M.K.: Real-time EEG-based human emotion recognition and visualization. In: 2010 International Conference on Cyberworlds, pp. 262–269. IEEE (2010)
Google Scholar
Li, M., Lu, B.-L.: Emotion classification based on gamma-band EEG. In: Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBC 2009, pp. 1223–1226. IEEE (2009)
Google Scholar
Duan, R.-N., Zhu, J.-Y., Lu, B.-L.: Differential entropy feature for eeg-based emotion classification. In: 2013 6th International IEEE/EMBS Conference on Neural Engineering, pp. 81–84. IEEE (2013)
Google Scholar
Wang, X.-W., Nie, D., Bao-Liang, L.: Emotional state classification from EEG data using machine learning approach. Neurocomputing 129, 94–106 (2014)
Article Google Scholar
Zheng, W.-L., Bao-Liang, L.: Investigating critical frequency bands and channels for EEG-based emotion recognition with deep neural networks. IEEE Trans. Auton. Mental Dev. 7(3), 162–175 (2015)
Article Google Scholar
Yang, Y., Ye, H.-J., Zhan, D.-C., Jiang, Y.: Auxiliary information regularized machine for multiple modality feature learning. In: IJCAI 2015, pp. 1033–1039. AAAI Press (2015)
Google Scholar
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: ICML 2011, pp. 689–696 (2011)
Google Scholar
Srivastava, N., Salakhutdinov, R.: Multimodal learning with deep boltzmann machines. J. Mach. Learn. Res. 15(1), 2949–2980 (2014)
MathSciNet MATH Google Scholar
Verma, G.K., Tiwary, U.S.: Multimodal fusion framework: a multiresolution approach for emotion classification and recognition from physiological signals. NeuroImage 102, 162–172 (2014)
Article Google Scholar
Lu, Y., Zheng, W.-L., Li, B., Lu, B.-L.: Combining eye movements and EEG to enhance emotion recognition. In: IJCAI 2015, pp. 1170–1176 (2015)
Google Scholar
Liu, W., Zheng, W.-L., Lu, B.-L.: Multimodal emotion recognition using multimodal deep learning. arXiv preprint arXiv:1602.08225 (2016)
Hinton, G.E.: Training products of experts by minimizing contrastive divergence. Neural Comput. 14(8), 1771–1800 (2002)
Article MATH Google Scholar
Koelstra, S., Mühl, C., Soleymani, M., Lee, J.-S., Yazdani, A., Ebrahimi, T., Pun, T., Nijholt, A., Patras, I.: Deap: a database for emotion analysis using physiological signals. IEEE Trans. Affect. Comput. 3(1), 18–31 (2012)
Article Google Scholar
Rozgic, V., Vitaladevuni, S.N., Prasad, R.: Robust EEG emotion classification using segment level decision fusion. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1286–1290. IEEE (2013)
Google Scholar
Li, X., Zhang, P., Song, D., Yu, G., Hou, Y., Hu, B.: EEG based emotion identification using unsupervised deep feature learning. In: SIGIR2015 Workshop on Neuro-Physiological Methods in IR Research, August 2015
Google Scholar

Download references

Acknowledgments

This work was supported in part by the grants from the National Natural Science Foundation of China (Grant No. 61272248), the National Basic Research Program of China (Grant No. 2013CB329401) and the Major Basic Research Program of Shanghai Science and Technology Committee (15JC1400103).

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Center for Brain-like Computing and Machine Intelligence, Shanghai, China
Wei Liu, Wei-Long Zheng & Bao-Liang Lu
Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognition Engineering, Shanghai, China
Bao-Liang Lu
Brain Science and Technology Research Center, Shanghai Jiao Tong University, Shanghai, China
Bao-Liang Lu

Authors

Wei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Wei-Long Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Bao-Liang Lu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bao-Liang Lu .

Editor information

Editors and Affiliations

The University of Tokyo, Tokyo, Japan
Akira Hirose
Kobe University, Kobe, Japan
Seiichi Ozawa
Okinawa Institute of Science and Technology Graduate University, Onna, Japan
Kenji Doya
Nara Institute of Science and Technology, Ikoma, Japan
Kazushi Ikeda
Kyungpook National University, Daegu, Korea (Republic of)
Minho Lee
Chinese Academy of Sciences, Beijing, China
Derong Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, W., Zheng, WL., Lu, BL. (2016). Emotion Recognition Using Multimodal Deep Learning. In: Hirose, A., Ozawa, S., Doya, K., Ikeda, K., Lee, M., Liu, D. (eds) Neural Information Processing. ICONIP 2016. Lecture Notes in Computer Science(), vol 9948. Springer, Cham. https://doi.org/10.1007/978-3-319-46672-9_58

Download citation

DOI: https://doi.org/10.1007/978-3-319-46672-9_58
Published: 30 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46671-2
Online ISBN: 978-3-319-46672-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Emotion Recognition Using Multimodal Deep Learning

Abstract

Similar content being viewed by others

Multimodal Emotion Recognition Using Deep Neural Networks

Multimodal Emotion Recognition System Using Machine Learning and Psychological Signals: A Review

PSPN: Pseudo-Siamese Pyramid Network for multimodal emotion analysis

Keywords

1 Introduction