A Multimodal Emotion Recognition System Using Facial Landmark Analysis

Rahdari, Farhad; Rashedi, Esmat; Eftekhari, Mahdi

doi:10.1007/s40998-018-0142-9

A Multimodal Emotion Recognition System Using Facial Landmark Analysis

Research paper
Published: 22 October 2018

Volume 43, pages 171–189, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Iranian Journal of Science and Technology, Transactions of Electrical Engineering Aims and scope Submit manuscript

A Multimodal Emotion Recognition System Using Facial Landmark Analysis

Download PDF

Farhad Rahdari¹,
Esmat Rashedi² &
Mahdi Eftekhari³

812 Accesses
20 Citations
Explore all metrics

Abstract

This paper introduces a multimodal emotion recognition system based on two different modalities, i.e., affective speech and facial expression. For affective speech, the common low-level descriptors including prosodic and spectral audio features (i.e., energy, zero crossing rate, MFCC, LPC, PLP and temporal derivatives) are extracted, whereas a novel visual feature extraction method is proposed in the case of facial expression. This method exploits the displacement of specific landmarks across consecutive frames of an utterance for feature extraction. To this end, the time series of temporal variations for each landmark is analyzed individually for extracting primary visual features, and then, the extracted features of all landmarks are concatenated for constructing the final feature vector. The analysis of displacement signal of landmarks is performed by the discrete wavelet transform which is a widely used mathematical transform in signal processing applications. In order to reduce the complexity of derived models and improve the efficiency, a variety of dimensionality-reduction schemes are applied. Furthermore, to exploit the advantages of multimodal emotion recognition systems, the feature-level fusion of the audio and the proposed visual features is examined. Results of experiments conducted on three SAVEE, RML and eNTERFACE05 databases show the efficiency of proposed visual feature extraction method in terms of performance criteria.

Audio-visual emotion recognition using multi-directional regression and Ridgelet transform

Article 26 November 2015

Multimodal emotion recognition using SDA-LDA algorithm in video clips

Article 04 October 2021

Multimodal emotion recognition model via hybrid model with improved feature level fusion on facial and EEG feature set

Article 26 April 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Intelligent machines are becoming an undeniable part of modern life. The importance of this issue has caused more attention to the human–machine interaction field in recent years. In this regard, improving the quality of the relationship between human and machine is desired, to be more realistic, friendly and interactive. One of the most important factors that increase the impact of relation is to recognize the human emotion by machine with the aim of making appropriate reactions. Speech is a common form of communication between human beings for conveying emotion. Albeit, the complexity of behavior, accent, etc., can challenge the recognition of emotion from speech. In addition to speech analysis, the study of facial expressions as well as the physiological states of the body, i.e., hands and feet can be also utilized to recognize the human emotions (Sebe et al. 2005; Jaimes and Sebe 2007). However, the human emotion is a complex and ambiguous phenomenon that depends on different factors such as gender, age, culture, language, nationality, etc., which makes it difficult for mathematical modeling. In order to overcome the challenges, the development of intelligent emotion recognition systems is taken into consideration in recent years due to the capabilities of computer systems in modeling and high-speed computing. Collaboration among groups of researchers in various fields such as computer, psychology and cognitive sciences has enabled computers to identify, interpret and express human emotions (Oatley and Johnson-Laird 1987; Caridakis et al. 2007). This capability makes the computers more efficient in such areas as e-learning, e-commerce, remote medicine and psychiatry and social networking.

The intelligent emotion recognition system is the process of human emotion classifying based on various modalities by using intelligent methods. In the first stage of this process, the appropriate features that have the highest correlation with the human emotion are extracted from the desired modality. Then, the emotion categorization is performed by using different classification methods. In particular, one of the key differences between proposed methods in the literature is the type of signal used for feature extraction. According to this, three main approaches can be distinguished: (1) audio-based approaches are based on extracting desired features from human speech signal, whereas (2) visual-based approaches obtain features by analyzing the human face. However, (3) multimodal methods (i.e., hybrid of the previous methods) have been of interest in recent years. By utilizing the various techniques of features fusion, these methods employ various modalities simultaneously for emotion recognition. An important issue in this field is to extract effective features and also utilize an appropriate classification model for deriving more accurate final models (Seng et al. 2016).

The main contribution of this study is to propose a new visual feature extraction method through the facial landmark analysis. To this end, the displacement of specific landmarks over time is exploited. In particular, this method considers the displacement of the landmarks across consecutive frames of utterance as a time series and analyzes the temporal variations of each landmark by one of the well-known mathematical transforms to extract desired features. The final feature vector is then constructed by concatenating extracted features from total landmarks. The displacement signal analysis of landmarks can be performed by a variety of mathematical transforms. In this study, the discrete wavelet transform (DWT) has been employed as a widely used transform in the signal processing. The DWT reveals hidden information of original signal by separating it into low- (approximation) and high-frequency (details) components, respectively. The information contents collected from sub-bands (coefficients) are then combined to construct final visual feature vector. After feature extraction phase, the effectiveness of proposed method has been studied using state-of-the-art classifiers. To this end, different experiments are conducted and results compared together. The proposed method has been applied on three datasets including SAVEE (Jackson and Haq 2014), RML (Wang and Guan 2008) and eNTERFACE05 (Martin et al. 2006), that includes different emotional utterances in six principal emotions. Results show improvements in the model accuracy with respect to the state-of-the-art alternatives with few exceptions. Furthermore, the common audio features including energy, zero crossing rate (ZCR), MFCC, LPC, RASTA-PLP and temporal derivations are also utilized to exploit the capabilities of audio-visual fusion for enhancing the final performance of model.

The rest of this paper is organized as follows: in Sect. 2, we review the latest works in audio-visual emotion recognition system. Section 3 describes the various components of common audio-visual emotion recognition system. In Sect. 4, the proposed visual feature extraction method as well as audio features is described. Section 5 includes simulation system setup and conducting different experiments on three common databases. Section 6 concludes the paper.

2 Related work

Researches in the field of intelligent human emotion recognition started some years ago. Many studies have been conducted by using various types of features, especially audio, visual and body gesture features. Nevertheless, in recent years more attention has been paid to multimodal approaches which fuse different kinds of features in the process of human emotion recognition. Feature fusion is the process of combining two or more feature vectors to construct a single feature vector. The major difference of the works often relates to the type of features and classifiers. In the following, we review the latest works in multimodal emotion recognition with emphasis on audio and visual modalities.

The authors in (Datcu and Rothkrantz 2014) introduce a novel technique for the recognition of emotions from audio and visual data. Depending on the presence or absence of speech, two types of models based on geometric face features for facial expression recognition are used in this work. In (Xie et al. 2015), a novel audio-visual emotion recognition solution using multimodal information fusion based on entropy estimation is introduced. This work proposes a dual-level fusion framework which consists of feature-level fusion module based on kernel entropy component analysis and score-level fusion module based on maximum correntropy criterion. In order to recognize facial expression and facial action units (AU) detection with considering dynamic of videos, a novel Variable-State Conditional Random Field model is proposed in (Walecki et al. 2015) that automatically selects the optimal latent states for the target image sequence based on the input data and underlying dynamics of the sequence. Also, two novel learning strategies and the posterior regularization of the latent states are proposed to derive a more robust model for the target tasks. The research presented in (Poria et al. 2016) describes a method to extract features from audio, visual and textual modalities using deep convolutional neural networks. The (Goyal et al. 2016) addresses the problem of continuous emotion prediction in movies from multimodal cues. To do this, a set of audio and video features including video compressibility and histogram of facial area is utilized. Also, a fusion model based on the mixture of experts is proposed in this work that fuses the information from the audio and video modalities for predicting the dynamic emotion. In (Haq et al. 2016), a comparative analysis of filter and wrapper approaches of feature selection has been presented for the audio, visual and audio-visual human emotion recognition. In filter approach, feature selection is performed using the MORE l-Take Away r algorithm. In wrapper method, features are selected based on their classification performance using support vector machine (SVM) classifier. The SVM classifier was then used for human emotion recognition. The authors in (Kaya and Salah 2016) propose extreme learning machines as an alternative for single-layer feedforward networks to model audio and video features for emotion recognition. In the paper (You et al. 2016), a cross-modality consistent regression model is proposed which utilizes both visual and textual sentiment analysis techniques. Seng et al. (2016) propose an audio-visual emotion recognition system that uses a mixture of rule-based and machine learning techniques to improve the recognition efficacy. In this system, the extracted visual features are passed into optimized RBF neural classifier. Also, the extracted audio features are passed into an audio feature-level fusion module that uses a set of rules to determine the most likely emotion contained in the audio signal. Mou et al. (2016) propose a novel framework for automatic emotion analysis of each individual in group settings along both arousal and valence dimensions. For facial expression analysis, a novel descriptor is introduced to encode spatiotemporal information. The paper then proposes a method to recognize the group membership of each individual by using their face and body behavioral cues. Authors in Hossain et al. (2016) propose a bimodal system of big data emotion recognition that combines the potential of emotion-aware big data and cloud technology toward 5G. Subramaniam et al. (2016) propose a novel approach for First Impressions Recognition in terms of the Big Five personality-traits from short videos. They train two bimodal end-to-end deep neural network architectures using temporally ordered audio and novel visual features from few frames. In Patwardhan and Knapp (2016), color- and depth-sensing device are used for facial feature extraction and tracking human body joints. The temporal features across multiple frames are then used for emotion recognition. Eventually, an event-driven decision-level fusion is used to combine the results from each individual modality. Chao et al. (2016) focus on two key problems for audio-visual emotion recognition in the video, i.e., audio and visual streams temporal alignment for feature-level fusion and locating and re-weighting the perception attentions in the whole audio-visual stream. Also, the long short-term memory NN is employed for classification. Gharavian et al. (2017) recognize the emotions from audio and visual information using fuzzy neural network along with PSO for the parameters optimization. Also, the fusion of audio and visual systems is conducted at both decision and feature levels. In Guo et al. (2017) a multimodality convolutional neural network (CNN) based on visual and geometrical information is proposed. In Kaya et al. (2017), a system for multimodal expression recognition in the wild is proposed wherein the CNN-based features are obtained via transfer learning. The approach fuses audio and visual features with least squares regression-based classifiers and weighted score-level fusion. Authors in Noroozi et al. (2017) present a multimodal emotion recognition system based on the analysis of audio and visual cues. For the visual part, facial landmarks’ geometric relations are computed. In this work, each emotional video is summarized into a reduced set of key frames by applying the CNN such that the confidence outputs of all the classifiers from various modalities are used to define a new feature space. The work by Tzirakis et al. (2017) proposes an emotion recognition system using auditory and visual modalities which also utilize a CNN to extract features from the speech and a deep residual network for the visual modality. This method employs long short-term memory networks as insensitive to outlier machine learning algorithm. For an extended review of multimodal emotion analysis, the interested readers are referred to (Poria et al. 2017; Soleymani et al. 2017) which review the progress made in the field from the past to the present.

As mentioned earlier, three different databases are employed in this work, namely Surrey Audio-Visual Expressed Emotion (SAVEE), RML and eNTERFACE05 to perform multimodal emotion recognition. All databases have a set of six principal emotions including happiness, sadness, anger, fear, surprise and disgust based on Ekman model (Ekman et al. 2013) which make them appropriate for conducting experiments. According to this, the characteristics of recent multimodal emotion recognition works using the above-mentioned databases have been studied and summarized in Table 1. As shown, various classifiers have been employed to recognize the emotion using audio and visual modalities. Besides the advantages, the weaknesses and limitations of the widely used classifiers in the field can be examined from different aspects. For instance, the SVM is a powerful and flexible classifier which operates based on the concept of decision planes by defining decision boundaries. The most serious problem with the SVM is the selection of the best kernel function and parameters. High algorithmic complexity, extensive memory requirements for large datasets, overfitting, difficulty of interpreting the final model and need to pairwise classifications in the multi-class case are other drawbacks of the SVM classifiers (Rifkin et al. 2003; Cawley and Talbot 2010; Cevikalp and Triggs 2013; Karamizadeh et al. 2014). As another commonly used supervised learning method we can refer to Hidden Markov Model (HMM) which provides a tool for representing probability distribution over sequence of observations. According to this, the observed data are modeled as a series of outputs generated by one of several hidden states (Ghahramani 2001). Though high speed and accuracy due to strong analytical basis are considered as the HMM specifications, this model suffers from different limitations such as the dependence of the model accuracy on the number of states and parameters. Also, this modeling approach is limited to applications satisfying the Markov property (Degirmenci 2014; Chakraborty and Talukdar 2016). Another well-known machine learning method in the field of emotion recognition system is Gaussian Mixture Model (GMM). The GMM is a parametric probability density function represented as a weighted sum (mixture) of Gaussian component densities (Reynolds 2015). Although this method benefits from the Gaussian distributions characteristics, several limitations can undermine the functionality of this model. In addition to computational complexity in the case of high dimensional problems, there is no specified algorithm to determine the number of mixture models (Yu and Deng 2016). This problem can be extended to the CNN method in which the reason of number of data and layers to obtain a certain performance is not specified. As other limitations of CNN, we can refer to the very time-consuming training process and need more powerful GPUs and large amounts of storage. Overfitting due to small dataset and also the complexity of hidden layers which makes it difficult to interpret the results, sensitivity to misclassification and over-classification are other challenges. Nevertheless, the CNN is able to detect important features at different levels from input data (image/video/audio) similar to a human brain and is also more computationally efficient, easier to train and has fewer parameters in comparison with conventional neural network (Goodfellow et al. 2016).

Table 1 Characteristics of recent multimodal emotion recognition models conducted on SAVEE, RML and eNTERFACE05 databases

A Multimodal Emotion Recognition System Using Facial Landmark Analysis

Abstract

Similar content being viewed by others

Audio-visual emotion recognition using multi-directional regression and Ridgelet transform

Multimodal emotion recognition using SDA-LDA algorithm in video clips

Multimodal emotion recognition model via hybrid model with improved feature level fusion on facial and EEG feature set

Explore related subjects

1 Introduction

2 Related work

3 Audio-Visual emotion recognition system

4 Proposed visual feature extraction method

5 System setup and experimental results

6 Conclusion

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation