Multichannel convolutional neural network for human emotion recognition from in-the-wild facial expressions

Boughanem, Hadjer; Ghazouani, Haythem; Barhoumi, Walid

doi:10.1007/s00371-022-02690-0

Multichannel convolutional neural network for human emotion recognition from in-the-wild facial expressions

Original article
Published: 21 October 2022

Volume 39, pages 5693–5718, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

The Visual Computer Aims and scope Submit manuscript

Multichannel convolutional neural network for human emotion recognition from in-the-wild facial expressions

Download PDF

592 Accesses
8 Citations
1 Altmetric
Explore all metrics

Abstract

Facial emotions reflect the person’s moods and show the human affective state that is correlative with non-verbal intentions and behaviors. Despite the advances on computer vision techniques, capturing automatically the facial expressions in-the-wild remains a very difficult task. In this context, we propose a multichannel convolutional neural network based on the quality and the strengths of three well-known pre-trained models, namely VGG19, GoogleNet, and ResNet101. Indeed, the complementarity of the features extracted from the three models is exploited in order to form a more robust feature vector. During the training phase, a freezing weight is applied for each architecture. Then, the layers containing the most relevant information are marked, and the final feature descriptor for emotion prediction is thereafter defined by concatenating the obtained vectors. In fact, the three architectures have showed their efficiency severally in term of emotion recognition, and notably they do not err in the same images. The final vector, obtained by concatenating the features extracted from the different models, is fed to a support vector machine classifier in order to predict the final emotions. Extensive experiments have been conducted on four challenging datasets (JAFFE, CK+, FER2013 and SFEW_2.0) covering in-the-wild as well as in-the-laboratory facial expressions. The obtained results show that the suggested method is not only more accurate compared to each pre-trained CNN model but it also outperforms relevant state-of-the-art methods.

A Hybrid Automatic Facial Expression Recognition Based on Convolutional Neuronal Networks and Support Vector Machines Techniques

Facial emotion recognition using convolutional neural networks (FERC)

Article 18 February 2020

A Deep Learning Model to Recognise Facial Emotion Expressions

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Most people believe knowing a great deal about their own emotions, nevertheless psychologists face difficulties in having a consensus about the nature and the working mechanisms of emotions [1]. Emotions, which are relatively brief, are fundamental human features playing important roles in social communication and effecting all social phenomena [2]. These emotions allow the observer to infer the emotional states as well as the intentions of others, which make it possible to anticipate their gestures and regulate his own behaviors accordingly. Emotions are evinced by different reactions such as psychological reactions change in tone voice, palpitations, heat, accelerated pulse gestural expressions and facial expressions. However, defining the human emotion is not simple, and the interest of many of researchers are aroused by the complexity that emotions carry [3]. Darwin has emphasized that emotion is a response to the environment [4], while Dam et al. [5] have defined the emotion as a reaction to an event which appears suddenly, without lasting long. Several existing works have the unanimous goal of classifying the input emotion into one of the seven basic emotion classes (happiness, sadness, neutrality, disgust, fear, surprise, and anger). These works just differ in the modalities used [6] and the supports treated from which the features and the information are extracted in order to be able to predict the emotions [7]. Among the relevant modalities, facial expressions are one of the most popular [8], due to several reasons. They are visible, they contain many useful features for emotion recognition, and it is relatively easy to collect a large dataset of face images [9]. It is worth mentioned that image datasets designed under controlled laboratory conditions are more available than those designed under uncontrolled (in-the-wild) conditions. Among them, we point out the most widely used ones, such as the JApanese Female Facial Expression (JAFFE) dataset [10], the Cohn-Kanade (CK) dataset [11] and its extended version (CK+) [12], the Oulu-CASIA dataset [13], the AffectNet dataset [14], the Acted Facial Expressions in the Wild (AFEW) dataset, and its static version: the Static Facial Expressions in the Wild (SFEW_2.0) dataset [15, 16], and the Facial Expression Recognition 2013 (FER2013) [17]. Nevertheless, Facial Emotion Recognition (FER) has remained as an active research topic during the past decades due to various challenging factors such as illumination changes, head pose, head motion, movement blur, age, gender, and skin color [18]. In fact, FER is still difficult particularly in-the-wild as well as in unconstrained real-life environments. Early approaches for automatic facial expression recognition [19] usually perform quickly and accurately in indoor environments, but they frequently drop in performance under real-world conditions [20]. Therefore, there are still several challenging issues. Indeed, most of studies have based the hand-crafted feature extraction approaches completely on human experience, and that fact made them so complex in some real applications. Consequently, it is hard to extract prominent features using the classical methods. To deal with this challenge facing the quick progress of emotion recognition techniques, and in order to achieve higher accuracy, recent investigations are further motivated to develop FER systems based on deep learning techniques. Thus, investigating deep neural network models for facial expression analysis has become the hottest subject in recent facial recognition works [21]. In fact, feature learning allows deep networks to learn a broader range of facial features than earlier approaches, including rotation variation and illumination changes, and it has turned out that Convolutional Neural Networks (CNN) trained for facial expression recognition can learn facial features reflecting those suggested by the psychologist Ekman [22].

Overall, several recent works have effectively dealt with FER issues using CNN [23]. Nevertheless, CNN models elucidate several limitations deserving more attention such as the accuracy rate that could be higher, especially in-the-wild. To cope with this limitation, we mainly focused on features provided by different CNN models, and on the ability of each model to achieve high precision rates separately. Our concept aims to achieve the resourcefulness by having multiple resources, not from having only one intelligent. Subsequently, we propose in this work to build upon the fusion of deep features supplied by different CNN models. More precisely, we have studied the Resnet101, which ensured its efficiency in terms of learning with the depth of the layers thanks to the use of residual learning networks. Moreover, the VGG19, which is a shallow model but with a remarkable amount of parameters, as well as the GoogleNet, which insures a balance between efficiency and speed of learning while reducing the parameters number of the network, are also investigated. In fact, the proposed method follows a standard FER scheme where face images are normalized, then augmented. Thereafter, the features from the pre-processed images are extracted using pre-trained CNN architectures and finally classified via an SVM classifier. The proposed method focuses on a layer-based feature selection from each pre-trained model separately. The concatenation includes the three feature vectors selected from different layers into a single final vector. The suggested scheme ensures the complementarity of facial expression features extracted from the three pre-trained architectures. This scheme is composed mainly of two phases: training and validation. During the training phase, images are pre-processed, then faces are detected and finally features are extracted from each model and then concatenated into a single vector to be fed to an SVM classifier for the training phase. The same pipeline is followed during the validation process. In fact, the main contributions of this work are twofold:

We have applied three pre-trained neural networks in order to extract complementary features driven into multichannel solution with a personalized freezing weight during the training phase. A layer-based feature selection is performed from each pre-trained model separately. A layer search is performed from the last five layers including the FC ones. The layer that provides the best features is selected and the features it provides are retained.
The final feature vector is formed by concatenating the features retained from the different pre-trained models. The concatenation phase has allowed to obtain a single model gathering the most relevant extracted facial information of the three basic models. The overall error rate is reduced compared to each single model since the failure percentage of one model could be fulfilled with that of another one.

Extensive experiments have been carried out on the most challenging FER datasets available today (JAFFE dataset of Japanese Female images, the Extensive Cohn-Kanade (CK+) dataset, the Facial Expression Recognition 2013 dataset (FER2013), and the SFEW_2.0 dataset of static images in the wild), and the proposed method has led to very promising results.

The remainder of this paper is organized as follows: Section 2 briefly reviews relevant existing FER methods. In Sect. 3, we describe the proposed method. In Sect. 4, an overview of datasets used in this work is outlined before providing experiments and performance comparison with relevant state-of-the-art methods. Finally, conclusions and future research directions are given in Sect. 5.

2 Related work

A standard FER system involves essentially three key components, namely face detection and pre-processing, feature extraction, and classification. Face detection aims to determine the location and the size of the human face, or faces, within the input image [24]. The most widely used methods for face detection include MTCNN [25], Dlib [26], the eigenface techniques [27], and the Viola-Jones algorithm [28]. Although face detection is an essential procedure enabling feature extraction, image pre-processing is usually required for the alignment and the normalization of the visual semantic information conveyed by the face. Its primary function is to ignore all variations irrelevant to facial expressions such as different backgrounds, illuminations, and head poses; fairly common in unconstrained scenarios; and to keep as much meaningful features as possible [29]. The second stage, which is feature extraction, intends to extract facial features from the pre-processed images of the detected faces [30]. The third stage is the classification of the extracted facial features into one of the basic emotion classes. Unlike the traditional methods where the feature extraction stage is independent of the feature classification one, deep networks can perform FER in an end-to-end manner [29]. Indeed, the way how facial changes are typically extracted into features [31] facilitates the emotion prediction for FER systems. In the remaining of this section, an overview of various FER works is presented briefly, while focusing on those that have been validated on the JAFFE, the CK+, the FER2013, and/or the SFEW_2.0 datasets. These works have been categorized, according to the adopted feature extraction approach, into three major groups: hand-crafted features, deep learning features and hybrid ones.

2.1 Hand-crafted features

First emotion recognition works have been based on hand-crafted feature representation methods, which are commonly divided into two categories: features based on templates (or appearance features) and geometric features. The appearance feature extraction methods (e.g. Gabor filter [32], Local Binary Pattern (LBP) [33], Histogram of Oriented Gradients (HOG) [34]$\ldots $) are applied on the totality of the face image, whereas the geometric feature-based methods commonly exploit landmark points in order to calculate geometric distances between face regions [35]. It is worth noting that most of existing hand-crafted methods use a combination of these two approaches [36]. For instance, Zhang et al. [37] have cropped images of size $110\times 150$ pixels after detecting automatically the faces based on a set of rectangular Haar-like features. Then, features have been extracted using local binary patterns before applying the Local Fisher Discriminant Analysis (LFDA) in order to produce a representation of extracted data of low dimension. An accuracy of 90.7% has been reached by this method on the JAFFE dataset. Likewise, Abdulrahman and Eleyan [38] have focused their contribution on the feature extraction step. The conceived system has been based on LBP as feature extractor and the Principal Component Analysis (PCA) for the dimensionality reduction of the feature vectors. These vectors are then fed to a Support Vector Machine (SVM) for the classification. Experiments were carried out on the JAFFE and the MUFE datasets and the method has proved to be efficient at 87% and 77%, respectively. Alshamsi et al. [39] have opted for the Hausdorff distance for the pre-processing and the face detection, followed by a combination of facial landmarks and centers of gravity for the feature extraction. Then, an SVM classifier has been applied while reaching an accuracy of 96.3% on the CK+ dataset, 91.9% on the JAFFE dataset, and 90.8% on the KDEF dataset. Differently, the FER system designed by Gite et al. [40] detects faces from facial images using the Viola-Jones algorithm. Then, a combination of geometric and appearance-based techniques has been explored in order to extract reliable features. In fact, the authors have investigated the coordinates of face landmarks before reducing the dimensionality of the feature vector using the principal component analysis. The method has been validated on the extended Cohn-Kanade (CK+) dataset and a recognition accuracy of 93%, using an SVM classifier, has been recorded. However, this FER system still struggled with the common issues of handling real-world conditions such as head movement, various lighting conditions, and low-intensity expressions. Overall, the major issues of the hand-crafted methods can be mainly summarized in the failure of low-level features to extract relevant local facial information, and the incapacity to capture high level salient information, notably under in-the-wild conditions such as different head positions, complex backgrounds, different distances from the camera, multi-face scenes, subject movement, and low lightness conditions.

2.2 Deep learning features

The swift progress of deep learning models has motivated researchers to introduce deep neural networks within the framework of FER systems. Therefore, in the last decades, most of works have leaned toward the use of deep learning techniques for FER purposes [41, 42]. Indeed, a large proportion of the relevant FER systems have relied on CNNs because of their performance and flexibility [43]. In particular, CNN architectures have proved to be more robust, than the Multi-Layer Perceptron (MLP), to face location changes as well as to scale variations, especially in the case of previously unseen faces and pose variations [44]. In addition to CNN, Deep CNN (DCNN) [45], Deep Belief Networks (DBN) [46], Deep Auto-Encoder (DAE) [47], Recurrent Neural Networks (RNN) [48], Generative Adversarial Networks (GAN) [49], and recently transfer learning-based frameworks [50], have been successfully investigated for facial emotion recognition. For instance, Shaees et al. [51] have performed a quantitative comparison between an FER method that is fully based on transfer learning, using pre-trained CNN, with an hybrid FER method based on a mixture of deep learned features, which are extracted using transfer learning, along with mainstream classification. They chose the AlexNet pre-trained CNN architecture, for their first method. However, a multiclass SVM had been adopted as classifier for the second method. They evaluated their methods on two datasets, namely NVIE and CK+, and they achieved for the first method the recognition rates of 91.5% and 90.1%, respectively. For the second method, an increase till 99.3% (resp. 98.3%) on the NVIE (resp. the CK+) dataset has been achieved. In the same context of deep learning approaches, Zhang et al. [52] have proposed two FER methods, both are based on deep convolutional neural networks of double-channel weighted mixture (WMDCNN) structure. However, the first method is based on static images and the second one is based on image sequences while adding long short-term memory (WMCNN-LSTM). The facial regions in the designed systems are detected by the AdaBoost method, and thereafter cropped and rotated, and only faces are kept by masking the other areas. The experimental results of the WMDCNN network on the CK+, the JAFFE, the Oulu-CASIA and the MMI datasets have achieved average recognition rates of 98.5%, 92.3%, 86%, and 78.24%, respectively. Nevertheless, the WMCNN-LSTM architecture has achieved an average recognition rate of 97.5% on the CK+ dataset, of 88% on the Oulu-CASIA dataset and of 87.1% on the MMI dataset. Differently, Minaee et al. [9] have introduced a deep learning approach based on attentional convolutional networks while adding a visualization technique in order to specify the most expressive regions related to emotions in the faces’ images. The proposed method has been evaluated on four datasets (FER-2013, Facial Expression Research Group (FERG), CK+ and JAFFE), and recognition rates of 70.02%, 99.3%, 98.0%, and 92.8%, respectively, have been reported. Chen et al. [53] have used a Deep Sparse Autoencoder Network (DSAN) for learning facial features, and a Softmax Regression (SR) for the classification of the facial expressions. An average emotion recognition of 94.761% has been reached by evaluating the method on the JAFFE dataset. Likewise, the FER system of Li et al. [31] has been conceived based on convolutional neural networks for feature extraction, preceded by a pre-processing phase including a new face cropping and rotation technique. The evaluation of this system has been performed on the CK+ and the JAFFE datasets, and recognition accuracies of 97.38% and 97.18% have been recorded, respectively. However, deep learning methods typically require large numbers of training instances, what presents the transfer learning as an attractive approach for the in-the-wild FER.

Table 1 Summary of relevant studied works for FER in the JAFFE, the CK+, the SFEW_2.0, and/or the FER2013 datasets using hand-crafted, deep learning and hybrid features

Multichannel convolutional neural network for human emotion recognition from in-the-wild facial expressions

Abstract

Similar content being viewed by others

A Hybrid Automatic Facial Expression Recognition Based on Convolutional Neuronal Networks and Support Vector Machines Techniques

Facial emotion recognition using convolutional neural networks (FERC)

A Deep Learning Model to Recognise Facial Emotion Expressions

Explore related subjects

1 Introduction

2 Related work

2.1 Hand-crafted features

2.2 Deep learning features

2.3 Hybrid features

3 Proposed method

3.1 Pre-processing and data augmentation

3.2 Feature extraction

3.3 Emotion’s classification

4 Experimental results

4.1 Data preparation and validation protocol

4.2 Results of the three models on the JAFFE dataset

4.3 Results of the three models on the CK+ dataset

4.4 Results of the three models on the SFEW_2.0 dataset

4.5 Results of the three models on the FER2013 dataset

4.6 Results of the proposed model after feature extraction and concatenation on the four used datasets

4.7 Comparison of the suggested method with relevant methods from the state-of-the-art

5 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation