1 Introduction

Most people believe knowing a great deal about their own emotions, nevertheless psychologists face difficulties in having a consensus about the nature and the working mechanisms of emotions [1]. Emotions, which are relatively brief, are fundamental human features playing important roles in social communication and effecting all social phenomena [2]. These emotions allow the observer to infer the emotional states as well as the intentions of others, which make it possible to anticipate their gestures and regulate his own behaviors accordingly. Emotions are evinced by different reactions such as psychological reactions change in tone voice, palpitations, heat, accelerated pulse gestural expressions and facial expressions. However, defining the human emotion is not simple, and the interest of many of researchers are aroused by the complexity that emotions carry [3]. Darwin has emphasized that emotion is a response to the environment [4], while Dam et al. [5] have defined the emotion as a reaction to an event which appears suddenly, without lasting long. Several existing works have the unanimous goal of classifying the input emotion into one of the seven basic emotion classes (happiness, sadness, neutrality, disgust, fear, surprise, and anger). These works just differ in the modalities used [6] and the supports treated from which the features and the information are extracted in order to be able to predict the emotions [7]. Among the relevant modalities, facial expressions are one of the most popular [8], due to several reasons. They are visible, they contain many useful features for emotion recognition, and it is relatively easy to collect a large dataset of face images [9]. It is worth mentioned that image datasets designed under controlled laboratory conditions are more available than those designed under uncontrolled (in-the-wild) conditions. Among them, we point out the most widely used ones, such as the JApanese Female Facial Expression (JAFFE) dataset [10], the Cohn-Kanade (CK) dataset [11] and its extended version (CK+) [12], the Oulu-CASIA dataset [13], the AffectNet dataset [14], the Acted Facial Expressions in the Wild (AFEW) dataset, and its static version: the Static Facial Expressions in the Wild (SFEW_2.0) dataset [15, 16], and the Facial Expression Recognition 2013 (FER2013) [17]. Nevertheless, Facial Emotion Recognition (FER) has remained as an active research topic during the past decades due to various challenging factors such as illumination changes, head pose, head motion, movement blur, age, gender, and skin color [18]. In fact, FER is still difficult particularly in-the-wild as well as in unconstrained real-life environments. Early approaches for automatic facial expression recognition [19] usually perform quickly and accurately in indoor environments, but they frequently drop in performance under real-world conditions [20]. Therefore, there are still several challenging issues. Indeed, most of studies have based the hand-crafted feature extraction approaches completely on human experience, and that fact made them so complex in some real applications. Consequently, it is hard to extract prominent features using the classical methods. To deal with this challenge facing the quick progress of emotion recognition techniques, and in order to achieve higher accuracy, recent investigations are further motivated to develop FER systems based on deep learning techniques. Thus, investigating deep neural network models for facial expression analysis has become the hottest subject in recent facial recognition works [21]. In fact, feature learning allows deep networks to learn a broader range of facial features than earlier approaches, including rotation variation and illumination changes, and it has turned out that Convolutional Neural Networks (CNN) trained for facial expression recognition can learn facial features reflecting those suggested by the psychologist Ekman [22].

Overall, several recent works have effectively dealt with FER issues using CNN [23]. Nevertheless, CNN models elucidate several limitations deserving more attention such as the accuracy rate that could be higher, especially in-the-wild. To cope with this limitation, we mainly focused on features provided by different CNN models, and on the ability of each model to achieve high precision rates separately. Our concept aims to achieve the resourcefulness by having multiple resources, not from having only one intelligent. Subsequently, we propose in this work to build upon the fusion of deep features supplied by different CNN models. More precisely, we have studied the Resnet101, which ensured its efficiency in terms of learning with the depth of the layers thanks to the use of residual learning networks. Moreover, the VGG19, which is a shallow model but with a remarkable amount of parameters, as well as the GoogleNet, which insures a balance between efficiency and speed of learning while reducing the parameters number of the network, are also investigated. In fact, the proposed method follows a standard FER scheme where face images are normalized, then augmented. Thereafter, the features from the pre-processed images are extracted using pre-trained CNN architectures and finally classified via an SVM classifier. The proposed method focuses on a layer-based feature selection from each pre-trained model separately. The concatenation includes the three feature vectors selected from different layers into a single final vector. The suggested scheme ensures the complementarity of facial expression features extracted from the three pre-trained architectures. This scheme is composed mainly of two phases: training and validation. During the training phase, images are pre-processed, then faces are detected and finally features are extracted from each model and then concatenated into a single vector to be fed to an SVM classifier for the training phase. The same pipeline is followed during the validation process. In fact, the main contributions of this work are twofold:

  • We have applied three pre-trained neural networks in order to extract complementary features driven into multichannel solution with a personalized freezing weight during the training phase. A layer-based feature selection is performed from each pre-trained model separately. A layer search is performed from the last five layers including the FC ones. The layer that provides the best features is selected and the features it provides are retained.

  • The final feature vector is formed by concatenating the features retained from the different pre-trained models. The concatenation phase has allowed to obtain a single model gathering the most relevant extracted facial information of the three basic models. The overall error rate is reduced compared to each single model since the failure percentage of one model could be fulfilled with that of another one.

Extensive experiments have been carried out on the most challenging FER datasets available today (JAFFE dataset of Japanese Female images, the Extensive Cohn-Kanade (CK+) dataset, the Facial Expression Recognition 2013 dataset (FER2013), and the SFEW_2.0 dataset of static images in the wild), and the proposed method has led to very promising results.

The remainder of this paper is organized as follows: Section 2 briefly reviews relevant existing FER methods. In Sect. 3, we describe the proposed method. In Sect. 4, an overview of datasets used in this work is outlined before providing experiments and performance comparison with relevant state-of-the-art methods. Finally, conclusions and future research directions are given in Sect. 5.

2 Related work

A standard FER system involves essentially three key components, namely face detection and pre-processing, feature extraction, and classification. Face detection aims to determine the location and the size of the human face, or faces, within the input image [24]. The most widely used methods for face detection include MTCNN [25], Dlib [26], the eigenface techniques [27], and the Viola-Jones algorithm [28]. Although face detection is an essential procedure enabling feature extraction, image pre-processing is usually required for the alignment and the normalization of the visual semantic information conveyed by the face. Its primary function is to ignore all variations irrelevant to facial expressions such as different backgrounds, illuminations, and head poses; fairly common in unconstrained scenarios; and to keep as much meaningful features as possible [29]. The second stage, which is feature extraction, intends to extract facial features from the pre-processed images of the detected faces [30]. The third stage is the classification of the extracted facial features into one of the basic emotion classes. Unlike the traditional methods where the feature extraction stage is independent of the feature classification one, deep networks can perform FER in an end-to-end manner [29]. Indeed, the way how facial changes are typically extracted into features [31] facilitates the emotion prediction for FER systems. In the remaining of this section, an overview of various FER works is presented briefly, while focusing on those that have been validated on the JAFFE, the CK+, the FER2013, and/or the SFEW_2.0 datasets. These works have been categorized, according to the adopted feature extraction approach, into three major groups: hand-crafted features, deep learning features and hybrid ones.

2.1 Hand-crafted features

First emotion recognition works have been based on hand-crafted feature representation methods, which are commonly divided into two categories: features based on templates (or appearance features) and geometric features. The appearance feature extraction methods (e.g. Gabor filter [32], Local Binary Pattern (LBP) [33], Histogram of Oriented Gradients (HOG) [34]\(\ldots \)) are applied on the totality of the face image, whereas the geometric feature-based methods commonly exploit landmark points in order to calculate geometric distances between face regions [35]. It is worth noting that most of existing hand-crafted methods use a combination of these two approaches [36]. For instance, Zhang et al. [37] have cropped images of size \(110\times 150\) pixels after detecting automatically the faces based on a set of rectangular Haar-like features. Then, features have been extracted using local binary patterns before applying the Local Fisher Discriminant Analysis (LFDA) in order to produce a representation of extracted data of low dimension. An accuracy of 90.7% has been reached by this method on the JAFFE dataset. Likewise, Abdulrahman and Eleyan [38] have focused their contribution on the feature extraction step. The conceived system has been based on LBP as feature extractor and the Principal Component Analysis (PCA) for the dimensionality reduction of the feature vectors. These vectors are then fed to a Support Vector Machine (SVM) for the classification. Experiments were carried out on the JAFFE and the MUFE datasets and the method has proved to be efficient at 87% and 77%, respectively. Alshamsi et al. [39] have opted for the Hausdorff distance for the pre-processing and the face detection, followed by a combination of facial landmarks and centers of gravity for the feature extraction. Then, an SVM classifier has been applied while reaching an accuracy of 96.3% on the CK+ dataset, 91.9% on the JAFFE dataset, and 90.8% on the KDEF dataset. Differently, the FER system designed by Gite et al. [40] detects faces from facial images using the Viola-Jones algorithm. Then, a combination of geometric and appearance-based techniques has been explored in order to extract reliable features. In fact, the authors have investigated the coordinates of face landmarks before reducing the dimensionality of the feature vector using the principal component analysis. The method has been validated on the extended Cohn-Kanade (CK+) dataset and a recognition accuracy of 93%, using an SVM classifier, has been recorded. However, this FER system still struggled with the common issues of handling real-world conditions such as head movement, various lighting conditions, and low-intensity expressions. Overall, the major issues of the hand-crafted methods can be mainly summarized in the failure of low-level features to extract relevant local facial information, and the incapacity to capture high level salient information, notably under in-the-wild conditions such as different head positions, complex backgrounds, different distances from the camera, multi-face scenes, subject movement, and low lightness conditions.

2.2 Deep learning features

The swift progress of deep learning models has motivated researchers to introduce deep neural networks within the framework of FER systems. Therefore, in the last decades, most of works have leaned toward the use of deep learning techniques for FER purposes [41, 42]. Indeed, a large proportion of the relevant FER systems have relied on CNNs because of their performance and flexibility [43]. In particular, CNN architectures have proved to be more robust, than the Multi-Layer Perceptron (MLP), to face location changes as well as to scale variations, especially in the case of previously unseen faces and pose variations [44]. In addition to CNN, Deep CNN (DCNN) [45], Deep Belief Networks (DBN) [46], Deep Auto-Encoder (DAE) [47], Recurrent Neural Networks (RNN) [48], Generative Adversarial Networks (GAN) [49], and recently transfer learning-based frameworks [50], have been successfully investigated for facial emotion recognition. For instance, Shaees et al. [51] have performed a quantitative comparison between an FER method that is fully based on transfer learning, using pre-trained CNN, with an hybrid FER method based on a mixture of deep learned features, which are extracted using transfer learning, along with mainstream classification. They chose the AlexNet pre-trained CNN architecture, for their first method. However, a multiclass SVM had been adopted as classifier for the second method. They evaluated their methods on two datasets, namely NVIE and CK+, and they achieved for the first method the recognition rates of 91.5% and 90.1%, respectively. For the second method, an increase till 99.3% (resp. 98.3%) on the NVIE (resp. the CK+) dataset has been achieved. In the same context of deep learning approaches, Zhang et al. [52] have proposed two FER methods, both are based on deep convolutional neural networks of double-channel weighted mixture (WMDCNN) structure. However, the first method is based on static images and the second one is based on image sequences while adding long short-term memory (WMCNN-LSTM). The facial regions in the designed systems are detected by the AdaBoost method, and thereafter cropped and rotated, and only faces are kept by masking the other areas. The experimental results of the WMDCNN network on the CK+, the JAFFE, the Oulu-CASIA and the MMI datasets have achieved average recognition rates of 98.5%, 92.3%, 86%, and 78.24%, respectively. Nevertheless, the WMCNN-LSTM architecture has achieved an average recognition rate of 97.5% on the CK+ dataset, of 88% on the Oulu-CASIA dataset and of 87.1% on the MMI dataset. Differently, Minaee et al. [9] have introduced a deep learning approach based on attentional convolutional networks while adding a visualization technique in order to specify the most expressive regions related to emotions in the faces’ images. The proposed method has been evaluated on four datasets (FER-2013, Facial Expression Research Group (FERG), CK+ and JAFFE), and recognition rates of 70.02%, 99.3%, 98.0%, and 92.8%, respectively, have been reported. Chen et al. [53] have used a Deep Sparse Autoencoder Network (DSAN) for learning facial features, and a Softmax Regression (SR) for the classification of the facial expressions. An average emotion recognition of 94.761% has been reached by evaluating the method on the JAFFE dataset. Likewise, the FER system of Li et al. [31] has been conceived based on convolutional neural networks for feature extraction, preceded by a pre-processing phase including a new face cropping and rotation technique. The evaluation of this system has been performed on the CK+ and the JAFFE datasets, and recognition accuracies of 97.38% and 97.18% have been recorded, respectively. However, deep learning methods typically require large numbers of training instances, what presents the transfer learning as an attractive approach for the in-the-wild FER.

Table 1 Summary of relevant studied works for FER in the JAFFE, the CK+, the SFEW_2.0, and/or the FER2013 datasets using hand-crafted, deep learning and hybrid features

2.3 Hybrid features

Although the success of automated FER systems based on deep learning architectures, many researchers have valued that the traditional extracted features (hand-crafted features) contain relevant information that capture texture, shape, and appearance information describing facial expressions. They consider that hand-crafted and deep learning features are complementary. Therefore, hand-crafted features can be effectively combined with deep learned features in order to further improve the robustness as well as the accuracy of FER, especially that hybrid methods are present in psychological mechanisms that recognize facial expressions [54]. For instance, a Deep Action Units Graph Network (DAUGN) has been investigated for facial expression recognition in [54]. The introduced network is based on a segmentation strategy that divides faces into action units, and CNN is thereafter used in order to fuse the local-appearance and global-geometry features. The proposed FER system has been evaluated on the CK+, the MMI, and the SFEW_2.0 datasets and has achieved 97.67%, 80.11% and 55.36%, respectively, as accuracy rates. The results obtained are competitive comparing to others works, but are still insufficient for in-the-wild facial images. Similarly, Fan and Tjahjadi [55] have proposed a hybrid framework based on deep features learned using convolutional neural networks, and hand-crafted features including shape and appearance descriptors. In fact, in order to collect the hand-crafted features while describing the local facial properties, shape descriptors from facial landmarks, related to the eyes, the nose, and the mouth, have been combined with PHOG features. The framework achieved an accuracy of 92.5% on the CK+ dataset. However, this framework has been validated on only one dataset putting in question its robustness as well as its overfitting risk. Sun and Lv [56] have also chose a hybrid model for facial expression recognition. They have combined Scale-Invariant Feature Transform (SIFT) descriptors with deep learning features extracted from a CNN model. The method has been validated on the CK+ dataset and has achieved an accuracy of 94.82%. The cross-dataset experiments on the JAFFE dataset have achieved an accuracy of 48.90%. Likewise, the FER method of Gogić et al. [57], called LBF-NN, has combined local binary features with deep learned features via a Gentle Boost Decision Trees Neural Network (GBDTNN). The extracted hand-crafted features have been based on facial landmarks detected from cropped facial images. The performance of the method has been evaluated on four datasets: CK+ with 96.48% of accuracy rate, 73.73% for MMI, 85.88% for JAFFE and an accuracy of 49.31% for SFEW_2.0. Nevertheless, the performance of the method is quite limited for the case of in-the-wild images, since facial expressions in nature are dynamic and change in intensity. Similarly, Alreshidi and Ullah [58] have conceived their facial emotion recognition system using hybrid features. They have extracted Neighborhood Difference Features (NDF) obtained from faces detected with AdaBoost cascade classifiers. They have tested the performance of their approach on the SFEW_2.0 and the RAF datasets, and they have achieved a precision rate of 57.7% for SFEW_2.0 and of 59.0% for RAF. Overall, in-the-wild facial expression recognition methods exclusively based on deep learned features have proved to be more effective than that methods combining such features with hand-crafted ones.

Table 1 encompasses some relevant research studies, ranging from early works up to more recent ones, for each category of features (hand-crafted, deep learning and hybrid methods). The selected works have been collected based on the datasets they used to validate their studies (JAFFE, CK+, SFEW_2.0 and/or FER2013). It is clear that the investigated hand-crafted features (e.g. LBP, PCA, LFDA\(\ldots \)) have not given sufficiently descriptive patterns of facial expressions, whereas deep learning methods show a remarkable improvement of the precision rate, especially under in-the-wild contexts, up to 18.57%. However, the margin for improvement is still possible, especially in real condition environments. The contribution detailed in this work focuses on the transfer learning from recent deep learning architectures in order to introduce effective solutions for the implementation of FER systems. The most relevant deep face features are studied by challenging several deep architectures in the context of in-the-wild FER. In fact, the suggested method aims to fuse relevant features from several pre-trained CNN models in order to use them in a multichannel solution for the recognition of in-the-wild human facial expressions. To the best our knowledge, it is the first time that deep learning features extracted from pre-trained architecture in the context of in-the-wild conditions are investigated and fused into a single solution to improve FER accuracy.

3 Proposed method

This section details the proposed method for FER in-the-wild. The method performs the FER task based on multichannel convolutional neural network, using dual deep learning networks. The first one is a DL as feature extractor based on transfer learning techniques. It uses three pre-trained CNN models namely VGG19 [60], GoogleNet [61], and ResNet101 [62]. The second one is a DL as a transformer. It consists to select the richest features’ layer from each model. The three resulting vectors are thereafter concatenated into a single vector representing the final feature vector to be fed to an SVM classifier in order to predict the emotion class of the input image. The proposed method aims to gather the most relevant features extracted from the VGG19, the GoogleNet, and the ResNet101 networks. It aims to exploit the complementarity of the extracted features from the three models in order to reduce the error rate. In what follows, we describe the different steps of the proposed emotion recognition procedure. In fact, the input images are pre-processed, before detecting the faces. After that, the three pre-trained CNN models are used for feature extraction. Then, the richest features from each model are selected and concatenated into a single vector representing the final feature vector, which is fed to the SVM classifier.

3.1 Pre-processing and data augmentation

For this study, the JAFFE, the CK+, the SFEW_2.0, and the FER2013 datasets have been investigated for the training and the evaluation. All the used datasets comprise face images with seven basic facial expressions (Anger, Surprise, Fear, Disgust, Happiness, Sadness, and Neutral). Dataset samples are shown in Fig. 1, whereas Fig. 2 illustrates in more details the proposed method steps through its instantiation for the JAFFE dataset.

Fig. 1
figure 1

Prototypical facial expression images from the JAFFE dataset (first column), the CK+ dataset (second column), the SFEW_2.0 dataset (third column), and the FER2013 dataset (fourth column)

Fig. 2
figure 2

Technical steps of the proposed FER method

In fact, input images are firstly converted into RGB space and then normalized by modifying the range of intensity values in order to ensure illumination change robustness [63]. Non-face parts and useless regions are thereafter removed from normalized images in order to keep only face regions. This pre-processing step is important to enhance the image recognition performance. In our case, the Viola & Jones face detection algorithm [28], which is known for its robustness especially in the case of frontal images, is used in order to localize the face regions and to crop them from the entire images composing the used datasets. Furthermore, since a convolutional neural network requires a large amount of data to reach better accuracy, the performance of the model could be improved by Data Augmentation (DA) solutions [64]. In fact, the more important number of samples the dataset contains, the more features can be extracted from them, and the more the model can be improved in performance. Thus, as account of the small size of some public FER datasets, DA techniques are commonly used to increase the sizes of the datasets. Mostly, translations, rotations and skewing DA techniques have shown their benefits while being computationally efficient [65]. In our case, the data augmentation step consists of creating new images from each cropped image, using the following augmentations: horizontal and vertical translations, horizontal reflection, and random image rotations with a rotation angle in [\(-\,10^{\circ } , 10^{\circ }\)] (Fig. 3). It is worth noting that data augmentation was applied only on the JAFFE and the SFEW_2.0 datasets, which include respectively 213 and 1230 images, because of their reduced numbers of samples compared to the CK+ and the FER2013 datasets, which include 5414 and 7178 images, respectively. Figure 3 illustrates some samples of the JAFFE and the SFEW_2.0 datasets, before and after applying the data augmentation.

Fig. 3
figure 3

Illustration of the different geometric DA techniques applied on the SFEW_2.0 (first row) and the JAFFE (second row) datasets

3.2 Feature extraction

After resizing the input images in order to fit the input size of the pre-trained models, which is \(224\times 224\times 3\), the feature extraction part of the proposed method is composed of two modules. The first one, called “DL as extractor”, consists on extracting features from the pre-processed facial images. To this end, a transfer learning has been applied while benefiting from the advantages of several relevant CNN models. The second module, called “DL as Transformer”, consists in concatenating the most relevant features selected from each single model to form the final prediction vector. The details of the two proposed modules are discussed in what follows.

- DL as extractor (CNN feature extraction): So as to represent the numerical information behind facial expressions, we have performed transfer learning on CNN models, which were pre-trained on 1000 classes from the ImageNet dataset, in order to discriminate between the seven emotional classes. In fact, we have tested several well-known deep learning models (ResNet50, ResNet101, VGG16, VGG19 and GoogleNet), which have already shown their effectiveness in several state-of-the-art FER works [9, 66, 67], on the challenging JAFFE dataset in order to assess their performance for the in-the-wild context. For more stable results, we have run the tested models twenty times. The mean and the standard deviation \(\sigma \) (1) have been calculated in order to choose the most appropriate models in terms of performance (i.e. highest accuracy means) as well as of stability (i.e. smaller standard deviations). For each studied model, the four best recognition rates, their mean and standard deviation are shown in Table 2. According to this Table, the RestNet101 has recorded the highest accuracy mean with the lowest standard deviation value, followed by the VGG19. The ResNet50 and the GoogleNet models have comparable mean value and standard deviation values. In this case, the choice of the third model was based on the mean of the three best recognition rates which gives the advantage to the GoogleNet model. For reasons related to the size of the final feature vector, with regards to the curse of dimensionality issue, and to have an odd number of sources, we opted for the choice of three models among the five tested ones for feature extraction. Thus, the realized experiments conducted us to choose the ResNet101, VGG19 and GoogleNet models in order to guarantee the most stable results in-the-wild context and therefore the most robust features.

$$\begin{aligned} \sigma =\sqrt{ \frac{1}{n}\sum ^n_{i=1} (x_i-\mu )^2}, \end{aligned}$$
(1)

where \(x_i\) denotes the recognition rates, \(\mu \) is mean of the best recognition rates and n is the total number of experiences. Transfer learning techniques were then applied to these pre-trained CNN models while freezing weights at a personalized range of shallow layers, which does not capture relevant information. Freezing weight technique is applied for each model apart according to its depth, in order to keep only the relevant image properties for the training phase. This first step of the method freezes some shallow layers and keeps others deeper containing important data and having more ability to learn discriminant features. Freezing these layers aims to gain training time, and especially to eliminate less reliable features while retaining only relevant ones that perform more accurate recognition. The deep features extracted from the three models will be used afterward to train the SVM classifier. Furthermore, in order to confirm the suitability of the three chosen CNN models for the context of FER in general, and not only for the in-the-wild context, we have also evaluated them separately on the CK+, the SFEW_2.0 and the FER2013 datasets. Each model has been tested on all three datasets, and the experiences were repeated twenty times while reporting the four best recognition rates (Table 3). We have also calculated the standard deviation and the mean of the recognition rates (Table 4). Indeed, for the JAFFE dataset, recognition rate reached 85.71% for the VGG19, 83.33% for the GoogleNet, and 85.71% for the ResNet101. For the CK+ dataset, recognition rates of 89.19% for the VGG19, 89.37% for the GoogleNet and 92.70% for the ResNet101 were recorded. For the SFEW_2.0 dataset, lower recognition rates were scored: 54.07% for the GoogleNet, 57.72% for the VGG19, and 60.57% for the ResNet101 models. Finally, for the FER2013 dataset, the VGG19 achieved an accuracy of 58.22%, 53.69% for the GoogleNet, and 55.57% for the ResNet101. Accuracies achieved using the three test CNN architectures are relatively good and promising for each one separately for the datasets taken in controlled environments but remain relatively low for uncontrolled environment (SFEW_2.0 and FER2013 datasets). However, after focusing on the confusion matrices of the three models on the SFEW_2.0 dataset, we have noticed that where one or two of the models fail, there is at least one that performs well. For example, the GoogleNet model fail to recognize the disgust emotion, whereas the ResNet101 model scores 28.6% for recognizing this emotion for the SFEW_2.0 dataset. Detailed results of the confusion matrices, which are illustrated later in the experimental result section, confirm this finding. This fact prompted us to investigate this complementarity while selecting the most suitable features from each model.

Table 2 Comparison results of five models applied on the JAFFE dataset
Table 3 Four best emotion recognition rates of VGG19, GoogleNet and RestNet101 on the JAFFE, the CK+, the SFEW_2.0, and the FER2013 datasets (best values are in bold)
Table 4 The obtained recognition rates (mean and standard deviation (SD)) using the VGG19, the GoogleNet, and the ResNet101 models

- DL as Transformer (Feature concatenation): Several tests have been performed in order to choose, for each model, the most suitable layer for extracting the discriminant features. Firstly, the features have been extracted only from the Fully Connected (FC) layers. Afterward, the subsequent tests have shown that more discriminate features can be selected from other layers than the FC ones, notably the pooling layers, which preserve the most essential features of facial images. Thus, the layer-based feature selection process was focused on the five last layers of each model. The process has been empirically validated, and several tests have been carried out in order to select the most appropriate combination of feature layers for each of the three pre-trained models. Those layers contain quality features which help to increase the accuracy of the facial expression recognition model. The five best layers’ combinations, in terms of recognition accuracy, from which the features were extracted, are summarized in Table 5 for each of the four datasets. As illustrated in this table, the layers retained from the three models for the extraction of features depend on the dataset, which explains the difference in terms of the number of features retained for each dataset. For instance, the Drop7, the Fc7 and the pool5 layers, respectively, selected from the VGG19, the GoogleNet and the ResNet101 models, have been retained for feature concatenation for the case of the CK+ dataset. In fact, this is the best layer combination that gave an accuracy of 98.80%.

Nevertheless, the results illustrated in Table 5 show that the pooling layers contain more relevant features compared to the fully connected ones. In the majority of the cases, combining two pooling layers from two different models with a fully connected layer from the third model gave more efficiency than the combination of two fully connected layers with one pooling layer as well as than combining three fully connected layers. At the end of this stage, three vectors for each model (one for each dataset) corresponding to the highest recognition rates are retained. In fact, given the three feature vectors corresponding to the three pre-trained models, for each dataset, the concatenation module aims to construct, for each dataset, a single feature vector from the three sets of features retained from each CNN model. To perform that, we based in this study on the selection of the most significant layer for each model in order to extract the most relevant information for the emotion classification. The vectors extracted from each model are concatenated to form a single vector as shown in Fig. 4, where the number of extracted features for each dataset is also provided. Thus, once the layer from which the features is selected is chosen for each model, the concatenation is applied to form a final single feature vector that will be fed to the SVM classifier in order to predict the emotions of the test facial images. In fact, for the CK+ dataset, 6151 features have been retained from the three models (3079 features from the ResNet101 model, 2048 features from the GoogleNet model, and 1024 features from the VGG19 model), whereas 3079 features have been selected for the JAFFE dataset (1790 from the ResNet101 model, 521 from the GoogleNet model, and 768 from the VGG19 model). However, 3328 features have been kept for the SFEW_2.0 dataset (2048 features from the ResNet101 model, 256 from the GoogleNet model, and 1024 from the VGG19 model). For the FER2013 dataset, 10787 features have been retrained (6144 from the ResNet101 model, 2048 from GoogleNet, and 2595 from VGG19).

Table 5 Top five layers’ combinations for the four investigated datasets
Fig. 4
figure 4

Principal of layers’ selection and concatenation

3.3 Emotion’s classification

After forming the final vector resulting from the concatenation of the features selected from the three initial vectors, the classification step consists to associate each studied image to the corresponding emotion class. As mentioned previously, the test images are different from the training images and the number of samples is smaller. Instead of the classification layers of the models, a linear support vector machine has been used as a classifier of emotions. In the case of few samples per class, the SVM shows its efficiency to classify into different classes all new instances derived from the test set based on the emotions learnt. Due to the relevance of the data obtained in the extraction and the concatenation steps, we do not need to adopt a kernel for the transformation of features. Thus, SVM is used to find the optimal hyperplane that maximizes the distance between it and the closest data point called the margin of separation. In fact, as we are faced with a multiple classification problem (non-binary), we used in this work linear SVM, while following the one-vs-rest strategy that implements the multiclass SVM.

4 Experimental results

Having a high number of labeled data is a necessity to train a neural network in order to enable it to handle the curse of dimensionality problem [68]. In this work, four publicly available datasets have been used. In fact, the investigated datasets are as follows: (i) the Extended Cohn-Kanade dataset (CK+), which is conceived in laboratory-controlled conditions, contains mixture of posed and spontaneous emotions, (ii) the JApanese Female Facial Expression dataset (JAFFE), also conceived in laboratory-controlled conditions, contains only posed emotions, (iii) the Static Facial Expression in the Wild dataset (SFEW_2.0), and (iv) the Facial Expression Recognition 2013 (FER2013), which illustrate spontaneous emotions taken under in-the-wild conditions. In what follows, we give a brief overview of these datasets before diving into the results.

1- Extended Cohn-Kanade dataset (CK+): This dataset is an extended version of the “CK” collection, which has been released since 2000 in order to promote research works in the field of facial expression detection [11]. All images have been designed in controlled environments. The subjects are both male and female where 31% are men and 69% are women with their age range from 20 to 45 years [69]. The dataset includes 593 sequences of images varying in duration from 10 to 60 frames collected from 123 subjects. Every image has \(640\times 490\) or \(640\times 480\) pixels resolution and their totality express seven emotion categories: the six basic emotions (Anger, Disgust, Fear, Happiness, Sadness, Surprise) and one contempt [12].

2- Japanese Female Facial Expression dataset (JAFFE): It is a laboratory-controlled dataset. As a benchmark collection, the JAFFE dataset is composed of 213 grayscale facial expression images of 10 Japanese women. The dataset is categorized for seven expressions: Neutral plus the six basic emotional expressions (Anger, Disgust, Fear, Happiness, Sadness and Surprise). Each image size is \(256\times 256\) pixels, and each of the images is rated based on six emotion adjectives using 60 Japanese subjects; each expressor has 2–4 samples for each expression. In this dataset, the same expression of one person may differ greatly in different samples and distinct expressions may not be very distinguishable [70].

3- The Static Facial Expressions in the Wild (SFEW_2.0): It is a static dataset covering unconstrained facial expressions, different head poses, wide age range, varied face resolutions and focus making it close to real-world illumination. It has been extracted from the temporal dataset Acted Facial Expressions in the Wild (AFEW) and was firstly published in 2011 by Dhall et al. [71]. Consequently, it is analogous to the AFEW set except for its composition of static frames of the movies. In fact, each studied frame has been associated with an expression label (Angry, Disgust, Fear, Happy, Sad, Surprise, or Neutral) under close to real-world conditions. The SFEW_2.0 dataset contains 1766 images partitioned into 958, 436, and 372 images, for the training, the validation, and the test sets, respectively.

4- The Facial Expression Recognition 2013 (FER2013): This dataset has been developed by collecting face images available on the Internet, using the Google Image Search API. All images in this dataset have been captured in uncontrolled environments which made it a challenging standard benchmark within the framework of in-the-wild FER [67]. It contains 35,887 images belonging to the main seven emotions classes (4953 images for “Anger”, 547 “Disgust” images, 5121 “Fear” images, 8989 “Happiness” images, 6077 “Sadness” images, 4002 “Surprise” images, and 6198 images for “Neutral”), while being divided into two sets: the training set and the test set [17]. However, the images are in gray scale with size restricted to \(48\times 48\) pixels.

4.1 Data preparation and validation protocol

For this study, the four datasets, JAFFE, CK+, SFEW_2.0, and FER2013 including, respectively, 213, 5414, 1230, and 7178 images, have been investigated. The images of the CK+ dataset have been manually divided into six classes of emotions, and the seventh class, which is “Neutral”, has been designed by collecting the first three sequences of emotion from each person of the six classes. We selected 5414 images from the five categories of emotions: happiness, fear, sadness, surprise, anger, disgust, while ignoring the class “contempt”. The JAFFE dataset have been also manually divided into seven classes of emotions, whereas the selected images from the SFEW_2.0 and the FER2013 datasets have been used as downloaded. Datasets are randomly split into training and testing samples with a split ratio of 80:20. Table 6 presents the numbers of samples for the training and the testing partitions, and the total number of images used for each dataset. All the CNN models have been trained for maximum 55 epochs. The ADAM optimizer has been applied for the GoogleNet and the ResNet101 models, while the SIGMOID has been used to optimize the VGG19 model. The initial learning rate was fixed as \(1.e^{-4}\) for all the models.

Table 6 Numbers of samples for the four datasets

The performance of the proposed method is presented on the above datasets. In fact, the produced results by the proposed multichannel CNN solution for facial emotion recognition are herein presented in two separate parts. The first part of the results is related to the first feature extraction deep learning network. The second part gives the results of the final accuracy rates after selecting and concatenating features. It is worth mentioning that all accuracies are referring to testing accuracy on samples that are not included in the training. The outputs of the first deep learning network as extractor (first step), where for each model a freezing weight has been applied to certain blocks of layers during the training phase, are presented first. The confusion matrices summarize the prediction results for each emotion apart. They have been generated to assess and to unravel apart each pre-trained model. These matrices have been presented in this work for each pre-trained model apart and for the proposed model to firstly demonstrate that the used models are complementary and do not err in the same emotions and then to show that the feature concatenation can enhance the recognition rate for emotions that are hard to capture.

4.2 Results of the three models on the JAFFE dataset

Three feature vectors have been selected for the JAFFE dataset with an accuracy of 85.71% from VGG19, 83.33% from GoogleNet, and 85.71% from ResNet101. We report in Tables 78 and 9 the corresponding confusion matrices, which show that the VGG19 model achieves 100% of recognition rate for four emotions (Fear, Happiness, Neutral and Surprise), whereas Anger emotion is recognized only with a rate of 50%. Disgust and Sadness have recognition rates of 66.7% and 83.3% respectively. However, the GoogleNet model achieves 100% of recognition rate for Anger and Disgust emotions, which are recognized only at 50% and 66.7% respectively by VGG19. GoogleNet also achieves 100% of recognition rate for Fear emotion.

Table 7 The confusion matrix of the VGG19 model on the test set of JAFFE
Table 8 The confusion matrix of the GoogleNet model on the test set of JAFFE
Table 9 The confusion matrix of the ResNet101 model on the test set of JAFFE
Table 10 Misclassified JAFFE images (Original Class–Predicted Class)

The RestNet101 model recognizes 100% for Fear, Happiness, and Surprise emotions. It reaches 83.3% for the Disgust and Neutral emotions, and 66.7% of recognition rate for Anger and Sadness. Although GoogleNet does not reach a high accuracy for Happiness, Surprise, and Neutral emotions, ResNet101 has recognized 100% for Happiness and Surprise and has reached a rate of 33.3% for the case of the Neutral emotion. While the neutral class had an average recognition rate of 50% by GoogleNet, it reached an accuracy of 100% by VGG19 and a 16.6% better success rate for Sadness compared to GoogleNet and ResNet101. Overall, the recorded results show a complementarity between the three models recognizing the seven emotional classes. That fact allows us to conclude that some models classify correctly some emotions while other models misclassify the same emotions. This finding is illustrated by Table 10(a) which shows an image misclassified by the GoogleNet model and correctly classified by the VGG19 model. Table 10(b) presents also an example of image misclassified by the ResNet101 model and correctly classified by the GoogleNet model.

4.3 Results of the three models on the CK+ dataset

Tables 1112 and 13 gather the confusion matrices representing the accuracies of the resulting feature vectors of each studied model on the CK+ dataset. In fact, the global recognition rate is 89.19% for VGG19, 89.37% from GoogleNet, and 92.70 % from ResNet101. We have assessed the recognition rate by comparing the confusion matrices of the three pre-trained models, and it is clear that the VGG19 recognizes better the emotion “Happiness” with a recognition rate of 95.3% compared to GoogleNet and ResNet101 models. While for the “Fear” emotion, VGG19 and ResNet101 had the same recognition rate (= 92.5%). The GoogleNet model recognizes better the emotion “Sadness” with an accuracy of 97.8%. For “Disgust” emotion, GoogleNet and ResNet101 achieved a recognition rate of 91.1%, while the Anger, Neutral and Surprise emotions have been recognized better with ResNet101 with accuracies of 97.7%, 85.8% and 95.6%, respectively. Similarly to the case of the JAFFE dataset, some images of this dataset are misclassified by one model but are correctly classified by another one. Table 14 shows some examples: images (a,b) are misclassified by GoogleNet but are correctly classified by ResNet101, whereas image (c) is correctly classified by ResNet101 and misclassified by VGG19, and image (d) is misclassified by GoogleNet but it is correctly classified by VGG19, and the final image (e) illustrates an example that is incorrectly classified by ResNet101 while being correctly classified by GoogleNet.

Table 11 The confusion matrix of the VGG19 model on the test set of CK+
Table 12 The confusion matrix of the GoogleNet model on the test set of CK+
Table 13 The confusion matrix of the ResNet101 model on the test set of CK+
Table 14 Misclassified CK+ images (Original Class–Predicted Class)

4.4 Results of the three models on the SFEW_2.0 dataset

Tables 1516 and 17 show the confusion matrices illustrating the emotion recognition rates of each model for the facial images of the SFEW_2.0 dataset, which are taken in real conditions. In fact, the mean accuracies are as follows: 57.72% from VGG19, 54.07 % from GoogleNet, and 60.60 % from ResNet101. The VGG19 model achieves the best emotion recognition rate for Happiness and Surprise compared to GoogleNet and ResNet101, at 89.4% and 60.7% of accuracy, respectively. The GoogleNet model could not recognize the Disgust emotion; however, it was able to achieve the best recognition rates of 66.7% for the Sadness emotion, and 54.2% for the Fear emotion. The three emotions Anger, Disgust, and Neutral have been recognized better using the ResNet101 model, with the following rates 63.8%, 28.6% and 56.8%, respectively.

Table 15 The confusion matrix of the VGG19 model on the test set of SFEW_2.0
Table 16 The confusion matrix of the GoogleNet model on the test set of SFEW_2.0
Table 17 The confusion matrix of the ResNet101 model on the test set of SFEW_2.0

Table 18 illustrates some examples of images that are simultaneously misclassified by one model and correctly classified by another one. For instance, the image shown in Table 18(a) has been misclassified by ResNet101 but it was correctly classified by GoogleNet. The image in Table 18(b) is misclassified by VGG19 and correctly classified by ResNet101. The last example in Table 18(c) is misclassified by the GoogleNet model and correctly classified by the VGG19 model.

Table 18 Misclassified SFEW_2.0 images (Original Class–Predicted Class)

4.5 Results of the three models on the FER2013 dataset

The three vectors selected for the FER2013 dataset in the first step have the following recognition rates: 58.22% from VGG19, 53.69% from GoogleNet, and 55.57% from ResNet101. The confusion matrices of the accuracies have been reported in Tables 1920 and  21. The best emotion recognition rate has been achieved for the Happiness emotion, at 81.70%, by the VGG19 model. The lowest recognition rate is for the “Disgust” class with an accuracy of 18.20% achieved by the GoogleNet model. The ResNet101 and the VGG19 models achieved the same accuracy of 43.80% for the Anger class; however, GoogleNet recognized better the emotion. The emotions Fear and Sad are better recognized by the VGG19 model with accuracies of 37.10% and 55.40%, respectively, while the Surprise emotion reached 74.70% by the VGG19 model. The principal of complementarity is also confirmed with the FER2013 dataset results. Table 22 shows some examples of FER2013 images that are misclassified by one model and correctly classified by another model. In Table 22(a), the image is misclassified by the ResNet101 model, but it is correctly classified by GoogleNet, whereas the image in Table 22(b) is correctly classified by GoogleNet and misclassified by VGG19. Another image misclassified by the ResNet101 while being correctly classified by the VGG19 model, is displayed in Table 22(c).

Table 19 The confusion matrix of the VGG19 model on the test set of FER2013
Table 20 The confusion matrix of the GoogleNet model on the test set of FER2013
Table 21 The confusion matrix of the ResNet101 model on the test set of FER2013
Table 22 Misclassified FER2013 images (Original Class–Predicted Class)

4.6 Results of the proposed model after feature extraction and concatenation on the four used datasets

The second key step of the proposed emotion recognition system in this study is the selection of features from each CNN model, their fusion into a single vector, and then the SVM-based emotion classification of the vector. Based on the results obtained by the three models on the four datasets, we can notice that they are complementary. Therefore, it could be beneficial to combine the features in order to meet the objective of tackling the shortcomings of one model through the performance of the other models. This fact led us to suggest fusing features extracted from the three models and then fed them to a supervised classifier. This observation was confirmed by the experimental results after extraction and concatenation of the features which gave the best recognition rates for the three models. Consequently, the use of mixed feature from the three models has considerably improved the overall recognition rate. In fact, the experiments performed using the concatenated features enabled us to achieve an overall recognition rate of 97.62% on JAFFE, 98.80% on CK+, 88.20% on SFEW_2.0, and 94.01% on FER2013. To evaluate the overall performance of the proposed method, the confusion matrices on the three datasets are illustrated in the Tables 232425 and 26.

Table 23 Confusion matrix of the proposed method on the JAFFE dataset
Table 24 Confusion matrix of the proposed method on the CK+ dataset
Table 25 Confusion matrix of the proposed method on the SFEW_2.0 dataset
Table 26 Confusion matrix of the proposed method on the FER2013 dataset
Table 27 Comparison between the FER accuracy of the proposed method and the ones recorded by relevant methods from the state-of-the-art on the following datasets: (a) JAFFE, (b) CK+, (c) SFEW_2.0, (d) FER2013

The process of fusing the resulting feature vectors of JAFFE led the proposed FER system to recognize the Sadness emotion perfectly by reaching 100% of recognition rate for this emotional class, while it was recognized at 72.23% on average. Similarly, it reached 100% for Anger and Disgust emotions. The fusion has also led to increase the recognition rate of the Neutral emotion to 83.3%, and to have a recognition rate of 100% for the other emotions. On the CK+ dataset, the resulting feature vector improves all the emotions’ recognition rates. This fact is reflected by the increase in the overall accuracy rate to 98.8%. The best recognition rate of 97.8% achieved by GoogleNet model for the Sadness emotion reached 100% using the concatenated vector. The lowest recognition rate of the Neutral class, which had a mean value of 76.7%, reached 97.7%, whereas for Surprise emotion it had an increase up to 97.6%. An average of 99.4% has been achieved for the Anger, Disgust, Fear, and Happiness emotions. Regarding the second type of datasets (in-wild-conditions), as discussed before, for the SFEW_2.0 dataset, which is more challenging than the other facial expressions datasets due to the complexity of background and the natural situation of human faces, we note a striking improvement in the emotion recognition rate for this dataset from an average of 57.46% to 88.2% with a considerable increase of 30.74%. For the second dataset in-the-wild (FER2013), which is known to be one of the most challenging dataset in emotion recognition domain as it contains images of cartoons and emojis in addition to the human facial images, the recognition rate using the concatenated vector has been clearly improved compared to those of each model apart. Indeed, an augmentation of the lowest accuracy of the Disgust emotion from 18.2 to 68.2%, and of the Fear emotion from 23.4 to 91.2% were recorded. All the other recognition rates have been increased to an average of 92%. The Happy emotion was the most recognized emotion with an accuracy of 98.3%. Many other challenging factors in facial emotion recognition can also reduce the emotion recognition rate, in particular when the images are taken in-the-wild conditions. It is worth mentioning that this type of images is different of images taken in laboratory conditions. In-the-wild conditions, there are different head poses because the individuals are in movement, and the distance between persons and the camera is variable. Contrary to the in controlled conditions where the persons are in front of the camera with the same distance, and vertical head pose. These challenging factors were overcome through the concatenation of relevant features of the three models. Assembling the features has reduced the error rate compared to each model separately and has remarkably improved the overall recognition rate especially for the in-the-wild datasets. The limitations of the single models have been relatively covered by the union of the three models into a global one able to predict more precisely the emotions. Each model has correctly classified a set of images that is different from the set correctly classified by the other model. The idea of feature concatenation applied in this work was able to ensure the maximum number of images correctly classified by the three models, especially in the case of in-the-wild datasets such as FER2013 and SFEW_2.0. This justifies the performance increase from 50% to 80%. Overall, the obtained results show that using the complementarity of several deep learning models and extracting features from different models can counteract the difficulties of capturing facial emotions in-the-wild.

Table 28 Sample of challenging in-the-wild conditions

4.7 Comparison of the suggested method with relevant methods from the state-of-the-art

This section presents a comparison of the proposed FER method with relevant emotion recognition methods from the literature. For fair comparison, we ascertained that the compared methods are using deep learning, transfer learning and CNN architectures, in addition to be validated on the same datasets as those of this study. The comparison results on the JAFFE, the CK+, the SFEW_2.0, and the FER2013 datasets are summarized in Table 27 (a, b, c, and d, respectively). It is clear that the proposed method has outperformed all the compared methods on all datasets. For instance, the proposed method is outperforming the method presented in [72] on the JAFFE dataset with a difference of 20.35%. Furthermore, for the case of the CK+ dataset, the proposed method accuracy is better than the accuracies reached by the method in [73] with a difference of 7.8%, and the one in [51] with a difference of 8.7%. The average increase of accuracy compared to the other methods varies from 0.3 to 5.56%. Likewise, while investigating the SFEW_2.0 dataset, the performances of the proposed method have outperformed those obtained by relevant state-of-the-art methods with a considerable difference of 39.18%, which represents the highest increase among all the studied datasets. Similarly, the validation of proposed method on the FER2013 dataset allows an improvement of results comparing to recent studies. In fact, the improvement of the recognition rate reached 24.45% compared to [74], and 15.02% compared to the best accuracy obtained by [42].

Table 29 A sample of misclassified images by the proposed method: Original Class (green)–Predicted Class (red)

Overall, the proposed method has reached better rates with regards to all the compared methods for the four used datasets. In particular, there is a substantial increase in the recognition of spontaneous emotions within the SFEW_2.0 and the FER2013 datasets. This was expected since the proposed method is designed to take advantage of the complementarity of deep learning models, especially for the in-the-wild context. We can also notice that the proposed model misclassifications concern above all the classes with less samples compared with other classes, for instance, the “Disgust” class of the SFEW_2.0 and the FER2013 datasets, which include only 14 and 22 samples, respectively. This emotion has been recognized at 71.4% for the SFEW_2.0 dataset, and at 68.2% for the FER2013 dataset. Thus, seven images have not been correctly classified from the FER2013 dataset, and just four facial images have not been correctly classified from the SFEW_2.0 dataset, particularly for the images where the expressions are not accentuated such as the ones presented in Table 29(c’) and the two last images in Table 29(c”). Likewise for the FER2013 dataset, in Table 29(d) the images do not strongly express the emotions. Under those circumstances, it is difficult even for the human being to capture the specific emotion. Table 29 gathers qualitative results for some examples from the four datasets. For the JAFFE dataset, the only misclassification made by the model was for a facial image where the “Fear” expression is too similar to the “Neutral” one. Considering all the expressions of the corresponding subject, we notice that they are too similar and are not too expressive, which explains the error in Table 29(a). For the case of the CK+ dataset, misclassifications also occur as shown in Table 29(b). Indeed, because of the similarities between facial expressions, we have encountered errors where some images from different classes have been classified as “Neutral” emotion owing to similarities between facial expressions. Furthermore, the misclassified images are the first instances of the image sequence that describes the emotion, where the expression of the emotion has not yet appeared. These images are equivalent to the first three images of each sequence which constituted the Neutral class. Concerning the second type of datasets that include images captured in the real world, some facial emotions have been incorrectly predicted by the model for the SFEW_2.0 dataset, as shown in Table 29(c). We may attribute the failure of emotion recognition into cases where there are different emotions intensities, various poses and lighting conditions of movie scenes, and other uncontrolled conditions that exist in this dataset. Mainly, we mention resolution variations, different age groups, occlusion, in addition to the previous ones. The dataset contains even images with more than two challenges at a time. The different challenges are illustrated in Table 28. These images are delicate even for the human being to classify. Nevertheless, the proposed scheme was able to achieve 88.20% accuracy thanks to the quality and the complementarity of the relevant features extracted from each model. As well as we know, this result has never been achieved before, and this is outstanding for the challenging issue of the in-the-wild FER. Passing to the FER2013 dataset, some images have not been correctly classified as presented in Table 29(d). The degradation of results in these cases is due to the unbalanced distribution of the number of samples per class. For instance, we find the “Disgust” class containing 111 samples, while the “Happy” class includes 1774 samples. In addition to the uncontrolled conditions existing within the other datasets, FER2013 is characterized by other specific uncontrolled constraints, such as the high number of babies and children’s facial images, the different skin colors and features, and the presence of cartoon images and emojis (Table 28). To sum up, most images that were misclassified by the individual models have been correctly classified and recognized using the deep features extracted and concatenated into a single feature vector. Moreover, images that have been misclassified after the feature concatenation were misclassified by at least two of the three models and in most cases by all three models. These images are either very challenging (incorporating occlusions and extreme head pose deviation) or images with very low level intensity of the expressions. The concatenation of deep features while choosing the suitable layers was generally able to raise the problem of false classification by individual models. In fact, the concatenation has filled the lack of one model by the other models, through the dynamic selection of the more relevant layers that contain the most discriminant features.

Overall, according to the experimental results, using a multichannel CNN method based on deep learning techniques on the well-known CK+, JAFFE, FER2013, and SFEW_2.0 datasets, the proposed method shows high recognition accuracy thanks to the richness of each selected pre-trained model in this study (VGG19, GoogleNet, ResNet101) as well as to the relevance of the deep features extracted from each one. Besides, the freezing of the layers applied on a personalized level relative to the depth of the pre-trained model led us to gain time and quality of extracted features. Indeed, the recorded execution time by the proposed method for the test phase is 16.537 ms, 47.200 ms, 36.463 ms, and 36.149 ms for the JAFFE, the CK+, the FER2013, and the SFEW_2.0 datasets, respectively, using the following hardware configuration: Intel(R) i7 9th generation CPU, with NVIDIA GeForce RTX2060 GPU, and 16GB RAM. This computational cost is among the best costs recorded by the state-of-the-art works that have been tested on the same datasets. For instance, although that more powerful hardware configuration (16GB GPU RAM, 2560 Cuda cores, 256-bit memory interface, GDDR5X as memory type NVIDIA Quadro P5000) has been used in [42], execution times of 402.6 ms, 569.7 ms and 1161.2 ms have been recorded for the JAFFE, the CK+ and the FER2013 datasets, respectively. However, in [75] (resp. [66]), an average execution time of 60 ms (resp. 34 ms) have been reported when testing on the three datasets using almost a similar hardware configuration then the one used in our work. Thus, the proposed method has proved to be cost-efficient in terms of computational time with an average of 34 ms, what represents half of the required time for the work of [75] while being competitive compared to [66].

5 Conclusion

In order to depict facial emotions more accurately, we have proposed a robust and computationally efficient method based on multichannel deep features extraction and concatenation. The proposed FER method is based upon two dual deep learning networks: the first one is dedicated to deep features’ extraction from three CNN models, while the second one is used for feature selection and concatenation. The validity of the proposed method has been assessed on four widely used FER datasets. The investigated datasets present various types of emotions: posed and spontaneous emotions, in the laboratory-conditions and in the wild-conditions. The first-line experiments performed in this study, using the three pre-trained CNN models, led to two findings. Firstly, the three used CNN models are highly effective to capture human facial emotions, and secondly, these models do not have the same weaknesses and strengths regarding the recognition of the different classes of emotions. Therefore, the main challenge of the suggested method is to consider deep features coming from multiple multichannel convolutional neural networks, thereby leading to getting the most out of the models’ performances by the complementarity. The main objective is to achieve better results than those of each model applied separately, notably for the case of in-the-wild environments. The experimental study of the proposed method for emotion recognition has been divided into two parts. The first one has presented the results of the DL as extractor. It highlights the efficiency of the geometric DA techniques, which allowed increasing the amount of training images. This type of techniques enabled to improve the training relevance of deep data remaining after applying freezing weights. These results also emphasize the efficiency of features extracted from these remaining layers, and the quality of data existing in each model. The second part has shown the results of the DL as transformer. It displays the final recognition rates of the output of the multichannel convolutional neural network. This DL aims to possess a single emotion prediction vector for each dataset. Thus, the final vector encompasses the most relevant features selected from rich layers what has helped to improve the final accuracy. In summary, the suggested method outperforms many relevant state-of-the-art methods, in addition to all the single model-based methods. It achieved 97.62% for the JAFFE dataset, and 98.80% for the CK+ dataset, while obtaining 88.20% for the SFEW_2.0 dataset, and 94.01% for the FER2013 dataset. According to the experimental results and to the comparative analysis with reference to several state-of-the-art works, we point out that the obtained results in this work outperform those in the literature for four datasets, especially for the FER2013 and the SFEW_2.0 datasets. Those latter elucidated a significant higher recognition rate of 20%, and 34%, respectively, on average compared to previous recognition rates. This confirms the ability of the concatenated vector, formed by heterogeneous deep features extracted from the three CNN models, to enhance the accuracy of emotion recognition, particularly for the in-the-wild datasets where the enhancement has been remarkable. Furthermore, the proposed method can be ameliorated in future by investigating the use of action units and face landmarks in order to improve the recognition rate of image faces with non-accentuated expressions.