Virtual facial expression recognition using deep CNN with ensemble learning

Chirra, Venkata Rami Reddy; Uyyala, Srinivasulu Reddy; Kolli, Venkata Krishna Kishore

doi:10.1007/s12652-020-02866-3

Virtual facial expression recognition using deep CNN with ensemble learning

Original Research
Published: 16 March 2021

Volume 12, pages 10581–10599, (2021)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Virtual facial expression recognition using deep CNN with ensemble learning

Download PDF

Venkata Rami Reddy Chirra ORCID: orcid.org/0000-0002-4381-9097^1,3,
Srinivasulu Reddy Uyyala^1,2 &
Venkata Krishna Kishore Kolli³

1452 Accesses
25 Citations
Explore all metrics

Abstract

In the current era, virtual environments and virtual characters have become popular. In the near future, recognition of virtual facial expressions plays an important role in virtual assistants, online video games, security systems, entertainment, psychological study, video conferencing, virtual reality, and online classes. The objective of this work is to recognize the facial emotions of virtual characters. Facial expression recognition (FER) from virtual characters is a difficult task due to its intra-class variation and inter-class similarity. The performances of existing FER systems are limited in this aspect. To address these challenges, we designed and developed a multi-block deep convolutional neural networks (DCNN) model to recognize the facial emotions from virtual, stylized and human characters. In multi-block DCNN, we defined four blocks with various computational elements to extract the discriminative features from facial images. To increase stability and to make better predictions two more models were proposed using ensemble learning which are bagging ensemble with SVM (DCNN-SVM), and the ensemble of three different classifiers with a voting technique (DCNN-VC). Image data augmentation was applied to expand the dataset to improve model performance and generalization. The accuracy of the proposed DCNN model was studied by tuning hyperparameters. Performances of the three proposed models were examined in contrast with pre-trained models such as VGGNet-19, ResNet50 with a voting technique for emotion recognition. The proposed models are evaluated and achieved the best accuracy when compared with other models on five publicly available facial emotion datasets that include UIBVFED, FERG, CK+, JAFFE, and TFEID.

Improving Ensemble Learning Performance with Complementary Neural Networks for Facial Expression Recognition

Multi-region Ensemble Convolutional Neural Network for Facial Expression Recognition

A facial expression recognition method based on ensemble of 3D convolutional neural networks

Article 20 October 2017

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Emotions are mainly expressed through hand, voice, body gestures, and facial expressions. Facial expressions are being used to convey emotions during interactions. Mehrabian (2007) stated that 55% of emotions are conveyed via facial expressions only. Ekman et al. (1972) identified six expressions, which are basic universal emotional expressions. A few decades ago Ekman et al. (1978) had done a systematic study on facial emotion analysis and identified six basic expressions that include anger, joy, sad, disgust, surprise, and fear. The human face exhibits relevant information cues to express emotional state or behavior. Humans can identify a person’s emotions accurately by observing the human face in a few seconds. Facial emotion recognition is used for human–computer interaction (Bartlett et al. 2003), patient care, and student awareness estimation (Whitehill et al. 2014),

multimedia, emotion aware devices (Soleymani and Pantic 2013), surveillance (Wang et al. 2015), autism disorder patients (Cockburnet al. 2008), and driver safety (Reddy et al. 2019; Mahesh Babu et al. 2019).

The dataset used in this work for recognizing facial expressions from virtual characters is UIBVFED which is a challenging dataset due to its intra-class variation (Fig. 1) and inter-class similarity (Fig. 2). In inter-class similarity, some of the images of different expression classes have the same similar appearances which make their discrimination difficult. In the intra-class variation, some of the images in the same expression class have different variations like illumination, age, and skin-color which make the model difficult to recognize the expression. Intra-class variations are intractable for facial expression recognition. The performance of FER degrades in virtual environments due to high intra-class variations and high inter-class similarities introduced by subtle facial appearance changes, illumination variations, skin-color changes, and identity-related attributes, e.g., age gender, and race.

In the literature, many experiments were conducted on the datasets with small intra-class variations only. However, the requirement is hard to satisfy when we recognize the virtual facial expressions from virtual environments. Researchers have proposed various methodologies to solve the above-mentioned problems. However, the intra-class variation was not explicitly considered in many existing approaches on FER but they used the datasets having intra-class variations. Most of the existing methods (Mayya et al. 2016; Venkata Rami Reddy et al. 2019; Gogić et al. 2020) depend on engineered features that lack generalization ability to perform virtual characters expression recognition.

The recent approaches in computer vision, especially deep learning models have improved the performance of the facial emotion classification tasks. Convolution Neural Network (CNN) based models are very robust and performing well in facial expression classification tasks. In CNN, convolutional filter parameters are fine-tuned at each layer to attain high-level features to generalize and represent the desired features for recognizing the unseen images.

Lee et al. (2014) address the intra-class variation problem by generating the intra-class variation image for each expression by training images and differences between these images are the features for sparse representation. This method addresses the illumination variation. An intra-class variation reduced features were used in (Xie et al. 2018) to reduce the intra-class variation influence. This method didn’t consider the effects of skin-color and age variations.

The performance of the existing FER systems was limited by inter-class similarity and intra-class variation. To address these issues, we propose a CNN based model for recognizing facial emotions from virtual characters. A multi-block deep CNN model was designed to extract the discriminative features from the virtual characters. The discriminative power of features can reduce the impact caused by intra-class variations and inter-class similarities to make the model robust in spite of variations. CNN models were used to obtain discriminative features of facial expression images, and these features are given as input to three classifiers (Support Vector Machine (SVM), Random Forest (RF), and Logistic Regression (LR)) for recognition. Based on the classifier used three models namely DCNN (softmax), DCNN-SVM (SVM with Bagging), and DCNN-VC (Voting technique) were being proposed.

The major contributions of this paper are as follows:

Proposed a multi-block DCNN model to extract discriminative features to recognize the seven facial emotions of virtual characters.
To the best of author’s knowledge first model which recognizes the facial expressions from three kinds of characters that include virtual, stylized, and human was proposed.
Image data augmentation was performed to expand the datasets for improving the performance and model generalization.
Bagging ensemble with SVM (DCNN-SVM) and the DCNN ensemble of SVM, RF, and LR classifiers with majority voting technique (DCNN-VC) was proposed to make better predictions.

2 Related works

In the emotional analysis, facial emotion recognition and classification has been considered as a challenging task. In recent years, many authors have proposed and developed various deep learning and machine learning (ML) models for emotion recognition tasks. In most of the existing works, the intra-class variation was not explicitly considered but they have done the experimentation using the datasets having intra-class variations.

Ramireddy et al. (2013) proposed a fusion-based method for recognizing emotions using Gabor wavelets and DCT (Discrete Cosine Transform). In this work, a different type of feature was extracted using Gabor filters and DCT. The kernel principal component analysis was applied to extract features, reduce dimensions. The RBFNN (Radial basis function neural network) was applied to classify the expression images into six basic emotions. Experimentation was performed on the CK dataset and accuracy of 99% was obtained with limited training and testing samples. Pons et al. (2018) developed a framework for recognizing emotions using the Supervised Committee of CNNs. 72 CNNs with the same baseline architecture was used for feature extraction. The proposed work was evaluated on FER2013, MMI, and LFW datasets. Li et al. (2019) designed a CNN model for recognizing emotions using Attention Mechanism (ACNN). pACNN was applied on local facial patches whereas gACNN combined both patch-level and image-level features. Experimentation was performed on Affect Net and RAF-DB datasets and attained 85 and 58.75% accuracy respectively.

Xie et al. (2018) developed a model based on Deep Comprehensive Multi patches Aggregation CNNs. In this work, two branches of CNNs were used. One branch of CNN was used for extracting the local features from patches and the other branch of CNNs were used for obtaining the holistic features from the entire face sample and these features were combined to create a feature vector and given to the classifier for expression classification. Experimentation was performed on CK+, JAFFE datasets and attained 93.46, 94.75% accuracy respectively. Mayya et al. (2016) developed a new method for recognizing emotions using DCNNs. In their work, the first face was detected from dataset images and given those frontal face images to CNN for extracting features. SVM with a grid search was used for classification. Proposed models were evaluated on CK+, JAFEE and achieved 97, 98.12% accuracy respectively. Rami Reddy et al. (2019) proposed different methods of FER. In this work, local and global features were extracted using Gabor wavelets and HWT respectively. Non-linear PCA (NLPCA) was used for reducing the feature dimension. Weighted and Concatenated fusion techniques were applied to combine those two types of features. SVM was used for classification. Experimentation was performed on the CK+ and achieved 98% accuracy.

An RGB–D Microsoft Kinect camera was adapted to record facial expressions of students in the classroom for recognizing the emotions in (Purnama and Sari 2019). The Adaptive-Network-Based Fuzzy Inference System machine learning algorithm was used to train and classify the expressions. A combination of EURECOM and the Cohn-Kanade dataset was used for training the algorithm. In biometric recognition, the accuracy of the system depends on the quality of input images. The impact of the image quality on accuracy was discussed in (Alsmirat et al.2019). In this study, the system provides good accuracy until the 30–40% compression ratio of raw images and higher ratio negatively impacted the accuracy of the system. Li et al. (2019) introduced deep overlap and weighted filter concepts in the macro pixel approach to extract the richer features from macro pixels. The experiment result shows that the proposed approach achieved better accuracy when compared with the original macro pixel approaches.

A CNN features are merged with the SIFT features to increase the FER accuracy by Connie et al. (2017). This work was tested on FER2013 and CK+ datasets and attained 73.4% and 99.1% accuracy respectively. A CNN feature-based FER was developed by Gonzalez-Lozoya et al. (2020) in which facial features were extracted using CNN. Model generalization was improved by mixing different dataset images. Ozcan et al. (2020) use transfer learning with hyperparameter optimization for FER on static images. They utilized hyperparameter optimization to increase the accuracy of the model. This work was experimented on JAFFE and ERUFER datasets. Gogić et al. (2020) developed a joint optimization framework for FER using local binary features and shallow networks with improved execution time. The hybrid deep learning model was developed by Garima and Hemraj (2020) for facial expression recognition. Here, the primary emotion being sad or joy was identified by one CNN and secondary CNN recognizes the secondary emotion of the image. This work was tested on FER2013, and JAFFE datasets.

All the mentioned works produced good results on human-based datasets but these models are sensitive to the illumination and specific poses present in that dataset because these models are evaluated on a single kind of dataset. These models have a lack of generalization ability to perform virtual and stylized character’s expression recognition. The performance of the existing FER systems was limited by above said two problems. Most of the existing methods for facial expression recognition used a single classifier hence, models suffered from bias and variance which affects the performance of the model. Henceforth, there is a wide scope for a new model that recognizes the emotions of virtual and stylized characters with better accuracy. Therefore, we developed a new model that recognizes the emotions from three kinds of characters which include virtual, stylized, and humans. To make better predictions ensemble learning techniques were used during classification.

3 The proposed models

DCNN, DCNN-SVM, and DCNN-VC models were proposed for facial expression recognition from three kinds of characters namely virtual, stylized characters, and humans. Initially, the face was detected and cropped followed by data augmentation to increase the number of image samples that are given as input to DCNN for extracting features. Finally, these features are fed into classifiers (SVM, RF, and LR) for recognition and the process is described in detail below.

3.1 Face detection

Viola-Jones algorithm (Viola et al. 2004) was applied to recognize the faces and those detected faces are cropped as shown in Fig. 3. The same algorithm has been adopted for detecting the face because of its low false-positive rate. The working of the algorithm is as follows: The image is subdivided into a grid of rectangles. The Haar feature selection uses the rectangles to detect features using windows in the image. The AdaBoost algorithm creates a strong classifier by integrating a set of weak learners. A weak learner uses Haar-like features to find the face in the sub-region of an image. Each classifier looks at the sub-region and if it finds a face then that region is forwarded to the next classifier otherwise that sub-region is rejected and repeated until the last weak classifier is reached. If all classifiers detected face, then the strong classifier approves the sub-region as a human face.

3.2 Data augmentation

The UIBVFED, CK+, JAFFE, and TFEID datasets have limited samples and there is a possibility of under-fitting as deep learning models required more samples for training. Image data augmentation was applied to expand the dataset to improve model performance and generalization. The data augmentation techniques such as flipping, rotation, and shifting were applied to expand the number of samples in UIBVFED, JAFFE, CK+, and TFEID datasets. In the proposed model, horizontal flipping, rotation range with 20, and shifting with 0.2 was used. The sample image after data augmentation was shown in Fig. 4.

3.3 DCNN

The multi-block DCNN was proposed for FER and the architecture was depicted in Fig. 5. It consists of four blocks for extracting the features from facial images. Each block contains two convolution layers, ELU (exponential linear unit), batch normalization, a max-pooling layer, and dropout. Kernel and bias regularizer with L2 regularization was used in the first convolution layer of each block to minimize the over fitting by penalizing the weight and bias values. Kernel initializer was used to initialize the weights in the first convolution layer of the first block. The Batch normalization was applied to improve the performance, stability and speed up learning after each convolution layer. The dropout was applied to prevent the developed model from over fitting. Each block generates a feature map that is given as input to the next block. The first block extracts the low-level features like dots, lines, and curves. The second and third blocks extract middle-level features whereas the last block generates high-level features. The feature map of the last block is flattened and forwarded to the fully connected (fc) layer that is given as input to a softmax layer. The softmax classifies the facial expression images into corresponding emotion classes. The convolution, ELU, max-pooling, and softmax are the main computational elements of our proposed multi-block DCNN model. The following subsections describe the functionality of these elements.

3.3.1 Convolution layer

The convolution layer (Teow 2017) was applied to obtain the pixel-wise visual features from an input face image. In this layer, the weights of the kernels are automatically adjusted using the back propagation to learn the input expression features. These features are forwarded to the next layer to process using the corresponding operation. In the proposed DCNN model, two convolution layers were used in each block. The convolution is a dot (.) product between the face image and kernel.

$$f_{c } = \sum_{m} \sum_{n} I\left( {m, n} \right)W\left( {i - m, j - n} \right)$$

(1)

Here, fc is a convolution feature map, I represent a facial input image, and W represents a convolution kernel.

3.3.2 ELU layer

The ELU activation function was applied to speed up the learning and improve the model generalization. In the proposed DCNN, we use elu before max-pooling in each block. ReLU introduces a dead ReLU problem where network components are not updated frequently with new value so ELU activation function was used. The ELU activation function is given in Eq. 2. In Eq. 2, if x value is greater than zero then the result is x otherwise resultant value is slightly below zero and which depends on α. Here ELU produces the negative value which helps the network nudge biases and weights in the correct directions and produces activations instead of zeros during gradient calculation. The output of ELU is a feature map f_e.

$$f_{e} = \left\{ {\begin{array}{*{20}l} {x,\quad if x > 0} \\ {\alpha \left( {e^{x} - 1} \right),\quad if x < 0} \\ \end{array} } \right.$$

(2)

Here, f_e is an ELU feature map, x represents the input and α represents the nonlinearity parameter.

3.3.3 Pooling layer

The feature map f_e generated by the ELU is forwarded to the pooling layer which subsamples f_e for the dimensionality reduction. In this work, a 2 × 2 max-pooling without stride and zero padding is applied for downsampling. In the max-pooling layer, pooling operation outputs the maximum value of the input within the kernel area at a given position which is given by the Eq. (3).

$$f_{p } = max{}_{i, j = 1}^{h,w} x_{i, j}$$

(3)

where, f_p is a pooled feature map that is generated by the max-pooling operation.

3.3.4 Softmax layer

In a multiclass classification, softmax returns a probability distribution over the target classes. The probability distribution contains the range of real values between 0 and 1. It assigns probabilities to each class in a multiclass problem. The sum of those decimal probabilities is equal to 1. Mathematically the softmax function is given by Eq. (4).

$${\text{Softmax}} \left( {x_{i} } \right) = \frac{{e^{{x_{i} }} }}{{\mathop \sum \nolimits_{j = 1}^{n} e^{{x_{j} }} }}$$

(4)

where n is the number of target classes. Here, DCNN was trained to classify the facial expression images from 0 to 6 classes. From Eq. 4 the expression image with the highest probability is recognized as the correct output.

3.4 DCNN-SVM

To reduce the variance and increase the accuracy, a second model was proposed using the bagging ensemble technique. In this model bagging ensemble with SVM (Kim et al. 2002) was used as the base classifier for facial expression classification.

In DCNN-SVM, the DCNN model was applied to obtain the discriminate features from face images then these features were given as input to ensemble bagging with SVM as a base classifier for facial expression classification. In this model, three SVMs were trained independently on deep features using a bootstrap technique which are combined using a majority voting technique. In majority voting, the class which receives the highest number of votes can be predicted as a final class. Figure 6 shows the architecture of the DCNN-SVM model. The bagging algorithm (Lango and Stefanowski 2017) is given in Table 1 .

Table 1 Bagging algorithm

Virtual facial expression recognition using deep CNN with ensemble learning

Abstract

Similar content being viewed by others

Improving Ensemble Learning Performance with Complementary Neural Networks for Facial Expression Recognition

Multi-region Ensemble Convolutional Neural Network for Facial Expression Recognition

A facial expression recognition method based on ensemble of 3D convolutional neural networks

Explore related subjects

1 Introduction

2 Related works

3 The proposed models

3.1 Face detection

3.2 Data augmentation

3.3 DCNN

3.3.1 Convolution layer

3.3.2 ELU layer

3.3.3 Pooling layer

3.3.4 Softmax layer

3.4 DCNN-SVM

3.4.1 SVM

3.5 DCNN-VC

3.6 Random Forest classifier

3.7 Logistic regression classifier

3.8 FER using transfer learning

3.9 ResNet50-VC

3.10 VGG-19 with voting classifier

4 Experimental results

4.1 Proposed DCNN model hyperparameters

4.1.1 Learning rate selection

4.1.2 The mini-batch size selection

4.1.3 Optimizer selection

4.1.4 Number of epochs selection

4.2 Overall recognition accuracy of proposed models

4.3 Performance of proposed models on closed expressions (inter-similarity)

4.4 State-of-art models

5 Conclusions

Availability of data and materials

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Code availability

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation