Keywords

36.1 Introduction

Intelligent Facial Expression Recognition is a vast and important research topic. Finding emotion from images is difficult and sensitive. The face is the index of our mind. Such a system can understand the situation better to take more fruitful results. The system will also be helpful in human-computer interaction. Since human-computer interaction, interacts with human in uncontrolled environment where the scene lighting, camera view, image resolution, background, user’s head pose, gender, and ethnicity can vary significantly.

CNN (Convolutional Neural Network) is a type of neural network which makes an assumption that the input is an image. It contains series of hidden layers that transform the input to output wherein each hidden layer is made up of neurons, and each neuron is connected to all the neurons in previous layers. CNN mainly comprises three types of layers: convolution layer, pooling layer, fully connected layer.

  1. 1.

    Convolution layer

  2. 2.

    Pooling layer

  3. 3.

    Fully connected layer.

The various emotions used in this paper are shown in Fig. 36.1.

Fig. 36.1
figure 1

Facial emotion used

36.2 Related Work

Gupta et al. [1] have proposed identification of facial expression using CNN, ResNet, and attention block that gives visual perceptibility. The authors have proposed to use a deep self-attention network for facial emotion recognition. They found that the proposed model outperformed the current CNN-based networks by achieving a higher precision of 85.76. Mellouka and Handouzia [2] performed a survey over multiple research publications that employed deep learning techniques. Authors identified that today researchers are restricted by just knowing the six-basic plus neutral emotion rather than creating a larger database in the future. Mehendale [3] proposed a hybrid of CNN and supervised learning for the detection of facial expression. It has been observed that this model gives better results when it operates for different orientations. The correctness was achieved due to the removal in the background. Zhang et al. [4] performed the facial emotion recognition based on an image of facial expression. Authors use biorthogonal wavelet and fuzzy multiclass support vector machine. This work provides an overall accuracy of 96.7%. Jabid et al. [5] presented an approach for facial emotion by using the local directional pattern(LDP) that takes into consideration the facial geometry. The model also identifies the effectiveness of different reduction techniques such as PCA and AdaBoost in terms of cost and accuracy, which shows the superiority of the LDP descriptor to other features of the descriptors. Mollahosseini et al. [6] proposed a deep neural network architecture for the detection of facial expression which consists of two convolution layers and followed by pooling and then four layers of Inception. It also provides a comparison to several state-of-the-art methods in which engineered features and classifier parameters are usually tuned on a very few databases. The proposed method outperformed over conventional CNN methods in terms of accuracy on both subject independent and cross-database evaluation scenarios. Operto et al. [7] performed a survey to determine ER abilities in children and adults with learning difficulties with different learning disorders incorporated in the study without cognitive disabilities and relates to intelligence. The authors concluded that facial emotional recognition outlays are potentially related to difficulties in cognitive control. Shan et al. [8] presented an approach for facial emotion representation based on the conditions of Local Binary Patterns.

36.3 Methodology

The proposed methodology is implemented as follows:

  1. 1.

    Dataset: The dataset deployed for implementation was the FER2013 dataset from the Kaggle challenge on FER. The dataset contains 35,887 grayscale images out of which 32298 are for training purposes, 3589 for testing. Images in the FER2013 dataset comes under one of the seven categories, namely: neutral, happy, fear, surprise, disgust, angry, and sad.

    Emotion labels in the dataset: 0: −4593 images—angry, 1: −547 images—disgust, 2: −5121 images—fear, 3: −8989 images—happy, 4: −6077 images—sad, 5: −4002 images—surprise, 6: −6198 images—neutral. FER 2013 dataset samples are shown in Fig. 36.2.

    Fig. 36.2
    figure 2

    FER 2013 dataset samples

  2. 2.

    Preprocessing: Resizing of the image into 48 * 48 grayscale images.

  3. 3.

    Grayscaling: Grayscaling is the process of transforming an RGV image input into a grayscale image whose pixel value from 0 to 255 upon the intensity of light on the image. As the pattern of an image does not depend on color and also the processing of color images requires more processing time and resources. Due to this reason grayscale images are used for processing.

  4. 4.

    Normalization: As neural networks are very sensitive to normalized data, normalization of an image is done to remove illumination variations and obtain an improved face image.

  5. 5.

    VGG19 Architecture:

    We have conducted extensive experiments to demonstrate the proposed method’s effectiveness compared to the most famous classification models, including VGG19 architecture. VGG19 architecture was developed by Simonyan and Zisserman of the University of Oxford with 19 layers, 16 conv, and 3 fully connected. 138 million parameters are eligible for VGG19. VGG19 can train on more than a million images and can classify into 1000 object categories. The detailed methodology is shown in Fig. 36.3.

    Fig. 36.3
    figure 3

    Block diagram of the proposed method

Block Diagram for Proposed Model:

The VGG19 network consists of sixteen two-dimensional convolutional layers, five max-pooling layers, and three fully connected layers. Max pooling uses the maximum value from each of a cluster of neurons at the prior layer by using a 5 × 5 max-pooling filter. This reduces the dimensionality of the output array. The input to the network is a preprocessed face of 48 × 48 pixels.

36.4 Experimental Results and Discussion

After performing the experiment with the proposed methodology, results are discussed in this section. The experiments are performed on Intel core i5 8400 CPU @ 2.80 GHZ, and python 3. Keeping kernel size 5 * 5 results obtained are shown in Tables 36.1, 36.2 and 36.3. Based on these tables, it is clear that sample size is a very important factor in deep learning. Since the samples of disgust and surprise is low therefore accuracy is poor in these samples. Emotion with happiness performs better compared to others. All others have results that vary in between. Tables 36.1, 36.2, 36.3 and 36.4 show various confusion matrix and results.

Table 36.1 Confusion matrix with Elu activation function
Table 36.2 Confusion matrix with ReLu activation function
Table 36.3 Results of Elu and ReLu activation function
Table 36.4 Comparison with others

The comparative results on FER 2013 dataset are shown in Table 36.4. Mollahosseini et al. [9] has used a convolution neural network with two convolution layer followed by max pooling and four inception layer and achieved 66.4% accuracy. Tümen et al. [10] achieved 57.1% accuracy, whereas VGG-19 architecture 69.1% accuracy is achieved in the proposed methodology. The proposed architecture is also used in face recognition and pattern recognition [11, 12] applications.

36.5 Conclusions

This paper presents recent development in the facial emotion recognition domain. The paper described VGG 19 architecture with different non-linear activation functions Elu and ReLu. Based on the experimental study, it has been observed that the size of the dataset plays an important role in facial expression recognition. Facial emotions have more samples such as happy, neutral, and angry performed better accuracy than other emotions. Further research based on multimodal deep learning architecture can improve more accuracy in this domain. One challenging issue is recognizing emotions from low-resolution images. It is also observed that the facial expressions are more suitable with the dynamic images in place of static images.