Keywords

1 Introduction

Facial expression which is a form of nonverbal communication plays an important role in interpersonal relations. Facial expressions represent the changes of facial appearance in reaction to persons inside emotional states, social communications, or intentions. Recently, active research is going on in the area of Automatic Facial Expression Recognition (FER). It is useful in the field of medicine, human emotion analysis, etc., and will be one of the best steps for improving Human Machine Interaction (HMI) systems. Many factors make emotion detection as challenging problem. Among them is the problem of an unavailability of the standardized database for FER. Benchmark database that can fulfill the various requirements of the problem domain in order to become standard database for future research is a tough and challenging exercise [1]. Most of existing databases expressions is posed and not spontaneous. Biggest challenge is to capture spontaneous expressions on images and video. Like different subjects express the same emotions at different intensities and sometimes laboratory conditions become hurdle for the subject to display spontaneous expressions. Another major challenge is labeling of the data which is a time-consuming process and possibly error prone also. Challenges involved in capturing and recognizing spontaneous nonbasic expression are more than basic expressions. Most of the FER has lack of rotational movement freedom [1]. Here, our aim is to present an approach based on CNNs for FER and a systematic comparison of five different 12-layer CNNs. The input to CNN is an image to predict the facial expression label which should be one of these labels: anger, happiness, fear, sadness, disgust, and neutral. Among various database, JAFFE database is chosen for implementation as this is the most commonly used in other automatic FER systems.

2 Background

Generally, face detection, facial feature extraction, and facial expression classification are the parts of an automatic FER system using traditional machine learning techniques. In the face detection step, given an input image system performs some image processing techniques on it in order to locate the face region. In feature extraction step, from located face, geometric features and appearance features are the two types of features that are generally extracted to represent facial expression. Geometric features describe shape of face and its components like lips, nose or mouth corners, etc. Whereas appearance features depict the changes in texture of face when expression is performed. Classification is the last part of the FER system which based on machine learning theory. Output of previous stage which is a set of features retrieved from face region is given as an input to the classifier like Support Vector Machines, K-Nearest Neighbors, Hidden Markov Models, Artificial Neural Networks, Bayesian Networks, or Boosting Techniques [2]. Some expression recognition systems classify the face into a set of standard emotions. Other system aims to find out movements of the individual muscle that the face can produce. In [1], author provided an extensive list of researches between 2001 and 2008 on FER and analysis. Since past few years, there were several advances to perform FER using traditional machine learning methods which involve different techniques of face tracking and detection, feature extraction, training classifier, and classification. In [3], to provide a solution for low resolution images, framework for expression recognition based on appearance features of selected few prominent facial patches which are active when emotion are expressed is proposed. In order to get discriminative features for classification, salient patches are obtained after processing selected patches further. One-against-one classification task is performed using these features and recognition of the expression is done based on majority vote. Experimentation of the proposed method is carried on CK and JAFFE facial expression databases. In [4], feature extraction is done using PCA along with LBP and SVM classifier used to obtain results. Database used are JAFFE database and MUFE database and obtained results show that both PCA and LBP gave high performance together. In [5], live video stream frames are extracted containing face using Gabor feature extraction method and neural network and modified k means with PCA is used for classification of emotion. JAFFE database is used for simulation of framework. In [6] Active Appearance Models (AAM) were used to identify the face and extract its graphic features. For expression prediction, HMM is used and to identify the person in the image, K-NN is used. In [7], improved Directional Ternary Patterns (DTP) feature extraction and SVM classifier are used for real-time purpose emotion detection by facial expressions on JAFFE database.

In FER using traditional machine learning techniques, programmer has to be very specific about what he is interested which involves laborious process of feature extraction. Domain knowledge is expected for feature extraction. Thus, success rate of system depends on programmer’s ability to accurately define a feature set. In addition, whenever the problem domain changes the whole system needs to change requiring a redesign of the algorithm from the start [8]. In [9], authors proposed a novel FER system based on features resulting from principal component analysis (PCA) which are fine-tuned by applying particle swarm. The best classification result achieved was 97% for CK database.

Most of researcher also used neural networks as its ability to extract undefined features from the training database. Most of the time it is observed that if neural networks that are trained on large amounts of data are able to extract generalized features well to scenarios that the network has not been trained on. In [10], constructive training algorithm for MLP neural networks has been proposed as classification step for the FER system. Experiments carried on three well-known databases show that the best recognition rate has been obtained using the constructive training algorithm as compared to the fixed MLP architecture. In [11], authors proposed Neural Network and K-NN based model for facial expression classification. For extraction of facial features on JAFEE database, ICA is used. Recent approaches include increased use of deep neural networks (neural networks with many numbers of hidden layers) for automatic FER problem. With growing computing power, for finding complex patterns in images, sound, and text, deep neural network architectures provide learning architecture similar to the development of brain-like structures which can learn multiple levels of representation and abstraction. Extreme variability patterns with robustness to distortions and simple geometric transformations are recognized by CNNs which are deep neural networks. It has been proven by a wide range of applications that are using CNN such as face detection, face recognition, gender recognition, and so forth that minimal domain knowledge of the problem at hand is sufficient to perform efficient pattern recognition tasks [8, 12,13,14,15,16]. CNNs have become the traditional approach for researchers examining vision and deep learning. Starting with LeNet-5 [17], variations of this basic design are prevalent in the image

classification literature with the best results. The recent trend is to increase the number of layers and layer size for larger datasets such as ImageNet and use of dropout in order to deal with the problem of overfitting [18, 19].

In [20], authors proposed network consists of two convolutional layers each followed by max pooling with next four inception layers and conducted experiments on seven publicly available facial expression databases. In [21], two different deep network models are proposed, for extraction temporal appearance features from image sequences and for extraction temporal geometry features from temporal facial landmark points. A new integration method for combining these two models is required in order to boost the performance of the FER. For Emotions in the Wild [22] contest for static images in [23], multiple deep convolutional neural networks are trained as committee members and combine their decisions, generating up to 62% test accuracy.

3 JAFFE Database and Proposed CNN Architecture

Proposed CNN architectures are tested on JAFFE database set of facial expression images for posed emotions (six different emotions and neutral face displays) of 10 Japanese female subjects. These six expressed emotions are the basic emotions given by Ekman and Friesen [24]. Figure 2 shows sample images from JAFFE database [25]. Expressed emotion seems to be universal across human ethnicities and cultures which are happiness, sadness, fear, disgust, surprise, and anger. The grayscale images are 256 × 256 pixels size. The images were labeled into 6 + 1 = 7 emotion classes. Some head pose variations can be featured by these images [26].

Five different 12-Layer CNNs architectures are proposed for facial expression classification up to seven different basic emotions. In all five CNNs, input layer is followed by convolutional layer with different filter size and number of filters. This layer is followed by relu layer and max pooling layer with pool filter size as 2 which outputs maximum among the four values. Max pooling layers are trailed by convolutional layer with different filter sizes and number of filters for different CNNs again followed with relu and max pooling layer. Next layer is a fully connected layer with a number of output neurons which varies in different CNNs and followed by relu layer. And last fully connected layer is with seven output neurons and output of this layer is given to final softmax and classification layer. Table 1 shows detail regarding proposed 5 different CNNs. For CNN_1, CNN_2, CNN_3, and CNN_4, number of training images are 164 and testing images are 26. In CNN_1 and CNN_2, training images are repeated to increase the total number of training image set. For CNN_5, number of training images are 178 and testing images are 35. In Table 1, numbers in convolutional layer column give number of feature maps generated with filter size and numbers in fully connected layer column gives number of output neurons. Thus in CNN_1, number of training images is 164 and first convolutional layer generates 3 feature map using 3 × 3 filter size. Second convolutional layer generates 9 feature map using 3 × 3 filter size. Fully connected layer 1 has output neurons 512 and last fully connected layer has number of output neurons as 7 representing basic emotion.

Table 1 Details for proposed CNNs for emotion detection

Proposed 12 layers architecture are shown in Fig. 1. Similar architecture is (referring Table 1) for CNN_1 CNN_2, CNN_3, and CNN_4 only with different filter size, feature map, and number of output neurons in fully connected layer.

Fig. 1
figure 1

Proposed CNN_5 architecture

Fig. 2
figure 2

JAFFE sample images [25]

4 Results and Discussions

The proposed design is tested on a 2.40 GHz Intel i7-5500U quad core processor; 8 GB RAM, with windows 10, 64 bit system. As CNN is computationally intensive which requires GPU processing for faster computation, thus system used was CUDA-enabled NVIDIA GPU with compute capability higher than 3.0, DirectX Runtime Version 12 (graphics card GeForce 920 M). MATLAB used for implementation and system tested on JAFFE database From Tables 1 and 2 it is observed that more image detail gets captured using 12 × 12 or 15 × 15 filters. In CNN_1 and CNN_2, number of training set is increased by repeating that set but its effect is similar to increasing the epoch during training. If first layer captures good detail from the input image of 256 × 256 size and number of epoch is more this leads final accuracy. Maximum accuracy achieved in CNN_5 in which total number of images to be trained is more as compared to other CNNs.

Table 2 Comparison for CNNs

5 Conclusion and Future Scope

Minimal preprocessing involved as CNNs are designed to recognize visual patterns directly from pixels of images. This is completely in contrast with the conventional pattern recognition tasks in which prior knowledge of the problem at hand is needed in order to apply a suitable algorithm to extract the right features. In this paper, five new CNN architectures have been proposed for automatic facial expression recognition. Among them, CNN_5 resulted in the highest accuracy with a classification accuracy of 96.15% achieved on test samples of JAFFE database using 2.5 GHz i7-5500U quad core processor, 8 GB RAM with GeForce 920 M. Proposed architecture can be extended for detection of nonbasic expression in future.