1 Introduction

Emotions play a major role during communication. Recognition of facial emotions is useful in so many tasks such as customer satisfaction identification, criminal justice systems, e-learning, security monitoring, social robots, and smart card applications, etc. [1, 2]. The main blocks in the traditional emotion recognition system are detection of faces, extracting the features, and classifying the emotions [3]. Based on the literature the most used feature extraction methods are Bezier curves [4], clustering methods [5], Independent Component analysis [6], two-directional two-dimensional Fisher principal component analysis ((2D)2FPCA) [7, 8], two-directional two-dimensional Modified Fisher principal component analysis ((2D)2 MFPCA) [9], Principle component analysis [10], Local binary patterns [11] and feature level fusion techniques [12], etc. After that, the features are given to the classifiers like Support vector machines [13], Hidden Markov models [14], k-nearest neighbors [15], Naïve Bayes, and Decision trees [16], etc. for classification. The drawback in conventional systems is that the feature extraction and classification phases are independent [17]. So, it is challenging to increase the performance of the system.

Deep learning networks uses end to end learning process to overcome the problems in conventional approaches [18,19,20]. The data size is very important in deep learning, the greater the dataset the performance is good. To improve the performance of deep learning, researchers are using data augmentation [21], translations, normalizations, cropping, adding noise and, scaling techniques [22] to increase the data size. Convolution Neural Networks is the best-proven algorithms in segmentation and classification tasks. The automatic feature extraction is one of the main advantages of this convolution neural network. Transfer learning is one of the methods in deep learning; in this, a model trained for a particular task is reused for another task by the transfer of knowledge [23]. The main advantages of transfer learning are time-saving and more accurate [24].

Some of the recent works in the area of Expression recognition using convolutional neural networks are discussed. Yingruo Fan et al. [25] proposed the multi-region ensemble convolutional neural network method for facial expression identification. In this, the features are extracted from three regions of eyes, nose, and mouth are given to three sub-networks. After that, the weights from three sub-networks are ensemble to predict the emotions. The databases used in this work are AFEW 7.0 and RAF-DB. Yingying Wang et al. [26] proposed recognition of emotions based on the auxiliary model. In this work, the information from three major sub-regions of eyes, nose, and mouth are combined with the complete face image through the weighting process to seize the maximum information. The model is evaluated by using four databases of CK + , FER2013, SFEW, and JAFFE. Frans Norden et al. [27] presented facial expression recognition using VGG16 and Resnet50. The databases used in this work are JAFFE and FER2013. The experimental outcome shows the finest classification accuracy is attained by Resnet50 when evaluated with other state of art methods.

Jyostna Devi Bodapati et al. [28] proposed recognition of emotions by using deep Convolution neural networks based-features. In this work, VGG16 is used to extract the features and a multi-class Support vector machine (SVM) is used for classification. The proposed algorithm achieved an accuracy of 86.04% with the face detection algorithm and 81.36% without the face detection algorithm on the CK + database. Nithya Roopa et al. [29] proposed emotion recognition using the Inception V3 model. The work is evaluated on the KDEF database and achieved a test accuracy of 39%. To handle the occlusions and pose variations Sreelakshmi et al. [30] presented an emotion recognition system by using MobileNet V2 architecture. The model is tested on real-time occluded images and achieves an accuracy of 92.5%. Aravind Ravi [31] proposed pre-trained CNN features based on facial emotion recognition. In this work, a pre-trained VGG19 network is used to extract the features and the support vector machine is used to predict the expressions. The experiment was conducted on two databases JAFFE and CK + and achieved the accuracies of 92.86% and 92.26%, respectively.

Shamoil shaees et al. [32] proposed a transfer learning approach with a support vector machine classifier. In this work, features are extracted by using CNNs, AlexNet, and feeding those features to SVM for classification. The work has been done using two databases of CK + and NVIE and achieved good accuracy. The authors of [33] presented facial emotion recognition with convolution neural networks. The experiment was conducted using different models such as VGG 19, VGG 16, and ResNet50 using the fer2013 dataset. Compared to all three models VGG 16 achieved the highest accuracy of 63.07%.Mehmet Akif OZDEMIR et al. [34] presented LeNet architecture-based emotion recognition system. In this work, a merged dataset (JAFFE, KDEF, and own custom data) is used. Haar cascade library is used in this work to remove the unwanted pixels that are not used for expression recognition. The accuracy achieved in this work is 96.43%. Poonam Dhankhar et al. [35] presented Resnet50 and VGG16 architectures for facial emotion recognition. The Ensemble model is suggested in this work by combining the models of Resnet50 and VGG16. The ensemble model proposed in this work is achieved the highest accuracy when compared with baseline SVM, and individual Resnet50 and VGG16 models. SVM achieves an accuracy of 37.9%, Resnet50 and VGG16 achieve the accuracies of 73.8% and 71.4%, respectively, and finally, the ensemble model achieves the highest accuracy of 75.8%. The authors of explored the transfer learning approach for facial expression recognition. In this work, the pre-trained networks of Alexnet, VGG, and Resnet architectures are used and attained an average accuracy of 90% on the combined dataset of JAFFE and CK + .

In this paper, transfer learning approach is used for facial emotion recognition. This paper is further subdivided into the subsequent sections. Section 2 discusses theories of emotions and emotion models, Sect. 3 explains the materials and methods, Sect. 4 describes the training procedure of proposed models, Sect. 5 discusses implementation parameters, Sect. 6 discusses the experimental results, Sect. 7 is comparisons, and Sect. 8 is the conclusion.

2 Related background

One of the most active research in the recent scenario is affective computing. The process of improvement of systems to recognize and simulate human affects is called affective computing [36]. The purpose of affective computing is to increase the intelligence of computers for human–computer interaction. Some of the applications of affective computing are Distance education, Internet banking, Virtual sales assistant, Neurology, Medical and Security fields, etc. [37]. In affective computing, the main step is to recognize human emotions by speech signals, body postures, or by facial expressions [38].

2.1 Theories of emotions

The emotions theories are grouped into three categories: Physiological (James–Lange and Cannon–Bard theories), Cognitive (Lazarus theory), and Neurological (Facial feedback theory) as shown in Fig. 1.

Fig. 1
figure 1

Theories of Emotions

The James–Lange model proposes the happening of emotion is due to the interpretation of the physiological response. After that, Walter Cannon disagreed with James–Lange theory and proposed that the emotions and physiological reactions are occurring simultaneously in Cannon- Bard theory [39]. Lazarus theory is also called Cognitive appraisal theory, in this physiological response occurs first, and then the person thinks the reason for the physiological response to experience the emotion [40]. Finally, the facial feedback theory explains the emotional experience through facial expressions.

2.2 Emotion models

Emotion models are mainly classified into two types: categorical models and dimensional models. The basic emotions of anger, fear, sadness, happiness, surprise, and disgust proposed by Ekman and Friesen are presented in the categorical model [41]. Dimensional model describes the emotions in two dimensional (Arousal and Valence) or three dimensional (Power, Arousal, and Valence). The Emotion models as shown Fig. 2.

Fig. 2
figure 2

Emotion Models

Valence determines the emotion’s positivity or negativity and Arousal measures the intensity of excitement of the expression. Circumplex, vector, and PANA (Positive Action- Negative Action) are two-dimensional models Plutchik’s and PAD (Pleasure, Arousal and Dominance) are three- dimensional models. The detailed explanation of all the models is explained in [42].

3 Materials and methods

Nowadays, extracting human emotions are playing a major role in affective computing. The process of emotion detection using pre-trained Convnets is shown in Fig. 3.

Fig. 3
figure 3

Emotion detection process

In this work, 918 images are taken from the CK + dataset. Sample pictures are displayed in Fig. 4.

Fig. 4
figure 4

Sample pictures from CK + dataset for seven expressions

All the images are in.png format. Among 918 images, 770 images are used for training purposes and 148 are used for testing purposes. It contains seven emotions such as anger, surprise, contempt, sadness, happiness, disgust, and fear. The official web link of the CK + database is http://www.jeffcohn.net/Resources/.

The initial step in the process is image resizing. We have to resize the inputs according to the input sizes of the pre-trained models. The CK + dataset images are mostly gray with a resolution of 640*490. The actual input sizes of Resnet50, vgg19, and MobileNet are 224*224 and Inception V3 is 299*299. So all the images are resized according to the input size of pre-trained Convnets. After that, all the layers of the pre-trained Convnets are frozen except the fully connected layers. Finally, the fully connected layers are only trainable to update the weights. Based on the number of classes in a fully connected layer, the emotions are classified. In this work, we are using the networks of Resnet50, VGG19, Inception V3 and MobileNet that are trained on the ImageNet. These pre-trained networks are used in our classification task by the process of transfer learning.

4 Training procedure of proposed models

Transfer learning is a strategy of reusing the model developed for a particular task is used for another task. The fundamental concept of transfer learning is taking a model trained on a big dataset and transferring its knowledge to a small dataset. Training a convolutional neural network from scratch requires more data and computationally expensive; on the other hand, transfer learning is computationally efficient, and a lot of data are also not needed. In this work, the training procedure for all the models is same, in the first step the weights are initialized from the ImageNet database before the training on the emotion dataset. By considering the advantage of transfer learning the last three layers (fully connected layer, a softmax layer, and classification output layer) of pre-trained models are replaced. And then, add the newly connected layers that are suitable to the classification task. Let us see the architectures of various networks.

4.1 VGG 19

The total number of layers in VGG 19 architecture is 19 layers. This VGG 19 is trained on the ImageNet database [43]. The ImageNet contains more than 14 million images and also capable to classify the images into 1000 different class labels. Figure 5 explains the architecture of VGG19.

Fig. 5
figure 5

VGG19 model

The input size of this model is 224*224*3(RGB image). The architecture of VGG19 consists of sixteen convolutional layers and three fully connected layers. The size of the convolution kernels is 3*3 with a one-pixel stride. The network contains five max-pooling layers with a kernel size of 2*2 with a two-pixel stride. It consists of three fully connected layers, in that the first two fully connected layers having 4096 channels each, and the last fully connected layer comprises 1000 channels. The last layer of the architecture is the Softmax layer [44].

In this effort, we used the pre-trained model to extract the features and changed the fully connected layers as per our classification task. In this work, we are aiming to classify a total of seven emotions. The VGG19 network consists of 4096*1000 fully connected layers, as per our classification task we are replacing the last layer with 1024*7 fully connected layer. Below Table 1 shows the summary of the proposed CNN using VGG19 as the base model and added our own fully connected layers on the top of the base model.

Table 1 Keras summary of the model using VGG19 as a feature extractor

4.2 Resnet50

One of the classes of deep neural networks is Resnet50. Resnet stands for residual networks. The architecture of the Resnet50 contains 50 layers. In this also the convolution and pooling layers are similar to standard convolution neural networks. The main block in the resnet architecture is the residual block. The purpose of the residual block is to make connections between actual inputs and predictions. The residual block functioning is displayed in Fig. 6.

Fig. 6
figure 6

Residual block

From the above diagram, x is the prediction, and F(x) is the residual. When x is equal to the actual input the value of F(x) is zero. Then, the identity connection copies the same x value [45].

The Resnet50 architecture mainly contains five stages with convolution and identity blocks. The input size of the resnet50 is 224*224 and is three channeled. Initially, it consists of a convolution layer with kernel size 7*7 and a max-pooling layer with 3*3 kernel size. In this architecture, each convolution block has three convolution layers and each identity block also contains three convolution layers. After the five stages, the next is the average pooling layer and the final layer is fully a connected layer with 1000 neurons. The architecture of the resnet50 is shown in Fig. 7. As per our work, we are considering resnet50 as the base model and we added our fully connected layers on the uppermost of it. For that, we are replacing the last layer into 1024*7 fully connected layers. Table 2 displays the summary of the proposed CNN using Resnet50 as the base model and added our own fully connected layers on the topmost of the base model.

Fig. 7
figure 7

Resnet50 Architecture

Table 2 Keras summary of the model using Resnet50 as a feature extractor

4.3 MobileNet

MobileNet is also called a lightweight convolution neural network. This is the most efficient architecture for mobile applications. The advantage of the MobileNet is it required less computational power to run. Instead of standard convolutions, the MobileNet used depth-wise separable convolutions. The number of multiplications required for depth-wise separable convolutions is less than the standard convolution so that the computational power is also reduced. Figure 8 shows the MobileNet Architecture.

Fig. 8
figure 8

MobileNet Architecture

Depth-wise separable convolution involves depth-wise convolutions and point-wise convolutions. In standard CNN’s the convolution is applied to all the M channels at the same time but in depth-wise convolution the convolution is applied to a single channel at a time. In point-wise convolution, the 1*1 convolution is applied to merge the outputs of depth-wise convolutions [46, 47]. Figure 9 shows the depth-wise and point-wise convolutions.

Fig. 9
figure 9

Depth-wise and point-wise convolutions

The computational cost of standard convolution is

$$ {\text{D}}_{{\text{k}}} . {\text{D}}_{{\text{k}}} . {\text{M}}. {\text{N}}. {\text{D}}_{{\text{F}}} . {\text{D}}_{{\text{F}}} $$
(1)

And the computational cost of depth-wise separable convolution is

$$ {\text{D}}_{{\text{k}}} .{\text{D}}_{{\text{k}}} .{\text{M}}. {\text{D}}_{{\text{F}}} + {\text{M}}. {\text{N}}.{\text{D}}_{{\text{F}}} . {\text{D}}_{{\text{F}}} $$
(2)

The overall MobileNet construction consists of convolution layers with stride 2, depth-wise layers, and also point-wise layers to double the channel size. The structure of the MobileNet is presented in Table 3.

Table 3 Structure of the MobileNet

The final layer of the MobileNet architecture contains 1024*1000 fully connected layers, as per our emotion classification task we are replacing the final layer of the MobileNet with 1024*7 fully connected layers as displayed in Table 4.

Table 4 Keras summary of the model using MobileNet as a feature extractor

4.4 Inception V3

Inception V3 is also one type of convolution neural network model. The input size of Inception V3 is 299*299 and it is a 48 layer deep network. The below Fig. 10 shows the base Inception V3 module. The 1 × 1 convolutions are added before the bigger convolutions to reduce the dimensionality and the same is done after the pooling layer also. To increase the performance of the architecture the 5 × 5 convolutions are into two 3 × 3 layers. It is also possible to factorize N × N convolutions into 1 × N and N × 1 convolutions.

Fig. 10
figure 10

Base Inception V3 module

The detailed structure of Inception V3 with input sizes, layer types (convolutional, pooling and softmax) and kernel sizes are presented in Table 5.

Table 5 Implementation of Inception V3

The final layer of the Inception V3 architecture [48] contains 2048*1000 fully connected layers as shown in Table 5, according to our emotion classification task we are replacing the last layer of the Inception V3 with 1024*7 fully connected layers as presented in Table 6.

Table 6 Keras summary of the model using Inception V3 as a feature extractor

5 Implementation

The experiment was done using Google Colaboratory with GPU backend using RAM of 12 GB. Using the Tensorflow and Keras API we can design VGG 19, Resnet50, MobileNet and Inception V3 architectures from scratch. For this implementation, we used the CK + dataset. The number of Convolutional layers, Max pooling layers (with filter and stride sizes), and fully connected layers used in each model are explained clearly in Sect. 4.

5.1 Implementation parameters

The below Table 7 shows some of the implementation parameters for all the four models used in this work. The input shape of the three networks VGG 19, Resnet 50, and MobileNet are the same but Inception V3 is different. For all the networks, weights are initialized from ImageNet. The classifier used for the models is the Softmax classifier and the optimizer is the Adam optimizer and the loss function is categorical_crossentropy. The regularization used for all the models is Batch normalization. And some of the parameters like Dropout, Epoch size, and Batch size are the same for all four models.

Table 7 Implementation Parameters

6 Experimental results and discussions

Below are the test results of various models used in this work. The performance metrics used in this work are Accuracy, Sensitivity (Recall), Specificity, Precision, and F1 score. These metrics are defined in terms of true-positive (TP), false-positive (FP), false-negative (FN), and true-negative (TN). The sample confusion matrix for the calculation of TP, TN, FP and FN values are clearly mentioned in Table 8.

Table 8 Sample Confusion Matrix showing TP, TN, FP, and FN values for class 1

7 Accuracy

Accuracy is defined as the proportion of the number of correct samples to the number of all samples.

$$ {\text{Accuracy}} = \frac{{{\text{TN}} + {\text{TP}}}}{{{\text{TN}} + {\text{TP}} + {\text{FN}} + {\text{FP}}}} $$
(3)

8 Sensitivity

Sensitivity is defined as the proportion between the number of true-positive cases to the total number of true-positive and false-negative cases.

$$ {\text{Sensitivity}}\left( {{\text{Recall}}} \right) = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}} $$
(4)

9 Specificity

The ratio of the number of true-negative cases to the total number of true-negative and false-positive cases is known as specificity.

$$ {\text{Specificity}} = \frac{{{\text{TN}}}}{{{\text{TN}} + {\text{FP}}}} $$
(5)

10 Precision

The ratio of correctly predicted positive cases to the total predicted positive cases is known as precision.

$$ {\text{Precision}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FP}}}} $$
(6)

11 F1 Score

The weighted average of precision and recall is the F1 score. The higher F1 score means the model is more accurate in doing the predictions.

$$ {\text{F}}1\;{\text{Score}} = 2.\frac{{\text{precision * recall}}}{{\text{precision + recall}}} $$
(7)

11.1 Results of VGG19 on test data

Figure 11 shows the data fitting results by using pertained VGG19 as a feature extractor.

Fig. 11
figure 11

Fitting results by using VGG19 model

Figures 12 and 13 show the accuracy and loss of the model. The number of epochs changes the loss and accuracy values are also changed.

Fig. 12
figure 12

accuracy by using VGG19

Fig. 13
figure 13

loss by using VGG19

Table 9 displays the confusion matrix for the test data of 148 samples. According to Table 10, the model is highly accurate in predicting the emotions of contempt and less accurate at the prediction of happy emotion.

Table 9 Confusion matrix by using VGG19 model
Table 10 Performance measures by using the VGG19 model

The metrics of the model accuracy, specificity and sensitivity are calculated by using true-positive (TP), false-positive (FP), and false-negative, (FN) and true-negative (TN) values. The below table shows the performance measures of the proposed model by using VGG19 as a feature extractor.

From the above calculations, the F1 score is 0.83 and the accuracy of the model by using a pre–trained VGG19 model is 96%.

11.2 Results of Resnet50 on test data

Figure 14 shows the data fitting results by using pertained Resnet50 as a feature extractor.

Fig. 14
figure 14

Fitting results by using Resnet50

Below Table 11 displays the confusion matrix using the Resnet50 model for the test data of 148 samples. According to Table 12, the model is highly accurate in predicting the emotions of sadness and less accurate at the prediction of happy emotion.

Table 11 Confusion matrix by using Resnet50 model
Table 12 Performance measures by using Resnet50 model

The below table displays the performance measures of the proposed model by using Resnet50 as a feature extractor.

From the above calculations, the F1 score is 0.91 and the accuracy of the model by using a pre-trained Resnet50 model is 97.7%. Figures 15 and 16 show the accuracy and loss of the model by using Resnet50.

Fig. 15
figure 15

accuracy by using Resnet50

Fig. 16
figure 16

loss by using Resnet50

11.3 Results of MobileNet on test data

The below Fig. 17 shows the data fitting results by using pertained MobileNet as a feature extractor.

Fig. 17
figure 17

Fitting results by using MobileNet

By using MobileNet as a feature extractor, Figs. 18 and 19 display the accuracy and loss of the design.

Fig. 18
figure 18

accuracy by using MobileNet

Fig. 19
figure 19

loss by using MobileNet

The underneath Table 13 displays the confusion matrix for the test data of 148 samples. According to Table 14, the model is highly accurate in predicting the emotions of surprise and fear and less accurate at the prediction of disgust emotion.

Table 13 Confusion matrix by using the MobileNet
Table 14 Performance measures by using MobileNet model

The below table displays the performance measures of the proposed model by using MobileNet as a feature extractor.

From the above calculations, the F1 score is 0.93 and the accuracy of the model by a using pre-trained MobileNet model is 98.5% (Fig. 20).

Fig. 20
figure 20

Fitting results by using Inception V3

11.4 Results of inception V3 on test data

The below Figure shows the data fitting results by using pertained Inception V3 as a feature extractor.

Table 15 displays the confusion matrix using the Inception V3 model for the test data of 148 samples. According to Table 16, the model is highly accurate in predicting the emotions of surprise and less accurate at the prediction of happy emotion.

Table 15 Confusion matrix by using Inception V3 model
Table 16 Performance measures by using Inception v3model

The below table displays the performance measures of the proposed model by using Inception V3 as a feature extractor.

From the above calculations, the F1 score is 0.75 and the accuracy of the model by using a pre-trained Inception V3 is 94.2%. The below Figs. 21 and 22 exhibits the accuracy and loss of the proposed model by using Inception V3.

Fig. 21
figure 21

accuracy by using Inception V3

Fig. 22
figure 22

loss by using Inception V3

12 Comparative analysis

12.1 Comparisons within proposed methods

In this work, four pre-trained networks of VGG19, Resnet50, MobileNet, and Inception V3 are used for recognizing emotions. The sensitivity, specificity, precision, F1 score, and accuracy values are calculated for every network. Table 17 shows the values obtained for all the networks.

Table 17 Comparisons within proposed networks

12.1.1 Inference from the results

From the above results, among all the four convolutional neural networks, MobileNet achieved the highest F1 score of 0.93 and accuracy of 98% and the second Resnet50 achieved the highest F1 score of 0.91 and the accuracy of 97%. MobileNet has the advantages of reduced size, reduced parameters and faster performance so it achieved high accuracy compared to the other state-of-the-art models. Because of tackling, the vanishing gradient problem Resnet also achieved high accuracy. The drawback in VGG Net is slow in training process.

12.2 Comparisons with other approaches

The below Table 18 displays the comparisons of various deep learning approaches by some of the researchers for facial emotion recognition problem in terms of accuracy.

Table 18 Comparison of Existing works

Compared to all the existing works our proposed method achieved the highest accuracy of 98% for facial emotion recognition.

13 Conclusions

This paper presented facial emotion recognition system using transfer learning approaches. In this work, pre-trained convolutional neural networks of VGG19, Resnet50, Inception V3 and MobileNet that are trained on ImageNet database, are used for facial emotion recognition. The experiments were tested using the CK + database. The accuracy achieved using the VGG19 model is 96%, Resnet50 is 97.7%, Inception V3 is 98.5%, and MobileNet is 94.2%. Among all four pre-trained networks, MobileNet achieved the highest accuracy. In future, these networks will be implemented for speech and EEG signals to recognize the emotions.