Introduction

Emotion intensity has attracted researchers for many years. Emotions are a complex psychological state that reveals feelings, moods, thoughts, and reactions while the intensity of emotion is a non-monotonic function that evaluates the strength level of emotion.

Researchers explore the depth of emotions to tackle the problem in their field. We can estimate its versatility by marking its applications in many fields including medical science, psychology, computer science, security system, recommender system, education, marketing, social science, human science, and many more. For instance, in medical science depression counseling [1], Autism spectrum disorders [2], and for patients seeking therapies [3] emotions are study. In psychology, many theories are proposed by researchers to illustrate the origin of emotions, its neurobiological explanation, along many aspects of emotions. Even to enhance the power of security systems, emotion and its intensity are examined by developers [4, 5]. However, emotions play a crucial role in education as suggested by leading researchers [6]. Many e-learning processes depict its vital importance. Furthermore, to study mental processes and disorders, human science uses emotion. Emotion in marketing is termed emotional marketing. It acts as a communication between buyer and seller to reveal the experience of the buyer with a particular product. In other studies, smile intensity linked to life satisfaction and years lived. In computer science, the study of emotions is called affective computing and it is based on several features like speech [7], text [8], facial expression [9], EEG signal [10], body posture [11] and context [12].

In the domain of Computer vision, a lot of work has been done in facial expression recognition (FER) field. Recently, a survey held by the researcher [13] on the FER system provided a complete summary of recent works based on deep learning. Some of FER works use CNN-based architecture and other Fuzzy inference systems (FIS) to address the problem. Besides these, another popular technique is transfer learning. The reason behind their popularity is they reduce training time and increase performance. For instance Sajjanhar et al. [9] proposed a FER system based on transfer learning. They evaluated performance of FER task through pre-trained models Inception-v3, VGG, and VGG face.

Rassadin et al.[14] used transfer learning for face identification in their group-level emotion recognition work. They extracted landmark points by Dlib and calculated pair-wise distance between them to learn face features. Then they applied classifiers such as logistic regression, support vector regression, gradient boosting tree, and random forest. A recent study [15] have shown that among VGG16, ResNet152V3, Inception V3, and Xception, VGG16 was superior for FER on a combined dataset of CK + and JAFFE with an accuracy of 83.16%. Many geometrical-based researches [16,17,18] are relying on facial landmark points for finding facial emotions. Recently, Amal et al. [19] did a real-time emotion recognition work on FER 2013 dataset using local binary patterns (LBP) for face detection, Dlip for landmark points extraction, and constructed a CNN with histogram of oriented gradients (HOG) features. This work got 75.1% accuracy shows that higher classification accuracy for real-time FER systems is still challenging.

On the other hand, Nicolai and Choi [20] introduced a fuzzy-based FER system with an accuracy of 78.8% on the JAFFE dataset. Also, Farahani et al.[21] considered a Mamdani-based fuzzy system with face features eye (opening, width ratio) and mouth (opening, width) for FER then got 78.8% accuracy. Another work with a similar concept is Chakraborty et al. [22]. The next FER system [16] was based on fuzzy logic for 5 basic emotions except neutral and used the concept of finding displacement of 17 landmark points between expressive and neutral frames. They evaluated their performance on the CK + dataset. Again Bahreini et al. [17] presented software by calculating 54 cosine values from the face with 37 FURIA fuzzy rules they predicted 6 basic emotions with 83.2% accuracy.

Both CNN and FIS systems also proposed emotion intensity-related work, like Witzig et al. [23] presented smile intensity work base on the combined structure of CNN and RNN. Esau et al. [18] developed a real-time fuzzy-based emotion intensity work for four emotions happy, sad, and angry, and fear. In this work, facial features were represented by 6 angles and achieved a recognition accuracy of 72%. Vinola and Vimala Devi [24] generated fuzzy-based smile intensity work and considered the Euclidean distance concept between landmark points. This work achieved a recognition rate of 86.54%. Other approaches to emotion intensity are given by Savran et al. [25] and Whitehill et al. [26].

Some of the researchers use a combined dataset to show that their model is not biased to a particular dataset. Ozdemir et al. [27] developed a CNN which was based on the LeNet model. By merging three datasets (JAFFE, KDEF, and custom data) they got training accuracy of 96.43% and validation accuracy of 91.81% in real-time-based facial emotions classification work. Other researcher Ahmed et al. [28] merged eight different datasets and applied augmentation techniques in their proposed a CNN structure and achieved 96.24% accuracy.

Recently, fusion-based approaches highlighted by the researcher for better accuracy. Kim et al. [29] proposed a hierarchical deep neural network-based FER system in which they fused appearance and geometric-based features. Then Song [30] proposed a feature fusion model based on machine learning and philosophical concepts. Similarly, Park et al. [31] constructed 3D CNN architecture for extracting spatial and temporal features simultaneously. In another research, Chu et al. [32] used multi-layer convolution feature fusion and Zhang et al. [33] proposed mask refined R-CNN that focused on global and detailed information for better results.

Emotions and their intensity play a significant role in various fields. Various approaches of affective computing are already got successful in this task. In this proposed work, we develop an effective approach to find emotions and their intensity. We combined two components CNN especially transfer learning and fuzzy inference system. Although these two components already presented various research in this task. But this kind of fusion of these two models is not done previously. The foundation of this proposed work relies on the specialty of transfer learning and FIS.

Transfer learning has the specialty of feature learning with the assumption that both source and target tasks sufficiently similar. Acquire knowledge from a model and implement it to others gives higher starting accuracy and faster convergence. Moreover, a fuzzy inference system is easy to construct, flexible, and capable to handle vagueness. With the help of fuzzy rules, it maps input values to output. Transfer learning and FIS both are well-established methods and applying in the emotion recognition field for the past couple of years. By grasping its great success in this field, we come up with the idea of an emotion intensity classifier based on transfer learning and FIS to take advantage of their specialty.

Previous emotion intensity researches mainly concentrate on one emotion (happy) also fine-grained categories of intensity are missing [34]. Most intensity-related works are performed on their own dataset [26, 34] or self-annotated datasets [23] by researchers due to the lack of a specific dataset for this task. Also, emotion intensity work based on the FIS model used a large number of face features [17, 18]. To overcome these limitations this work is comprised of two stages: basic emotion classification and subcategory of recognized emotion based on intensity. Pre-trained architecture VGG16 [35] is used for basic emotion classification work and a fuzzy inference system is used to estimating the intensity level of detected emotion. The fusion of these two: pre-trained network-based basic emotion classifier and FIS-based intensity sub classifier work, has not been done previously by other researchers.

The main contributions of this proposed work are as follows:

  • Aiming at the problem that the emotion intensity work by CNN requires a particular annotated data for this task while FIS-based work requires more face features for accurate prediction. So, the proposed fusion work divided emotion intensity work into two stages: In the first stage it performed basic emotion classification by pre-trained model on the available dataset (CK+, KDEF, and FER 2013) and in the second stage it predicts intensity level of detect emotion by help Fuzzy system with less complexity (i.e. less number of face features) and greater accuracy.

  • This work extended emotion intensity work from one emotion happy to more class (happy, sad, surprise, angry). Also, the proposed work is capable of predicting small to peak intensity levels with help of 13 fine-grained categories of intensity.

  • We used a combined dataset that contained posed and spontaneous images so variability in data courage real-life implementation of this work and also, we compare the findings of this proposed work to recent related works.

Rest of the paper structured as follows, [2] describes the two proposed modules, section [3] deals with experiment and results, while [4] concentrates on discussion of experimental results and [5] presents an overview of this work and also highlight its findings.

Proposed Work

This facial emotion intensity classifier work is arranged into two modules: classifier based on pre-trained structure for basic emotions such as happy, sad, angry, surprise, and classifier based on fuzzy inference system for the intensity subcategory of detected emotion. The utility of this model is that it makes emotion subcategory task very easy because after detecting basic emotion by pre-trained structure the subcategory of emotion depends upon selected features of the face like lips, eyes, and in some cases eyebrows. So instead of taking so many feature values or concepts, we can find subcategories of emotion smoothly and precisely. The flowchart of this proposed model is given in Fig. 1.

Fig. 1
figure 1

Flowchart of proposed fusion work: facial emotion intensity classifier

Facial Emotion Classifier

Collection of Facial Expression Database and Preprocessing

For the first module, we take three databases FER 2013, CK+, and KDEF. FER 2013 dataset [36] contains a total of 35,887 images. Out of 35,887 images, 28,709 for training, 3589 for validation, and 3589 for testing purposes. All images are of size 48 × 48 with 7 emotions. This dataset contains variation in images some images are in a straight position, some of them contain partial faces also there are many images in which the face is cover by hands. This dataset is imbalanced and several images are not correctly annotated.

CK+ dataset introduced by [37] contains 593 image sequences with resolution 640 × 490. This data set contains posed and spontaneous images of people with ages ranging from 18 to 50 years. CK+ dataset includes 123 different subjects. This dataset also contains the same seven facial expressions as FER 2013 contains.

KDEF (Karolinska Directed Emotional Faces) this dataset contains facial expressions [38] of 35 males and 35 females with 7 major facial expressions (happy, sad, surprise, angry, disgust, afraid, and anger). This dataset contains a total of 4900 images.

Dataset which is used in this paper has been downloaded from Kaggle [39] consists of all three datasets that are mention above with correct annotation. The downloaded database contains 32,900 images of 8 emotions (happy, sad, surprise, angry, disgust, afraid, anger, and neutral). All images are grayscale in PNG format with size 224 × 224.

In this study, we are focused on 4 emotions (happy, sad, angry, and surprise). From this dataset, we have taken randomly images of 4 emotions for our work. Also, some images are collected from Google. The dataset contains a total of 6937 images. We take 6079 images for training, approximately 1500 images for each class (happy, sad, angry, and surprise), for validation 436 images and 422 for testing our model. Many researchers used [27, 28] combined datasets for their work. Creating a dataset by collecting images from different sources makes our model effective and unbiased. Figure 2 gives some sample images of this dataset.

Fig. 2
figure 2

Some sample images of dataset

Data preprocessing The two preprocessing steps for the first module are image resize and image rescaling. Since all images were already in the same size except those which were downloaded from Google. All images were resized to the target size 224 × 224. Images for training, validation, and testing were loaded using the in-built function ImageDataGenerator provided by Keras API. This function was also used for resizing and rescaling images.

Basic Emotion Classification by Pre-trained Model

We applied the transfer learning technique using the VGG16 pre-trained model. Transfer learning technique provides weights that are developed for ImageNet image classification tasks. The architecture of the first module is summarized in Fig. 3.

Fig. 3
figure 3

Architecture of the first module facial emotion classifier

Sub-category Classifier Based on Fuzzy Inference System

Collection of Database and Preprocessing

For second module CK+ and KDEF dataset is used. Since FER 2013 images contain variation in terms of face alignment, face orientation so we did not include this dataset in the second module. But to cover a wide range of emotional intensity images were downloaded from Google. After preprocessing on Google images, we built a dataset for this work.

Preprocessing CK+ and KDEF dataset contains images that are almost equally orientated and face alignment nearly identical. Also, both datasets comprise only the frontal faces of people. Furthermore, images downloaded from Google required some extra preprocessing effort. We had downloaded frontal face images for four emotions (happy, sad, angry, surprise). Later on, we manually crop images in such a way that images contained face only (similar to images present in CK+ and KDEF) and resize to target size 224 × 224. Then applied preprocessing steps for the whole dataset which are

  1. a.

    Conversion into grayscale—Images were collected from different sources so after resizing them, we need to convert images to grayscale.

  2. b.

    Histogram equalization—This is the scheme used for contrast adjustment in the image. Through this method, the intensity of the image is better distributed on the histogram. This equalization method is adequate for both bright and dark images. We applied in-built function “cv2.equalizeHist ()” present in cv2 to reduce data variance.

  3. c.

    Face detection and landmark points extraction—Face detection and landmark point’s extraction is a prerequisite step in the FER system. Through face detection, we can find the location of the face in the image. For this task, we applied a frontal face detector present in the “dlib” library [14, 17]. This face detector is a pre-trained HOG and linear SVM face detector that provides quick and productive results. After detecting face location next, we applied the landmarks predictor which was present in the "dlib" library. It extracted 68 landmark points from the detected face. Figure 4 shows all preprocessing steps.

Fig. 4
figure 4

Three preprocessing steps: histogram equalization, face detection and landmark points detection

Feature Values Estimation/Estimation of Area and Tangent

Most of the emotion recognition tasks reveal that lips, eyes, and eyebrows are the most informative features. So, various FER related works like Rassadin et al. [14], Farahani et al. [21], Chakraborty et al. [22], and Islam and Loo [16] were based on it. In this work to estimate the lips area and eyes area from the detected face, we considered lips and eyes as elliptical in shape. Also, we calculated lip width and eyebrows tangent of face for emotion intensity-based subcategory work. Lips area and eyes area were calculated by the formula:

$${\text{Area of Ellipse }} = \, \left( {{3}.{14} \times {\text{length of major axis}} \times {\text{length of minor axis}}} \right) .$$

Lip width was calculated by the Euclidean distance (ED) formula and for eyebrows tangent (y2 − y1)/(x−  x1) formula was used. To calculate area and width first of all we calculated normalized Euclidean distance (NED) between two contributing landmark points.

$${\text{NED }} = \, \left( {{\text{ED}}/{\text{length of face segment}}} \right).$$

Euclidean distance cannot use as such because the estimated value varies from image to image depends upon the location of face and area of face segment. To standardize the ED we exerted normalized Euclidean distance as described by Vinola and Vimala Devi [24] in their smile intensity work. Figure 5 shows the height and width of the face and 68 landmark points and Fig. 6 represented all face features used in this study.

Fig. 5
figure 5

Height, width of the face and 68 landmark points

Fig. 6
figure 6

Face features used in this study

Fuzzy Inference System for Emotion Intensity-Based Sub-categories

If basic facial emotions are accurately detected by the first module (described in Section “Basic Emotion Classification by Pre-trained Model”) then our second module easily predicts the subcategory of detected emotion. Now to find a subcategory of detected emotion that is based on emotion intensity requires less number of face features. To detect subcategory of emotion lips, eyes, and in some cases, eyebrows are sufficient as discussed earlier so, lips area, lips width, eyes area, and eyebrow tangent are taken to determine subcategory of emotion based on its intensity.

We constructed four separate fuzzy inference systems for predicting subcategories of four basic emotions. These systems were independent of each other. The reason behind this construction was each emotion subcategory has its interval value with different intensities. For subcategories of each emotion, there were separate fuzzy rules that correspond to the linguistic variable. Range of resultant emotion subcategory values for each emotion taken from 0 to 100.

From the dataset, images corresponding to a class of emotion were taken after that to define the subcategory of emotion we calculated lip area, lip width, eye area, and eyebrow tangent. Table 1 defines the emotions and corresponding face features which were considered under this work. After examining a large dataset with varying emotion intensity, we had drawn a pattern and based on this pattern fuzzy ranges of lips area, lip width, eye area, and eyebrow tangent, from lower to higher intensity were defined. The membership function of each fuzzy input is defined by the linguistic variables low, medium, and high. Here triangular and trapezoidal membership functions are used then we defined if …. then type fuzzy rules to predict subcategories for that particular emotion. All fuzzy inference systems were developed in the same manner.

Table 1 Emotions and corresponding face features

Sample of fuzzy rules for each subcategory of emotion are as follows:

lip area['low'] & eyebrow tangent['high'], angry['angry'].

lip area['medium'] & eye area['medium'], surprise['surprise'].

lip area['low'] & eye area['high'], sad['sad'].

lip width['low'] & eye area['medium'], happy ['little bit happy'].

lip area['low'] & eye area['low'], sad ['more than sad'].

lip area['moderate'] & eye area['moderate'], happy['happy'].

lip area['low'] & eyebrow tangent['low'], angry ['little bit angry'].

lip area['high'] & eyebrow tangent['high'], angry['shouting'].

The architecture of the second module is summarized in Fig. 7.

Fig. 7
figure 7

Architecture of the second module subcategory classifier based on fuzzy inference system

Experiment and Results

We evaluate the performance of our model through the experiment. The implementation of the proposed work was in Google colaboratory using TensorFlow, Open CV, Dlib, and Scikit Fuzzy libraries. Training, validation, and testing dataset (see “Collection of Facial Expression Database and Preprocessing”) were loaded and preprocessed by the in-built function ImageDataGenerator. Then to train our first module, i.e. basic emotion classifier by pre-trained architecture described in “Basic Emotion Classification by Pre‑trained Model” and shown in Fig. 3, we loaded the pre-trained VGG16 model in Keras with taking “include_top” argument as “false”. To make a prediction we add the first fully connected layer with 512 nodes and “relu” as activation function then dropout layer is introduced in which 50% neurons randomly excluded after this last fully connected layer is added with 4 nodes for classification of 4 basic emotions (happy, sad, angry and surprise) with “softmax” as activation function. “Adam” optimizer with learning rate 0.001 and “categorical cross-entropy” as loss function was picked for this work-frame. Figure 8 shows the model accuracy and model loss. The accuracy of the model for the training dataset was 96.06% and for validation was 81.19% in 100 iterations.

Fig. 8
figure 8

Model accuracy and loss

The confusion matrix for the testing data set shows in Fig. 9 and it reveals that the accuracy of the model is 83% for the testing dataset. Precision value shows that if the model predicts a facial emotion is a surprise, it is correct 89% of the time. Also, from the confusion matrix, we conclude that for testing data the recognition rate of both positive emotions is higher than negative emotions. The recognition rate of emotion happy and surprise is 93% and 92%, respectively. While recognition rate for emotion angry and sad is 79% and 70%, respectively.

Fig. 9
figure 9

Precision, recall, f1-score and confusion matrix of classifier

We picked an image then applied preprocessing steps rescale and resize. After that, first module predicted the basic emotion with the class index value. With the help of this class index value, the system automatically transferred the image to the corresponding fuzzy inference system which was constructed for categorizing that emotion. Before transferring to the FIS system three preprocessing steps were taken histogram equalization, face detection, and landmark point extraction. After examining the location of landmark points in the image, the image was finally transferred to the FIS system. In the FIS system first estimation process is done (see “Feature Values Estimation/Estimation of Area and Tangent” and Table 1). Then estimated values pass through the fuzzy system and predicted subcategory of emotion based on fuzzy rules. The pre-trained model employs knowledge and skills from one to another system and the fuzzy system has a grip on uncertainty and impreciseness. Experiment results show that the proposed fusion model: facial emotion intensity classifier first predicts basic emotion class then intensity level-based subclass. Also, it reduces the complexity of this task and increases the performance because if basic emotion is accurately detected by the first module then the intensity of detected emotion depends upon less number of face features (size of mouth opening, eye-opening, and eyebrows tangent). Experiment results for all emotions are shown in Fig. 10.

Fig. 10
figure 10

Experimental results of all emotions

Discussion and Future Work

The performance of our model depends upon the prediction accuracy of the basic emotion classifier and preciseness of face detection and landmark point extraction. Here, we used the transfer learning technique, Dlib face detector, and landmark predictor for greater accuracy. To show the performance of the proposed model, images from various datasets (KDEF, CK+, and FER 2013) and also from Google were taken. For the facial emotion recognition task, we just use two preprocessing steps resize and rescale. So, the accuracy of our model fully depends upon the transfer learning technique, fuzzy system and combined dataset which is collected from various sources.

The classification results of the second module emotion intensity classifier will be correct if the first module accurately detects the basic emotion, face detection, and estimation of landmark points. If basic emotions are already detected then the intensity of emotion depends upon the size of mouth opening, eye-opening, and eyebrows tangent as they are prime face features for this task. So, the overall performance of our model depends upon the accuracy of the first module that is 96.06% for training, 81.19% for the validation dataset, and 83% for the testing dataset.

The fuzzy-based second module successfully detects subcategories of emotion graphically. We perceive that if we examine different images of the same emotion with varying intensity, the model successfully categorizes it. Even if two images are of the same category then we can do intra-class comparison through their membership values. The proposed system is capable of recognizing small to peak emotion intensity easily and effectively. Through the experiment, we conclude that this proposed work gives significant results for the images taken from different sources. So, overall the proposed fusion model: facial emotion intensity classifier predicts basic emotion class, an intensity value, and also a subcategory of recognized emotion based on its intensity by the graphical way (Fig. 10).

CK+, JAFFE, and KDEF all are posed dataset means data contain no variation in terms of head pose, illumination, and all images have a similar background. So, accuracy reached up to 97% but in the case of spontaneous data (data similar to the real-life situation) like FER2013 researcher got a maximum of 75–76% accuracy. Table 2 summarize the accuracy difference between posed [15, 29, 31, 40,41,42,43,44] and spontaneous datasets [17, 19, 30, 45,46,47]. Posed datasets always get greater accuracy than spontaneous but are less reliable in real-life applications.

Table 2 Comparison table for facial emotion classifier

This proposed fusion model: facial emotion intensity classifier predicts combinedly both basic emotion class through module 1 and intensity level-based category through module 2 (Fig. 10). Also, we take a combined dataset in which images are collected from different sources. So, if we compare the basic emotion classifier work (module 1) with other related works (mention in Table 2) having an individual single dataset or a particular number of emotion classes will be unfair. However, proposed basic emotion classification work (module 1) on combined data with four emotion classes got 96.06% training, 81.19% validation, and 83% testing accuracy. Now, we compare the proposed fusion work: facial emotion intensity classifier (Fig. 1) with related previous intensity work in Table 3.

Table 3 Comparison table for proposed fusion model: facial emotion intensity classifier

For better communication between human–machine interaction, emotion intensity plays a vital role. In literature, a sufficient amount of work has been done by the researchers in basic emotion classification tasks while on the other hand facial emotion intensity prediction works are limited. The reason behind that no particular labeled dataset is available for this task so, some of the intensity work [26, 34] researchers generate their data set for emotion happy. But for more emotion classes it is time-consuming and expensive to collect all intensity data so, major of previous intensity work [23, 26, 34] concentrates on one emotion happy and also their intensity classes are not sufficient.

To tackle this problem fuzzy-based researches was held in which researchers take a lot of face features [18] (complexity) to define emotion class and intensity level. For instance, Bahreini et al. [17] calculated 54 cosine values for six basic emotion classification work. Vinola and Vimala Devi [24] calculated five Euclidean distances between ten landmark points for smile intensity work only. This proposed fusion-based facial emotion intensity classifier work overcomes all the above-mentioned limitations by generating a classifier with two modules. Due to which no particular additional labeled data set is required for more class emotion intensity work also it reduces the complexity of this task by taking a smaller number of face feature to define the intensity level. The proposed fusion work successfully predicts four emotion classes with 13 emotion subcategories based on intensity. These fine-grained categories of emotion intensity are capable to predict small to peak intensity levels also the graphical output generated by the system is very effective and easy to visualize the outcome. Since this proposed work is also limited to the frontal faces and can be improved by adding an audio feature also the performance of proposed fusion model depends upon the prediction accuracy of the basic emotion classifier (module 1). So, in the future we will modify our architecture by applying fusion techniques at feature [49], score level [50] and also, we will add more spontaneous images and image preprocessing steps [51] to improve the accuracy. Further, we will apply the data augmentation technique [28], and other pre-trained models [9].

Conclusion

The main aspiration of this work is to generate an effective way of finding emotions subcategories by fusion of pre-trained network and FIS systems. Our emotion intensity classifier work divides this complicated task into two sub-tasks first basic emotion detection based on CNN especially the transfer learning technique and second subcategory of recognized emotion based on intensity through FIS. One of the important features of this model is that we take a combined dataset that is collected from different sources and our system works effectively on those images that make our model more reliable.

The results of the experiment explore the findings of this work. Experiment results show that the proposed emotion intensity classifier reduces the complexity of this task and enhances the performance by taking the advantage of transfer learning and fuzzy system. The purpose of this proposed work is to find out the improvement opportunity in the emotion intensity work.