Keywords

1 Introduction

Facial expressions play a vital role in social communications between humans because the human face is the richest source of emotional cues [18]. We are capable of reading and understanding facial emotions because of thousands of year of evolution. We also react to facial expressions [13] and some of these reactions are even unconscious [14]. Emotions play an important role as a feedback in learning as they inform the teacher about the student’s emotional state [3, 31]. This is particularly important in on-line learning, where a fully automated system can be adapted to emotional state of the learner [15].

We introduce a novel deep-neural network-based emotion detection targeted to educational settings. Our approach uses a web-cam to capture images of the learner and a convolutional neural networks to detect facial expressions in real time categorized by using Russell’s classification model [55] and Fig. 1, which covers the majority of affective states a learner might experience during a learning episode. We also take the effort to deal with different scenes, viewing angles, and lighting conditions that may be encountered in practical use. We use transfer learning on the fully-connected layers of the VGG_S network which was trained on human facial expressions that were manually labeled. The overall test accuracy of the detected emotions was 66% and our system is capable of reporting the user emotional state at about five frames per seconds on a laptop computer. We plan to integrate the emotional state detector into an affective pedagogical agent system where it will serve as a feedback to an intelligent animated tutor.

2 Background and Related Work

Encoding and understanding emotions is particularly important in educational settings [3, 31]. While face-to-face education with a capable, educated, and empathetic teacher is optimal, it is also not always possible. People have been looking at teaching without teachers ever since the invention of books and with the recent advances in technology, for example by using simulations [43, 66]. We have also seen significant advances in distance learning platforms and systems [22, 52]. However, while automation brings many advantages, such as reaching a wide population of learners or being available at locations where face-to-face education may not be possible, it also brings new challenges [2, 9, 50, 61]. One of them is the standardized look-and-feel of the course. One layout does not fit all learners, the pace of the delivery should be managed, the tasks should vary depending on the level of the learner, and the content should be also calibrated to the individual needs of learners.

Affective Agents: Some of these challenges have been addressed by interactive pedagogical agents that have been found effective in enhancing distance learning [6, 40, 47, 57]. Among them, animated educational agents play an important role [12, 39], because they can be easily controlled and their behavior can be defined by techniques commonly used in computer animation, for example by providing adequate gestures [25]. Pedagogical agents with emotional capabilities can enhance interactions between the learner and the computer and can improve learning as shown by Kim et al. [30]. Several systems have been implemented, for example Lisetti and Nasoz [37] combined facial expression and physiological signals to recognize a learner’s emotions. D’Mello and Graesser [15] introduced AutoTutor and they shown that learners display a variety of emotions during learning and they also shown that AutoTutor can be designed to detect emotions and respond to them. A virtual agent SimSensei [42] engages in interviews to elicit behaviors that can be automatically measured and analyzed. It uses a multimodal sensing system that captures a variety of signals that assess the user’s affective state, as well as to inform the agent to provide feedback. The manipulation of the agents affective states significantly influences learning [68] and has a positive influence on learner self-efficacy [30].

However, an effective pedagogical agent needs to respond to learners emotions that need to be first detected. The communication should be based on real input from the learner, pedagogical agents should be empathetic [11, 30] and they should provide emotional interactions with the learner [29]. Various means of emotion detection have been proposed, such as using eye-tracker [62], measuring body temperature [4], using visual context [8], or skin conductivity [51] but a vast body of work has been focusing on detecting emotions in speech [28, 35, 65].

Facial Expressions: While the above-mentioned previous work provides very good results, it may not be always applicable in educational context. Speech is often not required while communicating with educational agents, and approaches that require attached sensors may not be ideal for the learner. This leaves the detection of facial expressions and their analysis as a good option.

Various approaches have been proposed to detect facial expressions. Early works, such as the FACS [16], focus on facial parameterization, where the features are detected and encoded as a feature vector that is used to find a particular emotion. Recent approaches use active contours [46] or other automated methods to detect the features automatically. A large class of algorithms attempts to use geometry-based approaches, such as facial reconstruction [59] and others detect salient facial features [20, 63]. Various emotions and their variations have been studied [45] and classified [24], and some focus on micro expressions [17]. Novel approaches use automated feature detection by using machine learning methods such as support vector machine [5, 58], but they share the same sensibility to the facial detector as the above-mentioned approaches (see also a review [7]).

One of the key components of these approaches is a face tracking system [60] that should be capable of a robust detection of the face and its features even in varying light conditions and for different learners [56]. However, existing methods often require careful calibration, similar lighting conditions, and the calibration may not transfer to other persons. Such systems provide good results for head position or orientation tracking, but they may fail to detect subtle changes in mood that are important for emotion detection.

Deep Learning: Recent advances in deep learning [34] brought deep neural networks also to the field of emotion detection. Several approaches have been introduced for robust head rotation detection [53], detection of facial features [64], speech [19], or even emotions [44]. Among them, EmoNets [26] detects acted emotions from movies by simultaneously analyzing both video and audio streams. This approach builds on the previous work for CNN facial detection [33]. Our work is inspired by the work of Burket et al. [10] who introduced deep learning network called DeXpression for emotion detection from videos. In particular, they use the Cohn-Kanade database (CMU-Pittsburg AU coded database) [27] and the MMI Facial Expression [45].

3 Classification of Emotions

Most applications of emotion detection categorize images of facial expressions into seven types of human emotions: anger, disgust, fear, happiness, sadness, surprise, and neutral. Such classification is too detailed in the context of students’ emotions, for instance when learners are taking video courses in front of a computer the high number of emotions is not applicable in all scenarios. Therefore, we use a classification of emotions related and used in academic learning [48, 49]. In particular, we use Russell’s model of core affect [55] in which any particular emotion can be placed along two dimensions (see Fig. 1): 1) valence (ranging from unpleasant to pleasant), and 2) arousal (ranging from activation to deactivation). This model covers a sufficiently large range of emotions and is suitable for deep learning implementation.

The two main axis of the Russel’s divide the emotion space into four quadrants: 1) upper-left quadrant (active-unpleasant) includes affective states based on being exposed to instruction such as confusion or frustration, 2) upper-right quadrant (active-pleasant) includes curiosity and interest, 3) lower-right quadrant (inactive-pleasant) includes contentment and satisfaction, and 4) lower-left quadrant (inactive-pleasant) includes hopelessness and boredom.

Fig. 1.
figure 1

Mapping of emotions from the discrete model to the 4-quadrant model (from Russel et al. [55]).

Most of the existing image databases (some of them are discussed in Sect. 4.1) classify the images of facial expressions into the seven above-mentioned discrete emotions (anger, disgust, fear, happiness, sadness, surprise, and neutral). We transform the datasets according to Russell’s 4-quadrants classification model by grouping the images by using the following mapping:

  • pleasant-active \(\Leftarrow \) happy, surprised,

  • unpleasant-active \(\Leftarrow \) angry, fear, disgust,

  • pleasant-inactive \(\Leftarrow \) neutral, and

  • unpleasant-inactive \(\Leftarrow \) sad.

This grouping then assigns a unique label denoted by L to each image as:

$$\begin{aligned} L\in & {} \{active-pleasant, active-unpleasant, \\&inactive-pleasant, inactive-unpleasant\}. \nonumber \end{aligned}$$
(1)

4 Methods

4.1 Input Images and Databases

Various databases of categorized (labeled) facial expressions with detected faces and facial features exist. We used images from the Cohn-Kanade database (CK+) [27], Japanese Female Facial Expression (JAFFE) [38], The Multimedia Understanding Facial Expression Database (MUG) [1], Indian Spontaneous Expression Database (ISED) [23], Radboud Faces Database (RaFD) [32], Oulu-CASIA NIR&VIS facial expression database (OULU) [67], AffectNet [41], and The CMU multi-pose, illumination, and expression Face Database (CMU-PIE) [21].

Table 1. Databases used for training the deep neural network.

Table 1 shows the number of images and the subdivision of each dataset into categories (sad, happy, neutral, surprise, fear, anger, and disgust). Figure 2 shows the distributions of data per expression (top-left), per database (top-right), and the percentage distribution of each expression in the dataset (bottom-left). In total we had 853,624 images with 51% neutral faces, 25% happy, 3% sad, 8% disgust, 3% anger, 1% fear, and 9% surprise.

The lower right image in Fig. 2 shows the percentage of the coverage of each image by label L from Eq. (1). The total numbers were: active-pleasant: 288,741 images (12%), active-unpleasant 102,393 images (34%), inactive-pleasant 434,841 (51%), and inactive-unpleasant 27,599 (3%). The re-mapped categories were used as input to training the deep neural network in Sect. 4.2.

It is important to note that the actual classification of each image into its category varies in each databases and some are not even unique. Certain images may be classified by only one person while some are classified by various people, which brings more uncertainty. Moreover, some databases are in color and some are not. While it would be ideal to have uniform coverage of the expressions in all databases, the databases are unbalanced in both quality of images and the coverage of facial expressions (Fig. 2).

Also, certain expressions are easy to classify, but some may be classified as mixed and belonging to multiple categories. In this case, we either removed the image from experiments or put it into only one category. Interestingly, the most difficult expression to classify is neutral, because it does not represent any emotional charge and may be easily misinterpreted. This expression is actually the most covered in the dataset that should, in theory, improve its detection if correctly trained.

Fig. 2.
figure 2

Statistics of the used datasets used: contribution per database and expression (top row) and overall percentage of each expression (bottom left) and percentage of each contribution after remapping to Russel quadrants Eq. (1) (bottom right).

4.2 Deep Neural Network

We used deep neural network VGG_S Net [36] and Caffe. VGG_S Net is based on VGG Net that has proven successful in ImageNet Large Scale Visual Recognition Challenge [54] and the VGG_S is effective in facial detection.

Figure 3 shows the VGG_S neural network architecture. The network is a series of five convolutional layers, three fully-connected layers, eventually leading to a softmax classifier that outputs the probability value. We modified the output layer of the original net so that it generates the probability of the image to have a label from Eq. (1). The training input is a set of pairs [imageL], where L is the label belonging to one of the four categories from Russel’s diagram from Eq. (1). During the inference stage, the softmax layer outputs the probability of the input image of having the label L.

4.3 Training

We trained the network on images from datasets discussed in Sect. 4.1. We used data amplification by using Gaussian blur and applying variations of contrast, lighting, and subject position to the original images from each dataset to make our program more accurate in practical scenarios. The input images were preprocessed by using Haar-Cascade filter provided by OpenCV that crops the image by only including the face without significant background. This, in effect, that reduces the training times.

In order to have a balanced dataset, we would prefer to have similar number of images for each label from the categories in Eq. (1). Therefore, the lowest amount of images (inactive-unpleasant) dictated the size of the training set. We trained with 68,012 images, batch size 15 images, we used 80,000 iterations, and the average accuracy was set to 0.63 with 10,000 epochs. The training time was about 70 min on an desktop computer equipped with Intel Xeon(R) W-2145 CPU running at 3.7 GHz, with 32 GB of memory, and with NVidia RTX2080 GPU.

Fig. 3.
figure 3

Deep neural network architecture used in our framework.

5 Results

Testing: We divided the dataset randomly into two groups in the ratio 80:20. We trained on 80% of the images, tested on the remaining 20%, and we repeated the experiment three times with random split of the input data each time.

Table 2. Average and standard deviation of the three runs of our testing.

Table 2 shows the average and standard deviation of the confusion matrices from the three runs of our experiments and Fig. 4 show the confusion matrices of the individual runs. The main diagonal indicates that pleasant-active was detected 70% with standard deviation about 1.6% correctly and misdetected as pleasant-inactive in 22.3%, unpleasant-active in 2%, and unpleasant-active in 6.3% of cases. Similarly, pleasant-inactive was detected correctly for 87.3% of cases, unpleasant-active in 44% and the least precise was unpleasant-inactive with 62%. This is an expected result, because the lower part of the Russel’s diagram (Fig. 1) includes passive expressions that are generally more difficult to detect. We achieved an overall accuracy of 66%.

Fig. 4.
figure 4

Normalized confusion matrices for the results of our experiment.

Deployment: The trained deep neural network was extracted and used in real-time session that categorizes facial expressions into the four quadrants of Russel’s diagram. We used a laptop computer with a web cam in resolution 1,920 \(\times \) 1,080 equipped with CPU Intel Core i5-6300U at 2.4 GHz. We used the Caffe environment on Windows 10 and OpenCV to monitor the input image from the camera and detect face. Only the face was sent to the our trained network as the background was cropped out. The neural network classified the image and sent the result back to the application that displayed it as a label on the screen. An example in Fig. 5 shows several samples of real-time detection of facial expressions by using our system.

Fig. 5.
figure 5

Examples of real-time expression detection by using our system.

6 Conclusions

The purpose of this project is to develop a real-time facial emotion recognition algorithm that detects and classifies human emotions with the objective of using it as a classifier in online learning. Because of this requirement, our detector reports a probability of an emotion belonging to one of the four quadrants of Russel’s diagram.

Our future goal is to integrate the recognition algorithm into a system of affective pedagogical agents that will respond to the students’ detected emotions using different types of emotional intelligence. Our experiments show that the overall test accuracy is sufficient for a practical use and we hope that the entire system will be able to enhance learning.

There are several possible avenues for future work. While our preliminary results show satisfactory precision on our tested data, it would be interesting to actually validate our system in a real-world scenario. We conducted a preliminary user study in which we asked 10 people to make certain facial expression and we validated the detection. However, this approach did not provide satisfactory results, because we did not find a way to verify that the people were actually in the desired emotional state and their expressions were genuine - some participants started to laugh each time the system detected emotion they were not expecting. Emotional state is a complicated. Happy people cannot force themselves to make sad faces and some of the expressions were difficult to achieve even for real actors. So while validation of our detector remains a future work, another future work is increasing the precision of the detection by expanding the training data set and tuning the parameters of the deep neural network.