1 Introduction

Detecting people’s facial expressions has been an interesting research topic for more than 20 years. Facial expressions play an important role in many applications such as advertising, social interaction, and assistive technology. Research in facial expression recognition mainly includes three parts: datasets, algorithms, and real-world interaction applications. In our approach, to improve the performance of facial expression recognition in real scenes, we propose a framework for integrating dataset construction, algorithm design, and interaction implementation.

Even though facial expression research should include three integrated parts, most of the previous work mostly focused on one of the components. Consequently, algorithms or models designed for one or several datasets do not work well when dealing with real scene problems. With this in mind, we propose a recursive updating approach. Starting from a deep learning model trained from facial images collected from the web, a facial expression game is designed for collecting new and more balanced data. Then, the newly collected data are used to update the training model. The framework is illustrated in Fig. 1. We start with a dataset with candid images for facial expression (CIFE) to build an initial facial expression model, which is then served as the game engine for a facial expression game, and then when users play the game, facial images of the users are classified as different facial expressions by the model and automatically collected. This leads to a new and balanced dataset, named GaMo (standing for Game-based eMotion), which is used to update our facial expression model.

Fig. 1
figure 1

Proposed recursive framework

1.1 Contributions of the paper

This is an extended work of our previous research in web-based data collection CIFE [1], CIFE data enhancement and AlexNet model fine-tuning [2], and game interface designs for collecting new data (GaMo) [3]. In this paper, we have the following new contributions:

  1. 1.

    A recursive framework is proposed, which can recursively generate and automatically cleanse the new data, and update the deep learning model to have better performance in real scene facial expression recognition. In particular, we have a new recursive step to clean the collected data through a self-evaluation process.

  2. 2.

    A deeper CNN model is used by fine-tuning the 19-layer CNN structure proposed by the Visual Geometry Group (VGG) [4]. The performance of the fine-tuned VGG network is comparable to the state-of-the-art approaches on benchmarking emotion-in-the-wild datasets, and performance comparison showed that it outperformed the results we generated using our initial CNN model and the fine-tuned AlexNet model we reported in our previous work.

  3. 3.

    We detail the design and evaluation of the game interface, which is only controlled by human facial expressions, and automatically collecting expression images while the players are playing the game. The new GaMo dataset is also analyzed leading to insights and new ideas for more balanced data collection with our recursive framework in the future.

  4. 4.

    The performance of emotion recognition based on the two facial expression datasets is compared and analyzed: CIFE and GaMo, and their more balanced subsets using the new VGG model: CIFE is a web candid image-based facial expression dataset; GaMo is an in-Game-based eMotion dataset collected when our users played our facial expression game.

1.2 Organization of the paper

The rest of the paper organized as follows: After the introduction in Sect. 1, related work is reviewed in Sect. 2. Section 3 discusses how the CIFE dataset is collected and applied to build a CNN-based facial expression models discussed in Sect. cnnmodels. Details of how we design and implement the game interface is described in Sect. 5. We evaluated our framework by comparing and updating GaMo and CIFE datasets in building facial expression recognition models in Sect. 6. And finally, we concluded our work in Sect. 7.

2 Related work

We already mentioned that current facial expression research mainly includes three major components: datasets, algorithms/models, and applications. Here, we would like to give a review of each of the three components.

2.1 Datasets

Among the many datasets that have been provided by researchers for recognizing expressions from images, there are mainly two kinds of datasets. Datasets belonging to the first category are captured in laboratory. These include CK+, MMI, and DISFA dataset  [5,6,7,8]. Usually, subjects are invited to their laboratories and sit in lighting- and position-constrained environments. Good results can be achieved on these datasets, but in real life scenarios, it is always hard to have good performance. Datasets in the second category are collected from existing media and social networks, such as Kaggle and EmotiW  [1, 2, 9]. Using web search engines, one can easily obtain thousands of images, but the datasets are usually not balanced. EmotiW is a video clip dataset for an expression recognition challenge, and the video samples are from Hollywood movies where the actors show different expressions. For the datasets collected from existing media, some of the expressions like Happy or Sad are easier to obtain, but for some expressions like Disgust or Fear, it is hard to find enough samples.

2.2 Algorithms and models

Although the existing datasets are generally not balanced, many interesting and promising approaches have been proposed for expression detection. Most existing facial expression recognition methods have focused on recognizing expressions of frontal faces, such as the images in CK+ [8]. Shan et al. [10] have proposed a LBP-based feature extractor combined with an SVM for classification. In the method proposed by Xiao et al. [11], instead of training one model for all expressions, separate models have been trained for each expression, which improve the overall performance. Wang et al. [12] modeled facial expressions as complex activities that consist of temporally overlapping sequences of face events. Then, an interval temporal Bayesian network (ITBN) was used to capture the complex temporal information. Karan et al.  [13] proposed a HMM-based approach to make use of consecutive frame information to achieve better expression recognition accuracy from video.

In the last few years, deep learning methods have been successfully used for face recognition and verification [14, 15]. Deep learning approaches are also used in many expression detection applications. Liu et al. [16] proposed a boosted deep belief network to perform feature learning, feature selection, and classifier construction for expression recognition. Different DBN models for unsupervised feature learning in audio-visual expression recognition have been compared in the work done by Kim et al. [17]. Our early work [1] used CNNs on images collected from the web. To prove the effectiveness of CNNs, we compared CNN-based facial expression performance on CK+ to the state-of-the-art methods. Multimodal deep learning approaches have been applied to facial expression recognitions tasks. An example is Jung et al.’s work  [18] in which facial landmarks-based shape information and image-based appearance information are learned through a combined CNN network. The results show that deep learning-based multimodal features act better than individual modalities or the use of traditional learning approaches. Automatically learned features have also been used for multimodal facial emotion recognition on video clips [2].

2.3 Interactive applications

In the generation of ImageNet dataset [19], Amazon Mechanical Turk (AMT) is used to label all the training images. Workers are hired online and can remotely work on labeling the dataset. The ImageNet is a large-scale dataset that aims to label 50 million images for object classification and without the help of online workers, the labeling would not be feasible. This inspired us to develop the idea of involving people in the data collection process through an online framework, preferable using games. There have been some efforts in using games to attract people to perform some image classification work. Luis et al. [20] designed an interactive system that attracted people to label images; Mourao et al.  [21] developed a facial engaging algorithm as the controller to play their Novoexpressions game; and a player engagement dataset was obtained, and the relationship between the players’ facial engagement and game scores were analyzed. But their goal is not to collect data. Expression games have also been used to entertain children with autism spectrum disorder (ASD) and to help them perform facial expressions by mirroring their expressions to some cartoon characters [22]. Our online framework not only makes use of online crowdsourcing through games, but also has much lower cost than AMT. And since the numbers of various facial expressions can be controlled by the design of the games, the dataset can be guaranteed to be balanced.

3 CIFE: a dataset with web images

Since we would like to develop facial expression approaches that can be used in real-world scenes, we need to train and test the models on non-posed images or candid images. Therefore, we collect a candid image facial expression (CIFE) dataset. We note that most of the facial expression images on the web are randomly posed, and most of the expressions are natural. Therefore, we use web crawling techniques to acquire candid expression images from the web and create our candid image facial expression ( CIFE) dataset.

3.1 CIFE data collection

As we have mentioned, we define seven types of expressions: Happy, Anger, Disgust, Sad, Surprise, Fear, and Neutral. Using related key words to the each of the 7 expressions in addition to the name of the expression (e.g., joy, cheer, smile for Happiness), we have collected a large number of images that belong to each of the seven expressions. We have used most of the image search engines, including Google, Baidu, and Flickr. In our initial CIFE dataset  [1], the number of samples of different expressions were: Anger (1785), Disgust (266), Fear (781), Happiness (3636), Neutral (644), Sadness (2485), and Surprise (997). The images are from the web and most of them are not posed. However, the number of samples in different classes was highly unbalanced. Therefore, we have added some images to classes with fewer samples (for example, Disgust and Fear) to balance the class sizes  [2]. At the end, we obtained 14,756 images for 7 classes (after some manual post-filtering by humans). The total number for each facial expression in our revised CIFE dataset is listed in Table 1. This is the dataset we use in this paper. In  [2], the CNN model was one of the modules for video expression recognition, but here we focus on facial expression recognition in single images. Figure 2 shows a few typical examples of faces with various poses.

Table 1 Sample numbers of the seven facial expressions in CIFE (Ang, Dis, Fea, Hap, Neu, Sad, Sur represent angry, disgust, fear, happy, sad, surprise, respectively)
Fig. 2
figure 2

Images from CIFE

3.2 CIFE data augmentation

Deep learning with CNNs always requires a very large number of training images in order to train a large number of parameters of the network for obtaining good classification results. Even though our CIFE dataset has 14,756 images for 7 classes, it is still insufficient for training a deep CNN model. So before training the CNN model, we need to augment the dataset with various transformations to generate various small changes in appearances and poses. We applied five image appearance filters and six affine transform matrices. The five filters are disk, average, Gaussian, unsharp, and motion filters, and the six affine transforms are formalized by adding slight geometric transformations to the identity matrix, including a horizontal mirror image. Figure 3 shows an example of the facial image augmentation. By doing this augmentation, for each original image in the dataset, we can generate 30 (\(=\,5\times 6\)) samples; therefore, the number of possible training samples would increase from 10,330 to 309,900, which is sufficient for training the deep learning model.

Fig. 3
figure 3

CIFE data augmentation

4 Fine-tuned CNN models

After data augmentation, we now have 309,900 training images, and the model will be tested on 4424 original testing images (30% of 14,756). Our goal is to classify all the images into 7 facial expression groups. To achieve our goal, we design an CNN structure. In the following, we will describe our initial CNN model, the fine-tuning of two CNN structures—ALexNet [23] and VGG [4], and report the comparison of their performance.

4.1 The initial CNN model

Our initial CNN model structure includes one input layer (the original image), three convolutional layers, and an output layer. This structure was arrived by trial and error with many experimental tests. The input color image size is \(64 \times 64\), and the number of the output is 7. We set the convolutional filters size to be \(3 \times 3\). We then varied the number of layers and the number of filter for each layer. After many rounds of tests, we finally found our most suitable “simple” structure with 3 convolutional layers, and the filter numbers for each layer to be 32, 32, 64, respectively. For each of the 3 convolutional layers, we add a 2:1 pooling layer to make the training data less redundant. The input \(64 \times 64\) RGB image is then recognized as one of the 7 labeled classes. With this structure, we can easily know the numbers of the parameters to be around 184,000. Compared to the number of training images (309,900), the structure setting is also appropriate. Finally, we achieved a 65.2% accuracy on our test data. Even though the accuracy was not very high, the CNN-based facial expression showed its obvious performance advantage over tradition approaches such as the results using support vector machines (SVMs), 62.3% with the LBP feature, and 59.7% with the SIFT feature. Details of the results with the traditional approaches can be found in our previous work at  [1]. We want to note here that we reported a much higher classification accuracy (81.5%) using a similar CNN model. That was because the highly unbalanced numbers of samples in the initial CIFE dataset used in  [1]: There the recognition rates for disgust and fear classes were very low, for both the original dataset and the revised dataset, and hence adding new samples decreased the overall recognition statistics. We suspect that the reason for the low performance was that the three-layer structure is unable to learn the features deeply enough.

Fig. 4
figure 4

Fine-tuning AlexNet structure for facial expression recognition

4.2 Fine-tuning AlexNet

To further improve the performance of facial expression recognition using CNN, we noted that learned general classification models can be used for specific classification problems [24]. Since some existing learned models are deeply trained on large-scale datasets, image features thus learned can be better features for recognition of other classification tasks. Therefore, we are curious to find out if this can help improve facial expression recognition. To try out our idea, we did experiments by fine-tuning AlexNet [23] and VGG [4] structures.

In the AlexNet structure, there are 1 input layer, 5 convolutional layers, and 3 fully connected layers, leading to 60 million parameters in total. Our first guess was that training the AlexNet on our CIFE dataset would result in better classification accuracy. The only problem was the need for a larger number of images, as the ImageNet requires millions of images during training.

Therefore, we instead propose a CNN fine-tuning method to train a deeper model based on AlexNet. The rationale is that although our task is different from the ImageNet, which focuses on object classification, similar low level filters could be used in expression recognition. Based on this hypothesis, we can use the AlexNet and utilize our relatively “small” dataset to update and fine-tune parts of its parameters for adapting it to expression recognition.

As shown in Fig. 4, the parameters of the convolutional layers 1 through 4 are not changed. Our new CIFE dataset is used to update the parameters of the convolutional layer 5 and the first fully connected layer, without changing their structures. In the original AlexNet, the number of units of the second fully connected layer and the third layer is 4096 and 1000 (classes), respectively. Since the number of classes in our dataset is just 7, we needed to change the structure in these two layers. We reduced the number of neurons in the penultimate layer to 2048 and the third fully connected layer to 7. The classification accuracy by using this model is 73.5% on the revised CIFE dataset, which shows that the fine-tuning leads to a much better performance than our first attempt of using a three-layer CNN structure, an 8.3% improvement. This was the model we used in collecting the GaMo emotion dataset, when only this model was available. With this decent accuracy, a system that uses such a facial engine can lead to a good chance to obtain the right prediction in human computer facial interaction to encourage users to play the interaction game, which will be described in the next section.

4.3 Fine-tuning VGG

Compared to AlexNet, VGG is a much deeper network. After the GaMo data collection, we also investigated if using a fine-tuned VGG model can improve the facial expression recognition performance. We first tested the fine-tuned VGG model on the revised CIFE dataset for comparing the results with the fine-tuned AlexNet model. There are 19 learning layers in total, with 138 million parameters. VGG layers have some similar structure as AlexNet. They both have convolutional parts and fully connected parts. For each convolutional layer in AlexNet, VGG replaces it with 2–4 convolutional layers. Deeper networks lead to better representation of the input images: In the ImageNet challenge, VGG yielded a 6.8% top 5 error compared to AlexNet’s 16.4%. We applied a similar fine-tuning approach to the VGG net as we did to AlexNet. By fine-tuning on existing VGG model with the revised CIFE dataset, we finally achieved a 76.3% accuracy, which is a 2.8% improvement over the fine-tuned AlexNet and 11.1% over our initial CNN model. Therefore, in Sect. 6, we will show results using the fine-tuned VGG structure for emotion recognition with CIFE and GaMo datasets and their subsets.

Through fine-tuning the ImageNet models, we obtained improved facial expression recognition results. It indicates that the models for general image classification can share convolutional filters with specific purpose image classification tasks, such as facial expression recognition. In this way, the fine-tuning leads to more robust models for non-posed image facial expression prediction.

Note that this model has been used to provide reliable CNN-based features in our work in participating at the Emotiw 2015 challenge on benchmarking emotion recognition in the wild datasets. This proves that the fine-tuning approach is comparable with the state-of-the-art methods in solving the emotion recognition in the wild problem [2]. In our previous work, we show that the fine-tuning CNN feature is the most effctive feature among the three multimodal features we used—a LBP-TOP-based video feature, an openEAR energy/spectral-based audio feature, and the CNN-based deep image feature, enabling us to achieve a rank 5 among 73 teams.

5 GaMo: game-based interface for balanced expression data

We already showed that deep learning can lead to high accuracy facial expression recognition. While the candid images are most likely randomly posed, they are still different from images from real scenario interaction. More “real” data would be facial expression images of people collected without any constraints. If people can show this kind of “real” facial expression, we can use our facial expression model to select corresponding images and construct a real scene emotion dataset. The selected images may be useful for building a real scene expression model. For this purpose, we decided to design a game interface that invites people to show their facial expressions, willingly, while playing a game.

5.1 Game design

Since we would like game users to show real facial expressions yet remain engaged throughout the game, we had to design the expression game in a way that it is straightforward and interesting as much as possible. After performing research on the popular web games, we found that the tower defense games is the style that fits our task the best. The logic of the tower defense games is always very simple: The player needs to build a defense system against the intruders. In our application, we would like the user to act as a defender against an “expression target.” An example of the basic game scene is shown in Fig. 5. The live video of the user is shown on the top-left corner of the screen, so he can always see his expression. A randomly picked expression target as shown in Fig. 6 (a facial expression icon) will enter the screen as a bomb dropped from above, and the user has to protect the village by making the bomb disappear before it reaches the ground. Some sound effects are also added to make user more engaged. The bomb would disappear if the user makes a facial expression that correctly matches the displayed expression (the bomb), as judged by the CNN-based expression detector using the fine-tuned AlexNet model which was the available model to us when we collected the new data. The score as shown on the top-right of the screen will be increased.

Fig. 5
figure 5

Design of the facial expression game scene

Fig. 6
figure 6

Facial icons representing the seven basic expressions

5.2 General version and customized version

Here, we would like to provide some technical details of our general game design. The expression game web interface accesses the camera on the user’s machine and displays the video on the top-left corner of the screen. Then, the game interface captures images of the user’s face and then sends face images to our server. The CNN model we trained by fine-tuning AlexNet analyzes each image and generates a probability vector for the seven expressions and sends it back to the game webpage. The reason for processing face images at a server is due to the high computational requirements by the CNN model. After the probability vector of the seven expressions is sent back to the game interface, it compares this feedback with the expression target ID that has been displayed on the screen as a bomb and informs the server to save the image if it matches the icon. Since the target facial expression that the user needs to make is defined by our system, we not only know the label and have high confidence that facial expression’s label is correct, but also can make the dataset more balanced based on our needs. Our game can be accessed via this test page.Footnote 1

The frame rates of typical webcameras are usually 20–60 Hz. We do not need such high frame rates of image capturing for two main reasons: 1) From the computational resource’s point of view, the server will have a huge workload if we run the expression recognition on every single image since the CNN computation is time-consuming with hundreds of millions of parameters. 2) There is no need to know the user’s expression frame by frame. Giving the user some time delay to prepare their expression may actually result in better image quality and as a consequence, a much better dataset. For these two reasons, we design the game in such a way that it only sends one image per second to the server. It takes only around 200 ms for the server to generate the results for every single image, which makes the game run very smoothly. We also set the number of initial game lives to be five and generate the expression targets randomly, with equal probabilities to all the seven expressions, which theoretically result in a balanced dataset.

The game is implemented using Javascript and HTML. The backend is hosted using Ruby on Rails. During the game, expression icons will drop from the top of the screen, and the user’s face is also shown in the left top corner of the screen for the user to check her/his facial expression. If the user is able to match the expression before the icon hits the ground, she/he will receive 1 point. The score will change as the user gains or loses points. When all the game lives are used, a “Game Over” sign will be shown up, together with the total score gained by the user, and then a “Replay” button.

After making the game available to a small group and collecting data from several users who tried it, we realized that the collected dataset is not ideal, specifically for two main reasons: 1) Sometimes, it is hard for users to correctly imitate the exact expressions by just looking to the icons; 2) Our expression detector sometimes is not able to correctly determine whether the subject is making the right face or not. This makes it hard for the players to achieve high scores and as a result, the collected data becomes imbalanced.

We were able to provide two solutions to solve the above problems. One is to change the expression recognition to an expression verification task. This makes the classification task much easier. Since we know the “ground truth” or the target label for an icon being displayed in the game, we only need to check if the probability of this specific expression reached a predefined threshold. Each expression needs to have its own threshold since some expressions are harder to mimic through facial expressions and have higher variety among different users. This will help the users achieve higher scores and also include a broad range of correctly labeled facial expressions for each expression in the dataset.

Another solution is to create an individual model for each player based on the CNN extracted features. The user will be compared with her/his individual expression templates instead of the general CNN model. The DeepFace work [14] has proved that the CNNs are not only able to directly perform image classification, but can also extract robust features from the images [25]. Thus, we extract the features from the CNN for each individual user and then these features are saved as templates for that specific user. This makes the game customized for each player, and the user can gain higher scores and is encouraged to play more.

Fig. 7
figure 7

Registration page for building customized facial templates: initial page. The first image is the live webcam view, and the rest seven are places for showing the templates of the seven expressions

As a part of the solutions proposed above, we designed a user registration page as shown in Fig. 7. The registration page is divided into 8 subareas. The first subarea shows the current video stream. The other seven subareas display the seven registered expression templates. To save each template image, the user can click on the corresponding subarea while imitating the correct facial expression. This process can be repeated several times until the user is happy with the saved image. Once all seven expressions are registered, the user can click the “Send All” button to send the expression templates to the server, where the system will detect the face area in the images and use the CNN model to extract expression features for the user. If the face cannot be detected, an error message will be sent back to the user, and she/he is then asked to recapture the image for the specific expression that has caused the error, as shown in Fig. 8. The register page can be accessed here.Footnote 2

Fig. 8
figure 8

Faces of the seven expression after the user click “Send All.” In this example, two of the images are not qualified so the user has to recapture these two images

When all the template features are saved in the server, the user will be directed to the customized game scene. While the game is being played, the server will extract features for the image that is being sent at the moment and compare these features to the saved emotion templates. We use L2 distance to select the nearest result and send it back as the detected emotion. Since the features are robust and the user is always compared with her/his own model, the user will potentially achieve a higher score. We call this version of the game the “customized version,” as opposed to the previous “general version.”

5.3 GaMo: the new dataset

Within one month of the release of the two game test versions to the college students of our department, more than a hundred users played the general version and 74 users tried the customized version. All the users that we collected data from have signed the consent form of our IRB approval. We obtained 15455 images in total during this time period and generated the GaMo (game-based expression) dataset. Compared to some deep learning datasets, the size is still not big enough, but our game can run at any time, so we can obtain a much bigger dataset when the game reaches more people. The dataset is available by contacting the authors.

One concern about our dataset might be the use of a trained model to obtain more expression data: Will this recognition/ verification model only create expression data that are similar to our existing data samples and make the dataset less diverse? Will the use of a pre-trained model affect the quality of the data labeling? We arrived at two observations:

  1. 1.

    The game interaction and incentive can increase the tendency of the users to express more accurate emotions for data collection. We found that in playing our game, users would rather achieve a higher score compared to the others. Thus, they try to show the correct emotion as well as they can. This mechanism can automatically help us avoid too many wrong labeled data with manual labeling, since the data we collected have been double checked by both our model and our users themselves.

  2. 2.

    In the customized game mode, user are comparing with their own emotion templates. We checked all the templates of all the users, and they were all correct. Different people have different emotion patterns. By collecting templates and in-game emotion images of different people, we can have diverse emotion images.

We would like to note here that no manual cleanups for the images and labels have been done in the initial GaMo dataset; all the images are used in our first evaluation in Sect. 6. By randomly checking the dataset, we have not found any labels that are very off the real expressions. The distribution of the dataset is shown in Table 2. Compared to the CIFE dataset, GaMo is more balanced, which hopefully will result in a much more reliable facial expression detector. In conclusion, the data collection is automatic, of high quality and more balanced. Later on in Sect. 6, we will also discuss if an automatic data quality evaluation and cleanup could lead to better performance.

6 Evaluation

Based on our proposed framework, we applied deep learning and fine-tuning to web-collected images CIFE and obtain a facial expression recognition model—the fine-tuned AlexNet, then we used the model to host the facial expression game to collect the new GaMo data. We hope the GaMo dataset can be used to recursively fine-tune our CNN models. To prove this, we first need show that the GaMo data can actually improve the real scene facial expression recognition.

To determine the usefulness of the GaMo dataset, we performed the following experiments. First, we trained a new CNN model with GaMo by fine-tuning the previous AlexNet model that has been used as our game engine, which was trained on CIFE. To compare GaMo with CIFE, we ran both a self-evaluation and a cross-evaluation with the two CNN models: the GaMo CNN model and the CIFE CNN model.

In our earlier work [3], we have shown that the model trained with the more balanced GaMo dataset produced more robust results, especially for those classes that were underrepresented in the CIFE dataset. Further, the GaMo model can be applied to the CIFE dataset with a decent performance, but not the other way around.

In order to see if these observations are consistent with more complicated and better performed models, we then used the fine-tuned deeper VGG models—the best models among the ones we have developed and used. The VGG models are trained with the CIFE and GaMo datasets, respectively, and performed the same experiments as in our earlier work.

We noted that due to the game engine we used which was based on the unbalanced CIFE dataset, we still had fewer samples in some categories; thus, the new GaMo dataset was not completely balanced. Therefore, we also ran an experiment to see if we just using more balanced subsets from both CIFE and GaMo will have large changes to the recognition results. Finally, we designed a small user study to find out if the dataset can actually improve the game engine and game experience. For this purpose, the users played the general version of the game hosted by the two new CNN models.

Table 2 Comparison of expression sample numbers in CIFE and GaMo

6.1 Comparison of CIFE and GaMo

Table 2 shows the statistics of GaMo and CIFE datasets. For the CIFE dataset, as we mentioned before, the images are collected by searching from web engines using key words. We also went through the dataset to remove all the images that are not meaningful facial expressions. To some extent, the numbers of samples in the seven emotion categories reflect the distribution of facial image numbers online. We can clearly see the imbalance of the sample numbers for different facial expression categories, and it is hard for us to balance it since if we use the minimal number 975 for Disgust, the number of samples would be too small. We can also see that the sample numbers of each facial expression from GaMo are more balanced. Although we noticed some of the expression numbers are also smaller than others like Fear and Disgust, it is easy for us to make it more balanced. When we design the game, we make all expressions show up at the same probability, but due to different ability of the expression prediction using the game engine trained with the unbalanced CIFE dataset, the GaMo dataset is not completely balanced. The good news is that, since we already have known the different accuracy in predicting each facial expression, in the future data collection with our recursive framework, we can change the show-up probabilities of facial expression targets to control the final data distribution. This is our ongoing work.

6.2 Comparison of CNN models with CIFE and GaMo

To compare the two models, we test the overall accuracy in recognizing all the seven expressions (the average accuracy in Table ) as well as the accuracy of each individual expression within its own sub-dataset (Angry, Disgust, Fear, Happy, Neutral, Sad, and Surprise), as listed also in Table 3 and in Tables 4,  5,  6, and 7. This would give us a good sense on the usefulness of GaMo dataset. Furthermore, to compare the performance of the two CNN models based on the VGG structure, we perform a cross-evaluation: The model trained on CIFE is tested on images from GaMo and vice versa.

Table 3 Average accuracies of self- and cross-evaluation of CIFE and GaMo models
Table 4 Self-evaluation confusion matrix of CIFE
Table 5 Self-evaluation confusion matrix of GaMo
Table 6 Cross-evaluation confusion matrix of CIFE
Table 7 Cross-evaluation confusion matrix of GaMo
Fig. 9
figure 9

Comparison of individual template images of two users from GaMo

The confusion matrices of these four experiments are listed in Tables 456, and 7. Looking into the self-evaluation results, we can see that the model trained on GaMo has a much more balanced distribution on expression classification on all the seven expressions. Even though the average performance of the CNN model on the CIFE is slightly higher than that on GaMo, the numbers are misleading since the higher average accuracy of the CIFE-trained CNN model is due to the much larger numbers of samples in both Happy and Sad classes, which apparently also have much higher accuracy than others. In comparison, the performance in recognizing Disgust and Fear is much higher using GaMo than using CIFE.

The results of the cross-dataset tests are even more interesting. The model trained on CIFE has a very low performance when tested on the GaMo dataset; the confusion matrix shows that many images are classed to neutral. We have observed that the difference between the images is significant among the two datasets. Our observations indicate that the expressions in the CIFE dataset tend to be more exaggerated and thus easier to be identified, as it are shown in Fig. 6, while the GaMo dataset is more realistic to real life, as it is obtained from ordinary users with a high amount of varieties in imitating facial expressions while playing the game. As an example, Fig. 9 shows two users who played the game. The first player shows more explicit expressions, while the second player’s expressions tend to be more implicit. This makes it hard for the model trained on CIFE to classify the images from GaMo. The CIFE model almost completely fails in recognizing Angry, Disgust, and Happy in GaMo. We believe the reason is that these three expressions in the CIFE dataset, whether they have fewer or more samples, are much more highly exaggerated than those in the GaMo dataset. On the other hand, when the model trained on GaMo is cross-tested on CIFE, the performance is surprisingly good, even though the performance cannot beat that on the self-test. The reason is that the model is further fine-tuned on a larger, more inclusive, and more balanced dataset. The GaMo model does reasonably well on all the three expressions failed by the CIFE model. In addition, if subtle expressions (as in the GaMo dataset) can be recognized, the exaggerated ones (as in CIFE) are not difficult to detect. As an example, the Happy faces in CIFE can be much more easily recognized (with a 91% accuracy) using the GaMo model.

Here, we also want to note that the performance using the VGG structure is much better than using the AlexNet; interested readers please compare the results in Table 3 with the results in our previous work [3]. Nevertheless, the performance comparison observations between the CIFE and GaMo datasets are consistent from the AlexNet to VGG structure.

6.3 Comparison with more “balanced” sub-datasets

In most facial expression datasets, the sample numbers for different expression are imbalanced. The training process favors the class which has more samples to get higher accuracy. But this will weaken the model’s ability to recognize the facial expression with less samples. In reality, this will not be a good interactive experience if the facial expression model is unable to recognize some less frequent facial expressions. So if we want to build a model that can recognize all the facial expressions with equal accuracy, the best way is to create a training dataset with similar samples. In this case, for dataset CIFE, we will only have \(4781=683\times 7\) (683 is the number of Disgust expression samples in the CIFE train set, 70% of the total) images in total, which may not be sufficient for training a well-performed deep learning model. We use the subset of the CIFE dataset which is balanced and run the same deep learning training as the full CIFE. We augment the 683 images of each facial expression and then fine-tune the VGG model. The final prediction result on CIFE is shown in Table 8. As predicted, by comparing Tables and , overall performance is lower than using the full CIFE dataset, except the least frequent expressions: Disgust and Fear. The average recognition rate drops by 9%. But for the GaMo dataset, we still have over 10K images in the subset of balanced data, and the training dataset has over 7770 images (\(1586 \times 70\% \times 7\)). The performance result using the balanced GaMo subset is shown in Table 9. Compared to the result in Table 5, the overall performance improves by 6.4%. As a matter of fact, the recognition rates for all the categories increase; those with lower numbers of samples in the original GaMo datasets (Fear, Disgust, and Sad) increase significantly, by more than 10%.

By comparing the two approaches in collecting facial expression images: searching from the web and harvesting from game users, we have some important notes. First, it is almost impossible for us to get more facial images for CIFE as we already have searched most of the image search engines in order to obtain high-quality images. While for GaMo, as long as our game is running, we can have more and more balanced expression data. Second, even for the current version of GaMo, we retrained the deep learning model with the balanced subset of the GaMo dataset and by testing on the same original GaMo testing data, we see the performance has increased significantly.

Table 8 Self-evaluation confusion matrix of sub-balanced CIFE
Table 9 Self-evaluation confusion matrix of sub-balanced GaMo

In the balanced CIFE subset, due to fewer training data than the full CIFE, the performance for Angry and Happy dropped dramatically, but the accuracy for Disgust and Fear doesn’t improve much. While for the GaMo dataset, since each facial expression still has more than 1586 images, the balanced subset of GaMo is still a good dataset for training. The balanced GaMo produced a better facial expression model than the full GaMo. For the less representative facial expressions like Disgust, Fear, and Sad, the improvement is huge. The reason for this is that with equal consideration of all facial expressions during the training process, all expressions’ deep features can be learned correctly, and if the test data can be well represented by the training data, we can achieve very good results. So, with our framework, we have a better chance to be able to obtain a robust expression predictor on all facial categories.

6.4 Dataset cleaning via sample evaluation

The CIFE dataset was generated by using image search engines with keywords related to the seven facial expression classes. On the other hand, the GaMo dataset was collected by using a facial expression recognition engine trained with the CIFE dataset for checking users facial expression matching. Although we have manually run a proof check on the quality of CIFE and have found that most human subjects (game users) in our emotion game tried to show correct emotions to achieve more scores, we cannot guarantee all the images are correctly labeled. To reduce the impact of those weakly labeled images, we propose to use yet another “recursive” step to cleanse the datasets.

In our data cleaning step, we use the corresponding trained models to predict the emotion scores for all the images in the CIFE and GaMO datasets, respectively. So for each image, there are 7 emotion scores related to it. Since we know the emotion label of each image, only one of the 7 emotion scores is useful for that image. For example, a Happy image will result in 7 emotion scores describing the 7 probability of this image being classified as one of the 7 emotion types, but we only consider the value obtained for Happy emotion. Using the emotion scores, we can sort all the images in each of the seven emotion classes. For example, to sort all the Sad images, we first find all the images labeled Sad. Then, we sort them based on their predicted Sad scores. In each emotion group, a higher score means that the score assigned to the image is more reliable. To verify if this sorting process is meaningful, we randomly picked images from the top 5% “well” labeled images and the bottom 5% “badly” labeled images. Examples of these images are shown in Figs. 10 and 11, respectively. From these two figures, we can see the quality of the original labeling. The “good” labeled samples tend to have better emotion correspondence with their labels, while for the “badly” labeled samples, the image content is less related to their labels.

Fig. 10
figure 10

Comparison of the “well” labeled (correct) and “badly” labeled (wrong) samples in CIFE

Fig. 11
figure 11

Comparison of the “well” labeled (correct) and “badly” labeled (wrong) samples in GaMo

Since the emotion scores show how correctly images are labeled, we can use the predicted scores as criteria to “clean” each dataset. Here, we excluded the 10% emotion images with the lower emotion scores in each emotion class in both CIFE and GaMo. That yields in the “cleaned” CIFE and GaMo datasets. To see if this “self-cleansing” can improve the emotion recognition performance, we fine-tuned the pre-trained emotion models using the updated CIFE and GaMo dataset. After about 5000 iterations of training with 50 images as one batch, we stopped at the converged models. The confusion matrices for CIFE and GaMo are illustrated in Tables 10 and 11. By comparing them to the corresponding confusion matrices in Tables 4 and 5, we can see the average emotion recognition accuracy values for CIFE and GaMo increase by 9 and 8%, respectively, from 76 and 75% to 85 and 83%. The improvement in different emotion classes can be seen in Tables 10 and 11. We can draw the conclusion that by self-cleansing the emotion datasets, we obtain cleaner datasets that consist of emotion images with higher quality, which can contribute to more accurate emotion recognition.

Table 10 Self-evaluation confusion matrix of the cleaned CIFE
Table 11 Self-evaluation confusion matrix of cleaned GaMo

6.5 Comparison in user feedback

The goal of facial expression recognition research is often to train a model that can perform well in real scenes. This is especially true in human–computer interaction applications for real daily activities, such as satisfaction studies of customers and viewers, and assistive social interaction for people in need, or individuals with visual impairment and autism spectrum disorders (ASD). One approach to verify an expression detector is through a test on ordinary people with natural facial expressions. To accurately evaluate the two models, we analyze the data collected from five new users (3 male and 2 female) who are not included in the GaMo dataset, while playing the general version of the game. Note that in the phase of GaMo data collection, we mainly use the customized game interface since users cannot perform well with the general game interface. In this game engine performance study, the general game is played five times by each user with the same game settings and the scores of the five rounds are recorded. We use the two versions of our game engine, one trained on CIFE and the other on GaMo, respectively. Figure 12 shows the result of this experiment. We have plotted the two average scores for each player on games powered by the two game engines. According to this figure, the GaMo game engine has a much better performance and results in higher scores. This further confirms that the model trained on GaMo is more suitable for real-world expression recognition.

Fig. 12
figure 12

Users’ average scores on GaMo and CIFE-based CNN models

Fig. 13
figure 13

Subtle facial expression recognition by CIFE and GaMo models (left two are from the CIFE model and right two are from the GaMo model). Each histogram shows the probability distribution of the seven emotions for each facial image. The order of the expressions is Angry, Disgust, Fear, Happy, Neutral, and Surprise. For some subtle expressions, only the GaMo model works well

This result agrees with the cross-testing results which show that the GaMo model has a better performance on the GaMo dataset itself. These observations would also support our claim that GaMo is very useful in detecting subtle expressions. For instance, the user can gain a point with a normal smile expression in GaMo model game as shown in Fig. 13, while using the CIFE model, the expression cannot be detected. Same fact holds for detecting anger or any other expressions, as our players do not have any prior knowledge of how obvious and explicit their facial expression should look like.

7 Conclusion and discussion

In this paper, we propose a recursive framework in order to achieve real scene facial expression recognition. We first build a candid image facial expression dataset CIFE by parsing web expression images from image search engines. The CNN-based deep learning approaches are then employed to train robust facial expression predictors, while fine-tuning approaches are also constructed to improve facial expression accuracy. To collect real scene images, we have designed a facial expression interaction game based on our deep learning model that was trained with the CIFE dataset. With users playing both the general and the customized versions of the face game, the correctly labeled facial emotion images are selected and saved, which help us build the GaMo dataset. We also have run a self-evaluation of the quality of the data labeling and proposed a self-cleaning mechanism to improve the quality of the data. To prove the effectiveness of our framework, we compared GaMo and CIFE based their balancedness, recognition accuracy, the effectiveness of using strictly balanced subsets, the impact of data cleaning, and feedback from human subjects. The experiments show that our framework can build a reliable facial expression predictor for real scenes.

Through our evaluation of the GaMo and CIFE datasets, we see the effectiveness of our framework. By recursively updating our model with newly collected data, we can achieve better facial expression recognition model for real scenes. By comparing the statistics of CIFE and GaMo trained models, we can obtain higher quality datasets of emotion images. We can have more “underrepresented” facial expressions using the game interface than collecting them from image search engines with tremendous manual efforts. We also have pointed out that we can use the known recognition rate of our game engine for each emotion category to change the appearing frequency of them, so we can obtain more balanced samples across the seven expressions. The testing results on GaMo datasets, the real scene images, hold the fact that the models trained with GaMo have better performance in real scenarios. In the balanced subset experiment, GaMo shows us the potential to build a robust model able to detect all expressions. We have also shown that by self-cleansing the emotion datasets, we can obtain more “cleaned” datasets with emotion images with better quality, which can contribute to better emotion recognition results. Finally, our human subject experiment proves the ability of our updated model to detect more subtle expressions.