Keywords

1 Introduction

Deep learning methods, especially CNNs, outperform previous state of the art in nearly all computer vision tasks, e.g. object detection, image classification, and facial expression recognition. Facial expression recognition attracts researchers’ attention because of its many applications in human-machine interaction, social robotics, medical diagnosis and treatment, and semi-automated driving.

Regularization is one of the key elements of deep learning, allowing to generalize well to unseen data, even when training on a limited training set or with an imperfect optimization procedure [8]. Some widely and successfully used regularization techniques are data augmentation, drop-out, batch normalization, and weight decay, which are also common in expression recognition [11, 22]. In addition to these methods, this paper proposes an occlusion-based regularization technique, which consistently improves performance in facial expression recognition and can be combined with any existing regularization technique and network architecture. Occlusion regularization is a specific form of data augmentation and very easy to implement: Training images are synthetically occluded by random black bars or objects at random locations.

The work’s main contributions are:

  1. 1.

    We propose to apply occlusion augmentation in facial expression recognition tasks, such as recognition of emotion categories and recognition of facial action unit intensities. Occlusion augmentation is a simple and effective regularizer, which can be applied with any CNN approach. It is beneficial even if the test data does not contain occlusions.

  2. 2.

    We experimentally show the resulting performance improvements using three datasets with different expression recognition tasks (RAF  [12], AffectNet  [16], and Bosphorus  [18] databases) and three CNN architectures (pre-trained Xception  [2] and MobileNet  [6] as well as a custom architecture).

  3. 3.

    We compare our results with state-of-the-art results. We clearly outperform prior work on the Bosphorus dataset. On the RAF dataset we reach comparable results with our simple approach, which may also be applied to further improve results of more sophisticated state-of-the-art approaches.

2 Related Work

Most of the work related to occlusion in facial expression recognition intended to improve performance on partially occluded images. In contrast, our work addresses performance improvements on all face images, including occlusion-free images. For a general overview on facial expression recognition and on expression recognition under partial occlusion, the reader is referred to Li et al.  [11] and Zhang et al.  [24], respectively. A recent approach on occlusion-aware expression recognition is the CNN network with an attention mechanism proposed by Li et al.  [13]. They combined multiple representations from facial regions of interest by weighting via a proposed gate unit, which computes an adaptive weight from the region itself according to the unobstructedness and importance. This way they improved performance on both occluded and occlusion-free face images.

Kukačka et al.  [8] review and classify the literature on regularization. Among the most widely used methods are data augmentation, batch normalization, and drop-out. There are lots of works on using data augmentation to improve the performance of a deep learning network in general, which includes facial expression recognition tasks. Bengio et al.  [1] showed that the performance of a deep neural network can be improved by data augmentation in the image classification problem. Even before, back in 1998, LeCun et al.  [9] used various affine transformation for data augmentation for training LeNet. Lemley et al.  [10] proposed a smart data augmentation technique to optimize data augmentation during training. Lin et al.  [14] used data augmentation and compact feature learning to improve the performance of the facial expression recognition model. Sarandi et al.  [17] used synthetic object occlusion for 3D body pose estimation performance improvements, which inspired our work on occlusion-based regularization in the facial expression domain.

Ioffe and Szegedy  [7] showed how batch normalization can improve training time and the performance of deep learning networks. Batch normalization has a regularizing effect, because mean and standard deviation used for normalization vary between the randomly composed mini-batches. This introduces additional variation and teaches the layers to be robust to a lot of variation in their input. Another widely used concept of regularization in deep learning networks is drop-out, which was introduced by Hinton et al.  [5]. Drop-out randomly removes hidden neurons during the training of a deep network. By doing so, the network does not depends on a specific activation during training, which reduces overfitting.

Fig. 1.
figure 1

Example images of the used databases with synthetic random black-bar occlusions of the used sizes. The occlusions are only augmented during training. Testing is done with the original (mostly occlusion-free) images.

Fig. 2.
figure 2

Example images of the used databases with synthetic random object occlusions of the used sizes. The occlusions are only augmented during training. Testing is done with the original (mostly occlusion-free) images.

3 Approach

We propose to augment synthetic occlusions on the images that are used to train expression recognition models. The position of the occlusion mask in pixel coordinates is randomly selected for each sample (and epoch) in a way that it is always completely within the image. Two types of occlusions are considered in this work:

Black-Bar Occlusions: We use square occlusion masks of the sizes 10 \(\times \) 10 to 60 \(\times \) 60 pixels for all the experiments and set all pixels to zero. Some examples of random black bar occlusion can be seen in Fig. 1.

Object Occlusions: The PASCAL VOC 2011 [4] dataset is used to augment real objects (excluding faces) on face images. After several experiments, we selected to occlude each training image with two objects, because this resulted in better performance than using one object. Some exemplary occlusion masks and how much occlusion they create on training images can be seen in Fig. 2.

The occlusions do not resemble realistic occlusions, such as occlusions by hands, glasses, or other objects, because our goal is providing a simple regularization technique. Synthesizing realistic occlusions is a complex task, hard to implement, and – as our experiments show – not necessary for improving the performance.

Similar to other data augmentation techniques and batch normalization, occlusion augmentation increases the variance of the input and teaches the network to be more robust to variations. It encourages the network not to base its decisions exclusively on few local activations, but to combine multiple indicating activations globally. The size of the occlusion is a critical parameter: Occluding more pixels increases the variation of the training images and thus the regularization effect. However, occluding more pixels also hides more information that may be needed for a correct prediction.

The occlusion augmentation can be used with any CNN architecture and training loss. Further, it can be combined with any other regularization technique and be implemented as an extension of an arbitrary data augmentation pipeline. Although it is a simple approach, it is effective in improving the performance, as we will see in the following experiments.

4 Experiments and Results

To show the regularization effect of occlusion augmentation we use three facial expression datasets: Bosphorus  [18], RAF  [12], and AffectNet  [16]. We present experiments on varying the degree of occlusion augmentation using three CNN models: Xception  [2], MobileNet  [6], and a custom architecture. This way we verify occlusion regularization with both, standard models pre-trained on ImageNet  [3] and a custom model with random (Xavier) weight initialization. All three models are trained using both black-bar and object occlusion augmentation.

The custom model architecture contains six convolution layers (kernel 3 \(\times \) 3, 16/32/.../512 channels, ReLU), all except the first followed by MaxPoolig2D (pool size 2 \(\times \) 2). After the convolution part, we flatten the features, apply dropout (\(p=0.2\)), and append the final dense layer, using softmax activation for classification and linear activation for regression. The image size used for Xception and MobileNet is 128 \(\times \) 128 \(\times \) 3 and 100 \(\times \) 100 \(\times \) 3 for the custom model. We conduct the experiments with the Keras deep learning framework. The occlusions are augmented with custom Python source code (using OpenCV).

4.1 Black-Bar Occlusion Regularization

The Bosphorus Database  [18] contains 2,902 images, each with 26 facial action unit (AU) intensity labels. The images of 87 subjects (2,470 samples) are used for training and 17 subjects’ faces (432 samples) are used as test images. We align all the training and test images with a similarity transform using facial landmarks provided with the database. The Xception and MobileNet networks are trained with classification loss (categorical cross-entropy) and the custom model with regression loss (mean squared error), because we want to verify that occlusion regularization works with both classification and regression. Figure 3 illustrates the performance improvements using black-bar occlusion regularization. The baselines (occlusion size of zero, i.e. no occlusion) are the average of three runs for each of the models. The y-axis in the plot presents the average of the 26 AUs’ ICC(3,1) values [19] of the models on the test data (0 corresponds to chance level and 1 to error-free prediction) and the x-axis presents different occlusion sizes used in the respective training runs. It can be clearly seen from the plots that the performance of the models improves as the size of the occlusion mask increases until some range. Then, it starts to decline again. This is as to be expected, because at some occlusion size the negative impact of hiding information starts to outweigh the positive effect of regularization.

Fig. 3.
figure 3

Performance on Bosphorus (black-bar occlusion). An occlusion size of 0 corresponds to baseline results, which are outperformed by augmenting mid-size occlusions.

Fig. 4.
figure 4

Performance on RAF (black-bar occlusion). An occlusion size of 0 corresponds to baseline results, which are outperformed by augmenting mid-size occlusions.

Fig. 5.
figure 5

Performance on AffectNet (black-bar occlusion). An occlusion size of 0 corresponds to baseline results, which are outperformed by augmenting mid-size occlusions.

The Real-World Affective Faces Database (RAF-DB)  [12] is a large database with around 30,000 diverse real-world face images downloaded from the internet. All images were annotated with basic or compound emotions by 40 trained annotators. Images with basic emotion expressions were used for experiments (including 12,271 training images and 3,068 test images). The RAF database provides both original images and aligned images; we use aligned face images for our experiments. Figure 4 shows the model performance improvements using black bar occlusion regularization. The y-axis shows the test accuracy of the model on non-occluded test images and the x-axis shows the different occlusion sizes used for occlusion regularization. The curves of all three models are qualitatively similar to those obtained with the Bosphorus database, i.e. the performance increases with the occlusion size up to a certain point and decreases if the occlusion size is increased further.

AffectNet  [16] is the largest labeled expression recognition database by far: It contains around 400,000 manually annotated facial images. For this work, 99,852 training images and 2,549 test images were selected randomly to reduce the required computational effort. The AffectNet database provides aligned faces that we directly used in our experiments. The performance plots are depicted in Fig. 5. Again, we observed performance improvements for all models up to a tipping point, after which performance decreases again.

4.2 Object Occlusion Regularization

We repeat all the above experiments using the object occlusion augmentation on the training images. Then, model performances are tested on the test data, which are mainly free of occlusions. We present results with two occlusion masks per image unlike a single mask in black bar occlusion augmentation, because we found a better regularization effect compared to a single mask. The performance plots are qualitatively similar to black-bar occlusion regularization. So we show the numbers in Tables 1, 2, and 3. Again, occlusion-based regularization outperforms the training without occlusions, at least up to a certain size of occlusion (see the bold numbers in the tables).

On the Bosphorus dataset the best object augmentation resulted in performance improvements of 3.3%, 3.7%, and 0.7% for the Xception, MobileNet, and custom architectures, respectively. These are lower than the improvements of the best black-bar augmentation, which are 5.6%, 4.3%, and 3.2%.

On RAF the performance improvements of the best object augmentation are 2.3%, 3%, and 0.7% for the Xception, MobileNet, and custom architectures, respectively. With black-bar augmentation we get improvements of 4.7%, 1.7%, and 1.6%, which is better on average.

In contrast to the Bosphorus and RAF databases, object occlusion augmentation performs better than black-bar augmentation on the AffectNet database, with performance improvements of 2.7%, 1.7%, and 2.9% (Xception, MobileNet, and custom architecture), compared to 2.6%, 1.1%, and 0.8%.

4.3 Comparison with State of the Art

We compare our results with the state of the art, although beating it is not the focus of our work. Table 4 shows the comparison on the Bosphorus database (mean of ICC measures of 26 AUs’ intensity outputs). It can be seen from the table that we outperform the previous state of the art on this database clearly.

Table 1. Test performance achieved on Bosphorus datasets with different CNN models (columns) by augmenting the training data with synthetic object occlusions (rows).
Table 2. Test performance achieved on RAF datasets with different CNN models (columns) by augmenting the training data with synthetic object occlusions (rows).
Table 3. Test performance achieved on AffectNet datasets with different CNN models (columns) by augmenting the training data with synthetic object occlusions (rows).
Table 4. State of the art on Bosphorus database.
Table 5. State of the art on RAF database.

We also compare our best test accuracy on the RAF dataset with the existing state of the art. Table 5 shows that we obtain comparable results. Since occlusion regularization is working for all different datasets and models used in this work, we think that gCNN and gACNN can be further improved if these models are trained with additional occlusion regularization. For the AffectNet we did not use the full dataset, so comparison with other works is not fair.

5 Conclusion

We proposed and evaluated the idea of using occlusion augmentation for regularization in order to improve performance in facial expression recognition. Two types of occlusion augmentation were considered: black-bar occlusion and object occlusion. With both we found significant performance improvements (compared to not using occlusion augmentation) on three databases, three CNN architectures, two recognition tasks (basic emotions and AU intensities), and two loss functions (softmax cross-entropy and mean squared error). On the Bosphorus and RAF databases we observe greater performance improvements using black-bar than object occlusion regularization. On the AffectNet database it was vice versa. Due to the consistent improvements, we strongly recommend to integrate occlusion regularization into the training of all CNN-based facial expression recognition systems. We propose to use black-bar regularization (which is easy to implement and yields good results) with a square size in range of about 20–40% of the CNN input image size.

Future work should investigate occlusion augmentation and occlusion regularization in further experiments. Adding more randomization (e.g. regarding size and aspect ratio) may be a promising direction. Another approach is to randomly select for each image, whether the occlusion augmentation should be applied (leaving a subset of the images occlusion-free). Moreover, an algorithm may be developed which can automatically find the best occlusion augmentation for a particular database by searching the parameter space.