Keywords

1 Introduction

Facial expression recognition (FER) has been a popular research field for its potential applications in human-computer interaction, driver fatigue monitoring, mental health assessment, and other fields. Despite the high accuracy achieved under a standard environment, spatial occlusion has been the standing challenge to achieving robustness. Occlusions in real-life scenarios encompass a massive number of daily objects and occupy different positions of face images, which greatly affect the robustness of FER algorithms.

Earlier researchers mainly study the influence of occlusion positions on FER. Boucher et al. [4] occluded key areas of the face to learn which areas are the most important in human perception. Kotsia et al. [15] concluded that mouth occlusion causes a greater decrease in FER than the equivalent eyes occlusion. Then methods based on sparse representation are proposed. Cotter [7] presented the weighted voting method based on sparse representation classifier (SRC) for FER. Zhang et al. [31] extracted three typical facial features to evaluate the performance of the SRC method. Subsequently, with the emergence of large-scale datasets and robust novel network architectures, researchers carried out a combination of deep learning and sparse representation. Huang et al. [14] exploited the sparse representation and residual statistics to occlusion detection of video sequences. Zhong et al. [33] proposed a two-stage multi-task sparse learning framework to find dominant patches and learn specific facial patches for individual expression. Recently attention-based methods are proposed to address occlusions in FER [19, 20, 27], determined whether the facial block should be emphasized or not based on the importance score.

We are motivated to come up with a new mechanism to provide neural networks with the knowledge of occlusion for recognizing expressions with partial occlusion. When observing face images with occlusions, people will focus on the non-occluded areas and recognize expression based on the information of these non-occluded areas. Inspired by this, we propose a novel Region Re-Weight Network (RRWN) to capture and emphasize the non-occluded areas of the face. RRWN is mainly composed of two modules, Occlusion-Aware Module (OAM) and Block-Loss Module (BLM). OAM learns to pick out the non-occluded facial regions to facilitate recognition, which is compatible with the mainstream convolutional neural network (CNN) architecture. As depicted in Fig. 1, OAM works with a widely-used convolutional architecture, in which the feature maps of the holistic image are decomposed as the combination of feature maps from its local regions. Different from the most widely-used attention-based methods, OAM employs similarity measurements to capture the difference between facial and non-facial areas. After getting the non-occluded regions through OAM, the non-occluded regions will be highlighted in the latter network. In the meantime, we use the Block-Loss to emphasize the role of the key area among these non-occluded regions. Different from other occlusion-aware methods, our method guides the model to separate occlusions from the human face.

The major contributions of this work can be summarized in three aspects: 1) We propose OAM, a novel network structure to avoid facial blocks with occlusion and select non-occluded blocks. 2) A region biased loss (Block-Loss) is proposed to optimize the selection of crucial regions. 3) On four challenging datasets with occlusions, we demonstrate that our methods achieve superior performance.

2 Related Work

2.1 FER Methods Against Occlusions

Many FER methods consider using prior knowledge to strike a better performance both in lab-constraint and in-the-wild scenarios. Common options to incorporate such knowledge includes manually design refined segmentation based on detected facial landmarks since it is effective to constraint the model’s input to only the regions where expression-related actions occur. According to the facial action coding system [10], action units are situated around the eyes, the forehead, and the mouth. Extracting those key areas accordingly reduces noise from hair, sunglasses, masks, and other occlusions. However, it works only if these key areas are not occluded.

When the location of occlusions is uncertain, dividing the whole facial image into smaller patches while applying some selection or weighting method over the patches is often more robust than the key-area segmentation approaches. Face partitioning methods varies from uniform partitioning [14], landmark-centered partitioning [19], to sampling-oriented [27]. Subsequently, the occluded patches are given smaller importance weights, or simply excluded from the recognition process.

Recent works following this principle prefer to generate an importance score for each block according to its contribution to the classification. For example, Li et al. [19] proposed to use a convolution neural network with attention mechanism to compute an adaptive weight from the region itself according to the unobstructedness and importance. Wang et al. [27] proposed a novel region attention network using the sigmoid value to represent the attention value and combining the overall and part features to enhance the ability of the network.

The above methods obtain the importance score through a designed deep neural network, and it is considered that the blocks with large importance scores should be focused on by the network. But in fact, the blocks with large importance scores are possible to be the occluded blocks. Different from these works, our method determines whether the block is occluded by the similarity between the facial block and the whole image, rather than simply using the important score. When the face image is partially obscured, its overall characteristics are still close to a face, so the blocks which are close to the face image are non-occluded blocks.

2.2 Sparse Representation

Inspired by the success of sparse approximation in the face recognition task [29], researchers proposed adaptations and variations of sparse encoding to the expression recognition task. Methods concerning sparse representation decompose a facial image as a linear combination of images from the same expression category. During the process, four typical facial features, i.e., the raw pixels [31], Gabor wavelets representation [6], local binary patterns [2], and deep features extracted by a deep convolutional network [1] are used as the effective representations for the expression images.

However, the above methods suffer drastically from insufficient training sample size and variations included. To effectively represent an unseen image containing an occluded facial expression, they also require assistance from well-performing decorrelation technique, precise face alignment, and normalization which is far from reaching in many in-the-wild datasets to date. Although we also decompose the whole facial image as a linear combination, our method distinguishes itself from existing sparse representation methods since we measure how much content in each patch is related to the whole image.

3 Proposed Method

3.1 Overview of Region Re-weight Network

As depicted in Fig. 1, RRWN extends the traditional CNN architecture by the additional OAM and BLM. To begin with, the face image is fed into the first layer of the backbone network to obtain feature maps for the whole face image as well as each local block. Next, OAM selects the non-occluded blocks by measuring the similarity between local and global vectors. Finally, the non-occluded blocks will be highlighted in the latter CNN layers. In addition to OAM, we also introduce BLM which contains a loss function to emphasize the role of critical block, which comes from non-occluded blocks chosen by OAM. As a result, The whole RRWN can be trained in an end-to-end manner.

Fig. 1.
figure 1

The framework of our RRWN. A face image is fed into Resnet-18 and is represented as the global vector y and local vectors \(x_{i}\). The Occlusion-Aware Module takes y and \(x_{i}\) as input to find the non-occluded \(\{Block_{r_{k}}\}\). Then the \(\{Block_{r_{k}}\}\) will be re-weighted in the latter network (the corresponding black squares). The Block-Loss Modules emphasizes the role of key block among \(\{Block_{r_{k}}\}\) through the Block-Loss function.

3.2 Occlusion-Aware Module

We hold the presumption that the overall characteristics of the face image are close to its components rather than the occlusions. In our case, the similarity is used as a mathematical measure to find the clear facial areas similar to the overall face image. In other words, the non-occluded blocks of the face image are located by the similarity measurement. Inspired by how the orthogonal matching pursuit (OMP) method finds the most similar component of a signal [24], we design OAM to find the non-occluded blocks.

As shown in Fig. 1, after getting feature maps that represent the whole facial image, we partition the feature maps to multiple sub-feature-maps uniformly to obtain diverse blocks of the same size. Next, an adaptive average pooling operation is utilized to encode the feature maps into a vector, i.e., each three-dimensional feature is mapped to a one-dimensional vector. Let y denotes the global vector. We normalize y for convenient calculation so that we have \(\vert \vert y\vert \vert =1\). Similarly, \(\chi = \{x_{1}, x_{2}, \cdots , x_{n} \}\) denotes local vectors and \(\vert \vert x_{i}\vert \vert =1\). According to conventional sparse approximation methods, a dictionary is often created to store atomic vectors before finding the sparse representation of the global vector. In our method, the local vectors are used as the atomic vectors when building the dictionary \(D = [x_{1},x_{2},\cdots , x_{n}] \in R ^{n\times k}\), where n is the number of atomic vectors and k is the dimensionality of the atomic vectors.

After building the dictionary, the inner product of the global vector y and each atomic vector \(x_i\) is calculated. Then, the atomic vector with the largest absolute value of the inner product will be selected as the closest match-up to the y. This selection iterates until we obtain the maximum number of atomic vectors. In this way, y is decomposed into the vertical projection in the direction of the chosen atomic vectors and the corresponding residual, which can be formulated as,

$$\begin{aligned} y = \langle y,x_{r_0}\rangle x_{r_0} + R_{1}, \end{aligned}$$
(1)

where \(\langle .,.\rangle \) is the inner product, \(x_{r_0}\) is the closest match atomic vector, \(r_{0}\) is the column index of \(D, \langle y,x_{r_0}\rangle x_{r_0}\) is the vertical projection in the direction of \(x_{r_0}\), and \(R_{1}\) is the residual. Then we decompose the residual \(R_{1}\) in the same way. After k iterations, we can get

$$\begin{aligned} y = \sum \limits _{k=0}^{K}\langle R_{k},x_{r_k}\rangle x_{r_k} + R_{k+1}, \end{aligned}$$
(2)

where K is a hyper-parameter served as the number of selected atomic vectors, and \(R_{0} = y\). If K is too small, only a few non-occluded areas can be found. On the other hand, if K is too large, the non-occluded area may also be selected. After several iterations, the linear representation of the target vector can be obtained, which is formulated as follows:

$$\begin{aligned} y&= \sum \limits _{k=0}^{K} c_{k}x_{r_k}\nonumber \\ c_{k}&= \langle R_{k},x_{r_{k}}\rangle \end{aligned}$$
(3)

Now that the non-occluded blocks and the corresponding weight are obtained, then we apply a re-weight operation on the original feature maps. The blocks corresponding to the selected atomic vectors are weighted as Eq. 4 while the unselected blocks remain unchanged, which can be defined as,

$$\begin{aligned} block_{r_{k}} = (c_{k} + c)block_{r_{k}}, \end{aligned}$$
(4)

where \(block_{r_{k}}\) denotes the \({k}^{th}\) selected block. The \(c_{k}\) can be arbitrary in (0, 1). To strengthen the role of the non-occluded area, we increase the weight by c times. If we overemphasize the key blocks and impose great weight on them, it will lead to a decrease of accuracy. We will analyze this in the later ablation studies. After OAM, the new feature maps continue to be input to the rest of ResNet-18.

OAM optimizes the latter network during the training by performing the weighting operation to the original feature maps. OAM can select the atom vector that is closest to the target vector. The weights describe how similar the atom vector is to the target vector. Even if the face is partially occluded, the face is still the dominant object in the image. In this way, OAM can select the non-occluded areas. However, when the occlusion is too large and occupies most of the face image, the overall feature of the image tends to be the occlusion rather than the face, OAM will perform poorly.

3.3 Block-Loss Module

After OAM, we find the non-occluded blocks. Among the non-occluded blocks, some blocks contribute to recognizing the expression more significantly than others [4]. To encourage high weights for the most important block among these non-occluded blocks. Inspired by [27], we propose the Block-Loss.

As can be seen in Fig. 1, BLM contains a fully-connected layer and a sigmoid function. After getting the global vector y and the non-occluded local vectors \(x_{r_k}\) chosen by OAM, they are fed to BLM. After the fully-connected layer and the sigmoid function, we get their importance value. Block-Loss can be formulated as,

$$\begin{aligned} \mathcal {L}_{B}&= \max \lbrace 0, \alpha - (\mu _{max} - \mu _{y})\rbrace , \nonumber \\ \mu _{max}&= \max \lbrace f(x_{r_k}q)\rbrace ,\nonumber \\ \mu _{y}&= f(yq), \end{aligned}$$
(5)

where \(\alpha \) is a hyper-parameter served as a margin, q is the parameter of the fully-connected layer, and f denotes the sigmoid function. In the training process, the Cross-Entropy Loss is jointly optimized with the Block-Loss, which can be defined as,

$$\begin{aligned} \mathcal {L}_{All} = \mathcal {L}_{CE} + \mathcal {L}_{B}, \end{aligned}$$
(6)

where \(\mathcal {L}_{CE}\) denotes the Cross-Entropy Loss.

BLM optimizes the former network during the training by the loss function. BLM enforces that one of the important values of non-occluded blocks should be larger than the face image with a margin so that RRWN can focus on the most important block among the non-occluded blocks.

4 Experiments

4.1 Datasets

RAF-DB [17] contains 30, 000 facial images annotated with basic or compound expressions by 40 trained human coders. In our experiment, only images with basic emotions(neutral, happiness, surprise, sadness, anger, disgust, fear) are used, including 12,271 images as training data and 3, 068 images as test data. FERPlus [3] contains 28, 709 training images, 3, 589 validation images, and 3, 589 test images collected by the Google search engine, and all images are resized to \(48\times 48\) pixels. FERPlus supplements a contempt emotion and is annotated by 10 labels. AffectNet [23] is the largest FER dataset that contains more than one million images collected by three search engines using expression-related keywords. About 400,000 images are manually annotated with eight discrete facial expressions as FERPlus. It has imbalanced training and test sets as well as a balanced validation set. SFEW [8] contains 95 subjects and covers unconstrained facial expressions, a large range of ages, varied head poses, and real-word illumination. We use the newest version of SFEW [9] which has been divided into three sets: training (958 images), validation (436 images), and test (372 images), and all images are annotated with seven discrete facial expressions as RAF-DB.

Table 1. Values of hyper-parameters

4.2 Implementation Details

The proposed RRWN is implemented on the environment of Python 3.6 and the operating system of Windows 10. Preprocessing methods like image resizing are executed through OpenCV 3.4 for convenience. The proposed network involved in this work is run on Intel(R) Core(TM) i7-6700 3.4 GHz in CPU and NVIDIA RTX 1080 Ti with CUDA 9.0 in GPU. RRWN is implemented using the Pytorch platform and the backbone network is ResNet-18 [12]. By default, the ResNet-18 is pre-trained on MS-Celeb-1M face recognition dataset and we extract the feature maps after the first layer of ResNet-18.

Each face image is first resized to \(224 \times 224\). Then the feature maps are partitioned into \(7 \times 7\) blocks uniformly as depicted in Fig. 1. After adaptive average pooling operation, the feature maps are encoded as vectors of 64 dimensions. The number of selected atomic vectors is 10. The margin in Block-Loss is default as 0.01 and the whole network is jointly optimized with Block-Loss and Cross-Entropy Loss in training. The ratio of the two loss functions is empirically set at 1 : 1. Values of hyper-parameters are shown in Table 1. The batch-based stochastic gradient descent optimizer is used to train the model. On all datasets, the batch size is set to 64, the base learning rate was set as 0.01 and was reduced by the polynomial policy with the gamma of 0.1. Finally, the momentum was set as 0.9 and the weight decay was set as 0.0001.

Fig. 2.
figure 2

Images with occlusions from RAF-DB. Each image is equally divided into 49 blocks. The orange squares represent the facial non-occluded areas, and the blue squares represent the occluded areas. Dark orange and dark blue squares represent the blocks selected by OAM. The number in the square is the coefficient of the linear combination obtained by OAM. (Color figure online)

4.3 Visualization of the Blocks Selected by OAM

OAM should be able to match the non-occluded areas of the face. To demonstrate the effect of OAM, non-occluded blocks selected by OAM are shown in Fig. 2. The occluded areas are covered by blue masks while the clear face areas are covered by orange masks. Areas selected by OAM are further highlighted with a darker color and the corresponding weights. It is clear that most of the selected blocks which OAM selects are non-occluded blocks. In addition, some non-occluded blocks play an important role in FER because they include key areas such as eyes, mouth, etc.

For the images in the first row, where the occlusion and the face have many differences, OAM can find the key blocks closest to the whole face, making it effective to avoid the blocks with occlusions. In the next row, where the occlusions occupy a relatively larger area of the face image, blocks containing occlusions will be selected because features of the face image in this situation include quite a lot of information of the occlusions. Down to the last row, if the occluded object is a hand, in which the color, texture, and other features are relatively similar to the face, OAM will be possible to select few blocks containing hands.

Table 2. Test accuracy(%) on real-world datasets.

4.4 Ablation Studies Evaluation

Effectiveness of RRWN: To evaluate the effectiveness of RRWN compared with the baseline (ResNet-18), we conduct experiments on real-world datasets. Results are shown in Table 2. When training from scratch, our proposed RRWN outperforms the baseline network by a margin of \(4.83\%\), \(0.28\%\), and \(2.05\%\) on RAF-DB, FERPlus, and AffectNet respectively. It shows that our method does improve the accuracy of the baseline. In addition, when using ResNet-18 pre-trained on MS-Celeb-1M, our method obtains improvements of \(1.62\%\), \(0.9\%\), \(0.2\%\) on these datasets.

Table 3. Test accuracy(%) of the two modules on RAF-DB.

Furthermore, to explore the effectiveness of the two modules in improving accuracy, we conduct comparative experiments on RAF-DB. The result is shown in Table 3. By the way, when only BLM is added, the input vectors of BLM are directly from the vectors after the adaptive average pooling operation. When only adding OAM or BLM, we obtain improvements of \(3.68\%\) and \(1.16\%\) based on ResNet-18, \(1.3\%\) and \(0.4\%\) based on ResNet-18 (pretrain). This suggests that both OAM and BLM contribute to improving accuracy. In addition, OAM is the most contributed module for our RRWN.

Fig. 3.
figure 3

Evaluation of the position on RAF-DB

Position of OAM: We study the impact of the different OAM positions. Since the backbone network is ResNet-18, which can be divided into four layers (we represent them as layer1, layer2, layer3, and layer4). Experiments are carried out with OAM being placed after the first, second, and third layers. Result on RAF-DB is shown in Fig. 3. The test result indicates that OAM works best when is placed after the first layer. And the further back it is placed, the worse the effect will be.

We analyze this phenomenon and concluded two major reasons for the declination. First, OAM represents the target vector linearly with a certain number of atomic vectors, so the greater the difference between the blocks, the more accurate OAM is to find the non-occluded blocks. Second, as CNN deepens and constantly carries out convolution, pooling, and other operations, the obtained feature maps become smaller, and the features become more abstract. The features of different blocks are mixed, which leads it more difficult to distinguish different blocks. Therefore, adding OAM after the first layer is appropriate.

Fig. 4.
figure 4

Parameters evaluation

Evaluation of the Weight Increment c: In OAM, we obtain the atomic vectors corresponding to the non-occluded blocks. The blocks corresponding to the selected atomic vectors are re-weighted, and the blocks that are not selected remain unchanged. We study the effect of the amount of weight increase, and the result is shown in Fig. 4(a).

As can be seen from Fig. 4(a), when we just multiply the coefficient \(c_{k}\) to the non-occluded block, i.e., \(c=0\), the result is poor because the coefficient \(c_{k}\) is between 0 and 1, and the non-occluded blocks are weakened when we multiply them directly. On the other hand, the accuracy declined as c increased. Because FER not only focuses on the partial key blocks but also the global features. We should combine local features and global features. As the article [19, 27], the combination of global features and local features is more effective. If we focus too much on local features and ignore the global features, the weight increment c is too large and the accuracy will decline.

Evaluation of the Margin \(\alpha \): From Table 3, we can see that BLM further improves performance on RAF-DB.

The margin \(\alpha \) in Block-Loss is set to 0.01 by default. We evaluate the \(\alpha \) in FERPlus, the result is shown in Fig. 4(b). Increasing from 0 to 0.01 gradually improves the performance while larger \(\alpha \) leads to fast degradation, which indicates the features of the overall face image are also important for FER. It also further confirms that we need to combine local features and global features. We mainly carry out the combination of local features and global features in two aspects. One is to input the global vector into BLM, and the other is to appropriately emphasize the key blocks selected by OAM.

4.5 Results and Comparison

We compare our RRWN to several methods on RAF-DB, FERPlus, AffectNet, and SFEW including attention-based methods [19, 20, 27] and loss-function methods [5, 18, 21]. The result is shown in Table 4.

pACNN [20] re-weights each patch according to the attention mechanism. gACNN [19] leverages a patch-based attention network and a global network. RAN-ResNet18 [27] captures the importance of facial regions and aggregates region features into a compact representation. These attention-based methods are time-consuming due to the carefully designed deep neural networks. Our RRWN does not increase much computational expense by simply adding two modules to the existing network architecture. DLP-CNN [18] uses a locality-preserving loss for network training. Island Loss [5] proposes the island loss which combines the Center Loss [28] and an inter-class loss. IACNN [21] proposes an identity-sensitive contrastive loss to achieve identity-invariant FER. These loss-function methods do not emphasize the key block of the face image, whereas our RRWN emphasizes the key block in the non-occluded blocks. Our RRWN outperforms these recent methods with \(85.80\%\), \(87.70\%\), \(58.70\%\), \(54.26\%\) on RAF-DB, FERPlus, AffectNet, and SFEW.

Table 4. Comparison on datasets with occlusions

5 Conclusion

In this work, we propose RRWN to address facial expression recognition in the presence of occlusions. Our RRWN uses the Occlusion-Aware module (OAM) to adaptively capture and emphasize the uncovered area of the face. In addition, we design a region biased loss (Block-Loss) function to encourage high weight for the most important region. We evaluate our method on real-world datasets. Experiments show that our proposed method has substantial improvement on RAF-DB, FERPlus, AffectNet, and SFEW compared with other methods.