1 Introduction

Facial expression, as the most intuitive signal for human to convey social information, has become a research hotspot in the field of human–computer interaction (HCI). Both physical and inner thoughts can be obtained through the analysis of expression variation. In previous research, various approaches have been proposed to solve the issues of facial expression recognition (FER) [1,2,3,4]. However, most of the exiting works focus on the recognition of frontal or near-frontal facial expressions, with relatively few studies on pose-variant. Nevertheless, in real-world scenarios, the captured facial images are usually determined by the angular position of the camera, which leads to rather unstable recognition accuracy [5,6,7,8]. Therefore, how to effectively extract the features based on pose-invariant images is a very challenging and meaningful task.

In the past few decades, several effective feature extraction techniques have been proposed for pose-invariant expression recognition. According to the research route, those techniques can be roughly classified into traditional based as well as deep learning-based methods. When using traditional-based methods, facial images are usually represented by geometric feature models or cropped into different regions of interest (ROIs). For example, Zhang et al. [9] used the pre-view-trained Active Appearance Models (AAMs) to extract the positions of facial points, and then trained each set of feature points through a specific model for pose-invariant FER. Zheng et al. [10] utilized 83 landmark points and their surrounding regions to represent facial expressions in different poses, and then extracted SIFT features for expression classification. In [11], they divided the multi-view facial images into a set of sub-blocks with the same size, and extracted LBP features from each block for FER afterward. Similarly, Zhang et al. [12] firstly presented a spatially coherent feature learning method for pose-invariant FER (SC-PFER), which normalized the expressions and poses with same horizontal and pitch angles, subsequently extracted a sequence of key regions for unsupervised feature learning, and finally used the extracted regions for FER. All these above-mentioned methods can achieve good results, but in practical application, pre-processing is an indispensable operation before feature extraction.

When using deep learning-based methods, to extract the regions of interest more accurately, numerous researches attempt to use multi-channel and multi-model feature learning methods to improve the representation ability of CNNs. As shown in Fig. 1, Liu et al. [13] presented a multi-channel pose-aware convolution neural network (MPCNN) for multi-view FER, in which channel-M1, channel-M2 and channel-M3 are used to extract whole facial region, eyes region and mouth region, respectively, and then these regions are provided to the classifier for expression recognition. Similarly, Liu et al. [14] designed a multi-channel convolution network for pose-invariant FER. The features extraction part includes three sub-CNNs, which learn different regions of interest (ROIs) of expressions, and these fusion features are fed into pose-specific CNN operations to enhance high-level feature representation. Liu et al. [15] designed a multi-channel network with pose-invariant FER, in which DML-Net is composed of three parallel channel networks, learning global and local features from different facial regions, and then integrating them for FER. It is worth mentioning that the accuracy of KDFE, BU-3DFE and Multi-PIE database are 88.2%, 83.5% and 93.5%, respectively. Moreover, in [16], they used two different channels to extract images features, and employed fixed loss weighting parameters to enhance the accuracy of expression recognition. Based on this method, Zheng et al. [17] added adaptive dynamic weight (ADW) in different channels to filter useful information, which not only reduced the chance of over-fitting, but also improved the training efficiency of the network.

Fig. 1
figure 1

The main steps of multi-channel facial expression recognition

Although the traditional-based and deep learning-based methods both have performed well in reducing the influence of occlusion and pose-invariant, there still remain several inevitable shortcomings. In the traditional-based setting, these methods generally require to manually crop out a large number of ROIs, which destroys the construction of automatic expression recognition system, especially the geometric feature models that are more dependent on the precise localization of feature points will be greatly limited the capabilities of following feature extraction and representation. In the deep learning-based methods setting, the multi-channel multi-model features learning methods need to not only consider the features of each region, but also pay attention to the impact of the loss function of each region on the accuracy of expression recognition, which usually results in a the convolution neural network being more complex than the traditional end-to-end networks. Moreover, using ROIs to represent the facial images in sparse pattern may not be possible to represent the original meaning of expressions completely and precisely.

In this paper, as far as the overall information of the pose-invariant expression images is concerned, the cropping of ROIs and calibration of geometric feature points is avoidable, and good operation of the automatic expression recognition system can be well ensured. All these benefits are brought by the Squeeze-and-Excitation (SE) block [20], which can dynamically recalibrate channel-wise feature in each convolutional layer in spite of the different feature maps contained in each convolutional layer, aiming to enhance the representation ability of networks on the useful layer and suppress the role of the useless layer. According to this technique, Ma et al. [18] proposed an optimized neural network based on ResNet18 and SE blocks for FER, and embedded SE model into ResNet model, which not only reduced the calculation parameters, but also improved the flow capacity of the network layer by layer. Li et al. [19] presented a Slide-Patch and Whole-Face Attention model with SE blocks (SPWFA-SE) for multi-view FER in wild condition, in which SE blocks are used as attention modules to train the weights of pre-trained patches of each channel, which can further filter out salient features from multi-view facial images. Inspired by [18, 19], in order to accommodate the different visual images, this paper proposed a soft thresholding multi-channel squeeze-and-extraction (ST-SE) block for pose-invariant FER. In each ST-SE block, the extracted feature maps were flattened by global average pooling (GAP), which were then sent into SE module. Consequently, the threshold parameters were obtained by multiplying the SE training parameters and the absolute value GAP, which could be regarded as a specific self-attention function aiming at filtering the salient features in the current views. The main contributions of this paper are summarized as follows:

  1. 1.

    The soft thresholding SE (ST-SE) block for pose-invariant FER is designed. Not only the SE method, but also the global average pooling (GAP) layer is added to ST-SE block. GAP operation can provide a large number of the average values from each channel of the feature map, which can force the network to pay more attention to the features in the current view.

  2. 2.

    The SE operation multiplied by the absolute value GAP is regarded as a self-attention mechanism, which can not only extract salient feature information, but also reduce the influence of pose-variant on the recognition accuracy.

  3. 3.

    In order to illustrate the effectiveness of designed ST-SE block, ResNet50 is used as the backbone architecture, as well as SE and ST-SE blocks are embedded into deep architecture as nonlinear transformation layers, respectively.

  4. 4.

    This study implements extensive experiments on four public pose-invariant datasets. As shown in Fig. 2, there are not only controlled but also real-world scenario dataset, i.e., BU-3DFE, Multi-PIE, Pose-RAF-DB and Pose-AffectNet. In addition, the performance of SE and ST-SE block with some previous pose-invariant FER methods is compared, and the experiments show that the ST-SE block designed in this paper is superior.

Fig. 2
figure 2

An over view of the proposed ST-SE-ResNet block. a The ResNet block is used to extract feature map. b SE block is used to extract prominent features from different channels. c The ST-SE block is a soft thresholding operation that forces prominent features in current layer. d A basic ST-SE-ResNet block. \(\tilde{x}\) and \(\alpha\) denote the candidate feature maps when the threshold is determined

The remainder chapters are introduced as follow: Sect. 2 introduces the related works of pose-invariant FER. Section 3 represents the proposed method in detail. The experimental results and analysis are introduced in Sect. 4. Finally, the conclusions are given in Sect. 5.

2 Related work

The ResNets and ST-SE block both contain some similar basic components, including convolutional layer, batch normalization and rectifier linear unit, which are generally considered as the essential components of convolution operations. In addition, the Global average pooling (GAP), which fully-connected layer and cross-entropy as indispensable ancillary operations, are usually utilized in deep learning to improve classification tasks. Next, this paper introduces the concepts of these components.

2.1 Basic components

Convolution layer (Conv) is a role component that implements the convolution operation to the input image for extracting feature maps and then transmits them to the next layer. Each convolutional layer consists of a plurality of neurons with trainable weight and biases, and each feature map is implemented by a convolutional kernel over the input channels with fixed stride, which can be defined as follows:

$$ x_{j}^{l + 1} = \sum\limits_{{i \in M_{j} }} {x_{i}^{l} * k_{ij}^{l} + b_{j}^{l} } $$
(1)

where \(x_{i}^{l}\) denotes the input feature map at the \(i{\text{th}}\) channel, \(x_{i}^{l + 1}\) denotes the output feature map at the \(j{\text{th}}\) channel, \(k\) denotes the weight matrix of the convolutional kernel, \(b\) denotes the bias, and \(M_{j}\) denotes one of the feature maps in convolution layer.

As a feature normalizing method, batch normalization (BN) is usually inserted by convolution layer to accelerate the convergence of network training [21]. The BN plays a role in decreasing the offset of internal covariates during the process of training deep learning network. Especially in pose-invariant context, the distribution of training data usually varies with different views. BN operation can normalize the features of activation values to a fixed distribution during the training process, and adjust the feature mapping within a reasonable distribution range, which is an essential operation in a very deep network. The calculation steps can be expressed as follows:

$$ \mu_{{\left( {N_{batch} } \right)}} = \frac{1}{m}\sum\limits_{i = 1}^{m} {x_{i} } $$
(2)
$$ \sigma_{{\left( {N_{batch} } \right)}}^{2} = \frac{1}{m}\sum\limits_{i = 1}^{m} {\left( {x_{i} - \mu_{{\left( {N_{batch} } \right)}} } \right)}^{2} $$
(3)
$$ \hat{x}_{i} = \frac{{x_{i} - \mu_{{\left( {N_{batch} } \right)}} }}{{\sqrt {\sigma_{{N_{batch} }}^{2} + \varepsilon } }} $$
(4)
$$ y_{i} = \gamma \hat{x}_{i} + \beta $$
(5)

where \(x_{i}\) and \(y_{i}\) denote the input and output feature maps in current batch, respectively. m denotes the batch size. \(\gamma\) and \(\beta\) denote scale factor and movement factor, respectively. \(\varepsilon\) denotes a constant, which is composed to avoid meeting undefined of \(\sqrt s\) at \(s = 0\).

The rectifier linear unit (ReLU) serves as the other indispensable component of convolution operations whose appears and behaves are similar to a linear function, but instead are non-saturated and nonlinear features enabling complex layers of input feature maps to be learned. For any positive input x, the output is the same value. However, while the input is negative, the input will be forced to be 0, which can be expressed as \(f\left( x \right) = \max \left( {0,x} \right)\), and can cleverly solve the problems of gradient vanishing and gradient exploding when the parameters are trained among different layers.

2.2 Global average pooling

Global average pooling (GAP) is another indispensable operation that computes the average value from each channel of the feature map [22]. Similar to fully connected (FC) layers, it is usually applied for the last layer in the entire conventional structure. However, since there no parameters to be optimized, GAP can use less weights compared to FC layer, which reduces the possibility of overfitting. In addition, it needs to be mentioned that GAP can also solve the shift variant problem, which provides a unique advantage for pose-invariant and complex environmental background.

2.3 Fully connection layer

The fully connected (FC) layer is similar to the multi-perceptron neural net-works, and the neuron activation is fully connected with previous layer. The number of neuron activation in the last layer is determined by the input convolution kernel, and FC operation can flatten the input into a single vector in the next layer. Therefore, FC layer contains a large amount of parameters that characterize the characteristics and laws of sample data. For some classical convolutional models, i.e., VGG, GoogLeNet and ResNet, 1–3 FCs can generally solve the complex image classification problems.

2.4 Loss function

With respect to the loss function, cross-entropy is one of the most well-known loss functions in FER tasks. Before implementing cross-entropy operation, a softmax function is usually executed to limit the features range within (0, 1). It can be defined as follows:

$$ y_{j} = \frac{{e^{{x_{j} }} }}{{\sum\nolimits_{i = 1}^{{N_{class} }} {e^{{x_{i} }} } }} $$
(6)

where \(x_{j}\) denotes the jth input feature map of softmax function, \(y_{j}\) denotes a predicted probability belong to jth class, \(N_{class}\) denotes the number of classes. Then the cross-entropy loss function can be expressed as:

$$ E\left( {p\left( y \right),q\left( y \right)} \right) = - \sum\nolimits_{j = 1}^{{N_{class} }} {p_{j} \left( y \right)\log (q_{j} (y))} $$
(7)

where \(p\left( x \right)\) denotes the target values, \(q\left( x \right)\) denotes the real probability of x belonging to the jth class.

3 Proposed method

From the above, it can be seen that both residual network and SE block are composed of these basic elements. In this section, this paper presents in detail the improvement process of soft thresholding multi-channel SE Residual Network structure (ST-SE-ResNet). As shown in Fig. 2, this study first introduces the residual network, then describes the SE block, next describes the ST-SE, and finally introduces ST-SE-ResNet block.

3.1 Residual building blocks

ReseNet is a classical network model with “identity shortcut layer”, which has been widely concerned by scientific researchers [23]. As shown in Fig. 2a, the basic residual block (RBBs) consists of two BNs, two ReLUs, two Conv-layers and an identity shortcut layer. The key operation is identity shortcut that effectively back-propagates the gradient of loss function to earlier layers, which makes ResNet superior to the traditional deep learning methods. The residual block is described as:

$$ F\left( {X_{re\sin } } \right) = H\left( {X_{re\sin } } \right) - X_{re\sin } $$
(8)

where \(X_{re\sin }\) denotes the input feature map, \(H\left( {X_{re\sin } } \right)\) denotes the desired feature maps and \(F\left( {X_{re\sin } } \right)\) denotes the output feature maps of one residual module.

3.2 Squeeze-and-excitation block

As mentioned in [18,19,20], a multi-channel SE block was implemented to improve the representation of feature. The function of SE block is to learn feature information from different channels that can enhance the representation ability by a single basic block. As shown in Fig. 2b, for each input channel, a weight can be trained by a basic SE block. Here we assume \(X = \left\{ {x_{1} ,x_{2} , \cdot \cdot \cdot x_{n} } \right\}\) is the input feature map of SE block and \(Z_{c} { = }\left\{ {z_{1} ,z_{2} , \cdot \cdot \cdot z_{n} } \right\}\) is the corresponding output feature map, the Squeeze operation is described as:

$$ z_{c} = F_{sq} \left( {x_{c} } \right) = \frac{1}{W \times H}\sum\limits_{i = 1}^{W} {\sum\limits_{j = 1}^{H} {x_{c} \left( {i,j} \right),c = 1,2, \cdot \cdot \cdot ,n} } $$
(9)

where W and H denote the width and height of input feature maps of SE block, \(z_{c}\) denote the output of current layer. n denote the channel in SE block.

To enhance the representation ability from the current convolutional layer, Excitation operation is described as:

$$ s_{c} = F_{ex} \left( {z_{c} ,\omega } \right) = \sigma \left( {f\left( {z_{c} ,\omega } \right)} \right) = \sigma \left( {\omega_{2} \delta \left( {\omega_{1} z_{c} } \right)} \right) $$
(10)

where \(\omega_{1}\) and \(\omega_{2}\) denote the weight matrices in two FC layers. \(\delta\) and \(\sigma\) denote ReLU and sigmoid function.

$$ \tilde{x}_{c} = F_{scale} \left( {x_{c} ,s_{c} } \right) = s_{c} x_{c} $$
(11)

where \(\tilde{X} = \left( {\tilde{x}_{1} ,\tilde{x}_{2} , \cdots ,\tilde{x}_{c} } \right)\) and \(F_{scale} \left( {x_{c} ,s_{c} } \right)\) denote channel-wise multiplication between scaling parameter \(s_{c}\) and feature map \(x_{c} \in {\mathbb{R}}^{H \times W}\).

3.3 Soft thresholding SE block

The designed soft thresholding SE block (ST-SE) is a variant of SE, and its main difference is that a specific threshold is learned by each channel of the feature map, meaning that each channel can learn a specializing threshold to refine the significant feature information under current layer of FER. As shown in Fig. 2c, the ST-SE-ResNet block contains a special model, where GAP is used to flatten the feature map into a 1D vector. Next, the 1D vector is sent to two fully-connected layers to obtain a training parameter, the operation of which is similar to the SE block [20], and the number of convolutional cores is equal to the numbers of channels. Finally, the sigmoid function is used to keep the training parameters within the range of (0, 1), and the operation is described as follows:

$$ \alpha_{c} = \frac{1}{{1 + e^{{ - x_{c} }} }} $$
(12)

where \(x_{c}\) denotes the output two fully connected layers, and \(\alpha_{c}\) denotes the \(c{\text{th}}\) training parameter. Next, the training parameter \(\alpha\) and \(\left| x \right|\) are multiplied to obtain the threshold. The inspiration of this design is the fact that the threshold parameters need to be positive and not too large. Owing to a pose-invariant FER setting, the views have a very obvious influence on the recognition accuracy, especially on the edge. In order to reduce the impact of posture and background, the threshold values in a ST-SE-ResNet block are calculated as follows:

$$ \tau_{c} = \alpha_{c} \cdot \mathop {average}\limits_{i,j} \left| {x_{i,j,c} } \right| $$
(13)

where \(\tau_{c}\) denotes the threshold in the cth channel, i, j, and c denotes the index of width, height and channel of feature map x, respectively.

To demonstrate the practical use of the proposed ST-SE module, it is vital to construct the same network structure and parameter settings. Considering the diversity of the expression images in the same view pictures, this paper uses ResNet50 as the basic network architecture, and embeds the SE and ST-SE modules in the network, respectively, as shown in Fig. 2d. The architectures of ResNet50, SE-ResNet50 and ST-SE-ResNet50 are listed in Table 1.

Table 1 The parameters of ResNet50 (Left), SE-ResNet50 (Middle), ST-SE-ResNet50 (Right), fc denotes two fully connected layers in a SE-ResNet50 basic block

4 Experimental results

To evaluate the effectiveness of the designed network, this paper performed extensive experiments on four famous facial expression databases that are BU-3DFE [24] and Multi-PIE [25] which were collected in a controlled environment, as well as RAF-DB [26] and AffectNet [27] captured in real-world scenarios. Some samples of these databases are shown in Fig. 3. Since the BU-3DFE and Multi-PIE databases did not precisely divide training and testing sets, in this work, the fivefold cross-validation protocol was employed in these databases. The designed ST-SE-ResNet50 framework was carried out on PyTorch, and the learning rate and the batch size were set to 0.000001 and 40, respectively. The size of the input images was adjusted to 224 × 224, because using large images could improve the deep learning ability of the network, in which more salient features were able to be extracted. All the experiments was based on NVIDIA GeForce GTX 1660 Super GPU; Operating System: Windows 10 64bits.

Fig. 3
figure 3

Some examples of the two datasets (BU3DFE-E1, BU3DFE-E2, Multi-PIE, Pose-RAF-DB and Pose-AffectNet)

4.1 Experiments with BU-3DFE dataset

This paper first tested the designed network on BU-3DFE dataset, which was widely used in pose-invariant FER. There were a total of 100 subjects involved in the experiment, and each of them contained 6 typical expressions, i.e., anger (AN), disgust (DI), happiness (HA), fear (FE), sadness (SA) and surprise (SU) in 4 different intensities. Before using the original dataset, the 3D expression models were typically rotated on the invariant views to generate 2D texture images. Among the existing pose-invariant FER methods, two mainstream methods of extended 2D facial expression image sets were widely adopted. Next, this paper performed experiments on these two extended pose-invariant datasets, and compared the results with some previous methods.

For the first extended dataset of BU-3DFE (BU3DFE-E1), it contains \(5 \times 4 \times 6 \times 100 = 12000\) 2D texture expression images in 5 invariant yaw angles (0°, 30°, 45°, 60°, 90°) from 4 different intensities. The corresponding expression images are shown in Fig. 3a, many previous works [10, 11, 30,31,32], adapted BU3DFE-E1 dataset for pose-invariant FER experiments and achieved remarkable results. This paper evaluated these three network framework structures on the BU3DFE-E1 dataset and analyzed the reasons for these results.

As show in Table 2, this study compares the results of ST-SE-ResNet50 method with SE-ResNet50, ResNet50 and some previous works on BU3DFE-E1 dataset. It is worth mentioning that BU3DFE-E1 dataset contains not only 5 invariant yaw angles, but also 4 different intensities. The SE-ResNet50 network achieved 75.9% recognition accuracy, which was a little better than that of basic ResNet50. In contrast, the ST-SE-ResNet50 model could further improve the identification accuracy to 76.20%. Especially, the pose-invariant recognition algorithms that was often referenced was superior to 2D JFDNN (72.5%), CNN (68.9%), VGGNet16 (70.1%), and slightly better than the DBN (73.5%) and LLCBL (74.60%). For the classical Local binary patterns (LBP), the designed network was 5.1% higher than the highest recognition accuracy.

Table 2 The comparison with different methods on the BU3DFE-E1 dataset

Table 3 lists the specific recognition accuracy of 6 typical expressions under five yaw angles, and Fig. 4 shows the corresponding confusion matrix. In Table 3, it easy to find that the recognition accuracy varies with the yaw angle, where the best yaw angle of expression recognition is 60° with the accuracy rate of 78.6%, while the worst yaw angle is 90° with the accuracy rate of 73.6%. In addition, for the 6 basic typical expressions, the performances of recognition accuracies are also different. Happiness and surprise as the most obvious expressions to distinguish are usually the easiest to recognize in all different yaw angles, while fear is the most challenging expressions, whose recognition accuracies are less than 63%. Figure 4 shows the expression confusion matrix in each yaw view, we can see that angry and sadness expressions are more easily confused, which is the reason why the recognition accuracies of these two expressions are low. In the meantime, on the whole, all the misclassification rates of fear expression relatively higher than other expressions, which lead to the lowest recognition accuracy of fear among the six typical expressions.

Table 3 Average recognition accuracies under different yaw angles on BU3DFE-E1 dataset
Fig.4
figure 4

The confusion matrices on the BU3DFE-E1 dataset. Where a–e denotes the confusion matrices of five invariant yaw angle, and f denotes the overall recognition confusion matrices

The second extended dataset of BU-3DFE (BU3DFE-E2) contains \(7 \times 5 \times 6 \times 100{ = }21000\) 2D texture expression images with 7 invariant pan angles (0°, ±15°, ±30°, ±45°) and 5 invariant tilt angles (0°, ±15°, ±30°). The corresponding expression images are shown in Fig. 3b. Compared with BU3DFE-E1 dataset, BU3DFE-E2 pays more attention on the impact of different views on expression recognition. For example, the BU3DFE-E1 dataset only contains a pan angle, however, the pan angles are extended from − 45° to + 45° and the title angles are set vary from − 30° to + 30° in BU3DFE-E2 dataset. Besides, the BU3DFE-E2 dataset comprises only the 4th intensity level of 2D texture expression images, but the images in the BU3DFE-E1 dataset contains all intensity levels. Some of the state-of-the-art methods [34, 35] also adopt BU3DFE-E2 dataset for pose-invariant FER experiments. This paper evaluates the proposed method with all these expression images at 7 invariant yaw angles.

In the same way as BU3DFE-E1 dataset, this paper compares the method with previous works [9, 30, 32, 33] and presents the results in Table 4. It can be seen that ST-SE-ResNet50 achieves 83.7%, while the best result among state-of-the-art method is only 81.2%, which is far lower than the algorithm in this paper. Moreover, the recognition accuracy of ST-SE-ResNet50 is 4.1% higher than that of basic ResNet50, which demonstrates that the designed method also performs well under mixed multi-view. Table 5 lists the specific recognition accuracy rates under different angles, where the best yaw angle of expression classification is −30° with the accuracy rate of 85.17%, and the worst yaw angle is 45° with the accuracy rate of 82.83%. In addition, for the average recognition results of each expression, they are roughly consistent with BU3DFE-E1 dataset. As shown in Fig. 5h, the fear is still the most challenging expressions, and anger and sadness are also the most confusing expressions, but the overall recognition accuracy of each expression has been significantly improved.

Table 4 The comparison with different methods on the BU3DFE-E2 dataset
Table 5 Average recognition accuracies under invariant angles on BU3DFE-E2 database
Table 6 The comparison with different methods on the Multi-PIE dataset
Fig.5
figure 5

The confusion matrices on the BU3DFE-E2 database. Where ag denotes the confusion matrices of 7 invariant yaw angle, and h denotes the overall recognition confusion matrices

4.2 Experiments with multi-PIE dataset

The Multi-PIE captured by a closely real-world scene contained 755,370 facial expression images of 377 different samples. Unlike BU-3DFE database, it included not only different postures, but also unbalanced illumination, and background transformations. In this work, same experimental setting for facial expression recognition as reporting in [11, 28,29,30, 34] was adopted, where only 100 subjects presented in all four recording sessions were selected. For each subject, six different emotions (disgust, neural, scream, smile, squint and surprise) and 7 invariant yaw angles (0°, 15°, 30°, 45°, 60°, 75° and 90°) were selected in the experiments. Therefore, a total of \(7 \times 6 \times 100{ = }4200\) images were selected in the experiments. The sample of six subjects performing 42 facial images can be found in Fig. 2c.

Compared with the ResNet50 and SE-ResNet50 at 7 invariant yaw angles, the corresponding average recognition accuracy are 80.0% and 83.1%, respectively, while the method proposed in this paper is 86.1%, which is higher than other methods. Table 7 lists the specific identification results under different facial yaw angles, where the optimal expression recognition is different for each angle of view, and the best ones are often kept between 0° and 30°, which are 88.1%, 87.3% and 89.0%, respectively. Figure 6 shows each yaw angle and the overall confusion matrices. It can be seen from Table 7 and Fig. 6 that among the six typical expressions, the recognition accuracies of scream and surprise are higher, and their average recognition rates are 96.4% and 92.6%, respectively. On the contrary, squint and disgust, as the most difficult expressions to identify, has the average recognition accuracies less than 80%. Moreover, from the overall confusion matrix, we can see that the expressions squint and disgust are more prone to misclassification, and the most likely the fact that they achieve low recognition accuracies.

Table 7 Average recognition accuracies under invariant angles on Multi-PIE dataset
Fig.6
figure 6

The confusion matrices on the Multi-PIE dataset. Where a-g denotes the confusion matrices of 7 invariant yaw angle, and (h) denotes the overall recognition confusion matrices

4.3 Experiments with pose-RAF-DB and pose-AffectNet dataset

To evaluate the performance of the model in pose-invariance under real-world scenarios, Wang [35] et al. also collected two sub-datasets, namely, Pose-RAF-DB and Pose-AffectNet, respectively, from the test datasets of FAF-DB and AffectNet for facial expression recognition. Where the head pitch or yaw angles larger than 30° and 45° were selected as a set of pose-invariant facial images, and 7 typical expressions (anger (AN), disgust (DI), happiness (HA), fear (FE), neutral (NE), sadness (SA) and surprise (SU)) were considered. Therefore, Pose-RAF-DB consisted of 12,271 facial images for training, and 1,248 and 558 facial images were selected to test sub-datasets at 30° and 45°, respectively, while Pose-AffectNet consisted of 283,901 facial images for training, and 1,948 and 985 facial images were selected to test sub-datasets at 30° and 45°, respectively. In this work, the same experiment setting in [35, 35, 37, 38] was adopted. However, it is needed to mention that both Pose-RAF-DB and Pose-AffectNet treat forward and reverse facial images as a pose-invariant dataset, which enriches the database and increases the difficulty of classification, as shown in Fig. 3d.

This paper conducted experiments on databases Pose-RAF-DB and Pose-AffectNet, and the experimental results are listed in Table 8. The recognition accuracies of ST-SE-ResNet50 on Pose-FAF-DB database were 85.00% (> 30°) and 84.42% (> 45°), while those on Pose-AffectNet database were 56.57% (> 30°) and 57.00% (> 45°), respectively. Figure 7 shows the corresponding confusion matrices, from which can be found that happiness is the easiest recognizable expression in all databases; Sadness is relatively easy to identify in the Pose-FAF-DB dataset; Fear is relatively easily identified in Pose-AffectNet dataset; and disgust is the most difficult expression to classify. It is easier to be confused with neutral in Pose-RAF-DB dataset, and it is generally confused with anger in Pose-AffectNet database, which reduces the expression recognition accuracy in these two datasets.

Table 8 Average recognition accuracies under invariant angles on Pose-RAF-DB and Pose-AffectNet dataset
Fig.7
figure 7

Where a-b denotes the confusion matrices on Pose-RAF-DB dataset, c-d denotes the confusion matrices on Pose-AffectNet dataset

4.4 Experimental results analysis

From the recognition results of four pose-invariant datasets, it can be found that multi-channel soft thresholding SE residual network can achieve the same accuracy as the previous methods. Compared with the original Resnet50, the accuracy of the method in this paper on BU3DFE-E1, BU3DFE-E2, Multi-PIE, Pose-RAF-DB and Pose-AffectNet dataset can further improve by 1.6%, 4.1% and 5.1%, (0.44% (> 30°), 0.15% (> 45°)) and (0.26% (> 30°), 0.08% (> 45°)), respectively. Two reasons can explain this improvement. The first is that squeeze-and-excitation (SE) block serves as a bridge between different channels, whose function is to improve the quality of the designed network by using the interdependencies between the channels of its convolutional features; And the second is that GAP and SE block can be regarded as an attention mechanism, whose task is to learn global information to selectively emphasize valuable features from the current view and suppress the influence of intensity, pose, background and so on, which are necessary for pose-invariant FER.

As for the database of different scenarios, we also compared the performance of ResNet50, SE-ResNet50 and ST-SE-ResNet50 in controlled and real-world scenarios. The detailed results are provided in Fig. 8a–d, and the corresponding average recognition accuracies can be referred toTables 4, 5, 6, 7, and 8. For the controlled setting, it can be observed that under the influence of different views, the performance of the three models is globally consistent, among which ST-SE-ResNet50 is the lowest, followed by SE-ResNet50, and ResNet50 is the lowest. The stand deviation (SD) of fivefold cross-validations indicates that ST-SE-ResNet50 provides more stable identification accuracy than ResNet50 and SE-ResNet50. This phenomenon is even more pronounced in the Multi-PIE dataset, where the minimum SD values of yaw angle is above 0.7, while that of SE-ResNet50 and ST-SE-ResNet50 is only 0.49 and 0.18, respectively, illustrating that ST-SE block can enhance the stability of the network structure, and it is more robust for pose-invariant FER. For the real-world settings, their recognition accuracy also maintains the same trend, among which ResNet50 is the lowest, followed by ST-SE-ResNet50, and then SE-ResNet50. ST-SE-ResNet50 performs slightly better than ResNet50 on Pose-FAF-DB (0.44% (> 30°), 0.15% (> 45°)) and Pose-AffectNet (0.26% (> 30°), 0.08% (> 45°)) database while compared with the SE-ResNet50, which reduced by (1.67% (> 30°), 2.75% (> 45°)) and (2.25% (> 30°), 1.79% (> 45°)). This result shows that the algorithm in this paper cannot achieve good results in databases with non-normalized poses.

Fig. 8
figure 8

a-d the accuracies of Resnet50, SE-Resnet50 and ST-SE-Resnet50 on BU3DFE, Multi-PIE dataset, Pose-FAF-DB and Pose-AffectNet datasets

For the influence of expressions intensities, as shown in Tables 2 and 4, the recognition accuracy on the BU3DFE-E1 dataset is much lower than that of the second ones, which can be attributed to the micro-deformation of the low intensities expressions and even more variable yaw angles. In order to illustrate the impact of intensity on facial expression recognition, the classification accuracy of the ST-SE-Resnet50 on the BU3DFE-E1 dataset is shown in Fig. 9. As described in Sect. 4.1, the BU3DFE-E1 dataset contains four different expression intensities. It can be seen from Fig. 9 that the accuracy of expression recognition improves with the intensity level. For the III and IV intensity levels, these textures of the six basic expressions are more obvious than those of low intensity. In this case, the high-level intensity expression images contain more powerful representation capabilities than I and II level intensity. Therefore, the recognition rate of these three methods on BU3DFE-E1 dataset is higher than that on BU3DFE-E2 dataset.

Fig. 9
figure 9

Influence of four intensities on BU3DFE-E1 dataset

For the misclassified emoticons in the experiment, it is closely correlated to the facial expression image texture. As described in [39], each type of expression can be expressed as a combination of a similar type of textures. When two expressions contain the same type of textures they are more likely to be misclassified. As shown in Figs. 4f and 5h, anger and sadness have a high probability of being misclassified in BU-3DFE dataset, while anger and squint have a high probability of being misclassified in the pose-invariant dataset. This may due to the fact that these expressions include more similar types of textures in their datasets, which can be found in Fig. 3a and c. When the texture types of expressions are notable, the probability of misclassification is relatively low.

For different views, the best recognition results remain between −60° and 60°. In the experiment, when the views are larger than 60°, the recognition accuracy decreases sharply, especially in ResNet50. The reason is that as the view rotate, the main regions of interest (such as eyes, mouth and chin) are gradually blocked, which will reduce the accuracy of recognition. In addition, as can be seen from Tables 3, 5 and 7, the optimal recognition angle is usually not 0°, and they tend to concentrate on near-frontal views. For frontal face images, most of them are symmetrical images, that is to say, half or more than half images can represent the characteristics of the entire expression image. On the contrary, the entire expression images often include much redundant features compared with near-frontal expression images. Therefore, a small yaw angle can not only preserve the frontal facial features, but also add some detailed feature information on the side, which may be more conducive to the task of expression classification.

5 Conclusions

In this paper, a ST-SE-ResNet50 network base on ST-SE blocks was proposed for pose-invariant FER. Herein, the GAP was employed to flatten the feature map into a 1D vector, and then the flattened feature maps were sent to SE block to filter out salient information. The absolute value GAP multiplied SE operation can be regarded as a self-attention mechanism, whose purpose is to force the network to pay more attention to the feature information in the current view and reduce the influence of pose and occlusion on the recognition accuracy. The proposed method was evaluated on four famous datasets, i.e., BU-3DFE, Multi-PIE, Pose-RAF-DB and Pose-AffectNet, and the results indicate that the method is superior to many previous methods in controlled scenarios. However, in the real-world scenario, especially the facial images with different horizontal and pitch angles, the change of recognition accuracy are not obvious relative to the backbone architecture.