Keywords

1 Introduction

Currently, automatic recognition of facial expressions, which plays a crucial role in various human-computer interaction systems [1], including medical treatment, driver assistance, and other areas, has been a popular subject for researchers. The goal of facial expression recognition (FER) is to classify the input images into seven basic expressions: neutral, happy, sad, surprised, afraid, disgusted, and angry. According to different types of input data, the FER systems can be divided into image-based FER and video-based FER. Early FER methods mainly focused on image-based facial expressions. However, since facial expression is a dynamic process characterized by the interplay of muscle movements in different regions of the face, understanding the temporal sequence of expressions, plays a more important role than classifying static images. As a result, video-based FER research has received increasing attention in recent years.

According to different data scenarios, video-based FER datasets can be mainly divided into lab-controlled and in-the-wild. For the lab-controlled datasets, all video sequences are collected in a controlled laboratory environment, and the videos are relatively simple and free from occlusion. Representative datasets include CK+ [2], Oulu-CASIA [3], and MMI [4]. For the in-the-wild datasets (e.g., AFEW [5], Ferv39k [6], and DFEW [7]), video sequences are collected from real-world scenes, which are closer to natural facial events. Furthermore, the in-the-wild datasets are captured from thousands of subjects in complex scenes, which greatly increases the diversity of the data. Nowadays, the focus of video-based FER research has shifted from laboratory controls to challenges under field conditions.

Early methods for solving video-based FER were primarily based on hand-crafted features, such as the LBP-TOP [8], STLMBP [9], and HOG-TOP [10]. In addition, Liu et al. [11] introduced a spatio-temporal manifold(STM) method to model the video clips, and Liu et al. [12] used different Riemannian kernels measuring the similarity distance between sequences. In recent years, deep learning based methods gradually replace traditional methods. Among these methods, the RNN-based method performed better at capturing the temporal relationship between frames, and the spatial self-attention emerged [13] was proposed as a powerful tool for guiding the extraction of image features and determining the importance of each local feature. However, these methods focused on limited attention features or relationships from a single perspective, thus neglecting large variations of different perspective expression patterns and data uncertainties. The 3D CNN-based method [14] was able to learn both spatial and temporal features in the sequence, but failed to effectively utilize long-distance attention-dependent information to extract rich emotional features. Meanwhile, CNN-based methods require stacking multiple layers of convolutional layers to enlarge the receptive field. However, this often leads to the loss of input information, increases computational load, and may even result in gradient vanishing issues.

Motivated by the above shortages of existing methods, in this paper we propose a novel dynamic-static graph convolutional network (DSGCN) for video-based FER. DSGCN consists of the Static-relational graph (SRG) and the Dynamic-relational graph (DRG). Specifically, in SRG, our method first focuses on the static spatial features extracted from the input facial expression images, and then constructs GCN for these features, the features from each frame are used as the vertex and the spatial similarity are used as the edges. Thus the constructed SRG strengthens the salient areas of the face through the dependencies of context nodes, and weakens the impact of in-the-wild factors (illumination changes, non-frontal head poses, facial occlusions) on the final recognition. In DRG our method first aggregates the features of the entire input video sequence to learn the dynamic temporal information, then constructs GCN on other video samples in the same batch. The sample nodes in the batch share features through similarity, improving the robustness of facial expressions extracted in a single situation, thus better dealing with complex and changeable real situations. At last, the video-based FER task is transferred into a node classification problem in the graphs constructed on the batches.

In summary, this paper has the following contributions:

  1. (1)

    We propose DSGCN that simultaneously captures static spatial feature relationships, long-distance dynamic temporal dependencies and sample similarity relationships to gain efficient expression-related features.

  2. (2)

    We present a graph-based approach for solving the task of video-based facial expression recognition by casting it as a node classification problem.

  3. (3)

    Extensive experiments demonstrate DSGCN is able to outperform the baseline model significantly and achieve state-of-the-art results on three popular video-based FER datasets. Ablation studies verify the effectiveness of the composed modules (i.e., SRG, DRG).

2 Related Work

2.1 Image-Based FER in the Wild

The Image-based FER mainly consists of three stages, namely face detection, feature extraction and expression recognition. In the face detection stage, methods such as MTCNN [15] and Dlib [16] are usually used to locate faces in complex situations. In the feature extraction stage, early methods mostly use hand-extracted features. Among them, texture-based features include HOG [10], Histograms of LBP [8], Gabor wavelet coefficients. At the same time, there are many methods of extracting features based on landmark points such as noses, eyes, and mouths, and using multiple feature combinations to obtain richer representations. Currently deep learning based methods are widely used. Fasel [17] found that shallow CNNs are robust to facial poses. Tang and Kahou et al. [18] used deep CNN for feature extraction and won the FER2013 and Emotiw2013 challenges respectively. Liu et al. [19] proposed a CNN architecture based on facial action units for expression recognition. The next stage after feature extraction is to feed the features into supervised classifiers such as support vector machines (SVM), softmax layers, and logistic regression to assign facial expression categories.

2.2 Video-Based FER in the Wild

In order to capture the spatio-temporal information in the video, methods based on CNN and RNN have emerged. Most of the CNN-RNN based DFER methods first use CNN to learn spatial facial features for each video frame, and then RNN processes the temporal information between video frames. Some methods use VGG or ResNet to extract spatial features, and long short-term memory (LSTM) or Gated Recurrent Unit (GRU) to extract temporal features. For example, Baddar et al. [20] proposed a pattern varied LSTM to encode spatio-temporal features that are robust to unseen changing patterns. For 3D CNN-based methods [14], spatial and temporal feature representations of video sequences are jointly extracted through 3D convolutions. Some 3D-CNN-based methods extract temporal and spatial features of video sequences through 3D convolutions. These methods [21] extract spatio-temporal facial features by directly adopting 3D-CNN, and such spatio-temporal features are usually combined with other types of facial features. Recently, Liu et al. [22] leveraged graph convolutional networks (GCNs) to learn frame-based features that focus on specific expression regions. Lee et al. [23] proposed a Multi-modal Recurrent Attention Network (MRAN) for learning spatio-temporal attention maps for robust DFER in the wild. Zhao et al. [24] first introduced the transformer to the DFER task, they designed CS-Former and T-Former for extracting spatial and temporal features.

2.3 GNN for Video Understanding

In recent years, transformer and GNN based methods have demonstrated excellent performance in the field of video understanding, especially in improving the performance of CNN/RNN-based methods. In the field of video understanding, GNNs have been applied in dialogue modeling, video retrieval, emotion recognition and action detection. There are also video representation frameworks that can be used for multiple downstream tasks. For example, Arnab et al. [25] created a fully connected graph using foreground nodes extracted from video frames in a sliding window fashion. They established connections between the foreground nodes and the context nodes of adjacent frames. Liu et al. [22] introduced the GCN layer in the general CNN-RNN based model of video-based FER, but they only focused on the relationship between frames, and did not focus on the similarity between samples. Differently, our work is dedicated to construct a graph structure that can capture more relationships.

Fig. 1.
figure 1

The proposed model (DSGCN) architecture, which mainly consists of a Static-Relational graph (SRG) and a Dynamic-Relational graph (DRG).

3 Proposed Method

As shown in Fig. 1, the proposed DSGCN mainly consists of a static-relational graph (SRG) and dynamic-relational graph (DRG). The input of DSGCN are dynamically sampled fixed-length facial expression sequences from raw videos. SRG takes video series as input, dividing the video to single frames and extracting spatial facial features for each frame. Subsequently, SRG constructs a graph by using the spatial feature of each frame as nodes, the similarity between nodes as edges, thus capturing the long-distance dependencies of expressions. DRG aggregates the spatial feature sequence enhanced by SRG, and constructs GCN from other sample videos in the same batch, sharing feature information through the similarity between samples. Finally, the classification results are obtained by a full-connected (FC) layer.

3.1 Static-Relational Graph (SRG)

SRG mainly builds GCNs from frame nodes with rich spatial features. Given a facial expression video as input, a fixed-length sequence of facial expressions dynamically sampled from the raw video sequence is fed into the model. The frames in the sequence are first transformed to features carrying rich facial spatial information through the spatial network module, and then GCNs are constructed based on the features to strengthen the salient facial expression regions.

Static Spatial Feature: Fixed-length clip \(X \in \mathbb {R}^{T\times 3 \times H \times W }\) are obtained as input by dynamically sampling raw video. Specifically, we split the video sequence into S segments, and randomly select V frames in each segment. We thus obtain an input clip of fixed length \(T = S\times V\).

For building static-relational graph, extracting rich spatial representation from the frame, we use a Spatial Transformer [24]. The Spatial Transformer consists of five convolution blocks and N spatial encoders. The previous four convolution blocks, including conv1, conv2, conv3 and conv4, are used to extract local facial spatial features \(M \in \mathbb {R} ^ {C \times H^{\prime } \times W^{\prime } }\). After this, we flatten the feature and add positional information \(P_{spatial}\) to feed it into N spatial encoder. The spatial encoders consist of a multi-head self-attention and feed forward network to model global spatial relationships. The final convolution block conv5 is used to refine the final facial features. Therefore, input the Spatial Transformer of the t-length clip, and the output is \(F \in \mathbb {R}^{T \times f}\) that carries sufficient spatial information.

Intra-Video Graph: In order to capture the long-range dependencies of facial regions in videos, we propose a graph-based module to capture expression changes. We construct a GCN layer by obtaining T features with spatial-temporal relations from the previous module, and model the contextual correlation by learning the dynamic adjacency matrix A. All nodes tend to be influenced by expression informative frames and update themselves as more contributing ones.

The inputs are representation maps \(\hat{F} = \{ \hat{f_1}, \dots , \hat{f_T} \}\)extracted by spatial transformer from the original video. To begin with, we use cosine similarity coefficient to calculated the similarity between different representations as:

$$\begin{aligned} cossim(f_i,f_j) = \frac{f_i*f_j}{\Vert f_i\Vert f_j\Vert } \end{aligned}$$
(1)

At the same time, we construct the adjacency matrix A through the cosine similarity coefficient, and \(A_{i,j}\) represents the similarity between node i and node j. And in each time step, as the node features are updated, the adjacency matrix A will also update the similar state between nodes.

$$\begin{aligned} A_{i,j} = cossim(f_i,f_j) \end{aligned}$$
(2)

\(i,j \in \{1,2, \dots ,T \}\). then, we employ GCN as:

$$\begin{aligned} F^{l+1} = \bar{D}^{-\frac{1}{2}} \bar{A}\bar{D}^{-\frac{1}{2}}F^{l}W^{l} \end{aligned}$$
(3)

where l represents the lth time step, \(\bar{A} = A + I\) is the sum of un-directed graph A and the identity matrix, \(\bar{D}\) is the diagonal matrix from A, which is \(\bar{D}_{i,i} = \sum \limits _{j}{A_{i,j}}\). \(F^{l}\) and \(F^{l+1}\) are the corresponding input and output representations on the \(l_th\) level, and \(W^{l}\) are the trainable parameters on this level. At each time step, the GCN layer shares the features of each node to neighbor nodes based on the adjacency matrix A, and accepts update messages from neighbor nodes.

3.2 Dynamic-Relational Graph (DRG)

Dynamic Temporal Feature: Using the output \(F^{\prime }\) from the SRG as input, DRG aims at capturing the dynamic temporal relation for the feature nodes that have obtained spatial information, and mining the facial expression movement information between nodes. In the method, we first use the implemented Temporal Transformer [24]. The Temporal Transformer consists of M temporal encoders, each of which includes a multi-head self-attention and a feed-forward network. For T spatial features \(X^{\prime }\) from the Spatial relation graph, they will be input to the temporal encoder after adding position information \(P_{temporal}\). Through the multi-head self-attention and a feed-forward network in the temporal encoder, the global temporal information is modeled to output features h with rich spatial-temporal information.

Inter-Video Graph: Not limited to learning contextual relations in videos, we then extend the DRG module to learn the similarity between the input video samples. Our module accepts B video samples in the same batch, and the features of each video are transformed into a feature h carrying rich spatio-temporal information through the above steps, then we construct GCNs on these features. Our graph structure learns different scene knowledge of similar expressions for each video node in a single scene by sharing video samples of different expressions.

The inputs are representation maps \(\hat{H} = \{\hat{h_{1}},\dots ,\hat{h_{B}}\}\)extracted by temporal transformer from the sample video under the same batch. We will construct the adjacency matrix A based on Eqn. (2), and update the video nodes in the same batch based on Eqn. (3). After l rounds of updating node features, each node successfully learns different scene knowledge of similar expressions to deal with the large variations of expression patterns and data uncertainties.

Node Classification: In the previous steps, we have described our graph construction procedure that converts a batch video into a graph where each node has its own spatio-temporal feature vector. During the training process, we feed all videos in a batch simultaneously into the proposed model, and add fc layers at the end of the outputs, transferring the original video-based facial expression recognition into a seven-category node classification problem in the constructed graph.

4 Experiments

4.1 Datasets

FERV39k. [6] Currently represents the largest in-the-wild DFER dataset, containing 38,935 video clips collected from four different scenarios, which can be recursively divided into 22 fine-grained scenes, such as daily life, talk shows, business, and crime. All scene video clips are randomly shuffled and 80% of clips are allocated to the training set, while 20% of clips are reserved for the test set to avoid dataset overlaps. Therefore, in order to conduct a fair comparison, we directly use the training and testing sets divided by FERV39k.

DFEW [7] is a database that contains 16,372 video clips from more than 1,500 movies. All samples have been divided into five equally sized parts (fd1 fd5). Five-fold cross-validation is used as the evaluation scheme. In each fold, a portion of the samples is used for testing while the remaining data is reserved for training. Finally, all predicted labels are used to compute an evaluation metric by comparing them with the ground truth.

AFEW. [5] Dataset serves as the evaluation platform for the annual EmotiW challenge from 2013 to 2019. AFEW contains 1809 video clips collected from different movies and TV series. Consistent with DFEW, each video clip in AFEW is assigned to one of seven basic expressions. The test clips are not publicly available, so we train our model on train clips and test on validation clips.

Table 1. Comparison with state-of-the-art methods on FERV39k.

4.2 Implementation Details

Training Setting: For all the three datasets, we train our model from scratch with a batch size of 32, initialize the learning rate to 0.01, and divide it by 5 every 50 epochs. Due to the small number of data samples in AFEW dataset, in order to make a fair comparison, we first pre-train our model and other models on DFEW (fd1), and then fine-tune on AFEW with the same setting.

Evaluation Metrics: Without loss of generality, We choose Unweighted Average Recall (UAR, i.e. the accuracy of each category divided by the number of categories, regardless of the instances of each category) and weighted average recall (WAR, i.e. accuracy) as the metrics.

Table 2. Comparison with state-of-the-art methods on DFEW.
Table 3. Comparison with state-of-the-art methods on AFEW.

4.3 Comparison with State-of-the-Arts

In this section, we compare our best results with current state-of-the-art methods on the FERV39k, DFEW and AFEW benchmarks.

As shown in the Table 1, we compare our method with other state-of-the-art methods on the FERV39k dataset, including C3D, P3D, R2Plus1, 3DR18, R18+LSTM, VGG13+LSTM, Two C3D [6], Two R18+LSTM [6], Two VGG13+LSTM [6], Former-DFER [24], STT [26], NR-DFERNet [27]. DSGCN improvements of 9.88% and 11.77% in UAR and WAR than the previous state-of-the-art method STT. Moreover, we show the performances on each expression in the Table. As can be seen, our method achieve the best results on most of the expressions, only slightly lower than STT on Disgust with a gap of 1.28%. At the same time, in Table 1, we can see that most of the methods perform poorly in “disgust” and “fear”, which we believe is caused by insufficient training data in the original datasets.

Fig. 2.
figure 2

Illustration of feature distribution learned by the Former-DFER [24] (top) and DSGCN (bottom) on three datasets.

For the DFEW data set, the experiment compared 3DR18, R18+LSTM, R18+GRU, EC-STFL [7], Former-DFER [24], EST [28], STT [26], NR-DFERNet [27], GCA+IAL [29], and DPCNet [30] under 5-fold cross-validation. As shown in Table 2, DSGCN outperforms the comparison methods on the WAR metric, and is very close to the current state-of-the-art method DPCNet on the UAR metric. Specifically, we have a 1.33% improvement on WAR and only a 0.05% reduction on UAR compared to DPCNet. It should be noticed that DFEW also has a imbalanced data distribution. The proportions of “disgust” and “fear” sequences are 1.22% and 8.14%, which is the reason why our method achieve a relatively low performance in UAR.

For the AFEW dataset, all methods are first pre-trained on DFEW (fd1) and then fine-tuned on AFEW with the same settings. Our method compares C3D, I3D-RGB, R(2+1)D, 3DR18, R18+LSTM, Former-DFER [24], EST [28], STT [26], NR-DFERNet [27], DPCNet [30]. The comparative performance shown in Table 3 shows that DSGCN achieves the best results on both UAR and WAR. In particular, our method shows an improvement of 10.89% and 11.23% in UAR and WAR than the previous state-of-the-art method EST.

Fig. 3.
figure 3

Visualization of the learned feature maps. There are three sequences are presented, which including the facial expression of Neutral, Anger and Sadness, respectively. For each sequence, the images in the first row are heat-maps generated by the Former-DFER, and the images in the second row are heat-maps generated by DSGCN.

4.4 Visualization Results

We utilize t-SNE [31] to analyze the feature distribution learn by the Former-DFER and DSGCN on three datasets. As shown in Fig. 2, it is obvious that the feature distribution of each category learned by our method is tighter, and the boundaries between different categories are more obvious. This shows that our method can better discriminate different facial expressions in feature level. Furthermore, we conduct experiments to visualize the learned facial feature maps, as shown in Fig. 3, we used neutral, angry, and sad three types of expressions to compare with Fomer-DFER. For the first neutral expression sequence, although there is no significant expression behavior, our method still pays more attention to facial regions. In the second angry expression sequence and the third sad expression sequence, our method pays more attention to the facial regions such as mouth and eyes that contain more emotional information. In the second sequence where the subject has large head pose changes, our method always locks on the subject’s face region compared to the comparison method.

Table 4. Ablation study to evaluate the effectiveness of different modules in our proposed method.

4.5 Ablation Study

We conduct extensive ablation studies on the three video-based FER datasets to demonstrate the effectiveness of different components of our proposed method. Including the individual part of SRG and DRG, as well as the final DSGCN. The Former-DFER is employed as the baseline. As shown in Table 4, our STRG achieves the WAR and UAR of 40.43%/54.81%, 55.56%/67.04%, 57.67%/62.80% on three datasets, which outperforms some existing methods because of the spatio-temporal features we learned. in one hand, by using SRG we have achieved obvious improvements in WAR and UAR compared to the baseline. This proves that SRG can effectively enhance facial expressions by learning the similarity of expressions at different moments in the same video, and provide more robust features for subsequent extraction of temporal information. In other hand, through the propagation and enhancement of spatio-temporal features, DRG outperforms the baseline to varying degrees on the three datasets. The most significant improvements are in AFEW, where CRG exceeds baseline by 12.48% and 13.77% on WAR and UAR. This proves that DRG can indeed capture the correlation between different sample expressions to strengthen the current expression. We notice solely using DRG performs slightly better than using SRG, the reason is dynamic features from other video sequences can better improve the robustness of node features than in the same video sequence. Finally, in the complete method DSGCN, all indicators in the three datasets reach the highest, the results prove our method can indeed learn both the spatial and temporal relationship of the input video facial expressions.

5 Conclusion

This paper proposes a novel dynamic-static graph convolutional network (DSGCN) for dynamic facial expression recognition in-the-wild scenarios. Specifically, the proposed DSGCN mainly consists the static-relational graph (SRG) and the dynamic-relational graph (DRG), it can capture multi-level relationships among the input video sequences, including spatial relationship, temporal relationship, context relationship and sample relationship. Extensive experiments with previous methods show that the proposed DSGCN achieves state-of-the-art results on three popular dynamic FER benchmarks. The abundant ablation studies have validated the effectiveness of each part in DSGCN. Moreover, the visualization results of facial features demonstrate that DSGCN can pay more attention to the salient facial regions. The visualization results of the feature distribution show that the method can better discriminate the learned face features. Furthermore, comparisons with previous methods show that DSGCN achieves state-of-the-art results on three popular dynamic FER benchmarks.

In future work, based on our DSGCN framework, we will further expand it to Micro-Expression Recognition, Pose Prediction, Person Recognition and other fields. Additionally, we plan to integrate our DSGCN framework with self-supervised learning, encouraging the model to learn potential internal relationships in a large amount of unlabeled data, thereby alleviating the impact of imbalances in facial data.