1 Introduction

Facial expression analysis refers to the differentiation of facial changes corresponding to a neutral face and is one of the most important parts in daily communication of human beings. Almost all anatomically visible facial expressions can be described by another modality known as facial action units (AUs), which refer to the local facial muscle actions, as described by the facial action coding system (FACS) [7, 8]. Nowadays, owing to their applications such as in human–robot interaction, several studies [19] have focused on improving the multi-modality recognition, namely, facial expression and AU classification. These studies mainly face the following two challenges: environmental conditions, such as illumination, occlusion, low resolution, and human attribute variables, such as age, gender, and appearance. For the former, significant progress has been made [12, 23, 30]. However, for the latter, their changes will make human facial expressions exhibit different emotional intensities or even styles; thus, deconstructed different AU combinations. As shown in Fig. 1, a disgusted expression can be decomposed into “AU9+AU17” or “AU9+AU10+AU25”. In some situations, AU17 represents the degree of expressions. Therefore, researchers propose the use of the generative adversarial network (GAN) [10] for generating the query neutral face [42] or the average face based on a database [3] to filter out identity information, then use the middle layer of the generator to classify expressions or AUs. However, based on anatomical considerations, these models ignore the symbiosis and mutual exclusion of facial AUs.

Meanwhile, because of the mapping relationship between facial expressions and AUs, some researchers started exploring the utilization of AUs for recognizing facial expressions, and vice versa. As illustrated in Fig. 1, an angry expression activates AU4, AU17, and AU23, as we generally “lower brow”, “raise chin”, and “tighten lid and lip” when we feel slightly angry [6]. For the traditional methods, because the Bayesian network (BN) can imitate the process of human thinking and reasoning [2], BN is often used to model such dependencies [11]; however, its performance is limited to the input of manual features. Owing to the development of deep learning, some studies utilize the big data to obtain high-level semantic information. Through visualizing the middle layer of a convolution neural network (CNN), [13] demonstrates that the process of learning expressions is essentially learning AUs. Therefore, some models propose the construction of a multi-task [34] or multi-branch network [22] for learning each other’s features. However, these methods mostly ignore the direct representation of prior knowledge as constraints for alleviating the boundary fuzzy problem of multi classification and integrate it into deep learning to simplify the network and guide the learning process.

Fig. 1
figure 1

Illustration of the relationship between the six basic expressions and multiple AUs from Cohn–Kanade (CK+) and Radboud Faces Database (RaFD) databases

In this paper, we introduce a novel method for the recognition of facial expressions and AUs by using the dependencies with graph convolutional network (FE-AURDGCN). In particular, we use cGAN to generate the corresponding neutral image to solve the identity-related variation problem and extract AU regions from the middle layers of the generative model. The selected AUs are regarded as nodes of the graph and we construct an AU correlation graph to represent their symbiosis relationship for learning non-geometry semantic information. Thereafter, graph convolutional network (GCN) [14] is used to guide the information propagation among nodes. Subsequently, we integrate the expressions and AUs prior distribution into the loss function to regulate network outputs and guide the training direction of the network. We conduct extensive experiments on the widely used CK+ and RaFD datasets. The results demonstrate the superiority of the proposed FE-AURDGCN framework over the state-of-the-art facial expression and AU recognition methods.

In summary, this paper has the following contributions:

  1. (1)

    In this study, we formulate a novel expression and AU recognition model, known as the FE-AURDGCN model, which incorporates the cGAN and AU dependent relationship graph. This model alleviates the identity difference issue and solves the problem of encoding the appearance and geometry information of facial expressions and the relations of co-occurring of facial muscle movements.

  2. (2)

    To solve the boundary fuzzy problem for some emotions and AUs classification, we propose to use their inter-dependent conditional relation matrices as prior knowledge to describe the dependencies between expressions and AUs and use them into the loss function for reducing the final error identification probability.

The remainder of this paper is organized as follows: Sect. 2 reviews related work; Sect. 3 presents the details of the FE-AURDGCN model; Sect. 4 outlines the experiments conducted in which the recognition results for AUs and expressions on the CK+ and RaFD datasets are available; finally, Sect. 5 presents the concluding remarks.

2 Related work

2.1 Expression and AU recognition

Automatic facial expression and AU recognition have garnered widespread research interest and achieved significant progress in recent years. The existing methods can be essentially divided into traditional models based on hand-crafted features and deep learning models based on neural networks. For the traditional models, apart from relying on discriminative learning methods, such as the nearest neighbor [43] and support vector machine methods [38] et al., researchers focus more on the application of prior knowledge into reasoning models. Tong et al. [41] proposed the construction of a BN to describe the dependency relationship between AUs and expressions. Li et al. [20] further developed a dynamic BN (DBN) to represent the probabilistic relationships among facial expressions, AUs, facial components, and feature points. Wang et al. [39] used a restricted Boltzman machine to capture complex AUs relations and the correlations with expressions. Nevertheless, generalization performance of these methods is restricted because the conditional probabilities used in these topology graph models are all from target labels.

For the deep learning model, researchers have developed some effective networks by utilizing the correlation between AUs and expressions. Liu et al. [22] constructed an AU-aware depth network to enable expressions and AUs branches learn important information from each other. Zhao et al. [46] considered the anatomical attribute of facial regions and divided the face into multiple patches for AUs multi-label learning. Pons et al. [34] developed a multi-task network and the experiment revealed that simultaneous emotions classification and AUs detection can improve the expression recognition performance. Meanwhile, because of the development of the variants of CNNs known as GCN [14], Li et al. [18] used the semantic relationship between AUs as extra guidance for enhancement of facial region representation and significantly improved AU recognition performance. Liu et al. [24] used the geometric and local features of facial muscles to construct graph structure for expression classification. However, the explicit expression of the rule of knowledge based on FACS between expressions and AUs for guiding the learning direction of the network is yet to be studied.

2.2 Generative adversarial networks

Recently, the GAN has garnered increasing attention. Inspired by the adversarial idea [15], GAN [10] plays a minimax game, comprising the following two models: a generator (G) and a discriminator (D). G attempts to capture the distribution of ground truth, whereas D attempts to distinguish the generated examples from the true examples as much as possible. Owing to the development of GAN, there are increasing fields applying it or its variant, such as computer vision [36] and natural language [25]. Among them, the generation of different attributes faces is a popular topic. Liu et.al. [27] introduced a coupled GAN (CoGAN) to learn and regenerate faces with different attributes such as hair, smiling, and eyeglasses. NVIDIA [31] introduced an alternative generator architecture for generating more real faces with all types of attributes, such as freckles, pose, and even identity. Zhou et al. [37] applied conditional GAN (cGAN) to generate the neutral face from expressions and [32] used this model to recognize expressions by learning the intermediate layer of cGAN. Furthermore, Lai et al. [16] explored multi-view facial expression recognition by reconstructing the corresponding frontal face using GAN.

3 Methodology

In this section, we introduce our FE-AURDGCN learning framework in detail. First, we briefly introduce the extraction of expression information from an identity image. Thereafter, the graph representation network of expression is presented. The overall construction of our framework is illustrated in Fig. 2.

Fig. 2
figure 2

Framework of our proposed FE-AURDGCN. It comprises a generative model for reconstructing a neutral face and GCN for representing the dependencies between expressions and AUs. “\( \oplus \)” represents matrix concatenation along the feature dimension

3.1 Expression information extraction

Facial expression comprises a human face and expression information. Inspired by [42], the conditional generative adversarial network (cGAN) is used to filter out the identity information by generating the corresponding neural image of the query image. cGAN consists of G and D. In particular, the generative model generates the corresponding neutral face \({I_{output}}\) through encoders and decoders and reserves the expression information in the middle layers of the network. To narrow the gap between the pseudo reconstructed face \({I_{output}}\) and the ground truth \({I_{target}}\) to confuse the discriminator as much as possible, we add image-difference information between them to restrain G and use L1 loss for the image similarity. The objective loss for the generator is described as follows:

$$\begin{aligned} {{L}_{cGAN}}\left( G \right)&=\frac{1}{N}\sum \limits _{i=1}^{N}{\left\{ {-}\log D\left( I_{input}^{i},I_{output}^{i} \right) \right. } \nonumber \\&\qquad + \left. \theta {{\left\| I_{target}^{i}-I_{output}^{i} \right\| }_{1}} \right\} \end{aligned}$$
(1)

In other ways, the discriminator is a CNN for two classifications. Its goal is to differentiate pseudo labels [\({I_{input}}\), \({I_{output}}\)] from truth labels [\({I_{input}}\), \({I_{target}}\)]. The objective loss for the discriminator is described as follows:

$$\begin{aligned} {{L}_{cGAN}}\left( D \right)= & {} \frac{1}{N}\sum \limits _{i=1}^{N}{\left\{ \log D\left( I_{input}^{i},I_{target}^{i} \right) \right. } \nonumber \\&\qquad + \left. \log \left( 1-D\left( I_{input}^{i},I_{output}^{i} \right) \right) \right\} \end{aligned}$$
(2)

The final loss is described in Eq. (3). The optimization terminates at a saddle point, which is a minimum for G and a maximum for D. When the model reaches equilibrium, the expression information can be extracted from the middle layers of the generative model.

$$\begin{aligned} {G^*} = \arg \min _{G} \max _{D} {L_{cGAN}}\left( D \right) + {L_{cGAN}}\left( G \right) . \end{aligned}$$
(3)

3.2 AU-related graph construction

Considering expressions of different individuals may contain different combinations of different AUs because of variation in culture and race, from the disgusted expression illustrated in Fig. 1, it is difficult to directly construct an AU relation graph to represent a specific expression. However, according to FACS, there exists a co-existent and mutually exclusive relationship between AUs caused by the mechanism of muscles. Inspired by [18], we constructed an AU-related graph to learn more semantic features through GCN [14]. GCN works by propagating information between nodes V based on the correlation matrix A. In particular, each node in the graph represents the specific AU and each value in A represents the correlation dependency of AUs. We detail the construction of A and the embedding of V as follows:

3.2.1 Correlation matrix of AUs

In this study, we define A by mining AUs co-occurrence patterns within the dataset in the form of conditional probability, i.e., \(P\left( {{\mathrm{{y}}_i} = 1|{y_j} = 1} \right) \), which denotes the probability of occurrence of \(A{U_i}\) when \(A{U_j}\) appears. Although there exist positive and negative relationships between the pairwise AUs, we only consider the influence of the positive dependencies that are interpreted in two ways as expressed in Eqs. (4) and (5). The first formula indicates that when one AU appears, the other AU is more likely to appear than not. The second formula indicates that the probability of one AU appearing when the other AU appears is higher than not. Thus, if these two conditions are satisfied, we can set \({A_{\mathrm{{i}},j}}\) as 1; otherwise, as 0 as expressed in Eq. (6). The final AUs dependent relationship is illustrated in Fig. 3.

$$\begin{aligned}&P\left( {{\mathrm{{y}}_i} = 1|{y_j} = 1} \right) > P\left( {{\mathrm{{y}}_i} = 0|{y_j} = 1} \right) \end{aligned}$$
(4)
$$\begin{aligned}&P\left( {{\mathrm{{y}}_i} = 1|{y_j} = 1} \right) > P\left( {{\mathrm{{y}}_i} = 1|{y_j} = 0} \right) \end{aligned}$$
(5)
$$\begin{aligned}&{A_{\mathrm{{i}},j}} = \left\{ {\begin{array}{*{20}{l}} {1,}&{}\quad {if\left( {Eq.\,\, (4) = 1} \right) and\left( {{Eq.}\,\,\, (5) = 1} \right) }\\ {0,}&{}\quad {else}. \end{array}} \right. \end{aligned}$$
(6)
Fig. 3
figure 3

AUs dependent relationship graph

3.2.2 Feature embedding of nodes

According to the mapping relation between facial areas and AUs illustrated in Fig. 4, we can crop the obtained expression information into patches and set them as the corresponding node features. However, for a deep convolution network, the first one or two layers basically learn low-level features such as color, whereas the deeper layers could learn complex features such as texture [44]. Therefore, we only crop from the second layer of the generator to reduce computation cost. In other ways, we set 16*16 as the size of every AU region for the input image. Thus, based on the definition of a receptive field, the regions shrink two times after one encoder, and vice versa. Thereafter, the feature of each node is obtained by cascading each AU cropped region.

Fig. 4
figure 4

Central location of facial AUs

3.2.3 Convolutions on graph

To train the constructed affective graph, we perform the GCN proposed in [14]. Unlike traditional convolutions that operate on local Euclidean structures in an image, GCN uses feature descriptions X and the adjacency matrix A as inputs. The feature updating is computed as follows:

$$\begin{aligned}&Z={{\overset{\sim }{\mathop {D}}\,}^{-\frac{1}{2}}}\overset{\sim }{\mathop {A}}\,{{\overset{\sim }{\mathop {D}}\,}^{-\frac{1}{2}}}XW \end{aligned}$$
(7)
$$\begin{aligned}&\mathop A\limits ^ \sim = A + {I_N} \end{aligned}$$
(8)
$$\begin{aligned}&\mathop {{D_{ii}}}\limits ^ \sim = \sum \limits _j {{{\mathop A\limits ^ \sim }_{ij}}} \end{aligned}$$
(9)

where Z denotes the output with \(N*{D^1}\) dimensions, whereas \(\mathop A\limits ^ \sim \) and \(\mathop D\limits ^ \sim \) denote the normalized version of the correlation matrix A and D, which are computed as Eqs. (8) and (9), respectively. W denotes the learnable weight matrix with \({D^0}*{D^1}\) dimensions. Every graph convolution layer is followed by ReLU in our experiments. Thus, we can learn and model the semantic relationships of AUs by stacking multiple GCN layers.

3.3 Loss function with prior probability

According to FACS, nearly any facial expressions can be deconstructed into the specific AUs, and vice versa. For example, when people feel angry, their face may have a higher frequency to behave as AU4, AU17, AU23, and AU24, while hardly behaving as AU1, AU25, and AU27 [5]. Thus, we can refer to such a rule as prior knowledge and use it to improve the network. It is feasible to use expression-dependent AU margin probability P(AU|E) and AU-dependent expression margin probability P(E|AU) to describe the constrained relation between expressions and AUs. Therefore, during the training process, we can regard the expression label and AU labels of a query image as prior knowledge and multiply them by the prior probabilities P(AU|E) and P(E|AU), respectively, to adjust the model outputs and then alleviate the boundary fuzzy problem.

$$\begin{aligned}&{L_E}\mathrm{{ = }} - \frac{1}{N}\sum \limits _{j = 1}^N {\left[ {Y\left( {{x_j}} \right) \log \left( {\frac{{p\left( {{x_j}} \right) \left( {{P_1}Q\left( {{x_j}} \right) } \right) + 0.05}}{{1.05}}} \right) } \right] } \end{aligned}$$
(10)
$$\begin{aligned}&\begin{aligned} {{L}_{M}}&\text {=}-\frac{1}{NC}\sum \limits _{j=1}^{N}{\sum \limits _{i=1}^{C}{\left[ {{Q}_{i}}\left( {{x}_{j}} \right) \log \left( \frac{{{p}_{i}}\left( {{x}_{j}} \right) {{\left( Y\left( {{x}_{j}} \right) {{P}_{2}} \right) }_{i}}+0.05}{1.05} \right) \right. }} \\&\quad + \left. \left( 1-{{Q}_{i}}\left( {{x}_{j}} \right) \right) \log \left( \frac{1.05-{{p}_{i}}\left( {{x}_{j}} \right) {{\left( Y\left( {{x}_{j}} \right) {{P}_{2}} \right) }_{i}}}{1.05} \right) \right] \\ \end{aligned} \end{aligned}$$
(11)
$$\begin{aligned}&{L_{\mathrm{{total}}}} = {\lambda _1}{L_E} + {\lambda _2}{L_M} \end{aligned}$$
(12)

Meanwhile, the categorical and binary cross-entropy losses are often used in deep learning for the discrete multi-category and multi-label classification, respectively. To improve the training efficiency, we add prior probability into loss as demonstrated in Eqs. (10) and (11). \({L_E}\) represents the expression training loss and \({L_M}\) the AU training loss. Y and Q denote the ground-truth expression and AU label separately, whereas P denotes the predicted probability. Let N denote the batch size, and C denote the number of AUs. \({P_1}\) and \({P_2}\) denote the conditional probability computed according to P(AU|E) and P(E|AU). The total loss of the method is the combination of the losses for the emotion and AU category classification, which can be represented as Eq. (12), where the parameters \({\lambda _1}\) and \({\lambda _2}\) represent the weight coefficients for \({L_E}\) and \({L_M}\), respectively.

Table 1 Expression-dependent AU margin probability P(AU|E)

4 Experiments

To illustrate the effectiveness of the proposed FE-AUR-DGCN, extensive experiments have been conducted on the Extended Cohn–Kanade Dataset (CK+) [26] and the Radboud Faces Database (RaFD) [17].

4.1 Experimental datasets

The CK+ dataset includes 593 sequences collected from 123 subjects. Among them, we use 309 sequences of 106 subjects that are labeled with one of six basic expressions and AUs. Each video starts with a neutral face and reaches the peak in the last frame. Hence, the apex images are selected to construct datasets and the following 13 AUs, whose frequency of occurrence are higher than 10, are used in the experiment: AU1, AU2, AU4, AU5, AU6, AU7, AU9, AU12, AU17, AU23, AU24, AU25, and AU27. Tables 1 and 2 summarize the statistical results of the conditional probabilities P(AU|E) and P(E|AU), which are used in the loss function.

The RaFD dataset includes 8,040 images from 67 subjects. This dataset contains eight emotion expressions with three gaze directions taken from five view angles. Similar to CK+ datasets, we select images annotated with six basic expressions. Although the dataset does not provide AU labels, each model was trained by a FACS coder to exhibit each emotion. Therefore, we can set AU labels of each image as illustrated in reference [17]. In addition to the 13 AUs selected in CK+, AU10 and AU15 are both selected. Fig. 1 summarizes the specific prior distributions between facial expressions and AUs, which are used in the loss function.

Table 2 AU-dependent expression margin probability P(E|AU)
Table 3 Ablation study on the CK+ database
Table 4 Ablation study on the RaFD database

4.2 Implementation details

First, for the preprocessing of the input image, MTCNN [45] and OpencCV toolbox are employed to detect a human face and extract face landmarks separately. Thereafter, we use MMI database [28] to pre-train the cGAN and fine-tune the generative model with CK+ and RaFD datasets. During the training of GCN, we set \(\theta \) = 0.05 and \(\lambda _{1}\) = \(\lambda _{2}\) = 1. We use an Adam optimizer with a learning rate of 0.0002 and the mini-batch size is set as 16. All models are trained using NVIDIA GeForce GTX 1080 GPU based on the tensorflow [29].

4.3 Evaluation criteria

We evaluate our method in recognition of expressions and AUs with the accuracy and F1-score, respectively. F1-score is extensively used in binary classification. It considers both the precision P and recall R, and its specific computation is described as Eqs. (13)–(15).

$$\begin{aligned}&F1-score = \frac{{2PR}}{{P + R}} \end{aligned}$$
(13)
$$\begin{aligned}&P = \frac{{TP}}{{TP + FP}} \end{aligned}$$
(14)
$$\begin{aligned}&R = \frac{{TP}}{{TP + FN}} \end{aligned}$$
(15)

Where TP denotes the number of true positives and FP the number of false positives. FN denotes the number of false negatives.

Finally, we set the average value of accuracy and F1-score as the overall evaluation criteria for the model performance, as expressed in Eq. (16).

$$\begin{aligned} Avg = \frac{{Accuracy + F1-score}}{{2}} \end{aligned}$$
(16)

4.4 Ablation study

Effectiveness of cGANTo verify the effectiveness of cGAN, we compare the performance of our proposed model to those that do not employ cGAN. In particular, we extract AU information from images directly and apply GCN to recognize AUs and expressions. As summarized in Tables 3 and 4, the effectiveness of cGAN is clear. In the CK+ dataset, our model achieves a performance of nearly 20%, 0.12 with respect to (w.r.t.) expression recognition accuracy and AU recognition F1-score, respectively, compared with the model constructed with only GCN. In the RaFD dataset, the performance is more significant, which is approximately 40%, 0.30 in terms of accuracy and F1-score. Thus, we can observe that the identity variable has a significant impact on expression and AU recognition and the employment of cGAN is effective for the tasks. Meanwhile, Fig. 5 illustrates some samples of the reconstructed neutral face using cGAN on CK+ and RaFD databases, respectively. The first column represents the input image, second column the generated face, and third column the ground truth face. The filtered expressions from the top to bottom images are as follows: angry, disgusted, fearful, happy, sad, and surprised expression images. As shown, the expression information is removed successfully by the generator while the identity information is reserved.

Fig. 5
figure 5

Illustration of the generated face by the generator from CK+ and RaFD databses

Effectiveness of Graph Convolutional NetworkTo verify the effectiveness of the GCN for emotion recognition, we compare our model to those only using multi-layer perceptron (MLP) after cGAN (cGAN_MLP). In the CK+ dataset, as summarized in Table 3, regardless of whether the proposed loss function is used, our model performs better in expressions recognition. In detail, our method achieves a performance of 1.6% compared with the cGAN_MLP model, whereas with the help of the loss function with prior knowledge, the performance of our model is higher than 0.6%. In the RaFD dataset, as summarized in Table 4, from the model without the help of \({{P}_{1}}\) and \({{P}_{2}}\), although our model performance in expression recognition is worse than that of the model with MLP, our model’s performance in AU recognition is better, which is higher than 0.033. Compared with the model with \({{P}_{1}}\) and \({{P}_{2}}\), our model achieves 1.91%, 0.009 boost in both expression and AU recognition. Generally, the proposed model FE-AURDGCN performs better than the model cGAN_MLP. We can conclude that the construction of AUs-related knowledge graph is useful for expression or AU recognition.

Effectiveness of Loss Function with Prior KnowledgeTo verify the effectiveness of rule-based prior knowledge expressions, we have compared the performance of our proposed FE-AURDGCN to those without relation expressions. In the CK+ dataset, as summarized in Table 3, we can clearly observe that the model with two conditional probability constraints achieves 3.2% and 0.007 performance boost w.r.t. expression recognition accuracy and AU recognition F1-score when compared with the model without prior knowledge. In the RaFD dataset, the proposed method achieves 2.16% and 0.041 performance boost. Meanwhile, during the training process, when the prediction of some AU occurrence is low while there is actually a high probability from P(AU|E), the output probability would be increased by multiplying this probability; otherwise, the reverse would occur. Similarly, the final output of expressions would also be redressed. Thus, the importance of the prior knowledge expression can be observed.

4.5 Evaluation of AU recognition

For the recognition of AUs, we compare our method to alternative methods, including shared feature learning and semantic relation model (SFL-SRM) [47], latent regression Bayesian network (LRBN) [9], the confidence-weighted local subspace Random Forest (WLS-RF) [4], and deep AUs graph network (DAUGN) [24]. As summarized in Table 5, we can clearly observe that our model outperforms all of these state-of-the-art methods. SFLSRM adopts a multi-task feature learning method for learning the shared features and thereafter uses a BN to model the co-existent and mutual-exclusive semantic relations among AUs from the target labels. [9] proposes the construction of a three-layer hybrid BN, whose top two layers consist of a latent regression BN for representing relations among multiple AUs, and the bottom two layers are BNs that use expressions to facilitate the estimation of label dependencies among AUs. WLS-RF algorithm extracts a local expression subspace to describe facial expressions as well as AUs. DAUGN proposes a novel method to local AUs region and uses a graph-based CNN to combine the local-appearance and global-geometry information to recognize expressions or AUs. Compared to the first two methods, our method achieved 0.184 and 0.061 higher F1-scores, along with 0.010 and 0.001 higher AUC compared to the last two models, respectively. Although the results of our work are very close to the baseline in terms of AUC, our model can provide more information, such as more AU labels and expression labels, implying that we can obtain the same information with fewer computations and reduced time cost. On the other hand, for the specific AUs, the F1 scores of our method are higher than those of others in 10 out of 13 AUs, and the AUC are 5 out of 13 AUs. Therefore, the overall effects of our algorithms are better and demonstrate the superiority of our method over other methods.

Table 5 Comparison of quantitative expression recognition results on CK+ database
Fig. 6
figure 6

Confusion matrix on CK+ database

4.6 Evaluation of emotion recognition

For the CK+ dataset, we compare our method to other state-of-the-art methods, including AU-aware deep networks [22], WLS-RF [4], de-expression residual learning (DeRL) [42], deep temporal geometry network (DTAGN) [12], dynamic BN (DBN) [20], BN [41], and 3D CNNs with deformable action parts (3DCNN+DAP) [30]. As summarized in Table 6, in terms of multi-task recognition results, the average accuracies of our model FE-AURDGCN showed improvements of 7.70%, 3.70%, and 2.70%. DBN constructs a three-layer model with facial expressions, AUs, facial features, and landmark points, whereas BN adds two prior layers known as brain cognition and facial muscles layers on this basis. However, the inputs of BN are manual characteristics, which cannot be learned from end-to-end. 3DCNN+DAP uses 3D filters from local action parts to predict the expression intensity for a video segment. To evaluate the practicability, we also compare our model with single-task methods. The results indicate that our multi-task model shows improvements of 3.05% and 0.80% over AUDN and WLS-RF, respectively, while its performance is 1.9% and 2.3% lower than DTAG and DeRL, respectively. AUDN generates a complete representation of facial images to expressly describe the appearance in a specific area; however, it only considers partial patches rather than the entire face. DTAGN uses the temporal information extracted from videos and utilizes other models to fine-tune network parameters, while our model only employs static images to recognize expressions. Our model is more suitable for several applications where sequences are not available. DeRL requires a significant amount of data to train because its training result has a significant and direct influence on the results, while our model pre-trains cGAN based on only a small dataset without data augmentation, which requires fewer computations. Moreover, our model is a multi-task network, which means that we can provide more detailed expression information, such as AUs using fewer newwork parameters and at a reduced time cost. On the other hand, the application of the prior knowledge, which includes the AU dependency relationship and the mapping relationship between expressions and AUs, accords with the human psychological mechanism. In this sense, our model has considerable value in further exploring research. Meanwhile, the confusion matrix in Fig. 6 demonstrates that with the help of a graph structure, the highest accuracy reaches up to 100% at angry and happy expressions. However, because of excessively small sample sizes, there exist significant errors in fearful and sad expressions.

Table 6 Comparison of quantitative expression recognition results on RaFD database
Fig. 7
figure 7

Confusion matrix on RaFD database

For the RaFD dataset, as summarized in Table 7, we compare our method to other state-of-the-art methods, including the neural network ensemble (NNE) [1], SURF boosting (SURFB) [35], SVM [21], multi-channel CNN (MCCNN)) [33], and transfer learning convolution network (TLCNN) [40]. Because of the lack of AU labels, there only exist a single-task model to recognize emotions. Although MCCNN and TLCNN perform better than our model, our model is more practical. MCCNN learns and fuses the spatial-temporal features known as optical flow, but the query neutral faces are not always available. TLCNN requires a large dataset to pre-train the deep network and subsequently fine-tune it to achieve expression recognition; in contrast, our network has fewer layers. Moreover, our model provides additional AUs information, which offers some reference value for the understanding of facial expression behavior. Compared to other methods, the accuracy scores of the proposed method are 0.53%, 2.28% improvement. [1] employs HOG features for training binary CNNs and thereafter ensemble them to detect expressions. [35] utilizes surf features and applies a boosting algorithm to train classifiers. These two methods both build N networks for N expressions respectively, and have high and complex computations. Finally, the confusion matrix of the proposed method is presented in Fig. 7. We can observe that the recognition accuracies of the six basic expressions all exceed 90%.

Table 7 Recognition results of the proposed method in cross-database experiments

4.7 Evaluation of cross-datasets performance

Facial expression and AU recognition methods still have problems achieving high accuracies and scores when evaluated using the cross-database validation protocol. Because of culture and race, different persons have different combinations of different AUs. Even though the environment is controlled within the database, the facial behaviors are not controlled within the database. Therefore, it is important to know the performance obtained by the model when it is trained by one database and tested over another database. As summarized in Table 8, with the help of prior knowledge named \({{P}_{1}}\) and \({{P}_{2}}\), the accuracies and F1-scores of expression and AU recognition are increased, indicating that the mapping relationship between expressions and AUs has a positive impact on the generalization performance. However, overall, the cross-dataset performance is much worse than the within-dataset performance. This may be attributed to the dependency of prior knowledge on the limited two databases. Specifically, the RaFD database labels the six basic emotion images strictly according to the specific AU combination. However, the CK+ database includes prototypes and major variants of each emotion, which means that the sequences are collected from a looser condition. Because of different data sources, the prior knowledge has a different distribution; more databases are required to improve generalization. Additionally, the recognition results of the model trained by RaFD database are both higher than the model trained by CK+ database. The sample sizes of RaFD are nearly four times those of CK+ database, indicating that the generalization of the model is limited by the size of the training sets.

5 Conclusion

In this paper, we present a novel approach for recognizing expressions and AUs, which is based on FE-AURDGCN. First, a generative model is trained by cGAN to filter identity information and extract expression information. Thereafter, we consider the dependency among AUs to construct an expression graph and embed the nodes with multiple AU-related patches extracted from the generative model. Finally, we use prior knowledge matrices to represent the strong dependencies between expressions and AUs and subsequently integrate them into the loss function to constrain the model. Experimental results on the extensively used CK+ and RaFD datasets have demonstrated the superiority of the introduced framework over the state-of-the-art methods. In the future, we plan to explore how to combine the temporal information of the sequences into network to improve performance.