1 Introduction

Facial expression plays an essential role in human’s daily life. Facial Expression Recognition(FER) is crucial in real-world applications, such as service robots, driver fragile detection and human computer interaction[1]. Recently, with the large-scale databases (AffectNet [2], RAF-DB [3], SFEW [4], FER2013 [5], FER2013plus [6], EmotionNet [7], etc.), many deep learning approaches for FER [8,9,10,11,12] have been proposed and achieved promising performance. For example, recently proposed TransFER[60] obtains the nice performance which uses complex Transformer to learn different relation-aware local representation. However, their accuracies are greatly affected in the wild. For example, face images may be different from the age, race, gender, culture. Images from the same person still have changes in posture, occlusion, light intensity and other factors. On the other hand, the training samples have quality problems. For example, the data set collected from the Internet has the problem of the imbalanced categories and unconspicuous characteristics. That is because of the small difference between categories and difficult labeling, which leads to the low recognition efficiency of the model.

Facial expression is usually expressed in two ways: the basic expression classification (Anger, Disgust, Fear, Happy, Sad and Surprise), and the face action units (AUs) [13]. The former is common and easy to understand the expression of face behavior. The latter comes from Facial Action Coding System (FACS) [14], which is the domain knowledge resulted from human expert research. The facial action units could more subtly describe the action information of the local area of the face. It is not difficult to understand that the basic expression describes the facial behavior in a global way, while the facial action units describe the local changes of the facial muscle, which shows that the facial expression has a strong correlation and dependence with the facial action units. They are called domain knowledge. For example, as shown in Fig. 1, when action units with “Eyebrows together and down” and “Lips closed and down” are detected, it is most likely an “Angry”. Besides, the facial expression is more likely to be a “Surprise”, if “Eyebrows rose and wide-eyed” and “Mouth open without stretching” action units are detected. Therefore, modeling the relationship between facial action units and facial expressions is inevitably beneficial to improve the performance of facial expression recognition, especially for those expressions with highly uncertain or ambiguous.

Fig. 1
figure 1

Examples of the correlation between facial expression and action units

Thus, this paper uses the relationship between facial expression and facial action units to define new heuristic objective functions to guide neural network learning so as to complete facial expression recognition more robustly and accurately. Usually, the distance between samples with the same expression category should be closer. However, when the learned expression features are fuzzy, distance among some samples may be far, which may lead to classification errors. In such case, just using universal classification loss function (Cross-Entropy Loss) [15] to guide the neural network learning, the classification model may make mistakes when classifying these samples. Obviously, choosing a more appropriate objective function is very critical. Using the domain knowledge of facial expression, the objective function can be defined with the experience summarized by human beings so as to guide the model learning more reasonably and efficiently, and then improve the accuracy of facial expression recognition. The main contributions are summarized as follows:

  1. 1.

    Based on the facial action coding system as the domain knowledge, relationships between expression categories and facial action units are established. Subsequently, the connection between expression and expression can be inferred and used. The neglect of this connection is one of the important reasons for the errors in the classification of current models.

  2. 2.

    A new heuristic objective function based on the facial expression domain knowledge is proposed. Its aim is to guide the model to widen the distance among samples from different categories, which makes the model better classify samples.

  3. 3.

    On the standard databases, we compare our method with existing deep learning models to verify the universality, effectiveness and superiority of the proposed method.

The remainder chapters are introduced as follows: Sect. 2 introduces the related works of domain knowledge FER. Section 3 represents the proposed method in detail. The experimental results and analysis are introduced in Sect. 4. Finally, the conclusions are given in Sect. 5.

2 Related work

In this section, we discuss details of the deep learning techniques for facial expressions recognition based on domain knowledge as well as the related loss functions.

2.1 FER based on domain knowledge

Domain knowledge is the knowledge related to current tasks that machine learning methods aim to solve[16]. Facial expression recognition tasks also have the domain knowledge that can be used to improve the accuracy of machine learning. For example, there is a certain connection between the facial action units. According to FACS, action units such as Brow Lowerer (AU4) and Lid Tightener (AU7) appear when “Anger” occurs. Check Raiser (AU6) and Outer Brow Raiser (AU12) usually appear with “Happy,” which are domain knowledge. In addition, the connection with the characteristics between expression and identity information in psychology [17,18,19], and in Micro Expression is established, while the correlation characteristics of expressions have been also built[20]. Currently, researches have used these domain knowledge for facial expression recognition in [21,22,23,24,25,26,27]. For example, Pu et al.[28] exploited AU and expressions to mine useful local AU information so as to enhance image feature learning. Zhang et al.[17] suggested that identity information can promote facial expression recognition, proposing an identity-expression two-branching network model for facial expression recognition. He et al.[29] proposed a method for facial expression and AUs recognition based on the graph convolutional network [30] by exploiting the dependence between expression and AU. Wen et al.[31] observed the mixing relationship among expression categories from the confusion matrix, proposed a domain information loss function and achieved dynamic objective learning. Different from the method proposed in [31], this paper deduces the connection with expression according to the relationship between expression categories and facial action units. Finally, we design the heuristic objective function of facial expression recognition based on domain knowledge to guide the model training.

2.2 Loss function

Deep neural networks require a loss function to guide model learning. At present, there are many loss functions, such as Cross-Entropy Loss [15], Contrast Loss [32], Center Loss [33], Triplet Loss [34], and so on. Recently, Softmax Loss [35] and Cross-Entropy Loss in facial expression recognition have commonly been applied. In addition, new loss functions for facial expression recognition tasks have been proposed for the moment. For instance, Large-Margin Softmax loss [36] clearly guides model learning, which makes the class more compact and more separable. As an auxiliary loss function, the Center Loss is usually used in combination with the Softmax Loss, which can further reduce the intra-class distance of the same class of expression features and at the same time, keep the different categories of characteristics distinguishable. To make the features learn the more separable angular features, SphereFace Loss [37] improves on Softmax Loss, which makes the weight normalization and the deviation is set to zero. CosFace Loss [38] considers the features which are placed as one for normalization. But if the value is too small, it will cause the training loss to be too large. Thus, the scaling factor and the punishment factors are introduced, which obtain more separable features. ArcFace Loss [39] places the penalty on the angle, and then constraints the classification boundary in the angular space directly. The domain information loss [31] considers the domain information from the perspective of the confusion matrix, making the model learning more targeted.

Our method differs from the above ones, which proposes a new loss function through redefined facial expression mix relationship domain knowledge. With the center of the same sample as a positive example and easy to mix recognition as a negative example, we choice these as a new defined loss input to enlarge the distance among the classification and close the distance among the expression of the same category.

3 Proposed method

Based on expression category and facial action units domain knowledge, we propose a new heuristic objective function, which makes each sample distance from the class center as small as possible and at the same time, away from its confused other expression class center distance. So learning features can make the mixed expression category distance of samples increase so as to improve the deep neural network model generalization ability and robustness.

3.1 Domain knowledge of expression and AUs

Expression is the overall effect of facial movements. It is generally believed that the six basic expressions can be described by AUs unrelated to age, race, and culture [40]. Take “happy” for example, Cheek Raiser (AU6) and Lip Comer Puller (AU12) appear simultaneously to produce happy expression, as shown in Fig. 2. According to FACS, Ekman gives a correspondence between AUs and basic expression categories as shown in Table 1 [41, 42]. It is not difficult to find that the relationship among categories of expression can be indirectly inferred through the relationship between expression and AUs. In fact, some emotions are easy to recognition such as happy and disgust whose AUs are not the same. However, another some emotions are difficult to be distinguished such as surprise and fear, which are also difficult even for humans. Meanwhile, their AUs have an intersection, which include AU1, AU2, and AU26. This indicates that the recognition error among expression categories easily occurs when they have AUs in common.

Fig. 2
figure 2

Relationship between facial expressions and facial action units

Table 1 Relationship between expression and AUs

According to this intuition, the domain knowledge can be concluded from Table 1 that anger, fear and sad are easy to be confused, while fear, sad and surprise are also difficult to be distinguished. Moreover, it's a little hard to classify between disgust and sad. Consequently, we can obtain the easily confused relationship among five expression categories: Anger, Fear, Sad, Surprise and Disgust. The concrete mixing relationship is shown in the Expression-Expression Relationships of Fig. 3.

Fig. 3
figure 3

The heuristic objective function used for deep neural network to learn the better expression features

3.2 Heuristic objective function

With mixed relationships among expression categories as the heuristic domain knowledge, we define a new heuristic objective function that guides deep neural network learning to enlarge the distance among different expression categories and make the distance as small as possible among samples with the same expression category. The form of our loss function is similar to the triplet loss function but with different semantics. The core of Triplet Loss is sharing model with the anchor example, the positive example and the negative example[43], which the anchor sample is clustered with the positive sample, away from the negative example by model. Thus, this paper implements the heuristic objective function by improving the Triplet Loss.

In the deep neural network model, facial expression images \(x\) is mapped to the multi-dimensional Euclidean space \(f\left( x \right)\) by the constructed embedding. And our goal is mainly to optimize the feature map of facial expressions after embedding. For each training sample, heuristic objective function will get three different features through the network model, respectively, as: \(f\left( {x_{i} } \right)\)\(f\left( {x_{i}^{s} } \right)\) and \(f\left( {x_{i}^{m} } \right)\), where \(f\left( {x_{i}^{s} } \right)\) represents the expression category center, \(f\left( {x_{i}^{m} } \right)\) corresponds to the expression category center which are easily confused (i.e., the center features of anger, fear, sad, surprise and disgust). The corresponding mixed classification is obtained from the above domain knowledge. The heuristic objective function which guides deep neural network learning aims to make the features of the training sample and the center of the same category closer and closer, while making the distance between the sample and the features of the mixed classification center as large as possible. The formula is as follows:

$$ \left\| {f\left( {x_{i} } \right) - f\left( {x_{i}^{s} } \right)} \right\|_{2}^{2} + \alpha {\text{ < }}\left\| {f\left( {x_{i} } \right) - f\left( {x_{i}^{m} } \right)} \right\|_{2}^{2} $$
(1)
$$ f\left( {x_{i}^{s} } \right) = \frac{1}{n\left( s \right)}\mathop \sum \limits_{{x_{i} \in B,y_{i} = = s}} f\left( {x_{i} } \right) $$
(2)
$$ f\left( {x_{i}^{m} } \right) = \frac{1}{n\left( m \right)}\mathop \sum \limits_{{x_{i} \in B,y_{i} = = m}} f\left( {x_{i} } \right) $$
(3)
$$ n\left( y \right) = \left| {\left\{ {x_{i} \left| {x_{i} \in B,y_{i} = = y} \right.} \right\}} \right| $$
(4)

Heuristic objective function is defined as follows:

$$ L_{{{\text{HO}}}} = \sum\limits_{i}^{N} {\left[ {\left\| {f\left( {x_{i} } \right) - f\left( {x_{i}^{s} } \right)} \right\|_{2}^{2} - \sum\limits_{{m \in E\left( {y_{i} } \right)}} {\left\| {f\left( {x_{i} } \right) - f\left( {x_{i}^{m} } \right)} \right\|} _{2}^{2} + \alpha } \right]_{ + } } $$
(5)

where \(E\left( {y_{i} } \right) \) indicates a set of labels easily mixed with \( y_{i}\) which can be easily constructed in advance according to the domain knowledge, \(B = \left\{ {x_{i} } \right\}\) is a mini-batch of training sample set, \(y_{i}\) is the category of the expression, ||.|| is the Euclidean distance, and \(\left[ {} \right]_{ + }\) represents the loss value when the value in the central bracket is greater than 0, and the loss value is 0 when the value in the central parenthesis is less than 0, \(\alpha\) is the margin parameter which is a minimum interval of the feature distances between \(x_{i}\) and \(x_{i}^{m}\) and the feature distances between \(x_{i}\) and \(x_{i}^{s}\). The role of parameter \(\alpha\) is that it widens the distances among anchor, positive picture pair and anchor, negative picture pair. Those fully conform to our motives that HO can enhance the generalization ability and robustness of the deep neural network.

In order to reach the clearer understanding of the semantics of the heuristic objective function, we present another example. In Fig. 4, anger, sad and disgust are mixed expression categories. After learning through the heuristic objective function, the spatial distribution of the samples changed. The distance of the sad samples becomes shorter, while the distance between the anger and sad samples becomes larger, making the classification more easy and increasing the generalization ability of the network model.

Fig. 4
figure 4

Example of easily mixed expressions

3.3 Application mode of the HO

Deep neural networks used for expression recognition require loss functions to guide learning. At present, the main loss functions for FER are Softmax Loss and Cross-Entropy Loss. Clearly, more suitable loss functions would help guide deep neural networks to learn more discriminating expression features for FER. However, each loss function has its own special meaning and the solved problem, so that they have certain complementation. To validate the heuristic objective function and apply it to FER, we combine it with Cross-Entropy Loss to construct the most basic deep neural network. The basic structure is shown in Fig. 3. Because facial expression recognition is a multi-classification problem of images, the Cross-Entropy Loss objective function is usually used in the classification problems:

$$ L_{{{\text{cls}}}} = - \frac{1}{N}\mathop \sum \limits_{i} \mathop \sum \limits_{c = 1}^{t} y_{ic} \log \left( {p_{ic} } \right) $$
(6)
$$ p_{ic} = p\left( {y_{i} \left| {x_{i} } \right.} \right) $$
(7)

where \( t \) represents the number of categories, \(y_{ic}\) is a symbolic function, and \(p_{ic}\) represents the probability that the label of \(x_{i}\) is \(y_{i}\). And then the loss function of the entire neural network is the combination of Cross-Entropy Loss and heuristic objective functions as follows:

$$ L = L_{{{\text{cls}}}} + \lambda *L_{{{\text{HO}}}} $$
(8)

where \(\lambda\) is an adjustable nonnegative weight coefficient. Of course, not limited to Cross-Entropy Loss, there are lots of loss functions that can be combined with our heuristic objective function by various combination rules.

4 Experimental results

To evaluate the effectiveness of the proposed heuristic objective function, this paper performed extensive experiments on three famous facial expression databases. They demonstrate the universality and superiority of the proposed method by comparing with existing facial expression recognition methods.

4.1 Experimental data

RAF-DB [3] is a crowd sourced database which includes two distinct subsets, the basic one being single tagged and the compound one being double-tagged. In our experiment, a single-label subset with seven classes of basic emotions was used. The number of the training set and testing set is 12,271 images and 3068 images, respectively, and the expressions of them have near-identical distribution.

FER2013Plus[6] which is extended from FER2013[5], consists of 28,709 training facial expression images, 3589 validation images and 3589 testing images. It includes 8 classes in FER2013Plus, and contempt is introduced.

AffectNet [2] is the largest database of facial expression, with approximately 400 k manually annotated images. We choose 283,901 facial images as training set and 3500 images as testing set with the same seven basic expression as that in RAF-DB.

These datasets are illustrated in Table 2.

Table 2 Statistics of the images in the datasets

4.2 Implementation details

All experimental results are obtained by training the Python code on two NVIDIA GeForce RTX 3090 GPUs. We used ResNet34 as a feature extractor for facial expression images and pre-trained on ImageNet ILSVRC-2012 using Pytorch framework2. We use the stochastic gradient descent (SGD) optimizer with a momentum of 0.9 and a weight decay is set to 5e–4. Additionally, the initial learning rates of our proposed method are set to 0.01 and epoch to 100. The batch size is set as 64. The hyper-parameter \(\lambda\) is used to balance the loss function. By default, it is set as \(\lambda\) = 0.5.

4.3 Effects of the proposed loss

To validate effectiveness and universality of the proposed heuristic objective function, we apply it to lots of recent deep neural network models and then experiments are conducted on three datasets. The experimental results are shown in Table 3. The +HO represents adding the heuristic objective function to existing models. It can be seen that the overall accuracy rate has been improved in varying degrees. The average recognition accuracy on RAF-DB datasets can reach the second best result of 89.03%. The average recognition accuracy on the FER2013plus datasets and the AffectNet datasets also achieves the better performance than the baseline models.

Table 3 Effect of heuristic objective function on different models (accuracy %)

To further analyze the effects of the heuristic objective function on different emotion categories, we present the confusion matrix of the baseline model (ResNet34) and the model with heuristic objective function, as shown in Fig. 5.

Fig. 5
figure 5

Confusion matrix analysis

It can be seen from the confusion matrix of RAF-DB datasets in Fig. 6a, b that accuracy of Anger increased from 79.33 to 87.84%, increased by 8.51%. For Fear and Sad categories, accuracy of fear increased from 28.75 to 49.38%, increased by 20.63%, and Sad from 83.26 to 84.10%, increased by 0.84%. The probability of classifying the samples of Disgust into Sad decreased from 12.16 to 10.81%. Surprise increased from 71.60 to 73.46%, increased by 1.86%. However, there are also accuracies of some categories improved and the corresponding mixed classification probabilities do not drop. On the other hand, we can find that the accuracy improvement of a single category can be as high as 20.63% (e.g., Fear), while the accuracy improvement of the total datasets is only 2.8%. The reason is the category imbalance of the datasets. As shown in Table 2, for RAF-DB datasets, the Fear has only 281 images, representing 2.29% of the total data. The confusion matrix on the FER2013plus datasets is shown by (c) and (d) in Fig. 6. The classification accuracy of the Anger was increased from 82.78 to 83.15%, up by 0.37%; The Fear class was increased from 46.99 to 53.01%, up by 6.02%. The Sad was increased from 73.96 to 74.48%, up by 0.52%. Disgust was increased from 27.78 to 33.33%, up by 5.63%. Surprise increased from 89.90 to 91.67%, up by 1.77%. A similar analysis can be performed for the confusion matrices of the AffectNet datasets as shown in Fig. 6e, f. Through these analyses, the effects of the heuristic objective function are obvious, demonstrating the effectiveness and universality of the proposed method.

Fig. 6
figure 6

The learned feature distribution of ResNet34. For a (b), 0: Anger; 1: Disgust; 2: Fear; 3: Happy; 4: Sad; 5: Surprise; 6: Neutral; for c (d), 0: Anger; 1:Contempt; 2: Disgust; 3: Fear; 4: Happy; 5: Sad; 6: Surprise; 7: Neutral

4.4 Comparison with state-of-the-art methods

In order to further validate the proposed loss function, we compare it with more state-of-the-art methods on the RAF-DB, FER2013Plus, and AffectNet. The experimental results are shown in Table 4. It can be seen from Table 4a that our proposed method can obtain the second best recognition rate (89.03%) on the superposition of the existing models. The recognition results on the FER2013Plus database are shown in Table 4b. There are eight expression categories in the FER2013Plus database. Therefore, we evaluated all methods with eight expression categories (i.e., seven basic expressions and contempt expressions). It can be seen that our proposed method achieves 89.25% recognition rate on the superposition of existing models and has the best performance on the FER2013Plus datasets. The recognition accuracy on the AffectNet is shown in Table 4c. It can be seen that our method achieved the highest recognition rate on AffectNet7 (64.02%). These results under different real-world facial expression data verify that our proposed method can obtain better facial expression recognition performance in the wild.

Table 4 Comparison of accuracy of different methods across individual datasets

In addition, it can be seen from Table 5 that different base method costs different time. When HO loss is used, the time further increases but with the smaller margin. Although FDRL [49] achieves the better performance, but it costs too much time. This is why RUL [46] is selected to verify the effectiveness, universality, and the superior performance of the proposed method.

Table 5 MFlops of different methods on RAF-DB

4.5 Visual analysis

To further analyze the effect of the heuristic objective function, we use the t-SNE [44] tool to visualize the samples distribution after extracting features of each image, where features are extracted, respectively, by the baseline model and model with the heuristic objective function. The aim is to prove the effectiveness of the heuristic objective function. As shown in Fig. 6, it is not difficult to find that after adding the heuristic objective function, the feature distribution of the same class of facial expressions is more compact, and the feature distribution boundary among different facial expressions is more obvious. This shows that the heuristic objective function can promote the compactness and widen the inter-class distance to some extent.

Meanwhile, we directly observe the effects of the heuristic objective function by analyzing the expression classification results of samples. We selected 12 images from RAF-DB and AffectNet datasets, as shown in Fig. 7, where labels of the first row is the true expression, labels of the second row is labels predicted by the baseline model, and labels of the third line is the predicted labels by the model with the heuristic objective function. It is not difficult to find that some of labels predicted by the baseline model are wrong, while the predicted labels are correct after adding the heuristic objective function. It indicated that the heuristic objective function has the ability to correct the easily mixed expression categories.

Fig. 7
figure 7

Visualize the results before and after the adjustment

5 Conclusion and future work

This paper proposes a new heuristic objective function based on the domain knowledge. It enlarges the distance among expression categories while narrowing the distance between expression samples of the same category. After using heuristic objective function, deep neural network can effectively alleviate the problem of inter-classification of expression recognition, which is conducive to improve the accuracy of facial expression recognition. On the other hand, the heuristic objective function is universal and can be used in most of deep neural networks for facial expression recognition. Furthermore, it can be combined with various existing loss functions in complementary way to achieve higher accuracy. However, although the experiments have proved the effectiveness of the proposed method, its effect can be further improved. The proposed heuristic objective function is based on the domain knowledge relevant to facial expression recognition. However, the domain knowledge can be obtained from diverse perspectives. In the future, the action relationship within and among expression classes will be deeply explored and then define the better heuristic objective function. Simultaneously, the connection between heuristic objective functions and existing loss functions should be exploited carefully, and then the best compositional pattern should be found for FER.