1 Introduction

Facial expression recognition is a computer vision technique that facilitates emotion recognition in uncontrolled environments based on facial feature analysis. As one of the most potent signals in humans, facial expressions play a pivotal role for computers to decipher human emotional states and behavioral intentions. Hence, achieving accurate facial expression recognition is paramount in developing intelligent systems (such as smart robots and virtual reality) that can perceive and respond to human emotions.

In recent years, researchers have achieved significant advancements in facial expression recognition thanks to the development of deep learning technologies and the availability of large-scale facial expression datasets. Supervised learning methods based on convolutional neural networks (CNNs) and Vision Transformers have been introduced to facial expression recognition tasks, demonstrating outstanding performance. Some research methodologies have ventured to incorporate intricate attention mechanisms [1, 2] or to utilize prior knowledge (such as facial landmarks) to guide the networks [3,4,5], subsequently attaining even greater accuracy in facial expression recognition.

While these methods have effectively enhanced network performance, they all face a substantial limitation: they primarily focus on supervised learning. The visual complexity of facial expression images, coupled with their marked inter-class similarities and intra-class variations, means that a significant amount of time and specialized expertise is required for annotating facial images. Moreover, considering the privacy-sensitive nature of facial expressions and the subjective annotation biases among different graders, the constructed datasets demand rigorous validation processes. This makes the acquisition of large-scale annotated facial expression data exceedingly challenging, suggesting that future approaches should lean toward reduced label dependency, such as semi-supervised [6] or self-supervised methods.

The Mask Auto encoder [7], in its application to visual representation learning, has successfully transferred BERT-style pre-training strategies to the domain of computer vision. This method realized high-quality unsupervised representation learning by establishing an asymmetric encoder–decoder structure based on the vision transformer. Nevertheless, due to the absence of image-specific inductive biases in the vision transformer, its core multi-head attention mechanism tends to focus more on global information, often overlooking low-level, high-frequency details. This characteristic poses challenges for pixel-level facial unit reconstruction tasks, making it difficult to acquire high-quality facial representations during the pre-training phase.

To address the deficiencies in existing facial expression recognition efforts, this paper introduces a novel training strategy for facial expression recognition models: the multi-level feature selector (MFS). This method can learn multi-level facial representations in unlabeled data and carries out unsupervised fine-grained feature selection during the fine-tuning phase, achieving high-precision facial expression recognition. During the pre-training phase, we designed the multi-level feature combiner. It aims to integrate multiple latent features within masked images, compensating for the vision transformer’s shortfall in high-frequency information, thereby aiding the model in acquiring rich facial representations. In the fine-tuning phase, we devised the feature selector. During the learning process, this module can adaptively filter out non-discriminative features based on the discriminative power of the feature units themselves, consequently highlighting highly discriminative regions. Considering that highly discriminative regions are spatially distributed in a discrete manner, we will obtain a set of sparse data following feature filtration. Merely concatenating these features and employing multi-layer perceptron (MLP) for aggregation would lead to a substantial loss of spatial information. When dealing with sparse data, some studies have employed graph structures to aggregate node information, generating enhanced feature representations [8]. By successfully modeling the intricate high-order feature interactions among sparse data, these methodologies have achieved commendable performance. We conceptualize the filtered features as a graph structure. Employing graph convolutional networks, we have achieved efficient graph feature extraction on the discriminative feature map, thereby delving into the latent connections among discriminative feature units. With the Feature Selector, we can capture granular facial details, thus overcoming the intrinsic inter-class similarities and intra-class variations of facial expressions. The primary contributions of this paper can be summarized as follows:

  1. (1)

    We propose a self-supervised facial expression recognition algorithm named MFS. During the pre-training phase, with the assistance of the multi-level feature combiner, the backbone network can learn multi-level facial feature representations without the need for expensive labeling.

  2. (2)

    We designed the Feature Selector, which, through meticulously crafted granular feature selection and feature aggregation strategies, assists the network in learning superior decision boundaries, addressing the inherent ambiguities associated with facial expressions.

  3. (3)

    We evaluated the proposed MFS across multiple datasets. Experimental results indicate that MFS significantly outperforms other self-supervised methods and closely approaches the results of state-of-the-art supervised techniques.

The structure of this paper is organized as follows: Section 2 provides an overview of the related work on facial expression recognition; Section 3 delves into the specific implementation details of MFS; Section 4 presents the experimental results and ablation studies of MFS on the RAF-DB, AffectNet, and FER + datasets; In Sect. 5, we discuss various attempts and explorations undertaken during the research process. Finally, Sect. 6 summarizes the primary contributions of this study.

2 Related work

From the early methods based on handcrafted features [9, 10] to those based on end-to-end learning [11,12,13,14], facial expression recognition has always garnered significant attention. Notably, the majority of research aimed at improving facial expression recognition still focuses on extracting distinctive facial expression features using advanced computational models under supervised settings [15,16,17,18,19]. While these studies have achieved commendable accuracy, they heavily rely on labeled training data. Consequently, these methods might suffer substantial performance degradation when faced with low-quality and noisy labels. To address this issue, some researchers have proposed corresponding methodologies to reduce the network’s reliance on fine-grained labels. For instance, Li et al. [20, 21] reclassified the seven basic facial expressions into four coarse-grained classes and employed coarse labels to assist in fine-grained label supervised learning and contrastive learning, thereby mitigating the performance degradation caused by the similarity of facial expressions. [22] Designed a training paradigm that employed contrastive learning for self-supervised facial expression recognition in multi-view images. Although effective, the scalability of this method is constrained due to its heavy dependence on specific datasets. Subsequently, Shu et al. [23] applied a contrastive self-supervised learning approach to static single-view facial images, effectively enhancing the performance of self-supervised learning in facial expression recognition tasks. Recently, many studies have started to adopt mask image modeling as a self-supervised framework to learn effective facial representations [24,25,26]. Ma et al. [24] utilized a Mask Auto Encoder pre-trained on large-scale facial images and achieved state-of-the-art performance in facial action unit analysis tasks.

In addition, to address performance degradation stemming from pose variations, facial occlusions, inherent intra-class variability, and inter-class similarity in facial expressions, some studies have suggested employing fine-grained features for facial recognition [27, 28]. These methods can be broadly categorized into those based on facial landmarks [3,4,5, 29] and those leveraging attention mechanisms [1, 30, 31]. Zheng et al. [3] utilized a pre-trained facial landmarks detector to locate facial landmarks during the data preprocessing stage. They then inputted the salient regions containing these facial landmarks as prior knowledge into the feature extractor, guiding the feature extraction process. Shi et al. [5] introduce a multi-pose block occlusion face recognition method grounded on feature point location. This method segments the face based on facial landmarks and occlusion regions, thereby proficiently mitigating the influences of pose variations and occlusion on face recognition performance. In [30], an end-to-end network architecture for facial expression recognition based on attention mechanisms was proposed. This design emphasized focusing attention on the face while ignoring background noise. [31] Proposed an encoder–decoder attention operation that can focus more on the regions of muscle movements beneath the facial skin, such as the mouth, eyes, and nose, allowing the network to extract deep facial expression features better. Similarly, Wang et al. [1] introduced a region-based attention network architecture that, by capturing local facial features, displayed robustness against facial occlusions and pose variations.

Our MFS is a self-supervised training approach that does not require expensive labels. Compared to the Mask Auto encoder with a vanilla vision transformer as its backbone, this method can integrate multi-level features during pre-training, achieving superior facial representation learning. Additionally, we have designed an unsupervised feature selection strategy that can adaptively choose highly discriminative fine-grained facial features during the fine-tuning process while simultaneously disregarding non-salient regions. This differs from previous methods based on facial landmarks or those utilizing complex attention mechanisms.

3 Methodology

The overall framework of MFS is illustrated in Fig. 1. The entire training process is divided into two stages: The pre-training stage (a) and the Fine-tuning stage (b). The detailed structure of the Multi-level Feature Combiner is depicted on the right side(c).

Fig. 1
figure 1

Overall framework of MFS

In the initial phase, we employed a vision transformer backbone augmented with a multi-level feature combiner for self-supervised pre-training. This approach facilitated the network to acquire multi-level facial representations. Subsequently, in the second phase, we inherited the weights of the encoder from the first phase (without freezing) and performed fine-tuning of the entire network’s parameters based on fine-grained features extracted by the Feature Selector, resulting in the ultimate model. Detailed training specifics for each phase will be further elucidated in the following section.

3.1 Multi-level facial feature learning

This paper employs an asymmetric encoder–decoder structure of the Mask Auto encoder as the primary framework for self-supervised learning. For a given facial image \({I}^{C\times H\times W}\), we divide it into n patches. Among these, \(\left(n-k\right)\) patches are masked; while, the remaining \(k\) visible patches are fed into the encoder to encode latent features. Subsequently, these latent features are passed into a lightweight decoder to reconstruct the masked pixels.

As the backbone architecture for the encoder, the Vision Transformer excels at modeling global information with its core multi-head self-attention mechanism. However, due to the low-pass filtering nature of multi-head attention [32], Vision Transformers may lack emphasis on high-frequency features. Considering that the masked facial images exhibit noticeable sparsity at the semantic level, high-frequency texture features hold significant value for pixel-level facial reconstruction tasks.

The structure of the multi-level feature combiner is depicted on the right side of Fig. 1. "When the facial image \({I}^{C\times H\times W}\) is fed into the vision transformer encoder, we obtain an intermediate feature set \({\varvec{F}}={\{{\varvec{f}}}_{0},{{\varvec{f}}}_{1},{{\varvec{f}}}_{2},\dots \dots ,{\boldsymbol{ }{\varvec{f}}}_{{\varvec{i}}}\}\), where \({{\varvec{f}}}_{{\varvec{i}}}\) represents the output feature of the i-th transformer block. The multi-level feature combiner selects features from different levels within F and projects them using affine layers to align them in the feature space with the deepest layer’s feature. In the end, we obtain the integrated multi-level concatenated feature \({{\varvec{F}}}_{\mathbf{m}}\):

$$\begin{array}{c}{{\varvec{F}}}_{\mathbf{m}}=\text{stack}\left({\text{affine}}\left({{\varvec{f}}}_{{\varvec{i}}}\right),{{\varvec{f}}}_{{\varvec{n}}-1}\right)\end{array}$$
(1)

where n represents the number of transformer blocks in the encoder. Subsequently, we compute the fused feature \(\widehat{{\varvec{F}}}\) by taking a weighted average of \({{\varvec{F}}}_{\mathbf{m}}\) using an average weighted layer:

$$\begin{array}{c}\widehat{{\varvec{F}}}={\sum\limits_{i}{({\varvec{F}}}}_{\mathbf{m},{\varvec{i}}}\cdot {w}_{i})\end{array}$$
(2)

where \({w}_{i}\) represents the weight of the i-th level feature in the fused feature. We initialize all \({w}_{i}\) values to be the same and dynamically optimize them during the subsequent training process.

Finally, the fused feature \(\widehat{F}\) is fed into the decoder to reconstruct the masked pixels, resulting in the reconstructed image \({I}_{{\text{r}}}\). The reconstruction loss \({L}_{{\text{reconstruction}}}\) is defined as the pixel-wise mean squared error loss between the reconstructed image and the input image:

$$ \begin{array}{*{20}c} {L_{{{\text{reconstruction}}}} = \frac{1}{N}\sum\limits_{{i = 1}}^{N} {\left( {{\text{pixel}}_{{r,i}} - {\text{pixel}}_{{{\text{input}},i}} } \right)^{2} } } \\ \end{array} $$
(3)

where N represents the total number of pixels in a single image, and \({{\text{pixel}}}_{x,i}\) represents the i-th pixel value of image x.

It is important to note that in the multi-level feature combiner, simply fusing features from all levels may lead to information redundancy and introduce noise, resulting in a degradation of the model’s performance. In this paper, for facial expression recognition, we have chosen to select the output features from \({{\text{Block}}}_{0}\), \({{\text{Block}}}_{2}\), \({{\text{Block}}}_{4}\), and \({{\text{Block}}}_{6}\) and fuse them with the deepest layer features. Detailed experiments regarding this choice will be further discussed in Sect. 5.

3.2 Fine-grained feature fine-tuning

After splitting facial images into n patches, we can observe that not all regions exhibit significant discriminative information (Fig. 2). Regions that solely contain hair and clothing or are nearly monochromatic are commonly found in facial images across different expression categories. If these areas were used for recognition, their predicted probabilities would likely exhibit a relatively flat distribution. Conversely, selecting regions that contain facial landmarks for recognition yields more discriminative prediction probabilities.

Fig. 2
figure 2

Discriminative/non-discriminative in facial image

Based on the previous analysis, to make more effective use of the multi-level facial feature representations learned during the pre-training phase, we introduced a feature selector during the fine-tuning stage. The purpose of this design is to adaptively filter out background noise and focus on critical fine-grained facial features, thereby achieving more precise parameter optimization.

The Feature Selector treats each token input to the network as an independent feature unit and performs feature filtering based on the discriminative capacity of each feature unit itself. We use the features extracted by the vision transformer encoder, represented as \({{\varvec{f}}}_{{\varvec{i}}}\in {R}^{{\text{L}}\times {\text{D}}}\), as the input to the Feature Filter, where L represents the length of the feature sequence, and D represents the output dimension of the transformer block. In the feature filter, we first project the input features into a C-dimensional space (where C is the total number of predicted categories). Subsequently, we apply the Softmax function to calculate the category prediction scores for each feature unit:

$$ \begin{array}{*{20}c} {{\mathbf{logits}}_{\user2{i}} = {\text{Projection}}\left( {D,C} \right)(\varvec{f}_{\varvec{i}} ),\varvec{s}_{\varvec{i}} = {\text{Softmax}}\left( {{\mathbf{logits}}_{\varvec{i}} } \right)} \\ \end{array} $$
(4)

Among all the feature units, we select the top s units with the highest confidence as discriminative features; while, the remaining L-s units are considered non-discriminative features. Since the selected discriminative features exhibit notable local and sparse characteristics, we treat them as a discrete feature map. In the feature fusion stage, to preserve their original spatial scale and spatial structure integrity, we employ a Graph convolutional network (GCN) to process the discriminative feature map, further exploring potential relationships between different features:

$$ \begin{array}{*{20}c} {\varvec{f}_{{\varvec{i}}} ^{\prime } = \sigma (A_{{{\text{adaptive}}}} W\varvec{{f}_{i}} ),\varvec{{f}_{i}} \in R^{{C \times s}} } \\ \end{array} $$
(5)

Leveraging a graph convolutional network allows us to learn the influence between different feature units and incorporate this influence into the final output features, enabling effective graph feature extraction. Subsequently, by using an aggregator to consolidate the feature map, we ultimately feed the fused features into a classifier to obtain facial expression recognition results. Graph convolutional networks (GCNs) can effectively integrate relationships between multiple facial feature units without disrupting the original feature structure. This enables the model to learn more precise decision boundaries.

Based on the earlier analysis, we aim to achieve more precise recognition results by relying on fine-grained discriminative features. Therefore, we employ cross entropy to compute the classification loss for the logits corresponding to discriminative features:

$$ \begin{array}{*{20}c} {L_{{{\text{select}}}} = L_{{{\text{CE}}}} ({\mathbf{target}}_{{\varvec{i}}} ,{\mathbf{logits}}_{{\varvec{i}}} )} \\ \end{array} $$
(6)

Meanwhile, for the features corresponding to non-discriminative regions, we consider them as "background" information. As these features contribute relatively little to classification, we anticipate their prediction probabilities to exhibit a relatively flat distribution. Therefore, we define the flatten loss as follows:

$$ \begin{array}{*{20}c} {L_{{{\text{flatten}}}} = \frac{1}{N}\sum\limits_{{i = 1}}^{N} {\left( {{\text{tanh}}\left( {{\mathbf{logits}}_{\varvec{i}} } \right) + 1} \right)^{2} } } \\ \end{array} $$
(7)

Through the backpropagation of \({L}_{{\text{flatten}}}\), we aim to drive the logits’ values toward negative infinity, thereby obtaining a flat prediction probability distribution.

Based on the above, the overall loss, denoted as L, can be expressed as:

$$\begin{array}{c}L={\lambda }_{s}{L}_{{\text{select}}}+{\lambda }_{f}{L}_{{\text{flatten}}}\end{array}$$
(8)

where \({\lambda }_{s}\) and \({\lambda }_{f}\) are the weighting parameters for \({L}_{{\text{select}}}\) and \({L}_{{\text{flatten}}}\), respectively. In our experiments, we set \({\lambda }_{s}= {\lambda }_{f}= 1\).

The objective of designing the feature selector is to enable the model to automatically learn highly discriminative regions within images without relying on pre-extracted facial landmarks or other fine-grained semantic information. Furthermore, it seeks to achieve finer and more accurate recognition by mining the latent relationships between discriminative features. By extracting fine-grained features from facial images, we can more effectively distinguish those facial expressions that exhibit confusion.

4 Experiment

4.1 Experiment settings

Datasets

RAF-DB [11] is one of the most renowned benchmark datasets in facial expression recognition. This dataset comprises 29,672 facial images meticulously annotated by 40 trained annotators. We only utilized 15,339 images that were labeled with six basic emotions and neutral expressions. Out of this subset, 12,271 images were allocated for training purposes; while, the remaining 3068 were designated for testing.

AffectNet [33] stands as the largest facial expression recognition dataset to date, offering annotations for both classification and emotional valence-arousal dimensions. This dataset was assembled by querying facial expression-related keywords in three search engines, resulting in a collection of over one million images, with manual annotations for 450,000 images. It encompasses eight emotion categories, including seven primary facial expressions and the additional category of contempt.

FER + [34], derived from the FER2013 dataset, features 28,709 training samples, 3589 validation samples, and 3589 testing samples. All images are in grayscale format and were collected via the Google search engine. The images have a uniform resolution of 48 × 48 pixels. Each image was independently annotated by ten different annotators. Like AffectNet, the facial images in FER + are also annotated for eight different expressions.

Implementation details

The facial images used for training were resized to 224 × 224 pixels. During the pre-training phase, we conducted training for 400 epochs with a batch size of 16. Subsequently, we performed fine-tuning for 50 epochs with a batch size 32. The training process employed the AdamW optimizer with an initial learning rate of 1e-3. The initial 10% of epochs were designated as the warm-up stage. Following that, we employed a cosine annealing learning rate scheduler. The proposed method was implemented using the PyTorch framework.

4.2 Experimental results

Comparison with self-supervised learning methods

In this study, we conducted a systematic evaluation of the performance differences between the proposed MFS method and other self-supervised learning methods. Initially, we carried out pre-training on the AffectNet dataset and further fine-tuned the model on multiple diverse datasets (with random sampling) to obtain evaluation results. Additionally, we assessed the effectiveness of MFS’s pre-training on the RAF-DB dataset. It is worth noting that due to the limited information contained in the 48 × 48-pixel grayscale images in the FER + dataset, we did not perform pre-training evaluation on the model using the FER + dataset.

According to Table 1, our MFS not only excels in fine-tuning but also demonstrates outstanding performance in cross-dataset transfer learning tasks. MFS achieved the highest accuracy rates of 63.49%, 60.75%, 91.45%, and 90.16% on four different datasets, significantly surpassing other self-supervised learning methods. These results further confirm the ability of MFS to learn facial representations with stronger robustness and broader generalization capabilities.

Table 1 Experimental result of self-supervised learning methods with random sampling

Comparison with supervised learning state-of-the-art methods

In this section, we further explored the performance differences between MFS and state-of-the-art supervised learning methods. Specifically, we selected a backbone pre-trained on AffectNet and fine-tuned it under class-balanced sampling/random sampling conditions. The choice of sampling strategy was relevant to the sample distribution in the dataset. As observed in Table 2, MFS demonstrates competitive performance. On AffectNet7, AffectNet8, RAF-DB, and FER + , MFS exhibits performance differences of only 0.98%, 0.89%, 0.76%, and 0.70%, respectively, compared to state-of-the-art methods. Furthermore, MFS can rapidly adapt the pre-trained backbone to other data domains with minimal computational overhead, a feat that traditional supervised learning methods struggle to achieve.

Table 2 Comparison with supervised learning methods. Experiments on AffectNet used class-balanced sampler

4.3 Ablation study

To investigate the impact of different components of the MFS on the final results, we conducted extensive ablation experiments on the RAF-DB dataset, where MFC denotes multi-level feature combiner, and FS denotes feature selector (Table 3).

Table 3 Ablation on each component of MFS

When MFC and FS were not utilized, the model achieved an accuracy of 86.27%. When MFC and FS were used individually, the accuracy increased by 1.83% and 0.89%, respectively. However, when MFC and FS were combined, the accuracy improved to 89.41%. This improvement is not simply the mechanical summation of the two components but rather a result of their synergistic enhancement. By integrating these two techniques, the network can perform fine selection among a richer set of facial representations, leading to a significant performance gain.

Furthermore, we explored the number of discriminative features selected (denoted as ‘s’ in Eq. 5). The results indicated that the Vision Transformer backbone with a patch size 16 performed best when selecting 128 features. We speculate that this is related to the characteristics of facial images, where approximately 60% of the information in aligned facial images is crucial. This finding may provide valuable insights for future research.

We also attempted to visualize the regions the model focused on using Grad-CAM [46] to further analyze the effectiveness of the feature selector.

In Fig. 3, we can observe the impact of not using Feature Selector. (b) and using feature selector (c) on the model. Compared to (b), (c) can pinpoint more discriminative and fine-grained facial features within the facial image. This further demonstrates that Feature Selector can significantly enhance the network’s ability to extract fine-grained features.

Fig. 3
figure 3

Attention visualization via Grad-CAM, wherein a represents the input, b is without the application of Feature Selector, and c is with the incorporation of feature selector

In addition, we investigated the model’s confusion matrices (Fig. 4) when utilized both without and with the Feature Selector.

Fig. 4
figure 4

Confusion matrix of model without (a)/with (b) feature selector

As Fig. 4 shows, noticeable inter-class and intra-class confusions were observed within the Fear and Disgust categories. Upon integrating the feature selector, accuracy was substantially improved by 23% and 7% for Fear and Disgust, respectively. Concurrently, the confusions between Fear, Disgust, and other categories notably decreased. These outcomes indicate that the feature selector aids the network in capturing subtler distinctions among similar facial expressions, thereby mitigating performance deterioration resulting from inter-class similarities and intra-class variations to a certain extent.

5 Discussion

In this section, we will delve into various attempts and explorations made during the research process, laying the foundation for future studies.

5.1 Transformer blocks used in multi-level feature combiner

Regarding facial representation learning, we experimented with the fusion of multiple hierarchical features. We used features from five different levels as the final fusion scheme based on the results (Table 4).

Table 4 Experimental results of different fusion solutions

Furthermore, we examined the weights assigned to each layer in Fig. 5. In this analysis, it was observed that the features from the deepest layer consistently held the highest weight. In most cases, the weights of \(Bloc{k}_{0}\) closely approximated the weights of the deepest-level features. For shallower features, their weights decreased as the layers became deeper. In the case of the fusion of seven layers of features, the features from \(Bloc{k}_{10}\) displayed negative weights. These findings suggest that: (1) Shallow features contribute significantly to the final output, particularly the features from \({{\text{Block}}}_{0}\). (2) If too many layers are involved in the fusion, redundant information may be introduced into the network.

Fig. 5
figure 5

Weights of features from different transformer blocks

5.2 Mask strategy

Students who only practice simple problems will find it challenging to handle complex exams. In order to enable the model to learn a better facial feature representation, we attempt to guide mask generation during the pre-training process based on the attention maps of a well-trained model, which sets a higher challenge for the Mask Image Modeling task.

Given the input features \({\varvec{f}}\in {R}^{B\times L\times D}\) for \({{\text{Block}}}_{11}\). During the self-attention computation, \({\varvec{f}}\) undergoes three independent linear mapping layers, respectively generating the \({\varvec{Q}}\), \({\varvec{K}}\), and \({\varvec{V}}\) matrices. Subsequently, through the scaled dot-product attention computation, we obtain the attention matrix \({{\varvec{A}}}_{{\varvec{h}}}\in {R}^{L\times L}\). We take the average attention matrices from multiple heads to obtain the averaged attention matrix \(\widehat{{\varvec{A}}}\).

$$\begin{array}{c}{{\varvec{A}}}_{{\varvec{i}}}=\text{softmax}\left(\frac{{\varvec{Q}}{{\varvec{K}}}^{{\varvec{T}}}}{\sqrt{{\varvec{D}}}}\right)V,\widehat{{\varvec{A}}}=\frac{1}{h}{\sum\limits_{i=1}^{h}}{{\varvec{A}}}_{{\varvec{i}}}\end{array}$$
(9)

In the attention matrix \(\widehat{A}\), each row corresponds to the attention distribution of a token. We extract the attention vector associated with the class token from the first row and resize it into a square shape. This results in the attention map \(\mathbf{M}\in {R}^{\left(\sqrt{L-1}\right)\times \left(\sqrt{L-1}\right)}\) associated with the class token. Based on \(M\), we propose the following two masking strategies:

  1. (1)

    Attention-high: Prioritize masking patches with high attention scores.

  2. (2)

    Attention-clue: While masking high-attention-score patches, retain a certain percentage of the patches with the highest attention scores as reference clues for subsequent reconstruction (Fig. 6).

Fig. 6
figure 6

Investigation of masking Strategies: a input b attention map c random d attention-high e attention-clue

Experimental results indicate that using attention maps to guide mask generation significantly increases the difficulty of the mask image modeling task (as shown in Fig. 7). The loss value for pre-training with the attention-high and attention-clue masking strategies is notably higher than that with the random strategy. Regrettably, the increased task difficulty did not lead to enhanced performance.

Fig. 7
figure 7

Loss curves under various masking strategies

Based on Table 5, we can observe that models using random masking significantly outperform those following the attention-high and attention-clue strategies in terms of recognition accuracy. Upon further analysis, it seems unwise to increase task difficulty during the early stages of training. A more ideal approach might be incrementally raising the challenge once the model has accumulated a foundational knowledge base. However, pinpointing the exact moment to escalate this difficulty remains a challenge. A more promising strategy might involve adopting a learnable masking method, allowing the model to autonomously adjust the difficulty of the task. Therefore, designing a learnable masking strategy tailored explicitly for facial representation learning will be the central direction of our future research.

Table 5 Experimental results of different masking strategies

5.3 Flatten loss

We explored several strategies to make the prediction probability distribution of non-discriminative features more uniform. Specifically, we applied an activation to the logits and guided all activation values to converge to a fixed label value gradually:

$$\begin{array}{c}{L}_{{\text{flatten}}}=\text{MSE}\left({\text{Activation}}\left(\mathbf{l}\mathbf{o}\mathbf{g}\mathbf{i}\mathbf{t}\mathbf{s}\right),\boldsymbol{ }\mathbf{l}\mathbf{a}\mathbf{b}\mathbf{e}\mathbf{l}\right)\end{array}$$
(10)

The experimental results are shown in Table 6. After comprehensive comparison and analysis, we found that among all tested combinations, \(({\text{tanh}}/-1)\) produced the best results. Based on this observation, we chose this combination as the flatten loss.

Table 6 Experimental results of different flatten loss

6 Conclusion

In this paper, we propose a self-supervised training strategy for facial expression recognition called MFS (multi-level feature selector). During the pre-training phase, we employ the multi-level feature combiner to achieve multi-level facial representation learning. Subsequently, utilizing a meticulously designed feature selector, the network can adaptively filter out fine-grained features with discriminative solid power. These features are then fed into a graph convolutional network for graph feature extraction and aggregation. MFS effectively addresses challenges in facial expression recognition related to relying on large-scale annotated data, handling inter-class similarities, and intra-class variations. Experimental results on multiple FER benchmarks demonstrate that the proposed MFS outperforms the supervised learning baseline and other self-supervised methods.