Keywords

1 Introduction

Face recognition has entered the commercial era and is widely used in various scenarios. However, there are many places in the face recognition system that may be attacked.

The most common form of attack is presentation attack, for instance photo print and video replay, which greatly threatens the reliability and security of face recognition systems and makes face anti-spoofing (FAS) a challenging problem.

Over the past ten years, considerable FAS approaches have been put forward successively, which can be grouped into handcrafted methods and convolutional neural network (CNN) based methods. Although they have shown promising FAS performance, their discriminative and generalization capability still needs to be improved. Firstly, the huge number of labeled face images are required. The performance of these methods relies heavily on the supervision signals, like binary labels, remote photoplethysmography (rPPG) [31] and depth maps [16]. The accuracy may degenerate once the supervision information has some errors. Secondly, the convolution operation acts in a local manner, and therefore it cannot capture the long-range visual context that plays a crucial role in visual pattern recognition. Thirdly, the transfer capability of the learned feature is not encouraging in discriminating unknown types of presentation attacks.

In recent years, self-supervised learning (SSL) has emerged as the most prominent technology to overcome the shortage of supervised learning that require massive labeled data in computer vision. The core idea behind SSL is to learn general features via pretext task, then the learned knowledge is transferred to a specific downstream task, such as recognition, segmentation and detection. It should be pointed out that the pretext task uses a large-scale unlabeled dataset for pre-training, and then uses another relatively small labeled dataset for fine-tuning. SSL is superior to supervised learning in pre-training tasks, and the pretext task does not require labels that makes the model free from massive and complex label information, such as depth maps and rPPG. To sum up, there are two kinds of SSL models: generative and contrastive. The pretext task of contrastive SSL methods seeks to learn image level general semantic features [6, 12]. Inspired by BERT [8], masked image modeling (MIM) as a generative self-supervised method has been extensively studied in the past two years. With the help of self-attention mechanism [24] in the transformer models, the two generative SSL methods dubbed masked autoencoders (MAE) [11] and simple masked image modeling (SimMIM) [29] achieved outstanding performance and even surpassed the supervised learning baselines on some image processing tasks. The MIM learns the general image features via masking random patches of the original image and reconstructing the missing pixels. It has the following four advantages: (i) pretext task does not require image label information (ii) can learn general image detail features (iii) the learned general features have excellent transfer capability (iv) can capture the long-range global relationship of features because of the self-attention mechanism in the transformer encoder.

Generally speaking, the pixel details or global spatial structure of an image will be changed for spoofing faces, such as pixel blurring in printed photos and image warping in hand-held photos. In other words, the key discrepancies between spoofing faces and genuine faces come from the image fine-grained information [23] and the global correlation between the features at different regions.

Because the MIM can reconstruct image pixels perfectly even though most regions of the image are masked, which reveals that MIM is capable of learning image detail information and capturing image spatial structure. Accordingly, our initial motivation for this work is to learn detailed features for faces through MIM, which is helpful for detecting presentation attacks. What is more, the transformer encoder network can learn the global correlation between visual features, which is an important clue to distinguish between genuine and spoofing faces.

From the above analysis, in order to address the aforementioned issues of existing FAS methods, this paper proposes a novel and simple method to learn general and discriminative features for FAS under the SSL framework. The overall pipeline of our method is illustrated in Fig. 1. In the pretext task stage, the MIM is exploited to learn general face detail features in an unsupervised fashion under transformer encoder-decoder architecture. Afterward, the trained encoder knowledge is utilized to initialize the encoder of our downstream FAS task. Since we consider FAS as an image classification problem, and therefore the encoder is followed by a simple network only with global average pooling (GAP) and fully connected (FC) layers instead of the decoder. The main contributions of this paper are threefold:

  • To our knowledge, this work is the first attempt to exploit generative SSL for FAS. The SSL strategy renders our method can achieve better results than supervised learning methods on the premise of using a large amount of unlabeled images for pre-training, which effectively reduces the cost of labeling.

  • We explore the effectiveness of two different MIM models in learning general face detail features that have superior discriminative ability and transfer advantages.

  • We conduct extensive FAS experiments on three popular datasets. The results show that our method offers competitive performance compared with other FAS methods.

2 Related Work

2.1 Face Anti-spoofing

The majority of FAS methods are based on supervised learning. From the early period of handcrafted feature methods, such as LBP [21], etc., these methods require at least binary label as supervised information. With the rise of deep learning, there are more types of clues that have been proven to be discriminative to distinguish spoofing faces. In [1], depth maps are introduced into the FAS task firstly. In addition, [16] leverages depth maps and rPPG signal as supervision. Besides, reflection maps and binary mask are respectively introduced by [13] and [17]. In the past two years, the Vision Transformer (ViT) structure has achieved success in vision tasks. Some researchers have applied ViT to FAS. Although the new architecture further improves the indicators of FAS, these works still require various types of supervision. For example, ViTranZFAS [10] needs binary labels, and TransRPPG [31] needs rPPG as supervision.

Various types of supervision information seriously increase the cost of labeling, and the quality of labels also greatly affects the performance of models. Therefore, some researches begun to explore the FAS methods based on contrastive SSL [15, 20]. These works not only get rid of constraint of labels, but also achieve better performance than supervised learning. Unlike these methods, this paper adopts generative SSL method.

2.2 Masked Image Modeling

Masked image modeling is a generative self-supervised method. The work in [25] proposes denoising autoencoders (DAE), which corrupts the input signal and learns to reconstruct the original input. Further, the work of [26] takes masking as a noise type in DAE. They randomly set some values in the input data to zero with a certain probability, then train the encoder to reconstruct these values.

DAE first achieved great success in the field of NLP. Transformer [24] and BERT [8] are the most representative architectures. Specifically, a self-attention mechanism is proposed in Transformer to capture the relationship between different tokens. Further, a special token [MASK] is introduced to BERT. The [MASK] will replace some tokens in training phase, then the network predicts the original words in this position. After the masked model has achieved such great achievements in NLP area, a natural question is how to apply this model to computer vision tasks.

Some pioneering works in the recent years has explored the potential of MIM. iGPT [5] reshapes the raw images to a 1D sequence of pixels and predicts unknown pixels. The BEiT [2] proposes a pre-training task called MIM, and also introduces the definition of MIM firstly. In BEiT, the image is represented as discrete tokens, and these tokens will be treated as the construct target of masked patches. Most recently, MAE [11] and SimMIM [29] almost simultaneously obtain state-of-the-art on computer vision tasks. They propose a pre-training paradigm based on MIM, that is, the patches of images are randomly masked with a high probability (usually greater than 50%), then the self-attention mechanism is used in the encoder to learn the relationship between patches, and finally the masked patches is reconstructed in the decoder.

3 Methodology

3.1 Intuition and Motivation

Spoofing faces are very similar in appearance to genuine faces. Their main differences are the image pixel details (blur and color) and the overall image structure (deformation and specular reflection). Learning discriminative cues from numerous labeled samples via CNN is a common way, but it is hard to learn general features, so the generalization ability needs to be improved, and the cost of producing labeled samples is expensive. So how to learn the general discriminative features that can distinguish spoofing faces from genuine ones on small amount labeled faces are the main challenge of FAS.

Fig. 1.
figure 1

Overall architecture of our proposed face anti-spoofing with masked image modeling.

3.2 The Proposed Method

Pretext Task Stage. SSL has been recognized as an effective way to remedy the shortcoming of the appetite for a large amount of labeled data. Due to the strong power of MIM in reconstructing image pixels, we argue that it can capture face detail visual features and the image structure via position embedding. Moreover, the global features of face image can be characterized by the self-attention in the transformer. Consequently, the general discriminative face visual cues with good transfer ability can be learned by MIM in an unsupervised manner.

In this paper, we mainly consider two newly proposed MIM methods: MAE [11] and SimMIM [29]. The ViT [9] and swin transformer [18] are adopted as the encoder backbone of MAE and SimMIM respectively. Meanwhile, the experiments of MAE and SimMIM both prove that random mask is more effective, so this paper also adopts the random mask. Concretely, we first divide a face image into several non-overlapping patches and randomly mask a large portion of the patches according to the mask ratio. For MAE, the encoder network with multiple transformer blocks are called to learn latent representations from the remaining unmasked patches. For SimMIM, both unmasked patches and mask tokens are fed into the encoder. All the tokens composed of encoded visible patches and mask tokens are fed into a lightweight decoder that is responsible for regressing the raw pixel values of masked area under mean squared error or \(l_1\) loss.

Downstream Task Stage. Having obtained the knowledge from the trained pretext task, we directly apply the encoder to our downstream FAS task and discard the decoder. For the purpose of recognition, a binary classification network with GAP and FC layers is added after the encoder, and the cross-entropy loss is employed in this stage. We choose fine-tuning instead of linear probing to conduct supervised training to evaluate the face feature representations.

4 Experiments

4.1 Datasets and Evaluation Metrics

To evaluate the effectiveness of our method, extensive experiments are carried out on three representative datasets. OULU-NPU [3] contains 4950 high-resolution videos from 55 individuals. CASIA-FASD [36] comprises 600 videos from 50 subjects under three types of attacks. Repay-Attack [7] has 1200 videos from 50 persons with 24 videos per person under three kinds of attacks. Three widely used metrics are adopted [32]: attack presentation classification error rate, APCER = FP/(TN+FP), bona fide presentation classification error rate, BPCER = FN/(TP+FN), average classification error rate, ACER = (APCER+BPCER)/2, and equal error rate (EER). The lower scores signify better performance.

4.2 Implementation Details

Our method is implemented via Pytorch on an Ubuntu system with NVIDIA Tesla V100 and 32 GB graphics memory. The input images of pretext task and downstream task are of size 224 \(\times \) 224, and each image is into regular non-overlapping patches of size 16 \(\times \) 16. It should be pointed out that we did not use any additional datasets such as ImageNet. The epochs of pretext task and fine-tuning for ours MAE (SimMIM) are 1600 (1000) and 100 (100) respectively. The fine-tuning process for the downstream classification task is performed on each dataset or its protocol. Following [35], the frame-level image is used in this paper instead of the entire video. For simplicity, the first 20 frames of each spoofing video from the training set are selected. In order to alleviate the data imbalance problem, we select more frames for the genuine video so that the ratio between positive and negative samples is 1:1. In the testing phase, 20 trials of each video from the test set are conducted, and the average results are reported, for the i-th trail, the i-th frame for each test video is utilized.

4.3 Experimental Results and Analysis

Effect of Mask Ratio. The mask ratio of MIM is an important factor that has an obvious effect on the performance of visual recognition. To assess the impact of the mask ratio on the FAS task, three mask ratios \(\{0.50, 0.60, 0.75\}\) are evaluated for both MAE and SimMIM. Several experiments are carried out on the four protocols of OULU-NPU. The results are shown in Fig. 2(a).

For MAE, mask ratio and ACER scores basically show negative correlation. For SimMIM, the performance of the three mask rates in protocol 2 and protocol 3 is very similar. At the same time, the performance of 0.75 mask rate in protocol 1 and protocol 4 is significantly better than other mask rates. These experimental results show that different mask ratios and the choices of MIM models have a great impact on FAS.

Fig. 2.
figure 2

(a) ACER (%) versus mask ratio under MAE and SimMIM on the four protocols of OULU-NPU dataset. (b) Feature distribution visualization for all 1080 testing videos from OULU-NPU protocol 2 via t-SNE.

Transfer Ability of Pretext Task. When superior performance is shown on a single dataset, one natural question is how well the transfer ability of the MIM pretext task is. To answer this, we conducted six experiments. We first train our MIM pretext task on the training set of OULU-NPU, Replay-Attack, and CASIA-FASD. After knowledge transferring, the fine-tuning of downstream tasks are conducted on the training set of CASIA-FASD and Replay-Attack. All the ACER scores are enumerated in Table 1, we can get the following observations: (1) Even though the pretext tasks are trained on different datasets, the downstream task still has good performance, which reveals the generalization ability of the MIM pretext task is excellent. (2) On the Replay-Attack, the ACER scores for all three cases are 0. (3) The training videos of CASIA-FASD are only 240 and are less than that of OULU-NPU and Replay-Attack. Ours model (SimMIM) achieves better results when the pretext task is performed on a large training dataset than on a small one. Such phenomenon is consistent with the founding in transformer models, i.e., the more training data, the better the performance.

Table 1. ACER (%) of different cases of knowledge transferring. O, C and R denotes OULU-NPU, CASIA-FASD and Replay-Attack.
Table 2. Results on OULU-NPU dataset. architecture C and T denotes CNN and transformer. {M, S}-{0.50, 0.60, 0.75} stands for SimMIM and MAE under the mask ratio respectively. Bold values are the best results in each case.

Comparison with State-of-the-Art Methods. In what follows, we compare the performance of our approach on OULU-NPU with several classical methods, including three CNN based methods: attention-based two-stream CNN (ATS-CNN) [4], central difference convolutional networks (CDCN) [34] and neural architecture search (NAS) for FAS [33]. Three transformer based methods: temporal transformer network with spatial parts (TTN-S) [28], video transformer based PAD (ViTransPAD) [19] and two-stream vision transformers framework (TSViT) [22]. One SSL-based method: Temporal Sequence Sampling (TSS) [20]. All the comparison results on the four protocols are tabulated in Table 2.

Compared with these state-of-the-art methods, our method does not achieve best performance, especially in protocol 4. Nonetheless, our method still gets competitive results, for examples, the best BPCER in protocol 2, the second best ACER in protocol 2 and APCER in protocol 3. The reason why these methods outperform our method is that they design ingenious but complex models, which increase the consumption of computational resources. It should be noted that our models are relatively simple and do not require complex label information and structure design. This means that our method has great potential ability. For example, the architecture of TTN-S [28] is complex because it combines temporal difference attention, pyramid temporal aggregation and transformer. ViTransPAD [19] has high computation burden since it captures local spatial details with short attention and long-range temporal dependencies over frames. The architecture of TSViT [22] is also complex since it leverages transformer to learn complementary features simultaneously from RGB color space and multi-scale Retinex with color restoration space.

To sum up, the reasons for the excellent performance of our proposed approach are originated from two aspects: (i) masking and reconstruction strategy are well in learning face detail features. (ii) the self-attention of transformer is able to extract image global information.

To investigate our approach more comprehensively, we compare our method with several models on Replay-Attack and CASIA-FASD. All the testing videos of Replay-Attack are recognized correctly, and our EER score is the lowest for CASIA-FASD, which can evidently verify the superiority of our method again.

Table 3. Results on CASIA-FASD and Replay-Attack Datasets. Bold values are the best results in each case.

Ablation Study. We perform an ablation study on protocol 1 and protocol 2 of OULU-NPU to show that the experimental results not only benefit from ViTs structure but also benefit from MIM. We train the downstream tasks without pretext-task, which is using a pure ViT to train the FAS task separately. All ACER scores are enumerated in Table 4, and we can get the following observations: The pretext task plays a crucial role in the performance of our model. The ACER results of our method are significantly better than pure ViT on the two protocols. Such experimental results sufficiently prove the necessity of MIM.

Table 4. Ablation experimental results on OULU-NPU dataset.

4.4 Visualization

Feature Distribution. To visualize the distribution of our learned features based on MAE, the 1080 testing videos in protocol 2 of OULU-NPU are used, and the GAP processed feature matrix with the dimensions of 768 \(\times \) 1080 are fed into the t-SNE algorithm. From Fig. 2(b), it can be seen that the genuine videos and spoofing videos are very distinguishable, which obviously implies that our learned features possess the powerful discriminative capability.

Fig. 3.
figure 3

Reconstruction details marked by red boxes for genuine and spoofing faces.

Reconstruction Details. To further illustrate the effectiveness of our method, we display the reconstruction details for different type of face images, as shown in Fig. 3. Columns 1–3 represents genuine face, eye-cut photo attack and hand-held photo attacks. For FAS task, the differences between spoofing and genuine faces often lies in the pixel details. One can notice that the MIM focuses on the perfect reconstruction of the face area. Among them, for the image in column 2, the reconstruction quality of the eye-cut region is unpromising, for the image in column 3, the reconstruction quality of the hand-held region is incorrect. These parts that cannot be reconstructed well are all non-face areas. This discovery directly prove that our method pays attention to the learning of detailed facial features and autonomously discovers the visual cues of spoofing faces.

5 Conclusion

This paper proposes a novel FAS method under the SSL framework. In the pretext task stage, the MIM strategy is employed to learn general face detail features under an encoder-decoder structure. In the downstream task stage, the knowledge in the encoder is directly transferred, followed by a simple classification network only with GAP and FC layers. Extensive experiments on three standard benchmarks show that our method gets competitive results, which demonstrates the MIM pretext task is effective to learn general and discriminative face features that are beneficial to FAS.