Keywords

1 Introduction

Face recognition (FR) systems are widely employed for human-computer interaction in our daily life. Face anti-spoofing (FAS) is crucial to protect FR systems from presentation attacks, e.g., print attack, video attack and 3D mask attack. Traditional FAS methods extract texture patterns with hand-crafted descriptors [8, 15, 24].

Fig. 1.
figure 1

A practical application scenario for face anti-spoofing. In the pre-training phase, the company builds a model based on the collected face data. When deployed on the user side, few collected unlabeled data can improve the performance through adaptation, but has distribution discrepancies with source knowledge. Moreover, due to privacy and security concerns of face data, users have no access to any source data of the company but the trained model.

With the rise of deep learning, convolutional neural networks (CNNs) have been adopted to extract deep semantic features [40, 45, 46]. Despite promising success in intra-dataset tests, these methods are dramatically degraded in cross-dataset tests where training data are from the source domain and test data are from the target domain with different distributions. The distribution discrepancies in illumination, background and resolution undermine the performance and an adaptation process is required to mitigate domain shift.

Domain adaptation (DA) based methods leverage maximum mean discrepancy (MMD) loss [16, 32] and adversarial training [12, 35, 36] to align the source and target domains, which need to access source data. Unfortunately, they might be infeasible for sensitive facial images due to the restriction by institutional policies, legal issues and privacy concerns. For example, according to the General Data Protection Regulation (GDPR) [30], institutions in the European Union are regulated to protect the privacy of their data. Figure 1 illustrates a practical application scenario of source-free domain adaptation for FAS. A model is first pre-trained based on the (large-scale) source data and is released for deployment. In the deployment phase, the source data cannot be shared for adapting the pre-trained model to the target data, as they contain sensitive biometrics information. Besides, face images acquired under different illumination, background, resolution or using cameras with different parameters will lead to distribution discrepancies between source and target data. These distribution discrepancies have to be overcome using only the pre-trained source model and unlabeled target data. Domain generalization (DG) methods [11, 26, 27] learn a robust source model without exploiting the target data and achieve limited performance in practice. Consequently, Source-Free Domain Adaptation (SFDA) for face anti-spoofing is an important yet challenging problem remained to be solved.

Fig. 2.
figure 2

The t-SNE visualization of extracted features and corresponding faces under O & M & I\(\rightarrow \)C. Same border color for faces with the same identity. (a) SOTA-SFDA method SHOT [17] achieves marginal distribution alignment, which is prone to map the features of real and fake faces together. (b) Our method with conditional distribution alignment separates them well and increases the discrimination ability. (c) Intra-class and inter-class distance of extracted features for SHOT and our SDA-FAS.

Recently, SFDA has been considered to tackle a similar issue on image classification [1, 17, 41, 42]. In image classification, label consistency among data with high local affinity is encouraged [41, 42] or marginal distribution of source and target domains is implicitly aligned [1, 17] to harmonize the clustered features in the feature space. Different from image classification, in FAS, fake faces of the same identity have similar facial features, whereas real faces of different identities differ. The intra-class distance between real faces of different identities probably exceeds the inter-class distance between real and fake faces of the same identity [11, 27]. Clusters of features do not exist in FAS and SFDA models for image classification inevitably lead to degraded performance. Table 1 and Fig. 2 provide empirical results as supporting evidence, where SHOT [17], the state-of-the-art SFDA method, tends to cluster the features of real and fake faces together and obscures the discrimination ability. These problems urge a SFDA method designed specifically for FAS to achieve promising performance.

Lv et al. [20] accommodate to source-free setting for FAS by directly applying self-training but lack specific design for sufficiently exploring FAS tasks. The performance gain by adaptation is trivial (i.e., 1.9% HTER reduction on average), as shown in Table 1. To summarize, challenges to source-free domain adaptation for FAS include source knowledge adaptation and target data exploration.

  • Source knowledge adaptation. Existing self-training and marginal domain alignment cannot adapt source knowledge well in FAS, especially when source data are unavailable. The target pseudo labels generated by the source model are noisy, especially under domain shift, leading to the accumulated error of self-training. Marginal distribution alignment is prone to cluster the features of real and fake faces and greatly degrades the discrimination ability for FAS.

  • Target data exploration. Unseen attack types in the target data lead to enormous domain discrepancies where source knowledge is inapplicable and biased. It is indispensable to explore target data by itself to boost generalization ability. However, target data exploration is ignored in existing methods.

To address these issues, we propose a novel Source-free Domain Adaptation for Face Anti-Spoofing, namely SDA-FAS. Regarding source knowledge adaptation, we design novel strategies for self-training and domain alignment. We develop a contrastive domain alignment module for mitigating feature distribution discrepancies under a source-free setting. The pre-trained classifier weight is employed as the source prototypes with a theoretical guarantee of equivalence in training. We also introduce the source-oriented regularization into self-training to alleviate a self-biasing problem. For target data exploration, self-supervised learning is implemented with specified patch shuffle data augmentation to mine the intrinsic spoofing features of the target data, which also mitigates the reliance on pseudo-labels and boosts the tolerance to interfering knowledge transferred from the source domain. Contributions of this paper are summarized as below:

  • We propose a novel contrastive domain alignment module to align the features of target data with the source prototypes of the same category for mitigating distribution discrepancies with theoretical support.

  • We implement self-supervised learning with specified patch shuffle data augmentation to explore the target data for robust features in the case where unseen attack types emerge and source knowledge is unreliable.

  • Our method is evaluated extensively on thirteen cross-dataset testing benchmarks and outperforms the state-of-the-art methods by a large margin.

To our best knowledge, SDA-FAS is the first attempt that unifies the transfer of pre-trained source knowledge and the self-exploration of unlabeled target data for FAS under a practical yet challenging source-free setting.

2 Related Work

Face Anti-spoofing. Existing face anti-spoofing (FAS) methods can be classified into three categories, i.e., handcrafted, deep learning, and DG/DA methods. Handcrafted methods extract the frame-level features using handcrafted descriptors such as LBP [8], HOG [15] and SIFT [24]. Deep learning methods boost the discrimination ability of extracted features. Yang et al. [40] first introduce CNNs into FAS, and Xu et al. [39] design a CNN-LSTM architecture to extract temporal features. Intrinsic spoofing patterns are further explored with pixel-wise supervision [44], e.g., depth maps [18], reflection maps [43] and binary masks [19]. These methods achieve remarkable performance in intra-dataset tests but degrade significantly in cross-dataset tests due to distribution discrepancies.

DG and DA have been leveraged to mitigate domain shift in cross-dataset tests. DG methods focus on extracting domain invariant features without target data. MADDG [27] learns a shared feature space with multi-adversarial learning. SSDG [11] develops a single-side DG framework by only aggregating real faces from different source domains. DA methods achieve the domain alignment using source data and unlabeled target data. Maximum mean discrepancy (MMD) loss [16, 32] and adversarial training [12, 35, 36] are leveraged to align the feature space between the source and target domains. Quan et al. [25] present a transfer learning framework to progressively make use of unlabeled target data with reliable pseudo labels for training. However, these methods fail to work or suffer from poor performance in a practical yet challenging source-free setting, which considers the privacy and security issues of sensitive face images.

Source-Free Domain Adaptation. Domain adaptation aims at transferring knowledge from source domain to target domain. Recently, source-free domain adaptation (SFDA) has been considered to address privacy issues. PrDA [14] progressively updates the model in a self-learning manner with filtered pseudo labels. Based on the source hypothesis, SHOT [17] aligns the marginal distribution of source and target domains via information maximization. DECISION [1] further extends SHOT to a multi-source setting. TENT [34] adapts batch normalization’s affine parameters with an entropy penalty. NRC [41] exploits the intrinsic cluster structure to encourage label consistency among data with high local affinity. However, existing works cannot be easily employed in FAS due to the different nature of tasks. Recently, Lv et al. [20] realize SFDA for FAS by directly using the pseudo labels for self-training, but suffer from trivial performance gain after adaptation due to the accumulated training error brought by noisy pseudo labels, especially under domain shift.

Contrastive Learning. Contrastive learning is popular for self-supervised representation learning. To obtain the best feature representations, the InfoNCE loss [22] is introduced to pull together an anchor and one positive sample (constructed by augmenting the anchor), and push apart the anchor from many negative samples. Besides, self-supervised features can be learned by only matching the similarity between the anchor and the positive sample [3, 5]. Contrastive learning is also introduced into image classification in a supervised manner [13], where categorical labels are used to build positive and negative samples.

Fig. 3.
figure 3

(a) The overall architecture contains a pre-trained source model (in blue) and a trainable target model with three modules (in orange). For self-training with source regularization, pseudo labels \(\overline{\textbf{y}}^{t}_T\) and \(\overline{\textbf{y}}^{s}_T\) generated by target and source model supervise the outputs \(\tilde{\textbf{y}}_{t}\) and \(\tilde{\textbf{y}}_{t2s}\), respectively. (b) Contrastive domain alignment. The features of target data are pulled with the source prototypes of the same category (in green arrow) and pushed away from different categories (in red arrow) for conditional domain alignment. (c) Target self-supervised exploration. The original image and its patch shuffled view are sent to the student and teacher network. The output distributions are matched by minimizing the KL divergence, i.e., augmented features are pulled with features after scaling, which facilitates the learning of a compact feature space

3 Proposed Method

3.1 Overview

We consider the practical source-free domain adaptation setting for face anti-spoofing, in which only a trained source model and unlabeled target domain data are available for adaptation. To recover the knowledge in the pre-trained source model, we leverage a self-training way to generate pseudo labels for target supervision. To alleviate the self-biasing problem caused by vanilla self-training, we introduce the source-oriented pseudo labels as regularization in Sect. 3.2. Considering that general SFDA methods align the marginal distribution, they could fail in adapting the source knowledge and mitigating domain shift in FAS where intra-class distances are prone to being larger than inter-class distances. Therefore, we propose a novel contrastive domain alignment module tailored for FAS that aligns target features to source prototypes for conditional distribution alignment with theoretical insights in Sect. 3.3. For unseen attack types not covered by the source knowledge, we introduce a target self-supervised exploration module with patch shuffle data augmentation to get rid of the facial structure and mine the intrinsic spoofing features in Sect. 3.4.

Figure 3 illustrates the overall architecture of our proposed framework that consists of a pre-trained source model and a trainable target model. The pre-trained source model consists of a feature extractor and a one-layer linear classifier, the parameters of which are fixed during adaptation. The feature extractor consists of a transformer encoder for feature encoding and a convolution layer for feature embedding. The target model consists of a student network and a teacher network. The student network consists of a feature extractor with multi-branch classifiers. The parameters of each target module are initialized by the parameters of the pre-trained source model.

3.2 Self-training with Source Regularization

Self-training Baseline (ST). Given the target domain data \(\mathbf {\mathcal {D}}_{T}=\{\textbf{x}_{T}\}\) and the student network of the target model \(f_t=h_{t}\circ g_{t}\) (initialized by \(f_s\)), the network output is \(\tilde{\textbf{y}}_{t}=h_t(g_t(\textbf{x}_{T}))\) and the self-training loss is

$$\begin{aligned} \mathcal {L}_{\textrm{ST}}=\mathbbm {1}\left( \max \left( \textbf{c}_{T}^{t}\right) \ge \gamma \right) \mathcal {L}_{ce}(\tilde{\textbf{y}}_{t},\overline{\textbf{y}}_{T}^{t}), \end{aligned}$$
(1)

where \(\textbf{c}_{T}^{t}=\sigma (h_t(g_t(\textbf{x}_{T})))\) is the prediction confidence, \(\overline{\textbf{y}}_{T}^{t}=\textrm{argmax}(h_t(g_t(\textbf{x}_{T})))\) is the generated pseudo label, and \(\mathbbm {1} \in \{0,1\}\) is an indicator function that values 1 only when the input condition holds. \(\gamma \) is the confidence threshold to select out more reliable pseudo-labels.

Though self-training is effective in exploring unlabeled data [29], due to domain shift, it leads to the accumulated error and results in a self-biasing problem caused by noisy pseudo labels. As shown in Fig. 5, the accuracy of pseudo labels for ST gradually drops to about 50%, which is no better than a random guess for binary classification. Therefore, we introduce the regularization of source-oriented knowledge to alleviate the self-biasing problem.

Source-Oriented Regularization (SR). The target data \(\mathbf {\mathcal {D}}_{T}=\{\textbf{x}_{T}\}\) are fed into the fixed pre-trained source model \(f_s=h_{s}\circ g_{s}\) to obtain the source-oriented pseudo labels \(\overline{\textbf{y}}_{T}^{s}=\textrm{argmax}(h_s(g_s(\textbf{x}_{T})))\) and prediction confidence \(\textbf{c}_{T}^{s}=\sigma (h_s(g_s(\textbf{x}_{T})))\). The cross-entropy loss for SR compares the output \(\tilde{\textbf{y}}_{t2s}=h_{t2s}(g_t(\textbf{x}_{T}))\) of \(h_{t2s}\) with \(\overline{\textbf{y}}_{T}^{s}\) as

$$\begin{aligned} \mathcal {L}_{\textrm{SR}}=\mathbbm {1}\left( \max \left( \textbf{c}_{T}^{s}\right) \ge \gamma \right) \mathcal {L}_{ce}(\tilde{\textbf{y}}_{t2s},\overline{\textbf{y}}_{T}^{s}) \end{aligned}$$
(2)

Then, ST and SR are dynamically adjusted during training. Due to domain shift, the target model produces many noisy pseudo labels in the early stage of training and generates more reliable pseudo labels as the training proceeds. Thus, we assign higher importance to SR at first and gradually increase the importance of ST. The overall loss is formulated as

$$\begin{aligned} \mathcal {L}_{\textrm{SSR}}=\alpha \cdot \mathcal {L}_{\textrm{ST}}+(1-\alpha )\cdot \mathcal {L}_{\textrm{SR}}. \end{aligned}$$
(3)

Here, the hyperparameter \(\alpha \) gradually increases from 0 to 1.

3.3 Contrastive Domain Alignment

As discussed in Sect. 1, in real applications, faces are captured by various cameras under different environments, leading to distribution discrepancies in illumination, background and resolution. To mitigate the distribution discrepancies between source and target domains, DA methods employ MMD loss or adversarial learning, which requires full access to the source data. In the source-free setting, based on the source hypothesis, existing SFDA methods [1, 17] align the marginal distributions of the source and target domains, i.e., \(P(g_t(\textbf{x}_S))=P(g_t(\textbf{x}_T))\).

However, such a marginal distribution alignment regardless of the categories suffers degraded performance in FAS. Since the intra-class distance tends to exceed the inter-class distance in FAS, features of different categories exhibit close proximity. For example, given a real subject, the corresponding fake faces with the same identity have similar facial features, while the real faces with different identities have different facial features. As shown in Fig. 2, such a marginal distribution alignment [17] may align the features of real faces with those of fake ones, which implies the different conditional distribution \(P(g_t(\textbf{x}_S)|\textbf{y}_S)\ne P(g_t(\textbf{x}_T)|\textbf{y}_T)\) and affects the discrimination ability.

Thus, as shown in Fig. 3(b), we propose a contrastive domain alignment module to align the conditional distribution between the source and target domains. Due to the inaccessibility of source data, we propose to use the weights of pre-trained classifier \(h_s\) as the feature embeddings of the source prototype to compute the supervised contrastive loss.

Proposition 1

Given a trained model \(f_s=h_s \circ g_s\), where \(g_s\) is the feature extractor and \(h_s\) is the one-layer linear classifier, the \(\ell _{2}\)-normalized weight vectors \(\{\textbf{w}_s^{real},\textbf{w}_s^{fake}\}\) of the classifier are the equivalent representation of the feature embeddings \(\{\textbf{z}_s^{real},\textbf{z}_s^{fake}\}\) of the source prototypes for calculating the supervised contrastive loss.

Proof. Please refer to the supplementary material.

With the generated pseudo labels denoting the category of the feature embeddings of the target data anchor, we have the supervised contrastive loss as

$$\begin{aligned} \mathcal {L}_{\textrm{CDA}}=-\sum _{i=1}^{N_t}\sum _{m=1}^{M}\bigg [\mathbbm {1}(\max (\textbf{c}_{T}^{t,i})\ge \gamma ,\overline{\textbf{y}}_{T}^{t,i}=m)\cdot \log \frac{\exp (\langle {\textbf{z}_t^i},{\textbf{w}^m_{s}}\rangle /\tau )}{\sum _{j=1}^{M} \exp (\langle {\textbf{z}_t^i},{\textbf{w}^j_{s}}\rangle /\tau )}\bigg ], \end{aligned}$$
(4)

where \(\textbf{c}_{T}^{t,i}=\sigma (h_t(g_t(\textbf{x}_{T}^i))\), \(\overline{\textbf{y}}_{T}^{t,i}=\textrm{argmax}(h_t(g_t(\textbf{x}_{T}^i)))\), \(\textbf{z}_{t}^{i}=g_t(\textbf{x}_{T}^i)\), \(\langle \cdot ,\cdot \rangle \) denotes the inner product, \(\tau \) is the temperature parameter, and M is the number of total categories. The contrastive domain alignment module has two properties: (1) pull together the feature embeddings of real (fake) faces in the target domain and those of the same category in the source domain to align the conditional distribution (green arrow in Fig. 3 (b)); (2) push apart the feature embeddings of real (fake) faces in the target domain from those of different categories in the source domain to enhance the discrimination ability (red arrow in Fig. 3 (b)).

3.4 Target Self-supervised Exploration

For FAS applications, novel fake faces are continuously evolved and it is likely to encounter diverse attack types or collecting ways unseen in the source data. For example, spoofing features of 2D attacks and 3D mask attacks are quite different. For the cases where distribution discrepancies are enormous and source knowledge fails to apply, the generalization ability will decrease. Thus, we introduce a target self-supervised exploration (TSE) module to mine the valuable information from the target domain. However, traditional data augmentation fails to fit with the spirit of FAS to capture detailed features. Taking the whole image as input will inevitably introduce global facial information. Thus, to suppress facial structure information as the biased source knowledge that leads to larger intra-class distances than inter-class distances, patch shuffle [47] is leveraged as a data augmentation strategy to destroy the face structure and learn a more compact feature space. Moreover, TSE is naturally independent of pseudo labels and can boost the tolerance to the wrongly transferred source supervision. The difference between our method and self-supervised methods [3, 5] lies in the fact that we utilize the patch shuffle augmentation specifically for FAS and the target model is initialized by the pre-trained source model.

Specifically, a Siamese-like architecture is implemented to maximize the similarity of two views from one image [4], which consists of a student network (i.e., \(g_t^{stu} \triangleq g_t, h_t^{stu} \triangleq h_t, f_t^{stu} \triangleq f_t\)) and a teacher network \(f_t^{tea}=g_t^{tea}\circ h_t^{tea}\). The student network is optimized by gradient descent, whereas the teacher network is updated with an exponential moving average (EMA). Given a target data \(\textbf{x}_{T}\), a patch-disordered view \(\textbf{x}_{T'}\) is obtained by splitting and splicing. We firstly divide the image into several patches and then randomly permute the image patches to form a new image as a jigsaw. The original view \(\textbf{x}_{T}\) and patch-permuted view \(\textbf{x}'_{T}\) are alternatively fed into the student and teacher networks to obtain two pairs of output probability distributions \(\{P_{stu},P'_{tea}\}\) and \(\{P'_{stu},P_{tea}\}\). Since the two views contain the same detailed real/fake features, the output should be consistent, which is matched by minimizing the Kullback-Leibler (KL) divergence.

$$\begin{aligned} \mathcal {L}_\textrm{TSE} = D_{KL}(P'_{tea}\Vert P_{stu})+D_{KL}(P_{tea}\Vert P'_{stu}) \end{aligned}$$
(5)

After updating \(\theta _t\) with Eq. (5) by gradient descent, the parameters \(\theta _{t}^{tea}\) of the teacher network are updated with an EMA as \(\theta _{t}^{tea} \leftarrow l\theta _{t}^{tea}+(1-l)\theta _{t}\). l is the rate parameter.

The proposed framework for FAS is trained in an end-to-end manner as

$$\begin{aligned} \mathcal {L}= \mathcal {L}_{\textrm{SSR}}+\lambda _1\cdot \mathcal {L}_{\textrm{CDA}}+\lambda _2\cdot \mathcal {L}_{\textrm{TSE}}, \end{aligned}$$
(6)

where \(\lambda _1\) and \(\lambda _2\) are hyper-parameters to balance the losses.

4 Experiments

4.1 Experimental Settings

Datasets. Evaluations are made on five public datasets: Idiap Replay-Attack [7] (denoted as I), OULU-NPU [2] (denoted as O), CASIA-MFSD [50] (denoted as C), MSU-MFSD [38] (denoted as M) and CelebA-Spoof [49] (denoted as CA). CA is significantly largest with huge diversity.

Table 1. HTER and AUC for multi-source domains cross-dataset test. From top to bottom, compared methods are state-of-the-art deep learning FAS (DL-FAS), DG based FAS (DG-FAS), DA based FAS (DA-FAS), SFDA based FAS (SFDA-FAS) and state-of-the-art general SFDA methods (SOTA-SFDA). SourceOnly is our pre-trained source model and (best) is the target model after adaptation. Our average result is based on 3 independent runs with different seeds to report the mean value with standard deviation. Lv et al.(base) is the pre-trained source model and (SE) is the target model after adaptation. \(^\dagger \) indicates our reproduced results with the released code.

Testing Scenarios. Following [27], one dataset is treated as one domain. For simplicity, we use A & B\(\rightarrow \)C for the scenario that trains on the source domains A and B, and tests on the target domain C. There are thirteen scenarios in total:

  • Multi-source Domains Cross-dataset Test: O & C & I\(\rightarrow \)M, O & M & I\(\rightarrow \)C, O & C & M\(\rightarrow \)I, and I & C & M\(\rightarrow \)O.

  • Limited Source Domains Cross-dataset Test: M & I\(\rightarrow \)C and M & I\(\rightarrow \)O.

  • Cross-dataset Test on Large-scale CA: M & C & O\(\rightarrow \)CA.

  • Single Source Domain Cross-dataset Test: C\(\rightarrow \)I, C\(\rightarrow \)M, I\(\rightarrow \)C, I\(\rightarrow \)M, M\(\rightarrow \)C, and M\(\rightarrow \)I.

Fig. 4.
figure 4

ROC curves for multi-source domains cross-dataset test on O, C, I and M.

Evaluation Metrics. Following [11, 27], Half Total Error Rate (HTER) (half of the summation of false acceptance rate and false rejection rate) and the Area Under the Curve (AUC) are used as the evaluation metrics.

Table 2. HTER and AUC for test on O and C with limited source domain datasets.
Table 3. HTER and AUC for test on large-scale CA.
Table 4. HTER(%) for single source domain cross-dataset test on C, I, and M datasets.

Implementation Details. Following [11], MTCNN [48] is adopted for face detection. The detected faces are normalized to 256 \(\times \,\)256\(\times \,\)3 as inputs. DeiT-S [31] pre-trained on ImageNet is used as the transformer encoder. For pre-training on the source data, we randomly specify a 0.9/0.1 train-validation split and get the optimal model based on the HTER of the validation split. For adaptation, the model is finetuned on the train set of target data and test on the test set, ensuring the test set is unseen in the whole procedure. The source code is released at https://github.com/YuchenLiu98/ECCV2022-SDA-FAS.

4.2 Experimental Results

Multi-source Domains Cross-dataset Test. Table 1 shows our SDA-FAS improves conventional deep learning FAS methods a lot by mitigating distribution discrepancies across different datasets. Besides, SDA-FAS performs better than DG based methods by exploiting unlabeled target data, as shown in Fig. 4. Moreover, SDA-FAS even outperforms the state-of-the-art DA method Quan et al. under a more challenging source-free setting, i.e., 7.71% HTER reduction and 4.71% AUC gain (lower HTER and higher AUC for better performance) for I & C & M\(\rightarrow \)O that tests on the largest O dataset (among I, C, M and O). Furthermore, compared with SFDA based FAS method Lv et al. (SE), we greatly improve the performance, i.e., 3.77% vs. 20.30% HTER on average. Based on the pre-trained source model, our SDA-FAS achieves a large performance gain after adaptation with 12.7% HTER reduction on average, while Lv et al. only achieve 1.9%, validating the effectiveness of our adaptation framework. Finally, our SDA-FAS outperforms the state-of-the-art general SFDA methods by proposing an adaptation framework specifically designed for FAS.

Limited Source Domains Cross-dataset Test. Compared with state-of-the-art DG method D\(^{2}\)AM, SDA-FAS improves the performance a lot by effectively using available unlabeled target data, i.e., 17.28% HTER reduction and 19.31% AUC gain for M & I\(\rightarrow \)C, as shown in Table 2.

Cross-dataset Test on Large-Scale CA. For the most challenging test M & C & O\(\rightarrow \)CA, where CA is much larger with unseen spoofing types (3D mask attacks), our SDA-FAS reduces HTER by 7.2% and increases AUC by 10.9% in comparison to the state-of-the-art DA method Panwar et al., as shown in Table 3. The promising results under a more practical source-free setting demonstrate that our method is effective and trustworthy for complex real-world scenarios.

Single Source Domain Cross-dataset Test. Table 4 shows that under a more difficult source-free setting, SDA-FAS outperforms all DA methods under four of the six tests and achieves the best average result (13.2% HTER). Besides, compared with the SFDA method Lv et al., SDA-FAS achieves a much larger performance gain after adaptation, 15.0% vs. 4.3% HTER reduction for I\(\rightarrow \)C.

Table 5. Ablation studies on different components of our proposed SDA-FAS.
Table 6. HTER and AUC for unseen 3D mask attack type test on part of CA.
Table 7. AUC(%) of the cross attack type test on C, I and M. Two attack types of unlabeled target data are used for training and tested on unseen attack type.

4.3 Ablation Studies

Each Component of the Network. The proposed framework and its variants are evaluated on multi-source domains cross-dataset test. Table 5 shows that, based on ST, SR improves the performance by introducing source-oriented regularization to alleviate the self-biasing problem. Besides, the performance improves with CDA added, demonstrating the effectiveness of conditional domain alignment to mitigate distribution discrepancies and enhance the discrimination ability. Moreover, TSE can further improve the performance, especially on the large test dataset (e.g., I & C\( \& \)M\(\rightarrow \)O), reflecting its power in self-exploring valuable information in large target data.

Portion of Target Data Used. Firstly, we randomly sample 10% and 50% of live and spoof faces in the training set for adaptation. Table 8 shows, even with 10% training samples, SDA-FAS improves the performance a lot, manifesting the validity for real scenarios with few data. For example, SDA-FAS reduces HTER by 9.44% after adaptation using only 24 unlabeled samples in C. Secondly, for extreme cases in FAS where live faces are much larger than spoof faces, we randomly sample 5%, 10% and 50% of spoof faces in the training set. With only 5% spoof faces (i.e., 9 samples), SDA-FAS reduces HTER by 8.71% after adaptation to C, demonstrating the effectiveness for more challenging scenarios.

Unseen Attack Types. To further evaluate TSE in self-exploring the target data, we reconstitute CA test set with all real faces and only 3D mask attack faces (unseen in the source data where only 2D attack types exist), and conduct experiments under M & C & O\(\rightarrow \)CA (3D mask). As shown in Table 6, TSE significantly improves the performance, i.e., 9.25% HTER reduction and 7.15% AUC gain, demonstrating its effectiveness in self-exploring novel attack types in the case where the source knowledge fail to apply. Corresponding qualitative analysis is conducted in the supplementary material by visualizing a few hard 3D mask faces. Moreover, following protocols in [19], only partial attack types with unlabeled target data are tested. Table 7 shows our method outperforms DTN [19] that is fully supervised with labeled data. By adapting source knowledge, our method achieves better performance in an unsupervised manner.

Statistics of Pseudo Labels. As shown in Fig. 5, self-training (ST) results in a self-biasing problem and the accuracy of pseudo labels gradually drops to less than 50%. Self-training with source-oriented regularization (SSR) can alleviate the self-biasing problem, and the accuracy achieves a steady improvement to 70%. Moreover, with CDA mitigating domain discrepancies and TSE self-exploring target data, SDA-FAS achieves the highest accuracy exceeding 90%.

Table 8. Experiments on different portion of target train data and spoof faces. L denotes live faces and S denotes spoof faces, respectively.
Fig. 5.
figure 5

Reliable samples ratio (dashed line) and pseudo labels accuracy (solid line) with respect to the updating iteration.

4.4 Visualizations

Attention Map. Figure  6 shows that, for real faces in rows 1 and 3, our method exhibits dense attention maps to effectively capture the physical structure of human faces. For the cut attack in row 2, the cut area of eyes is precisely specified, whereas the finger hint holding the paper is detected for the print attack in row 4. The attention maps suggest that SDA-FAS can model the features of live faces well and also precisely capture the intrinsic and detailed spoofing cues. Therefore, it can generalize well to the target domain.

Feature Space. We select all samples of target data for t-SNE visualizations. As shown in Fig. 7, after adaptation, the features of fake faces and real faces are better separated on the target domain compared to those before adaptation.

Fig. 6.
figure 6

Attention maps [3] from the last layer of the transformer encoder under O & M & I\(\rightarrow \)C. Column 1: cropped input image. Columns 2–7: six heads of the transformer encoder. Rows 1–2: attention maps for subject 1’s real face and paper-cut attack. Rows 3–4: attention maps for subject 2’s real face and print photo attack.

Fig. 7.
figure 7

The t-SNE [21] visualization of the extracted features by our model with adaptation (right) and without adaptation (left) under O & M & I\(\rightarrow \)C.

5 Conclusion

In this paper, we propose a novel adaptation framework for face anti-spoofing under a practical yet challenging source-free setting, which protects the security and privacy of human faces. Specifically, source-oriented regularization is introduced to alleviate the self-biasing problem of self-training. Besides, we propose a novel contrastive domain alignment module to align the conditional distribution across domains for mitigating the discrepancies. Moreover, self-supervised learning is adopted to self-explore the target data for robust features under enormous domain discrepancies where source knowledge is inapplicable. Extensive experiments validate the effectiveness of our method statistically and visually.