1 Introduction

Face anti-spoofing (FAS) comprises techniques that distinguish genuine human faces and faces on spoof mediums [6], such as printed photographs, screen replay, and 3D masks. FAS is a critical component of the face recognition pipeline that ensures only genuine faces are being matched. As face recognition systems are widely deployed in real world applications, a laboratory-trained FAS model is often required to deploy in a new target domain with face images from novel camera sensors, ethnicities, ages, types of spoof attacks, etc., which differ from the source domain training data in the laboratory.

In the presence of a large domain-shift [23, 50, 58] between the source and target domain, it is necessary to employ new target domain data for updating the pre-trained FAS model, in order to perform well in the new test environment. Meanwhile, the source domain data might be inaccessible during updating, due to data privacy issues, which happens more and more frequently for Personally Identifiable Information (PII). Secondly, the FAS model needs to be evaluated jointly on source and target domains, as spoof attacks should be detected regardless of which domain they originate from. Motivated by these challenges, the goal of this paper is to answer the following question:

How can we update a FAS model using only target domain data, so that the upgraded model can perform well in both the source and target domains?

Fig. 1.
figure 1

We study multi-domain learning face anti-spoofing (MD-FAS), in which the model is trained only using target domain data. We first derive the general formulation of FAS models, which contains Spoof Cue Estimate Layers (SCE layers) and multi-scale feature extractor (MsFE). Based on these two components, we propose FAS-wrapper that can be adopted for any FAS models, as depicted in (c). (a) and (b) represent the naive fine-tuning and joint training.

We define this problem as multi-domain learning face anti-spoofing (MD-FAS), as depicted in Fig. 1. Notably, Domain Adaptation (DA) works [14, 27, 31, 38, 51] mainly evaluate on the target domain, whereas MD-FAS requires a joint evaluation. Also, MD-FAS is related to Multiple Domain Learning (MDL) [17, 44, 45], which aims to learn a universal representation for images in many generic image domains, based on one unchanged model. In contrast, MD-FAS algorithm needs to be model-agnostic for the deployment, which means the MD-FAS algorithm can be tasked to update FAS models with various architectures or loss functions. Lastly, the source domain data is unavailable during the training in MD-FAS, which is different from previous domain generalization methods in FAS [20, 36, 42, 53] or related manipulation detection problems [5].

There are two main challenges in MD-FAS. First, the source domain data is unavailable during the updating. As a result, MD-FAS easily suffers from the long-standing catastrophic forgetting [25] in learning new tasks, gradually degrading source domain performance. The most common solution [12, 22, 29] to such a forgetting issue is to use logits and class activation map (grad-CAM) [52] restoring prior model responses when processing the new data. However, due to the increasingly sophisticated spoof image, using logits and grad-CAM empirically fail to precisely pinpoint spatial pixel locations where spoofness occurs, unable to uncover the decision making behind the FAS model. To this end, we propose a simple yet effective module, namely spoof region estimator (SRE), to identify the spoof regions given an input spoof image. Such spoof traces serve as responses of the pre-trained model, or better replacement to logits and activation maps in the MD-FAS scenario. Notably, unlike using multiple traces to pinpoint spoofness or manipulation in image [31, 69], or low-resolution binary mask as manipulation indicator [10, 33, 64], our SRE offers a single and high-resolution detailed binary mask representing pixel-wise spatial locations of spoofness. Also, many anti-forgetting algorithms [8, 13, 40, 46, 49, 54] usually require extra memory for restoring exemplar samples or expanding the model size, which makes them inefficient in real-world situations.

Secondly, to develop an algorithm with a high level of adaptability, it is desirable to keep original FAS models intact for the seamless deployment while changing the network parameters. Unlike methods proposed in [44, 45] that specialize on the certain architecture (e.g., ResNet), we first derive the general formulation after studying FAS models [30, 34, 37, 53, 63, 65], then based on such a formulation we propose a novel architecture, named FAS-wrapper (depicted in Fig. 2), which can be deployed for FAS models with minimum changes on the architecture.

In summary, this paper makes the following contributions:

\(\diamond \) Driven by the deployment in real-world applications, we define a new problem of MD-FAS, which requires to update a pre-trained FAS model only using target domain data, yet evaluate on both source and target domains. To facilitate the MD-FAS study, we construct the FASMD benchmark, based on existing FAS datasets [7, 34, 36], with four evaluation protocols.

\(\diamond \) We propose a spoof region estimator (SRE) module to identify spoof traces in the input image. Such spoof traces serve as the prior model’s responses to help tackle the catastrophic forgetting during the FAS model updating.

\(\diamond \) We propose a novel method, FAS-wrapper, which can be adopted by any FAS models for adapting to target domains while preserving the source domain performance.

\(\diamond \) Our method demonstrates superior performance over prior works, on both source and target domains in the FASMD benchmark. Moreover, our method also generalizes well in the cross-dataset scenario.

Table 1. We study the multi-domain learning face anti-spoofing, which is different to prior works.

2 Related Works

Face Anti-spoofing Domain Adaptation. In Domain Adaption (DA) [14, 27, 31, 38, 51], many prior works assume the source data is accessible, but in our setup, source domain data is unavailable. The DA performance evaluation is biased towards the target domain data, as source domain performance may deteriorate, whereas FAS models need to excel on both source and target domain data. There are some FAS works that study the cross-domain scenario [20, 36, 43, 53, 56, 59, 61]. [61] is proposed for the scenario where source and a few labeled new domain data are available, with the idea to augment target data by style transfer [62]. [53] learns a shared, indiscriminative feature space without the target domain data. Besides, [20] constructs a generalized feature space that has a compact real faces feature distribution in different domains. [36] also works on unseen domain generalization. But the same as the other works, the new domain is not based on bio-metric patterns (i.e., age). Being orthogonal to prior works, the source domain data in our study is unavailable, which is a more challenging setting, as shown in Table 1.

Fig. 2.
figure 2

(a) Given the source pre-trained model that contains feature extractor \(f^{S}\), we fine-tune it with the proposed spoof region estimator (SRE) on the target domain data, in which we use preliminary mask (\({\textbf {I}}_{pre}\)) to assist the learning (see Sect. 3.2). Then, we obtain a well-trained SRE and a new feature extractor \(f^{T}\) which specializes in the target domain. (b) In FAS-wrapper, SRE helps \(f^{S}\) and updated model (\(f^{new}\)) generate binary masks indicating spoof cues, which serve as model responses given an input image (\({\textbf {I}}\)). \(\mathcal {L}_{Spoof}\) prevents the divergence between estimated spoof traces, to combat catastrophic forgetting. Meanwhile, using two multi-scale discriminators (\(Dis^{S}\) and \(Dis^{T}\)), FAS-wrapper transfers the knowledge from two teacher models ( \(f^{S}\) and \(f^{T}\)) to \(f^{new}\) via the adversarial training. (c) The update model \(f^{new}\) and SRE can be used for the inference.

Anti-forgetting Learning. The main challenge in MD-FAS is the long-studied catastrophic forgetting [25]. According to  [11], there exist four solutions: replay [8, 46, 54], parameter isolation [13, 40, 49], prior-driven [4, 25, 28] and data-driven [12, 22, 29]. The replay method requires to restore a fraction of training data which breaks our source-free constraint, e.g., [47] needs to store the exemplar training data. Parameter isolation methods [13, 40, 49] dynamically expand the network, which is also discouraged due to the memory expense. The prior-driven methods [4, 25, 28] are proposed based on the assumption that model parameters obey the Gaussian distribution, which is not always the case. The data-driven method [3, 15, 16, 19] is always more favored in the community, due to its effectiveness and low computation cost. However, the development of data-driven methods is dampened in the FAS, since the commonly-used pre-trained model responses (e.g., class probabilities [29] and grad-CAM [12]) fail to capture spoof regions. In this context, our SRE is a simple yet effect way of estimating the spoof trace in the image, which serves as the responses of the pre-train model.

Multi-domain Learning. Mostly recently, many large-scale FAS datasets with rich annotations have been collected [30, 67, 68] in the community, among which [30] studies cross-ethnicity and cross-gender FAS. However they work on multi-modal datasets, whereas our input is a single RGB image. In the literature, our work is similar to the multi-domain learning (MDL) [41, 44, 45], where a re-trained model is required to perform well on both source and target domain data. The common approaches are proposed from [44, 45] based on ResNet [18], which, compared to [26, 55], has advantages in increasing the abstraction by convoluation operations. In contrast, an ideal MD-FAS algorithm, such as FAS-wrapper, should work in a model-agnostic fashion.

3 Proposed Method

This section is organized as follows. Section 3.1 summarizes the general formulation of recent FAS models. Sectons 3.2 and 3.3 introduce the spoof region estimator and overall FAS-wrapper architecture. Training and inference procedures are reported in Sect. 3.4.

3.1 FAS Models Study

We investigate the recently proposed FAS methods (see Table 2) and observe that these FAS models have two shared characteristics.

Spoof Cue Estimate. Beyond treating FAS as a binary classification problem, many SOTA works emphasize on estimating spoof clues from a given image. Such spoof clues are detected in two ways: (a) optimizing the model to predict auxiliary signals such as depth map or rPPG signals [34, 63, 65]; (b) interpreting the spoofness from different perspectives: the method in [21] aims to disentangle the spoof noise, including color distortions and different types of artifacts, and spoof traces are interpreted in [35, 37] as multi-scale and physical-based traces.

Table 2. Summary of recent FAS models.

Multi-scale Feature Extractor. Majority of previous FAS methods adopt the multi-scale feature. We believe such a multi-scale structure assists in learning information at different frequency levels. This is also demonstrated in [37] that low-frequency traces (e.g., makeup strokes and specular highlights) and high-frequency content (e.g., Moiré patterns) are equally important for the FAS models’ success.

As a result, we formalize the generic FAS model using two components: feature extractor f and spoof cue estimate (SCE) layers (or decoders) g. When f takes an input face image, denoted as \(\textbf{I}\), the output feature map at t-th layer of the feature extractor f is \(f_{t}(\textbf{I})\). The size of \(f_{t}(\textbf{I})\) is \(C_{t} \times H_{t} \times W_{t}\), where \(C_{t}\) is the channel number, and \(H_{t}\) and \(W_{t}\) are respectively the height and width of feature maps.

3.2 Spoof Region Estimator

Motivation. Apart from the importance of identifying spoof cues for FAS performance, we observe that spoof trace also serves as a key reflection of how different models make the binary decision, namely, different models’ activations on the input image. In other words, although different models might unanimously classify the same image as spoof, they in fact could make decisions based on distinct spatial regions, as depicted in Fig. 6. Thus, we attempt to prevent the divergence between spoof regions estimated from the new model (i.e., \(f^{new}\)) and source domain pre-trained model (i.e., \(f^{S}\)), such that we can enable \(f^{new}\) to perceive spoof cues from the perspective of \(f^{S}\), thereby combating the catastrophic forgetting issue. To this end, we propose a spoof region estimator (SRE) to localize spatial pixel positions with spoof artifacts or covered by spoof materials.

Fig. 3.
figure 3

The preliminary mask generation process: (a) the spoof image, (b) the live reconstruction, (c) and (d) are difference image in RGB and gray format, and (e) is the preliminary spoof mask.

Formulation. Let us formulate the spoof region estimate task. We denote the pixel collection in an image as \(D_{{\textbf {I}}} = \{{(x_1,y_1),(x_2,y_2),...,(x_n,y_n)}\}\), the proposed method aims to predict the region where the area of presentation attack can be represented as a binary mask, denoted as \(D_{\textit{pred}} = \{{(x_1,y'_1),(x_2,y'_2),...,(x_n,y'_n)}\}\), where \(x_{i}\), \(y_{i}\) and \(y'_{i}\) respectively represent the pixel, ground truth pixel label, and predicted label at i th pixel. Also, the spoof region estimate task can be regarded as a pixel-level binary classification problem, namely pixel being live or spoof, thus we have \(y_{i} \in \{ \textit{o}^{Live}, \textit{o}^{Spoof}\}\). Note that \(i \in \{ 1,2,3...,n \}\) and n is the total number of pixels in the image.

Method. As depicted in Fig. 2, we insert a SRE module in the source pre-trained model, between the feature extractor \(f^{S}\) and spoof cue estimate layers \(g^{S}\). The region estimator converts \(f^{s}({\textbf {I}})\) to a binary mask \({\textbf {M}}\) with the size \(H_{t'} \times W_{t'}\). In the beginning of the training, we create the preliminary mask to supervise SRE for generating the spoof region. The preliminary mask generation is based on the reconstruction method proposed in [37], as illustrated in Fig. 3. In particular, we denote input spoof image as \({\textbf {I}}_{spoof}\) and use the method in [37] to reconstruct its live counterpart \(\hat{I}_{live}\). By subtracting \({\textbf {I}}_{spoof}\) from \(\hat{{\textbf {I}}}_{live}\), and taking the absolute value of the resulting image, we obtain the different image \({\textbf {I}}_{d}\), whose size is \(C_{0} \times H_{0} \times W_{0}\) where \(C_{0}\) is 3. We convert \({\textbf {I}}_{d}\) to a gray image \(\hat{{\textbf {I}}}_{d}\), by summing along with its channel dimension. Apparently, \(\hat{{\textbf {I}}}_{d}\) has the size as \(C_{1} \times H_{0} \times W_{0}\) where \(C_{1}\) is 1. We assign each pixel value in the preliminary mask by applying a predefined threshold T,

(1)

where pixels in \(\hat{{\textbf {I}}}_{d}\) and \({\textbf {I}}_{pre}\) are \(p_{ij}\) and \(p'_{ij}\) respectively.

Evidently, the supervisory signal \({\textbf {I}}_{pre}\) is not the ground truth. Inspired by [10] that a model can generate the manipulation mask by itself during training procedure, we only use \({\textbf {I}}_{pre}\) as the supervision at the first a few training epochs, then steer the model itself to find the optimal spoof region by optimizing towards a higher classification accuracy. More details are in Sect. 3.4.

Discussion. Firstly, we discuss the difference to prior spoof region estimate works. The previous methods [35, 37] use various traces to help live or spoof image reconstruction, while our goal is to pinpoint the region with spoof artifacts, which serves as pre-trained model’s responses to help the new model behave similar to the pre-trained one(s), alleviating the forgetting issue. [10] offers low-resolution binary masks as the supervisory signal, but our self-generated \({\textbf {I}}_{pre}\) can only bootstrap the system. Also, [69] proposes an architecture for producing multiple masks, which is not practical in our scenario. Thus our mask generation method is different from theirs. Finally, SRE can be a plug-in module for any given FAS model, and details are in Sect. 5.4.

3.3 FAS-Wrapper Architecture

Motivation. We aim to deliver an update algorithm that can be effortlessly deployed to different FAS models. Thus, it is important to design a model agnostic algorithm that allows the FAS model to remain intact, thereby maintaining the original FAS model performance. Our FAS-wrapper operates in a model-agnostic way where only external expansions are made, largely maintaining the original FAS model’s ability.

As depicted in Fig. 2, we denote the source pre-trained feature extractor \(f^{S}\) as source teacher, and the feature extractor after the fine-tuning procedure as target teacher (\(f^{T}\)). Instead of using one single teacher model like [20], we use \(f^{S}\) and \(f^{T}\) to regularize the training, offering the more informative and instructive supervision for the newly upgraded model, denoted as \(f^{new}\). Lastly, unlike prior FAS works [20, 53, 61] which apply the indiscriminative loss on the final output embedding or logits from \(f^{S}\), we construct multi-scale discriminators that operate at the feature-map level for aligning intermediate feature distributions of \(f^{new}\) to those of teacher models (i.e., \(f^{T}\) and \(f^{S}\)). Motivations of the multi-scale discriminators are: (a) the multi-scale features, as a common FAS model attribute (Sect. 3.1), should be considered; (b) the adversarial learning can be used at the feature-map level which contains the richer information than final output logits.

Method. We construct two multi-scale discriminators, \(Dis^{S}\) and \(Dis^{T}\), for transferring semantic knowledge from \(f^{S}\) and \(f^{T}\) to \(f^{new}\) respectively, via an adversarial learning loss. Specifically, at l-th scale, \(Dis_{l}^{S}\) and \(Dis_{l}^{T}\) take the previous discriminator output and the l-th scale feature generated from feature extractors. We use \({\textbf{d}}^{S}_{l}\) and \({\textbf{d}}^{T}_{l}\) to represent two discriminators’ outputs at l-th level while taking teacher generated features (i.e., \(f^{S}_{l}({\textbf {I}})\) and \(f^{T}_{l}({\textbf {I}})\)), and \(\textbf{d}^{\prime S}_{l}\) and \(\textbf{d}^{\prime T}_{l}\) while taking upgraded model generated feature, \(f^{new}_{l}({\textbf {I}})\). Therefore, the first-level discriminator output are:

(2)
(3)

and discriminators at following levels take the l-th (\(l > 1\)) backbone layer output feature and the previous level discriminator output, so we have:

(4)
(5)

After obtaining the output from the last-level discriminator, we define \(\mathcal {L}_{D_{S}}\) and \(\mathcal {L}_{D_{T}}\) to train \(Dis_{s}\) and \(Dis_{t}\), and \(\mathcal {L}_{S}\) and \(\mathcal {L}_{T}\) to supervise \(f^{new}\).

(6)
(7)
(8)
(9)

Discussion. The idea of adopting adversarial training on the feature map for knowledge transfer is similar to [9]. However, the method in [9] is for the online task and transferring knowledge from two models specialized in the same domain. Conversely, our case is to learn from heterogeneous models which specialize in different domains. Additionally, using two regularization terms with symmetry based on the two pre-trained models, is similar to work in [66] on the knowledge distillation topic that is different to FAS. However, the same is the effect of alleviating the imbalance between classification loss and regularization terms, as reported in [24, 66].

3.4 Training and Inference

Our training procedure contains two stages, as depicted in Fig. 2. Firstly, we fine-tune given any source pre-trained FAS model with the proposed SRE, on the target dataset. We optimize the model by minimizing the \(\ell _1\) distance (denoted as \(\mathcal {L}_{Mask}\)) between the predicted binary mask \(\textbf{M}\) and \({\textbf {I}}_{pre}\), and the original loss \(\mathcal {L}_{Orig}\) that is used in the training procedure of original FAS models. After the fine-tuning process, we obtain well-trained SRE and a feature extractor (\(f^{T}\)) that is able to work reasonably well on target domain data. Secondly, we integrate the well-trained SRE with the updated model (\(f^{new}\)) and the source pre-trained model (\(f^{S}\)), such that we can obtain estimated spoof cues from perspectives of two models. We use \(\mathcal {L}_{Spoof}\) to prevent the divergence between spoof regions estimated from \(f^{new}\) and \(f^{S}\). Lastly, we use \(\mathcal {L}_{S}\) and \(\mathcal {L}_{T}\) as introduced in Sect. 3.3 for transferring knowledge from the \(f^{S}\) and \(f^{T}\) to \(f^{new}\), respectively. Therefore, the overall objective function in the training is denoted as \(\mathcal {L}_{total}\):

$$\begin{aligned} \mathcal {L}_{total} = \lambda _{1} \mathcal {L}_{Orig} + \lambda _{2} \mathcal {L}_{Spoof} + \lambda _{3} \mathcal {L}_{S} + \lambda _{4} \mathcal {L}_{T}, \end{aligned}$$
(10)

where \(\lambda _{1} {\text{- }} \lambda _{4}\) are the weights to balance the multiple terms. In inference, we only keep new feature extract \(f^{new}\) and SRE, as depcited in Fig. 2 .

4 FASMD Dataset

We construct a new benchmark for MD-FAS, termed FASMD, based on SiW [34], SiW-Mv2 [36]Footnote 1 and Oulu-NPU [7]. MD-FAS consists of five sub-datasets: dataset A is the source domain dataset, and B, C, D and E are four target domain datasets, which introduce unseen spoof type, new ethnicity distribution, age distribution and novel illumination, respectively. The statistics of the FASMD benchmark are reported in Table 3.

Fig. 4.
figure 4

Representative examples in source and target domain for spoof and illumination protocols.

Fig. 5.
figure 5

The distribution of ethnicity and age in source and target domain subsets.

Table 3. The FASMD benchmark. [Keys: eth.= ethnicity, illu.= illumination.]

New Spoof Type. As illustrated in Fig. 4, target domain dataset B has novel spoof types that are excluded from the source domain dataset (A). The motivation for this design is, compared with the print and replay that are prevalent nowadays, other new spoof types are more likely to emerge and cause threats. As a result, given the fact that, five macro spoof types are introduced in SIW-Mv2 (print, replay, 3D mask, makeup and partial manipulation attack), we select one micro spoof type from other three macro spoof types besides print and replay to constitute the dataset B, which are Mannequin mask, Cosmetic makeup and Funny eyes.

New Ethnicity Distribution. In reality, pre-trained FAS models can be deployed to organizations with certain ethnicity distribution (e.g., African American sports club). Therefore, we manually annotate the ethnicity information of each subject in three datasets, then devise the ethnicity protocol where dataset A has only 1.1% African American samples, but this proportion increases to 52.3% in dataset C, as depicted in Fig. 5.

New Age Distribution, Likewise, a FAS model that is trained on source domain data full of college students needs to be deployed to the group with a different age distribution, such as a senior care or kindergartens. We estimate the age information by the off-the-shelf tool [48], and construct dataset D to have a large portion of subjects over 50 years old, as seen in Fig. 5.

New Illumination. Oulu-NPU dataset has three different illumination sessions, and we use methods proposed in [71] to estimate the lighting condition for each sample in SIW and SIW-Mv2 datasets. Then we apply K-means [39] to cluster them into K groups. For the best clustering performance, we use “eblow method" [57] to decide the value of K. We annotate different illumination sessions as Dark, three Front Light, Side Light, and two Bright Light (Fig. 4), then dataset E introduces the new illumination distribution.

5 Experimental Evaluations

5.1 Experiment Setup

We evaluate our proposed method on the FASMD dataset. In Sect. 5.3, we report FAS-wrapper performance with different FAS models, and we choose PhySTD [35] as the FAS model for analysis in Sect. 5.2, because PhySTD has demonstrated competitive empirical FAS results. Firstly, we compare to anti-forgetting methods (e.g., LwF [29], MAS [4] and LwM [12]). Specifically, based on the architecture of PhySTD, we concatenate feature maps generated by last convolution layers in different branches, then employ Global Average Pooling and fully connected (FC) layers to convert concatenated features into a 2-dimensional vector. We fix the source pre-trained model weights and only train added FC layers in the original FAS task, as a binary classifier. In this way, we can apply methods in [4, 12, 29] to this binary classifier. For multi-domain learning methods (e.g., Serial and Parallel Res-Adapter [44, 45]), we choose the \(1\times 1\) kernel size convolution filter as the adapter and incorporate into the PhySTD as described in original works (see details in the supplementary material). We use standard FAS metrics to measure the performance, which are Attack Presentation Classification Error Rate (APCER), Bona Fide Presentation Classification Error Rate (BPCER), and Average Classification Error Rate ACER [1], Receiver Operating Characteristic (ROC) curve.

Implementation Details. We use Tensorflow [2] in implementation, and we run experiments on a single NVIDIA TITAN X GPU. In the source pre-train stage, we use a learning rate 3e–4 with a decay rate 0.99 for every epoch and the total epoch number is 180. We set the mini-batch size as 8, where each mini-batch contains 4 live images and 4 spoof images (e.g., 2 SIW-Mv2 images, 1 image in SIW and OULU-NPU, respectively). Secondly, we keep the same hyper-parameter setting as the pre-train stage, fine-tune the source domain pre-trained model with SRE at a learning rate 1e–6. The overall FAS-wrapper is trained with a learning rate 1e–7.

Table 4. The main performance reported in TPR@FPR=0.5%. Scores before and after “/” are performance on the source and target domains respectively. [Key: , , except for two teacher models and upper bound performance in the first three rows ( )].

5.2 Main Results

Table 4 reports the detailed performance from different models on all four protocols. Overall, our method surpasses the previous best method on source and target domain evaluation in all categories, with the only exception of the target domain performance in the illumination protocol (\(0.3\%\) worse than [12]). More importantly, regarding performance on source domain data, it is impressive that our method surpasses the best previous method in all protocols by a large margin (e.g., \(5.3\%\), \(6.5\%\), \(0.8\%\) and \(8.6\%\), and \(5.6\%\) on average). We believe that, the proposed SRE can largely alleviate the catastrophic forgetting as mentioned above, thereby yielding the superior source domain performance than prior works. However, the improvement diminishes on the new ethnicity protocol. One possible reason is that the print and replay attacks account for a large portion of data in new ethnicity distribution, and different methods, performance on these two common presentation attacks are similar.

Table 5. The average performance of the different methods in four protocols. The scores before and after “/” are performance on source and target domains. [Key: , , except for two teacher models and upper bound performance in first three rows ( )].

Additionally, Table 5 reports the average performance on four protocols in terms of ACPER, BCPER and ACER. Our method still remains the best, besides BPCER on the target domain performance. It is worth mentioning that we have 4.2% APCER on source domain data and \(8.9\%\) ACER on target domain data, which are better than best results from prior works, namely \(5.6\%\) APCER in [12] and \(11.0\%\) ACER in [45]. Furthermore, in Sect. 5.4, we examine the adaptability of our proposed method, by incorporating it with different FAS methods.

Ablation Study Using \(\mathcal {L}_{Spoof}\). SRE plays a key role in FAS-wrapper for learning the new spoof type, as ablating the \(\mathcal {L}_{Spoof}\) largely decreases the source domain performance, namely from \(79.4\%\) to \(72.7\%\) on TPR@FPR = \(0.5\%\) (Table 4) and \(1.8\%\) on APCER (Table 5). Such a performance degradation supports our statement that, \(\mathcal {L}_{Spoof}\) prevents divergence between spoof traces estimated from the source teacher and the upgraded model, which helps to combat the catastrophic forgetting issue, and maintain the source domain performance.

Ablation Study Using \(\mathcal {L}_{S}\) and \(\mathcal {L}_{T}\). Without the adversarial learning loss (\(\mathcal {L}_{S}\) + \(\mathcal {L}_{T}\)), the model performance constantly decreases, according to Table 4, although such impacts are less than removal of \(\mathcal {L}_{Spoof}\), which still causes \(2.0\%\) and \(1.3\%\) average performance drop on source and target domains. Finally, we have a regularization term \(\mathcal {L}_{T}\) which also contributes to performance. That is, removing \(\mathcal {L}_{T}\) hinders the FAS performance (e.g., \(1.0\%\) ACER on target domain performance), as reported in Table 5.

Table 6. (a) The FAS-wrapper performance with different FAS models; (b) Performance of adopting different architecture design choices.
Fig. 6.
figure 6

Spoof region estimated from different models. Given input image (a), (b) and (c) are model responses from [35] and [34], respectively. Detailed analyses in Sect. 5.4.

5.3 Adaptability Analysis

We apply FAS-wrapper on three different FAS methods: Auxi.-CNN [34], CDCN [65] and PhySTD [35]. CDCN uses a special convolution (i.e., Central Difference Convolution) and Auxi.-CNN is the flagship work that learns FAS via auxiliary supervisions. As shown in Table 6  , FAS-wrapper can consistently improve the performance of naive fine-tuning. When ablating the \(\mathcal {L}_{Spoof}\), PhySTD [35] experiences the large performance drop (\(6.7\%\)) on the source domain, indicating the importance of SRE in the learning the new domain. Likewise, the removal of adversarial learning loss (e.g., \(\mathcal {L}_{T} + \mathcal {L}_{S}\)) leads to difficulty in preserving the source domain performance, which can be shown from, on the source domain, CDCN [65] decreases \(3.2\%\) and Auxi-CNN [34] decreases \(4.7\%\). This means dual teacher models, in the FAS-wrapper, trained with adversarial learning benefit the overall FAS performance. Also, we visualize the spoof region generated from SRE with [34, 35] in Fig. 6. We can see the spoof cues are different, which supports our hypothesis that, although FAS models make the same final binary prediction, they internally identify spoofness in different areas.

5.4 Algorithm Analysis

Spoof Region Visualization. We feed output features from different models (i.e., \(f^{S}\), \(f^{T}\) and \(f^{new}\)) to a well-trained SRE to generate the spoof region, as depicted in Fig. 7. In general, the \(f^{S}\) produces more accurate activated spoof regions on the source domain images. For example, two source images in new spoof category have detected makeup spoofness on eyebrows and mouth (first row) and more intensive activation on the funny eye region (second row). \(f^{T}\) has the better spoof cues estimated on the target domain image. For example, two target images in the new spoof category, where spoofness estimated from \(f^{T}\) is stronger and more comprehensive; in the novel ethnicity category, the spoofness covers the larger region. With \(\mathcal {L}_{Spoof}\), the updated model (\(f^{new}\)) identifies the spoof traces in a more accurate way.

Fig. 7.
figure 7

Given the input spoof image (a), spoof regions generated by SRE with two teacher models (i.e., \(f^{S}\) and \(f^{T}\)) in (b) and (c), and the new upgraded model (\(f^{new}\)) in (d), for different protocols. Detailed analyses are in Sect. 5.4.

Fig. 8.
figure 8

Different spoof estimate methods. Given input image (a), (b) and (c) are the spoof regions estimated from ours and [10]. (d) and (e) are the activated map from methods in [52, 70].

Explanability. We compare SRE with the work which generate binary masks indicating the spoofness [10], and works which explain how a model makes a binary classification decision [52, 70]. In Fig. 8, we can observe that our generated spoof traces can better capture the manipulation area, regardless of spoof types. For example, in the first print attack image, the entire face is captured as spoof in our method but other three methods fail to achieve so. Also, our binary mask is more detailed and of higher resolution than that of [10], and more accurate and robust than [52, 70]. Notably, we do not include works in [35, 37, 69] which use many outputs to identify spoof cues.

Architecture Design. We compare to some other architecture design choices, such as all multi-scale discriminators with the same weights, concatenation of different scale features and one single discriminator. Moreover, we use correlation similarity table in [60] instead of multi-scale discriminators for transfering knowledge from \(f^{S}\) and \(f^{T}\) to \(f^{new}\). Table 6 demonstrates the superiority of our architectural design.

Fig. 9.
figure 9

We adapt FAS-wrapper for the cross-dataset scenario.

5.5 Cross-Dataset Study

We evaluate our methods in the cross-dataset scenario and compare to SSDG [20] and MADDG [53]. Specifically, we denote OULU-NPU [7] as O, SIW [34] as S, SIW-Mv2 [36] as M, and HKBU-MARs [32] as H. We use three datasets as source domains for training and one remaining dataset for testing. We train three individual source domain teacher models on three source datasets respectively. Then, as depicted in Fig. 9, inside FAS-wrapper, three multi-scale discriminators are employed to transfer knowledge from three teacher models to the updated model \(f^{new}\) which is then evaluated on the target domain. Notably, we remove proposed SRE in this cross-dataset scenario, as there is no need to restore the prior model responses.

The results are reported in Table 7, indicating that our FAS-wrapper also exhibits a comparable performance on the cross-dataset scenario as prior works.

Table 7. The cross-dataset comparison.

6 Conclusion

We study the multi-domain learning face anti-spoofing (MD-FAS), which requires the model perform well on both source and novel target domains, after updating the source domain pre-trained FAS model only with target domain data. We first summarize the general form of FAS models, then based on which we develop a new architecture, FAS-wrapper. FAS-wrapper contains spoof region estimator which identifies the spoof traces that help combat catastrophic forgetting while learning new domain knowledge, and the FAS-wrapper exhibits a high level of flexibility, as it can be adopted by different FAS models. The performance is evaluated on our newly-constructed FASMD benchmark, which is also the first MD-FAS dataset in the community.