1 Introduction

Fig. 1
figure 1

Left: The original marginal distributions of the samples from three source domains and one target domain. Middle: Conventional multi-source DA methods aim to align the marginal distributions of source and target domain samples to learn a common feature space, which may fail to get a discriminative class boundary. Right: Our proposed multi-source DA method aims to align the joint distributions of sample features and their corresponding labels between source and target domains, which has the potential to learn a more discriminative class boundary

In recent years, face recognition (FR) techniques have been used in various identity authentication scenarios. However, existing FR systems are vulnerable to spoofing attacks such as printed photos, video replay, 3D facial masks, adversarial attacks, etc (Yu et al., 2022). To secure FR systems from various physical attacks, both the communities of industry and academia have paid increasing attention to face anti-spoofing (FAS). In the past two decades, various FAS methods have been proposed including both traditional methods and deep learning-based methods (Yu et al., 2022). Traditional methods based on handcraft descriptors (Komulainen et al., 2013; Patel et al., 2016) can be further classified into texture-based, motion-based, and image analysis-based methods. Subsequently, hybrid (handcrafted + deep learning) (Rehman et al., 2020; Khammari, 2019) and end-to-end deep learning-based methods (Liu et al.,2018 (Yu et al., 2020; Zhang et al., 2020) have also been proposed.

However, the performance of most FAS methods drops significantly in cross-scenario settings due to variations in lighting, facial appearance, or camera quality. In view of this, most existing solutions (Liu et al., 2022; Wang et al., 2020; Jia et al., 2020; Wang et al., 2023, 2022, 2021; Chen et al., 2021; Jiang et al., 2023) focus on improving the cross-scenario capability of deep FAS models by using multi-source domain generalization (DG) approach, which assumes that there exists a potential generalized feature space between the given source domains and unseen target domain. By adapting multiple source data to learn a common feature space, the model trained in source domains can be well generalized to the unseen target domain. However, in practice, a large amount of unlabeled facial images are available from existing FR systems, and domain adaptation (DA) forms a natural learning framework for FAS. DA approach attempts to aid cross-scenario FAS by extracting discriminative feature representations from labeled source data and unlabeled target data. Thus, they can exploit rich information in the unlabeled target domain and obtain a more robust decision boundary.

In most DA methods, the distributions of source and target features are matched in a learned feature space, by using Maximum Mean Discrepancy (MMD) (Pei et al., 2018; Rahman et al., 2020), Correlation Alignment (CORAL) (Baochen et al., 2016) or Kullback-Leiber divergence (KL) (Zhuang et al., 2015). Besides, another direction is based on adversarial training (Tzeng et al., 2017), where a discriminator (domain classifier) is trained to distinguish between the source and target representations. However, considering that there are not only large inter-class differences in the samples of each domain, intra-class differences are still obvious, and there are cases where samples with different labels in different domains are closer to each other than samples with the same label in different domains, as illustrated in the left of Fig. 1. So only considering fitting the feature distributions of the source and target domains will have a situation similar to fitting the features of real samples from source domains to fake samples from the target domain, as illustrated in the middle of Fig. 1, which is not conducive to classification. Therefore, different from existing DA-based FAS methods, which attempt to align the marginal distributions in the feature space between the source and target domains. In this paper, we consider the discrepancy in the joint distributions of features and labels of source and target data. In this way, the samples in the source and target domains are aligned based on both features and labels, so that the samples with different labels from the same domain will be separated, while samples with the same label from different domains will be aggregated, as illustrated in the right of Fig. 1.

The main idea of this paper is to find optimal transportation mappings between the product spaces (including features and labels) of each source domain and the product space (including features and pseudo-labels) of the target domain. In this case, we first compute the cost matrices based on the joint distributions of each source domain and the target domain and then compute the optimal transportation mappings while reducing the discrepancy between the joint distributions. The distribution inconsistencies are measured by the Wasserstein distances (Cuturi et al., 2014). After obtaining the optimal transportation mappings, we learn a convex combination of the joint distributions of source domains, which allows us to distribute the masses based on the similarities of the sources with the target, both in the feature and pseudo-label spaces. Domain weights are updated together with the parameters of the feature extractor and classifier by training the weighted transportation loss between each source domain and the target domains, with the weighted source domain classification loss and the target domain entropy loss. Here, the target domain entropy loss is used to adjust the parameters of the feature extractor and classifier adaptively to further fit the distribution of the target domain. Our idea of aligning the joint distributions is reflected in the definition of cost matrix and reacted on the transportation mapping. The reduction of domain discrepancy in our method is reflected in seeking domain-invariant product space of features and labels, rather than feature space. In fact, the single-source domain joint distribution optimal transport is a degeneration of the multi-source domain joint distribution optimal transport, except that the weight of the unique source domain is always equal to 1 and we do not need to train it when training the feature extractor and classifier. Our main contributions to this work can be summarized as follows:

  • Facing the cross-scenario FAS problem, we propose to reduce the discrepancy of domain distributions based on the joint distribution, which is dedicated to aligning the joint distributions of both sample features and labels (or pseudo-labels) of source and target domains in the common product space, which is largely different from existing methods.

  • To solve the multi-source DA-based FAS, we propose to utilize the Wasserstein distance to measure the distances between the joint distributions, and assign adaptively updated weights to each source domain based on the Wasserstein distances so as to take into account the contributions of different source domains to the target domain.

  • Extensive experimental results on four widely used 2D attack datasets and three recently published 3D attack datasets under both single- and multi-source domain adaptation settings (including both close-set and open-set) show the advantages of our proposed method for cross-scenario FAS. Our method achieves state-of-the-art results in all three protocols under the single-source setting, and under the multi-source setting, except for the 2D\(\rightarrow \)2D protocol, which achieves the second-best performance, the remaining two protocols also achieve state-of-the-art results.

2 Related Works

In this section, we will first introduce the DA-based methods for FAS. After that, the focus will be on reviewing the optimal transport-based DA methods and multi-source DA methods that are most relevant to our work.

2.1 Domain Adaptation for Face Anti-spoofing

The basic idea of the DA technique is to mitigate the distribution discrepancy between the source and target domains so that the model trained with the labeled source data can be well adapted to the unlabeled target data. Initially, a maximum mean discrepancy (MMD) based metric learning method is proposed for FAS to align the distributions of source features and target features (Li et al., 2018). Other major developments have focused on the inclusion of adversarial loss functions that drive the inability of CNNs to distinguish whether a sample is from the source or target domain (Wang et al., 2019, 2021; Jia et al., 2021). Specifically, Wang et al. (2021) proposed ML-Net using the combination of center loss and triplet loss jointly to learn a feature representation for source data, then they adapted this representation to the target domain via UDA-Net and DR-Net. Jia et al. (2021) designed a marginal distribution alignment module (MDA) for domain-invariant feature learning and a conditional distribution alignment module (CDA) for centroid alignment of labeled features. In addition, Zhou et al. (2022) reformulated the unsupervised DA-based FAS as a domain stylization problem. The target data is stylized with the source domain style through image translation to directly fit the target data to the source model. Yue et al. (2022) presented a cyclically disentangled feature translation network, and proposed to generate pseudo-labeled images to train a generalizable classifier. Li et al. (2022) proposed a teacher-student framework to improve the cross-domain performance of FAS through single-class DA. Overall, most of these methods require multiple stages of the training process and all of them only consider aligning the marginal distributions, ignoring the role of source labels.

As we know, CASIA-FASD (Zhang et al., 2012), Idiap Replay-Attack (Chingovska et al., 2012), MSU-MFSD (Wen et al., 2015), and OULU-NPU (Boulkenafet et al., 2017) datasets have been widely used to study the DA-based FAS. However, these datasets are limited in data scale and attack types (print and replay) and recorded in controlled indoor scenarios. Recently, many new FAS datasets have been released and there are three major trends in the development of datasets: (1) large-scale data amount, (2) increasing number of novel attack types and complex recording conditions, and (3) multiple modalities. For example, CASIA-SURF 3DMask (Yu et al., 2020) is the first FAS dataset considering outdoor scenes with challenging lighting and it includes three mask decorations (i.e., masks with/without hair and glasses) recorded under six environmental conditions. CASIA-SURF HiFiMask (Liu et al., 2022) dataset contains more than 50,000 videos and it includes 3D mask attacks with three kinds of materials (transparent, plaster, and resin) recorded under six lighting conditions and six indoor/outdoor scenes. And Surveillance High-Fidelity Mask (Fang et al., 2024) dataset is captured under 40 surveillance scenes, and it has 232 3D attacks (high-fidelity masks), 200 2D attacks (posters, portraits, and screens), and 2 adversarial attacks. Besides, CASIA-SURF (Zhang et al., 2019) and CASIA-SURF Cross-ethnicity Face Anti-spoofing (CeFA) (Liu et al., 2021) datasets contain 3 modalities, i.e., RGB, Depth and IR.

Yu et al. (2020) proposed a Neural Architecture Search (NAS)-based approach for FAS. They presented Domain/Type-aware Meta-NAS for leveraging cross-domain/type knowledge for robust searching to improve the transferability of NAS across datasets and unknown attack types. Liu et al. (2022) proposed a training method for supervised FAS tasks, i.e., contrasting context-aware learning framework, which accurately utilizes the rich context information (e.g., subjects, mask material, and illumination) between live face and high-fidelity mask attack pairs. Fang et al. (2024) proposed a Contrastive Quality-Invariance Learning network to mitigate the performance degradation of FAS methods caused by low-quality images in surveillance scenarios. These works have better FAS performance in single dataset scenarios, but have weak generalization ability and cannot effectively solve the DA-based 3D attack FAS. In this paper, we will study the DA-based FAS method dealing with both 2D and 3D attacks, and generalize it to open-set DA in which there are new types of attacks in the target domain that are different from the source domains.

2.2 Optimal Transport Based Domain Adaptation

The optimal transport problem is first introduced by the French mathematician Gaspard Monge in the middle of the 19th century as a way to find a minimal-effort solution to the transport of a given mass of dirt into a given hole. Kantorovich (2006) extended the Monge problem from the viewpoint of transport mapping to transportation plan. Later, new computational strategies have been proposed and make it possible to be used for the problem of DA (Courty et al., 2016, 2017; Damodaran et al., 2018). The core of optimal transport theory applied to the DA problem lies in learning the transformation between domains. In particular, Courty et al. (2016) proposed a regularized unsupervised optimal transport model to align the feature representations of source and target domains. They proposed two regularization schemes to encode the class structure in the source domain while estimating the transportation plan, thus reinforcing the intuition that the samples of the same class must undergo similar transformations. Subsequently, Courty et al. (2017) proposed to minimize the optimal transportation loss between the joint distribution of the source domain and the estimated joint distribution of the target domain. Later, this method was extended to deep learning frameworks (Damodaran et al., 2018) where the feature embedding is simultaneously estimated with the classifier by using an efficient stochastic optimization procedure. An important aspect of joint distribution optimal transport is that the optimization problem involves the joint distribution of both feature embeddings and sample labels, and the simultaneous use of feature and label information is the basis of most generalization bounds (Courty et al., 2017).

Fig. 2
figure 2

An overview of the proposed weighted joint distribution optimal transport method for multi-source DA-based cross-scenario FAS (WJDOT-FAS). The training phase of this method consists of three modules: joint distribution estimation, joint distribution optimal transport, and domain weight optimization. The joint distributions are determined by the feature extractor \(g_{\theta _1}\) and the classifier \(f_{\theta _2}\). Transportation mappings \(\varvec{\gamma }_{s_k\text{- }t}\), and domain weights \(w_k\) are alternately updated by aligning the joint distributions of each source domain and the target domain. Once the parameters of \(g_{\theta _1}\) and \(f_{\theta _2}\) have been well trained, they are used to predict the sample labels of the target domain in the testing phase

2.3 Multi-source Domain Adaptation

For the multi-source DA problem, Yishay et al. (2009) pointed out that learning a weighted combination of multiple source distributions can be better generalized to the target domain under a certain theoretical guarantee. Judy et al. (2018) proposed an algorithm for distribution-weighted combinatorial solutions based on square loss and cross-entropy loss to solve the multi-source DA problem. Recently, many deep learning networks designed specifically for multi-source domains have been proposed to solve the multi-source DA problem. Peng et al. (2019) proposed a multiple source domain adaptive moment matching network (M3SDA), which aims to transfer knowledge learned from multiple labeled source domains to an unlabeled target domain by dynamically aligning the moments of feature distributions. Zhao et al. (2018) proposed the Multi-Source Domain Adversarial Network (MDAN), which approaches the DA problem by optimizing the task-adaptive generalization bounds. Wen et al. (2020) pointed out that in order to achieve the optimal generalization upper bound for the target domain, a trade-off is needed between including all source domains to increase the number of valid samples and excluding less relevant domains to avoid negative transfer. Based on this theory, they proposed a domain aggregation network (DARN), which dynamically adjusts the weights of each source domain during the end-to-end training process. Xu et al. (2018) proposed a deep cocktail network (DCTN) to solve the problem of domain and category transfer between multiple sources. Kang et al. (2020) proposed the Contrast Adaptive Network (CAN), which optimizes a new metric, i.e. the contrast domain variance, explicitly modeling intra-class domain variance and inter-class domain variance. Besides, they utilized the weighting of the inversed classification loss of intra-domain samples as the domain weights for network updates. Li et al. (2021) proposed a multiple-source contribution learning network (MSCLDA) by considering source contributions when predicting a target task. This method can simultaneously learn the similarity and diversity of domains by extracting multi-view features and utilizes a metric based on MMD as the domain weights. Zhao et al. (2020) proposed a multi-source distillation network (MDDA), which not only considers different distances between multiple sources and targets but also investigates the different similarities between source samples and target samples. A metric based on the optimal transport distance is used as the domain weights. Most of these multi-source DA methods are based on feature distributions when measuring the discrepancy between source and target distributions, and these methods are not capable of adaptively adjusting source domain weights. In contrast to the above methods, Turrisi et al. (2022) exploited the diversity of source distributions by adjusting the weights of different source joint distributions to fit the target task, which aims to simultaneously find the optimal transport-based alignment between the source and target joint distributions, as well as the reweighting of the source distributions based on the transportation loss. Inspired by Turrisi et al. (2022), this paper adopts the idea of joint distribution optimal transport to solve the problem of single-source and multi-source DA-based cross-scenario FAS. To the best of our knowledge, this is the first work that uses the idea of weighted joint distribution optimal transport to solve the cross-scenario FAS.

3 Proposed Method

In this paper, we propose a weighted joint distribution optimal transport method for multi-source DA-based FAS (WJDOT-FAS). As shown in Fig. 2, the training phase of the proposed method consists of three modules, namely joint distribution estimation, joint distribution optimal transport, and domain weight optimization. In particular, given the labeled facial samples from K source domains and the unlabeled samples from the target domain, we first estimate the joint distributions of both samples’ features and labels (or pseudo labels) for each domain by using a pre-trained feature extractor and a randomly initialized classifier. Then, the cost matrices between the joint distributions of each source domain and the target domain are computed by using a weighted distance metric of both feature space and label space. Once the cost matrices are estimated, we can compute the optimal transportation mappings from the joint distributions of each source domain and the target domain by solving Lp-L1 optimal transport problems. These optimal transportation mappings can map the joint distributions of each source domain and the target domain to a new common space, in which their domain discrepancies can be well-aligned. Considering that different source domains have different contributions to the target domain, domain weights are defined for each source domain, and these weights can be solved by solving a convex optimization problem related to the loss functions of different source domains, target domain, as well as the optimal transportation losses from each source domain to the target domain. Meanwhile, the parameters of the feature extractor and the classifier are also updated, and the learnable parameters and the computations of the three modules are updated alternatively. Once the feature extractor and classifier have been well-trained, they are used to predict the sample labels of the target domain in the testing phase. More details of the proposed method will be introduced in the following paragraphs.

3.1 Joint Distribution Estimation

The aim of the joint distribution estimation module is to estimate the joint distributions of each source domain and the target domain. The joint distribution is defined in the product space of the sample feature space and sample label space. Given the labeled source data \({\mathcal {D}}_{s_k}=\big \{{\varvec{x}}_{i_k}^{s_k},{\varvec{y}}_i^{s_k}\big \}_{i_k=1}^{n_{s_k}}\) (\(k=1,\ldots ,K\), K is the number of source domains) and unlabeled target data \({\mathcal {D}}_{t}=\big \{{\varvec{x}}_j^{t}\big \}_{j=1}^{n_t}\), where \(n_{s_k}\) and \(n_{t}\) denote the sample numbers of the k-th source domain and the target domain. Our joint distribution estimation module is composed of two parts: a feature extraction function (\(g:\varvec{{\mathcal {X}}}\rightarrow \varvec{{\mathcal {Z}}}\subseteq {\mathbb {R}}^d\)) which maps the given facial samples from both source domains and the target domain into their feature space, and a classifier (\(f:\varvec{{\mathcal {Z}}}\rightarrow \varvec{{\mathcal {Y}}}\subseteq {\mathbb {R}}^2\)) which maps the sample features into their label space. The sample features of the k-th source domain and the target domain can be denoted as \(\big \{{\varvec{z}}_{i_k}^{s_k}\big \}_{i_k=1}^{n_{s_k}}\), i.e. \(\big \{g\big ({\varvec{x}}_{i_k}^{s_k}\big )\big \}_{i_k=1}^{n_{s_k}}\) and \(\big \{{\varvec{z}}_j^t\big \}_{j=1}^{n_t}\), i.e. \(\big \{g\big ({\varvec{x}}_j^t\big )\big \}_{j=1}^{n_t}\) respectively. Suppose we define \(\mu _{s_k}\) and \(\mu _{t}\) as the marginal feature distributions of the k-th source domain and the target domain, since the facial samples are in discrete form, we consider the empirical versions of \(\mu _{s_k}\) and \(\mu _{t}\), which can be defined in the following forms:

$$\begin{aligned} {\hat{\mu }}_{s_k}= & {} \frac{1}{n_{s_k}}\sum \limits _{i_k}\delta _{{\varvec{z}}_{i_k}^{s_k}}, \end{aligned}$$
(1)
$$\begin{aligned} {\hat{\mu }}_{t}= & {} \frac{1}{n_{t}}\sum \limits _{j}\delta _{{\varvec{z}}_{j}^{t}}, \end{aligned}$$
(2)

where \(\delta _{{\varvec{z}}_{i_k}^{s_k}}\) and \(\delta _{{\varvec{z}}_{j}^{t}}\) are the Dirac functions at points \({\varvec{z}}_{i_k}^{s_k}\in {\mathbb {R}}^{d}\) and \({\varvec{z}}_{j}^{t}\in {\mathbb {R}}^{d}\) respectively.

Following above notations, we assume there exit two distinct joint probability distributions \({\mathcal {P}}_{s_k}=({\varvec{z}}^{s_k},{\varvec{y}}^{s_k})_{{\varvec{z}}^{s_k}\sim \mu _{s_k}}\) and \( {\mathcal {P}}_{t}=\big ({\varvec{z}}^{t},f({\varvec{z}}^{t} )\big )_{{\varvec{z}}^{t}\sim \mu _{t}}\), whose empirical versions can be defined in the following forms:

$$\begin{aligned} \hat{{\mathcal {P}}}_{s_k}= & {} \frac{1}{n_{s_k}}\sum \limits _{i_k}\delta _{{\varvec{z}}_{i_k}^{s_k},{\varvec{y}}^{s_k}_{i_k}}, \end{aligned}$$
(3)
$$\begin{aligned} \hat{{\mathcal {P}}}_{t}= & {} \frac{1}{n_{t}}\sum \limits _{j}\delta _{{\varvec{z}}_{j}^{t},f\big ({\varvec{z}}_{j}^{t}\big )}, \end{aligned}$$
(4)

where \(\delta _{{\varvec{z}}_{i_k}^{s_k}, {\varvec{y}}^{s_k}_{i_k}}\) and \(\delta _{{\varvec{z}}_{j}^{t},f({\varvec{z}}_{j}^{t})}\) are the Dirac functions at points \(\big ({\varvec{z}}_{i_k}^{s_k},{\varvec{y}}^{s_k}_{i_k}\big )\in {\mathbb {R}}^{d+2}\) and \(\big ({\varvec{z}}_{j}^{t},f\big ({\varvec{z}}_{j}^{t}\big )\big )\in {\mathbb {R}}^{d+2}\) respectively.

In particular, we use a pre-trained ResNet-18 CNN (or Transformer) backbone to extract the deep features of the given facial samples and use a randomly initialized classifier to compute pseudo labels. The joint distributions of the source samples are estimated by using the sample features extracted by the feature extractor and the true labels; while the joint distribution of the target samples is estimated by using the sample features extracted by the feature extractor and the pseudo-labels computed by the classifier.

3.2 Joint Distribution Optimal Transport

3.2.1 Cost Matrix Computation

Optimal transport (OT) (Cédric et al., 2008) is an efficient way of seeking to transform one distribution into another for a given cost function. It can be used for computing Wasserstein distance between probability distributions. Formally, OT searches a transportation mapping \(\varvec{\gamma } \in \varvec{\Pi }(\mathcal {{\hat{P}}}_{s},\mathcal {\hat{P}}_{t})\) between two distributions \(\mathcal {\hat{P}}_{s}\) and \(\mathcal {\hat{P}}_{t}\) which yields a minimal displacement cost. In a discrete setting (both distributions are empirical), the Wasserstein distance between \(\mathcal {\hat{P}}_{s}\) and \(\mathcal {\hat{P}}_{t}\) calculated by the OT method can be expressed in the following form:

$$\begin{aligned} W(\mathcal {\hat{P}}_{s},\mathcal {\hat{P}}_{t})=\min \limits _{\varvec{\gamma }\in \varvec{\Pi }(\mathcal {\hat{P}}_{s},\mathcal {\hat{P}}_{t})}{\langle \varvec{\gamma },{\varvec{C}}\rangle _F}. \end{aligned}$$
(5)

Here, \(\langle \cdot ,\cdot \rangle _F\) is the Frobenius matrix norm, \({\varvec{C}}\in {\mathbb {R}}^{n_{s}\times n_t}\) is the cost matrix representing the pairwise costs of the joint distributions of source domain samples and the target domain samples. \(\varvec{\Pi }(\mathcal {\hat{P}}_{s},\mathcal {\hat{P}}_{t})\) describes the space of joint probability distributions of source and target domains and \(\varvec{\gamma }\) is the transportation mapping which is a matrix of size \(n_{s}\times n_t\).

Joint distribution optimal transport is applied to our method, which is reflected in the definition of the cost matrix \({\varvec{C}}\). The underlying idea is to align the joint distributions of features and labels from source and target domains instead of only considering the marginal distributions of features. Next, we will illustrate how to calculate \({\varvec{C}}\) under joint distributions in the case where only one source domain is available. The cost matrix \({\varvec{C}}\) associated with the product space of features and labels can be expressed as the gap between the joint distributions of the source and target domains, that is:

$$\begin{aligned} {\varvec{C}}\triangleq d(\hat{{\mathcal {P}}}_{s}, \hat{{\mathcal {P}}}_{t}), \end{aligned}$$
(6)

Specifically, the element of the i-th row and j-th column in \({\varvec{C}}\) can be expressed as a joint cost measure of costs in the feature and label spaces of the i-th source sample and j-th target sample, combining both the gap between sample features and the discrepancy between sample labels (pseudo labels for the target domain). According to Damodaran et al. (2018), the specific form of \({\varvec{C}}_{ij}\) is defined as follows:

$$\begin{aligned} \begin{aligned} {\varvec{C}}_{ij}&\triangleq c\big (g\big ({\varvec{x}}_i^s),{\varvec{y}}_i^s;g\big ({\varvec{x}}_j^t\big ),f\big (g\big ({\varvec{x}}_j^t\big )\big )\\&=\parallel g\big ({\varvec{x}}_i^s\big )-g\big ({\varvec{x}}_j^t\big )\parallel ^2+\beta {\mathcal {L}}_{CE}\big ({\varvec{y}}_i^s,f\big (g\big ({\varvec{x}}_j^t\big )\big )\big ), \end{aligned} \end{aligned}$$
(7)

where \(\parallel g\big ({\varvec{x}}_i^s\big )-g\big ({\varvec{x}}_j^t\big )\parallel ^2\) compares the compatibility of the features for source and target samples and it is a \(l_2^2\) distance; while \({\mathcal {L}}_{CE}\big ({\varvec{y}}_i^s,f\big (g\big ({\varvec{x}}_j^t\big )\big )\big )\) is a cross-entropy loss, which considers the gap between the true label of the i-th source sample and the pseudo label of the j-th target sample. Parameter \(\beta \) is a scalar value weighing the strength of label cost relative to feature cost. The definition of \({\varvec{C}}_{ij}\) in Eq. (7) guarantees that our optimal transport is defined under the joint distribution setting. If we only consider aligning the marginal distributions of source and target domain features, then \({\varvec{C}}_{ij}=\parallel g\big ({\varvec{x}}_i^s\big )-g\big ({\varvec{x}}_j^t\big )\parallel ^2\), i.e. the basic form of the cost matrix in OT.

3.2.2 Transportation Mapping Computation

In this section, we will introduce how to compute the transportation mapping \(\varvec{\gamma }\), considering the case of single-source DA. As shown in Eq. (5), OT searches a transportation mapping \(\varvec{\gamma } \in \varvec{\Pi }(\mathcal {\hat{P}}_s,\mathcal {\hat{P}}_t)\) between two distributions \(\mathcal {\hat{P}}_{s}\) and \(\mathcal {\hat{P}}_{t}\), where \(\varvec{\Pi }(\mathcal {\hat{P}}_{s},\mathcal {\hat{P}}_{t})\) can be expressed mathematically in the following form:

$$\begin{aligned} \begin{aligned} \varvec{\Pi }(\mathcal {\hat{P}}_{s},\mathcal {\hat{P}}_{t})&=\big \{\varvec{\gamma } \in (\mathbb {R^+})^{n_{s}\times n_t}|\\&\varvec{\gamma } {\varvec{1}}_{n_t}=\mathcal {\hat{P}}_{s},\varvec{\gamma } ^{\intercal } {\varvec{1}}_{n_{s}}=\mathcal {\hat{P}}_{t}\big \}, \end{aligned} \end{aligned}$$
(8)

where \({\varvec{1}}_{n_{s}}\) and \({\varvec{1}}_{n_t}\) are the \(n_{s}\) and \(n_t\)-dimension vectors of ones. With the definition of \({\varvec{C}}_{ij}\) in Eq. (7), we can compute the transportation mapping based on the following equation:

$$\begin{aligned} \varvec{{\hat{\gamma }}} _0=\mathop {\arg \min }\limits _{\varvec{\gamma }\in \varvec{\Pi }(\mathcal {\hat{P}}_{s},\mathcal {\hat{P}}_{t})}{\langle \varvec{\gamma },{\varvec{C}}\rangle _F}. \end{aligned}$$
(9)

Equation (9) is a linear programming problem and can be solved by the network simplex algorithm, but solving it becomes difficult when the sample size is large. To solve this problem more efficiently, the entropy regularized version of the above optimal transport problem is proposed (Chingovska et al., 2012) and can be formulated as follows:

$$\begin{aligned} \varvec{{\hat{\gamma }}} _0^\lambda = \mathop {\arg \min } \limits _{\varvec{\gamma } \in \varvec{\Pi }(\mathcal {\hat{P}}_{s},\mathcal {\hat{P}}_{t})}{\langle \varvec{\gamma },{\varvec{C}} \rangle }_F+\lambda \Omega _e(\varvec{\gamma } ), \end{aligned}$$
(10)

where \(\Omega _e(\gamma )=\sum _{i,j}{\varvec{\gamma }(i,j)\textrm{log} \varvec{\gamma }(i,j)}\) computes the negative entropy of \(\varvec{\gamma } \). This regularization is introduced because \(\varvec{{\hat{\gamma }}} _0\), as a solution of the linear program, most of the elements are zero, and thus a smoother version of the transport can be found by increasing the entropy, thus reducing its sparsity. In particular, \({\hat{\gamma }}_0^\lambda \) can be solved by using Sinkhorn algorithm (Cuturi et al., 2013).

Further, we resort to a class regularization term to estimate a better transport using the source sample label information. Our goal is to penalize the coupling of matching source samples with different labels to the same target sample. Thereby, the new optimization problem can be written in the following form:

$$\begin{aligned} \varvec{{\hat{\gamma }}} _0^\eta = \mathop {\arg \min } \limits _{\varvec{\gamma } \in \varvec{\Pi }(\mathcal {\hat{P}}_{s},\mathcal {\hat{P}}_{t}) }{\langle \varvec{\gamma },{\varvec{C}} \rangle }_F+\lambda \Omega _e(\varvec{\gamma } )+\eta \Omega _c(\varvec{\gamma }), \end{aligned}$$
(11)

where \(\eta \ge 0\) and \(\Omega _c(\cdot )\) is the class regularization term. In this work, we use group sparse regularization with the aim of making a given target sample receive masses from source samples with the same label. This regularization term is defined as:

$$\begin{aligned} \Omega _c(\varvec{\gamma } )=\sum \limits _{j}\sum \limits _{cl}\left\| \varvec{\gamma }({I} _{cl},j) \right\| _{1}^{1/2} \end{aligned}$$
(12)

where \(\left\| \cdot \right\| _{1}\) denotes the \(l_{1}\) norm and \(I _{cl}\) contains the indices of rows in \(\varvec{\gamma }\) related to source domain samples of class cl. So, \(\varvec{\gamma }({I} _{cl},j)\) is a vector containing coefficients of the j-th column of \(\varvec{\gamma }\) associated to class cl. In our case, cl stands for real or fake. This regularization term is called the Lp-L1 regularization term (here, \(p = 1/2\)) (Courty et al., 2014), and the problem can be transformed into Eq. (10) when the maximization minimization technique is applied on the Lp-L1 parametrization and can be solved by using an efficient Sinkhorn-Knopp algorithm (Courty et al., 2016).

Equations (9), (10) and (11) are called EMD solver, Sinkhorn solver and Lp-L1 solver respectively. After calculating the optimal transportation mapping, the Wasserstein distance between the source and target domain distributions is obtained according to Eq. (5). By computing the transportation mapping under joint distribution optimal transport, samples with similar features and common labels can be matched in the common product space, resulting in better discrimination.

3.3 Domain Weight Optimization

To solve the multi-source DA-based FAS, the weighing of each source domain is an important factor for the generalization ability of the final classifier on the target domain. We propose to assign adaptively updated weights to each source domain based on the Wasserstein distances between the joint distributions of each source domain and the target domain. Besides, for the FAS classification problem, these weights can be computed by solving a convex optimization problem related to the Wasserstein distances (optimal transportation losses) between the joint distributions of each source domain and the target domain and the classification losses of different source domains.

The Wasserstein distances (optimal transportation losses) between the joint distributions of each source domain and the target domain can be computed by solving the optimal transport problems in Eq. (5). It can measure the degree of the joint distribution alignment between the source and target domains. It’s not difficult to see that the better the distributions are aligned, the better the generalization effect on the target domain. Specifically, we first compute the cost matrices of the joint distributions between each source domain and the target domain samples by Eq. (7). Then, the optimal transportation mappings of the joint distributions from each source domain to the target domain are computed by Eq. (11). Finally, the Wasserstein distances (optimal transportation losses) between the joint distributions can be computed. The Wasserstein distance (optimal transportation loss) from the k-th source domain to the target domain is defined as:

$$\begin{aligned} {\mathcal {L}}_{s_k\text{- } t}=\sum \limits _{i_k}\sum \limits _j{\hat{\varvec{\gamma }}_{{i_k}j}}^{s_k}d\big (g\big ({\varvec{x}}_{i_k}^{s_k}\big ),{\varvec{y}}_{i_k}^{s_k};g\big ({\varvec{x}}_j^t\big ),f\big (g\big ({\varvec{x}}_j^t\big )\big )\big ). \end{aligned}$$
(13)

Moreover, to better utilize the source domain information to train the final classifier, we employ the adaptive cross-entropy (AdaCE) loss (Jia et al., 2021) to measure the classification error of the classifier for each source domain. AdaCE loss is defined by adjusting the weight of the cross-entropy loss adaptively based on the classification accuracy. For the k-th source domain, it can be defined as follows:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{s_k}&={\frac{1}{n_{s_k}}\sum \limits _{i_k}{\mathcal {L}}_{s_k}\big ({\varvec{y}}_{i_k}^{s_k},f\big (g\big ({\varvec{x}}_{i_k}^{s_k}\big )\big )\big )}\\&={\frac{1}{n_{s_k}}\sum \limits _{i_k}\Bigg (1-e^{-{\mathcal {L}}_{CE}\big ({\varvec{y}}_{i_k}^{s_k},f\big (g\big (x_{i_k}^{s_k}\big )\big )\big )}\Bigg )^\alpha \cdot \quad }\\&\qquad \qquad \quad {\mathcal {L}}_{CE}\big ({\varvec{y}}_{i_k}^{s_k},f\big (g\big ({\varvec{x}}_{i_k}^{s_k}\big )\big )\big ), \end{aligned} \end{aligned}$$
(14)

where \({\mathcal {L}}_{CE}(\cdot ,\cdot )\) is the cross-entropy loss, and \(\alpha \) is a hyper-parameter. For the i-th sample of the k-th source domain, \({\mathcal {L}}_{CE}(\cdot ,\cdot )\) is defined as:

$$\begin{aligned} {\mathcal {L}}_{CE}\big ({\varvec{y}}_{i_k}^{s_k},f\big (g\big ({\varvec{x}}_{i_k}^{s_k}\big )\big )\big )=-{\varvec{y}}_{i_k}^{s_k} \textrm{log}\big (f\big (g\big ({\varvec{x}}_{i_k}^{s_k}\big )\big )\big ). \end{aligned}$$
(15)

To further refine the parameters of the FAS classifier, we feed the unlabeled target domain data to the classifier and refer to the entropy loss proposed in Jia et al. (2021), which is expressed as follows:

$$\begin{aligned} {\mathcal {L}}_t=-\frac{1}{n_{t}}\sum \limits _j {f\big (g\big ({\varvec{x}}_j^t\big )\big ) \textrm{log} f\big (g\big ({\varvec{x}}_j^t\big )\big )}. \end{aligned}$$
(16)

Once the optimal transportation loss functions from different source domains to the target domain, the classification loss functions related to different source domains as well as the entropy loss function related to the target domain have been defined, we can compute the domain weights and update the network parameters of both feature extractor and classifier by solving the following convex optimization problem:

$$\begin{aligned}{} & {} \Big (g^{(n+1)}_{\theta _1},f^{(n+1)}_{\theta _2},w_k^{(n+1)}\Big )=\mathop {\arg \min } \limits _{g^{(n)}_{\theta _1},f^{(n)}_{\theta _2},w_k^{(n)}}{\mathcal {L}}_{total}, \end{aligned}$$
(17)
$$\begin{aligned}{} & {} {\mathcal {L}}_{total}=\sum \limits _k w_k({\mathcal {L}}_{s_k}+\lambda _1 {\mathcal {L}}_{s_k\text{- } t})+\lambda _2{{\mathcal {L}}_t}, \end{aligned}$$
(18)

where \(\lambda _1\) and \(\lambda _2\) are the trade-off parameters, and \(w_k\) denotes the domain weight related to k-th source domain.

The domain weights are continuously updated together with the network parameters. Then, the updated networks are used for the joint distribution estimation of source and target domains again, further for the joint distribution optimal transport, and finally the domain weight optimization. That’s to say, the three modules are alternated learned, and updated. Once the network parameters of the feature extractor and classifier have been well-trained, they are used to predict the sample labels of the target domain in the testing phase. The whole training process of the proposed weighted joint distribution optimal transport method for multi-source DA-based FAS is shown in Algorithm 1. It is worth noting that if only one source domain is available, \(w_k\) always equals 1.

Algorithm 1
figure a

Weighted joint distribution optimal transport method for multi-source DA-based FAS

4 Experimental Results

Fig. 3
figure 3

Examples of the real (the first row) and fake (the second row) face from CASIA-FASD (Zhang et al., 2012), OULU-NPU (Boulkenafet et al., 2017), CASIA-SURF HiFiMask (Liu et al., 2022) and Surveillance High-Fidelity Mask (Fang et al., 2024) databases. It is easy to find that there exists a large inter-domain gap, such as the lighting, background, and types of attack, which results in significant distribution discrepancies among different datasets

4.1 Datasets

To evaluate the effectiveness of the proposed WJDOT-FAS method for multi-source DA based FAS, we conducted experiments on four public datasets with only 2D attack types, namely CASIA-FASD (Zhang et al., 2012), Idiap Replay-Attack (Chingovska et al., 2012), MSU-MFSD (Wen et al., 2015) and OULU-NPU (Boulkenafet et al., 2017). For simplicity, they are denoted as C, I, M, and O in the following. Besides, we also conducted experiments on three large-scale public datasets with 3D attack types, namely CASIA-SURF 3DMask (Yu et al., 2020), CASIA-SURF HiFiMask (Liu et al., 2022) and Surveillance High-Fidelity Mask (Fang et al., 2024). For simplicity, they are denoted as 3DMask, HiFiMask, and SuHiFiMask in the following. In addition, we also demonstrated the effectiveness of our WJDOT-FAS method for open-set DA by using datasets containing only 2D attack types as the source domains and a dataset containing 3D attack types as the target domain.

The \(\textit{CASIA-FASD}\) (Zhang et al., 2012) dataset consists of 600 videos of real and attack attempts of 50 different subjects. There are three different image qualities in the dataset: low, normal, and high, which are captured with three different cameras (a Sony NEX-5 camera with 1280\(\times \)720 resolution and two different USB cameras with 640\(\times \)480 resolution). The face attacks include: distorted photo attacks, cut photo attacks, and video attacks. The \(\textit{Idiap Replay-Attack}\) (Chingovska et al., 2012) dataset consists of 1200 videos of real and attack attempts on 50 different subjects. The camera on the MacBook is used to collect the dataset with a resolution of 320 \(\times \) 240 under two conditions: (i) a control condition with a uniform background and fluorescent lights; and (ii) an unfavorable condition with a non-uniform background and daylight. Three types of deceptive attacks are designed: print attack, mobile attack, and high definition attack. The \(\textit{MSU-MFSD}\) (Wen et al., 2015) dataset consists of 440 videos from 55 subjects. The face videos are taken by two types of cameras (MacBook Air camera and Google Nexus 5 Android phone camera). The resolutions are 640 \(\times \) 480 and 720 \(\times \) 480. There are mainly two different spoofing attacks, the print photo attack and the replay video attack. The \(\textit{Oulu-NPU}\) (Boulkenafet et al., 2017) dataset consists of 4950 real and attack videos from 55 subjects. These videos are recorded with the front cameras of 6 mobile devices (Samsung Galaxy S6 edge, HTC Desire EYE, MEIZU X5, ASUS Zenfone Selfie, Sony XPERIA C5 Ultra Dual and OPPO N3). There are three different lighting conditions and background scenes. The types of presentation attacks are printing and video replay. These attacks are created using two printers and two display devices.

The \(\textit{CASIA-SURF 3DMask}\) (Yu et al., 2020) dataset contains 288 real face videos and 864 mask videos from 48 subjects. Six conditions are used for data collection, including normal, back-light, front-light, side-light, outdoor in shadow, and outdoor in sunlight. 3D masks of 48 subjects are collected by 3D printing technology. In addition to the use of naive masks, two more realistic decorative situations (i.e., masks with/without hair and glasses) are considered. The \(\textit{CASIA-SURF HiFiMask}\) (Liu et al., 2022) dataset consists of 75 subjects, and each subject provides high-fidelity plaster, resin, and transparent masks. Six different environments, six directional illuminations, and seven recording sensors are applied to the dataset. A total of 54,600 videos (13,650 live videos, 40,950 mask videos) are available in the dataset. The \(\textit{Surveillance High-Fidelity Mask}\) (Fang et al., 2024) dataset is captured in 40 real-life surveillance scenarios, such as movie theaters, security gates, and parking lots, representing a wide range of face recognition scenarios. It includes 101 participants of different ages and genders who perform various natural activities in their daily lives. In addition, the dataset contains multiple types of spoofing attacks such as high-fidelity masks, 2D attacks, and adversarial attacks.

In general, there are differences in acquisition devices, acquisition conditions, and attack types for different datasets, which leads to discrepancies among domains; in addition, each dataset is collected by multiple acquisition devices and the attack types are diverse, which leads to a situation where samples of the same label within the dataset are distant from each other, so inter-domain joint distribution metric becomes inevitable. Figure 3 shows some examples of real and fake facial samples in these datasets. It is easy to see that there exists a large inter-domain gap, such as the lighting, background, and types of attack, which results in significant distribution discrepancies among different datasets.

4.2 Experimental Settings and Implementation Details

In this paper, we perform experiments on 2D and 3D attack datasets under single- and multi-source domain settings for (open-set) DA. We set up three protocols under single- and multi-source domain settings, respectively.

Under the single-source domain setting:

  • Cross-dataset testing on 2D attack datasets (2D\(\rightarrow \)2D). We follow the protocols of (Wang et al., 2021; Jia et al., 2021), in which one of the I, C, and M datasets is used as the source domain and the other dataset as the target domain, so there are six sets of experiments. We use the Half Total Error Rate (HTER) (half of the summation of false acceptance rate and false rejection rate) as the evaluation metric. We first compute the Equal Error Rate (EER) and the corresponding threshold on the development set and then utilize the threshold to calculate the HTER on the testing set.

  • Cross-dataset testing on 2D attack datasets (2D\(\rightarrow \)2D). One of the 3DMask and HiFiMask datasets is used as the source domain and the other dataset as the target domain. We use the HTER and the Area Under the Curve (AUC) as the evaluation metrics.

  • Cross-dataset testing for open-set DA (2D\(\rightarrow \)3D). One of the C and I datasets is used as the source domain and the SuHiFiMask dataset as the target domain. We also use the HTER and AUC as the evaluation metrics.

Under the multi-source domain setting:

  • Cross-dataset testing on 2D attack datasets (2D\(\rightarrow \)2D). We follow the protocols of (Zhou et al., 2022; Liu et al., 2022), in which three of the four datasets are used as source domains and the remaining one as target domain, so there are four sets of experiments. The HTER and AUC are used as the evaluation metrics.

  • Cross-dataset testing on 3D attack datasets (3D\(\rightarrow \)3D). Two of the 3DMask, HiFiMask, and SuHiFiMask datasets are used as the source domains, and the other dataset as the target domain. We use the HTER and AUC as the evaluation metrics.

  • Cross-dataset testing for open-set DA (2D\(\rightarrow \)3D). The C and I datasets are used as the source domains and one of the 3DMask, HiFiMask, and SuHiFiMask datasets is used as the target domain. We also use the HTER and AUC as the evaluation metrics.

Table 1 Comparison results (HTER (%)) between the proposed method and the state-of-the-art methods for cross-dataset testing under the single-source DA setting on the C, I, and M datasets

In our experiments, we use the MTCNN algorithm (Zhang et al., 2016) for face detection and alignment. We implemented our WJDOT-FAS method on the PyTorch platform and utilized the ResNet-18 (He et al., 2016) and Vision Transformer (ViT) (Touvron et al., 2021)) pre-trained on ImageNet as our backbones. All detected face images are normalized to 256 \(\times \) 256 \(\times \) 3 and for the ResNet-18 backbone, we further resize them to 224 \(\times \) 224 \(\times \) 3. The network structures of our feature extractor and classifier are the same as (Jia et al., 2021) under ResNet-18 backbone and the same as (Liu et al., 2022) under ViT backbone. Specifically, the feature extractor outputs 512-dimensional features used for optimal transport. Both our JDOT-FAS-ResNet18 and JDOT-FAS-ViT models (including feature extractor and classifier) were trained by using the Adam optimizer with momentum of 1e-4 under the single-source DA setting. Besides, under the multi-source DA setting, our WJDOT-FAS-ResNet18 model was trained by using the Adam optimizer with momentum of 0.06 and weight decay of 2e-4, and our WJDOT-FAS-ViT model was trained by using the Adam optimizer with momentum of 1e-4. For both source and target domains, mini-batch sizes of \(n_{s_k}=n_t=120\) for ResNet-18 and \(n_{s_k}=n_t=60\) for ViT were used, and trained on a single NVIDIA RTX 3090 GPU. The weights of each source domain were also trained by using the Adam optimizer with a momentum of 0.006. The hyperparameters \(\lambda , \eta , \beta , \lambda _1 \), \(\lambda _2\), and \(\alpha \) are set to 0.1, 0.1, 0.1, 5, 0.1, and 2 respectively.

4.3 Comparisons with the State-of-the-Art Methods

4.3.1 Single-source DA Setting

To verify the effectiveness of our JDOT-FAS method under the single-source DA setting, we first compare it with the state-of-the-art FAS methods on the C, I, and M datasets with only 2D attack types. The methods we compare can be divided into three categories: traditional DA methods, including ADDA (Tzeng et al., 2017), DRCN (Ghifary et al., 2016) and DupGAN (Hu et al., 2018), DA based generalized FAS methods, including Li et al. (2018), ADA (Wang et al., 2019), UDA (Wang et al., 2021) and USDAN-Un (Jia et al., 2021), and some novel self-designed DA based deep learning FAS methods, including OCKD (Li et al., 2022), GDA (Zhou et al., 2022), SFDA-FAS (Liu et al., 2022), and CDFTN-R (Yue et al., 2022). The traditional DA methods generally judge a fake or real face by using a simple FC layer-based classifier optimized with cross-entropy loss. The DA-based generalized FAS methods are mainly based on Maximum Mean Discrepancy (MMD) (Li et al., 2018) and adversarial learning methods (Jia et al., 2021; Wang et al., 2019, 2021). The self-designed DA-based deep learning FAS methods are mainly based on some novel deep learning frameworks such as teacher-student learning (Li et al., 2022), generative DA (Zhou et al., 2022), contrastive learning (Liu et al., 2022), and disentangled representation learning (Yue et al., 2022). As shown in Table 1, in general, the DA-based generalized FAS methods perform better than traditional DA methods and the novel self-designed DA-based deep learning FAS methods obtain optimal performance. Our JDOT-FAS method belongs to the second category of the above methods and achieves the best average performance among all the DA-based FAS methods under ViT backbone, reducing the HTER by \(2.78\%\) compared to state-of-the-art DA method CDFTN-R (Yue et al., 2022). Our JDOT-FAS-ResNet18 model also has significant advantages over methods that also belong to the second category, reducing the HTER by more than \(5.6\%\). The possible reason is that our method introduces label (pseudo-label) distance into the measure of distribution discrepancy by flexibly defining the cost matrix so that the labels can be taken into account when aligning the source and target distributions. For other DA-based FAS methods, the distributions are defined in the feature space, ignoring the role of source sample labels as well as target sample pseudo-labels in distribution alignment. In particular, the feature extractors of the methods (Wang et al., 2019, 2021) are non-parameter shared, while we use a parameter-shared feature extractor and classifier that map both source data and target data to a shared common product space, thus facilitating the search for a domain-invariant product space of features and labels that is more generalizable. Overall, our JDOT-FAS method achieves competitive results under the single-source DA setting.

Table 2 Comparison results (HTER (%) and AUC (%)) between the proposed method and the state-of-the-art methods for cross-dataset testing under the single-source DA setting on the 3DMask and HiFiMask datasets
Table 3 Results (HTER (%) and AUC (%)) of testings on the SuHiFiMask dataset with the C or I dataset as the source domain under the single-source DA setting
Table 4 Comparison results (HTER (%)) between the proposed method and the state-of-the-art methods for multi-source domain cross-dataset testing on the O, C, I, and M datasets
Table 5 Comparison results (AUC (%)) between the proposed method and the state-of-the-art methods for multi-source domain cross-dataset testing on the O, C, I, and M datasets

Besides, we verify the effectiveness of our JDOT-FAS method under the single-source DA setting on the 3DMask and HiFiMask datasets with 3D attacks. As shown in Table 2, under both ResNet-18 and ViT frameworks, the effect of our JDOT-FAS method is improved compared to the baseline methods, and the effectiveness of our method is superior to the methods in Liu et al. (2022) because they lack the use of information about the target domain samples in the training process. In addition, for the 3D attack datasets, the features extracted using ResNet-18 are more capable of capturing the real and fake information in the face images, and the features are more conducive to the correct classification of the samples in the target domain after the joint distribution optimal transportation mapping. Therefore, our JDOT-FAS method not only has better generalization ability for the DA-based FAS with 2D attacks but also can achieve cross-scenario generalization for the DA-based FAS with 3D attacks.

Table 6 Results (HTER (%)) for multi-source domain cross-dataset testing on the 3DMask, HiFiMask, and SuHiFiMask datasets
Table 7 Results (AUC (%)) for multi-source domain cross-dataset testing on the 3DMask, HiFiMask, and SuHiFiMask datasets
Table 8 Results (HTER (%) and AUC (%)) of testings on the 3DMask, HiFiMask, and SuHiFiMask datasets with the C and I datasets as the source domains under the multi-source DA setting

In addition, to verify the effectiveness of our JDOT-FAS method under the open-set single-source DA setting, we choose the C or I dataset as the source domain which has only 2D attack types, and use the SuHiFiMask dataset with 2D, 3D, and adversarial attacks as the target domain. As shown in Table 3, under both ResNet-18 and ViT frameworks, the effect of our JDOT-FAS method is improved compared to the baseline methods and outperforms the experimental results under all the backbones in Fang et al. (2024). The improvement is more obvious under the ViT framework, with the average HTER reduced by 11.55\(\%\) and the average AUC improved by 13.4\(\%\). This indicates that our JDOT-FAS method is also effective for the open-set DA problem with novel attacks in the target domain, i.e., for the case where the distribution discrepancy between the source and target domains is large, our proposed optimal transportation loss of joint distribution can reduce the domain discrepancy to a certain extent, and improve the classification accuracy on the target domain with novel attacks.

4.3.2 Multi-source DA Setting

To verify the effectiveness of our WJDOT-FAS method under the multi-source DA setting, we first compare it with the state-of-the-art FAS methods on the C, I, M, and O datasets with only 2D attack types. The methods we compare can be divided into three categories: DG-based FAS methods (Jia et al., 2020; Wang et al., 2022; Liu et al., 2022; Wang et al., 2022; Zhou et al., 2023; Liu et al., 2023; Long et al., 2023), source-free DA-based FAS methods (Liu et al., 2022; Li et al., 2018; Wang et al., 2020; He et al., 2020; Liang et al., 2020; Yang et al., 2021a, b; LV et al., 2021; Wang et al., 2021) and unsupervised DA-based FAS methods (Wang et al., 2021; Zhou et al., 2022; Wang et al., 2019; Quan et al., 2021). As shown in Tables 4 and 5. The DG-based FAS methods are trained without the involvement of target data in the training process. The source-free DA-based FAS methods use the source data for model pre-training and then fine-tune the source domain pre-trained model using the target data, which makes the source-free DA methods take into account the discrepancy between the source and target domains and use meta-learning, contrastive learning, etc. methods to reduce the domain discrepancy. However, most of the source-free DA methods lack the full utilization of source domain information in the alignment of source and target domains. The unsupervised DA-based FAS methods are the most effective methods to reduce domain discrepancy, which align the distributions of source and target domains in the training process. They are based on adversarial learning (Wang et al., 2019, 2021), cross-domain image generation (Zhou et al., 2022) and progressive migration learning (Quan et al., 2021) methods. Our WJDOT-FAS method belongs to the category of unsupervised DA-based FAS methods and it outperforms all the DG-based FAS methods and most of the source-free DA-based FAS methods due to its full utilization of target data information. Besides, our WJDOT-FAS method achieves the best average performance among unsupervised DA methods of the same category. In particular, our WJDOT-FAS-ResNet18 model reduces the average HTER by more than \(1.74\%\) and improves the average AUC by more than \(0.13\%\) and our WJDOT-FAS-ViT model reduces the average HTER by more than \(2.51\%\) and improves the average AUC by more than \(0.36\%\). The possible reasons are as follows, first, other methods only consider the discrepancy in feature distributions, and ignore the effect of image labels. In contrast, we consider the alignment of the joint distributions of features and labels between domains, therefore the target domain classification effect is improved. Second, they treat different source domains uniformly, i.e., each source domain has the same domain weight, which means the target domain can not adaptively choose the source domain that is easier to align in the training process. In contrast, we make more effective use of the source domain information by setting domain weights that can be updated, which is an important reason that improves the effectiveness of cross-dataset testing for the multi-source DA-based FAS with 2D attacks.

Besides, we verify the effectiveness of our WJDOT-FAS method under the multi-source DA setting on the 3DMask, HiFiMask, and SuHiFiMask datasets with 3D attacks. As shown in Tables 6 and 7, under both ResNet-18 and ViT frameworks, the effect of our WJDOT-FAS method is improved compared to the baseline methods. Specifically, under the ResNet-18 framework, the average HTER (AUC) of our WJDOT-FAS method decreases (improves) by \(6.07\%\) (\(8.53\%\)) compared to the baseline method, and under the ViT framework, the average HTER (AUC) of our WJDOT-FAS method decreases (improves) by \(6.6\%\) (\(4.23\%\)) compared to the baseline method. As in the case of single-source DA, using the ResNet-18 network is more capable of capturing the real and fake information of 3D attack datasets. In addition, the generalization effect is best on the 3DMask dataset due to the fact that the 3D masks in the 3DMask dataset are better discriminated compared to the other two 3D attack datasets, and the 3DMask dataset is captured in a simpler acquisition environment. Overall, our WJDOT-FAS method can achieve cross-scenario generalization for DA-based FAS with 3D attacks.

Table 9 Evaluations (HTER (%)) of optimal transport methods based on marginal and joint distributions under the single-source DA setting
Table 10 Evaluations (HTER (%) and AUC (%)) of optimal transport methods based on marginal and joint distributions under the multi-source DA setting
Fig. 4
figure 4

Best HTER curves under the C\(\rightarrow \)M (a) and O &M &I\(\rightarrow \)C (b) settings. The red lines indicate the optimal transport methods based on marginal distributions, and the blue lines indicate the optimal transport methods based on joint distributions

In addition, to verify the effectiveness of our WJDOT-FAS method under the open-set multi-source DA setting, we choose the C and I datasets as source domains which have only 2D attack types and use the 3DMask, HiFiMask, or SuHiFiMask dataset with 3D attacks as the target domain. As shown in Table 8, under both ResNet-18 and ViT frameworks, the effect of our WJDOT-FAS method is improved compared to the baseline methods under the open-set multi-source DA setting. The improvement is more obvious under the C &I\(\rightarrow \)3DMask setting, as the 3DMask dataset has a minor distribution discrepancy with the convex combination of the C and I relative to the other two datasets. The experimental results show that our WJDOT-FAS method is also effective for the open-set multi-source DA problem with novel attacks in the target domain which can reduce the large distributional discrepancy between the convex combination of the source distributions and the target distribution by computing the optimal transportation mappings. The weights of each source domain can be adaptively selected according to the classification loss and the optimal transportation loss, which ultimately improves the classification of the models under the open-set multi-source DA setting.

4.4 Ablation Study

4.4.1 Effectiveness of the Joint Distribution Estimation

Tables 9 and 10 illustrate the advantages of the joint distribution estimation of features and labels (notated as JDOT-FAS or WJDOT-FAS) over the marginal distribution estimation of features (notated as MDOT-FAS or WMDOT-FAS) for the target domain discrimination. It can be seen that JDOT-FAS (or WJDOT-FAS) performs better than MDOT-FAS (or WMDOT-FAS) in the target domain. Specifically, under the single-source domain setting, JDOT-FAS has a 6.48\(\%\) lower average HTER than MDOT-FAS; under the multi-source domain setting, WJDOT-FAS has a 1.8\(\%\) lower average HTER and a 0.72\(\%\) higher average AUC than WMDOT-FAS. This is because optimal transport methods based on joint distributions can achieve alignment of feature distributions of the samples with the same label (pseudo label) in the product space of features and labels, instead of considering only alignment of marginal distributions within the feature space. Therefore, the process of joint distribution alignment between the source and target domains not only allows the target domain to better learn the source domain label information but also allows the target domain to better utilize the information provided by the pseudo-labels, thus achieving more accurate classification results.

Fig. 5
figure 5

The t-SNE embeddings of samples for optimal transport methods based on marginal distribution and joint distribution under the C\(\rightarrow \)M setting. Different colors represent different domains; different tokens represent different classes

Fig. 6
figure 6

The t-SNE embeddings of samples for optimal transport methods based on weighted marginal distribution and weighted joint distribution under the O &M &I\(\rightarrow \)C setting

To further illustrate the contribution of joint distribution optimal transport-based method to pseudo-labels accuracy improvement, we plot the best HTER variation curves over the validation sets during the training process under the single- and multi-source DA settings in Fig. 4. As can be seen from the decreasing best HTER curves in both figures, both the marginal and joint distribution optimal transport-based methods can improve the accuracy of target domain pseudo-labels. Both methods continuously update the transportation mappings that minimize the transportation losses so as to align the source and target domain distributions and improve the accuracy of sample estimations in the target domain during the training processes. In addition, we observe that the models trained with joint distribution-based optimal transport methods make the best HTER curves decrease faster than the marginal distribution-based optimal transport methods and eventually drop to lower points, i.e. the final best HTER value drops from \(13.69\%\) to \(2.38\%\) under the single-source DA setting and from \(1.39\%\) to \(0.83\%\) under the multi-source DA setting. This is because the discrepancies between the source and target domain distributions can be more accurately characterized based on the joint distributions, and thus the transportation mapping solved using the cost matrix based on the joint distribution can align the source and target domain distributions faster and more accurately in the process of minimizing the optimal transportation loss.

Table 11 Evaluations (HTER (%)) of different loss functions for training under the single-source DA setting
Table 12 Evaluations (HTER (%) and AUC (%)) of different loss functions for training under the multi-source DA setting

Figures 5 and 6 show the t-SNE (Maaten et al., 2008) visualizations of the source and target domains, which are learned by MDOT-FAS and JDOT-FAS methods under the single-source DA setting, and WMDOT-FAS and WJDOT-FAS methods under the multi-source DA setting, respectively. It can be seen that, for MDOT-FAS and WMDOT-FAS, most of the samples in the target domain are far from the classification boundary, making it easier to distinguish between real and fake faces but there are still a small number of fake faces mixed into the real faces from the source domains in the target domain at the classification boundary. Thus, MDOT-FAS and WMDOT-FAS methods have some effect on reducing the discrepancy of domain distributions, but the generalization performance still needs to be improved. For JDOT-FAS and WJDOT-FAS, all fake samples in the source and target domains are concentrated in the lower-left region or in the upper-right region, while all real samples are concentrated in the upper-right region or in the lower-left region, which means that JDOT-FAS and WJDOT-FAS play a greater role in reducing the domain distribution discrepancies compared to MDOT-FAS and WMDOT-FAS. The joint distribution-based optimal transport methods further achieve the minimum intra-class discrepancy and maximum inter-class discrepancy by introducing labels into the domain discrepancy metric. Therefore, joint distribution-based optimal transport methods have stronger discriminative power and better generalization ability by seeking domain-invariant product space.

4.4.2 Effectiveness of the Optimal Transportation Loss

Table 13 Evaluations (HTER (%)) of the hyperparameter \(\beta \) under the single-source DA setting
Table 14 Evaluations (HTER (%)) of three different transportation mapping solvers under the single-source DA setting

Tables 11 and 12 show the experimental results for each added loss function under the single- and multi-source DA settings, respectively. Three types of models are trained for comparisons, i.e., the models trained with source domain classification loss (source domains weighted classification loss), the models trained with source domain classification loss (source domains weighted classification loss) and target domain entropy loss, and the models trained with source domain classification loss (source domains weighted classification loss), target domain entropy loss and joint distribution optimal transportation loss (weighted joint distribution optimal transportation loss). It can be seen that the test performances under the single- and multi-source DA settings improve with the addition of each loss function and the joint distribution optimal transportation loss plays a larger role under the single-source domain setting, which reduces the average HTER by \(16.98\%\).The entropy loss allows the classifier to directly access unlabeled target data and adaptively adjust its parameters to further adapt to the distribution of the target domain. The joint distribution optimal transportation loss (weighted joint distribution optimal transportation loss) feeds back the distribution discrepancy during transportation into the update of network parameters. This enables the network to train the feature extractor and classifier that make the joint distributions of source and target domains closer to each other. We measure the discrepancy between the joint distributions of source domains and the target domain with the help of the Wasserstein distance, which enables a continuous transformation of the distributions, i.e., we are able to find the optimal transportation mappings that transform the joint distributions of source domains and the target domain into a common product space while preserving the geometrical characteristics of the distributions. The comparison results verify that each loss function of our proposed method contributes to performance improvement and all loss functions interact with each other to finally achieve the best results.

4.5 Discussion

4.5.1 Influences of the Cost Matrix Computation

In Eq. (7), we generalize the definition of cost matrix in the marginal distribution optimal transport method as the distance between features in the source and target domains to the weighting of features and labels distances in the joint distribution optimal transport method and achieve the alignment of the joint distributions of source and target domain samples by flexibly defining \({\varvec{C}}\). Table 13 shows the experimental results under different values of \(\beta \). We conduct experiments under the single-source DA setting, and the experimental results show that the average effect is best when \(\beta =0.1\), so it can be seen that the label cost plays a key role in the total cost matrix to measure the distance of the joint distribution, and the appropriate ratio of feature cost to label cost can make the total cost matrix play a more effective role in aligning the joint distribution of the source and target domains, so that the samples in the target domain can achieve better classification effect.

Table 15 Evaluations (HTER (%) and AUC (%)) of domain weight optimization under the multi-source DA setting

4.5.2 Influences of the Transportation Mapping Computation

There are three types of solvers that can be used to solve for the transportation mapping, i.e. EMD solver (Eq. 9), sinkhorn solver (Eq. 10) and Lp-L1 solver (Eq. 11). Table 14 shows the experimental results of the target domain test samples with different transportation mapping solvers during training under the single-source DA setting. It can be seen that five of the six experiments achieve the best test results using the Lp-L1 solver. This is because the transportation mapping solution without regularization in Eq. (9) is a linear programming problem, which makes most of the elements of the transportation mapping matrix solved based on the EMD solver to be 0, resulting in a higher sparsity of the transportation mapping matrix and an unsmooth solution. In contrast, the sinkhorn solver (Eq. 10) with entropy regularization can find a smoother version of the transport, thus reducing the sparsity of the transportation mapping matrix. Lp-L1 solver (Eq. 11) makes the target samples receive masses only from source samples with the same label over the smooth transport version. This can further constrain the transportation mapping to transport between samples with the same label in the source and target domains, on the basis of using the joint distribution to define the cost matrix.

4.5.3 Influences of the Domain Weight Optimization

In Table 15, we summarize three categories of domain-weighting strategies for comparison, including using the same weights for each domain, non-learnable weighting strategies based on the distances and optimization-based weighting strategies by using different optimizers. Non-learnable weighting strategies based on distances include classification loss-based (Kang et al., 2020), optimal transport distance-based (Zhao et al., 2020), and MMD based (Li et al., 2021) weighting strategies. Optimization-based weighting strategies include SGD (Saad D, 1998), AdaGrad (Duchi et al., 2011), and Adam (Kingma et al., 2014) algorithms. As can be seen from the comparison results, using the same domain weight for all source domains is equivalent to stitching all source samples into a source distribution and using joint distribution optimal transport on the resulting distribution. This equal treatment of each source domain is not conducive to the model learning the convex combination of source domains that are best adapted to the target domain, therefore using the same weights for each domain is poorly tested on the target domain. In addition, the optimal transport distance (Wasserstein distance) based weighting strategy outperforms the MMD-based weighting strategy, and the classification loss-based weighting strategy is the worst of these three distances-based weighting strategies. This further indicates that the optimal transport distance is a more accurate measure of similarity between domains and that the combination of source domains adapted to the target domain can be better learned with the optimal transport distance-based weighting strategy. Finally, optimization-based weighting strategies by using different optimizers obtain the best average results of the three categories of methods. This is due to the optimization-based weighting strategies allowing for more flexible optimization of domain weights based on the classification performance and transportation performance of each source domain. Models trained with the Adam algorithm achieve the best average test results on the target domain among these three optimization algorithms due to it determines the parameters update using the first-order moment estimation of the stochastic gradient and determines the adaptive learning rate using the second-order moment estimation of the stochastic gradient. Besides, the Adam algorithm corrects the bias of the first-order moment and second-order moment estimation, which avoids the oscillation during the optimization of the SGD algorithm as well as the low optimization efficiency of the AdaGrad algorithm in the later stage of the optimization process. Overall, the Adam algorithm can better learn the convex combination of source domain distributions that is closest to the target domain distribution. Therefore, we finally choose the Adam algorithm as the domain-weighting strategy.

5 Conclusions and Future Work

In this paper, we introduce a weighted joint distribution optimal transport framework to solve the cross-scenario problem in FAS, which is applicable in both single- and multi-source DA scenarios. This framework can be divided into three parts: joint distribution estimation, joint distribution optimal transport, and domain weight optimization. We compute the optimal transportation mappings from each source domain and the target domain based on the joint distribution cost matrices, and optimize the feature representation, label estimation, and domain weights simultaneously using the weighted optimal transportation loss, weighted source domain classification loss, and the target domain entropy loss. To validate the effectiveness of our method under both single- and multi-source DA settings, we have done extensive experiments on four public FAS datasets with only 2D attacks and three large-scale FAS datasets with 3D attacks. The experimental results show that our method achieves state-of-the-art results in all three protocols under the single-source setting, and under the multi-source setting, our method outperforms most multi-source DA methods. However, there is still room for performance improvement under the O &C &M\(\rightarrow \)I setting. In future work, we will design the optimal transport framework applicable to domain generalization-based and Multi-modal FAS.