Weighted Joint Distribution Optimal Transport Based Domain Adaptation for Cross-Scenario Face Anti-Spoofing

Mao, Shiyun; Chen, Ruolin; Li, Huibin

doi:10.1007/s11263-024-02178-5

Weighted Joint Distribution Optimal Transport Based Domain Adaptation for Cross-Scenario Face Anti-Spoofing

Published: 11 August 2024

(2024)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

International Journal of Computer Vision Aims and scope Submit manuscript

Weighted Joint Distribution Optimal Transport Based Domain Adaptation for Cross-Scenario Face Anti-Spoofing

Download PDF

136 Accesses
Explore all metrics

Abstract

Unsupervised domain adaptation-based face anti-spoofing methods have attracted more and more attention due to their promising generalization abilities. To mitigate domain bias, existing methods generally attempt to align the marginal distributions of samples from source and target domains. However, the label and pseudo-label information of the samples from source and target domains are ignored. To solve this problem, this paper proposes a Weighted Joint Distribution Optimal Transport unsupervised multi-source domain adaptation method for cross-scenario face anti-spoofing (WJDOT-FAS). WJDOT-FAS consists of three modules: joint distribution estimation, joint distribution optimal transport, and domain weight optimization. Specifically, the joint distributions of the features and pseudo labels of multi-source and target domains are firstly estimated based on a pre-trained feature extractor and a randomly initialized classifier. Then, we compute the cost matrices and the optimal transportation mappings from the joint distributions related to each source domain and the target domain by solving Lp-L1 optimal transport problems. Finally, based on the loss functions of different source domains, the target domain, and the optimal transportation losses from each source domain to the target domain, we can estimate the weights of each source domain, and meanwhile, the parameters of the feature extractor and classifier are also updated. All the learnable parameters and the computations of the three modules are updated alternatively. Extensive experimental results on four widely used 2D attack datasets and three recently published 3D attack datasets under both single- and multi-source domain adaptation settings (including both close-set and open-set) show the advantages of our proposed method for cross-scenario face anti-spoofing.

Learning Optimal Transport Mapping of Joint Distribution for Cross-scenario Face Anti-spoofing

Source-Free Domain Adaptation with Contrastive Domain Alignment and Self-supervised Exploration for Face Anti-spoofing

Adversarial Learning Domain-Invariant Conditional Features for Robust Face Anti-spoofing

Article 28 March 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In recent years, face recognition (FR) techniques have been used in various identity authentication scenarios. However, existing FR systems are vulnerable to spoofing attacks such as printed photos, video replay, 3D facial masks, adversarial attacks, etc (Yu et al., 2022). To secure FR systems from various physical attacks, both the communities of industry and academia have paid increasing attention to face anti-spoofing (FAS). In the past two decades, various FAS methods have been proposed including both traditional methods and deep learning-based methods (Yu et al., 2022). Traditional methods based on handcraft descriptors (Komulainen et al., 2013; Patel et al., 2016) can be further classified into texture-based, motion-based, and image analysis-based methods. Subsequently, hybrid (handcrafted + deep learning) (Rehman et al., 2020; Khammari, 2019) and end-to-end deep learning-based methods (Liu et al.,2018 (Yu et al., 2020; Zhang et al., 2020) have also been proposed.

However, the performance of most FAS methods drops significantly in cross-scenario settings due to variations in lighting, facial appearance, or camera quality. In view of this, most existing solutions (Liu et al., 2022; Wang et al., 2020; Jia et al., 2020; Wang et al., 2023, 2022, 2021; Chen et al., 2021; Jiang et al., 2023) focus on improving the cross-scenario capability of deep FAS models by using multi-source domain generalization (DG) approach, which assumes that there exists a potential generalized feature space between the given source domains and unseen target domain. By adapting multiple source data to learn a common feature space, the model trained in source domains can be well generalized to the unseen target domain. However, in practice, a large amount of unlabeled facial images are available from existing FR systems, and domain adaptation (DA) forms a natural learning framework for FAS. DA approach attempts to aid cross-scenario FAS by extracting discriminative feature representations from labeled source data and unlabeled target data. Thus, they can exploit rich information in the unlabeled target domain and obtain a more robust decision boundary.

In most DA methods, the distributions of source and target features are matched in a learned feature space, by using Maximum Mean Discrepancy (MMD) (Pei et al., 2018; Rahman et al., 2020), Correlation Alignment (CORAL) (Baochen et al., 2016) or Kullback-Leiber divergence (KL) (Zhuang et al., 2015). Besides, another direction is based on adversarial training (Tzeng et al., 2017), where a discriminator (domain classifier) is trained to distinguish between the source and target representations. However, considering that there are not only large inter-class differences in the samples of each domain, intra-class differences are still obvious, and there are cases where samples with different labels in different domains are closer to each other than samples with the same label in different domains, as illustrated in the left of Fig. 1. So only considering fitting the feature distributions of the source and target domains will have a situation similar to fitting the features of real samples from source domains to fake samples from the target domain, as illustrated in the middle of Fig. 1, which is not conducive to classification. Therefore, different from existing DA-based FAS methods, which attempt to align the marginal distributions in the feature space between the source and target domains. In this paper, we consider the discrepancy in the joint distributions of features and labels of source and target data. In this way, the samples in the source and target domains are aligned based on both features and labels, so that the samples with different labels from the same domain will be separated, while samples with the same label from different domains will be aggregated, as illustrated in the right of Fig. 1.

The main idea of this paper is to find optimal transportation mappings between the product spaces (including features and labels) of each source domain and the product space (including features and pseudo-labels) of the target domain. In this case, we first compute the cost matrices based on the joint distributions of each source domain and the target domain and then compute the optimal transportation mappings while reducing the discrepancy between the joint distributions. The distribution inconsistencies are measured by the Wasserstein distances (Cuturi et al., 2014). After obtaining the optimal transportation mappings, we learn a convex combination of the joint distributions of source domains, which allows us to distribute the masses based on the similarities of the sources with the target, both in the feature and pseudo-label spaces. Domain weights are updated together with the parameters of the feature extractor and classifier by training the weighted transportation loss between each source domain and the target domains, with the weighted source domain classification loss and the target domain entropy loss. Here, the target domain entropy loss is used to adjust the parameters of the feature extractor and classifier adaptively to further fit the distribution of the target domain. Our idea of aligning the joint distributions is reflected in the definition of cost matrix and reacted on the transportation mapping. The reduction of domain discrepancy in our method is reflected in seeking domain-invariant product space of features and labels, rather than feature space. In fact, the single-source domain joint distribution optimal transport is a degeneration of the multi-source domain joint distribution optimal transport, except that the weight of the unique source domain is always equal to 1 and we do not need to train it when training the feature extractor and classifier. Our main contributions to this work can be summarized as follows:

Facing the cross-scenario FAS problem, we propose to reduce the discrepancy of domain distributions based on the joint distribution, which is dedicated to aligning the joint distributions of both sample features and labels (or pseudo-labels) of source and target domains in the common product space, which is largely different from existing methods.
To solve the multi-source DA-based FAS, we propose to utilize the Wasserstein distance to measure the distances between the joint distributions, and assign adaptively updated weights to each source domain based on the Wasserstein distances so as to take into account the contributions of different source domains to the target domain.
Extensive experimental results on four widely used 2D attack datasets and three recently published 3D attack datasets under both single- and multi-source domain adaptation settings (including both close-set and open-set) show the advantages of our proposed method for cross-scenario FAS. Our method achieves state-of-the-art results in all three protocols under the single-source setting, and under the multi-source setting, except for the 2D$\rightarrow $2D protocol, which achieves the second-best performance, the remaining two protocols also achieve state-of-the-art results.

2 Related Works

In this section, we will first introduce the DA-based methods for FAS. After that, the focus will be on reviewing the optimal transport-based DA methods and multi-source DA methods that are most relevant to our work.

2.1 Domain Adaptation for Face Anti-spoofing

The basic idea of the DA technique is to mitigate the distribution discrepancy between the source and target domains so that the model trained with the labeled source data can be well adapted to the unlabeled target data. Initially, a maximum mean discrepancy (MMD) based metric learning method is proposed for FAS to align the distributions of source features and target features (Li et al., 2018). Other major developments have focused on the inclusion of adversarial loss functions that drive the inability of CNNs to distinguish whether a sample is from the source or target domain (Wang et al., 2019, 2021; Jia et al., 2021). Specifically, Wang et al. (2021) proposed ML-Net using the combination of center loss and triplet loss jointly to learn a feature representation for source data, then they adapted this representation to the target domain via UDA-Net and DR-Net. Jia et al. (2021) designed a marginal distribution alignment module (MDA) for domain-invariant feature learning and a conditional distribution alignment module (CDA) for centroid alignment of labeled features. In addition, Zhou et al. (2022) reformulated the unsupervised DA-based FAS as a domain stylization problem. The target data is stylized with the source domain style through image translation to directly fit the target data to the source model. Yue et al. (2022) presented a cyclically disentangled feature translation network, and proposed to generate pseudo-labeled images to train a generalizable classifier. Li et al. (2022) proposed a teacher-student framework to improve the cross-domain performance of FAS through single-class DA. Overall, most of these methods require multiple stages of the training process and all of them only consider aligning the marginal distributions, ignoring the role of source labels.

As we know, CASIA-FASD (Zhang et al., 2012), Idiap Replay-Attack (Chingovska et al., 2012), MSU-MFSD (Wen et al., 2015), and OULU-NPU (Boulkenafet et al., 2017) datasets have been widely used to study the DA-based FAS. However, these datasets are limited in data scale and attack types (print and replay) and recorded in controlled indoor scenarios. Recently, many new FAS datasets have been released and there are three major trends in the development of datasets: (1) large-scale data amount, (2) increasing number of novel attack types and complex recording conditions, and (3) multiple modalities. For example, CASIA-SURF 3DMask (Yu et al., 2020) is the first FAS dataset considering outdoor scenes with challenging lighting and it includes three mask decorations (i.e., masks with/without hair and glasses) recorded under six environmental conditions. CASIA-SURF HiFiMask (Liu et al., 2022) dataset contains more than 50,000 videos and it includes 3D mask attacks with three kinds of materials (transparent, plaster, and resin) recorded under six lighting conditions and six indoor/outdoor scenes. And Surveillance High-Fidelity Mask (Fang et al., 2024) dataset is captured under 40 surveillance scenes, and it has 232 3D attacks (high-fidelity masks), 200 2D attacks (posters, portraits, and screens), and 2 adversarial attacks. Besides, CASIA-SURF (Zhang et al., 2019) and CASIA-SURF Cross-ethnicity Face Anti-spoofing (CeFA) (Liu et al., 2021) datasets contain 3 modalities, i.e., RGB, Depth and IR.

Yu et al. (2020) proposed a Neural Architecture Search (NAS)-based approach for FAS. They presented Domain/Type-aware Meta-NAS for leveraging cross-domain/type knowledge for robust searching to improve the transferability of NAS across datasets and unknown attack types. Liu et al. (2022) proposed a training method for supervised FAS tasks, i.e., contrasting context-aware learning framework, which accurately utilizes the rich context information (e.g., subjects, mask material, and illumination) between live face and high-fidelity mask attack pairs. Fang et al. (2024) proposed a Contrastive Quality-Invariance Learning network to mitigate the performance degradation of FAS methods caused by low-quality images in surveillance scenarios. These works have better FAS performance in single dataset scenarios, but have weak generalization ability and cannot effectively solve the DA-based 3D attack FAS. In this paper, we will study the DA-based FAS method dealing with both 2D and 3D attacks, and generalize it to open-set DA in which there are new types of attacks in the target domain that are different from the source domains.

2.2 Optimal Transport Based Domain Adaptation

The optimal transport problem is first introduced by the French mathematician Gaspard Monge in the middle of the 19th century as a way to find a minimal-effort solution to the transport of a given mass of dirt into a given hole. Kantorovich (2006) extended the Monge problem from the viewpoint of transport mapping to transportation plan. Later, new computational strategies have been proposed and make it possible to be used for the problem of DA (Courty et al., 2016, 2017; Damodaran et al., 2018). The core of optimal transport theory applied to the DA problem lies in learning the transformation between domains. In particular, Courty et al. (2016) proposed a regularized unsupervised optimal transport model to align the feature representations of source and target domains. They proposed two regularization schemes to encode the class structure in the source domain while estimating the transportation plan, thus reinforcing the intuition that the samples of the same class must undergo similar transformations. Subsequently, Courty et al. (2017) proposed to minimize the optimal transportation loss between the joint distribution of the source domain and the estimated joint distribution of the target domain. Later, this method was extended to deep learning frameworks (Damodaran et al., 2018) where the feature embedding is simultaneously estimated with the classifier by using an efficient stochastic optimization procedure. An important aspect of joint distribution optimal transport is that the optimization problem involves the joint distribution of both feature embeddings and sample labels, and the simultaneous use of feature and label information is the basis of most generalization bounds (Courty et al., 2017).

2.3 Multi-source Domain Adaptation

For the multi-source DA problem, Yishay et al. (2009) pointed out that learning a weighted combination of multiple source distributions can be better generalized to the target domain under a certain theoretical guarantee. Judy et al. (2018) proposed an algorithm for distribution-weighted combinatorial solutions based on square loss and cross-entropy loss to solve the multi-source DA problem. Recently, many deep learning networks designed specifically for multi-source domains have been proposed to solve the multi-source DA problem. Peng et al. (2019) proposed a multiple source domain adaptive moment matching network (M3SDA), which aims to transfer knowledge learned from multiple labeled source domains to an unlabeled target domain by dynamically aligning the moments of feature distributions. Zhao et al. (2018) proposed the Multi-Source Domain Adversarial Network (MDAN), which approaches the DA problem by optimizing the task-adaptive generalization bounds. Wen et al. (2020) pointed out that in order to achieve the optimal generalization upper bound for the target domain, a trade-off is needed between including all source domains to increase the number of valid samples and excluding less relevant domains to avoid negative transfer. Based on this theory, they proposed a domain aggregation network (DARN), which dynamically adjusts the weights of each source domain during the end-to-end training process. Xu et al. (2018) proposed a deep cocktail network (DCTN) to solve the problem of domain and category transfer between multiple sources. Kang et al. (2020) proposed the Contrast Adaptive Network (CAN), which optimizes a new metric, i.e. the contrast domain variance, explicitly modeling intra-class domain variance and inter-class domain variance. Besides, they utilized the weighting of the inversed classification loss of intra-domain samples as the domain weights for network updates. Li et al. (2021) proposed a multiple-source contribution learning network (MSCLDA) by considering source contributions when predicting a target task. This method can simultaneously learn the similarity and diversity of domains by extracting multi-view features and utilizes a metric based on MMD as the domain weights. Zhao et al. (2020) proposed a multi-source distillation network (MDDA), which not only considers different distances between multiple sources and targets but also investigates the different similarities between source samples and target samples. A metric based on the optimal transport distance is used as the domain weights. Most of these multi-source DA methods are based on feature distributions when measuring the discrepancy between source and target distributions, and these methods are not capable of adaptively adjusting source domain weights. In contrast to the above methods, Turrisi et al. (2022) exploited the diversity of source distributions by adjusting the weights of different source joint distributions to fit the target task, which aims to simultaneously find the optimal transport-based alignment between the source and target joint distributions, as well as the reweighting of the source distributions based on the transportation loss. Inspired by Turrisi et al. (2022), this paper adopts the idea of joint distribution optimal transport to solve the problem of single-source and multi-source DA-based cross-scenario FAS. To the best of our knowledge, this is the first work that uses the idea of weighted joint distribution optimal transport to solve the cross-scenario FAS.

3 Proposed Method

In this paper, we propose a weighted joint distribution optimal transport method for multi-source DA-based FAS (WJDOT-FAS). As shown in Fig. 2, the training phase of the proposed method consists of three modules, namely joint distribution estimation, joint distribution optimal transport, and domain weight optimization. In particular, given the labeled facial samples from K source domains and the unlabeled samples from the target domain, we first estimate the joint distributions of both samples’ features and labels (or pseudo labels) for each domain by using a pre-trained feature extractor and a randomly initialized classifier. Then, the cost matrices between the joint distributions of each source domain and the target domain are computed by using a weighted distance metric of both feature space and label space. Once the cost matrices are estimated, we can compute the optimal transportation mappings from the joint distributions of each source domain and the target domain by solving Lp-L1 optimal transport problems. These optimal transportation mappings can map the joint distributions of each source domain and the target domain to a new common space, in which their domain discrepancies can be well-aligned. Considering that different source domains have different contributions to the target domain, domain weights are defined for each source domain, and these weights can be solved by solving a convex optimization problem related to the loss functions of different source domains, target domain, as well as the optimal transportation losses from each source domain to the target domain. Meanwhile, the parameters of the feature extractor and the classifier are also updated, and the learnable parameters and the computations of the three modules are updated alternatively. Once the feature extractor and classifier have been well-trained, they are used to predict the sample labels of the target domain in the testing phase. More details of the proposed method will be introduced in the following paragraphs.

3.1 Joint Distribution Estimation

The aim of the joint distribution estimation module is to estimate the joint distributions of each source domain and the target domain. The joint distribution is defined in the product space of the sample feature space and sample label space. Given the labeled source data ${\mathcal {D}}_{s_k}=\big \{{\varvec{x}}_{i_k}^{s_k},{\varvec{y}}_i^{s_k}\big \}_{i_k=1}^{n_{s_k}}$ ($k=1,\ldots ,K$, K is the number of source domains) and unlabeled target data ${\mathcal {D}}_{t}=\big \{{\varvec{x}}_j^{t}\big \}_{j=1}^{n_t}$, where $n_{s_k}$ and $n_{t}$ denote the sample numbers of the k-th source domain and the target domain. Our joint distribution estimation module is composed of two parts: a feature extraction function ($g:\varvec{{\mathcal {X}}}\rightarrow \varvec{{\mathcal {Z}}}\subseteq {\mathbb {R}}^d$) which maps the given facial samples from both source domains and the target domain into their feature space, and a classifier ($f:\varvec{{\mathcal {Z}}}\rightarrow \varvec{{\mathcal {Y}}}\subseteq {\mathbb {R}}^2$) which maps the sample features into their label space. The sample features of the k-th source domain and the target domain can be denoted as $\big \{{\varvec{z}}_{i_k}^{s_k}\big \}_{i_k=1}^{n_{s_k}}$, i.e. $\big \{g\big ({\varvec{x}}_{i_k}^{s_k}\big )\big \}_{i_k=1}^{n_{s_k}}$ and $\big \{{\varvec{z}}_j^t\big \}_{j=1}^{n_t}$, i.e. $\big \{g\big ({\varvec{x}}_j^t\big )\big \}_{j=1}^{n_t}$ respectively. Suppose we define $\mu _{s_k}$ and $\mu _{t}$ as the marginal feature distributions of the k-th source domain and the target domain, since the facial samples are in discrete form, we consider the empirical versions of $\mu _{s_k}$ and $\mu _{t}$, which can be defined in the following forms:

$$\begin{aligned} {\hat{\mu }}_{s_k}= & {} \frac{1}{n_{s_k}}\sum \limits _{i_k}\delta _{{\varvec{z}}_{i_k}^{s_k}}, \end{aligned}$$

(1)

$$\begin{aligned} {\hat{\mu }}_{t}= & {} \frac{1}{n_{t}}\sum \limits _{j}\delta _{{\varvec{z}}_{j}^{t}}, \end{aligned}$$

(2)

where $\delta _{{\varvec{z}}_{i_k}^{s_k}}$ and $\delta _{{\varvec{z}}_{j}^{t}}$ are the Dirac functions at points ${\varvec{z}}_{i_k}^{s_k}\in {\mathbb {R}}^{d}$ and ${\varvec{z}}_{j}^{t}\in {\mathbb {R}}^{d}$ respectively.

Following above notations, we assume there exit two distinct joint probability distributions ${\mathcal {P}}_{s_k}=({\varvec{z}}^{s_k},{\varvec{y}}^{s_k})_{{\varvec{z}}^{s_k}\sim \mu _{s_k}}$ and $ {\mathcal {P}}_{t}=\big ({\varvec{z}}^{t},f({\varvec{z}}^{t} )\big )_{{\varvec{z}}^{t}\sim \mu _{t}}$, whose empirical versions can be defined in the following forms:

$$\begin{aligned} \hat{{\mathcal {P}}}_{s_k}= & {} \frac{1}{n_{s_k}}\sum \limits _{i_k}\delta _{{\varvec{z}}_{i_k}^{s_k},{\varvec{y}}^{s_k}_{i_k}}, \end{aligned}$$

(3)

$$\begin{aligned} \hat{{\mathcal {P}}}_{t}= & {} \frac{1}{n_{t}}\sum \limits _{j}\delta _{{\varvec{z}}_{j}^{t},f\big ({\varvec{z}}_{j}^{t}\big )}, \end{aligned}$$

(4)

where $\delta _{{\varvec{z}}_{i_k}^{s_k}, {\varvec{y}}^{s_k}_{i_k}}$ and $\delta _{{\varvec{z}}_{j}^{t},f({\varvec{z}}_{j}^{t})}$ are the Dirac functions at points $\big ({\varvec{z}}_{i_k}^{s_k},{\varvec{y}}^{s_k}_{i_k}\big )\in {\mathbb {R}}^{d+2}$ and $\big ({\varvec{z}}_{j}^{t},f\big ({\varvec{z}}_{j}^{t}\big )\big )\in {\mathbb {R}}^{d+2}$ respectively.

In particular, we use a pre-trained ResNet-18 CNN (or Transformer) backbone to extract the deep features of the given facial samples and use a randomly initialized classifier to compute pseudo labels. The joint distributions of the source samples are estimated by using the sample features extracted by the feature extractor and the true labels; while the joint distribution of the target samples is estimated by using the sample features extracted by the feature extractor and the pseudo-labels computed by the classifier.

3.2 Joint Distribution Optimal Transport

3.2.1 Cost Matrix Computation

Optimal transport (OT) (Cédric et al., 2008) is an efficient way of seeking to transform one distribution into another for a given cost function. It can be used for computing Wasserstein distance between probability distributions. Formally, OT searches a transportation mapping $\varvec{\gamma } \in \varvec{\Pi }(\mathcal {{\hat{P}}}_{s},\mathcal {\hat{P}}_{t})$ between two distributions $\mathcal {\hat{P}}_{s}$ and $\mathcal {\hat{P}}_{t}$ which yields a minimal displacement cost. In a discrete setting (both distributions are empirical), the Wasserstein distance between $\mathcal {\hat{P}}_{s}$ and $\mathcal {\hat{P}}_{t}$ calculated by the OT method can be expressed in the following form:

$$\begin{aligned} W(\mathcal {\hat{P}}_{s},\mathcal {\hat{P}}_{t})=\min \limits _{\varvec{\gamma }\in \varvec{\Pi }(\mathcal {\hat{P}}_{s},\mathcal {\hat{P}}_{t})}{\langle \varvec{\gamma },{\varvec{C}}\rangle _F}. \end{aligned}$$

(5)

Here, $\langle \cdot ,\cdot \rangle _F$ is the Frobenius matrix norm, ${\varvec{C}}\in {\mathbb {R}}^{n_{s}\times n_t}$ is the cost matrix representing the pairwise costs of the joint distributions of source domain samples and the target domain samples. $\varvec{\Pi }(\mathcal {\hat{P}}_{s},\mathcal {\hat{P}}_{t})$ describes the space of joint probability distributions of source and target domains and $\varvec{\gamma }$ is the transportation mapping which is a matrix of size $n_{s}\times n_t$.

Joint distribution optimal transport is applied to our method, which is reflected in the definition of the cost matrix ${\varvec{C}}$. The underlying idea is to align the joint distributions of features and labels from source and target domains instead of only considering the marginal distributions of features. Next, we will illustrate how to calculate ${\varvec{C}}$ under joint distributions in the case where only one source domain is available. The cost matrix ${\varvec{C}}$ associated with the product space of features and labels can be expressed as the gap between the joint distributions of the source and target domains, that is:

$$\begin{aligned} {\varvec{C}}\triangleq d(\hat{{\mathcal {P}}}_{s}, \hat{{\mathcal {P}}}_{t}), \end{aligned}$$

(6)

Specifically, the element of the i-th row and j-th column in ${\varvec{C}}$ can be expressed as a joint cost measure of costs in the feature and label spaces of the i-th source sample and j-th target sample, combining both the gap between sample features and the discrepancy between sample labels (pseudo labels for the target domain). According to Damodaran et al. (2018), the specific form of ${\varvec{C}}_{ij}$ is defined as follows:

$$\begin{aligned} \begin{aligned} {\varvec{C}}_{ij}&\triangleq c\big (g\big ({\varvec{x}}_i^s),{\varvec{y}}_i^s;g\big ({\varvec{x}}_j^t\big ),f\big (g\big ({\varvec{x}}_j^t\big )\big )\\&=\parallel g\big ({\varvec{x}}_i^s\big )-g\big ({\varvec{x}}_j^t\big )\parallel ^2+\beta {\mathcal {L}}_{CE}\big ({\varvec{y}}_i^s,f\big (g\big ({\varvec{x}}_j^t\big )\big )\big ), \end{aligned} \end{aligned}$$

(7)

where $\parallel g\big ({\varvec{x}}_i^s\big )-g\big ({\varvec{x}}_j^t\big )\parallel ^2$ compares the compatibility of the features for source and target samples and it is a $l_2^2$ distance; while ${\mathcal {L}}_{CE}\big ({\varvec{y}}_i^s,f\big (g\big ({\varvec{x}}_j^t\big )\big )\big )$ is a cross-entropy loss, which considers the gap between the true label of the i-th source sample and the pseudo label of the j-th target sample. Parameter $\beta $ is a scalar value weighing the strength of label cost relative to feature cost. The definition of ${\varvec{C}}_{ij}$ in Eq. (7) guarantees that our optimal transport is defined under the joint distribution setting. If we only consider aligning the marginal distributions of source and target domain features, then ${\varvec{C}}_{ij}=\parallel g\big ({\varvec{x}}_i^s\big )-g\big ({\varvec{x}}_j^t\big )\parallel ^2$, i.e. the basic form of the cost matrix in OT.

3.2.2 Transportation Mapping Computation

In this section, we will introduce how to compute the transportation mapping $\varvec{\gamma }$, considering the case of single-source DA. As shown in Eq. (5), OT searches a transportation mapping $\varvec{\gamma } \in \varvec{\Pi }(\mathcal {\hat{P}}_s,\mathcal {\hat{P}}_t)$ between two distributions $\mathcal {\hat{P}}_{s}$ and $\mathcal {\hat{P}}_{t}$, where $\varvec{\Pi }(\mathcal {\hat{P}}_{s},\mathcal {\hat{P}}_{t})$ can be expressed mathematically in the following form:

$$\begin{aligned} \begin{aligned} \varvec{\Pi }(\mathcal {\hat{P}}_{s},\mathcal {\hat{P}}_{t})&=\big \{\varvec{\gamma } \in (\mathbb {R^+})^{n_{s}\times n_t}|\\&\varvec{\gamma } {\varvec{1}}_{n_t}=\mathcal {\hat{P}}_{s},\varvec{\gamma } ^{\intercal } {\varvec{1}}_{n_{s}}=\mathcal {\hat{P}}_{t}\big \}, \end{aligned} \end{aligned}$$

(8)

where ${\varvec{1}}_{n_{s}}$ and ${\varvec{1}}_{n_t}$ are the $n_{s}$ and $n_t$-dimension vectors of ones. With the definition of ${\varvec{C}}_{ij}$ in Eq. (7), we can compute the transportation mapping based on the following equation:

$$\begin{aligned} \varvec{{\hat{\gamma }}} _0=\mathop {\arg \min }\limits _{\varvec{\gamma }\in \varvec{\Pi }(\mathcal {\hat{P}}_{s},\mathcal {\hat{P}}_{t})}{\langle \varvec{\gamma },{\varvec{C}}\rangle _F}. \end{aligned}$$

(9)

Equation (9) is a linear programming problem and can be solved by the network simplex algorithm, but solving it becomes difficult when the sample size is large. To solve this problem more efficiently, the entropy regularized version of the above optimal transport problem is proposed (Chingovska et al., 2012) and can be formulated as follows:

$$\begin{aligned} \varvec{{\hat{\gamma }}} _0^\lambda = \mathop {\arg \min } \limits _{\varvec{\gamma } \in \varvec{\Pi }(\mathcal {\hat{P}}_{s},\mathcal {\hat{P}}_{t})}{\langle \varvec{\gamma },{\varvec{C}} \rangle }_F+\lambda \Omega _e(\varvec{\gamma } ), \end{aligned}$$

(10)

where $\Omega _e(\gamma )=\sum _{i,j}{\varvec{\gamma }(i,j)\textrm{log} \varvec{\gamma }(i,j)}$ computes the negative entropy of $\varvec{\gamma } $. This regularization is introduced because $\varvec{{\hat{\gamma }}} _0$, as a solution of the linear program, most of the elements are zero, and thus a smoother version of the transport can be found by increasing the entropy, thus reducing its sparsity. In particular, ${\hat{\gamma }}_0^\lambda $ can be solved by using Sinkhorn algorithm (Cuturi et al., 2013).

Further, we resort to a class regularization term to estimate a better transport using the source sample label information. Our goal is to penalize the coupling of matching source samples with different labels to the same target sample. Thereby, the new optimization problem can be written in the following form:

$$\begin{aligned} \varvec{{\hat{\gamma }}} _0^\eta = \mathop {\arg \min } \limits _{\varvec{\gamma } \in \varvec{\Pi }(\mathcal {\hat{P}}_{s},\mathcal {\hat{P}}_{t}) }{\langle \varvec{\gamma },{\varvec{C}} \rangle }_F+\lambda \Omega _e(\varvec{\gamma } )+\eta \Omega _c(\varvec{\gamma }), \end{aligned}$$

(11)

where $\eta \ge 0$ and $\Omega _c(\cdot )$ is the class regularization term. In this work, we use group sparse regularization with the aim of making a given target sample receive masses from source samples with the same label. This regularization term is defined as:

$$\begin{aligned} \Omega _c(\varvec{\gamma } )=\sum \limits _{j}\sum \limits _{cl}\left\| \varvec{\gamma }({I} _{cl},j) \right\| _{1}^{1/2} \end{aligned}$$

(12)

where $\left\| \cdot \right\| _{1}$ denotes the $l_{1}$ norm and $I _{cl}$ contains the indices of rows in $\varvec{\gamma }$ related to source domain samples of class cl. So, $\varvec{\gamma }({I} _{cl},j)$ is a vector containing coefficients of the j-th column of $\varvec{\gamma }$ associated to class cl. In our case, cl stands for real or fake. This regularization term is called the Lp-L1 regularization term (here, $p = 1/2$) (Courty et al., 2014), and the problem can be transformed into Eq. (10) when the maximization minimization technique is applied on the Lp-L1 parametrization and can be solved by using an efficient Sinkhorn-Knopp algorithm (Courty et al., 2016).

Equations (9), (10) and (11) are called EMD solver, Sinkhorn solver and Lp-L1 solver respectively. After calculating the optimal transportation mapping, the Wasserstein distance between the source and target domain distributions is obtained according to Eq. (5). By computing the transportation mapping under joint distribution optimal transport, samples with similar features and common labels can be matched in the common product space, resulting in better discrimination.

3.3 Domain Weight Optimization

To solve the multi-source DA-based FAS, the weighing of each source domain is an important factor for the generalization ability of the final classifier on the target domain. We propose to assign adaptively updated weights to each source domain based on the Wasserstein distances between the joint distributions of each source domain and the target domain. Besides, for the FAS classification problem, these weights can be computed by solving a convex optimization problem related to the Wasserstein distances (optimal transportation losses) between the joint distributions of each source domain and the target domain and the classification losses of different source domains.

The Wasserstein distances (optimal transportation losses) between the joint distributions of each source domain and the target domain can be computed by solving the optimal transport problems in Eq. (5). It can measure the degree of the joint distribution alignment between the source and target domains. It’s not difficult to see that the better the distributions are aligned, the better the generalization effect on the target domain. Specifically, we first compute the cost matrices of the joint distributions between each source domain and the target domain samples by Eq. (7). Then, the optimal transportation mappings of the joint distributions from each source domain to the target domain are computed by Eq. (11). Finally, the Wasserstein distances (optimal transportation losses) between the joint distributions can be computed. The Wasserstein distance (optimal transportation loss) from the k-th source domain to the target domain is defined as:

$$\begin{aligned} {\mathcal {L}}_{s_k\text{- } t}=\sum \limits _{i_k}\sum \limits _j{\hat{\varvec{\gamma }}_{{i_k}j}}^{s_k}d\big (g\big ({\varvec{x}}_{i_k}^{s_k}\big ),{\varvec{y}}_{i_k}^{s_k};g\big ({\varvec{x}}_j^t\big ),f\big (g\big ({\varvec{x}}_j^t\big )\big )\big ). \end{aligned}$$

(13)

Moreover, to better utilize the source domain information to train the final classifier, we employ the adaptive cross-entropy (AdaCE) loss (Jia et al., 2021) to measure the classification error of the classifier for each source domain. AdaCE loss is defined by adjusting the weight of the cross-entropy loss adaptively based on the classification accuracy. For the k-th source domain, it can be defined as follows:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{s_k}&={\frac{1}{n_{s_k}}\sum \limits _{i_k}{\mathcal {L}}_{s_k}\big ({\varvec{y}}_{i_k}^{s_k},f\big (g\big ({\varvec{x}}_{i_k}^{s_k}\big )\big )\big )}\\&={\frac{1}{n_{s_k}}\sum \limits _{i_k}\Bigg (1-e^{-{\mathcal {L}}_{CE}\big ({\varvec{y}}_{i_k}^{s_k},f\big (g\big (x_{i_k}^{s_k}\big )\big )\big )}\Bigg )^\alpha \cdot \quad }\\&\qquad \qquad \quad {\mathcal {L}}_{CE}\big ({\varvec{y}}_{i_k}^{s_k},f\big (g\big ({\varvec{x}}_{i_k}^{s_k}\big )\big )\big ), \end{aligned} \end{aligned}$$

(14)

where ${\mathcal {L}}_{CE}(\cdot ,\cdot )$ is the cross-entropy loss, and $\alpha $ is a hyper-parameter. For the i-th sample of the k-th source domain, ${\mathcal {L}}_{CE}(\cdot ,\cdot )$ is defined as:

$$\begin{aligned} {\mathcal {L}}_{CE}\big ({\varvec{y}}_{i_k}^{s_k},f\big (g\big ({\varvec{x}}_{i_k}^{s_k}\big )\big )\big )=-{\varvec{y}}_{i_k}^{s_k} \textrm{log}\big (f\big (g\big ({\varvec{x}}_{i_k}^{s_k}\big )\big )\big ). \end{aligned}$$

(15)

To further refine the parameters of the FAS classifier, we feed the unlabeled target domain data to the classifier and refer to the entropy loss proposed in Jia et al. (2021), which is expressed as follows:

$$\begin{aligned} {\mathcal {L}}_t=-\frac{1}{n_{t}}\sum \limits _j {f\big (g\big ({\varvec{x}}_j^t\big )\big ) \textrm{log} f\big (g\big ({\varvec{x}}_j^t\big )\big )}. \end{aligned}$$

(16)

Once the optimal transportation loss functions from different source domains to the target domain, the classification loss functions related to different source domains as well as the entropy loss function related to the target domain have been defined, we can compute the domain weights and update the network parameters of both feature extractor and classifier by solving the following convex optimization problem:

$$\begin{aligned}{} & {} \Big (g^{(n+1)}_{\theta _1},f^{(n+1)}_{\theta _2},w_k^{(n+1)}\Big )=\mathop {\arg \min } \limits _{g^{(n)}_{\theta _1},f^{(n)}_{\theta _2},w_k^{(n)}}{\mathcal {L}}_{total}, \end{aligned}$$

(17)

$$\begin{aligned}{} & {} {\mathcal {L}}_{total}=\sum \limits _k w_k({\mathcal {L}}_{s_k}+\lambda _1 {\mathcal {L}}_{s_k\text{- } t})+\lambda _2{{\mathcal {L}}_t}, \end{aligned}$$

(18)

where $\lambda _1$ and $\lambda _2$ are the trade-off parameters, and $w_k$ denotes the domain weight related to k-th source domain.

The domain weights are continuously updated together with the network parameters. Then, the updated networks are used for the joint distribution estimation of source and target domains again, further for the joint distribution optimal transport, and finally the domain weight optimization. That’s to say, the three modules are alternated learned, and updated. Once the network parameters of the feature extractor and classifier have been well-trained, they are used to predict the sample labels of the target domain in the testing phase. The whole training process of the proposed weighted joint distribution optimal transport method for multi-source DA-based FAS is shown in Algorithm 1. It is worth noting that if only one source domain is available, $w_k$ always equals 1.

4 Experimental Results

4.1 Datasets

To evaluate the effectiveness of the proposed WJDOT-FAS method for multi-source DA based FAS, we conducted experiments on four public datasets with only 2D attack types, namely CASIA-FASD (Zhang et al., 2012), Idiap Replay-Attack (Chingovska et al., 2012), MSU-MFSD (Wen et al., 2015) and OULU-NPU (Boulkenafet et al., 2017). For simplicity, they are denoted as C, I, M, and O in the following. Besides, we also conducted experiments on three large-scale public datasets with 3D attack types, namely CASIA-SURF 3DMask (Yu et al., 2020), CASIA-SURF HiFiMask (Liu et al., 2022) and Surveillance High-Fidelity Mask (Fang et al., 2024). For simplicity, they are denoted as 3DMask, HiFiMask, and SuHiFiMask in the following. In addition, we also demonstrated the effectiveness of our WJDOT-FAS method for open-set DA by using datasets containing only 2D attack types as the source domains and a dataset containing 3D attack types as the target domain.

The $\textit{CASIA-FASD}$ (Zhang et al., 2012) dataset consists of 600 videos of real and attack attempts of 50 different subjects. There are three different image qualities in the dataset: low, normal, and high, which are captured with three different cameras (a Sony NEX-5 camera with 1280$\times $720 resolution and two different USB cameras with 640$\times $480 resolution). The face attacks include: distorted photo attacks, cut photo attacks, and video attacks. The $\textit{Idiap Replay-Attack}$ (Chingovska et al., 2012) dataset consists of 1200 videos of real and attack attempts on 50 different subjects. The camera on the MacBook is used to collect the dataset with a resolution of 320 $\times $ 240 under two conditions: (i) a control condition with a uniform background and fluorescent lights; and (ii) an unfavorable condition with a non-uniform background and daylight. Three types of deceptive attacks are designed: print attack, mobile attack, and high definition attack. The $\textit{MSU-MFSD}$ (Wen et al., 2015) dataset consists of 440 videos from 55 subjects. The face videos are taken by two types of cameras (MacBook Air camera and Google Nexus 5 Android phone camera). The resolutions are 640 $\times $ 480 and 720 $\times $ 480. There are mainly two different spoofing attacks, the print photo attack and the replay video attack. The $\textit{Oulu-NPU}$ (Boulkenafet et al., 2017) dataset consists of 4950 real and attack videos from 55 subjects. These videos are recorded with the front cameras of 6 mobile devices (Samsung Galaxy S6 edge, HTC Desire EYE, MEIZU X5, ASUS Zenfone Selfie, Sony XPERIA C5 Ultra Dual and OPPO N3). There are three different lighting conditions and background scenes. The types of presentation attacks are printing and video replay. These attacks are created using two printers and two display devices.

The $\textit{CASIA-SURF 3DMask}$ (Yu et al., 2020) dataset contains 288 real face videos and 864 mask videos from 48 subjects. Six conditions are used for data collection, including normal, back-light, front-light, side-light, outdoor in shadow, and outdoor in sunlight. 3D masks of 48 subjects are collected by 3D printing technology. In addition to the use of naive masks, two more realistic decorative situations (i.e., masks with/without hair and glasses) are considered. The $\textit{CASIA-SURF HiFiMask}$ (Liu et al., 2022) dataset consists of 75 subjects, and each subject provides high-fidelity plaster, resin, and transparent masks. Six different environments, six directional illuminations, and seven recording sensors are applied to the dataset. A total of 54,600 videos (13,650 live videos, 40,950 mask videos) are available in the dataset. The $\textit{Surveillance High-Fidelity Mask}$ (Fang et al., 2024) dataset is captured in 40 real-life surveillance scenarios, such as movie theaters, security gates, and parking lots, representing a wide range of face recognition scenarios. It includes 101 participants of different ages and genders who perform various natural activities in their daily lives. In addition, the dataset contains multiple types of spoofing attacks such as high-fidelity masks, 2D attacks, and adversarial attacks.

In general, there are differences in acquisition devices, acquisition conditions, and attack types for different datasets, which leads to discrepancies among domains; in addition, each dataset is collected by multiple acquisition devices and the attack types are diverse, which leads to a situation where samples of the same label within the dataset are distant from each other, so inter-domain joint distribution metric becomes inevitable. Figure 3 shows some examples of real and fake facial samples in these datasets. It is easy to see that there exists a large inter-domain gap, such as the lighting, background, and types of attack, which results in significant distribution discrepancies among different datasets.

4.2 Experimental Settings and Implementation Details

In this paper, we perform experiments on 2D and 3D attack datasets under single- and multi-source domain settings for (open-set) DA. We set up three protocols under single- and multi-source domain settings, respectively.

Under the single-source domain setting:

Cross-dataset testing on 2D attack datasets (2D$\rightarrow $2D). We follow the protocols of (Wang et al., 2021; Jia et al., 2021), in which one of the I, C, and M datasets is used as the source domain and the other dataset as the target domain, so there are six sets of experiments. We use the Half Total Error Rate (HTER) (half of the summation of false acceptance rate and false rejection rate) as the evaluation metric. We first compute the Equal Error Rate (EER) and the corresponding threshold on the development set and then utilize the threshold to calculate the HTER on the testing set.
Cross-dataset testing on 2D attack datasets (2D$\rightarrow $2D). One of the 3DMask and HiFiMask datasets is used as the source domain and the other dataset as the target domain. We use the HTER and the Area Under the Curve (AUC) as the evaluation metrics.
Cross-dataset testing for open-set DA (2D$\rightarrow $3D). One of the C and I datasets is used as the source domain and the SuHiFiMask dataset as the target domain. We also use the HTER and AUC as the evaluation metrics.

Under the multi-source domain setting:

Cross-dataset testing on 2D attack datasets (2D$\rightarrow $2D). We follow the protocols of (Zhou et al., 2022; Liu et al., 2022), in which three of the four datasets are used as source domains and the remaining one as target domain, so there are four sets of experiments. The HTER and AUC are used as the evaluation metrics.
Cross-dataset testing on 3D attack datasets (3D$\rightarrow $3D). Two of the 3DMask, HiFiMask, and SuHiFiMask datasets are used as the source domains, and the other dataset as the target domain. We use the HTER and AUC as the evaluation metrics.
Cross-dataset testing for open-set DA (2D$\rightarrow $3D). The C and I datasets are used as the source domains and one of the 3DMask, HiFiMask, and SuHiFiMask datasets is used as the target domain. We also use the HTER and AUC as the evaluation metrics.

Table 1 Comparison results (HTER (%)) between the proposed method and the state-of-the-art methods for cross-dataset testing under the single-source DA setting on the C, I, and M datasets

Weighted Joint Distribution Optimal Transport Based Domain Adaptation for Cross-Scenario Face Anti-Spoofing

Abstract

Similar content being viewed by others

Learning Optimal Transport Mapping of Joint Distribution for Cross-scenario Face Anti-spoofing

Source-Free Domain Adaptation with Contrastive Domain Alignment and Self-supervised Exploration for Face Anti-spoofing

Adversarial Learning Domain-Invariant Conditional Features for Robust Face Anti-spoofing

Explore related subjects

1 Introduction

2 Related Works

2.1 Domain Adaptation for Face Anti-spoofing

2.2 Optimal Transport Based Domain Adaptation

2.3 Multi-source Domain Adaptation

3 Proposed Method

3.1 Joint Distribution Estimation

3.2 Joint Distribution Optimal Transport

3.2.1 Cost Matrix Computation

3.2.2 Transportation Mapping Computation

3.3 Domain Weight Optimization

4 Experimental Results

4.1 Datasets

4.2 Experimental Settings and Implementation Details

4.3 Comparisons with the State-of-the-Art Methods

4.3.1 Single-source DA Setting

4.3.2 Multi-source DA Setting

4.4 Ablation Study

4.4.1 Effectiveness of the Joint Distribution Estimation

4.4.2 Effectiveness of the Optimal Transportation Loss

4.5 Discussion

4.5.1 Influences of the Cost Matrix Computation

4.5.2 Influences of the Transportation Mapping Computation

4.5.3 Influences of the Domain Weight Optimization

5 Conclusions and Future Work

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation