1 Introduction

Face recognition technique plays a crucial role in modern applications like access control system and electronic payment. Meanwhile, existing face recognition systems are exposed to diverse presentation attacks (PAs), such as the printed face (print attack), face replay on digital devices (replay attack), face covered by a mask (3D high-fidelity mask), etc. As a result, the Face Anti-Spoofing (FAS) (Yu et al., 2022) technique, which detects whether the presented face is live or not, becomes indispensable to defend face recognition systems.

In recent decade, researchers have proposed lots of data-driven deep-learning-based methods (Lucena et al., 2017; Xu et al., 2015; Shao et al., 2017; Liu et al., 2018a) to distinguish spoof faces from live ones. Most of them train the live/spoof detector via learning from pre-collected dataset (Yu et al., 2020). Despite satisfactory performance on pre-defined testing set, these well-trained FAS models usually encounter generalization challenges when deployed in real-world scenarios. This overfitting phenomenon is mainly contributed to the limited scale and diversity of datasets. Frequently used datasets (Boulkenafet et al., 2017b; Liu et al., 2018b, 2022) contain less than 100 identities and partial PAs due to expensive cost. Further, data augmentation study shows limited promotion in this field (Wang et al., 2023).

Motivated by the rapid development of generative models (Goodfellow et al., 2014; Karras et al., 2019; Rombach et al., 2022), we intend to generate face images to extend the diversity of public datasets, and we conjecture that these generated samples could help to promote FAS models. Generating data as augmentation for FAS models has been studied in Liu et al. (2020), Wu et al. (2021a), Jourabloo et al. (2018). However, These existing methods (Wang et al., 2023; Liu et al., 2020; Ruiz et al., 2023) appear limited effects as shown in Fig. 1: (1) Low quality: unsatisfactory generation quality with eye-perceived artifacts, which is easy to be seen in EPCR (Wang et al., 2023), DSDG (Wu et al., 2021a), STDN (Liu et al., 2020) and other generative methods. (2) Rare diversity: unable to generate arbitrary PAs with any input face, as DSDG (Wu et al., 2021a) like methods are not able to control the generation attributes precisely. (3) Inconsistent generation: hard to disentangle face characteristics from spoof features. Since Stable Diffusion (Rombach et al., 2022) based method DreamBooth (Ruiz et al., 2023) is able to use prompt to generate spoof faces, we can see that these generate faces indeed own spoof trait like 3D mask margin. But the face appearance is not consistent with inputs and changed apparently throughout generation.

Fig. 1
figure 1

Comparison of different FAS augmentation manners applied on HiFiMask (Liu et al., 2022) and OULU-NPU (Boulkenafet et al., 2017b) datasets. The first column lists live, print attack and plaster mask face images as input. The second column shows the results of Patch Shuffle Augmentation proposed in EPCR (Wang et al., 2023). Another typical GAN-based augmentation namely DSDG (Wu et al., 2021a) is compared in the third column. Further we conduct image manipulation with Stable Diffusion (Rombach et al., 2022) based method DreamBooth (Ruiz et al., 2023). The last column shows the result of our proposed CG-FAS, which is able to convert the input images’ spoofing label and keep other face attributes consistent with input images

To overcome these challenges, we propose a novel framework named Cross-label Generative augmentation for Face Anti-Spoofing (CG-FAS). Using any FAS public dataset as input, CG-FAS could generate samples whose spoof labels are contrary to input images, while other characteristics are consistent as shown in the last column of Fig. 1. To disentangle spoof and spoof-irrelevant features, we execute the face editing in a highly disentangled latent space \(\mathcal {W+}\) (Abdal et al., 2019), which ensures that the face identity information will not be exterminated throughout generation. What’s more, an encoder and generator are trained to connect the RGB space and latent space. We utilize the prevalent StyleGAN (Karras et al., 2019) as generator, which is able to produce natural FAS faces superior to previous researches (Wu et al., 2021a; Liu et al., 2020). To exploit the advantages of the linear \(\mathcal {W+}\) space, we organize each presentation attack’s discriminative feature into an Interchange Bridge (IB) matrix, which can be used to generate images between arbitrary PA and live labels even for unseen face identities.

Given any face images from a public FAS dataset, CG-FAS firstly encodes images into low-dimensional latent codes in the \(\mathcal {W+}\) space. In this latent space, it is flexible to index the IB matrix to obtain a residual vector, which represents an editing direction, such as live to print attack. The editing process is executed by adding the latent codes with the residual vector, controlled by an editing coefficient scalar. After that, the resultant vectors are eventually fed into the pre-trained generator to produce target PA face images. Adding these generated images into existing FAS datasets as augmentation, the proposed CG-FAS is demonstrated to obtain better FAS models.

The main contributions of this study are summarized below:

  • We propose the Interchange Bridge(IB) matrix, which could be used to generate arbitrary live/PA faces while keeping spoof-irrelevant attributes consistent with input images.

  • Applying the IB matrix as augmentation during FAS model training, we introduce a novel framework called CG-FAS, which significantly enhances the performance of the FAS model.

  • Evaluated on single-domain and cross-domain experiments, our proposed CG-FAS achieves competitive performance on several FAS benchmarks.

2 Related Work

2.1 Face Anti-Spoofing

Before the deep learning era, FAS researchers were keen on extracting handcrafted local features to distinguish live and PAs face images. The most commonly used features are LBP (Tiago et al., 2013; Boulkenafet et al., 2015), HOG (Komulainen et al., 2013), SIFT (Patel et al., 2016), and DoG (Boulkenafet et al., 2017), which show limited performance. In recent decades, FAS methods have indeed benefited from the huge breakthrough of deep neural networks (He et al., 2016a; Ronneberger et al., 2015) and large-scale datasets (Liu et al., 2018a; Zhang et al., 2020; Liu et al., 2021a, 2022; Fang et al., 2024a, b). A lot of deep learning based FAS methods (Liu et al., 2019; Menotti et al., 2015; Nagpal & Dubey, 2019; Jourabloo et al., 2018) have emerged.

The significant Central Difference Convolution Network (CDCN) (Yu et al., 2020) is proposed to improve the representation capacity of detailed textures via leveraging local gradient features. After that, dual-cross central difference networks (Yu et al., 2021) are proposed to exploit the difference of the center and surrounding sparse local features to alleviate the information redundancy and sub-optimal problem in the training stage. PatchNet (Wang et al., 2022a) utilize fine-grained face patch to enhance model’s discriminative ability. Other works (Wang et al., 2022b; Sun et al., 2023) pay attention to the domain adaption problem in FAS task.

Some generated-based methods show impressive results by augmenting training data like STDN (Liu et al., 2020) and DSDG (Wu et al., 2021a). However, previous generated-based methods (Liu et al., 2020; Wu et al., 2021a) are limited to intra-dataset generation scenarios and the generated images do not seem as realistic as natural samples. In contrast, our proposed CG-FAS is able to flexibly generate vivid samples whose spoof label is different from the inputs and can be easily applied on unseen dataset.

2.2 Image Generation and Editing

2.2.1 Generative Methods

Generative methods are broadly studied and applied for image editing (Ling et al., 2021; Ruiz et al., 2023). We first introduce the recently prevailing generative methods, and image editing related advances later. Many popular generative paradigms are put forward like Auto-regressive models (Van Oord et al., 2016), Variational Autoencoder (VAE) (Diederik & Max, 2014), Generative Adversarial Network (GAN) (Goodfellow et al., 2014) and diffusion models (Sohl-Dickstein et al., 2015). Among all, diffusion models are popular but not easy to precisely control generation details with text prompt (Rombach et al., 2022). Since GAN-based methods are particularly concerned for generating high-quality and realistic samples (Arjovsky et al., 2017; Karras et al., 2018; Miyato et al., 2018), and generally applied in tasks like image-to-image translation (Isola et al., 2017), semantic image editing (Ling et al., 2021). We choose the distinguished StyleGAN (Karras et al., 2019, 2020b, a, 2021) network as our image generator (Wu et al., 2021b).

GAN inversion aims to invert real-world images into latent codes in the low-dimensional latent space (Xia et al., 2021), which is the reverse function of a GAN generator (Goodfellow et al., 2014). The latent space of GAN is generally studied and recognized as a Riemannian manifold (Shen et al., 2020a). The \({\mathcal {Z}}\) space is utilized by randomly samples a normal distribution vectors (Radford et al., 2016). StyleGAN utilizes a non-linear mapping network to convert a \({\mathcal {Z}}\) space latent code into \({\mathcal {W}}\) space, enabling better interpolation and disentangles (Karras et al., 2019, 2020b). Some researchers employ \(\mathcal {W+}\) space, which extends \({\mathcal {W}}\) space to a better representation (Abdal et al., 2019, 2020). In this study, the cutting-edge e4e (Tov et al., 2021) method and \(\mathcal {W+}\) space is chosen as our encoder module for high efficiency.

Fig. 2
figure 2

An overview of the CG-FAS pipeline. The upper subpicture shows that our CG-FAS could serve as a plugin for any existing FAS methods. Fed with source images, CG-FAS could generate new samples as augmentation, which helps to improve the training of the FAS model. The lower-left subpicture illustrates the generation process of CG-FAS. Initially, a well-trained encoder maps the source face images into latent codes in the latent space. Subsequently, face editing is performed by adding the latent code with the IB matrix element. Finally, a StyleGAN generator is utilized to generate target images. The lower-right subpicture shows the training process of IB matrix. By adding latent codes with IB matrix element, the resultant vector is sent into a pre-trained classifier. Thereafter, the cross entropy loss is calculated to update the IB matrix element merely. (Best viewed in color)

2.2.2 Face Image Editing

Face image editing technique is attractive for its versatility and beyond imagination results (Patashnik et al., 2021). A typical editing manner of StyleGAN-based researches obeys a paradigm of "invert first, edit later" (Richardson et al., 2021), which is conducted by firstly converting the given image into latent space, manipulating the latent code, and lastly generating the desired image by generator (Härkönen et al., 2020; Shen et al., 2020a). For instance, InterFaceGAN (Shen et al., 2020b) uses the SVM method to find semantic directions in \(\mathcal {W+}\) space to revise face attributes like age, gender, and expression. GANSpace (Härkönen et al., 2020) applies PCA(Principal Component Analysis) to find meaningful direction and execute an interpolation manipulation in a BigGAN (Brock et al., 2018) or StyleGAN (Karras et al., 2019) latent space. StyleCLIP (Patashnik et al., 2021) enables natural language to edit input images, relying on a large-scale visual-language model CLIP (Radford et al., 2021). While these methods are rarely applied in FAS area, we are seeking to use them for improving typical FAS methods.

3 Methodology

3.1 Overview

Since image generation techniques are rarely incorporated in contemporary FAS methods, we aim to utilize generative models to promote FAS model’s performance. By elaborately designing the latent space and editing approach, we propose CG-FAS to generate new images whoes spoof labels are reversed while keeping other face attributes reserved. These generated samples are subsequently used as augmentation to train a more robust FAS model, which provably and practically performs better.

Our CG-FAS consists of three main stages: (1) Determining the latent space. The \({\mathcal {W}}+\) space is selected as our latent space for its convenient semantic editing capability. Consequently we train an encoder and a generator to connect RGB space and this latent space shown in Sect. 3.2. (2) Bridging live and PAs in latent space. After mapping RGB images into the semantic disentangled \({\mathcal {W}}+\) space, we are able to train and obtain each PA’s unique spoof characteristics vector, and gather them into a matrix named Interchange Bridge (IB) matrix. The IB matrix can be used to transfer any face image’s spoof label with a zero-shot ability introduced in Sect. 3.3. (3) Augmentation for FAS models. The IB matrix serves as a plug-in for training a FAS model, and its effectiveness is conceptionally proved in Sect. 3.4. When executing face editing on batch images, we found a dilemma of balancing FAS score and identification similarity score. We propose a strategy to reach a trade-off in Sect. 3.5. The overall pipeline of CG-FAS is illustrated in Fig. 2.

3.2 The Latent Space

Determining a proper latent space is vital for image editing tasks. While the RGB space is high-dimensional and unsuitable for image editing, regular GAN-based methods generate images from a low-dimensional latent space, which is convenient for editing. Among all, StyleGAN (Karras et al., 2020a) is popular for generating vivid images, and some typical latent spaces (Radford et al., 2016; Karras et al., 2019; Abdal et al., 2019) of StyleGAN are therefore put forward. The \({\mathcal {W}}+\) space (Abdal et al., 2019) is advanced and specialized in human face manipulation, we confidently choose the linear \({\mathcal {W}}+\) space as latent space. In this study, our \({\mathcal {W}}+\) space is a concatenation of 16 different 512-dimensional vectors, which could be used to generate \(512 \times 512\) resolution face images.

To transfer the RGB space images into \({\mathcal {W}}+\) space, an indispensable component is training a generator as connection. In this study, the official StyleGAN2-ada (Karras et al., 2020a) is utilized as its generator could produce images obeying a similar distribution with input images. The Fréchet Inception Distances (FID) (Heusel et al., 2017) is used as supervision. During training, we add up the FFHQ dataset (Karras et al., 2019) and some FAS datasets as the complete training set, which is sufficient to produce faces with various characteristics. In summary, we define a generator \(f_G\) to map any latent code \(e \in \mathbb {R}^{d}\) into RGB space image x by the following equation:

$$\begin{aligned} x = f_G(e), \end{aligned}$$
(1)

where d is equal to 8192 in this study.

Reversely, to obtain the latent code e for any given image x, we train a deep neural network based encoder to map images into latent space. In this study, the e4e (Tov et al., 2021) encoder network is utilized to execute this mapping operation. The encoding procedure could be expressed by the formulation: \(e = f_E(x)\). Generally, \(f_G\) and \(f_E\) are approximately inverse functions of each other and can be conveyed as follow:

$$\begin{aligned} e = f_E(x) = f_E \circ f_G(e), \end{aligned}$$
(2)

where the notation \(\circ \) is a link in composite function. When training the encoder, we choose the LPIPS loss (Zhang et al., 2018) and ArcFace (Deng et al., 2019) loss to compute total loss: \({\mathcal {L}}_{encoder} = {\mathcal {L}}_{LPIPS} + \lambda _{ID} \cdot {\mathcal {L}}_{ArcFace}\).

Algorithm 1
figure a

Training IB Matrix.

3.3 The Interchange Bridge Matrix

After the encoder and generator are well trained, we are able to manipulate images in latent space. For any given latent code e, commonly used semantic face editing (such as expression, age, and gender) approaches are finding a residual vector (Härkönen et al., 2020; Shen et al., 2020b), which represents a specific semantic edition in the disentangled \(\mathcal {{\mathcal {W}}+}\) space. In this study, we use a Multi Layer Perceptron named classifier as supervision and calculate cross entropy loss to train each residual vector. For better usage, we organize a series of residual vectors into a matrix, namely the Interchange Bridge matrix described below:

$$\begin{aligned} \begin{aligned} \mathop {{\mathcal {D}}}\nolimits _{{n+1}}= { \left[ {\begin{array}{*{20}{c}} {\mathop {{\delta }}\nolimits _{{00}}}&{}{\mathop {{\delta }}\nolimits _{{01}}}&{}{ \cdots }&{}{\mathop {{\delta }}\nolimits _{{0n}}}\\ {\mathop {{\delta }}\nolimits _{{10}}}&{}{\mathop {{\delta }}\nolimits _{{11}}}&{}{ \cdots }&{}{\mathop {{\delta }}\nolimits _{{1n}}}\\ { \vdots }&{}{ \vdots }&{}{ \ddots }&{}{ \vdots }\\ {\mathop {{\delta }}\nolimits _{{n0}}}&{}{\mathop {{\delta }}\nolimits _{{n1}}}&{}{ \cdots }&{}{\mathop {{\delta }}\nolimits _{{nn}}} \end{array}} \right] }, \\ \end{aligned} \end{aligned}$$
(3)

Here n represents presentation attack categories. Any arbitrary element \(\delta _{ij} \in {\mathcal {D}}_{n+1}\) denotes a residual vector that could convert live/PA type i into type j. To be mentioned, the IB matrix’s diagonal elements are zero vectors, meaning no transformation within one live/PA type. Once we got the IB matrix, the editing operation can be expressed as adding latent code e with \(\delta _{ij}\) by the following formulation:

$$\begin{aligned} e' = e + \beta \cdot \delta _{ij}, \end{aligned}$$
(4)

where \(\beta \) is editing coefficient, and \(\delta _{ij} \in \mathbb {R}^{d}\).

Since transforming live face into PA face and transforming PA face into live face are reverse manipulation in linear space \(\mathcal {W+}\), naturally we get \(\delta _{ij} = - \delta _{ji}\). Therefore the IB matrix \({\mathcal {D}}_{n+1}\) is skew-symmetric, which means that we only need to obtain the upper triangular part of the IB matrix. What’s more, transforming live/PA from type i to type j can be considered as two separate steps: first transforming live/PA from type i into live face, then transforming live face into live/PA type j. Therefore, we have \(\delta _{ij} = \delta _{i0} + \delta _{0j}\).

Depending on the relationships described above, we only need to train the first line elements of \({\mathcal {D}}_{n+1}\), while other elements could be obtained by these relationships. The whole training procedure is explicitly described in Alg.1. The matrix \({\mathcal {D}}_{n+1}\) can be utilized to generate arbitrary PA or live faces, thus we named it Interchange Bridge matrix, which shows zero-shot generation capability even for unseen face identities.

3.4 Effect Analysis of CG-FAS

Problem Definition Arranging the IB matrix as a plugin when training a FAS model, we conjecture this augmentation could help to promote the FAS model’s performance, which is called CG-FAS in this study. In this subsection, we will demonstrate how CG-FAS could assist training FAS models by prevent overfitting. Firstly, we mathematically define a FAS task described in Eq. (5). For any input image \(x \in {\mathbb {X}}\), there exists a corresponding label \(y \in \{0,1\}\) representing live or PA face respectively. Researchers aim to find an optimal mapping relation between input x and label y. Generally we use a deep neural network \(f_{S}\) to approximate this relationship, and the objective is minimizing the cross entropy (CE) loss of model output and ground truth:

$$\begin{aligned} \begin{aligned} \min _{f_{S}}&\ {L(x, y; f_{S})} = \sum _{x,y} \ {CE({f_S(x), \ y})} \\&\ s.t. \ x \in {\mathbb {X}}, \ y \in \{0,1\}, \end{aligned} \end{aligned}$$
(5)

In the physical world, it’s observed that different types of presentation attack, such as print attack and 3D mask attack, exhibit distinct spoofing characteristics. Base on this observation, we have

Assumption 1

Any two PAs’ normalized residual vectors in \({{{\textbf {E}}}}_s = \{ s_i \ \mid \ s_i = \frac{\delta _{0i}}{\Vert \delta _{0i} \Vert }, \ i = 1,\ldots ,n \}\) are orthonormal, namely \(s_i \cdot s_j^T = 0, \forall \ s_i, s_j \in {{{\textbf {E}}}}_s\).

Since the linear \({\mathcal {W}}+\) space is proved to be highly disentangled (Abdal et al., 2019, 2020), showing that any attribute (e.g., glasses, hat) in live face can be encoded into a distinct embedding. Consequently, it is concluded that any live face can be expressed by a linear combination of such orthogonal embeddings, each corresponding to a specific attribute. Thus we have

Assumption 2

Any spoof-irrelevant face features (e.g., glasses, hat) could be conveyed by a set of orthonormal vectors \( {{{\textbf {E}}}}_b = \{ b_i \ \mid \ i = 1,\ldots ,m \}\) in \({\mathcal {W}}+\).

As illustrated in Fig. 3, CG-FAS is able to modify any face image’s spoofing attribute without altering other spoof-irrelevant attributes. This leads us to believe that the spoofing features described in assumption 1 and features described in assumption 2 are orthogonal to each other. We have

Assumption 3

Any two elements in \({{{\textbf {E}}}} = {{{\textbf {E}}}}_s \cup {{{\textbf {E}}}}_b \) are orthogonal, we get \(e_i \cdot e_j^T = 0, \forall \ e_i, e_j \in {{{\textbf {E}}}}\). Therefore \({{{\textbf {E}}}} = [s_1,.., s_n,\ldots ,\ b_{1},\ldots , b_{m}]\) forms a standard orthogonal basis of latent space \({\mathcal {W}}+\), and arbitrary latent code e is equivalent to coordinate vector \(\alpha = [\alpha _1,\ldots , \alpha _n, \ldots , \alpha _{n+m}]^T\) under this basis:

$$\begin{aligned} \begin{aligned} e&= {{{\textbf {E}}}} \cdot \alpha \\&= [s_1,.., s_n, {b_{1},\ldots , b_{m}}] \cdot [\alpha _1,\ldots , \alpha _n,\ldots , \alpha _{n+m}]^T, \end{aligned} \end{aligned}$$
(6)

In this study, the subset \({{{\textbf {E}}}}_s = \{s_1,\ldots , s_n \}\) stands for different kinds of spoof clues like replay-attack texture, 3D mask margin features and so on. And \({{{\textbf {E}}}}_b = \{b_1,\ldots , b_m \}\) stands for spoof-irrelevant information like lighting condition, scene, camera device which are independent of spoof clues. Based on the assumptions above, it’s natural to define a valid FAS model to discriminate PA faces from live ones below:

$$\begin{aligned} \begin{aligned} f_S(x)&= f_S \circ f_G(e) \\&= [\underbrace{1,\ldots , 1}_{n}, \underbrace{1,\ldots , 1}_{t}, \underbrace{0,\ldots , 0}_{m-t}] \cdot {{{\textbf {E}}}}^T \cdot e \\&= \alpha _1 + \ldots + \alpha _n + \underbrace{\alpha _{n+1} + \ldots + \alpha _{n+t}}_{overfitting\ items}, \end{aligned} \end{aligned}$$
(7)

In this FAS model, the coefficient from \(\alpha _1\) to \(\alpha _{n+t}\) will determine the final discriminative result. Since the previous n items represent level of spoof features, the latter t spoof-irrelevant features are mistakenly considered as spoof clues which are overfitting items. Live faces with these features may be misidentified as PA faces. Thus, a better FAS model ought to own fewer overfitting items. In the following part, we are going to prove how our proposed CG-FAS could eliminate overfitting.

Fig. 3
figure 3

An illustration of the editing process on the HiFiMask dataset. By increasing the editing coefficient \(\beta \) progressively, the intermediate images are exhibited in detail. Most spoof-irrelevant attributes like hats, glasses and lighting conditions are reserved during the process

For a typical FAS dataset, we suppose that \(b_1 \in {{{\textbf {E}}}}_b\) represents a spoof-irrelevant but overfitted feature, which mostly occurs in PA samples but rarely occurs in live samples. In such circumstance, researchers tend to train a FAS model like Eq. (7), where the overfitting item \(\alpha _{n+1}\) exists.

For any input images x, it is equal to a latent code e by Eq. (1), while e is equal to \(\alpha \) under the basis \({{{\textbf {E}}}}\) by Eq. (6). To simplify, we set \(\{ \alpha _i = 0\ or\ 1\ \mid \ i \in \alpha \}\). Thus, x could be categorized into the following two types:

$$\begin{aligned} x \triangleq e \triangleq \alpha = {\left\{ \begin{array}{ll} {[}\underbrace{0,\ldots , 0}_{n}, 0, \underbrace{0,\ldots , 0}_{m-1}]^T \ if\ x\ is\ live \\ {[}\underbrace{1,\ldots , 1}_{n}, 1, \underbrace{0,\ldots , 0}_{m-1}]^T \ if\ x\ is\ PA, \end{array}\right. } \end{aligned}$$
(8)

Once we conduct CG-FAS on x, m items of spoof-irrelevant features keep consistent and n items of spoof clues are converted. The generated sample \(x'\) could be expressed as:

$$\begin{aligned} \begin{aligned} x'&= \text {CG-FAS}(x) \triangleq e' \triangleq \alpha ' \\&= {\left\{ \begin{array}{ll} {[}\underbrace{1,\ldots , 1}_{n}, 0, \underbrace{0,\ldots , 0}_{m-1}]^T \ while\ x'\ is\ PA \\ {[}\underbrace{0,\ldots , 0}_{n}, 1, \underbrace{0,\ldots , 0}_{m-1}]^T \ while\ x'\ is\ live, \end{array}\right. } \end{aligned} \end{aligned}$$
(9)

by adding up x and \(x'\) into training set, we can see that the \(n+1 \ th\) feature \(b_1\) in Eq. (6) exists in both live and PA examples. Using the added images as training set, we tend to obtain a better FAS model where the overfitting item \(\alpha _{n+1}\) in Eq. (6) no longer exists, thus relieving overfitting.

3.5 Batch Image Editing

Since we can use Eq. (9) to realize our desired editing task, as shown in Fig. 3, by progressively increasing the editing coefficient \(\beta \) in Eq. (4), input live faces are smoothly converted into 3D plaster mask faces as output. If we set \(\beta \) as low value, the face attributes similarity between input and output is high but the output’s spoof degree is low. If we increase \(\beta \), the output image’s FAS score become higher but attributes similarity would get lower. Thus, determining the value of \(\beta \) becomes a vital problem.

To be mentioned, \(\beta \) is easy to be determined for a single image by delicate manual adjustment, but infeasible for large batch images which would exhaust too much human efforts. In this study, we will use a face recognition model \(f_{R}\) to evaluate spoof-irrelevant features’ consistency and a FAS model \(f_{S}\) to evaluate spoof confidence score. When editing batches of live/spoof faces, we hope the \(f_{R}\) score between input and output keep close and \(f_{S}\) score be reversed after edition.

To measure our objective quantitatively, we use \(f_{R}\) and \(f_{S}\) score on the original validation set as standards, and we hope CG-FAS generated samples obey a similar distribution with validation set on the two scores. Thus, we calculate the average value of \(f_{R}\) and \(f_{S}\) score on the validation set, noted as \({\tilde{t}} = [t_{R}, t_{S}]\). The optimization problem can be conveyed as: finding an optimal value \(\beta ^*\), where generated images’ \(f_{R}\) and \(f_{S}\) score point should be close to \({\tilde{t}}\), which is described below:

$$\begin{aligned} \begin{aligned} \beta ^{*}= \mathop {\arg \min }\limits _{\beta } \{ \mid \frac{1}{k} \sum _{i=1}^{k} {f_{R}(f_G(e_i + \beta \delta ),f_G(e_i))} - t_{R} \mid \\ \ + \mid \frac{1}{k} \sum _{i=1}^{k} f_{S}(f_G(e_i + \beta \delta )) - t_{S} \mid \}, \end{aligned} \end{aligned}$$
(10)

Here k is batch size, \(e_i\) is the \(i \ th\) latent code in this batch and \(\delta \) is a residual vector in IB matrix \({\mathcal {D}}_{n+1}\). By trying different value of \(\beta \) in the equation above, we could figure out the approximately optimal value \(\beta ^*\).

4 Experiments

In this section, firstly we introduce the experimental settings, and then quantitatively evaluate the editing result by some expert models. Next, we compare our proposed CG-FAS with other contemporary methods on two intra-testing datasets, and conduct two cross-testing experiments. We execute ablation studies on four key factors throughout our research. Finally, we show more visualization results of IB matrix applied on four typical FAS datasets.

Table 1 Four FAS datasets used in our experiments
Table 2 Evaluation of the original and CG-FAS generated testing set (marked as \(\checkmark \)) on two FAS models, which are well-trained on HiFiMask and OULU-NPU training set respectively

4.1 Experimental Settings

4.1.1 Datasets & Preprocessing

Four high resolution datasets namely OULU-NPU (Boulkenafet et al., 2017b), SiW (Liu et al., 2018a), HKBU MARsV2 (Liu et al., 2016) and HiFiMask (Liu et al., 2022, 2021b) are chosen as FAS datasets as shown in Table 1. Both OULU-NPU and SiW contain two categories of 2D PA: print and replay attack. MarsV2 is a 3D mask presentation attack dataset which includes live images two types of mask: ThatsMyFace and RealF masks. HiFiMask is a newly proposed 3D high fidelity dataset which contains live, transparent, plaster and resin masks. Besides, the FFHQ (Karras et al., 2019) dataset including 70000 identities face images is used while training StyleGAN. After executing the face detection and alignment operation, all images are cropped into \(512 \times 512\) resolution as preprocssing.

4.1.2 Implementation Details

We choose StyleGAN2-ada (Karras et al., 2020a) configuration with a pre-trained model as the generator. While fine-tuning, we freeze the preceding ten layers and train other parameters with FFHQ and the four FAS datasets. For the encoder, we follow the implementation of e4e (Tov et al., 2021) network as our encoder and \(\lambda _{ID}\) set as 0.5. We select CDCN (Yu et al., 2020) as our FAS model backbone. During batch image edition, we set \(\beta \) as 0.22 in HiFiMask and 0.25 in OULU-NPU. Moreover, we set the ratio of generated images to original images as 1.0 when applying IB matrix on FAS tasks. Eight NVIDIA RTX-2080 GPUs are employed during training.

4.1.3 Performance Metrics

For intra testings, we strictly follow the protocols and evaluation metrics of OULU-NPU and HiFiMask. APCER (Attack Presentation Classification Error Rate) and BPCER (Bona Fide Presentation Classification Error Rate) are computed first, and their mean value ACER (Average Classification Error Rate) is used as the evaluation metric. During cross testings, HTER (Half Total Error Rate) value and AUC (Area Under Curve) value are calculated as the evaluation metric.

Table 3 The intra-dataset testing results on OULU-NPU

4.2 Analyzing Editing Result

As shown in Fig. 3, we demonstrate the complete editing process on images of HiFiMask. By changing the editing coefficient \(\beta \) progressively, live faces turn into 3D high-fidelity masks smoothly. Attributes like glasses, expression, skin color, hat, and light condition are perfectly preserved after generation, which shows the huge advantages of our proposed CG-FAS.

Furthermore, we conducted a group of experiments to evaluate generated images quantitatively. As shown in Table 2, the first column lists the original testing set and CG-FAS generated sets which remain to be evaluated, while the last column lists two training sets on which we train two expert models. The middle columns display the comparison results on HiFiMask and OULU-NPU protocol 1. Evaluated on HiFiMask trained expert model, the testing ACER value on the original testing set is 1.3, while the generated testing set is 3.1. These two values are extremely close and near zero, meaning that our generated data owns the same spoof clues as the original HiFiMask. Additionally, when using the model trained on OULU-NPU dataset as expert, ACER on generated sets is 0.9, near the original testing set result as well.

Table 4 The intra-dataset testing results on HiFiMask

4.3 Intra Testing

4.3.1 Result on OULU-NPU

OULU-NPU is a widely used evaluation dataset designed for 2D presentation attacks. There are four protocols on OULU-NPU by allocating different identities, PA types, devices and sessions. As shown in Table 3, we apply our proposed CG-FAS framework on the four protocols’ training tasks, achieving the best performance on each protocol. Compared with other contemporary methods, CG-FAS shows evident superior performance on protocol three and protocol four, which verifies its strong generalization ability on hard examples. This experiment demonstrates the superiority of our proposed CG-FAS on 2D presentation attacks.

4.3.2 Result on HiFiMask

We further conduct an intra testing experiment on a 3D mask dataset called HiFiMask. It is a newly released 3D high resolution mask dataset which contains three representative masks of transparent, plaster and resin materials. HiFiMask dataset gathered various identities, lighting conditions, scenes and devices, while three protocols were raised by these rules. Within its training set, we utilize our CG-FAS framework to generate mask faces from live ones, and live faces from mask ones. After adding these generated images into the training set, our FAS model acquires state-of-the-art ACER by considerable advantage on all three protocols, which is shown in Table 4.

4.4 Cross Testing

We evaluate several state-of-the-art methods and our CG-FAS regarding the above protocols. In Table 6, all results from three protocols demonstrate a tendency that CG-FAS is superior to previous methods. This is mainly attributed to the ability of generating cross domain images by CG-FAS. Adding these generated data would change the original distribution of HiFiMask and OULU-NPU, which improves the FAS model’s generalization ability.

4.4.1 HiFiMask & MARsV2

Since each FAS dataset owns its unique spoofing characteristics, cross dataset experiments could prove a method’s generalization ability. Following the predecessor’s paradigm (Liu et al., 2022), we choose two 3D mask datasets namely HiFiMask and MARsV2 to demonstrate our CG-FAS framework. Using HiFiMask dataset as training set, CG-FAS shows a significant metric improvement by a large margin when testing on MARsV2 dataset. On the contrary, we further use MARsV2 as the training set and HiFiMask as the testing set. The computed HTER and AUC value result also outperforms previous works shown in Table 5. This performance strongly proves the generalization ability of our proposed CG-FAS framework.

Table 5 Cross-testing results on two 3D presentation attack datasets
Table 6 Cross-domain results on our proposed cross-domain attack benchmark
Table 7 Ablation study of CG-FAS with different backbones on OULU-NPU protocol 1

4.4.2 Cross-domain Attack Benchmark

To further demonstrate that our method is more effective in generating faces of unseen domains. In this part, we build specified settings as: (1) We combine the training set of OULU-NPU, the training set of HiFiMask, and the generated label transformation results of the above dataset as the actual training set. (2) We set the testing set of SiW as the actual testing set of protocol 1. This protocol ensures that the training and testing sets have the same 2D presentation attacks but different distributions. (3) We set the whole MARsV2 as the actual testing set of protocol 2. This protocol ensures that the training and testing sets have the same 3D presentation attacks but different distributions. Considering that MARsV2’s overall data magnitude is relatively small, we choose all of its data as the testing set. (4) Finally, we combine the two above protocols as protocol 3.

Table 8 Ablation study of editing coefficient \(\beta \) on HiFiMask protocol 1

4.5 Ablation Study

In this part, we execute four groups of experiments on four import factors in CG-FAS: FAS model backbone, editing coefficient \(\beta \), the generated image number and generative methods. These experiments are conducted on OULU-NPU and HiFiMask datasets.

4.5.1 Impact of FAS Backbones

To verify the effectiveness of our proposed CG-FAS under any FAS model backbones, we select three prevalent networks: ResNet50 (He et al., 2016b), Aux. (Liu et al., 2018a) and CDCN (Yu et al., 2020) as backbones. The compared results are shown in Table 7. In the first row, there baseline results were performed by using these backbones. In the second row, we equipped the training set with DSDG generated data, the ACER value on three backbones all improved to some extent. In the last row, CG-FAS achieved the most competitive ACER value, which illustrates that our method does not rely on any specific backbone.

Fig. 4
figure 4

The face recognition similarity score vs. FAS AUC score under different editing coefficient \(\beta \). The red point is the result on HiFiMask validation set

Table 9 Ablation study of the ratio of CG-FAS generated images to original images during training
Table 10 Ablation study of generative methods on HiFiMask dataset

4.5.2 Impact of Editing Coefficient

As shown in Table 8, we conduct five group experiments by setting different values of \(\beta \), ranging from 0.20 to 0.35. What’s more, we further draw the similarity score vs. FAS score curve shown in Fig. 4 All experiments are conducted on the HiFiMask protocol one’s training set. When \(\beta \) is set as 0.35, the face recognition score between original training set and generated samples are low, which means that faces are over edited. And the face anti-spoofing score is high, meaning that we indeed obtain desired PA or live face images. When \(\beta \) is set as 0.20, we notice that the face recognition score is high while face anti-spoofing score is low. Since the curve is fitted with unavoidable error, we set the optimal \(\beta ^*\) as 0.22 approximately, which is highly close to target point \({\tilde{t}}\). Thus \(\beta ^* = 0.22\) an approximately optimal solution of Eq. (10).

Fig. 5
figure 5

An exhibition of applying the IB matrix on four FAS datasets. The figure is split into three parts by dotted line. a Intra-dataset Generation: The left two columns compare input transparent mask images from HiFiMask dataset and our generated live faces, while the right two columns show live faces from MARsV2 dataset and our generated ThatsMyFace mask images. b Cross-domain Generation: Both MARsV2 and HiFiMask are 3D mask datasets. Using MARsV2 live faces as input, we could generate HiFiMask plaster mask style faces. What’s more, the OULU-NPU live faces can be used to generate SiW replay attack style faces. c Expanding-PA Generation: Here we use 2D PA dataset as input to generate 3D PA faces and reversely. OULU-NPU print attack faces can be used to generate HiFiMask plaster 3D mask faces. Besides, HiFiMask live faces are used to generate SiW print attack faces. (Best viewed in color)

4.5.3 Impact of Generated Image Numbers

An intuitive question is how many generated samples are suitable as data augmentation. By adding different amount of generated images into training set, we use CDCN (Yu et al., 2020) as backbone and train models on HiFiMask protocol one. As shown in Table 9, adding 0.1 times generated images could promote the FAS model performance. And the ACER keeps improving until 1.0 times generated images are added to the training set. After that, the performance saturates when the ratio is set to 2.0 on HiFiMask. While 3.0 times of generated images are added into the training set, the ACER value degrades. We conjecture that too much generated images would alter the distribution of the original dataset by introducing model bias. Thus we set 1.0 as the best ratio throughout all experiment settings.

4.5.4 Impact of Generative Methods

As shown in Fig. 1, the Stable Diffusion (Rombach et al., 2022) based method DreamBooth (Ruiz et al., 2023) could also generate faces with PA trait, while other face features are not enough consistent with input images. Therefore, we also utilize the DreamBooth generated images as augmentation and test on HiFiMask dataset. Table  10 demonstrates that DreamBooth augmented model get better ACER value than the baseline CDCN model, but inferior to our proposed CG-FAS. We believe that the diffusion model based generation also works, but less convenient to use than our proposed IB matrix. Besides, it’s hard to find proper prompt to align desired face editing tasks.

4.6 More Visualization Results

Aiming to show the superiority of our proposed IB matrix, we conduct more visualization experiments on the mentioned four FAS datasets. According to the difference of source and target images in domain and presentation attack, our generation results could be categorized into three types: intra-dataset generation, cross-domain generation and expanding-PA generation shown in Fig. 5.

4.6.1 Intra-dataset Generation

Firstly, we utilize our proposed CG-FAS to generate samples within a FAS dataset. Here we conduct experiments on HiFiMask dataset, which contains live, transparent and plaster and resin mask images. We select some transparent 3D mask faces from HiFiMask and convert them into their corresponding live ones. This cross-label generation results are shown in the first column of Fig. 5a. We also conduct the intra-dataset generation on another 3D mask FAS dataset, namely MARsV2 in the second column. It is clear that some selected live face images are converted into high-fidelity ThatsMyFace mask style images, which could alleviate the high expense of the ThatsMyFace mask data collection issue and facilitates data diversity during training.

4.6.2 Cross-domain Generation

In this study, we assume that cross domain datasets indicate the FAS datasets have the same PA types but different distribution. For instance, both OULU-NPU and SiW datasets contain print and replay attack images but with serious domain shifts. Figure 5b shows the process we transform the live face images from OULU-NPU into replay attack style images from SiW. Besides, we conduct the cross-domain experiments on two 3D mask FAS datasets MARsV2 and HiFiMask. Live faces from MARsV2 are transformed into plaster mask style images of HiFiMask with strong perceptive quality.

4.6.3 Expanding-PA Generation

Here we utilize CG-FAS to conduct generation across two datasets which contain different presentation attack types. For example, we are able to convert the live face images from OULU-NPU into the plaster mask style images of HiFiMask. As shown in Fig. 5c, such generated 3D mask attacks are visually high-quality. Besides, we can also convert live face images from HiFiMask into print attack style of SiW dataset. These generated face images contain obvious print features (e.g., color distortion) and show the ability of expanding PA types by our proposed CG-FAS.

Fig. 6
figure 6

Result of t-SNE visualization on OULU-NPU dataset. Circle markers represent samples from the original dataset, while triangle markers represent CG-FAS generated images

4.6.4 T-SNE Visualization

Fig. 6 shows t-SNE visualization result on the protocol 1 training set of OULU-NPU dataset. As illustrated, live and generated live images obey similar distribution, as do PA and generated PA images. This similarity suggests that the generated images effectively extend the boundary of the original dataset, thereby serving as beneficial augmentations for training FAS models.

5 Conclusion and Future Work

In this study, we propose a novel Interchange Bridge matrix, which could convert a live face into a 3D high-fidelity mask, replay, print, or other extra physical presentation attacks. Correspondingly, it can also restore a specific physical presentation attack to a live face. Served as an augmentation manner, we put forward CG-FAS to promote the training of FAS models. To validate CG-FAS, we conduct experiments on both existing FAS benchmarks and a proposed cross-domain attack benchmark. Experimental results show that CG-FAS outperforms existing generation methods with a clear margin.

Vision Foundation Models (VFMs) like Stable Diffusion model also show impressive result in this study. Fine-tuning VFM on downstream tasks is widely studied but less used in face anti-spoofing domain. In the future, we seek for effective adaption of VFMs on FAS research.