1 Introduction

With the rapid development of deep learning, generative models like Variational Auto-Encoders (VAEs) (Kingma & Welling, 2013), Generative Adversarial Networks (GANs) (Goodfellow et al., 2014a), and Diffusion Models (Ho et al., 2020) have been present in many fields and made significant progress. Along with these generative methods, face forgery has become a popular topic in recent research, such as FaceSwap. However, by synthesizing realistic faces to fool human beings, face forgery techniques expose risks and may be used for nefarious purposes, such as fake news and financial fraud. Therefore, to eliminate these potential threats, forgery detection has become a significant research direction, where plenty of efforts (Li et al., 2021b; Luo et al., 2021; Chen et al., 2022; Liu et al., 2021; Qian et al., 2020) are spurred to face forgery detection.

Fig. 1
figure 1

Challenges in forgery detection. a Current forgery detectors usually make predictions depending on the method-specific synthetic textures (e.g., Deepfakes (DF)). b Unseen deepfake techniques hold quite different synthetic textures (e.g., DF-VAE (Jiang et al., 2020a), Face2Face (F2F) (Thies et al., 2016), FaceSwap (FS), and Neural-Textures (NT) (Thies et al., 2019)), which causes the challenge of generalization across datasets. c Perturbations poison these textures (e.g., blur, noise, block, and adversarial attacks (Szegedy et al., 2014; Goodfellow et al., 2014b; Li et al., 2021a)), which leads to the challenge of robustness against corruption

Current forgery detectors can achieve excellent performance when the training and testing forgeries are from the same datasets and deepfake techniques, dubbed as “in-dataset" settings since they focus on method-specific synthetic textures (Fig. 1a)Footnote 1. Although these detectors can achieve good performance under “in-dataset" settings, for practical usage, there are still two inevitable challenges for face forgery detection. (1) The first challenge is the generalization of forgery detectors across various datasets (“cross-dataset" settings), where the testing forgeries are created by unseen advanced deepfake methods. Due to the gap in synthetic textures between the training and testing data generated by various deepfake methods, as shown in Fig. 1b, existing forgery detectors usually suffer from poor detection performance. (2) The other challenge is the robustness of forgery detectors. Since there are many uncertainties in the real world, natural media data is often disturbed by common corruption like blur, compression, and designed adversarial attacks (Szegedy et al., 2014; Goodfellow et al., 2014b; Li et al., 2021a; Feng et al., 2021, 2023b, a), as shown in Fig. 1c. These perturbations may poison some discriminative synthetic textures and mislead forgery detectors to incorrect predictions, thereby reducing their performance.

For the aforementioned two challenges, plenty of works have made impressive progress in boosting the generalization for cross-dataset detection, while few works focus on improving the robustness. (1) To improve the generalization across datasets, in earlier years, it is proposed to apply frequency artifacts (Durall et al., 2020; Liu et al., 2021; Qian et al., 2020) and spatial information (Wang et al., 2020; Afchar et al., 2018; Nguyen et al., 2019b). Recently, researchers have found data augmentations can lead to better generalization improvement, and many works focus on taking advantage of data augmentations to enhance the generalization further. Specifically, they usually synthesize various data by empirically designed augmentations, like generating blended images from two pristine images. Although these augmented-based methods can improve the detector’s generalization, to a certain extent, they also have some defects. Firstly, their designed augmentations mainly depend on intuitive thoughts, which only include limited fixed synthetic strategies. Although SLADD (Chen et al., 2022) tries to construct various samples dynamically by adversarial learning, its manipulated region selection is still hand-crafted, and the synthetic strategy remains blend-based. Secondly, these augmented-based methods lack theoretical analysis for the effectiveness of data augmentations, and they do not take into account the use of frequency information to further improve the generalization like earlier works (Durall et al., 2020; Liu et al., 2021; Qian et al., 2020). (2) For robustness, among these attempts to improve the generalization, only limited works (Haliassos et al., 2022, 2021) consider keeping detectors robust against corruption, especially more threatening adversarial perturbations. Thus, an investigation on simultaneously improving the generalization and robustness of forgery detectors with theoretical analysis is noteworthy.

Inspired by the aforementioned discussion, in this paper, we propose a forgery detection generative adversarial network (FD-GAN) with two generators (i.e., a blend-based generator and a transfer-based generator) for adaptive data augmentations and a discriminator (i.e., the forgery detector), which can simultaneously boost both the generalization and robustness. Specifically, the blend-based generator adaptively calculates manipulated regions (i.e., the forgery masks) for blending. And the transfer-based generator mixes the synthetic style in the fake reference image and the semantics in the real source image to make augmentations. Moreover, to further improve the generalization, we design the discriminator to judge whether the input is synthetic and predict manipulated regions (i.e., the forgery prototypes) with a collaboration of the spatial and frequency branches, like earlier works (Durall et al., 2020; Liu et al., 2021; Qian et al., 2020). Concretely, the spatial branch aims to compute spatial features and predict forgery prototypes, and the frequency branch works for mining helpful and generalizable frequency cues. Further, to boost the robustness of the detector, we propose a Low-rank Module in the frequency branch, which utilizes the Low-rank Estimation algorithm (Zhuo et al., 2021; Li et al., 2018b; Zhang et al., 2019) to filter out adversarial corruption. Therefore, our method can explore a large variety of augmented forgeries from the adaptive generator with spatial and robust frequency cues to improve generalization and robustness progressively. Besides, we also provide a theoretical analysis about the generalization and the robustness against adversarial attacks to guarantee the effectiveness of the proposed method. And extensive experiments illustrate that our method can achieve state-of-the-art performance on forgery detection in both in-dataset and cross-dataset settings. In addition, the robustness of our method against adversarial attacks can also be demonstrated by experiments. Meanwhile, experimental results in Sect. 4.4 also verify the robustness of our method against common corruption (e.g., blur, noise, and etc.).

The contributions of our paper are as follows:

  • We propose a forgery detection generative adversarial network (FD-GAN) with two generators (i.e., a blend-based generator and a transfer-based generator) for adaptive data augmentations and a discriminator (i.e., the forgery detector), which can not only improve the generalization across datasets but also boost the robustness against corruption and adversarial attacks.

  • We design the discriminator to identify whether the input is real and indicate the manipulated regions with spatial and frequency branches. Besides, an LRM in the frequency branch based on Low-rank Estimation removes adversarial corruption to keep our model robust.

  • Our method achieves superior performance on face forgery detection than current state-of-the-art methods in both in-dataset and cross-dataset settings. Moreover, our method shows strong robustness against common corruption and adversarial attacks.

2 Related Work

Face Forgery Techniques. In the past decades, face forgery techniques have rapidly developed. Early attempts (Dale et al., 2011; Garrido et al., 2014, 2015; Thies et al., 2015) on face forgery usually reconstruct 3D models for both source and target faces and generate synthetic videos. For example, Face2Face (Thies et al., 2016) is a classical real-time face forgery technique with 3D model reconstruction and image-based rendering. Some methods, like FaceSwap, even simply utilize only image processing to create synthetic faces. With the development of deep learning, many face forgery techniques [(e.g., Deep Video Portraits (Kim et al., 2018) and Neural Textures (Thies et al., 2019)] apply neural networks in their pipeline for facial reenactment. Recently, Generative models like Variational Auto-Encoders (VAEs) (Kingma & Welling, 2013), Generative Adversarial Networks (GANs) (Goodfellow et al., 2014a), and Diffusion Models (Ho et al., 2020) have revealed excellent performance in AI-content generation (AIGC) and become common choices for face forgery generation. Since these techniques can produce high-quality synthetic facial images and videos, their potential threats should be seriously considered.

Face Forgery Detection. Recent research has seen different attempts on face forgery detection (Wang et al., 2020; Rossler et al., 2019; Yang et al., 2019; Li et al., 2018a, 2020a; Chen et al., 2022; Shiohara & Yamasaki, 2022; Li et al., 2021b; Qian et al., 2020; Liu et al., 2021; Luo et al., 2021; Xu & Feng, 2023). Earlier works (Wang et al., 2020; Rossler et al., 2019) apply common CNNs like ResNet (He et al., 2016) and Xception (Chollet, 2017) to treat face forgery detection as a binary classification problem. Later, some approaches (Yang et al., 2019; Li et al., 2018a) pay attention to anomalies frequently in clumsy face forgery, while others (Cozzolino et al., 2021; Agarwa et al., 2019) utilize auxiliary identity information. Concerns about the generalization of forgery detectors arise along with the rapid development of deepfake techniques, and numerous methods have been proposed to solve this problem, such as applying data augmentations (Li et al., 2020a; Chen et al., 2022; Shiohara & Yamasaki, 2022), mining frequency cues (Li et al., 2021b; Qian et al., 2020; Liu et al., 2021; Luo et al., 2021; Miao et al., 2023, 2022), assisting with extra tasks (Chen et al., 2022; Nguyen et al., 2019a), using attention mechanisms (Zhao et al., 2021), and focusing on self-consistency (Huh et al., 2018; Dong et al., 2022). In this paper, we utilize a forgery detector with a spatial branch and a frequency branch to combine their advantages, rarely considered in previous works. Furthermore, in order to explore more training samples, an adversarial data augmentation strategy is also employed.

Fig. 2
figure 2

The pipeline of our method. (1) The blend-based generator \(G_B\) and the transfer-based generator \(G_T\) take source and reference images to derive synthetic samples. (2) The discriminator D utilizes synthetic samples to get prediction (real or fake) and manipulated regions (i.e., the forgery prototypes \(\{P_1, \cdots , P_k\}\)) with a spatial branch and a frequency branch. Details are available in Sect. 3

Adversarial Robustness. In the real world, neural networks can encounter incidental adversity like common corruption and intentional adversity created by adversarial attackers. Both can mislead models into wrong predictions. Some face forgery detection methods (Haliassos et al., 2021, 2022) have made progress in defending common corruption. However, adversarial attacks are usually considered more severe since they can target models with a crisis, fooling a model with invisible perturbations to human beings.

In addition to those universal adversarial attacks (Szegedy et al., 2014; Goodfellow et al., 2014b; Li et al., 2021a; Feng et al., 2024; Xu et al., 2024), some works (Li et al., 2021a; Jia et al., 2022) explore attacks targeted for face forgery detection. Li et al. (2021a) use pre-trained StyleGAN (Karras et al., 2019a) with gradients to generate high-quality adversarial examples, and Carlini and Farid (2020) apply black-box attacks on forgery detection to evaluate their robustness. Several frequency-based attacks (Jia et al., 2022; Luo et al., 2022) are proposed to evade frequency-based detectors and keep adversarial perturbations imperceptible. Recently, backdoor attacks (Sun et al., 2023), attribute variation-based attacks (Meng et al., 2023) and audio-based attacks (Panariello et al., 2023) have been applied to face forgery detection. On the contrary, only a few works (Hussain et al., 2021; Neekhar et al., 2021) try to prevent detectors from adversarial attacks. In this paper, we consider removing adversarial corruption with an elaborate module based on Low-rank Estimation.

3 Method

In this section, we propose a forgery detection generative adversarial network (FD-GAN), as shown in Fig. 2. Our FD-GAN consists of two generators and a discriminator: (1) The blend-based and transfer-based generators are responsible for adversarial data augmentations to boost generalization, illustrated in Sect. 3.1. (2) The discriminator (i.e., the detector) is applied for forgery detection with robust features, described in Sect. 3.2. Furthermore, Sect. 3.3 formulates the training objectives and Sect. 3.4 provides a detailed discussion with theoretical support.

3.1 Generators

Both the blend-based generator \(G_B(\cdot ;\theta _{B})\) and transfer-based generator \(G_T(\cdot ;\theta _{T})\) aim to improve generalization with adversarial data augmentations, where \(\theta _{B}\) and \(\theta _{T}\) denote their parameters. They take an original source image \(x_s \in \mathcal {R}^{3 \times H_0 \times W_0}\) and a manipulated reference image \(x_r \in \mathcal {R}^{3 \times H_0 \times W_0}\) as inputs, and output synthetic samples \(x_{syn} \in \mathcal {R}^{3 \times H_0 \times W_0}\).

Fig. 3
figure 3

Details of the Blend Process. The source image \(x_s\) is processed with k mask filters \(F_B = \{f^1_B, \cdots , f^k_B\}\) by Eq. (1) to get the corresponding forgery masks \(\{M_1, \cdots , M_k\}\). Since each \(M_i\) covers the corresponding part of manipulated regions, these forgery masks are fused by Eq. (2) to obtain the blending mask M. With the blending mask M, the blend-based generator merges the reference image \(x_r\) and the source image \(x_s\) into a synthetic image \(x_{syn}^B\)

Blend-based Generator \(G_B\). The blend-based generator extracts the feature map \(\mathcal {A}_{B} \in \mathcal {R}^{C \times H \times W}\) from inputs, where C, H, and W represent the number of channels, the height and the width of the feature map. To produce high-quality data augmentations with dynamic manipulated regions, we suppose the manipulated regions can be divided into k local parts. Thus, we design k mask filters (MFs) \(F_B = \{f^1_B, \cdots , f^k_B\}\), where each \(f^i_B, \forall i = 1, \cdots , k\) is responsible for locating one specific part. Concretely, each MF \(f^i_B \in \mathcal {R}^{1 \times 1 \times C}\) is parameterized by a \(1 \times 1\) convolution kernel weight. These MFs are utilized to convolve with the feature map \(\mathcal {A}_B\) and get the corresponding forgery masks \(\{M_1 \cdots , M_k\}\):

$$\begin{aligned} \begin{aligned} M_i&= \sigma (f_B^i \odot \mathcal {A}_B), \end{aligned} \end{aligned}$$
(1)

where \(\odot \) is the convolution operation and \(\sigma (\cdot )\) represents the sigmoid function. Each \(M_i\) covers the corresponding part of manipulated regions, and we fuse \(\{M_1, \cdots , M_k\}\) as the blending mask M:

$$\begin{aligned} \begin{aligned} M&= Clip(\frac{1}{k} \sum _{i = 1}^k Clip(M_i, th), 0), \end{aligned} \end{aligned}$$
(2)

where Clip(Xth) treats values in X greater than th as 1 and others as 0. And the blending can be formulated as:

$$\begin{aligned} \begin{aligned} x_{syn}^B&= M \otimes x_r + (1 - M) \otimes x_s, \end{aligned} \end{aligned}$$
(3)

where \(\otimes \) is the Hadamard product, and \(x^{B}_{syn}\) is the synthetic result of blending. Details of the above steps are also displayed in Fig. 3. Note that pre-processes like face alignment, color transfer, and blur should be applied before blending to avoid significant artifacts in the results.

Transfer-based Generator \(G_T\). The main idea of the transfer-based generator comes from a simple heuristic principle: Face forgeries can be viewed as a combination of original contents and synthetic styles. As a result, we can separate the styles of the synthetic reference samples \(x_r\) and generate a new synthetic sample \(x_{syn}^T\) with such synthetic styles and the content from the original source image \(x_s\). Compared with the blend-based generator, the transfer-based generator can prevent boundary artifacts from synthetic samples, which enables it to explore more challenging samples for augmentation. It contains a style encoder \(F_S\), a content encoder \(F_C\), and a decoder \(D_T\).

The style encoder \(F_S\) uses the encoder-bottleneck architecture (Choi et al., 2018) to extract the synthetic style from the reference image \(x_r\) and the output feature map \(\mathcal {A}_S \in \mathcal {R}^{C^\star \times H^\star \times W^\star }\) is fed into two \(1 \times 1\) convolution layers for latent styles \((\gamma , \beta )\). The content feature map \(\mathcal {A}_C \in \mathcal {R}^{C^\star \times H^\star \times W^\star }\) is produced from \(x_s\) by the content encoder \(F_C\), similar to the style encoder. Then we consider these styles should be transferred between the similar relative parts on the face. For instance, synthetic textures on the eyes should be transferred to the corresponding eye regions of the source image \(x_s\). Consequently, we introduce Attentive Makeup Morphing (AMM) module (Jiang et al., 2020b) to morph the latent styles \((\gamma , \beta )\) for synthetic sample generation. AMM calculates an attentive matrix \(A \in \mathcal {R}^{H^{\star }W^{\star } \times H^{\star }W^{\star }}\) by \(\mathcal {A}_S\), \(\mathcal {A}_C\), and facial landmarks, where \(A_{i,j}\) suggests the attentive value between the i-th pixel in the source image \(x_s\) and the j-th pixel in the reference image \(x_r\).Footnote 2 As a result, we get:

$$\begin{aligned} \begin{aligned} \gamma ^\prime = \sum _{j} A_{i,j} \gamma _j, \beta ^\prime = \sum _{j} A_{i,j} \beta _j. \end{aligned} \end{aligned}$$
(4)

\(\gamma ^\prime \) and \(\beta ^\prime \) are duplicated and expanded along the channel dimension to produce \((\Gamma , \mathcal {B})\). Finally, we generate synthetic samples \(x_{syn}^T\) with the morphed latent styles \((\Gamma , \mathcal {B})\), the content encoder \(F_C\), and the decoder \(D_T\) (Choi et al., 2018), calculated by:

$$\begin{aligned} \begin{aligned} x_{syn}^T&= D_T(\Gamma \otimes \mathcal {A}_C + \mathcal {B}). \end{aligned} \end{aligned}$$
(5)

3.2 Discriminator

To discover universal synthetic artifacts, our discriminator \(D(\cdot ; \theta _D)\) has a frequency branch and a spatial branch, focusing on frequency and spatial features separately.

Spatial Branch. With reference to Rossler et al. (2019), we adopt Xception to capture the synthetic spatial textures. As shown in Fig. 2, the spatial extractor gets output feature maps \(\mathcal {A}_{sp} \in \mathcal {R}^{C^\prime \times H^\prime \times W^\prime }\) from the synthetic samples \(x_{syn}\). Considering the blend-based generator creates forgeries with a series of forgery masks, we expect to encourage our spatial branch to locate the manipulated regions for better generalization.

Thus, k forgery prototypes \(\{P_1, \cdots , P_k\}\) are proposed, corresponding to the forgery masks \(\{M_1, \cdots , M_k\}\). They are created from k prototype filters \(\{f_P^1, \cdots , f_P^k\}\), similar to those mask filters \(\{f_B^1, \cdots , f_B^k\}\):

$$\begin{aligned} \begin{aligned} P_i = \sigma (f_P^i \odot \mathcal {A}_{sp}), \, f_P^i \in \mathcal {R}^{1 \times 1 \times C^\prime }, \, \forall i = 1, \cdots , k. \end{aligned} \end{aligned}$$
(6)

Frequency Branch. The frequency branch explores frequency artifacts from both a global and local perspective. It consists mainly of the following modules:

  • The Global Extractor obtains the global feature map \(\mathcal {A}_{g}\) from the synthetic sample \(x_{syn}\) transformed by Discrete Cosine Transform (DCT) to deal with global textures.

  • The Low-rank Module (LRM) is applied to remove adversarial perturbations in blocks for robustness by Low-rank Estimation algorithms, and its explanation is shown in Sect. 3.4. Its input should be split into blocks and then transformed by DCT to reduce the computation, and its output should be reverted to the same size as the input.

  • The Local Extractor pays more attention to local frequency textures. It derives the local feature map \(\mathcal {A}_{l}\) from the output of LRM.

  • The Fusion Module enables the collaboration between the global and local information with a convolution layer \(Conv(\cdot )\). Specifically, it is formulated as:

    $$\begin{aligned} \begin{aligned} M_c&= Conv(\mathcal {A}_g + \mathcal {A}_l), \\ \mathcal {A}_{lo}&= \mathcal {A}_l + M_c \otimes \mathcal {A}_g, \\ \mathcal {A}_{go}&= \mathcal {A}_g + M_c \otimes \mathcal {A}_l, \end{aligned} \end{aligned}$$
    (7)

    where \(\mathcal {A}_{go}\) and \(\mathcal {A}_{lo}\) are the final global and local feature maps.

The frequency feature maps \(\mathcal {A}_{go}\), \(\mathcal {A}_{lo}\), and the spatial feature map \(\mathcal {A}_{sp}\) are concatenated to get the final feature map, utilized in the prediction on forgery detection.

3.3 Training Objectives

Classification Loss \(\mathcal {L}_{cls}\). Following previous works (Qian et al., 2020; Shiohara & Yamasaki, 2022), we use the binary cross-entropy loss to compute \(\mathcal {L}_{cls}\).

$$\begin{aligned} \mathcal {L}_{cls} = - \frac{1}{N}\sum _{i=0}^{N-1} \{ y_i {\text {log}} f(x_i) + (1-y_i){\text {log}}(1-f(x_i) \} , \end{aligned}$$
(8)

where \(y_i\) is the corresponding ground truth label, and \(f(x_i)\) indicates the probability predicted by our proposed model of the input sample \(x_i\).

Forgery Similarity Loss \(\mathcal {L}_{sim}\). The forgery masks and forgery prototypes are fed into a fully connected (FC) layer to adaptively get the importance \(w_i\) of each forgery mask \(M_i\) and the corresponding forgery prototype \(P_i\):

$$\begin{aligned} \begin{aligned} w_i = FC(concat(M_i, P_i)), \forall i = 1, \cdots , k. \end{aligned} \end{aligned}$$
(9)

Note the transfer-based synthetic sample \(x_{syn}^T\) can be viewed as the entire face synthesis, so we fix its \(w_i\) as \(\frac{1}{k}\) and \(M_i\) as \(\textbf{1}\). And the forgery similarity loss \(\mathcal {L}_{sim}\) can be denoted as:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{sim} = \sum _{i = 1}^{k} w_i ||M_i - P_i||_1. \end{aligned} \end{aligned}$$
(10)

Diversity Loss \(\mathcal {L}_{div}\). It is likely that all forgery masks cluster in the same region and generate specific synthetic samples. We propose a diversity loss inspired by Liu et al. (2019), written as:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{div} = \sum _{i = 1}^{k} \sum _{j = 1}^{k} cos(f^i_B, f^j_B), \end{aligned} \end{aligned}$$
(11)

where \(cos(\cdot , \cdot )\) denotes the cosine similarity.

Finally, the optimization process can be formulated as:

$$\begin{aligned} \begin{aligned}&\min _{\theta _D} \max _{\theta _B, \theta _T} \mathcal {L}(\theta _B, \theta _T, \theta _D), \\ s.t. \, \mathcal {L}(\theta _B,&\theta _T, \theta _D) = \mathcal {L}_{cls} + \eta \mathcal {L}_{sim} + \lambda \mathcal {L}_{div}, \end{aligned} \end{aligned}$$
(12)

where \(\eta \) and \( \lambda \) are hyper-parameters.

3.4 Discussion

In this part, we discuss our design on data augmentation and adversarial robustness in detail, suggesting our FD-GAN’s effectiveness with theoretical analyses.

Data Augmentation. Several previous works (Shiohara & Yamasaki, 2022; Li et al., 2020a) utilize data augmentation to improve their generalization, while most depend on intuitive ideas. SLADD (Chen et al., 2022) shows the superior performance of generative models like GANs Goodfellow et al. (2014a) on augmentation in forgery detection. In this part, we will give helpful hints about data augmentation with detailed theoretical analysis in our Appendix A. They expound on how to make data augmentation effective and why generative models like GANs work well in forgery detection.

We first provide some basic settings for the following analysis. For convenience, we set real samples as the target since forgery detection is a binary classification problem. Suppose real samples in the training set are \(X = \{X_1, \cdots , X_k\}\), where \(T_d\) is the size of X. They should be independent and identically distributed because they are all real samples, following a probability density function (PDF) \(p_d\). Similarly, synthetic samples \(Y = \{y_1, \cdots , y_{T_r}\}\) follow another PDF \(p_r\) and the size of Y is \(T_r\). It is common sense that \(p_r\) depends on the data augmentation strategies and the source data. Here we mainly concern with the influence of various strategies rather than the source data since the latter is not the focus of our paper.

Our theoretical analysis suggests the following:

  • Suggestion 1: Make the number of synthetic samples \(T_r\) as large as possible, which is universal among most various augmentation strategies in forgery detection.

  • Suggestion 2: Generative models like GANs can effectively create diverse synthetic samples to complement the original training set X in forgery detection.

It is convenient for us to generalize our discovery across various forgery detector designs because our theoretical analysis mainly applies only to Logistic Regression. Based on Suggestion 2, we have developed our detector (FD-GAN) by a GAN-based approach, incorporating both a blend-based generator and a transfer-based generator to generate diverse synthetic samples for better generalization, demonstrated by the results in Sect. 4.3. Besides, our Appendix A also supports our Suggestion 1.

Adversarial Robustness. In general, adversarial examples can be regarded as a combination of natural semantic information and adversarial perturbations. A natural thought arises that we can prevent forgery detectors from adversarial attacks if the adversarial perturbations are filtered out. However, this process is usually lossy to the inputs’ quality and may degrade the performance of detectors. To tackle this problem, Low-rank Estimation (Zhuo et al., 2021; Li et al., 2018b; Zhang et al., 2019) provides a well-established theory and useful algorithms for recovering data matrices from noise observations when original data matrices have some ideal properties (e.g., sparse singular values). As stated in Awasthi et al. (2020), natural images often hold sparse singular values, and forgeries hold similar propertiesFootnote 3. It implies the validity of using Low-rank Estimation to remove adversarial perturbations while keeping the natural semantic information. Since DCT usually makes matrices sparse with their rank unchanged, we design the LRM in the frequency branch. Concretely, it aims to solve the following problem:

$$\begin{aligned} \begin{aligned} \min _{\hat{R}} rank(X), \quad s.t., \forall i,j, \, \hat{R}_{ij} \approx X_{ij}, \end{aligned} \end{aligned}$$
(13)

where X, \(\hat{R}\), and R represent the input with adversarial noises, the estimator, and the true data matrix, respectively. Moreover, the recovery of the true data matrix with the low-rank property can be theoretically guaranteed (Please refer to our Appendix A). It suggests our LRM can retain the semantic information and remove the annoying adversarial perturbations. Similar to other adversarial defense techniques based on pre-processing like (Dziugaite et al., 2016; Xu et al., 2017; Ding et al., 2019), LRM may slightly lower FD-GAN’s performance on normal images but improve the adversarial robustness significantly, proved by our experiments in Sect. 4.5.

Table 1 In-dataset evaluation results. Quantitative results (ACC (%) and AUC (%)) on FF++ are displayed with Raw, HQ and LQ versions, respectively. The bold results are best

4 Experiment

4.1 Experimental Setup

Inputs. We use RetinaFace (Deng et al., 2020) for face extraction and DLIB (Sagonas et al., 2016) for facial landmark detection. All faces are aligned and resized in the training and testing datasets.

Discriminator. We adopt Xception (Rossler et al., 2019) as the backbone of Spatial Extractor, Global Extractor, and Local Extractor. Besides, LRM applies USVT (Chatterjee, 2012) for Low-rank Estimation.

textbiGenerator. We modify Xception (Chollet, 2017) as the backbone for the blend-based generator, which is initialized by pre-trained Xception on ImageNet (Deng et al., 2009). The transfer-based generator’s architecture follows (Choi et al., 2018), as stated in our paper.

Optimization. The hyper-parameters in the final loss function are \(\eta = 0.1\) and \(\lambda = 0.02\). Besides, we use the Adam optimizer (Kingma & Ba, 2014) for both the generator and the discriminator with \(\beta _1 = 0.9\) and \(\beta _2 = 0.999\). The batch size is fixed to 32, and the learning rates of the discriminator and generator are set to \(1 \times 10^{-4}\) and \(3 \times 10^{-5}\), respectively.

Training Datasets. Based on recent deepfake detection methods (Li et al., 2021b; Wang & Deng, 2021; Luo et al., 2021; Chen et al., 2022; Liu et al., 2021; Qian et al., 2020; Li et al., 2020a), we train our model mainly on Faceforencis++ (FF++) dataset (Rossler et al., 2019), which consists of 1K real videos and 4K synthetic videos. Deepfakes (DF), FaceSwap (FS), Face2Face (F2F) (Thies et al., 2016), and Neural-Textures (NT) (Thies et al., 2019) are applied to generate synthetic videos with various compression levels, including RAW, High Quality (HQ), and Low Quality (LQ). We adopt the HQ version in our experiments by default unless otherwise specified.

Testing Datasets. To evaluate the generalizability of our method, we perform experiments on the following datasets: (1) CelebDF-v2 (CDF) (Li et al., 2020b) contains 518 test videos created by the improved deepfake technology. (2) Deepfake Detection Challenge Preview Dataset (DFDC) (Dolhansky et al., 2020) includes over 1K real videos and over 4K synthetic videos manipulated by multiple methods. (3) Deepfake Detection Dataset (DFD) contributes over 300 real videos and over 3K fake videos to support deepfake detection efforts. (4) DeeperForensics (DFo) (Jiang et al., 2020a) mainly consists of forged videos created by DF-VAE (Jiang et al., 2020a). As for the adversarial robustness evaluation, we follow (Jia et al., 2022) to choose 560 (\(140 \times 4\)) frames from synthetic videos in FF++ test dataset.

Table 2 Cross-dataset evaluation results. Quantitative results (AUC (%)) on CDF, DFDC, DFD, and DFo are displayed
Table 3 Robustness against adversarial attacks. ASR (%) of adversarial attacks on various face forgery detection methods is shown
Table 4 Robustness against common corruption. Average AUC scores (%) of methods for each corruption in Jiang et al. (2020a) across five intensity levels are shown

Baselines. We mainly compare our methods with various augment-based methods (Face X-ray (Li et al., 2020a), SBI (Shiohara & Yamasaki, 2022), and SLADD (Chen et al., 2022)) and frequency-based forgery detection methods (F3Net (Qian et al., 2020), SPSL (Liu et al., 2021), and FDFL (Li et al., 2021b)). Some state-of-the-art methods are also selected for comparisons, such as Two Branch (Masi et al., 2020), MADD (Zhao et al., 2021), FTCN (Zheng et al., 2021), RealForensics (Haliassos et al., 2022), LipForensics (Haliassos et al., 2021), and ICT (Dong et al., 2022). Several popular baselines are considered, like Xception (Rossler et al., 2019), MesoNet (Afchar et al., 2018), Patch-based (Chai et al., 2020), CNN-GRU (Sabir et al., 2019), CNN-aug (Wang et al., 2020), Capsule (Nguyen et al., 2019b), Multi-task (Nguyen et al., 2019a), and DSP-FWA (Li & Lyu, 2018). Limited by computational cost and few official code implementations, some results are unavailable and represented by “-". In the adversarial robustness evaluation, we adopt spatial attacks (FGSM (Goodfellow et al., 2014b), PGD (Madry et al., 2017), MIM (Dong et al., 2018), DIM (Xie et al., 2019), and TIM (Dong et al., 2019)) and frequency attacks (FreqAttack (Jia et al., 2022) and SSAH (Luo et al., 2022)). The hyper-parameters follow the defaults in Jia et al. (2022).

Evaluation Metrics. Following the previous works (Li et al., 2021b; Wang & Deng, 2021; Luo et al., 2021; Chen et al., 2022; Liu et al., 2021; Qian et al., 2020; Li et al., 2020a), we mainly report the accuracy (ACC) and the Area Under the receiver operating characteristic Curve (AUC) for the evaluation on forgery detection. Besides, we choose the Attack Success Rate (ASR) for adversarial robustness evaluation based on Jia et al. (2022).

4.2 In-Dataset Evaluation

In this part, we first compare our method with baselines on different face manipulation methods in FF++, including the RAW, HQ, and LQ versions. The results are shown in Table 1. Obviously, our method outperforms other baselines in AUC with the LQ dataset, and the performances on the RAW and HQ datasets are comparable (close to 100%). The performance improvement mainly benefits from the spatial and frequency information extracted by our method. In summary, these experiments show great success with previous methods on in-dataset evaluations, and we will illustrate the transferability of our method in the following.

4.3 Cross-Dataset Evaluation

In this section, we train our method on FF++ with multiple forgeries while evaluating it on other benchmarks, such as CDF, DFDC, DFo, and DFD. Since the synthetic samples are generated with unseen techniques in these benchmarks, this cross-dataset setting is more challenging than the in-dataset setting. Table 2 shows the AUC comparison with baseline methods for face forgery detection. Our method still achieves the state-of-the-art AUC in most cases, although it is on par with RealForensics (Haliassos et al., 2022) and ICT (Dong et al., 2022) on CDF, which obtain extra temporal information. These results illustrate the advantage of our proposed method on generalization under different datasets, which mainly benefits from the appreciable data augmentation and the spatial-frequency components. Detailed analysis of our method is available in Sect. 4.5 to understand the components responsible for excellent performance.

Table 5 Framework ablation. AUC (%) on CDF and DFD, and ASR (%) on selected adversarial images after training on FF++ are shown. “Frequency Branch" and “Spatial Branch" represent the corresponding part of our model. “w/o LRM” and “w/o Prototype” represent the performance without the LRM and forgery prototypes, respectively. All modules are defined in Sect. 3.2

4.4 Robustness Evaluation

In general, ideal detectors should be robust against common and adversarial corruption in addition to great generalization on various datasets. Thus, we evaluate our model with some baselines against adversarial attacks and common corruption to assess their robustness.

Robustness against Adversarial Attacks. Table 3 reports the ASR against FF++ and illustrates that our method outperforms other forgery detection methods by a large margin. For instance, PGD gets 66.1% ASR against CNN-aug (Wang et al., 2020) while only achieving 58.9% ASR against our method. Besides, CNN-aug (Wang et al., 2020) augmented with blur and compression resists more robust compared with other baselines, dropping a hint that some augmentations may help to keep models from adversarial corruption.

Robustness against Common Corruption. Following (Haliassos et al., 2021) and (Haliassos et al., 2022), we train our FD-GAN on FF++. The distortions are the same as those in Jiang et al. (2020a), including changes in saturation and contrast, Gaussian blur and noise, compression on both video and image levels, and local block-wise distortion. Five different intensity levels are applied for each type, and the average AUC across all intensity levels is shown in Table 4. As expected, our method outperforms baselines in most cases, even compared to the popular RealForensics (Haliassos et al., 2022) and LipForensics (Haliassos et al., 2021). Besides, we report AUC for each intensity level separately on some common corruption with several models. In Fig. 5, we can observe that our FD-GAN performs best in almost all cases, especially against severe corruption. The superior performance of our FD-GAN against common corruption is credited to our generators (adaptive data augmentation) and the LRM (removing noises).

4.5 Ablation Studies

Ablation studies in this section try to determine how the factors contribute to our method’s performance.

Table 6 Data augmentation ablation. AUC (%) on CDF and DFD after training on FF++ is shown. “Blend", “Blend (fixed)", and “Transfer" represent augmentations from the blend-based generator \(G_B\), another fixed blending strategy Li et al. (2020a), and the transfer-based generator \(G_T\), respectively

Framework Ablations. In Table 5, we ablate the components of our method and check the generalization of the detector on CDF and DFD (trained on FF++). First, training our detector with only the frequency or the spatial branch leads to a significant drop in performance (about 3.3% on average). Second, the LRM slightly degrades the performance of our detector (up to 0.5% drop in performance) while it results in much bettter robustness (leading to over 30% improvement in ASR). More results of LRM on adversarial robustness are available in the following. Finally, we observe remarkable improvements (about 4.4% on average) with the forgery prototypes.

Data Augmentations. Evaluation results are shown in Table 6. We obverse that with only one kind of augmentation, the performance of our model degrades, especially only with the transfer-based augmentation (by about 3.85% on average). The main reason is that limited choices of synthetic samples are likely to suppress the generalization of our model. Moreover, the fixed blend-based augmentation seems less effective than our blend-based generator since the fixed augmentations cannot be adaptive to the evolving forgery detector in training.

Fig. 4
figure 4

Defense ablation. ACC (%) on clean samples and average ASR (%) on various adversarial attacks (the lower, the better) are shown. “w/o LRM” means our method’s performance without the LRM. And “JPEG”, “Bit-Reduction”, and “Smooth” represent the performance of our model with the corresponding defense

Table 7 AUC (%) on CDF and DFD after training on FF++. “Blend” and “Transfer” denote the blend-based and transfer-based generators. “w AMM" and “w/o AMM" indicate the transfer-based generator is with or without AMM
Fig. 5
figure 5

AUC under common corruption in various intensity levels. The results of RealForensics and LipForensics are from their original paper

Defense Techniques. Here, we use other defense methods based on image processing techniques, including JPEG (Dziugaite et al., 2016), Bit-depth Reduction (Xu et al., 2017), and Smooth (Ding et al., 2019). As a baseline, we also provide the performance against perturbations of our detector without the LRM. In Fig. 4, each defense method can benefit our detector in robustness, but our LRM gains superior improvement compared with others. Besides, all adversarial defenses listed in Fig. 4 lead to a decrease in performance on clean samples and LRM shows the least performance penalty compared with other methods, which suggests its effectiveness.

Effect of Attentive Makeup Morphing. We perform experiments on Attentive Makeup Morphing (AMM), shown in Table 7. The performance drop without AMM suggests that AMM helps the transfer-based generator for better augmentation.

Sensitivity Analysis on Hyper-parameters of the Final Loss Function in Eq. (12). Sensitivity experiments on \(\eta \) and \(\lambda \) are displayed in Fig. 6. The results verify the stability of our FD-GAN since both coefficients are robust in a large range.

Fig. 6
figure 6

Sensitivity analysis on hyper-parameters in loss function. We test detectors on CDF and DFD after training on FF++

Fig. 7
figure 7

Performance with various sizes of synthetic samples. AUC (%) of our FD-GAN on CDF (Li et al., 2020b) and DFD is shown when the number of synthetic samples varies. Note that the ratio of original samples to synthetic samples is fixed. “\(\times 0.5\)" means the applied training dataset contains only half of the synthetic samples compared with the normal training process (“\(\times 1.0\)"), and so on

Effect of the number of synthetic samples. To verify our Suggestion 1, we construct some experiments. In Fig. 7, we obverse that the performance of our method increases with a larger size of data augmentations. However, the performance shows a modest increase when the number of synthetic samples is very large. It suggests the limited benefit of much larger data augmentation.

Effect of the number of forgery prototypes. The forgery masks and prototypes are designed to locate the manipulated regions, leading to flexible data augmentation strategies and better generalization across datasets. In this part, we focus on the effect of the number of forgery masks and prototypes (i.e., k). As displayed in Fig. 8, more masks and prototypes (\(k \le 8\)) help to improve the generalization of our model, while too many masks and prototypes (\(k \ge 8\)) may result in confusion among various parts of manipulated regions. Therefore, we select \(k = 8\) as our default setting.

Robustness against more threatening attacks. Table 8 displays the results of our FD-GAN and several baselines against more threatening attacks. Although the performance of our FD-GAN against these threatening attacks degrades by a large margin, it still outperforms other forgery detectors, suggesting its robustness.

Table 8 Attack success rate (ASR) of several adversarial attacks against forgery detectors
Table 9 The accuracy of CNN-aug Wang et al. (2020) on synthetic samples created by various generative models

“Fakeness" score of FD-GAN’s synthetic samples. Following CNN-aug Wang et al. (2020), we list the performance of forgery detectors (“Blur + JPEG (0.5)" (Wang et al., 2020) and “Blur + JPEG (0.1)" Wang et al. (2020)) on synthetic samples generated by our FD-GAN and baselines. The results in Table 9 illustrate that our generator’s synthetic samples can fool detectors better than others in all cases. It indicates our detector enjoys more challenging samples for better generalization.

Fig. 8
figure 8

Performance with various sizes of forgery masks and prototypes. AUC (%) of our FD-GAN on CDF (Li et al., 2020b) and DFD is shown when the number of forgery masks and prototypes k varies

4.6 Qualitative Results

Saliency map visualization. To gain better insights into both generalization and robustness, we highlight the behavior of our forgery prototypes with some baselines. Figure 9 visualizes several examples with Grad-CAM (Selvaraju et al., 2019). Clearly, our FD-GAN has more complete coverage of manipulated regions in most cases since our model enjoys a large variety of synthetic samples. However, the baselines focus on a limited choice of training samples.

Fig. 9
figure 9

Saliency map visualization. The baselines (F3Net (Qian et al., 2020) and SBI (Shiohara & Yamasaki, 2022)) capture method-specific artifacts while failing to detect the complete manipulated regions. Our FD-GAN’s attention covers most of the manipulated regions. However, without the help of the frequency branch (“w/o Frequency Branch") or the spatial branch (“w/o Spatial Branch"), our method tends to locate method-specific artifacts like the baselines

Fig. 10
figure 10

Manipulated regions, predicted masks, part of forgery masks and the corresponding forgery prototypes

Manipulated Regions and Predicted Masks. Visualization of manipulated regions and predicted masks are shown in Fig. 10. The predicted masks cover most manipulated regions. Moreover, the forgery prototypes describe the corresponding forgery masks accurately. It demonstrates that the discriminator can not only judge whether the input is synthetic, but also predict the manipulated regions.

Synthetic samples. Synthetic samples created by our FD-GAN are shown in Fig. 11. We can see that our synthetic samples are high-quality, although our paper does not focus on face forgery generation. And the manipulated regions vary with the corresponding source and reference images.

Fig. 11
figure 11

Synthetic samples. a, b, c, and d represent the source images, reference images, blend-based synthetic samples, and transfer-based synthetic samples, respectively

Forgery prototypes. In Figure 12, we can observe the explicit semantic correspondences between the same forgery prototypes. It proves the efficiency of our forgery prototypes. After training on plenty of synthetic samples, each forgery prototype can capture a specific semantic pattern so that the prediction is likely to be more robust to some corruption and capture universal synthetic artifacts.

Fig. 12
figure 12

Visualization of our forgery prototypes. We take five prototypes of forgeries created by NT (Thies et al., 2019) and DF-VAE (Jiang et al., 2020a) as examples. Obviously, we can see that each forgery prototype focuses on a specific manipulated part

5 Conclusion

In this paper, we propose a forgery detection generative adversarial network (FD-GAN) with two generators (a blend-based and a transfer-based generator) and a discriminator (i.e., detector), which can generalize well in unseen scenarios and keep robust against adversarial and common corruption. Specifically, the two generators can adaptively create challenging synthetic images with more flexible strategies to improve generalization. Besides, we design the discriminator to judge whether the input is synthetic and predicts the manipulated regions with a collaboration of spatial and frequency branches. Further, we propose a Low-rank Module in the frequency branch to remove adversarial corruption in the input for robustness improvement. And we also provide some guidelines on data augmentations for improving generalization and mathematical support for robustness. In experiments, FD-GAN exhibits superior generalization and robustness than the state-of-the-art methods.