1 Introduction

Deepfakes, which refer to manipulated videos by deep neural networks [1, 2], have led to a crisis of social trust and posed a significant threat to social stability [3, 4]. In response to the growing concerns surrounding deepfakes, extensive efforts have been undertaken to distinguish deepfake content from unaltered videos [5,6,7,8,9]. Most existing Deepfake detection methods [10,11,12] employ Convolutional Neural Networks (CNNs) to extract local information. However, relying solely on local information can make the model more susceptible to being influenced by specific dataset features, resulting in limited generalization performance [13, 14]. Meanwhile, the Vision Transformer (ViT) [15, 16], with its self-attention mechanism, demonstrates strong capabilities in capturing global information. Some researches [7, 17] employ the ViT model for Deepfake detection due to the ability. But ViT always overlook local details that are crucial for Deepfake detection. Due to the highly complementary nature of local and global features, it is essential to integrate the strengths of both CNN and ViT. This enables effective capture of local features while also considering global features, leading to improved overall performance [6, 18]. However, a mere straightforward fusion of model frameworks might result in the algorithm being more tailored to the training data, thus lacking ideal generalization when confronted with unseen datasets. To address this issue, our proposed hybrid network, Frequency-based Local and Global (FLAG) network, which aims to effectively combine CNN and ViT models to achieve improved detection performance not only on the in-dataset but also on unseen datasets.

To enhance the aggregation of CNN and ViT, we propose a Frequency-based Attention Enhancement Module (FAEM), which is specifically designed to improve the model’s generalization performance. During the process of face manipulation via deep neural networks, the imperfect generative models introduce visual artifacts in the spatial domain. Several detection methods have been proposed [10, 11] to identify the forged videos based on the visual artifacts in the spatial domain, and significant achievements have been made through the analysis of specific indicators, such as visual color discrepancies [12] and inconsistent head poses [19]. However, these spatial domain algorithms [10, 19, 20] demonstrate fragility when the visual quality of manipulated faces is degraded after common post-image processing attacks (i.e., compression and blur) [21]. To deal with the highly realistic manipulated images, frequency-domain clues have been utilized as generalized features to expose forged videos, by directly utilize frequency domain coefficients as clues or employ them as spatial attention weights for spatial features [22,23,24,25,26]. We argue that simply feeding the network with frequency features makes the network highly reliant on the details, which may disappear during the compression or other process, resulting in a drop in performance. Thus, we propose using channel attention instead of spatial attention to emphasize the discriminability among the channels. Compared to spatial attention, channel attention provides the ability to extract correlations and assess the importance between different channels of feature maps [27, 28].

In this paper, we propose a hybrid network that combines CNN and ViT, connected by FAEM to improve generalization. FAEM is a frequency-based attention enhancement module based on several representative frequency coefficients. More specifically, we select four Discrete Cosine Transform (DCT) coefficients as clues for constructing the channel attention mechanism, which combines Alternating Current (AC) and Direct Current (DC) coefficients. These coefficients provide improved stability and are less susceptible to quality distortion after potential image attacks [29]. Essentially, the amalgamation of AC and DC coefficients is employed to acquire frequency-domain channel attention, subsequently employed to augment the feature weight of mid-level features [10]. To provide a clearer understanding of FAEM and its impact on challenging manipulated faces, we present Gradient-weighted Class Activation Mapping (Grad-CAM) [30] visualizations of the module’s effects on Fig. 1 for Deepfakes (DF) [21] and NeuralTextures (NT) [31]. As shown in Fig. 1, our proposed FAEM can effectively focus on manipulated regions, especially in subtle manipulation parts (e.g., face and mouth) of the DF and NT.

Fig. 1
figure 1

Grad-CAM [30] visualization of FAEM on two kinds of challenging manipulated faces including DF [21] and NT [31]. (a) and (c) columns reflect features without FAEM, (b) and (d) columns reflect features with FAEM

The contributions of this paper are summarized as follows.

  1. (1)

    We propose a novel Deepfake detection network, Frequency-based Local and Global (FLAG) network, which integrates the CNN and ViT through FAEM-enhanced local features to improve generalization performance.

  2. (2)

    We design a novel Frequency-based Attention Enhancement Module (FAEM) that strengthens the correlations among feature channels and enhances the generalization of method via several frequency coefficients, in particular, three representative AC coefficients and a DC coefficient.

  3. (3)

    Experimental results demonstrate that the proposed method obtains significant and robustness performance on the in-dataset and effectiveness generalization capability on the cross-datasets. Additionally, the proposed method exhibits strong robustness against various image attacks.

2 Related work

2.1 Deepfake detection

Early methods in the field of forgery detection, such as [32,33,34,35], rely on intrinsic statistics or hand-crafted features to model spatial manipulation patterns. Matern et al. [12] propose a method for detecting deepfake videos by leveraging artifacts present in the images, specifically focusing on characteristics such as eye color [8], missing details in the eye and teeth areas. However, with the rapid advancements in deep learning, several studies have focused on developing detectors which based on deep learning that can differentiate manipulated images from real ones by extracting spatial features.

In recent times, there has been a surge in the development of deep learning-based detection methods, which have consistently delivered impressive results. Some Deepfake detection methods based on deep learning with spatial features are proposed. Afchar et al. [10] present a method that utilizes MesoNet for capturing mesoscopic features in the context of deepfake detection. Rössler et al. [11] propose a Deepfake detection method based on XceptionNet, achieving satisfactory results on the FF++ dataset. Li et al. [36] propose a novel spatial image representation called Face-X-ray. Face-X-ray is trained using a self-supervised algorithm on a large dataset consisting of mixed images synthesized from real images. The Face-X-ray approach can achieve high detection performance in high-quality videos and provide interpretable boundaries for face-swapping. However, it may suffer from a performance drop when encountering low-resolution images. Similarly, Zhao et al. [37] propose a multi-attention detection model to capture subtle forgery traces from spatial features. These spatial-based methods [11, 36,37,38], however, are fragile when the quality of a manipulated face is degraded by image processing methods. To counter the weakness against quality degradation, our method not only learns spatial artifacts but also builds channel-enhanced attention based on frequency domain coefficients.

In addition, Frank et al. [39] observe that forged images generated by Generative Adversarial Networks (GAN) [2] show particular artifacts in the frequency domain in the essential up-sampling operation, and it has been demonstrated that frequency features possess robust model generalization capabilities for the detection of unseen deepfakes. F3-Net, as described in [24], takes images into the frequency domain and employs two modules to capture global and local frequency cues, respectively. SPSL [40] combines spatial image features and phase spectrum information to effectively capture the up-sampling artifacts commonly found in face forgery images. Kohli et al. [22] convert RGB images into the DCT domain for Deepfake detection. Chen et al. [26] introduce an attention module that is designed for multi-scale feature fusion, aiming to integrate RGB and frequency domain information across various network levels. Luo et al. [25] employ a method that models the correlation and interaction between high-frequency modality and regular modality for detection purposes. In this paper, by considering the benefits of channel attention and the significance of frequency domain information in Deepfake detection tasks, we design a channel attention module that specifically focuses on leveraging frequency domain information.

2.2 Vision transformer

The Transformer has found extensive application in natural language processing (NLP) tasks [41, 42], obtaining impressive performance by effectively modeling long-range dependencies. The ViT [15], as a variant of the Transformer, has been successfully adapted for various computer vision tasks such as object recognition [43, 44], scene classification [45], and face recognition [46]. By dividing an image into a sequence of image patches and leveraging its built-in attention mechanism, ViT excels at capturing global information, thus offering notable advantages in capturing global features.

For Deepfake detection tasks, a number of detection algorithms based on ViT have been proposed, yielding remarkable performance outcomes. In reference [17], high-level convolutional features are extracted using a CNN model. These extracted features are then directly input into ViT for classification purposes. In [18], two CNN models are used to extract feature maps of different sizes. These feature maps are then inputted into a ViT network, which generates two predicted values. The final prediction is obtained by summing these two values. Wang et al. [14] propose a Transformer-based framework that selects more valuable blocks for Deepfake detection by designing the attention module. M2TR [7] designs multi-scale Transformer blocks and frequency-domain features to detect local forgery clues. HFI-Net [6] devises a network structure that combines CNN and ViT, utilizing mid-to-high-frequency information for Deepfake detection. To account for both local and global information in the feature maps, we propose a joint network architecture that incorporates the enhanced local and global features, promoting better model convergence [47].

Fig. 2
figure 2

The overall framework of our proposed method. The extracted middle level features \(X^m\) are enhanced with feature weights by the FAEM module, and then enhanced features \(X^m_{fre}\) input into the ViT module to obtain global information. \(\oplus \) and \(\otimes \) denote element-wise sum and channel-wise product

3 Method

3.1 The proposed frequency-based local and global network

CNN captures local features effectively, leading to superior detection performance in the same manipulation method. However, its limited receptive field hampers its performance on unseen datasets. On the other hand, ViT extracts global features but may overlook subtle clues crucial for Deepfake detection. The combination of local and global features exhibits strong complementarity. We can achieve this by splicing the CNN and ViT networks, significantly improving detection performance on the intra-dataset. However, the simple network splicing approach may be more suitable for the training dataset, limiting its generalization. To address this, we propose a new hybrid network, FLAG network, which utilizes FAEM to aggregate the CNN and ViT models. In this network, local features extracted from CNN are enhanced by FAEM, facilitating generalizable global features learning by the ViT. The FLAG network enables better complementarity between local and global features, resulting in improved detection performance not only on the intra-dataset but also on unseen datasets.

The proposed FLAG network is illustrated in Fig. 2. In this hybrid network, we employ convolutional networks to extract local features from the input image \({X} \in {R^{3 \times W \times H}}\), resulting in middle level network features \(X^m \in {R^{C \times M \times N}}\) of the CNN architecture. FAEM, which utilizes selected robust and generalizable frequency coefficients, is used to enhance the manipulated information within these local features. Afterwards, the enhanced local feature maps \(X_{fre}^m \in {R^{C \times M \times N}}\) are flattened and their channel size is adjusted using 1 \(\times \) 1 convolution to meet the requirements of the ViT’s transformer module. Class tokens and position embeddings are added, and the number of transformer blocks is modified accordingly. The modified features are then fed into the transformer blocks and MLP (Multi-Layer Perceptron) to extract global features for classification. This approach facilitates faster convergence to some extent [47]. As a result, this designed network enables the extraction of both local and global features.

3.2 Frequency-based attention enhancement module

Previous algorithms [25, 26, 40] have shown better generalization by incorporating frequency domain information. However, in these approaches, frequency domain features are often directly extracted and combined with spatial domain features to detect tampering clues. Additionally, they can also be used as spatial attention mechanisms for spatial domain features. In [28], researchers propose a channel attention mechanism based on the frequency domain. They improve the global average pooling method from the perspective of the frequency domain to introduce more channel information and improve the model’s performance. However, selecting too many coefficients causes overfitting and decreases generalization performance in Deepfake detection. To address this issue, we consider reducing the number of coefficients and using only relatively robust ones to construct the channel attention module for Deepfake detection. By considering the characteristics of exploring frequency domain coefficients, we select four DCT coefficients, including the DC coefficient and three AC coefficients, to construct FAEM. This aims to improve the generalization and robustness of the detection model.

DCT is a widely used signal processing technique that converts spatial domain information into a frequency domain representation. For a two-dimensional vector \({x^{2d}} \in {R^{M \times N}}\), the formula of 2D DCT is as follows:

$$\begin{aligned} B_{m,n}^{i,j}= & {} \cos \left( {\frac{{\pi m}}{M}\left( {i + \frac{1}{2}} \right) } \right) \cos \left( {\frac{{\pi n}}{N}\left( {j + \frac{1}{2}} \right) } \right) ,\end{aligned}$$
(1)
$$\begin{aligned} F_{m,n}^{2d}= & {} \sum \limits _{i = 0}^{M - 1} {\sum \limits _{j = 0}^{N - 1} {x_{i,j}^{2d}B_{m,n}^{i,j}} }, \end{aligned}$$
(2)

where \(B_{m,n}^{i,j}\) is the transformation basis function of DCT, M and N represent the length and width of the feature map, i and j represent the position of the feature map (\(i=0,1...,M, j=0,1,...N\)), m and n represent the position of the DCT coefficients (\(m=0,1...,M, n=0,1,...,N\)), and \(F_{m,n}^{2d}\) is the DCT coefficient.

Fig. 3
figure 3

Proposed Frequency-based Attention Enhance Module (FAEM). \(\odot \) and \(\otimes \) denote element-wise multiplication and channel-wise product

Specifically, the DC coefficient primarily represents the primary energy of the entire feature map, while the three adjacent AC coefficients represent the horizontal, vertical, and diagonal energy information of the feature map, respectively. These selected 4 DCT coefficients provide stability to the features and are less susceptible to loss in compression. These coefficients are used to construct a frequency-based attention enhancement module, aiming to improve the generalization performance of the detection model. The selected four coefficients represent a significant portion of the energy information in the features. This approach shows better robustness against image attacks like JPEG compression or Gaussian blur compared to using all low-frequency information or other frequency domain information. The effectiveness of this approach is validated through experimental analysis, as presented in Tables 4 and 5.

To provide a clearer understanding, Fig. 3 illustrates the construction of the FAEM using selection coefficients. Meanwhile, Fig. 3 depicts the process of utilizing FAEM to enhance local features. We divide the number of channels of the mid-level network features \({X^m} \in {R^{C \times M \times N}}\)into 4 parts according to the selected 4 frequency domain bases, denoted as \({X^k} \in {R^{C' \times M \times N}}\), \(k \in \left\{ {0,1,2,3} \right\} \), \(C' = C/4\). The features of each part correspond to a specific frequency domain base.

For each part, the frequency domain-based attention can be expressed as:

$$\begin{aligned} {F^k} = \sum \limits _{i = 0}^{M - 1} {\sum \limits _{j = 0}^{N - 1} {X_{:,i,j}^kB_{i,j}^{{m_k},{n_k}}} }, \end{aligned}$$
(3)

where \(\left( {{m_k},{n_k}} \right) \) are the frequency component 2D indices corresponding to \(X^{k}\), \({F^k} \in {R^{C'}} \) is the \(C^{'}\) dimensional vector. Then the frequency-domain attention features of these 4 parts are aggregated together by cat() function,

$$\begin{aligned} F_{re} = cat\left( {\left[ {{F^0},{F^1},{F^2},{F^3}} \right] } \right) , \end{aligned}$$
(4)

where \(F_{re} \in {R^c}\) is the C dimensional vector. Then, we employ a fully connected (FC) and a sigmoid function \(\sigma \) to obtain attentive weights:

$$\begin{aligned} W_{att} = \sigma \left( {FC\left( {F_{re}} \right) } \right) , \end{aligned}$$
(5)

To mitigate the problem of redundant data resulting from directly applying DCT to RGB information, we adopt an alternative approach. Instead of applying DCT to the RGB data directly, we perform it on the network’s middle-level features, denoted as \(X^m\). These middle-level features possess better resistance to interference compared to shallow features and contain more detailed information than high-level features [10]. According to the characteristics of middle-level features, by utilizing \(X^m\), we can preserve crucial tampering clues while minimizing the inclusion of redundant information that may affect recognition. Additionally, we enhance the middle-level features by applying frequency-based attentive weights derived from the AC and DC coefficients,

$$\begin{aligned} X_{fre}^m = W_{att} \otimes {X^m}, \end{aligned}$$
(6)

where \(X_{fre}^m\) denotes the enhanced features by frequency-based attentive weights, \(\otimes \) denotes a channel-wise multiplication. This process highlights the tampering clues within the feature maps, making them more prominent and improving the overall detection capability.

4 Experiment

4.1 Dataset and settings

Datasets

To verify the effectiveness and generalization of our proposed method, we conduct experiments on the FaceForensics++ (FF++) dataset [11], the Celeb-DF (V2) dataset [48] and the DeepFake Detection Challenge (DFDC) dataset [49]. The FF++ dataset comprises three versions: the original version (raw), the lightly compressed version (C23), and the heavily compressed version (C40). Each compressed version consists of 1000 real videos and corresponding fake videos generated using four common manipulation methods, including Deepfakes (DF) [21], Face2Face (F2F) [50], FaceSwap (FS) [51], and NeuralTextures (NT) [31]. Based on reference [11], we use 720 training videos, 140 validation videos, and 140 testing videos for every 1000 videos. For training, we select 32 frames for each video, while for validation and testing, we use 100 frames for each video. The Celeb-DF (V2) dataset [48] comprises 890 real videos and 5639 high-quality fake videos. The DFDC dataset [49] includes more than 20000 real videos and more than 100000 fake videos. In this paper, we employ the Celeb-DF (V2) dataset and DFDC dataset for cross-dataset testing.

Implementation details

We utilize MTCNN [52] to detect and save the face image size as 224 \(\times \) 224. To extract features, we utilize the EfficientNet-b4 model [53], which has been pretrained on the ImageNet dataset [54], specifically up to layer 5. The middle-level features extracted are subsequently input into the FAEM to acquire enhanced features. Then the enhanced features are fed into the ViT [15] network for classification. We employ AdamW with parameters (0.9, 0.999) as the optimizer. The initial learning rate is to 0.0001 and a weight decay of 1e-5. Training is conducted with a batch size of 14, while testing is performed with a batch size of 4. The total training epoch number is 20. As for data augmentation, we only apply random horizontal flip. And we utilize the cross-entropy loss function for the final binary classification.

We implement the framework and conduct experiments using the open-source PyTorch library on a single NVIDIA 2080Ti GPU. The proposed model has a computational complexity of 33.59 GMAC (Giga Multiply-Accumulates) and consists of 103.48 million parameters. During the testing phase, our algorithm achieves a detection speed of approximately 119 images per second with a batch size of 4.

Evaluation metrics

We use Accuracy (ACC) and Area Under the Receiver Operating Characteristic Curve (AUC) for validation. Since our method is essentially image-based, we default to evaluating the model with image-level evaluation following [6, 11].

Table 1 Intra-dataset evaluation results (AUC (%) and ACC (%) ) on FF++ dataset
Table 2 Cross-dataset evaluation (AUC (%) ) from FF++ (C23) to Celeb-DF (V2) and DFDC datasets

4.2 Comparison with other methods

In this subsection, we conduct comparative experiments with recent state-of-the-art (SOTA) methods to evaluate their performance in various scenarios. More specifically, MseoNet [10], Xception [11], MaDD [37], MTD-Net [20], and CFFs [55] leverage spatial features for deepfake detection. On the other hand, SPSL [40], M2TR [7], and HFI-Net [6] focus on utilizing frequency features to enhance the generalization ability of deepfake detection models. We perform intra-dataset performance tests on FF++ (C23) and FF++ (C40). Cross-dataset evaluations are then conducted using Celeb-DF (V2) and DFDC. Cross-manipulation evaluation in constructed on FF++(23). The best results are shown in bold.

Intra-dataset evaluation

The FF++ dataset is commonly used for evaluating deepfake detection methods. We conduct training and testing using the FF++ (C23) and FF++ (C40) settings, respectively. The results presented in Table 1 demonstrate that our method achieves competitive performance compared to previous approaches. In detail, the proposed method works better for the C23 subset, where more frequency details are stored compared to the C40 subset. Specifically, in the C23 subset, our method achieves an ACC of 96.56% and an AUC score of 99.26%, surpassing the second-best ACC performer, CFFs [55], by a notable gain of 1.63% in ACC. Additionally, our proposed method outperforms HFI-Net [6] by a substantial margin, achieving a gain of 4.69% in ACC and 2.19% in AUC. As for the highly compressed C40 subset, our proposed method gains an AUC of 89.94%, which lags CFFs [55] by a margin of 0.41% in AUC. However, in comparison to the HFI-Net [6] method, which also utilizes frequency domain features, our proposed method outperforms HFI-Net [6] by a margin of 0.9% in ACC and 1.54% in AUC.

Cross-dataset evaluation

In the cross-dataset evaluation, our model is trained on the FF++ (C23) dataset and tested on the Celeb-DF (V2) and DFDC datasets using the AUC metric. The experimental results, compared with SOTA methods, are presented in Table 2. Notably, our proposed approach demonstrates superior generalization on the DFDC dataset, achieving a 1.99% improvement in AUC compared to HFI-Net [6]. For Celeb-DF (V2), our method attains an AUC of 78.84%, surpassing CFFs [55] by a margin of 4.64%. It is worth mentioning that HFI-Net [6] attains the highest performance in Celeb-DF (V2) evaluation by incorporating a global-local interaction module at each stage, effectively suppressing certain features in the training dataset and enhancing generalization across diverse datasets. However, our method outperforms HFI in terms of intra-dataset performance, as indicated in Table 1.

Cross-manipulation evaluation

To demonstrate the generalization of our method across different manipulation methods, we conducted this experiment on the FF++ (C23) dataset. Following the standard protocols in [5], we train a model on three manipulation methods from the FF++ dataset and test it on the remaining manipulation method. During training, we use datasets that include three manipulation methods as the training and validation sets, while the remaining manipulation method is exclusively included in the test set. For instance, GID-DF (23) means training on the other three manipulated methods of FF++ (C23) and testing on Deepfakes class, as well as GID-F2F (23). The evaluation metrics used in this study are video level AUC and ACC. Based on references [5, 24], we utilize the average score of a sequence of frames to generate the video-level prediction. The comparative experimental results presented in Table 3 are obtained from [5].

First of all, our method excels in the GID-F2F scenario, achieving a remarkable performance with an ACC of 66.31% and an AUC of 86.50%. This surpasses the second-best ACC performer, LTW [5], by a notable margin, exhibiting a gain of 0.71% in accuracy and 5.3% in AUC. In the case of GID-DF, LTW [5] has a slightly higher AUC by 0.24% compared to our method, although our method achieves a higher ACC by 2.34% than LTW [5]. The advantage of GID-DF is still comparable to SOTA on average.

Table 3 Cross-manipulation evaluation results (AUC (%) and ACC (%) ) on FF++ (C23) dataset

4.3 Ablation study

Effectiveness of different components

To verify the effectiveness of the proposed modules, we conduct several ablation studies. Starting with the pure EfficientNet-b4 model as the baseline, we gradually add the proposed modules. ‘EfficientNet-b4+Vit_block’ refers to the utilization of the EfficientNet-b4 model with ViT block, excluding the attention module. ‘EfficientNet-b4+FAEM’ represents the use of the EfficientNet-b4 model with FAEM. These models are trained on the FF++ (C23) and tested on the FF++ (C40) and Celeb-DF (V2) datasets. As the number of proposed modules increases, the proposed model gradually gains enhanced expressive capabilities, leading to improved discriminative performance. Figure 4 demonstrates that the hybrid model structure combining CNN and ViT provides more confident information for decision-making compared to the pure EfficientNet-b4 model. The proposed enhanced module, utilizing frequency domain information, also improves the generalization capability of the model. Simultaneously, the proposed network structure, which takes into account both local and global features, has significantly improved the performance compared to the original pure EfficientNet-b4 model. The AUC scores indicate that the proposed algorithm improves intra-dataset performance and generalization across diverse datasets.

Fig. 4
figure 4

Ablation results for different components

Table 4 Ablation study on different frequency components via training on FF++ (C23) and C40 respectively

Effectiveness of different frequency components

To verify the effectiveness of the selected frequency coefficients in our work, ablation studies are made by comparing different choices of frequency coefficients. The results are listed in Table 4, in which ‘FLAG_mh’ indicates that the FAEM module is constructed using mid-frequency and high-frequency coefficients and ‘FLAG_low’ uses low-frequency coefficients to construct the FAEM module. Overall, the proposed method obtains the best performance in both scenarios: the in-domain test (train on FF++ 40 and test on FF++ C40), compression robustness test (train on FF++ C23 and test on FF++ C40) and the cross-dataset test (train on FF++ C23 and test on Celeb-DF). More specifically, we see that using low-frequency coefficients shows better performance than using high and middle frequencies, with at least a 0.81% improvement. This improvement is further raised by focusing on four low-frequency coefficients considered in our work.

In the real-world situations, images are often affected by image attack, which can result in a reduction in image quality. All these processing will cause a distortion that decreases the generalization performance. In other words, high-frequency features tend to be lost during image processing, further exacerbating the issue. To verify the effectiveness of coefficient selection in the face of image attack, we conduct a robustness test on the channel attention enhancement model constructed using different coefficients. First, we train the model on the C23 training set without image attack, and then we test it on the C23 test set after applying different image attacks [59]. We use various types of image attacks, including: (1) JPEG compression with JPEG quality factors with JPEG quality factors of 50, 30, and 20; (2) filter windows of sizes 7 \(\times \) 7, 5 \(\times \) 5, and 3 \(\times \) 3 for Gaussian Blur; (3) Color Saturation with saturation levels of 0.1, 0.2, and 0.3; (4) Block-wise with a size of 8 \(\times \) 8 and varying numbers of occluded blocks: 80, 64, and 48; (5) Color contrast with contrast ratios of 0.6, 0.725, and 0.85. Table 5 presents the different image attacks, their corresponding parameters, and the tested AUC results. Additionally, Fig. 5 provides visual examples illustrating the effects of different image processing techniques and their corresponding levels. Each column of images, from top to bottom, corresponds to different parameter level attacks as listed in Table 5. When facing a JPEG compression attack with a compression factor of 20, the proposed enhancement module, built based on four frequency domain coefficients, improves the AUC by 2% compared to FLAG_low. In the case of Gaussian blur with a filter window size of 7 \(\times \) 7, the proposed enhancement module achieves an AUC of 88.05%. Furthermore, when confronted with a color contrast attack featuring a contrast ratio of 0.6, the proposed module shows an AUC improvement of 0.56% compared to the FLAG_low.

Table 5 AUC results of C23 after image processing operation of different types and degrees
Fig. 5
figure 5

Image visualization on the levels of severity for five image processing operations. We utilize three severity levels for five distortion types in the robust testing

Fig. 6
figure 6

Four coefficients are chosen at various positions. The picked locations are highlighted in green

Furthermore, we restrict the selection to four coefficients at different positions, as shown in Fig. 6. Figure 6 illustrates the specific selection of locations for the coefficients in the experiment, where (a) represents the four coefficients selected by the proposed method, (b) represents the four middle-frequency coefficients selected, (c) represents another four middle-frequency coefficients selected, and (d) represents the selection of the four high-frequency coefficients in the lower right corner. The results are shown in Table 6 for the cross-dataset test. The proposed selection shows the best performance in Celeb-DF (V2) dataset. Regarding the suboptimal performance on the DFDC dataset, it is likely due to the dataset primarily comprising excellent quality forged videos. In such cases, the four frequency domain coefficients represented by Middle_Fre2 may be more effective in detecting and capturing tampering information. More importantly, this study supports our motivation to use four frequency coefficients to design channel attention instead of a random sampling strategy.

Fig. 7
figure 7

The visualization experiment of proposed methond through Grad-CAM [30]. The shown images includes DF [21], F2F [50], FS [51] and NT [31], corresponding to each column. And each column includes RGB images and corresponding Grad-CAM [30]

4.4 Visualization experiments

To further understand the effectiveness of our proposed method, we provide visualizations of our method through Grad-CAM [30] on different tampering methods in Fig. 7. Two tampering methods, DF [21] and FS [51], are employed for face swapping by replacing the target face with the source face. F2F [50] and NT [31] are facial reenactment technologies that specifically manipulate facial expressions and lip movements. In Fig. 7, we observe that our method focuses on the face regions in the DF [21] and FS [51] columns, while in the F2F [50] and NT [31] columns, our method focuses on manipulation regions such as the nose and mouth. These visualizations demonstrate that our proposed method captures discriminative and reasonable features, especially for NT where only the mouth part is manipulated.

5 Conclusion

This paper introduces the Frequency-based Local and Global (FLAG) network architecture, which effectively explores both local and global information by leveraging frequency-domain cues. By combining the strengths of CNN and ViT, the framework effectively captures tampering information at both local and global scales. Additionally, we propose a frequency-based attention enhancement module that carefully considers the characteristics of frequency domain coefficients. This module effectively integrates the CNN and ViT, resulting in improved generalization performance of the model. Experimental results on public datasets demonstrate the satisfactory performance of our proposed method. Furthermore, we hope that the FLAG framework can serve as inspiration for researchers to further explore the potential of frequency domain coefficients in the field of Deepfake detection.