Keywords

1 Introduction

Recently, face forgery generation methods have received lots of attention in the computer vision community [12, 21, 40, 46, 49, 53], which may cause severe trust issues and seriously disturb the social order. For example, the producer can forge the video of world leaders to influence or manipulate politics and social sentiment. Even worse, these fake videos are of high quality and can be easily generated by open-source codes like DeepFaceLab. Therefore, it is urgent to develop effective face forgery detection methods to mitigate malicious abuse of face forgery.

A simple way is to model face forgery detection as a binary classification problem [1, 12, 16, 44]. Basically, a pretrained convolutional neural network (CNN) is used to distinguish the authenticity of the input face, which is a golden standard in Deepfake-Detection-Challenge [12]. However, the generated fake faces become more and more authentic, which means the differences between real and fake faces are more subtle. Although CNN models possess discriminative features, it is still hard to directly use such models to capture those forgery clues in a unified framework, resulting in unsatisfactory performance.

Fig. 1.
figure 1

Visualization of self-information map and manipulation ground truth mask for forgery faces by different manipulations (Deepfakes and FaceSwap). The self-information map is calculated by Eq. 1. The ground truth mask is generated by subtracting the forged faces and the corresponding real faces with some morphological transformations

To tackle this issue, many heuristic methods [7, 11, 17, 23, 32, 45] usually use the prior knowledge or observed clues to learn more discriminative features, which are instrumental in distinguishing real and fake faces. For example, F3-Net [39] learns forgery patterns with the awareness of frequency, Gram-Net [32] leverages global image texture representations for robust fake image detection, and Face X-ray [27] takes advantage of blending boundary for a forged image to enhance the performance. Though these methods can help to improve the performance, these heuristic methods lack unified theoretical support and only reflect certain aspects of the face forgery, leading to model bias or sub-optimization.

To address this issue, we revisit face forgery detection from a new perspective, i.e., face forgery is highly associated with high-information content. Inspired by [5, 6], we make the first attempt to introduce self-information as a theoretic guidance to improve the discriminativeness of the model. Specially, self-information can be easily defined by the current or surrounding regions [43], where a high-information region is significantly different from their neighborhoods that can reflect the amount of information of the image content. Moreover, we find that most existing clues are always in high self-information regions. For example, due to the instability of the generative model, some abnormal textures always appear in forgery faces. These high-frequency artifacts are often very different from the surrounding facial features or skin, where the self-information can highlight these clues. Another example is blending artifacts. Face x-ray [27] demonstrate that the forged boundary is widely existed in forgery faces because of blending operation. The skin color or texture difference between real and forgery part enlarge the self-information in blending artifacts regions. Motivated by this observation, we design a novel self-information attention module called Self-Information Attention (SIA), which calculates pixel-wise self-information and uses it as a spatial attention map to capture more subtle abnormal clues. Additionally, the SIA module calculates the average self-information of each channel’s feature map and uses it as attention weights to select the most informative feature map. As shown in Fig. 1, the self-information map of the original image highlight the same region as the ground truth mask, which indicates its effectiveness in face forgery detection task.

We conduct our experiments on several widely-used benchmarks. And experimental results show that our proposed method significantly outperforms the state-of-the-art competitors. Particularly, the proposed SIA module can be flexibly plugged into most CNN architectures with little parameter increase. Our main contributions can be summarized as follows:

  • We propose a new perspective for face forgery detection based on information theory, where self-information is introduced as a theoretic guidance for detection models to capture more critical forgery cues.

  • We specially design a novel attention module based on self-information, which helps the model capture more informative regions and learn more discriminative features. Besides, the SIA attention can be plugged into most existing 2D CNNs with negligible parameter increase.

  • Extensive experiments and visualizations demonstrate that our method can achieve consistent improvement over multiple competitors with a comparable amount of parameters.

2 Related Work

2.1 Forgery Face Manipulation

Face forgery generation methods have a security influence on scenarios related to identity authentication, which achieve more and more attention in computer vision communities. In particular, deepfakes is the first deep learning based face identity swap method [49], which uses two Autoencoders to simulate changes in facial expressions. The other stream of research is to design GAN based models [4, 14, 15, 21] for generating entire fake faces. Recently, graphics-based approaches are widely used for identity transfer, which are more stable compared with deep learning based approaches. For instance, Face2Face [50] is can operate face swap using only an RGB camera in real-time. Averbuch-Elor et al. [3] proposed a reenactment method that deforms the target image to match the expressions of the source face. NeuralTextures [48] renders a fake face via computing reenactment result with neural texture. Kim et al. [25] combined image-to-image translation network with computer graphics renderings to convert face attributes. These forgery methods focus on manipulate high-information areas and may leave some high-frequency subtle clues, thus we introduce self-information learning to assist in identifying forged faces.

2.2 Face Forgery Detection

To detect the authenticity of input faces, early works usually extract low-level features such as RGB patterns [36], inconsistency of JPEG compression [2], visual artifacts [35]. More recently, binary convolution neural network has been widely used to this task [12] and achieve better performance. However, directly using vanilla CNN tend to extract semantic information while may ignore the subtle and local forgery patterns [54]. Thus, some heuristic methods are proposed, which leverage observation or prior knowledge to help model to mine the forgery pattern. For instance, Face X-ray [27] is supervised by the forged boundary. F3-Net [39] leverage frequency clues as to the auxiliary to RGB features. Local-Relation [9] measures the similarity between features of local regions based on the observation of inconsistency between forgery parts and real parts. However, these methods still cannot cover all the forgery clues, leading to suboptimal performance. Thus, we introduce self-information to help the model capture informative region adaptively. In addition, our proposed method only contains a few parameters and can serve as a plug-and-play module upon several backbones.

3 Proposed Method

3.1 Preliminaries

Problem Formulation. Many works [24] have been proposed to identify a given face, real or fake, but most of them are based on experimental observations that show the remarkable difference between real and fake faces. Recent work [35] found that these observations belong to the discriminative artifacts clues that are subtle but abnormal compared with their neighborhoods, because of the generative model’s instability and the imperfection of the blending methods. On the other hand, existing models just consider one or a small number of these different clues, which are integrated into the vanilla CNN, leading to bias or sub-optimization model. These raise a natural question that, is there a metric that can adaptively capture differential information? To answer this question, this paper focuses on the information theory and uses classical self-information to adaptively qualify the saliency clues.

Fig. 2.
figure 2

The overview of our face forgery detection framework with Self-information attention (SIA) module. The SIA module is embedded in the middle layer of CNN. The orange dotted block means the channel attention part, while the blue dotted block represents the spatial attention part. The details of self-information is shown in Fig. 3 (Best viewed in color) (Color figure online)

Self-information Analysis. The self-information is a metric of the information content related to the outcome of a random variable [5], which is also called surprisal, i.e., it can reflect the surprise of an event’s outcome. Given a random variable X with probability mass function \(P_X\), the self-information of X as outcome x is \(I_{X}(x) = -log(P_{X}(x))\). As a result, we can derive that the smaller its probability, the higher the self-information it has. That is, the more different the region from its neighboring patches, the more self-information it contains. Inspired by [5, 43], self-information can apply to the joint likelihood of statistics in a local neighborhood of the current patch, which provides a transformation between probability and the degree of information inherent in the local statistics. For face forgery detection, the heuristic unusual forgery clues (such as high-frequency noise, blending boundary, abnormal textures, etc.) are hidden in the high-information context. Therefore, it is intuitive to introduce the self-information metric into face forgery detection to help model additively learn high-information features.

3.2 Overall Framework

In this paper, we design a new attention mechanism, that is based on the self-information metric, which could highlight the manipulated regions. We call this newly defined model as Self-Information Attention (SIA) module, whose overview framework is shown in the Fig. 2. In particular, the proposed SIA module mainly contains three key parts: 1) Self-Information Computation: To capture the high-information content region, we calculate the self-information from the input feature map and output a new discriminative attention map. 2) Self-Information based Dual Attention: To maximize the ability of using self-information by backbone model, the self-information from the input feature map would be used on both channel-wise attention and spatial-wise attention. 3) Self-Information Aggregation: Motivated by [19, 54], we densely forward all previous self-information feature maps to the current SIA block, which is to preserve the detail area to the greatest extent.

Fig. 3.
figure 3

Visualization of Self-Information Computation. \(R_f\) denotes the local receptive filed region, and \(R_c\) is the channel offset region (Best viewed in color)

3.3 Self-information Computation

Let \(f^t\in {R^{C\times H \times W}}\) denotes the input of the t-th SIA module with C channels and spatial shape of \(H\times W\), where \(f^t_k(i,j)\) denotes the k-th channel’s pixel of \(f^t\) located by the coordinate (ij). As mentioned before [5], the self-information can be approximated by the joint probability distribution of the current pixel together with its neighborhoods with Gaussian kernel function. Different with previous work [5], we consider the self-information through two orthogonal dimensions, one is to find the neighborhoods in the spatial dimension, and the other is to search the neighborhoods in channel dimension.

We define the spatial space intra-feature self-information as:

$$\begin{aligned} I_\textrm{intra}(f^t_k(i,j)) = -\log \sum _{m,n\in R_f} e^{-\frac{{||f^t_{k}(i,j)-f^t_{k}(i+m,j+n)||}^{2}}{2h^{2}}}, \end{aligned}$$
(1)

where \(R_f\) are the local receptive filed region near the pixel (ij), m and n are the pixel indexes in the \(R_f\), and h is the bandwidth.

When the neighborhoods are located in the channel dimension, we define the self-information in channel as inter-feature self-information \(I_\textrm{inter}\), which is shown as:

$$\begin{aligned} I_\textrm{inter}(f^t_k(i,j)) = -\log \sum _{s\in R_c} e^{ -\frac{{||f^t_{k}(i,j)-f^t_{k+s}(i,j)||}^{2}}{2h^{2}} }, \end{aligned}$$
(2)

where s is the index of the channel offset region \(R_c\). The inter-feature self-information could help us avoid some observation noise that exists in the channels.

As a result, the whole self-information \(I\big (f_k\big )\) can be formulated as:

$$\begin{aligned} I(f^t_k(i,j)) = I_\textrm{intra}(f^t_k(i,j)) + \lambda I_\textrm{inter}(f^t_k(i,j)), \end{aligned}$$
(3)

where \(\lambda \) is the weight parameter that balance the importance of the inter-feature self-information. The Fig. 3 illustrates the computation of self-information.

3.4 Self-information Based Dual Attention

We propose a new dual attention model, where the saliency is qualified by the self-information measure [6]. Inspired by [20], we consider the saliency features through spatial dimension and channel dimension.

Spatial-Wise Attention Module. We introduce a spatial attention module based on self-information, as the flowchart shown in Fig. 2. In detail, we calculate each pixel’s self-information features \(I(f^t_k)\) via Eq. 3. Then, we use the Sigmoid function to normalize such features and output the self-information based spatial attention map. Finally, we perform an element-wise multiply operation with the input feature \(f^t_k\). The whole formulation of the Spatial-wise Attention Module is shown as follows:

$$\begin{aligned} s_k = \textrm{Sigmoid}(I(f^t_k))*f^t_k. \end{aligned}$$
(4)

This attention map focuses on the high-information region with little parameter improvement, which can adaptively enhance many artifact subtle clues, such as blending boundary and high-frequency noise. For more details please refer to the Sect. 4.

Channel-Wise Attention Module. Apart from the spatial attention module, we further introduce the channel attention module, which pipeline is illustrated in Fig. 2. Similar to [20], we calculate the average self-information of channel feature maps and generate channel-wise statistical feature \(c_k\) for the k-th element of \(f^t\) as follows:

$$\begin{aligned} c_k = \frac{1}{H\times W}\sum ^{H}_{i=1}\sum ^{W}_{j=1}I(f^t_k). \end{aligned}$$
(5)

To mitigate the problem of training stability, we opt to employ a simple linear transform with sigmoid activation on vector \(c = \{c_1,c_2...c_C\}\) as channel attention \({c^{\prime }}\):

$$\begin{aligned} {c^{\prime }} = \textrm{Sigmoid}(Wc), \end{aligned}$$
(6)

where \(W\in R^{C\times C}\) is the linear function. This module could improve the self-information of the feature map that contains high-information, which helps locate the saliency in explicit contents.

Dual Attention Module Embedded in CNN. Finally, we combine the two attention modules mentioned above and perform an element-wise sum operation between the processed attention map and \(f_k\) to output a residual error feature \(o_k \in R^{C\times H \times W}\), which formulated as:

$$\begin{aligned} O_k = c^{\prime }_k*s_k + f^t_k. \end{aligned}$$
(7)

The proposed SIA module is a flexible module, which can be easily inserted into any CNN-based architecture. Also, we can also flexibly choose the spatial attention module and the channel attention module. The SIA module does not increase many parameters yet can enhance the performance of the model.

3.5 Self-information Aggregation

General CNN such as EfficientNet [47] usually use down-sampling operations to reduce the parameters and expand the receptive field, which tend to eliminate subtle clues with high information content in face forgery detection task. To overcome this problem, inspired by [22], we design a self-information aggregation operation, cascading different levels of SIA modules via self-information attention map. Thus the local and subtle forgery clues can be preserved. As shown in Fig. 2, we add the attention map of the previous stage with the current input feature map to preserve the shallow high informative texture. Due to the different sizes of attention maps at different levels, we use \(1\times 1\) convolution to align the number of channels and use the interpolation method to align the size of the feature map. This alignment operation could be presented as the function \(\textrm{Align}\). As a result, the t-th input feature \(f^t\) can be defined as:

$$\begin{aligned} f^t = \sum ^{t-1}_{i=1}\textrm{Align}_i(I(f^i)) + m^t, \end{aligned}$$
(8)

where \(m^t\) is the feature map adjacent to the t-th SIA module.

3.6 Loss Function

We use the Cross-entropy as loss function, which is defined as:

$$\begin{aligned} L_{ce} = -\frac{1}{n}\sum _{i=1}^{n} y_{i}\log (\hat{y_{i}})+(1-y_{i})\log (1-\hat{y_{i}}), \end{aligned}$$
(9)

where n is the number of images, \(\hat{y_i}\) is the prediction of the i-th fake image, and \(y_i\) is the label of the sample.

4 Experiment

In this section, we evaluate the proposed SIA module against some state-of-the-art face forgery detection methods [1, 8, 10, 11, 13, 18, 27, 34, 39, 47, 54] and some attention techniques [20, 32, 42]. We explore the robustness under unseen manipulation methods and conduct some ablation studies, and further give some visualization results.

Table 1. Comparison on FaceForensics++ dataset in terms of ACC and AUC with different qualities (HQ and LQ). The highest results are highlighted in bold. The F3-Net use 0.5 as a threshold
Table 2. Cross-dataset evaluation from FF++ (LQ) to deepfake class of FF++ and Celeb-DF in terms of AUC. The highest results are highlight in bold

4.1 Experimental Setup

Datasets. We conduct our experiments on several challenging dataset to evaluate the effectiveness of our proposed method. FaceForensics++ [40] is a large-scale deepfake detection dataset containing 1, 000 videos, in which 720 videos are used for training and the rest 280 videos are used for validation or testing. There are four different face synthesis approaches, including two graphics-based methods (Face2Face and FaceSwap) and two learning-based approaches (DeepFakes and NeuralTextures). The videos in FaceForensics++ have two kinds of video quality: high quality (C23) and low quality (C40). Celeb-DF [30] is another widely-used Deepfakes dataset, which contains 590 real videos and 5, 639 fake videos. In Celeb-DF, the DeepFake videos are generated by swapping faces for each pair of the 59 subjects. Following the prior works [39, 47, 54], we use the multi-task cascaded CNNs to extract faces, and we randomly select 50 frames from each video to construct the training set and test set. WildDeepfake [56] is a recently released forgery face dataset that contains 3805 real face sequences and 3509 fake face sequences, which is obtained from the internet. Therefore, wild deepfake has a variety of synthesis methods, backgrounds, and ids.

Evaluation Metrics. We apply accuracy score (ACC) and area under the receiver operating characteristic curve (AUC) as our basic evaluation metrics.

Implementation Details. We use EfficientNet-b4 [47] pretrained on the ImageNet as our backbones, which are widely used in face forgery detection. The backbone contains seven layers, and we put our proposed SIA module in the output of layer1, layer2, and layer4. This is due to the shallow and middle layers contain low-level and middle-level features, which reflect the subtle artifact clues well. We resize each input face to \(299\times 299\). The hyperparameters \(\lambda \) in Eq. 3 is set to 0.5. We use Adam optimizer to train the network’s parameters, where the weight decay is equal to \(1e-5\) with betas of 0.9 and 0.999. The initial learning rate is set to 0.001, and we use StepLR scheduler with 5 step-size decay and gamma is set to 0.1. The batch size is set to 32.

Table 3. Performance on Celeb-DF and WildDeepfake datasets in terms of ACC and AUC
Table 4. ACC of different pretrained backbones on FaceForensics++ HQ, FaceForensics++ LQ and Celeb-DF datasets

4.2 Experimental Results

Intra-dataset Testing. We evaluate the performance under two quality settings on FaceForensics++. Note that the results of F3-Net use a threshold of 0.5. The overall results in Table 1 show that the proposed method obtains state-of-the-art performance on the both high-quality and low-quality settings. Compared Freq-SCL [26] and SPSL [31] leverage frequency clues as to the auxiliary to RGB features, both of which convert RGB image into the frequency domain together with a dual-stream framework. Both two methods boost performance via input perspective, but our method takes consideration of promoting representation learning. Compared with the recent attention based Multi-attentional [54], our method achieves better performance. This is because that SIA gives more accurate guidance for the attention mechanism on both channel-wise and spatial-wise dimensions, which provide more adaptive information for the model.

To further demonstrate the effectiveness of our method, we evaluate the SIA module on two famous forgery datasets: Celeb-DF and WildDeepfake. The results are shown in Table 3. We can observe that our SIA outperforms all comparison methods. Specifically, compared with F3-Net which requires 80M parameters and 21G macs, our EN-b4+SIA only contains 35M parameters and 6.05G macs and achieves about 3% improvement on both two datasets. In addition, we evaluate the proposed SIA module on DFDC [12] dataset and achieve SOTA performance with 82.31% in terms of ACC and 90.96% in terms of AUC. Due to the page limit, we put the results in the supplementary material.

Cross-Dataset Testing. To further demonstrate the generalization of SIA, we conduct cross-dataset evaluations. Specifically, following the setting of [34], we train our model on FF++ (LQ) and test it on Deepfakes class and Celeb-DF. The quantitative results are shown in Table 2, we can observe that our method obtain state-of-the-art performance especially in cross-database setting. Our SIA outperforms by 4% and 2% in terms of AUC compared with the recent SPSL and GFF on cross-dataset setting and achieve slight improvement on the intra-dimain setting. The reason for the improvement is that our module guide the backbone focuses on the informative subtle details which are commonly present on all forgery face.

Dependency on Backbone. The proposed SIA module is a plug-and-play block, which can be embedded in any deep learning based model. Therefore, we verify the effectiveness of the SIA module by using different backbones. We select the EffecinetNet-b0, MobileNet-v2, and XceptionNet as other backbones, and we evaluate the results on FaceForensics++ and Celeb-DF. All the SIA module is embedded in the first and middle layer. For instance, for the EffiecientNet-b0, we put out the module after the 2th, 3th and 5th MBConvBlock. For the XceptionNet, our module is inserted between 3th block and 4 block. As for the MobileNet-v2, the module is embedded after the 3th and 7th InvertedResidual block. The result is shown in Table 4. We find that our methods do improve the network performance regardless of the types of backbones, which proves the flexibility and generality of our method.

Table 5. Quantitative results on Celeb-DF and FaceForensics++ dataset with different qualities (HQ and LQ). The compared methods are all plug-and-play attention modules. The last column represents the parameter increase after adding the corresponding module. The highest results are highlighted in bold

Compared with Attention Methods. We compare the proposed method with several classical attention-based methods to show the effectiveness of self-information in this task: (1) Baseline: The EffecinetNet-b4 pretrained on the ImageNet. (2) Baseline+SE-layer [20]: The channel attention module. (3) Baseline+Non-local [52]: Non-local attention has been used in deepfake detection [32]. Here we use the Gaussian embedded version with both batchnorm layer and sub-sample strategy. (4) Baseline+SE-layer+Non-local: We use the SE-layer and Non-local to realize both channel attention and spatial attention module. (5) Baseline+GSA [42]: GSA is the state-of-the-art attention module that considers both pixel content and spatial information. Here the number heads are set to 8, and the dimensional key is set to 64.

The comparison results are reported in Table 5. The results show that our proposed SIA module outperforms all the reference methods on both two benchmarks. Specifically, after adding our SIA module on the baseline, the performance has about \(1.5\%\) ACC improvement with little parameters increase. This reflects that the self-information does fit for face forgery detection task.

Table 6. Abalation study on FaceForensics++(LQ) dataset
Table 7. Comparative experiment of module insertion position

4.3 Ablation Study

Impact of Different Components. To further explore the impact of different components of SIA module, we split each part separately for experimental verification. The ACC and AUC results on FF++ (LQ) are shown in Table 6. The results demonstrate that the three key components have a positive effect on the performance, all of which are necessary for the face forgery detection. Among them, spatial attention has a relatively large impact on performance, which demonstrate the importance of capturing high-information regions for the face forgery task.

Impact of Embedding Layer. We further conduct some ablation experiments to explore the effect of insertion place of our method. The attention module is embedded in different layers of EffiecientNet-b4 and tested on the FaceForensics++ LQ dataset. The results on the left of Table 7 show that the best performance is achieved when the attention module is embedded in layer1, layer 2 and layer4, which is in the shallow and middle of the backbone. The SIA module is derived from the theory of self-information (SI), which is usually built on the shallow structural and textural features. Therefore, it is intuitive to insert the SIA module in the shallow layers, which helps enhance SI. In the middle layers, SIA module helps reduce the global inconsistency bringing from long-range forgery patterns and pass the useful local and subtle forgery information via the self-information aggregation scheme. However, in deeper layers, down-sampling operation will neglect many local and subtle forgery information, in which SI can hardly find useful cues for forgery detection. In sum, it is natural and reasonable to plug SIA into either shallow or middle layers (L1, L2, L4), and our experiments indeed had verified this.

Fig. 4.
figure 4

Visualization of our channel-wise attention scores and their corresponding SIA maps. The feature maps are sorted according to the attention weights from low to high. The last column shows the SIA map calculated by the highest channel score feature map (Best viewed in color)

4.4 Visualization and Analysis

Analysis on SIA Module. To analysis our attention module, we visualize the feature maps from different channels sorted by channel-wise attention weight and the highest weight channel’s SIA map. Figure 4 shows the result (all visualizations are colored according to the normalized feature map). We can observe that the channels with high self-information contain more local high-frequency clues and subtle details, while the lower ones have more semantic information and smoother clues which is less helpful for the face forgery detection task. In addition, the self-information based attention map enhances high-information areas such as mouth, eyes, high-frequency textures, and blending boundary, while weakening repetitive low-frequency areas. These visualizations demonstrate that our SIA module can effectively mine the informative channels and subtle clues, which are critical for performance improvement.

Visualization of Grad-CAM. We apply Grad-CAM [41] and Guided Grad-CAM tools to the baseline model and our model, which are widely-used methods to explain the attention of deep neural networks. The Grad-CAM can identify the regions that the network considers import, while Guided Grad-CAM can reflect more details of activation. Through Fig. 5, we can observe that our module helps the network to capture more subtle artifacts compared with the baseline backbone. The red circle indicates the obvious high-information forgery details. We also find that the baseline model ignores these artifacts (white circle) while our SIA module helps networks pay more attention to these clues. For example, the forgery face in the fourth line has an obvious blending boundary, but the baseline CAM does not pay attention to this area. After going through our SIA module, the network clearly focuses on this high-information area. Furthermore, the activation area of the guided grad-cam is larger than the baseline, because our module help the network enhances the most informative channel.

Fig. 5.
figure 5

Grad CAM and Guided Grad CAM on baseline model and our proposed model (layer 1 of EffiecientNet-B4). The red circles indicate obvious clues that are ignored by previous approach but well captured by our method (Best viewed in color) (Color figure online)

5 Conclusion

In this work, we propose an information theoretic framework with self-Information Attention (SIA) for effective face forgery detection. The proposed SIA module has a strong theoretic basis, which leads to an effective and interpretable method that can achieve superior face forgery detection performance with negligible parameter increase. Specially, self-information of each feature map is extracted as the bases of dual attention to help model capture the informative regions which contains critical forgery clues. Experiments on several datasets demonstrate the effectiveness of our method.

Future Work. Currently, we only evaluate our SIA module on the RGB domain. In future work, we will evaluate it in the frequency domain to further demonstrate its effectiveness and generality.