1 Introduction

Deepfake refers to a type of face manipulation or replacement methods based on deep learning. With the development of generative models [1,2,3], the technical barrier for face forgery is getting lower and lower, and anyone can easily create realistic face forged contents by ready-made models or tools. Deepfakes may also be exploited by malicious users to create false political information and spread pornographic content.

As a defense mechanism, face forgery detection has been proposed to respond to the challenge brought by Deepfake. The task of face forgery detection is commonly defined as a real-fake binary classification problem. According to the face forgery generation procedure, two helpful conclusions can be drawn to detect Deepfakes, including (1) pixel modification occurs only in local regions of the face, which inevitably leads to spatial inconsistencies such as blending boundaries, and (2) since the forged video is generated frame by frame, temporal inconsistency across frames like facial position jittering cannot be eliminated. Frame-level detection methods mainly focus on the first observation, while video-level approaches based on the second.

Face forgery detection via biological signals provides another way of thinking. Heartbeat signal is a typical biological signal. Photoplethysmography (PPG) is a heart rate monitoring technology used in biomedicine [4]. As the hemoglobin level changes due to periodic heartbeats, the skin’s absorption rate of light changes accordingly. The development of remote photoplethysmography (rPPG) [5] technology makes it possible to capture the subtle changes in skin color from recorded videos. Since facial pixel modifications and inter-frame discrepancies inevitably disrupt periodic changes in skin color, previous work [6, 7] has proved that the rPPG signal is a powerful biological indicator for face forgery detection.

In this paper, the Multi-scale Spatial–Temporal representation of PPG is adopted to further exploit heartbeat signal calculated from different facial regions. As shown in Fig. 1, the key observation is that consistent rPPG signals are not yet preserved in Deepfakes, and pseudo signals produce unique rhythmic patterns in terms of different generation methods. Therefore, not only do we regard face forgery detection as a binary classification problem, but also as a source detection task for the recognition of different generation methods behind fake videos. To utilize both spatial and temporal information, a two-stage network is designed for face forgery detection and categorization. The Mask-Guided Local Attention module (MLA) is proposed to highlight the modified regions of PPG maps and guide the network to better detect the unique rhythmic patterns of different manipulation methods. Moreover, Transformer [8] is introduced to fully interact high-level temporal features between adjacent video clips in long distance. Abundant experiments prove the superiority of the proposed method, which outperforms all other rPPG-based methods in both face forgery detection and categorization. Extension experiment demonstrates the strong generalization ability of the proposed methods against newly added manipulation sources. To show the effectiveness of each component, we also conduct detailed ablation study on various setups for comparison.

Fig. 1
figure 1

An example of Multi-scale PPG maps (second row) and rPPG signals (third row) generated from real videos and various manipulations, i.e., Deepfakes, Face2Face, FaceSwap, and NeuralTextures. Each forgery method presents unique rhythmic patterns of rPPG signals

In summary, the contributions of this paper are listed as three-fold:

  1. (1)

    A two-stage network is designed to detect both spatial and temporal inconsistencies, which consists of a Mask-Guided Local Attention module (MLA) to highlight local regions of PPG maps and a Transformer to interact temporal adjacent features in long distance.

  2. (2)

    We utilize the Multi-scale Spatial–Temporal representation of PPG which contains information of multiple facial regions. The visualization shows that unique patterns of PPG maps can be identified in terms of different generation methods with the help of Mask-Guided Local Attention module (MLA).

  3. (3)

    Extensive experiments on various datasets are conducted to demonstrate the effectiveness and extension capability of the proposed method, which outperforms all other rPPG-based methods in terms of both forgery detection and categorization tasks.

2 Related work

2.1 Deepfake generation

Deepfake has been receiving more and more attention over the past decades. Variational Autoencoders (VAE) [9] and Generative Adversarial Networks (GAN) [1] are mainly adopted to generate Deepfakes in recent approaches. The existing facial manipulation methods can be divided into two major categories: facial identity manipulation and facial expression manipulation. Deepfakes (DF) refers to a type of facial identity manipulation method that was spread via online forums such as FakeApp which adopts two VAE models and a shared encoder to train and reconstructs the source and target faces. Poisson blending [10] and color transform algorithm [11, 12] are also used to mix the source face image with the background, which also applies to popular Deepfake open-source tools such as DeepFaceLab [13]. FaceSwap (FS) is a graphics-based approach to transfer the face region from a source video to a target video based on sparse detected facial landmarks and 3D template model. FaceShifter (FSH) [14] is a novel two-stage framework designed for high fidelity and occlusion aware facial identity manipulation. On the other hand, Face2Face (F2F) [15] and NeuralTextures (NT) [16] are two typical facial expression manipulation methods. F2F is a facial reenactment system that manipulates the target video with the expressions of the source video while keeping the target person's facial identity unchanged. NT adopts the rich signal stored in learned neural textures of the target person and performs facial reenactment by deferred neural renderer. However, regardless of the manipulation method, there will be spatial texture inconsistencies in each frame since the video background is constant. Meanwhile, since the video tampering is operated frame by frame, Deepfake inevitably contains temporal discrepancies.

2.2 Deepfake detection based on CNN

Early researches [17,18,19,20] mainly use hand-crafted features to distinguish real and fake videos. With the rise of deep learning and the continuous upgrading of face forgery technology, Convolutional Neural Network (CNN) has become the mainstream of Deepfake detection. A number of well-designed backbone networks are used to extract crucial features, such as Mesonet [21], CapsuleNet [22], and Xception [23]. In addition to replacing the backbone, another kind of methods focuses more on the local spatial inconsistency within the forged frame. For example, FaceXray [24] detects forgery by revealing the blending boundaries of Deepfakes. Dang et al. [25] propose a plug-in local attention module to highlight features in the modified regions. PRRNet [26] fuses pixel-wise similarity and region-level similarity to learn local differences by spatial attention mechanism. Chen et al. [27] measure the similarity between different local areas by calculating Multi-scale Patch Similarity, and fuse frequency information with RGB channels to obtain a more comprehensive representation of local features. However, these methods only focus on frame-level forgery traces and tend to ignore cross-frame information at the video level.

On the other hand, many works use 3D CNN or Recurrent Neural Networks (RNN) to explore the temporal inconsistency of Deepfakes. Lima et al. [28] transfer the 3D network pre-trained on the action recognition task for video classification. Montserrat et al. [29] propose a weighting mechanism that automatically selects relevant frames and combine CNN with GRU [30] to extract both spatial and temporal features. With the success of ViT [8] in the field of computer vision, Transformer [31] has also been introduced to detect Deepfakes. Zheng et al. [32] propose a hybrid network combining a fully temporal convolution network with a Temporal Transformer. Xu et al. [33] fuse the visual semantic sequence with the contexture feature sequence extracted by Transformer. Khan et al. [34] utilize both RGB image and UV texture map as two-stream inputs for Transformer to learn the fused features. These methods demonstrate the effectiveness of the long-distance self-attention mechanism of the Transformer architecture in Deepfake detection. The proposed method utilizes spatiotemporal representation of PPG, combines CNN with Transformer, and adopts the local attention mechanism. In other words, we take both the spatial and temporal inconsistency of Deepfake into account.

2.3 Deepfake detection based on explainable methods

In addition to using pure CNN, another kind of method explores a variety of explainable methods. Malolan et al. [35] make use of explainable AI (XAI) techniques including Local Interpretable Model-Agnostic Explanations (LIME) and Layer-Wise Relevance Propagation (LRP) to provide clear visualizations of the salient regions of the image focused on by the model. Jayakumar et al. [36] propose a model-agnostic high precision explainer named “Anchors” XAI to visually explain the predictions of a deepfake detector and obtain better performance than LIME.

Benefit from their clear physical meanings, biological signals provide another scope of the explainable approaches. Early attempts adopt biological signals such as eye blinking [37], head posture [38], and lip movement [39]. The development of rPPG makes it possible to estimate heart rate from recorded face videos, and rPPG signals are also used in Deepfake detection. FakeCatcher [7] first introduces heartbeat signal into Deepfake detection and proposed a spatial–temporal map of chromatic-based PPG and its power spectral density (PSD). Ciftci et al. [6] adopt the same form of PPG maps and demonstrate that different manipulation methods generate their own unique heartbeat rhythms. Boccignone et al. [40] calculate path-wise rPPG signals and spectrums, and measures both intra-patch and inter-patch coherence of rPPG. DeepRhythm [41] uses Motion-Magnified Spatial–Temporal Representation (MMSTR) of PPG to enhance facial color changes and amplify heartbeat signals. This work also adopts prior predictions from face-based network to weight the input features. Liang et al. [42] further study the interaction between adjacent PPG maps. All these PPG-based methods prove that Deepfakes are not yet capable of maintaining consistent heartbeat signals, which is a strong and explainable evidence to detect forgery videos. However, these methods do not focus on the detailed local discrepancy between PPG maps generated by different manipulation sources. Without any prior knowledge, we adopt the Multi-scale Spatial–Temporal representation of PPG to comprehensively represent facial skin color changes caused by heartbeat activity. In addition, a two-stage network is proposed. On the one hand, the Mask-Guided Local Attention module (MLA) is used to focus on the spatial local regions of PPG maps. On the other hand, the Temporal Transformer is utilized to further explore long-distance interactions between adjacent clips within a complete video.

3 Methods

In this section, we introduce the proposed overall framework illustrated in Fig. 2, including the generation of Multi-scale Spatial–Temporal PPG map, the two-stage network, and the loss function.

Fig. 2
figure 2

Proposed pipeline of the two-stage network. \(\odot\) denotes pointwise multiplication

3.1 Multi-scale spatial–temporal representation of PPG

Since the heartbeat signal is sensitive to head movements, light changes, and other disturbances, the untreated face image cannot be directly used to represent the rPPG signal. In order not to be constrained by prior information on ROI selection, inspired by [43], we adopt Multi-scale Spatial–Temporal representation of PPG to fuse multi-region pixel information. As shown in Fig. 3, we first divide a full video into several \(T{\text{ - frame}}\) video clips with the step size \(\omega\). For each video clip, face alignment is performed to obtain facial landmarks. According to the landmarks, set of sub-ROIs \(R_{t} = \left\{ {R_{1t} ,R_{2t} , \ldots ,R_{nt} } \right\}\) is obtained by selecting \(n\) informative regions of face such as cheeks, forehead, and jaw. Then, the average pixel values are calculated for all the non-empty subsets of \(R_{t}\) in \(C\) color channels. \(T{\text{ - frame}}\) temporal sequences of averaged pixel values from the same sub-ROI region or combination are arranged into a row. Finally, a max–min normalization is applied to all the temporal sequences in each channel to scale the values into [0,255]. The size of the Multi-scale Spatial–Temporal PPG map is \(\left( {2^{n} - 1} \right) \times T \times C\) for each video clip.

Fig. 3
figure 3

An illustration of the generation process of the Multi-scale PPG map from an input face video clip of T frames. The procedure includes face alignment, sub-ROI combination, pixel average, and normalization. The final size of the Multi-scale PPG map is \(\left( {2^{n} - 1} \right) \times T \times C\)

3.2 Overall framework

Our approach is based on the following two assumptions: (1) various video manipulation methods modify different facial regions, and these modifications are also reflected in the PPG map composed of multi-scale facial regions. Highlighting the modified local area may lead the network to better learn the unique rhythmic patterns of each manipulation method and help the network to distinguish between real and fake videos. (2) A single video contains multiple PPG clips, and sufficient interaction of the features from adjacent maps may yield more global information. Therefore, we propose a two-stage network consisting of a Mask-Guided Local Attention module (MLA) to focus on the modified local regions of the PPG map and a Temporal Transformer to exploit long-distance information between adjacent clips.

3.2.1 Mask-guided local attention module

Due to the unique patterns of rPPG signals, we regard Deepfake detection not only as a real-fake discrimination problem, but also as a categorization task of different manipulation methods. To be specific, face swapping methods change the pixels of the entire face area, while expression manipulation methods only modify the pixels of local regions, such as the mouth area. Since the spatial dimension of the PPG maps is arranged by the combination of different facial regions, the spatial–temporal representation of PPG can also reflect the regional discrepancies between each face manipulation method. This assumption is often ignored by previous approaches. Inspired by [25], we proposed a plug-in Mask-Guided Local Attention module (MLA) to highlight the position in the feature map of PPG that corresponds to the modified regions of the face image.

Concretely, the proposed MLA consists of the following steps. As shown in Fig. 3, given a PPG clip \(X \in {\mathbb{R}}^{{C \times \left( {2^{n} - 1} \right) \times T}}\), where \(T\) denotes the clip length, \(n\) is the number of face sub-ROIs, and \(C\) represents the number of inputted channels. The mid-level feature map \(F_{m}\) derived from the mid-layer of backbone \(f_{{{\text{mid}}}}\) can be formulated as \(F_{m} = f_{{{\text{mid}}}} (X) \in {\mathbb{R}}^{{C^{\prime } \times H \times W}}\) where \(H,W,C^{\prime}\) denote the height, width and channel numbers of the feature map, respectively. Then, with \(F_{m}\) as the input, the attention mask \(A_{{\text{mask }}} = \phi \left( {F_{m} } \right) \in {\mathbb{R}}^{H \times W}\) can be generated. The weighted feature \(F^{\prime } = A_{{\text{mask }}} \odot F_{m}\) is the input of the remaining network layer \(f_{high}\), where \(\odot\) denotes pointwise multiplication. Specifically, \(\phi ( \cdot )\) consists of a convolution operation \({\text{Conv}}( \cdot )\) for compressing channel dimension and a Sigmoid activation operation \({\text{Sigmoid}}( \cdot )\) to decide attention weights, which can be formulated as follows:

$$ \phi \left( {F_{m} } \right) = {\text{Sigmoid}}\left( {{\text{Conv}}\left( {F_{m} } \right)} \right) $$
(1)

In order to approximate the attention mask \(A_{{{\text{mask}}}}\) with the ground truth manipulation mask \(A_{gt}\), we train the MLA in a supervised manner and add an extra \(L1\) loss function \(L_{{{\text{mask}}}}\):

$$ L_{{\text{mask }}} = \left| {A_{{\text{mask }}} - A_{{{\text{gt}}}} } \right|_{1} $$
(2)

Given a pseudo PPG map which is generated from fake videos, its ground truth manipulation mask \(A_{{{\text{gt}}}}\) is calculated from its corresponding original map as a pair. To be elaborate, we first calculate the absolute pixel-wise difference of the PPG map pair in RGB channels to obtain a residual map. Then, the residual map is converted into grayscale, normalized to [0,1] and resized to the same scale of \(A_{{{\text{mask}}}}\). Finally, a threshold of 0.1 is selected to determine the map as a binary mask \(A_{{{\text{gt}}}}\). As for an original PPG map which is generated from real videos, its \(A_{{{\text{gt}}}}\) is set to all zeros because there is no manipulation occurred.

3.2.2 Temporal transformer

Although a single PPG map contains the temporal dimension, we believe that there is still potentially mutually reinforcing information in several adjacent PPG maps of the same video. In order to further mine temporal information, ViT [8] is utilized to interact adjacent clip features with each other in long distance.

As shown in Fig. 4, \(K\) adjacent PPG maps are inputted into the backbone network which has been well-trained in stage one, and the high-level features derived from the last convolutional layer of the backbone are denoted as \(F_{h}\). Then, average pooling and linear operation are performed on \(F_{h}\) to obtain \(K\) vectors as D-dimension embedded features \(x_{i} \in {\mathbb{R}}^{D} ,\;\;\;{\kern 1pt} i = 1,2, \ldots ,K\). Similar to the settings of ViT [8], an extra learnable class-token (\(Z_{0}^{0} = x_{{{\text{class}}}}\)) is added to the embedding sequence, whose output is responsible for the final prediction. Meanwhile, the standard 1D learnable position embedding (\(E_{{{\text{pos}}}} \in {\mathbb{R}}^{(K + 1) \times D}\)) is used to record the temporal order of \(K\) adjacent feature vectors. The input sequence of the Temporal Transformer can be formulated as follows:

$$ Z_{0} = \left[ {x_{{\text{class }}} ,x_{1} ,x_{2} , \ldots ,x_{i} } \right]^{T} + E_{{{\text{pos}}}} ,\;\;\;{\kern 1pt} i = 1,2, \ldots ,K $$
(3)
Fig. 4
figure 4

Structure of the Temporal Transformer. \(\oplus\) denotes sum operation

The Temporal Transformer consists of \(L\) Transformer encoder blocks [31], and each encoder block includes a Multi-head Self-Attention operation (MSA) [31] and a Feed-Forward network (FF). The commonly used LayerNorm (LN) is applied before each block. And the structure of the residual connections [44] is utilized after every block. Activation function GELU is also used to ensure nonlinearity. The forward process of the \(l{\text{ - th}}\) layer can be formulated as follows:

$$ Z_{l}^{\prime } = {\text{MSA}}\left( {{\text{LN}}\left( {Z_{l - 1} } \right)} \right) + Z_{l - 1} ,\;\;\;{\kern 1pt} l = 1,2, \ldots ,L $$
(4)
$$ Z_{l} = {\text{FF}}\left( {{\text{LN}}\left( {Z_{l}^{\prime } } \right)} \right) + Z_{l}^{\prime } $$
(5)

To obtain the final prediction score y, MLP head is applied on the class-token output of the last layer (\(LN\left( {Z_{l}^{0} } \right)\)), which can be formulated as follows:

$$ y = MLP\left( {LN\left( {Z_{l}^{0} } \right)} \right) $$
(6)

3.3 Loss function

In the first stage training of the backbone and MLA without the inclusion of ViT, we formulated joint loss function \(L_{{{\text{total}}}}\) including the softmax cross-entropy loss \(L_{{{\text{ce}}}}\) and the attention mask loss \(L_{{{\text{mask}}}}\) as follows:

$$ L_{{\text{total }}} = L_{{{\text{ce}}}} + \lambda L_{{\text{mask }}} $$
(7)

where \(\lambda\) is the hyperparameter for balancing classification task and mask regression task. In the second stage, we freeze parameters of the backbone layers and MLA, only use the cross-entropy loss to train the Temporal Transformer.

4 Experiments

In this section, elaborate evaluations are provided to test the effectiveness of the proposed method. First, our method is compared with six benchmark methods on binary face forgery detection and multi-category source detection tasks. Then, detailed ablation studies are performed to show the impact of each component. Moreover, extension experiments are conducted to show the expandability of the proposed method against new manipulation sources. Finally, experiments of video clip length and video compression are also performed for supplement.

4.1 Settings

4.1.1 Dataset

To illustrate the effectiveness of the proposed method on face forgery detection and source detection tasks, we select the most widely used FaceForensics + + (FF + +) dataset [45]. FF + + is a relatively large dataset containing 1000 real videos and 4000 fake videos generated by five different face manipulation methods, i.e., Deepfakes (DF), Face2Face (F2F), FaceSwap (FS), NeuralTextures (NT), and FaceShifter (FSH). In terms of the modified regions, DF, FS, and FSH swap the whole face, F2F focuses on smaller areas, transfers expressions while keeping the identity of the target face consistent, and NT operates only around the mouth area of the target face. There are three video quality versions in FF + + , correspond to different compression rate, i.e., RAW (c0), HQ (c23), and LQ (c40). In addition, to demonstrate the extension capability of the proposed method, experiments on Celeb-DF(v2) dataset [46] are also conducted. Celeb-DF is a more challenging dataset which contains 590 real celebrity videos. 59 subjects swapping faces in pairs to generate 5,639 high-quality fake videos. The forged faces in Celeb-DF are more detailed and convincing because of the usage of a more advanced synthetic process.

4.1.2 Implementation

For the real-fake binary classification task, the training set, test set and validation set of each sub-dataset are divided in the ratio of 8:1:1. As for the source detection task, the dataset is split in the ratio of 7:3, consistent with previous works [6, 7, 42]. We adopt an open-source face detector OpenFace [47] to detect 68 facial landmarks. Following the setting of [43], the number of ROI sub-regions \(n\) is 6, and both RGB and YUV color space are used to generate PPG maps; therefore, \(C\) is 6. Unless otherwise noted, the video clip length \(T\) is 64, the step size \(\omega\) is 16, and the loss balancing hyperparameter \(\lambda = 10\). EfficientNetV2-M [48] pre-trained on ImageNet is adopted as the backbone. The MLA is inserted after the third stage of the backbone. Adjacent PPG clip number \(K\), self-attention heads number, and the embedded features dimension \(D\) are set to 5, 8, and 256, respectively. The batch size is set to 32. SGD is used as our optimizer with the initial learning rate of 0.01. The total epoch number is 30. All models are implemented based on PyTorch framework and trained on GTX-1080Ti.

4.1.3 Prediction aggregation

Since a full video contains several video clips, we predict each clip of the video and count the number of real or fake clips. If the number of real clips is greater than fake ones, we identified the video as real and vice versa. As for source detection, majority vetoing is adopted to determine the predict source of each video. All results are based on video classification accuracy.

4.2 Comparison

In order to make a fair and comprehensive comparison, we consider both face-based methods and rPPG-based methods for the selection of baseline. Among face-based approaches, we choose the popular Xception [23]. And all methods utilizing rPPG are included for comparison, i.e., Ciftci et al. [6], FakeCatcher [7], Boccignone et al. [40], DeepRhythm [41], and Liang et al. [42]. Meanwhile, to demonstrate the ability of the proposed method to detect different manipulation sources, other than conventional real-fake binary classification, we also examined the source detection performance with five categories (1 real—4 fakes). All the comparison experiments are conducted on the FF + + dataset. As shown in Table 1, the proposed method achieves the best results in DF, F2F, FS, and NT sub-datasets of binary face forgery task and achieves the state-of-art performance among all rPPG-based methods on the source detection, which is sufficient to demonstrate the effectiveness of the proposed method. Compared with the baseline method using cropped faces as input, our method has more obvious advantages on categorization tasks. In terms of the source detection results on FS and NT categories, our method achieves 99.67 and 98.33% while Xception [23] with face inputs only reaches 92.76 and 89.62%, respectively. This result once again proves the strong ability of PPG maps to preserve the unique rhythmic patterns of different manipulation methods. Moreover, compared with other methods using rPPG, the proposed method has better performance on four categories, indicating the superior capacity of our method for exploiting information of the rPPG signal.

Table 1 Comparison with other methods on the sub-datasets of FF + + 

4.3 Ablation experiments

To demonstrate the effectiveness of each component of our method, i.e., Multi-scale Spatial–Temporal PPG map, Mask-Guided Local Attention module (MLA), and Temporal Transformer, we conducted detailed ablation experiments of source detection. And the results are shown in Table 2.

Table 2 Ablation experiments of our method by progressively adding the Multi-scale PPG Spatial–Temporal map (Multi-scale), the Mask-Guided Local Attention module (MLA), and the Temporal Transformer

4.3.1 Effectiveness of multi-scale PPG map

For a better comparison of the Multi-scale PPG map, two other forms of PPG map from previous works are implemented, i.e., POS [49] and Motion-Magnified Spatial–Temporal Representation (MMSTR) [41]. The first is based on assumptions of skin optic model, and the second utilizes motion magnification algorithm [50]. The amount of data and backbone training settings is the same for three forms of PPG maps. As shown in Table 2, the averaged categorization accuracy by adopting the Multi-scale PPG map is 13% higher than the other two, which proves that the multi-scale combination of facial regions contains more sufficient rhythmic information. Without prior calculation, the Multi-scale Spatial–Temporal representation of PPG can better cope with the deep learning framework in a fully data-driven manner.

4.3.2 Effectiveness of MLA

We then conducted experiments to demonstrate the effectiveness of the MLA which utilizes local attention mechanism. According to the mask examples shown in Fig. 5, the distribution of white pixels which represents the modified regions varies with different face manipulation methods. Mask calculated form DF shows a wider modified area than NT, which is consistent with our hypothesis that PPG maps can reflect spatial and regional differences between each face forgery method.

Fig. 5
figure 5

Visualization of Mask-Guided Local Attention (MLA) on various face manipulation methods. The second and the fourth row show the PPG maps and their corresponding masks of modified regions. The third row is the heat maps of high-level features without MLA in the network, and the last row shows the heat maps when MLA is inserted into the network

To present an intuitive interpretation of how MLA works, we also visualize heatmaps of high-level features utilizing Grad-CAM [51]. As shown in Fig. 5, without the mask and local attention operation, the network would focus on the large area near the upper right corner of the PPG map while ignoring the regional differences between specific face forgery methods. After adding MLA into the backbone, the strong-response area successfully converges at specific locations guided by the mask; thus, the rhythmic patterns of each face manipulation method are further distinguished. The average categorization accuracy also improves 0.71% by utilizing MLA. And it is worth noting that the accuracy of NT category, which is usually difficult to classify, improved by 1.33%. This result proves the effectiveness of MLA for detecting local discrepancies.

4.3.3 Effectiveness of temporal transformer

In order to explore the proper temporal model for interacting adjacent feature vectors, we select one layer of the widely used Bidirectional Long Short Term Memory (Bi-LSTM) networks and single head self-attention [31] (denoted as Self-attention) to compare with the standard Transformer encoder [31]. The result in Table 2 shows that the average detection accuracy of utilizing Bi-LSTM is 0.28% higher than not, proving our second assumption that adjacent PPG clips contain temporal-correlated information. Intriguingly, one layer of single head self-attention can be on par with the conventional RNN structure in performance. This result indicates that the recurrent structure is constrained by the contextual order of the sequence, while self-attention mechanism completely relying on the long-distance dependencies of input tokens. With the full structure of MSA and FF block, the Temporal Transformer fully exploits the global information between adjacent features and improves the accuracy by 0.38%. To better demonstrate the effectiveness of the Temporal Transformer over LSTM, we conduct real-fake binary face forgery detection experiments on five sub-datasets of the c23 (HQ) version of FF + + , respectively. As shown in Fig. 6, by adopting Transformer, the accuracy of all five sub-datasets increased by an average of 0.84% in comparison of utilizing LSTM. It should be noted that the superiority of ViT over LSTM is more evident on F2F, FS, and NT sub-datasets, with accuracy improvements of 1.15%, 1.7%, and 0.65%, respectively. These results provide stronger evidence of the superiority of the global attention and long-distance dependencies mechanism of ViT over the local contextual-constrained LSTM.

Fig. 6
figure 6

Binary Face forgery detection (real-fake) accuracy (%) on five sub-datasets of FF +  + (c23), including DF, F2F, FS, NT, and FSH, of using different temporal modules

In addition, the impact of using different numbers of Transformer encoders blocks is also investigated. As shown in Fig. 7, without the pre-trained parameters to initialize the second-stage model, adding more layers of Transformer encoder causes additional training parameters, but does not improve the outcome. In the other hand, compared with other scenarios of adopting ViT, the number of embedded features \(K\) is much more limited in our work; thus, single layer of the Transformer structure is sufficient.

Fig. 7
figure 7

Average categorization accuracy (%) of using different layers of Transformer encoders

4.4 Extension experiments

Deepfake generation methods upgrade rapidly, which demands that our method have extension ability against new face forgery methods. Thus, we conducted extension experiments by adding a new category of Celeb-DF (CD) to illustrate generalization performance of the proposed method. 1000 fake videos are selected from Celeb-DF(v2) [46], and PPG maps of each video clip is computed along with their corresponding masks to generate the sixth class for source detection. As shown in Table 3, our method is capable of tracking new sources and the average categorization accuracy of six classes exceeds other two previous methods [6, 42], which confirms the excellent extension ability of the proposed method.

Table 3 Results of extension experiment

4.5 Video clip length

We also conducted experiments to explore the balance between the length of a single clip and the amount of training data. We assume that the number of PPG maps that can be obtained from a complete video will be large when the clip length is relatively short, but the information contained in individual clips may also be limited. On the other hand, an excessively long clip would greatly reduce the amount of data available for training. Thus, we test the proposed method with different clip length \(T = \{ 32,64,128,256\}\). For a fair comparison, the step size \(\omega\) is set as 1/4 \(T\) to obtain all PPG maps from a full video. As results shown in Table 4, the accuracy reaches the highest score at \(T = 64\), but drops sharply at \(T = 256\). Not surprisingly, the length of video clips too long results in the limited data size and negative impact on our data-driven approach.

Table 4 Accuracy of categorization in different video clip length

4.6 Video compression

To test the performance of the proposed method against video compression, experiments are conducted on different video quality of FF +  + , i.e., HQ (c23) and LQ (c40). As shown in Table 5, the proposed method still reaches the average source detection accuracy of 90.52% on video quality HQ, which demonstrates the robustness against light compression. However, due to the loss of subtle facial color changes results from the severe compression, the rPPG signal is disrupted under LQ version, which leads to the accuracy of merely 56.96%.

Table 5 Accuracy of categorization in different compression rate

5 Conclusion

In this paper, the Multi-scale Spatial–Temporal PPG map is adopted to further exploit heartbeat signal from multiple facial regions. Motivated by the key observation that rPPG signals produce unique rhythmic patterns in terms of different manipulation methods, a two-stage network is proposed for both face forgery detection and categorization. Concretely, the Mask-Guided Local Attention module (MLA) is designed to locate spatial inconsistencies of modified facial regions reflected on PPG maps. The Temporal Transformer is also adopted to exploit long-distance information between adjacent video clips. Abundant experiments on FaceForensics + + and Celeb-DF(v2) datasets demonstrate the superiority of the proposed method which outperforms all other rPPG-based approaches. Moreover, extension experiment confirms the excellent generalization capability of the method against newly added manipulation model. Furthermore, detailed ablation study and visualization illustrate the effectiveness of each component and different settings.