1 Introduction

With the development of artificial intelligence (AI) capabilities, deep learning models have been employed in various fields of computer vision [1,2,3,4]. Convolutional neural network (CNN) models are used in the fields of image classification, image generation, and object detection and have emerged as state-of-the-art (SOTA) methods that apply various training techniques. Recent generative adversarial network (GAN) face generation models perform very well, including StyleGAN [5], StarGAN [6], and InterfaceGAN [7]. Deepfake videos can easily be made using such networks.

DeepFake is a combination of “deep learning” and “fake” terms, which refers to the technique of changing the source person in a target video. This technology makes the impersonation seem to be performing actions or saying things that they never did or said. Notably, cases of abuse, such as fake news and revenge porn, have emerged as detrimental social issues. Therefore, technologies and datasets for detecting fake videos have been studied.

Deepfakes are created using GANs [8] and variational autoencoders (VAEs) [9]. The most well-known deepfake generation technique, in which two VAE models are trained to generate the faces of a person, only the decoder part is exchanged to create an image, as if only the face part is changed. The encoder part is shared by the two models, and the decoder part is trained separately. The recently created DeepFake detection dataset (DFDC: the world’s largest) [10] was applied to every frame by a Deepfake autoencoder with a morphable mask/neural network face swap to change the landmarks of the face. Face-swapping GANs use the neural talking-heads method and a GAN (e.g., StyleGAN) to generate DeepFake videos. Using the DFDC full dataset [10] released by Facebook AI, the EfficientNetwork-b7 [11] became a SOTA model.

The procedure for finding fake videos is as follows. First, after locating the face in each frame, the image is cropped, and a deep learning model discriminates whether the image is fake. Fake videos can be detected by locating and detecting artificial parts within frames. An unnatural part is identified based on the spatial characteristics inside the frame. However, between frames, unnatural features can be found by their temporal characteristics. Networks for detecting edited parts of a frame mainly use a CNN structure because they do a good job of considering spatial characteristics. Additionally, anomaly discriminator methods, such as k-nearest neighbor and support vector machine (SVM) algorithms, work well. Instead of finding defects in one image, a recurrent neural network (RNN), which is normally used for natural language processing (NLP), can be used to find defects based on the temporal characteristics between frames. After the CNN model extracts the frame features, the RNN or a long short-term memory (LSTM) model determines whether the video is fake.

DeepFake detection requires clarification of goals because the results vary depending on which trait is looked for. Additionally, there is a problem in that the accuracy of real-world data is poor, depending on the trained data. In the DFDC private test dataset [10], most models incorrectly identify real videos as fake. To solve this problem, various training techniques and model innovations are required.

In addition, most current models are CNN-based architectures, but Jeffrey Hinton pointed out that the CNN model does not reflect the relationship of regional characteristics well. Hence, he proposed a capsule network [12] to mitigate the problem. As the CNN model goes to a higher level, more complex features are extracted and classified at the top layer. However, there is a disadvantage in that the positional relationship between simple and complex features cannot be considered because it is calculated as the weighted sum of the lower layers to the top layer. Therefore, for the current study, we use a model of a different structure that can consider the positional relationship of each face part, rather than the usual CNN-based approach.

In this study, we find fake videos using their spatial characteristics by leveraging a transformer model. Most DeepFake discrimination models use CNN-based networks as the SOTA model [11]. We also reveal the reason for this through detailed results analysis and discussion.

Our most relevant contributions are as follows:

  1. We adopt an improved vision transformer: An efficient deep architecture with a vision transformer (ViT) that can predict fake videos is designed.

  2. We apply a distillation technique for Deepfake detection: We apply a distillation method of data-efficient image transformers (DeiT) for Deepfake detection and show the results of various conditions.

  3. We combine patch embedding and CNN features: By combining the EfficientNet and patch features, we consider the advantages of two features and obtain higher area under the curve (AUC) and f1 scores than the SOTA [12] and other recent methods.

In summary, we design a DeepFake detection using a ViT model, which has shown good performance in recent image classifications. We combine CNN and patch-embedding features during the input stage. Also the proposed method uses the distillation technique, and the results show a higher performance than SOTA [11]. Moreover, our method shows better performance for fake videos, and we expect high performance in other test datasets as well.

This paper is organized as follows. Section 2 describes related works, and we introduce a model that considers spatial and temporal characteristics in detail. Section 3 proposes the scheme for the proposed DeepFake detection model. We explain the preprocessing process, the basic network, features combined with CNN, and patch and training processes. Section 4 presents the experimental results and analysis, and Section 5 concludes the paper.

2 Related works

Most configured models for DeepFake detection are based on a CNN structure. There are two approaches for discriminating DeepFake videos. One is to exploit unnatural spatial properties within one frame of video as an image unit, and the other is to exploit temporal properties to find unnaturalness between video frames.

Figure 1 represents the methods of DeepFake detection. The model finds artifacts using temporal characteristics and feature points using a CNN and puts sends them to a sequential network (e.g., RNN, LSTM, or GRU) in chronological order.

Fig. 1
figure 1

DeepFake detection methods

2.1 DeepFake detection using spatial properties

To detect spatial manipulation in the face, Li [13] used CNN models (i.e., VGG16 [14], ResNet50, ResNet101, and ResNet152 [15]). Nguyen proposed a capsule network that can detect various types of Deepfakes [12] by using features pretrained by VGG16 and suggested a capsule-forensic architecture. A classification method using an SVM was proposed by Yang [16], and Guarnera employed K-nearest neighbors and linear discriminant analysis [17]. However, owing to the limitations of CNNs, it is necessary to interact with and compare all parts of an image to detect falsified areas.

Until now, the best CNN model for this purpose was EfficientNet [18] on DFDC dataset. EfficientNet improved the performance by applying several techniques to increase the number of filters by width scaling, the number of layers by depth scaling, and the resolution of the input image by resolution scaling [18]. The SOTA model based on EfficientNet achieved 0.981 AUC using the ensemble technique which was averaged the predictions of multiple trained models for DeepFake detection [11]. However, such models that use spatial characteristics with a two-dimensional (2D) CNN structure cannot correlate features in a distant position with temporal information. This makes it difficult for them to succeed.

Li et al. also proposed a novel image representation (i.e., face X-ray) for detecting forgery in facial images [19]. In this method, the face X-ray of an input face image is used to reveal whether the input image can be decomposed into blended images from different sources. Mitall et al. suggested an audio-visual DeepFake detection method using affective cues [20]. This approach employed a deep learning method inspired by the Siamese network architecture and triplet loss. Using this scheme, they achieved AUC of 0.844 on the DFDC dataset.

Based on this analysis, if we design an improved vision transformer (ViT) model to consider this relation information as a global feature, we are able to expect a performance increment for DeepFake detection task.

2.2 DeepFake detection using temporal properties

Figure 2 shows the structure of DeepFake video discrimination using temporal characteristics. Montserrat detected space–time awkwardness by sending the frames of the video to EfficientNet and each feature from the frames into a gated recurrent unit [21]. Similarly, Güera used a CNN to extract frame-level features and train an RNN that learned to classify fake videos [22]. Unlike previous studies using CNN and RNN networks to determine spatiotemporal properties, de Lima [23] used a three-dimensional (3D) CNN to detect them simultaneously. They employed I3D [24], R3D [25], and MC3 [26] owing to their higher performance.

Fig. 2
figure 2

DeepFake detection for temporal characteristics

Using an optical flow-based CNN, Amerini applied an optical flow field to exploit possible inter-frame dissimilarities [27]. However, detection models using temporal properties tend to exhibit poor performance. Amerini [27] used Face2Face, which achieved an 81.61% of accuracy on 120 testing datasets [28]. Montserrat [21] achieved a 91.88% of accuracy on the DFDC test dataset. They successfully extracted the unnatural parts of the inter-frame, but most frames have similar features because almost all of the scenes are the same. Therefore, only temporal feature might be insufficient in those.

3 Proposed DeepFake detection algorithm

Before introducing our model, we describe its transformer structure and its advantages to DeepFake detection. Transformers are widely used in the natural language processing (NLP) field, but they also show good results in the field of computer vision. The Facebook AI team proposed a method that combined transformer model and distillation method [29]. We employ this model.

3.1 ViT for DeepFake detection

The CNN and ViT models have pros and cons in Deepfake detection. First, the CNN cannot learn the relation of different parts of the image. For example, the model cannot find an unnatural relationship between mouth and eyes that is out of synchronization. On the other side, the ViT learns their relationship to each position by assigning an order to patches of the given image. The input is embedded as patches with information on the positioning. These features are connected by Multi-Head Self Attention Layer (MSL) to know which part is fake.

In the Second, the ViT utilizes global information more than the CNN. The CNN uses a convolution filter that extracts crucial edges by filtering the surrounding pixel values regardless of absolute position. The multi-head self-attention layer in ViT makes it possible to embed information globally across the overall image. In [30], ViT have more global information than ResNet at lower layers and uniform representations. The CNN model has no information about the location, only information about the surrounding pixel values. This characteristic can detect the unnaturalness of the surrounding pixels due to image synthesis. Therefore, we combine a CNN feature and patch embedding to get local and global spatial information.

The features of the CNN model are gradually reduced by the CNN kernel through the entire image as input. Figure 3(a) illustrates the process of the CNN structure. Finally, it converges to a single feature and predicts the class of the image. To detect DeepFake images, the CNN model finds anomaly features by searching the entire face from the partial features of the face image.

Fig. 3
figure 3

(a) CNN structure process and (b) transformer process on face image

As shown in Fig. 3(b) regarding the transformer, the cls token interacts with all partial features and interacts with each element to find the deeply related parts. If there are unsuitable features, they affect the specific area involved. Patches with a strong relationship with class tokens appear as active areas. Thus, the most relevant feature with class tokens is an important factor in predicting a DeepFake.

When the model finds the fake part of the face, the interactive weight of the class token is strong. In the real part, all the weights do not bounce and are distributed evenly. For this reason, the transformer has slightly more difficulty finding the real image than does the CNN. However, it has more success in finding a fake image by dividing it into patches and interacting with the class token. On the other hand, the CNN condenses it to a single vector from image features. We consider both CNN features and patch embedding features for their respective advantages.

Figure 4 shows the transformer training process for DeepFake detection. First, we split the face into the desired patch size, and it is patch embedded by one CNN layer using a patch size kernel. Features corresponding to each part of the face are fed to the input of the transformer and interact with each other. Finally, the class token predicts whether the image is real or fake through the fully connected layer. Thus, the transformer can detect fake videos by interacting with the unnatural area.

Fig. 4
figure 4

ViT for DeepFake detection

3.2 The proposed method

The procedure of the proposed Deepfake detection is shown in Fig. 5. The face is extracted from a video using a multitask cascaded convolutional network (MTCNN) model [31]. Then, a landmark is extracted to proceed with the augmentation, which drops out the face part from the image.

Fig. 5
figure 5

Proposed DeepFake detection procedure

After face extraction from the video and preprocessing, the image enters a deep learning model. We contribute to the deep learning model of the entire process. As the output of the deep learning model, we can determine whether it is real or fake.

Our baseline follows the vision transformer network with a distillation token from DeiT. Input sequences were combined with patch embedding and CNN features. The entire network is shown in Fig. 6. We introduce our baseline model for DeepFake detection in Section 3.2.2 We illustrate that how the input consists in Section 3.2.3 and the specific distillation training process with teacher network in Section 3.2.4.

Fig. 6
figure 6

The proposed overall DeepFake detection network. The image is split into patches and passes EfficientNet [18]. We obtained (Batch, N, embedding features) and (Batch, M, embedding features), respectively. These tokens are concatenated through global pooling and fed to the transformer encoder. The encoder consists of Multi-Head Self Attention and two Gaussian error linear unit (GELU) layers which is feed-forward neural network (FFN). We add a distillation token trained by the teacher network

3.2.1 Data preprocessing

The face from the frames is extracted using the MTCNN model [31]. We generate landmarks in the cropped image and extract the structural similarity difference masks between the real and fake images. We follow [11] for this process. The reason for extracting landmarks is to cut out part of the face to make the model more general. This pre-processing prevents the model from overfitting. For example, the input image is preprocessed, as shown in Fig. 7. Finally, GaussNoise, GaussianBlur, HorizontalFlip, InsotropicResize, and ShiftScaleRotate of Albumentations [32] are used for data augmentation.

Fig. 7
figure 7

The result of data pre-processing [11]

3.2.2 Basic network architecture

We overview the vision transformer [33] and recognize its efficacy in the field of DeepFake detection. The transformer was originally used for NLP tasks; however, recently, many attempts have been made to apply it to image modeling [34,35,36]. The vision transformer has an encoder like the bidirectional encoder representations from transformers model, which uses position information and embedding sequences.

Before the Multi-Head Self Attention Layers (MSLs), the image, \(x \in \mathbb {R}^{(H\times W\times C)}\), is split into patches, \(x_{p} \in \mathbb {R}^{(\frac {H}{P} \times \frac {W}{P} \times E)}\), by learnable embedding, where (H,W) is the resolution of the image, C is the channel, P is the patch size, and E is the number of embedding features. All patches are flattened by linear projection and added to the position embedding equal to \((\frac {H}{P} \times \frac {W}{P})\). The transformer encoder consists of a Multi-Head Self Attention and multilayer perceptron (MLP). The MLP contains two layers with GELU non-linearity [33].

The sequences of feature vectors include all parts of the image. An encoder refers to all sequences of split patches. The previous CNN structure focused only on the activated part of the face and could not refer to other distant positions. However, input sequences depend on global information, which can reduce overfitting in transformers. We also find an interesting result in that the transformer makes a relatively fair classification of real and fake videos, rather than being skewed to either side, unlike the previous CNN models.

3.2.3 Combination of patch embedding and CNN features

We combine patch embedding and CNN features. Patch embedding determines the features of the patch of the face, and the CNN feature determines the overall features. Two features are combined and entered through global pooling. If both are considered, the performance is higher than when only single-patch embedding is compared.

Unlike the original input vectors of the vision transformer, we introduce input tokens before feeding the encoder. We define \(\mathbf {Z}_{p} = ({x^{1}_{p}}\boldsymbol {E},{x^{2}_{p}}\boldsymbol {E}, \cdots , {x^{N}_{p}}\boldsymbol {E})\) and \(\mathbf {Z}_{f} = f(x) = ({x^{1}_{f}}, {x^{2}_{f}}, \cdots , {x^{M}_{f}})\), where xp is a patch, E is a learnable embedding, N is the exponential number of split patches, M is the number of CNN features, and f(⋅) is the CNN model. Thus, Zf is a feature of the CNN model, and we use f as EfficientNet.

These features are combined as ZpZf (⊕ means concatenating features by channels), and global pooling is applied. Min Lin suggested that global average pooling is more interpretable between feature maps and categories [37]. Thus, we represent Zpf = globalpooling(ZpZf) as input vectors and N + M to N (vector number). As a result, we consider not only the main part features of the face but also the correlation of all parts.

The transformer’s input features can interact from patch to patch, and the CNN features can interact with the surrounding features. By using this approach, we can obtain better AUC and f1 scores than using only patch embedding or CNN features.

3.2.4 Distillation method and teacher network

We have xclass and xdistillation tokens. The class token is trained by the true label value, and the distillation token is trained based on the prediction value of the teacher. To achieve a higher performance than the current SOTA model [11], the teacher is set the same as the SOTA model [11]. If the distillation token is not added, overfitting occurs.

We add class tokens and distillation tokens to input Zpf, and we define the final input, Z0 = [xclass;Zpf; xdistillation] + Epos, where xclass and xdistillation are tokens for training by the label and teacher network, and Epos is the learnable position embedding. Finally, we can define the set train loss as

$$ \begin{array}{@{}rcl@{}} \mathcal{L}_{fake}&=& {\lambda \mathcal{L}_{BCE}}([Z_{c_{fake}}], y) \\&&+ (1-\lambda) {\mathcal{L}_{BCE}]({Z_{d_{fake}}}, \sigma([Z_{t_{fake}}})), \end{array} $$
(1)
$$ \begin{array}{@{}rcl@{}} {\mathcal{L}_{real}}&=& {\lambda \mathcal{L}_{BCE}}(Z_{c_{real}}, y) \\&&+ (1-\lambda) {\mathcal{L}_{BCE}}({Z_{d_{real}}}, \sigma({Z_{t_{real}}})), \end{array} $$
(2)
$$ \mathcal{L}_{train} = \frac{{\mathcal{L}_{fake}} + {\mathcal{L}_{real}}}{2}, $$
(3)

where \(Z_{t_{fake}}\) and \(Z_{t_{real}}\) are the logits of the teacher model for fake and real prediction, (\(Z_{d_{fake}}\), \(Z_{d_{real}}\)) and (\(Z_{c_{fake}}\), \(Z_{c_{real}}\)) are the logits of the distillation tokens and the class tokens for fake prediction and real prediction, respectively. We set λ by \(\frac {1}{2}\) through experimental analysis and binary cross entropy (\({\mathscr{L}}_{BCE}\)) on the labels, y, and σ as the sigma function.

In [29] regarding Facebook AI, a distillation method prevents overfitting by expanding the range of weights of labels. Also, when the teacher network is the CNN model, the transformer produces the best results compared with the other models.

Therefore, we chose the teacher network, EfficientNet, which is the state-of-the-art model on the DFDC dataset for DeepFake detection. Each class and distillation token represent the probability that the video is fake. The distillation tokens are used instead of the class tokens when testing. It can be seen that it outperforms when the distillation token is used on the test dataset.

The proposed scheme is efficient in detecting fake videos because we utilize distillation methodology to generalize the model and combine the CNN and patch-embedding features to gather more contextual information.

4 Experimental results

Here, we describe the dataset and the details of the parameters. We also compare the proposed scheme with the SOTA model [11], Li [19], Mittal [20] for the DFDC dataset, and I3D [24], R3D [25], MC3 [26] for the Celeb-DF (v2) dataset, which represent the condition of performance measurements. We explain why we used the DFDC and Celeb-DF (v2) datasets in Section 4.1. We describe the parameter setting and configuration environments required in the training process in Section 4.2 and analyze the experimental results in Section 4.3.

4.1 Datasets

4.1.1 DFDC Dataset

In a Kaggle competitionFootnote 1, the DFDC dataset was previewed [38]. Later, the Facebook AI team opened the full version [10], which is the largest publicly available DeepFake dataset, and it includes approximately 100,000 videos produced by GANs. Figure 8 presents an example.

Fig. 8
figure 8

Examples of DFDC Dataset [10]

In a DeepFake dataset survey [10], face-swap datasets were divided into three generations. First-generation datasets, such as DF-TIMIT [39], UADFV [16], and FaceForensics++DF (FF++DF) [40], have \(10^{4} \ \sim 10^{6}\) frames and up to 5,000 videos. Second-generation datasets include Celeb-DF [41] and DFDC preview [38]. The DFDC full dataset is third-generation and has 128,154 total videos and 104,500 unique fake videos.

Because the data size is large compared to other datasets, we chose the largest DeepFake dataset and compared its performance to that of the SOTA model [11] on the DFDC full dataset. In the analysis of Dolhansky [10], the submitted best model has an AUC of 0.734 on a private test set. Also, the higher the average precision of the submitted models ([11, 42,43,44,45]) on the DFDC dataset, the better the performance in real videos. Therefore, if the performance is good with the DFDC dataset, we can assume that the results can be generalized to real videos.

4.1.2 Celeb-DF (v2) datasets

The Celeb-DF (v2) dataset contains real and DeepFake synthesized videos having similar visual quality on par with those circulated online [41]. The Celeb-DF (v2) dataset is greatly extended from our previous Celeb-DF (v1), which only contains 795 DeepFake videos. To date, Celeb-DF includes 590 original videos collected from YouTube with subjects of different ages, ethic groups and genders, and 5639 corresponding DeepFake videos. Figure 9 shows some examples of the Celeb-DF (v2) dataset. This is smaller one comparing with the DFDC dataset.

Fig. 9
figure 9

Examples of Celeb-DF (v2) dataset

4.2 Training and testing detail

4.2.1 Training detail

We take a pre-processing stage initially. After the pre-processing, images are loaded randomly for training process with their labeled information.

Pre-processing: We used a face detector as the MTCNN [31] and cropped all frames to 384×384. We augmented our training data using albumentations [32]. We also cut out and dropped out part of the image, based on [11].

Training: The patch size for the embedding features was 32, and the embedding dimension was 1,024. We initialized our transformer and the EfficientNet-B7 model [11] using the pretrained model. We set the transformer to 16 heads and 24 layers, which is identical to the large ViT default model. Additionally, our teacher network used a pre-trained network, EfficientNet-B7 [11] on the DFDC dataset. We used a distillation token only during testing.

Parameters: Training and testing were performed on a V100 GPU machine with a batch size of 12 for training. We used a stochastic gradient descent optimizer with an initial learning rate of 0.01 and a differential learning rate reduction policy, which is a step-based method. The training epoch was 40, batches per epoch were 2,500, and it took 2 days on a single V100 GPU.

For classification, we used binary cross entropy for backward values. We tested a publicly available DFDC test dataset of 5,000 videos. The f1 score was measured for comparison with the SOTA model [11] with a 0.55 of threshold value, which was selected through experiments. Also, we will check on the performance at the best threshold value of each method.

4.2.2 Testing detail

To detect DeepFake video, we employed the same procedure in the SOTA model [11]. During testing, 32 frames per video are selected at regular intervals. If an image number with a predicted value larger than 0.8 is 12 frames or more, the predicted values are averaged. If number of images with predicted values smaller than 0.2 is bigger than 90%, the predicted values are averaged. In other case, the average of all predicted values is calculated. As a result, a value from 0 to 1 indicates whether the video is fake or not.

4.3 Ablation Study

We show the results of training and testing according to the λ value when applying the distillation method during training. We analyzed the results when the distillation method was not applied and the f1 score experiment result according to the threshold value.

Figure 10 shows that the deviation increases as the weight for fake loss increases. The validation loss was mostly similar, but we can see that the loss was relatively higher than when the distillation technique was not applied.

Fig. 10
figure 10

Validation loss result according to the weight of fake loss: (a) weight 0.5, (b) weight 1.0, and (c) weight 1.5 on the DFDC dataset

Table 1 shows the AUC result according to the λ value when training with the distillation method. The highest AUC value was obtained at 0.5 (as shown in boldface), and the lowest AUC value was obtained when no distillation was applied. In order to consider other intervals of values as interpolation, we experimented with three λ values.

Table 1 AUC results according to λ values in the test dataset

Table 2 is the result of the f1 score according to the threshold value (β), which is the probability of determining an image as fake. The boldface denotes the best performance of each algorithm. The proposed achieved the highest 0.919 f1 score at 0.55 and the highest 0.911 f1 score at 0.4 in the SOTA model [11]. The best threshold value of β has been selected by analyzing the performance (f1 score) as the variation of β. Since the different network structure is used, the best threshold value may be changed due to different training characteristics. When comparing the best performance, the proposed method outperformed the SOTA model [11] by a factor of 0.8% (0.008 of AUC).

Table 2 The result of f1 score according to the threshold value in the test dataset

Also, we concatenated CNN features and patch embedding to consider local and global information. Without using CNN features, we achieved 0.959 AUC and 0.891 f1 score, respectively. This result describes that the combination of CNN feature with patch embedding is very effective.

4.4 Performance analysis

We compared our model to the SOTA model [11]. We trained the proposed model using the training dataset and chose the model weights with the lowest loss in the validation set. In Fig. 11, we compare the validation loss to the SOTA model [11] for real and fake videos. The green plot indicates our model’s loss, and the purple plot indicates the SOTA model’s [11] loss. There was a slight difference in the loss of the real video, but there was a significant difference in that of the fake video. This graph shows that our model is a more robust classifier for fake videos. Although the real loss was similar, the overall average loss was lower. The validation loss is defined as

$$ \begin{array}{@{}rcl@{}} \boldsymbol{LogLoss} = -\frac{1}{n} \sum\limits^{n}_{i=1}[y_{i}log(\hat{y}_{i}) + (1-y_{i})log(1-\hat{y}_{i})], \end{array} $$
(4)

where n is the number of videos being predicted, \(\hat {y}_{i}\) is the predicted probability of the video being fake, yi is one if the video is fake and zero if real. We obtained \(\hat {y}_{i}\) using distillation tokens.

Fig. 11
figure 11

Results of loss between SOTA model and our model in validation DFDC dataset: (a) the loss for fake video, (b) the loss for real video, and (c) the average loss

In addition, the ROC–AUC curve of the proposed model has a larger area (0.978) than that of the SOTA model [11] (0.972) in Fig. 12. This indicates that the proposed classifier is more robust on fake videos because the precision was higher than that of the SOTA model [11], and the recall was close to one.

Fig. 12
figure 12

Receiver operating characteristic (ROC) and AUC curves. Orange is the curve of our model, and green is the curve of the SOTA model [11]. The proposed model has a larger area than the compared SOTA model [11] on the DFDC dataset

To verify robustness, a confusion matrix was obtained by setting a threshold of 0.55, which represents the probability of a fake video. It represents the predicted number of videos according to each label in Fig. 13. Top-right, top-left, bottom-right, and bottom-left represent false positive, true negative, true positive, and false negative, respectively. The confusion matrix of the left side is that of the previous SOTA model [11] prediction, and the right side shows our results. We can see that our model clearly predicts fake videos. The false-negative results for each model were 335 and 187, respectively. Thus, the proposed model is robust with fake video detection, and the f1 score of all cases was 0.919 as shown in boldface, which is higher than the 0.906 of the SOTA model [11] shown in Table 3.

Fig. 13
figure 13

Confusion matrix from the previous SOTA model which is EfficientNet-B7 [11] (left) and the proposed algorithm (right) on the DFDC dataset when threshold β= 0.55

Table 3 AUCs, f1 scores, and the complexity of the recent methods, SOTA, and the proposed model on the DFDC dataset

To compare the performance in the best threshold condition, we displayed another confusion matrix in Fig. 14. The best condition of the SOTA model [11] was at β= 0.40 and β= 0.55 in the proposed model. For 2,500 fake videos in these conditions, the proposed model correctly predicted 2,313 as fake, whereas the SOTA model predicted 2,276 in number. Also, the false-negative results for each model were 224 and 187, respectively. From this result, we can see that the proposed model is robust with fake video detection.

Fig. 14
figure 14

Confusion matrix from the previous SOTA model which is EfficientNet-B7 [11] (best threshold β= 0.40) (left) and the proposed algorithm (best threshold β= 0.55) (right) on the DFDC dataset

We also compared AUC values to other recent methodologies, such as a scheme based on the face X-ray network [19], a DeepFake detection method using emotion audio-visual affective cues [20], and the SOTA model [11], as shown in Table 3. Generally, DeepFake detection methods focus on manipulation artifacts. However, Li et al. [19] proposed a novel face X-ray image representation, which focuses on blending artifacts. They predicted the boundary of the manipulated face and obtained an AUC score of 0.809 in the DFDC test dataset.

Mittal et al. used audio and visual modalities from within the same video to determine similarity [20]. Their training method was like a Siamese network with facial and speech features. This scheme achieved a 0.844 AUC score for the DFDC test dataset. When compared with these methods, the proposed algorithm yielded a significantly improved AUC score (0.978 AUC score), as shown in Table 3. When comparing to Mittal [20], the proposed method was superior by 0.13 AUC even though the proposed model used only image (facial) feature.

From Table 3, the proposed method is only 0.006 higher in AUC than Selim method (SOTA) [11]. But we observed that the proposed scheme outperformed Selim method (SOTA) [11] by 0.013 f1 score. Also, the proposed method showed much better performance in finding fake video (True Positive). In DeepFake detection task, we think that the detection of fake video is more important than that of real video. In this viewpoint, the proposed scheme is useful enough in DeepFake detection task.

We also compared the ensemble results of the SOTA [11] and the proposed model. We trained five times and tested it by averaging the probability values. As a result, the AUC result of the SOTA model [11] was 0.981, and the proposed model achieved an AUC of 0.982. Thus, the proposed scheme can detect fake videos more robustly than the SOTA model [11].

Additionally, we trained and tested the proposed scheme by same process with single model on the Celeb-DF (v2) dataset. The Celeb-DF (v2) dataset [41] has 590 real videos and 5639 fake videos that is synthesised with high quality. Table 4 shows the AUCs, F1 scores, and the complexity of the proposed and recent existing models on the Celeb-DF (v2) dataset [41]. The boldface denotes the results of the proposed algorithm. As in [23], they used 3D CNNs to consider both spatial and temporal information [24, 26], [25]. Except R3D scheme [25], the proposed model gave better AUC. This means that the improved ViT model is able to give good performance for other DeepFake dataset.

Table 4 AUCs, f1 scores, and the complexity of the recent methods, SOTA, and the proposed model on the Celeb-DF (v2) dataset [41]

Since the proposed scheme uses an improved ViT structure shown in Fig. 6, the computational complexity is inevitably higher than that of ViT. In addition, the proposed scheme uses the feature from EfficientNet [11] together. The complexity of the proposed model is almost 8 \(\sim \) 10 times because of modification of ViT as shown in Tables 3 and 4. The transformer utilizes the attention mechanism to compute the correlation crossing on all tokens. This attention module has heavy parameters than CNN structures without attention module. Also, the number of convolution layers should be increased to make higher accuracy in classification task. But a problem is that we can observe the performance saturation (not improved more) although the layer number is increased continuously. To solve this problem, transformer is being widely utilised. With this, we deigned the proposed model to improve the detection performance of DeepFake image based on the ViT model.

Despite its high complexity, the proposed algorithm has shown better performance for Deepfake detection by designing distillation techniques and combining CNN features with patch embedding. From Tables 3 and 4, we can see that the proposed scheme achieves better AUC and f1 score on the DFDC and Seleb-DF (v2) datasets. Especially, we observed very high f1 score on the Celeb-DF (v2) dataset [41].

5 Conclusion

In this paper, we proposed an improved vision transformer model for DeepFake detection. The proposed scheme is a combination of patch embedding and CNN features utilizing a distillation token based on DeiT. By considering the characteristics of the CNN and the transformer, we verified superior performance over previous results.

We demonstrated the efficiency of the robust vision transformer model compared with EfficientNet as the SOTA model, which consists of a 2D CNN network. The SOTA obtained an AUC of 0.972, whereas ours obtained 0.978 under the same conditions without an ensemble approach. The proposed scheme produced an f1 score of 0.919, whereas the SOTA model achieved 0.906 under the same threshold condition of 0.55. Furthermore, we observed an AUC improvement of up to 0.17 compared with a recent scheme [19, 20]. With the ensemble method, the proposed model achieved an AUC of 0.982, whereas the SOTA model achieved 0.981 [11]. In addition, we verified 0.993 AUC and 0.978 of f1 score for the Celeb-DF (v2) dataset.

In future work, we will investigate more detailed unnaturalness between frames for DeepFake detection. If the spatial feature is only considered, motion information between adjacent frames of the DeepFake or the synthesized pixel portion within one frame may be missed. Therefore, we will study further a hybrid ViT model which can combine spatial feature with temporal feature, efficiently.